Solved! ARM Apple High-End CPU - Intel replacement

Richie Rich · Oct 14, 2019

There is a first rumor about Intel replacement in Apple products:

ARM based high-end CPU
8 cores, no SMT
IPC +30% over Cortex A77
desktop performance (Core i7/Ryzen R7) with much lower power consumption
introduction with new gen MacBook Air in mid 2020 (considering also MacBook PRO and iMac)
massive AI accelerator

Source Coreteks:

name99 · Mar 4, 2020

soresu said:
Doesn't that depend on your interconnect and uncore efficiency?

Yes, but these are not as hard to scale up as the Intel zealots claim. Every one of the (thinly financed) ARM server companies has managed to produce a creditable interconnect.
It's kinda amazing how much more efficient your designers can be when they aren't weighed down by 40 years of bad decisions legacy...

name99 · Mar 4, 2020

soresu said:
As much as I love to dump on Intel and their overpriced laxity - you really have the wrong end of the stick there.

A77/Deimos is an iteration on the "ground up" work of A76, which was a completely new core that took years to develop (first mentioned the year A72 was announced I believe) - and like Zen2 alongside Zen1, it was likely in development alongside A76 for years, as A78 will have been.

RR's numbers are wrong, but in the right direction.
Apple seems to have a turnaround time for their cores that is ASTONISHINGLY fast, between 3 and 4 years. (One way to see this is to look at the gap between when Arm releases a new version of the ARMv8 ISA and when Apple releases a conforming chip, though of course we don't know how long a period there is between the details being finalized, presumably with Apple at the table, and being released.)

ARM seem to be lagging a solid 2.5 to 3 years or so behind Apple, never catching up, but also never dropping back. Presumably they have much the same turnaround time.
(Though it's weird. They've been very slow in adopting new 8.x ISA additions, right up till the N1 cores. And they seem to have totally -Redacted- up the timing of their small relative to their large cores, which constantly holds them back. Compare to Apple shipping three new cores a year, all three in lock-step and on time. WHAT IS GOING ON? Massive incompetence at the ARM management level?)

On the other hand Intel seems to be at anywhere for 6 to 8 years with their turnaround time. There have been public lectures where they give this time (but some years ago, around Nehalem) as seven years. Certainly if you look at how slow they are able to change THEIR ISA features (think of the hash that is Lakefield because of the mis-match between the versions of AVX on the two different chips) you see a very different story.

Profanity (even abbreviated) is not allowed in the tech forums.

Daveybrat
AT Moderator

name99 · Mar 4, 2020

senttoschool said:
Did Apple release Intel Macs for the low end first or did they just do a grand replacement?

There is this thing called Google...

Mac transition to Intel processors - Wikipedia

en.wikipedia.org

- WWDC announcement (June 2006)
- MacBook Pro and iMac (January2007)
- Mac mini
- Macbook
- Mac Pro (August 2007, HW transition complete)

name99 · Mar 4, 2020

eek2121 said:
YOU should read the article again.

Specifically this page for memory latency and bandwidth: https://www.anandtech.com/show/14892/the-apple-iphone-11-pro-and-max-review/3

After you are done showing yourself out, then I'll leave it up to you to find the floating point results.

I'm objective, but calling Apple's ARM CPUs "competitive" with high end is a stretch.

You're making the classic mistake of assuming that the ONLY way to solve a problem is the wya you know.
No-one buys a CPU for the latency and bandwidth, they buy it for the performance on their workload. Apple achieves their uncore performance the same way they achieve their core performance --- by throwing massive amounts of (low power, but high area) logic at the problem rather than (high power) speed. Thus, as I keep telling you, their caches and prefetchers are ASTONISHINGLY good, so that you get vastly more hits in cache and just don't need to go to DRAM that often.

Look at their caches.
Compare

https://images.anandtech.com/doci/14892/A13.jpg

to

https://imgur.com/r/intel/4SG36

In particular look at the L2 and L3 and compare the amount of storage (the regular grid pattern) to the amount of logic (the irregular stuff in the same block). Note how Apple has VASTLY more logic in its caches. That logic is being used to ensure that its caches hold much more data that will be useful in future, compared to Intel's caches.

How is this done? Well, go read many many academic papers. But for example when you have a cache and a read misses you get that line from RAM and then you have various choices like
- do you put that line in L3, or L2, or only L1?
- which line in each of these caches do you remove to make space for the new line?

There are easy answers to these questions, answers known in the 1980s (like make all the caches inclusive, and use random replacement, or pseudo-LRU). But those answers aren't very good! For example on Intel L3, under most circumstances, over half the lines are dead (ie sitting there, but will never be touched again). Or how does your L2 treat D vs I lines? I lines are more critical because the I fetch subsystem can only tolerate a few cycles before grinding to a halt, while D can tolerate 50 cycles or more. So you should prioritize I lines over D in L2 and L3. But that requires a smarter cache. (I don't think Intel or AMD do such prioritization, I've never heard of it.)

Working harder you can come up with much better answers, so that you caches hold vastly more USEFUL data. Connect that up to smarter prefetchers and, woo hoo, you no longer need to burn so much power connecting to DRAM as fast as possible.

And this is everywhere. TLBs are also a caching system. And once again you can run your TLBs (and the caches that feed them) the smart way or the easy way.

soresu · Mar 4, 2020

name99 said:
And they seem to have totally fscked up the timing of their small relative to their large cores, which constantly holds them back.

The little cores have to remain ISA exact to the big cores for big little/DynamiQ to work - the last change was with A75 and A55.

A65 is a new 'little' core with it's ST IPC being pretty much the next step up from A55 - probably SMT is the only reason it hasn't made it into mobile configurations, because otherwise it seems a big step up in efficiency.

I would expect either A78/Hercules or Matterhorn to have a matching little core, and perhaps ARM's cash infusion from SoftBank may lead to a release cadence parity between big and little cores going from there.

name99 · Mar 4, 2020

soresu said:
The little cores have to remain ISA exact to the big cores for big little/DynamiQ to work - the last change was with A75 and A55.

A65 is a new 'little' core with it's ST IPC being pretty much the next step up from A55 - probably SMT is the only reason it hasn't made it into mobile configurations, because otherwise it seems a big step up in efficiency.

I would expect either A78/Hercules or Matterhorn to have a matching little core, and perhaps ARM's cash infusion from SoftBank may lead to a release cadence parity between big and little cores going from there.

Well duh! You’re repeating what I said!

They NEED lockstep in their large and small cores if they want to move the ISA forward.
They don’t HAVE that lockstep — so their implementation of the ISA lags years behind Apple. Where is their mobile support for PAC, for example?
This is DUMB.

Which part of these three sentence above do you disagree with?

Annual big core cadence requires annual little core cadence.

USER8000 · Mar 4, 2020

Richie Rich said:
Not mentioning Intel needed 4 years for this iteration and ARM only 1 year. This shows how Intel became horribly lazy during Bulldozer period.

soresu said:
No, you still had the wrong end of the stick saying A77 took 1 year to develop - nothing that complicated takes one year of development, you have confused the release/announcement cadence of ARM's big cores with the development time.

Richie Rich said:
I wrote:
- it takes 4 years to develop core
- parallel development of 4 cores needed to have every year new core release

I never wrote such a non sense as core dev takes one year.

USER8000 said:
That is from your own post - you implied that ARM took one year to develop cores. I know people who had friends who worked at ARM and I am from the UK,you really seem to be overegging things a bit mate.

Richie Rich said:
then you better stay isolated on your island, mate...

The very isolated island which has a huge amount of scientists and engineers which have changed the history of the world. ARM is a UK company(from that island I am on) - people here have a lot of nostalgia for ARM as its a homegrown company,just saying. I actually know people who had friends at ARM(from that island I am on),and worked in the UK tech industry in Cambridge,so don't know what you are attempting to do here with your reply.

ARM was started by a PC company called Acorn(from work done at Cambridge university). Many of us grew up with Acorn PCs such as the BBC Micro. ARM is Acorn Risc Machines,and the first ARM based chips were coprocessors for the 6502 based BBC Micro. I was probably using a PC with ARM tech in it before many of you were born!

ARM was spun out of work done at Cambridge university,and there are cluster of R and D labs located around Cambridge ,if you ever actually visited the place in the realworld - which I doubt you ever have,as it appears you are yet to leave your own basement,as you are raging at everyone.

Have fun.

soresu · Mar 4, 2020

USER8000 said:
ARM was spun out of Cambridge university,and there are cluster of R and D labs located around Cambridge,if you ever actually visited the place in the realworld,which I doubt you ever have,as it appears you are yet to leave your own basement,as you are raging at everyone.

There are at least 2 campuses I can remember the names of - Sophia Antipolis in France and Austin, Texas in the US.

There is another that co developed the A65/E1 core with the Cambridge team in Chandler (Arizona, US), I believe they may be responsible for the SMT features in that design.

USER8000 · Mar 4, 2020

soresu said:
There are at least 2 campuses I can remember the names of - Sophia Antipolis in France and Austin, Texas in the US.

There is another that co developed the A65/E1 core with the Cambridge team in Chandler (Arizona, US), I believe they may be responsible for the SMT features in that design.

There are other r and d locations,hence why the US restrictions on Huawei nearly lead to ARM withholding the license to them,as they were not certain if the work done at Texas(if I remember this right) would be applicable to them.

Also in general as people in the area who work in the various companies(or associated university r and d) have a lot of close contact to each other...probably due to the abundance of pubs around the area,and the drinking culture. Cambridge isn't the largest of cities either(which you probably know). So its really interesting in the sense the amount of different people from companies and the universities and the interactions you see there....its really a hot pot of ideas. Watson and Crick sat in the Eagle pub formulating their ideas as an example.

Carfax83 · Mar 4, 2020

name99 said:
You're making the classic mistake of assuming that the ONLY way to solve a problem is the wya you know.
No-one buys a CPU for the latency and bandwidth, they buy it for the performance on their workload. Apple achieves their uncore performance the same way they achieve their core performance --- by throwing massive amounts of (low power, but high area) logic at the problem rather than (high power) speed. Thus, as I keep telling you, their caches and prefetchers are ASTONISHINGLY good, so that you get vastly more hits in cache and just don't need to go to DRAM that often.

Yeah but, how can you extrapolate Apple's solutions to Intel's when it comes to workloads that aren't seen in mobile space?

Just as an example, gaming. Gaming from what I have seen is very reliant on low memory latency. Large caches help immensely no doubt, but caches can only hold so much data before the CPU has no choice but to go to system memory. Many modern games will easily approach double digit figures when it comes to RAM utilization in fact. No other consumer desktop applications are as memory intensive as gaming. Even with massive L3 caches, slightly better IPC and a much denser process node, Zen 2 is still inferior overall compared to Intel's best gaming CPUs (which are using microarchitecture technology from nearly 5 years ago) in gaming ostensibly due to much higher memory latency.

Doug S · Mar 5, 2020

Carfax83 said:
Yeah but, how can you extrapolate Apple's solutions to Intel's when it comes to workloads that aren't seen in mobile space?

Just as an example, gaming. Gaming from what I have seen is very reliant on low memory latency. Large caches help immensely no doubt, but caches can only hold so much data before the CPU has no choice but to go to system memory. Many modern games will easily approach double digit figures when it comes to RAM utilization in fact. No other consumer desktop applications are as memory intensive as gaming. Even with massive L3 caches, slightly better IPC and a much denser process node, Zen 2 is still inferior overall compared to Intel's best gaming CPUs (which are using microarchitecture technology from nearly 5 years ago) in gaming ostensibly due to much higher memory latency.

Apple's SoCs have worse memory latency than Intel's desktop CPUs, but a phone must operate within a limited power budget. One could easily envision them including a memory controller that used less power and incurred more wait states (and less prefetching and so forth) when used in more restrictive environments like a phone/tablet, and more power with fewer wait states (etc.) when operating with the more liberal power budget of a laptop or desktop.

Carfax83 · Mar 5, 2020

Doug S said:
Apple's SoCs have worse memory latency than Intel's desktop CPUs, but a phone must operate within a limited power budget. One could easily envision them including a memory controller that used less power and incurred more wait states (and less prefetching and so forth) when used in more restrictive environments like a phone/tablet, and more power with fewer wait states (etc.) when operating with the more liberal power budget of a laptop or desktop.

That's in line with the overall point I was making, that you can't extrapolate Apple's methods with the A series to what Intel has done, because the workloads and requirements for the hardware are totally different.

Doug S · Mar 5, 2020

Carfax83 said:
That's in line with the overall point I was making, that you can't extrapolate Apple's methods with the A series to what Intel has done, because the workloads and requirements for the hardware are totally different.

But the point is that Apple's performance is competitive with Intel's best even without doing obvious things like this that would further improve their performance. It isn't as though Intel's latency advantage makes Apple's SoCs unsuitable for gaming - their latency is better than most Intel and almost all AMD CPUs until recently. Were Intel CPUs unsuitable for gaming a decade ago?

soresu · Mar 6, 2020

More news on the ARM platform GPU open driver front, the Panfrost driver for Midgard and Bifrost generations of Mali GPU's has finally started landing Bifrost (G31-G76) compiler support, not too long after tentative OGL ES 3 support landed.

RK3588 could end up being quite a decent open platform at this rate, for CPU and GPU anyways (assuming it uses G76 of course).

Link here for the Bifrost compiler news.

Link here for OGL ES 3 news.

ultimatebob · Mar 9, 2020

name99 said:
There is this thing called Google...

Mac transition to Intel processors - Wikipedia

en.wikipedia.org

- WWDC announcement (June 2006)
- MacBook Pro and iMac (January2007)
- Mac mini
- Macbook
- Mac Pro (August 2007, HW transition complete)

Yeah... and by 2009, Apple completely dropped support for Power PC processors and those shiny and expensive PowerMac G5's that folks bought just a few years ago became doorstops when you couldn't get new software releases for them anymore. That's probably something that you should think about if you're thinking of buying an Intel Mac Pro right now.

Richie Rich · Mar 10, 2020

ultimatebob said:
Yeah... and by 2009, Apple completely dropped support for Power PC processors and those shiny and expensive PowerMac G5's that folks bought just a few years ago became doorstops when you couldn't get new software releases for them anymore. That's probably something that you should think about if you're thinking of buying an Intel Mac Pro right now.

Transition in 2009 was about an elimination of performance gap to WinXP+x86 as soon as possible. Today Win10+x86 has no advantage over Apple OSX+x86 so there is no reason for such a aggressive move nowadays. IMHO Apple can sell and support both x86 and ARM in parallel for as long as they need to. This will have an advantage for Apple as they can ask premium price for premium ARM performance and battery consumption. And direct comparison of state of the art Apple's ARM A14 vs. x86 will speed up ARM expansion in laptops.

Andrei. · Mar 10, 2020

name99 said:
They don’t HAVE that lockstep — so their implementation of the ISA lags years behind Apple. Where is their mobile support for PAC, for example?

Apple being ahead in ISA is actually a weird contractual anomaly, without Apple, those ISA revisions wouldn't exist. I wouldn't put much worth and weight into the whole thing.

You should expect a whole new lineup of several cores from Arm in 2021 on v9.

Andrei. · Mar 10, 2020

*delete

Andrei. · Mar 10, 2020

*delete

Gideon · Mar 10, 2020

Anandtech has the Graviton review up:

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

Really strong results. Some bandwidth figures are insane. I guess with 64MB L3 and higher TDP/Clocks it would be very close to Rome. There still is a long way to go, but comparing to Graviton 1 the improvements are huge.

AMD/Intel really have their work cut out for them to improve things until Next generation ARM Server CPUs appear.

Though it's also clear that Amazon really tries to avoid direct comparison to Rome. Both from their initial marketing slides and this:

It’s to be noted that we would have loved to be able to include AMD EPYC2 Rome based (c5a/c5ad) instances in this comparison; Amazon had announced they had been working on such deployments last November, but alas the company wasn’t willing to share with us preview access (One reason given was the Rome C-type instances weren’t a good comparison to the Graviton2’s M-type instance, although this really doesn’t make any technical sense). As these instances are getting closer to preview availability, we’ll be working on a separate article to add that important piece of the puzzle of the competitive landscape.

I really hope they haven't cancelled their Rome instances for political reasons.

Nothingness · Mar 10, 2020

Gideon said:
Anandtech has the Graviton review up:

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

Really strong results.

Single thread results are quite good as Andrei hinted

And both for integer and FP, which surprises me a bit.

MT scaling does not look good for some of the tests. I wonder if this isn't due to hardware prefetchers being too aggressive.

Anyway this looks competitive against AMD and Intel.

Andrei. · Mar 10, 2020

Nothingness said:
Single thread results are quite good as Andrei hinted And both for integer and FP, which surprises me a bit.

MT scaling does not look good for some of the tests. I wonder if this isn't due to hardware prefetchers being too aggressive.

Anyway this looks competitive against AMD and Intel.

Prefetchers are a lot more lax than mobile A76's, so it's not even being as aggressive as it can be. The thing is just cache and bandwidth starved at high core counts, it should have had at least 64MB, 128MB even better, this is one aspect where I expect Rome to beat it quite easily, and I'm quite worried about Ampere's 80-core 32MB unit.

joesiv · Mar 10, 2020

Gideon said:
Anandtech has the Graviton review up:

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

Really strong results. Some bandwidth figures are insane. I guess with 64MB L3 and higher TDP/Clocks it would be very close to Rome. There still is a long way to go, but comparing to Graviton 1 the improvements are huge.

AMD/Intel really have their work cut out for them to improve things until Next generation ARM Server CPUs appear.

Yeah, the results are better than I thought they would be. The key to me is that amazon didn't do too much custom work to these CPU's, and apple has the exact same access to this silicon IP. It was more of an eye opener for Amazon to spin off a custom CPU (to me), since they have no history in it (before gravaton 1), wheras apple has been spinning their own custom ARM APUs for a long time in the mobile space. This should be a cinch.

Imagine a 64 core gravaton like, or better? CPU in a mac pro? or scale back, 16, 32 core imacs? seems skys the limit.

Gideon said:
Though it's also clear that Amazon really tries to avoid direct comparison to Rome. Both from their initial marketing slides and this:

I really hope they haven't cancelled their Rome instances for political reasons.

I don't think it's a problem, they just need to make sure the rome instances are priced 'accordingly' if you know what I mean. More profit for amazon if people want the Rome instances, and if they're want the discounted gravaton instances, then they probably make good profit there too because of the lack of middle men.

jpiniero · Mar 15, 2020

https://pbs.twimg.com/media/ETI4ymHUYAEswK-?format=jpg&name=small

Probably fake but a GB5 score has appeared for the 2020 iPhone. 1658/4612 is really good for having 2 big cores and 4 small.

scannall · Mar 15, 2020

jpiniero said:
https://pbs.twimg.com/media/ETI4ymHUYAEswK-?format=jpg&name=small

Probably fake but a GB5 score has appeared for the 2020 iPhone. 1658/4612 is really good for having 2 big cores and 4 small.

The frequency is interesting (assuming it's not fake, big assumption) at 3.1 Ghz. Might be some insight on how TSMC's 5nm process performs.

Solved! ARM Apple High-End CPU - Intel replacement

Senior member

Senior member

Senior member

Senior member

Senior member

Diamond Member

Senior member

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Senior member

Senior member

Senior member

Senior member

Platinum Member

Diamond Member

Senior member

Member

Lifer

Golden Member