Solved! ARM Apple High-End CPU - Intel replacement

Page 31 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Richie Rich

Senior member
Jul 28, 2019
470
229
76
There is a first rumor about Intel replacement in Apple products:
  • ARM based high-end CPU
  • 8 cores, no SMT
  • IPC +30% over Cortex A77
  • desktop performance (Core i7/Ryzen R7) with much lower power consumption
  • introduction with new gen MacBook Air in mid 2020 (considering also MacBook PRO and iMac)
  • massive AI accelerator

Source Coreteks:
 
  • Like
Reactions: vspalanki
Solution
What an understatement :D And it looks like it doesn't want to die. Yet.


Yes, A13 is competitive against Intel chips but the emulation tax is about 2x. So given that A13 ~= Intel, for emulated x86 programs you'd get half the speed of an equivalent x86 machine. This is one of the reasons they haven't yet switched.

Another reason is that it would prevent the use of Windows on their machines, something some say is very important.

The level of ignorance in this thread would be shocking if it weren't depressing.
Let's state some basics:

(a) History. Apple has never let backward compatibility limit what they do. They are not Intel, they are not Windows. They don't sell perpetual compatibility as a feature. Christ, the big...

Doug S

Platinum Member
Feb 8, 2020
2,269
3,521
136
That's my whole point. The Graviton 2 is substantially different from the A13 because its not narrowly optimized for single threaded burst performance like the A13 is. My whole problem with this thread is how certain people have been implying that the A13 core could be akin to a drop in solution that is successful across a wide variety of workloads just as it is. To scale the A13 up to be successful in more diverse and multithreaded workloads would probably require some serious architectural changes, which would from what I've read in this thread, lower the single thread/IPC performance substantially.


The A13 ITSELF would not work well for heavily multithreaded loads, because it has only two big cores. But the CPU cores in the A13 would work perfectly for multithreaded loads, in a larger design that had more cores. There's no such thing as optimizing a CPU core for ONLY single threaded performance.

There are no changes to the core needed to handle large multithreaded workloads, but it would need changes to the uncore - more/wider memory controllers, a bigger switch/ring fabric to connect the cores.

You claim the A13 is somehow optimized not only for single threaded performance but single threaded "burst" performance. That's simply not possible, the only thing that keeps the A13 from running flat out 24x7 is that it has no cooling in the iPhone, and is powered by a battery. Give it even minimally better cooling (as it would get in even a Macbook Air form factor without a fan, simply having a much larger surface across which to dissipate the heat) and plug it into an AC outlet and it will run forever at maximum performance.

Getting the exact same maximum performance at the exact same clock rate out of 32 instead of 2 big cores would have nothing to do with the design of the core, and everything to do with the design of the uncore. Obviously no one scales perfectly (on real loads) in a 32 core design, but how close you come to the ideal is determined by how good your uncore is (and number of memory channels it can access) But such a 32 way design would be able to reach the same single threaded performance. There might be a few more wait states to higher levels of cache due to the larger fabric, but that (L3/L4) cache would be a lot bigger so the two factors would pretty much cancel out.
 

name99

Senior member
Sep 11, 2010
404
303
136
First ARM Mac will be almost certainly MBA (Or a new consumer focused product, and to put it bluntly, a Facebook machine, which is why the new iPad Pro makes me think they might not even do that).

If there was a MBP coming this fall you'd have heard the software developers working on software for it already.

First ARM mac will be a developer machine. Probably something like a mac mini though maybe not as elegant. (Cheap, simple, doesn't provide implications about the look of future consumer machines.)

Compare the Intel transition:
That used something like a Mac Pro case. Conceivably Apple could use today's mac pro case, but I'm guessing that's too expensive, and nicer than most developers need?

The PPC mac mini already existed at the time of WWDC 2005. So why wasn't it used?
It had only been around for a few months by the time of WWDC. Likely it was happening in a different group that maybe didn't even talk to the Intel group? And things were too locked into the Mac Pro style case by that point?

Also today much more so than back then, most people just don't care about cards and internal expansion any more; unless you have you have REALLY specialized needs, you can do pretty much everything (including serious code development) with a mac mini style machine and maybe a few external SSDs, even maybe an external GPU.

Of course the first CONSUMER machine...
Yeah, I'd agree MBA makes sense. For Intel Apple went with MBA and MBP, and a month later mac mini. I could see them doing the same here, only all three on the first day.
MBA for volume, MBP to show "we can make a really high end laptop". And mac mini because it's easy-ish and shows "AND we can make a really fast desktop".
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
The A13 ITSELF would not work well for heavily multithreaded loads, because it has only two big cores. But the CPU cores in the A13 would work perfectly for multithreaded loads, in a larger design that had more cores. There's no such thing as optimizing a CPU core for ONLY single threaded performance.

There are no changes to the core needed to handle large multithreaded workloads, but it would need changes to the uncore - more/wider memory controllers, a bigger switch/ring fabric to connect the cores.

My point in bringing up the single thread performance is the A13's cache structure. It has a huge L1 cache and a huge L2 cache, that doubtless helps tremendously in achieving high IPC for single threaded workloads. Obviously the cache structure would have to be redesigned to make it more scalable and performant in multithreaded workloads, which would probably result in lower single threaded performance.

You claim the A13 is somehow optimized not only for single threaded performance but single threaded "burst" performance. That's simply not possible, the only thing that keeps the A13 from running flat out 24x7 is that it has no cooling in the iPhone, and is powered by a battery. Give it even minimally better cooling (as it would get in even a Macbook Air form factor without a fan, simply having a much larger surface across which to dissipate the heat) and plug it into an AC outlet and it will run forever at maximum performance.

Someone posted a graph of the voltage/frequency curve for the A12 a few pages back, and it didn't look too convincing. The voltage spikes big time at around the 2.6ghz mark, so can you imagine the power draw if you put even more of these cores on a single die with a more robust uncore . I just don't see the A13 or a CPU like it being able to be competitive with Intel and AMD in heavier multithreaded workloads without a substantial redesign of the entire CPU, and not just the uncore. The entire microarchitecture would need to be redesigned I believe.

Also, the A13 only has NEON.

1585339204505.png
 

name99

Senior member
Sep 11, 2010
404
303
136
So we can't compare a collection of individual benchmarks to form a custom suite, we have to stick to Spec's collection of tests?

The reality is these CPUs will be tested by customers on their own optimized setups with their own actual flow being tested. Everything else is just talking points but as just consumers on a consumer forum, that's all we really have, right? The only way we'd have any info to go off of would be someone to release their internal testing (which almost no one will do).

Look, I'm not trying to downplay what ARM is doing in the server space, they've made a ton of progress and are starting to become a real threat to x86, just trying to bring some perspective compared to the marketing from ARM partners which is all we have to go off of because they haven't (maybe won't) release test systems for independent reviewers to publish their results. My personal opinion is that ARM isn't quite there yet with this generation, but the next generation could be a whole different story, especially with Intel continuing to struggle to put anything really competitive out in this space. The next generation or two may come down to how valuable system admins see sticking with x86 would be and using AMD versus switching to an ARM ecosystem.

Uhh, you apparently ARE trying to downplay what ARM is doing in the server space.
Graviton2 has been tested by "many customers on their own optimized setups with their own actual flow being tested". Here's today's version of the story:

You're just unwilling to accept that there's been a steady stream of these sorts of results for graviton2 in the data warehouse space.
The complaints now are basically (let's be honest) "I can't buy a PC with a kickass ARM SoC in it and run my own tests".
Which is true. But is not the same thing as "there are no tests out there showing how well ARM works for customer workloads".
 

Hitman928

Diamond Member
Apr 15, 2012
5,321
8,005
136
Uhh, you apparently ARE trying to downplay what ARM is doing in the server space.
Graviton2 has been tested by "many customers on their own optimized setups with their own actual flow being tested". Here's today's version of the story:

I'm sorry, but a single plot on a blog post is not exactly what I was alluding to. Also, I can't actually see the plot in your blog link, not sure why, but I'll just take the performance increase they give in the text at face value.

As we have shown, m6g.4xlarge is up to 30 percent faster than the current generation m5.4xlarge instance type.

So less than the 40% being advertised by Amazon. How much faster is Rome over Skylake-SP and at what power? Maybe even Icelake-SP coming soon? Those are Graviton's real competitors, not Skylake.

You're just unwilling to accept that there's been a steady stream of these sorts of results for graviton2 in the data warehouse space.
The complaints now are basically (let's be honest) "I can't buy a PC with a kickass ARM SoC in it and run my own tests".
Which is true. But is not the same thing as "there are no tests out there showing how well ARM works for customer workloads".

If you have further tests to post, feel free to post here, I'm always open to more data. I for one never said I wanted a PC with ARM inside or to see tests with one, but there are sites and groups out there who receive server test systems (or sometimes remote access to one) all the time for 3rd party testing. Servethehome was able to get a hold of a ThunderX2 test system for review and has tested many x86 systems over the years. It's not like it's an unusual thing to see server processors / systems being tested by an independent 3rd party.

And if you want some good indepth blog posts, remember when Cloudflare announced that their next gen servers would be powered by ARM and that they would be "Intel free" by the end of the year?


What happened there?


We looked very seriously at ARM-based CPUs and continue to keep our software up to date for the ARM architecture so that we can use ARM-based CPUs when the requests per watt is interesting to us.
In the meantime, we've deployed AMD's EPYC processors as part of Gen X server platform and for the first time are not using any Intel components at all.

Well, I guess the Intel free part came true.

Again, I'm not saying that ARM isn't becoming a threat and making inroads, but just trying to pierce all the marketing to see where they actually are at today both from a performance perspective as well as a platform perspective and, IMO, they're not quite there yet this gen but might really start to make some waves next gen. Being consistent with performance and platform improvements while building the needed relationships will be key for ARM vendors going forward, just like it has been for AMD.
 
Last edited:

name99

Senior member
Sep 11, 2010
404
303
136
My point in bringing up the single thread performance is the A13's cache structure. It has a huge L1 cache and a huge L2 cache, that doubtless helps tremendously in achieving high IPC for single threaded workloads. Obviously the cache structure would have to be redesigned to make it more scalable and performant in multithreaded workloads, which would probably result in lower single threaded performance.



Someone posted a graph of the voltage/frequency curve for the A12 a few pages back, and it didn't look too convincing. The voltage spikes big time at around the 2.6ghz mark, so can you imagine the power draw if you put even more of these cores on a single die with a more robust uncore . I just don't see the A13 or a CPU like it being able to be competitive with Intel and AMD in heavier multithreaded workloads without a substantial redesign of the entire CPU, and not just the uncore. The entire microarchitecture would need to be redesigned I believe.

Also, the A13 only has NEON.

"Obviously the cache structure would have to be redesigned to make it more scalable and performant in multithreaded workloads, which would probably result in lower single threaded performance."

Why? Like so many x86 people, you simply cannot understand that there are DIFFERENT WAYS to solve the same problem.
Look at something like the Graviton 2 topology
Look at something like AFX64

There are MANY ways to solve these issues.
For example Apple could create a hierarchical system. Fundament units are 4 CPUs+large L2, and multiple of these "tiles" share a distributed LLC. Look at what Graviton did.
With a performant enough L2, you can even scale this up to 8 cores sharing an L2 without difficulty (look at AFX64).
With few enough of these larger tiles (4 or 8 say) a ring is fine, and the NoC consists of two level addressing, first to station then to within a station (which is probably what Apple is already doing).
It's simply nonsense to claim that "scaling up number of cores" is some crazy hard problem that only x86 knows how to do properly.

A more serious critique would worry about things that impact cross-core operations. Things like locks. But ARM has a better ISA here (better scope for clarifying just how much coherency you require and no more) and similarly scoped atomics.

And A13 doesn't just have NEON. Even apart from A14 probably going to have SVE, A13 also has AMX. We don't know anything about it yet, but it's there on the core, and it's claimed to give 6x the throughput of the 3 NEON pipes together. We'll probably learn a whole more about it (and get compilers that target it) with WWDC and the XCode released at that point.
 
  • Like
Reactions: Etain05

yeshua

Member
Aug 7, 2019
166
134
86
Well, it's now semi-official:


Intel has very little time to dig itself out of the grave of their never ending 10nm/yet to be released 7nm nodes troubles.
 
  • Like
Reactions: Nothingness

OriAr

Member
Feb 1, 2019
63
35
91
Well, it's now semi-official:


Intel has very little time to dig itself out of the grave of their never ending 10nm/yet to be released 7nm nodes troubles.
1. ARM Macs have been coming since 2012, I will believe it when I see it launched.
2. If it's true, it more has to do with the fact that Apple always wanted maximum control over its own hardware rather than any issues Intel had with 10nm.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
2. If it's true, it more has to do with the fact that Apple always wanted maximum control over its own hardware rather than any issues Intel had with 10nm.
IMHO it's not about HW control this time. For 2021 there will be plenty Cortex A78 based chips for ultra thin laptops which will clearly outperform Intel and AMD (at least new 8cx from QComm). Apple is pushed to migrate on ARM otherwise they would be outperformed by Cortex cores (similar situation when PowerPC was outperformed by Intel). It's pure defensive move which is minimizing damage, typical for Tim Cook. Steve Jobs would move on ARM much earlier, he was more offensive type (but you need a vision for that).
 

Nothingness

Platinum Member
Jul 3, 2013
2,422
754
136
So we can't compare a collection of individual benchmarks to form a custom suite, we have to stick to Spec's collection of tests?
That's not what I mean. You were the one asking if we should throw SPEC because microbenchmarks give a different picture ;) The answer is no, especially when one of the benchmarks is Dhrystone...

The reality is these CPUs will be tested by customers on their own optimized setups with their own actual flow being tested. Everything else is just talking points but as just consumers on a consumer forum, that's all we really have, right? The only way we'd have any info to go off of would be someone to release their internal testing (which almost no one will do).
I fully agree with that! But that doesn't mean we should make any apple to orange comparison and be happy with that.

Look, I'm not trying to downplay what ARM is doing in the server space, they've made a ton of progress and are starting to become a real threat to x86, just trying to bring some perspective compared to the marketing from ARM partners which is all we have to go off of because they haven't (maybe won't) release test systems for independent reviewers to publish their results. My personal opinion is that ARM isn't quite there yet with this generation, but the next generation could be a whole different story, especially with Intel continuing to struggle to put anything really competitive out in this space. The next generation or two may come down to how valuable system admins see sticking with x86 would be and using AMD versus switching to an ARM ecosystem.
I think ARM already is a contender at this generation for some workloads. All workloads don't need the highest possible performance (which undoubtedly is Rome, except perhaps for some HPC workloads). You need good enough performance for most workloads and the right price. And Graviton2 is there.

The problem is rather that IT prefer Intel no matter what. When you see AMD perf results you wonder why, but that's how things are... at the moment.
 

Nothingness

Platinum Member
Jul 3, 2013
2,422
754
136
My point in bringing up the single thread performance is the A13's cache structure. It has a huge L1 cache and a huge L2 cache, that doubtless helps tremendously in achieving high IPC for single threaded workloads. Obviously the cache structure would have to be redesigned to make it more scalable and performant in multithreaded workloads, which would probably result in lower single threaded performance.
The L2 cache is large because it's shared. Think L3 cache on Intel/AMD but with lower latency. And if you think that's a big deal ARM went from shared L2 on Cortex-A73 to private L2 on Cortex-A75, which is a derivative of A73 which was released a year after. Do you really think Apple can't do such a change?

Also, the A13 only has NEON.
And previous generation of AMD only only had 128-bit FPU. Did that make them stink? Or useless?
 

Nothingness

Platinum Member
Jul 3, 2013
2,422
754
136
1. ARM Macs have been coming since 2012, I will believe it when I see it launched.
So do I.

2. If it's true, it more has to do with the fact that Apple always wanted maximum control over its own hardware rather than any issues Intel had with 10nm.
Do you really think that they would release something that's worse than what they already have just to have control? No, it has to be competitive against what they currently offer, and no matter what you think their CPU are competitive for some segments.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Will you be revisiting this prediction anytime soon? In case you are wrong? It's hard enough to get anything A76-based today. What makes you think there will be widespread A78 next year?
Because:
- A76 was wide spreaded in 2019 (phones and MS Surface X)
- A77 spreads this year 2020
- A78 will spread in 2021

Cortex A78 core with higher IPC than Zen3? That could be strong accelerator for ARM in laptops. Also first useful substitution for high performance Apple cores. I admit I'm maybe too optimistic in this.
 

soresu

Platinum Member
Dec 19, 2014
2,665
1,865
136
Because:
- A76 was wide spreaded in 2019 (phones and MS Surface X)
- A77 spreads this year 2020
- A78 will spread in 2021
All bets are off for next year until we know how bad the impact of the current problem is - for sure it will take a huge chunk out of the potential marketshare of A77 this year too.

As for laptops with ARM, still too little on the software front - MS is making waves with Collabora to translate OGL 3.3 and OCL 1.2 to DX12, which should shore up feature compatibility for WARM, but there's still a long way to go before it matches x86 Windows for compat sakes.
 

Doug S

Platinum Member
Feb 8, 2020
2,269
3,521
136
Well, it's now semi-official:


Intel has very little time to dig itself out of the grave of their never ending 10nm/yet to be released 7nm nodes troubles.



In what world does a story where the byline starts "Apple is predicted" mean anything has changed from the hundreds of similar articles that have been written over the years?

If Apple is going to introduce ARM Macs, Intel has no time to "dig itself out of the grave". Once Apple has made the decision, Intel having something on the drawing board they will try to claim will be faster than whatever Apple will have at the same time won't make any difference.

Besides, why should ANYONE, least of all Apple, believe Intel will be able to out-execute TSMC in the next five years when they've so utterly failed to do so in the previous five? If they show Apple a test system containing an early 7nm CPU this fall and say "look how great this is" it means nothing. They were showing off 10nm CPUs to partners in test systems about five years ago and only in the past six months or so have begun selling them in any real quantity.
 

name99

Senior member
Sep 11, 2010
404
303
136
Only in the big cores I think?

Well duh! Why is that relevant?
The L2 cache is large because it's shared. Think L3 cache on Intel/AMD but with lower latency. And if you think that's a big deal ARM went from shared L2 on Cortex-A73 to private L2 on Cortex-A75, which is a derivative of A73 which was released a year after. Do you really think Apple can't do such a change?


And previous generation of AMD only only had 128-bit FPU. Did that make them stink? Or useless?

As did Intel. Penryn class CPUs had large L2 shared between two cores, then Intel moved to small private L2 and large shared L3 with Nehalem.

This toggles between different cache structures are trivial in the grand scheme of things; any of these companies can (and do) flip between them based on many other things.
There is no "best" insofar as best depends on a dozen variables. Not just your process and your GHz target, but also things like the nature of your prefetchers, or how much intelligence is in your cache (placement and replacement algorithms, compression?, even cutting edge things like using the TLB to store a way directory). If you design with chiplets, that's yet another important variable.

Just bcs Intel felt it made sense to pivot to small L2 and large L3 doesn't mean that this will be the right choice for Apple's future high-core-count SoCs. (Or that it makes sense for Intel to stay there, except insofar as they now appear to be utterly terrified to change ANYTHING about their CPUs.)
 

Hitman928

Diamond Member
Apr 15, 2012
5,321
8,005
136
That's not what I mean. You were the one asking if we should throw SPEC because microbenchmarks give a different picture ;) The answer is no, especially when one of the benchmarks is Dhrystone...

Perhaps I wasn't clear enough that I was being facetious, but I was doing so to make a point. I also don't know why you keep calling other test collections outside of Spec "microbenchmarks", seemingly trying to discredit them, as if Spec isn't made up of a collection of "microbenchmarks".

Additionally, I'm not aware of any evidence of AOCC using cheats like ICC has been accused of in the past. On the contrary, AOCC is built upon current versions of LLVM and then given additional Zen optimizations (and increased compile time to try and produce the fastest code). That gives AMD data on which optimizations produce the best real world effects so that those optimizations can then be incorporated into future industry standard compilers. From my understanding ARM does the same but upstreams their optimizations into GCC so you get the optimizations earlier with that compiler compared to Zen (if product timelines were equivalent).
 

DrMrLordX

Lifer
Apr 27, 2000
21,640
10,858
136
Because:
- A76 was wide spreaded in 2019 (phones and MS Surface X)
- A77 spreads this year 2020
- A78 will spread in 2021

In phones. You still can't get a reasonably-priced A76 (or A77) sbc in the united states. Want something a76-like in a laptop? 8cx or bust. That's it! There's just no way to buy these cores in any kind of notebook or desktop form factor. Except 8cx. Where are the "serious" A77 machines this year? I don't see them.

A76 was widespread in 2018, furthermore. It took over a year to get it anywhere that wasn't a phone. A77 is new in phones this year. I would not expect much/any access to A77 outside of mobile until NEXT year.

As for Ampere and ThunderX3 . . . still waiting!

Well duh! Why is that relevant?

The small cores only have one NEON unit, I think, but Graviton2 has them on every core. It's notable that Amazon (and really it was ARM with the Neoverse reference design) committed all that silicon to SIMD while Apple mostly didn't. One's a phone SoC and the other isn't. Consider the previous context of this thread as to why that's relevant.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
@ name99 and Nothingness, my argument doesn't preclude Apple from adjusting or tweaking the cache hierarchy or even the microarchitecture of the A series to make it more competitive against Intel and AMD's best in multithreaded workloads. I'm well aware Apple has the engineering talent and resources to pull it off.

I think you guys need to read this thread again from front to back. Certain people in this thread have been going on and on about the single threaded IPC performance of Apple's A series CPUs and how superior they are to x86-64 CPUs, and that all Apple would need to do is to scale the chip up to 8 core (without messing with the cache hierarchy or the microarchitecture) and it would destroy any Intel or AMD CPU in overall performance. My point is, that many of those people have not taken into consideration the changes that would need to be made to the overall design of the A series CPUs to make it more scalable and performant in multithreaded workloads, and that these necessary changes would almost certainly definitely result in a significant reduction of single threaded performance.

Essentially, they are acting as though single threaded performance is the be all end all of CPU performance, when it's not. If the A series was such a winning design for other workloads, we would have seen it in x86 land already. The fact that AMD and Intel are both pursuing similar designs tells me all I need to know about the most ideal designs for these types of workloads.