Solved! ARM Apple High-End CPU - Intel replacement

Richie Rich · Oct 14, 2019

There is a first rumor about Intel replacement in Apple products:

ARM based high-end CPU
8 cores, no SMT
IPC +30% over Cortex A77
desktop performance (Core i7/Ryzen R7) with much lower power consumption
introduction with new gen MacBook Air in mid 2020 (considering also MacBook PRO and iMac)
massive AI accelerator

Source Coreteks:

Nothingness · Mar 22, 2020

Richie Rich said:
Looks like some people has a problem to accept Apple's monstrous transistor count same way they had a problem to accept its IPC.

If you were paying attention to what people write you'd know I'm not one of those. As many told you, the way you count is stupid and doesn't demonstrate anything except that you don't want to learn and think you are always right.

Explanation is simple: Apple's enormous IPC performance has to come from somewhere.

Not only from the number transistors, but you know that, right?

name99 · Mar 22, 2020

amrnuke said:
All exactly fair points, which is why I admitted to "blindly" multiplying things out. However, back to the original point: we don't know how many transistors are on each core, and Richie Rich is wrong to say that A13 has double when he has zero evidence to back that up. Any rudimentary calculations made in good faith point toward him being wrong. I'm open to changing my mind if further data comes to light.

In summary, as you state, if Apple focused on parallelism and high clocks, they would have to make their core less dense, just as AMD have done. If AMD wanted to focus on single threaded IPC and efficiency, they could make their cores more dense, and clock them lower, just as Apple have done. These are chips designed for completely different market segments. I am not sure why he is hell-bent on comparing them.

Well one reason to compare them is that most of us believe that soon they will be in comparable market segments, when Apple ships ARM macs.
And it is interesting as a TECHNICAL point how they differ because we can extrapolate, based on their technological underpinnings, how each will be able to evolve over say the next five years.

Doug S · Mar 22, 2020

naukkis said:
Both Zen2 and A13 are build on same process. So they both will have about equal transistor count/mm2. Different chip transistor densities come from area used for other than cpu core logic, gpu and caches are much more dense than cpu cores.

Yes, THIS!

People who are taking total transistor counts on a chip, dividing by the size, and using the resulting transistors/mm^2 figure to extrapolate the core transistor counts from multiplying by the area of the core are CLUELESS!

It is like taking how much a new house cost to build, dividing it by its size in ft^2/m^2, and thinking that will tell you how much it will cost to add a bathroom onto your house. (hint: it won't, because bathrooms and kitchens cost far more to build per unit of size than bedrooms and living rooms, similar to how CPU cores are far less dense in transistors/mm^2 than caches or even GPUs/NPUs)

amrnuke · Mar 23, 2020

Richie Rich said:
You can compare core either with L2$ or without. Mixing it together is monkey logic.

Including L2 cache:

- Zen2: core Mtr= 3.6mm2 * 52 Mtr/mm2 = 187 Mtr

- A13: core Mtr= 4.5mm2 * 86 Mtr/mm2 = 387 Mtr ..... 2.1x more transistors

Excluding L2 cache:

- Zen2: core Mtr= 2.7mm2 * 52 Mtr/mm2 = 140 Mtr

- A13: core Mtr= 2.6mm2 * 86 Mtr/mm2 = 223 Mtr ...... 1.6x more transistors

Looks like some people has a problem to accept Apple's monstrous transistor count same way they had a problem to accept its IPC.

You are right, we must normalize for L2$ or lackthereof, because it's a true apples to oranges comparison to compare a Zen2 core + 512KB dedicated L2$ to an A13 core + 8MB shared L2$. L2$ takes up a lot of die space and is higher density, skewing the results.

Excluding L2$:
The Zen2 core 2.7-2.87 mm2 (smaller than before, because we are excluding L2$)
The A13 core 2.6-2.61 mm2 (same as before)

Including 8MB L2$ on each:
Since you're using specint2006 (single threaded) as your benchmark of choice, then the Lightning core would be able to use all 8MB of the shared L2$ (in reality, for reasons unknown, likely because it seems like 2MB may still be earmarked for L2E for the Thunder cores, it only uses 6MB at a really brisk rate - that's really Apple's decision, but nonetheless the 8MB L2$ is shared between the two lightning cores and technically, a single-threaded application would use all 8MB).

L2$ area for Zen2 = 0.8mm2 or so, times 16 to get 8MB for the core - that gives us 0.8mm2 x 16 = 12.8mm2 additional die area.

Zen2 + 8MB Zen2-density L2$ = 16.1 mm2
A13 + 8MB A13-density L2$ = 4.5 mm2

If we just include same L2$ size from A13 and apply it to Zen2, to try to make it more even:
Zen2 + 8MB Apple L2$ = 6.5 mm2
A13 + 8MB Apple L2$ = 6.4 mm2

Even if you do broadly apply transistor density of each chip to its core, in no way would you end up with "double" the transistors on an A13 core unless you're including it's ridiculously large L2$. I agree, yes, L2$ has a lot of transistors.

Richie Rich said:
Explanation is simple: Apple's enormous IPC performance has to come from somewhere.

It probably comes from an L2$ that's 16 times the size of that on Zen2.

If you'll reference the cache size sensitivity of specint2006, you'll notice that miss rates are going to vary massively between a 512KB cache and an 8MB cache. Using that single test as your lodestar is misleading you because it can be manipulated so easily, or, in Apple's case, they are designing around single-thread performance and so a massive L2$ and 2 Lightning cores makes much more sense.

Carfax83 · Mar 23, 2020

amrnuke said:
If you'll reference the cache size sensitivity of specint2006, you'll notice that miss rates are going to vary massively between a 512KB cache and an 8MB cache. Using that single test as your lodestar is misleading you because it can be manipulated so easily, or, in Apple's case, they are designing around single-thread performance and so a massive L2$ and 2 Lightning cores makes much more sense.

The cache size argument was brought up in this thread pages ago by several people. The last forumite to bring it up was lobz, especially in relation to how it improves single thread performance.

Which begs the argument. The more I read this thread is the more I realize how absurd it is to really draw conclusions on the ARM vs x86 debate using the A series CPUs as a comparison point with Intel and AMD desktop/workstation/server variants.

Multithreaded performance has become much more important these days, even for laptops/desktops. Therefore while single thread performance is still very important, both Intel and AMD (which target more than just mobile) have decided to put their eggs in the same basket and design CPUs with greater emphasis on parallelism; both at the core/thread level as well as the instruction level with wider SIMDs. Personally I prefer this approach, and I'm glad to have witnessed the evolution of modern CPUs from the Athlon and P4 days to the huge multicore big SIMD monstrosities we have today.

Guys like Richie Rich are implying that if Apple made a laptop or desktop CPU with the A series as a foundation, then those CPUs would retain the very high IPC/single thread performance characteristic without any sacrifice while dramatically increasing multicore performance.

But as you and others have shown, that's not realistic at all.

naukkis · Mar 23, 2020

Carfax83 said:
Guys like Richie Rich are implying that if Apple made a laptop or desktop CPU with the A series as a foundation, then those CPUs would retain the very high IPC/single thread performance characteristic without any sacrifice while dramatically increasing multicore performance.

But as you and others have shown, that's not realistic at all.

And why not? Look how A76 optimized for server workload & cache & memory controllers perform: it's IPC and multicore scores improve a lot, IPC increases 30% and so on. Don't except Apple chip to perform any different. So probably desktop-class version of Apple core will have better IPC and much improved multicore speed against phone-chip.

Nothingness · Mar 23, 2020

Carfax83 said:
Which begs the argument. The more I read this thread is the more I realize how absurd it is to really draw conclusions on the ARM vs x86 debate using the A series CPUs as a comparison point with Intel and AMD desktop/workstation/server variants.

Multithreaded performance has become much more important these days, even for laptops/desktops. Therefore while single thread performance is still very important, both Intel and AMD (which target more than just mobile) have decided to put their eggs in the same basket and design CPUs with greater emphasis on parallelism; both at the core/thread level as well as the instruction level with wider SIMDs. Personally I prefer this approach, and I'm glad to have witnessed the evolution of modern CPUs from the Athlon and P4 days to the huge multicore big SIMD monstrosities we have today.

Guys like Richie Rich are implying that if Apple made a laptop or desktop CPU with the A series as a foundation, then those CPUs would retain the very high IPC/single thread performance characteristic without any sacrifice while dramatically increasing multicore performance.

But as you and others have shown, that's not realistic at all.

The problem in my opinion is that at one end of the spectrum you have that phone/tablet chip that is getting single thread performance close to high end desktop chips from Intel and AMD. At the other end you have a server chip from AWS that is competitive against Intel and AMD chips.

And in the middle nothing. At least nothing I find satisfying. Cortex-A76 and upward chips from ARM are competitive, but please let them run a real OS or at least let the user get rid of Windows.

Richie Rich · Mar 23, 2020

RetroZombie said:
Maybe by having better employees than Raja and Jim Keller

Do not touch Jim Keller. He is able to turn sand into CPU just by look. There is rumor he is related to Chuck Norris family. They have similar faces so who knows...

EDIT: some people might be surprised what Intel will develop under Keller and that's not a joke

b0btehninja · Mar 23, 2020

Well let's hope they pick AMD CPUs instead.

Richie Rich · Mar 23, 2020

amrnuke said:
Even if you do broadly apply transistor density of each chip to its core, in no way would you end up with "double" the transistors on an A13 core unless you're including it's ridiculously large L2$. I agree, yes, L2$ has a lot of transistors.

It probably comes from an L2$ that's 16 times the size of that on Zen2.

If you'll reference the cache size sensitivity of specint2006, you'll notice that miss rates are going to vary massively between a 512KB cache and an 8MB cache. Using that single test as your lodestar is misleading you because it can be manipulated so easily, or, in Apple's case, they are designing around single-thread performance and so a massive L2$ and 2 Lightning cores makes much more sense.

1) Regarding transistor count. I don't agree those two dies has identical transistor density. There was an article about Zen2 how they lowered resistivity for high clock support but cannot find the link now.
2) If you try to say Apple is able to gain massive +80% IPC over Zen2 at similar die size and transistor count despite massive L1$ then it's even worse situation for x86.
3) Thanks for the SPECint cache sensitivity document, very interesting. IMHO it's not Apple's fault that Zen2 core has just a fraction of L2 cache in compare to A13. X86 boys had shared L2$ long time ago too (Core2Duo had big shared L2$, Bulldozer had also shared L2$).

amrnuke · Mar 24, 2020

Richie Rich said:
1) Regarding transistor count. I don't agree those two dies has identical transistor density.

I never proposed that.

Richie Rich said:
There was an article about Zen2 how they lowered resistivity for high clock support but cannot find the link now.

I think that's generally knowledge that we all know at this point, no need to link it.

Richie Rich said:
2) If you try to say Apple is able to gain massive +80% IPC over Zen2 at similar die size and transistor count despite massive L1$ then it's even worse situation for x86.

You are comparing apples to oranges again. Apple has +80% IPC PER GHZ at 2.65 GHz. What is the A13's IPC at 4.6 GHz? Do you expect it to scale up linearly with clock increases?

Similarly, 3950X raw score is 50.02, does it drop that much if you artificially down-clock to 2.65 GHz?

I'm not sure CPUs are all that dissimilar from cars (that is, you can't run one at one speed and another at another speed, and then look at the RPM normalized to speed without considering gear ratios and so on). I have a lot of qualms about normalizing for clock speed, because somewhere in there, engineers have made decisions to permit better performance in one area at the expense of performance in another. A 3950X has basically the same specint2006 score as the A13, but can do a hell of a lot more than the A13 in so many other areas, that it makes little sense to try to "ding" the 3950X for this design decision.

Richie Rich said:
3) Thanks for the SPECint cache sensitivity document, very interesting. IMHO it's not Apple's fault that Zen2 core has just a fraction of L2 cache in compare to A13. X86 boys had shared L2$ long time ago too (Core2Duo had big shared L2$, Bulldozer had also shared L2$).

Exactly. Apple's advantage is, in large part, due to (A) just tacking on L2$ and (B) not having to worry as much about parallel/MT processes as much, they don't have to carve out as much transistor space for that and can devote more to ST performance.

They're different chips for different markets designed for different purposes. If AMD were designing for ST performance, they'd probably achieve just as good a result as A13. And if Apple were designing for the phoronix test suite, I'm sure they'd be able to cobble together something competitive.

[Graviton2 discussion moved to Graviton2 thread]

Nothingness · Mar 24, 2020

amrnuke said:
Apple's IPC gain is only "83%" per GHz because you're putting the chips on uneven playing fields. Either up-clock the A13 to 4.6 GHz or downclock the 3950X to 2.65 GHz, then you'll have a more even playing field and a more valid result.

More valid but still not correct. You don't design a micro-architecture the same way when targeting high frequency, where you usually need more pipe stages, hence lower IPC even when you scale down frequency.

IPC alone is uninteresting for performance (but that doesn't mean A13 isn't a very impressive and excellent micro-arch).

Exactly. Apple's advantage is, in large part, due to (A) just tacking on L2$ and (B) not having to worry as much about parallel/MT processes as much, they don't have to carve out as much transistor space for that and can devote more to ST performance.

That's still nonetheless impressive and I'd take a 4-core A13 over my i7-8650U without no hesitation.

An aside:

As Graviton2 has shown

Please post that Rome vs Graviton2 in the Graviton2 thread where it belongs, that's interesting

DrMrLordX · Mar 24, 2020

Nothingness said:
Please post that Rome vs Graviton2 in the Graviton2 thread where it belongs, that's interesting

Seconded. @amrnuke please do, I would like to discuss it there.

Thala · Mar 24, 2020

Nothingness said:
More valid but still not correct. You don't design a micro-architecture the same way when targeting high frequency, where you usually need more pipe stages, hence lower IPC even when you scale down frequency.

You cannot isolate frequency as you did in your above example. The CPUs are running at vastly different voltages on top of a mobile CPU generally using slower low leakage cells. If you remove both restrictions the A13 should easily clock 4GHz+ without adding a single pipeline. You have to understand that a mobile CPU is from micro architeture point as high-frequency as the architectures used for desktop - it is an architecrtural challenge to achieve 2.66 GHz in the above mentioned context.
The above conclusion would only be valid if everything else being the same, namely cell library and voltage - however they are not.

Nothingness · Mar 24, 2020

Thala said:
You cannot isolate frequency as you did in your above example. The CPUs are running at vastly different voltages on top of a mobile CPU generally using slower low leakage cells.If you remove both restrictions the A13 should easily clock 4GHz+ without adding a single pipeline. You have to understand that a mobile CPU is from micro architeture point as high-frequency as the architectures used for desktop - it is an architecrtural challenge to achieve 2.66 GHz in the above mentioned context.
The above conclusion would only be valid if everything else being the same, namely cell library and voltage - however they are not.

I'll take your word for it, I'm not familiar enough with process and implementation.

But do you think the A13 can work at 4 GHz while keeping a low enough power to fit a 4-core in a laptop? Won't power skyrocket? That's what happened to some of the CPU designs I was involved in (but that was years ago and perhaps the implementation guys were not that good and/or the process they used not adequate).

Thala · Mar 24, 2020

Nothingness said:
But do you think the A13 can work at 4 GHz while keeping a low enough power to fit a 4-core in a laptop? Won't power skyrocket? That's what happened to some of the CPU designs I was involved in (but that was years ago and perhaps the implementation guys were not that good and/or the process they used not adequate).

Power will go up apparently and significantly too. So 4Ghz all core in a 15W envelope looks challenging. On the other hand it does not have to run at 4Ghz to beat everything else.
In order to fit an A13 like microarchitecture into a laptop you would increase the core count, say to 8 and only slightly adjust frequency - it would not be an A13 to be fit into a laptop.

Doug S · Mar 24, 2020

Laptops aren't running four cores at 4 GHz non stop either, that's why they have a normal clock that's very low (much lower than mobile SoC clocks in some cases) and can turbo to 4 GHz or higher. But they can't maintain that turbo frequency on all cores continuously in a laptop.

Just like mobile SoCs can't run at their standard clock on all cores continuously - at least not when used in a phone. With so little cooling ability there is no way to dissipate that much power other than by making the phone so hot it would hurt to touch.

OriAr · Mar 24, 2020

This is the V/f curve for Apple A12:

For comparison, the 9700K hits 4.7GHz at the same voltage,
Based on what we know about the A13, it's safe to assume the V/f curve is very similar to A12's. Now obviously you are not hitting 4GHz here, and even 3GHz would take significantly more power (The curve already starts to get out of control by 2.6 GHz).

Now, while the current performance is certainly good, we have the issue of software compatibility between x86 and ARM. The emulation penalty is about 30% right now, which means that to make the jump you either need have lots of software ready (Possible, but very hard to do with professional grade stuff), or have a performance jump so big that paying the penalty would be trivial, and that basically means at least doubling the performance, which as we see here is not happening, especially with a mobile power budget.

There might be an ARM MacBook Air some day (And even that is looking doubtful with the new iPad Pro and the new MBA with quad cores in it) but I don't think the MBP is going ARM anytime soon, especially when Intel and AMD have TGL-H/RMB ready to go for the next MBP refresh in a year, which will bring some very much looked for performance gains.

Richie Rich · Mar 24, 2020

amrnuke said:
You are comparing apples to oranges again. Apple has +80% IPC PER GHZ at 2.65 GHz. What is the A13's IPC at 4.6 GHz? Do you expect it to scale up linearly with clock increases?
Similarly, 3950X raw score is 50.02, does it drop that much if you artificially down-clock to 2.65 GHz?

No need artificially down-clock Zen2, you can see how much IPC gains Zen2 when it's running at 2.6 GHz in 64c EPYC. Almost nothing, maybe some single digit, because limit is narrow back end.
The thing is that A13 doesn't need to run at 4 GHz even it can. 95% of all devices prefers efficiency (phones, tablets, laptops, most servers) over absolute peak performance (PC gaming, HPC workloads).
Just compare an old 4-core A12X - it's a beast and brand new Ice Lake is beaten even using much higher TDP.

OriAr said:
This is the V/f curve for Apple A12:

When you look at this curve then you can see the frequency scaling is not ideal. There is big penalty after 2.2 GHz, probably as a result of high transistor density, different libs etc. High frequency AMD CPU has similar V/f curve up to 2.2 Ghz when using same node. After 2.2 GHz it scales much better up to 1.4V. That's clearly silicon optimization for high freq, not an architectural limit. That is what Thala tries to say.

Anyway if Apple won't deliver ARM CPU into laptop market then Qualcomm and others will (actually QComm already did with 8cx SoC based on weak A76). A77 or A78 with IPC higher than Zen2 will be super competitive, cheap and dangerous. IMHO Apple has lost some vision. Steve Jobs would probably migrate laptops at A12X one year ago. Not sure what they're waiting for.

soresu · Mar 24, 2020

OriAr said:
This is the V/f curve for Apple A12:
View attachment 18574

For comparison, the 9700K hits 4.7GHz at the same voltage,
Based on what we know about the A13, it's safe to assume the V/f curve is very similar to A12's. Now obviously you are not hitting 4GHz here, and even 3GHz would take significantly more power (The curve already starts to get out of control by 2.6 GHz).

Now, while the current performance is certainly good, we have the issue of software compatibility between x86 and ARM. The emulation penalty is about 30% right now, which means that to make the jump you either need have lots of software ready (Possible, but very hard to do with professional grade stuff), or have a performance jump so big that paying the penalty would be trivial, and that basically means at least doubling the performance, which as we see here is not happening, especially with a mobile power budget.

There might be an ARM MacBook Air some day (And even that is looking doubtful with the new iPad Pro and the new MBA with quad cores in it) but I don't think the MBP is going ARM anytime soon, especially when Intel and AMD have TGL-H/RMB ready to go for the next MBP refresh in a year, which will bring some very much looked for performance gains.

Thankyou, a comment long overdue.

Nothingness · Mar 25, 2020

OriAr said:
For comparison, the 9700K hits 4.7GHz at the same voltage,

9700 reaches 4.7 GHz at less than 1.1V? And runs prime95 with no trouble?

coercitiv · Mar 25, 2020

Nothingness said:
9700 reaches 4.7 GHz at less than 1.1V? And runs prime95 with no trouble?

Data from Anandtech. I'd say 4500-4600 is a safe bet.

Nothingness · Mar 25, 2020

coercitiv said:
Data from Anandtech. I'd say 4500-4600 is a safe bet.

View attachment 18604

Not prime95, but blender certainly is good enough. Thanks a lot

trivik12 · Mar 25, 2020

So 1st Arm Macbook would use A14x. I wonder if 14" MBP will use Tigerlake or A series SOC.

Doug S · Mar 25, 2020

trivik12 said:
So 1st Arm Macbook would use A14x. I wonder if 14" MBP will use Tigerlake or A series SOC.

No way they'd put ARM CPUs in the Pro line first. They'd test it in consumer laptops first. The MBP coming this fall will be Intel.

Solved! ARM Apple High-End CPU - Intel replacement

Senior member

Diamond Member

Senior member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Senior member

Junior Member

Senior member

Golden Member

Diamond Member

Lifer

Golden Member

Diamond Member

Golden Member

Diamond Member

Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Diamond Member