Solved! ARM Apple High-End CPU - Intel replacement

Richie Rich · Oct 14, 2019

There is a first rumor about Intel replacement in Apple products:

ARM based high-end CPU
8 cores, no SMT
IPC +30% over Cortex A77
desktop performance (Core i7/Ryzen R7) with much lower power consumption
introduction with new gen MacBook Air in mid 2020 (considering also MacBook PRO and iMac)
massive AI accelerator

Source Coreteks:

Doug S · Mar 20, 2020

Now that we have benchmarks it seems pretty clear that this was a pretty small step like most of us were suggesting. They activated one dormant GPU unit in the A12X and slapped a new name on it. Didn't even bother to increase the clock rate - I'll bet it isn't even using N7P. Probably coming off the same N7 line as the A12X and just differs in not fusing off one GPU unit.

Nothingness · Mar 20, 2020

eek2121 said:
Regarding the Graviton 2, I read that article the day it came out: We don't have enough data to make a real comparison. That article was meant for comparison between cloud platforms on one provider alone. For example, one can look at the actual Zen 2/EPYC benchmarks and see that they are higher than the Graviton 2 benchmarks.

You wrote that it was for a single provider and then say now let's add this to the comparison. If Zen 2 isn't available on AWS then it's pointless. Or I could add some Fujitsu A64FX to the mix and laugh at the poor FP performance of x86 AVX-512 units.

Apple is using node shrinks for performance increases.

Yes but Axx performance doesn't only increase thanks to node shrinks.

Regarding the power consumption: AMD's 15 watt parts easily beat out the A13 by most performance metrics.

OMG a 15W chip is faster. All hell's breaking loose.

The important takeaway here is that Apple's A13 is only faster for very specific workloads (source below). This says to me that the A13 has a vastly superior cache subsystem or maybe something else, but it in no way means that the A13 is a faster chip!

So having a superior cache subsystem is not a property of A13? Ridiculous.

Chips do not, and will not scale up linearly. An 8 core, 8 thread A13 would have a 45 watt TDP. I've seen the A13 in my iPhone draw 7 watts of power and heat the phone up until it was uncomfortably hot, and that was in a GAME. While TDP isn't an indicator of power, the two are usually pretty close. Keep this in mind with the data I present below.

It didn't occur to you most of that power (how did you measure it BTW?) might be going to the GPU?

The A13 doesn't accelerate AVX, SSE, or a myriad of other instructions that current processors do.

x86 doesn't accelerate NEON. Breaking news.

Now lets look at some data.

Here is an early benchmark on Geekbench 5 (my favorite benchmark) of the 15 watt Ryzen 4800U: https://browser.geekbench.com/v5/cpu/1373084
Here is a random benchmark of the iPhone 11 Pro Max: https://browser.geekbench.com/v5/cpu/1498904

You like GB5 and you don't know how to make a proper comparison? I will help you: https://browser.geekbench.com/v5/cpu/compare/1498904?baseline=1373084

In most cases, it's a pretty narrow victory. However, that's not what is alarming here. What IS alarming is the pattern that emerges. It's almost like those very specific mobile oriented workloads are being accelerated in some way! With the exception of the SQLite benchmark, N-Body Physics, and Ray Tracing, All of the functions above are used on a smartphone. As the developers of Geekbench have little control over how their benchmark is built, it's very likely that Apple is using a number of accelerators in both the CPU and GPU to accelerate these workloads, while the Ryzen 4800U is doing everything brute force with the exception of built in instruction sets.

If you think Apple can magically use HW accelerators for C++ code (Geekbench is C++ you know that, right?), then they will destroy the competition (hint: that's not happening).

Clang once again consists of parsing a bunch of text (1094 lines of it!), want to know what would really accelerate this process? The ability to accelerate pattern recognition and text parsing! I don't see any evidence of Geekbench actually running a linker. Their own document suggests that they are building for the AArch64 architecture, but they don't mention how much of it they do. Furthermore, the benchmark is listed as klines/sec. Interesting indeed. That suggests to me that they aren't going as far as generating code, but instead are only going as far as parsing the source code. We can speculate on this all day, but it's a very small win, regardless.

37% faster is a small win?

BTW the clang test goes until assembly code generation.

I'll stop here, the rest of your post is about some magic Apple can do at accelerating C++ code with HW accelerators. That's not true and if it was true then they'd be years ahead of the competition and we could conclude x86 is dead (which is not the case).

Please let me know if you see any typos, I typed this up in a relative hurry and barely skimmed over it for proof reading.

Yeah sure you typed your whole message in a few minutes.

eek2121 · Mar 20, 2020

Nothingness said:
You wrote that it was for a single provider and then say now let's add this to the comparison. If Zen 2 isn't available on AWS then it's pointless. Or I could add some Fujitsu A64FX to the mix and laugh at the poor FP performance of x86 AVX-512 units.

Yes but Axx performance doesn't only increase thanks to node shrinks.

OMG a 15W chip is faster. All hell's breaking loose.

So having a superior cache subsystem is not a property of A13? Ridiculous.

It didn't occur to you most of that power (how did you measure it BTW?) might be going to the GPU?

x86 doesn't accelerate NEON. Breaking news.

You like GB5 and you don't know how to make a proper comparison? I will help you: https://browser.geekbench.com/v5/cpu/compare/1498904?baseline=1373084

If you think Apple can magically use HW accelerators for C++ code (Geekbench is C++ you know that, right?), then they will destroy the competition (hint: that's not happening).

37% faster is a small win?

BTW the clang test goes until assembly code generation.

I'll stop here, the rest of your post is about some magic Apple can do at accelerating C++ code with HW accelerators. That's not true and if it was true then they'd be years ahead of the competition and we could conclude x86 is dead (which is not the case).

Yeah sure you typed your whole message in a few minutes.

Apple controls the compiler. Full stop. You don’t know what that compiler is doing. Everything in that benchmark is screaming to me that either the NPU os involved, or the A13 is just really good at tree parsing and average or below average at everything else.

What’s great about my post is that it separates the men from the boys. I am seeing the names of all the shills here and can safely ignore them in the future, and I encourage others to do the same.

Nothingness · Mar 20, 2020

eek2121 said:
Apple controls the compiler. Full stop. You don’t know what that compiler is doing. Everything in that benchmark is screaming to me that either the NPU os involved, or the A13 is just really good at tree parsing and average or below average at everything else.

What is screaming here is that you can't admit the truth.

What’s great about my post is that it separates the men from the boys. I am seeing the names of all the shills here and can safely ignore them in the future, and I encourage others to do the same.

What's great is that one of us two is working in CPU design teams, and has seen and played with Geekbench source code. And it obviously is not you.

Nothingness · Mar 20, 2020

Carfax83 said:
I'm sure the proponents of ARM and the Apple A series are going to bring up its performance in Spec. Geekbench is one thing, but Spec is another. What are your thoughts on the Spec performance?

Something tells me he won't get to this as this would burst his bubble since source code is available and was built and run by someone on this very forum.

eek2121 · Mar 20, 2020

Nothingness said:
What is screaming here is that you can't admit the truth.

What's great is that one of us two is working in CPU design teams, and has seen and played with Geekbench source code. And it obviously is not you.

We can safely say you haven’t that’s for sure. I like to try and avoid personal attacks because you know, it’s the rules and stuff. I also am not a fanboy for a particular platform. All of them are evil.

For future reference, my background is compiler work for new and existing architectures, device driver and operating system work, etc. so you may want to work on those counter arguments.

To the folks that are screaming that I am comparing a 15w TDP CPU to a “5w” TDP CPU it should be noted that the Renoir chip has 4 times the number of “big” cores with only 3 times the TDP. It isn’t an apples-to-apples comparison in either direction of course. However, it only takes a view of objectivity to realize that cutting 6 vega cores and 6 cpus off a 15 watt design would easily make it consume the power of a smartphone SoC; we don’t even need to imagine this, AMD has 5-6 watt embedded CPUs out there on 14nm and 7nm 5 watt CPUs should land late this year or Q1 2021.

amrnuke · Mar 20, 2020

name99 said:
The problem is that blindly multiplying density by area is almost certainly seriously inaccurate. The AMD CPUs clock higher, meaning that a substantial fraction of their logic transistors need to be physically larger (probably achieved through more fins).
There just isn't enough info to know.

I prefer to read the claim not as literally true about *cores* (because there isn't enough data to make the point one way or another) but as pointing out an important difference between AMD (and Intel)'s design points and ARM design points, namely what I keep saying:
- that x86 are designing for speed. ARM are designing for IPC.
- this shows up, not least, in transistor density, because to achiever high GHz means physically larger transistors.

Could AMD pivot to a more brainiac design with smaller transistors, and so hit current performance with, say 4/3 more IPC at 3/4 GHz? Clearly the ARM ISA allows for this; I don't know the x86 constraints.
Firstly they have a lousy ISA to work with, which may limit various smarts ARM can throw at the problem (eg going forward something like value prediction is probably going to break their memory ordering model? and memory ordering may today constrain how aggressive they can be in their load/store queues?) Along with that there are all the known problems of course -- flags crap, stack crap, split registers crap.
Second they have a fan base a substantial fraction of which judge performance by GHz.
Third they don't have the luxury to redesign from scratch. God, can you imagine how much design then validation effort that would be to build an x86 from zero?

So their choices are understandable. But that's not the same thing as saying they're what would be optimal in a world of no constraints.

All exactly fair points, which is why I admitted to "blindly" multiplying things out. However, back to the original point: we don't know how many transistors are on each core, and Richie Rich is wrong to say that A13 has double when he has zero evidence to back that up. Any rudimentary calculations made in good faith point toward him being wrong. I'm open to changing my mind if further data comes to light.

In summary, as you state, if Apple focused on parallelism and high clocks, they would have to make their core less dense, just as AMD have done. If AMD wanted to focus on single threaded IPC and efficiency, they could make their cores more dense, and clock them lower, just as Apple have done. These are chips designed for completely different market segments. I am not sure why he is hell-bent on comparing them.

amrnuke · Mar 20, 2020

Carfax83 said:
I'm sure the proponents of ARM and the Apple A series are going to bring up its performance in Spec. Geekbench is one thing, but Spec is another. What are your thoughts on the Spec performance?

That's a nice result, but won't likely drive much uptake. Either 1) momentum/mindshare or 2) real workload needs are what will drive uptake of these chips. If you're running in an environment that favors ARM, you pick ARM. If you're running in an environment that favors AMD, you pick AMD, and same for Intel. Someone making a 6-7 figure decision, hopefully is making a logical one. And if they aren't, all it depends on is who has the better marketing and mindshare. There shouldn't be any situation where someone making decisions about buying a CPU would ever say, "Yeah, I know we leverage MKL and Epyc is best at that, and I know we have a contract with Intel, but look at that Graviton2 spec score!" And if they make a decision based on that, I'd hope they'd lose their job quickly.

Thala · Mar 20, 2020

eek2121 said:
It isn’t an apples-to-apples comparison in either direction of course. However, it only takes a view of objectivity to realize that cutting 6 vega cores and 6 cpus off a 15 watt design would easily make it consume the power of a smartphone SoC; we don’t even need to imagine this, AMD has 5-6 watt embedded CPUs out there on 14nm and 7nm 5 watt CPUs should land late this year or Q1 2021.

Looks like you have not understood how turbo mode is working. It allows a single CPU in an 8 core system to take much more than just 1/8 of available power. So if you cut off CPUs which are not even involved in the single core benchmark you took as reference, then the power wont change much...
But hey, seeing your other absurd theories about C++ compilers producing code for HW accelerators, this wont burst your bubble anyway...

eek2121 · Mar 20, 2020

Thala said:
Looks like you have not understood how turbo mode is working. It allows a single CPU in an 8 core system to take much more than just 1/8 of available power. So if you cut off CPUs which are not even involved in the single core benchmark you took as reference, then the power wont change much...
But hey, seeing your other absurd theories about C++ compilers producing code for HW accelerators, this wont burst your bubble anyway...

Hence why I said it wasn’t an apples-to-apples comparison. Do you shills lack reading comprehension? The Ryzen R1102G is a 14nm Zen 1 based 6w CPU. I am willing to bet that Zen 2 would raise that clock ceiling by 30% while allowing for additional cores.

Also, all these non-developers keep talking about compilers. One of the stages in compiling your software is *gasp* optimization for the target hardware platform.

Doug S · Mar 20, 2020

eek2121 said:
Hence why I said it wasn’t an apples-to-apples comparison. Do you shills lack reading comprehension? The Ryzen R1102G is a 14nm Zen 1 based 6w CPU. I am willing to bet that Zen 2 would raise that clock ceiling by 30% while allowing for additional cores.

Also, all these non-developers keep talking about compilers. One of the stages in compiling your software is *gasp* optimization for the target hardware platform.

Benchmarks do cross compiling so they target the same platform and thus do exact same amount of work no matter what CPU they are running on, and use the same version/source of compiler.

The GCC subtest in SPEC isn't generating ARM code when running on ARM and x86 code when running on x86, and the LLVM subtest in Geekbench isn't running Apple's version of LLVM when run on an iPhone and some other version when run on x86.

Andrei. · Mar 21, 2020

eek2121 said:
Also, all these non-developers keep talking about compilers. One of the stages in compiling your software is *gasp* optimization for the target hardware platform.

None of these benchmarks are even explicitly targeting the specific CPU microarchitecture, just the target ISA. If your background is compiler work then I hope you can differentiate between a compiler's front-end and back-end. Your previous big post is just a huge pile of nonsense, full of technical inaccuracies. I wish people would stop chiming in from an authoritative position when in reality they live in lala-land.

lobz · Mar 21, 2020

Nothingness said:
A remark which I hope for you was meant as a joke

I took the joke a bit further, as intel is at least as bad at power efficiency in servers. I have no idea how noone got that. Good job for downvoting your own joke by the way

soresu · Mar 21, 2020

Andrei. said:
I hope you can differentiate between a compiler's front-end and back-end.

Front end is source/language input and back end is ISA specific assembly/target language output depending on what type of compiler it is right?

Nothingness · Mar 21, 2020

lobz said:
I took the joke a bit further, as intel is at least as bad at power efficiency in servers. I have no idea how noone got that. Good job for downvoting your own joke by the way

Yeah I lowered myself to reach your level

You could have explained yourself rather than playing that silly game. I still don't get what you're trying to say. Really.

lobz · Mar 21, 2020

Nothingness said:
Yeah I lowered myself to reach your level You could have explained yourself rather than playing that silly game. I still don't get what you're trying to say. Really.

Which game, man? I explained my comment and I still don't have any idea what your problem is - or maybe you could stop trying to decide for other people what they were thinking.

Richie Rich · Mar 22, 2020

amrnuke said:
Zen2 core is 3.64mm2 on N7
A13 big core is 2.61mm2 on N7P

It is 100% false that an A13 Lightning core has double the transistors of a Zen2 core.

Please lets do the right match because you compare Zen2 including L2$ and A13 without L2$.

There is also different transistor density:
- Zen2 chiplet: 3.9 bilion tr per 74mm2 = 52 Mtr/mm2
- A13 SoC: 8.5 bilion tr per 98mm2 = 86 Mtr/mm2

Including L2 cache:

- Zen2: core Mtr= 3.6mm2 * 52 Mtr/mm2 = 187 Mtr
- A13: core Mtr= 4.5mm2 * 86 Mtr/mm2 = 387 Mtr ..... 2.1x more transistors

Excluding L2 cache:

- Zen2: core Mtr= 2.7mm2 * 52 Mtr/mm2 = 140 Mtr
- A13: core Mtr= 2.6mm2 * 86 Mtr/mm2 = 223 Mtr ...... 1.6x more transistors

So it's pretty clear A13 is monstrous beast over any existing x86 core.
X86 world is lucky that this A13 core cannot be licensed to anybody like other ARM Cortex cores and Apple was blind to Gerard's Williams plans to expand into server segment. However this will change with Nuvia and A78 cores will give x86 hard time in server. Not speaking about new ARMv9 with SVE2 and completely new ARM's core line up (A80-series?).

x86 is in serious trouble from speed of development point of view. They need to speed up development otherwise x86 will loose server and laptop majority in 5 years. IMHO in long term view x86 is not gonna make it and it's done. Remember x86 won over PowerPC due to lower price, multiple competing manufacturers pushing faster development, gaining majority market money. And all these advantages are on ARM's side now. According to economy laws x86 is dying.

Nothingness · Mar 22, 2020

Richie Rich said:
Please lets do the right match because you compare Zen2 including L2$ and A13 without L2$.

There is also different transistor density:
- Zen2 chiplet: 3.9 bilion tr per 74mm2 = 52 Mtr/mm2
- A13 SoC: 8.5 bilion tr per 98mm2 = 86 Mtr/mm2

Including L2 cache:

- Zen2: core Mtr= 3.6mm2 * 52 Mtr/mm2 = 187 Mtr

- A13: core Mtr= 4.5mm2 * 86 Mtr/mm2 = 387 Mtr ..... 2.1x more transistors

Excluding L2 cache:

- Zen2: core Mtr= 2.7mm2 * 52 Mtr/mm2 = 140 Mtr

- A13: core Mtr= 2.6mm2 * 86 Mtr/mm2 = 223 Mtr ...... 1.6x more transistors

So it's pretty clear A13 is monstrous beast over any existing x86 core.

@amrnuke comparison makes more sense to me: for Zen2 the L2 cache is private to each core, while for A13 it's shared (let's say it's like Zen2 L3).

Richie Rich · Mar 22, 2020

Nothingness said:
@amrnuke comparison makes more sense to me: for Zen2 the L2 cache is private to each core, while for A13 it's shared (let's say it's like Zen2 L3).

No. Apple A13 SoC has 16MB of System Level Cache shared between CPU and GPU.... this is like L3 cache: "Going further out into the cache hierarchy we’re hitting the SLC, which would act as an L3 to the large performance cores" Link Andrei's A13 test

Then it's fair to compare without L2 cache. So it's clear that A13 core is at least 1.6x bigger than Zen2. Unless you decide to make another "fair" comparison to match your statement at any cost.

naukkis · Mar 22, 2020

Richie Rich said:
Then it's fair to compare without L2 cache. So it's clear that A13 core is at least 1.6x bigger than Zen2. Unless you decide to make another "fair" comparison to match your statement at any cost.

It's fair to compare how much silicon space they use -> we can conclude that A13 and Zen2 are very similar in size. Fast logic transistors has to be much larger than slower logic or cache transistors so transistor count cannot be estimated. Silicon space can. x86 cpu's need to use logic to instruction decoding and legacy support which A13 won't have to, so they can use more that space to execution itself.

There's a sweet spot for every manufacturing process how large core can be, Apple and AMD have pretty similar core sizes - Intel is about twice that big -> Intel core consumes much more power and won't clock high resulting pretty low efficiency only similar to previous 14nm cpu's.

Nothingness · Mar 22, 2020

Richie Rich said:
No. Apple A13 SoC has 16MB of System Level Cache shared between CPU and GPU.... this is like L3 cache: "Going further out into the cache hierarchy we’re hitting the SLC, which would act as an L3 to the large performance cores" Link Andrei's A13 test

Then it's fair to compare without L2 cache. So it's clear that A13 core is at least 1.6x bigger than Zen2. Unless you decide to make another "fair" comparison to match your statement at any cost.

You're funny, I'm not the one who wants to be right at any cost

I don't care, keep living in your fantasy land.

Richie Rich · Mar 22, 2020

Nothingness said:
You're funny, I'm not the one who wants to be right at any cost I don't care, keep living in your fantasy land.

You can compare core either with L2$ or without. Mixing it together is monkey logic.

Including L2 cache:

- Zen2: core Mtr= 3.6mm2 * 52 Mtr/mm2 = 187 Mtr
- A13: core Mtr= 4.5mm2 * 86 Mtr/mm2 = 387 Mtr ..... 2.1x more transistors

Excluding L2 cache:

- Zen2: core Mtr= 2.7mm2 * 52 Mtr/mm2 = 140 Mtr
- A13: core Mtr= 2.6mm2 * 86 Mtr/mm2 = 223 Mtr ...... 1.6x more transistors

Looks like some people has a problem to accept Apple's monstrous transistor count same way they had a problem to accept its IPC.
Explanation is simple: Apple's enormous IPC performance has to come from somewhere.

Thunder 57 · Mar 22, 2020

Richie Rich said:
You can compare core either with L2$ or without. Mixing it together is monkey logic.

Including L2 cache:

- Zen2: core Mtr= 3.6mm2 * 52 Mtr/mm2 = 187 Mtr

- A13: core Mtr= 4.5mm2 * 86 Mtr/mm2 = 387 Mtr ..... 2.1x more transistors

Excluding L2 cache:

- Zen2: core Mtr= 2.7mm2 * 52 Mtr/mm2 = 140 Mtr

- A13: core Mtr= 2.6mm2 * 86 Mtr/mm2 = 223 Mtr ...... 1.6x more transistors

Looks like some people has a problem to accept Apple's monstrous transistor count same way they had a problem to accept its IPC.
Explanation is simple: Apple's enormous IPC performance has to come from somewhere.

And that's why server farms are full of iphones

.

naukkis · Mar 22, 2020

Both Zen2 and A13 are build on same process. So they both will have about equal transistor count/mm2. Different chip transistor densities come from area used for other than cpu core logic, gpu and caches are much more dense than cpu cores.

RetroZombie · Mar 22, 2020

Richie Rich said:
Apple's enormous IPC performance has to come from somewhere.

Maybe by having better employees than Raja and Jim Keller

Solved! ARM Apple High-End CPU - Intel replacement

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Golden Member

Golden Member

Diamond Member

Diamond Member

Senior member

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Senior member

Diamond Member

Senior member

Golden Member

Diamond Member

Senior member

Diamond Member

Golden Member

Senior member