Info TOP 20 of the World's Most Powerful CPU Cores - IPC/PPC comparison

Richie Rich · May 15, 2020

Added cores:

A53 - little core used in some low-end smartphones in 8-core config (Snapdragon 450)
A55 - used as little core in every modern Android SoC
A72 - "high" end Cortex core used in Snapdragon 625 or Raspberry Pi 4
A73 - "high" end Cortex core
A75 - "high" end Cortex core
Bulldozer - infamous AMD core

Geekbench 5.1 PPC chart 6/23/2020:

Pos	Man	CPU	Core	Year	ISA	GB5 Score	GHz	PPC (score/GHz)	Relative to 9900K	Relative to Zen3
1	Nuvia	(Est.)	Phoenix (Est.)	2021	ARMv9.0	2001	3.00	667.00	241.0%	194.1%
2	Apple	A15 (est.)	(Est.)	2021	ARMv9.0	1925	3.00	641.70	231.8%	186.8%
3	Apple	A14 (est.)	Firestorm	2020	ARMv8.6	1562	2.80	558.00	201.6%	162.4%
4	Apple	A13	Lightning	2019	ARMv8.4	1332	2.65	502.64	181.6%	146.3%
5	Apple	A12	Vortex	2018	ARMv8.3	1116	2.53	441.11	159.4%	128.4%
6	ARM Cortex	V1 (est.)	Zeus	2020	ARMv8.6	1287	3.00	428.87	154.9%	124.8%
7	ARM Cortex	N2 (est.)	Perseus	2021	ARMv9.0	1201	3.00	400.28	144.6%	116.5%
8	Apple	A11	Monsoon	2017	ARMv8.2	933	2.39	390.38	141.0%	113.6%
9	Intel	(Est.)	Golden Cove (Est.)	2021	x86-64	1780	4.60	386.98	139.8%	112.6%
10	ARM Cortex	X1	Hera	2020	ARMv8.2	1115	3.00	371.69	134.3%	108.2%
11	AMD	5900X (Est.)	Zen 3 (Est.)	2020	x86-64	1683	4.90	343.57	124.1%	100.0%
12	Apple	A10	Hurricane	2016	ARMv8.1	770	2.34	329.06	118.9%	95.8%
13	Intel	1065G7	Icelake	2019	x86-64	1252	3.90	321.03	116.0%	93.4%
14	ARM Cortex	A78	Hercules	2020	ARMv8.2	918	3.00	305.93	110.5%	89.0%
15	Apple	A9	Twister	2015	ARMv8.0	564	1.85	304.86	110.1%	88.7%
16	AMD	3950X	Zen 2	2019	x86-64	1317	4.60	286.30	103.4%	83.3%
17	ARM Cortex	A77	Deimos	2019	ARMv8.2	812	2.84	285.92	103.3%	83.2%
18	Intel	9900K	Coffee LakeR	2018	x86-64	1384	5.00	276.80	100.0%	80.6%
19	Intel	10900K	Comet Lake	2020	x86-64	1465	5.30	276.42	99.9%	80.5%
20	Intel	6700K	Skylake	2015	x86-64	1032	4.00	258.00	93.2%	75.1%
21	ARM Cortex	A76	Enyo	2018	ARMv8.2	720	2.84	253.52	91.6%	73.8%
22	Intel	4770K	Haswell	2013	x86-64	966	3.90	247.69	89.5%	72.1%
23	AMD	1800X	Zen 1	2017	x86-64	935	3.90	239.74	86.6%	69.8%
24	Apple	A13	Thunder	2019	ARMv8.4	400	1.73	231.25	83.5%	67.3%
25	Apple	A8	Typhoon	2014	ARMv8.0	323	1.40	230.71	83.4%	67.2%
26	Intel	3770K	Ivy Bridge	2012	x86-64	764	3.50	218.29	78.9%	63.5%
27	Apple	A7	Cyclone	2013	ARMv8.0	270	1.30	207.69	75.0%	60.5%
28	Intel	2700K	Sandy Bridge	2011	x86-64	723	3.50	206.57	74.6%	60.1%
29	ARM Cortex	A75	Prometheus	2017	ARMv8.2	505	2.80	180.36	65.2%	52.5%
30	ARM Cortex	A73	Artemis	2016	ARMv8.0	380	2.45	155.10	56.0%	45.1%
31	ARM Cortex	A72	Maya	2015	ARMv8.0	259	1.80	143.89	52.0%	41.9%
32	Intel	E6600	Core2	2006	x86-64	338	2.40	140.83	50.9%	41.0%
33	AMD	FX-8350	BD	2011	x86-64	566	4.20	134.76	48.7%	39.2%
34	AMD	Phenom 965 BE	K10.5	2006	x86-64	496	3.70	134.05	48.4%	39.0%
35	ARM Cortex	A57 (est.)	Atlas	0	ARMv8.0	222	1.80	123.33	44.6%	35.9%
36	ARM Cortex	A15 (est.)	Eagle	0	ARMv7 32-bit	188	1.80	104.65	37.8%	30.5%
37	AMD	Athlon 64 X2 3800+	K8	2005	x86-64	207	2.00	103.50	37.4%	30.1%
38	ARM Cortex	A17 (est.)		0	ARMv7 32-bit	182	1.80	100.91	36.5%	29.4%
39	ARM Cortex	A55	Ananke	2017	ARMv8.2	155	1.60	96.88	35.0%	28.2%
40	ARM Cortex	A53	Apollo	2012	ARMv8.0	148	1.80	82.22	29.7%	23.9%
41	Intel	Pentium D	P4	2005	x86-64	228	3.40	67.06	24.2%	19.5%
42	ARM Cortex	A7 (est.)	Kingfisher	0	ARMv7 32-bit	101	1.80	56.06	20.3%	16.3%

TOP 10 - Performance Per Area comparison at ISO-clock (PPA/GHz)

Copied from locked thread. They try to avoid people to see this comparison how x86 is so bad.[/B]

Pos	Man	CPU	Core	Core Area mm2	Year	ISA	SPEC PPA/Ghz	Relative
1	ARM Cortex	A78	Hercules	1.33	2020	ARMv8	9.41	100.0%
2	ARM Cortex	A77	Deimos	1.40	2019	ARMv8	8.36	88.8%
3	ARM Cortex	A76	Enyo	1.20	2018	ARMv8	7.82	83.1%
4	ARM Cortex	X1	Hera	2.11	2020	ARMv8	7.24	76.9%
5	Apple	A12	Vortex	4.03	2018	ARMv8	4.44	47.2%
6	Apple	A13	Lightning	4.53	2019	ARMv8	4.40	46.7%
7	AMD	3950X	Zen 2	3.60	2019	x86-64	3.02	32.1%

It's impressive how fast are evolving the generic Cortex cores:

A72 (2015) which can be found in most SBC has 1/3 of IPC of new Cortex X1 - They trippled IPC in just 5 years.
A73 and A75 (2017) which is inside majority of Android smart phones today has 1/2 IPC of new Cortex X1 - They doubled IPC in 3 years.

Comparison how x86 vs. Cortex cores:

A75 (2017) compared to Zen1 (2017) is loosing massive -34% PPC to x86. As expected.
A77 (2019) compared to Zen2 (2018) closed the gap and is equal in PPC. Surprising. Cortex cores caught x86 cores.
X1 (2020) is another +30% IPC over A77. Zen3 need to bring 30% IPC jump to stay on par with X1.

Comparison to Apple cores:

AMD's Zen2 core is slower than Apple's A9 from 2015.... so AMD is 4 years behind Apple
Intel's Sunny Cove core in Ice Lake is slower than Apple's A10 from 2016... so Intel is 3 years behind Apple
Cortex A77 core is slower than Apple's A9 from 2015.... but
New Cortex X1 core is slower than Apple's A11 from 2017 so ARM LLC is 3 years behind Apple and getting closer

GeekBench5.1 comparison from 6/22/2020:

added Cortex X1 and A78 performance projections from Andrei here
2020 awaiting new Apple A14 Firestorm core and Zen3 core

Updated:

EDIT:
Please note to stop endless discussion about PPC frequency scaling: To have fair and clean comparison I will use only the top (high clocked) version from each core as representation for top performance.

Doug S · Jun 27, 2020

DrMrLordX said:
To contrast GB5 with CBR20 (4.4 GHz, 1.344v-1.38v)

CBR20:

ST: ~49W package power
MT: ~162W package power

GB5:

ST: ~42W average, peaked at 46W in Structure from Motion
MT: All over the place, so average seems pointless, but it was ~91W. Ray tracing seemed to push power up to around 142-147W, while Structure from Motion hit the 130s.

Pretty sure Primate Labs claims to use AVX, but . . .

On the ST side the power usage is pretty similar. Cinemark is NOT a general purpose benchmark, it tests only one thing and is pretty meaningless if what you do isn't that one thing or closely related to it. Geekbench and SPEC do a variety of tests to try to form more of an average performance across a variety of tasks. Some things (especially if they have portions that are mostly cache bound so the memory controller isn't exercised as much) will end up using less power than others.

For instance, if you test a database load versus a heavy streaming load (which I assume Cinebench is though I haven't really looked at what it tests because it isn't in the realm of stuff I care about) you will see the database load use a lot less power on a CPU with a lot of cores. It isn't because the database load isn't stressing it, it is because databases can't effectively use all cores all the time due to locking and such. Tasks that are considered "embarrassingly parallel", i.e. those that will benefit from more cores assuming they can get enough memory bandwidth, will burn more and more up to the package max because there are no inter thread depencies.

Doug S · Jun 27, 2020

Carfax83 said:
Anyway, Intel, AMD and IBM's high core count CPUs all have a few things in common; small L1 and L2 caches, and absolutely humongous L3 caches. This is in sharp contrast to the Apple A series which has a huge L1 and L2 cache, with no L3 cache.

You're making an incorrect assumption here. There is nothing stopping a high core count CPU from having huge L1 and L2 caches. The cost of doing so is a bit of area (which designers have in abundance these days) but it does not limit the number of cores other than at the margins - i.e. if bigger caches make the cores a bit larger maybe you only have room for 25 cores instead of 28 at a given die size.

Smaller caches are actually related much more to higher clock rates - because the larger the cache the slower it is, when measured in absolute time (i.e. ns instead of cycles) You don't care too much about absolute latency though, you care about latency when measured in clock cycles - if your pipeline is such that you have to wait a few cycles extra for every L1 access your core will perform terribly. Thus a smaller cache that is faster in terms of latency measured in clock cycles is kind of forced on you if your design targets a high clock rate.

Everything is a tradeoff in CPU design. If you clock faster then you can get twice as much work done versus a core that's clocked half as fast, at least when you don't have to wait on memory or branches or whatever - but clocking higher means all levels of cache and memory are further away in terms of clock cycles so you have to adjust for that with smaller caches with fewer ways. If you go wider, you can get more work done per cycle, at least when your code allows you to fill all the slots, but a wider design burns more power and is more difficult to clock as high.

There isn't one "right" way to do it, different teams choose different points on the high clock / moderate clock and wide / not as wide spectrum, along with many other decisions they have to make. But just about every decision you make implies tradeoffs in other stuff like cache sizing, number of register ports, TLB size and on and on. About the only thing we know for sure that's "wrong" is pursuing clock rate above everything else. That was Intel's goal with the P4 when they talked about hitting 10 GHz eventually. Those 'half cycle' instructions pointed to a pipeline that was 'double pumped' internally. Their goal was to expose that so the half clock becomes a full clock and the clock rate doubles. Unfortunately they found that such a high clock rate burned an unacceptably high amount of power.

Carfax83 · Jun 27, 2020

Doug S said:
You're making an incorrect assumption here. There is nothing stopping a high core count CPU from having huge L1 and L2 caches.

Except die space and power limits. From what I understand, SRAM takes up a lot of die space because it's not very dense and it also burns a lot of power because it's often running at the same frequency as the core.

Smaller caches are actually related much more to higher clock rates - because the larger the cache the slower it is, when measured in absolute time (i.e. ns instead of cycles) You don't care too much about absolute latency though, you care about latency when measured in clock cycles - if your pipeline is such that you have to wait a few cycles extra for every L1 access your core will perform terribly. Thus a smaller cache that is faster in terms of latency measured in clock cycles is kind of forced on you if your design targets a high clock rate.

Agreed, which is why Intel, AMD and IBM prefer to have a much larger L3 cache for their multicore CPUs as that lowers latency for the entire CPU while keeping power and thermal within a reasonable state, compared to if they increased the size of the L1 and L2 caches. And this also why a theoretical A13 scaled to 8 big cores would not have 64MB of L2 cache, or even 32MB.

Apple's methodology only makes sense for small core count CPUs that will be doing predominantly single threaded workloads.

If you go wider, you can get more work done per cycle, at least when your code allows you to fill all the slots, but a wider design burns more power and is more difficult to clock as high.

Which is why I am very eager to see how a very wide CPU with relatively low clock speeds like the Apple A series compares to a traditional x86 design across a wide variety of workloads.

Doug S · Jun 28, 2020

Carfax83 said:
Except die space and power limits. From what I understand, SRAM takes up a lot of die space because it's not very dense and it also burns a lot of power because it's often running at the same frequency as the core.

Your understanding is exactly backwards. Cache is much more dense than random logic. A new process is often first with SRAM since that's the simplest and most dense structure. When Intel claims it has x density for a process their numbers are based on a die of pure SRAM. The power a cache uses it proportional to its clock rate and the type of transistor used, not its size, so going from say 64K to 128K L1 doesn't really use more power other than the leakage current present in any active transistors unless you added more ways of associativity when you increased its size.

L1 is the least dense and L3 the most dense, because L1 is more complex (more ways, more ports etc.) and uses the fastest possible transistors - which implies less density - however even L1 is quite a bit denser than logic blocks like an ALU. L3 is the densest because its as simple as possible and the fastest transistors aren't used because area and power are more important than speed.

Which is why it is really stupid to compare chips in terms of density as I see done here too often (not saying you do it, but those who do know who they are) A wider design with more ALUs will be less dense, a design with bigger cache will more dense, and modern chips have all sorts of other structures like memory controllers, GPUs, NPUs, IPUs and whatever else - I don't know enough about their properties to know the typical relative density of say a memory controller or an NPU but safe to say in a modern SoC like the A13 the CPU cores and caches are such a small part of the overall die that comparing its density to something else made on the same process like HiSilicon's SoC, let alone something made on a totally different process like an Intel CPU is a fool's errand. You might as well compare a Mustang and a Corvette based on how their exhaust smells.

DrMrLordX · Jun 28, 2020

Carfax83 said:
I'm looking forward to seeing Apple duke it out with Intel and AMD in anything but Geekbench and Spec2006.

Same. Based on how my Snapdragon 855+ handled the limited Java benchmarking I threw at it (Java adaptation of part of Dr. Cutress' 3DPM), it looked really strong. Kicked the crap out of my old A10-7700k. And we know that Apple has stronger cores than A76. That wasn't even a workload that mobile CPUs "should be good at", and yet, there it was.

Doug S said:
On the ST side the power usage is pretty similar. Cinemark is NOT a general purpose benchmark, it tests only one thing and is pretty meaningless if what you do isn't that one thing or closely related to it.

Therein lies the problem. What do you use to gauge IPC? GB5 is more like Antutu in that it is at least partially a user-experience benchmark. Do we regularly rate a CPU's performance based on how quickly it loads a PDF? Normally, no. Geekbench does, and it's a part of the score. I'm reluctant to take any benchmark suite seriously if it includes too many tests that leave execution resources unutilized.

Richie Rich · Jun 28, 2020

Carfax83 said:
Which is why I am very eager to see how a very wide CPU with relatively low clock speeds like the Apple A series compares to a traditional x86 design across a wide variety of workloads.

@Doug S gave you the example from past about P4. This was exactly same situation -

year 2000 - narrow 2xALU P4 core with higher clocks VS. 50% wider 3xALU K8 core with lower clocks
year 2020 - narrow 4xALU Intel/AMD core with higher clocks VS. 50% wider 6xALU A13 core with lower clocks

Almost exactly identical situation and we all know which approach won. It's even worse situation for x86 now because Apple's A13 is designed with extreme emphasis about power consumption for mobile devices which K8 was not. Intel was lucky that had mobile Pentium M Banias branch so he was able to shift back on track pretty fast. But Intel/AMD they have no similar product and no experience to do that (look how Samsung terribly failed to develop wide core to compete with Apple). And Samsung did that with Bobcat and Jaguar teams they dragged from AMD during BD exodus.

Another bombshell is that Apple is moving whole lineup including Mac Pros with Xeons. This means Apple is working on server grade CPU:

A14 is starting mass production now
A15 Xeon server grade CPU is about to tape out (or maybe already running samples)

This would explain still ongoing lawsuit against Nuvia server CPU company. Maybe Apple follows Amazon and will create their own cloud HW to save tremendous amount of money. There is huge difference to pay 7500 USD for 64-core EPYC 7742 versus 500 USD for their own 64-core silicon. And double as powerful one.

lobz · Jun 28, 2020

OK this is enough for me. Reading through these 5 threads with the same message from Richie Rich but with different titles has been such an excruciating chore, that he managed to turn one of my favorite topics into a PTSD trigger. Good luck to everyone else.

Doug S · Jun 28, 2020

DrMrLordX said:
Therein lies the problem. What do you use to gauge IPC? GB5 is more like Antutu in that it is at least partially a user-experience benchmark. Do we regularly rate a CPU's performance based on how quickly it loads a PDF? Normally, no. Geekbench does, and it's a part of the score. I'm reluctant to take any benchmark suite seriously if it includes too many tests that leave execution resources unutilized.

You don't. IPC is mostly meaningless, because people care about actual performance not performance per clock. Far too much attention is paid to it on these forums.

Personally when I look at a benchmark suite like Geekbench or SPEC I pretty much only look at the compiler benchmark (gcc, clang, llvm whatever) That's impossible to game with compiler tricks and will have a lot of impossible to predict branches, so there has never been a CPU that performs well compiling code that that doesn't perform well on all general purpose code. If a CPU falls short on that benchmark you know it has a glass jaw somewhere. You won't know exactly what but you'll know it has one and can't be trusted for general purpose performance even if it performs terrific on some more narrowly focused benchmark.

If you care about stuff like file compression or whatever exactly Cinemark is measuring then look at the components which do stuff like that, or use a narrow benchmark like Cinemark itself. Just because a benchmark suite like Geekbench has tests that are irrelevant to you doesn't mean the whole thing is irrelevant. The purpose for measuring how long it takes to open a PDF isn't because that is in and of itself important, but because a PDF/PS interpreter is a good proxy for a lot of applications that have to parse a complex file format. Excel will do something similar when you load a complex spreadsheet, a CAD program when you load your design and so on.

Tests aren't irrelevant simply because they don't use all execution resources. A lot of what you do doesn't use all execution resources. You want to see poor execution unit usage and terrible real world IPC look at the trace of a CPU running Oracle sometime. It can't come close to one instruction per cycle on any CPU, despite companies like IBM and Intel investing billions in trying to make it go faster, because the market for hardware that runs databases better is worth billions. If you look at execution unit usage you would say a database is not a worthy benchmark, when for a lot of the market it is the ONLY benchmark that matters.

DrMrLordX · Jun 28, 2020

Doug S said:
You don't. IPC is mostly meaningless, because people care about actual performance not performance per clock. Far too much attention is paid to it on these forums.

Oh, snap. There goes this entire thread.

Carfax83 · Jun 29, 2020

Doug S said:
Your understanding is exactly backwards. Cache is much more dense than random logic. A new process is often first with SRAM since that's the simplest and most dense structure. When Intel claims it has x density for a process their numbers are based on a die of pure SRAM. The power a cache uses it proportional to its clock rate and the type of transistor used, not its size, so going from say 64K to 128K L1 doesn't really use more power other than the leakage current present in any active transistors unless you added more ways of associativity when you increased its size.

I guess I should have been more specific. By density I mean the density/capacity ratio. You can fit less of it in a given area because four to six transistors (typically) are required per bit whereas DRAM as a comparison, requires just one transistor per bit.

At any rate, I think my point still stands. A CPU can be optimized to favor single threaded workloads and/or multithreaded workloads by nature of its cache hierarchy. Currently, Apple has optimized their CPUs for single threaded workloads for obvious reasons, while Intel and AMD prefer a more balanced approach as they have greater platform diversity. Of course things will likely change in the future, as Apple starts to target other platforms as well.

Carfax83 · Jun 29, 2020

Richie Rich said:
Almost exactly identical situation and we all know which approach won. It's even worse situation for x86 now because Apple's A13 is designed with extreme emphasis about power consumption for mobile devices which K8 was not. Intel was lucky that had mobile Pentium M Banias branch so he was able to shift back on track pretty fast. But Intel/AMD they have no similar product and no experience to do that (look how Samsung terribly failed to develop wide core to compete with Apple). And Samsung did that with Bobcat and Jaguar teams they dragged from AMD during BD exodus.

This only holds true because Intel lost so much time with that 10nm fiasco, and AMD were way behind Intel in terms of IPC until Zen. Now both companies are hell bent on making wider designs. Sunny Cove according to Intel is a 5 wide design, and Golden Cove will likely be even wider:

Richie Rich · Jun 29, 2020

@Carfax83 Sunny Cove core is not that great in compare to new ARM cores.

Number of ports:

ARM Apple A13 .... 11 wide
ARM Cortex X1 ..... 15 wide
x86 Sunny Cove .... 10 wide
x86 Zen2 ............... 11 wide

ALU comparison:

ARM Apple A13 .... 6xALU (2xBranch shared)
ARM Cortex X1 ..... 4xALU + 2xBranch in separated ports
x86 Sunny Cove .... 4xALU (2xBranch shared, also shared with 3xFPU)
x86 Zen2 ............... 4xALU (2xBranch shared)

You can see Intel and AMD has a narrower/weaker design here. ARMs are leaders, especially Apple's design, with its first 6xALU core on the world. No wonder Apple has 82% IPC/PPC lead over Intel/AMD cores. Even Cortex X1 has 40% higher IPC/PPC than Zen2. Those are huge numbers.

AGU comparison:

ARM Apple A13 .... 2xAGU (load & store)
ARM Cortex X1 ..... 2xAGU (load & store) + 1xAGU (load) + 2x Store
x86 Sunny Cove .... 4xAGU (2xload + 2xstore) + 2x Store
x86 Zen2 ............... 2xAGU (load & store) + 1x Store

Looks like Sunny Cove is winner here but it's very store oriented (it has only 2x Load AGU) which means it's built for SIMD operation. Usually 40% of instructions are Load ones so for high IPC in general code there is a winner Cortex X1 with its 3xLoad AGUs. AFAIK it's first core in ther world with 3x Load AGUs. Big question mark is about Apple's AGUs because it looks like the poorest design here. There must be something we don't know IMO.

FPU comparison:

ARM Apple A13 .... 3xFPU 128-bit
ARM Cortex X1 ..... 4xFPU 128-bit
x86 Sunny Cove .... 3xFPU 256(?)-bit
x86 Zen2 ............... 2xFPU 256-bit (in 4xpipes)

FPU is the good part of current x86 designs however ARMs did huge improvement and matched that. Don't forget next year new SIMD instruction set SVE2 is coming and it will be another huge step up. The new ARM Fujitsu A64FX with its 2x512-bit SVE SIMD/FPU is beating super computers based on GPU Volta. And SVE2 is designed for 2048-bit so in theory next year can appear ARM cores with a massive 2048-bit FPUs. It isn't about ISA limits anymore. ARM cores can adopt as wide vectors as they need. Such a wide SIMD doesn't make sense for smartphones due to power consumption of course. However what about next gen A64FX? Do you think Fujitsu is sitting still now and enjoys the fame? I expect A64FX-2 is under development and if Fujitsu will keep conservative 2 years cycle we can expect A64FX-2 next year (maybe with 2x1024-bit FPUs).

Funny is that weak A76 in Graviton2 (A76 is 3xALU+1xbranch, 2xAGU, 2xFPU only) is beating Zen1 servers very bad and delivers higher performance per thread than Zen2 Epyc Rome systems. And now Cortex X1 has 60% higher IPC than poor A76. Imagine the damage it is gonna make to x86 server systems.

Doug S · Jun 29, 2020

Carfax83 said:
At any rate, I think my point still stands. A CPU can be optimized to favor single threaded workloads and/or multithreaded workloads by nature of its cache hierarchy. Currently, Apple has optimized their CPUs for single threaded workloads for obvious reasons, while Intel and AMD prefer a more balanced approach as they have greater platform diversity. Of course things will likely change in the future, as Apple starts to target other platforms as well.

No, it doesn't. Apple's cache hierarchy is not "optimized to favor single threaded workloads". That same cache hierarchy would work just as well for multi-threaded workloads. The fact Apple doesn't post huge numbers in multithreaded workloads has only to do with the fact that Apple has not designed (or at least has not publicly released) anything with dozens of cores. Intel and AMD have not chosen a "more balanced approach", they simple made a different choice than Apple's designers did.

You are taking the conclusion you want to reach as a given and making up a reason to justify it.

JoeRambo · Jun 30, 2020

Doug S said:
The fact Apple doesn't post huge numbers in multithreaded workloads has only to do with the fact that Apple has not designed (or at least has not publicly released) anything with dozens of cores.

I think that's where the question "can they add dozens of cores while keeping same cache hierachy" lies.

Remember good old Core2, that competed with K8 that had IMC, by virtue of having a blob of 4MB of fast L2? Apple has something like that right now, very optimized for clock speed and core number cache hierarchy.

They will need to make tradeoffs if they want to increase clocks ( like keeping same latency in ns, but more from CPU PoV in cycles ). Or once they add more cores, they will need interconnects ( be it rings, crossbars or whatever ), partitioning cache into segments (think Skylake-X with 1MB of L2, or AMD's CCX). And then things like maintaining coherency etc start to bite when no longer have you have blob of fast cache to deal with it, but need to send said traffic to the other side of your chip or to memory.

But one thing is obviuos, Apple has team and resources to pull whatever they want.

TheGiant · Jun 30, 2020

JoeRambo said:
I think that's where the question "can they add dozens of cores while keeping same cache hierachy" lies.

Remember good old Core2, that competed with K8 that had IMC, by virtue of having a blob of 4MB of fast L2? Apple has something like that right now, very optimized for clock speed and core number cache hierarchy.

They will need to make tradeoffs if they want to increase clocks ( like keeping same latency in ns, but more from CPU PoV in cycles ). Or once they add more cores, they will need interconnects ( be it rings, crossbars or whatever ), partitioning cache into segments (think Skylake-X with 1MB of L2, or AMD's CCX). And then things like maintaining coherency etc start to bite when no longer have you have blob of fast cache to deal with it, but need to send said traffic to the other side of your chip or to memory.

But one thing is obviuos, Apple has team and resources to pull whatever they want.

QFT
I have little doubts performance is there for the A series but as I understand Apple uses very low latency caches
but if they can keep the 8large8small cores concept with reasonable low L2 latency as monolithic this will be a monster

Richie Rich · Jun 30, 2020

JoeRambo said:
They will need to make tradeoffs if they want to increase clocks ( like keeping same latency in ns, but more from CPU PoV in cycles ). Or once they add more cores, they will need interconnects ( be it rings, crossbars or whatever ), partitioning cache into segments (think Skylake-X with 1MB of L2, or AMD's CCX). And then things like maintaining coherency etc start to bite when no longer have you have blob of fast cache to deal with it, but need to send said traffic to the other side of your chip or to memory.

But one thing is obviuos, Apple has team and resources to pull whatever they want.

Yeah, if you read carefully Andrei's article (HERE) about A13 you will find out that Apple has the best cache subsystem in the game. They have the best engineers so I's easy for them to develop server grade subsystem to scale the high core count efficiently up. Look at Graviton2 based on Neoverse N1 as a first tryout by ARM LLC - it's beating Zen1 and in many ways Zen2 systems. If ARM LLC was able to do that, you can bett Apple can do that with left hand.

Doug S · Jun 30, 2020

JoeRambo said:
I think that's where the question "can they add dozens of cores while keeping same cache hierachy" lies.

Remember good old Core2, that competed with K8 that had IMC, by virtue of having a blob of 4MB of fast L2? Apple has something like that right now, very optimized for clock speed and core number cache hierarchy.

They will need to make tradeoffs if they want to increase clocks ( like keeping same latency in ns, but more from CPU PoV in cycles ). Or once they add more cores, they will need interconnects ( be it rings, crossbars or whatever ), partitioning cache into segments (think Skylake-X with 1MB of L2, or AMD's CCX). And then things like maintaining coherency etc start to bite when no longer have you have blob of fast cache to deal with it, but need to send said traffic to the other side of your chip or to memory.

But one thing is obviuos, Apple has team and resources to pull whatever they want.

You aren't comparing like for like. Core 2 had an L1 in each core and an L2 shared by the two cores. Skylake had an L1 and L2 in each core, and an L3 shared by the cores. So going from Core 2 to Skylake Intel ADDED per core cache by giving each core its own L2, and made L3 the cache level shared by all. The latter is like Apple, and why Apple will not need to change anything about their L1 and L2 sizes to increase the number of cores, though that doesn't preclude them making such a decision for other reasons - but it won't be simply about increasing the number of cores. There's absolutely nothing stopping them from making a 32 core Mac Pro beast that has the same L1/L2 cache size as found in the iPhone 11's SoC. Or hell, even bigger caches if that's what they think is the best way forward.

Yes you are correct that if they want to increase clocks to any great extent they will either need to accept a latency increase in their caches or reduce their size/complexity to reduce their absolute latency to maintain the same per cycle latency. But who says they are going to target Intel/AMD like clocks? They are able to match Intel/AMD CPUs with FAR higher clock rates at 2.6 GHz so they don't need 5 GHz or even 4 GHz. Probably at about 3.1 GHz they are beating the single thread performance of the fastest turbo'ed (but not overclocked) x86 CPUs on the market. And that's easily within reach using TSMC's N5 and a bit of a boost to the power budget above what a phone's form factor will allow.

Doug S · Jun 30, 2020

TheGiant said:
QFT
I have little doubts performance is there for the A series but as I understand Apple uses very low latency caches
but if they can keep the 8large8small cores concept with reasonable low L2 latency as monolithic this will be a monster

Apple's "low latency" caches are more a result of their lower clock rate. Cache latency in nanoseconds is pretty much fixed on a certain process once you've made you choices about its size and complexity. Apple's clock rate is half as fast so their cycles are twice as long in nanoseconds. So even though their caches are bigger which makes them slower in absolute time in nanoseconds, they are pretty quick when measured in cycles due to that slower clock.

JoeRambo · Jun 30, 2020

Doug S said:
The latter is like Apple, and why Apple will not need to change anything about their L1 and L2 sizes to increase the number of cores, though that doesn't preclude them making such a decision for other reasons - but it won't be simply about increasing the number of cores. There's absolutely nothing stopping them from making a 32 core Mac Pro beast that has the same L1/L2 cache size as found in the iPhone 11's SoC.

I don't really follow this argument. Even taking a basic look at Apple core die shots, it is quite obviuos that "32 cores" need a quite a redesign. Sure Apple could in theory duplicate 16 clusters of 2 big cores + big core L2, then proceed to connect said clusters with 16 links to System Agent + System Cache (L3ish), but that is where the problems start with coherency, tags, extra latencies and so on. It would be a challenge to physically fit things and to logically connect them.

Looking at things like Graviton2 from Amazon, reveal that realities of ARM server CPU are unlike those 2 core clusters with uber fast and large L2. And more like mesh network ala Skylake-X, with reduced L2 size from what Apple client CPUs have right now.

Doug S · Jun 30, 2020

JoeRambo said:
I don't really follow this argument. Even taking a basic look at Apple core die shots, it is quite obviuos that "32 cores" need a quite a redesign. Sure Apple could in theory duplicate 16 clusters of 2 big cores + big core L2, then proceed to connect said clusters with 16 links to System Agent + System Cache (L3ish), but that is where the problems start with coherency, tags, extra latencies and so on. It would be a challenge to physically fit things and to logically connect them.

Looking at things like Graviton2 from Amazon, reveal that realities of ARM server CPU are unlike those 2 core clusters with uber fast and large L2. And more like mesh network ala Skylake-X, with reduced L2 size from what Apple client CPUs have right now.

I'm not claiming you can cut and paste additional cores in, but adding cores is a pretty well understand problem. The hard part is going from one to two, once you've accomplished that adding a third or a 33rd core is a lot easier.

Whether your L2 is 512K or 16MB you still have to connect it to L3 and main memory, handle snooping, coherency and all that fun stuff. It isn't going to be any more of a challenge to "fit things" other than size. i.e. if the overall core is 25% larger due to a bigger L2 then you have room for fewer cores on a die at a given die size target. The fabric and its connectivity takes up the same amount of space regardless of L2 size.

Graviton2's design isn't telling you "this is the only way things can be done" so Apple has to do it like this if they want more cores. It is telling you the decision Amazon's team made. Almost 30 years ago HP designed their PA-RISC chips for servers and workstations with 1MB of L1I and 2MB of L1D. So large for the technology of the time that their caches were off-chip. Only when they went 64 bit in the mid 90s (and Moore's Law caught up to their designs) were they able to move those massive caches on chip. If everyone only solved problems by watching what others do and assuming that's the only way to solve it, there would never be any progress.

name99 · Jun 30, 2020

JoeRambo said:
I don't really follow this argument. Even taking a basic look at Apple core die shots, it is quite obviuos that "32 cores" need a quite a redesign. Sure Apple could in theory duplicate 16 clusters of 2 big cores + big core L2, then proceed to connect said clusters with 16 links to System Agent + System Cache (L3ish), but that is where the problems start with coherency, tags, extra latencies and so on. It would be a challenge to physically fit things and to logically connect them.

Looking at things like Graviton2 from Amazon, reveal that realities of ARM server CPU are unlike those 2 core clusters with uber fast and large L2. And more like mesh network ala Skylake-X, with reduced L2 size from what Apple client CPUs have right now.

No-one is claiming that you would cut and past the iPhone SoC design into a 32 core design!
The claims are that
- sizewise it's hardly outrageous. An Apple large core (A13) is about 4.5mm^2, the L2 is about 4.5mm^2. Group together 4 cores with an L2 and 23mm^2. 8 of those and around 170mm^2. Hardly outrageous. Of course you need some L3, you need memory controllers. But the point is, we're still well within the limits of "easy" areas, not up at 600..800mm^2 difficult areas.
It's pointless investigating the issue further than these rough numbers because we have zero idea what Apple's plans are for how they will handle many core machines.
+ a 16+16 core baseline and dual socket?
+ chiplets?
+ IO/memory/L3 (or L4) on a separate chiplet?
+ GPU as a separate chiplet? Or distributed as slices across each compute chiplet? Or a separate chip (but perhaps mounted on the same package as the compute die?)

- why are you so convinced that designing a NoC and L3 that can scale to many core is a task that's beyond Apple's capabilities? Marvell can do it, Ampere can do it, Amazon can do it -- but somehow when Apple tackles this problem will fall on its face?
How does this assumption make sense?

Doug S · Jun 30, 2020

name99 said:
- why are you so convinced that designing a NoC and L3 that can scale to many core is a task that's beyond Apple's capabilities? Marvell can do it, Ampere can do it, Amazon can do it -- but somehow when Apple tackles this problem will fall on its face?
How does this assumption make sense?

I don't think people are arguing Apple will fall on its face, they just want to believe that in order to go many core that Apple will have to make changes (like reducing cache size) that will reduce single thread performance. Its the only way they can preserve their belief that Apple is somehow "cheating" with their overly high (when compared to Intel & AMD) single thread results.

First it was because GB/SPEC were somehow tilted in Apple's or ARM's favor, now it is because Apple has very large caches which they want to believe is only possible because Apple SoCs have far fewer cores than the biggest Intel/AMD CPUs . It doesn't make sense, but they have to keep scrambling for some excuse to hold onto because the alternative would be to accept that Macs just might beat Intel & AMD PCs in BOTH single and multi thread performance in a couple years.

Carfax83 · Jun 30, 2020

Richie Rich said:
You can see Intel and AMD has a narrower/weaker design here. ARMs are leaders, especially Apple's design, with its first 6xALU core on the world. No wonder Apple has 82% IPC/PPC lead over Intel/AMD cores. Even Cortex X1 has 40% higher IPC/PPC than Zen2. Those are huge numbers.

You keep forgetting that clock frequency is the other half of the equation for CPU performance. Intel and AMD obviously see the value in designing a more balanced architecture than just focusing on width like the ARM CPUs do.

Carfax83 · Jun 30, 2020

Doug S said:
Intel and AMD have not chosen a "more balanced approach", they simple made a different choice than Apple's designers did.

Serious question. Do you know of a single high core count CPU (8 cores and greater) that has had a massive L2 cache attached to it without a L3?

As I said before, I'm not a chip architect and I don't even work in the tech industry. But I follow the industry fairly closely and I can't recall a single high core count CPU using that cache hierarchy.

You are taking the conclusion you want to reach as a given and making up a reason to justify it.

Just because something is possible, doesn't mean it makes sense or should be done. I don't doubt that a high core count CPU with an enormous L2 cache and no L3 could be designed, but would it be as effective as the multilevel cache systems that Intel, AMD and IBM use?

I don't think so.

Carfax83 · Jun 30, 2020

Doug S said:
They are able to match Intel/AMD CPUs with FAR higher clock rates at 2.6 GHz so they don't need 5 GHz or even 4 GHz. Probably at about 3.1 GHz they are beating the single thread performance of the fastest turbo'ed (but not overclocked) x86 CPUs on the market. And that's easily within reach using TSMC's N5 and a bit of a boost to the power budget above what a phone's form factor will allow.

I agree, but it must be said that Intel's failed 10nm process didn't do x86 any favors. If Intel had been successful with 10nm, one might wonder whether we would even be having this conversation.

Info TOP 20 of the World's Most Powerful CPU Cores - IPC/PPC comparison

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Senior member

Platinum Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Attachments

Senior member

Diamond Member

Golden Member

Senior member

Senior member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member