• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

Info TOP 20 of the World's Most Powerful CPU Cores - IPC/PPC comparison

Page 14 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Hitman928

Platinum Member
Apr 15, 2012
2,914
2,436
136
First defending GB as a good indicator of overall CPU performance (which imho is fair game), then pushing for lack of reliability of the benchmark when it comes to even multiple "random" server class entries.

So, just to make it clear - GB5 is a good tool for estimating server CPU performance but only as long as the result comes from a reputable source, unless the CPU is unreleased in which case an anonymous forum used called Richie Rich is to be considered reputable source. Scientific criteria on one side, personal preference on the other.

FYI - the OP refused x86 server class GB5 results not based on reliability, but rather based on lower clocks used by most x86 server CPUs combined with the arbitrary rule of having only 1 entry per architecture.


It's not the app's fault, think about what reducing clocks does to a computer with respect to CPU / memory subsystem relation. We're not comparing absolute performance or at least some proper relative indicator such as perf/watt (which would also put Apple cores first, mind you), it's an artificial comparison of PPC that heavily favors doing the work slowly so that the memory subsystem can keep up.

The reason for the PPC ranking is Apple and ARM products have traditionally been mobile oriented, and the main argument brought against them as server/desktop replacements in the forums was scaling (frequency, core count, interconnect). So therefore, perf/watt while being stunning for Apple at least, was not enough to lead imagination to new heights. Instead of waiting for more actual server/desktop silicon from Nuvia/Apple and other entitites, the OP created his own narrative in which revolutionary high performance ARM cores are always imminent, always at the next corner, and to support this scenario he chose to rely on PPC. This is also the reason we started seeing Nuvia and Apple future product estimates, an uncontrollable desire to make predictions happen.

You can witness his cognitive dissonance at work in the Apple A14 thread where even the remote possibility of A14 providing a generational leap in performance based mostly on higher clocks got him instantly tilted: first he rejected the GB results because of the clocks (he had no issue with the score), then quickly moved back to fantasy land with the wider A15 16c beast and the A19 12XALU juggernaut. If the present doesn't fit expectations, move to the future.



FYI, Thala was already presented with this information early this year.
Yes, it is quite obvious by now that there is not an honest attempt to build a conclusion from data but rather seeking data to support a preconceived conclusion where any evidence to the contrary will be ignored.
 
  • Like
Reactions: Elfear

Hitman928

Platinum Member
Apr 15, 2012
2,914
2,436
136
  • You claim that Renoir has 1254 pts @ 1.8 GHz ..... resulting in 697 pts/GHz. 16-core Ryzen 3950X has only 286 pts/GHz. So you claim that Renoir has 2.4x higher IPC? Really? Why AMD bother with Zen3 +15% IPC development when they can install Renoir everywhere and go 10-year vacation?
  • You claim Epyc has 1094 pts @ 2.25 GHz .... resulting in 486 pts/GHz, which is 1.7x higher IPC than 3950X. No comment.
  • you claim this is how we should do the IPC comparison?


This is exactly the reason why I refused to update the IPC table with all these crazy non-sense low-clocked numbers. I told people: create your own and better IPC table. Show me how better you are. None. Silence. Obviously they like to hate however smart enough knowing that their own IPC table full of these crazy results would discredit them, not me. So they stick with hate.

Running CPU at max speed:
  • has eliminates frequency uncertainty.
  • deliver max performance and that's we are searching for
  • means we compare IPC at max performance for given uarch - that's the goal

It's clear that most people is angry due to Apple's almost double IPC and much bigger 6xALU architecture in compare to their poor 4xALU x86 CPUs. All those results denials, doubting, claiming AISC circuits and hidden aggressivity is just like when you take a toy from little kid. x86 CPU sucks last 5years. Apple is the new uarch leader now. Get over it. And get used to it because Nuvia Phoenix with 8xALU+4xBr = 12-wide uarch is coming soon.
;)
You are literally just making stuff up now to conquer straw men in your own mind. Point to me where I claimed any of what you said, actually quote me where I say what you claim.
 

Hitman928

Platinum Member
Apr 15, 2012
2,914
2,436
136
The problem is you are cherry picking clock speeds to what best suits your narrative. For example, in your chart, you assume the 3950X is pegged at 4.6GHz for the entire test, which is almost certainly not true. And then for these results that you do not like, you're assuming they're pegged at base clocks the entire time so they'd have ridiculously high numbers that you could just throw out - which is also not true. The Epyc 7742 has a 3.4 GHz max boost speed - yielding 321.7 pts/ghz - and that's assuming it stayed at that 3.4 GHz boost the entire time. The Renoir chips you also tossed out have a 4.2 Ghz max boost, yielding 298.5 pts/ghz - still better than the 3950X result you chose.

The 3950X result you chose is showing a 4% IPC increase over Zen 1 in your own chart. How you can even seriously entertain that as accurate?
Exactly. You can actually see what clock speeds the chips are running for the first part of the benchmark by adding .gb5 to the end of the link. If you do that it shows that all the examples I linked are at or very near max boost speed during the single core test (which they obviously should be) which yields ~ 300 pts/GHz for Renoir which has reduced L3 and 320+ pts/GHz for Rome. These are not just random picked examples, I've looked through pages of results and they are consistent scores. They are also laptops and server machines so it's not like someone is "modding" them for the highest GB5 scores, that's a ridiculous claim.

Also, everyone should be aware that the 20% increase in PPC in GB5 with reduced clocks is another straw man the OP created to try and excuse away better results than what are in his table. You will not get +20% PPC simply by reducing clocks. You may gain a few %, that's it. For instance, you can find plenty of examples of Ryzen cores at over 4 GHz that match the PPC of Rome at 3.4 GHz. It was just another excuse by OP to ignore results that didn't fit his narrative.
 

Doug S

Senior member
Feb 8, 2020
354
507
96
Sorry @Thala, but pumping more power into a chip and using HPC cells instead of low power cells will not let you run a 2.6 GHz Apple SoC at 5 GHz.

The design matters. Read up on FO4 delay, but the simple version is that a pipeline stage (i.e. the work that happens in a single clock cycle) can be further broken down by number of FO4 delays. If you target a lower clock rate, you might have for example 8 FO4 delays per clock cycle, but if you target high frequency you might design your circuits allowing only 6 FO4 delays per clock cycle. I recall the optimal for performance was pegged in the range of 6 to 8, not sure if that's still the case today but that's where I got the numbers for this example.

What that means is that the CPU that has 8 FO4 delays per cycle can get 33% more work done per cycle. If you had for example a multiplier circuit that required 24 FO4 delays to produce a result it could produce that result in 3 clock cycles on the low frequency design but require 4 clock cycles on the high frequency design. There's (part of) your IPC difference.

FO4 delay is defined by how fast transistors switch, so sure adding voltage and using faster transistors will reduce the FO4 delay (which is measured in picoseconds) but if you are trying to do 8 FO4 delays of work at 3 GHz and only 6 FO4 delays of work at 5 GHz I think it is easy to see why that 3 GHz design will have difficulty running at 5 GHz even if you give it more voltage and use faster transistors. As you clock it up, at some point it won't have enough time to get 8 FO4 delays worth of work accomplished in a clock cycle, and it fails to work properly.

Now like I said I'm giving the Cliff's Notes version so there's a lot more to this, you don't really design circuits that require 8 FO4 delays of work when you have exactly 8 FO4 delays available in a clock cycle at your target frequency. You have to have some headroom for process variance or your yield wil suck, and there is overhead so you don't have all 8 FO4 delays to allocate to moving your multiplier forward and so on. So in reality your circuits might require a range from 5 to 7.5 FO4 delays per cycle, but the ones that require 7.5 are the ones that will hit the wall first and cause something like say Prime95 to start producing incorrect results when you push things too far when overclocking.
 
Last edited:

Doug S

Senior member
Feb 8, 2020
354
507
96
If we assume that Apple's A14 is running at 2.99/3.00 GHz as the leaked GB5 benchmarks indicate, they should be able to reach at least 3.6 GHz in a higher power design. HPC cells will buy you about 10%, and TSMC has said that other knobs they can turn where power is less important will buy another 10%. If they can bin a bit, the A14 might come within spitting distance of 4 GHz but wouldn't quite get there. That's about the best it will do, no amount of power increase would ever get it anywhere near 5 GHz. It simply isn't designed for that.

This is thinking about desktop stuff like iMac/Mac Pro where power draw doesn't matter, the Macbooks are a different matter. I wouldn't expect to see clock rates much above 3 GHz on the Macbook unless they do a totally different core for the Mac line.
 

naukkis

Senior member
Jun 5, 2002
365
208
116
If we assume that Apple's A14 is running at 2.99/3.00 GHz as the leaked GB5 benchmarks indicate, they should be able to reach at least 3.6 GHz in a higher power design. HPC cells will buy you about 10%, and TSMC has said that other knobs they can turn where power is less important will buy another 10%. If they can bin a bit, the A14 might come within spitting distance of 4 GHz but wouldn't quite get there. That's about the best it will do, no amount of power increase would ever get it anywhere near 5 GHz. It simply isn't designed for that.

This is thinking about desktop stuff like iMac/Mac Pro where power draw doesn't matter, the Macbooks are a different matter. I wouldn't expect to see clock rates much above 3 GHz on the Macbook unless they do a totally different core for the Mac line.
Why so obsessed with clock frequency? If Apple could clock A14 to 4GHz, that would mean about 2100 in Geekbench ST - good luck Intel and AMD to match it.
 
  • Like
Reactions: Etain05 and SAAA

Gideon

Golden Member
Nov 27, 2007
1,028
1,684
136
Geekbench does not scale linearly with clock speed so it will be a bit less (this is anothher reason why this graph is inheritly unfair, a 4Ghz A14 would have noticable worse score)

But yeah, if A14 can go anywhere near 4 Ghz it will be tha fastest CPU by quite a margin
 
  • Like
Reactions: Tlh97

Richie Rich

Senior member
Jul 28, 2019
470
227
76
That is really not true at all. Core design has a HUGE impact on clocks. Look back at AMD's Winchester vs. Orleans cores. Same process - but one maxed out at 2200 MHz, the other at 2600 MHz at the same power. Or how about we compare two different architectures on the same node - AMD Zen1 vs. AMD Polaris. Polaris struggles to achieve much more than ~1.3 GHz no matter how much power you give it.

Just because an Apple A-whatever can run at 2.6 GHz, doesn't means it'll hit 4 GHz even with all the power headroom in the world. If that were true, we'd already have 4-5 GHz "fast" ARM desktop CPUs. But we don't.

The problem is you are cherry picking clock speeds to what best suits your narrative. For example, in your chart, you assume the 3950X is pegged at 4.6GHz for the entire test, which is almost certainly not true. And then for these results that you do not like, you're assuming they're pegged at base clocks the entire time so they'd have ridiculously high numbers that you could just throw out - which is also not true. The Epyc 7742 has a 3.4 GHz max boost speed - yielding 321.7 pts/ghz - and that's assuming it stayed at that 3.4 GHz boost the entire time. The Renoir chips you also tossed out have a 4.2 Ghz max boost, yielding 298.5 pts/ghz - still better than the 3950X result you chose.

The 3950X result you chose is showing a 4% IPC increase over Zen 1 in your own chart. How you can even seriously entertain that as accurate?
So you claim that Epyc has 321.7 pts/GHz => higher IPC than Ice Lake. Show me at least 3 good reviews where they claim that Ice Lake has lower IPC than Zen2. AFAIK there is no such a crazy conclusion. Again, I admit that down clocked CPU, which Epyc @ 3.4 GHz and Renoir @ 4.2 are, can provide slightly higher IPC. That's why I use CPU with highest ST score. Ryzen 3950X has over 1300 pts and that's why I picked it.

What is so difficult to understand?
If there would be Apple A13 version @ 2.9 Ghz instead 2.6 GHz, I would use the faster version with higher ST score. Same way when A14M for MacBook will show up @ 3.2 GHz, I'm gonna update table with this faster one even it will mean probably lower IPC.

BTW, your downclocked Epyc shows only 12% higher IPC in compare to 3950X in my table. Those poor 12% IPC gain is still far far away from Apple's A13 Lightning core 76% IPC advantage over Zen2 3950X. Even if downclocked Zen2 to 1 Hz, you cannot get higher IPC than A13. It's in the marchitecture:
  • 6xALU₊2xBr = 9-wide integer core of A13 vs. poor 4xALU Zen2
  • L1 cache 128 kB of A13 vs. only poor 32 kB of Zen2
  • L2 cache 8 MB of A13 vs. only poor 0.5 MB of Zen2
  • die size 4.5 mm2 7nm HD of A13 vs. only poor 3.6 mm2 7nm HP of Zen2
  • at same proces 7nm HP A13 would be around 8 mm2.... that's more than 2x larger core

Get over it. Apple has absolutely monstrous CPU in every way. Several generation ahead anybody else. Thanks to genius architect Gerard Williams III. Sooner you stop denying this fact, smaller brain-stroke from ARM MacBook performance you will have.





The design matters. Read up on FO4 delay, but the simple version is that a pipeline stage (i.e. the work that happens in a single clock cycle) can be further broken down by number of FO4 delays. If you target a lower clock rate, you might have for example 8 FO4 delays per clock cycle, but if you target high frequency you might design your circuits allowing only 6 FO4 delays per clock cycle.
You don't know what you are talking about. ARM Cortex A72 which is in Raspberry Pi4 running @ 1.5 GHz (28nm) and in smartphone Snapdragon 652 around 2.0 GHz was made by TSMC with 7nm HP proces and they reached 4.2 GHz pretty easy. And those was just test samples without process tuning and binning best chips out of millions mass produced.

A72_VF_curve_4GHz.png

And now compare to Zen2 with binning. That's massive difference between 3600 and 3800X. You can see that Ryzen 3600 has very similar voltage curve like that Cortex A72 on 7nm. And if ARM would peak voltage to 1.5 V like Ryzens sometimes does it's pretty clear that best selected A72 could go to 4.5 - 4.6 GHz.

Zen2_vf_curve.jpeg


You don't understand basics of pipeline/stage design. Both low-power ARMs and high-speed x86 uses short 6 FO4 but from different reasons:
  • x86 uses short 6 FO4 for high frequencies obviously (with big help of HP version of process for voltage up to 1.5V)
  • ARM uses short 6 FO4 for low-voltage at medium frequency within given TDP (HD process has almost double density, huge money savings at the cost that it will never go over 4 GHz, which is non-sense in smartphone environment).
Both, x86 and ARMs tries to get as short stages as possible to get maximum frequency/performance out of it. The main difference is that x86 is limited by max Voltage and thermal density/overheating while ARMs are limited by ultra low TDP only. That's why A72 can run way over 4.2 GHz if it's manufactured with HP process.

It works also in opposite way for x86 to be able run super low TDP. AMD released Zen1 APUs with 6 W TDP. Intel also has CPU with 7 TDP or Lakefield with Sunny Cove runing close to smarphone TDP. It's pretty sad how many people here doesn't understand physics.

How many people know about multi-wave propagation here? Starting another clock when signal wave is in the middle of the stage. Effectively multipling frequency without need of multiple times more stages, saving a lot of SRAM for latches between stages. Technology discovered in 90's. I guess nobody.
 

Attachments

Last edited:

Hitman928

Platinum Member
Apr 15, 2012
2,914
2,436
136
So you claim that Epyc has 321.7 pts/GHz => higher IPC than Ice Lake. Show me at least 3 good reviews where they claim that Ice Lake has lower IPC than Zen2. AFAIK there is no such a crazy conclusion. Again, I admit that down clocked CPU, which Epyc @ 3.4 GHz and Renoir @ 4.2 are, can provide slightly higher IPC.
Nope, your Icelake score is also too low. Examples have already been given but Icelake can reach up to 360+ points per clock.

That's why I use CPU with highest ST score. Ryzen 3950X has over 1300 pts and that's why I picked it.
There are a lot of examples of 3950x with higher points per clock. A 3950x can reach over 1400 points at stock.

BTW, your downclocked Epyc shows only 12% higher IPC in compare to 3950X in my table. Those poor 12% IPC gain is still far far away from Apple's A13 Lightning core 76% IPC advantage over Zen2 3950X. Even if downclocked Zen2 to 1 Hz, you cannot get higher IPC than A13.
The Epyc isn't downclocked, it's running stock, it just doesn't boost as high as a 3950x, that doesn't make it an invalid result. Also, the whole lower clock thing is a giant red herring. There are numerous examples of 3950x reaching even higher ppc than that Epyc. Here's one example with 324 ppc (higher than the Epyc example) and its not even the highest scoring or highest ppc 3950x in the database.


Also, no one is arguing that the latest Apple chips aren't the highest IPC chips around. What we're saying is that using a single mobile focused benchmark is a poor way to determine absolute IPC, you're exaggerating the lead they do have in that one benchmark, and scaling up the Apple CPUs would be more difficult than you make it sound. Not that it's impossible, but they can't just paste a bunch of them together and crank up the voltage and all of a sudden have a core that will beat everything Inte/AMD.


You don't know what you are talking about. ARM Cortex A72 which is in Raspberry Pi4 running @ 1.5 GHz (28nm) and in smartphone Snapdragon 652 around 2.0 GHz was made by TSMC with 7nm HP proces and they reached 4.2 GHz pretty easy. And those was just test samples without process tuning and binning best chips out of millions mass produced.

View attachment 31095
Maybe you missed the part in there where those test A72 cores were custom designed for high frequency and even then ran L2 and L3 cache at 1/2 and 1/4 rate.


How many people know about multi-wave propagation here? Starting another clock when signal wave is in the middle of the stage. Effectively multipling frequency without need of multiple times more stages, saving a lot of SRAM for latches between stages. Technology discovered in 90's. I guess nobody.
Maybe you should explain in detail how multi-wave propagation in digital circuits work? There are people on this forum with very technical backgrounds so feel free to go into as much depth as you'd like.
 

Thunder 57

Golden Member
Aug 19, 2007
1,546
1,447
136
Maybe you should explain in detail how multi-wave propagation in digital circuits work? There are people on this forum with very technical backgrounds so feel free to go into as much depth as you'd like.
Nah, nobody here knows anything from even the 90's. Guys, we all need to back to school.
 
  • Haha
Reactions: lightmanek
Apr 30, 2020
26
64
51
So you claim that Epyc has 321.7 pts/GHz => higher IPC than Ice Lake. Show me at least 3 good reviews where they claim that Ice Lake has lower IPC than Zen2. AFAIK there is no such a crazy conclusion. Again, I admit that down clocked CPU, which Epyc @ 3.4 GHz and Renoir @ 4.2 are, can provide slightly higher IPC. That's why I use CPU with highest ST score. Ryzen 3950X has over 1300 pts and that's why I picked it.
All of your scores are suspect. That is the problem. Your own data shows Zen 2 only have a 4% IPC advantage over Zen 1, which is just complete nonsense. You are purposely cherry picking results that fit your narrative better. As others mentioned, there are many 3950X GB runs with significantly higher single-threaded scores at the same clock speed. Yet you're purposely ignoring them and choosing low scoring runs to better fit your narrative.
 

Doug S

Senior member
Feb 8, 2020
354
507
96
Geekbench does not scale linearly with clock speed so it will be a bit less (this is anothher reason why this graph is inheritly unfair, a 4Ghz A14 would have noticable worse score)
No benchmark scales linearly with clock speed unless it runs entirely within cache. If you double the clock rate of a CPU the main memory is the same speed as it was before, but latency has doubled in terms of clock cycles. Likewise if you cut a CPU's clock rate in half, memory latency in terms of cycles is halved.
 

Doug S

Senior member
Feb 8, 2020
354
507
96
You don't understand basics of pipeline/stage design. Both low-power ARMs and high-speed x86 uses short 6 FO4 but from different reasons:
  • x86 uses short 6 FO4 for high frequencies obviously (with big help of HP version of process for voltage up to 1.5V)
  • ARM uses short 6 FO4 for low-voltage at medium frequency within given TDP (HD process has almost double density, huge money savings at the cost that it will never go over 4 GHz, which is non-sense in smartphone environment).
Both, x86 and ARMs tries to get as short stages as possible to get maximum frequency/performance out of it. The main difference is that x86 is limited by max Voltage and thermal density/overheating while ARMs are limited by ultra low TDP only. That's why A72 can run way over 4.2 GHz if it's manufactured with HP process.

It works also in opposite way for x86 to be able run super low TDP. AMD released Zen1 APUs with 6 W TDP. Intel also has CPU with 7 TDP or Lakefield with Sunny Cove runing close to smarphone TDP. It's pretty sad how many people here doesn't understand physics.

How many people know about multi-wave propagation here? Starting another clock when signal wave is in the middle of the stage. Effectively multipling frequency without need of multiple times more stages, saving a lot of SRAM for latches between stages. Technology discovered in 90's. I guess nobody.

Please provide your source showing the FO4 delays of recent x86 and Apple SoCs. Stanford used to have a list but I don't think it has been updated for a long time, and I doubt anyone knows the FO4 delays of Apple's SoCs.

As far as wave pipelining goes, I know HP used it for their huge off chip cache in the PA-RISC 7xxx series, to avoid taking the wire delay hit. That's not relevant for internal circuits like say a multiplier. Show us some references stating this is being used inside the core of modern designs. I could believe it might be used for cache access (especially cache that's relatively "far away" on the die, like the L3) but I'd be shocked if it is being used for logic. AFAIK none of the automated tools can handle it, it is basically asynchronous circuit design so would be on your own and have to do the timing calculations (including safety margin for process variation) by hand. I just don't buy that any x86 or ARM designs are doing this. Maybe IBM, I could see them doing something crazy like that in POWER.

Just because something has "been known since the 90s" doesn't mean it gets used. Asynchronous circuits have been known since the 50s, but even today no one does them despite the obvious performance and power benefit - because its way too damn hard (some researchers did build an asynchronous ARM once...google AMULET)
 

DrMrLordX

Lifer
Apr 27, 2000
16,489
5,456
136
I told people: create your own and better IPC table. Show me how better you are.
You should not create IPC tables. You should benchmark two CPUs against one another in a large suite of software and then draw conclusions about where each CPU excels in which workloads. IPC tables are useless. IPC itself is, at best, an estimate of performance between different generations of the same hardware platform running basically the same software - for example, it's normal for people to compare IPC of Netburst designs to Core2, to show how much stronger Core2 was, but you don't really understand why Core2 was better until you look at the benchmark results.

The operative question is, "how does this CPU make my work better/faster/easier?" and comprehensive benchmarks show you that. Geekbench 5 alone does not grant you that information. It probably never will.

It's clear that most people is angry due to Apple's almost double IPC and much bigger 6xALU architecture
No. It is clear that people are angry at you for manipulating disparate data to push some ridiculous anti-x86 agenda.

In what world is BLENDER a better benchmark than SPEC? :rolleyes:
In this one? Blender is an outstanding benchmark. Many people use it to do real work. It's a great FP benchmark. We've been using CPU rendering software of various sorts to benchmark desktop CPUs for years. If someone wants to start comparing phone SoCs to desktop/workstation/server CPUs, why not ask for some Blender results? And you can compile it to run on a Mac, so the A14 Macs will give us hard numbers.

Of course I would want other software results as well before I started making comprehensive comparisons between A14 and anything desktop x86.
 

Richie Rich

Senior member
Jul 28, 2019
470
227
76
Maybe you missed the part in there where those test A72 cores were custom designed for high frequency and even then ran L2 and L3 cache at 1/2 and 1/4 rate.
Maybe you missed the part where Intel uses for L3 cache lower uncore frequency as well.
Maybe you missed the part where Apple big cores runs at 2.6 GHz, little cores at 1.8 GHz and GPU < 1 GHz, so L3 SLC cache also runs with some multiplier. Apple's L3 cache is shared with GPU and NPU as well. This shared L3 cache for GPU is something AMD Renoir and Intel can only dream of. Not speaking that they have no NPU to be connected to L3. Apple is light years ahead anybody.

I admit A72 at COWOS is not 100% core copy but maybe 99% copy. Minor changes. When ARM can do this minor changes for some experimental side project like COWOS then imagine Apple with all their massive resources.

Intel and AMD should be scared not because high frequency ARMs (that's easy to do) but because chiplets on interposer. Neoverse V1 stated using HBM2E memory and coherent chiplet architecture.

Maybe you should explain in detail how multi-wave propagation in digital circuits work? There are people on this forum with very technical backgrounds so feel free to go into as much depth as you'd like.
I doubt that on this forum are people with technical background. Best people like Andrei left this forum due to permanent attacks form knowless people. Just search what happened in Graviton2 thread. How you want learn lesson #10 about multi-wave propagation when you don't know lesson #1-9 how the pipeline stages works and what is critical path length? Otherwise you would never claim that ARMs cannot be clocked to high frequencies. There is a paper regarding multi-wave you can google it.


Please provide your source showing the FO4 delays of recent x86 and Apple SoCs. Stanford used to have a list but I don't think it has been updated for a long time, and I doubt anyone knows the FO4 delays of Apple's SoCs.
I found that AMD's FPU from 2000 (Athlon XP probably) has 35.2 FO4 latency for double FP adder (in two stages probably). So single stage has 17.4 FO4 latency which means all other executions units in CPU have to be about the same length:

FPUadder_FO4_delay_stage_length_critical_path--table.pngFPUadder_FO4_delay_stage_length_critical_path--scheme.png
 

Hitman928

Platinum Member
Apr 15, 2012
2,914
2,436
136
Maybe you missed the part where Intel uses for L3 cache lower uncore frequency as well.
Maybe you missed the part where Apple big cores runs at 2.6 GHz, little cores at 1.8 GHz and GPU < 1 GHz, so L3 SLC cache also runs with some multiplier. Apple's L3 cache is shared with GPU and NPU as well. This shared L3 cache for GPU is something AMD Renoir and Intel can only dream of. Not speaking that they have no NPU to be connected to L3. Apple is light years ahead anybody.
Intel's cache ratio is ~9/10, not 1/4. That was really just a side note anyway, the main thing is that the whole core was custom designed for high frequency compared to a standard A72 core. This was just a proof of concept project to promote TSMC's foundry. No one is arguing you couldn't clock an ARM chip up to high frequencies but you'd have to make some design changes and the extent of the design changes would depend on what chip you started with both in terms of uarch and physical design.

I admit A72 at COWOS is not 100% core copy but maybe 99% copy. Minor changes. When ARM can do this minor changes for some experimental side project like COWOS then imagine Apple with all their massive resources.
OK, since you know what has and hasn't been customized, why not lay it out for us? What blocks were changed and what blocks weren't? How did their stack effect frequency? Which Vt cells did they choose and why? Let's see the details of this 99% unchanged design.

Intel and AMD should be scared not because high frequency ARMs (that's easy to do) but because chiplets on interposer. Neoverse V1 stated using HBM2E memory and coherent chiplet architecture.
So AMD and Intel should be scared that ARM server CPUs will start using technology AMD has been using for years now? Somehow I doubt this keeps Lisa Su up at night thinking about it.

I doubt that on this forum are people with technical background. Best people like Andrei left this forum due to permanent attacks form knowless people. Just search what happened in Graviton2 thread. How you want learn lesson #10 about multi-wave propagation when you don't know lesson #1-9 how the pipeline stages works and what is critical path length? Otherwise you would never claim that ARMs cannot be clocked to high frequencies. There is a paper regarding multi-wave you can google it.
Stop trying to deflect away from the topic you brought up. No one ever claimed you couldn't design an ARM CPU to reach to high frequencies. Geez man, is all you ever do is create straw man in your arguments so you can feel like you are constantly winning? You are the one who brought up multi-wave propagation in digital circuits so I would like to hear your explanation. If you really understood it, you should be able to explain it in a way that anyone could understand but since many on this forum have advanced degrees in computer science/EE, I think it would be fine for you to go in to lots of detail. So let's hear it, what are the details, pros/cons, why should it be used on modern high performance CPUs? Don't tell me to just go google it, I want to hear what you have to say about it.
 

name99

Senior member
Sep 11, 2010
253
226
116
Why so obsessed with clock frequency? If Apple could clock A14 to 4GHz, that would mean about 2100 in Geekbench ST - good luck Intel and AMD to match it.
Think of it this way.
This forum is populated with 10% who know what they are talking and another 90%.
Which is which?
Well, which are the ones that endlessly squawk about clock frequency?

It's useful having a bozo filter.
 

name99

Senior member
Sep 11, 2010
253
226
116
Also, no one is arguing that the latest Apple chips aren't the highest IPC chips around. What we're saying is that using a single mobile focused benchmark is a poor way to determine absolute IPC, you're exaggerating the lead they do have in that one benchmark, and scaling up the Apple CPUs would be more difficult than you make it sound. Not that it's impossible, but they can't just paste a bunch of them together and crank up the voltage and all of a sudden have a core that will beat everything Inte/AMD.
Don't treat us like idiots. You know full well that Apple chips have been benchmarked via SPEC, via browsers, via my own Mathematica tests, and all results are consistent with GB5.
To keep whining that "GB5 is a single mobile focused benchmark" says a lot more about you than about GB5 or Apple chips.
 

Hitman928

Platinum Member
Apr 15, 2012
2,914
2,436
136
Don't treat us like idiots. You know full well that Apple chips have been benchmarked via SPEC, via browsers, via my own Mathematica tests, and all results are consistent with GB5.
To keep whining that "GB5 is a single mobile focused benchmark" says a lot more about you than about GB5 or Apple chips.
Let's hear it, what do you think my posts say about me, specifically?
 
Last edited:
  • Like
Reactions: coercitiv

Doug S

Senior member
Feb 8, 2020
354
507
96
In this one? Blender is an outstanding benchmark. Many people use it to do real work. It's a great FP benchmark. We've been using CPU rendering software of various sorts to benchmark desktop CPUs for years. If someone wants to start comparing phone SoCs to desktop/workstation/server CPUs, why not ask for some Blender results? And you can compile it to run on a Mac, so the A14 Macs will give us hard numbers.

Of course I would want other software results as well before I started making comprehensive comparisons between A14 and anything desktop x86.

That makes Blender a good benchmark for people doing Blender like tasks, and an outstanding benchmark for people actually running Blender. That doesn't make it better than SPEC, which not only tests a much wider array of FP tasks than Blender, but also tests INT which Blender doesn't really do at all.

So no, Blender is a far inferior benchmark to SPEC unless rendering is the main thing you care about.
 

lobz

Golden Member
Feb 10, 2017
1,401
1,673
106
People wanted to cheat. I remember one guy downclocked his x86 CPU to 2.6 GHz similar to Apple, probably boosted uncore and DDR mem to maximum and claimed score higher about +20% than my table. This is cheating. Apple core can also provide higher IPC at 1 GHz. GB database is full of tweaked and OC'ed systems with wrongly reported frequency. I try to use GB results from reviews as much as possible to get reasonable numbers. I stated this table is chart of IPC under normal maximum performance to avoid down clocking (cheating).
How do you still manage to trick decent people here into thinking you're actually arguing with them, while deliberately lying and 'forgetting' already established arguments/facts in this very topic after 1-2 days and going on with your insane propaganda, is totally beyond my comprehension.
 
Last edited:
Apr 30, 2020
26
64
51
This shared L3 cache for GPU is something AMD Renoir and Intel can only dream of. Not speaking that they have no NPU to be connected to L3. Apple is light years ahead anybody.
What do you mean? You don't think Intel or AMD are capable of sharing cache between the GPU and CPU? Most Intel iGPUs already share the L3 with the main CPU cores. Ice Lake adds a dedicated L3 to the iGPU - but it's still connected to the LLC of the main CPU cores. Don't forget about Broadwell, with its EDRAM L4 shared between the CPU and GPU as well.
I doubt that on this forum are people with technical background. Best people like Andrei left this forum due to permanent attacks form knowless people. Just search what happened in Graviton2 thread. How you want learn lesson #10 about multi-wave propagation when you don't know lesson #1-9 how the pipeline stages works and what is critical path length? Otherwise you would never claim that ARMs cannot be clocked to high frequencies. There is a paper regarding multi-wave you can google it.
No one is claiming that ARM CPUs cannot be clocked to high frequencies. We're claiming that an ARM implementation that TARGETS a ~2.5 GHz design frequency is not likely to be able to scale up to 4+GHz. It's not a matter of simply cramming more volts into the CPU or replacing a logic library here and there. There is a A LOT of work required to shift clock speeds up that high.

Look at AMD's Llano vs Richland APUs. Both on the same GF 32nm SOI process. Llano could barely wheeze out 3GHz, while later Richland APUs on a different architecture pushed that to 4.4GHz.
 
Last edited:

ASK THE COMMUNITY