Solved! ARM Apple High-End CPU - Intel replacement

Page 39 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Richie Rich

Senior member
Jul 28, 2019
326
157
76
People who think x86 has some unique advantages over ARM have their heads in the sand. I guess it will take Apple releasing the first ARM Macs to finally admit this, though I imagine some will still manage to find a few things x86 does better and try to claim those are the things that really matter.
Most of them know they are wrong but they keep living in denial due to their ego got hurt badly. Especially hardcore fans, imagine they bought recently very expensive 64-core EPYC expecting buying the CPU based on the best core on market. And know you tell them that A13 in iPhone has almost double PPC? In other words saying their EPYC is based on slow outdated uarch and Apple has advantage in core development about 4-5 years?

But the sad thing is that due to this ignorance this forum lost smart guys like Andrei and Nothingness. I asked moderator if they are OK about loosing those smart people. He answered me that's OK and there will be no change in this trend. And this is super sad, because CPU trends cannot be stopped at this forum and big ARM rise will happen anyway. Maybe Andrei started his own AnreiTech.com up?


More moderation callouts. They are still not allowed.
And no moderator said what you claimed above about losing "smart people".
That's plain false.


esquared
Anandtech Forum Director
 
Last edited by a moderator:
  • Haha
Reactions: CHADBOGA

Doug S

Member
Feb 8, 2020
78
91
51
Most of them know they are wrong but they keep living in denial due to their ego got hurt badly. Especially hardcore fans, imagine they bought recently very expensive 64-core EPYC expecting buying the CPU based on the best core on market. And know you tell them that A13 in iPhone has almost double PPC? In other words saying their EPYC is based on slow outdated uarch and Apple has advantage in core development about 4-5 years?

Quit harping on IPC, you are just as bad as those who have their heads in the sand and are doing your argument no favors. News flash, if you measure IPC of an Intel CPU at 5 GHz and then measure again at 2.5 GHz you will find it has gone up. If you design your core with a target of 2.5 GHz you can have shorter pipelines and lower cycle counts for cache and increase your IPC by a lot. That's what Apple has done, their designers made different choices.

What matters (as far as Apple adopting ARM for the Mac) is how x86 and Apple's cores compare directly, each running at their target clock rates, and Apple compares pretty well there especially if you limit it to mainstream laptop CPUs and ignore stuff like the 'X' CPUs topping out over 5 GHz for now. Those are the comparisons that matter, not comparing on IPC - that's only slightly less ignorant than comparing by clock rate.
 

Hitman928

Platinum Member
Apr 15, 2012
2,450
1,489
136
@Hitman928


But there are some diffences between your and mine measurements:
  • my comparison was in ST where A72 could benefit from 64-bit. 3-core load is still OK while 4-core load performance suffer a lot (probably due to mem bandwith bottleneck). So for core2core comparison is ST load much more realistic due to bottleneck elimination.
  • no downclock of my 3700X and SMT ON, just recalculation based on given frequency per one thread (16t)

Blender MT results:
  • Zen2 Ryzen 3700X 8c/16t ..... 179 s ..... 11 466 s/GHz/thread
  • Cortex A72 (RPi4) 4c/4t ...... 4077 s ..... 24 464 s/GHz/thread .... that's only 47% PPC per thread of Zen2
Please note, that we compare A72 core to Zen2 thread which means that AMD can get more than 4x higher PPC out of Zen2 core thanks to SMT2 (while RPi4 bottleneck at all-core load). This corresponds with your 200% higher PPC (Zen2 has +300%).

What is your time when you run Blender as single core? This would more interesting to compare if Zen+ scales similarly as Zen2.
First, calculating using SMT is obviously tricky, but SMT only gives about 15-20% speedup in modern versions of Blender when rendering the simpler scenes. Even with more complex scenes, you're looking at ~40% if you're lucky, so it would be more like:

Zen2 179s -> 6873 s/GHz-core to 8019 s/GHz-core which would put it at 205% - 255% faster perf/GHz-core than Rpi4 which is what I would expect given my 2700 results. Again, a much bigger improvement compared to GB4.

Second, to be honest, your single core Blender score is an outlier that doesn't match up with any other data and I don't trust it. Did you check the output frame to make sure it rendered correctly? The speedup you are seeing in the multicore score from Raspbian to Ubuntu is explained by the different versions of Blender being used as v2.8 performs ~25% faster versus 2.79 which is right in line with the multicore speedup you saw. There's no reason to think you should see a 100% increase in ST performance though and that a memory bottleneck is then causing terrible scaling to 4 cores when scaling was just fine in 2.79. I think that data points to there being something wrong with the ST render but obviously there would have to be some investigation to conclude this. For now, the multicore score is in line with expected performance increase from Blender 2.79 and is the number I think we should use.

All of this is really just to say (as has been pointed out again and again) that using a single benchmark (even if it's a collection of small benchmarks) and extrapolating that to ultimate IPC is a desktop/workstation/server type environment is silly, but you keep doing it again and again as if it means something. It's fine to talk about the progress ARM has made and that Apple has made, but you need to do it from a perspective of understanding that decisions are made in each design that don't necessarily translate to different markets and that comparing chips designed for very different purposes like this is foolhardy. ARM vendors have server chips in the market we can compare as such. The last gen still favored x86 solutions for most things. This gen is just being released so the story is still yet to be written but hopefully we can get some independent reviews going soon.
 
Last edited:

Hitman928

Platinum Member
Apr 15, 2012
2,450
1,489
136
BTW, my single thread Blender BMW result for a 2700 @ 1.5GHz is 5368s which gives me a 3.96x speedup when running 4 core. Just about perfect scaling. I also check every time to make sure the image rendered correctly.
 

name99

Member
Sep 11, 2010
142
123
116
Okay, show me the results. I'm waiting.







You must comment to the links you dropped.


esquared
Anandtech Forum Director
 
Last edited by a moderator:

DrMrLordX

Lifer
Apr 27, 2000
15,494
4,281
136

Hitman928

Platinum Member
Apr 15, 2012
2,450
1,489
136
@name99

https://www.phoronix.com/scan.php?page=article&item=amazon-graviton2-benchmarks&num=8

is what I was looking for. Thanks! For anyone not interested in poring over all the specific results, the geometric mean is here:

https://www.phoronix.com/scan.php?page=article&item=amazon-graviton2-benchmarks&num=12

The bare metal results for Graviton2 are honestly not that great. Though when you consider its power consumption, I guess you can't really complain too much.
Yeah, it looks great against the backdrop of current AWS, but that's comparing older CPUs versus Graviton2 and also giving Graviton2 twice the available cores. When comparing bare metal to bare metal, it doesn't look so hot. It's a shame we'll probably never get actual power usage numbers for a Graviton2 server to compare efficiency.

1589670001492.png
 

DrMrLordX

Lifer
Apr 27, 2000
15,494
4,281
136
Hmmm, where did you read this? I'm pretty sure ARM claimed between 105 W to 150 W for the Neoverse 64 core variant depending on how it was configured.
https://www.anandtech.com/show/15578/cloud-clash-amazon-graviton2-arm-against-intel-and-amd

Total power consumption of the SoC is something that Amazon wasn’t too willing to disclose in the context of our article – the company is still holding some aspects of the design close to its chest even though we were able to test the new chipset in the cloud. Given the chip’s more conservative clock rate, Arm’s projected figure of around 105W for a 64-core 2.6GHz implementation, and Ampere’s recent disclosure of their 80-core 3GHz N1 server chip coming in at 210W, we estimate that the Graviton2 must come in around anywhere between 80W as a low estimate to around 110W for a pessimistic projection.
It's guesswork, and of course I'm trying to be nice to Graviton2 here. Officially we don't know.

What does "bare metal" mean in this context?
Normally it means running OS + software directly on the hardware as opposed to relying on VM provisioning. AWS is all about VMs, though certain instances get you all the cores from one CPU, so that's as close as you're going to get to bare metal results from Graviton2. The rest of the results in the bare metal comparison are from hardware that Phoronix has access to now or has used in the past to produce benchmark results.

In other words, Phoronix took a 64c m6g instance and compared it to some non-AWS hardware, as opposed to restricting the comparison to other AWS instance types.
 
  • Like
Reactions: Tlh97 and Carfax83

Hitman928

Platinum Member
Apr 15, 2012
2,450
1,489
136
https://www.anandtech.com/show/15578/cloud-clash-amazon-graviton2-arm-against-intel-and-amd



It's guesswork, and of course I'm trying to be nice to Graviton2 here. Officially we don't know.



Normally it means running OS + software directly on the hardware as opposed to relying on VM provisioning. AWS is all about VMs, though certain instances get you all the cores from one CPU, so that's as close as you're going to get to bare metal results from Graviton2. The rest of the results in the bare metal comparison are from hardware that Phoronix has access to now or has used in the past to produce benchmark results.

In other words, Phoronix took a 64c m6g instance and compared it to some non-AWS hardware, as opposed to restricting the comparison to other AWS instance types.
Amazon offers bare metal instances but I don't know if they have done this yet for Graviton2. I assumed this was the case for their tests but I'll have to look closer and see if they specify.

Edit: Yep, they do specify it as a m6g.metal instance so it is bare metal.
 

Hitman928

Platinum Member
Apr 15, 2012
2,450
1,489
136
BTW, I dropped my RAM speed from 3200 MT/s to 1600 MT/s to see the effect it would have on scaling. It still scaled from 1 to 4 cores by 3.9x performance. So a slight decrease in scaling, but nothing major to suggest that bandwidth is a limiting factor when scaling to 4 cores with DDR4.
 

name99

Member
Sep 11, 2010
142
123
116
@name99

https://www.phoronix.com/scan.php?page=article&item=amazon-graviton2-benchmarks&num=8

is what I was looking for. Thanks! For anyone not interested in poring over all the specific results, the geometric mean is here:

https://www.phoronix.com/scan.php?page=article&item=amazon-graviton2-benchmarks&num=12

The bare metal results for Graviton2 are honestly not that great. Though when you consider its power consumption, I guess you can't really complain too much.
Uh, wait what? So now we've gone from "ARM SoCs cannot handle tasks involving lots of cores, lots of DRAM, or lots of IO" to "single socket Graviton 2 is only comparable to a single socket x86, not vastly superior"?
And this is a failure?

This is why no-one takes your arguments seriously; because you have no interest in actual understanding, only in litigating a particular point. And you keep changing your argument every time your current point gets refuted.

The argument was "What nobody seems to get, or understand, is that Apple and ARM AT THE MOMENT seems to be very strong in single core, non IO dependent benchmarks. They were designed for that purpose and do it well. But what about things that have high IO and multi-threaded requirements ? Blender, is just one. What about a huge database server, that serves Anandtech ? or Amazon ?"

That point has been answered. (And apparently at least part of the answer includes that Graviton 2 seems to handle virtualization with rather less overhead than x86...)
If you want to make an argument about cost to use these large SoCs, that has also been answered.
If you want to make an argument about the design time and team size required to create these SoCs that is also apparently answered -- certainly look at the size of the Marvel, Ampere, and Annapurna teams, and their turnover (*substantial* improvements every year to eighteen months) compared to Intel's size and pace.

Amazon has created a SoC that matches a substantial fraction of their needs. Period. If it doesn't match your needs, fine. (But, honestly, tell me you NEED a dual core Xeon Platinum system, and I'm going to call BS...) If you want to complain that it doesn't ship in a dual socket version (right now --- wait till the 2020 model...) or that it doesn't have feature X or feature Y, you need to explain to us
(a) why anyone cares.
(b) why this supposed lack is intrinsic to the ARM ecosystem, rather than not being provided because everyone involved (who, BTW, has a lot more serious incentives than you) simply doesn't consider it important or desirable.
 
  • Like
Reactions: Lodix

DrMrLordX

Lifer
Apr 27, 2000
15,494
4,281
136
Uh, wait what? So now we've gone from "ARM SoCs cannot handle tasks involving lots of cores, lots of DRAM, or lots of IO" to "single socket Graviton 2 is only comparable to a single socket x86, not vastly superior"?
Look, I don't know what else you're really going on about, but personally all I wanted to see were results that weren't SPEC or Geekbench. And "only comperable to a single socket x86" is being charitable. 1P EPYC 7742 is decisively faster than Graviton2. At higher power, sure. If people had been saying "okay, it's slower, but it's more efficient", I might agree with that now that I've seen how Graviton2 measures up. That's assuming power consumption is in the 80-110W range. We'll probably never know for sure. Regardless, nobody's been making that argument about ARM in general.

Graviton2 also seems to have a few critical weaknesses. The linux compile test did not go very well. Graviton2 took 86.66 seconds to complete that test. A desktop CPU - AMD 3950x - did it in 39.61 seconds:

https://www.phoronix.com/scan.php?page=article&item=amd-ryzen9-3950x&num=5

That's a big "ouch" for Graviton2.
 

naukkis

Senior member
Jun 5, 2002
327
157
116

Hitman928

Platinum Member
Apr 15, 2012
2,450
1,489
136
https://www.anandtech.com/show/15578/cloud-clash-amazon-graviton2-arm-against-intel-and-amd



It's guesswork, and of course I'm trying to be nice to Graviton2 here. Officially we don't know.
Yeah, I disagreed with him then and I still do. ARM already estimated 1W - 1.8W (2.6 GHz - 3.1 GHz) of power for each core + L2. Ampere's 80 core CPU based on the same core is rated at 210 W for 3 GHz performance. A quick calculation says that (being generous assuming 3 GHz cores consume the same as 3.1 GHz) that the cores + L2 consume 144 W. That still leaves 66 W of power outside of the cores + L2. Now if we say that 2.5 GHz consumes ~0.95W cores + L2 you get 60.8 W consumed by the 64 cores. So he's saying that everything outside of the cores + L2 of a 64 core SoC only consumes ~19W compared to 66 W from the 80 core Ampere. I just don't buy it.

Additionally, ARM previously estimated a 32 core Neoverse SoC at 3.1 GHz would have a TDP of approximately 100 W and that a hyperscale datacenter SoC would start at 64 cores and 150 W. If we compare that to an 80 core Ampere CPU at 3 GHz with 210W TDP, the numbers seems reasonable. If we compare that to a 64 core SoC at 2.5 GHz with a TDP of 80 W, it just doesn't add up.

Edit: I don't completely disagree with his estimated range, just the description of it. I would put the 80 W at probably overly optimistic. The center point of 95 W I think is a good guess, but I just think that it's more likely a littler higher than lower of that number.
 
Last edited:
  • Like
Reactions: Schmide

name99

Member
Sep 11, 2010
142
123
116
Look, I don't know what else you're really going on about, but personally all I wanted to see were results that weren't SPEC or Geekbench. And "only comperable to a single socket x86" is being charitable. 1P EPYC 7742 is decisively faster than Graviton2. At higher power, sure. If people had been saying "okay, it's slower, but it's more efficient", I might agree with that now that I've seen how Graviton2 measures up. That's assuming power consumption is in the 80-110W range. We'll probably never know for sure. Regardless, nobody's been making that argument about ARM in general.

Graviton2 also seems to have a few critical weaknesses. The linux compile test did not go very well. Graviton2 took 86.66 seconds to complete that test. A desktop CPU - AMD 3950x - did it in 39.61 seconds:

https://www.phoronix.com/scan.php?page=article&item=amd-ryzen9-3950x&num=5

That's a big "ouch" for Graviton2.
As I KEEP reminding you: is your goal understanding, or is your goal to wave a tribal flag?

If your goal is understanding, then the *intelligent* response to those compilation tests is to assume that there is a bug or misconfiguration, not to say "the system performs well almost everywhere except X, so I will latch onto X as my 'proof' of how bad the system is".

Why would the compilation time be so bad? Not a freaking clue but ALMOST ALWAYS when you see a glass jaw like that, the issue is something stupid in the software not "SoC does well everywhere except for this task".
We have seen versions of this where LLVM did really badly compared to GCC because of different decisions made in the standard library (in the case I am thinking of, LLVM prioritizing a "more" random RNG than GCC). We may well be seeing the same thing here in that some aspect of the Linux build system (at least when run on ARMv8) is not properly set up to handle so many cores on a single SoC because it has never encountered that before.
If you think that's unlikely, look at
https://reviews.llvm.org/D71786
which mentions (tangentially) the exact same thing in the context of LLVM, system set to handle up to 28 (basically less than 32) cores, and behaves much worse when given 36 cores...
 

Hitman928

Platinum Member
Apr 15, 2012
2,450
1,489
136
I think you got your wires crossed here, the Neoverse N1 promo materials on Anandtech say 64C for 105W:

No, N1 is designed to be configured for how the developer wants to use the CPU both in terms of modular blocks in the SoC as well cache sizes and target frequency. This slide is an example of 1 such possible configuration, it is designed for high cores and high efficiency (which means lower clocks), just like Graviton2, in order to hit the 105 W TDP. A higher clocked 64 core SoC will use more like 150W.




Graviton 2 is running at 2.5 GHz instead of 2.6 GHz which is the low side limit of the reference design, but you're not going to save 25 W by dropping that last 100 MHz. The voltage/freq table of Ares cores show you are only going to get about 10% power savings @ 2.5 GHz compared to 2.6 GHz so you are saving more like 6.5W, maybe considering the cores themselves take up only about 64 W to begin with at 2.6 GHz.
 

Hitman928

Platinum Member
Apr 15, 2012
2,450
1,489
136
I can't edit my above post for some reason, just wanted to clarify that the reference chip @ 105W is running at 2.6 GHz, in case my prior post wasn't clear.
 

DrMrLordX

Lifer
Apr 27, 2000
15,494
4,281
136
As I KEEP reminding you: is your goal understanding, or is your goal to wave a tribal flag?
What I'm trying to do is shut down at least one poster here whose goal it is to "wave a tribal flag" for ARM - or anything that isn't x86 - who insists that we can accurately calculate server ARM CPU performance by running Geekbench on phone SoCs.

@Hitman928

Well I said I was erring on the side of being nice to Graviton2.

@naukkis

Graviton2 is undoubtedly an improvement for AWS customers in some workloads. I'm more interested in what implications it has for the future of modern (read: not A72) ARM cores deployed in multicore configurations aimed at desktop/workstation/server. The Phoronix bare metal testing was more-informative in that area.

I'd kind of like to see Graviton2 head-to-head against chips like the 3950x and 10980X in the same benchmarks. Graviton2 is already losing out to the 3950x big time in Phoronix's Linux kernel compilation benchmark.
 
Last edited:

name99

Member
Sep 11, 2010
142
123
116
What I'm trying to do is shut down at least one poster here whose goal it is to "wave a tribal flag" for ARM - or anything that isn't x86 - who insists that we can accurately calculate server ARM CPU performance by running Geekbench on phone SoCs.

@Hitman928

Well I said I was erring on the side of being nice to Graviton2.

@naukkis

Graviton2 is undoubtedly an improvement for AWS customers in some workloads. I'm more interested in what implications it has for the future of modern (read: not A72) ARM cores deployed in multicore configurations aimed at desktop/workstation/server. The Phoronix bare metal testing was more-informative in that area.

I'd kind of like to see Graviton2 head-to-head against chips like the 3950x and 10980X in the same benchmarks. Graviton2 is already losing out to the 3950x big time in Phoronix's Linux kernel compilation benchmark.
"who insists that we can accurately calculate server ARM CPU performance by running Geekbench on phone SoCs"

Thing is, this is NOT a crazy claim. It boils down to -- what is the hard part of creating a SoC?
The hardest part is the part tested by GeekBench!
Yes, IN THEORY, an idiot could go to the effort of building or acquiring a kick-ass core, then totally waste it by coupling many such cores to an inadequate uncore, or an inappropriate NoC, or lousy memory controllers. But it's dumb to assume that because in essence what you are saying is "I'm smarter than the people at Marvell/Ampere/Amazon/Apple/ARM, and they will probably make mistakes that even I know to avoid"...

"Graviton2 is already losing out to the 3950x big time in Phoronix's Linux kernel compilation benchmark."
I ALREADY EXPLAINED this to you. Did you even bother to read the link I sent? And absorb anything from it -- from lock styles to malloc designs to memory capacity to large page usage to processor groups? For someone who claims to be so interested in large core count performance you seem remarkably ignorant of everything that actually affects such performance.

And yet you have the gall to ignore my explanation and tell me "What I'm trying to do is shut down at least one poster..." like you're so interested in the truth, not tribalism.
OK, I'm done. You're on the list of people it's not worth arguing against, or even presenting with facts.
 

lobz

Golden Member
Feb 10, 2017
1,130
1,013
106
"who insists that we can accurately calculate server ARM CPU performance by running Geekbench on phone SoCs"

Thing is, this is NOT a crazy claim. It boils down to -- what is the hard part of creating a SoC?
The hardest part is the part tested by GeekBench!
Have you ever considered the possibility, that it's not necessarily you who gets to decide that?
 

ASK THE COMMUNITY