Discussion AWS Graviton2 64 vCPU Arm CPU Heightens War of Intel Betrayal

Page 8 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
Imagine ARM competition for AMD Renoir 8c/16t for laptops. They can use 16x core A77 (ARM core has half area so resulting in same total area) running it at 2.5GHz and still get performance similar to Renoir clocked at 5GHz (and that's not even possible) while having 4x lower TDP. And that's the ARM power for which x86 has no answer.
Care to share your data supporting this situation?
 

ksec

Senior member
Mar 5, 2010
420
117
116
Wait a min, I am not following, are we in discussion that N1 from Graviton is not good enough to complete with x86 single thread / core?

Or do we simply have a disagreement whether ARM is so good that it will take over the ( Server and PC ) world ?

And cant we agree that while N1 could in theory compete with x86 in single core, it doesn't necessarily mean it will take over the world? That is a far too simplistic view.
 
Last edited:

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Care to share your data supporting this situation?
ARM's Neoverse N1 is base on A76 (1.2mm2 512kb L2$, 1.4mm2 1MB L2$). You can read AnandTech article about that. https://www.anandtech.com/show/13959/arm-announces-neoverse-n1-platform/2
A77 is 17% more transistors so 1.4 and 1.6mm2 for different L2$ size.
Zen2 core was measured about 3.6mm2 (512kB L2$).

03_Infra%20Tech%20Day%202019_Filippo%20Neoverse%20N1%20FINAL%20WM15.jpg
 

Elfear

Diamond Member
May 30, 2004
7,097
644
126

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
ARM's Neoverse N1 is base on A76 (1.2mm2 512kb L2$, 1.4mm2 1MB L2$). You can read AnandTech article about that. https://www.anandtech.com/show/13959/arm-announces-neoverse-n1-platform/2
A77 is 17% more transistors so 1.4 and 1.6mm2 for different L2$ size.
Zen2 core was measured about 3.6mm2 (512kB L2$).

03_Infra%20Tech%20Day%202019_Filippo%20Neoverse%20N1%20FINAL%20WM15.jpg
This does not at all support your conclusion that a 16 core A77 clocked at 2.5 GHz would have the same "performance" (whatever you mean by that - gaming? photo work? rendering? browsing?) as Renoir 8c/16t clocked at 5 GHz while still having 4 times lower TDP.

I asked you to support your statement about performance, and you gave me information on die size and transistors and cache.

Again, I ask you to support your statement with some kind of hard data, some benchmark, some comparative evaluation of actual performance.

I'm not saying you're wrong (how could I prove that? There are no 5 GHz Renoir chips nor any 16 core A77-based CPUs). But I'm saying you need to produce some data before you just start throwing claims around.
 
Last edited:

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
Why would the ARM vendors bother? It's not a terribly lucrative niche. There's a reason the only desktop chips today are either overclocked laptop chips or cut down server chips.
Fair point. One could even argue that Renoir is also just a cut down server CPU uarch (ie, Zen2 CCXs) taped onto silicon next to some 7nm Vega CUs.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
This does not at all support your conclusion that a 16 core A77 clocked at 2.5 GHz would have the same "performance" (whatever you mean by that - gaming? photo work? rendering? browsing?) as Renoir clocked at 5 GHz while still having 4 times lower TDP.

I asked you to support your statement about performance, and you gave me information on die size and transistors.

Again, I ask you to support your statement with some kind of hard data, some benchmark, some comparative evaluation of actual performance.
Ok, let me explain it in more details.

Assuming same IPC A77 vs. Zen2 (A77 is a bit faster in SPECint2006 but lets forget that now, it depends on SW).
Renoir 8c at 5 GHz vs. 16c A77 at 2.5 GHz..... identical MT performance for both.
SMT lowers performance per thread to about half so it's now equal to A77 at half frequency. Performance per thread is identical for both.

Half frequency is about 4x lower power consumption. (double A77 cores neutralize ARM's double energy efficiency)

Summary: Same performance per thread while 4x lower power consumption.
But in reality Renoir cannot be clocked at 5GHz physically, not speaking about TDP in laptop, while A77 can go to 2,8Ghz easily. Unfortunately there is no such a 16 core A77 CPU while Renoir is on market. That's the little catch :)
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
Ok, let me explain it in more details.

Assuming same IPC A77 vs. Zen2 (A77 is a bit faster in SPECint2006 but lets forget that now, it depends on SW).
SPECint2006
A77 at 2.6 GHz is estimated to be about 31.66 (link)
3900X at 4.6 GHz is 52.12 (link)
EPYC 7742 at 3.4 GHz is 41.9 (link)
= A77 IPC per GHz is 12.18
= 3900X IPC per GHz is 11.33
= EPYC 7742 IPC per GHz is 12.32
Let's call it even.

Renoir 8c at 5 GHz vs. 16c A77 at 2.5 GHz..... identical MT performance for both.
That's a huge jump. Assuming one can just build out an A77 to perform in real, actual multithreaded applications in the laptop/desktop/HEDT market, which has never, not once, ever been done.

SMT lowers performance per thread to about half so it's now equal to A77 at half frequency. Performance per thread is identical for both.
That's just entirely untrue.

1) Disabling SMT has a very small performance difference in single-threaded applications

2) SMT vs non-SMT tests show that SMT cores achieve somewhere between 54% and 82% of what would be expected from the addition of a true extra core (heavily threaded apps including wPrime, CBR20, Blender, Corona, Keyshot, MySQL, 7z-decompression are all used in this calculation). On average, an SMT core is worth about 66% of a real core.

3900XSMT offSMT on12c/12t per-thread score12c/24t per-thread scoreExpected 24c/24t time/per-thread score% of expected for added threads
wPrime82.3956.59n/an/a41.19562.63
CBR205553.27260.3462.767302.512462.76765.37
Blender229.79156.92n/an/a114.89563.42
Corona176.8129n/an/a88.454.07
Keyshot208.4303.617.36712.65017.36772.84
MySQL22074127775418395.08311573.08318395.08362.91
7z-decomp53722884384476.8333684.9174476.83382.31
Average66.22

So we can derive expected performance of an virtual cores when compared to real cores, as (# of real cores) + (# of virtual cores * 0.5 * 0.66).

E.g. if we are comparing 3900X with SMT disabled to a 3600 with SMT on, assuming both are clocked the same:

3600 (6c/12t) relative performance to 3900X (12c/12t) = 6 + (12 * 0.5 * 0.66) = 9.66 "real" cores vs 12 real cores

So it's not a 50% performance hit. It's more like a 1 - ( 9.66 / 12 ) = 19.5% performance hit compared to using real cores.

Half frequency is about 4x lower power consumption. (double A77 cores neutralize ARM's double energy efficiency)
You're shoving Zen2 WAAAAAY up the voltage-frequency curve from 4.2 GHz (4800H, for example) to 5 GHz while pushing A77 DOWN the voltage-frequency curve from 2.6 to 2.5 GHz. That's unfair when you're wanting to compare power consumption.

If we just take them as they are, and assume the A77 scales up to 16 cores perfectly:

Zen2 = ~11.8 (12.3 + 11.3 / 2) IPC / GHz score
A77 = ~12.2 IPC / GHz score

Renoir 4800 at 4.2 GHz = 11.8 * 4.2 = 49.56 for each core
A77 i at 2.6 GHz = 12.18 * 2.6 = 31.66 for each core (as per the above)

Since we know from my above calculations that enabling SMT only results in a 20% performance hit compared to using a real core, we can easily extrapolate this out.

Renoir 4800 = 49.56 * (8 real cores) + (16 virtual cores * 0.5 * 0.66) = 658.16
A77 i = 31.66 * 16 real cores = 506.56
Renoir performance per thread lead would be 30% even though half of those threads aren't even "real" cores!

Summary: Same performance per thread while 4x lower power consumption.
Summary: You have overestimated the penalty for using SMT, and then misapplied it.

As for power consumption, 4800 has a 10-45W TDP. I don't know what a 16 core A77 i would have in power consumption. But if we match it up against a 4800U at 10 watts TDP, I doubt a 16 core A77 would have the 2 watt TDP it would need to quadruple efficiency.
 
Last edited:

ksec

Senior member
Mar 5, 2010
420
117
116
Yea, I am really getting sick of the ARM advocates, as many of them have no clue about reality.

Let's be fair, it really isn't ARM's advocates that is the problem, it is the lack of understanding, gap in knowledge and unwillingness to learn. You see that with Intel and AMD advocates as well.

I think we need a term for these people.

( Edit: I am still waiting for TSMC to ship 5nm to prove my point. )
 

DrMrLordX

Lifer
Apr 27, 2000
21,644
10,862
136
He's pretty similar to REDACTED - I see exactly the same fixation on ARM architecture.

Do not speak the naaaame!

Unlike our resident "Strong Man", he-who-must-not-be-named can smite an entire forum through sheer force of will. We would surely perish in his presence. Our resident "Strong Man"/troll is persistently foolish. That is all.

Why would the ARM vendors bother? It's not a terribly lucrative niche. There's a reason the only desktop chips today are either overclocked laptop chips or cut down server chips.

Interesting question! I think Linus Torvalds wrote a good piece on the importance of having consumer ARM hardware available for developers:


We will see if that has any bearing on the future of ARM. But he has a point.
 
Last edited:

name99

Senior member
Sep 11, 2010
404
303
136
SPECint2006
A77 at 2.6 GHz is estimated to be about 31.66 (link)
3900X at 4.6 GHz is 52.12 (link)
EPYC 7742 at 3.4 GHz is 41.9 (link)
= A77 IPC per GHz is 12.18
= 3900X IPC per GHz is 11.33
= EPYC 7742 IPC per GHz is 12.32
Let's call it even.


That's a huge jump. Assuming one can just build out an A77 to perform in real, actual multithreaded applications in the laptop/desktop/HEDT market, which has never, not once, ever been done.


That's just entirely untrue.

1) Disabling SMT has a very small performance difference in single-threaded applications

2) SMT vs non-SMT tests show that SMT cores achieve somewhere between 54% and 82% of what would be expected from the addition of a true extra core (heavily threaded apps including wPrime, CBR20, Blender, Corona, Keyshot, MySQL, 7z-decompression are all used in this calculation). On average, an SMT core is worth about 66% of a real core.

3900XSMT offSMT on12c/12t per-thread score12c/24t per-thread scoreExpected 24c/24t time/per-thread score% of expected for added threads
wPrime82.3956.59n/an/a41.19562.63
CBR205553.27260.3462.767302.512462.76765.37
Blender229.79156.92n/an/a114.89563.42
Corona176.8129n/an/a88.454.07
Keyshot208.4303.617.36712.65017.36772.84
MySQL22074127775418395.08311573.08318395.08362.91
7z-decomp53722884384476.8333684.9174476.83382.31
Average66.22

So we can derive expected performance of an virtual cores when compared to real cores, as (# of real cores) + (# of virtual cores * 0.5 * 0.66).

E.g. if we are comparing 3900X with SMT disabled to a 3600 with SMT on, assuming both are clocked the same:

3600 (6c/12t) relative performance to 3900X (12c/12t) = 6 + (12 * 0.5 * 0.66) = 9.66 "real" cores vs 12 real cores

So it's not a 50% performance hit. It's more like a 1 - ( 9.66 / 12 ) = 19.5% performance hit compared to using real cores.


You're shoving Zen2 WAAAAAY up the voltage-frequency curve from 4.2 GHz (4800H, for example) to 5 GHz while pushing A77 DOWN the voltage-frequency curve from 2.6 to 2.5 GHz. That's unfair when you're wanting to compare power consumption.

If we just take them as they are, and assume the A77 scales up to 16 cores perfectly:

Zen2 = ~11.8 (12.3 + 11.3 / 2) IPC / GHz score
A77 = ~12.2 IPC / GHz score

Renoir 4800 at 4.2 GHz = 11.8 * 4.2 = 49.56 for each core
A77 i at 2.6 GHz = 12.18 * 2.6 = 31.66 for each core (as per the above)

Since we know from my above calculations that enabling SMT only results in a 20% performance hit compared to using a real core, we can easily extrapolate this out.

Renoir 4800 = 49.56 * (8 real cores) + (16 virtual cores * 0.5 * 0.66) = 658.16
A77 i = 31.66 * 16 real cores = 506.56
Renoir performance per thread lead would be 30% even though half of those threads aren't even "real" cores!


Summary: You have overestimated the penalty for using SMT, and then misapplied it.

As for power consumption, 4800 has a 10-45W TDP. I don't know what a 16 core A77 i would have in power consumption. But if we match it up against a 4800U at 10 watts TDP, I doubt a 16 core A77 would have the 2 watt TDP it would need to quadruple efficiency.

There's a much simpler way to say it. Essentially what you are asserting is that for Intel SMT2 a second thread is equivalent to about 1/3 of a core. (At least I think that's what you're saying, I'm not interested enough to validate your arithmetic and try to reverse engineer what you're calculating).

Now is this true? I'd say that it's way too optimistic in general.
Over a wider range of benchmarks, I'd say SMT worth 25% of a core is a better approximation.
To do better than this requires code that
- doesn't spend all its time in a single execution unit (usually SIMD).
- doesn't utilize memory in "normal" ways.

The first is obvious -- if thread A is using the SIMD unit(s) on 90% of cycles, and thread B wants to do the same, then both are going to run at close to half speed.
The second is less obvious (and a prime reason why SMT just never gets much better, no matter how much proponents try t push it). For most code, the single biggest bottleneck is the L1 cache. But with SMT under most conditions you're now halving the effective size of that cache :-( (and other "cache like" structures like branch tables). What you lose from that takes away much of the win you might hope for from a naive analysis of SMT.

I don't know many of the benchmarks listed. But I expect most of them take the form of extreme computation (with little reference to memory) while not using much AVX.
What happens when you get rid of those assumptions? Well then you get something like this:

Sometimes great -- and sometimes basically nothing...

If you want an extra 25% throughput, you can add SMT. Or you can add an ARM small core (which generally has about the same performance compared to the big core). The ARM cores small are small enough they're basically lost in the chip area noise (even the Apple ones are small.)
You lose the SYMMETRY of SMT, sure. But you also lose the INSECURITY of SMT. And you gain the optionality of lower power.

SMT isn't some superpower that x86 has and ARM does not. It's ONE way of increasing throughput at low area. Not the only way, not a great solution for some purposes. And of course should a company think it does make sense for their products (Broadcom, now Marvell) it can be added easily enough.
(Personally I think this was a dumb decision by Marvell, both generically and wrt details of how they did it. But it's their company not mine, we'll see the consequences soon enough. Anyone else in the ARM space can also add it -- and maybe they will, hopefully done right rather than done dumb.)
 

coercitiv

Diamond Member
Jan 24, 2014
6,215
11,963
136
If you want an extra 25% throughput, you can add SMT. Or you can add an ARM small core (which generally has about the same performance compared to the big core). The ARM cores small are small enough they're basically lost in the chip area noise (even the Apple ones are small.)
You lose the SYMMETRY of SMT, sure. But you also lose the INSECURITY of SMT. And you gain the optionality of lower power.
While this may look true and completely valid at first sight, one immediately asks the next logical question: why stop at 25% throughput increase when the small cores are basically "lost in the chip area noise"? Why not go for 50% or even 75%?

Sounds to me like that SYMMETRY may be much more important than your think.
 
  • Like
Reactions: lobz

Nothingness

Platinum Member
Jul 3, 2013
2,423
755
136
Yea, I am really getting sick of the ARM advocates, as many of them have no clue about reality.
I have the same issue with AMD fanatics you know. Or Intel ones.

Just ignore ARM threads like many ignore threads where AMD advocates are making claims that prove they have no clue about reality.
 

Nothingness

Platinum Member
Jul 3, 2013
2,423
755
136
Let's be fair, it really isn't ARM's advocates that is the problem, it is the lack of understanding, gap in knowledge and unwillingness to learn. You see that with Intel and AMD advocates as well.

I think we need a term for these people.
I call them fanbois but it's considered as an insult on this forum :) Someone used fanatics and I like it. Would that be OK with forum rules?
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
On average, an SMT core is worth about 66% of a real core.
3600 (6c/12t) relative performance to 3900X (12c/12t) = 6 + (12 * 0.5 * 0.66) = 9.66 "real" cores vs 12 real cores
So it's not a 50% performance hit. It's more like a 1 - ( 9.66 / 12 ) = 19.5% performance hit compared to using real cores.
I do not see such a huge SMT benefit on my 3700X. Coercitiv and Markfw experience average 66% SMT benefit too?

I like any calculation because it's hard to bend numbers. But if you do math with incorrect input data such as assumption that SMT brings between 66-80% more performance, then no wonder Renoir wins everywhere. I think if you "tune" the input data even further you may get Renoir into TOP500 supercomputer rank too :D
 
  • Haha
Reactions: Zucker2k

DrMrLordX

Lifer
Apr 27, 2000
21,644
10,862
136
SMT is generally not going to add +66% throughput. AMD's implementation is very good, so you might see +40% in some benchmarks, especially where AVX2 is not in use.
 

Nothingness

Platinum Member
Jul 3, 2013
2,423
755
136
1) Disabling SMT has a very small performance difference in single-threaded applications
That's correct. That used to be true in early Intel chips where some HW resources were statically partitioned rather than dynamically. But this has been fixed long ago.

2) SMT vs non-SMT tests show that SMT cores achieve somewhere between 54% and 82% of what would be expected from the addition of a true extra core (heavily threaded apps including wPrime, CBR20, Blender, Corona, Keyshot, MySQL, 7z-decompression are all used in this calculation). On average, an SMT core is worth about 66% of a real core.

3900XSMT offSMT on12c/12t per-thread score12c/24t per-thread scoreExpected 24c/24t time/per-thread score% of expected for added threads
wPrime82.3956.59n/an/a41.19562.63
CBR205553.27260.3462.767302.512462.76765.37
Blender229.79156.92n/an/a114.89563.42
Corona176.8129n/an/a88.454.07
Keyshot208.4303.617.36712.65017.36772.84
MySQL22074127775418395.08311573.08318395.08362.91
7z-decomp53722884384476.8333684.9174476.83382.31
Average66.22
I hate it when people don't cite their source: https://www.techpowerup.com/review/amd-ryzen-9-3900x-smt-off-vs-intel-9900k/3.html :)

You excluded many of the tests. Some are explicitly stated as being singlethreaded so I can understand. Are you sure the others are not?

Any workload that has SMT on faster than SMT off has to be multithreaded right? So I would have included all such tests.

Anyway a good study would be one that examines the scaling of apps varying the number of physical cores enabled and turning SMT on/off. I failed to find one :(

Summary: You have overestimated the penalty for using SMT, and then misapplied it.
IMHO you're making it look better than what it is. The "truth" likely lies in between.

EDIT: Forgot to say, that I find the results you showed excellent (better than what I was expecting). Thanks for sharing :) That's convincing me even more my next desktop will be AMD-based.
 
Last edited:

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
There's a much simpler way to say it. Essentially what you are asserting is that for Intel SMT2 a second thread is equivalent to about 1/3 of a core. (At least I think that's what you're saying, I'm not interested enough to validate your arithmetic and try to reverse engineer what you're calculating).
[...]
No, based on the above calculations I did, SMT provides about a 67% benefit for each additional thread added by SMT2. That is, if you compare 3900X SMT off (12 cores/12 threads) vs SMT on (12 cores/24 threads) those additional 12 threads add about 8 "cores" of performance instead of 12.

About my protocol: I didn't pick the benchmarks willy-nilly or to make things look better. I picked the based on benchmarks that scale well with cores. I did not include several of the benchmarks because they clearly do not scale well with increasing core count or because they were specifically single or very lightly threaded. To vet the benchmarks, I compared 3600X vs 3700X scores (both boost to 4.4 GHz) - the 3700X has 33% more cores, and should see something approximating 33% better performance than the 3600X if the benchmark scales well with cores. There are some limitations, in that both have the same amount of L3$ and I/O bandwidth available despite the 3700X having 33% more cores, so I gave a lot of wiggle room. In most cases I also confirmed by ensuring 3700X vs 3900X scaled up somewhere around 50% (since 3900X has 50% more cores and a 5% boost freq benefit so I actually expected more than that, but gave some wiggle room). There are a lot more details that could explain why cores don't scale well, but when comparing 3600X vs 3700X the benchmark scores were not even scaling 67% of expected (e.g. difference less than 22% for 33% core count increase) then I don't think it's a test that scales well or should be used to verify.

Doesn't scale well with cores (doesn't even come close to 33% score improvement when comparing 3600X vs 3700X):
Unreal Engine 4 - 6% difference
VS C++ - 6.5% difference
Tensorflow - 12.7% difference
Euler3D - 13.8% difference
DigiCortex - 9.4% difference
Tesseract OCR - 14.2% difference
WinRAR compress - 11.5% difference
x265 - 18.5% difference

Specifically single-threaded or VERY lightly threaded
SuperPi
CBR20 Single
Octane
Kraken
WebXPRT
Tensorflow
Office
Photoshop CC
Premiere Pro CC
3dF Zephyr
VMWare Workstation 15
LAME

Special exclusions // what I could have included but didn't for reasons I'll explain:

Java SE 8 is not included because it is a conglomeration of single-threaded and multi-threaded and thus I couldn't justify including it in a comparison that seeks to focus on just multi-threaded questions.

7-Zip compress is not included because the results are inconsistent - a 3900X has 50% more cores and threads than a 3700X, and only saw a 30% boost in compress performance, of which 5% was higher boost clock. This just is an odd test.

VeraCrypt - 3600X actually faster throughput than 3900X and 3700X.

x264 - 27% scaling vs 33% expected on 3600X -> 3700X jump but only 23.5% scaling vs 50%+ expected on 3700X -> 3900X jump.


And if you look at my original post, the worst of the "heavily threaded" apps was Corona, which BARELY made the 22% cutoff (23% scaling with 33% expected 3600X vs 3700X).


In the end, the benefit of SMT in benchmarks largely depends on the benefit of adding cores in that benchmark. So I picked benchmarks that benefit from adding cores, then saw how they do when adding virtual cores.


What I may do over the next couple of days:
For all of the application benchmarks, take the 3600X vs 3700X vs 3900X and compare the per-core benefit to the benefit of SMT-on and SMT-off for the 3900X. That should normalize for apps that don't scale well. I'll still exclude the single-threaded/lightly threaded apps since those make little sense to include, full-stop.
 
Status
Not open for further replies.