Discussion AWS Graviton2 64 vCPU Arm CPU Heightens War of Intel Betrayal

exquisitechar · Dec 3, 2019

https://www.servethehome.com/aws-graviton2-64-core-arm-cpu-heightens-war-of-intel-betrayal/

Pretty big deal for ARM in servers. Interested in seeing a comparison between this and Rome.

DrMrLordX · Apr 1, 2020

Nothingness said:
So given the lack of data you do approximations upon approximations. That's what I strongly disagree with.

Okay, find some better data then.

Yeah 16->32 scales better than 7601, OMG 7601 is so bad!!!!!!!1111111!!!!!

Sure, why not? It's an aging product, and it has some problems. AMD made a lot of compromises choosing multiple dice instead of going monolithic. Though a 2P system was used, so I expect that to affect scaling as well.

And for 32->64 7601 speedup is 67% while for G2 it's 59%. So really you insist on saying G2 is much worse?

When you consider how badly it choked on 403.gcc, and then you look at how it starts to tumble after 32t in a "reduced memory pressure" environment, it does actually look pretty bad.

In particular when SPECrate is not like doing parallel builds, much more pessimistic?

Why would parallel builds favor Graviton2 more than Spec?

amrnuke · Apr 1, 2020

Nothingness said:
So given the lack of data you do approximations upon approximations. That's what I strongly disagree with.

No need to download it: if you compile the kernel you won't be compiling the same file 64 times at the same time, right?

Yeah 16->32 scales better than 7601, OMG 7601 is so bad!!!!!!!1111111!!!!! And for 32->64 7601 speedup is 67% while for G2 it's 59%. So really you insist on saying G2 is much worse? In particular when SPECrate is not like doing parallel builds, much more pessimistic?

To quote a very wise person:

"Bait for wenchmarks."

We know SO LITTLE about Graviton2's performance in the real world that we have to rely on assumptions and tests that are not done head-to-head in a controlled environment but rather disparately and pieced together.

What is clear is that there is a use-case for G2 where it will likely do well (power-efficient, cheap), and a use-case for EPYC where it does very well. Those two things may overlap a little as well. In the end, sometimes you need a work broom to clean off your porch, and sometimes you need a power-washing to do it. You wouldn't power-wash dry leaves and loose pollen off your porch (the water might move some of it, but the associated air movement will blow leaves and pollen up all over the place and settle right in places you already cleaned), and you wouldn't try to broom away mildew/mold and bird droppings (it just won't get the job done well). Different tools for different types of the same job (cleaning the porch, or running AWS instances).

DrMrLordX · Apr 1, 2020

amrnuke said:
"Bait for wenchmarks."

M6g instances are only available in preview right now (which means Amazon has to select you to use one), but once they're available, it should be trivial to run some parts of the Phoronix Test Suite on instances of different sizes. Until then . . . still baiting for the wenchmarks!

moinmoin · Apr 1, 2020

@Nothingness While I agree with some of your criticism, can you please add better data to the discussion? As is you are shouting down on somebody actually doing the effort of trying to find some suitable data, the shouting is silly since it is not moving the discussion forward at all.

Nothingness · Apr 1, 2020

DrMrLordX said:
Okay, find some better data then.

You're the one making claims with no proof, so why should I provide data?

Sure, why not? It's an aging product, and it has some problems. AMD made a lot of compromises choosing multiple dice instead of going monolithic. Though a 2P system was used, so I expect that to affect scaling as well.

Agree we need at least 7742 to compare against

When you consider how badly it choked on 403.gcc, and then you look at how it starts to tumble after 32t in a "reduced memory pressure" environment, it does actually look pretty bad.

I see you're insisting on being right at all cost

Why would parallel builds favor Graviton2 more than Spec?

I will repeat it for the 3rd or 4th time: SPECrate compiles the same file on each instance launched, so the profile of the compiler run is the same. That leads to more pathological cases where all processes are attacking memory at the same time. If you compile different files at the same time, as a kernel compile, you're less likely to get into such pathological cases.

Richie Rich · Apr 1, 2020

coercitiv said:
Okay, so let's see how 25% SMT benefit in Zen 2 leads to ~50% performance per thread.

(100%+25%)/2 = 62.5% = 50% -> totally checks out! Numbers are straight, which is a huge sigh of relief because somehow I feared that a 8c/16t Zen2 @ 5GHZ may actually be equal to a 10c/10t Zen @ 5Ghz.

It was just simple comparison without going into details (without 25% SMT benefit, assuming Renoir can run at 5GHz and that A77 has 8% higher IPC).

But I see you are interested in those details so let do the calc (normalized to Zen2 ST = 100%):

8c/16t Renoir SMT ........ 8x 125% total performance .... 16x 62.5% perf per thread
16c A77 .......................16x 108% total performance .... 16x 108% perf per thread

The ratio of performance per thread is .......108/62.5 = 1.728 .... Renoir must clock 1.728 times more to get identical performance
If A77 clock in Snapdragon is 2.8 GHz .... then 2.8 x 1.728 = 4.83 GHz...... and that's not possible.

My 3700X at 4.3GHz all core consumes about 90W, Renoir 4900HS takes 60W peak with all core boost around 4.2?
A77 at 2.8 GHz in SD865 is 2.3W (whole SoC, core itself is about 1.4W) so 16x2.3=36.8W (multiplying 16x memory ctrl from SD865 which is overkill)

So you can see 16-core ARM A77 @2.8GHz would provide 15% more (4.83/4.2) performance than Renoir@4.2GHz.
Power consumption is rough estimation but ARM would consume about half the power while providing higher performance.

Intel and AMD are lucky that there is no such a 16 core ARM product from QComm or Samsung etc.

Nothingness · Apr 1, 2020

moinmoin said:
@Nothingness While I agree with some of your criticism, can you please add better data to the discussion? As is you are shouting down on somebody actually doing the effort of trying to find some suitable data, the shouting is silly since it is not moving the discussion forward at all.

Why should I provide data when I'm not the one making the claim? OTOH I can give reasons why the methodology is not correct and that's what I have been doing.

The problem is that several people are trying to make comparisons that are pointless due to too many differences in the methodology. And they insist on being correct. Stubbornness drives me mad.

And apologies if I seem to be shouting...

DrMrLordX · Apr 1, 2020

Huzzah I found some SPEC2006 data for dual EPYC 7601 from Anandtech! It's not a perfect complement to Andrei's testing, but it's close. Ish.

Sizing Up Servers: Intel's Skylake-SP Xeon versus AMD's EPYC 7000 - The Server CPU Battle of the Decade?

www.anandtech.com

Sizing Up Servers: Intel's Skylake-SP Xeon versus AMD's EPYC 7000 - The Server CPU Battle of the Decade?

www.anandtech.com

Sizing Up Servers: Intel's Skylake-SP Xeon versus AMD's EPYC 7000 - The Server CPU Battle of the Decade?

www.anandtech.com

Going to have to get a bit creative, since the only direct data we see here is how a 2P 7601 scales from 1T->128T. But! They gave us SMT scaling for 403.gcc when running the MT bench on a single core (2T). It's +19% for SMT being active.

So we can plausibly get the 64t score (SMT off) by reducing the 128T score given on page 16. The 128t score is 1400, so without SMT it should be ~1176. ST score for 403.gcc on a 2P 7601 is 35.1. Going from 1T->64t gives us a scaling factor of .52 . That's quite a bit better than Graviton2.

Alternatively, we can try using the m5a numbers from Andrei's article (looks like the m5a is 1x EPYC 7571 with SMT enabled for 64t). If we use that data instead, it looks like EPYC is scaling from 1t->64t with a factor of .35 (actually .34557 but whatever). That's more in line with Graviton2's .32, though that still doesn't make Graviton2 look very good when you consider that the m5a instance has half the core count.

If all we're looking at is multicore scaling, it would probably be better to take SMT off the table, but that's just my opinion . . .

Nothingness said:
Why should I provide data when I'm not the one making the claim?

To educate. We aren't here to be right or wrong. We're here to learn. Or at least, that was my assumption.

amrnuke · Apr 1, 2020

First of all, don't take this the wrong way. I'm going to be very pedantic and very unrealistic in the response to you, because what you have given is factually incorrect information, and then again applied it incorrectly.

Richie Rich said:
25% SMT benefit

Going from 8c/8t to 8c/16t gives you a performance boost of 67% (8c/8t = 1.00, 8c/16t = 1.67) based on my calculations and those of others in highly-threaded / highly-scalable benchmarks. I don't understand why you keep quoting 25%. But let's call it 50% since you're so insistent that the performance boost is 50%. Or is it 25%? You seem to be really unsure and give different numbers all the time. For instance, when you're touting SMT4, it's the best thing in the world and would provide a huge benefit and AMD are stupid not to use it. But when you're arguing against the benefit of SMT2, it's like SMT is pure doodoo.

Richie Rich said:
But I see you are interested in those details so let do the calc:

8c/16t Renoir SMT ........ 8x 125% total performance .... 16x 62.5% perf per thread

16c A77 .......................16x 108% total performance .... 16x 108% perf per thread

Correction:
Renoir would be 8 x 150% total performance - 16 x 75% performance per thread
A77, sure whatever, 16 x 108% performance per thread

Richie Rich said:
The ratio of performance per thread is .......108/62.5 = 1.728 .... Renoir must clock 1.728 times more to get identical performance
If A77 clock in Snapdragon is 2.8 GHz .... then 2.8 x 1.728 = 4.83 GHz...... and that's not possible.

Correction:
108/75 = 1.44
Renoir needs to clock 1.44 times higher than A77 to get identical performance by your metrics. Assuming of course they can design a front-end and interconnect and all the other stuff for a 16 core A77 chip that is competitive with Renoir.
So 2.8 GHz x 1.44 = 4.03 GHz

Richie Rich said:
My 3700X at 4.3GHz all core consumes about 90W, Renoir 4900HS takes 60W peak with all core boost around 4.2?

A77 at 2.8 GHz in SD865 is 2.3W (whole SoC, core itself is about 1.4W) so 16x2.3=36.8W (multiplying 16x memory ctrl from SD865 which is overkill)

I don't see how your 3700X's power consumption is relevant.

Sure, let's assume you're right about the 4900HS's boost consumption and all-core boost

. The most power-hungry Renoir takes 60W at 4.2 GHz, but downclocked to equal A77 performance, would probably consume far less based on known voltage-frequency curves. I'm guessing it may even be as power efficient as a 16 core A77, but with the option to boost much higher and crush it in performance when necessary!

Richie Rich said:
So you can see 16-core ARM A77 @2.8GHz would provide 15% more (4.83/4.2) performance than Renoir@4.2GHz.

No. It would provide 4% less performance (4.03/4.2 = 95.95%) than Renoir @ 4.2 GHz.

Richie Rich said:
Power consumption is rough estimation but ARM would consume about half the power while providing higher performance.

If Renoir 4900HS draws 60W peak, and we downclock it to 4.03 GHz I'm sure its power consumption would be a lot better, and closer to that of A77 x 16.
Also, what if we used the 4800U with 8 cores, 16 threads, 4.2 GHz boost, and 15W TDP / 25W cTDP-up?

Anyway, if you really want to know the numbers, go buy a 16 core A77 chip and compare it to a downclocked 4900HS.
And also, if you want to know the full capabilities of an A77 x 16 compared to a Renoir, run them against each other in Blender or CBR20 MT or PrimeGrid and see which one does the work faster.

Richie Rich said:
Intel and AMD are lucky that there is no such a 16 core ARM product from QComm or Samsung etc.

Oh, that's right, there isn't one.

coercitiv · Apr 1, 2020

Richie Rich said:
It was just simple comparison without going into details (without 25% SMT benefit, assuming Renoir can run at 5GHz and that A77 has 8% higher IPC).

So you start with this assumption on your own, you say Ryzen 8c/16t can do 5Ghz for the purpose of this theoretical comparison.

You then redo your calculations - increase clocks for A77 from 2.5Ghz to 2.8Ghz - and end up with a requirement of 4.83Ghz for Ryzen in a workload that perfectly scales with 16 threads.

Conclusion?

Richie Rich said:
4.83 GHz...... and that's not possible.

Of course it is, remember you were not comparing actual products, but theoretical products. You alone started with a set of parameters in the beginning and then modified them to ensure a "positive" result.

Let's do this properly, let's see what happens with performance as workloads scale through thread count. We assume both cores have equal IPC, SMT is at +25% and workloads scale perfectly with number of threads. In order to somewhat compensate for lower clocks during heavy workloads we will look at different scenarios, with different relative speeds between the cores. We asume the ARM chip does not need to drop clocks because... magic.

Let's start with 4.2Ghz vs 2.8Ghz.

Layout	Ghz	1T	2T	4T	6T	8T	10T	12T	14T	16T
8c / 16t	4.2	1	2	4	6	8	8.5	9	9.5	10
16c / 16t	2.8	0.67	1.33	2.67	4.00	5.33	6.67	8.00	9.33	10.67
Relative performance		1.5	1.5	1.5	1.5	1.5	1.275	1.125	1.017857	0.9375

Now let's look at 3.3Ghz vs. 2.8Ghz.

Layout	Ghz	1T	2T	4T	6T	8T	10T	12T	14T	16T
8c / 16t	3.3	1	2	4	6	8	8.5	9	9.5	10
16c / 16t	2.8	0.85	1.70	3.39	5.09	6.79	8.48	10.18	11.88	13.58
Relative performance		1.18	1.18	1.18	1.18	1.18	1.00	0.88	0.80	0.74

And finally, let's see what happens with clock parity.

Layout	Ghz	1T	2T	4T	6T	8T	10T	12T	14T	16T
8c / 16t	2.8	1	2	4	6	8	8.5	9	9.5	10
16c / 16t	2.8	1.00	2.00	4.00	6.00	8.00	10.00	12.00	14.00	16.00
Relative performance		1.00	1.00	1.00	1.00	1.00	0.85	0.75	0.68	0.63

Hmm, looking at these tables with scaling workloads.... things are not quite as black and white as you made them look, are they?

Let's write down what you ignored in your previous estimates:

in low threaded workloads performance per thread for a SMT core is 100%, which can make the 8c/16t core up to 50% faster at 4.2Ghz vs 16c/16t @ 2.8Ghz
only after workload exceeds 8T will SMT cores drop down to 62.5%, and only for the cores running 2 threads at once (the others still run at 100%)
the 16c/16t CPU gains ground gradually with thread count, even when 8c/16t is clocked at 3.3Ghz it takes at least 10T to obtain performance parity
even with clock parity, the two CPUs will exhibit equal performance for 8T workloads and bellow

Care to guess what happens in consumer workloads?

Nothingness · Apr 1, 2020

DrMrLordX said:
Huzzah I found some SPEC2006 data for dual EPYC 7601 from Anandtech! It's not a perfect complement to Andrei's testing, but it's close. Ish.

Sizing Up Servers: Intel's Skylake-SP Xeon versus AMD's EPYC 7000 - The Server CPU Battle of the Decade?

www.anandtech.com

Sizing Up Servers: Intel's Skylake-SP Xeon versus AMD's EPYC 7000 - The Server CPU Battle of the Decade?

www.anandtech.com

Sizing Up Servers: Intel's Skylake-SP Xeon versus AMD's EPYC 7000 - The Server CPU Battle of the Decade?

www.anandtech.com

Going to have to get a bit creative, since the only direct data we see here is how a 2P 7601 scales from 1T->128T. But! They gave us SMT scaling for 403.gcc when running the MT bench on a single core (2T). It's +19% for SMT being active.

So we can plausibly get the 64t score (SMT off) by reducing the 128T score given on page 16. The 128t score is 1400, so without SMT it should be ~1176. ST score for 403.gcc on a 2P 7601 is 35.1. Going from 1T->64t gives us a scaling factor of .52 . That's quite a bit better than Graviton2.

I prefer this comparison to the ones you made before, thanks

But I will make two objections:

I'm not sure the SMT scaling can easily be deduced since as noted in the article starting at 88 threads the CPU load reaches 100%. So SMT might bring less than 19%. As you write below we'd need SMT off.
The two sockets mean twice the memory bandwidth of a single socket (and no NUMA effect as previously discussed since no memory is shared).

As you can note, this goes in two opposing directions.

Alternatively, we can try using the m5a numbers from Andrei's article (looks like the m5a is 1x EPYC 7571 with SMT enabled for 64t). If we use that data instead, it looks like EPYC is scaling from 1t->64t with a factor of .35 (actually .34557 but whatever). That's more in line with Graviton2's .32, though that still doesn't make Graviton2 look very good when you consider that the m5a instance has half the core count.

And that's still no 7742.

If all we're looking at is multicore scaling, it would probably be better to take SMT off the table, but that's just my opinion . . .

And I 100% agree.

In fact we'd need a 7742 run with no SMT and as close as possible compilers (not some peak score obtained with a vendor compiler) that'd be ideal. But alas I couldn't find any such public data (I can get internal data for Rome/Naples/various Xeons with/without SMT, I guess most people in the processor industry have that kind of data, but obviously can't share it).

To educate. We aren't here to be right or wrong. We're here to learn. Or at least, that was my assumption.

That's exactly what I'm trying to do: educate you so that you understand why your methodology was flawed. It matters much more than providing data, don't you agree? And I am glad I insisted, because that forced you to look for more data and data that I (and I hope you too) think is better

And I've repeatedly proven I accept being wrong with no hesitation.

Nothingness · Apr 1, 2020

amrnuke said:
Going from 8c/8t to 8c/16t gives you a performance boost of 67% (8c/8t = 1.00, 8c/16t = 1.67) based on my calculations and those of others in highly-threaded / highly-scalable benchmarks. I don't understand why you keep quoting 25%.

The link @DrMrLordX posted shows much lower SMT speedups for 7601 ranging from 7% up to 51% for an average of 28%.

Yeah I know that's SPECrate but still

coercitiv · Apr 1, 2020

Nothingness said:
The link @DrMrLordX posted shows much lower SMT speedups for 7601 ranging from 7% up to 51% for an average of 28%.

Yeah I know that's SPECrate but still

In all seriousness, look at Marvell's claims from my previous post. It shows how you can both be right at the same time, because workloads are not created equal. (see Integer vs. Database)

amrnuke · Apr 1, 2020

Nothingness said:
The link @DrMrLordX posted shows much lower SMT speedups for 7601 ranging from 7% up to 51% for an average of 28%.

Yeah I know that's SPECrate but still

However SMT speed-ups for workloads including Blender, CBR20, and other highly threaded benchmarks were higher in the 3900X SMT on vs off test.
There are so many variations. But I think Blender, CBR20, 7-zip, MySQL reflect more appropriate real-world scenarios of multi-thread usage than does SPECrate.
In any case, it seems like scenarios where SMT doesn't create benefit are also scenarios where added physical cores wouldn't benefit.

amrnuke · Apr 1, 2020

We know that depending on the benchmark, performance gains from additional cores will produce varying results.

A hypothesis: If SMT is perfect, then the addition of SMT to a chip will result in the same performance as a chip with twice the cores, but no SMT. As a corollary, we can estimate the benefit of the addition of SMT vs the addition of real cores by comparing the increase in performance seen by each method of reaching a specific number of "expected" cores (whether via SMT or addition of physical cores).

There are minor variances in clock speed that can be accounted for, however we fortunately have excellent comparators for this hypothesis to be tested. This Techpowerup benchmark run provides us with three results in particular that are of high utility. We have the 3600X (1 IOD + 1 chiplet with 2 disabled cores with SMT on), and the 3900X (1 IOD + 2 chiplets with a total of 4 disabled cores with SMT off and on). Since both CPUs will see the same constraints from an I/O standpoint, the main differences between the two, when comparing 6C/12T to 12C/12T would be 1) the inefficiencies of SMT vs more physical cores, 2) the inefficiencies of inter-chiplet communication vs on-chiplet SMT, and 3) the doubling of L3$ per thread on the 3900X SMT-off compared to 3600X. We can account for the 3900X's 0.2 GHz speed benefit by docking the 3900X result by 4.35%.

In these benchmarks we can compare SMT vs real cores (3600X SMT vs 3900X no SMT), and we can also see the benefit of doubling cores (3600X SMT vs 3900X SMT). I think both pieces of data would be interesting. We can also compare the benefit of adding SMT (3900X no SMT vs 3900X SMT) and compare that to the other results.

The results are interesting, data posted at the end (note that for perfect scalers I lowered the threshold from 80% to 75% but forgot to change the text in the graphic below). I'll let others digest them further. But here are my take-aways.

Perfect Scalers
After adjusting for clock speeds, there are several benchmarks which scaled almost perfectly up with core count, with 6c/12t -> 12c/24t scores increasing by at least 75% -- wPrime, CBR20 MT, Blender, Corona, Keyshot, 7-zip decompress.

On those tests:

The 3900X with SMT off (12 threads) had on average a 27.994% benefit in performance over a 3600X with SMT on (12 threads). The 3600X acts like a 3900X with SMT off that has 8.64 cores. This as a result estimates that a 3600X with SMT on performs about 44% better than would be expected with SMT off (8.64 core equivalents / 6 "real" cores = 44%).

In another test, enabling SMT on the 3900X resulted in, on average, a 45% improvement in performance, meaning a 3900X with SMT on acted like it had 17.4 real cores. However we know scaling wasn't perfect - on these tests, doubling cores only resulted in 84.795% increase in performance. If we normalize the SMT thread-doubling performance increase to account for the imperfect core-doubling performance increase, you get 45% / 84.8% = 53% performance increase.

All Pertinent Tests (excluding the VERY poor scalers)
We will exclude benchmarks that are poorly-threaded. If a benchmark sees <10% benefit from the doubling of cores (that is, comparing 3600X to 3900X, both with SMT on, performance difference is <10%), when we account for clock speed (4.35%) the difference falls too close to the margin of error for the test. Further, it means the test is likely designed for single or few cores/threads, poorly scalable to multiple cores/threads, constrained by non-processor limits, or some other limitation. In other words, they introduce unnecessary noise to the data or are just poor tests for multi-core performance. This removes SuperPi, CBR20 ST, Octane, Kraken, WebXPRT, Word, PowerPoint, Excel, Photoshop, Premier, Zephyr, VMWare, VeraCrypt, and Lame. Sorry, but in all but one of those, the 3900X when normalized for clocks performs WORSE than a 3600X with fewer cores, which is absurd and makes them functionally useless for our purposes.

12c/24t benefit over 6c/12t (doubling cores and threads) - average 52.9%

12c/12t benefit over 6c/12t (achieving 12 threads via SMT or via real cores) - real cores on average 23.85% better (3600X acts like a 3900X/SMT off that only has 9.14 cores. This would estimate a 52% performance increase for SMT.

SMT on vs SMT off on 3900X results in 23.44% increase in performance, normalized to the fact that these are not well-scaling tests overall, divide 23.44% / 52.9% = 44.3% benefit of SMT.

All Tests, even the ones that don't scale well
Even if we INCLUDE the poor scalers, SMT benefit is 42.644% (3900X SMT on vs 3900X SMT off), when normalized for overall benefit of adding real cores (11.968% benefit of 12c/24t over 12c/12t, divided by 28.065% benefit of 12c/24t over 6c/12t).

Conclusion
Overall it appears we should expect a 42 to 53% performance increase when SMT is enabled vs not. That is, if a chip has performance of 1.00 with SMT off, it would have performance of 1.42 - 1.53 with SMT on. What's amazing to me is that however you slice it, SMT produces a fairly tight window of improvement in performance. This is in accordance with the scientific literature, for example this paper citing 30-70% increase.

What I Really Want
While I think this is pretty clear and is consistent with the literature, I still want more data, because... I'm bored due to social distancing and shelter-in-place orders. I want benchmarks with 3600X matched to 3900X (6 cores per chiplet), 3700X matched to 3950X (8 cores per chiplet), and also include 3960X, 3970X, and 3990X --- all with SMT on and off. Heck, give me a 7742 2P with SMT on and off! This would allow an incredibly comprehensive evaluation of SMT implementation/benefit on AMD's Zen2 chiplet. And I'd love to compare it to Intel's implementation as well. And even compare it to Zen and Zen+. Give me all the data!

Edit: Interesting points after looking at it some more - 3:11 PM CST
Several tests when going from 6c/12t to 12c/24t had a double digit benefit, but when going from 12c/12t to 12c/24t had a far lower benefit. Such tests include UE4, VS C++, Euler3D, DigiCortex, and x265. These application-specific limits confound the data. We would need SMT on/off tests on 3600X, 3700X, and ideally the HEDT chips to replicate this and confirm that those tests would be poor benchmarks to use in future analysis of high-core-count processors. As it stands, I'm not sure why there would be such small benefits unless the tests are heavily skewed against SMT and benefit from physical cores (do they saturate the front-end even with only 1 thread per core? are they just very efficient tests in that they produce almost no pipeline downtime, such that there is no real place for a second thread to operate?). I would like to confirm this, not sure if anyone else could comment on it.

Richie Rich · Apr 1, 2020

coercitiv said:
Let's write down what you ignored in your previous estimates:

in low threaded workloads performance per thread for a SMT core is 100%, which can make the 8c/16t core up to 50% faster at 4.2Ghz vs 16c/16t @ 2.8Ghz

only after workload exceeds 8T will SMT cores drop down to 62.5%, and only for the cores running 2 threads at once (the others still run at 100%)

the 16c/16t CPU gains ground gradually with thread count, even when 8c/16t is clocked at 3.3Ghz it takes at least 10T to obtain performance parity

even with clock parity, the two CPUs will exhibit equal performance for 8T workloads and bellow

Care to guess what happens in consumer workloads?

I agree with your calculation now. You forgot that A77 has 8% higher IPC than Zen2 but nothing's perfect.
I agree with your conclusion that Renoir is faster at low thread and ST performance, no doubt about that. But honestly I always talked about MT performance. So why that rage? Just cool down, man. If you want real ST champion then you can recalculate table against Apple's A13 Lightning core. But we know that A13 is beating Ryzen 3950X at 4,7 GHz so it would win everywhere

The biggest advantage of such a low frequency 16c A77 CPU would be:

low power consumption: Just lowering frequency from 4,2 -> 2,8 GHz you will get 3,3x lower consumption (Renoir from 60W -> 18W). But that 16core A77 would consume around 25W at higher performance than 60W Renoir.
smaller die size: A77 has 1,4mm2 …. 2x A77 = 2,8mm2 …………. in compare to Zen2's 3,6mm2 (A77 would save 23% of die size)

That's not bad result for CPU that any Chinese or Tchaj-wan company can buy the license and make it. That's something you can't do at x86 yard. It's impressive especially when two years ago ARM had only super slow 2xALU A75 (or lazy A72 in Raspberry Pi 4 and Graviton1) core and now they have pretty wide A77 machine, wider than Zen2 with higher IPC than Zen2. If somebody told me that two years ago I wouldn't believe him. Some cheap ARM cores can put almighty x86 into trouble in terms of IPC? WTF? However it's happening and Graviton2 is just beginning. And that's why x86 has to worry about Graviton3 (based on upcoming A78) and Graviton4 (probably based on ARMv9 and SVE2 2048-bit instruction set).

amrnuke · Apr 1, 2020

Richie Rich said:
I agree with your calculation now. You forgot that A77 has 8% higher IPC than Zen2 but nothing's perfect.
I agree with your conclusion that Renoir is faster at low thread and ST performance, no doubt about that. But honestly I always talked about MT performance.

A77 has no 8+ core parts available and no verifiable MT benchmarks. Period.

Richie Rich said:
So why that rage? Just cool down, man. If you want real ST champion then you can recalculate table against Apple's A13 Lightning core. But we know that A13 is beating Ryzen 3950X at 4,7 GHz so it would win everywhere

No. Just no. You keep making these assertions - "A13 [...] would win everywhere" - you have ZERO data to conclude that a 16 core/32 thread or 32 core A13 would compete in multi-threaded tasks with 3950X.

Richie Rich said:
The biggest advantage of such a low frequency 16c A77 CPU would be:

low power consumption: Just lowering frequency from 4,2 -> 2,8 GHz you will get 3,3x lower consumption (Renoir from 60W -> 18W). But that 16core A77 would consume around 25W at higher performance than 60W Renoir.

smaller die size: A77 has 1,4mm2 …. 2x A77 = 2,8mm2 …………. in compare to Zen2's 3,6mm2 (A77 would save 23% of die size)

Agree. A77 could be great for low power consumption and size. But Zen2-3-4 will for the foreseeable future win in pure performance per socket.

Richie Rich said:
That's not bad result for CPU that any Chinese or Tchaj-wan company can buy the license and make it.

Wow, if it's so easy, why haven't they?!

Richie Rich said:
Some cheap ARM cores can put almighty x86 into trouble in terms of IPC? WTF?

When the design is built for IPC, it will be good at IPC. When it's built for parallelism, it'll be good at parallelism. When it has to do both, there will be compromises. How is this not apparent to you?

Richie Rich said:
However it's happening and Graviton2 is just beginning. And that's why x86 has to worry about Graviton3 (based on upcoming A78) and Graviton4 (probably based on ARMv9 and SVE2 2048-bit instruction set).

Again, no parts. When Graviton3 comes out it has to compete not with Zen3, but Zen4. You forgot that x86 is advancing as well.

Yet again, this is a LOT of wet dreaming. I'm excited to see what ARM puts out. But you're trying to turn this into a religious/political type discussion where one side is right and the other is wrong and it's pure black and white, when we have NEXT TO NO PROOF of anything you are saying about MT performance. In fact, all we have for A77 ST performance is what ARM is claiming, not what has been verified. So, like, zero proof about anything having to do with A77.

I'll wait and see. I don't really care who wins. But I do care when you post information that is speculative based on manufacturer claims. If that's what we went by, we'd have Zen4 at 5.5 GHz with single thread performance doubling that of A77 at the same power draw.

chrisjames61 · Apr 1, 2020

rainy said:
He's pretty similar to juanrga - I see exactly the same fixation on ARM architecture.

I was thinking the same exact thing.

DrMrLordX · Apr 2, 2020

Nothingness said:
I'm not sure the SMT scaling can easily be deduced since as noted in the article starting at 88 threads the CPU load reaches 100%. So SMT might bring less than 19%. As you write below we'd need SMT off.

True! But if that's the case, the 64c SMT off score would be higher than the one I estimated, making Naples's multicore scaling factor better than the .52 from my calculations. So I'm comfortable erring in a manner that actually weakens my case.

The two sockets mean twice the memory bandwidth of a single socket (and no NUMA effect as previously discussed since no memory is shared)

Sizing Up Servers: Intel's Skylake-SP Xeon versus AMD's EPYC 7000 - The Server CPU Battle of the Decade?

www.anandtech.com

Anandtech's review of the 2P 7601 with 16xDDR4-2400 shows maximum rated memory bandwidth of 207 GB/s. Anandtech's review of Graviton2 has this to say about memory bandwidth:

The Arm chip is quite impressive, and we only seemingly needed 8 CPU cores to saturate the write bandwidth of the system, and only 16 cores for the read bandwidth, with the highest figure reaching about 190GB/s, near the theoretical 204GB/s peak of the system, and this is only using scalar 64B accesses. Very impressive.

The bandwidth advantage isn't that large for 2P EPYC 7601, despite having twice as many memory banks.

As you can note, this goes in two opposing directions.

True.

And that's still no 7742.

Indeed! I might be able to dig up some SPEC data on the 7742 for comparison. The 7571 is also a weird chip. It's 32c like the 7601, but it appears to have lower ST boost but similar MT clocks? If you compare the 2P 7601 scores from Anandtech's older article to the 1P 7571 (m5a) scores from the Graviton2 article, the MT score for the 7571 is exactly half of the 2P 7601 score, but the m5a ST score is lower than that of the 7601. That definitely changes some of the scaling calculations.

It matters much more than providing data, don't you agree?

Data with rebuttal is nice though. Helps dig through cruft faster.

Nothingness · Apr 2, 2020

amrnuke said:
However SMT speed-ups for workloads including Blender, CBR20, and other highly threaded benchmarks were higher in the 3900X SMT on vs off test.
There are so many variations. But I think Blender, CBR20, 7-zip, MySQL reflect more appropriate real-world scenarios of multi-thread usage than does SPECrate.
In any case, it seems like scenarios where SMT doesn't create benefit are also scenarios where added physical cores wouldn't benefit.

Yeah SPECrate doesn't tell a lot when considered alone in particular when saturating the CPU. But in that case going from one to two physical cores would double the score, while doing it with SMT it's much less than that.

So I think you can't just say those results can be excluded. There is a whole category of software that doesn't benefit from SMT: highly tuned software. More generally every piece of software that's been carefully optimized and that is computation bound won't benefit from SMT at all, even if their parallel speedup is perfect.

Here is a an example of study about this (even if it's old, the conclusions are still relevant): An Empirical Study of Hyper-Threading in High Performance Computing Clusters

My point is that all parallelizable software will benefit a lot from SMT, that's simply not true. But of course most software isn't saturating hardware resources of a core, so there's room for SMT to bring some speedup, and sometimes quite nice speedups. OTOH I will always prefer 8 physical cores to 4 SMT2 cores (but that's another discussion

).

Nothingness · Apr 2, 2020

amrnuke said:
Conclusion
Overall it appears we should expect a 42 to 53% performance increase when SMT is enabled vs not. That is, if a chip has performance of 1.00 with SMT off, it would have performance of 1.42 - 1.53 with SMT on. What's amazing to me is that however you slice it, SMT produces a fairly tight window of improvement in performance. This is in accordance with the scientific literature, for example this paper citing 30-70% increase.

What I Really Want
While I think this is pretty clear and is consistent with the literature, I still want more data, because... I'm bored due to social distancing and shelter-in-place orders.

Beware of the confirmation bias, I fall into that trap too often

Try this exercise: try to find literature that contradicts your point. I always do that when I'm convinced something is true, and often find I'm just wrong and that reality is much more complex.

(Sorry if this is obvious and you already did so, I'm confined but don't have a lot of time available to do that myself on this subject.)

Edit: Interesting points after looking at it some more - 3:11 PM CST
Several tests when going from 6c/12t to 12c/24t had a double digit benefit, but when going from 12c/12t to 12c/24t had a far lower benefit. Such tests include UE4, VS C++, Euler3D, DigiCortex, and x265. These application-specific limits confound the data. We would need SMT on/off tests on 3600X, 3700X, and ideally the HEDT chips to replicate this and confirm that those tests would be poor benchmarks to use in future analysis of high-core-count processors. As it stands, I'm not sure why there would be such small benefits unless the tests are heavily skewed against SMT and benefit from physical cores (do they saturate the front-end even with only 1 thread per core? are they just very efficient tests in that they produce almost no pipeline downtime, such that there is no real place for a second thread to operate?). I would like to confirm this, not sure if anyone else could comment on it.

It's not a question of "being skewed against SMT", a single thread can completely saturate a core. For instance I know of highly tuned parallelized software that can saturate FP resources and hence doesn't benefit from SMT but still benefits from more cores.

As an example Intel MKL is configured by default not to use SMT:

Developer Software Forums

The key points:

By default, Intel MKL uses the number of OpenMP threads equal to the number of physical cores on the system

Hyper-Threading Technology (HT Technology) is especially effective when each thread is performing different types of operations and when there are under-utilized resources on the processor. Intel MKL fits neither of these criteria as the threaded portions of the library execute at high efficiencies using most of the available resources and perform identical operations on each thread. You may obtain higher performance when using Intel MKL without HT Technology enabled.

Nothingness · Apr 2, 2020

chrisjames61 said:
I was thinking the same exact thing.

Oh we have AMD and Intel fanatics that are as dumb you know, that has nothing to do with ARM.

Richie Rich · Apr 2, 2020

amrnuke said:
SMT on vs SMT off on 3900X results in 23.44% increase in performance, normalized to the fact that these are not well-scaling tests overall, divide 23.44% / 52.9% = 44.3% benefit of SMT.

View attachment 18993

The last column with 23.4% SMT benefit looks reasonable finally. However I still don't understand why you try to rape this measured number by some crazy 52.9%. If you want to eliminate Amhdal's scaling penalty to get pure SMT benefit then the most clean way is to measure that by running multiple ST instances. That's why @Nothingness mentioned that SPECrate 28% SMT benefit.

To sum up: I was right from begining with my 25%, wasn't I?

coercitiv · Apr 2, 2020

Richie Rich said:
However I still don't understand why you try to rape this measured number by some crazy 52.9%.

Because he compared 12c/12t versus 6c/12t and showed the obvious: workloads don't scale with more cores as expected.

Assume SMT scaling is 25% and core scaling is 100%
1c/2t - resulting performance is 125%
2c/2t - resulting performance is 200%
--> relative performance between 2c/2t and 1c/2t: 1.6X

However in the real world relative performance between 12c/12t versus 6c/12t is only 1.24X , meaning either:
Option A - SMT gains are actually 60%
Option B - workloads don't scale linearly with cores, hence SMT gains should be adjusted to account for that (that's what you called "raping")

In other words, when you evaluate SMT gains going from 12c/12t to 12c/24t, you're actually applying diminishing returns on SMT by increasing thread count. You could argue that is fair game, as the purpose of SMT is to increase thread count, but then again you would have to admit that all your other performance estimates were built using perfect scaling for physical cores, which is simply inaccurate.

DrMrLordX · Apr 2, 2020

coercitiv said:
In other words, when you evaluate SMT gains going from 12c/12t to 12c/24t, you're actually applying diminishing returns on SMT by increasing thread count.

That's why I really appreciated reviews like Anandtech's EPYC 7601 review where they showed the effects of SMT on a single-core basis. It is a bit academic since in "real world" benchmarks, Amdahl's law is going to punish you for adding SMT, but it's also going to punish you for adding more cores.

Discussion AWS Graviton2 64 vCPU Arm CPU Heightens War of Intel Betrayal

Senior member

Lifer

Golden Member

Lifer

Diamond Member

Platinum Member

Senior member

Platinum Member

Lifer

Golden Member

Diamond Member

Platinum Member

Platinum Member

Diamond Member

Golden Member

Golden Member

Senior member

Golden Member

Senior member

Lifer

Platinum Member

Platinum Member

Platinum Member

Senior member

Diamond Member

Lifer