Discussion AWS Graviton2 64 vCPU Arm CPU Heightens War of Intel Betrayal

exquisitechar · Dec 3, 2019

https://www.servethehome.com/aws-graviton2-64-core-arm-cpu-heightens-war-of-intel-betrayal/

Pretty big deal for ARM in servers. Interested in seeing a comparison between this and Rome.

amrnuke · Mar 31, 2020

Richie Rich said:
I do not see such a huge SMT benefit on my 3700X. Coercitiv and Markfw experience average 66% SMT benefit too?

In what apps? Is it possible that the app you're using simply doesn't benefit from many cores at all? For instance, would the application run about the same on a 3600X as on a 3700X? If so, then it probably doesn't matter if you use SMT or buy a 3950X, it'll still run about the same.

Richie Rich said:
I like any calculation because it's hard to bend numbers. But if you do math with incorrect input data such as assumption that SMT brings between 66-80% more performance, then no wonder Renoir wins everywhere. I think if you "tune" the input data even further you may get Renoir into TOP500 supercomputer rank too

Are you even freaking serious?
I gave you the input data. This is not an assumption, it is a calculation based on benchmarks done by a third party.
"Tune" the input data? I'm not tuning the input data. I'm analyzing it, not changing it.
Better than a certain someone's modus operandi, which is to throw random fecal matter against the wall with absolutely NOTHING but conjecture and wet dreams to back it up.

coercitiv · Mar 31, 2020

Richie Rich said:
Coercitiv and Markfw experience average 66% SMT benefit too?

I have more pressing concerns than estimating SMT gains as long as workload type is not clearly defined. It's as productive as discussing core count scaling in a set of workloads with unknown I/O requirements.

Such a concern would be your perceived shift from SMT4 as a necessary strategy for x86 in the years to come. Assuming SMT2 yields only 25%, what would that say about SMT4? How can you reconcile such seemingly opposed conclusions in such a short amount of time?

Here's a quote from a 2013 article in IBM Systems Magazine, which sadly is no longer available online unless looking via web archives.

Keys to Balancing Server Throughput and Application Performance
[...]
For LPARs running many threads and lots of I/O, SMT2 or SMT4 is generally a good choice to achieve the best system throughput.
[...]
With a good application mix, changing from SMT1 to SMT2 will yield about 60 percent more throughput for a heavily loaded system. An additional gain of 20 percent is typical when moving from SMT2 to SMT4. (These percentages are approximations and will vary depending on the workload.) For a lightly loaded system, you will probably see no difference in throughput or performance when switching between SMT1 to SMT2/SMT4.

amrnuke · Mar 31, 2020

coercitiv said:
I have more pressing concerns than estimating SMT gains as long as workload type is not clearly defined. It's as productive as discussing core count scaling in a set of workloads with unknown I/O requirements.

Such a concern would be your perceived shift from SMT4 as a necessary strategy for x86 in the years to come. Assuming SMT2 yields only 25%, what would that say about SMT4? How can you reconcile such seemingly opposed conclusions in such a short amount of time?

Here's a quote from a 2013 article in IBM Systems Magazine, which sadly is no longer available online unless looking via web archives.

Keys to Balancing Server Throughput and Application Performance
[...]
For LPARs running many threads and lots of I/O, SMT2 or SMT4 is generally a good choice to achieve the best system throughput.
[...]
With a good application mix, changing from SMT1 to SMT2 will yield about 60 percent more throughput for a heavily loaded system. An additional gain of 20 percent is typical when moving from SMT2 to SMT4. (These percentages are approximations and will vary depending on the workload.) For a lightly loaded system, you will probably see no difference in throughput or performance when switching between SMT1 to SMT2/SMT4.

60% more throughput = base throughput x 1.6
Expected for double cores would be 100% more throughput = base throughput x 2.0
Benefit of SMT compared to double cores = 1.6 / 2.0 = 80%

That's exactly what I found in my calculations.

Thank goodness, I don't feel totally useless (after my Graviton vs 7742 comparison debacle).

Richie Rich · Mar 31, 2020

amrnuke said:
60% more throughput = base throughput x 1.6
Expected for double cores would be 100% more throughput = base throughput x 2.0
Benefit of SMT compared to double cores = 1.6 / 2.0 = 80%

That's exactly what I found in my calculations.

Thank goodness, I don't feel totally useless (after my Graviton vs 7742 comparison debacle).

Do not do your crazy calculations about SMT benefits any more, please.
There are bunch of test on internet regarding SMT benefits on Ryzen, clearly showing Zen2 SMT benefit is about 25% in average. Intels HT is about 20%.

CB20 SMT benefit is 31%
Photoshop................ -4%
Database................. 26%
7-zip ........................ 26%
H265 ......................... 2%
-----------------------------------------
TechPowerUp average ... 10.5% .............. Please show me your 66% SMT benefit at some representative web reviews.

HINT: Take a look at Photoshop CC how close 3600X is to 3900X SMT Off. Is it due to bad MT scaling which show negative -4% SMT benefit at 3900X or because SMT brings 96% benefit on 3600X vs. 3900X? Do you see how stupid and wrong is your calculation?

Nothingness · Mar 31, 2020

amrnuke said:
Thank goodness, I don't feel totally useless (after my Graviton vs 7742 comparison debacle).

That happens! What matters is that you understood your mistake and fixed it, contrary to many that just insist on being wrong.

And I like your posts, they are useful!

coercitiv · Mar 31, 2020

Richie Rich said:
Please show me your 66% SMT benefit at some representative web reviews.

I'm sorry, is this what you were looking for?

amrnuke · Mar 31, 2020

Richie Rich said:
Do not do your crazy calculations about SMT benefits any more, please.
There are bunch of test on internet regarding SMT benefits on Ryzen, clearly showing Zen2 SMT benefit is about 25% in average. Intels HT is about 20%.

CB20 SMT benefit is 31%
Photoshop................ 6%
Database................. 26%
7-zip ........................ 26%
H265 ......................... 2%
-----------------------------------------
TechPowerUp average ... 10.5% .............. Please show me your 66% SMT benefit at some representative web reviews.

Did you not read my post on how I determined whether I included a benchmark or not? If a benchmark doesn't scale up with addition of PHYSICAL cores, then it is lightly-threaded or heavily dependent on ST. I outlined ALL of that in another post. Photoshop is one such test - very dependent on single-thread performance and so it's a TERRIBLE test to use when considering whether SMT is beneficial. If you cannot see the logic in what I just said and what was in the other post, you have no reason to be commenting on this issue at all, period.

Richie Rich · Mar 31, 2020

amrnuke said:
Did you not read my post on how I determined whether I included a benchmark or not? If a benchmark doesn't scale up with addition of PHYSICAL cores, then it is lightly-threaded or heavily dependent on ST. I outlined ALL of that in another post. Photoshop is one such test - very dependent on single-thread performance and so it's a TERRIBLE test to use when considering whether SMT is beneficial. If you cannot see the logic in what I just said and what was in the other post, you have no reason to be commenting on this issue at all, period.

Comparing two different CPUs to find out SMT benefits is invalid. More poor MT scaling application has (every application has different Amdahl's law hit), the bigger number your calculation shows (because 3600X is geting closer to 3900X). But that's due to poor MT scaling not because SMT is soo good. Jeeez. That's completely mathematically invalid.

Please post here some graphs from reviews showing similar 66% SMT benefits you claiming here. I posted graphs from a review proving I'm right and your calculation is wrong.

coercitiv said:
I'm sorry, is this what you were looking for?

View attachment 18931

So you want to prove Zen2's 66% SMT2 benefit by posting results of ThunderX3 SMT4. Like..... really?

amrnuke · Mar 31, 2020

Richie Rich said:
Comparing two different CPUs to find out SMT benefits is invalid. More poor MT scaling application has (every application has different Amdahl's law hit), the bigger number your calculation shows (because 3600X is geting closer to 3900X). But that's due to poor MT scaling not because SMT is soo good. Jeeez. That's completely mathematically invalid.

Please post here some graphs from reviews showing similar 66% SMT benefits you claiming here. I posted graphs from a review proving I'm right and your calculation is wrong.

You are not engaging in critical thinking. You are so ingrained in the idea that you are right, that you've closed your mind to the data (and, I'd argue, to thinking in general).

We are talking about multi-core and multi-thread performance.

Many of the graphs you linked to are poor at utilizing multiple cores and multiple threads.

Which means that those benchmarks are very poor for testing highly threaded / high core count CPUs.

That's why I excluded them in my data. Why would I include single and lightly-threaded benchmarks in a comparison about the benefits of heavily multithreaded/multicore CPUs?

Compare CBR20 MT vs Photoshop, you will see that Photoshop is largely IPC/clock speed dependent, and cares much less about cores, whereas CBR20 MT cares about all aspects of the CPU (IPC, clock, MT performance). So why would you use Photoshop to compare whether a chip sees a benefit from more cores or more threads? Case in point, a 3600X beats a 3700X in Photoshop's benchmark. How can you explain that, if the Photoshop benchmark is valid for testing multithreaded capabilities, that a chip with 33% more real cores loses?

Let me put it another way: if what you are saying is true, that SMT performance is poor because you insist on using benchmarks that are very lightly threaded, then by that measure, Graviton2 would see no benefit on those same benchmarks, correct?

name99 · Mar 31, 2020

coercitiv said:
While this may look true and completely valid at first sight, one immediately asks the next logical question: why stop at 25% throughput increase when the small cores are basically "lost in the chip area noise"? Why not go for 50% or even 75%?

Sounds to me like that SYMMETRY may be much more important than your think.

It depends on the task, doesn't it?
A GPU or an NPU is indeed the logical extreme of "sea of small very simple cores", and they have many uses... Things like the original Broadcom Vulcan were another version of the idea, now targeting simple (rather than very simple) cores, but again aimed at a particular throughput market, in this case networking.

There are then two questions:
- are there enough use cases for a sea of small simple (but not VERY SIMPLE) cores to justify their production? The jury is still out on this.
We've obviously seen Intel try to sell things like this, things like Centerton and Averton, along with the first round of ARM32 servers.
They weren't great successes but I don't consider that dispositive.

There have been two problems so far:
+ companies have crippled these chips to prevent them from being too competitive with their expensive chips (things like memory bandwidth). It's notable than when these devices (GPU, NPU, even Vulcan) aren't believed to be a threat to the expensive products, they miraculously pick up an astonishingly performant throughput-optimized memory system...

+ the companies that would be using these chips are on the same treadmill as everyone else; they have their hands full simply trying to keep pace with new ideas for the product, with security threats, with new large core SoCs. Refactoring the stack to target a sea of small cores is the kind of neat idea that might form the basis of a PhD thesis, but right now it's lower priority than everything else going on.
Even things like GPUs (or aggressive use of AVX512 for text processing) which can at least target existing hardware, are lagging far behind where they should be. Change takes time.
I'd be curious to see if Amazon, in particular, have a second team looking at the question of whether there'd be value in an alternative Graviton instance consisting of, say, 256 A-35 type cores, targeting massive throughput text-streaming type apps.

- the second question is the value of optionality. Optionality (being able to toggle between a latency optimized vs throughput optimized SoC, or a latency optimized vs energy optimized SoC) is useful in personal devices (phones, PCs, ...) that do a dozen different things in a day. It's much less obviously useful in server applications where you could just run each microtask on a SoC of the appropriate capabilities.
This is another reason that I question the value of the SMT4 on ThunderX3 (unless they're targeting primarily a license runaround, like IBM...); that SMT4 optionality is just not worth much.

Schmide · Mar 31, 2020

Richie Rich said:
Please post here some graphs from reviews showing similar 66% SMT benefits you claiming here. I posted graphs from a review proving I'm right and your calculation is wrong.

You do understand 66% of something is 33% less of something that is 50% more?

SarahKerrigan · Mar 31, 2020

name99 said:
It depends on the task, doesn't it?
A GPU or an NPU is indeed the logical extreme of "sea of small very simple cores", and they have many uses... Things like the original Broadcom Vulcan were another version of the idea, now targeting simple (rather than very simple) cores, but again aimed at a particular throughput market, in this case networking.

Not sure what you mean by this. ThunderX2 (Vulcan) is a big core, directly comparable to SKL in most respects, and tops out at 32 cores/socket - far from a sea of simple cores. Unless you mean "simpler than Apple", in which case it applies to basically everything.

(Also, it already has SMT4, just like ThunderX3 does, as did its predecessors XLP and XLP II.)

name99 · Mar 31, 2020

SarahKerrigan said:
Not sure what you mean by this. ThunderX2 (Vulcan) is a big core, directly comparable to SKL in most respects, and tops out at 32 cores/socket - far from a sea of simple cores. Unless you mean "simpler than Apple", in which case it applies to basically everything.

Well big is relative isn't it? A Vulcan core is physically and "logically" small compared to an Apple or x86 big core.
It's no A35, but it's clearly a core targeting THROUGHPUT rather than LATENCY, and that's the fault-line that's actually relevant here: is there scope for dedicated CPU-based throughput engines or are most throughput tasks so specialized that it's worth providing them with bespoke hardware (eg GPU, NPU, ISP, ...)

My guess is that there ARE throughput tasks that would be well served by the type of engine I described (analytics, packet classification, non-realtime indexing and spidering, ...) the market just hasn't yet had time to optimize HW and SW for those tasks, but I would guess it's coming in time.
IBM Blue Gene Q/A2 is a different sort of example of my point. In one sense it's large (lots of SIMD) but it's a core designed for throughput over latency. Like a GPU, it throws transistors at execution units, not so much as truly spectacular OoO shenanigans.

SarahKerrigan · Mar 31, 2020

name99 said:
Well big is relative isn't it? A Vulcan core is physically and "logically" small compared to an Apple or x86 big core.
It's no A35, but it's clearly a core targeting THROUGHPUT rather than LATENCY, and that's the fault-line that's actually relevant here: is there scope for dedicated CPU-based throughput engines or are most throughput tasks so specialized that it's worth providing them with bespoke hardware (eg GPU, NPU, ISP, ...)

My guess is that there ARE throughput tasks that would be well served by the type of engine I described (analytics, packet classification, non-realtime indexing and spidering, ...) the market just hasn't yet had time to optimize HW and SW for those tasks, but I would guess it's coming in time.
IBM Blue Gene Q/A2 is a different sort of example of my point. In one sense it's large (lots of SIMD) but it's a core designed for throughput over latency. Like a GPU, it throws transistors at execution units, not so much as truly spectacular OoO shenanigans.

In what way is it logically small compared to an x86 big core? Iso-clock ST is pretty similar to Skylake. It's 4-wide, 4-threaded, 6-issue, with a 180-uop reorder window. I don't see that as a "simple core" in any respect.

coercitiv · Mar 31, 2020

Richie Rich said:
So you want to prove Zen2's 66% SMT2 benefit by posting results of ThunderX3 SMT4. Like..... really?

Yeah, really!

You're so emotionally invested in this "fight" than you no longer realize the brutal logical mistakes you're making. Let's rewind this discussion a bit, see where it all started.

Richie Rich said:
Imagine ARM competition for AMD Renoir 8c/16t for laptops. They can use 16x core A77 (ARM core has half area so resulting in same total area) running it at 2.5GHz and still get performance similar to Renoir clocked at 5GHz

While you were trying to justify how a 16c/16t CPU @ 2.5Ghz would tie a 8c/16t CPU @ 5Ghz in consumer workloads, magical words came up from your keyboard:

Richie Rich said:
SMT lowers performance per thread to about half

You know what SMT gain lowers performance per thread to about 50%? A 0% gain in performance from SMT.

But I let that one slide, because you know what's even more disconnected from reality than the assertion above? The idea that consumer workloads scale so well with core and thread count that ST performance is no longer a main target in creating consumer products.

A couple of posts later, you just change the figures, because you know... it's hard to keep track of bended numbers:

Richie Rich said:
Do not do your crazy calculations about SMT benefits any more, please.
There are bunch of test on internet regarding SMT benefits on Ryzen, clearly showing Zen2 SMT benefit is about 25% in average.

Okay, so let's see how 25% SMT benefit in Zen 2 leads to ~50% performance per thread.

(100%+25%)/2 = 62.5% = 50% -> totally checks out! Numbers are straight, which is a huge sigh of relief because somehow I feared that a 8c/16t Zen2 @ 5GHZ may actually be equal to a 10c/10t Zen @ 5Ghz.

But wait, let's come back to that grand discovery of 16c @ 2.5Ghz vs 8c @ 5Ghz: let's talk more about how a 16c Skylake @ 2.5Ghz would match an 8c Skylake @ 5Ghz in consumer workloads. Let's see browsing, games, office apps, OS updates, image and video editing... the works!

Or would you rather.... you know... have an open and sensible discussion on SMT scaling in server environments? Like the IBM reference of 60% and the graph I posted earlier claiming up to 80% gains going from SMT1 to SMT4 in I/O bound scenarios? Too on topic for this thread?

name99 · Mar 31, 2020

SarahKerrigan said:
In what way is it logically small compared to an x86 big core? Iso-clock ST is pretty similar to Skylake. It's 4-wide, 4-threaded, 6-issue, with a 180-uop reorder window. I don't see that as a "simple core" in any respect.

Look at the SPEC numbers for Thunderx2 vs x86:

Assessing Cavium's ThunderX2: The Arm Server Dream Realized At Last

www.anandtech.com

About half. Sure half of that is frequency, but half is I suspect just not caring about creating the best possible latency core.
There’s tons that goes into a latency core that isn’t covered by a few specs like L1 size, or ROB size — structure of the branch predictor, nature of the LS queues, simplifications made in the scheduler, op fusion, TLB issues, ...

Like I said, I‘m not interested in “large vs small” core, I’m interested in “throughput vs latency” optimized engines.
That’s the fault line of interest: by say 2000 standards even a core like Tempest would count as a big core.

DrMrLordX · Mar 31, 2020

I wanted to revisit the issue of single-threaded performance not necessarily being a big indicator of real-world server performance for a many-cores CPU. The numbers have been under my nose the entire time, and I was too busy looking at other data to go back to the Anandtech review of Graviton2 to find the relevant data:

ARM News, Reviews and Insights | Tom's Hardware

Discover the Tom's Hardware take on the ARM product range, with news, reviews and benchmarking for the hardcore PC enthusiast.

www.anandtech.com

Relevant quotes:

On the other hand, well known memory intensive workloads such as 462.libquantum absolutely crater in terms of per-thread performance. This memory bandwidth demanding workload is fully saturating the bandwidth of the system early on with very few cores, meaning that performance barely increases the more threads and cores we throw at it. Such a scaling more or less is mimicked in other workloads of varying cache and memory pressure.

The most worrying result though is 403.gcc. Code compilation should have been one of the bigger use-cases for a platform such as Graviton2, but the platform is having issues scaling well with core count, undoubtedly a result of higher cache pressure of the system. In a single-thread scenario in the system a core would have access to 33MB L2+L3, but when having 64 cores doing the same thing at once you’d end up with only 1.5MB per core, assuming things are evenly competitively shared.

n SPECfp2006, again, we see the well-known memory intensive workloads such as 433.milc and 470.lbm crater in their per-thread performance the more threads you throw at the system, while other workloads are able to scale near linearly with cores.

Compared to the int2017 suite, the fp2017 suite scales significantly worse for a larger number of workloads. When Ampere last week talked about its Altra processor, and that it was “designed for integer workloads”, that didn't make too much sense other than in the context that the N1 cores are missing wider SIMD execution units. What does make sense though is that the floating-point suite of SPEC is a lot more memory intensive and SoCs like the Graviton2 don’t fare as well at higher loaded core-counts.

Graviton2 has scaling problems.

beginner99 · Apr 1, 2020

DrMrLordX said:
Graviton2 has scaling problems.

Funny thing is, if it is cheap it would make it a efficient cpu for hosting corporate VMs with corporate apps that have rather small amounts of traffic. Which is a scenario that you would rather do in-house than in the cloud as you don't really need scaling. The irony.

DrMrLordX · Apr 1, 2020

beginner99 said:
Funny thing is, if it is cheap it would make it a efficient cpu for hosting corporate VMs with corporate apps that have rather small amounts of traffic. Which is a scenario that you would rather do in-house than in the cloud as you don't really need scaling. The irony.

Oh I know, it's not all bad. There's a lot you can do with Graviton2, and the power usage is very good. There's just too little credit given to non-ARM camps for the work they've done on interconnects.

Nothingness · Apr 1, 2020

DrMrLordX said:
Oh I know, it's not all bad. There's a lot you can do with Graviton2, and the power usage is very good. There's just too little credit given to non-ARM camps for the work they've done on interconnects.

Ha your unfounded argument about interconnect. You have no proof it's bad but still blame it.

Nothingness · Apr 1, 2020

DrMrLordX said:
I wanted to revisit the issue of single-threaded performance not necessarily being a big indicator of real-world server performance for a many-cores CPU. The numbers have been under my nose the entire time, and I was too busy looking at other data to go back to the Anandtech review of Graviton2 to find the relevant data:

ARM News, Reviews and Insights | Tom's Hardware

Discover the Tom's Hardware take on the ARM product range, with news, reviews and benchmarking for the hardcore PC enthusiast.

www.anandtech.com

Relevant quotes:

Graviton2 has scaling problems.

Do you have similar data for x86 that proves x86 chips don't crater in a similar way?

DrMrLordX · Apr 1, 2020

Nothingness said:
Do you have similar data for x86 that proves x86 chips don't crater in a similar way?

See other thread, maybe @Andrei. can help us out? It's a shame he didn't do the same efficiency testing with Naples and Cascade Lake-SP instances in AWS. Based on his reactions in the article, though, it seemed like Graviton2's woes were a bit on the unique side.

edit: I'll post this as possible comparative data. Phoronix did scaling tests on a 2P EPYC 7601 system from Dell:

A Look At Linux Application Scaling Up To 128 Threads - Phoronix

www.phoronix.com

Sadly, Phoronix didn't produce data points for 48 threads, nor do we know how the workload was balanced between CPUs on this 2P system. EPYC is playing at a little bit of a disadvantage here since you're dealing with possible intersocket latency. Nevertheless, we have the following scaling:

Linux Kernel compile:

16t->32t: ~73% speed increase
32t->64t: ~63% speed increase

LLVM compile:

16t->32t: ~74%
32t->64t: ~67%

In comparison, from Andrei's article, we have to extrapolate the MT scores at different core counts based on the scaling factors vs. his reported 403.gcc score (which is the closest analogous SPEC benchmark to the compile benchmarks from the Phoronix article). From this we get:

16t: ~381
32t: ~563
64t: ~706 (reported in the article to be 701 in the MT page of the article, but it's close enough).

So for scaling we get

16t->32t: ~48% speed increase
32t->64t: ~25% speed increase

In other words, dreadful.

Nothingness · Apr 1, 2020

DrMrLordX said:
See other thread, maybe @Andrei. can help us out? It's a shame he didn't do the same efficiency testing with Naples and Cascade Lake-SP instances in AWS. Based on his reactions in the article, though, it seemed like Graviton2's woes were a bit on the unique side.

edit: I'll post this as possible comparative data. Phoronix did scaling tests on a 2P EPYC 7601 system from Dell:

A Look At Linux Application Scaling Up To 128 Threads - Phoronix

www.phoronix.com

Sadly, Phoronix didn't produce data points for 48 threads, nor do we know how the workload was balanced between CPUs on this 2P system. EPYC is playing at a little bit of a disadvantage here since you're dealing with possible intersocket latency. Nevertheless, we have the following scaling:

Linux Kernel compile:

16t->32t: ~73% speed increase
32t->64t: ~63% speed increase

LLVM compile:

16t->32t: ~74%
32t->64t: ~67%

In comparison, from Andrei's article, we have to extrapolate the MT scores at different core counts based on the scaling factors vs. his reported 403.gcc score (which is the closest analogous SPEC benchmark to the compile benchmarks from the Phoronix article). From this we get:

16t: ~381
32t: ~563
64t: ~706 (reported in the article to be 701 in the MT page of the article, but it's close enough).

So for scaling we get

16t->32t: ~48% speed increase
32t->64t: ~25% speed increase

In other words, dreadful.

Your comparison makes no sense. These tasks have nothing to do with what SPECrate does. Were the same file compiled on the different cores at the same time?

And why did you pick 403.gcc rather than 502.gcc_r? Unconscious bias?

DrMrLordX · Apr 1, 2020

Nothingness said:
Your comparison makes no sense.

It's the closest I could get. They're compiler benchmarks. Granted one is LLVM (not gcc).

These tasks have nothing to do with what SPECrate does.

403.gcc is a compiler benchmark. Andre's commentary:

The most worrying result though is 403.gcc. Code compilation should have been one of the bigger use-cases for a platform such as Graviton2, but the platform is having issues scaling well with core count, undoubtedly a result of higher cache pressure of the system. In a single-thread scenario in the system a core would have access to 33MB L2+L3, but when having 64 cores doing the same thing at once you’d end up with only 1.5MB per core, assuming things are evenly competitively shared.

Had Andrei done all the same scaling tests with the competing instances, I wouldn't have to fish for compiler benchmarks run on other sites. It's interesting to note that Andrei seems to think that the small(ish) L3 could be the culprit here. That was my other pet theory as to why Graviton2 might have had scaling issues.

Moving right along . . .

Were the same file compiled on the different cores at the same time?

You'd have to dig into what exactly Phoronix does during their compiler benchmarks. They probably publish that data somewhere, or you can DL the Phoronix test suite and see how it's configured yourself.

And why did you pick 403.gcc rather than 502.gcc_r? Unconscious bias?

From the article:

The new gcc and mcf tests are actually scaling better than their 2006 counterparts due to actually reduced memory pressure on the new tests. It does beg the question of which variant of the test is actually more representative of most workloads of these types.

If you really want to look at 502.gcc_r for Graviton2:

16t->32t: ~85% speed increase
32t->64t: ~59% speed increase

Scaling is fine out to 32t in this case, but it starts to go south beyond that point.

Nothingness · Apr 1, 2020

DrMrLordX said:
It's the closest I could get. They're compiler benchmarks. Granted one is LLVM (not gcc).

403.gcc is a compiler benchmark. Andre's commentary:

Had Andrei done all the same scaling tests with the competing instances, I wouldn't have to fish for compiler benchmarks run on other sites. It's interesting to note that Andrei seems to think that the small(ish) L3 could be the culprit here. That was my other pet theory as to why Graviton2 might have had scaling issues.

So given the lack of data you do approximations upon approximations. That's what I strongly disagree with.

You'd have to dig into what exactly Phoronix does during their compiler benchmarks. They probably publish that data somewhere, or you can DL the Phoronix test suite and see how it's configured yourself.

No need to download it: if you compile the kernel you won't be compiling the same file 64 times at the same time, right?

If you really want to look at 502.gcc_r for Graviton2:

16t->32t: ~85% speed increase
32t->64t: ~59% speed increase

Scaling is fine out to 32t in this case, but it starts to go south beyond that point.

Yeah 16->32 scales better than 7601, OMG 7601 is so bad!!!!!!!1111111!!!!! And for 32->64 7601 speedup is 67% while for G2 it's 59%. So really you insist on saying G2 is much worse? In particular when SPECrate is not like doing parallel builds, much more pessimistic?

Discussion AWS Graviton2 64 vCPU Arm CPU Heightens War of Intel Betrayal

Senior member

Golden Member

Diamond Member

Golden Member

Senior member

Diamond Member

Diamond Member

Golden Member

Senior member

Golden Member

Senior member

Diamond Member

Senior member

Senior member

Senior member

Diamond Member

Senior member

Lifer

Diamond Member

Lifer

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member