Discussion AWS Graviton2 64 vCPU Arm CPU Heightens War of Intel Betrayal

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

Richie Rich

Senior member
Jul 28, 2019
470
229
76
@everybody: bla bla bla .... Apple will never move into servers .... bla bla bla...
Reality: ex-Apple architects starts the NUVIA up

@everybody: bla bla bla .... Cortex cores cannot scale up and if do it would consume same power as x86 .... bla bla bla...
Reality: Graviton2

There will be much more crying, sobbing and whining from guys like Thunder57, lobs, coercitive etc. in 2021 when Ampere Mystique and Graviton3 based on A77 will come. Yeah, prepare your tissues boys because I'm gonna pull out some older post of you. This gonna be fun :D
Because the majority agreed on a certain opinion does not mean that they are right. It could just mean they are sheeps....



Member callouts not allowed.
Dial back your rhetoric. It is not productive here.


esquared
Anandtech Forum Director
 
Last edited by a moderator:
  • Haha
Reactions: CHADBOGA

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
@everybody: bla bla bla .... Apple will never move into servers .... bla bla bla...
Reality: ex-Apple architects starts the NUVIA up

Nuvia isn't apple. Plus lawsuits. Apple is busy trying to bury Nuvia, not support them. But hey go right on with that tangent. Actually, don't.

Reality: Graviton2

Then why waste your time with Apple's cores in a thread about Graviton2? Honestly. And it (Graviton2) still hasn't benched against Rome. Nor has anything from Ampere. Or ThunderX3. I haven't seen Kunpeng benched against it either. Or A64FX!

There will be much more crying, sobbing and whining from guys

Again there is no crying, whining, or sobbing. People are tired of repetitive posting. The only one crying here is you.
 

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
There will be much more crying, sobbing and whining from guys like Thunder57, lobs, coercitive etc. in 2021 when Ampere Mystique and Graviton3 based on A77 will come. Yeah, prepare your tissues boys because I'm gonna pull out some older post of you. This gonna be fun :D
Because the majority agreed on a certain opinion does not mean that they are right. It could just mean they are sheeps....
I think you should seek psychiatric help.
 

RetroZombie

Senior member
Nov 5, 2019
464
386
96
I hope somebody will prove me wrong but nobody delivers single digit.
If you would point out to a review of one of those new arm server chips (must be already available in the market) vs what amd already have in the market (released more recently) and what intel also have in the market (2S, 4S, 8S) and also what ibm power pc have already available, i will be happy to discuss with you those digits.

Remember product must exist and must be available.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
Graviton3 based on A77 will come
More likely to be A78/Hercules based given the lack of N2 announcement last month to match the N1 announcement in February last year - I think N2 will be the A78 server superset core.

A78 is also the next target for the high end 'AE' core variant going by Linked-In postings, possibly we might see a successor to A65/E1 at the same time.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
[Posted in the ARM thread, asked by others to post here. Some modifications made.]

As Graviton2 has shown, it's possible for Arm to be applied to a server chip and produce results that seem somewhat competitive. Since they couldn't put Graviton2 up against Rome directly (yet), I went ahead and compared specint2006 published results, and yes, I know there are a lot of caveats. But it doesn't really look terribly competitive.

On the surface, the IPC looks great in the specint scores. So I went ahead and compared to 7742 published results.

Graviton2 is gcc9.2, 7742 is gcc8.3


Single-thread
IPC is what everyone clamors about with Graviton/Arm. Let's compare. This was easy because we had direct-comparison specint2006 scores.

specint2006, single-thread, normalized to GHz
Graviton2 +14% over Rome

specint2006, single-thread, raw
Rome +16.2% over Graviton2

It appears that there likely is an IPC advantage of Arm in those given setups. which doesn't matter much since Rome can clock higher, giving it a 16% lead in raw single-threaded specint2006 score.


Multi-thread
This was harder, not entirely apples to apples, which I'll discuss below.

Compare:
(1) Graviton2 specint2006 MT 64v CPU rate scores
(2) Rome 7742 64 cores in a 2P system (128 cores) and divide by 2 to normalize to 64 cores
(3) 7601 32 cores in a 2P system (total 64 cores)

Now, we can talk about Graviton2 not being SMT-enabled, hence limiting its performance (it does, but wouldn't make a difference against 7742). Or about whether using 7742 in 2P and dividing by 2 is fair. However, if anything, I think it hampers EPYC performance to have their chips in this comparison in a 2P configuration.

Here are the results, in brief:

specint2006, multi-thread, raw
Graviton2 - 100%
7601 - 115%
7742 - 125% ((( edited from previous, I made a formula error in Excel )))


Granted, none of this is apples to apples yet until they do a direct head-to-head comparison. Arm is doing something, but it's still many steps behind in the server market in my estimation. It's slower than 7601 in 2P (32 x 2 = 64 cores) and 7742 in 2P / 2.

So the brand new 2020 Graviton2 is 80% the speed of 2019's 7742. And AMD still have Milan to release this year.
 
Last edited:

beginner99

Diamond Member
Jun 2, 2009
5,208
1,580
136
@everybody: bla bla bla .... Apple will never move into servers .... bla bla bla...
Reality: ex-Apple architects starts the NUVIA up

Has the thought ever occured to you these ex-Apple people made their own company because Apple did not want to move their ARM cores into laptops and servers? I mean if Apple had such a project, no need for them to leave and make their own company.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
specint2006, single-thread, normalized to GHz
Graviton2 +14% over Rome

specint2006, single-thread, raw
Rome +16.2% over Graviton2

It appears that there likely is an IPC advantage of Arm in those given setups. which doesn't matter much since Rome can clock higher, giving it a 16% lead in raw single-threaded specint2006 score.
Graviton2 is beating Zen2 Rome in IPC by 14%, nice.
Rome's higher clock for ST is about +32% (1.14x1.16=1.32.....1.32 x 2.5 GHz = 3.3 GHz ST Rome Turbo clock, that's correct) which leads into 2.3x higher power consumption than Rome at 2.5 GHz (and approx 5x in compare to Graviton2 which is big price for turbo clock).
In other words, Rome system consumes 5x more electricity for just 16% performance advantage. Rome turbo helps in some HPC tasks but is it good for cloud service economy? Not sure.


Multi-thread
specint2006, multi-thread, raw

Graviton2 - 100%
7601 - 115%
7742 - 250%


Granted, none of this is apples to apples yet until they do a direct head-to-head comparison. Arm is doing something, but it's still many steps behind in the server market in my estimation. It's slower than 7601 in 2P (32 x 2 = 64 cores) and it's so much slower than 7742 that any amount of SMT isn't going to get it over the hump.

So the brand new 2020 Graviton2 is 40% the speed of 2019's 7742. And AMD still have Milan to release this year.
That's strange because Graviton2 was 1.4x faster than Zen1:
Gravito2 Anandtech test
I would expect 64 core Rome to be 1.5 faster than Graviton2. But 2.5x looks too much.

Don't forget die size:
  • 64-core Rome is 1004mm2 total die size
  • 64-core Graviton is around 250-300mm2 estimated (N1/A76 is 1.4mm2 x 64 = 90mm2, 32 MB L3 cache is approx 34mm2, so 124mm2 + MEM CTRL and I/O)

So 2P Rome system is 2008mm2 Goliath comparing to very tiny 300mm2 and cheap peace of Graviton2 silicon (8x smaller and cheaper). Not bad result for Graviton2 at all (when being crippled by tiny 32MB L3).
Especially when we take into account power consumption which is typically half of x86. What was the 2P Rome TDP? 2x180W?

Summary: Amazon can sell Graviton2's cloud service for much lower price than Rome (expensive to buy, higher power consumption). Let's see Amazon's Rome prices. But it looks like x86 world cannot win in cloud service. And Zen3 won't make that better. Graviton3 based on 128-core A78 at 5nm with higher IPC advantage over Zen3 will be cheaper and also faster than 2P Milan systems. I'm afraid x86 is done.
 
Last edited:
  • Haha
Reactions: CHADBOGA

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
@amrnuke

What do you think is the reason for the poor MT scaling in Graviton2? The reduced-size L3 cache?

@Richie Rich

Seems you don't fully appreciate how badly Graviton2 fares against Rome in that comparison. More data is needed, but I wouldn't be celebrating any victories for ARM Holdings or Amazon.
 

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
It's for a 2P system with its score cut in half.

To quote @amrnuke

Now, we can talk about Graviton2 not being SMT-enabled, hence limiting its performance (it does, but wouldn't make a difference against 7742). Or about whether using 7742 in 2P and dividing by 2 is fair. However, if anything, I think it hampers EPYC performance to have their chips in this comparison in a 2P configuration.

Something clearly goes off the rails for Graviton2 in the MT bench. Either the mesh interconnect is not so great, and/or the reduced L3 is causing problems. ARM's reference Neoverse design called for higher clocks and twice the L3 that Amazon chose for Graviton2. Even Naples beats Graviton2 in the MT test. That's 2017's server platform!
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Rome has 256 MB L3 cache (4MB per core).
Graviton has 32 MB L3 cache (0.5 MB per core).

If the tested load fits into Rome's L3 cache while not into Graviton2's that could be the reason. What was the test load?
 

Hitman928

Diamond Member
Apr 15, 2012
5,182
7,632
136
It's for a 2P system with its score cut in half.

To quote @amrnuke



Something clearly goes off the rails for Graviton2 in the MT bench. Either the mesh interconnect is not so great, and/or the reduced L3 is causing problems. ARM's reference Neoverse design called for higher clocks and twice the L3 that Amazon chose for Graviton2. Even Naples beats Graviton2 in the MT test. That's 2017's server platform!

I don't know about the mesh interconnect but as you mentioned the small (comparatively) amount of L3 I'm sure hurts scaling. It's interesting that both the Graviton2 and Ampere chip designers opted for 32 MB of L3 instead of the 64 MB ARM suggested for high core count chips. The choice was made obviously not for performance so it's got to be either for yield or power concerns. We know the TDP of the Ampere chip at full tilt (> 200 W) so my guess would be power.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
Rome has 256 MB L3 cache (4MB per core).
Graviton has 32 MB L3 cache (0.5 MB per core).

If the tested load fits into Rome's L3 cache while not into Graviton2's that could be the reason. What was the test load?
I doubt the L3$ cache plays a big role from that standpoint.

Graviton2 has 32MB L3$ shared across 64 cores. So if the benchmark is 20MB, it fits into L3$.
EPYC gen 1 has 8MB L3$ per CCX, meaning if the benchmark is 20MB, it won't fit into ANY of the CCX's L3$.
EPYC gen 2 has 16MB L3$ per CCX, and still a 20MB benchmark won't fit into any of the CCX's L3$.

L3$ doesn't tell the whole story. (Especially since Graviton2 actually has double the L1I$ and quadruple the L1D$ per thread than the 7601, and double the L2$ to boot.)

Zen2 has some major advantages over Graviton2 that go beyond this part of the cache configuration.
 

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
L3$ doesn't tell the whole story.

No, but it does aid in maintaining cache coherency. It used to be a big issue in multi-socket systems (and still is), but now I think it's going to have an impact in MT workloads handled by CPUs with massive core counts and complex interconnects.

SPECINT2006 is probably much larger than a 20MB working set. Hmm:


Yeah thought so.
 

Nothingness

Platinum Member
Jul 3, 2013
2,371
713
136
[Posted in the ARM thread, asked by others to post here. Some modifications made.]
Thanks.

You should have linked the test you used for these computations. I guess it's this:

Single-thread
IPC is what everyone clamors about with Graviton/Arm. Let's compare. This was easy because we had direct-comparison specint2006 scores.

specint2006, single-thread, normalized to GHz
Graviton2 +14% over Rome

specint2006, single-thread, raw
Rome +16.2% over Graviton2
I get a score of 39.25 for Rome against 32.34 for Graviton so that's +21%. How did you compute?

7742 - 250%
I get a score of 2962 for the 2x64 EPYC 7742 so that 2.6 times faster. So +30% per socket.

I get about the same advantage for 7601. Which shows, as you wrote, that this whole exercise needs a more fair comparison.

So the brand new 2020 Graviton2 is 40% the speed of 2019's 7742.
I find 70%. One of us is miscalculating (and I don't exclude it's me ;))
 
Last edited:

Richie Rich

Senior member
Jul 28, 2019
470
229
76
I find 70%. One of us is miscalculating (and I don't exclude it's me ;))
That 70% sounds reasonable as A76 has about 80% IPC of Zen2 in SPEC2006.
Adrei's Graviton test results:

115106.png




And what about throughput per dollar?
I doubt Rome can improve throughput per dollar enough to beat Graviton2. Sure Rome will get closer than Naples/Zen1 but not win.
And next year Graviton3 comes (probably at 5nm with 128-core A78 and about 128MB L3 cache) and will be faster than anything from x86 world. x86 is done in cloud servers.

perf-per-usd.png




Note that A76 was introduced in 2018 so it's older than Zen2. Better competitor for Rome would be A77 from 2019 with 25% higher IPC. When Zen2 brought only 15% IPC jump over Zen1, you can see that server CPU based on A77 would win over Zen2 Rome with much higher performance margin than Graviton2 over Zen1 Naples. And A78 is coming this year with another 20% IPC jump. And then new ARMv9 + SVE new core line up.

x86 cannot win in high core count servers because these systems are TDP limited. ARM has much better power efficiency so they will always have higher throughput from same TDP. This means better economy in terms of throughput per dollar.

Unless x86 will improve efficiency so much to be able to compete in smartphones.... and this is unreal.

 

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
Working set and memory requirements are different things.

They are until they aren't. SPEC hasn't outlined exactly why they require that much memory. Yeah it can be a smaller working set . . . or it may not be. They may have stuff dumped into main memory just so they don't have to read it off storage during the disk (see comments on the SPEC page about pagefile performance). Unless someone here has run the bench, it's hard to know what is the actual size of the working set. I would be surprised in the MT bench could be made to fit in Rome's L3 (chiplet restrictions notwithstanding).
 

Nothingness

Platinum Member
Jul 3, 2013
2,371
713
136
They are until they aren't. SPEC hasn't outlined exactly why they require that much memory. Yeah it can be a smaller working set . . . or it may not be. They may have stuff dumped into main memory just so they don't have to read it off storage during the disk (see comments on the SPEC page about pagefile performance). Unless someone here has run the bench, it's hard to know what is the actual size of the working set.
I gave a link to a paper with data about WSS. What more do you want? Do you want me to redo the study? :)

I would be surprised in the MT bench could be made to fit in Rome's L3 (chiplet restrictions notwithstanding).
IMHO SPEC rate is uninteresting anyway; what's the point of running the same thing on multiple cores? There's no shared data and in the end the only thing that could be an issue is main memory bandwidth which is shared by multiple cores, and only for some of the tests.

I can even think of pathological cases where running multiple instances of the same workload can give you a super linear speedup :D Though perhaps not on SPEC.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
That 70% sounds reasonable as A76 has about 80% IPC of Zen2 in SPEC2006.
Adrei's Graviton test results:

115106.png




And what about throughput per dollar?
I doubt Rome can improve throughput per dollar enough to beat Graviton2. Sure Rome will get closer than Naples/Zen1 but not win.
And next year Graviton3 comes (probably at 5nm with 128-core A78 and about 128MB L3 cache) and will be faster than anything from x86 world. x86 is done in cloud servers.

perf-per-usd.png




Note that A76 was introduced in 2018 so it's older than Zen2. Better competitor for Rome would be A77 from 2019 with 25% higher IPC. When Zen2 brought only 15% IPC jump over Zen1, you can see that server CPU based on A77 would win over Zen2 Rome with much higher performance margin than Graviton2 over Zen1 Naples. And A78 is coming this year with another 20% IPC jump. And then new ARMv9 + SVE new core line up.

x86 cannot win in high core count servers because these systems are TDP limited. ARM has much better power efficiency so they will always have higher throughput from same TDP. This means better economy in terms of throughput per dollar.

Unless x86 will improve efficiency so much to be able to compete in smartphones.... and this is unreal.
All valid points.

But throughput per socket is also a major consideration because it's not just the chip but the surrounding system that costs money. The rest of the system ain't cheap, and the less cheap the rest of the system is, the more value Graviton2 loses.

((( Below is edited )))

As a key point, the reference Daytona rack that is 2 x 7742 costs $25,000, of which about $14,000 is the 2x7742. So... the systems ain't cheap ($11,000 for the surrounding system).

Imagine 2P Graviton2, even if the chips cost $2,000 each, which would be a good bargain, a similar system would cost $15,000.

2P implementation comparison
2 x Graviton -- $15,000 -- 100% speed
2 x 7742 -- $25,000 -- 125% speed

Or if you want to try to equal the 7742
10 x Graviton -- $75,000 -- 500% speed
8 x 7742 -- $100,000 -- 500% speed

Assuming Graviton2 scales to 2P platforms well, this could be great.
 
Last edited:

Thunder 57

Platinum Member
Aug 19, 2007
2,647
3,706
136
I doubt Rome can improve throughput per dollar enough to beat Graviton2. Sure Rome will get closer than Naples/Zen1 but not win.
And next year Graviton3 comes (probably at 5nm with 128-core A78 and about 128MB L3 cache) and will be faster than anything from x86 world. x86 is done in cloud servers.

Hahaha, that's a good one. Way to jump the gun. There's still far to much in play to make that kind of bold conclusion.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Can you (or someone else that doesn't have me on ignore :D) double check your computation? Where did the 40% speed come from?
250% => 1/2.5 = 40% .... recalculated to different base (Zen2)

All valid points.

But throughput per socket is also a major consideration because it's not just the chip but the surrounding system that costs money. The rest of the system ain't cheap, and the less cheap the rest of the system is, the more value Graviton2 loses. So while Graviton2 might look like a good value, if total system cost for TWO graviton systems exceeds the cost of ONE 7742, then it is not a winning proposition globally.

As a key point, the reference Daytona rack that is 2 x 7742 costs $25,000, of which about $14,000 is the 2x7742. So... the systems ain't cheap ($11,000 for the surrounding system).

Imagine a 2 x Graviton2, even if the chips cost $2,000 each, which would be a good bargain, a similar system would cost $15,000.

2P implementation comparison
2 x Graviton -- $15,000 -- 100% speed
2 x 7742 -- $25,000 -- 250% speed

Or if you want to try to equal the 2 x 7742
4 x Graviton -- $30,000 -- 200% speed
2 x 7742 -- $25,000 -- 250% speed

Unless they're selling Graviton2 for like... well, they could give it away, and the 4 x implementation would still be barely be cheaper than the 2 x 7742 that is faster. And 4 x Graviton (2 x 2P systems) will not have the energy efficiency benefit over a 2 x 7742 (1 x 2P system) that Amazon are claiming for the chip alone.

How many years will Graviton2 take to make up the speed deficit with its cost-efficiency? Until we know pricing and implementation, that's hard to say. And I'm not sure the profit Amazon make on the contracts to lease out the instances (in other words, profit per instance and # of instances per thread per day), which is also a consideration.
I like your calculation a lot. It makes sense that Rome will be tough competitor for Graviton2. But there are two points:
1) Graviton2 die size is about 300mm2 ... that's price of 7nm GPU Vega2 .... Amazon's cost per Graviton2 CPU is about 200-300$? That's the magic of buying ARM license and making your own CPU.
2) Electricity cost is also important. Especialy when your CPU price is super low then most part of cost is electricity bill.

Lets see what price will Amazon set for Rome system. I would be surprised if it will be lower than Graviton2. IMHO cloud systems will dominate ARM soon or later. However X86 will still win where it can use higher clock >4GHz like HPC systems. But only until Nuvia CPUs arrive. That's pretty dark future for x86.
 
Status
Not open for further replies.