Discussion AWS Graviton2 64 vCPU Arm CPU Heightens War of Intel Betrayal

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

Nothingness

Diamond Member
Jul 3, 2013
3,063
2,047
136
Clearly they're being run by buffoons who know nothing. /s
That's not the place to discuss it but clearly some Intel management people were/are buffoons or at least they made some utterly stupid claims (and just plain lied with their process issues starting back to 14nm) and took wrong decisions.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Or the iPhone, why aren't we comparing it with Apple A13 like we do in every serious ARM vs. x86 server discussion? /s
Because A13 Lightning core has about 4.5mm2 while Zen2 core has 3.8?mm2. It's IPC is huge but in the cost of die size.
  • Cortex A76/N1 core used in Graviton2/Altra is about 1.4mm2 ...... compare to 3.8mm2 using Zen2
  • Cortex A77 core next year server products has +17% transistors (1.6mm2) while delivering +25% IPC (beating Zen2 ST IPC and almost double MT IPC when Rome uses SMT2).

Less then half area size delivers IPC higher than Zen2 while using less than half power. What? Tell me how AMD and Intel could fight such a specs in large core count server CPUs. I see no way due to TDP limitation. x86 can hold HPC territory due to higher clocks. But then again only until Nuvia arrives with Apple's IPC.

In some ways this cheap ARM Neoverse uarch is even more dangerous for x86 than Apple's monster uarch.
03_Infra%20Tech%20Day%202019_Filippo%20Neoverse%20N1%20FINAL%20WM15_575px.jpg
 

DrMrLordX

Lifer
Apr 27, 2000
22,027
11,607
136
Because A13 Lightning core has about 4.5mm2 while Zen2 core has 3.8?mm2.

I hope you realize you were deadpan responding to a sarcastic remark, right? Nobody in their right mind thinks that A13 belongs in any discussion about Altra or Graviton2 (except maybe in the context of "what if Apple made a server CPU" which by this point is overplayed).
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
I hope you realize you were deadpan responding to a sarcastic remark, right? Nobody in their right mind thinks that A13 belongs in any discussion about Altra or Graviton2 (except maybe in the context of "what if Apple made a server CPU" which by this point is overplayed).
Yeah I know, I try to provoke them to get some numbers out of them. I hope somebody will prove me wrong but nobody delivers single digit. That's good sign. They deliver only whining, crying, sobbing, sarcasm and some old good hatred. This helps me to make a good decision for buying some stocks, especially now when everything is falling rapidly down, so you can double your money in few weeks.
 

DrMrLordX

Lifer
Apr 27, 2000
22,027
11,607
136
Yeah I know, I try to provoke them to get some numbers out of them. I hope somebody will prove me wrong but nobody delivers single digit. That's good sign. They deliver only whining, crying, sobbing, sarcasm and some old good hatred. This helps me to make a good decision for buying some stocks, especially now when everything is falling rapidly down, so you can double your money in few weeks.

Nobody takes you seriously because A13 is not a server CPU. The ARM world is moving into the server business and they are not bringing Apple with them. They (the posters) are making fun of you for continuously hyping a very good mobile CPU. Not crying, sobbing, or hating.

The "numbers" are already available: Apple has yet to license any of their core designs to anyone who can or will put together a 64-or-more core design CPU. Huawei has done it, Ampere is doing it, Amazon has done it, and I'm sure I'm leaving out someone. Has Broadcom made it to 64c yet? Fujitsu is coming but their core is so niche that we may not hear much about it outside of some academic circles.

If you came to me two or more years ago and said, "A1x is a threat, Intel needs to watch out!" I might agree, assuming Apple was serious about rolling out big server CPUs based on their core designs with the proper I/O and interconnects. Apple has shown no interest. You can clearly see who are the players in the ARM server world moving forward. Sadly, we haven't yet seen any of these CPUs benched against Rome, and the benches we have for Graviton2 against Intel CPUs are kinda limited (though Cooper Lake isn't going to help Intel gain any ground).

Try to pay more attention to what is out on the market now or what is going to come out in the near future. Graviton2 is showing some promise. We may get more high-clockspeed ARM parts with many cores and capable interconnects that will do nicely in many server functions. Until then, people will continue poking fun at you over things like your sig.
 
Last edited:

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
Until then, people will continue poking fun at you over things like your sig.
And him adding coronavirus spread data to his sig...
He's literally trying to goad everyone into off-topic discussions. On topics he is clearly poorly-versed in (as a physician, his comments thus far are so factually incorrect as to be laugh-inducing, were it not for the severity of the condition).

So here's the real deal. Apple don't care about server market. Period.

I would agree with him, buy Apple stock. At P/E 19.6 it's a steal. But not because ARM is some magical unicorn that's going to dominate the world. It's because Apple have so many brand-loyal customers who are locked into their ecosystem.
 

JasonLD

Senior member
Aug 22, 2017
487
447
136
Apple definitely isn't moving to their own ARM based processor unless they are ready to make a full transition within a calendar year. Mike Filippo hire was just last year, so at least I can see Apple is definitely trying to scale their processor beyond mobile, but I can't see that happening until 2022 at least.
 

RetroZombie

Senior member
Nov 5, 2019
464
386
96
I hope somebody will prove me wrong but nobody delivers single digit.
If you would point out to a review of one of those new arm server chips (must be already available in the market) vs what amd already have in the market (released more recently) and what intel also have in the market (2S, 4S, 8S) and also what ibm power pc have already available, i will be happy to discuss with you those digits.

Remember product must exist and must be available.
 

soresu

Diamond Member
Dec 19, 2014
3,208
2,480
136
Graviton3 based on A77 will come
More likely to be A78/Hercules based given the lack of N2 announcement last month to match the N1 announcement in February last year - I think N2 will be the A78 server superset core.

A78 is also the next target for the high end 'AE' core variant going by Linked-In postings, possibly we might see a successor to A65/E1 at the same time.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
[Posted in the ARM thread, asked by others to post here. Some modifications made.]

As Graviton2 has shown, it's possible for Arm to be applied to a server chip and produce results that seem somewhat competitive. Since they couldn't put Graviton2 up against Rome directly (yet), I went ahead and compared specint2006 published results, and yes, I know there are a lot of caveats. But it doesn't really look terribly competitive.

On the surface, the IPC looks great in the specint scores. So I went ahead and compared to 7742 published results.

Graviton2 is gcc9.2, 7742 is gcc8.3


Single-thread
IPC is what everyone clamors about with Graviton/Arm. Let's compare. This was easy because we had direct-comparison specint2006 scores.

specint2006, single-thread, normalized to GHz
Graviton2 +14% over Rome

specint2006, single-thread, raw
Rome +16.2% over Graviton2

It appears that there likely is an IPC advantage of Arm in those given setups. which doesn't matter much since Rome can clock higher, giving it a 16% lead in raw single-threaded specint2006 score.


Multi-thread
This was harder, not entirely apples to apples, which I'll discuss below.

Compare:
(1) Graviton2 specint2006 MT 64v CPU rate scores
(2) Rome 7742 64 cores in a 2P system (128 cores) and divide by 2 to normalize to 64 cores
(3) 7601 32 cores in a 2P system (total 64 cores)

Now, we can talk about Graviton2 not being SMT-enabled, hence limiting its performance (it does, but wouldn't make a difference against 7742). Or about whether using 7742 in 2P and dividing by 2 is fair. However, if anything, I think it hampers EPYC performance to have their chips in this comparison in a 2P configuration.

Here are the results, in brief:

specint2006, multi-thread, raw
Graviton2 - 100%
7601 - 115%
7742 - 125% ((( edited from previous, I made a formula error in Excel )))


Granted, none of this is apples to apples yet until they do a direct head-to-head comparison. Arm is doing something, but it's still many steps behind in the server market in my estimation. It's slower than 7601 in 2P (32 x 2 = 64 cores) and 7742 in 2P / 2.

So the brand new 2020 Graviton2 is 80% the speed of 2019's 7742. And AMD still have Milan to release this year.
 
Last edited:

beginner99

Diamond Member
Jun 2, 2009
5,231
1,605
136
@everybody: bla bla bla .... Apple will never move into servers .... bla bla bla...
Reality: ex-Apple architects starts the NUVIA up

Has the thought ever occured to you these ex-Apple people made their own company because Apple did not want to move their ARM cores into laptops and servers? I mean if Apple had such a project, no need for them to leave and make their own company.
 

DrMrLordX

Lifer
Apr 27, 2000
22,027
11,607
136
@amrnuke

What do you think is the reason for the poor MT scaling in Graviton2? The reduced-size L3 cache?

@Richie Rich

Seems you don't fully appreciate how badly Graviton2 fares against Rome in that comparison. More data is needed, but I wouldn't be celebrating any victories for ARM Holdings or Amazon.
 

DrMrLordX

Lifer
Apr 27, 2000
22,027
11,607
136
It's for a 2P system with its score cut in half.

To quote @amrnuke

Now, we can talk about Graviton2 not being SMT-enabled, hence limiting its performance (it does, but wouldn't make a difference against 7742). Or about whether using 7742 in 2P and dividing by 2 is fair. However, if anything, I think it hampers EPYC performance to have their chips in this comparison in a 2P configuration.

Something clearly goes off the rails for Graviton2 in the MT bench. Either the mesh interconnect is not so great, and/or the reduced L3 is causing problems. ARM's reference Neoverse design called for higher clocks and twice the L3 that Amazon chose for Graviton2. Even Naples beats Graviton2 in the MT test. That's 2017's server platform!
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Rome has 256 MB L3 cache (4MB per core).
Graviton has 32 MB L3 cache (0.5 MB per core).

If the tested load fits into Rome's L3 cache while not into Graviton2's that could be the reason. What was the test load?
 

Hitman928

Diamond Member
Apr 15, 2012
6,123
10,527
136
It's for a 2P system with its score cut in half.

To quote @amrnuke



Something clearly goes off the rails for Graviton2 in the MT bench. Either the mesh interconnect is not so great, and/or the reduced L3 is causing problems. ARM's reference Neoverse design called for higher clocks and twice the L3 that Amazon chose for Graviton2. Even Naples beats Graviton2 in the MT test. That's 2017's server platform!

I don't know about the mesh interconnect but as you mentioned the small (comparatively) amount of L3 I'm sure hurts scaling. It's interesting that both the Graviton2 and Ampere chip designers opted for 32 MB of L3 instead of the 64 MB ARM suggested for high core count chips. The choice was made obviously not for performance so it's got to be either for yield or power concerns. We know the TDP of the Ampere chip at full tilt (> 200 W) so my guess would be power.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
Rome has 256 MB L3 cache (4MB per core).
Graviton has 32 MB L3 cache (0.5 MB per core).

If the tested load fits into Rome's L3 cache while not into Graviton2's that could be the reason. What was the test load?
I doubt the L3$ cache plays a big role from that standpoint.

Graviton2 has 32MB L3$ shared across 64 cores. So if the benchmark is 20MB, it fits into L3$.
EPYC gen 1 has 8MB L3$ per CCX, meaning if the benchmark is 20MB, it won't fit into ANY of the CCX's L3$.
EPYC gen 2 has 16MB L3$ per CCX, and still a 20MB benchmark won't fit into any of the CCX's L3$.

L3$ doesn't tell the whole story. (Especially since Graviton2 actually has double the L1I$ and quadruple the L1D$ per thread than the 7601, and double the L2$ to boot.)

Zen2 has some major advantages over Graviton2 that go beyond this part of the cache configuration.
 

DrMrLordX

Lifer
Apr 27, 2000
22,027
11,607
136
L3$ doesn't tell the whole story.

No, but it does aid in maintaining cache coherency. It used to be a big issue in multi-socket systems (and still is), but now I think it's going to have an impact in MT workloads handled by CPUs with massive core counts and complex interconnects.

SPECINT2006 is probably much larger than a 20MB working set. Hmm:


Yeah thought so.
 

Nothingness

Diamond Member
Jul 3, 2013
3,063
2,047
136
[Posted in the ARM thread, asked by others to post here. Some modifications made.]
Thanks.

You should have linked the test you used for these computations. I guess it's this:

Single-thread
IPC is what everyone clamors about with Graviton/Arm. Let's compare. This was easy because we had direct-comparison specint2006 scores.

specint2006, single-thread, normalized to GHz
Graviton2 +14% over Rome

specint2006, single-thread, raw
Rome +16.2% over Graviton2
I get a score of 39.25 for Rome against 32.34 for Graviton so that's +21%. How did you compute?

7742 - 250%
I get a score of 2962 for the 2x64 EPYC 7742 so that 2.6 times faster. So +30% per socket.

I get about the same advantage for 7601. Which shows, as you wrote, that this whole exercise needs a more fair comparison.

So the brand new 2020 Graviton2 is 40% the speed of 2019's 7742.
I find 70%. One of us is miscalculating (and I don't exclude it's me ;))
 
Last edited:

Richie Rich

Senior member
Jul 28, 2019
470
229
76
I find 70%. One of us is miscalculating (and I don't exclude it's me ;))
That 70% sounds reasonable as A76 has about 80% IPC of Zen2 in SPEC2006.
Adrei's Graviton test results:

115106.png




And what about throughput per dollar?
I doubt Rome can improve throughput per dollar enough to beat Graviton2. Sure Rome will get closer than Naples/Zen1 but not win.
And next year Graviton3 comes (probably at 5nm with 128-core A78 and about 128MB L3 cache) and will be faster than anything from x86 world. x86 is done in cloud servers.

perf-per-usd.png




Note that A76 was introduced in 2018 so it's older than Zen2. Better competitor for Rome would be A77 from 2019 with 25% higher IPC. When Zen2 brought only 15% IPC jump over Zen1, you can see that server CPU based on A77 would win over Zen2 Rome with much higher performance margin than Graviton2 over Zen1 Naples. And A78 is coming this year with another 20% IPC jump. And then new ARMv9 + SVE new core line up.

x86 cannot win in high core count servers because these systems are TDP limited. ARM has much better power efficiency so they will always have higher throughput from same TDP. This means better economy in terms of throughput per dollar.

Unless x86 will improve efficiency so much to be able to compete in smartphones.... and this is unreal.

 

DrMrLordX

Lifer
Apr 27, 2000
22,027
11,607
136
Working set and memory requirements are different things.

They are until they aren't. SPEC hasn't outlined exactly why they require that much memory. Yeah it can be a smaller working set . . . or it may not be. They may have stuff dumped into main memory just so they don't have to read it off storage during the disk (see comments on the SPEC page about pagefile performance). Unless someone here has run the bench, it's hard to know what is the actual size of the working set. I would be surprised in the MT bench could be made to fit in Rome's L3 (chiplet restrictions notwithstanding).
 

Nothingness

Diamond Member
Jul 3, 2013
3,063
2,047
136
They are until they aren't. SPEC hasn't outlined exactly why they require that much memory. Yeah it can be a smaller working set . . . or it may not be. They may have stuff dumped into main memory just so they don't have to read it off storage during the disk (see comments on the SPEC page about pagefile performance). Unless someone here has run the bench, it's hard to know what is the actual size of the working set.
I gave a link to a paper with data about WSS. What more do you want? Do you want me to redo the study? :)

I would be surprised in the MT bench could be made to fit in Rome's L3 (chiplet restrictions notwithstanding).
IMHO SPEC rate is uninteresting anyway; what's the point of running the same thing on multiple cores? There's no shared data and in the end the only thing that could be an issue is main memory bandwidth which is shared by multiple cores, and only for some of the tests.

I can even think of pathological cases where running multiple instances of the same workload can give you a super linear speedup :D Though perhaps not on SPEC.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
That 70% sounds reasonable as A76 has about 80% IPC of Zen2 in SPEC2006.
Adrei's Graviton test results:

115106.png




And what about throughput per dollar?
I doubt Rome can improve throughput per dollar enough to beat Graviton2. Sure Rome will get closer than Naples/Zen1 but not win.
And next year Graviton3 comes (probably at 5nm with 128-core A78 and about 128MB L3 cache) and will be faster than anything from x86 world. x86 is done in cloud servers.

perf-per-usd.png




Note that A76 was introduced in 2018 so it's older than Zen2. Better competitor for Rome would be A77 from 2019 with 25% higher IPC. When Zen2 brought only 15% IPC jump over Zen1, you can see that server CPU based on A77 would win over Zen2 Rome with much higher performance margin than Graviton2 over Zen1 Naples. And A78 is coming this year with another 20% IPC jump. And then new ARMv9 + SVE new core line up.

x86 cannot win in high core count servers because these systems are TDP limited. ARM has much better power efficiency so they will always have higher throughput from same TDP. This means better economy in terms of throughput per dollar.

Unless x86 will improve efficiency so much to be able to compete in smartphones.... and this is unreal.
All valid points.

But throughput per socket is also a major consideration because it's not just the chip but the surrounding system that costs money. The rest of the system ain't cheap, and the less cheap the rest of the system is, the more value Graviton2 loses.

((( Below is edited )))

As a key point, the reference Daytona rack that is 2 x 7742 costs $25,000, of which about $14,000 is the 2x7742. So... the systems ain't cheap ($11,000 for the surrounding system).

Imagine 2P Graviton2, even if the chips cost $2,000 each, which would be a good bargain, a similar system would cost $15,000.

2P implementation comparison
2 x Graviton -- $15,000 -- 100% speed
2 x 7742 -- $25,000 -- 125% speed

Or if you want to try to equal the 7742
10 x Graviton -- $75,000 -- 500% speed
8 x 7742 -- $100,000 -- 500% speed

Assuming Graviton2 scales to 2P platforms well, this could be great.
 
Last edited:
Status
Not open for further replies.