News Ampere Altra Launched with 80 Arm Cores for the Cloud（Performance Estimates）

csbin · Mar 3, 2020

Ampere Altra Launched with 80 Arm Cores for the Cloud

The Ampere Altra 80-core Arm server CPU is upon us with 128 PCIe Gen4 lanes per CPU, CCIX, and dual-socket server capabilities. We assess the impact.

www.servethehome.com

moinmoin · Mar 3, 2020

According to the STH article the 80 cores ARM chip is a 210W TDP part as well. And the End Notes above are a huge bummer with those Epyc and Xeon results being retroactively reduced to make the results of different compilers "more comparable". They should have excluded Epyc from the comparison, the latest Epyc really is way too close to those ARM efforts to make one wonder why bother.

Hitman928 · Mar 3, 2020

Interesting that they are using adjusted AMD and Intel scores according to how they think they would perform given the same compiler and compiler options they would use on the ARM system. The ARM score is also not tested but estimated and also estimated at 3.3 GHz when the slide says the CPU has a turbo of 3.3 GHz? Will be interesting to see how the actual scores play out and how reviews pan out. The last ARM based server I saw tested did just ok in performance but had significantly worse perf/W than both intel and AMD systems which was surprising.

Det0x · Mar 3, 2020

Much PR fluff

First, Ampere is claiming performance from its parts in excess of the dual AMD EPYC 7742 by a small margin. That is AMD’s second-highest performing EPYC part behind the HPC focused AMD EPYC 7H12 but it is also AMD’s highest-end mainstream SKU. Ampere Altra is also claiming a massive estimated Specrate2017_int_base figure over the dual Intel Xeon Platinum 8280 configuration. As we discussed in our Big 2nd Gen Intel Xeon Scalable Refresh piece, Intel has a new Xeon Gold 6258R which is a better comparison point now. While we expect the Xeon Gold 6258R to perform like the Xeon Platinum 8280, comparable numbers have not been published. Ampere, to its credit, is not using the Platinum 8280 list price so this is well done.

Ampere Altra Performance Specrate2017_int_base
When we get to the endnotes, we see how Ampere got to these figures. The Altra part is a dual-socket 3.3GHz platform using GCC 8.2. Ampere did not disclose the TDP here but that is OK at this point. What we will note is that Ampere de-rated both the AMD EPYC 7742 and Xeon Platinum 8280 results by 16.5% and 24% respectively. This was done to adjust for using GCC versus AOCC2.0 and ICC 19.0.1.144. Ampere disclosed this, and it is a big impact. Arm servers tend to use GCC as the compiler while there are more optimized compilers out there for AMD and Intel. For some reference point that is why we showed both optimized and GCC numbers in our large launch-day ThunderX2 Review and Benchmarks piece.

Ampere Altra End Note 1
This de-rate practice for ICC and AOCC is common in the industry and Ampere disclosed it clearly. We will note that while it is not enough to tip the balance on the Xeon side, it does mean that the 2019-era AMD EPYC 7742 can provide more performance than the future 2020-era Ampere Altra 3.3GHz.

Ampere in the slides above states that the Altra has a 3.0GHz maximum turbo. It is interesting that they are using a part here that is running at 10% higher clock speeds, especially with a 4% lead over AMD.

Markfw · Mar 3, 2020

30 3.0 ghz cores just beats 64 2.0 ghz cores (7742) and thats in integer. What about floating point ?

uzzi38 · Mar 3, 2020

Also worth noting the slides show the top-end Ampere has a turbo of 3GHz, but for the somparison the sample was running at 3.3GHz. Putting it a smidgen behind Rome, and late enough it can be argued it's a Milan competitor instead (no systems yet).

Poor 8280/6258R (basically the same thing anyway) though.

EDIT: Or as Andreas kindly put it:

https://twitter.com/x/status/1234883956849377281

moinmoin · Mar 3, 2020

uzzi38 said:
EDIT: Or as Andreas kindly put it: https://twitter.com/i/web/status/1234883956849377281

How did he put it? Tweet's no longer available.

uzzi38 · Mar 3, 2020

Hitman928 · Mar 3, 2020

uzzi38 said:
View attachment 17748

Could be more than a 10% OC depending on how their turbo is defined.

Schmide · Mar 3, 2020

Single lane 128bit simd. Certainly not as beefy as it could of been but fp16 support is good.

Probably the best balance for where arm is in this stage of development.

Nothingness · Mar 4, 2020

Det0x said:
Ampere Altra Performance Specrate2017_int_base
When we get to the endnotes, we see how Ampere got to these figures. The Altra part is a dual-socket 3.3GHz platform using GCC 8.2. Ampere did not disclose the TDP here but that is OK at this point. What we will note is that Ampere de-rated both the AMD EPYC 7742 and Xeon Platinum 8280 results by 16.5% and 24% respectively. This was done to adjust for using GCC versus AOCC2.0 and ICC 19.0.1.144. Ampere disclosed this, and it is a big impact. Arm servers tend to use GCC as the compiler while there are more optimized compilers out there for AMD and Intel. For some reference point that is why we showed both optimized and GCC numbers in our large launch-day ThunderX2 Review and Benchmarks piece.

What bothers me is not that they had to de-rate SPEC scores, it's a known fact that icc is applying unrealistic optimizations for instance. What bothers me is that they had to estimate that ratio. Don't they have access to Intel or AMD machines to make real measurements?

For reference, the above quoted ServeTheHome article shows that the ratio icc vs gcc 7 for Gold 6148 is ~0.65 (so 35% de-rate) and ~0.70 for AOCC on EPYC 7601.

By the way @Det0x, it would have been more honest to provide a link to the article where you copied verbatim your message from: https://www.servethehome.com/ampere-altra-80-arm-cores-for-cloud/

eek2121 · Mar 4, 2020

Based on the numbers they pushed out, these CPUs have much lower IPC than competitors. It's late so I won't get into it, but do the math...

DrMrLordX · Mar 4, 2020

Interesting. I had forgotten that Ampere was in the ARM server CPU game. We need some independent testing to verify what these CPUs can do. Nice to see someone other than AMD is supporting CCIX.

soresu · Mar 4, 2020

Det0x said:
Much PR fluff

First, Ampere is claiming performance from its parts in excess of the dual AMD EPYC 7742 by a small margin. That is AMD’s second-highest performing EPYC part behind the HPC focused AMD EPYC 7H12 but it is also AMD’s highest-end mainstream SKU. Ampere Altra is also claiming a massive estimated Specrate2017_int_base figure over the dual Intel Xeon Platinum 8280 configuration. As we discussed in our Big 2nd Gen Intel Xeon Scalable Refresh piece, Intel has a new Xeon Gold 6258R which is a better comparison point now. While we expect the Xeon Gold 6258R to perform like the Xeon Platinum 8280, comparable numbers have not been published. Ampere, to its credit, is not using the Platinum 8280 list price so this is well done.

Ampere Altra Performance Specrate2017_int_base
When we get to the endnotes, we see how Ampere got to these figures. The Altra part is a dual-socket 3.3GHz platform using GCC 8.2. Ampere did not disclose the TDP here but that is OK at this point. What we will note is that Ampere de-rated both the AMD EPYC 7742 and Xeon Platinum 8280 results by 16.5% and 24% respectively. This was done to adjust for using GCC versus AOCC2.0 and ICC 19.0.1.144. Ampere disclosed this, and it is a big impact. Arm servers tend to use GCC as the compiler while there are more optimized compilers out there for AMD and Intel. For some reference point that is why we showed both optimized and GCC numbers in our large launch-day ThunderX2 Review and Benchmarks piece.

Ampere Altra End Note 1
This de-rate practice for ICC and AOCC is common in the industry and Ampere disclosed it clearly. We will note that while it is not enough to tip the balance on the Xeon side, it does mean that the 2019-era AMD EPYC 7742 can provide more performance than the future 2020-era Ampere Altra 3.3GHz.

Ampere in the slides above states that the Altra has a 3.0GHz maximum turbo. It is interesting that they are using a part here that is running at 10% higher clock speeds, especially with a 4% lead over AMD.

That is likely a scalar comparison not accounting for SIMD/vector performance - the Altra is based on the Neoverse N1 design which in turn is a souped up A76 for server/datacenter markets, it has max 128 bit NEON units while the Epyc 2 series has 256 bit AVX2 units.

Andrei. · Mar 4, 2020

Nothingness said:
What bothers me is not that they had to de-rate SPEC scores, it's a known fact that icc is applying unrealistic optimizations for instance. What bothers me is that they had to estimate that ratio. Don't they have access to Intel or AMD machines to make real measurements?

For reference, the above quoted ServeTheHome article shows that the ratio icc vs gcc 7 for Gold 6148 is ~0.65 (so 35% de-rate) and ~0.70 for AOCC on EPYC 7601.

By the way @Det0x, it would have been more honest to provide a link to the article where you copied verbatim your message from: https://www.servethehome.com/ampere-altra-80-arm-cores-for-cloud/

I can't comment on the Ampere, but you'll have equal-compiler figures from us on Graviton2/Cascade/EPYC.

eek2121 said:
Based on the numbers they pushed out, these CPUs have much lower IPC than competitors. It's late so I won't get into it, but do the math...

They're stronger IPC than Intel/AMD, but not by a lot.

soresu said:
That is likely a scalar comparison not accounting for SIMD/vector performance - the Altra is based on the Neoverse N1 design which in turn is a souped up A76 for server/datacenter markets, it has max 128 bit NEON units while the Epyc 2 series has 256 bit AVX2 units.

There's some SIMD in SPEC, but it's not like it's a math crunching HPC test that fully utilises it. For the workloads that it's meant for, there's absolutely no problem with the throughput.

Schmide · Mar 4, 2020

Andrei. said:
They're stronger IPC than Intel/AMD, but not by a lot.

No! just no!

Even if accept that instructions are simpler, latest x86 decodes more, fits more in the cache, and has much wider execution units. Not to mention, SSE/AVX/AVX2 have way more custom instructions than NEON.

Just the base math

zen2 2.0 x 64 * 1.04 = ampere 3.0 x 80

in a single benchmark is enough to refute this.

Hitman928 · Mar 4, 2020

Andrei. said:
I can't comment on the Ampere, but you'll have equal-compiler figures from us on Graviton2/Cascade/EPYC.
They're stronger IPC than Intel/AMD, but not by a lot.
There's some SIMD in SPEC, but it's not like it's a math crunching HPC test that fully utilises it. For the workloads that it's meant for, there's absolutely no problem with the throughput.

The IPC calculation is tough because the AMD chip is using SMT and the Ampere chip is not. We also only have the one score to go off of but if we use that score and consider it more like throughput per core per GHz (assuming Ampere's scores are 100% valid).

AMD: 1 normalized score/128 cores/2.25 GHz --------------> 3.47m score/cores/GHz
Ampere: 1.04 normalized score / 160 cores / 3.3 GHz---> 1.97m score/cores/GHz

Now, I don't know what GHz the AMD chip was actually running at, however, even if we assume it was running at max boost (which it wasn't), we would get:

AMD: 1 / 128 / 2.8 --> 2.3m score/cores/GHz

So even using best case scenario scores (for Ampere) with unrealistically high effective clock rate for AMD's chip, it seems the perf/core/GHz of Ampere is lower and most likely significantly lower given more realistic constraints. . .

Now, actual IPC in other scenarios may prove different, but for the one score we have, it looks like the AMD chip is pretty dominant in this regard.

DisEnchantment · Mar 4, 2020

Schmide said:
No! just no!

Even if accept that instructions are simpler, latest x86 decodes more, fits more in the cache, and has much wider execution units. Not to mention, SSE/AVX/AVX2 have way more custom instructions than NEON.

Just the base math

zen2 2.0 x 64 * 1.04 = ampere 3.0 x 80

in a single benchmark is enough to refute this.

Can't agree more, I was wondering how does one even do this IPC comparison. Intel x86 vs AMD x86 would probably make some sense.
Just try to compile some day to day code with index operation from heap and see.
If you remove the -O/-s flag for readability, you can see the x86 code has lesser instructions for the C/C++ function than for arm64.

Regarding the int benchmarks, it makes me wonder how that is being made.
x86 has so many complex instructions and I am certain most compiler fuses a bunch of steps in one instruction wherever possible for most application code.

Nothingness · Mar 4, 2020

Schmide said:
No! just no!

Even if accept that instructions are simpler, latest x86 decodes more, fits more in the cache, and has much wider execution units. Not to mention, SSE/AVX/AVX2 have way more custom instructions than NEON.

Just the base math

zen2 2.0 x 64 * 1.04 = ampere 3.0 x 80

in a single benchmark is enough to refute this.

It didn't occur to you that Andrei is talking about results *he* measured himself on Graviton2? That's how I understood his claim.

Hitman928 · Mar 4, 2020

Nothingness said:
It didn't occur to you that Andrei is talking about results *he* measured himself on Graviton2? That's how I understood his claim.

That wouldn't follow the context of the quote he was posting, but maybe. . .

Schmide · Mar 4, 2020

Nothingness said:
It didn't occur to you that Andrei is talking about results *he* measured himself on Graviton2? That's how I understood his claim.

No because the claim was ubiquitous and regardless a single reference point does not make a trend.

Richie Rich · Mar 4, 2020

Regarding the INT and FPU IPC. Altra is based on Cortex A76 cores - basically comparable to Snapdragon 855. You can do the math from Andrei's graph:

INT:
- Ampere (A76) … 26.65 pts / 2.84 Ghz …. 9.38 pts/GHz
- EPYC (Zen2) ….. 50.02 pts / 4.65 GHz …. 10.75 pts/GHz …. +14.7% IPC over A76

FPU:
- Ampere (A76) … 36.87 pts / 2.84 Ghz …. 12.98 pts/GHz
- EPYC (Zen2) ….. 74.52 pts / 4.65 GHz …. 16.02 pts/GHz …. +23.5% IPC over A76

So A76's FPU IPC is a bit weaker than INT in compare to Zen2. Averaged mixed IPC could be -20% to Zen2.

Interesting is A77 score:
INT ...… 33.32 pts / 2.84 Ghz …. 11.73 pts/GHz …. +9.1% IPC over Zen2
FPU ….. 50.02 pts / 4.65 GHz …. 16.80 pts/GHz …. +4.9% IPC over Zen2

So danger for x86 servers will come soon when Ampere and Amazon will implement much stronger A77 cores. And if A78/Hercules will have another +20% IPC then it will start to be very interesting on server market as ARM will take the IPC lead over x86 (however still way behind Apple's monster).

Andrei. · Mar 4, 2020

Schmide said:
No! just no!

Even if accept that instructions are simpler, latest x86 decodes more, fits more in the cache, and has much wider execution units. Not to mention, SSE/AVX/AVX2 have way more custom instructions than NEON.

Just the base math

zen2 2.0 x 64 * 1.04 = ampere 3.0 x 80

in a single benchmark is enough to refute this.

Even if you live in the land of the obtuse and count aggregate instruction throughput across all cores in a system I hope you're also counting more clocks since all the cores together tick more together right?

I find MT IPC discussions stupid as at that point you're not discussing about a core's microarchitectural IPC capabilities anymore but the overlying system's memory capabilities. Why are you even accounting for the notion of cores here? AMD/Intel use SMT while Arm fits in two cores in the space of one x86 core. N1 single-threaded IPC is higher than Rome and Cascade lake, and "system-wide"-IPC it's also higher than at least Cascade (didn't get to do that comparison to Rome yet).

As a note, N1 in the server chips has very notably higher IPC than a mobile A76 because it has quite bigger caches and a significantly better memory subsystem. It is very very good.

DisEnchantment said:
Can't agree more, I was wondering how does one even do this IPC comparison. Intel x86 vs AMD x86 would probably make some sense.
Just try to compile some day to day code with index operation from heap and see.
If you remove the -O/-s flag for readability, you can see the x86 code has lesser instructions for the C/C++ function than for arm64.

Regarding the int benchmarks, it makes me wonder how that is being made.
x86 has so many complex instructions and I am certain most compiler fuses a bunch of steps in one instruction wherever possible for most application code.

Last I checked the retired instruction count of x86 vs AArch64 differed less than 10% so let's also retire that broken old argument. Thank you for bringing this up I should cover it in the review to end this for good.

Nothingness said:
It didn't occur to you that Andrei is talking about results *he* measured himself on Graviton2? That's how I understood his claim.

Yes.

Schmide · Mar 4, 2020

Andrei. said:
Even if you live in the land of the obtuse and count aggregate instruction throughput across all cores in a system I hope you're also counting more clocks since all the cores together tick more together right?

Ehh if you're not running at your best available clock, you better be doing avx512. Not all instructions are the same, nor are compilers, platforms, cooling, etc.

Andrei. said:
I find MT IPC discussions stupid as at that point you're not discussing about a core's microarchitectural IPC capabilities anymore but the overlying system's memory capabilities. Why are you even accounting for the notion of cores here? AMD/Intel use SMT while Arm fits in two cores in the space of one x86 core. N1 single-threaded IPC is higher than Rome and Cascade lake, and "system-wide"-IPC it's also higher than at least Cascade (didn't get to do that comparison to Rome yet).

Single / Multi threaded are just metrics. What you're saying here about memory, die area, or any other factor goes straight to the point. IPC, while often considered a direct reflection of computer performance, is very relative to the workload and platform.

Andrei. said:
As a note, N1 in the server chips has very notably higher IPC than a mobile A76 because it has quite bigger caches and a significantly better memory subsystem. It is very very good.

Estimates, rates, synthetics, overwhelmingly stream oriented workloads, may favor a certain niche. Other workloads not so much. x86 is on it's forth major iteration of simd. neon while moving fast, has still not even reached the level of sse.

Andrei. · Mar 4, 2020

Schmide said:
Ehh if you're not running at your best available clock, you better be doing avx512. Not all instructions are the same, nor are compilers, platforms, cooling, etc.

I don't understand what you're even trying to convey here, I was bringing up how system wide IPC is a stupid metric since you were normalising "clock" for a system wide metric, at the same time I can normalise processors as being a system wide metric at which point it doesn't matter how many cores it takes to achieve the system wide throughput result.

Also AMD and Arm/Ampere don't care about instruction types, you run at max as long as thermals allow. I ran my numbers on the same compiler.

Schmide said:
Single / Multi threaded are just metrics. What you're saying here about memory, die area, or any other factor goes straight to the point. IPC, while often considered a direct reflection of computer performance, is very relative to the workload and platform.

I'm talking about an aggregate wide variety of workloads. IPC matters to single-threaded workloads because that's a limitation of the serial nature of the workload. Once you can go parallel, if you can reach high throughput though SMT/IPC or just more cores it shouldn't matter as long as per-thread perf doesn't tank dramatically. In this case the N1 is better in both regards.

Schmide said:
Estimates, rates, synthetics, overwhelmingly stream oriented workloads, may favor a certain niche. Other workloads not so much.

A certain niche of what? Quit beating around the bush and talk in actual workloads. Are you saying SPEC doesn't cover real workloads? There's a very clear delineation of workload types that roughly fall into the categories of high compute/high execution bottle-necked and high memory pressure bottle-necked, with varying shades of other small aspects between those two.

Schmide said:
x86 is on it's forth major iteration of simd. neon while moving fast, has still not even reached the level of sse.

SIMD is mostly irrelevant for non-HPC workloads, games probably the only other everyday area where they matter more. Have you actually compiled a large codebase with and without AVX? The difference is 5-10%. Saying NEON doesn't even reach SSE is insane, please name one thing you cannot do with NEON? High-density matrix operations is literally the only thing at which they aren't as fast right now, until you see Arm's 2x256bit core in 2 years. But again you saw how AMD did just dandy with 2x128b in Zen and also beats Intel with 2x256b in Zen2.

News Ampere Altra Launched with 80 Arm Cores for the Cloud（Performance Estimates）

Senior member

Diamond Member

Diamond Member

Golden Member

Moderator Emeritus, Elite Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Senior member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Senior member

Diamond Member

Senior member