News ARM Server CPUs

Gideon · Mar 16, 2020

With all the upcoming ARM servers (Let's not forget Nuvia, etc) it probably makes sense to have 1 thread from them all, instead of creating new ones for each announcement (if Not, I will rename it).

However:
Anandtech: Marwell Announces 3rd Gen Arm Server Thunder X3: 96 Cores/384 threads
ServeTheHome: Marvell ThunderX3 Arm Server CPU with 768 Threads in 2020

Could be pretty impressive, though 25% Single Threaded performance gain seems a bit meh, compared to X2, which @2.5Ghz was ~50% slower than Xeon @ 3.8Ghz

But Their marketing slides sure show potential:

Richie Rich · Mar 16, 2020

What all the experts think about SMT4 benefits now?

Hitman928 · Mar 16, 2020

Richie Rich said:
What all the experts think about SMT4 benefits now?

The Thunder line is strictly a server line of CPUs. There was never any doubt that SMT4 could be beneficial in many server applications. The PowerPC line already showed this. The push back against SMT4 was it being implemented in AMD CPUs that have a single architecture that covers all segments. SMT4 also tends to have some drawbacks in terms of efficiency but that's more a question of design balance than a showstopper. Marvell made some pretty bold statements when they announced ThunderX2 that didn't really pan out, so we'll see how well ThunderX3 does. On paper it looks like a much stronger competitor though it will be fighting Milan (AMD Zen 3 cores) more than Rome or anything Intel, so we'll have to wait for both to release to see how things pan out.

Andrei. · Mar 16, 2020

Richie Rich said:
What all the experts think about SMT4 benefits now?

It's completely useless in client workloads. As I explain in the article, the only gains are in data-plane workloads in which the work data is accessed at higher latency; i.e. server and cloud environments. Marvell very much said it themselves in their briefing that higher compute workloads won't see any large benefits.

DrMrLordX · Mar 16, 2020

Richie Rich said:
What all the experts think about SMT4 benefits now?

Not much. They're bagging on ST performance.

edit: it's good to see that Marvell is still working on the ThunderX line. I won't deny that. They seem to be carving themselves a niche instead of trying to go head-to-head with Ampere, Huawei, and Amazon. Also where are the Huawei benchmarks? Their core is custom but should be close to A76, like Graviton2.

Richie Rich · Mar 18, 2020

Andrei. said:
It's completely useless in client workloads. As I explain in the article, the only gains are in data-plane workloads in which the work data is accessed at higher latency; i.e. server and cloud environments. Marvell very much said it themselves in their briefing that higher compute workloads won't see any large benefits.

I agree with you, SMT lowers thread performance and it's pretty clear from your Graviton2 test.
But there two very different ways how to use SMT:

- todays: put SMT on current core (gain +20% more throughput in cost of lower thread performance by factor 0.6x)
- future: merge two A76 cores without SMT into one big one with SMT2 (gain +20% through put while gaining 1.2x thread performance)

Such a core consisting of two A76 (3xALU+1xJump) would be 6xALU+2xJump wide, quite close to Apple's A13 Lightning core (4xALU+2xALU/Jump) so definitely feasible to create. IMHO shared resources can bring some nice gains in terms of efficiency. But it's definitely not easy to develop. And it doesn't make sense right know when Apple shows there is a way how to double IPC/throughput over A76.

amrnuke · Mar 18, 2020

Richie Rich said:
What all the experts think about SMT4 benefits now?

Nothing new. SMT2 produces similar benefits. At best, SMT4 is offering 80% improvement over SMT1 in this application, which is not even double, despite having quadruple the therads. Also, we don't know how SMT4 compares to SMT2 in ThunderX3 chips. So you touting these benchmarks as proving SMT4 over SMT2 is curious. Can you explain what you are implying?

Some other thoughts on the release:
- They are comparing a 96 core, 384 thread part against a 64 core, 128 thread part (7742) in these comparisons. Even with 384 threads, triple that of 7742, it only manages, in the best case, to double performance, and in most cases not even that.
- We don't know what the real power consumption is. They claim a great performance per watt number, which is expected on a 96 core, 384 thread part running at lower clocks, but we will need more data on a broad number of benchmarks to make any statement about this chip.
- If AMD put out a 96 core 192 thread part, clocked it low to hit a 240W TDP, I'm guessing their performance per watt would also be great.

Nothingness · Mar 18, 2020

amrnuke said:
Some other thoughts on the release:
- They are comparing a 96 core, 384 thread part against a 64 core, 128 thread part (7742) in these comparisons. Even with 384 threads, triple that of 7742, it only manages, in the best case, to double performance, and in most cases not even that.

Is there a CPU with more cores/thread per socket? If not then the comparison is valid. Better performance per socket is a very good thing for a class of workloads.

- We don't know what the real power consumption is. They claim a great performance per watt number, which is expected on a 96 core, 384 thread part running at lower clocks, but we will need more data on a broad number of benchmarks to make any statement about this chip.

Agreed. And the same applies to performance, we need independent testing.

- If AMD put out a 96 core 192 thread part, clocked it low to hit a 240W TDP, I'm guessing their performance per watt would also be great.

Can they make a 96 core/192 thread chip? I guess they economically could they would as there's a market for such chips.

Richie Rich · Mar 22, 2020

Bench of Kungpeng 920 would be interesting (not a Cortex core, it should be Chinese design).

Richie Rich · Jun 23, 2020

128-core version of ARM server CPU Ampere Altra is coming next year. Probably still A76/N1 Neoverse and 7nm TSMC. If AMD Rome had problems to defeat 64-core ARM Graviton2 in some areas (like performance per thread) then this 128-core CPU will be more powerful than 64-core Rome for sure. The only question is how Zen3 will stand against this 128-core monolith. If Zen3 will keep same 64-core count then 128-core Altra could be faster.

Ampere Preps 7nm 128-Core Server CPU to Take on AMD and Intel

The power of Arm

www.tomshardware.com

LightningZ71 · Jun 23, 2020

I believe that the server side improvements for Zen3 are somewhat more modest. The ccx improvements should improve average multi-threaded benchmarks a bit, both by making the L3 a bit more efficient and by improving inter-thread communications a bit. The IPC improvements will lift overall scores by a noticeable amount. The refinement to the node will add a little efficiency and clock speed. It is also expected that AVX capabilities will also be improved.

I think that, in the relevant multi-threaded benchmarks, we'll see Zen 3 Epyc be at least 20-25% faster per thread at a minimum for anything that isn't highly memory bound. For certain workloads, 50% improvement isn't unthinkable.

The question is, will it be enough to keep up with the competitive ARM server chips? AMD won't have more cores per package until Zen 4 at the earliest unless they heavily revise the EOYC package, ho to a smaller I/O die, and pack on a few more chiplets.

itsmydamnation · Jun 24, 2020

LightningZ71 said:
I believe that the server side improvements for Zen3 are somewhat more modest. The ccx improvements should improve average multi-threaded benchmarks a bit, both by making the L3 a bit more efficient and by improving inter-thread communications a bit. The IPC improvements will lift overall scores by a noticeable amount. The refinement to the node will add a little efficiency and clock speed. It is also expected that AVX capabilities will also be improved.

I think that, in the relevant multi-threaded benchmarks, we'll see Zen 3 Epyc be at least 20-25% faster per thread at a minimum for anything that isn't highly memory bound. For certain workloads, 50% improvement isn't unthinkable.

The question is, will it be enough to keep up with the competitive ARM server chips? AMD won't have more cores per package until Zen 4 at the earliest unless they heavily revise the EOYC package, ho to a smaller I/O die, and pack on a few more chiplets.

next year we will see Zen4 on 5nm.
Where can i buy ampere 80 core? Where are the benchmarks?
Why is Ampere 80 being compared to Zen2? After all Zen3 is sampling to Vendors just like Altra is, Altra is a 2020 release product just like milian is.

I suppose it doesn't look very good for a certain someone to compare to the actual competition ( i dont expect Ampere to compare to an unknown product), but we can atleast project a range of values.

DisEnchantment · Jun 24, 2020

If you think Cavium ThunderX blew the socks off x86 server chips, wait till you see Tachyum Prodigy
Their universal processor will beat all known CPUs and GPUs ever existed, by massive margins. Official statement. (You can find this also on @kokhua tweets)
Then Nuvia will top that and some more.

yuri69 · Jun 24, 2020

DisEnchantment said:
If you think Cavium ThunderX blew the socks off x86 server chips, wait till you see Tachyum Prodigy
Their universal processor will beat all known CPUs and GPUs ever existed, by massive margins. Official statement.

Have you actually seen any of the architecture description of the Tachyum Prodigy? The massive reliance on the compiler is a red light - IA64 says hello.

DisEnchantment · Jun 24, 2020

yuri69 said:
Have you actually seen any of the architecture description of the Tachyum Prodigy? The massive reliance on the compiler is a red light - IA64 says hello.

Why are you doubting? They said it. So it is true.

And.. btw
You will not be able to buy it anywhere
We can ship one to you, but you have to sign an NDA
For benchmarks you have to trust our numbers.

Richie Rich · Jun 24, 2020

itsmydamnation said:
next year we will see Zen4 on 5nm.
Where can i buy ampere 80 core? Where are the benchmarks?
Why is Ampere 80 being compared to Zen2? After all Zen3 is sampling to Vendors just like Altra is, Altra is a 2020 release product just like milian is.

I agree that we need to wait for benchmarks because Altra neither Zen3 isn't out yet. However we can estimate based on Graviton2 @ 2.5 GHz what is based on same A76/N1 core:

64c/64t Graviton2 has score at Phoronix ... 29.07 pts
64c/128t AMD Epyc 7742 (Rome Zen2) ..... 39.93 pts ...... 37% higher performance per socket
128c/128t Altra estimation = 2*G2 ............. 58.14 pts ...... 45% higher performance than AMD Rome
64c/128t Zen3 estimation = 20% IPC jump... 47.92 pts (39.93*1.2) .... 18% slower perf than Altra128
64c/128t Zen4 estimation = 10% IPC jump... 52.71 pts (47.92*1.1) .... 9% slower perf than Altra128
96c/192t Zen4 estimation .............................. 79.07 pts (52.71*1.5) ... 35% higher perf than Altra128
Altra is clocked @ 3 Ghz and boost to 3.3 GHz but I expect 128-core version to clocked similarly to G2. Gain from 2.5 -> 3.0 GHz is another 20% so you can do the math.

Benchmarking Amazon's Graviton2 Performance With 64 Neoverse N1 Cores Against Intel Xeon, AMD EPYC - Phoronix

www.phoronix.com

So AMD would need Zen4 at 5nm to beat ARM server CPU at old 7nm based on an old and slow core like A76 (from 2018). We expect ARM to unveil new server architecture Neoverse N2 later this year. If Altra128 (Mystique) is based on N2/Cortex A77 than this would be different beast.

Comparison of Cortex A76 core vs. A77:

IPC jump 20% in INT and 35% in FPU
transistor count increase only 17%
core area 1.2mm2 A76 vs. 1.4mm2 A77 (both including 1MB L2 cache, compare to 3.6mm2 Zen2)

And if Neoverse N2 is based on A78 Hercules unveiled few weeks ago, with it's +7% IPC and -5% die size, things becomes even more scary. IMHO I expected Altra Mystique to be based on A77. I really don't see any sense to invest a lot of money to create new masks for 3 year old architecture like A76 on the same process, just to add few more cores. This is more Intel-like big company approach, not small start up like Ampere. I'd expect at least N7P and A77 cores what annihilate any x86 CPU including Zen4 at 5nm.

lobz · Jun 24, 2020

Here we go again. You guys fell into the trap.

DrMrLordX · Jun 24, 2020

lobz said:
Here we go again. You guys fell into the trap.

I'm expecting AMD and Intel to collapse any day now. ARM is too stronk.

JoeRambo · Jun 24, 2020

Richie Rich = dr. Sharikou of Intel bankrupt by Q3 2007 fame.

Still for certain uses 128 cores of monolith are better than 16 or 8 shards of ZenX cores. And certainly better than 28 or 38 monolith from Intel.

Richie Rich · Jun 24, 2020

No Intel neither AMD will go bankrupt in future. But they can end up like VIA with 0.1% market share selling some old legacy HW.

X86 CPUs have about 10% of market share out of total CPUs sold world wide. And this market share is still decreasing in time as ARM is gaining in servers and supercomputers. So yeah, x86 is on the right way to disappearance.

lobz · Jun 24, 2020

Things sure took a dark turn pretty quickly.

Hitman928 · Jun 24, 2020

Richie Rich said:
I agree that we need to wait for benchmarks because Altra neither Zen3 isn't out yet. However we can estimate based on Graviton2 @ 2.5 GHz what is based on same A76/N1 core:

64c/64t Graviton2 has score at Phoronix ... 29.07 pts

64c/128t AMD Epyc 7742 (Rome Zen2) ..... 39.93 pts ...... 37% higher performance per socket

128c/128t Altra estimation = 2*G2 ............. 58.14 pts ...... 45% higher performance than AMD Rome

64c/128t Zen3 estimation = 20% IPC jump... 47.92 pts (39.93*1.2) .... 18% slower perf than Altra128

64c/128t Zen4 estimation = 10% IPC jump... 52.71 pts (47.92*1.1) .... 9% slower perf than Altra128

96c/192t Zen4 estimation .............................. 79.07 pts (52.71*1.5) ... 35% higher perf than Altra128

Altra is clocked @ 3 Ghz and boost to 3.3 GHz but I expect 128-core version to clocked similarly to G2. Gain from 2.5 -> 3.0 GHz is another 20% so you can do the math.

Benchmarking Amazon's Graviton2 Performance With 64 Neoverse N1 Cores Against Intel Xeon, AMD EPYC - Phoronix

www.phoronix.com

So AMD would need Zen4 at 5nm to beat ARM server CPU at old 7nm based on an old and slow core like A76 (from 2018). We expect ARM to unveil new server architecture Neoverse N2 later this year. If Altra128 (Mystique) is based on N2/Cortex A77 than this would be different beast.

Comparison of Cortex A76 core vs. A77:

IPC jump 20% in INT and 35% in FPU

transistor count increase only 17%

core area 1.2mm2 A76 vs. 1.4mm2 A77 (both including 1MB L2 cache, compare to 3.6mm2 Zen2)

And if Neoverse N2 is based on A78 Hercules unveiled few weeks ago, with it's +7% IPC and -5% die size, things becomes even more scary. IMHO I expected Altra Mystique to be based on A77. I really don't see any sense to invest a lot of money to create new masks for 3 year old architecture like A76 on the same process, just to add few more cores. This is more Intel-like big company approach, not small start up like Ampere. I'd expect at least N7P and A77 cores what annihilate any x86 CPU including Zen4 at 5nm.

In that test:

Going from 64c/128t AMD Epyc (7742) to 128c/256t Epyc (2x7742) results in a 10% average performance increase.
Going from 8c/16t AMD Epyc (7262) to 16c/32t Epyc (2x7262) results in a 17% average performance increase.
Going from 24c/16t AMD Epyc (7F52) to lower clocked 64c/128t Epyc (7742) results in a 4% average performance increase.

How in the world do you estimate 128c/128t Altra will be 2* 64c/64t Graviton2 in average performance in that test suite?

Markfw · Jun 24, 2020

Hitman928 said:
In that test:

Going from 64c/128t AMD Epyc (7742) to 128c/256t Epyc (2x7742) results in a 10% average performance increase.
Going from 8c/16t AMD Epyc (7262) to 16c/32t Epyc (2x7262) results in a 17% average performance increase.
Going from 24c/16t AMD Epyc (7F52) to lower clocked 64c/128t Epyc (7742) results in a 4% average performance increase.

How in the world do you estimate 128c/128t Altra will be 2* 64c/64t Graviton2 in average performance in that test suite?

I have learned to ignore anything he says for reasons like this.

podspi · Jun 24, 2020

Richie Rich said:
I agree with you, SMT lowers thread performance and it's pretty clear from your Graviton2 test.
But there two very different ways how to use SMT:

- todays: put SMT on current core (gain +20% more throughput in cost of lower thread performance by factor 0.6x)

- future: merge two A76 cores without SMT into one big one with SMT2 (gain +20% through put while gaining 1.2x thread performance)

Such a core consisting of two A76 (3xALU+1xJump) would be 6xALU+2xJump wide, quite close to Apple's A13 Lightning core (4xALU+2xALU/Jump) so definitely feasible to create. IMHO shared resources can bring some nice gains in terms of efficiency. But it's definitely not easy to develop. And it doesn't make sense right know when Apple shows there is a way how to double IPC/throughput over A76.

You're describing CMT (Bulldozer) not SMT.

NostaSeronx · Jun 24, 2020

podspi said:
You're describing CMT (Bulldozer) not SMT.

There is also clustered SMT, by the way.

CS Lecture 20 The Case for a Single-Chip Multiprocessor - ppt download

CMP vs. Wide-Issue Superscalar What is the best use of on-chip real estate? wide-issue processor (complex design/clock, diminishing ILP returns) CMP (simple design, high TLP, lower ILP) Contributions: Takes area and latencies into account Attempts fine-grain parallelization

slideplayer.com

=> https://player.slideplayer.com/78/12863145/slides/slide_16.jpg

Euro-Par 2003 Parallel Processing

Euro-ParConferenceSeries The European Conference on Parallel Computing (Euro-Par) is an international conference series dedicated to the promotion and advancement of all aspects of parallel and distributed computing. The major themes fall into the categories of hardware, software, algorithms...

books.google.com

https://dl.acm.org/doi/pdf/10.1145/1024295.1024304?download=true

Conclusion: "Large scale clustered SMT processors with many shared specialised or configurable function units come very close to the ideal of a softening of hardware and offer an attractive alternative to both, pure FPGA implementations and to costly ASIC designs."
Test design is derivative of https://www.semanticscholar.org/pap...ders/5a821d4ed6c23a66fd0e8d0eb3c3acc31a742e2f

https://www.researchgate.net/publication/2978897_5-GHz_32-bit_integer_execution_core_in_130-nm_dual-Vsub_T_CMOS

Development tree wise: cSMT is the end for SMT and CMT.
CMP -> SMT -> cSMT
CMP -> CMT -> cSMT
Both styles of cSMT(strong thread-cluster cohesion(CMT-derivative) and weak thread-cluster cohesion(SMT-derivative)), eventually fuse into distributive multithreading. Which is by far the most advanced style of threading to date.

News ARM Server CPUs

Golden Member

Senior member

Diamond Member

Senior member

Lifer

Senior member

Golden Member

Platinum Member

Senior member

Senior member

Golden Member

Platinum Member

Golden Member

Senior member

Golden Member

Senior member

Platinum Member

Lifer

Golden Member

Senior member

Platinum Member

Diamond Member

Moderator Emeritus, Elite Member

Golden Member

Diamond Member