News ARM Server CPUs

Gideon

Golden Member
Nov 27, 2007
1,598
3,520
136
With all the upcoming ARM servers (Let's not forget Nuvia, etc) it probably makes sense to have 1 thread from them all, instead of creating new ones for each announcement (if Not, I will rename it).

However:
Anandtech: Marwell Announces 3rd Gen Arm Server Thunder X3: 96 Cores/384 threads
ServeTheHome: Marvell ThunderX3 Arm Server CPU with 768 Threads in 2020

MarvellTX3_13_575px.jpg


Could be pretty impressive, though 25% Single Threaded performance gain seems a bit meh, compared to X2, which @2.5Ghz was ~50% slower than Xeon @ 3.8Ghz

But Their marketing slides sure show potential:

MarvellTX3_16_575px.jpg



MarvellTX3_15_575px.jpg
 
Last edited:

Hitman928

Diamond Member
Apr 15, 2012
5,160
7,595
136
What all the experts think about SMT4 benefits now? :cool:

The Thunder line is strictly a server line of CPUs. There was never any doubt that SMT4 could be beneficial in many server applications. The PowerPC line already showed this. The push back against SMT4 was it being implemented in AMD CPUs that have a single architecture that covers all segments. SMT4 also tends to have some drawbacks in terms of efficiency but that's more a question of design balance than a showstopper. Marvell made some pretty bold statements when they announced ThunderX2 that didn't really pan out, so we'll see how well ThunderX3 does. On paper it looks like a much stronger competitor though it will be fighting Milan (AMD Zen 3 cores) more than Rome or anything Intel, so we'll have to wait for both to release to see how things pan out.
 
  • Like
Reactions: Tlh97 and soresu

Andrei.

Senior member
Jan 26, 2015
316
386
136
What all the experts think about SMT4 benefits now? :cool:
It's completely useless in client workloads. As I explain in the article, the only gains are in data-plane workloads in which the work data is accessed at higher latency; i.e. server and cloud environments. Marvell very much said it themselves in their briefing that higher compute workloads won't see any large benefits.
 

DrMrLordX

Lifer
Apr 27, 2000
21,571
10,764
136
What all the experts think about SMT4 benefits now? :cool:

Not much. They're bagging on ST performance.

edit: it's good to see that Marvell is still working on the ThunderX line. I won't deny that. They seem to be carving themselves a niche instead of trying to go head-to-head with Ampere, Huawei, and Amazon. Also where are the Huawei benchmarks? Their core is custom but should be close to A76, like Graviton2.
 
  • Like
Reactions: Tlh97 and soresu

Richie Rich

Senior member
Jul 28, 2019
470
229
76
It's completely useless in client workloads. As I explain in the article, the only gains are in data-plane workloads in which the work data is accessed at higher latency; i.e. server and cloud environments. Marvell very much said it themselves in their briefing that higher compute workloads won't see any large benefits.
I agree with you, SMT lowers thread performance and it's pretty clear from your Graviton2 test.
But there two very different ways how to use SMT:
  • - todays: put SMT on current core (gain +20% more throughput in cost of lower thread performance by factor 0.6x)
  • - future: merge two A76 cores without SMT into one big one with SMT2 (gain +20% through put while gaining 1.2x thread performance)

Such a core consisting of two A76 (3xALU+1xJump) would be 6xALU+2xJump wide, quite close to Apple's A13 Lightning core (4xALU+2xALU/Jump) so definitely feasible to create. IMHO shared resources can bring some nice gains in terms of efficiency. But it's definitely not easy to develop. And it doesn't make sense right know when Apple shows there is a way how to double IPC/throughput over A76.
 

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
What all the experts think about SMT4 benefits now? :cool:
Nothing new. SMT2 produces similar benefits. At best, SMT4 is offering 80% improvement over SMT1 in this application, which is not even double, despite having quadruple the therads. Also, we don't know how SMT4 compares to SMT2 in ThunderX3 chips. So you touting these benchmarks as proving SMT4 over SMT2 is curious. Can you explain what you are implying? :cool:

Some other thoughts on the release:
- They are comparing a 96 core, 384 thread part against a 64 core, 128 thread part (7742) in these comparisons. Even with 384 threads, triple that of 7742, it only manages, in the best case, to double performance, and in most cases not even that.
- We don't know what the real power consumption is. They claim a great performance per watt number, which is expected on a 96 core, 384 thread part running at lower clocks, but we will need more data on a broad number of benchmarks to make any statement about this chip.
- If AMD put out a 96 core 192 thread part, clocked it low to hit a 240W TDP, I'm guessing their performance per watt would also be great.
 

Nothingness

Platinum Member
Jul 3, 2013
2,364
707
136
Some other thoughts on the release:
- They are comparing a 96 core, 384 thread part against a 64 core, 128 thread part (7742) in these comparisons. Even with 384 threads, triple that of 7742, it only manages, in the best case, to double performance, and in most cases not even that.
Is there a CPU with more cores/thread per socket? If not then the comparison is valid. Better performance per socket is a very good thing for a class of workloads.

- We don't know what the real power consumption is. They claim a great performance per watt number, which is expected on a 96 core, 384 thread part running at lower clocks, but we will need more data on a broad number of benchmarks to make any statement about this chip.
Agreed. And the same applies to performance, we need independent testing.

- If AMD put out a 96 core 192 thread part, clocked it low to hit a 240W TDP, I'm guessing their performance per watt would also be great.
Can they make a 96 core/192 thread chip? I guess they economically could they would as there's a market for such chips.
 
  • Like
Reactions: Etain05

Richie Rich

Senior member
Jul 28, 2019
470
229
76
128-core version of ARM server CPU Ampere Altra is coming next year. Probably still A76/N1 Neoverse and 7nm TSMC. If AMD Rome had problems to defeat 64-core ARM Graviton2 in some areas (like performance per thread) then this 128-core CPU will be more powerful than 64-core Rome for sure. The only question is how Zen3 will stand against this 128-core monolith. If Zen3 will keep same 64-core count then 128-core Altra could be faster.

 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
I believe that the server side improvements for Zen3 are somewhat more modest. The ccx improvements should improve average multi-threaded benchmarks a bit, both by making the L3 a bit more efficient and by improving inter-thread communications a bit. The IPC improvements will lift overall scores by a noticeable amount. The refinement to the node will add a little efficiency and clock speed. It is also expected that AVX capabilities will also be improved.

I think that, in the relevant multi-threaded benchmarks, we'll see Zen 3 Epyc be at least 20-25% faster per thread at a minimum for anything that isn't highly memory bound. For certain workloads, 50% improvement isn't unthinkable.

The question is, will it be enough to keep up with the competitive ARM server chips? AMD won't have more cores per package until Zen 4 at the earliest unless they heavily revise the EOYC package, ho to a smaller I/O die, and pack on a few more chiplets.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,731
3,063
136
I believe that the server side improvements for Zen3 are somewhat more modest. The ccx improvements should improve average multi-threaded benchmarks a bit, both by making the L3 a bit more efficient and by improving inter-thread communications a bit. The IPC improvements will lift overall scores by a noticeable amount. The refinement to the node will add a little efficiency and clock speed. It is also expected that AVX capabilities will also be improved.

I think that, in the relevant multi-threaded benchmarks, we'll see Zen 3 Epyc be at least 20-25% faster per thread at a minimum for anything that isn't highly memory bound. For certain workloads, 50% improvement isn't unthinkable.

The question is, will it be enough to keep up with the competitive ARM server chips? AMD won't have more cores per package until Zen 4 at the earliest unless they heavily revise the EOYC package, ho to a smaller I/O die, and pack on a few more chiplets.
next year we will see Zen4 on 5nm.
Where can i buy ampere 80 core? Where are the benchmarks?
Why is Ampere 80 being compared to Zen2? After all Zen3 is sampling to Vendors just like Altra is, Altra is a 2020 release product just like milian is.

I suppose it doesn't look very good for a certain someone to compare to the actual competition ( i dont expect Ampere to compare to an unknown product), but we can atleast project a range of values.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,587
5,703
136
If you think Cavium ThunderX blew the socks off x86 server chips, wait till you see Tachyum Prodigy
Their universal processor will beat all known CPUs and GPUs ever existed, by massive margins. Official statement. (You can find this also on @kokhua tweets)
Then Nuvia will top that and some more.
 

yuri69

Senior member
Jul 16, 2013
366
561
136
If you think Cavium ThunderX blew the socks off x86 server chips, wait till you see Tachyum Prodigy
Their universal processor will beat all known CPUs and GPUs ever existed, by massive margins. Official statement.
Have you actually seen any of the architecture description of the Tachyum Prodigy? The massive reliance on the compiler is a red light - IA64 says hello.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,587
5,703
136
Have you actually seen any of the architecture description of the Tachyum Prodigy? The massive reliance on the compiler is a red light - IA64 says hello.
Why are you doubting? They said it. So it is true.

And.. btw
You will not be able to buy it anywhere
We can ship one to you, but you have to sign an NDA
For benchmarks you have to trust our numbers.
 
Last edited:
  • Haha
Reactions: lightmanek

Richie Rich

Senior member
Jul 28, 2019
470
229
76
next year we will see Zen4 on 5nm.
Where can i buy ampere 80 core? Where are the benchmarks?
Why is Ampere 80 being compared to Zen2? After all Zen3 is sampling to Vendors just like Altra is, Altra is a 2020 release product just like milian is.
I agree that we need to wait for benchmarks because Altra neither Zen3 isn't out yet. However we can estimate based on Graviton2 @ 2.5 GHz what is based on same A76/N1 core:
  • 64c/64t Graviton2 has score at Phoronix ... 29.07 pts
  • 64c/128t AMD Epyc 7742 (Rome Zen2) ..... 39.93 pts ...... 37% higher performance per socket
  • 128c/128t Altra estimation = 2*G2 ............. 58.14 pts ...... 45% higher performance than AMD Rome
  • 64c/128t Zen3 estimation = 20% IPC jump... 47.92 pts (39.93*1.2) .... 18% slower perf than Altra128
  • 64c/128t Zen4 estimation = 10% IPC jump... 52.71 pts (47.92*1.1) .... 9% slower perf than Altra128
  • 96c/192t Zen4 estimation .............................. 79.07 pts (52.71*1.5) ... 35% higher perf than Altra128
  • Altra is clocked @ 3 Ghz and boost to 3.3 GHz but I expect 128-core version to clocked similarly to G2. Gain from 2.5 -> 3.0 GHz is another 20% so you can do the math.

So AMD would need Zen4 at 5nm to beat ARM server CPU at old 7nm based on an old and slow core like A76 (from 2018). We expect ARM to unveil new server architecture Neoverse N2 later this year. If Altra128 (Mystique) is based on N2/Cortex A77 than this would be different beast.

Comparison of Cortex A76 core vs. A77:

  • IPC jump 20% in INT and 35% in FPU
  • transistor count increase only 17%
  • core area 1.2mm2 A76 vs. 1.4mm2 A77 (both including 1MB L2 cache, compare to 3.6mm2 Zen2)

And if Neoverse N2 is based on A78 Hercules unveiled few weeks ago, with it's +7% IPC and -5% die size, things becomes even more scary. IMHO I expected Altra Mystique to be based on A77. I really don't see any sense to invest a lot of money to create new masks for 3 year old architecture like A76 on the same process, just to add few more cores. This is more Intel-like big company approach, not small start up like Ampere. I'd expect at least N7P and A77 cores what annihilate any x86 CPU including Zen4 at 5nm.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Richie Rich = dr. Sharikou of Intel bankrupt by Q3 2007 fame.

Still for certain uses 128 cores of monolith are better than 16 or 8 shards of ZenX cores. And certainly better than 28 or 38 monolith from Intel.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
No Intel neither AMD will go bankrupt in future. But they can end up like VIA with 0.1% market share selling some old legacy HW.

X86 CPUs have about 10% of market share out of total CPUs sold world wide. And this market share is still decreasing in time as ARM is gaining in servers and supercomputers. So yeah, x86 is on the right way to disappearance.
 

Hitman928

Diamond Member
Apr 15, 2012
5,160
7,595
136
I agree that we need to wait for benchmarks because Altra neither Zen3 isn't out yet. However we can estimate based on Graviton2 @ 2.5 GHz what is based on same A76/N1 core:
  • 64c/64t Graviton2 has score at Phoronix ... 29.07 pts
  • 64c/128t AMD Epyc 7742 (Rome Zen2) ..... 39.93 pts ...... 37% higher performance per socket
  • 128c/128t Altra estimation = 2*G2 ............. 58.14 pts ...... 45% higher performance than AMD Rome
  • 64c/128t Zen3 estimation = 20% IPC jump... 47.92 pts (39.93*1.2) .... 18% slower perf than Altra128
  • 64c/128t Zen4 estimation = 10% IPC jump... 52.71 pts (47.92*1.1) .... 9% slower perf than Altra128
  • 96c/192t Zen4 estimation .............................. 79.07 pts (52.71*1.5) ... 35% higher perf than Altra128
  • Altra is clocked @ 3 Ghz and boost to 3.3 GHz but I expect 128-core version to clocked similarly to G2. Gain from 2.5 -> 3.0 GHz is another 20% so you can do the math.

So AMD would need Zen4 at 5nm to beat ARM server CPU at old 7nm based on an old and slow core like A76 (from 2018). We expect ARM to unveil new server architecture Neoverse N2 later this year. If Altra128 (Mystique) is based on N2/Cortex A77 than this would be different beast.

Comparison of Cortex A76 core vs. A77:
  • IPC jump 20% in INT and 35% in FPU
  • transistor count increase only 17%
  • core area 1.2mm2 A76 vs. 1.4mm2 A77 (both including 1MB L2 cache, compare to 3.6mm2 Zen2)

And if Neoverse N2 is based on A78 Hercules unveiled few weeks ago, with it's +7% IPC and -5% die size, things becomes even more scary. IMHO I expected Altra Mystique to be based on A77. I really don't see any sense to invest a lot of money to create new masks for 3 year old architecture like A76 on the same process, just to add few more cores. This is more Intel-like big company approach, not small start up like Ampere. I'd expect at least N7P and A77 cores what annihilate any x86 CPU including Zen4 at 5nm.

In that test:

Going from 64c/128t AMD Epyc (7742) to 128c/256t Epyc (2x7742) results in a 10% average performance increase.
Going from 8c/16t AMD Epyc (7262) to 16c/32t Epyc (2x7262) results in a 17% average performance increase.
Going from 24c/16t AMD Epyc (7F52) to lower clocked 64c/128t Epyc (7742) results in a 4% average performance increase.

How in the world do you estimate 128c/128t Altra will be 2* 64c/64t Graviton2 in average performance in that test suite?
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,439
14,409
136
In that test:

Going from 64c/128t AMD Epyc (7742) to 128c/256t Epyc (2x7742) results in a 10% average performance increase.
Going from 8c/16t AMD Epyc (7262) to 16c/32t Epyc (2x7262) results in a 17% average performance increase.
Going from 24c/16t AMD Epyc (7F52) to lower clocked 64c/128t Epyc (7742) results in a 4% average performance increase.

How in the world do you estimate 128c/128t Altra will be 2* 64c/64t Graviton2 in average performance in that test suite?
I have learned to ignore anything he says for reasons like this.
 
  • Like
Reactions: Drazick and Tlh97

podspi

Golden Member
Jan 11, 2011
1,965
71
91
I agree with you, SMT lowers thread performance and it's pretty clear from your Graviton2 test.
But there two very different ways how to use SMT:
  • - todays: put SMT on current core (gain +20% more throughput in cost of lower thread performance by factor 0.6x)
  • - future: merge two A76 cores without SMT into one big one with SMT2 (gain +20% through put while gaining 1.2x thread performance)

Such a core consisting of two A76 (3xALU+1xJump) would be 6xALU+2xJump wide, quite close to Apple's A13 Lightning core (4xALU+2xALU/Jump) so definitely feasible to create. IMHO shared resources can bring some nice gains in terms of efficiency. But it's definitely not easy to develop. And it doesn't make sense right know when Apple shows there is a way how to double IPC/throughput over A76.

You're describing CMT (Bulldozer) not SMT.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,683
1,218
136
You're describing CMT (Bulldozer) not SMT.
There is also clustered SMT, by the way.
=> https://player.slideplayer.com/78/12863145/slides/slide_16.jpg
Conclusion: "Large scale clustered SMT processors with many shared specialised or configurable function units come very close to the ideal of a softening of hardware and offer an attractive alternative to both, pure FPGA implementations and to costly ASIC designs."
Test design is derivative of https://www.semanticscholar.org/pap...ders/5a821d4ed6c23a66fd0e8d0eb3c3acc31a742e2f

Development tree wise: cSMT is the end for SMT and CMT.
CMP -> SMT -> cSMT
CMP -> CMT -> cSMT
Both styles of cSMT(strong thread-cluster cohesion(CMT-derivative) and weak thread-cluster cohesion(SMT-derivative)), eventually fuse into distributive multithreading. Which is by far the most advanced style of threading to date.
 
Last edited: