• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."
  • Community Question: What makes a good motherboard?

News ARM Server CPUs

Gideon

Golden Member
Nov 27, 2007
1,092
1,943
136
With all the upcoming ARM servers (Let's not forget Nuvia, etc) it probably makes sense to have 1 thread from them all, instead of creating new ones for each announcement (if Not, I will rename it).

However:
Anandtech: Marwell Announces 3rd Gen Arm Server Thunder X3: 96 Cores/384 threads
ServeTheHome: Marvell ThunderX3 Arm Server CPU with 768 Threads in 2020



Could be pretty impressive, though 25% Single Threaded performance gain seems a bit meh, compared to X2, which @2.5Ghz was ~50% slower than Xeon @ 3.8Ghz

But Their marketing slides sure show potential:




 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,209
724
136
You're describing CMT (Bulldozer) not SMT.
There is also clustered SMT, by the way.
=> https://player.slideplayer.com/78/12863145/slides/slide_16.jpg
Conclusion: "Large scale clustered SMT processors with many shared specialised or configurable function units come very close to the ideal of a softening of hardware and offer an attractive alternative to both, pure FPGA implementations and to costly ASIC designs."
Test design is derivative of https://www.semanticscholar.org/paper/5-GHz-32-bit-Integer-Execution-Core-in-130-nm-CMOS-Vangal-Anders/5a821d4ed6c23a66fd0e8d0eb3c3acc31a742e2f

Development tree wise: cSMT is the end for SMT and CMT.
CMP -> SMT -> cSMT
CMP -> CMT -> cSMT
Both styles of cSMT(strong thread-cluster cohesion(CMT-derivative) and weak thread-cluster cohesion(SMT-derivative)), eventually fuse into distributive multithreading. Which is by far the most advanced style of threading to date.
 
Last edited:

Hitman928

Diamond Member
Apr 15, 2012
3,200
3,209
136
What all the experts think about SMT4 benefits now? :cool:
The Thunder line is strictly a server line of CPUs. There was never any doubt that SMT4 could be beneficial in many server applications. The PowerPC line already showed this. The push back against SMT4 was it being implemented in AMD CPUs that have a single architecture that covers all segments. SMT4 also tends to have some drawbacks in terms of efficiency but that's more a question of design balance than a showstopper. Marvell made some pretty bold statements when they announced ThunderX2 that didn't really pan out, so we'll see how well ThunderX3 does. On paper it looks like a much stronger competitor though it will be fighting Milan (AMD Zen 3 cores) more than Rome or anything Intel, so we'll have to wait for both to release to see how things pan out.
 
  • Like
Reactions: Tlh97 and soresu

Andrei.

Senior member
Jan 26, 2015
316
384
136
What all the experts think about SMT4 benefits now? :cool:
It's completely useless in client workloads. As I explain in the article, the only gains are in data-plane workloads in which the work data is accessed at higher latency; i.e. server and cloud environments. Marvell very much said it themselves in their briefing that higher compute workloads won't see any large benefits.
 

DrMrLordX

Lifer
Apr 27, 2000
16,641
5,644
136
What all the experts think about SMT4 benefits now? :cool:
Not much. They're bagging on ST performance.

edit: it's good to see that Marvell is still working on the ThunderX line. I won't deny that. They seem to be carving themselves a niche instead of trying to go head-to-head with Ampere, Huawei, and Amazon. Also where are the Huawei benchmarks? Their core is custom but should be close to A76, like Graviton2.
 
  • Like
Reactions: Tlh97 and soresu

Richie Rich

Senior member
Jul 28, 2019
470
227
76
It's completely useless in client workloads. As I explain in the article, the only gains are in data-plane workloads in which the work data is accessed at higher latency; i.e. server and cloud environments. Marvell very much said it themselves in their briefing that higher compute workloads won't see any large benefits.
I agree with you, SMT lowers thread performance and it's pretty clear from your Graviton2 test.
But there two very different ways how to use SMT:
  • - todays: put SMT on current core (gain +20% more throughput in cost of lower thread performance by factor 0.6x)
  • - future: merge two A76 cores without SMT into one big one with SMT2 (gain +20% through put while gaining 1.2x thread performance)

Such a core consisting of two A76 (3xALU+1xJump) would be 6xALU+2xJump wide, quite close to Apple's A13 Lightning core (4xALU+2xALU/Jump) so definitely feasible to create. IMHO shared resources can bring some nice gains in terms of efficiency. But it's definitely not easy to develop. And it doesn't make sense right know when Apple shows there is a way how to double IPC/throughput over A76.
 

amrnuke

Senior member
Apr 24, 2019
999
1,506
96
What all the experts think about SMT4 benefits now? :cool:
Nothing new. SMT2 produces similar benefits. At best, SMT4 is offering 80% improvement over SMT1 in this application, which is not even double, despite having quadruple the therads. Also, we don't know how SMT4 compares to SMT2 in ThunderX3 chips. So you touting these benchmarks as proving SMT4 over SMT2 is curious. Can you explain what you are implying? :cool:

Some other thoughts on the release:
- They are comparing a 96 core, 384 thread part against a 64 core, 128 thread part (7742) in these comparisons. Even with 384 threads, triple that of 7742, it only manages, in the best case, to double performance, and in most cases not even that.
- We don't know what the real power consumption is. They claim a great performance per watt number, which is expected on a 96 core, 384 thread part running at lower clocks, but we will need more data on a broad number of benchmarks to make any statement about this chip.
- If AMD put out a 96 core 192 thread part, clocked it low to hit a 240W TDP, I'm guessing their performance per watt would also be great.
 

Nothingness

Platinum Member
Jul 3, 2013
2,153
398
126
Some other thoughts on the release:
- They are comparing a 96 core, 384 thread part against a 64 core, 128 thread part (7742) in these comparisons. Even with 384 threads, triple that of 7742, it only manages, in the best case, to double performance, and in most cases not even that.
Is there a CPU with more cores/thread per socket? If not then the comparison is valid. Better performance per socket is a very good thing for a class of workloads.

- We don't know what the real power consumption is. They claim a great performance per watt number, which is expected on a 96 core, 384 thread part running at lower clocks, but we will need more data on a broad number of benchmarks to make any statement about this chip.
Agreed. And the same applies to performance, we need independent testing.

- If AMD put out a 96 core 192 thread part, clocked it low to hit a 240W TDP, I'm guessing their performance per watt would also be great.
Can they make a 96 core/192 thread chip? I guess they economically could they would as there's a market for such chips.
 
  • Like
Reactions: Etain05

Richie Rich

Senior member
Jul 28, 2019
470
227
76
128-core version of ARM server CPU Ampere Altra is coming next year. Probably still A76/N1 Neoverse and 7nm TSMC. If AMD Rome had problems to defeat 64-core ARM Graviton2 in some areas (like performance per thread) then this 128-core CPU will be more powerful than 64-core Rome for sure. The only question is how Zen3 will stand against this 128-core monolith. If Zen3 will keep same 64-core count then 128-core Altra could be faster.

 

LightningZ71

Senior member
Mar 10, 2017
571
511
106
I believe that the server side improvements for Zen3 are somewhat more modest. The ccx improvements should improve average multi-threaded benchmarks a bit, both by making the L3 a bit more efficient and by improving inter-thread communications a bit. The IPC improvements will lift overall scores by a noticeable amount. The refinement to the node will add a little efficiency and clock speed. It is also expected that AVX capabilities will also be improved.

I think that, in the relevant multi-threaded benchmarks, we'll see Zen 3 Epyc be at least 20-25% faster per thread at a minimum for anything that isn't highly memory bound. For certain workloads, 50% improvement isn't unthinkable.

The question is, will it be enough to keep up with the competitive ARM server chips? AMD won't have more cores per package until Zen 4 at the earliest unless they heavily revise the EOYC package, ho to a smaller I/O die, and pack on a few more chiplets.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,153
1,674
136
I believe that the server side improvements for Zen3 are somewhat more modest. The ccx improvements should improve average multi-threaded benchmarks a bit, both by making the L3 a bit more efficient and by improving inter-thread communications a bit. The IPC improvements will lift overall scores by a noticeable amount. The refinement to the node will add a little efficiency and clock speed. It is also expected that AVX capabilities will also be improved.

I think that, in the relevant multi-threaded benchmarks, we'll see Zen 3 Epyc be at least 20-25% faster per thread at a minimum for anything that isn't highly memory bound. For certain workloads, 50% improvement isn't unthinkable.

The question is, will it be enough to keep up with the competitive ARM server chips? AMD won't have more cores per package until Zen 4 at the earliest unless they heavily revise the EOYC package, ho to a smaller I/O die, and pack on a few more chiplets.
next year we will see Zen4 on 5nm.
Where can i buy ampere 80 core? Where are the benchmarks?
Why is Ampere 80 being compared to Zen2? After all Zen3 is sampling to Vendors just like Altra is, Altra is a 2020 release product just like milian is.

I suppose it doesn't look very good for a certain someone to compare to the actual competition ( i dont expect Ampere to compare to an unknown product), but we can atleast project a range of values.
 

DisEnchantment

Senior member
Mar 3, 2017
699
1,619
106
If you think Cavium ThunderX blew the socks off x86 server chips, wait till you see Tachyum Prodigy
Their universal processor will beat all known CPUs and GPUs ever existed, by massive margins. Official statement. (You can find this also on @kokhua tweets)
Then Nuvia will top that and some more.
 

yuri69

Member
Jul 16, 2013
77
67
91
If you think Cavium ThunderX blew the socks off x86 server chips, wait till you see Tachyum Prodigy
Their universal processor will beat all known CPUs and GPUs ever existed, by massive margins. Official statement.
Have you actually seen any of the architecture description of the Tachyum Prodigy? The massive reliance on the compiler is a red light - IA64 says hello.
 
  • Like
Reactions: lightmanek

DisEnchantment

Senior member
Mar 3, 2017
699
1,619
106
Have you actually seen any of the architecture description of the Tachyum Prodigy? The massive reliance on the compiler is a red light - IA64 says hello.
Why are you doubting? They said it. So it is true.

And.. btw
You will not be able to buy it anywhere
We can ship one to you, but you have to sign an NDA
For benchmarks you have to trust our numbers.
 
Last edited:
  • Haha
Reactions: lightmanek

Richie Rich

Senior member
Jul 28, 2019
470
227
76
next year we will see Zen4 on 5nm.
Where can i buy ampere 80 core? Where are the benchmarks?
Why is Ampere 80 being compared to Zen2? After all Zen3 is sampling to Vendors just like Altra is, Altra is a 2020 release product just like milian is.
I agree that we need to wait for benchmarks because Altra neither Zen3 isn't out yet. However we can estimate based on Graviton2 @ 2.5 GHz what is based on same A76/N1 core:
  • 64c/64t Graviton2 has score at Phoronix ... 29.07 pts
  • 64c/128t AMD Epyc 7742 (Rome Zen2) ..... 39.93 pts ...... 37% higher performance per socket
  • 128c/128t Altra estimation = 2*G2 ............. 58.14 pts ...... 45% higher performance than AMD Rome
  • 64c/128t Zen3 estimation = 20% IPC jump... 47.92 pts (39.93*1.2) .... 18% slower perf than Altra128
  • 64c/128t Zen4 estimation = 10% IPC jump... 52.71 pts (47.92*1.1) .... 9% slower perf than Altra128
  • 96c/192t Zen4 estimation .............................. 79.07 pts (52.71*1.5) ... 35% higher perf than Altra128
  • Altra is clocked @ 3 Ghz and boost to 3.3 GHz but I expect 128-core version to clocked similarly to G2. Gain from 2.5 -> 3.0 GHz is another 20% so you can do the math.

So AMD would need Zen4 at 5nm to beat ARM server CPU at old 7nm based on an old and slow core like A76 (from 2018). We expect ARM to unveil new server architecture Neoverse N2 later this year. If Altra128 (Mystique) is based on N2/Cortex A77 than this would be different beast.

Comparison of Cortex A76 core vs. A77:

  • IPC jump 20% in INT and 35% in FPU
  • transistor count increase only 17%
  • core area 1.2mm2 A76 vs. 1.4mm2 A77 (both including 1MB L2 cache, compare to 3.6mm2 Zen2)

And if Neoverse N2 is based on A78 Hercules unveiled few weeks ago, with it's +7% IPC and -5% die size, things becomes even more scary. IMHO I expected Altra Mystique to be based on A77. I really don't see any sense to invest a lot of money to create new masks for 3 year old architecture like A76 on the same process, just to add few more cores. This is more Intel-like big company approach, not small start up like Ampere. I'd expect at least N7P and A77 cores what annihilate any x86 CPU including Zen4 at 5nm.
 

JoeRambo

Senior member
Jun 13, 2013
913
648
136
Richie Rich = dr. Sharikou of Intel bankrupt by Q3 2007 fame.

Still for certain uses 128 cores of monolith are better than 16 or 8 shards of ZenX cores. And certainly better than 28 or 38 monolith from Intel.
 

Hitman928

Diamond Member
Apr 15, 2012
3,200
3,209
136
I agree that we need to wait for benchmarks because Altra neither Zen3 isn't out yet. However we can estimate based on Graviton2 @ 2.5 GHz what is based on same A76/N1 core:
  • 64c/64t Graviton2 has score at Phoronix ... 29.07 pts
  • 64c/128t AMD Epyc 7742 (Rome Zen2) ..... 39.93 pts ...... 37% higher performance per socket
  • 128c/128t Altra estimation = 2*G2 ............. 58.14 pts ...... 45% higher performance than AMD Rome
  • 64c/128t Zen3 estimation = 20% IPC jump... 47.92 pts (39.93*1.2) .... 18% slower perf than Altra128
  • 64c/128t Zen4 estimation = 10% IPC jump... 52.71 pts (47.92*1.1) .... 9% slower perf than Altra128
  • 96c/192t Zen4 estimation .............................. 79.07 pts (52.71*1.5) ... 35% higher perf than Altra128
  • Altra is clocked @ 3 Ghz and boost to 3.3 GHz but I expect 128-core version to clocked similarly to G2. Gain from 2.5 -> 3.0 GHz is another 20% so you can do the math.

So AMD would need Zen4 at 5nm to beat ARM server CPU at old 7nm based on an old and slow core like A76 (from 2018). We expect ARM to unveil new server architecture Neoverse N2 later this year. If Altra128 (Mystique) is based on N2/Cortex A77 than this would be different beast.

Comparison of Cortex A76 core vs. A77:
  • IPC jump 20% in INT and 35% in FPU
  • transistor count increase only 17%
  • core area 1.2mm2 A76 vs. 1.4mm2 A77 (both including 1MB L2 cache, compare to 3.6mm2 Zen2)

And if Neoverse N2 is based on A78 Hercules unveiled few weeks ago, with it's +7% IPC and -5% die size, things becomes even more scary. IMHO I expected Altra Mystique to be based on A77. I really don't see any sense to invest a lot of money to create new masks for 3 year old architecture like A76 on the same process, just to add few more cores. This is more Intel-like big company approach, not small start up like Ampere. I'd expect at least N7P and A77 cores what annihilate any x86 CPU including Zen4 at 5nm.
In that test:

Going from 64c/128t AMD Epyc (7742) to 128c/256t Epyc (2x7742) results in a 10% average performance increase.
Going from 8c/16t AMD Epyc (7262) to 16c/32t Epyc (2x7262) results in a 17% average performance increase.
Going from 24c/16t AMD Epyc (7F52) to lower clocked 64c/128t Epyc (7742) results in a 4% average performance increase.

How in the world do you estimate 128c/128t Altra will be 2* 64c/64t Graviton2 in average performance in that test suite?
 

Markfw

CPU Moderator, VC&G Moderator, Elite Member
Super Moderator
May 16, 2002
20,886
9,080
136
In that test:

Going from 64c/128t AMD Epyc (7742) to 128c/256t Epyc (2x7742) results in a 10% average performance increase.
Going from 8c/16t AMD Epyc (7262) to 16c/32t Epyc (2x7262) results in a 17% average performance increase.
Going from 24c/16t AMD Epyc (7F52) to lower clocked 64c/128t Epyc (7742) results in a 4% average performance increase.

How in the world do you estimate 128c/128t Altra will be 2* 64c/64t Graviton2 in average performance in that test suite?
I have learned to ignore anything he says for reasons like this.
 
  • Like
Reactions: Drazick and Tlh97

podspi

Golden Member
Jan 11, 2011
1,938
39
91
I agree with you, SMT lowers thread performance and it's pretty clear from your Graviton2 test.
But there two very different ways how to use SMT:
  • - todays: put SMT on current core (gain +20% more throughput in cost of lower thread performance by factor 0.6x)
  • - future: merge two A76 cores without SMT into one big one with SMT2 (gain +20% through put while gaining 1.2x thread performance)

Such a core consisting of two A76 (3xALU+1xJump) would be 6xALU+2xJump wide, quite close to Apple's A13 Lightning core (4xALU+2xALU/Jump) so definitely feasible to create. IMHO shared resources can bring some nice gains in terms of efficiency. But it's definitely not easy to develop. And it doesn't make sense right know when Apple shows there is a way how to double IPC/throughput over A76.
You're describing CMT (Bulldozer) not SMT.
 

Richie Rich

Senior member
Jul 28, 2019
470
227
76
Going from 64c/128t AMD Epyc (7742) to 128c/256t Epyc (2x7742) results in a 10% average performance increase.
Going from 8c/16t AMD Epyc (7262) to 16c/32t Epyc (2x7262) results in a 17% average performance increase.
Going from 24c/16t AMD Epyc (7F52) to lower clocked 64c/128t Epyc (7742) results in a 4% average performance increase.

How in the world do you estimate 128c/128t Altra will be 2* 64c/64t Graviton2 in average performance in that test suite?
Good catch! However that poor scaling is SW problem of that test suite. Don't you think that cloud providers like Amazon, Google or MS Azure will just put twice as much as customers and so effectively solve that scaling issue?

I assumed linear scaling of course to keep it simple. Some SW like raytracing engines scales linearly. I guess nobody is gonna use 128-core Altra for SW that cannot scale beyond 8-threads.


You're describing CMT (Bulldozer) not SMT.
Jeeez Christ, do not scare me with horror Bulldozer please. You didn't get my point there because it's the opposite way. My idea was that super wide core with 6xALUs+2xBranch, 4xAGU, 4xFPU+ SMT2 would be more powerful than 2x cores with 3xALU+1xBranch,2xAGU,2xFPU, w/ noSMT (Cortex A76). Transistor wise similar but performance wise in ST load it can use all 6xALU and 4xFPU. This idea was also behind SMT4 in Zen3 - if AMD would create Zen3 as double wide as Zen2 it would consists of 6 or 8xALU, 4xAGU, 4xFPU (8xpipes). Applying SMT4 at that double wide core wouldn't hurt performance per thread at all. Even some significant gain is possible.
 

podspi

Golden Member
Jan 11, 2011
1,938
39
91
Development tree wise: cSMT is the end for SMT and CMT.
CMP -> SMT -> cSMT
CMP -> CMT -> cSMT
Both styles of cSMT(strong thread-cluster cohesion(CMT-derivative) and weak thread-cluster cohesion(SMT-derivative)), eventually fuse into distributive multithreading. Which is by far the most advanced style of threading to date.
Are there any actual, shipping products with this technology? I've never heard of cSMT before.

Jeeez Christ, do not scare me with horror Bulldozer please. You didn't get my point there because it's the opposite way. My idea was that super wide core with 6xALUs+2xBranch, 4xAGU, 4xFPU+ SMT2 would be more powerful than 2x cores with 3xALU+1xBranch,2xAGU,2xFPU, w/ noSMT (Cortex A76). Transistor wise similar but performance wise in ST load it can use all 6xALU and 4xFPU. This idea was also behind SMT4 in Zen3 - if AMD would create Zen3 as double wide as Zen2 it would consists of 6 or 8xALU, 4xAGU, 4xFPU (8xpipes). Applying SMT4 at that double wide core wouldn't hurt performance per thread at all. Even some significant gain is possible.
For Integer, yes, Bulldozer did not combine the ALUs, etc. However, the FPU and frontend were shared. I don't deny that the idea is plausible, but I can't think of any situation where this actually has happened. Look at the Power CPUs from IBM, tons of threads per core, but AFAIK if you are just using one thread you're just wasting transistors, not benefitting from those extra execution resources.

I do think we're going to start seeing more innovative designs as fabbing improvements slow down, but I don't think we're there yet with Zen3 or Zen4.
 

ASK THE COMMUNITY