News ARM Server CPUs

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Richie Rich

Senior member
Jul 28, 2019
421
196
76
Going from 64c/128t AMD Epyc (7742) to 128c/256t Epyc (2x7742) results in a 10% average performance increase.
Going from 8c/16t AMD Epyc (7262) to 16c/32t Epyc (2x7262) results in a 17% average performance increase.
Going from 24c/16t AMD Epyc (7F52) to lower clocked 64c/128t Epyc (7742) results in a 4% average performance increase.

How in the world do you estimate 128c/128t Altra will be 2* 64c/64t Graviton2 in average performance in that test suite?
Good catch! However that poor scaling is SW problem of that test suite. Don't you think that cloud providers like Amazon, Google or MS Azure will just put twice as much as customers and so effectively solve that scaling issue?

I assumed linear scaling of course to keep it simple. Some SW like raytracing engines scales linearly. I guess nobody is gonna use 128-core Altra for SW that cannot scale beyond 8-threads.


You're describing CMT (Bulldozer) not SMT.
Jeeez Christ, do not scare me with horror Bulldozer please. You didn't get my point there because it's the opposite way. My idea was that super wide core with 6xALUs+2xBranch, 4xAGU, 4xFPU+ SMT2 would be more powerful than 2x cores with 3xALU+1xBranch,2xAGU,2xFPU, w/ noSMT (Cortex A76). Transistor wise similar but performance wise in ST load it can use all 6xALU and 4xFPU. This idea was also behind SMT4 in Zen3 - if AMD would create Zen3 as double wide as Zen2 it would consists of 6 or 8xALU, 4xAGU, 4xFPU (8xpipes). Applying SMT4 at that double wide core wouldn't hurt performance per thread at all. Even some significant gain is possible.
 

podspi

Golden Member
Jan 11, 2011
1,925
22
81
Development tree wise: cSMT is the end for SMT and CMT.
CMP -> SMT -> cSMT
CMP -> CMT -> cSMT
Both styles of cSMT(strong thread-cluster cohesion(CMT-derivative) and weak thread-cluster cohesion(SMT-derivative)), eventually fuse into distributive multithreading. Which is by far the most advanced style of threading to date.
Are there any actual, shipping products with this technology? I've never heard of cSMT before.

Jeeez Christ, do not scare me with horror Bulldozer please. You didn't get my point there because it's the opposite way. My idea was that super wide core with 6xALUs+2xBranch, 4xAGU, 4xFPU+ SMT2 would be more powerful than 2x cores with 3xALU+1xBranch,2xAGU,2xFPU, w/ noSMT (Cortex A76). Transistor wise similar but performance wise in ST load it can use all 6xALU and 4xFPU. This idea was also behind SMT4 in Zen3 - if AMD would create Zen3 as double wide as Zen2 it would consists of 6 or 8xALU, 4xAGU, 4xFPU (8xpipes). Applying SMT4 at that double wide core wouldn't hurt performance per thread at all. Even some significant gain is possible.
For Integer, yes, Bulldozer did not combine the ALUs, etc. However, the FPU and frontend were shared. I don't deny that the idea is plausible, but I can't think of any situation where this actually has happened. Look at the Power CPUs from IBM, tons of threads per core, but AFAIK if you are just using one thread you're just wasting transistors, not benefitting from those extra execution resources.

I do think we're going to start seeing more innovative designs as fabbing improvements slow down, but I don't think we're there yet with Zen3 or Zen4.
 

NostaSeronx

Platinum Member
Sep 18, 2011
2,990
579
126
Are there any actual, shipping products with this technology? I've never heard of cSMT before.
POWER9 is pretty much cSMT.
With core slices being;
1x Retire/Scheduler +1x Register File + 1x 128-bit Int/1x 128-bit FPU + 1x LSU&Cache
SMT4 containing two of the above. If IBM had strong thread to slice cohesion then it would be two cores with four threads.
SMT8 containing four of the above. This one being the same as the above(SMT8 has two weak cohesive SMT4 resources), but four cores with eight threads.

However, the SMT8 model does have stronger cohesion; SMT1 only runs on one SMT4 core resource and SMT2 can run on one SMT4 core resource(Actual SMT2) or both SMT4 core resources(Psuedo-CMT/CMP).
Logical core 0-3 can only run on core resource A(two physical cores that have in total four logical core slots) and logical core 4-7 on core resource B(two physical cores with four logical core slots).

10.1147/JRD.2018.2854039 => IBM POWER9 processor core
Figure 1 & 2

//Absolute cohesion => Logical core 0 will always run on physical core 0.
//Strong cohesion => Logical core 0 will mostly run on physical core 0, with logical core 1 sometimes running on it if there is room and physical core 1 can't hit retire time constraints, vice versa.
CMT starts with the above.
SMT starts with the below.
//Weak cohesion => Logical core 0/1 will free for all the resources of physical core 0.

In the ARM category the closest was all the SoftMachines processors at TSMC; 28HPM & 16FFP.
=> Sep 9, 2016, restart from scratch (AMD 2012 -> 2017), should be 4-5 years before we see it in Intel's TSMC 5nm push.
 
Last edited:

Richie Rich

Senior member
Jul 28, 2019
421
196
76
For Integer, yes, Bulldozer did not combine the ALUs, etc. However, the FPU and frontend were shared. I don't deny that the idea is plausible, but I can't think of any situation where this actually has happened. Look at the Power CPUs from IBM, tons of threads per core, but AFAIK if you are just using one thread you're just wasting transistors, not benefitting from those extra execution resources.

I do think we're going to start seeing more innovative designs as fabbing improvements slow down, but I don't think we're there yet with Zen3 or Zen4.
Idea is plausible but very hard to design. Look how long takes to add just one single ALU to move from 3xALU PIII to 4xALU Haswell? More than decade. you need to rebuild entire core to do that. It's much easier to boost OoO buffers like IceLake or copy/paste double FPU width like Zen1->Zen2 did.

Apple went this hard way and now it pays rewards to them as they have CPU core with almost double IPC/PPC in compare to x86 cores. And Apple is not using SMT at all which is surprising because wide core and SMT is win-win synergy situation. It increases efficiency in heavy load and smart scheduler can turn SMT OFF (by leaving logical core empty).

I see the ideal future core as 8xALU, 4xAGU, 8x FPU + SMT4. Same amount units like two Zen2 cores but with super strong ST IPC, higher MT IPC due to higher SMT4 efficiency and less transistors thanks to sharing resources. And if you can decouple decoder and run micro-ops natively like Transmeta/Tachyum then you can do code-morphing and you are partly ISA independent. IMO This is the future. OTOH not saying it's easy to do.

Time to time appears some fool who didn't know it's impossible and he create something extraordinary (like Apple or Elon Musk these days). Apple with their first 6xALU A11 Monsoon core is the first 6xALU core on the world as a good example. Apple probably did forget to ask AMD for some useful advice regarding 2xALU Bulldozer. Funny? No, because Apple's A7 Cyclone which is first 4xALU ARM core was released in September 2013 (first x86 4xALU core which was Haswell in June 2013) was developed since 2008(?) and Bulldozer was released 2011. So yeah, better to do your own maximum no matter what others do (infinity game player).
 

ASK THE COMMUNITY