News ARM Server CPUs

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Gideon

Golden Member
Nov 27, 2007
1,622
3,645
136
With all the upcoming ARM servers (Let's not forget Nuvia, etc) it probably makes sense to have 1 thread from them all, instead of creating new ones for each announcement (if Not, I will rename it).

However:
Anandtech: Marwell Announces 3rd Gen Arm Server Thunder X3: 96 Cores/384 threads
ServeTheHome: Marvell ThunderX3 Arm Server CPU with 768 Threads in 2020

MarvellTX3_13_575px.jpg


Could be pretty impressive, though 25% Single Threaded performance gain seems a bit meh, compared to X2, which @2.5Ghz was ~50% slower than Xeon @ 3.8Ghz

But Their marketing slides sure show potential:

MarvellTX3_16_575px.jpg



MarvellTX3_15_575px.jpg
 
Last edited:

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Going from 64c/128t AMD Epyc (7742) to 128c/256t Epyc (2x7742) results in a 10% average performance increase.
Going from 8c/16t AMD Epyc (7262) to 16c/32t Epyc (2x7262) results in a 17% average performance increase.
Going from 24c/16t AMD Epyc (7F52) to lower clocked 64c/128t Epyc (7742) results in a 4% average performance increase.

How in the world do you estimate 128c/128t Altra will be 2* 64c/64t Graviton2 in average performance in that test suite?
Good catch! However that poor scaling is SW problem of that test suite. Don't you think that cloud providers like Amazon, Google or MS Azure will just put twice as much as customers and so effectively solve that scaling issue?

I assumed linear scaling of course to keep it simple. Some SW like raytracing engines scales linearly. I guess nobody is gonna use 128-core Altra for SW that cannot scale beyond 8-threads.


You're describing CMT (Bulldozer) not SMT.
Jeeez Christ, do not scare me with horror Bulldozer please. You didn't get my point there because it's the opposite way. My idea was that super wide core with 6xALUs+2xBranch, 4xAGU, 4xFPU+ SMT2 would be more powerful than 2x cores with 3xALU+1xBranch,2xAGU,2xFPU, w/ noSMT (Cortex A76). Transistor wise similar but performance wise in ST load it can use all 6xALU and 4xFPU. This idea was also behind SMT4 in Zen3 - if AMD would create Zen3 as double wide as Zen2 it would consists of 6 or 8xALU, 4xAGU, 4xFPU (8xpipes). Applying SMT4 at that double wide core wouldn't hurt performance per thread at all. Even some significant gain is possible.
 

podspi

Golden Member
Jan 11, 2011
1,965
71
91
Development tree wise: cSMT is the end for SMT and CMT.
CMP -> SMT -> cSMT
CMP -> CMT -> cSMT
Both styles of cSMT(strong thread-cluster cohesion(CMT-derivative) and weak thread-cluster cohesion(SMT-derivative)), eventually fuse into distributive multithreading. Which is by far the most advanced style of threading to date.

Are there any actual, shipping products with this technology? I've never heard of cSMT before.

Jeeez Christ, do not scare me with horror Bulldozer please. You didn't get my point there because it's the opposite way. My idea was that super wide core with 6xALUs+2xBranch, 4xAGU, 4xFPU+ SMT2 would be more powerful than 2x cores with 3xALU+1xBranch,2xAGU,2xFPU, w/ noSMT (Cortex A76). Transistor wise similar but performance wise in ST load it can use all 6xALU and 4xFPU. This idea was also behind SMT4 in Zen3 - if AMD would create Zen3 as double wide as Zen2 it would consists of 6 or 8xALU, 4xAGU, 4xFPU (8xpipes). Applying SMT4 at that double wide core wouldn't hurt performance per thread at all. Even some significant gain is possible.

For Integer, yes, Bulldozer did not combine the ALUs, etc. However, the FPU and frontend were shared. I don't deny that the idea is plausible, but I can't think of any situation where this actually has happened. Look at the Power CPUs from IBM, tons of threads per core, but AFAIK if you are just using one thread you're just wasting transistors, not benefitting from those extra execution resources.

I do think we're going to start seeing more innovative designs as fabbing improvements slow down, but I don't think we're there yet with Zen3 or Zen4.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
Are there any actual, shipping products with this technology? I've never heard of cSMT before.
POWER9 is pretty much cSMT.
With core slices being;
1x Retire/Scheduler +1x Register File + 1x 128-bit Int/1x 128-bit FPU + 1x LSU&Cache
SMT4 containing two of the above. If IBM had strong thread to slice cohesion then it would be two cores with four threads.
SMT8 containing four of the above. This one being the same as the above(SMT8 has two weak cohesive SMT4 resources), but four cores with eight threads.

However, the SMT8 model does have stronger cohesion; SMT1 only runs on one SMT4 core resource and SMT2 can run on one SMT4 core resource(Actual SMT2) or both SMT4 core resources(Psuedo-CMT/CMP).
Logical core 0-3 can only run on core resource A(two physical cores that have in total four logical core slots) and logical core 4-7 on core resource B(two physical cores with four logical core slots).

10.1147/JRD.2018.2854039 => IBM POWER9 processor core
Figure 1 & 2

//Absolute cohesion => Logical core 0 will always run on physical core 0.
//Strong cohesion => Logical core 0 will mostly run on physical core 0, with logical core 1 sometimes running on it if there is room and physical core 1 can't hit retire time constraints, vice versa.
CMT starts with the above.
SMT starts with the below.
//Weak cohesion => Logical core 0/1 will free for all the resources of physical core 0.

In the ARM category the closest was all the SoftMachines processors at TSMC; 28HPM & 16FFP.
=> Sep 9, 2016, restart from scratch (AMD 2012 -> 2017), should be 4-5 years before we see it in Intel's TSMC 5nm push.
 
Last edited:

Richie Rich

Senior member
Jul 28, 2019
470
229
76
For Integer, yes, Bulldozer did not combine the ALUs, etc. However, the FPU and frontend were shared. I don't deny that the idea is plausible, but I can't think of any situation where this actually has happened. Look at the Power CPUs from IBM, tons of threads per core, but AFAIK if you are just using one thread you're just wasting transistors, not benefitting from those extra execution resources.

I do think we're going to start seeing more innovative designs as fabbing improvements slow down, but I don't think we're there yet with Zen3 or Zen4.
Idea is plausible but very hard to design. Look how long takes to add just one single ALU to move from 3xALU PIII to 4xALU Haswell? More than decade. you need to rebuild entire core to do that. It's much easier to boost OoO buffers like IceLake or copy/paste double FPU width like Zen1->Zen2 did.

Apple went this hard way and now it pays rewards to them as they have CPU core with almost double IPC/PPC in compare to x86 cores. And Apple is not using SMT at all which is surprising because wide core and SMT is win-win synergy situation. It increases efficiency in heavy load and smart scheduler can turn SMT OFF (by leaving logical core empty).

I see the ideal future core as 8xALU, 4xAGU, 8x FPU + SMT4. Same amount units like two Zen2 cores but with super strong ST IPC, higher MT IPC due to higher SMT4 efficiency and less transistors thanks to sharing resources. And if you can decouple decoder and run micro-ops natively like Transmeta/Tachyum then you can do code-morphing and you are partly ISA independent. IMO This is the future. OTOH not saying it's easy to do.

Time to time appears some fool who didn't know it's impossible and he create something extraordinary (like Apple or Elon Musk these days). Apple with their first 6xALU A11 Monsoon core is the first 6xALU core on the world as a good example. Apple probably did forget to ask AMD for some useful advice regarding 2xALU Bulldozer. Funny? No, because Apple's A7 Cyclone which is first 4xALU ARM core was released in September 2013 (first x86 4xALU core which was Haswell in June 2013) was developed since 2008(?) and Bulldozer was released 2011. So yeah, better to do your own maximum no matter what others do (infinity game player).
 

Gideon

Golden Member
Nov 27, 2007
1,622
3,645
136
Well well, nuvia throws down the gauntlet:

50% better performance than Zen 2 at 33% the power apparently. By the same guys that brought to you the Apple Cores:


NUVIA’s claim is that the Phoenix core is set to offer from +50% to +100% peak performance of the other cores, either for the same power as other Arm cores or for a third of the power of x86 cores. NUVIA’s wording for this graph includes the phrase ‘we have left the upper part of the curve out to fully disclose at a later date’, indicating that they likely intend for Phoenix cores to go beyond 5W per core.

N2_575px.png
 

Hitman928

Diamond Member
Apr 15, 2012
5,243
7,792
136
I don't find too much interesting info from a single core performance graph that is seemingly normalized to idle power (if I'm reading it right). There is a lot of room for manipulation there. With that said, the single core performance claims, while vague, are certainly interesting and it will be interesting to see how far they plan to scale out their CPUs.

For competition, if they are using Arm v9 that probably puts them in the 2022 time frame at the earliest (I'm guessing). So you're looking at competing against the next round of ARM's enterprise design as well as Zen 4 at that point. Should be fun!
 
  • Like
Reactions: Tlh97 and Gideon

Gideon

Golden Member
Nov 27, 2007
1,622
3,645
136
Here is the NUVIA article:



At least they were pretty clear in what they tested and (unlike some not-to-be-named people here) actually tested mobile vs mobile, which makes the most sense, given what is currently available from ARM:

Table+1%3A+List+of+devices+tested


What does all this data mean for the server market? It means a lot. All current and future flagship server SoCs are power constrained, very much like mobile SoCs. This trend is only going to continue as there is a push to integrate even more cores. In addition, the AMD, Intel and ARM client computing cores tested are comparable to their current and future data center products. As core count increases, what is not increasing is the TDP. TDPs are likely going to remain in the 250W - 300W range, which is the maximum power that can be dissipated in an air-cooled environment in a typical datacenter. Hyperscalers and other enterprise data centers still must operate their servers within these TDP limits to optimize the TCO for their data centers. As more cores are added, the power allocation per core shrinks significantly. A rough calculation can be made to determine the high and low bounds of the per-core power allocation in servers. We can assume that future flagship SoCs will have a minimum of 64 cores and a maximum of 128 cores. The TDP range is 250W - 300W, and the power outside of the CPU can range between 10W - 120W depending upon the workload. Taking into consideration these factors, the amount of power that each CPU can be allocated ranges between 1W - 4.5W when heavily utilized, as is the case in a datacenter environment. Drawing a set of vertical bars denoting this power range, it becomes evident why the NUVIA Phoenix CPU core has the potential to reset the bar for the market. No matter which scenario is considered, either unconstrained peak performance or power/thermally constrained performance, Phoenix should have a significant lead. Below is a preview into the planned capabilities of Phoenix, however we have left the upper part of the curve out to fully disclose at a later date. When measured against current products available in-market in the 1W-4.5W power envelope (per core), the Phoenix CPU core performs up to 2X faster than the competition. NUVIA’s Phoenix CPU performance is projected using architectural performance modeling techniques consistent with industry-standard practices on future CPU cores.

I'm certainly not taking this at face-value (as they clearly seem to be targeting investors) but I really like how the explained what they did, why they did it, the test setup and how they measured it.

In all measurements, we normalized the static idle power to remove constant system-level power taxes, and baseline idles are tuned to be as low as possible. We enabled all power management features, set panels to minimum brightness, turned off radios, and disabled all unneeded features. Batteries are fully charged and not charging while systems are under test. The power is measured at the battery output via the insertion of a high-precision series sense resistor and includes the DC-DC conversion loss of each platform. Typical high-efficiency buck converters used today have similar conversion losses, and therefore this is a more or less common factor amongst all devices.

And finally:

e realize the companies we have measured against in these tests are not standing still and will have new products in the market over the next 18 months. That said, we believe that even with significant performance gains (20%+) with new CPU architectures, we will continue to hold a clear position of leadership in performance-per-watt.

All in all, a PR talk but surprisingly informative and BS free, considering what most of the contemporaries put out (including Apple, etc)
 
Last edited:
  • Like
Reactions: Elfear and Tlh97

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
I can only wish good luck to Nuvia being able to sell their product in the mass market as soon as possible. The longer they take for that the less impressive their now clear lead will be. And being an Arm platform they'll need all the clear lead they can get to get a significant part of the market be willing to take on all the hassle involved in switching.
 
  • Like
Reactions: Tlh97 and Gideon

ScopedAndDropped

Junior Member
Feb 15, 2020
7
3
41
So based on that graph, the 18 month product availabilty and Apples current progress, the A15 should be somewhere within that range. Very impressive to claim that they'll match the current mobile performance king on their first product. They have the people to match the claim but they are unproven as a team.

I hope that they do come close to what they claim. They'll realisitically be competing with the a15, cortex x2?? , zen4 and golden cove/whatever sapphire rapids uses (hopefully on TSMC's 6nm and not 10nm).
 
Last edited:
  • Like
Reactions: Tlh97 and Gideon

beginner99

Diamond Member
Jun 2, 2009
5,210
1,580
136
And being an Arm platform they'll need all the clear lead they can get to get a significant part of the market be willing to take on all the hassle involved in switching.

That^s the sad part. They actually need the mentioned lead vs the competition at launch time. I mean switching to AMD EPCY should be trivial, yet intel is making bank with server CPUs that are just much worse on any metric than AMDs offering. So an ecosystem switch would need a gigantic incentive to happen.

Issue is also time. Cloud-providers would probably be the first to adapt and then it will slowly trickle down. Problem is if it trickles down fast enough for Nuvia to actually make profit before running out of money.
 

Gideon

Golden Member
Nov 27, 2007
1,622
3,645
136
Are they comparing a product from 18 months into the future using second gen N5 or N3 TSMC or Samsung equivalent vs N7 Zen2/Intel 14nm++ Skylake+?
They are (and obviously targeting investors), but even with 20% IPC gain on both Milan and Genoa and a 2x power/perf improvement from 5nm (as zen2 did on 7nm) they would still have a 40% power advantage and up to 10% perf advantage.

if they manage to pull it off that is.
 

Thala

Golden Member
Nov 12, 2014
1,355
653
136
Are they comparing a product from 18 months into the future using second gen N5 or N3 TSMC or Samsung equivalent vs N7 Zen2/Intel 14nm++ Skylake+?

N7 Zen2/Intel is not even close for what they aim for - N7 Apple A13 and ARM Cortex A77 are much closer - they are not even concerned about AMD/Intel at all.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,780
136
They are (and obviously targeting investors), but even with 20% IPC gain on both Milan and Genoa and a 2x power/perf improvement from 5nm (as zen2 did on 7nm) they would still have a 40% power advantage and up to 10% perf advantage.

if they manage to pull it off that is.
I don't know, I am having a sense of deja vu. As someone who commissioned some arm clusters in the past to automate the testing of applications for arm64 targets we deploy to field devices from our backend hubs, I have to say I am quite weary of such claims. We had to switch to using qemu and qemu multiarch docker images on x86 to emulate the arm64 targets.
Looks like I will have to deal with some team members with smart ideas again this time too. Anyway each to his own. Not all deployments are the same.
 

Gideon

Golden Member
Nov 27, 2007
1,622
3,645
136
I don't know, I am having a sense of deja vu. As someone who commissioned some arm clusters in the past to automate the testing of applications for arm64 targets we deploy to field devices from our backend hubs, I have to say I am quite weary of such claims. We had to switch to using qemu and qemu multiarch docker images on x86 to emulate the arm64 targets.
Looks like I will have to deal with some team members with smart ideas again this time too. Anyway each to his own. Not all deployments are the same.
True that, though I give Nuvia some benefit of a doubt. They have the guys who architected Apple's cores they are now putting to laptops and desktops. Also people from Google (including from their ML HW division) and have the biggest name in ARM server software (ex red hat Jon Masters).

Even then they still obviously have a lot to prove
 

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
N7 Zen2/Intel is not even close for what they aim for - N7 Apple A13 and ARM Cortex A77 are much closer - they are not even concerned about AMD/Intel at all.
Depends on the markets they try to target. Apple is no real rival since they won't ever replace any Apple Silicon. ARM Cortex they may replace for some licensees, but here it will fully depend on the balance whether the increase in performance and efficiency is worth the additional cost over the stock ARM design. Against AMD/Intel chips the potential of impressing people is the biggest, and the margins in the datacenter market are a likely target.
 

beginner99

Diamond Member
Jun 2, 2009
5,210
1,580
136
if they manage to pull it off that is.

Single-threaded? For sure possible (isn't apple already close to that anyway in selected benchmarks?) but does it make sense for a server-cpu? I think they will realize it's one thing to have this extreme ST-performance but entirely another to make 32 or 64 such cores work together efficiently. And that for more "esoteric" workloads outside the consumer world, it's suddenly a problem to get such good ST.

EDIT: at least I would assume that is the case or is x86 really that bad? Or do intel and amd engineers combined suck so bad compared to these "rockstars" at nuvia that they can't get it done? I mean if they can "come out of nowhere" so quickly and beat the top giant in these metrics by a landslide, one must ask why it is possible at all. X86 bad or own engineers bad? Or management /bureaucracy /office politics overhead?
 
  • Like
Reactions: Tlh97

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
Apr 30, 2015
131
10
81
Some references on ARM architecture, present and future:

This is a series of videos on ARMv8 by Grisenthwaite, their chief architect, from 2011:
Work began in 2007.

This is a description of SVE:

Further into the future, there is CHERI:
Capability Hardware Enhanced RISC Instructions. This will feature a test SoC and board, available to researchers in 2021. The intention is to feed into forthcoming architectures. It is of interest to hyper-scalers. AArch64 code will run on it, unmodified, but code will have to be modified to take advantage of the new features.

See also:
 

name99

Senior member
Sep 11, 2010
404
303
136
Well unless this new v8.7-A GCC entry is a placeholder I'd say that v9-A ist kaput?

Link here.

Maybe, maybe not.
For one thing we have no idea what the transition plan for v9 is.
They COULD have a temporary period during which devices can decode both v8 (hopefully 64-bit only!) and v9 instructions. How feasible this is depends on how ambitious they are in redesigning v9.

But they could also say that the clients for v9, at first anyway, are going to be cloud companies that are able to recompile everything from scratch, so no need for backward compatibility. ie the Neoverse track moves over to v9 fairly soon, even as the mobile track stays on v8 for as long as Google/Android want to remain there...
I mean, is Android, even today, fully 64-bit? Could ARM get away with shipping a v8 core that's 64 bit with no 32-bit, ala Apple?
 

soresu

Platinum Member
Dec 19, 2014
2,657
1,858
136
I mean, is Android, even today, fully 64-bit?
No.

Even the brand new Google Chromecast built on Android TV is in fact 32 bit because of the low RAM capacity - as is the cheaper variant of the new SHIELD TV with 2GB instead of 3GB on the more expensive model.

Sadly there are still many Android based devices that have less RAM than you would ideally need to run the 64 bit version of the OS.

As single RAM dies reach 32 gbit (4 GB) we will start to see a step transition to all 64 bit Android I think.
Could ARM get away with shipping a v8 core that's 64 bit with no 32-bit, ala Apple?
That's coming with Makalu in 2022, and there is a variant of Cortex A32 that is 64 bit only (A34?) for some reason, though I think it uses the original v8.0-A ISA revision rather than the v8.2-A used since A75 and A55.
 

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,829
136
Not got a subscription, so I don't know who it is. My guess would be Marvell?

Probably. That would be the easy guess anyway. It would be more interesting if it were Ampere or someone like that. Because Marvell has all but taken their ThunderX3 off the market, so it would hardly be news if they nixed the custom orders for it.