News ARM Server CPUs

Gideon · Mar 16, 2020

With all the upcoming ARM servers (Let's not forget Nuvia, etc) it probably makes sense to have 1 thread from them all, instead of creating new ones for each announcement (if Not, I will rename it).

However:
Anandtech: Marwell Announces 3rd Gen Arm Server Thunder X3: 96 Cores/384 threads
ServeTheHome: Marvell ThunderX3 Arm Server CPU with 768 Threads in 2020

Could be pretty impressive, though 25% Single Threaded performance gain seems a bit meh, compared to X2, which @2.5Ghz was ~50% slower than Xeon @ 3.8Ghz

But Their marketing slides sure show potential:

Richie Rich · Jun 24, 2020

Hitman928 said:
Going from 64c/128t AMD Epyc (7742) to 128c/256t Epyc (2x7742) results in a 10% average performance increase.
Going from 8c/16t AMD Epyc (7262) to 16c/32t Epyc (2x7262) results in a 17% average performance increase.
Going from 24c/16t AMD Epyc (7F52) to lower clocked 64c/128t Epyc (7742) results in a 4% average performance increase.

How in the world do you estimate 128c/128t Altra will be 2* 64c/64t Graviton2 in average performance in that test suite?

Good catch! However that poor scaling is SW problem of that test suite. Don't you think that cloud providers like Amazon, Google or MS Azure will just put twice as much as customers and so effectively solve that scaling issue?

I assumed linear scaling of course to keep it simple. Some SW like raytracing engines scales linearly. I guess nobody is gonna use 128-core Altra for SW that cannot scale beyond 8-threads.

podspi said:
You're describing CMT (Bulldozer) not SMT.

Jeeez Christ, do not scare me with horror Bulldozer please. You didn't get my point there because it's the opposite way. My idea was that super wide core with 6xALUs+2xBranch, 4xAGU, 4xFPU+ SMT2 would be more powerful than 2x cores with 3xALU+1xBranch,2xAGU,2xFPU, w/ noSMT (Cortex A76). Transistor wise similar but performance wise in ST load it can use all 6xALU and 4xFPU. This idea was also behind SMT4 in Zen3 - if AMD would create Zen3 as double wide as Zen2 it would consists of 6 or 8xALU, 4xAGU, 4xFPU (8xpipes). Applying SMT4 at that double wide core wouldn't hurt performance per thread at all. Even some significant gain is possible.

podspi · Jun 24, 2020

NostaSeronx said:
Development tree wise: cSMT is the end for SMT and CMT.
CMP -> SMT -> cSMT
CMP -> CMT -> cSMT
Both styles of cSMT(strong thread-cluster cohesion(CMT-derivative) and weak thread-cluster cohesion(SMT-derivative)), eventually fuse into distributive multithreading. Which is by far the most advanced style of threading to date.

Are there any actual, shipping products with this technology? I've never heard of cSMT before.

Richie Rich said:
Jeeez Christ, do not scare me with horror Bulldozer please. You didn't get my point there because it's the opposite way. My idea was that super wide core with 6xALUs+2xBranch, 4xAGU, 4xFPU+ SMT2 would be more powerful than 2x cores with 3xALU+1xBranch,2xAGU,2xFPU, w/ noSMT (Cortex A76). Transistor wise similar but performance wise in ST load it can use all 6xALU and 4xFPU. This idea was also behind SMT4 in Zen3 - if AMD would create Zen3 as double wide as Zen2 it would consists of 6 or 8xALU, 4xAGU, 4xFPU (8xpipes). Applying SMT4 at that double wide core wouldn't hurt performance per thread at all. Even some significant gain is possible.

For Integer, yes, Bulldozer did not combine the ALUs, etc. However, the FPU and frontend were shared. I don't deny that the idea is plausible, but I can't think of any situation where this actually has happened. Look at the Power CPUs from IBM, tons of threads per core, but AFAIK if you are just using one thread you're just wasting transistors, not benefitting from those extra execution resources.

I do think we're going to start seeing more innovative designs as fabbing improvements slow down, but I don't think we're there yet with Zen3 or Zen4.

NostaSeronx · Jun 24, 2020

podspi said:
Are there any actual, shipping products with this technology? I've never heard of cSMT before.

POWER9 is pretty much cSMT.
With core slices being;
1x Retire/Scheduler +1x Register File + 1x 128-bit Int/1x 128-bit FPU + 1x LSU&Cache
SMT4 containing two of the above. If IBM had strong thread to slice cohesion then it would be two cores with four threads.
SMT8 containing four of the above. This one being the same as the above(SMT8 has two weak cohesive SMT4 resources), but four cores with eight threads.

However, the SMT8 model does have stronger cohesion; SMT1 only runs on one SMT4 core resource and SMT2 can run on one SMT4 core resource(Actual SMT2) or both SMT4 core resources(Psuedo-CMT/CMP).
Logical core 0-3 can only run on core resource A(two physical cores that have in total four logical core slots) and logical core 4-7 on core resource B(two physical cores with four logical core slots).

10.1147/JRD.2018.2854039 => IBM POWER9 processor core
Figure 1 & 2

//Absolute cohesion => Logical core 0 will always run on physical core 0.
//Strong cohesion => Logical core 0 will mostly run on physical core 0, with logical core 1 sometimes running on it if there is room and physical core 1 can't hit retire time constraints, vice versa.
CMT starts with the above.
SMT starts with the below.
//Weak cohesion => Logical core 0/1 will free for all the resources of physical core 0.

In the ARM category the closest was all the SoftMachines processors at TSMC; 28HPM & 16FFP.
=> Sep 9, 2016, restart from scratch (AMD 2012 -> 2017), should be 4-5 years before we see it in Intel's TSMC 5nm push.

Richie Rich · Jun 28, 2020

podspi said:
For Integer, yes, Bulldozer did not combine the ALUs, etc. However, the FPU and frontend were shared. I don't deny that the idea is plausible, but I can't think of any situation where this actually has happened. Look at the Power CPUs from IBM, tons of threads per core, but AFAIK if you are just using one thread you're just wasting transistors, not benefitting from those extra execution resources.

I do think we're going to start seeing more innovative designs as fabbing improvements slow down, but I don't think we're there yet with Zen3 or Zen4.

Idea is plausible but very hard to design. Look how long takes to add just one single ALU to move from 3xALU PIII to 4xALU Haswell? More than decade. you need to rebuild entire core to do that. It's much easier to boost OoO buffers like IceLake or copy/paste double FPU width like Zen1->Zen2 did.

Apple went this hard way and now it pays rewards to them as they have CPU core with almost double IPC/PPC in compare to x86 cores. And Apple is not using SMT at all which is surprising because wide core and SMT is win-win synergy situation. It increases efficiency in heavy load and smart scheduler can turn SMT OFF (by leaving logical core empty).

I see the ideal future core as 8xALU, 4xAGU, 8x FPU + SMT4. Same amount units like two Zen2 cores but with super strong ST IPC, higher MT IPC due to higher SMT4 efficiency and less transistors thanks to sharing resources. And if you can decouple decoder and run micro-ops natively like Transmeta/Tachyum then you can do code-morphing and you are partly ISA independent. IMO This is the future. OTOH not saying it's easy to do.

Time to time appears some fool who didn't know it's impossible and he create something extraordinary (like Apple or Elon Musk these days). Apple with their first 6xALU A11 Monsoon core is the first 6xALU core on the world as a good example. Apple probably did forget to ask AMD for some useful advice regarding 2xALU Bulldozer. Funny? No, because Apple's A7 Cyclone which is first 4xALU ARM core was released in September 2013 (first x86 4xALU core which was Haswell in June 2013) was developed since 2008(?) and Bulldozer was released 2011. So yeah, better to do your own maximum no matter what others do (infinity game player).

Gideon · Aug 11, 2020

Well well, nuvia throws down the gauntlet:

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

50% better performance than Zen 2 at 33% the power apparently. By the same guys that brought to you the Apple Cores:

NUVIA’s claim is that the Phoenix core is set to offer from +50% to +100% peak performance of the other cores, either for the same power as other Arm cores or for a third of the power of x86 cores. NUVIA’s wording for this graph includes the phrase ‘we have left the upper part of the curve out to fully disclose at a later date’, indicating that they likely intend for Phoenix cores to go beyond 5W per core.

Hitman928 · Aug 11, 2020

I don't find too much interesting info from a single core performance graph that is seemingly normalized to idle power (if I'm reading it right). There is a lot of room for manipulation there. With that said, the single core performance claims, while vague, are certainly interesting and it will be interesting to see how far they plan to scale out their CPUs.

For competition, if they are using Arm v9 that probably puts them in the 2022 time frame at the earliest (I'm guessing). So you're looking at competing against the next round of ARM's enterprise design as well as Zen 4 at that point. Should be fun!

Gideon · Aug 11, 2020

Here is the NUVIA article:

https://nuviainc.com/blog/performancedeliveredanewway

At least they were pretty clear in what they tested and (unlike some not-to-be-named people here) actually tested mobile vs mobile, which makes the most sense, given what is currently available from ARM:

What does all this data mean for the server market? It means a lot. All current and future flagship server SoCs are power constrained, very much like mobile SoCs. This trend is only going to continue as there is a push to integrate even more cores. In addition, the AMD, Intel and ARM client computing cores tested are comparable to their current and future data center products. As core count increases, what is not increasing is the TDP. TDPs are likely going to remain in the 250W - 300W range, which is the maximum power that can be dissipated in an air-cooled environment in a typical datacenter. Hyperscalers and other enterprise data centers still must operate their servers within these TDP limits to optimize the TCO for their data centers. As more cores are added, the power allocation per core shrinks significantly. A rough calculation can be made to determine the high and low bounds of the per-core power allocation in servers. We can assume that future flagship SoCs will have a minimum of 64 cores and a maximum of 128 cores. The TDP range is 250W - 300W, and the power outside of the CPU can range between 10W - 120W depending upon the workload. Taking into consideration these factors, the amount of power that each CPU can be allocated ranges between 1W - 4.5W when heavily utilized, as is the case in a datacenter environment. Drawing a set of vertical bars denoting this power range, it becomes evident why the NUVIA Phoenix CPU core has the potential to reset the bar for the market. No matter which scenario is considered, either unconstrained peak performance or power/thermally constrained performance, Phoenix should have a significant lead. Below is a preview into the planned capabilities of Phoenix, however we have left the upper part of the curve out to fully disclose at a later date. When measured against current products available in-market in the 1W-4.5W power envelope (per core), the Phoenix CPU core performs up to 2X faster than the competition. NUVIA’s Phoenix CPU performance is projected using architectural performance modeling techniques consistent with industry-standard practices on future CPU cores.

I'm certainly not taking this at face-value (as they clearly seem to be targeting investors) but I really like how the explained what they did, why they did it, the test setup and how they measured it.

In all measurements, we normalized the static idle power to remove constant system-level power taxes, and baseline idles are tuned to be as low as possible. We enabled all power management features, set panels to minimum brightness, turned off radios, and disabled all unneeded features. Batteries are fully charged and not charging while systems are under test. The power is measured at the battery output via the insertion of a high-precision series sense resistor and includes the DC-DC conversion loss of each platform. Typical high-efficiency buck converters used today have similar conversion losses, and therefore this is a more or less common factor amongst all devices.

And finally:

e realize the companies we have measured against in these tests are not standing still and will have new products in the market over the next 18 months. That said, we believe that even with significant performance gains (20%+) with new CPU architectures, we will continue to hold a clear position of leadership in performance-per-watt.

All in all, a PR talk but surprisingly informative and BS free, considering what most of the contemporaries put out (including Apple, etc)

moinmoin · Aug 11, 2020

I can only wish good luck to Nuvia being able to sell their product in the mass market as soon as possible. The longer they take for that the less impressive their now clear lead will be. And being an Arm platform they'll need all the clear lead they can get to get a significant part of the market be willing to take on all the hassle involved in switching.

ScopedAndDropped · Aug 11, 2020

So based on that graph, the 18 month product availabilty and Apples current progress, the A15 should be somewhere within that range. Very impressive to claim that they'll match the current mobile performance king on their first product. They have the people to match the claim but they are unproven as a team.

I hope that they do come close to what they claim. They'll realisitically be competing with the a15, cortex x2?? , zen4 and golden cove/whatever sapphire rapids uses (hopefully on TSMC's 6nm and not 10nm).

beginner99 · Aug 12, 2020

moinmoin said:
And being an Arm platform they'll need all the clear lead they can get to get a significant part of the market be willing to take on all the hassle involved in switching.

That^s the sad part. They actually need the mentioned lead vs the competition at launch time. I mean switching to AMD EPCY should be trivial, yet intel is making bank with server CPUs that are just much worse on any metric than AMDs offering. So an ecosystem switch would need a gigantic incentive to happen.

Issue is also time. Cloud-providers would probably be the first to adapt and then it will slowly trickle down. Problem is if it trickles down fast enough for Nuvia to actually make profit before running out of money.

DisEnchantment · Aug 12, 2020

Are they comparing a product from 18 months into the future using second gen N5 or N3 TSMC or Samsung equivalent vs N7 Zen2/Intel 14nm++ Skylake+?

Gideon · Aug 12, 2020

DisEnchantment said:
Are they comparing a product from 18 months into the future using second gen N5 or N3 TSMC or Samsung equivalent vs N7 Zen2/Intel 14nm++ Skylake+?

They are (and obviously targeting investors), but even with 20% IPC gain on both Milan and Genoa and a 2x power/perf improvement from 5nm (as zen2 did on 7nm) they would still have a 40% power advantage and up to 10% perf advantage.

if they manage to pull it off that is.

Thala · Aug 12, 2020

DisEnchantment said:
Are they comparing a product from 18 months into the future using second gen N5 or N3 TSMC or Samsung equivalent vs N7 Zen2/Intel 14nm++ Skylake+?

N7 Zen2/Intel is not even close for what they aim for - N7 Apple A13 and ARM Cortex A77 are much closer - they are not even concerned about AMD/Intel at all.

DisEnchantment · Aug 12, 2020

Gideon said:
They are (and obviously targeting investors), but even with 20% IPC gain on both Milan and Genoa and a 2x power/perf improvement from 5nm (as zen2 did on 7nm) they would still have a 40% power advantage and up to 10% perf advantage.

if they manage to pull it off that is.

I don't know, I am having a sense of deja vu. As someone who commissioned some arm clusters in the past to automate the testing of applications for arm64 targets we deploy to field devices from our backend hubs, I have to say I am quite weary of such claims. We had to switch to using qemu and qemu multiarch docker images on x86 to emulate the arm64 targets.
Looks like I will have to deal with some team members with smart ideas again this time too. Anyway each to his own. Not all deployments are the same.

Gideon · Aug 12, 2020

DisEnchantment said:
I don't know, I am having a sense of deja vu. As someone who commissioned some arm clusters in the past to automate the testing of applications for arm64 targets we deploy to field devices from our backend hubs, I have to say I am quite weary of such claims. We had to switch to using qemu and qemu multiarch docker images on x86 to emulate the arm64 targets.
Looks like I will have to deal with some team members with smart ideas again this time too. Anyway each to his own. Not all deployments are the same.

True that, though I give Nuvia some benefit of a doubt. They have the guys who architected Apple's cores they are now putting to laptops and desktops. Also people from Google (including from their ML HW division) and have the biggest name in ARM server software (ex red hat Jon Masters).

Even then they still obviously have a lot to prove

moinmoin · Aug 12, 2020

Thala said:
N7 Zen2/Intel is not even close for what they aim for - N7 Apple A13 and ARM Cortex A77 are much closer - they are not even concerned about AMD/Intel at all.

Depends on the markets they try to target. Apple is no real rival since they won't ever replace any Apple Silicon. ARM Cortex they may replace for some licensees, but here it will fully depend on the balance whether the increase in performance and efficiency is worth the additional cost over the stock ARM design. Against AMD/Intel chips the potential of impressing people is the biggest, and the margins in the datacenter market are a likely target.

beginner99 · Aug 13, 2020

Gideon said:
if they manage to pull it off that is.

Single-threaded? For sure possible (isn't apple already close to that anyway in selected benchmarks?) but does it make sense for a server-cpu? I think they will realize it's one thing to have this extreme ST-performance but entirely another to make 32 or 64 such cores work together efficiently. And that for more "esoteric" workloads outside the consumer world, it's suddenly a problem to get such good ST.

EDIT: at least I would assume that is the case or is x86 really that bad? Or do intel and amd engineers combined suck so bad compared to these "rockstars" at nuvia that they can't get it done? I mean if they can "come out of nowhere" so quickly and beat the top giant in these metrics by a landslide, one must ask why it is possible at all. X86 bad or own engineers bad? Or management /bureaucracy /office politics overhead?

Gideon · Oct 12, 2020

Welp, Amazon is starting to transitioning whole services over to their ARM cpus-, ElastiCache being the first:

AWS makes its own Arm CPUs the default for ElastiCache in-memory data store service

Bills home-brewed silicon as the upgrade path to better Redis or memcached

www.theregister.com

soresu · Oct 12, 2020

Gideon said:
Welp, Amazon is starting to transitioning whole services over to their ARM cpus-, ElastiCache being the first:

AWS makes its own Arm CPUs the default for ElastiCache in-memory data store service

Bills home-brewed silicon as the upgrade path to better Redis or memcached

www.theregister.com

No surprise there - Graviton3 is going to be a very nice upgrade if it's using V1, especially for any SIMD heavy workloads.

Systems analyst · Oct 19, 2020

Some references on ARM architecture, present and future:

This is a series of videos on ARMv8 by Grisenthwaite, their chief architect, from 2011:

Work began in 2007.

This is a description of SVE:

https://arxiv.org/pdf/1803.06185.pdf

Further into the future, there is CHERI:

Department of Computer Science and Technology – CHERI: The Arm Morello Board

Capability Hardware Enhanced RISC Instructions. This will feature a test SoC and board, available to researchers in 2021. The intention is to feed into forthcoming architectures. It is of interest to hyper-scalers. AArch64 code will run on it, unmodified, but code will have to be modified to take advantage of the new features.

See also:

Digital Security by Design: Technology Platform - Richard Grisenthwaite, ARM

KTN ran a collaborators' workshop on 26 September 2019 in London to explain more about the Digital Security by Design Challenge announced by the government. The Digital Security by Design challenge has been recently announced by the Department for Business, Energy & Industrial Strategy (BEIS)...

www.slideshare.net

soresu · Oct 29, 2020

Well unless this new v8.7-A GCC entry is a placeholder I'd say that v9-A ist kaput?

Link here.

name99 · Oct 29, 2020

soresu said:
Well unless this new v8.7-A GCC entry is a placeholder I'd say that v9-A ist kaput?

Link here.

Maybe, maybe not.
For one thing we have no idea what the transition plan for v9 is.
They COULD have a temporary period during which devices can decode both v8 (hopefully 64-bit only!) and v9 instructions. How feasible this is depends on how ambitious they are in redesigning v9.

But they could also say that the clients for v9, at first anyway, are going to be cloud companies that are able to recompile everything from scratch, so no need for backward compatibility. ie the Neoverse track moves over to v9 fairly soon, even as the mobile track stays on v8 for as long as Google/Android want to remain there...
I mean, is Android, even today, fully 64-bit? Could ARM get away with shipping a v8 core that's 64 bit with no 32-bit, ala Apple?

soresu · Oct 29, 2020

name99 said:
I mean, is Android, even today, fully 64-bit?

No.

Even the brand new Google Chromecast built on Android TV is in fact 32 bit because of the low RAM capacity - as is the cheaper variant of the new SHIELD TV with 2GB instead of 3GB on the more expensive model.

Sadly there are still many Android based devices that have less RAM than you would ideally need to run the 64 bit version of the OS.

As single RAM dies reach 32 gbit (4 GB) we will start to see a step transition to all 64 bit Android I think.

name99 said:
Could ARM get away with shipping a v8 core that's 64 bit with no 32-bit, ala Apple?

That's coming with Makalu in 2022, and there is a variant of Cortex A32 that is 64 bit only (A34?) for some reason, though I think it uses the original v8.0-A ISA revision rather than the v8.2-A used since A75 and A55.

NTMBK · Oct 30, 2020

Rumour from Charlie of big layoffs at an ARM server vendor: https://semiaccurate.com/2020/10/30/arm-server-vendor-lays-off-130-and-cancels-cores/

Not got a subscription, so I don't know who it is. My guess would be Marvell?

DrMrLordX · Oct 30, 2020

NTMBK said:
Not got a subscription, so I don't know who it is. My guess would be Marvell?

Probably. That would be the easy guess anyway. It would be more interesting if it were Ampere or someone like that. Because Marvell has all but taken their ThunderX3 off the market, so it would hardly be news if they nixed the custom orders for it.

News ARM Server CPUs

Platinum Member

Senior member

Golden Member

Diamond Member

Senior member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Junior Member

Diamond Member

Golden Member

Platinum Member

Golden Member

Golden Member

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Member

Diamond Member

Senior member

Diamond Member

Lifer

Lifer