News ARM Server CPUs

Gideon · Mar 16, 2020

With all the upcoming ARM servers (Let's not forget Nuvia, etc) it probably makes sense to have 1 thread from them all, instead of creating new ones for each announcement (if Not, I will rename it).

However:
Anandtech: Marwell Announces 3rd Gen Arm Server Thunder X3: 96 Cores/384 threads
ServeTheHome: Marvell ThunderX3 Arm Server CPU with 768 Threads in 2020

Could be pretty impressive, though 25% Single Threaded performance gain seems a bit meh, compared to X2, which @2.5Ghz was ~50% slower than Xeon @ 3.8Ghz

But Their marketing slides sure show potential:

Thala · Apr 28, 2021

ThatBuzzkiller said:
That's not true for most consoles ... (the HLE strategy you mentioned isn't going to work for modern consoles)

Console games do offline compilation for their shaders and ship native GPU bytecode which is why runtime shader/pipeline compilation (increased load times/stutter) doesn't exist on consoles compared to PC ...

Console gfx APIs like GNM or Xbox D3D are statically linked (anti-HLE) too so it's totally pointless to reimplement APIs when games emit PM4 packets. You obviously have no understanding about GPU emulation at all ...

Thats not what you are doing when developing a high performance emulator. The API entry points are located and intercepted and wrapped around existing APIs. Of course if you want to have a slow emulator you can optionally try to emulate the GPU at HW level.
That having said, a few things typically need to be emulated - for instance non existing texture formats and similar things if not directly supported by the target HW. Shaders need to be translated either statically or dynamically - but thats what you need to do anyway - independent of the question if the CPU is ARM or not.
In addition - and as far as Direct3d is concerned - offline precompiled shaders are in binary DXIL format and are still HW independent.

ThatBuzzkiller · Apr 28, 2021

Thala said:
Thats not what you are doing when developing a high performance emulator. The API entry points are located and intercepted and wrapped around existing APIs. Of course if you want to have a slow emulator you can optionally try to emulate the GPU at HW level.
That having said, a few things typically need to be emulated - for instance non existing texture formats and similar things if not directly supported by the target HW. Shaders need to be translated either statically or dynamically - but thats what you need to do anyway - independent of the question if the CPU is ARM or not.
In addition - and as far as Direct3d is concerned - offline precompiled shaders are in binary DXIL format and are still HW independent.

With consoles, you have no choice but to emulate the GPUs at a HW level if you want to be even remotely accurate ... (every emulator that I've looked at for modern consoles is LLEing the GPU)

On Xbox D3D12, shaders are offline compiled into native GCN bytecode instead of DXIL on PC. Even with DXIL, drivers are still doing runtime (JIT) compilation ... (JIT compilation wouldn't be necessary on some PC configurations with AMD GPUs if developers could ship GCN bytecode over there too)

On other platforms (PCs/iOS/Android) developers practice dynamic linking, high-level APIs, multi-architecture binaries (DXIL/Universal 2). Unfortunately, the console world is a very different place and shows absolutely no mercy for emulators so we have static linking (Apple hates this) that is everywhere, low-level APIs that are only compatible from the same HW generation even from the same HW vendor, and tries to ship native binaries as much as possible ... (there is no compiler running on console applications, everything there is truly precompiled with x86/GCN bytecode unless developers are dumb enough to include a compiler with their applications)

The biggest blocker to modern console emulation are going to be the GPUs because they aren't even remotely close to converging on hardware design unless the industry decided the next day to standardize AMD GPUs ...

Nothingness · Apr 29, 2021

Doug S said:
Not really, FX!32 was simply a JIT that cached its results and optimized/saved them for future runs. That was new 25 years ago but is par for the course for JITs these days.

Here's a snippet from the Usenix paper abstract on it:

The translator provides native Alpha code for the portions of an x86 application which have been previously executed.

Click to expand...

It is not a JIT: it's an ISS backed by offline translation of traces. From the USENIX paper you cited:

DIGITAL FX!32 makes a different tradeoff. No translation is done while the application is executing. Rather, the emulator captures an execution profile.Later, a binary translator[1] uses the profile to translate the parts of the application that have been executed into native Alpha code.

What they called the translator is the offline process that takes care of translating IA32 into Alpha code.

The translator is invoked by the server to translate x86 images which have been executed by the emulator. As a result of executing the image, a profile for the image will exist in the DIGITAL FX!32 database. The translator uses the profile to produce a translated image.On subsequent executions of the image, the translated code will be used, substantially speeding up the application.

That's definitely not a typical JIT.

Doug S · Apr 29, 2021

Nothingness said:
It is not a JIT: it's an ISS backed by offline translation of traces. From the USENIX paper you cited:

What they called the translator is the offline process that takes care of translating IA32 into Alpha code.

That's definitely not a typical JIT.

But it is nothing like what Rosetta 2 does in translating a entire binary without executing it so there is no "first run" penalty, and if Digital's strategy was viable others would have done the same thing. Any patents on it expired long ago, so anyone writing a JIT is free to do the same.

Nothingness · Apr 29, 2021

Doug S said:
But it is nothing like what Rosetta 2 does in translating a entire binary without executing it so there is no "first run" penalty, and if Digital's strategy was viable others would have done the same thing. Any patents on it expired long ago, so anyone writing a JIT is free to do the same.

Yeah definitely what Apple did is more usable than what Digital did back then. I was just discussing the novelty. IMHO Apple didn't invent anything new here, that's engineering at its best with proper planning to have OS and HW support. That's all I was aguing about

Thala · Apr 29, 2021

Nothingness said:
I was just discussing the novelty. IMHO Apple didn't invent anything new here, that's engineering at its best with proper planning to have OS and HW support. That's all I was aguing about

In addition writing a static translator is much more trivial than writing a high performance JIT. I think it is more impressive, that Microsofts JIT have roughly the same performance range as Apples AOT translator (aka Rosetta 2).

Thala · Apr 29, 2021

ThatBuzzkiller said:
...
On Xbox D3D12, shaders are offline compiled into native GCN bytecode instead of DXIL on PC.
...

Do you have a reference for this? Last time i checked the XDK it was compiling to DXIL - same as on PC. I does not hurt much either, as the target code is ideally compiled in a background tasks while loading other assets. The really performance demanding part is the compilation HSLS -> DXIL, which in general is pre-compiled.

The remaining parts of your contribution i already covered in my previous posts. In particular I did mentioned that static linked library calls are intercepted and wrapped into native API calls. Further I do not understand how PC/Android/iOS are doing things are relevant for console discussion. Things are built differently on a console.

Doug S · Apr 29, 2021

Thala said:
In addition writing a static translator is much more trivial than writing a high performance JIT. I think it is more impressive, that Microsofts JIT have roughly the same performance range as Apples AOT translator (aka Rosetta 2).

If writing a static translator is so simple, why is Apple's the first one?

There was a thread in RWT about a year ago where I speculated Apple might do a static translator for their ARM Macs, made possible by their developer monoculture and long runway for planning. Linus among others was highly skeptical and pointed out all the reasons why he thought it wouldn't be feasible. If you think it is "trivial", all that tells me is that you have absolutely no clue what all is involved.

ThatBuzzkiller · Apr 29, 2021

Thala said:
Do you have a reference for this? Last time i checked the XDK it was compiling to DXIL - same as on PC.

The remaining parts of your contribution i already covered in my previous posts. In particular I did mentioned that static linked library calls are intercepted and wrapped into native API calls. Further I do not understand how PC/Android/iOS are doing things are relevant for console discussion. Things are built differently on a console.

Is this for UWP apps ? Because for native applications there are Xbox specific HLSL intrinsics which correspond exactly to some AMD GPU instructions so I'd imagine that offline compilation is supported by design on Xbox ... (there's an option to use PC D3D12 on Xbox but that is far from ideal from a performance perspective especially since their leading competitor will capitalize on this pitfall)

On PS4/PS5 GNM, their wave compiler supports offline compilation of PSSL shaders into GCN bytecode according to leaked documentation ... (there's no option to compile to other byte codes like SPIR-V or DXIL)

PC/Android/iOS applications are naturally designed to be more portable so static recompilation and reimplementing APIs is somewhat feasible for those platforms. Console emulator authors had better be prepared to write JITs and emulate the GPU at a low level. Console developers want both low level access and backwards compatibility which will constrain HW design a lot ...

Nothingness · Apr 30, 2021

Doug S said:
If writing a static translator is so simple, why is Apple's the first one?

You wrote it: because of vertical integration. No one has full control of both the software and the hardware stack like Apple.

Anyway evidence has been provided that FX!32 was using some form of static translation. Though it's based on previous execution to identify code areas, it still is a form of static translation. Also Apple doesn't rely only on static translation, and not only for JITed x86 code, but also very likely for other parts of the code due to undecidability of identifying code and data areas. So they're not fully static.

Android ART is doing a form of static translation to translate byte code to host code instead of having to rely only on a JIT. Nothing as fancy as Apple, but that definitely is static translation used in production.

If you think it is "trivial", all that tells me is that you have absolutely no clue what all is involved.

You seem to be overreacting. No one has been saying Apple's Rosetta 2 isn't a great achievement. But like it or not, writing a usable high performance JIT is harder than writing a purely static translator (which Apple isn't anyway).

Thala · Apr 30, 2021

Doug S said:
If writing a static translator is so simple, why is Apple's the first one?

There was a thread in RWT about a year ago where I speculated Apple might do a static translator for their ARM Macs, made possible by their developer monoculture and long runway for planning. Linus among others was highly skeptical and pointed out all the reasons why he thought it wouldn't be feasible. If you think it is "trivial", all that tells me is that you have absolutely no clue what all is involved.

Why i should answer you, when you think i have no clue? I might still give you a hint. The challenge is not the translation itself but getting the linking right - this involves answering if a dynamically linked module contains self modifying code etc. Another more general challenge are dynamically generated branch targets - which you can solve if you have an idea how the compiler generates these - a monoculture helps here big time. Ideally you try to regain knowledge back about how the source code looked like.
If you can solve above issues, a static translator is inherently simpler and produces inherently faster code.

For a JIT however, aside from the fact that you have generally much less time for translation (and doing a good register allocation on top, which is NP-hard), you have generally much smaller units of translations, which prevents you doing more global optimizations and you have to dynamically managing all the linking of basic block and associated data structures.

Having worked on projects both involving static translation and JIT, i can tell, that if you get the showstoppers away, for static translation the implementation is much more simple and translated code is very performant. In my particular case i knew precisely what the compiler and linker was doing - so the static translator was relatively trivial. On top of this i could re-use much of the existing static compiler technology - like a register allocator, instruction scheduler and static linker.
Writing a high performance JIT on the other hand is art. You have to work much more with clever heuristics instead of say brute-forcing register-allocation with graph-coloring. This holds in particular if a similar performance range as a static translator is achieved. And thats precisely what i am seeing with x64 JIT built into WoA.

Thala · Apr 30, 2021

ThatBuzzkiller said:
Is this for UWP apps ? Because for native applications there are Xbox specific HLSL intrinsics which correspond exactly to some AMD GPU instructions so I'd imagine that offline compilation is supported by design on Xbox ... (there's an option to use PC D3D12 on Xbox but that is far from ideal from a performance perspective especially since their leading competitor will capitalize on this pitfall)

Its a few year back, when i did tinker with the XBox devkit. So what i can say is, that it was not for UWP apps. I remember that the DXIL shaders (which i had precompiled offline) were compiled in the background after i called CreateXXXShader(). The shaders were generally compiled when i finished loading other assets - so there was no performance impact. But yeah - maybe my memory is cheating me - so therefore I was asking for link or something.

But coming back to the original discussion - some shader code translation would be necessary in any case - even if the GPU vendor stays the same. And this question is relatively orthogonal to the question if the CPU is going to be ARM.

Schmide · May 1, 2021

Thala said:
(snip) The challenge is not the translation itself but getting the linking right - this involves answering if a dynamically linked module contains self modifying code etc.

Can you elaborate on the usage of self modifying code? I understand the legacy of it but it just doesn't fit into modern processors. When you would have to write changed opcodes to the cache line, invalidate and reload that cache line, then execute it; it seems to be futile. You're basically stalling the processor to optimize something that could easily be handled by dereferencing or predication.

NTMBK · May 1, 2021

Schmide said:
Can you elaborate on the usage of self modifying code? I understand the legacy of it but it just doesn't fit into modern processors. When you would have to write changed opcodes to the cache line, invalidate and reload that cache line, then execute it; it seems to be futile. You're basically stalling the processor to optimize something that could easily be handled by dereferencing or predication.

Maybe you are e.g. writing an image processing algorithm. Baking various constants about image size, stride, format etc into the binary would let you speed things up, but obviously you can't precompile every supported image size. So at runtime you modify the code once before executing it over the entire image.

Schmide · May 1, 2021

NTMBK said:
Maybe you are e.g. writing an image processing algorithm. Baking various constants about image size, stride, format etc into the binary would let you speed things up, but obviously you can't precompile every supported image size. So at runtime you modify the code once before executing it over the entire image.

Then we're getting into semantics. If you modify the code before execution, that's just in time. If the code is running and then it changes that's self modifying. Which generally isn't allowed in OOP.

Thala · May 1, 2021

Schmide said:
Can you elaborate on the usage of self modifying code? I understand the legacy of it but it just doesn't fit into modern processors. When you would have to write changed opcodes to the cache line, invalidate and reload that cache line, then execute it; it seems to be futile. You're basically stalling the processor to optimize something that could easily be handled by dereferencing or predication.

I meant self-modifying code not literally - so maybe i should have avoided the term. I meant when the code itself produces code at runtime - like for instance what a JIT compiler is doing. I mean these days you cannot really write self modifying code in the classical sense, as the code pages are write protected by the OS.
When you generate code at runtime, you write the instructions to an RW+X page, then clean your data cache by virtual address, and invalidate you instruction cache by virtual address, synchronization barrier - done. Thats on ARM where the instruction cache is not coherent unlike x64. On x64 architecture you can avoid the explicit cache maintenance operations.

Schmide · May 1, 2021

Thala said:
I mean these days you cannot really write self modifying code in the classical sense, as the code pages are write protected by the OS.

In the context of consoles and emulators, I would assume that the protections are granted. It would be quite a round trip to modify and request a OS reload.

videogames101 · May 6, 2021

Thala said:
I meant self-modifying code not literally - so maybe i should have avoided the term. I meant when the code itself produces code at runtime - like for instance what a JIT compiler is doing. I mean these days you cannot really write self modifying code in the classical sense, as the code pages are write protected by the OS.
When you generate code at runtime, you write the instructions to an RW+X page, then clean your data cache by virtual address, and invalidate you instruction cache by virtual address, synchronization barrier - done. Thats on ARM where the instruction cache is not coherent unlike x64. On x64 architecture you can avoid the explicit cache maintenance operations.

Armv8 certainly allows for a coherent I$, even if most Armv8 processors don't have one. On both Graviton 2 and Ampere Altra the instruction caches are fully coherent, and therefore no explicit cache maintenance is required in the case you outlined. Of course synchronization barriers would still be required to ensure the store of the opcode has been made visible to the memory system before the opcode is subsequently fetched and executed.

Interestingly I believe x86-64 does not even require a synchronization barrier, and the processors will do the hazard detection between store and fetch automatically.

Gideon · May 19, 2021

Whoa, Ampere is moving super fast:

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

2 biggest chinese clouds and Microsoft Azure on board
128 core Altera CPU out and still 3.0 Ghz at the same 250W TDP as the 80 one (and easily faster than Milan)
And finally, new custom core 5nm designs coming, sampling to customers early 2022

And this is the 250W 128 core desing still based on relatively weak Neoverse N1:

Hitman928 · May 19, 2021

STH has a bit more complete coverage:

Ampere Planning Custom Arm Cores at 5nm and Beyond

Ampere is planning custom Arm cores at 5nm and beyond breaking from the standard Arm Neoverse roadmap to get more differentiation

www.servethehome.com

There's also a leaker on twitter with some additional info (if legit) on the new 128 core architecture:

System level cache reduced from 32 mb to 16 mb compared to the 80 core architecture.
Actual base clock of the 128 core model is 2.8 GHz.
2017 Spec rate int score is 19% higher compared to 80 core model.

https://twitter.com/x/status/1394816133832773635

The custom cores announcement is very interesting considering the convergence on the 'stock' architectures over the years. I think STH's analysis is probably accurate in that ARM is trying to gear their architectures moving forward to compete more generally against AMD/Intel whereas Ampere wants to focus their designs on providing the best performance for cloud based hosting servers.

Hitman928 · May 19, 2021

Also, if Andrei is correct that it isn't sampling yet, then it probably won't actually launch until late 2021 which I then have to wonder if their 5 nm chip will actually launch in 2022. If not, then this will really be competing against Genoa and SPR on the x86 side. If so, then it might be a tough sale when partners know a much better chip is right around the corner.

DisEnchantment · May 19, 2021

Not sure how they will survive after going public (in 3rd Quarter according to Yahoo), backers can only throw money for so long. Tencent/MS/Google and other big Cloud providers can basically do everything what Ampere does and can go in a bidding war for wafers.
They need to sell enough products to finance custom development and tapeout costs. But to sell something they need wafers. Are they capable of outbidding Apple/AMD/NV/QCOM all the way into 2022 and beyond, doubtful?
Making gigantic chips is not sustainable as well without a proper packaging and interconnect tech.
Not a rosy picture tbh.
Tachyum and Ampere Computing are in the same boat, waiting for a potential buyer. Nuvia made it.

moinmoin · May 19, 2021

Good stuff. Also good news that several cloud providers are backing it which is way preferable to in-house solutions like AWS' Graviton.

As I keep saying this is the actual competition AMD faces (and Intel has a lot to catch up to there). Though more comparisons of per core performance, using more up to date Milan and Ice Lake SP, as well as recognizing the impact of SMT would be nicer.

Gideon said:

Ooof, the "Altra Max" name is really really unfortunate in charts like these. I first had to go into the article to find out what the "Max" part refers to, if it's some kind of turbo boost or something...

DisEnchantment said:
Not sure how they will survive after going public (in 3rd Quarter according to Yahoo), backers can only throw money for so long. Tencent/MS/Google and other big Cloud providers can basically do everything what Ampere does and can go in a bidding war for wafers.
They need to sell enough products to finance custom development and tapeout costs. But to sell something they need wafers. Are they capable of outbidding Apple/AMD/NV/QCOM all the way into 2022 and beyond, doubtful?
Making gigantic chips is not sustainable as well without a proper packaging and interconnect tech.
Not a rosy picture tbh.
Tachyum and Ampere Computing are in the same boat, waiting for a potential buyer. Nuvia made it.

It's quite the conundrum, isn't it? For ARM as a server ecosystem some form of economy of scale would be way preferable to a further increase of fragmented in-house solutions not open to the public. But companies like Tachyum and Ampere need to get there first before being bought out. I just hope some of these service providers opt to guarantee business with one of the companies instead to ensure their products stay available on the open market.

Gideon · May 19, 2021

DisEnchantment said:
Not sure how they will survive after going public (in 3rd Quarter according to Yahoo), backers can only throw money for so long. Tencent/MS/Google and other big Cloud providers can basically do everything what Ampere does and can go in a bidding war for wafers.
They need to sell enough products to finance custom development and tapeout costs. But to sell something they need wafers. Are they capable of outbidding Apple/AMD/NV/QCOM all the way into 2022 and beyond, doubtful?
Making gigantic chips is not sustainable as well without a proper packaging and interconnect tech.
Not a rosy picture tbh.
Tachyum and Ampere Computing are in the same boat, waiting for a potential buyer. Nuvia made it.

Well, one avenue I can see is executing well enough that someone (e.g. microsoft) buys them out, rather than going through the effort of building their own team from scratch. They definitely need custom cores for that (and these need to be better than standard neoverse)

beginner99 · May 19, 2021

Hitman928 said:
I think STH's analysis is probably accurate in that ARM is trying to gear their architectures moving forward to compete more generally against AMD/Intel whereas Ampere wants to focus their designs on providing the best performance for cloud based hosting servers.

Does ARM have any provision or designs for chiplet-based processors? maybe that is the core thing. Pure core-performance and efficiency is one thing, being able to produce more and sell them cheaper is another one. I say it's actually these "secondary" aspects that will become more and more important including packaging.

News ARM Server CPUs

Platinum Member

Golden Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Golden Member

Diamond Member

Lifer

Diamond Member

Golden Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Platinum Member

Diamond Member