Question Nvidia to enter the server CPU market

NTMBK · Apr 12, 2021

NVIDIA Unveils Grace: A High-Performance Arm Server CPU For Use In Big AI Systems

www.anandtech.com

Based on Neoverse cores, not a Denver derivative.

This makes sense with the ARM acquisition attempt. They want to fully control the server stack.

moinmoin · Apr 13, 2021

STH describes well how Grace is a classic Nvidia move by cutting out potential partners and inventing its own stuff instead supporting industry standards, which are the parts that don't bode well for an acquisition.

Arm-azing Grace Combines Arm CPU NVIDIA GPU and NVLink

We discuss the NVIDIA Grace platform announcement, what it means for the industry, and the misleading figure in the presentation

www.servethehome.com

"In 2023, as Grace arrives, we will start seeing CXL 2.0. CXL 2.0 is a big deal. This is where we get switching and pooling. Effectively the way some vendors are looking at the memory bandwidth/ CPU bottleneck problem is not adding Arm CPUs in a 1:1 ratio. Instead, it is to take advantage of the high-performance and industry-standard CXL environment. With the Grace model, GPUs will have to go to the CPU to access memory. In the CXL 2.0 model, GPUs can directly share memory reducing the need for data movement and copies."

"While it may be partnering with Ampere and Altra today, it is hard to imagine how Ampere’s Arm platform for CUDA is relevant in the NVIDIA Grace era unless it leans into CXL heavily."

"NVIDIA will effectively make its partner OEMs obsolete. NVIDIA’s plan is to package the CPU, memory, interconnect, and GPU together."

"Make no mistake, NVIDIA is working to free itself from being a component vendor to being a systems vendor in the era of AI and that means it is going to put increasing emphasis on its systems and software business. This is very cool, but it is also a statement to the industry that NVIDIA is looking to subjugate other players in the market soon. NVIDIA will say it values partners, but it is effectively moving to take the hardware off the table for its partners and is working on parts of the software side as well."

And as a bonus:

"The difficult part is that we would not read so critically if NVIDIA did not misrepresent something as simple as PCIe Gen4 bandwidth."

moonbogg · Apr 13, 2021

These should make great mining products. And don't even act like people won't mine on them just because they are CPUs. Corporations will close their business and turn their servers into mining farms, spending their money and then selling their children for more mining CPUs.

NTMBK · Apr 16, 2021

One thing I didn't pick up at the time- ServeTheHome noticed that Grace's per-thread performance isn't actually that impressive:

https://twitter.com/x/status/1382445137024864258

Which makes sense. It's not about the CPU performance as such, it's about enabling the GPUs.

Nothingness · Apr 16, 2021

NTMBK said:
One thing I didn't pick up at the time- ServeTheHome noticed that Grace's per-thread performance isn't actually that impressive:

Per thread? Did NVIDIA announce the number of cores?

DisEnchantment · Apr 16, 2021

If put the other way around, if NV does not invest in CXL or CCIX(they are not a member of CCIX but Mellanox is) do they see themselves as part of any system employing coherent interconnects in conjunction with parts from other vendors?

Everyone coming in hard with top to bottom heterogenous systems
Intel -> CPUs/Stratix/Agilex FPGAs/Movidus/Habana/Coherent Interconnects/Xe GPUs
AMD -> CPUs/Virtex/Versal FPGAs/GPUs/Coherent Interconnects
Intel bring in also PMEM
Without a standard there is not going to be any interoperability.
But irrespective of how it may be, I have a feeling it is not going to end well for "NVidia Partners"

Gideon · Apr 16, 2021

@Thala, obviously Nvidia will not even hint of their upcoming vendor lock-in practices in their press releases until their Arm deal is done. It would be moronic to do so (as they are trying to convince regulators thqt they "couldn't possibly even think of that, honest to god").

But their track record 99,9% confirms that they will do it as soon as possible.

FFS, just during Ampere Launch they mentioned that they wanted to do a propiretary directstorage API but Microsoft forced them (their words) to implement this in an open standard way. And they still took particular pride in calling their implementation RTX IO.

1. If they wanted ARM cores, they could have just licenced them.
2. If they wanted a kickass inhouse ARM CPU team (better than Arm's own in fact), they could have just bought nuvia for 1/40 the price.

But sure, Jensen, who has never failed to extract every possible penny into proprietary solutions suddenly just wants to throw away $40 billion for no reason ...

ThatBuzzkiller · Apr 17, 2021

It's hilarious that there was even a thought that Nvidia could somehow get a stranglehold over x86 vendors just by purchasing ARM Ltd ...

CUDA offloading on Arm was only introduced under a year ago with the CUDA 11 SDK so the vast majority of customers on CUDA still run their host code on x86 CPUs and Nvidia would be screwing over many of their own customers if they tried to lock CUDA to their own ARM based systems. If immediately they find out that they can't use CUDA on x86 systems, most developers will just prefer to drop CUDA instead and stick to pure C++ and optimize their kernels with AVX or even AVX-512 over rewriting their host code to run on Arm. Some developers, if they're brave enough will make an attempt to transition to other heterogeneous compute platforms like ROCm or oneAPI if they're looking for more performance ...

Most developers don't optimize their CUDA kernels with low level PTX assembly and heavily rely on NVCC to optimize their code for big gains in speed up just to make it even worthwhile to port their C/C++ kernels in the first place. By comparison, far more code is optimized for x86 architectures because of their stable ISA and is more ubiquitous in nature ...

CUDA GPUs have tons of undesirable limitations like their unstable ISA (PTX changes every generation!), can't run all C++ kernels, and can't be used to accelerate every parallel algorithm. X86 CPUs do not have any of these drawbacks so they're far easier for programmers to maintain long-term projects and are much more widespread ...

NTMBK · Apr 17, 2021

ThatBuzzkiller said:
It's hilarious that there was even a thought that Nvidia could somehow get a stranglehold over x86 vendors just by purchasing ARM Ltd ...

CUDA offloading on Arm was only introduced under a year ago with the CUDA 11 SDK so the vast majority of customers on CUDA still run their host code on x86 CPUs and Nvidia would be screwing over many of their own customers if they tried to lock CUDA to their own ARM based systems. If immediately they find out that they can't use CUDA on x86 systems, most developers will just prefer to drop CUDA instead and stick to pure C++ and optimize their kernels with AVX or even AVX-512 over rewriting their host code to run on Arm. Some developers, if they're brave enough will make an attempt to transition to other heterogeneous compute platforms like ROCm or oneAPI if they're looking for more performance ...

Most developers don't optimize their CUDA kernels with low level PTX assembly and heavily rely on NVCC to optimize their code for big gains in speed up just to make it even worthwhile to port their C/C++ kernels in the first place. By comparison, far more code is optimized for x86 architectures because of their stable ISA and is more ubiquitous in nature ...

CUDA GPUs have tons of undesirable limitations like their unstable ISA (PTX changes every generation!), can't run all C++ kernels, and can't be used to accelerate every parallel algorithm. X86 CPUs do not have any of these drawbacks so they're far easier for programmers to maintain long-term projects and are much more widespread ...

In my experience the biggest CUDA optimizations aren't about fiddling with inline PTX. They come from structuring your algorithm to maximise shared memory usage, vector shuffles, etc. If you're at the point where you are genuinely compute bound on CUDA, you've already done a damn good job. The hard bit is feeding the beast.

beginner99 · Apr 17, 2021

ThatBuzzkiller said:
if they tried to lock CUDA to their own ARM based systems

I think you misunderstood. Not locking CUDA to ARM. That would be "suicide". I agree. What I meant was that they "lock" ARM into "CUDA" or NV GPUs respectively by forcing any (server CPU) implementation to have NVlink or some other NV proprietary stuff that helps them and makes it a very hard option to not choose an NV GPU for an ARM server.

NVs core is and will be GPUs. Hence in their mind, all they need is some medicore CPU to run the general code. If they can force an entire ecosystem (ARM) onto NV GPUs, that's a huge win. Plus they want to stop depending on Intel or AMD because their CPU sales increase their R&D also towards GPUs. Honestly I think AMD is in the worst spot. Intel has their huge income and oneAPI and in general being a huge corpoation than can invest a lot. NV has CUDA, bascially GPU/AI market leader. AMD has the hardware, CPU, GPU and FPGA but they simply miss the software which honestly is the more crucial part. And I fear AMD lacks the manpower and more so company culture to finally solve that problem. ROCm is simply going, way, way too slow. And as long as you can't prototype on windows, forget it. Devs work in companies and many of these still force you on windows, laptop and on the server!

jpiniero · Apr 17, 2021

beginner99 said:
NVs core is and will be GPUs.

Disagree, I fully expect them to go hard after server CPUs. Grace is really just the beginning. My question is whether they also go hard on desktops and laptops, or other markets.

DrMrLordX · Apr 17, 2021

jpiniero said:
Disagree, I fully expect them to go hard after server CPUs.

If they do that, who will support them, and why? Amazon, Google, and Microsoft have begun spinning their own ARM solutions (based on existing licenses obtained from ARM). Those are currently the biggest customers for the existing Neoverse platform. Google and MS (at a minimum) oppose the acquisition. Are they going to pull a 180 and start licensing new NV-inspired Neoverse technology once they've iterated through the pre-NV designs?

And where does that leave Fujitsu?

ThatBuzzkiller · Apr 18, 2021

beginner99 said:
I think you misunderstood. Not locking CUDA to ARM. That would be "suicide". I agree. What I meant was that they "lock" ARM into "CUDA" or NV GPUs respectively by forcing any (server CPU) implementation to have NVlink or some other NV proprietary stuff that helps them and makes it a very hard option to not choose an NV GPU for an ARM server.

NVs core is and will be GPUs. Hence in their mind, all they need is some medicore CPU to run the general code. If they can force an entire ecosystem (ARM) onto NV GPUs, that's a huge win. Plus they want to stop depending on Intel or AMD because their CPU sales increase their R&D also towards GPUs. Honestly I think AMD is in the worst spot. Intel has their huge income and oneAPI and in general being a huge corpoation than can invest a lot. NV has CUDA, bascially GPU/AI market leader. AMD has the hardware, CPU, GPU and FPGA but they simply miss the software which honestly is the more crucial part. And I fear AMD lacks the manpower and more so company culture to finally solve that problem. ROCm is simply going, way, way too slow. And as long as you can't prototype on windows, forget it. Devs work in companies and many of these still force you on windows, laptop and on the server!

If you think Nvidia can dominate with mediocre CPUs then you obviously don't understand what it takes to make a compelling heterogeneous compute platform ...

Even now high performance CPUs are still important as they are the biggest market segment by far and has defined Intel's prevailing past for many years. CPUs can run any code while GPUs cannot do the same. AMD tried to revolutionize the industry before with the 'Fusion' project in a similar thought to yours by integrating a less than ideal CPUs with powerful GPUs and it largely ended in failure because the future didn't play to their favour. The future is CPUs running most of the code while offloading to the GPUs/FPGAs/ASICs whenever possible and absolutely not the other way around! If Nvidia are going to fall into the same trap like AMD did in the past then CUDA will never be a worthy heterogeneous compute platform and should just stick to being purely a "GPU computing" platform because they think heterogeneous compute is just a joke to them ...

If you want to talk about who's behind on their software stack then take no further look than oneAPI since it still doesn't offer GPU acceleration with either Tensorflow or PyTorch. At the very least ROCm works with both and with upstream support to boot too. If Intel has does have a huge amount of cash then it's not showing up in oneAPI because between it and ROCm, the latter is arguably the more mature stack ...

beginner99 · Apr 18, 2021

ThatBuzzkiller said:
If you think Nvidia can dominate with mediocre CPUs then you obviously don't understand what it takes to make a compelling heterogeneous compute platform ...

I said it's their plan and not if it's a good one. There is a big difference there. Also AMD did not think a mediocre CPU is good enough, it was just all they had at that time.

ThatBuzzkiller said:
If you want to talk about who's behind on their software stack then take no further look than oneAPI since it still doesn't offer GPU acceleration with either Tensorflow or PyTorch. At the very least ROCm works with both and with upstream support to boot too. If Intel has does have a huge amount of cash then it's not showing up in oneAPI because between it and ROCm, the latter is arguably the more mature stack ...

Well Intel also doesn't have a product on the market yet so I expect this to be working once they launch officially also to the public. Maybe I'm overly optimistic but with their perceived effort into AI/ML not having our API work with the most common frameworks would be a pretty big failure with intels resources.

Have you ever actually tried ROCm? I've heard different stories. And if you get to work, it's Linux only anyway. Were I work, we are full windows. Even simple web servers have to be windows. I guess no different with many other non-tech large orgs that standardize their stack.

ThatBuzzkiller · Apr 18, 2021

beginner99 said:
Well Intel also doesn't have a product on the market yet so I expect this to be working once they launch officially also to the public. Maybe I'm overly optimistic but with their perceived effort into AI/ML not having our API work with the most common frameworks would be a pretty big failure with intels resources.

Have you ever actually tried ROCm? I've heard different stories. And if you get to work, it's Linux only anyway. Were I work, we are full windows. Even simple web servers have to be windows. I guess no different with many other non-tech large orgs that standardize their stack.

ROCm being tied to Linux is intentional design. ROCm was initially built on AMDKFD which was their HSA kernel driver at the time. Windows also has WDDM kernel driver limitations which prevent powerful APIs from being implemented so they aren't interested in dealing with an inferior implementation ...

Linux is pretty much the future for compute and pretty soon there'll be more APIs that can only be properly implemented over there ...

moinmoin · Apr 18, 2021

beginner99 said:
AMD has the hardware, CPU, GPU and FPGA but they simply miss the software which honestly is the more crucial part. And I fear AMD lacks the manpower and more so company culture to finally solve that problem. ROCm is simply going, way, way too slow. And as long as you can't prototype on windows, forget it. Devs work in companies and many of these still force you on windows, laptop and on the server!

While I agree that software historically is a problem for AMD, I disagree that AMD's current approach to it is wrong or bad. On the contrary the full out open source push is the one approach that allows AMD to focus on its hardware while the software is essentially open to all to use and improve. But open source needs community to work, and that's not found on Windows (though WSL2 allows one to merge it in to some degree, that's why Microsoft created WSL to begin with). Serious scientific research uses open source, AMD's two upcoming exascale supercomputers build on open source and in fact improving ROCm is part of their contracts.

For AMD community involvement is in its very interest so this can end up being the most sustainable approach. Intel could even do without and push its own solutions, but still does a very good job maintaining its open source project so far. Nvidia is the polar opposite to both of them, delivering token support at best and only offering closed source software otherwise. That's fine in more corporate environments where ready made solutions are in demand, but won't fly where more flexibility is needed.

NTMBK · Apr 18, 2021

ThatBuzzkiller said:
ROCm being tied to Linux is intentional design. ROCm was initially built on AMDKFD which was their HSA kernel driver at the time. Windows also has WDDM kernel driver limitations which prevent powerful APIs from being implemented so they aren't interested in dealing with an inferior implementation ...

Linux is pretty much the future for compute and pretty soon there'll be more APIs that can only be properly implemented over there ...

You can implement a non-graphics compute only driver on Windows that circumvents the WDDM limitations. See the TCC driver for CUDA.

DisEnchantment · Apr 18, 2021

NTMBK said:
You can implement a non-graphics compute only driver on Windows that circumvents the WDDM limitations. See the TCC driver for CUDA.

Microsoft added support for hyperv DRM infrastructure to emulate a GPU, and intercept calls from WSL and redirect to Windows.
Was added in Linux 5.11 or 5.12 I think.
End goal is to run Compute frameworks developed for Linux on Windows. Right now can run GUI apps, basic compute apps

[PATCH v3 1/2] drm/hyperv: Add DRM driver for hyperv synthetic video device - Deepak Rawat

ThatBuzzkiller · Apr 18, 2021

NTMBK said:
You can implement a non-graphics compute only driver on Windows that circumvents the WDDM limitations. See the TCC driver for CUDA.

TCC doesn't fix other fundamental issues like the fact that you can't you can't use the cudaMallocManaged API to do oversubscription on Windows and we still can't use the cool stuff like their new NVC++ compiler. TCC also has really poor integration with Windows too ...

Maybe one day even the Windows subsystem for Linux will allow for running different kernel drivers like AMDGPU/AMDKFD or the proprietary Nvidia kernel driver but the future very much remains to be Linux getting new compute features first or making them exclusive over there altogether ...

Thala · Apr 18, 2021

Gideon said:
@Thala, obviously Nvidia will not even hint of their upcoming vendor lock-in practices in their press releases until their Arm deal is done. It would be moronic to do so (as they are trying to convince regulators thqt they "couldn't possibly even think of that, honest to god").

Not sure, why you address this to me, because this was precisely my stance on the issue all along. I said the press release gives no evidence for either case - and indeed NVidia would be stupid if it would.

xpea · Apr 21, 2021

ThatBuzzkiller said:
If you think Nvidia can dominate with mediocre CPUs then you obviously don't understand what it takes to make a compelling heterogeneous compute platform ...

Where did you get the "mediocre CPU" info?
On their DGX, Nvidia is going from dual Epyc that produces 800ish SPEC_int to octo Grace that will be around 2400 and will offer magnitude more bandwidth than anything that will be available in the same timeframe.
The point is Nvidia is now a system vendor, so comparing Grace that is soldered in a SXM type of module to a socket CPU has no sense.

moinmoin · Apr 21, 2021

xpea said:
Where did you get the "mediocre CPU" info?
On their DGX, Nvidia is going from dual Epyc that produces 800ish SPEC_int to octo Grace that will be around 2400 and will offer magnitude more bandwidth than anything that will be available in the same timeframe.
The point is Nvidia is now a system vendor, so comparing Grace that is soldered in a SXM type of module to a socket CPU has no sense.

To quote STH:

"This year, the fastest CPU on the market is the AMD EPYC 7763 that has a SPECrate2017_int_base of around 800 in a dual-socket configuration or around 400/ CPU. Jensen during the keynote said that the NVIDIA DGX A100 only gets around 450. That seems very low for dual EPYC 7742’s as one can see as an example here. Perhaps Jensen was given that number, using 2019 CPUs, using older gcc versions? STH was even well over 600 from the same CPUs, and we are always lower than vendors due to how much more tuning vendors do on their platforms."

"NVIDIA did not show the marketing slide it showed the press, but the number was given that NVIDIA expects:

Over 900GB/s Cache Coherent NVLink CPU to GPU
Over 600GB/s CPU to CPU
Over 500GB/s LPDDR5x with ECC
Over 300 SPECrate2017_int_base

This slide (will update if we get a non-watermarked version) says >300 we would naturally assume that if NVIDIA was targeting 400+ it would simply say that. As such, the CPU power of the 2023 chip, is expected to be about that of a top-end 2019 x86 chip or about 75% of 2021’s top-end AMD EPYC 7763. We fully expect what NVIDIA is saying its 2023 Grace’s Arm CPU complex at a >300 SPECrate2017_int_base will be less than half the performance of a top-bin x86 part in 2023."

Arm-azing Grace Combines Arm CPU NVIDIA GPU and NVLink

We discuss the NVIDIA Grace platform announcement, what it means for the industry, and the misleading figure in the presentation

www.servethehome.com

xpea · Apr 21, 2021

moinmoin said:
To quote STH:
...

Arm-azing Grace Combines Arm CPU NVIDIA GPU and NVLink

We discuss the NVIDIA Grace platform announcement, what it means for the industry, and the misleading figure in the presentation

www.servethehome.com

They don't contradict what I said. In fact, they do the same mistake of comparing CPUs where Nvidia is selling systems. And they forget, intentionally or not, to compare the future DGX with 8 Grace CPU to the best hypothetical 2023 DGX with 2 future x86 CPUs (roughly 2400 SPEC_Int for Grace solution vs 1000~1200 with 2 top of the line x86 CPUs).
As I said, nothing will match the CPU performance of DGX in this category and that's what matters.

ThatBuzzkiller · Apr 21, 2021

xpea said:
Where did you get the "mediocre CPU" info?
On their DGX, Nvidia is going from dual Epyc that produces 800ish SPEC_int to octo Grace that will be around 2400 and will offer magnitude more bandwidth than anything that will be available in the same timeframe.
The point is Nvidia is now a system vendor, so comparing Grace that is soldered in a SXM type of module to a socket CPU has no sense.

They weren't my words so if you need to badger someone else about it do it to this guy ...

Also if Nvidia's future DGX systems needs twice the number of CPU sockets to be competitive against the x86 alternatives then maybe Grace CPUs are mediocre and they have a failure on their hand ... (the most common systems will always consist of either 1P or 2P and rarely does 4P/8P ever see deployment)

Even Frontier, the newest supercomputer uses single CPU socket server nodes and the delayed Aurora supercomputer would've used dual CPU socket server nodes ...

moinmoin · Apr 22, 2021

xpea said:
They don't contradict what I said. In fact, they do the same mistake of comparing CPUs where Nvidia is selling systems. And they forget, intentionally or not, to compare the future DGX with 8 Grace CPU to the best hypothetical 2023 DGX with 2 future x86 CPUs (roughly 2400 SPEC_Int for Grace solution vs 1000~1200 with 2 top of the line x86 CPUs).
As I said, nothing will match the CPU performance of DGX in this category and that's what matters.

A system which replaces CXL capability with a lot of mediocre CPUs, gotcha.

jpiniero · Apr 22, 2021

ThatBuzzkiller said:
Also if Nvidia's future DGX systems needs twice the number of CPU sockets to be competitive against the x86 alternatives then maybe Grace CPUs are mediocre and they have a failure on their hand ... (the most common systems will always consist of either 1P or 2P and rarely does 4P/8P ever see deployment)

You're missing the point of this system. The question really is does the CPU need to be that powerful in a system designed around the GPUs. We also don't know how many cores it has, the power consumption, etc.

Question Nvidia to enter the server CPU market

Lifer

Diamond Member

Lifer

Lifer

Diamond Member

Golden Member

Platinum Member

Golden Member

Lifer

Diamond Member

Lifer

Lifer

Golden Member

Diamond Member

Golden Member

Diamond Member

Lifer

Golden Member

Golden Member

Golden Member

Senior member

Diamond Member

Senior member

Golden Member

Diamond Member

Lifer