- Nov 14, 2011
- 10,301
- 5,302
- 136
Based on Neoverse cores, not a Denver derivative.
This makes sense with the ARM acquisition attempt. They want to fully control the server stack.
Per thread? Did NVIDIA announce the number of cores?One thing I didn't pick up at the time- ServeTheHome noticed that Grace's per-thread performance isn't actually that impressive:
It's hilarious that there was even a thought that Nvidia could somehow get a stranglehold over x86 vendors just by purchasing ARM Ltd ...
CUDA offloading on Arm was only introduced under a year ago with the CUDA 11 SDK so the vast majority of customers on CUDA still run their host code on x86 CPUs and Nvidia would be screwing over many of their own customers if they tried to lock CUDA to their own ARM based systems. If immediately they find out that they can't use CUDA on x86 systems, most developers will just prefer to drop CUDA instead and stick to pure C++ and optimize their kernels with AVX or even AVX-512 over rewriting their host code to run on Arm. Some developers, if they're brave enough will make an attempt to transition to other heterogeneous compute platforms like ROCm or oneAPI if they're looking for more performance ...
Most developers don't optimize their CUDA kernels with low level PTX assembly and heavily rely on NVCC to optimize their code for big gains in speed up just to make it even worthwhile to port their C/C++ kernels in the first place. By comparison, far more code is optimized for x86 architectures because of their stable ISA and is more ubiquitous in nature ...
CUDA GPUs have tons of undesirable limitations like their unstable ISA (PTX changes every generation!), can't run all C++ kernels, and can't be used to accelerate every parallel algorithm. X86 CPUs do not have any of these drawbacks so they're far easier for programmers to maintain long-term projects and are much more widespread ...
if they tried to lock CUDA to their own ARM based systems
NVs core is and will be GPUs.
Disagree, I fully expect them to go hard after server CPUs.
I think you misunderstood. Not locking CUDA to ARM. That would be "suicide". I agree. What I meant was that they "lock" ARM into "CUDA" or NV GPUs respectively by forcing any (server CPU) implementation to have NVlink or some other NV proprietary stuff that helps them and makes it a very hard option to not choose an NV GPU for an ARM server.
NVs core is and will be GPUs. Hence in their mind, all they need is some medicore CPU to run the general code. If they can force an entire ecosystem (ARM) onto NV GPUs, that's a huge win. Plus they want to stop depending on Intel or AMD because their CPU sales increase their R&D also towards GPUs. Honestly I think AMD is in the worst spot. Intel has their huge income and oneAPI and in general being a huge corpoation than can invest a lot. NV has CUDA, bascially GPU/AI market leader. AMD has the hardware, CPU, GPU and FPGA but they simply miss the software which honestly is the more crucial part. And I fear AMD lacks the manpower and more so company culture to finally solve that problem. ROCm is simply going, way, way too slow. And as long as you can't prototype on windows, forget it. Devs work in companies and many of these still force you on windows, laptop and on the server!
If you think Nvidia can dominate with mediocre CPUs then you obviously don't understand what it takes to make a compelling heterogeneous compute platform ...
If you want to talk about who's behind on their software stack then take no further look than oneAPI since it still doesn't offer GPU acceleration with either Tensorflow or PyTorch. At the very least ROCm works with both and with upstream support to boot too. If Intel has does have a huge amount of cash then it's not showing up in oneAPI because between it and ROCm, the latter is arguably the more mature stack ...
Well Intel also doesn't have a product on the market yet so I expect this to be working once they launch officially also to the public. Maybe I'm overly optimistic but with their perceived effort into AI/ML not having our API work with the most common frameworks would be a pretty big failure with intels resources.
Have you ever actually tried ROCm? I've heard different stories. And if you get to work, it's Linux only anyway. Were I work, we are full windows. Even simple web servers have to be windows. I guess no different with many other non-tech large orgs that standardize their stack.
While I agree that software historically is a problem for AMD, I disagree that AMD's current approach to it is wrong or bad. On the contrary the full out open source push is the one approach that allows AMD to focus on its hardware while the software is essentially open to all to use and improve. But open source needs community to work, and that's not found on Windows (though WSL2 allows one to merge it in to some degree, that's why Microsoft created WSL to begin with). Serious scientific research uses open source, AMD's two upcoming exascale supercomputers build on open source and in fact improving ROCm is part of their contracts.AMD has the hardware, CPU, GPU and FPGA but they simply miss the software which honestly is the more crucial part. And I fear AMD lacks the manpower and more so company culture to finally solve that problem. ROCm is simply going, way, way too slow. And as long as you can't prototype on windows, forget it. Devs work in companies and many of these still force you on windows, laptop and on the server!
ROCm being tied to Linux is intentional design. ROCm was initially built on AMDKFD which was their HSA kernel driver at the time. Windows also has WDDM kernel driver limitations which prevent powerful APIs from being implemented so they aren't interested in dealing with an inferior implementation ...
Linux is pretty much the future for compute and pretty soon there'll be more APIs that can only be properly implemented over there ...
Microsoft added support for hyperv DRM infrastructure to emulate a GPU, and intercept calls from WSL and redirect to Windows.You can implement a non-graphics compute only driver on Windows that circumvents the WDDM limitations. See the TCC driver for CUDA.
You can implement a non-graphics compute only driver on Windows that circumvents the WDDM limitations. See the TCC driver for CUDA.
@Thala, obviously Nvidia will not even hint of their upcoming vendor lock-in practices in their press releases until their Arm deal is done. It would be moronic to do so (as they are trying to convince regulators thqt they "couldn't possibly even think of that, honest to god").
Where did you get the "mediocre CPU" info?If you think Nvidia can dominate with mediocre CPUs then you obviously don't understand what it takes to make a compelling heterogeneous compute platform ...
To quote STH:Where did you get the "mediocre CPU" info?
On their DGX, Nvidia is going from dual Epyc that produces 800ish SPEC_int to octo Grace that will be around 2400 and will offer magnitude more bandwidth than anything that will be available in the same timeframe.
The point is Nvidia is now a system vendor, so comparing Grace that is soldered in a SXM type of module to a socket CPU has no sense.
They don't contradict what I said. In fact, they do the same mistake of comparing CPUs where Nvidia is selling systems. And they forget, intentionally or not, to compare the future DGX with 8 Grace CPU to the best hypothetical 2023 DGX with 2 future x86 CPUs (roughly 2400 SPEC_Int for Grace solution vs 1000~1200 with 2 top of the line x86 CPUs).To quote STH:
...
Arm-azing Grace Combines Arm CPU NVIDIA GPU and NVLink
We discuss the NVIDIA Grace platform announcement, what it means for the industry, and the misleading figure in the presentationwww.servethehome.com
Where did you get the "mediocre CPU" info?
On their DGX, Nvidia is going from dual Epyc that produces 800ish SPEC_int to octo Grace that will be around 2400 and will offer magnitude more bandwidth than anything that will be available in the same timeframe.
The point is Nvidia is now a system vendor, so comparing Grace that is soldered in a SXM type of module to a socket CPU has no sense.
A system which replaces CXL capability with a lot of mediocre CPUs, gotcha.They don't contradict what I said. In fact, they do the same mistake of comparing CPUs where Nvidia is selling systems. And they forget, intentionally or not, to compare the future DGX with 8 Grace CPU to the best hypothetical 2023 DGX with 2 future x86 CPUs (roughly 2400 SPEC_Int for Grace solution vs 1000~1200 with 2 top of the line x86 CPUs).
As I said, nothing will match the CPU performance of DGX in this category and that's what matters.
Also if Nvidia's future DGX systems needs twice the number of CPU sockets to be competitive against the x86 alternatives then maybe Grace CPUs are mediocre and they have a failure on their hand ... (the most common systems will always consist of either 1P or 2P and rarely does 4P/8P ever see deployment)