Question Nvidia to enter the server CPU market

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,607
5,799
136
Fusion lives on in CCIX. The name is gone, but the original concepts are fulfilled. To be specific, they're finally achieving something similar to Torrenza
Just to add...
Ironic to say HSA is dead, when AMD is beating this drum ever so loudly these days.
1619270874387.png
And the LLVM target for using ROCm is .... amdhsa.
Well, if you use ROCm you'd know this :)
Also Xilinx is working to integrate the FPGA compute infra within ROCm. Xilinx FPGAs uses CCIX for coherence
1619271759375.png
The foundation being dead has nothing to do with AMD's HSA vision being dead.
Intel's and AMD's vision of HSA are well underway with IF 3.0/CXL/CCIX interconnect and oneAPI/ROCm. Aldebaran kfd support already present in mainline.
NV wants in as well using their updated coherent NVLink Interconnect. This acquisition by NV makes a lot of sense.... maybe not for their "Partners"
 
Last edited:
  • Like
Reactions: Tlh97 and NTMBK

moinmoin

Diamond Member
Jun 1, 2017
4,952
7,663
136
ROCm is ironically the polar opposite since it's a closed standard dictated purely by AMD
I've come to the conclusion you just like spouting a lot of hot air.

CUDA is often underrated since contrary to the common impression CUDA isn't just a programming framework for GPUs but a whole set of tools and integrated ecosystem that allows its seamless productive use. And that's also both AMD's weakness in software and goal with ROCm, with the big difference that where Nvidia has proprietary closed source solutions AMD often used and adapted existing open source efforts and doesn't try to lock its users to "one true" approach, instead trying to support different existing approaches:

343814-rocm-open-source-chart-1260x709.png


AMD's big CUDA (as in the programming framework) "replacement" within ROCm is HIP which is in a way little more than a translator (subset of CUDA in fact) that allows the result to be portable between GPUs by Nvidia (CUDA 4.0+) and by AMD without changes to the source code.
 
  • Like
Reactions: Tlh97 and NTMBK

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
I've come to the conclusion you just like spouting a lot of hot air.

CUDA is often underrated since contrary to the common impression CUDA isn't just a programming framework for GPUs but a whole set of tools and integrated ecosystem that allows its seamless productive use. And that's also both AMD's weakness in software and goal with ROCm, with the big difference that where Nvidia has proprietary closed source solutions AMD often used and adapted existing open source efforts and doesn't try to lock its users to "one true" approach, instead trying to support different existing approaches:

*snip*

AMD's big CUDA (as in the programming framework) "replacement" within ROCm is HIP which is in a way little more than a translator (subset of CUDA in fact) that allows the result to be portable between GPUs by Nvidia (CUDA 4.0+) and by AMD without changes to the source code.

How about you start seeing things for what they really are and stop being so blind ?

Even MrLord admits that ROCm is mostly a closed standard that's ONLY developed by AMD and no one else. Not even Intel or Nvidia or any other hardware vendor gives damn about ROCm ...

open implementation =/= open standard :sunglasses:
 

DrMrLordX

Lifer
Apr 27, 2000
21,634
10,848
136
Whatever leadership AMD had in GPUs was truly gone by that point which is why I maintain that they don't have a GPU guru culture anymore

Seems to me they don't need it. Ultimately ROCm/HIP are fulfilling a vision they had for their own product stack 15 years ago. Their only fault was in letting NV get there first with CUDA.

I'm pretty sure Xeon Phi got killed off by regular CPUs and GPGPU can't run standard C++ code either.

Not true. Phi was killed by:

1). Intel's failed 10nm process leading to the cancellation of Knight's Mill (which is being replaced by Xe)
2). NV's HPC-oriented dGPUs

Also if you thought you could just "run standard C++ code" on earlier Phi products . . . it wasn't quite that simple. Knight's Landing was I think the first (and last) Phi product that made coding for it about as easy as writing code for any old Xeon. Assuming you could hand-tune AVX-512 but whatever. In any case, Phi competed directly against other HPC hardware, which at the time was (and still is) dGPUs. There is no "general purpose" CPU, x86 or otherwise, that pushed Phi out of its niche.

Xeon phi did not win all that much compared to regular CPUs

See Tianhe-2. You must concede that Phi products could produce greater throughput/watt in appropriate workloads than any of Intel's standard Xeon products of the same generation. Phi was eventually replaced by Cascade Lake-AP, and you ought to know how popular THAT was.

since there was kernel launch overhead

I thought that was eliminated by Knights Landing?

They were better off just implementing AVX-512 straight into the CPUs.

Intel had every intention of continuing Phi with Knights Mill despite implementing AVX-512 in standard Xeons as far back as Skylake-SP.

CPU guru culture

Now you're just making things up.

GPU compute platforms have existed ever since CUDA created so adding crappy CPUs to it won't make their entire stack more compelling than it already is ...

Simply not true. NVidia has always relied on someone else to supply CPUs and chipsets to host their devices, even when they were able to supply proprietary interconnects for their high-end stuff. There is the very real threat that Intel and AMD will simply kick them off their systems altogether. Drafting their own platform design with their own CPUs and chipsets offers a safe haven to their precious GPGPU business.

Also Xilinx is working to integrate the FPGA compute infra within ROCm. Xilinx FPGAs uses CCIX for coherence

Pretty sure this didn't happen until after AMD bought them?
 
Last edited:
  • Like
Reactions: Tlh97

DisEnchantment

Golden Member
Mar 3, 2017
1,607
5,799
136
Pretty sure this didn't happen until after AMD bought them?
You can read it here
They have some demo code already last year.
In any case, AMD will enable CCIX on their platform and Xilinx would want to use that vs Altera and others in some of these HPC segments.

The technology demonstration showcases:
  • Unified discovery and reservation of AMD and Xilinx accelerators using a converged runtime in the AMD ROCm open software platform;
  • Dispatch of work to Alveo accelerators using the same user-space queues used for low-latency work dispatch to AMD Instinct accelerators;
  • Peer-to-peer synchronization between GPU and FPGA devices; and
  • Access to memory on GPU, CPU, and FPGA devices using a common, shared virtual address space
 
Last edited:
  • Like
Reactions: Tlh97

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
Seems to me they don't need it. Ultimately ROCm/HIP are fulfilling a vision they had for their own product stack 15 years ago. Their only fault was in letting NV get there first with CUDA.

Which is exactly why they had a radical shift in their strategy to be closer with Intel's and that's perfectly fine ...

Also AMD's original vision for HSA was for it to become a new industry standard which is why the HSA Foundation started in the first place so that they can standardize HSAIL (HSA intermediate language) for targets like OpenCL, C++ AMP or maybe even SYCL. AMD didn't create HSA with the original intent of using it solely on ROCm or just cloning CUDA (HIP) for the others here that don't know any better ...

The original model for HSA was we start with OpenCL (or some other open standard like C++ AMP/SYCL) then drivers would ingest HSAIL but I don't think AMD envisioned in the past that none of this would've played out. The HSA project started out on a standards committee (HSA Foundation) and then it goes on to die as some low level interface for AMD. The only reason why AMD reused their HSA kernel driver was purely out of convenience since it was their lowest level interface for their GPUs at the time and really has nothing to do with HSA itself. ROCm itself doesn't even support HSAIL and instead supports compilation of HIP kernels (closed standard) into GCN assembly (proprietary ISA) which defeats the starting purpose of HSA being 'portable' ...

Not true. Phi was killed by:

1). Intel's failed 10nm process leading to the cancellation of Knight's Mill (which is being replaced by Xe)
2). NV's HPC-oriented dGPUs

Also if you thought you could just "run standard C++ code" on earlier Phi products . . . it wasn't quite that simple. Knight's Landing was I think the first (and last) Phi product that made coding for it about as easy as writing code for any old Xeon. Assuming you could hand-tune AVX-512 but whatever. In any case, Phi competed directly against other HPC hardware, which at the time was (and still is) dGPUs. There is no "general purpose" CPU, x86 or otherwise, that pushed Phi out of its niche.

See Tianhe-2. You must concede that Phi products could produce greater throughput/watt in appropriate workloads than any of Intel's standard Xeon products of the same generation. Phi was eventually replaced by Cascade Lake-AP, and you ought to know how popular THAT was.

I thought that was eliminated by Knights Landing?

Intel had every intention of continuing Phi with Knights Mill despite implementing AVX-512 in standard Xeons as far back as Skylake-SP.

Phi was mostly eliminated by general purpose CPUs. CPUs now have core counts going upto 64 cores and EPYC Genoa will have upto 96 cores with AVX-512 to boot ? Phi also made programming more complex too as kernels needed to be separated by host or device for execution and the latency overhead of splitting kernels between host/device was never solved. Instead general purpose CPUs prevailed since there was no significant performance benefit to the Xeon Phi and regular CPUs were easier to program as well when programmers didn't have to deal with device specific code nonsense ...
 

NTMBK

Lifer
Nov 14, 2011
10,237
5,020
136
Phi was mostly eliminated by general purpose CPUs. CPUs now have core counts going upto 64 cores and EPYC Genoa will have upto 96 cores with AVX-512 to boot ? Phi also made programming more complex too as kernels needed to be separated by host or device for execution and the latency overhead of splitting kernels between host/device was never solved. Instead general purpose CPUs prevailed since there was no significant performance benefit to the Xeon Phi and regular CPUs were easier to program as well when programmers didn't have to deal with device specific code nonsense ...

Knights Landing solved that problem. There was no "host", the Phi did everything. Knights Mill was going to be the same, until it got killed.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
Knights Landing solved that problem. There was no "host", the Phi did everything. Knights Mill was going to be the same, until it got killed.

I'm pretty sure the Intel compiler have explicit extensions for 'offloading' so it's not as transparent as you believe it to be. Automatic offloading is only available if you're using Intel's MKL library ...

Knights Landing only solves the compatibility issues with other Xeon processors since Knights Corner had a pretty different x86 ISA implementation ...
 

NTMBK

Lifer
Nov 14, 2011
10,237
5,020
136
I'm pretty sure the Intel compiler have explicit extensions for 'offloading' so it's not as transparent as you believe it to be. Automatic offloading is only available if you're using Intel's MKL library ...

Knights Landing only solves the compatibility issues with other Xeon processors since Knights Corner had a pretty different x86 ISA implementation ...

The offloading was needed for devices on PCIe. Knights Landing as a host could just use AVX-512 instructions inside any arbitrary code sequence.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
The offloading was needed for devices on PCIe. Knights Landing as a host could just use AVX-512 instructions inside any arbitrary code sequence.

Yeah but using Knights Landing as a pure host defeats the intended concept behind Phi which was supposed to be a co-processor or a generic accelerator. Why even bother with Xeon Phi when you can get much higher single threaded performance and not much performance deficit on parallel code with server CPUs ? This is why Xeon Phi's design was flawed from the get go ...
 

fkoehler

Member
Feb 29, 2008
193
145
116
The history is interesting, however there's an excellent chance this will all come to naught.

With the current situation vis-a-vis silicon, and regardless of the fact that ARM doesn't actually manufacture anything, I would not bet on NV getting the go-ahead from at least the UK for aquisition.
And I wouldn't be surprised in the least if China blocks it just to spite the US.
 

DrMrLordX

Lifer
Apr 27, 2000
21,634
10,848
136
Which is exactly why they had a radical shift in their strategy to be closer with Intel's and that's perfectly fine ...

Radical shift my butt.

Also AMD's original vision for HSA was for it to become a new industry standard

Smoke and mirrors. If you actually followed the hardware behind HSA you'd see that the HSA Foundation was populated by AMD and AMD alone. They were the only vendor to produce hardware that was HSA-compliant. It's no different with ROCm. AMD may have gotten some other code contributors but it did them very little good. Again, did you ever use the HSA software stack with Kaveri or Carrizo?

Phi was mostly eliminated by general purpose CPUs.

I don't know why you persist in this fantasy. Phi-based supercomputers like Tianhe-2 were supplanted on the Top 500 by systems like Summit that use NV dGPUs. Phi was effectively dead in 2017 when Intel's 10nm was completely botched. The last Phi product to reach the market - Knight's Landing - was launched in Q2 2016.

CPUs now have core counts going upto 64 cores and EPYC Genoa will have upto 96 cores

You don't think that people build HPC/ML training machines solo around those, do you?

Yeah but using Knights Landing as a pure host defeats the intended concept behind Phi which was supposed to be a co-processor or a generic accelerator.

If that's what you think about Phi, then maybe you don't understand the significance of Grace either. If NV could boot a system to a functioning OS entirely using only their own dGPUs without a motherboard + chipset and CPU, they most certainly would.

The history is interesting, however there's an excellent chance this will all come to naught.

With the current situation vis-a-vis silicon, and regardless of the fact that ARM doesn't actually manufacture anything, I would not bet on NV getting the go-ahead from at least the UK for aquisition.
And I wouldn't be surprised in the least if China blocks it just to spite the US.

Even if NV's acquisition of ARM Ltd. is blocked, they can still produce Grace. What they can't do is prevent hyperscalars or smaller shops from producing their own divergent ARM designs or "infect" the ARM Neoverse reference platform with their own proprietary tech.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
Smoke and mirrors. If you actually followed the hardware behind HSA you'd see that the HSA Foundation was populated by AMD and AMD alone. They were the only vendor to produce hardware that was HSA-compliant. It's no different with ROCm. AMD may have gotten some other code contributors but it did them very little good. Again, did you ever use the HSA software stack with Kaveri or Carrizo?

Pretty sure there were other members in the HSA Foundation like ARM, PowerVR, and Qualcomm. I don't deny that AMD was the biggest technical contributor to it's specifications but HSA arguably had a bigger vision in the past compared to now which is why I consider it to be mostly dead ... (the HSA kernel driver was renamed to "ROCK kernel driver" to reflect this)

I don't know why you persist in this fantasy. Phi-based supercomputers like Tianhe-2 were supplanted on the Top 500 by systems like Summit that use NV dGPUs. Phi was effectively dead in 2017 when Intel's 10nm was completely botched. The last Phi product to reach the market - Knight's Landing - was launched in Q2 2016.

You don't think that people build HPC/ML training machines solo around those, do you?

If that's what you think about Phi, then maybe you don't understand the significance of Grace either. If NV could boot a system to a functioning OS entirely using only their own dGPUs without a motherboard + chipset and CPU, they most certainly would.


Again, Phi was designed to run C++ code and not CUDA code so this is purely mirrored by the fact that it is compatible with standard C++ compilers like ICC or GCC while NV GPUs needed their special NVCC compiler at the time. I don't see how GPUs could've killed Phi when Phi was designed to get higher throughput on some portions of the CPU code ...

Phi was flawed from the beginning like I stated before. Regular CPUs are supposed to run host code which contains kernels that's mostly bound by either single threaded performance or sensitive to latency. You can run an OS on GPUs but it would be horrendously slow so it's just one more reason why AMD and Intel should just focus on CPUs so they can get relatively moderate performance on all code ...

That's why the GP in 'GPGPU' will never truly take off since we have Nvidia who integrate "special purpose hardware" like tensor cores on their GPUs and they'll never come close to even the programming capabilities of the Xeon Phi either ...
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
They were the only vendor to produce hardware that was HSA-compliant.
Chips using these three IPs; Cortex-A73, Mali-G71, and CoreLink CCI-550; are HSA 1.1 compliant.

The above for example means the HiSilicon Kirin 960(etc.), is HSA 1.1 compliant.
 
Last edited:

Saylick

Diamond Member
Sep 10, 2012
3,163
6,392
136
Let see how they stack up to Sapphire Rapids and Genoa

For reference:
122609.png


144 ARM cores producing 740 SPECint2017... that's about 5.14 points per core.
2x64 Milan gets 512 points, or 4 points per core.

Genoa likely gets a >20% IPC bump along with some clock increases, so I think it's totally feasible for AMD to get >920 SPECint2017 out of a 2-socket 96-core Genoa server. Bergamo is supposed to do 2x the throughput of top Milan per socket, so a 2-socket 128-core Bergamo should do about 1024 points.
 
  • Like
Reactions: lightmanek

nicalandia

Diamond Member
Jan 10, 2019
3,330
5,281
136
For reference:
122609.png


144 ARM cores producing 740 SPECint2017... that's about 5.14 points per core.
2x64 Milan gets 512 points, or 4 points per core.

Genoa likely gets a >20% IPC bump along with some clock increases, so I think it's totally feasible for AMD to get >920 SPECint2017 out of a 2-socket 96-core Genoa server. Bergamo is supposed to do 2x the throughput of top Milan per socket, so a 2-socket 128-core Bergamo should do about 1024 points.

I got this out of WCCFTECH

 

Saylick

Diamond Member
Sep 10, 2012
3,163
6,392
136

DisEnchantment

Golden Member
Mar 3, 2017
1,607
5,799
136
144 ARM cores producing 740 SPECint2017... that's about 5.14 points per core.
2x64 Milan gets 512 points, or 4 points per core.
You cannot compare scores just like that, it is very easy to find SPECint2017 score where Milan hits far higher, closer to 1000

Hardware Vendor​
System​
Peak Result​
Base Result​
Energy Peak Result​
Energy Base Result​
# Cores​
# Chips​
Published​
Disclosure
ASUSTeK Computer Inc.ASUS RS720A-E11(KMPP-D32) Server System 2.45 GHz, AMD EPYC 7763913861----1282Dec-2021HTML CSV PDF PS Text Config
ASUSTeK Computer Inc.ASUS RS720A-E11(KMPP-D32) Server System 2.45 GHz, AMD EPYC 7763892839----1282Mar-2021HTML CSV PDF PS Text Config
Cisco SystemsCisco UCS C225 M6 (AMD EPYC 7763 64-Core, Processor)898851----1282Sep-2021HTML CSV PDF PS Text Config
Cisco SystemsCisco UCS C245 M6 (AMD EPYC 7763 64-Core Processor)898854----1282Jul-2021HTML CSV PDF PS Text Config
Cisco SystemsCisco UCS C245 M6 (AMD EPYC 7763 64-Core Processor)892850----1282Jun-2021HTML CSV PDF PS Text Config
Cisco SystemsCisco UCS C245 M6 (AMD EPYC 7763 64-Core Processor)--852----1282Jun-2021HTML CSV PDF PS Text Config
Dell Inc.PowerEdge C6525 (AMD EPYC 7763 64-Core Processor)848800----1282May-2021HTML CSV PDF PS Text Config
Dell Inc.PowerEdge C6525 (AMD EPYC 7763 64-Core Processor)835790----1282Mar-2021HTML CSV PDF PS Text Config
Dell Inc.PowerEdge R6525 (AMD EPYC 7763 64-Core Processor)872822----1282Jun-2021HTML CSV PDF PS Text Config
Dell Inc.PowerEdge R6525 (AMD EPYC 7763 64-Core Processor)845801----1282Mar-2021HTML CSV PDF PS Text Config
Dell Inc.PowerEdge R7525 (AMD EPYC 7763 64-Core Processor)872821----1282May-2021HTML CSV PDF PS Text Config
Dell Inc.PowerEdge R7525 (AMD EPYC 7763 64-Core Processor)853802----1282Apr-2021HTML CSV PDF PS Text Config
Dell Inc.PowerEdge R7525 (AMD EPYC 7763 64-Core Processor)846798----1282Mar-2021HTML CSV PDF PS Text Config
FujitsuPRIMERGY RX2450 M1, AMD EPYC 7763 2.45 GHz--824----1282Oct-2021HTML CSV PDF PS Text Config
GIGA-BYTE TECHNOLOGY CO., LTD.R282-Z90 (AMD EPYC 7763 , 2.45GHz)866813----1282Mar-2021HTML CSV PDF PS Text Config
GIGA-BYTE TECHNOLOGY CO., LTD.R282-Z90 (AMD EPYC 7763, 2.45GHz)884832----1282Jul-2021HTML CSV PDF PS Text Config
Hewlett Packard EnterpriseProLiant DL365 Gen10 Plus (2.45 GHz, AMD EPYC 7763)865813----1282May-2021HTML CSV PDF PS Text Config
Hewlett Packard EnterpriseProLiant DL385 Gen10 Plus v2 (2.45 GHz, AMD EPYC 7763)872821----1282Mar-2021HTML CSV PDF PS Text Config
Lenovo Global TechnologyThinkSystem SR645 2.45 GHz, AMD EPYC 7763874819----1282Mar-2021HTML CSV PDF PS Text Config
Lenovo Global TechnologyThinkSystem SR645 2.45 GHz, AMD EPYC 7763870819----1282Mar-2021HTML CSV PDF PS Text Config
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,560
14,515
136
Thanks for the heads up. I thought that AT's own internal estimates would be good enough; looks like I was wrong.
Its kind of sad that it gets beat by 33% by a CPU thats been out for over a year. (by the time it launches) By the time Genoa comes out (close to its launch) I am sure it will get beat by 100%
 
  • Like
Reactions: Drazick

DisEnchantment

Golden Member
Mar 3, 2017
1,607
5,799
136
Its kind of sad that it gets beat by 33% by a CPU thats been out for over a year. (by the time it launches) By the time Genoa comes out (close to its launch) I am sure it will get beat by 100%
It is not sad, that chip was made specially for AI and HPC where memory bandwidth is more important than int throughput.
They are not advertising it for general compute as far as I have seen. The primary focus are mainly the Accelerators which is NV's core business.
A fair comparison should be Trento+MI250X (or their successor) setup vs Grace+Hopper setup. The bulk of the compute is from the accelerator and the CPU is just the enabler.
 

beginner99

Diamond Member
Jun 2, 2009
5,210
1,580
136
By the time Genoa comes out (close to its launch) I am sure it will get beat by 100%
Maybe but only in this benchmark. The main advantage being the tight integration between CPU and GPU over a super fast bus. this thing matters for "AI" or other compute stuff. Albeit in a mixed workload single-threaded CPU matters also greatly and there ARM certainly comes up with the short stick.