Discussion [Tomshardware] EPYC Genoa and Radeon Instinct to Power Two-Exaflop DOE Supercomputer

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
EPYC win for AMD for two exaflops exascale supercomputer.
EPYC Genoa + Future Radeon Instinct
Zen4 + DDR5 + Radeon Instinct

Big win for ROCm and users of open source SW.


I hope this gives a big boost to ROCm.
As a Linux user I am really looking forward to ROCm support upstreamed for most frameworks and the community can benefit without having to reverse engineer their ass out to find bugs and root causes if something is not working in the SW.

(Intel is also a big OSS contributor so I hope they win some too)


Update 1
AT Link

However the most interesting claim is that these IF 3.0 device nodes will support unified memory across the CPU and GPU, which is something AMD doesn’t offer today.

I hope it means what it sounds.
That we can write code which treat memory across the GPU as how we treat memory shared with another thread/core on the CPU? Sounds like a dream come true.


Best bit for me

Scott said, “As part of this procurement, the Department of Energy has provided additional funds beyond the purchase of the machine to fund non-recurring engineering efforts and one major piece of that is to work closely with AMD on enhancing the programming environment for their new CPU-GPU architecture.” Work is ongoing by all three partners to take the critical applications and workloads forward and optimize them to get the best performance in the machine when El Capitan is delivered.


Update 2
It is time to start a new Zen 4 thread :)
 
Last edited:

Hitman928

Diamond Member
Apr 15, 2012
5,177
7,628
136
My guess is Zen3 gets released 4Q this year, still on AM4 and then AMD gives a bit more time between generations to allow for the transition to AM5 (or whatever they call it) and the cooresponding server socket which means that Zen4 (Genoa) will be released late 1H 2022. Just a guess though.

From Anandtech's front page:
To that end, AMD confirmed what we essentially knew, with Zen 3 based Milan coming in ‘late 2020’.


Zen 4 based Genoa has already been announced as the CPU to power the El Capitan supercomputer, and in this roadmap AMD has put it as coming out by 2022. We asked AMD for clarification, and they stated that in this sort of graph, we should interpret it as the full stack of Genoa should be formally launched by the end of 2022. Given AMD’s recent 12-15 month cadence with the generations of EPYC, and the expected launch of Milan late this year, we would expect to see Genoa in early 2022.


Not too bad a guess it seems.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
I must be confused.... wasn't the whole point of Fusion/HSA a unified memory architecture?

Wasn't that already fully realised as of Volcanic Islands/GCN3?

GCN does not have cache coherency. Read further below

APUs and pre-CDNA 2
APUs do have heterogeneous architecture with the same memory space but since they are not cache coherent, access to the same memory location by the CPU and GPU causes problems.
In an APU, technically, the memory seen by the GPU and CPU is in the same memory space but since there is no coherency, they could be accessing/modifying the same memory region turning data into garbage.
Therefore APUs tend to have reserved memory for GPU so that the memory address is splitted and not accessed by both the CPU and GPU at the same time, even though they can see each other's entire memory space.
Compare this to a discrete GPU where the GPU memory is unknown to the CPU.


CDNA 2 + EPYC Genoa

Annotation 2020-03-07 1413361.png

Annotation 2020-03-07 1413362.png


In upcoming Infinity Architecture 3.0, there is cache coherency between CPU and GPU which means the CPU knows when the data is modified by the GPU.
Similar to how CPU cores know the data that is modifed by the other cores, through various means, either Cache hierarchy and other coherency designs.

From a programming perspective this is a very very big deal, because as you can see above example from AMD it simplifies code in a big way.
It is exactly like how you program in a multithread way, except that for GPU there are more specific function calls to schedule the job in a different thread.

Imagine allocating data in memory and a separate thread can modify this and you can read it without doing any copy operation.
I imagine that AMD already have some mechanism to do synchronization between GPU and CPU threads which would make this operation very seamless.

If you are developing HPC code doing heavy math operation, instead of doing some matrix math operation which uses the CPU you could just perform a blocking function call instead of going CUDA for example.
This would make your otherwise slow code super fast and can easily defeat those vector CPU extension from either Intel or AMD, but without the complexity of performing memory synchronization when using a GPU.
If you are a developer you will be able to appreciate how much of a paradigm shift this is.


For this reason I believe Genoa will be a game changer and you can bet AMD's competitors to be very concerned, why you might ask
Because
- The on chip memory could be used in a cache hierarchy to achieve coherency between the GPU and the CPU
- Genoa is scheduled on 5nm which means more density
- Genoa is in line with CDNA 2 which is the architecture which can handle the new Infinity Architecture 3.0
- The X3D interconnect will probably be used on Genoa to stack dies and achieve the 10x BW which AMD claimed.
- High BW and Low latency means the GPU could basically act as another CPU node wherein you can schedule your threads.

In the light of this, to me it is no surpise AMD has been bagging exascale deals left and right.

What it means for general purpose graphics remains to be seen but I might explore this later, however I am only familiar with OpenGL and not Vulkan and others, so my insight could be limited.

AMD and current Linux activities
Outside of this particular topic, AMD has been very active with GPU virtualization and Azure offered the new virtualized GPU instance for the very first time ever together with Rome instances.
Then AMD is the now active in bringing gpu cgroup control to the Linux kernel, although I have seen some resistance from other members (but only from from code standpoint and not the concept)
All these activities to use the GPU as seamlessly with the CPU will make any vector extensions in CPU not so unique because a GPU by nature is several orders of magnitude faster in this regard.
And then the cgroup managment for GPU and GPU virtualization will make using docker and VMs to be able to use GPUs seamlessly including resource allocation/splitting like how Linux is now doing wrt memory and CPU core/time/weight.
 

soresu

Platinum Member
Dec 19, 2014
2,612
1,811
136
GCN does not have cache coherency. Read further below

APUs and pre-CDNA 2
APUs do have heterogeneous architecture with the same memory space but since they are not cache coherent, access to the same memory location by the CPU and GPU causes problems.
In an APU, technically, the memory seen by the GPU and CPU is in the same memory space but since there is no coherency, they could be accessing/modifying the same memory region turning data into garbage.
Therefore APUs tend to have reserved memory for GPU so that the memory address is splitted and not accessed by both the CPU and GPU at the same time, even though they can see each other's entire memory space.
Compare this to a discrete GPU where the GPU memory is unknown to the CPU.


CDNA 2 + EPYC Genoa

View attachment 17854

View attachment 17853


In upcoming Infinity Architecture 3.0, there is cache coherency between CPU and GPU which means the CPU knows when the data is modified by the GPU.
Similar to how CPU cores know the data that is modifed by the other cores, through various means, either Cache hierarchy and other coherency designs.

From a programming perspective this is a very very big deal, because as you can see above example from AMD it simplifies code in a big way.
It is exactly like how you program in a multithread way, except that for GPU there are more specific function calls to schedule the job in a different thread.

Imagine allocating data in memory and a separate thread can modify this and you can read it without doing any copy operation.
I imagine that AMD already have some mechanism to do synchronization between GPU and CPU threads which would make this operation very seamless.

If you are developing HPC code doing heavy math operation, instead of doing some matrix math operation which uses the CPU you could just perform a blocking function call instead of going CUDA for example.
This would make your otherwise slow code super fast and can easily defeat those vector CPU extension from either Intel or AMD, but without the complexity of performing memory synchronization when using a GPU.
If you are a developer you will be able to appreciate how much of a paradigm shift this is.


For this reason I believe Genoa will be a game changer and you can bet AMD's competitors to be very concerned, why you might ask
Because
- The on chip memory could be used in a cache hierarchy to achieve coherency between the GPU and the CPU
- Genoa is scheduled on 5nm which means more density
- Genoa is in line with CDNA 2 which is the architecture which can handle the new Infinity Architecture 3.0
- The X3D interconnect will probably be used on Genoa to stack dies and achieve the 10x BW which AMD claimed.
- High BW and Low latency means the GPU could basically act as another CPU node wherein you can schedule your threads.

In the light of this, to me it is no surpise AMD has been bagging exascale deals left and right.

What it means for general purpose graphics remains to be seen but I might explore this later, however I am only familiar with OpenGL and not Vulkan and others, so my insight could be limited.

AMD and current Linux activities
Outside of this particular topic, AMD has been very active with GPU virtualization and Azure offered the new virtualized GPU instance for the very first time ever together with Rome instances.
Then AMD is the now active in bringing gpu cgroup control to the Linux kernel, although I have seen some resistance from other members (but only from from code standpoint and not the concept)
All these activities to use the GPU as seamlessly with the CPU will make any vector extensions in CPU not so unique because a GPU by nature is several orders of magnitude faster in this regard.
And then the cgroup managment for GPU and GPU virtualization will make using docker and VMs to be able to use GPUs seamlessly including resource allocation/splitting like how Linux is now doing wrt memory and CPU core/time/weight.
I wonder if this feature will remain exclusive to CDNA, it would certainly be beneficial to general compute tasks on RDNA too - perhaps we might finally be able to step away from the ever widening SIMD units on CPU's if so.
 
  • Like
Reactions: Olikan

moinmoin

Diamond Member
Jun 1, 2017
4,933
7,618
136
I wonder if this feature will remain exclusive to CDNA, it would certainly be beneficial to general compute tasks on RDNA too - perhaps we might finally be able to step away from the ever widening SIMD units on CPU's if so.
I'd expect AMD to introduce cache coherent heterogeneous architecture to gaming down the line, so that future RDNA and especially APUs can profit from it as well.

Cache coherent protocols in general are not specific to Infinity Fabric/Architecture but an industry wide movement, with Gen-Z, CCIX, OpenCAPI, and CXL all pushing for it as industry wide standard. They mostly depend on PCIe 5, so the non-server market may only get it once PCIe 5 is introduced there as well. AMD offering both CPU and GPU as well as an existing protocol capable of covering both, just pushes for its first mover advantage (and very successful at it considering the exascale wins).
 

soresu

Platinum Member
Dec 19, 2014
2,612
1,811
136
Interesting... big question:
Will future APUs be CDNA or RDNA?
Very possibly we may see a future divergence between consumer and pro/server/workstation APU's for pure compute as with the dGPU's themselves - both have their uses, but ditching graphics will make the GPU/accelerator of a pro APU a much more enticing prospect, especially in this current era of ML importance.
 
  • Like
Reactions: Olikan

soresu

Platinum Member
Dec 19, 2014
2,612
1,811
136
I'd expect AMD to introduce cache coherent heterogeneous architecture to gaming down the line, so that future RDNA and especially APUs can profit from it as well.

Cache coherent protocols in general are not specific to Infinity Fabric/Architecture but an industry wide movement, with Gen-Z, CCIX, OpenCAPI, and CXL all pushing for it as industry wide standard. They mostly depend on PCIe 5, so the non-server market may only get it once PCIe 5 is introduced there as well. AMD offering both CPU and GPU as well as an existing protocol capable of covering both, just pushes for its first mover advantage (and very successful at it considering the exascale wins).
It may finally bring programmable video codecs towards using GPU efficiently - or I hope so anyway, 64C CPU's are a lot more expensive than a halfway decent graphics card.

I wonder if all those new protocols will shortly be receiving an upgrade from the expected PCI-E 6.0 final standard next year?
 

moinmoin

Diamond Member
Jun 1, 2017
4,933
7,618
136
Minor update regarding AMD's ranking with supercomputers (didn't think it's worth a new thread):

— Frontier supercomputer, powered by AMD EPYC CPUs and AMD Instinct Accelerators, achieves number one spots on Top500, Green500 and HPL-AI performance lists, an industry first —

— AMD powers five of the top ten most powerful and eight of the top ten most efficient supercomputers in the world —
 

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Cache coherent protocols in general are not specific to Infinity Fabric/Architecture but an industry wide movement,

Right. And actually, compared to the off-the-shelf EPYC "Milan", cache-coherence is purportedly the big differentiator in "Trento" in Frontier. The speculation is that AMD brought forward the IF 3.0 cache-coherence feature for the IOD chiplet for "Trento".

"Not much is known about Trento, but it is widely expected to be a custom “Milan” part that takes the same cores as are used in Milan but marries them to a new I/O and memory chiplet that has Infinity Fabric 3.0 links on the port, and enough of them so that the CPU memory of a single socket and the memory of four GPUs can be all linked together into a single, coherent, shared memory." Oak Ridge has coherent memory in the IBM Power9 CPU-Nvidia Volta GPU compute complexes that are the basis of the “Summit” supercomputer, and this coherence, enabled by the addition of NVLink ports on the Power9 processors, was one of the salient characteristics of the architecture that allowed IBM and Nvidia to win the deal to build Summit. There is no way Oak Ridge would award the Frontier system to any vendor that didn’t have such coherence in their CPU and GPU complexes."

AMD Wants To Put Together The Complete Package (nextplatform.com)

"These Trento CPUs run at 2 GHz and have 64-cores; it is basically comprised of “Milan” core complexes linked to a memory and I/O die that has the Infinity Fabric 3.0 coherence interconnect enabled. Infinity Fabric 3.0 was not slated to appear until the “Genoa” Epyc 7004s later this year, and Frontier itself was expected to weigh in at around 1.5 exaflops and be installed in 2021. So AMD pulled this capability forward in the custom Trento chip, which was necessary because Oak Ridge already had CPU-GPU memory coherence in the “Summit” supercomputer installed in 2018. (Once you have coherence, you can’t go back. Programmers will revolt.)"

Frontier: Step By Step, Over Decades, To Exascale (nextplatform.com)

By the way, I am in awe of AMD's efficiency lead. AMD CTO Mark Papermaster's ambitious and relentless focus on power-efficiency (30x by 2025) has really paid off — with superstar engineers such as Corporate Fellow Sam Naffziger to make it happen. In the chart below, note that Selene uses AMD EPYC CPUs combined with Nvidia GPUs.

l0kq4n6gip291.png


Also worth noting is that Frontier is now the number 1 system for AI, as measured by the HPL-AI benchmark. AI (and HPC) workloads will be programmed on these AMD-based systems without CUDA in sight — as will the 2+ exaflop Aurora and El Capitan systems arriving next year, as well. So the proprietary CUDA moat is being steadily overcome in the supercomputing space, which is good to see.

7Zb63uiUU4CxPtGmfW2gKW-970-80.jpg.webp
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
21,582
10,785
136
Also worth noting is that Frontier is now the number 1 system for AI, as measured by the HPL-AI benchmark. AI (and HPC) workloads will be programmed on these AMD-based systems without CUDA in sight — as will the 2+ exaflop Aurora and El Capitan systems arriving next year, as well. So the proprietary CUDA moat is being steadily overcome in the supercomputing space, which is good to see.

If that trend extends into the commercial sector then we may see some progress away from CUDA. There will probably be some in academia whose preferences may be altered by the advent of publicly-funded supercomputers that can operate without CUDA; that being said, with all the toolchains emerging that allow translation of CUDA to SYCL, maybe not.
 
  • Like
Reactions: Tlh97 and Vattila

moinmoin

Diamond Member
Jun 1, 2017
4,933
7,618
136
AMD-Powered Frontier Supercomputer Faces Difficulties, Can't Operate a Day without Issues | TechPowerUp

Sensational headline. The actual scientists aren't too concerned and say that it's nothing out of the ordinary for computing at this scale.
Wanted to say that it's better to rely on dedicated HPC sites for these news, but their primary source already is such an article so here's the direct link:
So essentially the supercomputer opens for everybody in January so they are trying to solve all possible problems until then. Everybody who worked on getting any first of its kind state of the art tech to run knows that those seldom go smooth from the very beginning, so it's kind of a non-news where the interesting part are the details.