Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

jpiniero · Oct 11, 2020

I think it's more likely that you will be able to buy a DDR4 board on AM5.

DisEnchantment · Oct 11, 2020

soresu said:
It would certainly tie in well with AMD getting support for OneAPI through HipSYCL/ROCm.

The whole point with OneAPI for Intel was unifying CPU, GPU, FPGA, AI/ML accelerators etc under one coding roof.

Here's hoping for their sake it doesn't take AMD forever to deliver on the SW implementation then, because Intel will certainly have a humongous head start on the FPGA side unless Xilinx's own software platform can be made to simply take OneAPI code without a giant amount of engineering in the interim.

I think it has something more to do with whether the SYCL runtime implementation is compliant to the SYCL standard as specified by Khronos rather than OneAPI supporting AMD HW
The SYCL runtime is just one of many runtimes which the libraries that are exposed to the multiple frameworks can make use of.

You can check out this video for programming Xilinx FPGAs with SYCL

SYCL was developed with heterogenous systems in mind.
Like I wrote previously, ROCm is a complete ecosystem, with all the math libs, communication libs for multi node clusters, work dispatch , library integration into popular frameworks etc
SYCL runtime should be technically feasible to integrate with ROCm, if not easy, provided AMD has a business case for it.

moinmoin · Oct 11, 2020

DisEnchantment said:
SYCL runtime should be technically feasible to integrate with ROCm, if not easy, provided AMD has a business case for it.

There is hipSYCL which already integrates with ROCm for AMD GPUs (while also supporting CPUs through C++17 OpenMP compilers and Nvidia GPUs through clang/CUDA).

DisEnchantment · Oct 11, 2020

moinmoin said:
There is hipSYCL which already integrates with ROCm for AMD GPUs (while also supporting CPUs through C++17 OpenMP compilers and Nvidia GPUs through clang/CUDA).

It is not an official ROCm component. AMD ROCm runtime debs dont have it. SYCL is not planned to be a part of ROCm 4.0 afaik.
It just uses ROCm infrastructure to make SYCL on AMD possible.
Once the entire ROCm ecosystem is up, adding SYCL is not much of a challenge imo.

Vattila · Oct 12, 2020

AMD should get Microsoft aboard to implement SYCL in their C++ compiler. Microsoft partnered with AMD to create C++ AMP (ref. AMD's Fusion Developer Summits a few years ago), which is similar in in philosophy, but requires a small non-standard extension to the C++ language. SYCL supersedes that effort by eliminating the need for that extension, using standard compliant C++ to express the code to be run on the accelerator (GPU "kernels", FPGA algorithm, etc.).

For now, AMD seems to be concentrating on HIP as a CUDA replacement. Unfortunately, the tool chain is not supported on Windows.

DisEnchantment · Oct 12, 2020

Zen4 on N5(+AMD Sauce), new IOD and AM5 will be having so many new knobs its going to be really interesting. It will be another inflection point for PCs.

New socket AM5 could possibly bring in a bigger substrate and package area for even bigger chips or more number of chiplets
Ignoring absolute process density and using relative density for N7->N5 progression, 1.7x gain, would put the Zen4 chiplets roughly around 60-65% of what a Zen2 chiplet is now, somewhere around 50mm2. Suddenly AMD's chiplets seems genius because they can be smaller than typical phone SoCs and can tolerate bad yields.
Improved process for IOD. If not made by TSMC most likely GF 12LP+ which is a major improvement to 12LP
Improved efficiency, again.

There is so much more silicon area to play with. If some form of 3D stacking is there, that is going to be even more transistors that can be packed per chip.
Regarding new IOD we wont even have to wait for Zen4, it is coming soon with a specialized Zen3/Milan SKU for a specific HPC deployment.

DisEnchantment · Oct 12, 2020

Vattila said:
AMD should get Microsoft aboard to implement SYCL in their C++ compiler. Microsoft partnered with AMD to create C++ AMP (ref. AMD's Fusion Developer Summits a few years ago), which is similar in in philosophy, but requires a small non-standard extension to the C++ language. SYCL supersedes that effort by eliminating the need for that extension, using standard compliant C++ to express the code to be run on the accelerator (GPU "kernels", FPGA algorithm, etc.).

For now, AMD seems to be concentrating on HIP as a CUDA replacement. Unfortunately, the tool chain is not supported on Windows.

MS goal right now is to support most of these things via WSL2, at least in the short term.
From AMD's perspective ROCm (runtime at least) is planned to come to Windows using PAL. Reiterated by John Bridgman many times. But if WSL2 really takes off as envisioned by MS, their plans might change.
MS is really aggressive in pushing all the changes for hyperv subsystem upstream which is responsible for redirecting most of the Linux kernel requests to Windows.
With Aurora on backburner now, Frontier will be the first US exascale system and with that the first major deployment for ROCm.
It will be a big boost, contributions from academia will trickle in. Right now this dev was AMD only, and funded as part of the exascale procurement.

soresu · Oct 12, 2020

DisEnchantment said:
From AMD's perspective ROCm (runtime at least) is planned to come to Windows using PAL.

Pxxxx Abstraction Layer?

DisEnchantment · Oct 12, 2020

soresu said:
Pxxxx Abstraction Layer?

Platform Abstraction Library.

https://github.com/GPUOpen-Drivers/pal

ThatBuzzkiller · Oct 12, 2020

People are still talking about SYCL this late into the game as if it will be the new unifying compute standard ? Here's the reality for all of you people here ...

AMD is not interested in making a SPIR-V compiler for SYCL/OpenCL kernels so the community would need to step up and do AMD's work for them there. Good luck to anyone who doesn't have an army of compiler engineers with spare time laying around since this'll remain fruitless for many years to come ... (I used remember a time when HSAIL was *standardized* in name only by the HSA Foundation and that AMD once had a SPIR compiler running on gfx8 GPUs for their PAL OpenCL driver stack but those days are well behind us)

Nvidia does not have any representatives for SYCL so they are not going to adopt it as a standard anytime soon ...

Intel is the only one taking SYCL seriously out of all of the other vendors but even they're adding in their own vendor specific extensions to define their own proprietary DPC++ standard ...

soresu · Oct 12, 2020

Vattila said:
For now, AMD seems to be concentrating on HIP as a CUDA replacement.

I was under the impression that HIP code can compile to CUDA too?

soresu · Oct 12, 2020

DisEnchantment said:
Suddenly AMD's chiplets seems genius because they can be smaller than typical phone SoCs and can tolerate bad yields.

That was the point from the get go from what I read into AMD's previous words on the subject.

It's a natural progression when process nodes are getting worse with each generation for yields that you want to make the dies as small as possible.

Chiplets do that, and as a bonus adds a huge amount of versatility to their SKU segmentation options.

ThatBuzzkiller · Oct 13, 2020

soresu said:
I was under the impression that HIP code can compile to CUDA too?

Yes but the big caveat is that is that it'll use the NVCC or CUDA-Clang compiler instead of the HIP-Clang compiler so you can't totally expect consistent results between both compiler backends without maintaining them ...

Mk pt · Oct 13, 2020

DisEnchantment said:
Zen4 on N5(+AMD Sauce), new IOD and AM5 will be having so many new knobs its going to be really interesting. It will be another inflection point for PCs.

New socket AM5 could possibly bring in a bigger substrate and package area for even bigger chips or more number of chiplets

Ignoring absolute process density and using relative density for N7->N5 progression, 1.7x gain, would put the Zen4 chiplets roughly around 60-65% of what a Zen2 chiplet is now, somewhere around 50mm2. Suddenly AMD's chiplets seems genius because they can be smaller than typical phone SoCs and can tolerate bad yields.

Improved process for IOD. If not made by TSMC most likely GF 12LP+ which is a major improvement to 12LP

Improved efficiency, again.

There is so much more silicon area to play with. If some form of 3D stacking is there, that is going to be even more transistors that can be packed per chip.
Regarding new IOD we wont even have to wait for Zen4, it is coming soon with a specialized Zen3/Milan SKU for a specific HPC deployment.

Zen 4:
- AM5
- DDR5
- 5nm chiplets

- 7 nm iod. IO takes a huge stake in power consumption, and going to 7nm reduces die area.
By 2022 7nm will be cheap enough to make iod.

- Moar cores.. if Intel up their game.
With smaller ccd and iod AMD can easily put more ccd's, losing a bit in single thread but improving multrithread.

Usual improvements: faster IF and more cache

Launch in Jan/Feb of 2022.

Mk pt · Oct 13, 2020

Mk pt said:
Zen 4:
- Moar cores.. if Intel up their game.
With smaller ccd and iod AMD can easily put more ccd's, losing a bit in single thread but improving multrithread.

In desktop^

Epyc will double the number of cores.
Poor Intel in servers...

Vattila · Oct 13, 2020

soresu said:
I was under the impression that HIP code can compile to CUDA too?

Yes, AMD's HIP implementation uses the CUDA toolchain as a backend for targeting the Nvidia platform. For targeting the AMD platform it uses the ROCm toolchain. But that is really implementation detail. In theory, you could compile HIP source code down to an executable to be run on an OpenCL driver.

With ROCm, you ideally write and maintain your code using the open HIP programming model, thus allowing portable code (for now, only between AMD and Nvidia platforms, though). As far as I understand, HIP copies the CUDA programming model as closely as possible, for familiarity and ease of porting for CUDA users. The main difference is change of naming (e.g. "hip" prefix instead of "cu" on function calls). ROCm includes a converter tool that automates the rewrite of CUDA code to HIP, allegedly doing more than 90% of the work in the common case.

That said, in real life projects, you probably have to dip down into platform-specific details for some of your code, I guess. Lacking experience with the solution, I don't know how mature HIP/ROCm has become and the coverage of CUDA functionality it currently achieves (core functionality, libraries, profiling, debugging, etc.). Perhaps someone with practical experience can comment.

On the other hand, hipSYCL is an implementation of the fledgling Khronos SYCL standard. The SYCL programming model is based on pure standard C++ language and libraries, originally intended as a higher-level programming model for OpenCL. It is quite different to CUDA and HIP. The hipSYCL implementation can use OpenMP, HIP/ROCm or CUDA toolchains as backends (with interoperability with Intel's DPC++ SYCL compiler in the pipeline, I think, considering that Heidelberg University, which leads and funds hipSYCL development, recently partnered with Intel on the oneAPI initiative). Notably, hipSYCL does not support OpenCL as a backend, unlike most SYCL implementations (such as Intel's DPC++, Codeplay's ComputeCPP, Xilinx's triSYCL and Peter Žužek's sycl-gtx, which all support OpenCL, as well as various other backends).

To me, SYCL looks to be the open standard for the future. It encompasses more than just GPGPU, which HIP and CUDA focusses on. For example, SYCL is in use for FPGA programming. The C++ Standard Committee, universities (such as Heidelberg University), national institutions (such as Cineca and Argonne National Laboratory) and companies (such as Intel, Codeplay and Xilinx) are investing in and contributing to SYCL use and development.

Links:

- ROCm/HIP
- CUDA
- SYCL
- SYCL 2020 announcement
- hipSYCL
- Heidelberg University and Intel team up

PS. This discussion is somewhat off-topic, so apologies for that, but heterogeneous systems are the future, and the programming models will be an important part of it.

DisEnchantment · Oct 13, 2020

Vattila said:
Yes, AMD's HIP implementation uses the CUDA toolchain as a backend for targeting the Nvidia platform. For targeting the AMD platform it uses the ROCm toolchain. But that is really implementation detail. In theory, you could compile HIP source code down to an executable to be run on an OpenCL driver.

With ROCm, you ideally write and maintain your code using the open HIP programming model, thus allowing portable code (for now, only between AMD and Nvidia platforms, though). As far as I understand, HIP copies the CUDA programming model as closely as possible, for familiarity and ease of porting for CUDA users. The main difference is change of naming (e.g. "hip" prefix instead of "cu" on function calls). ROCm includes a converter tool that automates the rewrite of CUDA code to HIP, allegedly doing more than 90% of the work in the common case.

That said, in real life projects, you probably have to dip down into platform-specific details for some of your code, I guess. Lacking experience with the solution, I don't know how mature HIP/ROCm has become and the coverage of CUDA functionality it currently achieves (core functionality, libraries, profiling, debugging, etc.). Perhaps someone with practical experience can comment.

On the other hand, hipSYCL is an implementation of the fledgling Khronos SYCL standard. The SYCL programming model is based on pure standard C++ language and libraries, originally intended as a higher-level programming model for OpenCL. It is quite different to CUDA and HIP. The hipSYCL implementation can use OpenMP, HIP/ROCm or CUDA toolchains as backends (with interoperability with Intel's DPC++ SYCL compiler in the pipeline, I think, considering that Heidelberg University, which leads and funds hipSYCL development, recently partnered with Intel on the oneAPI initiative). Notably, hipSYCL does not support OpenCL as a backend, unlike many other SYCL implementations (such as Intel's DPC++, CodePlay's ComputeCPP, triSYCL and sycl-gtx, which all support OpenCL as well as various other backends).

To me, SYCL looks to be the open standard for the future. It encompasses more than just GPGPU, which HIP and CUDA focusses on. For example, SYCL is in use for FPGA programming. The C++ Standard Committee, universities, national institutions (such as Cineca and Argonne National Laboratory) and companies (such as Intel, CodePlay and Xilinx) are investing in and contributing to SYCL use and development.

Links:

- ROCm/HIP
- CUDA
- SYCL
- SYCL 2020 announcement
- hipSYCL
- Heidelberg University and Intel team up

PS. This discussion is somewhat off-topic, so apologies for that, but heterogeneous systems are the future, and the programming models will be an important part of it.

There is a list of changes to be upstreamed to enable LLVM to emit SPIRV IR code.

https://github.com/intel/llvm/wiki/SYCL-patches-upstream-status

The bigger question is how to integrate these changes from Intel if another IHV is doing the work in parallel.
I checked the diffs not that big to me(granted, we work with several codebases with more than 30000 kloc each). Even AMD's downstream ROCm LLVM fork has 35K+ diff from upstream and they are constantly issuing PR almost 5 times a day to get all in.
Consuming it with OpenCL runtime is the easier part.

These days almost everything uses the LLVM infrastructure.
ROCm also does. For AMD's part they also are making a lot of new proposals to elf/dwarf and new tooling for debugging heterogenous systems.

Update:
Looks like Intel has been upstreaming these LLVM changes indeed. Kudos to Intel.
You can check the Meeting Notes, last one from two weeks ago.

Vattila · Oct 13, 2020

DisEnchantment said:
[Intel has upstreamed changes] to enable LLVM to emit SPIRV IR code.

Great!

For those (like me) that don't know much about this, the Khronos SPIR-V standard is an abstract and portable intermediate representation (IR) that a SYCL/OpenCL toolchain may produce as it translates the high-level accelerator code into distributable program files. The SPIR-V code is fed to the OpenCL driver as the program is executed on the target platform. The OpenCL driver translates the SPIR-V IR code to device-specific machine code, which is then executed by the target platform hardware.

Here is a couple of charts, from the links provided in my previous post, providing an overview of these technologies:

amrnuke · Oct 13, 2020

Mk pt said:
In desktop^

Epyc will double the number of cores.
Poor Intel in servers...

Yup. With AM5, I imagine they go larger on the socket size. Combined with process shrink to 5nm, we can still keep a lower transistor density while cramming a lot more on the chip. The other sockets I don't expect a significant change.

32c/64t 6950X
128c/256t 6990X
128c/256t or 128c/512t Epyc 7xx3

Fun times ahead when you consider the IPC/power consumption improvements expected.

Cardyak · Oct 13, 2020

amrnuke said:
Yup. With AM5, I imagine they go larger on the socket size. Combined with process shrink to 5nm, we can still keep a lower transistor density while cramming a lot more on the chip. The other sockets I don't expect a significant change.

32c/64t 6950X
128c/256t 6990X
128c/256t or 128c/512t Epyc 7xx3

Fun times ahead when you consider the IPC/power consumption improvements expected.

I'd be surprised if core counts doubled in the move from 7nm -> 5nm.

Even though density improves by a factor of ~1.84x, the power consumption only reduces by approximately 30%.

This means that core counts can indeed double, but clock speeds will have to witness a fairly large regression (Probably somewhere in the region of 10%) in order to keep TDP in check. I'm not sure if sacrificing single-threaded performance in order to push more cores this early is palatable to the average consumer.

I expect AMD will opt for a compromise. A 50% increase in core counts with slightly larger and beefier Zen 4 cores, and clock speeds will be roughly similar to Zen 2 and Zen 3 speeds.

24c/48t 6950X
96c/192t 6990X
96c/192t Epyc 7xx3

Hitman928 · Oct 13, 2020

Cardyak said:
I'd be surprised if core counts doubled in the move from 7nm -> 5nm.

Even though density improves by a factor of ~1.84x, the power consumption only reduces by approximately 30%.

This means that core counts can indeed double, but clock speeds will have to witness a fairly large regression (Probably somewhere in the region of 10%) in order to keep TDP in check. I'm not sure if sacrificing single-threaded performance in order to push more cores this early is palatable to the average consumer.

I expect AMD will opt for a compromise. A 50% increase in core counts with slightly larger and beefier Zen 4 cores, and clock speeds will be roughly similar to Zen 2 and Zen 3 speeds.

24c/48t 6950X
96c/192t 6990X
96c/192t Epyc 7xx3

ARM is supposedly coming with up to 192 cores and up to 350W TDP within the next couple of years, they may want something as a response to that.

soresu · Oct 13, 2020

Cardyak said:
I'd be surprised if core counts doubled in the move from 7nm -> 5nm.

Even though density improves by a factor of ~1.84x, the power consumption only reduces by approximately 30%.

Depends what 5nm process you are talking about, because N5P will give more than 30% improvement for power consumption.

moinmoin · Oct 13, 2020

Cardyak said:
I'd be surprised if core counts doubled in the move from 7nm -> 5nm.

Even though density improves by a factor of ~1.84x, the power consumption only reduces by approximately 30%.

This means that core counts can indeed double, but clock speeds will have to witness a fairly large regression (Probably somewhere in the region of 10%) in order to keep TDP in check. I'm not sure if sacrificing single-threaded performance in order to push more cores this early is palatable to the average consumer.

I expect AMD will opt for a compromise. A 50% increase in core counts with slightly larger and beefier Zen 4 cores, and clock speeds will be roughly similar to Zen 2 and Zen 3 speeds.

24c/48t 6950X
96c/192t 6990X
96c/192t Epyc 7xx3

In plenty datacenters core density and energy efficiency is more important than absolute frequency. As @Hitman928 correctly points out the competition there is ARM servers, with plans for even more cores already. And in any case IPC needs to increase to be competitive with ARM chips, and that can cover at least part of the reduction in frequency.

Also this is not an either/or, AMD can still (continue to) offer packages with fewer cores that as a result have a bigger TDP headroom for higher frequencies. All of its top end products are "hampered" by the TDP limit (even consumer chips like 3950X and 5950X), but as a result those are also more energy efficient, that's part of the balance customers can choose between.

5nm is the next time with a significant increase in transistor density, and AMD so far always went with the square of 2 for the top end of its Zen packages.

Gideon · Oct 13, 2020

moinmoin said:
In plenty datacenters core density and energy efficiency is more important than absolute frequency. As @Hitman928 correctly points out the competition there is ARM servers, with plans for even more cores already. And in any case IPC needs to increase to be competitive with ARM chips, and that can cover at least part of the reduction in frequency.

Also this is not an either/or, AMD can still (continue to) offer packages with fewer cores that as a result have a bigger TDP headroom for higher frequencies. All of its top end products are "hampered" by the TDP limit (even consumer chips like 3950X and 5950X), but as a result those are also more energy efficient, that's part of the balance customers can choose between.

5nm is the next time with a significant increase in transistor density, and AMD so far always went with the square of 2 for the top end of its Zen packages.

Agreed on all points. Besides considerable amount of power can be saved jsut moving from the current package desing to a 2.5D or 3D approach, e.g. active interposer and such.

cytg111 · Oct 14, 2020

therealmongo said:
What are you talking about? There is no spoon

Actually there is two of them, and its the same spoon.

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Lifer

Golden Member

Diamond Member

Golden Member

Senior member

Golden Member

Golden Member

Diamond Member

Golden Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Member

Member

Senior member

Golden Member

Senior member

Golden Member

Member

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Lifer