Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 17 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,779
136
It would certainly tie in well with AMD getting support for OneAPI through HipSYCL/ROCm.

The whole point with OneAPI for Intel was unifying CPU, GPU, FPGA, AI/ML accelerators etc under one coding roof.

Here's hoping for their sake it doesn't take AMD forever to deliver on the SW implementation then, because Intel will certainly have a humongous head start on the FPGA side unless Xilinx's own software platform can be made to simply take OneAPI code without a giant amount of engineering in the interim.
I think it has something more to do with whether the SYCL runtime implementation is compliant to the SYCL standard as specified by Khronos rather than OneAPI supporting AMD HW
The SYCL runtime is just one of many runtimes which the libraries that are exposed to the multiple frameworks can make use of.

1602441195152.png

You can check out this video for programming Xilinx FPGAs with SYCL

SYCL was developed with heterogenous systems in mind.
Like I wrote previously, ROCm is a complete ecosystem, with all the math libs, communication libs for multi node clusters, work dispatch , library integration into popular frameworks etc
SYCL runtime should be technically feasible to integrate with ROCm, if not easy, provided AMD has a business case for it.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,779
136
There is hipSYCL which already integrates with ROCm for AMD GPUs (while also supporting CPUs through C++17 OpenMP compilers and Nvidia GPUs through clang/CUDA).
It is not an official ROCm component. AMD ROCm runtime debs dont have it. SYCL is not planned to be a part of ROCm 4.0 afaik.
It just uses ROCm infrastructure to make SYCL on AMD possible.
Once the entire ROCm ecosystem is up, adding SYCL is not much of a challenge imo.
 
  • Like
Reactions: Tlh97 and Vattila

Vattila

Senior member
Oct 22, 2004
799
1,351
136
AMD should get Microsoft aboard to implement SYCL in their C++ compiler. Microsoft partnered with AMD to create C++ AMP (ref. AMD's Fusion Developer Summits a few years ago), which is similar in in philosophy, but requires a small non-standard extension to the C++ language. SYCL supersedes that effort by eliminating the need for that extension, using standard compliant C++ to express the code to be run on the accelerator (GPU "kernels", FPGA algorithm, etc.).

For now, AMD seems to be concentrating on HIP as a CUDA replacement. Unfortunately, the tool chain is not supported on Windows.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,779
136
Zen4 on N5(+AMD Sauce), new IOD and AM5 will be having so many new knobs its going to be really interesting. It will be another inflection point for PCs.
  • New socket AM5 could possibly bring in a bigger substrate and package area for even bigger chips or more number of chiplets
  • Ignoring absolute process density and using relative density for N7->N5 progression, 1.7x gain, would put the Zen4 chiplets roughly around 60-65% of what a Zen2 chiplet is now, somewhere around 50mm2. Suddenly AMD's chiplets seems genius because they can be smaller than typical phone SoCs and can tolerate bad yields.
  • Improved process for IOD. If not made by TSMC most likely GF 12LP+ which is a major improvement to 12LP
  • Improved efficiency, again.

There is so much more silicon area to play with. If some form of 3D stacking is there, that is going to be even more transistors that can be packed per chip.
Regarding new IOD we wont even have to wait for Zen4, it is coming soon with a specialized Zen3/Milan SKU for a specific HPC deployment.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,779
136
AMD should get Microsoft aboard to implement SYCL in their C++ compiler. Microsoft partnered with AMD to create C++ AMP (ref. AMD's Fusion Developer Summits a few years ago), which is similar in in philosophy, but requires a small non-standard extension to the C++ language. SYCL supersedes that effort by eliminating the need for that extension, using standard compliant C++ to express the code to be run on the accelerator (GPU "kernels", FPGA algorithm, etc.).

For now, AMD seems to be concentrating on HIP as a CUDA replacement. Unfortunately, the tool chain is not supported on Windows.
MS goal right now is to support most of these things via WSL2, at least in the short term.
From AMD's perspective ROCm (runtime at least) is planned to come to Windows using PAL. Reiterated by John Bridgman many times. But if WSL2 really takes off as envisioned by MS, their plans might change.
MS is really aggressive in pushing all the changes for hyperv subsystem upstream which is responsible for redirecting most of the Linux kernel requests to Windows.
With Aurora on backburner now, Frontier will be the first US exascale system and with that the first major deployment for ROCm.
It will be a big boost, contributions from academia will trickle in. Right now this dev was AMD only, and funded as part of the exascale procurement.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
People are still talking about SYCL this late into the game as if it will be the new unifying compute standard ? Here's the reality for all of you people here ...

AMD is not interested in making a SPIR-V compiler for SYCL/OpenCL kernels so the community would need to step up and do AMD's work for them there. Good luck to anyone who doesn't have an army of compiler engineers with spare time laying around since this'll remain fruitless for many years to come ... (I used remember a time when HSAIL was *standardized* in name only by the HSA Foundation and that AMD once had a SPIR compiler running on gfx8 GPUs for their PAL OpenCL driver stack but those days are well behind us)

Nvidia does not have any representatives for SYCL so they are not going to adopt it as a standard anytime soon ...

Intel is the only one taking SYCL seriously out of all of the other vendors but even they're adding in their own vendor specific extensions to define their own proprietary DPC++ standard ...
 
  • Like
Reactions: Vattila

soresu

Platinum Member
Dec 19, 2014
2,655
1,857
136
Suddenly AMD's chiplets seems genius because they can be smaller than typical phone SoCs and can tolerate bad yields.
That was the point from the get go from what I read into AMD's previous words on the subject.

It's a natural progression when process nodes are getting worse with each generation for yields that you want to make the dies as small as possible.

Chiplets do that, and as a bonus adds a huge amount of versatility to their SKU segmentation options.
 
  • Like
Reactions: Tlh97

Mk pt

Member
Nov 23, 2013
67
17
81
Zen4 on N5(+AMD Sauce), new IOD and AM5 will be having so many new knobs its going to be really interesting. It will be another inflection point for PCs.
  • New socket AM5 could possibly bring in a bigger substrate and package area for even bigger chips or more number of chiplets
  • Ignoring absolute process density and using relative density for N7->N5 progression, 1.7x gain, would put the Zen4 chiplets roughly around 60-65% of what a Zen2 chiplet is now, somewhere around 50mm2. Suddenly AMD's chiplets seems genius because they can be smaller than typical phone SoCs and can tolerate bad yields.
  • Improved process for IOD. If not made by TSMC most likely GF 12LP+ which is a major improvement to 12LP
  • Improved efficiency, again.

There is so much more silicon area to play with. If some form of 3D stacking is there, that is going to be even more transistors that can be packed per chip.
Regarding new IOD we wont even have to wait for Zen4, it is coming soon with a specialized Zen3/Milan SKU for a specific HPC deployment.
Zen 4:
- AM5
- DDR5
- 5nm chiplets

- 7 nm iod. IO takes a huge stake in power consumption, and going to 7nm reduces die area.
By 2022 7nm will be cheap enough to make iod.

- Moar cores.. if Intel up their game.
With smaller ccd and iod AMD can easily put more ccd's, losing a bit in single thread but improving multrithread.

Usual improvements: faster IF and more cache


Launch in Jan/Feb of 2022.
 

Vattila

Senior member
Oct 22, 2004
799
1,351
136
I was under the impression that HIP code can compile to CUDA too?

Yes, AMD's HIP implementation uses the CUDA toolchain as a backend for targeting the Nvidia platform. For targeting the AMD platform it uses the ROCm toolchain. But that is really implementation detail. In theory, you could compile HIP source code down to an executable to be run on an OpenCL driver.

With ROCm, you ideally write and maintain your code using the open HIP programming model, thus allowing portable code (for now, only between AMD and Nvidia platforms, though). As far as I understand, HIP copies the CUDA programming model as closely as possible, for familiarity and ease of porting for CUDA users. The main difference is change of naming (e.g. "hip" prefix instead of "cu" on function calls). ROCm includes a converter tool that automates the rewrite of CUDA code to HIP, allegedly doing more than 90% of the work in the common case.

That said, in real life projects, you probably have to dip down into platform-specific details for some of your code, I guess. Lacking experience with the solution, I don't know how mature HIP/ROCm has become and the coverage of CUDA functionality it currently achieves (core functionality, libraries, profiling, debugging, etc.). Perhaps someone with practical experience can comment.

On the other hand, hipSYCL is an implementation of the fledgling Khronos SYCL standard. The SYCL programming model is based on pure standard C++ language and libraries, originally intended as a higher-level programming model for OpenCL. It is quite different to CUDA and HIP. The hipSYCL implementation can use OpenMP, HIP/ROCm or CUDA toolchains as backends (with interoperability with Intel's DPC++ SYCL compiler in the pipeline, I think, considering that Heidelberg University, which leads and funds hipSYCL development, recently partnered with Intel on the oneAPI initiative). Notably, hipSYCL does not support OpenCL as a backend, unlike most SYCL implementations (such as Intel's DPC++, Codeplay's ComputeCPP, Xilinx's triSYCL and Peter Žužek's sycl-gtx, which all support OpenCL, as well as various other backends).

To me, SYCL looks to be the open standard for the future. It encompasses more than just GPGPU, which HIP and CUDA focusses on. For example, SYCL is in use for FPGA programming. The C++ Standard Committee, universities (such as Heidelberg University), national institutions (such as Cineca and Argonne National Laboratory) and companies (such as Intel, Codeplay and Xilinx) are investing in and contributing to SYCL use and development.

Links:

- ROCm/HIP
- CUDA
- SYCL
- SYCL 2020 announcement
- hipSYCL
- Heidelberg University and Intel team up

PS. This discussion is somewhat off-topic, so apologies for that, but heterogeneous systems are the future, and the programming models will be an important part of it.
 
Last edited:

DisEnchantment

Golden Member
Mar 3, 2017
1,601
5,779
136
Yes, AMD's HIP implementation uses the CUDA toolchain as a backend for targeting the Nvidia platform. For targeting the AMD platform it uses the ROCm toolchain. But that is really implementation detail. In theory, you could compile HIP source code down to an executable to be run on an OpenCL driver.

With ROCm, you ideally write and maintain your code using the open HIP programming model, thus allowing portable code (for now, only between AMD and Nvidia platforms, though). As far as I understand, HIP copies the CUDA programming model as closely as possible, for familiarity and ease of porting for CUDA users. The main difference is change of naming (e.g. "hip" prefix instead of "cu" on function calls). ROCm includes a converter tool that automates the rewrite of CUDA code to HIP, allegedly doing more than 90% of the work in the common case.

That said, in real life projects, you probably have to dip down into platform-specific details for some of your code, I guess. Lacking experience with the solution, I don't know how mature HIP/ROCm has become and the coverage of CUDA functionality it currently achieves (core functionality, libraries, profiling, debugging, etc.). Perhaps someone with practical experience can comment.

On the other hand, hipSYCL is an implementation of the fledgling Khronos SYCL standard. The SYCL programming model is based on pure standard C++ language and libraries, originally intended as a higher-level programming model for OpenCL. It is quite different to CUDA and HIP. The hipSYCL implementation can use OpenMP, HIP/ROCm or CUDA toolchains as backends (with interoperability with Intel's DPC++ SYCL compiler in the pipeline, I think, considering that Heidelberg University, which leads and funds hipSYCL development, recently partnered with Intel on the oneAPI initiative). Notably, hipSYCL does not support OpenCL as a backend, unlike many other SYCL implementations (such as Intel's DPC++, CodePlay's ComputeCPP, triSYCL and sycl-gtx, which all support OpenCL as well as various other backends).

To me, SYCL looks to be the open standard for the future. It encompasses more than just GPGPU, which HIP and CUDA focusses on. For example, SYCL is in use for FPGA programming. The C++ Standard Committee, universities, national institutions (such as Cineca and Argonne National Laboratory) and companies (such as Intel, CodePlay and Xilinx) are investing in and contributing to SYCL use and development.

Links:

- ROCm/HIP
- CUDA
- SYCL
- SYCL 2020 announcement
- hipSYCL
- Heidelberg University and Intel team up

PS. This discussion is somewhat off-topic, so apologies for that, but heterogeneous systems are the future, and the programming models will be an important part of it.
There is a list of changes to be upstreamed to enable LLVM to emit SPIRV IR code.

The bigger question is how to integrate these changes from Intel if another IHV is doing the work in parallel.
I checked the diffs not that big to me(granted, we work with several codebases with more than 30000 kloc each). Even AMD's downstream ROCm LLVM fork has 35K+ diff from upstream and they are constantly issuing PR almost 5 times a day to get all in.
Consuming it with OpenCL runtime is the easier part.

These days almost everything uses the LLVM infrastructure.
ROCm also does. For AMD's part they also are making a lot of new proposals to elf/dwarf and new tooling for debugging heterogenous systems.

Update:
Looks like Intel has been upstreaming these LLVM changes indeed. Kudos to Intel.
You can check the Meeting Notes, last one from two weeks ago.
 
Last edited:

Vattila

Senior member
Oct 22, 2004
799
1,351
136
[Intel has upstreamed changes] to enable LLVM to emit SPIRV IR code.

Great!

For those (like me) that don't know much about this, the Khronos SPIR-V standard is an abstract and portable intermediate representation (IR) that a SYCL/OpenCL toolchain may produce as it translates the high-level accelerator code into distributable program files. The SPIR-V code is fed to the OpenCL driver as the program is executed on the target platform. The OpenCL driver translates the SPIR-V IR code to device-specific machine code, which is then executed by the target platform hardware.

Here is a couple of charts, from the links provided in my previous post, providing an overview of these technologies:

2020-05-sycl-landing-page-02_2.jpg


2020-spir-landing-page-01.jpg
 
Last edited:

amrnuke

Golden Member
Apr 24, 2019
1,181
1,772
136
In desktop^

Epyc will double the number of cores.
Poor Intel in servers...
Yup. With AM5, I imagine they go larger on the socket size. Combined with process shrink to 5nm, we can still keep a lower transistor density while cramming a lot more on the chip. The other sockets I don't expect a significant change.

32c/64t 6950X
128c/256t 6990X
128c/256t or 128c/512t Epyc 7xx3

Fun times ahead when you consider the IPC/power consumption improvements expected.
 
  • Like
Reactions: Tlh97 and Vattila

Cardyak

Member
Sep 12, 2018
72
159
106
Yup. With AM5, I imagine they go larger on the socket size. Combined with process shrink to 5nm, we can still keep a lower transistor density while cramming a lot more on the chip. The other sockets I don't expect a significant change.

32c/64t 6950X
128c/256t 6990X
128c/256t or 128c/512t Epyc 7xx3

Fun times ahead when you consider the IPC/power consumption improvements expected.

I'd be surprised if core counts doubled in the move from 7nm -> 5nm.

Even though density improves by a factor of ~1.84x, the power consumption only reduces by approximately 30%.

This means that core counts can indeed double, but clock speeds will have to witness a fairly large regression (Probably somewhere in the region of 10%) in order to keep TDP in check. I'm not sure if sacrificing single-threaded performance in order to push more cores this early is palatable to the average consumer.

I expect AMD will opt for a compromise. A 50% increase in core counts with slightly larger and beefier Zen 4 cores, and clock speeds will be roughly similar to Zen 2 and Zen 3 speeds.

24c/48t 6950X
96c/192t 6990X
96c/192t Epyc 7xx3
 

Hitman928

Diamond Member
Apr 15, 2012
5,242
7,786
136
I'd be surprised if core counts doubled in the move from 7nm -> 5nm.

Even though density improves by a factor of ~1.84x, the power consumption only reduces by approximately 30%.

This means that core counts can indeed double, but clock speeds will have to witness a fairly large regression (Probably somewhere in the region of 10%) in order to keep TDP in check. I'm not sure if sacrificing single-threaded performance in order to push more cores this early is palatable to the average consumer.

I expect AMD will opt for a compromise. A 50% increase in core counts with slightly larger and beefier Zen 4 cores, and clock speeds will be roughly similar to Zen 2 and Zen 3 speeds.

24c/48t 6950X
96c/192t 6990X
96c/192t Epyc 7xx3

ARM is supposedly coming with up to 192 cores and up to 350W TDP within the next couple of years, they may want something as a response to that.
 
  • Like
Reactions: Tlh97 and Vattila

moinmoin

Diamond Member
Jun 1, 2017
4,944
7,656
136
I'd be surprised if core counts doubled in the move from 7nm -> 5nm.

Even though density improves by a factor of ~1.84x, the power consumption only reduces by approximately 30%.

This means that core counts can indeed double, but clock speeds will have to witness a fairly large regression (Probably somewhere in the region of 10%) in order to keep TDP in check. I'm not sure if sacrificing single-threaded performance in order to push more cores this early is palatable to the average consumer.

I expect AMD will opt for a compromise. A 50% increase in core counts with slightly larger and beefier Zen 4 cores, and clock speeds will be roughly similar to Zen 2 and Zen 3 speeds.

24c/48t 6950X
96c/192t 6990X
96c/192t Epyc 7xx3
In plenty datacenters core density and energy efficiency is more important than absolute frequency. As @Hitman928 correctly points out the competition there is ARM servers, with plans for even more cores already. And in any case IPC needs to increase to be competitive with ARM chips, and that can cover at least part of the reduction in frequency.

Also this is not an either/or, AMD can still (continue to) offer packages with fewer cores that as a result have a bigger TDP headroom for higher frequencies. All of its top end products are "hampered" by the TDP limit (even consumer chips like 3950X and 5950X), but as a result those are also more energy efficient, that's part of the balance customers can choose between.

5nm is the next time with a significant increase in transistor density, and AMD so far always went with the square of 2 for the top end of its Zen packages.
 
  • Like
Reactions: Tlh97 and Vattila

Gideon

Golden Member
Nov 27, 2007
1,619
3,645
136
In plenty datacenters core density and energy efficiency is more important than absolute frequency. As @Hitman928 correctly points out the competition there is ARM servers, with plans for even more cores already. And in any case IPC needs to increase to be competitive with ARM chips, and that can cover at least part of the reduction in frequency.

Also this is not an either/or, AMD can still (continue to) offer packages with fewer cores that as a result have a bigger TDP headroom for higher frequencies. All of its top end products are "hampered" by the TDP limit (even consumer chips like 3950X and 5950X), but as a result those are also more energy efficient, that's part of the balance customers can choose between.

5nm is the next time with a significant increase in transistor density, and AMD so far always went with the square of 2 for the top end of its Zen packages.

Agreed on all points. Besides considerable amount of power can be saved jsut moving from the current package desing to a 2.5D or 3D approach, e.g. active interposer and such.
 
  • Like
Reactions: Tlh97