• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 6000)

Page 17 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

What do you expect with Zen 4?


  • Total voters
    168

DisEnchantment

Senior member
Mar 3, 2017
648
1,427
106
It would certainly tie in well with AMD getting support for OneAPI through HipSYCL/ROCm.

The whole point with OneAPI for Intel was unifying CPU, GPU, FPGA, AI/ML accelerators etc under one coding roof.

Here's hoping for their sake it doesn't take AMD forever to deliver on the SW implementation then, because Intel will certainly have a humongous head start on the FPGA side unless Xilinx's own software platform can be made to simply take OneAPI code without a giant amount of engineering in the interim.
I think it has something more to do with whether the SYCL runtime implementation is compliant to the SYCL standard as specified by Khronos rather than OneAPI supporting AMD HW
The SYCL runtime is just one of many runtimes which the libraries that are exposed to the multiple frameworks can make use of.

1602441195152.png

You can check out this video for programming Xilinx FPGAs with SYCL

SYCL was developed with heterogenous systems in mind.
Like I wrote previously, ROCm is a complete ecosystem, with all the math libs, communication libs for multi node clusters, work dispatch , library integration into popular frameworks etc
SYCL runtime should be technically feasible to integrate with ROCm, if not easy, provided AMD has a business case for it.
 

DisEnchantment

Senior member
Mar 3, 2017
648
1,427
106
There is hipSYCL which already integrates with ROCm for AMD GPUs (while also supporting CPUs through C++17 OpenMP compilers and Nvidia GPUs through clang/CUDA).
It is not an official ROCm component. AMD ROCm runtime debs dont have it. SYCL is not planned to be a part of ROCm 4.0 afaik.
It just uses ROCm infrastructure to make SYCL on AMD possible.
Once the entire ROCm ecosystem is up, adding SYCL is not much of a challenge imo.
 
  • Like
Reactions: Tlh97 and Vattila

Vattila

Senior member
Oct 22, 2004
551
576
136
AMD should get Microsoft aboard to implement SYCL in their C++ compiler. Microsoft partnered with AMD to create C++ AMP (ref. AMD's Fusion Developer Summits a few years ago), which is similar in in philosophy, but requires a small non-standard extension to the C++ language. SYCL supersedes that effort by eliminating the need for that extension, using standard compliant C++ to express the code to be run on the accelerator (GPU "kernels", FPGA algorithm, etc.).

For now, AMD seems to be concentrating on HIP as a CUDA replacement. Unfortunately, the tool chain is not supported on Windows.
 

DisEnchantment

Senior member
Mar 3, 2017
648
1,427
106
Zen4 on N5(+AMD Sauce), new IOD and AM5 will be having so many new knobs its going to be really interesting. It will be another inflection point for PCs.
  • New socket AM5 could possibly bring in a bigger substrate and package area for even bigger chips or more number of chiplets
  • Ignoring absolute process density and using relative density for N7->N5 progression, 1.7x gain, would put the Zen4 chiplets roughly around 60-65% of what a Zen2 chiplet is now, somewhere around 50mm2. Suddenly AMD's chiplets seems genius because they can be smaller than typical phone SoCs and can tolerate bad yields.
  • Improved process for IOD. If not made by TSMC most likely GF 12LP+ which is a major improvement to 12LP
  • Improved efficiency, again.

There is so much more silicon area to play with. If some form of 3D stacking is there, that is going to be even more transistors that can be packed per chip.
Regarding new IOD we wont even have to wait for Zen4, it is coming soon with a specialized Zen3/Milan SKU for a specific HPC deployment.
 

DisEnchantment

Senior member
Mar 3, 2017
648
1,427
106
AMD should get Microsoft aboard to implement SYCL in their C++ compiler. Microsoft partnered with AMD to create C++ AMP (ref. AMD's Fusion Developer Summits a few years ago), which is similar in in philosophy, but requires a small non-standard extension to the C++ language. SYCL supersedes that effort by eliminating the need for that extension, using standard compliant C++ to express the code to be run on the accelerator (GPU "kernels", FPGA algorithm, etc.).

For now, AMD seems to be concentrating on HIP as a CUDA replacement. Unfortunately, the tool chain is not supported on Windows.
MS goal right now is to support most of these things via WSL2, at least in the short term.
From AMD's perspective ROCm (runtime at least) is planned to come to Windows using PAL. Reiterated by John Bridgman many times. But if WSL2 really takes off as envisioned by MS, their plans might change.
MS is really aggressive in pushing all the changes for hyperv subsystem upstream which is responsible for redirecting most of the Linux kernel requests to Windows.
With Aurora on backburner now, Frontier will be the first US exascale system and with that the first major deployment for ROCm.
It will be a big boost, contributions from academia will trickle in. Right now this dev was AMD only, and funded as part of the exascale procurement.
 

ThatBuzzkiller

Senior member
Nov 14, 2014
989
131
106
People are still talking about SYCL this late into the game as if it will be the new unifying compute standard ? Here's the reality for all of you people here ...

AMD is not interested in making a SPIR-V compiler for SYCL/OpenCL kernels so the community would need to step up and do AMD's work for them there. Good luck to anyone who doesn't have an army of compiler engineers with spare time laying around since this'll remain fruitless for many years to come ... (I used remember a time when HSAIL was *standardized* in name only by the HSA Foundation and that AMD once had a SPIR compiler running on gfx8 GPUs for their PAL OpenCL driver stack but those days are well behind us)

Nvidia does not have any representatives for SYCL so they are not going to adopt it as a standard anytime soon ...

Intel is the only one taking SYCL seriously out of all of the other vendors but even they're adding in their own vendor specific extensions to define their own proprietary DPC++ standard ...
 
  • Like
Reactions: Vattila

soresu

Golden Member
Dec 19, 2014
1,495
714
136
Suddenly AMD's chiplets seems genius because they can be smaller than typical phone SoCs and can tolerate bad yields.
That was the point from the get go from what I read into AMD's previous words on the subject.

It's a natural progression when process nodes are getting worse with each generation for yields that you want to make the dies as small as possible.

Chiplets do that, and as a bonus adds a huge amount of versatility to their SKU segmentation options.
 
  • Like
Reactions: Tlh97

ThatBuzzkiller

Senior member
Nov 14, 2014
989
131
106
I was under the impression that HIP code can compile to CUDA too?
Yes but the big caveat is that is that it'll use the NVCC or CUDA-Clang compiler instead of the HIP-Clang compiler so you can't totally expect consistent results between both compiler backends without maintaining them ...
 
  • Like
Reactions: Vattila

Mk pt

Member
Nov 23, 2013
67
17
81
Zen4 on N5(+AMD Sauce), new IOD and AM5 will be having so many new knobs its going to be really interesting. It will be another inflection point for PCs.
  • New socket AM5 could possibly bring in a bigger substrate and package area for even bigger chips or more number of chiplets
  • Ignoring absolute process density and using relative density for N7->N5 progression, 1.7x gain, would put the Zen4 chiplets roughly around 60-65% of what a Zen2 chiplet is now, somewhere around 50mm2. Suddenly AMD's chiplets seems genius because they can be smaller than typical phone SoCs and can tolerate bad yields.
  • Improved process for IOD. If not made by TSMC most likely GF 12LP+ which is a major improvement to 12LP
  • Improved efficiency, again.

There is so much more silicon area to play with. If some form of 3D stacking is there, that is going to be even more transistors that can be packed per chip.
Regarding new IOD we wont even have to wait for Zen4, it is coming soon with a specialized Zen3/Milan SKU for a specific HPC deployment.
Zen 4:
- AM5
- DDR5
- 5nm chiplets

- 7 nm iod. IO takes a huge stake in power consumption, and going to 7nm reduces die area.
By 2022 7nm will be cheap enough to make iod.

- Moar cores.. if Intel up their game.
With smaller ccd and iod AMD can easily put more ccd's, losing a bit in single thread but improving multrithread.

Usual improvements: faster IF and more cache


Launch in Jan/Feb of 2022.
 

Mk pt

Member
Nov 23, 2013
67
17
81
Zen 4:
- Moar cores.. if Intel up their game.
With smaller ccd and iod AMD can easily put more ccd's, losing a bit in single thread but improving multrithread.
In desktop^

Epyc will double the number of cores.
Poor Intel in servers...
 
  • Like
Reactions: Tlh97

Vattila

Senior member
Oct 22, 2004
551
576
136
I was under the impression that HIP code can compile to CUDA too?
Yes, AMD's HIP implementation uses the CUDA toolchain as a backend for targeting the Nvidia platform. For targeting the AMD platform it uses the ROCm toolchain. But that is really implementation detail. In theory, you could compile HIP source code down to an executable to be run on an OpenCL driver.

With ROCm, you ideally write and maintain your code using the open HIP programming model, thus allowing portable code (for now, only between AMD and Nvidia platforms, though). As far as I understand, HIP copies the CUDA programming model as closely as possible, for familiarity and ease of porting for CUDA users. The main difference is change of naming (e.g. "hip" prefix instead of "cu" on function calls). ROCm includes a converter tool that automates the rewrite of CUDA code to HIP, allegedly doing more than 90% of the work in the common case.

That said, in real life projects, you probably have to dip down into platform-specific details for some of your code, I guess. Lacking experience with the solution, I don't know how mature HIP/ROCm has become and the coverage of CUDA functionality it currently achieves (core functionality, libraries, profiling, debugging, etc.). Perhaps someone with practical experience can comment.

On the other hand, hipSYCL is an implementation of the fledgling Khronos SYCL standard. The SYCL programming model is based on pure standard C++ language and libraries, originally intended as a higher-level programming model for OpenCL. It is quite different to CUDA and HIP. The hipSYCL implementation can use OpenMP, HIP/ROCm or CUDA toolchains as backends (with interoperability with Intel's DPC++ SYCL compiler in the pipeline, I think, considering that Heidelberg University, which leads and funds hipSYCL development, recently partnered with Intel on the oneAPI initiative). Notably, hipSYCL does not support OpenCL as a backend, unlike most SYCL implementations (such as Intel's DPC++, Codeplay's ComputeCPP, Xilinx's triSYCL and Peter Žužek's sycl-gtx, which all support OpenCL, as well as various other backends).

To me, SYCL looks to be the open standard for the future. It encompasses more than just GPGPU, which HIP and CUDA focusses on. For example, SYCL is in use for FPGA programming. The C++ Standard Committee, universities, national institutions (such as Cineca and Argonne National Laboratory) and companies (such as Intel, Codeplay and Xilinx) are investing in and contributing to SYCL use and development.

Links:

- ROCm/HIP
- CUDA
- SYCL
- SYCL 2020 announcement
- hipSYCL
- Heidelberg University and Intel team up

PS. This discussion is somewhat off-topic, so apologies for that, but heterogeneous systems are the future, and the programming models will be an important part of it.
 
Last edited:

DisEnchantment

Senior member
Mar 3, 2017
648
1,427
106
Yes, AMD's HIP implementation uses the CUDA toolchain as a backend for targeting the Nvidia platform. For targeting the AMD platform it uses the ROCm toolchain. But that is really implementation detail. In theory, you could compile HIP source code down to an executable to be run on an OpenCL driver.

With ROCm, you ideally write and maintain your code using the open HIP programming model, thus allowing portable code (for now, only between AMD and Nvidia platforms, though). As far as I understand, HIP copies the CUDA programming model as closely as possible, for familiarity and ease of porting for CUDA users. The main difference is change of naming (e.g. "hip" prefix instead of "cu" on function calls). ROCm includes a converter tool that automates the rewrite of CUDA code to HIP, allegedly doing more than 90% of the work in the common case.

That said, in real life projects, you probably have to dip down into platform-specific details for some of your code, I guess. Lacking experience with the solution, I don't know how mature HIP/ROCm has become and the coverage of CUDA functionality it currently achieves (core functionality, libraries, profiling, debugging, etc.). Perhaps someone with practical experience can comment.

On the other hand, hipSYCL is an implementation of the fledgling Khronos SYCL standard. The SYCL programming model is based on pure standard C++ language and libraries, originally intended as a higher-level programming model for OpenCL. It is quite different to CUDA and HIP. The hipSYCL implementation can use OpenMP, HIP/ROCm or CUDA toolchains as backends (with interoperability with Intel's DPC++ SYCL compiler in the pipeline, I think, considering that Heidelberg University, which leads and funds hipSYCL development, recently partnered with Intel on the oneAPI initiative). Notably, hipSYCL does not support OpenCL as a backend, unlike many other SYCL implementations (such as Intel's DPC++, CodePlay's ComputeCPP, triSYCL and sycl-gtx, which all support OpenCL as well as various other backends).

To me, SYCL looks to be the open standard for the future. It encompasses more than just GPGPU, which HIP and CUDA focusses on. For example, SYCL is in use for FPGA programming. The C++ Standard Committee, universities, national institutions (such as Cineca and Argonne National Laboratory) and companies (such as Intel, CodePlay and Xilinx) are investing in and contributing to SYCL use and development.

Links:

- ROCm/HIP
- CUDA
- SYCL
- SYCL 2020 announcement
- hipSYCL
- Heidelberg University and Intel team up

PS. This discussion is somewhat off-topic, so apologies for that, but heterogeneous systems are the future, and the programming models will be an important part of it.
There is a list of changes to be upstreamed to enable LLVM to emit SPIRV IR code.

The bigger question is how to integrate these changes from Intel if another IHV is doing the work in parallel.
I checked the diffs not that big to me(granted, we work with several codebases with more than 30000 kloc each). Even AMD's downstream ROCm LLVM fork has 35K+ diff from upstream and they are constantly issuing PR almost 5 times a day to get all in.
Consuming it with OpenCL runtime is the easier part.

These days almost everything uses the LLVM infrastructure.
ROCm also does. For AMD's part they also are making a lot of new proposals to elf/dwarf and new tooling for debugging heterogenous systems.

Update:
Looks like Intel has been upstreaming these LLVM changes indeed. Kudos to Intel.
You can check the Meeting Notes, last one from two weeks ago.
 
Last edited:

Vattila

Senior member
Oct 22, 2004
551
576
136
[Intel has upstreamed changes] to enable LLVM to emit SPIRV IR code.
Great!

For those (like me) that don't know much about this, the Khronos SPIR-V standard is an abstract and portable intermediate representation (IR) that a SYCL/OpenCL toolchain may produce as it translates the high-level accelerator code into distributable program files. The SPIR-V code is fed to the OpenCL driver as the program is executed on the target platform. The OpenCL driver translates the SPIR-V IR code to device-specific machine code, which is then executed by the target platform hardware.

Here is a couple of charts, from the links provided in my previous post, providing an overview of these technologies:



 
Last edited:

amrnuke

Senior member
Apr 24, 2019
876
1,143
96
In desktop^

Epyc will double the number of cores.
Poor Intel in servers...
Yup. With AM5, I imagine they go larger on the socket size. Combined with process shrink to 5nm, we can still keep a lower transistor density while cramming a lot more on the chip. The other sockets I don't expect a significant change.

32c/64t 6950X
128c/256t 6990X
128c/256t or 128c/512t Epyc 7xx3

Fun times ahead when you consider the IPC/power consumption improvements expected.
 
  • Like
Reactions: Tlh97 and Vattila

Cardyak

Member
Sep 12, 2018
33
39
61
Yup. With AM5, I imagine they go larger on the socket size. Combined with process shrink to 5nm, we can still keep a lower transistor density while cramming a lot more on the chip. The other sockets I don't expect a significant change.

32c/64t 6950X
128c/256t 6990X
128c/256t or 128c/512t Epyc 7xx3

Fun times ahead when you consider the IPC/power consumption improvements expected.
I'd be surprised if core counts doubled in the move from 7nm -> 5nm.

Even though density improves by a factor of ~1.84x, the power consumption only reduces by approximately 30%.

This means that core counts can indeed double, but clock speeds will have to witness a fairly large regression (Probably somewhere in the region of 10%) in order to keep TDP in check. I'm not sure if sacrificing single-threaded performance in order to push more cores this early is palatable to the average consumer.

I expect AMD will opt for a compromise. A 50% increase in core counts with slightly larger and beefier Zen 4 cores, and clock speeds will be roughly similar to Zen 2 and Zen 3 speeds.

24c/48t 6950X
96c/192t 6990X
96c/192t Epyc 7xx3
 

Hitman928

Platinum Member
Apr 15, 2012
2,915
2,454
136
I'd be surprised if core counts doubled in the move from 7nm -> 5nm.

Even though density improves by a factor of ~1.84x, the power consumption only reduces by approximately 30%.

This means that core counts can indeed double, but clock speeds will have to witness a fairly large regression (Probably somewhere in the region of 10%) in order to keep TDP in check. I'm not sure if sacrificing single-threaded performance in order to push more cores this early is palatable to the average consumer.

I expect AMD will opt for a compromise. A 50% increase in core counts with slightly larger and beefier Zen 4 cores, and clock speeds will be roughly similar to Zen 2 and Zen 3 speeds.

24c/48t 6950X
96c/192t 6990X
96c/192t Epyc 7xx3
ARM is supposedly coming with up to 192 cores and up to 350W TDP within the next couple of years, they may want something as a response to that.
 
  • Like
Reactions: Tlh97 and Vattila

soresu

Golden Member
Dec 19, 2014
1,495
714
136
I'd be surprised if core counts doubled in the move from 7nm -> 5nm.

Even though density improves by a factor of ~1.84x, the power consumption only reduces by approximately 30%.
Depends what 5nm process you are talking about, because N5P will give more than 30% improvement for power consumption.
 
  • Like
Reactions: Tlh97 and Vattila

moinmoin

Golden Member
Jun 1, 2017
1,915
2,142
106
I'd be surprised if core counts doubled in the move from 7nm -> 5nm.

Even though density improves by a factor of ~1.84x, the power consumption only reduces by approximately 30%.

This means that core counts can indeed double, but clock speeds will have to witness a fairly large regression (Probably somewhere in the region of 10%) in order to keep TDP in check. I'm not sure if sacrificing single-threaded performance in order to push more cores this early is palatable to the average consumer.

I expect AMD will opt for a compromise. A 50% increase in core counts with slightly larger and beefier Zen 4 cores, and clock speeds will be roughly similar to Zen 2 and Zen 3 speeds.

24c/48t 6950X
96c/192t 6990X
96c/192t Epyc 7xx3
In plenty datacenters core density and energy efficiency is more important than absolute frequency. As @Hitman928 correctly points out the competition there is ARM servers, with plans for even more cores already. And in any case IPC needs to increase to be competitive with ARM chips, and that can cover at least part of the reduction in frequency.

Also this is not an either/or, AMD can still (continue to) offer packages with fewer cores that as a result have a bigger TDP headroom for higher frequencies. All of its top end products are "hampered" by the TDP limit (even consumer chips like 3950X and 5950X), but as a result those are also more energy efficient, that's part of the balance customers can choose between.

5nm is the next time with a significant increase in transistor density, and AMD so far always went with the square of 2 for the top end of its Zen packages.
 
  • Like
Reactions: Tlh97 and Vattila

Gideon

Golden Member
Nov 27, 2007
1,028
1,704
136
In plenty datacenters core density and energy efficiency is more important than absolute frequency. As @Hitman928 correctly points out the competition there is ARM servers, with plans for even more cores already. And in any case IPC needs to increase to be competitive with ARM chips, and that can cover at least part of the reduction in frequency.

Also this is not an either/or, AMD can still (continue to) offer packages with fewer cores that as a result have a bigger TDP headroom for higher frequencies. All of its top end products are "hampered" by the TDP limit (even consumer chips like 3950X and 5950X), but as a result those are also more energy efficient, that's part of the balance customers can choose between.

5nm is the next time with a significant increase in transistor density, and AMD so far always went with the square of 2 for the top end of its Zen packages.
Agreed on all points. Besides considerable amount of power can be saved jsut moving from the current package desing to a 2.5D or 3D approach, e.g. active interposer and such.
 
  • Like
Reactions: Tlh97

ASK THE COMMUNITY