Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 21 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Unfortunately, they provide professional services only. What is given there is only a teaser. If you think semiaccurate's 1K/year is too much ....
But that is expected if you want real data for your competitive business analysis.
View attachment 33932
@kokhua probably buys a bunch of these papers.


It is best case scenario. Even TSMC itself says 35-40%

Reality is more like the lower end of that, which is to no surprise within reach of 5LPE if Samsung is being a little bit more honest.
35 to 40% seems like it is relatively good given the circumstances. Perhaps I am missing something. An AMD cpu chiplet may be relatively similar to a mobile SoC. Mobile SoCs have very little IO and an AMD cpu chiplet only has a single infinity fabric link (32-bit serdes?). Mobile SoCs usually have a gpu though, which is a bit more logic heavy than a cpu. Mobile SoC caches are generally quite small. I would expect an AMD cpu chiplet to have possibly lower amount of IO but a significantly higher memory to logic ratio. That may allow it to beat the average scaling for a mobile SoC.
 

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
They may move to a 16 core die with 2 CCX per die though. That would not be that large at 5 nm.
Well, with the scaling limitations going from N7 to N5, 16 cores would make the IOD CCD a fair bit larger than in Zen3 (assuming more xtors for Zen4 as well).

That still might be a decent trade off for AMD though, at least for Genoa. Not so great for Ryzen, unless AMD sets 8 cores as the bottom of their consumer stack. The number of dice with 8 or more defective cores would seem likely to be low given TSMC's comments on yields.
 
Last edited:

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Well, with the scaling limitations going from N7 to N5, 16 cores would make the IOD a fair bit larger than in Zen3 (assuming more xtors for Zen4 as well).
That still might be a decent trade off for AMD though, at least for Genoa. Not so great for Ryzen, unless AMD sets 8 cores as the bottom of their consumer stack.
The number of dice with 8 or more defective cores would seem likely to be low given TSMC's comments on yields.
Your assuming that they are going to stack the CCD on top of the IO die? I was thinking more of a 2.5D for the desktop part and possible 2.5D plus some 3D components for Epyc. Also, I am thinking 8 or less cores may be monolithic APU, even on desktop for Zen 4.
 

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
Your assuming that they are going to stack the CCD on top of the IO die? I was thinking more of a 2.5D for the desktop part and possible 2.5D plus some 3D components for Epyc. Also, I am thinking 8 or less cores may be monolithic APU, even on desktop for Zen 4.
Oh, didn’t have coffee b/4 posting. I meant CCD.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
35 to 40% seems like it is relatively good given the circumstances. Perhaps I am missing something. An AMD cpu chiplet may be relatively similar to a mobile SoC. Mobile SoCs have very little IO and an AMD cpu chiplet only has a single infinity fabric link (32-bit serdes?). Mobile SoCs usually have a gpu though, which is a bit more logic heavy than a cpu. Mobile SoC caches are generally quite small. I would expect an AMD cpu chiplet to have possibly lower amount of IO but a significantly higher memory to logic ratio. That may allow it to beat the average scaling for a mobile SoC.
Quite the contrary actually, Zen2/3 CCD is cache heavy compared to Mobile SoCs which are logic heavy. Given that SRAM scaling is around 1.2x for N7 --> N5, For a Zen3 CCD with more than 50% cache, the scaling is going to be even worse than current mobile devices.
 

moinmoin

Diamond Member
Jun 1, 2017
4,934
7,619
136
Not so great for Ryzen, unless AMD sets 8 cores as the bottom of their consumer stack. The number of dice with 8 or more defective cores would seem likely to be low given TSMC's comments on yields.
I fully expect AMD to only offer APUs at the bottom of their consumer stack at some point. (The point could be now already if Renoir were readily available.)
 

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
I fully expect AMD to only offer APUs at the bottom of their consumer stack at some point. (The point could be now already if Renoir were readily available.)
Well, if AMD can get enough wafers for their APUs, they would be a great choice for business class desktops and run of the mill consumer desktops. Seems to me that in a couple of years, AMD will need it's own mega fab at TSMC.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Quite the contrary actually, Zen2/3 CCD is cache heavy compared to Mobile SoCs which are logic heavy. Given that SRAM scaling is around 1.2x for N7 --> N5, For a Zen3 CCD with more than 50% cache, the scaling is going to be even worse than current mobile devices.
I said higher memory (cache) to logic ratio for Zen 3 vs. mobile SoCs. Is the "contrary" referring to the cache ratio (which you seems to agree with me on) or the scaling? It seems unclear how good cache scaling will be between 7 and 5 nm, but I think the L3 is already large enough, so I don't think that will be increasing again at 5 nm due to increased latency penalty. They may optimize it a bit more for higher bandwidth or lower latency. I suspect cache increases in Zen 4 will be L4 if it doesn't come with Milan. It would be great if it did come with Milan, but redesigning the IO die right before the switch to DDR5, PCI-E 5, etc doesn't make a whole lot of sense.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Oh, didn’t have coffee b/4 posting. I meant CCD.
A 16 core CCD will be larger, but yields seem like they are looking good at 5 nm. The Zen 3 die, since it is unified cache, has all of the cores to one side and the infinity fabric along one edge rather than in the middle, between two CCX as it was in Zen 2. (edit: I suspect that they will add another CCX with the infinity fabric connection in the middle again. They look similar to HBM to some exent.) They are probably still around 75 square mm, so double that up to 150 for the 16-core then shrink it a bit for 5 nm. Also, if it is made for stacking the infinity fabric link may actually be smaller since it doesn't need high speed serdes. That is only about 10 percent of the die or so though. Anyway, it could be smaller than you think. 150 square mm with 30% scaling and maybe a little extra shrink for infinity fabric changes could bring it down to the 100 to 120 square mm range with no increase in cache size.

If it is using TSMC stacking tech that does not need solder micro balls, then they could easily make a 32 core desktop part. The 8 core or less may be APUs. As noted above by someone else, that could still happen with Zen 3 parts. For Zen 4 , it could be APUs up to 8 core and then maybe a 16-core chiplet or 2 chiplets stacked directly on top of the other. The AM5 core count could go up to 32 if that is the case. For Epyc, 4 stacks would actually be 64-cores from the start and then 128 for 2 layers. They could just do 2 stacks for 32-core base model, 64 core with 2 layers, etc. This is all speculation, but TSMC's stacking tech seems to allow this. The next gen sockets could technically be smaller with use of die stacking, so space for the die may not be a concern.
 
Last edited:
  • Like
Reactions: Tlh97

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
Well, things went faster than I imagined.
In the Q3 call of today, Hans Mosesman specifically asked on the SW stack for HPC, and Lisa and Victor Peng(Xilinx CEO) said Xilinx has a great SW stack for compute(for FPGAs) and they will bring it together at some point in time with ROCm.
And guess what the comittee for SYCL is actually Intel, Xilinx, Codeplay and Argonne. One of the major contributors to LLVM for SYCL besides Intel is Xilinx(Keyrell ).
So in a quick turn of events it seems like it will basically be AMD contributing to SYCL for FPGA via Xilinx.
1605771798637.png

Here you go, straight from the Horse's mouth.
SYCL for FPGA's coming earlier than I thought and that too via ROCm.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
I said higher memory (cache) to logic ratio for Zen 3 vs. mobile SoCs. Is the "contrary" referring to the cache ratio (which you seems to agree with me on) or the scaling? It seems unclear how good cache scaling will be between 7 and 5 nm, but I think the L3 is already large enough, so I don't think that will be increasing again at 5 nm due to increased latency penalty. They may optimize it a bit more for higher bandwidth or lower latency. I suspect cache increases in Zen 4 will be L4 if it doesn't come with Milan. It would be great if it did come with Milan, but redesigning the IO die right before the switch to DDR5, PCI-E 5, etc doesn't make a whole lot of sense.
I might have misunderstood you on the cache ratio. But we can agree and it is obvious from the die shots that Zen3 like Zen2 is very cache heavy.
Regarding the scaling it is TSMC's own numbers and also from Semianalysis report from dissecting the A14 die that SRAM scaling is going to be far below that of logic. At best 1.2x.
Also agree that the Cache is big enough that it does not warrant increasing it any further.

Note also the rumor from Executable fix, that Genoa is not going to raise core count per CCD rather CCD count raised to achieve the 96 core count.
Noteworthy from the various patents are however various techniques like Tagging acceleration mechanisms, compression and more.
 
  • Like
Reactions: Tlh97 and Vattila

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
*snip*

Here you go, straight from the Horse's mouth.
SYCL for FPGA's coming earlier than I thought and that too via ROCm.

*snip*

Actually, if you take a look at the next slide during 2:25 in the video under "Integrated SW Stack" it suggests that SYCL is totally absent in their future plans ...

If anything, AMD are doubling down on their proprietary HIP API standard so that programmers will eventually be forced to use HIP to be able to develop for Xilinx FPGAs ...

AMD acquiring Xilinx will most likely undermine development of the SYCL standard rather than help it. Running SYCL on top of HIP via hipSYCL doesn't really count since it'll never be production ready when only a single person is working on the project in his free time and at the end of the day hipSYCL doesn't solve the issue of multiple compiler backends (HIP-Clang vs DPC++ compiler vs NVCC) either so a portable compute standards will be foreseeably out of reach in the future ...
 
  • Like
Reactions: Vattila

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
Actually, if you take a look at the next slide during 2:25 in the video under "Integrated SW Stack" it suggests that SYCL is totally absent in their future plans ...

If anything, AMD are doubling down on their proprietary HIP API standard so that programmers will eventually be forced to use HIP to be able to develop for Xilinx FPGAs ...

AMD acquiring Xilinx will most likely undermine development of the SYCL standard rather than help it. Running SYCL on top of HIP via hipSYCL doesn't really count since it'll never be production ready when only a single person is working on the project in his free time and at the end of the day hipSYCL doesn't solve the issue of multiple compiler backends (HIP-Clang vs DPC++ compiler vs NVCC) either so a portable compute standards will be foreseeably out of reach in the future ...
As someone who uses ROCm time to time we can agree to disagree.
The ROCm stack is a framework of everything, libs, compiler framework, math libs, comm libs, framework integration, etc with the compile toolchain being a fork of LLVM with upstreaming on daily basis.
LLVM right now can emit SPIRV code that can be be consumed by the FPGA device runtime (which is not present in ROCm right now). There are some extensions from Intel but the plan is to add those to the the provisional spec as well.
What you are saying would imply AMD to nerf the SYCL features in the LLVM fork but this is contrary to what they are doing now, they are trying to get everything upstreamed and minimize the diffs with upstream code.

When ROCm will rebase on LLVM12+ which has initial SYCL support (currently it is at 11) and the formal integration of the Xilinx device runtime into the ROCm infrastructure will happen you can come back and comment if AMD will shut out SYCL
SYCL is a mandate of the Software Readiness task force of the Exascale Computing Project to force suppliers to use vendor agnostic standards and I have doubts AMD will go the opposite direction. Argonne National Lab is a member of the SYCL committe for this particular reason.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
SYCL is a mandate of the Software Readiness task force of the Exascale Computing Project to force suppliers to use vendor agnostic standards and I have doubts AMD will go the opposite direction. Argonne National Lab is a member of the SYCL committe for this particular reason.
Argonne's Aurora is not using AMD hardware though?

El Capitan for LLNL and Frontier for ORNL are AMD supercomputer contracts - exactly how much use they will make of SYCL will determine just how much attention that gets from AMD.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
El Capitan for LLNL and Frontier for ORNL are AMD supercomputer contracts - exactly how much use they will make of SYCL will determine just how much attention that gets from AMD.
Software Readiness Task force encompass all the major labs (ORNL, LLNL, ANL etc) that will be part of the Exascale Computing Project. I can imagine that they wouldn't be happy if AMD shut out a standard which they are proposing.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
As someone who uses ROCm time to time we can agree to disagree.
The ROCm stack is a framework of everything, libs, compiler framework, math libs, comm libs, framework integration, etc with the compile toolchain being a fork of LLVM with upstreaming on daily basis.
LLVM right now can emit SPIRV code that can be be consumed by the FPGA device runtime (which is not present in ROCm right now). There are some extensions from Intel but the plan is to add those to the the provisional spec as well.
What you are saying would imply AMD to nerf the SYCL features in the LLVM fork but this is contrary to what they are doing now, they are trying to get everything upstreamed and minimize the diffs with upstream code.

When ROCm will rebase on LLVM12+ which has initial SYCL support (currently it is at 11) and the formal integration of the Xilinx device runtime into the ROCm infrastructure will happen you can come back and comment if AMD will shut out SYCL
SYCL is a mandate of the Software Readiness task force of the Exascale Computing Project to force suppliers to use vendor agnostic standards and I have doubts AMD will go the opposite direction. Argonne National Lab is a member of the SYCL committe for this particular reason.

It should stressed that LLVM is just a shared compiler infrastructure with multiple backends targetting wildly different programming language/HW combinations ...

The work going into Intel's backend compiler for LLVM supporting SPIR-V kernel modules is totally unrelated to AMD's HIP-Clang compiler in LLVM which is used for ROCm. Similarly, LLVM supports CUDA kernel compilation via the CUDA-Clang backend but neither AMD or Intel are thinking about directly supporting CUDA just because a certain backend (PTX specifically) of LLVM can do it ... (PTX is tied to Nvidia HW so CUDA is only going to work on their HW as well regardless if LLVM is used or not)

At the end of the day no committee is going to change AMD's corporate agenda to ignore developing a SPIR-V kernel compiler and doubling down on HIP. Someone else could write a SPIR-V kernel compiler for AMD HW but who on earth in their right mind is going write over millions of lines of code and maintain it ? AMD believes that the community should do this for them while the community feels the opposite since they don't want to do the work of a corporation for free. Maybe it should be Intel that makes this SPIR-V kernel compiler for AMD HW if they believe so strongly about the SYCL standard proliferating instead working on Intel specific extensions to SYCL (DPC++) ?

With all that out of the way let's get to the breakdown. ROCm and oneAPI maybe open source projects but are in no way developed by the community (anyone else but Intel). CUDA is a closed source project but guess what they all have in common ? They're all "corporate projects" if you guessed correctly regardless of whether they are open source or built on an open 'standard' ... (SYCL can hardly be defined as an industry standard when only Intel is making an effort to support it)
 
  • Like
Reactions: Vattila

jamescox

Senior member
Nov 11, 2009
637
1,103
136
I might have misunderstood you on the cache ratio. But we can agree and it is obvious from the die shots that Zen3 like Zen2 is very cache heavy.
Regarding the scaling it is TSMC's own numbers and also from Semianalysis report from dissecting the A14 die that SRAM scaling is going to be far below that of logic. At best 1.2x.
Also agree that the Cache is big enough that it does not warrant increasing it any further.

Note also the rumor from Executable fix, that Genoa is not going to raise core count per CCD rather CCD count raised to achieve the 96 core count.
Noteworthy from the various patents are however various techniques like Tagging acceleration mechanisms, compression and more.

Zen 3 is massively cache heavy with 32 MB for 8 cores on top of a lot of independent L2 caches. The amount of cache on mobile SoCs is probably closer to just the L2 with no L3. If cache scaling isn't that great for 7 nm -> 5 nm, then the overall scaling is not going to be that good. Looking at the numbers though, it doesn't seem like it needs to be that good. A 16-core die still isn't going to be that large.

I don't expect any increase in core count per CCX, but I am still thinking that 16-core CCDs are a possibility. Eight core CCDs are going to be rather small in 5 nm; possibly about half the size of an HBM2 die. An 8-core CCD is better from a salvage perspective but the 16-core may be reasonable if the lower end is filled with APUs. Chip stacking may not be cost effective for lower end desktop parts, so an APU up to 8-cores makes a lot of sense.

I know the rumors "only" say 96 cores (6 x 16 or 12 x 8). That would make sense without stacking. They would either need 3 x 16 cores on each side of the IO die or 6 tiny 8-core on each side. That may be the limit of what they can fit without stacking. In that case, the IO die may be a stacked device with L4 cache, which means it could actually be smaller. That might allow them to fit larger 16-core die, perhaps up to 8 for 128 cores. It is less risk and better thermals to stack the IO die, but not the cpu die. With cpu die stacking, the 16-core die seems more efficient, although stacks of 8 core die, while tiny, would be less wasteful if something goes wrong in the stacking process. It is still hard to speculate when chip stacking starts to be used. It seems like they would want to go for a doubling of the max number of cores. There is some possibility that 128-core ARM processors will be a thing, so they might need it.
 
  • Like
Reactions: lightmanek

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
Would they even need to change the I/O die with Milan/Genoa? The 8 core CCXs only seem to take one IF port now, instead of the old CCDs that took two, one per CCX. They could just stick 6 smaller CCDs on each side of the I/O die and use the existing IF links. Or, am I reading this all wrong?
 

moinmoin

Diamond Member
Jun 1, 2017
4,934
7,619
136
Would they even need to change the I/O die with Milan/Genoa? The 8 core CCXs only seem to take one IF port now, instead of the old CCDs that took two, one per CCX. They could just stick 6 smaller CCDs on each side of the I/O die and use the existing IF links. Or, am I reading this all wrong?
That should be possible if the available space is sufficient, though the IOD (if the concept itself doesn't change) would need to be updated for PCIe 5 and DDR5.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Would they even need to change the I/O die with Milan/Genoa? The 8 core CCXs only seem to take one IF port now, instead of the old CCDs that took two, one per CCX. They could just stick 6 smaller CCDs on each side of the I/O die and use the existing IF links. Or, am I reading this all wrong?
I have never seen anything that indicates that the Epyc IO die uses more than one physical link to each CCD. Anyone have a link to where they say it is more than one link? The two CCX per CCD in Rome don't talk to each other, but AFAIK, it is a shared link to the IO die. I have seen a slide where it says (for die-to-die) 16B read + 16B write per FCLK for Epyc 7xx1 and 32B read + 16B write per FCLK for 7xx02 (Rome). So they did increase the read bandwidth, but it still seems to be a single link per CCD. Both CCX need to talk to the IO die to maintain cache coherency. Talking directly to a CCX on the same die would be much more complicated and not that useful.

Milan may be exactly the same IO die as Rome. They didn't change it with the desktop Zen 3 part. It would be great if we could get infinity cache with Milan, but it doesn't make much sense to redesign the IO die now. Milan has been in the hands of large partners for a while, so I think we probably would have heard a rumor if it actually includes infinity cache. Genoa will definitely be a new IO die since it needs to support DDR5 and PCI-E 5. It also may use chip stacking of some kind. With chip stacking, there could be a completely different cache architecture. I am expecting infinity cache of some form to make it into Epyc at some point, otherwise it would not be called infinity cache.
 

LightningZ71

Golden Member
Mar 10, 2017
1,627
1,898
136
Well, I see where I went wrong with my understanding. When I read that the two CCX units could only talk to each other via the IOD, I apparently made the assumption that the CCXs had private, physical connections to the IOD. Instead, they use the sae physical interface, just with different logical paths.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
Zen 3 is massively cache heavy with 32 MB for 8 cores on top of a lot of independent L2 caches. The amount of cache on mobile SoCs is probably closer to just the L2 with no L3. If cache scaling isn't that great for 7 nm -> 5 nm, then the overall scaling is not going to be that good. Looking at the numbers though, it doesn't seem like it needs to be that good. A 16-core die still isn't going to be that large.

I don't expect any increase in core count per CCX, but I am still thinking that 16-core CCDs are a possibility. Eight core CCDs are going to be rather small in 5 nm; possibly about half the size of an HBM2 die. An 8-core CCD is better from a salvage perspective but the 16-core may be reasonable if the lower end is filled with APUs. Chip stacking may not be cost effective for lower end desktop parts, so an APU up to 8-cores makes a lot of sense.

I know the rumors "only" say 96 cores (6 x 16 or 12 x 8). That would make sense without stacking. They would either need 3 x 16 cores on each side of the IO die or 6 tiny 8-core on each side. That may be the limit of what they can fit without stacking. In that case, the IO die may be a stacked device with L4 cache, which means it could actually be smaller. That might allow them to fit larger 16-core die, perhaps up to 8 for 128 cores. It is less risk and better thermals to stack the IO die, but not the cpu die. With cpu die stacking, the 16-core die seems more efficient, although stacks of 8 core die, while tiny, would be less wasteful if something goes wrong in the stacking process. It is still hard to speculate when chip stacking starts to be used. It seems like they would want to go for a doubling of the max number of cores. There is some possibility that 128-core ARM processors will be a thing, so they might need it.
I strongly suspect that 2 high stacks of 8 core CCD's may be the path forward for Zen4, albeit only for server/Genoa SKU's running at lower (sub 3 ghz) clock frequencies.

This is where keeping the CCD light begins to pay off - both in yields and max thermals for any single die.

Zen3 clearly had significant perf/watt improvements even on the same process - if they can get even more such improvements in Zen4 combined with N5P it could definitely work with a little creative thermal design in the packaging.

Of course when you start stacking logic like that the IF bandwidth may become a concern, so they might institute a L4 cache die between the CCD's and the IOD to mitigate the power draw of that necessary bandwidth - though I guess adding an extra die in between the CCD's and IOD would add its own problems with latency on top of everything else.
 

A///

Diamond Member
Feb 24, 2017
4,352
3,154
136
I strongly suspect that 2 high stacks of 8 core CCD's may be the path forward for Zen4, albeit only for server/Genoa SKU's running at lower (sub 3 ghz) clock frequencies.
This is currently the same thought in my IRC groups. A good friend of mine who I've known since '01 called it first more than a year ago to heckles. Heat dissipation method will be interesting, though. The Xilinx purchase only furthers AMD's lead IMO.