Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 109 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
800
1,364
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

DisEnchantment

Golden Member
Mar 3, 2017
1,623
5,894
136
Is it actually listed as NPS8? I haven’t actually seen the BIOS settings. I thought it was NPS 1, 2, or 4 for no numa partitioning, each half of the IO die as separate node, and each quadrant of the die as a separate numa node. It also has a setting to make each L3 cache into a separate numa node which would be up to 16 on Rome and 8 on Milan (CCX = CCD). That is different since it is based on the cpu die, not IO die partitioning. The IO die partitioning affects the memory interleave while settings based on L3 cache do not.

If you look at the diagram of the IO die layout here:


It looks like there is a bigger penalty for going across the 2 halves of the IO die than I thought. It gets complicated to measure this due to the number of different parts. If you only have a 4 cpu die part, then setting NPS4 is equivalent to having a separate numa node for each L3 with Milan, but that is not the case with Rome or with devices with more than 4 cpu chips. I have some 7313s (4 core per CCD/CCX with 4 CCD) which I am running in NPS1. I also have a dual socket 7F32 (1 core per CCX with 8 CCX in, I assume, 4 CCD) which is currently set to NPS4. That gives me 8 numa nodes with 2 CPU each in 2 separate CCX on one CCD. I think the NPS4 setting is not optimal for this part and the software I am running.

Image above is from this article:

You are right, it is NPS4 but with L3AsNumaNode set, you basically get number of NUMA nodes = same as number of CCX/CCD for Milan
1636109137672.png
1636109026883.png

COD partitioning as per the patch is to partition by the number of fabric instances.
So it does hint that there are bigger CCDs with multiple SDPs (on Genoa) or they have rearranged the CCX in some other way like when having multiple CCX per CCD again (for Bergamo).

Anyhow, remains to be seen because patch is incomplete, it is a guess on my part how this will be implemented
 
Last edited:
  • Like
Reactions: lightmanek

jamescox

Senior member
Nov 11, 2009
637
1,103
136
You are right, it is NPS4 but with L3AsNumaNode set, you basically get number of NUMA nodes = same as number of CCX/CCD for Milan
View attachment 52402
View attachment 52401

COD partitioning as per the patch is to partition by the number of fabric instances.
So it does hint that there are bigger CCDs with multiple SDPs (on Genoa) or they have rearranged the CCX in some other way like when having multiple CCX per CCD again (for Bergamo).

Anyhow, remains to be seen because patch is incomplete, it is a guess on my part how this will be implemented
I don’t think AMD will change the number of cores per CCX with a zen 4 derivative. 16-core CCXs would push the limited if the ring bus-like tech used internal to the CCX (see the article in my previous post). I would expect that Bergamo is 2 CCX per chip unless they do something strange. One possibility I have considered is to use 2 cores sharing the same L2. The number of L3 clients remains constant in that case. That change seems unlikely in a Zen 4 derived part. Also, if the 16-core chiplet with dense cores is similar in size to a regular 8 core chiplet, then it seems like they actually could do 12 chiplets x 16 cores each for 192 cores if Genoa supports 12 chiplets. Perhaps they are not the same shape or the power would be too high though.

I am also really wondering what they are going to do for a workstation / HEDT part. SP5 will be a very large and expensive. Are they still going to base Threadripper on salvage IO die when it is made on TSMC 6 nm process? Perhaps they will have sufficient salvage parts. I was hoping they would do a modular IO die with stacking tech to make the Epyc IO die out of 4 separate chips, one for each quadrant. That would allow a smaller, cheaper socket for Threadripper with only 2 IO die chiplets. If they would also use one of the modular chips for Ryzen parts, that seems to be a good thing also. That is seeming unlikely given die sizes have been quoted for the IO die unless it is multiple chiplets mounted in a package and then the package mounted on the SP5 substrate. The IO die will still be close to 400 mm2, which is plausible on 6 nm; GPUs are likely running that large. It seems like yields would be a lot better with four 100 mm2 die connected with LSI or something. That would also allow cheap implementations for smaller parts. Perhaps we do not get such a thing until Zen 5 with some form of stacking.
 
  • Like
Reactions: Tlh97 and Joe NYC

DisEnchantment

Golden Member
Mar 3, 2017
1,623
5,894
136
Leaked Genoa manuals already say 8 core CCX/CCD but of course Bergamo is still up for speculation.
What the manual is also saying is that there are two Data Fabric ports from each CCD.
However I have not seek nor will attempt to seek the manual myself, so I trust the folks who have seen the manual.

But we should know something more, AMD, AMD Server, AMD Instinct are tweeting about the event every few hours.
Seems they are eager to share something. They even posted a count down
It is about time, this year there was no event like the FAD2020, nor any of the New Horizon or Next Horizon events etc. we had in the past.

Update:
Add YT Link

Lisa Su, Dan McNamara and Forrest Norrod will present the Accelerated Data Center event from AMD
 
Last edited:
May 17, 2020
122
233
116
On the image on the tweet there is 16 slots of DDR4 for each EPYC CPU, actually EPYC motherboard have 8 slots of DDR4 for each CPU. So If it's Trento, the IOD have updated to support for more memory channel ?
 
  • Like
Reactions: lightmanek

remsplease

Junior Member
Oct 22, 2021
16
3
41
-New AM5 socket for desktop. More rectangular than AM4. I expect (most) current AM4 coolers will work on AM5 in lower core-count scenarios.

-Sockets SP3 and TR4 EPYC/Threadripper remain the same.

-New chipsets supporting DDR5, pcie5, etc.

Zen4 (Ryzen, TR/EPYC) is a 5nm die-shrink of Zen 3 with an updated memory controller and other minor design changes for IO.

TR/EPYC substrate changes to accept 5nm chips. No pin count changes required.

Everything else is pretty much the same.
 
  • Haha
Reactions: Tlh97

leoneazzurro

Senior member
Jul 26, 2016
951
1,514
136

Hans Gruber

Platinum Member
Dec 23, 2006
2,153
1,099
136
Just a reminder. When AMD introduced the Zen (2017) architecture. They had it mapped out and planned through Zen 4. Anything after would be a new architecture. Intel utilizes the tic tock approach with CPU's. AMD since ryzen uses the ding dong approach. The tic is weak and the tock is strong. The Ding Dong approach is equal. When intel falls behind they take the Bill O'reilly approach. And will Alder Lake, everything at Intel is good again.
 
May 17, 2020
122
233
116
-New chipsets supporting DDR5, pcie5, etc.

Zen4 (Ryzen, TR/EPYC) is a 5nm die-shrink of Zen 3 with an updated memory controller and other minor design changes for IO.
It's a SoC so the memory controller is in the IOD not in chipset (even before Ryzen the memory controller was in CPU that is the case since a while) and same for PCIe 5 which is handled too in the SoC not only in the chipset....
 

remsplease

Junior Member
Oct 22, 2021
16
3
41
We already know that isn't true. Genoa supports AVX-512.
To which part are you referring?
It's a SoC so the memory controller is in the IOD not in chipset (even before Ryzen the memory controller was in CPU that is the case since a while) and same for PCIe 5 which is handled too in the SoC not only in the chipset....

Supporting the standard is required across the hardware stack. Not all pcie connections are directly to the CPU.
 

Ajay

Lifer
Jan 8, 2001
15,624
7,950
136
Supporting the standard is required across the hardware stack. Not all pcie connections are directly to the CPU.
Yeah, that's not the way it works. The SoC can support PCIe 5.0 with the chipset supporting PCI 4.0 (or 3.0).
DRAM support has nothing to do with the chipset.
 

AAbattery

Member
Jan 11, 2019
25
54
91
L3 is chopped but L2 gets beefed.

Puts new light on this quote from Mike Clark: "and as we continue to go forward, getting more cores, and getting more cores in a sharing L3 environment, we’ll still try to manage that latency so that when there are lower thread counts in the system, you still getting good latency out of that L3. Then the L2 - if your L2 is bigger then you can cut back some on your L3 as well."

Will be exciting to see Zen 4 and 5 come to fruition and their dense variants as well.

Summary of what we know: (feel free to correct or add):

Zen 4 (thanks to Gigabyte leak and "floundersedition" on Reddit among others for organizing it together)
  • 4-Int pipeline (like Zen 3)
  • AVX-512, full support except VP2INTERSECT
  • bigger load store
  • same L1 size
  • 1MB L2, (2x but otherwise the same as Zen 3)
  • 72 entries for L1 DTLB (up from 64)
  • 1.5x bigger L2 DTLB
  • same L3 sizes
  • 2x IF-links per CCD
  • at least DDR5-4800 and tho higher IF-speed (2400MHz), probably 5200/2600
  • PCIe 5.0 and more lanes, CXL, CCIX, Gen Z (at least for server)
  • no USB 4 on Raphael, but on Rembrandt
  • iGPU
maybe an updated branch predictor?
prefetcher changes?
increased Reorder Buffer to go along with increased OoO window?

Zen4D (thanks to MLID)
Dense library TSMC 5nm
2 8-core CCXs per CCD
16MB L3 cache per CCX?
1.25 or 1.5 MB L2 per core?
Reduced FP resources?

Zen 5
MOAR BIGGERER
MOAR BETTERER
 
Last edited:

soresu

Platinum Member
Dec 19, 2014
2,721
1,921
136
You claim Zen4 is a die shrink of Zen3, but that is not possible. Otherwise it would not support AVX-512.
Perhaps they meant supporting AVX512 without doubling SIMD?

There are enough SIMD units to support it already in Zen2 (obviously with changes to the core for the purpose), albeit only at 1x512 bit instruction per cycle vs Intel's cores which can do at least 2x512 bit instructions per cycle I think for the last few generations.
 

BorisTheBlade82

Senior member
May 1, 2020
667
1,022
136
maybe an updated branch predictor, a temporal prefetcher and bigger OOO
Didn't they already have temporal prefetchers? Also when looking at the increased front end and TLBs I am pretty sure they will increase the Reorder Buffer as well in order to increase the OoO window just as you mentioned.
 

soresu

Platinum Member
Dec 19, 2014
2,721
1,921
136
.....

Zen 4 (thanks to Gigabyte leak and "floundersedition" on Reddit among others for organizing it together)
  • ....
  • same L1 size
  • 1MB L2, (2x but otherwise the same as Zen 3)
  • ....
  • 1.5x bigger L2 DTLB
  • same L3 sizes
  • .....
....

Zen4D
Dense library TSMC 5nm
2 8-core CCXs per CCD
16MB L3 cache per CCX?
1.25 or 1.5 MB L2 per core?
....
It's highly unlikely that the dense variant of Zen4 would have more cache per core than the vanilla variant.

SRAM cache is probably the most area expensive part of the CCD, so if anything cache will be cut as much as possible without completey axing Zen4 IPC gains to make room for more cores in the dense variant.
 
  • Like
Reactions: lightmanek

AAbattery

Member
Jan 11, 2019
25
54
91
Didn't they already have temporal prefetchers? Also when looking at the increased front end and TLBs I am pretty sure they will increase the Reorder Buffer as well in order to increase the OoO window just as you mentioned.

Right. Just now I went to Ian's reviews of Milan and Vermeer to refresh my memory and he described Zen 3's new prefetch as "region-based".
 

uzzi38

Platinum Member
Oct 16, 2019
2,666
6,192
146
Perhaps they meant supporting AVX512 without doubling SIMD?

There are enough SIMD units to support it already in Zen2 (obviously with changes to the core for the purpose), albeit only at 1x512 bit instruction per cycle vs Intel's cores which can do at least 2x512 bit instructions per cycle I think for the last few generations.
Doesn't matter. Being intentionally vague here - there's a significant ST performance boost coming with Zen 4 that will show even in benchmarks that don't care for larger caches much. Calling it just a shrink of Zen 3 is dead wrong.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,804
3,256
136
Doesn't matter. Being intentionally vague here - there's a significant ST performance boost coming with Zen 4 that will show even in benchmarks that don't care for larger caches much. Calling it just a shrink of Zen 3 is dead wrong.
Yes ,
Interestingly I was wrong and other people where right ( Zen3 is the new core not Zen 4) , If we listen to Michael Clarke , Zen 3 is a completely new from scratch Core and Zen 4 will be the single iteration of that core before Zen5 which is a new core. So if Zen1 -> 2 can be 15-20 IPC , more clock and 128bit SIMD to 256 bit . I dont see any reason Zen3 -> 4 cant be 15-20% more IPC , more clock and 256bit SIMD to 512.

Based on this I'm actually a little disappointed in Zen3, In that its not really any wider then Zen2. No i dont count the extra execution pipelines etc because number RPF ports is the same , AMD was just smarter in usage. I want bigger and smarter!!!
 
Last edited:

lightmanek

Senior member
Feb 19, 2017
390
763
136
Yes ,
Interestingly I was wrong and other people where right ( Zen3 is the new core not Zen 4) , If we listen to Michael Clarke , Zen 3 is a completely new from scratch Core and Zen 4 will be the single integration of that core before Zen5 which is a new core. So if Zen1 -> 2 can be 15-20 IPC , more clock and 128bit SIMD to 256 bit . I dont see any reason Zen3 -> 4 cant be 15-20% more IPC , more clock and 256bit SIMD to 512.

Based on this I'm actually a little disappointed in Zen3, In that its not really any wider then Zen2. No i dont count the extra execution pipelines etc because number RPF ports is the same , AMD was just smarter in usage. I want bigger and smarter!!!

Zen 3 would look better in contex of Zen2 but AMD engeeniers managed to port new prefetchers back to Zen2 before launch improving IPC gains vs. Zen 1 and lowering gap between planned Zen2 and Zen3 ;)
 

DisEnchantment

Golden Member
Mar 3, 2017
1,623
5,894
136
Zen 3 would look better in contex of Zen2 but AMD engeeniers managed to port new prefetchers back to Zen2 before launch improving IPC gains vs. Zen 1 and lowering gap between planned Zen2 and Zen3 ;)
The opportunity was there and they took it. They could add more transistors in Zen2 without increasing power drastically.

Some Perspective
Zen --> Zen2
2x CCX for 8Cores + 2x 8MBL3 @2800 MTr (Entire Zeppelin is 4800 MTr) --> 1CCD/CCX for 8Core + 1x 32MBL3 3800MTr
That's 1.35x MTr gain for 15% IPC and total TDP 95W-->105W
More than half of the 1.35x gain over 2x CCX of Zen1 is due to doubling of L3, addition of 256 bit FP units and addition of GMI and SMU in Zen2 CCD
Real Core+L2 is only ~15% more MTr

Zen2 --> Zen3
~1.1x MTr gain for 19% IPC and total TDP 105W-->105W

Zen3-->Zen4
Between 25 to 40% MTr gain, IPC?
L3/SMU largely simlar MTr count