Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

DisEnchantment · Nov 5, 2021

jamescox said:
Is it actually listed as NPS8? I haven’t actually seen the BIOS settings. I thought it was NPS 1, 2, or 4 for no numa partitioning, each half of the IO die as separate node, and each quadrant of the die as a separate numa node. It also has a setting to make each L3 cache into a separate numa node which would be up to 16 on Rome and 8 on Milan (CCX = CCD). That is different since it is based on the cpu die, not IO die partitioning. The IO die partitioning affects the memory interleave while settings based on L3 cache do not.

If you look at the diagram of the IO die layout here:

https://images.anandtech.com/doci/16930/AMDIODie.png

It looks like there is a bigger penalty for going across the 2 halves of the IO die than I thought. It gets complicated to measure this due to the number of different parts. If you only have a 4 cpu die part, then setting NPS4 is equivalent to having a separate numa node for each L3 with Milan, but that is not the case with Rome or with devices with more than 4 cpu chips. I have some 7313s (4 core per CCD/CCX with 4 CCD) which I am running in NPS1. I also have a dual socket 7F32 (1 core per CCX with 8 CCX in, I assume, 4 CCD) which is currently set to NPS4. That gives me 8 numa nodes with 2 CPU each in 2 separate CCX on one CCD. I think the NPS4 setting is not optimal for this part and the software I am running.

Image above is from this article:

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

You are right, it is NPS4 but with L3AsNumaNode set, you basically get number of NUMA nodes = same as number of CCX/CCD for Milan

COD partitioning as per the patch is to partition by the number of fabric instances.
So it does hint that there are bigger CCDs with multiple SDPs (on Genoa) or they have rearranged the CCX in some other way like when having multiple CCX per CCD again (for Bergamo).

Anyhow, remains to be seen because patch is incomplete, it is a guess on my part how this will be implemented

jamescox · Nov 5, 2021

DisEnchantment said:
You are right, it is NPS4 but with L3AsNumaNode set, you basically get number of NUMA nodes = same as number of CCX/CCD for Milan
View attachment 52402
View attachment 52401

COD partitioning as per the patch is to partition by the number of fabric instances.
So it does hint that there are bigger CCDs with multiple SDPs (on Genoa) or they have rearranged the CCX in some other way like when having multiple CCX per CCD again (for Bergamo).

Anyhow, remains to be seen because patch is incomplete, it is a guess on my part how this will be implemented

I don’t think AMD will change the number of cores per CCX with a zen 4 derivative. 16-core CCXs would push the limited if the ring bus-like tech used internal to the CCX (see the article in my previous post). I would expect that Bergamo is 2 CCX per chip unless they do something strange. One possibility I have considered is to use 2 cores sharing the same L2. The number of L3 clients remains constant in that case. That change seems unlikely in a Zen 4 derived part. Also, if the 16-core chiplet with dense cores is similar in size to a regular 8 core chiplet, then it seems like they actually could do 12 chiplets x 16 cores each for 192 cores if Genoa supports 12 chiplets. Perhaps they are not the same shape or the power would be too high though.

I am also really wondering what they are going to do for a workstation / HEDT part. SP5 will be a very large and expensive. Are they still going to base Threadripper on salvage IO die when it is made on TSMC 6 nm process? Perhaps they will have sufficient salvage parts. I was hoping they would do a modular IO die with stacking tech to make the Epyc IO die out of 4 separate chips, one for each quadrant. That would allow a smaller, cheaper socket for Threadripper with only 2 IO die chiplets. If they would also use one of the modular chips for Ryzen parts, that seems to be a good thing also. That is seeming unlikely given die sizes have been quoted for the IO die unless it is multiple chiplets mounted in a package and then the package mounted on the SP5 substrate. The IO die will still be close to 400 mm2, which is plausible on 6 nm; GPUs are likely running that large. It seems like yields would be a lot better with four 100 mm2 die connected with LSI or something. That would also allow cheap implementations for smaller parts. Perhaps we do not get such a thing until Zen 5 with some form of stacking.

Thibsie · Nov 5, 2021

Again habing to connecter IO dies between them, will do nothing to lower energy use but will certainly introduce more latency.

DisEnchantment · Nov 5, 2021

Leaked Genoa manuals already say 8 core CCX/CCD but of course Bergamo is still up for speculation.
What the manual is also saying is that there are two Data Fabric ports from each CCD.
However I have not seek nor will attempt to seek the manual myself, so I trust the folks who have seen the manual.

But we should know something more, AMD, AMD Server, AMD Instinct are tweeting about the event every few hours.

https://twitter.com/x/status/1456604871524761609

Seems they are eager to share something. They even posted a count down
It is about time, this year there was no event like the FAD2020, nor any of the New Horizon or Next Horizon events etc. we had in the past.

Update:
Add YT Link

Lisa Su, Dan McNamara and Forrest Norrod will present the Accelerated Data Center event from AMD

xilli_fiberbit · Nov 6, 2021

On the image on the tweet there is 16 slots of DDR4 for each EPYC CPU, actually EPYC motherboard have 8 slots of DDR4 for each CPU. So If it's Trento, the IOD have updated to support for more memory channel ?

DisEnchantment · Nov 6, 2021

xilli_fiberbit said:
On the image on the tweet there is 16 slots of DDR4 for each EPYC CPU, actually EPYC motherboard have 8 slots of DDR4 for each CPU. So If it's Trento, the IOD have updated to support for more memory channel ?

16 DIMMs are already supported

leoneazzurro · Nov 6, 2021

It is almost certainly Trento, btw, as that is the combination used in the Frontier supercomputer being installed just now.

remsplease · Nov 6, 2021

-New AM5 socket for desktop. More rectangular than AM4. I expect (most) current AM4 coolers will work on AM5 in lower core-count scenarios.

-Sockets SP3 and TR4 EPYC/Threadripper remain the same.

-New chipsets supporting DDR5, pcie5, etc.

Zen4 (Ryzen, TR/EPYC) is a 5nm die-shrink of Zen 3 with an updated memory controller and other minor design changes for IO.

TR/EPYC substrate changes to accept 5nm chips. No pin count changes required.

Everything else is pretty much the same.

DrMrLordX · Nov 6, 2021

remsplease said:
Zen4 (Ryzen, TR/EPYC) is a 5nm die-shrink of Zen 3 with an updated memory controller and other minor design changes for IO.

We already know that isn't true. Genoa supports AVX-512.

leoneazzurro · Nov 6, 2021

And Zen4 has L2 cache doubled, and the core would be not exactly a direct shrink of Zen3, better term would be evolution. Moreover Genoa will be on socket SP5, not SP3.

AMD EPYC Genoa & SP5 Platform Leaked - 5nm Zen 4 CCD Measures Roughly 72mm, 12 CCD Package at 5428mm2, Up To 700W Peak Socket Power

Aside from the AM5 platform, the leaked Gigabyte documents have also detailed AMD's EPYC Genoa Zen 4 CPUs & SP5 server platform.

wccftech.com

Hans Gruber · Nov 6, 2021

Just a reminder. When AMD introduced the Zen (2017) architecture. They had it mapped out and planned through Zen 4. Anything after would be a new architecture. Intel utilizes the tic tock approach with CPU's. AMD since ryzen uses the ding dong approach. The tic is weak and the tock is strong. The Ding Dong approach is equal. When intel falls behind they take the Bill O'reilly approach. And will Alder Lake, everything at Intel is good again.

xilli_fiberbit · Nov 6, 2021

remsplease said:
-New chipsets supporting DDR5, pcie5, etc.

Zen4 (Ryzen, TR/EPYC) is a 5nm die-shrink of Zen 3 with an updated memory controller and other minor design changes for IO.

It's a SoC so the memory controller is in the IOD not in chipset (even before Ryzen the memory controller was in CPU that is the case since a while) and same for PCIe 5 which is handled too in the SoC not only in the chipset....

remsplease · Nov 6, 2021

DrMrLordX said:
We already know that isn't true. Genoa supports AVX-512.

To which part are you referring?

xilli_fiberbit said:
It's a SoC so the memory controller is in the IOD not in chipset (even before Ryzen the memory controller was in CPU that is the case since a while) and same for PCIe 5 which is handled too in the SoC not only in the chipset....

Supporting the standard is required across the hardware stack. Not all pcie connections are directly to the CPU.

Ajay · Nov 6, 2021

remsplease said:
Supporting the standard is required across the hardware stack. Not all pcie connections are directly to the CPU.

Yeah, that's not the way it works. The SoC can support PCIe 5.0 with the chipset supporting PCI 4.0 (or 3.0).
DRAM support has nothing to do with the chipset.

DrMrLordX · Nov 6, 2021

remsplease said:
To which part are you referring?

You claim Zen4 is a die shrink of Zen3, but that is not possible. Otherwise it would not support AVX-512.

AAbattery · Nov 7, 2021

uzzi38 said:
L3 is chopped but L2 gets beefed.

Puts new light on this quote from Mike Clark: "and as we continue to go forward, getting more cores, and getting more cores in a sharing L3 environment, we’ll still try to manage that latency so that when there are lower thread counts in the system, you still getting good latency out of that L3. Then the L2 - if your L2 is bigger then you can cut back some on your L3 as well."

Will be exciting to see Zen 4 and 5 come to fruition and their dense variants as well.

Summary of what we know: (feel free to correct or add):

Zen 4 (thanks to Gigabyte leak and "floundersedition" on Reddit among others for organizing it together)

4-Int pipeline (like Zen 3)
AVX-512, full support except VP2INTERSECT
bigger load store
same L1 size
1MB L2, (2x but otherwise the same as Zen 3)
72 entries for L1 DTLB (up from 64)
1.5x bigger L2 DTLB
same L3 sizes
2x IF-links per CCD
at least DDR5-4800 and tho higher IF-speed (2400MHz), probably 5200/2600
PCIe 5.0 and more lanes, CXL, CCIX, Gen Z (at least for server)
no USB 4 on Raphael, but on Rembrandt
iGPU

maybe an updated branch predictor?
prefetcher changes?
increased Reorder Buffer to go along with increased OoO window?

Zen4D (thanks to MLID)
Dense library TSMC 5nm
2 8-core CCXs per CCD
16MB L3 cache per CCX?
1.25 or 1.5 MB L2 per core?
Reduced FP resources?

Zen 5
MOAR BIGGERER
MOAR BETTERER

soresu · Nov 7, 2021

DrMrLordX said:
You claim Zen4 is a die shrink of Zen3, but that is not possible. Otherwise it would not support AVX-512.

Perhaps they meant supporting AVX512 without doubling SIMD?

There are enough SIMD units to support it already in Zen2 (obviously with changes to the core for the purpose), albeit only at 1x512 bit instruction per cycle vs Intel's cores which can do at least 2x512 bit instructions per cycle I think for the last few generations.

BorisTheBlade82 · Nov 7, 2021

AAbattery said:
maybe an updated branch predictor, a temporal prefetcher and bigger OOO

Didn't they already have temporal prefetchers? Also when looking at the increased front end and TLBs I am pretty sure they will increase the Reorder Buffer as well in order to increase the OoO window just as you mentioned.

soresu · Nov 7, 2021

AAbattery said:
.....

Zen 4 (thanks to Gigabyte leak and "floundersedition" on Reddit among others for organizing it together)

....

same L1 size

1MB L2, (2x but otherwise the same as Zen 3)

....

1.5x bigger L2 DTLB

same L3 sizes

.....

....

Zen4D
Dense library TSMC 5nm
2 8-core CCXs per CCD
16MB L3 cache per CCX?
1.25 or 1.5 MB L2 per core?
....

It's highly unlikely that the dense variant of Zen4 would have more cache per core than the vanilla variant.

SRAM cache is probably the most area expensive part of the CCD, so if anything cache will be cut as much as possible without completey axing Zen4 IPC gains to make room for more cores in the dense variant.

AAbattery · Nov 7, 2021

BorisTheBlade82 said:
Didn't they already have temporal prefetchers? Also when looking at the increased front end and TLBs I am pretty sure they will increase the Reorder Buffer as well in order to increase the OoO window just as you mentioned.

Right. Just now I went to Ian's reviews of Milan and Vermeer to refresh my memory and he described Zen 3's new prefetch as "region-based".

DrMrLordX · Nov 7, 2021

soresu said:
Perhaps they meant supporting AVX512 without doubling SIMD?

It's been awhile since I discussed that with anyone, but from what I remember, AVX512 can't be fully supported via op fusion.

uzzi38 · Nov 7, 2021

soresu said:
Perhaps they meant supporting AVX512 without doubling SIMD?

There are enough SIMD units to support it already in Zen2 (obviously with changes to the core for the purpose), albeit only at 1x512 bit instruction per cycle vs Intel's cores which can do at least 2x512 bit instructions per cycle I think for the last few generations.

Doesn't matter. Being intentionally vague here - there's a significant ST performance boost coming with Zen 4 that will show even in benchmarks that don't care for larger caches much. Calling it just a shrink of Zen 3 is dead wrong.

itsmydamnation · Nov 7, 2021

uzzi38 said:
Doesn't matter. Being intentionally vague here - there's a significant ST performance boost coming with Zen 4 that will show even in benchmarks that don't care for larger caches much. Calling it just a shrink of Zen 3 is dead wrong.

Yes ,
Interestingly I was wrong and other people where right ( Zen3 is the new core not Zen 4) , If we listen to Michael Clarke , Zen 3 is a completely new from scratch Core and Zen 4 will be the single iteration of that core before Zen5 which is a new core. So if Zen1 -> 2 can be 15-20 IPC , more clock and 128bit SIMD to 256 bit . I dont see any reason Zen3 -> 4 cant be 15-20% more IPC , more clock and 256bit SIMD to 512.

Based on this I'm actually a little disappointed in Zen3, In that its not really any wider then Zen2. No i dont count the extra execution pipelines etc because number RPF ports is the same , AMD was just smarter in usage. I want bigger and smarter!!!

lightmanek · Nov 7, 2021

itsmydamnation said:
Yes ,
Interestingly I was wrong and other people where right ( Zen3 is the new core not Zen 4) , If we listen to Michael Clarke , Zen 3 is a completely new from scratch Core and Zen 4 will be the single integration of that core before Zen5 which is a new core. So if Zen1 -> 2 can be 15-20 IPC , more clock and 128bit SIMD to 256 bit . I dont see any reason Zen3 -> 4 cant be 15-20% more IPC , more clock and 256bit SIMD to 512.

Based on this I'm actually a little disappointed in Zen3, In that its not really any wider then Zen2. No i dont count the extra execution pipelines etc because number RPF ports is the same , AMD was just smarter in usage. I want bigger and smarter!!!

Zen 3 would look better in contex of Zen2 but AMD engeeniers managed to port new prefetchers back to Zen2 before launch improving IPC gains vs. Zen 1 and lowering gap between planned Zen2 and Zen3

DisEnchantment · Nov 7, 2021

lightmanek said:
Zen 3 would look better in contex of Zen2 but AMD engeeniers managed to port new prefetchers back to Zen2 before launch improving IPC gains vs. Zen 1 and lowering gap between planned Zen2 and Zen3

The opportunity was there and they took it. They could add more transistors in Zen2 without increasing power drastically.

Some Perspective
Zen --> Zen2
2x CCX for 8Cores + 2x 8MBL3 @2800 MTr (Entire Zeppelin is 4800 MTr) --> 1CCD/CCX for 8Core + 1x 32MBL3 3800MTr
That's 1.35x MTr gain for 15% IPC and total TDP 95W-->105W
More than half of the 1.35x gain over 2x CCX of Zen1 is due to doubling of L3, addition of 256 bit FP units and addition of GMI and SMU in Zen2 CCD
Real Core+L2 is only ~15% more MTr

Zen2 --> Zen3
~1.1x MTr gain for 19% IPC and total TDP 105W-->105W

Zen3-->Zen4
Between 25 to 40% MTr gain, IPC?
L3/SMU largely simlar MTr count

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Golden Member

Senior member

Golden Member

Golden Member

Member

Golden Member

Golden Member

Junior Member

Lifer

Golden Member

Platinum Member

Member

Junior Member

Lifer

Lifer

Member

Diamond Member

Senior member

Diamond Member

Member

Lifer

Platinum Member

Diamond Member

Senior member

Golden Member