Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 88 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

moinmoin

Diamond Member
Jun 1, 2017
4,956
7,675
136
Zen4 seems to belong to the same Family 19H like Zen3. Core uArch and ISA (Zen 3 added a whole bunch of extensions to the ISA over Zen2 though) would largely remain similar.:confused:

I wonder where the major part of the perf will come from.
AVX512 is confirmed. Pretty clear when AMD never objected to using feature level 4 in gcc/clang for x86 being AVX512 mandatory.
Interposer will have to wait for a bit.
None of that should be a surprise really. AMD appears to apply a tick tock cadence of its own to the Zen family:
  • Zen 1: new core
  • Zen 2: same core on new node with increased FPU capability
  • Zen 3: new core on same node
  • Zen 4 by all indications so far: same core on new node with increased FPU capability
  • Zen 5: new core on same node?
 

Magic Carpet

Diamond Member
Oct 2, 2011
3,477
231
106
ADM TDP forumla is as follows:
TDP (Watts) = (tCase°C - tAmbient°C)/(HSF θca)
where HSF θca (°C/W) is defined as the minimum °C per Watt rating of the heatsink to achieve rated performance

and the numbers fit
169.56=(46.7-35)/0.069

The problem here is tCase is very low, and one doesn't chose a much lower die/heatspreader junction temperature unless it is actually needed. I reckon 170W TDP SKU(s) will have very agressive boosting, probably the highest in the entire lineup.
I don't understand why the ambient temp is rated at only 35 degrees, though. 125W Thuban TDP was done in a similar way, but the ambient was 44 degrees instead (more realistic). A way to lower potential TDP (power goes up as the temp increases). Liquid cooling suggested for a reason, imo.

1629217922278.png
 
Last edited:

DisEnchantment

Golden Member
Mar 3, 2017
1,608
5,816
136
None of that should be a surprise really. AMD appears to apply a tick tock cadence of its own to the Zen family:
If is more or less the same core uarch, it makes the alleged 22H2 launch a bit strange because it is not like they are new to N5, they already are shipping N5 products already as of current Quarter.
From my experience at least, most of the time spent from design to manufacture is in IP verification and validation anyway, if same core uarch time should be much reduced.
Or rumors are just that, rumors

  • Zen 5: new core on same node?
Rumors allegedly indicate Zen5 on N3
 

Abwx

Lifer
Apr 2, 2011
10,970
3,515
136
Moving from N7 (w/secret sauce!) to N5 will bring with it some performance improvement @ isopower. Not sure if that will be the only improvement, but it will count for something. I doubt it'll be +29% though.

If those leaks are accurate area wise and given TSMC s density at 5nm vs 7nm then Zen 4 has 60% more transistors than Zen 3, wether this is due to increased caches or anything else i dont think that this amount is here just for the fun.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,608
5,816
136
If those leaks are accurate area wise and given TSMC s density at 5nm vs 7nm then Zen 4 has 60% more transistors than Zen 3, wether this is due to increased caches or anything else i dont think that this amount is here just for the fun.
They won't get that much density gain. Especially not on a cache heavy design like Zen.
Apple managed 1.49x with relative far more logic for the die area than Zen.
For AMD they will get around 1.35x-1.4x density gain at best. Which would translate to around 30% more MTr.

But 30% MTr gain would be huge for Zen4 if it materializes, especially if cache remain more or less same.
Zen Bottlenecks are listed here, not sure about the methodology but there is some merit to the analysis here

Regarding the area, it is directly from the design guide from AMD, so it is pretty much a given.
Have to say ExecuFix got some deep moles.
Pretty interesting to get the tidbits from such leaker, but professionally as someone responsible for providing design info to our suppliers, I dread it very much.
 
Last edited:

yuri69

Senior member
Jul 16, 2013
389
624
136
The Zen 4 belonging to Family 19h has been known for quite a long time. The Family changes seem to be related to cache topology - K8 to K10, Bobcat to Jaguar, Zen 2 to Zen 3.
 

Abwx

Lifer
Apr 2, 2011
10,970
3,515
136
They won't get that much density gain. Especially not on a cache heavy design like Zen.
Apple managed 1.49x with relative far more logic for the die area than Zen.
For AMD they will get around 1.35x-1.4x density gain at best. Which would translate to around 30% more MTr.

TSMC claim 45% die reduction, they are talking of a whole SoC not of a specific circuit like a memory cell.

Apple chip is not comparable since it include the IMC and PCH.


Edit : Assuming a square root scaling of transistors/perf this point to 27% better MT perf at same frequency.
 
Last edited:

AMDK11

Senior member
Jul 15, 2019
234
153
116
None of that should be a surprise really. AMD appears to apply a tick tock cadence of its own to the Zen family:
  • Zen 1: new core
  • Zen 2: same core on new node with increased FPU capability
  • Zen 3: new core on same node
  • Zen 4 by all indications so far: same core on new node with increased FPU capability
  • Zen 5: new core on same node?
I dare to say that each of the Zen generation, i.e. Zen, Zen2, Zen3 and Zen4, are new x86 cores. The fact that Zen3 is almost a completely new design from scratch does not mean that Zen2 and Zen are the same apart from the FPU block. There have been changes between Zen and Zen2 not only in FPU but also in Front-End, Back-end, executive units and Load-Store pdsystem. Also Zen, Zen2, Zen3 and the future Zen4 are the new x86 cores.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,608
5,816
136
TSMC claim 45% die reduction, they are talking of a whole SoC not of a specific circuit like a memory cell.
This is what they said
1629228491920.png

At IEDM, Geoffrey Yeap gave a little more color to that density by reporting that for a typical mobile SoC which consists of 60% logic, 30% SRAM, and 10% analog/IO, their 5 nm technology scaling was projected to reduce chip size by 35%-40%.
But Apple managed 1.49x scaling or 33% die size reduction. IF you look at the numbers closely and compare an Apple SoC and Zen3 die, Zen3 die has huge percentage of cache which then is even more biased towards the 1.35x scaling (in ideal conditions)
But it matters less if cache remains more or less same, with 9% MTr gain from Zen2 to Zen3 they got 19% IPC, now consider 30% MTr gain.
I am also hoping what @uzzi38 is also saying, that L1/L3 would remain largely same.
Reading papers around, the bottlenecks are in many places, like the Retire buffer, OOO window etc
Some of which can be solved by throwing more regfile silicon
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
AVX512 support will burn a ton of area. 4x area required for FP register file alone and execution units need to get widened to 512bits as well. There is also a question of widening load/store datapaths, need to handle at least 1 of each @512bits to have decent performance?

I think ZEN4 is gonna be like ZEN1->2 transition, Zen3 made more capable in FP department and a lot of those increased resources are going to benefit IPC all around, but i don't expect more execution ports or widened core.
 

Abwx

Lifer
Apr 2, 2011
10,970
3,515
136
This is what they said
View attachment 48897


But Apple managed 1.49x scaling or 33% die size reduction. IF you look at the numbers closely and compare an Apple SoC and Zen3 die, Zen3 die has huge percentage of cache which then is even more biased towards the 1.35x scaling (in ideal conditions)
But it matters less if cache remains more or less same, with 9% MTr gain from Zen2 to Zen3 they got 19% IPC, now consider 30% MTr gain.
I am also hoping what @uzzi38 is also saying, that L1/L3 would remain largely same.
Reading papers around, the bottlenecks are in many places, like the Retire buffer, OOO window etc
Some of which can be solved by throwing more regfile silicon

He says 35-40% reduction including the I/O , the part that shrink the less and is very big , as already said this is not included in the chiplet, so my point hold even better.

AVX512 support will burn a ton of area. 4x area required for FP register file alone and execution units need to get widened to 512bits as well. There is also a question of widening load/store datapaths, need to handle at least 1 of each @512bits to have decent performance?

I think ZEN4 is gonna be like ZEN1->2 transition, Zen3 made more capable in FP department and a lot of those increased resources are going to benefit IPC all around, but i don't expect more execution ports or widened core.

There should be more exe ports since they ll surely implement AVX512 with two separate 256bit units, they can easily make use of such an arrangement to increase AVX/AVX2/SSE2-4.2 throughput/Hz.
 
Last edited:
  • Like
Reactions: Tlh97 and Joe NYC

DisEnchantment

Golden Member
Mar 3, 2017
1,608
5,816
136
He says 35-40% reduction including the I/O , the part that shrink the less and is very big , as already said this is not included in the chiplet, so my point hold even better.
It is a mobile SoC everything is inbuilt, not much IO like a desktop processor.
But lets see, we just have to wait a few quarters.
N7 as advertised by TSMC is 91 MTr/mm2, what AMD got with Zen3 CCD is ~52 MTr/mm2
 
  • Like
Reactions: Tlh97

uzzi38

Platinum Member
Oct 16, 2019
2,635
5,984
146
It is a mobile SoC everything is inbuilt, not much IO like a desktop processor.
But lets see, we just have to wait a few quarters.
N7 as advertised by TSMC is 91 MTr/mm2, what AMD got with Zen3 CCD is ~52 MTr/mm2
I mean, CCDs don't exactly have a lot of I/O either. Just the IFOP SerDes. They're the only thing supposedly on N5 anyway, the IODs are still N6 afaik.
 
  • Like
Reactions: Tlh97

DisEnchantment

Golden Member
Mar 3, 2017
1,608
5,816
136
I mean, CCDs don't exactly have a lot of I/O either. Just the IFOP SerDes. They're the only thing supposedly on N5 anyway, the IODs are still N6 afaik.
Indeed, only point I was trying to make is that Zen3 is really cache heavy. Looking at the die shot, about half used by the L3. Neither the Apple or any mobile SoC have lots of IO nor do the Zen CCDs, but mobile/Apple SoCs do have a lot of logic with the GPU blocks especially.
But no reason to split hairs over this, Zen4 will launch and we will know.
But my opinion, not close to 1.8x scaling. And in other words, very very far away from the advertized 171 MTr//mm2
 

Ajay

Lifer
Jan 8, 2001
15,468
7,874
136
Indeed, only point I was trying to make is that Zen3 is really cache heavy. Looking at the die shot, about half used by the L3. Neither the Apple or any mobile SoC have lots of IO nor do the Zen CCDs, but mobile/Apple SoCs do have a lot of logic with the GPU blocks especially.
But no reason to split hairs over this, Zen4 will launch and we will know.
But my opinion, not close to 1.8x scaling.
Taking the idea that 1/2 the chip is logic and the other half is cache, the average scaling from TSMC's numbers is around 1.5X. So, I would expect that value =/- a bit depending on the design rule limitations. There will be much wider data paths in the ALUs, registers and cache to support AVX512 - not sure what impact that will have on scaling, but I'd guess not very much.
 
  • Like
Reactions: Tlh97 and Saylick

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
There should be more exe ports since they ll surely implement AVX512 with two separate 256bit units, they can easily make use of such an arrangement to increase AVX/AVX2/SSE2-4.2 throughput/Hz.

It was easy to do with AVX2 and its 256bit instructions, as basically it was 128bit lower part and 128bit upper part, and instructions mostly worked on either full 256bits or lower 128bits. Instructions that "mixed" data between lower and upper part were rather uncommon.

With AVX512 it is completely different. "Lane" , "Shift" and "Mask" instructions are very common and operate on full width, imagine having to synchronize two 256bit units that use same mask register to conditionally sum two 512bit registers, not gonna perform well and you won't get await with small 1-cycle penalty just like ZEN1 128bit units had. Nor shuffling bytes around inside 512bit register is gonna work if you don't have full width unit and full register width available. Imagine pain of implementing shift between two halves of 256bits with two ALUs?
The proper way is to have 512bit datapath and ALU/FPU in the execution port.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,776
3,164
136
It was easy to do with AVX2 and its 256bit instructions, as basically it was 128bit lower part and 128bit upper part, and instructions mostly worked on either full 256bits or lower 128bits. Instructions that "mixed" data between lower and upper part were rather uncommon.

With AVX512 it is completely different. "Lane" , "Shift" and "Mask" instructions are very common and operate on full width, imagine having to synchronize two 256bit units that use same mask register to conditionally sum two 512bit registers, not gonna perform well and you won't get await with small 1-cycle penalty just like ZEN1 128bit units had. Nor shuffling bytes around inside 512bit register is gonna work if you don't have full width unit and full register width available. Imagine pain of implementing shift between two halves of 256bits with two ALUs?
The proper way is to have 512bit datapath and ALU/FPU in the execution port.
I think he is saying at the point that you do that , you might as well go to the effort of being able to execute 2x256bit ops concurrently in your single 512bit unit , obviously scheduling considerations aside.

I also think everyone is wrong, Zen 1 2 and 3 are fundamentally the same base core. Zen2 improved branch/ uop / 256bit ops , Zen3 improved executions and load/store but they did both in away without fundamentally changing the core ( same number of reg file ports, number of stages , what each stage does etc). Jim Keller Talked about Zen Core having Big Bones and i think that's exactly what he means. My bet is Zen4 is the next new set of big bones which will allow the following uarches to grow into.

if the rumoured perf uplifts are true it kind of has to be, I would ignore families thats about how they behave/handle certain operations, how they execute or the performance they execute with is not directly linked.
 

uzzi38

Platinum Member
Oct 16, 2019
2,635
5,984
146
I think L1 and L3 are expected to remain the same, but L2 sizes per-core are doubling. So some of the extra transistors will go to cache, but most of the extra die area should be for core improvements.
Apparently details in the leak suggest that at the very least what I wrote here for L1 and L2 are correct. Associativity is the same. L3 is still unknown though.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
My bet is Zen4 is the next new set of big bones which will allow the following uarches to grow into.

I think it will continue to iterate on ZEN3, by adding AVX512 and enlarging core resources, but keeping everything else the same. Combined with being well fed by 1MB of L2 and massive L3 they can extract rumoured IPC improvements.
 

Gideon

Golden Member
Nov 27, 2007
1,646
3,712
136
Apparently details in the leak suggest that at the very least what I wrote here for L1 and L2 are correct. Associativity is the same. L3 is still unknown though.
Unconfirmed, yes, but almost certainly unchanged.

Cosidering 5nm's poor SRAM scaling, existence of V-cache SKUs, adding more makes no sense.

Dropping to 16MB is even less likely. It would mean going all-in for V-cache for servers. No way they'd take that risk during the finalization of the design, that must have happened at least a year ago.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,608
5,816
136
Some good news on Linux at least, the whole kernel could be built with Feature levels instead of being built with ISA features of processors from more than a decade ago.
Should help with many benchmarks/applications where x86 CPUs are getting held back by SW targeting legacy processors.
Windows need to do the same. But the openness of PC ecosystem means SW targets the least common denominator at the cost holding back performance for newer CPUs.
Zen4 will be the first AMD processor to be x86-64-v4
 

Bigos

Member
Jun 2, 2019
129
287
136
Most of the kernel cannot even use SSE, and that is not because of feature levels but because of the design (tasks that don't clobber XMM registers don't need to save/restore them on context switches). I don't think these feature levels will mean much in the kernel, but I would like to be proven wrong (some features are unrelated to the vector registers and might make the kernel more performant).

The userspace is a completely different matter.