Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 95 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
809
1,412
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

Abwx

Lifer
Apr 2, 2011
11,557
4,349
136
This is what they said
View attachment 48897


But Apple managed 1.49x scaling or 33% die size reduction. IF you look at the numbers closely and compare an Apple SoC and Zen3 die, Zen3 die has huge percentage of cache which then is even more biased towards the 1.35x scaling (in ideal conditions)
But it matters less if cache remains more or less same, with 9% MTr gain from Zen2 to Zen3 they got 19% IPC, now consider 30% MTr gain.
I am also hoping what @uzzi38 is also saying, that L1/L3 would remain largely same.
Reading papers around, the bottlenecks are in many places, like the Retire buffer, OOO window etc
Some of which can be solved by throwing more regfile silicon

He says 35-40% reduction including the I/O , the part that shrink the less and is very big , as already said this is not included in the chiplet, so my point hold even better.

AVX512 support will burn a ton of area. 4x area required for FP register file alone and execution units need to get widened to 512bits as well. There is also a question of widening load/store datapaths, need to handle at least 1 of each @512bits to have decent performance?

I think ZEN4 is gonna be like ZEN1->2 transition, Zen3 made more capable in FP department and a lot of those increased resources are going to benefit IPC all around, but i don't expect more execution ports or widened core.

There should be more exe ports since they ll surely implement AVX512 with two separate 256bit units, they can easily make use of such an arrangement to increase AVX/AVX2/SSE2-4.2 throughput/Hz.
 
Last edited:
  • Like
Reactions: Tlh97 and Joe NYC

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
He says 35-40% reduction including the I/O , the part that shrink the less and is very big , as already said this is not included in the chiplet, so my point hold even better.
It is a mobile SoC everything is inbuilt, not much IO like a desktop processor.
But lets see, we just have to wait a few quarters.
N7 as advertised by TSMC is 91 MTr/mm2, what AMD got with Zen3 CCD is ~52 MTr/mm2
 
  • Like
Reactions: Tlh97

uzzi38

Platinum Member
Oct 16, 2019
2,705
6,427
146
It is a mobile SoC everything is inbuilt, not much IO like a desktop processor.
But lets see, we just have to wait a few quarters.
N7 as advertised by TSMC is 91 MTr/mm2, what AMD got with Zen3 CCD is ~52 MTr/mm2
I mean, CCDs don't exactly have a lot of I/O either. Just the IFOP SerDes. They're the only thing supposedly on N5 anyway, the IODs are still N6 afaik.
 
  • Like
Reactions: Tlh97

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
I mean, CCDs don't exactly have a lot of I/O either. Just the IFOP SerDes. They're the only thing supposedly on N5 anyway, the IODs are still N6 afaik.
Indeed, only point I was trying to make is that Zen3 is really cache heavy. Looking at the die shot, about half used by the L3. Neither the Apple or any mobile SoC have lots of IO nor do the Zen CCDs, but mobile/Apple SoCs do have a lot of logic with the GPU blocks especially.
But no reason to split hairs over this, Zen4 will launch and we will know.
But my opinion, not close to 1.8x scaling. And in other words, very very far away from the advertized 171 MTr//mm2
 

Ajay

Lifer
Jan 8, 2001
16,094
8,109
136
Indeed, only point I was trying to make is that Zen3 is really cache heavy. Looking at the die shot, about half used by the L3. Neither the Apple or any mobile SoC have lots of IO nor do the Zen CCDs, but mobile/Apple SoCs do have a lot of logic with the GPU blocks especially.
But no reason to split hairs over this, Zen4 will launch and we will know.
But my opinion, not close to 1.8x scaling.
Taking the idea that 1/2 the chip is logic and the other half is cache, the average scaling from TSMC's numbers is around 1.5X. So, I would expect that value =/- a bit depending on the design rule limitations. There will be much wider data paths in the ALUs, registers and cache to support AVX512 - not sure what impact that will have on scaling, but I'd guess not very much.
 
  • Like
Reactions: Tlh97 and Saylick

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
There should be more exe ports since they ll surely implement AVX512 with two separate 256bit units, they can easily make use of such an arrangement to increase AVX/AVX2/SSE2-4.2 throughput/Hz.

It was easy to do with AVX2 and its 256bit instructions, as basically it was 128bit lower part and 128bit upper part, and instructions mostly worked on either full 256bits or lower 128bits. Instructions that "mixed" data between lower and upper part were rather uncommon.

With AVX512 it is completely different. "Lane" , "Shift" and "Mask" instructions are very common and operate on full width, imagine having to synchronize two 256bit units that use same mask register to conditionally sum two 512bit registers, not gonna perform well and you won't get await with small 1-cycle penalty just like ZEN1 128bit units had. Nor shuffling bytes around inside 512bit register is gonna work if you don't have full width unit and full register width available. Imagine pain of implementing shift between two halves of 256bits with two ALUs?
The proper way is to have 512bit datapath and ALU/FPU in the execution port.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,931
3,556
136
It was easy to do with AVX2 and its 256bit instructions, as basically it was 128bit lower part and 128bit upper part, and instructions mostly worked on either full 256bits or lower 128bits. Instructions that "mixed" data between lower and upper part were rather uncommon.

With AVX512 it is completely different. "Lane" , "Shift" and "Mask" instructions are very common and operate on full width, imagine having to synchronize two 256bit units that use same mask register to conditionally sum two 512bit registers, not gonna perform well and you won't get await with small 1-cycle penalty just like ZEN1 128bit units had. Nor shuffling bytes around inside 512bit register is gonna work if you don't have full width unit and full register width available. Imagine pain of implementing shift between two halves of 256bits with two ALUs?
The proper way is to have 512bit datapath and ALU/FPU in the execution port.
I think he is saying at the point that you do that , you might as well go to the effort of being able to execute 2x256bit ops concurrently in your single 512bit unit , obviously scheduling considerations aside.

I also think everyone is wrong, Zen 1 2 and 3 are fundamentally the same base core. Zen2 improved branch/ uop / 256bit ops , Zen3 improved executions and load/store but they did both in away without fundamentally changing the core ( same number of reg file ports, number of stages , what each stage does etc). Jim Keller Talked about Zen Core having Big Bones and i think that's exactly what he means. My bet is Zen4 is the next new set of big bones which will allow the following uarches to grow into.

if the rumoured perf uplifts are true it kind of has to be, I would ignore families thats about how they behave/handle certain operations, how they execute or the performance they execute with is not directly linked.
 

uzzi38

Platinum Member
Oct 16, 2019
2,705
6,427
146
I think L1 and L3 are expected to remain the same, but L2 sizes per-core are doubling. So some of the extra transistors will go to cache, but most of the extra die area should be for core improvements.
Apparently details in the leak suggest that at the very least what I wrote here for L1 and L2 are correct. Associativity is the same. L3 is still unknown though.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
My bet is Zen4 is the next new set of big bones which will allow the following uarches to grow into.

I think it will continue to iterate on ZEN3, by adding AVX512 and enlarging core resources, but keeping everything else the same. Combined with being well fed by 1MB of L2 and massive L3 they can extract rumoured IPC improvements.
 

Gideon

Golden Member
Nov 27, 2007
1,774
4,145
136
Apparently details in the leak suggest that at the very least what I wrote here for L1 and L2 are correct. Associativity is the same. L3 is still unknown though.
Unconfirmed, yes, but almost certainly unchanged.

Cosidering 5nm's poor SRAM scaling, existence of V-cache SKUs, adding more makes no sense.

Dropping to 16MB is even less likely. It would mean going all-in for V-cache for servers. No way they'd take that risk during the finalization of the design, that must have happened at least a year ago.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
Some good news on Linux at least, the whole kernel could be built with Feature levels instead of being built with ISA features of processors from more than a decade ago.
Should help with many benchmarks/applications where x86 CPUs are getting held back by SW targeting legacy processors.
Windows need to do the same. But the openness of PC ecosystem means SW targets the least common denominator at the cost holding back performance for newer CPUs.
Zen4 will be the first AMD processor to be x86-64-v4
 

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
Most of the kernel cannot even use SSE, and that is not because of feature levels but because of the design (tasks that don't clobber XMM registers don't need to save/restore them on context switches). I don't think these feature levels will mean much in the kernel, but I would like to be proven wrong (some features are unrelated to the vector registers and might make the kernel more performant).
We have our SW ISV providing us a distro with AVX enabled in netfilter and the packet latency from the load balancer to the worker nodes is night and day difference.
The more complex the ebtables/iptables rules are the better the performance gains seems to be. Packet mangling, traversing the rules is much faster
This is one case that I know first hand, I wonder how many more.
 
Last edited:
  • Like
Reactions: RnR_au

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
We have our SW ISV providing us a distro with AVX enabled in netfilter and the packet latency from the load balancer to the worker nodes is night and day difference.
The more complex the ebtables/iptables rules are the better the performance gains seems to be. Packet mangling, traversing the rules is much faster

It might not even have anything to do with AVX or vectorization. What could happen is those flags enable use of instructions like Haswell BMI and compiler can generate efficient logic, bit testing and so on code. These can make huge difference as well.
But it could also be vectorization as well, as AVX can be very efficient in pattern and substring search operations. The real question is - can compiler generate such good code, or does distro simply contain hand optimized assembly routines for hot paths? Are You getting binaries, or source to build?


Yeah, AVX512 does not come free, take AVX2 FP PRF area, multiply by 4, then multiply by not small factor to account for signals having to deal with large physical distances and also multiply by FP PRF size growth if any for ZEN4.
I think Golden Cove Core+L2 area versus ZEN4 Core+L2 is gonna be as valid process and architecture provess comparison as we are gonna ever get.
 

dr1337

Senior member
May 25, 2020
417
691
136
That sounds a bit too large to me honestly. That would be an absolutely huge chunk of the die space gained from the die shrink lost just to AVX-512 registers. Not saying it's not feasible, just that I would expect it to be smaller than that.
I can kinda believe if it they've had to spread out the logic more to deal with thermals. Dropping clocks is by far intels biggest issue with AVX512 so maybe AMD is trading density for FP clockspeed?
 

AMDK11

Senior member
Jul 15, 2019
426
338
136
I also think everyone is wrong, Zen 1 2 and 3 are fundamentally the same base core. Zen2 improved branch/ uop / 256bit ops , Zen3 improved executions and load/store but they did both in away without fundamentally changing the core ( same number of reg file ports, number of stages , what each stage does etc). Jim Keller Talked about Zen Core having Big Bones and i think that's exactly what he means. My bet is Zen4 is the next new set of big bones which will allow the following uarches to grow into.

if the rumoured perf uplifts are true it kind of has to be, I would ignore families thats about how they behave/handle certain operations, how they execute or the performance they execute with is not directly linked.

You are looking too much through the diagrams that describe the core microarchitecture. The diagram describing the structure of the core is a far-reaching simplification to at least to some extent illustrate the main features of the core. Many parts of the core cannot be represented this way or it is very difficult. The new algorithms and the new logic controlling the core resources cannot be presented in this way and described graphically.

For example, Zen3 is a project almost completely new from scratch. And you say that it is basically the same as Zen and Zen2 because decoders and executive ports are roughly the same, but it is not. It's like saying that two different cars are the same because they have 4 wheels and a five cylinder engine.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
But it could also be vectorization as well, as AVX can be very efficient in pattern and substring search operations. The real question is - can compiler generate such good code, or does distro simply contain hand optimized assembly routines for hot paths? Are You getting binaries, or source to build?
We are getting sources (of everything they packaged in the distro) and images. AVX is indeed used but I am not sure if it is the upstream version or a custom implementation.
BTW it is a major Linux ISV and I can't put a name here. But they supply everything, compilers, libraries, distros, patching CVEs, enablement of HW, etc. They provide images for our compute clusters, CI/CD infras, edge devices and so on
 

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136

Yeah, that is great use of classic AVX vectorization by hand writing the algorithm in pseudo assembly, can be extended to AVX512 with some effort too. It helps in targeted places like crypto, packet processing, hashing, so if your server spends most of the time encrypting or processing packets it can be huge win.
But for general public, "system" time of typical application is minimal and usually heavy cases like drivers are already making great use of AVX, so kernel being compiled for "feature levels" won't speed up much as people are expecting it to.

THO with that said, there is large opportunity for distros like Clear Linux etc, that provided optimized builds from kernel to user space. Now that is where real advancements are to be had.
 

Doug S

Platinum Member
Feb 8, 2020
2,785
4,750
136
Some good news on Linux at least, the whole kernel could be built with Feature levels instead of being built with ISA features of processors from more than a decade ago.
Should help with many benchmarks/applications where x86 CPUs are getting held back by SW targeting legacy processors.
Windows need to do the same. But the openness of PC ecosystem means SW targets the least common denominator at the cost holding back performance for newer CPUs.
Zen4 will be the first AMD processor to be x86-64-v4


The kernel already checks for specific features on startup and uses code paths to utilize them where it will help. I don't see how compiler directives improve on this, other than saving a handful of microseconds on startup.

Few applications and even fewer benchmarks involve the kernel all that much. If you want to speed things up you go for user level stuff like glibc, not the kernel. That's where most of the action is.
 

Mopetar

Diamond Member
Jan 31, 2011
8,114
6,770
136
They won't get that much density gain. Especially not on a cache heavy design like Zen.

It really depends on how much they intend on using die stacking. Cache can scale upwards instead of outwards to some degree. The real trick will be finding other parts of the chip they can build on top of without creating heat issues.

He says 35-40% reduction including the I/O , the part that shrink the less and is very big , as already said this is not included in the chiplet, so my point hold even better.

I posted an article with an analysis several pages back where someone looked at die shots for their analysis and found Apple got no scaling on any of their IO moving to 5nm from 7nm.

I'm assuming AMD gets decent scaling based on the rumored figures and the additions they're making.
 

moinmoin

Diamond Member
Jun 1, 2017
5,064
8,032
136
Anybody keeping track of the teams at AMD? We know that both Zen 1 and 2 were handled by the same team, whereas Zen 3 was by a different one. Are the ones of the latter known to work on Zen 4 as well?
 

Saylick

Diamond Member
Sep 10, 2012
3,532
7,859
136
Anybody keeping track of the teams at AMD? We know that both Zen 1 and 2 were handled by the same team, whereas Zen 3 was by a different one. Are the ones of the latter known to work on Zen 4 as well?
Not sure if there was any rumors/confirmation of which team is working on Zen 4, but given that Zen 3 was a ground-up rebuild and Zen 4 also likely to be heavy lift, I would imagine that Zen 3 and Zen 4 are designed by two different teams.