Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 53 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
817
1,450
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

A///

Diamond Member
Feb 24, 2017
4,351
3,160
136
I really wonder if COVID pushed out the Zen4 product release dates. That, or we are being gas lit on the rumored late release dates. AMD only puts out vague roadmaps publicly - so we'll have to wait to a really solid leak comes from an OEM :sad_panda:
Under promise and over deliver. Much better than Intel ramming their dumb roadmaps down our throats for years, even before their 10nm woes, with their BS targets and claims.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,786
136
Under promise and over deliver. Much better than Intel ramming their dumb roadmaps down our throats for years, even before their 10nm woes, with their BS targets and claims.

If anything, we should be wary of corporations shouting about their products from the rooftops.

Cyberpunk 2077's release is a prime example. Most overhyped ever. Celebrity endorsements, teasers of "gameplay" months and months before release, Youtube channels that were giving glowing endorsements without even playing the actual game(probably because CD Projekt RED gave them canned test scenarios).

Shut up, and get working and let the product speak for itself.
 

A///

Diamond Member
Feb 24, 2017
4,351
3,160
136
If anything, we should be wary of corporations shouting about their products from the rooftops.

Cyberpunk 2077's release is a prime example. Most overhyped ever. Celebrity endorsements, teasers of "gameplay" months and months before release, Youtube channels that were giving glowing endorsements without even playing the actual game(probably because CD Projekt RED gave them canned test scenarios).

Shut up, and get working and let the product speak for itself.
I had absolutely no idea what that was until I looked it up. I'll take your word for it though!
 

scineram

Senior member
Nov 1, 2020
376
295
136
What are the expectations for this new core?
I would like to see doubled L1 caches and at least doubled L2 cache, maybe more. I don't know how feasible it is to have more than 4 decoders, but if possible increasing that to maybe 6 would be very cool.
 

Gideon

Platinum Member
Nov 27, 2007
2,003
4,959
136
What are the expectations for this new core?
I would like to see doubled L1 caches and at least doubled L2 cache, maybe more. I don't know how feasible it is to have more than 4 decoders, but if possible increasing that to maybe 6 would be very cool.
I really hope they go the tremont way and add 2 decoders working on different branches (of which there are almost always enough or it will fit to uop cache anyway )The other one can be simplified if it wastes too much die area. It should also improve SMT performance in some cases.

That should allow them to add more execution resources as well
 

DisEnchantment

Golden Member
Mar 3, 2017
1,774
6,757
136
I really hope they go the tremont way and add 2 decoders working on different branches (of which there are almost always enough or it will fit to uop cache anyway )The other one can be simplified if it wastes too much die area. It should also improve SMT performance in some cases.

That should allow them to add more execution resources as well
Decoding an instruction means power expended. This is why they are opting for virtualizing the uop cache and/or increasing the uop cache. From most patents you can already see they dont want to add decoding unit.
The decoding units is one of the weakness of x86 in terms of power use.
There are myriad of security issues now with uop caches not sure what next. From what I read somewhere the next SEV evolution after SNP is encrypted TLBs and such, not sure if this also extends to uop cache
 

Gideon

Platinum Member
Nov 27, 2007
2,003
4,959
136
Decoding an instruction means more power. This is why they are opting for virtualizing the uop cache and/or increasing the uop cache. From most patents you can already see they dont want to add decoding unit.
The decoding units is one of the weakness of x86 in terms of power use.
There are myriad of security issues now with uop caches not sure what next. From what I read somewhere the next SEV evolution after SNP is encrypted TLBs and such, not sure if this also extends to uop cache
Yet the 4-wide decoder is a serious bottleneck (also according to agner) and there is a limit of how much uop cache entries you can add. If they ever want to go wider, back to~3.5 Ghz and work up from there (which IMO they should, considering the future of process nodes) they have to figure something out or they can't widen their architecture.

I would really like to see another "zen" moment from AMD, widening cores and lowerin clock-speeds. Now obviously they can't replicate the increase Zen did as their previous core is kickass not a failure, but something between 30-40% IPC is totally doable IMO (M1 is a reference).

Say increasing IPC 35% and reducing clocks 20% might regress single-threded perf by a few % compared to "doing another zen3" (e.g adding ~20% IPC at 5 Ghz) but it will be way better across the stack. When designed right, real-life MT performance would go up about ~30% at roughly the same power consumption (due to the nature of the power/frequency curve) . That means way higher real-life performance in laptops, servers and HEDT (all of which are the main "money-makers").

All of this provided they execute well (which is hard!). If they do it like Samsung, they'll just waste power and die-area.

Yeah, 250W gaming PCs might not benefit much, but the reality is that once they have this new 3.5-4Ghz architecture it will eventually go up to 5Ghz again (as Core and Zen did).

TL;DR:

Due to the nature of power/frequency curve and the reality of upcoming of process nodes, it's better to have a +40% IPC but -20% Clocks design than a "vanilla" +20% IPC design. Provided perf/watt remains similar and you can beat the "square root law" for area as with Zen 3.
 
Last edited:

leoneazzurro

Golden Member
Jul 26, 2016
1,102
1,830
136
Depends on what's going on down there. If they have some interesting and exotic microchannel cooling going on to fight hotspots and make the IHS relevant again, than I'm stoked for a package like that.

Main reason may be different height of the dies behind it. But, a thick IHS should also improve the hotspot situation because of better thermal transient impedance.
 

Kepler_L2

Senior member
Sep 6, 2020
797
3,212
136
Yet the 4-wide decoder is a serious bottleneck (also according to agner) and there is a limit of how much uop cache entries you can add. If they ever want to go wider, back to~3.5 Ghz and work up from there (which IMO they should, considering the future of process nodes) they have to figure something out or they can't widen their architecture.

I would really like to see another "zen" moment from AMD, widening cores and lowerin clock-speeds. Now obviously they can't replicate the increase Zen did as their previous core is kickass not a failure, but something between 30-40% IPC is totally doable IMO (M1 is a reference).

Say increasing IPC 35% and reducing clocks 20% might regress single-threded perf by a few % compared to "doing another zen3" (e.g adding ~20% IPC at 5 Ghz) but it will be way better across the stack. When designed right, real-life MT performance would go up about ~30% at roughly the same power consumption (due to the nature of the power/frequency curve) . That means way higher real-life performance in laptops, servers and HEDT (all of which are the main "money-makers").

All of this provided they execute well (which is hard!). If they do it like Samsung, they'll just waste power and die-area.

Yeah, 250W gaming PCs might not benefit much, but the reality is that once they have this new 3.5-4Ghz architecture it will eventually go up to 5Ghz again (as Core and Zen did).

TL;DR:

Due to the nature of power/frequency curve and the reality of upcoming of process nodes, it's better to have a +40% IPC but -20% Clocks design than a "vanilla" +20% IPC design. Provided perf/watt remains similar and you can beat the "square root law" for area as with Zen 3.
I expect this for Zen 5 on 3nm. They are unlikely to keep increasing cache sizes since SRAM scaling is terrible and heat density is a big issue on such a dense FinFET node.

Going wide with lower clocks is the way.
 
  • Like
Reactions: Tlh97 and Gideon

Gideon

Platinum Member
Nov 27, 2007
2,003
4,959
136
I expect this for Zen 5 on 3nm. They are unlikely to keep increasing cache sizes since SRAM scaling is terrible and heat density is a big issue on such a dense FinFET node.

Going wide with lower clocks is the way.

Looks to be so, and it correlates with this leak about Zen 5 + Zen 4D. IMO the slides look very amateurish and fake, but the info seems to be correct according to other leakers.

If I had to guess I would imagine the cut-down Zen 4D will focus on density and among other things will cut FP blocks in half (still executing same instructions but some in 2 cycles, sorta like PS5 Zen 2), thus becoming a true "small core" (e.g. Arm's A78 vs X1)
 

DisEnchantment

Golden Member
Mar 3, 2017
1,774
6,757
136
Yet the 4-wide decoder is a serious bottleneck (also according to agner) and there is a limit of how much uop cache entries you can add. If they ever want to go wider, back to~3.5 Ghz and work up from there (which IMO they should, considering the future of process nodes) they have to figure something out or they can't widen their architecture.

I would really like to see another "zen" moment from AMD, widening cores and lowerin clock-speeds. Now obviously they can't replicate the increase Zen did as their previous core is kickass not a failure, but something between 30-40% IPC is totally doable IMO (M1 is a reference).

Say increasing IPC 35% and reducing clocks 20% might regress single-threded perf by a few % compared to "doing another zen3" (e.g adding ~20% IPC at 5 Ghz) but it will be way better across the stack. When designed right, real-life MT performance would go up about ~30% at roughly the same power consumption (due to the nature of the power/frequency curve) . That means way higher real-life performance in laptops, servers and HEDT (all of which are the main "money-makers").

All of this provided they execute well (which is hard!). If they do it like Samsung, they'll just waste power and die-area.

Yeah, 250W gaming PCs might not benefit much, but the reality is that once they have this new 3.5-4Ghz architecture it will eventually go up to 5Ghz again (as Core and Zen did).
Probably one of the reasons why the patent aims to keep uop cache size in check by virtualizing it. With the goal of having very less decoding in the first place.
I think we can expect a clock gated decoding unit ( most likely not even a complex unit ) to be added but the actual goal is to have them execute out of the virtualized uop cache.
This is one advantage ARM has over x86, decoding does not take much power, they can just keep adding without a severe power penalty

Going wide with lower clocks is the way.
This depends on whether the bean counters will allow overly significant die size balloning. Cost per wafer is the same but the dies per wafer is different. For non fully integrated company a very tough choice to make.
Also every Zen generation we see conservative but slightly wider/improved backends but the front end remains largely same. Not saying front end will remain same, but to be realistic for an x86 core expect a very conservative increase and very measured approach here

Regarding the MT efficiency, some patents and research have indicated that the MT is resulting in the sharing of a number of important queues, and aims to put in place a mechanism to allow a thread in a core to compete for resources to be able to access more of them to be able to effective improve performance

 

moinmoin

Diamond Member
Jun 1, 2017
5,206
8,367
136
I like it, but the render of the ihs , in my opinion , is too thick.
I agree, the thickness is hilarious, not something I would take serious. Does anybody really expect a slab of metal 5mm or thicker as an IHS? I think what the render was supposed to showcase were all the notches, not the IHS' thickness.
 

beginner99

Diamond Member
Jun 2, 2009
5,312
1,750
136
It's what you do with the real estate that matters. Vapor chamber? Tiny targeted TECs? Color me intrigued.

Or maybe just make people talk about it? The notches at least. If the real thing ends up to be much thinner than it appears in this render they can simply claim the render is deceptive.
 

eek2121

Diamond Member
Aug 2, 2005
3,325
4,884
136
Looks to be so, and it correlates with this leak about Zen 5 + Zen 4D. IMO the slides look very amateurish and fake, but the info seems to be correct according to other leakers.

If I had to guess I would imagine the cut-down Zen 4D will focus on density and among other things will cut FP blocks in half (still executing same instructions but some in 2 cycles, sorta like PS5 Zen 2), thus becoming a true "small core" (e.g. Arm's A78 vs X1)

By his own admission he is not a 3D artist.

Probably one of the reasons why the patent aims to keep uop cache size in check by virtualizing it. With the goal of having very less decoding in the first place.
I think we can expect a clock gated decoding unit ( most likely not even a complex unit ) to be added but the actual goal is to have them execute out of the virtualized uop cache.
This is one advantage ARM has over x86, decoding does not take much power, they can just keep adding without a severe power penalty


This depends on whether the bean counters will allow overly significant die size balloning. Cost per wafer is the same but the dies per wafer is different. For non fully integrated company a very tough choice to make.
Also every Zen generation we see conservative but slightly wider/improved backends but the front end remains largely same. Not saying front end will remain same, but to be realistic for an x86 core expect a very conservative increase and very measured approach here

Regarding the MT efficiency, some patents and research have indicated that the MT is resulting in the sharing of a number of important queues, and aims to put in place a mechanism to allow a thread in a core to compete for resources to be able to access more of them to be able to effective improve performance


AMD has said in the past they want to reach margins of something like 65%. For that reason alone, they will optimize designs around smaller die sizes.
 

GodisanAtheist

Diamond Member
Nov 16, 2006
7,927
9,047
136
mmmm no, thick IHS to reduce hotspots may be a thing, but whats the point in those notches?

-RE: Notches, probably a second tier failsafe to make sure the GPU is being installed in the right configuration.

If the small package directional cut out notches get ignored, having asymmetrical notches on the metal latching machanisms will prevent the retention bracket from closing, potentially saving the GPU from being damaged by being forced into the socket the wrong way.
 

Triskain

Member
Sep 7, 2009
63
33
91
The IHS is not thick, its the same as Broadwell-E or Skylake-X (images for reference: IHS, PCB), meaning there is an additional substrate layer on top of the layer that actually contacts the socket. This is done to allow lowering the bump pitch/increase the bump density of the Die (for higher current Power Delivery, more interface pins etc.) without blowing up the layer count/complexity/cost of the package substrate. There was an Intel paper about this a couple years ago, I'll see if I can dig it up.

Edit: Found it. Server CPU Package Design Using PoINT Architecture
 
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,786
136
Due to the nature of power/frequency curve and the reality of upcoming of process nodes, it's better to have a +40% IPC but -20% Clocks design than a "vanilla" +20% IPC design. Provided perf/watt remains similar and you can beat the "square root law" for area as with Zen 3.

The square root law is for transistor count increase and thus power as well.

The cube law for clocks only applies if the voltage increases/decreases proportionally. The voltage scaling(where voltages decreased substantially every new node) has died in the Pentium 4 days about sometime in 2002.

So rather than 20% lower clocks resulting in half the power(0.8^3), you'll get something like 0.8x0.95x0.95 or about 30% power reduction if they even get 5% voltage reduction at the peak operating frequency.

So you might end up with the 35% better perf/clk design ending up using 30-40% more power.

There is a side benefit to area/power use by aiming for lower clocks, but it's still a very difficult uphill battle.

The patent for virtual uop cache was filed like a year ago so I don't expect to see that in Zen 4. Maybe later generations? It sounds very cool though. Being able to use regular caches to store decoded instructions.
 
Last edited:

Saylick

Diamond Member
Sep 10, 2012
3,883
9,022
136
So AMD announced 3D stacking of SRAM at their Computex Keynote. I'm guessing with a bunch more LLC, they can feasibly implement that virtualized micro-op cache. The best part is that it's essentially HBM SRAM, so not only is the hit rate and capacity improved, the bandwidth is much higher as well. Separate SRAM chiplets can be manufactured on separate wafers, which should be easy to produce as the dies are small and the yields for SRAM are quite good. This tech is worth double digit IPC by itself.