Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 59 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Apple used the same chip for storage and RAM. I was really speaking of RAM (not storage) built into main boards to give insane base data rates to the CPU. Dedicate the first memory controller to this array of aggressive timings-based memory that comes soldered onboard. Have another memory controller for further expansion slots that are not soldered into the main board. I do not suggest your storage being soldered onboard. That's one aspect I think Apple went stupid.


Stacked memory suggests serialized access. I'd want access to be in parallel as much as possible. You'll probably see either one 'good enough' performance RAM or something where you have separate high performance and low costs options on the same board. Afford as much premium as possible to assure performance. But maintain lots of cheap storage that isn't as high performing but offers a great long term solution.
I am not sure what you are talking about with serial vs. parallel. There are various different types of stacked interfaces that allow different levels of parallelism. HBM is 1024-bit wide interface, but it is still DRAM, so access to a page that isn’t latched into an SRAM buffer is high latency. There has been some stacked memory for use on memory modules that use a rather narrow, but still parallel interface.

If you have massive amounts of memory in the cpu package, then using a high pin count parallel bus may not be the best solution. Adding such memory in the package adds components that can fail that would cause the entire device to need replaced, which I don’t like. If you had 16 or 32 GB of stacked DRAM in a package in some manner (like HBM), then adding DRAM on a parallel bus externally would be a waste. Adding something like optane or other flash as swap space on low pin count, serial link would be more efficient. We almost had serially attached memory previously with rambus memory.
 
  • Like
Reactions: Vattila

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
1) It is already hard without any 3D stacking, to make L1 larger, while keeping latency as fast. It is basically on critical path for ton of things, 32kb or 48kb L1D is max what is currently possible for X86 with 4KB pages and 4-5 cycles of latency. Latency and power usage is problem here, not area.
2) L2 is a sweet spot of 10-15 cycles of latency and size, you can grow it somewhat, but at some point latency and power starts to rise and real question becomes - does every core really need huge private L2, or maybe those transistors can be better used as L3 cache.
3) So we arrive to L3 cache, that has 40-50 cycles of latency and is shared. It can't serve as L2, as it is way too slow and "shared" would mean price of address conflicts etc would kill performance real fast.

So all 3 layers are needed, and only L2 and L3 can benefit from 3D stacking:

1) for L2 the benefits are moot, 3D stacking adds some latency, fast access in large cache adds power usage. Before long you arrive to situations like Intel had with Skylake - base product had 256kb with 4-Way L2, that was hurting performance big way. It was done so server product could have 1MB L2 with 16-way. The irony is, that they had to live with this retarded decision for way longer than they planned :)
I doubt AMD would repeat this mistake, to gimp products without 3D stacks to some stupid small L2 with bad structural decisions.

2) L3 is where the holy grail of 3D stacking is. Reasonable latency requirements, architecture - independant address hashed slices of L3 - ready for scaling big time. If anyone looked at ZEN3 diagram, L1->L2 is 32+32 bytes and L2-L3 is also 32+32 bytes. there is no bottleneck, so core has access to full bandwidth as long as it is coming from L3. Ready to scale with L3 size in other words
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Bandwidth is 2-3x higher for the L1. And yes I would be worried about wire length and latency.



Guess I'll just say . . . I'll believe it when I see it? See other responses wrt integration of L1 into core design. I'm really, really skeptical about the extra travel distances in the z plane being so small as to be a rounding error.
I don’t think it would be reasonable to move L1 and L2 to a stacked die for other reasons detailed in another post. In fact, I expect all of the chips that use this stacked cache to still have an on die L3. The TSMC die stacking technology without micro-solder balls is exceptionally thin though. This post indicates less than 50 microns die thickness


That may still be significant for L1 cache; that is 50,000 nm, but a human hair might only be 75 microns thick. The TSVs also are spaced a certain distance apart that may come into play.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
1) It is already hard without any 3D stacking, to make L1 larger, while keeping latency as fast. It is basically on critical path for ton of things, 32kb or 48kb L1D is max what is currently possible for X86 with 4KB pages and 4-5 cycles of latency. Latency and power usage is problem here, not area.
2) L2 is a sweet spot of 10-15 cycles of latency and size, you can grow it somewhat, but at some point latency and power starts to rise and real question becomes - does every core really need huge private L2, or maybe those transistors can be better used as L3 cache.
3) So we arrive to L3 cache, that has 40-50 cycles of latency and is shared. It can't serve as L2, as it is way too slow and "shared" would mean price of address conflicts etc would kill performance real fast.

So all 3 layers are needed, and only L2 and L3 can benefit from 3D stacking:

1) for L2 the benefits are moot, 3D stacking adds some latency, fast access in large cache adds power usage. Before long you arrive to situations like Intel had with Skylake - base product had 256kb with 4-Way L2, that was hurting performance big way. It was done so server product could have 1MB L2 with 16-way. The irony is, that they had to live with this retarded decision for way longer than they planned :)
I doubt AMD would repeat this mistake, to gimp products without 3D stacks to some stupid small L2 with bad structural decisions.

2) L3 is where the holy grail of 3D stacking is. Reasonable latency requirements, architecture - independant address hashed slices of L3 - ready for scaling big time. If anyone looked at ZEN3 diagram, L1->L2 is 32+32 bytes and L2-L3 is also 32+32 bytes. there is no bottleneck, so core has access to full bandwidth as long as it is coming from L3. Ready to scale with L3 size in other words
I expect that all of the chips that use this will still have L3 on die. Moving L1 and L2 to a separate die is probably a non-starter for many reasons. If they don’t have L3 on die, then they cannot easily make a non-stacked version. With it on die, they can sell a device without any stacked chip, although it might need a spacer to make up for the z-height difference unless they just don’t polish the base die down in the first place. That seems to be what current Zen 3 die actually are, although there may be some changes in the latest stepping to enable this. If something goes wrong with the stacking, they could just disable it and sell it as a regular Zen 3.

For Zen 4, I think they will reorganize the caches a bit since it will probably have significantly more floating point resources. It will need a lot more bandwidth for AVX512 level parallelism. The data paths will almost certainly be wider. The L1 might be larger. The L2 will likely be larger. The L3 will probably be the same size, 32 MB. If the cache chip is still 7 nm, then it may cover more of the 5nm Zen 4 die. Perhaps it covers the L2 and L3 caches. The stacking could allow them to scale the L3 from 32 MB on the low end to 288 MB on the high end with 4 stacked 64 MB die.
 

Gideon

Golden Member
Nov 27, 2007
1,646
3,712
136
In current Zen design, L2 is private to the core and inclusive of L1. L3 is victim. Cannot really skip L2, the implications of that is really really huge.
L1/L2 are involved in MMU operations and paging as well, Skipping L2 basically means the tiny L1 involved in managing working set of fairly big pages, not sure.
Yeah, considering all of the L2 is right next to the L3 on the chip, they could have just made the V-cache chip slightly bigger and already fit extended L2 there as well. The fact that they didn't hint's that it's not as easy:
pa9c5mu9y2y51.jpg

I do wonder however if they'll plan to enlarge the L2? It would add latency but i think they have to, as it currently peaks at ~1TB/s (aggregated) for 8 cores, according to AIDA64. That seems like quite the bottleneck when compared to L3
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
I do wonder however if they'll plan to enlarge the L2? It would add latency but i think they have to, as it currently peaks at ~1TB/s (aggregated) for 8 cores, according to AIDA64. That seems like quite the bottleneck when compared to L3

It is not bottleneck. AMD is "citing" AIDA numbers, that are cumulative bandwidth for all cores. So 2TB/s for 12C chip, they have such L3 and L2 bandwidth @ ~5Ghz, for 32B read/writes. 16C chip would have even bigger cumulative bandwidths for L2 and L3 might be somewhat behind, but still incredible and awesome for L3.
 

Det0x

Golden Member
Sep 11, 2014
1,031
2,963
136
L1 latency DOES depend on clock speed. It just happens to be part of the core so it scales with clock.

In terms of CPU cycles it'll be consistent. In terms of nanoseconds the CPU with higher clocks will have a faster L1 cache.
You are correct.
I have always been in the 4.8 to 5.2ghz ST domain so i have hardly seen the L1 latency change from 0.8ns... But after some further testing i see it scales just as good as L3 latency.

4800mhz
L1 latency @ 0.8ns
L3 latency @ 10.5ns
1622734568174.png

4300mhz
L1 latency @ 0.9ns
L3 latency @ 11.9s
1622734631852.png

3900mhz
L1 latency @ 1ns
L3 latency @ 13.1s
1622734716921.png

4800mhz/3900mhz = 23% increase in clock speed
1ns / 0.8ns = 25% reduction in L1 latency
13.1ns/10.5ns = 24.7% reduction in L3 latency
 

eek2121

Platinum Member
Aug 2, 2005
2,930
4,026
136
What I find crazy is that there didn't seem to be a whiff of this before it got dropped like a bomb.

I mean, hats off on that front.

For some workloads this will be a huge game changer. On the one hand I am stoked. On the other it makes me think my 5800x's might have been not quite as great of investments as I thought. I know this is a Zen 4 thread, but if this is a Zen 3 thing as well and the cost is... reasonable... obviously the long haul AM4 chip to get is the Zen3+StupidlyAwesomeCache variety.

I mean, the pieces were/are there. Several days/weeks ago we saw a leak that claimed warhol was canceled and that we were getting a new part that performed really great in gaming. We also had patents and a few other leaks about the stacking.
 

blckgrffn

Diamond Member
May 1, 2003
9,128
3,069
136
www.teamjuchems.com
I mean, the pieces were/are there. Several days/weeks ago we saw a leak that claimed warhol was canceled and that we were getting a new part that performed really great in gaming. We also had patents and a few other leaks about the stacking.

That's fair, but it's such a bombshell in it's entirety and there wasn't any huge countdown, etc. Just like "Oh yeah, we've figured out this thing. It's probably pretty disruptive. And our current parts that you know and love will be enhanced with it within months."

So many times I feel like there are these pumped up announcements and then they happen and its kinda "meh". It was like the flipside for this, which is great. I am used to marketing totally selling these things out.
 

Rigg

Senior member
May 6, 2020
472
974
106
I mean, the pieces were/are there. Several days/weeks ago we saw a leak that claimed warhol was canceled and that we were getting a new part that performed really great in gaming. We also had patents and a few other leaks about the stacking.
Everything surrounding Warhol has been unsubstantiated rumor mill nonsense. One day it's 6nm Zen 3+ the next day it's canceled blah blah blah. The only thing remotely solid on Warhol is the leaked roadmap from a year ago. Assuming that's even real, it says Warhol is Zen 3 on 7nm and uses PCIE 4. I think this strongly indicates what Dr Su showed the other day IS Warhol and all the other 'leaks' are made up BS.
 

uzzi38

Platinum Member
Oct 16, 2019
2,635
5,984
146
Everything surrounding Warhol has been unsubstantiated rumor mill nonsense. One day it's 6nm Zen 3+ the next day it's canceled blah blah blah. The only thing remotely solid on Warhol is the leaked roadmap from a year ago. Assuming that's even real, it says Warhol is Zen 3 on 7nm and uses PCIE 4. I think this strongly indicates what Dr Su showed the other day IS Warhol and all the other 'leaks' are made up BS.
I do not believe Warhol was related to any kind of 3D stacking.
 

Timmah!

Golden Member
Jul 24, 2010
1,428
650
136
Just learned about this cache stacking thing - somehow i missed it :-O Its a big thing, right?

If i understood it correctly, they showed 5900X, which already has L3 cache on the chiplet die, with more L3 stacked in another layer on top of the chiplet, right?
Do you expect the future chiplets to have no L3 at all, as its gonna be moved to this layer competely and moar cores instead in its space? Or more L1/L2 in tha space? I know Intel has the EMIB/Foveros thing, which i suppose is all about stacking as well, but did they demostrate any product like this so far?
 

Thibsie

Senior member
Apr 25, 2017
750
805
136
Just learned about this cache stacking thing - somehow i missed it :-O Its a big thing, right?

If i understood it correctly, they showed 5900X, which already has L3 cache on the chiplet die, with more L3 stacked in another layer on top of the chiplet, right?
Do you expect the future chiplets to have no L3 at all, as its gonna be moved to this layer competely and moar cores instead in its space? Or more L1/L2 in tha space? I know Intel has the EMIB/Foveros thing, which i suppose is all about stacking as well, but did they demostrate any product like this so far?

You got it.
Probablt they'll keep part of L3 in the chiplet, otherwise the chiplet can't be used if stacking is impossible for whatever reason.
 
  • Like
Reactions: Tlh97 and Timmah!

krumme

Diamond Member
Oct 9, 2009
5,952
1,585
136
Just learned about this cache stacking thing - somehow i missed it :-O Its a big thing, right?

If i understood it correctly, they showed 5900X, which already has L3 cache on the chiplet die, with more L3 stacked in another layer on top of the chiplet, right?
Do you expect the future chiplets to have no L3 at all, as its gonna be moved to this layer competely and moar cores instead in its space? Or more L1/L2 in tha space? I know Intel has the EMIB/Foveros thing, which i suppose is all about stacking as well, but did they demostrate any product like this so far?

I am a noob too here. Yes extra l3 on top of the existing l3 not the other logic as i understand it. The zen3 is already prepared for it :). The closeness means there is no extra latency penality. So you can have your cake and eat it.
The assumption is future chips will have it too. In more layers than the current one. They will have basic L3 as now as a minimum but the added layers means you can make dies for more segment more precisely. Eg gamerz, some encoding and compiling, stuff that fits larger cache.
L1 cant hardly benefit from it as it must stay small due to need for low latency. L2 is in between. But l3 is perfect for this stuff. Amazing stuff this early.
I am inline for one as a high fps gamer. Lol.
 
  • Like
Reactions: Tlh97 and Timmah!

gdansk

Platinum Member
Feb 8, 2011
2,123
2,629
136
Just learned about this cache stacking thing - somehow i missed it :-O Its a big thing, right?

If i understood it correctly, they showed 5900X, which already has L3 cache on the chiplet die, with more L3 stacked in another layer on top of the chiplet, right?
Do you expect the future chiplets to have no L3 at all, as its gonna be moved to this layer competely and moar cores instead in its space? Or more L1/L2 in tha space? I know Intel has the EMIB/Foveros thing, which i suppose is all about stacking as well, but did they demostrate any product like this so far?
My speculation: reduced L3 on CCD but not removed (yet). 16 or 8MB. Some high volume SKUs will avoid stacked dies. To avoid an even larger performance delta, it would be helpful if they have some L3 cache. Secondly, L2 is currently private per core. I suspect they want some shared L3. Third some L3 gives a good area to place through-silicon-vias.

As for what they will do with the extra space, I have no idea. It's hard to guess which changes would perform best on their tests. But bigger buffers, more execution units, wider FPU and more L1/L2 cache are in the cards. Maybe not in Zen 4, however. Its conception may have been a bit early to bet on stacked L3.
 
Last edited:

Ajay

Lifer
Jan 8, 2001
15,468
7,874
136
Maybe not in Zen 4, however. Its conception may have been a bit early to bet on stacked L3.
On the contrary, Zen 4, being later than Zen3, will most surely be using stacked SRAM. It may even have more stacked layers - super interesting architectural topologies to come!
 

gdansk

Platinum Member
Feb 8, 2011
2,123
2,629
136
On the contrary, Zen 4, being later than Zen3, will most surely be using stacked SRAM. It may even have more stacked layers - super interesting architectural topologies to come!
It's later but I'm not sure if it is late enough. Zen 4 will offer stacked cache but the amount of CCD die area wasted on L3 will indicate their state of mind when designing Zen 4. If it's still 32MB they had little confidence in it. If it's 16MB or less, they knew it was working.
 
Last edited:
  • Like
Reactions: KompuKare

uzzi38

Platinum Member
Oct 16, 2019
2,635
5,984
146
What is/was Warhol then? It seems like a plausible hypothesis to me.
Can't say for certain. All I know is it was never Zen 3+ (which is Rembrandt-only), it was on AM4 and that OEMs had some reason to think that it was like XT SKUs. Wish I knew what, but the only thing it could be is they were provided scores, because Warhol never existed outside of the roadmaps they were taken off as far as OEMs are concerned.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
My speculation: reduced L3 on CCD but not removed (yet). 16 or 8MB. Some high volume SKUs will avoid stacked dies. To avoid an even larger performance delta, it would be helpful if they have some L3 cache. Secondly, L2 is currently private per core. I suspect they want some shared L3. Third some L3 gives a good area to place through-silicon-vias.

As for what they will do with the extra space, I have no idea. It's hard to guess which changes would perform best on their tests. But bigger buffers, more execution units, wider FPU and more L1/L2 cache are in the cards. Maybe not in Zen 4, however. Its conception may have been a bit early to bet on stacked L3.
I think that they are unlikely to reduce L3 on Zen 4. It is on 5 nm, so the cache area will be even smaller. They still want to stack the cache over top of L2 and L3 for thermal reasons. I expect L3 on die to stay the same 32 MB even with the process shrink. L2 is likely to get larger. That are likely to add a huge amount of FP processing power and vector FP units are huge compared to integer units. Die size may be similar to slightly larger due to added FP units and L2 cache. They would still want the chiplet to be usable without any stacked cache die for lower end parts; regressing to 16 or 8 MB would not be competitive.

Edit: actually, it may be plausible that Zen 4 chiplets will go up to 64 MB on the base die at 5 nm. The lower end market is likely to be covered by APUs, probably with smaller on die caches. The chiplets will be higher end desktop parts, server, and workstation.
 
Last edited:

jamescox

Senior member
Nov 11, 2009
637
1,103
136
On the contrary, Zen 4, being later than Zen3, will most surely be using stacked SRAM. It may even have more stacked layers - super interesting architectural topologies to come!
Up to 4 stack settings have already been seen in BIOS settings. Zen 4 needs a completely different board for DDR5 and pci express 5, so the settings may be for a Zen 3 derivative. TSMC supports up to 12 layers though, which would be absolutely ridiculous. That would be something like 800 MB on a single chiplet if they managed to do that. It would be ridiculously expensive also.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Thinking of Zen4 WRT the stacked cache, and the expectation / rumour of more cores, up to 12C per CCD. The expectation, and therefore the refutation, being that 6C either side of the cache would be a weird layout.

But what if this now means that there is no cache at all on the compute silicon and rather than 4C/cache/4C, could it be 4C/4C/4C with two blocks (side by side, not stacked higher) of 64MB on top. 12C with 128MB of cache per CCD. That would be a nice step up over Zen3+ in and of itself.

The silicon would be roughly the same physical size and you'd still only need two chiplets, rather than an odd number of 3 (if they didn't go 6C/cache/6C). Possible?

I don’t think we have much of any real information on Zen 4. The zen 3 with stacked cache seems to have come as a surprise, so AMD is keeping leaks to a minimum.

It would be interesting if they went to 12 cores and 48 MB of cache on die. That would probably be just add another 2 cores on each side of the cache. There are a lot of reasons to keep a full size L3 on die. They would want to use the die without any stacking for lower end parts. It also allows some more opportunities for salvage if something goes wrong in the stacking process. If you don’t have any L3 on die and something goes wrong in the stacking, then the whole thing is probably garbage. I don’t know if they could sell a cache-less, celeron-like part.

I have been expecting them to stick with the 8 core CCD. The cores may be quite a bit larger due to significantly larger FP vector units. If they also increase L1 and / or L2 sizes, then that, along with larger FP units may explain how the due area is used. They may also increase L3 size again if the transistor budget allows for it on 5 nm.
 

gdansk

Platinum Member
Feb 8, 2011
2,123
2,629
136
I think that they are unlikely to reduce L3 on Zen 4. It is on 5 nm, so the cache area will be even smaller. They still want to stack the cache over top of L2 and L3 for thermal reasons. I expect L3 on die to stay the same 32 MB even with the process shrink. L2 is likely to get larger. That are likely to add a huge amount of FP processing power and vector FP units are huge compared to integer units. Die size may be similar to slightly larger due to added FP units and L2 cache. They would still want the chiplet to be usable without any stacked cache die for lower end parts; regressing to 16 or 8 MB would not be competitive.

Edit: actually, it may be plausible that Zen 4 chiplets will go up to 64 MB on the base die at 5 nm. The lower end market is likely to be covered by APUs, probably with smaller on die caches. The chiplets will be higher end desktop parts, server, and workstation.
I outlined previously why AMD should move some L3 off CCD as soon as possible: the CCD uses a process optimized for logic while stacked cache will use a process optimized for SRAM (achieving nearly 2x MB/mm²). You can get more cache in the same total die space by moving the cache out of of the CCD. Sure, stacking adds some cost but when designs are approaching 50% L3 cache by area, a reduction in L3 area could pay off quickly. Or allow them to build a more competitive design to stay ahead of their aggressive ARM competitors. Increasing the CCD L3 size is bad economics. But it's easy so maybe they will.
 
Last edited:
  • Like
Reactions: Vattila