Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

Hitman928 · Jun 2, 2021

LightningZ71 said:
That would require additional work in thermal management. The demonstrated setup showed the L3 stack being directly over the L3 section of the CCD, and it's likely there for more than distance reasons. Getting heat pumped out of the CPU core hotspots is a challenge now. Doing it with a cache die in the way? Likely much harder.

I suspect that, in the next iteration, the L3 cache die will be below the CCD, which will allow better heat dissipation. Why wasn't it done this time? In my opinion, it's because the tech wasn't ready for mass production when Zen3 went final.

Since the current solution is the same total height of the non-stacked version, the thermal difference in hot spots should be very, very small if noticeable at all. Without the stack, the heat still has to travel through just as much substrate to get to the TIM and then the heat spreader. The difference now is that they are removing that extra substrate from the bottom die and putting another substrate with the same thickness in its place. They may even be able to use a substrate with lower thermal resistance which would help transfer heat better than the original die. The potential 'bottleneck' then becomes the bond between dies and what the thermal resistance of that interface is but most likely this interface will be designed for very low thermal resistance as well.

If cache goes on bottom then you have to route all the IO through the bottom die. I don't know how effective that would be and probably wouldn't be worth it when the thermal issue isn't really an issue with their current stacking technique. If they want to start stacking high current regions over non-dummy substrate in the future (e.g. logic over logic or cache over logic) then a different approach may be warranted.

Timorous · Jun 2, 2021

eek2121 said:
Not unless we have some innovations around cooling. Stacking cache is easy. Cores? Not so much. Even little cores need to be cooled, and pushing that heat up through the big cores will cause problems there as well.

I think AMD is currently taking the best approach.

EDIT: I would love to see AMD take this to an obscene level with a halo Threadripper product.

You don't need to do core ontop of core though. Another option is that the little cores are designed to make use of the big core L3 so rather than it being a cache slice over the cache it is the little cores.

gdansk · Jun 2, 2021

It is interesting that even Zen 3 may have had some preparation for stacked caches.

For Zen 4/5, is it likely AMD planned to reduce the L3 cache size on the base CCD in anticipation of having stacked cache? If they are using a process optimized for SRAM (as Ian says) then it makes sense to move L3 from the logic die. The stacked cache would be more area efficient. That'd also leave more area for core/L1/L2.

HurleyBird · Jun 2, 2021

gdansk said:
For Zen 4/5, is it likely AMD planned to reduce the L3 cache size on the base CCD in anticipation of having stacked cache?

If not Zen 4, then certainly Zen 5.

gdansk said:
If they are using a process optimized for SRAM (as Ian says) then it makes sense to move L3 from the logic die. The stacked cache would be more area efficient.

And by extension of being more area efficient, you'll get better latency also.

gdansk said:
That'd also leave more area for core/L1/L2.

There's no reason you can't also stack L1 and L2. The former is where things get really interesting. If you can get the L1 (data) to, say, 512KB via stacking, first thing you're looking at is a massive performance increase, and the second thing is that L2 stops making sense as a bridge between L1 and L3 and you can drop an entire level of the cache hierarchy.

blckgrffn · Jun 2, 2021

What I find crazy is that there didn't seem to be a whiff of this before it got dropped like a bomb.

I mean, hats off on that front.

For some workloads this will be a huge game changer. On the one hand I am stoked. On the other it makes me think my 5800x's might have been not quite as great of investments as I thought. I know this is a Zen 4 thread, but if this is a Zen 3 thing as well and the cost is... reasonable... obviously the long haul AM4 chip to get is the Zen3+StupidlyAwesomeCache variety.

DrMrLordX · Jun 2, 2021

HurleyBird said:
If you can get the L1 (data) to, say, 512KB via stacking, first thing you're looking at is a massive performance increase, and the second thing is that L2 stops making sense as a bridge between L1 and L3 and you can drop an entire level of the cache hierarchy.

Are they going to be able to stack L1 and keep latency as low and bandwidth as high as in existing implementations? Here's a sample of Zen3 in Aida64 (credit to jesdals on the techpowerup forums):

jpiniero · Jun 2, 2021

HurleyBird said:
There's no reason you can't also stack L1 and L2. The former is where things get really interesting. If you can get the L1 (data) to, say, 512KB via stacking, first thing you're looking at is a massive performance increase, and the second thing is that L2 stops making sense as a bridge between L1 and L3 and you can drop an entire level of the cache hierarchy.

Packaging costs alone probably make stacking L1 and L2 unrealistic.

Hail The Brain Slug · Jun 2, 2021

DrMrLordX said:
Are they going to be able to stack L1 and keep latency as low and bandwidth as high as in existing implementations? Here's a sample of Zen3 in Aida64 (credit to jesdals on the techpowerup forums):

A more complete run with updated AGESA which provides more stable and reliable L3 cache figures in AIDA64 (1.2.0.2)

Bandwidth isn't that much lower than L1, it's entirely the latency that is over an order of magnitude worse and I imagine will be the real hurdle to stacking L1.

coercitiv · Jun 2, 2021

blckgrffn said:
What I find crazy is that there didn't seem to be a whiff of this before it got dropped like a bomb.

Some people knew something was "cooking", there was talk on Twitter and Lisa Su talked about 3 chip stacking recently:

So look, we've been sort of a leader in this idea of sort of advanced packaging and how do you use silicon for its best performance and feature set. And really, this is the key aspect of innovation.

When you think about sort of all that's said about Moore's Law slowing it means that you're getting performance gains by going to smaller geometries, but not necessarily the same gains that you got a few years ago. So we were very early in sort of the idea of using 2.5D packaging with high-bandwidth memory together with our GPUs as well as using a triplet architecture to really get the incredible performance that we're seeing with each generation of EPYC. And you'll see us continue to innovate on that road map.

So 3D trip stacking is definitely on the road map. We see it as another tool in the tool chest as you think about how do you put these different pieces together. And I think what you'll also see is you might see different technologies used along the price curve. So you can imagine the highest performance technologies can afford different elements. And then as you get into more cost sensitive, you might not be able to use all that complexity.
But think about it as AMD will push the envelope on 2.5D and 3D packaging as we go forward because it's a key element to unlock that next level of performance. And again, we'll talk a little bit more about that as we go through the next number of months as we roll out the next phase of our road maps.

@uzzi38 was kind enough to let us know both about the "lasagna" talk and the JPMorgan's Conference as well.

Det0x · Jun 2, 2021

Justinus said:
View attachment 45236

A more complete run with updated AGESA which provides more stable and reliable L3 cache figures in AIDA64 (1.2.0.2)

Bandwidth isn't that much lower than L1, it's entirely the latency that is over an order of magnitude worse and I imagine will be the real hurdle to stacking L1.

L3 latency depend on cpu clockspeed, L1 latency does not.

Also Aida64 is a pretty bad program to measure "real" memory performance in..
Lately some of us have been using Linpack Xtreme to compare real performance as it does a much better job.. Any very few can show real scaling above 1900 flck, but that is a other topic for a other thread

(no scaling in other real benchmarks eventho some are rocking 50ns memory latency in aida with fclk above 2100+ and no WHEA errors)

Why Linpack?
It's as close to a "real-world" memory test as any benchmark gets, it will hammer the memory controller AND the CPU so the CCD portion of the infinity-fabric gets hot, uses fairly representative memory in it's "Standard" benchmark (3Gb of heavy IO is about typical for most games/computational-software) and is multi-threaded. The problem with Aida64 bandwidth and latency tests is they put absolutely zero strain on the memory controller, Geekbench is too short and too synthetic, and y-cruncher is a fair substitute but is far heavier than any real-world use-case and so only really useful for scoring HWBot points or really proving a point about stability.

HurleyBird · Jun 2, 2021

DrMrLordX said:
Are they going to be able to stack L1 and keep latency as low and bandwidth as high as in existing implementations?

Yes.

In this current implementation, as relayed to Ian on his YouTube channel by AMD, latency is supposed to remain the same (you can round the extra travel distance in the z plane down to 0). AMD is claiming that they're getting 2 TB/s throughput via interleaving, which is a big jump.

The (main) reason why L1 needs to be small is to minimize wire length which in turn minimizes latency. Stacking lets you multiply the area devoted to cache without (in practical terms) increasing wire length.

jpiniero said:
Packaging costs alone probably make stacking L1 and L2 unrealistic.

Based on what? The packaging costs of a single stack aren't prohibitive enough that you can't make a consumer, desktop product with it, obviously. A single stack can be ~2x as dense with a cache optimized process too (AMD has confirmed that their 64MB V-cache silicon is a single stack). Even with one stack you're getting a big benefit, and there's no reason you can't put L2 on the same die as L3 (although L1 might benefit from a change in layout). In terms of cost, expect that economies of scale (and yield improvements) set in as you move more fully to stacking, and that each additional stack is cheaper to add than the previous one. Look at Micron's new 176-layer NAND chip for evidence of that.

naukkis · Jun 2, 2021

HurleyBird said:
The (main) reason why L1 needs to be small is to minimize wire length which in turn minimizes latency. Stacking lets you multiply the area devoted to cache without (in practical terms) increasing wire length.

Minimize latency and power use. Stacked chiplet aren't free lunch, it takes more power to access stack than direct die cache- for L1 cache they won't be increasing access energy use. And as L1-cache needs to be really really small for wire length reason stack for L1 cache would be way smaller than what could be made. So no, stacking won't help L1-cache at all.

Ajay · Jun 2, 2021

DrMrLordX said:
Are they going to be able to stack L1 and keep latency as low and bandwidth as high as in existing implementations? Here's a sample of Zen3 in Aida64 (credit to jesdals on the techpowerup forums):

There is not reason to do that though. In future designs of the Zen architecture, the whole of L3$ can be on a vertical stacked die, freeing up more space for larger cores (including L1$ & L2$) or for adding more cores.

HurleyBird · Jun 2, 2021

naukkis said:
L1 cache would be way smaller than what could be made.

First implementation of stacked L1 would just add it onto the existing cache chiplet, surrounded by dead silicon. Given a CCD-like layout, you would want the L1 regions to be as close to the middle as you can get them to minimize wastage.

Doug S · Jun 2, 2021

HurleyBird said:
There's no reason you can't also stack L1 and L2. The former is where things get really interesting. If you can get the L1 (data) to, say, 512KB via stacking, first thing you're looking at is a massive performance increase, and the second thing is that L2 stops making sense as a bridge between L1 and L3 and you can drop an entire level of the cache hierarchy.

I'm very skeptical you can stack L1. The length of wires between dies is pretty significant even with thinned dies, the RC delay will kill your latency on the scale of an L1. No way you don't add at least one cycle of latency. If you're willing to add a cycle of latency, you can go bigger in the main die.

Stacking might benefit L2, but I don't believe it would be at all viable for L1.

maddie · Jun 2, 2021

Doug S said:
I'm very skeptical you can stack L1. The length of wires between dies is pretty significant even with thinned dies, the RC delay will kill your latency on the scale of an L1. No way you don't add at least one cycle of latency. If you're willing to add a cycle of latency, you can go bigger in the main die.

Stacking might benefit L2, but I don't believe it would be at all viable for L1.

What's the Z ht in the stack? 1mm? Surely less than horizontal distance with the I-cache and data cache.

If heat was not a problem, ha, then having the major logic elements of the core stacked vertically would surely have the least distance between elements.

jamescox · Jun 2, 2021

Hitman928 said:
Since the current solution is the same total height of the non-stacked version, the thermal difference in hot spots should be very, very small if noticeable at all. Without the stack, the heat still has to travel through just as much substrate to get to the TIM and then the heat spreader. The difference now is that they are removing that extra substrate from the bottom die and putting another substrate with the same thickness in its place. They may even be able to use a substrate with lower thermal resistance which would help transfer heat better than the original die. The potential 'bottleneck' then becomes the bond between dies and what the thermal resistance of that interface is but most likely this interface will be designed for very low thermal resistance as well.

If cache goes on bottom then you have to route all the IO through the bottom die. I don't know how effective that would be and probably wouldn't be worth it when the thermal issue isn't really an issue with their current stacking technique. If they want to start stacking high current regions over non-dummy substrate in the future (e.g. logic over logic or cache over logic) then a different approach may be warranted.

I believe there has been a few patents from AMD that involved cooling for stacked logic chips, but those are generally going to be expensive if or when they are implemented. You wouldn't want to stack core chiplets since hot spots would line up. The thermal issues may not be as bad as some expect with just the stacked cache since the thinned die for stacking are very thin. The wafer is polished down so thin that I believe it is actually flexible like a sheet of paper if you were to pick it up. They are probably thinner than a sheet of common paper; TSMC supports up to 12 high stacks. You just don't want to stack high thermal output / high power draw components on top of each other so we get the current solution with the cache mostly on top of existing cache. High power delivery in the stack may pose difficulties. SRAM still burns some power since it is still a lot of active transistors at any given time; not as bad as logic though. DRAM has significantly lower power since it has very few active transistors comparatively.

The IO argument doesn't make that much sense if it is based on the number of IO pins being too large. The cpu chiplets only have a single link to the IO die, so it actually isn't a lot of IO pins. There could be some challenges passing the power up to the cpu if it were on top. You would have to get a lot of current through those TSVs. There are a lot of issues for putting cache on the bottom, but number of IO pins doesn't seem like a big problem compared to other issues. One issue is that putting the cache chip on the bottom would make it non-standard; it would likely be specific to the cpu chiplet. I suspect that this cache chip will be used across the product stack. The chiplet based gpus may use the same SRAM cache chips as infinity cache. They are talking about 2 TB/s bandwidth out of one of these things (or the L3 combined). Perhaps we get a chiplet based gpu with one of these on top of each chiplet and a stack of HBM next to it, or just regular GDDR. The external memory bandwidth becomes less important with the larger caches. It would be great if they could go up to more layers. The main issue with ray tracing is that it needs random access to the whole scene (geometry and surface information) since you can't predict which way the rays will bounce. Latency is important for ray tracing where it isn't as important in rasterization. If you could stack 256 MB on top of each gpu chiplet, then ray tracing performance could increase massively.

jamescox · Jun 2, 2021

Doug S said:
I'm very skeptical you can stack L1. The length of wires between dies is pretty significant even with thinned dies, the RC delay will kill your latency on the scale of an L1. No way you don't add at least one cycle of latency. If you're willing to add a cycle of latency, you can go bigger in the main die.

Stacking might benefit L2, but I don't believe it would be at all viable for L1.

It is almost certainly a shorter path going vertical. The stacked die are very thin; TSMC goes up to 12 high stacks. You could put the cache right on top of the core which would decrease path length for lower latency. There are probably a lot of other issues with this though. The power density probably goes up significantly with L2 vs L3 and L1 vs. L2, so stacking that on top of the already high power density logic of the cpu is probably a problem. The cache chiplet would not be usable for any other product. It would be very specific to the device it stacks on top of. There is a good chance the cache chiplets will be used across multiple products (infinity cache). Also, the device would not be usable without the cache chiplet on top and if something goes wrong in bonding, both are garbage. With the way it is implemented currently, they may be able to do some tests prior to bonding and reject any that aren't perfect for the tsv connections. The rejects just get sold without extra L3. If something goes wrong in bonding, you could probably just disable the stacked cache and sell it as a regular chip with the cache chip disabled. Although, I don't know how much testing they can do prior to bonding with TSMCs process. With HBM, they definitely test the stacks before placing them on the interposer. It may be possible to make a stacked device with lower level caches, but it probably isn't economical and almost certainly has some engineering challenges.

IntelUser2000 · Jun 2, 2021

HurleyBird said:
Based on what? The packaging costs of a single stack aren't prohibitive enough that you can't make a consumer, desktop product with it, obviously.

You know L1/L2 caches are tiny in both capacity and size right? So you'd go through the effort of making a chiplet that's possibly 1mm2 in size? The wafer cost might be negligible but packaging costs won't be much different, because you are still stacking and connecting.

And remember the part about AMD making the other parts of the die even in terms of z-height? So they'll have to consider it separately just for that 1mm2 stack.

Also you said earlier a large L1 would mean L2 can be skipped. Actually you mean L3 would be skipped. Because the "3" in L3 means third level of access.

Det0x said:
L3 latency depend on cpu clockspeed, L1 latency does not.

L1 latency DOES depend on clock speed. It just happens to be part of the core so it scales with clock.

In terms of CPU cycles it'll be consistent. In terms of nanoseconds the CPU with higher clocks will have a faster L1 cache.

Actually for some CPUs it's actually opposite of what you said. L3 cache since Haswell isn't connected to CPU clock, so you might have a faster clocked Intel CPU but same L3 cache latency.

MadRat · Jun 3, 2021

jamescox said:
So you want things to be built like apple does where if your SSD goes bad, you have to replace the whole system board because it is soldered on? Apple does this because pick and place machines are cheaper than having workers install components and put screws in. It is more reliable in some respects, but it has its down sides also. You can get soldered on memory and such in laptops and an xbox or playstation. The whole point of a PC is that you can mix and match components.

Apple used the same chip for storage and RAM. I was really speaking of RAM (not storage) built into main boards to give insane base data rates to the CPU. Dedicate the first memory controller to this array of aggressive timings-based memory that comes soldered onboard. Have another memory controller for further expansion slots that are not soldered into the main board. I do not suggest your storage being soldered onboard. That's one aspect I think Apple went stupid.

jamescox said:
They may eventually integrate enough memory (probably HBM or other stacked DRAM) that you wouldn't really need to add any more DRAM. Just plug in an optane or flash drive. Most off chip IO seems to be moving toward using pci-express physical layer signalling which is low pin count for the bandwidth. IT would be higher latency to use that for memory, but with massive caches, it may make sense eventually.

Stacked memory suggests serialized access. I'd want access to be in parallel as much as possible. You'll probably see either one 'good enough' performance RAM or something where you have separate high performance and low costs options on the same board. Afford as much premium as possible to assure performance. But maintain lots of cheap storage that isn't as high performing but offers a great long term solution.

naukkis · Jun 3, 2021

HurleyBird said:
First implementation of stacked L1 would just add it onto the existing cache chiplet, surrounded by dead silicon. Given a CCD-like layout, you would want the L1 regions to be as close to the middle as you can get them to minimize wastage.

And that's a part of a problem, L1 cache isn't just cache region as L3. L1-caches are part of a core, instruction L1 is decode phase of core and L1 data is combined with ALU/LS unit. In Zen-designs there is L2 physically between cores and L3, so if L1-cache stacking would be implemented it means that cores itself have to be stacked. Meaning stack chiplet becaming just as big as main chiplet. And using that much a silicon just for cache chiplet would be ineffective so they would have to design two different main chiplets where L1-caches would be stacked to each other chiplet. They won't be doing that for sure, designing main chiplet already enough hard without that extra complexity.

DrMrLordX · Jun 3, 2021

Justinus said:
Bandwidth isn't that much lower than L1, it's entirely the latency that is over an order of magnitude worse and I imagine will be the real hurdle to stacking L1.

Bandwidth is 2-3x higher for the L1. And yes I would be worried about wire length and latency.

HurleyBird said:
Yes.

In this current implementation, as relayed to Ian on his YouTube channel by AMD, latency is supposed to remain the same (you can round the extra travel distance in the z plane down to 0). AMD is claiming that they're getting 2 TB/s throughput via interleaving, which is a big jump.

The (main) reason why L1 needs to be small is to minimize wire length which in turn minimizes latency. Stacking lets you multiply the area devoted to cache without (in practical terms) increasing wire length.

Guess I'll just say . . . I'll believe it when I see it? See other responses wrt integration of L1 into core design. I'm really, really skeptical about the extra travel distances in the z plane being so small as to be a rounding error.

misuspita · Jun 3, 2021

HurleyBird said:
L2 stops making sense as a bridge between L1 and L3 and you can drop an entire level of the cache hierarchy.

So kind of like a L2+ (slower then current gen) cache, with tens or even hundred of megabytes?

JoeRambo · Jun 3, 2021

IntelUser2000 said:
L1 latency DOES depend on clock speed. It just happens to be part of the core so it scales with clock.

In terms of CPU cycles it'll be consistent. In terms of nanoseconds the CPU with higher clocks will have a faster L1 cache.

Actually for some CPUs it's actually opposite of what you said. L3 cache since Haswell isn't connected to CPU clock, so you might have a faster clocked Intel CPU but same L3 cache latency.

Good points, esp about uncouples uncores with L3, running in different clock domains ( like every Intel big core design since Haswell, including server stuff that runs at abysmal 2.4Ghz ).
L1 latency does change, but since it is measured in ns, it takes multiple hundreds of mhz to move 0.1ns - the "precision" of Aida testing.

4 cycle L1 @ 5Ghz => ~0.75ns, assuming some rounding and measuring error - clock would need to drop to ~4.7 ghz for AIDA to show 0.9ns and it would keep showing 0.9ns to ~4.2Ghz.
Quick reality check with 4.4Ghz static system:

DisEnchantment · Jun 3, 2021

HurleyBird said:
There's no reason you can't also stack L1 and L2. The former is where things get really interesting. If you can get the L1 (data) to, say, 512KB via stacking, first thing you're looking at is a massive performance increase, and the second thing is that L2 stops making sense as a bridge between L1 and L3 and you can drop an entire level of the cache hierarchy.

In current Zen design, L2 is private to the core and inclusive of L1. L3 is victim. Cannot really skip L2, the implications of that is really really huge.
L1/L2 are involved in MMU operations and paging as well, Skipping L2 basically means the tiny L1 involved in managing working set of fairly big pages, not sure.

Besides that, L1 is basically part of the core itself. The floor plan of the core includes the L1, the TLBs the uop cache and so on. Because the other elements of the core and L1 are running at the same clock it is important to maintain floor plan to ensure signal integrity, propagation delay etc for high clocking architectures like x86. Stacking is no option here, unless you dont mind limiting clock speeds. Or someone has come up with a totally radical 3D floor plan.
The L3 with its own clock domain ( even AMD themselves mentioned they can be clock gated independently ) can operate independent of the core.
So I would suppose even in the several incarnations of the Zen core (in mobile, Consoles, Desktop etc) they are tinkering with the L3 but the floor plan of the core remains same.

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Diamond Member

Golden Member

Diamond Member

Platinum Member

Diamond Member

Lifer

Lifer

Diamond Member

Diamond Member

Golden Member

Platinum Member

Golden Member

Lifer

Platinum Member

Diamond Member

Diamond Member

Senior member

Senior member

Elite Member

Lifer

Golden Member

Lifer

Senior member

Golden Member

Golden Member