Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

Ajay · Jun 3, 2021

gdansk said:
I outlined previously why AMD should move some L3 off CCD as soon as possible: the CCD uses a process optimized for logic while stacked cache will use a process optimized for SRAM (achieving nearly 2x MB/mm²). You can get more cache in the same total die space by moving the cache out of of the CCD. Sure, stacking adds some cost but when designs are approaching 50% L3 cache by area, a reduction in L3 area could pay off quickly. Or allow them to build a more competitive design to hedge of their stay ahead of their aggressive ARM competitors. Increasing the CCD L3 size is bad economics. But it's easy so maybe they will.

^Yes! I think it would be a good play to move all L3$ onto a vertical stack for Zen4 (not the L3 controller though). More room for other SRAM structures on the CCD, plus more logic and wider data paths (well, those aren’t taking up space in the floor plan).

jamescox · Jun 3, 2021

gdansk said:
I outlined previously why AMD should move some L3 off CCD as soon as possible: the CCD uses a process optimized for logic while stacked cache will use a process optimized for SRAM (achieving nearly 2x MB/mm²). You can get more cache in the same total die space by moving the cache out of of the CCD. Sure, stacking adds some cost but when designs are approaching 50% L3 cache by area, a reduction in L3 area could pay off quickly. Or allow them to build a more competitive design to hedge of their stay ahead of their aggressive ARM competitors. Increasing the CCD L3 size is bad economics. But it's easy so maybe they will.

That is one reason for moving it off die. There are several reasons for keeping some L3 (probably at least 32 MB) on die. While the density is higher on the cache die, it is still probably going to be cheaper to have a non-stacked base model. If you move the L3 off die completely, then you would be stacking L3 over top of cpu cores which might be a thermal issue. If something goes wrong in the stacking, then both die are probably garbage if there is no L3 on the base die. I have said all of this previously also. They will get higher density at 5 nm with Zen 4 and the stacked cache chip is likely to remain 7 nm for a while due to wafer shortages. Given the size differences, it may make sense to make the base die cache larger to cover roughly the same area as the stacked cache die rather than having the cache die covering cores. The 5 nm die will still be relatively small even with the same or larger L3.

I wouldn't rule out moving the L3 completely off die, but there seems to be a lot of reasons not to do that. Many applications do not respond to larger caches that much, so there will still be a big market for the base version with no stacking.

LightningZ71 · Jun 3, 2021

I'm in the camp with AMD keeping the CCX size at 8 cores, and retaining at least some L3 cache on the CCD. In my opinion, Zen4 will likely increase the L2, and gain significant transistor count due to expanded FP resources and possibly additional widening.

We know that SRAM doesn't scale well, at least on dies that are logic optimized. I suspect that the CCD L3 stats around 32MB, and that the N5 CCDs remain relatively the same size as N7 dies, perhaps a little smaller. This allows AMD to use a trailing nodes SRAM stack over the L3 on the CCD, just like now. Perhaps it will be an SRAM optimized N6 die by then, with up to 4 stacks of cache, giving a total of 288MB of L3 per CCD.

While we know that Intel is making noises about integrating HBM on their foveros products as an L4, I wonder how such a large and significantly quicker and faster L3 shakes out against a MUCH larger HBM based L4?

moinmoin · Jun 3, 2021

Gideon said:
Yeah, considering all of the L2 is right next to the L3 on the chip, they could have just made the V-cache chip slightly bigger and already fit extended L2 there as well. The fact that they didn't hint's that it's not as easy:

I do wonder however if they'll plan to enlarge the L2? It would add latency but i think they have to, as it currently peaks at ~1TB/s (aggregated) for 8 cores, according to AIDA64. That seems like quite the bottleneck when compared to L3

Note how with Zen 3 AMD unified the L3$, finally making all the 32MB accessible for all cores. Turns out that didn't stop there but was planned to be scalable well beyond that, 96MB as demoed, 288MB as assumed for 4 stacks.

L2$ is in a completely different position, being private to each core. With L3$ all the TSVs add up to the advantage for every core, with L2$ one would need massive TSVs just for the benefit of a single core each.

I'm sure there will be innovative ways to expand on the usability of L2$. But increasing its size through stacking seems very unlikely to me.

Timmah! said:
Just learned about this cache stacking thing - somehow i missed it :-O Its a big thing, right?

With every major Zen gen AMD found some way to massively scale out something.

Looks like this is what AMD worked on for Zen 3 which would have allowed for 288MB L3$ per CCX, adding up to 2304MB on a 64 cores Epyc chip.

The fact it appears this late (likely as a Ryzen only feature, as server chips need lengthy validation which is not worth doing mid-gen) tells me that the tech likely wasn't fully working back when the Zen 3 gen originally launched. I expect it to be an integral part of Zen 4, if not significantly expanded on further then.

A/// · Jun 3, 2021

One of the main issues, apart from extending it onto Zen 4, was they stated this new cache packaging is a first step. These implications are good. Zen 4 and Zen 5 should be incredible if you've read into their statements as much as I have.

jamescox · Jun 3, 2021

A/// said:
One of the main issues, apart from extending it onto Zen 4, was they stated this new cache packaging is a first step. These implications are good. Zen 4 and Zen 5 should be incredible if you've read into their statements as much as I have.

The fact that there is a zen 3 bios showing up to 4 high stacks seems to indicate that we will get 4 stacks on "milan-x". Desktop Ryzen is probably limited to 1 high stack though. For Zen 4, we may actually get cores stacked on an interposer. The power for pci-express 5 speed cpu to IO die interconnects will be high, so getting them stacked on an interposer will reduce power significantly. It is very hard to speculate once chip stacking comes into play since there are so many possibilities.

MadRat · Jun 4, 2021

Any time you break up data chunks and transmit into smaller packets you add time to re-assemble those chunks back into their original state. That's fine for slower access pathways. But if you're aiming for performance then keep data in its original state during transmission. Trace lengths are obviously critical for each cache level. As trace length increases the time it takes to transmit and receive signals across that trace is going to increase accordingly. Lengths of traces and the sheer number of the traces also creates timing skew in parallel interfaces.

I find the notion that creating onboard memory attached to the main board - in such a way to minimize trace distances - as 'being no advantage' is ridiculous. Apple obviously saw the advantage. Game consoles do it. Video cards do it. It cuts latency and decreases access time. You cannot compete with its latency by simply using serializing data over longer trace lengths. Re-assembly of data always pays an additional overhead cost. SSDs moved to serial interfaces because they simply do not require the latency figures that something like an L3 cache would need to improve performance.

We also shouldn't assume that a serialized device is transmitted through the same narrow number of traces from the device to the CPU. We often see localized caching or root hosts close to the device attached to wider interfaces. So from device to local caching it will be a narrow interface by nature. From that device-attached cache to CPU it moves data through stages via a bus architecture. The M.2 interface is 67 pins. The M.2 connects to PCIe 4x. The PCIe 4x connects to the PCIe x16 controller. The PCIe x16 interface is 82 pins. Pin counts general increase with each incremental interface closer to the CPU; these less complex interfaces eventually reach the CPU through a common southbridge interface.

A bottleneck eventually emerges due to limitations caused by data transmission between the CPU and southbridge. The solution wasn't more pins on the main board but rather by cutting parts from the southbridge out and transferring them to the CPU. Pin counts continue to increase to feed these components that moved because they not only shrunk, but the decrease in trace lengths dramatically decreased their timings and therefore require more lines of information to feed them. So as southbridge lost functions they continue to grow pin counts. CPU pin counts grew, too. Last generation AM4 is 1,331 pins. Next generation Zen 4 chips are going for 1,718 pins.

AMD is talking about 64MB L3 caches stacked to form up to a sum of 192 MB on their next generation chips. Caches are great but eventually you need access to RAM. While integrated memory will never become as fast as L3 cache, the aggressive move to decrease trace lengths to the first 8-16 GB of RAM will certainly help to keep these future CPUs fed. To suggest it would just another thing to break on the main board minimizes the fact that main boards are evolving complexity by progress. Systems that use integrated graphics have the most to gain.

krumme · Jun 4, 2021

Apple is just like consoles and the old Ford T. You can have it in any color as long as its black.

X86 caters to a far larger market. From dirt cheap to expensive servers.
Stacking L3 solves part of that problem. From a market segmenting perpective and total cost structure for entire portfolio, its just brilliant. Especially combined with the chiplets methology.

What customers, at same total cost, is better served with a special memory solution, than stacked l3, chiplets and soon ddr5?

Gideon · Jun 4, 2021

moinmoin said:
The fact it appears this late (likely as a Ryzen only feature, as server chips need lengthy validation which is not worth doing mid-gen) tells me that the tech likely wasn't fully working back when the Zen 3 gen originally launched. I expect it to be an integral part of Zen 4, if not significantly expanded on further then.

Yeah, that's IMO the most impressive thing about it. Look at TSMC's own slides about the timeline of this feature (Chip on Wafer):

The qualification will only be done by Q4 2021! This means AMD uses this feature the second it becomes available at the foundry. And mind you, that's the first foundry in the world to provide it as Intel will only get to the equivalent 3D tech in 2023 it seems.

So AMD managed to mass-produce a CCD in Q4 2020 with tons of TSVs baked in for this feature, that will only became available a year later. And they managed to do it in near-total secrecy (hidden in plain sight)!

And I thought Zen3, while a very nice upgrade, looked a bit underwhelming compared to the paradigm shifts of Zen 1 and Zen 2. Well not anymore

jamescox · Jun 4, 2021

MadRat said:
Any time you break up data chunks and transmit into smaller packets you add time to re-assemble those chunks back into their original state. That's fine for slower access pathways. But if you're aiming for performance then keep data in its original state during transmission. Trace lengths are obviously critical for each cache level. As trace length increases the time it takes to transmit and receive signals across that trace is going to increase accordingly. Lengths of traces and the sheer number of the traces also creates timing skew in parallel interfaces.

I find the notion that creating onboard memory attached to the main board - in such a way to minimize trace distances - as 'being no advantage' is ridiculous. Apple obviously saw the advantage. Game consoles do it. Video cards do it. It cuts latency and decreases access time. You cannot compete with its latency by simply using serializing data over longer trace lengths. Re-assembly of data always pays an additional overhead cost. SSDs moved to serial interfaces because they simply do not require the latency figures that something like an L3 cache would need to improve performance.

We also shouldn't assume that a serialized device is transmitted through the same narrow number of traces from the device to the CPU. We often see localized caching or root hosts close to the device attached to wider interfaces. So from device to local caching it will be a narrow interface by nature. From that device-attached cache to CPU it moves data through stages via a bus architecture. The M.2 interface is 67 pins. The M.2 connects to PCIe 4x. The PCIe 4x connects to the PCIe x16 controller. The PCIe x16 interface is 82 pins. Pin counts general increase with each incremental interface closer to the CPU; these less complex interfaces eventually reach the CPU through a common southbridge interface.

A bottleneck eventually emerges due to limitations caused by data transmission between the CPU and southbridge. The solution wasn't more pins on the main board but rather by cutting parts from the southbridge out and transferring them to the CPU. Pin counts continue to increase to feed these components that moved because they not only shrunk, but the decrease in trace lengths dramatically decreased their timings and therefore require more lines of information to feed them. So as southbridge lost functions they continue to grow pin counts. CPU pin counts grew, too. Last generation AM4 is 1,331 pins. Next generation Zen 4 chips are going for 1,718 pins.

AMD is talking about 64MB L3 caches stacked to form up to a sum of 192 MB on their next generation chips. Caches are great but eventually you need access to RAM. While integrated memory will never become as fast as L3 cache, the aggressive move to decrease trace lengths to the first 8-16 GB of RAM will certainly help to keep these future CPUs fed. To suggest it would just another thing to break on the main board minimizes the fact that main boards are evolving complexity by progress. Systems that use integrated graphics have the most to gain.

I am still not sure what you are arguing for or against here. Game consoles mostly use unified memory with just graphics memory. This is due to needing the bandwidth for the gpu and it is cheaper not to have separate system memory. I don't think I said anything about trace lengths being irrelevant. They are definitely important for L1 cache but signal propagation speed becomes less important as you move out the memory hierarchy. I doubt that the absolute value of the signal propagation time in a DDR connection is any significant amount of the latency. Most of the latency will be the sense amps reading the DRAM cells. Since it is a parallel interface, having it as close as possible for all traces is important and keeping traces short for signal integrity is important. I tried to get an old cpu to work while attached through a ribbon cable in the electronics lab once. It was nearly impossible to get it to work since the ribbon cable was essentially an antenna for interference.

Serial interfaces have length restrictions also. It has been an issue with pci-express 4. It will be more of an issue with pci-express 5. The trace length is important in different ways though. Each line is independent with an embedded clock with serial connections, so the traces do not need to be matched. As far as accessing DRAM via serial interfaces, we already do that. The higher latency on pci-express devices doesn't really have that much to do with the physical level interface. It has to do with it being a high level, software driven protocol. On AMD systems and Intel systems with multiple processors, memory is already accessed over a serialized interface. For AMD, the connection between the cpu chiplet and the IO die with the memory controller is infinity fabric, which is based on a physical layer very similar, if not the same, as pci-express. It is a multi-layer protocol, but it is all in hardware. In multi-socket systems, Intel or AMD, remote memory accesses go over a serialized link. AMD has their system matched very well. For an AMD IO die, you have 128-bit (2x64 channel) memory controller. The connection to the cpu chiplets is 32 bits wide, but at more than 4x the memory clock speed. Connection to another socket in Epyc is 16-bit wide, but at more than 8x the the memory clock.

Anyway, as you move further out the memory hierarchy, latency goes up and speed goes down. You also generally handle larger and larger chunks. The virtual memory system generally works with 4k pages. I saw something about Apple using 16k pages on their systems, which might give them a performance advantage. AMD treats HBM memory on their GPUs as cache with a fully virtualized memory system, but I don't know what the page or cache line size is. If you have a massive amout of DRAM on die, then you probably don't need more external DRAM. The bandwidth would be wasted. The consoles are already like this. Instead of on-package DRAM, they have unified graphics memory backed up by a fast SSD. If someone made a laptop like that, then I guess it is something I might consider, if it was a lot of graphics memory. We generally don't get an APU with graphics memory though, they have slow system memory.

Gideon · Jun 4, 2021

jamescox said:
For AMD, the connection between the cpu chiplet and the IO die with the memory controller is infinity fabric, which is based on a physical layer very similar, if not the same, as pci-express. It is a multi-layer protocol, but it is all in hardware. In multi-socket systems, Intel or AMD, remote memory accesses go over a serialized link

Speaking of that interface, I've wondered ever since Zen 2, why AMD doesn't offer optional memory-compression for servers? I mean, they already have excellent memory encryption, might as well add some flags to enable some sort of fast hardware compression algorithm as well (with similar granularity as encryption).

Considering how bandwidth limited, yet often compressible, some server workloads are wouldn't it make sense to offer the ability to do that for certain customers?

It would seem natural to do the compression/decompression on entering/leaving L3 and obviously in a granual way (some memory ranges for some processes, not the entire memory. AFAIK memory encryption already supports that level of granualarity.
It would add latency, sure, but for many bandwidth bound tasks it would be totally worth it. Not only would it increase effective memory bandwidth, it would also save on Infinity Fabric traffic as well between I/O die and CCds (and therefore potetially save power).

There is probably a catch there somewhere, but I don't immediately see it.

DrMrLordX · Jun 4, 2021

Rigg said:
What is/was Warhol then?

Possibly a red herring.

eek2121 · Jun 4, 2021

uzzi38 said:
Can't say for certain. All I know is it was never Zen 3+ (which is Rembrandt-only), it was on AM4 and that OEMs had some reason to think that it was like XT SKUs. Wish I knew what, but the only thing it could be is they were provided scores, because Warhol never existed outside of the roadmaps they were taken off as far as OEMs are concerned.

The funny thing is that these chips will likely debut as “XT” chips. I doubt AMD is going to name these chips Ryzen 6000 (it would limit their ability to upsell). What I expect them to do is announce right before Alder Lake drops (or right after, but it is fun to rain on your competitor’s parade). I suspect we will only see 2, or possibly 3 skus at most.

Then again, the amount of secrecy surrounding AMD products as of late has been unreal, so who knows?

uzzi38 · Jun 4, 2021

eek2121 said:
The funny thing is that these chips will likely debut as “XT” chips. I doubt AMD is going to name these chips Ryzen 6000 (it would limit their ability to upsell). What I expect them to do is announce right before Alder Lake drops (or right after, but it is fun to rain on your competitor’s parade). I suspect we will only see 2, or possibly 3 skus at most.

Then again, the amount of secrecy surrounding AMD products as of late has been unreal, so who knows?

It's possible. It's also possible it could be AMD somewhat misleading OEMs, not telling the whole truth about their upcoming products in an attempt to hide details. They did the same thing with Navi21 and Navi22 after all, 2-3 weeks before the release of both they provided a new driver to OEMs that added an additional 10-20% performance. Which is why the N21 == 3070 'leaks' existed, they were technically real, but they got jebaited hard.

I'm not entirely sure why AMD is playing this game, because their competitors aren't idiots. They are able to make their own estimations of where these chips should land performance-wise. But here we are nonetheless.

What I'm trying to get at with all this is simple: be very hesitant to immediately trust rumours regarding AMD's products.

eek2121 · Jun 4, 2021

DrMrLordX said:
Possibly a red herring.

Warhol is real.

AMD has multiple teams working on multiple designs at any given time.

Zen 4 development ran parallel to Zen 5 development, for example. (I say ran, because Zen 4 is likely finished)

Everything on the roadmap appears to be dead on so far. It is quite old, however.

jpiniero · Jun 4, 2021

eek2121 said:
The funny thing is that these chips will likely debut as “XT” chips. I doubt AMD is going to name these chips Ryzen 6000 (it would limit their ability to upsell).

Not if they rebrand. There's going to be some amount of chips that the vcache fails or the packaging gets it wrong or something. If the product is still otherwise good they are going to want to sell those. It's going to get announced roughly when mobile Rembrandt does and Rembrandt will for sure be 6xxx.

maddie · Jun 4, 2021

Gideon said:
Yeah, that's IMO the most impressive thing about it. Look at TSMC's own slides about the timeline of this feature (Chip on Wafer):

The qualification will only be done by Q4 2021! This means AMD uses this feature the second it becomes available at the foundry. And mind you, that's the first foundry in the world to provide it as Intel will only get to the equivalent 3D tech in 2023 it seems.

So AMD managed to mass-produce a CCD in Q4 2020 with tons of TSVs baked in for this feature, that will only became available a year later. And they managed to do it in near-total secrecy (hidden in plain sight)!

And I thought Zen3, while a very nice upgrade, looked a bit underwhelming compared to the paradigm shifts of Zen 1 and Zen 2. Well not anymore

So Zen4 is earliest 4Q22 if it uses this tech from the beginning, baring any late hiccups.

Timmah! · Jun 4, 2021

jamescox said:
That is one reason for moving it off die. There are several reasons for keeping some L3 (probably at least 32 MB) on die. While the density is higher on the cache die, it is still probably going to be cheaper to have a non-stacked base model. If you move the L3 off die completely, then you would be stacking L3 over top of cpu cores which might be a thermal issue. If something goes wrong in the stacking, then both die are probably garbage if there is no L3 on the base die. I have said all of this previously also. They will get higher density at 5 nm with Zen 4 and the stacked cache chip is likely to remain 7 nm for a while due to wafer shortages. Given the size differences, it may make sense to make the base die cache larger to cover roughly the same area as the stacked cache die rather than having the cache die covering cores. The 5 nm die will still be relatively small even with the same or larger L3.

I wouldn't rule out moving the L3 completely off die, but there seems to be a lot of reasons not to do that. Many applications do not respond to larger caches that much, so there will still be a big market for the base version with no stacking.

If you think they keep some of the cache on die, cause they cant stack it on top of the cores, would it not need to be the same process as the cache on the die? Cause the datapaths or how to call it would not fit? Using building analogy the upper floor would have different modular system than the bottom floor, so the columns would not be on top of each other.

naukkis · Jun 4, 2021

Timmah! said:
If you think they keep some of the cache on die, cause they cant stack it on top of the cores, would it not need to be the same process as the cache on the die? Cause the datapaths or how to call it would not fit? Using building analogy the upper floor would have different modular system than the bottom floor, so the columns would not be on top of each other.

Yeah, TSMC roadmap shows only COW-stacking 7nm on 7nm, and 5nm on 5nm released later. It would be more difficult to stack dies from different nodes at such a fine pitch, different nodes should either have compatible layering patterns or at least totally ground-up made node compatible libraries. So far TSMC roadmaps don't show such are planned, they only show cow-stacking on same node chiplets.

moinmoin · Jun 4, 2021

DrMrLordX said:
Possibly a red herring.

*new fish related code name get!*

lobz · Jun 4, 2021

moinmoin said:
Note how with Zen 3 AMD unified the L3$, finally making all the 32MB accessible for all cores. Turns out that didn't stop there but was planned to be scalable well beyond that, 96MB as demoed, 288MB as assumed for 4 stacks.

L2$ is in a completely different position, being private to each core. With L3$ all the TSVs add up to the advantage for every core, with L2$ one would need massive TSVs just for the benefit of a single core each.

I'm sure there will be innovative ways to expand on the usability of L2$. But increasing its size through stacking seems very unlikely to me.

With every major Zen gen AMD found some way to massively scale out something.

Looks like this is what AMD worked on for Zen 3 which would have allowed for 288MB L3$ per CCX, adding up to 2304MB on a 64 cores Epyc chip.

The fact it appears this late (likely as a Ryzen only feature, as server chips need lengthy validation which is not worth doing mid-gen) tells me that the tech likely wasn't fully working back when the Zen 3 gen originally launched. I expect it to be an integral part of Zen 4, if not significantly expanded on further then.

The '''delay''' or lack of such capacity instead, may have occurred on TSMC's side, because I don't see the tech itself not working in time, I mean AMD's had a good understanding of mass-producing a product using TSVs with all their complications. We know how constrained packaging has been in the past 12-18 months, so I'm guessing something like that, because their first TSV using product (Fury X) was a lot more complicated than this will be

Doug S · Jun 4, 2021

naukkis said:
Yeah, TSMC roadmap shows only COW-stacking 7nm on 7nm, and 5nm on 5nm released later. It would be more difficult to stack dies from different nodes at such a fine pitch, different nodes should either have compatible layering patterns or at least totally ground-up made node compatible libraries. So far TSMC roadmaps don't show such are planned, they only show cow-stacking on same node chiplets.

I see no reason why you couldn't stack chips made on different processes. All you have to line up are the TSVs, the location of which are independent of process. TSVs connect to the metal layers, and don't care about the location or density of transistors (other than not putting them where a TSV will poke through in chips designed to stack more than two levels high)

I wonder how well TSMC's "SRAM optimized" process scales from N7 to N5? Does it scale as poorly as cache scales on its regular process (where SRAM only gets a 20% scaling from N5 to N3) or does it scale more like logic does on the regular "logic optimized" process (which gets 70% scaling from N5 to N3)

This talk of "cache optimized" processes getting 2x the density for cache versus the standard process is a bit confusing to me. Can someone point to where that claim is made, and how it squares up with Apple's A13? On the A13, which is also made in TSMC 7nm Apple has a 16MB SLC which per Andrei's measurement is 8.47 mm^2 in size. That's basically identical density to AMD's L3 chip with 64MB in 36 mm^2. So is this "cache optimized" thing real, because that's the same density Apple is getting in a standard process. There are different cache types (6T vs 8T cells) and so forth so maybe that accounts for the difference but it is suspicious to me how close the density of the A13 and AMD's L3 chips are.

Doug S · Jun 4, 2021

LightningZ71 said:
I'm in the camp with AMD keeping the CCX size at 8 cores, and retaining at least some L3 cache on the CCD. In my opinion, Zen4 will likely increase the L2, and gain significant transistor count due to expanded FP resources and possibly additional widening.

I don't know how easy it would be to do or if it would be feasible at all, but hypothetically if you could put enough SRAM on your main die to carry tags for the largest possible L3 stack you would have, but use it as a regular L3 cache if there are no stacked L3 chips, you'd get the best of both worlds. You'd have some L3 cache for your low end unstacked configuration, and have minimum latency access to all the tags in a stacked configuration where all the L3 was in separate chips and only tags were stored on the main chip.

Using the same cells as either tags or L3 would mean some complicated wiring, but I can't think of a reason why it would be impossible. Just difficult.

Ajay · Jun 4, 2021

Doug S said:
I don't know how easy it would be to do or if it would be feasible at all, but hypothetically if you could put enough SRAM on your main die to carry tags for the largest possible L3 stack you would have, but use it as a regular L3 cache if there are no stacked L3 chips, you'd get the best of both worlds. You'd have some L3 cache for your low end unstacked configuration, and have minimum latency access to all the tags in a stacked configuration where all the L3 was in separate chips and only tags were stored on the main chip.

Using the same cells as either tags or L3 would mean some complicated wiring, but I can't think of a reason why it would be impossible. Just difficult.

Interesting idea. The problem is the different access patterns (and more complex metal layers) - optimize for cache and it seems that tag reads and writes would be cumbersome and might require higher cycle counts. Optimizer the other way and the 'L3' tags SRAM on the CCD could have latencies higher than the first stack - which sort of defeats the purpose.

jpiniero · Jun 4, 2021

Doug S said:
You'd have some L3 cache for your low end unstacked configuration

There's a lot you can get away with at the low end. Plus all the upsell opportunities.

Have to think AMD is going to remove the L3 from the main chiplet.

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Lifer

Senior member

Golden Member

Diamond Member

Diamond Member

Senior member

Lifer

Diamond Member

Golden Member

Senior member

Golden Member

Lifer

Platinum Member

Platinum Member

Platinum Member

Lifer

Diamond Member

Golden Member

Senior member

Diamond Member

Platinum Member

Platinum Member

Platinum Member

Lifer

Lifer