Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 68 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
809
1,412
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

Ajay

Lifer
Jan 8, 2001
16,094
8,109
136
Maybe not in Zen 4, however. Its conception may have been a bit early to bet on stacked L3.
On the contrary, Zen 4, being later than Zen3, will most surely be using stacked SRAM. It may even have more stacked layers - super interesting architectural topologies to come!
 

uzzi38

Platinum Member
Oct 16, 2019
2,705
6,427
146
What is/was Warhol then? It seems like a plausible hypothesis to me.
Can't say for certain. All I know is it was never Zen 3+ (which is Rembrandt-only), it was on AM4 and that OEMs had some reason to think that it was like XT SKUs. Wish I knew what, but the only thing it could be is they were provided scores, because Warhol never existed outside of the roadmaps they were taken off as far as OEMs are concerned.
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
My speculation: reduced L3 on CCD but not removed (yet). 16 or 8MB. Some high volume SKUs will avoid stacked dies. To avoid an even larger performance delta, it would be helpful if they have some L3 cache. Secondly, L2 is currently private per core. I suspect they want some shared L3. Third some L3 gives a good area to place through-silicon-vias.

As for what they will do with the extra space, I have no idea. It's hard to guess which changes would perform best on their tests. But bigger buffers, more execution units, wider FPU and more L1/L2 cache are in the cards. Maybe not in Zen 4, however. Its conception may have been a bit early to bet on stacked L3.
I think that they are unlikely to reduce L3 on Zen 4. It is on 5 nm, so the cache area will be even smaller. They still want to stack the cache over top of L2 and L3 for thermal reasons. I expect L3 on die to stay the same 32 MB even with the process shrink. L2 is likely to get larger. That are likely to add a huge amount of FP processing power and vector FP units are huge compared to integer units. Die size may be similar to slightly larger due to added FP units and L2 cache. They would still want the chiplet to be usable without any stacked cache die for lower end parts; regressing to 16 or 8 MB would not be competitive.

Edit: actually, it may be plausible that Zen 4 chiplets will go up to 64 MB on the base die at 5 nm. The lower end market is likely to be covered by APUs, probably with smaller on die caches. The chiplets will be higher end desktop parts, server, and workstation.
 
Last edited:

jamescox

Senior member
Nov 11, 2009
644
1,105
136
On the contrary, Zen 4, being later than Zen3, will most surely be using stacked SRAM. It may even have more stacked layers - super interesting architectural topologies to come!
Up to 4 stack settings have already been seen in BIOS settings. Zen 4 needs a completely different board for DDR5 and pci express 5, so the settings may be for a Zen 3 derivative. TSMC supports up to 12 layers though, which would be absolutely ridiculous. That would be something like 800 MB on a single chiplet if they managed to do that. It would be ridiculously expensive also.
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
Thinking of Zen4 WRT the stacked cache, and the expectation / rumour of more cores, up to 12C per CCD. The expectation, and therefore the refutation, being that 6C either side of the cache would be a weird layout.

But what if this now means that there is no cache at all on the compute silicon and rather than 4C/cache/4C, could it be 4C/4C/4C with two blocks (side by side, not stacked higher) of 64MB on top. 12C with 128MB of cache per CCD. That would be a nice step up over Zen3+ in and of itself.

The silicon would be roughly the same physical size and you'd still only need two chiplets, rather than an odd number of 3 (if they didn't go 6C/cache/6C). Possible?

I don’t think we have much of any real information on Zen 4. The zen 3 with stacked cache seems to have come as a surprise, so AMD is keeping leaks to a minimum.

It would be interesting if they went to 12 cores and 48 MB of cache on die. That would probably be just add another 2 cores on each side of the cache. There are a lot of reasons to keep a full size L3 on die. They would want to use the die without any stacking for lower end parts. It also allows some more opportunities for salvage if something goes wrong in the stacking process. If you don’t have any L3 on die and something goes wrong in the stacking, then the whole thing is probably garbage. I don’t know if they could sell a cache-less, celeron-like part.

I have been expecting them to stick with the 8 core CCD. The cores may be quite a bit larger due to significantly larger FP vector units. If they also increase L1 and / or L2 sizes, then that, along with larger FP units may explain how the due area is used. They may also increase L3 size again if the transistor budget allows for it on 5 nm.
 

gdansk

Platinum Member
Feb 8, 2011
2,962
4,493
136
I think that they are unlikely to reduce L3 on Zen 4. It is on 5 nm, so the cache area will be even smaller. They still want to stack the cache over top of L2 and L3 for thermal reasons. I expect L3 on die to stay the same 32 MB even with the process shrink. L2 is likely to get larger. That are likely to add a huge amount of FP processing power and vector FP units are huge compared to integer units. Die size may be similar to slightly larger due to added FP units and L2 cache. They would still want the chiplet to be usable without any stacked cache die for lower end parts; regressing to 16 or 8 MB would not be competitive.

Edit: actually, it may be plausible that Zen 4 chiplets will go up to 64 MB on the base die at 5 nm. The lower end market is likely to be covered by APUs, probably with smaller on die caches. The chiplets will be higher end desktop parts, server, and workstation.
I outlined previously why AMD should move some L3 off CCD as soon as possible: the CCD uses a process optimized for logic while stacked cache will use a process optimized for SRAM (achieving nearly 2x MB/mm²). You can get more cache in the same total die space by moving the cache out of of the CCD. Sure, stacking adds some cost but when designs are approaching 50% L3 cache by area, a reduction in L3 area could pay off quickly. Or allow them to build a more competitive design to stay ahead of their aggressive ARM competitors. Increasing the CCD L3 size is bad economics. But it's easy so maybe they will.
 
Last edited:
  • Like
Reactions: Vattila

Ajay

Lifer
Jan 8, 2001
16,094
8,109
136
I outlined previously why AMD should move some L3 off CCD as soon as possible: the CCD uses a process optimized for logic while stacked cache will use a process optimized for SRAM (achieving nearly 2x MB/mm²). You can get more cache in the same total die space by moving the cache out of of the CCD. Sure, stacking adds some cost but when designs are approaching 50% L3 cache by area, a reduction in L3 area could pay off quickly. Or allow them to build a more competitive design to hedge of their stay ahead of their aggressive ARM competitors. Increasing the CCD L3 size is bad economics. But it's easy so maybe they will.
^Yes! I think it would be a good play to move all L3$ onto a vertical stack for Zen4 (not the L3 controller though). More room for other SRAM structures on the CCD, plus more logic and wider data paths (well, those aren’t taking up space in the floor plan).
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
I outlined previously why AMD should move some L3 off CCD as soon as possible: the CCD uses a process optimized for logic while stacked cache will use a process optimized for SRAM (achieving nearly 2x MB/mm²). You can get more cache in the same total die space by moving the cache out of of the CCD. Sure, stacking adds some cost but when designs are approaching 50% L3 cache by area, a reduction in L3 area could pay off quickly. Or allow them to build a more competitive design to hedge of their stay ahead of their aggressive ARM competitors. Increasing the CCD L3 size is bad economics. But it's easy so maybe they will.
That is one reason for moving it off die. There are several reasons for keeping some L3 (probably at least 32 MB) on die. While the density is higher on the cache die, it is still probably going to be cheaper to have a non-stacked base model. If you move the L3 off die completely, then you would be stacking L3 over top of cpu cores which might be a thermal issue. If something goes wrong in the stacking, then both die are probably garbage if there is no L3 on the base die. I have said all of this previously also. They will get higher density at 5 nm with Zen 4 and the stacked cache chip is likely to remain 7 nm for a while due to wafer shortages. Given the size differences, it may make sense to make the base die cache larger to cover roughly the same area as the stacked cache die rather than having the cache die covering cores. The 5 nm die will still be relatively small even with the same or larger L3.

I wouldn't rule out moving the L3 completely off die, but there seems to be a lot of reasons not to do that. Many applications do not respond to larger caches that much, so there will still be a big market for the base version with no stacking.
 

LightningZ71

Golden Member
Mar 10, 2017
1,798
2,156
136
I'm in the camp with AMD keeping the CCX size at 8 cores, and retaining at least some L3 cache on the CCD. In my opinion, Zen4 will likely increase the L2, and gain significant transistor count due to expanded FP resources and possibly additional widening.

We know that SRAM doesn't scale well, at least on dies that are logic optimized. I suspect that the CCD L3 stats around 32MB, and that the N5 CCDs remain relatively the same size as N7 dies, perhaps a little smaller. This allows AMD to use a trailing nodes SRAM stack over the L3 on the CCD, just like now. Perhaps it will be an SRAM optimized N6 die by then, with up to 4 stacks of cache, giving a total of 288MB of L3 per CCD.

While we know that Intel is making noises about integrating HBM on their foveros products as an L4, I wonder how such a large and significantly quicker and faster L3 shakes out against a MUCH larger HBM based L4?
 

moinmoin

Diamond Member
Jun 1, 2017
5,064
8,032
136
Yeah, considering all of the L2 is right next to the L3 on the chip, they could have just made the V-cache chip slightly bigger and already fit extended L2 there as well. The fact that they didn't hint's that it's not as easy:
pa9c5mu9y2y51.jpg

I do wonder however if they'll plan to enlarge the L2? It would add latency but i think they have to, as it currently peaks at ~1TB/s (aggregated) for 8 cores, according to AIDA64. That seems like quite the bottleneck when compared to L3
Note how with Zen 3 AMD unified the L3$, finally making all the 32MB accessible for all cores. Turns out that didn't stop there but was planned to be scalable well beyond that, 96MB as demoed, 288MB as assumed for 4 stacks.

L2$ is in a completely different position, being private to each core. With L3$ all the TSVs add up to the advantage for every core, with L2$ one would need massive TSVs just for the benefit of a single core each.

I'm sure there will be innovative ways to expand on the usability of L2$. But increasing its size through stacking seems very unlikely to me.

Just learned about this cache stacking thing - somehow i missed it :-O Its a big thing, right?
With every major Zen gen AMD found some way to massively scale out something.

Looks like this is what AMD worked on for Zen 3 which would have allowed for 288MB L3$ per CCX, adding up to 2304MB on a 64 cores Epyc chip.

The fact it appears this late (likely as a Ryzen only feature, as server chips need lengthy validation which is not worth doing mid-gen) tells me that the tech likely wasn't fully working back when the Zen 3 gen originally launched. I expect it to be an integral part of Zen 4, if not significantly expanded on further then.
 

A///

Diamond Member
Feb 24, 2017
4,351
3,158
136
One of the main issues, apart from extending it onto Zen 4, was they stated this new cache packaging is a first step. These implications are good. Zen 4 and Zen 5 should be incredible if you've read into their statements as much as I have.
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
One of the main issues, apart from extending it onto Zen 4, was they stated this new cache packaging is a first step. These implications are good. Zen 4 and Zen 5 should be incredible if you've read into their statements as much as I have.
The fact that there is a zen 3 bios showing up to 4 high stacks seems to indicate that we will get 4 stacks on "milan-x". Desktop Ryzen is probably limited to 1 high stack though. For Zen 4, we may actually get cores stacked on an interposer. The power for pci-express 5 speed cpu to IO die interconnects will be high, so getting them stacked on an interposer will reduce power significantly. It is very hard to speculate once chip stacking comes into play since there are so many possibilities.
 

MadRat

Lifer
Oct 14, 1999
11,946
265
126
Any time you break up data chunks and transmit into smaller packets you add time to re-assemble those chunks back into their original state. That's fine for slower access pathways. But if you're aiming for performance then keep data in its original state during transmission. Trace lengths are obviously critical for each cache level. As trace length increases the time it takes to transmit and receive signals across that trace is going to increase accordingly. Lengths of traces and the sheer number of the traces also creates timing skew in parallel interfaces.

I find the notion that creating onboard memory attached to the main board - in such a way to minimize trace distances - as 'being no advantage' is ridiculous. Apple obviously saw the advantage. Game consoles do it. Video cards do it. It cuts latency and decreases access time. You cannot compete with its latency by simply using serializing data over longer trace lengths. Re-assembly of data always pays an additional overhead cost. SSDs moved to serial interfaces because they simply do not require the latency figures that something like an L3 cache would need to improve performance.

We also shouldn't assume that a serialized device is transmitted through the same narrow number of traces from the device to the CPU. We often see localized caching or root hosts close to the device attached to wider interfaces. So from device to local caching it will be a narrow interface by nature. From that device-attached cache to CPU it moves data through stages via a bus architecture. The M.2 interface is 67 pins. The M.2 connects to PCIe 4x. The PCIe 4x connects to the PCIe x16 controller. The PCIe x16 interface is 82 pins. Pin counts general increase with each incremental interface closer to the CPU; these less complex interfaces eventually reach the CPU through a common southbridge interface.

A bottleneck eventually emerges due to limitations caused by data transmission between the CPU and southbridge. The solution wasn't more pins on the main board but rather by cutting parts from the southbridge out and transferring them to the CPU. Pin counts continue to increase to feed these components that moved because they not only shrunk, but the decrease in trace lengths dramatically decreased their timings and therefore require more lines of information to feed them. So as southbridge lost functions they continue to grow pin counts. CPU pin counts grew, too. Last generation AM4 is 1,331 pins. Next generation Zen 4 chips are going for 1,718 pins.

AMD is talking about 64MB L3 caches stacked to form up to a sum of 192 MB on their next generation chips. Caches are great but eventually you need access to RAM. While integrated memory will never become as fast as L3 cache, the aggressive move to decrease trace lengths to the first 8-16 GB of RAM will certainly help to keep these future CPUs fed. To suggest it would just another thing to break on the main board minimizes the fact that main boards are evolving complexity by progress. Systems that use integrated graphics have the most to gain.
 
Last edited:

krumme

Diamond Member
Oct 9, 2009
5,956
1,595
136
Apple is just like consoles and the old Ford T. You can have it in any color as long as its black.

X86 caters to a far larger market. From dirt cheap to expensive servers.
Stacking L3 solves part of that problem. From a market segmenting perpective and total cost structure for entire portfolio, its just brilliant. Especially combined with the chiplets methology.

What customers, at same total cost, is better served with a special memory solution, than stacked l3, chiplets and soon ddr5?
 

Gideon

Golden Member
Nov 27, 2007
1,774
4,145
136
The fact it appears this late (likely as a Ryzen only feature, as server chips need lengthy validation which is not worth doing mid-gen) tells me that the tech likely wasn't fully working back when the Zen 3 gen originally launched. I expect it to be an integral part of Zen 4, if not significantly expanded on further then.

Yeah, that's IMO the most impressive thing about it. Look at TSMC's own slides about the timeline of this feature (Chip on Wafer):

B6e12vG.png


The qualification will only be done by Q4 2021! This means AMD uses this feature the second it becomes available at the foundry. And mind you, that's the first foundry in the world to provide it as Intel will only get to the equivalent 3D tech in 2023 it seems.

So AMD managed to mass-produce a CCD in Q4 2020 with tons of TSVs baked in for this feature, that will only became available a year later. And they managed to do it in near-total secrecy (hidden in plain sight)!

And I thought Zen3, while a very nice upgrade, looked a bit underwhelming compared to the paradigm shifts of Zen 1 and Zen 2. Well not anymore :D
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
Any time you break up data chunks and transmit into smaller packets you add time to re-assemble those chunks back into their original state. That's fine for slower access pathways. But if you're aiming for performance then keep data in its original state during transmission. Trace lengths are obviously critical for each cache level. As trace length increases the time it takes to transmit and receive signals across that trace is going to increase accordingly. Lengths of traces and the sheer number of the traces also creates timing skew in parallel interfaces.

I find the notion that creating onboard memory attached to the main board - in such a way to minimize trace distances - as 'being no advantage' is ridiculous. Apple obviously saw the advantage. Game consoles do it. Video cards do it. It cuts latency and decreases access time. You cannot compete with its latency by simply using serializing data over longer trace lengths. Re-assembly of data always pays an additional overhead cost. SSDs moved to serial interfaces because they simply do not require the latency figures that something like an L3 cache would need to improve performance.

We also shouldn't assume that a serialized device is transmitted through the same narrow number of traces from the device to the CPU. We often see localized caching or root hosts close to the device attached to wider interfaces. So from device to local caching it will be a narrow interface by nature. From that device-attached cache to CPU it moves data through stages via a bus architecture. The M.2 interface is 67 pins. The M.2 connects to PCIe 4x. The PCIe 4x connects to the PCIe x16 controller. The PCIe x16 interface is 82 pins. Pin counts general increase with each incremental interface closer to the CPU; these less complex interfaces eventually reach the CPU through a common southbridge interface.

A bottleneck eventually emerges due to limitations caused by data transmission between the CPU and southbridge. The solution wasn't more pins on the main board but rather by cutting parts from the southbridge out and transferring them to the CPU. Pin counts continue to increase to feed these components that moved because they not only shrunk, but the decrease in trace lengths dramatically decreased their timings and therefore require more lines of information to feed them. So as southbridge lost functions they continue to grow pin counts. CPU pin counts grew, too. Last generation AM4 is 1,331 pins. Next generation Zen 4 chips are going for 1,718 pins.

AMD is talking about 64MB L3 caches stacked to form up to a sum of 192 MB on their next generation chips. Caches are great but eventually you need access to RAM. While integrated memory will never become as fast as L3 cache, the aggressive move to decrease trace lengths to the first 8-16 GB of RAM will certainly help to keep these future CPUs fed. To suggest it would just another thing to break on the main board minimizes the fact that main boards are evolving complexity by progress. Systems that use integrated graphics have the most to gain.
I am still not sure what you are arguing for or against here. Game consoles mostly use unified memory with just graphics memory. This is due to needing the bandwidth for the gpu and it is cheaper not to have separate system memory. I don't think I said anything about trace lengths being irrelevant. They are definitely important for L1 cache but signal propagation speed becomes less important as you move out the memory hierarchy. I doubt that the absolute value of the signal propagation time in a DDR connection is any significant amount of the latency. Most of the latency will be the sense amps reading the DRAM cells. Since it is a parallel interface, having it as close as possible for all traces is important and keeping traces short for signal integrity is important. I tried to get an old cpu to work while attached through a ribbon cable in the electronics lab once. It was nearly impossible to get it to work since the ribbon cable was essentially an antenna for interference.

Serial interfaces have length restrictions also. It has been an issue with pci-express 4. It will be more of an issue with pci-express 5. The trace length is important in different ways though. Each line is independent with an embedded clock with serial connections, so the traces do not need to be matched. As far as accessing DRAM via serial interfaces, we already do that. The higher latency on pci-express devices doesn't really have that much to do with the physical level interface. It has to do with it being a high level, software driven protocol. On AMD systems and Intel systems with multiple processors, memory is already accessed over a serialized interface. For AMD, the connection between the cpu chiplet and the IO die with the memory controller is infinity fabric, which is based on a physical layer very similar, if not the same, as pci-express. It is a multi-layer protocol, but it is all in hardware. In multi-socket systems, Intel or AMD, remote memory accesses go over a serialized link. AMD has their system matched very well. For an AMD IO die, you have 128-bit (2x64 channel) memory controller. The connection to the cpu chiplets is 32 bits wide, but at more than 4x the memory clock speed. Connection to another socket in Epyc is 16-bit wide, but at more than 8x the the memory clock.

Anyway, as you move further out the memory hierarchy, latency goes up and speed goes down. You also generally handle larger and larger chunks. The virtual memory system generally works with 4k pages. I saw something about Apple using 16k pages on their systems, which might give them a performance advantage. AMD treats HBM memory on their GPUs as cache with a fully virtualized memory system, but I don't know what the page or cache line size is. If you have a massive amout of DRAM on die, then you probably don't need more external DRAM. The bandwidth would be wasted. The consoles are already like this. Instead of on-package DRAM, they have unified graphics memory backed up by a fast SSD. If someone made a laptop like that, then I guess it is something I might consider, if it was a lot of graphics memory. We generally don't get an APU with graphics memory though, they have slow system memory.
 

Gideon

Golden Member
Nov 27, 2007
1,774
4,145
136
For AMD, the connection between the cpu chiplet and the IO die with the memory controller is infinity fabric, which is based on a physical layer very similar, if not the same, as pci-express. It is a multi-layer protocol, but it is all in hardware. In multi-socket systems, Intel or AMD, remote memory accesses go over a serialized link
Speaking of that interface, I've wondered ever since Zen 2, why AMD doesn't offer optional memory-compression for servers? I mean, they already have excellent memory encryption, might as well add some flags to enable some sort of fast hardware compression algorithm as well (with similar granularity as encryption).

Considering how bandwidth limited, yet often compressible, some server workloads are wouldn't it make sense to offer the ability to do that for certain customers?

It would seem natural to do the compression/decompression on entering/leaving L3 and obviously in a granual way (some memory ranges for some processes, not the entire memory. AFAIK memory encryption already supports that level of granualarity.
It would add latency, sure, but for many bandwidth bound tasks it would be totally worth it. Not only would it increase effective memory bandwidth, it would also save on Infinity Fabric traffic as well between I/O die and CCds (and therefore potetially save power).

There is probably a catch there somewhere, but I don't immediately see it.
 
  • Like
Reactions: Vattila and Tlh97

eek2121

Diamond Member
Aug 2, 2005
3,100
4,398
136
Can't say for certain. All I know is it was never Zen 3+ (which is Rembrandt-only), it was on AM4 and that OEMs had some reason to think that it was like XT SKUs. Wish I knew what, but the only thing it could be is they were provided scores, because Warhol never existed outside of the roadmaps they were taken off as far as OEMs are concerned.

The funny thing is that these chips will likely debut as “XT” chips. I doubt AMD is going to name these chips Ryzen 6000 (it would limit their ability to upsell). What I expect them to do is announce right before Alder Lake drops (or right after, but it is fun to rain on your competitor’s parade). I suspect we will only see 2, or possibly 3 skus at most.

Then again, the amount of secrecy surrounding AMD products as of late has been unreal, so who knows?
 

uzzi38

Platinum Member
Oct 16, 2019
2,705
6,427
146
The funny thing is that these chips will likely debut as “XT” chips. I doubt AMD is going to name these chips Ryzen 6000 (it would limit their ability to upsell). What I expect them to do is announce right before Alder Lake drops (or right after, but it is fun to rain on your competitor’s parade). I suspect we will only see 2, or possibly 3 skus at most.

Then again, the amount of secrecy surrounding AMD products as of late has been unreal, so who knows?
It's possible. It's also possible it could be AMD somewhat misleading OEMs, not telling the whole truth about their upcoming products in an attempt to hide details. They did the same thing with Navi21 and Navi22 after all, 2-3 weeks before the release of both they provided a new driver to OEMs that added an additional 10-20% performance. Which is why the N21 == 3070 'leaks' existed, they were technically real, but they got jebaited hard.

I'm not entirely sure why AMD is playing this game, because their competitors aren't idiots. They are able to make their own estimations of where these chips should land performance-wise. But here we are nonetheless.

What I'm trying to get at with all this is simple: be very hesitant to immediately trust rumours regarding AMD's products.
 
  • Like
Reactions: Tlh97

eek2121

Diamond Member
Aug 2, 2005
3,100
4,398
136
Possibly a red herring.
Warhol is real.

AMD has multiple teams working on multiple designs at any given time.

Zen 4 development ran parallel to Zen 5 development, for example. (I say ran, because Zen 4 is likely finished)

Everything on the roadmap appears to be dead on so far. It is quite old, however.
 
  • Like
Reactions: Tlh97

jpiniero

Lifer
Oct 1, 2010
15,223
5,768
136
The funny thing is that these chips will likely debut as “XT” chips. I doubt AMD is going to name these chips Ryzen 6000 (it would limit their ability to upsell).

Not if they rebrand. There's going to be some amount of chips that the vcache fails or the packaging gets it wrong or something. If the product is still otherwise good they are going to want to sell those. It's going to get announced roughly when mobile Rembrandt does and Rembrandt will for sure be 6xxx.
 

maddie

Diamond Member
Jul 18, 2010
4,881
4,951
136
Yeah, that's IMO the most impressive thing about it. Look at TSMC's own slides about the timeline of this feature (Chip on Wafer):

B6e12vG.png


The qualification will only be done by Q4 2021! This means AMD uses this feature the second it becomes available at the foundry. And mind you, that's the first foundry in the world to provide it as Intel will only get to the equivalent 3D tech in 2023 it seems.

So AMD managed to mass-produce a CCD in Q4 2020 with tons of TSVs baked in for this feature, that will only became available a year later. And they managed to do it in near-total secrecy (hidden in plain sight)!

And I thought Zen3, while a very nice upgrade, looked a bit underwhelming compared to the paradigm shifts of Zen 1 and Zen 2. Well not anymore :D
So Zen4 is earliest 4Q22 if it uses this tech from the beginning, baring any late hiccups.
 

Timmah!

Golden Member
Jul 24, 2010
1,513
832
136
That is one reason for moving it off die. There are several reasons for keeping some L3 (probably at least 32 MB) on die. While the density is higher on the cache die, it is still probably going to be cheaper to have a non-stacked base model. If you move the L3 off die completely, then you would be stacking L3 over top of cpu cores which might be a thermal issue. If something goes wrong in the stacking, then both die are probably garbage if there is no L3 on the base die. I have said all of this previously also. They will get higher density at 5 nm with Zen 4 and the stacked cache chip is likely to remain 7 nm for a while due to wafer shortages. Given the size differences, it may make sense to make the base die cache larger to cover roughly the same area as the stacked cache die rather than having the cache die covering cores. The 5 nm die will still be relatively small even with the same or larger L3.

I wouldn't rule out moving the L3 completely off die, but there seems to be a lot of reasons not to do that. Many applications do not respond to larger caches that much, so there will still be a big market for the base version with no stacking.

If you think they keep some of the cache on die, cause they cant stack it on top of the cores, would it not need to be the same process as the cache on the die? Cause the datapaths or how to call it would not fit? Using building analogy the upper floor would have different modular system than the bottom floor, so the columns would not be on top of each other.
 

naukkis

Senior member
Jun 5, 2002
903
786
136
If you think they keep some of the cache on die, cause they cant stack it on top of the cores, would it not need to be the same process as the cache on the die? Cause the datapaths or how to call it would not fit? Using building analogy the upper floor would have different modular system than the bottom floor, so the columns would not be on top of each other.

Yeah, TSMC roadmap shows only COW-stacking 7nm on 7nm, and 5nm on 5nm released later. It would be more difficult to stack dies from different nodes at such a fine pitch, different nodes should either have compatible layering patterns or at least totally ground-up made node compatible libraries. So far TSMC roadmaps don't show such are planned, they only show cow-stacking on same node chiplets.
 
  • Like
Reactions: Tlh97