Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

Exist50 · Nov 20, 2022

maddie said:
The term Beachfront is actually used?

BorisTheBlade82 said:
Seems so, and IMHO it is a quite fitting term - just as real Beachfront it is rather limited on a die.

I'm more familiar with the term "shoreline" for that same thing. Think it works better, tbh.

BorisTheBlade82 · Nov 20, 2022

igor_kavinski said:
Wouldn't a unified large L2 cache be the next evolution in cache performance? A big slab of cache in the middle and cores placed on all sides of it?

Cache topologies do exist for a reason. If it was so easy to do a big, fast and unified cache, why not make a 1Gb unified L1 cache?

maddie · Nov 20, 2022

BorisTheBlade82 said:
Cache topologies do exist for a reason. If it was so easy to do a big, fast and unified cache, why not make a 1Gb unified L1 cache?

TANSTAAFL. One of my favorite hard Sci-Fi terms.

BorisTheBlade82 · Nov 20, 2022

maddie said:
TANSTAAFL. One of my favorite hard Sci-Fi terms.

😆
I have to admit, I needed to look up that abbreviation.

Joe NYC · Nov 20, 2022

Exist50 said:
This whole thing sounds like complete nonsense. Doubling the cores? +30% IPC? Unified L2 cache? Unified stacked L3 for everything? Yeah, I'm calling BS.

Unified, stacked L3 does not have to be on a separate die. It can still be stacked on top of CCDs.

If there is an order of magnitude (+) improvement in IOD-CCD links, such as in RDNA3 or an SoIC_H interposer, allowing a mesh between CCDs, unified L3 could be implemented algorithmically, in being able to access L3 of any CCD

By the time Zen 5 is released, TSMC / AMD probably have full potential of stacking (several layers) worked out, say 512 MB per CCD, and it would be leaving a lot of performance on the table if 8 or 12 CCDs can't share it.

Joe NYC · Nov 20, 2022

BorisTheBlade82 said:
Cache topologies do exist for a reason. If it was so easy to do a big, fast and unified cache, why not make a 1Gb unified L1 cache?

Agreed. I have doubts abouts.

But it could be just an algorithmic implementation of L2 of one core being a potential victim cache of another core's L2 that is not too busy.

Something like what IBM described.

I think more likely, this type of sharing will be implemented on L3 level, where the growth of the size is less restricted - it is going to 3rd dimension (stacked). So, there will be more to share. The sharing would be across the entire CPU, not just a single CCD.

There are many potential technologies on the near-term horizon to make communication between IOD-CCD and CCD-CCD possible and increasingly efficient.

DisEnchantment · Nov 20, 2022

Joe NYC said:
By the time Zen 5 is released

MI300 A/CPU. 1H23.
It will have many new stuffs related to chiplet and packaging, a culmination of a decade of AMD's Exascale APU vision

2.5D EFB for HBM
3D stacking CCDs/GCDs on Infinity Cache
- The interesting part to see would be how AMD wire up the communication across the CCXs
IOD in Infinity Cache or base die?
Interconnects for coherent shared LLC , IC <---> IC interconnects.
- GCD cannot access CPU L3. I think GPU and CPU coherent at IC.

I think there is a possibility DT and generic Zen 5 server SKUs will carry N4 based CCDs. N4/5 stacking would be already be fully qualified for a year. And it can be fabbed in US (but packaged elsewhere probably).
So far, only known info from reliable leaker is that Turin is 600W cTDP

https://twitter.com/x/status/1453707924338089990

EPYC 3rd Gen --> EPYC 4th Gen : +40% TDP and +50% Cores
EPYC 4th Gen --> EPYC 5th Gen : +50% TDP and another 33% (128 ) Cores?? That would be a bare minimum.
Things look bad because the interconnects ate a lot of power. Not just data lines are needed, there are numerous control and clock lines as well which are not counted in usual pJ/bit calculations.

Joe NYC said:
Agreed. I have doubts abouts.

But it could be just an algorithmic implementation of L2 of one core being a potential victim cache of another core's L2 that is not too busy.

Something like what IBM described.

I think more likely, this type of sharing will be implemented on L3 level, where the growth of the size is less restricted - it is going to 3rd dimension (stacked). So, there will be more to share. The sharing would be across the entire CPU, not just a single CCD.

There are many potential technologies on the near-term horizon to make communication between IOD-CCD and CCD-CCD possible and increasingly efficient.

MI300 CPU have access to IFC. So they have in the pipe another level of cache.
There is an interesting patent for memory prefetching in the LLC for a CPU, which could work well with HBM due to many concurrent channels. Likely for MI300 too.
https://www.freepatentsonline.com/y2022/0318151.html
20220318151 - METHOD AND APPARATUS FOR A DRAM CACHE TAG PREFETCHER

Paul98 · Nov 20, 2022

I want something like 24 cores + 8 core with 3D cache

Tigerick · Nov 21, 2022

DisEnchantment said:
MI300 A/CPU. 1H23.
It will have many new stuffs related to chiplet and packaging, a culmination of a decade of AMD's Exascale APU vision

2.5D EFB for HBM

3D stacking CCDs/GCDs on Infinity Cache

The interesting part to see would be how AMD wire up the communication across the CCXs

IOD in Infinity Cache or base die?

Interconnects for coherent shared LLC , IC <---> IC interconnects.

GCD cannot access CPU L3. I think GPU and CPU coherent at IC.

I think there is a possibility DT and generic Zen 5 server SKUs will carry N4 based CCDs. N4/5 stacking would be already be fully qualified for a year. And it can be fabbed in US (but packaged elsewhere probably).
So far, only known info from reliable leaker is that Turin is 600W cTDP

https://twitter.com/x/status/1453707924338089990
EPYC 3rd Gen --> EPYC 4th Gen : +40% TDP and +50% Cores
EPYC 4th Gen --> EPYC 5th Gen : +50% TDP and another 33% (128 ) Cores?? That would be a bare minimum.
Things look bad because the interconnects ate a lot of power. Not just data lines are needed, there are numerous control and clock lines as well which are not counted in usual pJ/bit calculations.

MI300 CPU have access to IFC. So they have in the pipe another level of cache.
There is an interesting patent for memory prefetching in the LLC for a CPU, which could work well with HBM due to many concurrent channels. Likely for MI300 too.
https://www.freepatentsonline.com/y2022/0318151.html
20220318151 - METHOD AND APPARATUS FOR A DRAM CACHE TAG PREFETCHER

MI300 release date is too near for Zen 5 core. With such complicated design, I would expect Zen 4/4c core initially. If AMD go for Zen 4c core, then we should be expecting 16 Zen 4c cores multiply by four IOD dies total 64 CPU cores. Pretty competitive compared to NV Grace's 72 ARM cores.

yuri69 · Nov 21, 2022

MI300 has CPUID 00A80F00. Phoenix has 00A70F00 and Bergamo most likely has 00AA0F00. This makes it fit to the Zen 3/4 family.

jamescox · Nov 28, 2022

Joe NYC said:
Agreed. I have doubts abouts.

But it could be just an algorithmic implementation of L2 of one core being a potential victim cache of another core's L2 that is not too busy.

Something like what IBM described.

I think more likely, this type of sharing will be implemented on L3 level, where the growth of the size is less restricted - it is going to 3rd dimension (stacked). So, there will be more to share. The sharing would be across the entire CPU, not just a single CCD.

There are many potential technologies on the near-term horizon to make communication between IOD-CCD and CCD-CCD possible and increasingly efficient.

I could see them increasing the core count by sharing an L2 cache between 2 cores. This requires the same number of connections for the L3 cache as Zen 4, but bandwidth may be an issue. That allows 16 cores in 1 CCX without blowing up the L3 cache connectivity. Sharing 8 cores in one L2 seems like too much; I would expect the latency to be too high. Having 4 cores sharing an L2 might be plausible with a 16 core being 4 L2 with 4 cores each. Zen 5 is supposed to be a new implementation, so such changes are plausible.

Making a hybrid Zen 4 / Zen 5 device seems to have a lot of issues. If they want to transparently switch between the low power Zen 4 derivative and the Zen 5 derivative, then they need to be very close, possibly sharing an L2. If they are on the same die, then they don't get the process tech advantage of using a low power process for the Zen 4 derivative. Stacking may make sense, but that also has problems. They could possibly put a low power Zen 4 die (maybe with extra L3) on top, but that would increase thermal issues, but may still be doable if they are not active at the same time. Placing the high power Zen 5 die on top makes sense from a thermal perspective, but they may have problems passing that much power up the stack through SoIC style connections. May be okay for mobile where the Zen 5 would be lower power. If they can put a Zen 5 die on top, then it would make sense to essentially use an APU with low power Zen 4 cores and just stack a Zen 5 die on top, to only be used when the performance is needed. The top die doesn't actually need much modification; it is the bottom die that needs to be thinned significantly and needs to have TSVs.

It would be great if the connectivity used for RDNA3 could also be used in Bergamo, but that would require a different IO die or the current IO die would need to have this interconnect already present but unused. If the current IO die has both, then the fan out version could just bypass the SerDes PHY and use whatever minimal PHY is required for this new fan out interconnect. It would save a lot of power. They could possibly fit four Zen 4c die along each side; they are likely rectangular if they contain two CCX. That may also still allow routing underneath for IO. Perhaps wishful thinking though.

It is unclear how any of this will be used for Zen 5. It doesn't seem like we even know how some of the upcoming Zen 4 products will work. MI300 seems to be Zen 4 with HBM connected to the cpus somehow? Anyone know how that is going to work? I was wondering if the MCD will allow connection to more than one type of memory or if there are more than one version of MCD. If they use MCD, then how would the MCD connect to the Zen 4 chiplets? This makes me wonder if AMD is going to use some special version of HBM with this new fan out interconnect rather than needing to use a bridge die / EFB. Looking at Trento, each compute die needs a lot of interconnect somehow, likely the new fan out connectivity for the adjacent die, but also a number of GMI links for die further away. They wouldn't want the interconnect on the compute die, so perhaps it is just split out into another chiplet rather than stacked in a base die. The current IO die already has a large number of internal switches and repeaters with different latencies, so having is split into separate chips in some manner may not be an issue. Slide 21 shows the internal structure of the Rome/Milan IO die with switches and repeaters:

AMD Chiplet Architecture for High-Performance Server and Desktop Products

AMD Chiplet Architecture for High-Performance Server and Desktop Products - Download as a PDF or view online for free

www.slideshare.net

I thought I posted this already from my phone, but I don't see it...

jamescox · Nov 28, 2022

DisEnchantment said:
MI300 A/CPU. 1H23.
It will have many new stuffs related to chiplet and packaging, a culmination of a decade of AMD's Exascale APU vision

2.5D EFB for HBM

3D stacking CCDs/GCDs on Infinity Cache

The interesting part to see would be how AMD wire up the communication across the CCXs

IOD in Infinity Cache or base die?

Interconnects for coherent shared LLC , IC <---> IC interconnects.

GCD cannot access CPU L3. I think GPU and CPU coherent at IC.

I think there is a possibility DT and generic Zen 5 server SKUs will carry N4 based CCDs. N4/5 stacking would be already be fully qualified for a year. And it can be fabbed in US (but packaged elsewhere probably).
So far, only known info from reliable leaker is that Turin is 600W cTDP

https://twitter.com/x/status/1453707924338089990
EPYC 3rd Gen --> EPYC 4th Gen : +40% TDP and +50% Cores
EPYC 4th Gen --> EPYC 5th Gen : +50% TDP and another 33% (128 ) Cores?? That would be a bare minimum.
Things look bad because the interconnects ate a lot of power. Not just data lines are needed, there are numerous control and clock lines as well which are not counted in usual pJ/bit calculations.

MI300 CPU have access to IFC. So they have in the pipe another level of cache.
There is an interesting patent for memory prefetching in the LLC for a CPU, which could work well with HBM due to many concurrent channels. Likely for MI300 too.
https://www.freepatentsonline.com/y2022/0318151.html
20220318151 - METHOD AND APPARATUS FOR A DRAM CACHE TAG PREFETCHER

RDNA3 does not seem to use any 3D stacking for infinity cache. The MCD seem to just be mounted next to the GCD without any silicon-based interconnect. It is some kind of advanced packaging tech, possibly cheaper than EFB. I have seen the MCD die size reported as 37 mm2, which is close to the v-cache die size, so perhaps they can stack more cache on top of the MCD? I was wondering why the cache MCD was only 16 MB. I guess that is because the GDDR6 memory controller is on the MCD. There is a possibility that they would want to use MCD with HBM also. HBM is still DRAM, so it has DRAM-like latencies even though the bandwidth is a lot higher, so having SRAM cache may still be useful.

The MCD are probably not suitable as L3 cpu cache. The level of connectivity is probably around an order of magnitude lower than the v-cache chips using SoIC. It is unclear how they would connect some Zen 4 chiplets into the MI300 system. Going by Trento, they need a lot of interconnect to other die that may not be directly adjacent, so it seems like they will need GMI links in a base die or other chiplet. They may be able to make it using the infinity link fan out tech, but that requires that the die be adjacent. They cannot be any distance apart. Perhaps they have a chiplet with some IO and 4x GMI for 2 (wide mode) or 4 (narrow mode) cpu chiplets and then either infinity link fan out or more GMI to connect to the gpus. The wide GMI for 4 chiplet Genoa seems odd since the parts listed so far are very low clocked, low-end parts. Is the wide GMI really going to help that much? Perhaps there is another use for wide GMI. That might require another chiplet for connection to HBM and some way to pass through the connection at the same speed as the gpus, so I don't know if that works. A base die seems like a good option, but there might be issues with passing sufficient power up the stack.

At this link: https://www.pcgamer.com/amd-infinity-links-rdna-3/

They say that infinity link is 4x64-bit links. In the Gamers Nexus interview with Sam Naffziger, I believe he said something about it being quad pumped. 4 * 64 * 4 = 1024, which is the width of all of the channels on an HBM stack, so they likely have sufficient bandwidth to connect to HBM through an MCD.

soresu · Nov 28, 2022

jamescox said:
so perhaps they can stack more cache on top of the MCD?

Possibly, but doing so would require the rest of the silicon have those things 5800X3D uses to balance the height of the package out for cooling, which could possibly reduce clockspeed even after the trial and error of their first gen X3D design experience.

Insert_Nickname · Nov 29, 2022

jamescox said:
Making a hybrid Zen 4 / Zen 5 device seems to have a lot of issues. If they want to transparently switch between the low power Zen 4 derivative and the Zen 5 derivative, then they need to be very close, possibly sharing an L2. If they are on the same die, then they don't get the process tech advantage of using a low power process for the Zen 4 derivative. Stacking may make sense, but that also has problems. They could possibly put a low power Zen 4 die (maybe with extra L3) on top, but that would increase thermal issues, but may still be doable if they are not active at the same time. Placing the high power Zen 5 die on top makes sense from a thermal perspective, but they may have problems passing that much power up the stack through SoIC style connections. May be okay for mobile where the Zen 5 would be lower power. If they can put a Zen 5 die on top, then it would make sense to essentially use an APU with low power Zen 4 cores and just stack a Zen 5 die on top, to only be used when the performance is needed. The top die doesn't actually need much modification; it is the bottom die that needs to be thinned significantly and needs to have TSVs.

I think that is an interesting idea. Having a high power and a low power core connected to the same L2 cache seems a good implementation of the big-little concept. Getting data from the big to the small core or vice versa would be so much easier and possibly lower power then having to transfer to an entirely different core complex.

There'd be all sorts of issues to iron out, but the idea has merit.

jamescox · Nov 29, 2022

Insert_Nickname said:
I think that is an interesting idea. Having a high power and a low power core connected to the same L2 cache seems a good implementation of the big-little concept. Getting data from the big to the small core or vice versa would be so much easier and possibly lower power then having to transfer to an entirely different core complex.

There'd be all sorts of issues to iron out, but the idea has merit.

It isn't exactly new. Apple processors have quite a few cores sharing the same L2 cache. I think they split the low power and high power cores to use separate L2 caches in the later versions, but all cores are visible to the OS. They also use relatively large L1 caches. Some applications like a large L2 cache. When VR headsets first came out, I remember people being surprised that some old core 2 quad processors were on the "sufficient" list. They had large L2 caches and no L3, which actually performed quite well. The next gen nehalem processors went to 256 KB L2 and various L3 sizes, which may have hurt performance on some applications without other improvements. It gets complicated to optimize the cache hierarchy for different applications, especially between single threaded and mult-threaded. I don't know if anyone has tested the Zen4 SMT efficiency. I was wondering how much the larger L2 helps with SMT.

Anyway, switching to a completely different CCX would have quite a lot of overhead if it was harware switching not visible to the OS. Stacking could allow the low power cores to share caches very simply, especially if they are not active at the same time. If they are visible to the OS by some preferred core mechanism or something, then haveing two completely separate chiplets makes sense.

jamescox · Nov 29, 2022

soresu said:
Possibly, but doing so would require the rest of the silicon have those things 5800X3D uses to balance the height of the package out for cooling, which could possibly reduce clockspeed even after the trial and error of their first gen X3D design experience.

Reduce clock speed of what? The MCD seem to be just adjacent to the gpu in RDNA3, not under. The MCD is just cache and memory controller, so placing another die on top probably is not a thermal issue. The MCD also has a very similar die size to the v-cache die, so likely no filler pieces of silicon would be needed. It seems like overkill to use a v-cache die though. The MCD are high bandwidth, probably similar to a stack of HBM, but not as high as L3 caches. HBM is 8x128b channels and the connection to the gpu may be split into channels also, so perhaps it would need to look a lot like an L3 cache.

soresu · Dec 1, 2022

jamescox said:
The MCD is just cache and memory controller, so placing another die on top probably is not a thermal issue. The MCD also has a very similar die size to the v-cache die, so likely no filler pieces of silicon would be needed

Unlikely, they would not have made the substrate of the base SKUs so that there was a significant difference between MCD and GCD in height just so that V cache would fit in later.

Even a few millimeters could reduce thermal conductivity significantly, which would not be ideal at power densities like this.

Possible solutions are:

#1. The substrate of the v cache SKU raises the GCD instead of adding filler silicon, or filler is added below and power/data routed through it while the GCD at the top has max exposure to IHS/vapor chamber.

#2. They make the MCDs for the V cache variant thinner, this is something Samsung have been doing with for each successive VNAND generation as they increase the number of layers. I think HBM must be doing this too for low stack height with many dies.

Or I'm just blowing it all out of proportions and the vapor chamber equalises the height for base SKUs 😅 - it could work as long as the power density in the MCD isn't too high, but this is SRAM cache and not DRAM, so I have no idea what that would be.

requesttop · Dec 2, 2022

I don't think the FPU will be widened until there is substantial amount of AVX-512 code out there. Certainly not Zen5, probably not Zen6 or 7 either. If Intel reintroduces it soon, then Zen8 or so would be realistic.

mobdro 2022

jamescox · Dec 6, 2022

soresu said:
Unlikely, they would not have made the substrate of the base SKUs so that there was a significant difference between MCD and GCD in height just so that V cache would fit in later.

Even a few millimeters could reduce thermal conductivity significantly, which would not be ideal at power densities like this.

Possible solutions are:

#1. The substrate of the v cache SKU raises the GCD instead of adding filler silicon, or filler is added below and power/data routed through it while the GCD at the top has max exposure to IHS/vapor chamber.

#2. They make the MCDs for the V cache variant thinner, this is something Samsung have been doing with for each successive VNAND generation as they increase the number of layers. I think HBM must be doing this too for low stack height with many dies.

Or I'm just blowing it all out of proportions and the vapor chamber equalises the height for base SKUs 😅 - it could work as long as the power density in the MCD isn't too high, but this is SRAM cache and not DRAM, so I have no idea what that would be.

This doesn't fit with how the die stacking works, or at least my understanding of it. If they make an MCD not meant for stacking, then they would just not thin the wafer down at all and it would have the same z-height as other die. It would have some space wasted on unused TSVs. They don't have to make two different versions, they just skip the thinning step if it isn't going to be used for stacking. This is the same as the cpu die. They have the TSVs, but they are not thinned if they are not going to be used for stacking.

If they are using it with stacking, then they need to thin the MCD down very thin to expose TSVs; something like 10 to 20 microns thick. Millimeters is huge when talking about die thickness. The top cache die needs to be thinned to match the z-height of the other die, but that would only remove the thickness of the base die, so something like 10 to 20 microns. If they are stacking multiple layers, then all intermediate layers would get thinned down to 10 to 20 microns (or whatever it actually is) to expose the TSVs with the top die thinned sufficiently to match the height. There doesn't need to be any filler silicon on top. There also may not be any filler silicon on the sides since the die may be the same size. Deliberately making it the same size would simplify this process significantly. They could plausibly join wafers without any dicing. With the vcache die for cpus, they have to thin the cpu wafer ridculously thin to expose the TSVs, make a so called "reconstituted" wafer with diced vcache die and filler silicon pieces, and then bond them with hybrid bonding.

Stacking whole wafers may mean that this process will be cheaper than the cpu vcache stacking. If something goes wrong, you lose MCD and vcache chips rather than cpu or gpu die. The MCD and vcache chips are also made on a cheaper process than the compute die. If they get multiple layers working, then perhaps we could see MCD with huge caches. Six MCD with 4 working layers would be 1.5 GB of added cache.

Exist50 · Dec 6, 2022

jamescox said:
Stacking whole wafers may mean that this process will be cheaper than the cpu vcache stacking. If something goes wrong, you lose MCD and vcache chips rather than cpu or gpu die. The MCD and vcache chips are also made on a cheaper process than the compute die. If they get multiple layers working, then perhaps we could see MCD with huge caches. Six MCD with 4 working layers would be 1.5 GB of added cache.

Chip-on-Wafer vs Wafer-on-Wafer. As you say, the latter has interesting potential cost advantages, but it does come with its own challenges. Means you're required to match the amount of silicon on each die instead of more granular scaling for each. Do the economics of V-Cache work if you need to have 2-3x worth of it? No idea. Or they could put something else in there. L2? Fabric? But that would mean you can't have a non-stacked option at all.

jamescox · Dec 6, 2022

Exist50 said:
Chip-on-Wafer vs Wafer-on-Wafer. As you say, the latter has interesting potential cost advantages, but it does come with its own challenges. Means you're required to match the amount of silicon on each die instead of more granular scaling for each. Do the economics of V-Cache work if you need to have 2-3x worth of it? No idea. Or they could put something else in there. L2? Fabric? But that would mean you can't have a non-stacked option at all.

The die sizes I have heard for vcache chips and MCD are about the same. The vcache chips are 64 MB of cache and nothing else. The MCD is 16 MB of cache + two GDDR6 memory channels. I was wondering why it was only 16 MB since this seemed very small and is actually a regression. Previous generation had 128 MB of infinity cache on die. Six MCD is only 6 * 16 = 96 MB. If they were trying to match the die size, then this could partially explain why it is only 16 MB in the MCD. It wouldn't have been that much larger with 32 since it is made on a process optimized for cache density; 64 MB is only around 40 mm2 for the vcache die, so only around 10 mm2 per 16 MB.

If they implemented this, then low end parts can use the base die with no stacking or salvage parts where the stacking went wrong. CDNA parts are likely expensive enough to justify die stacking. This is about moving as much as possible off the compute die made on the expensive process. The RDNA3 compute die is a lot smaller with the cache and memory controllers all moved off die. They will need a lot of vcache wafers if they use multiple layers. For expensive products, like CDNA devices or very high end RDNA, it may be worth it. The yield on the vcache die is probably very high. They are very small and cache can use duplicated blocks to make it resistant to defects. If they are going to use these across many products, then they will be cranking out a lot of these, so possibly good economy of scale.

It is still unclear how this will be used for MI300 and/or Zen 5, so it is kind of off topic for a Zen 5 thread.

Exist50 · Dec 6, 2022

jamescox said:
The die sizes I have heard for vcache chips and MCD are about the same. The vcache chips are 64 MB of cache and nothing else. The MCD is 16 MB of cache + two GDDR6 memory channels. I was wondering why it was only 16 MB since this seemed very small and is actually a regression. Previous generation had 128 MB of infinity cache on die. Six MCD is only 6 * 16 = 96 MB. If they were trying to match the die size, then this could partially explain why it is only 16 MB in the MCD. It wouldn't have been that much larger with 32 since it is made on a process optimized for cache density; 64 MB is only around 40 mm2 for the vcache die, so only around 10 mm2 per 16 MB.

If they implemented this, then low end parts can use the base die with no stacking or salvage parts where the stacking went wrong. CDNA parts are likely expensive enough to justify die stacking. This is about moving as much as possible off the compute die made on the expensive process. The RDNA3 compute die is a lot smaller with the cache and memory controllers all moved off die. They will need a lot of vcache wafers if they use multiple layers. For expensive products, like CDNA devices or very high end RDNA, it may be worth it. The yield on the vcache die is probably very high. They are very small and cache can use duplicated blocks to make it resistant to defects. If they are going to use these across many products, then they will be cranking out a lot of these, so possibly good economy of scale.

It is still unclear how this will be used for MI300 and/or Zen 5, so it is kind of off topic for a Zen 5 thread.

I have a way to bring it back on topic

. I think the big question about AMD's use of V-Cache over the next few years (CPUs, GPUs, whatever) is not wafer concerns, but rather a question of how much manufacturing capacity (tools deployed, etc.) they have available. Clearly they've being very cautious in their deployment of it so far. If they could go "all in" on a major product, that would open new possibilities, but we're probably a little ways out from that yet.

LightningZ71 · Dec 6, 2022

jamescox said:
The die sizes I have heard for vcache chips and MCD are about the same. The vcache chips are 64 MB of cache and nothing else. The MCD is 16 MB of cache + two GDDR6 memory channels. I was wondering why it was only 16 MB since this seemed very small and is actually a regression. Previous generation had 128 MB of infinity cache on die. Six MCD is only 6 * 16 = 96 MB. If they were trying to match the die size, then this could partially explain why it is only 16 MB in the MCD. It wouldn't have been that much larger with 32 since it is made on a process optimized for cache density; 64 MB is only around 40 mm2 for the vcache die, so only around 10 mm2 per 16 MB.

If they implemented this, then low end parts can use the base die with no stacking or salvage parts where the stacking went wrong. CDNA parts are likely expensive enough to justify die stacking. This is about moving as much as possible off the compute die made on the expensive process. The RDNA3 compute die is a lot smaller with the cache and memory controllers all moved off die. They will need a lot of vcache wafers if they use multiple layers. For expensive products, like CDNA devices or very high end RDNA, it may be worth it. The yield on the vcache die is probably very high. They are very small and cache can use duplicated blocks to make it resistant to defects. If they are going to use these across many products, then they will be cranking out a lot of these, so possibly good economy of scale.

It is still unclear how this will be used for MI300 and/or Zen 5, so it is kind of off topic for a Zen 5 thread.

It seems to me that the MCDs with the 16MB of iCache and 2 GDDR6 controllers is essentially a copy and paste of the same blocks off of the 6500XT die. The needed cells are already implemented on N6, so just reuse them in this application. Use six of them and boom, your needs are covered.

I do think that we will see some of this tech carry over to the desktop and server Zen5 designs.

Joe NYC · Dec 7, 2022

Exist50 said:
I have a way to bring it back on topic . I think the big question about AMD's use of V-Cache over the next few years (CPUs, GPUs, whatever) is not wafer concerns, but rather a question of how much manufacturing capacity (tools deployed, etc.) they have available. Clearly they've being very cautious in their deployment of it so far. If they could go "all in" on a major product, that would open new possibilities, but we're probably a little ways out from that yet.

I think Mi300 (rather than Zen 5) is the product where AMD is going "all in".

Since, according to rumors and AMD Financial Analyst Day presentation, the compute modules placed on the base die may be interchangeable, it might be possible that the Mi300 approach may overtake other dedicated implementations in CPU (desktop, server) APU, GPU.

The basic approach of Mi300 is rumored to be a large (360 mm2?) N6 die, connected to HBM modules and stacked on top of the base die, there would be 2 compute units on N5 node:

Joe NYC · Dec 14, 2022

There has been a lot of talk in these forums how to overcome the limitations of connecting chiplets through SerDes and organic substrate.

AMD has used:
- EFB - Mi250
- RDL - RDNA3
- 3D hybrid bond - Zen 3 V-Cache

Now, Mark Papermaster floated this new concept of EFB + Hybrid Bond:

Making Sure AMD Has The Complete Tech Package

The chief technology officers in the tech world have done their time in the engineering trenches, writing their papers, getting their patents, and helping

www.nextplatform.com

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Platinum Member

Senior member

Diamond Member

Senior member

Golden Member

Golden Member

Golden Member

Diamond Member

Senior member

Senior member

Senior member

Senior member

Platinum Member

Diamond Member

Senior member

Senior member

Platinum Member

Junior Member

Senior member

Platinum Member

Senior member

Platinum Member

Golden Member

Golden Member

Golden Member