The term Beachfront is actually used?
I'm more familiar with the term "shoreline" for that same thing. Think it works better, tbh.Seems so, and IMHO it is a quite fitting term - just as real Beachfront it is rather limited on a die.
The term Beachfront is actually used?
I'm more familiar with the term "shoreline" for that same thing. Think it works better, tbh.Seems so, and IMHO it is a quite fitting term - just as real Beachfront it is rather limited on a die.
Cache topologies do exist for a reason. If it was so easy to do a big, fast and unified cache, why not make a 1Gb unified L1 cache?Wouldn't a unified large L2 cache be the next evolution in cache performance? A big slab of cache in the middle and cores placed on all sides of it?
TANSTAAFL. One of my favorite hard Sci-Fi terms.Cache topologies do exist for a reason. If it was so easy to do a big, fast and unified cache, why not make a 1Gb unified L1 cache?
😆TANSTAAFL. One of my favorite hard Sci-Fi terms.
Unified, stacked L3 does not have to be on a separate die. It can still be stacked on top of CCDs.This whole thing sounds like complete nonsense. Doubling the cores? +30% IPC? Unified L2 cache? Unified stacked L3 for everything? Yeah, I'm calling BS.
Agreed. I have doubts abouts.Cache topologies do exist for a reason. If it was so easy to do a big, fast and unified cache, why not make a 1Gb unified L1 cache?
MI300 A/CPU. 1H23.By the time Zen 5 is released
MI300 CPU have access to IFC. So they have in the pipe another level of cache.Agreed. I have doubts abouts.
But it could be just an algorithmic implementation of L2 of one core being a potential victim cache of another core's L2 that is not too busy.
Something like what IBM described.
I think more likely, this type of sharing will be implemented on L3 level, where the growth of the size is less restricted - it is going to 3rd dimension (stacked). So, there will be more to share. The sharing would be across the entire CPU, not just a single CCD.
There are many potential technologies on the near-term horizon to make communication between IOD-CCD and CCD-CCD possible and increasingly efficient.
MI300 release date is too near for Zen 5 core. With such complicated design, I would expect Zen 4/4c core initially. If AMD go for Zen 4c core, then we should be expecting 16 Zen 4c cores multiply by four IOD dies total 64 CPU cores. Pretty competitive compared to NV Grace's 72 ARM cores.MI300 A/CPU. 1H23.
It will have many new stuffs related to chiplet and packaging, a culmination of a decade of AMD's Exascale APU vision
I think there is a possibility DT and generic Zen 5 server SKUs will carry N4 based CCDs. N4/5 stacking would be already be fully qualified for a year. And it can be fabbed in US (but packaged elsewhere probably).
- 2.5D EFB for HBM
- 3D stacking CCDs/GCDs on Infinity Cache
- The interesting part to see would be how AMD wire up the communication across the CCXs
- IOD in Infinity Cache or base die?
- Interconnects for coherent shared LLC , IC <---> IC interconnects.
- GCD cannot access CPU L3. I think GPU and CPU coherent at IC.
So far, only known info from reliable leaker is that Turin is 600W cTDP
EPYC 3rd Gen --> EPYC 4th Gen : +40% TDP and +50% Cores
EPYC 4th Gen --> EPYC 5th Gen : +50% TDP and another 33% (128 ) Cores?? That would be a bare minimum.
Things look bad because the interconnects ate a lot of power. Not just data lines are needed, there are numerous control and clock lines as well which are not counted in usual pJ/bit calculations.
MI300 CPU have access to IFC. So they have in the pipe another level of cache.
There is an interesting patent for memory prefetching in the LLC for a CPU, which could work well with HBM due to many concurrent channels. Likely for MI300 too.
https://www.freepatentsonline.com/y2022/0318151.html
20220318151 - METHOD AND APPARATUS FOR A DRAM CACHE TAG PREFETCHER
I could see them increasing the core count by sharing an L2 cache between 2 cores. This requires the same number of connections for the L3 cache as Zen 4, but bandwidth may be an issue. That allows 16 cores in 1 CCX without blowing up the L3 cache connectivity. Sharing 8 cores in one L2 seems like too much; I would expect the latency to be too high. Having 4 cores sharing an L2 might be plausible with a 16 core being 4 L2 with 4 cores each. Zen 5 is supposed to be a new implementation, so such changes are plausible.Agreed. I have doubts abouts.
But it could be just an algorithmic implementation of L2 of one core being a potential victim cache of another core's L2 that is not too busy.
Something like what IBM described.
I think more likely, this type of sharing will be implemented on L3 level, where the growth of the size is less restricted - it is going to 3rd dimension (stacked). So, there will be more to share. The sharing would be across the entire CPU, not just a single CCD.
There are many potential technologies on the near-term horizon to make communication between IOD-CCD and CCD-CCD possible and increasingly efficient.
RDNA3 does not seem to use any 3D stacking for infinity cache. The MCD seem to just be mounted next to the GCD without any silicon-based interconnect. It is some kind of advanced packaging tech, possibly cheaper than EFB. I have seen the MCD die size reported as 37 mm2, which is close to the v-cache die size, so perhaps they can stack more cache on top of the MCD? I was wondering why the cache MCD was only 16 MB. I guess that is because the GDDR6 memory controller is on the MCD. There is a possibility that they would want to use MCD with HBM also. HBM is still DRAM, so it has DRAM-like latencies even though the bandwidth is a lot higher, so having SRAM cache may still be useful.MI300 A/CPU. 1H23.
It will have many new stuffs related to chiplet and packaging, a culmination of a decade of AMD's Exascale APU vision
I think there is a possibility DT and generic Zen 5 server SKUs will carry N4 based CCDs. N4/5 stacking would be already be fully qualified for a year. And it can be fabbed in US (but packaged elsewhere probably).
- 2.5D EFB for HBM
- 3D stacking CCDs/GCDs on Infinity Cache
- The interesting part to see would be how AMD wire up the communication across the CCXs
- IOD in Infinity Cache or base die?
- Interconnects for coherent shared LLC , IC <---> IC interconnects.
- GCD cannot access CPU L3. I think GPU and CPU coherent at IC.
So far, only known info from reliable leaker is that Turin is 600W cTDP
EPYC 3rd Gen --> EPYC 4th Gen : +40% TDP and +50% Cores
EPYC 4th Gen --> EPYC 5th Gen : +50% TDP and another 33% (128 ) Cores?? That would be a bare minimum.
Things look bad because the interconnects ate a lot of power. Not just data lines are needed, there are numerous control and clock lines as well which are not counted in usual pJ/bit calculations.
MI300 CPU have access to IFC. So they have in the pipe another level of cache.
There is an interesting patent for memory prefetching in the LLC for a CPU, which could work well with HBM due to many concurrent channels. Likely for MI300 too.
https://www.freepatentsonline.com/y2022/0318151.html
20220318151 - METHOD AND APPARATUS FOR A DRAM CACHE TAG PREFETCHER
Possibly, but doing so would require the rest of the silicon have those things 5800X3D uses to balance the height of the package out for cooling, which could possibly reduce clockspeed even after the trial and error of their first gen X3D design experience.so perhaps they can stack more cache on top of the MCD?
I think that is an interesting idea. Having a high power and a low power core connected to the same L2 cache seems a good implementation of the big-little concept. Getting data from the big to the small core or vice versa would be so much easier and possibly lower power then having to transfer to an entirely different core complex.Making a hybrid Zen 4 / Zen 5 device seems to have a lot of issues. If they want to transparently switch between the low power Zen 4 derivative and the Zen 5 derivative, then they need to be very close, possibly sharing an L2. If they are on the same die, then they don't get the process tech advantage of using a low power process for the Zen 4 derivative. Stacking may make sense, but that also has problems. They could possibly put a low power Zen 4 die (maybe with extra L3) on top, but that would increase thermal issues, but may still be doable if they are not active at the same time. Placing the high power Zen 5 die on top makes sense from a thermal perspective, but they may have problems passing that much power up the stack through SoIC style connections. May be okay for mobile where the Zen 5 would be lower power. If they can put a Zen 5 die on top, then it would make sense to essentially use an APU with low power Zen 4 cores and just stack a Zen 5 die on top, to only be used when the performance is needed. The top die doesn't actually need much modification; it is the bottom die that needs to be thinned significantly and needs to have TSVs.
It isn't exactly new. Apple processors have quite a few cores sharing the same L2 cache. I think they split the low power and high power cores to use separate L2 caches in the later versions, but all cores are visible to the OS. They also use relatively large L1 caches. Some applications like a large L2 cache. When VR headsets first came out, I remember people being surprised that some old core 2 quad processors were on the "sufficient" list. They had large L2 caches and no L3, which actually performed quite well. The next gen nehalem processors went to 256 KB L2 and various L3 sizes, which may have hurt performance on some applications without other improvements. It gets complicated to optimize the cache hierarchy for different applications, especially between single threaded and mult-threaded. I don't know if anyone has tested the Zen4 SMT efficiency. I was wondering how much the larger L2 helps with SMT.I think that is an interesting idea. Having a high power and a low power core connected to the same L2 cache seems a good implementation of the big-little concept. Getting data from the big to the small core or vice versa would be so much easier and possibly lower power then having to transfer to an entirely different core complex.
There'd be all sorts of issues to iron out, but the idea has merit.
Reduce clock speed of what? The MCD seem to be just adjacent to the gpu in RDNA3, not under. The MCD is just cache and memory controller, so placing another die on top probably is not a thermal issue. The MCD also has a very similar die size to the v-cache die, so likely no filler pieces of silicon would be needed. It seems like overkill to use a v-cache die though. The MCD are high bandwidth, probably similar to a stack of HBM, but not as high as L3 caches. HBM is 8x128b channels and the connection to the gpu may be split into channels also, so perhaps it would need to look a lot like an L3 cache.Possibly, but doing so would require the rest of the silicon have those things 5800X3D uses to balance the height of the package out for cooling, which could possibly reduce clockspeed even after the trial and error of their first gen X3D design experience.
Unlikely, they would not have made the substrate of the base SKUs so that there was a significant difference between MCD and GCD in height just so that V cache would fit in later.The MCD is just cache and memory controller, so placing another die on top probably is not a thermal issue. The MCD also has a very similar die size to the v-cache die, so likely no filler pieces of silicon would be needed
This doesn't fit with how the die stacking works, or at least my understanding of it. If they make an MCD not meant for stacking, then they would just not thin the wafer down at all and it would have the same z-height as other die. It would have some space wasted on unused TSVs. They don't have to make two different versions, they just skip the thinning step if it isn't going to be used for stacking. This is the same as the cpu die. They have the TSVs, but they are not thinned if they are not going to be used for stacking.Unlikely, they would not have made the substrate of the base SKUs so that there was a significant difference between MCD and GCD in height just so that V cache would fit in later.
Even a few millimeters could reduce thermal conductivity significantly, which would not be ideal at power densities like this.
Possible solutions are:
#1. The substrate of the v cache SKU raises the GCD instead of adding filler silicon, or filler is added below and power/data routed through it while the GCD at the top has max exposure to IHS/vapor chamber.
#2. They make the MCDs for the V cache variant thinner, this is something Samsung have been doing with for each successive VNAND generation as they increase the number of layers. I think HBM must be doing this too for low stack height with many dies.
Or I'm just blowing it all out of proportions and the vapor chamber equalises the height for base SKUs 😅 - it could work as long as the power density in the MCD isn't too high, but this is SRAM cache and not DRAM, so I have no idea what that would be.
Chip-on-Wafer vs Wafer-on-Wafer. As you say, the latter has interesting potential cost advantages, but it does come with its own challenges. Means you're required to match the amount of silicon on each die instead of more granular scaling for each. Do the economics of V-Cache work if you need to have 2-3x worth of it? No idea. Or they could put something else in there. L2? Fabric? But that would mean you can't have a non-stacked option at all.Stacking whole wafers may mean that this process will be cheaper than the cpu vcache stacking. If something goes wrong, you lose MCD and vcache chips rather than cpu or gpu die. The MCD and vcache chips are also made on a cheaper process than the compute die. If they get multiple layers working, then perhaps we could see MCD with huge caches. Six MCD with 4 working layers would be 1.5 GB of added cache.
The die sizes I have heard for vcache chips and MCD are about the same. The vcache chips are 64 MB of cache and nothing else. The MCD is 16 MB of cache + two GDDR6 memory channels. I was wondering why it was only 16 MB since this seemed very small and is actually a regression. Previous generation had 128 MB of infinity cache on die. Six MCD is only 6 * 16 = 96 MB. If they were trying to match the die size, then this could partially explain why it is only 16 MB in the MCD. It wouldn't have been that much larger with 32 since it is made on a process optimized for cache density; 64 MB is only around 40 mm2 for the vcache die, so only around 10 mm2 per 16 MB.Chip-on-Wafer vs Wafer-on-Wafer. As you say, the latter has interesting potential cost advantages, but it does come with its own challenges. Means you're required to match the amount of silicon on each die instead of more granular scaling for each. Do the economics of V-Cache work if you need to have 2-3x worth of it? No idea. Or they could put something else in there. L2? Fabric? But that would mean you can't have a non-stacked option at all.
I have a way to bring it back on topicThe die sizes I have heard for vcache chips and MCD are about the same. The vcache chips are 64 MB of cache and nothing else. The MCD is 16 MB of cache + two GDDR6 memory channels. I was wondering why it was only 16 MB since this seemed very small and is actually a regression. Previous generation had 128 MB of infinity cache on die. Six MCD is only 6 * 16 = 96 MB. If they were trying to match the die size, then this could partially explain why it is only 16 MB in the MCD. It wouldn't have been that much larger with 32 since it is made on a process optimized for cache density; 64 MB is only around 40 mm2 for the vcache die, so only around 10 mm2 per 16 MB.
If they implemented this, then low end parts can use the base die with no stacking or salvage parts where the stacking went wrong. CDNA parts are likely expensive enough to justify die stacking. This is about moving as much as possible off the compute die made on the expensive process. The RDNA3 compute die is a lot smaller with the cache and memory controllers all moved off die. They will need a lot of vcache wafers if they use multiple layers. For expensive products, like CDNA devices or very high end RDNA, it may be worth it. The yield on the vcache die is probably very high. They are very small and cache can use duplicated blocks to make it resistant to defects. If they are going to use these across many products, then they will be cranking out a lot of these, so possibly good economy of scale.
It is still unclear how this will be used for MI300 and/or Zen 5, so it is kind of off topic for a Zen 5 thread.
It seems to me that the MCDs with the 16MB of iCache and 2 GDDR6 controllers is essentially a copy and paste of the same blocks off of the 6500XT die. The needed cells are already implemented on N6, so just reuse them in this application. Use six of them and boom, your needs are covered.The die sizes I have heard for vcache chips and MCD are about the same. The vcache chips are 64 MB of cache and nothing else. The MCD is 16 MB of cache + two GDDR6 memory channels. I was wondering why it was only 16 MB since this seemed very small and is actually a regression. Previous generation had 128 MB of infinity cache on die. Six MCD is only 6 * 16 = 96 MB. If they were trying to match the die size, then this could partially explain why it is only 16 MB in the MCD. It wouldn't have been that much larger with 32 since it is made on a process optimized for cache density; 64 MB is only around 40 mm2 for the vcache die, so only around 10 mm2 per 16 MB.
If they implemented this, then low end parts can use the base die with no stacking or salvage parts where the stacking went wrong. CDNA parts are likely expensive enough to justify die stacking. This is about moving as much as possible off the compute die made on the expensive process. The RDNA3 compute die is a lot smaller with the cache and memory controllers all moved off die. They will need a lot of vcache wafers if they use multiple layers. For expensive products, like CDNA devices or very high end RDNA, it may be worth it. The yield on the vcache die is probably very high. They are very small and cache can use duplicated blocks to make it resistant to defects. If they are going to use these across many products, then they will be cranking out a lot of these, so possibly good economy of scale.
It is still unclear how this will be used for MI300 and/or Zen 5, so it is kind of off topic for a Zen 5 thread.
I think Mi300 (rather than Zen 5) is the product where AMD is going "all in".I have a way to bring it back on topic. I think the big question about AMD's use of V-Cache over the next few years (CPUs, GPUs, whatever) is not wafer concerns, but rather a question of how much manufacturing capacity (tools deployed, etc.) they have available. Clearly they've being very cautious in their deployment of it so far. If they could go "all in" on a major product, that would open new possibilities, but we're probably a little ways out from that yet.