Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

DisEnchantment · Feb 13, 2023

moinmoin said:
I wonder if they could just turn it on its heads. The thermal density is in the cores, you want it to directly touch the IHS. Current v-cache and its dummy spacer prevent that.

maddie said:
Without doing a detailed analysis, my intuition tells me that the cooling situation will be better for the cache below design and not comparable to the present V-cache.
The fin-flex can save space on a monolithic layout, but isn't the optimized libraries for the 6nm cache, the same saving?

Of course we are not privy to many things but cache underneath does seem like a better scenario from thermal constraint perspective. And concrete proof of that is in MI300. IC is underneath.
And not coincidentally, interesting patent from, guess who, Naffziger on this very same topic

ARRANGEMENT AND THERMAL MANAGEMENT OF 3D STACKED DIES

https://www.freepatentsonline.com/y2022/0059425.html

But nevertheless I am doubtful at this point for Zen 5, because, if you read semiengineering.com literature on hybrid bond, instantaneous heat from a 5GHz+ core attached to a lower clocked, mostly not accessed, SRAM will create a temperature differential which degrades the hybrid bond over long periods of time. Additionally it seems the thermal cycles also degrade the TSVs themselves on the base die due to continuous expansion and contraction of the die itself. Seems this is a known problem.

But main thing though, Zen 4 has a very low XTor density for N5, it is only 93 MTr/mm2. It is very clear AMD has prioritized clocks over density for Zen 4 DT/Server with such a very clear tradeoff. RDNA3 for instance is 134MTr/mm2, Phoenix even higher at 140MTr/mm2. AMD went from 1800X/4.0GHz --> 3950X/4.7GHz (Frequency gate

) --> 5950X/4.9GHz (can sustain 5.05 GHz for short period) --> 7950X/5.7GHz (can sustain 5.85 GHz for short period).

So for now I feel convinced they will not let up on frequency even at the cost of die area for Zen 5. The buck stops at N3E perhaps but we will see. AMD should be able to get the gains from N4P, considering they taped out Zen 4 at the end of 2021 and N4P tape out in 2H22 and HVM in 2023. This should help keep power in check and a tradeoff between one of these parameters will easily put it above 6GHz.

As the third major enhancement of TSMC’s 5nm family, N4P will deliver an 11% performance boost over the original N5 technology and a 6% boost over N4. Compared to N5, N4P will also deliver a 22% improvement in power efficiency as well as a 6% improvement in transistor density. In addition, N4P lowers process complexity and improves wafer cycle time by reducing the number of masks.

DisEnchantment · Feb 13, 2023

While I am at it, one patent application I discovered
DIRECT-CONNECTED MACHINE LEARNING ACCELERATOR

https://www.freepatentsonline.com/y2022/0101179.html

Seems to match this one, in the slide from Victor at FAD22. There are EPYC chips with XDNA coming next.

I am wondering if the stuffs from MI300 will arrive for mainstream CPUs/APUs. SLC stacked on IOD shareable with CPU/GPU. Maybe I am just getting ahead of myself.

Exist50 · Feb 13, 2023

In terms of thermal density, it's worth remembering that it's not just an individual core you have to worry about, but what happens when you put a bunch of them together. Right now, a CCX benefits from having a bunch of cache to space cores out a bit. Even if you keep the thermal density for a single core constant, if you move all that cache to another die and pack those cores closer together, you have a problem.

maddie said:
His 3rd point says, if you can, do not put an additional BEOL layer above your hottest sections.

Backside power delivery is going to throw a real wrench into things there. Makes it impossible to thin the die down to the transistors.

Doug S · Feb 13, 2023

Exist50 said:
In terms of thermal density, it's worth remembering that it's not just an individual core you have to worry about, but what happens when you put a bunch of them together. Right now, a CCX benefits from having a bunch of cache to space cores out a bit. Even if you keep the thermal density for a single core constant, if you move all that cache to another die and pack those cores closer together, you have a problem.

Backside power delivery is going to throw a real wrench into things there. Makes it impossible to thin the die down to the transistors.

How will it be implemented? Backside power delivery implies processing the wafer from both sides, while buried power rails implies changing the order of processing steps so the metal layers are laid down first. If it is processed from both sides then you're right, that's going to be a problem for anything that requires thinning the die.

maddie · Feb 13, 2023

Exist50 said:
In terms of thermal density, it's worth remembering that it's not just an individual core you have to worry about, but what happens when you put a bunch of them together. Right now, a CCX benefits from having a bunch of cache to space cores out a bit. Even if you keep the thermal density for a single core constant, if you move all that cache to another die and pack those cores closer together, you have a problem.

Backside power delivery is going to throw a real wrench into things there. Makes it impossible to thin the die down to the transistors.

I'm not following. The Z direction to the heat spreader is, by far, the smaller distance, compared to the X/Y spacing between the cores. If the thermal density is similar, why do you see a problem?

The use of dark silicon is on the micro level not a chipwide use. The same considerations do not applies to using cache as a whole unit for a heat buffer/equalizer, as Z <<<<< X,Y.

jamescox · Feb 13, 2023

Exist50 said:
In terms of thermal density, it's worth remembering that it's not just an individual core you have to worry about, but what happens when you put a bunch of them together. Right now, a CCX benefits from having a bunch of cache to space cores out a bit. Even if you keep the thermal density for a single core constant, if you move all that cache to another die and pack those cores closer together, you have a problem.

Backside power delivery is going to throw a real wrench into things there. Makes it impossible to thin the die down to the transistors.

You would still have, possibly large, L2 and L1 caches on the compute die, so that could give them some area for stacking L3 cache on top without covering cores. It seems like it makes more sense to stack cache on top of IO though, which is part of why I was thinking of stacked cache on top of MCD type chips or IO dies. They may have it underneath in MI300. It could be an SoIC stacked cache chip and then IO die / bridge die using micro-solder ball tech. Trying to extend SoIC across multiple, overlapping chips is a thing, but it is likely very difficult to do and possibly has much more issues with thermal expansion failure. Using non-SoIC stacking would still allow 4 separate gpu / cpu stacks in MI300 to have relatively high connectivity for sharing all 128 GB of HBM across the whole device. Such connections are the same level of bandwidth as HBM3 connections and they can use even wider links to allow connection to the 2 HBM stacks in each multi-chip gpu device. It would still appear as at least 4 separate gpus, but that mostly does not matter for CDNA HPC devices. For MI300, and possibly some Zen5 variants, I am thinking that there may be a separate IO die bridge chip that extends under neighboring sets of chiplets rather than a monolithic base die for cache and IO. TSMC has a lot of different stacking options though and a few, like the infinity fabric fan out used for RDNA3/MCD connectivity, that do not seem to map to a specific TSMC developed technology.

Edit: also, what do you mean by "thin the die down to the transistors"? For the v-cache stacking on cpu chiplets, the cpu chiplet is thinned down very thin to expose the TSVs made from the other side. The cache chiplet only needs to be thinned sufficiently to match the z-height unless there is multiple layers of cache die.

jamescox · Feb 13, 2023

[

DisEnchantment said:
While I am at it, one patent application I discovered
DIRECT-CONNECTED MACHINE LEARNING ACCELERATOR

https://www.freepatentsonline.com/y2022/0101179.html
View attachment 76418
Seems to match this one, in the slide from Victor at FAD22. There are EPYC chips with XDNA coming next.
View attachment 76419

I am wondering if the stuffs from MI300 will arrive for mainstream CPUs/APUs. SLC stacked on IOD shareable with CPU/GPU. Maybe I am just getting ahead of myself.

The top diagram kind of looks like one of the MI300 slides, except possibly with two, rather than 4, cpu chiplets stacked on each end with possibly an FPGA / AI accelerator in the middle.

Exist50 · Feb 13, 2023

Doug S said:
How will it be implemented? Backside power delivery implies processing the wafer from both sides, while buried power rails implies changing the order of processing steps so the metal layers are laid down first. If it is processed from both sides then you're right, that's going to be a problem for anything that requires thinning the die.

Proper backside power delivery involves building an entire metal stack on the other side of the wafer. IIRC, it might even involve building it on a separate wafer and bonding them together, but I'm not 100% on the specifics.

Exist50 · Feb 13, 2023

maddie said:
I'm not following. The Z direction to the heat spreader is, by far, the smaller distance, compared to the X/Y spacing between the cores. If the thermal density is similar, why do you see a problem?

The use of dark silicon is on the micro level not a chipwide use. The same considerations do not applies to using cache as a whole unit for a heat buffer/equalizer, as Z <<<<< X,Y.

Cores don't just heat themselves in isolation. They heat their neighbors directly, and they also heat the heat spreader their neighbors rely on to cool themselves. It's a compounding effect. Basically, you're just further condensing the heat from the cooling system's perspective, and that will be a challenge.

Exist50 · Feb 13, 2023

jamescox said:
It seems like it makes more sense to stack cache on top of IO though, which is part of why I was thinking of stacked cache on top of MCD type chips or IO dies.

I think that makes more sense for a L4/system cache/memory-side cache, but L3 (assuming the existing hierarch) will still need to be close to the cores.

jamescox said:
Edit: also, what do you mean by "thin the die down to the transistors"?

If you're willing as a producer to pay the cost (e.g. 10900k), or as an enthusiast to "die lap" it yourself, you can remove some of the excess carrier silicon to reduce the gap between the transistors and the heat spreader. You can do this because there's no active logic or wires on that side of the wafer. But with backside metal, there's wiring on both sides, so this is no longer an option.

trexfromouterspace · Feb 13, 2023

DisEnchantment said:
For N3E based Zen 5, they can work with FinFlex for minimizing the size of the L3 SRAM arrays if need be.

That's irrelevant. The SRAM bitcell size is the SRAM bitcell size, FinFlex has nothing to do with that. N3E's SRAM bitcell is larger than N4's (same size as N5, and N4 is an N5 optical shrink). The only way to decrease SRAM area in N3E is to have less of it.

moinmoin · Feb 13, 2023

jamescox said:
It seems like it makes more sense to stack cache on top of IO though

No, it doesn't, that would kill latency. And the excellent low latency of L3$ despite their size is one of Zen's major advantages.

jamescox · Feb 13, 2023

moinmoin said:
No, it doesn't, that would kill latency. And the excellent low latency of L3$ despite their size is one of Zen's major advantages.

It may not be L3, although if L2 is much larger and shared, then L3 can move farther out. My initial thoughts for Zen 5 was larger L2 shared between multiple cores; I had been thinking 2 to 4 cores, but rumors have been saying all 8. Then possibly much smaller or stacked only L3. If cache is stacked on top of memory controllers, then it may cache for the local controller, so possibly L4 or memory side cache. I think it would be doable, especially if stacked interconnect is used. The latency penalty for stacked interconnect is small, but I agree that it would not be suitable for the L3 as designed in Zen 4; that is, not a replacement for SoIC stacked cache. Zen 5 will likely have a completely different cache hierarchy.

maddie · Feb 13, 2023

Exist50 said:
Cores don't just heat themselves in isolation. They heat their neighbors directly, and they also heat the heat spreader their neighbors rely on to cool themselves. It's a compounding effect. Basically, you're just further condensing the heat from the cooling system's perspective, and that will be a challenge.

I agree with some here, but I guess we have different understandings of heat flux behavior and how to cool objects.

jamescox · Feb 13, 2023

Exist50 said:
I think that makes more sense for a L4/system cache/memory-side cache, but L3 (assuming the existing hierarch) will still need to be close to the cores.

If you're willing as a producer to pay the cost (e.g. 10900k), or as an enthusiast to "die lap" it yourself, you can remove some of the excess carrier silicon to reduce the gap between the transistors and the heat spreader. You can do this because there's no active logic or wires on that side of the wafer. But with backside metal, there's wiring on both sides, so this is no longer an option.

I would not assume that Zen 5 will have a similar cache hierarchy to Zen 4. I expect it to be radically different. A lot depends on the yield of SoIC stacking. If yield is very high, then stacking on (or under) expensive compute die makes sense. For very expensive products, like MI300, it may still make sense, although the longevity still has to be very good for such a product. If yield is an issue, then stacking on a cheap MCD or IO die seems to make more sense, as long as they have a stacked or other high bandwidth interface (infinity fabric fan out) to the compute chiplets and the cache hierarchy is designed to make good use of L4 level of latency/bandwidth. If they lose a few MCD, it isn't nearly as expensive as losing a compute die. It seems to me that the latency would still be very low, but the bandwidth would not be close to SoIC stacked cache. If Zen 5 has massively more powerful FP processing, then the requirements may look more like a GPU where they are already using MCD.

Lapping a chip is really rather extreme. These stacked packages are likely a bit more delicate than a thick monolithic die anyway. I probably wouldn't even risk de-lidding a processor.

BorisTheBlade82 · Feb 14, 2023

jamescox said:
TSMC has a lot of different stacking options though and a few, like the infinity fabric fan out used for RDNA3/MCD connectivity, that do not seem to map to a specific TSMC developed technology.

IMHO the N31 solution pretty much exactly matches InFo-R, or am I missing something?

/edit: The discussion of the last couple of posts is a good example of why I am a member here - thanks to everyone involved 😊

jamescox · Feb 14, 2023

BorisTheBlade82 said:
IMHO the N31 solution pretty much exactly matches InFo-R, or am I missing something?

/edit: The discussion of the last couple of posts is a good example of why I am a member here - thanks to everyone involved 😊

Maybe. I remember an interview with an AMD engineer where it sounded like they took something that was more intended for mobile and turned it into a super high bandwidth connection, so perhaps a standard TSMC tech, but not the original intended use.

BorisTheBlade82 · Feb 15, 2023

jamescox said:
Maybe. I remember an interview with an AMD engineer where it sounded like they took something that was more intended for mobile and turned it into a super high bandwidth connection, so perhaps a standard TSMC tech, but not the original intended use.

I think I know which quote you mean. Maybe this was meant for a lesser educated audience. To my knowledge InFo in general was first used for Mobile SoCs from Apple and others.

Joe NYC · Feb 15, 2023

DisEnchantment said:
Zen 3 V-Cache can have many stacks of SRAM, they did not do it probably because of excessive latency and power.

https://twitter.com/x/status/1399671731112091649
Also Andreas Schilling showed the BIOS config for the 8 Hi Stacks.

Since 5800X3D is one of the most if not the most highest selling AM4 chip I would say the economics are slowly working out. There is no dearth of 5800X3D at all. Chip can be had for less than 350 bucks. Many current and upcoming products are chiplet based with advanced packaging. 5800X3D, 7XX0X3D, Milan-X, Genoa-X, RDNA3, MI300, STX (supposedly), RDNA4(supposedly)
This year TF-AMD's biggest packaging facility will come online in Malaysia. TSMC's AP2C came online in 2H2022.

It seems that AMD had a logical definition of using multiple stacks of V-Cache, but could not implement for Zen 3.

It may be the same story with Zen 4, that it may have definition in the logic and TSVs in place for multiple layers of V-Cache, but the physical implementation has to be fine tuned and tester.

Graymont, before deleting his Twitter account, said that AMD is testing 2 layers of V-Cache.

Given that Zen 4 will be around for some time, and this time, the single layer V-Cache is being released within ~6 month of initial release, we still have plenty of life left for Zen 4 for AMD to release additional products, hopefully with more layers of V-Cache.

Right now, going by prices on AMD.com, AMD is charging $100 for that stacked die on top of 5800X, which is quite profitable for something that costs somewhere between $10 and $20 in die cost and assembly.

It will be the same with the 7700X vs. 7800x3d, $349 + $100 = $449
Which again, is highly profitable. The difference may narrow if AMD is able to crank out these at very high rate, and even add V-Cache to 6 core chips.

Long time ago, prior to release of 5800x3d, I speculated that AMD would be selling the V-Cache for $50 per layer, so maybe we will get to that price when AMD is able to produce 2 or more layers of V-Cache.

Joe NYC · Feb 15, 2023

Mopetar said:
I'm assuming that the extra v-cache stacks would just increase the associativity of the cache. That's probably something that cloud provides would want on their server chips, but I'm not sure it adds a lot for consumers outside of a few niche workloads. I'm not even sure how much extra scaling games would get out of the next 64 MB of cache.

It's definitely something that might be really useful for CDNA GPUs as it lets them work more effectively with larger data sets. With AMD moving the infinity cache to separate chiplets, they could stack multiple v-cache layers on each chiplet to create a massive amount of cache. Even with the current setup you could theoretically create a GPU with 3 GB of cache assuming 64 MB layers stacked 8-hi.

TSMC floated this idea of SOIC_H - H for horizontal, which would be 2.5D type packaging, but chips being connected to interposer via hybrid bond.

One of the applications mentioned for this would be 3D SRAM cube, which would be several layers of SRAM stacked on top of each other and the whole cube stacked on top of the SoIC_H interposer. Which would mean that cache is not on top of but next to the chip.

This type of implementation would not be limited by the dimensions of the CPU die below it, it could be same size or higher. With SRAM, big die size is not a problem, and even a medium sized die of some 100-150 mm2 could have 256 MB, 4 of them stacked would be 1 GB. But that would probably be too expensive for consumer graphics cards.

I hope that with release of Zen 4 V-Cache, we will get some idea of the current cost of the SRAM. SRAM wafers on N6 should be much cheaper to process than logic, having fewer metal layers. And TSMC is highly motivated to sell this capacity, because TSMC has quite a bit of idle capacity on N6/N7...

Joe NYC · Feb 15, 2023

jamescox said:
Has AMD confirmed that it is SoIC? It has obvious bandwidth advantages, but other types of stacking allow them to mix chips on different processes or even made at different fabs. Also, they are not mutually exclusive. They can have an SoIC stack under or on top of chips using other stacking methods.

I think N5 compute stacked on top of N6 base die has been confirmed, so, at least in terms of Mi300, it would be fruitless to go on different tangents.

With Zen 5, it is possible that AMD will chose a different implementation, but a lot will depend on the experience with Mi300.

The implementation of N5 compute on top of larger N6 base (for IO and SRAM) solves a lot of problems / bottlenecks that Zen 4 has, so I think it should be considered a candidate for Zen 5 implementations.

jamescox said:
I think I already said something about CXL possibly being used for MI300. They definitely need more off package memory. With 128 GB of HBM cache, CXL as backing store would be fine, so it is unclear if SH5 will have DDR memory controllers. The connectivity to neighboring GPUs needs to be very high, so they may use the same infinity fabric fan out used for RDNA3/MCDs or some bridge chips.

I kind of doubt there will be controllers for external DDR5, and pins on SH5 for that.

But, it would not surprise me if all the extra pins are used for PCIe 5 lanes, maybe doubling the number of PCIe lanes of SP5 to ~256.

And in case of a multi-SH5 socket node is implemented, such as 4, then half or more of those lanes creating the mesh between the 4 Mi300s.

Or even 8 sockets, in theory. The Frontier supercomputer used half sized node, using half of the rack area, and 2 of them together were 1 unit on the rack. So there are 8 Mi250 that fitted in the square footage of the node on the rack. So 8 Mi300s in a node could be a possibility. And 8x Mi300 would be 1 TB of HBM memory among them.

jamescox said:
I don’t know if they need the Zen 4 chiplets to be different. Making a version specifically for MI300 is unlikely, but there is some possibility of a refresh version meant for stacking that is more widely used. I think it may also be possible to use a standard Zen 4 die. As I said somewhere earlier, they could have designed connectivity over the IO area for cpu chiplets. It would not be SoIC, but CPUs just do not need that much bandwidth. In fact, the bandwidth from micro-solder ball type stacking is sufficient for most things and probably overkill for CPUs.

I think we can be 100% sure that the Zen 4 chiplets for Mi300 will be different CCD. They will have all their communications go through TSVs, they will not need any L3, they will likely have 12 cores and a slightly larger size.

But with that in mind, AMD having a new CCD for Zen 4, knowing AMD really likes to re-use dies, it will be interesting to see if any other implementation outside of Mi300 emerges...

jamescox said:
AMD has been making more and more specialized chips since they have much larger budget now, but they still need to make things in a modular and reusable manner to keep cost down. They can’t compete with Intel without being very efficient on their usage of expensive silicon. Nvidia seems to be able to afford to keep on making giant, monolithic die for now, but they will need to go the chiplet route also just for scalability. There may be a single base die per gpu in MI300 but I think it may make sense to have separate cache and IO die embedded under the compute die. Perhaps an IO / switch die extends under neighboring gpu devices to provide the connectivity.

NVidia continues to make larger and larger monolithic dies, because it still can, and it can still sell them, and it can out compete in performance any challengers.

But it is quite possible that Mi300 breaks that. NVidia may be at the limit (with Hopper) and Mi300 will be beyond this limit. With Nvidia using 2 chips (Grace + Hopper), 2 memory spaces may seriously limit efficiency. We will see.

Joe NYC · Feb 15, 2023

DisEnchantment said:
From the economic perspective, it sounds intriguing indeed. Quite compelling when you think about the Zen 4 CCD, with 35% die are composed of L3 and the 2x GMI holding all those TCoil and related circuitry for the Line drivers .

It is possible that the size of 12 core Zen 4 die (which Mi300 is apparently using), without GMI links, without L3, just adding TSVs may be approximately same size as current 8 core Zen 4.

With 50% more cores, and workloads being mostly multithreaded, the CCD can be clocked lower and operate at much higher power efficiency and still deliver higher multithreaded performance.

DisEnchantment said:
However if they use the Silicon Bridge interconnect, they could cut back on the CCD die size by a few mm2, if they could replace the TCoils/Line Drivers with a more parallel interface, like HBI for instance, at even lower power than the Cu RDL fanouts. I keep seeing patents like these below maybe they will do it after all but if not, these Cu RDL fanout links should help cut substrate based routing at 1/7 of the 2pJ of the 36Gbps GMI3. They can increase the link width and speed to reduce the penalty to go to the L3 on another CCD.

There were some rumors - namely from Dylan Patel - that AMD woud already be using RDL fanouts in Zen 4. So maybe he just got it wrong, and it was not Zen 4, but RDNA 3.

I agree that getting rid of Ser-Des is likely a priority for Zen 5, and power efficiency of the connection also being a priority, to enable chiplet based mobile designs. So, we will see which technology emerges as the right fit for the time.

Joe NYC · Feb 15, 2023

DisEnchantment said:
But nevertheless I am doubtful at this point for Zen 5, because, if you read semiengineering.com literature on hybrid bond, instantaneous heat from a 5GHz+ core attached to a lower clocked, mostly not accessed, SRAM will create a temperature differential which degrades the hybrid bond over long periods of time. Additionally it seems the thermal cycles also degrade the TSVs themselves on the base die due to continuous expansion and contraction of the die itself. Seems this is a known problem.

Wouldn't the TSVs between the dies keep the temperature differential relatively small? It might be worth adding even dummy TSVs just to keep the temperatures in sync.

jamescox · Feb 15, 2023

Joe NYC said:
Or even 8 sockets, in theory. The Frontier supercomputer used half sized node, using half of the rack area, and 2 of them together were 1 unit on the rack. So there are 8 Mi250 that fitted in the square footage of the node on the rack. So 8 Mi300s in a node could be a possibility. And 8x Mi300 would be 1 TB of HBM memory among them.

It isn't uncommon for HPC machines to have 512 GB to 1 TB of system memory per GPU. The latency of CXL is possibly not an issue with all of the SRAM and HBM in the system, but I would expect it to have memory expansion somewhere / somehow.

Joe NYC said:
I think we can be 100% sure that the Zen 4 chiplets for Mi300 will be different CCD. They will have all their communications go through TSVs, they will not need any L3, they will likely have 12 cores and a slightly larger size.

But with that in mind, AMD having a new CCD for Zen 4, knowing AMD really likes to re-use dies, it will be interesting to see if any other implementation outside of Mi300 emerges...

It seems very unlikely that AMD would spend time re-designing a Zen4 ccx for 12 cores. Their slides already show an MI300 variant that appears to have two 8-core cpu chiplets on each end (yes, I posted this before):

Although, this might just be a mock-up. It doesn't seem to show 24 cores, but this may be a variant with an AI accelerator or it is just fake. If it is only 2 chiplets, I would be more likely to believe that it is really two Zen 4c chiplets with some cores disabled, so possibly 32-core capable. Given power restrictions and the fact that a lot of applications would just be throwing data at the gpus, why wouldn't they make use of Zen 4c? I think AMD has done such things in the past; talked about 48 core cpus and then release a 64-core cpu. I wonder if it is plausible that Zen 4c is designed to sit on top of a base die. They need really low power consumption with 128-core cloud processor, so making use of stacking or other advanced interconnect would make sense. I was hoping to get stacking with Bergamo. I don't know if we know enough to rule it out, but probably unlikely.

I don't think it is impossible for them to design something to make use of a standard Zen 4 chiplet. It may not have direct access to stacked caches if it is over the IO die portion rather than over the cache area that uses SoIC stacking. I was thinking the square in the middle in the image might be an AI accelerator stacked over the SoIC cache area and the two chiplets on each end are over possibly IO area or die. With the way the GPU modules look though, it seems like the cache would be on both ends rather than in the middle so as to be directly under the compute units. The area in the middle of the 2 chiplet gpu modules does not look like compute units, but we probably can't trust it much.

I guess I wouldn't rule out some variant designed for stacking that could be used in mobile in addition to MI300 type devices due to the power constraints. The mobile version could be a stacked design except with a RDNA gpu chiplet. It still seems a lot easier to just use two 8 core CCX, perhaps with shrunk caches, rather than a new 12 core CCX design.

BorisTheBlade82 · Feb 15, 2023

@jamescox
Aside from your MI300 layout suggestion, for which I already expressed a different opinion in this thread as well as the MI300 thread, this is exactly my line of thinking as well.
Why not use Zen4c in MI300 for the reasons you explained? Why not getting rid with IFoP in this process? Why not switching interconnects for Bergamo already and re-use (variations of) that CCD as small cores together with Zen5?
Maybe I am terribly wrong - but one can at least dream 😉

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Golden Member

Golden Member

Platinum Member

Diamond Member

Diamond Member

Senior member

Senior member

Platinum Member

Platinum Member

Platinum Member

Member

Diamond Member

Senior member

Diamond Member

Senior member

Senior member

Senior member

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Senior member