- Mar 3, 2017
- 1,777
- 6,791
- 136
I wonder if they could just turn it on its heads. The thermal density is in the cores, you want it to directly touch the IHS. Current v-cache and its dummy spacer prevent that.
Of course we are not privy to many things but cache underneath does seem like a better scenario from thermal constraint perspective. And concrete proof of that is in MI300. IC is underneath.Without doing a detailed analysis, my intuition tells me that the cooling situation will be better for the cache below design and not comparable to the present V-cache.
The fin-flex can save space on a monolithic layout, but isn't the optimized libraries for the 6nm cache, the same saving?
As the third major enhancement of TSMC’s 5nm family, N4P will deliver an 11% performance boost over the original N5 technology and a 6% boost over N4. Compared to N5, N4P will also deliver a 22% improvement in power efficiency as well as a 6% improvement in transistor density. In addition, N4P lowers process complexity and improves wafer cycle time by reducing the number of masks.
Backside power delivery is going to throw a real wrench into things there. Makes it impossible to thin the die down to the transistors.His 3rd point says, if you can, do not put an additional BEOL layer above your hottest sections.
In terms of thermal density, it's worth remembering that it's not just an individual core you have to worry about, but what happens when you put a bunch of them together. Right now, a CCX benefits from having a bunch of cache to space cores out a bit. Even if you keep the thermal density for a single core constant, if you move all that cache to another die and pack those cores closer together, you have a problem.
Backside power delivery is going to throw a real wrench into things there. Makes it impossible to thin the die down to the transistors.
I'm not following. The Z direction to the heat spreader is, by far, the smaller distance, compared to the X/Y spacing between the cores. If the thermal density is similar, why do you see a problem?In terms of thermal density, it's worth remembering that it's not just an individual core you have to worry about, but what happens when you put a bunch of them together. Right now, a CCX benefits from having a bunch of cache to space cores out a bit. Even if you keep the thermal density for a single core constant, if you move all that cache to another die and pack those cores closer together, you have a problem.
Backside power delivery is going to throw a real wrench into things there. Makes it impossible to thin the die down to the transistors.
You would still have, possibly large, L2 and L1 caches on the compute die, so that could give them some area for stacking L3 cache on top without covering cores. It seems like it makes more sense to stack cache on top of IO though, which is part of why I was thinking of stacked cache on top of MCD type chips or IO dies. They may have it underneath in MI300. It could be an SoIC stacked cache chip and then IO die / bridge die using micro-solder ball tech. Trying to extend SoIC across multiple, overlapping chips is a thing, but it is likely very difficult to do and possibly has much more issues with thermal expansion failure. Using non-SoIC stacking would still allow 4 separate gpu / cpu stacks in MI300 to have relatively high connectivity for sharing all 128 GB of HBM across the whole device. Such connections are the same level of bandwidth as HBM3 connections and they can use even wider links to allow connection to the 2 HBM stacks in each multi-chip gpu device. It would still appear as at least 4 separate gpus, but that mostly does not matter for CDNA HPC devices. For MI300, and possibly some Zen5 variants, I am thinking that there may be a separate IO die bridge chip that extends under neighboring sets of chiplets rather than a monolithic base die for cache and IO. TSMC has a lot of different stacking options though and a few, like the infinity fabric fan out used for RDNA3/MCD connectivity, that do not seem to map to a specific TSMC developed technology.In terms of thermal density, it's worth remembering that it's not just an individual core you have to worry about, but what happens when you put a bunch of them together. Right now, a CCX benefits from having a bunch of cache to space cores out a bit. Even if you keep the thermal density for a single core constant, if you move all that cache to another die and pack those cores closer together, you have a problem.
Backside power delivery is going to throw a real wrench into things there. Makes it impossible to thin the die down to the transistors.
While I am at it, one patent application I discovered
DIRECT-CONNECTED MACHINE LEARNING ACCELERATOR
View attachment 76418
Seems to match this one, in the slide from Victor at FAD22. There are EPYC chips with XDNA coming next.
View attachment 76419
I am wondering if the stuffs from MI300 will arrive for mainstream CPUs/APUs. SLC stacked on IOD shareable with CPU/GPU. Maybe I am just getting ahead of myself.
Proper backside power delivery involves building an entire metal stack on the other side of the wafer. IIRC, it might even involve building it on a separate wafer and bonding them together, but I'm not 100% on the specifics.How will it be implemented? Backside power delivery implies processing the wafer from both sides, while buried power rails implies changing the order of processing steps so the metal layers are laid down first. If it is processed from both sides then you're right, that's going to be a problem for anything that requires thinning the die.
Cores don't just heat themselves in isolation. They heat their neighbors directly, and they also heat the heat spreader their neighbors rely on to cool themselves. It's a compounding effect. Basically, you're just further condensing the heat from the cooling system's perspective, and that will be a challenge.I'm not following. The Z direction to the heat spreader is, by far, the smaller distance, compared to the X/Y spacing between the cores. If the thermal density is similar, why do you see a problem?
The use of dark silicon is on the micro level not a chipwide use. The same considerations do not applies to using cache as a whole unit for a heat buffer/equalizer, as Z <<<<< X,Y.
I think that makes more sense for a L4/system cache/memory-side cache, but L3 (assuming the existing hierarch) will still need to be close to the cores.It seems like it makes more sense to stack cache on top of IO though, which is part of why I was thinking of stacked cache on top of MCD type chips or IO dies.
If you're willing as a producer to pay the cost (e.g. 10900k), or as an enthusiast to "die lap" it yourself, you can remove some of the excess carrier silicon to reduce the gap between the transistors and the heat spreader. You can do this because there's no active logic or wires on that side of the wafer. But with backside metal, there's wiring on both sides, so this is no longer an option.Edit: also, what do you mean by "thin the die down to the transistors"?
For N3E based Zen 5, they can work with FinFlex for minimizing the size of the L3 SRAM arrays if need be.
No, it doesn't, that would kill latency. And the excellent low latency of L3$ despite their size is one of Zen's major advantages.It seems like it makes more sense to stack cache on top of IO though
It may not be L3, although if L2 is much larger and shared, then L3 can move farther out. My initial thoughts for Zen 5 was larger L2 shared between multiple cores; I had been thinking 2 to 4 cores, but rumors have been saying all 8. Then possibly much smaller or stacked only L3. If cache is stacked on top of memory controllers, then it may cache for the local controller, so possibly L4 or memory side cache. I think it would be doable, especially if stacked interconnect is used. The latency penalty for stacked interconnect is small, but I agree that it would not be suitable for the L3 as designed in Zen 4; that is, not a replacement for SoIC stacked cache. Zen 5 will likely have a completely different cache hierarchy.No, it doesn't, that would kill latency. And the excellent low latency of L3$ despite their size is one of Zen's major advantages.
I agree with some here, but I guess we have different understandings of heat flux behavior and how to cool objects.Cores don't just heat themselves in isolation. They heat their neighbors directly, and they also heat the heat spreader their neighbors rely on to cool themselves. It's a compounding effect. Basically, you're just further condensing the heat from the cooling system's perspective, and that will be a challenge.
I would not assume that Zen 5 will have a similar cache hierarchy to Zen 4. I expect it to be radically different. A lot depends on the yield of SoIC stacking. If yield is very high, then stacking on (or under) expensive compute die makes sense. For very expensive products, like MI300, it may still make sense, although the longevity still has to be very good for such a product. If yield is an issue, then stacking on a cheap MCD or IO die seems to make more sense, as long as they have a stacked or other high bandwidth interface (infinity fabric fan out) to the compute chiplets and the cache hierarchy is designed to make good use of L4 level of latency/bandwidth. If they lose a few MCD, it isn't nearly as expensive as losing a compute die. It seems to me that the latency would still be very low, but the bandwidth would not be close to SoIC stacked cache. If Zen 5 has massively more powerful FP processing, then the requirements may look more like a GPU where they are already using MCD.I think that makes more sense for a L4/system cache/memory-side cache, but L3 (assuming the existing hierarch) will still need to be close to the cores.
If you're willing as a producer to pay the cost (e.g. 10900k), or as an enthusiast to "die lap" it yourself, you can remove some of the excess carrier silicon to reduce the gap between the transistors and the heat spreader. You can do this because there's no active logic or wires on that side of the wafer. But with backside metal, there's wiring on both sides, so this is no longer an option.
IMHO the N31 solution pretty much exactly matches InFo-R, or am I missing something?TSMC has a lot of different stacking options though and a few, like the infinity fabric fan out used for RDNA3/MCD connectivity, that do not seem to map to a specific TSMC developed technology.
Maybe. I remember an interview with an AMD engineer where it sounded like they took something that was more intended for mobile and turned it into a super high bandwidth connection, so perhaps a standard TSMC tech, but not the original intended use.IMHO the N31 solution pretty much exactly matches InFo-R, or am I missing something?
/edit: The discussion of the last couple of posts is a good example of why I am a member here - thanks to everyone involved 😊
I think I know which quote you mean. Maybe this was meant for a lesser educated audience. To my knowledge InFo in general was first used for Mobile SoCs from Apple and others.Maybe. I remember an interview with an AMD engineer where it sounded like they took something that was more intended for mobile and turned it into a super high bandwidth connection, so perhaps a standard TSMC tech, but not the original intended use.
Zen 3 V-Cache can have many stacks of SRAM, they did not do it probably because of excessive latency and power.
Also Andreas Schilling showed the BIOS config for the 8 Hi Stacks.
Since 5800X3D is one of the most if not the most highest selling AM4 chip I would say the economics are slowly working out. There is no dearth of 5800X3D at all. Chip can be had for less than 350 bucks. Many current and upcoming products are chiplet based with advanced packaging. 5800X3D, 7XX0X3D, Milan-X, Genoa-X, RDNA3, MI300, STX (supposedly), RDNA4(supposedly)
This year TF-AMD's biggest packaging facility will come online in Malaysia. TSMC's AP2C came online in 2H2022.
I'm assuming that the extra v-cache stacks would just increase the associativity of the cache. That's probably something that cloud provides would want on their server chips, but I'm not sure it adds a lot for consumers outside of a few niche workloads. I'm not even sure how much extra scaling games would get out of the next 64 MB of cache.
It's definitely something that might be really useful for CDNA GPUs as it lets them work more effectively with larger data sets. With AMD moving the infinity cache to separate chiplets, they could stack multiple v-cache layers on each chiplet to create a massive amount of cache. Even with the current setup you could theoretically create a GPU with 3 GB of cache assuming 64 MB layers stacked 8-hi.
Has AMD confirmed that it is SoIC? It has obvious bandwidth advantages, but other types of stacking allow them to mix chips on different processes or even made at different fabs. Also, they are not mutually exclusive. They can have an SoIC stack under or on top of chips using other stacking methods.
I think I already said something about CXL possibly being used for MI300. They definitely need more off package memory. With 128 GB of HBM cache, CXL as backing store would be fine, so it is unclear if SH5 will have DDR memory controllers. The connectivity to neighboring GPUs needs to be very high, so they may use the same infinity fabric fan out used for RDNA3/MCDs or some bridge chips.
I don’t know if they need the Zen 4 chiplets to be different. Making a version specifically for MI300 is unlikely, but there is some possibility of a refresh version meant for stacking that is more widely used. I think it may also be possible to use a standard Zen 4 die. As I said somewhere earlier, they could have designed connectivity over the IO area for cpu chiplets. It would not be SoIC, but CPUs just do not need that much bandwidth. In fact, the bandwidth from micro-solder ball type stacking is sufficient for most things and probably overkill for CPUs.
AMD has been making more and more specialized chips since they have much larger budget now, but they still need to make things in a modular and reusable manner to keep cost down. They can’t compete with Intel without being very efficient on their usage of expensive silicon. Nvidia seems to be able to afford to keep on making giant, monolithic die for now, but they will need to go the chiplet route also just for scalability. There may be a single base die per gpu in MI300 but I think it may make sense to have separate cache and IO die embedded under the compute die. Perhaps an IO / switch die extends under neighboring gpu devices to provide the connectivity.
From the economic perspective, it sounds intriguing indeed. Quite compelling when you think about the Zen 4 CCD, with 35% die are composed of L3 and the 2x GMI holding all those TCoil and related circuitry for the Line drivers .
However if they use the Silicon Bridge interconnect, they could cut back on the CCD die size by a few mm2, if they could replace the TCoils/Line Drivers with a more parallel interface, like HBI for instance, at even lower power than the Cu RDL fanouts. I keep seeing patents like these below maybe they will do it after all but if not, these Cu RDL fanout links should help cut substrate based routing at 1/7 of the 2pJ of the 36Gbps GMI3. They can increase the link width and speed to reduce the penalty to go to the L3 on another CCD.
But nevertheless I am doubtful at this point for Zen 5, because, if you read semiengineering.com literature on hybrid bond, instantaneous heat from a 5GHz+ core attached to a lower clocked, mostly not accessed, SRAM will create a temperature differential which degrades the hybrid bond over long periods of time. Additionally it seems the thermal cycles also degrade the TSVs themselves on the base die due to continuous expansion and contraction of the die itself. Seems this is a known problem.
Or even 8 sockets, in theory. The Frontier supercomputer used half sized node, using half of the rack area, and 2 of them together were 1 unit on the rack. So there are 8 Mi250 that fitted in the square footage of the node on the rack. So 8 Mi300s in a node could be a possibility. And 8x Mi300 would be 1 TB of HBM memory among them.
I think we can be 100% sure that the Zen 4 chiplets for Mi300 will be different CCD. They will have all their communications go through TSVs, they will not need any L3, they will likely have 12 cores and a slightly larger size.
But with that in mind, AMD having a new CCD for Zen 4, knowing AMD really likes to re-use dies, it will be interesting to see if any other implementation outside of Mi300 emerges...