- Mar 3, 2017
- 1,774
- 6,757
- 136
There have been some rumors saying that the L2 cache may be shared across all 8 cores. Such an L2 cache would have rather low density due to the lack of scaling of cache and also because L2 needs to be much faster than L3. If those rumors are true, then there may be a large amount of die area with just L2. Sharing L2 across 8 cores seemed unlikely, but I think most Apple processors have used shared L2 among multiple cores. They may be able to stack L3 on top of the L2 and maybe the GMI link area. They might be able to make a cheap version with no L3. I am not sure how large the die area would be. I expect Zen5 to have significantly increased FP processing capability, which takes a lot of die are and a lot of bandwidth. Such streaming code does not necessarily need an increase in cache size though.Especially when the real reason is so obvious, there is no need to speculate. The whole purpose of chiplets is cost efficiency. AMD is widening its chip to increase IPC. Going from 5N to 4N doesn't provide significant density improvements, and AMD will probably choose a blend for the best performance/efficiency. More than 8 cores just doesn't make sense economically.
They could fit a ridiculous amount of cache on each one, but that large of cache has not proved to be that useful for GPUs. The infinity cache on RDNA3 is only 96 MB. It has 8 stackes of HBM3, so it doesn't need the bandwidth boost from infinity cache. It already has ridiculously high bandwidth. If it is stacked with SoIC rather than other stacking tech, then that could be a very different beast.
That could allow compute units to have massive local caches rather than a monolithic, but much farther L3 cache. All chiplets used would need to be designed with that in mind though. The Zen 4 chiplets likely would not be able to use it. In fact it is unclear how stacked Zen 4 chiplets will work on a base die anyway.
Given AMD's modular approach, something like embedded MCD (off package memory controllers + cache) , embedded IO die, bridge chips (LSI), etc seems like it may make more sense; chiplets that can be used across many different products rather than just MI300. I am not sure if we have any info on what the MI300 will have for off package connectivity in the SH5 socket. Will it have the same IO as SP5? HPC often needs TB of memory, so it can't just be the 128 GB of HBM. Also, sending signals across these giant interposers may be problematic. They don't daisy chain or run silicon under cpu chiplets in Epyc; it is better to just go the IFOP route with a separate connection than to have to route across multiple chips in silicon. The Epyc IO die already has a number of switches and repeaters internally that add latency. Due to the scaling differences between IO, cache, and logic, I am still thinking that the "base die" may be something made out of a number of different pieces of silicon, made on slightly different processes.
Maybe V-cache Zen 5 is a 2 Hi stack.
They could alternatively stack it underneath. Would have its own challenges, but might be better for thermals. Though we never did get a deep dive on what the limiting factor actually is for clocks w/ v-cache.
I think they could still get away with 96c reasonably comfortably if they had to, but if not, they could wait for a 3nm Zen 5 version (refresh?) and then do a sort of mid-cycle upgrade. Or they could just make a new socket with a 128c, 16 channel config, but that's probably not an ideal outcome.
AMD is all about making the production cheap. That is the primary motivation behind the chiplets. MI300 seems to be the first really complex and expensive product - the target market demands are high.I would expect Zen 5 to start using stacked die in some manner, so it may look more like MI300 than Genoa, which is partially why I am so interested in exactly what is in MI300. If they can economically use the same interconnect used in RDNA3/MCDs, then that would reduce power consumption significantly. Going up to pci-e 5 or 6 speeds for SerDes based GMI has to cost a lot of power. Speculating on how stacked die are going to used or arranged is very difficult without more information.
There is very little area scaling for cache, (95%?), so once this happens, [Cost of L3 cache in 6nm + die stacking cost < Cost of L3 cache in 3nm]. they can do it cheaper. Basically the wafer cost difference will determine profitability.AMD is all about making the production cheap. That is the primary motivation behind the chiplets. MI300 seems to be the first really complex and expensive product - the target market demands are high.
Mandating Zen 5 to use a complex packaging doesn't seem to be that great. Remember in times of Rome/Milan the platform cost made low-tier server offerings prohibitive. This has been amplified by the 12-channel SP5. Hence the need to split the server platforms by adding the lower-tier 6-channel SP6. Making Zen 5 default to stacking would make he cost high again. Also the client cost would be affected.
I would expect AMD to keep things cheap (with all the drawbacks/bottlenecks of the IFOP approach).
I definitely don't want to get into a situation like being stuck at 4C8T for 10 years, but 16/32 doesn't seem like too much of a limitation in MT workloads at present. Unless I've missed something fundamental I'm not sure 8 full cores and 16 compact cores wouldn't be a step back on the desktop. They're not really low power efficiency cores and they're also not as performant as a full core. They might show higher MT numbers in Cinebench, but regressions in other things. At least for right now the number of people who would benefit much from 32/64 or on a consumer platform is pretty small vs single threaded improvements. What they really need is to get TR releases closer in time to the release of the desktop platforms.RGT claims only 16C/32T for top Zen 5 desktop AM5 part. That seems kinda low as I was expecting AMD to do Zen 5 + Zen 5C (or whatever they'll call it). They could easily to 8C Zen 5 + 16C Zen 5C for a total of 24 Zen 5 cores. ISA would be the same, Zen 5C might clock ~15-20% lower but that's fine as they would still get ~30% boost versus 16C Zen 5 in MT workloads.
If it's nothing new the patent is invalid and completely wasted.
AMD is all about making the production cheap. That is the primary motivation behind the chiplets. MI300 seems to be the first really complex and expensive product - the target market demands are high.
Mandating Zen 5 to use a complex packaging doesn't seem to be that great. Remember in times of Rome/Milan the platform cost made low-tier server offerings prohibitive. This has been amplified by the 12-channel SP5. Hence the need to split the server platforms by adding the lower-tier 6-channel SP6. Making Zen 5 default to stacking would make he cost high again. Also the client cost would be affected.
I would expect AMD to keep things cheap (with all the drawbacks/bottlenecks of the IFOP approach).
Eh, the patent system is a mess, particularly with tech. Half of it seems to be outsourced to the courts to fix later. This is one of my favorite examples of a patent that by all rights shouldn't exist: https://patents.google.com/patent/US7862780The fact that a patent was granted means that the patent office agreed with AMD that whatever they did was different enough from existing implementations to warrant a patent.
Last I heard, one of the critical machines for hybrid bonding was still in risk/trial production levels. Probably this year or next, we'll start to see those actually start showing up in volume, and thus many more/more common hybrid bonded designs.I doubt there is something about 3D stacking that is inherently very expensive. I think the concern that AMD may have is volume, if TSMC will have a sufficient capacity for millions of chips to be stacked.
Zen 3 V-Cache can have many stacks of SRAM, they did not do it probably because of excessive latency and power.So, it remains to be seen. It would be surprising to me if AMD did not make any improvement to V-Cache capacity in Zen 4. Since we know that the stacked die is the same 64 MB as Zen 3, 2 layers would be the only way to add extra capacity.
Since 5800X3D is one of the most if not the most highest selling AM4 chip I would say the economics are slowly working out. There is no dearth of 5800X3D at all. Chip can be had for less than 350 bucks. Many current and upcoming products are chiplet based with advanced packaging. 5800X3D, 7XX0X3D, Milan-X, Genoa-X, RDNA3, MI300, STX (supposedly), RDNA4(supposedly)I doubt there is something about 3D stacking that is inherently very expensive. I think the concern that AMD may have is volume, if TSMC will have a sufficient capacity for millions of chips to be stacked.
The packaging facility doesn't do the 3D stacking, if I am not mistaken. That's just substrates and all that jazz. That 3D stacking (hybrid bonding) is entirely on TSMC, no?Zen 3 V-Cache can have many stacks of SRAM, they did not do it probably because of excessive latency and power.
Also Andreas Schilling showed the BIOS config for the 8 Hi Stacks.
Since 5800X3D is one of the most if not the most highest selling AM4 chip I would say the economics are slowly working out. There is no dearth of 5800X3D at all. Chip can be had for less than 350 bucks. Many current and upcoming products are chiplet based with advanced packaging. 5800X3D, 7XX0X3D, Milan-X, Genoa-X, RDNA3, MI300, STX (supposedly), RDNA4(supposedly)
This year TF-AMD's biggest packaging facility will come online in Malaysia. TSMC's AP2C came online in 2H2022.
Yes all FE packaging (e.g SoIC) is done by TSMC. AP2C is one of the latest 3D packaging Fab at TSMC. But other BEOL stuffs (Fan Outs etc) can be done by TF-AMD.The packaging facility doesn't do the 3D stacking, if I am not mistaken. That's just substrates and all that jazz. That 3D stacking (hybrid bonding) is entirely on TSMC, no?
Multi-layer stacking has additional complications vs 2-layer. Yes, AMD has support for it in BIOS, but that doesn't necessarily mean the manufacturing side is ready for it.Zen 3 V-Cache can have many stacks of SRAM, they did not do it probably because of excessive latency and power.
And if that is the decision, that the CCD will be on top of the L3 / System Level Cache, it would have to be "Standard", not optional as is the case now.
Has AMD confirmed that it is SoIC? It has obvious bandwidth advantages, but other types of stacking allow them to mix chips on different processes or even made at different fabs. Also, they are not mutually exclusive. They can have an SoIC stack under or on top of chips using other stacking methods.Yes, it is SoIC, Hybrid Bond stacking.
Eliminating all of the L3 and all of the I/O will allow more efficient use of the N5 (or N4) compute die areas, use even less area for I/O than RDNA3. Hybrid bond has higher density than RDL, so even less area needs to be used for connecting with outside world.
We will see how the various workloads (from scientific to AI) respond to extremely high bandwidth (and capacity) of system level cache. The need for extremely high bandwidth is well known, and it is the primary bottleneck (not compute), so we will see how much the cache helps in addition to HBM.
It will have to be a different CCD of Zen 4, to work stacked on top of the base die.
But once AMD has this type of implementation of the Zen 4 CCD, I wonder if there may be any possibility of using it elsewhere, such as 1/4 of the Mi300 with CPU only compute unit inside the AM5 socket, with 2 stacks of HBM for 32 MB of memory.
It would have higher cost than current desktop Zen 4 chips, but a shared Last Level Cache of perhaps 256 MB of SRAM would be another level beyond the V-Cache versions of Zen 4, and it would side step the problems that the 7950x3d would face, those problems being:
- more challenging cooling process with cache on top of die
- cache not being shared, causing some issues of threads jumping between CCDs, and losing access to previously used L3 content
- asymmetry of the 7950x3d implementation
Or a laptop implementation, with one GPU, one CPU compute unit and also HBM memory, eliminating the need for laptop to have DRAM.
Remains to be seen if there is going to be any DDR5 controllers on the base die, and whether the SH5 socket has that capability.
The desire on part of the cloud providers is to move to CXL memory that could be pooled. So there would be tradeoff between having pins multiple channels of DDR5 of having far more PCIe lanes, that could be reconfigured as CXL lanes for memory or storage access or reconfigured as Infinity Fabric lanes for connecting, say 4 way, 4 socket Mi300 nodes.
If that is the approach SH5 takes, then the re-use of the IO dies would be limited or impossible. If AMD does take the base die approach to Zen 5, even if the IO die is different between SP5 and SH5, Zen 5 then, in theory, could have re-use of the Zen 5 CCDs, if the new standard becomes being stacked on top of the base die.
I think AMD might just as well have 2 different architectures (and sockets) and let client decide what works best for them.
For AMD to grow its market share in datacenter, they may in fact need to have solutions for number of different tasks, from Sienna SP6, to Genoa SP5 and Mi300 SH5.
From the economic perspective, it sounds intriguing indeed. Quite compelling when you think about the Zen 4 CCD, with 35% die are composed of L3 and the 2x GMI holding all those TCoil and related circuitry for the Line drivers .There is very little area scaling for cache, (95%?), so once this happens, [Cost of L3 cache in 6nm + die stacking cost < Cost of L3 cache in 3nm]. they can do it cheaper. Basically the wafer cost difference will determine profitability.
My thinking is that we're already there and Zen5 being a more clean sheet design, will reflect this.
Aren't wider archs also harder to clock higher?From the economic perspective, it sounds intriguing indeed. Quite compelling when you think about the Zen 4 CCD, with 35% die are composed of L3 and the 2x GMI holding all those TCoil and related circuitry for the Line drivers .
But then when you think about the complexities they need to tackle they might not do it after all. They are going to lose a lot of Frequency headroom due to the thermal density. Below presentation is from Naffziger in 2021.
View attachment 76406
For instance, 7800X3D loses 10%+ top end frequency over 7700X.
Since Zen 4 missed the N4P PDK at tape out (N4P tape out in 2H2022, but Zen 4 taped out one year earlier than that), Zen 5 might even gain even more frequency than Zen 4 for the N4P based parts.
6GHz+ should be quite easy for N4P based Zen 5 (my 7950X consistently hit 5.88GHz on 280mm AIO), I suppose, whereas a 3D stacked Zen 5 might not make it past 5GHz.
Additionally disclosures of V-Cache variants would be odd since it is unclear where they would stack them otherwise.
Besides for next year, N5/N4 nodes are already mature and plenty, ~180k wpm from F18P1~4 and Arizona F21P1 and the likes of Apple moving to N3 again with ~160k wpm from F18P5~8
For N3E based Zen 5, they can work with FinFlex for minimizing the size of the L3 SRAM arrays if need be.
However if they use the Silicon Bridge interconnect, they could cut back on the CCD die size by a few mm2, if they could replace the TCoils/Line Drivers with a more parallel interface, like HBI for instance, at even lower power than the Cu RDL fanouts. I keep seeing patents like these below maybe they will do it after all but if not, these Cu RDL fanout links should help cut substrate based routing at 1/7 of the 2pJ of the 36Gbps GMI3. They can increase the link width and speed to reduce the penalty to go to the L3 on another CCD.
I wonder if they could just turn it on its heads. The thermal density is in the cores, you want it to directly touch the IHS. Current v-cache and its dummy spacer prevent that.But then when you think about the complexities they need to tackle they might not do it after all. They are going to lose a lot of Frequency headroom due to the thermal density.
Strictly from a cooling viewpoint.From the economic perspective, it sounds intriguing indeed. Quite compelling when you think about the Zen 4 CCD, with 35% die are composed of L3 and the 2x GMI holding all those TCoil and related circuitry for the Line drivers .
But then when you think about the complexities they need to tackle they might not do it after all. They are going to lose a lot of Frequency headroom due to the thermal density. Below presentation is from Naffziger in 2021.
View attachment 76406
For instance, 7800X3D loses 10%+ top end frequency over 7700X.
Since Zen 4 missed the N4P PDK at tape out (N4P tape out in 2H2022, but Zen 4 taped out one year earlier than that), Zen 5 might even gain even more frequency than Zen 4 for the N4P based parts.
6GHz+ should be quite easy for N4P based Zen 5 (my 7950X consistently hit 5.88GHz on 280mm AIO), I suppose, whereas a 3D stacked Zen 5 might not make it past 5GHz.
Additionally disclosures of V-Cache variants would be odd since it is unclear where they would stack them otherwise.
Besides for next year, N5/N4 nodes are already mature and plenty, ~180k wpm from F18P1~4 and Arizona F21P1 and the likes of Apple moving to N3 again with ~160k wpm from F18P5~8
For N3E based Zen 5, they can work with FinFlex for minimizing the size of the L3 SRAM arrays if need be.
However if they use the Silicon Bridge interconnect, they could cut back on the CCD die size by a few mm2, if they could replace the TCoils/Line Drivers with a more parallel interface, like HBI for instance, at even lower power than the Cu RDL fanouts. I keep seeing patents like these below maybe they will do it after all but if not, these Cu RDL fanout links should help cut substrate based routing at 1/7 of the 2pJ of the 36Gbps GMI3. They can increase the link width and speed to reduce the penalty to go to the L3 on another CCD.