if we go by AMD comments they are not blowing up the memory hiearchy every design anymore , its to hard to get consistant performance, so unless they implement stacked L2 cache in every market/ every product we will see the same reg file/LDS/L0/L1/L2/L3/whatever ratio/designs for shader cores and gpus.
What are the odds that stacked die will become as much a norm (at least for AMD) as was the chiplet design started with Zen?
If you take the current trends / leaks for Zen 6, every client chip except one (MDS1) is a chiplet based, that will need advance packaging.
The advanced packaging adds some costs, as does SoIC 3D stacking. Just adding SoIC alone is a cost adder. But replacing other advanced packaging with SoIC = offsetting costs.
So in theory, what seemed like a far fetched scenario, of future client chips looking like Mi300-Mi455 design may be closer to reality than before.
If stacked L2 + L3 can secure unparalleled CPU performance, it's an attractive proposition to proliferate it across all segments. You can just add bunch more stuff from a notebook chips to the base die, including IO, memory controllers and then have just core complexes (CPU and GPU) sitting on top of this base die and be much smaller consumption of the most advanced process silicon offset by die on the cheaper node for base die. It's just that it is a little more complex in notebook chips than in server / desktop CCDs.
In general, there are a number of barriers holding you back from achieving higher performance. AMD leadership figured out that the best way is to apply maximum pressure to the easiest to move barrier. Which is how AMD gained server CPU leadership with EPYC, applying the max pressure on increasing the chiplets.
Stacked L3 was very similar scenario. To increase L3 size through stacking, which was a weaker barrier than growing the chip size to increase L3.
If the CPUs CCDs (desktop and server) are already in process of transitioning to 2 stacked die design (single die CCD may disappear) than it is a non-brainer to also add L2 to the stacked die.
if some 33% Zen 5 die (minus SerDes) is SRAM, and a 16 core Zen 7 die would normally be 100 mm2, then moving both L2 and L3 off the base die would shrink the chip to 66mm2.
Then, the remaining 33mm2 moved to stacked die would increase to 66mm2, with enough room to place 2x L2 and 2x L3. In other words, 2MB L2 per core and 8 MB L3 per core. That would maintain the cache hierarchy.