Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 11 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

MadRat

Lifer
Oct 14, 1999
11,999
307
126
I think Mi300 (rather than Zen 5) is the product where AMD is going "all in".

Since, according to rumors and AMD Financial Analyst Day presentation, the compute modules placed on the base die may be interchangeable, it might be possible that the Mi300 approach may overtake other dedicated implementations in CPU (desktop, server) APU, GPU.

The basic approach of Mi300 is rumored to be a large (360 mm2?) N6 die, connected to HBM modules and stacked on top of the base die, there would be 2 compute units on N5 node:

View attachment 72428
This makes too much sense. You can probably vary memory type and capacity on a per module basis, giving major flexibility with target markets. How much capacity is the million dollar question.
 

Geddagod

Golden Member
Dec 28, 2021
1,524
1,620
106
This whole thing sounds like complete nonsense. Doubling the cores? +30% IPC? Unified L2 cache? Unified stacked L3 for everything? Yeah, I'm calling BS.
It doesn't really make sense since AMD already confirmed Zen 5 vanilla and Zen 5 3D stacked variants. So unless they just have a lineup without L3 at all (which I HIGHLY doubt), I don't think 3D stacked L3 cache is going to be the standard on zen 5
 
  • Like
Reactions: Mopetar

RnR_au

Platinum Member
Jun 6, 2021
2,675
6,119
136
It doesn't really make sense since AMD already confirmed Zen 5 vanilla and Zen 5 3D stacked variants. So unless they just have a lineup without L3 at all (which I HIGHLY doubt), I don't think 3D stacked L3 cache is going to be the standard on zen 5
Coudl X sku's have vcache and non-X won't have be a possibility?
 

Joe NYC

Diamond Member
Jun 26, 2021
3,647
5,183
136
It doesn't really make sense since AMD already confirmed Zen 5 vanilla and Zen 5 3D stacked variants. So unless they just have a lineup without L3 at all (which I HIGHLY doubt), I don't think 3D stacked L3 cache is going to be the standard on zen 5

It's a new architecture, for which AMD already has a customer - El Capitan supercomputer.

It seems flexible enough to allow number of different configurations.

BTW, I think Zen 4c will be the CPU compute die for Mi300. L3 of Zen 4c will likely be small or none.
 

Mopetar

Diamond Member
Jan 31, 2011
8,487
7,726
136
People keep acting as though v-cache is a magical solution to everything, when in reality is does very little or nothing for most productivity workloads. Insisting that it be a part of every CPU is just making them more expensive for the people who won't benefit from the inclusion.

It's certainly good for games, particularly so for a small handful of titles where it's better than moving to a top-end GPU, and cloud providers find it valuable, but for anyone else it really depends on your workload and mix. Mark has said it's good for some of his compute workloads, but for a lot of other professional work it doesn't do much if anything.
 
  • Like
Reactions: Tlh97 and scannall
Jul 27, 2020
28,008
19,125
146
People keep acting as though v-cache is a magical solution to everything, when in reality is does very little or nothing for most productivity workloads.
Unfortunately, no one has done a multitasking benchmark comparison of 5800X and 5800X3D. I think the multitasking experience on the 5800X3D would be considerably improved as more application data would fit on the larger L3 cache.
 

BorisTheBlade82

Senior member
May 1, 2020
707
1,130
136
People keep acting as though v-cache is a magical solution to everything, when in reality is does very little or nothing for most productivity workloads. Insisting that it be a part of every CPU is just making them more expensive for the people who won't benefit from the inclusion.

It's certainly good for games, particularly so for a small handful of titles where it's better than moving to a top-end GPU, and cloud providers find it valuable, but for anyone else it really depends on your workload and mix. Mark has said it's good for some of his compute workloads, but for a lot of other professional work it doesn't do much if anything.
The main reason why V-Cache might become the new normal is this:
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,667
2,532
136
Now that the io-die is made on N6, an errant thought that's bouncing around in the back of my head, why not put L3 on it and stack the CPU on it? For more v-cache, add the layers in between the ccd and iod.

This would eliminate the fairly power-hungry and latency-adding interface between the iod and the ccd, and allow making L3 with the more economical N6 node which is still quite competitive in SRAM density.

Of course, since it would mean every product is stacked they would need a lot more manufacturing throughput. Probably not viable for Zen5 yet.
 
  • Like
Reactions: Joe NYC

Mopetar

Diamond Member
Jan 31, 2011
8,487
7,726
136
Unfortunately, no one has done a multitasking benchmark comparison of 5800X and 5800X3D. I think the multitasking experience on the 5800X3D would be considerably improved as more application data would fit on the larger L3 cache.

I've speculated as much myself, but that's going to be a difficult thing to quantify and the switching has to be frequent enough for people to care.

The main reason why V-Cache might become the new normal is this:

Unless the entire L3 cache can get moved it wouldn't make a lot of sense. The L1 and L2 cache sizes don't really need to increase because the added hit time probably doesn't offset the reduction in miss rate. Moving those to another layer might not be as viable due to timing reasons.

Just like IO which hasn't scaled well for a while now, the companies will just have to suck it up and accept that some parts of the die won't shrink. Smart designers will just use it as a way to help spread heat out more by moving hot spots close to the cache.

Edit: minor grammar fix to improve clarity
 
Last edited:

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
Now that the io-die is made on N6, an errant thought that's bouncing around in the back of my head, why not put L3 on it and stack the CPU on it? For more v-cache, add the layers in between the ccd and iod.

This would eliminate the fairly power-hungry and latency-adding interface between the iod and the ccd, and allow making L3 with the more economical N6 node which is still quite competitive in SRAM density.

Of course, since it would mean every product is stacked they would need a lot more manufacturing throughput. Probably not viable for Zen5 yet.
It's an interesting idea, but the misalignment between the amount of IO die silicon and compute die silicon might be a problem. Also, it means that for lower tier SKUs, they can't really scale the L3 cache down. Though perhaps the biggest issue is that it would really constrain them if they wanted to reuse the IO die for another gen.
 

LightningZ71

Platinum Member
Mar 10, 2017
2,508
3,190
136
If he speculated as much myself, but that's going to be a difficult thing to quantify and the switching has to be frequent enough for people to care.



Unless the entire L3 cache can get moved it wouldn't make a lot of sense. The L1 and L2 cache sizes don't really need to increase because the added hit time probably doesn't offset the reduction in miss rate. Moving those to another layer might not be as viable due to timing reasons.

Just like IO which hasn't scaled well for a while now, the companies will just have to suck it up and accept that some parts of the die won't shrink. Smart designers will just use it as a way to help spread heat out more by moving hot spots close to the cache.
The entire L3 need not move. They can shrink the L3 on the compute die to 16 or 8 MB and have a large stacked L3 cache die in the 96MB range. This could allow more efficient die area usage, still allow lower end products without the stacked L3 die, allow more agressive core transistor growth for a given die size, or allows them more agressive ccd shrinkage.
 

eek2121

Diamond Member
Aug 2, 2005
3,410
5,049
136
The entire L3 need not move. They can shrink the L3 on the compute die to 16 or 8 MB and have a large stacked L3 cache die in the 96MB range. This could allow more efficient die area usage, still allow lower end products without the stacked L3 die, allow more agressive core transistor growth for a given die size, or allows them more agressive ccd shrinkage.
This. They can also do this with Zen 4 if they wanted. That would actually give them room to add a third chiplet or more cores. I don’t see AMD sticking with a 16 core limit when Intel is clearly willing to push things further.

I suspect the manufacturing capacity (for stacking) might still be an issue, however. We will see what happens with Zen 5. I am expecting AMD to surprise everyone with something different. Maybe they build something similar to what Apple did with the M1 Max, complete with GPU?

EDIT: I want to remind everyone that AMD is the company that brought us stuff like Bulldozer and x86-64. While Bulldozer's execution was flawed, the concept is pretty sound. We are way overdue to an optimization phase for silicon. Why have 16-24 AVX units if they will rarely be utilized? As low latency, high bandwidth on-package interconnect technologies mature, I suspect we will begin to see a disconnect between the various bits of a CPU. The term 'core' may even go away completely. We may see CPUs with 32 integer units, but only 16 floating point units. Possibly even shared L1 and L2 cache.

I'm not saying that Zen 5 will do any of this (though it is a possibility), but node shrinks are expensive, and being able to spend your silicon exactly where you need it and not needing to duplicate functionality where not needed will matter in the future. Especially if AMD and Intel start building serious GPU horsepower into their designs. Look, for example, at some of the legacy x86 stuff. By having a disconnected design, you can keep that logic in one area for the really old stuff, and the rest of the chip can exclude it. Even better, if you have an on chip FPGA, that logic can be loaded and used if needed, then unloaded so other instructions can be used.

Just some wild thoughts by me.
 
Last edited:

Joe NYC

Diamond Member
Jun 26, 2021
3,647
5,183
136
Now that the io-die is made on N6, an errant thought that's bouncing around in the back of my head, why not put L3 on it and stack the CPU on it? For more v-cache, add the layers in between the ccd and iod.

This would eliminate the fairly power-hungry and latency-adding interface between the iod and the ccd, and allow making L3 with the more economical N6 node which is still quite competitive in SRAM density.

Of course, since it would mean every product is stacked they would need a lot more manufacturing throughput. Probably not viable for Zen5 yet.

What you describe is what AMD is (apparently) planning for Mi300. 300+ mm N6 base with 2 compute units stacked on top of it, using hybrid bond.

What separates Mi300 from consumer level products is that this is well within the budget of a potential supercomputer or data center product.
 

Mopetar

Diamond Member
Jan 31, 2011
8,487
7,726
136
The entire L3 need not move. They can shrink the L3 on the compute die to 16 or 8 MB and have a large stacked L3 cache die in the 96MB range. This could allow more efficient die area usage, still allow lower end products without the stacked L3 die, allow more agressive core transistor growth for a given die size, or allows them more agressive ccd shrinkage.

The silicon cost ends up about the same either way unless the v-cache is on an older node that can integrate. Otherwise, it's just trading cost between a bigger base die and assembly and packaging.

The interesting question is if AMD could just reduce the L3 by some amount with such minimal performance loss it wouldn't be noticed for most workloads, and if they'd do that in order to make a smaller chip.
 

moinmoin

Diamond Member
Jun 1, 2017
5,242
8,456
136
Well, it doesn't scale for N3, and quite likely not even N2, but who knows what the future holds. I don't think SRAM scaling is completely dead.
GAA should improve it again, to be seen by how much. But until then calling it dead is apt. SRAM not scaling makes it very expensive to keep at the same size, nevermind increase in size. No more doubling of cache sizes on monolithic dies. Moving SRAM to a separate die on N5 or older is the only economical solution.
 

LightningZ71

Platinum Member
Mar 10, 2017
2,508
3,190
136
I think that SRAM cache dies will find their next long-term home on N4. While it doesn't offer much in the way of scaling, it should provide the needed switching speed to keep the L3 fast enough to be relevant. GAA should allow for a useful bump in total density, but, it will likely not carry forward shrinking even more. At that node, it'll still likely be more efficient to move that die area to N4 stacked cache (for at least a sizeable chunk of it) and use the rest on compute logic.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,667
2,532
136
I think that SRAM cache dies will find their next long-term home on N4. While it doesn't offer much in the way of scaling, it should provide the needed switching speed to keep the L3 fast enough to be relevant. GAA should allow for a useful bump in total density, but, it will likely not carry forward shrinking even more. At that node, it'll still likely be more efficient to move that die area to N4 stacked cache (for at least a sizeable chunk of it) and use the rest on compute logic.

Beyond GAA, backside power delivery is expected to approximately double SRAM density (because at that point, the wiring will be the hard constraint by far, and backside power will approximately double the achievable density of logic wiring on the frontside). TSMC is now expecting that to show up in some N2 node (but not the first one).

However, backside power delivery will also add a lot of cost, because it will add a whole new flow of manufacturing steps, and a lot of new things that can hurt yield.

I would not be shocked if when backside power is available, the lowest cost per bit is still on some older process, despite the greatly increased density.
 

Doug S

Diamond Member
Feb 8, 2020
3,574
6,311
136
I would not be shocked if when backside power is available, the lowest cost per bit is still on some older process, despite the greatly increased density.


Sure but even if that's the case you will benefit for lower cache layers that must remain on chip. Sure maybe it is cheaper to stack an L3 that is 3x as much silicon area per bit in an older process on top of a N2+ BPR die in exchange for increased latency.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
Beyond GAA, backside power delivery is expected to approximately double SRAM density (because at that point, the wiring will be the hard constraint by far, and backside power will approximately double the achievable density of logic wiring on the frontside). TSMC is now expecting that to show up in some N2 node (but not the first one).

However, backside power delivery will also add a lot of cost, because it will add a whole new flow of manufacturing steps, and a lot of new things that can hurt yield.

I would not be shocked if when backside power is available, the lowest cost per bit is still on some older process, despite the greatly increased density.
Where are you seeing that backside power delivery should double the density for SRAM?
 

Ajay

Lifer
Jan 8, 2001
16,094
8,114
136
Sure but even if that's the case you will benefit for lower cache layers that must remain on chip. Sure maybe it is cheaper to stack an L3 that is 3x as much silicon area per bit in an older process on top of a N2+ BPR die in exchange for increased latency.
Yeah, reading this 'news' it looks like on die L3$ will wind up going the way of the dodo in a few node cycles. So stacked L3$ will rule the day then. I wonder if memory side caches will come into to alleviate some of the latency issues in the whole memory system (or L4$, either on the IOD). I don't know if memory side caches need to be snoopable, seems like it wouldn't be worth it if that was the case.
 

Joe NYC

Diamond Member
Jun 26, 2021
3,647
5,183
136
Yeah, reading this 'news' it looks like on die L3$ will wind up going the way of the dodo in a few node cycles. So stacked L3$ will rule the day then. I wonder if memory side caches will come into to alleviate some of the latency issues in the whole memory system (or L4$, either on the IOD). I don't know if memory side caches need to be snoopable, seems like it wouldn't be worth it if that was the case.

Stacking may be it, while 3D stacked, using hybrid bond, is the only way to go.

But there are 2 possible approaches in the future, that would be 2.5D - horizontal. AMD mentioned EFB with hybrid bond, and TSMC mentioned SoIC_H, which would be interposer with chiplet attached via hybrid bond.

Once either or both of these technologies make it to production, then L3 on die or stacked could migrate to a separate SRAM stack or stacked either on top of memory controller or I/O die. And then, SRAM can work as system level cache, shared between different CPU chiplets or also graphics.
 
  • Like
Reactions: Kaluan