Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

MadRat · Dec 17, 2022

Joe NYC said:
I think Mi300 (rather than Zen 5) is the product where AMD is going "all in".

Since, according to rumors and AMD Financial Analyst Day presentation, the compute modules placed on the base die may be interchangeable, it might be possible that the Mi300 approach may overtake other dedicated implementations in CPU (desktop, server) APU, GPU.

The basic approach of Mi300 is rumored to be a large (360 mm2?) N6 die, connected to HBM modules and stacked on top of the base die, there would be 2 compute units on N5 node:

View attachment 72428

This makes too much sense. You can probably vary memory type and capacity on a per module basis, giving major flexibility with target markets. How much capacity is the million dollar question.

Geddagod · Dec 18, 2022

Exist50 said:
This whole thing sounds like complete nonsense. Doubling the cores? +30% IPC? Unified L2 cache? Unified stacked L3 for everything? Yeah, I'm calling BS.

It doesn't really make sense since AMD already confirmed Zen 5 vanilla and Zen 5 3D stacked variants. So unless they just have a lineup without L3 at all (which I HIGHLY doubt), I don't think 3D stacked L3 cache is going to be the standard on zen 5

RnR_au · Dec 18, 2022

Geddagod said:
It doesn't really make sense since AMD already confirmed Zen 5 vanilla and Zen 5 3D stacked variants. So unless they just have a lineup without L3 at all (which I HIGHLY doubt), I don't think 3D stacked L3 cache is going to be the standard on zen 5

Coudl X sku's have vcache and non-X won't have be a possibility?

Joe NYC · Dec 18, 2022

Geddagod said:
It doesn't really make sense since AMD already confirmed Zen 5 vanilla and Zen 5 3D stacked variants. So unless they just have a lineup without L3 at all (which I HIGHLY doubt), I don't think 3D stacked L3 cache is going to be the standard on zen 5

It's a new architecture, for which AMD already has a customer - El Capitan supercomputer.

It seems flexible enough to allow number of different configurations.

BTW, I think Zen 4c will be the CPU compute die for Mi300. L3 of Zen 4c will likely be small or none.

Mopetar · Dec 19, 2022

People keep acting as though v-cache is a magical solution to everything, when in reality is does very little or nothing for most productivity workloads. Insisting that it be a part of every CPU is just making them more expensive for the people who won't benefit from the inclusion.

It's certainly good for games, particularly so for a small handful of titles where it's better than moving to a top-end GPU, and cloud providers find it valuable, but for anyone else it really depends on your workload and mix. Mark has said it's good for some of his compute workloads, but for a lot of other professional work it doesn't do much if anything.

igor_kavinski · Dec 19, 2022

Mopetar said:
People keep acting as though v-cache is a magical solution to everything, when in reality is does very little or nothing for most productivity workloads.

Unfortunately, no one has done a multitasking benchmark comparison of 5800X and 5800X3D. I think the multitasking experience on the 5800X3D would be considerably improved as more application data would fit on the larger L3 cache.

BorisTheBlade82 · Dec 19, 2022

Mopetar said:
People keep acting as though v-cache is a magical solution to everything, when in reality is does very little or nothing for most productivity workloads. Insisting that it be a part of every CPU is just making them more expensive for the people who won't benefit from the inclusion.

It's certainly good for games, particularly so for a small handful of titles where it's better than moving to a top-end GPU, and cloud providers find it valuable, but for anyone else it really depends on your workload and mix. Mark has said it's good for some of his compute workloads, but for a lot of other professional work it doesn't do much if anything.

The main reason why V-Cache might become the new normal is this:

IEDM 2022: Did We Just Witness The Death Of SRAM?

IEDM 2022: Did we just witness the death of SRAM? While foundries continue to show strong logic transistor scaling, SRAM scaling has completely collapsed.

fuse.wikichip.org

Tuna-Fish · Dec 19, 2022

Now that the io-die is made on N6, an errant thought that's bouncing around in the back of my head, why not put L3 on it and stack the CPU on it? For more v-cache, add the layers in between the ccd and iod.

This would eliminate the fairly power-hungry and latency-adding interface between the iod and the ccd, and allow making L3 with the more economical N6 node which is still quite competitive in SRAM density.

Of course, since it would mean every product is stacked they would need a lot more manufacturing throughput. Probably not viable for Zen5 yet.

Mopetar · Dec 19, 2022

igor_kavinski said:
Unfortunately, no one has done a multitasking benchmark comparison of 5800X and 5800X3D. I think the multitasking experience on the 5800X3D would be considerably improved as more application data would fit on the larger L3 cache.

I've speculated as much myself, but that's going to be a difficult thing to quantify and the switching has to be frequent enough for people to care.

BorisTheBlade82 said:
The main reason why V-Cache might become the new normal is this:

IEDM 2022: Did We Just Witness The Death Of SRAM?

IEDM 2022: Did we just witness the death of SRAM? While foundries continue to show strong logic transistor scaling, SRAM scaling has completely collapsed.

fuse.wikichip.org

Unless the entire L3 cache can get moved it wouldn't make a lot of sense. The L1 and L2 cache sizes don't really need to increase because the added hit time probably doesn't offset the reduction in miss rate. Moving those to another layer might not be as viable due to timing reasons.

Just like IO which hasn't scaled well for a while now, the companies will just have to suck it up and accept that some parts of the die won't shrink. Smart designers will just use it as a way to help spread heat out more by moving hot spots close to the cache.

Edit: minor grammar fix to improve clarity

Exist50 · Dec 19, 2022

Tuna-Fish said:
Now that the io-die is made on N6, an errant thought that's bouncing around in the back of my head, why not put L3 on it and stack the CPU on it? For more v-cache, add the layers in between the ccd and iod.

This would eliminate the fairly power-hungry and latency-adding interface between the iod and the ccd, and allow making L3 with the more economical N6 node which is still quite competitive in SRAM density.

Of course, since it would mean every product is stacked they would need a lot more manufacturing throughput. Probably not viable for Zen5 yet.

It's an interesting idea, but the misalignment between the amount of IO die silicon and compute die silicon might be a problem. Also, it means that for lower tier SKUs, they can't really scale the L3 cache down. Though perhaps the biggest issue is that it would really constrain them if they wanted to reuse the IO die for another gen.

LightningZ71 · Dec 19, 2022

Mopetar said:
If he speculated as much myself, but that's going to be a difficult thing to quantify and the switching has to be frequent enough for people to care.

Unless the entire L3 cache can get moved it wouldn't make a lot of sense. The L1 and L2 cache sizes don't really need to increase because the added hit time probably doesn't offset the reduction in miss rate. Moving those to another layer might not be as viable due to timing reasons.

Just like IO which hasn't scaled well for a while now, the companies will just have to suck it up and accept that some parts of the die won't shrink. Smart designers will just use it as a way to help spread heat out more by moving hot spots close to the cache.

The entire L3 need not move. They can shrink the L3 on the compute die to 16 or 8 MB and have a large stacked L3 cache die in the 96MB range. This could allow more efficient die area usage, still allow lower end products without the stacked L3 die, allow more agressive core transistor growth for a given die size, or allows them more agressive ccd shrinkage.

eek2121 · Dec 20, 2022

LightningZ71 said:
The entire L3 need not move. They can shrink the L3 on the compute die to 16 or 8 MB and have a large stacked L3 cache die in the 96MB range. This could allow more efficient die area usage, still allow lower end products without the stacked L3 die, allow more agressive core transistor growth for a given die size, or allows them more agressive ccd shrinkage.

This. They can also do this with Zen 4 if they wanted. That would actually give them room to add a third chiplet or more cores. I don’t see AMD sticking with a 16 core limit when Intel is clearly willing to push things further.

I suspect the manufacturing capacity (for stacking) might still be an issue, however. We will see what happens with Zen 5. I am expecting AMD to surprise everyone with something different. Maybe they build something similar to what Apple did with the M1 Max, complete with GPU?

EDIT: I want to remind everyone that AMD is the company that brought us stuff like Bulldozer and x86-64. While Bulldozer's execution was flawed, the concept is pretty sound. We are way overdue to an optimization phase for silicon. Why have 16-24 AVX units if they will rarely be utilized? As low latency, high bandwidth on-package interconnect technologies mature, I suspect we will begin to see a disconnect between the various bits of a CPU. The term 'core' may even go away completely. We may see CPUs with 32 integer units, but only 16 floating point units. Possibly even shared L1 and L2 cache.

I'm not saying that Zen 5 will do any of this (though it is a possibility), but node shrinks are expensive, and being able to spend your silicon exactly where you need it and not needing to duplicate functionality where not needed will matter in the future. Especially if AMD and Intel start building serious GPU horsepower into their designs. Look, for example, at some of the legacy x86 stuff. By having a disconnected design, you can keep that logic in one area for the really old stuff, and the rest of the chip can exclude it. Even better, if you have an on chip FPGA, that logic can be loaded and used if needed, then unloaded so other instructions can be used.

Just some wild thoughts by me.

Joe NYC · Dec 20, 2022

Tuna-Fish said:
Now that the io-die is made on N6, an errant thought that's bouncing around in the back of my head, why not put L3 on it and stack the CPU on it? For more v-cache, add the layers in between the ccd and iod.

This would eliminate the fairly power-hungry and latency-adding interface between the iod and the ccd, and allow making L3 with the more economical N6 node which is still quite competitive in SRAM density.

Of course, since it would mean every product is stacked they would need a lot more manufacturing throughput. Probably not viable for Zen5 yet.

What you describe is what AMD is (apparently) planning for Mi300. 300+ mm N6 base with 2 compute units stacked on top of it, using hybrid bond.

What separates Mi300 from consumer level products is that this is well within the budget of a potential supercomputer or data center product.

Mopetar · Dec 20, 2022

LightningZ71 said:
The entire L3 need not move. They can shrink the L3 on the compute die to 16 or 8 MB and have a large stacked L3 cache die in the 96MB range. This could allow more efficient die area usage, still allow lower end products without the stacked L3 die, allow more agressive core transistor growth for a given die size, or allows them more agressive ccd shrinkage.

The silicon cost ends up about the same either way unless the v-cache is on an older node that can integrate. Otherwise, it's just trading cost between a bigger base die and assembly and packaging.

The interesting question is if AMD could just reduce the L3 by some amount with such minimal performance loss it wouldn't be noticed for most workloads, and if they'd do that in order to make a smaller chip.

scineram · Dec 20, 2022

Everything here is just trading costs between things.

BorisTheBlade82 · Dec 20, 2022

Mopetar said:
The silicon cost ends up about the same either way unless the v-cache is on an older node that can integrate.

For me the writing is on the wall that exactly this will happen - Cache on legacy processes.

moinmoin · Dec 20, 2022

Mopetar said:
unless the v-cache is on an older node that can integrate.

That's the whole point. SRAM doesn't scale further after N5 despite the higher cost. So V-Cache is bound to stay on N5, and as @LightningZ71 suggested it may even be worth moving existing SRAM to V-Cache to save area on the newer nodes.

Exist50 · Dec 20, 2022

moinmoin said:
SRAM doesn't scale further after N5 despite the higher cost.

Well, it doesn't scale for N3, and quite likely not even N2, but who knows what the future holds. I don't think SRAM scaling is completely dead.

moinmoin · Dec 20, 2022

Exist50 said:
Well, it doesn't scale for N3, and quite likely not even N2, but who knows what the future holds. I don't think SRAM scaling is completely dead.

GAA should improve it again, to be seen by how much. But until then calling it dead is apt. SRAM not scaling makes it very expensive to keep at the same size, nevermind increase in size. No more doubling of cache sizes on monolithic dies. Moving SRAM to a separate die on N5 or older is the only economical solution.

LightningZ71 · Dec 21, 2022

I think that SRAM cache dies will find their next long-term home on N4. While it doesn't offer much in the way of scaling, it should provide the needed switching speed to keep the L3 fast enough to be relevant. GAA should allow for a useful bump in total density, but, it will likely not carry forward shrinking even more. At that node, it'll still likely be more efficient to move that die area to N4 stacked cache (for at least a sizeable chunk of it) and use the rest on compute logic.

Tuna-Fish · Dec 21, 2022

LightningZ71 said:
I think that SRAM cache dies will find their next long-term home on N4. While it doesn't offer much in the way of scaling, it should provide the needed switching speed to keep the L3 fast enough to be relevant. GAA should allow for a useful bump in total density, but, it will likely not carry forward shrinking even more. At that node, it'll still likely be more efficient to move that die area to N4 stacked cache (for at least a sizeable chunk of it) and use the rest on compute logic.

Beyond GAA, backside power delivery is expected to approximately double SRAM density (because at that point, the wiring will be the hard constraint by far, and backside power will approximately double the achievable density of logic wiring on the frontside). TSMC is now expecting that to show up in some N2 node (but not the first one).

However, backside power delivery will also add a lot of cost, because it will add a whole new flow of manufacturing steps, and a lot of new things that can hurt yield.

I would not be shocked if when backside power is available, the lowest cost per bit is still on some older process, despite the greatly increased density.

Doug S · Dec 21, 2022

Tuna-Fish said:
I would not be shocked if when backside power is available, the lowest cost per bit is still on some older process, despite the greatly increased density.

Sure but even if that's the case you will benefit for lower cache layers that must remain on chip. Sure maybe it is cheaper to stack an L3 that is 3x as much silicon area per bit in an older process on top of a N2+ BPR die in exchange for increased latency.

Exist50 · Dec 21, 2022

Tuna-Fish said:
Beyond GAA, backside power delivery is expected to approximately double SRAM density (because at that point, the wiring will be the hard constraint by far, and backside power will approximately double the achievable density of logic wiring on the frontside). TSMC is now expecting that to show up in some N2 node (but not the first one).

However, backside power delivery will also add a lot of cost, because it will add a whole new flow of manufacturing steps, and a lot of new things that can hurt yield.

I would not be shocked if when backside power is available, the lowest cost per bit is still on some older process, despite the greatly increased density.

Where are you seeing that backside power delivery should double the density for SRAM?

Ajay · Dec 21, 2022

Doug S said:
Sure but even if that's the case you will benefit for lower cache layers that must remain on chip. Sure maybe it is cheaper to stack an L3 that is 3x as much silicon area per bit in an older process on top of a N2+ BPR die in exchange for increased latency.

Yeah, reading this 'news' it looks like on die L3$ will wind up going the way of the dodo in a few node cycles. So stacked L3$ will rule the day then. I wonder if memory side caches will come into to alleviate some of the latency issues in the whole memory system (or L4$, either on the IOD). I don't know if memory side caches need to be snoopable, seems like it wouldn't be worth it if that was the case.

Joe NYC · Dec 21, 2022

Ajay said:
Yeah, reading this 'news' it looks like on die L3$ will wind up going the way of the dodo in a few node cycles. So stacked L3$ will rule the day then. I wonder if memory side caches will come into to alleviate some of the latency issues in the whole memory system (or L4$, either on the IOD). I don't know if memory side caches need to be snoopable, seems like it wouldn't be worth it if that was the case.

Stacking may be it, while 3D stacked, using hybrid bond, is the only way to go.

But there are 2 possible approaches in the future, that would be 2.5D - horizontal. AMD mentioned EFB with hybrid bond, and TSMC mentioned SoIC_H, which would be interposer with chiplet attached via hybrid bond.

Once either or both of these technologies make it to production, then L3 on die or stacked could migrate to a separate SRAM stack or stacked either on top of memory controller or I/O die. And then, SRAM can work as system level cache, shared between different CPU chiplets or also graphics.

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Lifer

Golden Member

Platinum Member

Diamond Member

Diamond Member

Lifer

Senior member

Golden Member

Diamond Member

Platinum Member

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Senior member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Golden Member

Diamond Member

Platinum Member

Lifer

Diamond Member