• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."

Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 6000)

Page 82 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

What do you expect with Zen 4?


  • Total voters
    263

A///

Golden Member
Feb 24, 2017
1,018
740
136
They could be testing out higher stacks of things, but keep in mind that the base die has to support connectivity to a specific number of stacks. Zen 3 is likely designed to support a maximum of 4 layers of cache die. HBM has similar limitations with the stacked die supporting pass-through for a limited number of layers. It takes die area for TSVs to support connection to each layer, so going from 4 to 12 might not be trivial. TSMC supposedly has the pitch down to 0.9 microns which is significantly smaller than any micro-solder ball tech, but that is still 900 nm, so tens of thousands of TSVs will still take some area. You also get a higher probability of bonding failure, so they often are not going to use the max except for extremely high end devices. With AMD’s one chiplet for many products strategy, we are likely to only see a limited height stack, not the max TSMC supports.
Even if they did support the max stack per TSMC's ability, you'd need to address other issues even if the bonding was a success. File it under easier said than done in some* applications.
 

Joe NYC

Senior member
Jun 26, 2021
272
218
76
I have only seen up to 4 layers of cache, but that may be a Milan-x limitation, rather than a Zen 4 limitation. I believe TSMC supports up to 12 layers though. I don’t know if they can test the die before stacking them. With HBM, they have “known good die stacks” to start with. Something can still go wrong in the bonding process through.
From the tidbits I have come across, the hybrid bonding is going to be a very high yielding process.

Also, I don't know if the stacking has to start from the most expensive die (with cores) or not.

Suppose 2 (inexpensive) SRAM dies are first stacked together, tested, and then 2x2 stacks are bonded and tested only the good stack of 4 is bonded on top of the core die.

That would mean that even if there are possible yield losses, they would be minimized

The cpu die with cache chips might be much more expensive since you are adding another set of steps where something can go wrong and reduce yield. I think only the very high end HPC products will get 4 layers. Those that end up in the consumer market as Ryzen parts might actually be salvage and/or single layer parts. If something goes wrong with a 4 layer stack, you might still be able to use it as a 2 layer, single layer, or no stacking if all of the cache die are unusable. They might have a specific single layer part for high end, but not ridiculously expensive Epyc processors. If something goes wrong with the single layer part, then they can probably still sell it as an Epyc or Ryzen without stacked cache enabled. There should be lots of opportunities for salvage, but there is still a huge amount of silicon going into these things.
If substrate is the bottleneck and not silicon, both TSMC and AMD would be happy to sell more silicon at a good mark up.

BTW, I don't think the math changes greatly from Zen 3 and Zen 3D. Zen 3 chiplet can go into a very high end server part, and the same chiplet also goes into $299 5600x and $449 5800x.

AMD has the CPU gaming crown because it is not withholding these chips from desktop and gaming.

I am not sure if this would need to change a lot, if by changing it a lot (by say limiting the 4 layers Zen 3 to Milan) AMD would be handing the gaming desktop CPUs to Intel Alder Lake.

That would be a dumb move, and I don't think AMD will do that.

If something like a GB (or 2) of SRAM on one package actually exists, I doubt it will be classified as “affordable” by most people.
This would be only possible near term for MCMs with 8 chiplets. And these are already not affordable to most people. But 5800x is affordable.

For certain HPC applications, maybe big database servers and other things with expensive, per core licensing, it may be well worth the cost, but probably still not “affordable”.
These will likely be competing with Sapphire Rapids HBM in late 2022 / 2023, which will also most likely be +$5000 per CPU for HBM.

Cost of 4 layers of L3 x 8 chiplets will probably be ~$500. Well worth the price to beat Sapphire Rapids even with Milan X, (and Genoa on top of that).
 
  • Like
Reactions: Tlh97 and Vattila

Joe NYC

Senior member
Jun 26, 2021
272
218
76
Chances are they're testing out 4 and over layered stacks and have been for a while. I mean why not dogfood your own future product, especially when it would net you hundreds of million if not billions.
Exactly. If AMD wanted to have only 64MB more SRAM and not more, it would be simpler and maybe even less expensive to just add it to the base die.

AMD did not go into all the effort to only get 64MB.

TSMC is not building a huge packaging facility for 3D stacking just to sell a 36mm2 of L3.
 

Joe NYC

Senior member
Jun 26, 2021
272
218
76
Some TSMC stacking tech, like SoIC, mostly remove the power penalties for using multiple die, so it would be technologically possible (from a power consumption perspective) to make a stacked APU, perhaps with base IO die and cpu/gpu/memory chiplets stacked on top. It just might not be economically feasible to do that vs. a monolithic APU. If they are going to make all of their chiplets to be stacked rather than BGA package mounted, then they have to use stacking everywhere or do two different versions of the chiplet, one for stacking and one for BGA packaging.
Stacking would definitely be a way to split up the APU and reassemble it with smaller, cheaper, more optimized chiplets - through stacking. With Infinity Fabric On Package, I don't think AMD will ever get there because its power overhead makes it uncompetitive for laptops.

The stacking may not necessarily all be vertical. There was a patent from AMD for GPU design that incorporated active silicon bridge (also loaded with L3) that was stacked on top of GPU dies, and spanned multiple chiplets.

The main performance bottleneck being the memory, a monolithic APU with some HBM2e connected via embedded silicon bridge would perform very, very well without using any other stacking. It might be lower power if the HBM is used as a cache for the whole APU to reduce power needed to communicate with off package memory. I kind of doubt that AMD will move to a “stacking only” chiplet with Zen 4, so monolithic APUs will probably be around for a while. Perhaps everything will be stacked with Zen 5.
I have a feeling that TSMC and AMD just nailed the SoIC stacking recently, and availability of this technology to AMD (while Intel is 3 years behind with Feveros Direct), I have a feeling that some of the AMD roadmaps are being re-drawn.
 
Last edited:

Joe NYC

Senior member
Jun 26, 2021
272
218
76
They could be testing out higher stacks of things, but keep in mind that the base die has to support connectivity to a specific number of stacks. Zen 3 is likely designed to support a maximum of 4 layers of cache die. HBM has similar limitations with the stacked die supporting pass-through for a limited number of layers. It takes die area for TSVs to support connection to each layer, so going from 4 to 12 might not be trivial. TSMC supposedly has the pitch down to 0.9 microns which is significantly smaller than any micro-solder ball tech, but that is still 900 nm, so tens of thousands of TSVs will still take some area. You also get a higher probability of bonding failure, so they often are not going to use the max except for extremely high end devices. With AMD’s one chiplet for many products strategy, we are likely to only see a limited height stack, not the max TSMC supports.
The hints from BIOS of Daytona platform point to 4 layers. That may be it for Zen 3.

But I bet higher number will be supported by Zen 4.
 

Joe NYC

Senior member
Jun 26, 2021
272
218
76
Oops didn't notice your comment before. I think the only issue with the I/O die of desktop chips is that it just uses a lot of power, period, regardless of how it's oriented in the package. Stacking really isn't going to fix that. The I/O function of the APUs is cut down to save on idle power and total power consumption.
Communicating vertically through TSVs has minimum power cost vs. Infinity Fabric On Package, Also, the current bandwidth between IOD and chiplets is bare minimum, because of high power cost.

With stacked connection, the bandwidth can go up by several orders of magnitude.

GPU is even more bandwidth hungry, which would burn even more power.

Not sure that the gamer buying one of those laptops is really going to care about an iGPU, either. The OEM might for some reason, but really, when it comes to AMD losing sales from lack of an iGPU on their desktop chips, that isn't the market that comes to mind.
High bandwidth connection (through stacking or EMIB) might enable a separate GPU chiplets, which could totally replace a separate mobile GPU market (which NVidia mostly owns).
 
  • Like
Reactions: Tlh97 and Mopetar

jamescox

Senior member
Nov 11, 2009
283
504
136
Communicating vertically through TSVs has minimum power cost vs. Infinity Fabric On Package, Also, the current bandwidth between IOD and chiplets is bare minimum, because of high power cost.

With stacked connection, the bandwidth can go up by several orders of magnitude.

GPU is even more bandwidth hungry, which would burn even more power.



High bandwidth connection (through stacking or EMIB) might enable a separate GPU chiplets, which could totally replace a separate mobile GPU market (which NVidia mostly owns).
Are you saying that the stacked version would be too high of power vs. a monolithic die? That doesn’t seem to be the case. The SoIC stacking should be only marginally more power than having the units on the same die. In some cases, it may even be lower power due to shorter interconnect paths vs. running traces to the other side of the die.

The other forms of stacking that use micro-solder balls are an order of magnitude lower connectivity than SoIC and they take a little more power. It is significantly lower power than running through the package substrate (IFOP). IFOP are significantly lower power than going off package; IFIS (pci-express) or DDR memory connections.
 

jamescox

Senior member
Nov 11, 2009
283
504
136
Exactly. If AMD wanted to have only 64MB more SRAM and not more, it would be simpler and maybe even less expensive to just add it to the base die.

AMD did not go into all the effort to only get 64MB.

TSMC is not building a huge packaging facility for 3D stacking just to sell a 36mm2 of L3.
One layer is 96MB total. Four would be 288 MB. I don’t think we will see the 288 MB versions in the desktop market. There is likely some diminishing returns and trade-offs of some kind, like more power consumption. Some games didn’t really improve much with the jump to 96 MB. I suspect that the 288 MB version will be Milan-x only or perhaps Milan-x and some threadrippers. They will probably only be for HPC and certain high end servers.
 

Joe NYC

Senior member
Jun 26, 2021
272
218
76
Are you saying that the stacked version would be too high of power vs. a monolithic die?
No, I was not comparing stacking vs. monolithic die. I was just saying that the cost of stacking is minimal (vs any other connection technology - IFOP, EMIB, which have much higher power cost).

The SoIC stacking should be only marginally more power than having the units on the same die. In some cases, it may even be lower power due to shorter interconnect paths vs. running traces to the other side of the die.
I agree with that.

Additionally, can be enabling some architectures that otherwise would have prohibitive power costs, or outright impossible.

The other forms of stacking that use micro-solder balls are an order of magnitude lower connectivity than SoIC and they take a little more power. It is significantly lower power than running through the package substrate (IFOP). IFOP are significantly lower power than going off package; IFIS (pci-express) or DDR memory connections.
The IFOP is a great technology that got AMD where it is, but even Milan X will prove it to be extremely limiting for the future.

Suppose chiplet 1 needs data that is in L3 of chiplet 2. The bandwidth limit and latency hit would make those accesses only marginally faster than single channel DRAM access.

OTOH, with some stacked silicon bridges, both latency, bandwidth and power consumption could improve by orders of magnitude.
 
  • Like
Reactions: Tlh97 and Mopetar

Joe NYC

Senior member
Jun 26, 2021
272
218
76
One layer is 96MB total. Four would be 288 MB. I don’t think we will see the 288 MB versions in the desktop market.
AMD is surely analyzing all this. But the beauty is that productizing Zen 3D with 1 layer or 4 does not take another die. It can be decided late in the process, during assembly, based on market reception and conditions.

Threadrippers are most likely super low volume, yet AMD productized number of different SKUs.

No reason for not doing the same for desktop / gaming, and let the market decide if gamers want to pay the premium.

Some people might prefer 16 cores, other might prefer 8 cores with additional 256 MB of L3

There is one game I play a lot that totally CPU limited, it benefits from higher L3, so I would be a buyer of that 256 MB L3, 8 core model, and I would not care about the 16 core models.

There is likely some diminishing returns and trade-offs
Particularly in gaming, I would definitely take diminishing returns of extra L3 over no return of extra cores.

and trade-offs of some kind, like more power consumption. Some games didn’t really improve much with the jump to 96 MB.
We will see if any reviews start better review system, as in units of task completion/ unit of energy used.

I could see L3 feeding the cores so much better that the power consumption of the cores goes up. So it would be wrong to say that it is the fault of L3 that the power consumption went up.

And at the same time, energy spent on unit of task completion likely went down.

I suspect that the 288 MB version will be Milan-x only or perhaps Milan-x and some threadrippers. They will probably only be for HPC and certain high end servers.
Whoever decides to pay the premium. AMD cost of 8x4x64 MB may be $500 (my guess). If anyone wants to pay $2,000 for it, AMD will be more than happy to make the sale.
 
  • Like
Reactions: Tlh97

NTMBK

Diamond Member
Nov 14, 2011
9,366
2,830
136
No, I was not comparing stacking vs. monolithic die. I was just saying that the cost of stacking is minimal (vs any other connection technology - IFOP, EMIB, which have much higher power cost).



I agree with that.

Additionally, can be enabling some architectures that otherwise would have prohibitive power costs, or outright impossible.



The IFOP is a great technology that got AMD where it is, but even Milan X will prove it to be extremely limiting for the future.

Suppose chiplet 1 needs data that is in L3 of chiplet 2. The bandwidth limit and latency hit would make those accesses only marginally faster than single channel DRAM access.

OTOH, with some stacked silicon bridges, both latency, bandwidth and power consumption could improve by orders of magnitude.
I thought the L3 was dedicated to the CCX it is attached to? That's why Zen 3 was a big jump, because the L3 each core had access to doubled.
 

A///

Golden Member
Feb 24, 2017
1,018
740
136
Exactly. If AMD wanted to have only 64MB more SRAM and not more, it would be simpler and maybe even less expensive to just add it to the base die.

AMD did not go into all the effort to only get 64MB.

TSMC is not building a huge packaging facility for 3D stacking just to sell a 36mm2 of L3.
Precisely, and I remember an article from late 2019 about TSMC wanting to integrate their various offerings more into AMD's products but the GloFlo contract was a damper in those plans.
 
Last edited:

Joe NYC

Senior member
Jun 26, 2021
272
218
76
I thought the L3 was dedicated to the CCX it is attached to? That's why Zen 3 was a big jump, because the L3 each core had access to doubled.
I think it is a victim cache - meaning it has data that was ejected from L1 and L2. So it is recently used code and data.

The chiplet considers it to be its own, exclusive memory

But the content of L3 corresponds to certain region of DRAM, and if other cores want to use DRAM, they have to check if it is not in any of the L3 of any of the chiplets. (I may not be 100% accurate on this just a general idea).

So if the portion of memory that a core wants to use is in another core's L3, it can get it from there faster. But not that much faster, because of the limits on IFOP links.

BTW, it seems that AMD is now calling these links GMI...

So, anyway as L3s grow to GB range, there is higher likelihood that data is in one of L3s, so bigger L3, bigger the benefit from super faster and lower latency chiplet interconnect. Also, the bigger the need for next gen interconnect.
 
  • Like
Reactions: Tlh97

Joe NYC

Senior member
Jun 26, 2021
272
218
76
Precisely, and I remember an article from late 2019 about TSMC wanting to integrate their various offerings more into AMD's products but the GloFlo contract was a damper in those plans.
Then, the move of IO die to TSMC make these chiplets "eligible" for integrating additional TSMC technology, including packaging and stacking technologies.
 

Vattila

Senior member
Oct 22, 2004
644
852
136
Suppose chiplet 1 needs data that is in L3 of chiplet 2. The bandwidth limit and latency hit would make those accesses only marginally faster than single channel DRAM access.
Just a little clarification (I think we've discussed this before in another forum): The L3 is not shared between CCXs. That would create horrible contention for 64-core EPYC with 8 CCXs, and even more so for 2-socket 128-core systems. The states of the caches are only kept consistent within the rules of the x86 memory model, using a cache-coherency algorithm which is designed to do as little as possible — just enough to make it possible for all cores to agree on the state of memory.

Apart from any synchronisation needed by cache-coherency, an L3 miss goes straight to memory, as I understand it. Correct me if I am wrong.

Interestingly, the thing that kills performance and makes inter-CCX latency a bottleneck is high use of shared memory and locks. This puts the cache-coherency algorithm in overdrive with a lot of synchronisation between cores and CCXs. When cores work on separate memory only, or treat any shared data as read-only, the need for synchronisation goes away.

PS. The non-sharing of L3 is also why increasing the L3 available to each CCX has such big effect on performance, even in chips with multiple CCXs with high total L3, since that total isn't accessible to each CCX. We saw this with the increase from a 4-core to an 8-core CCX with a larger shared L3 cache. V-Cache multiplies this effect.
 
Last edited:

naukkis

Senior member
Jun 5, 2002
461
316
136
I think it is a victim cache - meaning it has data that was ejected from L1 and L2. So it is recently used code and data.

The chiplet considers it to be its own, exclusive memory

But the content of L3 corresponds to certain region of DRAM, and if other cores want to use DRAM, they have to check if it is not in any of the L3 of any of the chiplets. (I may not be 100% accurate on this just a general idea).

So if the portion of memory that a core wants to use is in another core's L3, it can get it from there faster. But not that much faster, because of the limits on IFOP links.

BTW, it seems that AMD is now calling these links GMI...

So, anyway as L3s grow to GB range, there is higher likelihood that data is in one of L3s, so bigger L3, bigger the benefit from super faster and lower latency chiplet interconnect. Also, the bigger the need for next gen interconnect.
Speed limitations ain't coming from interconnect. After CCX L3 miss memory request is asked from IO-die. IO-die has L3 tags from all chiplets and makes L3-tag comparison same time with memory request from dram. That massive (probably multiple part, first at IO-die and if hit then in CCX that got hit)L3-tag search is actually slowing down dram-access too so improving IO-die and chiplet interconnection won't speed things up much at all. They need different kind of topology if they want to boost cache-access from other chiplets - and that's probably won't be worth of, at least now.
 

Joe NYC

Senior member
Jun 26, 2021
272
218
76
Speed limitations ain't coming from interconnect. After CCX L3 miss memory request is asked from IO-die. IO-die has L3 tags from all chiplets and makes L3-tag comparison same time with memory request from dram. That massive (probably multiple part, first at IO-die and if hit then in CCX that got hit)L3-tag search is actually slowing down dram-access too so improving IO-die and chiplet interconnection won't speed things up much at all. They need different kind of topology if they want to boost cache-access from other chiplets - and that's probably won't be worth of, at least now.
Interesting. I wonder if taster I/O chiplet can perform these faster than making memory access. And also, higher bandwidth and lower latency connections should make the math different, so that there would not be redundant accessed going out of the MCM.

I still am not sure how data gets written to memory, the cash write policies. It does not make a lot of sense to me that the main memory has up to date data, which is being manipulated in a chiplet.
 

jpiniero

Diamond Member
Oct 1, 2010
9,939
2,285
136
BTW, looking at the title, Raphael and Phoenix are probably going to be 7000 series. Zen 3 w/vcache and Rembrandt are 6000.
 

jamescox

Senior member
Nov 11, 2009
283
504
136
Exactly. If AMD wanted to have only 64MB more SRAM and not more, it would be simpler and maybe even less expensive to just add it to the base die.

AMD did not go into all the effort to only get 64MB.

TSMC is not building a huge packaging facility for 3D stacking just to sell a 36mm2 of L3.
They are certainly going to be using the stacking tech for all manner of things, not just cache, in the future. Cache is just the first step since stacking logic likely requires cooling tech where the level of complexity will be significantly increased. It is a lot more difficult to stack anything other than memory due to power delivery through the stack and thermals to get the heat out of the stack. Future Zen parts may be an IO base layer with an array of different chips stacked on top, but anything that requires significant power will be a problem.

While they are going to be making 4 cache layer die stacks, I think it is questionable whether they will sell those as a Ryzen parts. AMD has not gone for as much artificial market segmentation as Intel, they still want some segmentation. If a 4 stack Ryzen part existed, I would likely get it instead of threadripper pro since it would perform spectacularly for compile jobs but it would likely be at such a high price that you might as well get threadripper pro anyway, probably no point in making it.
 

Joe NYC

Senior member
Jun 26, 2021
272
218
76
They are certainly going to be using the stacking tech for all manner of things, not just cache, in the future. Cache is just the first step since stacking logic likely requires cooling tech where the level of complexity will be significantly increased. It is a lot more difficult to stack anything other than memory due to power delivery through the stack and thermals to get the heat out of the stack. Future Zen parts may be an IO base layer with an array of different chips stacked on top, but anything that requires significant power will be a problem.

While they are going to be making 4 cache layer die stacks, I think it is questionable whether they will sell those as a Ryzen parts. AMD has not gone for as much artificial market segmentation as Intel, they still want some segmentation. If a 4 stack Ryzen part existed, I would likely get it instead of threadripper pro since it would perform spectacularly for compile jobs but it would likely be at such a high price that you might as well get threadripper pro anyway, probably no point in making it.
I think AMD will have an SKU with as many layers of L3 as it takes to beat Alder Lake in gaming, to keep the gaming crown.

Adding one layer and still losing would be silly and waste of energy and would backfire.

Good metaphor would be inventing a drug that takes 100 mg dose to save lives. Administering 25 mg and letting the patient die would achieve nothing and would discredit the drug.
 
  • Like
Reactions: Tlh97 and Vattila

jamescox

Senior member
Nov 11, 2009
283
504
136
I think AMD will have an SKU with as many layers of L3 as it takes to beat Alder Lake in gaming, to keep the gaming crown.

Adding one layer and still losing would be silly and waste of energy and would backfire.

Good metaphor would be inventing a drug that takes 100 mg dose to save lives. Administering 25 mg and letting the patient die would achieve nothing and would discredit the drug.
It is what it is. They likely can’t make an arbitrary number of cache die layers. The number of layers was decided a long time ago. They have to thin the base die to a specific thickness for either 1 or 4 cache die layers to get the proper height. They could possibly make the height different between Ryzen and Epyc, which would mean that there is likely no cross over. Epyc is 4 or nothing and Ryzen is 1 or nothing. If the height is the same, then the 4 high stack has to be thinned more. Also, they probably aren’t going to do anything other than 1 and 4. If they have 2 high stacks, it will likely be a salvaged 4 high stack part. Four high is likely the maximum for Zen 3 since the base die and cache die have to be designed with the connectivity and pass through TSVs for a specific number of layers. It is possible that the actual volume production will be 6 nm, possibly with some other tweaks in addition to the added cache die.

l
 

Joe NYC

Senior member
Jun 26, 2021
272
218
76
It is what it is. They likely can’t make an arbitrary number of cache die layers. The number of layers was decided a long time ago. They have to thin the base die to a specific thickness for either 1 or 4 cache die layers to get the proper height. They could possibly make the height different between Ryzen and Epyc, which would mean that there is likely no cross over. Epyc is 4 or nothing and Ryzen is 1 or nothing. If the height is the same, then the 4 high stack has to be thinned more. Also, they probably aren’t going to do anything other than 1 and 4. If they have 2 high stacks, it will likely be a salvaged 4 high stack part. Four high is likely the maximum for Zen 3 since the base die and cache die have to be designed with the connectivity and pass through TSVs for a specific number of layers. It is possible that the actual volume production will be 6 nm, possibly with some other tweaks in addition to the added cache die.

l
It's possible that 4 is the limit from technical reasons.

But my point was to the discussion of:
"1 is all there will be"
"Price of > 1 would be astronomical"
"> 1 will be reserved for a sacred place in a cathedral HPC"

my point is that if it takes 2 layers to win gaming crown, there will be an SKU with 2.
If it takes 4 layers to win the gaming crown, there will be 4.

For AMD to do something that would fall short, while winning was within easy grasp - that would not be smart thinking.

We already have someone at AMD who came up with idea that not 5800x but 5950x should have the highest turbo clock. Hopefully, that person has been transferred to AMD office in Outer Mongolia.
 
Last edited:

HurleyBird

Platinum Member
Apr 22, 2003
2,325
758
136
my point is that if it takes 2 layers to win gaming crown, there will be an SKU with 2.
If it takes 4 layers to win the gaming crown, there will be 4.
This would absolutely be correct if Jensen Huang was in charge of AMD. I think it's probably true for Lisa also, not because she has the same pathological need to "win at any cost" as Jensen does, but rather that she's smart enough to see how greatly this philosophy has benefited Nvidia.
 

Joe NYC

Senior member
Jun 26, 2021
272
218
76
This would absolutely be correct if Jensen Huang was in charge of AMD. I think it's probably true for Lisa also, not because she has the same pathological need to "win at any cost" as Jensen does, but rather that she's smart enough to see how greatly this philosophy has benefited Nvidia.
If you listen to Lisa Su, she states that her goal for AMD is to be leader of the HPC.

She achieved in in servers (gaming as a side bonus). She is not about to relinquish it because some on this thread think that stacking 2nd $6 die of silicon is just too much trouble over stacking the first one.
 

ASK THE COMMUNITY