Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 158 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

MadRat

Lifer
Oct 14, 1999
11,908
228
106
Thats just stupid.... Even a dual socket EPYC board is only about $650. The top of the line Alder lake and Ryzen motherboards are $600. A CPU and video card are a lot more complex than a motherboard. Mmeory is only about speed and quantity, it would also not be $1000. Your post makes no sense, and is pointless.
So things like a WRX80 chipset are stupid? If we are talking workstation iron I surely hope you are kidding. If you are throwing out that kind of money then SOC is a waste of the CPU potential.
 

Mopetar

Diamond Member
Jan 31, 2011
7,797
5,899
136
A lot of this speculation comes from people expect RDNA3 to be. Navi31 and Navi32 are supposed to have 2 GPU modules with abbreviation "GCD" and one connecting die named "MCD".

This is the best picture I could find from patent applications:

View attachment 56975


So we will find out what is possible, what AMD has been up to, when RDNA3 is announced. Since Bergamo is after RDNA3, the technology that RDNA3 uses should also be available to Bergamo.

From the patent it sounds like the bridge die only contains cache along with the means to connect the LLC of any connected chiplets so that they can function as one large cache from the perspective of the CPU or even the GPU itself. The patent itself shows multiple arrangements where the bridge can be below (Fig. 2) the chiplets or sit on top (Fig. 3) of them, but this is probably just to cover an easy derivative workaround that wouldn't otherwise be covered by the patent.

I don't think this approach is well suited to use in the manner you suggest, at least at this time. It certainly could, in some limited situations, but for something like Bergamo it would need to be able to overlap all chiplets, which becomes increasingly complex when there are more than 4 chiplets as some chiplets are going to be completely covered by (or completely cover) the bridge die. That makes for more complicated heat management issues as well.

Since it's largely for stacked cache you'd want to position everything so that the extra layer is sitting over top of the cache. There are a few ways to do this with clever design because chiplets can be rotated so that the areas with the cache are where the stacked layer will overlap, or can be placed to allow for a bridge to band across a row of chiplets, but it gets complicated in a hurry beyond those arrangements.

If AMD were to do what you suggested, they'd likely change the design of the chiplets to include parts of the IO, just like Intel's new server chips where each tile has some of the IO and there is no IO die. It's the same with Navi where each of the GPU chiplets contains the IO as well and there is no IO die. If something like that happens, it's with Zen 5 at the earliest which is going to be the first major new design.

Even if Bergamo were to utilize these cache/bridge layers to create a chips with massive L3 caches, though not necessarily a complete SLC, it would still probably have a separate IO die that the chiplets all connect to.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,478
14,434
136
So things like a WRX80 chipset are stupid? If we are talking workstation iron I surely hope you are kidding. If you are throwing out that kind of money then SOC is a waste of the CPU potential.
You did not mention that you were talking about WRX80, but even then its $850-900. You just said $1000 for every item, and thats just wrong. You have to know the platform first. There is desktop, then HEDT, then server, and they are all very different.
 
  • Like
Reactions: Drazick

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Infinity Cache != 3D V-Cache

On Ponte Vecchio Rambo Cache tiles appear to be used as interposers under the central tiles, connecting them all. Cache connecting GPU MCM chiplets would be the next logical step for Infinity Cache as used in RDNA2. Looks like both Genoa and RDNA3 will take that step at once.
Already said in the post above, but I don’t see why they couldn’t share the same die. There are several ways they could make a design that could be used either way. There isn’t even any bumps with the hybrid bonding, so they can put TSVs anywhere. They just might go unused in some products.
 
  • Like
Reactions: Joe NYC

NostaSeronx

Diamond Member
Sep 18, 2011
3,683
1,218
136
Nosta I think would suggest not shrinking but going FDSOI for the IOX.
The only case for FDX being inserted into the IOD mix is to fit in low-cost, long-life variants. Dropping the Ryzen name for Athlon name, most likely exclusive to V-cache variants to offset single channel DDR4/DDR5. Basically exclusive to CPU, GPU(PCIe 4.0 x4), Single-stick(1x8/1x16/1x32) DDR4/DDR5 business OEM models. Optimizing for worst-config which sells everywhere when they pop-up.
With the above in mind, would shrinking IO-die to say 6/7nm possibly make more sense if they added a small iGPU and/or and L4 cache and kept the IO mostly near the perimeter?
Specifically, cIOD3:
7/6nm IOD has GFX, VCN, and ACP.
+Small GPU
+Added Features w/ GPU(Multimedia cores for Video&Audio Dec/Enc, Audio post-processing) ==> Raphael-H is in private mode with Sound Open Firmware.
 
Last edited:
  • Like
Reactions: amd6502

amd6502

Senior member
Apr 21, 2017
971
360
136
As far as FDX for the IOX, I thought a big advantage over the current 12nm IOX would be energy efficiency. But cheaper IO dies are good as well, and very important.

I think those additions (iGPU especially) would be a great move for enthusiast desktop, even notebooks (mid to high end), and workstations. The energy efficiency of the newer node (over 12nm APU) would probably (?) just about offset the cost of MCM (extra energy for chiplet to IOX communication). That would allow them to skip monolithic 8c APU for Zen4. (I think they should've skipped it for either Zen2 or Zen3).

Current consumer oriented 8c monolithic Zen3 APU should be great for OEM desktops (traditional and all-in-ones) as well as high end laptop.

For the first Zen4 monolithic APU they should take a hint from Apple's M1 (which is equivalent to a very big core 4c/8t even though it is an 8c with its big.little 4+4). They are centered on mainstream computing and cover everything above bottom end mobile to just below very high end desktops---basically something that will work really well for 99% of the typical computer users. Imho that would be a Zen4 4c/8t to 6c/12t APU with a somewhat more capable GPU than Raven/Picasso's 11 Vega CU--perhaps 7 to 8 CU of Navi. Top models should just about max out what can be done with typical DDR5 dual channel memory. Apple kind of went way beyond that GPU power, I think piggybacking on the advantage they have with the on-chip RAM.

They should also do the vanilla update to the existing IOX. That should have a cost savings of about 3-4 billion transistors, which at the consumer level might mean you pay $50/$80/$120 more for the graphic capable MCM (3CU/6CU/8CU enabled).

Now if Zen4 doesn't land on AM4, I think it'd be important to have some BGA versions available to DIY consumers. Basically, inseparable CPU+motherbird where consumer has no desire to upgrade. (That would save a lot of headaches of the AM4 era where newbs broke the pins from some fancy Ryzen).

These things are getting so powerful that I think that if you get 6c or more it won't be obsolete for decades. Moore's law is slowing, the issue would be more whether the manufacturer continues to update software/firmware/security. So BGA is starting to make more sense in the desktop word, at least in the lower-mid mainstream range.
 
Last edited:
  • Like
Reactions: Joe NYC

moinmoin

Diamond Member
Jun 1, 2017
4,933
7,618
136
If AMD were to do what you suggested, they'd likely change the design of the chiplets to include parts of the IO, just like Intel's new server chips where each tile has some of the IO and there is no IO die. It's the same with Navi where each of the GPU chiplets contains the IO as well and there is no IO die. If something like that happens, it's with Zen 5 at the earliest which is going to be the first major new design.
That'd be like going back to the Zen 1 Zeppelin design.
 

Joe NYC

Golden Member
Jun 26, 2021
1,893
2,188
106
The problem as I see it is that AMD will continue to use a lot of 7/6 nm wafers for compute/bridge/cache dies. Plus the console chips aren't moving to 5 nm so they are taking up a significant amount of 7/6 nm wafers as well. I don't expect AMD to be able to make enough 5nm and 7/6nm products even if the IO die stays on GF. Moving the IO die to 7/6 nm will only further constrain their ability to supply their customers. Maybe its worth it if AMD can significantly improve efficiency by moving the IO die to TSMC to keep a clear premium tier product lead but I have my doubts that moving the IO die makes that much of a difference. Whatever process it is on, I hope someone does a deep dive into the new IOD and how it compares to the old one.

I think another argument for TSMC IOD is going to be reuse. AMD already has a number of functional blocks on 6nm in Rembrandt (monolithic APU), in CDNA2, soon likely on desktop and server CPU I/O, and in discrete consumer GPUs.

One way to think about TSMC N6 is that it may be AMD's process technology for I/O for next decade for large (and with Xilinx merger growing) portfolio of products. The dividends will come over time, not necessarily on Day 1.

Now let's see what TSMC is doing:

TSMC to build new chip factory in Taiwan's southern city amid shortage

"TAIPEI, Nov 9 (Reuters) - Taiwan Semiconductor Manufacturing Co (2330.TW) (TSMC) said on Tuesday it will set up a new chip factory in the island's southern city of Kaohsiung...

....plant will produce advanced 7-nanometer chips as well as mature 28-nanometer semiconductors."
TSMC to build new chip factory in Taiwan's southern city amid shortage | Reuters


The key is 7nm. Here's how it was described in planning stages, before the announcement, as 7nm hub:
7nm process hub in Kaohsiung? TSMC does not rule it out - Focus Taiwan
 

ryanjagtap

Member
Sep 25, 2021
108
127
96
You know, I had a thought. It's not related to zen4 but I don't know where to post it.
AMD is already doing chiplets with compute die and I/O die, they also have done an optimized SRAM process which fit 64MB cache in the size of 32MB cache for V-cache technology. What if they separate L3 from the compute die and use the optimized process to make smaller but still dense SRAM dies and create a processor with the amalgamation of these technologies?
 

Joe NYC

Golden Member
Jun 26, 2021
1,893
2,188
106
512 MB as 64 MB die would be 8 die which is the same number of CPU die as in Bergamo. It would sit on top of or under the IO die and overlap the cpu chiplets to act as a bridge and as cache die. I was hoping for 1 GB of cache in Bergamo though, either as L3 or L4. Perhaps we will get an increase up to 128 MB for some die or it will be more than one stack thick. If they can go up to 4 high, then it could have 2 GB. With the TSV connections it could act as L3 with significantly faster connection to the rest of the system. It really would look a lot more like a monolithic L3 cache. It would fit in with the lower power consumption goals of Bergamo since there would be no serdes links to cpu chiplets. The TSV connections would be much lower power and significantly higher bandwidth.

Yup, it would be sort of nirvana of this technology / path, of breaking up a large monolithic die, then reconnect it to a large monolithic die with minimum of costs (in terms of power, latency, bandwidth limits).

As far as total size of the L3 for Zen 4 generation on the server, it is a save assumption that it will go well over L3, and if all of this L3 could be seen as on massive L3 by all threads on all cores, it would increase the pool of server applications that would benefit from it.

And it would also lessen the pain of potentially moving memory from (fast) Local memory to (slow) CXL type devices.

It would also allow massive caches without using any extra 2D packaging area since it would be stacked. The entire IO die plus 8 cpu chiplets might actually fit in a single reticle size. The IO die would be smaller than the Genoa die since it would have half of the number of serdes links. The space for TSVs is likely much smaller than serdes links that can operate at pci-e 5 speeds or greater. The cpu chiplets would be similar in size even with 2x cores since cache would be stacked and possibly higher density, lower power libraries/process.

Between SerDes and L3 (with just 32 MB), it is half of the Zen 3 die. So that's huge savings on very expensive note and moving this functionality from N5 to N6

And like you said, if Bergamo were to adopt this approach, and move L3 from main die to stacked dies on bridges, Bergamo could move to 16 core chiplets. And even with extra new functionality, the chiplet could be the same size as 8 core Zen 3 chiplet.

On IO die, Genoa will have 12 of the SerDes links, so if that were to switch to 8 TSV links, the die savings would be even greater on IOD.

They definitely could use the same cache chip across several products. The TSV area isn’t that large and they could do something like connections in the middle for acting as v-cache and connections on the ends when acting as a bridge.

I don’t know how this fits with the RDNA3 GPUs, which are (I believe) rumored to have 256 and 512 MB variants. With 2 graphics die and 4 bridge/cache chips, that would be 256 MB. There likely would not be room for 8 cache die between 2 graphics die, so that would indicate that the cache chips are 128 MB, either larger die or 2 stacks. The 64 MB die is only 36 mm2, so going with a 128 MB die at around 72 mm2 (roughly same size as cpu chiplet) would make a lot sense. I don’t know how they would reuse those as v-cache chips though, since it would cover the whole cpu chiplet. That would possibly cause thermal issues unless the cache die moves to the bottom of the stack with the cpu on top.

On GPU, assuming Infinity cache on MCD, it would make sense to have one design for MCD, shared between Navi 31 and Navi 32, with Navi 31 using 2 of the dies, vs. 1 on Navi 32. It seems that. I think more dies would be an overkill. On SRAM, there is redundancy and extremely high yields, so smaller dies would not be improving yields, only adding assembly costs.

As far as what is on top and what is on the bottom, I think the advantage of the bridge being on top is that the bottom of the IOD would be limited to external IO, which could lead to IO die size savings.
 

Joe NYC

Golden Member
Jun 26, 2021
1,893
2,188
106
I don't think this approach is well suited to use in the manner you suggest, at least at this time. It certainly could, in some limited situations, but for something like Bergamo it would need to be able to overlap all chiplets, which becomes increasingly complex when there are more than 4 chiplets as some chiplets are going to be completely covered by (or completely cover) the bridge die. That makes for more complicated heat management issues as well.

I think the idea would be that each CCD of Bergamo would have its own bridge to IOD, keeping the size very manageable.

And since this bridge would be mandatory, and each CCD would have its own bridge, with L3, L3 could be removed from N5 CCD and placed on N6 bridge. Which would optimize which node would be used for with function, and that alone could create savings to offset the cost of assembly.

So, say the bridge has 128 MB of memory x 8 chiplets (of 16 cores) and Bergamo could have 1 GB of L3 standard, on optimized node, with optimized density.

And going forward, when AMD / TSMC figure out higher than 1 stacks, AMD could just create 2 high or 4 high bridges.

At first, I thought the name Zen4c (for Cloud) was kind of cringy, but now I think it is not just pertaining to the core alone, but to re-optimizing the design of the whole MCD.

If AMD were to do what you suggested, they'd likely change the design of the chiplets to include parts of the IO, just like Intel's new server chips where each tile has some of the IO and there is no IO die. It's the same with Navi where each of the GPU chiplets contains the IO as well and there is no IO die. If something like that happens, it's with Zen 5 at the earliest which is going to be the first major new design.

That would be an option, but from the point of view of optimize process nodes to their best function, this would be sub optimal, because IO would go on a super expensive N5 node.

Instead, keeping it a IOD and using TSVs up to connect chiplets, IOD could be less pad limited.

Even if Bergamo were to utilize these cache/bridge layers to create a chips with massive L3 caches, though not necessarily a complete SLC, it would still probably have a separate IO die that the chiplets all connect to.

I agree.
 

Joe NYC

Golden Member
Jun 26, 2021
1,893
2,188
106
You know, I had a thought. It's not related to zen4 but I don't know where to post it.
AMD is already doing chiplets with compute die and I/O die, they also have done an optimized SRAM process which fit 64MB cache in the size of 32MB cache for V-cache technology. What if they separate L3 from the compute die and use the optimized process to make smaller but still dense SRAM dies and create a processor with the amalgamation of these technologies?

Yes, we have been discussing this on and off on this thread.

L3 benefits much less from N5 process node than the logic parts of the core, mainly because it does not shrink as much, and the compute die is optimized to maximize compute performance.

So, taking L3 out of the compute die completely will add efficiencies. L3 die can be optimized for density and cost (using N6/N7 instead of N5)

Hopefully, it will happen in Bergamo. But, from reading between the lines of Mike Clark interview on Anandtech, Genoa will still have some base L3 on the compute die.
 

Mopetar

Diamond Member
Jan 31, 2011
7,797
5,899
136
I think the idea would be that each CCD of Bergamo would have its own bridge to IOD, keeping the size very manageable.

And since this bridge would be mandatory, and each CCD would have its own bridge, with L3, L3 could be removed from N5 CCD and placed on N6 bridge. Which would optimize which node would be used for with function, and that alone could create savings to offset the cost of assembly.

I don't think that would work at all. The whole point of the bridge is that it connects each chiplet with every other chiplet so that the cache can be treated as one large pool. Multiple independent bridges doesn't make this any easier than having to include logic to snoop the cache on other chiplets and the increased latency that comes with that.

The patent (and even AMD's Zen 3D implementation of stacked dies) puts the bridge (and the additional cache) over top of the existing cache on the die. If that cache isn't there that stacked layer is overlapping something else and I suspect that for thermal reasons it's not workable to have it overlapping with anything that's going to be switching frequently and generating a lot of heat.

Even the assembly stages in the patent show a lot of additional steps in order to assemble a chip with bridges that may involve several sub-steps when it comes to implementation. It's not done because it's less expensive than existing alternatives, but because it enables chiplets to act as though they have an SLC even though the L3 is spread across multiple chiplets.

For a high-end datacenter GPU that cost is probably worth paying because the applications it will be used in will benefit substantially from a large memory pool like that. It's less clear how many applications on a CPU benefit from a massive SLC and of those that do, how many couldn't just be ported to run on a GPU that has more cores that can work on that shared data set.

If I'm reading you right, you also seem to suggest that the bridge is going to replace the infinity fabric and act as replacement for any data being transmitted from the IO die to a CCD. That is going to further complicate the design since the bridge would need to connect to the on-chiplet logic that's responsible for handling all of that IO. This is turning in to the kind of major design overhaul that's extremely unlikely before Zen 5. It's the type of thing that needs to be developed in small increments because otherwise there's too much that can go wrong.
 
  • Like
Reactions: Tlh97 and Joe NYC

Joe NYC

Golden Member
Jun 26, 2021
1,893
2,188
106
I don't think that would work at all. The whole point of the bridge is that it connects each chiplet with every other chiplet so that the cache can be treated as one large pool. Multiple independent bridges doesn't make this any easier than having to include logic to snoop the cache on other chiplets and the increased latency that comes with that.

For Epyc, connecting all chips with one piece of silicon would be a non-starter, because it would have to be gigantic.

Each chiplet having its own bridge to IOD, while also housing chiplet's L3 would have the right size, scale and complexity.

There would have to be some magic to make 8 separate look like one large SLC, but that's what the designers and engineers live for.

If IOD is on N6 and has very low latency to each chiplet, it could assist, by working as a directory / traffic cop, without a bit latency hit.

Even the assembly stages in the patent show a lot of additional steps in order to assemble a chip with bridges that may involve several sub-steps when it comes to implementation. It's not done because it's less expensive than existing alternatives, but because it enables chiplets to act as though they have an SLC even though the L3 is spread across multiple chiplets.

Yes, the assembly of it looks very complex. But this is really the future of high end and TSMC and AMD must be investing into this.

As far as acting as SLC, even in RDNA3, there would potentially be multiple MCDs, potentially 2 or more MCDs in high end Navi31, so some logic of making number of L3s look like a SLC already must have number of solutions. IBM one was illustrated in one of the Anandtech articles.

For a high-end datacenter GPU that cost is probably worth paying because the applications it will be used in will benefit substantially from a large memory pool like that. It's less clear how many applications on a CPU benefit from a massive SLC and of those that do, how many couldn't just be ported to run on a GPU that has more cores that can work on that shared data set.

If the prospect for the cloud and large datecenter CPUs is worsening memory latency by using CXL, anything that will offset it would valuable. HBM acting as a local cache would be another way to address it, with larger "cache" but much worse latency than SRAM L3.

But even without that, I think database applications would benefit from all cores having access to single SLC (equivalent).

If I'm reading you right, you also seem to suggest that the bridge is going to replace the infinity fabric and act as replacement for any data being transmitted from the IO die to a CCD. That is going to further complicate the design since the bridge would need to connect to the on-chiplet logic that's responsible for handling all of that IO. This is turning in to the kind of major design overhaul that's extremely unlikely before Zen 5. It's the type of thing that needs to be developed in small increments because otherwise there's too much that can go wrong.

Yes, exactly, getting rid of SerDes links, which is power hungry, mostly serial, adds latency, and has low bandwidth. I think AMD has been targeting the bandwidth of the Infinity Fabric between CCD and IOD to be just over 1 memory channel bandwidth, which can be a bottleneck, and also prevents a fast implementation of SLC.

In theory, a silicon die of the bridge with several layers of metal has enough wiring potential to even have a mesh between all 8 CCDs, to have limited point to point links from, for snooping.

Yeah, that is a major overhaul, but AMD already signaled there is some overhaul going on when Zen4c was announced.

Suppose the steps already were:
1.Design new Zen 4 core, and put into a product that is just an extension of proven Zen2 / Zen3 architecture, prioritize time to market, clear any big unknowns, potential showstoppers from this step.
2. re-optimize the architecture, implement ambitious technologies that could not make it to #1
 

MadRat

Lifer
Oct 14, 1999
11,908
228
106
I can't believe AMD's focus is SOC for the bottom line. It just doesn't make sense to gear everything to the lowest profit-margin end user. I can see the benefit of caching, but there has to be more to it. If I'm AMD then I'm looking at enhancing Northbridge sales to Pro/Commercial users, too.

Maybe they will separate SOC consumers from their commercial/pro Northbridge chipset users by how IOD is organized. SOC will hit bottlenecks beyond the simple number of PCI lanes supported. No Northbridge (SOC) being present means reliance on internal communication that is simplified. Hook up a Northbridge and it shuts down the local controller so that Northbridge runs it, opening up advanced communications.

SOC users, especially dual core would be extremely bottlenecked because the internal controller would be swamped handling internal and interchip-negotiation of a shared connection. With each internal chip competing for bandwidth both internal and externally, it is bound to be full of signal overlap/collisions when accessing resources.

Each CPU in a Pro system would lean on the superduper Northbridge built to adequately handle the switching and control flow between CPUs. CPUs would be able to connect more seamlessly to other CPUs using a Northbridge because it doesn't allow inter-CPU or external signal collisions by design. All of the connections would enjoy relatively decent latency hidden by CPU on-die caches, but a Northbridge version would probably benefit more in designs suited for multiple CPUs. The advantage isn't in local latency, but in structured use of resources. Maybe they are aiming for more than paired CPUs, too.

And maybe the Northbridge allows signalling in smaller chunks where appropriate, and combining multiple larger chunks to maximize effort where appropriate. And lets not forget that big jump in the number of PCI lanes with a Northbridge.
 

Mopetar

Diamond Member
Jan 31, 2011
7,797
5,899
136
For Epyc, connecting all chips with one piece of silicon would be a non-starter, because it would have to be gigantic.

Each chiplet having its own bridge to IOD, while also housing chiplet's L3 would have the right size, scale and complexity.

. . .

AMD hasn't even released a product yet with a single silicon bridge, yet you've jumped to something that would utilize multiple bridges that contain all of the L3 cache for a chiplet and include functionality to replace IFOPs.

I think all of this stems from some desire on your part to forward or defend a notion that AMD must be using TSCM for their IO dies as opposed to a more reasoned analysis of why they might or might not. You're starting with a conclusion and trying to reason backwards from there, as opposed to developing arguments that can be used to support a conclusion.

I'm inclined to think that they won't be moving IO dies to TSMC, particularly TSCM 6nm for at least another few years. Even if they AMD were to use TSMC to manufacture an IO die for some Zen 4 products, I highly doubt that any of various ideas that you've brought up would be implemented.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
I don't think that would work at all. The whole point of the bridge is that it connects each chiplet with every other chiplet so that the cache can be treated as one large pool. Multiple independent bridges doesn't make this any easier than having to include logic to snoop the cache on other chiplets and the increased latency that comes with that.

The patent (and even AMD's Zen 3D implementation of stacked dies) puts the bridge (and the additional cache) over top of the existing cache on the die. If that cache isn't there that stacked layer is overlapping something else and I suspect that for thermal reasons it's not workable to have it overlapping with anything that's going to be switching frequently and generating a lot of heat.

Even the assembly stages in the patent show a lot of additional steps in order to assemble a chip with bridges that may involve several sub-steps when it comes to implementation. It's not done because it's less expensive than existing alternatives, but because it enables chiplets to act as though they have an SLC even though the L3 is spread across multiple chiplets.

For a high-end datacenter GPU that cost is probably worth paying because the applications it will be used in will benefit substantially from a large memory pool like that. It's less clear how many applications on a CPU benefit from a massive SLC and of those that do, how many couldn't just be ported to run on a GPU that has more cores that can work on that shared data set.

If I'm reading you right, you also seem to suggest that the bridge is going to replace the infinity fabric and act as replacement for any data being transmitted from the IO die to a CCD. That is going to further complicate the design since the bridge would need to connect to the on-chiplet logic that's responsible for handling all of that IO. This is turning in to the kind of major design overhaul that's extremely unlikely before Zen 5. It's the type of thing that needs to be developed in small increments because otherwise there's too much that can go wrong.
I don’t know if that is the whole point. You have to take the physical constraints and design constraints into account. Rome, with 2 CCX on each die, did not have direct communication between the CCX on the same die. That would have been very complicated to design and manage. All CCX, even those on the same die, connect to the IO die for everything. This is necessary to manage cache coherency and it is a lot simpler. If they add bridge chips with cache, I would expect it to work in a similar manner where all a accesses require snoop traffic to IO die. The connections to a stacked die would be significantly faster bandwidth and latency though, so they could possibly treat all of the cache as slices and access via some address bits. That could be plausible since it would look the same as whatever system is in place in a single L3 on a CCX, it would just be managing much larger slices.

Even if it is basically the same basic connectivity as current IO die, TSV connections make it look like everything is on one chip as far as latency and bandwidth are concerned. That would allow it to look like a monolithic cache to some extent, but all traffic to “remote” caches would likely be arbitrated via the IO die for cache coherency. The design could be very similar to the Rome / Milan design, except a lot faster and higher bandwidth. Just direct connect through TSVs; no serdes. The simple, quick design would be to basically just remove all of the physical links and connect the circuitry on each side directly. It would then work the same, but latencies would be lower, bandwidth higher, and power consumption significantly lower.

For the physical constraints, we have that the most likely organization is IO die with 4 cpu chiplets along each side. The die sizes make it such that anything else is unlikely. Connecting this with bridge chips can be accomplished several ways. If they want to reuse some 64 MB cache chips, then they would likely use 4 bridge chips along each side. A single bridge / cache die (MCD) along each side would be plausible also. With the cache optimized process, it would only be about 150 mm2 for 512 MB. Whatever it is, it is likely the same die used for the GPUs. In that case, it might be that the cache die really does look like a single L3 on a CCX, but Bergamo would still have 2 of them, one on each side, so likely two 64-core NUMA zones. Epyc processors already have several numa options. You can configure it by memory controller (NPS1, 2, or 4) since the IO die is split into 4 quadrants and there is also different latency between 2 halves. It also has an option to make different numa domains per each L3.
 

Mopetar

Diamond Member
Jan 31, 2011
7,797
5,899
136
I don’t know if that is the whole point. You have to take the physical constraints and design constraints into account. Rome, with 2 CCX on each die, did not have direct communication between the CCX on the same die. That would have been very complicated to design and manage. All CCX, even those on the same die, connect to the IO die for everything. This is necessary to manage cache coherency and it is a lot simpler. If they add bridge chips with cache, I would expect it to work in a similar manner where all a accesses require snoop traffic to IO die. The connections to a stacked die would be significantly faster bandwidth and latency though, so they could possibly treat all of the cache as slices and access via some address bits. That could be plausible since it would look the same as whatever system is in place in a single L3 on a CCX, it would just be managing much larger slices.

Based on the way the patent for the GPU is presented the bridge isn't intended to do anything other than connect the caches across multiple chiplets and perhaps serve as additional stacked cache. If you look at the die shot for Zen 3 you can see how you could construct such a bridge chip to run across multiple chiplets. Frankly the easiest product to implement something like that would be on something like Zen 3D that uses multiple chiplets.

The main problem is that unless the underlying chiplets have their cache designed around doing something like this, the additional cache isn't necessarily useful. With Zen 3D it appears that all it really does is triple the number of sets for the L3 cache, which isn't particularly useful for many applications. However, if you knew that you were going to use a bridge with additional cache to add more sets then it may lead you to change the layout for the cache on the chiplet so that the individual cache lines are longer.

The largest benefit to massive shared L3 cache is when you either have a lot of different applications running and want better response time (which is achieved because more cache sets means that data isn't being evicted as often) or you have a single highly parallelized application that tends to access memory in a way where recently used data will be used again frequently, but isn't small enough to fit into the L1 or L2 caches. The problem with the second class of applications is that you have to ask yourself if there's a reason they couldn't just be run on a GPU. The first case looks a lot like cloud providers that want to keep a high amount of utilization which means shuffling through a lot of different processes that may have periodic spikes in activity and benefit greatly from not having to wait around for main memory.

But if they're using this as a replacement for IFOPs then it doesn't make a lot of sense to try to treat the L3 cache as a massive pool. It means that the access penalty takes a big hit because the data is potentially stored in the bridge furthest away which requires routing that request across the bridge, through the IO die, and into another bridge. I'm not sure if that's a worthwhile tradeoff, especially if there's any additional latency because the bridge is being used to transmit other IO as well.
 

Joe NYC

Golden Member
Jun 26, 2021
1,893
2,188
106
I think all of this stems from some desire on your part to forward or defend a notion that AMD must be using TSCM for their IO dies as opposed to a more reasoned analysis of why they might or might not. You're starting with a conclusion and trying to reason backwards from there, as opposed to developing arguments that can be used to support a conclusion.

I would phrase it a little differently: If AMD is moving IOD to TSMC, a major reason for that is to allow more advanced packaging technologies in the future.

As far as moving now or a year later, and if the decision could be made today, I would favor the decision that will maximize the capacity - which could be staying with Global Foundries.

But the decision can't be made today. It was already made, years ago, and Genoa is already in production. We can't turn back time, it is a little late to argue now where IOD should be made.

Better discussion would be to find reasons that lead to whatever decision AMD made.

I'm inclined to think that they won't be moving IO dies to TSMC, particularly TSCM 6nm for at least another few years. Even if they AMD were to use TSMC to manufacture an IO die for some Zen 4 products, I highly doubt that any of various ideas that you've brought up would be implemented.

Big hint will be what RDNA3 turns out to be. If it turns out to be using silicon bridges that use 3D stacking, then you can bet that similar technology will make it into EPYC chips.

If RDNA3 does not use 3D stacking to connect the GCD dies, then any of the ideas I brought up will have to wait little longer to make it into EPYC.
 

Mopetar

Diamond Member
Jan 31, 2011
7,797
5,899
136
I would phrase it a little differently: If AMD is moving IOD to TSMC, a major reason for that is to allow more advanced packaging technologies in the future.

Which don't do them much good until they're ready to go and require serious design considerations that look like a serious gamble. Take for example your idea that bridges replace IFOP. That's an all-in design choice because you now need all IO dies to be fabricated at TSCM because you've built a chiplet that won't integrate with anything that can't be connected by that bridge or you have to build all of your chiplets so that they can work in either configuration.

As far as moving now or a year later, and if the decision could be made today, I would favor the decision that will maximize the capacity - which could be staying with Global Foundries.

But the decision can't be made today. It was already made, years ago, and Genoa is already in production. We can't turn back time, it is a little late to argue now where IOD should be made.

Even if you're making that decision years ago, you choose Global Foundries because it allows for more production. AMD would know how many wafers that they can buy, they would know that they're going to be making new console SoCs, they would know that they're going to be expanding their GPU offerings, and they would know that they're likely to have a strong advantage in HEDT and servers due to their chiplet-based approach and be able to grow in those markets.

If you move your IO dies to TSMC you need almost twice as many wafers to offset the ~40% of area that the CPUs devote to the IO dies. Even without a crystal ball to predict a pandemic and any supply chain issues you'd have to be foolish to make that move. It just doesn't make sense logistically.

Better discussion would be to find reasons that lead to whatever decision AMD made.

I'm not sure that's happening. I keep pointing out why it doesn't make sense, and the only response is "But what about this other really cool, but unlikely possibility?" I honestly don't think the basic points that strongly suggest continued use of Global Foundries have even been addressed. Recently they extended their commitments to GF and have now agreed to buy $2.1 billion in wafers through the next four years. But rather than address that it's just "But what about cool hypothetical future technology? Wouldn't that be sweet?"

Big hint will be what RDNA3 turns out to be. If it turns out to be using silicon bridges that use 3D stacking, then you can bet that similar technology will make it into EPYC chips.

If RDNA3 does not use 3D stacking to connect the GCD dies, then any of the ideas I brought up will have to wait little longer to make it into EPYC.

I'm not sure that's necessarily the case. It seems most likely that RNDA3 uses two bridged chiplets if it uses any at all. It may not even do that for all models and because those chiplets are perfectly fine if they aren't connected because they're really just GPUs being put on the same package as opposed to modules that need a separate die in order to function.

Keep in mind that Zen 2 apparently had the technology in place to utilized stacked cache, but it took until the tail end of Zen 3 to actually implement it. We're talking about bleeding edge technologies that haven't been used at scale before and it takes a lot of time to get the kinks worked out. The path to the cool future technology is a slow one with incremental evolution and building on top of previous successes, not a giant leap made with reckless abandon and all caution thrown to the wind.
 

Frenetic Pony

Senior member
May 1, 2012
218
179
116
As far as FDX for the IOX, I thought a big advantage over the current 12nm IOX would be energy efficiency. But cheaper IO dies are good as well, and very important.

I think those additions (iGPU especially) would be a great move for enthusiast desktop, even notebooks (mid to high end), and workstations. The energy efficiency of the newer node (over 12nm APU) would probably (?) just about offset the cost of MCM (extra energy for chiplet to IOX communication). That would allow them to skip monolithic 8c APU for Zen4. (I think they should've skipped it for either Zen2 or Zen3).

Current consumer oriented 8c monolithic Zen3 APU should be great for OEM desktops (traditional and all-in-ones) as well as high end laptop.

For the first Zen4 monolithic APU they should take a hint from Apple's M1 (which is equivalent to a very big core 4c/8t even though it is an 8c with its big.little 4+4). They are centered on mainstream computing and cover everything above bottom end mobile to just below very high end desktops---basically something that will work really well for 99% of the typical computer users. Imho that would be a Zen4 4c/8t to 6c/12t APU with a somewhat more capable GPU than Raven/Picasso's 11 Vega CU--perhaps 7 to 8 CU of Navi. Top models should just about max out what can be done with typical DDR5 dual channel memory. Apple kind of went way beyond that GPU power, I think piggybacking on the advantage they have with the on-chip RAM.

Eh, the TDP being taken up by the IO chiplet is way too high. Power efficiency is probably one good reason among many both Amazon, and as rumored Microsoft, are looking at or already building their own server CPUs. AMD is no doubt cognizant of this and will be looking at vastly increasing efficiency where it can. All the profit margins in the world aren't helpful if there aren't any sales; and AMD would be wise not to get too far ahead of itself with it's recent success.

Zen 4 APUs will probably get the new RDNA2 igpu design over anything else initially. The more interesting concept might be large chiplet APUs. Shove the same CU count as a 65/600xt into a chiplet and compete with the M1x and future variants seems entirely plausible. People being able to do professional video work on a laptop for example is awesome for that market segment; and there's also motivation for AMD to compete in other segments. A 3-4 chiplet APU, one monolithic with 8 cores/IO/medio, 1-2 64mb Vcache chiplets, and a big GPU chiplet could still save die space over separate GPU/CPU setups, be enough power for a giant segment of the market including gamers, and open up desktop form factors like mini PCs to more uses for AMD. Imagine how much an 8 core CPU/11 teraflop RDNA3 GPU combo sold for like $400-500 would sell. Seems like an all around win.
 

ryanjagtap

Member
Sep 25, 2021
108
127
96
Eh, the TDP being taken up by the IO chiplet is way too high. Power efficiency is probably one good reason among many both Amazon, and as rumored Microsoft, are looking at or already building their own server CPUs. AMD is no doubt cognizant of this and will be looking at vastly increasing efficiency where it can. All the profit margins in the world aren't helpful if there aren't any sales; and AMD would be wise not to get too far ahead of itself with it's recent success.

Zen 4 APUs will probably get the new RDNA2 igpu design over anything else initially. The more interesting concept might be large chiplet APUs. Shove the same CU count as a 65/600xt into a chiplet and compete with the M1x and future variants seems entirely plausible. People being able to do professional video work on a laptop for example is awesome for that market segment; and there's also motivation for AMD to compete in other segments. A 3-4 chiplet APU, one monolithic with 8 cores/IO/medio, 1-2 64mb Vcache chiplets, and a big GPU chiplet could still save die space over separate GPU/CPU setups, be enough power for a giant segment of the market including gamers, and open up desktop form factors like mini PCs to more uses for AMD. Imagine how much an 8 core CPU/11 teraflop RDNA3 GPU combo sold for like $400-500 would sell. Seems like an all around win.
This is well and good for Desktop but for laptop having such a large die on laptop would make for OEMs to build new cooling solutions for the heat generated on such a single die, and IMO OEMs don't want to do that much work. And 3-D packaging is for niche halo products still and not for HVM, so that 1-2 stacks of v-cache seems unfeasible.
 

MadRat

Lifer
Oct 14, 1999
11,908
228
106
Let me try to make sense of what people are saying here. So Milan supports 24 cores (3 x 8c) per chiplet? The long-term goal would be for a maximum of 3 chiplets per CPU? The base would be only 32MB of L3 cache per cache spot. Would AMD be looking at a 4 CPU system target? So maybe the launch model would be a 1-2 CPU system in SOC, 8-24 active cores per CPU, with 32MB-128MB cache per CPU? That gives them time to bring out an advanced Northbridge for better multiple CPU systems with much broader PCI lane support. Your early 'upper end' model would be capped at 128MB of L3 per CPU, using four 32MB L3 caches. Initially you're probably looking at much less memory support for SOC. Each chiplet would probably support 12 memory channels. Does that mean internally they could address 36 memory channels in a 3 chiplet design? That gives them time to work out kinks in multi-chiplet models. Down the road they can offer the stacked L3, up to four caches per CPU, 4 layers of 32MB each cache layer, to max out at 512MB per CPU. Two CPU systems up to 192 cores? Four CPU systems up to 384 cores?
 

Mopetar

Diamond Member
Jan 31, 2011
7,797
5,899
136
Eh, the TDP being taken up by the IO chiplet is way too high. Power efficiency is probably one good reason among many both Amazon, and as rumored Microsoft, are looking at or already building their own server CPUs. AMD is no doubt cognizant of this and will be looking at vastly increasing efficiency where it can. All the profit margins in the world aren't helpful if there aren't any sales; and AMD would be wise not to get too far ahead of itself with it's recent success.

I'm not sure if this is a serious issue. Regardless of whether AMD uses TSCM 6N for their IO die or the GF 12LP+ the power use compared to current chips should go down by a fair amount. Microsoft also bought exclusivity for Milan-X so AMD clearly has some advantages to leverage if customers are that interested.

Zen 4 APUs will probably get the new RDNA2 igpu design over anything else initially. The more interesting concept might be large chiplet APUs. Shove the same CU count as a 65/600xt into a chiplet and compete with the M1x and future variants seems entirely plausible.

The Zen 3+ APUs already have RDNA2 iGPUs. I'm not sure if AMD ever goes for an APU with a GPU that large as a monolithic design. Apple can because it targets their primary market. It's almost more likely for AMD to utilize some design where a GPU chiplet is put on the same package as a CPU chiplet, but that would have to offer advantages over the traditional route of pairing a more powerful discrete card with an APU/CPU.