Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 78 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

Joe NYC

Golden Member
Jun 26, 2021
1,899
2,197
106
AMD already has much of what's needed for an N6 IOD designed for N7/N6 from working on Cezanne, Renoir, and their revisions, so that's less complicated to migrate as well.

That's a very good point, I didn't think about that.

Hopefully, another iteration and re-spin for N6 can add some more optimization.
 
  • Like
Reactions: Tlh97

Joe NYC

Golden Member
Jun 26, 2021
1,899
2,197
106
The IO die can't make as much use because no matter how physically small the transistors can get, the physical interfaces can't shrink. Obviously there's more to the die than just that, but it's still a big part of the total area.

Even if the GPU part only has a really small number of CUs it still needs the front end and other parts for video display. Normally those only take up a small part of a GPU's total area, but they'll be a bigger part relative to a low CU count.

Any space savings you'd get from going to a new node probably are largely lost adding a GPU in so you're not saving any money going with that new node. More likely it costs a fair bit more.

I just don't see the point outside of doing it to provide a bare bones setup to drive a display which some people would no doubt appreciate. However I don't think it adds as much value to the product as it costs them to include.

Even if it otherwise was just an empty area on the I/O die?

Because, the pins for the interfaces will need certain minimum area.

Booting up a PC with internal graphics has some value, when you build a PC, even if you turn it off in the long run.

And if the integrated video helps with some design wins from the OEMs, then it has served its purpose.
 

Joe NYC

Golden Member
Jun 26, 2021
1,899
2,197
106
It may be a moot point anyway since I think the main reason AMD is using N6 for the IO die is because that's where the IP has been designed for.

The reason AMD is moving the IO die to N6 rather than N7 is because TSMC is urging all the clients to make that move - because N6 is even better performing than N7, and there are some optimizations in number of processing steps that make N6 actually cheaper for TSMC than N7.

Since both of the IO die, for Ryzen and Epyc are brand new design, going into production in the future, it only make sense to use the best performing and least expensive node.

Also, Epyc IO die is huge and quite power hungry. There is a good reason to achieve the die area savings and power savings for that die, and Ryzen die is just following the lead, so that AMD is not spreading design resources unnecessarily.
 
  • Like
Reactions: scineram and Tlh97

Mopetar

Diamond Member
Jan 31, 2011
7,797
5,899
136
There will be glut of N6 capacity at the time Zen 4 comes out. The mobile hordes are already moving from N7 to N5, and we are still at least 6 months away from first wafer starts for Zen 4 and its IO die.

By then, the mobile customers of TSMC will not only be gone from N7/N6, they will also be in their seasonal low periods.

I'm not sure of that. Apple is going to be buying up a lot of N5 as they continue to transition their product line from Intel to their own ARM SoCs. They aren't the highest volume sales, but we're talking about their entire product line on top of all of their iDevices. We also don't know what Nvidia's plans are, but given AMD is more competitive with their RDNA2 GPUs that may push Nvidia to try to get back on TSMC as well just to ensure no

Sony and Microsoft console sales haven't even managed to get down to MSRP yet and are still being heavily scalped online suggesting continued strong demand. They'll buy up an additional N7/N6 wafers that they can get their hands on, especially since it's likely that demand will be just as strong this next holiday season.

I don't think there's as much free capacity at TSMC in the near or even mid-term as you might be expecting.

Separate chip is a goal I bet AMD is aspiring to, but it would mean another communication link (from this GPU chiplet to I/O die, and its power and latency overhead.

The rumors about getting MCM GPUs have been around for a while, but are apparently hard to do at least when it comes to gaming performance which is why it may not materialize for a bit longer. Obviously if it can be done it provides the same kind of economic advantages that Zen did, possibly even more. However, there is another possibility in that they create a graphics chiplet designed for professional/datacenter workloads which apparently don't suffer from being split up across chiplets nearly as much.

A single chiplet designed for those GPUs could still be used with an APU as long as any additional necessary logic gets added to the IO die. I don't know if the need to design another communication link though if they could just recycle the existing link that connects CPU chiplets to the IO die. The only real concern is bandwidth, but really they only need to match the available memory bandwidth. They also already have a high speed link designed for connecting their GPUs that might be more suitable and could perhaps be repurposed, but I don't know.

But if AMD is going to keep re-using the CCD between desktop, and server, with potential of multiple CCDs per package, putting the graphics on CCD is a non-starter.

I agree that doesn't make sense. Either they make a monolithic APU or they make a special chiplet for graphics. It seems like we still may be another generation removed before seeing chiplet based GPUs though. There are some rumors about RDNA 3 having multi-die GPUs, but I think these may just be more akin to traditional multi-die GPUs where there are two standalone monolithic dies on a single card/package that are connected together.
 

Joe NYC

Golden Member
Jun 26, 2021
1,899
2,197
106
It appears that node scaling is rapidly approaching a point where it will be more efficient to take the SerDes power hit on APUs to split the I/O section to a separate die on a larger process that's power optimized in the Rules, and a CCD that is density and performance optimized.

It could be quite a big power hit, too big for the target markets for APU (laptop, SFF PCs).

As some point, AMD will make a move away from the SerDes interface, to something faster and higher performance (and likely more costly) but that will be the time to make that re-partitioning.
 
  • Like
Reactions: Tlh97

Joe NYC

Golden Member
Jun 26, 2021
1,899
2,197
106
Do they need to? Forgive me if I've got the details wrong here, but I was under the impression the core layers are flipped upwards in the 3D Stacked Zen 3 processors to allow for better heat distribution and cooling compared to regular Zen 3 chiplets.

The exotic cooling might be needed for logic on logic stacking, but not really for L3 SRAM stacking.

I've heard a rumor of HBMe for Zen 5, but again that's a rumor. It's simpler to take a stab in the dark here with AMD and maybe be 33% correct. I almost miss AMD's wild claims and falling short by 80% of them now.

With old the 3D stacking going on, by the time we get to Zen 5, the 2.5D method of HBM may be obsolete.
 
  • Like
Reactions: Tlh97

Mopetar

Diamond Member
Jan 31, 2011
7,797
5,899
136
Even if it otherwise was just an empty area on the I/O die?

Because, the pins for the interfaces will need certain minimum area.

If you had absolutely nothing else to use the space for sure you could include that additional logic assuming it would fit. However it also potentially requires a larger die to include more pins for the display output. The only reason to include it at all is because there's free space that otherwise goes completely unused and it doesn't increase the size of the die.

Booting up a PC with internal graphics has some value, when you build a PC, even if you turn it off in the long run.

And if the integrated video helps with some design wins from the OEMs, then it has served its purpose.

I don't deny this, but I'm not really sure they need it given that they already can't supply the market with enough Zen 3 CPUs which completely lack this capability. They also have new APUs with integrated video that cover that side of the market.

The number of customers who want a 16-core Zen CPU with onboard graphics isn't terribly large. I'm not sure the added cost across an entire product line is worth what niche market segments they might be able to pick up or the small bit of extra convenience that the onboard graphics provides if a GPU goes bad and there isn't a spare to use.

Since both of the IO die, for Ryzen and Epyc are brand new design, going into production in the future, it only make sense to use the best performing and least expensive node.

N6 might be better performing, but it isn't going to be nearly as cheap as using 12LP+.

Also, Epyc IO die is huge and quite power hungry. There is a good reason to achieve the die area savings and power savings for that die, and Ryzen die is just following the lead, so that AMD is not spreading design resources unnecessarily.

There were some more recent test results done by AT that seem to suggest that a lot of the excess power draw seen from the IO die with Milan may have been largely a result of using the original motherboard that they received with the Naples CPUs they tested when Zen originally launched. Using a motherboard designed for Milan yields much better power characteristics.

I also hate to say this for what feels like the 100th time, but a lot of the parts of the IO die don't benefit from a node shrink. You don't get the same kind of cost savings (or just break-even depending on the cost of the new node) that you typically get with other chips.

Using TSMC to manufacture the IO dies also means fewer wafers for all of their other chips that are currently in short supply. As I outlined in a previous post, I don't think there's anywhere near enough capacity at TSMC for AMD to use devote wafers there towards producing IO dies.
 

Joe NYC

Golden Member
Jun 26, 2021
1,899
2,197
106
I'm not sure of that. Apple is going to be buying up a lot of N5 as they continue to transition their product line from Intel to their own ARM SoCs. They aren't the highest volume sales, but we're talking about their entire product line on top of all of their iDevices. We also don't know what Nvidia's plans are, but given AMD is more competitive with their RDNA2 GPUs that may push Nvidia to try to get back on TSMC as well just to ensure no

Apple moves very quickly away from producing last generation phone - and phones are by far their biggest volume product. Apple had a rapid transition from N7 to N5 a year ago.

I would venture to guess that right now, Apple does not have any iPhone SOCs on N7 being processed by TSMC.

Q2 21 was when the other mobile player started their transitions. By Q1 2022, when the Zen 4 wafers will likely start, there will by just bottom feeders of mobile space on that node and trailing edge products.

Probably the only high profile left on N7 may be NVidia with their A100 line.

Also, when AMD starts the transition to Zen 4 on N5, at the same time, even more N7/N6 capacity will start to open up, since AMD is likely the biggest customer on N7. So like I said, there will be a glut on that node.

Sony and Microsoft console sales haven't even managed to get down to MSRP yet and are still being heavily scalped online suggesting continued strong demand. They'll buy up an additional N7/N6 wafers that they can get their hands on, especially since it's likely that demand will be just as strong this next holiday season.

Sony and MSFT will be beneficiaries of this node migration as well. IMO, supply will match demand by Christmas.

I don't think there's as much free capacity at TSMC in the near or even mid-term as you might be expecting.

The big crowd will be on N5, including Zen 4 die.

It's a good thing that AMD may have a very competitive Zen 3XD, that is entirely on N7/N6, while there is likely a shortage of N5 capacity.

The rumors about getting MCM GPUs have been around for a while, but are apparently hard to do at least when it comes to gaming performance which is why it may not materialize for a bit longer. Obviously if it can be done it provides the same kind of economic advantages that Zen did, possibly even more. However, there is another possibility in that they create a graphics chiplet designed for professional/datacenter workloads which apparently don't suffer from being split up across chiplets nearly as much.

A single chiplet designed for those GPUs could still be used with an APU as long as any additional necessary logic gets added to the IO die. I don't know if the need to design another communication link though if they could just recycle the existing link that connects CPU chiplets to the IO die. The only real concern is bandwidth, but really they only need to match the available memory bandwidth. They also already have a high speed link designed for connecting their GPUs that might be more suitable and could perhaps be repurposed, but I don't know.

I think those standalone GPU units are too massive currently, and their partitioning will likely be in conservative half steps. Nowhere near achieving a practical 50-100mm2 chiplet that could be used for APU as a standalone chiplet.

I agree that doesn't make sense. Either they make a monolithic APU or they make a special chiplet for graphics. It seems like we still may be another generation removed before seeing chiplet based GPUs though. There are some rumors about RDNA 3 having multi-die GPUs, but I think these may just be more akin to traditional multi-die GPUs where there are two standalone monolithic dies on a single card/package that are connected together.

Agree. Perhaps not even the upcoming Zen 3 based Rembrand, maybe the one after that - Zen 4 based Phoenix (?_
 
  • Like
Reactions: Tlh97

jamescox

Senior member
Nov 11, 2009
637
1,103
136
AMD has actually two versions of IO die: one larger for EPYC/Threadripper (which have 8 channel DDR4 memory controller, more PCIe 4.0 lanes....) and an another for Ryzen/X570, AMD still should have two versions of IO die because Genoa will 10 channel DDR5 memory controller and still more PCIe lanes then IO die for Ryzen
I am quite aware of the Epyc IO die since I have been messing with NUMA settings trying to get software to scale on a dual socket Epyc the same as it does on a single socket Epyc. That hasn’t worked too well so far, possibly due to interplay with the GPUs, which are not connected quite as I would prefer.

AMD has the Ryzen IO die that is basically 2 channel memory, 2 cpu links, and 2 x16 pci-express. They also use that same IO die as a chipset, just with the memory controller and cpu links unused. It only uses the pci-express.

The Epyc IO die is basically 4x the Ryzen IO die. It is split into 4 quadrants internally; you get different memory latency depending on whether the access is to the controller on the same quadrant or same half as the CPU die. To help optimize this, they have NPS settings in the BIOS. This stands for “NUMA per socket”. It can be 1 NUMA per socket, which optimizes for bandwidth by striping memory addresses across all 8 controllers. You get higher latency though. At NPS2, you get 2 NUMA nodes per socket and memory is striped across the 4 channels of 2 quadrants; kind of a north and south half. At NPS4, each quadrant is a separate NUMA node which offers the best latency, but you only get memory stripes across the dual channel controller of each quadrant. They have a few other settings, like making each CCX a separate numa node, so technically you could have up to 16 NUMA nodes in a dual socket Milan.

I was trying to run in NPS4 mode, but that is a mess. You get 8 NUMA numa noses in a dual socket system and only half of them have GPUs attached (4 GPUs). We can’t go over 4 GPUs since the amount of system memory for each gpu would be too low and the system would probably go from 2u to 4u in size. NPS2 might have been a good solution since it would have been 4 NUMA nodes across 2 cpu sockets for 4 GPUs. That didn’t work since the evaluation system I had ended up with 2 gpus in one node and none in the other. Also have to take the placement of the infiniband card into account.

We ended up going with a single socket CPU solution with just 2 GPUs each. We are still running in NPS1 to optimize for bandwidth and reduce communication issues with GPUs and the Infiniband card.

So, while the Epyc IO die has a lot more stuff, it is still arranged as if it is 4 separate die like Naples. The infinity fabric switch is just internal. It is a different layout, but each quadrant has almost exactly the same stuff as a Ryzen IO die.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
These statements about the PHY blocks staying about the same size with node shrinks are a bit confusing to me. Surely they have shrunk over the years or is it being claimed that they are almost the same size as they were on the 22nm or 28nm nodes for example?

I can see the scaling factor as being less than other logic or cache. In that case does anyone know the value? For example, it scales at 1/2 the rate of logic.
They would have shrunk, but there are limits. For something like the memory controller, they have a relatively large unit that is a “unified memory controller”. It basically will do everything except communication with the memory, so it would be everything that isn’t in the physical interface. The physical interface is the part that doesn’t shrink well; anything labeled PHY in the images someone posted earlier. The transistors that drive the external interfaces need to drive a lot of power, a lot more than the drive power required for anything on die, even sending data all of the way across the die. This means that they have to be significantly larger transistors. The size will be determined by the drive power required and what can be achieved with the process tech. There may be some optimizations that could be done to the process tech that would be not as good for logic, which is part of why stacking could make sense. They could make an chip with all of the PHY interfaces and then stack the required logic portions on top, maybe with some cache and other useful stuff.

Reduction in PHY size has probably come from changing the interfaces more than anything else. For SDR, I think it was 3.3 V. For DDR, I think it was 2.5 V originally-> 1.8V -> 1.5 -> 1.2 -> and 1.1 for DDR5. Lower voltage means lower power and smaller transistors, but they have to do more and more complicated things to get the necessary signal integrity, so there is some trade-offs. HBM reduces die size for the memory interface significantly, even though it is 1024 bits wide per stack. That is due to lower clock allowing simpler, lower voltage interfaces. The GDDR interfaces are massively complex to maintain signal integrity at very high clocks.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
It appears that node scaling is rapidly approaching a point where it will be more efficient to take the SerDes power hit on APUs to split the I/O section to a separate die on a larger process that's power optimized in the Rules, and a CCD that is density and performance optimized.
Stacking would be the optimal solution in some respects. You make the base die with all of the physical interfaces, stack the cpu cores on top, and cache die on top of that, and then maybe a DRAM die next to it. You can have a larger, cheaper process for the IO with optimized processes for logic, SRAM, DRAM on the other die. Problem is, that stack is probably a lot more expensive than just making a monolithic APU.
 
  • Like
Reactions: Tlh97

jamescox

Senior member
Nov 11, 2009
637
1,103
136
According some of the latest rumors - there will be no HBM for Zen 4.

Whether that means that there will not be anything supporting HBM 2,5D stacking or no high speed memory (3D stacked) is not quite clear.

I have seen another rumor of higher memory channel count (12?).

So it seems the major thrust may be on bigger L3s, and perhaps improved bandwidth to each individual CCD.

DDR5 should add more bandwidth, possible higher number of memory channels could add bandwidth further. And L3 could address the latency.

That is, unless AMD has something additional up its sleeve. There are some rumors regarding the MCD (Memory Cache Die for RDNA 3). I wonder if any of it might be applicable to Zen 4.
They might be behind Intel on HBM for HPC then, although Intel has been relegated to “I will believe it when I see it” status. AMD was there for a while. AMD could make up for not having HBM by just having ridiculous quantities of SRAM cache.
 
  • Like
Reactions: Tlh97

Joe NYC

Golden Member
Jun 26, 2021
1,899
2,197
106
I don't deny this, but I'm not really sure they need it given that they already can't supply the market with enough Zen 3 CPUs which completely lack this capability. They also have new APUs with integrated video that cover that side of the market.

The number of customers who want a 16-core Zen CPU with onboard graphics isn't terribly large. I'm not sure the added cost across an entire product line is worth what niche market segments they might be able to pick up or the small bit of extra convenience that the onboard graphics provides if a GPU goes bad and there isn't a spare to use.

If we think about potential bottlenecks preventing more Ryzen chips on the market, it may be from 2 sources:
- N7 node bottleneck
- substrate (and other component) bottleneck.

Looking at TSMC Q2 revenue, the N7 bottleneck has eased quite a bit to the point that it may no longer be a bottleneck for AMD, but substrate.

In Q1 2022, we will be 3 quarters away from N7/N6 being a bottleneck. Nowhere near being a bottleneck. Zen 4 sales will be constrained by N5 bottleneck.

The problem with APU covering this market - no they cover a different market
- APUs will not have 3D stacking available to them (most likely)
- APU will be on older generation of core (Zen 2, Zen 3).

So the APUs will be levels below Zen4 - nowhere a replacement just because they have graphics.

N6 might be better performing, but it isn't going to be nearly as cheap as using 12LP+.

That is likely very much going to be the case.

But when Zen4 and Genoa enter the market, they will be very high end products. And $5-10 difference in cost for desktop is not going to make a big difference. Especially if it buys you a simple GPU.

But in the longer run, 3D stacking will keep gaining importance. We probably haven't thought of 1 / 10th of the stacking ideas that AMD is evaluating. I bet including stacking on top of IO die.

My favorite here is DRAM. If a stack of DRAM is on I/O die that also has the graphics, you get instant nirvana level performance out of those CUs.

There were some more recent test results done by AT that seem to suggest that a lot of the excess power draw seen from the IO die with Milan may have been largely a result of using the original motherboard that they received with the Naples CPUs they tested when Zen originally launched. Using a motherboard designed for Milan yields much better power characteristics.

True, but AMD is going to be pushing those links for increased bandwidth that could potentially double the power draw on the old process from 50W back to 100W. And perhaps going to 6nm brings it back to 50W.
(just speculating here)

Using TSMC to manufacture the IO dies also means fewer wafers for all of their other chips that are currently in short supply. As I outlined in a previous post, I don't think there's anywhere near enough capacity at TSMC for AMD to use devote wafers there towards producing IO dies.

Assuming that the Zen 3D and Milan X are still extremely competitive in the market place, AMD does not have to risk an unnecessarily sharp ramp of Zen 4.

So in, say Q1, AMD may have 80% Zen 3 wafer starts (with all their I/O dies from GloFo) and only 20% for Zen 4 starts. These will be for sale in Q2/Q3.

So AMD will still continue to be buying a ton of IO dies from Global Foundries.

And like I said, I expect Q1 2022 to have a glut of N6 capacity.
 
  • Like
Reactions: Tlh97 and Vattila

Joe NYC

Golden Member
Jun 26, 2021
1,899
2,197
106
Stacking would be the optimal solution in some respects. You make the base die with all of the physical interfaces, stack the cpu cores on top, and cache die on top of that, and then maybe a DRAM die next to it. You can have a larger, cheaper process for the IO with optimized processes for logic, SRAM, DRAM on the other die. Problem is, that stack is probably a lot more expensive than just making a monolithic APU.

If you have next gen IO + cores + SRAM + DRAM, + medium level graphics, that could be 1000 mm2 (especially adding DRAM)

You can't make that monolithic
 
  • Like
Reactions: Tlh97

Joe NYC

Golden Member
Jun 26, 2021
1,899
2,197
106
They might be behind Intel on HBM for HPC then, although Intel has been relegated to “I will believe it when I see it” status. AMD was there for a while. AMD could make up for not having HBM by just having ridiculous quantities of SRAM cache.

Yup, and Milan X could already have good amount of L3 3 quarters before the first HBM Saphire Rapids is released.

And from Milan to Genoa, Genoa will add 4 more external memory channels and will likely raise the bar even higher with SRAM stacking to ridiculous levels..

And as far as power consumption inside the MCM, I think SRAM L3 reduces the power consumption, while HBM DRAM increases it.
 
  • Like
Reactions: Tlh97

jamescox

Senior member
Nov 11, 2009
637
1,103
136
It's also an argument against using a smaller node in the first place. Really the only reason I could see for doing it is because there's a way they could add a huge layer of v-cache to it that functions as an L4.



I think GF has some newer node (12 LP+) available that does have better power characteristics than the older stuff that AMD has been using. I don't know a lot about the new requirements for DDR5 (some don't think Zen 4 will use it, but I'm inclined to believe that it will) but I don't think a process shrink helps with a lot of the IO because as the transistors get smaller the resistance increases so more voltage is needed to offset this. Since the area used by the PHY components stays about the same, the energy cost doesn't decrease. Any logic that does shrink has better power characteristics, but without knowing how much power the various parts of the chip draw it's hard to determine the overall benefits.



Seems unlikely just from looking at the IO die that they use.

xYU7j4MZVB84reym.jpg


First since it's a server chip, it has even more memory channels, interconnects to chiplets, PCIe lanes, etc. that it needs to connect to so it benefits even less from a die shrink. While making it modular seems like it could help because it does add more room for them to add connectors, but they'd also need to add some redundant controllers so that each module has at least one. They'd also need to add some interconnects to transfer between the IO modules. I'm not sure how much latency that extra step adds, but that alone might make it less desirable.

The 12LP+ node from Global Foundries is supposed to offer a 40% improvement in power use over their 12LP process, so if that's true and AMD is using it it would alleviate a lot of the power problems without requiring a jump to a smaller node. There's also the several billion dollars in wafers that AMD is going to buy from GF through 2024 as part of a recent agreement. Some rumors suggested that they'd make Athlon CPUs at GF, but it's hard to believe that they'd buy that many of them. That also leads me to believe that they'll still be using GF for the IO dies.

As I have stated previously, the Epyc IO die is split into quadrants with each quadrant having almost the exact same components and interfaces as a Ryzen IO die. You get different latency if you cross a quadrant boundary. I think you also get different latency for top and bottom halves. This is exposed in different “NUMA per socket” (NPS) settings. I could see them making it as 4 or more separate chips on 6 nm and connecting 4 of them together with embedded silicon bridges for Epyc. That would take very little power and would look almost like on die connections as far as latency and bandwidth. The bridges would be unused in Ryzen and only half used in threadripper, but they wouldn’t take much die area. Having a single IO die across multiple products seems like it would be a good solution. They have done things before where they have unused interfaces. The original Zen 1 die actually had 4 IFOP links on die, but only 3 of them were used per die in Epyc. It was just to simplify package routing. They were completely unused in Ryzen parts.

The original slides from 2018 showing the package routing are here:


I could see them possibly making a version of the IO die at GF for desktop Ryzen since it isn’t as power sensitive as Epyc. I could see them also making a low end APU or perhaps even a low end chiplet at GF to supply the business / low end OEM market. I don’t think we have any low end Zen 3 parts since they can sell everything they make as higher end parts for a lot more money. I wonder if an 8 core APU design for desktop, where power consumption isn’t as important, would be a possibility. Perhaps R3 parts will be all GF. They would want mobile and Epyc on the latest process for power consumption and stacking tech may add other limitations on where they make things.

Another possibility is that Ryzen gets an IO die made at GF while Epyc gets the new lower power IO die. There isn’t much of a power constraint for desktop Ryzen. If they do a monolithic IO die for Epyc, they could get rid of the quadrants such that there is no variation in latency. It would still be a big chip though, even on 6 nm, so it does seem wasteful. I have wondered if there could be any sharing between GPU and CPU memory designs. Their GPUs support full virtual memory and they are 256-bit, but QDR, so perhaps 1024-bit unified memory controller internally. Epyc is 8x64 for 512-bit, but DDR, so actually the unified memory controller could be 1024-bit also, if it was combined. The current IO die has it split into essentially 4x 128-bit controllers with 2 channels each. I had expected infinity cache to make it into an Epyc IO die, although the stacked L3 seems to make it less necessary. I am wondering if they can share some design here with infinity cache and just swap out the GDDR and DDR physical interfaces. Adding infinity cache to an Epyc IO die would perform very well; it would give the system that monolithic last level cache. If they were adding 128 MB infinity cache, then they would almost certainly need to make it on a TSMC advanced process.
 
  • Like
Reactions: Tlh97

Joe NYC

Golden Member
Jun 26, 2021
1,899
2,197
106
I am quite aware of the Epyc IO die since I have been messing with NUMA settings trying to get software to scale on a dual socket Epyc the same as it does on a single socket Epyc. That hasn’t worked too well so far, possibly due to interplay with the GPUs, which are not connected quite as I would prefer.

If AMD did not obsolete the dual socket server with 64 core CPU, I hope they can finally hammer the stake through its heart with Genoa or Bergamo...
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
There will be glut of N6 capacity at the time Zen 4 comes out. The mobile hordes are already moving from N7 to N5, and we are still at least 6 months away from first wafer starts for Zen 4 and its IO die.

By then, the mobile customers of TSMC will not only be gone from N7/N6, they will also be in their seasonal low periods.



Separate chip is a goal I bet AMD is aspiring to, but it would mean another communication link (from this GPU chiplet to I/O die, and its power and latency overhead.

Big OEMs like to have some base video.

Whether AMD tries to go beyond that - remains to be seen.



But if AMD is going to keep re-using the CCD between desktop, and server, with potential of multiple CCDs per package, putting the graphics on CCD is a non-starter.
There is almost no overhead for the SoIC stacking that doesn’t use any micro-solder balls. The stacked solution will be more expensive though, so there needs to be a reason to make the stacked device on a separated process rather than just put it on the same die. For the cache die, they seem to have achieved significantly higher density in the 64 MB cache chip by making it on a slightly different process variant. It probably is almost entirely cache though, with a lot of the control already on the base CCD. I don’t think there is much of an advantage for the gpu to be on a separate process from the cpu, although the gpu might be better on a more density optimized variant with the cpu in a more high performance variant. I don’t know if there is enough of an advantage that incurring the cost of a stacked device will make sense. I have no idea how much the stacking tech cost though compared to just using more die area.
 
  • Like
Reactions: Tlh97 and Joe NYC

A///

Diamond Member
Feb 24, 2017
4,352
3,154
136
The exotic cooling might be needed for logic on logic stacking, but not really for L3 SRAM stacking.
Oh, I see now. You were referring to core layers in reference to the theory that AMD would layer cores/logic atop one another and they'd have to come up with a method to cool these layers. This is a very interesting and hotly discussed, pun not intended, topic when it comes to future Zens since space on the mobo is a premium the further smaller the mobo gets.
 

A///

Diamond Member
Feb 24, 2017
4,352
3,154
136
We don't know exactly where the I/O overhead lies, but if a big part of it is in I/O die, then perhaps that would be part of the motivation.

Possibility of stacking on top of I/O die might be simpler with TSMC, and also, if the GPU resides in I/O die, it could have beefier and more efficient GPU.

I think long term objective might be to move APUs to chiplet era, and using TSMC could bring AMD closer to that goal.
That's exactly what I'm alluding to. TSMC has a wealth of tech they can implement for a tighter stack of a product from start to finish for clients. In reference, "tighter stack" isn't inferring physical stacked layers, but overall product offerings, public knowledge or not.
 
  • Like
Reactions: Tlh97 and Joe NYC

Joe NYC

Golden Member
Jun 26, 2021
1,899
2,197
106
There is almost no overhead for the SoIC stacking that doesn’t use any micro-solder balls. The stacked solution will be more expensive though, so there needs to be a reason to make the stacked device on a separated process rather than just put it on the same die. For the cache die, they seem to have achieved significantly higher density in the 64 MB cache chip by making it on a slightly different process variant. It probably is almost entirely cache though, with a lot of the control already on the base CCD. I don’t think there is much of an advantage for the gpu to be on a separate process from the cpu, although the gpu might be better on a more density optimized variant with the cpu in a more high performance variant. I don’t know if there is enough of an advantage that incurring the cost of a stacked device will make sense. I have no idea how much the stacking tech cost though compared to just using more die area.

There are a few advantages to stack vs same die
- optimized process (as you mentioned) even node. 5nm Zen4 may have 6nm L3 V-Cache
- cost of processed SRAM wafer is likely a fraction of a price of logic wafer. So L3 on logic wafer is a lot more expensive. If you have 80 mm2 of logic and 288mm2 of L3 (see below), that cost differential may be higher than the entire cost of stacking.
- the the 8x64MB would actually not be 288mm2 on logic, non-optimized die. It may in fact be close to double the die size.
- 1, 2, 4, 8 levels high require just 2 types of die rather than 5 (base + 4) or salvaging die using less L3.
- area of core CCD + 8 levels of L3 is 80 + 8*36 = 80 + 288 = 368mm2, which could hurt yields
- vs. assembling 9 (tested) good die.
- AMD can decide late if it wants to stack any L3 and how many
- distances are actually shorter in stacked die vs. monolithic. The height of each layer is extremely thin.

The advantages of stacking IMO could completely outweigh the costs. Additionally, stacking turns an insane product spec - say Epyc with 2 GB of L3 into something quite feasible and affordable.
 
  • Like
Reactions: Tlh97