Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 85 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
809
1,412
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

Joe NYC

Platinum Member
Jun 26, 2021
2,535
3,466
106
Or...

They make one larger CCD that has two x eight core CCX with their L3 caches aligned such that they have a common long, central axis. The VIAs can be placed in the middle like with Zen3, and a single VCache die can be constructed to align with that axis. That would allow a single cache die to stack on the CCD and connect to both CCX units.

I actually suggest that they could design these high density CCDs with half the L3 per CCX, at 16MB. Then, a four stack of VCache can be placed on it with four layers of 32MB of cache over each ccx, giving 144MB of cache for each eight core CCD. That's still plenty.

I have a feeling that it will far more likely go in the opposite direction - enlarging the L3 area, not shrinking it.

For SRAM there is no yield hit from larger die. And higher potential capacity for L3 could leave room for potential custom configurations from clients who want extra large L3s.

The area lost under L3 - perhaps AMD will find some good use for it.
 

Joe NYC

Platinum Member
Jun 26, 2021
2,535
3,466
106
I guess we are going to see some possibly exotic solutions, but probably not for a while. Although, I may have said that in the past about AMD and was wrong. I think they might be able to do at least 2 layers by just using well binned devices at lower clocks. If they can keep it down to 4 or maybe 6 die stacks, then they could use LSI or other stacking tech for connecting CCD stacks to the IO die. I was expecting the initial version of Genoa to use serial connections, but it is not very power efficient with another doubling of speed. Using LSI would make sense, but it does require that the chips are adjacent, so it limits the number of CCD. The 6 stack solution is still asymmetric with respect to the 4 IO die quadrants, but they already make such devices. Perhaps the 128-core device comes a bit later and uses 4 high stacks with some exotic cooling.

I agree, probably not for a while for exotic cooling solutions. And also, probably not likely for logic on logic stacking this time around.

AMD has a clear field to run quite far with the existing approach, 3D stacked L3, before taking on more complexity with exotic cooling methods or Logic on Logic stacking.

I am wondering if AMD could just push the envelope on interconnect just a bit in every direction:
- lower voltage
- perhaps increased width
- increased frequency

To squeeze out just enough to maintain the current low cost MCM architecture..

And perhaps add some more exotic methods to the next half iteration with Bergamo.
 

A///

Diamond Member
Feb 24, 2017
4,351
3,158
136
I agree, probably not for a while for exotic cooling solutions. And also, probably not likely for logic on logic stacking this time around.

AMD has a clear field to run quite far with the existing approach, 3D stacked L3, before taking on more complexity with exotic cooling methods or Logic on Logic stacking.

I am wondering if AMD could just push the envelope on interconnect just a bit in every direction:
- lower voltage
- perhaps increased width
- increased frequency

To squeeze out just enough to maintain the current low cost MCM architecture..

And perhaps add some more exotic methods to the next half iteration with Bergamo.
Do they need to? Forgive me if I've got the details wrong here, but I was under the impression the core layers are flipped upwards in the 3D Stacked Zen 3 processors to allow for better heat distribution and cooling compared to regular Zen 3 chiplets.

I've heard a rumor of HBMe for Zen 5, but again that's a rumor. It's simpler to take a stab in the dark here with AMD and maybe be 33% correct. I almost miss AMD's wild claims and falling short by 80% of them now.
 
  • Like
Reactions: Tlh97

jpiniero

Lifer
Oct 1, 2010
15,223
5,768
136
Are they going to be able to sell however many Picasso dies they can get out of ~$1.6 billion in wafer purchases? Depending on the price (probably $3,000 - $4,000 based on common estimates) that's something near half a million wafers. That's an awful lot of Picasso dies, probably well over 100 million, maybe even closer to 150 million. There were some rumors about them using GF for the Athlon refresh which no doubt takes up some of those wafers, but probably not all of them.

You also have Rome and Milan sales/support. Would they price a theoretical Zen 3 12LP+ low enough to attract sales?
 

Joe NYC

Platinum Member
Jun 26, 2021
2,535
3,466
106
Stacked on the IO die would mean the IO Die and chiplets are on the same node right? TSMC don't do cross node stacking or is it that they don't do cross node stacking yet?

First, I don't think there will be any logic stacking on top of I/O die,

But TSMC does allow mixing nodes for stacking. This picture just shows when the node becomes eligible for top and bottom of the stack.
1626561395136.png
 
  • Like
Reactions: Tlh97

Joe NYC

Platinum Member
Jun 26, 2021
2,535
3,466
106
Perhaps AMD is chasing more than just power usage reduction by using TSMC for the entirety of their mainstream, HEDT and enterprise stack.

We don't know exactly where the I/O overhead lies, but if a big part of it is in I/O die, then perhaps that would be part of the motivation.

Possibility of stacking on top of I/O die might be simpler with TSMC, and also, if the GPU resides in I/O die, it could have beefier and more efficient GPU.

I think long term objective might be to move APUs to chiplet era, and using TSMC could bring AMD closer to that goal.
 

Joe NYC

Platinum Member
Jun 26, 2021
2,535
3,466
106
Why build an IO die on N6 thought when you don't see much size reduction (remember the physical interfaces always take up the same amount of space)? Given that AMD already can't get enough wafers to satisfy all of the demand they're seeing it seems bizarre to go down that route.

There will be glut of N6 capacity at the time Zen 4 comes out. The mobile hordes are already moving from N7 to N5, and we are still at least 6 months away from first wafer starts for Zen 4 and its IO die.

By then, the mobile customers of TSMC will not only be gone from N7/N6, they will also be in their seasonal low periods.

Also pairing an IO die with such low-end GPU capabilities seems pointless outside of ensuring that everything now has some minimal onboard video. If you want something more powerful then you need yet another piece of silicon. Are they going to make another IO die with 8 - 12 CU?

Separate chip is a goal I bet AMD is aspiring to, but it would mean another communication link (from this GPU chiplet to I/O die, and its power and latency overhead.

Big OEMs like to have some base video.

Whether AMD tries to go beyond that - remains to be seen.

If you wanted to make a lowest viable CU product, just make it part of a monolithic die. There were some other rumors about AMD doing an Athlon refresh on a newer node at Global Foundries, so it would seem odd to duplicate that using a far more expensive TSMC node.

But if AMD is going to keep re-using the CCD between desktop, and server, with potential of multiple CCDs per package, putting the graphics on CCD is a non-starter.
 
  • Like
Reactions: Tlh97

Joe NYC

Platinum Member
Jun 26, 2021
2,535
3,466
106
AMD already has much of what's needed for an N6 IOD designed for N7/N6 from working on Cezanne, Renoir, and their revisions, so that's less complicated to migrate as well.

That's a very good point, I didn't think about that.

Hopefully, another iteration and re-spin for N6 can add some more optimization.
 
  • Like
Reactions: Tlh97

Joe NYC

Platinum Member
Jun 26, 2021
2,535
3,466
106
The IO die can't make as much use because no matter how physically small the transistors can get, the physical interfaces can't shrink. Obviously there's more to the die than just that, but it's still a big part of the total area.

Even if the GPU part only has a really small number of CUs it still needs the front end and other parts for video display. Normally those only take up a small part of a GPU's total area, but they'll be a bigger part relative to a low CU count.

Any space savings you'd get from going to a new node probably are largely lost adding a GPU in so you're not saving any money going with that new node. More likely it costs a fair bit more.

I just don't see the point outside of doing it to provide a bare bones setup to drive a display which some people would no doubt appreciate. However I don't think it adds as much value to the product as it costs them to include.

Even if it otherwise was just an empty area on the I/O die?

Because, the pins for the interfaces will need certain minimum area.

Booting up a PC with internal graphics has some value, when you build a PC, even if you turn it off in the long run.

And if the integrated video helps with some design wins from the OEMs, then it has served its purpose.
 

Joe NYC

Platinum Member
Jun 26, 2021
2,535
3,466
106
It may be a moot point anyway since I think the main reason AMD is using N6 for the IO die is because that's where the IP has been designed for.

The reason AMD is moving the IO die to N6 rather than N7 is because TSMC is urging all the clients to make that move - because N6 is even better performing than N7, and there are some optimizations in number of processing steps that make N6 actually cheaper for TSMC than N7.

Since both of the IO die, for Ryzen and Epyc are brand new design, going into production in the future, it only make sense to use the best performing and least expensive node.

Also, Epyc IO die is huge and quite power hungry. There is a good reason to achieve the die area savings and power savings for that die, and Ryzen die is just following the lead, so that AMD is not spreading design resources unnecessarily.
 
  • Like
Reactions: scineram and Tlh97

Mopetar

Diamond Member
Jan 31, 2011
8,114
6,770
136
There will be glut of N6 capacity at the time Zen 4 comes out. The mobile hordes are already moving from N7 to N5, and we are still at least 6 months away from first wafer starts for Zen 4 and its IO die.

By then, the mobile customers of TSMC will not only be gone from N7/N6, they will also be in their seasonal low periods.

I'm not sure of that. Apple is going to be buying up a lot of N5 as they continue to transition their product line from Intel to their own ARM SoCs. They aren't the highest volume sales, but we're talking about their entire product line on top of all of their iDevices. We also don't know what Nvidia's plans are, but given AMD is more competitive with their RDNA2 GPUs that may push Nvidia to try to get back on TSMC as well just to ensure no

Sony and Microsoft console sales haven't even managed to get down to MSRP yet and are still being heavily scalped online suggesting continued strong demand. They'll buy up an additional N7/N6 wafers that they can get their hands on, especially since it's likely that demand will be just as strong this next holiday season.

I don't think there's as much free capacity at TSMC in the near or even mid-term as you might be expecting.

Separate chip is a goal I bet AMD is aspiring to, but it would mean another communication link (from this GPU chiplet to I/O die, and its power and latency overhead.

The rumors about getting MCM GPUs have been around for a while, but are apparently hard to do at least when it comes to gaming performance which is why it may not materialize for a bit longer. Obviously if it can be done it provides the same kind of economic advantages that Zen did, possibly even more. However, there is another possibility in that they create a graphics chiplet designed for professional/datacenter workloads which apparently don't suffer from being split up across chiplets nearly as much.

A single chiplet designed for those GPUs could still be used with an APU as long as any additional necessary logic gets added to the IO die. I don't know if the need to design another communication link though if they could just recycle the existing link that connects CPU chiplets to the IO die. The only real concern is bandwidth, but really they only need to match the available memory bandwidth. They also already have a high speed link designed for connecting their GPUs that might be more suitable and could perhaps be repurposed, but I don't know.

But if AMD is going to keep re-using the CCD between desktop, and server, with potential of multiple CCDs per package, putting the graphics on CCD is a non-starter.

I agree that doesn't make sense. Either they make a monolithic APU or they make a special chiplet for graphics. It seems like we still may be another generation removed before seeing chiplet based GPUs though. There are some rumors about RDNA 3 having multi-die GPUs, but I think these may just be more akin to traditional multi-die GPUs where there are two standalone monolithic dies on a single card/package that are connected together.
 

Joe NYC

Platinum Member
Jun 26, 2021
2,535
3,466
106
It appears that node scaling is rapidly approaching a point where it will be more efficient to take the SerDes power hit on APUs to split the I/O section to a separate die on a larger process that's power optimized in the Rules, and a CCD that is density and performance optimized.

It could be quite a big power hit, too big for the target markets for APU (laptop, SFF PCs).

As some point, AMD will make a move away from the SerDes interface, to something faster and higher performance (and likely more costly) but that will be the time to make that re-partitioning.
 
  • Like
Reactions: Tlh97

Joe NYC

Platinum Member
Jun 26, 2021
2,535
3,466
106
Do they need to? Forgive me if I've got the details wrong here, but I was under the impression the core layers are flipped upwards in the 3D Stacked Zen 3 processors to allow for better heat distribution and cooling compared to regular Zen 3 chiplets.

The exotic cooling might be needed for logic on logic stacking, but not really for L3 SRAM stacking.

I've heard a rumor of HBMe for Zen 5, but again that's a rumor. It's simpler to take a stab in the dark here with AMD and maybe be 33% correct. I almost miss AMD's wild claims and falling short by 80% of them now.

With old the 3D stacking going on, by the time we get to Zen 5, the 2.5D method of HBM may be obsolete.
 
  • Like
Reactions: Tlh97

jamescox

Senior member
Nov 11, 2009
644
1,105
136
AMD has actually two versions of IO die: one larger for EPYC/Threadripper (which have 8 channel DDR4 memory controller, more PCIe 4.0 lanes....) and an another for Ryzen/X570, AMD still should have two versions of IO die because Genoa will 10 channel DDR5 memory controller and still more PCIe lanes then IO die for Ryzen
I am quite aware of the Epyc IO die since I have been messing with NUMA settings trying to get software to scale on a dual socket Epyc the same as it does on a single socket Epyc. That hasn’t worked too well so far, possibly due to interplay with the GPUs, which are not connected quite as I would prefer.

AMD has the Ryzen IO die that is basically 2 channel memory, 2 cpu links, and 2 x16 pci-express. They also use that same IO die as a chipset, just with the memory controller and cpu links unused. It only uses the pci-express.

The Epyc IO die is basically 4x the Ryzen IO die. It is split into 4 quadrants internally; you get different memory latency depending on whether the access is to the controller on the same quadrant or same half as the CPU die. To help optimize this, they have NPS settings in the BIOS. This stands for “NUMA per socket”. It can be 1 NUMA per socket, which optimizes for bandwidth by striping memory addresses across all 8 controllers. You get higher latency though. At NPS2, you get 2 NUMA nodes per socket and memory is striped across the 4 channels of 2 quadrants; kind of a north and south half. At NPS4, each quadrant is a separate NUMA node which offers the best latency, but you only get memory stripes across the dual channel controller of each quadrant. They have a few other settings, like making each CCX a separate numa node, so technically you could have up to 16 NUMA nodes in a dual socket Milan.

I was trying to run in NPS4 mode, but that is a mess. You get 8 NUMA numa noses in a dual socket system and only half of them have GPUs attached (4 GPUs). We can’t go over 4 GPUs since the amount of system memory for each gpu would be too low and the system would probably go from 2u to 4u in size. NPS2 might have been a good solution since it would have been 4 NUMA nodes across 2 cpu sockets for 4 GPUs. That didn’t work since the evaluation system I had ended up with 2 gpus in one node and none in the other. Also have to take the placement of the infiniband card into account.

We ended up going with a single socket CPU solution with just 2 GPUs each. We are still running in NPS1 to optimize for bandwidth and reduce communication issues with GPUs and the Infiniband card.

So, while the Epyc IO die has a lot more stuff, it is still arranged as if it is 4 separate die like Naples. The infinity fabric switch is just internal. It is a different layout, but each quadrant has almost exactly the same stuff as a Ryzen IO die.
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
These statements about the PHY blocks staying about the same size with node shrinks are a bit confusing to me. Surely they have shrunk over the years or is it being claimed that they are almost the same size as they were on the 22nm or 28nm nodes for example?

I can see the scaling factor as being less than other logic or cache. In that case does anyone know the value? For example, it scales at 1/2 the rate of logic.
They would have shrunk, but there are limits. For something like the memory controller, they have a relatively large unit that is a “unified memory controller”. It basically will do everything except communication with the memory, so it would be everything that isn’t in the physical interface. The physical interface is the part that doesn’t shrink well; anything labeled PHY in the images someone posted earlier. The transistors that drive the external interfaces need to drive a lot of power, a lot more than the drive power required for anything on die, even sending data all of the way across the die. This means that they have to be significantly larger transistors. The size will be determined by the drive power required and what can be achieved with the process tech. There may be some optimizations that could be done to the process tech that would be not as good for logic, which is part of why stacking could make sense. They could make an chip with all of the PHY interfaces and then stack the required logic portions on top, maybe with some cache and other useful stuff.

Reduction in PHY size has probably come from changing the interfaces more than anything else. For SDR, I think it was 3.3 V. For DDR, I think it was 2.5 V originally-> 1.8V -> 1.5 -> 1.2 -> and 1.1 for DDR5. Lower voltage means lower power and smaller transistors, but they have to do more and more complicated things to get the necessary signal integrity, so there is some trade-offs. HBM reduces die size for the memory interface significantly, even though it is 1024 bits wide per stack. That is due to lower clock allowing simpler, lower voltage interfaces. The GDDR interfaces are massively complex to maintain signal integrity at very high clocks.
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
It appears that node scaling is rapidly approaching a point where it will be more efficient to take the SerDes power hit on APUs to split the I/O section to a separate die on a larger process that's power optimized in the Rules, and a CCD that is density and performance optimized.
Stacking would be the optimal solution in some respects. You make the base die with all of the physical interfaces, stack the cpu cores on top, and cache die on top of that, and then maybe a DRAM die next to it. You can have a larger, cheaper process for the IO with optimized processes for logic, SRAM, DRAM on the other die. Problem is, that stack is probably a lot more expensive than just making a monolithic APU.
 
  • Like
Reactions: Tlh97

jamescox

Senior member
Nov 11, 2009
644
1,105
136
According some of the latest rumors - there will be no HBM for Zen 4.

Whether that means that there will not be anything supporting HBM 2,5D stacking or no high speed memory (3D stacked) is not quite clear.

I have seen another rumor of higher memory channel count (12?).

So it seems the major thrust may be on bigger L3s, and perhaps improved bandwidth to each individual CCD.

DDR5 should add more bandwidth, possible higher number of memory channels could add bandwidth further. And L3 could address the latency.

That is, unless AMD has something additional up its sleeve. There are some rumors regarding the MCD (Memory Cache Die for RDNA 3). I wonder if any of it might be applicable to Zen 4.
They might be behind Intel on HBM for HPC then, although Intel has been relegated to “I will believe it when I see it” status. AMD was there for a while. AMD could make up for not having HBM by just having ridiculous quantities of SRAM cache.
 
  • Like
Reactions: Tlh97

Joe NYC

Platinum Member
Jun 26, 2021
2,535
3,466
106
I don't deny this, but I'm not really sure they need it given that they already can't supply the market with enough Zen 3 CPUs which completely lack this capability. They also have new APUs with integrated video that cover that side of the market.

The number of customers who want a 16-core Zen CPU with onboard graphics isn't terribly large. I'm not sure the added cost across an entire product line is worth what niche market segments they might be able to pick up or the small bit of extra convenience that the onboard graphics provides if a GPU goes bad and there isn't a spare to use.

If we think about potential bottlenecks preventing more Ryzen chips on the market, it may be from 2 sources:
- N7 node bottleneck
- substrate (and other component) bottleneck.

Looking at TSMC Q2 revenue, the N7 bottleneck has eased quite a bit to the point that it may no longer be a bottleneck for AMD, but substrate.

In Q1 2022, we will be 3 quarters away from N7/N6 being a bottleneck. Nowhere near being a bottleneck. Zen 4 sales will be constrained by N5 bottleneck.

The problem with APU covering this market - no they cover a different market
- APUs will not have 3D stacking available to them (most likely)
- APU will be on older generation of core (Zen 2, Zen 3).

So the APUs will be levels below Zen4 - nowhere a replacement just because they have graphics.

N6 might be better performing, but it isn't going to be nearly as cheap as using 12LP+.

That is likely very much going to be the case.

But when Zen4 and Genoa enter the market, they will be very high end products. And $5-10 difference in cost for desktop is not going to make a big difference. Especially if it buys you a simple GPU.

But in the longer run, 3D stacking will keep gaining importance. We probably haven't thought of 1 / 10th of the stacking ideas that AMD is evaluating. I bet including stacking on top of IO die.

My favorite here is DRAM. If a stack of DRAM is on I/O die that also has the graphics, you get instant nirvana level performance out of those CUs.

There were some more recent test results done by AT that seem to suggest that a lot of the excess power draw seen from the IO die with Milan may have been largely a result of using the original motherboard that they received with the Naples CPUs they tested when Zen originally launched. Using a motherboard designed for Milan yields much better power characteristics.

True, but AMD is going to be pushing those links for increased bandwidth that could potentially double the power draw on the old process from 50W back to 100W. And perhaps going to 6nm brings it back to 50W.
(just speculating here)

Using TSMC to manufacture the IO dies also means fewer wafers for all of their other chips that are currently in short supply. As I outlined in a previous post, I don't think there's anywhere near enough capacity at TSMC for AMD to use devote wafers there towards producing IO dies.

Assuming that the Zen 3D and Milan X are still extremely competitive in the market place, AMD does not have to risk an unnecessarily sharp ramp of Zen 4.

So in, say Q1, AMD may have 80% Zen 3 wafer starts (with all their I/O dies from GloFo) and only 20% for Zen 4 starts. These will be for sale in Q2/Q3.

So AMD will still continue to be buying a ton of IO dies from Global Foundries.

And like I said, I expect Q1 2022 to have a glut of N6 capacity.
 
  • Like
Reactions: Tlh97 and Vattila

Joe NYC

Platinum Member
Jun 26, 2021
2,535
3,466
106
Stacking would be the optimal solution in some respects. You make the base die with all of the physical interfaces, stack the cpu cores on top, and cache die on top of that, and then maybe a DRAM die next to it. You can have a larger, cheaper process for the IO with optimized processes for logic, SRAM, DRAM on the other die. Problem is, that stack is probably a lot more expensive than just making a monolithic APU.

If you have next gen IO + cores + SRAM + DRAM, + medium level graphics, that could be 1000 mm2 (especially adding DRAM)

You can't make that monolithic
 
  • Like
Reactions: Tlh97

Joe NYC

Platinum Member
Jun 26, 2021
2,535
3,466
106
They might be behind Intel on HBM for HPC then, although Intel has been relegated to “I will believe it when I see it” status. AMD was there for a while. AMD could make up for not having HBM by just having ridiculous quantities of SRAM cache.

Yup, and Milan X could already have good amount of L3 3 quarters before the first HBM Saphire Rapids is released.

And from Milan to Genoa, Genoa will add 4 more external memory channels and will likely raise the bar even higher with SRAM stacking to ridiculous levels..

And as far as power consumption inside the MCM, I think SRAM L3 reduces the power consumption, while HBM DRAM increases it.
 
  • Like
Reactions: Tlh97

jamescox

Senior member
Nov 11, 2009
644
1,105
136
It's also an argument against using a smaller node in the first place. Really the only reason I could see for doing it is because there's a way they could add a huge layer of v-cache to it that functions as an L4.



I think GF has some newer node (12 LP+) available that does have better power characteristics than the older stuff that AMD has been using. I don't know a lot about the new requirements for DDR5 (some don't think Zen 4 will use it, but I'm inclined to believe that it will) but I don't think a process shrink helps with a lot of the IO because as the transistors get smaller the resistance increases so more voltage is needed to offset this. Since the area used by the PHY components stays about the same, the energy cost doesn't decrease. Any logic that does shrink has better power characteristics, but without knowing how much power the various parts of the chip draw it's hard to determine the overall benefits.



Seems unlikely just from looking at the IO die that they use.

xYU7j4MZVB84reym.jpg


First since it's a server chip, it has even more memory channels, interconnects to chiplets, PCIe lanes, etc. that it needs to connect to so it benefits even less from a die shrink. While making it modular seems like it could help because it does add more room for them to add connectors, but they'd also need to add some redundant controllers so that each module has at least one. They'd also need to add some interconnects to transfer between the IO modules. I'm not sure how much latency that extra step adds, but that alone might make it less desirable.

The 12LP+ node from Global Foundries is supposed to offer a 40% improvement in power use over their 12LP process, so if that's true and AMD is using it it would alleviate a lot of the power problems without requiring a jump to a smaller node. There's also the several billion dollars in wafers that AMD is going to buy from GF through 2024 as part of a recent agreement. Some rumors suggested that they'd make Athlon CPUs at GF, but it's hard to believe that they'd buy that many of them. That also leads me to believe that they'll still be using GF for the IO dies.

As I have stated previously, the Epyc IO die is split into quadrants with each quadrant having almost the exact same components and interfaces as a Ryzen IO die. You get different latency if you cross a quadrant boundary. I think you also get different latency for top and bottom halves. This is exposed in different “NUMA per socket” (NPS) settings. I could see them making it as 4 or more separate chips on 6 nm and connecting 4 of them together with embedded silicon bridges for Epyc. That would take very little power and would look almost like on die connections as far as latency and bandwidth. The bridges would be unused in Ryzen and only half used in threadripper, but they wouldn’t take much die area. Having a single IO die across multiple products seems like it would be a good solution. They have done things before where they have unused interfaces. The original Zen 1 die actually had 4 IFOP links on die, but only 3 of them were used per die in Epyc. It was just to simplify package routing. They were completely unused in Ryzen parts.

The original slides from 2018 showing the package routing are here:


I could see them possibly making a version of the IO die at GF for desktop Ryzen since it isn’t as power sensitive as Epyc. I could see them also making a low end APU or perhaps even a low end chiplet at GF to supply the business / low end OEM market. I don’t think we have any low end Zen 3 parts since they can sell everything they make as higher end parts for a lot more money. I wonder if an 8 core APU design for desktop, where power consumption isn’t as important, would be a possibility. Perhaps R3 parts will be all GF. They would want mobile and Epyc on the latest process for power consumption and stacking tech may add other limitations on where they make things.

Another possibility is that Ryzen gets an IO die made at GF while Epyc gets the new lower power IO die. There isn’t much of a power constraint for desktop Ryzen. If they do a monolithic IO die for Epyc, they could get rid of the quadrants such that there is no variation in latency. It would still be a big chip though, even on 6 nm, so it does seem wasteful. I have wondered if there could be any sharing between GPU and CPU memory designs. Their GPUs support full virtual memory and they are 256-bit, but QDR, so perhaps 1024-bit unified memory controller internally. Epyc is 8x64 for 512-bit, but DDR, so actually the unified memory controller could be 1024-bit also, if it was combined. The current IO die has it split into essentially 4x 128-bit controllers with 2 channels each. I had expected infinity cache to make it into an Epyc IO die, although the stacked L3 seems to make it less necessary. I am wondering if they can share some design here with infinity cache and just swap out the GDDR and DDR physical interfaces. Adding infinity cache to an Epyc IO die would perform very well; it would give the system that monolithic last level cache. If they were adding 128 MB infinity cache, then they would almost certainly need to make it on a TSMC advanced process.
 
  • Like
Reactions: Tlh97

Joe NYC

Platinum Member
Jun 26, 2021
2,535
3,466
106
I am quite aware of the Epyc IO die since I have been messing with NUMA settings trying to get software to scale on a dual socket Epyc the same as it does on a single socket Epyc. That hasn’t worked too well so far, possibly due to interplay with the GPUs, which are not connected quite as I would prefer.

If AMD did not obsolete the dual socket server with 64 core CPU, I hope they can finally hammer the stake through its heart with Genoa or Bergamo...