Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 77 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
821
1,457
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

jamescox

Senior member
Nov 11, 2009
644
1,105
136
Maybe I just misread what you wrote, but you basically made it sound like you were proposing the following:

N5: Zen chiplet
N6: IO die
??: GPU chiplet

It makes far more sense to use the following configuration:

N5: Zen chiplet
N6: GPU chiplet
??: IO die

To better illustrate why this is, here's some die annotations from @GPUsAreMagic on Twitter.

This first one is Zepplin (Zen1) which was made on the Global Foundries 14LP process:

EiSh0OSX0AEifhr


The next is Renoir (Zen2) which was made on the TSMC 7nm process:

Ea49yMHXkAMc-6W


Even without doing a detailed analysis it's pretty easy to just eyeball the core sizes and compare them to the DDR PHY. In the first image it looks like about 2 of the Zen 1 cores (include the blue band identified as the L2 cache) will fit in the same physical area. Now compare this with the second image (here the gold areas identified as "Core" contain L2 cache) where you can fit at least 4 of those Zen 2 cores. It's pretty clear to see that the cores were able to get a lot smaller due to the shrink enabled by going from the 14nm GF node to the 7nm TSMC node. The size of the IO doesn't get a similar benefit.

For one more point of reference here's the IO die that was used with Matisse (Zen2) which I believe used the 12nm process from Global Foundries:

EWis7yzWoAMbkOD


All of the parts annotated PHY are parts that don't really benefit from process node shrink. An eyeball estimate puts them somewhere slightly less than half of the die area.

Maybe in an ideal world where you can get as many wafers as you want you put your IO die on best node available just because it will have better power characteristics and that's important, but we don't live in that reality. So why waste wafers on a chip that contains mostly parts that don't shrink down as well as another chip that scales far, far better.

Perhaps there's some really clever design possible, such as stacking a lot of 3D cache over top of part of (or all of) that PHY, but I can't really speak to the feasibility of doing something like that. However, the only reason to move the IO die to 6N is because they can't make anything else on it and given the utter shortage of GPUs right now, that's really hard to believe.
I haven’t read all of this discussion but I see several reasons why they would want to use N6. If they are doing any kind of stacking, including LSI (embedded silicon bridges) for connection to cpu die or other components, then they almost certainly have to make both chips at TSMC. Stuff using interposers with micro-solder bumps can be mixed a bit more freely, but a lot of the TSMC tech probably has to be completely in house.

The IO die isn’t that large on 14/12 nm GF. The PHY will not shrink much, but there is a lot there that will. That could make it pad limited, which means the die has to be a certain size due to the bump pitch for connections to the substrate. That would allow them to add extra stuff, like a small GPU, extra caches, or other components since they have extra silicon that would go to waste otherwise. Having a tiny gpu with the media unit (hardware video decode and such) would be useful for a lot of products. Although, most of those products would be served better by an APU anyway, so I am still a bit uncertain about including a GPU in the IO die.

For 6nm though, one other thing that I can think of is that it needs to clock a lot higher or be a lot wider to handle the increased clocks and bandwidth from DDR5 and/or pci-express 5. It might be difficult to build such a device on 12 nm GF, or whatever they have available. It is a huge jump in bandwidth, so doing it on an older process may also take way too much power.

There is some possibility that they will design the IO die to be modular such that they can use the same chips for Epyc. Epyc needs a power reduction for the IO die, so going with the smaller process seems like a good idea. An integrated GPU wouldn’t be needed, so perhaps salvage could go to Epyc or be used as chipsets. Epyc would make use of possibly 4 of the IO die, possibly connected by LSI. It doesn’t seem like they would want to make a possibly over 300 mm2 IO die on N6. Cranking out a bunch of much smaller IO die with a lot of opportunity for salvage seems to make more sense (full functional -> Epyc, not quite -> threadripper, less functional -> Ryzen, really less functional -> chipset; only needs pci-express working).

There has been some talk of a 96-core threadripper on zen 4. If that is remotely accurate, then I suspect that it must be a single layer of cpu die; that would be too much power to stack multiple layers. I am wondering if instead of 4 IO die quadrants in Epyc (possible in 4 separate 6 nm chips), we will actually get 6 IO die “quadrant” or just 6 separate IO die connected together. That would allow them to keep the 2 channel memory, 2 cpu links, and 2 x16 pci-express per partition, but that would come with an increase up to 160 pci-express per socket. I don’t think we have any rumors about an increase in pci-express links.
 
May 17, 2020
123
233
116
There is some possibility that they will design the IO die to be modular such that they can use the same chips for Epyc. Epyc needs a power reduction for the IO die, so going with the smaller process seems like a good idea. An integrated GPU wouldn’t be needed, so perhaps salvage could go to Epyc or be used as chipsets. Epyc would make use of possibly 4 of the IO die, possibly connected by LSI. It doesn’t seem like they would want to make a possibly over 300 mm2 IO die on N6. Cranking out a bunch of much smaller IO die with a lot of opportunity for salvage seems to make more sense (full functional -> Epyc, not quite -> threadripper, less functional -> Ryzen, really less functional -> chipset; only needs pci-express working).
AMD has actually two versions of IO die: one larger for EPYC/Threadripper (which have 8 channel DDR4 memory controller, more PCIe 4.0 lanes....) and an another for Ryzen/X570, AMD still should have two versions of IO die because Genoa will 10 channel DDR5 memory controller and still more PCIe lanes then IO die for Ryzen
 
  • Like
Reactions: Tlh97 and Joe NYC

Mopetar

Diamond Member
Jan 31, 2011
8,487
7,726
136
The IO die isn’t that large on 14/12 nm GF. The PHY will not shrink much, but there is a lot there that will. That could make it pad limited, which means the die has to be a certain size due to the bump pitch for connections to the substrate. That would allow them to add extra stuff, like a small GPU, extra caches, or other components since they have extra silicon that would go to waste otherwise.

It's also an argument against using a smaller node in the first place. Really the only reason I could see for doing it is because there's a way they could add a huge layer of v-cache to it that functions as an L4.

For 6nm though, one other thing that I can think of is that it needs to clock a lot higher or be a lot wider to handle the increased clocks and bandwidth from DDR5 and/or pci-express 5. It might be difficult to build such a device on 12 nm GF, or whatever they have available.

I think GF has some newer node (12 LP+) available that does have better power characteristics than the older stuff that AMD has been using. I don't know a lot about the new requirements for DDR5 (some don't think Zen 4 will use it, but I'm inclined to believe that it will) but I don't think a process shrink helps with a lot of the IO because as the transistors get smaller the resistance increases so more voltage is needed to offset this. Since the area used by the PHY components stays about the same, the energy cost doesn't decrease. Any logic that does shrink has better power characteristics, but without knowing how much power the various parts of the chip draw it's hard to determine the overall benefits.

There is some possibility that they will design the IO die to be modular such that they can use the same chips for Epyc. Epyc needs a power reduction for the IO die, so going with the smaller process seems like a good idea.

Seems unlikely just from looking at the IO die that they use.

xYU7j4MZVB84reym.jpg


First since it's a server chip, it has even more memory channels, interconnects to chiplets, PCIe lanes, etc. that it needs to connect to so it benefits even less from a die shrink. While making it modular seems like it could help because it does add more room for them to add connectors, but they'd also need to add some redundant controllers so that each module has at least one. They'd also need to add some interconnects to transfer between the IO modules. I'm not sure how much latency that extra step adds, but that alone might make it less desirable.

The 12LP+ node from Global Foundries is supposed to offer a 40% improvement in power use over their 12LP process, so if that's true and AMD is using it it would alleviate a lot of the power problems without requiring a jump to a smaller node. There's also the several billion dollars in wafers that AMD is going to buy from GF through 2024 as part of a recent agreement. Some rumors suggested that they'd make Athlon CPUs at GF, but it's hard to believe that they'd buy that many of them. That also leads me to believe that they'll still be using GF for the IO dies.
 

jpiniero

Lifer
Oct 1, 2010
16,810
7,253
136
There's also the several billion dollars in wafers that AMD is going to buy from GF through 2024 as part of a recent agreement.

Could respin Picasso to 12LP+ for cheap Chromebooks. Doubt they would make much but it'd be a way to satisfy the agreement.
 

maddie

Diamond Member
Jul 18, 2010
5,156
5,544
136
These statements about the PHY blocks staying about the same size with node shrinks are a bit confusing to me. Surely they have shrunk over the years or is it being claimed that they are almost the same size as they were on the 22nm or 28nm nodes for example?

I can see the scaling factor as being less than other logic or cache. In that case does anyone know the value? For example, it scales at 1/2 the rate of logic.
 

naukkis

Golden Member
Jun 5, 2002
1,020
853
136
PHY is the physical interface of IO, it's size is what physical implementation of that IO needs and manufacturing process won't affect it if logic needed isn't bigger that needed for that IO. Because of that Intel had free space in their chipsets so they implemented iGPU to make use for that free space in chips. And situation is probably still the same, Intel chips with less cpu cores have bigger iGPU because they just can't scale actual chip size to any smaller for needed IO, and power feed is part or that.
 

Mopetar

Diamond Member
Jan 31, 2011
8,487
7,726
136
Could respin Picasso to 12LP+ for cheap Chromebooks. Doubt they would make much but it'd be a way to satisfy the agreement.

Are they going to be able to sell however many Picasso dies they can get out of ~$1.6 billion in wafer purchases? Depending on the price (probably $3,000 - $4,000 based on common estimates) that's something near half a million wafers. That's an awful lot of Picasso dies, probably well over 100 million, maybe even closer to 150 million. There were some rumors about them using GF for the Athlon refresh which no doubt takes up some of those wafers, but probably not all of them.
 
  • Like
Reactions: Tlh97

Mopetar

Diamond Member
Jan 31, 2011
8,487
7,726
136
These statements about the PHY blocks staying about the same size with node shrinks are a bit confusing to me. Surely they have shrunk over the years or is it being claimed that they are almost the same size as they were on the 22nm or 28nm nodes for example?

I can see the scaling factor as being less than other logic or cache. In that case does anyone know the value? For example, it scales at 1/2 the rate of logic.

To some degree there's a maximum limit to how much they can shrink because they serve as a physical connection to some bus (memory, PCIe, USB, etc.) and there are requirements to wire spacing, etc. for those standards.

Someone did an analysis of the A13 and A14 SoCs which use the TSMC 7nm and 5nm nodes respectively. It was noted that the the LPDDR PHY did not shrink at all.
 

LightningZ71

Platinum Member
Mar 10, 2017
2,508
3,191
136
It appears that node scaling is rapidly approaching a point where it will be more efficient to take the SerDes power hit on APUs to split the I/O section to a separate die on a larger process that's power optimized in the Rules, and a CCD that is density and performance optimized.
 

Mopetar

Diamond Member
Jan 31, 2011
8,487
7,726
136
It appears that node scaling is rapidly approaching a point where it will be more efficient to take the SerDes power hit on APUs to split the I/O section to a separate die on a larger process that's power optimized in the Rules, and a CCD that is density and performance optimized.

Based on AT reporting, the 12LP+ that Global Foundries has seems like a good fit. 40% power reduction seems to put it closer to 7nm in terms of power consumption than its characteristics would otherwise suggest.
 
  • Like
Reactions: Lodix

A///

Diamond Member
Feb 24, 2017
4,351
3,160
136
Based on AT reporting, the 12LP+ that Global Foundries has seems like a good fit. 40% power reduction seems to put it closer to 7nm in terms of power consumption than its characteristics would otherwise suggest.
Perhaps AMD is chasing more than just power usage reduction by using TSMC for the entirety of their mainstream, HEDT and enterprise stack.
 

Joe NYC

Diamond Member
Jun 26, 2021
3,650
5,189
136
Intel claims to be using HBM2e in HPC processors in late 2022, so I would expect AMD to have an HBM device also. I don’t know how competitive they will be if they just have massive SRAM caches. It allowed them to use a narrower memory interface on their GPUs though. It is unclear how they would add HBM. If the CCD are still connected by serialized IFOP links, then the HBM may be on top of the IO die or next to it using LSI. It I might be interesting if they have split the IO die into 4 separate, but identical chips. HBM2 is about 92 square mm and a 6 nm, “single quadrant”, IO die might be about that size, so perhaps they just place one HBM stack on top of each chip. They could place 4 of these die stacks for an Epyc processor. With HBM2 die stacks being larger than CCDs, I don’t think they would want to place HBM2 die stacks next to the IO die. It would probably take a lot of space and limit routing under the stacks to the CCDs. Placing the HBM on top of the IO die(s) makes quite a bit of sense. I was wondering if they were going to use some number of smaller interposers, but that is seeming less likely.

According some of the latest rumors - there will be no HBM for Zen 4.

Whether that means that there will not be anything supporting HBM 2,5D stacking or no high speed memory (3D stacked) is not quite clear.

I have seen another rumor of higher memory channel count (12?).

So it seems the major thrust may be on bigger L3s, and perhaps improved bandwidth to each individual CCD.

DDR5 should add more bandwidth, possible higher number of memory channels could add bandwidth further. And L3 could address the latency.

That is, unless AMD has something additional up its sleeve. There are some rumors regarding the MCD (Memory Cache Die for RDNA 3). I wonder if any of it might be applicable to Zen 4.
 

Joe NYC

Diamond Member
Jun 26, 2021
3,650
5,189
136
Or...

They make one larger CCD that has two x eight core CCX with their L3 caches aligned such that they have a common long, central axis. The VIAs can be placed in the middle like with Zen3, and a single VCache die can be constructed to align with that axis. That would allow a single cache die to stack on the CCD and connect to both CCX units.

I actually suggest that they could design these high density CCDs with half the L3 per CCX, at 16MB. Then, a four stack of VCache can be placed on it with four layers of 32MB of cache over each ccx, giving 144MB of cache for each eight core CCD. That's still plenty.

I have a feeling that it will far more likely go in the opposite direction - enlarging the L3 area, not shrinking it.

For SRAM there is no yield hit from larger die. And higher potential capacity for L3 could leave room for potential custom configurations from clients who want extra large L3s.

The area lost under L3 - perhaps AMD will find some good use for it.
 

Joe NYC

Diamond Member
Jun 26, 2021
3,650
5,189
136
I guess we are going to see some possibly exotic solutions, but probably not for a while. Although, I may have said that in the past about AMD and was wrong. I think they might be able to do at least 2 layers by just using well binned devices at lower clocks. If they can keep it down to 4 or maybe 6 die stacks, then they could use LSI or other stacking tech for connecting CCD stacks to the IO die. I was expecting the initial version of Genoa to use serial connections, but it is not very power efficient with another doubling of speed. Using LSI would make sense, but it does require that the chips are adjacent, so it limits the number of CCD. The 6 stack solution is still asymmetric with respect to the 4 IO die quadrants, but they already make such devices. Perhaps the 128-core device comes a bit later and uses 4 high stacks with some exotic cooling.

I agree, probably not for a while for exotic cooling solutions. And also, probably not likely for logic on logic stacking this time around.

AMD has a clear field to run quite far with the existing approach, 3D stacked L3, before taking on more complexity with exotic cooling methods or Logic on Logic stacking.

I am wondering if AMD could just push the envelope on interconnect just a bit in every direction:
- lower voltage
- perhaps increased width
- increased frequency

To squeeze out just enough to maintain the current low cost MCM architecture..

And perhaps add some more exotic methods to the next half iteration with Bergamo.
 

A///

Diamond Member
Feb 24, 2017
4,351
3,160
136
I agree, probably not for a while for exotic cooling solutions. And also, probably not likely for logic on logic stacking this time around.

AMD has a clear field to run quite far with the existing approach, 3D stacked L3, before taking on more complexity with exotic cooling methods or Logic on Logic stacking.

I am wondering if AMD could just push the envelope on interconnect just a bit in every direction:
- lower voltage
- perhaps increased width
- increased frequency

To squeeze out just enough to maintain the current low cost MCM architecture..

And perhaps add some more exotic methods to the next half iteration with Bergamo.
Do they need to? Forgive me if I've got the details wrong here, but I was under the impression the core layers are flipped upwards in the 3D Stacked Zen 3 processors to allow for better heat distribution and cooling compared to regular Zen 3 chiplets.

I've heard a rumor of HBMe for Zen 5, but again that's a rumor. It's simpler to take a stab in the dark here with AMD and maybe be 33% correct. I almost miss AMD's wild claims and falling short by 80% of them now.
 
  • Like
Reactions: Tlh97

Mopetar

Diamond Member
Jan 31, 2011
8,487
7,726
136
Perhaps AMD is chasing more than just power usage reduction by using TSMC for the entirety of their mainstream, HEDT and enterprise stack.

They're not going to use GF to fab their chiplets or anything outside of the low-end Athlon chips that may not even be economical to produce on TSMC's more expensive wafers, but for the reasons I've outlined previously it doesn't make a lot of sense to expect an IO die on TSMC's 7nm or some enhanced version of it, at least not for Zen4. The IO dies don't benefit nearly as much from a node shrink and we also know (as it has been reported via press release) that AMD is committed to buying a lot of wafers from GF over the next few years.
 
  • Like
Reactions: Tlh97

jpiniero

Lifer
Oct 1, 2010
16,810
7,253
136
Are they going to be able to sell however many Picasso dies they can get out of ~$1.6 billion in wafer purchases? Depending on the price (probably $3,000 - $4,000 based on common estimates) that's something near half a million wafers. That's an awful lot of Picasso dies, probably well over 100 million, maybe even closer to 150 million. There were some rumors about them using GF for the Athlon refresh which no doubt takes up some of those wafers, but probably not all of them.

You also have Rome and Milan sales/support. Would they price a theoretical Zen 3 12LP+ low enough to attract sales?
 

Joe NYC

Diamond Member
Jun 26, 2021
3,650
5,189
136
Stacked on the IO die would mean the IO Die and chiplets are on the same node right? TSMC don't do cross node stacking or is it that they don't do cross node stacking yet?

First, I don't think there will be any logic stacking on top of I/O die,

But TSMC does allow mixing nodes for stacking. This picture just shows when the node becomes eligible for top and bottom of the stack.
1626561395136.png
 
  • Like
Reactions: Tlh97

Joe NYC

Diamond Member
Jun 26, 2021
3,650
5,189
136
Perhaps AMD is chasing more than just power usage reduction by using TSMC for the entirety of their mainstream, HEDT and enterprise stack.

We don't know exactly where the I/O overhead lies, but if a big part of it is in I/O die, then perhaps that would be part of the motivation.

Possibility of stacking on top of I/O die might be simpler with TSMC, and also, if the GPU resides in I/O die, it could have beefier and more efficient GPU.

I think long term objective might be to move APUs to chiplet era, and using TSMC could bring AMD closer to that goal.
 

Joe NYC

Diamond Member
Jun 26, 2021
3,650
5,189
136
Why build an IO die on N6 thought when you don't see much size reduction (remember the physical interfaces always take up the same amount of space)? Given that AMD already can't get enough wafers to satisfy all of the demand they're seeing it seems bizarre to go down that route.

There will be glut of N6 capacity at the time Zen 4 comes out. The mobile hordes are already moving from N7 to N5, and we are still at least 6 months away from first wafer starts for Zen 4 and its IO die.

By then, the mobile customers of TSMC will not only be gone from N7/N6, they will also be in their seasonal low periods.

Also pairing an IO die with such low-end GPU capabilities seems pointless outside of ensuring that everything now has some minimal onboard video. If you want something more powerful then you need yet another piece of silicon. Are they going to make another IO die with 8 - 12 CU?

Separate chip is a goal I bet AMD is aspiring to, but it would mean another communication link (from this GPU chiplet to I/O die, and its power and latency overhead.

Big OEMs like to have some base video.

Whether AMD tries to go beyond that - remains to be seen.

If you wanted to make a lowest viable CU product, just make it part of a monolithic die. There were some other rumors about AMD doing an Athlon refresh on a newer node at Global Foundries, so it would seem odd to duplicate that using a far more expensive TSMC node.

But if AMD is going to keep re-using the CCD between desktop, and server, with potential of multiple CCDs per package, putting the graphics on CCD is a non-starter.
 
  • Like
Reactions: Tlh97

Joe NYC

Diamond Member
Jun 26, 2021
3,650
5,189
136
That image looks like the sample Lisa Su showed of the 3d cache CPU where there's the IO die and then 2 chiplets, one with 3d cache and one without to show the difference.

Both chiplets had the 3D cache, one chiplet was just partially uncovered.