Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 84 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
809
1,412
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

Mopetar

Diamond Member
Jan 31, 2011
8,114
6,770
136
It's on two different nodes - the IO die is presumably on N6 while the CPUs are on N5. Ideally the IGP would be a chiplet on a cheap node but I don't think AMD wants to spend the effort backporting the RDNA2+ IP to some GloFo node.

This makes even less sense. Why would you put an IO die on N6 and make a graphics chiplet on an older node? The graphics logic pretty much all benefits from a die shrink whereas likely half of the IO die doesn't. Put the graphics chiplet on N6 and the IO die on the cheap node.
 

eek2121

Diamond Member
Aug 2, 2005
3,100
4,398
136
I don't think that's a particularly good argument considering AMD also sells APUs which contain graphics. While a GPU isn't just limited to gaming, anyone who needs one for the kind of workloads they excel at is going to buy a discrete card because what's included with a CPU typically isn't enough for professional work.

Also, it would probably work considerably better to design separate circuitry for AI/ML tasks as dedicated hardware will be better at that than offloading it to a GPU. Apple does this with their "neural engine" in their SoCs. Even Nvidia has special tensor cores in their GPUs to handle these tasks.

Finally, AMD is having a hard time keeping enough of their Zen 3 CPUs (which completely lack a built-in GPU) in stock to actually satisfy consumer demand. I really don't think they need to go tacking on something that not every consumer needs or wants just because the competition is doing it. If AMD stuck to what all of their competitors were doing, we wouldn't even have Zen in the first place.

Why would I buy a GPU for a virus scanner? What about encoding movies? What about photography work? There are several use cases for having a GPU capable of GPGPU workloads. Windows defender already uses the GPU for virus scanning. Expect Windows and other software to use the GPU for other workloads in the future. Smart compression? Encryption? fast hashing? Video/photo upscaling?

Why should an end user have to buy a gpu?
 

jpiniero

Lifer
Oct 1, 2010
15,223
5,768
136
This makes even less sense. Why would you put an IO die on N6 and make a graphics chiplet on an older node? The graphics logic pretty much all benefits from a die shrink whereas likely half of the IO die doesn't. Put the graphics chiplet on N6 and the IO die on the cheap node.

Power savings? The graphics can use the power savings too, just that I think that the tradeoff of the cost savings could be worth it in some cases.

But not everyone needs an iGPU. Why should they be forced to buy something that they don't need or will never use?

Eventually I think AMD will separate the IGP out as a chiplet. When that will happen I have no idea.

I could see the DIY focused models having the IGP disabled for yield purposes.
 

Mopetar

Diamond Member
Jan 31, 2011
8,114
6,770
136
It may be a moot point anyway since I think the main reason AMD is using N6 for the IO die is because that's where the IP has been designed for.

That's just as well explained by an eventual APU (or other products like SoCs or GPUs that would also utilize those units) being designed for N6. An APU needs to incorporate all of the IO onto the APU so there would be a need to have designs for all of that on the process as well. Rumors about someone doing design work for that on a particular node doesn't necessarily imply it's for an IO die.
 
Last edited:

moinmoin

Diamond Member
Jun 1, 2017
5,064
8,032
136
Including the iGPU on the IOD makes perfect sense for one reason: the GPU needs bandwidth first and foremost, and the memory controllers are part of the IOD. The bandwidth of the links to the individual chiplets while sufficient for cores is rather lackluster, a link to a GPUlet would need to run through something else than organic substrate to achieve the necessary higher bandwidth, likely whatever a future MCM based GPU would also use.

Seems like it would make much more sense to target 7N for Zen5 IODs. Save EUV for the chiplets where it will make the most difference.
TSMC is encouraging all N7 using customers to use N6 instead in the future.
 

tomatosummit

Member
Mar 21, 2019
184
177
116
Including the iGPU on the IOD makes perfect sense for one reason: the GPU needs bandwidth first and foremost, ...

TSMC is encouraging all N7 using customers to use N6 instead in the future.

I agree the gpu should be on the IO die but I really don't think it's going to be a high performance part, 3-6cus would be more than adequate, need the extra bandwidth might be a moot point.
I think it's mostly to increase the target market for the high performance zen4 desktop cpus, oems would eat up being able to market 16core business pcs and you'll have the most powerful desktop replacement laptops around that won't rely on a discrete gpu.

Leaks have said for a while the IO die will shrink, n7 or n6 is neither here nor there in the grandscheme with the shared production capabilities or design compatability.
What I think will come with the shrink is left over die space. People keep saying the PHY modules will not shrink and that might cause a minimum size limit for the, if I wasn't lazy I'd take the renoir die layout, remove the two ccx and a bit of uncore and see how it fits together after some rearrangement. Two (or three) IF links have to be added as well that takes up more edge space on the already phy heavy io die and once all that is done I think there'll be some left over space in the middle which would suit a small igpu.
I'll wait for someone to tell me phys can be inset now and I'm an idiot.
 

Mopetar

Diamond Member
Jan 31, 2011
8,114
6,770
136
The bandwidth of the links to the individual chiplets while sufficient for cores is rather lackluster, a link to a GPUlet would need to run through something else than organic substrate to achieve the necessary higher bandwidth, likely whatever a future MCM based GPU would also use.

That makes a certain amount of sense. It's not necessarily a problem if more IF links are used for a GPU chiplet, but it will still ultimately be limited by the bandwidth between the IO die and the system memory. DDR5 will increase that by a fair amount, but there's also the possibility of utilizing infinity cache to help offset that as well. I'm sure that the IFOPs for Zen4/Zen5 will need to be a bit beefier just to handle the additional memory bandwidth that DDR5 will provide.

Another possibility is adding a specialized link to allow higher transfer rates to the GPU. They've already talked about a 100 GB/s link that they've developed. I believe that Navi 21 has two of those links, but they may be intended for professional cards only since they haven't been talked about all that much. Of course that's not necessarily ideal because it now adds a specialized link to the IO die that wouldn't always be used. However, it is rather close to the 112 GB/s bandwidth that DDR5 will bring.

Still, that also creates a strong argument for just continuing to build a monolithic APU as well. Personally I think it would be interesting if they could manage to build something where the CPU and GPU share an exceptionally large L3 cache that can be provisioned as necessary. That makes for the most efficient overall design from the perspective of minimizing power use, which for a laptop chip is always going to be one of the most important aspects to design around.

Based on RDNA 1 cards, AMD seemed to have somewhere around 10 GB/s of memory bandwidth available for every CU, at least for the non-OEM parts that didn't have gimped memory. Assuming that future APUs/chiplets don't have any infinity cache to alleviate some of that, they probably wouldn't want more than 12 CU as a part of the design depending on what clock speeds they're targeting. That aside, if they are targeting some kind of MCM design for GPUs in the future there are probably some parts of the GPU that would do better on whatever they call the die that's designed to connect all of the GPU chiplets.
 
  • Like
Reactions: Vattila

Thibsie

Senior member
Apr 25, 2017
865
973
136
They don't. There are plenty of APUs that include a bare-bones GPU for people who don't need a lot of 3D graphical capabilities or just do some light processing. Anyone doing any kind of professional work is going to want a discrete GPU. Hell, even a casual user might be able to benefit from a discrete card if they spend a considerable amount of time using those capabilities.

But not everyone needs an iGPU. Why should they be forced to buy something that they don't need or will never use?

Mom and Dad don't need AVX, why do they pay for it ?
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
Maybe I just misread what you wrote, but you basically made it sound like you were proposing the following:

N5: Zen chiplet
N6: IO die
??: GPU chiplet

It makes far more sense to use the following configuration:

N5: Zen chiplet
N6: GPU chiplet
??: IO die

To better illustrate why this is, here's some die annotations from @GPUsAreMagic on Twitter.

This first one is Zepplin (Zen1) which was made on the Global Foundries 14LP process:

EiSh0OSX0AEifhr


The next is Renoir (Zen2) which was made on the TSMC 7nm process:

Ea49yMHXkAMc-6W


Even without doing a detailed analysis it's pretty easy to just eyeball the core sizes and compare them to the DDR PHY. In the first image it looks like about 2 of the Zen 1 cores (include the blue band identified as the L2 cache) will fit in the same physical area. Now compare this with the second image (here the gold areas identified as "Core" contain L2 cache) where you can fit at least 4 of those Zen 2 cores. It's pretty clear to see that the cores were able to get a lot smaller due to the shrink enabled by going from the 14nm GF node to the 7nm TSMC node. The size of the IO doesn't get a similar benefit.

For one more point of reference here's the IO die that was used with Matisse (Zen2) which I believe used the 12nm process from Global Foundries:

EWis7yzWoAMbkOD


All of the parts annotated PHY are parts that don't really benefit from process node shrink. An eyeball estimate puts them somewhere slightly less than half of the die area.

Maybe in an ideal world where you can get as many wafers as you want you put your IO die on best node available just because it will have better power characteristics and that's important, but we don't live in that reality. So why waste wafers on a chip that contains mostly parts that don't shrink down as well as another chip that scales far, far better.

Perhaps there's some really clever design possible, such as stacking a lot of 3D cache over top of part of (or all of) that PHY, but I can't really speak to the feasibility of doing something like that. However, the only reason to move the IO die to 6N is because they can't make anything else on it and given the utter shortage of GPUs right now, that's really hard to believe.
I haven’t read all of this discussion but I see several reasons why they would want to use N6. If they are doing any kind of stacking, including LSI (embedded silicon bridges) for connection to cpu die or other components, then they almost certainly have to make both chips at TSMC. Stuff using interposers with micro-solder bumps can be mixed a bit more freely, but a lot of the TSMC tech probably has to be completely in house.

The IO die isn’t that large on 14/12 nm GF. The PHY will not shrink much, but there is a lot there that will. That could make it pad limited, which means the die has to be a certain size due to the bump pitch for connections to the substrate. That would allow them to add extra stuff, like a small GPU, extra caches, or other components since they have extra silicon that would go to waste otherwise. Having a tiny gpu with the media unit (hardware video decode and such) would be useful for a lot of products. Although, most of those products would be served better by an APU anyway, so I am still a bit uncertain about including a GPU in the IO die.

For 6nm though, one other thing that I can think of is that it needs to clock a lot higher or be a lot wider to handle the increased clocks and bandwidth from DDR5 and/or pci-express 5. It might be difficult to build such a device on 12 nm GF, or whatever they have available. It is a huge jump in bandwidth, so doing it on an older process may also take way too much power.

There is some possibility that they will design the IO die to be modular such that they can use the same chips for Epyc. Epyc needs a power reduction for the IO die, so going with the smaller process seems like a good idea. An integrated GPU wouldn’t be needed, so perhaps salvage could go to Epyc or be used as chipsets. Epyc would make use of possibly 4 of the IO die, possibly connected by LSI. It doesn’t seem like they would want to make a possibly over 300 mm2 IO die on N6. Cranking out a bunch of much smaller IO die with a lot of opportunity for salvage seems to make more sense (full functional -> Epyc, not quite -> threadripper, less functional -> Ryzen, really less functional -> chipset; only needs pci-express working).

There has been some talk of a 96-core threadripper on zen 4. If that is remotely accurate, then I suspect that it must be a single layer of cpu die; that would be too much power to stack multiple layers. I am wondering if instead of 4 IO die quadrants in Epyc (possible in 4 separate 6 nm chips), we will actually get 6 IO die “quadrant” or just 6 separate IO die connected together. That would allow them to keep the 2 channel memory, 2 cpu links, and 2 x16 pci-express per partition, but that would come with an increase up to 160 pci-express per socket. I don’t think we have any rumors about an increase in pci-express links.
 
May 17, 2020
123
233
116
There is some possibility that they will design the IO die to be modular such that they can use the same chips for Epyc. Epyc needs a power reduction for the IO die, so going with the smaller process seems like a good idea. An integrated GPU wouldn’t be needed, so perhaps salvage could go to Epyc or be used as chipsets. Epyc would make use of possibly 4 of the IO die, possibly connected by LSI. It doesn’t seem like they would want to make a possibly over 300 mm2 IO die on N6. Cranking out a bunch of much smaller IO die with a lot of opportunity for salvage seems to make more sense (full functional -> Epyc, not quite -> threadripper, less functional -> Ryzen, really less functional -> chipset; only needs pci-express working).
AMD has actually two versions of IO die: one larger for EPYC/Threadripper (which have 8 channel DDR4 memory controller, more PCIe 4.0 lanes....) and an another for Ryzen/X570, AMD still should have two versions of IO die because Genoa will 10 channel DDR5 memory controller and still more PCIe lanes then IO die for Ryzen
 
  • Like
Reactions: Tlh97 and Joe NYC

maddie

Diamond Member
Jul 18, 2010
4,881
4,951
136
These statements about the PHY blocks staying about the same size with node shrinks are a bit confusing to me. Surely they have shrunk over the years or is it being claimed that they are almost the same size as they were on the 22nm or 28nm nodes for example?

I can see the scaling factor as being less than other logic or cache. In that case does anyone know the value? For example, it scales at 1/2 the rate of logic.
 

naukkis

Senior member
Jun 5, 2002
903
786
136
PHY is the physical interface of IO, it's size is what physical implementation of that IO needs and manufacturing process won't affect it if logic needed isn't bigger that needed for that IO. Because of that Intel had free space in their chipsets so they implemented iGPU to make use for that free space in chips. And situation is probably still the same, Intel chips with less cpu cores have bigger iGPU because they just can't scale actual chip size to any smaller for needed IO, and power feed is part or that.
 

Mopetar

Diamond Member
Jan 31, 2011
8,114
6,770
136
Could respin Picasso to 12LP+ for cheap Chromebooks. Doubt they would make much but it'd be a way to satisfy the agreement.

Are they going to be able to sell however many Picasso dies they can get out of ~$1.6 billion in wafer purchases? Depending on the price (probably $3,000 - $4,000 based on common estimates) that's something near half a million wafers. That's an awful lot of Picasso dies, probably well over 100 million, maybe even closer to 150 million. There were some rumors about them using GF for the Athlon refresh which no doubt takes up some of those wafers, but probably not all of them.
 
  • Like
Reactions: Tlh97

Mopetar

Diamond Member
Jan 31, 2011
8,114
6,770
136
These statements about the PHY blocks staying about the same size with node shrinks are a bit confusing to me. Surely they have shrunk over the years or is it being claimed that they are almost the same size as they were on the 22nm or 28nm nodes for example?

I can see the scaling factor as being less than other logic or cache. In that case does anyone know the value? For example, it scales at 1/2 the rate of logic.

To some degree there's a maximum limit to how much they can shrink because they serve as a physical connection to some bus (memory, PCIe, USB, etc.) and there are requirements to wire spacing, etc. for those standards.

Someone did an analysis of the A13 and A14 SoCs which use the TSMC 7nm and 5nm nodes respectively. It was noted that the the LPDDR PHY did not shrink at all.
 

Mopetar

Diamond Member
Jan 31, 2011
8,114
6,770
136
It appears that node scaling is rapidly approaching a point where it will be more efficient to take the SerDes power hit on APUs to split the I/O section to a separate die on a larger process that's power optimized in the Rules, and a CCD that is density and performance optimized.

Based on AT reporting, the 12LP+ that Global Foundries has seems like a good fit. 40% power reduction seems to put it closer to 7nm in terms of power consumption than its characteristics would otherwise suggest.
 
  • Like
Reactions: Lodix

A///

Diamond Member
Feb 24, 2017
4,351
3,158
136
Based on AT reporting, the 12LP+ that Global Foundries has seems like a good fit. 40% power reduction seems to put it closer to 7nm in terms of power consumption than its characteristics would otherwise suggest.
Perhaps AMD is chasing more than just power usage reduction by using TSMC for the entirety of their mainstream, HEDT and enterprise stack.
 

Joe NYC

Platinum Member
Jun 26, 2021
2,535
3,466
106
Intel claims to be using HBM2e in HPC processors in late 2022, so I would expect AMD to have an HBM device also. I don’t know how competitive they will be if they just have massive SRAM caches. It allowed them to use a narrower memory interface on their GPUs though. It is unclear how they would add HBM. If the CCD are still connected by serialized IFOP links, then the HBM may be on top of the IO die or next to it using LSI. It I might be interesting if they have split the IO die into 4 separate, but identical chips. HBM2 is about 92 square mm and a 6 nm, “single quadrant”, IO die might be about that size, so perhaps they just place one HBM stack on top of each chip. They could place 4 of these die stacks for an Epyc processor. With HBM2 die stacks being larger than CCDs, I don’t think they would want to place HBM2 die stacks next to the IO die. It would probably take a lot of space and limit routing under the stacks to the CCDs. Placing the HBM on top of the IO die(s) makes quite a bit of sense. I was wondering if they were going to use some number of smaller interposers, but that is seeming less likely.

According some of the latest rumors - there will be no HBM for Zen 4.

Whether that means that there will not be anything supporting HBM 2,5D stacking or no high speed memory (3D stacked) is not quite clear.

I have seen another rumor of higher memory channel count (12?).

So it seems the major thrust may be on bigger L3s, and perhaps improved bandwidth to each individual CCD.

DDR5 should add more bandwidth, possible higher number of memory channels could add bandwidth further. And L3 could address the latency.

That is, unless AMD has something additional up its sleeve. There are some rumors regarding the MCD (Memory Cache Die for RDNA 3). I wonder if any of it might be applicable to Zen 4.