Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

eek2121 · Jun 19, 2021

uzzi38 said:
Rembrandt's iGPU too wide to just be designed around the laptop market alone. Based off my 6700XT testing, I see absolutely no reason for AMD to go with 12CUs for mobile alone, as in the same 25W power budget for the GPU CUs alone, you could instead go with 8CUs (-33%) clocked 300MHz higher (+15%). At lower iGPU only power (AKA in more CPU heavy games or lower power budgets), this gap would measurably shrink.

Especially not as the iGPU will likely be enough to beat Intel ones out to 2023 or perhaps even later if MTL and LNL do actually come with Gen12 as previously rumoured.

Large APUs on the desktop are pretty much going to be necessary going forwards, as taping out small dGPUs becomes less and less cost cost effective going onwards from here.

Without doing something awfully strange like backporting new IP to 12LP+ or something that is.

You can't compare Vega to RDNA2. Especially when there is an 18% die shrink involved.

uzzi38 · Jun 19, 2021

eek2121 said:
You can't compare Vega to RDNA2. Especially when there is an 18% die shrink involved.

I didn't compare it to Vega?

Like, at all. I just said that the width of the iGPU doesn't seem balanced around laptops to me.

eek2121 · Jun 19, 2021

uzzi38 said:
I didn't compare it to Vega?

Like, at all. I just said that the width of the iGPU doesn't seem balanced around laptops to me.

A few thoughts:

Porting Zen 3 to 6nm would save anywhere from 10-18% in die space.

The 12 RDNA CUs + 8 Zen 3 cores is probably in a smaller die than Cezanne.

AMD likely wants more CUs for ML related tasks. Even Windows Defender uses GPU compute, if available.

Going with LPDDR5 means more memory bandwidth, so there will be a significant upgrade in gaming from the CU increase.

uzzi38 · Jun 19, 2021

eek2121 said:
A few thoughts:

Porting Zen 3 to 6nm would save anywhere from 10-18% in die space.

The 12 RDNA CUs + 8 Zen 3 cores is probably in a smaller die than Cezanne.

AMD likely wants more CUs for ML related tasks. Even Windows Defender uses GPU compute, if available.

Going with LPDDR5 means more memory bandwidth, so there will be a significant upgrade in gaming from the CU increase.

Rembrandt is 208mm^2.

They don't need more CUs for "ML tasks" (which they barely have a working software stack for already). Windows Defender using GPU compute isn't all that important either.

Memory bandwidth isn't the main reason why RDNA2 iGPUs far more effective at gaming than current gen ones. LPDDR5 is only a ~30% increase to mem b/w.

AAbattery · Jun 19, 2021

uzzi38 said:
Rembrandt is 208mm^2.

They don't need more CUs for "ML tasks" (which they barely have a working software stack for already). Windows Defender using GPU compute isn't all that important either.

Memory bandwidth isn't the main reason why RDNA2 iGPUs far more effective at gaming than current gen ones. LPDDR5 is only a ~30% increase to mem b/w.

According to Crucial, (and others agree), DDR5 will offer close to 2X the bandwidth of DDR4. Crucial's calculation is based off an increase of MT/s or frequency by 50%. If you assume an increase of only 30%, then the effective bandwidth is still ~70% higher than DDR4.

Even knowing this, I was only expecting 8 CUs in Rembrandt.

I just hope price and availability aren't terrible.

Mopetar · Jun 19, 2021

Adding additional CUs might make sense if they're going the route of leaning heavily on their FSR technology for laptops. In that market going for a high FPS super sample makes a certain amount of sense as even inexpensive laptops come with 1440p screens these days.

I think it's a bit disingenuous to market such a system as a 1440p gaming laptop, but it's going to sell well, especially when it's several hundred dollars cheaper than an actual 1440p gaming laptop.

jpiniero · Jun 19, 2021

LightningZ71 said:
The issue is that that end of the market doesn't want a dGPU.

Possibly for DIY, but OEMs love upselling dGPUs to that market.

LightningZ71 · Jun 19, 2021

Vega 7 and 8 iGPUs perform decently for 1080p gaming right now. Either you make the tradeoff for quality and set it to medium or low presets, or, where it's supported, set the rendering resolution at 720p with high or ultra quality and upscale with sharpening to 1080p for what usually is a decently playable frame rate.

As for bandwidth, "dual channel 128 bit LPDDR4-4266 is capable of pushing about 68GB of data per second. That's the fastest JEDEC config out there in the notebook world. DDR4-3200, which is the fastest desktop and SODIMM config that's officially supported is 51GB per second. DDR5 is CURRENTLY specced up to just over 100GB per second for "dual channel" 128 bit configurations (though that's not really currently commercially available), which is a 100% improvement on the desktop and a 50% increase for thin and light's that use soldered LPDDR4X at max spec. That's also about the same memory bandwidth as:

Radeon RX 550/560
XBOX One release edition EDRAM memory buffer
Nvidia GeForce 1050/ti

So, I dare say that an iGPU equipped APU, with 12 RDNA2 "CU" equivalents, with it's improved memory efficiency, will be capable of handling 1080p gaming on laptops and uSFF desktops just fine. With upscaling, it won't be bad at 1440p either. What it won't have is usable ray tracing.

soresu · Jun 19, 2021

xilli_fiberbit said:
some leaks stated that warhol was canceled but it's not the case

Almost certainly its the new V cache augmented Zen3.

It seems less likely now that there are any µArch based IPC enhancements to it as some rumored, probably entirely cache based.

soresu · Jun 19, 2021

jpiniero said:
If Navi 24 type products no longer make sense to design, that's fine with OEMs. They will just stick with 23 type products and better. Or buy nVidia.

I'd predict that if rumors correctly put Rembrandt as 12 CU and N24 as 16 CU then quite possibly it will be clockspeed that AMD uses to really differentiate them.

That way they can optimise Rembrandt for lower power devices and N24 for getting the most out of those 16 CUs possible with Raphael, assuming that is indeed the chip targeted for Raphael which seems pretty likely.

Mopetar · Jun 22, 2021

soresu said:
I'd predict that if rumors correctly put Rembrandt as 12 CU and N24 as 16 CU then quite possibly it will be clockspeed that AMD uses to really differentiate them.

That way they can optimise Rembrandt for lower power devices and N24 for getting the most out of those 16 CUs possible with Raphael, assuming that is indeed the chip targeted for Raphael which seems pretty likely.

Wouldn't there also be a pretty massive difference in memory bandwidth? The Navi 5300 (current low-end discrete part) has 168 GB/s even though it's only using a 96-bit bus because of the GDDR5 memory. That's a ~70% improvement over the suspected 100 GB/s that Rembrandt would have.

tomatosummit · Jun 22, 2021

Mopetar said:
Wouldn't there also be a pretty massive difference in memory bandwidth? The Navi 5300 (current low-end discrete part) has 168 GB/s even though it's only using a 96-bit bus because of the GDDR5 memory. That's a ~70% improvement over the suspected 100 GB/s that Rembrandt would have.

N24 also has infinity cache as well to further expand on that lead and help flatten any performance quirks that apus tend to have, I find the apu performance is wildly inconsistent across software and games not to mention never reaching it's potential outside of a desktop platform.
The other thing is that the apus are getting very expensive now, raven ridge and picasso being good value is in the past. Unless there's a 4core apu with full fat gpu at sub $200 it all ends up being very expensive for those few people who want mini pcs with palpable gpu performance.

Just to swing this thread back to zen4, has there been any information on the igpu in raphael?
It's expected to be n7/n6 so all the IP from the latest apus can be used, any guesses on sizes? can it be made around 100mm^2 or does the I/O require it to be closer to 150mm^2 which would mean there's a sizable space for a relatively large igpu.

eek2121 · Jun 22, 2021

tomatosummit said:
N24 also has infinity cache as well to further expand on that lead and help flatten any performance quirks that apus tend to have, I find the apu performance is wildly inconsistent across software and games not to mention never reaching it's potential outside of a desktop platform.
The other thing is that the apus are getting very expensive now, raven ridge and picasso being good value is in the past. Unless there's a 4core apu with full fat gpu at sub $200 it all ends up being very expensive for those few people who want mini pcs with palpable gpu performance.

Just to swing this thread back to zen4, has there been any information on the igpu in raphael?
It's expected to be n7/n6 so all the IP from the latest apus can be used, any guesses on sizes? can it be made around 100mm^2 or does the I/O require it to be closer to 150mm^2 which would mean there's a sizable space for a relatively large igpu.

No details, yet. We don’t even know if it will be a part of the IO die.

soresu · Jun 22, 2021

tomatosummit said:
Just to swing this thread back to zen4, has there been any information on the igpu in raphael?
It's expected to be n7/n6 so all the IP from the latest apus can be used, any guesses on sizes? can it be made around 100mm^2 or does the I/O require it to be closer to 150mm^2 which would mean there's a sizable space for a relatively large igpu.

A question is whether the Raphael GPU is in fact a whole separate die with it's own memory controller and IO etc, or whether it is basically just CU's with just an IF/IA port.

If we assume that the die is only going to use the DDR5 system memory anyways then it makes little sense to have a separate controller taking up space on the GCD.

The latter unified memory access option could mean it possibly having more CU's than we might expect with much of the uncore missing vs the other RDNA2 dies, especially if even on AM5 the Zen4 CCD's and hypothetical GCD end up sitting on top of an IOD/bridge/interposer rather than placed around it.

Shivansps · Jun 23, 2021

uzzi38 said:
As if that matters. OEMs are more than happy to throw a 3400G in a box and slap a gaming sticker on it - hell there's instances of some doing the same with the 3000G.

Do not understimate the 3400G, if it has dual channel DDR4-3200(it can support 3466 but its tricky to get it working right) it is a completely viable gaming rig, but you need to know how to use it, 1080P has to be avoided almost always, it is a mistake to go for 1080p, if you use 900p instead, you get a massive FPS boost, that allows for using higher quality presets with a image quality diference with 1080p that is almost impossible to see whiout a side by side comparison. 720p is were things get ugly, that has to be avoided as well.
The 3400G can also beat both the GT1030 and RX550 depending on the game, it was so good that AMD is still struggling to beat it with a considerable margin with APUs that has double the MSRP, double and better CPU cores and running way faster rams.

As i said many times, i used a 2200G for a year as my main pc whiout a dgpu, the last time i played Witcher 3 was with that 2200G, at 900p with a mix of ultra/high/med/low settings. I could not tell the diference from the RX 480 i had just a few weeks before. People that never used a Ryzen APU as its main pc has no idea of how good they actually are, because review sites suck at APU testing, because they never used one either.

Shivansps · Jun 23, 2021

LightningZ71 said:
QUOTE="jpiniero, post: 40527363, member: 281730"]
And the very low end IGP would be more than capable of doing that. It's gaming where it would be lackluster. That's where the upsell comes in.

But a tiny, low end iGPU doesn't do it for the whole market! Yes, for office machines, video players, and modest content creation, even the existing Vega iGPUs are fine. However, for anything north of 720p upscaled setups, they don't cut it for decent games at decent quality settings. The issue is that that end of the market doesn't want a dGPU. They want the smallest possible box that will do the job, preferably as quietly as possible. This is where a DDR5 fed iGPU could shine. The DDR5 spec currently tops out around the same memory bandwidth as an RX560 in a "dual channel" setup at top specs. RDNA2 is more efficient with memory bandwidth than the Polaris. Those RDNA "CU" equivalents will likely be more performant than the 560 as well. It is an easy stretch to believe that there won't be any real need for a dGPU in those systems for 1080p or 4K upscale situations unless the user wants to be competitive in an FPS, or they just want the highest possible quality. I know several people that do just fine gaming with 1050s and 560s, which I believe that these iGPUs will be comparable to.

The problem with using smaller iGPU there is that we are going to be in a awkward position were a more expensive APU is going to have lower (and probably by a lot if you go for something very small like 1-3CUs) performance than cheaper APU models.
Even some production software is going to slower whiout a dGPU, not only games. But thats way better than nothing.

LightningZ71 · Jun 23, 2021

You mean like the quad core Tiger Lake i7 G7 models have a vastly more capable iGPU than the more expensive 6 and 8 core H series parts?

It's an integrated package that's sold for a target audience and market segment. My only disappointment with AMD is not selling APUs targeted at the audience that would like a full-fat iGPU but doesn't need a fully enabled CPU side, like a quad core with all 8 CUs enabled on the gpu. But, that's a market decision.

Thibsie · Jun 23, 2021

LightningZ71 said:
You mean like the quad core Tiger Lake i7 G7 models have a vastly more capable iGPU than the more expensive 6 and 8 core H series parts?

It's an integrated package that's sold for a target audience and market segment. My only disappointment with AMD is not selling APUs targeted at the audience that would like a full-fat iGPU but doesn't need a fully enabled CPU side, like a quad core with all 8 CUs enabled on the gpu. But, that's a market decision.

It would be way easier with a modular approach to APUs.
Just adapt the GPU chiplet and that's it....

jpiniero · Jun 23, 2021

Thibsie said:
It would be way easier with a modular approach to APUs.
Just adapt the GPU chiplet and that's it....

I imagine AMD will do that eventually, but for desktop they would only realistically put the smallest one possible to upsell to dGPUs.

scineram · Jun 23, 2021

eek2121 said:
No details, yet. We don’t even know if it will be a part of the IO die.

The material GN shared indicated it is. I see no reason to doubt that. 12-16 CU maybe?

Thibsie · Jun 23, 2021

jpiniero said:
I imagine AMD will do that eventually, but for desktop they would only realistically put the smallest one possible to upsell to dGPUs.

Would they lose that much money ?
GPU partners would be in tears for sure but, not sure they'd lose that much. And it would free resources for CPUs/APUs.

jpiniero · Jun 23, 2021

Thibsie said:
Would they lose that much money ?
GPU partners would be in tears for sure but, not sure they'd lose that much. And it would free resources for CPUs/APUs.

Just the opposite. Sell more dGPUs, or encourage it.

LightningZ71 · Jun 23, 2021

Well, it would still be a "low end" product by GPU standards. I don't see Nvidia or AMD shedding many tears over that sort of competition to dGPUs as it seems that the low end of the market generates very little profit and they produce very little for it, outside of the OEM world, it seems.

Thibsie · Jun 23, 2021

LightningZ71 said:
Well, it would still be a "low end" product by GPU standards. I don't see Nvidia or AMD shedding many tears over that sort of competition to dGPUs as it seems that the low end of the market generates very little profit and they produce very little for it, outside of the OEM world, it seems.

Yeah, that's mostly how I see things.
Less boards to manufacture (the gpu cards) for the same performance: win for OEMs and maybe win for AMD.
This won't prevent sales from proper GPUs anyway.

jamescox · Jun 24, 2021

Vattila said:
On the question of silicon interposer/bridges for the "Zen 4" generation, AMD's paper by Sam Naffziger et al goes into detail about their considerations of using silicon interposer for the "Zen 3" generation, explaining why they ended up using IFOP (Infinity Fabric On Package) in the organic package substrate instead. It was primarily down to high cost and lack of reach. Relying on IFOP was quite a feat, though, and comes with its own drawbacks. In particular, there is not much room in the package for links so they had to be very creative, including co-design in the chiplets themselves (compromises/innovation in power delivery). Obviously, IFOP has higher latency. Energy per bit for IFOP is over an order of magnitude greater than links on silicon. And, 14% of the CCD die size is devoted to IFOP due to the large interconnect bump pitch. Surprisingly, IFOP isn't a bandwidth limiter though — it has ample bandwidth for the "Zen 3" generation, at least.

"3) Packaging Technology Decisions: AMD was among the first companies to commercially introduce silicon interposer technologies starting with the AMD Radeon™ R9 “Fury”
GPUs with high-bandwidth memory (HBM) in 2015 [16]. A natural question for our chiplet-based products is why we chose to use package substrate routing rather than the higher density interconnects enabled by silicon interposers. There are several factors that drove the decision to not use silicon interposers for our chiplet-based processors. The first is the communication requirements of our chiplets. With eight CCDs and eight memory channels, on average each chiplet’s IFOP only needs to handle approximately one DDR4 channel’s worth of bandwidth. Using DDR4-2933 as an example, a single channel would correspond to ~23.5 GB/s of peak bandwidth. Even accounting for some load imbalance across the CCDs, a single CCD’s IFOP would still be expected to observe no more than a few tens of GB/s of traffic, and in fact each link can support approximately 55GB/s of effective bandwidth. Point-to-point links in the package substrate routing layers are more than sufficient to handle this modest level of bandwidth. In contrast, a single HBM stack can deliver hundreds of GB/s of memory bandwidth, which far exceeds the capabilities of the organic package substrate, and this is why HBM-enabled GPU products need a higher-bandwidth solution such as silicon interposers [2][16][17]. The second factor against silicon interposers for our chiplet-based processors is the reach of the interposer-based interconnects. While interposers can provide great signal density for very high bandwidths, the lengths of the signals are limited and as such constrain the connections to edge-to-edge links. The reach of interposer-based interconnects can in principle be extended using wider metal routes and greater spacing between routes, but this would decrease the effective bandwidth per interface because fewer total routes could be supported for a fixed width of routing tracks. This argument also applies to silicon bridge technologies [12]. The next subsection describes the challenges of providing sufficient IFOP bandwidth across the package substrate. Figure 10 illustrates a hypothetical interposer-based processor design. The edge connectivity constraint would limit the architecture to only four CCDs, which would render the product concept to be far less compelling. Even if interconnect reach was not a limiting factor, the IOD and the eight CCDs would require so much area that the underlying interposer would greatly exceed the reticle limit (while a passive interposer does not contain any transistors, the metal layers are still lithographically created and therefore must stay within the same reticle field constraints). Figure 10 shows the placement where an additional CCD would have to be, which is both outside the boundary of a maximum-sized interposer and too far for the unbuffered interposer routes to reach while supporting required bandwidths. Recent advancements in silicon interposer manufacturing have enabled reticle stitching to create very large interposers [11], but such an approach would have been cost prohibitive for this market segment. Last, the silicon interposer itself adds more cost to the overall solution. A CCD with the twice the core count could have been used, but that would have resulted in lower yield and decreased configurability. For all these reasons, routing IFOP directly across the package substrate was chosen for this product family. The total area consumed by multiple chiplets is typically greater than a monolithic chip with equivalent functionality. While this could theoretically cause a corresponding increase in the overall package size, the size of the SP3 processor package used by AMD EPYC™ processors is primarily determined by the large number of package pins required to support the eight DDR memory channels, 128 lanes of PCIe plus other miscellaneous I/O, and all the power and ground connections."

Pioneering Chiplet Technology and Design for the AMD EPYC™ and Ryzen™ Processor Families: Industrial Product (computer.org)

I think the reasoning behind their decision strongly hints that silicon interposer/bridges will be coming as soon as reach and cost allow. However, there is a chance there might be low-risk and low-cost reasons to extend the current scheme, if they find it is sufficient for their performance targets. If they can elongate the package a little more, and there is room underneath the CCDs to route another IFOP link (or two), maybe they can fit another 4 (or 8) CCDs on the package, in a manner compatible with the ugly mock-ups shared by leakers. But the routing is already pretty cramped, as can be seen in this figure from the paper:

View attachment 45971

PS. Could we perhaps see 96-core "Genoa" using IFOP, and a 128-core follow-up using silicon interposer/bridges? Could they do that in the same socket?

Thanks. I have been looking for something like this for a while. This mostly fits the speculation in my previous post. I was assuming that the package size is limited by the number of pins and that giant interposers would be too expensive. This causes me to lean more towards initial Genoa being similar to Milan and Rome packages. The embedded silicon interconnect doesn’t really work with the 4 rows of 3 layout unless they use pass through bridges, which I thought might be a possibility, but probably unlikely.

The other idea I had was smaller die stacks connected with IFOP routing or local silicon interconnect. There would only be 4 of them, so edge connect with LSI would be doable. Basically, one quadrant of an Epyc IO die for each interposer. Making a die stack with a active interposer on the bottom utilizing micro bumb tech would make some sense. Micro-bump stacking would allow the use of a much older process tech. The physical layer interfaces do not scale well, so they could continue to make those on an older process and not waste 5, 6, or 7 nm production unnecessarily. Making the entire IO die on 6nm seems very wasteful. Then they could stack some cpu chiplets on top. If the 96 core rumors are true, then I guess it would fit 3 cpu die and perhaps an accessory chip that contains portions of the IO die that would scale well to a newer process. Some things like the unified memory controller and such would scale to 6 nm, just not the physical layer interfaces. The 128-core device that supposedly isn’t Genoa might be a second type of small interposer. This one might have room for up to 4 cpu die and some HBM stacks for very high-end systems.

Small interposers would be very modular and would allow scaling of cost with core count. All Epyc processors would need 4 interposers to get the full amount of IO though. I guess I am still leaning towards just IFOP again for Genoa after reading this paper. Perhaps they just make a stacked IO die to better use available wafers and the cpu die are still package mounted. They will already have done parts of the IO design in a modern process for the APUs though. Perhaps the 4 quadrants of the Epyc IO die will, in fact, be separate 6 nm chips that can be reused for desktop Ryzen, and even the next generation chipset.

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Platinum Member

Platinum Member

Platinum Member

Platinum Member

Member

Diamond Member

Lifer

Golden Member

Platinum Member

Platinum Member

Diamond Member

Member

Platinum Member

Platinum Member

Diamond Member

Diamond Member

Golden Member

Senior member

Lifer

Senior member

Senior member

Lifer

Golden Member

Senior member

Senior member