Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

Rigg · Nov 9, 2021

jpiniero said:
Then what do you do with Bergamo dies that don't cut it?

Use them as E cores for desktop SKU's??

jamescox · Nov 9, 2021

jpiniero said:
I guess the point is why do this when you'd likely be able to sell all the Genoa N5 wafers AMD is willing to buy regardless. Then what do you do with Bergamo dies that don't cut it?

They could be possibly used as mobile parts, especially if they use a silicon bridge for connection to the IO die.

nicalandia · Nov 9, 2021

jamescox said:
They could be possibly used as mobile parts, especially if they use a silicon bridge for connection to the IO die.

No IO die on Mobile parts, those will always be Monolithic due to power constrains

maddie · Nov 9, 2021

nicalandia said:
No IO die on Mobile parts, those will always be Monolithic due to power constrains

I disagree.

With 3D stacking you should be able to do everything electrically in as close to an identical manner as a monolithic die. Mobile, being a low power part, allows you a lot of design freedom, as thermals are the big barrier to stacking cores and other higher power sections. Mobile has a small fraction of the heat flux of desktop.

Ajay · Nov 9, 2021

uzzi38 said:
Why do that when you could instead cut the size of the core even further by removing the TSVs and all of the cache tags for handling V-Cache instead?

The aim with Bergamo is still to save on die area as much as possible while retaining as much per-core performance for cloud workloads as much as possible.

You lost me a bit here, so are you saying keep ~32Mb of L3. I figured (incorrectly I guess), the the V-Cache tags would be stored in on die L3$. I also didn't know the TSVs used up a significant amount of area (figured at M1, the diameter would be very small). Is the current thinking still 2 by 8 CCXs on one CCD?

Joe NYC · Nov 9, 2021

Saylick said:
The timeframe makes it a possibility. Early 2023 launch for Bergamo is after Qualification for N5-on-N5. I'm just not sure if they'd try to have the L3 be ONLY on the stacked die because N5 ain't cheap. That would imply that EVERY Bergamo CCD uses V-cache, but why spend the added costs of chip stacking and the extra N5 die if Zen 4c are supposed to be bare-bones cores. My opinion is that you'd want the much larger cache on a core that can actually leverage it, not vice-versa.

AMD is not going to waste money and wafer availability of N5 on SRAM, when N6/N7 can do the same job cheaper, and with much more available wafers.

leoneazzurro · Nov 9, 2021

maddie said:
I disagree.

With 3D stacking you should be able to do everything electrically in as close to an identical manner as a monolithic die. Mobile, being a low power part, allows you a lot of design freedom, as thermals are the big barrier to stacking cores and other higher power sections. Mobile has a small fraction of the heat flux of desktop.

For what is worth, leakers say that the Zen5 APU (and maybe the following) will be MCM.

nicalandia · Nov 9, 2021

maddie said:
I disagree.

With 3D stacking you should be able to do everything electrically in as close to an identical manner as a monolithic die. Mobile, being a low power part, allows you a lot of design freedom, as thermals are the big barrier to stacking cores and other higher power sections. Mobile has a small fraction of the heat flux of desktop.

All of the APUs from Amd have been Monolithic, low power devices don't need more than 8 cores, yes you can stack 3D V-Cache on top of the monolithic die to boost IPC performance for both CPU and GPU at the same time, there is no need for an IO, there will be no need to exceed 8 core CCDs

maddie · Nov 9, 2021

nicalandia said:
All of the APUs from Amd have been Monolithic, low power devices don't need more than 8 cores, yes you can stack 3D V-Cache on top of the monolithic die to boost IPC performance for both CPU and GPU at the same time, there is no need for an IO, there will be no need to exceed 8 core CCDs

What I'm saying is that due to the low power of mobile, you can stack a lot more than just cache and still have a manageable heat flux.

A zen core chiplet is ~ 80 mm^2. A Cezanne APU is > 2X this. Rembrandt should be even bigger. When combined with optimized nodes and libraries for each sub-unit, I can believe savings in cost and performance gains are possible.

You do something different when there are benefits to be gained, not just because the present isn't working.

jamescox · Nov 9, 2021

nicalandia said:
All of the APUs from Amd have been Monolithic, low power devices don't need more than 8 cores, yes you can stack 3D V-Cache on top of the monolithic die to boost IPC performance for both CPU and GPU at the same time, there is no need for an IO, there will be no need to exceed 8 core CCDs

Regardless of how many cores you think people need, the interface allowed by EFB tech (elevated fanout bridge) would be very low power. If they make a chiplet version for high end mobile devices, it would be nice to have a stack of HBM2E connected to the IO die with integrated gpu and then the cpu chiplet, possibly with some disabled cores, also connected by EFB. The connection to the cpu doesn’t actually need to be that fast since the gpu would likely be in or stacked on top of the IO die. I don’t think this interface would be too high of power consumption for mobile. Such a device would be expensive, but if anyone else wants to compete with Apple M1 Pro/Max level of devices, then that is basically what they need. I doubt that the regular Zen 4 cores will be as power efficient as Apple’s designs, so using Zen 4c might help.

They might have another variant that uses a similar low power design with a few things tweaked for mobile, but it seems like it could work well as is once you have cost effective chip stacking. It will depend on how expensive EFB is. HBM with massive interposers has just been too expensive for many consumer devices, but EFB may bring the cost down to make it an option. It would still be a high end device though. Cheap options will likely remain monolithic except perhaps som SoIC type stacking.

jamescox · Nov 9, 2021

maddie said:
Speaking about that.
Anyone know why different nodes can't be stacked with COW or is it not technical difficulties but time to qualify the process?
At the interface where the actual fusion takes place, does it matter what's in the rest of the bulk material?

They seem to use different pitch for the TSVs. I don’t know if that is a real limitation since I would think that the more advanced process could use larger TSV to be backward compatible. They do use a bit of die area; they are visible in die photos of Zen 3. I don’t know if this is right; I have seen several different numbers for the TSV pitch

AMD Unveils More Ryzen 3D Packaging and V-Cache Details at Hot Chips

Stacking the chips

www.tomshardware.com

“The first image shows the interconnect density between three different interconnect approaches. While AMD's new interconnect comes with a 9-micrometer (μm) pitch (distance between TSV), standard C4 packaging has a 130 μm pitch, and Microbump 3D comes with a 50 μm pitch.

TSMC has talked about getting down to less than 1 micron pitch. Zen 3 supposedly has something like 25,000 TSV if someone wants to try to do the math.

eek2121 · Nov 9, 2021

nicalandia said:
No IO die on Mobile parts, those will always be Monolithic due to power constrains

As others have stated, don’t bet on it. A refined IO die could drop the power down to as low as 1-2 watts. According to a leak, AMD is already playing with the idea of Zen 4 desktop having a mobile option.

uzzi38 · Nov 10, 2021

Ajay said:
You lost me a bit here, so are you saying keep ~32Mb of L3. I figured (incorrectly I guess), the the V-Cache tags would be stored in on die L3$. I also didn't know the TSVs used up a significant amount of area (figured at M1, the diameter would be very small). Is the current thinking still 2 by 8 CCXs on one CCD?

No, the L3 is still smaller. Look, the goal is to retain as much per-core performance as possible for cloud workloads, whilst trimming down the size of each core enough such that it becomes feasible to fit 128 cores in the kinds of space we could fit 96 cores before.

In the cloud larger L2s and smaller L3s are preferred. It makes no sense to stack L3 cache on top as V-Cache on Bergamo to make up for what's missing on die, it makes more sense to cut down the L3 and remove the TSVs and cache tags needed to handle all of that extra cache from the die, and then beef up the L2 instead. That would get you the same per-core performance in these specific workloads in a smaller area.

lobz · Nov 10, 2021

uzzi38 said:
No, the L3 is still smaller. Look, the goal is to retain as much per-core performance as possible for cloud workloads, whilst trimming down the size of each core enough such that it becomes feasible to fit 128 cores in the kinds of space we could fit 96 cores before.

In the cloud larger L2s and smaller L3s are preferred. It makes no sense to stack L3 cache on top as V-Cache on Bergamo to make up for what's missing on die, it makes more sense to cut down the L3 and remove the TSVs and cache tags needed to handle all of that extra cache from the die, and then beef up the L2 instead. That would get you the same per-core performance in these specific workloads in a smaller area.

Feasible. That being said, I'd be kinda shocked if AMD put even more than the 1MB/core L2 into the c variant, what the regular Zen 4 core will already boast.

Joe NYC · Nov 10, 2021

uzzi38 said:
No, the L3 is still smaller. Look, the goal is to retain as much per-core performance as possible for cloud workloads, whilst trimming down the size of each core enough such that it becomes feasible to fit 128 cores in the kinds of space we could fit 96 cores before.

In the cloud larger L2s and smaller L3s are preferred. It makes no sense to stack L3 cache on top as V-Cache on Bergamo to make up for what's missing on die, it makes more sense to cut down the L3 and remove the TSVs and cache tags needed to handle all of that extra cache from the die, and then beef up the L2 instead. That would get you the same per-core performance in these specific workloads in a smaller area.

If 4c uses mobile, rather than high performance variant of N5, and doesn't even strive for high clock speeds. Then, who knows if it might be possible to put V-Cache over the entire chip?

Or, another alternative, if AMD can get something similar to the MCDs rumored to be in RDNA3 as a bridge between the CCD and IO Die, which would include L3, then there would not have to be any L3 on base CCD, and AMD could easily fit 16 cores per CCD with die savings achieved by moving L3 off of the expensive N5 die.

One of the benefits of partitioning is being able to use optimal and most cost effective process for each chiplet / component, which would be N5 for CCDs and N6 for SRAM.

As far as 4c being for "cloud", I think the naming may be a little bit of a gimmick. I think AMD will deploy the core widely, including in APUs for notebooks.

uzzi38 · Nov 10, 2021

Joe NYC said:
If 4c uses mobile, rather than high performance variant of N5, and doesn't even strive for high clock speeds. Then, who knows if it might be possible to put V-Cache over the entire chip?

"Lets optimise for cost efficiency and then just throw twice the silicon on top".

Do I even need to point out how redundant that is?

Joe NYC said:
Or, another alternative, if AMD can get something similar to the MCDs rumored to be in RDNA3 as a bridge between the CCD and IO Die, which would include L3, then there would not have to be any L3 on base CCD, and AMD could easily fit 16 cores per CCD with die savings achieved by moving L3 off of the expensive N5 die.

They're also stacked via SoIC, so eh, been through that already.

Joe NYC said:
As far as 4c being for "cloud", I think the naming may be a little bit of a gimmick. I think AMD will deploy the core widely, including in APUs for notebooks.

Oh it absolutely is, but I don't think you've thought through this idea all that well. Why would you want to stack cache on the L3 of the little cores in particular? It would make more sense to either:

1. Only stack on the big cores' cache

2. Create a seperate system level cache and stack on that instead.

Unless you're suggesting little core only products as budget oriented solutions that you stack additional SRAM on top of. Surely you must realise that's just a tad bit silly, eh?

DisEnchantment · Nov 10, 2021

To combat the SPR CPUs with HBM probably Genoa will very likely have multiple stacks of Vcache.
4 stacks would be enough to match the on package HBM in cache mode.
Power, BW and latency (roughly around 18-20 ns using Zen3 estimations and lower frequency) should be way better for V Cache though
But since N5 on N5 only is seen on TSMC stacking roadmap, not sure how they will leverage the older N7 node.

I guess this is the time to add SLC on top of IOD. N7 on N7
The IOD is a real mystery.

But I am rather more excited about the Zen4c cores, this should go in a laptop.
Putting high power Zen4 cores is not good if you are already constrained in frequency and thermals.
Might as well target 3.8 GHz max to begin with and get all that efficiency and density instead of targetting 4.4GHz but sustaining it only for 10 seconds for example.
I will see if Rembrandt is efficient enough otherwise I wait for Zen4 laptop

Bigos · Nov 10, 2021

Isn't single-thread performance rather important in notebooks? I don't see how AMD can use frequency-constrained parts in this segment, unless they will pair these with a few high-frequency cores like middle/big cores on ARM SoCs (before Cortex X1 was a thing).

Unless we are talking about budget options, but I don't think AMD wants to invest too much design money in these parts.

uzzi38 · Nov 10, 2021

Bigos said:
Isn't single-thread performance rather important in notebooks? I don't see how AMD can use frequency-constrained parts in this segment, unless they will pair these with a few high-frequency cores like middle/big cores on ARM SoCs (before Cortex X1 was a thing).

Unless we are talking about budget options, but I don't think AMD wants to invest too much design money in these parts.

3.8GHz Zen 4 would still be somewhere in the region of ~1400pts on GB5 I believe (give or take 100pts), which you'll probably note is above the performance of even the Cortex-X2 (~1200pts), forget the X1.

I'd wager that's plenty for most notebook users.

DrMrLordX · Nov 10, 2021

Mopetar said:
Wasn't one of the selling points of the stacked dies that different nodes could be used? Make the V-cache on an older node like 6N since SRAM doesn't scale well anyhow.

Depends on how the stacked dice are connected, I guess? In the case of stacking an L3 die, it seems to require precise TSV alignment (and a relatively large number of TSVs to boot). Other types of chips might not be so picky. Just guessing!

jpiniero said:
I guess the point is why do this when you'd likely be able to sell all the Genoa N5 wafers AMD is willing to buy regardless.

Because some cloud provider is probably asking for it, the same way they want Altra MAX instead of Altra (from Ampere).

DisEnchantment · Nov 10, 2021

Bigos said:
Isn't single-thread performance rather important in notebooks? I don't see how AMD can use frequency-constrained parts in this segment, unless they will pair these with a few high-frequency cores like middle/big cores on ARM SoCs (before Cortex X1 was a thing).

Depends on what is the achievable target with Zen4.
For an 8 Core CPU, will 3.8 Ghz Zen4 at <14W have a perf equivalent to 4.4Ghz Zen3 at 28W?
If they can do 4GHz+ at 15W then why not. Otherwise doing 4GHz+ at 28W then no thanks for me.
I will take the <14W CPU and spare myself the throttling especially since I am not a fan of carrying bigger latops since I travel a lot.

That is why I am saying the 4c core is interesting.
Because we know AMD pushed N7 to the limits to get the performance.
What if they scaled back just a little on the frequency target, as per TSMC ~4.2 GHz is still right on the sweet spot of the Shmoo curve for N5
In the end it could be that the frequency for the 4c core in limited core count scenarious is not that much lower becuase they could sacrifice a little bit more power to operate out of its efficeincy range for the additional boosts.
Whenever they get their big.LITTLE it will a different scenario for sure.

HurleyBird · Nov 10, 2021

uzzi38 said:
3.8GHz Zen 4 would still be somewhere in the region of ~1400pts on GB5 I believe (give or take 100pts), which you'll probably note is above the performance of even the Cortex-X2 (~1200pts), forget the X1.

I'd wager that's plenty for most notebook users.

For sure. I'd guess closer to ~1550. And that's on Windows vs Android.

soresu · Nov 10, 2021

uzzi38 said:
3.8GHz Zen 4 would still be somewhere in the region of ~1400pts on GB5 I believe (give or take 100pts), which you'll probably note is above the performance of even the Cortex-X2 (~1200pts), forget the X1.

I'd wager that's plenty for most notebook users.

X3 will be announced months before Zen4 release too, and probably in phones before Zen4 is easily found in stock.

It's anyones guess what Sophia-Antipolis has been cooking for A720 and X3 though.

eek2121 · Nov 10, 2021

Bigos said:
Isn't single-thread performance rather important in notebooks? I don't see how AMD can use frequency-constrained parts in this segment, unless they will pair these with a few high-frequency cores like middle/big cores on ARM SoCs (before Cortex X1 was a thing).

Unless we are talking about budget options, but I don't think AMD wants to invest too much design money in these parts.

It is, but 2-4 “big” cores is all you actually need if there are sufficient small cores. This is one of the reasons I think Intel will have a great laptop offering with Alder Lake. If you can run a blender render in the background and have the cores use 15W or so while your big cores remain available for bursty performance, it is a win/win IMO.

AMD has avoided mentioning a hybrid approach outright, but I think that is where the market is headed. Once it is mainstream, user experience will improve as a result. Encoding a video with a software codec could use little power. Same with 3D rendering, doing a virus scan, windows updates, etc.

The same thing applies to having decent GPU compute and/or AI units on every machine.There are use cases for a compute only block outside of the GPU to do things like image/video scaling/enhancements, accelerated virus scans, speech/image recognition, etc. Modern x86 CPUs should all come with a separate, power efficient, dedicated (don’t use the APU unless inactive) compute block for those types of tasks.

Joe NYC · Nov 10, 2021

uzzi38 said:
"Lets optimise for cost efficiency and then just throw twice the silicon on top".

Do I even need to point out how redundant that is?

No, I think it's more like "Let's beat Ampere", using whatever tools we need to get the job done.

uzzi38 said:
They're also stacked via SoIC, so eh, been through that already.

Which would be AMD asset, not a liability, if AMD can do it earlier and better than others.

uzzi38 said:
Oh it absolutely is, but I don't think you've thought through this idea all that well. Why would you want to stack cache on the L3 of the little cores in particular? It would make more sense to either:

1. Only stack on the big cores' cache

2. Create a seperate system level cache and stack on that instead.

Unless you're suggesting little core only products as budget oriented solutions that you stack additional SRAM on top of. Surely you must realise that's just a tad bit silly, eh?

I was thinking only the "c" cores in some of the lower end mobile targeted APUs (Chromebooks, tablets) - which could be based solely on "c" cores, while being at the opposite end of the spectrum from hyperscaler Bergamo CPUs as you can get.

I think the idea you have missed is that this core will be AMD's weapon to fight Arm.

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Senior member

Senior member

Diamond Member

Diamond Member

Lifer

Diamond Member

Golden Member

Diamond Member

Diamond Member

Senior member

Senior member

Diamond Member

Platinum Member

Platinum Member

Diamond Member

Platinum Member

Golden Member

Senior member

Platinum Member

Lifer

Golden Member

Platinum Member

Diamond Member

Diamond Member

Diamond Member