Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 114 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
820
1,456
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

jamescox

Senior member
Nov 11, 2009
644
1,105
136
I guess the point is why do this when you'd likely be able to sell all the Genoa N5 wafers AMD is willing to buy regardless. Then what do you do with Bergamo dies that don't cut it?
They could be possibly used as mobile parts, especially if they use a silicon bridge for connection to the IO die.
 

maddie

Diamond Member
Jul 18, 2010
5,151
5,537
136
No IO die on Mobile parts, those will always be Monolithic due to power constrains
I disagree.

With 3D stacking you should be able to do everything electrically in as close to an identical manner as a monolithic die. Mobile, being a low power part, allows you a lot of design freedom, as thermals are the big barrier to stacking cores and other higher power sections. Mobile has a small fraction of the heat flux of desktop.
 
  • Like
Reactions: Tlh97

Ajay

Lifer
Jan 8, 2001
16,094
8,112
136
Why do that when you could instead cut the size of the core even further by removing the TSVs and all of the cache tags for handling V-Cache instead?

The aim with Bergamo is still to save on die area as much as possible while retaining as much per-core performance for cloud workloads as much as possible.
You lost me a bit here, so are you saying keep ~32Mb of L3. I figured (incorrectly I guess), the the V-Cache tags would be stored in on die L3$. I also didn't know the TSVs used up a significant amount of area (figured at M1, the diameter would be very small). Is the current thinking still 2 by 8 CCXs on one CCD?
 

Joe NYC

Diamond Member
Jun 26, 2021
3,375
4,945
136
The timeframe makes it a possibility. Early 2023 launch for Bergamo is after Qualification for N5-on-N5. I'm just not sure if they'd try to have the L3 be ONLY on the stacked die because N5 ain't cheap. That would imply that EVERY Bergamo CCD uses V-cache, but why spend the added costs of chip stacking and the extra N5 die if Zen 4c are supposed to be bare-bones cores. My opinion is that you'd want the much larger cache on a core that can actually leverage it, not vice-versa.
SoIC.jpg

AMD is not going to waste money and wafer availability of N5 on SRAM, when N6/N7 can do the same job cheaper, and with much more available wafers.
 

leoneazzurro

Golden Member
Jul 26, 2016
1,114
1,866
136
I disagree.

With 3D stacking you should be able to do everything electrically in as close to an identical manner as a monolithic die. Mobile, being a low power part, allows you a lot of design freedom, as thermals are the big barrier to stacking cores and other higher power sections. Mobile has a small fraction of the heat flux of desktop.

For what is worth, leakers say that the Zen5 APU (and maybe the following) will be MCM.
 

nicalandia

Diamond Member
Jan 10, 2019
3,331
5,282
136
I disagree.

With 3D stacking you should be able to do everything electrically in as close to an identical manner as a monolithic die. Mobile, being a low power part, allows you a lot of design freedom, as thermals are the big barrier to stacking cores and other higher power sections. Mobile has a small fraction of the heat flux of desktop.
All of the APUs from Amd have been Monolithic, low power devices don't need more than 8 cores, yes you can stack 3D V-Cache on top of the monolithic die to boost IPC performance for both CPU and GPU at the same time, there is no need for an IO, there will be no need to exceed 8 core CCDs
 

maddie

Diamond Member
Jul 18, 2010
5,151
5,537
136
All of the APUs from Amd have been Monolithic, low power devices don't need more than 8 cores, yes you can stack 3D V-Cache on top of the monolithic die to boost IPC performance for both CPU and GPU at the same time, there is no need for an IO, there will be no need to exceed 8 core CCDs
What I'm saying is that due to the low power of mobile, you can stack a lot more than just cache and still have a manageable heat flux.

A zen core chiplet is ~ 80 mm^2. A Cezanne APU is > 2X this. Rembrandt should be even bigger. When combined with optimized nodes and libraries for each sub-unit, I can believe savings in cost and performance gains are possible.

You do something different when there are benefits to be gained, not just because the present isn't working.
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
All of the APUs from Amd have been Monolithic, low power devices don't need more than 8 cores, yes you can stack 3D V-Cache on top of the monolithic die to boost IPC performance for both CPU and GPU at the same time, there is no need for an IO, there will be no need to exceed 8 core CCDs
Regardless of how many cores you think people need, the interface allowed by EFB tech (elevated fanout bridge) would be very low power. If they make a chiplet version for high end mobile devices, it would be nice to have a stack of HBM2E connected to the IO die with integrated gpu and then the cpu chiplet, possibly with some disabled cores, also connected by EFB. The connection to the cpu doesn’t actually need to be that fast since the gpu would likely be in or stacked on top of the IO die. I don’t think this interface would be too high of power consumption for mobile. Such a device would be expensive, but if anyone else wants to compete with Apple M1 Pro/Max level of devices, then that is basically what they need. I doubt that the regular Zen 4 cores will be as power efficient as Apple’s designs, so using Zen 4c might help.

They might have another variant that uses a similar low power design with a few things tweaked for mobile, but it seems like it could work well as is once you have cost effective chip stacking. It will depend on how expensive EFB is. HBM with massive interposers has just been too expensive for many consumer devices, but EFB may bring the cost down to make it an option. It would still be a high end device though. Cheap options will likely remain monolithic except perhaps som SoIC type stacking.
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
Speaking about that.
Anyone know why different nodes can't be stacked with COW or is it not technical difficulties but time to qualify the process?
At the interface where the actual fusion takes place, does it matter what's in the rest of the bulk material?
They seem to use different pitch for the TSVs. I don’t know if that is a real limitation since I would think that the more advanced process could use larger TSV to be backward compatible. They do use a bit of die area; they are visible in die photos of Zen 3. I don’t know if this is right; I have seen several different numbers for the TSV pitch



“The first image shows the interconnect density between three different interconnect approaches. While AMD's new interconnect comes with a 9-micrometer (μm) pitch (distance between TSV), standard C4 packaging has a 130 μm pitch, and Microbump 3D comes with a 50 μm pitch.

TSMC has talked about getting down to less than 1 micron pitch. Zen 3 supposedly has something like 25,000 TSV if someone wants to try to do the math.
 

uzzi38

Platinum Member
Oct 16, 2019
2,746
6,653
146
You lost me a bit here, so are you saying keep ~32Mb of L3. I figured (incorrectly I guess), the the V-Cache tags would be stored in on die L3$. I also didn't know the TSVs used up a significant amount of area (figured at M1, the diameter would be very small). Is the current thinking still 2 by 8 CCXs on one CCD?
No, the L3 is still smaller. Look, the goal is to retain as much per-core performance as possible for cloud workloads, whilst trimming down the size of each core enough such that it becomes feasible to fit 128 cores in the kinds of space we could fit 96 cores before.

In the cloud larger L2s and smaller L3s are preferred. It makes no sense to stack L3 cache on top as V-Cache on Bergamo to make up for what's missing on die, it makes more sense to cut down the L3 and remove the TSVs and cache tags needed to handle all of that extra cache from the die, and then beef up the L2 instead. That would get you the same per-core performance in these specific workloads in a smaller area.
 
  • Like
Reactions: RnR_au and Tlh97

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
No, the L3 is still smaller. Look, the goal is to retain as much per-core performance as possible for cloud workloads, whilst trimming down the size of each core enough such that it becomes feasible to fit 128 cores in the kinds of space we could fit 96 cores before.

In the cloud larger L2s and smaller L3s are preferred. It makes no sense to stack L3 cache on top as V-Cache on Bergamo to make up for what's missing on die, it makes more sense to cut down the L3 and remove the TSVs and cache tags needed to handle all of that extra cache from the die, and then beef up the L2 instead. That would get you the same per-core performance in these specific workloads in a smaller area.
Feasible. That being said, I'd be kinda shocked if AMD put even more than the 1MB/core L2 into the c variant, what the regular Zen 4 core will already boast.
 

Joe NYC

Diamond Member
Jun 26, 2021
3,375
4,945
136
No, the L3 is still smaller. Look, the goal is to retain as much per-core performance as possible for cloud workloads, whilst trimming down the size of each core enough such that it becomes feasible to fit 128 cores in the kinds of space we could fit 96 cores before.

In the cloud larger L2s and smaller L3s are preferred. It makes no sense to stack L3 cache on top as V-Cache on Bergamo to make up for what's missing on die, it makes more sense to cut down the L3 and remove the TSVs and cache tags needed to handle all of that extra cache from the die, and then beef up the L2 instead. That would get you the same per-core performance in these specific workloads in a smaller area.

If 4c uses mobile, rather than high performance variant of N5, and doesn't even strive for high clock speeds. Then, who knows if it might be possible to put V-Cache over the entire chip?

Or, another alternative, if AMD can get something similar to the MCDs rumored to be in RDNA3 as a bridge between the CCD and IO Die, which would include L3, then there would not have to be any L3 on base CCD, and AMD could easily fit 16 cores per CCD with die savings achieved by moving L3 off of the expensive N5 die.

One of the benefits of partitioning is being able to use optimal and most cost effective process for each chiplet / component, which would be N5 for CCDs and N6 for SRAM.

As far as 4c being for "cloud", I think the naming may be a little bit of a gimmick. I think AMD will deploy the core widely, including in APUs for notebooks.
 

uzzi38

Platinum Member
Oct 16, 2019
2,746
6,653
146
If 4c uses mobile, rather than high performance variant of N5, and doesn't even strive for high clock speeds. Then, who knows if it might be possible to put V-Cache over the entire chip?

"Lets optimise for cost efficiency and then just throw twice the silicon on top".

Do I even need to point out how redundant that is?

Or, another alternative, if AMD can get something similar to the MCDs rumored to be in RDNA3 as a bridge between the CCD and IO Die, which would include L3, then there would not have to be any L3 on base CCD, and AMD could easily fit 16 cores per CCD with die savings achieved by moving L3 off of the expensive N5 die.

They're also stacked via SoIC, so eh, been through that already.

As far as 4c being for "cloud", I think the naming may be a little bit of a gimmick. I think AMD will deploy the core widely, including in APUs for notebooks.

Oh it absolutely is, but I don't think you've thought through this idea all that well. Why would you want to stack cache on the L3 of the little cores in particular? It would make more sense to either:

1. Only stack on the big cores' cache

2. Create a seperate system level cache and stack on that instead.

Unless you're suggesting little core only products as budget oriented solutions that you stack additional SRAM on top of. Surely you must realise that's just a tad bit silly, eh?
 

DisEnchantment

Golden Member
Mar 3, 2017
1,777
6,786
136
To combat the SPR CPUs with HBM probably Genoa will very likely have multiple stacks of Vcache.
4 stacks would be enough to match the on package HBM in cache mode.
Power, BW and latency (roughly around 18-20 ns using Zen3 estimations and lower frequency) should be way better for V Cache though
But since N5 on N5 only is seen on TSMC stacking roadmap, not sure how they will leverage the older N7 node.

I guess this is the time to add SLC on top of IOD. N7 on N7
The IOD is a real mystery.

But I am rather more excited about the Zen4c cores, this should go in a laptop.
Putting high power Zen4 cores is not good if you are already constrained in frequency and thermals.
Might as well target 3.8 GHz max to begin with and get all that efficiency and density instead of targetting 4.4GHz but sustaining it only for 10 seconds for example.
I will see if Rembrandt is efficient enough otherwise I wait for Zen4 laptop
 

Bigos

Member
Jun 2, 2019
199
515
136
Isn't single-thread performance rather important in notebooks? I don't see how AMD can use frequency-constrained parts in this segment, unless they will pair these with a few high-frequency cores like middle/big cores on ARM SoCs (before Cortex X1 was a thing).

Unless we are talking about budget options, but I don't think AMD wants to invest too much design money in these parts.
 

uzzi38

Platinum Member
Oct 16, 2019
2,746
6,653
146
Isn't single-thread performance rather important in notebooks? I don't see how AMD can use frequency-constrained parts in this segment, unless they will pair these with a few high-frequency cores like middle/big cores on ARM SoCs (before Cortex X1 was a thing).

Unless we are talking about budget options, but I don't think AMD wants to invest too much design money in these parts.
3.8GHz Zen 4 would still be somewhere in the region of ~1400pts on GB5 I believe (give or take 100pts), which you'll probably note is above the performance of even the Cortex-X2 (~1200pts), forget the X1.

I'd wager that's plenty for most notebook users.
 

DrMrLordX

Lifer
Apr 27, 2000
22,752
12,755
136
Wasn't one of the selling points of the stacked dies that different nodes could be used? Make the V-cache on an older node like 6N since SRAM doesn't scale well anyhow.

Depends on how the stacked dice are connected, I guess? In the case of stacking an L3 die, it seems to require precise TSV alignment (and a relatively large number of TSVs to boot). Other types of chips might not be so picky. Just guessing!

I guess the point is why do this when you'd likely be able to sell all the Genoa N5 wafers AMD is willing to buy regardless.

Because some cloud provider is probably asking for it, the same way they want Altra MAX instead of Altra (from Ampere).
 
  • Like
Reactions: Tlh97 and Joe NYC

DisEnchantment

Golden Member
Mar 3, 2017
1,777
6,786
136
Isn't single-thread performance rather important in notebooks? I don't see how AMD can use frequency-constrained parts in this segment, unless they will pair these with a few high-frequency cores like middle/big cores on ARM SoCs (before Cortex X1 was a thing).
Depends on what is the achievable target with Zen4.
For an 8 Core CPU, will 3.8 Ghz Zen4 at <14W have a perf equivalent to 4.4Ghz Zen3 at 28W?
If they can do 4GHz+ at 15W then why not. Otherwise doing 4GHz+ at 28W then no thanks for me.
I will take the <14W CPU and spare myself the throttling especially since I am not a fan of carrying bigger latops since I travel a lot.

That is why I am saying the 4c core is interesting.
Because we know AMD pushed N7 to the limits to get the performance.
What if they scaled back just a little on the frequency target, as per TSMC ~4.2 GHz is still right on the sweet spot of the Shmoo curve for N5
In the end it could be that the frequency for the 4c core in limited core count scenarious is not that much lower becuase they could sacrifice a little bit more power to operate out of its efficeincy range for the additional boosts.
Whenever they get their big.LITTLE it will a different scenario for sure.
 

HurleyBird

Platinum Member
Apr 22, 2003
2,801
1,528
136
3.8GHz Zen 4 would still be somewhere in the region of ~1400pts on GB5 I believe (give or take 100pts), which you'll probably note is above the performance of even the Cortex-X2 (~1200pts), forget the X1.

I'd wager that's plenty for most notebook users.

For sure. I'd guess closer to ~1550. And that's on Windows vs Android.
 

soresu

Diamond Member
Dec 19, 2014
3,939
3,371
136
3.8GHz Zen 4 would still be somewhere in the region of ~1400pts on GB5 I believe (give or take 100pts), which you'll probably note is above the performance of even the Cortex-X2 (~1200pts), forget the X1.

I'd wager that's plenty for most notebook users.
X3 will be announced months before Zen4 release too, and probably in phones before Zen4 is easily found in stock.

It's anyones guess what Sophia-Antipolis has been cooking for A720 and X3 though.
 

eek2121

Diamond Member
Aug 2, 2005
3,391
5,014
136
Isn't single-thread performance rather important in notebooks? I don't see how AMD can use frequency-constrained parts in this segment, unless they will pair these with a few high-frequency cores like middle/big cores on ARM SoCs (before Cortex X1 was a thing).

Unless we are talking about budget options, but I don't think AMD wants to invest too much design money in these parts.

It is, but 2-4 “big” cores is all you actually need if there are sufficient small cores. This is one of the reasons I think Intel will have a great laptop offering with Alder Lake. If you can run a blender render in the background and have the cores use 15W or so while your big cores remain available for bursty performance, it is a win/win IMO.

AMD has avoided mentioning a hybrid approach outright, but I think that is where the market is headed. Once it is mainstream, user experience will improve as a result. Encoding a video with a software codec could use little power. Same with 3D rendering, doing a virus scan, windows updates, etc.

The same thing applies to having decent GPU compute and/or AI units on every machine.There are use cases for a compute only block outside of the GPU to do things like image/video scaling/enhancements, accelerated virus scans, speech/image recognition, etc. Modern x86 CPUs should all come with a separate, power efficient, dedicated (don’t use the APU unless inactive) compute block for those types of tasks.
 

Joe NYC

Diamond Member
Jun 26, 2021
3,375
4,945
136
"Lets optimise for cost efficiency and then just throw twice the silicon on top".

Do I even need to point out how redundant that is?

No, I think it's more like "Let's beat Ampere", using whatever tools we need to get the job done.

They're also stacked via SoIC, so eh, been through that already.

Which would be AMD asset, not a liability, if AMD can do it earlier and better than others.

Oh it absolutely is, but I don't think you've thought through this idea all that well. Why would you want to stack cache on the L3 of the little cores in particular? It would make more sense to either:

1. Only stack on the big cores' cache

2. Create a seperate system level cache and stack on that instead.

Unless you're suggesting little core only products as budget oriented solutions that you stack additional SRAM on top of. Surely you must realise that's just a tad bit silly, eh?

I was thinking only the "c" cores in some of the lower end mobile targeted APUs (Chromebooks, tablets) - which could be based solely on "c" cores, while being at the opposite end of the spectrum from hyperscaler Bergamo CPUs as you can get.

I think the idea you have missed is that this core will be AMD's weapon to fight Arm.
 
  • Like
Reactions: BorisTheBlade82