Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 113 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
820
1,456
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

Joe NYC

Diamond Member
Jun 26, 2021
3,358
4,923
136
One wonders, however, if there will be a "Genoa X" at some future date. I suppose that would come after 5nm supply improves. Or perhaps they can stack 7/6nm cache on 5nm CCD?

I think it is almost certain there will be Genoa X and the SRAM will almost certainly be N6/N7, not N5.
 

leoneazzurro

Golden Member
Jul 26, 2016
1,114
1,866
136
Are you implying it will be added in Genoa only to be removed in Bergamo? We know Genoa has AVX-512 support.
I suspect the smaller Zen4 cores in Bergamo will either have a much smaller L2 and larger L3, or they will strip down some of the cores, and use the neat little trick described in their “big.little” patent for the instructions the small cores don’t support. I seriously doubt the chip won’t support AVX-512 at all. Bergamo lands around the time Intel caches up on process, assuming no delays.

I am implying that there is a concrete possibility of removing AVX512 in Bergamo due to area/power savings and Bergamo being a cloud optimized die with very high core density per die and lower power. How much die space AVX512 takes? IIRC it is not a trivial amount, of course is no more 40% of the die like in the first implementation, but on a lean core like the ones of the Zen family it would be not small even if you would take a 2x256 approach to it (You'll need anyway 512 bit registers, and so on). And it would be replicated to X8, X12, X16 according to the number of cores in the CCD of Bergamo. And it would take power. And it would be practically unused in the target market Bergamo addresses (high density racks for web/cloud appliances). Having much smaller L2 and larger L3 depends on the performance/power balance, I can see L2 being back to 512Kb but at that point having larger L3 offsets all area savings from that, meaning you have less space for actual cores - which in die made for high density is a bit.. odd. Then, if AMD managed to create a low area, low power version of AVX512 that performs relatively well, yes, I could see it being in Bergamo. It is a big "IF" btw. Anyway, this is all speculation, we will see when there will be more information about this from AMD itself.
 

gdansk

Diamond Member
Feb 8, 2011
4,343
7,292
136
Remove or half-rate? I can see the latter but AMD said same ISA as Zen 4. If Zen 4 can decode it then Zen 4c must too. Doesn't say anything about how much silicon they are using/wasting on making AVX512 fast.
 

yuri69

Senior member
Jul 16, 2013
672
1,202
136
@Genoa-X: TSMC's 5nm-on-5nm CoW - the tech for the Zen 4 V-Cache - initial availability is scheduled to Q3 2022. Given the timing of 7nm-on-7nm V-Cache for Zen 3, the realistic mass availability & product launch would be late Q4 2022 at earliest. This might be OK for Zen 4 - it could still launch in 2022 w/ or w/o the cache.

@avx-512: Remember, AMD was no shy of cutting Zen 2 FP RF for the consoles. AVX 512 RF can be huge, so cutting that would fit Zen 4c.
 

moinmoin

Diamond Member
Jun 1, 2017
5,236
8,443
136
@ avx-512: Remember, AMD was no shy of cutting Zen 2 FP RF for the consoles. AVX 512 RF can be huge, so cutting that would fit Zen 4c.
Those being custom design that's down to the console makers. That's wasn't AMD calling the shots, it was Sony and Microsoft respectively.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,777
6,786
136
Remove or half-rate? I can see the latter but AMD said same ISA as Zen 4. If Zen 4 can decode it then Zen 4c must too. Doesn't say anything about how much silicon they are using/wasting on making AVX512 fast.

Not sure how Lisa Su's statement can be interpreted differently short of reading the PPR.
Unless she is lying.


At t=2114s
Bergamo is also socket compatible with Genoa with the same Zen4 instruction set
 
Last edited:

DisEnchantment

Golden Member
Mar 3, 2017
1,777
6,786
136
Maybe leoneazzurro doesn't consider AVX-512 a part of the ISA.
I understand ISA could be misconstrued as something like generic x86 ISA, but she explicitly says same instruction set

funnily this thought crossed my mind.
Bergamo is same Instruction set Architecture
Bergamo is same Instruction set
Bergamo is same Instructions
Bergamo is same
 

leoneazzurro

Golden Member
Jul 26, 2016
1,114
1,866
136
Maybe leoneazzurro doesn't consider AVX-512 a part of the ISA.

AVX512 is an x86 ISA extension. That does mean that when it is present, software may take advantage of it, when it is not present programs generally run as well, but at a reduced level of performance. But well, as I stated, my concerns are for the cost opportunity and about power savings. I have nothing against AVX512 per se. If AMD found a way to have it at low area cost (not cutting out other maybe more useful features and power savings), low power and reasonably performing, all the better.
 

soresu

Diamond Member
Dec 19, 2014
3,937
3,368
136
If AMD found a way to have it at low area cost (not cutting out other maybe more useful features and power savings), low power and reasonably performing, all the better.
Waiting until 6 or 5nm would probably be no small advantage in that, much as with their delay to go fully 256 bit with Zen2.
 

Saylick

Diamond Member
Sep 10, 2012
3,945
9,204
136
Given what we know about the Zen 3 CCD, the L3 cache takes up just as much space as the cores themselves, so it seems obvious to me that the majority of the core-packing effort comes from removing as much L3 cache as possible and replacing that die space with more cores. AMD allegedly gets an optimized TSMC N5 process (whether it be N5P or some HPC variant of N5, doesn't matter) hence why they are quoting 2x transistor density over N7 (or possibly N7P). Cache doesn't scale as well as logic anyways, so I don't see why they couldn't do half-rate AVX-512 and still fit 16 Zen 4c on a slightly larger CCD if they take out half of the cache.

Put it this way:
- Zen 3 CCD = half cache, half cores = 1.0x Area
- Strip out half of the cache, so remaining cache is 0.25x Area
- Assume Zen 4c cores have 1.5x transistor count but 2x density
- Assume cache gets 1.2x density
- Zen 4c CCD = 2*(0.5 * 1.5 / 2) for 16 Zen 4c cores + 0.25/1.2 for L3 cache = 0.958x Area of Zen 3 CCD.

Yes, I am aware that there's also a bunch of IO and miscellaneous off-die links, but just wanted to do some napkin math to get a ballpark of feasibility.
 
  • Like
Reactions: Tlh97 and Thibsie

eek2121

Diamond Member
Aug 2, 2005
3,390
5,014
136
Given what we know about the Zen 3 CCD, the L3 cache takes up just as much space as the cores themselves, so it seems obvious to me that the majority of the core-packing effort comes from removing as much L3 cache as possible and replacing that die space with more cores. AMD allegedly gets an optimized TSMC N5 process (whether it be N5P or some HPC variant of N5, doesn't matter) hence why they are quoting 2x transistor density over N7 (or possibly N7P). Cache doesn't scale as well as logic anyways, so I don't see why they couldn't do half-rate AVX-512 and still fit 16 Zen 4c on a slightly larger CCD if they take out half of the cache.

Put it this way:
- Zen 3 CCD = half cache, half cores = 1.0x Area
- Strip out half of the cache, so remaining cache is 0.25x Area
- Assume Zen 4c cores have 1.5x transistor count but 2x density
- Assume cache gets 1.2x density
- Zen 4c CCD = 2*(0.5 * 1.5 / 2) for 16 Zen 4c cores + 0.25/1.2 for L3 cache = 0.958x Area of Zen 3 CCD.

Yes, I am aware that there's also a bunch of IO and miscellaneous off-die links, but just wanted to do some napkin math to get a ballpark of feasibility.

That was my original thought. Reduced cache, maybe shared across the entire 16 core CCD?
 

Saylick

Diamond Member
Sep 10, 2012
3,945
9,204
136
That was my original thought. Reduced cache, maybe shared across the entire 16 core CCD?
I think we're going to be looking at two 8-core CCXs on the same Zen 4c CCD. So 8MB shared over 8 cores, and another 8MB shared over the other 8 cores. Same ring topography as Zen 3, but just a quarter of the cache per core (since the total cache is cut in half, and the number of cores is doubled).
 

dacostafilipe

Senior member
Oct 10, 2013
804
305
136
How about removing a big portion of the cache on the die to add it back via V-cache?

Placing the Vias could be an issue, but I find this idea kinda funny xD
 

Mopetar

Diamond Member
Jan 31, 2011
8,447
7,649
136
Is there anything about them using V-cache for those parts? Part of Zen 3's boost over Zen 2 came from their larger cache and the Zen 3D numbers show additional cache can benefit several applications.

If they can move a lot of the L3 cache to the stacked die it does allow for a much more densely packed core.
 

Saylick

Diamond Member
Sep 10, 2012
3,945
9,204
136
The timeframe makes it a possibility. Early 2023 launch for Bergamo is after Qualification for N5-on-N5. I'm just not sure if they'd try to have the L3 be ONLY on the stacked die because N5 ain't cheap. That would imply that EVERY Bergamo CCD uses V-cache, but why spend the added costs of chip stacking and the extra N5 die if Zen 4c are supposed to be bare-bones cores. My opinion is that you'd want the much larger cache on a core that can actually leverage it, not vice-versa.
SoIC.jpg
 

jpiniero

Lifer
Oct 1, 2010
16,573
7,072
136
The timeframe makes it a possibility. Early 2023 launch for Bergamo is after Qualification for N5-on-N5. I'm just not sure if they'd try to have the L3 be ONLY on the stacked die because N5 ain't cheap. That would imply that EVERY Bergamo CCD uses V-cache, but why spend the added costs of chip stacking and the extra N5 die if Zen 4c are supposed to be bare-bones cores. My opinion is that you'd want the much larger cache on a core that can actually leverage it, not vice-versa.

Hmm... maybe it is just a 2x 8-core CCX with no L3 but V-cache on top of the logic; with only really transistor changes to mitigate heat issues and to maximize density.

It'd be a lot better if you could do N7 on N5.
 

Mopetar

Diamond Member
Jan 31, 2011
8,447
7,649
136
Wasn't one of the selling points of the stacked dies that different nodes could be used? Make the V-cache on an older node like 6N since SRAM doesn't scale well anyhow.

While not everything will benefit from cache, a lot will and the hypothetically slashed L3 suggested seems like it would hurt performance in a lot of cases. But stacking the cache is one way to reconcile a lower amount on the die.
 
  • Like
Reactions: Tlh97

maddie

Diamond Member
Jul 18, 2010
5,151
5,537
136
Hmm... maybe it is just a 2x 8-core CCX with no L3 but V-cache on top of the logic; with only really transistor changes to mitigate heat issues and to maximize density.

It'd be a lot better if you could do N7 on N5.
Speaking about that.
Anyone know why different nodes can't be stacked with COW or is it not technical difficulties but time to qualify the process?
At the interface where the actual fusion takes place, does it matter what's in the rest of the bulk material?
 

uzzi38

Platinum Member
Oct 16, 2019
2,746
6,653
146
How about removing a big portion of the cache on the die to add it back via V-cache?

Placing the Vias could be an issue, but I find this idea kinda funny xD
Why do that when you could instead cut the size of the core even further by removing the TSVs and all of the cache tags for handling V-Cache instead?

The aim with Bergamo is still to save on die area as much as possible while retaining as much per-core performance for cloud workloads as much as possible.
 
  • Like
Reactions: Tlh97

jpiniero

Lifer
Oct 1, 2010
16,573
7,072
136
I guess the point is why do this when you'd likely be able to sell all the Genoa N5 wafers AMD is willing to buy regardless. Then what do you do with Bergamo dies that don't cut it?