Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

Joe NYC · Nov 9, 2021

gdansk said:
One wonders, however, if there will be a "Genoa X" at some future date. I suppose that would come after 5nm supply improves. Or perhaps they can stack 7/6nm cache on 5nm CCD?

I think it is almost certain there will be Genoa X and the SRAM will almost certainly be N6/N7, not N5.

leoneazzurro · Nov 9, 2021

eek2121 said:
Are you implying it will be added in Genoa only to be removed in Bergamo? We know Genoa has AVX-512 support.
I suspect the smaller Zen4 cores in Bergamo will either have a much smaller L2 and larger L3, or they will strip down some of the cores, and use the neat little trick described in their “big.little” patent for the instructions the small cores don’t support. I seriously doubt the chip won’t support AVX-512 at all. Bergamo lands around the time Intel caches up on process, assuming no delays.

I am implying that there is a concrete possibility of removing AVX512 in Bergamo due to area/power savings and Bergamo being a cloud optimized die with very high core density per die and lower power. How much die space AVX512 takes? IIRC it is not a trivial amount, of course is no more 40% of the die like in the first implementation, but on a lean core like the ones of the Zen family it would be not small even if you would take a 2x256 approach to it (You'll need anyway 512 bit registers, and so on). And it would be replicated to X8, X12, X16 according to the number of cores in the CCD of Bergamo. And it would take power. And it would be practically unused in the target market Bergamo addresses (high density racks for web/cloud appliances). Having much smaller L2 and larger L3 depends on the performance/power balance, I can see L2 being back to 512Kb but at that point having larger L3 offsets all area savings from that, meaning you have less space for actual cores - which in die made for high density is a bit.. odd. Then, if AMD managed to create a low area, low power version of AVX512 that performs relatively well, yes, I could see it being in Bergamo. It is a big "IF" btw. Anyway, this is all speculation, we will see when there will be more information about this from AMD itself.

gdansk · Nov 9, 2021

Remove or half-rate? I can see the latter but AMD said same ISA as Zen 4. If Zen 4 can decode it then Zen 4c must too. Doesn't say anything about how much silicon they are using/wasting on making AVX512 fast.

yuri69 · Nov 9, 2021

@Genoa-X: TSMC's 5nm-on-5nm CoW - the tech for the Zen 4 V-Cache - initial availability is scheduled to Q3 2022. Given the timing of 7nm-on-7nm V-Cache for Zen 3, the realistic mass availability & product launch would be late Q4 2022 at earliest. This might be OK for Zen 4 - it could still launch in 2022 w/ or w/o the cache.

@avx-512: Remember, AMD was no shy of cutting Zen 2 FP RF for the consoles. AVX 512 RF can be huge, so cutting that would fit Zen 4c.

DrMrLordX · Nov 9, 2021

tamz_msc said:
SEV has major vulnerabilities because it leverages the PSP, which has been shown to be vulnerable to voltage-glitching attacks.

AMD has had time to fix those problems. I would hope to see fixes in B2-stepping Zen3 (Milan-X, Zen3D, Vermeer XT/Refrsh) and all Zen4 products.

moinmoin · Nov 9, 2021

yuri69 said:
@ avx-512: Remember, AMD was no shy of cutting Zen 2 FP RF for the consoles. AVX 512 RF can be huge, so cutting that would fit Zen 4c.

Those being custom design that's down to the console makers. That's wasn't AMD calling the shots, it was Sony and Microsoft respectively.

DisEnchantment · Nov 9, 2021

gdansk said:
Remove or half-rate? I can see the latter but AMD said same ISA as Zen 4. If Zen 4 can decode it then Zen 4c must too. Doesn't say anything about how much silicon they are using/wasting on making AVX512 fast.

Not sure how Lisa Su's statement can be interpreted differently short of reading the PPR.
Unless she is lying.

At t=2114s

Bergamo is also socket compatible with Genoa with the same Zen4 instruction set

moinmoin · Nov 9, 2021

Maybe leoneazzurro doesn't consider AVX-512 a part of the ISA.

DisEnchantment · Nov 9, 2021

moinmoin said:
Maybe leoneazzurro doesn't consider AVX-512 a part of the ISA.

I understand ISA could be misconstrued as something like generic x86 ISA, but she explicitly says same instruction set

funnily this thought crossed my mind.
Bergamo is same Instruction set Architecture
Bergamo is same Instruction set
Bergamo is same Instructions
Bergamo is same

DisEnchantment · Nov 9, 2021

DisEnchantment said:
Anyway, now that they let the cat out, linux patches can start coming in.

Well, that did not take long

[PATCH 0/3] k10temp/amd_nb: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh - Babu Moger

leoneazzurro · Nov 9, 2021

moinmoin said:
Maybe leoneazzurro doesn't consider AVX-512 a part of the ISA.

AVX512 is an x86 ISA extension. That does mean that when it is present, software may take advantage of it, when it is not present programs generally run as well, but at a reduced level of performance. But well, as I stated, my concerns are for the cost opportunity and about power savings. I have nothing against AVX512 per se. If AMD found a way to have it at low area cost (not cutting out other maybe more useful features and power savings), low power and reasonably performing, all the better.

soresu · Nov 9, 2021

leoneazzurro said:
If AMD found a way to have it at low area cost (not cutting out other maybe more useful features and power savings), low power and reasonably performing, all the better.

Waiting until 6 or 5nm would probably be no small advantage in that, much as with their delay to go fully 256 bit with Zen2.

MadRat · Nov 9, 2021

Are they growing interconnects via a die process up the edges?

lobz · Nov 9, 2021

DisEnchantment said:
Bergamo is same Instruction set Architecture
Bergamo is same Instruction set
Bergamo is same Instructions
Bergamo is same

Bergamo is love

Saylick · Nov 9, 2021

Given what we know about the Zen 3 CCD, the L3 cache takes up just as much space as the cores themselves, so it seems obvious to me that the majority of the core-packing effort comes from removing as much L3 cache as possible and replacing that die space with more cores. AMD allegedly gets an optimized TSMC N5 process (whether it be N5P or some HPC variant of N5, doesn't matter) hence why they are quoting 2x transistor density over N7 (or possibly N7P). Cache doesn't scale as well as logic anyways, so I don't see why they couldn't do half-rate AVX-512 and still fit 16 Zen 4c on a slightly larger CCD if they take out half of the cache.

Put it this way:
- Zen 3 CCD = half cache, half cores = 1.0x Area
- Strip out half of the cache, so remaining cache is 0.25x Area
- Assume Zen 4c cores have 1.5x transistor count but 2x density
- Assume cache gets 1.2x density
- Zen 4c CCD = 2*(0.5 * 1.5 / 2) for 16 Zen 4c cores + 0.25/1.2 for L3 cache = 0.958x Area of Zen 3 CCD.

Yes, I am aware that there's also a bunch of IO and miscellaneous off-die links, but just wanted to do some napkin math to get a ballpark of feasibility.

eek2121 · Nov 9, 2021

Saylick said:
Given what we know about the Zen 3 CCD, the L3 cache takes up just as much space as the cores themselves, so it seems obvious to me that the majority of the core-packing effort comes from removing as much L3 cache as possible and replacing that die space with more cores. AMD allegedly gets an optimized TSMC N5 process (whether it be N5P or some HPC variant of N5, doesn't matter) hence why they are quoting 2x transistor density over N7 (or possibly N7P). Cache doesn't scale as well as logic anyways, so I don't see why they couldn't do half-rate AVX-512 and still fit 16 Zen 4c on a slightly larger CCD if they take out half of the cache.

Put it this way:
- Zen 3 CCD = half cache, half cores = 1.0x Area
- Strip out half of the cache, so remaining cache is 0.25x Area
- Assume Zen 4c cores have 1.5x transistor count but 2x density
- Assume cache gets 1.2x density
- Zen 4c CCD = 2*(0.5 * 1.5 / 2) for 16 Zen 4c cores + 0.25/1.2 for L3 cache = 0.958x Area of Zen 3 CCD.

Yes, I am aware that there's also a bunch of IO and miscellaneous off-die links, but just wanted to do some napkin math to get a ballpark of feasibility.

That was my original thought. Reduced cache, maybe shared across the entire 16 core CCD?

Saylick · Nov 9, 2021

eek2121 said:
That was my original thought. Reduced cache, maybe shared across the entire 16 core CCD?

I think we're going to be looking at two 8-core CCXs on the same Zen 4c CCD. So 8MB shared over 8 cores, and another 8MB shared over the other 8 cores. Same ring topography as Zen 3, but just a quarter of the cache per core (since the total cache is cut in half, and the number of cores is doubled).

dacostafilipe · Nov 9, 2021

How about removing a big portion of the cache on the die to add it back via V-cache?

Placing the Vias could be an issue, but I find this idea kinda funny xD

Mopetar · Nov 9, 2021

Is there anything about them using V-cache for those parts? Part of Zen 3's boost over Zen 2 came from their larger cache and the Zen 3D numbers show additional cache can benefit several applications.

If they can move a lot of the L3 cache to the stacked die it does allow for a much more densely packed core.

Saylick · Nov 9, 2021

The timeframe makes it a possibility. Early 2023 launch for Bergamo is after Qualification for N5-on-N5. I'm just not sure if they'd try to have the L3 be ONLY on the stacked die because N5 ain't cheap. That would imply that EVERY Bergamo CCD uses V-cache, but why spend the added costs of chip stacking and the extra N5 die if Zen 4c are supposed to be bare-bones cores. My opinion is that you'd want the much larger cache on a core that can actually leverage it, not vice-versa.

jpiniero · Nov 9, 2021

Saylick said:
The timeframe makes it a possibility. Early 2023 launch for Bergamo is after Qualification for N5-on-N5. I'm just not sure if they'd try to have the L3 be ONLY on the stacked die because N5 ain't cheap. That would imply that EVERY Bergamo CCD uses V-cache, but why spend the added costs of chip stacking and the extra N5 die if Zen 4c are supposed to be bare-bones cores. My opinion is that you'd want the much larger cache on a core that can actually leverage it, not vice-versa.

Hmm... maybe it is just a 2x 8-core CCX with no L3 but V-cache on top of the logic; with only really transistor changes to mitigate heat issues and to maximize density.

It'd be a lot better if you could do N7 on N5.

Mopetar · Nov 9, 2021

Wasn't one of the selling points of the stacked dies that different nodes could be used? Make the V-cache on an older node like 6N since SRAM doesn't scale well anyhow.

While not everything will benefit from cache, a lot will and the hypothetically slashed L3 suggested seems like it would hurt performance in a lot of cases. But stacking the cache is one way to reconcile a lower amount on the die.

maddie · Nov 9, 2021

jpiniero said:
Hmm... maybe it is just a 2x 8-core CCX with no L3 but V-cache on top of the logic; with only really transistor changes to mitigate heat issues and to maximize density.

It'd be a lot better if you could do N7 on N5.

Speaking about that.
Anyone know why different nodes can't be stacked with COW or is it not technical difficulties but time to qualify the process?
At the interface where the actual fusion takes place, does it matter what's in the rest of the bulk material?

uzzi38 · Nov 9, 2021

NeoLuxembourg said:
How about removing a big portion of the cache on the die to add it back via V-cache?

Placing the Vias could be an issue, but I find this idea kinda funny xD

Why do that when you could instead cut the size of the core even further by removing the TSVs and all of the cache tags for handling V-Cache instead?

The aim with Bergamo is still to save on die area as much as possible while retaining as much per-core performance for cloud workloads as much as possible.

jpiniero · Nov 9, 2021

I guess the point is why do this when you'd likely be able to sell all the Genoa N5 wafers AMD is willing to buy regardless. Then what do you do with Bergamo dies that don't cut it?

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Diamond Member

Golden Member

Diamond Member

Senior member

Lifer

Diamond Member

Golden Member

Diamond Member

Golden Member

Golden Member

Golden Member

Diamond Member

Lifer

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Platinum Member

Lifer