Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 138 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

LightningZ71

Golden Member
Mar 10, 2017
1,628
1,898
136
The additional latency hit for stacked L3 can TYPICALLY be mitigated by increasing the L2 size to the next tier up. Doubling the L2 size, while it may increase access latency by a cycle (maybe not if N5 gives an advantage there) should reduce the average latency of memory access by enough to make taking a few extra cycles on the L3 a wash overall. Yes, I realize that doubling the L2 is not trivial, especially if you want to keep latencies low, and that it does eat into die area budget. However, if you are going to be stacking large amounts of L3, you want to make it power efficient by accessing it less frequently (thus, larger L2), and you want to hide the latency hit (thus, larger L2). In addition, if you believe that going vertical is the future, you can certainly make it a foundational design principle for the CCD and reduce the on-die L3 to 16MB (which still leaves you with 80MB if you are single stacking 64MB). This removes something from the CCD that we have been told over and over again that doesn't scale as well (L3 cache) without making it a massive performance hit. If we look at Cezanne as a starting point, doubling the L2 cache area doesn't take up a massive amount of space, and wouldn't make a CCD developed from the CCX of that chip massively bigger. I realize that that's Zen3, and we expect Zen4 to be significantly larger, at the very least due to the inclusion of AVX-512. That would make the L2 comparative size even smaller.
 

eek2121

Platinum Member
Aug 2, 2005
2,930
4,026
136
We're not in 2005 anymore where multiple cores meant frequencies are reduced for workloads of all threads.

If they have a 24 core CPU, I assure you at 16 cores it'll be clocked just as high as Zen 3. And everything beyond 16 cores, it'll be faster anyways. That's why Turbo is great!

Uh, what? I don't see my 5950x boosting to 5 ghz for all core workloads. It barely hits 4.2 ghz.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Uh, what? I don't see my 5950x boosting to 5 ghz for all core workloads. It barely hits 4.2 ghz.

You are saying 24 cores are a loss since it'll clock less. So what? Under 16 cores active the same CPU will clock just as high as current top of the line 5950X. So you don't lose at all, because if 16+ cores are active then it'll be faster anyway even if it clocks down a bit.

By the way this is what you said as a reminder:
I worry the increased core count would eat significantly into multicore frequencies.

What I said is quite straightforward.
 

Joe NYC

Golden Member
Jun 26, 2021
1,970
2,354
106
The additional latency hit for stacked L3 can TYPICALLY be mitigated by increasing the L2 size to the next tier up. Doubling the L2 size, while it may increase access latency by a cycle (maybe not if N5 gives an advantage there) should reduce the average latency of memory access by enough to make taking a few extra cycles on the L3 a wash overall. Yes, I realize that doubling the L2 is not trivial, especially if you want to keep latencies low, and that it does eat into die area budget. However, if you are going to be stacking large amounts of L3, you want to make it power efficient by accessing it less frequently (thus, larger L2), and you want to hide the latency hit (thus, larger L2). In addition, if you believe that going vertical is the future, you can certainly make it a foundational design principle for the CCD and reduce the on-die L3 to 16MB (which still leaves you with 80MB if you are single stacking 64MB). This removes something from the CCD that we have been told over and over again that doesn't scale as well (L3 cache) without making it a massive performance hit. If we look at Cezanne as a starting point, doubling the L2 cache area doesn't take up a massive amount of space, and wouldn't make a CCD developed from the CCX of that chip massively bigger. I realize that that's Zen3, and we expect Zen4 to be significantly larger, at the very least due to the inclusion of AVX-512. That would make the L2 comparative size even smaller.

From Mike Clark interview, it seems that AMD is going to have some base amount of L3 on die, so that the chip can be sold without stacking. At least for next 1-2 generation.

But beyond that, it may very well become more economical to remove L3 from (very expensive) base die and put it on a cheap stacked silicon.
 

Abwx

Lifer
Apr 2, 2011
10,971
3,532
136
Uh, what? I don't see my 5950x boosting to 5 ghz for all core workloads. It barely hits 4.2 ghz.

A 5950X has 64% better MT perf than a 5800X at about same TDP.

A theorical 32C wouldnt compare as favorably against the 5950X because the 5800X is pushed in a steepier part of the V/F curve, but it should still have roughly 50% advantage over the 16C part.

 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
A theorical 32C wouldnt compare as favorably against the 5950X because the 5800X is pushed in a steepier part of the V/F curve, but it should still have roughly 50% advantage over the 16C part.

Exactly. Abwx nailed it. And with modern Turbo, when only 16 cores are used, then it'll clock just as high as the 5950X, meaning it'll be faster in everything.

From Mike Clark interview, it seems that AMD is going to have some base amount of L3 on die, so that the chip can be sold without stacking. At least for next 1-2 generation.

Latency doesn't have to be higher on the V-cache. It could be made so the base cache gets the same latency and V-cache layer is higher latency.

Also, when it comes to costs, the tiny die might cost little but complexities in stacking is what raises costs.

Similar to how packaging costs are dominant in sub-100mm2 die CPUs but matter less in larger die CPUs. Things like packaging and stacking are fixed costs that become advantageous in expensive and larger configurations.
 

DrMrLordX

Lifer
Apr 27, 2000
21,644
10,864
136
I worry the increased core count would eat significantly into multicore frequencies.

Was going to respond to this but @IntelUser2000 and @Abwx beat me to it. They're right though, you don't lose anything by going with more cores. Except money out of your wallet.

What worries me more is the addition of AVX512 and the frequency throttling that this may bring.

AMD isn't Intel (and tbh even Intel doesn't do this anymore, at least not on IceLake and TigerLake). They already dynamically clock the CPU based on v/t/f curves for AVX and AVX2 workloads, just like everything else. The entire CPU isn't going to tank because of a single AVX/AVX2/AVX512 thread. Only issue might be instruction latency for AVX512. We will see.

Could AMD implement power control features that scale core frequencies better with regards to overall loads on cores. Ideally you scale each core independently as needed. Aren't they implementing a vastly better power management system in Zen 4? Could this be it?

AMD already scales each core independently as needed.
 

turtile

Senior member
Aug 19, 2014
614
294
136

Rumor suggests that AMD will make 16 core chiplets, 8 with L3 cache stacked running at a low TDP, and the rest at full power.

This makes sense if they can charge enough. Especially if these are actually two chiplets bridged together.
 

Saylick

Diamond Member
Sep 10, 2012
3,172
6,415
136

Rumor suggests that AMD will make 16 core chiplets, 8 with L3 cache stacked running at a low TDP, and the rest at full power.

This makes sense if they can charge enough. Especially if these are actually two chiplets bridged together.
LOL. Did anyone here email Hassan some of the napkin sketches we were musing over when we discussed how AMD could fit 16 Zen 4c cores on a similarly sized chiplet? I know some of us hypothesized moving the bulk of the L3 onto V-cache to fit more Zen 4c cores onto the base die, but this rumor is just ridiculous.

16 Zen 4 cores (I presume they are all feature equivalent and thus similar in die area), with a lot less L3 cache, just to make it up with V-cache, all for use in the desktop space??? To top it off, Hassan claims that the intent is for the "low TDP" cores to only be active once the main cores exceed 100% utilization??? I'm sorry, I didn't realize that we needed low TDP cores in the desktop space. My bad. This take really sounds like someone saw Alderlake and came up with something similar for AMD but with V-cache. @Kepler_L2 thinks it's fake. I think it's fake. WCCFTech at it again with writing up BS clickbaity articles.
 

Exist50

Platinum Member
Aug 18, 2016
2,445
3,043
136
LOL. Did anyone here email Hassan some of the napkin sketches we were musing over when we discussed how AMD could fit 16 Zen 4c cores on a similarly sized chiplet? I know some of us hypothesized moving the bulk of the L3 onto V-cache to fit more Zen 4c cores onto the base die, but this rumor is just ridiculous.

16 Zen 4 cores (I presume they are all feature equivalent and thus similar in die area), with a lot less L3 cache, just to make it up with V-cache, all for use in the desktop space??? To top it off, Hassan claims that the intent is for the "low TDP" cores to only be active once the main cores exceed 100% utilization??? I'm sorry, I didn't realize that we needed low TDP cores in the desktop space. My bad. This take really sounds like someone saw Alderlake and came up with something similar for AMD but with V-cache. @Kepler_L2 thinks it's fake. I think it's fake. WCCFTech at it again with writing up BS clickbaity articles.
Seconded. V-cache for every SKU? Are they joking? And why would they need such a weird hybrid scheme for just 16 cores? Vermeer works plenty fine.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
dieshotmodded.jpg

I guess it could work? The Low TDP is probably because the cores are under the SRAM die. FinFET-FEOL is next to package bottom rather than package IHS at top.

Using 3Dvache lines = 36 mm2 boundary, then the above die is 71.777733722 mm2 big and the Zen4 core is 2.852417231 mm2. If we go based on the new green lines locations being where they will be after the shrink. If there is a smaller Zen4 core then that will lead for room for a larger L2 taking up that area.

0.8 -> 0.58 -> 1.1(1 MB) -> 2.3(2 MB) => Zen4 being ~1.74 mm2 .. if AMD is exactly aiming at 2x area shrink => ~1.62 mm2.

57CPP -> 51CPP = 0.9x shrink
40MP -> 32MP = 0.8x shrink
That is only a 0.72x shrink.

6-track -> 5-track = ~0.5x
Which allows for a maximum of ~0.36x shrink. 0.36x for logic <-> 0.72x for memories. Add them and divide by two for an average potential ~0.54x shrink = 3.24 * 0.54 => 1.7496 mm2
 
Last edited:

Hitman928

Diamond Member
Apr 15, 2012
5,325
8,027
136
FinFET-FEOL is next to package bottom rather than package IHS at top.

This isn't a Finfet thing, it's a packaging choice. They could be FEOL down or up, just like most every other CMOS process.

Edit: Actually, after re-reading this, your comment is just wrong. My reply still stands about it being a packaging choice still stands though.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
This isn't a Finfet thing, it's a packaging choice. They could be FEOL down or up, just like most every other CMOS process.
I wasn't making it a FinFET thing, till you made it be...

FinFETs have self-heating effects worse than PDSOI, which >70% of which is leaked into the BEOL where the faster part of the Fin-resides. Flipping the orientation over makes the heat get released on the faster dissipation side. So, the positioning is potentially "innovative" on getting rid of the heat expressed in BEOL faster.

14nm saw >40% get leaked into BEOL
7nm saw >60% get leaked into BEOL
5nm saw >70% get leaked into BEOL, etc.

Standard-orientation going forward:
beolflip.jpeg

New-orientation going forward:
beolflip2.jpeg
 
Last edited:

Saylick

Diamond Member
Sep 10, 2012
3,172
6,415
136
View attachment 55415

I guess it could work? The Low TDP is probably because the cores are under the SRAM die. FinFET-FEOL is next to package bottom rather than package IHS at top.
It could work from a die size perspective, but it would still be a waste of area to have 8 extra cores taking up a ton of area just to have their combined TDP capped at 30W. Those low TDP cores would need to be super small to make sense, which basically puts this rumor as Alderlake-envy but with an extra helping of V-cache. It's literally an AMD fanboy's wet dream of outdoing Intel but with AMD's own flavors.
 
  • Like
Reactions: Joe NYC

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
It could work from a die size perspective, but it would still be a waste of area to have 8 extra cores taking up a ton of area just to have their combined TDP capped at 30W. Those low TDP cores would need to be super small to make sense, which basically puts this rumor as Alderlake-envy but with an extra helping of V-cache. It's literally an AMD fanboy's wet dream of outdoing Intel but with AMD's own flavors.
With 5nm we get; 1.25x performance increase, 2x area reduction, 2x power reduction. That still gives an all-core guaranteed (P1-state) which is ~3.5 GHz for the cores under V-Cache. However, the cores not under V-cache get ~5 GHz boost. Exterior cores(not-under-vcache) get aggressive boosting at high-TDP, interior cores(under-vcache) get guaranteed clocks at low-TDP.

It is feasible, it is workable. It doesn't actually need new cores... will AMD actually do it. Who knows but them.

Intel's solution => big.Little director
AMD's solution => existing or improved CPPC2/SMU preferred cores
 
Last edited:

Hitman928

Diamond Member
Apr 15, 2012
5,325
8,027
136
I wasn't making it a FinFET thing, till you made it be...

FinFETs have self-heating effects worse than PDSOI, which >70% of which is leaked into the BEOL where the faster part of the Fin-resides. Flipping the orientation over makes the heat get released on the faster dissipation side. So, the positioning is potentially "innovative" on getting rid of the heat expressed in BEOL faster.

14nm saw >40% get leaked into BEOL
7nm saw >60% get leaked into BEOL
5nm saw >70% get leaked into BEOL, etc.

Standard-orientation going forward:
View attachment 55416

New-orientation going forward:
View attachment 55417

You are just showing flip chip orientation which has been used for many generations for these types of chips, starting before finfets. The rest of your post seem to be numbers plucked out of the air so not sure what to make of them.

I’m still confused though because you say both are the orientation going forward so I don’t know which way you think AMD/Intel are packaging their chips.
 
Jul 27, 2020
16,350
10,359
106
It is feasible, it is workable. It doesn't actually need new cores... will AMD actually do it. Who knows but them.

Intel's solution => big.Little director
AMD's solution => existing or improved CPPC2/SMU preferred cores
The battery efficiency gains in mobile Alder Lake SKUs must be huge for AMD to be considering this "hack".
 

Exist50

Platinum Member
Aug 18, 2016
2,445
3,043
136
You are just showing flip chip orientation which has been used for many generations for these types of chips, starting before finfets. The rest of your post seem to be numbers plucked out of the air so not sure what to make of them.

I’m still confused though because you say both are the orientation going forward so I don’t know which way you think AMD/Intel are packaging their chips.
It's Nosta just BSing about stuff he doesn't understand, like usual.
 
  • Like
Reactions: scineram

Kedas

Senior member
Dec 6, 2018
355
339
136
The die size doesn't look very big for 16 cores (unless Zen4 is much bigger than Zen3), they removed a lot of L3 and L2 cache and replaced it with cores that are power limited. (not sure about L2 I hope it's more than 1MB)
And the cost of stacking L3 cache is probably a lot cheaper than you think.
Also remember that separate L3 cash has twice the density for some reason. (64MB on 32MB ZEN3)

And power isn't linear so those 30W (8 cores) can probably do >50% extra work.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
Yet the WCCFTech rumor says that this chiplet will be for desktop Ryzen. 🤦‍♂️
3-cores Active = 3.7/3.5 GHz, 6-core Active = 3.3/3 GHz => "Unnamed" Desktop Products.

While the above is trimming off idle/below TDP times. The 16-core CCD+64MB L3 is transitioning from low-leakage SRAM to high-leakage Logic below low-leakage, high-density SRAM.
The battery efficiency gains in mobile Alder Lake SKUs must be huge for AMD to be considering this "hack".
I don't think it is about mobile, but if we are talking mobile...

CPU-H product with this design would basically go toe to toe with Zen/Zen+ Threadripper in Mobile.

45W to >65W only gives 30W for internal cores, >15W for external cores. So, in that case it is guaranteed clocks only, anyway.

Ryzen Threadripper 2990WX => 32 cores * 3 GHz(128-bit units) + 256-bit @ 2933 MHz + 64 MB L3
vs
Hypothetical 2+2:1 Penta-die CPU-H => 32 cores * ~2.7 GHz(512-bit units) + 128-bit @ 5866 MHz(max) + 128 MB L3

2990WX => $1500 to $1800 currently
The hypothetical also $1500 to $1800 => Perfect for >$3000 workstation laptops.

~800 mm2 worth in die area to sub-373 mm2 for same price, not a bad hop in margins.
 
Last edited:

eek2121

Platinum Member
Aug 2, 2005
2,930
4,026
136

Rumor suggests that AMD will make 16 core chiplets, 8 with L3 cache stacked running at a low TDP, and the rest at full power.

This makes sense if they can charge enough. Especially if these are actually two chiplets bridged together.

I read that earlier. It doesn’t match up with other leaks, however. If AMD did manage to pull this off, Intel will be in heaps of trouble. Not just on the desktop, but this also means a drastic increase in core counts for Threadripper and EPYC.

We will see how things play out.
 
  • Like
Reactions: Tlh97