Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 175 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

eek2121

Platinum Member
Aug 2, 2005
2,930
4,026
136
HEDT needs more than just cores, there's the question of connectivity and memory bandwidth as well. Different usecases will require different memory bandwidth, but the move to DDR5 is only increasing bandwidth by ~50% until faster sticks come out, and Zen4 cores are getting faster (IPC & clockspeed). The result is that just to keep up with existing 5950X, you'd need a 3rd primary memory slot (trying to keep away from "channel" because DDR5 is complicating that terminology).
Anyone needing that stuff will rely on an OEM and not their own hackjob of a PC. Gamers and casuals don’t need that stuff.
The SRAM L3 Die on top will be 6nm no problem

View attachment 58609
Note the release date. Likely too late for AMD to use.
5nm goals:
View attachment 58643
On the press slides it is 5nm improvements vs 7nm process. Basically, Zen4 if they achieve their goals will be half the size of Zen3, given the same Fam 19h. None of the bits for Zen4 given the leak indicate units larger than Zen3. Near-identical core to Zen3 = 1/2th the size in 5nm, given 2x density goals.

On Zen4/Zen4c being the way they are is similar to Milan/Milan-X (Genesis/Genesis-X). Genoa is non-X3D die, and Bergamo is X3D die. Basically, similar to Altra (32MB SLC) to Altra Max (16 MB SLC):: same architecture, more cores, less L3 hogging die-area. Hence, it is "Zen4" with a cache hierarchy that is cloud optimized.
Milan-X = 96*8
Bergamo = 64*8 to 256*8, since Bergamo isn't the prototype die, but rather the production die.
--ARMv9 N3 Cores in progress of tapeout since July 2021 for cloud have indicated Neoverse 3nm is targeting 128+ cores w/ 2MB L2 + 128 MB L3. AMD in servers is competing against single architecture ARMv9, cloud-orientated processors.

On the power efficiency comments, since it is fully X3D optimized, there might be more aggressive power-savings added from Zen4 on-die cache hierarchy -> Zen4 stacked-die cache hierarchy:
View attachment 58646
Different die, different optimizations.

Raphael would thus be conventionally split into two dies:
Durango CCD = Genoa CCD = 8c/1 MB L2 + on-die 32 MB Hi-Current SRAM L3.
Durango-X CCD = Bergamo CCD = 16c/2 MB L2 + stacked-vertical-cache 64 MB Hi-Density SRAM via L3D.

Specifically, I am not operating that wccftech is accurate as there has been no indications of a 2x512KB L2 Durango part.

General profits model:
Vermeer = $300 for 8-core
Vermeer-X = $450 for 8-core (~80 mm2 + ~40 mm2 stacked-die)
Raphael 8C CCD = $375 for 8-core (Limited ASP gain)
Raphael 16C CCD = $675~$900 for 16-core (~80 mm2 + ~40 mm2 stacked-die; increase of 1.5x~2x ASP over Vermeer-X)

Alder Lake has two incompatible architectures in 8P+8E config without AVX512 for ~$600
Raphael 16C CCD has a single architecture in 16P config with AVX512 for >$600 (better solution can ask for a higher price)
Zen 4 has a larger FPU and more cache at minimum. There are also other changes.

The “rumor” is very likely fake.

EDIT: although if AMD did do this, moving TDP from 105W to 125W would ensure boost clocks aren’t negatively impacted. It would be a clever way to wipe out Intel’s attempt at huge multicore gains since the top SKU would have 32C/64T.
 
  • Like
Reactions: Tlh97 and Saylick

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
Zen 4 has a larger FPU and more cache at minimum. There are also other changes.

The “rumor” is very likely fake.

EDIT: although if AMD did do this, moving TDP from 105W to 125W would ensure boost clocks aren’t negatively impacted. It would be a clever way to wipe out Intel’s attempt at huge multicore gains since the top SKU would have 32C/64T.
The rumors are likely wrong in specification and timing. I do not think Raphael will launch with a 16c/32MB L2+64MB L3D. Unless, the 2H22 roadmap falls into 2023. To switch it would be trying to get a higher profit margin per CCD. 16-core dies should sell at a higher price than 8-core dies.

Zen4 might be larger in transistor quantity, but it might be implemented smaller.

Zen3 to Zen4 potential change in logic cells:
Critical path = Multi-height 2-fin to Multi-height 2-fin (no change of 1.8x logic density) || 1.8x density and 0.7x power (reliability is unharmed)
Non-critical path = Single-height 2-fin to single-height 1-fin (change up to 3.2x logic density) || 3.2x density and 0.4x power (do to reduced wire-length, Fmax goes up)
^-- height isn't track-height which is purely 6T in 7nm. With 1-Fin+SDB devices being ~5T. Samsung's 5nm Single-fin devices have higher frequency max and nom, so it is likely TSMC 5nm Single-fin is also higher frequency max and nom.

Lowest area usage of logic = 2x density
Highest area usage of logic = 3x density
Memories universal = 1.35x density

Zen4 to Zen4c change in logic cells:
Critical path goes from standard cell most dense BEOL to HPC cells with relaxed BEOLs
Memories with HS-prefix also have relaxed BEOLs
1x metals to 1.1x metals, etc.
However, I doubt the above, AMD is only using N5HPC for the new MIMs to alter the Freq/Power curve.

On the Zen4 FPU, it has been indicated that it is not native AVX512 execution, one AVX512 macro-op is split into smaller 256-bit ops by the NSQ/FPU Retire-Rename.

From Zen3:
2x 160 128-bit
goes to
2x 80 256-bit
which can in Zen4 be:
32 Lo-half + 32 Hi-half = One full set AVX512 architectural regs, giving 48 Lo+Hi AVX512 rename regs.

~80+ 512-bit registers with minor size increase from Zen3 to support AVX512 bits (4x 128+xx-bits per register for flags, masks, instructions like Zbit, etc). Thus, in my honest opinion the Zen4 FPU is going to be much smaller than what other people expect it to be.
Jaguar for example has 36 virtual 256-bit register => 16 for arch, 20 for rename
Zen4 if keeps data regs the same has 80 512-bit regs => 32 for arch, 48 for rename. Make the last 8 for the special registers in AVX512, boom, Zen4 implements double what Jaguar's PRF did to support 256-bit.

Zen4c + 2 MB L2 with vertical L3 cache should still be able to fit 16-cores in ~80 mm2, especially if it bumps off the L3 SRAMs. The increased latency of 2 MB L2 hides the vertical cache latency. The vertical cache only implementation means the move to just HD cells should have reduced power consumption at higher data rate.
 
Last edited:
  • Like
Reactions: Drazick

jamescox

Senior member
Nov 11, 2009
637
1,103
136
No problem. Just trying to call out BS when I see it, and WCCFtech is just full of it.

The rumor feels almost as if someone saw Alderlake's hybrid approach and they thought up some way AMD could do the same but with AMD's own technology by simply adding V-cache. The moment I saw that it requires V-cache to work, it was an absolute bust in my opinion because it runs counter to AMD's ethos of modularity and re-usability.
The wccftech rumor might be BS, but I have expected Bergamo to be a stacked device from the start. This isn’t like Zen 1, where AMD had one chip for everything. They are making a lot more varied die now. Some things might be correct though. Zen 4c might be almost the same core as Zen 4, just made on a much more power optimized process. Zen 3 at 2.45 GHz with 64-cores is 280 W TDP. Going down to 2 GHz is 225 W TDP. What will it be at 128 Zen 4c cores? Regular Zen 4 seems to be able to do over 5 GHz; Zen 4c is likely no where near that, although it is unclear what it would reach with a larger power budget. I wouldn’t expect it to boost very high even with more thermal headroom.

The power savings might extend to the IO die by using silicon bridge chips with TSV instead of serdes connections. For modularity, they may be using the same infinity cache chips that will be used on GPUs. If they have 512 MB to 1 GB of infinity cache in stacked bridge chip(s) using TSV or other stacked connections, then they don’t need L3 on die. Also, since it would require a different IO die with TSVs, it would likely not be usable on desktop parts. It will not clock anywhere near as high as regular Zen 4 anyway, so you probably wouldn’t want it in a desktop part.
 
Last edited:

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Thanks. I’ve been following this debate and see no business case for use of a more complex SoC in the desktop space. Using a specialized server chiplet is patently ridiculous - all because of some clickbait rumors.
How many different die has AMD taped out recently? Is it really that unbelievable that they would have 2 different Zen 4 chiplets? How are they doing 128-cores without having a server specific chiplet? Using the standard 8 core Genoa chiplet only gets them to 96 cores with 12 serdes links. That isn’t very power efficient.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
All Zen 2+ and above CCDs actually have an area designated for test probes/bumps which is being done before singulation. (grey area in this annotation)
View attachment 58452

That point you mentioned seems to be one of the main reasons why chip last 2/3D BE packaging is the only kind AMD seems to be interested in so far.
Because it allows to test everything before the final logic dies are packaged together.
CoWoS/EFB/Interposer are all chip last packaging tech. InFO is not, but some chip makers who can absorb the cost do that, because they can eat the loss due to defects or because their dies are small.

In chip first the wafer on which the logic dies are made is the carrier and packaging is done before singulation (i.e. InFO)

SoIC is CoW, but it is a FE packaging, it is also done before singulation so they need to probe before bonding, otherwise they will take some loss.
I don’t know how much testing is worth doing though. They are releasing some lower end Zen 3 based parts now. Given the flexibility and modularity of the chiplets, they can sell just about every part they make in some manner. It can have a lot of bad cores and still be used somewhere. It could even have bad cache since they have some Epyc parts with half sized caches. If something is wrong with the stacked cache, perhaps they can still sell it as a part without stacked cache.

Is it actually worth trying to do some testing before dicing? What do they do about failed die? Do they put a dummy / failed cache die over that cpu? I don’t know, but for tiny cpu chiplets, trying to test before dicing seems like it might not be worth it even if it is possible.
 

Saylick

Diamond Member
Sep 10, 2012
3,171
6,404
136
FWIW, Charlie from SemiAccurate had some musings on Bergamo about a month ago (not even sure how reputable this info is), but I'll summarize it as follows:
- Bergamo takes same IOD as Genoa, but puts 8 Bergamo CCDs instead of Genoa's 12.
- Each Bergamo CCD has (16) Zen 4c cores but the same 32 MB of L3 as Genoa's CCD.
- Zen 4c CCDs splits up the (16) Zen 4c cores into two CCXs, and each CCX shares half of the total L3 (i.e. 2 MB of L3 per Zen 4c).
- Given that there's two CCXs on each Bergamo CCD, it is likely that there is a latency penalty when a core on one CCX needs to access another core's data on a different CCX, even if that CCX is on the same CCD.
- AMD likely has figured out how to connect (12) memory channels to (8) CCDs given that Milan already handles this situation fine.
- Twice the socket performance of top Milan for all key foundational workloads.
- Bergamo can run in non-SMT mode, which helps with per-thread performance. On a thread vs thread basis, 128C/128T Bergamo is about 60% more performant than 64C/128T Milan.

Edit: If anyone has an issue with me summarizing this info because it technically was behind a paywall, let me know and I can delete it from this thread.
 
Last edited:

Justinus

Diamond Member
Oct 10, 2005
3,175
1,518
136
FWIW, Charlie from SemiAccurate had some musings on Bergamo about a month ago (not even sure how reputable this info is), but I'll summarize it as follows:
- Bergamo takes same IOD as Genoa, but puts 8 Bergamo CCDs instead of Genoa's 12.
- Each Bergamo CCD has (16) Zen 4c cores but the same 32 MB of L3 as Genoa's CCD.
- Zen 4c CCDs splits up the (16) Zen 4c cores into two CCXs, and each CCX shares half of the total L3 (i.e. 2 MB of L3 per Zen 4c).
- Given that there's two CCXs on each Bergamo CCD, it is likely that there is a latency penalty when a core on one CCX needs to access another core's data on a different CCX, even if that CCX is on the same CCD.
- AMD likely has figured out how to connect (12) memory channels to (8) CCDs given that Milan already handles this situation fine.
- Twice the socket performance of top Milan for all key foundational workloads.
- Bergamo can run in non-SMT mode, which helps with per-thread performance. On a thread vs thread basis, 128C/128T Bergamo is about 60% more performant than 64C/128T Milan.

Edit: If anyone has an issue with me summarizing this info because it technically was behind a paywall, let me know and I can delete it from this thread.

Summarizing what, exactly?
 

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
FWIW, Charlie from SemiAccurate had some musings on Bergamo about a month ago (not even sure how reputable this info is), but I'll summarize it as follows:
- Bergamo takes same IOD as Genoa, but puts 8 Bergamo CCDs instead of Genoa's 12.
- Each Bergamo CCD has (16) Zen 4c cores but the same 32 MB of L3 as Genoa's CCD.
- Zen 4c CCDs splits up the (16) Zen 4c cores into two CCXs, and each CCX shares half of the total L3 (i.e. 2 MB of L3 per Zen 4c).
- Given that there's two CCXs on each Bergamo CCD, it is likely that there is a latency penalty when a core on one CCX needs to access another core's data on a different CCX, even if that CCX is on the same CCD.
- AMD likely has figured out how to connect (12) memory channels to (8) CCDs given that Milan already handles this situation fine.
- Twice the socket performance of top Milan for all key foundational workloads.
- Bergamo can run in non-SMT mode, which helps with per-thread performance. On a thread vs thread basis, 128C/128T Bergamo is about 60% more performant than 64C/128T Milan.

Edit: If anyone has an issue with me summarizing this info because it technically was behind a paywall, let me know and I can delete it from this thread.
I only have a problem with the fact that you don't do this often enough.
Oh, and thank you! 😊
 

Kepler_L2

Senior member
Sep 6, 2020
338
1,211
106
FWIW, Charlie from SemiAccurate had some musings on Bergamo about a month ago (not even sure how reputable this info is), but I'll summarize it as follows:
- Bergamo takes same IOD as Genoa, but puts 8 Bergamo CCDs instead of Genoa's 12.
- Each Bergamo CCD has (16) Zen 4c cores but the same 32 MB of L3 as Genoa's CCD.
- Zen 4c CCDs splits up the (16) Zen 4c cores into two CCXs, and each CCX shares half of the total L3 (i.e. 2 MB of L3 per Zen 4c).
- Given that there's two CCXs on each Bergamo CCD, it is likely that there is a latency penalty when a core on one CCX needs to access another core's data on a different CCX, even if that CCX is on the same CCD.
- AMD likely has figured out how to connect (12) memory channels to (8) CCDs given that Milan already handles this situation fine.
- Twice the socket performance of top Milan for all key foundational workloads.
- Bergamo can run in non-SMT mode, which helps with per-thread performance. On a thread vs thread basis, 128C/128T Bergamo is about 60% more performant than 64C/128T Milan.

Edit: If anyone has an issue with me summarizing this info because it technically was behind a paywall, let me know and I can delete it from this thread.
So Zen4 SMT2 is more than 50% perf increase? Isn't that way higher than current Zen3 implementation?
 

Abwx

Lifer
Apr 2, 2011
10,953
3,474
136
probably have to consider clocks . SMT normally isnt great for perf/watt.

There s not much if any at all MT improvements that can be extracted from higher frequencies, they have to rely solely on IPC given that 2x the core count will get them the same TDP with 128C than with the previous 64C, and that s at same throughput/core.

Frequency increase power at an almost cubic rate vs throughput while SMT increase power quasi linearly vs throughput.
 

Tarkin77

Member
Mar 10, 2018
74
159
106
FWIW, Charlie from SemiAccurate had some musings on Bergamo about a month ago (not even sure how reputable this info is), but I'll summarize it as follows:
- Bergamo takes same IOD as Genoa, but puts 8 Bergamo CCDs instead of Genoa's 12.
- Each Bergamo CCD has (16) Zen 4c cores but the same 32 MB of L3 as Genoa's CCD.
- Zen 4c CCDs splits up the (16) Zen 4c cores into two CCXs, and each CCX shares half of the total L3 (i.e. 2 MB of L3 per Zen 4c).
- Given that there's two CCXs on each Bergamo CCD, it is likely that there is a latency penalty when a core on one CCX needs to access another core's data on a different CCX, even if that CCX is on the same CCD.
- AMD likely has figured out how to connect (12) memory channels to (8) CCDs given that Milan already handles this situation fine.
- Twice the socket performance of top Milan for all key foundational workloads.
- Bergamo can run in non-SMT mode, which helps with per-thread performance. On a thread vs thread basis, 128C/128T Bergamo is about 60% more performant than 64C/128T Milan.

Edit: If anyone has an issue with me summarizing this info because it technically was behind a paywall, let me know and I can delete it from this thread.

You realise that this information is behind the paywall? That's not fair to Charlie.
 

maddie

Diamond Member
Jul 18, 2010
4,746
4,687
136
There s not much if any at all MT improvements that can be extracted from higher frequencies, they have to rely solely on IPC given that 2x the core count will get them the same TDP with 128C than with the previous 64C, and that s at same throughput/core.

Frequency increase power at an almost cubic rate vs throughput while SMT increase power quasi linearly vs throughput.
Why don't you just say "SMT is one of the most efficient ways of increasing perf/W". :)
 
  • Like
Reactions: Tlh97 and Thibsie

Saylick

Diamond Member
Sep 10, 2012
3,171
6,404
136
You realise that this information is behind the paywall? That's not fair to Charlie.
I do realize it, hence why I said if anyone has an issue I can delete the post... Anyways, a student subscription is only $100/year so y'all are more than welcome to subscribe if you found this info helpful. 😉

No, Bergamo has 2x perf compared to Milan with SMT, without SMT it has 60% more performance.
200%/160% = ~25%

A 25% uplift from SMT alone. This correlates to the industry standard most people expect from enabling SMT.
This is my interpretation of those numbers as well. If anything, the biggest conclusion one can garner is that Zen 4c has roughly the same IPC as Zen 3, assuming clocks for Bergamo are roughly the same as Milan, i.e. AMD spent all of the 5nm gains on efficiency rather than raising clocks so that they could pack more cores into the same TDP.

Thead vs. thread uplift:
128 Zen 4c cores without SMT / (64 Zen 3 cores x 1.25 for SMT) = ~1.60.

Socket vs. socket uplift:
128 Zen 4c cores x 1.25 for SMT / (64 Zen 3 cores x 1.25 for SMT) = ~2.0

Those two numbers are what Charlie reports, so the only way it works out is if the clocks x IPC for Zen 4c are roughly similar to Zen 3.
 

Doug S

Platinum Member
Feb 8, 2020
2,267
3,519
136
You realise that this information is behind the paywall? That's not fair to Charlie.


If Saylick is a subscriber (especially if his company pays for it not him personally) then yeah I'd agree. I would assume Charlie requires subscribers agree not to publicly repost information from his articles.

If however Saylick found that information repeated elsewhere by someone else then its fair game, IMHO.

I may be annoyed that all of Charlie's best info is behind a paywall, but he's got a right to make a living.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,608
5,816
136
After having subbed to many of these analysts, I feel Dylan is one of the best.
David Schor is really good but he makes an article once in a blue moon.
Semiaccurate articles are not really technically precise.
So far I have not read anything interesting from Semiaccurate which was not already leaked by the usual suspects and there is hardly technical depth with his articles. But of course good enough for most investors and investment firms.
Dylan does a much better job at explaining from the most basic things to complex things with enough detail to interest the technical investor and the technical audience but stopping short of being entirely an engineering treatise.
And he is covering the entire spectrum of the semi ecosystem. And he does some speculation and supply chain/market intelligence which is interesting when put in the overall perspective.

If Ian Cutress were to take an analyst role he would be one of the best. But I am not sure he would be doing speculation and stuffs. Tim Morgan, Patrick Moorehead etc., are too superficial to interest a reasonably technical audience, they cater more to the financial types.

And many folks on this forum have very good technical insights and I can assure most of what Semiaccurate has been writing has been discussed already on here, just that the people who don't sub him don't know that they can already surmise from open information what SA has learnt.
 
Last edited:

Frenetic Pony

Senior member
May 1, 2012
218
179
116
Would 4 threads per core increase the power efficiency of i9-12900K?

It's a thing I'd actually enjoy seeing benchmarked. SMT was made for latency hiding, and latency has gotten so bad that that the memory hierarchy has become a prime target for optimization. But the other way you can optimize is potentially just go even wider and have more threads.