Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 238 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
AVX512 is complex topic that can be split into two parts:

1) The marketing driven and misguided part. Where core now can do up to 2x512 bit FMAs and looks great on paper in GFLops, or equally retarded "neural network" pushes with byte multiplications or BF16 support that provides unique benefits in accumulating results in larger than operands size register and so on.
This ship has already sailed, leading super computer has 50Gflops per watt efficiency and hint -> it does not come from CPUs. GPUs already rule this rooster and i think this year NV and/or AMD might come with SKU that will touch 100TFlops on gaming card.
Game is over for AVX512 in this area for quite some time. Noone cares about your FMA/IFMA throughput or latency as that is minimum order of magnitude slower and less efficient that GPUs.

2) The really useful part of AVX512 is amazing instruction set. Despite the mess with support requiring Venn diagrams, it is actually very simple and by ICL-SP you get some amazing instructions there, that enable to parallelize plenty more algorithms and to enable completely new capabilities that would require chains of instructions to do what one AVX512 instruction can.
For example going from already very parallel and fast baseline, AVX512 can gain 60% speed in JSON parsing library that is very important gain if you deal with bulk data in that format:

And there are plenty of hashing, crypto, bit bashing etc algos that can be made faster in times by using AVX512 GFNI instructions that are some of the most flexible vector bit instructions.

The fun thing is, that this case does not even require 512bit width, it could be done on AVX2 256 registers perfectly fine and with some of the same speed ups and efficiency gains. Intel has realised that somewhat and they provide subset of very useful stuff on Gracemont.

Fun times ahead with AMD and AVX512 support across all CPUs, that would really spur adoption and put fire under Intel's bottom.
 

Henry swagger

Senior member
Feb 9, 2022
372
239
86
Not really a die shrink, based on known Linux patches alone, Zen4 seems like a big change in many areas besides AVX512.
L3 block heavily redesigned to support new RAS features, new PMU and new IBS
MPDMA hints also present in SMCA patches

L2 is doubled
L1 and L2 DTLB increased. Dual SDP ports
UAI
57 bit addressing

IOD is a massive overhaul which needs no introduction, CXL, DDR5, IF3.0, PCIe5, NVDIMM-P, out of box Cache coherent interconnect, etc etc


You can blame AMD's messaging for the disappointment.
View attachment 62504
AMD has been throwing shade about the competitor IPC increases and also about how bloated the competitor core is. Interviews with Taylor/Hallock, and Hallock on the Zen3+/6000 release and probably more than what I would have watched.
Going by the DerBauer's video for the Zen4 CCD of ~70mm2 there is a massive >50% MTr even with a modest 90MTr/mm2 (compared to >134MTr/mm2 for Apple chips).
>50%... this is a gigantic gain in MTr (Zen3 is like 9% gain in MTr),
Imagine 50 percent more transistors for a measly single digit gain in performance per clock. I doubt you'd be impressed if you are being honest. The frequency and efficiency gains are for the most part TSMC's achievements, not AMD's.

While I will reserve my disappointment/opinion/etc on Zen4 core until the launch, the messaging is not very consistent, which is not inspiring confidence and therefore all this discussion we are having.
However I will not deny that I enjoyed seeing some AMD evangelists squirming with the reveal :)

That being said, however there could be more than what Hallock is allowed to reveal.
They cannot just throw around 2 million additional transistors doing nothing. I feel there could be a lot more INT improvements than FP this time around but we shall see.
Hallock loves throwing shade at intel when you hear his interviews 😏
 

DisEnchantment

Golden Member
Mar 3, 2017
1,608
5,816
136
Imagine Intel having done the same in ADL and not even enabling it for desktop...
... except when you check Intel's AVX512 implementation and you can see it is not a huge uptick in real estate and not remotely close to additional 50% more silicon real estate.
So they don't care if it is disabled, even if they don't share CCDs like AMD. Skylake AVX256 vs AVX512 implementation is a good reference.
 

leoneazzurro

Senior member
Jul 26, 2016
930
1,465
136
... except when you check Intel's AVX512 implementation and you can see it is not a huge uptick in real estate and not remotely close to additional 50% more silicon real estate.
So they don't care if it is disabled, even if they don't share CCDs like AMD. Skylake AVX256 vs AVX512 implementation is a good reference.

But that is because AMD reuses the CCDs on both server and desktop parts (other than having seemingly a better implementation - instruction wise - than Intel). In the case of ADL it's really wasted silicon. In the case of AMD, it's a niche feature consuming die space but maybe it could be used (if AMD does not fuse it away, which would be idiotic at this point), and it is reused on servers. Also, I would wait to see really what Zen4 really is.
 
  • Like
Reactions: Tlh97 and Kaluan

DisEnchantment

Golden Member
Mar 3, 2017
1,608
5,816
136
But that is because AMD reuses the CCDs on both server and desktop parts (other than having seemingly a better implementation - instruction wise - than Intel). In the case of ADL it's really wasted silicon. In the case of AMD, it's a niche feature consuming die space but maybe it could be used (if AMD does not fuse it away, which would be idiotic at this point).
The point was about consuming the extra area afforded by N5 mostly for AVX512, not about discussing the obvious (CCD sharing b/w desktop and server or whether AVX512 is relevant).
My actual post was that there is much more to the increase in MTr than just AVX512.
 

leoneazzurro

Senior member
Jul 26, 2016
930
1,465
136
The point was about consuming the extra area afforded by N5 mostly for AVX512, not about discussing the obvious (CCD sharing b/w desktop and server or whether AVX512 is relevant).
My actual post was that there is much more to the increase in MTr than just AVX512.

That is true, but as said we have increased L2, Modified L1, possibly some changes in L3 as well, and who knows. And, higher clocks which everyone seems to give credit only to the process for, but I think they don't come for free in transistor count either.
 
  • Like
Reactions: lightmanek

moinmoin

Diamond Member
Jun 1, 2017
4,952
7,666
136
Over Zen 2 the redesigned Zen 3 core managed to improve performance more than it increased MTr. That most likely happened because both are on the same node. But I'd still expect AMD to try to aim for similar improvements in Zen 4 even if it should just happen in some selective and not general workloads. It's obvious that aside the leaked (and now officially confirmed through announced NN/AI support) AVX 512 implementation and increased L2$ size we still know very little about the next Zen core.

AMD is already dominating Intel in most server workloads. Adding even more cores would let AMD dominate those workloads a bit more, but Intel would still have the AVX 512 niche all to themselves.
AMD does both, with Zen 4 introducing AVX 512 and likely some other significant changes not yet known for 50% more MTr.
Genoa serves 50% more cores, while Bergamo pushes the same amount on top again.

It's also possible that AMD made a purposeful density tradeoff to chase clock speeds.
That's already the case for all non-mobile Zen chips. That's why Apple chips have significantly higher density on the same node as AMD's non-mobile chips.
 

HurleyBird

Platinum Member
Apr 22, 2003
2,684
1,268
136
That's already the case for all non-mobile Zen chips. That's why Apple chips have significantly higher density on the same node as AMD's non-mobile chips.

And could be even more the case for Zen4. Frequency vs density is not binary, it's a gradient. When you have a die that goes all in on density (zen4c) anyway, then it makes sense for your other die to go further towards frequency compared to your previous generation.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,774
3,153
136
Has the pipeline depth remained same from Zen to Zen 3? Maybe AMD has increased the pipeline depth in Zen 4 to accommodate the higher frequencies.
Unlikely , if that was going to happen it would be Zen3 or 5 . you dont just change pipeline length , because you have to refactor what gets done in each stage. increasing execution pipeline length ( more cycles per op) would be terrible for performance and unless they increase the complexity in the front end ( more decode or dispatch) why would they increase the pipeline length. increasing pipeline length around retirement and Store seems pointless because they arent really critical path for performance.

going from bog standard 7nm to custom 5nm more then explains a ~10-25% clock rate increase , what appears to be significantly higher MT clock rate is the give away.
 

eek2121

Platinum Member
Aug 2, 2005
2,930
4,026
136
Zen 4 seems like a die shrink for the most part. The only major changes are increasing the L2 cache size (seems like they doubled the number of sets) and adding AVX-512 support. The IO die getting built-in graphics is also a pretty big deal. There's a lot of work that has gone into Zen 4, but it's just not the same parts that we saw during the transition from Zen 2 to Zen 3.

From the AT interview with Mike Clark last fall, I get the impression that Zen 5 is more of a fresh design. Mike doesn't outright say what's changing, but the conversation topics do hint at it. He also states that their high-level design approach is to create a grounds up redesign, then a derivative of that to improve on it, but that because a second derivative doesn't have as much room for improvement, to start with a new fresh design based on lessons learned.

From this we should expect that Zen 5 may have some radical design changes. I'm expecting that the front-end will have seen some serious work, because it's difficult to keep making a wider design with more execution units if you can't keep them all fed.
Honestly, I would be surprised by a redesign. It seems like there is still plenty they can do with the current architecture.
Psst Zen4 is Genoa too ya know.

I am curious what clocks for Genoa are going to look like. Not expecting much from the 96 core part, but we should see similar gains in other parts.
 

Mopetar

Diamond Member
Jan 31, 2011
7,842
5,996
136
Honestly, I would be surprised by a redesign. It seems like there is still plenty they can do with the current architecture.

They'll probably be some substantial changes to the front-end and I wouldn't be too surprised if the design of the CCX is modified in some way either. I also expect that the cache designs will be developed around most chips having some form of v-cache, which I think will be far more common place in future chips. Perhaps we even see something similar to the rumored RDNA3 design where the LLC is entirely on a separate chiplet.
 

naad

Member
May 31, 2022
63
176
66
I'm of the opinion that the performance increases in the core, at least the ones that we know of so far don't match the extra silicon budget from the new process, however that budget can be used for more than just scalar or SIMD performance.

I would like to see a GMI link from chiplet to chiplet, which if used on a active interposer might enable significantly higher cache per core, but that's just a dream of mine for now, but then you have to include some kind of system agent on the CCD that can enable proper communication with another so somebody already did the math on that, probably.

Now for the extra xtors:

Various sensors, monitors, state trackers and telemetry in general, quite a few of their slides show them increasing them gen over gen, these can directly improve power managment and wake up/clock ramp up latency, I already said I expect zen4 to significantly improve the v/f curve in the 4-5GHz range, this doesn't come free in area.
There's already a few rumors of Genoa having 10% higher clocks than Milan, that's with 50% more cores and probably not even QS yet.
What's SP5's package power? 400W? Compared to 280W of SP3? This is looking very good, much more so than desktop.

Taller fins needed for the large clock increases, everyone knows HP libraries don't come close to HD libraries in when it comes to density in the real world.

Dark silicon, everyone's (un)favorite

Though I doubt it there could be more analog circuitry in the CCD and those simply don't scale.
All we have now is the die size, for all we know when someone opens it up the core area of the chip might not be all that big as we expect.
 
Last edited:

CakeMonster

Golden Member
Nov 22, 2012
1,392
501
136
I know the design for Z4 and Z5 started several years back, but are there any current recent trends in CPU usage and demand that we might use to predict how well the upcoming designs might work out for regular people?

There was never really a thing with CPU mining, and various SETI/research apps seem to be niche, there is certainly ML but it has not taken off for the masses although that could certainly happen, but will probably be more driven by GPU? The only hype thing I can think of recently is AI upscaling for video and images, and even though that is GPU accelerated its still quite heavy on the CPU (even with my 12c). Anything else that could be the next big thing for the masses?
 

Mopetar

Diamond Member
Jan 31, 2011
7,842
5,996
136
Though I doubt it there could be more analog circuitry in the CCD and those simply don't scale.

From what we know there shouldn't be. Going forward there might be even less if AMD starts using other techniques and bridge chiplets as they're rumored to be doing with RDNA3 for Navi 31.

All we have now is the die size, for all we know when someone opens it up the core area of the chip might not be all that big as we expect.

I don't know how accurate they are, but someone had posted some mock ups based on what is known and the extra space seems to be mainly used by the extra L2 cache and the hardware for handling AVX-512.

There was never really a thing with CPU mining, and various SETI/research apps seem to be niche, there is certainly ML but it has not taken off for the masses although that could certainly happen, but will probably be more driven by GPU? The only hype thing I can think of recently is AI upscaling for video and images, and even though that is GPU accelerated its still quite heavy on the CPU (even with my 12c). Anything else that could be the next big thing for the masses?

Even phone SoCs have dedicated hardware for the more ML-oriented tasks like image processing. Anything that benefits well enough from specialized hardware gets it eventually depending on the need and the transistor budget.

I'm not really sure the masses need powerful new hardware. Plenty of the masses are already fine using a mobile phone or a tablet as their primary device. Causal internet usage isn't too demanding, especially if you block the ads and the bloated JavaScript.
 
Jul 27, 2020
16,332
10,345
106
Anything else that could be the next big thing for the masses?
Natural language processing using AI. Telling computers what we need them to do, like how it was originally supposed to be. AI understanding that ambiguity may lead to wrong results or even disastrous consequences so having the intelligence to ask additional questions for clarifying to make sure it only does what we want it to do.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,608
5,816
136
I don't think cache+AVX512 takes up the >50% MTr gain.
DerBauer's measurement puts Z4 CCD at 70.35mm2 just including the die itself (excluding the filler) and subtract ~2mm2 of that for scribe lines you will get 68.35mm2 of actual logic area.
Assuming a very modest 90MTr/mm2 that is around 2000MTr of gain vs the Z3 CCD. (For a CCD with similar layout of SRAM/logic blocks to this pic below)

1654491396559.png

On Zen3 FPU is 11% of CCD, 9.7mm2. A straight up shrink of 1.8x (TSMC's value) puts it at 5.4mm2 or 8% of 68.35mm2 Z4 CCD.
On Intel CPUs the server version of the FPU with AVX512 is about 1.35x ~ 1.45x in area of the client without.
Even if AMD were to do 1.75x more MTr to implement AVX512, that is 9.6mm2 or 13% of Z4 CCD of 68.35mm2. (vs 11% on Z3 CCD).
We can estimate that to support AVX512 on one port, they would consume only ~18%-25% of the extra MTr depending on how well they can implement(Leaked manual indicates no change in the Number of execution ports)
If AMD's AVX128 -->AVX256 transition is any indication it will be fairly optimized.

1654487676528.png

L2 on Z3 is 6.7mm2 or 8% of CCD. A 1.35x shrink (TSMC's value) on N5 and doubling it raises it 9.9mm2 or 14% of Z4 CCD.
We can estimate the L2 increase to consume ~22% of the MTr gain.

That leaves 50% of the MTr gain available for something else, like mentioned in this post there are many.
On top of that post as well, there is supposed to be a second SDP using an additional IFOP macro so that would take up a good chunk of die area as well.
One thing that Hallock mentioned in the stream with HotHardware though (which I did not see in the initial Computex briefing) is that the improvement of 15% is only for CB23 which is mainly FP, since there are no additional FP execution units/ports it makes sense.

One thing to note is that AMD is quoting higher density than what TSMC is quoting. (2x density)
This can be attributed to finer M0/1 lines brought about by N5 EUV, which can offset the extra tracks that AMD added to make N7 hit 5GHz.
Additionally FPU PRF uses flops and not SRAM so density is quite high for the changed FPU block.

While absolute density cannot be on par with TSMC's values, the relative gain should be similar, (e.g. AMD's density is behind the advertised values on N7 and will be behind on N5 as well, but relatively the gain for AMD from N7-->N5 transition should be comparable to TSMC's gain values for the specific IP blocks)

Regarding high clocks, x86 pipeline is deep enough to sustain such high clocks without any redesign.
The current clock tree and core macro designs would already be robust enough to handle the 5.5+GHz since the CPUs can work normally at 5.1 GHz already.
The smaller core layout, in fact should help with higher clocks because wire delay is smaller.
I think the main thing AMD/TSMC achieved (besides faster Xtor switching speeds) to sustain 5.5+ GHz is improving RC characteristics and parasitics which are main culprits for introducing signal delays at gate inputs at higher frequencies.
 
Last edited:

itsmydamnation

Platinum Member
Feb 6, 2011
2,774
3,153
136
18% IPC with that clock increase is much more than the 15% performance that AMD said, Zen 4 would be a monstrous uplift over Zen 3. No way.
There is also the much feared Osbourne effect.

But also CB being a FPU workload that fits in L2 might actually see little "IPC" benefit. if L/S stays the same number of ports , incremental Queue size , decode stays the same and FPU for SSE/AVX 1&2 stays the same you could have a bunch of improvements in the core that have little effect on CB IPC.

So there could be a big improvement for complex branches/miss predict. Could be more rework on Integer pipeline , could be more flexibility in load and store ports or store to load forwarding etc. Then there is the increase L2 and TLB's etc that mainly play out in the scalar side or things with higher miss rates. Or there could be none of those things and 7% IPC is real avg across the board.

TLDR just @Markfw post with more dribble :)
 
  • Like
Reactions: lightmanek