Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 148 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
View attachment 55872

Made a quick estimation of the probable efficiency gains using AMD's announced process characteristics for Zen4 and Zen 2 N7 data from the ISSCC presentation. Data is for 4C/8T.
N5 does certainly make those efficiency look sweet up until 4 GHz
Nice work, too bad AMD doesn't have a chart with Zen3. Zen 2->Zen4 looks huge, but from Zen3, it will be a bit less impressive. Curious to see what how the uArch changes this (depending what AMD actually was giving away in their presentation).
 
Last edited:
  • Like
Reactions: Tlh97

eek2121

Platinum Member
Aug 2, 2005
2,904
3,906
136
Nice work, too bad AMD doesn't have a chart with Zen3. Zen 2->Zen4 looks huge, but from Zen3, I will be a bit less impressive. Curious to see what how the uArch changes this (depending what AMD actually was giving away in their presentation).

Strong probability of a > 20% uplift in single core performance. (uplift, not IPC)
Good probability of a 25% uplift.
A 52% uplift over Zen 2 would let them pat themselves on the back (they could compare it to the original Ryzen launch) and is totally achievable. (Zen 3 is 10-25% faster in many workloads vs. Zen 2 for reference).
 

moinmoin

Diamond Member
Jun 1, 2017
4,933
7,619
136
Nice work, too bad AMD doesn't have a chart with Zen3. Zen 2->Zen4 looks huge, but from Zen3, I will be a bit less impressive. Curious to see what how the uArch changes this (depending what AMD actually was giving away in their presentation).
The impression of it being less impressive likely depends on what you focus on. Zen 3 fares much better at reaching higher frequencies than Zen 2 did, but it may actually have regressed a little at lower frequencies efficiency wise.

But yeah, indeed too bad AMD doesn't have a chart with Zen 3.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
Nice work, too bad AMD doesn't have a chart with Zen3. Zen 2->Zen4 looks huge, but from Zen3, I will be a bit less impressive. Curious to see what how the uArch changes this (depending what AMD actually was giving away in their presentation).
Zen 3 did not do much process wise (at least from efficiency perspective). I read somewhere it has two more metal layers and slightly larger metal pitches (needed for higher clocks) but that is about it. Been hunting for that source of info since forever.

The interesting thing is though, this chart is depicting the HPC optimized process for the server and desktop. Mobile Zen 4/Bergamo should have more slightly better efficiency.
A Van Gogh successor is really going to shine.

This makes me think, going forward (e.g. N3), deep pipelined architectures might actually get extra leg room being able to hit far higher clocks but with much lesser penalty than in the past.

Strong probability of a > 20% uplift in single core performance.
Good probability of a 25% uplift.
A 52% uplift over Zen 2 would let them pat themselves on the back (they could compare it to the original Ryzen launch) and is totally achievable. (Zen 3 is 10-25% faster in many workloads vs. Zen 2 for reference).
The bigger question is what is AMD doing with so much more transistors.
[Zen 3] 4150 MTr --> [Zen 4] 7000 - 7500 MTr, between 70% to 80% more transistors. (Assuming 72.225 mm2 CCD is right (and it is, we have the "X-Ray" of the CPU) and AMD's public figure of 2x density improvement over Zen 2/3 process)
 

yuri69

Senior member
Jul 16, 2013
373
573
136
The bigger question is what is AMD doing with so much more transistors.
[Zen 3] 4150 MTr --> [Zen 4] 7000 - 7500 MTr, between 70% to 80% more transistors. (Assuming 72.225 mm2 CCD is right (and it is, we have the "X-Ray" of the CPU) and AMD's public figure of 2x density improvement over Zen 2/3 process)
The same situation (or even better?) was present during the Zen 1/14nm -> Zen 2/7nm transition.

For Zen 2 cores AMD doubled the FPU width, doubled L3 size, switched to a fat BPU, widened a few things/buffers, and switched to a new GMI. This resulted in a ~15% IPC uplift (and ability to scale to 64c).

For Zen 4 cores AMD is gonna double the FPU width(?), double L2 size, widen a few things/buffers, and switch to a new GMI. On the OoO front, the Gigabyte leak haven't really promised anything balls-to-the-wall as Golden Cove did. So the transistors will most likely be eaten by the FPU and L2 size increases.

Besides that, AMD will support many new features like the Infinity Architecture 3, CLX, various RAS thingies, updated SVE, etc. All of those will eat from the transistor budget of either CCD or IOD.

----

...a wild theory would speculate about transistors being spent on the frequency optimizations a la Intel... Since, you know, the demoed desktop ES was reportedly running at all-core 5GHz. But reality will surely not be that wild.
 

Abwx

Lifer
Apr 2, 2011
10,847
3,297
136
Since, you know, the demoed desktop ES was reportedly running at all-core 5GHz. But reality will surely not be that wild.

So they displayed something that would never be released.?.
It could be a 8C/16T running at full 125W TDP.

FTR at isoperf vs a 5950X a Zen 4 based 16C/32T will consume something like 40W, at isofrequency it is rumoured to have at least 20% better perf and will consume 65-70W.
 
  • Like
Reactions: Tlh97

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
I still hope it's more interesting stuff than just AVX512. :(
Doubling the FPU is going to add only 9% more MTr. Also FPU RF use mainly flops not SRAM so density should be quite high.
Doubling the L2 adds another 9% more MTr. And L3 is same.
Bulk of the other known changes are in sIOD: DDR5 UMC, CXL, GenZ, MPIO, MPDMA, PCIe5, IFIS 3.0 PHY, +4 DDR5 UMC, new NB, SCM

Totally unknown for the rest 50%-60% more MTr
 

eek2121

Platinum Member
Aug 2, 2005
2,904
3,906
136
So they displayed something that would never be released.?.
It could be a 8C/16T running at full 125W TDP.

FTR at isoperf vs a 5950X a Zen 4 based 16C/32T will consume something like 40W, at isofrequency it is rumoured to have at least 20% better perf and will consume 65-70W.

Not only that, but leaks confirmed that there is a possibility of a 170W SKU. However, these chips are on N5, and tied together with the rest of AMD's marketing slides, I can 100% believe that the cores were all running at 5 GHz for this workload. That would be a 19% increase over what my 5950X does for this workload.
 

yuri69

Senior member
Jul 16, 2013
373
573
136
So they displayed something that would never be released.?.
It could be a 8C/16T running at full 125W TDP.

FTR at isoperf vs a 5950X a Zen 4 based 16C/32T will consume something like 40W, at isofrequency it is rumoured to have at least 20% better perf and will consume 65-70W.
Is it really all about the power requirements? I guess the power delivery quality and quantity surely have a role in that but still.

Zen 2/3 don't really easily raise clock even with high voltage compared to Intel's classic. Is 5nm alone gonna change this?
 

Mopetar

Diamond Member
Jan 31, 2011
7,797
5,899
136
Is it really all about the power requirements? I guess the power delivery quality and quantity surely have a role in that but still.

Zen 2/3 don't really easily raise clock even with high voltage compared to Intel's classic. Is 5nm alone gonna change this?

It really depends on the reason why Zen 2/3 didn't scale well beyond a certain clock speed. If it's a matter of process there's a possibility that the TSMC 5nm node doesn't have those limitations. If it's a matter of design, it's possible that AMD has made adjustments to Zen 4 to allow for that. It's certainly likely that some of the transistor budget would have gone towards the largest architectural bottlenecks present in previous iterations of Zen. Some of that could be to address limitations in clock speed those earlier chips faced.
 
  • Like
Reactions: Joe NYC

itsmydamnation

Platinum Member
Feb 6, 2011
2,743
3,071
136
Doubling the FPU is going to add only 9% more MTr. Also FPU RF use mainly flops not SRAM so density should be quite high.
Doubling the L2 adds another 9% more MTr. And L3 is same.
Bulk of the other known changes are in sIOD: DDR5 UMC, CXL, GenZ, MPIO, MPDMA, PCIe5, IFIS 3.0 PHY, +4 DDR5 UMC, new NB, SCM

Totally unknown for the rest 50%-60% more MTr
blow everyones mind and its a 12 core ccx


yes i know thats not going to happen
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
blow everyones mind and its a 12 core ccx


yes i know thats not going to happen

But we can go ahead and confirm it again :)
1642195220362.png


But perhaps I can depict it in a more pictorial way

1642196945068.png

L3 being same, even though huge, would really drive up the Xtor count everywhere else.

One thing you will notice is that the die size increase from Zen 2 to Zen 3 is actually less than 8% if we discard the 4x long strips of silicon area allocated to TSVs.
The increased size was in the FPU due to native AVX256 support and a small bump in L2.
Core itself did not increase much overall (AMD did cut the BTB size in Zen 3 a bit vs Zen2)
Worst case 90MTr/mm2 is basically crappy density from a process targeting 191 MTr/mm2 (Apple got 134 MTr/mm2 from N5, and 82 MTr/mm2 from N7)
Zen2 and Zen3 data from Wikichip/Die shots
 
Last edited:

itsmydamnation

Platinum Member
Feb 6, 2011
2,743
3,071
136
AMD did increase FPU resources for Zen3 but Zen2 has one pass 256 bit AVX support as well. And original Zen has also AVX256 support but only 128 bit execution units so execution speed is same for AVX128 and 256.
The thing is Zen3 really didn't, on both the scalar and SIMD sides they added more execution pipelines but kept the same number of Read/write ports to the register files.
 
  • Like
Reactions: lobz

Saylick

Diamond Member
Sep 10, 2012
3,084
6,184
136
But we can go ahead and confirm it again :)
View attachment 55991


But perhaps I can depict it in a more pictorial way

View attachment 55994

L3 being same, even though huge, would really drive up the Xtor count everywhere else.

One thing you will notice is that the die size increase from Zen 2 to Zen 3 is actually less than 8% if we discard the 4x long strips of silicon area allocated to TSVs.
The increased size was in the FPU due to native AVX256 support and a small bump in L2.
Core itself did not increase much overall (AMD did cut the BTB size in Zen 3 a bit vs Zen2)
Worst case 90MTr/mm2 is basically crappy density from a process targeting 191 MTr/mm2 (Apple got 134 MTr/mm2 from N5, and 82 MTr/mm2 from N7)
Zen2 and Zen3 data from Wikichip/Die shots
Do you have an estimate of the Zen 4 core die size in comparison to Zen 2 and Zen 3? With and without L2?

I was going to take your transistor counts and convert them to mm2 but I figured it would be easier for you since you had the numbers on hand. :)
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
Do you have an estimate of the Zen 4 core die size in comparison to Zen 2 and Zen 3? With and without L2?
Zen 2 : 3.68 mm2 and 2.96 mm2 [w/o L2] --> official data from ISSCC.
Zen 3 : 3.98 mm2 and 3.1 mm2 [w/o L2]. (Around 2-3 mm2 of area went to TSVs) --> Die shot
Zen 4 : 4.2-4.6 mm2 and 3.0-3.2 mm2 [w/o L2] --> rough estimate.
Estimating area is quite a guesswork, because the MTr in the value used above chart is average but density is higher for logic. (However, Zen 4 without L3 will be bigger than Zen 3 without L3, but of course Zen 4 density will be much higher)
 
Last edited:

MadRat

Lifer
Oct 14, 1999
11,908
228
106
Would there ever be situations were a modern CPU needs information not in cache where it would be faster to send both the call for the information and the workload out of the main CPU to get that function done sooner and moved to where it is needed in fewer cycles than simply waiting for that information to come to the CPU from main memory? Or, worse yet, calling information from slow storage? That was kind of the idea of having audio processors, graphics cards, etc. With CPUs going monolithic like they've become, I'd think that it may open up room for external co-processing again.
 

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
Would there ever be situations were a modern CPU needs information not in cache where it would be faster to send both the call for the information and the workload out of the main CPU to get that function done sooner and moved to where it is needed in fewer cycles than simply waiting for that information to come to the CPU from main memory? Or, worse yet, calling information from slow storage? That was kind of the idea of having audio processors, graphics cards, etc. With CPUs going monolithic like they've become, I'd think that it may open up room for external co-processing again.
Sure, fixed function units. Intel IGP and AMD APU include such things as A/V encode/decode blocks that accelerate those functions while using less power. Oh, they aren't external, plenty of room on chip for these (very small on die footprint).
 

Mopetar

Diamond Member
Jan 31, 2011
7,797
5,899
136
Would there ever be situations were a modern CPU needs information not in cache where it would be faster to send both the call for the information and the workload out of the main CPU to get that function done sooner and moved to where it is needed in fewer cycles than simply waiting for that information to come to the CPU from main memory? Or, worse yet, calling information from slow storage? That was kind of the idea of having audio processors, graphics cards, etc. With CPUs going monolithic like they've become, I'd think that it may open up room for external co-processing again.

Not really. If you were going to do something like that in the first place it's a case where dedicated hardware (video decode, etc.) is faster anyway so you would t put the workload on the CPU.

The problem is that you don't know you have a cache miss until you don't find the data you wanted in the cache. If it's not in L1 you move on to look in L2, and then the L3 cache. At this point you can either load it in from memory (and into all of the various caches along the way) or signal something else to do it, which takes time. It would need to load that same data from memory, incur the same cost of time to do so, and then assuming the CPU needs the result, have a way to send it back while bypassing the entirety of the memory hierarchy, which makes the design a lot more complex.

If you've already got some dedicated hardware to handle the work you wouldn't have the CPU do it since even if there weren't any slowdowns due to cache misses, the dedicated hardware will do the work faster and for less power use.