Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 251 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
821
1,457
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

itsmydamnation

Diamond Member
Feb 6, 2011
3,073
3,897
136
They will be able to scale the this new core for several generations - THEY ALREADY HAVE IT, while AMD is at the end of the road for Zen architecture.
So you agree then that Intel have gotten relative to everyone else abysmal results in their P core (GC) for resources it deploys? Also it isn't the "end of the road" Much of Zen4 can remain or evolve upon in Zen5. So the 3 main areas I can see that need to grow is:

PRF both in terms of Ports and Size
Decode throughput
an increase to Load Store ports

Decode width increase probably needs to be "smart" , some form of clustering/splitting of bits to keep power low but also clock scaling etc. AMD also have patients around decoded instructions flushing to cache etc

PRF would be interesting , if its just a 5th set of read/write ports then they might be able to brut force it , but if they go wider in Zen5 they would need something smart there, how does apple achieve 8 wide ~ 600 ROB, 2x 4 wide clusters?


Zen 3 already has on both the INT and FP Execution sides of the Core the needed width to go 6 wide just not the PRF ports to match.
Zen 3 also already has great flexibility in the L/S pipeline being able to do 3 loads , 2L/1S , 1L/2S they just need another port, could be dedicated load
Zen2/3 already has higher retirement width then decode

I expect just like Zen1 we will see far more then just the minimum to hit an increase in core width, it will be interesting to see what that is, how much are in the patents we have already seen and how much comes right out of the blue.

AMD is not guaranteed to succeed with Zen5 architecture, they might get delayed, might take a page too many from Bulldozer or K10 "success" stories. They are sure executing real good lately, but local circlejerkers were already overjoyed by "massive IPC increase of Zen4", so while unlikely it might happen.
I find this funny, AMD might not succeed , all the while Intel has been pretty mediocre on the uarch side since Haswell. Even if you look at Bulldozer it did lots of "good" things (move to FPR, way better memory disambiguation, big wide front end) it was aimed at the wrong target which is what we call bad/lack of leadership. Also AMD SOI process in lunched on was rubbish, both of those not so much a problem with the current setup.


The reality is that Zen3 is currently loosing in PPC to ADL very substantially and mediocre PPC gains ensure that Raptor Lake will also be in the lead versus Zen4. I'd be damned if AMD was not more efficient with full node advantage, but their days of performance leadership over 2013 year designs in 2020 are over.
This is only in the relative position you construct for yourself. You pick your reference point and then extrapolate, just see how you excluded anandtech SPEC setup which is the way the very vast majority of Laptops, Desktops and Servers operate!

But this goes back to my previous point , all those core resources in GC and it cant beat Zen3 in a "standard" operating environment. You look at pieces like chips and cheese how they point to the increased size of the L1D is needed for latency hiding so they can clock the core so high. Well it looks like Zen4 is going to drastically increase clock scaling to near the same point without having to sacrifice latency while at way less power.

But sure make it about PPC , if the shoe was on the other foot you totally wouldn't move the argument to total performance.

Intel's future will be decided by their process, if they can come up with Intel 4 and move to Intel 3 on time, they will be just fine.
So we agree its not going to be saved by its P-core designers

Common, even Anandtech with their JEDEC loving, but not for real world testing found:
redacted, what timing and speed to most laptops, desktops ( remember most are prebuilt) and servers run. Your in your own feedback loop!

Hard to imagine Intel not continuing to iterate on these things, they sure plan to use 32 E-Cores to make AMDs day miserable in throughput tests. The power "efficiency" of core pushed to 4Ghz by marketing on 10nm might not apply to chip on Intel 3 at correct voltage sipping power.
Now your doing the exact same thing you called out above about zen4 IPC......... your epidermis bias is showing.
If I did the same thing, AMD is going to make intel miserable in ST with its 5.5-6ghz 8wide 40%+ IPC Zen5 monster...........
edit: i just realised.... with this last paragraph you totally jumped the shark to move to goal post not from PPC to total ST performance but all the way to per socket throughput. :eek:




We still do not allow profanity in the tech forums.


esquared
Anandtech Forum Director
 
Last edited by a moderator:

DrMrLordX

Lifer
Apr 27, 2000
22,905
12,974
136
At the risk of stating obviuos: from this discussion about modern turbos, PPT, voltages, V/F curves and frequency points it is well possible to design variuos SKUs. And depending on V/F curve, my worry is that such design @ 5.5+ghz might be a bridge too far and efficiency might suffer tremendously.

Raphael won't be pushing boost clocks that high with any significant number of cores. It's a non-issue. Also if Zen4 truly is just "Zen3 with more L2 and AVX512" then the process change alone should allow Raphael to hit 5.5 GHz ST @ isopower when compared to Vermeer @ 5 GHz. In fact it would hit at least 5.75 GHz, or more! Personally I suspect there's more to Zen4 than meets the eye, so simple node comparisons will not entirely suffice, but again . . . 5.5 GHz on one core pulling 20W does not seem that far-fetched unless there are some really weird design-related clockspeed walls. The process alone would not explain such ridiculous inefficiency.
 
  • Like
Reactions: Tlh97 and Kaluan

RTX2080

Senior member
Jul 2, 2018
343
542
136
Also if Zen4 truly is just "Zen3 with more L2 and AVX512

So the 3 main areas I can see that need to grow is:

PRF both in terms of Ports and Size
Decode throughput
an increase to Load Store ports

Decode width increase probably needs to be "smart" , some form of clustering/splitting of bits to keep power low but also clock scaling etc. AMD also have patients around decoded instructions flushing to cache etc

PRF would be interesting , if its just a 5th set of read/write ports then they might be able to brut force it , but if they go wider in Zen5 they would need something smart there, how does apple achieve 8 wide ~ 600 ROB, 2x 4 wide clusters?

maybe I'm late but here's speculation by InstLatX64


I still think a CPU not being changed anything while only to implement wider instruction window(like AVX256/512) is physically impossible, that said Zen4 = Zen3 + AVX512 is also impossible. We have to revise core to adopt it, the 8-10% IPC mentioned by AMD already is a thing. I speculate changes would be like from Haswell to Skylake(-X), no more miracle and it's reasonable.
 
  • Like
Reactions: Joe NYC

DrMrLordX

Lifer
Apr 27, 2000
22,905
12,974
136
We have to revise core to adopt it, the 8-10% IPC mentioned by AMD already is a thing.

It was just some handwaving on my part to describe how it was mostly impossible that Zen4 would use that much more power with 5.5 GHz boost clocks given the node jump (unless, again, AMD introduces some ugly Fmax limitations elsewhere). The idea that Zen4 is just Zen3 + L2 + AVX512 is a popular meme though, and it's kind of fun to poke at it.
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
maybe I'm late but here's speculation by InstLatX64


I still think a CPU not being changed anything while only to implement wider instruction window(like AVX256/512) is physically impossible, that said Zen4 = Zen3 + AVX512 is also impossible. We have to revise core to adopt it, the 8-10% IPC mentioned by AMD already is a thing. I speculate changes would be like from Haswell to Skylake(-X), no more miracle and it's reasonable.
We have known for a long time that Zen 4 would be an evolution of Zen 3 (same family) and Zen 5 will be a new family with much more radical changes. I suspect that the changes for Zen 4 are actually very large, but they are mostly limited to increasing the size of buffers or widening interconnect and such. Doubling the size of the L2 while preventing latency regression may have taken a lot of die area. They may also have done considerable work on reducing power since that is the main limiter on clock speed and Zen 4 seems to be essentially their “little efficiency core” compared to Zen 5. They also need a significant power reduction for running massive core counts in Epyc. Increasing all kinds of other buffers like TLB and trace caches can eat a lot of space since they need to be very fast. The interconnect increases to supply AVX512 and such will also increase die area since that is also buffered and pipelined. The pipeline depth is almost certainly still the same. Rebalancing a pipeline is a significant design change unlikely to have been done on Zen 3. Increasing performance on integer code is quite difficult since it is so latency dependent, so only seeing something like 10% per clock isn’t that unexpected.

It is a lot easier to increase performance on floating point applications. We probably get a large increase in floating point applications with Zen 4, part of which might just be significantly increased bandwidth and clock speed. The clock speed increases can affect bandwidth limited applications much more than latency limited applications. A lot of games benefit from large caches but they also benefit from higher clocks more than most applications, so I suspect that the 5800 X3D will not beat Zen 4 processors in games. I don’t know if that is really a concern since most games seem to be gpu limited or are already running at >144 fps. New, more powerful GPUs will need CPUs to keep up with them though. A giant APU could be a significant jump in performance and efficiency vs. separate cpu and gpu. Not needing to constantly copy memory around should be a big win, so we may see cases where the same gpu chiplet performs better in an APU (with HBM cache or large SRAM cache) compared to the same chiplet and memory on a pci express card.

The initial Zen 4 seems to be very conservative, looking very similar to Zen 3; just MCMs, maybe with some advanced packaging tech (RDL, etc), but no silicon bridges. I am thinking that the Zen 4 refresh and Bergamo (Zen 4c) may start to include more stacking tech; silicon bridges or perhaps an infinity cache base die. It is still unclear whether they have some modular base die that can be used across multiple products or if they are just going to put essentially v-cache die under other chips. I thought they were talking about the HPC APU coming with Zen 4, so that seems to indicate that stacking will be used with Zen 4, but we already know that the initial releases are MCM, unless there is some stuff hiding in the package. HBM seems to be coming to Epyc, possibly with Zen 4. It doesn’t seem like it can really consume that level of bandwidth. Perhaps it is only with products containing gpu chiplets? It is unclear how they would make use of HBM otherwise; would the CPU chiplet have an HBM interface? HBM is significantly more power efficient than going out to system memory, so even if they can’t take full advantage of the bandwidth, it may still be worthwhile for power consumption.
 

Ajay

Lifer
Jan 8, 2001
16,094
8,114
136
That's not Zen 4 That's Medochino.
Thanks was a link from Tomshardware.com. Articles is talking about several upcoming processors but the links and pics don't seem to match their descriptions. Or, I'm just dumb today.
 

MadRat

Lifer
Oct 14, 1999
11,999
307
126
If only those cache chiplets could be made on smaller processes. If you ate going modular, think of the advantages.

1. Building cache is much simpler than a CPU.
2. The cache would run significantly less warm
3. The cache can be positioned nearer the CPU due to lower thermal emmissions
4. Shorter trace lengths for more aggressive timings
5. You could fit more cache in a package by area and volume
 

Schmide

Diamond Member
Mar 7, 2002
5,744
1,033
126
If only those cache chiplets could be made on smaller processes. If you ate going modular, think of the advantages.

1. Building cache is much simpler than a CPU.
2. The cache would run significantly less warm
3. The cache can be positioned nearer the CPU due to lower thermal emmissions
4. Shorter trace lengths for more aggressive timings
5. You could fit more cache in a package by area and volume

I think the days of off-die Last Level Cache are over. Whole bunch of latency for little gain. Not to mention memory speed is squeezing from the outside. HBM and other wide buses make it effectivly obsolete.
 

eek2121

Diamond Member
Aug 2, 2005
3,414
5,051
136
If only those cache chiplets could be made on smaller processes. If you ate going modular, think of the advantages.

1. Building cache is much simpler than a CPU.
2. The cache would run significantly less warm
3. The cache can be positioned nearer the CPU due to lower thermal emmissions
4. Shorter trace lengths for more aggressive timings
5. You could fit more cache in a package by area and volume

I try to eat modularly all the time, especially in THIS economy. 😉
 

MadRat

Lifer
Oct 14, 1999
11,999
307
126
I think the days of off-die Last Level Cache are over. Whole bunch of latency for little gain. Not to mention memory speed is squeezing from the outside. HBM and other wide buses make it effectivly obsolete.
Why would HBM have any more advantage over 'on package' cache? HBM is a standard, but its not running at processor speeds. Nothing limits your cache to less than the width of bus on HBM. But by choosing HBM you do elect to run at their design limits. I believe if you're designing your chiplet then you break constraints of the standard and play by your own expectations.
 

jamescox

Senior member
Nov 11, 2009
644
1,105
136
I think the days of off-die Last Level Cache are over. Whole bunch of latency for little gain. Not to mention memory speed is squeezing from the outside. HBM and other wide buses make it effectivly obsolete.
What type of “off die” cache are you talking about? Stacked cache will definitely be a thing and it is “off die”. The level of connectivity for SoIC type stacking makes it appear on die from a design perspective. It has orders of magnitude larger connectivity than the type of interfaces used for HBM. HBM is not that great as a processor cache because it is DRAM rather than SRAM. You still get DRAM latencies, you can just get a lot higher bandwidth. HBM can be made at different locations by different companies though. The SoIC type of stacking requires that the stacked chips be designed together and they must be made at TSMC. They may use stacking of things other than memory going forward, like CDNA, RDNA, XDNA (FPGA), and compute core chiplets stacked on top of other die.
 
  • Like
Reactions: Joe NYC

Schmide

Diamond Member
Mar 7, 2002
5,744
1,033
126
Why would HBM have any more advantage over 'on package' cache? HBM is a standard, but its not running at processor speeds. Nothing limits your cache to less than the width of bus on HBM. But by choosing HBM you do elect to run at their design limits. I believe if you're designing your chiplet then you break constraints of the standard and play by your own expectations.

HBM is on package. You could say the same thing for apple's M1 memory. The point that I'm making is there is little room for any custom chiplet that contains only cache. (which I think is what you are proposing).

So on a processor you typically

latency

L1 < 2ns
L2 < 4ns
L3 < 20ns
mem < 50ns

and bandwidth

L1 1-4 TB/s
L2 0.5-2 TB/s
L3 300-600 GB/s
Mem 30-128 GB/s

Zen3 ddr4 ~45-55GB/s
HBM 8 stack ~128GB/s (give or take a bunch based on generation and width)

infinity fabric is provisioned to roughly match the memory speed.

The general point being, once you go to infinity fabric you're at memory speeds and latency.

Edit: Revising numbers as pointed out.
 
Last edited:

Det0x

Golden Member
Sep 11, 2014
1,465
4,999
136
So on a processor you typically

latency

L1 < 4ns
L2 < 20ns
L3 < 40ns
mem < 80ns

and bandwidth

L1 1-4 TB/s
L2 0.5-2 TB/s
L3 300-600 GB/s
Mem 30-128 GB/s

Zen3 ddr4 ~45GB/s
HBM 8 stack ~128GB/s (give or take a bunch based on generation and width)

infinity fabric is provisioned to roughly match the memory speed.

The general point being, once you go to infinity fabric you're at memory speeds and latency.
What kind of processor is that ? Do you have a Zen(3) cpu ?

I'm getting the following with my dual ccd Zen3:

L1 = 0.8ns
L2 = 2.3ns
L3 = 10ns
mem = 51.6ns (dual channel @1900:3800MT/s)
1655411061331.png
1655411138126.png

*edit*
And for a comparison, my single CCD 3dstacked v-cache Zen3 which is limited to 4450mhz MT
1655412066855.png
 
Last edited:

jamescox

Senior member
Nov 11, 2009
644
1,105
136
HBM is on package. You could say the same thing for apple's M1 memory. The point that I'm making is there is little room for any custom chiplet that contains only cache. (which I think is what you are proposing).

So on a processor you typically

latency

L1 < 4ns
L2 < 20ns
L3 < 40ns
mem < 80ns

and bandwidth

L1 1-4 TB/s
L2 0.5-2 TB/s
L3 300-600 GB/s
Mem 30-128 GB/s

Zen3 ddr4 ~45GB/s
HBM 8 stack ~128GB/s (give or take a bunch based on generation and width)

infinity fabric is provisioned to roughly match the memory speed.

The general point being, once you go to infinity fabric you're at memory speeds and latency.
The v-cache chiplet used on the 5800X3D is a “custom chiplet that contains only cache”?

Also, HBM is still more in the system memory range as far as latency. It is not a replacement for SRAM-based cache.
 
  • Like
Reactions: Joe NYC

Schmide

Diamond Member
Mar 7, 2002
5,744
1,033
126
The v-cache chiplet used on the 5800X3D is a “custom chiplet that contains only cache”?

Also, HBM is still more in the system memory range as far as latency. It is not a replacement for SRAM-based cache.

3d cache is multi-layer chip stacking . When integrated on package it constitutes one chiplet.
 
  • Like
Reactions: Joe NYC