Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 180 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

nicalandia

Diamond Member
Jan 10, 2019
3,330
5,281
136
So we have recently learned that Zen3 CCD size is actually 74 mm^2(From AMD Product page) and that Zen4 CCD(from the Gigabyte Leak) is going to be 72 mm^2, knowing what is the SRAM and Logic density of TSMC 5nm we can assume AMD is packing A lot of Grunt on those Zen4 cores, Full AVX512, 1 MiB L2$...

Cyan area: 2x 256-bit vector registers
Green: 1 MiB L2$
1648938458234.png


The Zen4 Die is Longer than Zen3(10.7 mm vs 10.15 mm) due to extra L2$ and AVX512 registers but Narrower(6.75 mm vs 7.3 mm)
 
Last edited:

Mopetar

Diamond Member
Jan 31, 2011
7,794
5,897
136
$ has ALWAYS been a bottleneck, no way around that, since the beginning of CPU time. There is no reason to point that out as if it just happened to Zen3

Currently it's only a bottleneck for games and specific niche workloads that cloud providers have, at least for current generation Zen CPUs. If it were for other applications, we should see a similar performance uplift as in games, but we don't. Either the L3 cache isn't the bottleneck for these programs or any increase in performance is exactly offset by the decrease in clock speed.

In most resolutions the CPU isn't a bottleneck at all. I'm not expecting Zen 3D to do anything in 4K, would be quite surprised to see it affect more than a small handful of titles in 1440p, and we only start to see the advertised 10-15% uplift once we get to 1080p and the GPU stops being the bottleneck. Maybe it's even more than that at 720p so we can get an idea of the theoretical maximum uplift as GPUs continue to grow, but isn't particularly useful for anyone in reality.

There is no universal workload, so anything can always be a bottleneck for some piece of code and knowing the hardware, it's always possible to construct some code specifically to cause a bottleneck at a specific part of the hardware. But it's difficult to claim that L3 cache is a current bottleneck when tripling it has such a small overall impact in such a small set of applications.
 
Jul 27, 2020
15,444
9,550
106
But it's difficult to claim that L3 cache is a current bottleneck when tripling it has such a small overall impact in such a small set of applications.

1648999737186.png

With LPDDR5-6400, Zen3+ CPU is able to match ADL CPU with LPDDR5-5200. This makes me think two things:

AMD's DDR5 controller is beyond amazing, able to keep latencies low.
Zen3 is bandwidth starved and when given enough of it, overcomes the relatively lower single threaded IPC vs. ADL.

Especially the 1% low fps is much better on Zen3+. Wow.

Exciting times ahead with Zen 4 if this is just a taste of what's to come.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,587
5,703
136
Some new patents about radical front end changes from AMD.
Very likely a bit too late for Zen4, filed in 2020 but who knows.

AMD's attempt to tackle the much debated x86's decode width issue aka x86 cannot increase decode width without massive power/area penalty

Instead of one unit decoding many more instructions than what they have with multiple fast/slow paths, they are attempting multiple fetch-decode units decoding in parallel different branch windows of the instruction stream. Both pipelines are not active always, only when one pipeline cannot handle the instruction stream anymore.
Instructions gets decoded in parallel on all pipelines and gets reordered before dispatch if needed.

From a high level functional perspective sounds very intriguing and scalable.

20220100519 - PROCESSOR WITH MULTIPLE FETCH AND DECODE PIPELINES
1649002405194.png1649007111156.png


Similarly, there are multiple uop cache as well fed to a reorder block along with the decode from above.

20220100663 - PROCESSOR WITH MULTIPLE OP CACHE PIPELINES
1649002429535.png1649007344994.png


The above two are complementary patents, and makes better sense when read together.
If they keep the current decoders and uop cache and just double them with addition of the reorder block, the frontend throughput would be quite large.

And to complement these, they suggested uop compression in the dispatch as well.

20220100501 - Compressing Micro-Operations in Scheduler Entries in a Processor

1649002621878.png


Very likely too late for Zen4. But nevertheless seems whichever uarch is going to have these is going to be interesting to say the least.
 

szrpx

Member
Jan 12, 2022
34
66
51
Some new patents about radical front end changes from AMD.
Very likely a bit too late for Zen4, filed in 2020 but who knows.

AMD's attempt to tackle the much debated x86's decode width issue aka x86 cannot increase decode width without massive power/area penalty

Instead of one unit decoding many more instructions than what they have with multiple fast/slow paths, they are attempting multiple fetch-decode units decoding in parallel different branch windows of the instruction stream. Both pipelines are not active always, only when one pipeline cannot handle the instruction stream anymore.
Instructions gets decoded in parallel on all pipelines and gets reordered before dispatch if needed.

From a high level functional perspective sounds very intriguing and scalable.

20220100519 - PROCESSOR WITH MULTIPLE FETCH AND DECODE PIPELINES
View attachment 59473View attachment 59478


Similarly, there are multiple uop cache as well fed to a reorder block along with the decode from above.

20220100663 - PROCESSOR WITH MULTIPLE OP CACHE PIPELINES
View attachment 59474View attachment 59479


The above two are complementary patents, and makes better sense when read together.
If they keep the current decoders and uop cache and just double them with addition of the reorder block, the frontend throughput would be quite large.

And to complement these, they suggested uop compression in the dispatch as well.

20220100501 - Compressing Micro-Operations in Scheduler Entries in a Processor

View attachment 59475


Very likely too late for Zen4. But nevertheless seems whichever uarch is going to have these is going to be interesting to say the least.

Probably Zen 5, even Mike Clark was excited about it during Ian's Interview. Actually, I think the interview was around the same time these patents were filed.
 

Mopetar

Diamond Member
Jan 31, 2011
7,794
5,897
136
How much of an issue is that a problem with modern code though? I can't imagine modern compilers not producing instructions that are going to be far easier to decode and schedule. The rest is there for legacy reasons and I question to what extent it matters for most of what we're running today.

If anything this just looks like a redesign of the front end that's needed because the back end has been getting wider for both AMD and Intel. At some point they're going to struggle to keep it fed and rather than just scaling the current approach, a clean design may be in order if it makes for better throughput and an easier ability for AMD to reuse their design across multiple designs to scale with the needs of the products being developed.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
Probably cause the cache isn't big enough it's relying on main memory more.
If aggressive prefetch is in use, it trades bandwidth for latency to some extent. Higher bandwidth is generally easier to achieve than lower latency, so it is usually a good trade. Generally not as good of trade off if you have a massive number of cores hitting the same memory system, so prefetch may differ between desktop and Epyc parts. Prefetchers are also something that can be upgraded independently from the core and easily enabled or disabled. I think they behaved differently between Zen 2 and Zen 3, so I would expect bigger changes with Zen 4 due to higher bandwidth memory.

The Zen 3 has the issue that latency is only really low for 8 MB from each core due to TLB limitations. That is in the original Zen 3 article at AnandTech. TLBs are very likely to be improved on Zen 4 in some manner. The changes shouldn’t be too drastic since it is the same family as Zen 3. Hopefully, low thread count applications will be able fo take better advantage of the L3 cache, even without going to v-cache. It would be interesting to see how much DDR5 effects memory performance. It goes down to 32-bit wide channels, 2 per DIMM, which should provide better granularity. It should help more than just bandwidth.
 

Frenetic Pony

Senior member
May 1, 2012
218
179
116
Probably Zen 5, even Mike Clark was excited about it during Ian's Interview. Actually, I think the interview was around the same time these patents were filed.

How much does the date of a patent filing have to do with the date of it being thought of and possibly designed for hardware? If it's a long enough lead time between the two it could be in Zen 4. I imagine it'd have to be thought of and perhaps even tested first (simulated), and only then go into the lawyers inbox, show up next in the que, have them go over it, and only then have it filed.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,587
5,703
136
How much does the date of a patent filing have to do with the date of it being thought of and possibly designed for hardware? If it's a long enough lead time between the two it could be in Zen 4. I imagine it'd have to be thought of and perhaps even tested first (simulated), and only then go into the lawyers inbox, show up next in the que, have them go over it, and only then have it filed.
In AMD's case, it used to take in the past 3 years or more from potential patent filing to actually show up in some CPU products, and less than 2 years for GPU products.
So Zen5 is a likely candidate. But with the huge R&D boost AMD is having, architectural concept to implementation should improve.

The concept is not new, firstly showing up in Tremont, hence my wording "AMD's attempt" in the post. But the patent claims would have to be totally different and worked out by the legal advisors and the patent is in fact more tailored for high performance, especially the use of op cache to augment each fetch-decode pipeline for the particular branch window.

Concept is quite interesting for AMD to not look at it, the power, area and perf benefits as described are very tangible.
 
Last edited:

eek2121

Platinum Member
Aug 2, 2005
2,883
3,859
136
If aggressive prefetch is in use, it trades bandwidth for latency to some extent. Higher bandwidth is generally easier to achieve than lower latency, so it is usually a good trade. Generally not as good of trade off if you have a massive number of cores hitting the same memory system, so prefetch may differ between desktop and Epyc parts. Prefetchers are also something that can be upgraded independently from the core and easily enabled or disabled. I think they behaved differently between Zen 2 and Zen 3, so I would expect bigger changes with Zen 4 due to higher bandwidth memory.

The Zen 3 has the issue that latency is only really low for 8 MB from each core due to TLB limitations. That is in the original Zen 3 article at AnandTech. TLBs are very likely to be improved on Zen 4 in some manner. The changes shouldn’t be too drastic since it is the same family as Zen 3. Hopefully, low thread count applications will be able fo take better advantage of the L3 cache, even without going to v-cache. It would be interesting to see how much DDR5 effects memory performance. It goes down to 32-bit wide channels, 2 per DIMM, which should provide better granularity. It should help more than just bandwidth.

I am skeptical of AMD using different parts in server vs. desktop. They’ve actually shown some indications that make me think they don’t even want a separate mobile SKU.
 

jamescox

Senior member
Nov 11, 2009
637
1,103
136
I am skeptical of AMD using different parts in server vs. desktop. They’ve actually shown some indications that make me think they don’t even want a separate mobile SKU.
i don’t think it would be a different piece of silicon. They can easily change such things via bios, microcode, or even fusing things off.
 

moinmoin

Diamond Member
Jun 1, 2017
4,926
7,608
136
Very likely too late for Zen4. But nevertheless seems whichever uarch is going to have these is going to be interesting to say the least.
As Zen 4 is an extension of Zen 3 family wise, this perfectly fits as a part of the expected core overhaul for the subsequent family in Zen 5 actually. Thanks for the write up!
 

soresu

Platinum Member
Dec 19, 2014
2,582
1,777
136
As Zen 4 is an extension of Zen 3 family wise, this perfectly fits as a part of the expected core overhaul for the subsequent family in Zen 5 actually. Thanks for the write up!
Papermaster's recent interview did imply that a wider core was on the horizon when the the subject was broached - Zen 5 would be as good as any.

I'm pretty content with my 3950X for the moment and would probably wait for Zen 5 and whatever top end AM5 SKU is fielded for that generation.
 

deasd

Senior member
Dec 31, 2013
508
711
136
*wrong thread, sorry, so many amd thread now*

Up to 11% faster than 5800X in blender, but no uplift on some other pure benchmark, no gaming test yet

 

nicalandia

Diamond Member
Jan 10, 2019
3,330
5,281
136
..
Personally I doubt that the Zen 4 comes with PCI-E 5 but Zen 5 likely could.
It has been confirmed that Genoa(Zen 4) will have 128 Lanes of PCIe-5

1649350790801.png

Plus the PCIe-5 can use CXL DDR5 Memory Expansion to Terabyte DDR5 levels

1649351703305.png

 
Last edited:
Jul 27, 2020
15,444
9,550
106
In addition to CXL hardware innovation, Samsung has incorporated several controller and software technologies like memory mapping, interface converting and error management, which will allow CPUs or GPUs to recognize the CXL-based memory and utilize it as the main memory.

I hope GPU's of the future STOP coming with their own RAM and just use the CXL expanded memory device attached to PCIe 5.0 or 6.0. No more segmentation based on memory size.
 
  • Like
Reactions: MadRat
Jul 27, 2020
15,444
9,550
106
Nonsense. Bandwidth, latency and energy efficiency will all be an order of magnitude worse than direct attached GDDR or HBM.
It's expanded memory. Maybe the GPUs will have a small amount of really fast cache but most of the data will be stored in the memory expansion device. Or another possibility is that all future GPU's come with either 2GB or 4GB of built-in RAM and rest is up to the user's choice.
 

Asterox

Golden Member
May 15, 2012
1,026
1,775
136
Cinebench doesn't benefit from large L3 at all.

Yes, or the system memory speed same story.

For Blender/Cinebench/POV Ray, L3 Cache size is not important at all.

One example with two different Zen 2 CPU-s.Default All Core Turbo is very similar, or 100-150mhz higher on Renoir APU.

- R5 3600, 32mb L3

- R5 Pro 4650G, 8mb L3



 

maddie

Diamond Member
Jul 18, 2010
4,717
4,614
136
It's expanded memory. Maybe the GPUs will have a small amount of really fast cache but most of the data will be stored in the memory expansion device. Or another possibility is that all future GPU's come with either 2GB or 4GB of built-in RAM and rest is up to the user's choice.
I thought you hated the RX 6500XT? Your idea would be magnitudes worse.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
It's expanded memory. Maybe the GPUs will have a small amount of really fast cache but most of the data will be stored in the memory expansion device. Or another possibility is that all future GPU's come with either 2GB or 4GB of built-in RAM and rest is up to the user's choice.

Why not take it to extreme and attach two terabyte of SSD directly to GPU, just like those Radeon SSG cards already did years ago?
The reality is that there exists amount of (very fast and very energy efficient) framebuffer that is required and mainstream card is already communicating with DRAM in machine via PCIE without fancy names for technology.

Those CXL's are good to provide a tier of memory, just like Optane does, but GPUs are least suited for it.
 
  • Like
Reactions: Tlh97 and maddie