Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

DisEnchantment · Aug 18, 2021

Bigos said:
Most of the kernel cannot even use SSE, and that is not because of feature levels but because of the design (tasks that don't clobber XMM registers don't need to save/restore them on context switches). I don't think these feature levels will mean much in the kernel, but I would like to be proven wrong (some features are unrelated to the vector registers and might make the kernel more performant).

We have our SW ISV providing us a distro with AVX enabled in netfilter and the packet latency from the load balancer to the worker nodes is night and day difference.
The more complex the ebtables/iptables rules are the better the performance gains seems to be. Packet mangling, traversing the rules is much faster
This is one case that I know first hand, I wonder how many more.

Thibsie · Aug 18, 2021

https://twitter.com/x/status/1427929018234458114

512 indeed ?

JoeRambo · Aug 18, 2021

DisEnchantment said:
We have our SW ISV providing us a distro with AVX enabled in netfilter and the packet latency from the load balancer to the worker nodes is night and day difference.
The more complex the ebtables/iptables rules are the better the performance gains seems to be. Packet mangling, traversing the rules is much faster

It might not even have anything to do with AVX or vectorization. What could happen is those flags enable use of instructions like Haswell BMI and compiler can generate efficient logic, bit testing and so on code. These can make huge difference as well.
But it could also be vectorization as well, as AVX can be very efficient in pattern and substring search operations. The real question is - can compiler generate such good code, or does distro simply contain hand optimized assembly routines for hot paths? Are You getting binaries, or source to build?

Thibsie said:
https://twitter.com/x/status/1427929018234458114

512 indeed ?

Yeah, AVX512 does not come free, take AVX2 FP PRF area, multiply by 4, then multiply by not small factor to account for signals having to deal with large physical distances and also multiply by FP PRF size growth if any for ZEN4.
I think Golden Cove Core+L2 area versus ZEN4 Core+L2 is gonna be as valid process and architecture provess comparison as we are gonna ever get.

uzzi38 · Aug 18, 2021

Thibsie said:
https://twitter.com/x/status/1427929018234458114

512 indeed ?

That sounds a bit too large to me honestly. That would be an absolutely huge chunk of the die space gained from the die shrink lost just to AVX-512 registers. Not saying it's not feasible, just that I would expect it to be smaller than that.

dr1337 · Aug 18, 2021

uzzi38 said:
That sounds a bit too large to me honestly. That would be an absolutely huge chunk of the die space gained from the die shrink lost just to AVX-512 registers. Not saying it's not feasible, just that I would expect it to be smaller than that.

I can kinda believe if it they've had to spread out the logic more to deal with thermals. Dropping clocks is by far intels biggest issue with AVX512 so maybe AMD is trading density for FP clockspeed?

AMDK11 · Aug 18, 2021

itsmydamnation said:
I also think everyone is wrong, Zen 1 2 and 3 are fundamentally the same base core. Zen2 improved branch/ uop / 256bit ops , Zen3 improved executions and load/store but they did both in away without fundamentally changing the core ( same number of reg file ports, number of stages , what each stage does etc). Jim Keller Talked about Zen Core having Big Bones and i think that's exactly what he means. My bet is Zen4 is the next new set of big bones which will allow the following uarches to grow into.

if the rumoured perf uplifts are true it kind of has to be, I would ignore families thats about how they behave/handle certain operations, how they execute or the performance they execute with is not directly linked.

You are looking too much through the diagrams that describe the core microarchitecture. The diagram describing the structure of the core is a far-reaching simplification to at least to some extent illustrate the main features of the core. Many parts of the core cannot be represented this way or it is very difficult. The new algorithms and the new logic controlling the core resources cannot be presented in this way and described graphically.

For example, Zen3 is a project almost completely new from scratch. And you say that it is basically the same as Zen and Zen2 because decoders and executive ports are roughly the same, but it is not. It's like saying that two different cars are the same because they have 4 wheels and a five cylinder engine.

DisEnchantment · Aug 18, 2021

JoeRambo said:
But it could also be vectorization as well, as AVX can be very efficient in pattern and substring search operations. The real question is - can compiler generate such good code, or does distro simply contain hand optimized assembly routines for hot paths? Are You getting binaries, or source to build?

We are getting sources (of everything they packaged in the distro) and images. AVX is indeed used but I am not sure if it is the upstream version or a custom implementation.
BTW it is a major Linux ISV and I can't put a name here. But they supply everything, compilers, libraries, distros, patching CVEs, enablement of HW, etc. They provide images for our compute clusters, CI/CD infras, edge devices and so on

Bigos · Aug 18, 2021

DisEnchantment said:
We have our SW ISV providing us a distro with AVX enabled in netfilter and the packet latency from the load balancer to the worker nodes is night and day difference.
The more complex the ebtables/iptables rules are the better the performance gains seems to be. Packet mangling, traversing the rules is much faster
This is one case that I know first hand, I wonder how many more.

I would say that such things should be implemented using runtime detection (the Linux kernel already detects a lot of hardware features dynamically). Unless this is general purpose code that has been autovectorized, but I doubt it (Linux kernel doesn't use aggressive compiler options for a reason).

DisEnchantment · Aug 18, 2021

Bigos said:
I would say that such things should be implemented using runtime detection (the Linux kernel already detects a lot of hardware features dynamically). Unless this is general purpose code that has been autovectorized, but I doubt it (Linux kernel doesn't use aggressive compiler options for a reason).

The code is here

[PATCH nf-next 8/8] nft_set_pipapo: Introduce AVX2-based lookup implementation - Stefano Brivio

DisEnchantment · Aug 18, 2021

Thibsie said:
512 indeed ?

Straight from the horse's mouth

JoeRambo · Aug 18, 2021

DisEnchantment said:
The code is here
[PATCH nf-next 8/8] nft_set_pipapo: Introduce AVX2-based lookup implementation - Stefano Brivio

Yeah, that is great use of classic AVX vectorization by hand writing the algorithm in pseudo assembly, can be extended to AVX512 with some effort too. It helps in targeted places like crypto, packet processing, hashing, so if your server spends most of the time encrypting or processing packets it can be huge win.
But for general public, "system" time of typical application is minimal and usually heavy cases like drivers are already making great use of AVX, so kernel being compiled for "feature levels" won't speed up much as people are expecting it to.

THO with that said, there is large opportunity for distros like Clear Linux etc, that provided optimized builds from kernel to user space. Now that is where real advancements are to be had.

Doug S · Aug 18, 2021

DisEnchantment said:
Proposed: Allow Building The Linux Kernel With x86-64 Microarchitecture Feature Levels - Phoronix

www.phoronix.com

Some good news on Linux at least, the whole kernel could be built with Feature levels instead of being built with ISA features of processors from more than a decade ago.
Should help with many benchmarks/applications where x86 CPUs are getting held back by SW targeting legacy processors.
Windows need to do the same. But the openness of PC ecosystem means SW targets the least common denominator at the cost holding back performance for newer CPUs.
Zen4 will be the first AMD processor to be x86-64-v4

The kernel already checks for specific features on startup and uses code paths to utilize them where it will help. I don't see how compiler directives improve on this, other than saving a handful of microseconds on startup.

Few applications and even fewer benchmarks involve the kernel all that much. If you want to speed things up you go for user level stuff like glibc, not the kernel. That's where most of the action is.

Mopetar · Aug 18, 2021

DisEnchantment said:
They won't get that much density gain. Especially not on a cache heavy design like Zen.

It really depends on how much they intend on using die stacking. Cache can scale upwards instead of outwards to some degree. The real trick will be finding other parts of the chip they can build on top of without creating heat issues.

Abwx said:
He says 35-40% reduction including the I/O , the part that shrink the less and is very big , as already said this is not included in the chiplet, so my point hold even better.

I posted an article with an analysis several pages back where someone looked at die shots for their analysis and found Apple got no scaling on any of their IO moving to 5nm from 7nm.

I'm assuming AMD gets decent scaling based on the rumored figures and the additions they're making.

moinmoin · Aug 19, 2021

Anybody keeping track of the teams at AMD? We know that both Zen 1 and 2 were handled by the same team, whereas Zen 3 was by a different one. Are the ones of the latter known to work on Zen 4 as well?

Saylick · Aug 19, 2021

moinmoin said:
Anybody keeping track of the teams at AMD? We know that both Zen 1 and 2 were handled by the same team, whereas Zen 3 was by a different one. Are the ones of the latter known to work on Zen 4 as well?

Not sure if there was any rumors/confirmation of which team is working on Zen 4, but given that Zen 3 was a ground-up rebuild and Zen 4 also likely to be heavy lift, I would imagine that Zen 3 and Zen 4 are designed by two different teams.

inf64 · Aug 19, 2021

Here is my speculation about AMD/intel HEDT parts in 2022, I think it will be very close in ST. AMD will most likely win MT hands down so I will not even address that portion.

Zen4 should be >20% higher IPC than Zen3 , no Vcache in both. Add the layer of stacked cache and percentage probably stays the same and the only difference is performance gain versus vanilla Zen3. Since we have GC numbers now and they are getting 19% higher IPC versus Rocket Lake, we can extrapolate a *possible* performance of GoldenCove(GC), Zen3D , Zen4, and RaptorLake parts.

Zen3 has around 7.7% higher IPC versus Cypress Cove which intel used in their presentation. The numbers are 1.10x higher int and 1.054x higher fp according to AT 11700K review. Geomean of both is around 7.7%. I used the ucode patched (higher) results for 11700K and adjusted for difference in ST boost clocks between 5800X and 11700K. That means that GC has around 10% higher IPC than vanilla Zen3 and ~16-17% higher ST performance if it reaches 5.3Ghz in ST workloads.

Zen3D might get around ~5% boost in general purpose apps due to that massive Vcache and if AMD really deploys it on 6nm node, we might get 5.1-5.2Ghz ST boost. The best case scenario for AMD is almost a parity of Zen3D versus GC in ST performance, if GC tops at 5.3Ghz range. There will always be benchmark outliers, on both ends, but this is probably the best AMD can get with the next Zen3D iteration.

That brings us to Zen4. Zen4 is rumored to be the next major overhaul of Zen3, boosting the FP/AVX datapaths to full 512bit and bringing the huge Vcache again on mainstream parts.
If the rumors are true, Zen4 with Vcache could be a huge 30% faster than vanilla Zen3 at ISO clocks. If AMD manages to get a few percentages of ST clock bump it could be even faster.
Zen4 should go against RaptorLake which will have beefed up GC cores with more cache. It is fair to assume that this might get another 5% of performance at ISO clocks and if intel reaches 5.5Ghz the net results could potentially be almost 10% performance uplift for ST versus GC. So, versus vanilla Zen3, RaptorLake could be as significant 25-30% faster in ST, while being ~15% faster than Zen3D.
Finally, Zen4 Vs RaptorLake looks like Zen3 versus CypresCove/Willow Cove - a slight edge for AMD if they manage to get the clocks near 5.3Ghz range.

Gaming is a different story as there the huge Vcache AMD has in Zen3D and Zen4 will probably give them the edge. It is a smart move by AMD as it might be an "easy" fix for ADL problem between now and Zen4 launch. It will be interesting year for sure, I hope this brings the prices down but that might be a pipe dream - they might simply keep the same prices or even jack them up higher for higher margins.

exquisitechar · Aug 19, 2021

inf64 said:
AMD will most likely win MT hands down so I will not even address that portion.

You think? I can't see a 16 core Zen 4 CPU beating Raptor Lake if it's 8 P-cores/16 E-cores and I think it's unlikely there will be a 24 core Raphael CPU.

Abwx · Aug 19, 2021

exquisitechar said:
You think? I can't see a 16 core Zen 4 CPU beating Raptor Lake if it's 8 P-cores/16 E-cores and I think it's unlikely there will be a 24 core Raphael CPU.

Rumour is that it s 29% faster at isofrequency, but that s in a server environment, so a sizeable chunk of the improvement is surely due to higher RAM bandwith, neverless the results should be good enough in a DT.

Besides we ll have to wait for Intel s ADL real benches, so far the suite they used seems like selected for the purpose of creating a high number, only relevant benches used in their average are Spec_int and eventually Geekbench and WebXPRT3, all the rest are totaly irrelevant as far as IPC is the sought metric.

itsmydamnation · Aug 19, 2021

AMDK11 said:
For example, Zen3 is a project almost completely new from scratch. And you say that it is basically the same as Zen and Zen2 because decoders and executive ports are roughly the same, but it is not. It's like saying that two different cars are the same because they have 4 wheels and a five cylinder engine.

I didn't even say that, but look at what they changed and how the changed it . Did

peak L/S bandwidth increase , No
peak ALU throughput increase , No
peak FPU throughput increase , No
peak dispatch/issue/retire , No
Peak Decode through increase , No
Number of register file ports , No
Number of FP register file ports , No
Did the floor plan change , No
Did internal structures increase massively in size ( reg ' L/S queue dispatch queue etc) , No
Did the pipeline change , No

Get the point , Zen3 in bound by the Scope of Zen1 and Zen1 was designed with a big enough fundamental base through the core that large gen on gen improvements could happen .

Now compare that list with Willow to Golden or Tremont to Gracemont.

Im saying i think Zen4 wont be bound by the Scope of Zen3. In the same way that Zen wasn't bound by bulldozer yet lots of things came straight from it ( like the FPU still supporting XOP). So i expect fundamental changes in featch/decode , execution and retirement / memory access for Zen 4.

To go to your metaphor go stick a 560HP 6L engine from an enzo in a Golf GTi and see how well it works, there both cars, they both have 4 wheels.

Ajay · Aug 19, 2021

itsmydamnation said:
Im saying i think Zen4 wont be bound by the Scope of Zen3. In the same way that Zen wasn't bound by bulldozer yet lots of things came straight from it ( like the FPU still supporting XOP). So i expect fundamental changes in featch/decode , execution and retirement / memory access for Zen 4.

Yes, thank you for the above summary. Given it's development time, and Papermaster's (IIRC) comments on tock, tock, tock, tock. I expect Zen4 to be a very substantial evolution of the Zen architecture. Looking forward to seeing what ~~Jim~~ Mike Clark's team has come up with.

DisEnchantment · Aug 19, 2021

Ajay said:
I expect Zen4 to be a very substantial evolution of the Zen architecture. Looking forward to seeing what Jim Clark's team has come up with.

You mean Mike Keller ?

FWIW, from here

AMD Zen 3: An AnandTech Interview with CTO Mark Papermaster

www.anandtech.com

MP: It’s generational - if you look to the future we drive improvements in every generation. So you will see AMD transition to PCIe Gen 5 and that whole ecosystem. You should expect to hear from us in our next round of generational improvements across both the next-gen core that is in design as well as that next-gen IO and memory controller complex.

Papermaster mentioned about a next gen core and next gen IO last October.
IIRC it was Papermaster who mentioned about AMD not going with the approach of throwing silicon at the problem and being conservative with die increases for Zen3 to achieve the 19% IPC which is why I am doubting AMD throwing tens of mm2 of die area in cache for questionable lead across the board save for Enterprise loads.

Next Gen IO = PCIe 5
Next Gen Memory Controller = DDR5
Next Gen Core = Persephone is next Gen?

moinmoin · Aug 19, 2021

DisEnchantment said:
Next Gen Core = Persephone is next Gen?

A gen above Cerberus anyway.

leoneazzurro · Aug 20, 2021

itsmydamnation said:
I didn't even say that, but look at what they changed and how the changed it . Did

peak L/S bandwidth increase , No
peak ALU throughput increase , No
peak FPU throughput increase , No
peak dispatch/issue/retire , No
Peak Decode through increase , No
Number of register file ports , No
Number of FP register file ports , No
Did the floor plan change , No
Did internal structures increase massively in size ( reg ' L/S queue dispatch queue etc) , No
Did the pipeline change , No

Well', Zen3 is an iterative design in the general structure, but I think you got some points wrong, i.e. Zen3 has larger retire capability as well as register files have more ports, FP is definitely changed with double schedulers and

AMD Zen 3 Ryzen Deep Dive Review: 5950X, 5900X, 5800X and 5600X Tested

www.anandtech.com

Floating point capability in certain cases has doubled and there are significant improvements in many areas.

itsmydamnation · Aug 20, 2021

leoneazzurro said:
Well', Zen3 is an iterative design in the general structure, but I think you got some points wrong, i.e. Zen3 has larger retire capability as well as register files have more ports, FP is definitely changed with double schedulers and

AMD Zen 3 Ryzen Deep Dive Review: 5950X, 5900X, 5800X and 5600X Tested

www.anandtech.com

FADD and FMUL throughput has doubled (even if with higher latency) and there are significant improvements in many areas.

So i was very specific for a reason

, i said large gains in structures , intel went from like 224 to 356 to 512 , amd went what 192,224,256 ( cbf looking )?
Also i am right on the reg file ports, they added more pipelines but actual read write to reg file is the same, which is why I was always talking about peak numbers, zen1 to 2 to 3 has been about increasing utilisation, which is why they have been getting perf/watt improvements at ISO process.

leoneazzurro · Aug 20, 2021

Yes but peak throughput means very little without utilization. Intel went to 356 for the regfile and IPC stayed lower than Zen3, so...
If any, the fact AMD stayed conservative with Zen3 while achieving a significant IPC increase points to many "low hanging fruits" that can be improved in Zen4...

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Golden Member

Senior member

Golden Member

Platinum Member

Senior member

Senior member

Golden Member

Member

Golden Member

Golden Member

Golden Member

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Lifer

Platinum Member

Lifer

Golden Member

Diamond Member

Senior member

Platinum Member

Senior member