Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 29 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

jamescox

Senior member
Nov 11, 2009
637
1,103
136
I think AMD have already stated that Zen4 will be a fairly significant uArch change by itself, so we can't rule anything out.

AVX512 itself is just instructions (fragmented though it is), the main thing is the actual FP/SIMD unit that executes them after decode - as Zen2 went from 128 bit to 256 bit units despite being a "minor uArch update" that means anything goes.

Albeit given AMD's rise to competitiveness once more it isn't impossible that we could see an entirely different SIMD solution for 512 bit - unlike with XOP where the non competitive state of Bulldozer meant that support of those extensions were doomed to very niche applications, I'm not sure if any commercial apps ever supported it at all.

What we may see too is AVX512 instruction support albeit with fused 256 bit units - then just a doubling of those units should give the far more widespread AVX2 code a real boost as it did for Zen2.
I expect them to widen the interconnect all of the way through significantly to support increased DDR5 bandwidth and the throughput offered by chip stacking (Possibly 1024-bit or wider interfaces). I also expect that there will be infinity cache, either in the interposer or stacked as a separate piece of silicon. That could be embedded under other die with tsmc’s chip stacking technologies. That might be the time to add more FP units. They may also restructure the cache again for larger L2 and the bandwidth requirements of more FP units.

The possibility of an embedded FPGA allows a lot of other possibilities. I don’t know if it will be big enough to program an AVX512 unit with all instructions. You might be able to do something like program an instruction set just for bfloat16 and have it be significantly wider than a regular unit that has to include a huge number of other instructions. I don’t have time to read all of the links. I am wondering if the embedded FPGA is a Zen 4 feature or farther In the future.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
A lot of servers don’t really use FP processing at all and it takes significant die area. Consumer applications don’t really need it either.
AVX is both FP and INT SIMD.

Plenty of applications use SIMD processing - it's even in Windows code.

The dav1d AV1 decoder uses a royal shedload of AVX2 assembly to make AV1 playable on x86 platforms as do most types of software media decoders, and plenty of emulators use lots of SIMD.

I'd be surprised if game engines like Unreal and Unity don't use a whole lot of it too.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,744
3,077
136
AVX is both FP and INT SIMD.

Plenty of applications use SIMD processing - it's even in Windows code.

The dav1d AV1 decoder uses a royal shedload of AVX2 assembly to make AV1 playable on x86 platforms as do most types of software media decoders, and plenty of emulators use lots of SIMD.

I'd be surprised if game engines like Unreal and Unity don't use a whole lot of it too.
i would think every game engine thats 64bit would have atleast some SIMD and i would think most from P4/K7 era onwards would as well.
 

uzzi38

Platinum Member
Oct 16, 2019
2,565
5,575
146
AVX is both FP and INT SIMD.

Plenty of applications use SIMD processing - it's even in Windows code.

The dav1d AV1 decoder uses a royal shedload of AVX2 assembly to make AV1 playable on x86 platforms as do most types of software media decoders, and plenty of emulators use lots of SIMD.

I'd be surprised if game engines like Unreal and Unity don't use a whole lot of it too.
Any game engine that implements PhysX uses AVX2. Which is a very large number of them, to say the least.

As for servers though, it really depends. Cloud providers for example are most interested in improvements to INT over FP just because that's what they see the most use for.
 

naukkis

Senior member
Jun 5, 2002
702
571
136
Not if they want to run on Celeron, Pentium.

All x86-64 cpu's have at least SSE2 SIMD capabilities. All new Celeron/Pentium lines have up to SSE4.2. So most of programs target to SSEx SIMD. AVX support in programs is very rare, thanks to Intel segmentation, as to support AVX developer needs either to maintain two codepaths or lose users that don't have AVX-support. Most do obvious thing and choose to support only commonly supported instruction sets.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,744
3,077
136
this seems pretty cool


so lots of games requiring MMX, they dont have a category for SSE :(


edit: bestproxy for SSE would Be minimum requirement P3

looking at that list, this brings back memories

 
Last edited:

naukkis

Senior member
Jun 5, 2002
702
571
136
so lots of games requiring MMX, they dont have a category for SSE :(

Every game from last twenty years use SSE but that isn't need any attention on x64 systems as SSE & SSE2 are part of x64. Many games twenty years ago needed MMX-support from cpu to be able to run, and MMX was option for x86 so those lists have some point.

For AVX that kind of list does have a point, but that list today is rather short. There's like one or two titles that require AVX-support.
 

jpiniero

Lifer
Oct 1, 2010
14,510
5,159
136
Have to think most games don't use AVX(+) because it's not any faster on Jaguar. That will change eventually to AVX2 as a hard requirement.
 

naukkis

Senior member
Jun 5, 2002
702
571
136
Have to think most games don't use AVX+ because it's not any faster on Jaguar. That will change eventually.

AVX isn't that much faster on any cpu on games - 256 bit vectors aren't really useful on about anything. But main thing why not to use AVX is that every cpu capable of running software won't have support for it. And so long as program target have something like Pentium as target they can't easily support AVX. Any software designer don't want to lose big chunk of potential sales for little performance upgrade - or want to about double their code keeping efforts with supporting two different code paths.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,744
3,077
136
AVX isn't that much faster on any cpu on games - 256 bit vectors aren't really useful on about anything. But main thing why not to use AVX is that every cpu capable of running software won't have support for it. And so long as program target have something like Pentium as target they can't easily support AVX. Any software designer don't want to lose big chunk of potential sales for little performance upgrade - or want to about double their code keeping efforts with supporting two different code paths.
why does this whole 256bit vectors are useless for gaming persist?

its twice the throughput , you just have to write for it. The other reason someone might ant to use it is double precision throughput.

Every game from last twenty years use SSE but that isn't need any attention on x64 systems as SSE & SSE2 are part of x64. Many games twenty years ago needed MMX-support from cpu to be able to run, and MMX was option for x86 so those lists have some point.

For AVX that kind of list does have a point, but that list today is rather short. There's like one or two titles that require AVX-support.
There are games i ran into that i couldn't play on K8/10 because of SSSE3
 

jpiniero

Lifer
Oct 1, 2010
14,510
5,159
136
AVX isn't that much faster on any cpu on games - 256 bit vectors aren't really useful on about anything. But main thing why not to use AVX is that every cpu capable of running software won't have support for it. And so long as program target have something like Pentium as target they can't easily support AVX. Any software designer don't want to lose big chunk of potential sales for little performance upgrade - or want to about double their code keeping efforts with supporting two different code paths.

Core Celeron and Pentium are such a tiny part of sales that it's not really a big deal. Maybe there's people still using Gulftown out there. Once developers start targeting the current gen consoles AVX2 will just become the standard. That's basically Haswell+ and all the Ryzens.
 

naukkis

Senior member
Jun 5, 2002
702
571
136
why does this whole 256bit vectors are useless for gaming persist?

its twice the throughput , you just have to write for it.

There isn't a lot of things that can be parallelized to 8 wide vectors. And for those that could, to make it really effective and not only minor performance variation needs that every cpu to run that code can execute it instead of some less-optimized fallback code. In other words, they don't go for 10% improvement for AVX-enabled cpu's if trade off is 40% less performance with cpu's that don't support AVX.

Arm did have that same problem few years ago when NEON wasn't mandatory, and there was a one phone SOC not supporting it - most of code was without NEON support even only tiny fraction of devices didn't support it.
 

Bigos

Member
Jun 2, 2019
127
281
136
If you parallelize the code, you don't do that for N-wide vectors (also called AOS or Array Of Struct). You do that for arrays of either scalars or vectors but placed as many scalar arrays (usually as SOA or struct-of-arrays or a couple of scalar arrays each representing a single scalar of a vector). With such a code it doesn't matter if you mathematically work on 3-wide or 4-wide vectors, you can speed up code using 4-wide, 8-wide, or even 512-wide vectors (given enough entries in the array).

Of course, such a programming paradigm requires a plan and is not straightforward to implement - especially if retrofitted in an implementation that wants to treat each "object" a separate. Any sane game engine would use such an approach anyway as it better uses the CPU cache (which is 90% of the case more important than CPU execution resources). I would thus assume such a scheme is possible for the majority of important-for-performance computations in most games.
 
  • Like
Reactions: lightmanek

moinmoin

Diamond Member
Jun 1, 2017
4,934
7,619
136
Upcoming AAA games on current consoles (not cross gen) likely won't run on processors without AVX2 due to performance requirements so AVX2 may well become the differentiation for minimum specs then.
 

soresu

Platinum Member
Dec 19, 2014
2,617
1,812
136
Upcoming AAA games on current consoles (not cross gen) likely won't run on processors without AVX2 due to performance requirements so AVX2 may well become the differentiation for minimum specs then.
Hopefully they stop making AAA game ports for XB1 and PS4 gen consoles by 2023 when I would expect plenty of PS5 and XSX units to be in consumer hands and in the market.
 
  • Like
Reactions: moinmoin

moinmoin

Diamond Member
Jun 1, 2017
4,934
7,619
136
Interesting. First I thought of the ARM core already included on every Zen chip. But the flowchart suggests it would only handle the pathfinding (probably on a SCF level), with the mini processor slotting in between pure I/O and core complex work. I'm undecided whether it's a full core, the patent text does suggest instructions are handled (in [0008]) which either means a full core or at least a full frontend that's outside of the core complex.

The patent essentially covers following two areas:
  • Conserving energy by having several stage levels working on task: GPIO, mini processor and core complex, with the mini processor working transparently (so not big.LITTLE like depending on OS support)
  • Similarly there can be several processors covering different subset of the supported instruction set, and tasks are moved between them according to support (sounds fitting for including FPGA cores)
The first point makes me wonder if the mini processor would fit on the IOD, allowing whole chiplets to rest.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,590
5,722
136
Flow chart from the patent shows how the proposed system works.

View attachment 37819
A more consequential patent is this actually, which I also posted some pages back
AMD's patent for their hybrid chip is quite interesting. They rely on illegal instruction exception to wake up the big cores and transfer the register state to them and avoid depending on the scheduler like current hybrid designs. Sounds like a bad idea from security perspective on server chips though. So will be client only probably.
US Patent has been awarded was filed on Oct 27 2017

But what is interesting is that AMD applied for another similar patent again. It seems likely that they found additional things to do the with it and applied for it again. Filed June 25 2020.
Not yet awarded, still in application state but will surely be awarded as it is just a continuation of awarded patent.

Similar abstract but 20 additional claims.

You can read both, they are awesome.

1610659997358.png1610660039709.png



Like I wrote in my earlier post, there are shared registers between the big and small cores.
In low power mode the small cores are running only a small subset of low powered instructions, when a complex instruction is encountered a trap occurs and the big core takes over seamlessly.
OS is not even aware that all of this is happening. It is basically the same core to the OS.
Because of shared Registers L2 etc, the number of small cores is exactly the same as the number of big cores and at any time either the big or the small core is active not both.
In my opinion this could be done fairly cheaply in terms of die real estate, they could power gate selected blocks in the current core and add special execution ports, L/S ports and other tidbits and thats all.
I imagine in low power the first target would be to power gate the complex decoder block, leaving just simple decoding blocks, power gate all the executions ports except the most simple and power efficient one, power gate a chunk of the register file, and so on.
More innovative than the big.LITTLE imo.

I suppose this opens up new possibilities on what the CPU can do.
It can execute multiple ISA with the same core :blush: . illegal opcode trap in one core changes the CPU to another and continue execution. And OS is not aware.
Dreams...
 
Last edited:

NTMBK

Lifer
Nov 14, 2011
10,208
4,940
136
If you parallelize the code, you don't do that for N-wide vectors (also called AOS or Array Of Struct). You do that for arrays of either scalars or vectors but placed as many scalar arrays (usually as SOA or struct-of-arrays or a couple of scalar arrays each representing a single scalar of a vector). With such a code it doesn't matter if you mathematically work on 3-wide or 4-wide vectors, you can speed up code using 4-wide, 8-wide, or even 512-wide vectors (given enough entries in the array).

Of course, such a programming paradigm requires a plan and is not straightforward to implement - especially if retrofitted in an implementation that wants to treat each "object" a separate. Any sane game engine would use such an approach anyway as it better uses the CPU cache (which is 90% of the case more important than CPU execution resources). I would thus assume such a scheme is possible for the majority of important-for-performance computations in most games.

Except such code is a total pain in the arse to work with. It's harder to write, harder to modify, harder to debug, harder to extend. And as soon as you have any divergent control flow, your efficiency falls off a cliff.

For anything where you can transform the algorithm to fit that pattern, it's absolutely the most performant approach. But you're going to see that type of code in key centralised systems, not entire engines written that way. Programmer productivity and effectiveness is a valuable resource.
 
  • Like
Reactions: Gideon

Ajay

Lifer
Jan 8, 2001
15,332
7,792
136
Has there been any new info on when Zen4 (Ryzen) might be hitting the retail market?
There was noise for a while about 'Warhol' which made me think 2022 - but not many rumors have popped of recently (AFAICT).