Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 282 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
799
1,351
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

Untitled2.png


What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts! :)
 
Last edited:
  • Like
Reactions: richardllewis_01

NostaSeronx

Diamond Member
Sep 18, 2011
3,687
1,221
136
If they are fused they are 512 bit.

If you think they are separate, please explain how a _mm512_permutex2var is going to execute in separate units with 256bit registers?
Specifically, my comment was specifically double sized PRF and increased width execution.
2x 320-entry PRFs + 6x 512-bit pipes

There is no native ZMM register, thus no 512-bit entry. Instead it is several 128-bit entries becoming a virtual 512-bit entry.

Rename/NSQ which determines register names:
ZMM_Lo in Scheduler0 on FP0? and ZMM_Hi in Scheduler0 on FP1?

The same issue would have come up with 256-bit permutes in Zen2. If there was an actual problem with that. Since, it is segmented via 128-bit units.

zen2fpu.png
Zen2 FPU:
2x 128-bit FMUL, 2x 128-bit FADD :: Low-128b2x 128-bit FMUL, 2x 128-bit FADD :: High-128b
Parallel YMM execution, no FlexFPU capability: no extra MULs/ADDs for 128-bit.

zen3fpu.png
Zen3 FPU superseded by Zen4 FPU (very minimal visual change from Zen3):
1x 256-bit FMUL, 1x 256-bit FADD, 1x 256-bit FSTORE :: Low-256b1x 256-bit FMUL, 1x 256-bit FADD, 1x 256-bit FSTORE :: High-256b
Parallel ZMM execution, FlexFPU capable: all ports are usable for 256b or 512b at same end FPU ops.

Any perceived problem or breakage would have occurred with 256-bit permutes in Zen2.

Zen4 PPR:
FP128=0
FP256=1
There is no FP512.

Register width in Zen4 is XMM(physical width), YMM(virtual width), ZMM_Lo(virtual-width), ZMM_Hi(virtual-width).

The largest increase in Zen4 is the load/store unit, not the FPU. As the load-store change is meant to handle 64B load/store split across both FPU clusters.
 
Last edited:
  • Like
Reactions: Tlh97 and Kaluan

Asterox

Golden Member
May 15, 2012
1,026
1,775
136
Highly doubt that. Not if 13400 is still a 65W part. Single core boost could be similar as 12600K, but all core boost should be lower at that TDP.

For example, i5 12400 6+0 with 65W label.

- all core turbo is 4hgz

- single core is 4.4ghz

If you add 4 E cores, or i5 13400 6+4 you must have to push it under 65W or if it doesn't work raise the TDP.

R5 7600(no X), it will probably still have beeter singlethread scores with much higher singlethread CPU frequency+higher IPC.Intel non K CPU=much lower CPU frequency.
 
  • Like
Reactions: Tlh97 and Kaluan

coercitiv

Diamond Member
Jan 24, 2014
6,217
11,988
136
I really like to see how Ryzen 7600 (6+6 = 12Threads) will compete against RL 13400 (6+4 = 16Theads ), 13400 @ $200-220 should be faster vs current 12600K.
Do you have a source for 13400 being 6+4? The only bit of rumor/leak on the small die SKUs was talking about using an Alder Lake Refresh instead.

To me it makes sense that at least some of the locked i5 SKUs introduce E cores, otherwise the performance gap going down from 13600K will too big. That being said, I've seen nothing to confirm this so far.
 
  • Like
Reactions: Tlh97 and Kaluan

Schmide

Diamond Member
Mar 7, 2002
5,587
719
126
Am I missing something here? is mm512/Permutex2var not able to be split into two micro-ops which are then scheduled independently?

If you were to split it into multiple instructions it would take more than two, require extra registers and operands, and a second pass through the pipeline.

(very complex decoding where the vector index is parsed into 4 sub-vectors and 2 masks)

permute high for high
permute low for high
blend
permute high for low
permute low for low
blend

This is seriously a can't see the forest for of the trees.

If you're doing flat operations, it doesn't matter. Every lane is independent. The ability to cross lanes is a major functionality of avx256/512. I highly doubt the above is anywhere close to actual implementation.
 
  • Like
Reactions: Tlh97 and Kaluan

inf64

Diamond Member
Mar 11, 2011
3,703
4,034
136
If you were to split it into multiple instructions it would take more than two, require extra registers and operands, and a second pass through the pipeline.

(very complex decoding where the vector index is parsed into 4 sub-vectors and 2 masks)

permute high for high
permute low for high
blend
permute high for low
permute low for low
blend

This is seriously a can't see the forest for of the trees.

If you're doing flat operations, it doesn't matter. Every lane is independent. The ability to cross lanes is a major functionality of avx256/512. I highly doubt the above is anywhere close to actual implementation.
I agree with you. It is logical that AMD opted for native implementation for AVX512, as the other option seems too convoluted.
 

nicalandia

Diamond Member
Jan 10, 2019
3,330
5,281
136
If you were to split it into multiple instructions it would take more than two, require extra registers and operands, and a second pass through the pipeline.

This is seriously a can't see the forest for of the trees.
The most recent leaks on Genoa puts its single core AVX-512 performance on par with Sapphire Rapids and much ahead on MT AVX-512 due to the much larger core count. So at least on benches it seems to be performing as good
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,687
1,221
136
If you were to split it into multiple instructions it would take more than two, require extra registers and operands, and a second pass through the pipeline.

(very complex decoding where the vector index is parsed into 4 sub-vectors and 2 masks)

permute high for high
permute low for high
blend
permute high for low
permute low for low
blend

This is seriously a can't see the forest for of the trees.

If you're doing flat operations, it doesn't matter. Every lane is independent. The ability to cross lanes is a major functionality of avx256/512. I highly doubt the above is anywhere close to actual implementation.
Insignificant issue in Zen2. So, it is a non-issue in Zen4.
I agree with you. It is logical that AMD opted for native implementation for AVX512, as the other option seems too convoluted.
Zen4 is non-native for AVX512. Zen2 is the exact same setup for AVX256, but doesn't have a second scheduler, so AVX256 instruction only cracks at Unit/PRF.

AVX512 remains the full operation for:
Decode stage
Retire stage
Rename stage

It is split to ZMM_Low and ZMM_High at the non-scheduler queue. Where ZMM_Lo(Lower 32B) is loaded to PRF0 and ZMM_Hi(Upper 32B) is loaded to PRF1.
AVX512 is split across the Scheduler0 and Scheduler1. Where ZMM_Low is executed on P0(FMUL port if MUL/Permute) and ZMM_High is executed on P1(2nd FMUL port if MUL/Permute). Transferring elements cross-domain(across PRFs) is only a single-cycle penalty. ZMM_High <-> ZMM_Low is not 0-cycle, but 1-cycle. This issue is present in Zen2 with YMM_Low to YMM_High permutes having Hi<->Lo mix of +1 cycle. Where as Zen3 has full YMM, but two separate YMM domains: YMM Reg A is in PRF0 and YMM Reg B is in PRF1 the rotate permute adds an extra cycle for cross-domain. Trading low 128-bit to high 128-bit is a single cycle penalty if on separate PRFs, the penalty for trading low 256-bit to high 256-bit would be the same.

A single instruction can be split into many sub-operations by the scheduler. Permute instructions can be scheduled as many permute sub-operations in this case.

Zen4 FPU is a scaled variant of the Zen3 FPU. There is no changes in regards to Unit#/PRF#.

1. AVX512 is split across two AVX256 units, like Zen2's AVX256 being split across two AVX128 units.
2. There is no 512-bit entry in PRF, the maximum load-size for one unit is 256-bit(2-entries of 128-bit).
3. Zen4 FPU isn't significantly larger than Zen3 FPU.

The most significant change in Zen4 is the Load/Store unit. Which can load/store from BOTH FPU clusters, where in Zen3 it only do a SINGLE FPU cluster at a time.

Increased size (Biggest increase to smallest increase):
Load/store unit <-> Integer execution+Floating Point execution0&1

Decreased size (Biggest decrease to smallest decrease):
Branch Predictor unit <-> Fetch/Pick

Reminder for N5 scaling:
Memories = 1.35x
HP Standard Cells = 1.38x
HD Standard Cells = 1.8x
Analog = 1.2x

Zen3 -> Zen4 with the up to 1 GHz increase in boosts. Basically makes it impossible for the units to be 512-bit width or an increase of double in PRF. Since the power increase for such a change is greater than 4x.

8-core 4.5GHz/Power consumption
Baseline = ~4.5 GHz @ 130W+
N5 and enlarge(Full 512-bit, doubled PRF) = ~4.5 GHz @ 360W+
Where as
N5 and similar(Fusion of 256-bit, fusion of PRF) = ~4.5 GHz @ 90W+ and ~5.625 GHz @ 130W+ <-- Only this one fits the bill, there is no frequency/power differential for AVX256&AVX512. AVX512 Workload on a single-core = 5.7 GHz, AVX256 Workload on a single core = 5.7 GHz.
 
Last edited:

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
That guy from ComputerBase used overclocked 12th gen models with faster memory. 12900K +42% faster DDR5, 12700K +25% faster DDR4, 5800X3D +19% faster DDR4. That comparison doesn't say much about Ryzen 7000. The advantage of the 12900K in that test is clearly based on OC and fast DDR5. Ryzen 7000 also can use fast DDR5.

Don't assume that just because Alder Lake benefits greatly from faster DDR5 that Zen 4 will as well. Alder Lake has only 30mb of L3 cache to share with 16 cores (and the L3 cache isn't very fast) which is why it benefits so heavily from fast DDR5 memory, likely due to all the cache misses in games.

Also Alder Lake is monolithic while Zen 4 will be chiplet based. I don't think Zen 4 will benefit as much from higher speed DDR5 due to having humongous and fast L3 cache and having a chiplet design.

Large caches reduce the impact of faster memory.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
humm... Zen4 L3 bandwidth is ~50% higher, userbenchmark sees a ~50% increase :smirk:

It's not 50% higher. Aida64 scales with multiple cores, which is why Zen 4 (and even Zen 3) scores so highly in those benches compared to Alder Lake.

5950x has similar L3 read bandwidth to 7950x.