Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

gruffi · Aug 4, 2022

AtenRa said:
I really like to see how Ryzen 7600 (6+6 = 12Threads) will compete against RL 13400 (6+4 = 16Theads ), 13400 @ $200-220 should be faster vs current 12600K.

Highly doubt that. Not if 13400 is still a 65W part. Single core boost could be similar as 12600K, but all core boost should be lower at that TDP.

inf64 · Aug 4, 2022

Det0x said:
I already have those L3 bandwidth numbers with regular Zen3..

That "regular" Zen 3 is basically balls to the wall OCed 5950X...

deasd · Aug 4, 2022

Det0x said:
I already have those L3 bandwidth numbers with regular Zen3..

View attachment 65365

AIDA is multithreaded. That image seems to be 12C 7900X.

nicalandia · Aug 4, 2022

Schmide said:
If you think they are separate, please explain how a _mm512_permutex2var is going to execute in separate units with 256bit registers?

Am I missing something here? is mm512/Permutex2var not able to be split into two micro-ops which are then scheduled independently?

NostaSeronx · Aug 4, 2022

Schmide said:
If they are fused they are 512 bit.

If you think they are separate, please explain how a _mm512_permutex2var is going to execute in separate units with 256bit registers?

Specifically, my comment was specifically double sized PRF and increased width execution.
2x 320-entry PRFs + 6x 512-bit pipes

There is no native ZMM register, thus no 512-bit entry. Instead it is several 128-bit entries becoming a virtual 512-bit entry.

Rename/NSQ which determines register names:
ZMM_Lo in Scheduler0 on FP0? and ZMM_Hi in Scheduler0 on FP1?

The same issue would have come up with 256-bit permutes in Zen2. If there was an actual problem with that. Since, it is segmented via 128-bit units.

Zen2 FPU:

2x 128-bit FMUL, 2x 128-bit FADD :: Low-128b

2x 128-bit FMUL, 2x 128-bit FADD :: High-128b

Parallel YMM execution, no FlexFPU capability: no extra MULs/ADDs for 128-bit.

Zen3 FPU superseded by Zen4 FPU (very minimal visual change from Zen3):

1x 256-bit FMUL, 1x 256-bit FADD, 1x 256-bit FSTORE :: Low-256b

1x 256-bit FMUL, 1x 256-bit FADD, 1x 256-bit FSTORE :: High-256b

Parallel ZMM execution, FlexFPU capable: all ports are usable for 256b or 512b at same end FPU ops.

Any perceived problem or breakage would have occurred with 256-bit permutes in Zen2.

Zen4 PPR:
FP128=0
FP256=1
There is no FP512.

Register width in Zen4 is XMM(physical width), YMM(virtual width), ZMM_Lo(virtual-width), ZMM_Hi(virtual-width).

The largest increase in Zen4 is the load/store unit, not the FPU. As the load-store change is meant to handle 64B load/store split across both FPU clusters.

Det0x · Aug 4, 2022

inf64 said:
That "regular" Zen 3 is basically balls to the wall OCed 5950X...

I meant regular Zen3 as in non v-cache, i did write "maxed tuned" 5950x in the spoiler.

deasd said:
AIDA is multithreaded. That image seems to be 12C 7900X.
View attachment 65366

Here you go then, my 5950x with only 12 cores enabled:

Asterox · Aug 4, 2022

gruffi said:
Highly doubt that. Not if 13400 is still a 65W part. Single core boost could be similar as 12600K, but all core boost should be lower at that TDP.

For example, i5 12400 6+0 with 65W label.

- all core turbo is 4hgz

- single core is 4.4ghz

If you add 4 E cores, or i5 13400 6+4 you must have to push it under 65W or if it doesn't work raise the TDP.

R5 7600(no X), it will probably still have beeter singlethread scores with much higher singlethread CPU frequency+higher IPC.Intel non K CPU=much lower CPU frequency.

nicalandia · Aug 4, 2022

NostaSeronx said:
View attachment 65368
Zen3 FPU superseded by Zen4 FPU (very minimal visual change from Zen3):

Zen4 FPU has been doubled from Zen3, hence the die area impact on the CCD

coercitiv · Aug 4, 2022

AtenRa said:
I really like to see how Ryzen 7600 (6+6 = 12Threads) will compete against RL 13400 (6+4 = 16Theads ), 13400 @ $200-220 should be faster vs current 12600K.

Do you have a source for 13400 being 6+4? The only bit of rumor/leak on the small die SKUs was talking about using an Alder Lake Refresh instead.

To me it makes sense that at least some of the locked i5 SKUs introduce E cores, otherwise the performance gap going down from 13600K will too big. That being said, I've seen nothing to confirm this so far.

Schmide · Aug 4, 2022

nicalandia said:
Am I missing something here? is mm512/Permutex2var not able to be split into two micro-ops which are then scheduled independently?

If you were to split it into multiple instructions it would take more than two, require extra registers and operands, and a second pass through the pipeline.

(very complex decoding where the vector index is parsed into 4 sub-vectors and 2 masks)

permute high for high
permute low for high
blend
permute high for low
permute low for low
blend

This is seriously a can't see the forest for of the trees.

If you're doing flat operations, it doesn't matter. Every lane is independent. The ability to cross lanes is a major functionality of avx256/512. I highly doubt the above is anywhere close to actual implementation.

inf64 · Aug 4, 2022

Schmide said:
If you were to split it into multiple instructions it would take more than two, require extra registers and operands, and a second pass through the pipeline.

(very complex decoding where the vector index is parsed into 4 sub-vectors and 2 masks)

permute high for high
permute low for high
blend
permute high for low
permute low for low
blend

This is seriously a can't see the forest for of the trees.

If you're doing flat operations, it doesn't matter. Every lane is independent. The ability to cross lanes is a major functionality of avx256/512. I highly doubt the above is anywhere close to actual implementation.

I agree with you. It is logical that AMD opted for native implementation for AVX512, as the other option seems too convoluted.

nicalandia · Aug 4, 2022

Schmide said:
If you were to split it into multiple instructions it would take more than two, require extra registers and operands, and a second pass through the pipeline.

This is seriously a can't see the forest for of the trees.

The most recent leaks on Genoa puts its single core AVX-512 performance on par with Sapphire Rapids and much ahead on MT AVX-512 due to the much larger core count. So at least on benches it seems to be performing as good

nicalandia · Aug 4, 2022

inf64 said:
I agree with you. It is logical that AMD opted for native implementation for AVX512, as the other option seems too convoluted.

If thats the case then they just quadrupled the FP size requirement from original Zen3

Saylick · Aug 4, 2022

I didn't see this posted, but for those who want the 3D V-cache versions this one is for you:

https://twitter.com/x/status/1555203790529523713

CakeMonster · Aug 4, 2022

So that is directly correlated with power?

Saylick · Aug 4, 2022

CakeMonster said:
So that is directly correlated with power?

Something to do with better heat transfer from the base die through the cache die.

maddie · Aug 4, 2022

Saylick said:
Something to do with better heat transfer from the base die through the cache die.

I'm thinking it might have more to do with the blank silicon next to the cache, assuming the same layout for Zen4. That's where the real heat generation exists.

NostaSeronx · Aug 4, 2022

Schmide said:
If you were to split it into multiple instructions it would take more than two, require extra registers and operands, and a second pass through the pipeline.

(very complex decoding where the vector index is parsed into 4 sub-vectors and 2 masks)

permute high for high
permute low for high
blend
permute high for low
permute low for low
blend

This is seriously a can't see the forest for of the trees.

If you're doing flat operations, it doesn't matter. Every lane is independent. The ability to cross lanes is a major functionality of avx256/512. I highly doubt the above is anywhere close to actual implementation.

Page 282 - Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 282 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Insignificant issue in Zen2. So, it is a non-issue in Zen4.

inf64 said:
I agree with you. It is logical that AMD opted for native implementation for AVX512, as the other option seems too convoluted.

Zen4 is non-native for AVX512. Zen2 is the exact same setup for AVX256, but doesn't have a second scheduler, so AVX256 instruction only cracks at Unit/PRF.

AVX512 remains the full operation for:
Decode stage
Retire stage
Rename stage

It is split to ZMM_Low and ZMM_High at the non-scheduler queue. Where ZMM_Lo(Lower 32B) is loaded to PRF0 and ZMM_Hi(Upper 32B) is loaded to PRF1.
AVX512 is split across the Scheduler0 and Scheduler1. Where ZMM_Low is executed on P0(FMUL port if MUL/Permute) and ZMM_High is executed on P1(2nd FMUL port if MUL/Permute). Transferring elements cross-domain(across PRFs) is only a single-cycle penalty. ZMM_High <-> ZMM_Low is not 0-cycle, but 1-cycle. This issue is present in Zen2 with YMM_Low to YMM_High permutes having Hi<->Lo mix of +1 cycle. Where as Zen3 has full YMM, but two separate YMM domains: YMM Reg A is in PRF0 and YMM Reg B is in PRF1 the rotate permute adds an extra cycle for cross-domain. Trading low 128-bit to high 128-bit is a single cycle penalty if on separate PRFs, the penalty for trading low 256-bit to high 256-bit would be the same.

A single instruction can be split into many sub-operations by the scheduler. Permute instructions can be scheduled as many permute sub-operations in this case.

Zen4 FPU is a scaled variant of the Zen3 FPU. There is no changes in regards to Unit#/PRF#.

1. AVX512 is split across two AVX256 units, like Zen2's AVX256 being split across two AVX128 units.
2. There is no 512-bit entry in PRF, the maximum load-size for one unit is 256-bit(2-entries of 128-bit).
3. Zen4 FPU isn't significantly larger than Zen3 FPU.

The most significant change in Zen4 is the Load/Store unit. Which can load/store from BOTH FPU clusters, where in Zen3 it only do a SINGLE FPU cluster at a time.

Increased size (Biggest increase to smallest increase):
Load/store unit <-> Integer execution+Floating Point execution0&1

Decreased size (Biggest decrease to smallest decrease):
Branch Predictor unit <-> Fetch/Pick

Reminder for N5 scaling:
Memories = 1.35x
HP Standard Cells = 1.38x
HD Standard Cells = 1.8x
Analog = 1.2x

Zen3 -> Zen4 with the up to 1 GHz increase in boosts. Basically makes it impossible for the units to be 512-bit width or an increase of double in PRF. Since the power increase for such a change is greater than 4x.

8-core 4.5GHz/Power consumption
Baseline = ~4.5 GHz @ 130W+
N5 and enlarge(Full 512-bit, doubled PRF) = ~4.5 GHz @ 360W+
Where as
N5 and similar(Fusion of 256-bit, fusion of PRF) = ~4.5 GHz @ 90W+ and ~5.625 GHz @ 130W+ <-- Only this one fits the bill, there is no frequency/power differential for AVX256&AVX512. AVX512 Workload on a single-core = 5.7 GHz, AVX256 Workload on a single core = 5.7 GHz.

AtenRa · Aug 5, 2022

ASUS AM5 ROG CROSSHAIR X670E EXTREME and HERO

FangBLade · Aug 5, 2022

So with Zen 4 AMD was focused on cache and latency, while IPC was secondary, we should see good gains in gaming, and future architectures will improve upon that while having also high IPC gains, well AMD officialy removed their reputation as duct tape https://wccftech.com/alleged-amd-ry...-offers-over-50-cache-bandwidth-versus-zen-3/

Det0x · Aug 5, 2022

Not so bad considering that the memory controller is on its own separate IO die.. It only remains to see how high the memory can be clocked now..

https://twitter.com/x/status/1555517933493047296

If this is with stock "XMP" timings its actually pretty good compared to Raptor Lake

Olikan · Aug 5, 2022

humm... Zen4 L3 bandwidth is ~50% higher, userbenchmark sees a ~50% increase

Carfax83 · Aug 5, 2022

gruffi said:
That guy from ComputerBase used overclocked 12th gen models with faster memory. 12900K +42% faster DDR5, 12700K +25% faster DDR4, 5800X3D +19% faster DDR4. That comparison doesn't say much about Ryzen 7000. The advantage of the 12900K in that test is clearly based on OC and fast DDR5. Ryzen 7000 also can use fast DDR5.

Don't assume that just because Alder Lake benefits greatly from faster DDR5 that Zen 4 will as well. Alder Lake has only 30mb of L3 cache to share with 16 cores (and the L3 cache isn't very fast) which is why it benefits so heavily from fast DDR5 memory, likely due to all the cache misses in games.

Also Alder Lake is monolithic while Zen 4 will be chiplet based. I don't think Zen 4 will benefit as much from higher speed DDR5 due to having humongous and fast L3 cache and having a chiplet design.

Large caches reduce the impact of faster memory.

Carfax83 · Aug 5, 2022

Olikan said:
humm... Zen4 L3 bandwidth is ~50% higher, userbenchmark sees a ~50% increase

It's not 50% higher. Aida64 scales with multiple cores, which is why Zen 4 (and even Zen 3) scores so highly in those benches compared to Alder Lake.

5950x has similar L3 read bandwidth to 7950x.

inf64 · Aug 5, 2022

Carfax83 said:
It's not 50% higher. Aida64 scales with multiple cores, which is why Zen 4 (and even Zen 3) scores so highly in those benches compared to Alder Lake.

5950x has similar L3 read bandwidth to 7950x.

Actually the leaked Aida scores are for 7900X( 12 core), not 16 core Zen 4 part.

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Golden Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Senior member

Golden Member

Platinum Member

Diamond Member

Diamond Member

Diamond Member