Discussion Zen 5 Architecture & Technical discussion

naukkis · Jan 26, 2025

MS_AT said:
This is the direct quote from the Software Optimization Guide for the AMD Zen5 Microarchitecture published in August 2024, revision 1.00:

So I took it to mean that the setup is asymmetrical, since they underline only the first slot, not any slot, of course I might have read into it too literally, but in that case I find the wording confusing. Still, it would be a waste to put more "complex" decoders in, if only the first slot will do the "complex" decoding, unless this is being muxed for some purpose.

That ain't split between simple and complex instructions - complex instructions can be quite short too. There's really no point of making decode fetch matrix wider for allowing decoding those overly long instructions simultaneously - there isn't fetch bandwidth or mop extraction bandwidth to support those kind of instruction combinations anyway.

NostaSeronx · Jan 26, 2025

MS_AT said:
So I took it to mean that the setup is asymmetrical, since they underline only the first slot, not any slot, of course I might have read into it too literally, but in that case I find the wording confusing. Still, it would be a waste to put more "complex" decoders in, if only the first slot will do the "complex" decoding, unless this is being muxed for some purpose.

Decode pipe0 = D0,D1,D2,D3
D0 = >10-byte, Vector-path
Decode pipe1 = D4,D5,D6,D7
D4 = >10-byte, Vector-path

Fastpath, Double Fastpath = All decode slots.
Vectorpath = Only the first slot of each.
This behavior is actually on all AMD processors. Since, this terminology was created. Even Steamroller/Excavator has this behavior.

"The outputs of the early decoders keep all (DirectPath or VectorPath) instructions in program order. Early decoding produces three macro-ops per cycle from either path. The outputs of both decoders are multiplexed together and passed to the next stage in the pipeline, the instruction control unit. Decoding a VectorPath instruction may prevent the simultaneous decoding of a DirectPath instruction." Virtually D0/D4 each act like a large VectorPath/microcode decoder.

Cardyak · Jan 26, 2025

MS_AT said:
His diagram contains errors (there is only one complex decoder in the cluster if we are to trust software optimization guide) so I would not rule out other mistakes.

Yeah, I was guessing here regarding the 2 micro-op queues in Zen 5

If you find out the true answer please let me know and I’ll update the diagram

naukkis · Jan 26, 2025

MS_AT said:
So I took it to mean that the setup is asymmetrical, since they underline only the first slot, not any slot, of course I might have read into it too literally, but in that case I find the wording confusing. Still, it would be a waste to put more "complex" decoders in, if only the first slot will do the "complex" decoding, unless this is being muxed for some purpose.

x86 considers instruction "simple" when it outputs just one micro-op. Complex instructions output at least 2 micro-ops. All AMD decoders sans K5 are "complex" - being able to decode instructions which output more than just one mop. Intel decoders are split to simple and complex - simple decoders can only decode one-to-one mapped instructions - meaning hardware instruction is equal to ISA instruction and does not split to multiple different instructions.

MS_AT · Jan 26, 2025

naukkis said:
x86 considers instruction "simple" when it outputs just one micro-op. Complex instructions output at least 2 micro-ops. All AMD decoders sans K5 are "complex" - being able to decode instructions which output more than just one mop. Intel decoders are split to simple and complex - simple decoders can only decode one-to-one mapped instructions - meaning hardware instruction is equal to ISA instruction and does not split to multiple different instructions.

Thanks for explanation. Obviously this was lack of knowledge on my side as I have simply assumed the split was done on the instruction length and not on the micro-ops produced.

Therefore the decoders were correctly labelled on Cardyak diagram. Sorry for the confusion.

StefanR5R · Jan 27, 2025

Cardyak said:
I was guessing here regarding the 2 micro-op queues in Zen 5

Looks plausible, given that the µop cache is dual-ported. (The split could be static or dynamic though...)

Gotta re-read CnC's analysis and the SOG whether something is said about
(a) how many µops/cycle a single thread can pull from the µop cache: up to 6, or up to 6+6?,
(b) if there is any word on the µop queue depth. If it is shallower in 1T mode than ideally possible, then it may be harder to make full use of the next stage (the ROB) in 1T mode, in which case Zen 5's deficit at "stitching the out-of-order instructions streams back in-order at the micro-op queue" might hinder 1T performance more than just WRT decoding bandwidth limit...?

MS_AT · Jan 27, 2025

StefanR5R said:
(a) how many µops/cycle a single thread can pull from the µop cache: up to 6, or up to 6+6?,

To quote: https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile

To further speed up instruction delivery, Zen 5 fills decoded micro-ops into a 6K entry, 16-way set associative micro-op cache. This micro-op cache can service two 6-wide fetches per cycle. Evidently both 6-wide fetch pipes can be used for a single thread.

This also matches what can be found in software optimization guide (chapter 2.9.1)

The Op Cache (OC) is a cache of previously decoded instructions. When instructions are being
fetched from the Op Cache, normal instruction fetch and decode are bypassed. This improves
pipeline latency because the Op Cache pipeline is shorter than the traditional fetch and decode
pipeline. It improves bandwidth because the maximum throughput from the Op Cache is 12
instructions per cycle, whereas the maximum throughput from the traditional fetch and decode
pipeline is 4 instructions per cycle per thread.

Jan Olšan · Jan 27, 2025

Somebody found a nice use (huge performance boosts?) for the VP2INTERSECT instruction in Zen 5.

Using the most unhinged AVX-512 instruction to make the fastest phrase search algo

Disclaimers before we start For those who don’t want to read/don’t care that much, here are the results. I hope after seeing them you are compelled to read. TL;DR: I wrote a super fast phrase search algorithm using AVX-512 and achieved wins up to 1600x the performance of Meilisearch. The source...

gab-menezes.github.io

Kepler_L2 · Jan 27, 2025

Jan Olšan said:
Somebody found a nice use (huge performance boosts?) for the VP2INTERSECT instruction in Zen 5.

Using the most unhinged AVX-512 instruction to make the fastest phrase search algo

Disclaimers before we start For those who don’t want to read/don’t care that much, here are the results. I hope after seeing them you are compelled to read. TL;DR: I wrote a super fast phrase search algorithm using AVX-512 and achieved wins up to 1600x the performance of Meilisearch. The source...

gab-menezes.github.io

Too bad Zen6 deprecates it

branch_suggestion · Jan 27, 2025

https://www.amd.com/en/developer/resources/technical-articles/zendnn-5-0-supercharge-ai-on-amd-epyc-server-cpus.html

Very impressive gains, 2025 I think will finally showcase the software reorg AMD has been working on for some years.

yuri69 · Jan 28, 2025

branch_suggestion said:
Very impressive gains, 2025 I think will finally showcase the software reorg AMD has been working on for some years.

It seems great, until you realize this particular release targets a product released in Oct 2024. That means this release is a 3 months late...

carrotmania · Jan 28, 2025

yuri69 said:
It seems great, until you realize this particular release targets a product released in Oct 2024. That means this release is a 3 months late...

What kind of logic is that? Does that mean every ray traced game coming out this year is 5yrs late? Software is done when it's done. And 3mo "late" it's better than never at all, like AMDs previous form. 400% is worth the delay. I take it this will run even better on MI300...

StefanR5R · Jan 28, 2025

yuri69 said:
this particular release targets a product released in Oct 2024.

Wrong. Turin servers were not "released" in October 2024.

branch_suggestion · Jan 28, 2025

yuri69 said:
It seems great, until you realize this particular release targets a product released in Oct 2024. That means this release is a 3 months late...

Progress is progress.
Better to release late than not at all. And better than an on time release that is buggy and missing features.

MS_AT · Jan 28, 2025

branch_suggestion said:
https://www.amd.com/en/developer/resources/technical-articles/zendnn-5-0-supercharge-ai-on-amd-epyc-server-cpus.html
Very impressive gains, 2025 I think will finally showcase the software reorg AMD has been working on for some years.

I find this comparison lacking in details. I mean they give you enough to be able to compare ZenDNN against ZenDNN, but not against other solutions. You don't know if they are playing catch-up, or they actually improved things. I mean inference is heavily dependant on memory BW, they don't give information on what that memory BW is, so it is hard to estimate how it would do against other frameworks.

branch_suggestion · Jan 28, 2025

MS_AT said:
I find this comparison lacking in details. I mean they give you enough to be able to compare ZenDNN against ZenDNN, but not against other solutions. You don't know if they are playing catch-up, or they actually improved things. I mean inference is heavily dependant on memory BW, they don't give information on what that memory BW is, so it is hard to estimate how it would do against other frameworks.

They compare it to IPEX 2.4.0 iso-hardware.
Phoronix will compare it to GNR, don't worry.

igor_kavinski · Mar 24, 2025

An Interview with Zen Chief Architect Mike Clark

Zen is one of the most important microarchitectures in the history of the x86 ecosystem.

www.computerenhance.com

MS_AT · Jul 26, 2025

Finally, Agner Fog also did his Zen5 analysis, https://www.agner.org/forum/viewtopic.php?t=287&start=10 microarchitecture detailed evaluation and instruction tables are linked at the bottom of the post with more details.

moinmoin · Sep 3, 2025

Moving this Zen 5 architecture discussion from the Nova Lake thread here:

moinmoin said:
Even with the first implementation in Zen 1 most of it was already competitively shared though.

Zen - Microarchitectures - AMD - WikiChip

Zen (family 17h) is the microarchitecture developed by AMD as a successor to both Excavator and Puma. Zen is an entirely new design, built from the ground up for optimal balance of performance and power capable of covering the entire computing spectrum from fanless notebooks to high-performance...

en.wikichip.org

Red - Competitively shared structures

Turquoise - Competitively shared and SMT tagged

Blue - Competitively shared with Algorithmic Priority

Green - Statically Partitioned

AMD has an update for Zen 5 at https://www.amd.com/en/blogs/2025/simultaneous-multithreading-driving-performance-a.html

View attachment 129504
View attachment 129505

naukkis said:
This is incorrect for Zen5. Zen5 have whole front-end statically partitioned, it has decoders, op-cache and so on duplicated for each thread. They use significant amount of silicon just for smt which ain't used fot 1t at all.

adroc_thurston said:
Nope, they're all watermarked.

Covfefe said:
Zen5's micro-op cache is shared, per Chips and Cheese's testing.

MS_AT said:
uOP cache has capacity of 6k entries, if SMT is active each threads gets 3k entries. Geneally only decoders are statically partitioned (one decoder per thread even if you disable SMT in the BIOS). Eveything else is either competively shared or watermarked.

It is interesting that AMD repeatedly refers to resources to be competitively shared when in cases like the dual decoder it does seem to be effectively statically partitioned (due to a bug?).

MS_AT · Sep 3, 2025

moinmoin said:
It is interesting that AMD repeatedly refers to resources to be competitively shared

I think they could hire somebody to proof read. I mean in software optimization guide for zen 5 you can read:

The maximum capacity of the Op Cache is 6 K instructions or fused instructions. The
actual limit may be less due to efficiency considerations. Avoid hot code regions that approach this
size when only one thread is running on a physical core, or half this size when two threads share a
physical core. The Op Cache is physically tagged, which allows Op Cache entries to be shared
between both threads when fetching shared code.

What to me suggest at least watermarking as otherwise the limit would not be so precise. But in the same manual you can find a table:

Decoders are its own mystery since the messaging from the company could suggest either terrible internal communication (fixed decoder per thread was always intended and is working as planned) or a bug of some sort they were unable to fix but did not correct all the public materials after they realized they have a problem.

igor_kavinski · Sep 3, 2025

Are Intel's manuals as good as that at providing detailed info about their internal stuctures' SMT partitioning?

511 · Sep 3, 2025

wrong thread

naukkis · Sep 3, 2025

That Amd documentation also claims that op-cache is physically tagged and able to share code between threads. That don't make any sense to me - have anyone tested that being real? There's very potential side-channel possibilities there and extremely small posdible advantage from those additional physical tag op-cache hits between threads.

MS_AT · Sep 3, 2025

naukkis said:
That don't make any sense to me - have anyone tested that being real?

If you are doing fork and join that should dramatically boost efficiency (basically both workers do the same on different slice of data).

naukkis said:
There's very potential side-channel possibilities there and extremely small posdible advantage from those additional physical tag op-cache hits between threads.

Could you explain the dangers you see? Leaking code is unlike leaking data I would expect, but side channels are nasty things.

naukkis · Sep 3, 2025

MS_AT said:
If you are doing fork and join that should dramatically boost efficiency (basically both workers do the same on different slice of data).

Op-cache is supposedly shared by physical tags - those gains aren't there because hits are only shared after tlb with simultaneous l1i and op-cache scans. Both threads would only perform optimally with their own op-cache hits.

MS_AT said:
Could you explain the dangers you see? Leaking code is unlike leaking data I would expect, but side channels are nasty things.

With that scheme there's two timing path whether code is shared or not with other cpu thread(process). Ok, code timing side-channels aren't as bad as data but making one for no gains is still unnecessary risk.

Discussion Zen 5 Architecture & Technical discussion

Golden Member

Diamond Member

Member

Golden Member

Senior member

Elite Member

Senior member

Senior member

Golden Member

Senior member

Senior member

Member

Elite Member

Senior member

Senior member

Senior member

Lifer

Senior member

Diamond Member

Senior member

Lifer

Diamond Member

Golden Member

Senior member

Golden Member