Discussion Zen 5 Architecture & Technical discussion

Page 21 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

naukkis

Golden Member
Jun 5, 2002
1,010
852
136
This is the direct quote from the Software Optimization Guide for the AMD Zen5 Microarchitecture published in August 2024, revision 1.00:

So I took it to mean that the setup is asymmetrical, since they underline only the first slot, not any slot, of course I might have read into it too literally, but in that case I find the wording confusing. Still, it would be a waste to put more "complex" decoders in, if only the first slot will do the "complex" decoding, unless this is being muxed for some purpose.

That ain't split between simple and complex instructions - complex instructions can be quite short too. There's really no point of making decode fetch matrix wider for allowing decoding those overly long instructions simultaneously - there isn't fetch bandwidth or mop extraction bandwidth to support those kind of instruction combinations anyway.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,810
1,289
136
So I took it to mean that the setup is asymmetrical, since they underline only the first slot, not any slot, of course I might have read into it too literally, but in that case I find the wording confusing. Still, it would be a waste to put more "complex" decoders in, if only the first slot will do the "complex" decoding, unless this is being muxed for some purpose.
Decode pipe0 = D0,D1,D2,D3
D0 = >10-byte, Vector-path
Decode pipe1 = D4,D5,D6,D7
D4 = >10-byte, Vector-path

Fastpath, Double Fastpath = All decode slots.
Vectorpath = Only the first slot of each.
This behavior is actually on all AMD processors. Since, this terminology was created. Even Steamroller/Excavator has this behavior.

"The outputs of the early decoders keep all (DirectPath or VectorPath) instructions in program order. Early decoding produces three macro-ops per cycle from either path. The outputs of both decoders are multiplexed together and passed to the next stage in the pipeline, the instruction control unit. Decoding a VectorPath instruction may prevent the simultaneous decoding of a DirectPath instruction." Virtually D0/D4 each act like a large VectorPath/microcode decoder.
 
Last edited:

Cardyak

Member
Sep 12, 2018
78
181
106
His diagram contains errors (there is only one complex decoder in the cluster if we are to trust software optimization guide) so I would not rule out other mistakes.
Yeah, I was guessing here regarding the 2 micro-op queues in Zen 5

If you find out the true answer please let me know and I’ll update the diagram
 

naukkis

Golden Member
Jun 5, 2002
1,010
852
136
So I took it to mean that the setup is asymmetrical, since they underline only the first slot, not any slot, of course I might have read into it too literally, but in that case I find the wording confusing. Still, it would be a waste to put more "complex" decoders in, if only the first slot will do the "complex" decoding, unless this is being muxed for some purpose.

x86 considers instruction "simple" when it outputs just one micro-op. Complex instructions output at least 2 micro-ops. All AMD decoders sans K5 are "complex" - being able to decode instructions which output more than just one mop. Intel decoders are split to simple and complex - simple decoders can only decode one-to-one mapped instructions - meaning hardware instruction is equal to ISA instruction and does not split to multiple different instructions.
 

MS_AT

Senior member
Jul 15, 2024
826
1,677
96
x86 considers instruction "simple" when it outputs just one micro-op. Complex instructions output at least 2 micro-ops. All AMD decoders sans K5 are "complex" - being able to decode instructions which output more than just one mop. Intel decoders are split to simple and complex - simple decoders can only decode one-to-one mapped instructions - meaning hardware instruction is equal to ISA instruction and does not split to multiple different instructions.
Thanks for explanation. Obviously this was lack of knowledge on my side as I have simply assumed the split was done on the instruction length and not on the micro-ops produced.

Therefore the decoders were correctly labelled on Cardyak diagram. Sorry for the confusion.
 

StefanR5R

Elite Member
Dec 10, 2016
6,623
10,469
136
I was guessing here regarding the 2 micro-op queues in Zen 5
Looks plausible, given that the µop cache is dual-ported. (The split could be static or dynamic though...)

Gotta re-read CnC's analysis and the SOG whether something is said about
(a) how many µops/cycle a single thread can pull from the µop cache: up to 6, or up to 6+6?,
(b) if there is any word on the µop queue depth. If it is shallower in 1T mode than ideally possible, then it may be harder to make full use of the next stage (the ROB) in 1T mode, in which case Zen 5's deficit at "stitching the out-of-order instructions streams back in-order at the micro-op queue" might hinder 1T performance more than just WRT decoding bandwidth limit...?
 
Last edited:

MS_AT

Senior member
Jul 15, 2024
826
1,677
96
(a) how many µops/cycle a single thread can pull from the µop cache: up to 6, or up to 6+6?,
To quote: https://chipsandcheese.com/p/amds-strix-point-zen-5-hits-mobile
To further speed up instruction delivery, Zen 5 fills decoded micro-ops into a 6K entry, 16-way set associative micro-op cache. This micro-op cache can service two 6-wide fetches per cycle. Evidently both 6-wide fetch pipes can be used for a single thread.
This also matches what can be found in software optimization guide (chapter 2.9.1)
The Op Cache (OC) is a cache of previously decoded instructions. When instructions are being
fetched from the Op Cache, normal instruction fetch and decode are bypassed. This improves
pipeline latency because the Op Cache pipeline is shorter than the traditional fetch and decode
pipeline. It improves bandwidth because the maximum throughput from the Op Cache is 12
instructions per cycle, whereas the maximum throughput from the traditional fetch and decode
pipeline is 4 instructions per cycle per thread.
 

Jan Olšan

Senior member
Jan 12, 2017
568
1,119
136
Somebody found a nice use (huge performance boosts?) for the VP2INTERSECT instruction in Zen 5.

 

Kepler_L2

Senior member
Sep 6, 2020
970
4,037
136
Somebody found a nice use (huge performance boosts?) for the VP2INTERSECT instruction in Zen 5.

Too bad Zen6 deprecates it
 

yuri69

Senior member
Jul 16, 2013
673
1,202
136
Very impressive gains, 2025 I think will finally showcase the software reorg AMD has been working on for some years.
It seems great, until you realize this particular release targets a product released in Oct 2024. That means this release is a 3 months late...
 

carrotmania

Member
Oct 3, 2020
119
300
136
It seems great, until you realize this particular release targets a product released in Oct 2024. That means this release is a 3 months late...
What kind of logic is that? Does that mean every ray traced game coming out this year is 5yrs late? Software is done when it's done. And 3mo "late" it's better than never at all, like AMDs previous form. 400% is worth the delay. I take it this will run even better on MI300...
 

branch_suggestion

Senior member
Aug 4, 2023
763
1,662
106
It seems great, until you realize this particular release targets a product released in Oct 2024. That means this release is a 3 months late...
Progress is progress.
Better to release late than not at all. And better than an on time release that is buggy and missing features.
 
  • Like
Reactions: Tlh97

MS_AT

Senior member
Jul 15, 2024
826
1,677
96
Very impressive gains, 2025 I think will finally showcase the software reorg AMD has been working on for some years.
I find this comparison lacking in details. I mean they give you enough to be able to compare ZenDNN against ZenDNN, but not against other solutions. You don't know if they are playing catch-up, or they actually improved things. I mean inference is heavily dependant on memory BW, they don't give information on what that memory BW is, so it is hard to estimate how it would do against other frameworks.
 
  • Like
Reactions: Tlh97

branch_suggestion

Senior member
Aug 4, 2023
763
1,662
106
I find this comparison lacking in details. I mean they give you enough to be able to compare ZenDNN against ZenDNN, but not against other solutions. You don't know if they are playing catch-up, or they actually improved things. I mean inference is heavily dependant on memory BW, they don't give information on what that memory BW is, so it is hard to estimate how it would do against other frameworks.
They compare it to IPEX 2.4.0 iso-hardware.
Phoronix will compare it to GNR, don't worry.
 

moinmoin

Diamond Member
Jun 1, 2017
5,240
8,454
136
Moving this Zen 5 architecture discussion from the Nova Lake thread here:
Even with the first implementation in Zen 1 most of it was already competitively shared though.


400px-amd_zen_hc28_smt.png

  • Red - Competitively shared structures
  • Turquoise - Competitively shared and SMT tagged
  • Blue - Competitively shared with Algorithmic Priority
  • Green - Statically Partitioned
AMD has an update for Zen 5 at https://www.amd.com/en/blogs/2025/simultaneous-multithreading-driving-performance-a.html

View attachment 129504
View attachment 129505
This is incorrect for Zen5. Zen5 have whole front-end statically partitioned, it has decoders, op-cache and so on duplicated for each thread. They use significant amount of silicon just for smt which ain't used fot 1t at all.
Nope, they're all watermarked.
Zen5's micro-op cache is shared, per Chips and Cheese's testing.
uOP cache has capacity of 6k entries, if SMT is active each threads gets 3k entries. Geneally only decoders are statically partitioned (one decoder per thread even if you disable SMT in the BIOS). Eveything else is either competively shared or watermarked.

It is interesting that AMD repeatedly refers to resources to be competitively shared when in cases like the dual decoder it does seem to be effectively statically partitioned (due to a bug?).
 
  • Like
Reactions: fastandfurious6

MS_AT

Senior member
Jul 15, 2024
826
1,677
96
It is interesting that AMD repeatedly refers to resources to be competitively shared
I think they could hire somebody to proof read. I mean in software optimization guide for zen 5 you can read:
The maximum capacity of the Op Cache is 6 K instructions or fused instructions. The
actual limit may be less due to efficiency considerations. Avoid hot code regions that approach this
size when only one thread is running on a physical core, or half this size when two threads share a
physical core
. The Op Cache is physically tagged, which allows Op Cache entries to be shared
between both threads when fetching shared code.
What to me suggest at least watermarking as otherwise the limit would not be so precise. But in the same manual you can find a table:
1756908745621.png

Decoders are its own mystery since the messaging from the company could suggest either terrible internal communication (fixed decoder per thread was always intended and is working as planned) or a bug of some sort they were unable to fix but did not correct all the public materials after they realized they have a problem.
 

naukkis

Golden Member
Jun 5, 2002
1,010
852
136
That Amd documentation also claims that op-cache is physically tagged and able to share code between threads. That don't make any sense to me - have anyone tested that being real? There's very potential side-channel possibilities there and extremely small posdible advantage from those additional physical tag op-cache hits between threads.
 

MS_AT

Senior member
Jul 15, 2024
826
1,677
96
That don't make any sense to me - have anyone tested that being real?
If you are doing fork and join that should dramatically boost efficiency (basically both workers do the same on different slice of data).

There's very potential side-channel possibilities there and extremely small posdible advantage from those additional physical tag op-cache hits between threads.
Could you explain the dangers you see? Leaking code is unlike leaking data I would expect, but side channels are nasty things.
 

naukkis

Golden Member
Jun 5, 2002
1,010
852
136
If you are doing fork and join that should dramatically boost efficiency (basically both workers do the same on different slice of data).
Op-cache is supposedly shared by physical tags - those gains aren't there because hits are only shared after tlb with simultaneous l1i and op-cache scans. Both threads would only perform optimally with their own op-cache hits.

Could you explain the dangers you see? Leaking code is unlike leaking data I would expect, but side channels are nasty things.

With that scheme there's two timing path whether code is shared or not with other cpu thread(process). Ok, code timing side-channels aren't as bad as data but making one for no gains is still unnecessary risk.