Discussion Zen 5 Architecture & Technical discussion

Page 10 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

LightningZ71

Golden Member
Mar 10, 2017
1,781
2,135
136
Given that certain enterprise class server hosted software charges per thread, it may make sense for Turin to have different decoder behavior at least available as a bios setting to enhance ST throughput of each core. Either in the low core count, higher clocked boutique parts or in the maximum core count parts for normal and dense where some clients turn off SMT for various reasons.
 
  • Like
Reactions: Vattila

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
Sorry, could you explain how you came to that conclusion?
You can find it in the microbenchmarks from Chips and Cheese

Each Zen 5 cluster only handles a single thread, and maximum frontend throughput can only be achieved if both SMT threads are loaded. Intel’s scheme has all clusters working in parallel on different parts of a single thread’s instruction stream.


Also David Huang said the same

In this test, a single Zen 5 thread still performs like a 4-decode x86 core. But when we enable two SMT threads for testing, we can see that the throughput doubles, and the instruction throughput reaches 8 in the L1-L2 and even L3 ranges, and in the DRAM range it returns to the same normal level as Zen 4.
 

StefanR5R

Elite Member
Dec 10, 2016
5,889
8,757
136
Have these analyses of the apparent decoder width in 1T vs. 2T been done only on Strix Point so far, or did anybody reproduce them on Granite Ridge already?

Plus, Strix Point tests = Asus Zenbook tests, on which there is no BIOS option to disable SMT, or is there?

(I'm afraid I lost track.)
 

CouncilorIrissa

Senior member
Jul 28, 2023
520
1,991
96
Have these analyses of the apparent decoder width in 1T vs. 2T been done only on Strix Point so far, or did anybody reproduce them on Granite Ridge already?

Plus, Strix Point tests = Asus Zenbook tests, on which there is no BIOS option to disable SMT, or is there?

(I'm afraid I lost track.)
Huang performed his tests on the Zenbook, whereas C&C did theirs on the PX13. No GNR tests yet.
 

gdansk

Platinum Member
Feb 8, 2011
2,836
4,218
136
You should probably also include the funniest bit:
Personally, I think VP2INTERSECT is a disgusting instruction...
the instruction probably should never have existed in the first place in its current form. This may be one of the reasons Intel decided to get rid of it. It's unclear exactly what the original task was that it was meant for.
 

DavidC1

Senior member
Dec 29, 2023
778
1,236
96
Edit: once all the figures are on Skymont is still a huge leap in efficiency and PPA for Intel but not quite class leading. The irony is that in servers both Skymont and Zen 4c/5c are both going for the one market where ARM does really well.
Skymont clocks noticeably higher. Arrowlake's P core had to downgrade by 5%, but the E core clock went up by 200MHz. It goes up to 4.6GHz on Arrowlake.
 

gdansk

Platinum Member
Feb 8, 2011
2,836
4,218
136
Skymont clocks noticeably higher. Arrowlake's P core had to downgrade by 5%, but the E core clock went up by 200MHz. It goes up to 4.6GHz on Arrowlake.
Do we know the clock rates of N3E Zen 5C?
I'm also assuming it is lower than that but it is odd to be so confident without measuring.
 

KompuKare

Golden Member
Jul 28, 2009
1,163
1,426
136

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
Please kindly stick to Zen 5 technical aspects, there are many thread for all the other things. There were many insightful posts and some from new members, so it is very welcome after we lost some long term knowledgeable members.

Do we know the clock rates of N3E Zen 5C?
It would not matter so much for Turin D, I suspect it will be an efficiency play, so lets say 192 Cores at 2.8G@360W, up from 128C 2.25G@360W would already be a big perf uplift, no need for 3.5G+ clocks. Power per core would really be the most interesting part in my view.
 

gdansk

Platinum Member
Feb 8, 2011
2,836
4,218
136
It would not matter so much for Turin D, I suspect it will be an efficiency play, so lets say 192 Cores at 2.8G@360W, up from 128C 2.25G@360W would be a big perf uplift. Power per core would really be the most interesting part in my view.
Yes, that seems possible. But I think this is lower power per core than what it was being compared to (253W for 24 cores)
 

Jan Olšan

Senior member
Jan 12, 2017
396
680
136
Regarding the dual decode capability for 1T, David Huang writes:

Overall, despite the suspected shrinkage of op cache equivalent capacity in previous tests, Zen 5 maintains fairly high op cache utilization in most cases. That is, for SPEC int, the width of its x86 decoding is in most cases harmless to the overall performance, as will be seen later in the performance analysis. (DeepL)


Maybe people are focusing too much on the dual decode not being active for 1T. Dual-feeding is still possible from uOP cache and that may be bigger factor than decode capacity itself.

By looking at the PMCs associated with the mop source, we can see that even though the mop cache of Zen 5 has shrunk from 6.75K to 6K, the actual efficiency has increased. In the vast majority of cases, the percentage of mop from x86 decoders is smaller for Zen 5 than for Zen 4.


Op cache gives a performance boost of around 23% for the Zen 5. As a comparison (non-rigorous), based on readily available test data, the 7950X with op cache turned on has a performance gain of only about 13%. As you can see, op cache is far more important for Zen 5 than for previous generation architectures.

I have something to say about the finding that SPEC x264 is shown to be heavily execution backend bound tho:

x264 subcomponent: significantly higher bottlenecks in the back-end execution unit, access memory, and some bandwidth bottleneck increase in the front-end. In the end, the two new dispatch slots added by Zen5 only improve retire/ipc by 0.4;
This is very easy to understand. x264 has a very large number of compiler-generated SIMD integer and SIMD-access instructions, whereas the mobile Zen 5 has a significant regression in this area (SIMD integer latency and throughput drop) or no significant improvement (SIMD access, especially 128/256bit)

That probably could also be artifact of the SPEC code only using autovectorization and disabling all hand-written ASM code that is large part of x264. That assembly does a lot of work in tightly tuned SIMD routines that pack as much work as possible into as little cycles as possible.
When you disable assembly code, you will force the compiler to recreate all that from C code by autovectorization, and autovectorization is notorious for not working that well with complicated multimedia code that is not straightforward numerical HPC code. So basically the autovectorized code generates massive number of extra adds, muls, shuffles, probably heavily shifting the balance of the execution characteristics. I suspect that x264 compiled properly would show lesser level of backend bottleneck even on the modest 256bit mobile Zen 5 core, although software video encoding of this quality is still extremely computationally intensive, so it may be more backend-bottlenecked than other code.
 
Last edited:

MS_AT

Senior member
Jul 15, 2024
202
474
96
Maybe people are focusing too much on the dual decode not being active for 1T. Dual-feeding is still possible from uOP cache and that may be bigger factor than decode capacity itself.
To be honest I just want to understand how it works, that is why I hope somebody will check;) uOP caches are probably doing wonders for well written math kernels but when it comes to branchy workloads [hello games] they can only do so much.
That is probably artifact of the SPEC code only using autovectorization and disabling all hand-written ASM code that is large part of x264. That assembly does a lot of work in tightly tuned SIMD routines that pack as much work as possible into as little cycles as possible.
When you disable assembly code, you will force the compiler to recreate all that from C code by autovectorization, and autovectorization is notorious for not working that well with complicated multimedia code that is not straightforward numerical HPC code. So basically the autovectorized code generates massive number of extra adds, muls, shuffles, probably heavily shifting the balance of the execution characteristics. I suspect that x264 compiled properly would show lesser level of backend bottleneck even on the modest 256bit mobile Zen 5 core, although software video encoding of this quality is still extremely computationally intensive, so it may be more backend-bottlenecked than other code.
While I would generally agree, for the comparison David did Zen 4 and Zen 5 were running the same code, therefore x264 could be the one place where the 1c increased latency of SIMD instructions could be hampering other places. SIMD int add for example was 1c latency on Zen4 its 2c on Zen5. Instruction breakdown of x264 would be nice to see, to check if the adds could really have this influence.
 

gdansk

Platinum Member
Feb 8, 2011
2,836
4,218
136
Good, now it got a 11% IPC gain. Still a long way to the marketing's 16% figure.
This is integer rate. IPC would include some weighting of fp rate, right?

I guess you can argue it too is misleading because, in the past, AMD typically matched what their slide said to the SPEC int rate and this time they didn't. But it's not like the gaming slides even if still a decline in marketing accuracy.
 

yuri69

Senior member
Jul 16, 2013
530
946
136
This is integer rate. IPC would include some weighting of fp rate, right?

I guess you can argue it too is misleading because, in the past, AMD typically matched what their slide said to the SPEC int rate and this time they didn't. But it's not like the gaming slides. Still a bad regression in marketing accuracy.
It's a more like a tounge-in-cheek post.

The marketing dept. apparently abandoned the Zen 1-4 tradition of matching their "IPC workloads" to more or less match SPEC INT rate. Frankly, they did not have much choice. Zen 5 does not follow the same philosophy as its Zen predecessors. So slides presenting the ~10% figure would look rather bad for their "from the ground up core".

Again, Zen 5 should not be presented as a successor to Zen 1-4 but something of a different lineage. Then we would likely not see these oddities.
 

Abwx

Lifer
Apr 2, 2011
11,516
4,302
136
Good, now it got a 11% IPC gain. Still a long way to the marketing's 16% figure.

They included FP in their slides, so, the below is not marketing..?..

AMD%20COMPUTEX%20CLIENT%20PRESS%20DECK-01-01%20%2812%29.png


Edit : They always included FP in their IPC claims, so far they never stated that
it was only for Spec_int, in all their previous slides it was a mix of INT and FP.
 
Last edited:
  • Like
Reactions: lightmanek

Jan Olšan

Senior member
Jan 12, 2017
396
680
136
(... ) Zen 5 does not follow the same philosophy as its Zen predecessors. (...)

Again, Zen 5 should not be presented as a successor to Zen 1-4 but something of a different lineage. Then we would likely not see these oddities.
I find this argumentation weird. Besides the thing that calling it differently doesn't change any real behavior or merit of the core... how exactly do you determine if it uses the same philosophy as Zen 1-4? On what is that based?

Is it based on die footprint? (isn't it about the same size as Zen 4?) Is it because it increases FPU/SIMD width? (Zen 2 did.). Is it clocks not increasing? (Zen 1 didn't raise clocks above Excavator, but well, this can often be about process tech/practical limits too...)

If you decide that only on performance and the IPC increase, then I'm not sure that's a good idea. Because what if the core is not performing as well as it should because some features had to be disabled or slowed down (historical analogy: original Zen 1 had lowered L2 latency from originally intended 12 clocks to 17 clocks, which is the source of IPC increase in Zen+ because it fixed that).
The design "philosophy" is something that is applied during the designing, yet here you determine it solely on end results that may not match exactly what was intended...

(Note - we don't know if the IPC is affected by "bugs" in this way, this is hypothetical. IPC can probably be lower than intended also due to the architecture not quite matching pre-silicon modelling as a whole system due to other complex reasons.)
 

yuri69

Senior member
Jul 16, 2013
530
946
136
I find this argumentation weird. Besides the thing that calling it differently doesn't change any real behavior or merit of the core... how exactly do you determine if it uses the same philosophy as Zen 1-4? On what is that based?
It is based largely on the decisions made by the architects. There is no silver bullet so these decisions are basically trade-offs.

These decisions affect the project sub-budgets - areas of development investment. These areas are defined by projected goals using various metrics.

Originally, the Zen IP targeted mobile, server and even desktop/workstation workloads in a rather symmetric way. With Zen 5 things went a different way.

Zen 5 cache + structures + data paths got reworked in order to feed the brand new 512b-wide FPU. This development investment is disproportional to the INT investment. On top of that it lead to regressions for various instructions. So having a 512b does seem like a grand goal for the core. Being it a generally usuable goal? Not really.

Btw Zen 2 doubled the FPU width after years of being stuck on 128b. For context, Intel has been riding the 512b width for two years already.

The uncore got alost upgrade - no investment was made in that area. Was it a good trade-off given the previous gen was already bottlenecked?

Completely reworking the frontend which (now?) dedicates resources to SMT is another strange design choice given the profiling (now?) shows the frontend acts as a significant single-thread bottleneck. Was a server-class workloads investment a good trade-off?

This is my layman PoV.
 
  • Like
Reactions: podspi and Vattila

Hitman928

Diamond Member
Apr 15, 2012
6,024
10,352
136
Well looks the issue is now solved. David updated the results. Zen 5 is ahead of Zen4 in x264.

With the compiler bug fixed, Zen 5 actually looks pretty good overall. I know 11% still falls short of expectations, but it gets good improvements almost across the board, only deepsjeng and leela show relatively flat performance and drag the average down a bit.

SPECint2017-Zen5-APU-IPC-vs-competition.png


According to this paper, both tests are on the lower end in terms of how branchy the code is. Not sure why it’s performing relatively less in them though.

1723558130056.png
 

gdansk

Platinum Member
Feb 8, 2011
2,836
4,218
136
On top of that it lead to regressions for various instructions.
It has these regressions in 256b version too. I don't think the regression is the result of the vector register width but some other change.
 

Nothingness

Diamond Member
Jul 3, 2013
3,029
1,971
136
According to this paper, both tests are on the lower end in terms of how branchy the code is. Not sure why it’s performing relatively less in them though.

View attachment 105230
Not sure where the table comes from, but it doesn't seem to show branch misprediction rate. Even when you have few branches, they can be hard to predict.

Leela and deepsjeng are running through game trees making decisions depending on evaluation functions. That's very data dependant and branch predictors can have a hard time with that.
 

Hitman928

Diamond Member
Apr 15, 2012
6,024
10,352
136
Not sure where the table comes from, but it doesn't seem to show branch misprediction rate. Even when you have few branches, they can be hard to predict.

Leela and deepsjeng are running through game trees making decisions depending on evaluation functions. That's very data dependant and branch predictors can have a hard time with that.

Sorry, forgot to embed the link, it is from here, https://lca.ece.utexas.edu/pubs/HPCA_SPEC17_ShuangSong.pdf . I had more time to read through and it does say that leela shows bad misprediction penalties, at least on their Intel Skylake test CPU,

Several benchmarks (e.g., leela r, mc f r, xz r) spend a significant fraction of their execution time on front-end stalls as they suffer from higher branch misprediction rates

Even still, I would think Zen 5's improved branch prediction and dual ucode feed would help in this situation, rather than being flat. Looking at David Huang's results, it looks like it is a front end issue, but earlier in his post, he shows that Zen 5 uop utilization and op fusion are both as good or better on these tests than Zen 4. I guess that the branches are just terribly unpredictable in these tests such that even the improved BPU isn't really helping and it's getting stuck having to do a lot of fetch and decode. Since Zen 5 can't utilize the dual decode on a single thread, these 2 tests gets bottlenecked from the lack of improvement in single thread fetch/decode. I'm assuming/hoping (based on the below results) that AMD is focusing on improving the front end (the really front, front end) with Zen 6 in a way that can be seen with a single thread. I'm not a digital designer, so this is getting into deeper waters than I usually swim in. Just interesting results.

I wonder if games tend to profile the same way and that's why they don't show much improvement either, or if that lack of improvement is mainly due to the improved cores being starved for data by the same memory/fabric as the prior generation. I think the X3D chips should at least give us an answer on that.

Zen4-APU-SPEC-topdown.png

Zen5-APU-SPEC-topdown.png
 
Last edited: