Discussion Zen 5 Architecture & Technical discussion

Page 11 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

LightningZ71

Golden Member
Mar 10, 2017
1,781
2,135
136
Deepsenj has a memory footprint of over 700MB in the rate version, and much larger in the speed version. It would seem to be likely to regularly spill out of the L3 cache quite heavily, putting lost of strain on the MMC. The regression vs the 7840u is likely due to the contention between the two CCXs for access to the MMC and any cross CCX data transfers that have to happen. I haven't looked at the actual memory transfers in flight to see what it's real behavior is. It's a chess simulator, so they crawl data trees constantly, typically.
 
  • Like
Reactions: Vattila

Hitman928

Diamond Member
Apr 15, 2012
6,024
10,352
136
Deepsenj has a memory footprint of over 700MB in the rate version, and much larger in the speed version. It would seem to be likely to regularly spill out of the L3 cache quite heavily, putting lost of strain on the MMC. The regression vs the 7840u is likely due to the contention between the two CCXs for access to the MMC and any cross CCX data transfers that have to happen. I haven't looked at the actual memory transfers in flight to see what it's real behavior is. It's a chess simulator, so they crawl data trees constantly, typically.

Testing rate 1T shouldn’t encounter any cross CCX or memory controller contention issues though, unless I’m mistaken. . .
 

LightningZ71

Golden Member
Mar 10, 2017
1,781
2,135
136
Looking at Hitman's post, Deepseng is spending more time waiting for data to arrive and them streaming the data into the core. Both front-End latency and bandwidth take up more resources than in Zen4. It looks like, to me, that it's just stalling on memory contention. With the small size of the L3 on the C CCX, this isn't surprising.
 

LightningZ71

Golden Member
Mar 10, 2017
1,781
2,135
136
Testing rate 1T shouldn’t encounter any cross CCX or memory controller contention issues though, unless I’m mistaken. . .
True, I hadn't notice that it was 1T. My mistake. Still, with a 700MB footprint, and only 16MB of L3, that's a lot of cache misses. It is still possible that, since there's a second CCX, there's an added amount of latency in each memory request to deal with the possibility that addresses in that second L3 may get invalidated. It doesn't have to be much to impact these tests.
 
  • Like
Reactions: Tlh97 and Hitman928

LightningZ71

Golden Member
Mar 10, 2017
1,781
2,135
136
Hmm, I went and compared Deepsenj on a few similar Zen4 SKUs where the main difference was 3D cache being present or absent, and the scores didn't really seem to change too much between them. These were larger desktop/server CCXs, so twice the L3. I am going to conclude that it isn't much of an L3 cache issue, and the L2 remaining the same seems to indicate that it isn't part of this either. It could still be an issue with the memory controller on Strix having slightly higher latency than Phoenix/Hawk point. Of note, since I was looking at server SKUs, they were running notably lower frequencies than the APUs, so, there's less pressure on memory performance as the core isn't demanding as much.
 

Hitman928

Diamond Member
Apr 15, 2012
6,024
10,352
136
Is that data available in tabular format?

Not really, this is the closest thing:

Zen5-Zen4-slots.png


True, I hadn't notice that it was 1T. My mistake. Still, with a 700MB footprint, and only 16MB of L3, that's a lot of cache misses. It is still possible that, since there's a second CCX, there's an added amount of latency in each memory request to deal with the possibility that addresses in that second L3 may get invalidated. It doesn't have to be much to impact these tests.

That makes sense, but I guess I would again expect the improved BPU to enable better pre-fetch to improve the scores at least somewhat, but if the BPU isn't showing better performance in these tests, then it kind of all makes sense as being essentially a memory + decode bottleneck.

Edit: @LightningZ71 the Vcache vs non-Vcache results being the same are interesting. Maybe with test results on desktop CPUs with no CCX issues and larger L3s things may get a little more clear.
 

Hitman928

Diamond Member
Apr 15, 2012
6,024
10,352
136
BTW, David Huang said it was a GCC bug that caused the Zen 5 improvements to appear lower, but if I'm reading his explanation right, it seems like it was his error in non-matching compiler flags that caused the issue:

Previously, some old data from znver3 was used, but the new test data used znver4, which led to the conclusion that x264 had almost no improvement. After unifying the flag, both znver3 and znver4/5 can achieve similar improvements.

Google translate was used for the quote.
 

CouncilorIrissa

Senior member
Jul 28, 2023
520
1,991
96
Completely reworking the frontend which (now?) dedicates resources to SMT is another strange design choice given the profiling (now?) shows the frontend acts as a significant single-thread bottleneck. Was a server-class workloads investment a good trade-off?

This is my layman PoV.
I very much doubt that this was their intention. We still need to see GNR tests (as Zen 5 core seems to have bits and pieces chopped off left, right and center) to draw final conclusion about the front-end's capabilities, but I think it's something that blew up very late into development. You don't dedicate so much area on something that would only bring benefits with SMT, that goes against the whole idea of SMT being an area-efficient way to get some nT performance.

Besides, aren't there some server workloads that perform better with SMT off anyway?
 

Nothingness

Diamond Member
Jul 3, 2013
3,029
1,971
136

CouncilorIrissa

Senior member
Jul 28, 2023
520
1,991
96

It's official, no parallel decoding with 2 clusters SMT off.

cross-CCD latency has been absolutely destroyed compared to Zen 4.
1723642379071.png
1723642388648.png
 

poke01

Golden Member
Mar 8, 2022
1,991
2,527
106

It's official, no parallel decoding with 2 clusters SMT off.

cross-CCD latency has been absolutely destroyed compared to Zen 4.
View attachment 105279
View attachment 105280
Umm, Zen4 supremacy Anyone!

They were supposedly going to charge $1000 for this junk. No wonder they reduced the price.
 

JustViewing

Senior member
Aug 17, 2022
216
381
106
Seems like AMD needs Zen6 to iron out the shortcomings of Zen5. This means, Zen6 may come next year.
 

JustViewing

Senior member
Aug 17, 2022
216
381
106
Umm, Zen4 supremacy Anyone!

They were supposedly going to charge $1000 for this junk. No wonder they reduced the price.
They were never going to charge $1000. It was a pure fantasy by some people in this forum. No one would buy 16 cores for $1000.
 
  • Like
Reactions: poke01

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,054
15,197
136

As I said before, Their avx-512 is way faster, than even Intel's best ! They will rock in that !!
 

MS_AT

Senior member
Jul 15, 2024
202
474
96
I recommend update Zen 5 analysis from Y-cruncher author http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardown/ some interesting bits about clocks and voltage observed.

Also I wonder if the core parking feature AMD tried to add last minute to this launch is the band-aid for the cross CCD latency. Still the rumor from MLID that they improved cross CCD latency did not age well, I think ;) [unless AT is right, the driver is botched and makes latency worse on Windows].
 
Jul 27, 2020
19,613
13,476
146
I'm interested in knowing what happens if the chipset driver package isn't installed? Or will Windows update forcefully install it?
 

DisEnchantment

Golden Member
Mar 3, 2017
1,747
6,598
136
cross-CCD latency has been absolutely destroyed compared to Zen 4.

I don't know if there is something AMD inhibited in firmware, but it makes no sense for latency to go from 80ms to 200ms using similar IOD.
Roughly speaking on Z4, something like twice the L3 latency and round trip to get the cache line. which would be around 60ms for snooping traffic + 20ms L3 latency. So for a cache line they need minimum of two transfers due to 32B/cycle.
If you cut 20ms L3 latency and imagine 3x more time fetch the data from the other CCX it is not making any sense. It is like they throttled something.

And another funny thing is that, according to C&C Strix has more IF write bandwidth than GNR.
 

Jan Olšan

Senior member
Jan 12, 2017
396
680
136
It is based largely on the decisions made by the architects. There is no silver bullet so these decisions are basically trade-offs.

These decisions affect the project sub-budgets - areas of development investment. These areas are defined by projected goals using various metrics.

Originally, the Zen IP targeted mobile, server and even desktop/workstation workloads in a rather symmetric way. With Zen 5 things went a different way.

Zen 5 cache + structures + data paths got reworked in order to feed the brand new 512b-wide FPU. This development investment is disproportional to the INT investment. On top of that it lead to regressions for various instructions. So having a 512b does seem like a grand goal for the core. Being it a generally usuable goal? Not really.
Well, integer got increased load-store capacity (4 loads/2 store /4 total from 3/1/3) and + 2 ALUs, which also improved the mix from 3 simple + 1 complex to 3 simple + 3 complex (although not all three know all complex instructions). Jump to 8 ALUs would probably be too complex to do at once (but you can't increase SIMD widths in +50% steps. Although arguably, you could say that AMD did half of the job in Zen 4, half in Zen 5).

Btw Zen 2 doubled the FPU width after years of being stuck on 128b. For context, Intel has been riding the 512b width for two years already.

Perhaps, but I would say generally AMD was late with the SIMD upgrades. Fully 128bit SSE* took them just one year after Intel with Barcelona, full-speed AVX2 in Zen 2 was 6 years behind Intel. Intel had AVX-512 since 2017... (it gets a bit complicated with the question of when it became "full-speed" tho...)
Generally there are people doubting that AVX-512 is useful, but if you accept you want to have it, then the sooner, the better.
The uncore got alost upgrade - no investment was made in that area. Was it a good trade-off given the previous gen was already bottlenecked?

Well, IOD was never going to be updated, that's sadly the policy of the desktop lineup, to have it on half-cadence. I absolutely agree that the IOD and chiplet scheme is the biggest weakness for performance but also for power consumption and efficiency. I wish they got something more efficient via advanced packaging.
Completely reworking the frontend which (now?) dedicates resources to SMT is another strange design choice given the profiling (now?) shows the frontend acts as a significant single-thread bottleneck. Was a server-class workloads investment a good trade-off?

This is my layman PoV.
The reworked frontend isn't completely limited to SMT mode. The improved branch predicion still works in 1T I think, as does dual-fetch from microcode cache, which is luckily the more common source of instruction, at least if the code works as AMD engineers intended. Majority of apps should probably run from ucode cache significantly 50% of a time, although there may be outliers.
However, htere is good chance the split decoders will get the ability to feed 1T too, in a way Intel does it. I have no idea if that feature was buggy and had to be disabled or if it is yet to be added, but I'm pretty confident the whole reason for this scheme is to eventually be able to do it for 1T. If that was not the goal, they would do it like Golden Cove and Lion Cove and try to add decoders in a single cluster. They seem to see the x86 fufute in the Atom-like scheme with multiple clusters, which well may be the more efficient way (and thus in line with the balanced "Zen" philosophy, perhaps?) to increase the decoding width.

Of course, that future prospect doesn't help Zen 5, but if it takes some pain now to get the benefits in the follow-up cores, may be worth it.
 

Hitman928

Diamond Member
Apr 15, 2012
6,024
10,352
136
I don't know if there is something AMD inhibited in firmware, but it makes no sense for latency to go from 80ms to 200ms using similar IOD.
Roughly speaking on Z4, something like twice the L3 latency and round trip to get the cache line. which would be around 60ms for snooping traffic + 20ms L3 latency. So for a cache line they need minimum of two transfers due to 32B/cycle.
If you cut 20ms L3 latency and imagine 3x more time fetch the data from the other CCX it is not making any sense. It is like they throttled something.

And another funny thing is that, according to C&C Strix has more IF write bandwidth than GNR.

I think it has to do with the new driver, though I’m just guessing. I posted about it here.