Discussion Zen 5 Architecture & Technical discussion

Page 19 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

lopri

Elite Member
Jul 27, 2002
13,254
628
126
Why is this thing so bad at Super Pi? With 7700X I got 6 secs for 1M but with 9700X can't do it under 7 secs.
 

lopri

Elite Member
Jul 27, 2002
13,254
628
126
Super pi is purely single threaded. Power limit has no impact until you go way lower.
 

Det0x

Golden Member
Sep 11, 2014
1,307
4,284
136
Ok, maybe I wasn't clear enough. The CCD to IOD interface limits you to 64GB/s, while 6000MT/s DDR5 setups provides theoretical 96GB/s. Since CCD to IOD bandwidth is the limiting factor here, it doesn't matter how fast your DRAM is if you saturate CCD to IOD link first [probably better to have a bit higher for various contoller related overheads].

AVX512 would love to use the bandwidth but it won't be able to.
Not correct, single CCD Zen4 scales alittle with memoryspeeds even in 2:1 8000MT/s vs 1:1 6600MT/s
My own results with Clam cache/mem benchmark:
Results in Clam cache/mem benchmark:

Latency ranking:
  1. SR 2x16gigs @ 6600MT/s 1:1 mode= 68.75ns
  2. DR 2x32gigs @ 6600MT/s 1:1 mode =70.17ns
  3. SR 2x16gigs @ 8000MT/s 2:1 mode = 70.24ns
  4. DR 2x32gigs @ 8000MT/s 2:1 mode = 71.84ns

Bandwidth read-modify-write (ADD) ranking:
  1. SR 2x16gigs @ 8000MT/s 2:1 mode= 97.11GB/s
  2. DR 2x32gigs @ 8000MT/s 2:1 mode = 92.87GB/s
  3. SR 2x16gigs @ 6600MT/s 1:1 mode = 91.23GB/s
  4. DR 2x32gigs @ 6600MT/s 1:1 mode = 87.34GB/s
A few comments in random order to my findings above :)

A single 8core Zen4 CCD can take advantage of the higher bandwidth afforded by 2:1 mode vs 1:1 mode, even if the common misconception on many forums is that there is no benefit because they can hardly see any difference in gimmicky AIDA64 memory bench. (its also easy to double check this in other benchmarks such as y-cruncher / GB3 membench which will show the same)

The next question would naturally be what's the "best memory setup", 1:1 mode with its lower latency or 2:1 with its higher bandwidth. There is no easy answer for this as it all depends on what benchmark/game you comparing the numbers in.. Some will prefer latency while others bandwidth, so you just have to check on an individual basis. :eek:

But what i can say is that i pretty much always think higher memoryspeed is better, be it in 1:1 mode or 2:1 mode... From time to time i see some limit themself to something like 6000/6200MT/s because they think its faster in games than say 6400MT/s for some reason (?)

My next observation is that i did not find any bandwidth benefit from the "dual rank" (quad) in Clam cache/mem benchmark, but karhu is seemingly showing higher mb/s. But i suspect this is because the higher memory size tested, not increased bandwidth from DR. I will do some more DR karhu runs where i limit used memorysize to same as SR and check if the numbers change. (y) edit Its also possible the forced GDM enabled with DR is eating up the bandwidth benefit compared to SR

Have also seen some complains about some ppl having a hardtime tuning memory on the 1.1.7.0 PatchA FireRangeP AGESA, i can only say that is working pretty good for me on the ASUS GENE, even if i'm using a beta bios. But be warned, stabilizing DR 64gigs @ 8000MT/s is still insanely hard, think i spent like 5x the time on this profile compared to all others combined... Its really on a razors edge, +-5 mv on some rails and you can forget about 10k karhu.

Have also saved all pictures here also, incase this forum goes loco again with the screenshots
The same should be true for Zen5 as they share the same memory system
 
Last edited:
Jul 27, 2020
20,089
13,762
146
Super pi is purely single threaded. Power limit has no impact until you go way lower.
It's legacy code. If PiFast from Benchmate or the following multithreaded Rust program also shows Zen 5 losing to Zen 4, then yes, we can say Houston, we have a problem.

 

Nothingness

Diamond Member
Jul 3, 2013
3,104
2,104
136
It's legacy code. If PiFast from Benchmate or the following multithreaded Rust program also shows Zen 5 losing to Zen 4, then yes, we can say Houston, we have a problem.
There are two sides to this: some people run some random obsolete benchmark, get odd results and draw conclusions (not saying anyone is doing the latter here); OTOH legacy code has to run fast enough.

But at this point I'm not sure what the point of running that obsolete unmaintained PiFast is. Especially if it's not been characterized (what SIMD extensions are used? Is it memory bottlenecked?).
 

Det0x

Golden Member
Sep 11, 2014
1,307
4,284
136
In all its beauty 😘
1727105589489.png

But on a more serious note guys, watch out with the direct die frame v2
Even if TG says it support Zen5, its not without problems..

Long store short, i was getting pretty bad temp spread on the cores after delid
1727105812757.png

7 remounts later i found the problem (yes this took hours)
Frame had been pressing down on the glue on each side of the CCD's
1727105859983.png

Temperature spread @ 310w PPT after the fix are looking much better :)
1727105978020.png

1727105986623.png
 
Last edited:

MS_AT

Senior member
Jul 15, 2024
261
594
96
Not correct, single CCD Zen4 scales alittle with memoryspeeds even in 2:1 8000MT/s vs 1:1 6600MT/s
My own results with Clam cache/mem benchmark:
Results in Clam cache/mem benchmark:

Latency ranking:
  1. SR 2x16gigs @ 6600MT/s 1:1 mode= 68.75ns
  2. DR 2x32gigs @ 6600MT/s 1:1 mode =70.17ns
  3. SR 2x16gigs @ 8000MT/s 2:1 mode = 70.24ns
  4. DR 2x32gigs @ 8000MT/s 2:1 mode = 71.84ns

Bandwidth read-modify-write (ADD) ranking:
  1. SR 2x16gigs @ 8000MT/s 2:1 mode= 97.11GB/s
  2. DR 2x32gigs @ 8000MT/s 2:1 mode = 92.87GB/s
  3. SR 2x16gigs @ 6600MT/s 1:1 mode = 91.23GB/s
  4. DR 2x32gigs @ 6600MT/s 1:1 mode = 87.34GB/s

The same should be true for Zen5 as they share the same memory system
I could have been more precise. So, just to clear the first point I was talking about bandwidth only, not latency.

The part that I ignored is the fact that CCD to IOD connection is 32B/16B read and write respectively (based on one of earlier C&C investigations) with both lanes, so to speak, usable at the same time what gives you higher bandwidth limit for a test that is mixing reads and writes. Pure read should show bandwith closer to 32B x IF clk. Unless I have missed something in my analysis.
 
  • Like
Reactions: lightmanek

yuri69

Senior member
Jul 16, 2013
545
979
136
Reading Lion Cove/Skymont analysis at David Huang's Blog there are interesting comparisons to Zen 5.

* Intel really went from 6-wide to 8-wide x86 decode this gen; while AMD apparently sticks to 4-wide
* Skymont internal structure sizing is dangerously close to Zen 5 (except FP-related)
* Lion Cove vs Zen 5 SPEC2017 INT scores achieved at 4.2GHz are very close
 

coercitiv

Diamond Member
Jan 24, 2014
6,693
14,367
136
Reading Lion Cove/Skymont analysis at David Huang's Blog there are interesting comparisons to Zen 5.

* Intel really went from 6-wide to 8-wide x86 decode this gen; while AMD apparently sticks to 4-wide
* Skymont internal structure sizing is dangerously close to Zen 5 (except FP-related)
* Lion Cove vs Zen 5 SPEC2017 INT scores achieved at 4.2GHz are very close
TL;DR - cores dangerously close to each other, collisions expected.

The one thing that does not sit right with me is the efficiency of the cores in the dense cluster, less efficiency than vanilla cores is weird.
1727646725072.png
 

Abwx

Lifer
Apr 2, 2011
11,560
4,358
136
* Skymont internal structure sizing is dangerously close to Zen 5 (except FP-related)

20/30% better perf/clock in INT/FP for Zen 5c vs SKT in GB 6 ST, so that s close only apparently, in the real world they are far apart, about two gens apart, and to think that we had people here expecting SKT to match or even beat Zen 4, i once said that it was at Zen 3 level at best.
 

Saylick

Diamond Member
Sep 10, 2012
3,554
7,921
136
TL;DR - cores dangerously close to each other, collisions expected.

The one thing that does not sit right with me is the efficiency of the cores in the dense cluster, less efficiency than vanilla cores is weird.
View attachment 108440
They are called dense cores instead of efficient cores for a reason, although it is weird that they aren’t as efficient even though AMD touted a perf/W gain, e.g.
1727649924121.jpeg
 

coercitiv

Diamond Member
Jan 24, 2014
6,693
14,367
136
They are called dense cores instead of efficient cores for a reason
AMD themselves showed Zen 4c improving efficiency for low power scenarios in Phoenix 2. Also note the wording: "better optimized for NT efficiency, and size". The idea was to gain the density jump while also preserving or preferably improving efficiency. That being said, David Huang's package power readings might not be enough to tell the whole story here, but for what is worth it's showing a regression.

1727674285643.png