Discussion Zen 5 Architecture & Technical discussion

CakeMonster · Sep 8, 2024

I'd love to see a re-review of the findings with regards to the inter-core and inter-ccx latencies compared to Z4 when there are new AGESA/BIOS'es out and chipset drivers and Windows are properly updated.

igor_kavinski · Sep 8, 2024

CakeMonster said:
I'd love to see a re-review of the findings with regards to the inter-core and inter-ccx latencies compared to Z4 when there are new AGESA/BIOS'es out and chipset drivers and Windows are properly updated.

Clock is ticking. AMD needs to get the fixed AGESA out the door BEFORE Arrow Lake launch.

MS_AT · Sep 8, 2024

igor_kavinski said:
Clock is ticking. AMD needs to get the fixed AGESA out the door BEFORE Arrow Lake launch.

And what needs fixing? CCD to CCD latency? Was anyone able to correlate poor performance in this synthetic test with impact on any particular workload? Or is there something else that needs fixing? Chips&Cheese did one article long time ago https://chipsandcheese.com/2021/06/...to-core-latency-and-the-role-that-locks-play/ let me quote the conclusion:

Typically games have around 20-30 per 10000 instructions suffering a L3 cache miss, which means that games are much more bound by memory latency than lock latency. If you picked an instruction at random, it’s 20-30 times more likely to miss L3 than require a core to core transfer. The situation is more skewed for very parallel productivity workloads like Cinebench, where L3 misses happen about 80x as often as core to core transfers. So in conclusion, a core to core latency test using locks isn’t very indicative of how a CPU will perform with real world usage either of games or productivity workloads. Core to core latency is merely one part of a CPU’s overall performance, and plays a small role compared to other factors like the performance of a CPU’s cache and memory hierarchy.

That is not to say such workloads do not exist, it might be also that C&C test approach was flawed or their game selection insufficient but anyway this synthetic test got lots of attention because of huge regression but I am unaware of anyone who was able to successfully link that to real world perf regression.

naukkis said:
Alpha has that additional latency when instructions has to cross register file. Scheluder, in case of Alpha with help of programmer, tries to keep dependent instructions on same side of register file. But in situations like when there is 3 adds to scheluded in clock for 2+2 alu configuration one instruction needs to take wrong side and from that result dependent instructions see one cycle latency penalty. When scheluder can isolate dependent instructions to their own sides full thoughput can be maintained without any penalties.

In Zen you also suffer 1c latency penalty when you shift data from scalar / int register file to fp/simd register file if I understand documentation correctly. But this penalty is different to the penalty suffered exclusively by otherwise 1c latency SIMD ops.

igor_kavinski · Sep 8, 2024

MS_AT said:
And what needs fixing?

Anything that AMD hasn't been able to do so far. I don't know what their TODO list for Zen 5 related AGESA improvements looks like but they better get to it fast. Also, if they release the X870E/X870 chipsets soon, those may also help performance a bit when used with Zen 5 optimized EXPO kits. And of course, if they are able to release the new X3D SKUs, that would further improve their standing in benchmarks.

Hotrod2go · Sep 9, 2024

igor_kavinski said:
Clock is ticking. AMD needs to get the fixed AGESA out the door BEFORE Arrow Lake launch.

Agree, 6400MT/S 1:1 must be the new "sweet spot", anything less is robbing the efficiency gains from Zen 5, also help in the market fight with Intel's upcoming arrow lake.

MS_AT · Sep 9, 2024

Hotrod2go said:
Agree, 6400MT/S 1:1 must be the new "sweet spot", anything less is robbing the efficiency gains from Zen 5, also help in the market fight with Intel's upcoming arrow lake.

Fabric clock sweet spot is 2000MHz, CCD to IOD link is 32B/c -> max bandwidth 64GB/s. If you are lucky you get 70,4GB/s with 2200MHz.
For DDR: 6400MT/s with 128b(16B) bus is 102GB/s. With current sweetspot of 6000MT/s you are at 96GB/s.
In other words, the new sweet spot will be meaningless to 1 CCD SKUs. Only 2 CCD SKUs will be able to benefit provided you can engage both CCD dies. When it comes to bandwidth, at least.

igor_kavinski · Sep 9, 2024

MS_AT said:
In other words, the new sweet spot will be meaningless to 1 CCD SKUs.

AVX-512 workloads using 512-bit registers will see benefit. They are the ones most in need of any extra bandwidth compared to the same instructions executing on Zen 4.

MS_AT · Sep 9, 2024

igor_kavinski said:
AVX-512 workloads using 512-bit registers will see benefit. They are the ones most in need of any extra bandwidth compared to the same instructions executing on Zen 4.

Ok, maybe I wasn't clear enough. The CCD to IOD interface limits you to 64GB/s, while 6000MT/s DDR5 setups provides theoretical 96GB/s. Since CCD to IOD bandwidth is the limiting factor here, it doesn't matter how fast your DRAM is if you saturate CCD to IOD link first [probably better to have a bit higher for various contoller related overheads].

AVX512 would love to use the bandwidth but it won't be able to.

igor_kavinski · Sep 9, 2024

MS_AT said:
AVX512 would love to use the bandwidth but it won't be able to.

The big Ryzen 7000 Memory and OC Tuning Guide - Infinity Fabric, EXPO, Dual-Rank, Samsung and Hynix DDR5 in Practice test with Benchmarks and Recommendations | Page 7 | igor´sLAB

AMD’s new Ryzen 7000 desktop CPUs, based on the Zen4 micro-architecture, still use the same chiplet design as their predecessors, with a few small but not negligible changes. Igor already gave details…

www.igorslab.de

If one is able to get DDR5-6400 CL30 or lower working at 2133 IF with a high end DDR5-8200 RAM kit, AVX-512 WILL see gains. May not be as much as one would like but going for 6400 MT/s won't be a total waste over stock DDR5-6000 CL30. In AIDA64, that's 13% copy, 12.6% read and 14.2% write memory bandwidth gains. Not insignificant by any means.

MS_AT · Sep 9, 2024

igor_kavinski said:
The big Ryzen 7000 Memory and OC Tuning Guide - Infinity Fabric, EXPO, Dual-Rank, Samsung and Hynix DDR5 in Practice test with Benchmarks and Recommendations | Page 7 | igor´sLAB

AMD’s new Ryzen 7000 desktop CPUs, based on the Zen4 micro-architecture, still use the same chiplet design as their predecessors, with a few small but not negligible changes. Igor already gave details…

www.igorslab.de

If one is able to get DDR5-6400 CL30 or lower working at 2133 IF with a high end DDR5-8200 RAM kit, AVX-512 WILL see gains. May not be as much as one would like but going for 6400 MT/s won't be a total waste over stock DDR5-6000 CL30. In AIDA64, that's 13% copy, 12.6% read and 14.2% write memory bandwidth gains. Not insignificant by any means.

Let me quote myself:

MS_AT said:
In other words, the new sweet spot will be meaningless to 1 CCD SKUs. Only 2 CCD SKUs will be able to benefit provided you can engage both CCD dies. When it comes to bandwidth, at least.

Thanks for finding measurements that confirm this hypothesis😉 It's 7950X so 2 CCD SKU using AIDA that uses both CCDs😉

igor_kavinski · Sep 9, 2024

MS_AT said:
Thanks for finding measurements that confirm this hypothesis😉 It's 7950X so 2 CCD SKU using AIDA that uses both CCDs😉

SkatterBencher #78: Ryzen 7 9700X Overclocked to 5860 MHz - SkatterBencher

We overclock the AMD Ryzen 7 9700X to 5860 MHz with the ASUS ROG Crosshair X670E Hero motherboard and AIO water cooling.

skatterbencher.com

DDR5-6400 with tuned memory timings and IF @ 2200 on a 9700X yielding above 20% improvement in V-Ray and AI Bench, both of which presumably use AVX-512.

MS_AT · Sep 9, 2024

igor_kavinski said:
SkatterBencher #78: Ryzen 7 9700X Overclocked to 5860 MHz - SkatterBencher

We overclock the AMD Ryzen 7 9700X to 5860 MHz with the ASUS ROG Crosshair X670E Hero motherboard and AIO water cooling.

skatterbencher.com

DDR5-6400 with tuned memory timings and IF @ 2200 on a 9700X yielding above 20% improvement in V-Ray and AI Bench, both of which presumably use AVX-512.

Let me quote myself once again:

MS_AT said:
In other words, the new sweet spot will be meaningless to 1 CCD SKUs. Only 2 CCD SKUs will be able to benefit provided you can engage both CCD dies. When it comes to bandwidth, at least.

I have never said timings are not important. In fact they are more important than the pure bandwidth for single CCD SKUs due to IF limitation. Second if you read through his article paying some attention you will see that:

Despite the Ryzen 7 9700X having only 8 cores, the performance is restricted by its maximum power to 65W. By enabling PBO, we can easily double the power budget in all-core workloads. Combined that with enabling higher memory speeds and it translates into significant performance gains across the board. The Geomean performance improvement is +4.04%, and we get a maximum improvement of +18.07% in the AI Benchmark.

So this 20 % AI Benchmark is not thanks to memory alone. The results shown are 9700X PBO + MEMORY EXPO vs 9700 Stock at 4800MHz RAM.

I get you really want to prove your point Igor, but now it turns into pure spam of hastily thrown links that are supposed to validate what you say. I will drop the subject now in order not to deteoriate the thread further.

inf64 · Sep 9, 2024

Interesting test done by Brian from Tech Yes City YT channel:

It looks like earlier bios provides better performance than the release day one...

Hotrod2go · Sep 10, 2024

MS_AT said:
Let me quote myself once again:

I have never said timings are not important. In fact they are more important than the pure bandwidth for single CCD SKUs due to IF limitation. Second if you read through his article paying some attention you will see that:

So this 20 % AI Benchmark is not thanks to memory alone. The results shown are 9700X PBO + MEMORY EXPO vs 9700 Stock at 4800MHz RAM.

I get you really want to prove your point Igor, but now it turns into pure spam of hastily thrown links that are supposed to validate what you say. I will drop the subject now in order not to deteoriate the thread further.

65W TDP... pffft! my Asrock X670E board runs my 9700X with auto power limits for PBO well beyond that limit - 105W for example when running Mem Test Pro & this with a negative Vcore offset as well, my PBO has no tuning, all that on AGESA 1.2.0.0a which is the only current AGESA Asrock implement for my board atm. MSI & Asus have released updated AGESA at the time of writing this so that promotes 105W TDP for Zen 5 but Asrock are yet to catch up to that besides Asrock are pushing the power limits of Zen 5 as it is. If I engage motherboard limits, the power limits are pushed even further, but so is performance however that is beyond the thermal solution I use atm.

MS_AT said:
Fabric clock sweet spot is 2000MHz, CCD to IOD link is 32B/c -> max bandwidth 64GB/s. If you are lucky you get 70,4GB/s with 2200MHz.
For DDR: 6400MT/s with 128b(16B) bus is 102GB/s. With current sweetspot of 6000MT/s you are at 96GB/s.
In other words, the new sweet spot will be meaningless to 1 CCD SKUs. Only 2 CCD SKUs will be able to benefit provided you can engage both CCD dies. When it comes to bandwidth, at least.

If 6000 MT/s is the so called "sweet spot" then why do Tech tubers like Tech YES City conduct testing between 9700X, 7700X & 7800X3D with 6200 MT/s? are they cheating or something?
FCLK at 2000 is playing it safe for poor quality silicon chips thanks to the silicon lottery still a thing, this why AMD recommends it. The higher the FCLK can go with stability shows increases in memory bandwidth & reduced latency. Gaming benchmarks have proven fps increases with faster Memory up to a given point & that is IF the chip can do 6600MT/S 1:1 even on Zen 4. Your technical theories are wrong in real world applications like gaming.
Using a chip with 2 CCDs brings in the dreaded latency between the CCD problem & when running in 2:1 that is even worse as the memory controller is only operating at half speed. 1:1 is where its at for pure unadulterated performance.

CouncilorIrissa · Sep 15, 2024

Discussing AMD’s Zen 5 at Hot Chips 2024

Hot Chips isn’t just a conference where companies give in-depth presentations on the architectures behind high performance chips.

chipsandcheese.com

Nothingness · Sep 19, 2024

Phoronix benchmarked the new AMD BIOS: https://www.phoronix.com/review/amd-9950x-agesa-1202

Markfw · Sep 19, 2024

Nothingness said:
Phoronix benchmarked the new AMD BIOS: https://www.phoronix.com/review/amd-9950x-agesa-1202

I did not see a median or average speedup, but it looks like ~10% faster.

gdansk · Sep 19, 2024

Markfw said:
I did not see a median or average speedup, but it looks like ~10% faster.

Geomean is at the end, no difference.

Markfw · Sep 19, 2024

gdansk said:
Geomean is at the end, no difference.

the word Geomean does not exist on page 3. The power is identical though, that's at the end.

This is at the end, but I had to take a flying leap guess as to the geomean.

Hitman928 · Sep 19, 2024

Markfw said:
the word Geomean does not exist on page 3. The power is identical though, that's at the end.

It’s short for geometric mean, which is on the third page.

gdansk · Sep 19, 2024

Markfw · Sep 19, 2024

gdansk said:
Here

How I missed that is a mystery, sorry,

edit: I was calculating the AVERAGE in my head, not the "geometric mean". I will try and do that myself later.

Markfw · Sep 19, 2024

+1.71%. Margin of error ????

16.5
12.1
-10.3
9.9
9.8
8.7
7.9
-7.9
7.5
6.9
6.2
5.5
5.3
-4.9
4.8
-4.4
4.4
4.2
4.1
3.9
-3.8
3.8
-3.7
3.6
3.6
-3.5
-3.4
-3.4
-3.4
3.3
3.3
3.1
3
3
2.9
-2.9
-2.8
2.7
2.7
2.7
-2.6
-2.5
2.4
-2.3
-2.3
-2.3
2.3
2.2
-2.2
2.2
-2.2
-2.1
-2.1
2.1
2
2
2
1.712281

Bigos · Sep 19, 2024

The geomean from the Phoronix article is from all 385 tests. The table shows just some of them, probably the ones with the most absolute difference, not all 385 of them.

I think all test results are here:

https://openbenchmarking.org/result/2409192-PTS-RYZEN99532&sgm=1&ppt=D&grs#results

Markfw · Sep 19, 2024

Well, the geometrics is the same. I don't expect the average to me much different, but thanks (in case I want one specific test)

Discussion Zen 5 Architecture & Technical discussion

Golden Member

Lifer

Golden Member

Lifer

Senior member

Golden Member

Lifer

Golden Member

Lifer

Golden Member

Lifer

Golden Member

Diamond Member

Senior member

Senior member

Diamond Member

Moderator Emeritus, Elite Member

Diamond Member

Moderator Emeritus, Elite Member

Diamond Member

Diamond Member

Attachments

Moderator Emeritus, Elite Member

Moderator Emeritus, Elite Member

Senior member

Moderator Emeritus, Elite Member