Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 155 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

HurleyBird

Platinum Member
Apr 22, 2003
2,678
1,266
136
No it'll load all 128t on one socket then move on, in 1c/2t increments.

Not a server guy and this is hard for me to either verify or disprove. Chat GPT-4 seems to think that inside the same socket all the physical cores are usually assigned before logical, but we know it isn't always trustworthy.

Can you provide a source, or absent that, explain the benefit of scheduling tasks this way? The only thing I can think of is improving security a bit (speculative/sidechannel attacks) in some scenarios, at least when everything aligns properly.
 

adroc_thurston

Platinum Member
Jul 2, 2023
2,040
2,613
96
explain the benefit of scheduling tasks this way?
It's a single task silly parallel workload of rendering aka the thing you do NOT want to spill out of the socket.
Same exactly reason why MCM GPUs like Mi300 have been a never ever thing until recently.

Please just quote your estimate for socket level SIR2017 bumps for Turin over Genoa (both 96c and 128c) and be done with it.
 

HurleyBird

Platinum Member
Apr 22, 2003
2,678
1,266
136
It's a single task silly parallel workload of rendering aka the thing you do NOT want to spill out of the socket.

That makes sense but isn't what I'm asking. I'm asking about the how and why of thread scheduling inside the socket, not across them.

Please just quote your estimate for socket level SIR2017 bumps for Turin over Genoa (both 96c and 128c) and be done with it.

Why would I have one? I'm not trying to be Nostradamus, I'm trying to make sense of the existing information.
 

HurleyBird

Platinum Member
Apr 22, 2003
2,678
1,266
136
It consumes all 128t available on socket before spilling to the next P.

Right, again, that's not in contention. It's the claim that inside each socket the scheduler will not, in contrast to modern consumer schedulers, attempt to minimize SMT usage. That's not relevant to the first socket in this scenario, but it is to the second.
 

Saylick

Diamond Member
Sep 10, 2012
3,116
6,263
136
Yes, socket SIR score is how vendors love guiding their things.
You see it everywhere, Intel, AMD, ARM, whatever.

I.e. Turin is %redacted% amount faster in SIR, EMR is low teens over SPR, GNR is ~2x EMR and so on and so forth.

Not the end-it-all metric but a useful proxy.
Gotcha.

I'll toss what should be a relatively safe ring into the hat: 30% Spec Int Rate improvement at 1T over Zen 4, the same as what Jim Keller estimated in his presentation. :p
 

H433x0n

Senior member
Mar 15, 2023
873
937
96
Oh, and for similar cores.... so 64 cores gets 615 and 60 cores of SR get 495. So at the closest we can bench, its 25% faster for for 7% more Genoa cores.
View attachment 86497
That’s not a perfect apples/apples. You’d want to compare single socket configurations or compare dual socket configurations. A dual socket setup is always less than 2x of a single socket.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,536
14,488
136
That’s not a perfect apples/apples. You’d want to compare single socket configurations or compare dual socket configurations. A dual socket setup is always less than 2x of a single socket.
Best that I could find. From experience with mine I am sure the SR chips are far inferior. This and the benchmarks I have posted elsewhere show that also.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,599
5,765
136
So, if the slides are to believed (and they look quite authentic), the core ends up being much more similar to Alder Lake than I anticipated. But still noticeably fatter.

  • The same 12-way 48KB L1 cache as Colden Cove (hopefully without the latency penalty)
  • 8-wide dispatch (+2 vs Alder Lake and Zen 4)
  • 6 ALUs (+1 vs Alder lake +2 vs Zen 4)
  • 4 loads / 2 stores per cycle (vs 3/2 for Golden cove, 2 /1 for Zen 4)
    • - if I'm reading this right, these are 512bit (64 byte) ? That's a massive uplift from Zen 4 if true (4x the throughput in ideal AVX-512 scenarios)
The biggest unknown for me is how do they plan to feed the beast? There are no mentions of any decoder changes, surely it would be an absurd bottleneck, if not changed?


Anyway looking forward to comparisons to the Arrow Lake core. In the end, they couuld end up pretty similar in width - so it would all come down to execution.
Looking at this again it, vs Z4
  • +2 rename/dispatch
  • +2 ALUs
  • +1 LD/cycle
  • 512b FP width
  • 64B LD/ST queues
  • 48K L1D
  • OOO structures increased
  • Usual generational architectural improvements scattered around
  • New BP with larger BTBs -> zero bubble conditional branches sounded like the patent I listed before where a second BP scans the other conditional branch
  • Decode width unknown, doubtful it is going to be beyond 6 wide if at all they even increase.
  • uop cache unknown
  • "2 basic block fetch" --> Does this mean 2x fetch and decode blocks akin to Tremont?
Does not seem terribly bloated, would have indeed seems akin to the Z2 -> Z3 evolution. The unknowns however do seem like the kind of big ticket items. I think the zero bubble conditional branch could be tied to the "2 basic block fetch".

Low Power core
  • Probably the low power core option is not having 512b FP pipes or 64B LD/ST queues (they mentioned FP 512 variants, which would mean 512 pipes and data structures not standard across all core)
  • Denser node/efficiency optimized libs as usual
  • Cache reduction as usual
  • If the 2x basic block fetch is akin to what I described, they could clock gate the second fetch block aggressively for mobile
However a major departure from Zen 3/4 series are the unified schedulers for INT and FP back to Zen 2 style. Would be interesting to see latencies with Zen 5.
 
Last edited:

Saylick

Diamond Member
Sep 10, 2012
3,116
6,263
136
Looking at this again it, vs Z4
  • +2 rename/dispatch
  • +2 ALUs
  • +1 LD/cycle
  • 512b FP width
  • 64B LD/ST queues
  • 48K L1D
  • OOO structures increased
  • Usual generational architectural improvements scattered around
  • New BP with larger BTBs -> zero bubble conditional branches sounded like the patent I listed before where a second BP scans the other conditional branch
  • Decode width unknown, doubtful it is going to be beyond 6 wide if at all they even increase.
  • uop cache unknown
  • "2 basic block fetch" --> Does this mean 2x fetch and decode blocks akin to Tremont?
Does not seem terribly bloated, would have indeed seems akin to the Z2 -> Z3 evolution. The unknowns however do seem like the kind of big ticket items. I think the zero bubble conditional branch could be tied to the "2 basic block fetch".

Low Power core
  • Probably the low power core option is not having 512b FP pipes or 64B LD/ST queues (they mentioned FP 512 variants, which would mean 512 pipes and data structures not standard across all core)
  • Denser node/efficiency optimized libs as usual
  • Cache reduction as usual
  • If the 2x basic block fetch is akin to what I described, they could clock gate the second fetch block aggressively for mobile
However a major departure from Zen 3/4 series are the unified schedulers for INT and FP back to Zen 2 style. Would be interesting to see latencies with Zen 5.
Do you think you can do us a favor and prepare this data into a table which compares various architectures (e.g. Zen 3, Zen 4, Zen 5, GLC)? Please and thank you! :)
 

DisEnchantment

Golden Member
Mar 3, 2017
1,599
5,765
136
Wait until you see the ROB and PRF sizes.
If they implemented something like what is described below in their patents, it wont be as bloated as other designs.

APPARATUS AND METHODS EMPLOYING A SHARED READ PORT REGISTER FILE
From <https://www.freepatentsonline.com/y2023/0034072.html>
Methods and systems for utilizing a master-shadow physical register file based on verified activation
From <https://www.freepatentsonline.com/11599359.html>

Diminishing returns at the expense of a lot of power/area unless they are going more in the direction of the ideas in their patents

Do you think you can do us a favor and prepare this data into a table which compares various architectures (e.g. Zen 3, Zen 4, Zen 5, GLC)? Please and thank you! :)
I have a table from Zen 1 to Zen 4, but Zen 5 has too many unknowns so not posting at the moment.


Two patents possibly related to the conditional branch thing
ALTERNATE PATH FOR BRANCH PREDICTION REDIRECT
From <https://www.freepatentsonline.com/y2022/0075624.html>
Instruction address translation and caching for primary and alternate branch prediction paths
From <https://www.freepatentsonline.com/11579884.html>

And the possible patents for the '2 basic block fetch' thing, if it is a thing at all.
PROCESSOR WITH MULTIPLE OP CACHE PIPELINES
From <https://www.freepatentsonline.com/y2022/0100663.html>
PROCESSOR WITH MULTIPLE FETCH AND DECODE PIPELINES
From <https://www.freepatentsonline.com/y2022/0100519.html>
 
Last edited: