Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 635 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Jan Olšan

Senior member
Jan 12, 2017
407
718
136
I'm not convinced the clustered decode on Zen 5 works well on ST. David Huang got zero from his ST tests

He used a sequence of NOPs specially crafted to measure it, not a realistic code. The explanation could be that the sequence had no branches. Different microbenchmarking code would be need to catch the effect of both decoder clusters getting used.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,930
3,556
136
He used a sequence of NOPs specially crafted to measure it, not a realistic code. The explanation could be that the sequence had no branches. Different microbenchmarking code would be need to catch the effect of both decoder clusters getting used.
yes , i said in the other thread the fact they are trying to look further ahead for branches and that the uop cache is multi ported says they are trying to run ahead and stick stuff in the op cache
 
  • Like
Reactions: lightmanek

HurleyBird

Platinum Member
Apr 22, 2003
2,760
1,455
136
He used a sequence of NOPs specially crafted to measure it, not a realistic code. The explanation could be that the sequence had no branches. Different microbenchmarking code would be need to catch the effect of both decoder clusters getting used.

I'm not sure whether this is pertinent to the benchmark results, but it's worth mentioning Clark stated no op (NOP) fusion was removed in Zen 5 during the Chips & Cheese interview. Reading between the lines a bit, it sounds like AMD sacrificed a number of optimizations that would have needed to be rebuilt for dual decode on the alter of cadence.
 
Last edited:

Jan Olšan

Senior member
Jan 12, 2017
407
718
136
There were similar cases of features in Zen 2 not making it to Zen 3 but then reappearing in Zen 4. It probably shows how they are not lying when saying they basically re-architect the core anew in the odd-numbered generations (refining than in the even ones). Sometimes they don't invest the effort to bring in all the features from the prior core, probably calculating that they will hit nice IPC gains even without them and preferring to focus on other parts that are perhaps more critical for the new uarch. Sometimes it could be due to not branching the development from the finished n-1 core, but more from some point after n-2 with some bits of n-1. But then the stuff can get back in later.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,930
3,556
136
I'm not sure whether this is pertinent to the benchmark results, but it's worth mentioning Clark stated no op fusion was removed in Zen 5 during the Chips & Cheese interview. Reading between the lines a bit, it sounds like AMD sacrificed a number of optimizations that would have needed to be rebuilt for dual decode on the alter of cadence.
except for all the op fusion they kept. why would dual decode affect op fusion , its post decode at dispatch AFAIK.

its probably a width of dispatch vs complexity vs timing/power thing.
 

HurleyBird

Platinum Member
Apr 22, 2003
2,760
1,455
136
except for all the op fusion they kept. why would dual decode affect op fusion , its post decode at dispatch AFAIK.

From Mike Clark:
Part of the reason I would say we didn’t put let’s say no op fusion into Zen 5 is that we had that wider dispatch. Zen 1 to Zen 4 had that 6 wide dispatch and 4 ALUs, so getting the most out of that 6-wide dispatch was important and it drove some complexity into the dispatch interface to be able to do that. When looking at having the capability of an 8-wide dispatch and putting no op fusion on top of it, it didn’t really seem to pay off for the complexity because we had that wider dispatch natively. But you may see it come back. Zen 5 is sort of a foundational change to get to that 8-wide dispatch and 6 ALUs. We’re now going to try to optimize that pinch point of the architecture to get more and more out of it and so you know as we move forward, no op fusion is likely to come back as a good leverage of that eight wide dispatch. But for the first generation, we didn’t want to bite off the complexity.

There are two micro-op caches... are there two micro-op queues also? If the two paths converge at dispatch, it's plausible there's complexity in fusing ops that arrive from different paths. If they converge prior to dispatch, maybe fusion is taking place earlier?
 
Last edited:

itsmydamnation

Platinum Member
Feb 6, 2011
2,930
3,556
136
From Mike Clark:


There are two micro-op caches... are there two micro-op queues also? If the two paths converge at dispatch, it's plausible there's complexity in fusing ops that arrive from different paths. If they converge prior to dispatch, maybe fusion is taking place earlier?
why lie by omission.............

Mike Clark: We don’t support no op (NOP) fusion. We do have a lot of op fusion that’s similar, we still fuse branches and there’s some other cases that we fuse.

Part of the reason I would say we didn’t put let’s say no op fusion into Zen 5 is that we had that wider dispatch. Zen 1 to Zen 4 had that 6 wide dispatch and 4 ALUs, so getting the most out of that 6-wide dispatch was important and it drove some complexity into the dispatch interface to be able to do that. When looking at having the capability of an 8-wide dispatch and putting no op fusion on top of it, it didn’t really seem to pay off for the complexity because we had that wider dispatch natively. But you may see it come back. Zen 5 is sort of a foundational change to get to that 8-wide dispatch and 6 ALUs. We’re now going to try to optimize that pinch point of the architecture to get more and more out of it and so you know as we move forward, no op fusion is likely to come back as a good leverage of that eight wide dispatch. But for the first generation, we didn’t want to bite off the complexity.
 

MS_AT

Senior member
Jul 15, 2024
243
565
96
Have AMD themselves mentioned that the FPUs differ?
Nothing I can find. And if what David found is true, I mean the silicon is different and it wasn't an ES effect [not fully working microcode etc] then it's big lie by omission on AMD marketing dept. part seeing the press materials never differentiate Strix-Point Zen5 core from Granite Ridge Zen5 core, they only mention the distinction between Zen5 and Zen5c.
 

gaav87

Member
Apr 27, 2024
124
173
76
RTG: Ladies and Gentlemen, I was told from sources that Zen 6 will have mid-double digits IPC gain. Well, the middle between 10% and 99% is roughly 60%. Zen60% confirmed.

Source (probably): Expect modest gains for Zen 6, like no more than 15% IPC gain.
Watch out!
Red gaming tech will make, a video out of your post
 

poke01

Platinum Member
Mar 8, 2022
2,205
2,802
106
Cores do usually gain performance over time as codebases get updated, but Z5 does seem to be an outsized FineWine candidate based on what Clark said.
You can get fine wine already with cachyOS with their latest Zen4/5 optimised release.

Excluding the AVX-512 datasets it’s about a 14.5% gain in IPC. It’s clear that Zen 5 is a server first architecture more than every other Zen.
 

MS_AT

Senior member
Jul 15, 2024
243
565
96
I was talking about ALU count, looks like it's the same six as Zen4? Did they just made them wider then? I was under impression that the ALU count was substantially increased, hence all this 'jebaited expectations' spiel of last few weeks.
Why would they increase the execution resources on the FP side, if they cannot sustain more than 2x512b loads per cycle? [It's still a great improvement from Zen4 btw, that could do only 1x512b]. Not sure what is the story with FP stores if they can do 2x512b or 1x512, but either of them is also nice improvement over zen4 that could only do 0.5x512b per cycle store.

What I want to say is, that to increase fpu resources even further they would have needed to provide more bandwidth, otherwise it would be wasted silicon. [They can already do 2x512b ADDs and 2x512b FMA per cycle and I am not sure Intel ever had that on any core. I remember 2x512b FMA but not sure if concurrent adds were possible].
 

Det0x

Golden Member
Sep 11, 2014
1,299
4,234
136
Dunno if real numbers, but this was posted over at WCCTech "forum" by a 13900KF user

13900KF @ 150w packet power = 34.9k points in Cinebench r23
1721123853214.png

13900KF @ 170w packet power = 36.3k points in Cinebench r23
1721123886890.png

13900KF @ 190w packet power = 37.2k points in Cinebench r23
1721123938488.png

13900KF @ 270w packet power = 40.4k points in Cinebench r23
1721123997126.png

I'm not too familiar with raptor lakes power/performance curve, are these normal/average numbers for the 13900K SKU ?
Can maybe used as a comparison for the higher ES PPT numbers 🧐
 
Last edited:

Abwx

Lifer
Apr 2, 2011
11,557
4,349
136
Dunno if real numbers, but this was posted over at WCCTech "forum" by a 13900KF user

13900KF @ 150w packet power = 34.9k points in Cinebench r23


13900KF @ 170w packet power = 36.3k points in Cinebench r23
View attachment 103153

13900KF @ 190w packet power = 37.2k points in Cinebench r23

There s nothing exceptional, the guy is just unaware that his chip still consume much more than Zen 4 in this very bench which is a best case for Intel.

FI he boast 36k3 at 170W, just imagine that it s about the score of a stock 7950X3D that use barely 130W to do so, with some UV and tweaking like this one you can get at 110W.

FTR a 14900KS does 41k at stock and using 330W, so his score at 270W is not even much better overall than a stock chip.

Guess that s telling at wich point some people are in denial, seeing as great what is actually very mediocre, but hey, that s my prefered brand.