Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 771 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,752
6,625
136
One thing which they appear to have done in Zen 5 relative to Zen 3/4, according to Granite Ridge die shots, is to implement the L3 cache within a considerably reduced area. Right now I can't see though how this could affect penalties to cross-CCX traffic.

The other obvious change in Zen 5 is that they co-designed for 8c and 16c CCXs. But again it is not obvious to me how this pertains to traffic outside a CCX.

I thought about this and I think both the above points could be related to the L3 which is the actual SDF/SCF master and which is responsible for CCX to CCX traffic. L3 has a very important role in maintaining coherency across all cores.

L3 has directory entries for 16C now, so unless they increased the directory entries they would need some hashing or filtering in order to keep the access latencies same.
Imagine if this is now designed to handle from 8C GNR to 16C Turin D and upcoming 32C Venice D CCX. Since Z6 is Z5 derivative, so it would have the design already forward looking. Going from 8C to 32C and if they increase L2 they would need to increase the directory by 8x at the bare minimum, unless them implement some new filtering.
I think they implemented some filtering/hashing to minimize the size of the L3 directory entries for the increased number of L2 clients but caused extra traffic to be done to narrow down the cache lines across CCX.
The new interconnect should be able to help I think if this is the case, by increasing the number of Bytes/cycle and multiple SDP communication.

The fact they could maintain same L3 latency and reduce it is good if not remarkable.
This may be good enough for server but probably not for high clocked DT parts.
 

Philste

Senior member
Oct 13, 2023
262
455
96
In their Zen 5 specific review for games Computerbase use up to 8000MT/s RAM and they say that it run smoothly at this frequency, and that s with a 2 x 24GB kit, even the 7950X3D get up to 7200.
But what's the point when the fabric bottlenecks everything starting at 6000MT? No wonder 8000 runs snoothly, the CPU doesn't even really notice the difference to 6000MT
 

inf64

Diamond Member
Mar 11, 2011
3,884
4,690
136
Something is definitely wrong with Zen 5 and Windows. Latest video from HU:


The huge inter CCD latencies are a big red flag. If this was just a chiplet swap and IO Die is the same, why would this generation see such a huge regression in the latencies?
Also, AMD not providing the raw performance numbers for Zen 5 and even excluding the 7700X from the reviewer's guide was a tell tell sign something is way off. They wasted such a huge opportunity to make a great leap ahead of intel with this generation. X3D will likely still be the fastest chip after ARL launches and will probably just trade blows in ST and MT (according to current ARL leaks). Still, because of weird design or cost saving decisions they made, this generation looks like a total waste for most desktop users. It will probably dominate in server and workstation space, but for average desktop user it is a complete failure vs Ryzen 7000 series.
 

Abwx

Lifer
Apr 2, 2011
11,616
4,474
136
But what's the point when the fabric bottlenecks everything starting at 6000MT? No wonder 8000 runs snoothly, the CPU doesn't even really notice the difference to 6000MT





It will probably dominate in server and workstation space, but for average desktop user it is a complete failure vs Ryzen 7000 series.

For average DT user all that matter is WebXPRT 4, Speedometer, that s browser tests, or eventualy Affinity Photo, see the numbers.
Now if you need a lot of computing power then Zen 5 will also fit the purpose, so what is not good here..?..
 
Last edited:
  • Like
Reactions: carancho

MS_AT

Senior member
Jul 15, 2024
390
855
96
The huge inter CCD latencies are a big red flag. If this was just a chiplet swap and IO Die is the same, why would this generation see such a huge regression in the latencies?
Also, AMD not providing the raw performance numbers for Zen 5 and even excluding the 7700X from the reviewer's guide was a tell tell sign something is way off. They wasted such a huge opportunity to make a great leap ahead of intel with this generation. X3D will likely still be the fastest chip after ARL launches and will probably just trade blows in ST and MT (according to current ARL leaks). Still, because of weird design or cost saving decisions they made, this generation looks like a total waste for most desktop users. It will probably dominate in server and workstation space, but for average desktop user it is a complete failure vs Ryzen 7000 series.
The worst part of this launch is the AMD PR/marketing department that looks terrible compared to Zen4 launch or previous launches. I am not sure if they were not able to find good replacement for Robert Hallock or what is the deal but they definitely did terrible presentation of the product. They shouldn't highlight gaming so much if they knew there is no gaming uplift. That wouldn't give them so bad press and I bet the application improvements [and they are application improvements, look at browser benchmarks that is probably the one piece of software that is used more than games on average person PC ;)] could entice few people to update. I mean based on internal numbers they must have known they would not sway gamers anyway so why bother?

The latency itself isn't as big of a problem as people make it out to be. It only made the existing issue more visible, as anyone who wanted to extract max performance had to take the CCD to CCD latency into consideration as it was 4-5 times bigger than inter CCD latency. I mean it's definitely a regression and I am very curious to find out what is the reason behind that. Is that a bug or intentional design choice that allowed to save some space on the die, or maybe a provision for new IOdie of server parts that we have yet to see.
 

Philste

Senior member
Oct 13, 2023
262
455
96
@Abwx that is not an answer. My point is that it doesn't matter that ZEN5 can run DDR5-8000 without crashing when the performance is the same because internally the IF is bottlenecking everything to DDR5-6000 speeds.
 
  • Like
Reactions: igor_kavinski

LightningZ71

Golden Member
Mar 10, 2017
1,972
2,372
136
Another notable change as compared to zen4 is having to support an additional physical L3 suze AND the possibility of CCXs with dissimilar L3 sizes (Strix has only 8MB L3 on it's c ccx and 16 on the P ccx). I'm assuming that there is LOGICAL commonality between Strix Point and desktop/server CCX design, this means that, for simplification purposes, the IO die/MMC needs to be able to handle that situation. It likely adds a bit of latency to deal with that possibility. I can't imagine that it would hve to be as big as it is, but if it's having to take a couple of cycles to check a buffer in the memory controller for each cross ccx access, it could get ugly fast.

Has anyone done testing with different fabric speeds on desktop to see how the latency changes? That could give us some insight into where the delay is.

If this bears out, it's an indictment of the plan to keep the same IO die with minimal to no revision.
 
  • Like
Reactions: igor_kavinski

Josh128

Senior member
Oct 14, 2022
557
926
106
The huge inter CCD latencies are a big red flag. If this was just a chiplet swap and IO Die is the same, why would this generation see such a huge regression in the latencies?
Also, AMD not providing the raw performance numbers for Zen 5 and even excluding the 7700X from the reviewer's guide was a tell tell sign something is way off. They wasted such a huge opportunity to make a great leap ahead of intel with this generation. X3D will likely still be the fastest chip after ARL launches and will probably just trade blows in ST and MT (according to current ARL leaks). Still, because of weird design or cost saving decisions they made, this generation looks like a total waste for most desktop users. It will probably dominate in server and workstation space, but for average desktop user it is a complete failure vs Ryzen 7000 series.

Why? The answer is quite obvious now that it is because it was purposely designed that way. Its not due to cross-CCD, its due to cross CCX-- as monolithic Strix Point has the identical issue.

My post on the previous page seems to have largely been glossed over, but nobody is noticing that cross CCX traffic on monolithic Strix Point is almost equally as bad as cross CCX/CCD traffic on Granite Ridge. Just from that, you can throw out anything to do with chiplets, IF, or IOD. This is somehow related to the new core designs. A conscious decision made by the design team. The question now is, why was it necessary to make that decision.
 
Last edited:

Abwx

Lifer
Apr 2, 2011
11,616
4,474
136
@Abwx that is not an answer. My point is that it doesn't matter that ZEN5 can run DDR5-8000 without crashing when the performance is the same because internally the IF is bottlenecking everything to DDR5-6000 speeds.

On games IF doesnt matter that much because 8 cores are more than enough and a single CCD will do the job.

Thing is that to be on par with the 5800X3D a 7700X not only has 12-13% better INT IPC than Zen 3 but also and foremost a 25% MT and 20% SC frequency uplift compared to the Zen 3 X3D, you can see on CB graphs that the 7800X3D is 13% faster than the 9700X in games, so there s 600-700MHz missing to reproduce the previous exploit.
 

inf64

Diamond Member
Mar 11, 2011
3,884
4,690
136
Why? The answer is quite obvious now that it is because it was purposely designed that way. Its not due to cross-CCD, its due to cross CCX--monolithic Strix Point has identical issue.

My post on the previous page seems to have largely been glossed over, but nobody is noticing that cross CCX traffic on monolithic Strix Point is almost equally as bad as cross CCX/CCD traffic on Granite Ridge. Just from that, you can throw out anything to do with chiplets, IF, or IOD. This is somehow related to the new core designs. A conscious decision made by the design team. The question now is, why was it necessary to make that decision.
If it was a conscious decision to make it that way, they should come out and let us know their logic behind it. It all boils down to bad choices they made early in the design process. Zen 4 looks like a juggernaut design when compared to Zen 5, and this is all on AMD.
 

LightningZ71

Golden Member
Mar 10, 2017
1,972
2,372
136
I'm beginning to think that the "neat" improvements to the 9xxx series X3D chips will be a larger helping of 3d cache. Likely doubling it to 128MB. Lacking greater frequency and integer throughput, reducing latencies will be paramount to getting additional gains. Since memory throughput is roughly stagnant between Zen4 and 5, unlike Zen3 to 4, they have to get it from somewhere else. It's also going to help AVX-512 throughput as that is also heavily memory bound, at least for modest data sets.
 

IEC

Elite Member
Super Moderator
Jun 10, 2004
14,475
5,555
136
Looks like at least one Youtuber has partially tested RAM scaling:

6200 CL28 giving some huge gains in some game benchmarks vs JEDEC 4800. Though I wish JEDEC 5600 had been the baseline so people could see the difference versus most reviews out there:
1723813739877.png
 

inf64

Diamond Member
Mar 11, 2011
3,884
4,690
136
I'm beginning to think that the "neat" improvements to the 9xxx series X3D chips will be a larger helping of 3d cache. Likely doubling it to 128MB. Lacking greater frequency and integer throughput, reducing latencies will be paramount to getting additional gains. Since memory throughput is roughly stagnant between Zen4 and 5, unlike Zen3 to 4, they have to get it from somewhere else. It's also going to help AVX-512 throughput as that is also heavily memory bound, at least for modest data sets.
I'm afraid that this is not going to happen. The bean counters from AMD have already cut Zen 5 to mediocre at best by doing the weird design choices. Adding 2x stacked L3 which would increase cost even more is highly unlikely.
 

Josh128

Senior member
Oct 14, 2022
557
926
106
I'm beginning to think that the "neat" improvements to the 9xxx series X3D chips will be a larger helping of 3d cache. Likely doubling it to 128MB. Lacking greater frequency and integer throughput, reducing latencies will be paramount to getting additional gains. Since memory throughput is roughly stagnant between Zen4 and 5, unlike Zen3 to 4, they have to get it from somewhere else. It's also going to help AVX-512 throughput as that is also heavily memory bound, at least for modest data sets.
I think it was already rumored (not sure if confirmed), that the X3D die will still be 64MB. I tend to believe this rumor, I doubt it will be increased.

**EDIT: I found it. This rumor was posted on several hardware news sites.

 
Last edited:
  • Like
Reactions: Mopetar

Philste

Senior member
Oct 13, 2023
262
455
96
On games IF doesnt matter that much because 8 cores are more than enough and a single CCD will do the job.
That makes no sense, IF bottlenecking Games is the entire point of 3D performing that good, because the bigger L3 avoid huge parts of traffic through IF.
 

Timmah!

Golden Member
Jul 24, 2010
1,530
862
136
Why? The answer is quite obvious now that it is because it was purposely designed that way. Its not due to cross-CCD, its due to cross CCX-- as monolithic Strix Point has the identical issue.

My post on the previous page seems to have largely been glossed over, but nobody is noticing that cross CCX traffic on monolithic Strix Point is almost equally as bad as cross CCX/CCD traffic on Granite Ridge. Just from that, you can throw out anything to do with chiplets, IF, or IOD. This is somehow related to the new core designs. A conscious decision made by the design team. The question now is, why was it necessary to make that decision.
So they can fix it with Zen6 and have a selling point for it :p
 

marees

Senior member
Apr 28, 2024
654
755
96
From a customer perspective, why would anyone buy Kraken Point over Hawk Point?

Hawk would have better CPU MT performance and better GPU performance. CPU ST performance advantage of Kraken will be tiny. As far as I can see the only reason to buy Kraken insteaf of Hawk, is the 50 TOPS NPU.
Gary Colomb says that zen5 on strix point is good enough in low power usage to create xbox series S experience on handheld (720p?)

Kraken point has half the gpu of strix point but keep in mind that in these design cpu/memory is more of a bottleneck in GPU

I expect kraken point with 8 CU to match hawk point with 12 CU (in most 60+ fps scenarios) but at much much lower power usage

Hawk point could still pull ahead in 30fps high resolution/detail scenarios (but that may not matter much)

This would be an interesting battle vs lunar lake based MSI claw on TSMC 3nm with better optimized low/idle power usage
 

MS_AT

Senior member
Jul 15, 2024
390
855
96
That makes no sense, IF bottlenecking Games is the entire point of 3D performing that good, because the bigger L3 avoid huge parts of traffic through IF.
The IF bottleneck is the bandwidth bottleneck. Games are memory latency bound not throughput bound most of the time. That is why the IF bottleneck doesn't matter much.
 
  • Like
Reactions: Hitman928

inf64

Diamond Member
Mar 11, 2011
3,884
4,690
136
So they can fix it with Zen6 and have a selling point for it :p
That is the only silver lining I can see post this launch. Zen 6 has a huge opportunity to fix everything that is wrong with Zen 5 and then add some more. Only caveat is that this is AMD we are talking about, so nobody knows if the real slim shady will stand up (or not)

To be honest, on CPU front they were executing perfectly, especially with Zen 3 and Zen 4 (which brought HUGE performance uplifts). We got spoiled.
 

moinmoin

Diamond Member
Jun 1, 2017
5,151
8,250
136
It could also be the scheduler never giving a single process maximum cycles available, so that QoS is better for multitasking/GUI responsiveness.
That's actually the now traditional approach of Linux schedulers, whereas the Windows scheduler prefers the process of the current active window. And that QoS approach actually turned out to be great for multitasking but bad for GUI responsiveness. It took Linux schedulers quite some time to get to a point of offering both a bottleneck free fair sharing of cycles that doesn't significantly increase the latency which actually decreases GUI responsiveness. The Windows scheduler never tried to be completely fair sharing cycles which is why Linux commonly is able to achieve much higher throughput in heavy load situations.

The standard Linux scheduler is called CFS (Completely Fair Scheduler). A latency optimized scheduler available is EEVDF (Earliest Eligible Virtual Deadline First). A try at combining both is BORE (Burst-Oriented Response Enhancer).
 

marees

Senior member
Apr 28, 2024
654
755
96
That is the only silver lining I can see post this launch. Zen 6 has a huge opportunity to fix everything that is wrong with Zen 5 and then add some more. Only caveat is that this is AMD we are talking about, so nobody knows if the real slim shady will stand up (or not)

To be honest, on CPU front they were executing perfectly, especially with Zen 3 and Zen 4 (which brought HUGE performance uplifts). We got spoiled.
If zen 6 is on ddr6 then will there be a zen 5+ on ddr5 & 3nm ?
 

H T C

Senior member
Nov 7, 2018
590
427
136
Something is definitely wrong with Zen 5 and Windows. Latest video from HU:


The huge inter CCD latencies are a big red flag. If this was just a chiplet swap and IO Die is the same, why would this generation see such a huge regression in the latencies?
Also, AMD not providing the raw performance numbers for Zen 5 and even excluding the 7700X from the reviewer's guide was a tell tell sign something is way off. They wasted such a huge opportunity to make a great leap ahead of intel with this generation. X3D will likely still be the fastest chip after ARL launches and will probably just trade blows in ST and MT (according to current ARL leaks). Still, because of weird design or cost saving decisions they made, this generation looks like a total waste for most desktop users. It will probably dominate in server and workstation space, but for average desktop user it is a complete failure vs Ryzen 7000 series.

It seems to me (and i only watched A PORTION of the video, thus far), that they are ruling out an AM5 bug when that's NOT NECESSARILY the case: it could STILL be 2 DISTINCT (or more) problems making the discrepancy in the AMD VS reviewer's results.

Also, and regarding the testing of AM5 CPUs only: why haven't they tried this on AM4 versions too? Who knows how long this has been happening?

I 100% agree: whatever the "culprits" might be, those inter CCD latencies are WAY TOO HIGH, and are MOST DEFINITELY a red flag.