Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 432 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

AMDK11

Senior member
Jul 15, 2019
291
198
116
Yes but your going to see that covered by the prefetch and load parts of the breakdown diagram. But that will only be a component of those sections.
It is largely Front-End and Load Store.

I don't know how else to explain it.

If the cache had nothing to do with IPC, the 3DV cache would provide absolutely nothing. And yet he can give a specific boost.

The biggest problem with Zen2 was the division into 2 CCXs, which made communication at the cache level between cores from a separate CCX much slower and the cores waited longer for data.

AMD presents on the IPC growth curve slide tests that are not 1T and include games and some applications which, thanks to a unified cache with equal access to it by all, result in lower delays and higher IPC achieved.

If you could turn off the entire L3, you would see a very large drop in IPC, not only in games.
 

Vattila

Senior member
Oct 22, 2004
800
1,364
136
But I would suppose you are getting weary of all the back and forth hype and anti hype, the sarcasms and mud slings. You are not alone. There is a group who enjoys this kind of discussion but there is another group who doesn't and cannot participate in any discussion anymore on the topic of Zen 5. Maybe we should create a different thread to discuss only the architectural/technical aspects of this upcoming product with limited speculation (no hype, no market share discussion, no memes) based on publicly available evidences (patches, GB results, manuals, official statements).

I agree. I think we need separate threads for technical discussion and product chat. The latter preoccupies these forums, thus drowning out the former.

Although I now rarely participate, I originated many of the speculation threads on "Zen" and other technical topics, and I am somewhat proud that "Speculation: Zen 4" is still a trending discussion thread here with over 13 thousand replies. But for technical content, it is not worth following anymore, for the most part. Much of the interesting discussion has been lost in the banter and bickering (unfortunately, as far as I am aware, there is still no AI strong enough to summarise the good content from all the noise).

I have noticed that there is less contribution from the technically inclined posters (including myself), so a separation of threads and focus may help to keep likeminded forum members engaged.

I recommend you start two "Zen 6" discussion threads:
  1. "Zen 6 architecture (technical discussion only)" — for members interested in CPU architecture. Twitter/X egos built on leaking information are not welcome (*). Nor are quarrelers, drama queens, comedians and meme peddlers. The thread should encourage long detailed posts full of technical content, not chat. Quality over quantity, as a general posting guideline. If you don't have 15 minutes to spare on creating a thoughtful post, don't post here.

  2. "Zen 6 chat (product details; performance, pricing, etc.)" — this is for members seeking buying advice, in particular, as well as those who just want to chat about products related to "Zen 6". Leakers are not welcome (*).
* Leaking or encouraging the leaking of confidential undisclosed information is not ethical, unless for justified journalistic purposes serving the common good (e.g. SemiAccurate reporting that not all was as rosy as depicted by Intel's leadership during recent years of process development issues — which highlighted the leadership's neglect in properly informing Intel shareholders of the true state of affairs). Leaks for fame or fortune should not be endorsed.

PS. Perhaps moderators can help improve things by enforcing on-topic discussion and prohibiting leaking and its encouragement.
 
Last edited:

AMDK11

Senior member
Jul 15, 2019
291
198
116
C&C:
"VCache provides a notable 33% L3 hitrate increase here. Bringing average hitrate to 78% is more than enough to compensate for the slight L3 latency increase. GHPC enjoys a 9.67% IPC gain from running on the VCache CCD, so the other CCD should fall short even with its higher clock speed."

"With affinity set to the VCache CCD, IPC increased from 1.26 to 1.43. That’s a 13.4% increase, or basically a generational jump in performance per clock. VCache really turns in an excellent performance here. L3 hitrate with VCache is 63.74% – decent for a game, but not the best in absolute terms. Therefore, there’s still plenty of room for improvement. Modern CPUs have a lot of compute power, and DRAM performance is so far behind that a lot of that CPU capability is left on the table. Cyberpunk 2077 is an excellent demonstration of that."

"Zen 4’s normal 32 MB cache suffers heavily in this game, eating a staggering 8.66 MPKI while hitrates average under 50%. VCache mitigates the worst of these issues. Hitrate goes up by 47%, while IPC increases by over 19%."

"Hitrate improves by 16.75%, going from 61.5% to 72.8%. That’s a measurable and significant increase in hitrate, but like DCS, libx264 doesn’t suffer a lot of L3 misses in the first place. It’s not quite as extreme at 1.48 L3 MPKI with the non-VCache CCD. But for comparison, Cyberpunk and GHPC saw 5 and 5.5 L3 MPKI respectively. We still see a 4.9% IPC gain, but that’s not great when the regular CCD clocks 7% higher. Performance doesn’t scale linearly with clock speed, but that’s mostly because memory access latency falls further behind core clock. But given libx264’s low L3 miss rate, it’ll probably come close."

"With affinity set to the VCache CCD, we see a 29.37% hitrate improvement. IPC increases by 9.75%, putting it in-line with GHPC. This is a very good performance for VCache, and shows that increased caching capacity can benefit non-gaming workloads. However, AMD’s default policy is to place regular applications on the higher clocked CCD. Users will have to manually set affinity if they have a program that benefits from VCache."

"Zen 4’s VCache implementation is an excellent follow-on to AMD’s success in stacking cache on top of Zen 3. The tradeoffs in L3 latency is very minor in comparison to the massive capacity increase, meaning that VCache can provide an absolute performance advantage in quite a few scenarios. Zen 4’s larger L2 also puts it in a better position to tolerate the small latency penalty created by VCache, because the L2 satisfies more memory requests without having to incur L3 latency. The results speak for themselves. While we didn’t test a lot of scenarios, VCache provided an IPC gain in every one of them. Sometimes, the extra caching capacity alone is enough to provide a generational leap in performance per clock, without any changes to the core architecture."

It was developed by someone smarter who didn't just look at marketing slides ;)
 
Last edited:

RnR_au

Golden Member
Jun 6, 2021
1,742
4,259
106
No, not really. E.g. I just quoted this regarding SpecINT being very close to average IPC:
Just on this, remember that the rumours comes from serverside sku's. +40% SPEC Int Rate 1N and +50% SPEC Int Rate tN (with +25% power). While on serverside SPEC Int increases may a good proxy for general application performance increases, that might not be the case for desktop scenario's. The rumour is currently that the IOD is not changed for Granite Ridge, so you have to be cautious in applying the serverside performance numbers to desktop sku's since memory bandwidth is not likely to change too much.

I have no idea on how many benchmarks in AMD's IPC suite are bandwidth sensitive under nT loads, but I would wager its greater than zero.

Hence SPEC Int gain !== IPC gain for Granite Ridge.
 
Last edited:

techjunkie123

Junior Member
May 1, 2024
3
2
36
Just on this, remember that the rumours comes from serverside sku's. +40% SPEC Int Rate 1N and +50% SPEC Int Rate tN (with +25% power). While on serverside SPEC Int increases may a good proxy for general application performance increases, that might not be the case for desktop scenario's. The rumour is currently that the IOD is not changed for Granite Ridge, so you have to be cautious in applying the serverside performance numbers to desktop sku's since memory bandwidth is not likely to change too much.

I have no idea on how many benchmarks in AMD's IPC suite are bandwidth sensitive under nT loads, but I would wager its greater than zero.

Hence SPEC Int numbers !== IPC for Granite Ridge.

Also, Genoa 1T boost speeds (4.4 ghz) are way below current DT (5.7 ghz) right? Maybe with Zen 5 the server products clock at like 5 ghz (+15%, with remaining +20% being the IPC increase)? Idk anything though, just speculation.
 

gdansk

Platinum Member
Feb 8, 2011
2,212
2,836
136
C&C:
"VCache provides a notable 33% L3 hitrate increase here. Bringing average hitrate to 78% is more than enough to compensate for the slight L3 latency increase. GHPC enjoys a 9.67% IPC gain from running on the VCache CCD, so the other CCD should fall short even with its higher clock speed."

"With affinity set to the VCache CCD, IPC increased from 1.26 to 1.43. That’s a 13.4% increase, or basically a generational jump in performance per clock. VCache really turns in an excellent performance here. L3 hitrate with VCache is 63.74% – decent for a game, but not the best in absolute terms. Therefore, there’s still plenty of room for improvement. Modern CPUs have a lot of compute power, and DRAM performance is so far behind that a lot of that CPU capability is left on the table. Cyberpunk 2077 is an excellent demonstration of that."

"Zen 4’s normal 32 MB cache suffers heavily in this game, eating a staggering 8.66 MPKI while hitrates average under 50%. VCache mitigates the worst of these issues. Hitrate goes up by 47%, while IPC increases by over 19%."

"Hitrate improves by 16.75%, going from 61.5% to 72.8%. That’s a measurable and significant increase in hitrate, but like DCS, libx264 doesn’t suffer a lot of L3 misses in the first place. It’s not quite as extreme at 1.48 L3 MPKI with the non-VCache CCD. But for comparison, Cyberpunk and GHPC saw 5 and 5.5 L3 MPKI respectively. We still see a 4.9% IPC gain, but that’s not great when the regular CCD clocks 7% higher. Performance doesn’t scale linearly with clock speed, but that’s mostly because memory access latency falls further behind core clock. But given libx264’s low L3 miss rate, it’ll probably come close."

"With affinity set to the VCache CCD, we see a 29.37% hitrate improvement. IPC increases by 9.75%, putting it in-line with GHPC. This is a very good performance for VCache, and shows that increased caching capacity can benefit non-gaming workloads. However, AMD’s default policy is to place regular applications on the higher clocked CCD. Users will have to manually set affinity if they have a program that benefits from VCache."

"Zen 4’s VCache implementation is an excellent follow-on to AMD’s success in stacking cache on top of Zen 3. The tradeoffs in L3 latency is very minor in comparison to the massive capacity increase, meaning that VCache can provide an absolute performance advantage in quite a few scenarios. Zen 4’s larger L2 also puts it in a better position to tolerate the small latency penalty created by VCache, because the L2 satisfies more memory requests without having to incur L3 latency. The results speak for themselves. While we didn’t test a lot of scenarios, VCache provided an IPC gain in every one of them. Sometimes, the extra caching capacity alone is enough to provide a generational leap in performance per clock, without any changes to the core architecture."

It was developed by someone smarter who didn't just look at marketing slides ;)
Wow, you've proved AMD's comparison between Zen 2 and Zen 3 wrong using a small subset of tests comparing Zen 4 to Zen 4 X3D.

Honestly I'm not sure what the point is here other than obfuscation. More cache is good for some games and that's why Zen 3 was well above the geomean AMD showed for some games. And Zen 3D further ahead still in even more games for the same reasons.
 
Last edited:

AMDK11

Senior member
Jul 15, 2019
291
198
116
Wow, you've proved AMD's comparison between Zen 2 and Zen 3 wrong using a small subset of tests comparing Zen 4 to Zen 4 X3D.

Honestly I'm not sure what the point is here other than obfuscation. More cache is good for some games and that's why Zen 3 was well above the geomean AMD showed for some games. And Zen 3D further ahead still in even more games for the same reasons.
Obfuscation? Read previous entries. Previous speakers claim that cache has nothing to do with the increase in IPC. I have provided clear proof that it is quite the opposite and the proof is VCache which can increase a lot of IPC. Not in every application, but the L3 design itself and its quantity affects IPC.
 

gdansk

Platinum Member
Feb 8, 2011
2,212
2,836
136
Obfuscation? Read previous entries. You claim that cache has nothing to do with IPC increase. I provided clear proof that it is quite the opposite and the proof of this is VCache which can increase a lot of IPC. Not in every application, but the L3 design itself and its quantity has an impact on IPC measurements.
And where did I say that?

Here is what AMD says about it:
"The core complex unit (CCX) consists of 8 Zen 3 cores, each with a 0.5MB private L2 cache, and a 32MB shared L3 cache. Increasing this from 4 cores and 16MB L3 in the prior generation provides additional performance uplift, in addition to the IPC and frequency improvements."

They didn't include it in their IPC breakdown. Because Zen 3, in other manifestations, has different L3 configurations.
 

AMDK11

Senior member
Jul 15, 2019
291
198
116
Allmighty Patterson, bless my soul.
Not in case of vanilla Z3.
V$ triples the cache.
Z3 didn't add a single meg.
Neither does Z5.
Are you sure it didn't increase L3? Each Zen3 core has direct access to 32 MB instead of just 16 MB like a single Zen2 core. I see the difference, but you don't see it.
 

AMDK11

Senior member
Jul 15, 2019
291
198
116
It's for ISSCC. From the people who designed it. I'll take that over 'a gamer explains'.

There's a reason AMD markets the X3Ds as 'the ultimate in latency reduction' not 'the ultimate in IPC'.
You clearly wanted to deny that cache has any effect on IPC, since AMD didn't state that on the slide. While claiming that delays have no impact on IPC.

You were both wrong.
 
Last edited:
  • Like
Reactions: Henry swagger

AMDK11

Senior member
Jul 15, 2019
291
198
116
Clearly not. Stop tilting at windmills and read.
At what point am I wrong? Enlighten me.


Sorry. My fault. I took this post in the wrong context.

The topic is mainly addressed to the previous speaker who pretends to be an expert and still does not see that the transition from Zen 2 to Zen 3 increases direct access from 16MB to 32MB for each Zen3 core :D

How did he put it earlier? If he pays me, I can draw this for him if he still has a problem with it.
 
Last edited:
  • Like
Reactions: gdansk

AMDK11

Senior member
Jul 15, 2019
291
198
116
Yea.

There isn't much, that's the point.
I see this is still a problem.

CCD Zen 2 has 2x CCX(4 cores and 16MB)
(2x 16 MB (total 32 MB)).

The problem is that each Zen2 core only has direct access to 16MB, and another 16MB is connected by a much slower IF.

CCD Zen3 has 1x CCX, i.e. 8 cores and 32MB. This allows each Zen3 core to have direct access to 32MB of L3.

Can you see the difference?
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,804
3,268
136
Are you sure it didn't increase L3? Each Zen3 core has direct access to 32 MB instead of just 16 MB like a single Zen2 core. I see the difference, but you don't see it.
if you have bigger caches( at even given point in cache hierarchy ) and your prefetchers and predictors aren't any different/better then all you have is a larger victim cache. If the working set/hot loop /etc fits within the existing victim cache size then your going to see next to zero benefit from larger cache.

So as people keep saying the front end is going to be super critical, prefetch , predict along with decode that feeds the beast. Cache is a support to that not the key contributor. If it was so super important to enabling performance AMD would be shipping 4/8 hi Vcache in server and rolling up with 6Gb of cache a socket . If it was that enabling of performance they would be able to charge whatever price because CPU's are like 1-5% of the hardware cost of a server ( depending on exact config ) and irrelevant in full life TCO. But obviously they dont exist as a product part so what does that tell you ?
 

adroc_thurston

Platinum Member
Jul 2, 2023
2,508
3,661
96
CCD Zen 2 has 2x CCX(4 cores and 16MB)
(2x 16 MB (total 32 MB)).

The problem is that each Zen2 core only has direct access to 16MB, and another 16MB is connected by a much slower IF.

CCD Zen3 has 1x CCX, i.e. 8 cores and 32MB. This allows each Zen3 core to have direct access to 32MB of L3.

Can you see the difference?
It doesn't matter that much outside of like DB workloads.
 

AMDK11

Senior member
Jul 15, 2019
291
198
116
if you have bigger caches( at even given point in cache hierarchy ) and your prefetchers and predictors aren't any different/better then all you have is a larger victim cache. If the working set/hot loop /etc fits within the existing victim cache size then your going to see next to zero benefit from larger cache.

So as people keep saying the front end is going to be super critical, prefetch , predict along with decode that feeds the beast. Cache is a support to that not the key contributor. If it was so super important to enabling performance AMD would be shipping 4/8 hi Vcache in server and rolling up with 6Gb of cache a socket . If it was that enabling of performance they would be able to charge whatever price because CPU's are like 1-5% of the hardware cost of a server ( depending on exact config ) and irrelevant in full life TCO. But obviously they dont exist as a product part so what does that tell you ?
And did I write somewhere that there are no other improvements apart from the cache?

VCache clearly shows that reducing latency in accessing RAM by using it less frequently results in additional IPC gain, which is mainly used by games.
 

AMDK11

Senior member
Jul 15, 2019
291
198
116
It doesn't matter that much outside of like DB workloads.
Everything matters. Games benefit mainly from this.

Just a moment ago you said that you don't see the difference, that each core has direct access to the common L3 of 32MB instead of 16MB and at the same time inter-core communication benefits because it does not have to communicate in the same CCD with another CCX via a slower IF.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,804
3,268
136
And did I write somewhere that there are no other improvements apart from the cache?

VCache clearly shows that reducing latency in accessing RAM by using it less frequently results in additional IPC gain, which is mainly used by games.
like Nar der....

but its called a diminishing return curve , other wise we would have 6gb cache processors right now.......
 

AMDK11

Senior member
Jul 15, 2019
291
198
116
like Nar der....

but its called a diminishing return curve , other wise we would have 6gb cache processors right now.......
V Cache is a lot of additional billions of transistors to obtain additional IPC layers from the cores. VCache only allows you to get close to the theoretical peak IPC of a given architecture. To see further gains again, you need a new and more complex core design (to put it very simply).
 

AMDK11

Senior member
Jul 15, 2019
291
198
116
No, they benefit from total cache capacity.
(They also love fat L2's, as you've seen in RPL).
RaptorLake has no inter-chiplet latency problem and the RAM controller is on the same chip. Thanks to 2MB L2 instead of 1.25MB, RaptorLake gains approximately +4-5% higher IPC.

Zen has a RAM controller on a separate IOD, so it needs larger L3 to compensate, and this is mainly why it benefits from large L3 + VCache in games.
Moreover, communication of one CCD with the neighboring CCD is via IOD.

Zen 6 is expected to introduce a single CCD with 16 cores and a shared 64MB L3.
The only question is whether there will be 2x CCX with 8 cores and 32MB L3 per 1 CCD or whether it will be 1 CCX with 16 cores and a shared L3 of 64MB.
 
Last edited: