Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 948 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

eek2121

Diamond Member
Aug 2, 2005
3,270
4,791
136
Has any of the reviewers tested gaming while the running something semi heavy in the background? Not necessarily encoding that taxes all threads, but enough to keep the CPU so busy that it doesn't core park. I typically can launch a game while doing other stuff, and I want to know what it does compared to my 7950X that does not core park.
It performs exactly like a 9950X, but with bonus 3D cache. If the cores don't park, they don't park. If you mean gaming while the second CCD is loaded? I haven't seen anything.

Are there any reviews on overclocking?
 

CakeMonster

Golden Member
Nov 22, 2012
1,593
779
136
Yeah, I don't trust that it will core park if I run something more heavy in the background and would like to see how it performs under that scenario (probably will have less power budget for the gaming CCD).

HUB (only review I've watched so far) had some numbers with PBO, not an impressive performance increase.
 

eek2121

Diamond Member
Aug 2, 2005
3,270
4,791
136
Yeah, I don't trust that it will core park if I run something more heavy in the background and would like to see how it performs under that scenario (probably will have less power budget for the gaming CCD).

HUB (only review I've watched so far) had some numbers with PBO, not an impressive performance increase.

The clocks are the same as a vanilla 9950X. I fail to see the issue here. The game will be pinned to the 3D cache die regardless, even if those cores don't park. In what scenario would you EVER game while also doing a "heavy workload" in the background. Most heavy workloads scale, so it is beneficial to NOT run them in the background.
 
Jul 27, 2020
23,098
16,259
146
Ive had my 9950X3D ES for 9 months already ;)
Meh.

My issue with AMD is that they can't put out a product that IGOR_KAVINSKI wants! (price and performance BOTH matter to me)

Seems there are some 9900X CPUs collecting dust in some Amazon warehouse and they want to get rid of them at half the price it's selling for in my region.

I was like, sure, Amazon. Here's my money. Now to see if they hold up their end of the bargain.
 
  • Wow
Reactions: lightmanek

StefanR5R

Elite Member
Dec 10, 2016
6,320
9,717
136
[3D V-cache in multi-chiplet CPUs]
I feel like stacking it on the memory controller means they'd only need one still, and could benefit multiple dies (be they CPU or GPU).
Shared cache is beneficial (dramatically, sometimes) to algorithms in which several program threads share hot data. For any other workloads, the downsides of replacing CCX-internal cache by MALL cache may negate what was won by unifying the last level cache: Cache access latency goes up, aggregate bandwidth goes down, or/and energy consumption of last level cache accesses goes up.

(If you move the last level cache out of the core complex into the north bridge, you end up with the CCXs' coherent master block (CM), the Infinity Fabric, and the memory controller's coherent slave block sitting between cores and last level cache. Depending on how much all this is scaled up for this purpose, you lose quality of service during concurrent cache accesses or/and have a power hungrier fabric.)
 

Hulk

Diamond Member
Oct 9, 1999
5,043
3,508
136
[3D V-cache in multi-chiplet CPUs]

Shared cache is beneficial (dramatically, sometimes) to algorithms in which several program threads share hot data. For any other workloads, the downsides of replacing CCX-internal cache by MALL cache may negate what was won by unifying the last level cache: Cache access latency goes up, aggregate bandwidth goes down, or/and energy consumption of last level cache accesses goes up.

(If you move the last level cache out of the core complex into the north bridge, you end up with the CCXs' coherent master block (CM), the Infinity Fabric, and the memory controller's coherent slave block sitting between cores and last level cache. Depending on how much all this is scaled up for this purpose, you lose quality of service during concurrent cache accesses or/and have a power hungrier fabric.)
I'm looking at 9950X vs 9950X3D and having a hard time justifying for productivity as much as I WANT to see it. Maybe a little in AI, but my GPU handles that load.
Games? Yes of course. You get the better game performance and maintain productivity. But I don't see an advantage for productivity alone.
 

MS_AT

Senior member
Jul 15, 2024
526
1,111
96
  • Like
Reactions: igor_kavinski

Joe NYC

Platinum Member
Jun 26, 2021
2,909
4,279
106
[3D V-cache in multi-chiplet CPUs]

Shared cache is beneficial (dramatically, sometimes) to algorithms in which several program threads share hot data. For any other workloads, the downsides of replacing CCX-internal cache by MALL cache may negate what was won by unifying the last level cache: Cache access latency goes up, aggregate bandwidth goes down, or/and energy consumption of last level cache accesses goes up.

(If you move the last level cache out of the core complex into the north bridge, you end up with the CCXs' coherent master block (CM), the Infinity Fabric, and the memory controller's coherent slave block sitting between cores and last level cache. Depending on how much all this is scaled up for this purpose, you lose quality of service during concurrent cache accesses or/and have a power hungrier fabric.)

All good points.

Also, with upcoming 12 core chiplet, that chiplet will address a percentage of users, if additional cores have diminishing returns in:
a) performance
b) sales of CPUs

then 9th core is far more valuable than 16th core. CPUs with 2 chiplets will become even smaller niche, after people see what 12 core chiplet can do.

if the main CCD has 48 MB and V-Cache has ~96 MB, the extra +50% L3 with ~same latency will improve CPU performance (of extra cores) far more than MALL can.

As far as IPC gains in 1T, a good percentage of Zen 3 gains came from single thread having access to +100% L3. Likewise, there will be 1T performance improvement in Zen 6 just from having access of +50% L3.

V-Cache CPUs showed limited 1T performance uplift - because the clock speed went down. Since AMD is narrowing the gap between clock speeds of V-Cache and non-V-Cache,, there is a possibility that by Zen 6, the clock rates match, and there will be yet more 1T performance just from L3 in Zen 6.

From Zen 3 -> Zen 4 -> Zen 5 there were only tiny IPC gain from L3 optimization, but there will be real gains in Zen 6 from L3. Which something a lot of people are ignoring.
 

StefanR5R

Elite Member
Dec 10, 2016
6,320
9,717
136
This should've been the generation with a shared V-cache design across the two CCDs. Alas.
They already performed two related steps in this generation:
– Changed the CCX topology from "optimized ring bus" to "mesh". (The only publicly known benefit of this in the current generation: It enables the Turin-dense CCX.)
– Changed the V-cache stacking from [substrate - core die - cache die - structural die - heat spreader] to [substrate - cache die - core die - heat spreader].

You are asking that they go an additional step of extending the CCX topology to reach across three dies (two core dies sitting on one cache die). Would this be a little extra step, or a big one…?

Intel's server CPUs above a certain core count already have got a mesh which reaches across two…four chiplets, connected through EMIB. The latencies between cores and cache segments and memory controllers in these meshes are a lot higher than in client CPUs (high enough that it's worthwhile to logically divide them into NUMA domains), but admittedly these are considerably larger meshes.
 

Joe NYC

Platinum Member
Jun 26, 2021
2,909
4,279
106
They already performed two related steps in this generation:
– Changed the CCX topology from "optimized ring bus" to "mesh". (The only publicly known benefit of this in the current generation: It enables the Turin-dense CCX.)
– Changed the V-cache stacking from [substrate - core die - cache die - structural die - heat spreader] to [substrate - cache die - core die - heat spreader].

You are asking that they go an additional step of extending the CCX topology to reach across three dies (two core dies sitting on one cache die). Would this be a little extra step, or a big one…?

Intel's server CPUs above a certain core count already have got a mesh which reaches across two…four chiplets, connected through EMIB. The latencies between cores and cache segments and memory controllers in these meshes are a lot higher than in client CPUs (high enough that it's worthwhile to logically divide them into NUMA domains), but admittedly these are considerably larger meshes.

It seems that V-Cache just scales the amount of memory at each stop of ring bus, while maintaining the same ring bus functionality.

Sharing V-Cache would need an entirely new algorithm for accessing Le and entirely new topology.
 

Joe NYC

Platinum Member
Jun 26, 2021
2,909
4,279
106
That's something for the cooks to worry about. All we want is our double layered V-cache cake :p

1.5x is probable for Zen 6 V-Cache. 2x could be possible.

In Zen 3 and Zen 4, V-Cache of 64 MB took approx. 35mm2. Over a die that was barely 2x size.

With Zen 6, the die size going to 75mm2, which is > 2x, even with the same N7 die, 2x, or 128 MB could fit. But if AMD stays with N6/N7, 96MB V-Cache is more likely.
 
  • Like
Reactions: igor_kavinski

Gideon

Golden Member
Nov 27, 2007
1,964
4,814
136
if they nail shared L3 cache in medusa it will be huge

48 + 48 + 96 + 96 = 288 MB L3 global cache

good stuff
I Would very much like to see this amount of cache, but at those sizes i think it might be better to either have an L4 cache or at least larger L2's. I don't think a 1-2MB L2 and a cross 3 die 288MB L3 is an optimal solution.

In an ideal world I'd rather take:
  • 1MB L2
  • 48MB L3 (intra-chiplet)
  • 192-256 MB L4 (unified large cache die)