Question Zen 6 Speculation Thread

Page 266 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Joe NYC

Diamond Member
Jun 26, 2021
3,650
5,197
136
V$ is completely irrelevant for fmax SKUs.
It's completely irrelevant in server outside of specific HPC workloads.

I think that information is dated to Zen 3 / Zen 4 era, where you had fmax penalty to V-Cache. So the winning cases were only those where V-Cache could make up more than 300-400 MHz clock speed deficit.

In Zen 5, where boost and sustained clocks are much closer between V-Cache and non-V-Cache, V-Cache processors.

Overall, Phoronix showed 12% gain of V-Cache processors vs. non-V-Cache across server / workstation tests. Database tests, for example, showed big boosts in performance.

Now, that Intel is closed the gap a little bit with Xeon 6, I don't think AMD should extend the leads with Zen 6, so that Intel is, again, not competitive.
 
  • Like
Reactions: Tlh97

adroc_thurston

Diamond Member
Jul 2, 2023
7,113
9,885
106
I think that information is dated to Zen 3 / Zen 4 era, where you had fmax penalty to V-Cache. So the winning cases were only those where V-Cache could make up more than 300-400 MHz clock speed deficit.
Oh I'm not talking about fmax penalties or anything.
V$ is just usable in a few niche workloads in DC and that's it.
Database tests, for example, showed big boosts in performance.
They're fixed tiny worksets, irrelevant to how real DBs work irl.
I don't think AMD should extend the leads with Zen 6, so that Intel is, again, not competitive.
venice-d inherently makes the lead bigger than ever.
 

Joe NYC

Diamond Member
Jun 26, 2021
3,650
5,197
136
Oh I'm not talking about fmax penalties or anything.
V$ is just usable in a few niche workloads in DC and that's it.

Those were the old comparisons. They showed some workloads with big leads, some tied and some where V-Cache chip was behind.

But once you eliminate the clock speed penalty, all workloads move up. That's why 9800x3d is leading 9700x in phoronics tests by 11.9%


They're fixed tiny worksets, irrelevant to how real DBs work irl.

Database tests are a mixture of data that can fit and that cannot fit in the caches. More table you can fit into cache, the less bandwidth is consumed, and more processing can take place at speed of L3 latency.

venice-d inherently makes the lead bigger than ever.

For Oracle databases, you would probably want classic Zen 6, due to licensing costs.

And offering them with classic Zen 6 with V-Cache would move the performance a generation ahead. Something like Zen 7 performance out of Zen 6 processor.

We can assume that Zen 6 will already be quite well endowed as far as clock speeds, so the main thing holding it back will be memory latency.
 

adroc_thurston

Diamond Member
Jul 2, 2023
7,113
9,885
106
That's why 9800x3d is leading 9700x in phoronics tests by 11.9%
12% perf bump on a part that's normally crippled by cIOD being Pretty Bad?
Whoa man serious stuff right here.
Database tests are a mixture of data that can fit and that cannot fit in the caches. More table you can fit into cache, the less bandwidth is consumed, and more processing can take place at speed of L3 latency.
The average perf bump will be tiny.
We can assume that Zen 6 will already be quite well endowed as far as clock speeds, so the main thing holding it back will be memory latency.
And offering them with classic Zen 6 with V-Cache would move the performance a generation ahead. Something like Zen 7 performance out of Zen 6 processor.
no, 10% skt perf bump is not "Florence-like performance".
Please get real and stop projecting nerd dreams onto products made for serious people.
 

basix

Senior member
Oct 4, 2024
255
511
96
V-Cache in servers will probably get less relevant in the future:
- Z5 = 32 MByte per CCD
- Z6 classic = 48 MByte per CCD
- Z6c = 128 MByte per CCD
- Z7 = 7 MB/core are rumored, not clear if classic or dense. I would assume dense. Which would mean 224MB+ per CCD. Classic could stick to that or move to even more cache per core, because their cores are bigger

For some use case V-Cache will still bring a performance bump. Bit the margin should get thinner.
 

basix

Senior member
Oct 4, 2024
255
511
96
Some applications show >50% performance increase with V-Cache (some CFD solvers, RTL simulation, ...). I would not say that this is thin. But that will get thinner, the more cache per core and in total per CCD you have. Diminishing returns as usual.

V-Cache benefits are very application specific. For most use cases the gains are slim. So V-Cache SKUs are not a no-brainer. You need to know if it's worth for your use case.
 

itsmydamnation

Diamond Member
Feb 6, 2011
3,073
3,897
136
I wonder if they would ever go memory side cache v-cache. those would/should get better hit rates on in memory DB's or systems using > then a CDD core allocation, obviously worse latency but you will still execute sooner then going all the way to slow ass server dram.
 

basix

Senior member
Oct 4, 2024
255
511
96
You need very big amounts of such a memory side cache to get gains compared to a large L3$. Theoretically possible (e.g. stacking a large cache below the IOD), but not sure how big the gains will be.
I think it would make more sense to stack DRAM instead of SRAM on the IOD. Because of being stacked, you could reduce latency and increase bandwidth compared to regular DDR. It will be worse than SRAM in both regards, but feature much higher capacities (e.g. 16 GByte instead of 512 MByte). The result would be similar like what Intel did with their Xeon Max HBM integration, but without the high HBM costs.
 

StefanR5R

Elite Member
Dec 10, 2016
6,675
10,575
136
[dense CCD cache size]
Where are you getting this from?
I'd answer if I only could. It's been claimed a while ago, certainly here in this thread. Unfortunately it is too hard to keep track of where the rumors are buried within the speculation.

[CCD--IOD IF width]
Well, for AMD it would have been easy to do just what you wrote with Strix Halo - but they decided to keep the exact same bandwidth.
it's not the exact same since the writes are symmetrical to reads now, both 32B a clock cycle.
I understood that Strix Point has this also, per CCX. (On-die there, of course, but still.) Strix Halo's off-die fabric design was evidently a one-step-at-a-time thing.

[...] I fear them to align more with the lower bound described above than anything else.
I sure hope they don't skimp on that; power requirements are supposed to be a lot lower. Their cores have so much SIMD execution width nowadays... Furthermore, concerning their die-to-die interconnects, hopefully they don't regress from the impressive internal uniformity of their current sIOD, if the next sIOD is made of two chiplets instead.
 

ToTTenTranz

Senior member
Feb 4, 2021
690
1,150
136
So the only difference between Zen 6 and Zen6 dense is transistor library? Or is the FPU less wide as well?

I guess now it makes more sense that the PS6 handheld only uses Zen6c cores. Performance should scale better towards the home console CPU cores.
 

BorisTheBlade82

Senior member
May 1, 2020
710
1,132
136
That makes sense:
- 64B/clk = 205 GByte/s at MCRDIMM-12'800
- 128B/clk = 410 GByte/s at MCRDIMM-12'800

That is perfectly suited so that you can max. out the 1.6 TB/s total memory bandwidth with 8x 12C chiplets (96C) or with all Zen 6c SKUs (4x 32C = 128C or more CCDs).
Isn't the big unknown the clock for a new interconnect? I mean, with a BoW couldn't they also simply go very wide per clock but clock rather low as long as latency does not nosedive?
I mean, in isolation 2,2 GHz is only worth 0,5ns of latency, so going down to 1 GHz would still be only around 1ns.
 

adroc_thurston

Diamond Member
Jul 2, 2023
7,113
9,885
106
So the only difference between Zen 6 and Zen6 dense is transistor library
And physdes. Makes all the difference in the world.
I guess now it makes more sense that the PS6 handheld only uses Zen6c cores
It's only dense because of a magical thing called 'cost'.
Isn't the big unknown the clock for a new interconnect?
There is no 'interconnect', it's just wires.
Everything's running at fclk speed for very obvious reasons.
 

basix

Senior member
Oct 4, 2024
255
511
96
You could theoretically clock it to 50% or 200% fclk and change the bus width accordingly for the chip-to-chip interface. But that is probably not worth it and you are adding some SERDES again (altough very simple 2:1 or 1:2 ones). I mean, 3.2 Gbps PHY speed is very slow and energy efficient. I do not see a reason the reduce clocks even further. RDNA 3 MCDs used much higher 9.2 Gbps, HBM runs also at 8+ Gbps and IFOP xGMI links even higher at 16 Gbps (Zen 3) or 32 Gbps (Zen 4).

When maybe adding 2:1 SERDES again for DDR6 (keep fabric clock at 3.2 GHz due to energy efficiency reasons but increase PHY speed to 6.4 Gbps) you are still in a quite cozy frequency range and keep a bunch-of-wires connecting scheme.
Or you can simply increase bus width, if Die area allows it.
 
  • Like
Reactions: Tlh97

adroc_thurston

Diamond Member
Jul 2, 2023
7,113
9,885
106
But that is probably not worth it and you are adding some SERDES again
SERDES is a very specific thing, and putting d2d sludge into a separate clock domain does *not* make it serdes lmao.
RDNA 3 MCDs used much higher 9.2 Gbps
uh no they ain't.
When maybe adding 2:1 SERDES again for DDR6 (keep fabric clock at 3.2 GHz due to energy efficiency reasons but increase PHY speed to 6.4 Gbps)
you do understand that 2.5D at 25um pitch makes pins cheap? SDP spam is way of the future(tm).
 

basix

Senior member
Oct 4, 2024
255
511
96
If you are changing interface width, but keep the bandwidth the same, you have to add SERDES in one way or another. Or change modulation from NRZ to PAM4, use DDR or whatever.
But you can explain, how you want to do that with just a separate clock domain. Show me your magic tricks.

And yeah, RDNA3 used 9.2 Gbps. If you don't believe me, check AMDs presentations.