Question Zen 6 Speculation Thread

Page 265 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Joe NYC

Diamond Member
Jun 26, 2021
4,117
5,655
136
So dense has the same L3/core as classic this time, instead of half?

I wonder what the difference is size is between the classic and dense then.

Not only that, but with

32 cores * 4 MB = 128 MB

The size of the L3 pool goes up, which allows the active cores that use a lot of memory to get even bigger allocation of that L3. The size of the L3 pool goes up 4x from Turin Dense to Venice dense.
 

adroc_thurston

Diamond Member
Jul 2, 2023
8,220
10,971
106
Or even less cores. There are EPYCs with 1 enabled core per CCD, they are popular for this crowd.
Kind of, you're not winning much freq from going to 16c.
I think, better than 50% probability that AMD comes back with V-Cache for the Zen 6 classic, for highest single core performance.
V$ is completely irrelevant for fmax SKUs.
It's completely irrelevant in server outside of specific HPC workloads.
 
  • Like
Reactions: Tlh97 and marees

Tuna-Fish

Golden Member
Mar 4, 2011
1,688
2,581
136
Kind of, you're not winning much freq from going to 16c.
The main purpose being to push core counts down because Oracle doesn't accept "sure we have 32 but we are only using 16 of them".
It's completely irrelevant in server outside of specific HPC workloads.
If you can fit some of your most important indexes in cache, it suddenly becomes very relevant for databases.
 

adroc_thurston

Diamond Member
Jul 2, 2023
8,220
10,971
106
The main purpose being to push core counts down because Oracle doesn't accept "sure we have 32 but we are only using 16 of them".
World's ain't ending on Oracle.
If you can fit some of your most important indexes in cache, it suddenly becomes very relevant for databases.
Niche and hard to execute on.
So far main V$ usecases in DC are down to carefully MPI-sliced HPC workloads.
That's why Turin-X went the way of the dodo.
 
  • Like
Reactions: madtronik

Joe NYC

Diamond Member
Jun 26, 2021
4,117
5,655
136
The main purpose being to push core counts down because Oracle doesn't accept "sure we have 32 but we are only using 16 of them".

If you can fit some of your most important indexes in cache, it suddenly becomes very relevant for databases.

One of the benefits of F CPUs with reduced core count per CCD was increased L3 per CCD. With Genoa-X, V-Cache on top of the CPU was still slowing down the FMax.

The new options on the table for Zen 6 with V-Cache:
- also increase L3 per core
- but don't reduce FMax
So both can be done without wasting precious N2 die by disabling perfectly fine cores.
 

Joe NYC

Diamond Member
Jun 26, 2021
4,117
5,655
136
BTW, does anyone know which version of Venice is being use in Helios rack scale installation? 256 core dense or 96 core classic?
 

Saylick

Diamond Member
Sep 10, 2012
4,110
9,607
136
BTW, does anyone know which version of Venice is being use in Helios rack scale installation? 256 core dense or 96 core classic?
Total guess, but probably the 96 core classic? For reference, Vera is a 88 core CPU, also with SMT.
 
  • Like
Reactions: Joe NYC

adroc_thurston

Diamond Member
Jul 2, 2023
8,220
10,971
106
With Genoa-X, V-Cache on top of the CPU was still slowing down the FMax.

The new options on the table for Zen 6 with V-Cache:
- also increase L3 per core
- but don't reduce FMax
So both can be done without wasting precious N2 die by disabling perfectly fine cores.
just forget about the V$.
For reference, Vera is a 88 core CPU, also with SMT.
We don't care what NV does, server Tegras are poo poo.
 

StefanR5R

Elite Member
Dec 10, 2016
6,829
10,932
136
I am not sure the total market for 128 core Turin, or Venice, but my 9755 is not alone in several applications that it uses and uses 128 core fat (NO SMT) or 256 thin (with SMT), and I just don't know how many do, but it certainly is not zero.
Turin yes, but not Venice in the very same way. The relationship of Venice's classic and dense parts to each other will not be the same anymore as we are used from Turin and from Genoa, Bergamo, Siena. (And coinciding, the relationship of the sockets SP7 and SP8 to each other will be very different from the relationship of SP5 and SP6 to each other.) Hence, certain conclusions which some have drawn from Turin to Venice here in this thread are not valid.

So, the Turin situation that 9755 > 9745 and that 9755 has got its use (niche or not) is readily apparent. But with Venice, things will play out differently. Remember that the dense CCX is presumed to be heavily upgraded to the same ratio of L3$ to cores as the classic CCX, and thereby to much more L3$ per core complex than classic. And this is only one system-level change among several more from Turin to Venice.
 

StefanR5R

Elite Member
Dec 10, 2016
6,829
10,932
136
One of the benefits of F CPUs with reduced core count per CCD was increased L3 per CCD. [...V-cache as potential alternative...]
Another effect of using fewer cores per CCD is a higher ratio of die-to-die bandwidth to core count. (With the current IFoP implementation, "wide GMI" is an alternative or additional means to this end.) Will be interesting to learn what die-to-die bandwidth the 12c and 32c CCDs will be configured with.
 

BorisTheBlade82

Senior member
May 1, 2020
722
1,149
136
Will be interesting to learn what die-to-die bandwidth the 12c and 32c CCDs will be configured with.
As stated before, it must be at least 200 GByte/s/CCD in order to saturate the new max. Mem BW with an 8 CCD SKU under perfect circumstances.
According to C&C, total RAM BW is shared by reads and writes, so with a typical factor of 2:1 I see 128 GByte/s read, 64 GByte/s write as the lower boundary. I would hope for 256 GByte/s r, 128 GByte/s w for two reasons:
It is better for a broader range of workloads with less even distributed RAM BW demands.
It gives SKUs with <8 CCD more room to stretch their legs, just like GMI-Wide does today. And no, I cannot imagine them to be able to combine two links for one CCD, as the layout/routing constraints will make that impossible.
 

Joe NYC

Diamond Member
Jun 26, 2021
4,117
5,655
136
Another effect of using fewer cores per CCD is a higher ratio of die-to-die bandwidth to core count. (With the current IFoP implementation, "wide GMI" is an alternative or additional means to this end.) Will be interesting to learn what die-to-die bandwidth the 12c and 32c CCDs will be configured with.

With new technology allowing higher bandwidth delivered efficiently, the cap will likely be increased so much that it is almost never a bottleneck.

So that argument should go away - fewer cores per CCD in order for each core to have more bandwidth.
 

BorisTheBlade82

Senior member
May 1, 2020
722
1,149
136
With new technology allowing higher bandwidth delivered efficiently, the cap will likely be increased so much that it is almost never a bottleneck.

So that argument should go away - fewer cores per CCD in order for each core to have more bandwidth.
Well, for AMD it would have been easy to do just what you wrote with Strix Halo - but they decided to keep the exact same bandwidth. So I'd guess that power consumption, area and routing effort is still enough of a thing for them to not widen the interconnect beyond sanity. This is why I fear them to align more with the lower bound described above than anything else.
 
  • Like
Reactions: Tlh97 and Joe NYC

adroc_thurston

Diamond Member
Jul 2, 2023
8,220
10,971
106
Well, for AMD it would have been easy to do just what you wrote with Strix Halo - but they decided to keep the exact same bandwidth
it's not the exact same since the writes are symmetrical to reads now, both 32B a clock cycle.
It's one SDP there.
My guess is Zen6 will move to an SDP-per-mesh-column arrangement, giving the baby CCD 64bytes o'clock, and the big boy with a nice and chonky 128B/clk.
 

basix

Senior member
Oct 4, 2024
291
594
96
That makes sense:
- 64B/clk = 205 GByte/s at MCRDIMM-12'800
- 128B/clk = 410 GByte/s at MCRDIMM-12'800

That is perfectly suited so that you can max. out the 1.6 TB/s total memory bandwidth with 8x 12C chiplets (96C) or with all Zen 6c SKUs (4x 32C = 128C or more CCDs).
 

Josh128

Golden Member
Oct 14, 2022
1,508
2,260
106
Remember that the dense CCX is presumed to be heavily upgraded to the same ratio of L3$ to cores as the classic CCX, and thereby to much more L3$ per core complex than classic.
Where are you getting this from? No longer just 2 MB L3 per core for Zen 6c, but 4 MB?? Thats news to me.