Question Zen 6 Speculation Thread

Page 322 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

OneEng2

Golden Member
Sep 19, 2022
1,004
1,206
106
View attachment 135175View attachment 135176
How did they get >1.7x perf uplift?

Efficiency is also >1.7x which mean same 500W TDP.
  • 8x IF links vs 12x on Turin Dense, so lesser number of links to power.
  • 2.5D packaging to bring immense power savings in signalling
  • New process for IOD for more efficiency
  • But signalling rate is higher at 64 Gbps vs 36 Gbps on Turin D
They would need to bring IOD power usage really down to the 60W - 80W to allow around 1.65W< per core for 256 cores in same SoC at 500W. Below that honestly seems difficult considering all the various functions in the IOD.

If they can do <1.65W per core they would need all the power efficiency gains from N2. At minimum 1.1x efficiency per Core.

But they need frequency gains too to get to the 1.7x perf gain.

1.1x power efficiency gain on top of 1.1x frequency gain would be something of an optimistic outcome from the N2 process jump.

1.33x cores * 1.1x frequency * 1.18x perf/clock = 1.73x perf uplift

If there is indeed no increase in TDP, the clocks would be conservative to keep efficiency high which means the perf per clock would be quite high. I would suspect high teens.

Another thing I suspect is additional 2x FP pipes (new AI pipelines by Papermaster, suggesting at least 2x) something new since Z3. This would bring some easy gains in common benches like cinebench and stuff.
Note: The clocks on Turin are already thermally limited at 192c. The efficiency is the move from N3E to N2. I feel like AMD's got the goods on this one, or they wouldn't have released the information so early. They also must feel pretty confident that Intel is unable to answer Venice.

As for the added FP pipes, I doubt it. They are expensive and would require scheduling changes. I was thinking they would simply optimize the existing FP units... but I agree that this is a likely spot for them to do a little work. After all, it would eliminate the one win ARL currently has in benchmarks ;).
 
  • Like
Reactions: lightmanek

StefanR5R

Elite Member
Dec 10, 2016
6,848
11,017
136
Another thing I suspect is additional 2x FP pipes (new AI pipelines by Papermaster, suggesting at least 2x) something new since Z3.
Whatever they meant by "more AI pipelines" (Papermaster's FAD presentation) is unlikely to pay into the ">1.7x" projection though (McNamara's presentation), as the latter refers to SPECrate®2017_int_base specifically.
 

Abwx

Lifer
Apr 2, 2011
12,008
4,973
136
cinememe isn't a math benchmark at all.
Actually i m not sure that when designing a custom scene using Cinema 4D and with quite
different parameters than the ones used in Cinebech it would reproduce a perf advantage
for Intel, the tables may well be turned.
 

Cheesecake16

Member
Aug 5, 2020
45
168
106
As for the added FP pipes, I doubt it. They are expensive and would require scheduling changes. I was thinking they would simply optimize the existing FP units... but I agree that this is a likely spot for them to do a little work. After all, it would eliminate the one win ARL currently has in benchmarks ;).
Well no... it wouldn't require any scheduler layout changes because Zen has always been 2 FMA units plus 2 FADD units...

They could make those 2 FADD units into FMA units and change nothing in terms of the scheduler layout... they would need to make all 4 of the AGUs capable of 512b operations so that the core can do 4 512b loads or 2 512b stores along with adding more ports to the Vector Register File but the scheduler layout wouldn't need to be changed...

Now, is it likely that AMD will make the 2 FADD units into FMA unit... no... IMO the "More AI Pipelines" likely refers to the addition of AVX512-FP16 that Zen 6 adds, assuming the GCC patches are accurate...

What Zen 6 brings in terms of architectural improvement is still largely up in the air... I can make some guesses but they would only be guesses...
 
  • Like
Reactions: Jan Olšan

Cheesecake16

Member
Aug 5, 2020
45
168
106
Feel free, it's a speculation thread;) Just maybe mention it's a speculation so people wouldn't bother you with asking for sources etc.
Well then... I'll speculate...

Here is my speculation for what Zen 6 may change compared to Zen 5, with the 12 core CCD being the focus for the L3 and CCD <-> IOD interconnect speculation...

- Starting at the frontend, this is where Zen 5 had, at least for a small number of games, the most bottlenecks surprisingly... usually you'd see the core as memory bound but Zen 5 is much more frontend bound compared to other cores...

- Zen 5 struggles with frontend latency more than anything which can be relieved in a number of ways with the most straight forward way being to make the L1i larger... if AMD chooses to not make the L1i larger they can improve the instruction prefetchers so that they can more aggressively fetch data into the L1i in Zen 6 to reduce the effective latency they could also possibly make the iTLBs larger to reduce latency that way as well tho I suspect that could run into diminishing returns...

- Zen 5 is also frontend bandwidth bound, which does potentially point to AMD making it so that a single thread can use both decode clusters at the same time if for no other reason than to improve the bandwidth out of the L2 to the L1i... they can also improve the behavior of the op cache by making Zen 6 be able to feed 2x8 instructions per cycle instead of 2x6 instructions per cycle that Zen 5 currently does...

- Moving to the Branch Predictor, I don't expect much to change here... Zen 5 already had an insanely good branch predictor so I suspect for Zen 6 that they may increase some structure sizes in the BPU and possibly make the return stack lower latency...

- Going down the pipe to the dispatch and rename I don't expect this to change and for Zen 6 to stick with 8 ops per cycle into the ROB...

- Speaking of the ROB, I do expect this to get a little bit larger, IMO 512 entries or so, with the retirement width either increasing from 8 per cycle to 12 or 16 per cycle or removing restrictions on the number of specific instructions the retirement queue can retire per cycle...

- Looking at the Integer side of the core, I expect that the Physical Integer Register File will see a bump of at least 64 entries if not more to north of 300 entries and I do expect the ALU scheduler to increase in size a little, perhaps to 96 entries perhaps more, but I don't expect much in terms of port or ALU layout to have changed if at all...

- I really doubt that much if anything has changed for the vector side of the core other than new AVX512 instructions and getting rid of the 2 cycle hazard in Zen 5 that made basically all single cycle instructions on the vector side 2 cycles AKA Zen 6 will make those instructions a single cycle again...

- Now, for the memory side of core, I would not be surprised if AMD decided to go from a 4 AGU setup with 2 being able to do both loads and stores and 2 being only able to do loads in Zen 5 which limits you to 4 memory ops per cycle to a 6 AGU setup with 4 load AGUs and 2 store AGUs so that Zen 6 will be able to do 6 memory ops per cycle along with making the memory scheduler larger or possibly adding a non-scheduling queue in front just like the vector side...

- For both the Load and Store queues, I don't expect much change in Zen 6...

- I do expect both the data TLBs to be made larger, AMD has done this every generation and honestly I wouldn't be surprised if the L2 dTLB increased from 4K to 5-6K entries or so and I also expect the cache miss buffers to get larger as well to better absorb cache and memory latency...

- Moving to the L2, I also don't expect AMD to make the L2 larger but I do expect that they will improve the behavior of the L2 -> L1d bandwidth because while it is 64B per cycle it's not consistently 64B per cycle and closer to 32B per cycle if you are just doing reads...

- For the L3, I expect AMD to stick with the same mesh interconnect that Zen 5 has, just with more stops to accommodate the 12 core CCDs which means that each CCD will now have 48MB of L3 cache... this will likely increase the latency of the L3 but how much... I don't know... I think somewhere between 4 and 8 cycles seems likely for a L3 latency of about 50 to 54 cycles... As for the new V-Cache die I expect it to increase by 50% to 96MB inline with the 50% L3 cache increase of the new 12 core CCD...

- For the new CCD to IOD interconnect... I would not be surprised if they moved to a 64B per cycle read path and a 32B per cycle write path per CCD, so a dual CCD chip will be able to do 128B reads per cycle and 64B writes per cycle, in order to be able to better utilize the higher frequency DDR5 we are seeing launched... as for the fabric clock speed... I expect that to be around the same 2GHz or so that Zen 5 is currently tho that may go up as well... which means that a single CCD will be able to handle about 128GB/s reads and 64GB/s writes to memory assuming 2GHz and a dual CCD setup will be able to deal with 256GB/s and 128GB/s writes to memory...

So I am expecting Zen 6 to be an evolution of the Zen 5 core with some widening of the core and fixing weakspots WRT the frontend latency and bandwidth and the vector side hazard with the largest improvements happening in the L3 cache and memory subsystem along with the new AVX512 instructions... but again, this is just my speculation... I very well maybe completely wrong...
 
Last edited:

OneEng2

Golden Member
Sep 19, 2022
1,004
1,206
106
Well then... I'll speculate...

Here is my speculation for what Zen 6 may change compared to Zen 5, with the 12 core CCD being the focus for the L3 and CCD <-> IOD interconnect speculation...

- Starting at the frontend, this is where Zen 5 had, at least for a small number of games, the most bottlenecks surprisingly... usually you'd see the core as memory bound but Zen 5 is much more frontend bound compared to other cores...

- Zen 5 struggles with frontend latency more than anything which can be relieved in a number of ways with the most straight forward way being to make the L1i larger... if AMD chooses to not make the L1i larger they can improve the instruction prefetchers so that they can more aggressively fetch data into the L1i in Zen 6 to reduce the effective latency they could also possibly make the iTLBs larger to reduce latency that way as well tho I suspect that could run into diminishing returns...

- Zen 5 is also frontend bandwidth bound, which does potentially point to AMD making it so that a single thread can use both decode clusters at the same time if for no other reason than to improve the bandwidth out of the L2 to the L1i... they can also improve the behavior of the op cache by making Zen 6 be able to feed 2x8 instructions per cycle instead of 2x6 instructions per cycle that Zen 5 currently does...

- Moving to the Branch Predictor, I don't expect much to change here... Zen 5 already had an insanely good branch predictor so I suspect for Zen 6 that they may increase some structure sizes in the BPU and possibly make the return stack lower latency...

- Going down the pipe to the dispatch and rename I don't expect this to change and for Zen 6 to stick with 8 ops per cycle into the ROB...

- Speaking of the ROB, I do expect this to get a little bit larger, IMO 512 entries or so, with the retirement width either increasing from 8 per cycle to 12 or 16 per cycle or removing restrictions on the number of specific instructions the retirement queue can retire per cycle...

- Looking at the Integer side of the core, I expect that the Physical Integer Register File will see a bump of at least 64 entries if not more to north of 300 entries and I do expect the ALU scheduler to increase in size a little, perhaps to 96 entries perhaps more, but I don't expect much in terms of port or ALU layout to have changed if at all...

- I really doubt that much if anything has changed for the vector side of the core other than new AVX512 instructions and getting rid of the 2 cycle hazard in Zen 5 that made basically all single cycle instructions on the vector side 2 cycles AKA Zen 6 will make those instructions a single cycle again...

- Now, for the memory side of core, I would not be surprised if AMD decided to go from a 4 AGU setup with 2 being able to do both loads and stores and 2 being only able to do loads in Zen 5 which limits you to 4 memory ops per cycle to a 6 AGU setup with 4 load AGUs and 2 store AGUs so that Zen 6 will be able to do 6 memory ops per cycle along with making the memory scheduler larger or possibly adding a non-scheduling queue in front just like the vector side...

- For both the Load and Store queues, I don't expect much change in Zen 6...

- I do expect both the data TLBs to be made larger, AMD has done this every generation and honestly I wouldn't be surprised if the L2 dTLB increased from 4K to 5-6K entries or so and I also expect the cache miss buffers to get larger as well to better absorb cache and memory latency...

- Moving to the L2, I also don't expect AMD to make the L2 larger but I do expect that they will improve the behavior of the L2 -> L1d bandwidth because while it is 64B per cycle it's not consistently 64B per cycle and closer to 32B per cycle if you are just doing reads...

- For the L3, I expect AMD to stick with the same mesh interconnect that Zen 5 has, just with more stops to accommodate the 12 core CCDs which means that each CCD will now have 48MB of L3 cache... this will likely increase the latency of the L3 but how much... I don't know... I think somewhere between 4 and 8 cycles seems likely for a L3 latency of about 50 to 54 cycles... As for the new V-Cache die I expect it to increase by 50% to 96MB inline with the 50% L3 cache increase of the new 12 core CCD...

- For the new CCD to IOD interconnect... I would not be surprised if they moved to a 64B per cycle read path and a 32B per cycle write path per CCD, so a dual CCD chip will be able to do 128B reads per cycle and 64B writes per cycle, in order to be able to better utilize the higher frequency DDR5 we are seeing launched... as for the fabric clock speed... I expect that to be around the same 2GHz or so that Zen 5 is currently tho that may go up as well... which means that a single CCD will be able to handle about 128GB/s reads and 64GB/s writes to memory assuming 2GHz and a dual CCD setup will be able to deal with 256GB/s and 128GB/s writes to memory...

So I am expecting Zen 6 to be an evolution of the Zen 5 core with some widening of the core and fixing weakspots WRT the frontend latency and bandwidth and the vector side hazard with the largest improvements happening in the L3 cache and memory subsystem along with the new AVX512 instructions... but again, this is just my speculation... I very well maybe completely wrong...
Wow. Thanks for that.

Do you think that the front end limitations were holding up some performance in CB24? Seems like Zen 5 dominates in CB23, but loses to ARL in CB24. I always racked it up to a throughput problem, but wasn't sure where the bottleneck actually was.
 

Jan Olšan

Senior member
Jan 12, 2017
624
1,258
136
Most workloads don't need 1:1 read to write... but I can think of one, Y-Cruncher, that does... so a certain person will be happy if Zen 6 does 1:1 read to write instead of 2:1 read to write...

And I'd say benchmarks pretty much confirmed this - people were panicking about "writes being half speed" since Zen 2 I think but if there is little measureable impact, why the heck not do this.

Why make it symmetric when write bandwidth isn't needed?
If you want to blow extra budget on bandwidth, just widen both interfaces instead while keeping the 2:1 ratio. You will get better payoff.
 
  • Like
Reactions: lightmanek

CouncilorIrissa

Senior member
Jul 28, 2023
786
2,854
106
And I'd say benchmarks pretty much confirmed this - people were panicking about "writes being half speed" since Zen 2 I think but if there is little measureable impact, why the heck not do this.

Why make it symmetric when write bandwidth isn't needed?
If you want to blow extra budget on bandwidth, just widen both interfaces instead while keeping the 2:1 ratio. You will get better payoff.
But the baseline here, however, is not Granite Ridge, but rather Strix Halo, and that one is symmetric. 32B/cycle in both directions.
FWIW I personally think they'll stick to 32B in both directions on client and leave the L3 to fend for itself. Would be a very AMD thing to do.
 

Geddagod

Golden Member
Dec 28, 2021
1,680
1,715
136
Zen 5 struggles with frontend latency more than anything which can be relieved in a number of ways with the most straight forward way being to make the L1i larger...
Why do you think Zen 5's L1i MPKI is so much higher than Zen 4's despite being the same capacity? As in ranging from 10x worse to 3x worse across a variety of specint2017 subtests.
- Moving to the Branch Predictor, I don't expect much to change here... Zen 5 already had an insanely good branch predictor so I suspect for Zen 6 that they may increase some structure sizes in the BPU and possibly make the return stack lower latency...
The branch predictor has dramatically more L2 BTB overrides than Zen 4 from Huang's testing, which also exacerbates the front end latency issue. Seems extremely weird.
 

Cheesecake16

Member
Aug 5, 2020
45
168
106
Why do you think Zen 5's L1i MPKI is so much higher than Zen 4's despite being the same capacity? As in ranging from 10x worse to 3x worse across a variety of specint2017 subtests.
Simple, you are having to feed a much larger frontend... you expect more L1i misses assuming the same structure size...
The branch predictor has dramatically more L2 BTB overrides than Zen 4 from Huang's testing, which also exacerbates the front end latency issue. Seems extremely weird.
So that is very dependent on the workload, for example compiling the Linux Kernel the L2 BTB overrides went down from about 12.86 MPKI to about 3 MPKI... where as the L1 iTLB misses went up which again isn't surprising considering that the L1 iTLB size didn't change from Zen 4 to Zen 5...
 
  • Like
Reactions: Tlh97 and 511

Doug S

Diamond Member
Feb 8, 2020
3,827
6,761
136
Most workloads don't need 1:1 read to write... but I can think of one, Y-Cruncher, that does... so a certain person will be happy if Zen 6 does 1:1 read to write instead of 2:1 read to write...

Yep and designers design for the common case not the outlier. Nor do they make things symmetrical simply due to an overdeveloped case of OCD. There has to be a reason to spend the extra resources on it.

If e.g. you had the resources to perform three loads and three stores you'd rather do four loads and two stores because that would result in an overall average speedup of "all" code even if outliers like Y-Cruncher were leaving performance on the table.