adroc_thurston
Diamond Member
- Jul 2, 2023
- 8,502
- 11,247
- 106
It runs that.That sounds too optimistic, 9965 is 500W for 192 Cores.
It runs that.That sounds too optimistic, 9965 is 500W for 192 Cores.
Note: The clocks on Turin are already thermally limited at 192c. The efficiency is the move from N3E to N2. I feel like AMD's got the goods on this one, or they wouldn't have released the information so early. They also must feel pretty confident that Intel is unable to answer Venice.View attachment 135175View attachment 135176
How did they get >1.7x perf uplift?
Efficiency is also >1.7x which mean same 500W TDP.
They would need to bring IOD power usage really down to the 60W - 80W to allow around 1.65W< per core for 256 cores in same SoC at 500W. Below that honestly seems difficult considering all the various functions in the IOD.
- 8x IF links vs 12x on Turin Dense, so lesser number of links to power.
- 2.5D packaging to bring immense power savings in signalling
- New process for IOD for more efficiency
- But signalling rate is higher at 64 Gbps vs 36 Gbps on Turin D
If they can do <1.65W per core they would need all the power efficiency gains from N2. At minimum 1.1x efficiency per Core.
But they need frequency gains too to get to the 1.7x perf gain.
1.1x power efficiency gain on top of 1.1x frequency gain would be something of an optimistic outcome from the N2 process jump.
1.33x cores * 1.1x frequency * 1.18x perf/clock = 1.73x perf uplift
If there is indeed no increase in TDP, the clocks would be conservative to keep efficiency high which means the perf per clock would be quite high. I would suspect high teens.
Another thing I suspect is additional 2x FP pipes (new AI pipelines by Papermaster, suggesting at least 2x) something new since Z3. This would bring some easy gains in common benches like cinebench and stuff.
cinememe isn't a math benchmark at all.After all, it would eliminate the one win ARL currently has in benchmarks
Whatever they meant by "more AI pipelines" (Papermaster's FAD presentation) is unlikely to pay into the ">1.7x" projection though (McNamara's presentation), as the latter refers to SPECrate®2017_int_base specifically.Another thing I suspect is additional 2x FP pipes (new AI pipelines by Papermaster, suggesting at least 2x) something new since Z3.
Actually i m not sure that when designing a custom scene using Cinema 4D and with quitecinememe isn't a math benchmark at all.
Well no... it wouldn't require any scheduler layout changes because Zen has always been 2 FMA units plus 2 FADD units...As for the added FP pipes, I doubt it. They are expensive and would require scheduling changes. I was thinking they would simply optimize the existing FP units... but I agree that this is a likely spot for them to do a little work. After all, it would eliminate the one win ARL currently has in benchmarks.
Feel free, it's a speculation threadI can make some guesses but they would only be guesses...
Well then... I'll speculate...Feel free, it's a speculation threadJust maybe mention it's a speculation so people wouldn't bother you with asking for sources etc.
Wow. Thanks for that.Well then... I'll speculate...
Here is my speculation for what Zen 6 may change compared to Zen 5, with the 12 core CCD being the focus for the L3 and CCD <-> IOD interconnect speculation...
- Starting at the frontend, this is where Zen 5 had, at least for a small number of games, the most bottlenecks surprisingly... usually you'd see the core as memory bound but Zen 5 is much more frontend bound compared to other cores...
- Zen 5 struggles with frontend latency more than anything which can be relieved in a number of ways with the most straight forward way being to make the L1i larger... if AMD chooses to not make the L1i larger they can improve the instruction prefetchers so that they can more aggressively fetch data into the L1i in Zen 6 to reduce the effective latency they could also possibly make the iTLBs larger to reduce latency that way as well tho I suspect that could run into diminishing returns...
- Zen 5 is also frontend bandwidth bound, which does potentially point to AMD making it so that a single thread can use both decode clusters at the same time if for no other reason than to improve the bandwidth out of the L2 to the L1i... they can also improve the behavior of the op cache by making Zen 6 be able to feed 2x8 instructions per cycle instead of 2x6 instructions per cycle that Zen 5 currently does...
- Moving to the Branch Predictor, I don't expect much to change here... Zen 5 already had an insanely good branch predictor so I suspect for Zen 6 that they may increase some structure sizes in the BPU and possibly make the return stack lower latency...
- Going down the pipe to the dispatch and rename I don't expect this to change and for Zen 6 to stick with 8 ops per cycle into the ROB...
- Speaking of the ROB, I do expect this to get a little bit larger, IMO 512 entries or so, with the retirement width either increasing from 8 per cycle to 12 or 16 per cycle or removing restrictions on the number of specific instructions the retirement queue can retire per cycle...
- Looking at the Integer side of the core, I expect that the Physical Integer Register File will see a bump of at least 64 entries if not more to north of 300 entries and I do expect the ALU scheduler to increase in size a little, perhaps to 96 entries perhaps more, but I don't expect much in terms of port or ALU layout to have changed if at all...
- I really doubt that much if anything has changed for the vector side of the core other than new AVX512 instructions and getting rid of the 2 cycle hazard in Zen 5 that made basically all single cycle instructions on the vector side 2 cycles AKA Zen 6 will make those instructions a single cycle again...
- Now, for the memory side of core, I would not be surprised if AMD decided to go from a 4 AGU setup with 2 being able to do both loads and stores and 2 being only able to do loads in Zen 5 which limits you to 4 memory ops per cycle to a 6 AGU setup with 4 load AGUs and 2 store AGUs so that Zen 6 will be able to do 6 memory ops per cycle along with making the memory scheduler larger or possibly adding a non-scheduling queue in front just like the vector side...
- For both the Load and Store queues, I don't expect much change in Zen 6...
- I do expect both the data TLBs to be made larger, AMD has done this every generation and honestly I wouldn't be surprised if the L2 dTLB increased from 4K to 5-6K entries or so and I also expect the cache miss buffers to get larger as well to better absorb cache and memory latency...
- Moving to the L2, I also don't expect AMD to make the L2 larger but I do expect that they will improve the behavior of the L2 -> L1d bandwidth because while it is 64B per cycle it's not consistently 64B per cycle and closer to 32B per cycle if you are just doing reads...
- For the L3, I expect AMD to stick with the same mesh interconnect that Zen 5 has, just with more stops to accommodate the 12 core CCDs which means that each CCD will now have 48MB of L3 cache... this will likely increase the latency of the L3 but how much... I don't know... I think somewhere between 4 and 8 cycles seems likely for a L3 latency of about 50 to 54 cycles... As for the new V-Cache die I expect it to increase by 50% to 96MB inline with the 50% L3 cache increase of the new 12 core CCD...
- For the new CCD to IOD interconnect... I would not be surprised if they moved to a 64B per cycle read path and a 32B per cycle write path per CCD, so a dual CCD chip will be able to do 128B reads per cycle and 64B writes per cycle, in order to be able to better utilize the higher frequency DDR5 we are seeing launched... as for the fabric clock speed... I expect that to be around the same 2GHz or so that Zen 5 is currently tho that may go up as well... which means that a single CCD will be able to handle about 128GB/s reads and 64GB/s writes to memory assuming 2GHz and a dual CCD setup will be able to deal with 256GB/s and 128GB/s writes to memory...
So I am expecting Zen 6 to be an evolution of the Zen 5 core with some widening of the core and fixing weakspots WRT the frontend latency and bandwidth and the vector side hazard with the largest improvements happening in the L3 cache and memory subsystem along with the new AVX512 instructions... but again, this is just my speculation... I very well maybe completely wrong...
why asymmetry, wires are cheap on a 25um -LSI.- For the new CCD to IOD interconnect... I would not be surprised if they moved to a 64B per cycle read path and a 32B per cycle write path per CCD
I am guessing that it isn't needed so why waste the area?why asymmetry, wires are cheap on a 25um -LSI.
SDPs are generally symmetrical too.
You're not wasting much area, d2d shoreline is real cheap for dumb parallel interfaces.I am guessing that it isn't needed so why waste the area?
Most workloads don't need 1:1 read to write... but I can think of one, Y-Cruncher, that does... so a certain person will be happy if Zen 6 does 1:1 read to write instead of 2:1 read to write...why asymmetry, wires are cheap on a 25um -LSI.
SDPs are generally symmetrical too.
Indeed. But SDPs are inherently symmetrical.Most workloads don't need 1:1 read to write...
and you know well enough what I'm gonna do to him.o a certain person will be happy if Zen 6 does 1:1 read to write instead of 2:1 read to write...
Most workloads don't need 1:1 read to write... but I can think of one, Y-Cruncher, that does... so a certain person will be happy if Zen 6 does 1:1 read to write instead of 2:1 read to write...
But the baseline here, however, is not Granite Ridge, but rather Strix Halo, and that one is symmetric. 32B/cycle in both directions.And I'd say benchmarks pretty much confirmed this - people were panicking about "writes being half speed" since Zen 2 I think but if there is little measureable impact, why the heck not do this.
Why make it symmetric when write bandwidth isn't needed?
If you want to blow extra budget on bandwidth, just widen both interfaces instead while keeping the 2:1 ratio. You will get better payoff.
Why do you think Zen 5's L1i MPKI is so much higher than Zen 4's despite being the same capacity? As in ranging from 10x worse to 3x worse across a variety of specint2017 subtests.Zen 5 struggles with frontend latency more than anything which can be relieved in a number of ways with the most straight forward way being to make the L1i larger...
The branch predictor has dramatically more L2 BTB overrides than Zen 4 from Huang's testing, which also exacerbates the front end latency issue. Seems extremely weird.- Moving to the Branch Predictor, I don't expect much to change here... Zen 5 already had an insanely good branch predictor so I suspect for Zen 6 that they may increase some structure sizes in the BPU and possibly make the return stack lower latency...
Simple, you are having to feed a much larger frontend... you expect more L1i misses assuming the same structure size...Why do you think Zen 5's L1i MPKI is so much higher than Zen 4's despite being the same capacity? As in ranging from 10x worse to 3x worse across a variety of specint2017 subtests.
So that is very dependent on the workload, for example compiling the Linux Kernel the L2 BTB overrides went down from about 12.86 MPKI to about 3 MPKI... where as the L1 iTLB misses went up which again isn't surprising considering that the L1 iTLB size didn't change from Zen 4 to Zen 5...The branch predictor has dramatically more L2 BTB overrides than Zen 4 from Huang's testing, which also exacerbates the front end latency issue. Seems extremely weird.
Most workloads don't need 1:1 read to write... but I can think of one, Y-Cruncher, that does... so a certain person will be happy if Zen 6 does 1:1 read to write instead of 2:1 read to write...
Are you sure y-cruncher behaves like that?Most workloads don't need 1:1 read to write... but I can think of one, Y-Cruncher, that does... so a certain person will be happy if Zen 6 does 1:1 read to write instead of 2:1 read to write...
It's a server CCD also with strict requirements of making 8ch DDR5-8000 work with 8 CCDs.FWIW I personally think they'll stick to 32B in both directions on client and leave the L3 to fend for itself. Would be a very AMD thing to do.
well I'll be, unified int scheduler hasn't even lived for that long
Is this a good or a bad thing? What does it mean for performance impacts do you think?well I'll be, unified int scheduler hasn't even lived for that long
I can't tell a single thing without looking at the rest of the core.Is this a good or a bad thing?
