Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 15 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Timorous

Golden Member
Oct 27, 2008
1,627
2,798
136
That's the claim AMD themselves made. Probably talking about core + L2.

Again, where are all of these numbers you keep throwing out coming from? No, Zen 4 in Phoenix is basically identical to Zen 4 elsewhere, but all current indications.

The APU is 25B xtors in 176mm² which is about 140M xtors per mm² despite all the IO in a monolithic design. Compared to the 94M Xtors per mm² of the normal Zen4 CCD it represents a 49% increase in density.

So ~11B transistors at 140M xtors per mm² is 79mm² for a 16c 32MB L3 CCD.

None of that is exact of course. Zen4c may be more dense due to less IO vs the APU. It may have slightly different transistor counts to standard Zen 4 so the die size will vary but what 79mm² does tell you is that it is small enough to make 8 CCDs work on Bergamo which seems to have 2 IO dies (probably 2 of the ones used for Siena which would be typical AMD to so that).

My question remains. What is the business case for AMD to design yet another 4N Zen4 CCX at > 140M xtors mm² using a different library. Why do that work twice to not fit more cores in?
 

nicalandia

Diamond Member
Jan 10, 2019
3,330
5,281
136
My question remains. What is the business case for AMD to design yet another 4N Zen4 CCX at > 140M xtors mm² using a different library. Why do that work twice to not fit more cores in?
A single Bergamo CPU is worth $15,000 CPU(selling for $22,000 right now in China), so as you can see they will be exceedingly profitable, enough to warrant a different design.
 

Timorous

Golden Member
Oct 27, 2008
1,627
2,798
136
A single Bergamo CPU is worth $15,000 CPU(selling for $22,000 right now in China), so as you can see they will be exceedingly profitable, enough to warrant a different design.

Sure and a CCD designed around 2 Phoenix CCXs will do the job nicely. Why spend 10s of millions designing yet another small variation of that design? It is not a semi custom gig where AMD get paid to do this
 

Exist50

Platinum Member
Aug 18, 2016
2,445
3,043
136
The APU is 25B xtors in 176mm² which is about 140M xtors per mm² despite all the IO in a monolithic design. Compared to the 94M Xtors per mm² of the normal Zen4 CCD it represents a 49% increase in density.
You're comparing apples and oranges. The density of a high speed, cache-heavy compute die will be vastly different from an SoC with a GPU and everything else included. I'm sure if you actually look at the Zen 4 cluster in isolation, the density will be very similar to the desktop Zen 4.
 

Timorous

Golden Member
Oct 27, 2008
1,627
2,798
136
You're comparing apples and oranges. The density of a high speed, cache-heavy compute die will be vastly different from an SoC with a GPU and everything else included. I'm sure if you actually look at the Zen 4 cluster in isolation, the density will be very similar to the desktop Zen 4.

That was not the case for Renoir or Cezanne. Further Navi 31 is not 140M xtors per mm² although getting the density of the 5nm portion of the die alone is tricky because I don't think we have a transistor count for the GCD on its own, just the whole thing.

Anyway we will find out when we get more details but I don't see AMD doing the same thing twice for no benefit (like say squeezing 3 CCXs into a CCD).
 

Timorous

Golden Member
Oct 27, 2008
1,627
2,798
136
Yes, it was.

I just went and checked and you are right. Locuza did some analysis and the difference between the Cezanne die size and Vermeer die size was purely the L3 cache and lack of TSVs. The core itself was exactly the same.

So given that my entire argument is nonsense since AMD already copy/pasted the existing CCX into the APU. So yes, AMD will need to design a higher density version for Zen4c and that will come with clock speed tradeoffs.

Still find it curious how AMD manage to have such high transistor density for the APUs when it seems the CCXs and the Compute units are the same density as desktop counterparts and there is a lot of IO. I guess the power gating circuitry can be really dense.
 

uzzi38

Platinum Member
Oct 16, 2019
2,639
6,000
146
That was not the case for Renoir or Cezanne. Further Navi 31 is not 140M xtors per mm² although getting the density of the 5nm portion of the die alone is tricky because I don't think we have a transistor count for the GCD on its own, just the whole thing.

Anyway we will find out when we get more details but I don't see AMD doing the same thing twice for no benefit (like say squeezing 3 CCXs into a CCD).
Just compared from a die shot, the Zen 3 core + L2 cache on Cezanne is the same size as the desktop one.
 
  • Like
Reactions: Vattila

moinmoin

Diamond Member
Jun 1, 2017
4,956
7,676
136
I was also very surprised to see that the core sizes were the same. Some kind of changes had to be done to lead to the significant difference in v/f curves.
 

eek2121

Platinum Member
Aug 2, 2005
2,930
4,026
136
I just went and checked and you are right. Locuza did some analysis and the difference between the Cezanne die size and Vermeer die size was purely the L3 cache and lack of TSVs. The core itself was exactly the same.

So given that my entire argument is nonsense since AMD already copy/pasted the existing CCX into the APU. So yes, AMD will need to design a higher density version for Zen4c and that will come with clock speed tradeoffs.

Still find it curious how AMD manage to have such high transistor density for the APUs when it seems the CCXs and the Compute units are the same density as desktop counterparts and there is a lot of IO. I guess the power gating circuitry can be really dense.

There may a few things at play here.
  1. N5 actually provided for a decent SRAM shrink
  2. Analog portions of a chip also see a shrink.
  3. AMD used a custom tailored process for desktop and EPYC that allows them to scale clocks.
  4. TSMC has various “scaling boosters” available on N5, which were not used on desktop due to #1.
As I have stated in another thread, Mobile has a similar density to the Apple M2. Bergamo will likely have even higher density. I will be surprised if it does not.

EDIT:

Another thing I wanted to add (from the business side of things) is that with Zen 2 and Zen 3, AMD wanted to clock their mobile chips as high as possible (read: competitive with Intel) while being power efficient, hence why they went with a similar design as desktop. With Zen 4, AMD very likely added a ton of dark silicon and otherwise sacrificed density in order to hit insanely high clocks. They did this in order to compete with Intel. On mobile, they have no need to hit such high clock targets. Instead their objective would be density and power savings. After all, a mobile chip doesn't need to hit 5.75 ghz or higher. By removing dark silicon areas, using more scaling boosters, and increasing density, they can bring the die size down so the chip costs less to make (and they can make more of them) without sacrificing all that much in the way of clocks.

I expect mobile Zen 4 to be extremely efficient from a perf/watt standpoint. Perf/$? Not so much.
 
Last edited:

Khanan

Senior member
Aug 27, 2017
203
91
111
RDNA3 with the Navi31 has extreme density that’s mentioned by AMD in the slides. Could however be marketing wash.
 
Last edited:

Kaluan

Senior member
Jan 4, 2022
500
1,071
96
RDNA3 with the Navi31 has extreme density that’s mentioned by AMD in the slides. Could however be marketing wash.
IDK, I believe that part is true.
Even N33 manages to be almost 40% more dense than N23, of which less than half can at best be explained by N7 -> N6.
 
  • Like
Reactions: Khanan

DisEnchantment

Golden Member
Mar 3, 2017
1,609
5,817
136
The MI300 infographic disclosure did raise even more questions regarding the techs for the upcoming AMD products. It seems to me the CPUs are sharing the IC with the GPU so a possible SLC could trickle down to consumer parts. The CPU L3 as well is not on the top die, this could lend some credibility to the rumor of CPU cores stacked on top of cache or at least on top of an SLC. The GCDs appearing as one single GPU to the SW also would mean they have sorted out the multiple GCD concept but that's for another thread.

However for me, the interconnects are the most interesting part of the MI300 and how that would carry over to Zen 5. Lots of interesting design choices would have to be made for Zen 5 and fairly exciting for the folks working on such things considering the tape out would be within a quarter or so if not done already.

In case there is really stacking throughout the Zen 5 DT/Server product lines
  • Infinity fabric links via the package would be gone if all dies are stacked on the base die. This would be a necessity if the supposed 64 Gbps links (from LinkedIn) are to be implemented. Current gen Ryzen is bottlenecked by the IF clock to some extent.
  • More cores per CCX (alluded to by Mike Clark) sharing an L3 would need some stacking, that would make routing each core to the L3 possible with uniform distance and not bloat the core area too much and leave the RDL on the base die side instead of core die. This could in fact shave off a non trivial amount of traces if the distance to the L3 from the core is cut.
  • Base die can host the IO and other PCIe stuffs. They might even get away with N6 for the base die on the desktop
One drawback is that they would need to limit the Tjmax to something like 85 C to keep the hybrid bond from degrading they might need a big IPC jump to recover the lost ST perf ( for instance 7800X3D tops out at 5 GHz ). But the thought of having the big L3 chunks off the leading edge nodes would make the bean counters extremely happy.

If the CCDs are not stacked, they would need something like the interconnects on RDNA3 to replace the current Z4 GMI. Then they have FinFlex to play with.
 

JustViewing

Member
Aug 17, 2022
136
233
76
The MI300 infographic disclosure did raise even more questions regarding the techs for the upcoming AMD products. It seems to me the CPUs are sharing the IC with the GPU so a possible SLC could trickle down to consumer parts. The CPU L3 as well is not on the top die, this could lend some credibility to the rumor of CPU cores stacked on top of cache or at least on top of an SLC. The GCDs appearing as one single GPU to the SW also would mean they have sorted out the multiple GCD concept but that's for another thread.

However for me, the interconnects are the most interesting part of the MI300 and how that would carry over to Zen 5. Lots of interesting design choices would have to be made for Zen 5 and fairly exciting for the folks working on such things considering the tape out would be within a quarter or so if not done already.

In case there is really stacking throughout the Zen 5 DT/Server product lines
  • Infinity fabric links via the package would be gone if all dies are stacked on the base die. This would be a necessity if the supposed 64 Gbps links (from LinkedIn) are to be implemented. Current gen Ryzen is bottlenecked by the IF clock to some extent.
  • More cores per CCX (alluded to by Mike Clark) sharing an L3 would need some stacking, that would make routing each core to the L3 possible with uniform distance and not bloat the core area too much and leave the RDL on the base die side instead of core die. This could in fact shave off a non trivial amount of traces if the distance to the L3 from the core is cut.
  • Base die can host the IO and other PCIe stuffs. They might even get away with N6 for the base die on the desktop
One drawback is that they would need to limit the Tjmax to something like 85 C to keep the hybrid bond from degrading they might need a big IPC jump to recover the lost ST perf ( for instance 7800X3D tops out at 5 GHz ). But the thought of having the big L3 chunks off the leading edge nodes would make the bean counters extremely happy.

If the CCDs are not stacked, they would need something like the interconnects on RDNA3 to replace the current Z4 GMI. Then they have FinFlex to play with.

All well and good but the big questions is, would it bring down the cost for consumer CPUs? AMD may separate desktop CPU and Server CPU package design.
 
  • Like
Reactions: moinmoin

moinmoin

Diamond Member
Jun 1, 2017
4,956
7,676
136
the big questions is, would it bring down the cost for consumer CPUs? AMD may separate desktop CPU and Server CPU package design.
I agree. I continue to expect AMD to use more complex packages only for higher performance high margin products. The FAD reference to "AMD chiplet technology" for Phoenix appears to have turned out to be a false flag. At this rate Intel may well be first to chiplet based lower performance mobile parts with Meteor Lake. Will be very interesting to see if that is a competitive cost advantage or disadvantage.
 

Exist50

Platinum Member
Aug 18, 2016
2,445
3,043
136
The MI300 infographic disclosure did raise even more questions regarding the techs for the upcoming AMD products. It seems to me the CPUs are sharing the IC with the GPU so a possible SLC could trickle down to consumer parts. The CPU L3 as well is not on the top die, this could lend some credibility to the rumor of CPU cores stacked on top of cache or at least on top of an SLC.
I might have missed info, but have they shown that the L3 is moved to the base die? Actually, what (if anything) is confirmed for it?
More cores per CCX (alluded to by Mike Clark) sharing an L3 would need some stacking, that would make routing each core to the L3 possible with uniform distance and not bloat the core area too much and leave the RDL on the base die side instead of core die.
They moved to a ring bus with the move to an 8c CCX, so I don't think stacking is a necessity for them to scale per-CCX core count. Nor do I think 2 layer hybrid bonding alone would be enough to solve that routing issue, so they'll probably keep the ring.
One drawback is that they would need to limit the Tjmax to something like 85 C to keep the hybrid bond from degrading they might need a big IPC jump to recover the lost ST perf ( for instance 7800X3D tops out at 5 GHz )
Putting the cache on the bottom will probably help a lot. No longer need to go through the top die for cooling. And maybe future gens of hybrid bonding can get the thermal resilience up to where it's a non-issue, if that's even the limiting factor today. Might just be a lack of telemetry on the cache die driving a more conservative approach for now.
 

TESKATLIPOKA

Platinum Member
May 1, 2020
2,364
2,854
106
This is a continuation to what @Glo. posted here about Strix Point possibly having 24CU.
I don't really believe in such a big IGP, because I don't think there would be a big enough market for It, but let's imagine It's real.
I will use LLC instead of L3 and IC and It will have Its own chiplet.

There could be 10 different chiplets and 5 chiplets per APU:
1.) Zen5 chiplet in 2 versions, 4core and 8core (3nm)
2.) Zen4c chiplet in 2 versions, 4core and 8core (3nm)
3.) IGP chiplet in 2 versions, 16CU and 24CU (3nm)
4.) Cache chiplet in 3 versions, 36MB, 64MB and 96MB (6nm)
5.) A single IO chiplet with 2*64bit DDR5(4x32bit LPDDR5), video stuff, PHY etc. (5-6nm)

This is how different models could look like in real life, but these sizes are just for show.
Width would stay the same and only height would change depending on which chiplet you use.
8C16T 16CU.png8C16T 24CU.png12C24T 24CU.png 16C32T 16CU.png16C32T 24CU.png

You could make a lot of combinations with different chiplets while not cutting them down much excluding LLC chiplet. Power limit and LLC was based on the IGP's needs. I made a table with 15 models:
Zen 5Zen 4cIGPIGP gaming frequencyLast Level CachePower Limit -> gaming
6C16T Model3 Cores; L2: 6MB3 Cores; L2: 1.5MB12 CU; 48TMU; 24 ROP2400 MHzTotal: 21MB
CPU: 9 + IGP: 12MB
30W
CPU: 15W + IGP: 15W
6C16T Model3 Cores; L2: 6MB3 Cores; L2: 1.5MB16 CU; 64TMU; 32 ROP2400 MHzTotal: 33MB
CPU: 9 + IGP: 24MB
35W
CPU: 15W + IGP: 20W
8C16T Model4 Cores; L2: 4MB4 Cores; L2: 2MB16 CU; 64TMU; 32 ROP2400 MHzTotal: 36MB
CPU: 12 + IGP: 24MB
35W
CPU: 15W + IGP: 20W
8C16T Model4 Cores; L2: 4MB4 Cores; L2: 2MB20 CU; 80TMU; 40 ROP2400 MHzTotal: 48MB
CPU: 12 + IGP: 36MB
40W
CPU: 15W + IGP: 25W
8C16T Model4 Cores; L2: 4MB4 Cores; L2: 2MB24 CU; 96TMU; 48 ROP2400 MHzTotal: 60MB
CPU: 12 + IGP: 48MB
45W
CPU: 15W + IGP: 30W
10C20T model4 Cores; L2: 4MB6 Cores; L2: 3MB16 CU; 64TMU; 32 ROP2800 MHzTotal: 46 MB
CPU: 14 + IGP: 32MB
50W
CPU: 20W + IGP: 30W
10C20T model4 Cores; L2: 4MB6 Cores; L2: 3MB20 CU; 80TMU; 40 ROP2800 MHzTotal: 60MB
CPU: 14 + IGP: 46MB
57W
CPU: 20W + IGP: 37W
12C24T model4 Cores; L2: 4MB8 Cores; L2: 4MB16 CU; 64TMU; 32 ROP2800 MHzTotal: 48MB
CPU: 16 + IGP: 32MB
55W
CPU: 25W + IGP: 30W
12C24T model4 Cores; L2: 4MB8 Cores; L2: 4MB20 CU; 80TMU; 40 ROP2800 MHzTotal: 62MB
CPU: 16 + IGP: 46MB
62W
CPU: 25W + IGP: 37W
12C24T model4 Cores; L2: 4MB8 Cores; L2: 4MB24 CU; 96TMU; 48 ROP2800 MHzTotal: 76MB
CPU: 16 + IGP: 60MB
70W
CPU: 25W + IGP: 45W
14C28T model6 Cores; L2: 6MB8 Cores; L2: 4MB16 CU; 64TMU; 32 ROP2800 MHzTotal: 52MB
CPU: 20 + IGP: 32MB
60W
CPU: 30W + IGP: 30W
14C28T model6 Cores; L2: 6MB8 Cores; L2: 4MB20 CU; 80TMU; 40 ROP2800 MHzTotal: 66MB
CPU: 20 + IGP: 46MB
67W
CPU: 30W + IGP: 37W
16C32T model8 Cores; L2: 8MB8 Cores; L2: 4MB16 CU; 64TMU; 32 ROP3200 MHzTotal: 64MB
CPU: 24 + IGP: 40MB
75W
CPU: 35W + IGP: 40W
16C32T model8 Cores; L2: 8MB8 Cores; L2: 4MB20 CU; 80TMU; 40 ROP3200 MHzTotal: 80MB
CPU: 24 + IGP: 56MB
85W
CPU: 35W + IGP: 50W
16C32T model8 Cores; L2: 8MB8 Cores; L2: 4MB24 CU; 96TMU; 48 ROP3200 MHzTotal: 96MB
CPU: 24 + IGP: 72MB
95W
CPU: 35W + IGP: 60W

edit: IO chiplet could have a 2CU IGP for models without IGP(GPU) chiplet.
 
Last edited:

Mopetar

Diamond Member
Jan 31, 2011
7,850
6,015
136
That's essentially where I'd like to see AMD go in the future and really is the ultimate end-game of a chiplet based approach.

There's probably always going to be some market for a monolithic design, but I could see that being relegated to niche markets over time.

I suspect that we potentially get some dual-GPU designs as well. As long as the physical size doesn't interfere, there's no reason note to offer a 36/48 CU option for a gaming APU that provides pretty good performance without the need to add in a discrete card.

I also could imagine AMD doing something like Intel where it designs a smaller core that's built around providing more throughput for CPU compute. The Zen core is already considerably smaller than Intel's performance core so there's not as much pressure for them to do this. You could also say that the more densely packed Zen 4c is already them doing this, but I wonder how much further they could take it.
 
  • Like
Reactions: TESKATLIPOKA