Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 9 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,777
6,791
136
Interesting thing today with RDNA 3 launch is that N31 uses EFB as interconnect. And Angstronomics ( who happens to get things right on N31) also says N32 will use chiplets as well.
N32 will likely be priced in the range of ~600 USD and containing 4 interconnects.
N32 will be high volume parts so would this mean AMD has sorted out the economics of EFB? EFB is done at AMD's packaging facilities

Interesting thing was that the guy presenting the chiplet tech in N31 is Sam Naffziger who also was a key person for Zen chiplets.

If they were to use EFB for CPUs, for two CCDs, they would need only two EFB interconnects to connect the IOD to CCDs.
I am just wondering if this is the tech they would use to replace the SerDes links.
 

BorisTheBlade82

Senior member
May 1, 2020
707
1,130
136
Interesting thing today with RDNA 3 launch is that N31 uses EFB as interconnect. And Angstronomics ( who happens to get things right on N31) also says N32 will use chiplets as well.
N32 will likely be priced in the range of ~600 USD and containing 4 interconnects.
N32 will be high volume parts so would this mean AMD has sorted out the economics of EFB? EFB is done at AMD's packaging facilities

Interesting thing was that the guy presenting the chiplet tech in N31 is Sam Naffziger who also was a key person for Zen chiplets.

If they were to use EFB for CPUs, for two CCDs, they would need only two EFB interconnects to connect the IOD to CCDs.
I am just wondering if this is the tech they would use to replace the SerDes links.
IMHO there are four main aspects:
  • Bandwidth - The IFoP only has around 1/10th of the bandwidth of what N3x has between each MCD and the GCD (900Gbyte/s). Something like InFo-R should be enough in this regard.
  • Reticle Limit - EFB gives you total freedom. But even InFo-R should provide multiple times the Reticle Limit through reticle stitching.
  • Energy consumption - As I understand it EFB should be much better compared to InFo-R - but maybe I am wrong as there are not a lot of figures available.
  • Costs - EFB should have come down on costs but might still be much more expensive than InFo-R.
So it pretty much boils down to priorities and the question if InFo-R would give them enough total area for stitching all those Zen5 EPYC CCDs together.
Might as well be that they deem IFoP sufficient for yet another generation.
 
Last edited:

gdansk

Diamond Member
Feb 8, 2011
4,568
7,682
136
Then whats wrong in calling AMD's hybrid approach big little?
In the original context it seems he was correcting the rumor that it was NOT big little. The nomenclature isn't important but AMD calling it heterogeneous in their own documentation more or less confirms at least 2 different core types. While some old rumors from Chinese forums say it isn't should probably be treated as less reputable.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,667
2,532
136
N32 will be high volume parts ...
I am just wondering if this is the tech they would use to replace the SerDes links.

Entirely possible. But I'd note that high volume for a upper mid range GPU is very far from high volume for a CPU. It's quite possible that they are spinning up the plants and using them for the GPUs this gen, and once they have enough experience to trust them and when capacity meets up with demand, plan use them for the CPUs.
 

yuri69

Senior member
Jul 16, 2013
677
1,215
136
Interesting thing was that the guy presenting the chiplet tech in N31 is Sam Naffziger who also was a key person for Zen chiplets.
Mr. Naffziger seems to be the person driving AMD's advanced power-related techniques. His effort was aimed mainly at getting Bulldozer's successors power-efficient, then adapting the power saving tech for both GCN and APUs (e.g. Bristol Ridge), followed by Zen and RDNA.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,777
6,791
136
IMHO there are four main aspects:
  • Bandwidth - The IFoP only has around 1/10th of the bandwidth of what N3x has between each MCD and the GCD (900Gbyte/s). Something like InFo-R should be enough in this regard.
  • Reticle Limit - EFB gives you total freedom. But even InFo-R should provide multiple times the Reticle Limit through reticle stitching.
  • Energy consumption - As I understand it EFB should be much better compared to InFo-R - but maybe I am wrong as there are not a lot of figures available.
  • Costs - EFB should have come down on costs but might still be much more expensive than InFo-R.
So it pretty much boils down to priorities and the question if InFo-R would give them enough total area for stitching all those Zen5 EPYC CCDs together.
Might as well be that they deem IFoP sufficient for yet another generation.
Well, seems AnandTech retracted their article on N31 using EFB interconnect. So quite likely Angstronomics is right again on N31 using InFO-oS.
Your hypothesis could be right regarding the usage of InFO-R, but with the addition of the local interconnect (InFO-L is basically InFO-R + LSI). Today I saw an AMD patent using InFO-LSI or at least some flavor of it.

11469183 : Multirow semiconductor chip connections
A method of manufacturing a semiconductor device includes mounting an interconnect chip to a redistribution layer structure and mounting a first, second, and third semiconductor chip to the redistribution layer structure, where the second semiconductor chip is interposed between the first and the third semiconductor chips, and the interconnect chip communicatively couples the first, second and third, semiconductor chips to one another.

1667860595863.png
1667862834862.png

So it seems similar tech to what AMD would be deploying on N31 and what is described in this patent, save for the local bridge.
Makes more sense than EFB actually, but what is quite strange is that in every earnings call AMD is saying they are expanding on packaging and I am wondering if they building all of these in house instead of at TSMC.
In some embodiments, the first semiconductor chip is a core complex die, the second semiconductor chip is a core complex die, and the third semiconductor chip is an input/output die.
Also from LinkedIn we can surmise GMI4 runs at 64 Gbps on N3 nodes which is only possible if there is an interconnect chip with repeaters instead of high energy medium range PHYs. Otherwise they will burn even more power than what they current do on GMI3.
And they do these connections in multi rows (as titled in the patent) to form a big typical chiplet based EPYC CPU. Obviously a chunk of it can be taken out to form a Client CPU.
 

BorisTheBlade82

Senior member
May 1, 2020
707
1,130
136
@DisEnchantment
So at least in that patent they seem to tunnel one CCD with the bridge in order to connect another one. I would not have thought that this makes sense - with regards to power supply, general routing of IO, etc.
But they also mention TSVs for the bridge chip.
Maybe they want to increase bandwidth massively in order to make the L3 of neighbouring CCDs accessible to each other?
 
  • Like
  • Wow
Reactions: Kaluan and Joe NYC

DisEnchantment

Golden Member
Mar 3, 2017
1,777
6,791
136
But they also mention TSVs for the bridge chip.
In the patent, they mentioned TSVs for allowing power and some contacts in case the CCD needs to reach the substrate, when it is blocked by interconnect chip.
But they could just relocate the contact elsewhere because the RDL is there, that is why the interconnect chip is quite a thin strip. Some flexibilities they have.

Maybe they want to increase bandwidth massively in order to make the L3 of neighboring CCDs accessible to each other?
In the patent, the main idea is to overcome the reach of short range interconnects (obviously not using medium range PHYs like GMI2/3/4)
This is the biggest drawback to be overcome in my eyes, to replace the medium range PHYs with low power within few mm range. LSI basically few mms and when needed add repeater in the bridge.

Otherwise no way can hit 64 Gbps GMI4 target without blowing up power consumption.
For reference GMI3 range is up to several tens of mms, XGMI3 can reach hundreds of mms for inter socket comm.

They have additional benefits for BW though as described in this concept.
 

BorisTheBlade82

Senior member
May 1, 2020
707
1,130
136
@DisEnchantment
Generally I am with you. But to my knowledge IFoP already is at 64Gbyte/s - at least in one direction (32 byte up / 16 byte down * 2 GHz).
And at the moment no one could confirm if this is the "narrow mode" with only one port, or for both ports, each CCD has.
Also, after having done the math again:
Even at 2pJ/bit (pessimistic estimate for IFoP) consumption is only around 1,5w per CCD connection at full load. To be honest, I had miscalculated it first and came in at 15w. 1,5w might still seem too much for mobile, but for Desktop/Server it's not that big of a deal.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,777
6,791
136
But to my knowledge IFoP already is at 64Gbyte/s
GMI3 is 32 GT/s/lane. We are talking about different things here I believe. BW is not as critical for CPUs as latency. The higher transfers per second it has per lane the lower the latency.
I don't have the GMI3 slides though, below is GMI2
1667903235060.png

GMI2 --> Max 25 GT/s/lane - as configured in product 14.6GT/s @1466FCLK/2933MCLK
1667905543727.png

Excerpt from LinkedIn
1667903363812.png
BW is of course decided by the number of lanes * GT/s/lane

GMI3 --> Max 32 GT/s/lane - as configured in product unknown.
  • I think they added more lanes and also slightly higher frequency. AFAIK you can run FCLK all the way to 2000 MHz in 1:1.

GMI4 is 64 GT/s/lane

What could change with the GMI4 is to remove the line driver as shown below (with triangle) with simple traces and repeater where needed (but routed through RDL+LSI instead of longer substrate traces).

1667903476658.png

Update: now found the slides for the lanes
Changed Gbps to GT/s to be more precise
 
Last edited:

leoneazzurro

Golden Member
Jul 26, 2016
1,114
1,867
136

This is interesting. Why do I post here instead than in the RDNA3 thread? Because this may be a hint about the interconnect speeds that could be achievable on advanced packaging and then these could be applicable on Zen5, too. We have seen that IF links are becoming a limit already on Zen4, so improving these in Zen5 could remove a performance limitation.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,777
6,791
136

This is interesting. Why do I post here instead than in the RDNA3 thread? Because this may be a hint about the interconnect speeds that could be achievable on advanced packaging and then these could be applicable on Zen5, too. We have seen that IF links are becoming a limit already on Zen4, so improving these in Zen5 could remove a performance limitation.
Looks like plain InFO-R or AMD's equivalent of this tech. At best 4 copper layers, but definitely miles better than driving something through the substrate.

1668426137012.png

Biggest challenge is managing thermal expansion on CPUs which degrade the chip Fan out structure over thermal cycles.
But it looks simple enough and density is quite low (compared to even 65nm), can just fab this in house at AMD.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,777
6,791
136
Unrelated to AMD, GLink provides similar inter die link like AMD's IFOP/GMI at 0.3pJ/bit on InFO-R(_oS)

The GLink-2.5D IP utilizes single-ended signaling on parallel bus with DDR clock forwarding. This allows for up to 8/16Gbps per pin consuming only 0.25pJ/bit on TSMC’s RDL-based InFO (Integrated-Fan-Out) or CoWoS (Chip-on-Wafer-on-Substrate). One slice has 32 full-duplex lanes and one PHY has 8 slices with 2/4Tbps maximum bandwidth. For the next generation GLink, one slice will have 56 full-duplex lanes and one PHY has 8 slices with 7.5 Tbps maximum bandwidth.
1668430136578.png

AMD's IFOP seems more advanced than this scheme, at least based on open architecture. (low swing single ended signaling)
If they migrate their link to InFO-R they should be able to match this if not better.
IFOP via substrate --> ~2pJ/bit.
IFOP via RDL --> ~0.3pJ/bit
Up to ~7x reduction in pJ/bit.
BW is important, but latency is even more important in CPUs. How high they can clock would be very critical.
 

Henry swagger

Senior member
Feb 9, 2022
511
313
106

Tigerick

Senior member
Apr 1, 2022
846
799
106
Rgt saying zen 5 will have a unified l2 cache around the ccx and stacked l3.. will have zen 5 + zen 4 cores 🤔🤔
This leak is the one I strongly believe would be for Zen5 desktop CPU architecture. With removal of L3, AMD can double up Zen5 cores while maintaining similar die size which is important for Turin server CPU. And by sharing all L2 cache AMD can remedy latency issue with external L3 cache.
 
  • Like
Reactions: Henry swagger

Exist50

Platinum Member
Aug 18, 2016
2,452
3,106
136
Rgt saying zen 5 will have a unified l2 cache around the ccx and stacked l3.. will have zen 5 + zen 4 cores 🤔🤔
This whole thing sounds like complete nonsense. Doubling the cores? +30% IPC? Unified L2 cache? Unified stacked L3 for everything? Yeah, I'm calling BS.
 

poke01

Diamond Member
Mar 8, 2022
4,202
5,549
106
With removal of L3, AMD can double up Zen5 cores while maintaining similar die size which is important for Turin server CPU. And by sharing all L2 cache AMD can remedy latency issue with external L3 cache.
man Apple's ex-cheif designer was way ahead. They already moved L3 cache ages ago and already moved to 8 wide decode in 2017 i think?

If Gerald's cores ie Nuvia come to PC that is native Windows\linux, then Intel and AMD have tough days. Qualcomm will be aiming for server, laptops/mobile and autos.
 

BorisTheBlade82

Senior member
May 1, 2020
707
1,130
136
Rgt saying zen 5 will have a unified l2 cache around the ccx and stacked l3.. will have zen 5 + zen 4 cores 🤔🤔
That RGT video was posted more than 7 months ago. Pretty sure, it was already discussed back then...
Regarding the caches: Instead of restructuring all the mentioned caches I would rather imagine them to introduce a L4/LLC. But latency could be a problem. And the question is how big it needs to be in order to make an impact.
 
  • Like
Reactions: Kaluan and yuri69

maddie

Diamond Member
Jul 18, 2010
5,156
5,544
136
Unrelated to AMD, GLink provides similar inter die link like AMD's IFOP/GMI at 0.3pJ/bit on InFO-R(_oS)


View attachment 71129

AMD's IFOP seems more advanced than this scheme, at least based on open architecture. (low swing single ended signaling)
If they migrate their link to InFO-R they should be able to match this if not better.
IFOP via substrate --> ~2pJ/bit.
IFOP via RDL --> ~0.3pJ/bit
Up to ~7x reduction in pJ/bit.
BW is important, but latency is even more important in CPUs. How high they can clock would be very critical.
The term Beachfront is actually used?
 
Jul 27, 2020
28,008
19,125
146
This whole thing sounds like complete nonsense. Doubling the cores? +30% IPC? Unified L2 cache? Unified stacked L3 for everything? Yeah, I'm calling BS.
Wouldn't a unified large L2 cache be the next evolution in cache performance? A big slab of cache in the middle and cores placed on all sides of it?