Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

poke01 · Nov 2, 2022

DisEnchantment said:
AMD calls it heterogenous

The word heterogenous is not exclusive to AMD

yuri69 · Nov 2, 2022

poke01 said:
Which people can't buy en masse till July or August....

Hence the quotation marks in my original post.

DisEnchantment · Nov 3, 2022

poke01 said:
The word heterogenous is not exclusive to AMD

Neither is CPU, or Zen. In fact my favorite sports bar is named Zen.
Does not make it any more invalid than if AMD calls their CPU ... CPU

DisEnchantment · Nov 3, 2022

Interesting thing today with RDNA 3 launch is that N31 uses EFB as interconnect. And Angstronomics ( who happens to get things right on N31) also says N32 will use chiplets as well.
N32 will likely be priced in the range of ~600 USD and containing 4 interconnects.
N32 will be high volume parts so would this mean AMD has sorted out the economics of EFB? EFB is done at AMD's packaging facilities

Interesting thing was that the guy presenting the chiplet tech in N31 is Sam Naffziger who also was a key person for Zen chiplets.

If they were to use EFB for CPUs, for two CCDs, they would need only two EFB interconnects to connect the IOD to CCDs.
I am just wondering if this is the tech they would use to replace the SerDes links.

BorisTheBlade82 · Nov 3, 2022

DisEnchantment said:
Interesting thing today with RDNA 3 launch is that N31 uses EFB as interconnect. And Angstronomics ( who happens to get things right on N31) also says N32 will use chiplets as well.
N32 will likely be priced in the range of ~600 USD and containing 4 interconnects.
N32 will be high volume parts so would this mean AMD has sorted out the economics of EFB? EFB is done at AMD's packaging facilities

Interesting thing was that the guy presenting the chiplet tech in N31 is Sam Naffziger who also was a key person for Zen chiplets.

If they were to use EFB for CPUs, for two CCDs, they would need only two EFB interconnects to connect the IOD to CCDs.
I am just wondering if this is the tech they would use to replace the SerDes links.

IMHO there are four main aspects:

Bandwidth - The IFoP only has around 1/10th of the bandwidth of what N3x has between each MCD and the GCD (900Gbyte/s). Something like InFo-R should be enough in this regard.
Reticle Limit - EFB gives you total freedom. But even InFo-R should provide multiple times the Reticle Limit through reticle stitching.
Energy consumption - As I understand it EFB should be much better compared to InFo-R - but maybe I am wrong as there are not a lot of figures available.
Costs - EFB should have come down on costs but might still be much more expensive than InFo-R.

So it pretty much boils down to priorities and the question if InFo-R would give them enough total area for stitching all those Zen5 EPYC CCDs together.
Might as well be that they deem IFoP sufficient for yet another generation.

poke01 · Nov 4, 2022

DisEnchantment said:
Neither is CPU, or Zen. In fact my favorite sports bar is named Zen.
Does not make it any more invalid than if AMD calls their CPU ... CPU

Then whats wrong in calling AMD's hybrid approach big little?

gdansk · Nov 4, 2022

poke01 said:
Then whats wrong in calling AMD's hybrid approach big little?

In the original context it seems he was correcting the rumor that it was NOT big little. The nomenclature isn't important but AMD calling it heterogeneous in their own documentation more or less confirms at least 2 different core types. While some old rumors from Chinese forums say it isn't should probably be treated as less reputable.

Tuna-Fish · Nov 4, 2022

DisEnchantment said:
N32 will be high volume parts ...
I am just wondering if this is the tech they would use to replace the SerDes links.

Entirely possible. But I'd note that high volume for a upper mid range GPU is very far from high volume for a CPU. It's quite possible that they are spinning up the plants and using them for the GPUs this gen, and once they have enough experience to trust them and when capacity meets up with demand, plan use them for the CPUs.

yuri69 · Nov 5, 2022

DisEnchantment said:
Interesting thing was that the guy presenting the chiplet tech in N31 is Sam Naffziger who also was a key person for Zen chiplets.

Mr. Naffziger seems to be the person driving AMD's advanced power-related techniques. His effort was aimed mainly at getting Bulldozer's successors power-efficient, then adapting the power saving tech for both GCN and APUs (e.g. Bristol Ridge), followed by Zen and RDNA.

DisEnchantment · Nov 7, 2022

BorisTheBlade82 said:
IMHO there are four main aspects:

Bandwidth - The IFoP only has around 1/10th of the bandwidth of what N3x has between each MCD and the GCD (900Gbyte/s). Something like InFo-R should be enough in this regard.

Reticle Limit - EFB gives you total freedom. But even InFo-R should provide multiple times the Reticle Limit through reticle stitching.

Energy consumption - As I understand it EFB should be much better compared to InFo-R - but maybe I am wrong as there are not a lot of figures available.

Costs - EFB should have come down on costs but might still be much more expensive than InFo-R.

So it pretty much boils down to priorities and the question if InFo-R would give them enough total area for stitching all those Zen5 EPYC CCDs together.
Might as well be that they deem IFoP sufficient for yet another generation.

Well, seems AnandTech retracted their article on N31 using EFB interconnect. So quite likely Angstronomics is right again on N31 using InFO-oS.
Your hypothesis could be right regarding the usage of InFO-R, but with the addition of the local interconnect (InFO-L is basically InFO-R + LSI). Today I saw an AMD patent using InFO-LSI or at least some flavor of it.

11469183 : Multirow semiconductor chip connections
A method of manufacturing a semiconductor device includes mounting an interconnect chip to a redistribution layer structure and mounting a first, second, and third semiconductor chip to the redistribution layer structure, where the second semiconductor chip is interposed between the first and the third semiconductor chips, and the interconnect chip communicatively couples the first, second and third, semiconductor chips to one another.

https://www.freepatentsonline.com/11469183.html

So it seems similar tech to what AMD would be deploying on N31 and what is described in this patent, save for the local bridge.
Makes more sense than EFB actually, but what is quite strange is that in every earnings call AMD is saying they are expanding on packaging and I am wondering if they building all of these in house instead of at TSMC.

In some embodiments, the first semiconductor chip is a core complex die, the second semiconductor chip is a core complex die, and the third semiconductor chip is an input/output die.

Also from LinkedIn we can surmise GMI4 runs at 64 Gbps on N3 nodes which is only possible if there is an interconnect chip with repeaters instead of high energy medium range PHYs. Otherwise they will burn even more power than what they current do on GMI3.
And they do these connections in multi rows (as titled in the patent) to form a big typical chiplet based EPYC CPU. Obviously a chunk of it can be taken out to form a Client CPU.

BorisTheBlade82 · Nov 7, 2022

@DisEnchantment
So at least in that patent they seem to tunnel one CCD with the bridge in order to connect another one. I would not have thought that this makes sense - with regards to power supply, general routing of IO, etc.
But they also mention TSVs for the bridge chip.
Maybe they want to increase bandwidth massively in order to make the L3 of neighbouring CCDs accessible to each other?

DisEnchantment · Nov 8, 2022

BorisTheBlade82 said:
But they also mention TSVs for the bridge chip.

In the patent, they mentioned TSVs for allowing power and some contacts in case the CCD needs to reach the substrate, when it is blocked by interconnect chip.
But they could just relocate the contact elsewhere because the RDL is there, that is why the interconnect chip is quite a thin strip. Some flexibilities they have.

BorisTheBlade82 said:
Maybe they want to increase bandwidth massively in order to make the L3 of neighboring CCDs accessible to each other?

In the patent, the main idea is to overcome the reach of short range interconnects (obviously not using medium range PHYs like GMI2/3/4)
This is the biggest drawback to be overcome in my eyes, to replace the medium range PHYs with low power within few mm range. LSI basically few mms and when needed add repeater in the bridge.

Otherwise no way can hit 64 Gbps GMI4 target without blowing up power consumption.
For reference GMI3 range is up to several tens of mms, XGMI3 can reach hundreds of mms for inter socket comm.

They have additional benefits for BW though as described in this concept.

BorisTheBlade82 · Nov 8, 2022

@DisEnchantment
Generally I am with you. But to my knowledge IFoP already is at 64Gbyte/s - at least in one direction (32 byte up / 16 byte down * 2 GHz).
And at the moment no one could confirm if this is the "narrow mode" with only one port, or for both ports, each CCD has.
Also, after having done the math again:
Even at 2pJ/bit (pessimistic estimate for IFoP) consumption is only around 1,5w per CCD connection at full load. To be honest, I had miscalculated it first and came in at 15w. 1,5w might still seem too much for mobile, but for Desktop/Server it's not that big of a deal.

DisEnchantment · Nov 8, 2022

BorisTheBlade82 said:
But to my knowledge IFoP already is at 64Gbyte/s

GMI3 is 32 GT/s/lane. We are talking about different things here I believe. BW is not as critical for CPUs as latency. The higher transfers per second it has per lane the lower the latency.
I don't have the GMI3 slides though, below is GMI2

GMI2 --> Max 25 GT/s/lane - as configured in product 14.6GT/s @1466FCLK/2933MCLK

Excerpt from LinkedIn

BW is of course decided by the number of lanes * GT/s/lane

GMI3 --> Max 32 GT/s/lane - as configured in product unknown.

I think they added more lanes and also slightly higher frequency. AFAIK you can run FCLK all the way to 2000 MHz in 1:1.

GMI4 is 64 GT/s/lane

What could change with the GMI4 is to remove the line driver as shown below (with triangle) with simple traces and repeater where needed (but routed through RDL+LSI instead of longer substrate traces).

Update: now found the slides for the lanes
Changed Gbps to GT/s to be more precise

leoneazzurro · Nov 14, 2022

https://twitter.com/x/status/1592108845068951554

This is interesting. Why do I post here instead than in the RDNA3 thread? Because this may be a hint about the interconnect speeds that could be achievable on advanced packaging and then these could be applicable on Zen5, too. We have seen that IF links are becoming a limit already on Zen4, so improving these in Zen5 could remove a performance limitation.

DisEnchantment · Nov 14, 2022

leoneazzurro said:
https://twitter.com/x/status/1592108845068951554

This is interesting. Why do I post here instead than in the RDNA3 thread? Because this may be a hint about the interconnect speeds that could be achievable on advanced packaging and then these could be applicable on Zen5, too. We have seen that IF links are becoming a limit already on Zen4, so improving these in Zen5 could remove a performance limitation.

Looks like plain InFO-R or AMD's equivalent of this tech. At best 4 copper layers, but definitely miles better than driving something through the substrate.

Biggest challenge is managing thermal expansion on CPUs which degrade the chip Fan out structure over thermal cycles.
But it looks simple enough and density is quite low (compared to even 65nm), can just fab this in house at AMD.

DisEnchantment · Nov 14, 2022

Unrelated to AMD, GLink provides similar inter die link like AMD's IFOP/GMI at 0.3pJ/bit on InFO-R(_oS)

The GLink-2.5D IP utilizes single-ended signaling on parallel bus with DDR clock forwarding. This allows for up to 8/16Gbps per pin consuming only 0.25pJ/bit on TSMC’s RDL-based InFO (Integrated-Fan-Out) or CoWoS (Chip-on-Wafer-on-Substrate). One slice has 32 full-duplex lanes and one PHY has 8 slices with 2/4Tbps maximum bandwidth. For the next generation GLink, one slice will have 56 full-duplex lanes and one PHY has 8 slices with 7.5 Tbps maximum bandwidth.

AMD's IFOP seems more advanced than this scheme, at least based on open architecture. (low swing single ended signaling)
If they migrate their link to InFO-R they should be able to match this if not better.
IFOP via substrate --> ~2pJ/bit.
IFOP via RDL --> ~0.3pJ/bit
Up to ~7x reduction in pJ/bit.
BW is important, but latency is even more important in CPUs. How high they can clock would be very critical.

Henry swagger · Nov 20, 2022

Leak suggests AMD Zen 5 CPUs to pack impressive IPC performance gains and hugely increased core counts

RedGamingTech has leaked some early AMD Zen 5 info. The leak suggests Zen 5 chips will come with a hybrid design similar to Intel Alder Lake CPUs and will pack impressive gen-on-gen IPC performance gains.

www.notebookcheck.net

Rgt saying zen 5 will have a unified l2 cache around the ccx and stacked l3.. will have zen 5 + zen 4 cores 🤔🤔

Tigerick · Nov 20, 2022

Henry swagger said:
Leak suggests AMD Zen 5 CPUs to pack impressive IPC performance gains and hugely increased core counts

RedGamingTech has leaked some early AMD Zen 5 info. The leak suggests Zen 5 chips will come with a hybrid design similar to Intel Alder Lake CPUs and will pack impressive gen-on-gen IPC performance gains.

www.notebookcheck.net

Rgt saying zen 5 will have a unified l2 cache around the ccx and stacked l3.. will have zen 5 + zen 4 cores 🤔🤔

This leak is the one I strongly believe would be for Zen5 desktop CPU architecture. With removal of L3, AMD can double up Zen5 cores while maintaining similar die size which is important for Turin server CPU. And by sharing all L2 cache AMD can remedy latency issue with external L3 cache.

Exist50 · Nov 20, 2022

Henry swagger said:
Leak suggests AMD Zen 5 CPUs to pack impressive IPC performance gains and hugely increased core counts

RedGamingTech has leaked some early AMD Zen 5 info. The leak suggests Zen 5 chips will come with a hybrid design similar to Intel Alder Lake CPUs and will pack impressive gen-on-gen IPC performance gains.

www.notebookcheck.net

Rgt saying zen 5 will have a unified l2 cache around the ccx and stacked l3.. will have zen 5 + zen 4 cores 🤔🤔

This whole thing sounds like complete nonsense. Doubling the cores? +30% IPC? Unified L2 cache? Unified stacked L3 for everything? Yeah, I'm calling BS.

poke01 · Nov 20, 2022

Tigerick said:
With removal of L3, AMD can double up Zen5 cores while maintaining similar die size which is important for Turin server CPU. And by sharing all L2 cache AMD can remedy latency issue with external L3 cache.

man Apple's ex-cheif designer was way ahead. They already moved L3 cache ages ago and already moved to 8 wide decode in 2017 i think?

If Gerald's cores ie Nuvia come to PC that is native Windows\linux, then Intel and AMD have tough days. Qualcomm will be aiming for server, laptops/mobile and autos.

BorisTheBlade82 · Nov 20, 2022

Henry swagger said:
Leak suggests AMD Zen 5 CPUs to pack impressive IPC performance gains and hugely increased core counts

RedGamingTech has leaked some early AMD Zen 5 info. The leak suggests Zen 5 chips will come with a hybrid design similar to Intel Alder Lake CPUs and will pack impressive gen-on-gen IPC performance gains.

www.notebookcheck.net

Rgt saying zen 5 will have a unified l2 cache around the ccx and stacked l3.. will have zen 5 + zen 4 cores 🤔🤔

That RGT video was posted more than 7 months ago. Pretty sure, it was already discussed back then...
Regarding the caches: Instead of restructuring all the mentioned caches I would rather imagine them to introduce a L4/LLC. But latency could be a problem. And the question is how big it needs to be in order to make an impact.

maddie · Nov 20, 2022

DisEnchantment said:
Unrelated to AMD, GLink provides similar inter die link like AMD's IFOP/GMI at 0.3pJ/bit on InFO-R(_oS)

View attachment 71129

AMD's IFOP seems more advanced than this scheme, at least based on open architecture. (low swing single ended signaling)
If they migrate their link to InFO-R they should be able to match this if not better.
IFOP via substrate --> ~2pJ/bit.
IFOP via RDL --> ~0.3pJ/bit
Up to ~7x reduction in pJ/bit.
BW is important, but latency is even more important in CPUs. How high they can clock would be very critical.

The term Beachfront is actually used?

igor_kavinski · Nov 20, 2022

Exist50 said:
This whole thing sounds like complete nonsense. Doubling the cores? +30% IPC? Unified L2 cache? Unified stacked L3 for everything? Yeah, I'm calling BS.

Wouldn't a unified large L2 cache be the next evolution in cache performance? A big slab of cache in the middle and cores placed on all sides of it?

BorisTheBlade82 · Nov 20, 2022

maddie said:
The term Beachfront is actually used?

Seems so, and IMHO it is a quite fitting term - just as real Beachfront it is rather limited on a die.

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Diamond Member

Senior member

Golden Member

Golden Member

Senior member

Diamond Member

Diamond Member

Golden Member

Senior member

Golden Member

Senior member

Golden Member

Senior member

Golden Member

Golden Member

Golden Member

Golden Member

Senior member

Senior member

Platinum Member

Diamond Member

Senior member

Diamond Member

Lifer

Senior member