Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 53 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

A///

Diamond Member
Feb 24, 2017
4,351
3,160
136
I guess it's Pat now.
You must have had a bump earlier today on your head because Paul stopped being ceo a decade ago and hasn't been among the living for half that! But I agree with what you said. Jensen, Lisa and Pat need to get together for a CEO's luncheon, have a lot of drinks and some oregano in between bites of food and take turns prank calling that leno jawed bonehead whose videos are worth less than dung.
 
  • Like
Reactions: Mopetar

BorisTheBlade82

Senior member
May 1, 2020
707
1,130
136
A split ring (bisected), just means two rings - with one 'stop' (switch/router) from each ring to the other. 2 actually, for c-w and cc-w; but still just one stop for computing latency. You also need a 'node' for the IF connection to the ring (again, two of them). There is a math formula for it, but I don’t recall it at the moment. So, just start at node n=1 to n=10 and sum the latency for each node to the other, then divide by 10. I’m sure the route uses SPF (shortest path first). AMD could probably just make the IF node another route point - but I don’t know if they do. Intel had separate points for memory, I/O and ring to ring connections.

Anyway, pretty sure this is close to correct for napkin math - without knowing implementation details. Anyone who cares to can correct me.
How come, you are so sure, that the IFoP is even a part of the ring? I could also imagine that it is directly attached to the L3 on a separate path. IMHO all memory traffic goes through that path which is why latencies of all cache stages as well as memory latency add up.

As to the calculations:
I know how to calculate the average hops. The thing is, that for each topology the formula is different. And for more complex topologies like a bisected ring or the ladder (which sounds like a 2x4 grid/mesh to me) it is a bit of a pain in the a.. to simply count them. That is why I am wondering that seemingly no one in the world made an online calculator - maybe this is my next hobby project 🤔
 

Mopetar

Diamond Member
Jan 31, 2011
8,487
7,726
136
You must have had a bump earlier today on your head because Paul stopped being ceo a decade ago and hasn't been among the living for half that!

I'm just going to blame it on the fact that they both have names starting with P. That and I think my brain wants to forget that Brian Krzanich ever happened to Intel.
 
  • Like
Reactions: Thibsie

A///

Diamond Member
Feb 24, 2017
4,351
3,160
136
I'm just going to blame it on the fact that they both have names starting with P. That and I think my brain wants to forget that Brian Krzanich ever happened to Intel.
That one would be a double p.


and the reason he resigned
 

DisEnchantment

Golden Member
Mar 3, 2017
1,777
6,791
136
How come, you are so sure, that the IFoP is even a part of the ring? I could also imagine that it is directly attached to the L3 on a separate path. IMHO all memory traffic goes through that path which is why latencies of all cache stages as well as memory latency add up.

As to the calculations:
I know how to calculate the average hops. The thing is, that for each topology the formula is different. And for more complex topologies like a bisected ring or the ladder (which sounds like a 2x4 grid/mesh to me) it is a bit of a pain in the a.. to simply count them. That is why I am wondering that seemingly no one in the world made an online calculator - maybe this is my next hobby project 🤔
I don't think the ring interconnect would be attached to IF. L3 also is not directly attached to IFOP but rather to SDF/SCF. IFOP/IFIS is at least a level below.
My own understanding is that the ring interconnect to reach out to other cores is on the L3, which maintains a copy of the tags of all the other L2s as well. That is how data from one core is routed from its private caches to another when needed.

Some core to core latency measurements from AT
1684124812068.png
It looks like it needs 2 hops for two adjacent cores and average between 5 to 6 hops for the other cores.

ed:
From @Bigos comment below, this graph is with SMT indeed.
 
Last edited:

Bigos

Senior member
Jun 2, 2019
204
519
136
This graph is for a 8-core CCD. CPU0 and CPU1 are two SMT threads of the same physical core, hence massively lower latency.

This looks like the latency between physical cores (looking at the first row only) ranges from 13.8 to 17.6. You cannot really tell the number of hops just from that as there is the base L3 latency as well (and CPU0 & CPU1 do not communicate using L3, most probably). Just looking at CPU0 -> CPU2 and CPU0 -> CPU3 we can conclude the variance in this testing procedure is pretty large, as CPU2 and CPU3 are again two SMT threads of the same physical core (by looking at the latency between CPU2 and CPU3). We can still see some cores are closer than others and some kind of analysis involving math would be able to tell more.
 

BorisTheBlade82

Senior member
May 1, 2020
707
1,130
136
I don't think the ring interconnect would be attached to IF. L3 also is not directly attached to IFOP but rather to SDF/SCF. IFOP/IFIS is at least a level below.
Isn't that a bit nitpicky - is it? 😉
But yeah, that was my saying all along - the "interconnect to the IOD" - to be as general as possible, is not part of the ring.
I am not sure about the L2 tags though - but just because I never dug too deep into that matter. Core to core communication over several CCDs is handled by L3 cohererency, but on the same CCD that is not needed.
 

DisEnchantment

Golden Member
Mar 3, 2017
1,777
6,791
136
Core to core communication over several CCDs is handled by L3 coherency, but on the same CCD that is not needed.
L2$ is inclusive of L1$ and L3$ has shadow tags for core private L2$. So if one core needs data from another core, the shadow tags are used to find out which core has the data. The data is not routed via L3$ however so I assumed there is some interconnect here, L3$ only contains data ejected from any of the L2$. Other than this I don't know of any other primitives for barrier synchronization, message passing etc., between cores. But for obvious reasons there will never be any public info around this anyway.
 

naukkis

Golden Member
Jun 5, 2002
1,020
853
136
Isn't that a bit nitpicky - is it? 😉
But yeah, that was my saying all along - the "interconnect to the IOD" - to be as general as possible, is not part of the ring.
Data to and from IOD needs exactly same routing that other L3-traffic. Why would AMD make duplicated interconnect network for IOD-traffic only? Intel designs have memory controller a part of ring - as does AMD GPUs.
 

BorisTheBlade82

Senior member
May 1, 2020
707
1,130
136
Data to and from IOD needs exactly same routing that other L3-traffic. Why would AMD make duplicated interconnect network for IOD-traffic only? Intel designs have memory controller a part of ring - as does AMD GPUs.
Maybe you misunderstood me:
The L3 on Zen is exclusive to each CCD (unlike SPR by default). So there is absolutely zero L3 traffic via IFoP/ IOD - except for the L3 coherency, where AMD uses some kind of MOESI. And that is exactly the way the cores talk to each other when on separate CCDs - otherwise they would have horrible latency when going to the RAM. And that is the beauty: Although IFoP bandwidth is very limited, there is no common workload to my knowledge, where this is detrimental.
 

naukkis

Golden Member
Jun 5, 2002
1,020
853
136
Maybe you misunderstood me:
The L3 on Zen is exclusive to each CCD (unlike SPR by default). So there is absolutely zero L3 traffic via IFoP/ IOD - except for the L3 coherency, where AMD uses some kind of MOESI. And that is exactly the way the cores talk to each other when on separate CCDs - otherwise they would have horrible latency when going to the RAM. And that is the beauty: Although IFoP bandwidth is very limited, there is no common workload to my knowledge, where this is detrimental.

Every bit of data in and out from CCD goes through that ifop link. And Ifop link is one of Zen3 ring stops. Just like Intel chips with ringbus - memory controller is at one ring stop.
 

BorisTheBlade82

Senior member
May 1, 2020
707
1,130
136
Every bit of data in and out from CCD goes through that ifop link. And Ifop link is one of Zen3 ring stops. Just like Intel chips with ringbus - memory controller is at one ring stop.
Care to share a source, that it is a ring stop? At least I can't see why this should be a given.
 

A///

Diamond Member
Feb 24, 2017
4,351
3,160
136
how reliable is the chubby lad from youtube with the 80s rocker hair and jabba the hutt double chin that sounds like a particularily s***e version of gordon elliot?
 

naukkis

Golden Member
Jun 5, 2002
1,020
853
136
Care to share a source, that it is a ring stop? At least I can't see why this should be a given.

AMD's Zen3 presentation:

408558125.jpg


Every core and every L3-slice needs connection to other cores and IO. Ringbus is one widely used interconnection for that.

 
Last edited:

BorisTheBlade82

Senior member
May 1, 2020
707
1,130
136
why would you have 16 connections to a port that has such little bandwidth relative to the number of connections?
Exactly  because it is such a small bandwidth connection which costs relatively few transistors and nets you uniform RAM latency for each cache slice and doesn't introduce cross talk to the ring.

AMD's Zen3 presentation:

408558125.jpg


Every core and every L3-slice needs connection to other cores and IO. Ringbus is one widely used interconnection for that.

I am still failing to see proof in this. All I see is 8 cores connected to an L3 block which, as we already knew from another source, has its slices connected via some form of bidirectional ring. And then we have another connection from that block to the outside - but we have no idea how this is implemented.

To make myself clear: I do not deny the possibility of it being a ring node. But so far you could not provide hard facts as to why this should be considered, well, a fact.
 

Ajay

Lifer
Jan 8, 2001
16,094
8,114
136
I don't think the ring interconnect would be attached to IF. L3 also is not directly attached to IFOP but rather to SDF/SCF. IFOP/IFIS is at least a level below.
Well, yes, it would be through the SDF. I/O and memory need to be connected to the ring somehow. That’s the point of a ring. It wouldn’t make any sense to add a mesh or P2P interconnect for data under the ring. Using the ring only for cache snooping and l3$ to l3$ data transfers.
 

naukkis

Golden Member
Jun 5, 2002
1,020
853
136
Exactly  because it is such a small bandwidth connection which costs relatively few transistors and nets you uniform RAM latency for each cache slice and doesn't introduce cross talk to the ring.
Ifop is not a small bandwidth connection. Ring also double acts as request queue/load balancing - with direct connection from each L3 slice to ifop there needs to be other ways to implement those, basically duplicated second interconnection network.
 

gdansk

Diamond Member
Feb 8, 2011
4,568
7,682
136
That's interesting. I would have expected a similar, almost identical IOd for Zen 5.
 

AMDK11

Senior member
Jul 15, 2019
473
407
136
Chiplets look similar to Zen4. Maybe it's Zen4 with some IOd variant or a test version of the new IOd for Zen5 but still on Zen4 chiplets? On the one hand, a new-generation IOd prepared earlier for testing purposes could be combined with other chiplets, but it may as well be an APU.
 
Last edited:

BorisTheBlade82

Senior member
May 1, 2020
707
1,130
136
Well, yes, it would be through the SDF. I/O and memory need to be connected to the ring somehow. That’s the point of a ring. It wouldn’t make any sense to add a mesh or P2P interconnect for data under the ring. Using the ring only for cache snooping and l3$ to l3$ data transfers.
The L3 is unified to the CCD, so there is A LOT of traffic going on from L3 accesses alone - more than enough to justify a ring solely for this.

Ifop is not a small bandwidth connection. Ring also double acts as request queue/load balancing - with direct connection from each L3 slice to ifop there needs to be other ways to implement those, basically duplicated second interconnection network.
Maybe we have a different understanding of "large bandwidth". The IFoP has 64/32 GByte/s, while the L3 has almost 1.5 TByte/s, see https://chipsandcheese.com/2023/04/23/amds-7950x3d-zen-4-gets-vcache/

That is around 20x more, in case you might have missed that. At this point I am not so sure, if you got your facts together. So your statements seem less and less trustworthy.
 
  • Like
Reactions: Joe NYC

naukkis

Golden Member
Jun 5, 2002
1,020
853
136
Maybe we have a different understanding of "large bandwidth". The IFoP has 64/32 GByte/s, while the L3 has almost 1.5 TByte/s, see https://chipsandcheese.com/2023/04/23/amds-7950x3d-zen-4-gets-vcache/

That is around 20x more, in case you might have missed that. At this point I am not so sure, if you got your facts together. So your statements seem less and less trustworthy.

Look that picture in your own link. https://i0.wp.com/chipsandcheese.co...23/04/zen4_ring_vs_broadwell_drawio.png?ssl=1

And you just calculated total bandwidth - what matters is individual bandwidth between ring stops. Ifop bandwith is in same league as any other ring traffic. Ring link speed is 32B * ring clock @4ghz equals to 128GB/s. Zen4 could also use double-link ifop from CCD to provide that bandwidth from server IOD.
 
Last edited: