Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

A/// · May 13, 2023

Mopetar said:
I guess it's Pat now.

You must have had a bump earlier today on your head because Paul stopped being ceo a decade ago and hasn't been among the living for half that! But I agree with what you said. Jensen, Lisa and Pat need to get together for a CEO's luncheon, have a lot of drinks and some oregano in between bites of food and take turns prank calling that leno jawed bonehead whose videos are worth less than dung.

BorisTheBlade82 · May 14, 2023

Ajay said:
A split ring (bisected), just means two rings - with one 'stop' (switch/router) from each ring to the other. 2 actually, for c-w and cc-w; but still just one stop for computing latency. You also need a 'node' for the IF connection to the ring (again, two of them). There is a math formula for it, but I don’t recall it at the moment. So, just start at node n=1 to n=10 and sum the latency for each node to the other, then divide by 10. I’m sure the route uses SPF (shortest path first). AMD could probably just make the IF node another route point - but I don’t know if they do. Intel had separate points for memory, I/O and ring to ring connections.

Anyway, pretty sure this is close to correct for napkin math - without knowing implementation details. Anyone who cares to can correct me.

How come, you are so sure, that the IFoP is even a part of the ring? I could also imagine that it is directly attached to the L3 on a separate path. IMHO all memory traffic goes through that path which is why latencies of all cache stages as well as memory latency add up.

As to the calculations:
I know how to calculate the average hops. The thing is, that for each topology the formula is different. And for more complex topologies like a bisected ring or the ladder (which sounds like a 2x4 grid/mesh to me) it is a bit of a pain in the a.. to simply count them. That is why I am wondering that seemingly no one in the world made an online calculator - maybe this is my next hobby project 🤔

Mopetar · May 14, 2023

A/// said:
You must have had a bump earlier today on your head because Paul stopped being ceo a decade ago and hasn't been among the living for half that!

I'm just going to blame it on the fact that they both have names starting with P. That and I think my brain wants to forget that Brian Krzanich ever happened to Intel.

A/// · May 14, 2023

Mopetar said:
I'm just going to blame it on the fact that they both have names starting with P. That and I think my brain wants to forget that Brian Krzanich ever happened to Intel.

That one would be a double p.

and the reason he resigned

DisEnchantment · May 15, 2023

BorisTheBlade82 said:
How come, you are so sure, that the IFoP is even a part of the ring? I could also imagine that it is directly attached to the L3 on a separate path. IMHO all memory traffic goes through that path which is why latencies of all cache stages as well as memory latency add up.

As to the calculations:
I know how to calculate the average hops. The thing is, that for each topology the formula is different. And for more complex topologies like a bisected ring or the ladder (which sounds like a 2x4 grid/mesh to me) it is a bit of a pain in the a.. to simply count them. That is why I am wondering that seemingly no one in the world made an online calculator - maybe this is my next hobby project 🤔

I don't think the ring interconnect would be attached to IF. L3 also is not directly attached to IFOP but rather to SDF/SCF. IFOP/IFIS is at least a level below.
My own understanding is that the ring interconnect to reach out to other cores is on the L3, which maintains a copy of the tags of all the other L2s as well. That is how data from one core is routed from its private caches to another when needed.

Some core to core latency measurements from AT

AnandTech Forums: Technology, Hardware, Software, and Deals

Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

www.anandtech.com

~~It looks like it needs 2 hops for two adjacent cores and average between 5 to 6 hops for the other cores.~~

ed:
From @Bigos comment below, this graph is with SMT indeed.

Bigos · May 15, 2023

This graph is for a 8-core CCD. CPU0 and CPU1 are two SMT threads of the same physical core, hence massively lower latency.

This looks like the latency between physical cores (looking at the first row only) ranges from 13.8 to 17.6. You cannot really tell the number of hops just from that as there is the base L3 latency as well (and CPU0 & CPU1 do not communicate using L3, most probably). Just looking at CPU0 -> CPU2 and CPU0 -> CPU3 we can conclude the variance in this testing procedure is pretty large, as CPU2 and CPU3 are again two SMT threads of the same physical core (by looking at the latency between CPU2 and CPU3). We can still see some cores are closer than others and some kind of analysis involving math would be able to tell more.

BorisTheBlade82 · May 15, 2023

DisEnchantment said:
I don't think the ring interconnect would be attached to IF. L3 also is not directly attached to IFOP but rather to SDF/SCF. IFOP/IFIS is at least a level below.

Isn't that a bit nitpicky - is it? 😉
But yeah, that was my saying all along - the "interconnect to the IOD" - to be as general as possible, is not part of the ring.
I am not sure about the L2 tags though - but just because I never dug too deep into that matter. Core to core communication over several CCDs is handled by L3 cohererency, but on the same CCD that is not needed.

DisEnchantment · May 15, 2023

BorisTheBlade82 said:
Core to core communication over several CCDs is handled by L3 coherency, but on the same CCD that is not needed.

L2$ is inclusive of L1$ and L3$ has shadow tags for core private L2$. So if one core needs data from another core, the shadow tags are used to find out which core has the data. The data is not routed via L3$ however so I assumed there is some interconnect here, L3$ only contains data ejected from any of the L2$. Other than this I don't know of any other primitives for barrier synchronization, message passing etc., between cores. But for obvious reasons there will never be any public info around this anyway.

naukkis · May 15, 2023

BorisTheBlade82 said:
Isn't that a bit nitpicky - is it? 😉
But yeah, that was my saying all along - the "interconnect to the IOD" - to be as general as possible, is not part of the ring.

Data to and from IOD needs exactly same routing that other L3-traffic. Why would AMD make duplicated interconnect network for IOD-traffic only? Intel designs have memory controller a part of ring - as does AMD GPUs.

BorisTheBlade82 · May 15, 2023

naukkis said:
Data to and from IOD needs exactly same routing that other L3-traffic. Why would AMD make duplicated interconnect network for IOD-traffic only? Intel designs have memory controller a part of ring - as does AMD GPUs.

Maybe you misunderstood me:
The L3 on Zen is exclusive to each CCD (unlike SPR by default). So there is absolutely zero L3 traffic via IFoP/ IOD - except for the L3 coherency, where AMD uses some kind of MOESI. And that is exactly the way the cores talk to each other when on separate CCDs - otherwise they would have horrible latency when going to the RAM. And that is the beauty: Although IFoP bandwidth is very limited, there is no common workload to my knowledge, where this is detrimental.

naukkis · May 15, 2023

BorisTheBlade82 said:
Maybe you misunderstood me:
The L3 on Zen is exclusive to each CCD (unlike SPR by default). So there is absolutely zero L3 traffic via IFoP/ IOD - except for the L3 coherency, where AMD uses some kind of MOESI. And that is exactly the way the cores talk to each other when on separate CCDs - otherwise they would have horrible latency when going to the RAM. And that is the beauty: Although IFoP bandwidth is very limited, there is no common workload to my knowledge, where this is detrimental.

Every bit of data in and out from CCD goes through that ifop link. And Ifop link is one of Zen3 ring stops. Just like Intel chips with ringbus - memory controller is at one ring stop.

BorisTheBlade82 · May 15, 2023

naukkis said:
Every bit of data in and out from CCD goes through that ifop link. And Ifop link is one of Zen3 ring stops. Just like Intel chips with ringbus - memory controller is at one ring stop.

Care to share a source, that it is a ring stop? At least I can't see why this should be a given.

A/// · May 15, 2023

how reliable is the chubby lad from youtube with the 80s rocker hair and jabba the hutt double chin that sounds like a particularily s***e version of gordon elliot?

itsmydamnation · May 16, 2023

BorisTheBlade82 said:
Care to share a source, that it is a ring stop? At least I can't see why this should be a given.

why would you have 16 connections to a port that has such little bandwidth relative to the number of connections?

naukkis · May 16, 2023

BorisTheBlade82 said:
Care to share a source, that it is a ring stop? At least I can't see why this should be a given.

AMD's Zen3 presentation:

Every core and every L3-slice needs connection to other cores and IO. Ringbus is one widely used interconnection for that.

AMD Announces Ryzen 7 5800X3D, World's Fastest Gaming Processor -

AMD today announced its Spring 2022 update for the company’s Ryzen desktop processors, with as many as seven new processor models in the retail channel. The lineup is led by … Read More

www.screenhacker.com

BorisTheBlade82 · May 16, 2023

itsmydamnation said:
why would you have 16 connections to a port that has such little bandwidth relative to the number of connections?

Exactly because it is such a small bandwidth connection which costs relatively few transistors and nets you uniform RAM latency for each cache slice and doesn't introduce cross talk to the ring.

naukkis said:
AMD's Zen3 presentation:

Every core and every L3-slice needs connection to other cores and IO. Ringbus is one widely used interconnection for that.

AMD Announces Ryzen 7 5800X3D, World's Fastest Gaming Processor -

AMD today announced its Spring 2022 update for the company’s Ryzen desktop processors, with as many as seven new processor models in the retail channel. The lineup is led by … Read More

www.screenhacker.com

I am still failing to see proof in this. All I see is 8 cores connected to an L3 block which, as we already knew from another source, has its slices connected via some form of bidirectional ring. And then we have another connection from that block to the outside - but we have no idea how this is implemented.

To make myself clear: I do not deny the possibility of it being a ring node. But so far you could not provide hard facts as to why this should be considered, well, a fact.

Ajay · May 16, 2023

DisEnchantment said:
I don't think the ring interconnect would be attached to IF. L3 also is not directly attached to IFOP but rather to SDF/SCF. IFOP/IFIS is at least a level below.

Well, yes, it would be through the SDF. I/O and memory need to be connected to the ring somehow. That’s the point of a ring. It wouldn’t make any sense to add a mesh or P2P interconnect for data under the ring. Using the ring only for cache snooping and l3$ to l3$ data transfers.

naukkis · May 16, 2023

BorisTheBlade82 said:
Exactly because it is such a small bandwidth connection which costs relatively few transistors and nets you uniform RAM latency for each cache slice and doesn't introduce cross talk to the ring.

Ifop is not a small bandwidth connection. Ring also double acts as request queue/load balancing - with direct connection from each L3 slice to ifop there needs to be other ways to implement those, basically duplicated second interconnection network.

deasd · May 16, 2023

Mysterious AM5 series(Zen5?

)

https://twitter.com/x/status/1658186371234934784

gdansk · May 16, 2023

That's interesting. I would have expected a similar, almost identical IOd for Zen 5.

AMDK11 · May 16, 2023

Chiplets look similar to Zen4. Maybe it's Zen4 with some IOd variant or a test version of the new IOd for Zen5 but still on Zen4 chiplets? On the one hand, a new-generation IOd prepared earlier for testing purposes could be combined with other chiplets, but it may as well be an APU.

gdansk · May 16, 2023

Maybe they're adding an Xilinx AI accelerator to the IOd too.

BorisTheBlade82 · May 16, 2023

Ajay said:
Well, yes, it would be through the SDF. I/O and memory need to be connected to the ring somehow. That’s the point of a ring. It wouldn’t make any sense to add a mesh or P2P interconnect for data under the ring. Using the ring only for cache snooping and l3$ to l3$ data transfers.

The L3 is unified to the CCD, so there is A LOT of traffic going on from L3 accesses alone - more than enough to justify a ring solely for this.

naukkis said:
Ifop is not a small bandwidth connection. Ring also double acts as request queue/load balancing - with direct connection from each L3 slice to ifop there needs to be other ways to implement those, basically duplicated second interconnection network.

Maybe we have a different understanding of "large bandwidth". The IFoP has 64/32 GByte/s, while the L3 has almost 1.5 TByte/s, see https://chipsandcheese.com/2023/04/23/amds-7950x3d-zen-4-gets-vcache/

That is around 20x more, in case you might have missed that. At this point I am not so sure, if you got your facts together. So your statements seem less and less trustworthy.

Ajay · May 16, 2023

BorisTheBlade82 said:
The L3 is unified to the CCD, so there is A LOT of traffic going on from L3 accesses only - more than enough to justify a ring solely for this.

DUH. Let’s just say I hit my head this morning.

naukkis · May 16, 2023

BorisTheBlade82 said:
Maybe we have a different understanding of "large bandwidth". The IFoP has 64/32 GByte/s, while the L3 has almost 1.5 TByte/s, see https://chipsandcheese.com/2023/04/23/amds-7950x3d-zen-4-gets-vcache/

That is around 20x more, in case you might have missed that. At this point I am not so sure, if you got your facts together. So your statements seem less and less trustworthy.

Look that picture in your own link. https://i0.wp.com/chipsandcheese.co...23/04/zen4_ring_vs_broadwell_drawio.png?ssl=1

And you just calculated total bandwidth - what matters is individual bandwidth between ring stops. Ifop bandwith is in same league as any other ring traffic. Ring link speed is 32B * ring clock @4ghz equals to 128GB/s. Zen4 could also use double-link ifop from CCD to provide that bandwidth from server IOD.

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Golden Member

Senior member

Senior member

Golden Member

Golden Member

Senior member

Golden Member

Senior member

Diamond Member

Diamond Member

Golden Member

Senior member

Lifer

Golden Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Lifer

Golden Member