Question The IF interconnect topology inside sIOD of ROME?

Gnyueh

Junior Member
Feb 10, 2019
19
5
51
It is very clear now from die shot that ROME's sIOD comsists from 4 slices of cIOD, but it remains much unclear that how these 4 slices are connected (ring or Xbar? or any other kind )
Does anybody have any idea on this?
 

Gnyueh

Junior Member
Feb 10, 2019
19
5
51
Amd haven't done a ring since r600, with 4 iod likely just a full mesh.
Some interestiing experiment to scope into sIOD's interconnect by test core to core ping at Chiphell https://www.chiphell.com/thread-2183951-1-1.html (google translate is needed).
Discovery in short: the ping from core 0 to core 1-63 shows a 5-step stair (inside CCX -> connected to the same IOD slice - +5~10ns -> neighbouring IOD slice - +10~20ns -> next IOD slice - +5~10ns -> farthest slice),
which indicates (the author said and I quote), the 4 slice of cIOD inside sIOD is likely to be connected on a chain (or so called half ring) topology instead a crossbar or a closed ring.
 
  • Like
Reactions: moinmoin

moinmoin

Diamond Member
Jun 1, 2017
5,236
8,443
136
Some interestiing experiment to scope into sIOD's interconnect by test core to core ping at Chiphell https://www.chiphell.com/thread-2183951-1-1.html (google translate is needed).
Discovery in short: the ping from core 0 to core 1-63 shows a 5-step stair (inside CCX -> connected to the same IOD slice - +5~10ns -> neighbouring IOD slice - +10~20ns -> next IOD slice - +5~10ns -> farthest slice),
which indicates (the author said and I quote), the 4 slice of cIOD inside sIOD is likely to be connected on a chain (or so called half ring) topology instead a crossbar or a closed ring.
Thanks for the head up. The chart they ("Kagamine"?) created is also plenty insightful.

122445j2jjdr3iieo2gt2sdka2.png


Surprising they didn't use a crossbar and didn't even close the ring. Guess the decrease in latency wasn't worth the increase in power usage for the high bandwidth requirement.

Edit: May be worth also sharing Fritzchens Fritz's dieshot of the server IOD:


Notable is that while the four nodes are mostly similarly structured, in the four corners the gap right next to the MC the structure is very distinct. In general the uncore/IOD still contains plenty "secret sauce" people haven't (bothered to) figured out yet.
 
Last edited:
  • Like
Reactions: lightmanek

Gnyueh

Junior Member
Feb 10, 2019
19
5
51
Thanks for the head up. The chart they ("Kagamine"?) created is also plenty insightful.

122445j2jjdr3iieo2gt2sdka2.png


Surprising they didn't use a crossbar and didn't even close the ring. Guess the decrease in latency wasn't worth the increase in power usage for the high bandwidth requirement.

Edit: May be worth also sharing Fritzchens Fritz's dieshot of the server IOD:


Notable is that while the four nodes are mostly similarly structured, in the four corners the gap right next to the MC the structure is very distinct. In general the uncore/IOD still contains plenty "secret sauce" people haven't (bothered to) figured out yet.

This topolgy is still more a hypothesis than a confirmed conclusion, and the debate is heated up at CHH. For now the conclusion seems correct ( 5-step ladder is confirmed by cross-core data coherency ping-pong tool (cache irrelavent) by LambdaDelta https://www.chiphell.com/forum.php?mod=redirect&goto=findpost&ptid=2183951&pid=44195238 and cache transfer ping-pong tool by intel. https://www.chiphell.com/thread-2184273-1-1.html).
But I still cannot believe AMD utilizes such a stupid topology. A closed loop is much better.
I am actually here to find if there is any more evidence.
 

Gnyueh

Junior Member
Feb 10, 2019
19
5
51
That would be a bidirectional closed ring.
unidirectional means one-direction?
Sorry for that confusion.
The author's reply: unidirectional ring is my previous conjecture. Now after several people have tested it for several nights, it can be determined that it is linear.


For a linear topology, the nodes in the middle will have different latancy from the end nodes, which is observed.
 
  • Like
Reactions: moinmoin and NTMBK

111alan

Junior Member
Mar 15, 2017
5
5
51
I'm the creator of both topos. Currently settles for AMD's declared numbers instead of my own tested numbers to play safe. Still need a better tool to test out those latencies. On Ryzen cpus the intel mlc seems to fail if I set -C is too small(like -C8), so it's hard to get a clean latency from mlc on Ryzen. For the tool by LambdaDelta the margin of error is too big for us to see a clear topology. The most "clean" result came from cache_to_cache, but I forgot who I got this test from, neither do I have the source code to see how it works.

Here's the two tools I currently have(mlc with scripts for 32-64c rome) and cache_to_cache. Disable SMT before testing.
link:https://pan.baidu.com/s/1-88Nam-xCWy9USUGVSUdRg
pw:vaj2

Here's mlc's official site:

And lambdadelta doesn't seem to be pleased if other people shares his codes elsewhere...

Request for progress... Seems that AMD want to advertise their IO die to be a monolith instead of 4 slices, if that's true we probably won't ever get an official reveal.
Official numbers:
EPYC2-Rome-Interconnections.png

Update: Seems there's a ring after all, tested with NPS4 and mlc latency matrix:
EPYC2-Rome-Interconnections-speculation.png

latency map generated by cache_to_cache(core0 to1, then to 2, etc, then core 1 to 2, etc, till second last core to last core). Wonder if there is a way to find out how this small software really works.
Lat.png

but still, have a nice day.
 
Last edited:

moinmoin

Diamond Member
Jun 1, 2017
5,236
8,443
136
Seems that AMD want to advertise their IO die to be a monolith instead of 4 slices
Actually there is a setting with Epyc Rome chips, NPS (NUMA per socket, similar to Intel's SNC/Sub-NUMA Clustering) that goes up to 4 and as such allows to partition the four slices to access only the local IMC.

Thanks to avoiding the chain between slices this increases memory bandwidth tremendously as STH showed (but conversely turns the chip into something like Epyc Naples, just with double the amount of cores per NUMA):
AMD-EPYC-7002-NPS-Impact-on-Stream-Bandwidth.jpg
 

Gnyueh

Junior Member
Feb 10, 2019
19
5
51
Actually there is a setting with Epyc Rome chips, NPS (NUMA per socket, similar to Intel's SNC/Sub-NUMA Clustering) that goes up to 4 and as such allows to partition the four slices to access only the local IMC.

Thanks to avoiding the chain between slices this increases memory bandwidth tremendously as STH showed (but conversely turns the chip into something like Epyc Naples, just with double the amount of cores per NUMA):
AMD-EPYC-7002-NPS-Impact-on-Stream-Bandwidth.jpg
NPS4 will be better for ROME because interleave the 8 ch mem will make the cross-die XGMI a bottle-neck which works@memclk/2 2ch mem bw?
 

111alan

Junior Member
Mar 15, 2017
5
5
51
Actually there is a setting with Epyc Rome chips, NPS (NUMA per socket, similar to Intel's SNC/Sub-NUMA Clustering) that goes up to 4 and as such allows to partition the four slices to access only the local IMC.

Thanks to avoiding the chain between slices this increases memory bandwidth tremendously as STH showed (but conversely turns the chip into something like Epyc Naples, just with double the amount of cores per NUMA):
AMD-EPYC-7002-NPS-Impact-on-Stream-Bandwidth.jpg
I did know that. And a quote from AMD's engineering in ANAND's Rome review:
In NPS4, the NUMA domains are reported to software in such a way as it chiplets always access the near (2 channels) DRAM. In NPS1 the 8ch are hardware-interleaved and there is more latency to get to further ones. It varies by pairs of DRAM channels, with the furthest one being ~20-25ns (depending on the various speeds) further away than the nearest. Generally, the latencies are +~6-8ns, +~8-10ns, +~20-25ns in pairs of channels vs the physically nearest ones."

This adds more to the myth. The furthest is too far when compared with the closer 2 nodes. 3 clumped-together nodes and 1 separated node doesn't seem reasonable to me.
 

111alan

Junior Member
Mar 15, 2017
5
5
51
Update: Set 7502P to NPS-4, tested with mlc. Then we get this latency matrix. Seem there's a ring after all, but it's an rectangular ring.
NPS-4_mlc.png
 
  • Like
Reactions: moinmoin