Some interestiing experiment to scope into sIOD's interconnect by test core to core ping at Chiphell https://www.chiphell.com/thread-2183951-1-1.html (google translate is needed).Amd haven't done a ring since r600, with 4 iod likely just a full mesh.
Thanks for the head up. The chart they ("Kagamine"?) created is also plenty insightful.Some interestiing experiment to scope into sIOD's interconnect by test core to core ping at Chiphell https://www.chiphell.com/thread-2183951-1-1.html (google translate is needed).
Discovery in short: the ping from core 0 to core 1-63 shows a 5-step stair (inside CCX -> connected to the same IOD slice - +5~10ns -> neighbouring IOD slice - +10~20ns -> next IOD slice - +5~10ns -> farthest slice),
which indicates (the author said and I quote), the 4 slice of cIOD inside sIOD is likely to be connected on a chain (or so called half ring) topology instead a crossbar or a closed ring.
Thanks for the head up. The chart they ("Kagamine"?) created is also plenty insightful.
![]()
Surprising they didn't use a crossbar and didn't even close the ring. Guess the decrease in latency wasn't worth the increase in power usage for the high bandwidth requirement.
Edit: May be worth also sharing Fritzchens Fritz's dieshot of the server IOD:
Notable is that while the four nodes are mostly similarly structured, in the four corners the gap right next to the MC the structure is very distinct. In general the uncore/IOD still contains plenty "secret sauce" people haven't (bothered to) figured out yet.
I closed ring will be a 4 step ladder inside CCX -> inside IOD slice -> closest slice -> diagonal sliceWould those results not line up with a closed-but-unidirectional ring?
That would be a bidirectional closed ring.I closed ring will be a 4 step ladder inside CCX -> inside IOD slice -> closest slice -> diagonal slice
unidirectional means one-direction?That would be a bidirectional closed ring.
Actually there is a setting with Epyc Rome chips, NPS (NUMA per socket, similar to Intel's SNC/Sub-NUMA Clustering) that goes up to 4 and as such allows to partition the four slices to access only the local IMC.Seems that AMD want to advertise their IO die to be a monolith instead of 4 slices
NPS4 will be better for ROME because interleave the 8 ch mem will make the cross-die XGMI a bottle-neck which works@memclk/2 2ch mem bw?Actually there is a setting with Epyc Rome chips, NPS (NUMA per socket, similar to Intel's SNC/Sub-NUMA Clustering) that goes up to 4 and as such allows to partition the four slices to access only the local IMC.
Thanks to avoiding the chain between slices this increases memory bandwidth tremendously as STH showed (but conversely turns the chip into something like Epyc Naples, just with double the amount of cores per NUMA):
![]()
I did know that. And a quote from AMD's engineering in ANAND's Rome review:Actually there is a setting with Epyc Rome chips, NPS (NUMA per socket, similar to Intel's SNC/Sub-NUMA Clustering) that goes up to 4 and as such allows to partition the four slices to access only the local IMC.
Thanks to avoiding the chain between slices this increases memory bandwidth tremendously as STH showed (but conversely turns the chip into something like Epyc Naples, just with double the amount of cores per NUMA):
![]()
In NPS4, the NUMA domains are reported to software in such a way as it chiplets always access the near (2 channels) DRAM. In NPS1 the 8ch are hardware-interleaved and there is more latency to get to further ones. It varies by pairs of DRAM channels, with the furthest one being ~20-25ns (depending on the various speeds) further away than the nearest. Generally, the latencies are +~6-8ns, +~8-10ns, +~20-25ns in pairs of channels vs the physically nearest ones."