Question The IF interconnect topology inside sIOD of ROME?

Gnyueh · Feb 2, 2020

It is very clear now from die shot that ROME's sIOD comsists from 4 slices of cIOD, but it remains much unclear that how these 4 slices are connected (ring or Xbar? or any other kind )
Does anybody have any idea on this?

itsmydamnation · Feb 2, 2020

Amd haven't done a ring since r600, with 4 iod likely just a full mesh.

Gnyueh · Feb 3, 2020

itsmydamnation said:
Amd haven't done a ring since r600, with 4 iod likely just a full mesh.

Some interestiing experiment to scope into sIOD's interconnect by test core to core ping at Chiphell https://www.chiphell.com/thread-2183951-1-1.html (google translate is needed).
Discovery in short: the ping from core 0 to core 1-63 shows a 5-step stair (inside CCX -> connected to the same IOD slice - +5~10ns -> neighbouring IOD slice - +10~20ns -> next IOD slice - +5~10ns -> farthest slice),
which indicates (the author said and I quote), the 4 slice of cIOD inside sIOD is likely to be connected on a chain (or so called half ring) topology instead a crossbar or a closed ring.

moinmoin · Feb 3, 2020

Gnyueh said:
Some interestiing experiment to scope into sIOD's interconnect by test core to core ping at Chiphell https://www.chiphell.com/thread-2183951-1-1.html (google translate is needed).
Discovery in short: the ping from core 0 to core 1-63 shows a 5-step stair (inside CCX -> connected to the same IOD slice - +5~10ns -> neighbouring IOD slice - +10~20ns -> next IOD slice - +5~10ns -> farthest slice),
which indicates (the author said and I quote), the 4 slice of cIOD inside sIOD is likely to be connected on a chain (or so called half ring) topology instead a crossbar or a closed ring.

Thanks for the head up. The chart they ("Kagamine"?) created is also plenty insightful.

Surprising they didn't use a crossbar and didn't even close the ring. Guess the decrease in latency wasn't worth the increase in power usage for the high bandwidth requirement.

Edit: May be worth also sharing Fritzchens Fritz's dieshot of the server IOD:

https://flic.kr/p/2hCYVv9

Notable is that while the four nodes are mostly similarly structured, in the four corners the gap right next to the MC the structure is very distinct. In general the uncore/IOD still contains plenty "secret sauce" people haven't (bothered to) figured out yet.

Gnyueh · Feb 3, 2020

moinmoin said:
Thanks for the head up. The chart they ("Kagamine"?) created is also plenty insightful.

Surprising they didn't use a crossbar and didn't even close the ring. Guess the decrease in latency wasn't worth the increase in power usage for the high bandwidth requirement.

Edit: May be worth also sharing Fritzchens Fritz's dieshot of the server IOD:

https://flic.kr/p/2hCYVv9

Notable is that while the four nodes are mostly similarly structured, in the four corners the gap right next to the MC the structure is very distinct. In general the uncore/IOD still contains plenty "secret sauce" people haven't (bothered to) figured out yet.

This topolgy is still more a hypothesis than a confirmed conclusion, and the debate is heated up at CHH. For now the conclusion seems correct ( 5-step ladder is confirmed by cross-core data coherency ping-pong tool (cache irrelavent) by LambdaDelta https://www.chiphell.com/forum.php?mod=redirect&goto=findpost&ptid=2183951&pid=44195238 and cache transfer ping-pong tool by intel. https://www.chiphell.com/thread-2184273-1-1.html).
But I still cannot believe AMD utilizes such a stupid topology. A closed loop is much better.
I am actually here to find if there is any more evidence.

NTMBK · Feb 3, 2020

Would those results not line up with a closed-but-unidirectional ring?

Gnyueh · Feb 4, 2020

NTMBK said:
Would those results not line up with a closed-but-unidirectional ring?

I closed ring will be a 4 step ladder inside CCX -> inside IOD slice -> closest slice -> diagonal slice

moinmoin · Feb 4, 2020

Gnyueh said:
I closed ring will be a 4 step ladder inside CCX -> inside IOD slice -> closest slice -> diagonal slice

That would be a bidirectional closed ring.

Gnyueh · Feb 4, 2020

moinmoin said:
That would be a bidirectional closed ring.

unidirectional means one-direction?
Sorry for that confusion.
The author's reply: unidirectional ring is my previous conjecture. Now after several people have tested it for several nights, it can be determined that it is linear.

For a linear topology, the nodes in the middle will have different latancy from the end nodes, which is observed.

111alan · Feb 4, 2020

I'm the creator of both topos. Currently settles for AMD's declared numbers instead of my own tested numbers to play safe. Still need a better tool to test out those latencies. On Ryzen cpus the intel mlc seems to fail if I set -C is too small(like -C8), so it's hard to get a clean latency from mlc on Ryzen. For the tool by LambdaDelta the margin of error is too big for us to see a clear topology. The most "clean" result came from cache_to_cache, but I forgot who I got this test from, neither do I have the source code to see how it works.

Here's the two tools I currently have(mlc with scripts for 32-64c rome) and cache_to_cache. Disable SMT before testing.
link：https://pan.baidu.com/s/1-88Nam-xCWy9USUGVSUdRg
pw：vaj2

Here's mlc's official site:

Intel® Memory Latency Checker v3.8

Contributors Vish Viswanathan, Karthik Kumar, Thomas Willhalm, Patrick Lu, Blazej Filipiak, Sri Sakthivelu

www.intel.com

And lambdadelta doesn't seem to be pleased if other people shares his codes elsewhere...

Request for progress... Seems that AMD want to advertise their IO die to be a monolith instead of 4 slices, if that's true we probably won't ever get an official reveal.
Official numbers:

Update: Seems there's a ring after all, tested with NPS4 and mlc latency matrix:

latency map generated by cache_to_cache(core0 to1, then to 2, etc, then core 1 to 2, etc, till second last core to last core). Wonder if there is a way to find out how this small software really works.

but still, have a nice day.

moinmoin · Feb 4, 2020

111alan said:
Seems that AMD want to advertise their IO die to be a monolith instead of 4 slices

Actually there is a setting with Epyc Rome chips, NPS (NUMA per socket, similar to Intel's SNC/Sub-NUMA Clustering) that goes up to 4 and as such allows to partition the four slices to access only the local IMC.

Thanks to avoiding the chain between slices this increases memory bandwidth tremendously as STH showed (but conversely turns the chip into something like Epyc Naples, just with double the amount of cores per NUMA):

Gnyueh · Feb 5, 2020

moinmoin said:
Actually there is a setting with Epyc Rome chips, NPS (NUMA per socket, similar to Intel's SNC/Sub-NUMA Clustering) that goes up to 4 and as such allows to partition the four slices to access only the local IMC.

Thanks to avoiding the chain between slices this increases memory bandwidth tremendously as STH showed (but conversely turns the chip into something like Epyc Naples, just with double the amount of cores per NUMA):

NPS4 will be better for ROME because interleave the 8 ch mem will make the cross-die XGMI a bottle-neck which works@memclk/2 2ch mem bw?

111alan · Feb 5, 2020

moinmoin said:
Actually there is a setting with Epyc Rome chips, NPS (NUMA per socket, similar to Intel's SNC/Sub-NUMA Clustering) that goes up to 4 and as such allows to partition the four slices to access only the local IMC.

Thanks to avoiding the chain between slices this increases memory bandwidth tremendously as STH showed (but conversely turns the chip into something like Epyc Naples, just with double the amount of cores per NUMA):

I did know that. And a quote from AMD's engineering in ANAND's Rome review:

In NPS4, the NUMA domains are reported to software in such a way as it chiplets always access the near (2 channels) DRAM. In NPS1 the 8ch are hardware-interleaved and there is more latency to get to further ones. It varies by pairs of DRAM channels, with the furthest one being ~20-25ns (depending on the various speeds) further away than the nearest. Generally, the latencies are +~6-8ns, +~8-10ns, +~20-25ns in pairs of channels vs the physically nearest ones."

This adds more to the myth. The furthest is too far when compared with the closer 2 nodes. 3 clumped-together nodes and 1 separated node doesn't seem reasonable to me.

111alan · Feb 8, 2020

Update: Set 7502P to NPS-4, tested with mlc. Then we get this latency matrix. Seem there's a ring after all, but it's an rectangular ring.

Search

Question The IF interconnect topology inside sIOD of ROME?

Gnyueh

Junior Member

itsmydamnation

Diamond Member

Gnyueh

Junior Member

moinmoin

Diamond Member

Gnyueh

Junior Member

NTMBK

Lifer

Gnyueh

Junior Member

moinmoin

Diamond Member

Gnyueh

Junior Member

111alan

Junior Member

Intel® Memory Latency Checker v3.8

moinmoin

Diamond Member

Gnyueh

Junior Member

111alan

Junior Member

111alan

Junior Member

TRENDING THREADS