AMD's intra-CCX latency is HALF Intel's latency between cores (the flatness of Intel's solution is a result of having a bidirectional ring bus - you will average the same number of hops over many runs).
AMD has an additional 100ns delay when communicating between cores in other CCXes. What we need to know is if this is a direct communication or if this is done through main memory. The simplest answer is that it is done through main memory and the latency is therefore CCX latency + DF latency + IMC latency + DF latency + CCX latency.
44 + ? + 19~30 + ? + 44 = ~140
This would suggest that the data fabric latency, itself, is actually fairly low - from 16.5ns to 11ns on average. All testing seems to suggest there's no real difference between accessing one CCX over another - but there IS one. We can see it in the first image above, where the accesses to the left CCX are just a couple nanoseconds longer than for those on the right.
That feature almost certainly denotes that the data is not being directly communicated, but is hitting system memory - with the other thread requesting that data (listening to a port, accessing an address, etc...). The two CCXes would not need to know about the outside world - which has major advantages when it comes to design.
This then begs the question - what is Intel doing that is hiding the ring bus latency from benchmarks and applications? IMC latency to a core, on average, appears to be about 80ns - so they shouldn't be able to show 20ns - it should be 100ns. There may be a simple answer - I'm not fully versed on what Intel is doing these days.
But weren't those pings between cores? Not L3 caches? So you would have to subtract 20ns from the 140ns score.
So to access L3 cache on second CCX you would need about 120ns.
From AMD slides L3 was connected to DF and memory was connected to DF (on the other side) - to go to second part of L3 cache through memory is senseless as that would imply having 2 DF and memory being connected to both in a mirror-like way:
core==L3 ==DF==memory==DF==L3==core
while in fact it is like this:
core==L3==DF==L3==core
.....................||
................memory
(please ignore "." they are there because otherwise picture makes no sense)
Effective memory latency was about 98ms (I will round it to 100 for convenience).
So:
- core to L3 (20)
- core to memory (100) - from hardware.fr test
- core to another CCX L3 (120) - from core ping result
We don't know DF latency and we don't know memory to DF latency BUT:
say "a" is a latency from L3 to DF
say "b" is a latency from DF to memory
say "c" is a latency from core to L3
then
c+a+b=100 (latency access to memory)
c+a+a=120 (latency access to L3 from another CCX) - it is the same ac c+a+a+c=140 if you ping core in another CCX
from that a=50 and b=30
a is a combined latency of link between L3 cache and half trip through DF itself
b is a combined latency of link between memory and another half trip through DF itself
It is not possible really to tell what is the latency of DF but then we don't need that anyway to tell that access to second part of L3 is NOT through memory, if that was the case it would be close to 200ms and it does not make any sense from designer point of view, such configuration would not bring any advantages to performance OR expandability of design.
Btw, why would 44ns be CCX latency? If you mean L3 latency it is about 20ns, 44ns from core ping test is a round trip - core=L3=core