this will be my last OT post in this thread,
https://tu-dresden.de/zih/forschung...benchit/2014_MSPC_authors_version.pdf?lang=en
Code:
location latency in ns
L2 L3 RAM
local 9.1
2nd core 19.5 27.3 82.3
on-chip 88.6
2nd die in MCM 129 116 133
other socket, 1 hop 136 123 146
other socket, 2 hops 178 164 187
including probe (max) 198 185 -
Even on bulldozer the difference between an on chip not local L2 and a MCM L2 access is only 33%. But then go an look at the different SOC uarch's to see just how much better
Zepplins is then
Bulldozers.
Bulldozer used the same HT interfaces on MCM and inter socket. Bulldozer had the SRQ which was one giant bottleneck even accessing the directory cache. To get to the other chip in the package you had to go SRQ-> crossbar->HT->crossbar->SRQ->directory/L3/L2.
Zen is much better, cache directories are attached directly to the UMC's, the UMC's/GMI's/io Hub/Core Complexs are attached to the fabric, The L3 holds tags for the L2 within a CCX, so a much more scalable solution.
Then even just look at the size and number of phy's for the
GMI interfaces, sure they are going to cost you some power to go over the interposer but nothing like PCIe or memory accesses. Fudzilla's leaks( everything else in them is correct) had each of those GMI interfaces at 25GB/s which is twice the bandwidth of BD's HT (12.8), so we are looking at around 200 GB/s of GMI bandwidth for each SOC, what we dont know is if it is full mesh or ring (i think ring is more likely with 4 controls and 8 phys with only 4 stops) .
So then it comes down to latency and we dont know what it is but i'll bet you its no more then the 40ns of BD, i recon it will be around the 20ns mark per hop just like
inter CCX on the same chip. Just look at BD the difference for inter proc vs inter mcm is only 13ns so distance isn't a big contributor and we know Zeppelin has been designed from ground up to have a
distributed memory hierarchy
Last but not least i'll leave you with these tidbit from an OEM when we where talking about 32 core naples:
https://forum.beyond3d.com/threads/amd-ryzen-cpu-architecture-for-2017.56187/
Terms and conditions apply here. I can't say the same about DP, but let's say it looks really good in a single CPU configuration.
We know why DP doesn't look as good, GMIx needs to use PCIe lanes, so in 2P you loose 1/2 your lanes and have nowhere near the bandwidth of the on package fabric.