Now imagine if Intel's L3 wasn't inclusive - this is actually the food for though we should probably focus on. It is the source of inter-core latency & bandwidth Intel reigns in. A single high priority "rogue" thread could destroy most of the benefits of LLC.
Ryzen uses a mostly exclusive L3 per CCX. It keeps a copy of the L2 tags and the burden of getting data from a neighboring core's L2 is actually quite similar to the cost of getting it from the L3, AFAICT from testing. I can only imagine AMD specifically designed it to be that way.
I think a future improvement for the L3 would be to prefetch directly into it - that should help with in-page random access latency and would help some of the algorithms which are currently behaving badly on Ryzen. There are a lot of caveats to that, though.
From what I can tell on Ryzen, any one core can only evict to 4MB of L3 - which is why single-threaded cache latency tests show a sudden latency hit when exceeding 4MB. Each core can read from any part of the L3, though, so there will be multiple cache tag searches at once.
Some of my testing which uses a mutex-free user-mode spinlock seems to suggest inter-CCX latency is only 32~60 cycles for commands or simple data (one way). As soon as the data is more than 128-bits, it seems latency skyrockets - but I have to modify my test more - it's pretty cruddy right now.
I think the command bus is much lower latency than the data bus and can even carry a small data payload. This would help explain why Ryzen has amazing multi-threaded scaling even across CCXes, but light, data-heavy, workloads suffer immensely.