Everything in computing is a balance of complexity vs speed. IBMs power 8 seems to use the most efficient core complex of 3. Note - AMD and IBM worked closely together in the past. I think AMDs current design reflects quite a bit on the Power8 arch.
Intel on the other hand used a ring bus until
Skylake-SP/X.
The ring bus has an advantage of relatively flat latency up to 8 cores (~80ns). However, it is slower than AMDs 4 ccx (~40ns), yet faster than IF (~90-140ns). However once the ring bus was extended to 16 or more cores it exhibits the same distance type latency that AMD has. I do not know the numbers.
So Intel is going with a mesh bus. Again I do not know the numbers but I would surmise it would be a nearest neighbor factor. The farther you go the more you incur.
Oh I'm sure they use parallel hardware searches.
The nature of a cache is you have a block of memory, you subdivide it by way and then map memory into each block. What we have seen the less ways you have, the faster the cache can be. The same can be said for cache size. AMD has gone a step further and subdivided the cache per core. I would imagine that this has the effect of making it like a 4x16 = 64 way cache with many restrictions. The most notable being that each core can only write its its own victim area.
The whole nature of this speculation is - What is the optimal complexity to run at the fastest speed.