We have. But I cannot remember any consensus building around your interpretation. Actually, I'm surprised that you maintain this interpretation and present it as fact. For those interested, see discussion earlier in this thread:
I have seen and read that, at some point I just stopped bothering to respond.
I think the conventional interpretation, as described in AMD's presentations and slides, is the correct one: the L3 cache controller acts as a crossbar between the 4 cache slices in a CCX, requiring 6 links for a fully connected topology.
That is not the conventional interpretation, and it is not backed by AMD's slides.
You are basing too much of your interpretation on a few arrows on a slide with no attempt to accurately portray the situation. The "L3 is crossbar"-interpretation requires substantial evidence for it, because it makes no sense whatsoever from an engineering standpoint, and since extraordinary claims require extraordinary evidence, it should be discarded. In contrast, there is substantial evidence for a fully connected topology from examining how code actually runs on the chip.
I'm surprised that you come to that conclusion. The way it sounds to me, Clark makes the point that memory interleaving is used to achieve uniform latency on average, thus weakening, not supporting, the L2 crossbar interpretation.
How exactly address interleaving would produce uniform latency in your topology? Just to make sure you get the basics right: every cache line lives only in one of the slices. Interleaving is done between adjacent cache lines. That is, cache line 0 (addresses 0x0..0x3f) is in slice 0, cache line 1 (0x40..0x7f) is in slice 1, and so on until cache line 5 (0x100..0x13f) is again in slice 0. It is easy to confirm this by allocating an array that fits into the L3, and then only accessing every fourth cache line, measuring the throughput, and comparing to accessing all of the array linearly. Accessing all of it gets substantially higher throughput.
If access to any of the L3 slices went through the closest one, you would expect a substantially different latency to that one, yet you cannot see that. You can also measure that Core 1 accessing L3 slice 4 does not impact the throughput of core 4 accessing L3 slice 1, which is what you'd expect if they were sharing the link.
You keep repeating that 6 links are less than, and therefore better than 16 (and ignoring that your interpretation actually includes 10 links, and that those extra 4 links would have to be different and substantially beefier because you can see a throughput difference between accessing all of the cache versus just any slice), without considering what, exactly is it that you are saving.
Your method would would cost more power, it would potentially require buffering at two places instead of one, it would increase latencies because of longer total distance traveled, and because it would force a long-distance signal to be brought back down from the uppermost metal layers without purpose. And it would save nothing of consequence.
The physical links themselves are
free, because they occupy an area of the die which would be just plain blank without them. As for routing logic, your interpretation has 1x4 crossbar at every L3 (just because one of those links takes to the L3 slice itself doesn't mean you can leave it out), mine has a 1x4 crossbar at every core. Since there are as many cores as there are L3 slices, that comes out to the same amount of logic. Except, your method would result in multiple routing hops per transmitted line, as opposed to the one I have, so it would actually require more logic for the same total throughput.
Consider the case of core 0 accessing L3 slice 4, and how it would work under heavy contention. Note that the distance between the core and the cache is multiple cycles, and so the core doing the request cannot know the readiness of the cache when it makes the request.
With a fully connected topology with no shared links:
Immediately after confirming L2 miss, the core knows which slice the line is potentially in. Each core has a 1x4 crossbar, and can immediately send a request on that to the appropriate slice. Each slice has 4 separate reservation stations which store requests, one for each core. Every time a core sends a request, it consumes a slot, and every time it gets a response, it considers one slot freed. The core keeps track of the occupancy of it's own reservation station at each slice, and is only allowed to send a request if there is a slot free. This way, you only need one layer of buffering to get full throughput out of any link.
With your topology:
After confirming L2 miss, the core knows which slice the line is in, but for some reason sends the request to the closest L3 slice instead. At this L3 there is a 1x4 crossbar which picks another slice as the target. Since the link between that L3 slice and the one you actually want to talk to might be saturated for reasons that core 0 cannot predict, there needs to be a buffer both at this point, and at the final L3 slice.