Dual 4 core CCX in Matisse design.
It seems pretty obvious that the two CCXs in Zeppelin scales perfectly to 4 CCXs using direct-connect. Why not take advantage of that?
This forms a hierarchical two-layered topology of direct-connected quads (what I, probably somewhat incorrectly, calls a quad-tree topology in my OP). Then optimise this topology by adding further connections as far as metal layers allow, creating a more complex and optimised topology, that brings down average latency between any two cores.
Then connect up to 4 of these 4-CCX dies together using direct-connect on the package, as they currently do. This avoids yet another sub-optimal interconnect scheme between the 6 to 8 dies in your approach, which also require the packaging to change to use a large interposer underpinning all the dies.
The simplest options I see:
- If we assume AMD does not move to a chiplet design, then just add two direct-connected 4-core CCXs to the die.
- Assuming AMD moves to a chiplet design, then implement the uncore on an active interposer with 4 x 4-core CCX chiplets mounted on top.
Both approaches can reuse the current MCM packaging scheme.