Allow me to give a fuller explanation of how I arrived at the diagram:
Ever since the rumor that ROME will move to a 9-die configuration surfaced in July, I've been trying to make sense of why AMD might decide to do that. I had assumed that smaller CPU dies on an immature 7nm process was the main motivation, and that ROME would follow the same basic architecture of NAPLES except extend it to 8 CPU dies instead of 4. I also imagined the I/O die to be a simple chip containing only the Management/Security processor, South Bridge, and maybe some PCIe lanes.
But the tradeoffs didn't make sense:
1. 8ch DDR4 does not have sufficient bandwidth to feed 64 cores. This relates to 64C, not specifically 9-die.
2. 1ch DDR4 per CPU die is a severe bottleneck. An active core can only access 1ch memory at any time, whether local or multiple hops away. With only 1ch, there is also no opportunity to implement like bank-inteleaving to hide DRAM latency. Memory utilization efficiency will be very poor.
3. Too many IF interconnects (at least 7 per CPU die) are required to have a fully connected system for 8 CPU dies. This presents major power consumption and latency issues.
4. Packaging issues: the complex interconnections between the CPU and I/O dies will practically necessiatate the use of expensive silicon interposers.
Then AdoredTV relesed a video titled "Intel's Epyc Battle, AMD heads to the Moon" on Sep 14. This video offered some clues. In it, Jim mentioned a couple of important things: (a) ROME is a completely new design, and (b) AMD will move away from NUMA altogether. The UMA rumor, in particular, finally allowed me to unravel the conundrum of 9-die ROME. Referring to my diagram:
1. It would mean that all the memory channels must move to the System Controller or SC die (I prefer to call it that) and you now have one big shared memory space served by 8 DDR4 memory controllers. To answer the question of 8ch DDR4 feeding 64C, I added a L4 eDRAM cache and memory compression. This is purely speculation on my part. Collecting all the 8 memory controllers together alone would allow much more flexibility in optimizing the memory controller architecture to improve utilization efficiency. The L4 cache and/or memory compression may not be needed.
2. The problem of 1ch DDR4 per CPU die no longer applies.
3. In this architecture, there is no need for IF links to connect up the CPU dies; the cache coherent network on the SC die takes care of that. The link between the CPU dies and SC die must be very low latency. IF serial links may not be appropriate because the SERDES latency would be directly in the memory data path. A wide, high speed parallel link may be more appropriate. This could simply be a parallel version of IF.
4. Serial IF links are still needed for inter-socket connections for 2P configurations. The appropriate thing to do is to move them to the SC die as well.
5. It then follows that the PCIe Gen4 lanes will also move to the SC die since they share the same multi-mode SERDES as IF.
6. All the duplicated blocks like Management/Security Processor, Server Controller Hub (aka south bridge), etc. gets eliminated, leaving the CPU die with only the cores.
7. There are seemingly many packaging options:
(a) Organic MCM. Since the connection between the CPU die and SC die are now very short and direct (at most 2-3 mm), and are located at the edge of the dies, the drivers can be very low power. Organic MCM might be sufficient for the job. This would be the cheapest option.
(b) Passive interposer. Similar to (a) but using a passive silicon interposer in place of the organic substrate. It would offer better performance than organic MCM but is also much more expensive. Interposer size would exceed reticle limit and require stitching. In this case, the SC die cannot be made too large as it would simply make the interposer even bigger. This means a meaningfully large L4 cache may not be practical. Seems like paying a high price but not getting a commensurate payback in performance. I think this option is overkill.
(c) Active interposer. In this case, the CPU dies will be stacked on top of an active interposer which is also the SC die. The interposer will be large but will not exceed reticle limit. Normally there is no need to use 14nm node to make this interposer. But if the rumor that the SC die uses 14nm is true, then you might as well make full use of the area available by adding a large L4 eDRAM cache. The result would be a monster! The L4 cache would mitigate the increased memory latency resulting from moving the memory controller off-die and the limited bandwidth of 8ch DDR4.
(d) EMIB. Intel's EMIB looks like the perfect packing option for connecting the CPU dies to the SC die. But obviously AMD can't use EMIB. I consulted someone in the packaging business and he told me that there are currently no commercially available alternatives to EMIB. Even though Intel claims that EMIB is theoretically able to accomodate up to 8 bridges per die; in practice it is very diffcult to achieve perfect alignment with more than a couple of bridges; yields will be very bad. Interestingly, AMD has a patent that describes an alternative to EMIB:
https://patents.google.com/patent/US20180102338A1/en?oq=20180102338 …. But it is not clear if this is what they will use.