And how will the game see it as one GPU?
As I understand, and I'm by no means an expert, getting games to see it as one GPU is not hard, though it could have its own difficulties.
The problem we're trying to tackle is how to increase GPU performance, and as I understand there are 3 currently feasible "physical" ways - increase die size, shrink process, or chiplet design to increase yields/decrease cost. Maybe someone could add any other possible routes, obviously improving overall microarchitecture can play a huge role too.
The limits as best I can tell are fairly well known. For process shrinks, it will only be an answer for a few more cycles on current electronics, perhaps photonics or quantum is the next route, but that's going to be expensive. You could also just wait until process and silicon costs drop, and increase die size, but there are a lot of heat and topology issues that mean each additional transistor doesn't add linearly to the power of the processor (as you add more to the edges, communication between one edge and the other takes longer, for instance, which is exacerbated on a larger process, as well as more difficult heat dissipation). A much harder answer, of course, is a chiplet design, because of the importance of latency and bandwidth, and the lack of a current robust communication link that works at low latency over longer distances (i.e. inter-chiplet).
For a GPU chiplet, unlike the Zen2 chiplet design, they'd almost certainly require direct chiplet-to-chiplet communication link (as best I could tell), because the latency from distance on two links from chiplet-I/O-chiplet would be unacceptable in GPUs. I think it's already been theorized that they'd need active interposers or some other more robust link between chiplets directly.
So there might be two possible routes for chiplets to become feasible for GPUs: 1) someone figures out a cost-effective and robust inter-chiplet latency/bandwidth issue making it feasible to use chiplets; or 2) some new development makes SLI/Crossfire type of chiplet work delegation more feasible. Rather than relying on drivers/software perhaps there is a hardware scheduler-like solution that might be more efficient.
Another consideration is to decrease die size of the execution section. While 70-75% of the die of current GPUs is the actual execution unit (unlike CPUs which have smaller execution die allocation), still, moving 25-30% front-end off-die (with a 5-10% die size tax for communication from front-end to execution unit) might allow reasonable performance gains and some cost-reduction but latency again is a big issue when you start talking about having the scheduler and other front-end work done from afar.
I'd like to hear more from those with more knowledge of GPU design, really interesting in the little bit I've read about it.