From the looks of things, the chiplets will be symmetrical (given how they are meant to stick closely together in a square formation for a configuration with 4 chiplets). One will be the primary one simply because it is connected directly to the PCIe bus and the rest need to be access through the primary one. I wonder if it does mean that there will be more duplicated structures than usual (like multiple VCN blocks).
Regarding latency, it is pretty much not an issue in GPU workloads with very long pipelines. Using a tile-based rasterizer will also make it possible to subdivide the geometry between the chiplets to reduce the inter-chiplet bandwidth requirements. However, things like texture/geometry access might need to go through the interposer as each chiplet will contain its own memory bank (something akin of NUMA or Zen 1 EPYC). I wonder how much bandwidth will the interposer solution allow for. It was evidently not possible on organic substrate (like EPYC and Ryzen 3000+ use) as it would probably burn too much power for the required bandwidth.