So with MCM the work distribution, what lanes run what, is all still centrally controlled. There's one input stream to deal with, and all the figuring out of how to get that instruction stream and results and those results back to where they're needed, back and forth across chiplets is done on the hardware itself. Over what needs to be a very high bandwidth, very low latency, preferably very cheap connection (thus why it's taken so long to make).
The problem with "sli" and whatever is indeed latency, in part, and bandwidth. And the fact that there's two instruction streams going to two different chips who then may have to shuttle data back and forth to each other, requiring a lot of bandwidth and consideration of latency (and complication, lots of it) over PCIE (which might be saturated at points already), or everything has to be copied out twice to both cards, meaning a lot of the benefit is lost as you can't perfectly "split the screen in half" and have one card to one half and the other do the other half.Either way, you can pay double the cost for something that is not double the performance.
Meanwhile if you split a GPU in two each half costs less than one big whole, depending on the sizes as splitting a 200mm chip into 2 100mm chips you don't save much and it might even cost more as you end up needing extra hardware for that link between chiplets. But if you split a big enough chip, like a huge GPU or a 64 core CPU or etc. into two or more chiplets then even with some extra added hardware and etc. for chiplet cost your overall cost goes down.