It's that not everything is about the compute units. What do you think the memory controller is doing? What about texture throughput? The compute doubles but the ROP/TMU stays the same. Memory bandwidth increases but not as much as compute.
People think Compute units/Flops = performance because generally for balance when you increase compute units by 50%, you try to bump everything else(ROP/TMU/Memory bandwidth) by 50%.
I know that, but even with the same number of ROPs, TMUs, etc. had it been able to do FP+FP+INT at the same time, the performance would've probably still been quite higher, even if not all the cores were fully utilized. It was certainly not worth the added chip size, complexity, and hence cost, for a sub-optimal solution that would not have been able to fully utilize the cores.
That is what Turing did. Turing could do 1 INT + 1 FP. This is the next evolution where there is an additional data path but only for FP calculations which allows for either 1FP + 1FP or alternatively 1FP + 1INT.
I've been thinking a bit more about this. They state that a Turing SM could do 64 FP + 64 INT, while an Ampere SM does either 64 FP + 64 INT, or 64 FP + 64 FP.
Now, if there are physically double the FP32 cores in ampere, why not do it similar to turing and put all the FP cores in their own paths? Wouldn't that result in a 128 FP + 64 INT setup?
Actually, are there even double the FP cores? Jenses said they made the FP cores double-issue; would that imply that the number of physical cores is actually the same, but they've been upgraded to do twice the work?
In that case, perhaps by "datapath", they mean a logical path, not a physical one?
I'm just speculating here (I don't really know how these things work exactly); perhaps the scheduler/dispatcher is itself capable of double-issue at most, so it can either issue two FP instructions at the same time, or FP + INT.
Maybe they could've upgraded the scheduler/dispatcher to a triple-issue setup, but it would've been too big and complex, and probably unnecessary since all the other GPU units weren't increased in numbers?