Registry File size. Titan Xp has 3840 CUDA cores, each SM in it has 128 Cores. Each SM is fed by particular Registry File size. The same Registry File size is available to 64 cores in GP100 chip. So they are less starved for resources. This is the reason why Maxwell was such huge jump in efficiency over Kepler. It was not because of Tile Based Rasterization, it does not increase performance, but efficiency(saves power required to move the data). It was because of lowered number of cores in Maxwell architecture that had access to particular pool of resources.
Funniest part: GP102, GP104 have had the same SM/Registry file size layout as Maxwell, that is why there was no difference in performance clock for clock/core for core.
This is not true. The block diagram (which you can see on this page) for GP104 clearly indicates that each SM has 64 CUDA cores, just like on GP100. The only difference with GP100 SMs is that they also have extra dedicated FP64 CUDA cores (which are needed for HPC but serve no purpose during gaming).
Reducing CUDA cores per SM from 128 to 64 had minimal impact on performance; as you note, on a per-TFlop basis, there was little difference between Maxwell and Pascal. (The performance boosts that Nvidia obtained with Pascal were largely gotten through higher clock speeds, and packing in a couple more shaders due to the die shrink.) If going from 128 to 64 CUDA cores per SM didn't do much, then it seems unlikely that the massive gains in perf/TFlop (about 33%) from Kepler to Maxwell were due to going from 192 to 128 cores per SM.
The addition of tiled rendering remains the most likely explanation for Maxwell's substantial boost in utilization (DX11 perf/TFlop). Whether any other such breakthroughs are on the horizon remains to be seen. It's safe to say that Nvidia will try to position GV104 (when it arrives) above GP102, since they have historically tried to have each new card beat the previous generation's card from one tier up. That could be accomplished with 3584 CUDA cores at about 2 GHz, if they were fed with adequate memory bandwidth (probably via GDDR6, and a 256-bit bus as always for 4-series chips). It would not necessarily require improved perf/TFlop and I'm not sure we will see additional meaningful gains on that front. Maybe 5%-10%. I'm assuming a ~400mm^2 chip, similar in size to GM204, with 4/6 as many shaders as the big chip. Presumably the new "12FFN" process offers better clock speeds, since calculating the transistor density increase from GV100 over GP100 shows only about a 4% improvement on that front.