maybe i should try rephrasing what i'm fundamentally trying to understand -
Why is Intel's IPC increases sort of iterative at this point? Is there a technological reason it cant improve faster? if there is a ceiling why does that ceiling exist - fundamentally.
All the low hanging fruit is
gone. Totally gone. Increases in speed require more power. Bigger data structures require superlinear space and thus cost and power. Part of the reason we even had the P4 and IA64 was that some very smart people genuinely believed that these CPUs of today, with their mildly-increasing IPC, were going to be impossible to design.
Ultimately, today, it comes down to memory, for IPC. Maybe 4 cycles for L1 (I think that's what Haswell has...), ~10-20 for L2 (can't recall), 30+ for L3 (varies by current speeds), maybe 100+ for RAM (RAM varies a lot, and that's only increased with power saving in modern CPUs)...those are cycles, where on a miss, nothing gets done by that thread. It doesn't take too many cache misses to ruin the effective IPC of what looks like some fine fast code. New, disruptive, RAM technology, that can be commercially viable, is a requirement for any major IPC increases in the future.
Apple could do so much because (a) they bought an all-star design team, (b) Apple knows R&D spending matters, so they try not to be too stingy with it, (c) ARM's own CPU designs suffer from design by committee inefficiencies, and (d) Apple didn't get all stingy with die space, giving the whole thing plenty of cache. Qualcomm, FI, has kept rolling their own to optimize the hardware for their target markets, and was doing so since before Apple was. Broadcom used to do this, too, but I haven't heard much from them lately involving custom cores.