The future of CPU performance scaling is in programmable logic. Programmable logic linked tightly to the actual execution units of the CPU core. Smaller blocks, probably only a square millimeter or perhaps even less. But many of them. Just like Skylake has 6 execution units. One of these programmable blocks would be only about the same size as one of those existing execution units. They would have direct access to the prefetcher and scheduler and instruction/data caches. They would be power gated. Applications that want to make use of these blocks could program them on the fly. Any blocks of code that repeat the same sequence of instructions would be a candidate for massive IPC increases. I'm thinking about javascript specifically. Such a core would be so advanced that it would really unlock the software developers to develop extremely fast and efficient code. These blocks could be reprogrammed very quickly every time there is a context switch. I view this as the ultimate bare metal programming. You could literally write an ASIC to execute your code. In about 10 years I expect the CPU to construct its own programmable logic blocks in real time based on what is running; to, in essense, profile and optimize itself to achieve the highest possible IPC.