I still think that mainstream, many-core CPUs, should adopt a model much like GPUs, only not SIMD. Basically, a wide array of execution pipelines, with a hardware threading scheduler (think SMT, but with 128/256/512/1024-way, rather than 2-way. And if current cores, have 5/6/7/8 pipelines, then perhaps, having a massive array of pipelines, 256 of them, divided up into the various types, L/S, AGU, FP, INT, MUL, DIV, etc.)
Basically, make CPUs much more like GPUs, capable of MASSIVE multi-threading.