Clearly there is a reason why more and more cores are required, even hundreds for some specific workloads
Required is an odd word to use. it is simpler to design a CPU with more cores than it is to work on scaling existing cores. But there is no inherent requirement in software to have more cores, it is actually detrimental to the software, making it much more difficult to program.
It is clearly a case of Science vs Engineering - you have a given physical celling (for each architecture/manufacture process) where simply is more efficient to have more slower cores opposed to faster ones.
I disagree. I think it is significantly cheaper and faster to develop and design more slower cores then less faster cores. But I completely disagree that there is any physics effects that make more cores MORE efficient.
Remember that the THERMALS argument of 2 cores at 2.5ghz or 1 core at 3ghz is because of the non linear increase of power consumption with core clock speed and voltage. however, do remember that the dual core is nearly twice as many transistors.
A single core design with the same amount of transistors as the dual core @ 2.5 would be even more efficient overall... but requires significant changes from x86.
Performance: single core @2.5ghz @2B transistors > dual core @2.5ghz @2B transistors > single core @ 3ghz @ 1B transistors
Development cost: single core @2.5ghz @2B transistors < dual core @2.5ghz @2B transistors < single core @ 3ghz @ 1B transistors
Manufacturing cost: single core @2.5ghz @2B transistors = dual core @2.5ghz @2B transistors > single core @ 3ghz @ 1B transistors
The above of course assumes that you have a good design, and are not just inflating transistor counts with "more cache" or some such... but use them to the optimal capability (which might be more cache, might be other things, depends on the exact design).
Note that both ATI and nVidia have such a design in their GPUs. they call SPs "cores" but they are not actual cores, they are parallel execution units. Just like how a single "core" in x86 has multiple ALU (Arithmatic logic units). The software sees 1 GPU, not hundreds of nvidia cuda cores. And the underlying architecture allows them to be utilized in parallel quite well. This is despite the fact that each SP is identical to every other SP, simply duplicated many times over.
GPUs, which are not hampered by x86, allow amazing "single core" scaling via massively parallel execution units. x86 has very rigid structure, which while it allows a specific amount of multiple execution units per core, doesn't allow flexibly increasing/decreasing easily. instead they duplicates those cores (easiest), with wastage... along with the wastage of the x86 instruction set itself btw.
I think this is a big reason why larrabee was scraped (at least in its first iteration).