I'm not at all convinced on this.....
Look at CPU architectures from the 70s and 80s, like the 6502 used in the Apple ][, Commodore 64, and others.
It had only 96 instructions. That is counting every possible addressing mode of an operation as a different instruction. That is counting "Increment X register" as a different instruction from "Increment Y register".
Really, if you took the addressing mode and register being used as a separate dimension from the "instruction" the count would probably be around 20-25 instructions. I'm pretty sure that's less than any GPU.
The complexity of offering REAL instructions operating at a high frequency is non-trivial.
You can cite an Apple ][ but, the 6502 processor was a low-clock asynchronous non-pipeline design. Even modern renditions of the chip run at 2Mhz or less.
It's not capable of floating point operation and so implementations of modern software would run abysmally slow in software emulation.
Looking forward to chips from the era that supported floating point operation in hardware, we find things like the 8087 FPU. Sure, it's 45,000 transistors (simple), but it is also a non-pipelined design. As such, even on a modern process, would probably run WELL under 1Ghz (closer to 100Mhz, I'd guess). At the same time, it can take up to 150 CPU cycles just to do a single FMUL operation.
This is WAY faster than in software emulation, which would require tens of thousands of cycles to do the same thing, but still, it's not what you're looking for when you talk about performance IMPROVEMENTS using massively parallel systems.
The first chip that could do important operations like FMUL in under 20 cycles was the 486DX series, but now we're talking about several million transistors per core, and still only a 2-3 stage pipeline, limiting best-case frequency scaling to a fraction of a modern CPU.
Modern chips do FMUL in 2-3 cycles and do them at multi Ghz speeds. That's a non trivial accomplishment that cannot be compared to a MOS 6502 from an Apple ][.
This isn't even addressing SIMD (SSE, MMX, etc), memory addressing, speculative execution, OOP, branch prediction, cache coherence, transaction lookaside buffers, cache associativity and all the other performance enhancements that make modern computers not suck so much and requirements for multi-core processing. Then there are security enhancements like NX bits, virtualization enhancements like VT, etc, etc.
Giving them all up in favor of 100 CPUs could speed up certain applications, such as high-volume math operations, but would drastically slow down operations that are linear in nature, which almost half of execution will be, by the very nature of computing.
I think the "large complex, speculative, OOP CPU" combined with a "multi-core specialized ASIC" is the way to go, in much they way we currently use it with CPU/GPU.
What I was saying in my original post is that, combining the features we expect from a multipurpose CPU, you won't be able to scale it by a factor of 100.
If you want to ditch Floating Point, and cache coherency and a hundred other features, you could do a multi-purpose CPU that was small enough to put 100 on a die, but you would have a crappy computer.
A series of special purpose engines (in massive parallel configurations), however, might solve this problem. Have 100 GPU-like cores that do a limited number of matrix operations, another few dozen cores that only do basic FMUL/FDIV and another set that handle a highly pipeline speculative integer operations. The combination just might beat out the current CPU/GPU combination, but only if it were very carefully built.