While this may look true and completely valid at first sight, one immediately asks the next logical question: why stop at 25% throughput increase when the small cores are basically "lost in the chip area noise"? Why not go for 50% or even 75%?
Sounds to me like that SYMMETRY may be much more important than your think.
It depends on the task, doesn't it?
A GPU or an NPU is indeed the logical extreme of "sea of small very simple cores", and they have many uses... Things like the original Broadcom Vulcan were another version of the idea, now targeting simple (rather than very simple) cores, but again aimed at a particular throughput market, in this case networking.
There are then two questions:
- are there enough use cases for a sea of small simple (but not VERY SIMPLE) cores to justify their production? The jury is still out on this.
We've obviously seen Intel try to sell things like this, things like Centerton and Averton, along with the first round of ARM32 servers.
They weren't great successes but I don't consider that dispositive.
There have been two problems so far:
+ companies have crippled these chips to prevent them from being too competitive with their expensive chips (things like memory bandwidth). It's notable than when these devices (GPU, NPU, even Vulcan) aren't believed to be a threat to the expensive products, they miraculously pick up an astonishingly performant throughput-optimized memory system...
+ the companies that would be using these chips are on the same treadmill as everyone else; they have their hands full simply trying to keep pace with new ideas for the product, with security threats, with new large core SoCs. Refactoring the stack to target a sea of small cores is the kind of neat idea that might form the basis of a PhD thesis, but right now it's lower priority than everything else going on.
Even things like GPUs (or aggressive use of AVX512 for text processing) which can at least target existing hardware, are lagging far behind where they should be. Change takes time.
I'd be curious to see if Amazon, in particular, have a second team looking at the question of whether there'd be value in an alternative Graviton instance consisting of, say, 256 A-35 type cores, targeting massive throughput text-streaming type apps.
- the second question is the value of optionality. Optionality (being able to toggle between a latency optimized vs throughput optimized SoC, or a latency optimized vs energy optimized SoC) is useful in personal devices (phones, PCs, ...) that do a dozen different things in a day. It's much less obviously useful in server applications where you could just run each microtask on a SoC of the appropriate capabilities.
This is another reason that I question the value of the SMT4 on ThunderX3 (unless they're targeting primarily a license runaround, like IBM...); that SMT4 optionality is just not worth much.