And yet the things that benefit from SSE4, already have support for SSE4. Ever since SSE2 the improvements have been very minor though. Until AVX2.
So please stop looking at the past adoption rate of instruction set extension. Even the major SSE2 extension had a slow adoption rate because the 128-bit instructions were initially executed on 64-bit execution units!
I fear you may be misunderstanding what those numbers are- those are not the % of applications using instruction set extensions, those are the percentage of active Steam users' PCs which can utilise those instruction set extensions. And given the decline in PC sales growth of late, I imagine that the rate at which end users get access to new ISEs will only get slower.
Furthermore, I think it's silly to expect a significant number of applications to take advantage of more than four cores, but not AVX2. I can tell your first hand that scaling beyond four cores, without the help of TSX, gets very hard. In comparison it will be a breeze to take advantage of AVX2 to increase throughput. So rest assured that the developers who wish their application to run faster, will make use of AVX2 very quickly.
In other words, AVX2 is in most cases every bit as good as having twice the number of cores, if not better.
The nature of large vector units and SIMD is that you need to be performing the same calculations across multiple chunks of data. This can also be done with separate cores, but separate cores can also perform entirely independent tasks.
I am not, however, especially advocating vastly increased core counts. In server chips, yes, of course- we need all the cores we can get there. But in consumer devices core count is not the most important. The single most useful improvement to everyday users is transparent speedup of existing applications- that is, increased clock speeds and increased IPC.
You're probably thinking about 'horizontal' vectorization. AVX2 is particularly suited for 'vertical' vectorization. To have 8 data elements to work on simutaneously, all you need is a scalar loop with independent iterations. Each AVX2 instruction can then execute the same scalar operation for 8 iterations simultaneously!
Pretty much all arithmetic intensive applications have performance critical loops like that. And there's no need to "code these instructions by hand". It's very straightforward for the compiler to perform this form of vectorization.
Yes, I am well aware of how vectorisation works, and how to use vector units. I write code full of __m128's for a living. And outside of matrix multiplications, there really are not that many situations which have large numbers of iterations of a loop with independent iterations. Of course I don't have access to a vast range of different codebases outside of my speciality, so I can't reliably predict how well AVX2 autovectorisation will perform. I reserve judgement for the time being, but I am certainly not pinning my hopes upon it.
Yes they will. People who are the first to buy new hardware are also the first to buy new software. So developers have to implement support for new features early on or they'll lose an important piece of the market to the competition. Benchmarks of competing software products are typically released not long after the new hardware is released. So they can't afford to look bad.
Developers will need to write code utilising these new extensions
in addition to the existing code path(s) which give the same functionality (but slightly slower), or else the large majority of their potential users will be utterly unable to use their software. Repeat this every time Intel brings out a new ISE, and you have an utterly unmanageable hierarchy of different branches of code using different levels of instruction set for all the potential permutations. The costs for this are huge.
I do not deny that in certain very specific segments, this is indeed the case. Small but mission critical algorithms that companies pay a lot of money for to run as fast as possible will indeed use any edge that they can get over their rival developers. But the majority of software written does not bother with marginal performance increases for a tiny fraction of their potential market (who have the newest processors anyway, and hence are the least likely to complain about lack of performance). They will use an extension if the vast majority of computers in use support it, and if the only computers which do not support it would be too rubbish to run the software properly anyway. At the moment, that title belongs to SSE2 and arguably SSE3.