I don't. I conclude that OpenCL, DirectCompute (HLSL/Cg, basically), CUDA, etc., came about due to GPUs, and that DX10.1+ GPUs are far more capable that most vector coprocessors. Yet, unfortunately for the GPU manufacturers (especially NVidia, in the long run), there is nothing keeping [mostly scalar] CPUs from doing the same work just as well, regardless of how they go about it.
Ok, it wasn't entirely clear to me that you agree about the hardware aspects.
What I disagree with you on is ignoring the software infrastructure, which, as of late, MS, Apple, and GPU companies have been fostering.
I'm not ignoring the software infrastructure. It is critically important, but I don't see any obstacles. First of all, every API for GPGPU will get an AVX2 implementation in the short term. Intel and Apple are working on an LLVM-based implementation of OpenCL. And LLVM also supports PTX which is used by CUDA. Next, Microsoft's implementation of C++ AMP will definitely support AVX2 as well.
Which brings us to throughput computing using auto-vectorization.
Visual Studio 11 supports it (note the "up to 8 times faster"). GCC supports it. Clang (built on LLVM and used by XCode and compatible with GCC) supports it...
So there's no reason for concern. GPGPU on the other hand has many restrictions and getting good performance is a minefield. So AVX2 has everything going for it.
When the compiler doesn't turn your obviously-unrollable loop into a nice ugly block of AVX2 instructions, how are you going to go about asking it why? For that matter, why should you have to attempt to gently coerce it, and assume it will do what you wish for it to, anyway? There is great value, and time saved, in being to tell it what you want done. When you could be dealing with a performance difference of hundreds of percent, hoping you can satisfy the compiler shouldn't be the best option.
Auto-vectorization is just one of the options. And a very important one for those who desire the best gain versus effort ratio. If you do want to push the hardware to its limits, auto-vectorization is probably not the best option. But there are countless alternatives to choose from.
Besides, it only takes a tiny change to tell whether an auto-vectorizing compiler effectively vectorized the loop you desired to be vectorized. Just like the __forceinline keyword will give you a warning when the compiler didn't inline it, there could be a __forcevectorize keyword, or a pragma.
So it's really a non-issue. The important thing is that AVX2 has no hardware limitations, and the software infrastructure can offer a very wide range of solutions from being completely transparent to the developer, through minor hints, to explicit parallel programming.