- That still makes it 3-4 years+ into the future. On top of what BenchPress said
For the record, applications that made use of SSE4 were available on the day Nehalem launched. The only problem was, SSE4 only offered an improvement for a narrow field of applications.
AVX2 on the other hand is incredibly generic. It widens
every integer SIMD instruction from 128-bit to 256-bit, and adds FMA for twice the peak floating-point performance. So each and every application that used SIMD before, will have access to twice the throughput with Haswell!
But that's not all. Thanks to the addition of gather and vector-vector shift instructions, a slew of applications which previously couldn't benefit from SIMD, suddenly become prime candidates...
- I have no real understanding of what that means in terms of being totally different than other SIMD extentions (i am a programmer, high and low, but never delt with graphics or simd's), why it is totally different to have equivalents of every major scalar instruction or not...
Thanks for asking. It used to be that to benefit from SIMD, you needed code that was explicitly using vector math. For instance to move an object in 3D space you'd add a 3-component displacement vector to a 3-component position vector. SSE uses 4 x 32-bit component vector registers, so you're not making use of the fourth component in this example, but still, it's faster than adding each component separately.
AVX widens the vectors to 256-bit, so that's 8 x 32-bit. Now obviously this seems like a big waste since most vector code will consist of ~3 components. But the trick is to not process one vector at a time, but eight at a time! Each AVX register would hold one component of eight
different vectors. So you can move eight 3D object positions with three vector additions. For this example, the AVX code would be 2.7x faster.
The caveat is that you need to get the original eight 3-component vectors into three 8-component vectors. This is where AVX2's gather support comes in super handy. It can load eight 32-bit values from non-consecutive locations in memory.
Now, the
really exciting part is that this can be be done for
every code loop with independent iterations. Eight iterations can run simultaneously, each using one component of the AVX registers. And this is where the vector-vector shift instruction of AVX2 is critical. Obviously a lot of code loops contain shift instructions, but since the vectorized form would be using shift values from eight different iterations, they need to be independent. Prior to AVX2 such an instruction was not available and the shifts would have to be performed one component at a time. With AVX2, that can be done eight times faster.
either way in my mind it would be a 'simple' matter of recompilation..
Compilers have been pretty bad at auto-vectorizing code. And that's precisely because all previous SIMD instruction sets lacked things like gather and vector-vector shift. With AVX2 this changes radically because vectorizing a loop becomes straightforward when you have vector equivalents of every scalar instruction. The code nearly looks the same it's just skipping ahead eight iterations at a time.
What minimum architecture are games compiled for these days? I assume we've moved past 80386.
Most games nowadays demand SSE2 as a bare minimum.
The theoretical throughput for simds/3dnow/whatever have always "blown our mind" .. I have yet to get blown(meeh).
Again that's because the current SIMD extensions have only been applicable in narrow fields, and even when they're applicable the full potential is left unused due to a few missing instructions. With AVX2 Intel will finally offer everything that's required to make it useful for the vast majority of software.
But yes, Haswell does come with a lot of small mmm's that may all add up to Umpf! .. Remains to be seen, 3-4-5 years down the line.
It won't take that long. Developers of high-performance applications would be insane to leave the potential of AVX2 and TSX unused for years.