How important are SSE1,2,3,4?

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Do these instruction sets make your CPU faster if coded properly for, say, games?
Yes. You get benefits without even having use for vector processing. SSE and SSE2 allow FP without using a stack, so have been preferred for ages, but standardized with x86-64. MS' compiler, FI, has supported selectively using SSE for scalar, when it would obviously be faster than x87, for awhile, now.
 

nenforcer

Golden Member
Aug 26, 2008
1,767
1
76
Certain games won't run without them, like Alien Swarm for Steam / PC which won't run on the PC in my sig but will on my Core 2 Duo system.
 

Cogman

Lifer
Sep 19, 2000
10,284
138
106
To answer the original question.

How important are SSE1,2,3,4?

SSE1, not so important. It provides very basic functionality.
SSE2, EXTREMELY important. It brought everything that SSE1 didn't bring. You will very rarely see applications that will require SSE1, but you will frequently see an SSE2 requirement.
SSE3, Meh, it was more of a SSE1 like move. Semi important in a few specific cases.
SSE4/4.1/SSSE3, Importantish. More important than 3, less important than 2. It introduces some nifty functionality, but nothing earth shattering.

AVX (which is, in many ways, just extended SSE). We have yet to see, but I imagine it will be pretty important for scientific type calculations. The main thing it brings to the table is an extended 512bit register.
 

jhu

Lifer
Oct 10, 1999
11,918
9
81
SSE2 : most important - finally brings double-precision floating point without using a stack on x86

AVX2 : next important - gather, integer support

all others : not too important
 

Mars999

Senior member
Jan 12, 2007
304
0
0
AVX1/2 is a huge boost in speed over SSE2 from the demo's I have ran in game tests from Intel. So yeah it's important if used correctly.
 

pantsaregood

Senior member
Feb 13, 2011
993
37
91
Speaking of SSE: where exactly do the SSE4a instructions AMD uses fit in this? Are they something like a subset of SSE4.1/4.2, or are they completely unrelated instructions that got a similar name? Intel doesn't seem to be interested in adopting them, so I'm assuming they'll die out like 3DNow! did.
 

Mars999

Senior member
Jan 12, 2007
304
0
0
SSE4.2 or 4.1 has some useful string acceleration stuff so don't throw that out....
 

_Rick_

Diamond Member
Apr 20, 2012
3,945
69
91
Depends on your application.
I've seen someone hand-optimise some assembly routines for the Waterman-Smith algorithm for SSSE3, and that gave a speed-up over legacy implementations of around a factor 10. In fact, it was faster than running the code on a mid-end GPU, IIRC.

Essentially what these extensions do, is allow your CPU to behave GPU-like. In some applications this means insane speed-ups, in others it's not so important.
Java code for example has no support for SIMD at all, it only uses the additional SSE-registers.

There's a nice SSE vector library available, that gives relatively convenient access (you still have to think/design in SIMD) to those instructions. Many modern compilers are useless when it comes to figuring out which code can be optimized to be treated as a SIMD set, especially for the newer extensions.


TL;DR: (modern) SSE extensions are like GPGPU code: awesome if your problem works that way, and you explicitly code for it, useless otherwise.
 

Mars999

Senior member
Jan 12, 2007
304
0
0
Depends on your application.
I've seen someone hand-optimise some assembly routines for the Waterman-Smith algorithm for SSSE3, and that gave a speed-up over legacy implementations of around a factor 10. In fact, it was faster than running the code on a mid-end GPU, IIRC.

Essentially what these extensions do, is allow your CPU to behave GPU-like. In some applications this means insane speed-ups, in others it's not so important.
Java code for example has no support for SIMD at all, it only uses the additional SSE-registers.

There's a nice SSE vector library available, that gives relatively convenient access (you still have to think/design in SIMD) to those instructions. Many modern compilers are useless when it comes to figuring out which code can be optimized to be treated as a SIMD set, especially for the newer extensions.


TL;DR: (modern) SSE extensions are like GPGPU code: awesome if your problem works that way, and you explicitly code for it, useless otherwise.

Would you be referring to libSIMDx86? If not I would like to give it a go if you can remember the lib...

Totally agree with what you said, I seen the new AVX do a demo with shadow mapping and was ~50-100% faster with it on... IIRC Nice!
 

Cogman

Lifer
Sep 19, 2000
10,284
138
106
SSE4.2 or 4.1 has some useful string acceleration stuff so don't throw that out....

This is one of the reasons I rank it above SSE3 and SSE1 in importance (but not 2, 2 is just the bee's knees).

It does provide some useful stuff, but not at the same level as SSE2.
 

Cogman

Lifer
Sep 19, 2000
10,284
138
106
There's a nice SSE vector library available, that gives relatively convenient access (you still have to think/design in SIMD) to those instructions. Many modern compilers are useless when it comes to figuring out which code can be optimized to be treated as a SIMD set, especially for the newer extensions.

Yep, you pretty much need to hand feed the compiler when it comes to vectorization. Not only that, but you have to remember to specifically target an architecture that supports those instructions. Many compilers, by default, will target Pentium Pro or Pentium 2 processors.
 

Cogman

Lifer
Sep 19, 2000
10,284
138
106

It uses the floating point calculation aspect of SSE2, and maybe even some of the vectorization. However, I doubt that it is doing a great job at the vectorization. Most compilers aren't very good at this. Vectorization is just hard to do, and unless you know about it and how to hand feed it to your compiler, you are likely not going to be getting the full amount of benefits that it provides.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
Many modern compilers are useless when it comes to figuring out which code can be optimized to be treated as a SIMD set, especially for the newer extensions.
That will change dramatically with AVX2. It has a vector equivalent of every relevant scalar operation. And hence loops with independent iterations can easily be auto-vectorized in an SPMD fashion.
TL;DR: (modern) SSE extensions are like GPGPU code: awesome if your problem works that way, and you explicitly code for it, useless otherwise.
With AVX2, developers won't have to explicitly code for it, and they can use any programming language they like. That's why homogeneous throughput computing will have a far greater impact on the consumer market than GPGPU.
 

_Rick_

Diamond Member
Apr 20, 2012
3,945
69
91
With AVX2 you will have to explicitly code for it, as it still requires its own opcodes and SIMD-stacking. If you compile for x86_64, you won't get AVX2 code out of it.
Of course, well developed production code will have loadable libraries for each architecture, so that won't be such a huge problem, but still, I'm skeptical of your claims of magic.
Java for example, still is in the stone ages, and while yes, it does support SSE2, it only uses it without any vectorization, rendering it not quite impotent, but far from the awesomeness that it could be. Java will not magically benefit from AVX2. All compilers will still have to produce AVX2 code. And they will do so at different degrees of quality. I don't doubt that further accelerated vector extensions will be great, and I even believe that it may be easier to work with AVX2, but I don't see how the basic issues of vectorization will be overcome by yet another set of instructions.

As for that vector library, I found it here:
http://www.agner.org/optimize/#vectorclass

That site also contains a very nice study of how the intel compiler builds code that runs slower on non-Intel CPU's, even though it shouldn't have to (He tests this by manipulating the CPUID of a VIA CPU, then running benchmarks on that, seeing whether one CPUID setting is faster than the other. Spoiler: KML also cheats.) Great resource for all things CPU-vector-extensions.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
With AVX2 you will have to explicitly code for it, as it still requires its own opcodes and SIMD-stacking. If you compile for x86_64, you won't get AVX2 code out of it.
Coding for a specific architecture, and compiling for it, are very different things. Most of my code is compiled for i686 and yet it takes advantage of everything up to SSE4.1.
Of course, well developed production code will have loadable libraries for each architecture, so that won't be such a huge problem, but still, I'm skeptical of your claims of magic.
There is absolutely no need for a loadable library per architecture. You can have different code paths all within the same binary.
Java for example, still is in the stone ages, and while yes, it does support SSE2, it only uses it without any vectorization, rendering it not quite impotent, but far from the awesomeness that it could be. Java will not magically benefit from AVX2.
Any application written in Java simply isn't performance-oriented. So don't expect AVX2 or anyhing else for that matter to make a difference (intentionally).
All compilers will still have to produce AVX2 code. And they will do so at different degrees of quality.
No they won't. Any (performance-oriented) compiler will produce equivalent AVX2 code. And that's because AVX2 demands to parallelize loops with independent iterations in exactly one way; by replacing every scalar operation with its vector equivalent. There's no two ways about it.
I don't doubt that further accelerated vector extensions will be great, and I even believe that it may be easier to work with AVX2, but I don't see how the basic issues of vectorization will be overcome by yet another set of instructions.
This isn't a matter of "belief". It's science, so please do your research. AVX2 is the very first x86 instruction set extension where every important scalar instruction will have a 256-bit vector equivalent. This enables the SPMD on SIMD programming model, the same vectorization approach that makes GPUs execute code in a masively parallel fashion. So yes, "yet another" set of instructions will overcome the legacy issues of vectorization an will have a huge impact on the future of high performance computing.
 

_Rick_

Diamond Member
Apr 20, 2012
3,945
69
91
Coding for a specific architecture, and compiling for it, are very different things. Most of my code is compiled for i686 and yet it takes advantage of everything up to SSE4.1.

This either means you have huge fat binaries, with a boatload of code paths, or you lack optimization for some extensions that add registers etc.
Just because code runs on a platform, doesn't mean it runs especially well.

There is absolutely no need for a loadable library per architecture. You can have different code paths all within the same binary.
Which is the same thing, only worse, as you can't optimize for install foot print.
Any application written in Java simply isn't performance-oriented. So don't expect AVX2 or anyhing else for that matter to make a difference (intentionally).

They aren't, no. Still there are people running Java on their HPC clusters; who would love to have magic performance increases.
But this was more in reply to a previous post, proclaiming Java's SSE2 capabilities.

No they won't. Any (performance-oriented) compiler will produce equivalent AVX2 code. And that's because AVX2 demands to parallelize loops with independent iterations in exactly one way; by replacing every scalar operation with its vector equivalent. There's no two ways about it.
And yet different compilers are already wildly different on any number of other optimizations, so one would expect differing sensitivity to the detection of independence in the compiler. Different ways of arranging the code into the AVX registers. And you still have to code for AVX, by making sure your loops only have AVX2 compatible instructions and are independent. This nay have to be forced at times, especially with compilers that don't look to atomize loops properly. Writing parallel code is still a specific challenge, and the on I was talking about. While AVX2 makes it somewhat more straight forward by removing the operation limitions, it's still not the same as writing fast code for sequential execution, as now loops are unrolled differently, according to what's going on inside the loop.
Believing that the compiler will do magic, is what Intel did, when they pushed VLIW with Itanium. It kind of didn't end that way.

This isn't a matter of "belief". It's science, so please do your research. AVX2 is the very first x86 instruction set extension where every important scalar instruction will have a 256-bit vector equivalent. This enables the SPMD on SIMD programming model, the same vectorization approach that makes GPUs execute code in a masively parallel fashion. So yes, "yet another" set of instructions will overcome the legacy issues of vectorization an will have a huge impact on the future of high performance computing.

It's an unknown future, where humans work with a tool. Yes, it may be a great tool, but that doesn't mean it's guaranteed to improve the work. The term belief fits well, until we know after several years of experience, that yes, AVX2 did in fact hugely improve the execution of vectorizable code. Until then, it's merely a promise, with no implementation, but only specifications for what should happen.
 

Mars999

Senior member
Jan 12, 2007
304
0
0
As I have always said you can't replace the best optimizer (the one right between you ears) and will continue to be that way. You will not beat a more intelligent design or thought out process than some off the self one size fits all method. This isn't going to work all the time.

The problem is the compiler doesn't know what your intentions are with what you want to use the AVX extensions for...
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
This either means you have huge fat binaries, with a boatload of code paths, or you lack optimization for some extensions that add registers etc.
Optimizing the hotspots with multiple code paths does not produce "huge fat binaries". They are but a minor fraction of the total code size.
And yet different compilers are already wildly different on any number of other optimizations...
I wouldn't say "wildly". The results typically vary by at most a few ten percent between the relevant contenders. Which is nothing compared to the eightfold parallelization you get with AVX2. So compiler differences will be irrelevant to its success.
It's an unknown future...
Not really. AVX2 consists of exactly the kind of instructions used by the GPU. So any work that has been done on GPGPU in the last several years, also applies to AVX2. And AVX2 is far more flexible (supports legacy programming languages) and won't suffer from the heterogeneous overhead nor from driver issues and such. So it's not hard to realize that it will have a bright future.