- Jan 7, 2002
- 3,322
- 0
- 71
Yes. You get benefits without even having use for vector processing. SSE and SSE2 allow FP without using a stack, so have been preferred for ages, but standardized with x86-64. MS' compiler, FI, has supported selectively using SSE for scalar, when it would obviously be faster than x87, for awhile, now.Do these instruction sets make your CPU faster if coded properly for, say, games?
How important are SSE1,2,3,4?
Depends on your application.
I've seen someone hand-optimise some assembly routines for the Waterman-Smith algorithm for SSSE3, and that gave a speed-up over legacy implementations of around a factor 10. In fact, it was faster than running the code on a mid-end GPU, IIRC.
Essentially what these extensions do, is allow your CPU to behave GPU-like. In some applications this means insane speed-ups, in others it's not so important.
Java code for example has no support for SIMD at all, it only uses the additional SSE-registers.
There's a nice SSE vector library available, that gives relatively convenient access (you still have to think/design in SIMD) to those instructions. Many modern compilers are useless when it comes to figuring out which code can be optimized to be treated as a SIMD set, especially for the newer extensions.
TL;DR: (modern) SSE extensions are like GPGPU code: awesome if your problem works that way, and you explicitly code for it, useless otherwise.
SSE4.2 or 4.1 has some useful string acceleration stuff so don't throw that out....
There's a nice SSE vector library available, that gives relatively convenient access (you still have to think/design in SIMD) to those instructions. Many modern compilers are useless when it comes to figuring out which code can be optimized to be treated as a SIMD set, especially for the newer extensions.
Java code for example has no support for SIMD at all, it only uses the additional SSE-registers.
That will change dramatically with AVX2. It has a vector equivalent of every relevant scalar operation. And hence loops with independent iterations can easily be auto-vectorized in an SPMD fashion.Many modern compilers are useless when it comes to figuring out which code can be optimized to be treated as a SIMD set, especially for the newer extensions.
With AVX2, developers won't have to explicitly code for it, and they can use any programming language they like. That's why homogeneous throughput computing will have a far greater impact on the consumer market than GPGPU.TL;DR: (modern) SSE extensions are like GPGPU code: awesome if your problem works that way, and you explicitly code for it, useless otherwise.
Coding for a specific architecture, and compiling for it, are very different things. Most of my code is compiled for i686 and yet it takes advantage of everything up to SSE4.1.With AVX2 you will have to explicitly code for it, as it still requires its own opcodes and SIMD-stacking. If you compile for x86_64, you won't get AVX2 code out of it.
There is absolutely no need for a loadable library per architecture. You can have different code paths all within the same binary.Of course, well developed production code will have loadable libraries for each architecture, so that won't be such a huge problem, but still, I'm skeptical of your claims of magic.
Any application written in Java simply isn't performance-oriented. So don't expect AVX2 or anyhing else for that matter to make a difference (intentionally).Java for example, still is in the stone ages, and while yes, it does support SSE2, it only uses it without any vectorization, rendering it not quite impotent, but far from the awesomeness that it could be. Java will not magically benefit from AVX2.
No they won't. Any (performance-oriented) compiler will produce equivalent AVX2 code. And that's because AVX2 demands to parallelize loops with independent iterations in exactly one way; by replacing every scalar operation with its vector equivalent. There's no two ways about it.All compilers will still have to produce AVX2 code. And they will do so at different degrees of quality.
This isn't a matter of "belief". It's science, so please do your research. AVX2 is the very first x86 instruction set extension where every important scalar instruction will have a 256-bit vector equivalent. This enables the SPMD on SIMD programming model, the same vectorization approach that makes GPUs execute code in a masively parallel fashion. So yes, "yet another" set of instructions will overcome the legacy issues of vectorization an will have a huge impact on the future of high performance computing.I don't doubt that further accelerated vector extensions will be great, and I even believe that it may be easier to work with AVX2, but I don't see how the basic issues of vectorization will be overcome by yet another set of instructions.
Coding for a specific architecture, and compiling for it, are very different things. Most of my code is compiled for i686 and yet it takes advantage of everything up to SSE4.1.
Which is the same thing, only worse, as you can't optimize for install foot print.There is absolutely no need for a loadable library per architecture. You can have different code paths all within the same binary.
Any application written in Java simply isn't performance-oriented. So don't expect AVX2 or anyhing else for that matter to make a difference (intentionally).
And yet different compilers are already wildly different on any number of other optimizations, so one would expect differing sensitivity to the detection of independence in the compiler. Different ways of arranging the code into the AVX registers. And you still have to code for AVX, by making sure your loops only have AVX2 compatible instructions and are independent. This nay have to be forced at times, especially with compilers that don't look to atomize loops properly. Writing parallel code is still a specific challenge, and the on I was talking about. While AVX2 makes it somewhat more straight forward by removing the operation limitions, it's still not the same as writing fast code for sequential execution, as now loops are unrolled differently, according to what's going on inside the loop.No they won't. Any (performance-oriented) compiler will produce equivalent AVX2 code. And that's because AVX2 demands to parallelize loops with independent iterations in exactly one way; by replacing every scalar operation with its vector equivalent. There's no two ways about it.
This isn't a matter of "belief". It's science, so please do your research. AVX2 is the very first x86 instruction set extension where every important scalar instruction will have a 256-bit vector equivalent. This enables the SPMD on SIMD programming model, the same vectorization approach that makes GPUs execute code in a masively parallel fashion. So yes, "yet another" set of instructions will overcome the legacy issues of vectorization an will have a huge impact on the future of high performance computing.
Optimizing the hotspots with multiple code paths does not produce "huge fat binaries". They are but a minor fraction of the total code size.This either means you have huge fat binaries, with a boatload of code paths, or you lack optimization for some extensions that add registers etc.
I wouldn't say "wildly". The results typically vary by at most a few ten percent between the relevant contenders. Which is nothing compared to the eightfold parallelization you get with AVX2. So compiler differences will be irrelevant to its success.And yet different compilers are already wildly different on any number of other optimizations...
Not really. AVX2 consists of exactly the kind of instructions used by the GPU. So any work that has been done on GPGPU in the last several years, also applies to AVX2. And AVX2 is far more flexible (supports legacy programming languages) and won't suffer from the heterogeneous overhead nor from driver issues and such. So it's not hard to realize that it will have a bright future.It's an unknown future...