Intel extends AVX to 512-bit

NTMBK

Lifer
Nov 14, 2011
10,419
5,712
136
Yup- and it's first being supported in Knight's Landing. Good stuff, although we really need to rework our code to use these vector widths...
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
It's going to revolutionize computing as we know it. It brings all of the general-purpose computing power of the GPU, into the CPU cores. No more heterogeneous overhead.

R.I.P. GPGPU.
 
Mar 10, 2006
11,715
2,012
126
It's going to revolutionize computing as we know it. It brings all of the general-purpose computing power of the GPU, into the CPU cores. No more heterogeneous overhead.

R.I.P. GPGPU.

BenchPress,

Could you give us more detailed thoughts on what you think of this new ISA extension?

I'm all ears!
 

sm625

Diamond Member
May 6, 2011
8,172
137
106
AVX adds like what 10% performance boost to x264 encoding when compield to use it? So AVX 512 might add what, 20%? lol it is hardly a threat to gpgpu.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
AVX adds like what 10% performance boost to x264 encoding when compield to use it? So AVX 512 might add what, 20%? lol it is hardly a threat to gpgpu.
AVX only extended floating-point operations to 256-bit. x264 uses integer operations. AVX2 offers 256-bit integer vector operations. AVX-512 does not extend them to 512-bit, for now.

Just to be clear though, GPGPU is not efficient at processing small integer elements.
 

Nothingness

Diamond Member
Jul 3, 2013
3,292
2,360
136
AVX only extended floating-point operations to 256-bit. x264 uses integer operations. AVX2 offers 256-bit integer vector operations. AVX-512 does not extend them to 512-bit, for now.
So why is avx2 not gaining a lot on x264? A few percent only.
 

BallaTheFeared

Diamond Member
Nov 15, 2010
8,115
0
71
In the test the forum did with handbrake, not all of the process was using AVX2, in fact looking at my temps only a small percent of the actual workload was using AVX2.

I could probably use adaptive voltage and a desktop recorder to show to voltage bump when it uses AVX2.

I believe I gained around 12% performance from it, which was enough to set it apart from the 8350s at 4.8GHz.
 

Nothingness

Diamond Member
Jul 3, 2013
3,292
2,360
136
12% is not a lot given that Haswell is supposed to have a few percent of IPC improvements. Also there's a point after which vector widening won't bring you anything for most traditional tasks a user does (including transcoding).
 

BallaTheFeared

Diamond Member
Nov 15, 2010
8,115
0
71
It's 30% faster than an i5-2500k at the same clocks.

It's also 15% faster than an i5-3570k at the same clocks.

Handbrake isn't using AVX2 for the entire time, only a small part of it.
 

mikk

Diamond Member
May 15, 2012
4,292
2,382
136
So why is avx2 not gaining a lot on x264? A few percent only.


Because AVX doesn't gain anything on C-Code which rules out half of x264. Only a few algorithms gain from AVX or AVX2 in x264. That doesn't mean other software pgrogram types only improve by a couple of percent, it isn't that easy.
 

Nothingness

Diamond Member
Jul 3, 2013
3,292
2,360
136
Because AVX doesn't gain anything on C-Code which rules out half of x264. Only a few algorithms gain from AVX or AVX2 in x264. That doesn't mean other software pgrogram types only improve by a couple of percent, it isn't that easy.
I wasn't the one who brought x264 to the discussion ;). BTW most critical routines of x264 are written in assembly language.

So what other programs that matter to end user would benefit a lot from these wide vectors?
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
BenchPress,

Could you give us more detailed thoughts on what you think of this new ISA extension?

I'm all ears!
Interestingly it's not exactly new. It's for the most part the Xeon Phi ISA, made compatible with the legacy 256-bit and 128-bit instructions.

There's a new zeroing behavior option when using the mask registers, which seems of particular interest to out-of-order execution architectures, to break dependencies.

Extending to 32 registers probably made sense even for out-of-order architectures for keeping load/store usage reasonable. It costs an extra encoding byte but it brings other useful things that were lacking from AVX2 such as broadcast for most instructions.

There also appears to be room to extend it to 1024-bit, which would be useful for hiding latency when executed in two cycles (especially when using gather instructions). I also like how the k registers are specified as 64-bit, but current instructions that operate on them have a 'W' suffix for 16-bit masks for 512-bit vector operations, suggesting that there could be 'D' suffix ones in the future for AVX-1024.

So apparently it was a big design goal to converge the Xeon Phi and CPU ISAs back together. It seems to be successful at that, but some of the differences make for an awkward mix. That seems to be an x86 tradition though. :p
 

NTMBK

Lifer
Nov 14, 2011
10,419
5,712
136
It's a programing problem.

OpenCL is the best way to achieve the power what AVX2 provides. Or HSA with legacy fallback.

Meh, OpenCL is horribly clunky in my eyes.

Going against some of my earlier stances, I actually have big hopes for autovectorization of AVX-512. It has gather, scatter, operation masking, and plenty of other cool little tricks which should finally make it feasible to really crack autovectorization. It is, frankly, going to be awfully close to a GPU- not surprising, given that the instructions are very much inspired by the Phi (and hence by Larrabee).
 
Mar 10, 2006
11,715
2,012
126
Meh, OpenCL is horribly clunky in my eyes.

Going against some of my earlier stances, I actually have big hopes for autovectorization of AVX-512. It has gather, scatter, operation masking, and plenty of other cool little tricks which should finally make it feasible to really crack autovectorization. It is, frankly, going to be awfully close to a GPU- not surprising, given that the instructions are very much inspired by the Phi (and hence by Larrabee).

AVX-2 had gather, but not scatter, correct?
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
So what other programs that matter to end user would benefit a lot from these wide vectors?
That's a bit like asking what programs benefit from having more execution units per core. Sure, some have more instruction level parallelism (ILP) than others, but you can't draw a line between ones that do and ones that don't benefit from it. Likewise, these wide vector instructions are very generic, and they can be used to extract any data level parallelism (DLP):

Any non-trivial program contains loops, which are sections of code which are repeated many times. If each iteration of the loop code is independent of each other, which is quite common, then vector instructions can be used to execute multiple iterations in parallel.

That's it. So basically any computationally intensive program could potentially benefit from it. And with AVX-512, it can benefit a lot. It can execute loops up to 16 times faster. It's hard to tell in advance to what extent an application will benefit from it as a whole. There's Amdahl's law which causes diminishing returns, but at the same time new algorithms are devised every day that are more suitable for vectorization.

GPGPU tries to do the same thing by using the GPU's wide vectors, but it suffers from high overhead for communicating between the CPU and GPU. So certain classes of computationally intensive programs haven't been very successful yet. It's also a question of ROI. Developers really don't like to target various models of GPUs, each with their own bottlenecks and complex optimization quirks, not to mention unpredictable driver robustness. AVX-512 on the CPU won't suffer from the heterogeneous overhead and it's much more straightforward to develop for. Compilers can take care of the vectorization, and there aren't any major performance pitfalls or robustness issues to worry about.

So you can expect to see many new applications. Stuff that hasn't been done (successfully) before. Use your imagination, and someone might create it.
 

zlatan

Senior member
Mar 15, 2011
580
291
136
AVX-2 had gather, but not scatter, correct?
Yes. But the next AVX implementation should support scatter.
But this is not a huge problem. With AVX Intel want very efficient execution of MIMD->SPMD algorithm. This is a good thing, but there aren't any good standardized SPMD extension for C++. In C++AMP there is, but the WARP don't support AVX/AVX2. OpenCL is the only way now to support AVX2 with good efficiency. Also in OpenCL it is easier to compile a pre-vectorized input to the GPU. There are some other problems. Doing a divergent branch on an AVX unit is worse than doing it on a modern GPU. The CPUs don't have the same flexibility in their vector units.

What we need now is a good standardized SPMD extension for C++. This is a must. Or C++AMP is also a good solution, but the WARP must support AVX and AVX2.
 

NTMBK

Lifer
Nov 14, 2011
10,419
5,712
136
Yes. But the next AVX implementation should support scatter.
But this is not a huge problem. With AVX Intel want very efficient execution of MIMD->SPMD algorithm. This is a good thing, but there aren't any good standardized SPMD extension for C++. In C++AMP there is, but the WARP don't support AVX/AVX2. OpenCL is the only way now to support AVX2 with good efficiency. Also in OpenCL it is easier to compile a pre-vectorized input to the GPU. There are some other problems. Doing a divergent branch on an AVX unit is worse than doing it on a modern GPU. The CPUs don't have the same flexibility in their vector units.

What we need now is a good standardized SPMD extension for C++. This is a must. Or C++AMP is also a good solution, but the WARP must support AVX and AVX2.

Some of the stuff in the Intel compiler seems promising. Take a look at their "elemental function" stuff from Cilk Plus- you define the operation on one element of a vector using entirely standard C++, then ask the compiler to vectorize it with a little hint at the start of the function. It's been a bit limited in the past, but things like vector masking should let it handle much more branchy and hard to predict code.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
Doing a divergent branch on an AVX unit is worse than doing it on a modern GPU. The CPUs don't have the same flexibility in their vector units.
What makes you think that?
no vector-op masking, which is a very big thing.
Same question.

I don't perceive the lack of dedicated masking registers as that big of an issue. AVX has 'blend' instructions for predication, and Intel CPUs have two execution ports for them. Masking can reduce power consumption by disabling unused lanes, and it eliminates the blend instructions, but I don't think it's a deal breaker not to have it.

That said, AVX-512 offers access to 32 vector registers and 8 mask registers, which is certainly better than AVX2's 16 registers if you have to use some of them for masks and temporary results before blending.
 

Nothingness

Diamond Member
Jul 3, 2013
3,292
2,360
136
So you can expect to see many new applications. Stuff that hasn't been done (successfully) before. Use your imagination, and someone might create it.
I highly doubt that a killer app will appear. If such a killer app existed, don't you think nVidia wouldn't have shown it given for how long they've been claiming GPGPU was the next big thing?

Don't get me wrong, I'm sure I could use that, but that's because I have specific needs, HPC like.

Add to that Intel will segment as usual and I bet this won't be used for many years for consumer apps.

So definitely nice, but certainly not a revolution as you claimed.