Intel extends AVX to 512-bit

BenchPress · Jul 25, 2013

Nothingness said:
I highly doubt that a killer app will appear. If such a killer app existed, don't you think nVidia wouldn't have shown it given for how long they've been claiming GPGPU was the next big thing?

They tried but failed, due to the inherent heterogeneous overhead and programming complications. With the Kepler architecture they're focusing on graphics again, which is where the money is for them, and they've taken a serious step back from consumer GPGPU. AVX-512 instead is homogeneous, which opens up a whole new world of possibilities.

I could list a bunch of things, but that would just be the tip of the iceberg and not do it justice. Don't look for one specific killer app. There are three different kinds of parallelism for increasing performance beyond clock speed scaling: ILP, TLP, and DLP. ILP is pretty much maxed out, TLP you get from multiple cores, and DLP is most efficiently extracted using vector instructions. So although the seed had already been planted with AVX, AVX-512 will really add another dimension to CPU performance as a whole, and not target one specific killer app.

Think about how superscalar execution and multi-core have transformed computing. There is no single killer app for them, but you sure don't want to go back to single-issue or single-core. The same thing will happen with wide vectors. A lot of applications will benefit. Some more, some less, but you'll soon be wondering how we ever lived without them.

Add to that Intel will segment as usual and I bet this won't be used for many years for consumer apps.

So definitely nice, but certainly not a revolution as you claimed.

Of course it will take many years. The same was true about multi-core in 2005, and there are still software companies that only recently started looking at it. TSX should help, but won't be ubiquitous for many years either (especially since indeed Intel segmented support for it). But we still think of multi-core and transactional memory as revolutionary.

So you shouldn't look at the slowness of the market to determine whether something is revolutionary or not. AVX-512 can execute loops up to 16 times faster than legacy 32-bit code. That's not going to leave things unchanged. That's a revolution.

Nothingness · Jul 25, 2013

OK, definitely not worth discussing.

SlickR12345 · Jul 25, 2013

Big deal. We need to see a standardisation of 6 core processors for the desktop.

I mean unless I see 6 core processors at 3.3GHz with 8mb c3 at $200, I'd consider the next generation a fail.

Intel own projections from 2010 showed that we should have had 8 core processors right now.

Instead we are stuck with 4 cores, unless you are hardcore and have lots of money to be able to spare $1000 for a 6 core processor.

Sweepr · Jul 25, 2013

Skylake: AVX3.2, DDR4, PCIe 4.0... can we have mainstream 6C/12T CPUs too?

BenchPress · Jul 25, 2013

SlickR12345 said:
Big deal. We need to see a standardisation of 6 core processors for the desktop.

I mean unless I see 6 core processors at 3.3GHz with 8mb c3 at $200, I'd consider the next generation a fail.

Intel own projections from 2010 showed that we should have had 8 core processors right now.

We have 10-core processors right now. You have the (lack of) competition to blame for keeping the prices high on anything beyond quad-core. AMD's Steamroller architecture with four modules might finally perform a little closer to an 8-core. So that would make Intel release affordable 6 or 8-core models.

That said, AMD hasn't even put AVX2 support on the roadmap yet, and each module only has one shared SIMD cluster. If Skylake features AVX-512, which the announcement strongly hints at, then AMD will again be severely lacking in performance.

Dresdenboy · Jul 26, 2013

BenchPress said:
That said, AMD hasn't even put AVX2 support on the roadmap yet, and each module only has one shared SIMD cluster. If Skylake features AVX-512, which the announcement strongly hints at, then AMD will again be severely lacking in performance.

I saw one AMD guy on LinkedIn mentioning AVX2 for Excavator. He even wrote it in a way which might include SR too.

zlatan · Jul 26, 2013

BenchPress said:
What makes you think that?

I tried it. It's much easier to optimize to a GCN than AVX.
The only problem with GPUs is the separate memory space, but with the new APUs this problem will be gone.
Try an Xbox One. I can't talk about what I'm working now, but I want to use bitonic mergesort for sorting. I tried an AVX implementation first, but now I use the iGPU for it. It can sort in-place in system memory, and it was much easier to optimize. Actually I have shocked how easy it was.

NTMBK · Jul 26, 2013

Dresdenboy said:
I saw one AMD guy on LinkedIn mentioning AVX2 for Excavator. He even wrote it in a way which might include SR too.

I'd hope that Steamroller could get AVX2. If we have to wait for Excavator, then AMD will be almost a year and a half behind Intel in introducing new instruction sets...

ShintaiDK · Jul 26, 2013

I doubt its in SR. Simply because the units are still 128bit.

AMD not only lacks AVX2, they also lack 256bit paths and units.

Nothingness · Jul 26, 2013

ShintaiDK said:
I doubt its in SR. Simply because the units are still 128bit.

AMD not only lacks AVX2, they also lack 256bit paths and units.

You don't need 256-bit paths or units to handle 256-bit SIMD instructions (of course you'd take a perf hit) so this doesn't prove anything.

NTMBK · Jul 26, 2013

ShintaiDK said:
I doubt its in SR. Simply because the units are still 128bit.

AMD not only lacks AVX2, they also lack 256bit paths and units.

They used to do SSE on 64-bit vectors, they already do AVX(1) on 128-bit vectors in Jaguar, and they do it on the pair of 128-bit vector units in BD. Instruction cracking and a bit of microcode is nothing new- and Piledriver already supports FMA3, and supported it before Intel did.

JoeRambo · Jul 26, 2013

AVX2 is nothing special really, once you have AVX rolling on 256bit packed d/s floats and get so called VEX encoded instructions up, AVX2 instructions that operate on 256b vector integers are no big deal. The main headache will probably come from scatter/gather instructions. Given total AMD incompetence in cache/TLB department in the past, I am scared to think about them trying to get it right so early, microcoded or not, those things need to work correctly and to do so with operands on byte aligment, crossing cache lines, pages etc.

P.S. I am aware that current AVX2 scatter/gather implementation by Intel is not exactly known for speedups either.

Search

Intel extends AVX to 512-bit

BenchPress

Senior member

Nothingness

Diamond Member

SlickR12345

Senior member

Sweepr

Diamond Member

BenchPress

Senior member

Dresdenboy

Golden Member

zlatan

Senior member

NTMBK

Lifer

ShintaiDK

Lifer

Nothingness

Diamond Member

NTMBK

Lifer

JoeRambo

Golden Member

TRENDING THREADS