The Official AVX2 Thread

Page 6 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

BenchPress

Senior member
Nov 8, 2011
392
0
0
It seems to me the whole point of AVX2 is to enable the CPU to perform more parallel functions, thereby widening the pipeline. Correct or not?
Pipeline width typically refers to the issue width; the number of instructions it can execute in parallel. This isn't expected to change with Haswell, since increasing it is prohibitively expensive. Instead, the instructions themselves become wider. AVX2 features 256-bit vectors, which can be eight 32-bit values, or many other combinations of element count and element type which fits in 256-bit. Haswell should have three such 256-bit execution units, so 3 is the arithmetic issue width.

GPUs typically have at least 512-bit vector units, and the issue width varies from 1/4 to 2 (fractional numbers meaning it takes multiple cycles to process an entire vector). And while they settle for a lower clock speed, they do often have more cores.

The computing density of a CPU with AVX2 won't differ very much from a GPU though. Which begs the question whether we should even rely on the GPU at all for general purpose computing, or if we just need more CPU cores. Homogeneous computing on the CPU is inherently more efficient since the data doesn't have to be moved back and forth to/from the GPU.
From what I've read in this and other threads it appears that Haswell will have 64 gpu-like cores to widen the pipeline.
That would be the 'effective' number of compute cores for 32-bit data, yes. Although you have to keep in mind that it's running at about three times the clock speed of most GPUs.
Now, unless I'm mistaken, I think these can still coexist just fine. Haswell will handle parallelism up to 64 'whatevers' wide and the GPU will take over and run anything requiring more cores than that.

Make sense?
Not really. There is nothing that "requires" more cores. A single-core is better than two cores at half the frequency. There's always a portion of a task that can't be parallelized, and Amdahl's law shows that more cores become increasingly less effective. GPUs have been very successful for graphics because graphics has a great level of parallelism, but general purpose tasks often have far less parallelism so they're better off with fewer, but faster cores.

Also this isn't actually the entire picture yet. A GPU doesn't have out-of-order execution, so a thread will always stall when trying to execute dependent instructions or loading data from memory. It's only because it runs hundreds of threads per core which it constantly switches between that it can achieve high throughput. But this means a GPU needs a truely massive amount of parallelism in the workload to achieve good efficiency. And because there are so many stalled threads which each need a set of registers, the GPU can actually run out of registers when processing complex tasks.

What's more, these threads have to share the caches too, and so the data locality is pretty horrible, often leading to slow RAM accesses instead of reading things from cache. Fortunately discrete GPUs have lots of bandwidth, but that's not the case for APUs. There are clever compression techniques for graphics, but not for general purpose workloads.
Also, one question - how much advantage do AMD's APUs get by being on-die with the GPU? In other words, if you compared an APU to an equivalent discrete GPU/CPU pair (same cores/speeds) how much better throughput would the APU offer?
This greatly depends on the task. Graphics essentially doesn't have to communicate anything back to the CPU, it's just sent straight to your monitor. But something like physics calculations would have to be retrieved back by the CPU, and there can be an order of magnitude difference in performance for an APU versus an equivalent discrete GPU.

Either way homogeneous computing with AVX2 doesn't suffer at all from the latency or bandwidth bottleneck that GPGPU faces.
 

Denithor

Diamond Member
Apr 11, 2004
6,298
23
81
So the integrated GPU in an APU (whether AMD or Intel) is going to offer considerable advantage over discrete GPU/CPU combo for compute-related (non-gaming) tasks.

Which actually makes nVidia's decision to move the GTX 680 into more of a pure 'gaming' role versus packing in the full compute hardware very smart looking forward. IE - allow consumer video cards to go back to being just for gaming and focus the high end pro cards on HPC duties (where I suppose the higher TDP will allow them an advantage even over Haswell in certain highly parallelized applications). They produce a cheap core with excellent gaming performance without gobbling all the extra power running superfluous compute hardware.
 

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
So the integrated GPU in an APU (whether AMD or Intel) is going to offer considerable advantage over discrete GPU/CPU combo for compute-related (non-gaming) tasks.

games use cpu power, the lower latency from Apus gives them an advantage over discreet Gpu...

we don't see Apus dominating any game, is because...

well it would result in a ~600mm² die using ~800Watts, with expensive gddr5 memorys and a list of other problems...
 

podspi

Golden Member
Jan 11, 2011
1,982
102
106
games use cpu power, the lower latency from Apus gives them an advantage over discreet Gpu...

we don't see Apus dominating any game, is because...

well it would result in a ~600mm² die using ~800Watts, with expensive gddr5 memorys and a list of other problems...

Well that isn't really true... an all-out APU would probably end somewhere in the 200W~300W range and be pretty competitive in the mainstream market.

I just don't think we'll see such a monster because cooling a 200W~300W chip that size isn't easy, so OEMs would never bite. Much better to slap it on an add-on card exhausting out of the case, and charge the customer lots for it.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
So the integrated GPU in an APU (whether AMD or Intel) is going to offer considerable advantage over discrete GPU/CPU combo for compute-related (non-gaming) tasks.
Only in relative terms, not so much in absolute terms. The problem is that while an APU is more efficient at heterogeneous general purpose computing, it's also considerably weaker than the average discrete card.

So in my opinion it's not worth sacrificing any graphics performance to make the GPU (either discrete or integrated) more suitable for GPGPU. Instead the GPU should focus on graphics workloads, which are massively parallel and don't require reading back any results. The CPU can handle complex general purpose workloads with limited parallelism much more efficiently, without any sacrifices to its other roles. This way any and all heterogeneous processing overhead is avoided.
Which actually makes nVidia's decision to move the GTX 680 into more of a pure 'gaming' role versus packing in the full compute hardware very smart looking forward. IE - allow consumer video cards to go back to being just for gaming and focus the high end pro cards on HPC duties (where I suppose the higher TDP will allow them an advantage even over Haswell in certain highly parallelized applications). They produce a cheap core with excellent gaming performance without gobbling all the extra power running superfluous compute hardware.
Exactly. It's actually a funny history. When NVIDIA released the Fermi architecture, it was ahead of AMD in GPGPU efficiency, but there were inherent compromises to the graphics performance which allowed AMD to gain a lot of popularity among gamers. Performance per dollar was definitely in AMD's favor.

NVIDIA didn't make much if any money from GPGPU in the consumer market. It's too hard for developers to create interesting consumer applications for it, and it was largely limited to NVIDIA customers only. And with CPUs getting more cores and wide vectors with FMA and gather, they realized the GPU was out of advantages for general purpose computing. So with Kepler they concentrated on getting the best possible graphics performance out of a relatively small chip, like AMD was doing before.

So the roles reversed. AMD now needs a big and expensive chip to keep up with NVIDIA in graphics performance, and they're going to lose even more computing density when implementing the remaining HSA features. They just hope to capitalize on it by enticing developers with layers of software which should make it somewhat easier to develop for and more open than NVIDIA's proprietary CUDA. But apparently they didn't count on CPUs getting all the same throughput computing qualities of a GPU, in one homogeneous solution that is much easier to program for than HSA will ever be.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
Also note that Intel has a cross license agreement with NVIDIA. So they should be able to improve the graphics efficiency of their integrated GPUs. Which also means that AMD doesn't have enough of an advantage in GPU technology to create an APU that is efficient at GPGPU while still trumping Intel in graphics performance.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
Will AVX2 keep BF3 over 60fps? I doubt it. Therefore, I don't care.
How can you be sure you "don't care" if you only "doubt" it? There's a contradiction in your level of assertion here.

Let's be a little more scientific here. What is preventing BF3 from consistently running at over 60 FPS? This is a very complex question unless you know the game's engine code inside-out or you've profiled the binary. Have you?

If there's a hotspot caused by arithmetic throughput, then AVX2 should help (after patching of course). If there's a cache bandwidth bottleneck, then AVX2 as presumably implemented by Haswell should help too. And if there's a multi-threading efficiency issue, then Haswell's TSX technology can make all the difference.

Anyhow, you make it sound as if you actually know a better alternative. So can you share with us what technology you do care about? Otherwise your post seems a bit pointless.
 

Riek

Senior member
Dec 16, 2008
409
15
76
Games are typically limited by branch predication and penalty.

AVX2 (fpu) will shine in software that would shine the best on gpu's. So basicaly saying: here is a program that doesn't do well on gpus compared to a normal cpu... but with AVX2 the cpu gets gpu technology and it will work alot faster. That is a big contradiction.. that either means the program is lazy programmed and AVX2 won't do a thing... or it isn't blocked by independant operation throughput and avx2 won't make much difference either since the program is limited by other factors than ILP.

Believing a compiler from current and future code will utilize AVX2 at it full potential is believing in fairy tales. (Thats basically the reason why you will see openCL, AMP,.. applications to be faster on cpu only also... Because those rewirtten application/core code is programmed with such behavior in thought. Which does not occur when programming in .NET, java, c++, ..)

Instruction set additions always fail in their hype because they only speed up a fraction of the code and by no means get their full potential. Yes you can create tools that do use as much as possible... most of those programs are near the border of useless (e.g. sisoft, ...).
Its great talking about the best case of the best and how significant it is when you can reach it for that operation... it is just not meaningful at all.

(Note similar problems exist for gpu ofcourse, but in contrast to AVX2 we have benchmarks from gpu's and some people compare real life known performance of current hardware with theoretical increases from non existing hardware and than hail the theroetical one...).

Also programming in an language that support GPGPU will have big advantages for AVX2 because you are programming in way which will the closest to AVX2 design filosophy than any other language currently available.


Also this discussion is just as futile as the discussions we had on Bulldozer... but we can't stop human thinking i guess? at least i'm out of this 'discussion' until it brings something new and relevant on the table... (e.g. a real life benchmark of a recompiled application for instance... probably over 6months we might see some leaks..).
 
Last edited:

bronxzv

Senior member
Jun 13, 2011
460
0
71
Thats basically the reason why you will see openCL, AMP,.. applications to be faster on cpu only also...

mmm do you have a *concrete example* to share? I mean an actual Open CL kernel that will be faster than the C++ equivalent compiled with todays' Intel parallel studio XE for native AVX targets?
 

Edrick

Golden Member
Feb 18, 2010
1,939
230
106
How can you be sure you "don't care" if you only "doubt" it? There's a contradiction in your level of assertion here.

Let's be a little more scientific here. What is preventing BF3 from consistently running at over 60 FPS? This is a very complex question unless you know the game's engine code inside-out or you've profiled the binary. Have you?

If there's a hotspot caused by arithmetic throughput, then AVX2 should help (after patching of course). If there's a cache bandwidth bottleneck, then AVX2 as presumably implemented by Haswell should help too. And if there's a multi-threading efficiency issue, then Haswell's TSX technology can make all the difference.

Anyhow, you make it sound as if you actually know a better alternative. So can you share with us what technology you do care about? Otherwise your post seems a bit pointless.

You realize that you are wasting your time responding to a kid who only games with his PC, and knows nothing of real world gains to be had by AVX2.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Here is a raw comparison of current GPU "cores" (the SIMD engines, not shader cores) vs. a Haswell core with correct architectural register count, FLOPS (assuming two FMA3 units per core) and some "worst case" base numbers based on SB:
haswell_vs_gpus.png

(GPU data from http://realworldtech.com/page.cfm?ArticleID=RWT032212172023 )

Here are two of Intels OpenCL talk presentations:
http://www.khronos.org/assets/uploa...of_opencl/OpenCL-BOF-Intel-SIGGRAPH-Jul10.pdf
http://llvm.org/devmtg/2011-11/Rotem_IntelOpenCLSDKVectorizer.pdf

Quote: "We plan to work on AVX2 together with the community"

I think, extending a CPU's SIMD capabilities and improving GPUs for doing compute tasks both lead to increasing the number of cases each architecture is well suited for. But no somewhat complex compute task has components running better on a wider-SIMD CPU and others running better (simply faster or at lower total energy) on GPU like architectures. Since this would need to switch back and forth, several people in the past already had the idea to bring both together. AMD is just one company bringing that to life. They're not the inventors. Check academic research for the history of heterogeneous computing.

Here are some interesting results of different architectures including SB (from techreport):
luxmark.gif
 
Last edited:

BenchPress

Senior member
Nov 8, 2011
392
0
0
What applications, are likely to see the biggest boosts from using AVX2?
All throughput limited applications. This includes, but is not limited to: all types of multimedia processing, physics simulations, content creation, various parts of games, all applications that make use of OpenCL, speech recognition, gesture recognition, augmented reality, artificial intelligence, global illumination, ray-tracing, compression, encryption, crosstalk cancellation, video conferencing, etc.

Really anything that benefits from data parallelism.