Engineers boost Llano GPU performance by 20% without overclocking

CPUarchitect · Feb 9, 2012

Abwx said:
http://pcper.com/news/General-Tech/AFDS11-Microsoft-Announces-C-AMP-Competitor-OpenCL

That's using a discrete Radeon HD 5800 series card, not just Llano. In fact if you watch the video you'll see that Llano only reaches 100 GFLOPS.

And while you're at it, take a look at the code and you'll see that the CPU implementation is shamefully unoptimized. It's single-threaded, and uses array-of-structure processing instead of structure-of-array processing.

Phynaz · Feb 9, 2012

Abwx said:
http://pcper.com/news/General-Tech/AFDS11-Microsoft-Announces-C-AMP-Competitor-OpenCL

Self-ownage. Lol!

bronxzv · Feb 9, 2012

CPUarchitect said:
AVX2 adds fused multiply-add (FMA)

nope, FMA is another feature flag in another CPUID leaf than AVX2 and btw AMD will support it before Intel

Cerb · Feb 9, 2012

soccerballtux said:
why branch predict when you can compute all outcomes?

Because every conditional branch doubles your potential outcomes*, and you simply don't have that much memory bandwidth to spare.

CPUarchitect said:
It's ironic that the CPU is being used to speed up the GPU.

Things like this reinforce that the future will be homogeneous architectures, not heterogeneous ones. The CPU simply needs high-throughput execution units like a GPU. And Intel will offer just that with the AVX2 support in Haswell next year. GPGPU's days are numbered.

No. Current x86 CPUs are not homogeneous, themselves. Vector processing bolted on to a scalar CPU is a perfect example of it. GPGPU is different, in that the very idea is to add TLP, but it is not conceptually different than vector extensions. What is truly novel about it is that it is coming from a dedicated graphics processor towards being a chunk of the CPU.

So why not skip all those GPU-specific limitations and just support C++ as-is (or any other language for that matter). AVX2 can be used to auto-vectorize any code loop with independent iterations.

Without extending the language a great deal, how are you going to successfully auto-vectorize it? As it is, it's practically a miracle that compilers can automatically vectorize much of anything. We need the language to support vector processing, to make good use of it. Standard C++ does not have this, at least not to enough of a degree that you won't have to read your hardware optimization docs to make it useful (no more easy ARM SoC port). So, you'll either need to use a customized version (AMP, CUDA), or a whole other language, to make good use of the hardware.

At this point, IMO, we are at least 90% limited by standards body bureaucracies.

* admittedly, it will not double memory needs, but I'm not sure of any work done on actual code execution to figure out a typical curve.

Olikan · Feb 9, 2012

CPUarchitect said:
I'm sorry but clearly you didn't do it very well because there is no scatter instruction. Not in AVX2, and not in any GPU.

uh oh! 🙁

Effective performance is the only thing that matters. The very topic of this thread shows that GPUs leave performance on the floor when not assisted by the CPU. And that's only from cache misses. There's lots of additional overhead from passing things back and forth and there are GPU-specific bottlenecks that make GPGPU applications only reach a fraction of the theoretical GFLOPS number

this is old news...GPUs are just now doing compute, it's like comparing a child to an adult.

yet, again...amd the ones behind gpgpu, they developed XOP

CPUarchitect · Feb 9, 2012

bronxzv said:
nope, FMA is another feature flag in another CPUID leaf than AVX2 and btw AMD will support it before Intel

Sure, strictly speaking it's a separate feature. But for Intel it will arrive at the same time as AVX2. And yes AMD supported it first, but what you're quoting is from a text explaining that between Nehalem and Haswell, Intel will have increased the peak throughput by a factor of four. AMD is irrelevant to that argument.

CPUarchitect · Feb 9, 2012

Cerb said:
No. Current x86 CPUs are not homogeneous, themselves.

I said future architectures, not current ones.

GPGPU is different, in that the very idea is to add TLP, but it is not conceptually different than vector extensions. What is truly novel about it is that it is coming from a dedicated graphics processor towards being a chunk of the CPU.

How is where it came from relevant? Haswell will be capable of 2 x 256-bit FMA per core per cycle. This isn't fundamentally different from a Fermi GPU executing 2 x 512-bit, at a lower frequency.

Also, TLP is just a mechanism to hide latency. It's not a desirable feature to have many threads because it means they're fighting for register and cache space. Remember how Hyper-Threading initially was a tossup between gaining or losing performance due to contention? And that was with just two threads. Having hundreds of them is preventing the GPU from achieving good performance with irregular workloads and also suffers from Amdahl's Law. DLP is the real goal here, and as the comparison against Fermi illustrates, nothing is preventing the CPU from achieving similar throughput per core.

Latency hiding on a CPU is achieved with out-of-order execution, and prefetching (the topic of this thread which makes the CPU help the GPU perform better). In the future we can expect to see AVX-1024 being executed in multiple cycles to further help hide latency, and also allow more clock gating of the front-end for lower power consumption.

Without extending the language a great deal, how are you going to successfully auto-vectorize it? As it is, it's practically a miracle that compilers can automatically vectorize much of anything.

Yes it's a miracle today, since there is no vector equivalent of each scalar instruction yet. AVX2 changes that, making auto-vectorization far more straightforward. There's no need to extend the language. Any loop with independent iterations is a candidate for running several of them in parallel on each SIMD lane.

The only piece of information the compiler doesn't know is whether the loop will run for many iterations (unless it's a constant amount). But this is easily solved with profile-guided optimization, JIT compilation, or a simple hint provided by the developer (like a #pragma or something like the 'inline' keyword). You don't need anything near as invasive as CUDA or C++ AMP.

bronxzv · Feb 9, 2012

CPUarchitect said:
Sure, strictly speaking it's a separate feature. But for Intel it will arrive at the same time as AVX2.

in a physical Intel processor yes, but FMA was featured in the SDE ages ago, unlike AVX2 which is a recent addition

CPUarchitect said:
AMD is irrelevant to that argument.

you were stating in the very same post that

"If AMD was the first to implement AVX2 I would applaud them instead of Intel. "

CPUarchitect · Feb 9, 2012

bronxzv said:
in a physical Intel processor yes, but FMA was featured in the SDE ages ago, unlike AVX2 which is a recent addition

True, but the SDE emulator is irrelevant to the discussion, which was clearly about the performance of physical processors.

you were stating in the very same post that

"If AMD was the first to implement AVX2 I would applaud them instead of Intel. "

Yes, but please don't ignore the context. Since AMD already has FMA, it didn't need mentioning. And I was obviously also talking about a high-performance implementation, not the 1 x 256-bit per two cores they have today.

Cerb · Feb 9, 2012

CPUarchitect said:
How is where it came from relevant?

That without that context, it wouldn't exist. The trappings of what it is derived from are largely what makes it special, and that if those are removed, the barrier is one of effectively telling the computer to do X in a massively-parallel fashion.

This isn't fundamentally different from a Fermi GPU executing 2 x 512-bit, at a lower frequency.

Exactly.

Yes it's a miracle today, since there is no vector equivalent of each scalar instruction yet. AVX2 changes that, making auto-vectorization far more straightforward. There's no need to extend the language. Any loop with independent iterations is a candidate for running several of them in parallel on each SIMD lane.

The bolded part is not so easy to prove with procedural code, and sane compilers will err on the side of not doing it. A compiler needs to come up with a positive proof that it is safe to do, which it can't do so easily as a human. The language having ways of telling it that is the case would be far more straight-forward. To be widely effective, the language needs a way for the independence of the processing to be defined at a high level. It won't take moving mountains to get the job done, but it will take people doing it now, and letting standards get worked out later.

bronxzv · Feb 9, 2012

Cerb said:
To be widely effective, the language needs a way for the independence of the processing to be defined at a high level. It won't take moving mountains to get the job done, but it will take people doing it now, and letting standards get worked out later.

did you try Cilk Plus with Array Notations already ? it looks like something meeting some of your requirements

Nemesis 1 · Feb 9, 2012

Cerb said:
That without that context, it wouldn't exist. The trappings of what it is derived from are largely what makes it special, and that if those are removed, the barrier is one of effectively telling the computer to do X in a massively-parallel fashion.Exactly.

The bolded part is not so easy to prove with procedural code, and sane compilers will err on the side of not doing it. A compiler needs to come up with a positive proof that it is safe to do, which it can't do so easily as a human. The language having ways of telling it that is the case would be far more straight-forward. To be widely effective, the language needs a way for the independence of the processing to be defined at a high level. It won't take moving mountains to get the job done, but it will take people doing it now, and letting standards get worked out later.

I may be wrong as I don't understand the subject as well as some here. But I was under the impression thats what the Vec prefix is used for in AVX

Cerb · Feb 9, 2012

Nemesis 1 said:
I may be wrong as I don't understand the subject as well as some here. But I was under the impression thats what the Vec prefix is used for in AVX

AVX doesn't exist in C, C++ (standard), C#, Java, etc.. A compiler for such a language must, through very limited programmed-in means, verify that some loop's iterations can be processed in parallel, then figure out if it can come up with a way to do that with proprietary hardware extensions, again using fairly limited programmed-in methods. In other words, there are many cases where a loop could be unrolled, or even implemented by as a set of vector maths, but won't be. Any number of additions to the language could do it easier, and more effectively, leaving the compiler just the job of figuring how to do it, having been told that it can be done. If any one change can be adopted by multiple major players (MS and GCC easily being the most important), and people use it, everyone else pretty much follows by default, and it will become a standard in practice long before it becomes one officially. Being required to understand how the compiler's logic was implemented, and coerce it into doing it, is simply not going to catch on. It would be easier to do it manually.

CPUarchitect · Feb 9, 2012

Cerb said:
That without that context, it wouldn't exist.

Vector processors with multiple lanes have existed since at least 1982. So it's not something that was invented by GPU manufacturers. Also, x86 processors had SIMD instructions before gaphics chips were even programmable. So you can't conclude that AVX2 would not have existed without GPUs.

Either way how we got to this point, it won't affect whether the dominant future architecture will be heterogeneous or homogeneous.

Exactly.

Then what were you trying to argue?

The bolded part is not so easy to prove with procedural code, and sane compilers will err on the side of not doing it. A compiler needs to come up with a positive proof that it is safe to do, which it can't do so easily as a human. The language having ways of telling it that is the case would be far more straight-forward. To be widely effective, the language needs a way for the independence of the processing to be defined at a high level. It won't take moving mountains to get the job done, but it will take people doing it now, and letting standards get worked out later.

Proving that loop iterations are independent might not be "easy", but compiler optimizations never are. So that's not a very convincing argument. Compiler writers will go to great lengths to extract more performance. Especially for something that can make 8 iterations run in parallel there is lots of incentive to solve "hard" problems. Auto-vectorization is a hot topic among the LLVM developers, and the Polly project shows some phenomenal potential. Note that for C++ the strict aliasing assumption helps a great deal too to determine (non)dependencies.

Of course that doesn't mean hints provided by the developer wouldn't be useful. C99 already has the 'restrict' keyword and most major C++ compilers support it too. And the Intel compiler has additional pragmas to ignore aliasing. And that's really all you need.

Eveything else "added" by C++ AMP or CUDA or OpenCL is a restriction caused by the limitations of today's GPGPU hardware. And while those restrictions may be lifted in future versions, AVX2 won't have these restrictions in the first place. Hence it's superior and will see greater adoption than GPGPU because it doesn't require much effort from developers.

denev2004 · Feb 9, 2012

bronxzv said:
in a physical Intel processor yes, but FMA was featured in the SDE ages ago, unlike AVX2 which is a recent addition

you were stating in the very same post that

"If AMD was the first to implement AVX2 I would applaud them instead of Intel. "

Actually I guess the reason why they have AVX2 is they do not want SnB which has AVX unit supports FMA. And in order to make the new AVX in Haswell ,which has a "new function" compared to SnB called FMA, more attractive, they establish a new name for it.

Nemesis 1 · Feb 10, 2012

Topic Title is a bit misleading isn't it?

bronxzv · Feb 10, 2012

denev2004 said:
which has a "new function" compared to SnB called FMA, more attractive, they establish a new name for it.

there isn't a new name for FMA, the name of the feature flag is "FMA" and official documentation like the Intrinsics Guide for AVX has a section named "FMA" (and another section named "AVX2" for AVX2 intrinsics)

FMA (aka FMA3) will be supported by AMD later this year, so it's important to have a different code path for FMA and yet another one for FMA+AVX2

Cerb · Feb 10, 2012

CPUarchitect said:
Vector processors with multiple lanes have existed since at least 1982. So it's not something that was invented by GPU manufacturers. Also, x86 processors had SIMD instructions before gaphics chips were even programmable. So you can't conclude that AVX2 would not have existed without GPUs.

I don't. I conclude that OpenCL, DirectCompute (HLSL/Cg, basically), CUDA, etc., came about due to GPUs, and that DX10.1+ GPUs are far more capable that most vector coprocessors. Yet, unfortunately for the GPU manufacturers (especially NVidia, in the long run), there is nothing keeping [mostly scalar] CPUs from doing the same work just as well, regardless of how they go about it.

What I disagree with you on is ignoring the software infrastructure, which, as of late, MS, Apple, and GPU companies have been fostering.

Eveything else "added" by C++ AMP or CUDA or OpenCL is a restriction caused by the limitations of today's GPGPU hardware. And while those restrictions may be lifted in future versions, AVX2 won't have these restrictions in the first place. Hence it's superior and will see greater adoption than GPGPU because it doesn't require much effort from developers.

AVX2 might not have restrictions, but again, it's not AVX2 that limits results, if the compiler must figure out what the programmer should be able to tell it: it's MSVC and GCC C++ implementations.

When the compiler doesn't turn your obviously-unrollable loop into a nice ugly block of AVX2 instructions, how are you going to go about asking it why? For that matter, why should you have to attempt to gently coerce it, and assume it will do what you wish for it to, anyway? There is great value, and time saved, in being to tell it what you want done. When you could be dealing with a performance difference of hundreds of percent, hoping you can satisfy the compiler shouldn't be the best option.

BenchPress · Feb 10, 2012

Nemesis 1 said:
Topic Title is a bit misleading isn't it?

The original topic is silly. It's some academic research that will never be used in practice. Why? Because it shows a flaw in the GPGPU hardware. It should be solved in hardware, not software. It will take AMD till 2014 to complete its HSA architecture, so developers won't bother putting a lot of effort into squeezing 20% performance out of a chip that has no significant market share whatsoever.

So the discussion went 'off topic' to talk about the kind of technology that is much more likely to be relevant to developers and consumers...

Olikan · Feb 10, 2012

Nemesis 1 said:
Topic Title is a bit misleading isn't it?

we are still talking about gpus and gpu compute 😉

BenchPress · Feb 10, 2012

bronxzv said:
FMA (aka FMA3) will be supported by AMD later this year, so it's important to have a different code path for FMA and yet another one for FMA+AVX2

All Chips supporting AVX2 will also support FMA, so let's not split things up unnecessarily. And it makes sense for Intel to implement them together considering that both the 256-bit integer instructions in AVX2, and FMA, demand doubling the bandwidth.

That said, I seriously doubt that many developers will manually write separate code paths for AVX, AVX+FMA, and AVX2. That's a lot of work (and you need additional paths for older processors). Also, both Sandy Bridge and Bulldozer are flawed for not providing sufficient bandwidth. Not to mention AVX1 is flawed for being so wide without having support for parallel load (gather) and parallel shift to make it efficient to vectorize loops!

In other words a lot of developers are not bothering with AVX1 or FMA4. They're waiting for AVX2 to make the 256-bit instruction set complete. I know I am.

BenchPress · Feb 10, 2012

Olikan said:
we are still talking about gpus and gpu compute 😉

Actually on second thought the topic title is indeed misleading. Most people would expect it to be about graphics performance, not generic computing. 😵

CPUarchitect · Feb 10, 2012

Cerb said:
I don't. I conclude that OpenCL, DirectCompute (HLSL/Cg, basically), CUDA, etc., came about due to GPUs, and that DX10.1+ GPUs are far more capable that most vector coprocessors. Yet, unfortunately for the GPU manufacturers (especially NVidia, in the long run), there is nothing keeping [mostly scalar] CPUs from doing the same work just as well, regardless of how they go about it.

Ok, it wasn't entirely clear to me that you agree about the hardware aspects.

What I disagree with you on is ignoring the software infrastructure, which, as of late, MS, Apple, and GPU companies have been fostering.

I'm not ignoring the software infrastructure. It is critically important, but I don't see any obstacles. First of all, every API for GPGPU will get an AVX2 implementation in the short term. Intel and Apple are working on an LLVM-based implementation of OpenCL. And LLVM also supports PTX which is used by CUDA. Next, Microsoft's implementation of C++ AMP will definitely support AVX2 as well.

Which brings us to throughput computing using auto-vectorization. Visual Studio 11 supports it (note the "up to 8 times faster"). GCC supports it. Clang (built on LLVM and used by XCode and compatible with GCC) supports it...

So there's no reason for concern. GPGPU on the other hand has many restrictions and getting good performance is a minefield. So AVX2 has everything going for it.

When the compiler doesn't turn your obviously-unrollable loop into a nice ugly block of AVX2 instructions, how are you going to go about asking it why? For that matter, why should you have to attempt to gently coerce it, and assume it will do what you wish for it to, anyway? There is great value, and time saved, in being to tell it what you want done. When you could be dealing with a performance difference of hundreds of percent, hoping you can satisfy the compiler shouldn't be the best option.

Auto-vectorization is just one of the options. And a very important one for those who desire the best gain versus effort ratio. If you do want to push the hardware to its limits, auto-vectorization is probably not the best option. But there are countless alternatives to choose from.

Besides, it only takes a tiny change to tell whether an auto-vectorizing compiler effectively vectorized the loop you desired to be vectorized. Just like the __forceinline keyword will give you a warning when the compiler didn't inline it, there could be a __forcevectorize keyword, or a pragma.

So it's really a non-issue. The important thing is that AVX2 has no hardware limitations, and the software infrastructure can offer a very wide range of solutions from being completely transparent to the developer, through minor hints, to explicit parallel programming.

bronxzv · Feb 10, 2012

BenchPress said:
In other words a lot of developers are not bothering with AVX1 or FMA4. They're waiting for AVX2 to make the 256-bit instruction set complete. I know I am.

I'm not waiting since in my use cases (mostly 3D rendering) no single critical loop requires per element shift counts and that processing seldom packed int instructions in a single instruction instead of two has a very low impact when comparing the AVX2 path with the AVX path (actual SDE statistics), also hardware gather should have less than 5% overall impact (under the very optimistic assumption that hardware gather will be 4x faster on average than software synthetized gather), so most of the expected (ISA related) speedup on Haswell will come from FMA (if we get 2 FMA units per core), some hot spots enjoy nearly a 1.8x theoretical speedup thanks to FMA (per SDE stats), the most useful AVX2 capability in the AVX2 path is the generic permute, as I said, all of this speaking of my very own use cases based on a preliminary AVX2 and FMA port, I'm sure other people will report other findings

anyway, if we don't support legacy "AVX1" we will leave performance on the table for several millions AVX-enabled Sandy Bridge and Ivy Bridge CPUs in the installed base when Haswell is launched so that's not a very good idea IMO

BenchPress · Feb 10, 2012

bronxzv said:
I'm not waiting since in my use cases (mostly 3D rendering) no single critical loop requires per element shift counts and that processing seldom packed int instructions in a single instruction instead of two has a very low impact when comparing the AVX2 path with the AVX path (actual SDE statistics), also hardware gather should have less than 5% overall impact (under the very optimistic assumption that hardware gather will be 4x faster on average than software synthetized gather)...

Huh? The engineers I've talked to who work on 3D content production on the CPU told me they expect gather to make a massive difference. Texture filtering with lots of taps, the implementation of transcendental functions using lookup tables, dynamic indexing into constant buffer, etc. If you're expecting to only see a 5% speedup from a gather instruction that's four times faster than scalar loads, that would mean only about 6.5% of all your current code would perform scalar loads. That seems awfully little.

I need to ask, is your software capable of advanced shaders (motion blur, Bokeh, depth of field, caustics, soft shadows, etc.), or more like a toy project? Most 3D rendering software falls into one of these two categories, and the former seems like it could make good use of gather.

...so most of the expected (ISA related) speedup on Haswell will come from FMA (if we get 2 FMA units per core), some hot spots enjoy nearly a 1.8x theoretical speedup thanks to FMA (per SDE stats)...

But wouldn't that make scalar loads become more of a bottleneck? 😛

...the most useful AVX2 capability in the AVX2 path is the generic permute, as I said, all of this speaking of my very own use cases based on a preliminary AVX2 and FMA port, I'm sure other people will report other findings

I'm curious what you need the generic permute for in 3D rendering. GPUs don't support any cross-lane operations as far as I know.

anyway, if we don't support legacy "AVX1" we will leave performance on the table for several millions AVX-enabled Sandy Bridge and Ivy Bridge CPUs in the installed base when Haswell is launched so that's not a very good idea IMO

Do you have a target market of millions of people who have a CPU only capable of AVX1? And why would that be significant compared to the billions of people who don't have AVX support at all? In other words, would you lose important paying clients if you skipped AVX? I've heard it only improves performance by 30% in the best case while AVX2 has a lot more potential.

Engineers boost Llano GPU performance by 20% without overclocking

Senior member

Lifer

Senior member

Elite Member

Platinum Member

Senior member

Senior member

Senior member

Senior member

Elite Member

Senior member

Lifer

Elite Member

Senior member

Member

Lifer

Senior member

Elite Member

Senior member

Platinum Member

Senior member

Senior member

Senior member

Senior member

Senior member