Engineers boost Llano GPU performance by 20% without overclocking

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

bronxzv

Senior member
Jun 13, 2011
460
0
71
Huh? The engineers I've talked to who work on 3D content production on the CPU
just let them talk directly here, I really don't like this kind of "I have a friend" discussion, pls ask them to explain me how AVX2 gather instructions can help to sample textures when there is neither 8-bit nor 16-bit elements gather instructions, do they work with FP32 textures ? do they know it's plain overkill in most cases ?

btw when I say "I'm sure other people will report other findings" I really mean it, if they have a working AVX2 path, just let them report their actual results, it will be far more interesting than to hear you overhyping something for which you lack first-hand experience

and the former seems like it could make good use of gather.
if for ex. they apply convolution filters to FP32 textures yes, otherwise, I mean if they want something fast, AVX2 gather isn't very practical, but maybe AVX3 gather will be great ? let's skip AVX2 ?

But wouldn't that make scalar loads become more of a bottleneck? :p
good point, so let's say 9% speedup from hardware gather

I'm curious what you need the generic permute for in 3D rendering.
One frequent use case is the emulation of the MIC VCOMPRESS instruction, one easy to grasp example is back faces elimination, ask one of your "engineer who work on 3D production" if you don't get it

GPUs don't support any cross-lane operations as far as I know.
How GPUs limitations are relevant to rendering on the CPU ? Are you among all these people thinking that only GPUs can do graphics ?

And why would that be significant compared to the billions of people who don't have AVX support at all?
indeed why ? where did I say that we don't have a legacy SSE path ?
 
Last edited:

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Ok, it wasn't entirely clear to me that you agree about the hardware aspects.

I'm not ignoring the software infrastructure. It is critically important, but I don't see any obstacles. First of all, every API for GPGPU will get an AVX2 implementation in the short term. Intel and Apple are working on an LLVM-based implementation of OpenCL. And LLVM also supports PTX which is used by CUDA. Next, Microsoft's implementation of C++ AMP will definitely support AVX2 as well.

Which brings us to throughput computing using auto-vectorization. Visual Studio 11 supports it (note the "up to 8 times faster"). GCC supports it. Clang (built on LLVM and used by XCode and compatible with GCC) supports it...

So there's no reason for concern. GPGPU on the other hand has many restrictions and getting good performance is a minefield. So AVX2 has everything going for it.

Auto-vectorization is just one of the options. And a very important one for those who desire the best gain versus effort ratio. If you do want to push the hardware to its limits, auto-vectorization is probably not the best option. But there are countless alternatives to choose from.

Besides, it only takes a tiny change to tell whether an auto-vectorizing compiler effectively vectorized the loop you desired to be vectorized. Just like the __forceinline keyword will give you a warning when the compiler didn't inline it, there could be a __forcevectorize keyword, or a pragma.

So it's really a non-issue. The important thing is that AVX2 has no hardware limitations, and the software infrastructure can offer a very wide range of solutions from being completely transparent to the developer, through minor hints, to explicit parallel programming.

OK now you answered my orginal question in a way I could best comprehend
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
pls ask them to explain me how AVX2 gather instructions can help to sample textures when there is neither 8-bit nor 16-bit elements gather instructions, do they work with FP32 textures ?
I know that much. Textures typically have RGBA format which is 4 x 8-bit = 32-bit. So no need for individual 8-bit or 16-bit element gather, AVX2's 32-bit element gather will do just fine. And while textures with smaller texels do exist, note that it is recommended by GPU manufacturers to pack data into as few textures as possible to minimize the number of lookups (they can't do narrow texel fetches any faster). So I don't see why you expect the CPU to go beyond what a GPU can do. The hardware cost for 16-bit element gather and 8-bit element gather would be much greater than 32-bit element gather, while they'd hardly have any use cases.
good point, so let's say 9% speedup from hardware gather
You're still assuming to have very few gather operations in the first place!

For argument's sake let's say your shader consists of merely 5% texture sampling operations. In other words 1 texture operation for every 19 arithmetic operations. Nowadays trilinear anisotropic filtering is the standard and I expect on average the effective anisotropy is around 2. That means 16 texel taps per texture lookup. With 8 lanes that's 128 individual scalars you need to load sequentially if you don't have gather support...

So that's 128 scalar load instructions versus 19 arithmetic vector operations! Now, I know there's some more to texture sampling than just the texel fetches, but doesn't it seem extremely doubtful that you'll only gain a 9% speedup from parallelizing those loads? And I was only looking at the texture lookups, and I expect there to be more than 1/20. And then there's table lookups for transcendental functions and constant buffers and whatnot as well.

So would you care to explain in sufficient detail how it is you're only expecting a 9% speedup? Is your 3D rendering software very different from that used in the content creation industry?
One frequent use case is the emulation of the MIC VCOMPRESS instruction, one easy to grasp example is back faces elimination
Ah, I didn't realize those permute instructions take variable indexes (unlike previous shuffle instructions). It's basically a gather within a register. :cool: That would have a few uses I imagine. AVX2 is looking more impressive every day.

Anyway I'm not sure how back face elimination would be implemented in software exactly, but since you're talking about using a compress operation I assume that would be to reject the data of the back facing ones? Can't you just use a gather on the front-facing ones instead?
How GPUs limitations are relevant to rendering on the CPU ? Are you among all these people thinking that only GPUs can do graphics ?
No, I know CPUs are pretty good at ray-tracing and other complex offline rendering tasks. Like I said, I have contacts in the 3D content creation industry.

Anyway, I was wrong to assume that the GPU's lack of cross-lane instructions within the shader cores would mean there would be no use for it for 3D graphics on the CPU either. But I now realize that it's potentially useful for input and output that happens outside of the actual shader code.
indeed why ? where did I say that we don't have a legacy SSE path ?
I didn't imply that. It's just that most developers tell me they have an SSE2 path because it's widely supported and provides a nice speedup where applicable, but they skipped SSE3, SSE4 and AVX because they add little benefit for the effort of writing and maintaining additional code paths. They do look forward to supporting AVX2 though.

Anyway, I'd still like to understand why you think gather is the least exciting feature when me and everyone I know think it will be a breakthrough for auto-vectorization.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
I know that much. Textures typically have RGBA format which is 4 x 8-bit = 32-bit. So no need for individual 8-bit or 16-bit element gather, AVX2's 32-bit element gather will do just fine.
true, but not enough since you end up with a pattern such as
r7|g7|b7|a7 .. r0|g0|b0|a0 that is useless as is for any computations (you mentioned multi-taps filtering), to make it useful you need to unpack the data, in my case I need the colors in SoA so the legacy method with vanilla AVX swizzling is nearly as fast

The hardware cost for 16-bit element gather and 8-bit element gather would be much greater than 32-bit element gather
I have in mind a 8 x 8-bit elements gather followed by an expansion like the one provided by VPMOVSXBQ, this doesn't looks more complex than a 8 x 32-bit elements gather and will be invaluable for texture mapping

individual scalars you need to load sequentially if you don't have gather support...
nope, 32-bit moves then swizzle

then there's table lookups for transcendental functions and constant buffers and whatnot as well.
with 3rd order polynomials you'll need 4 gathers with the exact same packed indices, that's not much faster than the legacy AVX with swizzle from you coefficients stored in AoS


I assume that would be to reject the data of the back facing ones? Can't you just use a gather on the front-facing ones instead?
nope because to compute the front faces mask you need to have the transformed normals in registers, when you have everything in registers why accessing memory again ?
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
true, but not enough since you end up with a pattern such as
r7|g7|b7|a7 .. r0|g0|b0|a0 that is useless as is for any computations (you mentioned multi-taps filtering), to make it useful you need to unpack the data, in my case I need the colors in SoA so the legacy method with vanilla AVX swizzling is nearly as fast
Useless? Just mask out the component you need and it's in SoA form. I don't see how you could do it "nearly as fast" with scalar loads and swizzles.
I have in mind a 8 x 8-bit elements gather followed by an expansion like the one provided by VPMOVSXBQ, this doesn't looks more complex than a 8 x 32-bit elements gather and will be invaluable for texture mapping
But then you need four times more of these 8 x 8-bit gathers! Why not fetch all four texel components at once (they're next to each other anyway). Masking out the components should be cheaper than gather (both in cycles and power consumption), and gets you SoA instantly.
with 3rd order polynomials you'll need 4 gathers with the exact same packed indices, that's not much faster than the legacy AVX with swizzle from you coefficients stored in AoS
The indices are not necessarily the same. I know a few implementations of transcendental math functions for supercomputers where the exponent and mantissa are used in separate table lookups.

You have a good point that some data locality could be lost when gathering polynomial coefficients from separate tables. But note that you can use 64-bit element gathers to fetch two single-precision coefficients at the same time from an 'interleaved' table. The VHADDPS instruction should come in handy here (or VDPPS if you have more coefficients). You can also just duplicate the indices within the register and add an offset to access the other coefficients.

Anyway, that assumes that the tables would be fairly big and the indices highly random. For these uses they are typically quite small and there's some correlation between the index values so multiple gathers with the same index vectors should work just fine.
Cool, thanks. Looks like your product is in bit of a niche market though. Anyway, it seems safe to conclude that AVX2 will have something for everyone. We'll have to wait and see which approaches work best.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
Useless *as is*, i.e. it must be massaged before to be used

Just mask out the component you need and it's in SoA form.
For 3/4 of the components before to “just mask out the component” you need to shift right a copy of the original gather, with AVX2 that will be 1 vpgatherdd + 3 vpand + 3 vmovdqa + 3 vpsrld, in other words the actual gather is only 1/10 instructions, when you factor in the computation of the indices and some useful computations based on the gathered values, gather will amount to something like 1/40 .. 1/20 instructions, and we are speaking of a gather intensive kernel

don't see how you could do it "nearly as fast" with scalar loads and swizzles.
Not scalar loads, 8x 32-bit RGBA moves or 4x 64-bit R1G1B1A1R0G0B0A0 moves with adjacent texels, then a couple of punpck(h/l)bw & punpck(h/l)wd, I'm not sure the gather path will be faster, will be glad to test it

But then you need four times more of these 8 x 8-bit gathers! Why not fetch all four texel components at once (they're next to each other anyway). Masking
It will be useful for 24-bit RGB textures, but indeed a bad idea for 32-bit textures