GTX480 Breaks Sorting Algorithm Giga-Sort Barrier

Wreckage

Banned
Jul 1, 2005
5,529
0
0
http://code.google.com/p/back40computing/wiki/RadixSorting

This project implements a very fast, efficient radix sorting method for CUDA-capable devices. For sorting large sequences of fixed-length keys (and values), we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture: our stock NVIDIA GTX480 sorting results exceed the 1G keys/sec average sorting rate (i.e., one billion 32-bit keys sorted per second).

They also benchmarked a bunch of different cards. Really cool stuff.
 

Barfo

Lifer
Jan 4, 2005
27,539
212
106
text redacted

Cool it with the mocking, not acceptable.

Moderator Idontcare
 
Last edited by a moderator:

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Being a radix sort, I wonder how well it could be made to scale to larger data types (like Decimal types)? Not in theory, but as actual performance, implenting within a reasonable time frame.

IMO, this is a pretty big step, in terms of showing that it can really be done, with real devices. Based on their stated CPU numbers, performance per watt isn't great...but also isn't shabby, and would probably look very good, if offloaded to a GTX 460, upcoming GTX 470 replacement (full GF104), etc..

It's this 'boring' stuff that will really push GPGPU, and force maturation of OpenCL, DirectCompute, etc.. The kind of performance seen would be enough to make it worth trying on something like Llano's or SB's IGP, FI.
 

brybir

Senior member
Jun 18, 2009
241
0
0
Being a radix sort, I wonder how well it could be made to scale to larger data types (like Decimal types)? Not in theory, but as actual performance, implenting within a reasonable time frame.

IMO, this is a pretty big step, in terms of showing that it can really be done, with real devices. Based on their stated CPU numbers, performance per watt isn't great...but also isn't shabby, and would probably look very good, if offloaded to a GTX 460, upcoming GTX 470 replacement (full GF104), etc..

It's this 'boring' stuff that will really push GPGPU, and force maturation of OpenCL, DirectCompute, etc.. The kind of performance seen would be enough to make it worth trying on something like Llano's or SB's IGP, FI.


I think that is certainly the future. Most workloads are not of one specific type of data computation and therefore parts will rock on GPGPU and others will fail hard.

If you follow the Folding At Home folks over the past few years you see this in action: PS3 clients and GPU clients are very powerful and work very well for *some* of their data crunching, but they still require massive CPU implementations for the data types that do not work well on GPGPU devices.

We are also seeing GPU's beginning to look more like CPU's in some way. Fermi has added much larger data caches and other features that make it able to act more like a CPU in certain operations. Similarly, Intel is releasing new hardware implementations for media transcoding and vector operations that should help it in some areas.

Five years from now we should be seeing your standard Integer/Floating point units combined on die with GPU cores and specialized cores, which should make for an interest few years to watch as this market evolves. The question is to me, at least, is how Nvidia gets onboard this trend, as its GPU is a monster number cruncher for some data types, but it lacks a clear path to adding Integer and floating point units into its execution pipelines to provide a coherent product to compete with Intel and AMD. Should be interesting to watch Nvidia's strategy over the next few years as well.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
I think that is certainly the future. Most workloads are not of one specific type of data computation and therefore parts will rock on GPGPU and others will fail hard.
But, the failures already have well-tuned CPU implementations, and the whole tasks can be sped up, by letting the CPU do those. IE, do the non-GPGPU-friendly stuff in the CPU, pass the GPU-friendly stuff to the GPU, do other CPU work or sleep while waiting, then do anything the CPU is best at again in the CPU. For cases where both are good, decide based on current resources available.

If you follow the Folding At Home folks over the past few years you see this in action: PS3 clients and GPU clients are very powerful and work very well for *some* of their data crunching, but they still require massive CPU implementations for the data types that do not work well on GPGPU devices.
But, where it works, it tends to more than make up for the effort. Software maturity/flexibility and hardware latency are two huge humps to get over, right now. Intel has hardware latency basically taken care of, while AMD is taking smaller steps (grrr), and nVidia is moving towards their chip being a complete CPU (I'm sure they're either working on a custom ARM design, or a fully custom set of execution units to merge into a future design).
 

Acanthus

Lifer
Aug 28, 2001
19,915
2
76
ostif.org
I think as GPU's get more programmable we will see performance per watt reach parity with CPU's.

It is interesting how ARM and Atom are starting to turn some heads for large loads that can be heavily threaded.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
I think as GPU's get more programmable we will see performance per watt reach parity with CPU's.
It might be now, for some things, when not tested on cards that were ever high-end. In the link they did not test a single card known for having even decent performance per watt.

I understand that was not their purpose, but if we're going to be talking about performance per watt, the GTX 280 and GTX 480 being the better ones of the bunch is not a good place to start from. Of nVidia's line-up, it'd be nice to see how a GT 240 (would it be faster than a AII X4 or i3?), GTS 250, and GTX 460 manage, FI, sticking with CUDA.
 

Scali

Banned
Dec 3, 2004
2,495
0
0
In the best case, GPUs already beat CPUs in terms of performance/watt in an embarassing way.
Fastest GPUs deliver 2+ TFLOPS in less than 200W TDP.
Fastest CPUs deliver about 120 GFLOPS in about 130W TDP.
 

brybir

Senior member
Jun 18, 2009
241
0
0
But, the failures already have well-tuned CPU implementations, and the whole tasks can be sped up, by letting the CPU do those. IE, do the non-GPGPU-friendly stuff in the CPU, pass the GPU-friendly stuff to the GPU, do other CPU work or sleep while waiting, then do anything the CPU is best at again in the CPU. For cases where both are good, decide based on current resources available.

But, where it works, it tends to more than make up for the effort. Software maturity/flexibility and hardware latency are two huge humps to get over, right now. Intel has hardware latency basically taken care of, while AMD is taking smaller steps (grrr), and nVidia is moving towards their chip being a complete CPU (I'm sure they're either working on a custom ARM design, or a fully custom set of execution units to merge into a future design).

Your first statement makes it sound like you are disagreeing with me but I think you are saying exactly what I said?

As to your second point, it will remain to be seen whether Nvidia can successfully develop CPU type processing abilities to integrate into its GPU and whether such a solution will be accepted by the market.

The overall point I was making is that both ends are converging i.e. GPU's becoming more CPU like and CPU's becoming more GPU like. The interesting part is seeing which type of solution gains traction and which ones do not.
 

brybir

Senior member
Jun 18, 2009
241
0
0
In the best case, GPUs already beat CPUs in terms of performance/watt in an embarassing way.
Fastest GPUs deliver 2+ TFLOPS in less than 200W TDP.
Fastest CPUs deliver about 120 GFLOPS in about 130W TDP.

When the data works for GPU's to compute it is exceptional. When it does not, it gets pretty ugly.

One interesting paper from a bit ago was published by Intel: http://portal.acm.org/citation.cfm?...&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1

I cant find a free version of it but the highlight from their own summary says:

"In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an NVIDIA GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average."
 
Last edited:

Scali

Banned
Dec 3, 2004
2,495
0
0
When the data works for GPU's to compute it is exceptional. When it does not, it gets pretty ugly.

One interesting paper from a bit ago was published by Intel: http://portal.acm.org/citation.cfm?...&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1

I cant find a free version of it but the highlight from their own summary says:

"In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an NVIDIA GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average."

2.5x would be enough to give GPUs the advantage in performance/watt, since their TDP is less than 2.5x that of a CPU (and the CPU doesn't even include chipset and memory, which technically should be included as well, since the GPU TDP goes for the card as a whole, including memory and additional components on the card).

And the beauty of it all is:
You have both, so in cases where the GPU is inefficient, you use the CPU, and vice versa.

But I'd say on average the GPU has already won the performance/watt over CPUs.
 
May 13, 2009
12,333
612
126
pic redacted

Cool it with the mocking, not acceptable.

Moderator Idontcare
 
Last edited by a moderator:

Keysplayr

Elite Member
Jan 16, 2003
21,209
50
91
When the data works for GPU's to compute it is exceptional. When it does not, it gets pretty ugly.

One interesting paper from a bit ago was published by Intel: http://portal.acm.org/citation.cfm?...&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1

I cant find a free version of it but the highlight from their own summary says:

"In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an NVIDIA GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average."

Yes, I remember this. Many thought it would have been better if Intel didn't publish this. "GPU's arent 10 to 100X faster than our i7 960 CPU's, they are only 2.5x faster." Mind you, that was addressing a GTX280, (GT200), not even Fermi. In hindsight, Intel probably wished they hadn't released that.
 

Voo

Golden Member
Feb 27, 2009
1,684
0
76
Yes, I remember this. Many thought it would have been better if Intel didn't publish this. "GPU's arent 10 to 100X faster than our i7 960 CPU's, they are only 2.5x faster." Mind you, that was addressing a GTX280, (GT200), not even Fermi. In hindsight, Intel probably wished they hadn't released that.
Sure you end up with good performance improvements if the algorithm can be coded in a certain way, but there are more things to consider: Mostly the added complexity (looked at the code and compared it to a standard radix sort implementation?) as well as the fact that you end up writing the code in 15 different ways for every architecture, register size, SM version and so on, oh and you've got to compile differently for all those versions as well, otherwise you end up with gigantic performance penalties (SM1.0 generated code on SM2.0 devies for example). Also I've heart of some problems with partial writes on Fermi with ECC where you end up with 60-70% performance losses and so on..

Lots of problems that got to be solved before it's really mass market ready, but hey that's the interesting part ;)

And about performance/Watt: For algorithms that work well with a GPU it's certainly more efficient and faster, but that's to be expected, if a specialiced device wouldn't beat a much more general one in things it's supposed to be good at, something would be horribly wrong.

PS: Oh and their numbers don't include the driver overhead and the time needed to transfer data to the GPU.
 
Last edited:

Keysplayr

Elite Member
Jan 16, 2003
21,209
50
91
Sure you end up with good performance improvements if the algorithm can be coded in a certain way, but there are more things to consider: Mostly the added complexity (looked at the code and compared it to a standard radix sort implementation?) as well as the fact that you end up writing the code in 15 different ways for every architecture, register size, SM version and so on, oh and you've got to compile differently for all those versions as well, otherwise you end up with gigantic performance penalties (SM1.0 generated code on SM2.0 devies for example). Also I've heart of some problems with partial writes on Fermi with ECC where you end up with 60-70% performance losses and so on..

Lots of problems that got to be solved before it's really mass market fit, but hey that's the interesting part ;)

PS: Oh and their numbers don't include the driver overhead and the time needed to transfer data to the GPU.

Sure, there are situations where things are better left to run on CPU as for that particular app, it's faster. The 2.5x is worse case scenario. It couldn't be anything else but that. We've seen apps that actually do run (best case scenario) many hundreds of times faster than a top end CPU. Overhead and time needed to transfer data to the CPU or not.
 

brybir

Senior member
Jun 18, 2009
241
0
0
Yes, I remember this. Many thought it would have been better if Intel didn't publish this. "GPU's arent 10 to 100X faster than our i7 960 CPU's, they are only 2.5x faster." Mind you, that was addressing a GTX280, (GT200), not even Fermi. In hindsight, Intel probably wished they hadn't released that.

Yeah...I am not technically inclined enough to know specifically what they were getting at, and since I cannot find a free copy I cant even read it. My guess would be over time that 2.5x would start to spiral upwards, at least until Intel gets on its own GPU bandwagon and then it can say...well we saw that and so now we added it! no pie on face that way (or less of it anyways).

On a general note it is not surprising that GPU's are faster for certain data applications, but I would disagree with Scali that "But I'd say on average the GPU has already won the performance/watt over CPUs."

GPU's only win the performance per watt award when the data is of a certain type that they excel at. When the data is not their type, they fail miserably. Computing tasks will always be some combo of true serial data, mostly serial but able to have some done parallel with lots of coding effort and that data which is embarrassingly parallel. It is the rare real world data set that is exclusively one of these types of data sets (although rasterization is one example that is embarrassingly parallel).

For the current crop of GPU's to be monsters the data has to have certain features:
1. Parallel throughout most or all of the data set. Things that only a portion are parallel do not work very well and incur significant performance penalties when run on a GPU.

2. Little or no branching, or

3. if they do branch, they need to branch the same way. Because the shades work in groups (i.e. "CUDA cores"), when a branch diverges it causes a major stall. Not to mention the lack of branch prediction and mis-prediction logic and buffering abilities that are lacking in most GPU's.

4. The data needs to fit into the ram on the GPU card. Latency and access penalties are very high for a GPU compared to a CPU right now. Hence why the Tesla cards have several GB of very fast ram onboard. But, there is a limit and it makes the cost skyrocket. Intel server systems with 32GB ram are easy, 32GB tesla cards....not so much.

5. Problem should be single precision at least until DP work catches up in hardware development cycles.

So data that fits these rough parameters will be exceptional on a GPU, most other data types will not. For example, try using a GPU as your primary processor for a relational database and it will not work well at all to the point where the CPU is now the one with multiple times performance advantages.

Or even look at certain specialized chips from Ti or others, they can build custom, specialized chips that can run a gigabit router and switch that traffic with ease. Try replacing a that switch with an i7 or Fermi and watch your switch fail.
 

konakona

Diamond Member
May 6, 2004
6,285
1
0
quoted pic redacted

lol cool jpg bro
 
Last edited by a moderator:

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Well, first, with low enough latency, and easy enough programming (by which I mean the likes DirectCompute or OpenCL maturing, or something else coming in, as optimization currently takes too much work to reach even decent performance), DB work could actually be benefited a good bit. As long as you have accumulators (even if you need a hundreds of them...), you should be able to pawn off many things to a GPGPU, just not in whole. While Llano and SB are not themselves terribly interesting, performance-wise (SB more than Llano, but still), both appear to drastically reduce latency, compared to what we have now, for IGP, and I think that's one of many signs of what lies ahead.

Second, forget cards. Cards will remain for some time, and will not be totally obsolete any time soon, but it's what gets into the CPU that will really count. Need to fit in the video card's RAM? Nah, you just need to have shared memory, or some address translation magic (really ugly code that you don't want to see, but with elegant-looking wrappers that look like nice functions). Get quad-channel RAM, and it won't be too bottlenecked by that, either. As the software 'gets here', having a capable GPGPU in every mainstream CPU (2012?) will count for quite a lot, especially in servers. This also neatly handles anything with more than a few major branches: use the CPU--it's good for that. With some shared memory between the two (such as Intel's way of doing things), you could have some parts going on each section, for completing the same work. The current very high latency makes it quite a PITA to even think about writing code for GPGPU, unless it just begs for it.