Tri or Quad-Core GPU Beast for CUDA Compute - What Hardware?

Arc337 · Jun 24, 2011

Hey guys,

I wanted to spec out a tri or quad-core GPU system for CUDA compute.Basically, I want to see how many GFLOPS performance per dollar I can get out of a machine. The budget is under $1500, but surprise me if you can get a really great GFLOPS/dollar performance out of something that costs as much as $2500. Parts will be bought in the US. I won't be using any current parts; it'll be all new. I haven't read any similar threads on anandtech. The build would be for, theoretically, if I were to build it NOW.

I need something that will be able to support PCI-E x16 on all slots that have a card in it, because I saw some motherboards revert to PCI-E x8 or slower when there's a lot of cards in the slots.

The CPU doesn't matter, as long as it's at least quad core. I'd need 4 gigs of RAM, but that's not too important, either. What's important is that

a. we can put two, three, four, or more GPUs in a single system (I was thinking GTX 460s for good GFLOPS/dollar)
b. they all have x16 slots
c. the power supply can handle all of it!
d. It won't be overlclocked, but the parts should be really solid (not any unstable preOCed stuff!)
e. good airflow (nothing should be getting too hot, because the calculations need to be correct!)
f. no rebates, since I'd like to be able to price for more than one build.

Recommendations for a cheap, great quality motherboard with at least 3 PCI-E x16 slots would be much appreciated, if that's all you have time for!

VirtualLarry · Jun 24, 2011

I'm not entirely sure what the benefit of all x16 PCI-E slots would be. Unless you somehow know that your application is going to be bottlenecked by x8 PCI-E 2.0.

I built a quad-GPU CUDA rig for F@H, using an MSI K9A2 Platinum motherboard, an AMD low-power dual-core CPU, and four 9600GSO single-slot cards.

16K PPD, 400W.

Arc337 · Jun 24, 2011

VirtualLarry said:
I'm not entirely sure what the benefit of all x16 PCI-E slots would be. Unless you somehow know that your application is going to be bottlenecked by x8 PCI-E 2.0.

I built a quad-GPU CUDA rig for F@H, using an MSI K9A2 Platinum motherboard, an AMD low-power dual-core CPU, and four 9600GSO single-slot cards.

16K PPD, 400W.

The use would be for high performance computing, not a grid application. On the GPU, the number crunching power is so great that we really want to avoid having to wait for memory accesses, and transferring over the PCI-E bus is where a real bottleneck could potentially be. I'd just prefer to keep it x16 since I'm not sure if the program structure and memory accesses will necessarily be similar to F@H.

mfenn · Jun 24, 2011

If your CUDA code is constantly sending data over the PCIe bus, you're doing it wrong. Any kind of sustained transfers (yes, even across a x16 link) is going to absolutely kill your performance.

Also, does your code need DP? If so, forget about using anything other than a real Tesla.

<-- Does HPC for a living

Arc337 · Jun 25, 2011

GREAT to see another HPC user in this thread.

Mfenn, I completely agree, but our datasets may exceed the size of the device memory, in which case, we have no choice. Of course, we're going to optimize for data reuse as much as possible, and I can't tell you exactly how we're going to do that, but the less bottlenecks, the better. As it is, the device memory latency is a bottleneck compared to the shared memory. The number crunching power of the Fermi cores when properly using the SIMD paradigm is a lot faster than the device memory.

I was considering that DP vs non DP thing. First off, it appears from what I've seen on Wikipedia that the Fermi cards aren't capped too bad (what's the artifical cap on the Fermi consumer cards, is it 1/4?). Regardless of the artificial DP cap, I was thinking Amdahl's law: if we end up not being able to hide our memory accesses by switching to another task, we might just be better of getting 3 Fermi consumer compute cards for the price of 1. However, we'd have to consider things like the fact that the scientific cards have ecc. I don't know much about the other differences between the scientific and consumer versions. Thoughts, comments?

But for this build, let's assume SP for now

tynopik · Jun 25, 2011

I would guess knowing the amount of memory you need is fairly critical

if 1.5GB causes a ton of swap but everything fits in 3GB, then you basically have no choice but to go for the 3gb card

and if 3GB is not enough, then you may have no choice but to go tesla

basically I don't see how you can make an informed decision until you have a better handle on your memory requirements

mfenn · Jun 25, 2011

Arc337 said:
GREAT to see another HPC user in this thread.

Mfenn, I completely agree, but our datasets may exceed the size of the device memory, in which case, we have no choice. Of course, we're going to optimize for data reuse as much as possible, and I can't tell you exactly how we're going to do that, but the less bottlenecks, the better. As it is, the device memory latency is a bottleneck compared to the shared memory. The number crunching power of the Fermi cores when properly using the SIMD paradigm is a lot faster than the device memory.

I was considering that DP vs non DP thing. First off, it appears from what I've seen on Wikipedia that the Fermi cards aren't capped too bad (what's the artifical cap on the Fermi consumer cards, is it 1/4?). Regardless of the artificial DP cap, I was thinking Amdahl's law: if we end up not being able to hide our memory accesses by switching to another task, we might just be better of getting 3 Fermi consumer compute cards for the price of 1. However, we'd have to consider things like the fact that the scientific cards have ecc. I don't know much about the other differences between the scientific and consumer versions. Thoughts, comments?

But for this build, let's assume SP for now

I believe that DP limit is 1/12 or 1/8, depending on the card.

When you're working with a large dataset, if your processing takes less time that streaming the data into memory, then you should probably forget about using GPUs.

PCIe Gen2 x8 = 32Gb/s
PCIe Gen2 x16 = 64Gb/s
Dual-channel DDR3 1333 = 170Gb/s

Another consideration is that with desktop platforms (1155), there are only 16 PCIe lanes directly connected to the CPU anyway, so if you're truly memory bandwidth-limited, it doesn't really matter how you split it up.

Arc337 · Jun 25, 2011

mfenn said:
Default
If your CUDA code is constantly sending data over the PCIe bus, you're doing it wrong. Any kind of sustained transfers (yes, even across a x16 link) is going to absolutely kill your performance.

The plan as of now is to run more than one independent code per GPU. I could see why this is a problem if all the codes are waiting for memory accesses. Care to elaborate?

mfenn said:
I believe that DP limit is 1/12 or 1/8, depending on the card.

Thanks!

mfenn said:
When you're working with a large dataset, if your processing takes less time that streaming the data into memory, then you should probably forget about using GPUs.

You may have a point. If the CPU can run the program as a whole faster, then we should use the CPU. As it stands, I believe some of the code benefits from the GPU.

mfenn said:
PCIe Gen2 x8 = 32Gb/s
PCIe Gen2 x16 = 64Gb/s
Dual-channel DDR3 1333 = 170Gb/s

Another consideration is that with desktop platforms (1155), there are only 16 PCIe lanes directly connected to the CPU anyway, so if you're truly memory bandwidth-limited, it doesn't really matter how you split it up.

I wonder if AMD's latest stuff has the same problem. However, does the host memory (the RAM on the mobo) have to be accessed by the GPU through these lanes?

Davidh373 · Jun 26, 2011

Arc337 said:
I wonder if AMD's latest stuff has the same problem. However, does the host memory (the RAM on the mobo) have to be accessed by the GPU through these lanes?

Processing isn't as fast, BUT you can take a look at x58 and socket 1366, which has more PCI-E lanes (sorry, it's been a while, so I can't remember exact specs)

mfenn · Jun 26, 2011

Arc337 said:
The plan as of now is to run more than one independent code per GPU. I could see why this is a problem if all the codes are waiting for memory accesses. Care to elaborate?

Going back to Amdahl's law, you can consider an arbitrary number of independent codes to be the P of a larger workflow and the time spend transferring across the memory bus to be S. Thus, if S is large and grows in proportion to P then there's no real point in making P fast. You'd be better off spending your energy making S constant, or at least a smaller factor.

Arc337 said:
I wonder if AMD's latest stuff has the same problem. However, does the host memory (the RAM on the mobo) have to be accessed by the GPU through these lanes?

All modern CPU architectures have their memory controllers on the CPU die. Current generation AMD uniprocessor platforms are actually a bit worse off because the PCIe lanes are on the chipset which is connected to the CPU via the HyperTransport bus.

It's not until you get into the dual and quad socket market that you really see an explosion in available PCIe bandwidth. Take the HP SL390s G7 2U for example. It has two IOH chipsets, one connected to each CPU.

Search

Tri or Quad-Core GPU Beast for CUDA Compute - What Hardware?

Arc337

Junior Member

VirtualLarry

No Lifer

Arc337

Junior Member

mfenn

Elite Member

Arc337

Junior Member

tynopik

Diamond Member

mfenn

Elite Member

Arc337

Junior Member

Davidh373

Platinum Member

mfenn

Elite Member

TRENDING THREADS