CUDA performance - any reviews benching Fermi vs GT200 performance?

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
I use TMPGEnc for transcoding with a fair amount of video noise reduction and contour enhancement (the stuff cuda does well at in TMPGEnc) and am looking to upgrade to possibly a GTX470.

Are there any CUDA performance reviews out there for Fermi? Preferably ones that compare it to GT200 variants.

This is one of the most recent CUDA app performance reviews, and it is lackluster to say the least:
http://www.tomshardware.com/reviews/nvidia-cuda-gpgpu,2299-8.html

I don't game much on the rig targeted for the upgrade. I do use dual-monitor for 2D desktop work (computer programming, etc) and was a little surprised to see these power-consumption results when Fermi is used in dual-lcd mode at idle:

After a little more a little more investigation I discovered that the GeForce GTX 480 video card was sitting at 90C in an idle state since I had two monitors installed on my system. I talked with some of the NVIDIA engineers about this 'issue' I was having and found that it wasn't really an issue per say as they do it to prevent screen flickering. This is what NVIDIA said in response to our questions:

http://www.legitreviews.com/article/1258/15/

On that note - has anyone reported on the 2D desktop power-consumption for 3-LCD eyefinity setups? Is it also silly high?

edit: Legitreviews posted an update to their dual-screen observations including some new bios support from Nvidia addressing the topic, felt it needed to be included here for sake of completeness.

Dual Monitor Temperature Issues Get Fixed
http://www.legitreviews.com/article/1271/1/
 
Last edited:

Lonyo

Lifer
Aug 10, 2002
21,938
6
81
No idea about the first bits, but ATI also increases clocks when multiple monitors are connected, like NV has to, so power consumption is higher than a single monitor (but not as high as the GTX's).
 

Sylvanas

Diamond Member
Jan 20, 2004
3,752
0
0
Anandtech had some CUDA benches (aswell as openCL). As you can see, sometimes it just scales linearly with the increase of CUDA cores in the Fermi architecture which is not good considering all the other enhancements in Fermi's architecture for compute (concurrent Kernals in parallel for example). This is most likely a driver / application problem as you can see with the results in Folding@home and the Ray tracing app where performance really flies.

I really should bookmark every interesting thread I come across :D. There was a discussion recently that linked to a response from Nvidia pretty much saying that multiple monitors and high clockspeeds are a hardware design issue and nothing much can really be done about it. Both ATI and Nvidia are in the same boat here, with their cards having to ramp up clockspeeds and voltage when a 2nd or 3rd monitor is connected, so don't expect a fix soon (if at all).
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91

Thanks for that link...I definitely missed the "Compute" page in the review, too much "skipping around" the review on my part no doubt.

As you can see, sometimes it just scales linearly with the increase of CUDA cores in the Fermi architecture which is not good considering all the other enhancements in Fermi's architecture for compute (concurrent Kernals in parallel for example). This is most likely a driver / application problem as you can see with the results in Folding@home and the Ray tracing app where performance really flies.

I'm assuming similar (shader-count and clockspeed normalized) cuda performance between fermi and GT200 just shows the fermi designers didn't break anything.

I'm not a GPU microarchitect but wouldn't we expect the programs to need a recompile before any CUDA-specific optimizations or improvements from the fermi architecture would really come to the forefront?

Perhaps those with more knowledge on such a topic (Ben, etc) could share their opinion here. It's not the point of the thread but you've peaked my interest so I have no issues if it gets discussed/debated here (if it hasn't been already discussed elsewhere...been out of the loop a few months).
 

Sylvanas

Diamond Member
Jan 20, 2004
3,752
0
0
I'm assuming similar (shader-count and clockspeed normalized) cuda performance between fermi and GT200 just shows the fermi designers didn't break anything.

I'm not a GPU microarchitect but wouldn't we expect the programs to need a recompile before any CUDA-specific optimizations or improvements from the fermi architecture would really come to the forefront?

I'd agree. Looking at the evidence, it seems as if the applications that did not scale as expected are just making use of the same resources they were on the GT200 (hence the linear increase in performance correlates to the doubling of shader resources). I'd assume the applications are just making use of the same calls used in previous architectures- Fermi may have a few hardware implimentations that innately speed up this process just by using the same code on Fermi over say a GTX285 given the same amount of shader resources. I can see independent SM dispatch logic and a shared 48K L1 cache as ways to improve throughput immediately. I'm not sure if parallel kernal processing needs to be coded for or if Fermi handles that at a low level- perhaps, as you say recompiling or a later revision of CUDA may be required.

I'd be interested in what those familiar with CUDA would have to say about this.
 

ViRGE

Elite Member, Moderator Emeritus
Oct 9, 1999
31,516
167
106
I'm not a GPU microarchitect but wouldn't we expect the programs to need a recompile before any CUDA-specific optimizations or improvements from the fermi architecture would really come to the forefront?

Perhaps those with more knowledge on such a topic (Ben, etc) could share their opinion here. It's not the point of the thread but you've peaked my interest so I have no issues if it gets discussed/debated here (if it hasn't been already discussed elsewhere...been out of the loop a few months).
There are two ways to compile a CUDA application. You can compile it directly to machine code for certain GPU families, or you can compile it to CUDA bytecode (PTX). Programs compiled to machine code won't run on any other GPUs, so NVIDIA recommends compiling to PTX for most uses. The only time you want to compile to machine code is when you're knowingly targeting certain hardware, for example if you're running a Tesla farm.

Everything that these guys have been able to benchmark was compiled to PTX, which removes the normal compiler from the equation. When compiled to PTX, performance rides on the JIT compiler in the drivers, and the hardware itself. So none of these programs need to be recompiled in the most literal sense.

This doesn't mean that the JIT compiler or the hardware is currently efficient though. For the hardware Fermi has configurable shared memory that allows each SM's 64KB shared memory to be configured as either 16KB L1/48KB shared, or 48KB L1/16 KB shared. By default it acts more like GT200 for backwards compatibility purposes, which results in it using the 16/48 configuration. Depending on the application, 48/16 may be faster. As for the JIT compiler, it's entirely possible that NVIDIA has yet to wring everything out of the Fermi architecture with Fermi's JIT compiler, which again could impact performance.

In any case there's no specific reason why something needs to only be linearly faster on Fermi than GT200.

However rewriting (note: not recompiling) code to take advantage of Fermi's features would certainly go a long way towards maximizing its potential. Otherwise you don't get concurrent kernels, proper use of cache/shared memory, etc.
 

Genx87

Lifer
Apr 8, 2002
41,091
513
126
There are two ways to compile a CUDA application. You can compile it directly to machine code for certain GPU families, or you can compile it to CUDA bytecode (PTX). Programs compiled to machine code won't run on any other GPUs, so NVIDIA recommends compiling to PTX for most uses. The only time you want to compile to machine code is when you're knowingly targeting certain hardware, for example if you're running a Tesla farm.

Everything that these guys have been able to benchmark was compiled to PTX, which removes the normal compiler from the equation. When compiled to PTX, performance rides on the JIT compiler in the drivers, and the hardware itself. So none of these programs need to be recompiled in the most literal sense.

This doesn't mean that the JIT compiler or the hardware is currently efficient though. For the hardware Fermi has configurable shared memory that allows each SM's 64KB shared memory to be configured as either 16KB L1/48KB shared, or 48KB L1/16 KB shared. By default it acts more like GT200 for backwards compatibility purposes, which results in it using the 16/48 configuration. Depending on the application, 48/16 may be faster. As for the JIT compiler, it's entirely possible that NVIDIA has yet to wring everything out of the Fermi architecture with Fermi's JIT compiler, which again could impact performance.

In any case there's no specific reason why something needs to only be linearly faster on Fermi than GT200.

However rewriting (note: not recompiling) code to take advantage of Fermi's features would certainly go a long way towards maximizing its potential. Otherwise you don't get concurrent kernels, proper use of cache/shared memory, etc.

Thanks for the information. I was going to ask if the code is compiled via the driver or before hand. And how much that would affect performance. Your post answered all my questions :D