CUDA and Stream Progress

Stiganator · Sep 17, 2008

I heard Eran Badit was working with nvidia to enable CUDA on ATI cards. Anyone heard anything else about this since July? It would be nice to have a GPGPU standard for both cards.

Anyone know how effective each companies respective stream processors are?

ATI 48XX has 800 SP which seems very impressive.
NVIDIA 2XX has 256 SP

Are the ATI ones less efficient or could the ATI potentially blow the NVIDIA out of the water in highly parallel tasks?

SunnyD · Sep 17, 2008

Oy. Essentially the wait the SP's are counted is different between the two platforms. I forget exactly where the comparison is, but basically you have to multiple or divide to get apples and oranges to work with the same common denominator as far as functionality goes.

Munky · Sep 17, 2008

There IS a cross-platform gpu language, called Brook, and that's what AMD supports. NV just has to make everything proprietary, like cuda and Cg shader language. Game developers hardly use Cg, with standard languages like HLSL and GLSL, and I have hunch once gpu computing becomes a more mature field, hardly anyone will use cuda as well.

QuixoticOne · Sep 17, 2008

#1 Rule of GPGPU GFLOPS peak performance MOSTLY NEVER EVER EVER matters.
Why? The cards can only talk to VRAM at around 160GBy/s peak; actually many models are less than half of that, I'm being generous to include some of them that go over 100GBy/s systained.
Divide 160GBy/s by 4 to get G(SP)/s as in giga single precision 32-bit floats I/Os per second = 40GSPIO/s.
So assume that you're READING one SP datum per calculation from VRAM and writing one SP datum per calculation to VRAM, that's two, so you're down to 20G/s calculations MAXIMUM not because your SPs can't achieve TERAFLOP level peak speeds (50x faster) but because you just don't have the VRAM BANDWIDTH to do the I/O to read and write DATA to VRAM to anything like keep up with the SPs calculation speeds.

Granted there ARE on chip registers and caches that are much faster than VRAM access, so if your calculation data can FIT into the cache / registers / local memory / global memory, you can do sustained very high speed (maybe 20x faster or more than talking to VRAM) calculations on those registers / on-chip memory. Keep in mind that the numbers of single precision floating point values that can fit in ON CHIP fast memory are numbered in the several thousands, so you're not going to be crunching megabytes of data efficiently on a GPU at peak speeds of the SPs due to the RAM bandwidth limit. Granted if you can read in one value and then do LOTS of math on-chip on that value, you CAN get close to PEAK SP efficiency, but you'd have to be doing something like 10-100 calculations using ON-CHIP resources for EVERY single precision datum you read or write from/to VRAM to get peak SP speeds.

#2 ATI has around 1/5th of their SPs capable of doing DOUBLE PRECISION calculations in the 3800 / 4800 series GPUs. That can mean there are 800 SP ALUs and those same ALUs when acting in bunches of 5 give you up to 160 DP ALUs. NVIDIAs current GTX2xx series parts have significantly fewer DP ALUs, so their peak DP performance might be inferior to ATIs though the actual performance of course depends on your calculation, the merits of CUDA vs CAL implementations, et. al.

#3 if you're doing single precision floating point calculations or of course integer or byte or whatever then both ATI and NVIDIA GPUs have tons of SPs you can use for your calculations and which is better is a very big function of CUDA vs CAL vs BROOK coding efficiencies, the on-chip architecture of the GPU, the way your threads schedule and your algorithm parallelizes et. al.

#4 -- Look at OPENCL, that is a standard that may end up being supported by both camps. Apple, et. al. are pushing it.
Also there are proprietary cross platform GPGPU tools like RapidMind's commercial solution.
BROOK is a language that has cross platform support, although in its most common format it uses a back-end that is basically OpenGL level so you don't really get access to a lot of the advanced features of current NVIDIA / ATI GPUs like options relating to scatter, gather, double precision, etc. etc. etc.
ATI has commercialized BROOK and created "BROOK+" which they distribute in their SDK; it is not particularly mature / efficient yet even on their own GPUs, and it is no longer a cross platform (NVIDIA GPUs) solution in this version.
Nothing prevents you from writing GPGPU programs in OpenGL / DirectX / HLSL / etc. shading languages, and many early GPGPU programs were cross platform because they were written using the graphics shading languages supported by common GPUs even before CUDA / CAL / BROOK+ languages were available. As above you lose out on a LOT of the better architectural capabilities of the GPUs by doing this, but it is a cross platform solution. The original F@H GPU code used DirectX; they've abandoned it in favor of native CUDA and CAL implementations.

#5 CAL is the most efficient way to program modern generations of ATI GPUs; it is NOT user friendly, it is close to assembly language, but because of that you get more direct control of the actual program / hardware.

#6 CUDA is more mature and offers better ease of use / programming than ATI AMD CAL, and there are lots more mature GPGPU codes out there that use CUDA due to NVIDIA's head start offering such GPUs and language tools. Eventually maybe BROOK+ / CAL / et. al. will catch up to CUDA somewhat. I predict that OPENCL or other such common languages may become the more popular options, though.

#7 in terms of CODING EFFICIENCY = time required to implement an algorithm that works and works reasonably efficiently on the GPU, CUDA wins for now even if the respective GPU silicon may or may not be better than ATIs theoretical performance, you'll probably do the same engineering in 1/10 the development time using CUDA presently assuming you can program in C anyway... Once you optimize the code for CUDA and optimize it for CAL on ATI then it is mostly a matter of the merits of each individual GPU and device driver / scheduler system... F@H has a handy points per day lead on NVIDIA GPUs (as in 2x or more the performance of ATI's most modern GPUs) currently although they do have CUDA and CAL implementations.

Originally posted by: Stiganator
I heard Eran Badit was working with nvidia to enable CUDA on ATI cards. Anyone heard anything else about this since July? It would be nice to have a GPGPU standard for both cards.

Anyone know how effective each companies respective stream processors are?

ATI 48XX has 800 SP which seems very impressive.
NVIDIA 2XX has 256 SP

Are the ATI ones less efficient or could the ATI potentially blow the NVIDIA out of the water in highly parallel tasks?

aka1nas · Sep 17, 2008

Originally posted by: SunnyD
Oy. Essentially the wait the SP's are counted is different between the two platforms. I forget exactly where the comparison is, but basically you have to multiple or divide to get apples and oranges to work with the same common denominator as far as functionality goes.

IIRC(If this is totally off-base please correct me), AMD cards are using vector-based shaders, while Nvidia's are scalar-based. I think they are 5-element shaders(so 160 "actual" shaders on the 48xx). On shader code that can't be vectorized well, the 2xx series would have an edge, while on easily vectorized code the AMD shaders should take full advantage.

apoppin · Sep 17, 2008

This is Old, - June [haha] .. but explains AMD's approach and preference for OpenCL

http://techreport.com/articles.x/14968

This is a few days ago and links to AMD's new SDK and support for stream computing on 4870

http://techreport.com/discussions.x/15490

Munky · Sep 17, 2008

That's not entirely correct. AMD's shaders are superscalar, so each ALU in a 5D group can act as a scalar ALU. The overall difference between the gpu's is how the instruction threads are scheduled. A more accurate way to describe the gpu's is AMD having 10 SIMD arrays with 80 ALU's each and NV having 30 SM clusters with 8 ALU's each. The ALU efficiency of AMD gpu's is more heavily dependent on how much IPC the compiler can extract from the code, and in particular, would take a bigger hit with more dependent instructions.

apoppin · Sep 17, 2008

Originally posted by: munky
That's not entirely correct. AMD's shaders are superscalar, so each ALU in a 5D group can act as a scalar ALU. The overall difference between the gpu's is how the instruction threads are scheduled. A more accurate way to describe the gpu's is AMD having 10 SIMD arrays with 80 ALU's each and NV having 30 SM clusters with 8 ALU's each. The ALU efficiency of AMD gpu's is more heavily dependent on how much IPC the compiler can extract from the code, and in particular, would take a bigger hit with more dependent instructions.

from the first discussion:

So, I asked, what's the difference between CUDA and the Stream SDK? Harrell explained:

At their core, they're essentially a very similar idea. Brook+ was based on a graduate project out of Stanford called Brook, which has been around for years and is designed to target various architectures with a high-level API. And in this case there's back-ends for GPUs . . . What our engineering team did was take that project and bring that in-house, clean it up, write a new back-end that talks to our lower-level interface, and post the thing back out to open-source in keeping with our open systems philosophy.

Brook looks like C. . . Function calls go to the GPU very much like CUDA. In fact, the guy who was one of the core designers on Brook went to Nvidia and did CUDA. . . . And another guy who recently got his doctorate at Stanford and worked extensively on Brook at Stanford is one of the core Brook+ architects now at AMD. So, they were both born out of the same idea.

In terms of what we do differently, the one thing we've tried to do is publish all of our interfaces from top to bottom so that developers can access the technology at whatever level they want. So, underneath Brook we have what we call CAL, Compute Abstraction Layer, which you can think of as an evolution of the original CTM. It provides a run-time, driver layer, as well as an intermediate language. Think of it as analogous to an assembly language. So Brook has a back-end that targets CAL, basically, as does ACML and some of the other third-party tools that we're working on. . . . From the beginning we published the API for CAL as well as for Brook so people could program at either level. We also published the instruction set architecture . . . so [people] can essentially tune low-level performance however they want. And Brook+ itself is open-source.

i don't think it matters exactly how the architecture is set up, the way the respective languages are set up to address it is evidently very similar

i think AMD better get something out of their acquisition

QuixoticOne · Sep 17, 2008

There are pretty big differences when you're doing double precision (NVIDIA has very few DP ALUs on the GTX2xx and none on earlier models). ATI has the edge with DP FLOPs.

There are significant differences relating to threading efficiencies, scheduling, parallelism, thread local storage, and effects of thread divergence.

There are also significant differences in the symmetry of the architecture -- SP and DP ALUs work very differently and have very different resource impacts on things like shared / local / global memory, registers, et. al. Even though ATI's 5 clustered ALUs are generally capable of doing some SP operations, there are important differences in that not all ALUs have certain capabilities (e.g. only one out of the five does transcendental operations IIRC), and they cannot be scheduled to perform operations totally independently since there are limits on the input / output data access and instruction mix the cluster can simultaneously use.

In addition to the vast local storage differences between ATI and NVIDIA global or shared storage is very different.

There are also limits on what kinds of data types and vector sizes you can access in vectors / arrays. There are also significant limits about doing things like reading and writing to arbitrary addresses within a shader, et. al.

Basically you have to understand the architecture related optimizations if you want to get more than about 1% to 10% efficiency out of any GPU with GPGPU.

apoppin · Sep 17, 2008

Of course .. that is why each company has released their own SDK with hopes it will be the ONE to be adopted industrywide

each company touts up the benefits of using THEIR HW and their complier

time will tell
:clock:

a long time, i think .. a couple of years for it all to play out

Denithor · Sep 17, 2008

So is F@H running on SP or DP ALUs? Because it runs using CUDA on nV hardware and performs much much better than on faster ATi hardware (8800GS will outperform 4850 nearly 2:1 in ppd using the GPU client).

MegaWorks · Sep 17, 2008

I just order 4 XFX 9800 GT for Folding@Home. I think CUDA is amazing, and this is coming from someone who likes ATI a lot more.

ViRGE · Sep 17, 2008

Originally posted by: Denithor
So is F@H running on SP or DP ALUs? Because it runs using CUDA on nV hardware and performs much much better than on faster ATi hardware (8800GS will outperform 4850 nearly 2:1 in ppd using the GPU client).

It's all SP.

---

As for the matter of programming languages, I think it's a shame that AMD went the direction that they did. Having played with both Brook+ and CUDA, they're really only comparable in concept; the execution differs wildly. Brook+ still feels like a research project, it does things that probably made sense in 2002 when it was being developed by underpaid grad students, but make no sense from a professional development standpoint. Meanwhile NVIDIA went through that whole phase with Cg, CUDA while far from perfect definitely feels and works on a level more on par with C++ itself, with so-so debugging that's closer to what Xcode/VC++ can do than Brook+ is.

Accordingly, the focus on CAL shouldn't be a surprise; that's the bit AMD did build well, but it's not nearly as useful as a good high-level language. Plus most of what CAL offers NVIDIA offers with PTX, it's just not heavily flaunted since NVIDIA is focusing on the high-level aspect. I'm not sure if AMD was expecting someone else to build a better high-level language for them, or if they couldn't get the resources to do it themselves. The utter irony is that right now AMD's hardware is going to be faster when put in the hands of hardcore academics that can understand it well enough to extract its full performance, but even academics want to work in a high-level language when possible.

I just wish OpenCL information was easier to come by. The few short samples I've seen look like they're influenced by CUDA/Cg, but it's early and since it's Apple driving the boat, it's entirely possible that it can do anything from making a detour to Cuba to hitting an iceberg. Apple doesn't have the greatest history here with graphics/GPUs. It can't be good for AMD though if it's going to draw largely upon CUDA.

QuixoticOne · Sep 18, 2008

AFAIK ATI 4800 series would probably run DP circles around an NVIDIA GTX280 if the algorithm was able to run efficiently with data stored on-chip, and code was written in CAL.

The problem with CAL besides it being almost assembly language level is that it (well more specfically the GPU architecture) is poorly documented so people who REALLY want to get down to the low level and use the chip's resources efficiently often cannot. Once you start to get into registers, local / global / shared memory, thread scheduling, memory access / streaming details, memory mapping of data, ability to understand and handle read/write conflicts, ability to communicate with the host PC, ability to use the on-chip hardware like the video interfaces, et. al. the documentation is often absent, inadequate, or confusing.

NVIDIA's PTX and lower level chip docs. are ALMOST as bad, but are still worlds better than AMD's -- there are at least some very nice CUDA documents on optimization and architecture that help you write effective code given the way the memory / threads / scheduling work. ATI often doesn't have very good sample code or documentation either.

NVIDIA has been better on multi-platform driver support in that they have had for many weeks the Xorg 7.4 preliminary support working whereas ATI seems STILL to lack that even with the 8.9 catalyst drivers released just hours ago. NVIDIA at least has some drivers (although not CUDA sadly) for Solaris, they have some drivers for FreeBSD x86 (although not x64 or CUDA sadly).

NVIDIA has had their BLAS / FFT library public for a long while whereas ATI is still working on ACML GPU BETA.

Anyway I think for progress these things must happen from all vendors:
a: better hardware / architectural level documentation
b: better compilers, tools, sample codes, debugging, profiling support
c: fast open architecture bidirectional interface ports to get data in/out of GPUs -- e.g. infiniband or 10Gb ethernet or so.
d: Better parallel high level languages that can automatically compile down to EFFICIENT GPU code that takes advantage of all the relevant architectural capacities and optimizations -- OPENCL, CILK, Fortran90, C and OpenMP, whatever.
e: much faster VRAM on the GPUs right now VRAM slow peak performance limits the GPU ALUs to about 0.5% of their peak performance if one had to read/write from RAM 1:1 with calculation operations. That is too poor and a waste of a good GPU chip.
f: much larger on-chip SRAM / register sets so one can have megabytes of fast SRAM or at least thousands more registers.

Search

CUDA and Stream Progress

Stiganator

Platinum Member

SunnyD

Belgian Waffler

Munky

Diamond Member

QuixoticOne

Golden Member

aka1nas

Diamond Member

apoppin

Lifer

Munky

Diamond Member

apoppin

Lifer

QuixoticOne

Golden Member

apoppin

Lifer

Denithor

Diamond Member

MegaWorks

Diamond Member

ViRGE

Elite Member, Moderator Emeritus

QuixoticOne

Golden Member

TRENDING THREADS