#1 Rule of GPGPU GFLOPS peak performance MOSTLY NEVER EVER EVER matters.
Why? The cards can only talk to VRAM at around 160GBy/s peak; actually many models are less than half of that, I'm being generous to include some of them that go over 100GBy/s systained.
Divide 160GBy/s by 4 to get G(SP)/s as in giga single precision 32-bit floats I/Os per second = 40GSPIO/s.
So assume that you're READING one SP datum per calculation from VRAM and writing one SP datum per calculation to VRAM, that's two, so you're down to 20G/s calculations MAXIMUM not because your SPs can't achieve TERAFLOP level peak speeds (50x faster) but because you just don't have the VRAM BANDWIDTH to do the I/O to read and write DATA to VRAM to anything like keep up with the SPs calculation speeds.
Granted there ARE on chip registers and caches that are much faster than VRAM access, so if your calculation data can FIT into the cache / registers / local memory / global memory, you can do sustained very high speed (maybe 20x faster or more than talking to VRAM) calculations on those registers / on-chip memory. Keep in mind that the numbers of single precision floating point values that can fit in ON CHIP fast memory are numbered in the several thousands, so you're not going to be crunching megabytes of data efficiently on a GPU at peak speeds of the SPs due to the RAM bandwidth limit. Granted if you can read in one value and then do LOTS of math on-chip on that value, you CAN get close to PEAK SP efficiency, but you'd have to be doing something like 10-100 calculations using ON-CHIP resources for EVERY single precision datum you read or write from/to VRAM to get peak SP speeds.
#2 ATI has around 1/5th of their SPs capable of doing DOUBLE PRECISION calculations in the 3800 / 4800 series GPUs. That can mean there are 800 SP ALUs and those same ALUs when acting in bunches of 5 give you up to 160 DP ALUs. NVIDIAs current GTX2xx series parts have significantly fewer DP ALUs, so their peak DP performance might be inferior to ATIs though the actual performance of course depends on your calculation, the merits of CUDA vs CAL implementations, et. al.
#3 if you're doing single precision floating point calculations or of course integer or byte or whatever then both ATI and NVIDIA GPUs have tons of SPs you can use for your calculations and which is better is a very big function of CUDA vs CAL vs BROOK coding efficiencies, the on-chip architecture of the GPU, the way your threads schedule and your algorithm parallelizes et. al.
#4 -- Look at OPENCL, that is a standard that may end up being supported by both camps. Apple, et. al. are pushing it.
Also there are proprietary cross platform GPGPU tools like RapidMind's commercial solution.
BROOK is a language that has cross platform support, although in its most common format it uses a back-end that is basically OpenGL level so you don't really get access to a lot of the advanced features of current NVIDIA / ATI GPUs like options relating to scatter, gather, double precision, etc. etc. etc.
ATI has commercialized BROOK and created "BROOK+" which they distribute in their SDK; it is not particularly mature / efficient yet even on their own GPUs, and it is no longer a cross platform (NVIDIA GPUs) solution in this version.
Nothing prevents you from writing GPGPU programs in OpenGL / DirectX / HLSL / etc. shading languages, and many early GPGPU programs were cross platform because they were written using the graphics shading languages supported by common GPUs even before CUDA / CAL / BROOK+ languages were available. As above you lose out on a LOT of the better architectural capabilities of the GPUs by doing this, but it is a cross platform solution. The original F@H GPU code used DirectX; they've abandoned it in favor of native CUDA and CAL implementations.
#5 CAL is the most efficient way to program modern generations of ATI GPUs; it is NOT user friendly, it is close to assembly language, but because of that you get more direct control of the actual program / hardware.
#6 CUDA is more mature and offers better ease of use / programming than ATI AMD CAL, and there are lots more mature GPGPU codes out there that use CUDA due to NVIDIA's head start offering such GPUs and language tools. Eventually maybe BROOK+ / CAL / et. al. will catch up to CUDA somewhat. I predict that OPENCL or other such common languages may become the more popular options, though.
#7 in terms of CODING EFFICIENCY = time required to implement an algorithm that works and works reasonably efficiently on the GPU, CUDA wins for now even if the respective GPU silicon may or may not be better than ATIs theoretical performance, you'll probably do the same engineering in 1/10 the development time using CUDA presently assuming you can program in C anyway... Once you optimize the code for CUDA and optimize it for CAL on ATI then it is mostly a matter of the merits of each individual GPU and device driver / scheduler system... F@H has a handy points per day lead on NVIDIA GPUs (as in 2x or more the performance of ATI's most modern GPUs) currently although they do have CUDA and CAL implementations.
Originally posted by: Stiganator
I heard Eran Badit was working with nvidia to enable CUDA on ATI cards. Anyone heard anything else about this since July? It would be nice to have a GPGPU standard for both cards.
Anyone know how effective each companies respective stream processors are?
ATI 48XX has 800 SP which seems very impressive.
NVIDIA 2XX has 256 SP
Are the ATI ones less efficient or could the ATI potentially blow the NVIDIA out of the water in highly parallel tasks?