Originally posted by: Scali
Originally posted by: munky
LOL, you talk like Cuda and GLSL runs on different HW. Anyone with enough knowledge of OpenGL and GLSL can accomplish the same results as you can with Cuda or OpenCL on a modern gpu. Cuda and OpenCL exist so the developer doesn't have to write all the graphics-related code to get those results. GLSL is not limited to simple shaders.
Patently false

In a way you could say that Cuda and GLSL run on different hardware. GLSL was devised a few years ago when the first shader hardware arrived. Cuda was devised for the G80, which is a completely different architecture from the GPUs that were around when GLSL was devised.
And no, you can't just do what you can with Cuda/OpenCL in OpenGL/GLSL. That's exactly the point.
OpenGL only allows you to render from vertexbuffers into output buffers, going through vertex shaders and pixel shaders.
THere is no concept of local storage or anything, and the memory access is very limited aswell. You can only read from textures, and you can only render to your output buffers (and you're not allowed to use the same texture for both input and output in a single pass).
Technically you may be able to devise some kind of multipass OpenGL scheme for whatever algorithm you want to implement... but it's in no way comparable to how Cuda/OpenCL handle code, input, output etc.
Originally posted by: munky
You left out the important part that Nvidia is not 240 independent scalar processors either. They are grouped into multiprocessor clusters, each working on a single program stream, and if that stream diverges based on heavy branching, you're getting a lot of bubbles, basically resulting in wasted cycles. So it's not like NV's architecture has no worst-case penalties either.
I left that part out because it isn't specific to nVidia. That part is very similar to ATi, and will most likely also be similar for Larrabee.
This is because they are essentially SIMD processors, where the threads all share the same code, and even the same program counter. Technically there's only one instruction, it's just executed by many units at the same time.
That's exactly the difference between GPGPU and CPU processing in general. CPU's may not have the parallelism, but all their threads are completely independent and can branch however they like.
But I won't let that distract me. You actually did agree with me on everything I posted about the differences between ATi and nVidia in terms of GPGPU and code compilation. So you will understand my concerns relating to ATi's performance in OpenCL.