Technical Difference between ATI & nVidia Stream Processors

Status
Not open for further replies.

phaxmohdem

Golden Member
Aug 18, 2004
1,839
0
0
www.avxmedia.com
I've always been curious as to what makes nVidia able to get away with using lower numbers of "Stream Processors" on their cards for performance parity with an ATI card touting 2-3 times as many SP's. (or conversely, why ATI must use so many SP's for performance parity) I've scoured various forums but I haven't seen much more than semi-informed fanboi drivel most places.

Nearest I can figure is that an nVidia SP must be able to do roughly twice the work of an ATI SP somehow and operate at a higher clock than the rest of the GPU:

For example purposes, lets take two roughly equivalent cards, a GTX 275 vs. a Radeon 4890.

GTX 275
----------------------
240 SP
GPU/SP Clocks = 633 / 1404 MHz respectively

Radeon 4890
----------------------
800 SP
GPU Clock = 850 MHz

Now simply multiplying the shader count by the clock speed yields the following results:

GTX 275: (240x1404) = 336,960
HD 4890: (800x850) = 680,000

This admittedly oversimplified example shows that ceteris-paribus the Radeon core should be able to do almost exactly twice the work of the GTX 275... which I take to mean that nVidia's SP's can somehow do twice the work of ATI's processors.

Does anyone have any low level explanation for why this is the case?
 
May 11, 2008
22,551
1,471
126
I've always been curious as to what makes nVidia able to get away with using lower numbers of "Stream Processors" on their cards for performance parity with an ATI card touting 2-3 times as many SP's. (or conversely, why ATI must use so many SP's for performance parity) I've scoured various forums but I haven't seen much more than semi-informed fanboi drivel most places.

Nearest I can figure is that an nVidia SP must be able to do roughly twice the work of an ATI SP somehow and operate at a higher clock than the rest of the GPU:

For example purposes, lets take two roughly equivalent cards, a GTX 275 vs. a Radeon 4890.

GTX 275
----------------------
240 SP
GPU/SP Clocks = 633 / 1404 MHz respectively

Radeon 4890
----------------------
800 SP
GPU Clock = 850 MHz

Now simply multiplying the shader count by the clock speed yields the following results:

GTX 275: (240x1404) = 336,960
HD 4890: (800x850) = 680,000

This admittedly oversimplified example shows that ceteris-paribus the Radeon core should be able to do almost exactly twice the work of the GTX 275... which I take to mean that nVidia's SP's can somehow do twice the work of ATI's processors.

Does anyone have any low level explanation for why this is the case?

I do not know much about it. The best way is to read about cuda and atistreams.
What i recall is that ati use a very wide simd system , single instruction on multiple data. Although ati has more functional unit's, this also means that when a calculation must take place, it must be the same calculation for lot's of data. Nvidia has less functional unit's but these unit's seem to be arranged to be a bit more flexible. In scenario A ati is faster and in scenario B nvidia is faster. Scenario A : lot's of data where the same calculation in parallel can be done. Scenario B would be parallel data but different instructions...


I was a bit lazy to look it up right here at anandtech and this what i had ready available.


ATI

http://www.beyond3d.com/content/reviews/53
http://www.beyond3d.com/content/reviews/52/9

The general organization we first encountered way back in the day with the R600 is still here: SIMDs made up of 16 blocks of 5 ALUs, depending on exploitable ILP in the instruction stream. There are 20 such SIMDs, and physical arrangement of the shader core has changed versus the RV770, with SIMDs being split in two 10 tall blocks symmetrically placed versus the central data request bus. Mind you, this is a physical layout change, logically they work in pretty much the same way, with a similar flow.

As you've probably already deduced, assuming sufficient ILP exists, and the compiler works its magic, up to 5 instructions can be co-issued per ALU block. However, dependencies can reduce it to as little as 1 instruction per cycle, in a purely serial instruction stream. From what we've seen, the general case seems to be around 3-4 instructions packed per cycle, which is decent. Speaking of packing, that's how the ALU blocks get their instructions, in a packed VLIW that can contain anywhere between 1 and 5 64-bit scalar ALU ops (lots of bits there, which gives you an idea about the complexity of this GPU's ISA), and up to two 64-bit literal constants. Control flow instructions are separately dispatched as 64-bit words, for execution on the branch processing unit.

The 4 slim ALUs handle all of their old tricks, so each is capable of 1 FP MAD/ADD/MUL or 1 INT ADD/AND/CMP, as well as integer shifts. Cypress adds to its slim ALUs the capability to do single cycle 24-bit INT MUL whereas before no INT MUL support existed at all. Getting the 24-bit INT is fairly easy, since the slim ALUs are FP32 so there are enough mantissa bits there to represent it, but there's not enough for a full 32-bit INT. There's also FMA support, which will bring benefits versus a simple MAD when it comes to precision loss introduced by rounding involved in the latter.


Nvidia

http://www.beyond3d.com/content/reviews/51

At its core, GT200 is a MIMD array of SIMD processors, partitioned into what we call clusters, with each cluster a 3-way collection of shader processors which we call an SM. Each SM, or streaming multiprocessor, comprises 8 scalar ALUs, with each capable of FP32 and 32-bit integer computation (the only exception being multiplication, which is INT24 and therefore still takes 4 cycles for INT32), a single 64-bit ALU for brand new FP64 support, and a discrete pool of shared memory 16KiB in size.

The FP64 ALU is notable not just in its inclusion, NVIDIA supporting 64-bit computation for the first time in one of its graphics processors, but in its ability. It's capable of a double precision MAD (or MUL or ADD) per clock, supports 32-bit integer computation, and somewhat surprisingly, signalling of a denorm at full speed with no cycle penalty, something you won't see in any other DP processor readily available (such as any x86 or Cell). The ALU uses the MAD to accelerate software support for specials and divides, where possible.

Those ALUs are paired with another per-SM block of computation units, just like G80, which provide scalar interpolation of attributes for shading and a single FP-only MUL ALU. That lets each SM potentially dual-issue 8 MAD+MUL instruction pairs per clock for general shading, with the MUL also assisting in attribute setup when required. However, as you'll see, that dual-issue performance depends heavily on input operand bandwidth.

Each warp of threads still runs for four clocks per SM, with up to 1024 threads managed per SM by the scheduler (which has knock-on effects for the programmer when thinking about thread blocks per cluster). The hardware still scales back threads in flight if there's register pressure of course, but that's going to happen less now the RF has doubled in size per SM (and it might happen more gracefully now to boot).

So, along with that pool of shared memory is connection to a per-SM register file comprising 16384 32-bit registers, double that available for each SM in G80. Each SP in each SM runs the same instruction per clock as the others, but each SM in a cluster can run its own instruction. Therefore in any given cycle, SMs in a cluster are potentially executing a different instruction in a shader program in SIMD fashion. That goes for the FP64 ALU per SM too, which could execute at the same time as the FP32 units, but it shares datapaths to the RF, shared memory pools, and scheduling hardware with them so the two can't go full-on at the same time (presumably it takes the place of the MUL/SFU, but perhaps it's more flexible than that). Either way, it's not currently exposed outside of CUDA or used to boost FP32 performance.


more links :

Indepth article GT200 and cuda.
http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242
 
Last edited:

Cogman

Lifer
Sep 19, 2000
10,286
145
106
There isn't going to be too much information on this as they are trade secretes.

My bet, though, is that it probably has to deal with things like GPU instruction caching and pipelining. As well, it might have to deal with the drivers themselves (a crappy compiler will really slow things down).

Think of what CPUs do to speed things up, and I imagine that the same principles are being applied at varying levels to AMD's and Nvidia's steam processors.
 

jimhsu

Senior member
Mar 22, 2009
705
0
76
My purely hypothetical conjecture is that ATI stream processors can have a much higher theoretical performance, under ideal conditions. In real life though the flexibility of nvidia SPs all but erases this gap. Fermi might change things though. This is somewhat consistent with the folding@home claims that they manage to only get about 50% of theoretical FLOPs for ATI. not sure about nvidia's figure though.
 

Ben90

Platinum Member
Jun 14, 2009
2,866
3
0
I cannot find the Anandtech article at the moment that explains this, but basically ATI shaders are clustered into 5. One of these can handle any type of GPU instruction, while the other four can only handle certain tasks. The 4890 has 160 "fullblown" shaders and 640 sidekick shaders.

Nvidia's advertised shaders can do most operations, yet also have dedicated "Special Processing Units" that do not get counted into advertised shaders. By having separate SPUs, Nvidia can decrease complexity of their main shaders.

Due to the architectures, ATI has a lot more theoretical power, but its dependent on the compiler.

*edit found it*

http://www.anandtech.com/video/showdoc.aspx?i=3341&p=1
page 3 and 6 explain it
 
Last edited:

Lonyo

Lifer
Aug 10, 2002
21,938
6
81
I cannot find the Anandtech article at the moment that explains this, but basically ATI shaders are clustered into 5. One of these can handle any type of GPU instruction, while the other four can only handle certain tasks. The 4890 has 160 "fullblown" shaders and 640 sidekick shaders.

Nvidia's advertised shaders can do most operations, yet also have dedicated "Special Processing Units" that do not get counted into advertised shaders. By having separate SPUs, Nvidia can decrease complexity of their main shaders.

Due to the architectures, ATI has a lot more theoretical power, but its dependent on the compiler.

*edit found it*

http://www.anandtech.com/video/showdoc.aspx?i=3341&p=1
page 3 and 6 explain it

And page 7 is nice too, specifically:
This shows us that NVIDIA's architecture requires more than 2x the die area of AMD's in order to achieve the same level of peak theoretical performance

AMD has higher theoretical performance and higher performance density by a large margin, but the actual extracted performance is indeed typically much lower.

It's fairly obvious that you can't really compare the two SPs because ATI manages to have "800" of them in a die smaller than the 240 of NV, so clearly there are architectural differences (which AT somewhat explains).
If you took 240 vs 800, you would think 800 might win, but you can clearly see that the 800 must be much simpler, because 800 shaders in 260mm^2 is never going to let you have shaders that are that close to 240 in 412mm^2.

Page 8 of the AT article wraps it up:
But the hearts of GT200 and RV770, the SPA (Steaming Processor Array) and the DPP (Data Parallel Processing) Array, respectively, are quite different. The explicitly scalar one operation per thread at a time approach that NVIDIA has taken is quite different from the 5 wide VLIW approach AMD has packed into their architecture. Both of them are SIMD in nature, but NVIDIA is more like S(operation)MD and AMD is S(VLIW)MD.
[...]
getting the most out of GT200 and RV770 requires vastly different approaches in some cases. Long shaders can benefit RV770 due to increased ILP that can be extracted, while the increased resource use of long shaders may mean less threads can be issued on GT200 causing lowered performance. Of course going the other direction would have the opposite effect. Caches and resource availability/management are different, meaning that tradeoffs and choices must be made in when and how data is fetched and used.
 
Status
Not open for further replies.