Stream Processors and Performance

HardcoreRomantic · Mar 10, 2008

I was looking at some spec comparisons and I saw the 3870 has 320 stream processors and the 8800GT has 112. Why doesn't this translate into performance? Same question with memory width 256 bit vs 512 bit?

angry hampster · Mar 10, 2008

ATI's stream processors do fewer processes per clock whenc ompared to Nvidia. There's lots of neat graphs floating around but I can't find any right off the bat.

Auric · Mar 10, 2008

Memory bus width is akin to the diameter of a pipe in that the total volume (or bits) passing through is also determined by the pressure (or frequency). Ergo, in practical terms, bus width can only increase the performance of a GPU over an equivalent when they are both limited by the available specification memory. Otherwise, it is more cost efficient to match the target needs of the GPU with a narrower bus yet higher spec memory.

Soulkeeper · Mar 11, 2008

effective memory badwidth has pretty much been the single most important factor in graphics performance over the last several years imo
the memory bus width is an important factor to this

superbooga · Mar 11, 2008

Then why was the 2900xt, with its massive memory bandwidth, slower than both the 8800GTX and 3870?

Cookie Monster · Mar 11, 2008

Firstly, stream processor is a shady term. Let me break this down to you in an easy to understand way.

ATi has 320 ALUs (Arithmetic Logic Units, or shaders that are grouped into a group of 5. Therefore they are sometimes mentioned as a "Vec5" shader although the term differs slightly with ATI's ALU setup. In RV670/R600, the vec5 shader (one group) consists of 4 ALUs (scalar) and one special or beefy ALU (that can perform other special ops). The RV670 i.e HD38x0 have 64 Vec5 shaders that are clocked at the same frequency as the rest of the chip i.e the core clock speed.

Ok, now we got this far, lets talk about how they work. Vec5 shaders can be advantageous against others IF its feeded (cant think of an easier term) with Vec5 instructions i.e, this one big fat instruction (requiring 5 ALUs all at once) is taken care of by one vec5 shader, i.e speeding up the operation much faster. However the downside is that the instructions must be coded in a way that takes advantage of the 5 ALUs. If say the vec5 shader is feeded with a scalar instruction (i.e requires just one ALU) it will waste 4 ALUs to complete the task. I.e the utilization of a vec5 shader architecture is very low but in a pure arithmetic crunching power, its faster than the nVIDIA counterparts.

Now lets compare G80/G9x. They have scalar ALUs. 128 of them to be exact. Compared to the 320 of the R600/670 its pretty low. However the difference is significant. Firstly, nVIDIA has been using different clock frequencies for parts of the chip, where the shader core (the ALUs) runs as fast as 1650MHz comapred to 750MHz of a RV670. Its quite the difference. Not only that but these are scalar ALUs. Even if feeded with a vec5 instruction (which can be slow for the scalar ALUs), these are broken up into 5 scalar instruction and so forth. So basically the utilization of a scalar shader architecture is roughly near ~100%. However in terms of pure arithmetic power, it lacks behind its competition although not as badly.

IMO i think the G80/G9x architectures handles those instructions fairly well seeing as their shader core is clocked around 2 times as high as their ATi counterparts, plus a utilization of those scalar ALUs at ~100%. Which means that basically nVIDIA cards will be consistent in games while the ATI counterparts might have some drawbacks because of its "vec5" nature. However there are other major variables invovled such as texturing power where the nVIDIA cards has over twice to triple the amount compared to ATI. AA and AF performance also can be mentioned too and both are lacking in ATi GPUs.

Its been years, and even now that bandwidth was never a major favor when concerning game performance. One good example is the R600 and its 512bit bus (i still think this was more of a check box thing then being implemented for practical use). Or 7800GTX 512MB and its monstrous 1800MHz memory clock back in the days (yet the 7900GTX outperformed it because of a higher core clock and for ref had a memory clock of 1600MHz). There are relatively rare occasions where a scene could be bandwidth limited. These are situations where SSAA and high res might come into play. But theres shader/texture/fillrate variables to also think about. IMO, cards were never too fast enough to make use of its overwhelming bandwidth and i.e bandwidth was something last on the list. (Maybe its abit different with G92s as seen by their lack of bandwidth and overwhelming shader/texturing performance).

Hope this answers your question. Bear in mind, ive missed out on alot of other details.

BFG10K · Mar 11, 2008

effective memory badwidth has pretty much been the single most important factor in graphics performance over the last several years imo

No, shader performance has. Witness how 256 bit derivatives (3870, 8800 GT, 8800 GTS 512) perform so well compared to their older cousins with much more memory bandwidth.

AzN · Mar 11, 2008

Cookie knows how to explain better than anyone here. Thanks cookie for that well written post.

To keep it short though AMD SP are not built the same as nvidia's SP.

AMD 320 SP perform more like nvidia's 96SP. Memory bandwidth alone doesn't do anything. You need to feed fillrate to take advantage.

Most of these games benefit more with bigger fillrate and bandwidth. Only when it's shader bound the card with better shader prevail.

SlowSpyder · Mar 11, 2008

Can anyone explain what the stream processor actually does in the GPU? I'm guessing that with the pixel pipes in the old dx9 GPUs you could take the number of pipes multiplied by the GPU speed to see how many pixels could be drawn, right? (8 pixel pipelines x 500mhz = 4gigapixel fillrate?) The texture units, shader units were similar, the number of pipes could be applied to the pixels being drawn. They all had easy to follow labels. What does the stream processor actually do since it's not drawing pixels? Is it adapable to do whatever the GPU needs it to do at a time, ie shading, texturing? Thanks

SlowSpyder · Mar 11, 2008

Double post.

AzN · Mar 11, 2008

In short it's rendering effects.

Dkcode · Mar 11, 2008

Originally posted by: Cookie Monster
Firstly, stream processor is a shady term. Let me break this down to you in an easy to understand way.

ATi has 320 ALUs (Arithmetic Logic Units, or shaders that are grouped into a group of 5. Therefore they are sometimes mentioned as a "Vec5" shader although the term differs slightly with ATI's ALU setup. In RV670/R600, the vec5 shader (one group) consists of 4 ALUs (scalar) and one special or beefy ALU (that can perform other special ops). The RV670 i.e HD38x0 have 64 Vec5 shaders that are clocked at the same frequency as the rest of the chip i.e the core clock speed.

Ok, now we got this far, lets talk about how they work. Vec5 shaders can be advantageous against others IF its feeded (cant think of an easier term) with Vec5 instructions i.e, this one big fat instruction (requiring 5 ALUs all at once) is taken care of by one vec5 shader, i.e speeding up the operation much faster. However the downside is that the instructions must be coded in a way that takes advantage of the 5 ALUs. If say the vec5 shader is feeded with a scalar instruction (i.e requires just one ALU) it will waste 4 ALUs to complete the task. I.e the utilization of a vec5 shader architecture is very low but in a pure arithmetic crunching power, its faster than the nVIDIA counterparts.

Now lets compare G80/G9x. They have scalar ALUs. 128 of them to be exact. Compared to the 320 of the R600/670 its pretty low. However the difference is significant. Firstly, nVIDIA has been using different clock frequencies for parts of the chip, where the shader core (the ALUs) runs as fast as 1650MHz comapred to 750MHz of a RV670. Its quite the difference. Not only that but these are scalar ALUs. Even if feeded with a vec5 instruction (which can be slow for the scalar ALUs), these are broken up into 5 scalar instruction and so forth. So basically the utilization of a scalar shader architecture is roughly near ~100%. However in terms of pure arithmetic power, it lacks behind its competition although not as badly.

IMO i think the G80/G9x architectures handles those instructions fairly well seeing as their shader core is clocked around 2 times as high as their ATi counterparts, plus a utilization of those scalar ALUs at ~100%. Which means that basically nVIDIA cards will be consistent in games while the ATI counterparts might have some drawbacks because of its "vec5" nature. However there are other major variables invovled such as texturing power where the nVIDIA cards has over twice to triple the amount compared to ATI. AA and AF performance also can be mentioned too and both are lacking in ATi GPUs.

Its been years, and even now that bandwidth was never a major favor when concerning game performance. One good example is the R600 and its 512bit bus (i still think this was more of a check box thing then being implemented for practical use). Or 7800GTX 512MB and its monstrous 1800MHz memory clock back in the days (yet the 7900GTX outperformed it because of a higher core clock and for ref had a memory clock of 1600MHz). There are relatively rare occasions where a scene could be bandwidth limited. These are situations where SSAA and high res might come into play. But theres shader/texture/fillrate variables to also think about. IMO, cards were never too fast enough to make use of its overwhelming bandwidth and i.e bandwidth was something last on the list. (Maybe its abit different with G92s as seen by their lack of bandwidth and overwhelming shader/texturing performance).

Hope this answers your question. Bear in mind, ive missed out on alot of other details.

Nice :thumbsup:

BenSkywalker · Mar 11, 2008

Can anyone explain what the stream processor actually does in the GPU?

They have replaced shader units, both pixel and vertex. They went this route as it allowed more flexibility in allocating resources as the need calls for, downside is they would be a little bit slower(would is used as they are clearly significantly faster then what they replaced, but a comparable generation dedicated pixel and particularly vertex shader would be faster). This is somewhat interesting to me as ATI was big on pushing unified shaders as fast as possible due to flexibility and then designs unified shader hardware that exemplifies the shortcomings of them(ie- they need ideal workload to come remotely close to peak performance).

Moving forward bandwidth performance is going to start coming back up as a major concern for overall GPU performance- but it will end up being on the read end instead of the write for a change(500+ shader units are going to have to be fed and the amount of shaders on screen at once is skyrocketing).

Denithor · Mar 11, 2008

Originally posted by: Soulkeeper
effective memory badwidth has pretty much been the single most important factor in graphics performance over the last several years imo
the memory bus width is an important factor to this

Originally posted by: superbooga
Then why was the 2900xt, with its massive memory bandwidth, slower than both the 8800GTX and 3870?

The analogy of a water hose is the best way to think about this effect.

If you hook up a firehose to a home water faucet, the water doesn't come close to filling the pipeline and will simply trickle through (this is what happened with the 2900xt, just because massive bandwidth was available did not mean the card could properly use it).

If you hook a garden hose to a fire hydrant you will have the opposite problem: the garden hose does not have enough capacity to carry the load efficiently so you will get a constricted stream (this is what we saw with the venerable 7600gt, it was strangled by its 128-bit memory interface [and would have been much much faster with a 256-bit memory interface, a point proven by the significantly higher performance of 7600gt cards that came with higher clocked DDR3 memory]).

The water pressure available in the hose depends on the source. Obviously a fire hydrant provides much more water than a garden faucet. This is equivalent to looking at different video card GPUs. The 8600GT for example is like the garden faucet while the 8800GT is closer to the fire hydrant in our analogy.

When designing video cards they need to make sure the bandwidth is adequate to handle the load the GPU can put out (so you don't constrict the flow -- 7600gt) but they don't need an excessive amount (extra bandwidth doesn't help anything -- 2900xt).

Remember that total memory bandwidth is a combination of the memory interface and the memory speed (both contribute to how much water you can move through the hose).

It's generally accepted that the G92 8800 cards are somewhat limited in bandwidth (256-bit is scrawny versus the 320-bit/384-bit available on the G80 cards).

My question is: why haven't any of the nVidia card makers produced 8800 series cards with faster GDDR4 memory to make up for the narrower 256-bit pipeline?

AzN · Mar 11, 2008

No more garden hose please.

Denithor · Mar 11, 2008

No more hoses it is.

But interestingly enough, check out this article on DailyTech talking about the next generation 9800 cards.

9800GTX = G92 @ 675MHz core/2200MHz memory

If I'm not completely mistaken, that means basically an 8800GTS with GDDR4 for higher bandwidth. Granted, they will probably have a 1GB model but anyway...

The MSI 8800GTS 512MB OC at $220 after MIR is suddenly looking pretty good!

vanvock · Mar 11, 2008

Originally posted by: Cookie Monster
Firstly, stream processor is a shady term. Let me break this down to you in an easy to understand way.

ATi has 320 ALUs (Arithmetic Logic Units, or shaders that are grouped into a group of 5. Therefore they are sometimes mentioned as a "Vec5" shader although the term differs slightly with ATI's ALU setup. In RV670/R600, the vec5 shader (one group) consists of 4 ALUs (scalar) and one special or beefy ALU (that can perform other special ops). The RV670 i.e HD38x0 have 64 Vec5 shaders that are clocked at the same frequency as the rest of the chip i.e the core clock speed.

Ok, now we got this far, lets talk about how they work. Vec5 shaders can be advantageous against others IF its feeded (cant think of an easier term) with Vec5 instructions i.e, this one big fat instruction (requiring 5 ALUs all at once) is taken care of by one vec5 shader, i.e speeding up the operation much faster. However the downside is that the instructions must be coded in a way that takes advantage of the 5 ALUs. If say the vec5 shader is feeded with a scalar instruction (i.e requires just one ALU) it will waste 4 ALUs to complete the task. I.e the utilization of a vec5 shader architecture is very low but in a pure arithmetic crunching power, its faster than the nVIDIA counterparts.

Now lets compare G80/G9x. They have scalar ALUs. 128 of them to be exact. Compared to the 320 of the R600/670 its pretty low. However the difference is significant. Firstly, nVIDIA has been using different clock frequencies for parts of the chip, where the shader core (the ALUs) runs as fast as 1650MHz comapred to 750MHz of a RV670. Its quite the difference. Not only that but these are scalar ALUs. Even if feeded with a vec5 instruction (which can be slow for the scalar ALUs), these are broken up into 5 scalar instruction and so forth. So basically the utilization of a scalar shader architecture is roughly near ~100%. However in terms of pure arithmetic power, it lacks behind its competition although not as badly.

IMO i think the G80/G9x architectures handles those instructions fairly well seeing as their shader core is clocked around 2 times as high as their ATi counterparts, plus a utilization of those scalar ALUs at ~100%. Which means that basically nVIDIA cards will be consistent in games while the ATI counterparts might have some drawbacks because of its "vec5" nature. However there are other major variables invovled such as texturing power where the nVIDIA cards has over twice to triple the amount compared to ATI. AA and AF performance also can be mentioned too and both are lacking in ATi GPUs.

Its been years, and even now that bandwidth was never a major favor when concerning game performance. One good example is the R600 and its 512bit bus (i still think this was more of a check box thing then being implemented for practical use). Or 7800GTX 512MB and its monstrous 1800MHz memory clock back in the days (yet the 7900GTX outperformed it because of a higher core clock and for ref had a memory clock of 1600MHz). There are relatively rare occasions where a scene could be bandwidth limited. These are situations where SSAA and high res might come into play. But theres shader/texture/fillrate variables to also think about. IMO, cards were never too fast enough to make use of its overwhelming bandwidth and i.e bandwidth was something last on the list. (Maybe its abit different with G92s as seen by their lack of bandwidth and overwhelming shader/texturing performance).

Hope this answers your question. Bear in mind, ive missed out on alot of other details.

fed

Munky · Mar 11, 2008

Ati's 64 vec5 shaders are not limited to just vec5 instructions. Each subcomponent is capable of executing a different instruction, therefore theoretically it can also execute upto 320 scalar instructions. But it's up to the driver to optimize the code for maximum utilization, and in actual games all 320 shaders don't get utilized to the max.

Soulkeeper · Mar 11, 2008

Originally posted by: BFG10K

effective memory badwidth has pretty much been the single most important factor in graphics performance over the last several years imo

Click to expand...

No, shader performance has. Witness how 256 bit derivatives (3870, 8800 GT, 8800 GTS 512) perform so well compared to their older cousins with much more memory bandwidth.

that's why i said "IMO" because i don't look at the same niche junk as the next guy

but if you take the same GPU and put it on a board with multiple memory bandwidth configurations the results will be noticeable, moreso than an oc of the gpu
this has been the case for atleast 10yrs

BFG10K · Mar 11, 2008

that's why i said "IMO" because i don't look at the same niche junk as the next guy

but if you take the same GPU and put it on a board with multiple memory bandwidth configurations the results will be noticeable, moreso than an oc of the gpu
this has been the case for atleast 10yrs

I'm not sure what you're saying here. In anything modern (say last three years) shader performance is more important than memory bandwidth and overclocking the core will yield more performance than overclocking memory.

Again I refer to the examples of the 3870 and 8800 GT/GTS 512 being competitive and even better than previous offerings with vastly more memory bandwidth.

Memory bandwidth generally becomes more of a factor at high resolutions combined with high levels of AA.

Auric · Mar 12, 2008

Originally posted by: BFG10K
I'm not sure what you're saying here. In anything modern (say last three years) shader performance is more important than memory bandwidth and overclocking the core will yield more performance than overclocking memory.

Again I refer to the examples of the 3870 and 8800 GT/GTS 512 being competitive and even better than previous offerings with vastly more memory bandwidth.

Welp, that's 'cause they've got 'nuff. On 'tother hand take a G71 with 128-bit bus and o'erclocking the memory makes a beeeg difference. Indeed, a 7600GT outperforms a 7800GS 256-bit.

BFG10K · Mar 12, 2008

Welp, that's 'cause they've got 'nuff. On 'tother hand take a G71 with 128-bit bus and o'erclocking the memory makes a beeeg difference. Indeed, a 7600GT outperforms a 7800GS 256-bit.

Sure, if you cripple anything enough it'll start becoming the bottleneck.

Stream Processors and Performance

Senior member

Diamond Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Lifer

Banned

Lifer

Lifer

Banned

Senior member

Diamond Member

Diamond Member

Banned

Diamond Member

Senior member

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer