My 8800 Ultra Bottleneck Investigation

BFG10K · Mar 28, 2008

We ran a few tests like this in the 9600 thread but it wasn't resolved to my satisfaction and a lot of unanswered questions remained.

So I decided to do something about it.

I hope you find the results interesting as I certainly did.

I'd also encourage people to submit similar tests in this thread, especially 8800 GT and 9600 GT owners, so we can see whether their bottlenecks are similar to those I observed on the 8800 Ultra.

Also let me know if you like the new result format over the old text columns.

Click.

JPB · Mar 28, 2008

Awesome work as always.

Nice chart also mate :thumbsup:

Sylvanas · Mar 28, 2008

Nicely well laid out post and results. I too would have thought shader clock makes more of a significant impact on performance but I guess not- good post.

Munky · Mar 28, 2008

Good post. I did some similar testing on my 8800gt, only I overclocked the card, and tested how each component affected performance scaling. So far I only tested FEAR, but I'll try more games and see how it looks.

n7 · Mar 28, 2008

Always love reading your results :thumbsup:

apoppin · Mar 28, 2008

Originally posted by: Sylvanas
Nicely well laid out post and results. I too would have thought shader clock makes more of a significant impact on performance but I guess not- good post.

Very well done!
[the 'usual' excellence we are now coming to expect

]

i hope you don't mind i quote the OP's conclusions here:

Commentary
The biggest performance difference clearly comes from the core clock where some games are almost seeing a 1:1 performance delta with it. I expected it would be shader clocks making the biggest difference but clearly that isn?t the case with the 8800 Ultra.

My theory is the Ultra?s 128 SPs have plenty of shader power to burn as even a 1224 MHz shader clock is double the stock core clock (612 MHz). Ramping up the shader clock was a smart move on nVidia?s part, especially since I doubt it affects yields too much.

Enabling AF + AA moves the bottleneck away from the shaders and onto the memory and core. I expected this as said features hit the ROPs, TMUs and bandwidth harder. Note that even though TMUs are tied to SPs on the G80/G92, they actually run at core clocks, not shader clocks.

These figures may be different across generations of cards and even cards in the same generation, so YMMV. But it?s clear if nVidia want to make a 8800 Ultra killer the simplest way is to ramp the core clock and leave everything else the same.

logically laid out .. that is my own guess for GT200 that we will see the biggest jump on their core clocks. G80 was an awesome design to build on!

BFG10K · Mar 28, 2008

I did some similar testing on my 8800gt, only I overclocked the card, and tested how each component affected performance scaling. So far I only tested FEAR, but I'll try more games and see how it looks.

Do you have those Fear figures by any chance? I only used games from 2006 onwards but I?m still interested in older titles.

In particular Doom 3, Far Cry, Fear, Riddick and Call of Duty 2 could be quite interesting; despite being older titles they?re still quite demanding.

apoppin · Mar 28, 2008

Originally posted by: BFG10K

Do you have those Fear figures by any chance? I only used games from 2006 onwards but I?m still interested in older titles.

In particular Doom 3, Far Cry, Fear, Riddick and Call of Duty 2 could be quite interesting; despite being older titles they?re still quite demanding.

i am curious, i can do something similar with 2900xt, right?

The idea was that it was too "shader rich" - has it been "explored" as well as what you have done with your Ultra? ..
i think RT and ATT Tool have similar O/C'ing programs for my GPU so i can copy you [i hope ]

i know O/C'ing my 2900Pro's core makes most of the performance difference although it's memory was certainly underclocked in comparison to where it is now.

Pantlegz · Mar 28, 2008

I would like to help, any chance 8800GT's SLI would be of any use or are you only intersted in single gpu solutions at the moment?

BFG10K · Mar 28, 2008

i am curious, i can do something similar with 2900xt, right?

Unfortunately not quite. You can certainly test core vs memory but on ATi there's no way to decouple the shader clocks from the core clock that I know of, so if you touched the core you wouldn?t really know what caused the performance change.

I would like to help, any chance 8800GT's SLI would be of any use or are you only intersted in single gpu solutions at the moment?

SLI might occlude the issue as there?d now be scaling factors to account for as well as possible synchronization issues. In theory you could run your system under single card mode and underlock both cards the same amount, but YMMV.

The ideal tests are single G8x/G9x cards, and I?m particularly interested in 8800 GT and 9600 GT tests (especially shader clocks on the latter).

apoppin · Mar 28, 2008

Unfortunately not quite. You can certainly test core vs memory but on ATi there's no way to decouple the shader clocks from the core clock that I know of, so if you touched the core you wouldn?t really know what caused the performance change.

i have pretty much already tested that, but ...

Too bad .. no wonder no articles explore this. i think AMD's shader clocks can be decoupled from the core clock as are NVIDIAs but we will need for someone to write the program, i guess.

- Who could do it?

SickBeast · Mar 28, 2008

BFG I noticed something similar when I ran COD4 with half of my 8800GTS's shaders disabled. It barely made any difference to the gameplay.

What I would like to see someone definitively answer is: Why does it seem impossible to break the 40fps barrier in Crysis? Even quad-sli rigs struggle to go faster than that, regardless of resolution. More CPU doesn't seem to help either.

Perhaps it's related to clockspeed...

apoppin · Mar 28, 2008

i think "brute force" will finally do it for Crysis .. and it is certainly not YET fully Optimized for neither Quad-core nor for Vista-64 .. Vista-64 should add at least +15-20% more FPS over the 32-bit version if FarCry or Hg:L is any indication.

GT200 x2 or what ever is after r700 [x2] will probably finally get "very high"
--maybe by end of this year at upper-mid resolutions is my hope

edited

SickBeast · Mar 28, 2008

Originally posted by: apoppin
i think "brute force" will finally do it for Crysis

What exactly do you mean by that? Just tons of TMUs, ROPs, and stream processors at extremely high clockspeeds?

To me, SLI/CF instantly doubles (or even quadruples) all of those factors except for clockspeed.

I'm guessing that certain parts of a game engine take advantage of different shaders. I also assume that there is a limit to the number of shaders in a given game, thus making many of the stream processors on a GPU redundant (à la R600). Therefore, you simply need less stream processors running at a higher clockspeed in order to gain more "brute force".

I hopefully just answered my own question.

The 9600GT seems to be an example of this theory as well. :beer:

Sylvanas · Mar 28, 2008

Originally posted by: apoppin

Unfortunately not quite. You can certainly test core vs memory but on ATi there's no way to decouple the shader clocks from the core clock that I know of, so if you touched the core you wouldn?t really know what caused the performance change.

Click to expand...

i have pretty much already tested that, but ...

Too bad .. no wonder no articles explore this. i think AMD's shader clocks can be decoupled from the core clock as are NVIDIAs but we will need for someone to write the program, i guess.
- Who could do it?

Nope, I would think it's impossible and it's there by architectural design- if shader clocks for ATI 2900/38xx were adjustable they'd be there separately in the BIOS, they are not.

It begs the question though, BFG has shown shader clocks don't make much of an impact so I guess thats why AMD just ramped the core with the RV670 instead of having a separate shader frequency. Although I am sure theres much more to it than that and it would be hard to make comparisons with vastly different architectures.

apoppin · Mar 28, 2008

i think another year will answer it for sure ... and Crysis still will probably gain nearly +25% more improvement under ideal conditions when it is fully optimized - about the same time the HW is powerful enough.
- certainly within a year

Nope, I would think it's impossible and it's there by architectural design- if shader clocks for ATI 2900/38xx were adjustable they'd be there separately in the BIOS, they are not.

They could be hidden

--but you are probably right ,, and too bad

edit

AFAIK this is not possible - you'd need a separate clock generator for the shaders at the hardware level.

- it's not worth a bump, but that is what i meant by "possible" ... you'd have to reverse engineer it; not likely .. clearly AMD set the shader clocks at the HW level and left no adjustments to be accessed in BIOS.

BFG10K · Mar 29, 2008

Too bad .. no wonder no articles explore this. i think AMD's shader clocks can be decoupled from the core clock as are NVIDIAs but we will need for someone to write the program, i guess

AFAIK this is not possible - you'd need a separate clock generator for the shaders at the hardware level.

BFG I noticed something similar when I ran COD4 with half of my 8800GTS's shaders disabled. It barely made any difference to the gameplay.

Keep in mind that when disabling SPs you?re not only disabling shaders but also TMUs, and maybe even ROPs too.

In any case, my CoD 4 results had performance drops across the board especially with AA+AF where the core made a huge difference (a 19% core speed drop resulted in a 14.26% performance drop)

As for Crysis, I tested that game too I got the biggest difference from core clocks followed by shader and then memory (game settings all on high).

To me nVidia's options seem simple: release a 65nm 8800 Ultra and ramp the core as high as possible (preferably 800 MHz or more). They can leave everything else the same if they like.

It begs the question though, BFG has shown shader clocks don't make much of an impact so I guess thats why AMD just ramped the core with the RV670 instead of having a separate shader frequency.

I think the fact that they're running at greater than double core speed means there's plenty of shader power to burn. If the shader clocks matched the core?s like ATi's I'd wager we'd be seeing much bigger difference and likely the shader clock would become the biggest bottleneck.

ArchAngel777 · Mar 29, 2008

Nice work BFG. It takes a lot of time and dedication to run these tests. Anyone who has tried to run a few knows how many hours this can take of your free time. Appreciate the test results.

One point of interest to me, though, is the ratio between the core/shader and memory bandwidth of G80 is far superior to that of G92. Which means, that the G92 may in fact show much closer to a 1:1 ratio in terms of underclocking the memory and the performance impact it has.

imported_Shaq · Mar 29, 2008

Originally posted by: SickBeast
BFG I noticed something similar when I ran COD4 with half of my 8800GTS's shaders disabled. It barely made any difference to the gameplay.

What I would like to see someone definitively answer is: Why does it seem impossible to break the 40fps barrier in Crysis? Even quad-sli rigs struggle to go faster than that, regardless of resolution. More CPU doesn't seem to help either.

Perhaps it's related to clockspeed...

On the benchmark I am at 49.34fps with med. shadows, and high/very high for everything else... 1680x1050 with OC'd GX2 and q6600 With all on high settings I am at ~51fps. According to the AT article it is platform related. I look forward to any updates he provides. OC'ing the skulltrail platform raised framerates I believe. Different reviews have different results but a lot of them do hit a wall at 45fps or so. NASA is using a supercomputer to max out Crysis and they still haven't finished the whole game as it kept crashing. LOL It was in Game Informer magazine.

Definitely a nice chart BFG10K. Thanks for sharing the results with us. Doom 3 will be interesting as I recall when it came out it responded very well to an increase in CPU cache size. There may be peculiarities in the GPU benchmark as well.

BFG10K · Mar 29, 2008

I also assume that there is a limit to the number of shaders in a given game, thus making many of the stream processors on a GPU redundant (à la R600). Therefore, you simply need less stream processors running at a higher clockspeed in order to gain more "brute force".

This is a good theory Sickbeast. It could be that parallelism is falling off in these situations and we simply need higher clocks to speed up the work that's already being done.

Doom 3 will be interesting as I recall when it came out it responded very well to an increase in CPU cache size. There may be peculiarities in the GPU benchmark as well.

I dunno about you guys but I?m sorely tempted to test more games. No promises but I?ll see what I can do.

AzN · Mar 29, 2008

Good thing you went and tested it out. We would have been arguing forever.

The games today just aren't shader bound. That will change rather quickly though when these cards with more SP will flex its muscle. I know bioshock for instance uses more shader more so then others along with FEAR.

Crysis is definitely screaming for more SP but the biggest bottleneck seems to be pixel and texture fillrate. Post processing is one of those things that also eat up bandwidth and fillrate along with AA and AF. That is why 9600gt does so well with all the filters on even though raw performance somewhere between 3850 and 3870.

Cookie Monster · Mar 29, 2008

Having a fast shader core can have a positive boost to shader intensive games/or even ALU intensive apps, but when the rest of the chip is crawling at a slow frequency then its obvious that the fast shader core gets bottlenecked by the rest of the chip (TMUs, ROPs, fillrates, etc etc) i.e downclocking your core clock will have a bigger impact on performance loss.

I dont know if you can say that there is a limit on number of shaders but depends on just how well the GPU can keep all its ALUs active. For instance R600/RV670 is a VLIW architecture using a shader setup similar to vec5. That alone can effect its performance since it requires more effort to keep the 320 ALUs busy, unlike nVIDIA's G80 architecture where theres no need for specific coding for the compiler or any relevant overhead for each game due to its scalar nature. R600/RV670 is kind of a driver nightmare for software engineers since the games out now all have massively variant shader instructions making it hard for the R600 scheduler to do its job more efficently unless they go out of there way to optimize each specific title. Not to mention AA being done through the shaders on the R600/RV670. Thats why its hard to say that by increasing clock frequency is better ("brute force") than having more units, but rather its how they are "fed" or "utilized" inorder to keep all of its ALUs doing work. For G80, i think we are clearly seeing a bottleneck. Higher shader clock seems to result in diminishing returns which means there are other bottlenecks that are clearly affecting this negatively. One example is its triangle setup. But this architecture is almost 2 years old so i think its doing great.

Also i think that the whole higher ALU to TEX ratio "is the future" (kind of promoted by ATi and can be seen on their 3~4:1 ALU:TEX designs) may also be untrue and that texturing is still important even as of today. Something along the lines of a linear increase proportional to shader usage, but not exponentially as some many have led to believe. Imo texturing still important enough to create bottlenecks for GPUs that infact lack texturing performance which is clear with the R6x0 series architecture.

G80 doesn't seem to be affected by lower bandwidth, especially seen by the GTX/Ultra comparison. However it would be interesting to see some G9x results (on how memory clock i.e bandwidth affects its performance) since bandwidth seems to the most obvious limiting factor for G9x. Even the 512mb vram seems to be another bottleneck in 19x12 (and above) + AA/AF scenarios.

edit - just some food for thought.

;

SickBeast · Mar 30, 2008

I really wonder what a card like an 8800GT would perform like running on DDR2 memory. DDR2 is soooo cheap. They could probably buy 512mb worth for $5 or something silly.

Plus, it would silence the critics of the Fusion idea (if it actually can perform decently).

I wonder if we could simulate DDR2 performance by clocking the memory speed really low on a graphics card.

Piuc2020 · Mar 31, 2008

Underclock your memory to 450 (900 DDR) and you should get similar performance to DDR2, however, you'll see that it will bring a huge performance drop, while BFG tests show that core shows the most gains, that doesn't mean the cards aren't limited by their memory bandwidth as well.

AzN · Mar 31, 2008

Remember memory bandwidth plays a role with pixel fillrate with nvidia's dx10 cards... not so much with texture fillrate.

pixel fillrate test

texture fillrate test

9600gt would be opposite of BFG's ultra effect. Underclocking shader would hamper performance as much as core.

My 8800 Ultra Bottleneck Investigation

Lifer

Diamond Member

Diamond Member

Diamond Member

Elite Member

Lifer

Lifer

Lifer

Diamond Member

Lifer

Lifer

Lifer

Lifer

Lifer

Diamond Member

Lifer

Lifer

Diamond Member

Senior member

Lifer

Banned

Diamond Member

Lifer

Golden Member

Banned