Importance of GPGPU/Compute in future games

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

3DVagabond

Lifer
Aug 10, 2009
11,951
204
106
Either way, the performance sucks and that's just not good. Not so subtle but it's the way I see it.

Rendering multiple dynamic light sources takes a lot of comp power. That's why the "performance hit" with it on. With typical lighting tech the game would by unrenderable(sic) in real time. What you are calling a performance hit is in reality a performance boost to the point of it being usable in games, where before, without it, it wasn't.
 

BrightCandle

Diamond Member
Mar 15, 2007
4,762
0
76
Any top game today is using "compute". All it really means is that the GPUs processing units are being used. There are a couple of ways one could run a program on the GPU, you can do so on the geometry, on the pixels or with no such 3D primitives in place. Most of the lighting today is done using the pixels image with a pixel shader program, and that is compute in the general sense of the word. It isn't likely that moving this into a specialist model and using another API is going to change the nature of the algorithm itself or change how they perform. Simply put games don't need the increased accuracy calculations thus the performance difference that we see in compute programs today isn't comparable.

If your doing science you'll use 64 bit floats, perhaps 128 bit or even higher depending on your calculation. If its a game and we are doing lighting well 32 bit floats are more than enough for that. NVidia's disadvantage is at 64 bit, its much less prominent at gaming levels of quality, and I can't imagine anyone would waste performance on 64 bit colour channels, it would be ludicrous.
 

cmdrdredd

Lifer
Dec 12, 2001
27,052
357
126
If you want to look at it that way sure. I'm thinking from a playability standpoint which is a different point of view. I understand where you're coming from. You wouldn't be able to do it at all before using traditional DX11 techniques. I'm looking at it as a game that went from super playable to near slideshow with one tick. It's two opposite views of the same thing. It is possible to appreciate what they were trying to do, but as was pointed out earlier some of it may have been overdone a bit.

We won't really know unless more games start using it, but so far there hasn't been many that I can recall. As I said before I can't think of any other examples that had that strong of a performance hit. Even Civ 5 ran just fine when using it's directcompute texture decompression. That is a totally different way to use directcompute though and isn't all that taxing to begin with. We will really have to wait until some devs start talking about it a bit more.
 

boxleitnerb

Platinum Member
Nov 1, 2011
2,605
6
81
I made a thread like this before:
http://forums.anandtech.com/showthread.php?t=2300351

Dirt 3, Bioshock 3, Battlefield 3, Metro 2033, Metro Last Light, Far Cry 3, Hitman Absolution, Sleeping Dogs, Tomb Raider, Sniper Elite V2 (and I guess more that haven't come to mind) all use DirectCompute. Mostly for DoF and SSAO type of effects as far as I can gather from information on the web.
To what extent, I don't know, couldn't even guess.

I would also like to point out that for example the 7970 GE has 4.3 TFLOPs compute power and 288GB/s of memory bandwidth. A full GK104 has much less resources in this department, so it's not necessarily a question about architecture alone.
 
Last edited:

BrightCandle

Diamond Member
Mar 15, 2007
4,762
0
76
To this day I still don't understand why or how the 680 ever even remotely competes with a 7970. Its down on memory bandwidth and compute and its a considerably smaller core. It does have some other aspects that are superior but I have been under the impression that games are mostly needing more bandwidth and more compute performance these days for the effects. The 7970 never pulls ahead as much as its raw theoretical figures suggest and that is either a point of concern (its not very efficient in practice) or a shining potential for when developers start doing real things with that compute performance.
 

boxleitnerb

Platinum Member
Nov 1, 2011
2,605
6
81
There are cases where the 7970 (GE) pull ahead quite significantly, up to 30%. But those are rather rare overall.
The problem AMD has is, that their frontend is weak. They get less performance out of x ALUs@y MHz compared to Nvidia most of the time. The smaller pixels get, the less this bottleneck is important. Thus AMD cards can use their raw power more effectively at higher resolutions/with (OG)SSAA. This can also be verified by using SGSSAA. In contrast to OGSSAA, SGSSAA doesn't blow up the resolution, so the hardware deals with larger pixels, but multiple times during the sampling process (2x-8x depending on SGSSAA mode). On average, GCN-cards fall behind more using SGSSAA, sometimes significantly.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
To this day I still don't understand why or how the 680 ever even remotely competes with a 7970. Its down on memory bandwidth and compute and its a considerably smaller core. It does have some other aspects that are superior but I have been under the impression that games are mostly needing more bandwidth and more compute performance these days for the effects. The 7970 never pulls ahead as much as its raw theoretical figures suggest and that is either a point of concern (its not very efficient in practice) or a shining potential for when developers start doing real things with that compute performance.

This is something I've often wondered myself. The 7970 has 50% more bandwidth than the GTX 680, but the GTX 680 can easily hang with the 7970, or outpace it at 2560x1600 with AA and all the bells and whistles in many titles.

It's not until you go above 1600p that the Radeon's extra bandwidth begins to assert itself it seems. I guess NVidia's memory controller and bandwidth enhancing/saving technology are more efficient.

Even the GTX 770 doesn't really gain much from the extra bandwidth (with the exception of a few games) which implies the GTX 680 isn't really bandwidth starved..
 

SirPauly

Diamond Member
Apr 28, 2009
5,187
1
0
Very important for cinematic features like DOF, compression, dynamics and potentially Physics ( particles, fluids) etc..
 

sushiwarrior

Senior member
Mar 17, 2010
738
0
71
This is something I've often wondered myself. The 7970 has 50% more bandwidth than the GTX 680, but the GTX 680 can easily hang with the 7970, or outpace it at 2560x1600 with AA and all the bells and whistles in many titles.

It's not until you go above 1600p that the Radeon's extra bandwidth begins to assert itself it seems. I guess NVidia's memory controller and bandwidth enhancing/saving technology are more efficient.

Even the GTX 770 doesn't really gain much from the extra bandwidth (with the exception of a few games) which implies the GTX 680 isn't really bandwidth starved..

I think it may be a matter of "good enough", where more bandwidth isn't really necessary because both cards have enough to push 2560x1600, but past that point the bandwidth starts to become a bit of a bottleneck. I doubt it is something like the 680 having a more advanced MC (IIRC, GDDR5 memory controllers are all standardized?).

Given how Kepler is performing equally with "less" resources, I see a few possibilities.

1. AMD drivers still aren't giving GCN its full potential? Or GCN simply is good at GPGPU, but not really games.
2. Kepler is superior/equal at "standard" GPU work (polygons, texture fill, etc.) and hence it is a smaller die than Tahiti, due to no/limited inclusion of architecture features necessary for GPGPU. If many games were to start using compute, GCN would gain more performance relative to Kepler.

GK110 is almost twice the size of GK104, but doesn't deliver nearly 2x the performance. Tahiti is ~60mm^2 bigger, but this could easily be attributed to the larger memory bus (significant die size hog). If Tahiti's "core" is the same size as Kepler's "core", yet gives equal game performance and better compute performance... then wow, that's pretty impressive. I just hope GCN2.0 isn't a disappointment, and I hope that Maxwell will bring GPGPU ability back into the equation for Nvidia.
 

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
GK110 is almost twice the size of GK104, but doesn't deliver nearly 2x the performance. Tahiti is ~60mm^2 bigger, but this could easily be attributed to the larger memory bus (significant die size hog). If Tahiti's "core" is the same size as Kepler's "core", yet gives equal game performance and better compute performance... then wow, that's pretty impressive. I just hope GCN2.0 isn't a disappointment, and I hope that Maxwell will bring GPGPU ability back into the equation for Nvidia.

Lol. What a comparision. GCN delivers less geometry performance than Kepler. So there is nothing "wow" about a card which is sacrificing an architecture point to get more out of another one.
 

sushiwarrior

Senior member
Mar 17, 2010
738
0
71
Lol. What a comparision. GCN delivers less geometry performance than Kepler. So there is nothing "wow" about a card which is sacrificing an architecture point to get more out of another one.

GCN doesn't deliver less geometry performance than Kepler, Tahiti is roughly the same size as GK104 and they perform the same in standard games, Tahiti wins in most titles with GPGPU/DirectCompute implementations. There is something "wow" about a card that is the same price, the same performance, yet many many times faster at GPGPU.
 

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
You're absolutely right. 2 instead of 4 rasterizer and 2 instead of 8 geometry units is not "less geometry performance than Kepler".

...!
 

boxleitnerb

Platinum Member
Nov 1, 2011
2,605
6
81
Are we talking GCN in general or specific GPUs?
I would like to point out that there is always a trade-off between area efficiency and energy efficiency. GK110 is large, but also very efficient, about 25% more efficient than Tahiti in graphics and DP-compute. The DP/SP ratio of 3:1 certainly costs transistors as well. It's a moot point comparing die sizes in this case since the GPUs have different strengths and weaknesses that come into play and affect the size.
 

sushiwarrior

Senior member
Mar 17, 2010
738
0
71
You're absolutely right. 2 instead of 4 rasterizer and 2 instead of 8 geometry units is not "less geometry performance than Kepler".

...!

Comparing across architectures isn't that simple :confused: given that GCN evidently has much less geometry units and rasterizers, how is it tied in games that are purely geometry? Clearly you are over simplifying the architectural choices.

Are we talking GCN in general or specific GPUs?
I would like to point out that there is always a trade-off between area efficiency and energy efficiency. GK110 is large, but also very efficient, about 25% more efficient than Tahiti in graphics and DP-compute. The DP/SP ratio of 3:1 certainly costs transistors as well. It's a moot point comparing die sizes in this case since the GPUs have different strengths and weaknesses that come into play and affect the size.

Yes, area efficiency =/= energy efficiency. While Tahiti saves on static leakage, the higher voltage means that dynamic power consumption could be adversely affected compared to GK110.

The DP/SP ratio is a bit of a moot point when Tahiti has 1.04TFLOPS FP64 and Titan has 1.14TFLOPS...
 

boxleitnerb

Platinum Member
Nov 1, 2011
2,605
6
81
Titan has 1.3 TFLOPs DP (depends on the actual load, downclocking below base clock may occur):
http://www.anandtech.com/show/6774/nvidias-geforce-gtx-titan-part-2-titans-performance-unveiled/3

In any case, coming from the professional segment, K20(X) is significantly stronger in DP than W9000 (Tahiti), about 30% at the same TDP. I just think the 1:3 ratio isn't for free in terms of transistors. Same with dynamic parallelism and Hyper-Q. Both of which are irrelevant in gaming, but they make the chip larger nonetheless.
 

sushiwarrior

Senior member
Mar 17, 2010
738
0
71
Titan has 1.3 TFLOPs DP (depends on the actual load, downclocking below base clock may occur):
http://www.anandtech.com/show/6774/nvidias-geforce-gtx-titan-part-2-titans-performance-unveiled/3

In any case, coming from the professional segment, K20(X) is significantly stronger in DP than W9000 (Tahiti), about 30% at the same TDP. I just think the 1:3 ratio isn't for free in terms of transistors. Same with dynamic parallelism and Hyper-Q. Both of which are irrelevant in gaming, but they make the chip larger nonetheless.

Sorry, my 1.14 was hasty and from a google search. Tahiti looks like it delivers ~3/4 of Titan's FP64 performance, and more FP32 performance...
According to Tom's.

1:3 ratio is definitely a cost - GK110 is the same layout as GK104. GK104 has 8 dedicated FP64 units per SMX (they only do FP64, and nothing else) while GK110 has 64 FP64 units per SMX (which, again, only do FP64). It was purely a matter of using more transistors.
 

boxleitnerb

Platinum Member
Nov 1, 2011
2,605
6
81
I wonder though why there are dedicated FP64 units at all. Why not put two FP32 units together for an FP64 operation? Sorry for OT, but has anybody an idea about this?
 

bunnyfubbles

Lifer
Sep 3, 2001
12,248
3
0
by the time implementation of GPGPU/compute in games becomes something of a mainstream thing, GCN will be 2-3 generations old and thus making this speculation rather pointless

we saw the same things with GeForce 6 and its supposed SM3.0 advantage, or the Radeon Evergreen series and DX11, this really isn't any different
 

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
I wonder though why there are dedicated FP64 units at all. Why not put two FP32 units together for an FP64 operation? Sorry for OT, but has anybody an idea about this?

Less transistors are active. And you have a much better memory transfer system.
 

BrightCandle

Diamond Member
Mar 15, 2007
4,762
0
76
Doing fp64 using 32 bit units isn't as simple as it sounds. With integers it takes 3 calculations to so 64 bit on 32 bit units but with FP it takes quite a few more, at least 4x and as high as 10x from what I have seen.