About the misconception of "compute" in games

boxleitnerb · Feb 7, 2013

ICDP said:
The problem is that the cards you are comparing are not in the same price bracket. So people will automatically jump on that, regardless of your true reason for the comparison.

Honestly, most don't know or don't care why the 7970 is faster than the GTX 680 in most games. All that matters is that it is, and on top of that is it cheaper.

True, but that is not what this thread is about. It shouldn't be that hard to stick to the thread topic.

ICDP · Feb 7, 2013

In my testing of GTX 680, HD 7950 and HD 7970 the Nvidia card needed more brute force on the core clock and VRAM to compete.

For example I found that to get similar performance from a GTX 680 and HD 7950 the GTX 680 needed to run around 80-100MHz faster core clock and around 500+ VRAM clock. Compared to the 7970 the GTX 680 needed around 175-200 MHz extra core clock to compete. This raised problems in that compared to a HD 7970 the GTX 680 could not match it for performance because if I got 1170 or 1200 core clock from the 7970 I needed 1350 or 1375 core clock from the GTX 680. Unless you have an expensive Lightning edition GTX 680 you will be very unlikely to reach that clock speed.

That tells me that the Tahiti cards (79x0) are simply faster clock for clock.

boxleitnerb · Feb 7, 2013

Clock for clock is irrelevant. Tahiti has 33% more ALUs and a 50% wider bus. You need to look at the complete picture, not just clocks. We're talking about absolute values for GFLOPs/s and GB/s here. How they are achieved is another matter.

raghu78 · Feb 7, 2013

boxleitnerb said:
Clock for clock is irrelevant. Tahiti has 33% more ALUs and a 50% wider bus. You need to look at the complete picture, not just clocks. We're talking about absolute values for GFLOPs/s and GB/s here. How they are achieved is another matter.

given that both the HD 7970 Ghz and GTX 680 are on the same TSMC 28nm process and have similar range of max clocks for 24/7 air cooling the clock for clock performance does matter. The average HD 7970 Ghz cards do 1175 - 1200 mhz and will beat the average GTX 680 overclocks of 1275 - 1300 mhz. A golden HD 7970 1300 - 1350 mhz is significantly faster than a golden GTX 680 Lightning 1400 - 1450 mhz. But for a few titles like borderlands 2 , assassins creed 3, shogun 2 HD 7970 is faster in most games.

The compute performance of HD 7970 Ghz is clearly seen in the latest 3D Mark where the GTX 680 gets beaten easily. Games like Metro 2033 (DOF), Sleeping Dogs, Hitman Absolution, Dirt Showdown are significantly faster on HD 7970 Ghz because of the better compute shader performance.

As for your original question actually perf/sq mm and perf/watt are the decisive metrics for determining efficiency.

also you cannot compare two different architectures by comparing just the FLOPS. the die size and power consumption is what matters. so the best time would be to judge a GTX 680 (294 sq mm) against a HD 8870 which is rumoured to be around 270 sq mm. 1792 stream processors at 1.1 ghz with 256 bit memory at 6 Ghz. so more or less this chip is the right candidate for comparison. and the rumoured performance matches GTX 680. the chips also match well on a clock for clock comparison as its well known that GTX 680 boosts upto 1.1 Ghz

http://videocardz.com/34981/amd-radeon-hd-8870-and-hd-8850-specifiation-leaked

boxleitnerb · Feb 7, 2013

We are not talking about overclocking, price, value - we are talking about architecture. It seems that is too difficult to understand?

ICDP · Feb 7, 2013

boxleitnerb said:
Clock for clock is irrelevant. Tahiti has 33% more ALUs and a 50% wider bus. You need to look at the complete picture, not just clocks. We're talking about absolute values for GFLOPs/s and GB/s here. How they are achieved is another matter.

I could use the same argument to ask why the GTX 680 isn't significantly faster than a GTX 670. It is only ~5% faster clock for clock, yet if we go by specifications it should be significantly faster clock for clock (15-20%). We both know that having (for example) double the specification does not = double the performance.

I don't think compute is the reason Tahiti is faster, I think it is mainly down the fact it is just a higher spec card.

Riek · Feb 7, 2013

boxleitnerb said:
We are not talking about overclocking, price, value - we are talking about architecture. It seems that is too difficult to understand?

Isn't your conclusion a litte bit wrong?

we know the 680 is more effecient/ flop in games. So when comparing the LE to 680-670 you are also including those effects to have effect on the gaming performance.
So while looking at the different benchmark aren't you simply moving the complete bar to the same direction and thus effectively do a null operation?

e.g. the card can still compete on those games while it is slower in every other game. Shouldn't that lead to the same conclusion as on the other models? That the reason because they can still compete in that game despite the weaker gaming performance could be due to computing?

boxleitnerb · Feb 7, 2013

ICDP said:
I could use the same argument to ask why the GTX 680 isn't significantly faster than a GTX 670. It is only ~5% faster clock for clock, yet if we go by specifications it should be significantly faster clock for clock (15-20%). We both know that having (for example) double the specification does not = double the performance.

I don't think compute is the reason Tahiti is faster, I think it is mainly down the fact it is just a higher spec card.

Bandwidth.

Riek said:
Isn't your conclusion a litte bit wrong?

we know the 680 is more effecient/ flop in games. So when comparing the LE to 680-670 you are also including those effects to have effect on the gaming performance.
So while looking at the different benchmark aren't you simply moving the complete bar to the same direction and thus effectively do a null operation?

e.g. the card can still compete on those games while it is slower in every other game. Shouldn't that lead to the same conclusion as on the other models? That the reason because they can still compete in that game despite the weaker gaming performance could be due to computing?

Sure, you're absolutely right. As ICDP correctly said, the 7970 (GE) is the higher spec card. Of course "compute" is an explanation for higher performance of Tahiti - but not because of some magic mojo or compute efficiency etc, but because Tahiti has more units that it can use for compute and more bandwidth to feed them.

f1sherman · Feb 7, 2013

1) perf./FLOPS is almost as useless as 2) perf/clock. 680 and 7970 being perfect examples.

1) 680 does much better per FLOPS, but it has much less available raw power than 7970, even overclocked to 7970 TDP
2) 7970 does better per clock, but it can't clock as high as 680, not to mention Fermi-like TDP gained in the process

perf/W is all that matters when speaking about architecture.
perf/mm^2 is also important but it's more chip specific, and other eye should pay attention to TDP

In general compute - GPGPU, GTX 600 and HD7000 are same like AMD and NV drivers are "same".
Meaning they aren't.
HD7000 will surely win in more apps/benchmarks, particularly in random non- CUDA-tuned code.

In games compute is a non-issue, because games are not using it so extensively that calculations involved are bottlenecking any modern card.
FPS is limited by rendering thread, not "compute" execution.

sontin · Feb 7, 2013

raghu78 said:
given that both the HD 7970 Ghz and GTX 680 are on the same TSMC 28nm process and have similar range of max clocks for 24/7 air cooling the clock for clock performance does matter.

Yeah, because both have the same amount of compute units.

The compute performance of HD 7970 Ghz is clearly seen in the latest 3D Mark where the GTX 680 gets beaten easily.

Wow - we have another scientist who found out that a card with 32% more compute performance and 50% more bandwidth can be 10% faster than the competition.

Rvenger · Feb 7, 2013

GCN is a new architecture, whereas Kepler is a beefed up fermi. I'd say AMD did pretty good for a first generation architecture.

Comparing a 680 to a 7970 is becoming more of a joke since its only an overclocked midrange card. Just look how driver improvements have been, they really can't squeeze much more performance out of kepler so they need to bring in Titan.

raghu78 · Feb 7, 2013

f1sherman said:
perf/W is all that matters when speaking about architecture.
perf/mm^2 is also important but it's more chip specific, and other eye should pay attention to TDP

perf/watt and perf/sq mm together should be used to compare two different architectures.

f1sherman · Feb 7, 2013

perf/W alone is fully relevant.

but yes(if that's what you mean) - perf/mm^2 should be coupled with TDP.
something like:

perf / (mm^2 * W)

because perf/mm^2 alone is a cost relevant metric, and not so much arch.

ICDP · Feb 7, 2013

boxleitnerb said:
Bandwidth.

I thought they had the same bandwidth.

http://www.hardwarecanucks.com/foru.../54055-nvidia-geforce-gtx-670-2gb-review.html

boxleitnerb said:
Sure, you're absolutely right. As ICDP correctly said, the 7970 (GE) is the higher spec card. Of course "compute" is an explanation for higher performance of Tahiti - but not because of some magic mojo or compute efficiency etc, but because Tahiti has more units that it can use for compute and more bandwidth to feed them.

Totally agrre with this, it isn't magic, just logic

boxleitnerb · Feb 7, 2013

ICDP said:
I thought they had the same bandwidth.

http://www.hardwarecanucks.com/foru.../54055-nvidia-geforce-gtx-670-2gb-review.html

Totally agrre with this, it isn't magic, just logic

Yes they have, but the 680 has 25% more FLOPs. But because it has no equally increased bandwidth, it cannot make use of those FLOPs to the fullest.

raghu78 · Feb 7, 2013

f1sherman said:
perf/W alone is fully relevant.

but yes(if that's what you mean) - perf/mm^2 should be coupled with TDP.
something like:

perf / (mm^2 * W)

because perf/mm^2 alone is a cost relevant metric, and not so much arch.

i said that because you cannot compare a 600 sq mm chip with a 300 sq mm chip. you can easily clock the 600 sq mm low enough that power is on par with 300 sq mm chip but it will perform better because of massive resources. normally by reducing clocks 33% (1/3rd) you can power twice the amount of transistors for the same power or the same amount of transistors for half the power. since graphics is a massively parallel problem the 600 sq mm chip will always be faster than 300 sq mm at the same power.

eg: a very rough explanation
600 sq mm chip - 2X x 0.67 = 1.34X
300 sq mm chip - X x 1 = X

piasabird · Feb 7, 2013

I have a distrust of video game benchmarks. It is quite possible to pick games that will do better with AMD or Intel or to alter the benchmark with special settings tht will make it look like the benchmark is getting better results. Also you dont know if they tweaked the motherbord or the BIOS or used settings that most people do not know about. So take these benchmarks and realize that the website may have an agenda or a bias toward one platform or the other. I would suggest you look at several different websites. Also realize that the average benchmark does not have a browser or IM or SKIPE running in the background. You cant base purchases based solely on benchmarks.

RussianSensation · Feb 7, 2013

boxleitnerb said:
Now I've always thought that this touted special compute ability was rather irrelevant and that it instead was all about raw power (SP GFLOPs) and bandwidth. I recently had a look at some "Tahiti LE" reviews which are quite interesting, because Tahiti LE cards have about the same SP GFLOPs as a 670/680 hybrid and the same memory bandwidth.

Can't be the answer alone. Real world evidence shows other factors are at play too.

HD7970 GE = 2048 SPs @ 1050mhz = 4.3 Tflops
HD7950 V2 = 1792 SPs @ 950mhz = 3.4 Tflops
SP GFlops advantage of 26%

HD7970 GE = 288 GB/sec memory bandwidth
HD7950 V2 = 240 GB/sec memory bandwidth
Memory bandwidth advantage of 20%

In Hitman Absolution at 1920x1080 4AA, HD7970 GE is beating HD7950 V2 by 29% on avg. and 33% in minimums. "We ran each game test or benchmark twice and took the best result for the diagrams, but only if the difference between them didn’t exceed 1%. If it did exceed 1%, we ran the tests at least one more time to achieve repeatability of results."

Also, memory bandwidth helps to feed the compute units but it doesn't mean it's the most important factor either. The CU units can put the power to the ground more effectively by exploiting GCN's thread level parallelism in Compute heavy games if they are not memory bandwidth bottlenecked. However, you can't assume linear scaling of compute performance with more memory bandwidth. GTX590 has 327.7 GB/sec of memory bandwidth vs. 288GB/sec of HD7970 HIS and yet is getting beaten by 28%.

But if you look at 1180mhz HD7970 / 288GB/sec memory bandwidth and compare it to 800mhz HD7950 / 240 GB/sec memory bandwidth, with only 20% more memory bandwidth, HD7970 GE is putting down 45.4% higher FPS. Thus we know for sure memory bandwidth is not the bottleneck here and therefore is not the only thing that matters for compute. What about SPs? HIS 7970 GE has 68.7% more SP power than the HD7950 in that graph but only puts out 45.4% higher FPS. Therefore, SPs (Single Point GFLOPs) also cannot be the answer. Sounds like Sniper Elite V2 is either pixel fillrate or texture fillrate limited as well.

The point is it's more complex than just looking at Single Point GLOPs or memory bandwidth. We almost would need to know what % of the game code in that scene uses Compute Shaders and where the bottlenecks could lie.

boxleitnerb said:
So my conclusion:
Kepler and GCN as architectures are equally good when it comes to compute-heavy games, there is no difference. What matters more and what sets the two apart, is the actual amount of raw power their individual SKUs have.

Any thoughts?

When people talk about "Compute in Games", what does it mean?

A Compute Shader is a programmable shader stage that expands Microsoft Direct3D 11 beyond graphics programming. Like other programmable shaders (vertex and geometry shaders for example), a Compute Shader is designed and implemented with HLSL but that is just about where the similarity ends. A compute shader provides high-speed general purpose computing and takes advantage of the large numbers of parallel processors on the graphics processing unit (GPU). The compute shader provides memory sharing and thread synchronization features to allow more effective parallel programming methods.

More here:
http://blogs.msdn.com/b/chuckw/archive/2010/07/14/directcompute.aspx
and here
http://msdn.microsoft.com/en-us/library/windows/desktop/ff476331(v=vs.85).aspx

Second: Kepler architecture is different from talking about Kepler GK104 vs. Tahiti XT because GK104 is not the "full" version of Kepler architecture. When you are comparing GCN to Kepler, the comparison has to be made in the context GK104 since GK110 at least fixes 1 key issue of GK104 - lack of dynamic scheduler for compute work. GK110 vs. Tahiti XT is different.

Third: You are interchanging compute with single floating point processing power. Those things are not always directly related in the context of DirectCompute / Compute Shaders in games. For example, GTX680 has 3.09 Tflops of SP vs. 1.58 Tflops of SP in GTX580. Obviously looking at SP floating point and extrapolating it to "compute" in games is irrelevant when comparing GTX580 to 680. This is one one part of your conclusion is incorrect: "sets the two apart, is the actual amount of raw power their individual SKUs have." If that were true, GTX680 would be nearly 2x faster than GTX580 in games that use DirectCompute / Compute shaders. It isn't. Therefore, since floating point alone is not enough to explain the differences, some other factors have to matter for a strong Compute architecture, not just GFLOPs.

If you want more details please read the article on how GCN architecture works to understand what GCN has that makes it more effective for Compute tasks (and no this isn't SP floating point operations only, and computing tasks are also generally differnet from traditional graphical tasks).
http://blogs.msdn.com/b/chuckw/archive/2010/07/14/directcompute.aspx

What Compute in games means is using the Compute Shaders to perform workloads in parallel that would normally be run using traditional GPU pipeline methods that are subject to stalls/wavefront scheduling sequences and thus inefficiencies. The DirectX 11 Compute Shader feature allows access to the shader cores/pipeline for Stream Computing (graphics acceleration) type applications and physics acceleration. DirectCompute essentially allows easier access to the GPU’s many cores for parallel processing. To run DirectCompute tasks more efficiently, you have to figure out how to get those Stream Processors to put that power to the ground in a more efficient way.

Specific GCN Tahiti XT details:

GCN Tahiti XT is not simply a comparison of 2048 Shaders vs. 1536 Shaders of HD6970. Tahiti XT is actually 32 Compute Units made up of 64 Shaders each. This makes all the difference because the Compute Unit is designed to perform both scalar and vector operations well, allowing it to perform both graphical and computing tasks well, and exploit higher thread-level parallelism.

Here is why:

1. GCN's building blocks are compute units not just shaders/TMUs/ROPs in a basic cluster. Traditionally a single SIMD can execute vector operations well but that’s it. However in GCN, combined with a number of other functional units it makes a complete CU unit capable of the entire range of compute tasks that run well, not just limited to traditional game code running well. (Tahiti XT +1)

2. Dynamic Scheduler: The weakness of VLIW and GK104 is that workload is statically scheduled ahead of time by the compiler. Like VLIW-4/5, GK104 has a static scheduler. As a result if any dependencies crop up while code is being executed, there is no deviation from the schedule and efficiency drops. So the first change is immediate: GCN design, scheduling is moved from the compiler to the hardware. It is the CU that is now scheduling execution within its domain. Dynamic scheduler in GCN can cover up dependencies and other types of stalls, making it way more efficient for compute work. (Tahiti XT +1)

3. ACEs: The frontend of GCN architecture contains 2 Asynchronous command processors/engines (ACEs) responsible for feeding the CUs and the geometry engines responsible for geometry setup. AMD’s new Asynchronous Compute Engines serve as the command processors for compute operations on GCN. The principal purpose of ACEs will be to accept work and to dispatch it off to the CUs for processing. As GCN is designed to concurrently work on several tasks, the ACEs decide on resource allocation, context switching, and task priority. ACEs can prioritize and reprioritize tasks, allowing tasks to be completed in a different order than they’re received. This allows GCN to free up the resources those tasks were using as early as possible rather than having the task consuming resources for an extended period of time in a nearly-finished state. End result is the ability to perform more parallel compute based operations. (Tahiti XT +1)
http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/5

---------
TL; DR:

1) DX11's Compute Shaders, or also known as DirectCompute, is a feature which allows access to the shader cores/pipeline for Stream Computing (graphics acceleration) type applications and physics acceleration.

2) GCN Tahiti XT is designed around Compute Units (with Dynamic Scheduler and ACEs) that more efficiently tap the power of Stream Processors for compute tasks because the combination of 2 command processors and a dynamic scheduler allows the SPs inside the compute unit to perform more parallel compute based operations.

3) Leveraging DirectCompute basically means exploiting thread-level parallelism of many Stream processors in a more efficient way than is done by traditional VLIW/SIMD architectures. Tahiti XT just does this better than GK104 based on the architecture. Kepler 2.0 is just Fermi rebalanced, GCN is a brand new AMD architecture designed from the ground-up for DirectCompute. That means on the technology curve, Kepler dates back to Fermi 1.0 in 2010 (GTX480), while GCN is Dec 2011. This makes GCN a nearly 2 year newer architecture.

Just looking at pure compute benchmarks, GTX680 falls apart:
http://www.computerbase.de/artikel/grafikkarten/2012/test-grafikkarten-2012/8/

And if you compare HD6970 to HD7970 on SP floating point and memory bandwidth alone, some of those benchmarks could not make sense unless GCN's compute advantage was way better than VLIW-4-5's/GK104's. It is better which is why it's smoking them in pure compute benchmarks like just Fermi obliterated HD5870/6970 in more tessellation limited scenarios.

boxleitnerb · Feb 7, 2013

Very informative post, thank you.

Btw, do you have a source for the hardware scheduler in GK110?
And while you have provided quite some details on GCN, the question remains - is Kepler so different in the specific points you mentioned, and if so, how different? I would be so sure if there are really such large differences there. Would have to look that up, though.

Yet I wonder, with all that being said - why is the 7870 LE not faster than the 670/680 if it is so much better at compute? Aside from Dirt Showdown where AMD was directly involved, thus making it a bad example, it simply cannot overtake the GTX680 outside of the margin of error. All those examples you mentioned the other day are at best a draw between the 7870 LE and the 680.

thilanliyan · Feb 7, 2013

RussianSensation said:
<snip>

Go Russian!...Go Russian!...GO! GO! GO!

OT, I've never really heard people claiming it is the compute power of the 7 series cards that allow it to do well in some games. I think the cards are quite even with the nV competition at the various price price points, with some games being favourable to either brand.

zlatan · Feb 7, 2013

boxleitnerb said:
Very informative post, thank you.

Btw, do you have a source for the hardware scheduler in GK110?
And while you have provided quite some details on GCN, the question remains - is Kepler so different in the specific points you mentioned, and if so, how different? I would be so sure if there are really such large differences there. Would have to look that up, though.

Yet I wonder, with all that being said - why is the 7870 LE not faster than the 670/680 if it is so much better at compute? Aside from Dirt Showdown where AMD was directly involved, thus making it a bad example, it simply cannot overtake the GTX680 outside of the margin of error. All those examples you mentioned the other day are at best a draw between the 7870 LE and the 680.

Compute shaders can be different. There are many algorithms for one problem and these can impact the performance differently. Dirt Showdown use a very branchy shader code. In the forward+ render the color pixel shader do serial for loops with all the point lights, the projective point lights and the virtual point lights. This is not an algorithm that suited for the GPUs, but the GCN architecture is designed to handle these branchy execution as efficient as a CPU can do.
Kepler is a more traditional GPU design, so it can execute branchy and complex shaders, but won't be as efficient as the GCN.

boxleitnerb · Feb 7, 2013

Btw regarding Sniper Elite:

But if you look at 1180mhz HD7970 / 288GB/sec memory bandwidth and compare it to 800mhz HD7950 / 240 GB/sec memory bandwidth, with only 20% more memory bandwidth, HD7970 GE is putting down 45.4% higher FPS. Thus we know for sure memory bandwidth is not the bottleneck here and therefore is not the only thing that matters for compute. What about SPs? HIS 7970 GE has 68.7% more SP power than the HD7950 in that graph but only puts out 45.4% higher FPS. Therefore, SPs (Single Point GFLOPs) also cannot be the answer. Sounds like Sniper Elite V2 is either pixel fillrate or texture fillrate limited as well.

I disagree. While that is a possibility, you cannot look at those factors individually. It is also possible that the bottleneck is both, SP GLFLOPs and bandwidth. Up to a certain point the 7970 GE can play its GFLOP card, but at 40-45% bandwidth begins to become an issue.

The 7970 GE has less bandwidth per GFLOP than the 7950 (800). That certainly will influence things. That is also the reason why the 670 and 680 are so close together - the 680 has 25% more GFLOPs, but 0% more bandwidth. Bottlenecks aren't necessarily absolute, they can be relative.

RussianSensation · Feb 7, 2013

boxleitnerb said:
Very informative post, thank you.

I hope someone who knows GCN even more can provide a way more detailed explanation. I couldn't find more details in such a short period of time responding to your post but I tried to put the pieces together from GCN architecture in AT's article and reading up on Compute Shader/DirectCompute and what it does.

boxleitnerb said:
Btw, do you have a source for the hardware scheduler in GK110?

Facepalm moment for me. I recall reading that AT pointed out the dynamic scheduler was removed in GK104 to save transistor space and it was assumed back then that like GF100/110 that the real GK110 would bring it back. Looks like I was wrong. I apologize.

"GF100 was essentially a thread level parallelism design. With each SM executing a single instruction from up to two warps. At the same time certain math instructions had variable latencies, so GF100 utilized a complex hardware scoreboard to do the necessary scheduling. Compared to that, GK110 introduces instruction level parallelism to the mix, making the GPU reliant on a mix of high TLP and high ILP to achieve maximum performance.

At the same time scheduling has been moved from hardware to software, with NVIDIA’s compiler now statically scheduling warps thanks to the fact that every math instruction now has a fixed latency. The end result is that at an execution level NVIDIA has sacrificed some of GF100’s performance consistency by introducing superscalar execution – and ultimately becoming reliant on it for maximum performance. At the same time they have introduced a new type of consistency (and removed a level of complexity) by moving to fixed latency instructions and a static scheduled compiler."

http://www.anandtech.com/show/6446/nvidia-launches-tesla-k20-k20x-gk110-arrives-at-last/3

Well then GCN still appears has 1 advantage over GK110 architecture. However, there are certain things in GK110 that seem like very interesting features that could still make it very fast for Compute:

http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf

Page 5 - Dynamic Parallelism, Hyper-Q and Grid Management Unit
Dynamic Parallelism– adds the capability for the GPU to generate new work for itself, synchronize on results, and control the scheduling of that work via dedicated, accelerated hardware paths, all without involving the CPU. By providing the flexibility to adapt to the amount and form of parallelism through the course of a program's execution, programmers can expose more varied kinds of parallel work and make the most efficient use the GPU as a computation evolves. This capability allows less‐structured, more complex tasks to run easily and effectively, enabling larger portions of an application to run entirely on the GPU.

boxleitnerb said:
Yet I wonder, with all that being said - why is the 7870 LE not faster than the 670/680 if it is so much better at compute?

My educated guess is most games are not mostly Compute Shader limited.

Think of it this way - for graphics we now have all these factors that could be limiting overall gaming performance of a GPU:

Pixel shading power
Texture shading power
Geometry shader/engine performance (tessellation)
Compute shader processing power (Compute shaders)
Memory bandwidth limitations

A game is not just going to stress just 1 of these factors. So unless you could recreate a game that is more compute shader limited like Dirt Showdown or Sniper Elite V2, there is no way HD7870 LE can outperform a GTX680. Just like most games aren't made almost entirely of complex geometry like NV's Tessellation City Demo, most games aren't relying on Compute Shaders. Think of Compute Shader performance as just another feature of a GPU, like Tessellation / Geometry Units are. I think the misleading part was that when people talked about "Compute" what they meant to talk about was running Compute Shader code, but somehow all this got generalized to Single and Double Precision "Compute" performance affecting games. This is why probably so many people got confused. Just like an architecture that has strong geometry engines can perform tessellation faster, an architecture designed to perform Compute Shader work faster is going to be more efficient. GCN just happens to be more focused on DX11 DirectCompute / Compute Shader. But wait until you see Maxwell!!

You can also talk about single precision and double precision Compute in a different context, such as HPC (High Performance Computing) for workstations and professional applications. This is different than Compute Shader. The confusion is we interchangeably used the term "Compute" in 2 different contexts.

RussianSensation · Feb 7, 2013

zlatan said:
Compute shaders can be different. There are many algorithms for one problem and these can impact the performance differently. Dirt Showdown use a very branchy shader code. In the forward+ render the color pixel shader do serial for loops with all the point lights, the projective point lights and the virtual point lights. This is not an algorithm that suited for the GPUs, but the GCN architecture is designed to handle these branchy execution as efficient as a CPU can do.
Kepler is a more traditional GPU design, so it can execute branchy and complex shaders, but won't be as efficient as the GCN.

Ya, we really need some programmers in here to explain what specific features can run faster on GCN Compute Shaders more efficiently. At the end of the day short of an actual hardware designer who worked on GCN, they are the guys coding the games around specific GPU architectures. They are the experts.

RussianSensation · Feb 7, 2013

boxleitnerb said:
Btw regarding Sniper Elite:

I disagree. While that is a possibility, you cannot look at those factors individually. It is also possible that the bottleneck is both, SP GLFLOPs and bandwidth. Up to a certain point the 7970 GE can play its GFLOP card, but at 40-45% bandwidth begins to become an issue.

Sorry I should have explained it better:

HD7970GE @ 1180mhz vs. HD7950 800mhz
Pixel fillrate advantage = 47.5%
Texture fillrate advantage = 68.5%
Gflops advantage = 68.5%
Memory bandwidth advantage = 20%

It's putting down 45.4% higher FPS than HD7950 in Sniper Elite V2. The most likely limiting factor in that game after Compute Shader performance is probably pixel fillrate for GCN architecture (what a surprise, the long talked about 32 ROP limitation of HD7970 that keeps rearing its head: HD7950 can't pull away from HD7870 by much at stock 800mhz GPU clocks in most games because it's ROP limited! HD7870 has more pixel fillrate performance over HD7950 at stock, and HD7950's 55% memory bandwidth advantage is mostly wasted until you open up the ROP bottleneck --> which is also why HD7950 screams at 1100-1200mhz overclocks).

If memory bandwidth was the main bottleneck for HD7970GE in Sniper Elite V2, it would haven't put down 45% more performance. It'd be limited to 20%. However, you could argue HD7950 has too much memory bandwidth since its ROP limited.

At the end of the day, we keep zoning in that HD7950/7970 are ROP limited and imo Sniper Elite V2 exposes it. There isn't a simple formula. That's why I don't like comparing specs on paper between NV and AMD or even AMD vs. AMD without understanding the architecture. And comparing theoretical specs across 2 different architectures can open up a can of worms. 2 GPUs can have similar pixel fillrate, but it means nothing if one of them can't put that power the ground.

Case and point -- HD6970 vs. HD7970. Both have 32 ROPs, 7970 has a slight GPU clock speed advantage (925mhz vs. 880mhz) but has a 50% pixel fillrate advantage!! Theorical math doesn't translate to real world:

This is why to begin with the theoretical/on paper spec comparison across 2 different architectures can be highly dangerous/misleading. You almost have to read specifics of each GPU architecture and somehow try to fill in the blanks by looking at what's happening in different games to try and guess the bottlenecks. If you are a programmer or hardware designer or have an electrical engineering degree and work in GPU hardware, then you have a better idea how to improve on the GPU design. Remember 7900GTX vs. X1950XTX? ATI engineers guessed right that most games would be moving away from texture workloads to shader-based workloads. That generation NV guessed wrong. With Fermi, NV guessed right that tessellation will come into play.

About the misconception of "compute" in games

Platinum Member

Senior member

Platinum Member

Diamond Member

Platinum Member

Senior member

Senior member

Platinum Member

Platinum Member

Diamond Member

Elite Member <br> Super Moderator <br> Video Cards

Diamond Member

Platinum Member

Senior member

Platinum Member

Diamond Member

Lifer

Elite Member

Platinum Member

Lifer

Senior member

Platinum Member

Elite Member

Elite Member

Elite Member