Now I've always thought that this touted special compute ability was rather irrelevant and that it instead was all about raw power (SP GFLOPs) and bandwidth. I recently had a look at some "Tahiti LE" reviews which are quite interesting, because Tahiti LE cards have about the same SP GFLOPs as a 670/680 hybrid and the same memory bandwidth.
Can't be the answer alone. Real world evidence shows other factors are at play too.
HD7970 GE = 2048 SPs @ 1050mhz = 4.3 Tflops
HD7950 V2 = 1792 SPs @ 950mhz = 3.4 Tflops
SP GFlops advantage of 26%
HD7970 GE = 288 GB/sec memory bandwidth
HD7950 V2 = 240 GB/sec memory bandwidth
Memory bandwidth advantage of 20%
In Hitman Absolution at 1920x1080 4AA, HD7970 GE is beating HD7950 V2 by
29% on avg. and 33% in minimums.
"We ran each game test or benchmark twice and took the best result for the diagrams, but only if the difference between them didn’t exceed 1%. If it did exceed 1%, we ran the tests at least one more time to achieve repeatability of results."
Also, memory bandwidth helps to feed the compute units but it doesn't mean it's the most important factor either. The CU units can put the power to the ground more effectively by exploiting GCN's thread level parallelism in Compute heavy games if they are not memory bandwidth bottlenecked. However, you can't assume linear scaling of compute performance with more memory bandwidth. GTX590 has 327.7 GB/sec of memory bandwidth vs. 288GB/sec of HD7970 HIS and yet is getting beaten by 28%.
But if you look at 1180mhz HD7970 / 288GB/sec memory bandwidth and compare it to 800mhz HD7950 / 240 GB/sec memory bandwidth, with only 20% more memory bandwidth, HD7970 GE is putting down 45.4% higher FPS. Thus we know for sure memory bandwidth is not the bottleneck here and therefore is not the only thing that matters for compute. What about SPs? HIS 7970 GE has 68.7% more SP power than the HD7950 in that graph but only puts out 45.4% higher FPS. Therefore, SPs (Single Point GFLOPs) also cannot be the answer. Sounds like Sniper Elite V2 is either pixel fillrate or texture fillrate limited as well.
The point is it's more complex than just looking at Single Point GLOPs or memory bandwidth. We almost would need to know what % of the game code in that scene uses Compute Shaders and where the bottlenecks could lie.
So my conclusion:
Kepler and GCN as architectures are equally good when it comes to compute-heavy games, there is no difference. What matters more and what sets the two apart, is the actual amount of raw power their individual SKUs have.
Any thoughts?
When people talk about "Compute in Games", what does it mean?
A Compute Shader is a programmable shader stage that expands Microsoft Direct3D 11
beyond graphics programming. Like other programmable shaders (vertex and geometry shaders for example), a Compute Shader is designed and implemented with HLSL but that is just about where the similarity ends.
A compute shader provides high-speed general purpose computing and takes advantage of the large numbers of parallel processors on the graphics processing unit (GPU). The compute shader provides memory sharing and thread synchronization features to allow more effective parallel programming methods.
More here:
http://blogs.msdn.com/b/chuckw/archive/2010/07/14/directcompute.aspx
and here
http://msdn.microsoft.com/en-us/library/windows/desktop/ff476331(v=vs.85).aspx
Second: Kepler architecture is different from talking about Kepler GK104 vs. Tahiti XT because GK104 is not the "full" version of Kepler architecture. When you are comparing GCN to Kepler, the comparison has to be made in the context GK104 since GK110 at least fixes 1 key issue of GK104 - lack of dynamic scheduler for compute work. GK110 vs. Tahiti XT is different.
Third: You are interchanging compute with single floating point processing power. Those things are not always directly related in the context of DirectCompute / Compute Shaders in games. For example, GTX680 has 3.09 Tflops of SP vs. 1.58 Tflops of SP in GTX580. Obviously looking at SP floating point and extrapolating it to "compute" in games is irrelevant when comparing GTX580 to 680. This is one one part of your conclusion is incorrect: "sets the two apart, is the actual amount of raw power their individual SKUs have." If that were true, GTX680 would be nearly 2x faster than GTX580 in games that use DirectCompute / Compute shaders. It isn't. Therefore, since floating point alone is not enough to explain the differences, some other factors have to matter for a strong Compute architecture, not just GFLOPs.
If you want more details please read the
article on how GCN architecture works to understand what GCN has that makes it more effective for Compute tasks (and no this isn't SP floating point operations only, and computing tasks are also generally differnet from traditional graphical tasks).
http://blogs.msdn.com/b/chuckw/archive/2010/07/14/directcompute.aspx
What Compute in games means is using the Compute Shaders to perform workloads in parallel that would normally be run using traditional GPU pipeline methods that are subject to stalls/wavefront scheduling sequences and thus inefficiencies. The DirectX 11 Compute Shader feature allows access to the shader cores/pipeline for Stream Computing (graphics acceleration) type applications and physics acceleration. DirectCompute essentially allows easier access to the GPU’s many cores for parallel processing. To run DirectCompute tasks more efficiently, you have to figure out how to get those Stream Processors to put that power to the ground in a more efficient way.
Specific GCN Tahiti XT details:
GCN Tahiti XT is not simply a comparison of 2048 Shaders vs. 1536 Shaders of HD6970. Tahiti XT is actually
32 Compute Units made up of 64 Shaders each. This makes all the difference because the Compute Unit is designed to perform both scalar and vector operations well, allowing it to perform both graphical and computing tasks well, and exploit higher thread-level parallelism.
Here is why:
1.
GCN's building blocks are compute units not just shaders/TMUs/ROPs in a basic cluster. Traditionally a single SIMD can execute vector operations well but that’s it. However in GCN, combined with a number of other functional units it makes a complete CU unit capable of the entire range of compute tasks that run well, not just limited to traditional game code running well. (Tahiti XT +1)
2.
Dynamic Scheduler: The weakness of VLIW and GK104 is that workload is statically scheduled ahead of time by the compiler. Like VLIW-4/5, GK104 has a static scheduler. As a result if any dependencies crop up while code is being executed, there is no deviation from the schedule and efficiency drops. So the first change is immediate: GCN design, scheduling is moved from the compiler to the hardware. It is the CU that is now scheduling execution within its domain. Dynamic scheduler in GCN can cover up dependencies and other types of stalls, making it way more efficient for compute work. (Tahiti XT +1)
3.
ACEs: The frontend of GCN architecture contains 2 Asynchronous command processors/engines (ACEs) responsible for feeding the CUs and the geometry engines responsible for geometry setup. AMD’s new Asynchronous Compute Engines serve as the command processors for
compute operations on GCN. The principal purpose of ACEs will be to accept work and to dispatch it off to the CUs for processing. As GCN is designed to concurrently work on several tasks, the ACEs decide on resource allocation, context switching, and task priority. ACEs can prioritize and reprioritize tasks, allowing tasks to be completed in a different order than they’re received. This allows GCN to free up the resources those tasks were using as early as possible rather than having the task consuming resources for an extended period of time in a nearly-finished state. End result is the ability to perform more parallel compute based operations. (Tahiti XT +1)
http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/5
---------
TL; DR:
1) DX11's Compute Shaders, or also known as DirectCompute, is a feature which allows access to the shader cores/pipeline for Stream Computing (graphics acceleration) type applications and physics acceleration.
2) GCN Tahiti XT is designed around Compute Units (with Dynamic Scheduler and ACEs) that more efficiently tap the power of Stream Processors for compute tasks because the combination of 2 command processors and a dynamic scheduler allows the SPs inside the compute unit to perform more parallel compute based operations.
3) Leveraging DirectCompute basically means exploiting thread-level parallelism of many Stream processors in a more efficient way than is done by traditional VLIW/SIMD architectures. Tahiti XT just does this better than GK104 based on the architecture. Kepler 2.0 is just Fermi rebalanced, GCN is a brand new AMD architecture designed from the ground-up for DirectCompute. That means on the technology curve, Kepler dates back to Fermi 1.0 in 2010 (GTX480), while GCN is Dec 2011. This makes GCN a nearly 2 year newer architecture.
Just looking at pure compute benchmarks, GTX680 falls apart:
http://www.computerbase.de/artikel/grafikkarten/2012/test-grafikkarten-2012/8/
And if you compare HD6970 to HD7970 on SP floating point and memory bandwidth alone, some of those benchmarks could not make sense unless GCN's compute advantage was way better than VLIW-4-5's/GK104's. It is better which is why it's smoking them in pure compute benchmarks like just Fermi obliterated HD5870/6970 in more tessellation limited scenarios.