GCN is always GPU limited...
Let's take furyx vs titanx as an example, fury has 4,096 and titan 3,072 shading units but titanx has 96 GPixel/s and furyx has 67.2 GPixel/s speed.
Let's make it more simple, Nvidia has faster but fewer "cores" AMD has more but slower "cores" .
No matter what you do you can't make the cores go faster (except with O/C but that's irrelevant to this conversation)
but you can overtax the GPU,by running at 4k for example you don't need speed anymore since even with the X's you only run at ~30FPS ,it's the same thing as with gameworks, you can't gain speed with using insane levels of tessellation but you sure can make the other team look even slower then your own cards.
And it's the same with whatever the ashes benchmark is doing it's not making the game faster it just renders more stuff.
GCN is not always GPU limited,
Theoretically speaking, GCN cards have a lower fillrate than GM200 cards, but games do not highlight this. You see this as the resolution goes up in DX11 games, meaning there are more pixels on the screen, GCN cards catch up to their respective competitors.
Take the Tech Report's synthetic tests here for example:
http://techreport.com/review/28513/amd-radeon-r9-fury-x-graphics-card-reviewed/4
The FuryX achieves 64 GPixels/s out of a theoretical 67 GPixels/s. A GTX 980 Ti achieves 85 GPixels/s out of a theoretical 95 GPixels/s and the Titan X achieves 95 GPixels/s out of a theoretical 103 GPixels/s.
Theoretically, GCN is at a disadvantage which should translate into GM200 cards pulling away from GCN cards as the resolution of a game rises but we don't see this happening at all, we see the opposite actually (meaning that Pixel throughput is not a bottleneck for GCN vs SM200).
As for Tessellation requires, it also makes use of pixel pipelines. Each tessellation unit (geometry processor) has dedicated hardware for vertex fetch, tessellation, and coordinate transformations. They operate with raster engines which transform newly tessellated triangles into a fine stream of pixels for shading. In other words, as the resolution rises, you need more pixels because you end up with more triangles.
If tessellation was the main bottleneck, GM200 would pull away from GCN as the resolution rises. But we don't see that happening in a majority of DX11 titles.
Take Rise of the Tomb Raider for example:
At 1600x900, look at how GCN cards relative to the SM200 cards. First the R9 390x at 79.4 FPS vs a GTX 980 at at 85.3 FPS. A 6 FPS lead for the GTX 980. Now lets compare a FuryX at 87.6 FPS vs a GTX 980 Ti at 105.5 FPS a near 18 FPS lead for the GTX 980 Ti.
Now lets move to 2560x1440:
At this resolution the R9 390x has a 1.7 FPS lead over the GTX 980 and that 18 FPS lead the GTX 980 Ti had over the FuryX? Down to 1.1 FPS.
So what happened? Well as the resolution rises, we become more GPU bound rather than CPU bound at lower resolutions.
If GCN were GPU bound at lower resolutions, then as the resolution scaled higher, SM200 would continue to maintain a large lead over GCN.
Some people claim that memory bandwidth plays a role except that...
it doesn't. In theory yes but not in practice due to the large L2 cache found on SM200 coupled with the superior color compression algorithms.
As for GCN having more, yet slower cores, that's not true (speaking clock for clock and core for core). It has to do with the types of shaders (short or long) fed to both GCN and SM200. GCN likes long running shaders while SM200 likes short running shaders. Another determining factor is utilization. GCN is a wider architecture and requires a higher degree of parallelism in order to make use of its full compute capabilities. What do I mean by wider architecture?
On GCN:
Each CU is composed of 64 SIMD cores which can execute 40 wavefronts concurrently and each wavefront is composed of 64 threads. So that's 2,560 threads per 64 SIMD cores. For Fiji, with its 64 CUs, that's a total of 163,840 threads executing concurrently.
Source:
http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/4
On SM200:
Each SMM is composed of 128 SIMD cores and can execute 64 warps concurrently and each warp is composed of 32 threads. So that's 2,048 threads per 128 SIMD cores. For a GTX 980 Ti, with its 22 SMMs, that's a total of 45,056 threads executing concurrently.
Source:
https://devblogs.nvidia.com/parallelforall/maxwell-most-advanced-cuda-gpu-ever-made/
GCN therefore has less SIMD cores dedicated towards doing more work. SM200 has more SIMD cores dedicated towards doing less work.
So if you're feeding your GPU with small amounts of compute work items, SM200 will come out on top. If you're feeding your GPU with large amounts of compute work, GCN will come out on top.
Hence why GCN stands to benefit more from Asynchronous compute + graphics than SM200 does because GCN has more threads idling than SM200. So long as you feed both architectures with optimized code, they perform as expected:
If anything, GCN has superior compute cores over SM200:
What is ALU latency? The amount of cycles the ALU takes to process a MADD operation. In other words, the performance of an SIMD unit. GCN is around 4 cycles while SM200 is just over 5. This behavior shocked folks over at Beyond3D when the Async compute controversy was in full force, see below...
Ext3h:
GCN has only 4 cycles latency for the simple SP instructions. Which I already accounted for, at least for "pure" SP loads, not to mention that this is also the minimum latency for all instructions on GCN and GCN also features a much larger register file.
The 6 cycle SP latency for Maxwell however is ... weird. I actually though that Maxwell had a LOWER latency for primitive SP instructions than GCN, but the opposite appears to be the case???
Source:
https://forum.beyond3d.com/posts/1871515/
NVIDIA have made some large advances with SM200 over Kepler but they're still behind on per SIMD performance. It's just that SM200 dedicates more of its weaker SIMDs towards doing smaller compute work items than GCN (hence the NVIDIA recommendation of large batches of short running shaders). Ext3h made the same recommendation here:
http://ext3h.makegames.de/DX12_Compute.html
For a safe bet, go with the batched approach recommended for Nvidia hardware:
Choose sufficiently large batches of short running shaders.Long running shaders can complicate scheduling on Nvidias hardware. Ensure that the GPU can remain fully utilized until the end of each batch. Tune this for Nvidias hardware, AMD will adapt just fine.
also recommended by NVIDIA here:
https://developer.nvidia.com/dx12-dos-and-donts
The problem is that game titles are mostly sponsored by NVIDIA and the code they run tends to be optimized for NVIDIA. These games don't tend to be compute heavy at all so when it comes to smaller compute work items, SM200 can come out on top.
In the professional realm, CUDA optimizations and documentation far surpasses that of OpenCL.
The software makes the difference in this case.