While it's true that there's potential for Nvidia GPUs to idle in graphics/fixed function heavy tasks like shadow map generation, rasterizing a G-buffer or a depth prepass I do not believe that to be the case the majority of the time in a lot of modern games ...
There's many ways to leverage parallelism without having async compute like games buffering up several frames and pipelining to keep the GPU fed ...
Nvidia also over engineers their geometry processors and rasterizers to keep idling to a minimum too and it looks like that idea is working for them if you take a look at the market share ...
Memory Bandwidth. Cache occupancy/spillage. NVIDIAs Maxwell GPU is beholden to its Memory/Cache Bandwidth. The ROPs, on Maxwell, are tied to the memory controllers. Each 16 ROPs are linked to a dedicated 64-bit memory controller. You can even see the bottle neck arise when looking at synthetics. Even though the following synthetic test reserves all of the memory bandwidth towards ROP operations, we see each Maxwell GPU being unable to hit its theoretical peak.
Kepler hits its theoretical peak and GCN is within a striking distance of its theoretical peaks. Maxwell, on the other hand, is consistently 10GPixels/s behind its theoretical peak across GM204 and GM200.
The end result is that Maxwell, though having a theoretically more robust front end than Fiji, is unable to surpass Fiji when gaming at 4K without operating well beyond its designed frequencies. The boost in GPU frequency, on Maxwell, boosts the cache bandwidth which helps Maxwell a great deal.
Maxwell's SMMs (Shader Multiprocessors) share L1 and Instruction caches with the texture mapping units and polymorph engines. Maxwell thus runs out of immediate caches in between 16 and 32 concurrent warps per SM and begins to spill into the L2 cache. We can see that here:
Maxwell makes up for these deficiencies by ensuring better compute utilization. This is done by having 4 Warp Schedulers per SMM but each Warp scheduler having dominion over its own group of 32 CUDA cores. 32 CUDA cores maps perfectly to a Warp (32 threads) and thus allows Maxwell to make efficient use of its available compute resources.
It's really common knowledge among developers that Nvidia GPUs have better frontends and geometry processing. If we take a look at synthetic benchmarks that measure the triangle output Nvidia comes out very far ahead compared to AMD ...
Async compute doesn't buffer those frames faster. Async compute is meant is meant to make it possible to run the seperate compute queue in parallel with the graphics queue ...
Oh but I do need to bring up the marketshare because you seem to have tons of misunderstanding or a severe lack of perspective. It's certainly not the case that GM200 has inferior performance compared to Fiji when we take a look at the majority of the games out today. You need to understand that AMD places too much faith that things will go in their way ...
Maxwell does have a superior triangle output over Fiji but this is mostly a non-issue. Neither are limited by their triangle outputs.
Geometry wise, GCN suffers from a lack of primitive discard acceleration. That means that GCN doesn't cull triangles smaller than a pixel. This causes issues when using tessellation factors beyond x16.
ROP wise, GCN does have the more efficient ROPs due to the dedicated color cache afforded to each group of 4 ROPs. Maxwell attempts to make up for its less robust ROP engineering by doubling down on the ROP to memory controller ratio of Kepler. Going from 8:1 to 16:1. This helps Maxwell achieve parity with Fiji at higher resolutions. 96 Maxwell ROPs = 64 Fiji ROPs.
Maxwell also throws in color compression, which is required for Maxwell due to the less efficient memory controllers:
Even in the midrange segment Nvidia is still competitive, it's only recently that AMD's equivalent caught up ...
If it's doing graphics, it's not doing compute ?! It's clear at this point that you don't have any idea of what your talking about anymore. I have no idea why you or others go into details if you don't know what you're talking about. Question, are you even a developer ?! If not then any posts about GPU microarchitectures from this point onwards from you is NOT credible!
Async compute is just better utilization of the shaders. AMD is literally advertising the idea of FREE compute shaders!
How do you know that Nvidia would benefit ? Have you done any logic simulation on the RTL design or got any other similar evidence that suggests so ? Do you speak out of pure ignorance ?! Well which one is it ?
AMD were caught by a higher API overhead. Hardware wise, GCN has been consistently architecturally superior to every NVIDIA architecture pitted against it except for maybe GM200.
This served as the basis for AMDs Mantle project and now Vulkan/DX12. AMD have made quite a bit of headway with their DX11 drivers. We are starting to see AMD being quite competitive under DX11 scenarios. Even AMDs AotS DX11 driver has improved substantially. So much so that AMDs GCN DX11 performance is now greater than NVIDIAs in this title.
As for Asynchronous Compute + Graphics, it's more like Hyperthreading than it is about better utilisation. Rather than executing a single thread sequentially, GCN can execute two threads in parallel so to speak. This helps cut down on frame time (frame latency) which translates into an FPS boost. You can execute more work per frame if need be or execute the same work in less time.
As for Maxwell, if it's doing texture work, in an SMM, it isn't doing compute work. The two cannot be done in parallel in an SMM as they share caches. That's what the whole preemption/context switch thing is all about.
Maxwell needs to perform a flush of its caches, in an SMM, before it can switch contexts. This adds a delay, this delay can be significant if Maxwell is caught behind a long running draw call. The result of which is coarse grained preemption.
Yes you are a liar but hopefully you won't end up like John Fruehe of AMD ...
BTW better utilization of fixed function units like ROPs or the rasterizer comes from designing your game around a specific GPU bottleneck so async compute won't do a damn if your rendering workload is already at a perfect balance between graphics and compute or that the game is heavily geared towards compute ...
Rendering work loads are never at a perfect balance between Graphics and Compute. The Nitrous engine, powering AotS, has a 20:80 ratio of compute:graphics. This engine is considered compute heavy.
We're a LONG way from having a perfect balance between the two. Also, Async Compute + Graphics would still help in such a scenario. You're still executing two workloads at once and thus reducing frame time.
Asynchronous compute + Graphics is not what you think it is. Sure, you need the available compute resources to process these extra compute jobs and sure, GCN is heavilly threaded compute wise, has more dedicated caches throughout the architecture to handle extra compute jobs without dropping and spilling into L2 cache.
I think we may be seeing Maxwell's compute limitations under AotS. It may very be that the sheer amount of parallel compute jobs are pushing Maxwell SMMs into spilling into L2 cache.
You're right to mention that Maxwell may not benefit from Asynchronous compute + graphics even if it were capable of executing such tasks. There's simply not enough dedicated cache to handle the amount of compute threads. Heck Maxwell is probably already taxed as is under AotS. In other words, Maxwell will likely regress in performance in upcoming engines and titles titles becoming another Kepler.
Fiji will likely have a longer life span.