From what I understand, NV GPUs have less idle time than AMD gpus. This is why NV shines in DX11, they have a more efficient pathway and are using much more of their hardware. However, while NV GPUs have less idle time, there is still some that async compute could use to increase the GPU efficiency.
NV's current implementation of async compute is such that there is such a performance hit to using it that not only does it not increase performance, it hinders it. This may be why NV's benchmarks in AOTS in DX12 are lower than their DX11 benches.
AMD on the other hand, because it has more idle time, and performs async compute more efficiently because of the ACEs, allows performance to significantly increase by using AC. Which may be why GCN performance in DX12 significantly improves in the bench over DX11.
I am not sure your theory holds true because it goes counter to exactly how GCN was designed -- which is to minimize idle time. Are you assuming that because some units in GCN are underutilized due to older DX11 API, that as a result AMD's GPUs are more idle? Sure, if the API isn't exposing the full benefits of the GCN architecture, ACE engines will be sitting idle but with DX12 that could change -- at least that's what Oxide is telling us because they got a free performance boost utilizing those ACEs. It's also not as simple as having the GPU being occupied since some tasks should have higher priority over others, but if you prioritize a task, there is a big performance hit with context switching overhead:
"In short, here's the thing, everybody expected NVIDIA Maxwell architecture to have full DX12 support, as it now turns out, that is not the case. AMD offers support on their Fury and Hawaii/Grenada/Tonga (GCN 1.2) architecture for DX12 asynchronous compute shaders. The rather startling news is that Nvidia's Maxwell architecture, and yeah that would be the entire 900 range does not support it." ~
Guru3D
I think AMD's DX11 performance has long been traced to draw calls of the DX11 API and their driver rather than lack of utilization of ACE engines in DX11 games. This drawcall bottleneck is fixed with DX12 but AC is something totally different. It's about using the existing GPU resources more effectively (i.e., how we moved from typical shaders to DirectCompute shaders). So rather than addressing a specific
bottleneck which was the case with draw calls in DX11 API, DX12/lower level APIs allow the programmer to speed up certain tasks by taking
advantage of specific hardware features like Asynchronous Shaders/Compute engines already present in hardware.
I think the context is totally different. When AMD presented GCN to the world, there was no DX12 or Mantle or Vulkan. Eric Demers and the architects behind GCN explained why they made GCN with ACE engines an so on.
Very good AT article on GCN architecture:
"Now GCN is not an out-of-order architecture; within a wavefront the instructions must still be executed in order, so you can’t jump through a pixel shader program for example and execute different parts of it at once. However the CU and SIMDs can select a different wavefront to work on; this can be another wavefront spawned by the same task (e.g. a different group of pixels/values) or it can be a wavefront from a different task entirely.
Meanwhile on the compute side, AMD’s new Asynchronous Compute Engines serve as the command processors for compute operations on GCN.The principal purpose of ACEs will be to accept work and to dispatch it off to the CUs for processing. As GCN is designed to concurrently work on several tasks, there can be multiple ACEs on a GPU, with the ACEs deciding on resource allocation, context switching, and task priority.
One effect of having the ACEs is that GCN has a limited ability to execute tasks out of order. As we mentioned previously GCN is an in-order architecture, and the instruction stream on a wavefront cannot be reodered. However the ACEs can prioritize and reprioritize tasks, allowing tasks to be completed in a different order than they’re received. This allows GCN to free up the resources those tasks were using as early as possible rather than having the task consuming resources for an extended period of time in a nearly-finished state. This is not significantly different from how modern in-order CPUs (Atom, ARM A8, etc) handle multi-tasking."
AMD always meant to design GCN as a compute monster and that's why they ditched the old VLIW designed, created an ACE+CU architecture GPU design and expanded the ACEs even further from HD7970 to R9 290X. DX12 does not look like it has anything to do with this architecture because the architecture came first and was designed to be forward looking for future software. It actually looking more like the GCN architecture was so far ahead for its time when it came to DirectCompute/Asynchronous Compute, that it took
until DX12 (or a specific low-level PS4 API) to actually expose the benefits of this architecture. That's my theory. That's why we are seeing console developers squeeze more performance out of PS4 and XB1 using AC in games like Uncharted 4 and XB1, while this benefits of GCN has largely been ignored on the PC.
We did see glimpses of the potential when DirectCompute was used for Global Illumination in Dirt Showdown or SSAA/shadows in games like Hitman Absolution and Sleeping Dogs but those were limited use cases.
It's interesting how NV owners defended Fermi for poor compute, deny, deny, deny, and then this was repeated with Kepler. In hindsight, both of those NV architectures bombed at compute and now we are starting to see cracks in Maxwell which if true would be a 3rd generation in a row where NV compute crippled its architecture. If ACEs/DirectCompute becomes a key factor in DX12 performance, then bye bye Fermi, Kepler and low end Maxwell cards. Hard to say for now though because it'll be a while by the time we see many DX12 games by which point many gamers will have upgraded to 16nm HBM2 GPUs.