The image he showed:
Is what AMD can do. Notice that you have 3 lines of commands running concurrently to one another. Those 3 lines represent the 3 queues (Graphics/3D, Copy and Compute).
When NVIDIA Kepler/Maxwell execute the same code they execute it like you see in DirectX11 below with some exceptions. When the Graphics queue is executing a compute command, the compute queue can also concurrently execute a compute command. Copy commands can also be executed concurrently on NVIDIA hardware. What NVIDIA don't support is mixing Compute with Graphics thus NVIDIAs current hardware do not support Asynchronous compute + Graphics (what can significantly boost performance). AMD behave like the DirectX12 portion of this image and support all forms of concurrent executions:
Where you see a divergence is on the CPU side. Intel, AMD and NVIDIA all split the command buffer listing (DirectX run time or red bar) across various cores under DirectX12 as such:
Of course there's another difference here, where as AMD GCN executes the DirectX runtime under DX11 like the DirectX 11 shot above, NVIDIA do not. NVIDIA support DirectX 11 Multi-threaded command listing while AMD do not.
DirectX 11 Multi-threaded command listing basically works like this, Batches of Commands are pre-recorded on multiple CPU cores. And the primary CPU thread simply plays back the pre-arranged and pre-computed command lists to the NVIDIA driver. The NVIDIA driver compiler orders them into grids and schedules them for execution.
Basically NVIDIA is able to split that red bar, in the DirectX11 shot above, across many CPU threads under DX11. AMD does not. NVIDIA also already employ a multi-threaded DirectX11 driver (pale blue bar).
This is why NVIDIA don't gain much from DX12 over DX11 performance wise and AMD gain a lot. AMD GCN hammers the primary CPU thread under DirectX11 leading to a CPU bottleneck. Vulcan will eventually highlight the same behaviour for AMD (as the API matures), more performance than DX11 and NVIDIA, similar performance give or take (NVIDIA new driver will improve things a bit by allowing concurrent executions of compute commands).
The end result is that AMD GCN gains performance from the get go by running DirectX12 over DirectX11.
If you throw Asynchronous Compute + Graphics into the mix, AMD gain even more performance. How? Asynchronous compute + Graphics significantly lowers frame times (frame latency) thus boosting performance. Asynchronous Compute + Graphics also raises GPU utilization thus minimizing idling resources.
The thing with GCN is that resources are almost always idling. The architecture is highly parallel.
So yeah, AMDs DX11 implementation is inferior to NVIDIAs. There's no denying that. Vulcan and DX12, however, are based on ideas spawned by the Mantle API and thus, like the Mantle API, are really tailored to AMD hardware as it pertains to "performance".
Want to see this multiple queue execution in action? GPUView allows you to do just that.
Here's the Fable Legends Fly by demo running on a TitanX. (Note: There are very little Asynchronous work loads in the Fable Legends Fly by test but the released version will include spell effects and more):
Notice the Compute queue is pretty much empty?
Now the same test on a Fury:
Get the idea?
NVIDIA will be able to run Asynchronous compute, running compute commands, in the compute queue, concurrently to compute commands in the Graphics (3D queue) but not Compute commands running concurrently to Graphics commands (you see that in the Fable Legends screen shot above). This is what NVIDIA call "Asynchronous Compute". Kollock, Oxide developer, mentioned that support for this was recently added into NVIDIAs driver but requires an NVIDIA specific implementation. However the real performance gains are to be had from concurrently executing Compute and Graphics tasks. This is something current NVIDIA architectures are incapable of doing.
Conclusion:
- What AMD mean by Asynchronous Compute is not what NVIDIA mean.
- NVIDIA do not support concurrent executions of Compute + Graphics commands.
- GCN has idling resources from being a very wide architecture. Exploiting those resources through Asynchronous compute can lead to significant performance improvements.
- NVIDIA has little to gain performance wise on their current architectures under DX12/Vulcan.
- AMDs DirectX11 implementation is inferior and hammers the primary CPU thread leading to a CPU bottleneck (Rise of the Tomb Raider highlighting this).
And that's that.