Async compute is an easy, but tricky thing in D3D12. This is a multi-engine API, so choosing the right queue for the right job is important. The API can support any hardware with a multi-engine code so the devs only need to care about to run some jobs asynchronously. For example texture streaming is an easy target and it should be loaded to the copy engine. If a hardware only has one DMA unit, then D3D12 will automatically run the job on the graphics engine. This is a very easy multi-engine model, because graphics is the superset of compute, and compute is the superset of copy. This is flawless in my opinion so kudos to Microsoft.
Still the async compute is problematic, because the DXKG specification require a very specific compute command engine in the hardware that support fences and barriers in the right way. The fences mostly used to synchronize the workloads, while the barriers used to block an operation on the GPU. In D3D12 the barriers has to support some specific conditions which makes this API really future-proof, but make some limits either, because GCN is the only architecture where the compute command engines designed in the right way. Kepler and Maxwell has some independent compute command engines, but these aren't support the specific barrier conditions that DXKG requires for D3D12. In this case both Kepler and Maxwell can execute a standard multi-engine code, but don't able to run the async compute in the compute command engines, so the async compute jobs will be loaded to the main command engine, and this leads to some not useful context switches. The result will be a dramatic performance degradation with a standard D3D12 multi-engine code. To avoid this situation the devs must use an alternate codepath, that use the compute command engines in a not really efficient way, but this option is better for NVIDIA, or it is possible to not use async compute.