Async compute is an easy, but tricky thing in D3D12. This is a  multi-engine API, so choosing the right queue for the right job is  important. The API can support any hardware with a multi-engine code so  the devs only need to care about to run some jobs asynchronously. For  example texture streaming is an easy target and it should be loaded to  the copy engine. If a hardware only has one DMA unit, then D3D12 will  automatically run the job on the graphics engine. This is a very easy  multi-engine model, because graphics is the superset of compute, and  compute is the superset of copy. This is flawless in my opinion so kudos to Microsoft.
Still the async compute is  problematic, because the DXKG specification require a very specific  compute command engine in the hardware that support fences and barriers  in the right way. The fences mostly used to synchronize the workloads,  while the barriers used to block an operation on the GPU. In D3D12 the  barriers has to support some specific conditions which makes this API  really future-proof, but make some limits either, because GCN is the  only architecture where the compute command engines designed in the  right way. Kepler and Maxwell has some independent compute command  engines, but these aren't support the specific barrier conditions that  DXKG requires for D3D12. In this case both Kepler and Maxwell can execute a  standard multi-engine code, but don't able to run the async compute in  the compute command engines, so the async compute jobs will be loaded to the main command engine, and this leads to some not useful context switches. The result will be a dramatic performance  degradation with a standard D3D12 multi-engine code. To avoid this  situation the devs must use an alternate codepath, that use the compute  command engines in a not really efficient way, but this option is better  for NVIDIA, or it is possible to not use async compute.