AMD
Compute engines can be used for multiple different purposes on GCN hardware:
Long running compute jobs can be offloaded to a compute queue. If a job is known to be possibly wasting a lot of time in stalls, it can be outsourced from busy queues. This comes at the benefit of achieving better shader utilization as 3D and compute workload can be interleaved on every level in hardware, from the scheduler down to actual execution on the compute units. High priority jobs can be scheduled to a dedicated compute queue. They will go into the next free execution slot on the corresponding ACE. They can not preempt running shaders, but they will skip any queued ones. Make proper use of the priority setting on the compute queue to achieve this behaviour. Get around the execution slot limit. When executing compute shaders with tiny grids, down to the minimum of 64 threads per thread group, you would under utilize the GPU using only a single engine. By utilizing all 8 ACE units together with the 3D Engine, you can achieve up to 640 active grids on Fiji. This is precisely the upper occupation limit and maximizes utilization, even if each grid only yields a single wavefront. You should still prefer issuing less commands with larger grids instead. Pushing the hardware to the limits like this can expose other unexpected bottlenecks. Create more back pressure. By providing additional jobs on a compute engine, the impact of blocking barriers in other queues can be avoided. Barriers or fences placed on other queues are not causing any interference. GCN is still perfectly happy to accept compute commands in the 3D queue.
There is no penalty for mixing draw calls and compute commands in the 3D queue. In fact, compute commands have approximately the same performance as draw calls with proxy geometry10.
Compute commands should still be preferred for any non-geometry related operation for practical reasons, such as utilizing the local shared memory and increasing possible concurrency.
Offloading compute commands to the compute queue is a good chance to increase GPU utilization.
10 Proxy geometry refers to a technique where you are using a simple geometry, like a single screen filling square, to apply post processing effects and alike to 2D buffers.
Nvidia
Due to the possible performance penalties from using compute commands concurrently with draw calls, compute queues should mostly be used to offload and execute compute commands in batch.
There are multiple points to consider when doing this:
The workload on a single queue should always be sufficient to fully utilize the GPU. There is no parallelism between the 3D and the compute engine so you should not try to split workload between regular draw calls and compute commands arbitrarily. Make sure to always properly batch both draw calls and compute commands. Pay close attention not to stall the GPU with solitary compute jobs limited by texture sample rate, memory latency or anything alike. Other queues can't become active as long as such a command is running. Compute commands should not be scheduled on the 3D queue. Doing so will hurt the performance measurably. The 3D engine does not only enforce sequential execution, but the reconfiguration of the SMM units will impair performance even further. Consider the use of a draw call with a proxy geometry instead when batching and offloading is not an option for you. This will still save you a few microseconds as opposed to interleaving a compute command. Make 3D and compute sections long enough. Switching between compute and 3D queues results in a full flush of all pipelines. The GPU should have spent enough time in one mode to justify the penalty for switching. Beware that there is no active preemption, a long running shader in either engine will stall the transition. Despite the limitations, the use of compute shaders should still be considered. The reduced overhead and effectively higher level of concurrency compared to classic draw calls with proxy geometry can still yield remarkable performance gains.
Additional care is required to cleanly separate the render pipeline into batches.
If async compute with support for high priority jobs and independent scheduling is a hard requirement, consider the use of CUDA for these jobs instead of the DX12 API.
With GK110 and later, CUDA bypasses the graphics command processor and is handled by a dedicated function unit in hardware which runs uncoupled from the regular compute or graphics engine. It even supports multiple asynchronous queues iin hardware as you would expect.
Ask your personal Nvidia engineer for how to share GPU side buffers between DX12 and CUDA.