computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

Page 17 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Glo.

Diamond Member
Apr 25, 2015
5,930
4,991
136
If your hardware does not support API features even if driver exposes them you have to add specific code for specific vendor. Simple as it can be. Whole point of Asynchronous Compute is context switching between Compute and Graphics in pipeline.
 

xthetenth

Golden Member
Oct 14, 2014
1,800
529
106
No, you dont add specific code for nVidia. You need specific code for every vendor you want to support. This is low level. Developers want more freedom? It will not come for free.

/edit: nVidia has published an article for Vulkan and OpenGL about this: https://developer.nvidia.com/transitioning-opengl-vulkan

No, you don't write vendor specific code except as a last resort. Whenever possible you write your workflows to work from best to worst method based on the results to querying for support. The only reason you'd write vendor specific code is if one vendor's implementation of a feature is either non-compliant or non-performant. The general case isn't vendor specific, and I bet that if NV hadn't been overconfident they could've lobbied for a way to give a warning that their async performance is anything but.
 

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
If your hardware does not support API features even if driver exposes them you have to add specific code for specific vendor. Simple as it can be. Whole point of Asynchronous Compute is context switching between Compute and Graphics in pipeline.

And how can anybody ask the driver about the support?
You guys still talking about the second step. Nobody has provided a way to get this kind of information from the driver. :D

No, you don't write vendor specific code except as a last resort. Whenever possible you write your workflows to work from best to worst method based on the results to querying for support. The only reason you'd write vendor specific code is if one vendor's implementation of a feature is either non-compliant or non-performant. The general case isn't vendor specific, and I bet that if NV hadn't been overconfident they could've lobbied for a way to give a warning that their async performance is anything but.

Doesnt make sense with Vulkan and DX12. They are too low to not write different code paths.
 

Glo.

Diamond Member
Apr 25, 2015
5,930
4,991
136
Because it is the Application that commands the Hardware. Not the driver. Driver only exposes hardware for Low-Level API Application. Is it that hard to understand that?
 

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
Has nothing to do with my question:
How can anybody ask the driver to get the information about the multi engine support level?

What you wrote is exactly what was my response:
nVidia provided a DX12 driver so they are supporting everything in the same way like AMD...
 

PhonakV30

Senior member
Oct 26, 2009
987
378
136
Minimum X for Z is Y, IF Nvidia can't provide Y ,then This Doesn't support/Can't Do Z !

Minimum $ for iPhone is $300, So Do you have $300 money? No but you have $100.Then you can't buy iPhone.

Is it really hard to understand that? I have the right to Ask question :
Is there any statement from Nvidia about Async compute ? Answer : No. Why?
We're are still waiting for 6 month.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
You havent answered the question. What is the cap which describes the multi engine support of the underlaying hardware.

Seriously, what you just wrote is nothing else than:
nVidia has provided a DX12 driver so they are supporting everything AMD supports. :\
I still wait to see how a "DX12 driver" can or cannot support Async Compute.
I did answer your question but I think you don't understand.

The driver tells the API what functions the hardware supports. This process is referred as "the driver exposes the features".

So when the Microsoft tool queries the driver, the driver returns what it was programmed to report, upon a query, in terms of its feature support. The driver tells the Microsoft tool what it supports.

Example:

Microsoft tool ---> driver
Microsoft tool <--- driver

At no point does the Microsoft tool test the feature support of the hardware itself.

That's why Oxide stated what they stated and I'm paraphrasing here: "the driver reported that the Asynchronous Compute feature was available but when we went to use it, the result was an unmitigated disaster. AFAIK Maxwell does not support Asynchronous compute. So I don't know why the driver was trying to expose it."

Understand?
 
Last edited:

xthetenth

Golden Member
Oct 14, 2014
1,800
529
106
Doesnt make sense with Vulkan and DX12. They are too low to not write different code paths.

Of course you write different code paths, but you base them on features, not a fixed list of hardware because of course you don't. This is so self-evident it boggles the mind. Do you seriously believe a brave new future where devs have to patch in new rendering pipelines to support new hardware releases would have any developer support? You might not realize how extraordinary the claim is but it requires some extraordinary proof.
 

Azix

Golden Member
Apr 18, 2014
1,438
67
91
Has nothing to do with my question:
How can anybody ask the driver to get the information about the multi engine support level?

What you wrote is exactly what was my response:
nVidia provided a DX12 driver so they are supporting everything in the same way like AMD...

Oxide said the driver was showing a feature as available. So there you have it :thumbsup:
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
Nvidia executes Asynchronous compute code synchronously under DX12. Nvidia supports Async Compute through Hyper-Q, CUDA, but Hyper-Q doesn't support the additional wait conditions of barriers (a DX12 requirement). So no, there is no Async Compute for Fermi, Kepler or Maxwell under DX12.

Let me explain, Microsoft have introduced additional compute queues into 3D apps with their DX12 API:

Graphics queue for primary rendering tasks
Compute queue for supporting GPU tasks (lighting, post processing, physics etc)
Copy queue for simple data transfers

Command lists, from a specific queue, are still executed synchronously, while those in different queues can execute asynchronously (ex: concurrently and in parallel). What does asynchronous mean? Asynchronous means that the order of execution of each queue in relation to another is not defined. Work loads submitted to these queues may start or complete in a different order than they were issued. In terms of Fences and barriers, they only apply to each respective queue. When the work load in a queue is blocked by a fence, the other queues can still be running and submitting work for execution. If Synchronisation points between two or more queues are required, they can be defined and enforced by using fences.

Similar features have been available under OpenCL and CUDA for some time. The fences and signals, under DX12 map directly to a subset of the event system under OpenCL and CUDA. Under DX12, however, Barriers have additional wait conditions. These wait conditions are not supported by either OpenCL or CUDA. Instead, a write through of dirty buffers needs to be explicitly requested. Therefore Asynchronous compute under DX12, though similar to Asynchronous compute under OpenCL and CUDA, requires explicit feature support for compatibility with the Asynchronous Compute feature.

These new queues are also different than the classic Graphics queue. While the classic Graphics queue can be fed with compute commands, copy commands and graphics commands (draw calls), the new compute and copy queues can only accept compute and copy commands respectively. Hence their names.

For Maxwell, Compute and Graphics can't be active at the same time under DX12 because there is only a single function unit (Command Processor) rather than having access to ACEs as well. Copy commands, however, can run Asynchronously to Graphics and Compute commands due to the inclusion of more than one DMA engine in Maxwell. We see this when looking at how Fable Legends executes the various queues. What nvidia would need, in order to execute graphics and compute commands asynchronously, is to add support for additional barrier wait times for their Hyper-Q implementation. Why? This would expose the additional execution unit under Hyper-Q. The Hyper-Q interface used for CUDAs concurrent executions supports Asynchronous compute as we see in DX11 + Physx titles (Batman Arkham series for example). Hyper-Q is, however, not compatible with the DX12 API (for reasons mentioned above). If it was compatible, there would be a hardware limit of 31 asynchronous compute queues and 1 Graphics queue (as Anandtech reported).

So all that to say that if you fence often, you can get nvidia hardware to run the Asynchronous code synchronously. You also have to make sure you use large batches of short running shaders, long running shaders would complicate scheduling on nvidia hardware and introduce latency. Oxide, because they were using AMD supplied code, ran into this problem in Ashes of the Singularity.

Since AMD are working with IO for the Hitman DX12 path, then you can be sure that the DX12 path will be AMD optimized. That means less fencing and longer running shaders.

For Hitman, Nvidia basically have to work with IO as well, in order to add a vendor ID specific DX12 path (like we saw Oxide do). It's probably not worth it seeing as nvidia have little to gain from DX12 over DX11. AMD, however, will likely suffer from a CPU bottle neck under Hitman DX11 (as they do under Rise of the Tomb Raider DX11). AMD have a lot to gain from working with developers on coding and optimizing a DX12 path.

So to summarize,

Nvidia do not support Async compute under DX12. Hitman's DX12 path may run like crap on nvidia hardware unless nvidia convince IO Interactive to code a vendor ID specific path and supply IO with optimized short running shaders. Basically, same thing that nvidia did with Oxide for Ashes of the Singularity. Since nvidia have little to gain from moving from DX11 to DX12, best for them to not waste time and money helping IO code a vendor ID specific path.

AMD will suffer performance issues due to a CPU bottleneck, brought on by the lack of support for DX11 multi-threaded command listing, when running the Hitman DX11 path. AMD has everything to gain in assisting IO Interactive in the implementation of a DX12 path. Asynchronous compute is just an added bonus on top of the removal of the CPU bottle neck which plagues AMD GCN under DX11 titles.
 
Last edited:

Paul98

Diamond Member
Jan 31, 2010
3,732
199
106
It will be interesting to see if console games that use async shaders, will also port them over to PC.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
http://gearnuke.com/deus-ex-mankind-divided-use-async-compute-enhance-pure-hair-simulation/

Wonder if we will end up with almost "free" PureHair in Deus Ex? Tomb raider used it for lightning / BTAO on xbox and no dx12 support on pc.

Anyone have access to this?

http://dl.acm.org/citation.cfm?id=2775280.2775284&coll=DL&dl=GUIDE
Yes. TressFX/Purehair is a Physics based implementation which can be executed asynchronously delivering pretty much no performance loss when activated vs having it turned off. When comparing the feature running asynchronously vs synchronously, a slight boost in performance will be observed.
 
Last edited:

Mahigan

Senior member
Aug 22, 2015
573
0
0
It will be interesting to see if console games that use async shaders, will also port them over to PC.
That's all on AMD, they need to work on their developer partnerships or risk a repeat of the Rise of the Tomb raider fiasco.

As gamers, we really lost out by the lack of DX12 support in Rise of the Tomb Raider. Rumours are that a DX12 patch is incoming though :)
 

PhonakV30

Senior member
Oct 26, 2009
987
378
136
I think Async compute means add more effects with the Same fps NOT "add more effects with more fps".
 

Kenmitch

Diamond Member
Oct 10, 1999
8,505
2,250
136
I think Async compute means add more effects with the Same fps NOT "add more effects with more fps".

It can go both ways also most likely. Figure anything that wasn't Async would probably have effected fps to some extent.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
I think Async compute means add more effects with the Same fps NOT "add more effects with more fps".
It depends on what you're comparing.

If you're comparing between DX11 running the shaders synchronously vs. DX12 running the shaders asynchronously, then you will observe a boost in performance. This boost in performance can allow you to run more shaders, more effects, and still end up with more FPS then you would otherwise have obtained when running less shaders and less effects synchronously under DX11.

If you're comparing between an asynchronously running shader intensive feature being turned on and off vs a synchronously shader intensive feature being turned on and off under DX12, then what you said applies. You either achieve the same FPS with the feature turned on and off when running asynchronously vs a performance hit when running synchronously or a significantly reduced performance hit when running asynchronously vs synchronously.
 
Last edited:
Feb 19, 2009
10,457
10
76
The Hyper-Q interface used for CUDAs concurrent executions supports Asynchronous compute as we see in DX11 + Physx titles (Batman Arkham series for example). Hyper-Q is, however, not compatible with the DX12 API (for reasons mentioned above). If it was compatible, there would be a hardware limit of 31 asynchronous compute queues and 1 Graphics queue (as Anandtech reported).

You misunderstand Hyper-Q, it's intended to run 32 compute queues in parallel. It's never stated that it was able to run 1 graphics + 31/32 compute in the same pipeline.

The reason is fairly simple, if it was capable of running graphics and 31 compute queues simultaneously, there would be very little performance loss for GPU PhysX simulation, as they would be running asynchronously and therefore, will not bottleneck graphics rendering.

What's the observed results? GPU PhysX often destroys performance. So therefore, it's clearly showing that CUDA compute is slowing down graphics rendering. Anyone remember Borderlands? When GPU PhysX is on, FPS would drop massively, minimum FPS especially.

That does not have the hallmark of compute being run in parallel. It looks exactly like serial mode graphics + compute. The more compute tasks you add (GPU PhysX), the slower the graphics run because it has to wait for compute to be processed. This is why people use to rock 2nd GPUs just for GPU PhysX to not tank performance.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
You misunderstand Hyper-Q, it's intended to run 32 compute queues in parallel. It's never stated that it was able to run 1 graphics + 31/32 compute in the same pipeline.

The reason is fairly simple, if it was capable of running graphics and 31 compute queues simultaneously, there would be very little performance loss for GPU PhysX simulation, as they would be running asynchronously and therefore, will not bottleneck graphics rendering.

What's the observed results? GPU PhysX often destroys performance. So therefore, it's clearly showing that CUDA compute is slowing down graphics rendering. Anyone remember Borderlands? When GPU PhysX is on, FPS would drop massively, minimum FPS especially.

That does not have the hallmark of compute being run in parallel. It looks exactly like serial mode graphics + compute. The more compute tasks you add (GPU PhysX), the slower the graphics run because it has to wait for compute to be processed. This is why people use to rock 2nd GPUs just for GPU PhysX to not tank performance.
Early implementations of Hyper-Q (Think Kepler and Maxwell v1) could not use Compute in conjunction with Graphics.

Maxwell v2 (900 series) supports running Compute in conjunction with Graphics under Hyper-Q.

http://www.anandtech.com/show/9124/amd-dives-deep-on-asynchronous-shading

A lot of the "nvidia abandoned Kepler" rhetoric in GameWorks titles which use PhysX stems from this fact. Kepler doesn't support this new feature.

Ever since the 900 series, running a second GPU for PhysX has offered either negligible improvements in performance or, often the case, a performance loss.
 
Last edited:
Feb 19, 2009
10,457
10
76
Early implementations of Hyper-Q (Think Kepler and Maxwell v1) could not use Compute in conjunction with Graphics.

Maxwell v2 (900 series) supports running Compute in conjunction with Graphics under Hyper-Q.

http://www.anandtech.com/show/9124/amd-dives-deep-on-asynchronous-shading

A lot of the "nvidia abandoned Kepler" rhetoric in GameWorks titles which use PhysX stems from this fact. Kepler doesn't support this new feature.

Ever since the 900 series, running a second GPU for PhysX has offered either negligible improvements in performance or, often the case, a performance loss.

Don't link that AT article please, they got it wrong and haven't fixed it.

I would prefer to see some evidence of GPU PhysX (on/off) tanking in Kepler and not in Maxwell 2.

Any recent titles with GPU PhysX shows that result?
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
Don't link that AT article please, they got it wrong and haven't fixed it.

I would prefer to see some evidence of GPU PhysX (on/off) tanking in Kepler and not in Maxwell 2.

Any recent titles with GPU PhysX shows that result?
Well, here's the proof that you can run compute and graphics together provided you're using CUDA (Batman Arkham Origins):
8df82047ea3db0dcca922e619fa704e3.jpg


Here's what it looks like when attempting to run compute and graphics under DX12 (from Beyond3D):
20df0ee8b765075b349e2b01eb3005e0.jpg


This does seem to indicate that the specific aspect of Anandtech's article regarding Hyper-Q supporting Asynchronous Compute with Maxwell V2 is correct.

If we turn to Fable Legends (DX12) we have AMD GCN providing this result:
ac302da85d3b348367332914aeace947.jpg


And Maxwell provides this result:
d6670e561d84fff29572f79f49d32d32.jpg
 
Last edited:
Feb 19, 2009
10,457
10
76
Well, here's the proof that you can run compute and graphics together provided you're using CUDA (Batman Arkham Origins).

I wasn't aware you can run GPUView in a DX11 game and it splits up the queues in the DX12 fashion.

The question though is if they can run in parallel, why is it GPU PhysX hurts graphics performance so much? That goes against the point of async compute.
 

3DVagabond

Lifer
Aug 10, 2009
11,951
204
106
I wasn't aware you can run GPUView in a DX11 game and it splits up the queues in the DX12 fashion.

The question though is if they can run in parallel, why is it GPU PhysX hurts graphics performance so much? That goes against the point of async compute.

Possibly because nVidia doesn't have dedicated compute engines and it uses CU/CC resources that would normally be doing render work.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
I wasn't aware you can run GPUView in a DX11 game and it splits up the queues in the DX12 fashion.

The question though is if they can run in parallel, why is it GPU PhysX hurts graphics performance so much? That goes against the point of async compute.

I guess it depends on the title:
29e3d61e347b719e63c38cb8ca437489.jpg

5b8158635ca6d5e87ff39dff277831b4.jpg


I wonder what the CPU usage is like for every one of these titles. I also wonder how threaded they are and what DirectX level they were programmed for. It could be that properly threaded games take less of a hit than titles which only use 1-2 cores. I mean multi-threaded command listing is only available in DX11 and has to be both explicitly programmed into the title as well as in the GPU driver.

Other than that, I'm not sure why PhysX, hurts performance as hard as it does under certain titles.
 
Last edited: