computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

Page 19 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Mahigan

Senior member
Aug 22, 2015
573
0
0
Looks like nvidia support the "computer science" definition of Asynchronous compute (meaning the execution of compute workloads without a defined order) but not the AMD Mantle, DX12 definition that we've all been accustomed too (which adds Graphics and Copy workloads in concurrent execution to Compute).

Basically, nvidia don't support the performance boosting aspects of what we've come to know as Asynchronous Compute. Nvidia's implementation boosts performance but not by that much.
 

Osjur

Member
Sep 21, 2013
92
19
81
So Nvidia can do graphics, compute, transfer, sparse_binding in single queue while AMD has 3 queues with 3 different tasks. Why AMD is better?

Lets try to make this as simple as possible.

Nvidia : rendering single frame serially because it cant do them concurrently.
Graphics queue = 10ms + Compute queue = 14ms + Copy queue = 16ms

AMD : rendering single frame concurrently.
Graphics queue = 10ms + Compute queue = 10ms + Copy queue = 10ms = faster

So what happened? Well, AMD cards just did compute and copy queues on seperate threads (compute engines) while doing graphics queue on graphics engine.

Now this is most certainly exaggerated results but it should give a hint why async compute is useful. Do more in less time...
 
Last edited:

Azix

Golden Member
Apr 18, 2014
1,438
67
91
We are talking here about a "low level" API. What you describe is something like OpenGL or DX11 where most of the work is done within the driver.

its not that low-level. Developers could not manage that.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
So Nvidia can do graphics, compute, transfer, sparse_binding in single queue while AMD has 3 queues with 3 different tasks. Why AMD is better?
In the Graphics queue (3D queue) NVIDIA executes all tasks sequentially like this:
Graphics --> Compute --> Copy --> Graphics = 20ms

AMD, on the other hand, would execute the first 3 concurrently (at the same time rather than one after the other) by separating the tasks into 3 queues. Then AMD would execute the last task. Like this:
Graphics + Compute + Copy --> Graphics = 16ms

That time savings (reduction in latency) = FPS boost.
 
Last edited:

Mahigan

Senior member
Aug 22, 2015
573
0
0
We are talking here about a "low level" API. What you describe is something like OpenGL or DX11 where most of the work is done within the driver.
Vulcan and DX12 are lower level but not to metal APIs.

The Driver still translates the API commands into ISA commands (ISA being the language of a GPU) under DX12 and Vulcan.

The driver can thus still re-order commands or recompile commands. This is especially true for nvidia as, ever since Kepler, they've returned to static scheduling. A large segment of nvidia's scheduling hardware was removed and replaced with software in nvidia's driver.
 
Feb 19, 2009
10,457
10
76
In the Graphics queue (3D queue) NVIDIA executes all tasks sequentially like this:
Graphics --> Compute --> Copy --> Graphics = 20ms

AMD, on the other hand, would execute the first 3 concurrently (at the same time rather than one after the other) by separating the tasks into 3 queues. Then AMD would execute the last task. Like this:
Graphics + Compute + Copy --> Graphics = 16ms

That time savings (reduction in latency) = FPS boost.

It's much easier an analogy for folks to understand when we refer to CARS.

Let's say there's 3 road vehicles.

Cars = Graphics
Trucks = Compute
Bikes = Copy

They are driving from A to B. Their goal is B = frame finishes rendering.

In DX11, the approach is a one lane road that can accommodate all 3 vehicles, but only one vehicle type at a time can be on the road.

In DX12, with async compute capability, it allows multiple lanes. Some of the lanes are reserved only for Trucks & Bikes.

Now all traffic can flow at the same time if you schedule them properly, Cars goes to the main lane, Trucks and Bikes goes to the new lanes reserved for Compute tasks.

Now, on any given day (game), there may not be many Trucks*, it is just Cars. On this day, DX12 Async Compute (multi-lanes not used) does not have much benefit.

* Or rather, games may use compute, but its not queued properly and gets sent as a graphics workload, it goes into the Car lane as well.

You can visualize and understand why under scenarios of heavy compute/copy tasks, Async Compute in DX12 is a huge feature to have on the hardware.
 

Bacon1

Diamond Member
Feb 14, 2016
3,430
1,018
91
Think of it like a Relay race, except instead of having 1 baton AMD has 3 and their runners all start running their own segments at the start of the race, while Nvidia is passing theirs along.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
Soon nvidia will make use of concurrent compute executions and copy executions. Only Graphics seems implausible.

So
compute + compute
And
compute + copy

Only Graphics + Compute appears to be a no no.
 

swilli89

Golden Member
Mar 23, 2010
1,558
1,181
136
Think of the bits of data as stars in a galaxy. You have the whole universe, tittleman's crest, sector 9 what have you. The bottom line is, you don't want to put the universe in a tube.
 

airfathaaaaa

Senior member
Feb 12, 2016
692
12
81
i think the best analogy for this is like that
this is nvidia doing things concurrently
tollbooth-600px.jpg

this is nvidia when you introduce graphics process along with the others on the same time
Damn_2.jpg
 
Feb 19, 2009
10,457
10
76
https://community.amd.com/community...on-gpus-are-ready-for-the-vulkan-graphics-api

Only Radeon™ GPUs built on the GCN Architecture currently have access to a powerful capability known as asynchronous compute, which allows the graphics card to process 3D geometry and compute workloads in parallel. As an example, this would be useful when a game needs to calculate complex lighting and render characters at the same time.

Capture.PNG


I find it odd that AnandTech's Vulkan article didn't even mention that, they even linked their erroneous Async Shading article. -_-

@Mahigan
We know the Kronos consortium, their current chief works for NVIDIA, expanded compatibility for various architectures. Yet still Kepler/Maxwell only has 1 lumped up queue, unable to access Async Shading features of Vulkan. It's not a DX12 specific implementation issue. It's simply hardware incapability, no graphics & compute is allowed in the rendering engine at the same time.

Because Maxwell still has the single engine, its not capable of this feature. When more info on Pascal is revealed, if it's got extra engines for compute, we will know it's hardware capable!
 
Last edited:

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
its not that low-level. Developers could not manage that.

It is that low. Memory Management, control of workflow (number of draw calls, work within a draw call etc), multi engine etc. must be optimized for every underlaying hardware if you want the best performance.

The driver can thus still re-order commands or recompile commands. This is especially true for nvidia as, ever since Kepler, they've returned to static scheduling. A large segment of nvidia's scheduling hardware was removed and replaced with software in nvidia's driver.

You should read a little bit more about DX12 and Vulkan. The driver doesnt do anything what is in control of the application. Otherwise DX12 and Vulkan wouldnt work.
 
Last edited:

airfathaaaaa

Senior member
Feb 12, 2016
692
12
81
the driver translates what the programs asks into hardware commands...this doesnt mean that the driver is just there to do that...if it wasnt the whole of this thread is nullified
 

Krteq

Golden Member
May 22, 2015
1,007
719
136
You should read a little bit more about DX12 and Vulkan. The driver doesnt do anything what is in control of the application. Otherwise DX12 and Vulkan wouldnt work.
What the hell?!? There is nothing like "code to metal" in DX12/Vulkan/Mantle/Metal APIs. There is still need of HLSL coding, a driver exposing caps/features, need to compile a code etc.
 

TheELF

Diamond Member
Dec 22, 2012
4,027
753
126
https://community.amd.com/community...on-gpus-are-ready-for-the-vulkan-graphics-api



Capture.PNG


I find it odd that AnandTech's Vulkan article didn't even mention that, they even linked their erroneous Async Shading article. -_-

@Mahigan
We know the Kronos consortium, their current chief works for NVIDIA, expanded compatibility for various architectures. Yet still Kepler/Maxwell only has 1 lumped up queue, unable to access Async Shading features of Vulkan. It's not a DX12 specific implementation issue. It's simply hardware incapability, no graphics & compute is allowed in the rendering engine at the same time.

Because Maxwell still has the single engine, its not capable of this feature. When more info on Pascal is revealed, if it's got extra engines for compute, we will know it's hardware capable!

What you(they) show on the picture is the basic dx12/vulkan/low level async stuff that almost any card can do, even intel igpus.
Async compute,using graphics + compute is just a feature where only GCN cards get a big boost.
But then again gaining speed from doing graphics + compute in parallel only means that you can not get all of your GPUs power when doing only one of them...
 
Feb 19, 2009
10,457
10
76
What you(they) show on the picture is the basic dx12/vulkan/low level async stuff that almost any card can do, even intel igpus.

How many times do AMD have to say only GCN is capable of Async Shading in DX12/Vulkan will you people stop spreading such fud?
 

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
What the hell?!? There is nothing like "code to metal" in DX12/Vulkan/Mantle/Metal APIs. There is still need of HLSL coding, a driver exposing caps/features, need to compile a code etc.

We dont talk about "code to metal". DX12 and Vulkan are much more low level than DX12/OpenGL without Extensions. The amount of work a developer has to do to get the same result is huge. The driver is doing less work and most of the the work happens on the application side:
vulkan_gltransition_maintenance2.png

https://developer.nvidia.com/transitioning-opengl-vulkan

And this is from CroTeam:
Q: That's better. Now, why was Vulkan support added to Talos?

A: Good question! We (Croteam) firmly believe that Vulkan is really the best low(est)-level API there can be. Fast and portable!
Yes, it has downside(s). For one, it's quite hard to program for. You have to do a lot of things manually, instead of relying on drivers to do the work for you. This is both good and bad at the same time. Good for performance reasons, because driver doesn't assume what game wants to render (I won't go into any more details here, sorry). Bad because there's a lot more coding and in general, it's a more complex approach. You better know what you're doing, because you won't get any help from the driver. You're on your own.
It's really great to have that much control. IF you know what you're doing!
 
Last edited:

maddie

Diamond Member
Jul 18, 2010
5,147
5,523
136
What you(they) show on the picture is the basic dx12/vulkan/low level async stuff that almost any card can do, even intel igpus.
Async compute,using graphics + compute is just a feature where only GCN cards get a big boost.
But then again gaining speed from doing graphics + compute in parallel only means that you can not get all of your GPUs power when doing only one of them...
This is a very, VERY deep misunderstanding of what is happening.
 

TheELF

Diamond Member
Dec 22, 2012
4,027
753
126
This is a very, VERY deep misunderstanding of what is happening.
Is it?
AMD's Stream Processors are shader units right?
Async compute does it's compute on these shader units,shader units that could be used (by a fast enough core=CPU overhead) to run the graphics workload faster.
It's great for slow moar coarz CPUs(consoles) cause they will be able to get more out of the GPUs but don't confuse AMD PR talk with what is relevant.

Read pages 2-3-4
http://amd-dev.wpengine.netdna-cdn....10/Asynchronous-Shaders-White-Paper-FINAL.pdf
SCHEDULING
A basic requirement for asynchronous shading is the ability of the GPU to schedule work from
multiple queues of different types across the available processing resources. For most of their
history, GPUs were only able to process one command stream at a time, using an integrated
command processor. Dealing with multiple queues adds significant complexity. For example, when
two tasks want to execute at the same time but need to share the same processing resources
,
which one gets to use them first?
Ola,use them first? ,no parallel?NO!No parallel. (unless you loose graphics time to gain compute time which would be a wash at the best)
All GCN does is give the CPU access to the 8 ACEs "threads" so up to 8 cores can throw queue items at the graphics card.
The ACEs can operate in parallel with the graphics command processor and two DMA engines. The
graphics command processor handles graphics queues, the ACEs handle compute queues, and the
DMA engines handle copy queues. Each queue can dispatch work items without waiting for other
tasks to complete, allowing independent command streams to be interleaved on the GPU’s Shader
Engines and execute simultaneously.

This architecture is designed to increase utilization and performance by filling gaps in the pipeline,
where the GPU would otherwise be forced to wait for certain tasks to complete before working on
the next one in sequence.
It still supports prioritization and pre-emption when required, but this
will often not be necessary if a high priority task is also a relatively lightweight one. The ACEs are
designed to facilitate context switching, reducing the associated performance overhead.
Only the ques work truly in parallel,if two tasks need the same resources (shaders) they will still have to be processed one after an other,or loose speed as before.
 
Last edited:

caswow

Senior member
Sep 18, 2013
525
136
116
Is it?
AMD's Stream Processors are shader units right?
Async compute does it's compute on these shader units,shader units that could be used (by a fast enough core=CPU overhead) to run the graphics workload faster.
It's great for slow moar coarz CPUs(consoles) cause they will be able to get more out of the GPUs but don't confuse AMD PR talk with what is relevant.

Read pages 2-3-4
http://amd-dev.wpengine.netdna-cdn....10/Asynchronous-Shaders-White-Paper-FINAL.pdf

so basically you say that nvidia doesnt need async and shouldnt waste their time in paralellism? is that what you say? async as a whole is only benificial to slow cpu cores or drivers that overcome overhead better than others? watch beeing async the single best invention after the discovery of electricity IF pascal does it as well as amds gcn D:
 
Last edited:

Krteq

Golden Member
May 22, 2015
1,007
719
136
Is it?
AMD's Stream Processors are shader units right?
Async compute does it's compute on these shader units,shader units that could be used (by a fast enough core=CPU overhead) to run the graphics workload faster.
It's great for slow moar coarz CPUs(consoles) cause they will be able to get more out of the GPUs but don't confuse AMD PR talk with what is relevant.

Read pages 2-3-4
http://amd-dev.wpengine.netdna-cdn....10/Asynchronous-Shaders-White-Paper-FINAL.pdf
Heh.

Just re-read the whole whitepaper, especially page 5 :rolleyes:
The ACEs can operate in parallel with the graphics command processor and two DMA engines. The graphics command processor handles graphics queues, the ACEs handle compute queues, and the DMA engines handle copy queues.

Each queue can dispatch work items without waiting for other tasks to complete, allowing independent command streams to be interleaved on the GPU’s Shader Engines and execute simultaneously.
 
Last edited:

TheELF

Diamond Member
Dec 22, 2012
4,027
753
126
Heh.

Just re-read the whole whitepaper, especially page 5 :rolleyes:

Yea I also edited my post,
"allowing independent command streams to be interleaved on the GPU’s Shader Engines and execute simultaneously. "
command streams are the streams of commands coming from the CPU not the execution of them by the GPU.
Also interleaved is being explained by AMDs chapter on SCHEDULING.
 
Last edited:

TheELF

Diamond Member
Dec 22, 2012
4,027
753
126
so basically you say that nvidia doesnt need async and shouldnt waste their time in paralellism? is that what you say? async as a whole is only benificial to slow cpu cores or drivers that overcome overhead better than others? watch beeing async the single best invention after the discovery of electricity IF pascal does it as well as amds gcn D:

I am saying that they should test async on athlon 5350 cpus and see if nvidia gains any speed there...