computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

Bacon1 · Feb 16, 2016

NeoLuxembourg said:
Funny thing, nVidia's Vulkan driver only exposes one queue. So, no Async Compute supported on this new API. (http://vulkan.gpuinfo.org/displayreport.php?id=1)

Still waiting for AMD to release the Vulkan driver to check how many queues they expose.

Yep, looks like async compute is AMD only.

https://community.amd.com/community...on-gpus-are-ready-for-the-vulkan-graphics-api

Mahigan · Feb 16, 2016

Bacon1 said:
Yep, looks like async compute is AMD only.

https://community.amd.com/community...on-gpus-are-ready-for-the-vulkan-graphics-api

Looks like nvidia support the "computer science" definition of Asynchronous compute (meaning the execution of compute workloads without a defined order) but not the AMD Mantle, DX12 definition that we've all been accustomed too (which adds Graphics and Copy workloads in concurrent execution to Compute).

Basically, nvidia don't support the performance boosting aspects of what we've come to know as Asynchronous Compute. Nvidia's implementation boosts performance but not by that much.

Osjur · Feb 16, 2016

Good_fella said:
So Nvidia can do graphics, compute, transfer, sparse_binding in single queue while AMD has 3 queues with 3 different tasks. Why AMD is better?

Lets try to make this as simple as possible.

Nvidia : rendering single frame serially because it cant do them concurrently.
Graphics queue = 10ms + Compute queue = 14ms + Copy queue = 16ms

AMD : rendering single frame concurrently.
Graphics queue = 10ms + Compute queue = 10ms + Copy queue = 10ms = faster

So what happened? Well, AMD cards just did compute and copy queues on seperate threads (compute engines) while doing graphics queue on graphics engine.

Now this is most certainly exaggerated results but it should give a hint why async compute is useful. Do more in less time...

Azix · Feb 16, 2016

sontin said:
We are talking here about a "low level" API. What you describe is something like OpenGL or DX11 where most of the work is done within the driver.

its not that low-level. Developers could not manage that.

Mahigan · Feb 16, 2016

Good_fella said:
So Nvidia can do graphics, compute, transfer, sparse_binding in single queue while AMD has 3 queues with 3 different tasks. Why AMD is better?

In the Graphics queue (3D queue) NVIDIA executes all tasks sequentially like this:
Graphics --> Compute --> Copy --> Graphics = 20ms

AMD, on the other hand, would execute the first 3 concurrently (at the same time rather than one after the other) by separating the tasks into 3 queues. Then AMD would execute the last task. Like this:
Graphics + Compute + Copy --> Graphics = 16ms

That time savings (reduction in latency) = FPS boost.

Mahigan · Feb 16, 2016

sontin said:
We are talking here about a "low level" API. What you describe is something like OpenGL or DX11 where most of the work is done within the driver.

Vulcan and DX12 are lower level but not to metal APIs.

The Driver still translates the API commands into ISA commands (ISA being the language of a GPU) under DX12 and Vulcan.

The driver can thus still re-order commands or recompile commands. This is especially true for nvidia as, ever since Kepler, they've returned to static scheduling. A large segment of nvidia's scheduling hardware was removed and replaced with software in nvidia's driver.

Silverforce11 · Feb 16, 2016

Mahigan said:
In the Graphics queue (3D queue) NVIDIA executes all tasks sequentially like this:
Graphics --> Compute --> Copy --> Graphics = 20ms

AMD, on the other hand, would execute the first 3 concurrently (at the same time rather than one after the other) by separating the tasks into 3 queues. Then AMD would execute the last task. Like this:
Graphics + Compute + Copy --> Graphics = 16ms

That time savings (reduction in latency) = FPS boost.

It's much easier an analogy for folks to understand when we refer to CARS.

Let's say there's 3 road vehicles.

Cars = Graphics
Trucks = Compute
Bikes = Copy

They are driving from A to B. Their goal is B = frame finishes rendering.

In DX11, the approach is a one lane road that can accommodate all 3 vehicles, but only one vehicle type at a time can be on the road.

In DX12, with async compute capability, it allows multiple lanes. Some of the lanes are reserved only for Trucks & Bikes.

Now all traffic can flow at the same time if you schedule them properly, Cars goes to the main lane, Trucks and Bikes goes to the new lanes reserved for Compute tasks.

Now, on any given day (game), there may not be many Trucks*, it is just Cars. On this day, DX12 Async Compute (multi-lanes not used) does not have much benefit.

* Or rather, games may use compute, but its not queued properly and gets sent as a graphics workload, it goes into the Car lane as well.

You can visualize and understand why under scenarios of heavy compute/copy tasks, Async Compute in DX12 is a huge feature to have on the hardware.

Bacon1 · Feb 16, 2016

Think of it like a Relay race, except instead of having 1 baton AMD has 3 and their runners all start running their own segments at the start of the race, while Nvidia is passing theirs along.

Mahigan · Feb 16, 2016

Soon nvidia will make use of concurrent compute executions and copy executions. Only Graphics seems implausible.

So
compute + compute
And
compute + copy

Only Graphics + Compute appears to be a no no.

swilli89 · Feb 16, 2016

Think of the bits of data as stars in a galaxy. You have the whole universe, tittleman's crest, sector 9 what have you. The bottom line is, you don't want to put the universe in a tube.

airfathaaaaa · Feb 17, 2016

i think the best analogy for this is like that
this is nvidia doing things concurrently

this is nvidia when you introduce graphics process along with the others on the same time

Silverforce11 · Feb 17, 2016

https://community.amd.com/community...on-gpus-are-ready-for-the-vulkan-graphics-api

Only Radeon™ GPUs built on the GCN Architecture currently have access to a powerful capability known as asynchronous compute, which allows the graphics card to process 3D geometry and compute workloads in parallel. As an example, this would be useful when a game needs to calculate complex lighting and render characters at the same time.

I find it odd that AnandTech's Vulkan article didn't even mention that, they even linked their erroneous Async Shading article. -_-

@Mahigan
We know the Kronos consortium, their current chief works for NVIDIA, expanded compatibility for various architectures. Yet still Kepler/Maxwell only has 1 lumped up queue, unable to access Async Shading features of Vulkan. It's not a DX12 specific implementation issue. It's simply hardware incapability, no graphics & compute is allowed in the rendering engine at the same time.

Because Maxwell still has the single engine, its not capable of this feature. When more info on Pascal is revealed, if it's got extra engines for compute, we will know it's hardware capable!

sontin · Feb 17, 2016

Azix said:
its not that low-level. Developers could not manage that.

It is that low. Memory Management, control of workflow (number of draw calls, work within a draw call etc), multi engine etc. must be optimized for every underlaying hardware if you want the best performance.

Mahigan said:
The driver can thus still re-order commands or recompile commands. This is especially true for nvidia as, ever since Kepler, they've returned to static scheduling. A large segment of nvidia's scheduling hardware was removed and replaced with software in nvidia's driver.

You should read a little bit more about DX12 and Vulkan. The driver doesnt do anything what is in control of the application. Otherwise DX12 and Vulkan wouldnt work.

airfathaaaaa · Feb 17, 2016

the driver translates what the programs asks into hardware commands...this doesnt mean that the driver is just there to do that...if it wasnt the whole of this thread is nullified

Krteq · Feb 17, 2016

sontin said:
You should read a little bit more about DX12 and Vulkan. The driver doesnt do anything what is in control of the application. Otherwise DX12 and Vulkan wouldnt work.

What the hell?!? There is nothing like "code to metal" in DX12/Vulkan/Mantle/Metal APIs. There is still need of HLSL coding, a driver exposing caps/features, need to compile a code etc.

TheELF · Feb 17, 2016

Silverforce11 said:
https://community.amd.com/community...on-gpus-are-ready-for-the-vulkan-graphics-api

I find it odd that AnandTech's Vulkan article didn't even mention that, they even linked their erroneous Async Shading article. -_-

@Mahigan
We know the Kronos consortium, their current chief works for NVIDIA, expanded compatibility for various architectures. Yet still Kepler/Maxwell only has 1 lumped up queue, unable to access Async Shading features of Vulkan. It's not a DX12 specific implementation issue. It's simply hardware incapability, no graphics & compute is allowed in the rendering engine at the same time.

Because Maxwell still has the single engine, its not capable of this feature. When more info on Pascal is revealed, if it's got extra engines for compute, we will know it's hardware capable!

What you(they) show on the picture is the basic dx12/vulkan/low level async stuff that almost any card can do, even intel igpus.
Async compute,using graphics + compute is just a feature where only GCN cards get a big boost.
But then again gaining speed from doing graphics + compute in parallel only means that you can not get all of your GPUs power when doing only one of them...

Silverforce11 · Feb 17, 2016

TheELF said:
What you(they) show on the picture is the basic dx12/vulkan/low level async stuff that almost any card can do, even intel igpus.

How many times do AMD have to say only GCN is capable of Async Shading in DX12/Vulkan will you people stop spreading such fud?

sontin · Feb 17, 2016

Krteq said:
What the hell?!? There is nothing like "code to metal" in DX12/Vulkan/Mantle/Metal APIs. There is still need of HLSL coding, a driver exposing caps/features, need to compile a code etc.

We dont talk about "code to metal". DX12 and Vulkan are much more low level than DX12/OpenGL without Extensions. The amount of work a developer has to do to get the same result is huge. The driver is doing less work and most of the the work happens on the application side:

https://developer.nvidia.com/transitioning-opengl-vulkan

And this is from CroTeam:

Q: That's better. Now, why was Vulkan support added to Talos?

A: Good question! We (Croteam) firmly believe that Vulkan is really the best low(est)-level API there can be. Fast and portable!
Yes, it has downside(s). For one, it's quite hard to program for. You have to do a lot of things manually, instead of relying on drivers to do the work for you. This is both good and bad at the same time. Good for performance reasons, because driver doesn't assume what game wants to render (I won't go into any more details here, sorry). Bad because there's a lot more coding and in general, it's a more complex approach. You better know what you're doing, because you won't get any help from the driver. You're on your own.
It's really great to have that much control. IF you know what you're doing!

zlatan · Feb 17, 2016

We should call these new APIs explicit and not low-level.

maddie · Feb 17, 2016

TheELF said:
What you(they) show on the picture is the basic dx12/vulkan/low level async stuff that almost any card can do, even intel igpus.
Async compute,using graphics + compute is just a feature where only GCN cards get a big boost.
But then again gaining speed from doing graphics + compute in parallel only means that you can not get all of your GPUs power when doing only one of them...

This is a very, VERY deep misunderstanding of what is happening.

TheELF · Feb 17, 2016

maddie said:
This is a very, VERY deep misunderstanding of what is happening.

Is it?
AMD's Stream Processors are shader units right?
Async compute does it's compute on these shader units,shader units that could be used (by a fast enough core=CPU overhead) to run the graphics workload faster.
It's great for slow moar coarz CPUs(consoles) cause they will be able to get more out of the GPUs but don't confuse AMD PR talk with what is relevant.

Read pages 2-3-4
http://amd-dev.wpengine.netdna-cdn....10/Asynchronous-Shaders-White-Paper-FINAL.pdf

SCHEDULING
A basic requirement for asynchronous shading is the ability of the GPU to schedule work from
multiple queues of different types across the available processing resources. For most of their
history, GPUs were only able to process one command stream at a time, using an integrated
command processor. Dealing with multiple queues adds significant complexity. For example, when
two tasks want to execute at the same time but need to share the same processing resources,
which one gets to use them first?

Ola,use them first? ,no parallel?NO!No parallel. (unless you loose graphics time to gain compute time which would be a wash at the best)
All GCN does is give the CPU access to the 8 ACEs "threads" so up to 8 cores can throw queue items at the graphics card.

The ACEs can operate in parallel with the graphics command processor and two DMA engines. The
graphics command processor handles graphics queues, the ACEs handle compute queues, and the
DMA engines handle copy queues. Each queue can dispatch work items without waiting for other
tasks to complete, allowing independent command streams to be interleaved on the GPU’s Shader
Engines and execute simultaneously.
This architecture is designed to increase utilization and performance by filling gaps in the pipeline,
where the GPU would otherwise be forced to wait for certain tasks to complete before working on
the next one in sequence. It still supports prioritization and pre-emption when required, but this
will often not be necessary if a high priority task is also a relatively lightweight one. The ACEs are
designed to facilitate context switching, reducing the associated performance overhead.

Only the ques work truly in parallel,if two tasks need the same resources (shaders) they will still have to be processed one after an other,or loose speed as before.

caswow · Feb 17, 2016

TheELF said:
Is it?
AMD's Stream Processors are shader units right?
Async compute does it's compute on these shader units,shader units that could be used (by a fast enough core=CPU overhead) to run the graphics workload faster.
It's great for slow moar coarz CPUs(consoles) cause they will be able to get more out of the GPUs but don't confuse AMD PR talk with what is relevant.

Read pages 2-3-4
http://amd-dev.wpengine.netdna-cdn....10/Asynchronous-Shaders-White-Paper-FINAL.pdf

so basically you say that nvidia doesnt need async and shouldnt waste their time in paralellism? is that what you say? async as a whole is only benificial to slow cpu cores or drivers that overcome overhead better than others? watch beeing async the single best invention after the discovery of electricity IF pascal does it as well as amds gcn D:

Krteq · Feb 17, 2016

TheELF said:
Is it?
AMD's Stream Processors are shader units right?
Async compute does it's compute on these shader units,shader units that could be used (by a fast enough core=CPU overhead) to run the graphics workload faster.
It's great for slow moar coarz CPUs(consoles) cause they will be able to get more out of the GPUs but don't confuse AMD PR talk with what is relevant.

Read pages 2-3-4
http://amd-dev.wpengine.netdna-cdn....10/Asynchronous-Shaders-White-Paper-FINAL.pdf

Heh.

Just re-read the whole whitepaper, especially page 5

The ACEs can operate in parallel with the graphics command processor and two DMA engines. The graphics command processor handles graphics queues, the ACEs handle compute queues, and the DMA engines handle copy queues.

Each queue can dispatch work items without waiting for other tasks to complete, allowing independent command streams to be interleaved on the GPU’s Shader Engines and execute simultaneously.

TheELF · Feb 17, 2016

Krteq said:
Heh.

Just re-read the whole whitepaper, especially page 5

Yea I also edited my post,
"allowing independent command streams to be interleaved on the GPU’s Shader Engines and execute simultaneously. "
command streams are the streams of commands coming from the CPU not the execution of them by the GPU.
Also interleaved is being explained by AMDs chapter on SCHEDULING.

TheELF · Feb 17, 2016

caswow said:
so basically you say that nvidia doesnt need async and shouldnt waste their time in paralellism? is that what you say? async as a whole is only benificial to slow cpu cores or drivers that overcome overhead better than others? watch beeing async the single best invention after the discovery of electricity IF pascal does it as well as amds gcn D:

I am saying that they should test async on athlon 5350 cpus and see if nvidia gains any speed there...

computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

Diamond Member

Senior member

Member

Golden Member

Senior member

Senior member

Lifer

Diamond Member

Senior member

Golden Member

Senior member

Lifer

Diamond Member

Senior member

Golden Member

Diamond Member

Lifer

Diamond Member

Senior member

Diamond Member

Diamond Member

Senior member

Golden Member

Diamond Member

Diamond Member