computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

SPBHM · Feb 25, 2016

390 and Fury X scaling well with the Celeron
980 Ti gains something, but the other Nvidia cards!?

TheELF · Feb 25, 2016

Shivansps said:
You know what im seeing there? something that does not worth expending an extra cent on developing anything.
...

...
Just too much over hype on a feature used in a crappy alpha game, its like taking Star Citizen as a example of anything.

Lol, even the fury x looses performance with async on at 1080!!!
I guess they don't support async in hardware either...

ThatBuzzkiller · Feb 25, 2016

Silverforce11 said:
But I do see your point. AMD went for the sabotage by giving Mantle away for free to Apple, Kronos and Microsoft so that it's iterated upon and forked into Metal, Vulkan and DX12.

The Apple Metal is more comparable to DX11 than it is to DX12 ...

No manual memory management, no explicit synchronization, and no bindless too!

Metal is arguably more geared towards iOS than OS X ...

Silverforce11 · Feb 25, 2016

SPBHM said:
390 and Fury X scaling well with the Celeron
980 Ti gains something, but the other Nvidia cards!?

http://pclab.pl/art67995-15.html

Have to question how the heck AMD GPUs aren't completely CPU bound in DX11 with a Celeron!!?

We keep hearing how they have awful DX11 drivers. It reminds me of this video: https://www.youtube.com/watch?v=fq6GyUzyuJQ

When AMD performs better in DX11, it's some unknown reason, but when AMD is behind, these guys love to claim it's the poor DX11 overhead.

Either way, strange result.

Silverforce11 · Feb 25, 2016

TheELF said:
Lol, even the fury x looses performance with async on at 1080!!!
I guess they don't support async in hardware either...

If you took a few seconds to look into the game, most of the dynamic lights and compute features are on the higher settings, like Crazy & Extreme and disabled on High. -_-

The lighter the compute workloads, the less useful Async Compute is, should be simple to understand.

airfathaaaaa · Feb 25, 2016

also i saw this photo today..

is this for real? dx12 actually likes logical cores or threads?

monstercameron · Feb 25, 2016

airfathaaaaa said:
also i saw this photo today..

is this for real? dx12 actually likes logical cores or threads?

This has been demonstrated since mantle.

TheELF · Feb 25, 2016

Silverforce11 said:
If you took a few seconds to look into the game, most of the dynamic lights and compute features are on the higher settings, like Crazy & Extreme and disabled on High. -_-

The lighter the compute workloads, the less useful Async Compute is, should be simple to understand.

So?
Still, how do you loose performance with Async turned on when you have hardware Async support?Wasn't that the main argument in this threat to "prove" that nvidia doesn't support it?

Instead now it is shown that the FuryX can manage ~75FPS at 1080,with a lot of shaders idle, and async only makes things slower if you are at full speed already.

Erenhardt · Feb 25, 2016

The other nice thing about dx12 that is obvious from the graph below is how nicely it scales with cores. Those FX6xxx/8xxx just love it.

In dx11 games we see fx6300 tied or even loosing to fx4300 due to frequency difference. The 50% extra cores give no benefit most of the time. This pattern was broken recenty in some games thanks to new consoles, but still, a common theme.

DX12 takes multithreaded rendering to 11(pun)! and fx6300 is 30%+ faster than a fx4300

Headfoot · Feb 25, 2016

Erenhardt said:
The other nice thing about dx12 that is obvious from the graph below is how nicely it scales with cores. Those FX6xxx/8xxx just love it.

In dx11 games we see fx6300 tied or even loosing to fx4300 due to frequency difference. The 50% extra cores give no benefit most of the time. This pattern was broken recenty in some games thanks to new consoles, but still, a common theme.

DX12 takes multithreaded rendering to 11(pun)! and fx6300 is 30%+ faster than a fx4300

Judging from that graph, it looks like AotS can scale well up to 6 cores with a little bit on a 7th or 8th core -- since 6350 @ 100 mhz slower than 8350 hardly runs much slower.

xthetenth · Feb 25, 2016

It'd be cool if AMD CPUs could compete by having more cores even if their IPC isn't quite as good as Intel.

TheELF said:
So?
Still, how do you loose performance with Async turned on when you have hardware Async support?Wasn't that the main argument in this threat to "prove" that nvidia doesn't support it?

Instead now it is shown that the FuryX can manage ~75FPS at 1080,with a lot of shaders idle, and async only makes things slower if you are at full speed already.

There's an overhead to async compute. If the card doesn't see a benefit from AC, then even if it's doing it, it isn't going to get faster. Once you add enough work to move the bottleneck though, it shows pretty clearly whether the card has async compute or not. It's like running 4K top settings on a 910, and saying that an i7 doesn't benefit from hyper threading because it's still choking miserably.

Erenhardt · Feb 25, 2016

There is a possiblity that fx4300 result is an outlier. Because it has an ipc almost the same as the other fx cpus, so those should perform similar in single-thread bottleneck scenario

and in CB MT test it is faster than a celeron, so if this test is very multithreaded, it shouldn't be much slower.

TheELF · Feb 25, 2016

xthetenth said:
Once you add enough work to move the bottleneck though, it shows pretty clearly whether the card has async compute or not.

Unfortunately it's not that clear since you drop from 70+ to 60+
If furyX would stay at 75fps with the compute added then yes it would be clear but now the only clear thing is that AMD has more compute throughput then nvidia.

maddie · Feb 25, 2016

TheELF said:
Unfortunately it's not that clear since you drop from 70+ to 60+
If furyX would stay at 75fps with the compute added then yes it would be clear but now the only clear thing is that AMD has more compute throughput then nvidia.

I'm assuming that you really want to know and not arguing incessantly in defence of Nvidia.

Nvidia CAN lose performance by using their so called "Async Compute". Simulation in software does not always lead to the same benefits. The cost of switching can be greater than gains in using compute, leading to a net performance loss. This article gives a nice understanding of the two architectures, their strengths and limitations. There is much more than the lower quote, so please read.

http://ext3h.makegames.de/DX12_Compute.html

AMD

Compute engines can be used for multiple different purposes on GCN hardware:
Long running compute jobs can be offloaded to a compute queue. If a job is known to be possibly wasting a lot of time in stalls, it can be outsourced from busy queues. This comes at the benefit of achieving better shader utilization as 3D and compute workload can be interleaved on every level in hardware, from the scheduler down to actual execution on the compute units. High priority jobs can be scheduled to a dedicated compute queue. They will go into the next free execution slot on the corresponding ACE. They can not preempt running shaders, but they will skip any queued ones. Make proper use of the priority setting on the compute queue to achieve this behaviour. Get around the execution slot limit. When executing compute shaders with tiny grids, down to the minimum of 64 threads per thread group, you would under utilize the GPU using only a single engine. By utilizing all 8 ACE units together with the 3D Engine, you can achieve up to 640 active grids on Fiji. This is precisely the upper occupation limit and maximizes utilization, even if each grid only yields a single wavefront. You should still prefer issuing less commands with larger grids instead. Pushing the hardware to the limits like this can expose other unexpected bottlenecks. Create more back pressure. By providing additional jobs on a compute engine, the impact of blocking barriers in other queues can be avoided. Barriers or fences placed on other queues are not causing any interference. GCN is still perfectly happy to accept compute commands in the 3D queue.
There is no penalty for mixing draw calls and compute commands in the 3D queue. In fact, compute commands have approximately the same performance as draw calls with proxy geometry10.
Compute commands should still be preferred for any non-geometry related operation for practical reasons, such as utilizing the local shared memory and increasing possible concurrency.
Offloading compute commands to the compute queue is a good chance to increase GPU utilization.
10 Proxy geometry refers to a technique where you are using a simple geometry, like a single screen filling square, to apply post processing effects and alike to 2D buffers.
Nvidia

Due to the possible performance penalties from using compute commands concurrently with draw calls, compute queues should mostly be used to offload and execute compute commands in batch.
There are multiple points to consider when doing this:
The workload on a single queue should always be sufficient to fully utilize the GPU. There is no parallelism between the 3D and the compute engine so you should not try to split workload between regular draw calls and compute commands arbitrarily. Make sure to always properly batch both draw calls and compute commands. Pay close attention not to stall the GPU with solitary compute jobs limited by texture sample rate, memory latency or anything alike. Other queues can't become active as long as such a command is running. Compute commands should not be scheduled on the 3D queue. Doing so will hurt the performance measurably. The 3D engine does not only enforce sequential execution, but the reconfiguration of the SMM units will impair performance even further. Consider the use of a draw call with a proxy geometry instead when batching and offloading is not an option for you. This will still save you a few microseconds as opposed to interleaving a compute command. Make 3D and compute sections long enough. Switching between compute and 3D queues results in a full flush of all pipelines. The GPU should have spent enough time in one mode to justify the penalty for switching. Beware that there is no active preemption, a long running shader in either engine will stall the transition. Despite the limitations, the use of compute shaders should still be considered. The reduced overhead and effectively higher level of concurrency compared to classic draw calls with proxy geometry can still yield remarkable performance gains.
Additional care is required to cleanly separate the render pipeline into batches.
If async compute with support for high priority jobs and independent scheduling is a hard requirement, consider the use of CUDA for these jobs instead of the DX12 API.
With GK110 and later, CUDA bypasses the graphics command processor and is handled by a dedicated function unit in hardware which runs uncoupled from the regular compute or graphics engine. It even supports multiple asynchronous queues iin hardware as you would expect.
Ask your personal Nvidia engineer for how to share GPU side buffers between DX12 and CUDA.

Glo. · Feb 25, 2016

sontin said:
GM200 is a better DX12 card than Fiji. GM204 is beating Tonga without problems. AMD has no advantage with DX12.
Looks like that AMD needs to catch up with them. D:

HSA? Is this not this thing nobody cares about?!

Glo. said:
Sheer compute horsepower on Fiji is much higher than anything Nvidia offers.

That alone will make gigantic difference. The matter is this. Proper implementation of DX12 will make R9 390X on par with GTX 980 Ti. Just because both cards have so similar compute power. And GTX 980 Ti is not able to execute context switching properly. Other thing that Fiji is fundamentally flawed from design point of view, but still, it is capable of delivering. It will never however achieve its true, full potential.

About HSA. Everybody cares about HSA. If they would not they would completely ignore it. But they don't(mobile vendors, Apple, AMD, Nvidia, Intel - everybody does something about it).

Im sorry for refreshing this post, and quoting myself but...

Look at the benchmarks now, and what has been written few weeks ago. It turned out to be reality.

P.S. It is quite funny now to retrospect this thread, and read what has been written by few people over time, and how the reality turned out to be.

sontin · Feb 25, 2016

GM200 doesnt scale over GM204 or GM206. The DX12 path is still unoptimized for them.
GM204 is still faster than Tonga:
http://www.guru3d.com/articles_pages/ashes_of_singularity_directx_12_benchmark_ii_review,7.html

Especially under the "Crazy" setting there is no improvement with DX12 over DX11 which shows that there is something very wrong:

http://pclab.pl/art67995-13.html

Glo. · Feb 25, 2016

Are you sure?

P.S. I was referring to GTX 980 Ti - R9 390X comparison.

P.S. R9 380X will not be able to get to GTX 980 levels of performance because it has much lower core clock(970 MHz vs over 1000, I do not know the nominal numbers for GTX 980). But the difference in performance will reflect just that margin. That is however - in future.

Paul98 · Feb 25, 2016

We are still barely started the early DX12 development cycle. The results are as expected for where we are at. But moving forward we will see some rather interesting changes.

sontin · Feb 25, 2016

Glo. said:
Are you sure?

P.S. I was referring to GTX 980 Ti - R9 390X comparison.

GTX980TI doesnt scale over GTX980. The limitation is not the compute performance.

P.S. R9 380X will not be able to get to GTX 980 levels of performance because it has much lower core clock(970 MHz vs over 1000, I do not know the nominal numbers for GTX 980). But the difference in performance will reflect just that margin. That is however - in future.

No, the difference between the GTX980 and 380X is around 30%. And a reference GTX980 uses only 165W vs. ~200W of the (custom) 380X cards.

IllogicalGlory · Feb 25, 2016

sontin said:
GM200 doesnt scale over GM204 or GM206. The DX12 path is still unoptimized for them.
GM204 is still faster than Tonga:
http://www.guru3d.com/articles_pages/ashes_of_singularity_directx_12_benchmark_ii_review,7.html

Especially under the "Crazy" setting there is no improvement with DX12 over DX11 which shows that there is something very wrong:

http://pclab.pl/art67995-13.html

What's wrong is that NV doesn't support Asynchronous compute. All they're getting is the low overhead and it seems they don't really need in that this title.

Teizo · Feb 25, 2016

Looks promising for Radeon GPU's, but this is still in beta and is using last gen tech.

The true story will be told at the end of the year when this benchmark may actually mean something when real DX12 GPU's are in the market....not just DX12 'capable' GPU's.

dacostafilipe · Feb 25, 2016

There seems to be a Frametime issue with AMD cards : http://www.guru3d.com/articles-pages/ashes-of-singularity-directx-12-benchmark-ii-review,10.html

There's also some kind of mandatory DX12 feature not supported by AMD's actual drivers.

Pcper is looking into this at the moment.

swilli89 · Feb 25, 2016

Teizo said:
Looks promising for Radeon GPU's, but this is still in beta and is using last gen tech.

The true story will be told at the end of the year when this benchmark may actually mean something when real DX12 GPU's are in the market....not just DX12 'capable' GPU's.

Its going gold in a few weeks.. its very much a finished preview beta. And your last statement doesn't make any sense. Both nVidia and AMD have said that their most recent GPUs are fully DirectX12 compliant. They *are* DX12 GPUs.

Hitman928 · Feb 25, 2016

NeoLuxembourg said:
There seems to be a Frametime issue with AMD cards : http://www.guru3d.com/articles-pages/ashes-of-singularity-directx-12-benchmark-ii-review,10.html

There's also some kind of mandatory DX12 feature not supported by AMD's actual drivers.

Pcper is looking into this at the moment.

Update: hours before the release of this article we got word back from AMD. They have confirmed our findings. Radeon Software 16.1 / 16.2 does not support a DX12 feature called DirectFlip, which is mandatory and the solve to this specific situation. AMD intends to resolve this issue in a future driver update.

nvgpu · Feb 25, 2016

With massive stuttering like that, AMD's benchmark results are invalidated and basically unplayable.

computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

Diamond Member

Diamond Member

Golden Member

Lifer

Lifer

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Golden Member

Senior member

Golden Member

Diamond Member

Senior member