computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

ThatBuzzkiller · Feb 5, 2016

I fully expect that we will be staying on 14/16nm for a long time so how well GPUs perform overtime will matter a lot more as far as perf/mm^2 ...

And with DX12 AMD should remain competitive in the foreseeable future ...

Glo. · Feb 5, 2016

sontin said:
GM200 is a better DX12 card than Fiji. GM204 is beating Tonga without problems. AMD has no advantage with DX12.
Looks like that AMD needs to catch up with them. D:

HSA? Is this not this thing nobody cares about?!

Sheer compute horsepower on Fiji is much higher than anything Nvidia offers.

That alone will make gigantic difference. The matter is this. Proper implementation of DX12 will make R9 390X on par with GTX 980 Ti. Just because both cards have so similar compute power. And GTX 980 Ti is not able to execute context switching properly. Other thing that Fiji is fundamentally flawed from design point of view, but still, it is capable of delivering. It will never however achieve its true, full potential.

About HSA. Everybody cares about HSA. If they would not they would completely ignore it. But they don't(mobile vendors, Apple, AMD, Nvidia, Intel - everybody does something about it).

ThatBuzzkiller · Feb 5, 2016

Glo. said:
Sheer compute horsepower on Fiji is much higher than anything Nvidia offers.

That alone will make gigantic difference. The matter is this. Proper implementation of DX12 will make R9 390X on par with GTX 980 Ti. Just because both cards have so similar compute power. And GTX 980 Ti is not able to execute context switching properly. Other thing that Fiji is fundamentally flawed from design point of view, but still, it is capable of delivering. It will never however achieve its true, full potential.

That's only if you consider the ideal case ...

And even then you would at least need 16 waves per CU to get full utilization ...

Glo. · Feb 5, 2016

I will put all this this way. Mastering the GCN architecture for DX11 was really hard. Mastering GCN architecture for DX12 is... much easier.

TheELF · Feb 5, 2016

You guys are talking as if every D3D12 game will use async extensively ,just look at Ps4 games there are what, 2-3 games with async in the 2 years the new gen consoles are out.

And that is if every game that comes out in the future will be D3D12 but it will take very long until a high percentage of games will be coming out in D3D12.

And for ashes in particular,don't confuse the specialized benchmark with actual gaming, one company being better than the other in the bench does not necessarily mean that the same will be true for gameplay.

As for HSA, 3d Xpoint will be coming out this summer in ssd/sata form so HSA might end up not being an advantage since next year any DDR4 system will be able to use that as mem/hdd.

Mahigan · Feb 5, 2016

Deders said:
Is this based on AOTS performance? The reason you see such an improvement with AMD cards and DX12 is down not needing as much CPU overhead. Nvidia already had this well optimised with DX11 drivers.

Not quite the same as DX12 performance should give, but enough to make a big difference.

Scheduling. NVIDIA have used software based scheduling ever since Kepler. This has allowed NVIDIA the capability of being able to multi-thread their scheduler in their driver. This was added to the NVIDIA driver at some point during the GTX 780 timeframe during a performance driver release. With this driver, NVIDIA GPUs are able to make use of some explicit features from within DX11 and thus use more than one CPU core to feed the GPU leading to lower CPU overhead on the first CPU core (or primary thread).

AMDs GCN uses hardware based scheduling, specifically the Graphics Command Processor. This processor can only execute 1 graphics or 1 compute command per clock. It is also not compatible with a multi-threaded driver. Because of this, AMD GCN cards cannot use more than one CPU thread for scheduling under DX11. So you end up with a high utilization on the primary CPU thread.

What does DX12 bring to the table? The ability to execute multiple compute commands per clock by using the ACEs through Asynchronous Computing.

In comes Polaris, notice something with Polaris? A brand new Graphics Command Processor. Guess what it can do? Yep...compatible with DX11 multi-threaded features.

So not only will Polaris incorporate improved SIMD units (and CUs in general), tesselation/geometry performance, caching and memory performance due to a new memory controller but also DX9/10/11 performance.

Add ACEs and you've got the potential for great DX12 performance too.

So it's not an AMD driver issue, it's an architectural limitation brought on by the lack of support for a DX11 feature by the Graphics Command Processor. A DX11 feature whose utility is apparent when looking at AMD vs. NVIDIA DX11 performance under Ashes of the Singularity.

ThatBuzzkiller · Feb 5, 2016

TheELF said:
You guys are talking as if every D3D12 game will use async extensively ,just look at Ps4 games there are what, 2-3 games with async in the 2 years the new gen consoles are out.

And that is if every game that comes out in the future will be D3D12 but it will take very long until a high percentage of games will be coming out in D3D12.

And for ashes in particular,don't confuse the specialized benchmark with actual gaming, one company being better than the other in the bench does not necessarily mean that the same will be true for gameplay.

As for HSA, 3d Xpoint will be coming out this summer in ssd/sata form so HSA might end up not being an advantage since next year any DDR4 system will be able to use that as mem/hdd.

There's a lot more PS4 games that use asynchronous compute more than you think ...

DICE is using asynchronous compute on ALL of their AAA games on PS4, all Infamous games on PS4 are using asynchronous compute, The Order 1886 is using asynchronous compute, The Tomorrow Children is using asynchronous compute, and other talented developers at Sony have already started using asynchronous compute too ...

Even Mark Cerny, lead PS4 architect has promoted asynchronous compute to the other teams at Sony ...

Eidos and Microsoft are already using asynchronous compute on the X1 ...

It'll only be a matter of time before other AAA 3rd party publishers PC teams will port the rendering back ends from the console teams ...

HSA is not about increasing the quality of memory subsystems, it's concerned with heterogeneous communication between different architectures!

Mahigan · Feb 5, 2016

ThatBuzzkiller said:
There's a lot more PS4 games that use asynchronous compute more than you think ... 
&nbsp;
DICE is using asynchronous compute on ALL of their AAA games on PS4, all Infamous games on PS4 are using asynchronous compute, The Order 1886 is using asynchronous compute, The Tomorrow Children is using asynchronous compute, and other talented developers at Sony have already started using asynchronous compute too ... 
&nbsp;
Even Mark Cerny, lead PS4 architect has promoted asynchronous compute to the other teams at Sony ...
&nbsp;
Eidos and Microsoft are already using asynchronous compute on the X1 ... 
&nbsp;
It'll only be a matter of time before other AAA 3rd party publishers PC teams will port the rendering back ends from the console teams ... 
&nbsp;
HSA is not about increasing the quality of memory subsystems, it's concerned with heterogeneous communication between different architectures!

and if HSA isn't important then why did NVIDIA hire the head of the HSA project and former VP at AMD? Because NVIDIA need HSA for their CUDA/ARM solutions. http://arstechnica.com/gadgets/2015/10/another-loss-for-amd-as-hsa-and-radeon-veteran-phil-rogers-joins-nvidia/

Silverforce11 · Feb 5, 2016

Genx87 said:
I am sure they are happy to know your concern for their purchase. The argument is tired and boring and predictable. Faux concern for cards and how they will perform in 2-3 years. When anybody buying high end gear wont stand for performance of said card in 2-3 years even if DX 12 never happened.

If you buy a GPU in 2015, in 2016, there will be DX12 games already, quite a few in fact.

It's not 2-3 years down the road, it's happening this year.

sontin · Feb 5, 2016

Mahigan said:
Scheduling. NVIDIA have used software based scheduling ever since Kepler. This has allowed NVIDIA the capability of being able to multi-thread their scheduler in their driver. This was added to the NVIDIA driver at some point during the GTX 780 timeframe during a performance driver release. With this driver, NVIDIA GPUs are able to make use of some explicit features from within DX11 and thus use more than one CPU core to feed the GPU leading to lower CPU overhead on the first CPU core (or primary thread).

So nVidia can lower the CPU overhead with more CPU overhead.

There is no "software scheduling". They just got rid of some part of the hardware scheduler. Most of the work happens on the gpu otherwise it wouldnt work.

They cant feed the GPU from more threads. This concept doesnt exist under DX11 and only with multithreading drivers it is possible in certain ways.

Mahigan · Feb 5, 2016

sontin said:
So nVidia can lower the CPU overhead with more CPU overhead.
There is no "software scheduling". They just got rid of some part of the hardware scheduler. Most of the work happens on the gpu otherwise it wouldnt work.

They cant feed the GPU from more threads. This concept doesnt exist under DX11 and only with multithreading drivers it is possible in certain ways.

Umm...

The*DirectX 11 API has the ability to use multiple CPU cores:

"DX11 adds multi-threading support that allows applications to simultaneously create resources or manage state and issue draw commands, all from an arbitrary number of threads. This may not significantly speed up the graphics subsystem (especially if we are already very GPU limited), but this does increase the ability to more easily explicitly massively thread a game and take advantage of the increasing number of CPU cores on the desktop."
http://www.anandtech.com/show/2716/3

DX11 adds multi-threaded capabilities to the pipeline when working with parallel loads:

" The major benefit I'm talking about here is multi-threading. Yes, eventually everything will need to be drawn, rasterized, and displayed (linearly and synchronously), but DX11 adds multi-threading support that allows applications to simultaneously create resources or manage state and issue draw commands, all from an arbitrary number of threads."
http://www.anandtech.com/show/2716/3

DirectX 11 adds the deferred context/command listing feature which allows for multi-workload management:

"A deferred contexts is a special ID3D11DeviceContext that can be called in parallel on a different thread than the main thread which is issuing commands to the immediate context. Unlike the immediate context, calls to a deferred contexts are not sent to the GPU at the time of call and must be marshalled into a command list which is then executed at a later date. It is also possible to execute a command list multiple times to replay a sequence of GPU work against different input data."
http://docs.nvidia.com/gameworks/c.../d3d_samples/d3d11deferredcontextssample.htm

Nvidia on the benefits of using deferred contexts under DX11:

" The entire reason for using or not using deferred contexts revolves around performance. There is a potential to parallelize CPU load onto idle CPU cores and improve performance.

You will be interested in using deferred context command lists if:
Your game is CPU bottlenecked.
You have a significant # of draw calls (>3000).
Your CPU bottleneck is from render thread load or Direct3D API calls.
You have a threaded renderer but serialize to a main render thread for mapping incurring sync point costs."
http://docs.nvidia.com/gameworks/c.../d3d_samples/d3d11deferredcontextssample.htm

So what are you talking about exactly?

TheELF · Feb 5, 2016

sontin said:
So nVidia can lower the CPU overhead with more CPU overhead.

It's not more overhead,it's the same overhead split up into more threads lowering the demand for speed from cores.

Mahigan · Feb 5, 2016

And it's not more CPU overhead, you see most of the commands are processed by the primary CPU thread under DX11. AMDs GCN is thus constrained by the performance of the CPUs single threaded performance under DX11. Since the scheduling hardware is on die with GCN, then the Graphics Command Processor is handling both compute and graphics commands. Nvidia have most of their scheduling components in software, with some remaining in hardware, Anandtech explains this here: http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/3

" The end result is an interesting one, if only because by conventional standards its going in reverse. With GK104 NVIDIA is going*back*to static scheduling. Traditionally, processors have started with static scheduling and then moved to hardware scheduling as both software and hardware complexity has increased. Hardware instruction scheduling allows the processor to schedule instructions in the most efficient manner in real time as conditions permit, as opposed to strictly following the order of the code itself regardless of the codes efficiency. This in turn improves the performance of the processor.

However based on their own internal research and simulations, in their search for efficiency NVIDIA found that hardware scheduling was consuming a fair bit of power and area for few benefits. In particular, since Keplers math pipeline has a fixed latency, hardware scheduling of the instruction inside of a warp was redundant since the compiler already knew the latency of each math instruction it issued. So NVIDIA has replaced Fermis complex scheduler with a far simpler scheduler that still uses scoreboarding and other methods for inter-warp scheduling, but moves the scheduling of instructions in a warp into NVIDIAs compiler. In essence its a return to static scheduling."

Glo. · Feb 5, 2016

The scheduler on Nvidia hardware is called MegaThread Engine. The problem is: without software(drivers) in DX11 it completely does not know what to do.

Like I have written before. Nvidia went the route with drivers - they are handling both the software and hardware. Thats how they are able to get the most of DX11, apart from truly remarkable preemption and cache management, to feed the cores.

Everything for Nvidia hardware comes from software, really. Gaming drivers for DX11, CUDA software, all of this is done on software level. Even HSA 2.0 is done on Nvidia hardware through CUDA.

Mahigan · Feb 5, 2016

But Sontin has a point, under DX12... Nvidia's Kepler/Maxwell architectures become harder on the CPU under certain conditions which Ashes of the Singularity meet.

The tables turn, GCN is less taxing on the CPU as well under DX12.

Glo. · Feb 5, 2016

It is seen on 4K results from Ashes. Arstechnica some time ago tested it, and it was apparent that the more cores you have on CPU side, the better.

http://cdn.arstechnica.net/wp-conte...ew-chart-template-final-full-width-3.0021.png 4 cores, no HT
http://cdn.arstechnica.net/wp-conte...ew-chart-template-final-full-width-3.0011.png 6 cores, with HT. Look how it affects 4K resolution on Nvidia hardware.

sontin · Feb 6, 2016

Mahigan said:
So what are you talking about exactly?

Multithreading is part of the API and has nothing to do with the hardware. :\
AMD is free to support it, too.

ThatBuzzkiller · Feb 6, 2016

sontin said:
Multithreading is part of the API and has nothing to do with the hardware. :\
AMD is free to support it, too.

Very dependent on data set reuse and still requires submitting the contexts serially ...

Better off getting a bigger front end to improve GPU feeding than to directly support it ...

sontin · Feb 6, 2016

ThatBuzzkiller said:
Very dependent on data set reuse and still requires submitting the contexts serially ...

DX12 doesnt make a difference. Submitting happens "in the same way":

In D3D12 the concept of a command queue is the API representation of a roughly serial sequence of work submitted by the application. Barriers and other techniques allow this work to be executed in a pipeline or out of order, but the application only sees a single completion timeline. This corresponds to the immediate context in D3D11.

https://msdn.microsoft.com/de-de/library/windows/desktop/dn899217(v=vs.85).aspx

Better off getting a bigger front end to improve GPU feeding than to directly support it ...

"Feed the GPU" is part of the API.

ThatBuzzkiller · Feb 6, 2016

sontin said:
DX12 doesnt make a difference. Submitting happens "in the same way":

https://msdn.microsoft.com/de-de/library/windows/desktop/dn899217(v=vs.85).aspx

"Feed the GPU" is part of the API.

Umm, you didn't exactly read the entire D3D12 spec, did you ?

Command queues are very different from immediate contexts ...

Instead of a single immediate context, you are allowed to record for multiple command lists for better multi-threaded utilization. Command queues can be made more efficient with explicit synchronization too. There's THREE type's of command queue as well, one for 3D, one for compute, and one for copy can be ran at all times to exploit more concurrency. Then there's bundles too within the command lists for more reuse too ...

Then we have nice features like Resource Binding and ExecuteIndirect as well to do bindless!

DX12 gives an order of magnitude improvement when it comes to Multithreaded rendering. Deferred contexts on the other hand does not!

sontin · Feb 6, 2016

Execution of the queues happen in a similar way to DX11. Which brings us back to the point:
Feeding the GPU is part of the API.

The hardware scheduler is just scheduling the existing workload which already is on the GPU. It doesnt create anything (except the programmer wants this like Dynamic Parallelism).

So yeah, if a architecture couldnt support "Multithreading" under DX11 it wouldnt support it under DX12, too. But this is just nonsense and doesnt makes any sense.

guskline · Feb 6, 2016

Question. Beside me, how many posters on this thread actually own a copy of Ashes of Singularity and have run the benchmarks? If so could you post your results please?

On my 5960x @4.4Ghz and a single EVGA GTX980TI SC (vcore 1102 vs 1000 on test model)
I had overall average frame rate of 54.8 for DX11 and 54.6 for DX12.

On my 4790k rig @4.7Ghz and 2 R9 290s (Sapphire OC Tri-Xs at 1000 core)
I had overall frame rate of 36.4 for DX11 and 39.4 for DX12.

Thank you.

Glo. · Feb 6, 2016

sontin said:
DX12 doesnt make a difference. Submitting happens "in the same way":

https://msdn.microsoft.com/de-de/library/windows/desktop/dn899217(v=vs.85).aspx

"Feed the GPU" is part of the API.

I will ask you one question. Does all of what you are saying allow Out-of Order execution of pipeline on Nvidia hardware.

Simple answer: No. Because it IS serial architecture. Extremely fast in emptying the pipeline. But it will never be true Multithread/Out-of-Order execution. That is the point of Asynchronous Compute and Context Switching.

And you argue that it is meaningless.

ThatBuzzkiller · Feb 6, 2016

sontin said:
Execution of the queues happen in a similar way to DX11. Which brings us back to the point:
Feeding the GPU is part of the API.

The hardware scheduler is just scheduling the existing workload which already is on the GPU. It doesnt create anything (except the programmer wants this like Dynamic Parallelism).

So yeah, if a architecture couldnt support "Multithreading" under DX11 it wouldnt support it under DX12, too. But this is just nonsense and doesnt makes any sense.

Repeating "Feeding the GPU is part of the API" doesn't help further your point one bit when that is a trivial fact ...

Queues are NOT the same as immediate contexts, that is a FACT!

I am not talking about the hardware scheduler, I was focusing on the front end ...

And second it's called "deferred contexts", not "multithreading" so do no other options exist for D3D12 to achieve the same desired effect ? (rhetorical question for you)

It's painful to keep replying so please get your details right next time ...

sontin · Feb 6, 2016

ThatBuzzkiller said:
Repeating "Feeding the GPU is part of the API" doesn't help further your point one bit when that is a trivial fact ...

Queues are NOT the same as immediate contexts, that is a FACT!

Have you read what Microsoft wrote? I guess not.
Submitting the queues happen in a similiar way. But DX12 allows to fill those queues from different threads. This part has nothing to do with the underlaying hardware, it is just software.

I am not talking about the hardware scheduler, I was focusing on the front end ...

The front end is not responsible to fill lthe queues neither with the hardware scheduling of the workload.

And second it's called "deferred contexts", not "multithreading" so do no other options exist for D3D12 to achieve the same desired effect ? (rhetorical question for you)

It's painful to keep replying so please get your details right next time ...

This wasnt even part of the discussion or part of the posting i responded. :\

MAybe you should go back and re-read this again:

AMDs GCN uses hardware based scheduling, specifically the Graphics Command Processor. This processor can only execute 1 graphics or 1 compute command per clock. It is also not compatible with a multi-threaded driver. Because of this, AMD GCN cards cannot use more than one CPU thread for scheduling under DX11. So you end up with a high utilization on the primary CPU thread.

http://forums.anandtech.com/showpost.php?p=38011479&postcount=171

So if a developer just use the graphics queue AMD hardware cant use more than one CPU thread?

He hasnt understood what DX12 is doing. That's the reason why i said it doesnt make any sense.

Glo. said:
I will ask you one question. Does all of what you are saying allow Out-of Order execution of pipeline on Nvidia hardware.

Simple answer: No. Because it IS serial architecture. Extremely fast in emptying the pipeline. But it will never be true Multithread/Out-of-Order execution. That is the point of Asynchronous Compute and Context Switching.

And you argue that it is meaningless.

It is not meaningless. It is all about to fill the pipelines with work. AMD need a huge amount of workload to get past the geometry bottleneck. Async Shaders helps to fill the pipeline. But this comes with more work to do from a developers perspective. And even then the hardware isnt any better than nVidia. Fiji cant beat GM200. Tonga comes not even close to GM204.
Only Hawaii has a real chance because it is bruteforcing it way through the hardware bottleneck with nearly twice the power and 40% more compute performance.

computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Senior member

Golden Member

Senior member

Lifer

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member