computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

Silverforce11 · Feb 6, 2016

How well GCN scales for DX12 will depend on how much compute is used that is asynchronous. If it's none or negligible, the ACEs are still idling and not contributing, it's essentially wasted die space.

Fable was noted by the developers to only utilize ~5% of the available compute in async mode (Joe Hruska posted on this). I don't know the figure for Ashes as the devs never gave a firm estimate.

What we do know, some console titles are pushing towards 20-30%. And it's because they need to, the hardware is pretty low powered so there is a need to extract as much peak performance from it as possible.

ThatBuzzkiller · Feb 6, 2016

sontin said:
Have you read what Microsoft wrote? I guess not.
Submitting the queues happen in a similiar way. But DX12 allows to fill those queues from different threads. This part has nothing to do with the underlaying hardware, it is just software.

Yes I did and Command Queues are NOT the same as immediate contexts! Is it so hard to accept that conclusion from Microsoft themselves ?

The front end is not responsible to fill lthe queues neither with the hardware scheduling of the workload.

Yes it is. Do you have ANY idea or knowledge about batching ?! You CAN be frontend limited if you don't submit enough triangles per draw. The graphics command processor and others like it is the one responsible for feeding these commands to the entire GPU.

This wasnt even part of the discussion or part of the posting i responded. :\

MAybe you should go back and re-read this again:
http://forums.anandtech.com/showpost.php?p=38011479&postcount=171

So if a developer just use the graphics queue AMD hardware cant use more than one CPU thread?
He hasnt understood what DX12 is doing. That's the reason why i said it doesnt make any sense.

Deferred Context =/= Multithreading (Period!)

I'll let the poster above in the quote speak for himself ...

ThatBuzzkiller · Feb 6, 2016

Silverforce11 said:
How well GCN scales for DX12 will depend on how much compute is used that is asynchronous. If it's none or negligible, the ACEs are still idling and not contributing, it's essentially wasted die space.

Fable was noted by the developers to only utilize ~5% of the available compute in async mode (Joe Hruska posted on this). I don't know the figure for Ashes as the devs never gave a firm estimate.

What we do know, some console titles are pushing towards 20-30%. And it's because they need to, the hardware is pretty low powered so there is a need to extract as much peak performance from it as possible.

You may not even need asynchronous compute at all if you ONLY use compute shaders for rendering! At that point VGPR usage count matters more ...

The point of asynchronous compute is to bypass bottlenecks in the graphics pipeline. If your done generating your depth buffer you can start rasterizing your shadow maps while using the compute shaders to render ambient occlusion or other things that come after it such as post processing, physics, and even volumetric lighting!

Silverforce11 · Feb 6, 2016

I've seen crazy talk of full compute rendering engines but let's be real, most engines will remain as is and any compute they do, they may not even make it run in async mode just because AMD doesn't have the marketshare to entice studios to go the extra step.

So if AMD wants that feature used, they better be sponsoring game developers for the PC port.

ThatBuzzkiller · Feb 6, 2016

Silverforce11 said:
I've seen crazy talk of full compute rendering engines but let's be real, most engines will remain as is and any compute they do, they may not even make it run in async mode just because AMD doesn't have the marketshare to entice studios to go the extra step.

So if AMD wants that feature used, they better be sponsoring game developers for the PC port.

I also foresee the graphics pipeline remaining for quite a while so that's where we are right now with asynchronous compute ...

garagisti · Feb 6, 2016

guskline said:
Question. Beside me, how many posters on this thread actually own a copy of Ashes of Singularity and have run the benchmarks? If so could you post your results please?

On my 5960x @4.4Ghz and a single EVGA GTX980TI SC (vcore 1102 vs 1000 on test model)
I had overall average frame rate of 54.8 for DX11 and 54.6 for DX12.

On my 4790k rig @4.7Ghz and 2 R9 290s (Sapphire OC Tri-Xs at 1000 core)
I had overall frame rate of 36.4 for DX11 and 39.4 for DX12.

Thank you.

Would you try the 290 (is that 290 or 290x, just to make sure) on the same rig with the 5960x?

Thanks.

Glo. · Feb 6, 2016

sontin said:
It is not meaningless. It is all about to fill the pipelines with work. AMD need a huge amount of workload to get past the geometry bottleneck. Async Shaders helps to fill the pipeline. But this comes with more work to do from a developers perspective. And even then the hardware isnt any better than nVidia. Fiji cant beat GM200. Tonga comes not even close to GM204.
Only Hawaii has a real chance because it is bruteforcing it way through the hardware bottleneck with nearly twice the power and 40% more compute performance.

That is Only your opinion, not a fact.

So far - Fury X in techpowerup recent reviews is faster in 11 out of 15 games games in 4K resolution than reference GTX 980 Ti. I will ask you this again: do you even know where Fiji is bottlenecked?

It is not about fighting between AMD and Nvidia. It is about how Nvidia hardware is not able to utilize anything that brings DX12 in context.

DownTheSky · Feb 6, 2016

guskline said:
Question. Beside me, how many posters on this thread actually own a copy of Ashes of Singularity and have run the benchmarks? If so could you post your results please?

On my 5960x @4.4Ghz and a single EVGA GTX980TI SC (vcore 1102 vs 1000 on test model)
I had overall average frame rate of 54.8 for DX11 and 54.6 for DX12.

On my 4790k rig @4.7Ghz and 2 R9 290s (Sapphire OC Tri-Xs at 1000 core)
I had overall frame rate of 36.4 for DX11 and 39.4 for DX12.

Thank you.

In Ashes multigpu doesn't work. So it's just ONE 290 vs 980ti

dogen1 · Feb 6, 2016

Mahigan said:
And it's not more CPU overhead, you see most of the commands are processed by the primary CPU thread under DX11. AMDs GCN is thus constrained by the performance of the CPUs single threaded performance under DX11. Since the scheduling hardware is on die with GCN, then the Graphics Command Processor is handling both compute and graphics commands. Nvidia have most of their scheduling components in software, with some remaining in hardware, Anandtech explains this here:

Mahigan said:
AMDs GCN uses hardware based scheduling, specifically the Graphics Command Processor. This processor can only execute 1 graphics or 1 compute command per clock. It is also not compatible with a multi-threaded driver. Because of this, AMD GCN cards cannot use more than one CPU thread for scheduling under DX11. So you end up with a high utilization on the primary CPU thread.

I've been wondering what the deal was with this for the longest time. Thank you for the information.

Any idea why nvidia is able to push ~50% more draw calls(going by the 3dmark draw call bench. 1.9M vs 1.3M), even in a single thread? Is it related to this? Or is their driver just plain faster?

Silverforce11 said:
How well GCN scales for DX12 will depend on how much compute is used that is asynchronous. If it's none or negligible, the ACEs are still idling and not contributing, it's essentially wasted die space.

Fable was noted by the developers to only utilize ~5% of the available compute in async mode (Joe Hruska posted on this). I don't know the figure for Ashes as the devs never gave a firm estimate.

What we do know, some console titles are pushing towards 20-30%.

20-30%? I know of a few games in development that do a lot more than that. Dreams is 100% compute, for example.

guskline · Feb 6, 2016

garagisti said:
Would you try the 290 (is that 290 or 290x, just to make sure) on the same rig with the 5960x?

Thanks.

Since they are all custom water cooled I won't move them from one machine to the other.

garagisti · Feb 7, 2016

guskline said:
Since they are all custom water cooled I won't move them from one machine to the other.

Oh, thank you for the update. Obviously they're different systems, so i thought i'd ask you to benchmark the cards on the same system. However, it seems to be too much work.

zlatan · Feb 7, 2016

sontin said:
If you offload dispatch calls to the compute queue, nVidia has no problem. This was verified by the guy from the beyond3d.com forum who wrote the Async Compute benchmark.

The problem is the mixed workload. NV don't support multi-engine concurrency in an advanced form (what D3D12 require), because the command lists that uses barriers have to be loaded to the main command engine, because the compute command engines don't allowed to block an operation on the GPU.

sontin said:
nVidia doesnt change anything. Execution of the compute queue happens in the graphics queue as a seperated queue.

That's what I'm talking about. NV don't able to execute a compute command list that uses barriers, in an independent compute command engine. Kepler and Maxwell have to load these tasks to the main engine.

sontin said:
It is a non problem on hardware which cant execute graphics and compute queues at the same time.

A Kepler/Maxwell multiprocessor can only able to execute one type of warp at the same time. Switching between these will require a full cache/scheduler flush in the multiprocessor.

zlatan · Feb 7, 2016

sontin said:
No, it is the job of the developer to implement it for every hardware in a way that it works.

Of course it is. That's why Oxide implemented an alternative codepath for Nv.

sontin said:
Oxide has demanded a low level API and more access to the GPUs. Blaming the hardware vendor when something doesnt work, it just an excuse.

Probably not the best thing for an ISV to blame an IHV for something publicly. Sure the devs are frustrated when a hardware is not really well designed for an important workload, but this doesn't matter as long as the ISV figures out a way to implement a codepath that works well.
Oxide was probably mad because NV attacked them at the press before the benchmark lauch. Intel and AMD was also angry for some workloads and optimizations, but they don't send letters to the press about it.

guskline · Feb 7, 2016

garagisti said:
Oh, thank you for the update. Obviously they're different systems, so i thought i'd ask you to benchmark the cards on the same system. However, it seems to be too much work.

Yes, sorry but it would require a drain of both systems and moving of one of the R9 - 290s to the 5960x rig, refilling , bleeding air etc. Then redrain and refill both systems. Not for the feighnt of heart!😱

Mahigan · Feb 7, 2016

zlatan said:
The problem is the mixed workload. NV don't support multi-engine concurrency in an advanced form (what D3D12 require), because the command lists that uses barriers have to be loaded to the main command engine, because the compute command engines don't allowed to block an operation on the GPU.

That's what I'm talking about. NV don't able to execute a compute command list that uses barriers, in an independent compute command engine. Kepler and Maxwell have to load these tasks to the main engine.

A Kepler/Maxwell multiprocessor can only able to execute one type of warp at the same time. Switching between these will require a full cache/scheduler flush in the multiprocessor.

Truth.

Kepler/Maxwell must have overlooked an API requirement in D3D12 because it works under CUDA.

This is likely why so many D3D12 titles have been pushed back and why Rise of the Tomb Raider was recoded to D3D11 rather than the D3D12 path used on the Xbox One (with Async Compute).

Mahigan · Feb 7, 2016

zlatan said:
Of course it is. That's why Oxide implemented an alternative codepath for Nv.

Probably not the best thing for an ISV to blame an IHV for something publicly. Sure the devs are frustrated when a hardware is not really well designed for an important workload, but this doesn't matter as long as the ISV figures out a way to implement a codepath that works well.
Oxide was probably mad because NV attacked them at the press before the benchmark lauch. Intel and AMD was also angry for some workloads and optimizations, but they don't send letters to the press about it.

Truth,

What NV did was supply optimized shaders to Oxide and then optimized the compiler, in their driver, so that it reordered the Grids (workloads) in order to maximize occupancy. That's why NV was able to obtain a boost with their drivers under AotS.

Glo. · Feb 7, 2016

I do not want to spawn fire, but, all what we are writing looks like absolute catastrophe for Nvidia GPUs.

And running it on CUDA is... well lets say stupid, the least.

Spjut · Feb 7, 2016

Mahigan said:
Truth.

Kepler/Maxwell must have overlooked an API requirement in D3D12 because it works under CUDA.

This is likely why so many D3D12 titles have been pushed back and why Rise of the Tomb Raider was recoded to D3D11 rather than the D3D12 path used on the Xbox One (with Async Compute).

Would Kepler/Maxwell have the same issue with Vulkan?

Mahigan · Feb 7, 2016

Spjut said:
Would Kepler/Maxwell have the same issue with Vulkan?

Not likely because of the open source nature of Vulcan and the fact that NVIDIA are now aware of this issue. Vulcan has likely been written in such a way that it will circumvent this issue. NVIDIA have been pretty vocal about their anticipation of Vucan whereas they tend to downplay the importance of DX12.

sontin · Feb 7, 2016

zlatan said:
The problem is the mixed workload. NV don't support multi-engine concurrency in an advanced form (what D3D12 require), because the command lists that uses barriers have to be loaded to the main command engine, because the compute command engines don't allowed to block an operation on the GPU.

nVidia and Intel support "multi-engine concurrency in an advanced form". There is no problem with Copy and Compute for example.

DX12 doesnt require support for "Asynchronous Shaders". It is just one possible workload of the multi engine concept.

Mahigan · Feb 7, 2016

sontin said:
nVidia and Intel support "multi-engine concurrency in an advanced form". There is no problem with Copy and Compute for example.

DX12 doesnt require support for "Asynchronous Shaders". It is just one possible workload of the multi engine concept.

Well,

It's pretty common knowledge amongst developers that NVIDIA are lacking in multi-engine support.

If you google "multi engine NVIDIA" you get this link:
http://ext3h.makegames.de/DX12_Compute.html

Have a read.

Mahigan · Feb 7, 2016

Many of us have been privied to off the record information from developers. We cannot disclose the sources for our information but a lot of it was discussed over at Overclock.net as well as Beyond3D.

sontin · Feb 7, 2016

You should tell this Microsoft:
Minute 43: https://channel9.msdn.com/Events/GDC/GDC-2015/Advanced-DirectX12-Graphics-and-Performance

Mahigan · Feb 7, 2016

sontin said:
nVidia and Intel support "multi-engine concurrency in an advanced form". There is no problem with Copy and Compute for example.

DX12 doesnt require support for "Asynchronous Shaders". It is just one possible workload of the multi engine concept.

Concurrently, yes. Sequentially no. But this is only true for D3D12 and NVIDIA.

Under CUDA, there are no issues whatsoever with multi-engine except if course that it is limited and less flexible than AMDs implementation.

AMD, hardware wise, are several years ahead of the competition. AMD software wise, are several years behind the competition. The exact opposite is true for NVIDIA.

Polaris vs Pascal should, in theory, expose AMDs advantages. This is especially true if OpenGPU takes off.

Mahigan · Feb 7, 2016

sontin said:
You should tell this Microsoft:
Minute 43: https://channel9.msdn.com/Events/GDC/GDC-2015/Advanced-DirectX12-Graphics-and-Performance

That is dated information. Those are the assumptions many of us had prior to the exposition of NVIDIA's hardware flaws under D3D12 back in August/Sept of 2015.

These assumptions are what Oxide were under when they began coding AotS until they ran into a wall leading to the Async Controversy last end of summer/fall.

Oxide had to implement a separate path for NVIDIA due to this flaw. This exposed the information, in the link you just provided, as being erroneous.

NVIDIA were either dishonest or were caught with their pants down. They likely believed their hardware could do what they informed Microsoft that it could do.

I prefer believing the latter as the former would be a hard pill to swallow.

computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

Lifer

Golden Member

Golden Member

Lifer

Golden Member

Senior member

Diamond Member

Senior member

Senior member

Diamond Member

Senior member

Senior member

Senior member

Diamond Member

Senior member

Senior member

Diamond Member

Senior member

Senior member

Diamond Member

Senior member

Senior member

Diamond Member

Senior member

Senior member