computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

Page 24 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Mahigan

Senior member
Aug 22, 2015
573
0
0
That's exactly what I said earlier

Think of a GPU as being two different pieces of hardware fused together.

1. A hardware accelerated rasterizer paired with fixed function units for transform and lighting of pixels, geometry operations and texture mapping.

2. A hardware accelerated parallel computation device.

Pixels are handled by the 1st piece of hardware as are textures and triangles. We often name this the GPU front end.

Pixel shading, post processing effects, lighting, physics are handled through mathematical algorithms in the 2nd piece of hardware.

The "threads", I mentioned, are part of the 2nd piece of hardware. The Pixels, are part of the first piece of hardware.

So the resolution has nothing to do with the computation threads.
 

parvadomus

Senior member
Dec 11, 2012
685
14
81
you guys are going too low level and everyone are throwing mistakes about how the archs and DX11 or 12 works. for example im 100% sure the AMD overhead in DX11 is not that their uarch has 52662 whatever more "threads" and its harder to use them. otherwise lower end AMD gpus with fewer threads would not have this problem and thats not true.
 

3DVagabond

Lifer
Aug 10, 2009
11,951
204
106
Well go ahead and give me your twisted, convoluted argument on why AMD loses perf relative to nVidia as resolution decreases,why is it harder to do less?
At least loosing performance when there is more data to perform on is logical,loosing less when there is more to perform is a special kind weird.

More API overhead because they don't do MTR. Lower res CPU performance influences overall performance more.
 

Bacon1

Diamond Member
Feb 14, 2016
3,430
1,018
91
More API overhead because they don't do MTR. Lower res CPU performance influences overall performance more.

AMD does support MTR

Using the DX11 Multithread sample:

Code:
    DEVICECONTEXT_IMMEDIATE,                // Traditional rendering, one thread, immediate device context
    DEVICECONTEXT_ST_DEFERRED_PER_SCENE,    // One thread, multiple deferred device contexts, one per scene 
    DEVICECONTEXT_MT_DEFERRED_PER_SCENE,    // Multiple threads, one per scene, each with one deferred device context
    DEVICECONTEXT_ST_DEFERRED_PER_CHUNK,    // One thread, multiple deferred device contexts, one per physical processor 
    DEVICECONTEXT_MT_DEFERRED_PER_CHUNK,    // Multiple threads, one per physical processor, each with one deferred device context

Immediate -> 8-9fps
ST Def / Scene -> 8.5-9.5 fps
MT Def / Scene -> 23-24fps
ST Def / Chunk -> 8-9 fps
MT Def / Chunk -> 19-20fps

https://code.msdn.microsoft.com/Direct3D-Multithreaded-d02193c0

Can we please dispel this myth now?
 

3DVagabond

Lifer
Aug 10, 2009
11,951
204
106
AMD does support MTR

Using the DX11 Multithread sample:

Code:
    DEVICECONTEXT_IMMEDIATE,                // Traditional rendering, one thread, immediate device context
    DEVICECONTEXT_ST_DEFERRED_PER_SCENE,    // One thread, multiple deferred device contexts, one per scene 
    DEVICECONTEXT_MT_DEFERRED_PER_SCENE,    // Multiple threads, one per scene, each with one deferred device context
    DEVICECONTEXT_ST_DEFERRED_PER_CHUNK,    // One thread, multiple deferred device contexts, one per physical processor 
    DEVICECONTEXT_MT_DEFERRED_PER_CHUNK,    // Multiple threads, one per physical processor, each with one deferred device context

Immediate -> 8-9fps
ST Def / Scene -> 8.5-9.5 fps
MT Def / Scene -> 23-24fps
ST Def / Chunk -> 8-9 fps
MT Def / Chunk -> 19-20fps

https://code.msdn.microsoft.com/Direct3D-Multithreaded-d02193c0

Can we please dispel this myth now?

I recall AMD saying they didn't support it because the gains aren't worth it. I've never seen them state otherwise. What are you showing above? Where is it from?
 

Bacon1

Diamond Member
Feb 14, 2016
3,430
1,018
91
I recall AMD saying they didn't support it because the gains aren't worth it. I've never seen them state otherwise. What are you showing above? Where is it from?

From me running that code as I wrote the message. 290 16.1.1.1 feb 3 hotfix, windows 10.
 

Bacon1

Diamond Member
Feb 14, 2016
3,430
1,018
91
So is it new for this release? Or have they been supporting it for a while?

Its been there for a while, not sure when but it worked when I ran it month or so ago as well. Don't feel like installing old drivers just to test but someone else can feel free to :)
 

TheELF

Diamond Member
Dec 22, 2012
3,973
730
126
Think of a GPU as being two different pieces of hardware fused together.

1. A hardware accelerated rasterizer paired with fixed function units for transform and lighting of pixels, geometry operations and texture mapping.

2. A hardware accelerated parallel computation device.

Pixels are handled by the 1st piece of hardware as are textures and triangles. We often name this the GPU front end.

Pixel shading, post processing effects, lighting, physics are handled through mathematical algorithms in the 2nd piece of hardware.

The "threads", I mentioned, are part of the 2nd piece of hardware. The Pixels, are part of the first piece of hardware.

So the resolution has nothing to do with the computation threads.

So AMD has a rasterizer that is faster than nvidias at 4k but slower as nvidias at 1080?

Or maybe the image/scene gets cut up into as many pieces as there are threads,
the scene gets calculated by the shaders/"threads" ,
and then the rasterizer only has to display/work on the final pixels?
 

3DVagabond

Lifer
Aug 10, 2009
11,951
204
106
Its been there for a while, not sure when but it worked when I ran it month or so ago as well. Don't feel like installing old drivers just to test but someone else can feel free to :)

Just the way you said it I thought you were insinuating they always supported it, (I know you didn't explicitly say that, which is why I wanted some qualification) but I know I had read they weren't really interested. Thus them wanting so badly to get away from DX11.

Thanks for the info.
 

Dygaza

Member
Oct 16, 2015
176
34
101
This guy said it best:

Alarchy said:
AMD DX11 drivers for all cards (GCN, Terascale) do not support multi-threaded command lists (an optional feature of DX11). Command lists are accepted and then single-threaded in the driver. This increases CPU overhead, and makes AMD cards highly reliant on fast IPC processors (which AMD processors are not). Nvidia implemented this shortly after the first DX11 game released (after spending two years on it) and saw immense performance gains and decreased CPU overhead.

AMD's performance boost in DX12 is because it now mandates multithreaded command lists and AMD was already working on a similar feature in Mantle. Basically, the boost AMD is seeing in DX12 is similar to the boost Nvidia saw with their Fermi cards in DX11.

Supporting multi-threaded rendering in DX11 gave Nvidia a huge performance advantage, and allowed them to slowly gut parts of their hardware and still keep competitive with AMD's drastically more powerful (hardware) cards. AMD GPUs have always been crippled by the lack of multithreaded command lists, but have been able to specifically optimize some games to lessen the impact (the performance increases seen over time).

Nvidia cards get barely any boost in DX12 because they were already supporting the feature of DX12 that makes it so fast - multithreaded rendering.

DX12 is finally exposing the true power of AMD GPUs, that was locked behind a single-threaded driver for years. If anything, their performance in DX12 is a testament to just how poor the DX11 driver was.

https://www.reddit.com/r/Amd/comments/3sm46y/we_should_really_get_amd_to_multithreaded_their/

Is multithreaded rendering and having multithreaded command lists really the same thing, or are we talking about 2 different things?

3dmark API tests atleast don't give any performance increase from this. Also as pointed out in many threads before, nvidia ain't just ahead in multithreaded, but their singlethreaded performance is miles ahead of AMD aswell.
 
Last edited:

Paul98

Diamond Member
Jan 31, 2010
3,732
199
106
So AMD has a rasterizer that is faster than nvidias at 4k but slower as nvidias at 1080?

Or maybe the image/scene gets cut up into as many pieces as there are threads,
the scene gets calculated by the shaders/"threads" ,
and then the rasterizer only has to display/work on the final pixels?

I am trying to figure out how you think 3d graphics work. I can read what you wrote in multiple ways.

Such as do you think that the rasterizer has "final pixels" that were calculated as described before? Or that the only work the rasterizer has to do is calculate what the final pixels are?

What are you talking about when you say "image/scene", or "shaders/threads"? What do you mean by "scene gets calculated"?
 

Tapoer

Member
May 10, 2015
64
3
36
So AMD has a rasterizer that is faster than nvidias at 4k but slower as nvidias at 1080?

Or maybe the image/scene gets cut up into as many pieces as there are threads,
the scene gets calculated by the shaders/"threads" ,
and then the rasterizer only has to display/work on the final pixels?

Is that hard for you to understand that with lower resolution AMD GPU are being bottlenecked by CPU, mainly draw calls in D3D11?

Low resolution --> CPU bottleneck
High resolution --> GPU bottleneck

The more CPU heavy the games are the worst AMD GPU perform with lower resolution, because the CPU cannot feed the GPU with enough frames for the GPU to draw, the GPU will be idle more often.

This have nothing to do with how fast AMD GPU really are hardware wise.

Nvidia have less problems in D3D11 because they have lower CPU overhead.

For example, a CPU is capable of 80 fps on a specific game (resolution doesn't change that), the GPU at 720p can draw 200fps, at 1080p 100fps, at 1440p 60fps and at 4k 30fps, you will get:

720p --> ~80fps (CPU bottleneck)
1080p --> ~80fps (CPU bottleneck)
1440p --> ~60fps (GPU bottleneck)
4k --> ~30fps (GPU bottleneck)
 
Last edited:

maddie

Diamond Member
Jul 18, 2010
4,738
4,666
136
Is that hard for you to understand that with lower resolution AMD GPU are being bottlenecked by CPU, mainly draw calls in D3D11?

Low resolution --> CPU bottleneck
High resolution --> GPU bottleneck

The more CPU heavy the games are the worst AMD GPU perform with lower resolution, because the CPU cannot feed the GPU with enough frames for the GPU to draw, the GPU will be idle more often.

This have nothing to do with how fast AMD GPU really are hardware wise.

Nvidia have less problems in D3D11 because they have lower CPU overhead.

For example, a CPU is capable of 80 fps on a specific game (resolution doesn't change that), the GPU at 720p can draw 200fps, at 1080p 100fps, at 1440p 60fps and at 4k 30fps, you will get:

720p --> ~80fps (CPU bottleneck)
1080p --> ~80fps (CPU bottleneck)
1440p --> ~60fps (GPU bottleneck)
4k --> ~30fps (GPU bottleneck)
As an aside.

This is why I find it so misleading when you get all these "Your CPU is bottlenecking your GPU" advice without the poster having a clue, or even asking, to the resolution in use.
 

Dygaza

Member
Oct 16, 2015
176
34
101
Also keep in mind scenes in games vary a lot. In same games you are all the time cpu capped (very rare), and in some games it's just one 5 second segment where you get huge cpu bottleneck. These ofc reflect directly to benchmark results.

Digitalfoundry has some good videos from AMD cpu bottleneck.

Here is one good example:

https://youtu.be/fAVxmfNUuRs?t=100
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
This guy said it best:
Exactly what I've communicated and said a few pages back. :)

Is multithreaded rendering and having multithreaded command lists really the same thing, or are we talking about 2 different things?
7ba3fd6d8a24b1947d16f5af1a52ec28.jpg

Multi threaded command listing (DirectX runtime MT):
DirectX works by creating bundles (batches) of commands (command lists). These bundles or batches of commands are sent from the API to the Graphics driver. The driver can perform some changes to these commands (shader replacements, reordering of commands etc) and then translates them into ISA (Instruction Set Architecture, the GPUs language) command lists (Grids/threads) before sending them to the GPU for processing.

Multi-threaded command listing allows the DirectX driver to pre-record lists of commands on idling CPU cores. These lists of commands are then played back to the Graphics driver using the CPUs primary Core (thread 0). Why? The DirectX driver can only run on the primary CPU thread.

3993748dd5b336cc60c620cc1c884bdb.jpg

Multi-threaded rendering (DirectX runtime MT + DirectX driver MT):
Is more or less same as above (DirectX runtime can also scale past 4 cores) except the last part, the DirectX driver doesn't need to play back the commands over the primary CPU thread, any CPU core/thread can talk directly to the GPU driver and thus send its command lists to the Graphics driver. How? The DirectX driver is split amongst every CPU thread.


3dmark API tests atleast don't give any performance increase from this. Also as pointed out in many threads before, nvidia ain't just ahead in multithreaded, but their singlethreaded performance is miles ahead of AMD aswell.
NVIDIAs single threaded DX11 performance is boosted from supporting multi-threaded command listing. So eventhough it is making use of single threaded rendering, the command lists, DirectX runtime, are being processed by the available CPU threads (multi-threaded).

DX11 doesn't support Multi-threaded rendering so the performance is usually the same (between single and multi threading or a bit faster with multi-threading but the performance boost is negligible). This is because the DirectX driver only runs on the primary CPU thread under DX11.
 
Last edited:

Dygaza

Member
Oct 16, 2015
176
34
101
NVIDIAs single threaded DX11 performance is boosted from supporting multi-threaded command listing. So eventhough it is making use of single threaded rendering, the command lists, DirectX runtime, are being processed by the available CPU threads (multi-threaded).

DX11 doesn't support Multi-threaded rendering so the performance is usually the same (between single and multi threading or a bit faster with multi-threading but the performance boost is negligible). This is because the DirectX driver only runs on the primary CPU thread under DX11.

First of all, thanks for good explanations.

Any reason AMD wouldn't benefit from using multi-threaded command listing. I remember reading them testing it, but not finding any significant improvements from it. Can't find the article so it can be all my imagination aswell.

Nvidia actually do get quite nice improvements in 3dmark API from MT compared to ST. But like I said, for us AMD users, it's useless to drool around their MT numbers, if we aren't even touching their ST numbers.

You shouldn't aim to be a long jumper, if you can't run.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
First of all, thanks for good explanations.

Any reason AMD wouldn't benefit from using multi-threaded command listing. I remember reading them testing it, but not finding any significant improvements from it. Can't find the article so it can be all my imagination aswell.

Nvidia actually do get quite nice improvements in 3dmark API from MT compared to ST. But like I said, for us AMD users, it's useless to drool around their MT numbers, if we aren't even touching their ST numbers.

You shouldn't aim to be a long jumper, if you can't run.
I'm not sure but almost a year ago AMD were looking to hire someone for that very task: https://www.linkedin.com/jobs2/view/31034254?trk=jobs_biz_pub

I've heard people make the claim that AMD tried implementing the feature but suffered negative scaling. I haven't seen anyone link any articles to back up that claim though.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
Also this might interest you... http://www.pcgameshardware.com/aid,...h-Interview-What-DirectX-11-is-good-for/News/

Who is Dan Baker? Think Oxide/Kollock.

3) Do you use DX11 multithreading to reduce the CPU workload or another DirectX 11 feature? If there are special DirectX 11 visuals (except hardware tessellation), what are the graphical features that can only be rendered with shader model 5 hardware? What are the visual differences between the DX9 and DX11 version of Civilization 5?

Civilization V, as far as we know, is the first fully threaded DX11 game.

Unfortunately, because no other games have used this feature yet, neither Nvidia nor AMD have publically released threaded drivers, so users may not experience all the benefits just yet. We decided to keep threading enabled for Civilization V, however, because we are continuing to work closely with Nvidia and AMD on their support for multi-threading. We expect publically available threaded drivers shortly.*

The internal architecture of the Civilization V graphics engine, however, is heavily multi-threaded and users will see multi-processor benefits even with drivers that are not threaded (including DX9). We have developed a series of configurable benchmark modes that we use internally for measuring our threading ability. These are fully described in the readme file. After some discussion, we decided to expose these internal tests on the released version so, if the users view the readme file, they will see that there are detailed instructions of these benchmark modes.*

There are many notable improvements with the DX11 version of the game over the DX9 version of the game. Also, don't forget that the DX11 version includes all the DX10 features, so it has 2 generations of hardware features.*

One big difference in the DX11 version is the terrain. On DX11, we are able to have a much more detailed terrain, since we are able to get a 4x compression on the textures. This allows us to keep most of the terrain cached. The DX9 players will notice some paging while they go to new parts of the map however; this is generally a non-issue on the DX11 version. We were also able to have specularity on the terrain, so players will notice marshes and snow that reflect the sun. The DX11 version also has a more detailed fog of war (on the higher settings), which uses weather simulation dynamics for a more realistic cloud movement.

Another huge visual improvement is our leader scenes. We have used a number of advanced features for lighting that are only available on DX10.1 and higher, and a number of these advanced features to give more realistic detail and correct shadowing.



4) From what DirectX 11 feature do you think your game profits most? What do you think about DirectX 11 in times of Cross-Platform-Development, can we expect more and more DX11 titles?

We benefit most from the DX11 Compute features. However, once Nvidia and AMD release threading enabled drivers, we expect the threading to be the biggest single benefit. We understand that many of our customers are hardware enthusiasts and want their games to use the latest technology. Since DirectX11 is leaps and bounds above the capabilities of current consoles it can be difficult to be cross platform and take advantage of the new capabilities. Fortunately, because we are a PC only game, this wasn't a concern for us. We can't speak for everyone, but we expect to see DX11 rapidly become the standard for gaming.

He was right, DX11 rapidly became the standard. Not only did he make the first DX11 game, he's now making the first DX12 game.

:)

CIV5 is the first game to exhibit this behaviour we've been discussing. Watch as an R9 290 is dead last and as the resolution rises, surpasses the GTX 970 and GTX 780 Ti. The R9 290x starts off more or less tied with a GTX 970 but as the resolution rises it distances itself from the GTX 970, catches up to the GTX 980 and finally surpasses it:

CIV 5 - Beyond Earth:
35136d21f9a3cd14e387f1468807c517.jpg

e88ae4602fc2b19bac67c7360f3eefe2.jpg

acf5af00323e6dedd3aef271aa3b3530.jpg

158607015b626c395951068a76d6437e.jpg


This is why, for AMD, DirectX 12 is key.

1) DX12 removes AMDs multi-thread rendering handicap and allows their GPUs to perform to their rendering architectures fullest.

2) Asynchronous compute + graphics is a bonus, allowing AMD GCN to make better utilization of its idling compute resources.
 
Last edited:

Bacon1

Diamond Member
Feb 14, 2016
3,430
1,018
91
I've heard people make the claim that AMD tried implementing the feature but suffered negative scaling. I haven't seen anyone link any articles to back up that claim though.

As I showed above, the multithreading does correctly work. But I've seen that in draw call intensive scenes like the 3dmark API test, multithreading is slightly slower than single. Some site tested Tomb Raider (not sure what patch or if it was pre-launch day) and found that single core was fastest for AMD as well. So there is probably some additional overhead if you can't fill it with a single core which makes threaded slower, but threaded is faster (as I showed) if single can't fill the gpu.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
As I showed above, the multithreading does correctly work. But I've seen that in draw call intensive scenes like the 3dmark API test, multithreading is slightly slower than single. Some site tested Tomb Raider (not sure what patch or if it was pre-launch day) and found that single core was fastest for AMD as well. So there is probably some additional overhead if you can't fill it with a single core which makes threaded slower, but threaded is faster (as I showed) if single can't fill the gpu.
And since AMD haven't been able to rectify the issue in software then it probably has something to do with hardware.

This is why I've hypothesized that it may have something to do with their Command Processor.

I can speculate that it could also be AMDs DirectX driver , under DX11, which incurs a larger overhead than NVIDIAs because it must re-batch NVIDIA optimized code (32 thread kernels) into batches of 64 threads (1 wavefront kernel) in order to obtain better SIMD utilization (Compute). Imagine re-batching 4-8 (varying on how many cores are involved) 32-thread command lists, when using multi-threaded command listing, into batches of 64 threads on the primary CPU thread? That would incur a heck of a lot of API over head and would show up under CPU bound conditions. (It wouldn't affect your Microsoft test because Microsoft likely included both NVIDIA and AMD vendor specific paths in that test).

Since the resolution doesn't affect the amount of batches, scaling up the resolution would give us a better idea of AMD and NVIDIAs true GPU strengths.

For all intents and purposes, Fiji is likely a match for a GTX 980 Ti when both are running reference clocks hardware wise under optimized code.

DirectX 12 would alleviate this by spreading the DirectX driver across many cores. Hence AMD no longer incurring a CPU bound problem under DX12 at low resolutions.

Just a thought because when AMD partner with a game developer this happens (Hitman BETA DX11):

9be97626795cd34287dfb1e2f85d6f0d.jpg

e6bbbfd7fb42d3268fab1b018f29da38.jpg

d1054f85df940dc2aa33a1452e73eee4.jpg


Evidently, once Hitman releases with DX12 and Asynchronous compute, GCN will perform admirably under this title.
 
Last edited:

Bacon1

Diamond Member
Feb 14, 2016
3,430
1,018
91
Hmm interesting how much the 980 TI looses from low -> Medium, anyone know what all changes there? Don't have the beta myself. Its a bigger hit than medium -> high though (fury goes from behind to ahead @ 1080p there, then back to tied @ high)

But yeah, can't wait for some DX12 & Vulkan optimized games, get rid of this DX11 driver overhead please and let my card shine :)
 

Dygaza

Member
Oct 16, 2015
176
34
101
It's also very hard to understand how can new improved Command Processor in Polaris effect so much, that it would remove cpu bottleneck, when it is physically located at the other side of the PCI-E lane. Only way I can think of is that current command processor is the bottleneck, but that doesn't explain why cpu clockspeed has almost linear effect to fps in those scenes.
 

Leadbox

Senior member
Oct 25, 2010
744
63
91

Leadbox

Senior member
Oct 25, 2010
744
63
91
DX12 is so new API, that most of the companies starting with it really has to learn a lot of things. Folks like Oxide already got really nice headstart from Mantle.

I think if a certain IHV fully supported DX12 as the marketing material suggests they are, we would be there already.