computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

Mahigan · Feb 19, 2016

TheELF said:
That's exactly what I said earlier

Think of a GPU as being two different pieces of hardware fused together.

1. A hardware accelerated rasterizer paired with fixed function units for transform and lighting of pixels, geometry operations and texture mapping.

2. A hardware accelerated parallel computation device.

Pixels are handled by the 1st piece of hardware as are textures and triangles. We often name this the GPU front end.

Pixel shading, post processing effects, lighting, physics are handled through mathematical algorithms in the 2nd piece of hardware.

The "threads", I mentioned, are part of the 2nd piece of hardware. The Pixels, are part of the first piece of hardware.

So the resolution has nothing to do with the computation threads.

parvadomus · Feb 19, 2016

you guys are going too low level and everyone are throwing mistakes about how the archs and DX11 or 12 works. for example im 100% sure the AMD overhead in DX11 is not that their uarch has 52662 whatever more "threads" and its harder to use them. otherwise lower end AMD gpus with fewer threads would not have this problem and thats not true.

3DVagabond · Feb 19, 2016

TheELF said:
Well go ahead and give me your twisted, convoluted argument on why AMD loses perf relative to nVidia as resolution decreases,why is it harder to do less?
At least loosing performance when there is more data to perform on is logical,loosing less when there is more to perform is a special kind weird.

More API overhead because they don't do MTR. Lower res CPU performance influences overall performance more.

Bacon1 · Feb 20, 2016

3DVagabond said:
More API overhead because they don't do MTR. Lower res CPU performance influences overall performance more.

AMD does support MTR

Using the DX11 Multithread sample:

Code:

    DEVICECONTEXT_IMMEDIATE,                // Traditional rendering, one thread, immediate device context
    DEVICECONTEXT_ST_DEFERRED_PER_SCENE,    // One thread, multiple deferred device contexts, one per scene 
    DEVICECONTEXT_MT_DEFERRED_PER_SCENE,    // Multiple threads, one per scene, each with one deferred device context
    DEVICECONTEXT_ST_DEFERRED_PER_CHUNK,    // One thread, multiple deferred device contexts, one per physical processor 
    DEVICECONTEXT_MT_DEFERRED_PER_CHUNK,    // Multiple threads, one per physical processor, each with one deferred device context

Immediate -> 8-9fps
ST Def / Scene -> 8.5-9.5 fps
MT Def / Scene -> 23-24fps
ST Def / Chunk -> 8-9 fps
MT Def / Chunk -> 19-20fps

https://code.msdn.microsoft.com/Direct3D-Multithreaded-d02193c0

Can we please dispel this myth now?

3DVagabond · Feb 20, 2016

Bacon1 said:
AMD does support MTR

Using the DX11 Multithread sample:

Code:

DEVICECONTEXT_IMMEDIATE, // Traditional rendering, one thread, immediate device context DEVICECONTEXT_ST_DEFERRED_PER_SCENE, // One thread, multiple deferred device contexts, one per scene DEVICECONTEXT_MT_DEFERRED_PER_SCENE, // Multiple threads, one per scene, each with one deferred device context DEVICECONTEXT_ST_DEFERRED_PER_CHUNK, // One thread, multiple deferred device contexts, one per physical processor DEVICECONTEXT_MT_DEFERRED_PER_CHUNK, // Multiple threads, one per physical processor, each with one deferred device context

Immediate -> 8-9fps
ST Def / Scene -> 8.5-9.5 fps
MT Def / Scene -> 23-24fps
ST Def / Chunk -> 8-9 fps
MT Def / Chunk -> 19-20fps

https://code.msdn.microsoft.com/Direct3D-Multithreaded-d02193c0

Can we please dispel this myth now?

I recall AMD saying they didn't support it because the gains aren't worth it. I've never seen them state otherwise. What are you showing above? Where is it from?

Bacon1 · Feb 20, 2016

3DVagabond said:
I recall AMD saying they didn't support it because the gains aren't worth it. I've never seen them state otherwise. What are you showing above? Where is it from?

From me running that code as I wrote the message. 290 16.1.1.1 feb 3 hotfix, windows 10.

3DVagabond · Feb 20, 2016

Bacon1 said:
From me running that code as I wrote the message. 290 16.1.1.1 feb 3 hotfix, windows 10.

So is it new for this release? Or have they been supporting it for a while?

Bacon1 · Feb 20, 2016

3DVagabond said:
So is it new for this release? Or have they been supporting it for a while?

Its been there for a while, not sure when but it worked when I ran it month or so ago as well. Don't feel like installing old drivers just to test but someone else can feel free to

TheELF · Feb 20, 2016

Mahigan said:
Think of a GPU as being two different pieces of hardware fused together.

1. A hardware accelerated rasterizer paired with fixed function units for transform and lighting of pixels, geometry operations and texture mapping.

2. A hardware accelerated parallel computation device.

Pixels are handled by the 1st piece of hardware as are textures and triangles. We often name this the GPU front end.

Pixel shading, post processing effects, lighting, physics are handled through mathematical algorithms in the 2nd piece of hardware.

The "threads", I mentioned, are part of the 2nd piece of hardware. The Pixels, are part of the first piece of hardware.

So the resolution has nothing to do with the computation threads.

So AMD has a rasterizer that is faster than nvidias at 4k but slower as nvidias at 1080?

Or maybe the image/scene gets cut up into as many pieces as there are threads,
the scene gets calculated by the shaders/"threads" ,
and then the rasterizer only has to display/work on the final pixels?

3DVagabond · Feb 20, 2016

Bacon1 said:
Its been there for a while, not sure when but it worked when I ran it month or so ago as well. Don't feel like installing old drivers just to test but someone else can feel free to

Just the way you said it I thought you were insinuating they always supported it, (I know you didn't explicitly say that, which is why I wanted some qualification) but I know I had read they weren't really interested. Thus them wanting so badly to get away from DX11.

Thanks for the info.

Dygaza · Feb 20, 2016

This guy said it best:

Alarchy said:
AMD DX11 drivers for all cards (GCN, Terascale) do not support multi-threaded command lists (an optional feature of DX11). Command lists are accepted and then single-threaded in the driver. This increases CPU overhead, and makes AMD cards highly reliant on fast IPC processors (which AMD processors are not). Nvidia implemented this shortly after the first DX11 game released (after spending two years on it) and saw immense performance gains and decreased CPU overhead.

AMD's performance boost in DX12 is because it now mandates multithreaded command lists and AMD was already working on a similar feature in Mantle. Basically, the boost AMD is seeing in DX12 is similar to the boost Nvidia saw with their Fermi cards in DX11.

Supporting multi-threaded rendering in DX11 gave Nvidia a huge performance advantage, and allowed them to slowly gut parts of their hardware and still keep competitive with AMD's drastically more powerful (hardware) cards. AMD GPUs have always been crippled by the lack of multithreaded command lists, but have been able to specifically optimize some games to lessen the impact (the performance increases seen over time).

Nvidia cards get barely any boost in DX12 because they were already supporting the feature of DX12 that makes it so fast - multithreaded rendering.

DX12 is finally exposing the true power of AMD GPUs, that was locked behind a single-threaded driver for years. If anything, their performance in DX12 is a testament to just how poor the DX11 driver was.

https://www.reddit.com/r/Amd/comments/3sm46y/we_should_really_get_amd_to_multithreaded_their/

Is multithreaded rendering and having multithreaded command lists really the same thing, or are we talking about 2 different things?

3dmark API tests atleast don't give any performance increase from this. Also as pointed out in many threads before, nvidia ain't just ahead in multithreaded, but their singlethreaded performance is miles ahead of AMD aswell.

Paul98 · Feb 20, 2016

TheELF said:
So AMD has a rasterizer that is faster than nvidias at 4k but slower as nvidias at 1080?

Or maybe the image/scene gets cut up into as many pieces as there are threads,
the scene gets calculated by the shaders/"threads" ,
and then the rasterizer only has to display/work on the final pixels?

I am trying to figure out how you think 3d graphics work. I can read what you wrote in multiple ways.

Such as do you think that the rasterizer has "final pixels" that were calculated as described before? Or that the only work the rasterizer has to do is calculate what the final pixels are?

What are you talking about when you say "image/scene", or "shaders/threads"? What do you mean by "scene gets calculated"?

Tapoer · Feb 20, 2016

TheELF said:
So AMD has a rasterizer that is faster than nvidias at 4k but slower as nvidias at 1080?

Or maybe the image/scene gets cut up into as many pieces as there are threads,
the scene gets calculated by the shaders/"threads" ,
and then the rasterizer only has to display/work on the final pixels?

Is that hard for you to understand that with lower resolution AMD GPU are being bottlenecked by CPU, mainly draw calls in D3D11?

Low resolution --> CPU bottleneck
High resolution --> GPU bottleneck

The more CPU heavy the games are the worst AMD GPU perform with lower resolution, because the CPU cannot feed the GPU with enough frames for the GPU to draw, the GPU will be idle more often.

This have nothing to do with how fast AMD GPU really are hardware wise.

Nvidia have less problems in D3D11 because they have lower CPU overhead.

For example, a CPU is capable of 80 fps on a specific game (resolution doesn't change that), the GPU at 720p can draw 200fps, at 1080p 100fps, at 1440p 60fps and at 4k 30fps, you will get:

720p --> ~80fps (CPU bottleneck)
1080p --> ~80fps (CPU bottleneck)
1440p --> ~60fps (GPU bottleneck)
4k --> ~30fps (GPU bottleneck)

maddie · Feb 20, 2016

Tapoer said:
Is that hard for you to understand that with lower resolution AMD GPU are being bottlenecked by CPU, mainly draw calls in D3D11?

Low resolution --> CPU bottleneck
High resolution --> GPU bottleneck

The more CPU heavy the games are the worst AMD GPU perform with lower resolution, because the CPU cannot feed the GPU with enough frames for the GPU to draw, the GPU will be idle more often.

This have nothing to do with how fast AMD GPU really are hardware wise.

Nvidia have less problems in D3D11 because they have lower CPU overhead.

For example, a CPU is capable of 80 fps on a specific game (resolution doesn't change that), the GPU at 720p can draw 200fps, at 1080p 100fps, at 1440p 60fps and at 4k 30fps, you will get:

720p --> ~80fps (CPU bottleneck)
1080p --> ~80fps (CPU bottleneck)
1440p --> ~60fps (GPU bottleneck)
4k --> ~30fps (GPU bottleneck)

As an aside.

This is why I find it so misleading when you get all these "Your CPU is bottlenecking your GPU" advice without the poster having a clue, or even asking, to the resolution in use.

Dygaza · Feb 20, 2016

Also keep in mind scenes in games vary a lot. In same games you are all the time cpu capped (very rare), and in some games it's just one 5 second segment where you get huge cpu bottleneck. These ofc reflect directly to benchmark results.

Digitalfoundry has some good videos from AMD cpu bottleneck.

Here is one good example:

https://youtu.be/fAVxmfNUuRs?t=100

Mahigan · Feb 20, 2016

Dygaza said:
This guy said it best:

Exactly what I've communicated and said a few pages back.

Is multithreaded rendering and having multithreaded command lists really the same thing, or are we talking about 2 different things?

Multi threaded command listing (DirectX runtime MT):
DirectX works by creating bundles (batches) of commands (command lists). These bundles or batches of commands are sent from the API to the Graphics driver. The driver can perform some changes to these commands (shader replacements, reordering of commands etc) and then translates them into ISA (Instruction Set Architecture, the GPUs language) command lists (Grids/threads) before sending them to the GPU for processing.

Multi-threaded command listing allows the DirectX driver to pre-record lists of commands on idling CPU cores. These lists of commands are then played back to the Graphics driver using the CPUs primary Core (thread 0). Why? The DirectX driver can only run on the primary CPU thread.

Multi-threaded rendering (DirectX runtime MT + DirectX driver MT):
Is more or less same as above (DirectX runtime can also scale past 4 cores) except the last part, the DirectX driver doesn't need to play back the commands over the primary CPU thread, any CPU core/thread can talk directly to the GPU driver and thus send its command lists to the Graphics driver. How? The DirectX driver is split amongst every CPU thread.

3dmark API tests atleast don't give any performance increase from this. Also as pointed out in many threads before, nvidia ain't just ahead in multithreaded, but their singlethreaded performance is miles ahead of AMD aswell.

NVIDIAs single threaded DX11 performance is boosted from supporting multi-threaded command listing. So eventhough it is making use of single threaded rendering, the command lists, DirectX runtime, are being processed by the available CPU threads (multi-threaded).

DX11 doesn't support Multi-threaded rendering so the performance is usually the same (between single and multi threading or a bit faster with multi-threading but the performance boost is negligible). This is because the DirectX driver only runs on the primary CPU thread under DX11.

Dygaza · Feb 20, 2016

Mahigan said:
NVIDIAs single threaded DX11 performance is boosted from supporting multi-threaded command listing. So eventhough it is making use of single threaded rendering, the command lists, DirectX runtime, are being processed by the available CPU threads (multi-threaded).

DX11 doesn't support Multi-threaded rendering so the performance is usually the same (between single and multi threading or a bit faster with multi-threading but the performance boost is negligible). This is because the DirectX driver only runs on the primary CPU thread under DX11.

First of all, thanks for good explanations.

Any reason AMD wouldn't benefit from using multi-threaded command listing. I remember reading them testing it, but not finding any significant improvements from it. Can't find the article so it can be all my imagination aswell.

Nvidia actually do get quite nice improvements in 3dmark API from MT compared to ST. But like I said, for us AMD users, it's useless to drool around their MT numbers, if we aren't even touching their ST numbers.

You shouldn't aim to be a long jumper, if you can't run.

Mahigan · Feb 20, 2016

Dygaza said:
First of all, thanks for good explanations.

Any reason AMD wouldn't benefit from using multi-threaded command listing. I remember reading them testing it, but not finding any significant improvements from it. Can't find the article so it can be all my imagination aswell.

Nvidia actually do get quite nice improvements in 3dmark API from MT compared to ST. But like I said, for us AMD users, it's useless to drool around their MT numbers, if we aren't even touching their ST numbers.

You shouldn't aim to be a long jumper, if you can't run.

I'm not sure but almost a year ago AMD were looking to hire someone for that very task: https://www.linkedin.com/jobs2/view/31034254?trk=jobs_biz_pub

I've heard people make the claim that AMD tried implementing the feature but suffered negative scaling. I haven't seen anyone link any articles to back up that claim though.

Mahigan · Feb 20, 2016

Also this might interest you... http://www.pcgameshardware.com/aid,...h-Interview-What-DirectX-11-is-good-for/News/

Who is Dan Baker? Think Oxide/Kollock.

3) Do you use DX11 multithreading to reduce the CPU workload or another DirectX 11 feature? If there are special DirectX 11 visuals (except hardware tessellation), what are the graphical features that can only be rendered with shader model 5 hardware? What are the visual differences between the DX9 and DX11 version of Civilization 5?

Civilization V, as far as we know, is the first fully threaded DX11 game.

Unfortunately, because no other games have used this feature yet, neither Nvidia nor AMD have publically released threaded drivers, so users may not experience all the benefits just yet. We decided to keep threading enabled for Civilization V, however, because we are continuing to work closely with Nvidia and AMD on their support for multi-threading. We expect publically available threaded drivers shortly.*

The internal architecture of the Civilization V graphics engine, however, is heavily multi-threaded and users will see multi-processor benefits even with drivers that are not threaded (including DX9). We have developed a series of configurable benchmark modes that we use internally for measuring our threading ability. These are fully described in the readme file. After some discussion, we decided to expose these internal tests on the released version so, if the users view the readme file, they will see that there are detailed instructions of these benchmark modes.*

There are many notable improvements with the DX11 version of the game over the DX9 version of the game. Also, don't forget that the DX11 version includes all the DX10 features, so it has 2 generations of hardware features.*

One big difference in the DX11 version is the terrain. On DX11, we are able to have a much more detailed terrain, since we are able to get a 4x compression on the textures. This allows us to keep most of the terrain cached. The DX9 players will notice some paging while they go to new parts of the map however; this is generally a non-issue on the DX11 version. We were also able to have specularity on the terrain, so players will notice marshes and snow that reflect the sun. The DX11 version also has a more detailed fog of war (on the higher settings), which uses weather simulation dynamics for a more realistic cloud movement.

Another huge visual improvement is our leader scenes. We have used a number of advanced features for lighting that are only available on DX10.1 and higher, and a number of these advanced features to give more realistic detail and correct shadowing.

4) From what DirectX 11 feature do you think your game profits most? What do you think about DirectX 11 in times of Cross-Platform-Development, can we expect more and more DX11 titles?

We benefit most from the DX11 Compute features. However, once Nvidia and AMD release threading enabled drivers, we expect the threading to be the biggest single benefit. We understand that many of our customers are hardware enthusiasts and want their games to use the latest technology. Since DirectX11 is leaps and bounds above the capabilities of current consoles it can be difficult to be cross platform and take advantage of the new capabilities. Fortunately, because we are a PC only game, this wasn't a concern for us. We can't speak for everyone, but we expect to see DX11 rapidly become the standard for gaming.

He was right, DX11 rapidly became the standard. Not only did he make the first DX11 game, he's now making the first DX12 game.

CIV5 is the first game to exhibit this behaviour we've been discussing. Watch as an R9 290 is dead last and as the resolution rises, surpasses the GTX 970 and GTX 780 Ti. The R9 290x starts off more or less tied with a GTX 970 but as the resolution rises it distances itself from the GTX 970, catches up to the GTX 980 and finally surpasses it:

CIV 5 - Beyond Earth:

This is why, for AMD, DirectX 12 is key.

1) DX12 removes AMDs multi-thread rendering handicap and allows their GPUs to perform to their rendering architectures fullest.

2) Asynchronous compute + graphics is a bonus, allowing AMD GCN to make better utilization of its idling compute resources.

Bacon1 · Feb 20, 2016

Mahigan said:
I've heard people make the claim that AMD tried implementing the feature but suffered negative scaling. I haven't seen anyone link any articles to back up that claim though.

As I showed above, the multithreading does correctly work. But I've seen that in draw call intensive scenes like the 3dmark API test, multithreading is slightly slower than single. Some site tested Tomb Raider (not sure what patch or if it was pre-launch day) and found that single core was fastest for AMD as well. So there is probably some additional overhead if you can't fill it with a single core which makes threaded slower, but threaded is faster (as I showed) if single can't fill the gpu.

Mahigan · Feb 21, 2016

Bacon1 said:
As I showed above, the multithreading does correctly work. But I've seen that in draw call intensive scenes like the 3dmark API test, multithreading is slightly slower than single. Some site tested Tomb Raider (not sure what patch or if it was pre-launch day) and found that single core was fastest for AMD as well. So there is probably some additional overhead if you can't fill it with a single core which makes threaded slower, but threaded is faster (as I showed) if single can't fill the gpu.

And since AMD haven't been able to rectify the issue in software then it probably has something to do with hardware.

This is why I've hypothesized that it may have something to do with their Command Processor.

I can speculate that it could also be AMDs DirectX driver , under DX11, which incurs a larger overhead than NVIDIAs because it must re-batch NVIDIA optimized code (32 thread kernels) into batches of 64 threads (1 wavefront kernel) in order to obtain better SIMD utilization (Compute). Imagine re-batching 4-8 (varying on how many cores are involved) 32-thread command lists, when using multi-threaded command listing, into batches of 64 threads on the primary CPU thread? That would incur a heck of a lot of API over head and would show up under CPU bound conditions. (It wouldn't affect your Microsoft test because Microsoft likely included both NVIDIA and AMD vendor specific paths in that test).

Since the resolution doesn't affect the amount of batches, scaling up the resolution would give us a better idea of AMD and NVIDIAs true GPU strengths.

For all intents and purposes, Fiji is likely a match for a GTX 980 Ti when both are running reference clocks hardware wise under optimized code.

DirectX 12 would alleviate this by spreading the DirectX driver across many cores. Hence AMD no longer incurring a CPU bound problem under DX12 at low resolutions.

Just a thought because when AMD partner with a game developer this happens (Hitman BETA DX11):

Evidently, once Hitman releases with DX12 and Asynchronous compute, GCN will perform admirably under this title.

Bacon1 · Feb 21, 2016

Hmm interesting how much the 980 TI looses from low -> Medium, anyone know what all changes there? Don't have the beta myself. Its a bigger hit than medium -> high though (fury goes from behind to ahead @ 1080p there, then back to tied @ high)

But yeah, can't wait for some DX12 & Vulkan optimized games, get rid of this DX11 driver overhead please and let my card shine

Dygaza · Feb 21, 2016

It's also very hard to understand how can new improved Command Processor in Polaris effect so much, that it would remove cpu bottleneck, when it is physically located at the other side of the PCI-E lane. Only way I can think of is that current command processor is the bottleneck, but that doesn't explain why cpu clockspeed has almost linear effect to fps in those scenes.

Leadbox · Feb 21, 2016

Dygaza said:
It's also very hard to understand how can new improved Command Processor in Polaris effect so much, that it would remove cpu bottleneck, when it is physically located at the other side of the PCI-E lane. Only way I can think of is that current command processor is the bottleneck, but that doesn't explain why cpu clockspeed has almost linear effect to fps in those scenes.

"Bigger instructions buffer for better single-threaded performance"

Leadbox · Feb 21, 2016

Dygaza said:
DX12 is so new API, that most of the companies starting with it really has to learn a lot of things. Folks like Oxide already got really nice headstart from Mantle.

I think if a certain IHV fully supported DX12 as the marketing material suggests they are, we would be there already.

computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

Senior member

Senior member

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Diamond Member

Lifer

Member

Diamond Member

Member

Diamond Member

Member

Senior member

Member

Senior member

Senior member

Diamond Member

Senior member

Diamond Member

Member

Senior member

Senior member