Nvidia vs AMD's Driver Approach in DX11

Udgnim

Diamond Member
Apr 16, 2008
3,662
104
106
https://www.youtube.com/watch?v=nIoZB-cnjc0&feature=youtu.be

Summary:

DX11 puts draw calls on a single thread. Nvidia's DX11 driver uses a software schedule to take a game's single threaded DX11 draw calls and splits the draw calls across multiple cores. This incurs a higher CPU overhead hit across all cores but oftentimes results in improved performance due to not running into a single threaded bottleneck.

AMD GCN architecture uses a hardware scheduler and can not take a game's single threaded DX11 draw calls and split them across multiple cores.

The result is that in games that heavily place game logic + draw calls on a single thread, AMD performance will suffer while Nvidia performance will not.

However, games that are multi threaded so that draw calls are dedicated to 1 core while game logic is spread across the other cores results in the possibility of AMD performance pulling ahead of Nvidia performance with similar level GPUs like 480 vs 1060 due to Nvidia's software scheduler incurring a CPU overhead hit across multiple cores to split draw calls.
 
Last edited:

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
Besides that this whole video is nonsense, these two things are important to remember:

DX11 puts draw calls on a single thread. Nvidia's DX11 driver uses a software schedule to take a game's single threaded DX11 draw calls and splits the draw calls across multiple cores. This incurs a higher CPU overhead hit across all cores but oftentimes results in improved performance due to not running into a single threaded bottleneck.

DX11 is software. The driver is software. Splitting draw call processing (and others workload) into more pieces to create commandlist for the command queue from multiple threads has no correlation with the scheduling process of workload on the GPU. It is wrong to call this "software scheduler" when you talk about the GPU side.

AMD GCN architecture uses a hardware scheduler and can not take a game's single threaded DX11 draw calls and split them across multiple cores.

So AMD doesnt use a DX11 driver and it is not compartible with DX11 software?

DX11 and DX12 can only use one (let us ignore Async Compute for a moment) command queue. Multithreading happens on the software side before the application/driver sends the queue to the GPU. So yes, neither nVidia nor AMD is using a software scheduler to schedule workload on the GPU. This doesnt make sense and doesnt exist for any GPU in the world.
 
Last edited:

Krteq

Senior member
May 22, 2015
991
671
136
You obviously don't know what you are talking about. Since Kepler nV is using a "hybrid" scheduling model. Most of the work is done by compiler in driver and then Warp Schedulers in HW are used to do a simple part.

Scheduler_575px.jpg


The end result is an interesting one, if only because by conventional standards it’s going in reverse. With GK104 NVIDIA is going back to static scheduling. Traditionally, processors have started with static scheduling and then moved to hardware scheduling as both software and hardware complexity has increased. Hardware instruction scheduling allows the processor to schedule instructions in the most efficient manner in real time as conditions permit, as opposed to strictly following the order of the code itself regardless of the code’s efficiency. This in turn improves the performance of the processor.

However based on their own internal research and simulations, in their search for efficiency NVIDIA found that hardware scheduling was consuming a fair bit of power and area for few benefits. In particular, since Kepler’s math pipeline has a fixed latency, hardware scheduling of the instruction inside of a warp was redundant since the compiler already knew the latency of each math instruction it issued. So NVIDIA has replaced Fermi’s complex scheduler with a far simpler scheduler that still uses scoreboarding and other methods for inter-warp scheduling, but moves the scheduling of instructions in a warp into NVIDIA’s compiler. In essence it’s a return to static scheduling.
DX12 API has a quite different execution model, so nV have to "modify" their compiler and warp schedulers
 

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
The compiler works after the DX11 driver or the DX12 application has created and filled the command queue. What nVidia did from Fermi to Kepler has no correlation with this topic.

So NVIDIA has replaced Fermi’s complex scheduler with a far simpler scheduler that still uses scoreboarding and other methods for inter-warp scheduling, but moves the scheduling of instructions in a warp into NVIDIA’s compiler. In essence it’s a return to static scheduling.
http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/3

The execution model between DX11 and DX12 is the same:
In D3D12 the concept of a command queue is the API representation of a roughly serial sequence of work submitted by the application. Barriers and other techniques allow this work to be executed in a pipeline or out of order, but the application only sees a single completion timeline. This corresponds to the immediate context in D3D11.
https://msdn.microsoft.com/en-us/library/windows/desktop/dn899217(v=vs.85).aspx
 
  • Like
Reactions: Carfax83

Yakk

Golden Member
May 28, 2016
1,574
275
81
Nvidia was also the gpu in the Xbox 360, so they gathered a large experience and had influence on console developement with DX9, and by extension DX10 & DX11 (which were just extensions of DX9).

Now AMD has this position with DX12, which also trickles down to PC's. As well as promoting Vulkan.
 

Krteq

Senior member
May 22, 2015
991
671
136
Nvidia was also the gpu in the Playstation 3, so they gathered a large experience and had influence on console developement with DX9, and by extension DX10 & DX11 (which were just extensions of DX9).
Fixed, Xbox 360 -> ATi Xenos
 

Krteq

Senior member
May 22, 2015
991
671
136
o_OHuh, what? Read the whole MSDN article, you will be surprised.

In Direct3D 11, all work submission is done via the immediate context, which represents a single stream of commands that go to the GPU. To achieve multithreaded scaling, games also have deferred contexts available to them, but like PSOs, deferred contexts also do not map perfectly to hardware, and so relatively little work can be done in them.

Direct3D 12 introduces a new model for work submission based on command lists that contain the entirety of information needed to execute a particular workload on the GPU. Each new command list contains information such as which PSO to use, what texture and buffer resources are needed, and the arguments to all draw calls. Because each command list is self-contained and inherits no state, the driver can pre-compute all necessary GPU commands up-front and in a free-threaded manner. The only serial process necessary is the final submission of command lists to the GPU via the command queue, which is a highly efficient process.

In addition to command lists, Direct3D 12 also introduces a second level of work pre-computation, bundles. Unlike command lists which are completely self-contained and typically constructed, submitted once, and discarded, bundles provide a form of state inheritance which permits reuse. For example, if a game wants to draw two character models with different textures, one approach is to record a command list with two sets of identical draw calls. But another approach is to “record” one bundle that draws a single character model, then “play back” the bundle twice on the command list using different resources. In the latter case, the driver only has to compute the appropriate instructions once, and creating the command list essentially amounts to two low-cost function calls.
DirectX 12 - What’s the big deal?
 

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
You missed the critical part of your quote:
The only serial process necessary is the final submission of command lists to the GPU via the command queue, which is a highly efficient process.

This is the same part for DX11 and DX12. Work submission to the GPU happens serial. Work submission to the command queue is different and has nothing to do with scheduling. This is a API process and happens on the CPU. Guess this makes it "software" for everyone.
 

Janooo

Golden Member
Aug 22, 2005
1,067
13
81
Nvidia was also the gpu in the Xbox 360, so they gathered a large experience and had influence on console developement with DX9, and by extension DX10 & DX11 (which were just extensions of DX9).

Now AMD has this position with DX12, which also trickles down to PC's. As well as promoting Vulkan.
No, Xbox 360 gpu was Ati's Xenos.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
I'm pretty sure sontin is right on this one. NVidia has obviously spent a lot of effort on designing their drivers to take advantage of multicore CPUs over the years, since way back when even before DX11 became available. This is a lot different from the internal hardware schedulers that GPUs from AMD and NVidia use when it comes to actually drawing frames.
 

Bacon1

Diamond Member
Feb 14, 2016
3,430
1,018
91
NVidia has obviously spent a lot of effort on designing their drivers to take advantage of multicore CPUs over the years

Yes, their drivers which is CPU. They moved away from handling the scheduling on hardware because not enough developers were multithreading the calls and thus it was bottlenecking.

Both AMD and Nvidia support deferred contexts, but Nvidia also supports driver command lists. So when DX11 games are single thread heavy, Nvidia's driver will spread out and make it thread friendly. When Developers take the time to make it thread friendly to begin with, both work AMD and Nvidia work well and neither end up CPU bound.

So yes, Nvidia is more reliant on the CPU than AMD since they handle it through drivers, but they aren't limited to a single CPU if the game isn't optimized well.

So, poor CPU utilization = Nvidia faster (driver moves to other cores), good CPU utilization = AMD has lower overhead because the GPU handles it.

DX11 is still a massive bottleneck even when properly multithreaded as shown in this video and as you can see from any API Overhead chart from 3dMark.

86102.png


http://www.anandtech.com/show/11223/quick-look-vulkan-3dmark-api-overhead

Now the differences between DX11 and either DX12 or Vulkan are massive
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
Yes, their drivers which is CPU. They moved away from handling the scheduling on hardware because not enough developers were multithreading the calls and thus it was bottlenecking.

To clarify, we're talking about instruction scheduling and not warp scheduling right? As far as I know, instruction scheduling is done by the compiler which is obviously software, but warp scheduling is done by a dedicated warp scheduler in hardware.

Both AMD and Nvidia support deferred contexts, but Nvidia also supports driver command lists. So when DX11 games are single thread heavy, Nvidia's driver will spread out and make it thread friendly. When Developers take the time to make it thread friendly to begin with, both work AMD and Nvidia work well and neither end up CPU bound.

I don't know if you know this, but driver command lists have to be supported by the game. As such, very few games ever used driver command lists, because as you said, only NVidia had implemented it in their drivers.

So NVidia's advantage in DX11 is not due to driver command lists unless it's supported by the game, ie Civilization V.

So, poor CPU utilization = Nvidia faster (driver moves to other cores), good CPU utilization = AMD has lower overhead because the GPU handles it.

Then how do you explain NVidia's advantage in highly threaded games like Assassin's Creed Unity and Tom Clancy's Ghost Recon Wildlands?

DX11 is still a massive bottleneck even when properly multithreaded as shown in this video and as you can see from any API Overhead chart from 3dMark.

Yes, DX11 is a bottleneck, but obviously NVidia has figured a way around this somewhat.
 
  • Like
Reactions: Arachnotronic

EightySix Four

Diamond Member
Jul 17, 2004
5,121
49
91
I don't know if you know this, but driver command lists have to be supported by the game. As such, very few games ever used driver command lists, because as you said, only NVidia had implemented it in their drivers.

So NVidia's advantage in DX11 is not due to driver command lists unless it's supported by the game, ie Civilization V.

Literally the whole argument the video made is that nvidia's driver examines the incoming calls and splits it into command lists across multiple threads which are then delivered to the GPU. Nothing else you said provides evidence to the contrary while the video makes some compelling arguments that this is the case.
 

dogen1

Senior member
Oct 14, 2014
739
40
91
DX11 is still a massive bottleneck even when properly multithreaded as shown in this video and as you can see from any API Overhead chart from 3dMark.

It's only a bottleneck when it's actually a bottleneck..
 

Bacon1

Diamond Member
Feb 14, 2016
3,430
1,018
91
I don't know if you know this, but driver command lists have to be supported by the game. As such, very few games ever used driver command lists, because as you said, only NVidia had implemented it in their drivers.

Did you not watch the video? Its explained there

So NVidia's advantage in DX11 is not due to driver command lists unless it's supported by the game, ie Civilization V.

Civ V runs just as well on AMD hardware. It uses deferred contexts not driver command lists. Deferred are done by the developers. Driver are done... well inside the drivers. Thats why Nvidia can take a single thread heavy game and make it thread aware. So it uses say 5% more cpu, but you can then take the 30% cpu usage and spread it over 3 other cores. So you use say 35% cpu total but now its split 15%, 10%, 10% vs 30% on AMD on the single core still (all made up numbers just to help visualize it). So there is additional overhead from having drivers do it, but it helps them spread out the work to prevent the single core bottleneck.

So in conclusion, the reason NVIDIA beats AMD in Civ V is that NVIDIA currently offers full support for multi-threaded rendering/deferred contexts/command lists, while AMD does not. Civ V uses massive amounts of objects and complex terrain, and because it's multi-threaded rendering capable the introduction of multi-threaded rendering support in NVIDIA's drivers means that NVIDIA's GPUs can now rip through the game.

https://forums.anandtech.com/thread...e-radeon-in-civ5.2155665/page-2#post-31520674

Once AMD updated their drivers to support deferred contexts, performance went way up as well.

Yes, DX11 is a bottleneck, but obviously NVidia has figured a way around this somewhat.

Did you not see the massive gap between DX11 and Vulkan/DX12 for Nvidia? It's still 15x faster on the other two. Obviously the GPUs can't handle that much work to use that many drawcalls in an actual game, but the difference in the APIs is massive and the difference will grow as GPUs are more powerful and limited by DX11.

Then how do you explain NVidia's advantage in highly threaded games like Assassin's Creed Unity and Tom Clancy's Ghost Recon Wildlands?

Because they are gameworks titles and heavily optimized for Nvidia hardware?
 
  • Like
Reactions: DarthKyrie and Yakk

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
Literally the whole argument the video made is that nvidia's driver examines the incoming calls and splits it into command lists across multiple threads which are then delivered to the GPU. Nothing else you said provides evidence to the contrary while the video makes some compelling arguments that this is the case.

Command lists are part of the DirectX 11 specification, though it wasn't mandatory. Whatever NVidia is doing is something else entirely, even if it may resemble driver command lists. NVidia has been offloading "stuff" to the CPU for years, before even DX11, and I don't think the author of that video is in the position to know exactly what is going on.

NVidia drivers multithreaded since 2005!
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
Did you not watch the video? Its explained there

The author admits that it's supposition on his part. As I explained above, NVidia has been doing this for years, as early as 2005; years before DX11. So whatever it is they are doing, it may not have anything to do with driver command lists, as DCLs are actually a specification of DX11 and support has to be on the game side for it to work.

Civ V runs just as well on AMD hardware. It uses deferred contexts not driver command lists. Deferred are done by the developers. Driver are done... well inside the drivers. Thats why Nvidia can take a single thread heavy game and make it thread aware. So it uses say 5% more cpu, but you can then take the 30% cpu usage and spread it over 3 other cores. So you use say 35% cpu total but now its split 15%, 10%, 10% vs 30% on AMD on the single core still (all made up numbers just to help visualize it). So there is additional overhead from having drivers do it, but it helps them spread out the work to prevent the single core bottleneck.

https://forums.anandtech.com/thread...e-radeon-in-civ5.2155665/page-2#post-31520674

The quote that you cited from Ryan clearly explains that Civilization 5 used both deferred contexts and driver command lists. The two are meant to be used together to function properly, so you can't just use deferred contexts and not driver command lists.

And no, Civ 5 doesn't run as well on AMD hardware as it does on NVidia; at least not in CPU bound scenarios as you can see in the graph below where the older GTX 580 is ahead of the more powerful 7970 at a low resolution.

43127.png


Make things more GPU bound however, and suddenly the Radeon surges ahead:

43126.png


Once AMD updated their drivers to support deferred contexts, performance went way up as well.

No. Deferred contexts requires driver command lists to work properly. The two are complimentary, and need to be used together for the draw call batch submission to work in parallel.

Did you not see the massive gap between DX11 and Vulkan/DX12 for Nvidia? It's still 15x faster on the other two. Obviously the GPUs can't handle that much work to use that many drawcalls in an actual game, but the difference in the APIs is massive and the difference will grow as GPUs are more powerful and limited by DX11.

I think the really relevant question is, how many games today are actually draw call bottlenecked? Very few, and even then only in certain circumstances. Also, many game developers have succeeded in implementing task based parallelism in their engines. This makes scaling on multithreaded CPUs much more effective, even when it's handicapped by the API. Which is why you see much lower clocked CPUs like the 6950x and the 6900K able to compete with and outright beat much higher clocked CPUs like the 7700K. For example, look at Ghost Recon Wildlands. It scales all the way up to a deca-core CPU, which is incredible.

XkFRmR.png


Because they are gameworks titles and heavily optimized for Nvidia hardware?

Blaming gameworks is intellectually lazy to be honest :rolleyes:
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
That article is about pre-unified shader stuff in the DX9 era. Irrelevant in the present day.

Might be, but it's still relevant because it shows that NVidia has been pursuing multithreaded drivers for a long time. Who knows what form it has taken in modern times? Only the NVidia software engineers know.