Nvidia vs AMD's Driver Approach in DX11

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Bacon1

Diamond Member
Feb 14, 2016
3,430
1,018
91
Here is some more reading material:

http://slideplayer.com/slide/1600266/

If anyone want to play around with it here is a sample that lets you change what kind of contexts are used:

https://code.msdn.microsoft.com/windowsdesktop/Direct3D-Multithreaded-d02193c0

It has the following options:

  • DEVICECONTEXT_IMMEDIATE, // Traditional rendering, one thread, immediate device context
  • DEVICECONTEXT_ST_DEFERRED_PER_SCENE, // One thread, multiple deferred device contexts, one per scene
  • DEVICECONTEXT_MT_DEFERRED_PER_SCENE, // Multiple threads, one per scene, each with one deferred device context
  • DEVICECONTEXT_ST_DEFERRED_PER_CHUNK, // One thread, multiple deferred device contexts, one per physical processor
  • DEVICECONTEXT_MT_DEFERRED_PER_CHUNK, // Multiple threads, one per physical processor, each with one deferred device context
I get a large increase when using the multithreaded versions which proves that they do work.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
Yes, a 290x much slower than a 970 is definitely running well :rolleyes:

So it should have been above the Titan X? LOL! :D Witcher 3 is one of the most optimized games out there. The fact that you don't consider it optimized simply because it uses Gameworks, just goes to show how biased you are.

Again wrong, I've used them myself and they work without driver command lists (which AMD does not support!)

How about we go to the actual people that developed this technology?

Microsoft immediate and deferred rendering:

Deferred rendering records graphics commands in a command buffer so that they can be played back at some other time. Use a deferred context to record commands (rendering as well as state setting) to a command list. Deferred rendering is a new concept in Direct3D 11; deferred rendering is designed to support rendering on one thread while recording commands for rendering on additional threads. This parallel, multithread strategy allows you to break up complex scenes into concurrent tasks.

Direct3D generates rendering overhead when it queues up commands in the command buffer. In contrast, a command list executes much more efficiently during playback. To use a command list, record rendering commands with a deferred context and play them back using an immediate context.

So according to Microsoft, while you can use deferred contexts without using command lists, the end result is that much greater overhead is generated than if you had used a command list. This overhead likely mitigates any performance increase you'd get by using deferred rendering without any command lists.

I'm sorry, but that makes absolutely no sense. Task based parallelism isn't some magic thing. It just means structuring your code and game logic so all parts don't conflict and can be run and completed at different times. So you can run your AI separate from weapon collision separate from keyboard input etc. That way you don't bottleneck waiting for one of those to finish. And yes, it has been in game design for a very long time. The difference is being able to make GPU calls not just game engine logic on different threads for DX11 which was further enhanced in DX12/Vulkan.

That's a half assed explanation that doesn't even begin to cover the benefits of task based parallelism. The main advantage of task based parallelism, is that you can run a single task across multiple threads concurrently, whether it be rendering, animation, physics etcetera but asynchronously.. The task is basically broken up and executed in parallel, so that no single thread is carrying the burden of a particular task like rendering or physics. This approach can scale to much higher thread counts than thread level parallelism, and result in much greater performance.

Here's an excellent PDF presentation by a programmer at Intel on how to implement task based parallelism in a game engine.

That presentation was created in 2011 for the Game developer conference, so this totally refutes your assertion that task based parallelism has been used in game engines for a "very long time." :rolleyes:
 

tamz_msc

Diamond Member
Jan 5, 2017
3,708
3,554
136
It's been shown multiple times that unless tessellation is involved, AMD has a similar performance hit as NVidia when running these Gameworks effects. HardOCP did a fairly good overview of it last year if I recall, using Far Cry 4.
That's the same point I made.
Besides, the last time I checked, most Gameworks effects can be disabled. If you don't like it, turn it off.
You're deviating from the point - you asked the question why some Gameworks titles don't take a hit on AMD cards, and even run better in some cases. I answered that question, to the best of my knowledge.
Who cares whether it's before unified shaders existed? You're missing the entire point of why I posted that article to begin with. I posted that article to show that NVidia for a LONG time has had an interest in exploiting multicore/multithreaded CPUs to increase their GPU performance.
The video that started this thread was in part about the advantages/disadvantages of NVIDIA's DX11 driver when it comes to draw calls. What they thought about DX9 multi-threading in 2005 isn't exactly bringing anything new to this present discussion.
Why do you think NVidia has the "threaded optimization" toggle option in the driver control panel? It's a remnant of the days where thread level parallelism was used in game engines, unlike the more efficient task based parallelism that is used today.
That's an OpenGL feature from back in time when multithreading OpenGL had its own performance issues and has nothing to do with DX11.
 

Bacon1

Diamond Member
Feb 14, 2016
3,430
1,018
91
Witcher 3 is one of the most optimized games out there. The fact that you don't consider it optimized simply because it uses Gameworks, just goes to show how biased you are.

What? Why should a 290x be above a Titan X? You are crazy. Its a well optimized game now but not at launch. At launch it was horribly optimized due to gameworks which the developers themselves even stated they couldn't optimize for.


Here's an excellent PDF presentation by a programmer at Intel on how to implement task based parallelism in a game engine.

That presentation was created in 2011 for the Game developer conference, so this totally refutes your assertion that task based parallelism has been used in game engines for a "very long time." :rolleyes:

Here is 2009 talking about Xbox 360 using Parallel tasks http://stackoverflow.com/questions/...-such-as-games-use-loads-of-different-threads

And MS talking about it prior to that for 360 development: https://msdn.microsoft.com/en-us/library/ee416321.aspx?f=255&MSPPError=-2147217396

And here is a library for 360 development from 2008: http://paralleltasks.codeplex.com/SourceControl/list/changesets?page=1

Guessing earlier than 2008 -> 2017 isn't a long time? Considering in 2008 the best Nvidia GPU was the 9000 series based off the 8000 series from 2006, I'd consider that a long time yes.

It would be even easier to find older references except most stuff isn't dated with when it was published

This overhead likely mitigates any performance increase you'd get by using deferred rendering without any command lists.

Yeah ok, I'll take your word over the almost 3x performance gain I've seen myself as it doesn't work properly.. :rolleyes:
 
  • Like
Reactions: DarthKyrie

tamz_msc

Diamond Member
Jan 5, 2017
3,708
3,554
136
The primary thread is always loaded with other stuff, in both DX11 and DX12, so I don't see how this proves your point.
Your statement is wrong and I'll prove it right now.
DX11 Multi-core scaling @TH
Metro:LL Redux
Tom' Hardware: Can you go into more depth about how 4A uses threading and what benefits that confers?
4A Games:PC (for Redux games): One dedicated D3D API submission thread, two or three control (depends on hardware thread count), all other available cores/threads become workers + some aux + some GPU driver threads + a lot of random threads from other processes
aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS9NL00vNTk5NTY2L29yaWdpbmFsL21ldHJvLTE5eDEwLWZwcy5wbmc=

Zero scaling from 6 to 10 cores. The 6700K takes a hit in avg. fps probably because of the other threads that the game spawns.

Project Cars- There is a debate whether this game runs PhysX and AI off the primary thread. Here you see NVIDIA's driver doing well to scale across multiple cores:
aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS9KL1UvNTk5NDY2L29yaWdpbmFsL2NhcnMtMTl4MTAtZnBzLnBuZw==


Edit:after a bit of digging, it is the combination of drivers+physics engine of the game that is responsible for its historically poor performance on AMD cards.
 
Last edited:
  • Like
Reactions: DarthKyrie

Bacon1

Diamond Member
Feb 14, 2016
3,430
1,018
91
And hey, here is 2006 ;)

Synchronous function parallel model
One way to include parallelism to a game loop is to find parallel tasks from an existing loop. To reduce the need for communication between parallel tasks, the tasks should preferably be truly independent of each other. An example of this could be doing a physics task while calculating an animation. Figure 1 shows a game loop parallelized using this technique.

http://www.gamasutra.com/view/feature/130247/multithreaded_game_engine_.php

Or even 2002 when it was known as behavior based ;)

http://www.cs.northwestern.edu/~rob/publications/applying-inexpensive-ai.ieee02.pdf
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
The video that started this thread was in part about the advantages/disadvantages of NVIDIA's DX11 driver when it comes to draw calls. What they thought about DX9 multi-threading in 2005 isn't exactly bringing anything new to this present discussion.

Actually it does, because it exposes a flaw in the theory of exactly how NVidia is accomplishing this. Whatever NVidia is doing is likely remarkably sophisticated, and probably doesn't include driver command lists (as understood in the context of the DX11 feature) as it requires game side changes as well as I've already explained.

That's an OpenGL feature from back in time when multithreading OpenGL had its own performance issues and has nothing to do with DX11.

I've heard conflicting reports on this. I remember a long time ago when I was playing around with this setting when playing NWN2 on Vista, and it definitely seemed to impact the game's performance. And NWN2 was DX9 if I recall.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
What? Why should a 290x be above a Titan X? You are crazy. Its a well optimized game now but not at launch. At launch it was horribly optimized due to gameworks which the developers themselves even stated they couldn't optimize for.

Do yourself a favor and look up the word sarcasm in the dictionary while you're doing all of this research. :rolleyes:

Anyway, your attack on the Witcher 3 is pure nonsense. I had that game from the very beginning and the game was fairly optimized from the very start on PC. Only on the consoles did it really have any issues. The sole Gameworks feature you're referring to that CDPR ended up optimizing was Hairworks, and that could be disabled. HBAO+ was just fine on the other hand.

Here is 2009 talking about Xbox 360 using Parallel tasks http://stackoverflow.com/questions/...-such-as-games-use-loads-of-different-threads

And MS talking about it prior to that for 360 development: https://msdn.microsoft.com/en-us/library/ee416321.aspx?f=255&MSPPError=-2147217396

And here is a library for 360 development from 2008: http://paralleltasks.codeplex.com/SourceControl/list/changesets?page=1

Guessing earlier than 2008 -> 2017 isn't a long time? Considering in 2008 the best Nvidia GPU was the 9000 series based off the 8000 series from 2006, I'd consider that a long time yes.

Instead of showing people talking about it, why don't you find games that actually used it? Also, last time I checked we were talking about PC games, not console games. Console games don't play by the same rules as PC games as they use APIs that are programmed at a much lower level than what is found in PCs. As such, parallel rendering was never really problematic to implement on the consoles because they don't have the API and driver overhead of the PC platform to begin with.

Yeah ok, I'll take your word over the almost 3x performance gain I've seen myself as it doesn't work properly.. :rolleyes:

LOL 3x performance gain on some code sample in an artificial environment no doubt. Why don't you try an actual game and then get back with me?

But I have to say, that if Microsoft themselves couldn't convince you that deferred context rendering requires command lists to function properly, then I'm pretty sure that I can't.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
Your statement is wrong and I'll prove it right now.
DX11 Multi-core scaling @TH
Metro:LL Redux

aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS9NL00vNTk5NTY2L29yaWdpbmFsL21ldHJvLTE5eDEwLWZwcy5wbmc=

Zero scaling from 6 to 10 cores. The 6700K takes a hit in avg. fps probably because of the other threads that the game spawns.

LOL, this is the absolute worst way to test CPU scaling. To do a CPU scaling test properly, you need to account for as many variables as possible such as clock speed, cache size, differences in the architecture. It seems that Toms hardware only accounted for clock speed and that was it..

So the test is flawed, because cache size is undoubtedly influencing the results of that test. It would have been much more effective if they had used the 6950x alone and then disabled cores and hyperthreading as necessary like what PCgameshardware.de did for Tom Clancy's Ghost Recon Wildlands.

That said, the fact that there is zero scaling from 6 to 10 cores means nothing in and of itself. If the game's engine can't scale above six threads, then nothing is going to change it.

Project Cars-
There is a debate whether this game runs PhysX and AI off the primary thread. Here you see NVIDIA's driver doing well to scale across multiple cores:
aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS9KL1UvNTk5NDY2L29yaWdpbmFsL2NhcnMtMTl4MTAtZnBzLnBuZw==


Edit:after a bit of digging, it is the combination of drivers+physics engine of the game that is responsible for its historically poor performance on AMD cards.

As for this test, it's jus as flawed as the one above. The differences in this test can be attributed to L3 cache size, as the 6700K is actually slightly in front of the hexcore 6850K this time, which shows that it's not really scaling above four cores.

Also newsflash, ProjectCars uses CPU PhysX and not GPU PhysX so you can't blame the physics engine for the performance issues of AMD hardware.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136

I think you're not understanding the context of our conversation. I wasn't asking you to provide examples of task based parallelism as a concept, but as an actual implementation in games. Task based parallelism as a programming model has been around for a long time as you said, and it has probably been used in other fields of programming for years and years.

But for PC games, it's fairly new. The most popular game engine by far last generation was UE3. UE3 used the old thread level parallelism model where they programmed certain tasks to run on certain threads, like one thread for physics, streaming, audio, game logic and rendering. This is completely different than what we have nowadays where all of those tasks could conceivably run on the SAME thread, but broken up.

We really didn't start seeing task based parallelism being implemented for PC games until the very end of the last gen console cycle (games like Crysis 3), and then it really ramped up for the cross gen titles on PS4 and Xbox One like Watch Dogs, BF4 etcetera.
 

dogen1

Senior member
Oct 14, 2014
739
40
91
Here's an excellent PDF presentation by a programmer at Intel on how to implement task based parallelism in a game engine.

That presentation was created in 2011 for the Game developer conference, so this totally refutes your assertion that task based parallelism has been used in game engines for a "very long time." :rolleyes:

Well, it was in use at least as early as Metro 2033 (the lead programmer talked about it in a digital foundry interview), so that was 2009. Idk if the PC version did so, but I assume it did.
 
Last edited:

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
Well, it was in use at last as early as Metro 2033 (the lead programmer talked about it in a digital foundry interview), so that was 2009. Idk if the PC version did so, but I assume it did.

Can you show me that interview? I'm pretty sure Metro 2033 didn't use it, as the CPU scaling in that game wasn't particularly good if I recall, at least for the PC version.
 
  • Like
Reactions: dogen1

dogen1

Senior member
Oct 14, 2014
739
40
91
Can you show me that interview? I'm pretty sure Metro 2033 didn't use it, as the CPU scaling in that game wasn't particularly good if I recall, at least for the PC version.

Sure. I think CPU scaling wasn't amazing because it was heavily GPU bound at the time.

http://www.eurogamer.net/articles/digitalfoundry-tech-interview-metro-2033

The most interesting/non traditional thing about our implementation of multi-threading is that we don't have dedicated threads for processing some specific tasks in-game with the exception of PhysX thread.

All our threads are basic workers. We use task-model but without any pre-conditioning or pre/post-synchronising. Basically all tasks can execute in parallel without any locks from the point when they are spawned. There are no inter-dependencies for tasks. It looks like a tree of tasks, which start from more heavyweight ones at the beginning of the frame to make the system self-balanced.
...
The last time I measured the statistics, we were running approximately 3,000 tasks per 30ms frame on Xbox 360 for CPU-intensive scenes with all HW threads at 100 per cent load.

In fact, they were using fibers years before teams like naughty dog. (not a dig against naughty dog of course)

The PS3 is not that different by the way. We use "fibres" to "emulate" a six-thread CPU, and then each task can spawn a SPURS (SPU) job and switch to another fibre. This is a kind of PPU off-loading, which is transparent to the system. The end result of this beautiful, albeit somewhat restricting, model is that we have perfectly linear scaling up to the hardware deficiency limits.
 
Last edited:
  • Like
Reactions: Carfax83

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
That's awesome. I figured the earliest game to use the task based model was Crysis 3, because that game could scale to 6 threads easily.. But I guess Metro 2033 might be the earliest candidate after all. I agree that it wouldn't have made much of an impact due to the game being overwhelmingly GPU bound.
 

Bacon1

Diamond Member
Feb 14, 2016
3,430
1,018
91
Sure. I think CPU scaling wasn't amazing because it was heavily GPU bound at the time.

http://www.eurogamer.net/articles/digitalfoundry-tech-interview-metro-2033

Same thing with Battlefield / Frostbite engine in 2010:

http://www.techspot.com/articles-info/255/images/quad.png

http://www.techspot.com/article/255-battlefield-bad-company2-performance/page7.html

GPU bound but did scale great to all 4 cores. Games have been doing multithreading for a very long time.

Anyway none of this has to do with DX11 drivers which is the topic.
 

dogen1

Senior member
Oct 14, 2014
739
40
91
That's awesome. I figured the earliest game to use the task based model was Crysis 3, because that game could scale to 6 threads easily.. But I guess Metro 2033 might be the earliest candidate after all. I agree that it wouldn't have made much of an impact due to the game being overwhelmingly GPU bound.

Yeah, 4A games have a very skilled engineering team. They were one of the first to use deferred shading (in STALKER). At least on PC they might've been first. I know shrek did it in 2001 on xbox lol. I remember being super hyped for the game just because of that article, but I didn't even get the game until recently lol.
 
Last edited:

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
Same thing with Battlefield / Frostbite engine in 2010:

http://www.techspot.com/articles-info/255/images/quad.png

http://www.techspot.com/article/255-battlefield-bad-company2-performance/page7.html

GPU bound but did scale great to all 4 cores. Games have been doing multithreading for a very long time.

Ahem. The MP didn't scale well on cores either, so I don't think the Frostbite 2 engine had task based parallelism. It probably used the old thread level parallelism which was common at the time.

The differences in the benchmarks below can be attributed to cache size and architecture.

Rs4ovd.png


Anyway none of this has to do with DX11 drivers which is the topic.

Indeed, but lets remember the reason why this whole hornets nest was kicked off in the first place. The OP stated that NVidia uses a software scheduler to somehow split the rendering workload from a single thread into multiple threads.

Nobody really knows how NVidia is doing it, but we know for sure that NVidia uses a compiler to schedule instructions to the GPU, and they are most likely taking advantage of modern CPU characteristics such as high clock speed, multicore, SMT, extremely accurate branch predictors, tremendous flexibility etcetera to have an insanely efficient instruction scheduler that can rival or surpass a hardware scheduler. If true, that's incredible and a tremendous feat of software engineering.

Despite this, it's the warp scheduler in the end that actually assigns work to the various shader clusters, and that's definitely hardware based.
 

dogen1

Senior member
Oct 14, 2014
739
40
91
Ahem. The MP didn't scale well on cores either, so I don't think the Frostbite 2 engine had task based parallelism. It probably used the old thread level parallelism which was common at the time.

This is still kinda off topic, but yeah I remember an announcement about one of the older frostbite games stating they'd increased the core count the engine could scale to, to 6 or something like that. Might've been medal of honor. Sounds like a coarse grained approach.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
This is still kinda off topic, but yeah I remember an announcement about one of the older frostbite games stating they'd increased the core count the engine could scale to, to 6 or something like that. Might've been medal of honor. Sounds like a coarse grained approach.

Well the current Frostbite 3 engine uses 8 threads, that I know for sure. Usually these things occur as a gradual evolution, and we know for sure that DICE has made continual updates on the Frostbite engine. It probably went from 4 to 6 and then to 8.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,708
3,554
136
Actually it does, because it exposes a flaw in the theory of exactly how NVidia is accomplishing this. Whatever NVidia is doing is likely remarkably sophisticated, and probably doesn't include driver command lists (as understood in the context of the DX11 feature) as it requires game side changes as well as I've already explained.
NVIDIA does use DCL's in their DX11 driver. See their 2016 GDC talk.

I've heard conflicting reports on this. I remember a long time ago when I was playing around with this setting when playing NWN2 on Vista, and it definitely seemed to impact the game's performance. And NWN2 was DX9 if I recall.
Nobody really knows how it impacts DX11 titles, and it is definitely known to cause issues in OpenGL.

LOL, this is the absolute worst way to test CPU scaling. To do a CPU scaling test properly, you need to account for as many variables as possible such as clock speed, cache size, differences in the architecture. It seems that Toms hardware only accounted for clock speed and that was it..

Except the i3 all of them have >8MB L3. Size of L3 matters when you have a small amount of it to begin with, like in case of laptop i3s and i5s. (http://wccftech.com/intel-amd-l3-cache-gaming-benchmarks/ - OG source is 404ed)

That said, the fact that there is zero scaling from 6 to 10 cores means nothing in and of itself. If the game's engine can't scale above six threads, then nothing is going to change it.
It's a pity you don't read. Metro LL Redux uses ONE thread for DX11 draw calls. Of course it will not scale. And of course NVIDIA's driver can't do anything in this situation.
As for this test, it's jus as flawed as the one above. The differences in this test can be attributed to L3 cache size, as the 6700K is actually slightly in front of the hexcore 6850K this time, which shows that it's not really scaling above four cores.

Also newsflash, ProjectCars uses CPU PhysX and not GPU PhysX so you can't blame the physics engine for the performance issues of AMD hardware.
Wrong again. The Broadwell-E scaling can be perfectly understood if one knows how the game handles physics:
  • The MADNESS engine runs PhysX at only 50Hz and not at 600Hz as mentioned in several articles
  • The MADNESS engine uses PhysX for collision detection and dynamic objects, which is a small part of the overall physics systems
  • The MADNESS engine does not use PhysX for the SETA tyre model or for the chassis constraint solver (our two most expensive physics sub-systems)
  • The MADNESS engine does not use PhysX for the AI systems or for raycasting, we use a bespoke optimised solution for those
  • The physics systems run completely independently of the rendering and main game threads and utilises 2 cores at 600Hz
  • The physics threading does not interact with the rendering, it is a push system sending updated positional information to the render bridge at 600Hz (this is what is hinted at by the developers as causing the performance drop on AMD cards, in the [H]-forum post I linked previously)
  • Any performance difference with PhysX would not be reflected with differences in comparing rendering frame rates. There is no interaction between PhysX and the rendering
  • Overall, PhysX uses less than 10% of all physics thread CPU on PC. It is a very small part of the physics system so would not make a visual difference if run on the CPU or GPU
The 6700K is able to edge out the 6850K probably because it has better single-threaded performance, which helps with the physics calculations.
 

Spjut

Senior member
Apr 9, 2011
928
149
106
I remember that slide. Repi was complaining that there were no drivers from AMD or NVidia that could do DX11 multithreading at the time. But one could easily see that he wasn't happy with the overall state of DX11 multithreading regardless, which is why he went on to design Mantle.

repi also said even when Battlefield 4 was released, so two years after Nvidia supported it in their mainstream drivers, that it was useless for their them.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
NVIDIA does use DCL's in their DX11 driver. See their 2016 GDC talk.

I never said they didn't use DCLs. I'm saying that developers need to make the game engine aware for DCLs to work, so unless NVidia has found a way to bypass that, its highly unlikely that DCLs are the reason behind their superior DX11 performance.

Except the i3 all of them have >8MB L3. Size of L3 matters when you have a small amount of it to begin with, like in case of laptop i3s and i5s. (http://wccftech.com/intel-amd-l3-cache-gaming-benchmarks/ - OG source is 404ed)

The 6700K has exactly 8MB of L3 cache. The 6850K has 15MB, and the 6900K has 20MB. The 6950x has 25MB. So out of the two game benchmarks that you posted, Metro LL Redux scales to six threads, which is why you see the HEDT CPUs take the point.

It's a pity you don't read. Metro LL Redux uses ONE thread for DX11 draw calls. Of course it will not scale. And of course NVIDIA's driver can't do anything in this situation.

Where did you see that Metro LL Redux uses one thread for draw calls? According to dogen1's source, the game uses task based parallelism which means that rendering is probably a shared task between worker threads.
Wrong again. The Broadwell-E scaling can be perfectly understood if one knows how the game handles physics:

No idea how this ties into your overall argument. Perhaps you should expound.

The 6700K is able to edge out the 6850K probably because it has better single-threaded performance, which helps with the physics calculations.

But if NVidia's driver were scaling the CPU threads like you claim, then the 6850K would be in front of it wouldn't it? In games that can use a lot of threads like Tom Clancy's Ghost Recon Wildlands, thread count matters. But in Project Cars, a quad core +HT with less L3 cache barely edges out the 6850K, which tells me that the game itself uses no more than probably 4 threads (with HT for added efficiency) unlike Metro LL Redux which can use at least 6.

This defeats your assertion that NVidia's driver is somehow able to bypass the game's actual programming and churn out extra worker threads using the CPU to increase rendering output.