Nvidia vs AMD's Driver Approach in DX11

Carfax83 · Apr 6, 2017

Spjut said:
repi also said even when Battlefield 4 was released, so two years after Nvidia supported it in their mainstream drivers, that it was useless for their them.

Yeah, I remember him saying that. The DX11 multithreading model is just too inefficient, but it was really a stopgap solution anyway until DX12.

tamz_msc · Apr 6, 2017

Carfax83 said:
I never said they didn't use DCLs. I'm saying that developers need to make the game engine aware for DCLs to work, so unless NVidia has found a way to bypass that, its highly unlikely that DCLs are the reason behind their superior DX11 performance.

If you cared to listen to NVIDIA's 2016 GDC talk where they talk about how developers should implement DCLs, then it would give you the context to properly understand what the OP video is talking about.

The 6700K has exactly 8MB of L3 cache. The 6850K has 15MB, and the 6900K has 20MB. The 6950x has 25MB. So out of the two game benchmarks that you posted, Metro LL Redux scales to six threads, which is why you see the HEDT CPUs take the point.

I don't even...one moment you say it scales because of cache, the very next moment you say it scales to six threads. So going from the 6850K,6900K, and 6950X the bigger change according to you is the cache and not the additional cores?

Show me benchmarks where going from 8MB L3 to 25MB L3 and beyond has more than a 1-2 fps difference in performance.

Where did you see that Metro LL Redux uses one thread for draw calls? According to dogen1's source, the game uses task based parallelism which means that rendering is probably a shared task between worker threads.

I have a better source - I don't know if you deliberately chose to ignore it in my original post or simply forgot about it.

Interview with 4A Games

Tom’s Hardware: Can you go into more depth about how 4A uses threading and what benefits that confers?

Dmitry: PC (for Redux games): One dedicated D3D API submission thread, two or three control (depends on hardware thread count), all other available cores/threads become workers + some aux + some GPU driver threads + a lot of random threads from other processes

No idea how this ties into your overall argument. Perhaps you should expound.

Well for starters, here is an example of the task based parallelism in effect. The game is very well threaded in itself. Unlike in the case of AMD where the physics engine updating the rendering engine+drivers cause GPUs to take a hit, NVIDIA's drivers are doing their job here.

But if NVidia's driver were scaling the CPU threads like you claim, then the 6850K would be in front of it wouldn't it?

Why don't you also take a look at Broadwell-E scaling in the same chart?

But in Project Cars, a quad core +HT with less L3 cache barely edges out the 6850K, which tells me that the game itself uses no more than probably 4 threads (with HT for added efficiency) unlike Metro LL Redux which can use at least 6.

Look at the charts again before making this erroneous conclusion.

In hindsight, Project Cars might not have been the best example. Given that you mention TCGR:W for the umpteenth time, I had a look at The Division instead -

aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS9ML08vNTk5NTMyL29yaWdpbmFsL2RpdmlzaW9uLTI1eDE0LWZwcy5wbmc=

Tell me why there is reverse scaling with no. of cores on the Broadwell-E processors, and why the 6700K tops them all.

Carfax83 · Apr 7, 2017

tamz_msc said:
If you cared to listen to NVIDIA's 2016 GDC talk where they talk about how developers should implement DCLs, then it would give you the context to properly understand what the OP video is talking about.

Hmm, here's their technical speeches for GDC 2016, and I don't see anything about DCLs, at least not for DX11. In the interest of time, could you point it out?

NVidia GDC 2016

I don't even...one moment you say it scales because of cache, the very next moment you say it scales to six threads. So going from the 6850K,6900K, and 6950X the bigger change according to you is the cache and not the additional cores?

Deep breaths! If you read my post, I mentioned that it was influenced by L3 cache, and also that it scaled to six threads. But the fact that it scales to six threads is nothing remarkable in and of itself.

Several game engines will use six threads, or even more.

Show me benchmarks where going from 8MB L3 to 25MB L3 and beyond has more than a 1-2 fps difference in performance.

Not hard to find one. Fallout 4 responds well to L3 cache size, which is why the 5960x sits atop this benchmark chart:

I have a better source - I don't know if you deliberately chose to ignore it in my original post or simply forgot about it.

That's for draw call submissions. DX12 uses a single thread for draw call submissions as well. The command lists record from multiple threads, but the final submission to the GPU is serial.

Well for starters, here is an example of the task based parallelism in effect. The game is very well threaded in itself. Unlike in the case of AMD where the physics engine updating the rendering engine+drivers cause GPUs to take a hit, NVIDIA's drivers are doing their job here.

Once again, the physics engine is CPU based in Project Cars. It shouldn't have any effect on the ability of the CPU to issue draw calls unless you're using a weak CPU like a dual core without SMT or something.

AMD's problems with Project Cars are likely just due to overall driver inefficiency. BTW, Project Cars does support DX11 multithreading, but it's disabled by default the last time I checked.

Why don't you also take a look at Broadwell-E scaling in the same chart?

I have been looking at it. Because the test was conducted so poorly, it's difficult to tell whether it's from L3 cache size or threads. I suspect cache size and architecture are playing a bigger role here than thread count, because the 6700K is ahead of the 6850K.

Here's a better example of CPU scaling in Project Cars. As you can see, it doesn't use over four threads....as I suspected.

Look at the charts again before making this erroneous conclusion.

How about you look at my new chart

In hindsight, Project Cars might not have been the best example. Given that you mention TCGR:W for the umpteenth time, I had a look at The Division instead -

You really seem to have a liking for Toms Hardware I must say, even though they aren't that good.

Tell me why there is reverse scaling with no. of cores on the Broadwell-E processors, and why the 6700K tops them all.

For God's sake man, stop using Toms Hardware as a source!

It's obvious that the game is GPU bound in that benchmark. Doing a CPU scaling test at 1440p is insanely stupid first off. To do a proper CPU scaling test, you should run the benchmark at a much lower resolution to keep things CPU bound.

The Division scales to about 6 threads if I recall, so if the 6700K is ahead (and only by 1 FPS), it's just margin of error GPU bound stuff.

If you want to see some real CPU scaling tests on The Division, look at PCgameshardware.de

tamz_msc · Apr 7, 2017

Carfax83 said:
Hmm, here's their technical speeches for GDC 2016, and I don't see anything about DCLs. In the interest of time, could you point it out?

http://www.gdcvault.com/play/1023517/Advanced-Rendering-with-DirectX-11
11:50 onward.

Carfax83 · Apr 7, 2017

tamz_msc said:
http://www.gdcvault.com/play/1023517/Advanced-Rendering-with-DirectX-11
11:50 onward.

I'm pretty sure he was talking about DX12 and not DX11. DX12 also uses command lists.

tamz_msc · Apr 7, 2017

Carfax83 said:
Deep breaths! If you read my post, I mentioned that it was influenced by L3 cache, and also that it scaled to six threads. But the fact that it scales to six threads is nothing remarkable in and of itself.

Prove that 8MB+ L3 cache has an effect beyond 2-3fps at best. It is laughable that you show Fallout 4 to claim that bigger L3 on the 5960X is what puts it on top when it is known that Fallout 4 loves memory bandwidth.

That's for draw call submissions. DX12 uses a single thread for draw call submissions as well. The command lists record from multiple threads, but the final submission to the GPU is serial.

So what? That the game caps out at 6 threads doesn't necessarily imply that NVIDIA's driver also caps out at 6 threads, does it?

Once again, the physics engine is CPU based in Project Cars. It shouldn't have any effect on the ability of the CPU to issue draw calls unless you're using a weak CPU like a dual core without SMT or something.

AMD's problems with Project Cars are likely just due to overall driver inefficiency. BTW, Project Cars does support DX11 multithreading, but it's disabled by default the last time I checked.

In your earlier post you said that it doesn't scale beyond 4 cores, when in reality it's the opposite.

I have been looking at it. Because the test was conducted so poorly, it's difficult to tell whether it's from L3 cache size or threads. I suspect cache size and architecture are playing a bigger role here than thread count, because the 6700K is ahead of the 6850K.

No, just no. Read what I earlier linked about the physics engine of Project Cars. The physics engine, although independent, constantly updates the render engine based on positional information of the car+track+camera. It is more likely that the better single-threaded performance of the 6700K is helping it to pull ahead. 6850K has more cache which doesn't help it that much. So again you're wrong to claim that cache sizes make a difference.

For God's sake man, stop using Toms Hardware as a source!

It's obvious that the game is GPU bound in that benchmark. Doing a CPU scaling test at 1440p is insanely stupid first off. To do a proper CPU scaling test, you should run the benchmark at a much lower resolution to keep things CPU bound.

The Division scales to about 6 threads if I recall, so if the 6700K is ahead (and only by 1 FPS), it's just margin of error GPU bound stuff.

This is just beyond stupid. No argument can be made if it comes down to saying that the source of the data is bad. Please tell me why in a GPU-bound situation going from a six-core CPU to a ten-core CPU drops min. frame rates by 25%?

tamz_msc · Apr 7, 2017

Carfax83 said:
I'm pretty sure he was talking about DX12 and not DX11. DX12 also uses command lists.

He essentially talks about the optimum way of programming CLs in DX12 to prevent CPU bottlenecks. Why do you think that such bottlenecks exist in this case?

Carfax83 · Apr 7, 2017

tamz_msc said:
Prove that 8MB+ L3 cache has an effect beyond 2-3fps at best. It is laughable that you show Fallout 4 to claim that bigger L3 on the 5960X is what puts it on top when it is known that Fallout 4 loves memory bandwidth.

Fallout 4 loves memory bandwidth? LOL That's news to me. All the tests I've seen shows that it loves memory speed more than anything. No game uses the bandwidth provided by the X99 platform (which can be over 70GB/s), least of all Fallout 4 with it's last gen graphics.

Also, incase you didn't know, the size of the L3 cache reduces reliance on memory speed.

So what? That the game caps out at 6 threads doesn't necessarily imply that NVIDIA's driver also caps out at 6 threads, does it?

Wow, you just can't admit that you're wrong in anything can you? That's a serious character flaw there man. All submissions, whether on DX11 or DX12 occur on a single thread. It's the worker threads that collect all of the data, and as such, Metro LL Redux doesn't scale above six threads period because of how the engine is designed.

In your earlier post you said that it doesn't scale beyond 4 cores, when in reality it's the opposite.

Now you're just being intentionally obtuse. Did you not see the nice pretty graph from Computerbase.de that I posted which shows that Project Cars does not scale beyond four threads?

No, just no. Read what I earlier linked about the physics engine of Project Cars. The physics engine, although independent, constantly updates the render engine based on positional information of the car+track+camera. It is more likely that the better single-threaded performance of the 6700K is helping it to pull ahead. 6850K has more cache which doesn't help it that much. So again you're wrong to claim that cache sizes make a difference.

I also mentioned architecture. Whether it's cache or architecture, the point is that it's NOT using more than four threads. If the game used more than four threads, the 6850K would be ahead of the 6700K now wouldn't it?

Here, I'll make it easy for you. This is what a game looks like when it actually uses more than four cores. As you can see, the 7700K despite its tremendous clock speed advantage and being on the latest architecture, doesn't have a chance in hell against Intel's Broadwell-E HEDT CPUs.

Heck, even Ryzen is getting in on the action.

This is just beyond stupid. No argument can be made if it comes down to saying that the source of the data is bad. Please tell me why in a GPU-bound situation going from a six-core CPU to a ten-core CPU drops min. frame rates by 25%?

Minimum framerates are notoriously unreliable. No serious reviewer uses them anymore, as they don't provide an accurate estimation of performance. Most reviewers nowadays use percentiles which are much better.

Carfax83 · Apr 7, 2017

tamz_msc said:
He essentially talks about the optimum way of programming CLs in DX12 to prevent CPU bottlenecks. Why do you think that such bottlenecks exist in this case?

Because overloading a CPU with too many command lists will use too much CPU and impact performance elsewhere. This is the same problem that happens if you use DCLs inappropriately in DX11, although the threshold for screwing up is much lower in DX11 than it is for DX12.

Yep, too much of anything is usually bad for you.

tamz_msc · Apr 7, 2017

Carfax83 said:
Fallout 4 loves memory bandwidth? LOL That's news to me. All the tests I've seen shows that it loves memory speed more than anything.

Look here, someone is pretending to not know what increased memory speed does. Again claiming that it's the 4770K's bigger cache is what puts it ahead of the i3 4360. Still cannot show how having >=8MB L3 affects fps.

All submissions, whether on DX11 or DX12 occur on a single thread.

4A games said "all D3D submissions are done on a dedicated thread". By that do they mean draw call submissions or final submission before rendering? If its the former then regarding its multithreading capability NVIDIA claims otherwise.

No serious reviewer uses them anymore, as they don't provide an accurate estimation of performance.

Says the guy who just showed a graph with minimum frame rates from Techspot. Extreme double standards I say.

Because overloading a CPU with too many command lists will use too much CPU and impact performance elsewhere. This is the same problem that happens if you use DCLs inappropriately in DX11, although the threshold for screwing up is much lower in DX11 than it is for DX12.

Yeah and that is exactly the opposite(in most cases) of what AMD prescribes when programming for GCN.

William Gaatjes · Apr 9, 2017

Bacon1 said:
Yes, their drivers which is CPU. They moved away from handling the scheduling on hardware because not enough developers were multithreading the calls and thus it was bottlenecking.

Both AMD and Nvidia support deferred contexts, but Nvidia also supports driver command lists. So when DX11 games are single thread heavy, Nvidia's driver will spread out and make it thread friendly. When Developers take the time to make it thread friendly to begin with, both work AMD and Nvidia work well and neither end up CPU bound.

So yes, Nvidia is more reliant on the CPU than AMD since they handle it through drivers, but they aren't limited to a single CPU if the game isn't optimized well.

So, poor CPU utilization = Nvidia faster (driver moves to other cores), good CPU utilization = AMD has lower overhead because the GPU handles it.

DX11 is still a massive bottleneck even when properly multithreaded as shown in this video and as you can see from any API Overhead chart from 3dMark.

http://www.anandtech.com/show/11223/quick-look-vulkan-3dmark-api-overhead

Now the differences between DX11 and either DX12 or Vulkan are massive

On a side note, It is interesting to see in the anandtech article that the jump in drawcalls for the GTX1060 from DX11 to DX12/vulcan is massive. As is the same for the GTX1080 Ti. But the relative difference in drawcalls between the GTX1060 and the GTX 1080 i expected to be much higher. But it is not. 26.4million for the GTX 1060 and 32.4 for the GTX 1080 Ti. The i7-4960X = a 6 core/ 12 thread and seems to be the limiting factor in at least this drawcall test. Thus GTX1080 Ti is only interesting with a very powerful cpu.

dogen1 · Apr 9, 2017

William Gaatjes said:
But the relative difference in drawcalls between the GTX1060 and the GTX 1080 i expected to be much higher.

don't

William Gaatjes · Apr 9, 2017

dogen1 said:
don't

Please elaborate...

dogen1 · Apr 9, 2017

William Gaatjes said:
Please elaborate...

It's a (highly) synthetic test designed to highlight the difference between API (and CPU) as much as possible.

William Gaatjes · Apr 9, 2017

dogen1 said:
It's a (highly) synthetic test designed to highlight the difference between API (and CPU) as much as possible.

Ok.
I mentioned that in my post as well.
But the general thought that a gtx1080 ti needs a very powerful cpu is not wrong.
One thing i do not get, if it was purely cpu limited, why are the result not the same when using the same cpu ? I mean, the drawcalls are issued by the cpu. Then again, the gtx1080 ti services the drawcalls faster, so it should get a higher number until the cpu becomes the bottleneck. That would explain the slightly higher number.

edit:
Changed text.

Carfax83 · Apr 9, 2017

tamz_msc said:
Look here, someone is pretending to not know what increased memory speed does. Again claiming that it's the 4770K's bigger cache is what puts it ahead of the i3 4360. Still cannot show how having >=8MB L3 affects fps.

Like I told you a few pages ago, perhaps you should take your own advice

FYI memory speed is measured in cycles, and bandwidth is measured in GB/s. The two are often related, but obviously they aren't the same. The X99's bandwidth advantage comes from additional memory channels, NOT memory speed.

And for the last time, Fallout 4 does not respond to memory bandwidth. It responds to memory frequency. And a bigger L3 cache reduces the need for the CPU to go all the way to main memory for data, thus improving the speed at which the CPU can process data and increase frame rates for the game.

4A games said "all D3D submissions are done on a dedicated thread". By that do they mean draw call submissions or final submission before rendering? If its the former then regarding its multithreading capability NVIDIA claims otherwise.

As far as I know, draw call submission are recorded via command lists on multiple threads, and then sent to a dedicated thread for final submission. This is the same for both DX11 and DX12, but the former has much greater overhead than the latter.

Says the guy who just showed a graph with minimum frame rates from Techspot. Extreme double standards I say.

Yeah but I never made a song and dance about minimum framerates being important did I?

Yeah and that is exactly the opposite(in most cases) of what AMD prescribes when programming for GCN.

Well AMD has a hardware scheduler so maybe these rules don't apply to them. But in any case, they never used command lists in DX11.

dogen1 · Apr 9, 2017

William Gaatjes said:
Ok.
I mentioned that in my post as well.
But the general thought that a gtx1080 ti needs a very powerful cpu is not wrong.
One thing i do not get, if it was purely cpu limited, why are the result not the same when using the same cpu ? I mean, the drawcalls are issued by the cpu. Then again, the gtx1080 ti services the drawcalls faster, so it should get a higher number until the cpu becomes the bottleneck. That would explain the slightly higher number.

edit:
Changed text.

Yup, the 1080 draws all the geometry faster so the CPU can get started on the next frame a little bit sooner.

And I wouldn't say it needs a more powerful CPU necessarily. You can always increase GPU bound settings to take advantage, resolution, AA, shadows, etc.

William Gaatjes · Apr 9, 2017

dogen1 said:
Yup, the 1080 draws all the geometry faster so the CPU can get started on the next frame a little bit sooner.

And I wouldn't say it needs a more powerful CPU necessarily. You can always increase GPU bound settings to take advantage, resolution, AA, shadows, etc.

But since the GTX 1080 Ti is theoretically at least twice as fast, there is clearly a bottleneck somewhere. Maybe a combination of hardware and software. That is why i expected the gtx1080 ti to have a much higher number.

dogen1 · Apr 9, 2017

William Gaatjes said:
But since the GTX 1080 Ti is theoretically at least twice as fast, there is clearly a bottleneck somewhere. Maybe a combination of hardware and software. That is why i expected the gtx1080 ti to have a much higher number.

Yeah.. the CPU. Like I said, the test is mostly CPU bound and is designed to be.

tamz_msc · Apr 10, 2017

Carfax83 said:
Like I told you a few pages ago, perhaps you should take your own advice

FYI memory speed is measured in cycles, and bandwidth is measured in GB/s. The two are often related, but obviously they aren't the same. The X99's bandwidth advantage comes from additional memory channels, NOT memory speed.

And for the last time, Fallout 4 does not respond to memory bandwidth. It responds to memory frequency. And a bigger L3 cache reduces the need for the CPU to go all the way to main memory for data, thus improving the speed at which the CPU can process data and increase frame rates for the game.

Deep breaths......Who said that frequency and bandwidth are same? What changes when you get faster memory? Bandwidth and latency. Latency changes plateau after a certain point, but bandwidth continues to increase as long as the memory controller isn't saturated. Changing the memory frequency directly affects memory bandwidth. This is why it is said that you should get faster memory when your applications are memory bound - depending on the application it may be sensitive to bandwidth or latency, or both.

X99 has an inherent advantage here because quad-channel support means that even with lower frequency memory, you can get more bandwidth from it compared to a dual-channel setup. This is what I mean when I say that Fallout 4 loves memory bandwidth - because ultimately this is what you are changing by getting faster RAM or moving to X99. I suspect that it may be sensitive to latency as well.

Faster memory implies more bandwidth, as long as you don't loosen the timings.

Oh, you still haven't proved how more than 8MB L3 affects performance, apart from repeating a textbook statement. I showed you graphs where 2,4,6,8 MB L3 had an effect, albeit with diminishing returns, proving that it matters more when you have a low amount of it to begin with. Do your part and show me something similar for larger L3.

William Gaatjes · Apr 10, 2017

dogen1 said:
Yeah.. the CPU. Like I said, the test is mostly CPU bound and is designed to be.

That is what i mentioned in my first post as well.

The cpu is the limiting factor.

EightySix Four · Apr 11, 2017

Man, this effect seems even more obvious in the Ryzen 5 review. Check out the difference on Rocket League.

Bacon1 · Apr 11, 2017

EightySix Four said:
Man, this effect seems even more obvious in the Ryzen 5 review. Check out the difference on Rocket League.

Naw the 480 is normally 66% faster than the 1060 right? Nothing to see here, Nvidia drivers are fine

Yakk · Apr 11, 2017

EightySix Four said:
Man, this effect seems even more obvious in the Ryzen 5 review. Check out the difference on Rocket League.

WOW!

Just... Wow!

EightySix Four · Apr 11, 2017

Bacon1 said:
Naw the 480 is normally 66% faster than the 1060 right? Nothing to see here, Nvidia drivers are fine

Think about all the Ryzen benchmarks you've seen with a 1080 or Titan... It totally makes sense to use the the fastest card available for benchmarking but I don't think we've ever seen such a huge disparity due to display driver optimizations not matching up with a new CPU architecture before.

I don't think the Rocket League numbers are indicative of every game or even most games, but it's something to consider for sure.

Nvidia vs AMD's Driver Approach in DX11

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Senior member

Lifer

Senior member

Lifer

Diamond Member

Senior member

Lifer

Senior member

Diamond Member

Lifer

Diamond Member

Diamond Member

Golden Member

Diamond Member