nvlddmkm crashes when running GPU tasks (any project)

Erich56

Junior Member
Aug 4, 2018
21
4
16
on the same machine with 2 GTX980ti on which I have been crunching GPUGRID (and sometimes also other projects) for 2 1/2 years on Windows XP, I recently installed Windows 10.

The Mainboard is ASUS P9X79WS, the CPU is Intel i7-4930k, running @ 3.9GHz. RAM is 4x8GB Kingston HyperX Fury DDR3-1866. PSU is Corsair HX1000i 1000W.
The PC was assambled 2 1/2 years ago.

When now crunching GPU projects on Windows 10, it happens in irregular intervals (maybe after 1 hour, our after 8 hours, or some time inbetween) that all of a sudden the monitor freezes, and crunching stopps.
Neither keyboard nor mouse react any more, so all I can do is pushing the off-button on the PC and make a reboot.

Then, the Windows event log under "system" shows the warning "the graphic driver nvlddmkm does no longer react and was restored".
This notice shows up every 4 seconds, from the time on the crash happened until I switched off the PC. So if this happens during night and I notice it only next morning, this entry is logged a few thousand times.
Under "details" it shows "eventID 4101", and under event data "nvlddmkm".

What I tried right away was to replace the NVIDIA driver - version 388... came along with the Win10 installation, I made a clean unsintall (with DDU) and installed the latest driver from NVIDIA, i.e. version 398.36 (Note: on two other PCs with Windows10, GPU crunching works well with the 388... driver)
However, swapping the driver did not help at all, the problem persists :-(

What I also tried was to crunch only with 1 of the 2 GPUs - same problem.

What also needs to be said: With the NVIDIA Inspector, I set the GPU temperatures to 60/61°C, which yields GPU clocks around or even below default clock (i.e. about 1100MHz or even lower). So, there is no overclocking and no overheating of the GPUs. Power usage of the GPUs as indicated in the NVIDIA Inspector is around 50-60%.

As said above, I used the same settings under Windows XP, even with decent GPU overclocking sometimes (yielding a higher power usage up to 75%), and never ever there was this "nvlddmkm" crash.

Anyone any idea what could be the reason for this problem, and how I could get it solved?
 

mikeymikec

Lifer
May 19, 2011
19,936
14,188
136
Seems to me like three possibilities:

OC'd CPU (you mentioned a frequency that it does for turbo as if it's the typical running frequency)
Faulty graphics card(s)
Faulty PSU

I'd probably start by running the CPU at stock for a bit (and the RAM if that's overclocked). If that makes no difference, I'd pull a graphics card. Failing that, swap graphics cards.

I don't know what the power requirements of that graphics card are, so I'll leave PSU troubleshooting to others.

Do you get any graphics corruptions?
 

Erich56

Junior Member
Aug 4, 2018
21
4
16
Seems to me like three possibilities:
...
Do you get any graphics corruptions?

The power requirement for a GTX980Ti is 360W. So if I run both at around 50%, I'll be at around 360W power requirement. Whis is not even half of what the PSU offers. And a 2 1/2 years old PSU normally should still function well.

Lowering the CPU clock from turbo 3.9GHz to default 3.4GHz might be a possibility to try. The RAM is at default with 1866MHz. Of course, I could lower that too, in the BIOS.

No graphic corruptions.

But, as said before: in Windows XP, the whole thing runs well with these settings. If there was some hardware default or so, it would not run well under XP either, right?
 

mikeymikec

Lifer
May 19, 2011
19,936
14,188
136
In the OP you referred to your XP use in the past tense, and you said you switched to Win10, so unless you switch back to XP and confirm it still works fine there, I honestly think you're looking at a hardware instability issue.

Another thing you could try is surfing the nvidia forums for Win10 980ti users having trouble and see whether there's anything to suggest that the 980ti and Win10 are a problematic combination. I personally doubt it, as while I've seen GPU trouble on Win10, it tends to be of the very low, older and poorly-supported end of the spectrum, and normally the problem starts straight after a Win10 feature update.

Just because the PSU purports to be able to do 1kW, it doesn't mean it will work forever. It would be lower down my list of priorities to test for though.
 

StefanR5R

Elite Member
Dec 10, 2016
6,390
9,846
136
@Erich56,
did the problem start immediately when you switched from XP to 10?
Do you have the XP system disk still there, reverted to it, and re-ran it successfully?
 

Erich56

Junior Member
Aug 4, 2018
21
4
16
I did switch back to XP for testing, and this worked fine.

Still, though, I guess it could be that a hardware configuration which worked fine in XP shows problems in Win10. In fact, I too have read about all kinds of problems with Win10, particularly after these huge "service updates".

Surfing the NVIDIA Forum sounds like a good idea.

I agree that a PSU ages; but the obne I have was described as a very good one in various tests, and guess that after 2 1/2 years, it should not have a problem with 2 GPUs running at about 180Watt each. That's at least what I hope.
 

mikeymikec

Lifer
May 19, 2011
19,936
14,188
136
Is the board BIOS up-to-date?

Might also be worth trying the ASUS VIP forums as well and see whether anything similar to your problem has been posted.

I'd still be curious about whether the system performs properly with just one graphics card. It probably isn't faulty hardware, but perhaps it's incompatibility.
 

StefanR5R

Elite Member
Dec 10, 2016
6,390
9,846
136
Very interesting pointers from @mindless1. The first link has a long list of potential causes for the timeouts. Among them: GPU monitoring software. Device drivers for hardware entirely unrelated to the graphics stack. ...
 

Erich56

Junior Member
Aug 4, 2018
21
4
16
XP didn't have TDR. I don't know the answer except try to disable that, or increase the timeout from 2 seconds to __ ???
this sounds like a logical explanation (probably one of so many others, too, I know).

I changed the TDR from 2 seconds to 8 seconds, as recommanded in one of links provided by mindless1.
GPUGRID has now been running for about 3 hours - so let's keep our fingers crossed :)

If this didn't help either, I can always try to remove the TDR completely (another recommandation in the linked articles).
 

StefanR5R

Elite Member
Dec 10, 2016
6,390
9,846
136
My understanding is that it would be best to determine what causes the latency, and then fixing this, instead of papering over the issue by a more relaxed watchdog timeout.
 
  • Like
Reactions: mikeymikec

mikeymikec

Lifer
May 19, 2011
19,936
14,188
136
My understanding is that it would be best to determine what causes the latency, and then fixing this, instead of papering over the issue by a more relaxed watchdog timeout.

Yup. TDR is meant to help Windows avoid the BSOD scenario if a graphics card goes off the reservation. It shouldn't be happening full stop, let alone with any kind of regularity. For a graphics card to be regularly taking unscheduled holidays of likely billions of cycles per time kind of undermines the point of having a high-end graphics card, let alone two.

Apart from the earliest days of Windows Vista (when this feature was first introduced and when GPU manufacturers' drivers were not particularly reliable following the new driver model), the vast majority of the times I've seen 'driver has recovered', it's been due to faulty hardware IIRC. The only disclaimer I'll put on this assertion is that I rarely buy bleeding edge hardware (ie. buying a high-end graphics card weeks after its release), at which point drivers are a more likely cause of problems than otherwise.
 

Erich56

Junior Member
Aug 4, 2018
21
4
16
My understanding is that it would be best to determine what causes the latency, and then fixing this, instead of papering over the issue by a more relaxed watchdog timeout.
Yup. TDR is meant to help Windows avoid the BSOD scenario if a graphics card goes off the reservation.
Yes, guys, I fully agree.
However, if I just knew what the problem is :rolleyes:
 

mikeymikec

Lifer
May 19, 2011
19,936
14,188
136
I'm trying to think of any way that playing with TDR might help troubleshoot the issue. Unfortunately I have little to base a guess of what goes on inside a graphics card when it goes off the reservation, whether it's going at full tilt doing *something*, or whether it becomes idle. If it's the latter, and given the problem appears to occur under load, I don't think it really tells us anything.

What's the nature of this "GPU project crunching"? Has that changed in any way (like say a newer version of the software for Win10)? Do you do any gaming on this rig, how does it fare if you do?
 

StefanR5R

Elite Member
Dec 10, 2016
6,390
9,846
136
My understanding is that it would be best to determine what causes the latency, and then fixing this, instead of papering over the issue by a more relaxed watchdog timeout.

up. TDR is meant to help Windows avoid the BSOD scenario if a graphics card goes off the reservation.

Yes, guys, I fully agree.
However, if I just knew what the problem is :rolleyes:
Hmm, on the other hand --- it is possible that the underlying problem was already there in Win XP, just that XP didn't have Vista's watchdog of course, hence wouldn't complain. If so, an arguably reasonable approach would be: Don't fix what isn't broken too much.

--------
Regarding eliminating some of the many potential trouble causes: Unplug all devices that you don't strictly need to boot and start boinc, or disable them in the BIOS or in device manager, and then see whether or not the hangs stay away.
 

Erich56

Junior Member
Aug 4, 2018
21
4
16
What's the nature of this "GPU project crunching"? Has that changed in any way (like say a newer version of the software for Win10)? Do you do any gaming on this rig, how does it fare if you do?

my change from WinXP to Win10 coincided with a new GPUGRID app for Win10 (as well as for WinXP).
However, on 2 other systems I am also crunching GPUGRID with Win10, and there was no problem whatsoever after the release of the new app version.

No, I am not doing any gaming on this rig (in fact, on the other PCs not either).

Hmm, on the other hand --- it is possible that the underlying problem was already there in Win XP, just that XP didn't have Vista's watchdog of course, hence wouldn't complain.
That's exactly what I am believing myself by now.

The PC has now been running about 9 hours after I made this registry change for the Tdr delay (before, it crashed after between 1 and 8 hours). However, so far I crunch GPUGRID tasks on 1 GPU only, not 2. So let's keep our fingers crossed :)
If it runs okay with 1 GPU during the following night, I'll try it with 2 GPUs tomorrow.
 

Erich56

Junior Member
Aug 4, 2018
21
4
16
Most unfortunately, the problem reoccured shortly after midnight (and I didn't find out until this morning).
So changing the TdrDelay in the Registry from 2 to 8 seconds obviously did not help much. The only difference to before was that this time it took significantly longer until the graphic driver crashed - about 15 hours. Before, it was between 1 and 8 hours.

So, what I tried this morning was to set the TdrLevel to "0" (as suggested in one of the above linked articles, alternatively to increasing the TdrDelay). However, this yields a major setback: when starting BOINC, it brought the notice "no usable graphic card found". So, apparently BOINC makes some kind of check as to whether there is such entry in the registry (which I was fully surprised about).
Hence, I deleted this entry and put back the one with the TdrDelay, the value of which I increased from 8 to 12. Whether this will ultimately help or not - I'll see during the course of the day.
 

mikeymikec

Lifer
May 19, 2011
19,936
14,188
136
How long would you leave this number crunching job to run normally?

I wonder if both graphics cards give similar results.

Based on the evidence I'm inclined to say that it might be a compatibility issue, but since that card was released in 2016 that puts it firmly in Win10 territory already. Out of curiosity, if possible I'd try it on say Windows 7.

Is the BIOS up-to-date? I wonder whether there might be some stability issues with that board that only exhibit themselves with Win10. A long shot might be to check for graphics card BIOS updates.
 

Erich56

Junior Member
Aug 4, 2018
21
4
16
So, what I tried this morning was to set the TdrLevel to "0" (as suggested in one of the above linked articles, alternatively to increasing the TdrDelay). However, this yields a major setback: when starting BOINC, it brought the notice "no usable graphic card found".
I noticed something rather odd: when I re-entered this TdrLevel = 0 into the registry, this time as "DWORD", NOT as "QWORD" (as suggested in the linked article, QWORD obviously comes for 64-bit which my system is), then the BOINC notice "no usable GPU found" does NOT show up, and crunching works.
However, my suspicion is that using DWORD (sted QWORD) probably will have no effect at all (and that's why BOINC does not complain about it).
On the other hand, all other registry entries I saw there were "DWORD" - why so, since my system is 64-bit?

Anyway, I restarted crunching GPUGRID with both GPUs, and will see what happens.
 

Erich56

Junior Member
Aug 4, 2018
21
4
16
Your patience is to be commended! Sounds terribly aggravating!
well, I'd just like to find out what the problem is with this machine and Windows 10. As said before, it ran well under Windows XP.
However, I'll be soon at the point of giving up, anyway.

I would so much like to crunch LHC tasks on Windows10, as they get finished about double as fast under Windows10 (obviously Windows10 can handle hyper-threading much better than XP, and also multicore-tasking), even at significantly lower CPU temperatures.

Yesterday morning, I switched off the TDR in the registry, which resulted in a slight change of the behaviour of the system (but still not satisfactory): instead of getting these nvlddmkm error reports regarding unsuccessful graphic driver restores, now NO error reports are shown in the Windows event log, although there is a system freeze in about 11-12 hours's intervals.
 

Howdy

Senior member
Nov 12, 2017
572
480
136
Out of curiosity, is this a fresh install of 10 or an update to 10 from XP?
 

mikeymikec

Lifer
May 19, 2011
19,936
14,188
136
Out of curiosity, is this a fresh install of 10 or an update to 10 from XP?

You can't "upgrade" from XP or Vista to 10. Though I assume it was a fresh install of Win10 and not a quick install of 7/8x and upgrade from there...?
 

lane42

Diamond Member
Sep 3, 2000
5,721
624
126
However, I'll be soon at the point of giving up, anyway.
:)
 
Last edited: