GPU tasks are causing Win10 machine to become unresponsive/restart (fixed by Disabling SLI)

Discussion in 'Distributed Computing' started by pandemonium, Nov 10, 2017.

  1. pandemonium

    pandemonium Golden Member

    Joined:
    Mar 17, 2011
    Messages:
    1,700
    Likes Received:
    41
    Long story short, I thought it was insufficient virtual memory sizing. I've since had reduced issues with the display refreshing upon wake after increasing virtual memory allocation to 1024 / 10240 MB. I originally suspected GPU tasks (per my configuration at least), required a decently sized paging file available to not crash the display driver if the task is paused and the desktop is reinitialized. I've since had a few crashes. However, disabling SLI has allowed me to run a few days without issue. Disabling SLI was the fix. I've had zero crashes and zero issues with the desktop waking from sleep while cranking while SLI was disabled. This is a warning to the wise: disable SLI if crunching!

    Hey, all.

    The last couple of days I've been having odd issues with GPUGRID (now seen with any GPU tasks from other projects as well) GPU tasks causing the windows display driver to crash, attempt to recover, then cause display output to get completely lost with both of my displays going blank with no input (not black). (Eventvwr shows these occurrences as display attempted to be recovered events.)

    I've run these tasks successfully before and I'm attempting to narrow down the changes that were made locally (if it is indeed locally isolated) to possibly affect these tasks specifically:

    • Added TdrLevel to registry and set to value 3, TdrDelay (tried 60, 20, 10, 5, and the default of 2) (reverted, same issue)
    • Tweaked my CPU overclock (reverted, same issue)
    • Tweaked my GPU overclocks (reverted, same issue)
    • Updated MSIAfterburner to 4.3, then 4.4 (have not attempted to revert yet; this is my next step when I get home)
    I'm thinking the problem somehow lay with how MSIAfterburner is clocking my cards. I have 2x ASUS GTX970's in SLI and have them synchronized to run the same according to the utility, however, in the monitor read-out I notice that one is oddly running at a different core clock than the other. Even if I disable the synchronize option, the clocks showing in the monitor never appear to be matching properly with what is set in the utility.

    My question to everyone here at this time: is anyone else running GPU tasks with MSIAfterburner 4.3 or 4.4 having problems? This was not the cause as I've tried MSIAfterburner 4.2.0 with the same result.

    I'm not at home to supply logs right now, but was hoping I could query you fine ladies and gentlemen to see if I'm alone with this. :eek:

    Current machine specs:
    MoBo: Sabertooth X99
    CPU: Intel 5820k @ 4.2Ghz (Bclk 100, multiplier 42)
    RAM: 32GB Mushkin 2400Mhz DDR4 @XMP 2400
    Storage: Intel 750 400GB via hyperkit
    GPUs: 2x ASUS GTX970 STRIX @ stock clocks (per MSIAfterburner utility)
    PSU: Seasonic SS-760XP2
    OS: Windows 10 Professional

    Stability tests with Prime, memtest, Unigine Valley, et cetera have been performed at varying stages since I built it in 2015 with no issues. CPU rarely goes beyond 80°C and GPUs rarely go beyond 75°C after cranking for 10+ hours.

    Subsequent troubleshooting done:
    • Rolled MSIAfterburner back to 4.2
    • "Underclocked" both cards using the power target meter on the utility, to as low as 70%
    • Patched Windows to latest updates
    • Ran dism online to clean old updates and sfc /scannow to verify Windows files (no problems)
    • Ran chkdsk (no errors)
    • Updated Nvidia drivers from 388.00 to 388.13 WHQL (clean install)
    • Set CPU to stock clocks
    • Rolled back to 385.41 WHQL Nvidia drivers (clean install)
    • Disabled all Visual effects enhancements in Windows
    • Verified latest bios version installed (Asus 3701)
    • Updated ASMedia USB3.1 eXtensible Host Controller to 1.16.38.1 (for external WD MyBook connection)
    • Increased virtual memory allocation from 128 /1024 MB to 1024 / 4096 10240 MB
    • Disabled SLI in Nvidia control panel
     
    #1 pandemonium, Nov 10, 2017
    Last edited: Jan 20, 2018
  2. Loading...

    Similar Threads - tasks causing Win10 Forum Date
    Longest Climateprediction.net Task Ever Had Distributed Computing Mar 29, 2018
    Scripts for getting rid of jammed tasks Distributed Computing Feb 19, 2018
    Random Seti@Home GPU Task crash Distributed Computing Nov 3, 2017
    SETI@Home AstroPulse GPU tasks Distributed Computing Oct 16, 2017
    Is there a way to cause SETI to restart if closed using task manager? Distributed Computing Oct 6, 2001

  3. StefanR5R

    StefanR5R Golden Member

    Joined:
    Dec 10, 2016
    Messages:
    1,672
    Likes Received:
    1,398
    I had no problems at GPUGrid so far with two hosts with Pascal GPUs: One X79 based with 1x GTX 1070, another X99 based with 3x GTX 1080(Ti) (wild mixture of non-Ti with Ti, to be cleaned up hopefully before our December Folding contest). Drivers: 384.94, MSI Afterburner: 4.3, OS: Windows 7 Pro.

    I don't game, hence I don't use SLI. Is a single GPUGrid task spread over both cards when SLI is enabled?
     
  4. pandemonium

    pandemonium Golden Member

    Joined:
    Mar 17, 2011
    Messages:
    1,700
    Likes Received:
    41
    Does your Afterburner's monitor show the peak core clock at the correctly set value from the utility? That's the only weird thing I noticed happening.

    BOINC conveniently properly manages 1 GPU task per GPU, regardless of being bridged together physically or having SLI enabled. I was very happy when I found that out, thinking I'd have to disable SLI every time I wanted to run it.

    I'm also speculating that my attempts at modifying the TDR values somehow messed with how the display driver communicates to the BOINC client. Gaming and general desktop use all function perfectly fine.
     
  5. TennesseeTony

    TennesseeTony Elite Member

    Joined:
    Aug 2, 2003
    Messages:
    2,837
    Likes Received:
    1,194
    I only use afterburner for monitoring the cards/cpu/temps. I'd say it's a software change causing the issue. If not afterburner causing it, did you get the Win10 Fall update recently? New Nvidia drivers?
     
  6. pandemonium

    pandemonium Golden Member

    Joined:
    Mar 17, 2011
    Messages:
    1,700
    Likes Received:
    41
    I have not run windows update. Nvidia drivers are the latest since I updated those about a week ago IIRC: 388.13 WHQL.
     
  7. iwajabitw

    iwajabitw Senior member

    Joined:
    Aug 19, 2014
    Messages:
    778
    Likes Received:
    130
    I'm running dual GTX980's in SLI. My current drivers are 382.05 and have no problems with anything I throw at them. I would start with a driver roll back and reset Afterburner to its defaults and see how it goes. I have had your issue most of the summer on my AMD R9280x systems until I finally found the problem was being caused by Afterburner and Wattman not getting along. The new Radeon driver suite vs the old Crimson suite caused constant core and mem clocks to just fall to idle state all the time. I removed Afterburner from the AMD systems and everything became stable. DC projects are usually behind gaming by quite a bit so if you game a lot, you may have to find the driver that gives you a happy medium between games and DC.
     
  8. pandemonium

    pandemonium Golden Member

    Joined:
    Mar 17, 2011
    Messages:
    1,700
    Likes Received:
    41
    Thanks for the input, @iwajabitw. I've been reluctant to make any changes and halt compute production since the Universe sprint started, but I'll see what troubleshooting I can do tomorrow; Afterburner being the first step, then Nvidia drivers.

    I did test running SETI GPU tasks and those are running fine.
     
  9. StefanR5R

    StefanR5R Golden Member

    Joined:
    Dec 10, 2016
    Messages:
    1,672
    Likes Received:
    1,398
    With the 10x0 a.k.a. Pascal series at least, Afterburner and other tools don't set a peak core clock, but a core clock offset. But more importantly, this offset is not against a fixed base. Rather the GPU's firmware constantly adjusts the clock based on actual power usage, voltage requirement, and temperature.

    Monitoring in Afterburner gives me the same readings as in other tools, such as GPU-Z and HWinfo64.
     
  10. pandemonium

    pandemonium Golden Member

    Joined:
    Mar 17, 2011
    Messages:
    1,700
    Likes Received:
    41
    Yeah, I knew it could vary, but it is odd that my secondary GPU (always lower in temperature) will be set to the peak boost core clock the same as the first and never go higher than that setting, while the primary GPU (almost always higher in temperature) can be many Mhz faster.

    For instance, my primary is at 1329 Mhz and my secondary is at 1304 when boost clock is applied. When base peak core is running they're both at 1114 Mhz.

    I just thought it was odd and I won't pretend to understand how this utility works. :eek:

    Edit: While in games both cards run at the correct 1304Mhz. Only during BOINC GPU tasks is there disparity between my GPUs' core clock boost speeds.
     
    #9 pandemonium, Nov 11, 2017
    Last edited: Nov 12, 2017
  11. pandemonium

    pandemonium Golden Member

    Joined:
    Mar 17, 2011
    Messages:
    1,700
    Likes Received:
    41
    Well, this is becoming extremely aggravating.

    Now anytime I have my computer running BOINC tasks and I wake the monitor or stop the tasks with the automatic "do not run tasks while computer is in use", the display driver either crashes with the new Windows 10 blue screen "BAD_POOL_CALLER" and restarts, or it attempts to reset and displays nothing and I have to restart it anyways.

    This may be the cause behind errors while computing and I'm losing a lot of work done (or it could simply be due to the beta nature of the GPUGRID project):
    16687358 12857389 447109 15 Nov 2017 | 22:29:23 UTC 16 Nov 2017 | 3:42:23 UTC Error while computing 10,832.09 2,609.13 --- Long runs (8-12 hours on fastest card) v9.18 (cuda80)

    In addition to the original troubleshooting, I've also:
    • Rolled MSIAfterburner back to 4.2
    • "Underclocked" both cards using the power target meter on the utility, to as low as 70%
    • Patched Windows to latest updates
    • Ran dism online to clean old updates and sfc /scannow to verify Windows files (no problems)
    • Ran chkdsk (no errors)
    • Updated Nvidia drivers from 388.00 to 388.13 WHQL (clean install)
    My next course of action is to roll back the client version of BOINC and Nvidia drivers. After that, I'm running out of ideas. As previously stated, BOINC tasks are the only thing that cause any issues with my rig.
     
    #10 pandemonium, Nov 16, 2017
    Last edited: Nov 16, 2017
  12. Kiska

    Kiska Senior member

    Joined:
    Apr 4, 2012
    Messages:
    584
    Likes Received:
    112
    Have you tried reinstalling the driver? Doing a clean install of it?
     
  13. pandemonium

    pandemonium Golden Member

    Joined:
    Mar 17, 2011
    Messages:
    1,700
    Likes Received:
    41
    Yeah, I forgot to mention that. I'll update it. I was incorrect about having 388.13 and had 388.00 installed, so I updated. (I always clean install graphic drivers.)
     
  14. StefanR5R

    StefanR5R Golden Member

    Joined:
    Dec 10, 2016
    Messages:
    1,672
    Likes Received:
    1,398
    The boinc-client version is unrelated to such problems. I would concentrate on the drivers. (384.94 works for me on Pascal; I don't have Maxwell cards.)
     
  15. pandemonium

    pandemonium Golden Member

    Joined:
    Mar 17, 2011
    Messages:
    1,700
    Likes Received:
    41
    I'll be rolling back Nvidia drivers when I get home. I don't recall having this problem pre 388.00.

    Wish me luck and thanks!
     
  16. pandemonium

    pandemonium Golden Member

    Joined:
    Mar 17, 2011
    Messages:
    1,700
    Likes Received:
    41
    Rolled back to 385.41 and so far so good. We'll see. It wasn't consistently a problem and would do it at random from stopping tasks and waking the monitors up.
     
  17. ao_ika_red

    ao_ika_red Senior member

    Joined:
    Aug 11, 2016
    Messages:
    943
    Likes Received:
    347
    That's interesting. I also had to rollback my radeon display driver few days ago because my GPU core clock stuck at idle (300MHz) in BOINC. It's fine with other programs / games, it only had problem with BOINC projects.
     
  18. pandemonium

    pandemonium Golden Member

    Joined:
    Mar 17, 2011
    Messages:
    1,700
    Likes Received:
    41
    BAD_POOL_CALLER reared its ugly head again.

    As usual, accompanied by, "The description for Event ID 13 from source nvlddmkm cannot be found."

    Onward with the troubleshooting!
    • Disabling all Visual effects enhancements in Windows
     
  19. iwajabitw

    iwajabitw Senior member

    Joined:
    Aug 19, 2014
    Messages:
    778
    Likes Received:
    130
    If you haven't already. I would return the 5820 to default speeds, turn XMP off, make sure the latest device drivers for the mobo are up to date, through the manufactures site. I used to have my 5820K in my Asrock x99 board, when running boinc , I had to set my "use at most cpu %" to 84%, even 75% with dual cards to give 1 core to each gpu. Bad_pool_caller errors may not actually pertain to the nvidia driver, even though it crashes. Something else my be fighting for the IRQ. Make sure your Bios is up to date also. I had this happen on shutting down a year or so ago on reboots. It was my bluetooth/wireless network card causing the problem. When I added the second GTX980 I had to pull the BT card out as it was between the 2 GPU's. Problem went away. Also read due in my struggle Avast and some other software's I had never heard of, drivers caused this on win 10.
     
    Ken g6 likes this.
  20. pandemonium

    pandemonium Golden Member

    Joined:
    Mar 17, 2011
    Messages:
    1,700
    Likes Received:
    41
    That certainly seems to be the point that I'm at: finding adjacent causes to this.

    I have BOINC set to 80% of the CPU on the client itself so I can still use the computer for other things and so temperatures don't reach higher than I prefer. At 80%, task manager shows 100% CPU utilization, but each core isn't actually being capped. CoreTemp shows fluctuation between ~75-95% on each core.

    I updated the bios to the latest a few months ago, but haven't checked if there's newer recently. According to ASUS's website, Version 3701 is the latest from 5-26-17. I'll double check that when I get home. Double checked bios version is indeed 3701.

    My mistake for not mentioning I had tried setting stock clocks on the CPU for a bit. I did not try removing the XMP profile for the RAM though. I'll put that on the list.

    I do have an external WD Mybook connected via USB3.1 that may somehow be causing the issues. (This drive likes to corrupt data and have bad file issues and likes to disappear from time-to-time causing me to have to unplug the data and power cables to it for it to be recognized. It's been on my list of things to be replaced for - probably - too long.)

    In fact, now that I think about it, I wouldn't be surprised if the external drive is the cause. I'll troubleshoot that next if the visual enhancements being disabled doesn't solve the problem.

    Another thing may be the fact that I have my Samsung TV connected via HDMI. I'll have to run tasks and wake the monitor without it connected to see if that is somehow the culprit. These GPUs, for whatever reason, prioritize displaying the BIOS (init display) on HDMI over DP and that always bothered me and also may be a sign of the failed design of these cards (aside from the infamous 0.5GB of memory that causes poor performance).

    I thought I had tweaked this machine to perfection, but the more I run BOINC tasks the more I realize that's not true. :eek:
     
    #19 pandemonium, Nov 17, 2017
    Last edited: Nov 17, 2017
  21. StefanR5R

    StefanR5R Golden Member

    Joined:
    Dec 10, 2016
    Messages:
    1,672
    Likes Received:
    1,398
    AFAIU older USB 3.0 controllers, those which came out when the standard was still new, have conflicting power management which causes issues like these infamous disconnections of external disks. I haven't found a good recipe against that so far, but haven't researched it extensively.
     
  22. pandemonium

    pandemonium Golden Member

    Joined:
    Mar 17, 2011
    Messages:
    1,700
    Likes Received:
    41
    Interesting. I'll check the driver details for that controller when I get home and look into it. Thanks.

    Edit: Found a newer version for ASMedia USB3.1 eXtensible Host Controller. Updated.
     
    #21 pandemonium, Nov 17, 2017
    Last edited: Nov 17, 2017
  23. pandemonium

    pandemonium Golden Member

    Joined:
    Mar 17, 2011
    Messages:
    1,700
    Likes Received:
    41
    So, I'm going to feel ashamed in this next troubleshooting step I took to fix the problem; because, so far, it's worked, and it was self inflicted. I'll likely know for certain in a few days run time, but the usual occurrences of BAD_POOL_CALLER restarts and display resetting and not coming back at all haven't happened.

    When I first setup this computer I took it upon myself to tweak everything with it, including the virtual memory. Over the years I have played with minimizing the amount of virtual memory Windows had at its disposal and was comfortable with my grasp on its function for Windows. My thoughts were: I wanted to have as little reads/writes to my storage drives as possible and for the memory to be more heavily utilized in that domain. (I know that's not the end-all-be-all of how virtual memory works, but it's a significant operation of it.) I will say, that everything else minus BOINC GPU tasks operate flawlessly with a minimal virtual memory setting. However, these tasks are definitely beyond the norm. That's when I got to thinking, perhaps they require a larger pool of virtual memory to pause tasks and initiate the display driver to the desktop.

    Originally, I had virtual memory set to 128 / 1024 MB. I changed it to have 1024 / 4096 MB, restarted, and started cranking. After waking the monitor and stopping GPU tasks several times yesterday, no problems. This appears to be a vital requirement for BOINC GPU tasks and the display driver to be reinitialized.

    Hopefully my misadventure will be of assistance to someone else, should they also take it upon themselves to go beyond the norm by tweaking every little thing with their machine.
     
    TennesseeTony likes this.
  24. iwajabitw

    iwajabitw Senior member

    Joined:
    Aug 19, 2014
    Messages:
    778
    Likes Received:
    130
    I didn't think of low VM, very possible. My system has 16gb of ram, as well I have not changed VM settings since the XP days, so it never crossed my mind. Hope that puts your issue to rest.
     
  25. StefanR5R

    StefanR5R Golden Member

    Joined:
    Dec 10, 2016
    Messages:
    1,672
    Likes Received:
    1,398
    @pandemonium, thanks. That's interesting to learn. Checking my triple-GPU host for that now... (It's running Win 7 Pro.) Turns out that I have set min 5 GB - max 5 GB for paging file size a long time ago. This setting may even be older than this PC's actual hardware. Maybe not a very wise setting in hindsight, but apparently sufficient by a slim margin for what I have run on it so far.
     
  26. pandemonium

    pandemonium Golden Member

    Joined:
    Mar 17, 2011
    Messages:
    1,700
    Likes Received:
    41
    I've had one instance of BAD_POOL_CALLER (out of about ~30 possible occurrences), so I increased the maximum page file size to 8192. I believe that should be sufficient for any and all GPU tasks.