Cruncher down - quad GPU not feeling well

VirtualLarry

No Lifer
Aug 25, 2001
56,587
10,227
126
Hmm. I was running my quad-GPU F@H rig just fine last night, I showed it to someone, and then this morning, when I went to move the mouse and wake it up, to shut it down, it wouldn't wake up. :(

I hope nothing's wrong with it. I hit RESET and it came back up, and all of the GPUs showed up, so that's good. I wonder if I had a power glitch last night, because although I have a beefy UPS, it might not be enough to keep the F@H rig up.

Edit: I started it up again late today, and it was running for a short while, and I left the room. A few hours later, I came back to it and for some reason the screen was on. (In that time, the display should have gone to sleep.) When I went to move the mouse, the mouse pointer was frozen. Aww, shoot. Looks like something's not right here.

Edit: I left it overnight sitting at the desktop, and it wasn't frozen when I went to wake it up this morning. I enabled Prime95/SoB and HFM.net and RivaTuner, basically everything BUT the GPU crunching, and I'm going to test if that freezes it up.

I wonder if the PSU is failing or worn out - it's only been about 2 years.

Edit: I don't think it's the PSU. When my other computer plugged into the VGA port on my monitor went into standby, all of the sudden I got a BSOD, nv4_disp. I freaked out, thinking that my new build had bluescreened, then scratched my head because it doesn't have an NV card in it. Then I realized that was my F@H box plugged into the DVI. So something is freaking out the video driver, or something. I didn't even have any of the GPUs crunching. Just the CPU, which is at stock, running Prime95/SoB. Oh, and RivaTuner running. I wonder if one of my cards is getting flaky.

According to MS, Stop 0x000000EA is "THREAD_STUCK_IN_DEVICE_DRIVER". Their recommendations are to update the drivers, or replace the graphics hardware.

I don't get it though, I haven't touched the drivers, and they have been fine for over a year. I wasn't even running the GPUs, they were just sitting at the desktop, when I got that BSOD.

I'm running a Memtest86+ on the rig, because I swapped in some new memory and want to test it. (Swapped in after the first freeze, but before the BSOD.)
 

Ken g6

Programming Moderator, Elite Member
Moderator
Dec 11, 1999
16,836
4,815
75
What version of nVIDIA drivers are you using? I've heard about a lot of problems with the latest 260.19.* drivers, and I've had quite a few problems myself with PrimeGrid. I suggest the 256.53 drivers.
 

VirtualLarry

No Lifer
Aug 25, 2001
56,587
10,227
126
I had been using 180.xx, I think .60. The first CUDA drivers that took like zero CPU under XP.

I tried updating to 260.99 for XP, and then I uninstalled RivaTuner, as it looks like it hasn't been updated in a year, and downloaded and installed EVGA Precision 2.01. By default, it links all of your GPU clocks. I adjusted them to 648/1620/850, and 100% fan speed, which is basically what they were set at with RivaTuner, and within five minutes, it froze up again. (Hadn't even started F@H.)

So I suspect hardware failure of some sort. I'm going to have to start pulling cards and testing them individually it seems. Hopefully I can get it sorted out by the holiday race.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
27,426
16,291
136
what psu are you running ? I have been upgrading my farm for PSU's after I burned out a 1010 watt >85% efficiency PSU. Another AX850 coming tomorrow.
 

ZipSpeed

Golden Member
Aug 13, 2007
1,302
170
106
My GPU folding rig also bluescreened. Running the 260.99 drivers. I updated to the latest drivers when I added another GTX 460 and upgraded the PSU. I'm going to revert to the 258.96 as I didn't experience a single issue with that set of drivers.
 

VirtualLarry

No Lifer
Aug 25, 2001
56,587
10,227
126
Well, now I can't seem to get it to crash. Strange.

I had this semi-brilliant idea, that instead of pulling cards, I would simply not extend my windows desktop onto that GPU. Well, I set up a multi-GPU screensaver, set it to 1 minute, and set the monitor power-down to none.

Only my primary desktop - no crash.
Two desktops - no crash.
Three desktops - no crash.
Four desktops - still no crash.

So now I re-overclocked with Precision 2.01 to 650/1620/900, set fan speed to 100, and started F@H on GPU 0 and 1 (768MB Pny 9600GSOs, my newest pair).

So far, no crash.

One thing that I did notice, but didn't totally pin down, is that during various times of rebooting, the hardware device list shown by the BIOS was changing. It prints them in a two column list, and a few times, there was an extra item.

So perhaps if I can track that down, then I can solve the mystery.

Edit: There are some new BIOS revisions out, pretty up-to-date actually. I can now use a Phenom II X6 with this K9A2 Platinum with the newest BIOS. That would be awesome! Probably need a new PSU if I did that, although if I get four GTS450 single-slot cards too, then I could upgrade everything at once.

I noticed a strange and dangerous bug in EVGA Precision 2.01, the newest version on MajorGeeks.com. Sometimes, even though you link the GPUs, and turn off auto fan speed, and set it to 100 - it looses the fan speed settings on some GPUs, and it then defaults it to the slowest setting - 35. One of my GPUs was 105C when I woke it up!!! Not good. So I clicked it on AUTO for the fan speed, and then GPU2's fan speed dropped to 35 suddenly, a split second later. When I clicked on GPU2, fan speed was not AUTO, and was set to the lowest setting possible. VERY BAD.

Edit: Uh oh, it crashed again, after flashing the BIOS to the newest one and rebooting. It seems to have something to do with the BIOS hardware detection, I think.

Edit: I've had it crunching with all four cards for over 24hr now, and it hasn't locked up. Last time I booted, the PCI/PCIE hardware enumeration list was offset by one. That seems to be the key to it working. I also disabled USB Legacy Support, whatever that does.
 
Last edited: