Failed GPU WU on 820Mhz GTX460 1GB

VirtualLarry · Jan 13, 2011

How often do these happen? I don't think I had a failed WU all through Dec, during the race.

That was in a PC with an EarthWatts 650 and a couple of PCI-E power splitters, powering a Q6600 @ 3.6 and two GTX460 1GB cards, which came factory overclocked to 715Mhz, but I pushed them to 820Mhz.

So I decommissioned the Q6600 rig, and pulled out the GPUs, and put one each into my existing desktop rigs, that I upgraded (one so far) from E2140 dual-cores to Q9300 quad-cores. I also pulled out a brand-new EarthWatts 650W PSU to put into the rig.

The back top of the case is really warm, where the PSU is. Not too much airflow coming out. The computer has it's sides off, but it's in a cubby-hole in a desk, so airflow is somewhat constricted.

GPU temp right now is reported at 68C, fan speed 57%, 2100 RPM. Still pretty quiet.

Do GPU WUs just fail out of the blue?

ZipSpeed · Jan 13, 2011

WUs can sometimes fail due to inherent instability of that protein simulation. Doesn't necessarily mean your 460 is unstable. During the race, my GT 240 and 8800 GTS probably failed about 5 units per video card. No failures on the 460s though.

Unless you get multiple EUEs in a row, I wouldn't worry too much. My eVGA 460 (the hotter of the two 460s) gets to around 77C at full load and I haven't had any stability issues yet other than the computer BSOD on occasion due to bad drivers.

VirtualLarry · Jan 13, 2011

Well, I got back, and now the failed count is up to 3 in HFM.NET.

[02:10:20] Build host: SimbiosNvdWin7
[02:10:20] Board Type: NVIDIA/CUDA
[02:10:20] Core : x=15
[02:10:20] Window's signal control handler registered.
[02:10:20] Preparing to commence simulation
[02:10:20] - Looking at optimizations...
[02:10:20] DeleteFrameFiles: successfully deleted file=work/wudata_02.ckp
[02:10:20] - Created dyn
[02:10:20] - Files status OK
[02:10:20] sizeof(CORE_PACKET_HDR) = 512 file=<>
[02:10:20] - Expanded 43737 -> 172159 (decompressed 393.6 percent)
[02:10:20] Called DecompressByteArray: compressed_data_size=43737 data_size=172159, decompressed_data_size=172159 diff=0
[02:10:20] - Digital signature verified
[02:10:20]
[02:10:20] Project: 6806 (Run 3348, Clone 2, Gen 1)
[02:10:20]
[02:10:20] Assembly optimizations on if available.
[02:10:20] Entering M.D.
[02:10:22] Tpr hash work/wudata_02.tpr: 3471899655 1965075330 576058860 2762702867 3227574895
[02:10:22] Working on 2 PEPTIDE (1-42)
[02:10:22] Run: exception thrown in GuardedRun -- cannot continue further.
[02:10:22] Going to send back what have done -- stepsTotalG=0
[02:10:22] Work fraction=0.0000 steps=0.
[02:10:26] logfile size=0 infoLength=0 edr=0 trr=23
[02:10:26] + Opened results file
[02:10:26] - Writing 635 bytes of core data to disk...
[02:10:26] Done: 123 -> 124 (compressed to 100.8 percent)
[02:10:26] ... Done.
[02:10:26] DeleteFrameFiles: successfully deleted file=work/wudata_02.ckp
[02:10:26]
[02:10:26] Folding@home Core Shutdown: UNSTABLE_MACHINE
[02:10:31] CoreStatus = 7A (122)
[02:10:31] Sending work to server
[02:10:31] Project: 6806 (Run 3348, Clone 2, Gen 1)
[02:10:31] - Read packet limit of 540015616... Set to 524286976.
[02:10:31] + Attempting to send results [January 13 02:10:31 UTC]
[02:10:31] Gpu type=3 species=30.
[02:10:31] + Results successfully sent
[02:10:31] Thank you for your contribution to Folding@Home.

[18:58:44] Build host: SimbiosNvdWin7
[18:58:44] Board Type: NVIDIA/CUDA
[18:58:44] Core : x=15
[18:58:44] Window's signal control handler registered.
[18:58:44] Preparing to commence simulation
[18:58:44] - Looking at optimizations...
[18:58:44] DeleteFrameFiles: successfully deleted file=work/wudata_09.ckp
[18:58:44] - Created dyn
[18:58:44] - Files status OK
[18:58:44] sizeof(CORE_PACKET_HDR) = 512 file=<>
[18:58:44] - Expanded 43910 -> 172159 (decompressed 392.0 percent)
[18:58:44] Called DecompressByteArray: compressed_data_size=43910 data_size=172159, decompressed_data_size=172159 diff=0
[18:58:44] - Digital signature verified
[18:58:44]
[18:58:44] Project: 6806 (Run 3464, Clone 2, Gen 3)
[18:58:44]
[18:58:44] Assembly optimizations on if available.
[18:58:44] Entering M.D.
[18:58:46] Tpr hash work/wudata_09.tpr: 1666636903 1563452542 4042842797 2674782421 1536283653
[18:58:46] Working on 2 PEPTIDE (1-42)
[18:58:46] Run: exception thrown in GuardedRun -- cannot continue further.
[18:58:46] Going to send back what have done -- stepsTotalG=0
[18:58:46] Work fraction=0.0000 steps=0.
[18:58:50] logfile size=0 infoLength=0 edr=0 trr=23
[18:58:50] + Opened results file
[18:58:50] - Writing 635 bytes of core data to disk...
[18:58:50] Done: 123 -> 124 (compressed to 100.8 percent)
[18:58:50] ... Done.
[18:58:50] DeleteFrameFiles: successfully deleted file=work/wudata_09.ckp
[18:58:50]
[18:58:50] Folding@home Core Shutdown: UNSTABLE_MACHINE
[18:58:54] CoreStatus = 7A (122)
[18:58:54] Sending work to server
[18:58:54] Project: 6806 (Run 3464, Clone 2, Gen 3)
[18:58:54] - Read packet limit of 540015616... Set to 524286976.
[18:58:54] + Attempting to send results [January 13 18:58:54 UTC]
[18:58:54] Gpu type=3 species=30.
[18:58:55] + Results successfully sent
[18:58:55] Thank you for your contribution to Folding@Home.

[20:10:40] Completed 21000000 out of 50000000 steps (42%).
[20:11:45] Run: exception thrown in GuardedRun -- cannot continue further.
[20:11:45] Going to send back what have done -- stepsTotalG=50000000
[20:11:45] Work fraction=0.4264 steps=50000000.
[20:11:49] logfile size=0 infoLength=0 edr=0 trr=23
[20:11:49] + Opened results file
[20:11:49] - Writing 642 bytes of core data to disk...
[20:11:49] Done: 130 -> 126 (compressed to 96.9 percent)
[20:11:49] ... Done.
[20:11:49] DeleteFrameFiles: successfully deleted file=work/wudata_00.ckp
[20:11:49]
[20:11:49] Folding@home Core Shutdown: EARLY_UNIT_END
[20:11:52] CoreStatus = 72 (114)
[20:11:52] Sending work to server
[20:11:52] Project: 6806 (Run 3546, Clone 2, Gen 0)
[20:11:52] - Read packet limit of 540015616... Set to 524286976.
[20:11:52] + Attempting to send results [January 13 20:11:52 UTC]
[20:11:52] Gpu type=3 species=30.
[20:11:52] + Results successfully sent
[20:11:52] Thank you for your contribution to Folding@Home.
[20:11:56] - Preparing to get new work unit...

So what gives? Two of them looked like they errored out immediately, so bad WU?

ZipSpeed · Jan 13, 2011

Is both your machines fresh installs of Windows 7? I had stability issues with the 260.89 and 260.99 drivers, and reverted back to 258.96 and haven't had any problems since then. If you do try changing drivers, run Driver Sweeper first to remove any remnants of the old driver.

VirtualLarry · Jan 13, 2011

Yes, pretty fresh install of Win7 64-bit HP, with 260.99 WHQL drivers.

ZipSpeed · Jan 13, 2011

The weird thing with overclocking is that sometimes even just moving to new hardware and software is enough that the new combination is unstable. I remember my old E6850 was stable at 3.6 GHz with Win XP but when I moved to Vista, the same computer was only able to achieve 3.4.

Maybe try backing down the overclock to 800 MHz and see if your issues still persist.

Markfw · Jan 13, 2011

If you are running the 258's, just re-boot the computer. I have had cases where NO changes in software, overclocking or anything. All of a sudden, it goes out to lunch. Re-boot, and problems disappear.

VirtualLarry · Jan 14, 2011

It crunched 5 more WUs sucessfully. So I'm going to just watch it.

Edit: Now 7 WUs crunched sucessfully.

GPU temp: 70C
GPU Usage: 99%
Fan speed: 61%
Fan RPM: 2310

Ambient temp 78F-80F

ZipSpeed · Jan 14, 2011

Good to hear. Did you do anything to the computer?

VirtualLarry · Jan 15, 2011

[23:10:20] Loaded queue successfully.
[23:10:20] Gpu type=3 species=30.
[23:10:21] + Closed connections
[23:10:21]
[23:10:21] + Processing work unit
[23:10:21] Core required: FahCore_15.exe
[23:10:21] Core found.
[23:10:21] Working on queue slot 09 [January 15 23:10:21 UTC]
[23:10:21] + Working ...
[23:10:21]
[23:10:21] *------------------------------*
[23:10:21] Folding@Home GPU Core
[23:10:21] Version 2.15 (Tue Nov 16 09:05:18 PST 2010)
[23:10:21]
[23:10:21] Build host: SimbiosNvdWin7
[23:10:21] Board Type: NVIDIA/CUDA
[23:10:21] Core : x=15
[23:10:21] Window's signal control handler registered.
[23:10:21] Preparing to commence simulation
[23:10:21] - Looking at optimizations...
[23:10:21] DeleteFrameFiles: successfully deleted file=work/wudata_09.ckp
[23:10:21] - Created dyn
[23:10:21] - Files status OK
[23:10:21] sizeof(CORE_PACKET_HDR) = 512 file=<>
[23:10:21] - Expanded 40808 -> 162639 (decompressed 398.5 percent)
[23:10:21] Called DecompressByteArray: compressed_data_size=40808 data_size=162639, decompressed_data_size=162639 diff=0
[23:10:21] - Digital signature verified
[23:10:21]
[23:10:21] Project: 6805 (Run 3037, Clone 0, Gen 0)
[23:10:21]
[23:10:21] Assembly optimizations on if available.
[23:10:21] Entering M.D.
[23:10:23] Tpr hash work/wudata_09.tpr: 2560135873 1818690808 4058277428 4186197612 2499892568
[23:10:23] Working on ALZHEIMER'S DISEASE AMYLOID
[23:10:23] Run: exception thrown in GuardedRun -- cannot continue further.
[23:10:23] Going to send back what have done -- stepsTotalG=0
[23:10:23] Work fraction=0.0000 steps=0.
[23:10:27] logfile size=0 infoLength=0 edr=0 trr=23
[23:10:27] + Opened results file
[23:10:27] - Writing 635 bytes of core data to disk...
[23:10:27] Done: 123 -> 124 (compressed to 100.8 percent)
[23:10:27] ... Done.
[23:10:27] DeleteFrameFiles: successfully deleted file=work/wudata_09.ckp
[23:10:27]
[23:10:27] Folding@home Core Shutdown: UNSTABLE_MACHINE
[23:10:31] CoreStatus = 7A (122)
[23:10:31] Sending work to server
[23:10:31] Project: 6805 (Run 3037, Clone 0, Gen 0)
[23:10:31] - Read packet limit of 540015616... Set to 524286976.

That's three WUs that IMMEDIATELY went UNSTABLE_MACHINE, without computing anything. Wonder what's up with that. I have the 1GB GTX460, so it shouldn't be running out of VRAM, I don't think.

Is there a new version of the GPU3 systray client, within the last two weeks? Perhaps I need to upgrade.

theAnimal · Jan 15, 2011

Are you still @820? If so, try backing off a step or two.

VirtualLarry · Jan 16, 2011

theAnimal said:
Are you still @820? If so, try backing off a step or two.

Somehow, I doubt that's the problem. I had both cards running @820 for the entire month of Dec., and no failed WUs that I could tell.

Mark has his at like 840.

Rudy Toody · Jan 16, 2011

Check this thread.

CoreStatus= 7A (122)
This appears in various forms but appears to be directly related to calculation errors detected by a GPU. Whether the errors are GPU hardware errors or are inherent in the WU is currently unknown.

If other crunchers are getting the same error on the same WU, then it is probably a WU problem.

Otherwise, it could be hardware failure.

VirtualLarry · Jan 16, 2011

Hmm. The mystery grows deeper. The "2 PEPTIDE" WU, the two that failed, well, I got my other machine up and running, and it's currently processing that WU.

So the idea that it's just a bad WU may indeed be wrong. But what's wrong with my primary machine? It's OCCT:linpack 64-bit one hour stable on the CPU.

Edit: More wierdness. My other machine that's running the "2 PEPTIDE" unit, suddenly the screen flashed and it dropped out of Aero. This happened about 30 seconds before the screen fade for monitor timeout occurred.

Does F@H interfere with Aero on anyone else's machine?

petrusbroder · Jan 17, 2011

Thanks for keeping us updated on the problems, it gives me reassurance that I am not the only one who has had problems with -bigadv WUs.
I have had a very similar problem during the race - with 5 or 6 WUs - but that has passed. I did reboot och have added 1024 RAM (from 2048 to 3072 Mbyte). Whether that was the solution or not I do not know (since the GPU does not use RAM ...) but I have not hade any problems since then.

ZipSpeed · Jan 17, 2011

I would still try backing down that GPU overclock a tad to eliminate overclocking issues from the equation.

VirtualLarry · Jan 17, 2011

[23:56:54] Loaded queue successfully.
[23:56:54] Gpu type=3 species=30.
[23:56:54] + Closed connections
[23:56:54]
[23:56:54] + Processing work unit
[23:56:54] Core required: FahCore_15.exe
[23:56:54] Core found.
[23:56:54] Working on queue slot 06 [January 17 23:56:54 UTC]
[23:56:54] + Working ...
[23:56:54]
[23:56:54] *------------------------------*
[23:56:54] Folding@Home GPU Core
[23:56:54] Version 2.15 (Tue Nov 16 09:05:18 PST 2010)
[23:56:54]
[23:56:54] Build host: SimbiosNvdWin7
[23:56:54] Board Type: NVIDIA/CUDA
[23:56:54] Core : x=15
[23:56:54] Window's signal control handler registered.
[23:56:54] Preparing to commence simulation
[23:56:54] - Looking at optimizations...
[23:56:54] DeleteFrameFiles: successfully deleted file=work/wudata_06.ckp
[23:56:54] - Created dyn
[23:56:54] - Files status OK
[23:56:54] sizeof(CORE_PACKET_HDR) = 512 file=<>
[23:56:54] - Expanded 43821 -> 172159 (decompressed 392.8 percent)
[23:56:54] Called DecompressByteArray: compressed_data_size=43821 data_size=172159, decompressed_data_size=172159 diff=0
[23:56:54] - Digital signature verified
[23:56:54]
[23:56:54] Project: 6806 (Run 3940, Clone 2, Gen 11)
[23:56:54]
[23:56:54] Assembly optimizations on if available.
[23:56:54] Entering M.D.
[23:56:56] Tpr hash work/wudata_06.tpr: 625212066 1241649038 2294914323 2415248780 2798237093
[23:56:56] Working on 2 PEPTIDE (1-42)
[23:56:56] Run: exception thrown in GuardedRun -- cannot continue further.
[23:56:56] Going to send back what have done -- stepsTotalG=0
[23:56:56] Work fraction=0.0000 steps=0.
[23:57:00] logfile size=0 infoLength=0 edr=0 trr=23
[23:57:00] + Opened results file
[23:57:00] - Writing 635 bytes of core data to disk...
[23:57:00] Done: 123 -> 124 (compressed to 100.8 percent)
[23:57:00] ... Done.
[23:57:00] DeleteFrameFiles: successfully deleted file=work/wudata_06.ckp
[23:57:00]
[23:57:00] Folding@home Core Shutdown: UNSTABLE_MACHINE
[23:57:05] CoreStatus = 7A (122)
[23:57:05] Sending work to server
[23:57:05] Project: 6806 (Run 3940, Clone 2, Gen 11)
[23:57:05] - Read packet limit of 540015616... Set to 524286976.

It's these darn "2 PEPTIDE" WUs that seem to be immediately erroring. Wierd that my other machine, which has identical components, was running one of the "2 PEPTIDE" WUs fine. WTF.

And what's with the "stepsTotalG=" thing being 0? Could that have something to do with this? The only bonafide EUE listed "stepsTotalG" as 50000000 or something like that, even though it didn't process all of those steps.

Edit: It's not just all of the "2 PEPTIDE" WUs, I checked the log, and it processed two of them sucessfully. So I'm going with the theory that these are just bad WUs, with zero total steps allocated.

Edit: Now my OTHER machine, identical parts, failed one of those "2 PEPTIDE" WUs immediately too. But it had processed several other successfully.

VirtualLarry · Feb 4, 2011

Since I last rebooted, this machine has processed 69 WUs, all successful. So I think it was bad WUs myself.

Markfw · Feb 4, 2011

Every time you get that "unstable machine" crap (when you know it is) just reboot. Fixes it every time for me.

Failed GPU WU on 820Mhz GTX460 1GB

No Lifer

Golden Member

No Lifer

Golden Member

No Lifer

Golden Member

Moderator Emeritus, Elite Member

No Lifer

Golden Member

No Lifer

Diamond Member

No Lifer

Diamond Member

No Lifer

Elite Member

Golden Member

No Lifer

No Lifer

Moderator Emeritus, Elite Member