Stung, by BumbleBee

TennesseeTony

Elite Member
Aug 2, 2003
4,331
3,800
136
www.google.com
Greetings my fellow enthusiasts!

I need some trouble shooting help. Below is my scattered thoughts on what happened and my thoughts and actions so far.

BumbleBee went down during the PrimeGrid Race recently. He currently is alive, but not well. He is not accepting decent GPUs at the moment.

He was running dual R9-280X's. They were well spaced, but the PSU is located as the bottom of the case, and is of such a high output that the fan doesn't feel the need to run continuously, allowing that heat to get drawn into the bottom GPU.

After the first crash, I finally noticed the PSU fan/heat problem. The bottom card would reach 99C and crash the system. This is a well ventilated case, with 3 120mm fans at max speed pushing air into the system. Removing the side cover did not reduce the temps. Bee would run about 3-5minutes and crash. So I broke out an oscillating pedestal fan, running off the mains and the temps dropped to 66C. Then crashed. And crashed. And crashed.

Corrupted software? Ok. Uninstall, restart, install latest drivers, restart, put a load on the GPUs: crash. Restart, no load on GPUs, crash. Remove one GPU: crash. Swap GPU: crash.

Thinking both GPUs are toast, I tried a third R9-280X, and 1-3 minutes: crash. ???? Single rail PSU, so not likely that the power system is at fault. EDIT: Actually a dual rail PSU, but the PCIe gets the 2nd rail to itself, 70amps/840watts.

Replace AMD cards with very low end Nvidia Quadro: Runs forever. Hmmmmmmm. A Quadro 4000, one that requires 6pin power: Runs forever. Dual Quadro 4000: Runs forever.

Ordinarily I'd just do a bunch of part swapping but either my parts are busy, or have already been tested, and left me scratching my head still.

My thoughts at this point, in order of likelihood:

1.) Software corruption. I was initially thinking Windows is corrupted, but then why would the Nvidia cards run fine? Other than the driver uninstallation/reinstallation, what's a man to do?

2.) PSU issue? Perhaps the double Quadro (pulling only 75W + 75W) wasn't enough to expose the problem? Maybe the fan WAS supposed to be spinning, and I'm thinking of some other unit that has the temp controlled fan. It's Thermaltake brand, 1375 Watts. My past experience is that their products are (somewhat) innovative/stylish, but not of the best quality. Decent quality, but not the best.

3.) And the least likely suspect is that the 2 previously installed 280Xs are BOTH dead, not just the over-heated one, and that my working spare was dead before I installed it.

And with that long and drawn out story, I now conclude by asking for YOUR thoughts.

Thanks in advance, and Merry CHRISTmas!

Tony.
 
Last edited:

Pokey

Platinum Member
Oct 20, 1999
2,781
480
126
Do you have a power supply tester? I'd check the psu first.

My psu tester is maybe the most used tool I have other than my phillips head.

AMD drivers would be the next thing to check. Use DDU to completely uninstall and then re-install latest driver.

That's just what I would do first. Especially if you can't put one of the cards in another box.
 

TennesseeTony

Elite Member
Aug 2, 2003
4,331
3,800
136
www.google.com
... I'd check the psu first...psu tester is maybe the [2nd] most used tool I have....

That's what my gut is telling me. And is why I've already eBay'd a Corsair AX1200. ;)

My PSU tester is around here somewhere, but it is antiquated and defective on 5v I believe. I suppose I should place yet another Amazon order.

I may need to put a hammock, porta potty, and small refrigerator on the porch for my UPS driver, as they spend a lot of time there anyway.
 

TennesseeTony

Elite Member
Aug 2, 2003
4,331
3,800
136
www.google.com
So far, I have managed to use DDU (thank you very much for that tip Pokey) to wipe all traces of AMD's drivers, and successfully run two of the three questionable GPUs on the questionable PSU.

And, the 3rd GPU was able to not only crash the system, but corrupt the drivers to make the other two cards also crash the system (all installed one at a time, single GPU).

How very interesting! Surprisingly, it is NOT the overheated card that destabilizes the system. NOT surprisingly, it's one of two matching cards that have given me trouble from day one.

VisionTek is the brand to avoid boys and girls. As the second owner, that may not be fair to blame the maker, however one card managed to run for months in an open case, with no obstructions to the heatsink, and yet one of the plastic washers used to insulate the spring-loaded heatsink screws from the circuit board melted and allowed a short-circuit.

Anyway, enough with the rant on VisionTek. After a bit more stability testing on the second 'revived' card, I'll put the suspect VisionTek back in there, just to make sure it wasn't a fluke, not seated properly, etc.
 

TennesseeTony

Elite Member
Aug 2, 2003
4,331
3,800
136
www.google.com
Final update until the new PSU arrives at the end of the year.

Before I tried the suspect VisionTek GPU, I installed the 2nd good card, for dual GPU. Failed, drivers corrupted again. So I now suspect the PSU is indeed the root cause of failure. Hopefully the bad GPU can once again be revived, so as to not have a double loss (PSU and GPU).

This hardware failure reminds me....let's go undo the overclocking on those shiny, new, expensive 980's.

_
 

Drsignguy

Platinum Member
Mar 24, 2002
2,264
0
76
This hardware failure reminds me....let's go undo the overclocking on those shiny, new, expensive 980's.


Probably a good idea. Nothing against OC'ing but when crunching, I have tried let the "card"(s) do their own thing. Also, I try to make sure the temps stay pretty equal @ or around 70c. So far they seem to be working just fine.
Sorry to hear about the troubles, truly sucks! Hate killing hardware! Hopefully you got it solved......
 

Assimilator1

Elite Member
Nov 4, 1999
24,152
517
126
Hi Tony
That's not what you want on xmas eve! :(

Yea after swapping the GPUs round I'd be testing the PSU voltages with a Digital multi meter, oh btw just check that the GPU power plugs & sockets are in good nick, & not part singed/scorched (from slightly loose fitting pins causing high resistance & heating). I've had a couple of ATX plugs do that & cause random rebooting.
PSU tolerances on the 3.3,5 & 12v rail are +/- 5% IIRC, & if they haven't changed the specs in the past few years ;)
 

Rudy Toody

Diamond Member
Sep 30, 2006
4,267
421
126
The key to successful distributed computing is to start with the largest case you can find, take off the side panel. and fill it with one-hundred-dollar bills.
 

Rudy Toody

Diamond Member
Sep 30, 2006
4,267
421
126
The best way is with a race horse, which is very efficient at turning one-hundred-dollar bills into manure.