New system crashes and BSODs, Part III

CurseTheSky

Diamond Member
Oct 21, 2006
5,401
2
0
As many of you have probably read by now, I've been having problems with my latest build. Nothing like some unexplained BSODs, crashes, and wacky behavior to make you feel like a noob again. :p

I THOUGHT my MCP was overheating, but I've been assured by eVGA that it can handle substantial heat. I THOUGHT my CPU was overheating, but 27C idle in BIOS (liquid cooled) has me convinced otherwise. I THOUGHT my PSU killed my Raptor, but now my HDD is working again. I THOUGHT a thousand other things.

I sat down and looked at my case's design, and manual. The manual had special mounting instructions for Purepower PSUs, but not Toughpowers, like the one I have. The Toughpowers have a 140mm fan on the "bottom," and a grate on the back with no attached fan. Obviously the hot air all has to come out of one area. The Kandalf LCS case mounts the PSU on its side, with the fan facing the window of the case. However, the case also has a removable HDD cage right between the PSU and the window. While it does have a 90mm exhaust fan in this area, the metal from the HDD cage blocks most of the 140mm PSU fan. Combine that with two HDDs producing their own heat, and you have a nice oven going.

I've thought my problem was overheating from the beginning, just by the fact that they system would crash and not turn back on for quite some time - probably enough time to cool back down to acceptable levels. If I booted it up "cold" (not in use for several hours), it would often run for an hour or so, then crash. If I booted it "hot" (ten minutes from the last crash) it would usually BSOD again in 5-10 minutes. However, whether the computer is under load or not seems to be irrelevant to the crashes. Sometimes it crashes in Windows, and sometimes I can play an hour of Doom 3 with no problems. That leads me to believe that it's not one of the usual monitored components (CPU, GPU, SPP, MCP) that's causing the issue.

After the very last BSOD when I thought the only thing left to do was an RMA or two (no rhyme intended), my Raptor started clicking constantly, and the BIOS would no longer detect it. I thought it was dead, straight off the bat, leading me to believe that the PSU was the problem. I completely took the system apart, inspected everything for damage, and reseated everything (including the MCP and SPP heatsinks with some AS5 - so far, not a whoe lot of change at idle). I plugged the Raptor in just by chance... and it works!

My question is: will a system detect a HDD or PSU that's overheating, and shut down accordingly?
 

CurseTheSky

Diamond Member
Oct 21, 2006
5,401
2
0
Well, scratch one more theory. Crashed again. :(

I'm putting my system under load, and watching the CPU, System, and GPU temps until it crashes. Even though they're obviously not 100% correct, they should give me some idea whether or not something is overheating under load.

This is frustrating. At least I'm starting to see some light at the end of the tunnel.
 

Matthias99

Diamond Member
Oct 7, 2003
8,808
0
0
Originally posted by: CurseTheSky
My question is: will a system detect a HDD or PSU that's overheating, and shut down accordingly?

Not unless you've installed some sort of system monitoring software that looks for a certain temperature threshold (as measured by some monitoring chip on the MB) and shuts you down. Some HDDs have thermal sensors that can be monitored through SMART. Few if any consumer PSUs have temperature monitoring (beyond maybe a diode or thermistor to control fan speed).

That's at a software level, anyway. Modern CPUs actually have an 'overheat' or 'thermal trip' pin that the BIOS/MB may use to cut the power if the CPU diode detects that it is literally going to burn up (ie, you turned it on with no heatsink). Sometimes this is user-configurable. But otherwise, no, your system will not nicely shut down if something is overheating.

I did see a system once where the PSU fan had failed, the symptom being that after 20 minutes or so the PSU would just shut off completely -- but that certainly wasn't by design.

Your description of the case/PSU setup (a HDD cage in front of the PSU air intake? Yikes!) makes me also think heat may be the culprit. Try running the case with the side panel off and a desk/floor fan blowing air right at the MB/CPU. If it runs stably like this, probably you are not getting enough airflow in the case.
 

CurseTheSky

Diamond Member
Oct 21, 2006
5,401
2
0
Just tested CPU temps with two different Windows-based temperature monitors (Intel Thermal Analysis Tool and NVIDIA MonitorView) and thirty minutes of Orthos. It started out (idle) around 34-36C, then jumped up to 40C once Orthos started. During Orthos operation, it ranged from 39-45C under full load - I enabled logging and took screenshots incase of a crash. After Orthos was stopped, it almost instantly went back to 36C, and stayed stable at idle. During the test, the system was running with the side panel off, and all voltages and frequencies were set at stock or low settings. Memory was set to 4-4-4-12 and 2.1v.

It's not the best test in the world, but I honestly think that I can safely rule out processor overheating as the cause of the system crashes. The MCP / SPP heatpipe, Raptor HDD, and the RAM are all very hot to the touch, though.

I did play around with nTune a little while back, and checked a BIOS option for manual fan control on the Northbridge. I set it to 100%, since during some of the crashes, I heard the Northbridge fan spin down a few second before the BSOD. So far, I'm about an hour, maybe an hour and fifteen minutes stable. Time to play some games and see what happens - here's to hoping I somehow fixed the problem.
 

CurseTheSky

Diamond Member
Oct 21, 2006
5,401
2
0
Originally posted by: Matthias99
Originally posted by: CurseTheSky
My question is: will a system detect a HDD or PSU that's overheating, and shut down accordingly?

Not unless you've installed some sort of system monitoring software that looks for a certain temperature threshold (as measured by some monitoring chip on the MB) and shuts you down. Some HDDs have thermal sensors that can be monitored through SMART. Few if any consumer PSUs have temperature monitoring (beyond maybe a diode or thermistor to control fan speed).

That's at a software level, anyway. Modern CPUs actually have an 'overheat' or 'thermal trip' pin that the BIOS/MB may use to cut the power if the CPU diode detects that it is literally going to burn up (ie, you turned it on with no heatsink). Sometimes this is user-configurable. But otherwise, no, your system will not nicely shut down if something is overheating.

I did see a system once where the PSU fan had failed, the symptom being that after 20 minutes or so the PSU would just shut off completely -- but that certainly wasn't by design.

Your description of the case/PSU setup (a HDD cage in front of the PSU air intake? Yikes!) makes me also think heat may be the culprit. Try running the case with the side panel off and a desk/floor fan blowing air right at the MB/CPU. If it runs stably like this, probably you are not getting enough airflow in the case.


First of all, thanks for the input. Everything is very welcome at this point.

Second, I've tried just about every different simple variable I can think of. The system crashes under full load (Orthos, about 35-40 minutes into it, BSOD), and at complete idle (let the system sit there over night, not doing a thing - crashed out of the blue). I've run it with the side panel off and with the side panel on. It SEEMS to crash faster with the panel on, which is another reason I think it's a heat problem. The case does get decent airflow, except where the PSU fan is. Keep in mind, the PSU fan is on the bottom, and the PSU is mounted on its side. I have a feeling this case was designed before the Toughpower line was, so they never really thought they'd have this problem.

However, I've pretty much ruled out the PSU overheating, as I was running the computer with the HDDs installed in normal 3.5" bays, the top HDD cage (the one blocking the PSU fan) completely removed, and the side panel off. The PSU had tons of room to breathe. It still crashed, while idling, out of the blue.

So far so good with my recent tests. If it doesn't crash again, I'm going to assume that there was some sort of error in a BIOS setting, and my tinkering with nTune or the BIOS itself somehow fixed it. After playing with nothing else than the memory ratio, FSB speed, and memory speed, Windows failed to load up saying that some profile in ../System32/ was corrupted or missing. I restarted, and Windows loaded up with no problems - sort of odd.
 

Matthias99

Diamond Member
Oct 7, 2003
8,808
0
0
Originally posted by: CurseTheSky
First of all, thanks for the input. Everything is very welcome at this point.

Second, I've tried just about every different simple variable I can think of. The system crashes under full load (Orthos, about 35-40 minutes into it, BSOD), and at complete idle (let the system sit there over night, not doing a thing - crashed out of the blue). I've run it with the side panel off and with the side panel on. It SEEMS to crash faster with the panel on, which is another reason I think it's a heat problem. The case does get decent airflow, except where the PSU fan is. Keep in mind, the PSU fan is on the bottom, and the PSU is mounted on its side. I have a feeling this case was designed before the Toughpower line was, so they never really thought they'd have this problem.

However, I've pretty much ruled out the PSU overheating, as I was running the computer with the HDDs installed in normal 3.5" bays, the top HDD cage (the one blocking the PSU fan) completely removed, and the side panel off. The PSU had tons of room to breathe. It still crashed, while idling, out of the blue.

So far so good with my recent tests. If it doesn't crash again, I'm going to assume that there was some sort of error in a BIOS setting, and my tinkering with nTune or the BIOS itself somehow fixed it. After playing with nothing else than the memory ratio, FSB speed, and memory speed, Windows failed to load up saying that some profile in ../System32/ was corrupted or missing. I restarted, and Windows loaded up with no problems - sort of odd.

You've got my sympathy, if nothing else. I spent a week trying to debug my last build... turns out the RMA replacement for my motherboard (squealing caps under load in the first) was just flaky enough to run memtest86, make it through the Windows installation process, and then fail to boot up. I swapped out the CPU, HDD, and PSU before finally deciding it had to be the new board that was bad.

Random lockups/BSODs, frankly, suck to figure out. Crashing while sitting 'idle' implies some sort of hardware fault to me. Have you run memtest86/prime95 to just see if the CPU/RAM is working OK?
 

CurseTheSky

Diamond Member
Oct 21, 2006
5,401
2
0
I've run Memtest, though not as many passes as I should I'll admit. I let it go through two full passes, and an additional five passes of test 5 just to get a good idea of whether or not the memoty was completely bad. No errors with that - I'll give Memtest another shot tonight or tomorrow, depending if it crashes again or not (TWO HOURS... SO FAR SO GOOD!).

Idle crashes did bother me a lot, and nothing I did seemed to help. Everything really pointed to either something that isn't directly temperature monitored overheating, motherboard failure, or PSU failure. Since I don't have a spare PSU or motherboard, I can't really test either of those. I was convinced for a while that my PSU had killed my Raptor, but both seem to be working just fine now. After that, I was convinced that either the HDDs or PSU was overheating. I moved those, and it crashed one more time. Every time I thought I had the answer, I'd fix it, and it would crash AGAIN.

As I said before, I screwed around with both nTune and the BIOS a few hours ago. I used the automatic overclocking (coarse tune) just for kicks, and ended up with a whooping 150MHz gain on the CPU (woot). While I was there, I also adjusted the Northbridge fan to 100% manual, so it would spin nice and fast at all times. I haven't heard it spin down since, and I haven't had a crash since... maybe that was the problem?

Additionally, I tinkered with some manual overclocking in the BIOS just to see if I could raise the processor temps a bit. My first overclock (2.4 to 2.55 GHz) worked, but my second (2.4 to 2.9GHz, big jump just for fun) crashed. When I restarted, Windows gave me an error saying that some kind of setup or profile in ../System32 was corrupted or missing - I thought either the HDD was bad, or I had screwed something up and it was reformat time. I restarted for the Hell of it, and Windows loaded with no problems. Maybe there was some kind of low-level file issuing bad commands to hardware in my system? It's a long shot, but just another thing to help me THINK I fixed my system.

This is the longest I've gone without a crash so far. I've managed to run 30 minutes of Orthos, post here for about 10-15 minutes, AND do a 40-minute Battlefield 2 patch download. So far, so good. Now it's time to install BF2: Special Forces and play some games... the whole reason I put this thing together. :D

Edit: As you can probably tell by now, a lot of my "debugging" is very disorganized and certainly not the best way to go about things. I just keep tinkering hoping that I can find the problem. It's worked for every system in the past... I just hope it works for this one.

Edit 2: I also have two programs actively logging and monitoring my system. Hopefully they'll provide some insight as to what went wrong incase my system does crash again. I wish I had thought of this earlier.
 

CurseTheSky

Diamond Member
Oct 21, 2006
5,401
2
0
Well, it ran all this time without a crash. I've done all kinds of various things that I would normally do during a day, so I'm HOPING that the issue is fixed. What finally fixed it? No idea. I did verify that the Northbridge fan setting reset to automatic, so I guess that didn't do it. It spun down a few times over the hours, but never crashed - thank God.

Now for the dreaded restart to see if it all still works. Wish me luck. :D
 

Rastus

Diamond Member
Oct 10, 1999
4,704
3
0
Try swapping RAM around in different configurations of the slots. Could be a bad slot.