2 identical systems, but one overheats regularly and shuts down

Turbonium

Platinum Member
Mar 15, 2003
2,109
48
91
I have 2 identical Core 2 Quad Q9650 systems. Both run at full load 24/7, but one of them regularly shuts down due to overheating, based on the system beeping and the following message when I boot it up again:

The following are errors that were detected during this boot.
These can be viewed in setup on the Event Log Page.
WARNING: Processor Thermal Trip


I'm running Linux on both systems. What temperature monitoring programs are out there for Linux/Ubuntu? I want to monitor temps as a start to see what is going on.
 
Last edited:

biodoc

Diamond Member
Dec 29, 2005
6,262
2,238
136
I have 2 identical Core 2 Quad Q9650 systems. Both run at full load 24/7, but one of them regularly shuts down due to overheating, based on the system beeping and the following message when I boot it up again:

The following are errors that were detected during this boot.
These can be viewed in setup on the Event Log Page.
WARNING: Processor Thermal Trip


I'm running Linux on both systems. What temperature monitoring programs are out there for Linux/Ubuntu? I want to monitor temps as a start to see what is going on.

lm-sensors works for me.

https://help.ubuntu.com/community/SensorInstallHowto

It might be a good idea to reseat your heat sink/fan.
 

Turbonium

Platinum Member
Mar 15, 2003
2,109
48
91
lm-sensors works for me.

https://help.ubuntu.com/community/SensorInstallHowto

It might be a good idea to reseat your heat sink/fan.
Thanks.

The funny thing is that of the two systems, the one that is overheating had the HSF installed most properly (all 4 clips worked with some effort), while the one that doesn't overheat only had 3 of the clips working (1/4 never clicked into place properly).

Seriously, both systems are identical in every way. Chassis, internal cable layout, components (both had brand new stock HSFs). Literally everything is the same. And yet...

Could it be that the overheating one is overheating simply due to positioning in the room? I have them in vertical cases, next to one another. The vents are on the side, meaning the system on the right vent into open space, while the system on the left vents at the other system, with a few inches of space. I can't imagine that being the reason though. Maybe it's a faulty HSF.

Gonna have to install that monitoring software...
 

Rudy Toody

Diamond Member
Sep 30, 2006
4,267
421
126
You could also check the thermal shut-down limits in the BIOS. Perhaps one is set too low.
 

Turbonium

Platinum Member
Mar 15, 2003
2,109
48
91
You could test this theaory by swapping the PC placement.
I'm quite aware of that, but I hate moving my computers or opening them unless I absolutely have to. It's a bit of OCD, I guess you could say (since I make it a huge operation every time).

You could also check the thermal shut-down limits in the BIOS. Perhaps one is set too low.
I'll check, but I doubt it's the case, as both mobos were brand new, and I'm fairly sure they were the same revision and BIOS, etc.
 

Turbonium

Platinum Member
Mar 15, 2003
2,109
48
91
Thermal tripping system at full load:

w83627dhg-isa-0a10
Adapter: ISA adapter
Vcore: +1.06 V (min = +0.00 V, max = +1.74 V)
in1: +1.12 V (min = +0.09 V, max = +2.02 V)
AVCC: +3.33 V (min = +2.86 V, max = +0.91 V) ALARM
+3.3V: +3.31 V (min = +2.72 V, max = +3.79 V)
in4: +1.29 V (min = +0.31 V, max = +0.81 V) ALARM
in5: +0.76 V (min = +1.49 V, max = +0.20 V) ALARM
in6: +1.01 V (min = +1.48 V, max = +1.43 V) ALARM
3VSB: +3.41 V (min = +2.93 V, max = +3.47 V)
Vbat: +3.33 V (min = +2.21 V, max = +3.81 V)
fan1: 0 RPM (min = 351 RPM, div = 128) ALARM
fan2: 3183 RPM (min = 1068 RPM, div = 8)
fan3: 0 RPM (min = 351 RPM, div = 128) ALARM
fan4: 0 RPM (min = 10546 RPM, div = 128) ALARM
fan5: 0 RPM (min = 10546 RPM, div = 128) ALARM
temp1: +56.0°C (high = -12.0°C, hyst = -4.0°C) ALARM sensor = thermistor
temp2: +85.0°C (high = +80.0°C, hyst = +75.0°C) ALARM sensor = diode
temp3: +47.0°C (high = +80.0°C, hyst = +75.0°C) sensor = thermistor
cpu0_vid: +0.000 V
intrusion0: ALARM

Thermal tripping system at idle:

w83627dhg-isa-0a10
Adapter: ISA adapter
Vcore: +1.06 V (min = +0.00 V, max = +1.74 V)
in1: +1.12 V (min = +0.09 V, max = +2.02 V)
AVCC: +3.33 V (min = +2.86 V, max = +0.91 V) ALARM
+3.3V: +3.31 V (min = +2.72 V, max = +3.79 V)
in4: +1.29 V (min = +0.31 V, max = +0.81 V) ALARM
in5: +0.76 V (min = +1.49 V, max = +0.20 V) ALARM
in6: +1.01 V (min = +1.48 V, max = +1.43 V) ALARM
3VSB: +3.41 V (min = +2.93 V, max = +3.47 V)
Vbat: +3.33 V (min = +2.21 V, max = +3.81 V)
fan1: 0 RPM (min = 351 RPM, div = 128) ALARM
fan2: 3183 RPM (min = 1068 RPM, div = 8)
fan3: 0 RPM (min = 351 RPM, div = 128) ALARM
fan4: 0 RPM (min = 10546 RPM, div = 128) ALARM
fan5: 0 RPM (min = 10546 RPM, div = 128) ALARM
temp1: +56.0°C (high = -12.0°C, hyst = -4.0°C) ALARM sensor = thermistor
temp2: +85.0°C (high = +80.0°C, hyst = +75.0°C) ALARM sensor = diode
temp3: +47.0°C (high = +80.0°C, hyst = +75.0°C) sensor = thermistor
cpu0_vid: +0.000 V
intrusion0: ALARM

Virtually identical temps. Something is either wrong with the sensor(s), or with the cooling. Agreed? I could really use some help here.

For comparison, here is the healthy system at full load:

w83627dhg-isa-0a10
Adapter: ISA adapter
Vcore: +1.15 V (min = +0.00 V, max = +1.74 V)
in1: +1.11 V (min = +1.54 V, max = +0.84 V) ALARM
AVCC: +3.30 V (min = +3.09 V, max = +1.90 V) ALARM
+3.3V: +3.30 V (min = +2.43 V, max = +0.78 V) ALARM
in4: +1.30 V (min = +0.10 V, max = +1.54 V)
in5: +0.76 V (min = +0.58 V, max = +0.02 V) ALARM
in6: +1.04 V (min = +0.22 V, max = +1.69 V)
3VSB: +3.39 V (min = +0.02 V, max = +0.37 V) ALARM
Vbat: +3.33 V (min = +0.51 V, max = +1.06 V) ALARM
fan1: 0 RPM (min = 2636 RPM, div = 128) ALARM
fan2: 3068 RPM (min = 998 RPM, div = 8)
fan3: 0 RPM (min = 351 RPM, div = 128) ALARM
fan4: 0 RPM (min = 10546 RPM, div = 128) ALARM
fan5: 0 RPM (min = 10546 RPM, div = 128) ALARM
temp1: +70.0°C (high = -60.0°C, hyst = -100.0°C) ALARM sensor = thermistor
temp2: +75.0°C (high = +80.0°C, hyst = +75.0°C) sensor = diode
temp3: +55.5°C (high = +80.0°C, hyst = +75.0°C) sensor = thermistor
cpu0_vid: +0.000 V
intrusion0: ALARM
 
Last edited:

biodoc

Diamond Member
Dec 29, 2005
6,262
2,238
136
Your CPU temps sensor for intel chips should refer to "coretemp"

this is my 2600K read out:

coretemp-isa-0000
Adapter: ISA adapter
Physical id 0: +50.0°C (high = +80.0°C, crit = +98.0°C)
Core 0: +49.0°C (high = +80.0°C, crit = +98.0°C)
Core 1: +43.0°C (high = +80.0°C, crit = +98.0°C)
Core 2: +48.0°C (high = +80.0°C, crit = +98.0°C)
Core 3: +50.0°C (high = +80.0°C, crit = +98.0°C)

I would check your temps in the bios. If they are high there, the it's probably the HSF.
 

GLeeM

Elite Member
Apr 2, 2004
7,199
128
106
The vents are on the side, meaning the system on the right vent into open space, while the system on the left vents at the other system, with a few inches of space. I can't imagine that being the reason though.
I can imagine this as being THE reason!
My computer vents out very hot air. If it took in air that hot it would not run very long :(
 

Sunny129

Diamond Member
Nov 14, 2000
4,823
6
81
Could it be that the overheating one is overheating simply due to positioning in the room? I have them in vertical cases, next to one another. The vents are on the side, meaning the system on the right vent into open space, while the system on the left vents at the other system, with a few inches of space. I can't imagine that being the reason though. Maybe it's a faulty HSF.
hmm...i've never heard of such a computer case before. usually intake air comes in the front and/or the left side of the case and exhaust out the back or the top. you say your cases exhaust hot air out the right side of the case? if that is the case, and if you have intake fans on the left side (and not just at the front of the cases), then the left computer would be exhausting hot air right into the side intake of the right computer. that could be a major problem. but please confirm if i understood the configuration correctly or not...
 

Turbonium

Platinum Member
Mar 15, 2003
2,109
48
91
Ok, just to clarify:

I don't think my cases (link) have either intake or exhaust fans (I could be wrong though; I will likely check the case later and update, since I'm probably going to have to open up the one case to investigate).

What I meant to say is that there is a "vent" on the one side, which I assume hot air tends to come out of. That one is facing the side of the other case.

In other words, I thought maybe the overheating case isn't getting enough "clearance" on the fanless vent, so the hot air wasn't escaping efficiently enough.
 

Sunny129

Diamond Member
Nov 14, 2000
4,823
6
81
Ok, just to clarify:

I don't think my cases (link) have either intake or exhaust fans (I could be wrong though; I will likely check the case later and update, since I'm probably going to have to open up the one case to investigate).

What I meant to say is that there is a "vent" on the one side, which I assume hot air tends to come out of. That one is facing the side of the other case.

In other words, I thought maybe the overheating case isn't getting enough "clearance" on the fanless vent, so the hot air wasn't escaping efficiently enough.
ahh...i didn't know you were using mATX slim cases, so that clears up a few things. first of all, the product description would mention the locations of chassis fans if there were any, and it doesn't say anything to that affect...so those vents are probably just for passive cooling intended to function by convection only. normally i would say that small, cramped, passively cooled cases aren't ideal for crunching machines, but you've managed to do it successfully with at least one of those machines. if your cases are standing vertically (and not laying flat like in the newegg product pictures), and there are no case fans, then the side vents probably act as intakes and the top vents act as exhaust vents by way of convection. i, like you, suspect that the problem may very well be insufficient distance between your two cases...but not because the lack of space isn't allowing enough hot air to escape out of the side vent. rather i think it is because the lack of space between the cases isn't letting enough cool air in through the side vent on the left case. not only that, but if the cases are in close enough proximity, any cool air trying to make its way between the cases and into the side vent on the left computer is probably getting warmed up before it ever makes it into the left computer b/c it has to pass by the left side of the right case first - and the mobo and CPU sit against the left side of these vertical cases, probably making them warmer than the right sides of these cases.
 

Turbonium

Platinum Member
Mar 15, 2003
2,109
48
91
In BIOS...

What is "Processor Thermal Margin"? Is this the CPU temp? One CPU is around 24°C, while the other hobbles between 0°C, -1°C, and -2°C (and I just started it up, straight into BIOS, lol?).

I'm guessing this is either a faulty/broken diode or whatnot, or something has caused the HSF to become unseated or not functioning (the fan is spinning at normal speed on the system in question). Is it possible for a thermal diode to simply "break"?

Something of importance I forgot to mention earlier: both processors were running at 100% load for over 2 months before the one started having issues. Based on that alone, I don't think the chassis positioning theory is true, given I would have had issues from the start. It must be either a diode issue, or HSF issue.

What I'm wondering at this point is: do diodes just stop working like that? Also: is the diode part of the CPU, or part of the motherboard?

If it matters, the mobos are the Intel DG41TY, nothing fancy.
 
Last edited:

biodoc

Diamond Member
Dec 29, 2005
6,262
2,238
136
The readout from w83627dhg-isa-0a10 is from motherboard sensors.

The coretemp thermal sensors are in the CPU. There is one thermal sensor per core. To accurately monitor CPU temp, you'll need the readout from those sensors.

I would suggest running sudo sensors-detect again. You should detect coretemp-isa-0000 if you answer yes to all the questions. Post the entire readout from that if you can. If that fails, then I'm not sure what to suggest. Perhaps this intel MB does not feature readouts from the CPU thermal sensors? Hard to believe.
 

Turbonium

Platinum Member
Mar 15, 2003
2,109
48
91
The readout from w83627dhg-isa-0a10 is from motherboard sensors.

The coretemp thermal sensors are in the CPU. There is one thermal sensor per core. To accurately monitor CPU temp, you'll need the readout from those sensors.

I would suggest running sudo sensors-detect again. You should detect coretemp-isa-0000 if you answer yes to all the questions. Post the entire readout from that if you can. If that fails, then I'm not sure what to suggest. Perhaps this intel MB does not feature readouts from the CPU thermal sensors? Hard to believe.
What do you think of the BIOS readout above?

Either way, I'm going to open up the chassis tomorrow and take a look.
 

biodoc

Diamond Member
Dec 29, 2005
6,262
2,238
136
What is "Processor Thermal Margin"? Is this the CPU temp? One CPU is around 24°C, while the other hobbles between 0°C, -1°C, and -2°C (and I just started it up, straight into BIOS, lol?).

I did some google searches and found this:

The Processor Thermal Margin is actually reporting the difference between Tjmax (the maximum theoretical safe temperature for the CPU) and the actual temperature of the CPU. So if the PTM is higher, it means your CPU temperature is much lower.

It looks as though the processor that reads 0 C has reached the maximum allowed temperature (Tjmax). The HSF is probable not seated properly. Is this the one with the shutdown problems?
 

Turbonium

Platinum Member
Mar 15, 2003
2,109
48
91
I did some google searches and found this:

The Processor Thermal Margin is actually reporting the difference between Tjmax (the maximum theoretical safe temperature for the CPU) and the actual temperature of the CPU. So if the PTM is higher, it means your CPU temperature is much lower.

It looks as though the processor that reads 0 C has reached the maximum allowed temperature (Tjmax). The HSF is probable not seated properly. Is this the one with the shutdown problems?
Yup.

I really don't get it though. It was running fine for a solid 2 months, at 100% load.
 

Turbonium

Platinum Member
Mar 15, 2003
2,109
48
91
Ok, I opened up the system, and the HSF looks seated perfectly fine, and has almost no dust buildup either (I used compressed air on it anyway to clean it up). Fan seems to be spinning fine as well, based on how easily and noiselessly it spins when applying compressed air.

I see no point in removing the HSF and reapplying it, given I'd have to use up more thermal paste.

Diode problem perhaps? I really need your help on this.

Here is a pic of the "overheating" internals:

jQuykHt.jpg
 
Last edited:

Sunny129

Diamond Member
Nov 14, 2000
4,823
6
81
Something of importance I forgot to mention earlier: both processors were running at 100% load for over 2 months before the one started having issues. Based on that alone, I don't think the chassis positioning theory is true, given I would have had issues from the start. It must be either a diode issue, or HSF issue.
yeah, now that you mention this, i agree that the issue isn't case placement.

i don't see any bad caps on the board in your picture...everything "looks" kosher. my guess is that the thermal diode is crapping out, b/c let's face it - if the heatsink/fan assembly is seated correctly, and the fan is operating at the same or similar rpms as your working machine, then it must be a bogus temp reading that's shutting down the system - not an actual overheating event. that said, before you just assume its a faulty diode or temp sensor, it might be wise to re-seat the heatsink/fan assembly. i know you don't want to do it b/c its a hassle, but its the only way you can be 100% sure that it isn't the source of the problem. and a small tube containing several applications worth of thermal compound costs pennies in the grand scheme of things, should you need another.
 

Turbonium

Platinum Member
Mar 15, 2003
2,109
48
91
yeah, now that you mention this, i agree that the issue isn't case placement.

i don't see any bad caps on the board in your picture...everything "looks" kosher. my guess is that the thermal diode is crapping out, b/c let's face it - if the heatsink/fan assembly is seated correctly, and the fan is operating at the same or similar rpms as your working machine, then it must be a bogus temp reading that's shutting down the system - not an actual overheating event. that said, before you just assume its a faulty diode or temp sensor, it might be wise to re-seat the heatsink/fan assembly. i know you don't want to do it b/c its a hassle, but its the only way you can be 100% sure that it isn't the source of the problem. and a small tube containing several applications worth of thermal compound costs pennies in the grand scheme of things, should you need another.
Well, the other thing is that the HSF was a huge hassle to install simply because the clips on it were slightly defective. I had to apply a lot of force to get them to "snap" into place. I'm afraid of breaking the board outright if I try again.

Is there any way to bypass the thermal shutdown process? I want to force it to run, even if it thinks it's overheating.

I checked the BIOS and didn't see any obvious options for thermal protection bypassing. Perhaps something from within Linux itself?

BIOS info: http://www.intel.com/support/motherboards/desktop/sb/CS-020304.htm
 
Last edited:

QuietDad

Senior member
Dec 18, 2005
523
79
91
Your NEVER going to figure it out until:
1: Try it pulled out in the open, maybe with the case open to see if it's something silly as an airflow issue
2: Reseat the HSF with new thermal paste to make sure it's seated right.

Those two steps at MOST cost you $15 and an afternoon. You already have the case out for the pic you took. Forcing it to run hot will burn out a CPU and a motherboard for a minimum of $200 and when you rebuild it and it works because the new CPU was seated correctly, you'll swear you have it fixed, put the case back and it will overheat again if it was airflow.
 

Turbonium

Platinum Member
Mar 15, 2003
2,109
48
91
Your NEVER going to figure it out until:
1: Try it pulled out in the open, maybe with the case open to see if it's something silly as an airflow issue
2: Reseat the HSF with new thermal paste to make sure it's seated right.

Those two steps at MOST cost you $15 and an afternoon. You already have the case out for the pic you took. Forcing it to run hot will burn out a CPU and a motherboard for a minimum of $200 and when you rebuild it and it works because the new CPU was seated correctly, you'll swear you have it fixed, put the case back and it will overheat again if it was airflow.
Yea, I'm probably going to do option 1 later today. I just really, really doubt it's an airflow issue or something of that nature. I mean, I boot up the system from a standstill (totally shutdown), straight into BIOS, and the CPU is at its thermal limit already? It just seems like a diode issue.

EDIT: I tried it just now. Again, with the system case wide open, from a dead startup and straight into BIOS, it's "overheating". The air around the HSF is relatively cool (mobo temp reported as 39°C).
 
Last edited:

Turbonium

Platinum Member
Mar 15, 2003
2,109
48
91
Okay, now I'm running the processor at 100% load within Linux (folding), with the case wide open. Lets see how long it lasts until it shuts down due to "overheating"...
 
Last edited:

Turbonium

Platinum Member
Mar 15, 2003
2,109
48
91
Over 3 hours later and it's still running. Weird.

I'm going to now put the case cover on and see what happens. If it runs for a few hours without shutting down, I'm going to try putting it in the vertical position. Perhaps the bearing on the fan has given way somehow and when it's vertical, it doesn't spin as efficiently. I'm really running out of ideas here.
 
Last edited:

QuietDad

Senior member
Dec 18, 2005
523
79
91
Have we looked at the temperature's 3 hours later? Be interesting to see if they are now more in line with the other PC. At this point I would just put it vertical. If that overheats, then try it flat. Saves a step if it's good, same number of steps if it's not.