Did I damage my CPU?

dinkumthinkum

Senior member
Jul 3, 2008
203
0
0
I have a server (no overclocking), a few years old, with an E8400. I just had to move it from one place to another and now for the past day and a half I find it shutting itself down randomly. I do some diagnostics, there's machine check exceptions regarding thermal events, so I install lm-sensors and check the temps. It's idling in the high-70s, mid-80s. Under load it's spiking to 100C. I reboot and open the BIOS health status screen to be sure that I am reading the right temps. The moment I open it, the CPU temp is 100C, and climbing fast. Before I can even hit the power button, it is at 112C and then shuts itself down.

I am about to go ahead and reseat the HSF with a fresh coating of Arctic Silver. But should I just go ahead and replace the CPU since it is already 3 years old and I may have cut its lifespan terribly?

Code:
IDLE:

it8718-isa-0290
Adapter: ISA adapter
in0:          +1.06 V  (min =  +0.00 V, max =  +4.08 V)
in1:          +2.03 V  (min =  +0.00 V, max =  +4.08 V)
in2:          +3.30 V  (min =  +0.00 V, max =  +4.08 V)
+5V:          +2.85 V  (min =  +0.00 V, max =  +4.08 V)
in4:          +4.08 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
in5:          +0.05 V  (min =  +0.00 V, max =  +4.08 V)
in6:          +4.08 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
in7:          +3.01 V  (min =  +0.00 V, max =  +4.08 V)
Vbat:         +3.07 V  
fan1:        1717 RPM  (min =    0 RPM)
fan2:           0 RPM  (min =    0 RPM)
temp1:        +40.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermistor
temp2:        +67.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermal diode
temp3:         -2.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermistor

coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +78.0°C  (high = +78.0°C, crit = +100.0°C)  ALARM (CRIT)

coretemp-isa-0001
Adapter: ISA adapter
Core 1:       +77.0°C  (high = +78.0°C, crit = +100.0°C)  ALARM (CRIT)


LOAD:

it8718-isa-0290
Adapter: ISA adapter
in0:          +1.09 V  (min =  +0.00 V, max =  +4.08 V)
in1:          +2.03 V  (min =  +0.00 V, max =  +4.08 V)
in2:          +3.28 V  (min =  +0.00 V, max =  +4.08 V)
+5V:          +2.85 V  (min =  +0.00 V, max =  +4.08 V)
in4:          +4.08 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
in5:          +0.14 V  (min =  +0.00 V, max =  +4.08 V)
in6:          +4.08 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
in7:          +3.01 V  (min =  +0.00 V, max =  +4.08 V)
Vbat:         +3.07 V  
fan1:        1708 RPM  (min =    0 RPM)
fan2:           0 RPM  (min =    0 RPM)
temp1:        +39.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermistor
temp2:        +88.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermal diode
temp3:         -2.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermistor

coretemp-isa-0000
Adapter: ISA adapter
Core 0:      +100.0°C  (high = +78.0°C, crit = +100.0°C)  ALARM (CRIT)

coretemp-isa-0001
Adapter: ISA adapter
Core 1:       +98.0°C  (high = +78.0°C, crit = +100.0°C)  ALARM (CRIT)


LOG:

kernel: [ 1499.816014] [Hardware Error]: Machine check events logged
mcelog: HARDWARE ERROR. This is *NOT* a software problem!
mcelog: Please contact your hardware vendor
mcelog: MCE 0
mcelog: CPU 0 THERMAL EVENT TSC 373bbee5e1a 
mcelog: TIME 1307127492 Fri Jun  3 14:58:12 2011
mcelog: Processor 0 heated above trip temperature. Throttling enabled.
mcelog: Please check your system cooling. Performance will be impacted
mcelog: STATUS 88010023 MCGSTATUS 0
mcelog: MCGCAP 806 APICID 0 SOCKETID 0 
mcelog: CPUID Vendor Intel Family 6 Model 23
mcelog: HARDWARE ERROR. This is *NOT* a software problem!
mcelog: Please contact your hardware vendor
mcelog: MCE 1
mcelog: CPU 0 THERMAL EVENT TSC 373bc002b4d 
mcelog: TIME 1307127492 Fri Jun  3 14:58:12 2011
mcelog: Processor 0 below trip temperature. Throttling disabled
mcelog: STATUS 88010022 MCGSTATUS 0
mcelog: MCGCAP 806 APICID 0 SOCKETID 0 
mcelog: CPUID Vendor Intel Family 6 Model 23
kernel: [ 1616.571112] i2c /dev entries driver
kernel: [ 1752.243219] CPU0: Core temperature above threshold, cpu clock throttled (total events = 93581)
kernel: [ 1752.243609] CPU0: Core temperature/speed normal
 

alanwest09872

Golden Member
Aug 12, 2007
1,100
0
0
if it serves your purpose why replace it. Im a newb thou. New cpus are only as good as the software you want to use. If you use 3yr old software then it should be adequate. But again I am a newb.
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
Sounds like something didn't take the trip too well. I'd do a complete re-install of the CPU and heatsink.
 

dinkumthinkum

Senior member
Jul 3, 2008
203
0
0
I am examining and cleaning the CPU right now. No external signs of damage. The HSF seemed to be slightly loose when I first looked at it, that may be the culprit. I am about to reset it with a fresh coat of thermal compound. But having hit temps of 112C, do you think that the CPU is likely to fail soon even if temps are restored to normal?
 

Zap

Elite Member
Oct 13, 1999
22,377
2
81
It will probably be fine. Is EIST enabled in BIOS as well as any protections?

FWIW I had an overclocked Pentium III that ran for a while with no heatsink on it whatsoever. The machine had gotten knocked over and the heatsink fell off the CPU - it was a socket 370 CPU in a slotket. I didn't know, and let it sit like that for a half hour. Came back to it and it was locked up. It was still fine after that. Intel CPUs are surprisingly robust with temperatures, especially at stock clock/voltages.
 

VirtualLarry

No Lifer
Aug 25, 2001
56,327
10,035
126
The CPU is still likely to be fine. If it was shutting down due to temps, that means that the thermal throttle/shutdown feature is indeed working fine, which should save the CPU from damage.

I would (carefully) replace the heatsink, possibly with a new one if the push-pins are damaged, and put new thermal paste on.

Overall, though, I think it should be alright.
 

dinkumthinkum

Senior member
Jul 3, 2008
203
0
0
Well, the fun continues.

The heatsink fan isn't in the greatest shape, so I may have to replace it whatever may come. However the temps are back down to normal: 45 idle, 65 load. The problem now is that suddenly, without any warning, the machine freezes after a few minutes of use.

I tried a number of things. After the first couple times, I rebooted into single user mode and tried to examine the temperature sensors and the system log. The temps were right -- 45C, and there were no errors in the system log. However the machine froze on me while I was reading the log. This is not heavy duty, it's just a text viewer in the system console.

I booted into BIOS and read the PC Health status screen: CPU Temp = 37C. I let it run a few minutes and sure enough: it froze in the BIOS.

At this point, I think I have to conclude that the CPU is damaged. It could also be the memory, though I am not sure why running the BIOS for a few minutes would instigate a memory problem. What do you think?
 

pm

Elite Member Mobile Devices
Jan 25, 2000
7,419
22
81
Yeah, I'm with phynaz - take everything out and put it all back in again. And I'd run a memory check.
 

dinkumthinkum

Senior member
Jul 3, 2008
203
0
0
Status update: I went back to reseat everything and noticed that even after 5 minutes of being shut off, the chipset heatsink was pretty hot to the touch. Anyway, I made sure to re-assemble the full case so the airflow was properly restored. So far, it has not crashed. Tomorrow I will run some stress tests and memtest to be sure.

Thanks.
 

Soulkeeper

Diamond Member
Nov 23, 2001
6,712
142
106
I assume you are using linux ?
I had machine check exceptions on my server too, although it was really 100% stable
for me I just disabled the checks in the kernel

Another option is updating to the latest kernel available for your distribution.

things i'd do:
reseat heatsinks
clean out all dust with compressor, psu too
check for loose cables
run case for a short time with side off to ensure all fans are spinning
then run memtest86+ for an hour or so


EDIT: sorry I didn't read all the replies above, you can disregard most of this
 
Last edited:

dinkumthinkum

Senior member
Jul 3, 2008
203
0
0
I did recently update the kernel, 2.6.38-2-amd64 (Debian). However since the freezes occurred in the BIOS I assumed that it had nothing to do with the OS. The freezes did only occur with the case side off, so I started wondering if it had something to do with inadequate case cooling. I did notice that the Seasonic PSU fan was not spinning, any time I checked. I presumed that was because the PSU was doing fine, but maybe that is also a sign of something.