- Feb 24, 2008
- 4
- 0
- 0
So I've got this very strange issue that has cropped up recently. When I leave my computer idling for 20-30 minutes or longer, it hard locks (reset button is the only solution). Additionally, if I try to start defragmenting my drive (a tip I found while searching for a solution), it triggers a hang almost immediately.
This is on a recent computer I built, sometime in June. The hardware is:
- Gigabyte GA-EP35-DS3R mainboard
- Intel Xeon E3110 (E8400 alternative)
- 4x 1GB Ocz DDR2-800 memory modules
- 2x Western Digital Cavier SE16 500GB drives
- Leadtek GeForce 7900 GS
- 550W Antec power supply
This problem needs a little back-story first. Once built, I installed Windows XP SP2 on the box, and put the 2 WD drives in a RAID1 configuration on the Intel ICH9R controller. Also at this time, I was using only 2GB of a different brand memory. Things worked just fine for a few days, and then the RAID array was marked with a VERIFY flag. I didn't notice this immediately but within a week, my machine started rebooting itself it random times. I installed the Intel Storage Matrix software and it detected inconsistencies in the array, but it could not complete a scan without the computer rebooting. These reboots were related to the RAID array because the storage manager icon quickly flashed a failure bubble in the corner before the machine restarted. Eventually I could no longer start the machine and make it past checkdisk. Breaking the array and running either of the two drives independently worked successfully, and so my solution was to just run with 1 drive. I ran like this for perhaps a week or two before the OS just failed to boot entirely, and at this point the ICH9R refused to recognize the second (previously disconnected) drive as a bootable device.
At this point I gave up. I got a third hard drive, and was able to copy my data from the drive I had been running alone, so it wasn't a case of a failed drive. Although the memory checked fine, I swapped it out for new pairs, recreated the array on my two original drives, and reloaded Windows and copied my data back over, this time using the latest storage drivers from Intel. This has worked just grand for the past couple months, so I assumed I just had bad RAID drivers. It wouldn't be the first time. But now jump to yesterday, my machine is hanging at random times when it's idling, or when I try and defrag my drives. Now, I don't know what is triggering this problem, but I do know that Windows is hanging because the disk IO subsystem stops responding. At the point that the machine locks up, all hard drive activity ceases (lights stop flashing, drives stop clacking), and programs will continue to run until the point they need to access the disk or swap, at which point the request is never fulfilled and so they stop responding. It only takes a few seconds until all I can do is move my mouse around (if I'm lucky), but other then that the reset button is the only way out.
So far, Intel Storage Matrix Manager claims that my mirror is fully intact, and continuous and abrupt restarting has not seemed to harm my system. I can find no evidence of problems or errors in the system logs (but that is not surprising, it could not write to the logs if there was a problem!). I haven't installed any new software in over a week. Knowing that a defrag is a surefire way to kill the system, I've tried running a defrag in Safe mode with nothing else running, and the behavior is still the same. Within a minute of the defrag starting, all disk IO activity ceases. I have tried observing this with the performance monitor while performing a defrag. Here is a snapshot I took with my camera of the graph after the disks stopped responding, the green is "Avg. Disk Bytes/Transfer" and the red is "Current Disk Queue Length". You can see that all transfer just ceases and IO requests build up.
Kudos if you made it this far into my post. My question now is: what is the most likely source of failure here? Is it a bad ICH / Raid controller (most sources say the Intel controller is quite good by integrated/software raid solutions). Is it a bad drive, even though the array manages to stay intact and early testing showed they could run independently of each other. Is it something more subtle in my mainboard or CPU? Has something randomly gone awry in Windows, or could there be software or a service capable of just killing the disk system? I'm ready to throw something out a window here, but I don't want to blindly sink money into a system when I just can't be sure of where the problem lies.
Thanks for taking the time to read this, and for anyone that can lend me some insight into this problem!
This is on a recent computer I built, sometime in June. The hardware is:
- Gigabyte GA-EP35-DS3R mainboard
- Intel Xeon E3110 (E8400 alternative)
- 4x 1GB Ocz DDR2-800 memory modules
- 2x Western Digital Cavier SE16 500GB drives
- Leadtek GeForce 7900 GS
- 550W Antec power supply
This problem needs a little back-story first. Once built, I installed Windows XP SP2 on the box, and put the 2 WD drives in a RAID1 configuration on the Intel ICH9R controller. Also at this time, I was using only 2GB of a different brand memory. Things worked just fine for a few days, and then the RAID array was marked with a VERIFY flag. I didn't notice this immediately but within a week, my machine started rebooting itself it random times. I installed the Intel Storage Matrix software and it detected inconsistencies in the array, but it could not complete a scan without the computer rebooting. These reboots were related to the RAID array because the storage manager icon quickly flashed a failure bubble in the corner before the machine restarted. Eventually I could no longer start the machine and make it past checkdisk. Breaking the array and running either of the two drives independently worked successfully, and so my solution was to just run with 1 drive. I ran like this for perhaps a week or two before the OS just failed to boot entirely, and at this point the ICH9R refused to recognize the second (previously disconnected) drive as a bootable device.
At this point I gave up. I got a third hard drive, and was able to copy my data from the drive I had been running alone, so it wasn't a case of a failed drive. Although the memory checked fine, I swapped it out for new pairs, recreated the array on my two original drives, and reloaded Windows and copied my data back over, this time using the latest storage drivers from Intel. This has worked just grand for the past couple months, so I assumed I just had bad RAID drivers. It wouldn't be the first time. But now jump to yesterday, my machine is hanging at random times when it's idling, or when I try and defrag my drives. Now, I don't know what is triggering this problem, but I do know that Windows is hanging because the disk IO subsystem stops responding. At the point that the machine locks up, all hard drive activity ceases (lights stop flashing, drives stop clacking), and programs will continue to run until the point they need to access the disk or swap, at which point the request is never fulfilled and so they stop responding. It only takes a few seconds until all I can do is move my mouse around (if I'm lucky), but other then that the reset button is the only way out.
So far, Intel Storage Matrix Manager claims that my mirror is fully intact, and continuous and abrupt restarting has not seemed to harm my system. I can find no evidence of problems or errors in the system logs (but that is not surprising, it could not write to the logs if there was a problem!). I haven't installed any new software in over a week. Knowing that a defrag is a surefire way to kill the system, I've tried running a defrag in Safe mode with nothing else running, and the behavior is still the same. Within a minute of the defrag starting, all disk IO activity ceases. I have tried observing this with the performance monitor while performing a defrag. Here is a snapshot I took with my camera of the graph after the disks stopped responding, the green is "Avg. Disk Bytes/Transfer" and the red is "Current Disk Queue Length". You can see that all transfer just ceases and IO requests build up.
Kudos if you made it this far into my post. My question now is: what is the most likely source of failure here? Is it a bad ICH / Raid controller (most sources say the Intel controller is quite good by integrated/software raid solutions). Is it a bad drive, even though the array manages to stay intact and early testing showed they could run independently of each other. Is it something more subtle in my mainboard or CPU? Has something randomly gone awry in Windows, or could there be software or a service capable of just killing the disk system? I'm ready to throw something out a window here, but I don't want to blindly sink money into a system when I just can't be sure of where the problem lies.
Thanks for taking the time to read this, and for anyone that can lend me some insight into this problem!