Lightning + Hard drive + RAID = ?????

UnbiosedOpinion

Junior Member
Sep 19, 2005
14
0
61
Maybe someone can help me with this stumper. Can anyone tell me how this happened?

An Asus P5E Deluxe system running fine for 3 months, with two Western Digital Black SATA hard drives mirrored with RAID-1 on the Intel ICH9R controller.
Evidently lightning caused a surge through the phone line, fried the fax modem, motherboard, and wall AC outlet, and tripped the GFI circuit.
I'm 90% sure the system was not running at the time; it was probably in hibernation.
After replacing the motherboard with the same model and same BIOS level, and with hard drives disconnected, Memtest ran fine.

Because I was nervous about the hard drives, I reconnected them to the SATA RAID ports and immediately did a full image backup with Acronis True Image 11, using the Acronis boot disk. All partitions were visible, and the backup appeared to run and verify successfully.

So I figured I could finally boot the system (Vista 32-bit Ultimate OEM), and deal with the re-activation issues due to the new motherboard.
Instead I got the "BootMgr is missing" message. Couldn't boot the Diagnostic F8 utilities either.
So I booted the Vista DVD Repair utilities to fix boot errors. But this utility went nowhere; couldn't even find an installed Windows system.

Sensing disaster, I went to Command Prompt to try to back up user files to a USB drive. Although some data was recoverable, many of the copy attempts failed with various index/not found errors.

Next I ran chkdsk (in read-only mode), which found thousands of errors, mainly on the C drive: Orphaned files, bad index entries, bad file attributes, you name it.
The other volumes on the array had less extensive corruption, but here's a key point: Even the Vista Recovery partition showed chkdsk errors, and it's hidden from Vista!

Next I reconfigured the drives as IDE so I could test them individually with Western Digital Diagnostics.
But both drives tested fine, including the full surface scans.

So I ran chkdsk, again read-only, this time on each disk individually. The same errors appeared to be replicated across both drives.

I decided to try chkdsk repair on one of the drives, thinking I might at least get a bootable system and maybe recover more files.
Chkdsk did "fix" thousands of items, but also deleted a lot of orphans, including most of the user files.
Still no bootable system; various system files and drivers are "missing or corrupt".

So this is where I sit. I'm going to have to restore an earlier full-system backup. But I'm more interested in what the heck happened. Here are my main questions:

1) Can a power surge corrupt a hard drive that's not powered on?
2) I can see that a powered-off drive might get corrupted by a strong magnetic field generated by lightning, but how could the same damage be replicated across both drives?
With ICH9R/Intel Matrix, I thought the mirroring happened within Windows by the driver. If the damage was caused by lightning, the replication would have had to happen prior to booting Windows, maybe while the stand-alone Acronis backup was running. Is that possible?
3) Even if the PC was running at the time of the electrical hit, why would the "hidden" Recovery partition get corrupted?
4) Is it possible the corruption was happening before the lightning strike? Although my web searches haven't found any specific problems like mine, there is conflicting info on whether the ICH9R firmware needs to match the Intel Matrix software level. Intel says it can be different, which was the case on this machine.

Anything I'm not understanding? All wisdom is appreciated.

 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
1) Yes, why would powering on have anything to do with it? only UNPLUGGED systems are safe (BTW, why did you not have a surge protector on your PC?)... However I do not think this is what you have here.
2) With ICH9R replication happens within the motherboard controller using the CPU and RAM from a BIOS applet, not within windows. Did you replace the board with the EXACT SAME MODEL? And did you then go through a controller specific detailed array recovery process (aka "how do recover a RAID1 array on an ICH9R")?
3) Hidden means the OS does not access it... what does that have do with A LIGHTENING STRIKE DAMAGING THE PHYSICAL COMPONENTS! Really, this isn't magic here... you have a physical drive containing physical data. lightening frys it... setting a partition to be "hidden" does not magically make it immune to lightening
4) Not really...

was your "RAID1" a mirror or a stripe? Because I have heard before of people ruining their array by letting chkdsk scan it... it "fixes" the "corrupt file fragments" which are actually good files that it simply does not recognize because the raid array is not configured correctly. AKA, you COULD potentially have recovered your data before.. .but now that you let chkdsk run in fix mode it went and corrupted every single file on the drive.

My GUESS as to what happened:
1. You had set up a RAID0 array, thinking it is a RAID1 array...
2. Lightening fried your mobo
3. You replaced mobo, did not jump through the hoops of recovering array properly on an ubercrappy mobo controller (never ever use those, really!)
4. You ran chkdsk which corrupted all your data by "Fixing" the "fragments"

Alternatively.
0. ICH9R is even crappier than I thought and does something with RAID1 which will make it seem like corrupt data when not in raid mode instead of two seperate non RAID drives containing unique data (normal RAID1 behavior for broken array)
1. Lightening fried your mobo
2. You replaced mobo, did not jump through the hoops of recovering array properly on an ubercrappy mobo controller (never ever use those, really!)
3. You ran chkdsk which corrupted all your data by "Fixing" the "fragments"
 

UnbiosedOpinion

Junior Member
Sep 19, 2005
14
0
61
Taltamir, thank you for your input.

I agree that only unplugged systems are safe. It just seemed odd that the platters of the drive could get corrupted even when its heads are parked. Especially since the drives themselves were not "fried by lightning"--they test fine now with WD's diagnostics.
But I guess lightning's strong magnetic field could corrupt the platters.

Also, the owner of the PC (who is not me) was using a surge protector, but neglected to get one with phone line protection.

Yes, the replacement mobo was the exact same model, hardware rev, and BIOS. There was no RAID recovery to do because 1) The RAID's pre-boot utility showed no array errors, everything looked perfect; and 2) The only other recovery I know of is in the Windows Matrix software, which was unbootable.

It was definitely a RAID1 mirrored array, so your first guess as to what happened doesn't apply.

The two insights you supplied that were most valuable to me were that:

1) Chkdsk can corrupt a RAID disk. I had read this elsewhere, and can see it being true for RAID 0, but I'm not sure about Intel's RAID 1. In any case, chkdsk was a last resort, and I have the Acronis full array image backup, which I took before doing anything that would update the disks.

2) If the mirroring/replication is done at the BIOS/hardware level without needing Windows, that explains how the damage from one drive was duplicated onto the other, even with the system never getting into Windows. (Which is surprising, since some people call ICH9R just a "software controller.")

Can anyone confirm or trash these last two points?
 

VirtualLarry

No Lifer
Aug 25, 2001
56,570
10,202
126
I suspect that you had corruption going on before the lightning strike. That's the only explaination that I could see.

If the lightning strike had corrupted the HD's platters, then it would have interfered with the factory low-level format too, and you would be getting "Sector not found" errors up the wazoo, as the disk could no longer seek to various sectors, because the hidden address ID bits would have been scrambled or wiped.

But as you say, the WD Data Lifeguard diagnostics didn't show anything. So it cannot be that.

I do wonder if perhaps there was some RAID0/RAID1 issues going on here. Intel's Matrix RAID allows you to create both RAID0 as well as RAID1 volumes on a single pair of disks. Chkdsk "repairing" thousands of files on an individual disk out of a RAID pair does sound suspicious.

 

MerlinRML

Senior member
Sep 9, 2005
207
0
71
Originally posted by: UnbiosedOpinion
1) Chkdsk can corrupt a RAID disk. I had read this elsewhere, and can see it being true for RAID 0, but I'm not sure about Intel's RAID 1. In any case, chkdsk was a last resort, and I have the Acronis full array image backup, which I took before doing anything that would update the disks.

2) If the mirroring/replication is done at the BIOS/hardware level without needing Windows, that explains how the damage from one drive was duplicated onto the other, even with the system never getting into Windows. (Which is surprising, since some people call ICH9R just a "software controller.")

Can anyone confirm or trash these last two points?

1) Chkdsk makes changes at the filesystem level. It should not affect the RAID metadata itself, but if you run the controller in a mode other than RAID, then the filesystem is in a different layout and state than expected and chkdsk will trash both RAID metadata and the data on the filesystem. There's a lot of misunderstanding about this point, and I wouldn't be surprised if someone comes in to tell me how wrong I am and how chkdsk corrupted their RAID array.

2) There is absolutely no mirroring/replication done at the BIOS level on the Intel Matrix RAID. The Intel Matrix RAID BIOS only updates or writes new RAID metadata to the disk. It does not actually do any data movement. That is completely handled by the driver, and is why the the ICH9R is called a software controller.

Is it possible that the computer was not in hibernation for the lightning strike? You say you're pretty sure, but it sure sounds like a power surge caused the hard drives to go bonkers. Other than that, I'm in agreement with Larry that there may have been some problems beforehand.
 

UnbiosedOpinion

Junior Member
Sep 19, 2005
14
0
61
Well I'm glad you guys seem to agree there's no slam-dunk explanation as to to what happened.

And thanks for the new information. Makes a lot of sense that platter damage from lightning on a powered-off drive should have zapped the low-level formatting and triggered WD diagnostic errors. (Which was not the case.)

And I put a lot of faith in Merlin's statements about ICH9R data mirroring happening at the OS driver level.

So, I think you guys are right suggesting that either the PC was running when the lightning hit, or the corruption was happening even earlier.

The owner was "quite sure" the PC was off during the storm, but he wasn't home at the time. Maybe it was just asleep. Maybe a phone call woke it up via the modem. It would explain a lot.

As for prior corruption, after pressing the owner a bit, he admitted needing to do hard shutdowns (via 5-second power button) when an app locked up a few times. I'm still not convinced that this could damage the hidden Vista recovery partition. But in the future I guess I need to shut off the write buffering to be safer. I wish Vista wasn't constantly doing disk I/O for who-knows-what reasons.
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
2) There is absolutely no mirroring/replication done at the BIOS level on the Intel Matrix RAID. The Intel Matrix RAID BIOS only updates or writes new RAID metadata to the disk. It does not actually do any data movement. That is completely handled by the driver, and is why the the ICH9R is called a software controller.
If that was the case it would not be possible to boot windows from such a drive.

1) Chkdsk makes changes at the filesystem level. It should not affect the RAID metadata itself, but if you run the controller in a mode other than RAID, then the filesystem is in a different layout and state than expected and chkdsk will trash both RAID metadata and the data on the filesystem. There's a lot of misunderstanding about this point, and I wouldn't be surprised if someone comes in to tell me how wrong I am and how chkdsk corrupted their RAID array.
How is that NOT chkdsk corrupting your data?
 

UnbiosedOpinion

Junior Member
Sep 19, 2005
14
0
61
I think I can help with this red state/blue state controversy over where the data replication for ICH9R mirroring takes place.

This is quoted from the "Intel I/O Controller Hub 9 (ICH9) Family Datasheet" at
http://www.intel.com/Assets/PDF/datasheet/316972.pdf (6.6MB)

5.16.6.1 Intel® Matrix Storage Manager RAID Option ROM
The Intel Matrix Storage Manager RAID Option ROM is a standard PnP Option ROM that
is easily integrated into any System BIOS. When in place, it provides the following
three primary functions:
? Provides a text mode user interface that allows the user to manage the RAID
configuration on the system in a pre-operating system environment. Its feature set
is kept simple to keep size to a minimum, but allows the user to create & delete
RAID volumes and select recovery options when problems occur.
? Provides boot support when using a RAID volume as a boot disk. It does this by
providing Int13 services when a RAID volume needs to be accessed by DOS
applications (such as NTLDR) and by exporting the RAID volumes to the System
BIOS for selection in the boot order.

? At each boot up, provides the user with a status of the RAID volumes and the
option to enter the user interface by pressing CTRL-I.

So the second bullet would seem to handle the boot problem pointed out by Taltamir.
 

MerlinRML

Senior member
Sep 9, 2005
207
0
71
Originally posted by: UnbiosedOpinion
I think I can help with this red state/blue state controversy over where the data replication for ICH9R mirroring takes place.

This is quoted from the "Intel I/O Controller Hub 9 (ICH9) Family Datasheet" at
http://www.intel.com/Assets/PDF/datasheet/316972.pdf (6.6MB)

? Provides boot support when using a RAID volume as a boot disk. It does this by
providing Int13 services when a RAID volume needs to be accessed by DOS
applications (such as NTLDR) and by exporting the RAID volumes to the System
BIOS for selection in the boot order.[/i]

So the second bullet would seem to handle the boot problem pointed out by Taltamir.

Oh sure, bring in "proof". Where's the fun in that?!? :)

Taltamir is correct in that chkdsk is corrupting data. However, I was trying to point out that the reason it's misbehaving is due to a configuration problem and chkdsk is just doing what it is supposed to. It sure would be nice if chkdsk was smarter in those situations and know when it shouldn't run. Unfortunately, it falls on the user to know when not to run chkdsk.