Any chance of recovering this raid array?

Red Squirrel · Aug 18, 2011

We had a long power outage, the UPS shut down everything and all properly, but the outage was just so long that the drives all had a chance to completely cool down before they were powered up again. Upon powering up everything looked ok, and while the VMs were loading 2 of the 5 drives dropped out of my main mdadm array. So everything started going insane like spamming dmesg errors and stuff but I could still access some of the data on the array, but it was hit and miss. Maybe it was cached, I don't know. So I rebooted and hopped for the best. All the drives are up but the raid does not start. It starts with two drives missing. I can add them, and it actually says it's clean but not started. Ironicly, there is no start command. Normally an array just starts automaticly.

Is there any way to actually recover this or am I screwed? I do have backups but it will be such a royal pain having to sift through all that and do a restore, reset all the permissions so stuff works etc... just the thought of it is brutal.

Before I give up, just wondering if there's anything I can maybe try. Though even if I get it going I don't know how long it's really going to last.

Never buy Hitachi drives. They are the biggest pieces of crap ever. I ironicly have 5 incoming in the mail that I RMAed not too long ago. Now 2 more dropped at once. Pure crap.

Guess it's time to go shopping and order 5 new drives. Maybe WD blacks.

Red Squirrel · Aug 18, 2011

Playing around with adding/removing drives (through mdadm) I managed to get it started again. And it turns out there is a --run command to start it. Though its still odd that it wont start at system startup and the drives get pushed out. But now it's mounted.

mdadm's solidness never ceases to amaze me. My priority now is updating all my backups (thankfully the most important one was only 2 days old anyway) and then deciding on what new drives to order.

I'm also starting to wonder if it's the sata cards, power, or the bays causing the drives to fail, I've had so many drives fail in this server it's getting scary.

PreferLinux · Aug 19, 2011

I'd suspect dirty power output from the PSU could be the culprit for the drive failures.

Khyron320 · Aug 19, 2011

Or the fact that you bought 5 drive at the same time that then later in service began to fail in close proximity. Unless u have a different drive in the system with different usage. Its a sound theory.

Replace all of them.

PreferLinux · Aug 19, 2011

Good point.

Red Squirrel · Aug 19, 2011

I'm suspecting the backplane, cable, or controller as I moved one of the drives that dropped out to another slot and it's working fine now.

Ordered all new drives and 2 new backplanes to be on the safe side. I may also order a new controller but I'll start with that. Don't have time to start dealing with micro troubleshooting so I'm just going to do a big swoop and hope for the best.

I rebuilt the array but VMs wont load. Decided to try doing a FSCK but not much luck. Anyone know how I can get it to actually output what it's doing? It's been running for an hour and this is all I see:

Code:

[root@borg ~]# fsck -p -c -f -V /dev/md0 
fsck 1.41.3 (12-Oct-2008)
[/sbin/fsck.ext3 (1) -- /raid1] fsck.ext3 -p -c -f /dev/md0

I tried -v (small v) and was getting no output at all except for the version. According to top there's stuff going on, but no idea what.

Red Squirrel · Aug 19, 2011

I was not thinking. Does not make sense to do a bad block check on a raid array, I think it must be a bug where it just locks up or something. I removed -c and now I'm getting output. Tons of illegal block errors that are being fixed. Maybe this will fix the VMs, though I have a feeling I'll have to restore those from backup as they probably got corrupted beyond repair.

Gotta love Linux. Windows would probably just be BSODing over all this. 😛

Fallen Kell · Aug 24, 2011

Well, saw that you fixed it but thought I would post this anyway.

There is a phase, "baking your drive", which exists because that is what you had to do back in the day when a disk that was in a server which shutdown for a period of time would no longer spin-up when powered on. You physically took the hard drive(s), put them in an oven at about 250 degrees and baked them for a few minutes. You then took the drive out of the oven and quickly installed them in the computer again and powered it up.

MedicBob · Aug 27, 2011

Fallen Kell said:
Well, saw that you fixed it but thought I would post this anyway.

There is a phase, "baking your drive", which exists because that is what you had to do back in the day when a disk that was in a server which shutdown for a period of time would no longer spin-up when powered on. You physically took the hard drive(s), put them in an oven at about 250 degrees and baked them for a few minutes. You then took the drive out of the oven and quickly installed them in the computer again and powered it up.

I remeber hitting(tapping) a drive(s), almost always SCSI, as the server started. There were lots of prayers on some of those days.

Red Squirrel · Aug 27, 2011

After changing one backplane and removing the other (still waiting for another to come in as there's two) the issue is still present. I finished changing all hard drives yesterday so now I will see if maybe it was the drives. This is really starting to cost a lot. If it still does it, then it's probably the controller which means I'll need a new motherboard as the decent controllers are pci-e 8x or more. I only have two 1x slots in there, rest is PCI and there's a x16 that can't be used as it's strictly for a video card (it's shared with the built on card). So new motherboard also means new cpu (sockets change all the time) and new ram (this is using ddr while ddr3 is what's out there now) So yeah I'm looking about about 1k just so I can get a controller. So hopefully the issue goes away.

Since I changed the backplane the issue is slightly different now though. Instead of drives dropping out of the array, I get errors like this:

Code:

INFO: task pdflush:14763 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
pdflush       D ffff8801dac8fa80     0 14763      2
 ffff8800048c9da0 0000000000000046 ffff8800048c9d00 ffffffff8101686f
 ffffffff8162a500 ffffffff8162a500 ffff88020f12adc0 ffff8801ff4496e0
 ffff88020f12b108 000000020dc05dc6 ffff880028062570 ffff88020f12b108
Call Trace:
 [<ffffffff8101686f>] ? read_tsc+0xe/0x24
 [<ffffffff81033569>] ? __dequeue_entity+0x61/0x6a
 [<ffffffff8100e80e>] ? __switch_to+0x1b0/0x3e0
 [<ffffffff812c1f41>] __down_read+0xa3/0xbd
 [<ffffffff812c1250>] down_read+0x2a/0x2e
 [<ffffffff810c04ff>] sync_supers+0x4a/0xc4
 [<ffffffff810946e0>] wb_kupdate+0x35/0x119
 [<ffffffff81095163>] pdflush+0x16e/0x231
 [<ffffffff810946ab>] ? wb_kupdate+0x0/0x119
 [<ffffffff81094ff5>] ? pdflush+0x0/0x231
 [<ffffffff81094ff5>] ? pdflush+0x0/0x231
 [<ffffffff810534bf>] kthread+0x49/0x76
 [<ffffffff81011719>] child_rip+0xa/0x11
 [<ffffffff81010a37>] ? restore_args+0x0/0x30
 [<ffffffff81053476>] ? kthread+0x0/0x76
 [<ffffffff8101170f>] ? child_rip+0x0/0x11

Not always pdflush, it's random. Either way, this causes all the Linux VMs to crash. For some reason the Windows ones don't crash. Don't really understand why.

Any chance of recovering this raid array?

Red Squirrel

No Lifer

Red Squirrel

No Lifer

PreferLinux

Senior member

Khyron320

Senior member

PreferLinux

Senior member

Red Squirrel

No Lifer

Red Squirrel

No Lifer

Fallen Kell

Diamond Member

MedicBob

Diamond Member

Red Squirrel

No Lifer

TRENDING THREADS