Anyone deal with troubleshooting bad drives in a RAID set?

Discussion in 'Memory and Storage' started by sornywrx, Jan 4, 2013.

  1. sornywrx

    sornywrx Member

    Joined:
    Jun 16, 2010
    Messages:
    171
    Likes Received:
    0
    I have a friend that has a Dell Precision T7500. It has a PERC 6i RAID card with two Seagate 1.5TB SATA drives (plain cheap SATA, not SAS drives) in a RAID 1 config. He said his PC was running really slow and some programs would crash in Windows. After checking it out for a while I thought it felt like it was a failing HD as I've seen enough of them to just have a feeling by the way it acts. Checked the event log and saw where numerous programs were crashing and services were timing out and closing. When you used the PC it would actually boot up and run ok but for 30 seconds or so it'd run fine then explorer/whatever program would stop responding for a minute or so.

    Anyways I'm thinking bad HD and didn't even realize it was a RAID array at first. So I go into the PERC 6i's config and it shows SMART status as OK with no errors (both drives). So I kept working on the PC for another two hours trying to find something else wrong like bad memory or problems with Windows itself but it still all points to a bad HD. I've replaced a LOT of hard drives, even drives in a RAID 1 array but its usually cut and dry easy to diagnose and has SMART errors. I'm not familiar with the PERC 6i... can I disconnect the drives from the controller, hook them up to my PC one at time and run diags? So far every (free) diagnostic program I've run doesn't support RAID so it shows the logical drive not the individual ones.

    I considered taking drive 0 offline, or just unhooking it, and letting it boot up on the second drive and see if there's a difference. Should be pretty easy to tell as even the Windows 7 "Welcome" screen takes 2 minutes to load and this is on a dual Xeon with 12GB RAM. Of course any filesystem corruption problems would be on the mirrored drive but if it was "freezing" because it was hitting bad sectors on the drive 0 I would think this would fix that. And if I pull drive 0 offline on this PERC card, will hooking it back up be a problem? I haven't messed with PERC cards enough to remember, it's been a year or so since I've worked with one that had failing drives... or at least drives that had data that I would be worried about losing.
     
  2. Loading...

    Similar Threads - deal troubleshooting drives Forum Date
    XMP Not Working After Upgrading From 16GB to 32GB Memory and Storage Apr 6, 2017
    Is the Micron M500 a good deal? Memory and Storage May 4, 2016
    Whats the deal with Spectek? Memory and Storage Apr 19, 2016
    What's the real deal with DDR4 memory? Memory and Storage Apr 10, 2016
    Wow these are a great deal and work better Memory and Storage Aug 28, 2015

  3. iluvdeal

    iluvdeal Golden Member

    Joined:
    Nov 22, 1999
    Messages:
    1,975
    Likes Received:
    0
    I've done that before, unhook both drives from the RAID controller, connect them to onboard SATA, then run a bootable diagnostic CD from the drive manufacturer to check it for defects. You definitely can't go just by SMART status.
     
  4. sornywrx

    sornywrx Member

    Joined:
    Jun 16, 2010
    Messages:
    171
    Likes Received:
    0
    Thanks! I was thinking that's how it would work but I've worked with a couple of generic (like $20) cheapo Chinese RAID cards that had cryptic menus. I have accidentally erased a logical drive when just trying to add the old drive back or to add a new drive back breaking a mirror. I'm actually hoping having one bad drive is the problem because I don't want to start troubleshooting bad controllers or helping format/reinstall. Oh and yeah I don't trust SMART much at all. Was hoping it'd be flagged to confirm what I was thinking but oh well. I had about 5 min to boot a linux rescue CD I happened to have and open and closed a bunch of programs without any problems. In Windows it would run a minute or two at a time max without Explorer or anything else crashing, like it was access problems on the hard drive. Even when I booted a Windows 7 Recovery CD the section where it sits for 30 seconds or so after the part where you select the drive and click next to get to the main menu. And then when running chkdsk at the command prompt it paused for about 2 minutes after typing the chkdsk command and getting the first line of the program.
     
    #3 sornywrx, Jan 4, 2013
    Last edited: Jan 4, 2013
  5. mikeford

    mikeford Diamond Member

    Joined:
    Jan 27, 2001
    Messages:
    5,099
    Likes Received:
    0
    I've got post about 10 below this one, running I think its RAID 1, its a simple mirror of two 2TB Seagate drives, and a year ago I had one drive drop from the array, no error, just out of the array even though I could still read files on it. Seatools windows version could not see the serial number, and failed it on the long test, so I left it in the system until I had more time to open the case all the way up and remove it, and just added a third drive and assigned it to the array which I recall rebuilt automatically.

    I have no sata slots except for the motherboard which all run through the SB710 northbridge chip raid controller.

    New years day the system beeped and got flakier for a couple hours and half a dozen reboots until it wouldn't boot to windows at all, saw no boot manager.

    Maybe we both have bad raid controllers?
     
  6. rsutoratosu

    rsutoratosu Platinum Member

    Joined:
    Feb 18, 2011
    Messages:
    2,545
    Likes Received:
    1
    I would check dell's site, there were some earlier issue on perc5/6 that require manadatory firmware to be updated.

    If I were you, i would take both drives out, put in a fresh drive , install windows and see if it's still acting up.. worse case, rebuild windows and transfer the data back
     
  7. sornywrx

    sornywrx Member

    Joined:
    Jun 16, 2010
    Messages:
    171
    Likes Received:
    0
    The thought of a bad RAID controller crossed my mind when I saw this in the Event Log:

    The driver detected a controller error on \Device\Harddisk1\DR1

    It's listed at least 10-15x a day for the past week. Odd thing is that it changes from DR1 through DR4. Not sure about the DR part. I believe theres four partitions on the logical drive. But then I've seen controller errors in the Event Log that wasn't the controller but was just the hard drive so I didn't think too much about it.

    That's a good idea, I hadn't even thought of that. I could even be doing that while the drives are being tested. I was going to move them to a completely different PC to check them out so I wouldn't have to worry about the very low chance of a bad motherboard or memory (which I've tested) to affect the tests. I would think that Windows 7 setup would be doing the same thing since the Windows 7 Recovery program was giving me freezing for 30 seconds at a time or more like Windows was, so I might see a problem right off the bat before Windows even gets installed.

    IF I do find a bad HD should I put them back in the computer and then add a third as a hot spare, then remove the failed disk? Is this required to keep the mirror intact? I read some of the PERC 6i manual on Dell's site and they say this regarding replacing a failing hard drive:

    If the disk is part of a redundant virtual disk:


    1. Select the redundant virtual disk that includes the physical disk that is receiving SMART alerts and perform the Check Consistency task. See "Check Consistency" for more information.
    [​IMG] CAUTION: To avoid potential data loss, you should perform a check consistency before removing a physical disk that is receiving SMART alerts. The check consistency verifies that all data is accessible within the redundant virtual disk and uses the redundancy to repair any bad blocks that may be present. In some circumstances, failure to perform a check consistency can result in data loss. This may occur, for example, if the physical disk receiving SMART alerts has bad disk blocks and you do not perform a check consistency before removing the disk.
    1. Select the disk that is receiving SMART alerts and execute the Offline task.
    2. Manually remove the disk.
    3. Insert a new disk. Make sure that the new disk is the same size or larger as the disk you are replacing. (On some controllers, you may not be able to use the additional disk space if you insert a larger disk. See "Virtual Disk Considerations for PERC 5/E, PERC 5/i, PERC 6/E, and PERC 6/I Controllers" for more information.) After you complete this procedure, a rebuild is automatically initiated because the virtual disk is redundant.


    Sounds simple enough but I wouldn't have thought to run the consistency check beforehand so I'm glad it mentioned that (if it's as important as it says).
     
  8. kleinkinstein

    kleinkinstein Senior member

    Joined:
    Aug 16, 2012
    Messages:
    823
    Likes Received:
    0
    Keep ReclaiMe close by. Very handy in uncertain times like these.
     
  9. sornywrx

    sornywrx Member

    Joined:
    Jun 16, 2010
    Messages:
    171
    Likes Received:
    0
    I've never heard of this program and was just looking at their site, definitely something to have on hand, thank you!
     
  10. rsutoratosu

    rsutoratosu Platinum Member

    Joined:
    Feb 18, 2011
    Messages:
    2,545
    Likes Received:
    1
    Now, no guarantees, I have several dell server and perc raid. I was able to break mirror, and recreate mirrors from existing raid disk without the clear or initalized function so the data stays intact. Ill take a look when i get home on splitting the array and add it back without destroying data.
     
  11. corkyg

    corkyg Elite Member<br>Super Moderator <br>Peripherals
    Super Moderator

    Joined:
    Mar 4, 2000
    Messages:
    26,390
    Likes Received:
    63
    I have not had to trouble shoot a bad drive in a RAID 1 array, but I have had to replace them with bigger ones. First, I clone the entire array to an external. Then I pull the old drives and replace them with the larger ones, and proceed to build the RAID 1 array. Now, there is nothing in that array.

    Final step is to clone the external backup drive to the array, and that is it. The array is 100% but with more capacity.

    You can also just use one of the old array drives. Each of the drives in a RAID 1 array is a clone of the other, and either can be used to reconstitute a new array. They can also be used singly as lettered drives.
     
  12. imagoon

    imagoon Diamond Member

    Joined:
    Feb 19, 2003
    Messages:
    5,199
    Likes Received:
    0
    Since it is Dell... Install openmanage on the T7500 and look at the Perc logs. The Perc card will tell you if the drives are bad or failing. If you don't want to use openmanage, go to LSI's website and download the MSM (MegaRAID Storage Manager), Dell uses LSI and the LSI MSM will connect to and manage Perc cards. You can pull the controller logs from there and see if there is a problem as well and pull and rebuild the array hot if you want to. I also recommend updating the Perc firmware to the most current version and upgrading the PERC drivers to match.

    Rebuilding a PERC is dirt simple. Remove the crapped drive (you can do it hot if you want to) replace the drive (again hot if you want to) then in openmanage or MSM click "rebuild array." It will tell you that the bad disk is missing and you have the option to replace it with new disk. Also if it was configured with autorebuild, it will rebuild without asking you a question at all.
     
  13. sornywrx

    sornywrx Member

    Joined:
    Jun 16, 2010
    Messages:
    171
    Likes Received:
    0
    That's what I thought SHOULD happen but that's an example of when I would've been wrong with one of those cheapo RAID cards..


    I had thought about imaging the drive before I go any further because I would feel a lot better working on it if I know there's an image somewhere to go from. I thought I had used Macrium Reflect Free on a RAID set before but then I was thinking I read on their website that you can't make images of RAID sets. I didn't understand if this was a basic rule of RAID or if it was just that they just didn't support it.

    I have had a mirrored set that I didn't set offline or anything and just powered down and put the drives into another PC and they showed up as two separate, regular drives. But I also had another issue with a cheap controller where the drives had some weird signature that wouldn't show the drives as single drives on another card/computer.

    I went into an LSI program that was installed already, but I hadn't seen it before so I wasn't too familiar with it. I remember that it asked for a username and password and was logging in by IP address. But when I got in there and looked around, although it wasn't very thorough, I didn't really see many details on drive health except something on the drive's summary screen that maybe had "Predicted failures: 0" and another listing about device errors listing 0 on both drives.

    So if this is pretty accurate, combined with SMART showing zero problems, I may be troubleshooting in the wrong direction on this even though I was sure it was a HD problem of some sort.
     
  14. imagoon

    imagoon Diamond Member

    Joined:
    Feb 19, 2003
    Messages:
    5,199
    Likes Received:
    0
    You may want to install the full version. The one that came with the Dells was the stripped down release. However since it says predicted failures: 0, I suspect the controller thinks the array is fine. You can kick off a consistency / scrub on the array from the controller and it will tell you if it has issues. Disk issues would be logged in the text log with messages like "failed to read block ##########" "rebuilt block #########" etc.
     
  15. sornywrx

    sornywrx Member

    Joined:
    Jun 16, 2010
    Messages:
    171
    Likes Received:
    0
    I bet a consistency check on mirrored 1.5TB drives will takes quite a while (like overnight) so I should've done something like that last night when I was there and today when I stop by back it might've been done. Oh well, definitely something to try. He's chomping at the bit to get it fixed and get back to work but I told him RAID troubleshooting can be slow.

    Should I skip the individual disk tests through Seatools and just run the consistency check?
     
  16. imagoon

    imagoon Diamond Member

    Joined:
    Feb 19, 2003
    Messages:
    5,199
    Likes Received:
    0
    I personally would. The PERC check does the same check as the extended seatools check (large long read.) If you run it from MSM, it will do it in idle time while the machine is running so he can use it during the check.
     
  17. sornywrx

    sornywrx Member

    Joined:
    Jun 16, 2010
    Messages:
    171
    Likes Received:
    0
    Great thank you. I wasn't sure if the consistency check was good for checking for physical disk problems or just for verifying the data itself.
     
  18. imagoon

    imagoon Diamond Member

    Joined:
    Feb 19, 2003
    Messages:
    5,199
    Likes Received:
    0
    The RAID controller will read each sector on both disks and then compare the values. If the values are mismatched or one of the sectors is unreadable it reports an error. It can also cause the array to drop in to degraded status but that will normally be accompanied by a message about which disk failed and caused the degrade. I don't think it is quite as thorough as the SeaTools but it is a good start that can run while the machine is in use.
     
  19. sornywrx

    sornywrx Member

    Joined:
    Jun 16, 2010
    Messages:
    171
    Likes Received:
    0
    Great, I'm going to try this and see if it finds anything. If it runs for a day and doesn't find any issues I'm not sure where to go from there. Does the PERC card have any way to diagnose itself?
     
  20. imagoon

    imagoon Diamond Member

    Joined:
    Feb 19, 2003
    Messages:
    5,199
    Likes Received:
    0
    It does self diagnostics on every boot up. If it thought it was bad it would halt the machine with a message just after the PERC6i message in post. Also Dell does include the dell diagnostic tools. I think you hit F12 during boot on those and select "diagnostics." It can test the entire machine.
     
  21. rsutoratosu

    rsutoratosu Platinum Member

    Joined:
    Feb 18, 2011
    Messages:
    2,545
    Likes Received:
    1
    I would call dell and see what they say, still under warranty ? or check the dell community forums, might be a common issue
     
  22. imagoon

    imagoon Diamond Member

    Joined:
    Feb 19, 2003
    Messages:
    5,199
    Likes Received:
    0
    It honestly doesn't sound like a disk issue to me. If it was stuttering as bad as he mentioned, I would expect the PERC to screaming about disk issues.
     
  23. sornywrx

    sornywrx Member

    Joined:
    Jun 16, 2010
    Messages:
    171
    Likes Received:
    0
    I saw the F12 diagnostics and was going to use to check the hard drives but it just sees the logical drive and not each physical drive. I just let it run long enough to test memory and CPU.
     
  24. imagoon

    imagoon Diamond Member

    Joined:
    Feb 19, 2003
    Messages:
    5,199
    Likes Received:
    0
    Once the basic scan runs, the advanced should see each disk. Also scanning the logical volume will test the array as I believe from messing with the T7500s I had it puts the PERC in to diagnostic mode. IE the PERC most days will only really read one disk at a time in RAID 1 but it tends to stagger requests to both disks unless there is read error or a scrub is happening. The built in diags however will hit both disks. the advanced one after can hit each one individually. The MSM can also run a surface scan on a per disk basis.
     
  25. sornywrx

    sornywrx Member

    Joined:
    Jun 16, 2010
    Messages:
    171
    Likes Received:
    0
    That's what I'm worried about. At this point I am almost hoping it was a disk issue just because that's pretty cut and dry versus pinpointing motherboard or cable issues (that I know are a lot less likely). I was considering it was likely a Windows issue until I booted into the recovery console (from the CD) and it was fine until there was any kind of disk access. But I didn't have a ton of time to say for sure that there wasn't issues just that it seemed like the temporary freezing only happened while accessing the logical disk and when it got past that it wasn't pausing (maybe that's a better word than freezing).
     
  26. corkyg

    corkyg Elite Member<br>Super Moderator <br>Peripherals
    Super Moderator

    Joined:
    Mar 4, 2000
    Messages:
    26,390
    Likes Received:
    63
    Possibly true for software RAID, but Acronis TI handles my RAID perfectly any time I want it. The problem with software RAID is that cloning is usually done off line (command prompt reboot) and software RAID does not exist then. Hardware RAID - no problem!