Troubleshooting RAID1/Server problems - Please Help

runboy

Member
Dec 6, 2000
96
0
0
I have a Colocated server literally sitting 5000 miles away that I am having
severe problems with. Sorry for the long story ;o) You can skip to the
bottum if you want the meat of the mission.

The server is a Dell Pe2650 with 5 drives. 2 configured as a mirror and 3 as
a Raid5. The mirror contains 3 partitions one of them being the OS (Windows
2000 Server)

Recently I updated my server with some new security fixes and after
rebooting the server it Bluescreened with Inaccessible Boot Device
0x88FCBAF0,0xC0000102,0x00000000,0x00000000
Afterwards it was able to reboot into windows. Some ACL on the filesystem
seemed to be broken though (The webserver no longer would serve ASP pages +
there were some other snacks. I decided to rebuild the ACL from scratch and
here I made a stupid mistake - I am having some problems with the RAC on the
server so I was logged in via Terminal services Desktop. First steep in
resetting ACL is to pretty much give permission to Administator Group and
System only and then build on top of that as you go. After doing this I
could no longer access the C drive via Terminal server.
I set up a trouble ticket at the Datacenter to get them to physically go to
the server and log in and reset ACL. Don't know why, but they decided to
reboot the server and ran into the following Bluescreen c000021a error with
status code 0xc0000022 (0xc0000000 0xc0000000)
http://support.microsoft.com/kb/137400 Problems after messing with ACL ;o)

The easiest solution at this point was to restore the C drive with
DriveSnapshot. The Datacenter charges $150/hr after the first 15 minutes so
I didn't want them to spend unneccessary time troubleshooting if a fix was
simple. I had a 1 month old copy of my C drive laying locally (5000 miles
from the server), but I suggested the datacenter set up an FTP account and
then I could upload the backup and they could restore it. Really simple
process. I have a BartPE CD that will boot the server right up with
DriveSnapshot ready to do its work. Unfortunately I hadn't read the fine
print when I placed my server at the Datacenter. Due to liability reasons
they couldn't restore the backup for me - After a lot of back and fourth
with higher management I finally (After 48 hours of trying) gave up and
shelled out the $1500 for a ticket to go fix my server. I planned for a
short trip - back next day - since I had to make it back for work.

I arrived at the Datacenter and within 1½ hours I had restored the backup
and my system was up and running again. Everything was good - I thought! I
was sure the security update had caused my original problem. Most of all
because I had found a post on the net with a user claiming the security
patch had caused his system to give the same Blue Screen.

MEAT:

Long story short the security update was not at fault. System crashes almost
every time I restart it and chkdsk will sometimes get it back up and other
times not. I rebuilt the backup several times troubleshooting this problem.
Pattern was the same:

1. First Restart after backup goes without trouble
2. After using the syste a little will get this entry in Event viewer -
System -The file system structure on the disk is corrupt and unusable.
Please run chkdsk utility on the volume C:
3. Server will seem to run ok. (Actually after last restart it is still going
strong after almost a week with heavy load)
4. Restarting server will cause Bluescreen. + chkdsk autotun + seems to find
same problem sectors every time.

I have tried defragging right after restoring and that didn't help. Neither
the RAC or the server are showing any indications of hardware problems.
Disks are showing as healthy.

Not having time to troubleshoot and thinking the problem was hardware
related (Controller or motherboard) I extended my returnflight one day and
went out and found another server that will take over the next time the
problem server crashes. I am now in the process of setting up that one
remotely. Most of the time on location was spent getting the server and
initial setup.

On my flight home a great deal of thought process was spent figuring out
what I could have done troubleshooting further.

1. I regreet not restoring MBR and filestructure via Drivesnapshot ( I don't
think it is a boot virus since the server is locked down pretty good and I have Anti Virus running on the computer,
but you never know)
2. If I had more time I would have troubleshooted the drives a little more.
Could a single drive in the mirror with bad sectors cause this problem or
would it show up as not healthy?

What would be the best option to use in trouble shooting the mirror. I am
thinking I would proceed like this:

1. Remove one drive from the Raid1 and reinserting a new one. let the Raid
rebuild and then restore backup and see if things worked. If not remove the
other drive and reinsert the other drive rebuild Raid and the restore and
see if it works.

Any other options?

Could it be bad RAM? I upgraded RAM and went from 2-4 Gigs 2 months ago.

Wouldn't I have more problems keeping the system running if this problem was
with the RAID controller or the motherboard?
 

mechBgon

Super Moderator<br>Elite Member
Oct 31, 1999
30,699
1
0
If you had enough time, you could jump into the SCSI controller's BIOS and have the controller do a media check on the drives to locate bad sectors. I believe the bad sectors get added to the Grown Defect List that the drive maintains, so it would avoid them in the future.

 

RebateMonger

Elite Member
Dec 24, 2005
11,586
0
0
I know of no magic in cases like this. You start with verifiying that the drive controller, the cables, the drives, the RAM, and the power are all good. Also, make sure that EVERYTHING is running with the latest BIOS/Firmware revision. I've seen cases where old firmware apparently caused problems with the latest software updates.

It SOUNDS like a problem with the drive system. I'd probably just replace the drive controller, the boot drives, and the boot drive cabling. That's what I did with my last new client who had apparent drive array problems. They had fired their previous IT person because they couldn't afford to be down anymore.

After testing the replaced items (at your convenience), you can always use them as spares if they verify as good.