Is md raid check necessary?

Red Squirrel

No Lifer
May 24, 2003
67,380
12,129
126
www.anyf.ca
I think I just figured out what randomly crashes my file server. I've learned that any situation in Linux where the disk IO is bogged down too much, it causes a chain reaction and stuff starts to crash, and you get errors like "task delayed for 120 secs" and it will start to kill off stuff.

I think I found the source of what occasionally bogs down my entire file server, which in the end will bog down everything else such as my VMs. It's the stupid raid-check cron job! I don't know who's bright idea that was, but it runs once a month and it checks ALL arrays at the same time! The load average right now on my server is over 10! It's barely responsive.

Is this raid check really necessary? Is there a way I can split it up so it only does one array at a time at least? The cron job is just this:

Code:
0 1 * * Sun root /usr/sbin/raid-check

Is there a way to make it only check one array at once, and perhaps do it at a lower priority?
 

manly

Lifer
Jan 25, 2000
11,019
2,135
126
I am not familiar with that cron job.
However, check your mdadm.conf. I'm fairly certain a monthly consistency check is normally enabled, so I don't see why it's a script called by a cron job?

FWIW a load of over 10 (assuming it's under 20) isn't really that extreme on modern Xeon procs. You're usually looking at anywhere from 4 cores (single socket) to 12 cores (i.e. dual Ivy Bridge-EP). Unless you're on an older dual-core proc, the server should be quite responsive still.
 

Red Squirrel

No Lifer
May 24, 2003
67,380
12,129
126
www.anyf.ca
Yeah I found it in /etc/cron.d. I got an alert on my phone that my server load average was 10+ so I started to investigate and found that job. Everything was brutal slow even SSH. I disabled it for now, but I imagine I still want to run that once in a while but I'd want to run it in a staggered fasion like md0 on the 1st of every month, md1 on the 3rd and so on. My mdadm.conf does not have anything related to check, it just looks like this:

Code:
ARRAY /dev/md1 metadata=0.90 UUID=11f961e7:0e37ba39:2c8a1552:76dd72ee
ARRAY /dev/md0 metadata=1.2 name=isengard.loc:0 UUID=2e257e19:33dab86c:2e112e06:b386598e
ARRAY /dev/md3 metadata=1.2 name=isengard.loc:3 UUID=99f0389f:dbf75cb3:c841340e:33f62841

On similar note I have a dead md2 array. It refuses to go offline even though there are no drives in it. Any way to kill that? Every now and then I get io messages in dmesg related to that array, always scares the crap out of me when I check it and see all the drives failed then realize it's that dead array.
 

Red Squirrel

No Lifer
May 24, 2003
67,380
12,129
126
www.anyf.ca
Ended up just nuking this, wont bother with it. Is there any repercussions?

It totally kills my server when it runs. Everything times out, VMs crash, etc. Just can't have that happen.
 
Feb 25, 2011
16,789
1,469
126
That kind of data integrity check stuff is usually important. Sorta.

Could you just use "nice" in the cron entry to make it less all-consuming? Or tinker with /etc/sysconfig/raid-check so it only does certain arrays at certain times, etc.

I'd just worry you're cutting off your nose to spite your face instead on tweaking/configuring to suit your needs.
 

Red Squirrel

No Lifer
May 24, 2003
67,380
12,129
126
www.anyf.ca
Hmm did not know about /etc/sysconfig/raid-check and google was not returning anything useful when I was trying to figure out if there's a way to configure it. This is what it looks like now:

Code:
ENABLED=yes
CHECK=check
NICE=low
# To check devs /dev/md0 and /dev/md3, use "md0 md3"
CHECK_DEVS=""
REPAIR_DEVS=""
SKIP_DEVS=""
MAXCONCURRENT=

I'll try this and see how it goes:

Code:
ENABLED=yes
CHECK=check
NICE=idle
# To check devs /dev/md0 and /dev/md3, use "md0 md3"
CHECK_DEVS=""
REPAIR_DEVS=""
SKIP_DEVS=""
MAXCONCURRENT=1