Linux: Troubleshooting ext3 filesystem corruption?

manly

Lifer
Jan 25, 2000
12,607
3,397
136
Upon bootup of my SuSE Linux workstation, init reported that the root filesystem was corrupt and dropped me into single-user mode to repair the problem. I know my way around Linux systems, but suffice to say I'm not a professional sysadmin so I'm not too handy at hazard duty.

So I remounted / read-write and ran e2fsck on my ext3 root filesystem. Yes, it's safe to run e2fsck on ext3fs according to the man page. I ran e2fsck because the remount command suggested as much; other than that I wouldn't know how to "repair" the filesystem.

Anyhow, it ran through a series of passes, reporting "duplicate / bad blocks" (I really wish it differentiated between the two). It also changed the file type of a number of files (from type 2 to type 1, which I assume is from directory to regular file). The reason I think this is the case is because /root is now a regular file. :Q

To estimate the extent of the problem, there were probably on the order of 100 duplicate / bad blocks, and 20 file type errors, along with other assorted errors. I would characterize this as serious, but not fatal problems, at least judging by the fact I am now using the system instead of feverishly backing up. :eek: Unfortunately, I don't see any record of fsck in /var/log/messages. There are a bunch of files now sitting in /lost+found if I want to start digging to see what data was impacted.

Actually, I was dropped into single-user mode to repair / twice consecutively. Apparently, some of the fixes in the first fsck were no good, and likely reversed in the second fsck. If it happened (or happens) a third time, I would definitely have shut down the system and started a contingency plan. Furthermore, /home is a separate filesystem and losing / alone would not be catastrophic. I also have no reason to believe there is a physical disk failure that jeopardizes all filesystems, but device failure is always a possibility. I have not handled or moved the drive recently. Nor is there any history of mishandling or bad blocks.

So my questions are basically where do I go from here?
  • Are there any other filesystem repair utilities for ext3fs?
  • Besides the drive manufacturer's diagnostic utility, how can I investigate drive failure? S.M.A.R.T. is enabled, but appears fairly useless for desktop PCs (in theory, Linux should report any impending drive failures if support is activated).
  • How do you guys like to backup your workstation? I generally backup /home only once a year to CD-R. :eek: I've never been a fan of tape, but even DVD-R would not be a convenient general system backup media for a 80 GB drive.
  • I always shutdown Linux sanely, but the system does rarely lockup (I can't recall a reason why, but for example I recently was playing with Nvidia video drivers). Is the combination of rare system reboots (without clean unmounts) and the short fsck for ext3fs one cause for this problem unexpectedly occurring, and the volume of errors?
In due time, I'll probably reinstall the OS (with some Linux flavor, most likely SuSE) to clear out the cruft that's apparently built up. But not until after I've backed up data and verified the drive is reliable. It appears the OS was installed last June. For the record, the drive is a Western Digital 120 GB IDE drive connected to a Promise ATA 133 controller (performs noticeably better than the onboard ATA-66 controller). Other than that, the system is fairly generic. I'm more concerned with the Linux software side of things than asking hardware-related questions.

Any thoughts & analysis are appreciated.
 

Barnaby W. Füi

Elite Member
Aug 14, 2001
12,343
0
0
How do you guys like to backup your workstation? I generally backup /home only once a year to CD-R. :eek: I've never been a fan of tape, but even DVD-R would not be a convenient general system backup media for a 80 GB drive.
i back up /home, /usr/local, /etc, and /var, once every month or two. i back up my mp3's a bit less often. the problem i find with mp3's is that there is no way to intelligently do incremental backups. say i rename some mp3's, that in itself would warrant backup, because i don't want to restore to the old names. i wish there was some intelligent diff-like backup utility that could do diffs of file names, permissions, contents, etc, so you could only backup what has changed. i can backup everything onto one cd, EXCEPT my mp3's, which take lots of cd's (9GB). dvd-ram looks extremely appealing.


i can't really comment on much else, luckily i've never had a serious file system corruption problem, or hard drive failure (knock on wood).
 

Soulkeeper

Diamond Member
Nov 23, 2001
6,731
155
106
e2fsck -v -y /dev/whatever should do it if not then it's prob more than the filesystem that's messed up
I find the best way to back things up is from one hard drive to another hehe
cd's just take too long and cost money
 

manly

Lifer
Jan 25, 2000
12,607
3,397
136
BBWF,

Do you use tar to prep backups, or just mkisofs? Right now, my needs are simply so I just use mkisofs, a 10x CD-RW that's reusable, and a path-list file along with the graft-points option. I know I could use Xcdroast or whatever, but I'm getting pretty handy with mkisofs/cdrecord so why change now.

I'm sure there are dedicated backup programs out there that do incremental backups (Arkeia comes to mind), but they are generally commercial, proprietary products.

FWIW, I backed up personal data in my home directory, but didn't really do much else. My system has been running fine for the past couple days, notwithstanding some data loss as detailed in the original post.

Will other Linux experts please chime in?
 

Nothinman

Elite Member
Sep 14, 2001
30,672
0
0
So I remounted / read-write and ran e2fsck on my ext3 root filesystem

If you're going to run e2fsck don't mount the filesystem read-write, that's asking for more trouble. You should never fsck a rw mounted filesystem.

Are there any other filesystem repair utilities for ext3fs?

e2fsck is it, it was written by the guy who designed the filesystem so it's pretty comprehensive. There's debug2fs, but you need to know a good bit about the filesystem internals to do anything with it.

Besides the drive manufacturer's diagnostic utility, how can I investigate drive failure? S.M.A.R.T. is enabled, but appears fairly useless for desktop PCs (in theory, Linux should report any impending drive failures if support is activated).

Linux itself doesn't support S.M.A.R.T. you have to use a userland tool. In Debian there's smartmontools, which includes smartctl and smartd, smartctl shows info and can run SMART tests on drives, smartd runs as a daemon and will test drives periodically and report problems to syslog.

smartctl -a /dev/sda
smartctl version 5.1-4 Copyright (C) 2002 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: SEAGATE ST318451LW Version: 0003
Local Time is: Tue Feb 4 23:27:49 2003 EST
Device supports S.M.A.R.T. and is Enabled
Temperature Warning Disabled or Not Supported
S.M.A.R.T. Sense: Ok!
Current Drive Temperature: 37 C
Drive Trip Temperature: 65 C

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 0 - [- - -]
# 2 Background short Completed - 0 - [- - -]
# 3 Background short Completed - 0 - [- - -]
# 4 Background short Completed - 0 - [- - -]
# 5 Background short Completed - 0 - [- - -]
# 6 Background short Completed - 0 - [- - -]

Is the combination of rare system reboots (without clean unmounts) and the short fsck for ext3fs one cause for this problem unexpectedly occurring, and the volume of errors?

The only thing that might contribute is that the longer the box is up, the more things that are cached in memory so a crash can cause more to be lost. ext3 being journaled should save you from running fsck, although it can't save your data it can usually save the filesystem.
 

Barnaby W. Füi

Elite Member
Aug 14, 2001
12,343
0
0
Originally posted by: manly
BBWF,

Do you use tar to prep backups, or just mkisofs? Right now, my needs are simply so I just use mkisofs, a 10x CD-RW that's reusable, and a path-list file along with the graft-points option. I know I could use Xcdroast or whatever, but I'm getting pretty handy with mkisofs/cdrecord so why change now.
i just tar/gzip var, usr/local, etc and home in to var.tgz, usr.local.tgz, etc.tgz, and home.tgz respectively. sometimes i back up the individual directories in my homedir for example images.tgz, text.tgz, mbox.gz, dotfiles.tgz, etc.

as far as backing up music, i don't bother compressing it because it doesnt really have any effect. even with bzip2, when i tried it out, i only got at most 5% compression or so, and that's just not worth the hassle. the only problem i have with backing up to cd is long filenames. sometimes i rename my mp3's to shorter names (i.e. rename all mp3's that start with "really long freaking artist name" to "rlfan", and i just remember what it is and rename them back when i restore), but it seems that even if a name exceeds the iso standard, mkisofs still uses the long name, because i've had mkisofs tell me a name was too long, but when i mounted the iso, the filename was perfectly normal. using iso in general just kind of sucks. i'm actually contemplating just using something like ext2 when i burn data cds now. i have no need for using the cd's on a windows machine or mac (can osx read ext2?), so i figure "what the hell?" :) having to rename mp3's sucks when you have to deal with nearly 2000 of them. (thank god for the perl "rename" script :))

I'm sure there are dedicated backup programs out there that do incremental backups (Arkeia comes to mind), but they are generally commercial, proprietary products.

i dont really know, have you tried looking at freshmeat?
 

manly

Lifer
Jan 25, 2000
12,607
3,397
136
Originally posted by: Nothinman
So I remounted / read-write and ran e2fsck on my ext3 root filesystem

If you're going to run e2fsck don't mount the filesystem read-write, that's asking for more trouble. You should never fsck a rw mounted filesystem.
For whatever reason, when SuSE drops me into single-user mode, the message printed is to remount the root FS read-write. So I blindly followed minimal instructions... What's the usefulness of running fsck on a read-only mounted FS? Will it correct any errors just the same?

Besides the drive manufacturer's diagnostic utility, how can I investigate drive failure? S.M.A.R.T. is enabled, but appears fairly useless for desktop PCs (in theory, Linux should report any impending drive failures if support is activated).

Linux itself doesn't support S.M.A.R.T. you have to use a userland tool. In Debian there's smartmontools, which includes smartctl and smartd, smartctl shows info and can run SMART tests on drives, smartd runs as a daemon and will test drives periodically and report problems to syslog.
True, but this is a Linux OS system feature to me, albeit not a kernel feature as you point out. On SuSE, the available tool is ide-smart. Does smartd actively notify of any warnings or errors upon initial bootup? So I gave ide-smart a try:
p3-800:/media # ide-smart -d /dev/hde
Id= 1, Status=11 {PreFailure , OnLine }, Value=200, Threshold= 51, Passed
Id= 3, Status= 7 {PreFailure , OnLine }, Value= 97, Threshold= 21, Passed
Id= 4, Status=50 {Advisory , OnLine }, Value=100, Threshold= 40, Passed
Id= 5, Status=51 {PreFailure , OnLine }, Value=200, Threshold=140, Passed
Id= 7, Status=11 {PreFailure , OnLine }, Value=200, Threshold= 51, Passed
Id= 9, Status=50 {Advisory , OnLine }, Value= 96, Threshold= 0, Passed
Id= 10, Status=19 {PreFailure , OnLine }, Value=100, Threshold= 51, Passed
Id= 11, Status=19 {PreFailure , OnLine }, Value=100, Threshold= 51, Passed
Id= 12, Status=50 {Advisory , OnLine }, Value=100, Threshold= 0, Passed
Id=196, Status=50 {Advisory , OnLine }, Value=200, Threshold= 0, Passed
Id=197, Status=18 {Advisory , OnLine }, Value=200, Threshold= 0, Passed
Id=198, Status=18 {Advisory , OnLine }, Value=200, Threshold= 0, Passed
Id=199, Status=10 {Advisory , OnLine }, Value=200, Threshold= 0, Passed
Id=200, Status= 9 {PreFailure , OffLine}, Value=200, Threshold= 51, Passed
OffLineStatus=130 {Completed}, AutoOffLine=Yes, OffLineTimeout=78 minutes
OffLineCapability=59 {Immediate Auto SuspendOnCmd}
SmartRevision=16, CheckSum=249, SmartCapability=3 {SaveOnStandBy AutoSave}
According to the documentation, there are both online and offline tests. I only tried the online tests, and it took less than a second, and the results (everything passed) are perhaps meaningless. Personally, I've never heard of anyone say SMART warned them of an imminent drive failure preemptively.

Is the combination of rare system reboots (without clean unmounts) and the short fsck for ext3fs one cause for this problem unexpectedly occurring, and the volume of errors?

The only thing that might contribute is that the longer the box is up, the more things that are cached in memory so a crash can cause more to be lost. ext3 being journaled should save you from running fsck, although it can't save your data it can usually save the filesystem.
So the metadata is always sane, but is the fact that fsck doesn't do any significant work on each abrupt restart a reason for corruption to "build up" over time? I'm just trying to get a sense for the cause and effect, as this is the first time I've had a real filesystem corruption issue.
 

Nothinman

Elite Member
Sep 14, 2001
30,672
0
0
For whatever reason, when SuSE drops me into single-user mode, the message printed is to remount the root FS read-write. So I blindly followed minimal instructions...

That should be changed then, mounting it rw will just cause more problems unless fsck has been run already.

What's the usefulness of running fsck on a read-only mounted FS? Will it correct any errors just the same?

Yes, e2fsck works on the raw block device, not the mounted filesystem. The problem with running e2fsck on a rw mounted partition is that what e2fsck examines on disk might not be the same as what the OS intends to be there because of filesystem caching and it may fix things that aren't broken.

Does smartd actively notify of any warnings or errors upon initial bootup? So I gave ide-smart a try:

ide-smart is available too, but I just noticed that smartctl and smartd were packaged and figured I'd give them a try. It writes events to syslog, you need something monitoring the logs to see the warnings.

Personally, I've never heard of anyone say SMART warned them of an imminent drive failure preemptively.

I've seen it dozens of times where I work, S.M.A.R.T. has saved a ton of people's data where I work.

So the metadata is always sane, but is the fact that fsck doesn't do any significant work on each abrupt restart a reason for corruption to "build up" over time?

The metadata is always sane so there's nothing to build up. You can have ext3 do data journaling too, but all it does is slow things down dramatically. And recently there was a problem with the data journaling that caused corrupted filesystems, I'm not saying it's still broken but I'd be weary.

I'm just trying to get a sense for the cause and effect, as this is the first time I've had a real filesystem corruption issue.

I can't really say because I've had no real filesystem corruption myself. I had once or twice where a filesystem needed xfs_repair run or it wouldn't mount (I use XFS on my main box, ext3 on my UltraSparcs but they don't reboot much) but after the repair it was fine, minus a few files that were lost because they were in the filesystem cache when the system went down.