Data archiving philosophy questions (LONG)

The111 · Jun 1, 2010

Looking for critique on my thought process here, and a few unanswered questions regarding data corruption.

As a photographer and videographer I generate a sizeable amount of data very quickly, data that I want to last a lifetime. For years I archived it on optical discs in triplicate (2 copies stored at home, 1 away from home), but this grows increasingly impractical for many reasons, and I also still fear data dropout from the optical medium over time (everyone has had burned discs go bad).

So, I've swapped to HDD storage. Right now my current "archive of everything" is about 650GB. I have 3x 1TB drives (WDC Black) that I use as mirrored backup disks (2 stored at home, 1 away from home). I don't like the idea of RAID because I don't want any sort of disk array that is reliant on any software or hardware external to the disk to function (mobo, drivers, etc). I refer to my manually mirrored triplicate disk concept as "ghetto RAID" since it accomplishes the same thing (and more, since the 3rd disc is off-site in case of fire, etc). The disks are only powered on when I am transferring data to or from. I store them in ESD cases and when I need to hook them up to PC I use an e-SATA dock. There is another 1TB drive in my desktop machine which contains the same archive for immediate access and also for daily updates (the backup discs get updated much less often).

I am mostly content with my methods, but one thing concerns me: data corruption. I don't know much about it, but I do know that data can go bad over time. More specifically, I'm worried about unintentionally replicating corrupt data over good data in my manual backup process.

Scenario 1 (acceptable corruption): Let's say that in my desktop I have my current archive defined as ABCDEF. And I also have this ABCDEF mirrored on my 3 external disks. Over the course of a few weeks, I update the archive in my machine such that ABCDEF becomes ABCDEF+G. Very easy to update all 3 external disks to ABCDEFG (simply add G to each disk). If, anywhere along the line, "C" has gone corrupt on any of the 4 disks, that corruption stays isolated to the one disk (since I have not over-written any of the C's). Somewhere in the future I become aware of the corruption, and it's no problem, I simply get a good "C" from one of the other disks.

Scenario 2 (unacceptable corruption): Let's call the archive A1B1C1D1E1F1. Over the course of a few weeks, the version in my machine which I update daily is transformed into A2B2C2D2E2F2+G. These 1->2 changes are minor. Maybe I've altered 4 photos out of a batch of 4,000 in each instance. Maybe I've also modified the directory tree structure in such a way that it's easiest to just completely rewrite all 3 of my backup disks with the brand new A2B2C2D2E2F2G set, and blow away A1B1C1D1E1F1 on all of these backups. Now, what if, unknown to me, a large part of C1 became corrupt on my internal drive. I changed C1 to C2 by modifying 4 out of 4,000 photos (another 2,000 have gone bad but I never discovered it). When I get rid of all my backup C1's and replace them with C2's... I am replicating this unknown corruption onto all my backups. VERY BAD. How do I avoid this?

I think the key here is that the corruption occurred "unknown to me." As I said, I don't know a lot about data corruption... I just know it happens. How can I check for it? Right now, if I copy a very large directory from one disk to another, when it is done copying, I check the "properties" in windows for each folder, and I compare the file count and byte count. If both numbers are identical, I have always assumed that the backup was successful and both folders on their respective disk are indeed identical. Is this a valid assumption? If it is, then I can always check for data corruption on my primary internal disk by comparing it to external disks just before making any internal modifications which might results in overwriting archive data on all the external disks.

Absent these manual "corruption checks," I am sure there exists all sorts of "backup software" which I have always ignored because I prefer the KISS principle. I am guessing these programs employ sync operations which would ostensibly find corruption as well... but I again I like to KISS. No fancy software where not necessary. Should I be re-thinking this?

One of the very few advantages that optical had over HDD was that I could not forcibly overwrite anything that was archived on a write-once medium. Though, I do not think this one advantage alone is enough to draw me back to optical archiving.

RebateMonger · Jun 1, 2010

I suggest that if it's an "archive", then you DON'T update it. Plan on removing one or two backup disks from your rotation once or twice a year and putting those archive disks someplace safe. Replace the old disk or disks with new (and probably larger) disks that you use in your current backup rotation. Rinse and repeat.

Doing it this way only costs one or two hundred dollars a year and minimizes the effect of corruption in your main data store. If you did this yearly, for instance, you'd have your current data store, your ongoing backups, plus your archive disks from, say, 2008, 2009, 2010, etc.

Additionally, for your ongoing backup sets, use software that doesn't overwrite old versions of files. Set your backup software so that it keeps all the original versions of the files and only makes additional backups of changed files.

jimhsu · Jun 1, 2010

I know at least SVN with fsfs offers a possible solution. With fsfs, each new version (commit) occupies a new file - existing repo data is never modified. If you then run a script weekly that generates md5s for your repo data and have it emailed it to you (I have such a script if you want an example), you can instantly monitor for corruption. And no, comparing file sizes is close to useless, since corruption flips a bit without inserting or deleting one.

Don't know how well it handles a 650GB repo, but I run a repo with several gigs of binary data with success. The other advantage of SVN in this usage scenario over things like git, bazaar, etc is efficient diffs of huge (100+ MB) files. I run diffs of a 0.5GB truecrypt file every day without any problems. Performance is slow though especially if modifications are extensive.

Also the online backup I use (Crashplan) supposedly does background checks of all backed up data and notifies corruption pre-emptively. Don't know how well it works though.

VirtualLarry · Jun 1, 2010

I've read things about ZFS using ECC to verify integrity of files.

I personally use PAR2 (Google it). You create PAR2 archive sets based on a set of files, and that data can be used to both verify it's integrity, as well as repair damage to it (depending on up to how many PAR2 blocks you generated initially).

When I back up my HDs using Ghost 2003 for DOS, I generate split image files of 690MB ea (good for burning onto CD or DVD), and then I generate PAR2 files, and I store them all on DVDs (making multiple copies). That way, even if a block or two of the DVD becomes unreadable, I can still make a (corrupt) ISO file, and due to the way PAR2 works, I can extract all of the files from the ISO and repair the files, assuming that the corruption doesn't exceed the capacity of the PAR2 files to repair.

Another way I check for corruption, is, every few months, I dig out my old backups, extract the files to a directory on a different drive, and then I do a recursive full binary compare between the files (using WinDiff.exe, incidentally). It takes a while, and you have to manually sort through all of the files that have changed, and decide whether or not those changes were intentional. Generally though, I know what I've changed since my last backup, and if I see any other files that have changed, I can revert them back to the version from my backup.

This of course only works if you maintain a progressive historical set of backups, which means not overwriting your backups with newer backups. My data set is still small enough to fit on about 7 DVDs, so it's not too hard to manage, but with HDs it would be similar. But you would have to save the HDs to archive them, and not update them. (Too bad HDs don't have a write-protect jumper. There would definitely be a market for them.)

mv2devnull · Jun 2, 2010

VirtualLarry said:
I've read things about ZFS using ECC to verify integrity of files.

And the same principle is used elsewhere too. AFAIK, zip-compressed packages do have a checksum, so one can check whether the package is intact. Many Linux distro CD/DVD images tend to have (md5 or sha) checksum computed. You download the image and the checksum (textfile) and recalculate the checksum. Mismatch indicates corruption.

And then intrusion detection. You compute checksums of your OS (Linux) files. If you do suspect a break-in, then you can check whether the intruder has modified binaries.

Version control (Subversion, Git, etc) is good, but it is not a backup. RAID is good, but it is not a backup. The former uses a database, and databases can get corrupted too. The latter automatically duplicates all corruption as well.

Compression is not nice, since where the original image might have had one corrupted pixel by one changed bit, a compressed version will have many.

Thus, calculating and storing checksums could be a simple approach.

The111 · Jun 2, 2010

Good info from all... thanks so much for all the feedback.

I think the one major step missing from my manual workflow process was error checking. I ended up getting a manual MD5 hashing tool:

http://corz.org/windows/software/checksum/

Seems pretty popular from the search engine ranking... it's also very powerful and yet very simple to use. Integrates into windows context menus. The cool thing is I can create one .hash file for an entire massive directory with tons of recursive subdirectories, with ALL of those contents included in the hash.

So what I can do now to eliminate my concerns is sort of two-fold:

1) avoid overwriting old remote archives with "the same" data from a primary "local archive"... in short only append future updates to old archives (and also maybe completely LOCK DOWN certain old archive disks from time to time, replacing them in the rotation as RebateMonger suggested)
2) if I need to modify old archive data for whatever reason, I should compare hash for all local and remote archives, BEFORE making the modifications in one of the disks, and then immediately distribute only those modifications to the other disks... basically keeping archive checksums on hand for comparison as mv2devnull suggested

Thanks again for all the info!

Data archiving philosophy questions (LONG)

The111

Member

RebateMonger

Elite Member

jimhsu

Senior member

VirtualLarry

No Lifer

mv2devnull

Golden Member

The111

Member

TRENDING THREADS