What happens when non-ECC RAM goes bad in a ZFS system?
So pretend that our same 8bit file is stored in RAM wrong.
Same file as above and same length: 00110011.
The file is loaded into RAM, and since its going to be stored on your ultra-safe zpool it goes to ZFS to be paritied and checksummed. So your RAIDZ2 zpool gets the file. But, here's the tricky part. The file is corrupted now thanks to your bad RAM location.
ZFS gets 01110011 and is told to safely store that in the pool. So it calculates parity data and checksums for the corrupted data. Then, this data is saved to the disk. Ok, not much worse than above since the file is trashed. But your parity and checksum isn't going to help you since those were made after the data was corrupted.
But, now things get messy. What happens when you read that data back? Let's pretend that the bad RAM location is moved relative to the file being loaded into RAM. Its off by 3 bits and move it to position 5. So now I read back the data:
Read from the disk: 01110011.
But what gets stored in RAM? 01111011.
Ok, no problem. ZFS will check against its parity. Oops, the parity failed since we have a new corrupted bit. Remember, the checksum data was calculated after the corruption from the first memory error occurred. So now the parity data is used to "repair" the bad data. So the data is "fixed" in RAM. Now it's supposed to be corrected to 01110011, right? But, since we have that bad 5th position, its still bad! Its corrected for potentially one clock cycle, but thanks to our bad RAM location its corrupted immediately again. So we really didn't "fix" anything. 01111011 is still in RAM. Now, since ZFS has detected corrupted data from a disk, its going to write the fix to the drive. Except its actually corrupting it even more because the repair didn't repair anything. So as you can see, things will only get worse as time goes on.
Now let's think about your backups.
If you use rsync, then rsync is going to backup the file in its corrupted form. But what if the file was correct and later corrupted? Well, thanks to rsync, you've actually synced the corrupted copy to your backups! Great job on those backups! Yes, that last sentence was sarcasm.
What about ZFS replication? Surely that's better, right? Well sort of. Thanks to those regular snapshots your server will happily replicate the corruption to your backup server. And lets not forget the added risk of corruption during replication because when the ZFS checksums are being calculated to be piped over SSH those might be corrupted too!
But we're really smart. We also do religious zpool scrubs. Well, guess what happens when you scrub the pool. As that stuck memory location is continually read and written to, zfs will attempt to "fix" corrupted data that it thinks is from your hard disk and write that data back. But instead it is actually reading good data from your drive, corrupting it in RAM, fixing it in RAM(which doesn't fix it as I've shown above), and then writing the "fixed" data to your disk. So you are literally trashing your entire pool while trying to do a scrub. So good job on being proactive with your zpool, you've trashed all of your data because you couldn't be bothered to buy appropriate server hardware. Yes, that last sentence was sarcasm again.
So in conclusion:
1. All that stuff about ZFS self-healing, that goes down the drain because you didn't use ECC RAM.
2. Your backups quite possibly will be trashed because of bad RAM. Based on forum users over the last 18 months, you've got almost no chance your backups will be safe by the time you realize your RAM is bad.
3. Scrubs are the best thing you can do for ZFS, but they can also be your worst enemy if you use bad RAM.
4. The parity data, checksums, and actual data need to all match. If not, then repairs start taking place. And what are you to do when you need to replace a disk and your parity data and actual data don't match because of corruption? You lose your data. So was ZFS really that much more reliable than a hardware RAID because you wanted to go with non-ECC RAM?
To protect your data from loss with ZFS, here's what you need to know:
1. Use ECC RAM. Period. If you don't like that answer, sorry. It's just a fundamental truth.
2. ZFS uses parity, checksums, mirrors, and the copies parameter to protect your data in various ways. Checksums prove that the data on the disk isn't corrupted, parity/mirrors/copies corrects those errors. As long as you have enough parity/mirrors/copies to fix any error that ever occurs, your data is 100% safe(again, assuming you are using ECC RAM). So running a RAIDZ1 is very dangerous because when one disk fails you have no more protection. During the long(and strenuous) task of resilvering your pool you run a very high risk of encountering errors on the remaining disks. So any error is detected, but not corrected. Let's hope that the error isn't in your file system where corruption could be fatal for your entire pool. In fact, about 90% of users that lose their data had a RAIDZ1 and suffered 2 disk failures.
3. If you run out of parity/mirrors and your pool is unmountable, you are in deep trouble. There are no recovery tools for ZFS, and quotes from data recovery specialists start in the 5 digit range. All those tools you've used in the past for recovery of desktop file systems, they don't work with ZFS. ZFS is nothing like any of those file systems, and recovery tools typically only find billions of 4k blocks of data that looks like fragments to a file. Good luck trying to reconstruct your files. Clearly it would be cheaper(and more reliable) to just make a backup, even if you have to build a second FreeNAS server. Let's not forget that if ZFS is corrupted just badly enough to be unmountable because of bad RAM, even if your files are mostly safe, you'll have to consider that 5 digit price tag too.
4. Usually, when RAM goes bad you will normally lose more than a single memory location. The corruption is usually a breakdown of the insulation between locations, so adjacent locations start getting trashed too. This only creates more multi-bit errors.
5. ZFS is designed to repair corruption and isn't designed to handle corruption that you can't correct. That's why there's no fsck/chkdsk for ZFS. So once you're at the point that ZFS' file structure is corrupted and you can't repair it because you have no redundancy, you are probably going to lose the pool(and the system will probably kernel panic).
So now that you're convinced that ECC really is that important, you can build a system with ECC for relatively cheap...