Bad sectors: ZFS versus legacy RAID

CiPHER

Senior member
Mar 5, 2015
226
1
36
This is a reply to Cerb in another thread but i thought creating a separate topic might be better not to disturb the original thread. So here goes:

With copies, you may as well have RAID
If you set it on the entire filesystem, then yes. But ZFS can set these things on a per-filesystem basis. So you will assign copies=2 or even copies=3 for your important data like documents, and regular copies=1 for bulk data like downloads, videos, etc. Generally, the documents are only 0.1-2% of total data so it is 'cheap' to enable the feature.

And metadata gets copies=2 by default, while the crucial ZFS metadata which allows ZFS to detect it as a disk belonging to a ZFS pool, gets multiple copies (i think 16) spanned across the drive.

So unreadable sectors (URE) should never cause damage to the ZFS filesystem itself, only to specific files - even on a single disk configuration this is true. The chance that both metadata copies are affected by unreadable sectors is statistically very small. And when using a redundant configuration, you can multiply the protection as the copies also apply to the parity/mirror version as well.

and you are now using non-standard configs
What do you mean exactly?

Enabling ditto blocks - as the feature is officially called - can be achieved with a zfs set copies=2 pool/documents command. ZFSguru and probably the other ZFS platforms can do this easily via the web-interface.

and with just metadata, not enough of importance is protected. Redundancy is needed to protect against UREs, and ZFS does not change that. It enables better protection, but you have to go out of your way to make that happen. FI, saving off files to another local directory will safeguard your data just as well.
I don't really understand what you mean. If you copy the files to another directory; how do you know they weren't corrupt to begin with? How do you know the file on the destination is the same as the original after the copy? Do you check the checksum? And if you do, doesn't the read I/O get drawn from the RAM filecache instead of re-reading it from disk to verify?

You have to start with one thing: error detection.

Then you start to add layers of protection to correct damage, or reverse it like with zfs snapshots, another waaaay cool feature that ZFS offers that allows easy incremental backups of your data. This way, you protect against other dangers too like viruses that eat your data or an accident with the delete-button. These are dangers that normally only a backup would protect from, and not just RAID. This feature is not unique to ZFS as many other filesystems have it too. But the way it is implemented is really cool and very easy to use and understand for beginners.

A URE with plain RAID 5 and a URE with RAID-Z1 are functionally identical.
I would strongly object. ZFS is virtually immune to bad sectors when using redundant configurations. Let us compare the difference in reality:


  • A RAID engine of an inferior class will panic on any timeout - including a bad sector recovery - and throw the disk out the array. Some will even drop TLER-disks, because they drop disks with I/O errors too. Worse, all these retarded RAID-engines update the metadata of the other disks to reflect the new state. So reboot or power-cycle will not fix the issue. Have one bad sector on two disks and you ruin your RAID5 already. Even with all disks perfectly good, because one bad sector falls within the specced 10^-14 uBER rate of producing unreadable sectors. Maximum 1 per day with maximum duty cycle; usually once per half year on average when using many samples - because the variety per sample is huge.

  • A RAID engine of superior class will handle bad sectors gracefully; letting the disk timeout and cause service interruption, but not drop the disk from the array and simply return I/O error instead to the application. The application will most likely yield an error as well to the user. All is good. But hey, what if the data wasn't a read file, but rather crucial metadata? Oops! The legacy filesystem has no redundancy itself whatsoever. Now you're in trouble!

  • I have had multiple people on the Dutch forums who lost their array on md-raid. They switched to ZFS. I should note that they did not seek expert help long enough to exhaust all possible ways of recovery. That would have demanded he copies all disks to a backup source so he can retry anything he does and not work on the original copy instead. Simply put: i do not know the cause. The only persons with a failed ZFS pool i got, were one with failed redundancy (two disks dead in RAID-Z) and one who used a ZFS pool behind an Areca RAID controller with unsafe write-back. This causes write reordering across FLUSH-requests, which kills the integrity of ZFS. The only real way to kill ZFS aside from massive disk failure, is to put something in between the disks that changes the order of writes. Lost recent writes are not a problem at all, thanks to the transactional filesystem design.

  • I have not had any confirmation any RAID engine ever actually corrects bad sectors from redundancy. I have heard claims though, but nothing substantive like an official document or something of sorts. It is theoretically perfectly possible for RAID engines to do this. They can read data from the mirror/parity and write the calculated missing data to the disk that should have had that data but was unreadable. This way you overwrite the bad sector and fix the issue. It is similar to what ZFS does, but i have not had anyone convince me this actually exists in reality - aside from ZFS and its siblings.

  • When using a RAID5 solution, the change is beyond 99% that you will use a "legacy" 2nd generation filesystem like NTFS or Ext4 or XFS. A filesystem of this generation would blindly accept metadata - it would not authenticate the metadata like ZFS does. This causing all kinds of funny things to happen, like entirely lost directory trees, all weird names of files and directories and other unforeseen disaster. I have screenshots of someone who encountered this. The creator of Ext4 (theodore 'o or something) seems to agree, and thinks of his creation as a 'stopgap' until Btrfs - a 3rd generation filesystem - can take over. He also labeled ext4 as 'old technology'. I happen to agree with him. :cool: :awe:

  • When using a RAID5 solution, you have the filesystem and disk aggregation layer ('RAID') as separate entities, not being aware of each other's information. This means some features that require a unification or at least co-operation of the two, will not be possible. ZFS and also Btrfs can do this. One unique feature is the dynamic stripesize, which is simply not possible in legacy RAID. It causes RAID-Z to be more like RAID3 instead of RAID5, and have fixed the issue of the 'RAID5 write hole' and can do atomicity in one pass. Since all writes will align perfectly with the stripe boundary, no read-modify-write will ever occur with RAID-Z family - as is frequent with RAID5 writes, other than sequential contiguous I/O.

I would love to know which parts you agree with and which ones you don't.

And by the way, anything i said about ZFS being 'immune' to bad sectors or URE's also applies to other 3rd generation filesystems like Btrfs and ReFS. Simply put: in 2015, protection against bitrot is almost mandatory.

how much space does metadata take up v. data? Hardly any!
Some ZFS systems have 128GiB of RAM memory so that they can cache about 64-96GiB of metadata, which is about 5 - 50% of their pool's total metadata. And all metadata is compressed by ZFS. So it's not 'hardly any' - perhaps percentage wise it is for a large pool. You have a point of course when you assert that any URE would land much more frequently on data than on metadata, of course. But even this small chance is not low enough to call insignificant. 0,1% means one in a thousand chance. That is way to high. And i would assume even NTFS uses more metadata than 0,1% of stored data, depending on file size of course.

Caching metadata in RAM memory is very hot for some ZFS users - myself included! It makes file searching almost instant and directory browsing as well. It also causes less random seeks to the (5400rpm) disks so that they can go do more sequential I/O instead. This means the disks won't do 50MB/s because they are being hammered with random reads, but will do up to 175MB/s with the latest 6GB WD Green - this one by the way has 1.2TB platters and higher sequential speeds than the 2-5TB Greens.

What ZFS specifically does protect against, that is not protected against in most systems (BTRFS and ReFS being notable exceptions), and which appear to be ever more important as time goes on, are not UREs, but successful reads of bad data, which need OS-level or higher awareness to handle.
You're referring to the end-to-end data security feature of ZFS i presume. Yes it is nice, but for home users not all that hot. End-to-end security means the path of the storage chain is protected by checksum or other means of error detection method. So the disk gets checksummed from what the ZFS metadata says, if you have a good SAS controller it can add a layer of protection as well, ZFS internally keeps the data guaranteed from corruption (*) and finally it is delivered to the application only if the data is known to be good.

The 'application' may be your copy command or other backup/filesync program. So this prevents corruption from spreading to your backups and of course other data. For home users this is nice, but not extremely crucial. For companies, this is essential since the bank cannot sometimes make a mistake and transfer a trillion dollars due to the application making the decisions getting faulty data. They need End-to-End data security.

(*) I should note that ZFS' End-to-End data security feature stops working when RAM corruption is involved. ZFS can detect and sometimes correct RAM corruption on disk, however. But the application is vulnerable.

Some other RAID implementations may or may not stop cold with a failed stripe. Likewise, Linux' RAID handles non-TLER disks just fine, along with degraded arrays.
We agree here. I would add BSD GEOM software-raid, which is among the best and in some areas better than what Linux offers. But popular engines like Windows FakeRAID (Intel/AMD/nVidia/Silicon Image/ASMedia/Promise/JMicron/blabla) and quite a few hardware RAID cards, including the popular Areca series - are doomed. It might be possible they fixed the worst of issues with recent firmware i don't know about, i should add.

But this issue was long under the carpet. Stupid RAID engines kick the disk out the array and immediately write/update metadata to the disks to reflect the new (broken/degraded) array status. So a power-cycle or reboot will not fix it. Many home users have lost their data this way. Due to a stupid bad sector causing panic on a badly designed RAID engine. It makes me feel both sad and angry when considering that this affects so many people across the globe. Technology should be smart, sleek and sexy. ZFS is just so much better in so many ways. There will be better in the future, but right now ZFS is miles ahead of any of the legacy RAID and 2nd generation filesystem crap that cannot even detect corruption to begin with.
 

thecoolnessrune

Diamond Member
Jun 8, 2005
9,672
578
126
Another point to rebuild times is that traditional RAID only sees a RAID set. Rebuilding a RAID array requires that every stripe on the drive be written to the replacement drive, whether that stripe actually contains data or not. With newer generation filesystems like ZFS, ReFS, and BtrFS, rebuilds only take as long as however much data is on the drive. That means for arrays that are following best storage practices of at least 20% free space, rebuilds are noticeably faster.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
If you set it on the entire filesystem, then yes. But ZFS can set these things on a per-filesystem basis. So you will assign copies=2 or even copies=3 for your important data like documents, and regular copies=1 for bulk data like downloads, videos, etc. Generally, the documents are only 0.1-2% of total data so it is 'cheap' to enable the feature.
And now you have to know where those things are, that the FS can do that, how it can, how to set it, what to set it to...

Once you start doing things that are require specific knowledge, you need someone with habits that wouldn't necessarily fit a user of a prebuilt NAS box.

So unreadable sectors (URE) should never cause damage to the ZFS filesystem itself, only to specific files - even on a single disk configuration this is true. The chance that both metadata copies are affected by unreadable sectors is statistically very small. And when using a redundant configuration, you can multiply the protection as the copies also apply to the parity/mirror version as well.
Which, if the URE or otherwise bad data occurs in your excel file, instead, helps you all of...none. It's a hyped up feature that does basically nothing until combined with RAID for real-time use, and/or backups for point-in-time. Once you do that, it gives better protection for your data than prior systems, but then you also incur the need to be actively managing your data on the device--IE, you have to have a sysadmin hat you can put on (but, not a fedora...let that fad pass, and let us have some diversity amongst hats!). Multiple copies of the same metadata written helps in the case of power loss before a complete flush of the data. Past that, as a user, I really couldn't be made to care, unless my data gets the same treatment, which usually means RAID.

IE, if I have backups, and some metadata went bad, I can use backups to restore what that affected; if I have backups, and file data went bad, I can use backups to restore that file data. If I don't have backups (since, if I do, this won't really matter enough to bother doing anything about), randomly distributed errors are far more likely to occur within a file than within metadata, and the file is not at all protected.

You want RAID for such protection anyway. Copies is nice, but it's also cheap to add disks. ZFS can do it better, but then how much protection is that offering for low-value data to few users, over a simpler storage solution? And, while I'm at it, when is BTRFS going to get ubiquitously used for mirror-based RAIDs in commercial NAS products? :)

What do you mean exactly?
When you start doing things that require dropping to a shell, or knowing small details of the file system, you're no longer dealing with features that are accessible to users that may buy a prebuilt NAS, nor that should be accessible to them, if they are. Even when available in the GUI, what are they are actually setting? When tweaks start being required for some touted feature to be present, it's not realistically going to be present, unless the whole thing is being implemented by a seasoned or interested *n*x user, who's not going to buy a black box for his data in the first place.

I don't really understand what you mean. If you copy the files to another directory; how do you know they weren't corrupt to begin with? How do you know the file on the destination is the same as the original after the copy? Do you check the checksum? And if you do, doesn't the read I/O get drawn from the RAM filecache instead of re-reading it from disk to verify?
Using BTRFS, ReFS, or ZFS, you get verification of them at the block level. However, even so, you can't know they weren't corrupt to begin with. That was part of my point: you only know that corruption wasn't added since they went to be written to the computer housing the checksumming FS. If they got corrupted by a NIC, switch, or your client PC, or any other point, all bets are off. With that the case, CRCs done by backup/sync programs are nearly as effective, for data that gets updated often (if made on the client PC, before reaching the NAS, possibly more effective).

With multiple computers, each using such file systems (even if not the same ones), and using CRC-enabled software transport of the file data (normally CIFS), then you're in much better shape. Today, that can be done with Windows, even, though you have to have multiple HDDs, and use them accordingly. So, it's not all doom and gloom (I really hate that ReFS was not designed to be able to fully replace NTFS--IMO, it's as bad from MS as Intel's constant ECC/no-ECC market segmentation), but it is also not automatic.

I have not had any confirmation any RAID engine ever actually corrects bad sectors from redundancy. I have heard claims though, but nothing substantive like an official document or something of sorts.
Until very recently, it hasn't been a good thing to do. Let the drive remap it, or kick the disk, and degrade the array. If more sectors may go bad, or that sector was weak, or it's a driver or firmware bug causing it, it's not a good idea to do anything with it, except maybe to get the disk back to being responsive (not a problem with RAID drives $$$). Continuing to use the disk until things get rather bad is a behavior that is only fairly safe for your data due to the checksumming of the FS in combination with the FS' integrated RAID.

But this issue was long under the carpet. Stupid RAID engines kick the disk out the array and immediately write/update metadata to the disks to reflect the new (broken/degraded) array status. So a power-cycle or reboot will not fix it. Many home users have lost their data this way. Due to a stupid bad sector causing panic on a badly designed RAID engine. It makes me feel both sad and angry when considering that this affects so many people across the globe. Technology should be smart, sleek and sexy. ZFS is just so much better in so many ways. There will be better in the future, but right now ZFS is miles ahead of any of the legacy RAID and 2nd generation filesystem crap that cannot even detect corruption to begin with.
:thumbsup: I do not at all disagree with any of this. Just that much of the touted features are not as much as they are hyped, when in the context of many NASes, and that some of the features rely on a willingness to learn the system, not be afraid of it, to regularly check up on it, etc., and that failure to do so on the user's part leaves them no better off than with a Synology box, and sometimes worse. I also think it's kind of sad that companies like Buffalo, Drobo, Synology, etc., are not helping to mature BTRFS with part-time work from their own people, or just finding towards experts doing work on it, since their users could stand to benefit from it (I did read Netgear is using it, though).
 

Emulex

Diamond Member
Jan 28, 2001
9,759
1
71
Without end to end ECC - ZFS seems to be good, but it is not that hard to rock end to end ECC - from cpu to ram to pci bus to (nic/raid card) to sas storage devices. I think it would be wise to use a end-to-end ECC system regardless of filesystem since all that is great about ZFS won't help if a bit-flip occurs in the NIC (intel server nics do not all have ECC!) and spits out bit rot to your nfs/iscsi/smb client which defeats the purpose of ZFS!