ZFS: Stripe & Parity, or Stripe & Mirror? Also... SLOG & L2ARC?

destrekor · Feb 13, 2016

I'm planning to build a NAS for a home server that has a primary focus on media storage for streaming playback, but also as a self-contained cloud storage/sync for access elsewhere. Some local file saves and scheduled backups from other systems will also find a home on the server.

With that in mind, what is the best balance of available storage and performance with data integrity/safety in mind? I hear there are a lot of fans of RAID10, or the equivalent in ZFS land.

I'm thinking of starting with 6 disks, and either the next two I buy are spares on hand or hot-spares, or also add additional drives to the pool (and also spares).

So with 6, RAIDZ2 would net 4 disks of space, and Stripe+Mirror would net 3 disks of space, striped across 3 mirrored vdevs. It sounds like the RAID10 approach offers greater IOPS, but I'd have to fear a disk failing because if the mirrored disk in the mirrored pair encounters an error while the data is being copied to the replacement, the whole thing is toast.

The idea is you can suffer more drives lost with that approach, so long as any failed disks weren't in the same pair.

But with any parity-based stripe, like in RAIDZ2, the rebuild/resilvering time will be greatly longer than in a RAID10-type array.

I'm not necessarily against RAID10 or similar because it takes away from useable space... On the contrary, I'll accept that as the price of good data security. I just want to ensure what method I choose is a good match for my home use purposes.

Also, from what I gather, a SLOG(ZIL) & L2ARC are likely to be a waste of money for this type of use case, but if anyone has any opinions to the contrary, please do tell!

dave_the_nerd · Feb 13, 2016

destrekor said:
I'm planning to build a NAS for a home server that has a primary focus on media storage for streaming playback, but also as a self-contained cloud storage/sync for access elsewhere. Some local file saves and scheduled backups from other systems will also find a home on the server.

With that in mind, what is the best balance of available storage and performance with data integrity/safety in mind? I hear there are a lot of fans of RAID10, or the equivalent in ZFS land.

RAID10 is usually the best performance option, particularly for random IO. For sequential reads/writes, the margin is narrower.

The typical method in ZFS is to create a bunch of two-drive mirrors, and then lay a stripe across those mirrors.

http://www.zfsbuild.com/2010/06/03/howto-create-striped-mirror-vdev-pool/

The inverse (creating a mirror of two larger stripes) is not commonly done, but I've not read anything specific as to why. (I think it just is more fragile in terms of device failures.)

I'm thinking of starting with 6 disks, and either the next two I buy are spares on hand or hot-spares, or also add additional drives to the pool (and also spares).

ZFS doesn't really let you do that. Once you create a pool with X number of devices, you can only swap/replace those devices, not increase the number of devices in the pool.

So my 5-drive RAIDZ1 will always be a 5-drive RAIDZ1, although I can replace all the drives with bigger disks if I want.

You could, I suppose create a mirror of two stripes, and when you wanted to add more drives, take out a stripe, nuke it, add a disk, create a new stripe, mirror it back, and then repeat the process for the other half, but holy crap that would be a lot of work and leave you with no safety net for way too long.

There are other software RAID types that do enable you to expand RAID volumes - notably Linux madm software RAID. (Which is the RAID implementation used by a lot of COTS NAS appliances - or some derivative of it anyway.)

http://zackreed.me/articles/48-adding-an-extra-disk-to-an-mdadm-array

Not as awesome as ZFS (although you can get the same features by selecting the right file system to put on top of your softRAID device. But then you're one of those file system nerds, and not one of the ZFS-using cool kids).

So with 6, RAIDZ2 would net 4 disks of space, and Stripe+Mirror would net 3 disks of space, striped across 3 mirrored vdevs. It sounds like the RAID10 approach offers greater IOPS, but I'd have to fear a disk failing because if the mirrored disk in the mirrored pair encounters an error while the data is being copied to the replacement, the whole thing is toast.

That particular bogeyman is more a hardware-parity-RAID issue, and ZFS avoids that anyway - if you have corrupt data, sure, that data's gone, but the pool logs the fact and continues repairing itself.

Other softRAID implementations are also pretty tolerant of that. Although the real way to avoid a problem is to have multiple parity drives (RAIDZ2, RAID6, etc.)

The idea is you can suffer more drives lost with that approach, so long as any failed disks weren't in the same pair.

Gambling. RAID10 is a performance solution, it's only "tolerant" of one disk failure and anybody who risks more than that is asking for it.

But with any parity-based stripe, like in RAIDZ2, the rebuild/resilvering time will be greatly longer than in a RAID10-type array.

Yeah, but it happens in the background anyway.

I'm not necessarily against RAID10 or similar because it takes away from useable space... On the contrary, I'll accept that as the price of good data security. I just want to ensure what method I choose is a good match for my home use purposes.

Also, from what I gather, a SLOG(ZIL) & L2ARC are likely to be a waste of money for this type of use case, but if anyone has any opinions to the contrary, please do tell!

My bet is you'll be network bottlenecked 99% of the time anyways. RAID10 and a ZIL or L2ARC are (VERY) useful in the right circumstances, but IMO it's added expense you don't need for your use case.

Also, it's complexity. If you just feed your NAS OS a crapton of RAM, any system (Windows, FreeNAS, whatever) is smart enough to use as much of that as possible for as a read/write cache for storage requests. Most of the time, that's adequate.

But if you do want to use ZFS, it's probably a priority to get all 8-10 HDDs up front and configure your box the way you're going to want it to be. (Even if that means buying a little less RAM* now, or getting a single NIC and adding the second as budget permits, or whatever.)

* The primary consumer of ZFS RAM is dedup tables - the rule of thumb there is 1GB of RAM per TB of dedupe'd data, above and beyond what you should already have. The caching and other stuff I mentioned earlier means that a "good" NAS should have at least 8 and prefererably 16GB of RAM (or more if you have a lot of users) anyway. A 70TB ZFS NAS with dedup enabled and a dozens of clients should have, then, at least 86GB of RAM. (96GB or 128GB is a nice round number.) But if you don't enable dedup, you could "get away" with having 4GB of RAM and an sluggish NAS, at least long enough for a couple paychecks to accumulate.)

destrekor · Feb 13, 2016

Lot's of good information, thanks! I'll point out a comments that raised further questions:

dave_the_nerd said:
RAID10 is usually the best performance option, particularly for random IO. For sequential reads/writes, the margin is narrower.

The typical method in ZFS is to create a bunch of two-drive mirrors, and then lay a stripe across those mirrors.

http://www.zfsbuild.com/2010/06/03/howto-create-striped-mirror-vdev-pool/

The inverse (creating a mirror of two larger stripes) is not commonly done, but I've not read anything specific as to why. (I think it just is more fragile in terms of device failures.)

Yeah, that's what I was getting at, pairs of mirrored drives which are then striped across.

dave_the_nerd said:
ZFS doesn't really let you do that. Once you create a pool with X number of devices, you can only swap/replace those devices, not increase the number of devices in the pool.

So my 5-drive RAIDZ1 will always be a 5-drive RAIDZ1, although I can replace all the drives with bigger disks if I want.

You could, I suppose create a mirror of two stripes, and when you wanted to add more drives, take out a stripe, nuke it, add a disk, create a new stripe, mirror it back, and then repeat the process for the other half, but holy crap that would be a lot of work and leave you with no safety net for way too long.

That's what I thought I remember reading, that you couldn't truly add disks. But, what about with the RAID10 approach? Isn't the whole thing setup as multiple vdevs which you put into a striped pool? I hear you can add storage to a pool but not to a vdev. So if you add two drives, put them into a mirrored vdev, could you then add it to the collective striped array?

And that's also why I was considering buying maybe 3TB drives now instead of 4 or 6TB, filling up the bays in the configuration I want, and then replacing those with larger disks down the road.

dave_the_nerd said:
That particular bogeyman is more a hardware-parity-RAID issue, and ZFS avoids that anyway - if you have corrupt data, sure, that data's gone, but the pool logs the fact and continues repairing itself.

Other softRAID implementations are also pretty tolerant of that. Although the real way to avoid a problem is to have multiple parity drives (RAIDZ2, RAID6, etc.)

Well for data integrity, ZFS does well. But my worry was more about if the mirrored disk also outright fails. Also, are Unrecoverable Read Errors (UREs) "recoverable" with ZFS? I thought the worry about UREs still held true in ZFS, so if you got one on the drive during resilvering, that drive's data was ignored and if there was no other parity than the array was toast. Or can UREs just result in lost data but not the whole array?

dave_the_nerd said:
Gambling. RAID10 is a performance solution, it's only "tolerant" of one disk failure and anybody who risks more than that is asking for it.

Sorry, didn't mean to imply that. Obviously you try to replace the first failed drive before another one pops up, but RAIDZ2 can tolerate two failures, and RAID10 can tolerate up to the total number of mirrored pairs, so 4 disks if you have 4 pairs... but of course that's a gamble and you don't wait, you don't wait for two failures in RAIDZ2 either. That's more if multiple failures occur at once or your missed the alarms for whatever reason. But in RAID10, if the second failure is in the same pair as the first failure, the array's toast.

dave_the_nerd said:
My bet is you'll be network bottlenecked 99% of the time anyways. RAID10 and a ZIL or L2ARC are (VERY) useful in the right circumstances, but IMO it's added expense you don't need for your use case.

Also, it's complexity. If you just feed your NAS OS a crapton of RAM, any system (Windows, FreeNAS, whatever) is smart enough to use as much of that as possible for as a read/write cache for storage requests. Most of the time, that's adequate.

But if you do want to use ZFS, it's probably a priority to get all 8-10 HDDs up front and configure your box the way you're going to want it to be. (Even if that means buying a little less RAM* now, or getting a single NIC and adding the second as budget permits, or whatever.)

* The primary consumer of ZFS RAM is dedup tables - the rule of thumb there is 1GB of RAM per TB of dedupe'd data, above and beyond what you should already have. The caching and other stuff I mentioned earlier means that a "good" NAS should have at least 8 and prefererably 16GB of RAM (or more if you have a lot of users) anyway. A 70TB ZFS NAS with dedup enabled and a dozens of clients should have, then, at least 86GB of RAM. (96GB or 128GB is a nice round number.) But if you don't enable dedup, you could "get away" with having 4GB of RAM and an sluggish NAS, at least long enough for a couple paychecks to accumulate.)

So with all that said, are you able to provide a recommendation? It sort of sounds like you might be leaning towards RAIDZ2 for my use case, or have I read that wrong?

I might end up with a 12GB useable volume (6x3TB in RAIDZ2), or perhaps 16GB (6x4TB). I'd probably spring for two more to have as spares on hand, so I'm not sure if I'd be able to buy, in total, 10 disks of any size to make an 8-disk array, and not sure I'd be comfortable riding it out with 8 right away without any spares, but... perhaps? Worst case if there is a failure I can take the whole thing offline until I can get a warranty replacement. It's not mission critical so I could take the whole server offline whenever necessary, so perhaps I shouldn't need to worry about that while the disks are under warranty. Might spare me a little time to put together the extra money to later order two disks to have on hand. But still, even 8x3TB = 18TB useable in RAIDZ2 - 12TB in RAID10. I could probably make that happen, and likely could ride that data for awhile before I need to grow it, and by then I could possibly even be able to afford a whole set of 6 or 8TB disks to replace the original ones. But then the density at those sizes is also a little scary when it comes to storage arrays and UREs.

And how's the whole compression thing? I didn't realize that was a common thing in ZFS, and perhaps it is not, I just haven't been paying attention. Is it normal to utilize -- what I've seen called -- transparent compression?

edit:

And for what is primarily a home media server, is there any benefit to using a disk like the WD Red Pro or WD RE? Enough increase in speed and latency to justify the added expense at this level of usage? Or is it better to stick to the community-favorite NAS disk, the WD Red series? A little under $900 gets 8x 3TB WD Reds, whereas that would only get me 6x2TB WD Red Pros, and obviously fewer drives for the higher capacities.

dave_the_nerd · Feb 13, 2016

destrekor said:
That's what I thought I remember reading, that you couldn't truly add disks. But, what about with the RAID10 approach? Isn't the whole thing setup as multiple vdevs which you put into a striped pool? I hear you can add storage to a pool but not to a vdev. So if you add two drives, put them into a mirrored vdev, could you then add it to the collective striped array?

You can add storage to a pool, yes - but that's not the same thing.

http://alp-notes.blogspot.com/2011/09/adding-vdev-to-raidz-pool.html

Code:

freebsd# zpool status testpool
  pool: testpool
 state: ONLINE
 scrub: none requested
config:

        NAME            STATE     READ WRITE CKSUM
        testpool        ONLINE       0     0     0
          raidz1        ONLINE       0     0     0
            /tmp/disk1  ONLINE       0     0     0
            /tmp/disk2  ONLINE       0     0     0
            /tmp/disk3  ONLINE       0     0     0
          raidz1        ONLINE       0     0     0
            /tmp/disk4  ONLINE       0     0     0
            /tmp/disk5  ONLINE       0     0     0
            /tmp/disk6  ONLINE       0     0     0

errors: No known data errors
freebsd# zpool list testpool
NAME       SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
testpool   572M   210K   572M     0%  ONLINE  -
freebsd# zfs list testpool
NAME       USED  AVAIL  REFER  MOUNTPOINT
testpool   114K   349M  28.0K  /testpool

This part shows how it works - you have two 3-disk RAIDZ's, in the same pool. From a file system standpoint, it looks like one big drive, but any given hunk of data lives on either one RAID array or the other. (Which puts an absolute cap on performance, and also means that even though he's got two drives used for parity - like RAIDZ2 - he's only single-drive-failure-tolerant - like RAIDZ1.)

And that's also why I was considering buying maybe 3TB drives now instead of 4 or 6TB, filling up the bays in the configuration I want, and then replacing those with larger disks down the road.

That'd work.

Well for data integrity, ZFS does well. But my worry was more about if the mirrored disk also outright fails.

Well yeah, if both drives in a mirror fail, you're jolly well rogered.

Also, are Unrecoverable Read Errors (UREs) "recoverable" with ZFS? I thought the worry about UREs still held true in ZFS, so if you got one on the drive during resilvering, that drive's data was ignored and if there was no other parity than the array was toast. Or can UREs just result in lost data but not the whole array?

Correct. Traditional RAID controllers mark the entire drive drive dead if it has a URE, cancelling the rebuild and ruining your weekend. But as anybody who's cloned a failing HDD with bad sectors knows, a URE really only applies to a particular sector of the drive - most of the data is still good.

ZFS will keep trying to read the rest of the disk too. It can't reconstitute the completely lost data (and yes, that data may be important) but it will recover what it can. (And honestly, a one-sector glitch in an audio or video file usually translates* to a blip$# that is barely notica*ble ini the gra4nd scheme of things.

So with all that said, are you able to provide a recommendation? It sort of sounds like you might be leaning towards RAIDZ2 for my use case, or have I read that wrong?

That's where I'm leaning, yeah.

I might end up with a 12GB useable volume (6x3TB in RAIDZ2), or perhaps 16GB (6x4TB). I'd probably spring for two more to have as spares on hand, so I'm not sure if I'd be able to buy, in total, 10 disks of any size to make an 8-disk array, and not sure I'd be comfortable riding it out with 8 right away without any spares, but... perhaps? Worst case if there is a failure I can take the whole thing offline until I can get a warranty replacement. It's not mission critical so I could take the whole server offline whenever necessary, so perhaps I shouldn't need to worry about that while the disks are under warranty. Might spare me a little time to put together the extra money to later order two disks to have on hand. But still, even 8x3TB = 18TB useable in RAIDZ2 - 12TB in RAID10. I could probably make that happen, and likely could ride that data for awhile before I need to grow it, and by then I could possibly even be able to afford a whole set of 6 or 8TB disks to replace the original ones. But then the density at those sizes is also a little scary when it comes to storage arrays and UREs.

Which aren't anywhere near as big a problem for ZFS. Also, that's why you have 2-drive parity.

And how's the whole compression thing? I didn't realize that was a common thing in ZFS, and perhaps it is not, I just haven't been paying attention. Is it normal to utilize -- what I've seen called -- transparent compression?

Also adds overhead and kills speed. It's awesome for cold storage systems, but I wouldn't implement it for anything I'm using day to day.

edit:

And for what is primarily a home media server, is there any benefit to using a disk like the WD Red Pro or WD RE? Enough increase in speed and latency to justify the added expense at this level of usage? Or is it better to stick to the community-favorite NAS disk, the WD Red series? A little under $900 gets 8x 3TB WD Reds, whereas that would only get me 6x2TB WD Red Pros, and obviously fewer drives for the higher capacities.

I think WD has hedged their bets on qualifying the Reds for larger RAID arrays, but companies like QNAP have validated them in their 6 and 8-bay systems. It _should_ work just fine. Backblaze and companies of that sort use consumer HDDs in very large arrays.

Speed and latency are going to be bottlenecked by the network anyway - I wouldn't worry about it.

destrekor · Feb 13, 2016

dave_the_nerd said:
You can add storage to a pool, yes - but that's not the same thing.

http://alp-notes.blogspot.com/2011/09/adding-vdev-to-raidz-pool.html

Code:

freebsd# zpool status testpool pool: testpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM testpool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 /tmp/disk1 ONLINE 0 0 0 /tmp/disk2 ONLINE 0 0 0 /tmp/disk3 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 /tmp/disk4 ONLINE 0 0 0 /tmp/disk5 ONLINE 0 0 0 /tmp/disk6 ONLINE 0 0 0 errors: No known data errors freebsd# zpool list testpool NAME SIZE USED AVAIL CAP HEALTH ALTROOT testpool 572M 210K 572M 0% ONLINE - freebsd# zfs list testpool NAME USED AVAIL REFER MOUNTPOINT testpool 114K 349M 28.0K /testpool

This part shows how it works - you have two 3-disk RAIDZ's, in the same pool. From a file system standpoint, it looks like one big drive, but any given hunk of data lives on either one RAID array or the other. (Which puts an absolute cap on performance, and also means that even though he's got two drives used for parity - like RAIDZ2 - he's only single-drive-failure-tolerant - like RAIDZ1.)

That'd work.

Well yeah, if both drives in a mirror fail, you're jolly well rogered.

Correct. Traditional RAID controllers mark the entire drive drive dead if it has a URE, cancelling the rebuild and ruining your weekend. But as anybody who's cloned a failing HDD with bad sectors knows, a URE really only applies to a particular sector of the drive - most of the data is still good.

ZFS will keep trying to read the rest of the disk too. It can't reconstitute the completely lost data (and yes, that data may be important) but it will recover what it can. (And honestly, a one-sector glitch in an audio or video file usually translates* to a blip$# that is barely notica*ble ini the gra4nd scheme of things.

That's where I'm leaning, yeah.

Which aren't anywhere near as big a problem for ZFS. Also, that's why you have 2-drive parity.

Also adds overhead and kills speed. It's awesome for cold storage systems, but I wouldn't implement it for anything I'm using day to day.

I think WD has hedged their bets on qualifying the Reds for larger RAID arrays, but companies like QNAP have validated them in their 6 and 8-bay systems. It _should_ work just fine. Backblaze and companies of that sort use consumer HDDs in very large arrays.

Speed and latency are going to be bottlenecked by the network anyway - I wouldn't worry about it.

Awesome, I think you've cleared up all the questions I've been trying to figure out on my own when it comes to ZFS.

dave_the_nerd · Feb 13, 2016

A side note: RAID arrays over 12 drives are usually discouraged regardless. If you were putting, say, 36 drives in a system, you'd probably want to have 3x 12-drive RAIDZ2 arrays, and assemble them into a single pool. (Like the guy I linked to above.)

Maybe even 11-drive RAIDZ2s with a hot spare, depending on your service turnarounds.

frowertr · Feb 13, 2016

Dave took care of you. He knows a ton. The only thing I'll contest is his statement about RAID 10 being just for performance only. I don't agree here. It's simply safer than any parity based RAID solution for spinning rust in my opinion.

The odds of a two drive mirror failing in a RAID 10 array are so low. We are talking about an actual physical failure (not URE) where both drives are totally dead. It's pretty a pretty small failure percentage.

UREs go unnoticed in a RAID 1 or RAID 10 array. They simply don't matter. This guarantees that a mirrored array can ALWAYS survive, at the minimum, a single drive failure. Parity RAID can't give this guarantee. A single URE in a RAID 5 resilver and your hosed. Two in a RAID 6 resilver, see ya. The odds of running into a URE in a 6TB array during a resilver is 50% when using drives that have a URE rate of one every 12TB which is the common URE rate for consumer level drives. That's astoundingly high. A 6TB array is nothing today considering all the crap we keep nowadays (movies, music, pictures, etc...)

Anyway something else to consider. RAID is about tradoffs like everything else. Just depends on your needs.

dave_the_nerd · Feb 13, 2016

Thanks!

frowertr said:
UREs go unnoticed in a RAID 1 or RAID 10 array. They simply don't matter. This guarantees that a mirrored array can ALWAYS survive, at the minimum, a single drive failure. Parity RAID can't give this guarantee. A single URE in a RAID 5 resilver and your hosed. Two in a RAID 6 resilver, see ya. The odds of running into a URE in a 6TB array during a resilver is 50% when using drives that have a URE rate of one every 12TB which is the common URE rate for consumer level drives. That's astoundingly high. A 6TB array is nothing today considering all the crap we keep nowadays (movies, music, pictures, etc...)

Hmm, sure, but the odds of running into two UREs while rebuilding a RAID6 are pretty slim. (Probably on par with a critical second drive in a mirrored array inconveniently failing during a rebuild.)

Also, I think you're oversimplifying the math a bit - given how few (none) UREs I've run into while rebuilding* multi-TB single-parity-drive arrays (dozens?) Either that or the URE rate is horribly, horribly overstated by HD manufacturers as a CYA.

*waiting for the storage controller to rebuild

This is all academic assuming you have backups, of course. (RAID is not a backup!

) As long as you have and easily restored, current backup, usable GB / $ is priority #1. (IMHO)

frowertr · Feb 13, 2016

Yup I agree with the that. It really is all about tradoffs. RAID 6 is still not terrible on mechanicals. Great for getting max capacity out of the least amount of drives and allowing much easier expansion when adding disks. It's write speed is bad but, again, tradoffs.

I have tended to totally move away from spinning rust and am basically only using/recommending/installing flash at this point. RAID 5 is making a comeback with SSDs which is nice. Brings me back to the mid-90s...

grimpr · Feb 23, 2016

Thanks for all the good info, fired up a fbsd vm and experimenting with zfs raid and mirror configurations, this thing is a beast and well worth the time to learn everything about it.

Search

ZFS: Stripe & Parity, or Stripe & Mirror? Also... SLOG & L2ARC?

destrekor

Lifer

dave_the_nerd

Lifer

destrekor

Lifer

dave_the_nerd

Lifer

destrekor

Lifer

dave_the_nerd

Lifer

frowertr

Golden Member

dave_the_nerd

Lifer

frowertr

Golden Member

grimpr

Golden Member

TRENDING THREADS