ZFS and 512n vs 512e vs 4kN drives

FlawlessMind · Mar 14, 2016

Guys, I will try to keep it very short as I tend to get into unnecessary details.

My ultimate goal is to choose between using 512n, 512e, or pure 4kN drives for a new ZFS storage appliance that will either consist of eight 4TB drives in four mirror VDEVs, or eventually Raidz2 in order to mitigate any unrecoverable read errors that might occur during re-silvering a mirror in case of disk failure.

I understand the difference between 512n, 512e and 4kN drives and that 512e drives are basically 4k sector drives with emulation for backward compatibility for software that supports only 512b sectors. I know that with 4k drives there is more efficiency as there is eight times less gaps between sectors and eight times less markers.

Still it is quite difficult for me to make a decision on which drives to choose from as with HGST (Ultrastar 7k6000 SAS) I can choose between 512n, 512e and 4kN and with Seagate (Enterprise Capacity 3.5 SAS) I can choose between 512e and 4kN).

One of the things that bothers me given the fact that ZFS will always use the physical sector size to create VDEVs is if there would be any difference between using 512e and 4kN drives from the same model when using these in ashift=12 vdevs? My understanding here is that ZFS treats 512e and 4kN drives the same way in such vdevs and my guess is that any operations involving pure 4k reads and writes from and to 512e drive would not cause any in-drive processor usage e.g. causing performance degradation which is the case when using the drive in 512e mode. Would I be right to assume that?

I am trying to figure out if there is any reasoning behind getting pure 4kN drives that would get another 15% on top of my bill as opposed to 512e drives.

On the other side of the coin, having the option from HGST to choose 512n 4TB drives I am trying to figure out if that would be of any benefit other than optimizing overhead for certain file sizes in RaidZ setups. And I do understand that 512n drives could be little bit slower due to the higher number of operations needed to acquire sectors.

Any opinions are highly appreciated!

[FONT=&quot]~Cheers~[/FONT]

frowertr · Mar 14, 2016

Can't help with the ZFS ashift stuff but there are some pretty knowledgable folks here that know a lot about ZFS. However, I did want to clear one thing up.

You have the URE issue as it pertains to mirrors and parity RAID backwards. During rebuild, mirrors are immune to UREs in the fact that a corrupted bit in a sector (or a totally unreadable sector) won't effect a resync (remirror). That is because mirrors don't resilver like parity raid does, they simply remirror. So a corrupted sector or flipped bit in the sector is just copied over to the new disk. The data is, of course still lost, but the array is still rebuilt. In a single parity raid (RAID 5) this would kill the rebuild. RAID 6 can handle a single URE if it's already in a degraded state.

Mirrors are pretty damn safe. The question of RAID6 vs RAID 10 usually comes down to capacity and if you are willing to give some up in order to take advantage of RAID 10's speed vs parity RAID.

Anyway, someone else should come along for your ZFS questions.

Essence_of_War · Mar 14, 2016

Get 4kN drives.

You're going to need new drives someday there is no guarantee that you'll want, or be able to, acquire 512 sector drives later. You certainly will be able to find 4k drives.

Also, ZFS really doesn't need enterprise SAS drives. It works really well on ordinary SATA drives. Save your money there instead of messing around with the block reporting size.

eight 4TB drives in four mirror VDEVs, or eventually Raidz2

You know that vdevs are immutable, right? You can't start with a mirror vdev and convert it to a raidz1/2/3. You can ADD a raidz2 vdev to a zpool consisting of mirror'd vdevs later, but that isn't usually something people want to do.

FlawlessMind · Mar 14, 2016

Essence_of_War said:
Get 4kN drives.

You're going to need new drives someday there is no guarantee that you'll want, or be able to, acquire 512 sector drives later. You certainly will be able to find 4k drives.

Also, ZFS really doesn't need enterprise SAS drives. It works really well on ordinary SATA drives. Save your money there instead of messing around with the block reporting size.

You know that vdevs are immutable, right? You can't start with a mirror vdev and convert it to a raidz1/2/3. You can ADD a raidz2 vdev to a zpool consisting of mirror'd vdevs later, but that isn't usually something people want to do.

You can still replace 512n drives with 512e drives as far as I understand. You can not replace these with 4kN.

The real question here though is:

1) Is there any difference between 512e and 4kn from ZFS point of view and from drive point of view when ZFS writes/reads from it, that would justify 15%-20% for 4kN on top of wht I woul pay for 512e.

2) If I am guaranteed I can get 512n replacements over my 5 year warranty, besides space efficiency (which would be applicable only for raidz) is there other reason I would choose these instead?

I understand the differences between different raid setups with ZFS and I've been using ZFS for many years already.

SAS is by design as there are two heads in a redundant setup that would be accesing the drives and server CIFS and iSCSI for clients.

~Cheers~

Soulkeeper · Mar 15, 2016

The pure 4kN option does command a high premium.
I'm a bit disappointed in that. You're unlikely to notice any performance difference between them and 512e drives once everything is installed and configured.
I was really hoping 4kN drives would just become the norm for all segments, but it's taking time.

FlawlessMind · Mar 15, 2016

Soulkeeper said:
The pure 4kN option does command a high premium.

Which really should not be the case given the fact they have less elements (lacking the CPU that is used for the emulation tasks).

Soulkeeper said:
I'm a bit disappointed in that. You're unlikely to notice any performance difference between them and 512e drives once everything is installed and configured.

Still I would like to understand if the behavior would be the same between the two drives. Theoretically speaking and speculating of course I agree with that if we assume that ZFS treats both drives the same there should not be any difference in performance.

Soulkeeper said:
I was really hoping 4kN drives would just become the norm for all segments, but it's taking time.

I agree, if we want to see these drives rolling they should be actually less expensive given the fact they have less hardware on the controller board.

Off-topic but maybe worth mentioning I was just about to get the Seagate Enterprise Capacity 3.5 SAS 4TB ST4000NM0034 drives, but I was disappointed to see that in consecutive datasheets posted on Seagate's website on the first one (March 2014) http://www.seagate.com/www-content/...prise-capacity-3-5-hdd-v4-ds1791-3-1403us.pdf the drive would be listed with 1.4M MTBF hours and on the second one (October 2014) http://www.seagate.com/www-content/...terprise-capacity-3-5-hdd-ds1791-8-1410us.pdf same drive is listed with 2M MTBF hours. I find this inconsistent, and not trustworthy enough to convince me to switch from HGST with my previous reasoning for switching was how fast Seagate manage to handle warranty issues.

frowertr · Mar 15, 2016

I think if you look around, the consensus is that MTBF is basically a marketing gimmick. I don't even look at that figure on drive data sheets.

Stay with HGST though. They make spectacular drives.

FlawlessMind · Mar 15, 2016

frowertr said:
I think if you look around, the consensus is that MTBF is basically a marketing gimmick. I don't even look at that figure on drive data sheets.

Stay with HGST though. They make spectacular drives.

Off-topic but I am very happy with Hitachi and I would like to keep using them, but in that particular case I have to pay 35% premium for HGST Ultrastar 7k6000 4TB over Seagate Enterprice Capacity 3.5 4TB which doesn't really make me happy.

frowertr · Mar 15, 2016

Have you priced out the 7k4000 line? Those are a gen older but still in production and a solid drive. I purchased two of the 3TB SAS models recently when I rebuilt one of my productions servers at work. The difference in speed between the 4000 and 6000, I would imagine, would be negligible for real world scenarios.

Perhaps I missed it, but you never shared what the intended purpose of this server will be. Though I'm guessing if using 3.5" drives coupled with parity ZFS I'm thinking speed isn't the driving factor.

FlawlessMind · Mar 15, 2016

frowertr said:
Have you priced out the 7k4000 line? Those are a gen older but still in production and a solid drive. I purchased two of the 3TB SAS models recently when I rebuilt one of my productions servers at work. The difference in speed between the 4000 and 6000, I would imagine, would be negligible for real world scenarios.

Perhaps I missed it, but you never shared what the intended purpose of this server will be. Though I'm guessing if using 3.5" drives coupled with parity ZFS I'm thinking speed isn't the driving factor.

As I need to get SAS drives by design with the 7k4000 the only available option for SAS is 512n and the price difference for a new drive compared to 512e 7k6000 is only 10$/drive in favor of the 7k4000.

The storage would be used for two hypervisors with about 30 VMs on each, mixed set of Linux and Windows machines. On top of that it would serve CIFS that includes some users (4) that would normally read/write large files (50MB-2GB). Hypervisors and users that will be accessing these large files will have 10Gb/s connectivity to the storage.

As per the speed, I would really like to go with mirror vdevs and benefit from the better IO and better write speeds together with much faster re-silvering times, but what bothers me here is the chance for unrecoverable read errors on the remaining drive during re-silvering a degraded mirrored vdev.

The other way would be to create two raidz2 vdevs 4 drives each so the writes are distributed across them. That would still waste 50% drives as with mirrors, speed would still not be the same as with 8 drives in 4 mirrored vdevs, but I will have better odds in case of failure.

Eventually an option is to go with mirror vdevs and create another raidz2 pool of 8 drives (with cheaper drives) for backups, but that is unfortunately out of the budget right now.

frowertr · Mar 15, 2016

FlawlessMind said:
As per the speed, I would really like to go with mirror vdevs and benefit from the better IO and better write speeds together with much faster re-silvering times, but what bothers me here is the chance for unrecoverable read errors on the remaining drive during re-silvering a degraded mirrored vdev.

You have URE's backwards. Go re-read my post #2 at the top. Mirrors don't resilver, they remirror. URE's do not affect a mirrored array's ability to remirror at all. You are actually safer with mirrors vs parity as far as UREs are concerned.

So, you're building out a SAN and using it partly as a NAS as well. What kind of IOPS are you expecting?

FlawlessMind · Mar 15, 2016

frowertr said:
You have URE's backwards. Go re-read my post #2 at the top. Mirrors don't resilver, they remirror. URE's do not affect a mirrored array's ability to remirror at all. You are actually safer with mirrors vs parity as far as UREs are concerned.

Pardon me if I didn't use the right term, I thought these are always called re-silvers? I know how mirrors work, but again my point here is, okay you will be able to re-mirror' and keep all of your other data if there are URE's on the remaining drive, but you also end-up having a data corruption for any data residing on sectors that couldn't be read while re-mirroring.

frowertr said:
So, you're building out a SAN and using it partly as a NAS as well. What kind of IOPS are you expecting?

That's correct it will have to be used as NAS as well. Unfortunately I don't have any estimates on IOPS right now as the old environment is not monitored at all. The current setup that will be replaced has 6 1TB drives in Raidz2 with only 8GB of RAM and 1 SSD used for system, read cache, write cache (don't know who set that this way) D:

frowertr · Mar 15, 2016

FlawlessMind said:
Pardon me if I didn't use the right term, I thought these are always called re-silvers? I know how mirrors work, but again my point here is, okay you will be able to re-mirror' and keep all of your other data if there are URE's on the remaining drive, but you also end-up having a data corruption for any data residing on sectors that couldn't be read while re-mirroring.

Nope, mirrors resync/re-mirror and not resilver. But...

A URE on a parity array vs. a URE on a mirrored array results in the same condition: Some type of data loss. It could be catastrophic on a parity array in the fact it can't be rebuilt at all depending on the RAID level. It has absolutely no effect whatsoever on the rebuild of a mirrored array.

ZFS has all kinds of nifty features built in to safeguard against UREs (e.g. Checksumming) but, if I'm not mistaken, those safeguards pertain to both parity and mirrored vdevs. So you aren't gaining any more protection against UREs using RAIDz than you would against using regular mirrors. You aren't losing any features either. ZFS does its thing regardless of the RAID level. In other words, you aren't safer from UREs choosing a RAIDZ level vs mirrors. You are actually less safe since the mirror will always rebuild with UREs should one happen... Hopefully that makes sense.

that's correct it will have to be used as NAS as well. Unfortunately I don't have any estimates on IOPS right now as the old environment is not monitored at all. The current setup that will be replaced has 6 1TB drives in Raidz2 with only 8GB of RAM and 1 SSD used for system, read cache, write cache (don't know who set that this way)

Goodness. The existing infrastructure runs 60 VMs (twin hosts with thirty VMs each) off of that??? And it's responsive???

dave_the_nerd · Mar 15, 2016

frowertr said:
Nope, mirrors resync/re-mirror and not resilver. But...

A URE on a parity array vs. a URE on a mirrored array results in the same condition: Some type of data loss. It could be catastrophic on a parity array in the fact it can't be rebuilt at all depending on the RAID level. It has absolutely no effect whatsoever on the rebuild of a mirrored array.

ZFS has all kinds of nifty features built in to safeguard against UREs (e.g. Checksumming) but, if I'm not mistaken, those safeguards pertain to both parity and mirrored vdevs. So you aren't gaining any more protection against UREs using RAIDz than you would against using regular mirrors. You aren't losing any features either. ZFS does its thing regardless of the RAID level. In other words, you aren't safer from UREs choosing a RAIDZ level vs mirrors. You are actually less safe since the mirror will always rebuild with UREs should one happen... Hopefully that makes sense.

Goodness. The existing infrastructure runs 60 VMs (twin hosts with thirty VMs each) off of that??? And it's responsive???

A single SSD could push enough IOPS to make that work, if it's lightly loaded and/or total amount of data written every day is fairly small.

FlawlessMind · Mar 15, 2016

frowertr said:
Nope, mirrors resync/re-mirror and not resilver. But...

A URE on a parity array vs. a URE on a mirrored array results in the same condition: Some type of data loss. It could be catastrophic on a parity array in the fact it can't be rebuilt at all depending on the RAID level. It has absolutely no effect whatsoever on the rebuild of a mirrored array.

ZFS has all kinds of nifty features built in to safeguard against UREs (e.g. Checksumming) but, if I'm not mistaken, those safeguards pertain to both parity and mirrored vdevs. So you aren't gaining any more protection against UREs using RAIDz than you would against using regular mirrors. You aren't losing any features either. ZFS does its thing regardless of the RAID level. In other words, you aren't safer from UREs choosing a RAIDZ level vs mirrors. You are actually less safe since the mirror will always rebuild with UREs should one happen... Hopefully that makes sense.

Well I will have to disagree with you. If you have a Raidz2 with one failed drive and you re-silver from all other that hold parity even if another drive has URE and serves bad data that would be detected as the parity would have bad checksum. In that case parity from the remaining drives will be used to re-silver the data. I am not sure what happens with the drive that serves the bad parity. That's the safety that I am talking about. With mirror vdev you will re-mirror and end-up having a mirrored bad data. So silent error corruption. That's how I understand it, please correct me if you are sure I am wrong.

frowertr said:
Goodness. The existing infrastructure runs 60 VMs (twin hosts with thirty VMs each) off of that??? And it's responsive???

Yeap they've been running that for years, maybe about seven. The only difference is that it doesn't serve as a NAS right now it was only used for the hypervisors. Windows servers are not that responsive already even though they are legacy versions

Essence_of_War · Mar 15, 2016

So silent error corruption

Even if you have an error, you still have the checksum. You will know about the error, you just won't know if the error is in the data or the checksum. This is loud data corruption, not silent.

FlawlessMind · Mar 15, 2016

Essence_of_War said:
Even if you have an error, you still have the checksum. You will know about the error, you just won't know if the error is in the data or the checksum. This is loud data corruption, not silent.

From what I remember if a checksum could not be verified in a healthy mirror, ZFS would try to re-read the data and if it fails against the checksum again it will pull the data from the other member of the mirror and try to recover (write back) the data to the bad member.

In case there is a re-mirror in progress would the data be cloned to the new drive? I can only wonder how you figure out which data has been corrupted if that is used as shared storage with exported iSCSI LUNs for hypervisors.

frowertr · Mar 15, 2016

FlawlessMind said:
Well I will have to disagree with you. If you have a Raidz2 with one failed drive and you re-silver from all other that hold parity even if another drive has URE and serves bad data that would be detected as the parity would have bad checksum. In that case parity from the remaining drives will be used to re-silver the data. I am not sure what happens with the drive that serves the bad parity. That's the safety that I am talking about. With mirror vdev you will re-mirror and end-up having a mirrored bad data. So silent error corruption. That's how I understand it, please correct me if you are sure I am wrong.

Now I understand what you are saying. Yes, what you are saying technically is true. But there are other factors at play. I'd argue your risk of hitting a URE on a RAID 10 rebuild vs 6 is much much lower. That is because you only have to read the mirror to rebuild the failed drive in RAID 10. In parity, you have to read EVERY drive in the array to rebuild the failed drive. More sectors read means a higher chance of read failure.

RAID 6 is still a good choice depending in its application. I wasn't trying to dissuade you from it, I just wasn't understanding what you were saying.

frowertr · Mar 15, 2016

dave_the_nerd said:
A single SSD could push enough IOPS to make that work, if it's lightly loaded and/or total amount of data written every day is fairly small.

Sure an SSD could do that depending on the application. My suprise was due to having 6 winchesters (3.5" presumably) running parity RAID all hosting 60 VMs.

FlawlessMind · Mar 15, 2016

frowertr said:
Now I understand what you are saying. Yes, what you are saying technically is true. But there are other factors at play. I'd argue your risk of hitting a URE on a RAID 10 rebuild vs 6 is much much lower. That is because you only have to read the mirror to rebuild the failed drive in RAID 10. In parity, you have to read EVERY drive in the array to rebuild the failed drive. More sectors read means a higher chance of read failure.

RAID 6 is still a good choice depending in its application. I wasn't trying to dissuade you from it, I just wasn't understanding what you were saying.

Sorry if I wasn't clear enough in my statements so far ;(

Don't get me wrong, I desperately want to set that as a pool full of mirrors. I just keep fighting with myself about it, especially given the fact there would be no backup in place and I am not the one deciding on that.

Theoretically speaking don't you have more chances for hitting an error reading 6 times more data from one drive compared to reading the same amount of data spread across 6 different drives?

frowertr · Mar 15, 2016

FlawlessMind said:
Sorry if I wasn't clear enough in my statements so far ;(

Don't get me wrong, I desperately want to set that as a pool full of mirrors. I just keep fighting with myself about it, especially given the fact there would be no backup in place and I am not the one deciding on that.

Wow. If this is a production system... Wow. Mgmt is dropping the ball hard here. What is their answer to the array failing or someone accidentally erasing some data they didn't mean to?

Theoretically speaking don't you have more chances for hitting an error reading 6 times more data from one drive compared to reading the same amount of data spread across 6 different drives?

Reading all those sectors from all those drives spread across the array increases your chances. It also slows down the entire array. In your case, you would have to read 28TB (seven 4TB drives) to rebuild the single 4TB drive that failed in parity raid. That is one big plus about RAID 10. Rebuilds are fast and are only constrained to one mirror. So if you don't have any data on that mirror you need to access you don't even notice the array is in resync. The rebuilds on RAID 6 can just suck especially if the array is also in heavy use during the rebuild. I think ZFS mitigates parity raid resilvers somewhat by only reading sectors that have data on them. This does shorten the time but, still, 20+TB of data being resilvered will take days and you're degraded the entire time during the resilver.

Wonder who sold them on a SAN to begin with since you only have two hosts. So much cheaper and easier to stuff those servers full of drives and replicate VMs between the two hosts.

FlawlessMind · Mar 15, 2016

frowertr said:
Wow. If this is a production system... Wow. Mgmt is dropping the ball hard here. What is their answer to the array failing or someone accidentally erasing some data they didn't mean to?

Well just another reason I hate non-profits and doing projects for such organizations. They understand the consequences but they just don't want to spend the extra money LOL

frowertr said:
Reading all those sectors from all those drives spread across the array increases your chance.

Well I am not sure about that, the more you read from a drive the more chances you have for URE, still somebody with better math memories should calculate that to determine the odds as increasing the number of disks also increases the general probability of hitting URE.

frowertr said:
It also slows down the entire array. That is one big plus about RAID 10. Rebuilds are fast and are only constrained to one mirror. So if you don't have any data on that mirror you need to access you don't even notice the array is in resync. The rebuilds on RAID 6 can just suck especially if the array is also in heavy use during the rebuild.

You are right, my entire reasoning in environments where there are actual backups lays on that. Better IOPS, less affect on speed, much faster resync times.

frowertr said:
I think ZFS mitigates parity raid resilvers somewhat by only reading sectors that have data on them. This does shorten the time but, still, 20+TB of data being resilvered will take days and you're degraded the entire time during the resilver.

I don't know about parity arrays, but it makes sense to recover data that has parity information for only plus any redundant parities that you might've had on the failed drive. For mirrors it is clearly stated in the docs that it would only sync actual data sectors and not the entire drive.

P.S. We made it for a good talk today

Should hang-out and have some beers if you are in bay area

frowertr · Mar 15, 2016

Well I am not sure about that, the more you read from a drive the more chances you have for URE, still somebody with better math memories should calculate that to determine the odds as increasing the number of disks also increases the general probability of hitting URE.

The idea is you have to read EVERY sector of EVERY drive for a parity rebuild (unless ZFS does it differently). You only have to read every sector of ONE drive for a mirror rebuild. So the chances of hitting that infamous URE increases during the rebuild phase with parity RAID. That is what I was taught at least. Perhaps I'm mistaken but it's too late to google right now...

P.S. We made it for a good talk today Should hang-out and have some beers if you are in bay area.

I love talking about storage. I started working for an ISP back in the late 90's and was totally enamoured with networks back then. Still am but my love for storage has eclipsed that over the years. I'd take you up on grabbing a beer in a heartbeat. Unfortunately, I'm about 2000 miles away down in Florida!

Good luck on your project. Hopefully you can talk some sense into them and get some type of backup system added to the build!

Essence_of_War · Mar 15, 2016

The idea is you have to read EVERY sector of EVERY drive for a parity rebuild (unless ZFS does it differently).

It does. ZFS walks the data's merkle tree during a rebuild, it doesn't blindly XOR all of the blocks together. For the same reason that 'zpool create' is extremely fast for a raidz1/2/3 and doesn't require an array sync like mdadm or similar require, resilvering doesn't require you to read and XOR blocks that are not in use.

frowertr · Mar 15, 2016

Essence_of_War said:
It does. ZFS walks the data's merkle tree during a rebuild, it doesn't blindly XOR all of the blocks together. For the same reason that 'zpool create' is extremely fast for a raidz1/2/3 and doesn't require an array sync like mdadm or similar require, resilvering doesn't require you to read and XOR blocks that are not in use.

Yeah this is kinda what i thought but wasnt sure on how it did that or what it was called.

Thanks! Thats good to know.

ZFS and 512n vs 512e vs 4kN drives

Junior Member

Golden Member

Platinum Member

Junior Member

Diamond Member

Junior Member

Golden Member

Junior Member

Golden Member

Junior Member

Golden Member

Junior Member

Golden Member

Lifer

Junior Member

Platinum Member

Junior Member

Golden Member

Golden Member

Junior Member

Golden Member

Junior Member

Golden Member

Platinum Member

Golden Member