Does it make sense to run SSDs in RAID1?

mikeymikec · May 31, 2018

@John L

Continuing to argue with you is a waste of my time. When you've had an SSD fail for a reason other than it hitting the 'max host writes' scenario, come back and tell us that you were in fact wrong. It might take a while, but it will eventually happen to you. If however you think you could be convinced of something that should be obvious, perhaps you should start a poll in this sub-forum? A pretty simple question should do the trick:

"Have you had an SSD fail for reasons other than it hitting the maximum number of host writes?" Y/N

Or maybe reading a thread I posted a while ago describing the first SSD failure of a drive I bought might help:
https://forums.anandtech.com/threads/my-first-ssd-failure-sam-850-pro-256gb.2538866

John L · May 31, 2018

SSDs in any number of configurations will have different Annualized Failure Rates, even if they were 100% consistent in P/Es to the single cell. Citing non-RAID 1 experience of drives failing at different times does not indicate that there is a high distribution in failure times. The only production circumstance where two drives are running the exact same number of P/E cycles with exactly the same number of occupied sectors at exactly the same time is RAID 1. I have not seen RAID 1 significantly used in datacenter applications, except for where I've seen it fail. Suggesting that real world workloads would indicate that devices with consistent life cycles would cause all the devices to fail at the same time is illogical. Even in the RAID 5 arrays I have in the lab, the loads on each drive are substantially uneven. This in turn means that the same re-alloc, WAF, wear leveling algorithms are going to have different impacts on every one of them, further diversifying the AFR for each drive. The only real world scenario where your re-allocs, WAF, wear leveling(which are all really only particular reasons for P/E) and write P/Es are going to be substantially the same is RAID 1.

Did NetAPP or EMC sell you a RAID 1 SSD storage solution?! Is anyone?

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/a7-narayanan.pdf
Short version, these are the top five errors generated before general failure of the drive.
Symptom DataErrors 1
Symptom ReallocSectors 0.943
Device workload TotalNANDWrites 0.526
Device workload HostWrites 0.517
Device workload TotalReads+Writes 0.516

Again, the argument is not against arrays or suggesting that SSDs don't fail. The point is that RAID 1 specifically possesses characteristics that degrade SSDs by doubling host writes to the array, halving the capacity of the array for wear leveling, reducing diversity in the factors which lead to variation in AFR.

If you run a bad firmware update against all drives in production, RAID 1 will not provide you any benefit. I haven't had a capacitor fail yet, but the power loss failure mode those capacitors protect against, RAID 1 won't protect you from. Zombie drives, where the controller has failed to recognize a failure, RAID 1 won't protect you from that either, because if the controller is failing to recognize bit errors, it will continue to attempt to read from the drive with failed writes. It only makes sense to cite circumstances in which RAID 1 would be effective to argue in favor of implementing RAID 1.

dave_the_nerd You left out a much simpler use case. Why boot of off local media at all? If you have a critical uptime service supporting a boat fund, how wouldn't you be negligent to suggest running it off of two drives of local storage? Yes, virtualization, storage clusters, and distributed compute are great, but even at it's simplest level, it's trivial today to PXE boot a conventional OS from remote media.

John L · May 31, 2018

mikeymikec said:
@John L
Or maybe reading a thread I posted a while ago describing the first SSD failure of a drive I bought might help:
https://forums.anandtech.com/threads/my-first-ssd-failure-sam-850-pro-256gb.2538866

It's like you're trying to make my point. Where the drive keeps reporting to the controller that everything is fine, the controller will not degrade the array. RAID 1 will not protect you from the very thing you just described.

Brahmzy · May 31, 2018

WTH is the argument here?
SSDs fail, rarely from endurance reasons, be it DAS, SAN, whatever. If you’re failing from endurance, you bought the wrong media for the workload.
I’ve had SSD failures from many vendors on various platforms.
RAID1 is one application of protection from losing your DAS. Is something otherwise being suggested? I will continue to run some form of RAID on my DAS with any current media that’s out there, despite this nonsense, because it has maintained the uptime of the workload it’s protecting.
Dumb argument is dumb.

mindless1 · May 31, 2018

John L said:
It's like you're trying to make my point. Where the drive keeps reporting to the controller that everything is fine, the controller will not degrade the array. RAID 1 will not protect you from the very thing you just described.

Yet it's still better than not having it for many common fault causes. There is no comprehensive guarantee, just measures one can take. Setting your server up to PXE boot so now one machine is dependent on TWO machines working properly isn't a guarantee either. So what if it booted if there's nothing (no data) to serve?

Brahmzy · May 31, 2018

Makes me wonder how long John L has actually been running real-world workloads in datacenters?

Hail The Brain Slug · May 31, 2018

Brahmzy said:
Makes me wonder how long John L has actually been running real-world workloads in datacenters?

Redacted.

Is it somehow outside the realm of possibility to have a catastrophic SSD failure that does not result in a zombie drive? I get the idea it is not.

corkyg · May 31, 2018

This is a tech forum - personal insults will not be tolerated . . . cool it!

PliotronX · Jun 1, 2018

How we got to that point is our inclination to be right but as Aristotle said, an educated mind is able to entertain a thought without accepting it so lets analyze the logic and apply it. If the MTBF, a predictable estimate by manufacturers is the only cause for failure, why do we need stackable switches? Why include redundancy in RAID-10 or RAID-5 not just run RAID-0?

Aside from that, the naysayer is probably not aware of a recent storage strategy called tiered storage. This allows for the host OS to move block level data that is repetitively used to the "hot" tier (SSD) and move less accessed data to the "cold" tier. As well write back cache can be used for masking the poor write speeds of mechanical drives (particularly with software RAID-5). Does anyone know what happens when integrity of write back cache is lost without dumping contents first?

I have implemented tiered storage with RAID-1 SSDs and RAID-1, 5, and 10 mechanical drives in five production servers so far with great effect. RAID-1 by far allows for redundancy while realizing the performance benefits of SSD without costing exorbitant amounts so it is a good fit for the SMBs that they are running at.

Brahmzy · Jun 1, 2018

^^ Sub-LUN block tiering is cool, but has been around for a LONG time - since Compellent, then 2009 EMCs Flare30. It’s a budget-minded implementation.
All that block relocation is post-process and often stale by the time the blocks get moved. All flash has been where it’s at for a while now. Life’s just better with AFAs

. But yeah, they can be pricey, but honestly, so can the tiered arrays.
I generally don’t play with host-side tiering as it is all software driven, but I could see it being valuable/cost-effective.

nosirrahx · Jun 1, 2018

RIAD 1 saved my ass once. I had a drive vanish from the array while working. Other than my Areca card scaring the hell out of me with its alarm there was no real issue.

When I finished up for the day I swapped out the bad drive and it rebuilt while I was watching TV.

Sure, a SSD is super unlikely to fail like this but the chance is not 0.

Brahmzy · Jun 1, 2018

SSDs fail exactly like that all the time. Less than platter drives, but they do fail.

TimSheetz · Jan 26, 2021

mindless1 said:
This is incorrect. SSDs do NOT consistently fail at exactly the same # of duty cycles. The # of cycles is an average and while there won't be specimens that greatly outlive that average like some HDDs do, it would be insane to think that you'd get years of service from SSDs then they all up and die the same day, let alone the same moment, unless you have an external cause like a power surge from failing PSU, lightning strike, etc.

On the contrary the odds are very high that upon one failing you have time to order and replace it if you didn't have a spare lying around. I would advise having at least one spare.

This doesn't even count what you're trying to protect against which is not wear-out. You can merely schedule a replacement interval to guard against that. It's the unexpected random failures you're wanting to keep from causing downtime.

Search

Does it make sense to run SSDs in RAID1?

mikeymikec

Lifer

John L

Junior Member

John L

Junior Member

Brahmzy

Senior member

mindless1

Diamond Member

Brahmzy

Senior member

Hail The Brain Slug

Diamond Member

corkyg

Elite Member | Peripherals

PliotronX

Diamond Member

Brahmzy

Senior member

nosirrahx

Senior member

Brahmzy

Senior member

TimSheetz

Junior Member

TRENDING THREADS