• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Storage area network advice

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
You do know that SSDs fail to read only mode on worn blocks right? It's not like the drive just dies (in cases where they do those are firmware/controller bugs and are irrelevant to your topic of unavoidable NAND flash floating gate dielectric degradation).

You'll get warnings long before your data goes *poof*.

You forget in enterprise you *always* have backups, failover systems, etc. Even *if* your case for SSDs dying left and right was true, it's more than made up for in how quickly you can rebuild or replicate back to it once you get it replaced. Let alone user productivity in the years that it works flawlessly.
 
Last edited:
I was referring to pre fail monitoring. Yes most sans will alert you when a drive, ESM controller or other part fails, but in the case of SSDs you want to know before that, as it means they are all pretty much worn out and need to all be replaced, or more specificly, all the drives in a given LUN. As each LUN may have different usage levels depending on how it's setup. So it's better to know ahead of time so you can start replacing them before shit hits the fan.

For years, SAN operating systems have had predictive failure algorithms. My experience is limited to NetApp, but by default, they do weekly RAID scrubs to check for bad blocks, etc. Also, by default, NetApp SANs use dual parity raid groups. So if you buy two shelves of SSD, you'll have 48 disks. Lets say you make a single aggregate out of them, and lets say you use the default raid group size of 23 (21 data disks and 2 parity disks) that leaves you with two hot spares. That means two disks can fail anywhere in those two shelves and it will start rebuilding immediately - no loss of data. At minimum, you can lose two more disks anywhere in those two shelves with zero data loss. At most, you could lose 4 more, assuming the hot spares have finished rebuilding since the original two failures.

Where I work, we carry 24/7/4 support on our SANs. So within a few hours, you'll have your replacement disks. The chances of more than 4 SSD's failing before you can get replacements is highly unlikely.

However, a better use for SSD (in a NetApp SAN, at least) is with their new Flash Pool technology in ONTAP 8.1.1, you add SSD into existing aggregates of SAS or SATA disks and it acts as a read AND write cache. If a disk fails, the controller simply stops using that disk.

In addition to all of what I've just said, the majority of disk failures in a SAN are predictive failures. Meaning, the software has decided that a disk is no longer reliable and removes it from the array and starts rebuilding a hot spare. The chance of a disk failing when writes are actually being laid down on disk is very slim.

*EDIT* One other point I'd mention about NetApp (and I can't speak for EMC or any other vendor in this regard) is that they have a very good PowerShell toolkit. If you have a Windows environment, it makes scripting/automation so much easier vs. trying to script SSH commands to perform automated tasks. I've installed several NetApp SANs and administer even more, and the only thing I've run into that you can't do with PowerShell that you have to revert to CLI for is OS updates, firmware updates, out-of-band processor firmware updates, etc. ALL provisioning and administration tasks can be performed via PowerShell.
 
Last edited:
Back
Top