Does it make sense to run SSDs in RAID1?

Homerboy

Lifer
Mar 1, 2000
30,856
4,974
126
Spec'ing out a few new servers and in talking to my rep, he brought up a good point:

"Mirroring an SSD doesn’t make a lot of sense. The die from the number of writes and if you are going to write the exact same thing to both drives they should fail at the exact same time 2,000,000 hours later, 120 years.

We sell about 15,000 enterprise SSDs a quarter and we don’t do more than a handful of RMAs on them a year from malfunction."

While I can't find much fault in his logic, I still feel... "weird" not running a server with RAID1/RAID10 drives (I've always done RAID10 HDDs, so this foray into SSDs on server is new for me...)

Thoughts?
 

BonzaiDuck

Lifer
Jun 30, 2004
15,709
1,450
126
Spec'ing out a few new servers and in talking to my rep, he brought up a good point:

"Mirroring an SSD doesn’t make a lot of sense. The die from the number of writes and if you are going to write the exact same thing to both drives they should fail at the exact same time 2,000,000 hours later, 120 years.

We sell about 15,000 enterprise SSDs a quarter and we don’t do more than a handful of RMAs on them a year from malfunction."

While I can't find much fault in his logic, I still feel... "weird" not running a server with RAID1/RAID10 drives (I've always done RAID10 HDDs, so this foray into SSDs on server is new for me...)

Thoughts?

Funny. I just finished my project of replacing my old WHS-2011 server with a 2012 R2 Essentials and Ivy Bridge with PCIE 3.0 hardware. I was kicking back flipping my DVR library on the new box, and I was thinking the same thing. I had some lucid thoughts about it.

SSDs are still too expensive for that sort of thing. I don't do RAID anymore, nor any RAID1 or flavors of it. I can have three-disk file and folder duplication in my drive pool. And I could do that with SSDs.

But how much would you spend on a 1TB SSD? $300? $400? I bought a 2TB Crucial MX300 last year for something like $550.

My old WHS-11 drive pool is made of 2TB Seagate NAS disks -- four of them. I can re-deploy those drives, but now the bigger ones are less than what I paid for the Seagates. And -- I'm modest: I only chose 3TB disks for the new box, while I can re-deploy the 2TB units as backup drives.

Think about it. If you wanted 8TB of SSD drive pool, you could spend between $1,600 and $2,000 on -- say -- four 2TB Crucials. On the up side, your server would be sucking less power from the wall -- a saving for 24/7 operation. But the electro-mechanical HDDs are greater in capacity, and they cost a fraction of an SSD. I just bought two 3TB Hitachi Enterprise drives on sale for $50 each. I have a third one, so together that's 9TB for maybe $200. Instead of the four drives I had in the old server, I now only run three.

No -- here's what I do, and I think it happened to coincide with an opinion by Terry Walsh in his server-OS books. Use an SSD for the server OS itself; use hard disks for the drive pool -- connected to a $100 SuperMicro [Marvell] PCIE x8 controller. I'm even testing the trial version of PrimoCache Server, given that I have 16GB of DDR3-1600. I should've finished this project 2 years ago!

Your power savings are not going to cover the cost of those SSDs anytime in a medium to short-range time-horizon. Your power savings can still benefit by using larger but fewer HDDs. You can do your RAID 1, RAID 1/0, RAID 5 or 6 with more expensive hardware if you want the latter two and a hardware RAID controller. But I get file/folder duplication (x3 "triplication" if I want!). Don't need to mirror whole drives to do it.

Ultimately, it's true -- the SSDs are not going to pose a risk like the HDDs. But they're going to cost a lot more. Well -- maybe you were thinking about an OS-boot-system drive for a server using mirrored SSDs. But why not just back up a 250GB OS partition to a 1TB HDD once a day? And of course -- a mirrored array is only redundant -- it isn't "backup."
 

nosirrahx

Senior member
Mar 24, 2018
304
75
101
Modern controllers can do distributed reads in RAID 1. As a result sequential read speed increases around 95%. That said RAID kind of kills 4KQ1T1 speed so while it looks like it should be faster what you "feel" is actually slower.

The only real reason I would want RAID 1 anymore is to have the ability to survive a failure and schedule a rebuild at a convenient time instead of dealing with a dead OS drive the instant it happens.
 

Homerboy

Lifer
Mar 1, 2000
30,856
4,974
126
Funny. I just finished my project of replacing my old WHS-2011 server with a 2012 R2 Essentials and Ivy Bridge with PCIE 3.0 hardware. I was kicking back flipping my DVR library on the new box, and I was thinking the same thing. I had some lucid thoughts about it.

SSDs are still too expensive for that sort of thing. I don't do RAID anymore, nor any RAID1 or flavors of it. I can have three-disk file and folder duplication in my drive pool. And I could do that with SSDs.

But how much would you spend on a 1TB SSD? $300? $400? I bought a 2TB Crucial MX300 last year for something like $550.

My old WHS-11 drive pool is made of 2TB Seagate NAS disks -- four of them. I can re-deploy those drives, but now the bigger ones are less than what I paid for the Seagates. And -- I'm modest: I only chose 3TB disks for the new box, while I can re-deploy the 2TB units as backup drives.

Think about it. If you wanted 8TB of SSD drive pool, you could spend between $1,600 and $2,000 on -- say -- four 2TB Crucials. On the up side, your server would be sucking less power from the wall -- a saving for 24/7 operation. But the electro-mechanical HDDs are greater in capacity, and they cost a fraction of an SSD. I just bought two 3TB Hitachi Enterprise drives on sale for $50 each. I have a third one, so together that's 9TB for maybe $200. Instead of the four drives I had in the old server, I now only run three.

No -- here's what I do, and I think it happened to coincide with an opinion by Terry Walsh in his server-OS books. Use an SSD for the server OS itself; use hard disks for the drive pool -- connected to a $100 SuperMicro [Marvell] PCIE x8 controller. I'm even testing the trial version of PrimoCache Server, given that I have 16GB of DDR3-1600. I should've finished this project 2 years ago!

Your power savings are not going to cover the cost of those SSDs anytime in a medium to short-range time-horizon. Your power savings can still benefit by using larger but fewer HDDs. You can do your RAID 1, RAID 1/0, RAID 5 or 6 with more expensive hardware if you want the latter two and a hardware RAID controller. But I get file/folder duplication (x3 "triplication" if I want!). Don't need to mirror whole drives to do it.

Ultimately, it's true -- the SSDs are not going to pose a risk like the HDDs. But they're going to cost a lot more. Well -- maybe you were thinking about an OS-boot-system drive for a server using mirrored SSDs. But why not just back up a 250GB OS partition to a 1TB HDD once a day? And of course -- a mirrored array is only redundant -- it isn't "backup."

Thanks for the reply. I'm talking in an business situation though with 150+ accessing etc. Power consideration isn't on the radar for that matter. Looking at failure rate of SSDs v s HDDs it's tenths of a percent versus 4-6% over a several year span. Not to mention failure in SSDs is able to be monitored and somewhat predicted due to read/write values and cycles.
 

Homerboy

Lifer
Mar 1, 2000
30,856
4,974
126
Modern controllers can do distributed reads in RAID 1. As a result sequential read speed increases around 95%. That said RAID kind of kills 4KQ1T1 speed so while it looks like it should be faster what you "feel" is actually slower.

The only real reason I would want RAID 1 anymore is to have the ability to survive a failure and schedule a rebuild at a convenient time instead of dealing with a dead OS drive the instant it happens.

Yes - this has nothing do with speed per se. It's about fault tolerance. Failure in SSDs seems so rare that you're basically hedging against the same odds of getting hit by lightning 3 times in the same day it seems. It's a hard decision for me to make - how insured do you REALLY need to be...
 

mikeymikec

Lifer
May 19, 2011
17,672
9,514
136
Yes - this has nothing do with speed per se. It's about fault tolerance. Failure in SSDs seems so rare that you're basically hedging against the same odds of getting hit by lightning 3 times in the same day it seems. It's a hard decision for me to make - how insured do you REALLY need to be...

As you said, the whole point is fault tolerance. Even with hard drives the likelihood of failure isn't that high, yet we sometimes put them in RAID1 configurations, because one failure scenario is a complete halt to workflow and disaster recovery, and the other is, "oh, a drive failed. I guess I ought to replace that at some point soon".

If you're thinking that uptime is a high enough priority to possibly consider RAID1, then I doubt that deciding not to because "SSDs are just that good" is going to feel very re-assuring should the worst happen.

I've seen two SSDs fail and one inexplicably have the Windows 10 configuration mangled after the power went off. I don't consider SSDs to be bulletproof.
 

PliotronX

Diamond Member
Oct 17, 1999
8,883
107
106
It does because you can have two identical SSDs and one can outright fail at any time, well before the expected EOL. The probability is very low but the risk is downtime awaiting replacement and restoring from backup once it is in place. This kind of scenario happened last year with pathetically expensive 200GB SAS SSDs in a server. Because the failed drive was in an array, the system was not offline and we replaced and it rebuilt and the users were none the wiser. If this is a critical array, you have to weigh the cost of an extra drive that will do nothing but step in should this happen versus downtime of your ability to replace and restore from backup. Can a whole day or two go by without the array?
 

BonzaiDuck

Lifer
Jun 30, 2004
15,709
1,450
126
Thanks for the reply. I'm talking in an business situation though with 150+ accessing etc. Power consideration isn't on the radar for that matter. Looking at failure rate of SSDs v s HDDs it's tenths of a percent versus 4-6% over a several year span. Not to mention failure in SSDs is able to be monitored and somewhat predicted due to read/write values and cycles.
Yup -- I thought you were in that sort of situation. My little kloodge of a home server still demonstrates the same tradeoffs in microcosm. But that's exactly right: there are no mechanical failures with a wide random statistical distribution of failures. There is only the accumulated TBW -- with some comparatively narrow range of ultimate failure. So . . . . no RAID1, even for an OS-system disk.

Of course, I can also see it the other way: you have the redundancy to avoid even an hour's downtime in RAID1. Even so, it could take at most an hour to restore a dead boot-system disk, I suppose . . .
 

python134r!

Junior Member
Aug 7, 2016
6
0
6
I used to raid0 2 x 64gb kingston drives years ago for OS and were pretty quick, however current ssd's are pretty quick and it's not necessary
 

ch33zw1z

Lifer
Nov 4, 2004
37,759
18,039
146
Spec'ing out a few new servers and in talking to my rep, he brought up a good point:

"Mirroring an SSD doesn’t make a lot of sense. The die from the number of writes and if you are going to write the exact same thing to both drives they should fail at the exact same time 2,000,000 hours later, 120 years.

We sell about 15,000 enterprise SSDs a quarter and we don’t do more than a handful of RMAs on them a year from malfunction."

While I can't find much fault in his logic, I still feel... "weird" not running a server with RAID1/RAID10 drives (I've always done RAID10 HDDs, so this foray into SSDs on server is new for me...)

Thoughts?
The logic only applies if writes we're the only cause of SSD failures. Since they aren't, base your decision on how critical the server. What's the business impact if the server is down? Etc..

Much of the time, the extra cost of a mirrored set is miniscule compared to the business impact during a server recovery.
 

Homerboy

Lifer
Mar 1, 2000
30,856
4,974
126
Just to lay this one to bed, I'm going with RAID1 as for the extra money, it's just worth the insurance policy.
Thanks for everyone's input.
 

BonzaiDuck

Lifer
Jun 30, 2004
15,709
1,450
126
Just to lay this one to bed, I'm going with RAID1 as for the extra money, it's just worth the insurance policy.
Thanks for everyone's input.

Well, the cost of insurance -- in premiums over time -- is a function of the actual expected cost of a damage incident. If the system is taking orders from online customers -- yes -- there would be potential loss for the downtime. In some other areas of a corporate or public-institution's context, suppose the actuarial division of an insurance company, and hour's downtime might not have the same impact if work could go on in some of its aspects. But given the price of SSDs, if this was for "server-OS-disk" or something of reasonably and relatively small capacity, just as well to make a mirrored RAID of it.
 

Woomack

Junior Member
May 7, 2018
3
0
1
SSD can die instantly without any warning so at least for me RAID1 is better idea than RAID0 on SSD.
Because random performance is not really scalling in RAID0 and sequential performance is not really important above some point in home/office computers then RAID0 is not the best idea and usually it's better to buy one larger drive.
Just some thoughts after longer work with various SSD.
 

Brahmzy

Senior member
Jul 27, 2004
584
28
91
Find a new rep. I’ve heard the same - they’re not the ones responsible to resurrect a dead server (at always the wrong time). I’ve had multiple SSD failures in the data center. Has nothing to do with endurance. It has everything to do with controller failure. It does happen. I think SSD controllers are subjected to a lot of heat and are not built the same way, say, an Intel CPU is (I’ve lost a few of those in the data center too.)
Always RAID1 at a minimum. “The cost of doing bitniss.”
 
  • Like
Reactions: VirtualLarry

John L

Junior Member
May 30, 2018
5
0
1
The reason not to use RAID 1 isn't that SSDs don't fail. The reason not to use RAID 1 is that SSDs consistently fail the same way, at the same number of duty cycles. A functional RAID 1 guarantees that you're putting the exact same number of duty cycles on both drives! Congrats, you've borked performance for zero benefit. Our Ops guys actually demonstrated this on an NGINX video cache service. Basically, they were completely rewriting the disk every few hours. They lost multiple servers simultaneously at 2 months.

That doesn't mean there are not array solutions for SSDs. There are vendors that make flash based arrays. They are not using RAID 1. See below for an example. If you need to back up your data, back up your data! If you only have 2 drives for business critical data, you're already doing it so wrong that no level of RAID will ever be able to save you.


https://www.solidfire.com/
 

mikeymikec

Lifer
May 19, 2011
17,672
9,514
136
The reason not to use RAID 1 isn't that SSDs don't fail. The reason not to use RAID 1 is that SSDs consistently fail the same way, at the same number of duty cycles.

The OP has already decided what they're going to do, but reading the thread would also have informed you that SSDs do fail in more than one possible way.
 

John L

Junior Member
May 30, 2018
5
0
1
The OP has already decided what they're going to do, but reading the thread would also have informed you that SSDs do fail in more than one possible way.
Sure the poster may have made a decision, but 1. He can change his mind, and 2. Other people will see this thread, because that is how the internet works.

Again, I'm not saying that you shouldn't plan for drive failure, only that RAID 1 does not provide that protection. Some of the failures mentioned in the thread, like from power loss, basically where the controller had cached the write operation but hadn't pushed it, RAID 1 doesn't protect against that. If you're relying on an array of drives to keep an uptime critical application running, go to a larger more sophisticated array or put it in the cloud. RAID 1 impairs garbage collection and on two drives doubles duty cycles. It is not a solution for a matched set of SSDs. Arguably, if you staggered your SSD purchases, it would provide some redundancy, but I don't think anyone was suggesting that.
 

mikeymikec

Lifer
May 19, 2011
17,672
9,514
136
Again, I'm not saying that you shouldn't plan for drive failure, only that RAID 1 does not provide that protection.

Of course it does. A simple drive failure is perfectly plausible and RAID1 is likely to provide some protection against. It's not perfect protection, but then nothing is: dual power supplies or a UPS does not provide perfect protection against power-related problems, dual-homed Internet connections do not perfectly protect against Internet connectivity issues, etc.

I once had RAID1 running on a server, a drive failed which somehow downed the server (so technically a RAID1 failure as one of the reasons to use RAID1 is to attempt to maximise uptime), but when the server was power cycled the second drive allowed the server to continue providing its services. That's still a damn sight better than no RAID1 and the only storage drive failing: no disaster recovery and much less downtime.

Arguably, if you staggered your SSD purchases, it would provide some redundancy, but I don't think anyone was suggesting that.

It provides redundancy even with two simultaneously installed drives, as has already been described. Your opening argument was bogus (that SSDs consistently fail the same way). Why on earth would you think that SSDs will only fail in 'max host writes' scenarios? Surely your (presumed) experience of other types of hardware would make you think this is a naive statement to make?

Your only point that I think has some validity is for people to more carefully consider scenarios that involve pushing drives to their limits 24/7 (whether those drives are SSDs or whatever), but anyone who knows the basics of RAID1 would know that RAID1 wouldn't be ideal in such a situation anyway, because RAID1 is for when the priority is redundancy, not maximum performance.
 
  • Like
Reactions: exo_HD

John L

Junior Member
May 30, 2018
5
0
1
Was the server with the single failed drive using SSDs? The technology for HDDs and SSDs is fundamentally different. The standard distributions around MTBF for spinning disk is wildly greater than SSDs. Applying HDD experience to SSDs is wrong. Also, what was wrong with your array controller? That is not how RAID 1 should behave. I've had many RAID 1 spinning disk arrays degrade without a crash. A single drive failure should not induce downtime in a properly configured and functioning RAID 1 array. Was it a zombie drive?

That said, most servers I've run had many many more than 2 spinning disks and were run in at least RAID 5. Again, single drive degradation did not cause downtime.

Actually, in my experience, not a single one of our SSDs has failed for any reason besides hitting their write limits. Again, I have directly observed SSDs simultaneously hitting their write limits in deployments from other system architects at AT&T. My sample is only a few tens of thousands of drives, though. Maybe someone on this thread has observed different behavior at a larger scale, but my experience is consistent with 100% of the literature I've seen on SSD drive failure and MTBF data.

Your hangup on performance belies your understanding of the technology. You realize that at the max write limit it is not a question of how well SSDs perform, but rather whether you can write to them at all, right? Wear leveling is a function of garbage collection. For any given number of cells, doubling the number in active use makes it more likely that the same cell gets re-written. RAID 1 impairs wear leveling in SSDs compared to a configuration that allows more functional space. In certain scenarios, that means you're more than halving the rated lifetime of the drives, which if anything diminishes the reliability of the system.
 

thecoolnessrune

Diamond Member
Jun 8, 2005
9,672
578
126
SSDs can fail for any sort of reason. Enterprise SSDs usually have a form of Power Loss Protection that can add to the list of failure scenarios (if a capacitor array fails tests, the drive will often fail itself).

We're all about SSDs in Infrastructure, and I've certainly seen SSDs fail. I've had Nutanix nodes fail their SSD and have to get replaced, fortunately the CVM was on a MDRAID 1 set so the SSD could be replaced and the array rebuilt. We've had two vSAN Pods arbitrarily start taking a single SSD and shooting latency through the roof (0.2 - 0.5ms of latency to 200-300ms of latency). We traced up and down with VMware and Cisco trying to find the root cause and ultimately it appears to have been a case of delayed write acknowledgement and NAND corruption that started due to a drive firmware bug, but that was over the course of 4 months and over 10 SSDs.

We've had NetApp AFF and FAS, EMC VMAX, and Tegile SSDs drop out and need replacement too.

I think it's fair to say SSDs fail less than Hard Drives, but it's absolutely ridiculous to claim that SSDs only fail in one way and that a same-batch pair would fail the same way at the same time. You need a RAID 1 array for SSDs just like you do for Hard Drives. And that's by determining if down time / rebuild time is justifiable vs. the cost.
 
Feb 25, 2011
16,788
1,468
126
If a boot drive failing in your server will cause a service disruption, then RAID1 is cheap insurance and I'd recommend it most of the time.

But now we have to have a serious talk about virtualization, fault tolerance, clustered services, etc. Because a boot drive failure doesn't have to cause a service disruption.
 

thecoolnessrune

Diamond Member
Jun 8, 2005
9,672
578
126
If a boot drive failing in your server will cause a service disruption, then RAID1 is cheap insurance and I'd recommend it most of the time.

But now we have to have a serious talk about virtualization, fault tolerance, clustered services, etc. Because a boot drive failure doesn't have to cause a service disruption.

Oh man, so this reminds me of one of our clients. I mentioned Nutanix earlier but this is a separate issue. Now direct Nutanix Blocks are Supermicro servers, but these weren't straight Nutanix Blocks but OEM Blocks from one of the Big 3. So the Hypervisor OS sits on a SATA SSD (SATA DOM to be specific) inside the node. The OEM releases a Firmware Update they say reduces corner case chances of the SSD hanging on I/O and bringing down the boot drive. We start getting outage windows to take each of these Nutanix Nodes down one by one and patch the SATA DOM. First one the SATA DOM fails to update and just outright fails to come back. Resetting it and everything it just will not come back. RMA, replace the SATA DOM, rebuild the Hypervisor, get it back in, let the cluster rebuild. That host is taken care of because we got a SATA DOM with the corrected firmware.

Next SATA DOM, the upgrade works, but SATA DOM is wiped. Gotta rebuild that host too.

OEM takes a break and comes back a couple weeks later. We reboot a third node at their recommendation. That SATA DOM just doesn't boot again.

Node after node these SATA DOMs puke the bed. Turns out the existing SATA DOM Firmware was having issues with the installation of vSphere, causing unnecessary write amplification from re-generating logs and burning them out. Every SATA DOM in every node had to be replaced and the node rebuilt.

Didn't cause any actual outage, but it ate a *ton* of Engineer hours.
 
Feb 25, 2011
16,788
1,468
126
Oh man, so this reminds me of one of our clients. I mentioned Nutanix earlier but this is a separate issue. Now direct Nutanix Blocks are Supermicro servers, but these weren't straight Nutanix Blocks but OEM Blocks from one of the Big 3. So the Hypervisor OS sits on a SATA SSD (SATA DOM to be specific) inside the node. The OEM releases a Firmware Update they say reduces corner case chances of the SSD hanging on I/O and bringing down the boot drive. We start getting outage windows to take each of these Nutanix Nodes down one by one and patch the SATA DOM. First one the SATA DOM fails to update and just outright fails to come back. Resetting it and everything it just will not come back. RMA, replace the SATA DOM, rebuild the Hypervisor, get it back in, let the cluster rebuild. That host is taken care of because we got a SATA DOM with the corrected firmware.

Next SATA DOM, the upgrade works, but SATA DOM is wiped. Gotta rebuild that host too.

OEM takes a break and comes back a couple weeks later. We reboot a third node at their recommendation. That SATA DOM just doesn't boot again.

Node after node these SATA DOMs puke the bed. Turns out the existing SATA DOM Firmware was having issues with the installation of vSphere, causing unnecessary write amplification from re-generating logs and burning them out. Every SATA DOM in every node had to be replaced and the node rebuilt.

Didn't cause any actual outage, but it ate a *ton* of Engineer hours.

That's kinda the point, IMO. Spending a ridiculous amount of hours eliminating that last 0.01% of possible DT are what put money in my Boat Fund™. :D
 

mindless1

Diamond Member
Aug 11, 2001
8,052
1,442
126
The reason not to use RAID 1 isn't that SSDs don't fail. The reason not to use RAID 1 is that SSDs consistently fail the same way, at the same number of duty cycles. A functional RAID 1 guarantees that you're putting the exact same number of duty cycles on both drives! Congrats, you've borked performance for zero benefit.

This is incorrect. SSDs do NOT consistently fail at exactly the same # of duty cycles. The # of cycles is an average and while there won't be specimens that greatly outlive that average like some HDDs do, it would be insane to think that you'd get years of service from SSDs then they all up and die the same day, let alone the same moment, unless you have an external cause like a power surge from failing PSU, lightning strike, etc.

On the contrary the odds are very high that upon one failing you have time to order and replace it if you didn't have a spare lying around. I would advise having at least one spare.

This doesn't even count what you're trying to protect against which is not wear-out. You can merely schedule a replacement interval to guard against that. It's the unexpected random failures you're wanting to keep from causing downtime.
 

Brahmzy

Senior member
Jul 27, 2004
584
28
91
Oh man, so this reminds me of one of our clients. I mentioned Nutanix earlier but this is a separate issue. Now direct Nutanix Blocks are Supermicro servers, but these weren't straight Nutanix Blocks but OEM Blocks from one of the Big 3. So the Hypervisor OS sits on a SATA SSD (SATA DOM to be specific) inside the node. The OEM releases a Firmware Update they say reduces corner case chances of the SSD hanging on I/O and bringing down the boot drive. We start getting outage windows to take each of these Nutanix Nodes down one by one and patch the SATA DOM. First one the SATA DOM fails to update and just outright fails to come back. Resetting it and everything it just will not come back. RMA, replace the SATA DOM, rebuild the Hypervisor, get it back in, let the cluster rebuild. That host is taken care of because we got a SATA DOM with the corrected firmware.

Next SATA DOM, the upgrade works, but SATA DOM is wiped. Gotta rebuild that host too.

OEM takes a break and comes back a couple weeks later. We reboot a third node at their recommendation. That SATA DOM just doesn't boot again.

Node after node these SATA DOMs puke the bed. Turns out the existing SATA DOM Firmware was having issues with the installation of vSphere, causing unnecessary write amplification from re-generating logs and burning them out. Every SATA DOM in every node had to be replaced and the node rebuilt.

Didn't cause any actual outage, but it ate a *ton* of Engineer hours.
Nutanix is banging on my door hard and loud. In fact every HCI vendor is. I've got more stories of epic outages on some of these HCI clusters. Hyperflex I know of multiple cases.
I saw this coming since VSAN and the like first hit the scene. Put all the goods in software and your golden, except when you're not and you hit a bug. Because software has freaking bugs. A lot of them.
I'll stick to traditional tiers for now. Scale-out NAS and next-gen scale-out DP is a different story.