SSD reliability in the real world according to google

Elixer · Feb 29, 2016

The FAST 2016 paper Flash Reliability in Production: The Expected and the Unexpected, (the paper is not available online until Friday) by Professor Bianca Schroeder of the University of Toronto, and Raghav Lagisetty and Arif Merchant of Google, covers:

Millions of drive days over 6 years
10 different drive models
3 different flash types: MLC, eMLC and SLC
Enterprise and consumer drives

Key conclusions

Ignore Uncorrectable Bit Error Rate (UBER) specs. A meaningless number.
Good news: Raw Bit Error Rate (RBER) increases slower than expected from wearout and is not correlated with UBER or other failures.
High-end SLC drives are no more reliable that MLC drives.
Bad news: SSDs fail at a lower rate than disks, but UBER rate is higher (see below for what this means).
SSD age, not usage, affects reliability.
Bad blocks in new SSDs are common, and drives with a large number of bad blocks are much more likely to lose hundreds of other blocks, most likely due to die or chip failure.
30-80 percent of SSDs develop at least one bad block and 2-7 percent develop at least one bad chip in the first four years of deployment.

http://www.zdnet.com/article/ssd-reliability-in-the-real-world-googles-experience/
Google's paper https://www.usenix.org/conference/fast16

But it isn't all good news. SSD UBER rates are higher than disk rates, which means that backing up SSDs is even more important than it is with disks. The SSD is less likely to fail during its normal life, but more likely to lose data

Backup, backup, backup...it all boils down to that, no matter what the device is.

Elixer · Feb 29, 2016

After reading some of this about SLC vs MLC, it sure seems like TLC would be in a even worse position.

What is most annoying is the file/data loss (corruption) with no good explanation on why it is going on.
Firmware bugs? Design flaw? Bad Chips? Who knows.

In particular,some of these errors(final read error, uncorrectable error, meta error) lead to data loss,
unless there is redundancy at higher levels in the system,
as the drive is not able to deliver data that it had previously stored.

They can detect it, since they care about data retention, and the file system supports it when the data doesn't match.

Most consumers running windows with NTFS don't stand a chance at finding corruption until it is too late.

Drives tend to either have less than a handful of bad blocks, or a large number of them, suggesting that impending chip failure could be predicted based on prior
number of bad blocks (and maybe other factors). Also, a drive with a large number of factory bad blocks has a higher chance of developing more bad blocks in the field,
as well as certain types of errors

I wonder how they could tell how many bad blocks a SSD has before it gets to their hands? It sounds like the silicon lottery here, which could explain quite a few things.
So, if a SSD does start getting bad blocks, it pretty much means game over for that particular NAND chip does it not?

VirtualLarry · Feb 29, 2016

Did they use any of the Plextor SSDs? Those undergo what I think is the most extensive write/read/burn-in testing of any consumer SSD. Those should not have any bad blocks or bad chips, when received by the customer as "new".

JimmiG · Feb 29, 2016

Elixer said:
http://www.zdnet.com/article/ssd-reliability-in-the-real-world-googles-experience/
Google's paper https://www.usenix.org/conference/fast16

Backup, backup, backup...it all boils down to that, no matter what the device is.

It's almost harder to protect yourself against "silent" errors like data corruption due to a firmware bug, current leakage or whatever. If a drive crashes, it's dead. That's kind of hard to miss. You get a new drive and restore your backup. If there's a silent error, you might not notice it until it's too late. For example, you might keep daily backups for 30 days, and then after 90 days, you notice that this important file is corrupted and no longer usable due to 3 bytes being unreadable.

It's also interesting that age rather than wear has the greatest impact on reliability. A lot of people are concerned about the number of PE cycles and go to great lengths to reduce the number of writes. In reality, flash storage seems to be extremely unsuitable for cold storage, and refreshing the data might actually be beneficial.

Charlie98 · Feb 29, 2016

JimmiG said:
It's also interesting that age rather than wear has the greatest impact on reliability. A lot of people are concerned about the number of PE cycles and go to great lengths to reduce the number of writes. In reality, flash storage seems to be extremely unsuitable for cold storage, and refreshing the data might actually be beneficial.

That also begs another question... why? Back in the Old Days, even I was kind of nutty about writes to my tiny 60GB SSD, but now with bigger SSDs commonly in use, with potentially more unused blocks to switch data to (overprovisioning, etc) you would think they would last much longer... but if old age has more impact...?

Cerb · Feb 29, 2016

JimmiG said:
It's almost harder to protect yourself against "silent" errors like data corruption due to a firmware bug, current leakage or whatever.

While problematic for /boot, I've gone to BTRFS for everything else in Linux. ReFS is OK, in Windows, but can be rather slow, lacks dedup capability, and there are some Samba issues with it (regardless of where the blame should go, in a mixed-OS environment, things like files potentially not showing up is a big deal). IMO, MS really dropped the ball, making a bulk storage only file system. ZFS has its own issues, much of it being based on licensing, of course.

You can't blindly trust the drives, these days.

It's also interesting that age rather than wear has the greatest impact on reliability. A lot of people are concerned about the number of PE cycles and go to great lengths to reduce the number of writes. In reality, flash storage seems to be extremely unsuitable for cold storage, and refreshing the data might actually be beneficial.

Literal cold storage should be good. Warm off/idle, much less so. I don't know which drives now do and don't, but idle scans for worsening errors should probably become standard in SSDs to come.

Essence_of_War · Feb 29, 2016

ZFS has its own issues, much of it being based on licensing, of course.

Ubuntu 16.04

Cerb · Feb 29, 2016

Essence_of_War said:
Ubuntu 16.04

Indeed.

Also a few choice snippets from the paper, now that I've done more than skim it:

In summary, we find that the flash drives in our study experience significantly lower replacement rates (within their rated lifetime) than hard disk drives. On the down-side, they experience significantly higher rates of uncor-rectable errors than hard disk drives.

Based on our observations above, we conclude that SLC drives are not generally more reliable than MLC drives.

on raw error rates said:
Finally, there is not one vendor that consistently out-performs the others. Within the group of SLC and eMLC drives, respectively, the same vendor is responsible for one of the worst and the best models in the group.

If used as a metric in the field, UBER will artificially decrease the error rates for drives with high read count and artificially inflate the rates for drives with low read counts, as UEs occur independently of the number of reads.

IoW, just like with HDDs, it's yet another metric to differentiate "consumer," and, "enterprise," on a datasheet, and little more.

De-pending on the model, 30-80% of drives develop bad blocks in the field.

So, that RAID-a-like and/or FEC stuff several, if not all, by now, manufacturers are doing, is very good for us. Nonetheless, tbl 5 on p 77 has results much higher than I would have expected, for the MLC drives.

We note that the vendors of all flash chips in these drives guarantee that no more than 2% of blocks on a chip will go bad while the drive is within its PE cycle limit. There-fore, the two thirds of bad chips that saw more than 5% of their blocks fail are chips that violate vendor specs.

It would be interesting to see what effects, if any, various vendors' proprietary data protection schemes may have on uncorrectable errors, with newer drives. For MLC and eMLC, using such old drives makes that something they can't really do, but it would make for an interesting comparison for a future study. FI, Micron, Sandisk, and Sandforce definitely use a RAID5-like parity implementation. How bad might it be over time with the SF drives that have it turned off? How much does it actually lower uncorrectable error rates in the field? How effective is that in comparison to what Samsung and Toshiba have been using? Etc..

Elixer · Feb 29, 2016

MS threw winFS under the bus, so, unless everyone transfers to a dedicated storage device, people will never know of these silent errors.

I suppose people could run a hash on their files everyday, and see if anything has changed, but that is an extremely long, time consuming process.

Madpacket · Feb 29, 2016

Elixer said:
MS threw winFS under the bus, so, unless everyone transfers to a dedicated storage device, people will never know of these silent errors.

I suppose people could run a hash on their files everyday, and see if anything has changed, but that is an extremely long, time consuming process.

Actually that's not a terrible idea. Someone create a specialized md5sum application. It should check periodically against critical system files and also allow the user to select additional folders or files that are important to them.

Essence_of_War · Mar 1, 2016

Also a few choice snippets from the paper, now that I've done more than skim it:

I found the remarks about reliability vs. spinning disks to be pretty interesting also! Much lower field replacement rate (2-9% annually for HDDs, vs. 4-10% over 4 years for SSDs!), which is sort of what you'd hope for if you're removing a mechanical failure mode, but also much higher incidence of uncorrectable errors! They are claimingsomething like 10x or more bad sectors over a similar period of time, and that is not accounting for the fact that HDDs also have on average like 10x as many sectors! That's pretty wild.

Cerb · Mar 1, 2016

Essence_of_War said:
I found the remarks about reliability vs. spinning disks to be pretty interesting also! Much lower field replacement rate (2-9% annually for HDDs, vs. 4-10% over 4 years for SSDs!), which is sort of what you'd hope for if you're removing a mechanical failure mode, but also much higher incidence of uncorrectable errors! They are claimingsomething like 10x or more bad sectors over a similar period of time, and that is not accounting for the fact that HDDs also have on average like 10x as many sectors! That's pretty wild.

Yeah. Given they found a correlation with both age and write cycles, I wonder how they compare in terms of host data written per UE encountered? Kind of makes RAID 5 on SSD, which should be good from a pure UBER specification standpoint, not look so good. As old as their drives are, I also wonder how many have controller-level error correction as good as current drives, and what the UE rate would be with newer server-grade drives based on them.

Essence_of_War · Mar 1, 2016

Cerb said:
Yeah. Given they found a correlation with both age and write cycles, I wonder how they compare in terms of host data written per UE encountered? Kind of makes RAID 5 on SSD, which should be good from a pure UBER specification standpoint, not look so good.As old as their drives are, I also wonder how many have controller-level error correction as good as current drives, and what the UE rate would be with newer server-grade drives based on them.

Great questions. Skimming over the intro again, it looks like data was collected over 6 years, which is kind of a long time for controller tech. It looks like they have PE cycle data, and it ranges from like 1/5 of the limit to like 1/1000th of the limit, and they have some plots of RBER and UE as functions of PE. SLC-A seems to do very well, but some of the other drives start to go off the rails as they approach a few thousand PE cycles.

frowertr · Mar 1, 2016

Cerb said:
Yeah. Given they found a correlation with both age and write cycles, I wonder how they compare in terms of host data written per UE encountered? Kind of makes RAID 5 on SSD, which should be good from a pure UBER specification standpoint, not look so good. As old as their drives are, I also wonder how many have controller-level error correction as good as current drives, and what the UE rate would be with newer server-grade drives based on them.

What I take away is that RAID 5 on SSDs is still perfectly acceptable but RAID 6 would be better (still insanely fast even with double parity). However, scrubbing and checksumming are really needed now for utmost data integrity no matter what you do.

Of course, this is for mission critical stuff. For a home NAS, you won't notice if you have a few bits missing from your rip of Transformers during playback...

Cerb · Mar 1, 2016

frowertr said:
What I take away is that RAID 5 on SSDs is still perfectly acceptable but RAID 6 would be better (still insanely fast even with double parity). However, scrubbing and checksumming are really needed now for utmost data integrity no matter what you do.

Right. What I really wonder is, while they have more errors per drive per day, how does the error rate compare on a per-IO or per-host-GB-write basis, since SSDs may be getting used where HDDs simply couldn't keep up anyway.

I mean, with bad blocks and writes predicting more uncorrectable errors, if the HDDs are doing, say, 20GB writes/day, and the SSDs 200GB writes/day, then the likely error rates for mere mortal users might look very different.

hojnikb · Mar 1, 2016

VirtualLarry said:
Did they use any of the Plextor SSDs? Those undergo what I think is the most extensive write/read/burn-in testing of any consumer SSD. Those should not have any bad blocks or bad chips, when received by the customer as "new".

Every flash die has bad blocks. Even plextors.

Elixer · Mar 1, 2016

I wish they would tell us which specific SSDs they are talking about.
6 year span means it is most likely, Samsung, Intel, SanDisk, Crucial, and Toshiba.
These are data centers as well, so you know everything was temperature controlled.
I also wonder if there was encryption involved, I would think that 1 bit error on that would pretty much fubar it.

I am betting that this is why we see more talk of things like RAIN technology (Crucial) to try to detect and fix the bad cell issue.

frowertr · Mar 1, 2016

Elixer said:
I wish they would tell us which specific SSDs they are talking about.
6 year span means it is most likely, Samsung, Intel, SanDisk, Crucial, and Toshiba.
These are data centers as well, so you know everything was temperature controlled.
I also wonder if there was encryption involved, I would think that 1 bit error on that would pretty much fubar it.

I am betting that this is why we see more talk of things like RAIN technology (Crucial) to try to detect and fix the bad cell issue.

I'm thinking if this is Google than they were using Intel Enterprise drives. I wouldn't think they would even be messing with consumer level drives here.

Cerb · Mar 1, 2016

frowertr said:
I'm thinking if this is Google than they were using Intel Enterprise drives. I wouldn't think they would even be messing with consumer level drives here.

Yet, they include regular MLC, and they definitely did just that with HDDs. If you have a software layer handling your data correctness, replication, and backup, to a point where whole racks can go out with no data loss, why not?

Hellhammer · Mar 1, 2016

These are not stock SSDs as Google has for years made its own SSDs. It clearly says so on the second page:

The drives in our study are custom designed high performance solid state drives, which are based on commodity flash chips, but use a custom PCIe interface, firmware and driver. We focus on two generations of drives, where all drives of the same generation use the same device driver and firmware. That means that they also use the same error correcting codes (ECC) to detect and correct corrupted bits and the same algorithms for wear-leveling

In other words, Google is only buying NAND and the vendors listed are simply different NAND vendors, not SSD vendors. As a result the conclusions of this study aren't very far-reaching because the ECC implementation is limited to just one version, so anything related to ECC is only applicable to Google's implementation. ECC has improved a lot over the past couple of years with LDPC being the norm today, meaning that most of the ECC results are likely not relevant with today's SSDs.

The only conclusion I would make is that if a NAND die has bad blocks from day one, it's likely to generate more bad blocks sooner than a "perfect" die. Then again, this isn't really anything new because a die with a defect is likelier to produce errors than a die without errors.

VirtualLarry · Mar 2, 2016

hojnikb said:
Every flash die has bad blocks. Even plextors.

Customer-visible ones?

coercitiv · Mar 2, 2016

VirtualLarry said:
Customer-visible ones?

Do you reckon that not seeing the bad blocks positively affects reliability?

VirtualLarry · Mar 2, 2016

Elixer said:
I wonder how they could tell how many bad blocks a SSD has before it gets to their hands? It sounds like the silicon lottery here, which could explain quite a few things.
So, if a SSD does start getting bad blocks, it pretty much means game over for that particular NAND chip does it not?

Hellhammer said:
These are not stock SSDs as Google has for years made its own SSDs. It clearly says so on the second page:

The only conclusion I would make is that if a NAND die has bad blocks from day one, it's likely to generate more bad blocks sooner than a "perfect" die. Then again, this isn't really anything new because a die with a defect is likelier to produce errors than a die without errors.

coercitiv said:
Do you reckon that not seeing the bad blocks positively affects reliability?

I did not know, until Hellhammer's post, that these were "custom PCI-E SSD", presumably with a way to access the raw NAND if / when needed. (Or possibly all the time?)

I was initially under the impression that these were standard SATA6G SSDs, and that Google was seeing initial bad blocks that were visible over the host interface. Whereas, given Plextor's extensive burn-in processes for their SATA SSDs, that shouldn't happen.

But I apparently misunderstood the nature of Google's SSDs, which apparently offer a much more "raw" interface to the NAND.

hojnikb · Mar 2, 2016

VirtualLarry said:
Customer-visible ones?

Of course not. Thats what lots of spare area is for. So that controller can map those bad blocks away (ones that come dead from the factory) and any additional that die in the course of ssds life. HDDs do the same thing.

coercitiv · Mar 2, 2016

VirtualLarry said:
I was initially under the impression that these were standard SATA6G SSDs, and that Google was seeing initial bad blocks that were visible over the host interface. Whereas, given Plextor's extensive burn-in processes for their SATA SSDs, that shouldn't happen.

You keep mentioning this exclusive burn-in process that Plextor does. Do you have any details about it or can we just safely assume they (stress) test the drives in a manner that ensures a low early failure rate?

Don't get me wrong, if Plextor manages to get 0.5% average failure rate on their drives by using premium components and quality control, I can only commend them for that. But since they are using the same suppliers as many other SSD manufacturers, why would you even entertain the idea that Plextor drives have pristine NAND chips in them?

PS: Why are we even discussing Plextor, when the whole purpose of such studies is to eliminate any brand/tech bias and look at real world results. Sure the data cannot tell what the current state of SSD reliability is, but it's the best baseline we have.

SSD reliability in the real world according to google

Lifer

Lifer

No Lifer

Platinum Member

Diamond Member

Elite Member

Platinum Member

Elite Member

Lifer

Platinum Member

Platinum Member

Elite Member

Platinum Member

Golden Member

Elite Member

Senior member

Lifer

Golden Member

Elite Member

AnandTech Emeritus

No Lifer

Diamond Member

No Lifer

Senior member

Diamond Member