SSD reliability in the real world according to google

Elixer

Lifer
May 7, 2002
10,371
762
126
The FAST 2016 paper Flash Reliability in Production: The Expected and the Unexpected, (the paper is not available online until Friday) by Professor Bianca Schroeder of the University of Toronto, and Raghav Lagisetty and Arif Merchant of Google, covers:

Millions of drive days over 6 years
10 different drive models
3 different flash types: MLC, eMLC and SLC
Enterprise and consumer drives

Key conclusions

Ignore Uncorrectable Bit Error Rate (UBER) specs. A meaningless number.
Good news: Raw Bit Error Rate (RBER) increases slower than expected from wearout and is not correlated with UBER or other failures.
High-end SLC drives are no more reliable that MLC drives.
Bad news: SSDs fail at a lower rate than disks, but UBER rate is higher (see below for what this means).
SSD age, not usage, affects reliability.
Bad blocks in new SSDs are common, and drives with a large number of bad blocks are much more likely to lose hundreds of other blocks, most likely due to die or chip failure.
30-80 percent of SSDs develop at least one bad block and 2-7 percent develop at least one bad chip in the first four years of deployment.
http://www.zdnet.com/article/ssd-reliability-in-the-real-world-googles-experience/
Google's paper https://www.usenix.org/conference/fast16

But it isn't all good news. SSD UBER rates are higher than disk rates, which means that backing up SSDs is even more important than it is with disks. The SSD is less likely to fail during its normal life, but more likely to lose data

Backup, backup, backup...it all boils down to that, no matter what the device is.
 

Elixer

Lifer
May 7, 2002
10,371
762
126
ssads.png

After reading some of this about SLC vs MLC, it sure seems like TLC would be in a even worse position.

What is most annoying is the file/data loss (corruption) with no good explanation on why it is going on.
Firmware bugs? Design flaw? Bad Chips? Who knows.
In particular,some of these errors(final read error, uncorrectable error, meta error) lead to data loss,
unless there is redundancy at higher levels in the system,
as the drive is not able to deliver data that it had previously stored.
They can detect it, since they care about data retention, and the file system supports it when the data doesn't match.

Most consumers running windows with NTFS don't stand a chance at finding corruption until it is too late.

•
Drives tend to either have less than a handful of bad blocks, or a large number of them, suggesting that impending chip failure could be predicted based on prior
number of bad blocks (and maybe other factors). Also, a drive with a large number of factory bad blocks has a higher chance of developing more bad blocks in the field,
as well as certain types of errors
I wonder how they could tell how many bad blocks a SSD has before it gets to their hands? It sounds like the silicon lottery here, which could explain quite a few things.
So, if a SSD does start getting bad blocks, it pretty much means game over for that particular NAND chip does it not?
 

VirtualLarry

No Lifer
Aug 25, 2001
56,587
10,225
126
Did they use any of the Plextor SSDs? Those undergo what I think is the most extensive write/read/burn-in testing of any consumer SSD. Those should not have any bad blocks or bad chips, when received by the customer as "new".
 

JimmiG

Platinum Member
Feb 24, 2005
2,024
112
106
http://www.zdnet.com/article/ssd-reliability-in-the-real-world-googles-experience/
Google's paper https://www.usenix.org/conference/fast16



Backup, backup, backup...it all boils down to that, no matter what the device is.

It's almost harder to protect yourself against "silent" errors like data corruption due to a firmware bug, current leakage or whatever. If a drive crashes, it's dead. That's kind of hard to miss. You get a new drive and restore your backup. If there's a silent error, you might not notice it until it's too late. For example, you might keep daily backups for 30 days, and then after 90 days, you notice that this important file is corrupted and no longer usable due to 3 bytes being unreadable.

It's also interesting that age rather than wear has the greatest impact on reliability. A lot of people are concerned about the number of PE cycles and go to great lengths to reduce the number of writes. In reality, flash storage seems to be extremely unsuitable for cold storage, and refreshing the data might actually be beneficial.
 

Charlie98

Diamond Member
Nov 6, 2011
6,298
64
91
It's also interesting that age rather than wear has the greatest impact on reliability. A lot of people are concerned about the number of PE cycles and go to great lengths to reduce the number of writes. In reality, flash storage seems to be extremely unsuitable for cold storage, and refreshing the data might actually be beneficial.

That also begs another question... why? Back in the Old Days, even I was kind of nutty about writes to my tiny 60GB SSD, but now with bigger SSDs commonly in use, with potentially more unused blocks to switch data to (overprovisioning, etc) you would think they would last much longer... but if old age has more impact...?
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
It's almost harder to protect yourself against "silent" errors like data corruption due to a firmware bug, current leakage or whatever.
While problematic for /boot, I've gone to BTRFS for everything else in Linux. ReFS is OK, in Windows, but can be rather slow, lacks dedup capability, and there are some Samba issues with it (regardless of where the blame should go, in a mixed-OS environment, things like files potentially not showing up is a big deal). IMO, MS really dropped the ball, making a bulk storage only file system. ZFS has its own issues, much of it being based on licensing, of course.

You can't blindly trust the drives, these days.

It's also interesting that age rather than wear has the greatest impact on reliability. A lot of people are concerned about the number of PE cycles and go to great lengths to reduce the number of writes. In reality, flash storage seems to be extremely unsuitable for cold storage, and refreshing the data might actually be beneficial.
Literal cold storage should be good. Warm off/idle, much less so. I don't know which drives now do and don't, but idle scans for worsening errors should probably become standard in SSDs to come.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Ubuntu 16.04 :) :) :)
Indeed.

Also a few choice snippets from the paper, now that I've done more than skim it:
In summary, we find that the flash drives in our study experience significantly lower replacement rates (within their rated lifetime) than hard disk drives. On the down-side, they experience significantly higher rates of uncor-rectable errors than hard disk drives.

Based on our observations above, we conclude that SLC drives are not generally more reliable than MLC drives.

on raw error rates said:
Finally, there is not one vendor that consistently out-performs the others. Within the group of SLC and eMLC drives, respectively, the same vendor is responsible for one of the worst and the best models in the group.

If used as a metric in the field, UBER will artificially decrease the error rates for drives with high read count and artificially inflate the rates for drives with low read counts, as UEs occur independently of the number of reads.
IoW, just like with HDDs, it's yet another metric to differentiate "consumer," and, "enterprise," on a datasheet, and little more.

De-pending on the model, 30-80% of drives develop bad blocks in the field.
So, that RAID-a-like and/or FEC stuff several, if not all, by now, manufacturers are doing, is very good for us. Nonetheless, tbl 5 on p 77 has results much higher than I would have expected, for the MLC drives.

We note that the vendors of all flash chips in these drives guarantee that no more than 2% of blocks on a chip will go bad while the drive is within its PE cycle limit. There-fore, the two thirds of bad chips that saw more than 5% of their blocks fail are chips that violate vendor specs.

It would be interesting to see what effects, if any, various vendors' proprietary data protection schemes may have on uncorrectable errors, with newer drives. For MLC and eMLC, using such old drives makes that something they can't really do, but it would make for an interesting comparison for a future study. FI, Micron, Sandisk, and Sandforce definitely use a RAID5-like parity implementation. How bad might it be over time with the SF drives that have it turned off? How much does it actually lower uncorrectable error rates in the field? How effective is that in comparison to what Samsung and Toshiba have been using? Etc..
 

Elixer

Lifer
May 7, 2002
10,371
762
126
MS threw winFS under the bus, so, unless everyone transfers to a dedicated storage device, people will never know of these silent errors. :(

I suppose people could run a hash on their files everyday, and see if anything has changed, but that is an extremely long, time consuming process.
 

Madpacket

Platinum Member
Nov 15, 2005
2,068
326
126
MS threw winFS under the bus, so, unless everyone transfers to a dedicated storage device, people will never know of these silent errors. :(

I suppose people could run a hash on their files everyday, and see if anything has changed, but that is an extremely long, time consuming process.

Actually that's not a terrible idea. Someone create a specialized md5sum application. It should check periodically against critical system files and also allow the user to select additional folders or files that are important to them.
 

Essence_of_War

Platinum Member
Feb 21, 2013
2,650
4
81
Also a few choice snippets from the paper, now that I've done more than skim it:

I found the remarks about reliability vs. spinning disks to be pretty interesting also! Much lower field replacement rate (2-9% annually for HDDs, vs. 4-10% over 4 years for SSDs!), which is sort of what you'd hope for if you're removing a mechanical failure mode, but also much higher incidence of uncorrectable errors! They are claimingsomething like 10x or more bad sectors over a similar period of time, and that is not accounting for the fact that HDDs also have on average like 10x as many sectors! That's pretty wild. :eek:
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
I found the remarks about reliability vs. spinning disks to be pretty interesting also! Much lower field replacement rate (2-9% annually for HDDs, vs. 4-10% over 4 years for SSDs!), which is sort of what you'd hope for if you're removing a mechanical failure mode, but also much higher incidence of uncorrectable errors! They are claimingsomething like 10x or more bad sectors over a similar period of time, and that is not accounting for the fact that HDDs also have on average like 10x as many sectors! That's pretty wild. :eek:
Yeah. Given they found a correlation with both age and write cycles, I wonder how they compare in terms of host data written per UE encountered? Kind of makes RAID 5 on SSD, which should be good from a pure UBER specification standpoint, not look so good. As old as their drives are, I also wonder how many have controller-level error correction as good as current drives, and what the UE rate would be with newer server-grade drives based on them.
 

Essence_of_War

Platinum Member
Feb 21, 2013
2,650
4
81
Yeah. Given they found a correlation with both age and write cycles, I wonder how they compare in terms of host data written per UE encountered? Kind of makes RAID 5 on SSD, which should be good from a pure UBER specification standpoint, not look so good.As old as their drives are, I also wonder how many have controller-level error correction as good as current drives, and what the UE rate would be with newer server-grade drives based on them.

Great questions. Skimming over the intro again, it looks like data was collected over 6 years, which is kind of a long time for controller tech. It looks like they have PE cycle data, and it ranges from like 1/5 of the limit to like 1/1000th of the limit, and they have some plots of RBER and UE as functions of PE. SLC-A seems to do very well, but some of the other drives start to go off the rails as they approach a few thousand PE cycles.
 

frowertr

Golden Member
Apr 17, 2010
1,372
41
91
Yeah. Given they found a correlation with both age and write cycles, I wonder how they compare in terms of host data written per UE encountered? Kind of makes RAID 5 on SSD, which should be good from a pure UBER specification standpoint, not look so good. As old as their drives are, I also wonder how many have controller-level error correction as good as current drives, and what the UE rate would be with newer server-grade drives based on them.

What I take away is that RAID 5 on SSDs is still perfectly acceptable but RAID 6 would be better (still insanely fast even with double parity). However, scrubbing and checksumming are really needed now for utmost data integrity no matter what you do.

Of course, this is for mission critical stuff. For a home NAS, you won't notice if you have a few bits missing from your rip of Transformers during playback...
 
Last edited:

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
What I take away is that RAID 5 on SSDs is still perfectly acceptable but RAID 6 would be better (still insanely fast even with double parity). However, scrubbing and checksumming are really needed now for utmost data integrity no matter what you do.
Right. What I really wonder is, while they have more errors per drive per day, how does the error rate compare on a per-IO or per-host-GB-write basis, since SSDs may be getting used where HDDs simply couldn't keep up anyway.

I mean, with bad blocks and writes predicting more uncorrectable errors, if the HDDs are doing, say, 20GB writes/day, and the SSDs 200GB writes/day, then the likely error rates for mere mortal users might look very different.
 

hojnikb

Senior member
Sep 18, 2014
562
45
91
Did they use any of the Plextor SSDs? Those undergo what I think is the most extensive write/read/burn-in testing of any consumer SSD. Those should not have any bad blocks or bad chips, when received by the customer as "new".


Every flash die has bad blocks. Even plextors.
 

Elixer

Lifer
May 7, 2002
10,371
762
126
I wish they would tell us which specific SSDs they are talking about.
6 year span means it is most likely, Samsung, Intel, SanDisk, Crucial, and Toshiba.
These are data centers as well, so you know everything was temperature controlled.
I also wonder if there was encryption involved, I would think that 1 bit error on that would pretty much fubar it.

I am betting that this is why we see more talk of things like RAIN technology (Crucial) to try to detect and fix the bad cell issue.
 

frowertr

Golden Member
Apr 17, 2010
1,372
41
91
I wish they would tell us which specific SSDs they are talking about.
6 year span means it is most likely, Samsung, Intel, SanDisk, Crucial, and Toshiba.
These are data centers as well, so you know everything was temperature controlled.
I also wonder if there was encryption involved, I would think that 1 bit error on that would pretty much fubar it.

I am betting that this is why we see more talk of things like RAIN technology (Crucial) to try to detect and fix the bad cell issue.

I'm thinking if this is Google than they were using Intel Enterprise drives. I wouldn't think they would even be messing with consumer level drives here.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
I'm thinking if this is Google than they were using Intel Enterprise drives. I wouldn't think they would even be messing with consumer level drives here.
Yet, they include regular MLC, and they definitely did just that with HDDs. If you have a software layer handling your data correctness, replication, and backup, to a point where whole racks can go out with no data loss, why not?
 

Hellhammer

AnandTech Emeritus
Apr 25, 2011
701
4
81
These are not stock SSDs as Google has for years made its own SSDs. It clearly says so on the second page:

The drives in our study are custom designed high performance solid state drives, which are based on commodity flash chips, but use a custom PCIe interface, firmware and driver. We focus on two generations of drives, where all drives of the same generation use the same device driver and firmware. That means that they also use the same error correcting codes (ECC) to detect and correct corrupted bits and the same algorithms for wear-leveling

In other words, Google is only buying NAND and the vendors listed are simply different NAND vendors, not SSD vendors. As a result the conclusions of this study aren't very far-reaching because the ECC implementation is limited to just one version, so anything related to ECC is only applicable to Google's implementation. ECC has improved a lot over the past couple of years with LDPC being the norm today, meaning that most of the ECC results are likely not relevant with today's SSDs.

The only conclusion I would make is that if a NAND die has bad blocks from day one, it's likely to generate more bad blocks sooner than a "perfect" die. Then again, this isn't really anything new because a die with a defect is likelier to produce errors than a die without errors.
 
Last edited:

VirtualLarry

No Lifer
Aug 25, 2001
56,587
10,225
126
I wonder how they could tell how many bad blocks a SSD has before it gets to their hands? It sounds like the silicon lottery here, which could explain quite a few things.
So, if a SSD does start getting bad blocks, it pretty much means game over for that particular NAND chip does it not?

These are not stock SSDs as Google has for years made its own SSDs. It clearly says so on the second page:

The only conclusion I would make is that if a NAND die has bad blocks from day one, it's likely to generate more bad blocks sooner than a "perfect" die. Then again, this isn't really anything new because a die with a defect is likelier to produce errors than a die without errors.

Do you reckon that not seeing the bad blocks positively affects reliability?

I did not know, until Hellhammer's post, that these were "custom PCI-E SSD", presumably with a way to access the raw NAND if / when needed. (Or possibly all the time?)

I was initially under the impression that these were standard SATA6G SSDs, and that Google was seeing initial bad blocks that were visible over the host interface. Whereas, given Plextor's extensive burn-in processes for their SATA SSDs, that shouldn't happen.

But I apparently misunderstood the nature of Google's SSDs, which apparently offer a much more "raw" interface to the NAND.
 

hojnikb

Senior member
Sep 18, 2014
562
45
91
Customer-visible ones?

Of course not. Thats what lots of spare area is for. So that controller can map those bad blocks away (ones that come dead from the factory) and any additional that die in the course of ssds life. HDDs do the same thing.
 

coercitiv

Diamond Member
Jan 24, 2014
7,439
17,706
136
I was initially under the impression that these were standard SATA6G SSDs, and that Google was seeing initial bad blocks that were visible over the host interface. Whereas, given Plextor's extensive burn-in processes for their SATA SSDs, that shouldn't happen.
You keep mentioning this exclusive burn-in process that Plextor does. Do you have any details about it or can we just safely assume they (stress) test the drives in a manner that ensures a low early failure rate?

Don't get me wrong, if Plextor manages to get 0.5% average failure rate on their drives by using premium components and quality control, I can only commend them for that. But since they are using the same suppliers as many other SSD manufacturers, why would you even entertain the idea that Plextor drives have pristine NAND chips in them?

PS: Why are we even discussing Plextor, when the whole purpose of such studies is to eliminate any brand/tech bias and look at real world results. Sure the data cannot tell what the current state of SSD reliability is, but it's the best baseline we have.
 
Last edited: