3LC flash unfit for use in anything?

VirtualLarry · Jun 3, 2010

http://www.recovermyflashdrive.com/articles/29F64G08CFAAA-29F64G08FAMCI-3Bit-MLC-high-failure-rate

Interesting that they feel that MLC flash is unreliable after 1K writes, not 10K as often quoted. Worse yet is the newer 3LC, which they feel can become unreliable after as few as 10 writes!

If you read my other thread about USB flash drives and TRIM and lack of secure erase, if you wipe or write-test the drive with vconsole.com's usb flash drive tester, then the drive becomes "gridlocked" due to running out of free blocks for the dynamic wear leveller, and then if you wipe the drive again, it has to re-write the same physical block 8 times or more for every 512byte logical sector write it does, thus two wipe passes and you have effectively FRIED your shiny new 3LC flash drive, like a Sandisk 16GB Cruiser Micro you bought at Wally-world for $40.

I agree with the article. Vendors of devices with flash memory products inside, should be REQUIRED BY LAW to state the type (SLC/MLC/3LC), as well as the estimated lifetime of the flash memory. (This should also apply to things like motherboards and LCD monitors too, they contain flash memory.)

Edit: Here's another article on that site, this time they tested an SSD until it failed. Interesting.
http://www.recovermyflashdrive.com/articles/solid-state-drives-ssd-testing-reliability

Idontcare · Jun 4, 2010

VirtualLarry said:
Interesting that they feel that MLC flash is unreliable after 1K writes, not 10K as often quoted. Worse yet is the newer 3LC, which they feel can become unreliable after as few as 10 writes!

That's not what they state, or you are using the wrong words to state what you mean.

They clearly state that you can begin to observe sectors fail in as few as 1k writes, etc. Not entire chips.

This is not news. There is a distribution (its a weibull distro to be more specific) to the probability of failure for all the cells within a flash chip.

The 10k quoted value is a spec value that means something specific in terms of the percentage of sectors that are dead by the time the average endurance cycle for all sectors has hit 10k. It doesn't mean the chip is dead or that you've lost your data, pre-sensing and the like goes on with every read.

And consider the source...naturally these guys are going to see their labs filled with all the infant mortality chips that die prematurely in the field (that is why the warranty exists). Its not like their observations span the same distribution that we consumers randomly sample from when taking delivery of our purchases.

VirtualLarry said:
Edit: Here's another article on that site, this time they tested an SSD until it failed. Interesting.
[URL="http://www.recovermyflashdrive.com/articles/solid-state-drives-ssd-testing-reliability"]http://www.recovermyflashdrive.com/articles/solid-state-drives-ssd-testing-reliability[/URL]

This test can't be reproduced on an Intel SSD since the Intel drives actively monitor the total amount written per 24hr period and will throttle the write speed intentionally to avoid exceeding a threshold value that was set to ensure a minimum 5yr operating lifespan. I don't recall the links but it is there in the Intel technical documentation and myself and taltamir discussed it on multiple occasions here in the storage forum.

But even in this case...the sample size these guys used is "1". When do I stop laughing? That is what warranties are for. All the data is there in readable form...read your data after 61TB of continuous writes and send the SSD back in for warranty servicing.

VirtualLarry · Jun 4, 2010

Idontcare said:
That's not what they state, or you are using the wrong words to state what you mean.

They clearly state that you can begin to observe sectors fail in as few as 1k writes, etc. Not entire chips.

I didn't use the word fail, I said "become unreliable". If you have failing sectors, then yes, that chip is becoming unreliable. My English was perfectly clear and correct.

Idontcare said:
But even in this case...the sample size these guys used is "1". When do I stop laughing? That is what warranties are for. All the data is there in readable form...read your data after 61TB of continuous writes and send the SSD back in for warranty servicing.

But what was interesting, and what they tested, was that even though (the mfg) claim the drive would last for 5 years, they tested it and it failed at a point that would suggest a 3.3 year lifespan instead.

IOW, the claimed lifetime on SSDs is bunk, mostly driven by marketing, because in the real world, they simply don't last as long as they are claimed to.

And as far as warranty claims go on a drive that dies that way, does the warranty cover "normal wear and tear"? Most don't, most only cover defects. So a drive that dies prematurely like that, would NOT be returnable for warranty RMA.

Idontcare · Jun 4, 2010

Sample size = 1

jeremyb10 · Jun 30, 2010

Sorry for digging up a old post but it feels neat to have two of your articles debated on. I spent two hours writing a response specific enough to be useful but vague enough to not violate any NDA's.

To clarify my personal semi-scientific experiments, 10% of 3-bit MLC develop bit errors after a single write, after ten writes the majority of sectors (50%+) have bit errors. Saying a sector has failed or is bad is arbitrary because it depends on the amount of error correcting code available to repair the damaged bits.

Error Correcting Code (ECC) is used to repair x amount of damaged bits. From the NAND controllers perspective, a sector hasnt failed until the amount of error bits exceeds the amount of error correcting code available to repair the damaged sector. I stress on the amount of error correcting code available because the amount of code provided varies GREATLY from 4 to 15 error bits in a 512 byte sector on most controllers. Controllers with weaker ECC engines require less logic gates, which brings down the cost of the controller making it attractive to many OEMs. (Hint: The high capacity 3-bit MLC drive you bought for $20 which failed after two days)

The endurance numbers quoted by NAND manufactures are guesses based upon the controller manufacture using ALL the available spare area for error correcting code which in the real world never happens. In general never believe ANY endurance number you read.

SSD reliability wise your right, sample size = 1, in most cases SSD drives never fail gracefully, either they aren't detected by the computer or output erroneous information. For example when Intel X-25 drives fail they show up as an 8MB drive. If your lucky enough for it to fail gracefully (ie: lock itself in read only mode) then image the data to another drive, RMA the drive, and consider yourself lucky.

There is an overall lack of transparency when it comes to NAND products, many companies are afraid to give out information which could hurt their trade secrets. I can't blame them. Sometimes I wonder if its because if people knew how the drives really worked they wouldnt buy them.

Always Backup and if you need your data recovered - Shameless plug removed <-- Shameless plug..

Feel free to ask any questions.. 🙂

Mark R · Jun 30, 2010

The point is that flash memory is inherently unreliable - even brand new SLC, doesn't promise 100% data fidelity. The manufacturers specify a maximum %age of bit flips as allowable (in much the same way as LCD manufacturers specify a certain number of dead pixels) - most flash is specified so that 'end of life' means less than 99.999% accuracy (error rate 10^-5). When new, the error rate may be much lower (e.g. 10^-12 for brand new SLC). (Note that flash is still in spec with a *lot* of bit errors - e.g. 50% of sectors with bit errors is still within 'spec').

Because of this, the manufacturers include extra space in each sector for ECC. E.g. a modern flash chip with 4k (4096 byte) sectors actually stores 4132 bytes per sector - the extra 36 bytes are available for ECC or other use by the controller. The idea is that under most circumstances the data will be recoverable using the ECC. So, with a good ECC algorithm, a controller could take a error rate of 10^-5 and reduce it by a factor of 1 billion - taking error rate down to 10^-14. Of course, there is a small chance that one sector could take a series of hits, and not be recoverable - in that case the controller should tell the host PC - 'Oops. Bad sector. Data lost'. This should be very rare, as most controllers regularly scan the flash for errors and will reallocate sectors once individual sectors start looking flaky. But a bad controller might not perform these scans, and errors might accumulate without detection until the data is lost.

Some controllers take this even further - sandforce controllers actually store additional ECC data on the other flash chips (they call this 'RAISE') - so even if one flash chip is catastrophically damaged, the data should be recoverable using the additional ECC on the other chips.

The other thing that is interesting is that different types of flash actually need different types of ECC for optimal data integrity. The pattern of corruption that you get in SLC, 2bit MLC and 3bit MLC are different. SLC errors are random individual bits. A very simple scheme which provides 1 bit of ECC per 32 bits is likely to be adequate. With 3bit MLC, the errors involve blocks of 3 bits at a time - the SLC scheme would be inadequate, and a different, much more sophisticated, scheme is needed to cope adequately with the 'chunkiness' of errors. Use of a basic SLC ECC technique is likely to give much lower than expected reliability if used on 3bit MLC. Unfortunately, expensive complex ECC doesn't jive well with the extreme low-budget devices that 3bit MLC is aimed at.

A similar difference in ECC technique is seen in PC memory. Many servers use ECC RAM, but the highest-end servers often use 'chipkill' ECC - this uses a more sophisticated, but slower and more expensive, algorithm which is capable of correcting 'chunky' errors - e.g. the loss of a complete RAM chip. Note that chipkill doesn't need more ECC spare space - it uses the same amount, but it's just an enhanced algorithm (combined with smaller RAM chips).

The other issue is how effective the controllers are at managing write-amplification. Simply filling an SSD 4000 times, will cause a lot more than 4000 writes to the flash - potentially as many as 20k-50k writes if it has a cheap controller. The use of spare capacity, improved wear levelling, data deduplication, garbage collection and TRIM support all contribute to reducing the write amplification - and getting it closer to (or even less than) unity (1 MB of flash write per 1 MB of write instructions received).

There is also a relatively little known feature that affects the lifetime of flash and that is 'rest time'. If a flash cell is written, and then immediately erased and rewritten - that is far more damaging (10s of times as damaging) as writing the cell, allowing it to rest for several minutes-hours, before erasing and rewriting. The longer the rest time, the less wear occurs (up to a point). Unfortunately, no one really knows how the manufacturers come up with the figure of 10k or 100k write cycles - if these were write-immediate erase + immediate rewrite - then the field life time of the flash would be expected to be much higher than the lab tests. If the flash was allowed to rest for a minute between each write test, the field life time would be much closer to the lab estimate.

The other big issue, which jeremy raises in the point above - is what happens then the controller detects more errors than it knows what to do with. There's a lot of variation in behaviour - and while the controller manufacturers are quick to trumpet their sophisticated ECC, garbage collection, etc. - this is something they keep very quiet about - and may well not test very thoroughly.

jeremyb10 · Jun 30, 2010

Mark R said:
The point is that flash memory is inherently unreliable - even brand new SLC, doesn't promise 100% data fidelity. The manufacturers specify a maximum %age of bit flips as allowable (in much the same way as LCD manufacturers specify a certain number of dead pixels) - most flash is specified so that 'end of life' means less than 99.999% accuracy (error rate 10^-5). When new, the error rate may be much lower (e.g. 10^-12 for brand new SLC). (Note that flash is still in spec with a *lot* of bit errors - e.g. 50% of sectors with bit errors is still within 'spec').

I've always had a problem with the idea that throwing more error correcting code at a problem is the solution, especially when controller manufactures don't use all the spare area for error correcting code. Whats the point of having non-volatile storage if the data stored inside will contain errors after one write. Whats worse is that its considered acceptable.

Mark R said:
Because of this, the manufacturers include extra space in each sector for ECC. E.g. a modern flash chip with 4k (4096 byte) sectors actually stores 4132 bytes per sector - the extra 36 bytes are available for ECC or other use by the controller. The idea is that under most circumstances the data will be recoverable using the ECC. So, with a good ECC algorithm, a controller could take a error rate of 10^-5 and reduce it by a factor of 1 billion - taking error rate down to 10^-14. Of course, there is a small chance that one sector could take a series of hits, and not be recoverable - in that case the controller should tell the host PC - 'Oops. Bad sector. Data lost'. This should be very rare, as most controllers regularly scan the flash for errors and will reallocate sectors once individual sectors start looking flaky. But a bad controller might not perform these scans, and errors might accumulate without detection until the data is lost.

On an average 3-Bit MLC flash drive 70%-80% of the allocated sectors contain at least a single bit error. This is considered "acceptable" thanks to error correcting code. I wouldn't call uncorrectable pages rare, in fact their very common, especially in 3-bit MLC. Perhaps this behavior exists in Solid State Drives, but that type of maintenance is atypical in your standard flash drive, cf card, or sd card.

Mark R said:
The other thing that is interesting is that different types of flash actually need different types of ECC for optimal data integrity. The pattern of corruption that you get in SLC, 2bit MLC and 3bit MLC are different. SLC errors are random individual bits. A very simple scheme which provides 1 bit of ECC per 32 bits is likely to be adequate. With 3bit MLC, the errors involve blocks of 3 bits at a time - the SLC scheme would be inadequate, and a different, much more sophisticated, scheme is needed to cope adequately with the 'chunkiness' of errors. Use of a basic SLC ECC technique is likely to give much lower than expected reliability if used on 3bit MLC. Unfortunately, expensive complex ECC doesn't jive well with the extreme low-budget devices that 3bit MLC is aimed at.

Hmm, thats interesting in my observations I tend to find bit errors spread out through groups of sectors, generally in page size. My thought to this is that NAND chips may interleave physical groups of cells throughout the entire page to spread the wear. Ie: instead of having 8 bit errors in one sector per page, have 1 bit error per 8 sectors in a page. I dont think all 3-bits of data are applied as a portion of an 8bit byte. But then again, some bit errors follow no pattern so this might not be universal.

Mark R said:
There is also a relatively little known feature that affects the lifetime of flash and that is 'rest time'.

Interesting

Mark R said:
The other big issue, which jeremy raises in the point above - is what happens then the controller detects more errors than it knows what to do with. There's a lot of variation in behaviour - and while the controller manufacturers are quick to trumpet their sophisticated ECC, garbage collection, etc. - this is something they keep very quiet about - and may well not test very thoroughly.

Despite many USB/CF/SD controller manufactures claims, not a lot has changed over the last 10 years from a design perspective, they just rehash the same schemes with more error correcting code. Many solid state drives are simply extensions of flash drives carrying over the same antiquted designs conjured up almost a decade ago. However there are a few companies that do stand out as innovators.

3LC flash unfit for use in anything?

VirtualLarry

No Lifer

Idontcare

Elite Member

VirtualLarry

No Lifer

Idontcare

Elite Member

jeremyb10

Junior Member

Mark R

Diamond Member

jeremyb10

Junior Member

TRENDING THREADS