Power failure "duty cycle" for SSD, mechanical drives

Syzygies

Senior member
Mar 7, 2008
229
0
0
I am an early adopter of solid state drives, having paid $621 last November for an Intel X25-M 80 GB SSD, and $189 last month for an OCZ Vertex 60 GB SSD, both from Newegg. I made each purchase after closely studying Anandtech's storage articles on solid state drives:

http://www.anandtech.com/storage/

which are required reading for anyone considering such a purchase. My Intel X25-M was used (and will again be used) as the internal drive in my MacBook.

Here is an issue that has received little attention in these articles: How many times can a hard drive survive a power failure while writing?

Over the three decades I have owned computers, I have experienced many power failures with mechanical drives. Often, I was lucky and the drive was fine. Often, I needed to reformat the drive, and then it was fine. I have never lost a drive this way, although I have read andecdotal reports of others losing drives this way.

On my very first power failure, I bricked my Intel X25-M solid state drive. On an airplane, I put my MacBook to sleep, and then changed the battery. Apparently I didn't wait long enough, and the MacBook was still writing "sleep" data when the power was lost.

I am thrilled with Intel's support, under the circumstances. They offered me a free swap under warranty. I instead elected to pay $25 for a Cross Ship, along with accepting a $279.45 hold on my card to guarantee the return. (Of course I asked if I could have two at this price! They said no.)

I am less thrilled with Intel's engineering and software support. I may be no expert, but to discuss this one needs to be up on how solid state drives work, specifically the existence of "wear-leveling" algorithms and so forth.

There is an abstraction layer in solid state drives that doesn't exist in mechanical drives, allowing the solid state drive to simulate a mechanical drive from the operating system's point of view, while handling the bookkeeping of rewriting data in blocks, with wear-leveling movement of data. There is nothing a user can do to penetrate this abstaction layer, at least for an Intel X25-M drive. If the data tables used by this abstraction layer get hosed by a power failure, the effect on the user is indistinguishable from a hardware failure.

There is a notion of "atomic writes" that comes up both in file system design (see the radical ZFS file system and its brethren) and in symmetric multiprocessor parallelism (so other cores don't see inconsistent data). Intel could have specified atomic writes as a design requirement, for the abstraction layers hidden from the user. I have no idea if they did or not, but my experience suggests that they sacrificed this for a bit of extra speed. Atomic writes would protect against hosing data, through thousands of power failures. (Good luck getting Intel to comment on this...)

Likewise, the drive itself is aware of a reformat request, which can be distinguished from daily use. It could elect to do a hard reset as part of handling such requests. Unless the wear-leveling algorithms work really well, one would want to preserve past history counts, another reason to want atomic writes, and why a reset could be complicated. As far as I can tell, there is instead no way for a user to reset the Intel X25-M. Tech support confirmed this; my drive simply needs to be replaced.

It is interesting that Intel, and other vendors, offer warranty support under these circumstances. One could instead imagine selling solid state drives with the user taking full responsibility for power failures, perhaps buying insurance, as one does when taking responsibility for crashing a car. This would however have a dampening affect on the solid state market.

Here is my challenge to AnandTech:

It would be relatively easy to build a computer that tested hard drives by deliberately causing power failures during writes, then reformatting and repeating the test. (Put the drive under testing on a different power supply, which turns on and off via a digital timer like used to cycle house lights, and write a clever script.) Part of any storage article evaluation should be to expose each hard drive to hundreds of power failures while writing, and determine which drives end up bricked and how quickly.

My guess? Mechanical drives can survive 1000 "duty cycles" of power failures while writing, with loss of data but no physical damage. Solid state drives can reliably be bricked by similar testing, in the first few cycles. If this is true, it deserves to be widely known.

I still swear by the speed of a solid state drive, and my backup strategies expect loss of data as a regularly recurring event. I'd think twice about recommending a solid state drive to a less technically aware friend, give my experience and the above conjecture.
 

magreen

Golden Member
Dec 27, 2006
1,309
1
81
Interesting suggestion. Try emailing Anand about it. If you get nowhere, try emailing PCPerspectives about it. They're willing to go out on a limb to expose things, and they're the ones who found that major degradation issue in the intel drives a few months ago.
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
There is an abstraction layer in solid state drives that doesn't exist in mechanical drives, allowing the solid state drive to simulate a mechanical drive from the operating system's point of view, while handling the bookkeeping of rewriting data in blocks, with wear-leveling movement of data. There is nothing a user can do to penetrate this abstaction layer, at least for an Intel X25-M drive. If the data tables used by this abstraction layer get hosed by a power failure, the effect on the user is indistinguishable from a hardware failure.

Indistinguishable from a hardware failure? not entirely... the data might be hosed at worst, but it shouldn't brick the drive, just require a reformat, or in this case, running the intel tool to "clear" the drive. Now it could be a firmware bug that causes the intel to stop responding to communcations when something like that happens, but that would be a strictly firmware issue and firmware fixable fairly easily... (I am not talking about fixing in introducing atomic writes... atomic writes will protect against DATA LOSS, the issue is a firmware that allows the drive to "recover" from loosing all its wear leveling data)

Anyways, none of this makes any sense, if the drive just BRICKS when losing power that SHOULD have been noticed much earlier, many people do hard shutdowns on their computers... i think you had a fluke... but its worth investigating.
 

Syzygies

Senior member
Mar 7, 2008
229
0
0
Originally posted by: taltamir
Indistinguishable from a hardware failure? not entirely... the data might be hosed at worst, but it shouldn't brick the drive, just require a reformat, or in this case, running the intel tool to "clear" the drive.
...
Anyways, none of this makes any sense, if the drive just BRICKS when losing power that SHOULD have been noticed much earlier, many people do hard shutdowns on their computers... i think you had a fluke... but its worth investigating.
As far as I know, there is no Intel tool to "clear" the drive. The unique tool is a boot disk image which nondestructively updates the firmware if the firmware is out of date. (OCZ only has a destructive Windows tool for firmware updates.) I could find no other tool on their site, and several people on their tech support asserted that no such tool exists. I would be thrilled if you could prove me wrong, I was making your assertion to myself all weekend before tech support opened, to no avail.

I tried reformats; the volumes worked until I noticed I was losing files. Verify disk failed. Writing zeros to erase the disk failed. (I can imagine algorithms where "exercising" every bit would be enough to rebuild all corrupt tables.)

I agree completely with you: Were I head engineer, atomic writes would have been part of the spec from day one, and one design goal would have been to survive a far higher number of power failures while writing than a mechanical drive.

Most people who do hard shutdowns at least semi-consciously try to avoid pulling the plug while the drive is obviously in use. My power failure was worst-case: The MacBook was waiting for the sleep data to be written to disk, before sleeping.

I only know my experience, I'd like it to be a fluke. But I ask a reasonable question: How many power failure "duty cycles" can SSD, or mechanical drives, tolerate before bricking?

If an SSD bricks 1 time out of 50, then early adopters like me would first be reporting experiences like this now. Look how long it took a Concorde to crash. Only one data point, but by the averages that crash made it look like a rather dangerous plane.
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
i wasn't referring to firmware update tool, i was referring to the wiping tool used to "fix" the slowdown problems by clearing the drive, before the firmware patch was released and made it unneeded.

It is working and losing data now? something is definitely odd and I do not think it has anything to do with a power loss.
 

Syzygies

Senior member
Mar 7, 2008
229
0
0
Originally posted by: taltamir
i wasn't referring to firmware update tool, i was referring to the wiping tool used to "fix" the slowdown problems by clearing the drive, before the firmware patch was released and made it unneeded.
Here is essentially the only link that uses the phrase "wiping tool" in reference to an Intel X25-M:

Sector remap fragmentation slowing Intel X25-M SSDs

A more complete system involves using low-level IDE commands to completely shred every sector of the drive, including the remap table, and reformat, restoring the drive to a virgin state. However, this is difficult; it requires turning off AHCI, booting in DOS, and using an obsolete, no longer available older version of an obscure drive-wiping tool.

So you're saying that the firmware patch fixes the fragmentation slowdown problem? If so, that's good. It doesn't provide a low level way to wipe the drive.

To reiterate,

[1] There is no consumer tool for resetting an Intel X25-M to a virgin state.

[2] There is an abstraction layer inaccessible to end-user software, that is not present on mechanical drives, and that depends on tables that could get hosed. I doubt that Intel specified atomic writes for this layer, trading speed for reliability. Even if they did, an atomic write ends by flipping a bit to enable the new data; a poorly timed power outage could leave that bit in an unstable state.

My drive is superficially usable, but unstable, after experiencing a power outage during a write. This could coincidentally be a hardware failure, but I find the hypothesis of corrupt data tables in the abstraction layer inaccessible to me to be by far the more likely explanation. I find it frustrating that I can do nothing about this, but heartwarming that Intel is displaying such stellar customer service in response.

It would be good to know how this all plays out for other brands. Is there a hard reset tool? How is customer service on the replacement, after one admits that a power failure bricked the drive?

In any case, we'll never know the extent of this problem until someone subjects both solid state and mechanical drives to endless power failure testing. If Anand doesn't bite, PCPerspective might like this one, as the go-to source for SSD horror stories.
 

Syzygies

Senior member
Mar 7, 2008
229
0
0
Here's another interesting link:

Beating the performance bogeys in the Intel SSDs

The problem is that the SSD doesn't know when virtual sectors are "available" again - it only knows which virtual sectors the operating system has ever written to.

If the operating system has EVER written to a virtual sector, the SSD controller studiously preserves the content of that virtual sector ever after, EVEN IF THE OPERATING SYSTEM LATER THINKS IT HAS DELETED FILES AND THAT THEREFORE THE SECTOR IS UNUSED AND AVAILABLE FOR RE-USE.
English-to english translation: I was a complete friggin' idiot to write zeroes to the drive (which failed, by the way) in an attempt to save the drive.

UPDATE: If your Intel SSD has hit the doledrums (i.e. has lost its random write performance), check out the HDD Erase tool and accompanying notes.
Chasing links,

Long-term performance analysis of Intel Mainstream SSDs (p6)

I was about to call it quits when I learned that Intel was supplying reviewers with a copy of an older (and more importantly, compatible) version of HDDErase.
*Edit* - Due to the large number of requests for HDDErase 3.3 after this article went live, we have made it available here.
I'll give this a try. If I can recover my drive, the $280 Intel wants if they don't get my old drive back is starting to look like a good price.
 

malventano

Junior Member
May 27, 2009
18
19
76
PCPer.com
Originally posted by: Syzygies
I am an early adopter of solid state drives...
...last November...
...as the internal drive in my MacBook.

Syzygies,

Given you had an early version of the drive, and were using it in a macbook, I suspect you were hit with the bootloader / timing issue known to effect early x25-m's when used with macbooks. This timing issue has nothing to do with how data was written, or the abstraction layer, or atomic writes. Actually, when your macbook goes to sleep, it does not do so until the drive has reported all data written, so if you were at the slow blink when you yanked the battery, you would not have interrupted any write in progress.

I checked with Ryan (PCPer editor in chief). He had one of the early engineering samples (that shared bootloader code with some of the early shipping drives) in his macbook. It worked just fine until one day it would not come back from sleeping. It seems the bootloader / timing issue sits just at the threshold of failure and somehow flips past it and won't come back after that point.

Keep in mind that it's not really a hard failure - the drive just won't work in a macbook after that point. It can still be read with a usb / sata converter or by installing it into a PC.

As far as getting a drive back to a 'new' state, this is actually quite easy:
http://cmrr.ucsd.edu/people/Hughes/SecureErase.shtml

Just use that on it. If you have flashed to the newer 8820 firmware (highly recommended), you will not need to use the older version of the utility. It will wipe all tables as well as perform a block erase on all flash. The drive *will* remember any bad blocks that are present, so no worries there.

Note that the 8820 firmware does not patch the bootloader. This is only possible at the factory, hence the exchange program currently employed by Intel for this very situation.

I have 4 X25-M series drives in the lab (some are my own). Two of them are early retail / engineering sample drives that are known to not work in macbooks due to the timing / bootloader issue. The 4 drives here have had terabytes of data written to them through repeated torture testing. I did test power failure writes prior to using my own X25-M for OS duties, as I experienced corruption of this sort on a Memoright drive I used to own. Despite multiple tests, the worst I could find was the X25-M failed to complete whatever write was in progress at the time of the failure, but there was no corruption beyond that expected, and nothing permanent.

Hope this helps.

**edit**

Final note: Unfortunately, the 8820 firmware and/or hdderase will likely not do anything to correct the timing issue making that drive not work in your macbook. The bootloader is the very first bit of code the drive executes on power-up, and has nothing to do with any user data / table status present at that time.
 

Syzygies

Senior member
Mar 7, 2008
229
0
0
Thanks for your careful reply.

Originally posted by: malventano
Actually, when your macbook goes to sleep, it does not do so until the drive has reported all data written, so if you were at the slow blink when you yanked the battery, you would not have interrupted any write in progress.
I'll plead guilty here. The instant my MacBook crashed on waking from sleep, I realized that I hadn't confirmed the slow blink before yanking the battery. I'm an idiot, but mechanical hard drives can generally take this, and a correctly-designed SSD should be able to take 1,000,000 cycles of this.

Perhaps I was very unlucky: In all of your tests, your drive was writing user data on power loss, while by chance the X25-M was writing an internal table?

If it were using atomic writes a la the ZFS file sytem, you'd simply see the old file, with no data corruption indicating an attempt to write the new file. You say you'd see one bad file, that suggests it doesn't use atomic writes.

Originally posted by: malventano
As far as getting a drive back to a 'new' state, this is actually quite easy:
http://cmrr.ucsd.edu/people/Hughes/SecureErase.shtml

Just use that on it. If you have flashed to the newer 8820 firmware (highly recommended), you will not need to use the older version of the utility. It will wipe all tables as well as perform a block erase on all flash.
That's helpful to know; others can look for HDDErase.exe 4.0 once they've performed the firmware update. Alas, the above link is not to an ISO image, which one must boot from to use this utility. The zip file wouldn't open on my Mac; I didn't try further.

Before seeing your post, I managed to wipe my X25-M using the 3.1 version after partially filling a wastebasket with failed DOS boot CDs. Trying both OS X and Ubuntu, I was unable to modify an existing DOS boot image (.iso) by adding HDDErase.EXE, have the image still burn to a bootable CD, and have the new file HDDErase.EXE show up after booting. Any subset, but not the whole works. For example, using isomaster on Ubuntu to add HDDErase.EXE and save the modified .iso file, I could burn a CD which booted, which showed the file HDDErase.EXE as present under any other operating system, but which didn't show the file HDDErase.EXE as present while booting under DOS. Obscure.

Then I read the instructions again, claiming that HDDErase.EXE came from the Ultimate Boot CD, version 4.1.1. I chased many extinct mirrors before finding a working .iso image. This CD did the trick, securely erasing my X25-M.

I then booted into OS X on the homebuilt EFi-X Intel box that did the above work, and tested. My wiped X25-M reformatted withiout a hitch. I verified disk using OS X Disk Utility, cloned a system onto it, and verified again. All signs of instability were gone.

Given your bootloader comments and the fact that if I kept both drives I'd only have one warranty, I'll return the old drive to Intel. (My replacement X25-M arrived in one day!) I did the above in the interest of science, to try to pin down what happened. I stand by my conclusion: The X25-M hosed one of its private tables during the power failure, and secure erase fixed the damage, while an ordinary reformat did not. And Intel offers stellar customer service, taking back the hosed drive without hesitation. It would be better for their bottom line if they instead offered a repair tool.
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
i don't doubt your drive BROKE...
I doubt the REASONS you THINK it broke.

A more complete system involves using low-level IDE commands to completely shred every sector of the drive, including the remap table, and reformat, restoring the drive to a virgin state. However, this is difficult; it requires turning off AHCI, booting in DOS, and using an obsolete, no longer available older version of an obscure drive-wiping tool.
Yeppers, thats what I was talking about

And what is wrong with running an "obsolete" (GASP!) tool from DOS (OH NO!) with AHCI off? Do you have something against taking 1 minute to change the bios setting?
1. Obsolete schobsolete... thats like saying running windows XP is a bad thing because it is "obsolete".
2. Dos is not a problem, any system can easily boot into it. you can boot it from a CDR, usb drive, etc...
3. ahci is off by default to begin with, so only if you knew to turn it on is it an issue.

I believe that returning the drive to virgin state in the method described should fix any issue caused by dataloss, even dataloss of the remap table (the only POSSIBLE issue) would be fixed by such a thing, and should not be any issue to begin with (if corrupt, the firmware should fix it)...
This seems like MOST LIKELY a hardware failure (power spike / unclean power due to messing with the PSU while running) or less likely, a firmware bug... not a deep seeted issue caused by lack of atomic writes. (which, if it was the case, would put the drive at risk of data loss, but no more)
 

Syzygies

Senior member
Mar 7, 2008
229
0
0
Originally posted by: taltamir
I do not have a superman complex; for I am God, not superman!
Oh come on! That was a quote, not my words. When my replacement drive came, I could have just shipped the original off to Intel. Instead I spent part of the day doing experiments, to come up with some evidence as to what really happened. Someone a month from now might find this thread gives them information they need. I'm guessing they'll also think your god complex distracts from the conversation.

I never doubted that this was data loss. I was surprised to discover that it was data loss out of reach of standard reformatting tools in each operating system. I pleaded with everyone at Intel for a reset tool. They denied the existence of such a tool, and instead set up a drive exchange for me. You couldn't be bothered to offer a link; anyone trying to get past such an problem in the future might find it quicker to read through my blow-by-blow experience than to repeat it themselves in real time.

Try putting yourself in someone else's shoes. I taught a friend how to remove her internal drive from a similar MacBook. She doesn't happen to have a homebuilt PC available for fixing drives, and unlike myself isn't at ease toggling AHCI settings in a BIOS. This is a more typical consumer, unlikely to make the fifteen tries it took me to get the DOS repair disk working. They'll be glad that Intel offers such stellar support, but think that we're two lunatics arguing.

According to your profile, you have a ZFS array. Surely you understand that with a better algorithm involving atomic writes, a solid state drive could go through 1,000,000 power failures without experiencing any data loss. Rather, the state of the drive would always be its last coherent state, with no user intervention necessary, ever, needed to return the drive to service.

This was not my experience. I'm pointing this out because some people might be interested. I'm baffled what you're getting at, ranting at me.


 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
Look, I am not trying to ATTACK YOU. I am just saying that you MIGHT be jumping to conclusions as to why your drive failed...
it COULD be as you said (in which case intel has a serious problem on their hands), but from a mechanical and software standpoint I have explained why I think it is not, and instead had to do with power fluctuations. I could be entirely wrong and I would like to get more input as to why I might be wrong or right so I may further my learning.

Your analysis and testing was selfless and generous. I have never cast any doubt on it.

And yes, I do understand that atomic writes are very important... I just don't think data loss from a failed write should cause the drive to stop working, only cause it to lose data. And I have seen no proof beyond a GUESS that the intel does NOT perform atomic writes.
IF it does not perform atomic writes, then it puts data at risk, and that is a problem... even if so I do not think this should cause the drive to fail, only to lose data.
 

Syzygies

Senior member
Mar 7, 2008
229
0
0
Originally posted by: taltamir
IF it does not perform atomic writes, then it puts data at risk, and that is a problem... even if so I do not think this should cause the drive to fail, only to lose data.
Data corruption in a layer inaccessible to a typical user is the functional equivalent of failure. A typical user would conclude that if Intel says there's nothing to do, send the drive back, then the drive was bricked.

No one expects to find data on a drive that was scheduled to be written after a power failure, that would be physically impossible. With atomic writes, one could insure that any power loss made the drive appear exactly as it did after the last successful write. Namely, in a coherent state. If the operating system doesn't use atomic writes, then the state after the last successful write may be of no use to the user, and a reformat will be necessary. However, the hardware/software system that is the drive itself should remain in a coherent state, with no signs of corruption after a reformat.

There's an easy litmus test for this: If you need to use a special tool to get the drive working again, it was not left in a coherent state.

After my power failure, my drive was not left in a coherent state. That's all I'm saying. Were I lead engineer, one of my primary design specs would have been to make this impossible, through 1,000,000 duty cycles of power failure.

 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
Data corruption in a layer inaccessible to a typical user is the functional equivalent of failure.
Unless it is silently repaired by the firmware, only if it causes the firmware to hang (unlikely) would it cause failure and not loss of data.

If the operating system doesn't use atomic writes, then the state after the last successful write may be of no use to the user, and a reformat will be necessary.
That is NOT what atomic writes prevent... at least, the reformat part... you don't need to reformat, you simply have silent data corruption.. that is, you ask the drive for the bits between address X and Y, and it will give you a bunch of bits, and those bits are wrong. The only time that would necessitate a reformat was if those bits happened to be OS files... either way it will still not brick the drive, just require a reformat. (yours did NOT recover when reformatted, proof positive that it is not the case.... Unless it was writing the mapping table... but atomic writes on mapping table and atomic writes on data are two different things).

There's an easy litmus test for this: If you need to use a special tool to get the drive working again, it was not left in a coherent state.
Pray tell WHY is that the case? (It isn't)

Also, you keep on saying 1mil power failure cycles, where did you get this figure? atomic writes are explained quite well in the ZFS presentation:
http://opensolaris.org/os/comm.../zfs/docs/zfs_last.pdf
 

Syzygies

Senior member
Mar 7, 2008
229
0
0
Originally posted by: taltamir
atomic writes are explained quite well in the ZFS presentation:
http://opensolaris.org/os/comm.../zfs/docs/zfs_last.pdf
They use the word three times, and explain quite well how they use atomic writes to implement their design goals. Wikipedia gives a more general definition of an atomic operation, although that article primarily concerns itself with parallelism.

I believe that we are making apparently contradictory statements because ZFS is concerned with better containing and recovering from actual physical damage to existing data, while I am assuming no actual physical damage. I am asking how robustly an SSD can handle being interrupted, when a clean loss of power does no physical damage, but stops the writing process at an arbitrary time.

To make an analogy, when Claude Shannon proved that one can reliably push data through a noisy line, many people were stunned. Support that a solid state drive receives power in 100 irregular bursts per second, with power available on average half the time. The analogous statement to Shannon's information theory would be the assertion that in theory, one could still reliably write hours of data to the drive, it would just take twice as long.

A key tool in designing an SSD that could operate under these conditions would be atomic writes. That wouldn't be the whole story, and their use wouldn't resemble the design of ZFS in its entirety.

This model is an exaggeration of the issue that I question: A representative consumer buys an SSD for her sole computer, which experiences regular power failures. (Either she lives in California, where the power is wonky, or she doesn't wait for the "slow blink" on her sleeping MacBook before changing batteries.) Can the SSD be designed so that under this usage pattern, she is never told by the manufacturer that there is unspecified damage, and the unit must be exchanged?

It always feels safer to say no. When it was first proposed that we fly to the moon using two ships that would then dock in Lunar orbit for the return, many people thought this was impossible, but we figured it out. At least one astronaut held a Ph.D. in the study of this problem.

Another litmus test that I like: How would I feel about a problem if half my retirement savings were at stake? Hypothetical in my case, but very real for anyone working a startup company.

Would I rather have my money at stake in an SSD whose manufacturer exchanged after power failures, but for which a technical forum described an elaborate procedure (comfortable for the denizens of that forum, if not for their family members) which fixed the damage? Or would I want my money at stake in an SSD that by design survives every power failure in a state that a typical consumer can fix for themselves, using standard tools and without understanding squat about SSDs?

There's a reason Steve Jobs is very rich. I know how he'd answer this.

Now, there's a "the driver never gets carsick" effect in problem solving. Stop telling me I'm wrong, and ask yourself how you would design an SSD which could handle 1,000,000 random power failures, and remain in a state that a typical consumer can repair. Remember, half your retirement is at stake, if you're dogmatic about what that typical comsumer should be willing to do, you'll end up eating dog food in your old age, thinking "dogmatic" refers to your electric can opener. Instead, meet the freakin' design spec. At some point you'll naturally find yourself considering atomic writes as a tool. It won't be an identical use to ZFS, but that example use will still be helpful.
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
They use the word three times, and explain quite well how they use atomic writes to implement their design goals. Wikipedia gives a more general definition of an atomic operation, although that article primarily concerns itself with parallelism.
wiki focuses on CPU atomic actions... the ZFS on storage atomic actions... and I don't see how you could do it any other way than what the ZFS team is doing...

I believe that we are making apparently contradictory statements because ZFS is concerned with better containing and recovering from actual physical damage to existing data, while I am assuming no actual physical damage. I am asking how robustly an SSD can handle being interrupted, when a clean loss of power does no physical damage, but stops the writing process at an arbitrary time.
Actually ZFS constantly recovers soft errors and silent data corruption.

This model is an exaggeration of the issue that I question: A representative consumer buys an SSD for her sole computer, which experiences regular power failures. (Either she lives in California, where the power is wonky, or she doesn't wait for the "slow blink" on her sleeping MacBook before changing batteries.) Can the SSD be designed so that under this usage pattern, she is never told by the manufacturer that there is unspecified damage, and the unit must be exchanged?
And there you go jumping into conclusions again as to what caused your drive to fail. also you are not actually SAYING anything of value, you are making an appeal to emotion instead of actually CONTRADICTING ANY OF THE POINTS I made.

Actually you entire post is about analogies and catchphrases, both of which are used as appeals to emotion... you have not said one damn thing about the technical situation which I have asked... you have not given an explanation as to HOW it could POSSIBLE cause the problem you are experiencing.
I did give several explanations, 2 of which related to data failure due to non atomic writes which I have found highly unlikely, and concluded its PROBABLY the other two, but not necessarily and I could be wrong.
 

Syzygies

Senior member
Mar 7, 2008
229
0
0
I think it's obvious from the evidence I collected that a table got hosed by the power loss, that isn't reset by actions a normal consumer would take. I also believe that it would be easy to design an SSD that would be invulnerable to such corruption.

I'm happy to let what I've said so far stand; I've got a plumbing project that needs my attention.
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
1. it is not obvious from the evidence you collected.
2. yes, a consumer should not have to take it, but AFAIK this isn't what happened to you so they don't actually have it happen to them.
3. It is impossible for there to even exist such a corruption possibility unless there is a known firmware bug that is EASILY fixable and intel just REFUSES to fix... non atomic writes should cause data loss, not drive failure.

As I said before, my best guess is that taking out the battery caused a power spike or who knows what, not a case of "power loss" causing "data loss" due to "non atomic writes" which results in "corrupt table" which magically causes the controller to "write corrupted data after a reformat".
 

malventano

Junior Member
May 27, 2009
18
19
76
PCPer.com
Originally posted by: Syzygies

If it were using atomic writes a la the ZFS file sytem, you'd simply see the old file, with no data corruption indicating an attempt to write the new file. You say you'd see one bad file, that suggests it doesn't use atomic writes.

X25's could never perform atomic writes at the file level like ZFS does. It has no clue about any file system goings on. The closest it could come is atomic writes at the sector level as far as updating a sector with a new one, *then* updating the associated pointer in the remap table, which to some extent it actually does. The 'current file' corruption would only occur because half of it was written by a non-atomic-write OS. As a quick note, most journaling file systems only journal metadata changes, not file changes, so that wouldn't help either.
 

malventano

Junior Member
May 27, 2009
18
19
76
PCPer.com
Originally posted by: taltamir
Unless it is silently repaired by the firmware, only if it causes the firmware to hang (unlikely) would it cause failure and not loss of data.

What's funny is I've seen this very scenario in a few SSD's. Basically the drive gets interrupted due to some external event, during a write, and this *somehow* results in a page being written with a mismatched CRC/ECC. The next time you attempt a read of that cluster, the drive hangs. Drives *should* just report a read error along with the errored data, but I've seen some drives poorly handle this event and hang until timeout.

**I have NOT seen this in any X25 series drive**
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
well, i have actually seen a monitor and a TV CRASH before... i had to reboot them before they started working... their OSD was acting all borked... :)
it was those expensive really fancy monitor and TV with lots of extra functions. (well, both were capable of doing monitor and TV work, what with both having picture in picture, picutre by picture, and 5+ input methods... its just the shape and size that made me call one a monitor and one a TV).

Crashed routers too... oh and I think a DVD recorder deck... (and naturally gaming consoles)

The next time you attempt a read of that cluster, the drive hangs. Drives *should* just report a read error along with the errored data, but I've seen some drives poorly handle this event and hang until timeout.

**I have NOT seen this in any X25 series drive**
Absolutely, that is what I was saying, this is the possible methology for this to happen, and:
1. AFAIK it is not what happens in the X25
2. If it DID happen the nature of the failure would not match the issues he is describing.
 

hackeron

Junior Member
Oct 7, 2011
1
0
0
I got my OCZ Vertex Plus 60GB - my test was simple:·

1) I took the OCZ SSD and my trusted Crucial M4 SSD
2) I installed the same OS on both (same system, same everything)
3) I tried various scenarios such as pressing reset button when system idle, pulling the power chord out, filling up ram and causing a kernel panic, etc.
4) My Crucial was just fine in all tests, the OCZ Vertex showed severe filesystem corruption, large directories simply missing, etc in all tests - including the system freeze test!

I am sending my OCZ back on Monday.