And even the usefulness of it is questionable. The Google study certainly over-represents or exaggerates the actual rate of drive failure by its excessively permissive definition:
Definition of Failure. Narrowly defining what constitutes a failure is a difficult task in such a large operation. Manufacturers and end-users often see different statistics when computing failures since they use different definitions for it. While drive manufacturers often quote yearly failure rates below 2% [2], user studies have seen rates as high as 6% [9]. Elerath and Shah [7] report between 15-60% of drives considered to have failed at the user site are found to have no defect by the manufacturers upon returning the unit. Hughes et al. [11] observe between 20-30% “no problem found” cases after analyzing failed drives from their study of 3477 disks. From an end-user’s perspective, a defective drive is one that misbehaves in a serious or consistent enough manner in the user’s specific deployment scenario that it is no longer suitable for service. Since failures are sometimes the result of a combination of components (i.e., a particular drive with a particular controller or cable, etc), it is no surprise that a good number of drives that fail for a given user could be still considered operational in a different test harness. We have observed that phenomenon ourselves, including situations where a drive tester consistently “green lights” a unit that invariably fails in the field. Therefore, the most accurate definition we can present of a failure event for our study is: a drive is considered to have failed if it was replaced as part of a repairs procedure."
IOW, after [correctly] establishing that a relatively high percentage of drives which are deemed to have "failed" by the user (or repair technician) are subsequently found to be in perfect working order, but do not operate properly or "misbehave" in the user's particular configuration due to some
interoperability issue between the firmware, BIOS, controller, driver, or due to bad cabling (or fluctuating/inadequate power source), the authors then use a definition that certainly includes those very false positives/failures.
Sift through the BIOS, firmware, and driver release notes for any enterprise class RAID or storage controller. Its not difficult to find change notes like "solved spurious fault/warnings/errors/behavior with [insert particular make and model hard disk here]" or "when in this specific configuration/under these particular conditions" (e.g. during staggered spin-up, when NCQ is enabled, so on and so forth, sometimes even referencing a particular firmware revision for a particular drive). From time to time, you'll even see a remark along these lines in a motherboard BIOS release note.
In a server environment like Google's, the potential for these kinds of interactions to be exposed is vastly greater than any personal computer. e.g. two to four storage controllers, 16 ~ 32+ hard drives (mixed makes/models) creates hundreds if not thousands of different configuration possibilities that even the most loaded personal computer/gaming/workstation will never encounter. The more complex the configuration, the more complex the interactions become between system BIOS, storage controller/firmware, disk drive firmware/controller, drivers, and even the OS.
The study made no attempt whatsoever to narrow the definition of "failure" to any kind of problem with the drive itself or to eliminate these interoperability issues owing to some BIOS, firmware, or controller glitch/bug. If the drive was replaced because someone (subjective) deemed it to be defective or problematic, it gets counted as a "failed" drive.
Overall, many miss the general conclusion or statement of the authors that disk drives are "
generally very reliable" and "
rarely fail."