- Oct 15, 2003
- 3,239
- 0
- 76
Just thought I'd post a quick applications paper I had to do for my probability class. I chose to base it on RAID 5 and 6; looking at the probability of an array failing to rebuild due to UREs.
Anyway, if I'm completely wrong, feel free to shoot me down
Cliffs:
For a 10 drive 20 terabyte array (2tb drives), these are the probabilities that the array will fail to recover during a rebuild.
Probability of Restoration Failing
Drive URE| RAID 5| RAID 6
1 in 10^14 | 0.952049 | 0.806277
1 in 10^15 | 0.26196 | 0.037596
1 in 10^16 | 0.033188 | 0.003796
Anyway, if I'm completely wrong, feel free to shoot me down
For performance and reliability purposes, computer hard disk drives may be paired together using a
scheme called RAID (Redundant Array of Independent Disks). RAID has quite a few different levels.
Level 0 stripes the data written across the array. As an example, two hard drives can be paired
together. One drive stores the odd number of bits, while the other drive stores the even number
of bits. This has the advantage of increasing the speed of the combined drives. Almost double
the read and write performance can be extracted. The downside to using RAID 0 is that you double
the chance that you will lose your data. Since the data depends on having both hard drives, if one
drive fails; the data on the other drive is unusable. Figure 1.1, shown below, illustrates how the
data is saved in a two drive RAID 0 array.
Figure 1.1: RAID 0 Array. [1]
The opposite of RAID 0 is RAID 1. RAID mirrors the data. Going back to the example two drives
discussed for RAID 0, in RAID 1, each drive has the same exact set of data. No performance gains
are realized (possibly some read gains depending on the controller), however the reliability of
the system (data) has essentially doubled. Figure 1.2 shows how the data is arranged in RAID 1.
Figure 1.2: RAID 1 Array. [1]
Another set of RAID levels are RAID 5 and RAID 6. These RAID levels essentially combine
the advantages of RAID 0 and RAID 1. RAID 5 requires a minimum of 3 disks while RAID 6 requires
a minimum of 4 disks. Using the exclusive or (XOR) function, a parity bit is created. So in
RAID 5, one bit is written to one disk, the next bit to the next bit, and using XOR, a parity
bit is written on the third disk. This cycle alternates between the drives so that every drive
has some parity bits and actual data. RAID 5 can expand to any number of drives. With RAID 5, one
drive can be lost and all the data can still be recovered. The total available data for a RAID 5
array is (n-1) * capacity of each drive, where n is the number of drives in the array. RAID 6 is
very similar to RAID 5; however two parity bits are generated rather than just one. As such, RAID 6
can lose two disks in the array and still recover all the data. RAID 6’s capacity is (n-2) * capacity
of each drive. Figures 1.3 and 1.4 illustrate RAID 5 and RAID 6 respectively.
Figure 1.3: RAID 5 Array. [2]
Figure 1.4: RAID 6 Array. [1]
The interesting points about RAID 5 and 6 are when they are in recovery mode. All hard disks
have a probability of experiencing an unrecoverable read error (URE). This probability is around 1 in
1014 reads for consumer level hard drives.[3] During a RAID rebuild, 100% of the data minus the failed
drive’s capacity, in the array must be read, and new parity or real data bits must be generated on the
replacement drive(s). As hard disk capacities have increased, the likelihood of reaching 1014 reads occurs.
With RAID 5, when the array is rebuilding, the array no longer has any protection from a read error or
drive failure. Should an unrecoverable read error occur, all of the data will be lost. RAID 6 can experience
a second drive failure or an unrecoverable read error during rebuild, however if two or more read errors occur,
all of the data in the array will be lost. Using probability theory, specifically Bernoulli trials; the
probability that a RAID 5 or RAID 6 array will fail during recovery, given that one drive has already failed,
can be calculated. Calculating this probability requires the use of Equation 1.1, where n is the number of bits
being read, k is the number of unrecoverable read errors, p is the probability of a read error, and lastly q is
simply 1-p.
Equation 1.1: P=n!/(k!*(n-k)!)*p^k*q^((n-k))
For RAID 5, the probability that it will fail requires that there be at least one read error. It is much
simpler to calculate this probability by doing one minus the probability of zero errors as shown below.
Equation 1.2: RAID 5 Restoration Failure Probability.
P(Rebuild Fail)=1-P(Rebuild)=1-P(0 Errors ∪1 Error)=1-[(#Bits)!/(0)!(#Bits-0)!]*(〖10〗^(-14) )^0 〖(1-〖10〗^(-14))〗^(#Bits)
Since the desired number of read error bits in Equation 1.1 is zero, the binomial coefficient simplifies to one.
Equation 1.3: P(Rebuild Fail)= 1-(1*(〖10〗^(-14) )^0 〖(1-〖10〗^(-14))〗^(#Bits)
RAID 6, due to its extra redundancy, requires at least two failures. Similarly with RAID 5, the probability
that a rebuild will fail can be found by computing one minus the probability that a rebuild will succeed. In order
for the build to succeed, no errors or one error must occur. Since no errors and one error are mutually exclusive,
their probabilities can be summed up. Thus the probability can be found as shown below:
Equation 1.4: RAID 6 Restoration Failure Probability.
P(Rebuild Fail)=1-P(Rebuild)=1-P(0 errors ∪1 Error)= 1-([(#Bits)!/(0)!(#Bits-0)!] *〖 (〖10〗^(-14) )〗^0 (1-〖10〗^(-14) )^(#Bits)+[(#Bits)!/(1)!(#Bits-1)!]*(〖10〗^(-14) )^1 (1-〖10〗^(-14) )^(#Bits-1) )
The Binomial coefficients in Equation 1.4 can be simplified.
Equation 1.5: P(Rebuild Fail)=1-(1*(〖10〗^(-14) )^0 (1-〖10〗^(-14) )^(#Bits)+(#Bits)*(〖10〗^(-14) )^1 (1-〖10〗^(-14) )^(#Bits-1))
The question proposed for this application, is whether or not consumer level hard drives are suitable for server use,
with their URE rate of 1 in 10^14 reads. A typical server array may include as many as twenty drives in each RAID array. Modern
disks come in capacities up to two terabytes ( 2*1012). There are eight bits in one byte, so there are 16*1012 bits in one hard drive.
Thus, twenty drives will have 320*1012 bits. During a rebuild with one drive failure, nineteen drives will need to be read, including
100% of their data, whether they are real data or parity bits. Thus 304*1012 will need to be read. Using Equations 1.3 and 1.5, the
probability of a RAID recovery failure for levels 5 and 6 can be computed respectively.
RAID 5: P(Recovery Failure)= 1-(1*(〖10〗^(-14) )^0*(1-〖10〗^(-14) )^(304*〖10〗^12 )= .952165
RAID 6: P(Recovery Failure)=1-(1*(〖10〗^(-14) )^0 (1-〖10〗^(-14) )^(304*〖10〗^12 )+(304*〖10〗^12 )*(〖10〗^(-14) )^1 (1-〖10〗^(-14) )^(304*〖10〗^12-1) )=.806365
As seen above, RAID 6 does significantly decrease the chance that an array will fail at recovering its data. Regardless,
in both cases with a 20 terabyte array, the chance of recovery is quite low; roughly 4% for RAID 5 and 20% for RAID 6. As
the numbers show, consumer level drives are definitely inadequate for server use, or any application where the data stored on
these drives is critical. For servers, a minimum of a drive with an unrecoverable error rate of 1 in 1015, is necessary and also
requires the use of RAID level 6. If RAID level 5 is desired, then drives with an URE of 1 in 1016 are required. Table 1.1 shows
the probability calculations for various drives with different URE rates and different RAID levels.
Probability of Restoration Failing
Drive URE| RAID 5| RAID 6
1 in 10^14 | 0.952049 | 0.806277
1 in 10^15 | 0.26196 | 0.037596
1 in 10^16 | 0.033188 | 0.003796
Table 1.1: Restoration Failure Probabilities for RAID 5 and 6 with Differing UREs.
All calculations were performed using the following matlab code:
% Bernoulli's Trials Project Program %
% Start with a clean slate in Matlab %
clc
clear
% Program Purpose for Matlab Paper %
fprintf('Enter the required values for the Bernoulli equation.These values \n')
fprintf('include the total number of bytes in the array, the probability that a drive \n')
fprintf('will suffer an URE. \n\n');
fprintf('Thank you for using our program. \n\n');
% Data Entry %
n = input('Please ENTER the number of drives: -> ');
d = input('Please ENTER the capacity of each drive in terabytes: ->');
p = input('Please ENTER the probability of an URE: -> ');
k = input('Please ENTER the RAID level: -> ');
capac = (n-1)*d;
capacBytes=capac *10^12;
bits=capacBytes*8;
% Calculations %
if k == 5
Pr = 1 - (p^0)*(1-p)^(bits);
else
Pr =1 - bits*(p^1)*(1-p)^(bits-1) - (p^0)*(1-p)^(bits-0);
end
fprintf('The probability that the array will fail at rebuilding is: %f' , Pr);
Cliffs:
For a 10 drive 20 terabyte array (2tb drives), these are the probabilities that the array will fail to recover during a rebuild.
Probability of Restoration Failing
Drive URE| RAID 5| RAID 6
1 in 10^14 | 0.952049 | 0.806277
1 in 10^15 | 0.26196 | 0.037596
1 in 10^16 | 0.033188 | 0.003796