FreeNAS/ZFS - When to replace problematic drive(s)

Viper GTS

Lifer
Oct 13, 1999
38,107
433
136
I've got a couple drives in my ZFS system that are giving me trouble. I have 16 identical HD103SJ from over 5 years ago that have been flawless up until now. I've been getting emails every few weeks about an unrecoverable read error on one of them, and today got a message about another drive for the first time.

I pulled SMART data on all 16 drives, these are the only two that show any value other than 0 for RAW_VALUE on Raw_Read_Error_Rate. Both are fine from a value/threshold standpoint. For reference, the worst of the two:

Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       5782
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
  3 Spin_Up_Time            0x0023   072   070   025    Pre-fail  Always       -       8782
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       62
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       38413
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       93
191 G-Sense_Error_Rate      0x0022   252   252   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   059   053   000    Old_age   Always       -       41 (Min/Max 21/52)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       0
223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always       -       0
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       93

I have a spare drive on hand to replace this one, but much of what I'm finding says that until the threshold value trips failure to just ignore the raw value. When I have 14 other drives reporting 0 though I feel like I'm tempting fate.

Should I order a few more spares and replace both of these, or just hold out until FreeNAS actually fails a drive? I never expected to have all 16 drives still alive after this long, they've been amazingly reliable for a distinctly consumer class drive.

Viper GTS
 

XavierMace

Diamond Member
Apr 20, 2013
4,307
450
126
I'm running Solaris 11.2 and it automatically offlined drives that started generating excessive errors. I would have expected FreeNAS to do the same.

Try replacing cables first.

Do you have disks dropping out of the array?

If memory serves, he's running a Norco case with a backplane so there's not individual cables to drives.
 

Viper GTS

Lifer
Oct 13, 1999
38,107
433
136
I'm running Solaris 11.2 and it automatically offlined drives that started generating excessive errors. I would have expected FreeNAS to do the same.



If memory serves, he's running a Norco case with a backplane so there's not individual cables to drives.

That is ridiculous that you remember that, but you are correct. Norco 4224 with SFF 8087 to storage controllers.

Yes I expect that it would eventually fail the drive if the rate crossed the threshold and it seems to be well below it at this point.

No major signs of issues other than the emails every few weeks (Four from March 1 to July 5th on this particular drive). I switched to FreeNAS (from Nexenta) in December of last year so it ran without error from late December to March of this year.

Viper GTS
 

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member
Super Moderator
Sep 28, 2005
21,137
3,675
126
Sigh...

Code:
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   114   099   006    Pre-fail  Always       -       68583328
  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       57
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       1
  7 Seek_Error_Rate         0x000f   075   060   030    Pre-fail  Always       -       41304756
  9 Power_On_Hours          0x0032   071   071   000    Old_age   Always       -       25998
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       57
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   092   092   000    Old_age   Always       -       8
190 Airflow_Temperature_Cel 0x0022   073   052   045    Old_age   Always       -       27 (Min/Max 27/27)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       41
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       57
194 Temperature_Celsius     0x0022   027   048   000    Old_age   Always       -       27 (0 16 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

just got a warning from freenas saying replace this NOW... :\

but doing google, i heard ID 5, 197, 198 are the important things in knowing when you need to replace the drive asap.
 
Last edited:

XavierMace

Diamond Member
Apr 20, 2013
4,307
450
126
That is ridiculous that you remember that, but you are correct.

For me, it's like somebody posting about their nice car. It sticks with me.

Yes I expect that it would eventually fail the drive if the rate crossed the threshold and it seems to be well below it at this point.

That's a fair point, I don't know how long the drives were generating errors before it off lined them. I guess IMO, once it starts though, you're on borrowed time. That said, first drive failure (running a Z2 array) didn't have any impact on performance. I didn't even realize I had a failed drive until I logged into the Web-UI to check something else. I'm probably going to be replacing mine tonight when I get home..

Edit:

And now I wait:

20160719002833-f994cb50.png
 
Last edited: