• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Did I just lose my entire raid array? [STARTED!]

Red Squirrel

No Lifer
My server is on it's last legs hence why I bought a 200AH battery backup system for it due to all the outages we get here. Any time it goes down, something fails on it. I really should just bite the bullet and upgrade. Unfortunately we just had one last about an hour too long and it went down hard. I really should have done a manual shut down when it hit 11 volts but I was hoping to squeeze a bit more out of it as it was so close to the ETA.

Power came back, booted the server, and as I imagined would happened, no raid array. 🙁

If I try to manually start it, I get this:

[root@borg ~]# mdadm --manage --run /dev/md0
mdadm: failed to run array /dev/md0: Invalid argument
[root@borg ~]#

Am I screwed? All the drives do show up and I can check smart info on them. They do appear to have some errors though.

[root@borg ~]# smartctl -a /dev/sdb
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: WDC WD1002FAEX-00Y9A0
Serial Number: WD-WCAW32643966
Firmware Version: 05.01D05
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Apr 15 22:33:15 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (16800) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 173) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 177 175 021 Pre-fail Always - 4133
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 35
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 096 095 000 Old_age Always - 3598
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 33
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 23
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 13
194 Temperature_Celsius 0x0022 123 101 000 Old_age Always - 24
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 2

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 2562 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@borg ~]# smartctl -a /dev/sdc
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: WDC WD1002FAEX-00Y9A0
Serial Number: WD-WCAW32668254
Firmware Version: 05.01D05
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Apr 15 22:33:19 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (16800) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 173) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 2
3 Spin_Up_Time 0x0027 177 174 021 Pre-fail Always - 4125
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 35
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3576
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 33
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 23
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 13
194 Temperature_Celsius 0x0022 123 099 000 Old_age Always - 24
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 2540 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@borg ~]# smartctl -a /dev/sdd
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: WDC WD1002FAEX-00Y9A0
Serial Number: WD-WCAW32467397
Firmware Version: 05.01D05
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Apr 15 22:33:22 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (16680) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 172) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 5
3 Spin_Up_Time 0x0027 174 172 021 Pre-fail Always - 4275
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 35
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 096 095 000 Old_age Always - 3602
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 33
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 23
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 13
194 Temperature_Celsius 0x0022 124 105 000 Old_age Always - 23
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 2566 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@borg ~]# smartctl -a /dev/sde
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: WDC WD1002FAEX-00Z3A0
Serial Number: WD-WCATRA314980
Firmware Version: 05.01D05
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Apr 15 22:33:24 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (16260) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 189) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3037) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 17
3 Spin_Up_Time 0x0027 176 176 021 Pre-fail Always - 4166
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 11
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 4633
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 9
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 8
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2
194 Temperature_Celsius 0x0022 123 116 000 Old_age Always - 24
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 9

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@borg ~]# smartctl -a /dev/sdf
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: WDC WD1002FAEX-00Y9A0
Serial Number: WD-WCAW32590153
Firmware Version: 05.01D05
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Apr 15 22:33:25 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (16860) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 174) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 2
3 Spin_Up_Time 0x0027 177 175 021 Pre-fail Always - 4125
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 26
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2713
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 24
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 18
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 9
194 Temperature_Celsius 0x0022 121 104 000 Old_age Always - 26
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 7

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 1676 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@borg ~]# smartctl -a /dev/sdg
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: WDC WD1002FAEX-00Y9A0
Serial Number: WD-WCAW31551813
Firmware Version: 05.01D05
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Apr 15 22:33:29 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (16680) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 172) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 1
3 Spin_Up_Time 0x0027 174 173 021 Pre-fail Always - 4258
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 43
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4149
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 41
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 28
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 16
194 Temperature_Celsius 0x0022 122 100 000 Old_age Always - 25
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 3112 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@borg ~]# smartctl -a /dev/sdh
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: WDC WD1002FAEX-00Y9A0
Serial Number: WD-WCAW31627985
Firmware Version: 05.01D05
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Apr 15 22:33:32 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (16080) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 166) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 175 172 021 Pre-fail Always - 4250
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 16
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 8325
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 14
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 10
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 7
194 Temperature_Celsius 0x0022 120 109 000 Old_age Always - 27
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 2

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 6102 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@borg ~]# smartctl -a /dev/sdi
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: WDC WD1002FAEX-00Z3A0
Serial Number: WD-WCATRA194858
Firmware Version: 05.01D05
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Apr 15 22:33:34 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (16260) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 189) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3037) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 177 177 021 Pre-fail Always - 4108
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 10
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2489
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 8
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 7
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2
194 Temperature_Celsius 0x0022 120 114 000 Old_age Always - 27
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@borg ~]# smartctl -a /dev/sdj
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: WDC WD1002FAEX-00Z3A0
Serial Number: WD-WCATRA191825
Firmware Version: 05.01D05
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Apr 15 22:33:49 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (17160) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 199) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3037) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 177 177 021 Pre-fail Always - 4133
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 10
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2489
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 8
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 7
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2
194 Temperature_Celsius 0x0022 121 114 000 Old_age Always - 26
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 10
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Any chance of me being able to recover this array? What do I do?
 
Something rather interesting I just found on google:

[root@borg ~]# mdadm --examine /dev/sdc
/dev/sdc:
Magic : a92b4efc
Version : 0.90.00
UUID : 11f961e7:0e37ba39:2c8a1552:76dd72ee
Creation Time : Sat Sep 20 02:15:28 2008
Raid Level : raid5
Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
Array Size : 6837319552 (6520.58 GiB 7001.42 GB)
Raid Devices : 8
Total Devices : 9
Preferred Minor : 0

Update Time : Mon Apr 15 21:03:17 2013
State : clean
Active Devices : 8
Working Devices : 9
Failed Devices : 0
Spare Devices : 1
Checksum : 417e20f6 - correct
Events : 2089486

Layout : left-symmetric
Chunk Size : 64K

Number Major Minor RaidDevice State
this 2 8 32 2 active sync /dev/sdc

0 0 8 96 0 active sync /dev/sdg
1 1 8 16 1 active sync /dev/sdb
2 2 8 32 2 active sync /dev/sdc
3 3 8 48 3 active sync /dev/sdd
4 4 8 112 4 active sync /dev/sdh
5 5 8 80 5 active sync /dev/sdf
6 6 8 128 6 active sync /dev/sdi
7 7 8 160 7 active sync
8 8 8 144 8 spare /dev/sdj


Is there perhaps hope?
 
Ok so turns out you have to specify each device individually at startup. I don't know why, I really don't recall having to do this before, and I can see that being problematic if I switch drives around while the server is off... There has to be a better way, but regardless, it worked!


[root@borg ~]# mdadm --assemble /dev/md0 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sd
sda sda1 sda2 sda3 sdb sdc sdd sde sdf sdg sdh sdi sdj
[root@borg ~]# mdadm --assemble /dev/md0 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
mdadm: /dev/md0 has been started with 8 drives and 1 spare.
[root@borg ~]#
[root@borg ~]#
[root@borg ~]#
[root@borg ~]#
[root@borg ~]# mount /dev/md0
[root@borg ~]#
[root@borg ~]#
[root@borg ~]# dir /raid1/
total 60
4 drwxrwxr-x 6 root smbusers 4096 2013-03-15 02:25 applications
4 drwxrwxr-x 10 root smbusers 4096 2013-03-15 02:20 backups
0 -rw-r--r-- 1 root root 0 2011-10-02 20:47 dosync_args.txt
4 drwxrwxr-x 8 root smbusers 4096 2012-07-08 00:24 intranet
16 drwx------ 4 root root 16384 2008-09-20 02:26 lost+found
4 drwxr-xr-x 5 root root 4096 2011-06-28 23:31 misc
4 drwxrwxr-x 9 p2puser p2puser 4096 2012-10-04 18:58 p2p
4 drwxrwxr-x 14 root smbusers 4096 2013-03-23 01:35 public
4 drwxr-xr-x 6 root root 4096 2013-03-15 02:31 scripts
4 drwxr-xr-x 4 mysql mysql 4096 2009-01-11 16:08 sqldata
4 drwxr-xr-x 3 root root 4096 2012-08-13 19:19 tmp
4 drwxr-xr-x 3 root smbusers 4096 2011-07-23 01:20 userdata
4 drwxrwxrwt 6 root vmusers 4096 2011-09-13 20:32 vms
[root@borg ~]#
[root@borg ~]#
[root@borg ~]#
[root@borg ~]# mdadm --detail /dev/md0
/dev/md0:
Version : 0.90
Creation Time : Sat Sep 20 02:15:28 2008
Raid Level : raid5
Array Size : 6837319552 (6520.58 GiB 7001.42 GB)
Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
Raid Devices : 8
Total Devices : 9
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Mon Apr 15 22:52:53 2013
State : clean
Active Devices : 8
Working Devices : 9
Failed Devices : 0
Spare Devices : 1

Layout : left-symmetric
Chunk Size : 64K

UUID : 11f961e7:0e37ba39:2c8a1552:76dd72ee
Events : 0.2089486

Number Major Minor RaidDevice State
0 8 96 0 active sync /dev/sdg
1 8 16 1 active sync /dev/sdb
2 8 32 2 active sync /dev/sdc
3 8 48 3 active sync /dev/sdd
4 8 112 4 active sync /dev/sdh
5 8 80 5 active sync /dev/sdf
6 8 128 6 active sync /dev/sdi
7 8 144 7 active sync /dev/sdj

8 8 64 - spare /dev/sde
[root@borg ~]#


Guess I'll have to add that to my rc.local. I could have sworn I did not have to do this before though. :hmm: My mdadm.conf also shows 6 devices when it should be 9. Something is fishy.
 
..... and I just got a drive failure. Fuck I have the most terrible luck with this stuff. Rebuilding on the hot spare right now. Time to replace all the drives again, I can't trust them after they've been spinned down for over an hour like that. I really should have went out to Canadian Tire right away to get a couple more marine batteries. $300 is cheaper than the $1000+ it will cost to replace all the drives.
 
Last edited:
Then, make a backup.

Then, ditch the RAID 5. You're not unlucky. You're using RAID 5, with modern hardware. Don't do that.

To be even halfway reliable, RAID 5 needs a machine UPS, configured correctly, to shut the PC down nicely, and/or a hardware RAID controller with its own BBU. Then, you will still have the write hole to worry over, you will still have bit error rates biting you, poor performance when you need it if you have lots of scrubbing to handle that problem, and an array rebuild will take an eternity. Even after that, you might then have a slow as death full-partition FSCK to wait for! Ugh!

With RAID 1 or 10, you'll find the occasional files disappearing, or finding their way to lost+found, when power glitches happen, much like a single drive; the array won't go into a degraded state nearly as often; and when it does get degraded, the rebuild will be far quicker. Use EXT4, of course.

As of today, RAID-Z1 (ZFS) is your best option to get RAID 5's storage efficiency, without having an array that will go south if you look at funny. Otherwise, stick to 1 and 10, with EXT4 (stop-gap for stabler ZFS support on Linux, and BTRFS' toolset maturing).
 
Last edited:
I have a UPS, but it only lasted about 3-4 hours. Need to add more batteries and probably need to add water to the current ones. Could have sworn they lasted like 5 hours before.

Thankfully I do have backups, still a hassle though. I never heard of all these issues with raid 5, is it because I'm using such an old version of md raid? Raid 1 is kinda a waste imo, and 10 is a waste too unless performance is the main factor. I can still saturate my gigabit connection with this raid 5 and I only lose 1 drive. Though if I was to do it over again I'd probably use raid 6. Not a fan of hardware raid, I have VERY bad luck with any hardware (I'd say 1 out of 5 piece of hardware I order is usually DOA) so the card would be an extra point of failure. At least with software I can swap controller cards easily.

I would like to look at ZFS though, I just want to wait till it's more proven for Linux. Though if I build a dedicated storage system I can always just use Solaris for that box and then use NFS. On the other hand with 3TB drives being under $200 now I suppose raid 10 could be a consideration. 6x 3TB drives would give me around 8.3TB of usable space so I'd even be upgrading and giving me a few extra slots. though I heard some cards don't support 3TB drives, is this the case? This is a very old server using cheap 2 port PCI cards, as I could not find a single card with enough ports at the time.
 
Last edited:
UPS are meant to last long enough for you to shut the servers down gracefully. Seems like what you need is a generator.

Yeah... w/o having a backup power source a UPS should only be used to shutdown the system gracefully. If your power goes out for hours at a time fairly often I would definitely consider getting a generator.
 
I have a UPS, but it only lasted about 3-4 hours. Need to add more batteries and probably need to add water to the current ones. Could have sworn they lasted like 5 hours before.
You need to configure it to shut down the system when the batteries get low, no matter how long it may last. A UPS that lasts 15 minutes is plenty, as long as it shuts down after several minutes.

I never heard of all these issues with raid 5, is it because I'm using such an old version of md raid? Raid 1 is kinda a waste imo, and 10 is a waste too unless performance is the main factor.
Google something like "CERN RAID 5"...it'll be a start (I wish I'd bookmarked what I'd read a few years ago; the tech news and blog sites have done lots of SEO work, and I can't find some of the other good stuff, anymore).

Here's a good one, though:
http://queue.acm.org/detail.cfm?id=1670144

More generally, if you haven't heard about it, I can only assume you haven't worked much, if at all, in IT. RAID 5 has gradually become worse, bit by bit, over time, in practice.

RAID 10 is not a waste, but a stop-gap, if you don't want to use FreeBSD, or don't want to trust ZFS for Linux (just announced as stable). The data corruption problems of RAID could be solved, and at a device level, but nobody's willing to do that (enterprise PHBs don't want to change, and would prefer solutions like 520B sector drives, over better RAID and file system implementations), it seems. RAID from 1988 does not protect against today's common failures so well (UREs and silent corruption).
 
You need to configure it to shut down the system when the batteries get low, no matter how long it may last. A UPS that lasts 15 minutes is plenty, as long as it shuts down after several minutes.

Google something like "CERN RAID 5"...it'll be a start (I wish I'd bookmarked what I'd read a few years ago; the tech news and blog sites have done lots of SEO work, and I can't find some of the other good stuff, anymore).

Here's a good one, though:
http://queue.acm.org/detail.cfm?id=1670144

More generally, if you haven't heard about it, I can only assume you haven't worked much, if at all, in IT. RAID 5 has gradually become worse, bit by bit, over time, in practice.

RAID 10 is not a waste, but a stop-gap, if you don't want to use FreeBSD, or don't want to trust ZFS for Linux (just announced as stable). The data corruption problems of RAID could be solved, and at a device level, but nobody's willing to do that (enterprise PHBs don't want to change, and would prefer solutions like 520B sector drives, over better RAID and file system implementations), it seems. RAID from 1988 does not protect against today's common failures so well (UREs and silent corruption).

That's a poor assumption to make. All IT shops I've worked in use nothing but raid 5 for storage including a hospital storing medical data. Raid 1 for OS is typical with raid 5 for data. All the SAN's I've worked with were also configured as raid 5. Seen 10 a few times. This thing about raid 5 being bad is news to me and even seems odd as I've never seen any of the stuff you describe, my main issue is my server being really old and whenever it is shut down I have problems. I need to upgrade. It took a really bad hit years ago due to a 5+ hour outage before I had a good UPS so I shut it down within 15 minutes and it cooled down completely, and really did not like that. Ended up having to replace the backplanes, all the drives, cables etc.... still get occasional random errors and lock ups.

The purpose of my UPS setup is for keeping the equipment up (much like a telco UPS is meant to do) so for now I just have to add more batteries. I am considering a generator though for very extended run so I can top up the batteries during long outages. Also good to have in case of large scale emergencies. Need to install a transfer switch and inlet though to do it properly and not something I want to do when there's 3-4 feet of snow outside. Maybe in the summer I can consider that.

That said I am definitely considering ZFS for my next solution as it is superior to raid 5, I can't deny that. At least from what I heard, I have yet to try it. The biggest issue I have with raid 5 is the rebuild times. It takes a good 5 hours to rebuild my server's raid. It's a time of stress as if another drive fails I lose everything and have to go to backups.
 
Last edited:
Another quick one (main site is really slow): http://webcache.googleusercontent.c...-new-raid-level-recommendations-from-dell+&cd

It's nothing new (it was predicted a very long time ago, even). The good news is that since error rates on drives are actually lower than specified (of course), it's not quite as bad as predicted by the specs and maths. But, that just means the curve isn't quite as steep right now.

Fundamentally, the problem is that RAID trusts hardware to be either good or bad, and we need new implementations that behave more like networking, trusting only in data (as of today, the only implementations are tied to an FS, or to specific NAS hardware).
 
Last edited:
The purpose of my UPS setup is for keeping the equipment up (much like a telco UPS is meant to do) so for now I just have to add more batteries. I am considering a generator though for very extended run so I can top up the batteries during long outages. Also good to have in case of large scale emergencies. Need to install a transfer switch and inlet though to do it properly and not something I want to do when there's 3-4 feet of snow outside. Maybe in the summer I can consider that.

No... not really. Their UPSs are designed to keep the systems online until the generators come online, usually rather quickly. If there's a failure in the generators the UPS systems won't be able to keep the whole datacenter online for very long. The amount of batteries that would require would be very large.
 
No... not really. Their UPSs are designed to keep the systems online until the generators come online, usually rather quickly. If there's a failure in the generators the UPS systems won't be able to keep the whole datacenter online for very long. The amount of batteries that would require would be very large.

I work at a telco. The very important COs do have generators and ATS, but most of the smaller ones, and cell sites rely on batteries and most of the time they are meant to last like 8 hours and I seen some that can last up to 2 days. There are a few ones that only last like an hour though so we need to dispatch a generator right away on those. IMO they really should have standby generators at all of them, but they don't. It's an added expense to buy and maintain them.
 
I work at a telco. The very important COs do have generators and ATS, but most of the smaller ones, and cell sites rely on batteries and most of the time they are meant to last like 8 hours and I seen some that can last up to 2 days. There are a few ones that only last like an hour though so we need to dispatch a generator right away on those. IMO they really should have standby generators at all of them, but they don't. It's an added expense to buy and maintain them.

I forget you live in the middle of nowhere Canada... I've never been in a datacenter in the US that didn't have generators. I'm not talking about remote cell towers that maybe serve a few hundred customers.

The fact remains that you are using your UPS incorrectly if you are just waiting for them to power off your servers by running out of power.
 
I forget you live in the middle of nowhere Canada... I've never been in a datacenter in the US that didn't have generators. I'm not talking about remote cell towers that maybe serve a few hundred customers.

The fact remains that you are using your UPS incorrectly if you are just waiting for them to power off your servers by running out of power.

Nooo that's not the point. It was my mistake for not turning it off myself and waiting too long and I admitted to that. The point mine is for is to last longer than 15 minutes. Though in the telco world if a generator can't be brought to a site (too many sites down, forest fires, etc) we shut off CDMA if it's a cell site as it does not like to be dropped hard much like a server, but we let everything else drop. DMS stuff is also just let drop. Some places have lot of roadside DMS (telephony) cabinets and only so many techs to go around so during the huge power outages and snow storms they try to get to them as much as they can, but mostly have to rely on batteries. It's rare it gets to a point where stuff is dropped though. Also telco CO != data center. We don't have much data centers in Canada and the ones we do I imagine have lot of generator power. They use more power than a CO. Our CO uses about 400 amps on each phase and it's a big one.

My main goal is to have enough capacity so I never have to shut down. Though at some point I will probably automate it so it does shut down once it gets to the critical threshold. In this particular case I was erroneously trying to get more power out of it and hoping it would stay up longer. But that's besides the point, since the raid is not assembling itself properly at startup as it does not work through the GUID like I thought it should, but instead wants me to specify each device sdx name. Before I was able to just do --assemble --scan and it would load it. I have that line and the mount command in my rc.local so that the array starts up on it's own.
 
Last edited:
A RAID 5 rebuild can very well take a day or more, now, and a fsck on it will take much longer than a RAID 10. A clean shutdown in the middle of a rebuild is going to be much less painful than an unexpected shutdown in the middle of a rebuild; also, an unexpected shutdown could very well cause a full rebuild to occur. One of the several benefits of RAID 10 are that scanning, scrubbing, and rebuilding can be done in just a few hours, even with large-capacity drives. The longer the drives stay under load, the greater chance something else will also go wrong.

No matter how long your UPS lasts, you are just making a time bomb by having a RAID 5 (or 6, for that matter) and not making sure it will nicely shut down. The 15 minute number was to make a point. A UPS is there to either give you time to bring up other power, or perform a clean shutdown, and it is best for the shutdown to occur automatically. It is not going to be very tolerant of bad shutdowns (and, if any HDDs were writing when it got shut down, who knows what scrambled data could have been written to the drives!).

Regardless of whether you can keep everything up or not, prepare for not being able to. A bad shutdown has great risks for any standard RAID. More-so with parity-type RAID, but even RAID 1 and 10 can suffer from inconsistency, and then need rebuilds, possibly losing data in the process. RAID is weak against several common failure modes, and really needs to be re-done; whether ZFS and BTRFS are the right way to do it or not, that's where most of the work is going, today, instead of generic replacements.

The RAID should either use the GUID, or its own marker, though. Something else might be up, with that.
 
Back
Top