What causes these crashes?

Red Squirrel

No Lifer
May 24, 2003
69,693
13,325
126
www.betteroff.ca
I used to have this issue on my main server till I migrated all storage duties to a dedicated storage server, it seems to happen when there's too much going on and the system can't handle it, but why does it actually crash?

Just had my VPN server randomly conk out on me. I was able to access it via a SSH backdoor that I have setup for this purpose. This is the dmesg log:

Code:
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6680
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6d80
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6580
NOHZ: local_softirq_pending 100
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f188bc0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1886c0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f1886c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6b80
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6280
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f2a6280
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1881c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1886c0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f1886c0
NOHZ: local_softirq_pending 100
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1889c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1888c0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f1888c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6d80
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f188cc0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f188cc0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6480
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1885c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1884c0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f1884c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f188ac0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1888c0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f1888c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f188bc0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f188ac0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f188ac0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1880c0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f1880c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6680
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f2a6680
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1889c0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f1889c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1881c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1888c0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f1888c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1889c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1884c0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f1884c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6c80
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6280
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f2a6280
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6980
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6080
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f2a6080
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1885c0
sd 0:0:0:0: timing out command, waited 180s
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 95 f4 b8 00 00 20 00
Aborting journal on device dm-0-8.
EXT4-fs error (device dm-0): ext4_journal_start_sb: 
EXT4-fs error (device dm-0): ext4_journal_start_sb: Detected aborted journal
EXT4-fs (dm-0): Remounting filesystem read-only
EXT4-fs error (device dm-0) in ext4_reserve_inode_write: Journal has aborted
EXT4-fs error (device dm-0) in ext4_dirty_inode: Journal has aborted
Detected aborted journal
INFO: task master:1164 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
master        D 0000000000000001     0  1164      1 0x00000080
 ffff88000bb61948 0000000000000086 0000000000000000 ffffffffa000443c
 ffff88000bb618b8 ffffffff81055678 ffff88000bb618d8 ffffffff8105571d
 ffff88000f1865f8 ffff88000bb61fd8 000000000000fbc8 ffff88000f1865f8
Call Trace:
 [<ffffffffa000443c>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
 [<ffffffff81055678>] ? resched_task+0x68/0x80
 [<ffffffff8105571d>] ? check_preempt_curr+0x6d/0x90
 [<ffffffff811bf120>] ? sync_buffer+0x0/0x50
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffff811bf160>] sync_buffer+0x40/0x50
 [<ffffffff8152893a>] __wait_on_bit_lock+0x5a/0xc0
 [<ffffffff811bf120>] ? sync_buffer+0x0/0x50
 [<ffffffff81528a18>] out_of_line_wait_on_bit_lock+0x78/0x90
 [<ffffffff8109b320>] ? wake_bit_function+0x0/0x50
 [<ffffffff811be6b9>] ? __find_get_block+0xa9/0x200
 [<ffffffff811bf306>] __lock_buffer+0x36/0x40
 [<ffffffffa0089293>] do_get_write_access+0x493/0x520 [jbd2]
 [<ffffffffa0089471>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
 [<ffffffffa00d6d98>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
 [<ffffffffa00b0bd3>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
 [<ffffffffa00b0c4c>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
 [<ffffffffa0088495>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
 [<ffffffffa00b0f40>] ext4_dirty_inode+0x40/0x60 [ext4]
 [<ffffffff811b48fb>] __mark_inode_dirty+0x3b/0x160
 [<ffffffff811a5002>] file_update_time+0xf2/0x170
 [<ffffffff81193cf2>] pipe_write+0x302/0x6a0
 [<ffffffff81188c7a>] do_sync_write+0xfa/0x140
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8118e7b4>] ? cp_new_stat+0xe4/0x100
 [<ffffffff810149b9>] ? read_tsc+0x9/0x20
 [<ffffffff812263c6>] ? security_file_permission+0x16/0x20
 [<ffffffff81188f78>] vfs_write+0xb8/0x1a0
 [<ffffffff81189871>] sys_write+0x51/0x90
 [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task pickup:2163 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
pickup        D 0000000000000004     0  2163   1164 0x00000080
 ffff88000c8dd968 0000000000000082 0000000000000000 ffffffffa000443c
 ffff88000c8dda08 ffffffff8112f3a3 ffff880000018900 0000000700000000
 ffff88000f93b058 ffff88000c8ddfd8 000000000000fbc8 ffff88000f93b058
Call Trace:
 [<ffffffffa000443c>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
 [<ffffffff8112f3a3>] ? __alloc_pages_nodemask+0x113/0x8d0
 [<ffffffff811bf120>] ? sync_buffer+0x0/0x50
 [<ffffffff815280a3>] io_schedule+0x73/0xc0
 [<ffffffff811bf160>] sync_buffer+0x40/0x50
 [<ffffffff8152893a>] __wait_on_bit_lock+0x5a/0xc0
 [<ffffffff811bf120>] ? sync_buffer+0x0/0x50
 [<ffffffff81528a18>] out_of_line_wait_on_bit_lock+0x78/0x90
 [<ffffffff8109b320>] ? wake_bit_function+0x0/0x50
 [<ffffffff811be6b9>] ? __find_get_block+0xa9/0x200
 [<ffffffff811bf306>] __lock_buffer+0x36/0x40
 [<ffffffffa0089293>] do_get_write_access+0x493/0x520 [jbd2]
 [<ffffffffa0089471>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
 [<ffffffffa00d6d98>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
 [<ffffffffa00b0bd3>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
 [<ffffffffa00b0c4c>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
 [<ffffffffa0088495>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
 [<ffffffffa00b0f40>] ext4_dirty_inode+0x40/0x60 [ext4]
 [<ffffffff811b48fb>] __mark_inode_dirty+0x3b/0x160
 [<ffffffff811a5215>] touch_atime+0x195/0x1a0
 [<ffffffff81194365>] pipe_read+0x2d5/0x4e0
 [<ffffffff81188dba>] do_sync_read+0xfa/0x140
 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff812263c6>] ? security_file_permission+0x16/0x20
 [<ffffffff811896a5>] vfs_read+0xb5/0x1a0
 [<ffffffff811897e1>] sys_read+0x51/0x90
 [<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
EXT4-fs error (device dm-0) in ext4_reserve_inode_write: Journal has aborted
EXT4-fs error (device dm-0) in ext4_reserve_inode_write: Journal has aborted

How do I stop this from happening? Been a while since I've had these type of errors but looks like it's going to start again. It's always a different app that crashes when it does it. It seems to happen when backup jobs are running.


OS is CentOS 6.5 running in ESXi.
 

Red Squirrel

No Lifer
May 24, 2003
69,693
13,325
126
www.betteroff.ca
Crap hardware again? I have the worst luck with that. Every single system I build ends up with failing hardware. I built this one so I can get off my other one, now this one has failing hardware too? Those issues would get translated right to the VMs? Now that you mention it I have another VM that keeps crashing like this:



So what should I do? Build another machine and hope for the best? It's always a gamble, and my luck is terrible. I already tested the ram before I built the machine so I know it's not that.

What I don't get though is why would it only pick a few VMs to crash? Or is Linux more susceptible? All the other VMs are Windows and run fine.
 

Red Squirrel

No Lifer
May 24, 2003
69,693
13,325
126
www.betteroff.ca
Ahhh SOOB. File server is doing the same thing. I feel like just giving up computers sometimes. Sick of dealing with this crap, I just want stuff that works so I can spend less time troubleshooting and more time actually using it. This could also explain why I'm having so much trouble using dd with this laptop I'm trying to backup. So not only is my VM server failing but so is my file server. WTF? I thought going server grade with ECC ram would ensure I get proper working hardware and not stuff that fails so easily.

Also, what is netresolv? It's killing my file server using all CPU power. It has like 900 hours of cpu time and constantly on top.

Code:
top - 18:23:17 up 222 days, 19:18,  1 user,  load average: 0.12, 0.37, 0.52
Tasks: 340 total,   1 running, 339 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.4%id,  0.5%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8032040k total,  3761508k used,  4270532k free,  2473108k buffers
Swap:  8175608k total,     1720k used,  8173888k free,   333116k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                         
14448 root      20   0  3924  132   64 S  1.3  0.0 829:57.47 netresolv                                                                                                                                                                       
14513 root      20   0  3924  136   64 S  1.3  0.0 824:22.80 netresolv                                                                                                                                                                       
17012 root      20   0 15168 1420  928 R  0.3  0.0   0:00.03 top                                                                                                                                                                             
    1 root      20   0 19356  872  612 S  0.0  0.0   0:00.84 init                                                                                                                                                                            
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.29 kthreadd                                                                                                                                                                        
    3 root      RT   0     0    0    0 S  0.0  0.0   1:42.32 migration/0                                                                                                                                                                     
    4 root      20   0     0    0    0 S  0.0  0.0  12:25.03 ksoftirqd/0                                                                                                                                                                     
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0                                                                                                                                                                     
    6 root      RT   0     0    0    0 S  0.0  0.0   0:09.13 watchdog/0                                                                                                                                                                      
    7 root      RT   0     0    0    0 S  0.0  0.0   0:20.37 migration/1                                                                                                                                                                     
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1                                                                                                                                                                     
    9 root      20   0     0    0    0 S  0.0  0.0   0:27.86 ksoftirqd/1                                                                                                                                                                     
   10 root      RT   0     0    0    0 S  0.0  0.0   0:06.25 watchdog/1                                                                                                                                                                      
   11 root      RT   0     0    0    0 S  0.0  0.0   0:11.73 migration/2                                                                                                                                                                     
   12 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2


Now that I think of it, could these hardware issues be affected by temperature? My server room is not climate controlled yet, I still need to design the hvac system before I drywall it so it's in the open uninsulated basement right now and temp drops to like 10c sometimes.

This is the log on the file server (which serves the VM server and everything else around the house).

Code:
INFO: task kjournald:2864 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald     D 0000000000000004     0  2864      2 0x00000080
 ffff880227fb1d50 0000000000000046 0000000000000000 00000000811b59c7
 ffff88004bdfc540 ffffea00042b7740 0000000000000000 0000000000000000
 ffff880224091098 ffff880227fb1fd8 000000000000fb88 ffff880224091098
Call Trace:
 [<ffffffffa042b6f1>] journal_commit_transaction+0x161/0x1310 [jbd]
 [<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81081a3b>] ? try_to_del_timer_sync+0x7b/0xe0
 [<ffffffffa0431768>] kjournald+0xe8/0x250 [jbd]
 [<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0431680>] ? kjournald+0x0/0x250 [jbd]
 [<ffffffff81096916>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff81096880>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
INFO: task nfsd:10556 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nfsd          D 0000000000000000     0 10556      2 0x00000080
 ffff880109073920 0000000000000046 ffff880109073890 ffff880130eb1000
 ffff880016da4380 ffff880109839dc0 ffff880109839df8 0000000000000020
 ffff880228819058 ffff880109073fd8 000000000000fb88 ffff880228819058
Call Trace:
 [<ffffffff8150ed3e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff814413a5>] ? memcpy_toiovec+0x55/0x80
 [<ffffffff8150ebdb>] mutex_lock+0x2b/0x50
 [<ffffffff8111c381>] generic_file_aio_write+0x71/0x100
 [<ffffffff8111c310>] ? generic_file_aio_write+0x0/0x100
 [<ffffffff81180b5b>] do_sync_readv_writev+0xfb/0x140
 [<ffffffffa04bd5e8>] ? find_acceptable_alias+0x28/0x100 [exportfs]
 [<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8121baf6>] ? security_file_permission+0x16/0x20
 [<ffffffff81181ae6>] do_readv_writev+0xd6/0x1f0
 [<ffffffffa04f4992>] ? nfsd_setuser_and_check_port+0x62/0xb0 [nfsd]
 [<ffffffff81181c46>] vfs_writev+0x46/0x60
 [<ffffffffa04f6085>] nfsd_vfs_write+0x105/0x430 [nfsd]
 [<ffffffff8117e372>] ? dentry_open+0x52/0xc0
 [<ffffffffa04f7cfb>] ? nfsd_open+0x1db/0x230 [nfsd]
 [<ffffffffa04f8117>] nfsd_write+0xe7/0x100 [nfsd]
 [<ffffffffa050067f>] nfsd3_proc_write+0xaf/0x140 [nfsd]
 [<ffffffffa04f143e>] nfsd_dispatch+0xfe/0x240 [nfsd]
 [<ffffffffa0464524>] svc_process_common+0x344/0x640 [sunrpc]
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffffa0464b60>] svc_process+0x110/0x160 [sunrpc]
 [<ffffffffa04f1b62>] nfsd+0xc2/0x160 [nfsd]
 [<ffffffffa04f1aa0>] ? nfsd+0x0/0x160 [nfsd]
 [<ffffffff81096916>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff81096880>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
INFO: task nfsd:10557 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nfsd          D 0000000000000007     0 10557      2 0x00000080
 ffff8802294dd920 0000000000000046 0000000000000000 ffff8802276b3400
 ffff880228894020 ffff8802119a1ca8 ffff8802293ec740 0000000000000007
 ffff880227433ab8 ffff8802294ddfd8 000000000000fb88 ffff880227433ab8
Call Trace:
 [<ffffffff8150ed3e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8150ebdb>] mutex_lock+0x2b/0x50
 [<ffffffff8111c381>] generic_file_aio_write+0x71/0x100
 [<ffffffff8111c310>] ? generic_file_aio_write+0x0/0x100
 [<ffffffff81180b5b>] do_sync_readv_writev+0xfb/0x140
 [<ffffffffa04bd5e8>] ? find_acceptable_alias+0x28/0x100 [exportfs]
 [<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8121baf6>] ? security_file_permission+0x16/0x20
 [<ffffffff81181ae6>] do_readv_writev+0xd6/0x1f0
 [<ffffffffa04f4992>] ? nfsd_setuser_and_check_port+0x62/0xb0 [nfsd]
 [<ffffffff81181c46>] vfs_writev+0x46/0x60
 [<ffffffffa04f6085>] nfsd_vfs_write+0x105/0x430 [nfsd]
 [<ffffffff8117e372>] ? dentry_open+0x52/0xc0
 [<ffffffffa04f7cfb>] ? nfsd_open+0x1db/0x230 [nfsd]
 [<ffffffffa04f8117>] nfsd_write+0xe7/0x100 [nfsd]
 [<ffffffffa050067f>] nfsd3_proc_write+0xaf/0x140 [nfsd]
 [<ffffffffa04f143e>] nfsd_dispatch+0xfe/0x240 [nfsd]
 [<ffffffffa0464524>] svc_process_common+0x344/0x640 [sunrpc]
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffffa0464b60>] svc_process+0x110/0x160 [sunrpc]
 [<ffffffffa04f1b62>] nfsd+0xc2/0x160 [nfsd]
 [<ffffffffa04f1aa0>] ? nfsd+0x0/0x160 [nfsd]
 [<ffffffff81096916>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff81096880>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
INFO: task nfsd:10558 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nfsd          D 0000000000000004     0 10558      2 0x00000080
 ffff880105eddc60 0000000000000046 ffff880105eddc00 0007ffffffffffff
 ffff880105eddcc0 ffffffff8111a2fe 0486e00100000007 ffff88022a2a4000
 ffff88022396e5f8 ffff880105eddfd8 000000000000fb88 ffff88022396e5f8
Call Trace:
 [<ffffffff8111a2fe>] ? wait_on_page_writeback_range+0x8e/0x190
 [<ffffffff8112e0c4>] ? generic_writepages+0x24/0x30
 [<ffffffff8150ed3e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8150ebdb>] mutex_lock+0x2b/0x50
 [<ffffffff811b1a20>] vfs_fsync_range+0x90/0xe0
 [<ffffffff811b1add>] vfs_fsync+0x1d/0x20
 [<ffffffffa04f7ffb>] nfsd_commit+0x6b/0xa0 [nfsd]
 [<ffffffffa04fefdd>] nfsd3_proc_commit+0x9d/0x100 [nfsd]
 [<ffffffffa04f143e>] nfsd_dispatch+0xfe/0x240 [nfsd]
 [<ffffffffa0464524>] svc_process_common+0x344/0x640 [sunrpc]
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffffa0464b60>] svc_process+0x110/0x160 [sunrpc]
 [<ffffffffa04f1b62>] nfsd+0xc2/0x160 [nfsd]
 [<ffffffffa04f1aa0>] ? nfsd+0x0/0x160 [nfsd]
 [<ffffffff81096916>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff81096880>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
INFO: task nfsd:10559 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nfsd          D 0000000000000004     0 10559      2 0x00000080
 ffff88010907d920 0000000000000046 ffff88010907d870 ffff8802276b3400
 ffff880228894020 ffff880016c2cba8 ffff8802293ec740 0000000000000004
 ffff880227c87098 ffff88010907dfd8 000000000000fb88 ffff880227c87098
Call Trace:
 [<ffffffff8150ed3e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff814413a5>] ? memcpy_toiovec+0x55/0x80
 [<ffffffff8150ebdb>] mutex_lock+0x2b/0x50
 [<ffffffff8111c381>] generic_file_aio_write+0x71/0x100
 [<ffffffff8111c310>] ? generic_file_aio_write+0x0/0x100
 [<ffffffff81180b5b>] do_sync_readv_writev+0xfb/0x140
 [<ffffffffa04bd5e8>] ? find_acceptable_alias+0x28/0x100 [exportfs]
 [<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8121baf6>] ? security_file_permission+0x16/0x20
 [<ffffffff81181ae6>] do_readv_writev+0xd6/0x1f0
 [<ffffffffa04f4992>] ? nfsd_setuser_and_check_port+0x62/0xb0 [nfsd]
 [<ffffffff81181c46>] vfs_writev+0x46/0x60
 [<ffffffffa04f6085>] nfsd_vfs_write+0x105/0x430 [nfsd]
 [<ffffffff8117e372>] ? dentry_open+0x52/0xc0
 [<ffffffffa04f7cfb>] ? nfsd_open+0x1db/0x230 [nfsd]
 [<ffffffffa04f8117>] nfsd_write+0xe7/0x100 [nfsd]
 [<ffffffffa050067f>] nfsd3_proc_write+0xaf/0x140 [nfsd]
 [<ffffffffa04f143e>] nfsd_dispatch+0xfe/0x240 [nfsd]
 [<ffffffffa0464524>] svc_process_common+0x344/0x640 [sunrpc]
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffffa0464b60>] svc_process+0x110/0x160 [sunrpc]
 [<ffffffffa04f1b62>] nfsd+0xc2/0x160 [nfsd]
 [<ffffffffa04f1aa0>] ? nfsd+0x0/0x160 [nfsd]
 [<ffffffff81096916>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff81096880>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
INFO: task nfsd:10560 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nfsd          D 0000000000000006     0 10560      2 0x00000080
 ffff880109151720 0000000000000046 ffff880227433058 ffff880109151fd8
 000000000000fb88 ffff880227433058 ffff88022c031500 ffff880227432aa0
 ffff880227433058 ffff880109151fd8 000000000000fb88 ffff880227433058
Call Trace:
 [<ffffffff81096f6e>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa0429caa>] start_this_handle+0x20a/0x3f0 [jbd]
 [<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa042a065>] journal_start+0xb5/0x100 [jbd]
 [<ffffffffa0556481>] ext3_journal_start_sb+0x31/0x60 [ext3]
 [<ffffffffa0543b9d>] ext3_dirty_inode+0x3d/0xa0 [ext3]
 [<ffffffff811ac13b>] __mark_inode_dirty+0x3b/0x160
 [<ffffffff8119c352>] file_update_time+0xf2/0x170
 [<ffffffff8111c0b0>] __generic_file_aio_write+0x230/0x490
 [<ffffffff814413a5>] ? memcpy_toiovec+0x55/0x80
 [<ffffffff8111c398>] generic_file_aio_write+0x88/0x100
 [<ffffffff8111c310>] ? generic_file_aio_write+0x0/0x100
 [<ffffffff81180b5b>] do_sync_readv_writev+0xfb/0x140
 [<ffffffffa04bd5e8>] ? find_acceptable_alias+0x28/0x100 [exportfs]
 [<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8121baf6>] ? security_file_permission+0x16/0x20
 [<ffffffff81181ae6>] do_readv_writev+0xd6/0x1f0
 [<ffffffffa04f4992>] ? nfsd_setuser_and_check_port+0x62/0xb0 [nfsd]
 [<ffffffff81181c46>] vfs_writev+0x46/0x60
 [<ffffffffa04f6085>] nfsd_vfs_write+0x105/0x430 [nfsd]
 [<ffffffff8117e372>] ? dentry_open+0x52/0xc0
 [<ffffffffa04f7cfb>] ? nfsd_open+0x1db/0x230 [nfsd]
 [<ffffffffa04f8117>] nfsd_write+0xe7/0x100 [nfsd]
 [<ffffffffa050067f>] nfsd3_proc_write+0xaf/0x140 [nfsd]
 [<ffffffffa04f143e>] nfsd_dispatch+0xfe/0x240 [nfsd]
 [<ffffffffa0464524>] svc_process_common+0x344/0x640 [sunrpc]
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffffa0464b60>] svc_process+0x110/0x160 [sunrpc]
 [<ffffffffa04f1b62>] nfsd+0xc2/0x160 [nfsd]
 [<ffffffffa04f1aa0>] ? nfsd+0x0/0x160 [nfsd]
 [<ffffffff81096916>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff81096880>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
INFO: task nfsd:10561 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nfsd          D 0000000000000004     0 10561      2 0x00000080
 ffff880109021920 0000000000000046 ffff880109021890 ffff880130eb1000
 ffff880016da4380 ffff8801055ce4c0 ffff8801055ce4f8 0000000000000020
 ffff880228879ab8 ffff880109021fd8 000000000000fb88 ffff880228879ab8
Call Trace:
 [<ffffffff8150ed3e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8150ebdb>] mutex_lock+0x2b/0x50
 [<ffffffff8111c381>] generic_file_aio_write+0x71/0x100
 [<ffffffff8111c310>] ? generic_file_aio_write+0x0/0x100
 [<ffffffff81180b5b>] do_sync_readv_writev+0xfb/0x140
 [<ffffffffa04bd5e8>] ? find_acceptable_alias+0x28/0x100 [exportfs]
 [<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8121baf6>] ? security_file_permission+0x16/0x20
 [<ffffffff81181ae6>] do_readv_writev+0xd6/0x1f0
 [<ffffffffa04f4992>] ? nfsd_setuser_and_check_port+0x62/0xb0 [nfsd]
 [<ffffffff81181c46>] vfs_writev+0x46/0x60
 [<ffffffffa04f6085>] nfsd_vfs_write+0x105/0x430 [nfsd]
 [<ffffffff8117e372>] ? dentry_open+0x52/0xc0
 [<ffffffffa04f7cfb>] ? nfsd_open+0x1db/0x230 [nfsd]
 [<ffffffffa04f8117>] nfsd_write+0xe7/0x100 [nfsd]
 [<ffffffffa050067f>] nfsd3_proc_write+0xaf/0x140 [nfsd]
 [<ffffffffa04f143e>] nfsd_dispatch+0xfe/0x240 [nfsd]
 [<ffffffffa0464524>] svc_process_common+0x344/0x640 [sunrpc]
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffffa0464b60>] svc_process+0x110/0x160 [sunrpc]
 [<ffffffffa04f1b62>] nfsd+0xc2/0x160 [nfsd]
 [<ffffffffa04f1aa0>] ? nfsd+0x0/0x160 [nfsd]
 [<ffffffff81096916>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff81096880>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
INFO: task nfsd:10562 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nfsd          D 0000000000000002     0 10562      2 0x00000080
 ffff88022900f920 0000000000000046 ffff88022900f8e8 ffff88022900f8e4
 ffff880016da4380 ffff88022fc24500 ffff880028316700 0000000000000353
 ffff880223b21af8 ffff88022900ffd8 000000000000fb88 ffff880223b21af8
Call Trace:
 [<ffffffff8150ed3e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff814413a5>] ? memcpy_toiovec+0x55/0x80
 [<ffffffff8150ebdb>] mutex_lock+0x2b/0x50
 [<ffffffff8111c381>] generic_file_aio_write+0x71/0x100
 [<ffffffff8111c310>] ? generic_file_aio_write+0x0/0x100
 [<ffffffff81180b5b>] do_sync_readv_writev+0xfb/0x140
 [<ffffffffa04bd5e8>] ? find_acceptable_alias+0x28/0x100 [exportfs]
 [<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8121baf6>] ? security_file_permission+0x16/0x20
 [<ffffffff81181ae6>] do_readv_writev+0xd6/0x1f0
 [<ffffffffa04f4992>] ? nfsd_setuser_and_check_port+0x62/0xb0 [nfsd]
 [<ffffffff81181c46>] vfs_writev+0x46/0x60
 [<ffffffffa04f6085>] nfsd_vfs_write+0x105/0x430 [nfsd]
 [<ffffffff8117e372>] ? dentry_open+0x52/0xc0
 [<ffffffffa04f7cfb>] ? nfsd_open+0x1db/0x230 [nfsd]
 [<ffffffffa04f8117>] nfsd_write+0xe7/0x100 [nfsd]
 [<ffffffffa050067f>] nfsd3_proc_write+0xaf/0x140 [nfsd]
 [<ffffffffa04f143e>] nfsd_dispatch+0xfe/0x240 [nfsd]
 [<ffffffffa0464524>] svc_process_common+0x344/0x640 [sunrpc]
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffffa0464b60>] svc_process+0x110/0x160 [sunrpc]
 [<ffffffffa04f1b62>] nfsd+0xc2/0x160 [nfsd]
 [<ffffffffa04f1aa0>] ? nfsd+0x0/0x160 [nfsd]
 [<ffffffff81096916>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff81096880>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
INFO: task nfsd:10563 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nfsd          D 0000000000000000     0 10563      2 0x00000080
 ffff88010439b720 0000000000000046 0000000000016700 0000000000000000
 0000000000000000 ffff880227f32d40 ffff88010439b750 ffffffff8150d6e0
 ffff8801013d9098 ffff88010439bfd8 000000000000fb88 ffff8801013d9098
Call Trace:
 [<ffffffff8150d6e0>] ? thread_return+0x4e/0x76e
 [<ffffffff81096f6e>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa0429caa>] start_this_handle+0x20a/0x3f0 [jbd]
 [<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa042a065>] journal_start+0xb5/0x100 [jbd]
 [<ffffffffa0556481>] ext3_journal_start_sb+0x31/0x60 [ext3]
 [<ffffffffa0543b9d>] ext3_dirty_inode+0x3d/0xa0 [ext3]
 [<ffffffff811ac13b>] __mark_inode_dirty+0x3b/0x160
 [<ffffffff8119c352>] file_update_time+0xf2/0x170
 [<ffffffff8111c0b0>] __generic_file_aio_write+0x230/0x490
 [<ffffffff814413a5>] ? memcpy_toiovec+0x55/0x80
 [<ffffffff8111c398>] generic_file_aio_write+0x88/0x100
 [<ffffffff8111c310>] ? generic_file_aio_write+0x0/0x100
 [<ffffffff81180b5b>] do_sync_readv_writev+0xfb/0x140
 [<ffffffffa04bd5e8>] ? find_acceptable_alias+0x28/0x100 [exportfs]
 [<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8121baf6>] ? security_file_permission+0x16/0x20
 [<ffffffff81181ae6>] do_readv_writev+0xd6/0x1f0
 [<ffffffffa04f4992>] ? nfsd_setuser_and_check_port+0x62/0xb0 [nfsd]
 [<ffffffff81181c46>] vfs_writev+0x46/0x60
 [<ffffffffa04f6085>] nfsd_vfs_write+0x105/0x430 [nfsd]
 [<ffffffff8117e372>] ? dentry_open+0x52/0xc0
 [<ffffffffa04f7cfb>] ? nfsd_open+0x1db/0x230 [nfsd]
 [<ffffffffa04f8117>] nfsd_write+0xe7/0x100 [nfsd]
 [<ffffffffa050067f>] nfsd3_proc_write+0xaf/0x140 [nfsd]
 [<ffffffffa04f143e>] nfsd_dispatch+0xfe/0x240 [nfsd]
 [<ffffffffa0464524>] svc_process_common+0x344/0x640 [sunrpc]
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffffa0464b60>] svc_process+0x110/0x160 [sunrpc]
 [<ffffffffa04f1b62>] nfsd+0xc2/0x160 [nfsd]
 [<ffffffffa04f1aa0>] ? nfsd+0x0/0x160 [nfsd]
 [<ffffffff81096916>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff81096880>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
INFO: task smbd:6097 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
smbd          D 0000000000000000     0  6097  30667 0x00000084
 ffff880017b41a38 0000000000000082 ffff880017b419b8 0000000000000286
 ffff880017b419c8 ffff880016da53f0 ffffffff81ead540 ffffea00048edd20
 ffff880223a23058 ffff880017b41fd8 000000000000fb88 ffff880223a23058
Call Trace:
 [<ffffffff81096f6e>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa0429caa>] start_this_handle+0x20a/0x3f0 [jbd]
 [<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa042a065>] journal_start+0xb5/0x100 [jbd]
 [<ffffffffa0556481>] ext3_journal_start_sb+0x31/0x60 [ext3]
 [<ffffffffa05467fd>] ext3_write_begin+0xad/0x2c0 [ext3]
 [<ffffffff8111a673>] generic_file_buffered_write+0x123/0x2e0
 [<ffffffff81075787>] ? current_fs_time+0x27/0x30
 [<ffffffff8111c0e0>] __generic_file_aio_write+0x260/0x490
 [<ffffffff8111c398>] generic_file_aio_write+0x88/0x100
 [<ffffffff81180c9a>] do_sync_write+0xfa/0x140
 [<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8100b9ce>] ? common_interrupt+0xe/0x13
 [<ffffffff8121baf6>] ? security_file_permission+0x16/0x20
 [<ffffffff81180f98>] vfs_write+0xb8/0x1a0
 [<ffffffff81181952>] sys_pwrite64+0x82/0xa0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
nfsd: peername failed (err 107)!
md: md1: data-check done.
md: md3: data-check done.
md: md0: data-check done.

Oddly, my old server stopped doing it. This seems to be directly related to whatever systems are doing the most work.
 

lxskllr

No Lifer
Nov 30, 2004
59,071
9,481
126
Are you sure you don't have anything misconfigured? Has everything been installed via package manager? Are you running vanilla configs, or have you heavily modified what's installed? Your "bad luck" with hardware is in the realm of possibility, but it is exceptionally bad. I would setup a test network with the basics, and see if you can keep it going error free.
 

Red Squirrel

No Lifer
May 24, 2003
69,693
13,325
126
www.betteroff.ca
Everything is vanilla and installed through yum. Other than basic configuration like, setting up NFS exports/mounts and that kind of thing. I have enough problems then to start messing with stuff. :p

VM server is ESXi, and most of my Linux systems are CentOS 6.5 including the physical file server on which the VMs reside on. My old server (still running some stuff, I need to migrate it off) is FC9 and had the same issues till I moved all the hard drives to the file server. It seems all the problems now migrated to the file server AND the VM server. Basically just moved the entire md array and it's running off the file server now. Most of the stuff is not running off those drives though, but even so, if it was a failing drive or something it should not be crashing all the systems. At worse it should be dropping out of the md array.

Problem is I can't really be taking this offline especially not the file server as everything runs off that. Though I guess worse case scenario I'll have to.

Is there some kind of Linux burn in test I can run to try to find what hardware is failing?
 

Anteaus

Platinum Member
Oct 28, 2010
2,448
4
81
Assuming you are being truthful about every machine having problems, I'd start looking for a least common denominator.

Do you suspect you might have dirty power? Assuming you don't already have them, I would start with a line conditioner and a UPS. Even if power isn't the cause, at least it will help eliminate potential issues.
 

lxskllr

No Lifer
Nov 30, 2004
59,071
9,481
126
How is your array setup? Is the data being spread across all drives? If so, a bad disk might cause a failure of the whole system instead of just dropping out, especially if it isn't dead, but intermittently failing. Another possibility is misconfigured vms.

A good start might be checking SMART data for the drives, and see if everything looks reasonable. Other than that, and memtest, I'm unaware of hardware testing suites, but a search should pull some useful tools up.

I'm just throwing some ideas out there. I'm unfamiliar with what you're doing, but the approach I'd take is to simplify as much as possible to remove variables, and that may reveal the problem. Instead of RAID, goto single disks. Instead of vms, run directly on hardware...
 

Red Squirrel

No Lifer
May 24, 2003
69,693
13,325
126
www.betteroff.ca
Assuming you are being truthful about every machine having problems, I'd start looking for a least common denominator.

Do you suspect you might have dirty power? Assuming you don't already have them, I would start with a line conditioner and a UPS. Even if power isn't the cause, at least it will help eliminate potential issues.

I've put an oscilloscope on the power and it's clean, but it's hard to know if maybe there is some weird stuff that gets past the UPS at the very instant that it crashes. The UPS is standby. The UPS power itself is pretty dirty though but it's only when the power goes out that the servers get power from it.

Is there a way to get time stamps on dmesg logs? It would be interesting if I could know if all servers act up at same time.

Currently my raid is setup in 3 mdadm arrays, two raid 10's and a raid 5. Most of the VMs are on the raid 10's, and so is most of the data. I've been slowly migrating stuff off the raid 5 as I want to eventually retire it and put in bigger drives to do a raid 10. I'm contemplating going ZFS as well, would that help in situations where a drive might be on the verge of failing? I hear it's more resilient to stuff like that. Right now the raid 5 is mostly backups and some old VMs I don't really use anymore.


For smart data, I MIGHT be getting somewhere. The two raid 10's check out fine but for the raid 5 array (oldest array, one that came from my old server) I have some drives with issues:

/dev/sdc: Raw_Read_Error_Rate 4

/dev/sdm:
Raw_Read_Error_Rate 1
Multi_Zone_Error_Rate 2

/dev/sdd:
Raw_Read_Error_Rate 18

etc

...checked all 8 and it seems all the drives in this array are bad! There's not really much data on this array though, but could bad drives still cause the entire system to lock up/crash like it's doing? Those "Task delayed for 120 seconds" crashes are usually a result of the entire system locking right up temporarily, I've seen it happen when I was actively using my old system. If the file system is locking up then it would probably cause the VMs to lock up too if they can't read/write data. I never even considered checking smart data on all the drives but I'm wondering if I found my issue.

How would ZFS handle an issue like this, would it report to me right away that a drive is bad?

I suppose I could put the smart info into my monitoring program and monitor those values though.
 

Red Squirrel

No Lifer
May 24, 2003
69,693
13,325
126
www.betteroff.ca
Just had another VM crash now. It's getting worse. I'm copying data off that failing array so I can retire it now, but I really don't see how it can be causing all my VMs to crash one by one when they're not on that array, but worth a try I guess.

Not sure how NFS handles bad drives, if it just hangs the entire NFS service or what, so I suppose it could be possible that array is the culpit.
 

Fallen Kell

Diamond Member
Oct 9, 1999
6,145
502
126
How would ZFS handle an issue like this, would it report to me right away that a drive is bad?

I suppose I could put the smart info into my monitoring program and monitor those values though.

ZFS would potentially mark the disk as failing and replace it with a hot spare (assuming you have hot spares configured), or would simply fail the disk (and assuming it was in a vdev that has redundancy like raidz, raidz2, or zmirror, use the other disks to access the data). The real benefit of ZFS would be that it would know where the failed writes occurred and would have been able to save the written data still to the system.

Right now, you simply have a RAID5 device that is attempting to write the data to one of the members of its raid set which fails, but the other members of the raid set may not have failed to write/update their data, and now you have inconsistent data across the set of disks. ZFS would recognize that inconsistency because it has a checksum of the not only the data, but also the entire write, and as the filesystem is copy on write, it has the old state of the data/file available as well (except in the case that the zpool is close to being full).


Sorry I just saw this thread now. The moment you said that you transferred this set of disks from the old system to the new one and the crashing problem moved with it I was going to say the problem is with those disks.

Backblaze has a really good report on SMART errors and values that are good predictors of a failing hard drive: https://www.backblaze.com/blog/hard-drive-smart-stats/

The most important SMART values from Backblaze's experience are as follows:

SMART 5 – Reallocated_Sector_Count.
SMART 187 – Reported_Uncorrectable_Errors.
SMART 188 – Command_Timeout.
SMART 197 – Current_Pending_Sector_Count.
SMART 198 – Offline_Uncorrectable.

Try finding those values for your disks. If you see any disk with SMART 5 values above 70, it is failing. If 187 is greater than 0, it is failing. If 188 hits 13G or higher, it is most likely failing. If 197 is greater than 0, it is failing. If 198 is above 0 it is failing as well...
 
Last edited:

Red Squirrel

No Lifer
May 24, 2003
69,693
13,325
126
www.betteroff.ca
The only issues I see is are in Raw_Read_Error_Rate and Multi_Zone_Error_Rate, the rest seems clear. Most of the drives have errors in one or two of those places though. The drives in the other raid array seem fine. These drives are WD Blacks.

ex report:

Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0027   173   173   021    Pre-fail  Always       -       4308
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       38
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   094   093   000    Old_age   Always       -       4531
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       36
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       24
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       15
194 Temperature_Celsius     0x0022   116   101   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       2
 

Red Squirrel

No Lifer
May 24, 2003
69,693
13,325
126
www.betteroff.ca
For fun, (this is kinda dangerous...) I turned crond off to stop all automatic backups. On my previous server where I had these crashes, they seemed to often happen during backup jobs, at least the few times I managed to catch the crash in the act, since it's hard to tell when the crash occurred by just looking at the dmesg log especially if it's been a while since I checked or cleared it. Wish they'd put real time stamps on those log entries. One of these days I want to write an app that monitors it and outputs the entries to a log file with time stamps.

By turning off backups I'll see if the crashes happen again. I completely removed all active data off the raid 5 array except for backups and less active data. Well, the data is all still there but I renamed the folders, but they've all been coped to other raid arrays and the NFS shares/mounts moved accordingly. I technically could offline this array without much impact. There is still a LUN mapped to my ESXi server though. I'll have to doublecheck that there are no VMs on it when I'm home but I really don't think so. No important/turned on ones anyway.

I'll still continue to run my manual backups to removable media though, then monitor for crashes while they're running.

It seems to only be high I/O situations that trigger these.
 
Last edited: