I used to have this issue on my main server till I migrated all storage duties to a dedicated storage server, it seems to happen when there's too much going on and the system can't handle it, but why does it actually crash?
Just had my VPN server randomly conk out on me. I was able to access it via a SSH backdoor that I have setup for this purpose. This is the dmesg log:
How do I stop this from happening? Been a while since I've had these type of errors but looks like it's going to start again. It's always a different app that crashes when it does it. It seems to happen when backup jobs are running.
OS is CentOS 6.5 running in ESXi.
Just had my VPN server randomly conk out on me. I was able to access it via a SSH backdoor that I have setup for this purpose. This is the dmesg log:
Code:
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6680
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6d80
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6580
NOHZ: local_softirq_pending 100
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f188bc0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1886c0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f1886c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6b80
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6280
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f2a6280
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1881c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1886c0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f1886c0
NOHZ: local_softirq_pending 100
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1889c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1888c0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f1888c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6d80
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f188cc0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f188cc0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6480
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1885c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1884c0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f1884c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f188ac0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1888c0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f1888c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f188bc0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f188ac0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f188ac0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1880c0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f1880c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6680
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f2a6680
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1889c0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f1889c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1881c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1888c0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f1888c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1889c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1884c0
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f1884c0
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6c80
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6280
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f2a6280
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6980
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f2a6080
sd 0:0:0:0: [sda] Failed to abort cmd ffff88000f2a6080
sd 0:0:0:0: [sda] task abort on host 0, ffff88000f1885c0
sd 0:0:0:0: timing out command, waited 180s
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 95 f4 b8 00 00 20 00
Aborting journal on device dm-0-8.
EXT4-fs error (device dm-0): ext4_journal_start_sb:
EXT4-fs error (device dm-0): ext4_journal_start_sb: Detected aborted journal
EXT4-fs (dm-0): Remounting filesystem read-only
EXT4-fs error (device dm-0) in ext4_reserve_inode_write: Journal has aborted
EXT4-fs error (device dm-0) in ext4_dirty_inode: Journal has aborted
Detected aborted journal
INFO: task master:1164 blocked for more than 120 seconds.
Not tainted 2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
master D 0000000000000001 0 1164 1 0x00000080
ffff88000bb61948 0000000000000086 0000000000000000 ffffffffa000443c
ffff88000bb618b8 ffffffff81055678 ffff88000bb618d8 ffffffff8105571d
ffff88000f1865f8 ffff88000bb61fd8 000000000000fbc8 ffff88000f1865f8
Call Trace:
[<ffffffffa000443c>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
[<ffffffff81055678>] ? resched_task+0x68/0x80
[<ffffffff8105571d>] ? check_preempt_curr+0x6d/0x90
[<ffffffff811bf120>] ? sync_buffer+0x0/0x50
[<ffffffff815280a3>] io_schedule+0x73/0xc0
[<ffffffff811bf160>] sync_buffer+0x40/0x50
[<ffffffff8152893a>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff811bf120>] ? sync_buffer+0x0/0x50
[<ffffffff81528a18>] out_of_line_wait_on_bit_lock+0x78/0x90
[<ffffffff8109b320>] ? wake_bit_function+0x0/0x50
[<ffffffff811be6b9>] ? __find_get_block+0xa9/0x200
[<ffffffff811bf306>] __lock_buffer+0x36/0x40
[<ffffffffa0089293>] do_get_write_access+0x493/0x520 [jbd2]
[<ffffffffa0089471>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
[<ffffffffa00d6d98>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
[<ffffffffa00b0bd3>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
[<ffffffffa00b0c4c>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
[<ffffffffa0088495>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
[<ffffffffa00b0f40>] ext4_dirty_inode+0x40/0x60 [ext4]
[<ffffffff811b48fb>] __mark_inode_dirty+0x3b/0x160
[<ffffffff811a5002>] file_update_time+0xf2/0x170
[<ffffffff81193cf2>] pipe_write+0x302/0x6a0
[<ffffffff81188c7a>] do_sync_write+0xfa/0x140
[<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff8118e7b4>] ? cp_new_stat+0xe4/0x100
[<ffffffff810149b9>] ? read_tsc+0x9/0x20
[<ffffffff812263c6>] ? security_file_permission+0x16/0x20
[<ffffffff81188f78>] vfs_write+0xb8/0x1a0
[<ffffffff81189871>] sys_write+0x51/0x90
[<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task pickup:2163 blocked for more than 120 seconds.
Not tainted 2.6.32-431.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
pickup D 0000000000000004 0 2163 1164 0x00000080
ffff88000c8dd968 0000000000000082 0000000000000000 ffffffffa000443c
ffff88000c8dda08 ffffffff8112f3a3 ffff880000018900 0000000700000000
ffff88000f93b058 ffff88000c8ddfd8 000000000000fbc8 ffff88000f93b058
Call Trace:
[<ffffffffa000443c>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
[<ffffffff8112f3a3>] ? __alloc_pages_nodemask+0x113/0x8d0
[<ffffffff811bf120>] ? sync_buffer+0x0/0x50
[<ffffffff815280a3>] io_schedule+0x73/0xc0
[<ffffffff811bf160>] sync_buffer+0x40/0x50
[<ffffffff8152893a>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff811bf120>] ? sync_buffer+0x0/0x50
[<ffffffff81528a18>] out_of_line_wait_on_bit_lock+0x78/0x90
[<ffffffff8109b320>] ? wake_bit_function+0x0/0x50
[<ffffffff811be6b9>] ? __find_get_block+0xa9/0x200
[<ffffffff811bf306>] __lock_buffer+0x36/0x40
[<ffffffffa0089293>] do_get_write_access+0x493/0x520 [jbd2]
[<ffffffffa0089471>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
[<ffffffffa00d6d98>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
[<ffffffffa00b0bd3>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
[<ffffffffa00b0c4c>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
[<ffffffffa0088495>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
[<ffffffffa00b0f40>] ext4_dirty_inode+0x40/0x60 [ext4]
[<ffffffff811b48fb>] __mark_inode_dirty+0x3b/0x160
[<ffffffff811a5215>] touch_atime+0x195/0x1a0
[<ffffffff81194365>] pipe_read+0x2d5/0x4e0
[<ffffffff81188dba>] do_sync_read+0xfa/0x140
[<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff812263c6>] ? security_file_permission+0x16/0x20
[<ffffffff811896a5>] vfs_read+0xb5/0x1a0
[<ffffffff811897e1>] sys_read+0x51/0x90
[<ffffffff810e1e5e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
EXT4-fs error (device dm-0) in ext4_reserve_inode_write: Journal has aborted
EXT4-fs error (device dm-0) in ext4_reserve_inode_write: Journal has aborted
How do I stop this from happening? Been a while since I've had these type of errors but looks like it's going to start again. It's always a different app that crashes when it does it. It seems to happen when backup jobs are running.
OS is CentOS 6.5 running in ESXi.