I'm at a loss, what keeps crashing my system?

Nothinman · Oct 24, 2012

Those messages indicate that you ran out of memory and the OOM killer should have kicked in.

Cerb · Oct 26, 2012

The original messages of IO wait could have been related to having trouble cleaning (writing back dirty pages) and defragmenting. You are definitely running low on memory. So, when these aren't happening, what is your available RAM looking like (free+caches)? If I'm reading it right, SSH tried to allocate 128MB (why?), and failed, though you should have had 160+MB that could have been freed. However, if it wanted that 128MB contiguous, you could very well be stuck waiting for a long time for it to make that happen.

So, all the below assumes that you have plenty of physical memory, but it isn't being used effectively, and you end up stuck with too much fragmentation, and I/O wait from writebacks backlogging, blocking successful allocation. Assuming you do indeed have enough RAM, that would make sense, with regards to the earlier errors, too. If you're actually running close to the amount of RAM in the computer, then the solution, sadly, is to fix it at that level, first (if you don't have 20%+ for caching, you may as well have run out, already). I can't say I know enough to deduce the actual total RAM you have from the error (though I'm sure it's quite possible).

Active:132283
inactive:1508049
dirty:161830
writeback:2625
unstable:0
free:10126
slab:90087
mapped:168019
pagetables:12444
bounce:0

That's quite a bit of dirty, and a fair amount of slab, relative to the rest. I'm not used to seeing writeback above a couple hundred pages, but I can't say that I know what, if anything, that means (it might just be high from having lots of dirty pages, and mean nothing on its own).

I would set extfrag_threshold to 100 (default is 500, and a lower value makes it more aggressive about defragmenting), and ditry_ratio and dirty_background ratio both to 10 or less, and then see how the machine fairs (note that low dirty ratios can give Linux a jerky feel when a lot of disk writing is going on, and will, in general, increase the amount and frequency that it writes to disk). Depending on what your current min free value is, you may wish to increase to 100+MB.

I've found Virtualbox and WINE can be really bad about causing fragmentation to outpace the background compactor, though I've managed to keep from getting crashes from it.

So, something like maybe this, in /etc/sysctl.conf:

Code:

vm.min_free_kbytes = 262144
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
vm.extfrag_threshold = 100

It's on the extreme side, so if your crashing disappears, start to loosen up on the dirty ratios, for smoother performance (I've not found a lower extfrag_threshold to yet cause bad behavior, but documentation for it is kind of sparse), and maybe reduce the min free to 128MB (131072), or even 64MB (65536).

Red Squirrel · Nov 2, 2012

This is odd, virtualbox seems to be using over 7GB, I have 8GB, so this explains things... I only have 1 VM running and it has 512k assigned to it. Not sure whether this was the original issue or not. I used to have more VMs running before. No way I'd be able to start them now if I tried.

Red Squirrel · Nov 23, 2012

Ok this really sucks. One of my other servers started doing it too now. WTF?

[root@hal9000 envirostate]# dmesg
e1000e: eth0 NIC Link is Down
CIFS VFS: Server borg.loc has not responded in 300 seconds. Reconnecting...
INFO: task bdi-default:29 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bdi-default D 0000000000000002 0 29 2 0x00000000
ffff88013cdf1a40 0000000000000046 0000000000000000 ffff880139d7d980
ffff88013cdf1a50 ffffffffa02800ac ffff88013cdf1a18 ffff88013b7b1d70
ffff88013cdeb0b8 ffff88013cdf1fd8 000000000000f4e8 ffff88013cdeb0b8
Call Trace:
[<ffffffffa02800ac>] ? smb_sendv+0xfc/0x380 [cifs]
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f493a>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff81110907>] __lock_page+0x67/0x70
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa0276dbf>] cifs_writepages+0x63f/0x670 [cifs]
[<ffffffff81126171>] do_writepages+0x21/0x40
[<ffffffff811a034d>] writeback_single_inode+0xdd/0x2c0
[<ffffffff811a078e>] writeback_sb_inodes+0xce/0x180
[<ffffffff811a08eb>] writeback_inodes_wb+0xab/0x1b0
[<ffffffff810096f0>] ? __switch_to+0xd0/0x320
[<ffffffff811a0c8b>] wb_writeback+0x29b/0x3f0
[<ffffffff814f3911>] ? thread_return+0x4e/0x79d
[<ffffffff811a0f79>] wb_do_writeback+0x199/0x240
[<ffffffff8107c020>] ? process_timeout+0x0/0x10
[<ffffffff811348ba>] bdi_forker_task+0x6a/0x310
[<ffffffff81134850>] ? bdi_forker_task+0x0/0x310
[<ffffffff810906a6>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81090610>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
INFO: task FahCore_a3.exe:14391 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
FahCore_a3.ex D 0000000000000001 0 14391 1972 0x00000080
ffff880138eedb58 0000000000000082 0000000000000000 ffff880037930a38
ffffffff81facfa0 ffffea0003fee0e0 ffff880138eedae8 ffffffff812763d6
ffff880037930638 ffff880138eedfd8 000000000000f4e8 ffff880037930638
Call Trace:
[<ffffffff812763d6>] ? __prop_inc_single+0x46/0x60
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f4a8f>] __wait_on_bit+0x5f/0x90
[<ffffffff81110b23>] wait_on_page_bit+0x73/0x80
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa0276a5d>] cifs_writepages+0x2dd/0x670 [cifs]
[<ffffffff81126171>] do_writepages+0x21/0x40
[<ffffffff8111108b>] __filemap_fdatawrite_range+0x5b/0x60
[<ffffffff8111158f>] filemap_fdatawrite+0x1f/0x30
[<ffffffffa025e11c>] cifs_file_aio_write+0x3c/0x50 [cifs]
[<ffffffff811760fa>] do_sync_write+0xfa/0x140
[<ffffffff81090a10>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81060fa0>] ? pick_next_task_fair+0xd0/0x130
[<ffffffff8120c296>] ? security_file_permission+0x16/0x20
[<ffffffff811763f8>] vfs_write+0xb8/0x1a0
[<ffffffff81176e01>] sys_write+0x51/0x90
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task bdi-default:29 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bdi-default D 0000000000000002 0 29 2 0x00000000
ffff88013cdf1a40 0000000000000046 0000000000000000 ffff880139d7d980
ffff88013cdf1a50 ffffffffa02800ac ffff88013cdf1a18 ffff88013b7b1d70
ffff88013cdeb0b8 ffff88013cdf1fd8 000000000000f4e8 ffff88013cdeb0b8
Call Trace:
[<ffffffffa02800ac>] ? smb_sendv+0xfc/0x380 [cifs]
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f493a>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff81110907>] __lock_page+0x67/0x70
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa0276dbf>] cifs_writepages+0x63f/0x670 [cifs]
[<ffffffff81126171>] do_writepages+0x21/0x40
[<ffffffff811a034d>] writeback_single_inode+0xdd/0x2c0
[<ffffffff811a078e>] writeback_sb_inodes+0xce/0x180
[<ffffffff811a08eb>] writeback_inodes_wb+0xab/0x1b0
[<ffffffff810096f0>] ? __switch_to+0xd0/0x320
[<ffffffff811a0c8b>] wb_writeback+0x29b/0x3f0
[<ffffffff814f3911>] ? thread_return+0x4e/0x79d
[<ffffffff811a0f79>] wb_do_writeback+0x199/0x240
[<ffffffff8107c020>] ? process_timeout+0x0/0x10
[<ffffffff811348ba>] bdi_forker_task+0x6a/0x310
[<ffffffff81134850>] ? bdi_forker_task+0x0/0x310
[<ffffffff810906a6>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81090610>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
INFO: task FahCore_a3.exe:14391 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
FahCore_a3.ex D 0000000000000001 0 14391 1972 0x00000080
ffff880138eedb58 0000000000000082 0000000000000000 ffff880037930a38
ffffffff81facfa0 ffffea0003fee0e0 ffff880138eedae8 ffffffff812763d6
ffff880037930638 ffff880138eedfd8 000000000000f4e8 ffff880037930638
Call Trace:
[<ffffffff812763d6>] ? __prop_inc_single+0x46/0x60
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f4a8f>] __wait_on_bit+0x5f/0x90
[<ffffffff81110b23>] wait_on_page_bit+0x73/0x80
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa0276a5d>] cifs_writepages+0x2dd/0x670 [cifs]
[<ffffffff81126171>] do_writepages+0x21/0x40
[<ffffffff8111108b>] __filemap_fdatawrite_range+0x5b/0x60
[<ffffffff8111158f>] filemap_fdatawrite+0x1f/0x30
[<ffffffffa025e11c>] cifs_file_aio_write+0x3c/0x50 [cifs]
[<ffffffff811760fa>] do_sync_write+0xfa/0x140
[<ffffffff81090a10>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81060fa0>] ? pick_next_task_fair+0xd0/0x130
[<ffffffff8120c296>] ? security_file_permission+0x16/0x20
[<ffffffff811763f8>] vfs_write+0xb8/0x1a0
[<ffffffff81176e01>] sys_write+0x51/0x90
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task kslowd000:14390 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kslowd000 D 0000000000000001 0 14390 2 0x00000080
ffff8801394f7d00 0000000000000046 0000000000000000 0000000000000082
ffff8801394f7cd0 ffffffff8105e5dc 0000000000015f80 0000000000000002
ffff88013b7e4638 ffff8801394f7fd8 000000000000f4e8 ffff88013b7e4638
Call Trace:
[<ffffffff8105e5dc>] ? try_to_wake_up+0x24c/0x3e0
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f493a>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff81110907>] __lock_page+0x67/0x70
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa026994e>] cifs_writev_complete+0x1ee/0x210 [cifs]
[<ffffffff81104673>] slow_work_execute+0x233/0x310
[<ffffffff811048a7>] slow_work_thread+0x157/0x360
[<ffffffff81090a10>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81104750>] ? slow_work_thread+0x0/0x360
[<ffffffff810906a6>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81090610>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
INFO: task bdi-default:29 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bdi-default D 0000000000000002 0 29 2 0x00000000
ffff88013cdf1a40 0000000000000046 0000000000000000 ffff880139d7d980
ffff88013cdf1a50 ffffffffa02800ac ffff88013cdf1a18 ffff88013b7b1d70
ffff88013cdeb0b8 ffff88013cdf1fd8 000000000000f4e8 ffff88013cdeb0b8
Call Trace:
[<ffffffffa02800ac>] ? smb_sendv+0xfc/0x380 [cifs]
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f493a>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff81110907>] __lock_page+0x67/0x70
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa0276dbf>] cifs_writepages+0x63f/0x670 [cifs]
[<ffffffff81126171>] do_writepages+0x21/0x40
[<ffffffff811a034d>] writeback_single_inode+0xdd/0x2c0
[<ffffffff811a078e>] writeback_sb_inodes+0xce/0x180
[<ffffffff811a08eb>] writeback_inodes_wb+0xab/0x1b0
[<ffffffff810096f0>] ? __switch_to+0xd0/0x320
[<ffffffff811a0c8b>] wb_writeback+0x29b/0x3f0
[<ffffffff814f3911>] ? thread_return+0x4e/0x79d
[<ffffffff811a0f79>] wb_do_writeback+0x199/0x240
[<ffffffff8107c020>] ? process_timeout+0x0/0x10
[<ffffffff811348ba>] bdi_forker_task+0x6a/0x310
[<ffffffff81134850>] ? bdi_forker_task+0x0/0x310
[<ffffffff810906a6>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81090610>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
INFO: task FahCore_a3.exe:14391 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
FahCore_a3.ex D 0000000000000001 0 14391 1972 0x00000080
ffff880138eedb58 0000000000000082 0000000000000000 ffff880037930a38
ffffffff81facfa0 ffffea0003fee0e0 ffff880138eedae8 ffffffff812763d6
ffff880037930638 ffff880138eedfd8 000000000000f4e8 ffff880037930638
Call Trace:
[<ffffffff812763d6>] ? __prop_inc_single+0x46/0x60
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f4a8f>] __wait_on_bit+0x5f/0x90
[<ffffffff81110b23>] wait_on_page_bit+0x73/0x80
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa0276a5d>] cifs_writepages+0x2dd/0x670 [cifs]
[<ffffffff81126171>] do_writepages+0x21/0x40
[<ffffffff8111108b>] __filemap_fdatawrite_range+0x5b/0x60
[<ffffffff8111158f>] filemap_fdatawrite+0x1f/0x30
[<ffffffffa025e11c>] cifs_file_aio_write+0x3c/0x50 [cifs]
[<ffffffff811760fa>] do_sync_write+0xfa/0x140
[<ffffffff81090a10>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81060fa0>] ? pick_next_task_fair+0xd0/0x130
[<ffffffff8120c296>] ? security_file_permission+0x16/0x20
[<ffffffff811763f8>] vfs_write+0xb8/0x1a0
[<ffffffff81176e01>] sys_write+0x51/0x90
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task kslowd000:14390 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kslowd000 D 0000000000000001 0 14390 2 0x00000080
ffff8801394f7d00 0000000000000046 0000000000000000 0000000000000082
ffff8801394f7cd0 ffffffff8105e5dc 0000000000015f80 0000000000000002
ffff88013b7e4638 ffff8801394f7fd8 000000000000f4e8 ffff88013b7e4638
Call Trace:
[<ffffffff8105e5dc>] ? try_to_wake_up+0x24c/0x3e0
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f493a>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff81110907>] __lock_page+0x67/0x70
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa026994e>] cifs_writev_complete+0x1ee/0x210 [cifs]
[<ffffffff81104673>] slow_work_execute+0x233/0x310
[<ffffffff811048a7>] slow_work_thread+0x157/0x360
[<ffffffff81090a10>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81104750>] ? slow_work_thread+0x0/0x360
[<ffffffff810906a6>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81090610>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
INFO: task bdi-default:29 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bdi-default D 0000000000000002 0 29 2 0x00000000
ffff88013cdf1a40 0000000000000046 0000000000000000 ffff880139d7d980
ffff88013cdf1a50 ffffffffa02800ac ffff88013cdf1a18 ffff88013b7b1d70
ffff88013cdeb0b8 ffff88013cdf1fd8 000000000000f4e8 ffff88013cdeb0b8
Call Trace:
[<ffffffffa02800ac>] ? smb_sendv+0xfc/0x380 [cifs]
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f493a>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff81110907>] __lock_page+0x67/0x70
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa0276dbf>] cifs_writepages+0x63f/0x670 [cifs]
[<ffffffff81126171>] do_writepages+0x21/0x40
[<ffffffff811a034d>] writeback_single_inode+0xdd/0x2c0
[<ffffffff811a078e>] writeback_sb_inodes+0xce/0x180
[<ffffffff811a08eb>] writeback_inodes_wb+0xab/0x1b0
[<ffffffff810096f0>] ? __switch_to+0xd0/0x320
[<ffffffff811a0c8b>] wb_writeback+0x29b/0x3f0
[<ffffffff814f3911>] ? thread_return+0x4e/0x79d
[<ffffffff811a0f79>] wb_do_writeback+0x199/0x240
[<ffffffff8107c020>] ? process_timeout+0x0/0x10
[<ffffffff811348ba>] bdi_forker_task+0x6a/0x310
[<ffffffff81134850>] ? bdi_forker_task+0x0/0x310
[<ffffffff810906a6>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81090610>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
INFO: task FahCore_a3.exe:14391 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
FahCore_a3.ex D 0000000000000001 0 14391 1972 0x00000080
ffff880138eedb58 0000000000000082 0000000000000000 ffff880037930a38
ffffffff81facfa0 ffffea0003fee0e0 ffff880138eedae8 ffffffff812763d6
ffff880037930638 ffff880138eedfd8 000000000000f4e8 ffff880037930638
Call Trace:
[<ffffffff812763d6>] ? __prop_inc_single+0x46/0x60
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f4a8f>] __wait_on_bit+0x5f/0x90
[<ffffffff81110b23>] wait_on_page_bit+0x73/0x80
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa0276a5d>] cifs_writepages+0x2dd/0x670 [cifs]
[<ffffffff81126171>] do_writepages+0x21/0x40
[<ffffffff8111108b>] __filemap_fdatawrite_range+0x5b/0x60
[<ffffffff8111158f>] filemap_fdatawrite+0x1f/0x30
[<ffffffffa025e11c>] cifs_file_aio_write+0x3c/0x50 [cifs]
[<ffffffff811760fa>] do_sync_write+0xfa/0x140
[<ffffffff81090a10>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81060fa0>] ? pick_next_task_fair+0xd0/0x130
[<ffffffff8120c296>] ? security_file_permission+0x16/0x20
[<ffffffff811763f8>] vfs_write+0xb8/0x1a0
[<ffffffff81176e01>] sys_write+0x51/0x90
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[root@hal9000 envirostate]#
[root@hal9000 envirostate]#

No VM software of any kind on that machine, so guess I can rule out VirtualBox being the cause of these crashes. My server monitoring app crashed in this process. This server runs my hvac as well so this is quite serious. This one is running CentOS (version 5 I think, but not entirely sure how to check)

Edit: seems this may have happened during my switch cut over. Still, a simple network connectivity issue should not cause an entire system to crash.

Nothinman · Nov 23, 2012

Red Squirrel said:
Ok this really sucks. One of my other servers started doing it too now. WTF?

No VM software of any kind on that machine, so guess I can rule out VirtualBox being the cause of these crashes. My server monitoring app crashed in this process. This server runs my hvac as well so this is quite serious. This one is running CentOS (version 5 I think, but not entirely sure how to check)

Edit: seems this may have happened during my switch cut over. Still, a simple network connectivity issue should not cause an entire system to crash.

That's a CIFS mount hanging when you unplugged the cable. The entire system didn't crash, just the processes relying on data from that mount point. Those backtraces are just to let you know where the processes hung so you can fix the real issue.

L0rD_LuCk · Dec 27, 2012

Hello,

i have the same prolem (kjournald blocked for more than 120 seconds)
i m running a debian with the 2.6.32 kernel for about 1.5 years without any problems. my raid 6 on AMCC 3ware 9650SE-8LPML (8x 2TB) works without errors (logs checkt) i run the system once or twice a month for backups. the whole array is crypted with LUKS, FS is ext3. the box does nothing but backups. no fancy stuff is installed nor working.

the error occurs when:
the system is fresh booted
the array is mounted
an i try to mkdir a folder on the array

the system freeze up, then a message like yours shows up, then the system went back to normal, the folder is created i can work for hours with heavy load without having the problem again.

its a realy strange problem, i dont know but the array is running out of space (i've 2TB left) is it possible that the FS journald needs something to work properly?

Red Squirrel · Feb 24, 2014

Happening again on my brand new file server. Is this a Linux issue? Does Linux/raid choke up and act up when there's too much going on? Seems I/O related.

Here's some output, same deal, different system:

Code:

ata5: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
ata5: irq_stat 0x00400040, connection status changed
ata5: SError: { HostInt PHYRdyChg 10B8B DevExch }
ata5: hard resetting link
ata5: SATA link down (SStatus 0 SControl 300)
ata5: hard resetting link
ata5: SATA link down (SStatus 0 SControl 300)
ata5: limiting SATA link speed to 1.5 Gbps
ata5: hard resetting link
ata5: SATA link down (SStatus 0 SControl 310)
ata5.00: disabled
ata5: EH complete
ata5.00: detaching (SCSI 9:0:0:0)
sd 9:0:0:0: [sdo] Synchronizing SCSI cache
sd 9:0:0:0: [sdo] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 9:0:0:0: [sdo] Stopping disk
sd 9:0:0:0: [sdo] START_STOP FAILED
sd 9:0:0:0: [sdo] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
md: data-check of RAID array md1
md: minimum _guaranteed_  speed: 200000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 75000000 KB/sec) for data-check.
md: using 128k window, over a total of 976759936k.
md: data-check of RAID array md0
md: minimum _guaranteed_  speed: 200000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 75000000 KB/sec) for data-check.
md: using 128k window, over a total of 5860270080k.
md: md1: data-check done.
md: md0: data-check done.
swapper: page allocation failure. order:1, mode:0x20
Pid: 0, comm: swapper Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff811669d2>] ? kmem_getpages+0x62/0x170
 [<ffffffff811675ea>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116703f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81167369>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811682eb>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffff81439b38>] ? sk_prot_alloc+0x48/0x1c0
 [<ffffffff8143ac12>] ? sk_clone+0x22/0x2e0
 [<ffffffff81489ae6>] ? inet_csk_clone+0x16/0xd0
 [<ffffffff814a29f3>] ? tcp_create_openreq_child+0x23/0x450
 [<ffffffff814a01ed>] ? tcp_v4_syn_recv_sock+0x4d/0x310
 [<ffffffff814a2796>] ? tcp_check_req+0x226/0x460
 [<ffffffff814981b6>] ? tcp_rcv_state_process+0x126/0xa10
 [<ffffffff8149fc8b>] ? tcp_v4_do_rcv+0x35b/0x430
 [<ffffffff814a5d1f>] ? __udp_queue_rcv_skb+0x4f/0x170
 [<ffffffff814a149e>] ? tcp_v4_rcv+0x4fe/0x8d0
 [<ffffffff8143cfd7>] ? __kfree_skb+0x47/0xa0
 [<ffffffff8147f28d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8147f518>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8147e9dd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff8147ef65>] ? ip_rcv+0x275/0x350
 [<ffffffff8144819b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff8144a578>] ? netif_receive_skb+0x58/0x60
 [<ffffffff8144a680>] ? napi_skb_finish+0x50/0x70
 [<ffffffff8144cc29>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa01d07cb>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
 [<ffffffffa01d33a0>] ? e1000_clean_rx_irq+0x380/0x580 [e1000e]
 [<ffffffffa01daffd>] ? e1000e_poll+0x8d/0x380 [e1000e]
 [<ffffffff8109ace2>] ? enqueue_hrtimer+0x82/0xd0
 [<ffffffff8144cd43>] ? net_rx_action+0x103/0x2f0
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810e1640>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff81516c95>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI>  [<ffffffff812d37ae>] ? intel_idle+0xde/0x170
 [<ffffffff812d3791>] ? intel_idle+0xc1/0x170
 [<ffffffff81414ef7>] ? cpuidle_idle_call+0xa7/0x140
 [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
 [<ffffffff814f2f2a>] ? rest_init+0x7a/0x80
 [<ffffffff81c27f7b>] ? start_kernel+0x424/0x430
 [<ffffffff81c2733a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81c27438>] ? x86_64_start_kernel+0xfa/0x109
swapper: page allocation failure. order:1, mode:0x20
Pid: 0, comm: swapper Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff811669d2>] ? kmem_getpages+0x62/0x170
 [<ffffffff811675ea>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116703f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81167369>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811682eb>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffff81439b38>] ? sk_prot_alloc+0x48/0x1c0
 [<ffffffff8143ac12>] ? sk_clone+0x22/0x2e0
 [<ffffffff81489ae6>] ? inet_csk_clone+0x16/0xd0
 [<ffffffff814a29f3>] ? tcp_create_openreq_child+0x23/0x450
 [<ffffffff814a01ed>] ? tcp_v4_syn_recv_sock+0x4d/0x310
 [<ffffffff814a2796>] ? tcp_check_req+0x226/0x460
 [<ffffffff8149fc8b>] ? tcp_v4_do_rcv+0x35b/0x430
 [<ffffffff814a5d1f>] ? __udp_queue_rcv_skb+0x4f/0x170
 [<ffffffff814a149e>] ? tcp_v4_rcv+0x4fe/0x8d0
 [<ffffffff8143cfd7>] ? __kfree_skb+0x47/0xa0
 [<ffffffff8147f28d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8147f518>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8147e9dd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff8147ef65>] ? ip_rcv+0x275/0x350
 [<ffffffff8144819b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff8144a578>] ? netif_receive_skb+0x58/0x60
 [<ffffffff8144a680>] ? napi_skb_finish+0x50/0x70
 [<ffffffff8144cc29>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa01d07cb>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
 [<ffffffffa01d33a0>] ? e1000_clean_rx_irq+0x380/0x580 [e1000e]
 [<ffffffffa01daffd>] ? e1000e_poll+0x8d/0x380 [e1000e]
 [<ffffffff8109ace2>] ? enqueue_hrtimer+0x82/0xd0
 [<ffffffff8144cd43>] ? net_rx_action+0x103/0x2f0
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810e1640>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff81516c95>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI>  [<ffffffff812d37ae>] ? intel_idle+0xde/0x170
 [<ffffffff812d3791>] ? intel_idle+0xc1/0x170
 [<ffffffff81414ef7>] ? cpuidle_idle_call+0xa7/0x140
 [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
 [<ffffffff814f2f2a>] ? rest_init+0x7a/0x80
 [<ffffffff81c27f7b>] ? start_kernel+0x424/0x430
 [<ffffffff81c2733a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81c27438>] ? x86_64_start_kernel+0xfa/0x109
swapper: page allocation failure. order:1, mode:0x20
Pid: 0, comm: swapper Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff811669d2>] ? kmem_getpages+0x62/0x170
 [<ffffffff811675ea>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116703f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81167369>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811682eb>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffff81439b38>] ? sk_prot_alloc+0x48/0x1c0
 [<ffffffff8143ac12>] ? sk_clone+0x22/0x2e0
 [<ffffffff81489ae6>] ? inet_csk_clone+0x16/0xd0
 [<ffffffff814a29f3>] ? tcp_create_openreq_child+0x23/0x450
 [<ffffffff814a01ed>] ? tcp_v4_syn_recv_sock+0x4d/0x310
 [<ffffffff814a2796>] ? tcp_check_req+0x226/0x460
 [<ffffffff814981b6>] ? tcp_rcv_state_process+0x126/0xa10
 [<ffffffff8149fc8b>] ? tcp_v4_do_rcv+0x35b/0x430
 [<ffffffff814a5d1f>] ? __udp_queue_rcv_skb+0x4f/0x170
 [<ffffffff814a149e>] ? tcp_v4_rcv+0x4fe/0x8d0
 [<ffffffff8147f28d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8147f518>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8147e9dd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff8147ef65>] ? ip_rcv+0x275/0x350
 [<ffffffff8144819b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff8144a578>] ? netif_receive_skb+0x58/0x60
 [<ffffffff8144a680>] ? napi_skb_finish+0x50/0x70
 [<ffffffff8144cc29>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa01d07cb>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
 [<ffffffffa01d33a0>] ? e1000_clean_rx_irq+0x380/0x580 [e1000e]
 [<ffffffffa01daffd>] ? e1000e_poll+0x8d/0x380 [e1000e]
 [<ffffffff81052201>] ? resched_task+0x11/0x80
 [<ffffffff8106317e>] ? try_to_wake_up+0x24e/0x3e0
 [<ffffffff8144cd43>] ? net_rx_action+0x103/0x2f0
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810e1640>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff81516c95>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI>  [<ffffffff812d37ae>] ? intel_idle+0xde/0x170
 [<ffffffff812d3791>] ? intel_idle+0xc1/0x170
 [<ffffffff8109dd1d>] ? sched_clock_cpu+0xcd/0x110
 [<ffffffff81414ef7>] ? cpuidle_idle_call+0xa7/0x140
 [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
 [<ffffffff814f2f2a>] ? rest_init+0x7a/0x80
 [<ffffffff81c27f7b>] ? start_kernel+0x424/0x430
 [<ffffffff81c2733a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81c27438>] ? x86_64_start_kernel+0xfa/0x109
swapper: page allocation failure. order:1, mode:0x20
Pid: 0, comm: swapper Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff810522fd>] ? check_preempt_curr+0x6d/0x90
 [<ffffffff811669d2>] ? kmem_getpages+0x62/0x170
 [<ffffffff811675ea>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116703f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81167369>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811682eb>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffff81439b38>] ? sk_prot_alloc+0x48/0x1c0
 [<ffffffff8143ac12>] ? sk_clone+0x22/0x2e0
 [<ffffffff81489ae6>] ? inet_csk_clone+0x16/0xd0
 [<ffffffff814a29f3>] ? tcp_create_openreq_child+0x23/0x450
 [<ffffffff814a01ed>] ? tcp_v4_syn_recv_sock+0x4d/0x310
 [<ffffffff814a2796>] ? tcp_check_req+0x226/0x460
 [<ffffffff8149fc8b>] ? tcp_v4_do_rcv+0x35b/0x430
 [<ffffffff814a5d1f>] ? __udp_queue_rcv_skb+0x4f/0x170
 [<ffffffff814a149e>] ? tcp_v4_rcv+0x4fe/0x8d0
 [<ffffffff8147f28d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8147f518>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8147e9dd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff8147ef65>] ? ip_rcv+0x275/0x350
 [<ffffffff8144819b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff8144a578>] ? netif_receive_skb+0x58/0x60
 [<ffffffff8144a680>] ? napi_skb_finish+0x50/0x70
 [<ffffffff8144cc29>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa01d07cb>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
 [<ffffffffa01d33a0>] ? e1000_clean_rx_irq+0x380/0x580 [e1000e]
 [<ffffffffa01daffd>] ? e1000e_poll+0x8d/0x380 [e1000e]
 [<ffffffff81052201>] ? resched_task+0x11/0x80
 [<ffffffff8106317e>] ? try_to_wake_up+0x24e/0x3e0
 [<ffffffff8144cd43>] ? net_rx_action+0x103/0x2f0
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810e1640>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff81516c95>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI>  [<ffffffff812d37ae>] ? intel_idle+0xde/0x170
 [<ffffffff812d3791>] ? intel_idle+0xc1/0x170
 [<ffffffff8109dd1d>] ? sched_clock_cpu+0xcd/0x110
 [<ffffffff81414ef7>] ? cpuidle_idle_call+0xa7/0x140
 [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
 [<ffffffff814f2f2a>] ? rest_init+0x7a/0x80
 [<ffffffff81c27f7b>] ? start_kernel+0x424/0x430
 [<ffffffff81c2733a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81c27438>] ? x86_64_start_kernel+0xfa/0x109
swapper: page allocation failure. order:1, mode:0x20
Pid: 0, comm: swapper Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff811669d2>] ? kmem_getpages+0x62/0x170
 [<ffffffff811675ea>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116703f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81167369>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811682eb>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffff81439b38>] ? sk_prot_alloc+0x48/0x1c0
 [<ffffffff8143ac12>] ? sk_clone+0x22/0x2e0
 [<ffffffff81489ae6>] ? inet_csk_clone+0x16/0xd0
 [<ffffffff814a29f3>] ? tcp_create_openreq_child+0x23/0x450
 [<ffffffff814a01ed>] ? tcp_v4_syn_recv_sock+0x4d/0x310
 [<ffffffff814a2796>] ? tcp_check_req+0x226/0x460
 [<ffffffff814981b6>] ? tcp_rcv_state_process+0x126/0xa10
 [<ffffffff8149fc8b>] ? tcp_v4_do_rcv+0x35b/0x430
 [<ffffffff810522fd>] ? check_preempt_curr+0x6d/0x90
 [<ffffffff814a149e>] ? tcp_v4_rcv+0x4fe/0x8d0
 [<ffffffff81063322>] ? default_wake_function+0x12/0x20
 [<ffffffff8147f28d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8147f518>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8147e9dd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff8147ef65>] ? ip_rcv+0x275/0x350
 [<ffffffff8144819b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff8144a578>] ? netif_receive_skb+0x58/0x60
 [<ffffffff8144a680>] ? napi_skb_finish+0x50/0x70
 [<ffffffff8144cc29>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa01d07cb>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
 [<ffffffffa01d33a0>] ? e1000_clean_rx_irq+0x380/0x580 [e1000e]
 [<ffffffffa01daffd>] ? e1000e_poll+0x8d/0x380 [e1000e]
 [<ffffffff8105ad3e>] ? scheduler_tick+0x11e/0x260
 [<ffffffff8144cd43>] ? net_rx_action+0x103/0x2f0
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810e1640>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff81516c95>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI>  [<ffffffff812d37ae>] ? intel_idle+0xde/0x170
 [<ffffffff812d3791>] ? intel_idle+0xc1/0x170
 [<ffffffff8109dd1d>] ? sched_clock_cpu+0xcd/0x110
 [<ffffffff81414ef7>] ? cpuidle_idle_call+0xa7/0x140
 [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
 [<ffffffff814f2f2a>] ? rest_init+0x7a/0x80
 [<ffffffff81c27f7b>] ? start_kernel+0x424/0x430
 [<ffffffff81c2733a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81c27438>] ? x86_64_start_kernel+0xfa/0x109
swapper: page allocation failure. order:1, mode:0x20
Pid: 0, comm: swapper Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff811669d2>] ? kmem_getpages+0x62/0x170
 [<ffffffff811675ea>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116703f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81167369>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811682eb>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffff81439b38>] ? sk_prot_alloc+0x48/0x1c0
 [<ffffffff8143ac12>] ? sk_clone+0x22/0x2e0
 [<ffffffff81489ae6>] ? inet_csk_clone+0x16/0xd0
 [<ffffffff814a29f3>] ? tcp_create_openreq_child+0x23/0x450
 [<ffffffff814a01ed>] ? tcp_v4_syn_recv_sock+0x4d/0x310
 [<ffffffff814a2796>] ? tcp_check_req+0x226/0x460
 [<ffffffff8149fc8b>] ? tcp_v4_do_rcv+0x35b/0x430
 [<ffffffff810522fd>] ? check_preempt_curr+0x6d/0x90
 [<ffffffff814a149e>] ? tcp_v4_rcv+0x4fe/0x8d0
 [<ffffffff81063322>] ? default_wake_function+0x12/0x20
 [<ffffffff8147f28d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8147f518>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8147e9dd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff8147ef65>] ? ip_rcv+0x275/0x350
 [<ffffffff8144819b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff8144a578>] ? netif_receive_skb+0x58/0x60
 [<ffffffff8144a680>] ? napi_skb_finish+0x50/0x70
 [<ffffffff8144cc29>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa01d07cb>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
 [<ffffffffa01d33a0>] ? e1000_clean_rx_irq+0x380/0x580 [e1000e]
 [<ffffffffa01daffd>] ? e1000e_poll+0x8d/0x380 [e1000e]
 [<ffffffff8105ad3e>] ? scheduler_tick+0x11e/0x260
 [<ffffffff8144cd43>] ? net_rx_action+0x103/0x2f0
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810e1640>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff81516c95>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI>  [<ffffffff812d37ae>] ? intel_idle+0xde/0x170
 [<ffffffff812d3791>] ? intel_idle+0xc1/0x170
 [<ffffffff8109dd1d>] ? sched_clock_cpu+0xcd/0x110
 [<ffffffff81414ef7>] ? cpuidle_idle_call+0xa7/0x140
 [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
 [<ffffffff814f2f2a>] ? rest_init+0x7a/0x80
 [<ffffffff81c27f7b>] ? start_kernel+0x424/0x430
 [<ffffffff81c2733a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81c27438>] ? x86_64_start_kernel+0xfa/0x109
md: data-check of RAID array md1
md: minimum _guaranteed_  speed: 200000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 75000000 KB/sec) for data-check.
md: using 128k window, over a total of 976759936k.
md: data-check of RAID array md0
md: minimum _guaranteed_  speed: 200000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 75000000 KB/sec) for data-check.
md: using 128k window, over a total of 5860270080k.
md: md0: data-check done.
md: md1: data-check done.
ata5: exception Emask 0x10 SAct 0x0 SErr 0x4050002 action 0xe frozen
ata5: irq_stat 0x00400040, connection status changed
ata5: SError: { RecovComm PHYRdyChg CommWake DevExch }
ata5: hard resetting link
ata5: link is slow to respond, please be patient (ready=0)
ata5: COMRESET failed (errno=-16)
ata5: hard resetting link
ata5: link is slow to respond, please be patient (ready=0)
ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata5.00: ATA-7: Hitachi HDS721010KLA330, GKAOAB0A, max UDMA/133
ata5.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
ata5.00: configured for UDMA/133
ata5: EH complete
scsi 9:0:0:0: Direct-Access     ATA      Hitachi HDS72101 GKAO PQ: 0 ANSI: 5
sd 9:0:0:0: Attached scsi generic sg14 type 0
sd 9:0:0:0: [sdo] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
sd 9:0:0:0: [sdo] Write Protect is off
sd 9:0:0:0: [sdo] Mode Sense: 00 3a 00 00
sd 9:0:0:0: [sdo] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sdo: sdo1
sd 9:0:0:0: [sdo] Attached SCSI disk
kjournald starting.  Commit interval 5 seconds
EXT3-fs (sdo1): warning: checktime reached, running e2fsck is recommended
EXT3-fs (sdo1): using internal journal
EXT3-fs (sdo1): mounted filesystem with ordered data mode
rsync: page allocation failure. order:1, mode:0x20
Pid: 19498, comm: rsync Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff811669d2>] ? kmem_getpages+0x62/0x170
 [<ffffffff811675ea>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116703f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81167369>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811682eb>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffff81439b38>] ? sk_prot_alloc+0x48/0x1c0
 [<ffffffff8143ac12>] ? sk_clone+0x22/0x2e0
 [<ffffffff81489ae6>] ? inet_csk_clone+0x16/0xd0
 [<ffffffff814a29f3>] ? tcp_create_openreq_child+0x23/0x450
 [<ffffffff814a01ed>] ? tcp_v4_syn_recv_sock+0x4d/0x310
 [<ffffffff814a2796>] ? tcp_check_req+0x226/0x460
 [<ffffffff814981b6>] ? tcp_rcv_state_process+0x126/0xa10
 [<ffffffff8149fc8b>] ? tcp_v4_do_rcv+0x35b/0x430
 [<ffffffff8105230c>] ? check_preempt_curr+0x7c/0x90
 [<ffffffff814a149e>] ? tcp_v4_rcv+0x4fe/0x8d0
 [<ffffffff81063322>] ? default_wake_function+0x12/0x20
 [<ffffffff8147f28d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8147f518>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8147e9dd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff8147ef65>] ? ip_rcv+0x275/0x350
 [<ffffffff8144819b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff8144a578>] ? netif_receive_skb+0x58/0x60
 [<ffffffff8144a680>] ? napi_skb_finish+0x50/0x70
 [<ffffffff8144cc29>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa01d07cb>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
 [<ffffffffa01d33a0>] ? e1000_clean_rx_irq+0x380/0x580 [e1000e]
 [<ffffffffa01daffd>] ? e1000e_poll+0x8d/0x380 [e1000e]
 [<ffffffff8144cd43>] ? net_rx_action+0x103/0x2f0
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810e1640>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff81516c95>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI> 
rsync: page allocation failure. order:1, mode:0x20
Pid: 19498, comm: rsync Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff811669d2>] ? kmem_getpages+0x62/0x170
 [<ffffffff811675ea>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116703f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81167369>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811682eb>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffff81439b38>] ? sk_prot_alloc+0x48/0x1c0
 [<ffffffff8143ac12>] ? sk_clone+0x22/0x2e0
 [<ffffffff81489ae6>] ? inet_csk_clone+0x16/0xd0
 [<ffffffff814a29f3>] ? tcp_create_openreq_child+0x23/0x450
 [<ffffffff814a01ed>] ? tcp_v4_syn_recv_sock+0x4d/0x310
 [<ffffffff814a2796>] ? tcp_check_req+0x226/0x460
 [<ffffffff8149fc8b>] ? tcp_v4_do_rcv+0x35b/0x430
 [<ffffffff8105230c>] ? check_preempt_curr+0x7c/0x90
 [<ffffffff814a149e>] ? tcp_v4_rcv+0x4fe/0x8d0
 [<ffffffff81063322>] ? default_wake_function+0x12/0x20
 [<ffffffff8147f28d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8147f518>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8147e9dd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff8147ef65>] ? ip_rcv+0x275/0x350
 [<ffffffff8144819b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff8144a578>] ? netif_receive_skb+0x58/0x60
 [<ffffffff8144a680>] ? napi_skb_finish+0x50/0x70
 [<ffffffff8144cc29>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa01d07cb>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
 [<ffffffffa01d33a0>] ? e1000_clean_rx_irq+0x380/0x580 [e1000e]
 [<ffffffffa01daffd>] ? e1000e_poll+0x8d/0x380 [e1000e]
 [<ffffffff8144cd43>] ? net_rx_action+0x103/0x2f0
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810e1640>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff81516c95>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI> 
ata5: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
ata5: irq_stat 0x00400040, connection status changed
ata5: SError: { HostInt PHYRdyChg 10B8B DevExch }
ata5: hard resetting link
ata5: SATA link down (SStatus 0 SControl 300)
ata5: hard resetting link
ata5: SATA link down (SStatus 0 SControl 300)
ata5: limiting SATA link speed to 1.5 Gbps
ata5: hard resetting link
ata5: SATA link down (SStatus 0 SControl 310)
ata5.00: disabled
ata5: EH complete
ata5.00: detaching (SCSI 9:0:0:0)
sd 9:0:0:0: [sdo] Synchronizing SCSI cache
sd 9:0:0:0: [sdo] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 9:0:0:0: [sdo] Stopping disk
sd 9:0:0:0: [sdo] START_STOP FAILED
sd 9:0:0:0: [sdo] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
ata5: exception Emask 0x10 SAct 0x0 SErr 0x4050002 action 0xe frozen
ata5: irq_stat 0x00400040, connection status changed
ata5: SError: { RecovComm PHYRdyChg CommWake DevExch }
ata5: hard resetting link
ata5: SATA link down (SStatus 0 SControl 300)
ata5.00: link offline, clearing class 1 to NONE
ata5: EH complete

theevilsharpie · Feb 24, 2014

Code:

ata5.00: ATA-7: Hitachi HDS721010KLA330, GKAOAB0A, max UDMA/133

This drive is failing. Remove it from the bus or evict it from your raid set.

Red Squirrel · Feb 25, 2014

Wow is that really all there is to it? Been chasing my tail around forever on this issue, I had given up hope thinking a new file server would fix things. Now to find out, which drive that is. HDS721010KLA330 is the serial number I take it?

Edit: Actually I think that may be one of my backup drives... it would have been having data written to it at the time, but is otherwise not all that critical, I'll just have to try finding it physically so I can remove it from the backup pool and destroy it.

Red Squirrel · Aug 23, 2015

Well this seems to have started again. I caught it, I got a bunch of alerts that some failed services restored (never got the alerts that they died, systems were probably frozen trying to send the alert) so I checked my file server and sure enough some crash logs:

Code:

rsync: page allocation failure. order:1, mode:0x20
Pid: 18527, comm: rsync Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffffa01d4069>] ? e1000_xmit_frame+0x769/0xdc0 [e1000e]
 [<ffffffff811669d2>] ? kmem_getpages+0x62/0x170
 [<ffffffff811675ea>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116703f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81167369>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811682eb>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffff81439b38>] ? sk_prot_alloc+0x48/0x1c0
 [<ffffffff8143ac12>] ? sk_clone+0x22/0x2e0
 [<ffffffff81489ae6>] ? inet_csk_clone+0x16/0xd0
 [<ffffffff814a29f3>] ? tcp_create_openreq_child+0x23/0x450
 [<ffffffff814a01ed>] ? tcp_v4_syn_recv_sock+0x4d/0x310
 [<ffffffff814a2796>] ? tcp_check_req+0x226/0x460
 [<ffffffff814981b6>] ? tcp_rcv_state_process+0x126/0xa10
 [<ffffffff8149fc8b>] ? tcp_v4_do_rcv+0x35b/0x430
 [<ffffffff8105230c>] ? check_preempt_curr+0x7c/0x90
 [<ffffffff814a149e>] ? tcp_v4_rcv+0x4fe/0x8d0
 [<ffffffff8147f28d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8147f518>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8147e9dd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff8147ef65>] ? ip_rcv+0x275/0x350
 [<ffffffff8144819b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff8144a578>] ? netif_receive_skb+0x58/0x60
 [<ffffffff8144a680>] ? napi_skb_finish+0x50/0x70
 [<ffffffff8144cc29>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa01d07cb>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
 [<ffffffffa01d33a0>] ? e1000_clean_rx_irq+0x380/0x580 [e1000e]
 [<ffffffffa01daffd>] ? e1000e_poll+0x8d/0x380 [e1000e]
 [<ffffffff8144cd43>] ? net_rx_action+0x103/0x2f0
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810e1640>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8107700f>] ? __do_softirq+0x11f/0x1e0
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff81516c95>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI>

This about 3x. No other errors such as anything pointing to a disk. Even then, a disk failure should not be crashing the system, that's the whole point of raid.

This seems to be triggered by any kind of high IO such as rsync backups or torrents.

Essence_of_War · Aug 24, 2015

What are your kernel and rsync versions?

Red Squirrel · Aug 24, 2015

Whatever is the latest in CentOS, it's all different systems so can't specify something exact as they may have different versions, but the particular server that had the error last has the following:

Kernel: Linux isengard.loc 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Rsync: version 3.0.6 protocol version 30

Essence_of_War · Aug 24, 2015

Hold on, you're on kernel 2.6.32-358?

Isn't that kernel like 6 years old?

mv2devnull · Aug 25, 2015

Not quite. The 358 was in 6.4. "Feb 22 2013" says the uname. Still ancient and deprecated.

Red Squirrel said:
Whatever is the latest in CentOS

CentOS 5: kernel-2.6.18-406.el5
CentOS 6: kernel-2.6.32-573.3.1.el6
CentOS 7: kernel-3.10.0-229.11.1.el7

Essence_of_War · Aug 25, 2015

Thanks for clarifying that mv2devnull, this kernel isn't quite as old as I thought, but if I'm reading Red Hat/CentOS correctly, it stopped getting updates in March 2015.

Also, rsync is presently on version 3.1.1 (which is up-to-date in Debian-based distros). 3.0.6 is over 6 years old.

I know it's the obvious and not insightful answer, but I would try to get your kernel and rsync up-to-date before I spent too much more time troubleshooting.

mv2devnull · Aug 25, 2015

The RH version numbers are not directly comparable to upstream because RH backports fixes to their packages. That way a major release of RHEL (and thus CentOS too) has stable feature-set its entire lifetime (several years). (Bit similar to LTS releases in Ubuntu.)

Nevertheless, 2.6.32-573.* is the kernel of CentOS 6 now.

Red Squirrel · Aug 25, 2015

Hmm is "yum update" not sufficient to update properly? What other command should I try? I do that rather regularly on all my systems. This issue has been going on for YEARS though, so unless it's something that got fixed recently I'm suspecting it's something I need to configure differently somewhere, like maybe some kind of time out setting of sorts.

mv2devnull · Aug 26, 2015

yum update should be sufficient. I guess that there is a trivial explanation.
What do these show?

Code:

cat /etc/{centos,redhat}-release
uptime
uname -rv
rpm -q kernel
rpm -q rsync
df -h /boot

Red Squirrel · Aug 26, 2015

Here's the output:

Code:

[root@isengard ~]# cat /etc/{centos,redhat}-release
CentOS release 6.7 (Final)
CentOS release 6.7 (Final)
[root@isengard ~]# 
[root@isengard ~]# 
[root@isengard ~]# uptime
 18:47:25 up 449 days, 18:42,  1 user,  load average: 1.82, 2.68, 2.45
[root@isengard ~]# 
[root@isengard ~]# 
[root@isengard ~]# uname -rv
2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013
[root@isengard ~]# 
[root@isengard ~]# 
[root@isengard ~]# rpm -q kernel
kernel-2.6.32-358.el6.x86_64
kernel-2.6.32-504.3.3.el6.x86_64
kernel-2.6.32-504.23.4.el6.x86_64
kernel-2.6.32-504.30.3.el6.x86_64
kernel-2.6.32-573.3.1.el6.x86_64
[root@isengard ~]# 
[root@isengard ~]# 
[root@isengard ~]# 
[root@isengard ~]# rpm -q rsync
rsync-3.0.6-12.el6.x86_64
[root@isengard ~]# 
[root@isengard ~]# 
[root@isengard ~]# 
[root@isengard ~]# df -h /boot
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdn1       485M  157M  303M  35% /boot
[root@isengard ~]#

I see that kernel 573 is there, but I guess it's not installed?

Crusty · Aug 26, 2015

You're going to have to reboot if you want to load the new kernel....

Red Squirrel · Aug 26, 2015

I can't reboot that machine, it's the file server which also acts as storage for VMs, is there another way to load the kernel so it's seamless? How would an enterprise situation do it? I guess they'd have multiple servers that act in a shared storage/HA config? Something I do want to look into at some point but no money now.

mv2devnull · Aug 27, 2015

HA is nice, but even the big HPC clusters do have scheduled maintenance breaks occasionally.

It is that simple; you tell the users that there will be a break and then you have the break.

If your service is so critical that no break is possible, then it should already have a proper HA.

PS. I would do

Code:

yum remove kernel-2.6.32-573.3.1.el6
yum update

before the reboot. Why? The latest kernel is from 6.7 point update and the same yum update did pull many other packages, some of which kernel installation does use. This extra step ensures that the install uses latest, already installed packages.

Essence_of_War · Aug 27, 2015

Red Squirrel said:
I can't reboot that machine, it's the file server which also acts as storage for VMs, is there another way to load the kernel so it's seamless? How would an enterprise situation do it? I guess they'd have multiple servers that act in a shared storage/HA config? Something I do want to look into at some point but no money now.

The only thing linux can't do live right now is upgrade the kernel. That actually may change in the not too distant future, it's a feature that lots of people are interested in, but for now it's reboot or else.

To do kernel upgrades, even an enterprise user would have to reboot, they would usually have some sort of fail-over or scheduled maintenance to bridge them.

You really can't reboot this machine? What is your plan if you have a power outage that outlasts your UPS?

Red Squirrel · Aug 27, 2015

Essence_of_War said:
You really can't reboot this machine? What is your plan if you have a power outage that outlasts your UPS?

My UPS is good for about 4 hours. I need to add more batteries though as we've had a few outages that pushed dangerously close to that and I've added more equipment since the last test run.

The downtime is not so much a problem, it's the risk that something does not come back up properly. But guess I'll have to bite the bullet and find a good time to do it. I have a game server that runs off this environment as well and it's already been through several maintenance windows so I hate to do that again.

I also did a boo boo on this server one time, I was doing a mv command to move a bunch of files into the same folder, basically "mv a/* ." and repeat for b c d etc I was typing fast and accidentally did a typo and did "mv / .". It started moving system folders like /usr and /etc. I did move them back, but who knows if attributes and stuff are right. I did my best to compare to another server but there may be special attributes I don't see. I'm scared the system might not boot back up. Lot of this system stuff needs to be in a specific location on the platters as well. I think /boot is probably the most important.

I'm at a loss, what keeps crashing my system?

Elite Member

Elite Member

No Lifer

No Lifer

Elite Member

Junior Member

No Lifer

Platinum Member

No Lifer

No Lifer

Platinum Member

No Lifer

Platinum Member

Golden Member

Platinum Member

Golden Member

No Lifer

Golden Member

No Lifer

Lifer

No Lifer

Golden Member

Platinum Member

No Lifer