I'm at a loss, what keeps crashing my system?

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Nothinman

Elite Member
Sep 14, 2001
30,672
0
0
Those messages indicate that you ran out of memory and the OOM killer should have kicked in.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
The original messages of IO wait could have been related to having trouble cleaning (writing back dirty pages) and defragmenting. You are definitely running low on memory. So, when these aren't happening, what is your available RAM looking like (free+caches)? If I'm reading it right, SSH tried to allocate 128MB (why?), and failed, though you should have had 160+MB that could have been freed. However, if it wanted that 128MB contiguous, you could very well be stuck waiting for a long time for it to make that happen.

So, all the below assumes that you have plenty of physical memory, but it isn't being used effectively, and you end up stuck with too much fragmentation, and I/O wait from writebacks backlogging, blocking successful allocation. Assuming you do indeed have enough RAM, that would make sense, with regards to the earlier errors, too. If you're actually running close to the amount of RAM in the computer, then the solution, sadly, is to fix it at that level, first (if you don't have 20%+ for caching, you may as well have run out, already). I can't say I know enough to deduce the actual total RAM you have from the error (though I'm sure it's quite possible).

Active:132283
inactive:1508049
dirty:161830
writeback:2625
unstable:0
free:10126
slab:90087
mapped:168019
pagetables:12444
bounce:0
That's quite a bit of dirty, and a fair amount of slab, relative to the rest. I'm not used to seeing writeback above a couple hundred pages, but I can't say that I know what, if anything, that means (it might just be high from having lots of dirty pages, and mean nothing on its own).

I would set extfrag_threshold to 100 (default is 500, and a lower value makes it more aggressive about defragmenting), and ditry_ratio and dirty_background ratio both to 10 or less, and then see how the machine fairs (note that low dirty ratios can give Linux a jerky feel when a lot of disk writing is going on, and will, in general, increase the amount and frequency that it writes to disk). Depending on what your current min free value is, you may wish to increase to 100+MB.

I've found Virtualbox and WINE can be really bad about causing fragmentation to outpace the background compactor, though I've managed to keep from getting crashes from it.

So, something like maybe this, in /etc/sysctl.conf:
Code:
vm.min_free_kbytes = 262144
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
vm.extfrag_threshold = 100
It's on the extreme side, so if your crashing disappears, start to loosen up on the dirty ratios, for smoother performance (I've not found a lower extfrag_threshold to yet cause bad behavior, but documentation for it is kind of sparse), and maybe reduce the min free to 128MB (131072), or even 64MB (65536).
 

Red Squirrel

No Lifer
May 24, 2003
70,180
13,576
126
www.anyf.ca
This is odd, virtualbox seems to be using over 7GB, I have 8GB, so this explains things... I only have 1 VM running and it has 512k assigned to it. Not sure whether this was the original issue or not. I used to have more VMs running before. No way I'd be able to start them now if I tried.
 

Red Squirrel

No Lifer
May 24, 2003
70,180
13,576
126
www.anyf.ca
Ok this really sucks. One of my other servers started doing it too now. WTF?


[root@hal9000 envirostate]# dmesg
e1000e: eth0 NIC Link is Down
CIFS VFS: Server borg.loc has not responded in 300 seconds. Reconnecting...
INFO: task bdi-default:29 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bdi-default D 0000000000000002 0 29 2 0x00000000
ffff88013cdf1a40 0000000000000046 0000000000000000 ffff880139d7d980
ffff88013cdf1a50 ffffffffa02800ac ffff88013cdf1a18 ffff88013b7b1d70
ffff88013cdeb0b8 ffff88013cdf1fd8 000000000000f4e8 ffff88013cdeb0b8
Call Trace:
[<ffffffffa02800ac>] ? smb_sendv+0xfc/0x380 [cifs]
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f493a>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff81110907>] __lock_page+0x67/0x70
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa0276dbf>] cifs_writepages+0x63f/0x670 [cifs]
[<ffffffff81126171>] do_writepages+0x21/0x40
[<ffffffff811a034d>] writeback_single_inode+0xdd/0x2c0
[<ffffffff811a078e>] writeback_sb_inodes+0xce/0x180
[<ffffffff811a08eb>] writeback_inodes_wb+0xab/0x1b0
[<ffffffff810096f0>] ? __switch_to+0xd0/0x320
[<ffffffff811a0c8b>] wb_writeback+0x29b/0x3f0
[<ffffffff814f3911>] ? thread_return+0x4e/0x79d
[<ffffffff811a0f79>] wb_do_writeback+0x199/0x240
[<ffffffff8107c020>] ? process_timeout+0x0/0x10
[<ffffffff811348ba>] bdi_forker_task+0x6a/0x310
[<ffffffff81134850>] ? bdi_forker_task+0x0/0x310
[<ffffffff810906a6>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81090610>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
INFO: task FahCore_a3.exe:14391 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
FahCore_a3.ex D 0000000000000001 0 14391 1972 0x00000080
ffff880138eedb58 0000000000000082 0000000000000000 ffff880037930a38
ffffffff81facfa0 ffffea0003fee0e0 ffff880138eedae8 ffffffff812763d6
ffff880037930638 ffff880138eedfd8 000000000000f4e8 ffff880037930638
Call Trace:
[<ffffffff812763d6>] ? __prop_inc_single+0x46/0x60
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f4a8f>] __wait_on_bit+0x5f/0x90
[<ffffffff81110b23>] wait_on_page_bit+0x73/0x80
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa0276a5d>] cifs_writepages+0x2dd/0x670 [cifs]
[<ffffffff81126171>] do_writepages+0x21/0x40
[<ffffffff8111108b>] __filemap_fdatawrite_range+0x5b/0x60
[<ffffffff8111158f>] filemap_fdatawrite+0x1f/0x30
[<ffffffffa025e11c>] cifs_file_aio_write+0x3c/0x50 [cifs]
[<ffffffff811760fa>] do_sync_write+0xfa/0x140
[<ffffffff81090a10>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81060fa0>] ? pick_next_task_fair+0xd0/0x130
[<ffffffff8120c296>] ? security_file_permission+0x16/0x20
[<ffffffff811763f8>] vfs_write+0xb8/0x1a0
[<ffffffff81176e01>] sys_write+0x51/0x90
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task bdi-default:29 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bdi-default D 0000000000000002 0 29 2 0x00000000
ffff88013cdf1a40 0000000000000046 0000000000000000 ffff880139d7d980
ffff88013cdf1a50 ffffffffa02800ac ffff88013cdf1a18 ffff88013b7b1d70
ffff88013cdeb0b8 ffff88013cdf1fd8 000000000000f4e8 ffff88013cdeb0b8
Call Trace:
[<ffffffffa02800ac>] ? smb_sendv+0xfc/0x380 [cifs]
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f493a>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff81110907>] __lock_page+0x67/0x70
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa0276dbf>] cifs_writepages+0x63f/0x670 [cifs]
[<ffffffff81126171>] do_writepages+0x21/0x40
[<ffffffff811a034d>] writeback_single_inode+0xdd/0x2c0
[<ffffffff811a078e>] writeback_sb_inodes+0xce/0x180
[<ffffffff811a08eb>] writeback_inodes_wb+0xab/0x1b0
[<ffffffff810096f0>] ? __switch_to+0xd0/0x320
[<ffffffff811a0c8b>] wb_writeback+0x29b/0x3f0
[<ffffffff814f3911>] ? thread_return+0x4e/0x79d
[<ffffffff811a0f79>] wb_do_writeback+0x199/0x240
[<ffffffff8107c020>] ? process_timeout+0x0/0x10
[<ffffffff811348ba>] bdi_forker_task+0x6a/0x310
[<ffffffff81134850>] ? bdi_forker_task+0x0/0x310
[<ffffffff810906a6>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81090610>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
INFO: task FahCore_a3.exe:14391 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
FahCore_a3.ex D 0000000000000001 0 14391 1972 0x00000080
ffff880138eedb58 0000000000000082 0000000000000000 ffff880037930a38
ffffffff81facfa0 ffffea0003fee0e0 ffff880138eedae8 ffffffff812763d6
ffff880037930638 ffff880138eedfd8 000000000000f4e8 ffff880037930638
Call Trace:
[<ffffffff812763d6>] ? __prop_inc_single+0x46/0x60
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f4a8f>] __wait_on_bit+0x5f/0x90
[<ffffffff81110b23>] wait_on_page_bit+0x73/0x80
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa0276a5d>] cifs_writepages+0x2dd/0x670 [cifs]
[<ffffffff81126171>] do_writepages+0x21/0x40
[<ffffffff8111108b>] __filemap_fdatawrite_range+0x5b/0x60
[<ffffffff8111158f>] filemap_fdatawrite+0x1f/0x30
[<ffffffffa025e11c>] cifs_file_aio_write+0x3c/0x50 [cifs]
[<ffffffff811760fa>] do_sync_write+0xfa/0x140
[<ffffffff81090a10>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81060fa0>] ? pick_next_task_fair+0xd0/0x130
[<ffffffff8120c296>] ? security_file_permission+0x16/0x20
[<ffffffff811763f8>] vfs_write+0xb8/0x1a0
[<ffffffff81176e01>] sys_write+0x51/0x90
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task kslowd000:14390 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kslowd000 D 0000000000000001 0 14390 2 0x00000080
ffff8801394f7d00 0000000000000046 0000000000000000 0000000000000082
ffff8801394f7cd0 ffffffff8105e5dc 0000000000015f80 0000000000000002
ffff88013b7e4638 ffff8801394f7fd8 000000000000f4e8 ffff88013b7e4638
Call Trace:
[<ffffffff8105e5dc>] ? try_to_wake_up+0x24c/0x3e0
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f493a>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff81110907>] __lock_page+0x67/0x70
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa026994e>] cifs_writev_complete+0x1ee/0x210 [cifs]
[<ffffffff81104673>] slow_work_execute+0x233/0x310
[<ffffffff811048a7>] slow_work_thread+0x157/0x360
[<ffffffff81090a10>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81104750>] ? slow_work_thread+0x0/0x360
[<ffffffff810906a6>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81090610>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
INFO: task bdi-default:29 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bdi-default D 0000000000000002 0 29 2 0x00000000
ffff88013cdf1a40 0000000000000046 0000000000000000 ffff880139d7d980
ffff88013cdf1a50 ffffffffa02800ac ffff88013cdf1a18 ffff88013b7b1d70
ffff88013cdeb0b8 ffff88013cdf1fd8 000000000000f4e8 ffff88013cdeb0b8
Call Trace:
[<ffffffffa02800ac>] ? smb_sendv+0xfc/0x380 [cifs]
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f493a>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff81110907>] __lock_page+0x67/0x70
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa0276dbf>] cifs_writepages+0x63f/0x670 [cifs]
[<ffffffff81126171>] do_writepages+0x21/0x40
[<ffffffff811a034d>] writeback_single_inode+0xdd/0x2c0
[<ffffffff811a078e>] writeback_sb_inodes+0xce/0x180
[<ffffffff811a08eb>] writeback_inodes_wb+0xab/0x1b0
[<ffffffff810096f0>] ? __switch_to+0xd0/0x320
[<ffffffff811a0c8b>] wb_writeback+0x29b/0x3f0
[<ffffffff814f3911>] ? thread_return+0x4e/0x79d
[<ffffffff811a0f79>] wb_do_writeback+0x199/0x240
[<ffffffff8107c020>] ? process_timeout+0x0/0x10
[<ffffffff811348ba>] bdi_forker_task+0x6a/0x310
[<ffffffff81134850>] ? bdi_forker_task+0x0/0x310
[<ffffffff810906a6>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81090610>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
INFO: task FahCore_a3.exe:14391 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
FahCore_a3.ex D 0000000000000001 0 14391 1972 0x00000080
ffff880138eedb58 0000000000000082 0000000000000000 ffff880037930a38
ffffffff81facfa0 ffffea0003fee0e0 ffff880138eedae8 ffffffff812763d6
ffff880037930638 ffff880138eedfd8 000000000000f4e8 ffff880037930638
Call Trace:
[<ffffffff812763d6>] ? __prop_inc_single+0x46/0x60
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f4a8f>] __wait_on_bit+0x5f/0x90
[<ffffffff81110b23>] wait_on_page_bit+0x73/0x80
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa0276a5d>] cifs_writepages+0x2dd/0x670 [cifs]
[<ffffffff81126171>] do_writepages+0x21/0x40
[<ffffffff8111108b>] __filemap_fdatawrite_range+0x5b/0x60
[<ffffffff8111158f>] filemap_fdatawrite+0x1f/0x30
[<ffffffffa025e11c>] cifs_file_aio_write+0x3c/0x50 [cifs]
[<ffffffff811760fa>] do_sync_write+0xfa/0x140
[<ffffffff81090a10>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81060fa0>] ? pick_next_task_fair+0xd0/0x130
[<ffffffff8120c296>] ? security_file_permission+0x16/0x20
[<ffffffff811763f8>] vfs_write+0xb8/0x1a0
[<ffffffff81176e01>] sys_write+0x51/0x90
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task kslowd000:14390 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kslowd000 D 0000000000000001 0 14390 2 0x00000080
ffff8801394f7d00 0000000000000046 0000000000000000 0000000000000082
ffff8801394f7cd0 ffffffff8105e5dc 0000000000015f80 0000000000000002
ffff88013b7e4638 ffff8801394f7fd8 000000000000f4e8 ffff88013b7e4638
Call Trace:
[<ffffffff8105e5dc>] ? try_to_wake_up+0x24c/0x3e0
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f493a>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff81110907>] __lock_page+0x67/0x70
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa026994e>] cifs_writev_complete+0x1ee/0x210 [cifs]
[<ffffffff81104673>] slow_work_execute+0x233/0x310
[<ffffffff811048a7>] slow_work_thread+0x157/0x360
[<ffffffff81090a10>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81104750>] ? slow_work_thread+0x0/0x360
[<ffffffff810906a6>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81090610>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
INFO: task bdi-default:29 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bdi-default D 0000000000000002 0 29 2 0x00000000
ffff88013cdf1a40 0000000000000046 0000000000000000 ffff880139d7d980
ffff88013cdf1a50 ffffffffa02800ac ffff88013cdf1a18 ffff88013b7b1d70
ffff88013cdeb0b8 ffff88013cdf1fd8 000000000000f4e8 ffff88013cdeb0b8
Call Trace:
[<ffffffffa02800ac>] ? smb_sendv+0xfc/0x380 [cifs]
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f493a>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff81110907>] __lock_page+0x67/0x70
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa0276dbf>] cifs_writepages+0x63f/0x670 [cifs]
[<ffffffff81126171>] do_writepages+0x21/0x40
[<ffffffff811a034d>] writeback_single_inode+0xdd/0x2c0
[<ffffffff811a078e>] writeback_sb_inodes+0xce/0x180
[<ffffffff811a08eb>] writeback_inodes_wb+0xab/0x1b0
[<ffffffff810096f0>] ? __switch_to+0xd0/0x320
[<ffffffff811a0c8b>] wb_writeback+0x29b/0x3f0
[<ffffffff814f3911>] ? thread_return+0x4e/0x79d
[<ffffffff811a0f79>] wb_do_writeback+0x199/0x240
[<ffffffff8107c020>] ? process_timeout+0x0/0x10
[<ffffffff811348ba>] bdi_forker_task+0x6a/0x310
[<ffffffff81134850>] ? bdi_forker_task+0x0/0x310
[<ffffffff810906a6>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81090610>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
INFO: task FahCore_a3.exe:14391 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
FahCore_a3.ex D 0000000000000001 0 14391 1972 0x00000080
ffff880138eedb58 0000000000000082 0000000000000000 ffff880037930a38
ffffffff81facfa0 ffffea0003fee0e0 ffff880138eedae8 ffffffff812763d6
ffff880037930638 ffff880138eedfd8 000000000000f4e8 ffff880037930638
Call Trace:
[<ffffffff812763d6>] ? __prop_inc_single+0x46/0x60
[<ffffffff81110930>] ? sync_page+0x0/0x50
[<ffffffff814f40d3>] io_schedule+0x73/0xc0
[<ffffffff8111096d>] sync_page+0x3d/0x50
[<ffffffff814f4a8f>] __wait_on_bit+0x5f/0x90
[<ffffffff81110b23>] wait_on_page_bit+0x73/0x80
[<ffffffff81090a50>] ? wake_bit_function+0x0/0x50
[<ffffffffa0276a5d>] cifs_writepages+0x2dd/0x670 [cifs]
[<ffffffff81126171>] do_writepages+0x21/0x40
[<ffffffff8111108b>] __filemap_fdatawrite_range+0x5b/0x60
[<ffffffff8111158f>] filemap_fdatawrite+0x1f/0x30
[<ffffffffa025e11c>] cifs_file_aio_write+0x3c/0x50 [cifs]
[<ffffffff811760fa>] do_sync_write+0xfa/0x140
[<ffffffff81090a10>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81060fa0>] ? pick_next_task_fair+0xd0/0x130
[<ffffffff8120c296>] ? security_file_permission+0x16/0x20
[<ffffffff811763f8>] vfs_write+0xb8/0x1a0
[<ffffffff81176e01>] sys_write+0x51/0x90
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[root@hal9000 envirostate]#
[root@hal9000 envirostate]#


No VM software of any kind on that machine, so guess I can rule out VirtualBox being the cause of these crashes. My server monitoring app crashed in this process. This server runs my hvac as well so this is quite serious. This one is running CentOS (version 5 I think, but not entirely sure how to check)


Edit: seems this may have happened during my switch cut over. Still, a simple network connectivity issue should not cause an entire system to crash.
 
Last edited:

Nothinman

Elite Member
Sep 14, 2001
30,672
0
0
Ok this really sucks. One of my other servers started doing it too now. WTF?





No VM software of any kind on that machine, so guess I can rule out VirtualBox being the cause of these crashes. My server monitoring app crashed in this process. This server runs my hvac as well so this is quite serious. This one is running CentOS (version 5 I think, but not entirely sure how to check)


Edit: seems this may have happened during my switch cut over. Still, a simple network connectivity issue should not cause an entire system to crash.

That's a CIFS mount hanging when you unplugged the cable. The entire system didn't crash, just the processes relying on data from that mount point. Those backtraces are just to let you know where the processes hung so you can fix the real issue.
 

L0rD_LuCk

Junior Member
Dec 27, 2012
1
0
0
Hello,

i have the same prolem (kjournald blocked for more than 120 seconds)
i m running a debian with the 2.6.32 kernel for about 1.5 years without any problems. my raid 6 on AMCC 3ware 9650SE-8LPML (8x 2TB) works without errors (logs checkt) i run the system once or twice a month for backups. the whole array is crypted with LUKS, FS is ext3. the box does nothing but backups. no fancy stuff is installed nor working.

the error occurs when:
the system is fresh booted
the array is mounted
an i try to mkdir a folder on the array

the system freeze up, then a message like yours shows up, then the system went back to normal, the folder is created i can work for hours with heavy load without having the problem again.

its a realy strange problem, i dont know but the array is running out of space (i've 2TB left) is it possible that the FS journald needs something to work properly?
 

Red Squirrel

No Lifer
May 24, 2003
70,180
13,576
126
www.anyf.ca
Happening again on my brand new file server. Is this a Linux issue? Does Linux/raid choke up and act up when there's too much going on? Seems I/O related.

Here's some output, same deal, different system:

Code:
ata5: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
ata5: irq_stat 0x00400040, connection status changed
ata5: SError: { HostInt PHYRdyChg 10B8B DevExch }
ata5: hard resetting link
ata5: SATA link down (SStatus 0 SControl 300)
ata5: hard resetting link
ata5: SATA link down (SStatus 0 SControl 300)
ata5: limiting SATA link speed to 1.5 Gbps
ata5: hard resetting link
ata5: SATA link down (SStatus 0 SControl 310)
ata5.00: disabled
ata5: EH complete
ata5.00: detaching (SCSI 9:0:0:0)
sd 9:0:0:0: [sdo] Synchronizing SCSI cache
sd 9:0:0:0: [sdo] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 9:0:0:0: [sdo] Stopping disk
sd 9:0:0:0: [sdo] START_STOP FAILED
sd 9:0:0:0: [sdo] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
md: data-check of RAID array md1
md: minimum _guaranteed_  speed: 200000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 75000000 KB/sec) for data-check.
md: using 128k window, over a total of 976759936k.
md: data-check of RAID array md0
md: minimum _guaranteed_  speed: 200000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 75000000 KB/sec) for data-check.
md: using 128k window, over a total of 5860270080k.
md: md1: data-check done.
md: md0: data-check done.
swapper: page allocation failure. order:1, mode:0x20
Pid: 0, comm: swapper Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff811669d2>] ? kmem_getpages+0x62/0x170
 [<ffffffff811675ea>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116703f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81167369>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811682eb>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffff81439b38>] ? sk_prot_alloc+0x48/0x1c0
 [<ffffffff8143ac12>] ? sk_clone+0x22/0x2e0
 [<ffffffff81489ae6>] ? inet_csk_clone+0x16/0xd0
 [<ffffffff814a29f3>] ? tcp_create_openreq_child+0x23/0x450
 [<ffffffff814a01ed>] ? tcp_v4_syn_recv_sock+0x4d/0x310
 [<ffffffff814a2796>] ? tcp_check_req+0x226/0x460
 [<ffffffff814981b6>] ? tcp_rcv_state_process+0x126/0xa10
 [<ffffffff8149fc8b>] ? tcp_v4_do_rcv+0x35b/0x430
 [<ffffffff814a5d1f>] ? __udp_queue_rcv_skb+0x4f/0x170
 [<ffffffff814a149e>] ? tcp_v4_rcv+0x4fe/0x8d0
 [<ffffffff8143cfd7>] ? __kfree_skb+0x47/0xa0
 [<ffffffff8147f28d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8147f518>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8147e9dd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff8147ef65>] ? ip_rcv+0x275/0x350
 [<ffffffff8144819b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff8144a578>] ? netif_receive_skb+0x58/0x60
 [<ffffffff8144a680>] ? napi_skb_finish+0x50/0x70
 [<ffffffff8144cc29>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa01d07cb>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
 [<ffffffffa01d33a0>] ? e1000_clean_rx_irq+0x380/0x580 [e1000e]
 [<ffffffffa01daffd>] ? e1000e_poll+0x8d/0x380 [e1000e]
 [<ffffffff8109ace2>] ? enqueue_hrtimer+0x82/0xd0
 [<ffffffff8144cd43>] ? net_rx_action+0x103/0x2f0
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810e1640>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff81516c95>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI>  [<ffffffff812d37ae>] ? intel_idle+0xde/0x170
 [<ffffffff812d3791>] ? intel_idle+0xc1/0x170
 [<ffffffff81414ef7>] ? cpuidle_idle_call+0xa7/0x140
 [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
 [<ffffffff814f2f2a>] ? rest_init+0x7a/0x80
 [<ffffffff81c27f7b>] ? start_kernel+0x424/0x430
 [<ffffffff81c2733a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81c27438>] ? x86_64_start_kernel+0xfa/0x109
swapper: page allocation failure. order:1, mode:0x20
Pid: 0, comm: swapper Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff811669d2>] ? kmem_getpages+0x62/0x170
 [<ffffffff811675ea>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116703f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81167369>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811682eb>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffff81439b38>] ? sk_prot_alloc+0x48/0x1c0
 [<ffffffff8143ac12>] ? sk_clone+0x22/0x2e0
 [<ffffffff81489ae6>] ? inet_csk_clone+0x16/0xd0
 [<ffffffff814a29f3>] ? tcp_create_openreq_child+0x23/0x450
 [<ffffffff814a01ed>] ? tcp_v4_syn_recv_sock+0x4d/0x310
 [<ffffffff814a2796>] ? tcp_check_req+0x226/0x460
 [<ffffffff8149fc8b>] ? tcp_v4_do_rcv+0x35b/0x430
 [<ffffffff814a5d1f>] ? __udp_queue_rcv_skb+0x4f/0x170
 [<ffffffff814a149e>] ? tcp_v4_rcv+0x4fe/0x8d0
 [<ffffffff8143cfd7>] ? __kfree_skb+0x47/0xa0
 [<ffffffff8147f28d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8147f518>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8147e9dd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff8147ef65>] ? ip_rcv+0x275/0x350
 [<ffffffff8144819b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff8144a578>] ? netif_receive_skb+0x58/0x60
 [<ffffffff8144a680>] ? napi_skb_finish+0x50/0x70
 [<ffffffff8144cc29>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa01d07cb>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
 [<ffffffffa01d33a0>] ? e1000_clean_rx_irq+0x380/0x580 [e1000e]
 [<ffffffffa01daffd>] ? e1000e_poll+0x8d/0x380 [e1000e]
 [<ffffffff8109ace2>] ? enqueue_hrtimer+0x82/0xd0
 [<ffffffff8144cd43>] ? net_rx_action+0x103/0x2f0
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810e1640>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff81516c95>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI>  [<ffffffff812d37ae>] ? intel_idle+0xde/0x170
 [<ffffffff812d3791>] ? intel_idle+0xc1/0x170
 [<ffffffff81414ef7>] ? cpuidle_idle_call+0xa7/0x140
 [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
 [<ffffffff814f2f2a>] ? rest_init+0x7a/0x80
 [<ffffffff81c27f7b>] ? start_kernel+0x424/0x430
 [<ffffffff81c2733a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81c27438>] ? x86_64_start_kernel+0xfa/0x109
swapper: page allocation failure. order:1, mode:0x20
Pid: 0, comm: swapper Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff811669d2>] ? kmem_getpages+0x62/0x170
 [<ffffffff811675ea>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116703f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81167369>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811682eb>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffff81439b38>] ? sk_prot_alloc+0x48/0x1c0
 [<ffffffff8143ac12>] ? sk_clone+0x22/0x2e0
 [<ffffffff81489ae6>] ? inet_csk_clone+0x16/0xd0
 [<ffffffff814a29f3>] ? tcp_create_openreq_child+0x23/0x450
 [<ffffffff814a01ed>] ? tcp_v4_syn_recv_sock+0x4d/0x310
 [<ffffffff814a2796>] ? tcp_check_req+0x226/0x460
 [<ffffffff814981b6>] ? tcp_rcv_state_process+0x126/0xa10
 [<ffffffff8149fc8b>] ? tcp_v4_do_rcv+0x35b/0x430
 [<ffffffff814a5d1f>] ? __udp_queue_rcv_skb+0x4f/0x170
 [<ffffffff814a149e>] ? tcp_v4_rcv+0x4fe/0x8d0
 [<ffffffff8147f28d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8147f518>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8147e9dd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff8147ef65>] ? ip_rcv+0x275/0x350
 [<ffffffff8144819b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff8144a578>] ? netif_receive_skb+0x58/0x60
 [<ffffffff8144a680>] ? napi_skb_finish+0x50/0x70
 [<ffffffff8144cc29>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa01d07cb>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
 [<ffffffffa01d33a0>] ? e1000_clean_rx_irq+0x380/0x580 [e1000e]
 [<ffffffffa01daffd>] ? e1000e_poll+0x8d/0x380 [e1000e]
 [<ffffffff81052201>] ? resched_task+0x11/0x80
 [<ffffffff8106317e>] ? try_to_wake_up+0x24e/0x3e0
 [<ffffffff8144cd43>] ? net_rx_action+0x103/0x2f0
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810e1640>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff81516c95>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI>  [<ffffffff812d37ae>] ? intel_idle+0xde/0x170
 [<ffffffff812d3791>] ? intel_idle+0xc1/0x170
 [<ffffffff8109dd1d>] ? sched_clock_cpu+0xcd/0x110
 [<ffffffff81414ef7>] ? cpuidle_idle_call+0xa7/0x140
 [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
 [<ffffffff814f2f2a>] ? rest_init+0x7a/0x80
 [<ffffffff81c27f7b>] ? start_kernel+0x424/0x430
 [<ffffffff81c2733a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81c27438>] ? x86_64_start_kernel+0xfa/0x109
swapper: page allocation failure. order:1, mode:0x20
Pid: 0, comm: swapper Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff810522fd>] ? check_preempt_curr+0x6d/0x90
 [<ffffffff811669d2>] ? kmem_getpages+0x62/0x170
 [<ffffffff811675ea>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116703f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81167369>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811682eb>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffff81439b38>] ? sk_prot_alloc+0x48/0x1c0
 [<ffffffff8143ac12>] ? sk_clone+0x22/0x2e0
 [<ffffffff81489ae6>] ? inet_csk_clone+0x16/0xd0
 [<ffffffff814a29f3>] ? tcp_create_openreq_child+0x23/0x450
 [<ffffffff814a01ed>] ? tcp_v4_syn_recv_sock+0x4d/0x310
 [<ffffffff814a2796>] ? tcp_check_req+0x226/0x460
 [<ffffffff8149fc8b>] ? tcp_v4_do_rcv+0x35b/0x430
 [<ffffffff814a5d1f>] ? __udp_queue_rcv_skb+0x4f/0x170
 [<ffffffff814a149e>] ? tcp_v4_rcv+0x4fe/0x8d0
 [<ffffffff8147f28d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8147f518>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8147e9dd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff8147ef65>] ? ip_rcv+0x275/0x350
 [<ffffffff8144819b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff8144a578>] ? netif_receive_skb+0x58/0x60
 [<ffffffff8144a680>] ? napi_skb_finish+0x50/0x70
 [<ffffffff8144cc29>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa01d07cb>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
 [<ffffffffa01d33a0>] ? e1000_clean_rx_irq+0x380/0x580 [e1000e]
 [<ffffffffa01daffd>] ? e1000e_poll+0x8d/0x380 [e1000e]
 [<ffffffff81052201>] ? resched_task+0x11/0x80
 [<ffffffff8106317e>] ? try_to_wake_up+0x24e/0x3e0
 [<ffffffff8144cd43>] ? net_rx_action+0x103/0x2f0
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810e1640>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff81516c95>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI>  [<ffffffff812d37ae>] ? intel_idle+0xde/0x170
 [<ffffffff812d3791>] ? intel_idle+0xc1/0x170
 [<ffffffff8109dd1d>] ? sched_clock_cpu+0xcd/0x110
 [<ffffffff81414ef7>] ? cpuidle_idle_call+0xa7/0x140
 [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
 [<ffffffff814f2f2a>] ? rest_init+0x7a/0x80
 [<ffffffff81c27f7b>] ? start_kernel+0x424/0x430
 [<ffffffff81c2733a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81c27438>] ? x86_64_start_kernel+0xfa/0x109
swapper: page allocation failure. order:1, mode:0x20
Pid: 0, comm: swapper Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff811669d2>] ? kmem_getpages+0x62/0x170
 [<ffffffff811675ea>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116703f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81167369>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811682eb>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffff81439b38>] ? sk_prot_alloc+0x48/0x1c0
 [<ffffffff8143ac12>] ? sk_clone+0x22/0x2e0
 [<ffffffff81489ae6>] ? inet_csk_clone+0x16/0xd0
 [<ffffffff814a29f3>] ? tcp_create_openreq_child+0x23/0x450
 [<ffffffff814a01ed>] ? tcp_v4_syn_recv_sock+0x4d/0x310
 [<ffffffff814a2796>] ? tcp_check_req+0x226/0x460
 [<ffffffff814981b6>] ? tcp_rcv_state_process+0x126/0xa10
 [<ffffffff8149fc8b>] ? tcp_v4_do_rcv+0x35b/0x430
 [<ffffffff810522fd>] ? check_preempt_curr+0x6d/0x90
 [<ffffffff814a149e>] ? tcp_v4_rcv+0x4fe/0x8d0
 [<ffffffff81063322>] ? default_wake_function+0x12/0x20
 [<ffffffff8147f28d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8147f518>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8147e9dd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff8147ef65>] ? ip_rcv+0x275/0x350
 [<ffffffff8144819b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff8144a578>] ? netif_receive_skb+0x58/0x60
 [<ffffffff8144a680>] ? napi_skb_finish+0x50/0x70
 [<ffffffff8144cc29>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa01d07cb>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
 [<ffffffffa01d33a0>] ? e1000_clean_rx_irq+0x380/0x580 [e1000e]
 [<ffffffffa01daffd>] ? e1000e_poll+0x8d/0x380 [e1000e]
 [<ffffffff8105ad3e>] ? scheduler_tick+0x11e/0x260
 [<ffffffff8144cd43>] ? net_rx_action+0x103/0x2f0
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810e1640>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff81516c95>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI>  [<ffffffff812d37ae>] ? intel_idle+0xde/0x170
 [<ffffffff812d3791>] ? intel_idle+0xc1/0x170
 [<ffffffff8109dd1d>] ? sched_clock_cpu+0xcd/0x110
 [<ffffffff81414ef7>] ? cpuidle_idle_call+0xa7/0x140
 [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
 [<ffffffff814f2f2a>] ? rest_init+0x7a/0x80
 [<ffffffff81c27f7b>] ? start_kernel+0x424/0x430
 [<ffffffff81c2733a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81c27438>] ? x86_64_start_kernel+0xfa/0x109
swapper: page allocation failure. order:1, mode:0x20
Pid: 0, comm: swapper Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff811669d2>] ? kmem_getpages+0x62/0x170
 [<ffffffff811675ea>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116703f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81167369>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811682eb>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffff81439b38>] ? sk_prot_alloc+0x48/0x1c0
 [<ffffffff8143ac12>] ? sk_clone+0x22/0x2e0
 [<ffffffff81489ae6>] ? inet_csk_clone+0x16/0xd0
 [<ffffffff814a29f3>] ? tcp_create_openreq_child+0x23/0x450
 [<ffffffff814a01ed>] ? tcp_v4_syn_recv_sock+0x4d/0x310
 [<ffffffff814a2796>] ? tcp_check_req+0x226/0x460
 [<ffffffff8149fc8b>] ? tcp_v4_do_rcv+0x35b/0x430
 [<ffffffff810522fd>] ? check_preempt_curr+0x6d/0x90
 [<ffffffff814a149e>] ? tcp_v4_rcv+0x4fe/0x8d0
 [<ffffffff81063322>] ? default_wake_function+0x12/0x20
 [<ffffffff8147f28d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8147f518>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8147e9dd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff8147ef65>] ? ip_rcv+0x275/0x350
 [<ffffffff8144819b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff8144a578>] ? netif_receive_skb+0x58/0x60
 [<ffffffff8144a680>] ? napi_skb_finish+0x50/0x70
 [<ffffffff8144cc29>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa01d07cb>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
 [<ffffffffa01d33a0>] ? e1000_clean_rx_irq+0x380/0x580 [e1000e]
 [<ffffffffa01daffd>] ? e1000e_poll+0x8d/0x380 [e1000e]
 [<ffffffff8105ad3e>] ? scheduler_tick+0x11e/0x260
 [<ffffffff8144cd43>] ? net_rx_action+0x103/0x2f0
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810e1640>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff81516c95>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI>  [<ffffffff812d37ae>] ? intel_idle+0xde/0x170
 [<ffffffff812d3791>] ? intel_idle+0xc1/0x170
 [<ffffffff8109dd1d>] ? sched_clock_cpu+0xcd/0x110
 [<ffffffff81414ef7>] ? cpuidle_idle_call+0xa7/0x140
 [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
 [<ffffffff814f2f2a>] ? rest_init+0x7a/0x80
 [<ffffffff81c27f7b>] ? start_kernel+0x424/0x430
 [<ffffffff81c2733a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81c27438>] ? x86_64_start_kernel+0xfa/0x109
md: data-check of RAID array md1
md: minimum _guaranteed_  speed: 200000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 75000000 KB/sec) for data-check.
md: using 128k window, over a total of 976759936k.
md: data-check of RAID array md0
md: minimum _guaranteed_  speed: 200000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 75000000 KB/sec) for data-check.
md: using 128k window, over a total of 5860270080k.
md: md0: data-check done.
md: md1: data-check done.
ata5: exception Emask 0x10 SAct 0x0 SErr 0x4050002 action 0xe frozen
ata5: irq_stat 0x00400040, connection status changed
ata5: SError: { RecovComm PHYRdyChg CommWake DevExch }
ata5: hard resetting link
ata5: link is slow to respond, please be patient (ready=0)
ata5: COMRESET failed (errno=-16)
ata5: hard resetting link
ata5: link is slow to respond, please be patient (ready=0)
ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata5.00: ATA-7: Hitachi HDS721010KLA330, GKAOAB0A, max UDMA/133
ata5.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
ata5.00: configured for UDMA/133
ata5: EH complete
scsi 9:0:0:0: Direct-Access     ATA      Hitachi HDS72101 GKAO PQ: 0 ANSI: 5
sd 9:0:0:0: Attached scsi generic sg14 type 0
sd 9:0:0:0: [sdo] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
sd 9:0:0:0: [sdo] Write Protect is off
sd 9:0:0:0: [sdo] Mode Sense: 00 3a 00 00
sd 9:0:0:0: [sdo] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sdo: sdo1
sd 9:0:0:0: [sdo] Attached SCSI disk
kjournald starting.  Commit interval 5 seconds
EXT3-fs (sdo1): warning: checktime reached, running e2fsck is recommended
EXT3-fs (sdo1): using internal journal
EXT3-fs (sdo1): mounted filesystem with ordered data mode
rsync: page allocation failure. order:1, mode:0x20
Pid: 19498, comm: rsync Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff811669d2>] ? kmem_getpages+0x62/0x170
 [<ffffffff811675ea>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116703f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81167369>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811682eb>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffff81439b38>] ? sk_prot_alloc+0x48/0x1c0
 [<ffffffff8143ac12>] ? sk_clone+0x22/0x2e0
 [<ffffffff81489ae6>] ? inet_csk_clone+0x16/0xd0
 [<ffffffff814a29f3>] ? tcp_create_openreq_child+0x23/0x450
 [<ffffffff814a01ed>] ? tcp_v4_syn_recv_sock+0x4d/0x310
 [<ffffffff814a2796>] ? tcp_check_req+0x226/0x460
 [<ffffffff814981b6>] ? tcp_rcv_state_process+0x126/0xa10
 [<ffffffff8149fc8b>] ? tcp_v4_do_rcv+0x35b/0x430
 [<ffffffff8105230c>] ? check_preempt_curr+0x7c/0x90
 [<ffffffff814a149e>] ? tcp_v4_rcv+0x4fe/0x8d0
 [<ffffffff81063322>] ? default_wake_function+0x12/0x20
 [<ffffffff8147f28d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8147f518>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8147e9dd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff8147ef65>] ? ip_rcv+0x275/0x350
 [<ffffffff8144819b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff8144a578>] ? netif_receive_skb+0x58/0x60
 [<ffffffff8144a680>] ? napi_skb_finish+0x50/0x70
 [<ffffffff8144cc29>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa01d07cb>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
 [<ffffffffa01d33a0>] ? e1000_clean_rx_irq+0x380/0x580 [e1000e]
 [<ffffffffa01daffd>] ? e1000e_poll+0x8d/0x380 [e1000e]
 [<ffffffff8144cd43>] ? net_rx_action+0x103/0x2f0
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810e1640>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff81516c95>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI> 
rsync: page allocation failure. order:1, mode:0x20
Pid: 19498, comm: rsync Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffff811669d2>] ? kmem_getpages+0x62/0x170
 [<ffffffff811675ea>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116703f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81167369>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811682eb>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffff81439b38>] ? sk_prot_alloc+0x48/0x1c0
 [<ffffffff8143ac12>] ? sk_clone+0x22/0x2e0
 [<ffffffff81489ae6>] ? inet_csk_clone+0x16/0xd0
 [<ffffffff814a29f3>] ? tcp_create_openreq_child+0x23/0x450
 [<ffffffff814a01ed>] ? tcp_v4_syn_recv_sock+0x4d/0x310
 [<ffffffff814a2796>] ? tcp_check_req+0x226/0x460
 [<ffffffff8149fc8b>] ? tcp_v4_do_rcv+0x35b/0x430
 [<ffffffff8105230c>] ? check_preempt_curr+0x7c/0x90
 [<ffffffff814a149e>] ? tcp_v4_rcv+0x4fe/0x8d0
 [<ffffffff81063322>] ? default_wake_function+0x12/0x20
 [<ffffffff8147f28d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8147f518>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8147e9dd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff8147ef65>] ? ip_rcv+0x275/0x350
 [<ffffffff8144819b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff8144a578>] ? netif_receive_skb+0x58/0x60
 [<ffffffff8144a680>] ? napi_skb_finish+0x50/0x70
 [<ffffffff8144cc29>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa01d07cb>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
 [<ffffffffa01d33a0>] ? e1000_clean_rx_irq+0x380/0x580 [e1000e]
 [<ffffffffa01daffd>] ? e1000e_poll+0x8d/0x380 [e1000e]
 [<ffffffff8144cd43>] ? net_rx_action+0x103/0x2f0
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810e1640>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff81516c95>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI> 
ata5: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
ata5: irq_stat 0x00400040, connection status changed
ata5: SError: { HostInt PHYRdyChg 10B8B DevExch }
ata5: hard resetting link
ata5: SATA link down (SStatus 0 SControl 300)
ata5: hard resetting link
ata5: SATA link down (SStatus 0 SControl 300)
ata5: limiting SATA link speed to 1.5 Gbps
ata5: hard resetting link
ata5: SATA link down (SStatus 0 SControl 310)
ata5.00: disabled
ata5: EH complete
ata5.00: detaching (SCSI 9:0:0:0)
sd 9:0:0:0: [sdo] Synchronizing SCSI cache
sd 9:0:0:0: [sdo] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 9:0:0:0: [sdo] Stopping disk
sd 9:0:0:0: [sdo] START_STOP FAILED
sd 9:0:0:0: [sdo] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
ata5: exception Emask 0x10 SAct 0x0 SErr 0x4050002 action 0xe frozen
ata5: irq_stat 0x00400040, connection status changed
ata5: SError: { RecovComm PHYRdyChg CommWake DevExch }
ata5: hard resetting link
ata5: SATA link down (SStatus 0 SControl 300)
ata5.00: link offline, clearing class 1 to NONE
ata5: EH complete
 

theevilsharpie

Platinum Member
Nov 2, 2009
2,322
14
81
Code:
ata5.00: ATA-7: Hitachi HDS721010KLA330, GKAOAB0A, max UDMA/133

This drive is failing. Remove it from the bus or evict it from your raid set.
 

Red Squirrel

No Lifer
May 24, 2003
70,180
13,576
126
www.anyf.ca
Wow is that really all there is to it? Been chasing my tail around forever on this issue, I had given up hope thinking a new file server would fix things. Now to find out, which drive that is. HDS721010KLA330 is the serial number I take it?

Edit: Actually I think that may be one of my backup drives... it would have been having data written to it at the time, but is otherwise not all that critical, I'll just have to try finding it physically so I can remove it from the backup pool and destroy it.
 
Last edited:

Red Squirrel

No Lifer
May 24, 2003
70,180
13,576
126
www.anyf.ca
Well this seems to have started again. I caught it, I got a bunch of alerts that some failed services restored (never got the alerts that they died, systems were probably frozen trying to send the alert) so I checked my file server and sure enough some crash logs:

Code:
rsync: page allocation failure. order:1, mode:0x20
Pid: 18527, comm: rsync Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
 [<ffffffffa01d4069>] ? e1000_xmit_frame+0x769/0xdc0 [e1000e]
 [<ffffffff811669d2>] ? kmem_getpages+0x62/0x170
 [<ffffffff811675ea>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8116703f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81167369>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff811682eb>] ? kmem_cache_alloc+0x11b/0x190
 [<ffffffff81439b38>] ? sk_prot_alloc+0x48/0x1c0
 [<ffffffff8143ac12>] ? sk_clone+0x22/0x2e0
 [<ffffffff81489ae6>] ? inet_csk_clone+0x16/0xd0
 [<ffffffff814a29f3>] ? tcp_create_openreq_child+0x23/0x450
 [<ffffffff814a01ed>] ? tcp_v4_syn_recv_sock+0x4d/0x310
 [<ffffffff814a2796>] ? tcp_check_req+0x226/0x460
 [<ffffffff814981b6>] ? tcp_rcv_state_process+0x126/0xa10
 [<ffffffff8149fc8b>] ? tcp_v4_do_rcv+0x35b/0x430
 [<ffffffff8105230c>] ? check_preempt_curr+0x7c/0x90
 [<ffffffff814a149e>] ? tcp_v4_rcv+0x4fe/0x8d0
 [<ffffffff8147f28d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff8147f518>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff8147e9dd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff8147ef65>] ? ip_rcv+0x275/0x350
 [<ffffffff8144819b>] ? __netif_receive_skb+0x4ab/0x750
 [<ffffffff8144a578>] ? netif_receive_skb+0x58/0x60
 [<ffffffff8144a680>] ? napi_skb_finish+0x50/0x70
 [<ffffffff8144cc29>] ? napi_gro_receive+0x39/0x50
 [<ffffffffa01d07cb>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
 [<ffffffffa01d33a0>] ? e1000_clean_rx_irq+0x380/0x580 [e1000e]
 [<ffffffffa01daffd>] ? e1000e_poll+0x8d/0x380 [e1000e]
 [<ffffffff8144cd43>] ? net_rx_action+0x103/0x2f0
 [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
 [<ffffffff810e1640>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8107700f>] ? __do_softirq+0x11f/0x1e0
 [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
 [<ffffffff81076d95>] ? irq_exit+0x85/0x90
 [<ffffffff81516c95>] ? do_IRQ+0x75/0xf0
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
 <EOI>

This about 3x. No other errors such as anything pointing to a disk. Even then, a disk failure should not be crashing the system, that's the whole point of raid.


This seems to be triggered by any kind of high IO such as rsync backups or torrents.
 

Red Squirrel

No Lifer
May 24, 2003
70,180
13,576
126
www.anyf.ca
Whatever is the latest in CentOS, it's all different systems so can't specify something exact as they may have different versions, but the particular server that had the error last has the following:

Kernel: Linux isengard.loc 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Rsync: version 3.0.6 protocol version 30
 

Essence_of_War

Platinum Member
Feb 21, 2013
2,650
4
81
Thanks for clarifying that mv2devnull, this kernel isn't quite as old as I thought, but if I'm reading Red Hat/CentOS correctly, it stopped getting updates in March 2015.

Also, rsync is presently on version 3.1.1 (which is up-to-date in Debian-based distros). 3.0.6 is over 6 years old.

I know it's the obvious and not insightful answer, but I would try to get your kernel and rsync up-to-date before I spent too much more time troubleshooting.
 

mv2devnull

Golden Member
Apr 13, 2010
1,526
160
106
The RH version numbers are not directly comparable to upstream because RH backports fixes to their packages. That way a major release of RHEL (and thus CentOS too) has stable feature-set its entire lifetime (several years). (Bit similar to LTS releases in Ubuntu.)

Nevertheless, 2.6.32-573.* is the kernel of CentOS 6 now.
 

Red Squirrel

No Lifer
May 24, 2003
70,180
13,576
126
www.anyf.ca
Hmm is "yum update" not sufficient to update properly? What other command should I try? I do that rather regularly on all my systems. This issue has been going on for YEARS though, so unless it's something that got fixed recently I'm suspecting it's something I need to configure differently somewhere, like maybe some kind of time out setting of sorts.
 

mv2devnull

Golden Member
Apr 13, 2010
1,526
160
106
yum update should be sufficient. I guess that there is a trivial explanation.
What do these show?
Code:
cat /etc/{centos,redhat}-release
uptime
uname -rv
rpm -q kernel
rpm -q rsync
df -h /boot
 

Red Squirrel

No Lifer
May 24, 2003
70,180
13,576
126
www.anyf.ca
Here's the output:


Code:
[root@isengard ~]# cat /etc/{centos,redhat}-release
CentOS release 6.7 (Final)
CentOS release 6.7 (Final)
[root@isengard ~]# 
[root@isengard ~]# 
[root@isengard ~]# uptime
 18:47:25 up 449 days, 18:42,  1 user,  load average: 1.82, 2.68, 2.45
[root@isengard ~]# 
[root@isengard ~]# 
[root@isengard ~]# uname -rv
2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013
[root@isengard ~]# 
[root@isengard ~]# 
[root@isengard ~]# rpm -q kernel
kernel-2.6.32-358.el6.x86_64
kernel-2.6.32-504.3.3.el6.x86_64
kernel-2.6.32-504.23.4.el6.x86_64
kernel-2.6.32-504.30.3.el6.x86_64
kernel-2.6.32-573.3.1.el6.x86_64
[root@isengard ~]# 
[root@isengard ~]# 
[root@isengard ~]# 
[root@isengard ~]# rpm -q rsync
rsync-3.0.6-12.el6.x86_64
[root@isengard ~]# 
[root@isengard ~]# 
[root@isengard ~]# 
[root@isengard ~]# df -h /boot
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdn1       485M  157M  303M  35% /boot
[root@isengard ~]#


I see that kernel 573 is there, but I guess it's not installed?
 

Red Squirrel

No Lifer
May 24, 2003
70,180
13,576
126
www.anyf.ca
I can't reboot that machine, it's the file server which also acts as storage for VMs, is there another way to load the kernel so it's seamless? How would an enterprise situation do it? I guess they'd have multiple servers that act in a shared storage/HA config? Something I do want to look into at some point but no money now.
 

mv2devnull

Golden Member
Apr 13, 2010
1,526
160
106
HA is nice, but even the big HPC clusters do have scheduled maintenance breaks occasionally.

It is that simple; you tell the users that there will be a break and then you have the break.

If your service is so critical that no break is possible, then it should already have a proper HA.


PS. I would do
Code:
yum remove kernel-2.6.32-573.3.1.el6
yum update
before the reboot. Why? The latest kernel is from 6.7 point update and the same yum update did pull many other packages, some of which kernel installation does use. This extra step ensures that the install uses latest, already installed packages.
 

Essence_of_War

Platinum Member
Feb 21, 2013
2,650
4
81
I can't reboot that machine, it's the file server which also acts as storage for VMs, is there another way to load the kernel so it's seamless? How would an enterprise situation do it? I guess they'd have multiple servers that act in a shared storage/HA config? Something I do want to look into at some point but no money now.

The only thing linux can't do live right now is upgrade the kernel. That actually may change in the not too distant future, it's a feature that lots of people are interested in, but for now it's reboot or else.

To do kernel upgrades, even an enterprise user would have to reboot, they would usually have some sort of fail-over or scheduled maintenance to bridge them.

You really can't reboot this machine? What is your plan if you have a power outage that outlasts your UPS? :eek:
 

Red Squirrel

No Lifer
May 24, 2003
70,180
13,576
126
www.anyf.ca
You really can't reboot this machine? What is your plan if you have a power outage that outlasts your UPS? :eek:

My UPS is good for about 4 hours. I need to add more batteries though as we've had a few outages that pushed dangerously close to that and I've added more equipment since the last test run.

The downtime is not so much a problem, it's the risk that something does not come back up properly. But guess I'll have to bite the bullet and find a good time to do it. I have a game server that runs off this environment as well and it's already been through several maintenance windows so I hate to do that again.

I also did a boo boo on this server one time, I was doing a mv command to move a bunch of files into the same folder, basically "mv a/* ." and repeat for b c d etc I was typing fast and accidentally did a typo and did "mv / .". It started moving system folders like /usr and /etc. I did move them back, but who knows if attributes and stuff are right. I did my best to compare to another server but there may be special attributes I don't see. I'm scared the system might not boot back up. Lot of this system stuff needs to be in a specific location on the platters as well. I think /boot is probably the most important.