• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

extremly high server load, nothing in top

Red Squirrel

No Lifer
My server load is very high but yet I can't figure out what is even hogging the cpu that bad since nothing is showing anything in top, anywhere else I should look?

I started a VM in vmware server and stopped it, this is when I noticed this load. I stopped the vmware service and its still very high.

Here is the output of top:


top - 19:27:03 up 21:58, 2 users, load average: 1.00, 1.03, 1.79
Tasks: 174 total, 1 running, 173 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 75.3%id, 24.7%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 7927988k total, 7677104k used, 250884k free, 17152k buffers
Swap: 15358132k total, 76k used, 15358056k free, 7199404k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9753 root 20 0 14840 1188 852 R 0.3 0.0 0:00.68 top
1 root 20 0 4052 904 648 S 0.0 0.0 0:00.56 init
2 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root RT -5 0 0 0 S 0.0 0.0 0:00.00 migration/0
4 root 15 -5 0 0 0 S 0.0 0.0 0:01.38 ksoftirqd/0
5 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
6 root RT -5 0 0 0 S 0.0 0.0 0:00.00 migration/1
7 root 15 -5 0 0 0 S 0.0 0.0 0:02.33 ksoftirqd/1
8 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/1
9 root RT -5 0 0 0 S 0.0 0.0 0:00.00 migration/2
10 root 15 -5 0 0 0 S 0.0 0.0 1:55.90 ksoftirqd/2
11 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/2
12 root RT -5 0 0 0 S 0.0 0.0 0:00.00 migration/3
13 root 15 -5 0 0 0 S 0.0 0.0 0:10.44 ksoftirqd/3
14 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/3
15 root 15 -5 0 0 0 S 0.0 0.0 0:00.68 events/0
16 root 15 -5 0 0 0 S 0.0 0.0 0:00.40 events/1
17 root 15 -5 0 0 0 S 0.0 0.0 0:00.59 events/2
18 root 15 -5 0 0 0 S 0.0 0.0 0:01.51 events/3
19 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 khelper
76 root 15 -5 0 0 0 S 0.0 0.0 0:05.72 kblockd/0
77 root 15 -5 0 0 0 S 0.0 0.0 0:10.14 kblockd/1
78 root 15 -5 0 0 0 S 0.0 0.0 0:01.34 kblockd/2
79 root 15 -5 0 0 0 S 0.0 0.0 0:05.12 kblockd/3
81 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kacpid
82 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kacpi_notify
158 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 cqueue
160 root 15 -5 0 0 0 S 0.0 0.0 0:00.10 ksuspend_usbd
165 root 15 -5 0 0 0 S 0.0 0.0 0:00.12 khubd
168 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kseriod
238 root 15 -5 0 0 0 S 0.0 0.0 9:42.21 kswapd0
284 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 aio/0
285 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 aio/1
286 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 aio/2
287 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 aio/3
465 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kpsmoused
521 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 ata/0
522 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 ata/1
523 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 ata/2
524 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 ata/3
525 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 ata_aux
531 root 15 -5 0 0 0 S 0.0 0.0 0:00.88 scsi_eh_0
532 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_1
533 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_2
534 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_3
535 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_4
536 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_5
551 root 15 -5 0 0 0 S 0.0 0.0 0:10.51 kjournald
601 root 16 -4 12880 972 400 S 0.0 0.0 0:00.56 udevd
1254 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_7
1256 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_8
1405 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kauditd
1496 root 15 -5 0 0 0 S 0.0 0.0 15:54.74 md0_raid5
1549 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kstriped
1563 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kmpathd/0

 
What would cause it to be that high though? Is there a way I can find which program is doing this? It's not normal for load to be this high when system is not really doing anything. What's odd is normally it would show up in top which program is using most resources.
 
Originally posted by: RedSquirrel
What would cause it to be that high though? Is there a way I can find which program is doing this? It's not normal for load to be this high when system is not really doing anything. What's odd is normally it would show up in top which program is using most resources.

Did you even read the link I posted? I explains what Wait is, and why you won't see the process in top.
 
I know what it is and did read it, but I'm asking how I can find out WHAT program is eating up resources. There's got to be something making it that high. When any other server is idle the wait is 0 or near 0. ex:

top - 20:21:20 up 1 day, 2:05, 1 user, load average: 0.00, 0.00, 0.00
Tasks: 111 total, 2 running, 109 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1% us, 0.1% sy, 0.0% ni, 99.5% id, 0.3% wa, 0.0% hi, 0.0% si
Mem: 3636552k total, 3529048k used, 107504k free, 389624k buffers
Swap: 2096472k total, 0k used, 2096472k free, 2879868k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 16 0 1988 668 572 S 0.0 0.0 0:00.56 init
2 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
4 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 events/0
5 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 khelper
6 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kthread
8 root 10 -5 0 0 0 S 0.0 0.0 0:00.03 kblockd/0
9 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 kacpid
104 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 khubd
159 root 20 0 0 0 0 S 0.0 0.0 0:00.00 pdflush
160 root 15 0 0 0 0 S 0.0 0.0 0:00.06 pdflush
162 root 18 -5 0 0 0 S 0.0 0.0 0:00.00 aio/0
161 root 15 0 0 0 0 S 0.0 0.0 0:00.23 kswapd0
249 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kseriod
331 root 11 -5 0 0 0 S 0.0 0.0 0:00.00 kpsmoused
342 root 15 0 0 0 0 S 0.0 0.0 0:00.06 kjournald
382 root 11 -5 0 0 0 S 0.0 0.0 0:00.00 kauditd
[root@oldborg ~]#
 
'ps ax' and look for the processes in the 'D', uninterruptible sleep, state. See the ps man page about Process State Codes. Whichever processes they are, they're not making the 'top' list. Though I think you can sort top by status. See '?' for help.
 
I wouldn't call it extremely high load judging from that snapshot, but what does iostat -x 5 show when your load and %wa is high?

how many vm's are you running on this machine. what type of machine is running vmware server.
 
iostat does not work, invalid command. Is there a package I can download to get that command? Using fedora core 9. The load seems to have stablelized now even with 4 vms so thats good.
 
iostat is part of sysstat , yum install sysstat

4 vm's running off how many spindles? i assume one and startup of all four vm's at a time will cause queue-ing.
 
How does this look?


Linux 2.6.26.3-29.fc9.x86_64 (borg.loc) 09/26/2008

avg-cpu: %user %nice %system %iowait %steal %idle
0.50 0.00 1.73 2.20 0.00 95.57

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 2.09 7.77 5.05 23.37 843.23 249.88 38.46 0.13 4.71 0.61 1.74
sda1 0.08 0.00 0.01 0.00 0.20 0.00 31.00 0.00 7.58 6.64 0.00
sda2 0.02 0.00 0.00 0.00 0.16 0.01 52.44 0.00 14.42 13.64 0.00
sda3 1.99 7.77 5.04 23.37 842.81 249.86 38.46 0.13 4.71 0.61 1.73
sdb 4.89 189.44 7.37 13.12 264.95 1624.64 92.22 0.19 9.39 2.80 5.73
sdb1 4.89 189.44 7.37 13.12 264.90 1624.64 92.23 0.19 9.39 2.80 5.73
sdc 4.97 191.75 7.66 10.13 268.08 1619.25 106.12 0.50 28.29 4.20 7.47
sdc1 4.96 191.75 7.66 10.13 268.03 1619.25 106.13 0.50 28.29 4.20 7.47
sdd 4.99 190.15 7.65 11.60 268.16 1618.15 97.99 0.24 12.31 2.56 4.93
md0 0.00 0.00 19.57 396.81 657.42 3174.51 9.20 0.00 0.00 0.00 0.00


 
Is this the first summary iteration or a snapshot of one of the intervals.

You said this raid 5, 2+1, are you doing this with software raid or on a controller. Is md0 the volume of sdb/c/d?

sdc has a relatively high access time compared to the other disks. There's not much you can do, but it would be better to run it for some time when the load is high. you'll see which disk has a high await, which in turn means a long iowait.

 
Yeah b c d is raid 5 md raid while a is single stand alone for OS. Odd one disk would have higher load, though it may be the disk itself thats slower or something. I have 2 seagates and one hitachi in there. Can't recall which is which though, think d is one of the seagates if I recall.


I'll have to recheck if I catch the load skyrocket again. But so far it has not done it. Only time it did was when I powered on the vista VM, but I pretty much expected that.
 
Ok caught it on high load again. The load totally skyrockets when I run a custom app. (server application that listens for data - no connections at that point). This app did not even touch the load on my old server so this is not right at all.


-bash-3.2$ iostat -x
Linux 2.6.26.3-29.fc9.x86_64 (borg.loc) 09/27/2008

avg-cpu: %user %nice %system %iowait %steal %idle
0.94 0.00 3.09 1.22 0.00 94.74

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.39 7.49 1.15 8.84 156.22 131.58 28.79 0.04 4.26 0.51 0.51
sda1 0.01 0.00 0.00 0.00 0.03 0.00 28.43 0.00 7.59 6.73 0.00
sda2 0.00 0.00 0.00 0.00 0.03 0.00 52.44 0.00 14.42 13.64 0.00
sda3 0.37 7.49 1.15 8.84 156.15 131.57 28.79 0.04 4.25 0.51 0.51
sdb 3.90 66.47 6.46 9.23 236.02 610.46 53.94 0.16 10.40 2.88 4.52
sdb1 3.90 66.47 6.46 9.23 236.01 610.46 53.94 0.16 10.40 2.88 4.52
sdc 3.79 67.41 6.69 7.77 237.11 606.34 58.32 0.51 35.52 3.72 5.38
sdc1 3.79 67.41 6.69 7.77 237.10 606.34 58.32 0.51 35.53 3.72 5.38
sdd 3.96 66.58 6.57 8.19 237.31 602.96 56.94 0.24 16.28 2.51 3.71
md0 0.00 0.00 15.91 143.04 586.69 1144.34 10.89 0.00 0.00 0.00 0.00

-bash-3.2$ uptime
12:10:30 up 17:57, 4 users, load average: 6.87, 3.47, 2.21
-bash-3.2$



(can't wait for vB, we'll actually get code tags and stuff)


My old server was a single core AMD 64. I ran all the VMs I'm running now and did pretty much the same thing. This is a core 2 quad. It seems just about anything is making the load super high. I figured Intel was ahead of AMD hence why I went that way, but so far my AMD64 is outperforming this server.
 
Wow this is really not good... a backup job kicked in. server is not even usable right now. Load is 8 but everything froze now so its probably higher. Would I be better off with raid 10? Only issue is the OS drive is non raid and is taking up a slot so I can't put 4 drives unless I sacrifice the other 4 slots which I want to use for a future raid 10. But if software raid bogs down that much I wont really want another raid anyway... I had raid 1 software on my old server and it did not bog down at all.
 
Reboot wins again! Seems ok after a reboot. Thought that was a windows thing, so never even occured to me to try it with Linux.

Though I did not try it during that backup job yet.... so it may still not be a solved problem. Time will tell I guess. I rescheduled that backup for nighttime, not sure why I had it during the day.
 
your raid 5 array is not performing well. sdc is performing slower than the others.

sdb 3.90 66.47 6.46 9.23 236.02 610.46 53.94 0.16 10.40 2.88 4.52
sdb1 3.90 66.47 6.46 9.23 236.01 610.46 53.94 0.16 10.40 2.88 4.52
sdc 3.79 67.41 6.69 7.77 237.11 606.34 58.32 0.51 35.52 3.72 5.38 ***
sdc1 3.79 67.41 6.69 7.77 237.10 606.34 58.32 0.51 35.53 3.72 5.38 ***
sdd 3.96 66.58 6.57 8.19 237.31 602.96 56.94 0.24 16.28 2.51 3.71

i would create a mirror out of sdb and sdd
 
you keep posting new items. you never explain the situation in its entirety.

high load. then raid 5. then software raid. then using different types of disks. then a backup running during the day. all of these extra issues contribute to load. do you know what is going on with the computer at the time or are you just looking at the load numbers?
 
was just posting stuff as I noticed. But none of those things should be causing the load to be this high. At least, they did not on the other server. Just wondering what do you see in those readings that make it say sdc is doing bad? I just want to know more how to interpret those.

I'll look at replacing sdc with another drive if that's the case. That's one of my newer drives too... Anything I should look for in smartctl that would indicate it may be failing? If I can warranty it that would be best bet, otherwise I'll have to order another 1TB drive and just use this particular drive for backups or what not.

Guess I should really be using drives from same manufacturer and same batch in a raid for best performance. Think I'll eventually buy new drives and just cycle these ones as backup drives. I got myself one of those drive docks so I can slowly phase out my backup server and just use the drive dock.


edit: the problem *seems* gone though.

I got 200 local socket connections to my test app and its writing about 20MB per sec - 1 file per connection. Load is at 4.59 so still rather higher then other server but at least I'm not getting booted off SSH or having DNS time outs like before my reboot.
 
All of those things can contribute to load with a misbehaving raid set. Your old system is a mirror, this is a raid 5, completely different system. All the issues you're experiencing is related to I/O. The I/O is different between systems.

The await column is 35ms, which is 2x sdd and 3x slower thatn sdb, while amount of data and writes are the same. The man page for iostat explains it all.

i don't feel that sdc is going bad, it just doesn't play nice with the raid group. software raid 5 shouldn't suffer like thisin a c2d system, but this clearly isn't the case.

i would just make a mirror from sdb and sdd. i assume these are the two seagates.
 
I don't want a mirror though, then I'm stuck with a non raid drive. The whole goal behind this upgrade was to have redundancy. The only non raid drive is the OS drive as it would be hard to make software raid with the OS drive so I did not want to even go there. It's a smaller drive.

That backup kicked in last night, and it is STILL going, I had to stop it since its not even moving, I'd be surprised if 1 file even got copied, with a load of 6 it pretty much means nothing is actually getting done as its so back logged.

If I buy a new drive would it fix this issue? Since if yes I'll order one now.


This is my current config:


/dev/sdb: SAMSUNG HD103UJ
/dev/sdc: Seagate ST31000340AS
/dev/sdd: Hitachi HDS721010KLA330


If I buy another should I avoid seagate or is this drive just a bad one? The samsung seems to be the one performing the best, and it's also 100 bucks more then the seagate. The only difference I see is the samsung has a seek time of 8.9 while the seagate has 9.3 but that's not enough of a difference to cause issues is it?

Got the info from here:
http://www.tigerdirect.ca/appl...&body=MAIN#detailspecs
http://www.tigerdirect.ca/appl...&body=MAIN#detailspecs

 
Yeah think the drive is failing... check this out:


Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.56 19.25 1.73 234.04 245.36 2027.74 9.64 0.47 2.00 0.38 9.05
sda1 0.01 0.00 0.00 0.00 0.02 0.00 27.95 0.00 7.53 6.80 0.00
sda2 0.00 0.00 0.00 0.00 0.02 0.00 51.08 0.00 16.18 15.26 0.00
sda3 0.55 19.25 1.73 234.04 245.31 2027.74 9.64 0.47 2.00 0.38 9.05
sdb 11.63 1313.87 19.30 86.49 388.59 11209.15 109.64 1.19 11.21 1.58 16.72
sdb1 11.63 1313.87 19.30 86.49 388.58 11209.15 109.64 1.19 11.21 1.58 16.72
sdd 11.93 1318.71 19.45 80.49 392.25 11200.03 115.99 1.64 16.35 1.72 17.15
md0 0.00 0.00 15.09 2772.06 540.72 22176.46 8.15 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.01 0.02 32.27 0.05 40388.90 2776.72 0.25
sde1 0.00 0.00 0.00 0.00 0.01 0.02 37.13 0.05 53046.28 3645.34 0.25


It's sde now since I pulled it out to make the raid fail then put it in another slot to rebuild the array, and yeah, don't think it's going to rebuild.


Also what does this indicate?




[root@borg sysconfig]# uptime
20:48:00 up 1 day, 3:19, 6 users, load average: 4.61, 2.05, 0.80
[root@borg sysconfig]# mdadm --misc --detail /dev/md0
/dev/md0:
Version : 00.90.03
Creation Time : Sat Sep 20 02:15:28 2008
Raid Level : raid5
Array Size : 1953519872 (1863.02 GiB 2000.40 GB)
Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
Raid Devices : 3
Total Devices : 4
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Sun Sep 28 20:51:47 2008
State : clean, degraded
Active Devices : 2
Working Devices : 2
Failed Devices : 2
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

UUID : 11f961e7:0e37ba39:2c8a1552:76dd72ee
Events : 0.70712

Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 0 0 1 removed
2 8 48 2 active sync /dev/sdd

3 8 65 - faulty spare /dev/sde1
4 8 33 - faulty spare
[root@borg sysconfig]#

 
This is more messed up by the day.

Now, its a different drive that's dragging along.


Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.26 5.14 0.58 4.10 30.73 74.69 22.53 0.01 2.47 0.62 0.29
sda1 0.04 0.00 0.00 0.00 0.11 0.00 33.58 0.00 8.64 7.32 0.00
sda2 0.01 0.00 0.00 0.00 0.08 0.00 50.82 0.00 11.06 9.09 0.00
sda3 0.20 5.14 0.58 4.10 30.51 74.69 22.51 0.01 2.46 0.62 0.29
sdb 10902.93 21.72 735.99 6.59 93164.06 230.36 125.77 1.24 1.67 0.44 32.31
sdb1 10902.93 21.72 735.99 6.59 93164.03 230.36 125.77 1.24 1.67 0.44 32.31
sdc 0.35 11106.56 3.79 537.42 33.14 93159.67 172.19 1.86 3.44 0.99 53.62
sdd 10993.58 21.53 646.08 5.90 93170.03 223.48 143.25 6.93 10.63 1.14 74.61
md0 0.00 0.00 15.40 49.72 280.35 397.79 10.41 0.00 0.00 0.00 0.00



Last night I decided to try an experiment. I pulled sdc out (the drive that was previously slow) and put it in another slot and tried to rebuild the raid. It ended up not being called sdc anymore since the old sdc was still there. Had some trouble but eventually got it to rebuild. I rebooted the server but it never caught on as the drive letters probably resetted so md did not find the drive. So I made it rebuild again, I put it back in same slot as before. At this point I was scared to loose my entire raid array so I just said screw it.

Now it's sdd that's slow. sdd is the hitachi one. So it's not the drive, and not the backplane. For some weird reason also, the raid partition would not mount so I had to set the drive as raw. Can I just do that with all drives? or should I have a raid partition setup first?

Could this slowdown be caused by drives just not syncing up properly? So would getting all same brand drives fix the issue?
 
Back
Top