The issue oddly seems to be gone, maybe it took a while to go away when I changed the setting in the SSH config to usedns=no. For some reason DNS on that particular VM is slow. nslookup takes a long time (though not AS long). SSHing to other servers from that server is bloody slow too, but SSHing to that server is now fast, it was slow before I changed that setting. Maybe I need to set that setting to off on all servers.
I went ahead and ordered another 24 port gigabit switch off ebay, managed to find one for under $100. I've always wanted to split the storage/network traffic anyway. I'll see if that helps the performance issues.
Is there any commands I can use to monitor Linux network usage? I could maybe add monitoring points for that. i may as well do await too, what is considered a bad value?
I wrote my own monitoring software as Nagios etc is too complex to setup (too many dependencies) and I wanted something that is centrally managed where all the config files are on one server and get pushed to the agents. I just have to install the agent on all the servers and then the monitoring server sends off the monitor points. Because there are no dependencies the software will run on practically any OS/setup with minimal effort. I've neglected that software for years now, I need to improve a lot of things in it. I could add graphing capabilities and that might make it easier to monitor this storage subsystem issue. Any specific alarm points I can add? Like maybe await, anything else? How do I measure network throughput? I can't seem to find any answer on google on how to do that, without some crazy hacks. There's got to be some kind of value in /proc or something I can just grab.
The disk IO issue has been going on for as long as I can remember, it used to be worse before but every now and then it still hiccups. It seems to be really bad during backup jobs, everything just grinds to a halt.
Here's what things look like on the storage server:
Code:
iostat -xn 1 2
Linux 2.6.32-696.3.1.el6.x86_64 (isengard.loc) 02/04/18 _x86_64_ (8 CPU)
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdv 0.21 0.91 0.56 0.51 39.04 10.65 46.11 0.00 1.36 0.37 2.45 1.02 0.11
sdd 1.17 352.71 10.07 331.92 5627.24 5202.22 31.67 0.29 0.85 17.48 0.34 0.80 27.30
sdb 0.37 349.86 9.03 334.76 5342.40 5202.22 30.67 0.21 0.61 13.57 0.26 0.72 24.59
sdg 0.22 345.35 9.44 329.12 5512.65 5120.90 31.41 0.11 0.32 12.28 0.94 0.66 22.41
sdf 0.36 347.70 11.02 328.59 5618.76 5135.48 31.67 0.17 0.51 13.52 0.08 0.72 24.45
sda 0.27 348.66 9.20 331.81 5447.67 5168.88 31.13 0.13 0.37 12.86 0.03 0.67 22.80
sdc 0.55 351.36 10.64 329.10 5434.32 5168.88 31.21 0.18 0.53 13.95 0.10 0.72 24.58
sdh 1.65 347.95 9.52 326.51 5316.25 5120.90 31.06 0.21 0.64 20.12 0.07 0.73 24.63
sde 0.83 346.25 8.33 330.04 5124.57 5135.48 30.32 0.21 0.62 21.64 0.09 0.68 23.11
sdj 20.94 15.47 8.15 4.43 664.27 153.17 64.99 0.16 12.59 15.34 7.55 3.77 4.74
sdp 0.00 0.00 0.00 0.00 0.00 0.00 8.07 0.00 0.52 0.52 0.00 0.52 0.00
sdo 20.22 15.01 5.04 3.88 633.54 144.61 87.24 0.09 9.73 8.28 11.60 2.65 2.36
sdk 20.41 15.02 4.92 3.78 634.16 143.84 89.45 0.10 11.01 9.94 12.39 2.67 2.33
sdm 20.58 15.14 4.63 3.98 633.08 146.70 90.56 0.13 14.75 13.40 16.31 3.69 3.18
sdl 20.40 15.06 5.16 3.79 635.99 144.43 87.17 0.11 12.04 9.61 15.34 3.22 2.88
sdn 20.46 15.05 4.77 3.72 633.12 143.78 91.50 0.11 13.47 11.31 16.25 3.29 2.79
sdr 1.83 46.11 2.48 16.33 879.14 452.77 70.78 0.13 7.11 5.20 7.41 2.19 4.13
sdq 0.94 45.97 1.08 16.04 749.50 449.28 70.01 0.13 7.50 9.01 7.40 1.93 3.31
sds 20.51 15.42 5.49 3.34 639.63 144.41 88.82 0.16 18.39 10.85 30.80 2.61 2.30
sdt 0.52 46.25 0.95 16.19 710.44 452.77 67.84 0.13 7.85 10.35 7.71 2.28 3.92
sdu 1.33 45.84 2.49 16.17 810.38 449.28 67.50 0.14 7.60 5.36 7.94 1.96 3.66
dm-0 0.00 0.00 0.77 1.33 39.02 10.65 23.60 0.03 13.49 0.44 21.08 0.52 0.11
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 0.75 0.36 1.84 0.19 0.00
md3 0.00 0.00 64.61 2611.29 41111.85 20623.90 23.07 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 50.51 118.80 3855.28 941.05 28.33 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 11.61 117.48 3149.16 900.52 31.37 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.01 0.00 8.00 0.00 0.23 0.17 0.83 0.18 0.00
sdi 0.44 20.47 0.88 0.95 92.26 169.99 143.42 0.02 9.23 4.23 13.89 2.71 0.49
Filesystem: rBlk_nor/s wBlk_nor/s rBlk_dir/s wBlk_dir/s rBlk_svr/s wBlk_svr/s ops/s rops/s wops/s
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdv 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdj 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdp 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdk 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdm 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdl 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdn 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdr 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sds 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdt 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdu 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Filesystem: rBlk_nor/s wBlk_nor/s rBlk_dir/s wBlk_dir/s rBlk_svr/s wBlk_svr/s ops/s rops/s wops/s
Not sure why everything is 0's on the 2nd one.
Here's one while backups are happening: (just fired off a bunch of jobs manually, both local and in VMs)
Code:
iostat -xn 1 2
Linux 2.6.32-696.3.1.el6.x86_64 (isengard.loc) 02/04/18 _x86_64_ (8 CPU)
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdv 0.21 0.91 0.56 0.51 39.14 10.65 46.17 0.00 1.36 0.37 2.45 1.02 0.11
sdd 1.17 352.72 10.07 331.93 5627.08 5202.36 31.67 0.29 0.85 17.48 0.34 0.80 27.31
sdb 0.37 349.87 9.03 334.77 5342.20 5202.36 30.67 0.21 0.61 13.57 0.26 0.72 24.59
sdg 0.22 345.36 9.44 329.12 5512.44 5121.04 31.41 0.11 0.32 12.28 0.94 0.66 22.41
sdf 0.36 347.71 11.02 328.60 5618.61 5135.62 31.67 0.17 0.51 13.52 0.08 0.72 24.45
sda 0.27 348.67 9.20 331.82 5447.47 5169.02 31.13 0.13 0.37 12.86 0.03 0.67 22.80
sdc 0.55 351.37 10.64 329.11 5434.17 5169.02 31.21 0.18 0.53 13.95 0.10 0.72 24.58
sdh 1.65 347.96 9.52 326.52 5316.10 5121.04 31.06 0.21 0.64 20.12 0.07 0.73 24.63
sde 0.83 346.25 8.33 330.05 5124.38 5135.62 30.32 0.21 0.62 21.64 0.09 0.68 23.11
sdj 20.94 15.47 8.15 4.43 664.26 153.16 64.98 0.16 12.59 15.33 7.55 3.77 4.74
sdp 0.00 0.00 0.00 0.00 0.00 0.00 8.07 0.00 0.52 0.52 0.00 0.52 0.00
sdo 20.22 15.01 5.04 3.88 633.52 144.60 87.24 0.09 9.73 8.28 11.60 2.65 2.36
sdk 20.40 15.02 4.92 3.78 634.15 143.83 89.44 0.10 11.01 9.94 12.39 2.67 2.33
sdm 20.58 15.14 4.63 3.98 633.07 146.69 90.55 0.13 14.75 13.40 16.31 3.69 3.18
sdl 20.40 15.06 5.16 3.79 635.97 144.43 87.16 0.11 12.04 9.61 15.34 3.22 2.88
sdn 20.46 15.05 4.77 3.72 633.11 143.78 91.49 0.11 13.47 11.31 16.25 3.29 2.79
sdr 1.83 46.12 2.49 16.34 880.63 452.88 70.84 0.13 7.11 5.21 7.40 2.19 4.13
sdq 0.94 45.98 1.09 16.04 750.90 449.39 70.07 0.13 7.50 9.02 7.40 1.93 3.31
sds 20.51 15.42 5.49 3.34 639.62 144.41 88.82 0.16 18.39 10.85 30.80 2.61 2.30
sdt 0.52 46.26 0.96 16.20 711.82 452.88 67.90 0.13 7.85 10.36 7.71 2.28 3.92
sdu 1.33 45.85 2.50 16.17 811.79 449.39 67.56 0.14 7.60 5.36 7.94 1.96 3.66
dm-0 0.00 0.00 0.78 1.33 39.13 10.65 23.64 0.03 13.48 0.44 21.08 0.52 0.11
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 0.75 0.36 1.84 0.19 0.00
md3 0.00 0.00 64.61 2611.36 41110.56 20624.45 23.07 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 50.51 118.80 3855.22 941.01 28.33 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 11.62 117.51 3154.83 900.74 31.41 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.01 0.00 8.00 0.00 0.23 0.17 0.83 0.18 0.00
sdi 0.44 20.47 0.88 0.95 92.27 169.99 143.41 0.02 9.23 4.23 13.89 2.71 0.50
Filesystem: rBlk_nor/s wBlk_nor/s rBlk_dir/s wBlk_dir/s rBlk_svr/s wBlk_svr/s ops/s rops/s wops/s
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdv 0.00 0.00 1.00 0.00 16.00 0.00 16.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 2.00 372.00 10.00 638.00 2680.00 7608.00 15.88 1.02 1.57 14.40 1.37 0.57 37.20
sdb 0.00 383.00 3.00 627.00 1168.00 7608.00 13.93 0.86 1.36 11.67 1.31 0.52 32.70
sdg 0.00 503.00 7.00 637.00 2856.00 8648.00 17.86 0.66 1.03 9.00 0.94 0.47 30.00
sdf 0.00 728.00 28.00 647.00 5232.00 10520.00 23.34 1.52 2.35 13.86 1.85 0.77 51.80
sda 0.00 259.00 17.00 465.00 3608.00 5016.00 17.89 0.75 1.54 6.41 1.36 0.64 30.80
sdc 0.00 319.00 17.00 405.00 2168.00 5000.00 16.99 0.77 1.80 7.47 1.56 0.94 39.50
sdh 0.00 577.00 21.00 563.00 3552.00 8648.00 20.89 1.00 1.72 11.52 1.35 0.85 49.40
sde 0.00 729.00 6.00 645.00 824.00 10520.00 17.43 1.18 1.81 15.83 1.68 0.68 44.00
sdj 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdp 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdk 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdm 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdl 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdn 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdr 0.00 1010.00 42.00 166.00 25792.00 9220.00 168.33 1.33 6.29 15.33 4.01 3.02 62.80
sdq 0.00 1058.00 36.00 118.00 24080.00 9220.00 216.23 1.95 12.66 24.75 8.97 4.31 66.40
sds 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdt 1.00 1081.00 50.00 97.00 31376.00 11268.00 290.10 3.50 24.68 32.12 20.85 6.05 89.00
sdu 0.00 1017.00 40.00 157.00 20760.00 9219.00 152.18 1.28 5.86 13.70 3.87 3.23 63.60
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md3 0.00 0.00 109.00 4099.00 21504.00 32320.00 12.79 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 173.00 2324.00 105192.00 18432.00 49.51 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Filesystem: rBlk_nor/s wBlk_nor/s rBlk_dir/s wBlk_dir/s rBlk_svr/s wBlk_svr/s ops/s rops/s wops/s
These stats are useless to me though. Every time I run the command the numbers are completely different. That's what makes it so hard to troubleshoot, I really don't know what to even look for.
One of the backup jobs is also hung up on a 0 byte file. That happens a lot with rsync it seems, where it hangs on random files, often small ones. It will hang up for a good 5-10 minutes before it continues. When it goes, it goes fast. This is physically on the storage server.
ESX info:
Storage server specs:
Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz (8 cores)
8GB ECC ram (would more ram help?)
3x LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 - flashed with "IT firmware" so it acts as just a HBA.
3x raid arrays using mdadm raid. The two main ones are raid 10, the other is raid 5 but no VMs run off that one. It's mostly for backups.
Is there any command/values I could add as alarm points to monitor? My monitoring software will accept any value generated by a command. (ex: using sed etc)