I have one server whre SSHing to it is bloody slow, how do I troubleshoot this?

Red Squirrel · Mar 31, 2018

I have a particular server that for whatever reason randomly drops off the network, it's just a standard CentOs install running in a VM. Any time I try to SSH into it (Even when it's working properly) it takes BLOODY AGES for anything to happen. Like about a minute to get the login prompt, and then when I login an other couple minutes to log in. Once I'm in then everything is ok.

How would I go about troubleshooting this? It's super annoying. It's a database server and usually when I'm trying to login it's to restore a backup and it's kind of time sensitive so it makes it even worse having to do all that waiting. Basically when the server drops the game server that relies on it crashes and then it causes the DB to corrupt. That's another issue I need to figure out. I also get lot of "Task delayed for 120 seconds" errors in the dmesg logs.

XavierMace · Mar 31, 2018

How busy is the server from an I/O standpoint? What kind of storage do you have backing it? Do you have other VM's on the same host? Do they have issues too?

Gryz · Apr 1, 2018

Login on one session. Wait till you logged in.
Do a "top". That shows you which processes are busy.

Now log in via another session.
Do you see any processes in your top that start taking a lot of CPU ?
That's an indication.

This might not be a CPU issue.
It might be to do with NFS or locking.
Find out what process is doing the login. Probably a child-process of sshd.
Use a command (ofiles, pfiles, whatever it is called on Mint) to see the list of open files.
Is there anything locked ? Any files you didn't expect ? Anything you can't access from your first ssh-session ?

It might be DNS related.
Maybe the sshd wants to know from which ip-address you are connecting. And wants to resolve the ip-address to a dns-name, for logging.
If DNS is misconfigured, or slow, or there's no reverse lookup for your ip-address, this might cause the sshd to wait for a DNS reply that is never coming.

Good luck.
Let us know if you find it.
I'm curious.

Red Squirrel · Apr 1, 2018

That's the thing when I do "top" it's not really that busy. Mysqld and Java (it also runs a Minecraft server, though that crashes most of the time so it's not always running) use a decent amount of resources but load is never really past 1.0. Storage subsystem is NFS on raid 10 mdadm raid on a separate physical server. Though my NFS performance is bloody hell and I have yet to figure out why. Always been an issue, it will just grind to a halt randomly at times, especially during backup jobs. Any kind of benchmark I run passes, but real world, not so much. Backup jobs often cripple my entire network and even cause things to crash. But this particular server is different in that no matter what, it's always slow to SSH in. The other servers are fine. I gave up trying to figure out the NFS issue, it's too sporadic. I would need to find some kind of tool with real time graphs of disk IO or something so I can better visualize what's going on, but because it's VMs, and NFS, all I'm going to see is that NFS is causing high IO. That does not tell me much, I already know that.

I've read that I can use "async" mode on NFS to make it faster, but I don't know if I want to trust that as it increases risk of data corruption.

I did nslookup from the server and it can lookup the IP fine and also the reverse lookup. I do know that if you can't resolve the hostname, it own't let you SSH in at all, I've run into that before when my DNS server went down, before I setup a secondary physical server. Though now that I think about it, the nslookup from the server is slow too, it takes a good 5 seconds to do the lookup, it should not be taking that long.

XavierMace · Apr 2, 2018

You need to troubleshoot the issue, not the symptom. In two posts you've mentioned multiple different network related issues.

1) Slow SSH into one server.
2) Same server loses all network connectivity.
3) Poor NFS performance.
4) High network utilization causing crashes of unspecified systems.

Some of the issues are apparently isolated to one systems, others are not. That suggests a larger issue. But, we're going to need a better picture of what's going on.

1) What virtualization platform are you using?
2) Do you have other virtualization hosts? If yes, can you move this VM to another host?
3) What is the hardware specs for the host the main "problem" VM is on and how many other VM's are running on it?
4) Is your storage traffic isolated from your regular network traffic?
5) Are you using Jumbo Frames?
6) What do your disk stats look like on the storage server?
7) What's your Tx/Rx rates look like on the problem VM?
8) If you stop the Minecraft server, is it still slow?
9) If you stop the MySQL server, is it still slow?
10) If you stop both, is it still slow?
11) Any errors/crashes logged on any switches involved in these connections?
12) How frequent does the problem VM experience network connectivity loss and what actions do you take to restore connectivity?
13) Is there any abnormal behavior on the VM or it's host preceding the connectivity loss?
14) Is the VM or host involved in this picture the same one that's giving PCIE link status errors?
15) You mentioned benchmarks of your NFS system "passes". What kind of actual numbers are we talking and are you running the benchmark from the storage server or a remote system?

Red Squirrel · Apr 2, 2018

1) ESXi. I eventually want to switch to kemu/KVM as it has more features but it's much more complex to setup.
2) Only one host at this point, since free ESXi does not support multi host environments
3) I don't recall the exact specs off hand but it's a Supermicro Xeon based server that's a few years old with 32GB of ram. it's a decent box but nothing super crazy.
4) Currently no, but eventually want to look at that. I don't think it comes close to pegging the gigabit link though, but it's hard to measure since the numbers just jump all over the place. Would need something that generates a real time graph or something. Not aware of anything that would do that.
5) No, but would that help? I think my switch supports it
6) Not really sure what to look at, any tools I use the numbers are all over the place or don't really tell me anything.
7) Not sure how to check this
8) Yes it's still slow. It only runs when I decide to play, but it only runs for a few hours before it crashes (this is more Minecraft specific, it's always done this on any VM/server I've ran it on)
9) Yes does not make any difference.
10) no difference here either.
11) Not sure how to check this, but I do wonder if maybe I have a switch that's going glitchy. My switches are Dells. Any way I can check that? Don't imagine those have all that much nvram so there's probably not any log other than having to keep console open right?
12) about every 2-3 weeks maybe
13) not that I'm aware of
14) No that one is a physical box, that issue happens maybe every couple months at most.
15) I basically did some dd tests and the results were acceptable for gig speeds. Tried locally and over NFS. NFS is much slower, but not as slow as it seems to be for real world stuff. Ex: copying files.

I guess the biggest thing would be figuring out how to measure disk and network IO. Really not sure of the proper way. Any tools I've found I don't really know what to make of the numbers. Some of them are also so horribly formatted that it's not really readable.

Gryz · Apr 2, 2018

My first guess would still be a DNS problem.
After logging in, do a "who" to see how your IP address is resolved.

Other tools you can use: iostat, vmstat

To see if the long login is caused by NFS: create a new user, who's home-directory is /tmp/something. In other words: a home-directory is not accessed via NFS.
When logging in into that user, do you still see long delays ?

Another trick might be:
- make sure your system isn't doing anything.
- log into the problematic box
- start a tcp-dump. capturing all packets going in and out of the box. Make sure there's one exception: the tcp connection over which you've logged in. (You don't wanna see packets from that connection).
- log into the box via a 2nd connection.
- what packets do you see going in/out of the box ? DNS requests ? NFS ? NIS ? Anything else that might give a clue ?

mxnerd · Apr 2, 2018

This?

https://superuser.com/questions/1200107/slow-ssh-login-on-centos

Red Squirrel · Apr 2, 2018

"Who" works fine. Shows correct IP/hostname for logged in users.

Tried ssh -v and it hangs up on random parts. There's no time stamp so it's kind of hard to post the log and show what's slow. I'll do a video screen capture later and post it.

Also tried usedns=no, it seems to speed up the process once I enter my password, but actually getting to the password prompt still takes bloody ages. So it did help a bit.

This is iostat and vmstat, not really sure what to make of the values. Are they good or bad?

Code:

[root@aovdb ~]# iostat
Linux 2.6.32-642.6.2.el6.x86_64 (aovdb.loc) 	02/04/18 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.29    0.00    0.08    2.12    0.00   97.50

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               9.93       138.17      1475.70 1858872236 19853287900
sdb              15.17      2445.68        30.99 32902894849  416859354
dm-0            185.36       138.15      1475.67 1858603578 19852876192
dm-1              0.00         0.02         0.02     228016     273128

[root@aovdb ~]# 
[root@aovdb ~]# 
[root@aovdb ~]# vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0  18088 483352 531784 4675760    0    0   162    94    0    0  0  0 98  2  0

Hard to copy/paste top output since it changes all the time and clears the selection, but load average is like 0.04.

Once I'm logged in everything is fast. It's the actual login process that takes bloody ages. Actually, come to think of it, some random things are slow. Like I'm doing a yum update now and "determining fastest mirrors" is hanging. It seems anything network related is slow, but actual bulk data transfer is ok. Like once the update finally starts then it downloads everything fast.

DNS does seem to be what is slow. nslookups take like 5-10 seconds when they should be instant. The DNS server is local.

XavierMace · Apr 2, 2018

Red Squirrel said:
1) ESXi. I eventually want to switch to kemu/KVM as it has more features but it's much more complex to setup.
2) Only one host at this point, since free ESXi does not support multi host environments
3) I don't recall the exact specs off hand but it's a Supermicro Xeon based server that's a few years old with 32GB of ram. it's a decent box but nothing super crazy.
4) Currently no, but eventually want to look at that. I don't think it comes close to pegging the gigabit link though, but it's hard to measure since the numbers just jump all over the place. Would need something that generates a real time graph or something. Not aware of anything that would do that.
5) No, but would that help? I think my switch supports it
6) Not really sure what to look at, any tools I use the numbers are all over the place or don't really tell me anything.
7) Not sure how to check this
8) Yes it's still slow. It only runs when I decide to play, but it only runs for a few hours before it crashes (this is more Minecraft specific, it's always done this on any VM/server I've ran it on)
9) Yes does not make any difference.
10) no difference here either.
11) Not sure how to check this, but I do wonder if maybe I have a switch that's going glitchy. My switches are Dells. Any way I can check that? Don't imagine those have all that much nvram so there's probably not any log other than having to keep console open right?
12) about every 2-3 weeks maybe
13) not that I'm aware of
14) No that one is a physical box, that issue happens maybe every couple months at most.
15) I basically did some dd tests and the results were acceptable for gig speeds. Tried locally and over NFS. NFS is much slower, but not as slow as it seems to be for real world stuff. Ex: copying files.

I guess the biggest thing would be figuring out how to measure disk and network IO. Really not sure of the proper way. Any tools I've found I don't really know what to make of the numbers. Some of them are also so horribly formatted that it's not really readable.

1) No comment.
2) You can run multiple hosts on free ESXi, you just don't have vCenter for centralized management.
3) Specifics would be good if you really want to try to dive into this.
4) Personally I'd recommend buying a switch and NIC's to dedicate to storage to completely isolate it. Nothing fancy, just a simple managed switch and some decent surplus NIC's.
5) Maybe, maybe not. Depends on the NIC's, your CPU load, etc.
6) See below.
7) I'm running Solaris for my storage but most of these commands are usually the same or similar on Linux. You want to see the data on both the guest and on the storage directly. See below.
8) Crashing could still be a symptom, but at least we've ruled out the application.
9) OK.
10) This would seem to indicate it's not not a contention issue on the guest side.
11) I'd have to do some Googling on the specific commands as I really only deal with Cisco but you should be able to check your interfaces for errors, clear the stats so you're starting with fresh data, and look for a dump file (indicating a switch crash).
12) What build of ESXi are you running?
13) Alright, we'll assume it's running fine until it dies then.
14) Is this VM at all dependent on that physical box with NIC issues?
15) dd isn't really a useful test in this case. It's good to know that your sequential speeds are fine but we're more concerned with your random speeds.

iowait% means your system is sitting there doing nothing while waiting for your storage to catch up. That's bad. That said, 2% isn't the end of the world. But, I'm assuming that was under "light" load. What does it look like when backups and such are running? %b (Busy) is another important value as that can show you if it's one specific disk in the array having problems. I've been finding that to be almost as reliable predictive failure indicator as SMART. If you've got 4 "identical" drives and one's showing drastically busier than the others, there's either something wrong with the disk or your server isn't properly distributing I/O. Here's some data from my smaller one just to give you some sort of reference point. It's admittedly mostly idle (as you can see) save for a large-ish in progress transfer from the file server to another box. Destination is on WiFi hence the low throughput. You'll note that the I/O is only hitting 4 of the disks. That's because the array was recently expanded and the data hasn't been balanced across all the drives.

Code:

Disk statistics via iostat -xn 1 2 (shows second value only)

 device   statistics                                   
 r/s     w/s     kr/s        kw/s   wait   actv   wsvc_t   asvc_t   %w   %b   device
 0.0    0.0     0.0         0.0     0.0     0.0    0.0         0.0         0      0      c1t0d0
 10.0  0.0     8204.7   0.0     0.0     0.2    0.0        18.4        0      9      c0t5000C500796B92EEd0
 12.0  0.0     8220.8   0.0     0.0     0.2    0.0        20.7        0     12     c0t5000C5007A9CDF31d0
 14.0  0.0     8220.8   0.0     0.0     0.2    0.0        14.0        0     10     c0t5000C5007A9CC48Fd0
 13.0  0.0     8204.7   0.0     0.0     0.2    0.0        18.7        0     13     c0t5000C5007A9CD90Cd0
 0.0    0.0     0.0         0.0     0.0     0.0    0.0        0.0          0     0       c2t1d0
 0.0   13.0    0.0        132.2  0.0     0.0    0.0        0.4          0     0       c0t5002538D41F97781d0
 0.0    0.0     0.0        0.0     0.0     0.0     0.0        0.0          0     0       c0t5000C5005073F2E0d0
 1.0    0.0     94.1      0.0     0.0     0.0     0.0       12.0         0     1       c0t5000C50050239367d0

 Time          Int     rMbps   wMbps   rPk/s     wPk/s       rAvs     wAvs    %Util   Sat
 12:47:45   net0   0.01     0.03       13.00     10.00       128.2    449.5    0.00    0.00
 12:47:45   net1   7.21    184.2      8314.1   16174.1   113.7    1493.1  19.3    0.00

I'd recommend investing at least time if not some money into getting a monitoring system running. I don't have personal experience with this other than I know it's free and built on Nagios: https://download.centreon.com/

Also, post the disk latency numbers being reported by ESXi for your VM's. VMware's general is anything over 30ms will be a noticeable impact to users on the VM.

Gryz said:
My first guess would still be a DNS problem.
After logging in, do a "who" to see how your IP address is resolved.

Other tools you can use: iostat, vmstat

To see if the long login is caused by NFS: create a new user, who's home-directory is /tmp/something. In other words: a home-directory is not accessed via NFS.
When logging in into that user, do you still see long delays ?

Another trick might be:
- make sure your system isn't doing anything.
- log into the problematic box
- start a tcp-dump. capturing all packets going in and out of the box. Make sure there's one exception: the tcp connection over which you've logged in. (You don't wanna see packets from that connection).
- log into the box via a 2nd connection.
- what packets do you see going in/out of the box ? DNS requests ? NFS ? NIS ? Anything else that might give a clue ?

mxnerd said:
This?

https://superuser.com/questions/1200107/slow-ssh-login-on-centos

Given Red's past issues with DNS, I'd completely agree that's a likely culprit for the slow SSH logins. But I think bigger picture there's other issues.

Edit: Ugh. This forum fails at formatting. If you can't make sense of that data let me know and I'll post screenshots.

mxnerd · Apr 2, 2018

XavierMace said:
Given Red's past issues with DNS, I'd completely agree that's a likely culprit for the slow SSH logins. But I think bigger picture there's other issues.

He once put each computer in its own domain. Don't know if he configured it correctly this time. I only configured DNS in Windows server in the past though. Probably can't help if it's Linux DNS issue.

Red Squirrel · Apr 2, 2018

mxnerd said:
He once put each computer in its own domain. Don't know if he configured it correctly this time. I only configured DNS in Windows server in the past though. Probably can't help if it's Linux DNS issue.

I thought that was the normal way of doing it (DNS domains, not windows domain, just to clarify). Each server/device basically has it's own .loc name. It would be a huge task to change that and reassign all new host names etc so I left it that way for now. At some point I might change it and just pick a random domain name and then make every server be a sub domain. It only became an issue when I realized you can't setup bind to do zone transfer on ALL zones, but have to do it on a per zone basis, which kinda defeats the purpose of having a backup DNS because if you add a new zone you still need to add it to the backup. I solved that with rsync but it's not really the proper way as technically both DNS servers are primaries. I don't see how that would cause any server issues though, it's jsut semantics. Instead of having servername.somedomain.loc I just have servername.loc.

Still don't know how you'd normally do zone transfer setup on a web server. You're not going to have only 1 domain on a web server. I only have 1 web server so don't have a secondary so never really explored that. I'd probably end up just doing the rsync route.

mxnerd · Apr 2, 2018

Yeah, WIndows domain has nothing to do with DNS domain at all.

In Windows DNS zone transfer can be done easily, however. But also on per zone basis.

I believe you are using BIND DNS in Linux.

https://www.youtube.com/results?search_query=bind+dns+zone+transfer

Red Squirrel · Apr 2, 2018

The issue oddly seems to be gone, maybe it took a while to go away when I changed the setting in the SSH config to usedns=no. For some reason DNS on that particular VM is slow. nslookup takes a long time (though not AS long). SSHing to other servers from that server is bloody slow too, but SSHing to that server is now fast, it was slow before I changed that setting. Maybe I need to set that setting to off on all servers.

I went ahead and ordered another 24 port gigabit switch off ebay, managed to find one for under $100. I've always wanted to split the storage/network traffic anyway. I'll see if that helps the performance issues.

Is there any commands I can use to monitor Linux network usage? I could maybe add monitoring points for that. i may as well do await too, what is considered a bad value?

I wrote my own monitoring software as Nagios etc is too complex to setup (too many dependencies) and I wanted something that is centrally managed where all the config files are on one server and get pushed to the agents. I just have to install the agent on all the servers and then the monitoring server sends off the monitor points. Because there are no dependencies the software will run on practically any OS/setup with minimal effort. I've neglected that software for years now, I need to improve a lot of things in it. I could add graphing capabilities and that might make it easier to monitor this storage subsystem issue. Any specific alarm points I can add? Like maybe await, anything else? How do I measure network throughput? I can't seem to find any answer on google on how to do that, without some crazy hacks. There's got to be some kind of value in /proc or something I can just grab.

The disk IO issue has been going on for as long as I can remember, it used to be worse before but every now and then it still hiccups. It seems to be really bad during backup jobs, everything just grinds to a halt.

Here's what things look like on the storage server:

Code:

 iostat -xn 1 2
Linux 2.6.32-696.3.1.el6.x86_64 (isengard.loc) 	02/04/18 	_x86_64_	(8 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdv               0.21     0.91    0.56    0.51    39.04    10.65    46.11     0.00    1.36    0.37    2.45   1.02   0.11
sdd               1.17   352.71   10.07  331.92  5627.24  5202.22    31.67     0.29    0.85   17.48    0.34   0.80  27.30
sdb               0.37   349.86    9.03  334.76  5342.40  5202.22    30.67     0.21    0.61   13.57    0.26   0.72  24.59
sdg               0.22   345.35    9.44  329.12  5512.65  5120.90    31.41     0.11    0.32   12.28    0.94   0.66  22.41
sdf               0.36   347.70   11.02  328.59  5618.76  5135.48    31.67     0.17    0.51   13.52    0.08   0.72  24.45
sda               0.27   348.66    9.20  331.81  5447.67  5168.88    31.13     0.13    0.37   12.86    0.03   0.67  22.80
sdc               0.55   351.36   10.64  329.10  5434.32  5168.88    31.21     0.18    0.53   13.95    0.10   0.72  24.58
sdh               1.65   347.95    9.52  326.51  5316.25  5120.90    31.06     0.21    0.64   20.12    0.07   0.73  24.63
sde               0.83   346.25    8.33  330.04  5124.57  5135.48    30.32     0.21    0.62   21.64    0.09   0.68  23.11
sdj              20.94    15.47    8.15    4.43   664.27   153.17    64.99     0.16   12.59   15.34    7.55   3.77   4.74
sdp               0.00     0.00    0.00    0.00     0.00     0.00     8.07     0.00    0.52    0.52    0.00   0.52   0.00
sdo              20.22    15.01    5.04    3.88   633.54   144.61    87.24     0.09    9.73    8.28   11.60   2.65   2.36
sdk              20.41    15.02    4.92    3.78   634.16   143.84    89.45     0.10   11.01    9.94   12.39   2.67   2.33
sdm              20.58    15.14    4.63    3.98   633.08   146.70    90.56     0.13   14.75   13.40   16.31   3.69   3.18
sdl              20.40    15.06    5.16    3.79   635.99   144.43    87.17     0.11   12.04    9.61   15.34   3.22   2.88
sdn              20.46    15.05    4.77    3.72   633.12   143.78    91.50     0.11   13.47   11.31   16.25   3.29   2.79
sdr               1.83    46.11    2.48   16.33   879.14   452.77    70.78     0.13    7.11    5.20    7.41   2.19   4.13
sdq               0.94    45.97    1.08   16.04   749.50   449.28    70.01     0.13    7.50    9.01    7.40   1.93   3.31
sds              20.51    15.42    5.49    3.34   639.63   144.41    88.82     0.16   18.39   10.85   30.80   2.61   2.30
sdt               0.52    46.25    0.95   16.19   710.44   452.77    67.84     0.13    7.85   10.35    7.71   2.28   3.92
sdu               1.33    45.84    2.49   16.17   810.38   449.28    67.50     0.14    7.60    5.36    7.94   1.96   3.66
dm-0              0.00     0.00    0.77    1.33    39.02    10.65    23.60     0.03   13.49    0.44   21.08   0.52   0.11
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     8.00     0.00    0.75    0.36    1.84   0.19   0.00
md3               0.00     0.00   64.61 2611.29 41111.85 20623.90    23.07     0.00    0.00    0.00    0.00   0.00   0.00
md1               0.00     0.00   50.51  118.80  3855.28   941.05    28.33     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00   11.61  117.48  3149.16   900.52    31.37     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.01     0.00     8.00     0.00    0.23    0.17    0.83   0.18   0.00
sdi               0.44    20.47    0.88    0.95    92.26   169.99   143.42     0.02    9.23    4.23   13.89   2.71   0.49

Filesystem:              rBlk_nor/s   wBlk_nor/s   rBlk_dir/s   wBlk_dir/s   rBlk_svr/s   wBlk_svr/s     ops/s    rops/s    wops/s

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdv               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdp               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdo               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdk               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdm               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdl               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdn               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdr               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdq               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sds               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdt               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdu               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md3               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Filesystem:              rBlk_nor/s   wBlk_nor/s   rBlk_dir/s   wBlk_dir/s   rBlk_svr/s   wBlk_svr/s     ops/s    rops/s    wops/s

Not sure why everything is 0's on the 2nd one.

Here's one while backups are happening: (just fired off a bunch of jobs manually, both local and in VMs)

Code:

 iostat -xn 1 2
Linux 2.6.32-696.3.1.el6.x86_64 (isengard.loc) 	02/04/18 	_x86_64_	(8 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdv               0.21     0.91    0.56    0.51    39.14    10.65    46.17     0.00    1.36    0.37    2.45   1.02   0.11
sdd               1.17   352.72   10.07  331.93  5627.08  5202.36    31.67     0.29    0.85   17.48    0.34   0.80  27.31
sdb               0.37   349.87    9.03  334.77  5342.20  5202.36    30.67     0.21    0.61   13.57    0.26   0.72  24.59
sdg               0.22   345.36    9.44  329.12  5512.44  5121.04    31.41     0.11    0.32   12.28    0.94   0.66  22.41
sdf               0.36   347.71   11.02  328.60  5618.61  5135.62    31.67     0.17    0.51   13.52    0.08   0.72  24.45
sda               0.27   348.67    9.20  331.82  5447.47  5169.02    31.13     0.13    0.37   12.86    0.03   0.67  22.80
sdc               0.55   351.37   10.64  329.11  5434.17  5169.02    31.21     0.18    0.53   13.95    0.10   0.72  24.58
sdh               1.65   347.96    9.52  326.52  5316.10  5121.04    31.06     0.21    0.64   20.12    0.07   0.73  24.63
sde               0.83   346.25    8.33  330.05  5124.38  5135.62    30.32     0.21    0.62   21.64    0.09   0.68  23.11
sdj              20.94    15.47    8.15    4.43   664.26   153.16    64.98     0.16   12.59   15.33    7.55   3.77   4.74
sdp               0.00     0.00    0.00    0.00     0.00     0.00     8.07     0.00    0.52    0.52    0.00   0.52   0.00
sdo              20.22    15.01    5.04    3.88   633.52   144.60    87.24     0.09    9.73    8.28   11.60   2.65   2.36
sdk              20.40    15.02    4.92    3.78   634.15   143.83    89.44     0.10   11.01    9.94   12.39   2.67   2.33
sdm              20.58    15.14    4.63    3.98   633.07   146.69    90.55     0.13   14.75   13.40   16.31   3.69   3.18
sdl              20.40    15.06    5.16    3.79   635.97   144.43    87.16     0.11   12.04    9.61   15.34   3.22   2.88
sdn              20.46    15.05    4.77    3.72   633.11   143.78    91.49     0.11   13.47   11.31   16.25   3.29   2.79
sdr               1.83    46.12    2.49   16.34   880.63   452.88    70.84     0.13    7.11    5.21    7.40   2.19   4.13
sdq               0.94    45.98    1.09   16.04   750.90   449.39    70.07     0.13    7.50    9.02    7.40   1.93   3.31
sds              20.51    15.42    5.49    3.34   639.62   144.41    88.82     0.16   18.39   10.85   30.80   2.61   2.30
sdt               0.52    46.26    0.96   16.20   711.82   452.88    67.90     0.13    7.85   10.36    7.71   2.28   3.92
sdu               1.33    45.85    2.50   16.17   811.79   449.39    67.56     0.14    7.60    5.36    7.94   1.96   3.66
dm-0              0.00     0.00    0.78    1.33    39.13    10.65    23.64     0.03   13.48    0.44   21.08   0.52   0.11
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     8.00     0.00    0.75    0.36    1.84   0.19   0.00
md3               0.00     0.00   64.61 2611.36 41110.56 20624.45    23.07     0.00    0.00    0.00    0.00   0.00   0.00
md1               0.00     0.00   50.51  118.80  3855.22   941.01    28.33     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00   11.62  117.51  3154.83   900.74    31.41     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.01     0.00     8.00     0.00    0.23    0.17    0.83   0.18   0.00
sdi               0.44    20.47    0.88    0.95    92.27   169.99   143.41     0.02    9.23    4.23   13.89   2.71   0.50

Filesystem:              rBlk_nor/s   wBlk_nor/s   rBlk_dir/s   wBlk_dir/s   rBlk_svr/s   wBlk_svr/s     ops/s    rops/s    wops/s

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdv               0.00     0.00    1.00    0.00    16.00     0.00    16.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               2.00   372.00   10.00  638.00  2680.00  7608.00    15.88     1.02    1.57   14.40    1.37   0.57  37.20
sdb               0.00   383.00    3.00  627.00  1168.00  7608.00    13.93     0.86    1.36   11.67    1.31   0.52  32.70
sdg               0.00   503.00    7.00  637.00  2856.00  8648.00    17.86     0.66    1.03    9.00    0.94   0.47  30.00
sdf               0.00   728.00   28.00  647.00  5232.00 10520.00    23.34     1.52    2.35   13.86    1.85   0.77  51.80
sda               0.00   259.00   17.00  465.00  3608.00  5016.00    17.89     0.75    1.54    6.41    1.36   0.64  30.80
sdc               0.00   319.00   17.00  405.00  2168.00  5000.00    16.99     0.77    1.80    7.47    1.56   0.94  39.50
sdh               0.00   577.00   21.00  563.00  3552.00  8648.00    20.89     1.00    1.72   11.52    1.35   0.85  49.40
sde               0.00   729.00    6.00  645.00   824.00 10520.00    17.43     1.18    1.81   15.83    1.68   0.68  44.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdp               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdo               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdk               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdm               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdl               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdn               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdr               0.00  1010.00   42.00  166.00 25792.00  9220.00   168.33     1.33    6.29   15.33    4.01   3.02  62.80
sdq               0.00  1058.00   36.00  118.00 24080.00  9220.00   216.23     1.95   12.66   24.75    8.97   4.31  66.40
sds               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdt               1.00  1081.00   50.00   97.00 31376.00 11268.00   290.10     3.50   24.68   32.12   20.85   6.05  89.00
sdu               0.00  1017.00   40.00  157.00 20760.00  9219.00   152.18     1.28    5.86   13.70    3.87   3.23  63.60
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md3               0.00     0.00  109.00 4099.00 21504.00 32320.00    12.79     0.00    0.00    0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00  173.00 2324.00 105192.00 18432.00    49.51     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Filesystem:              rBlk_nor/s   wBlk_nor/s   rBlk_dir/s   wBlk_dir/s   rBlk_svr/s   wBlk_svr/s     ops/s    rops/s    wops/s

These stats are useless to me though. Every time I run the command the numbers are completely different. That's what makes it so hard to troubleshoot, I really don't know what to even look for.

One of the backup jobs is also hung up on a 0 byte file. That happens a lot with rsync it seems, where it hangs on random files, often small ones. It will hang up for a good 5-10 minutes before it continues. When it goes, it goes fast. This is physically on the storage server.

ESX info:

Storage server specs:

Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz (8 cores)
8GB ECC ram (would more ram help?)
3x LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 - flashed with "IT firmware" so it acts as just a HBA.
3x raid arrays using mdadm raid. The two main ones are raid 10, the other is raid 5 but no VMs run off that one. It's mostly for backups.

Is there any command/values I could add as alarm points to monitor? My monitoring software will accept any value generated by a command. (ex: using sed etc)

mxnerd · Apr 2, 2018

You have XEON E3-1270 V2 @3.5G not E3-1230 V2@3.3G

And you have 32GB RAM, not 8GB according to your screenshot.

Red Squirrel · Apr 2, 2018

mxnerd said:
You have XEON E3-1270 V2 @3.5G not E3-1230 V2@3.3G

And you have 32GB RAM, not 8GB according to your screenshot.

That's the VM server, the other specs are storage server.

I've been toying with different ideas to build a better infrastructure using open source and one thing I've been thinking of, maybe I should just use local storage. It would be less versatile, but much simpler and higher performance. I could have multiple VM servers and they just run their own VMs on their own storage. I'd lose ability of having any kind of HA, but really for a home environment it's probably fine.

What would be really cool is if I could do that, BUT the storage would be some kind of mesh and with HA. No central storage, just storage that runs across all VM servers. I think this might be doable with gluster. It looks super complicated and tedious to setup though but if I can write some kind of front end that manages it all it would perhaps not be too bad. It would basically be like a cloud and I can just add more boxes to my hearts desire.

mxnerd · Apr 2, 2018

I see.

Still don't know how you'd normally do zone transfer setup on a web server. You're not going to have only 1 domain on a web server.

You probably misunderstood what zone transfer means.

Zone transfer has nothing to do with a PC or VM being a web server or not. Zone transfer is only regarding DNS records being replicated from one DNS server to another DNS server.

Let's say you have 3 DNS domains ( zones ) on local LAN: mybiz.local , yourcomp.local and hisfamily.local.

Each zone contains several PC name <-> IP address pair records and assume you have 2 DNS servers.

You created all 3 zones on DNS 1, but you can choose only replicate 2 zones's records (zone transfer) mybiz.local & hisfamily.local to DNS 2.

If you want to use DNS 2 as DNS 1's backup, then of course you want to do zone transfer for all 3 zones.

Red Squirrel · Apr 2, 2018

Yeah but when I looked into how to setup zone transfers, you need to configure it on a per zone basis instead of for the whole server, which defeats the whole purpose as it won't be a perfect mirror of the DNS server which is what you would normally want in a web server setup, which would have many domains. If you have 100 domains (100 zones) you need to setup each one twice. My way using rsync, you only need to configure it once on the primary server then it gets replicated to the other server(s). So if you add or delete new zones the changes will get replicated that night or whenever the script runs. The issue with this though is that both servers are considered primary. If one goes down it does not fail over properly. Clients keep trying the first one even if it's down.

This is a bit OT mind you... for the time being I only have one DNS server for my web stuff, I'll cross that bridge when I get there, if ever I do get more than one web server. Chances are I'll just do the rsync method, it's much simpler. Configure zones on one server and the changes get replicated to other. The only issue with this setup is when I tested it, it does not seem to fail over properly if the primary goes down. Clients still keep trying the primary over and over.

mxnerd · Apr 2, 2018

Not familiar with Linux DNS, but I believe zone transfer can be done automatically if setup correctly and there is no need for using utility tools to replicate records.

Xavier should be able to help.

Red Squirrel · Apr 3, 2018

They can, but unless I missed something, each zone individually has to be setup for it, so it kinda defeats the purpose. you can't create a new zone, and have it automatically replicate to the backup server. I suppose what I could do is write a front end that then does all the dirty work in the background. TBH this is something I always wanted to do for every aspect of Linux server administration is to have a nice front end where everything is automated. I had started on such project many years back then got side tracked. I'd have to pick it up again. Essentially it was a web based UI where you can do all the basic stuff like setup virtual hosts for apache, mail accounts etc... and it just did all the config in the background. The goal was to eventually make it so you can quickly transfer all this over to a new box or pretty much automate everything, add fail over, etc... Pretty big project mind you. It would basically allow to spin up servers very fast with a set config and it would handle a lot of the nitty gritty stuff in the background. Any software (such as DNS) that supports any form of fail over would automatically be taken care of.

This is where KVM and Gluster could come in too, if I setup a front end to manage all those things it would make it easy to add new servers or move stuff to a new server and so on. Less manual tedius config file work and less time spent googling stuff, as the UI would lay out everything in front of you. honestly I'm surprised nothing like this already exists. Closest thing is webmin but when I tried it, it was quite primitive.

mxnerd · Apr 3, 2018

Googled.

It can be done automatically since BIND 9.11. Called Catalog Zone.

Since you are using ESXi, should be able to test in VMs easily for functionality.

Haven't read thoroughly.

https://jpmens.net/2016/05/24/catalog-zones-are-coming-to-bind-9-11/

https://kb.isc.org/article/AA-01401/0/A-short-introduction-to-Catalog-Zones.html

Will wake up and read it tomorrow.

XavierMace · Apr 3, 2018

I'm not going to get involved in the DNS discussion again as I think we just both ended up mad at each other last time.

As far as the disk stats go, in your output there's 2 main stats that you want to look at:

await = The time the I/O’s wait in the host I/O queue
avgqu-sz = The average amount of I/O’s waiting in the queue.

Your's look fine. The only thing I find a bit odd is that your read response times are quite a bit slower than your write response times. Disable all power management features in the BIOS on the file server and try again. Also post screenshots of the disk latency being reported by VMWare for a couple of your VM's. Performance Tab in vSphere, choose datastore from the drop down. I'd do that before and after changing the power management features.

Edit: Also post some screenshots of your Network stats on the host. Performance Tab, choose network.

Red Squirrel · Apr 3, 2018

What power management features would there be that I should disable? Though I don't really want to go into the bios as that would require a reboot.

Here's a few screenshots of vmware:

3rd screenshot is the VM that seems to have DNS resolve issues. (resolutions take long). The actual SSH login issue seems to be fixed now by changing the DNS option in the SSHD config. The spikes every 5 minutes are probably normal, the game server issues snapshots every 5 minutes, which basically involves copying a bunch of tables. I actually thought of a better way to do that the other day, I need to play around with that in dev. I doubt that is the cause of the slowdown though as the slowdown is consistent.

sourceninja · Apr 3, 2018

Red Squirrel said:
They can, but unless I missed something, each zone individually has to be setup for it, so it kinda defeats the purpose. you can't create a new zone, and have it automatically replicate to the backup server. I suppose what I could do is write a front end that then does all the dirty work in the background. TBH this is something I always wanted to do for every aspect of Linux server administration is to have a nice front end where everything is automated. I had started on such project many years back then got side tracked. I'd have to pick it up again. Essentially it was a web based UI where you can do all the basic stuff like setup virtual hosts for apache, mail accounts etc... and it just did all the config in the background. The goal was to eventually make it so you can quickly transfer all this over to a new box or pretty much automate everything, add fail over, etc... Pretty big project mind you. It would basically allow to spin up servers very fast with a set config and it would handle a lot of the nitty gritty stuff in the background. Any software (such as DNS) that supports any form of fail over would automatically be taken care of.

This is where KVM and Gluster could come in too, if I setup a front end to manage all those things it would make it easy to add new servers or move stuff to a new server and so on. Less manual tedius config file work and less time spent googling stuff, as the UI would lay out everything in front of you. honestly I'm surprised nothing like this already exists. Closest thing is webmin but when I tried it, it was quite primitive.

The reason you can't find that is probably that most people looking for that level of control are using puppet, chef, salt, etc. Could be a fun project though.

Pick2 · Apr 3, 2018

I had a similar problem back about '05 , between OSX and linux.
I looked at the "/etc/ssh/sshd_config" file on each , looked up what the variables did that were different between the two , and made some changes to them. It cured the problem , but I can't remember what I changed when It started working fast.

sorry

Edit: I think it was handshaking or encryption protocol ... it's been to long

I have one server whre SSHing to it is bloody slow, how do I troubleshoot this?

No Lifer

Diamond Member

Golden Member

No Lifer

Diamond Member

No Lifer

Golden Member

Diamond Member

No Lifer

Diamond Member

Diamond Member

No Lifer

Diamond Member

No Lifer

Diamond Member

No Lifer

Diamond Member

No Lifer

Diamond Member

No Lifer

Diamond Member

Diamond Member

No Lifer

Diamond Member

Golden Member