I have one server whre SSHing to it is bloody slow, how do I troubleshoot this?

XavierMace · Apr 3, 2018

Red Squirrel said:
What power management features would there be that I should disable? Though I don't really want to go into the bios as that would require a reboot.

Here's a few screenshots of vmware:

3rd screenshot is the VM that seems to have DNS resolve issues. (resolutions take long). The actual SSH login issue seems to be fixed now by changing the DNS option in the SSHD config. The spikes every 5 minutes are probably normal, the game server issues snapshots every 5 minutes, which basically involves copying a bunch of tables. I actually thought of a better way to do that the other day, I need to play around with that in dev. I doubt that is the cause of the slowdown though as the slowdown is consistent.

Your write latency is definitely high in that first screenshot, especially LUN3. That's going to be noticeable, especially if you're running anything I/O sensitive off that LUN. What's the read latency like? As far as the power management goes, I've seen a lot of people reporting mdadm RAID performance issues because the processor is dropped down to a lower power state. I'd disable all the power management options first, see if that has an effect. If it does, then you can try to fine tune which specific settings will do the trick. If not, then just put it back where it was.

Red Squirrel · Apr 3, 2018

That's odd since lun3 is the one with the most drives so I would figure the performance would be better. It does have the db server that has slow DNS issues and the application server that talks to it, those are probably the most intensive VMs in terms of resource usage. Would it be worthwhile looking at enterprise drives? I always felt it was not worth paying double for what might be a slight boost in performance, but wondering if maybe the drives themselves are the bottleneck as they are consumer drives. I tend to go with "red" or "black" variant though and 7200rpm. I avoid green or lower rpm drives, especially the ones that go to sleep as it breaks the raid array. (I learned that the hard way when I setup a raid array on the cheap where performance didn't matter) That particular array has 8 WD Red drives in raid 10.

Was not aware of Xeons having any kind of throttling. Is there a way I can modify that setting without rebooting? Like maybe something via software that tells it to not throttle? I don't want to reboot the storage server given pretty much the whole network rides on that. I really need to look at a more redundant setup some day so I can take down an enclosure without affecting availability, but can't quite afford that. My next big purchase is probably going to be power related as I want to do a dual conversion setup and also add more battery capacity.

XavierMace · Apr 3, 2018

Yes your Xeon supports power management and it's generally enabled by default (C states, sometimes Speedstep is enabled too). Your CPU usage looked to be pretty low so it's entirely possible its dropping down to a lower C state.

I don't think the drives are your problem as that latency wasn't visible directly on the storage server. Depending on the size of your DB server and Application server, it may be worth considering a small SSD array for them. Can you post screenshots of your vSwitch configuration? Also what does your nfsstat output look like? I'm more used to iSCSI and FC, but we may be able to gleam something from that.

Red Squirrel · Apr 3, 2018

This is what nfsstat looks like:

Code:

# nfsstat
Server rpc stats:
calls      badcalls   badclnt    badauth    xdrcall
4170489315   0          0          0          0       

Server nfs v3:
null         getattr      setattr      lookup       access       readlink     
4038      0% 7323668   0% 83415     0% 1170283   0% 3641702   0% 58        0% 
read         write        create       mkdir        symlink      mknod        
3577354773 86% 565702867 13% 328275    0% 1333      0% 0         0% 0         0% 
remove       rmdir        rename       link         readdir      readdirplus  
291693    0% 99        0% 5796      0% 0         0% 147       0% 110075    0% 
fsstat       fsinfo       pathconf     commit       
38223     0% 4127      0% 110       0% 261388    0% 

Server nfs v4:
null         compound     
2         0% 14236680 99% 

Server nfs v4 operations:
op0-unused   op1-unused   op2-future   access       close        commit       
0         0% 0         0% 0         0% 793154    2% 18996     0% 2688      0% 
create       delegpurge   delegreturn  getattr      getfh        link         
18        0% 0         0% 18250     0% 13301488 43% 715117    2% 0         0% 
lock         lockt        locku        lookup       lookup_root  nverify      
0         0% 0         0% 0         0% 697095    2% 0         0% 0         0% 
open         openattr     open_conf    open_dgrd    putfh        putpubfh     
19124     0% 0         0% 4         0% 0         0% 14192837 46% 0         0% 
putrootfh    read         readdir      readlink     remove       rename       
2         0% 886971    2% 762       0% 0         0% 0         0% 0         0% 
renew        restorefh    savefh       secinfo      setattr      setcltid     
43454     0% 0         0% 0         0% 0         0% 637       0% 3         0% 
setcltidconf verify       write        rellockowner bc_ctl       bind_conn    
3         0% 0         0% 67574     0% 0         0% 0         0% 0         0% 
exchange_id  create_ses   destroy_ses  free_stateid getdirdeleg  getdevinfo   
0         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
getdevlist   layoutcommit layoutget    layoutreturn secinfononam sequence     
0         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
set_ssv      test_stateid want_deleg   destroy_clid reclaim_comp 
0         0% 0         0% 0         0% 0         0% 0         0%

Vswitch config consists of two nics, one is dedicated to management, and other is trunk port to the switch and has all the vlans assigned to it. This is the properties dialog of that switch:

XavierMace · Apr 3, 2018

So effectively all your data and storage traffic is going over a single interface? That's an awful lot of traffic for a single interface. How are backups being handled? I thought you'd mentioned it but looking back through the thread, I don't see it. Are you just backing up the NFS mount points directly or are you doing VM level backups? Where are the backups being stored and how are they getting there? IE:

You have an agent/job/whatever on the slow server that backs up the entire file system to another box that is not a VM on the same host. That data has to get read from the NFS share using your lone NIC, then transmitted to your backup server using the same lone NIC. Now keep in mind that NIC is shared with 19 other VM's (per your earlier screenshot) and it's trying to share it evenly between them. I did see most of them are powered off, but that's still a lot of traffic for a single NIC.

Red Squirrel · Apr 3, 2018

Backups are done through Rsync. The majority of them are done locally on the file server, ex: take data from one lun and put on the backup one. Also have a manual backup system where I put a drive in a docking station and then I swap them out (essentially like tapes). I do have some backup jobs that happen over NFS too though. They are staggered to try to avoid too many running at once.

I don't have any VM level backups (need expensive software for that) but I do backup the actual data. I don't think I'm pegging the gig link though, most of the time rsync is just slow, period, so it's too slow to actually move enough data to be an issue. It will hold up on random files etc. I gave up on that issue as it's super sporadic and since they just run overnight I don't really care all that much how long they take. I may write my own software at some point and see if it works better.

XavierMace · Apr 4, 2018

You don't think you're pegging it, but have you checked?

Red Squirrel · Apr 4, 2018

I don't really know how to measure network usage in Linux, but I have done some tests while it's slow with iperf and I can get like 100MB/sec or even higher.

It's kinda hard to peg a gig link in a home environment. When it's slow, it's not the network that's the bottleneck, I don't know what it is. it's like if it randomly hiccups for no reason at times and everything just grinds to a halt. I'll start getting "the task delayed for 120 second" crashes in dmesg. Though it's been a while since I got that one. That used to haunt me on my last file server before I built this one. It would cause all sorts of crashes.

I've also tried ping -f to see if I get any dropped packets and I don't.

XavierMace · Apr 4, 2018

Red Squirrel said:
It's kinda hard to peg a gig link in a home environment. When it's slow, it's not the network that's the bottleneck, I don't know what it is. it's like if it randomly hiccups for no reason at times and everything just grinds to a halt. I'll start getting "the task delayed for 120 second" crashes in dmesg. Though it's been a while since I got that one. That used to haunt me on my last file server before I built this one. It would cause all sorts of crashes.

Check if from the vSphere side. I know you love your Linux CLI and what not, but vSphere is your friend here since the system you're having issues with is a VM. Unless you want to learn esxcli.

You don't have a home environment. If you can't peg a GbE link with your setup, there's something very wrong. @VirtualLarry can peg a GbE link with his cheap off the shelf NAS units. I can peg a 10GbE link if I want to but I decided the noise/power requirements for 10GbE wasn't worth it and sold my 10GbE stuff. I've got 4x1GbE links per host for iSCSI. I run my SAN replication job over the single 1GbE management link because it's a differential backup so as long as the backup has been constantly running, there's not that much data over it. Guess what happens though if it's been a while? My management interface is basically inaccessible until the job completes because that replication completely saturates that connection. Have you ever tried to SSH into a router when it's interface is being saturated? Guess what happens, your SSH connection sucks balls at best and will frequently drop.

I'm primarily a Windows/VMWare guy but everything you've describe so far could easily be explained by storage latency and/or host resource contention. Could it be other issues? Sure. Could be a bunch of combined configuration issues. But that's hard to say without being able to actually see your setup. I know you don't want to reboot your storage server, but I'd start there with the power management settings.

Red Squirrel · Apr 4, 2018

Oh I can peg it if I really try, like multiple wget with lot of big files or something, what I mean is that the every day stuff that goes on does not peg it as far as I know. This is what vmware looks like:

And file server: (I had forgot about the iftop command)

There may be a local backup job running on one of the VMs right now as that's more traffic than I'd expect.

XavierMace · Apr 4, 2018

vSphere shows you hitting as high as 80MBps in the last hour. What does the last day look like?

Red Squirrel · Apr 4, 2018

For some reason it's not letting me change the time range, it's grayed out.

XavierMace · Apr 4, 2018

Chart options, change metrics to just show you one data point. Should let you change it at that point.

Red Squirrel · Apr 5, 2018

I don't see any option like that. I do recall seeing it before when I used to manage a vsphere environment, so maybe it's my version or fact that it's the free version so it's a feature that's not there.

I did find a place in /proc where I can get an accumulation of bytes transfered and received on the file server though, not exactly ideal but I could write a script that gets those values once a second, then averages the value over a minute, then writes it to a file, then I can setup a monitoring point to look at that file to get the data rate. Whatever traffic on the vm server is going to be hitting the file server too. Most of the non vmdk traffic is going to either be between VMs (and only go through the vswitch) or to the storage server via NFS. Traffic going to clients (ex: game server or web stuff) is going to be very small and negligible. So if I can get a graph of the file server it would probably still be decent data to gather. I don't have graphing capabilities on my monitoring software yet though but it's something I've been wanting to add.

XavierMace · Apr 5, 2018

It's there, I can see the link in your screenshots.

Red Squirrel · Apr 5, 2018

This is what I get. Tried to select just one object and just one counter (is that what you mean by metrics?)

XavierMace · Apr 5, 2018

Hmmm. I didn't remember vCenter being necessary for ALL historical data, but that appears to be the case. Could could try running the free version of Veeam One as I think it retains historical data itself, but I'm not certain of that.

thecoolnessrune · Apr 6, 2018

XavierMace said:
Hmmm. I didn't remember vCenter being necessary for ALL historical data, but that appears to be the case. Could could try running the free version of Veeam One as I think it retains historical data itself, but I'm not certain of that.

Yeah, vCenter is required for anything other than Real-time. VEEAM ONE collects its historical performance metrics from vCenter. Without vCenter you only get real-time there as well

Red Squirrel · Apr 14, 2018

My switch came in. Though I realized I'm probably going to want a dedicated nic too for the storage server at least. The switch will be completely separate from the main network and strictly for storage back end so any server that needs access to storage will also connect to that switch. There's a fibre channel card in that server I no longer use so I can take the time to take it out and put the ethernet NIC, and while my whole network is down I can also look in the bios for any power saving settings and disable the power save states.

For the time being there will still be some file activity over the main network as lot of stuff accesses that server directly, but I might change that over time and treat it more like a SAN than a NAS. Though for raw file access it probably still makes more sense to access it directly, vs making a file server VM, as there will be extra overhead. I do eventually want to switch from vmware to something more open like qemu/kvm and add more redundancy, HA etc too.

In the meantime maybe I will just add more alarm/data points, such as await. So when the network starts to slow down I can go back and look at values. I still need to look at adding graphing capabilities to my monitoring software as well. That will at least give me a better picture of what's going on.

Fallen Kell · Apr 22, 2018

I think you really need to look at tuning your NFS based on the nfsstat you posted. I highly suggest increasing the rsize and wsize values on the client side (mount option), as well as adding "noatime, nodiratime". I would look into using "actimeo=<some value>" as well, but that is dependent on how you use storage (i.e. do you have multiple systems reading/writing to the same files/directories at the same time, etc.).

The reason I say this is that 86% of the NFSv3 operations are reads, which means you have either a LOT of reads, or more likely a lot of SMALL reads happening, where the rsize value would greatly help use a more optimal value tuned to your filesystem performance (i.e. I would make it at least 65536, higher if your kernel supports it, but you will need to enable jumboframes to really take full advantage). And on the NFSv4 side, you have 43% of the operations being getattrs. This means almost half of what it is doing is updating the client with file attributes, something you can have your clients cache for very long periods of time if you storage is setup such that the various clients are not all operating on the same files at the same time.

Red Squirrel · Apr 22, 2018

Where would I put that setting, and is there a way to do it server side instead of having to do it on every single client? I rather have all the same settings across the board so it's more consistent instead of having to manually do it for each client.

sourceninja · Apr 22, 2018

Easiest way to deal with multiple clients is to use a config management tool like saltstack

Fallen Kell · Apr 24, 2018

Well, two or three ways to do it:
1) setup a LDAP domain and configure automount maps, in which you can then set the values in the map
2) setup a NIS domain (not very secure, but if you use it only for automount, not an issue, just make sure you actually disable the passwd and shadow maps if you do set this up as NIS does not use an encryption layer on the network and thus the password hashes and salts are exposed across the network and to people local on the systems (which is why it is not secure)), and setup automount map...
3) Use a configuration management tool (chef/puppet and/or ansible are probably the industry standards, i.e. skills worth having as you can get paid to know them)
4) Setup distributed shell (DSH) across all your systems and use a common storage location and copy the files to all systems with a single command...

Red Squirrel · Apr 24, 2018

That sounds way more complicated and involved per machine than I want to get into. Is there not a parameter I can set in the export file or something to just tell all the clients to use those settings? Not sure why they even make those settings client side. Seems like it should be server side.

Fallen Kell · Apr 25, 2018

Nope. It is all client side. They did it that way to be more compatible with wider range of hardware/software and networks. For instance, you may have a 10x increase in performance by using the "async" option for clients that are on a highly reliable network with a server that is on stable power, but a client that might be in an out building, off a WAN link that uses IPoAC (go look that one up on google), and you might need a completely different set of parameters in order for those clients to work. Client side meant that the individual clients could be tuned for their best potential, but a server side change would mean needing to provide for the worst case, downgrading everyone to that worst case.

Setting up NIS only takes about 30-40 minutes... But it is the least useful skill in the market. LDAP and the configuration management would be the most marketable skills to learn.

I have one server whre SSHing to it is bloody slow, how do I troubleshoot this?

Diamond Member

No Lifer

Diamond Member

No Lifer

Diamond Member

No Lifer

Diamond Member

No Lifer

Diamond Member

No Lifer

Diamond Member

No Lifer

Diamond Member

No Lifer

Diamond Member

No Lifer

Diamond Member

Diamond Member

No Lifer

Diamond Member

No Lifer

Diamond Member

Diamond Member

No Lifer

Diamond Member