Server and Network device monitoring and alerting

dphantom

Diamond Member
Jan 14, 2005
4,763
326
126
My organization - medium size hospital - with around 2000 servers and 1000+ switches/routers has been and is struggling with process issues around monitoring and alerting for these devices.

A little background to set the stage.

We currently use SolarWinds for all of our network devices. All network devices are configured to be monitored by SolarWinds. Critical systems are also configured to alert on certain events though this is not consistent and IT staff can be overwhelmed with irrelevant alerts and miss something important.

We use Data Center Real User Monitoring (DCRUM) for critical application systems though it was badly out of date until a few months ago when I started updating the monitored applications with updated servers and services.

For servers (Windows/Linux) we can use Big Brother and/or System Center Ops manager. We "usually" configure the servers for one or both. There is rarely ever any alerting set up for any server. The practice is typically to wait for a user to call and then go find out what is broken. 80% of our servers are running on VMware hosts.

We recognize need to get better but are meeting resistance for fear that IT staff would be overwhelmed with pages/emails from the several thousand devices on our network.

My questions for this group are as follows:

What is your corporate practice for device monitoring? That is, what level of criticality would you need to see to monitor a device (server or network)?

Who is/are responsible for determining such criticality?

What is your corporate practice for determining when an alert should be sent?

Do you have a standard for configuring alerting for applications residing on the network? We use DCRUM for our most business critical systems but do not send alerts to application analysts. Mainly because DCRUM is still getting updated from disuse over the last 3 years and because we have no organizational process that defines what and to whom we should alert.

Any feedback is appreciated.
 

XavierMace

Diamond Member
Apr 20, 2013
4,307
450
126
We use Solarwinds (NPM+SAM) to monitor servers, infrastructure an applications. About 10k devices in total currently. Assuming you're running NPM, if you're not running 12, I'd get on that boat. The alerting system and flexibility is vastly improved in 12. We've still got a lot of tweaking to do in our environment but we've gone from 12k email alerts a day (a year ago) to around 50-100 web alerts.
 

dphantom

Diamond Member
Jan 14, 2005
4,763
326
126
Very good. That is our teams concerns is the amount of time it might take to narrow down to what needs to be alerted on. So nothing ever gets done. No one wants to get 10k pages/emails a day so you have evidently put a lot of time and effort into that.

We would need to do the same but except for our network folks, none of the other teams want to. Finding a way to convince them proactive alerting is beneficial is my challenge.
 

XavierMace

Diamond Member
Apr 20, 2013
4,307
450
126
It takes a fair bit of work on your (or whoever is responsible for Solarwinds) but once you get it setup, it does make life much easier. Again, assuming you're using Orion/NPM and not one of their free tools...

Our monitoring is broken up by geographical region. Each team has their own dashboard and since we use AD authentication, they are taken directly to their dashboard when they log in. So they are only seeing the nodes/alerts they care about. You can set severity levels on the alerts themselves and set what action the alert needs to takes accordingly.

For example, when a server hits 80% disk space used that triggers a "warning" which shows up in the alert list. If somebody gets it back under 80% the alert goes away. Or you can acknowledge the alert which hides it from the list. At 90% it generates a new "serious" alert that again shows up in the list. At 95% it generates a "critical" alert, adds an alert to the list AND emails the assigned people as well.

By default Solarwinds does alert on basically everything. So the first thing I did was just hit the "off" toggle on everything except for down nodes, hardware failures, interface utilization (monitoring WAN circuits), and down applications. Then I started creating custom alerts for things like disk usage where we wanted more granular triggers.

The biggest fight for me is to keep alerts standardized across the board. We do managed IT so I constantly get request for one off and I tell them no and explain if everything is standardized, I can have one set of disk alert triggers. When a new server is setup, I don't have to do anything beside make sure it's added to Solarwinds. All the alerts are already setup. Regarding the responsibilities....

In our organization the Help Desk (phones) people are responsible for monitoring Solarwinds and doing basic fixes such as cleaning up a disk drive or calling an ISP if a connection is down. They then escalate to the engineers as needed so the engineers for the most part don't have to watch Solarwinds.

I'd pitch it to the teams that Solarwinds isn't just a tool for monitoring, if used properly, it can do a lot of the troubleshooting for you. For example we export NetFlow data to Solarwinds as well as monitor interface utilization. So when a branch calls in and says they're slow, you just click on the router for that location. Oh, the WAN interface is at 95% receive. Check NetFlow. Oh, a user is watching internet videos. Call branch manager and tell them to make their people stop. Done. I didn't have to log into any devices or anything.
 

dphantom

Diamond Member
Jan 14, 2005
4,763
326
126
I'd pitch it to the teams that Solarwinds isn't just a tool for monitoring, if used properly, it can do a lot of the troubleshooting for you. For example we export NetFlow data to Solarwinds as well as monitor interface utilization. So when a branch calls in and says they're slow, you just click on the router for that location. Oh, the WAN interface is at 95% receive. Check NetFlow. Oh, a user is watching internet videos. Call branch manager and tell them to make their people stop. Done. I didn't have to log into any devices or anything.

That has been one of my main pitches to the teams. My role is technical architect which in our organization is more process and forward thinking along with being the point person on impossible to solve problems. :)

Good stuff Xavier. I am trying to do the same. Our teams complain of overwork and constantly putting out fires and taking hours sometimes to identify a server is down or a link is flapping. We have the tools, we just need to get them set up and configured properly. It will save time spent tracking down problems as well as averting trouble before they do become a problem.
 

XavierMace

Diamond Member
Apr 20, 2013
4,307
450
126
Yep, I feel your pain. If you've got any Solarwinds specific questions, feel free to hit me up. Sadly that's a good portion of my life right now. LOL.
 

dphantom

Diamond Member
Jan 14, 2005
4,763
326
126
Yep, I feel your pain. If you've got any Solarwinds specific questions, feel free to hit me up. Sadly that's a good portion of my life right now. LOL.
I appreciate the offer and may do so. We'll see how the next few weeks go. At least my Director is in my corner so there is that.