- Jan 14, 2005
- 4,763
- 326
- 126
My organization - medium size hospital - with around 2000 servers and 1000+ switches/routers has been and is struggling with process issues around monitoring and alerting for these devices.
A little background to set the stage.
We currently use SolarWinds for all of our network devices. All network devices are configured to be monitored by SolarWinds. Critical systems are also configured to alert on certain events though this is not consistent and IT staff can be overwhelmed with irrelevant alerts and miss something important.
We use Data Center Real User Monitoring (DCRUM) for critical application systems though it was badly out of date until a few months ago when I started updating the monitored applications with updated servers and services.
For servers (Windows/Linux) we can use Big Brother and/or System Center Ops manager. We "usually" configure the servers for one or both. There is rarely ever any alerting set up for any server. The practice is typically to wait for a user to call and then go find out what is broken. 80% of our servers are running on VMware hosts.
We recognize need to get better but are meeting resistance for fear that IT staff would be overwhelmed with pages/emails from the several thousand devices on our network.
My questions for this group are as follows:
What is your corporate practice for device monitoring? That is, what level of criticality would you need to see to monitor a device (server or network)?
Who is/are responsible for determining such criticality?
What is your corporate practice for determining when an alert should be sent?
Do you have a standard for configuring alerting for applications residing on the network? We use DCRUM for our most business critical systems but do not send alerts to application analysts. Mainly because DCRUM is still getting updated from disuse over the last 3 years and because we have no organizational process that defines what and to whom we should alert.
Any feedback is appreciated.
A little background to set the stage.
We currently use SolarWinds for all of our network devices. All network devices are configured to be monitored by SolarWinds. Critical systems are also configured to alert on certain events though this is not consistent and IT staff can be overwhelmed with irrelevant alerts and miss something important.
We use Data Center Real User Monitoring (DCRUM) for critical application systems though it was badly out of date until a few months ago when I started updating the monitored applications with updated servers and services.
For servers (Windows/Linux) we can use Big Brother and/or System Center Ops manager. We "usually" configure the servers for one or both. There is rarely ever any alerting set up for any server. The practice is typically to wait for a user to call and then go find out what is broken. 80% of our servers are running on VMware hosts.
We recognize need to get better but are meeting resistance for fear that IT staff would be overwhelmed with pages/emails from the several thousand devices on our network.
My questions for this group are as follows:
What is your corporate practice for device monitoring? That is, what level of criticality would you need to see to monitor a device (server or network)?
Who is/are responsible for determining such criticality?
What is your corporate practice for determining when an alert should be sent?
Do you have a standard for configuring alerting for applications residing on the network? We use DCRUM for our most business critical systems but do not send alerts to application analysts. Mainly because DCRUM is still getting updated from disuse over the last 3 years and because we have no organizational process that defines what and to whom we should alert.
Any feedback is appreciated.