Monitor for “NT Service Groups” and Resources

I’ve been working on a monitor which can monitor groups of services as a single object instead of using the service monitor wizard (which creates a lot more in the background and database than I need).  As I came up with features I thought it would be great to hear ideas in the general SCOM community who have “heard it all”.

The short description is it will monitor groups of services (NT services or Cluster Resource Services). You can mix services that are clustered in with standard NT services because the Monitor knows the difference (and who owns it) and handles appropriately.

The group of services (as well as the parent) are object instances so views can be used and can be included in distributed applications.

Couple of scenarios (there are overrides for the features):

#1 – NT Services Only – Restart Down Services

if one service is down in the group, restart it (or just display a Critical state with a no restart override).  It can be configured to monitor only services that are set to Automatic only with an override.

If the override is enabled, it can change the state to WARNING when all services are restarted so you can notify that services were restarted successfully and then it returns to HEALTHY the next pass.  If it could not restart the services, it sets the state to CRITICAL.

#2 – Clustered Resource Services – Restart Offline Resource Service

The monitor checks to see if the resource service in question is hosted on the node where the monitor is running.  If it is and one clustered resource service is found offline in the group, bring the resource back online (or just display a CRITICAL state with a no-restart override).

If the override is enabled, it can change the state to WARNING when all cluster resource services are restarted so you can notify that resource services were restarted successfully and then it returns to HEALTHY the next pass.  If it could not bring the resources online, it sets the state to CRITICAL.

The health state will only be reflected on the node hosting the resource.  If it is on another node, the clustered resource service is ignored.

# 3 – Mixed NT Service and Cluster Resource Service

This functions identical to #1 and #2 other than there is a mixture of NT Services and Resource Services.

#4 – NT Service and Cluster Resource Service with Restart Order

With the “StopServices” override set to true, when one service is found down, it will first stop all of the services in the group.  Next it will start the services (or bring online the cluster resource service) in the order these services are listed.

Of course,  this holds true for #1, #2, and #3

My interest is in hearing ideas that I might not have thought of or real world monitoring scenarios for monitoring groups of services that have unique requirements.

————————————

Multiple Service groups can be monitored on a single server this way.

the overrides are as follows:

AutoOnly: (true/false)
True=only monitor if set to auto
StopServices: (true/false)
True=stop all services and restart in order
RestartServices: (true/false)
True=RestartServices
WarnDown: (true/false)
True=Change to Warning state if services were restarted
Return to healthy the next pass.
WarnNoService: (true/false)
True=Set state to WARNING when 1 or more services
listed does not exist.
IntervalSeconds: (Integer)
The standard polling interval.

My reason for coming up with this is there is a unique situation where I have 3 servers that have 4 services running on them. On 1 server, all 4 are running and on the other two, only two of them are running.  At any time, they can manually be changed.  The two that are not running are set to Manual which would remove them from monitoring.  When set back, they are monitored again.

An unrelated situation (but similar) is where a service “could” be a standard NT Service or “could” be a clustered resource service.  This type of monitor covers me either way.

So I’d love to hear thoughts and comments. My first phase test of this monitor was successful and I have been quite happy with it I have only tested with Server 2008 and 2012.  the monitor MP is not specific to a SCOM version.

Craig

Click Here to Download the MP

8 thoughts on “Monitor for “NT Service Groups” and Resources

  1. TVK

    What a GREAT idea, Craig! Just what i’ve been looking for and hoped was im implemented oprmgr2012/R2. Care to share? Thanks, alot!

  2. Craig Pero Post author

    I’m more than willing to share… The only requirements I have is #1 (hold harmless) you use at your own risk and test in your lab. #2 you share improvements with me.

    I have to fix one item first.  I realized that my code logic for Stopping services does not check if the restart services is set.  In essence it would stop all services when a service (or resource) was down but would not restart if the restart option is set to false.

    I think  I’ll code it such that if StopServices is set to true, RestartServices is assumed to be true even if it is set to false. I believe stopping services without restarting is a pretty bad configuration selection that could accidentally happen.

  3. Craig Pero Post author

    I have updated the Monitor to assume Restart when the Stop Services is set to true. I also added a option for WaitForStartServiceSeconds which is used to wait for the service to start or the resource to come online. The goal is to avoid the State being reported as Critical when it was taking longer than usual to start.

  4. Craig Pero Post author

    I should mention, if you create registry keys manually, make sure the default value exists.  Even if it is a blank value.  If you don’t discovery will not work for the management pack.

  5. Pete Zerger

    Craig, I think the cluster resource work you’re doing here most resonates with me. To keep things simple, I think maybe even the cluster-focused work should be separated from the simple grouping logic.

    What are your thoughts?

  6. Tommy Gunn

    I think I agree with what Pete is saying here. For us, stopping the cluster noise is something of a nightmare. Is your work here intended to augment the existing Failover Cluster MP?

  7. Craig Pero Post author

    Thanks for your comments.  It’s important to think of the impact and I know cluster monitoring is VERY noisy. At a previous company, we disabled all the cluster rules for the groups and incorporated them into 1 rule.  If 1 fires, you know the others are going to be marching in behind it so why not just have one alert for it all. First to fire the alert gets the glory.

    My goal here was to monitor a group of services and have their overall health reflected by a single class instance that can be included in a view for an Ops Team to monitor (as well as distributed application views).  Adding the option to include a service that was clustered was only to expand the functionality of the monitor, not really to augment cluster monitoring. I have a case where a service could be one or the other depending on if it is on a cluster or not so can be picked up regardless.

    Using the NT Service Template creates a lot more objects and monitors which increases state change information for the database when the scope starts increasing.  Since we don’t need to see the state of each service… on a box where I have 15 services to watch, I have 2 class instances and 1 monitor instead of 15 service instances, 1 aggregate, and I think around 30 monitors which would be required by the standard service template although half of those monitors are not enabled by default.  (The service monitor template creates a lot of value adding components when you need to track more information on a service than just is it running or not.

    Is this something (even if it was for services alone and excluded “generic service” resources) you see others looking for?  Are there ideas to make this a better solution that could help me further or others?

    Ultimately I am thinking in terms of database friendly, simple administration, reducing alerts by watching services in groups when nothing more than up/down is needed but seeing the state in a console view is required.  Monitoring  cluster resource services was just a plus thrown in that might never get used by someone else.  I’d rather have it and not use it than to need it and not have it.

     

  8. Craig Pero Post author

    I was thinking about the comments and thought I might add a clarification. The monitor can restart a stopped service or with an override restart ALL of the services in order.  It would be a choice by the admin to restart them all and I think that is actually a fairly rare case that I just happened to stumble upon where I am currently contracted.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.