|
|
 
0 |
System Center Operations Manager 2007 provides an extremely comprehensive solution to monitor your entire Windows (and now non-Windows) environment, and provides great knowledge about issues which are occurring or may soon occur. The level of information available within the various management packs is extremely comprehensive and knowledge is provided within the management packs which assist with resolving alerts. When OpsMgr identifies issues it raises alerts based upon the importance of the issue which is identified. These alerts have both a priority level and a severity level associated to them which is explained below.
Operations Manager provides three levels of alert priority:
High (2)
Medium (1)
Low (0)
Operations Manager provides three different levels of Severity:
Critical (2) - Red
Warning (1) – Yellow
Information (0) - (Not actually a color, but lack of a critical or warning indicates a Green state)
Each of the severity levels corresponds to a color – like a stoplight. Out of the box, Operations Manager includes alerts which are each level based upon the state of the component or distributed application which Operations Manager is monitoring. As an example, a drive which is low on disk space would first hit a warning level (Yellow), and then as the free disk space decreased it would eventually reach a critical level (Red). When sufficient free disk space is available on the drive it returns to a Green state. The challenge here however is that Operations Manager uses these colors to explain the state of a system, and for larger entity’s. However, what OpsMgr sees as critical may not be what your environment truly considers to be critical.
As an example, if a system is critical on disk space and it’s in the Frisco site then the site itself appears a critical. Management packs define what is critical based upon the functionality of the application itself, not on the overall impact of what is or is not critical. So as another example, if I have an IIS site down on a server that is in the Frisco location then the Frisco location itself is critical since the IIS site is critical. The challenge with all of this is an inherent situation where not everything can be fixed, and as a result systems tend to stay in a critical state (Red).
To paraphrase some feedback about OpsMgr I recently heard from a colleague: “Critical errors are not really critical for the customer. The end result is customers selecting lots of alerts and closing them weekly".”
All of this is to set background so that we can put this blog entry into context for an area where OpsMgr can be extremely challenging – Getting the Red out of your environment. My recommendation is to use the following three-step process to address this challenge: Tuning, Defining, Overriding
Step 1: Tuning and Alert Resolution
Proper tuning of OpsMgr to your environment is crucial. Alerts which occur should either be fixed (which is optimal and should always be the approach if possible), closed (if the situation is known and has been resolved), or potentially overridden depending upon the specific requirements of your system. As an example, if you have a development web server which you are monitoring you may not care when the website is down. In that situation you would either override the alerts for the website to lower levels of severity/priority or disable the alerts.
Proper tuning and alert resolution takes time as each environment is different. The product knowledge on the alerts is useful for debugging and resolving issues and/or there is a free management pack available at System Center Central which provides a repository of alerts and community based resolutions for the alerts (http://www.systemcentercentral.com/Details/tabid/147/indexId/21716/Default.aspx).
Step 2: Defining Critical, Warning and Informational
Before you can determine what should be critical, warning or informational in your environment you need to define these terms for your environment. As an example, this is how we have defined them:
Critical: A situation has occurred which needs to be addressed immediately. These alerts must be actionable and must require immediate attention to resolve the situation. Example of these include:
- Network link is down
- Production website is down
- Production server is down
- Database out of space
- Disk almost out of disk space
Warning: These are issues which need to be addressed but they are not important enough to drop whatever is currently being worked on in order to address them. Example of these include:
- High processor utilization
- One network link is down to a remote location but other network links are still online
- The Auto Shrink flag has been set
- Long running jobs are occurring in SQL Server
- Mail flow latency has been exceeded
Informational: Good information to be aware of in the environment but most likely no specific action is required related to these items.
- Exchange information stores are now online
- A domain controller was rebooted
Step 3: Visine through the use of Overrides
Now that we have defined what each of these levels means in our environment we can start customizing our OpsMgr environment to meet the definitions which we have created.
Some may call this OpsMgr OCD, but I call it Getting-The-Red-Out
Some may call this redefining success, but it call it project Visine
The first step for this is to create a management pack where we will store our overrides (I like get-the-red-out MP but that’s just my opinion). Once we have created the management pack we can begin assessing the critical alerts in our environment based upon our new definitions. There are multiple different ways that a critical alert can be assessed:
- The first one is the default assumption that it is really critical, in which case it stays exactly how it is. We need to fix the issue and move on.
- Often a critical alert is critical for a specific system but it may not be critical for other systems. In this case we use groups and overrides to keep the alert at critical for the specific systems, while using groups to lower the severity of the alert based upon that group.
- If an alert is never critical for the environment, an override can be created to change the severity of the alert for all systems to the appropriate level for your environment (warning or informational).
- For details on how to alter severity and priority via overrides see Marius’ blog entry at: http://blogs.msdn.com/mariussutara/archive/2007/12/17/alert-severity-and-priority-use-with-override.aspx
This is also an ongoing process of evaluating alerts when they occur. When new management packs are added or new alerts occur in the environment, they need to be assessed through the same process that we have defined for the environment so that going forward alerts will be critical only in the situations where we define them as critical for our environment.
Summary: You do not have to accept that your environment will always be red in certain areas when you are using Operations Manager. Through effective tuning and customization to match your business definitions you can have an environment which is green when all is well, yellow when there are issues, and red when it is truly critical.