Blog
By Pete Zerger on 2/3/2010 4:19:14 AM • Rank (1100) • Views 1172
0

0

With the noisy unit monitors we discussed in part 3, we were looking at state changes. This is probably a good idea, because all unit monitors have state, but not all generate alerts. Rules on the other hand do not understand state, but they do have an element not seen with unit monitors – a Repeat Count. Since alerts generated by rules are never automatically resolved (closed) as is possible with unit monitors, rules have a RepeatCount property that is incremented once for each recurrence of the alert condition for that rule while it remains in an unresolved state.

Previous Installments

In this installment, we'll take a quick look at identifying rules generating the most alerts, as well as the monitored object triggering the alert rule most often. After all, it could be a widespread need to tune the rule, or it could simply be a one or two agents where the rule is being triggered. This can help you identify how you need to target your override – to an entire class, a group of objects, or simply to a couple of problem instances. 

Options for Identifying "Noisy" Rules

When trying to find the rules generating the most alerts in your environment, you can look at it a few different ways. A couple of these queries focus on the rules themselves, while the third also does some grouping on the object being monitored as well.

  • Noisiest Rules by Alert Count
  • Noisiest Rules by Repeat Count
  • Noisiest Rules by Repeat Count AND Instance (the object being monitored)

The first couple of queries appear on a couple of different blogs (including the great and powerful Kevin Holman) and the last couple are variations I've derived to get a deeper slightly deeper understanding of where exactly the alert noise is coming from (useful in larger environments to be sure)

Top 10 Rule-Generated Alerts in an Operational Database (by Alert Count)

SELECT TOP 10 SUM(1) AS AlertCount, AlertStringName, AlertStringDescription, MonitoringRuleId, Name 
FROM Alertview WITH (NOLOCK)
WHERE TimeRaised is not NULL AND IsMonitorAlert = 0
GROUP BY AlertStringName, AlertStringDescription, MonitoringRuleId, Name
ORDER BY AlertCount DESC

 Top 10 Rule-Generated Alerts in OperationsManager Database (by Repeat Count)

SELECT TOP 10 SUM(RepeatCount+1) AS RepeatCount, AlertStringName, AlertStringDescription, MonitoringRuleId, Name 
FROM Alertview WITH (NOLOCK)
WHERE Timeraised is not NULL AND IsMonitorAlert = 0
GROUP BY AlertStringName, AlertStringDescription, MonitoringRuleId, Name
ORDER BY RepeatCount DESC

In the results, you'll notice the rule generating the most alerts (in my small lab) by RepeatCount shows a total of 3,826 occurrences. If this is spread

image 

Top 10 Alerts in Operational Database (by Repeat Count AND Target)

This query adds the MonitoringObjectFullName field to the grouping, which then shows us the object / computer from which the alert is being generated often. This allows us to make a more informed tuning targeting decision when creating our overrides.

SELECT TOP 10 SUM(RepeatCount+1) AS RepeatCount, AlertStringName, AlertStringDescription, 
MonitoringRuleId, Name, MonitoringObjectFullName
FROM Alertview WITH (NOLOCK)
WHERE Timeraised is not NULL AND IsMonitorAlert = 0
GROUP BY AlertStringName, AlertStringDescription, MonitoringRuleId, Name, MonitoringObjectFullName
ORDER BY RepeatCount DESC
 
In the results, you'll see the same total count for the top rule broken down by the computer/object from which the alert was generated. In this case, we can see only two servers are generating the bulk of the alerts. Time for some tuning and troubleshooting on these two servers.
 
image
 
Had the "Service Check Data Source Module Failed Execution" rule disappeared from my top alerts altogether, that's a good sign the alerts were spread across a much larger number of instances in the class. To check to be sure, I could modify the query slightly to present only results for that AlertStringName, as shown here. Then I could see exactly how widespread the alerting was, which again may affect my tuning decision. Based on this last query I then think about troubleshooting or tuning for a group of servers, or an entire object class, depending on what I see.
 
SELECT TOP 10 SUM(RepeatCount+1) AS RepeatCount, AlertStringName, AlertStringDescription, MonitoringRuleId, Name, 
MonitoringObjectFullName
FROM Alertview WITH (NOLOCK)
WHERE Timeraised is not NULL AND IsMonitorAlert = 0
AND AlertStringName = 'Service Check Data Source Module Failed Execution'
GROUP BY AlertStringName, AlertStringDescription, MonitoringRuleId, Name,
MonitoringObjectFullName
ORDER BY RepeatCount DESC

Remaining Installments

Hopefully you've found some of the info in this post useful. Below I've mapped out what's left in the series below, which we should wrap up over the next couple of weeks.

  • In the next installment, we will look at some additional queries useful to daily operations.
  • Following that we'll look at a troubleshooting flowchart to help define a process for leveraging this information
  • We'll then wrap up the series with a couple of sample reports to demonstrate how to more conveniently collect and present this information
Comments - Comment RSS


Who Viewed
Who Reviewed
Categories
Related Pages
Shortened URL
http://tinyurl.com/ydesjv7

Top Contributors
Featured Members
Pete Zerger
Points: 65502
Level: System Center Expert
Tommy Gunn
Points: 42718
Level: System Center Expert
Simon Skinner
Points: 40744
Level: System Center Expert
Stefan Koell
Points: 28999
Level: System Center Expert
Andreas Zuckerhut
Points: 27584
Level: System Center Expert