|
|
 
0 |
Written By: Cameron Fuller [MVP], Kerrie Meyler {MVP], John Joyner [MVP], and Andy Dominey
One of the most comprehensive ways to monitor VMware ESX hosts in an Operations Manager environment is with the nworks management pack (MP) from Veeam. Version 5.0.0 of the Veeam nworks MP is now available and contains some solid new functionality. Of key interest is the new ability to provide redundancy and load balancing among different Virtual Infrastructure Collectors (VICs) in the management group. VICs are like gateways for communication with ESX hosts and vCenter servers.
How to acquire and integrate the nworks management pack into your OpsMgr environment:
1. Request a trial version of the management pack from http://www.veeam.com/vmware-microsoft-esx-monitoring.html.
2. Veeam provides online access to their installation, deployment, and operations guides. These are available at http://www.veeam.com/vmware-microsoft-esx-monitoring.html.
3. Read the installation guide prior to importing any of the management packs. It includes details for installing the components including the Management Center Server, the Management Center UI, the Virtual Infrastructure Collector(s), and importing the management packs.
4. Create an nworks_Overrides management pack to contain any overrides required for the MP. (An overrides.xml file is included with the MP for your use, but you may want to create your own to maintain your own naming standards.)
Benefits of Monitoring VMware in OpsMgr
VMware has its own monitoring product that does a decent job of monitoring the VM servers and the VM guest systems, so the question is often asked: Why do we want to integrate this with Operations Manager? This was best answered for us when we ran into a strange client situation:
As background, this particular site had both Operations Manager 2007 R2, VMware, and the nworks management pack installed. We got the call that one of our important business applications had just stopped responding. It was a large distributed application, so if we lost a server or a piece of the application that wasn't shocking. However, the condition where the entire application failed at the same time was extremely surprising. We tried to debug what was happening real-time, but by the time we started digging into the situation the application was functional again. About a week later, the same situation happened again, and again in a short amount of time it was functional again.
To debug the root cause, we went back to the Operations Manager alerts that occurred during that timeframe. We found that about 3-5 minutes prior to our application outage there was a set of alerts indicating the VMware servers had lost the ability to communicate with their LUNs. Our distributed application is heavily virtualized; we realized that when we lost the ability to communicate with our LUNs, each of the virtual machines on those LUNs was also impacted. We had a tech validate our findings on the VMware side; they confirmed our findings. We also found that our SAN vendor had been on-site in our data center both times this occurred, so a root cause of the issue was identified.
While the "single-pane-of-glass" concept may be overused in many cases, this time it really hits the mark. We were able to identify a root cause of a non-repeatable outage only because we had a single view into what was occurring within our environment, and we could use this view to do things that we otherwise could not do. This becomes an extremely powerful concept as you bring together all these components within Operations Manager: Windows Servers, Unix/Linux Servers, VMware servers, distributed applications, network devices, and server hardware.
There are also some unexpected side-benefits of having this management pack in place. The “VMware: VM GuestOS vDisk is low on free space” alert is a great one to leave in place because it provides you with low disk space information even if there isn’t an OpsMgr agent on the system! This was also very useful as we started to pilot Virtual Desktop Infrastructure (VDI), as we could then determine what VDI images were low on disk space as well.
Another installation had repeated issues of slow performance on several of their VMware servers. The VMware Disk Utilization and CPU Utilization alerts made them aware that the root cause of the problem was that the images were over using resources at specific peak times. They used this information to reorganize their images across the hosts.
Ultimately, utilizing the VMware management pack in addition to the application and hardware vendor management packs takes monitoring from the physical hardware, through the virtual layer, and ultimately through the OS and down to the applications on the virtual guest; providing a true holistic monitoring picture.
Challenges/Gotcha’s to be aware of
1. If you have an earlier version of the nworks management pack, do NOT upgrade the management pack without first removing the previous version of the management pack. The Veeam documentation states this, but having gone down this path, we recommend you try to avoid removing/editing/re-adding management pack dependencies. Completely remove the current version of the management pack and re-add it for any major version update - such as from 4.x to 5.x. Minor updates to the management pack - such as from 5.0.x to 5.5.x should not require deleting the previous version of the management pack.
2. In a large environment, this management pack can create a very large number of alerts. The values nworks uses to alert on are the same or similar values that a healthy VMware environment uses to keep itself tuned for health and reliability. If you have a VMware environment where you use HA and DRS (vmotion), the alerts are automatically resolved by VMware. This is not to say that the nworks management pack is not relevant in this situation, but it does involve some specific tuning recommendations included here:
For our VMware environment with HA and DRS configured, we disabled the following alerts because they were host specific:
- Alert: nworks VMware: ESX Host Hardware Sensor has breached threshold from Source: AvgPwrIns1 for Power Distribution 1
- Alert: nworks VMware: VM CPU Ready value has exceeded threshold from Source: _Total
- Alert: nworks VMware: ESX Host Hardware Sensor has breached threshold
- Alert: nworks VMware: ESX Cluster CPU Used has exceeded threshold
- Alert: nworks vCenter: ESX Host Memory Alarm changed to Yellow
For our VMware environment with HA and DRS configured, we disabled the following alerts because they are addressed in Virtual Center: (we specifically disregard any processor or memory notifications on either the guest or host systems as they are often auto-resolved through vmotion)
- Alert: nworks VMware: VMware Tools Service Status changed to Yellow
- Alert: nworks VMware: VMware Tools Service Status changed to Red
- Alert: nworks VMware: VM CPU Usage has exceeded threshold
- Alert: nworks VMware: ESX Host Available Memory has dropped below threshold
- Alert: nworks Collector: ESX Host overall status is 'Yellow'
- Alert: nworks vCenter: ESX Host CPU Alarm changed to Yellow
- Alert: nworks vCenter: Virtual Machine Memory Alarm changed to Yellow
- Alert: nworks vCenter: VM CPU Alarm changed to Yellow
- Alert: nworks VMware: ESX Host has exceeded threshold for CPU _Total Usage
- Alert: nworks VMware: ESX Host CPU has exceeded usage threshold
Additional Tuning/Alerts to Look for in the nworks MP
Alert: nworks VirtualCenter: Failed to login user
Issue: Failed attempt to log into the VirtualCenter application.
Resolution: Disregarded, as the account used appears to have been a typo situation. Closed the alert, but left the rule active to track for conditions where multiple failed log attempts may occur.
Alert: nworks VMware: ESX Host Memory Swap Rate has exceeded threshold
Issue: There are other alerts that cover this with the nworks management pack which are more relevant. This is a measure of how often a host is swapping to memory, which is not as relevant as the nworks VirtualCenter: ESX Host Memory Alarm. These were generating a significant number of alerts in this management pack and were not actionable.
Resolution: Disabled the alerts via an override to the nworks_Overrides management pack. Veeam changes these to a Consecutive Sample in v5.5, which should change the frequency of this alert. If these alerts are due to spikes, the situation should go away in v5.5.
Alert: nworks Virtual Infrastructure Collector
Issue: Just after installation of the nworks management pack this alert appeared.
Resolution: Used the nworks VIC Configuration tool to specify the VirtualCenter server, and started the collector service within the same user interface. Closed the alert in the OpsMgr console.
Alert: nworks VMware: VM CPU Usage has exceeded threshold
Issue: The default threshold in this management pack alerted as 35% for a warning and 38% for a critical level. These VMware systems are highly utilized on a daily basis making these thresholds too low for daily monitoring. These alerts were generating every few minutes in this environment.
Resolution: Created an override for all objects of type: vCPU to be 70% for a warning and 75% for a critical level. To do this, set Threshold2 to 75 and Threshold1 to 70, and store it in an nworks_Overrides management pack. In version 5.5 of the management pack, Veeam disables this alert and replaces it with a new one based on a consecutive sample with 90% generating a critical error.
Alert: Failed Accessing Windows Event Log
Issue: This alert appeared just after installing the nworks management pack.
Resolution: Used the nworks VIC Configuration tool to specify the VirtualCenter server, and started the collector service within the same user interface. Closed the alert in the OpsMgr console.
Alert: nworks VMware: ESX Host Available Memory has dropped below threshold
Issue: Per my VMware SME, “All of these memory issues are false alarms. I'm watching them, and not a single one is even close. We might want to just disable this alarm for now.”
Resolution: Created an override to disable this alert and put it into the nworks_Overrides management pack. Veeam states this is fixed in Version 5.5 of the management pack.
Alert: nworks VMware: VM CPU Ready value has exceeded threshold
Issue: Warning level and critical level alerts were being generated; this was the #1 most common alert in the environment. The alert means that the virtual machine is in need of resources and the host does not have any available. We received a value of 3 for warning, value of 5 for critical. In a production environment, this would indicate that the cluster is now overwhelmed/we are out of resources on the cluster. These values make sense in a production non-overwhelmed environment. In this case, the environment is overloaded, so these need to be adjusted. Investigate what a median value for this is. We exported the performance data for this counter into Excel as an XML data file and found there was a spike of this value to 126,185, among other spikes.
Resolution: Short term, we have set the warning level to 10 as this represents a situation where resources are not sufficient. And an error level of 1000 as this represents a situation that should be mathematically impossible based upon the maximum of 100% (but we are still seeing it). Closed the warning and critical level alerts. Veeam states this issue is resolved in version 5.5 of the nworks management pack. The alert will have a new single threshold of 20% for warning and be based on a Consecutive Sample.
Alert: nworks VMware: ESX Host Hardware Sensor has breached threshold
Issue: Temperature sensors state that they are reporting in a range that is an alert in nworks, but does not appear to be an alert in either IBM Director or VMware Virtual Center. Tracking these down, we found that all of these were reporting from a single chassis (chassis 1). Investigated these errors, but these seem to have no basis in reality on the systems which are reporting the errors. We checked both through the VMware Virtual Center and through the IBM Blade Chassis software to validate this.
Resolution: Created an override to disable these alerts and put them into the nworks_Overrides management pack.
Alert: nworks VirtualCenter: ESX Host CPU Alarm changed to Yellow
Issue: This is linked to other alerts such as high CPU, low memory, or other host bottlenecks. Since there is no way to tie together the alerts for CPU/Memory to the yellow/red state alerts, the best option is to disable the CPU alerting and use the green/yellow/red state monitors for the host systems.
Resolution: Disabling the nworks VMware: ESX Host CPU has exceeded usage threshold alert based upon this item. We do not want the alerts for the individual CPUs, but they are beneficial for the entire host (Total counter). This did not disable either the cluster or the host total CPU monitoring. We have been informed that Version 5.5 will only use the _Total monitor.
Alert: nworks VirtualCenter: Enabling HA agent on Host
Issue: Multiple hosts in a VMware Cluster are setup through HA to talk between the multiple systems. Should one of them go down and stop responding to ‘hello’ it checks with its other neighbors, who identify who is isolated and perform a defined action. This is seen whenever a system goes into or out of maintenance mode, or when it is rebooted. This is part of the normal functionality of VMware.
Resolution: Create an override to disable this alert, as enabling on these is done as a result of manual actions that we are performing ,such as putting the system into maintenance mode or rebooting the system. This alert is disabled out-of-the box in version 5.5 of the management pack.
Alert: nworks VirtualCenter: HA Agent Disabled
Issue: Multiple hosts in a VMware Cluster are setup through HA to talk between the multiple systems. Should one of them go down and stop responding to "hello," it checks with its other neighbors, who identify who is isolated and perform a defined action. This is seen whenever a system goes into our out of maintenance mode, or when it is rebooted. This is part of the normal functionality of VMware.
Resolution: Create an override to disable this alert, as enabling on these is done as a result of manual actions that we are performing, such as putting the system into maintenance mode or rebooting the system. These informational alerts indicate that someone is disabling the HA agent on a system, but most likely this is a result of going through maintenance mode or reboots. The ESX overall host status would go red from an alert generated by nworks VirtualCenter: Host Connection Lost. This alert is disabled out-of-the box in version 5.5 of the management pack.
Alert: nworks VirtualCenter: Failed to login user
Issue: Failed attempt to log into the VirtualCenter application.
Resolution: Disregarded, as the account used appears to have been a typo situation. Closed the alert, but left the rule active to track for conditions where multiple failed log attempts may occur.
Alert: nworks VMware: VM CPU Usage has exceeded threshold
Issue: The nworks management pack detects high CPU usage levels on guest operating systems. This system was experiencing high CPU utilization but the guest OS is also monitored by OpsMgr, so this was double-alerting.
Resolution: First attempt was to create overrides to disable these notifications for the All Windows Computer Group, as notifications were provided by the OpsMgr server operating system management pack. Performed this action on both the warning and critical level alerts but it did not have an impact.
We individually disabled the following alerts for systems that were generating the most alerting as we were being double-notified (both from OpsMgr MP’s and from the nworks MP).
Alert: nworks VirtualCenter: VM Moved from Resource Pool
Issue: The alert indicates the move of a guest from one resource pool to another.
Resolution: For our environment, this alert was identified as not relevant enough to justify notification. Disabled the alert. This alert is disabled out-of-the box in version 5.5 of the management pack.
Alert: nworks VirtualCenter: Virtual Machine Memory Alarm changed to Yellow
Issue: The memory state for the virtual machine has reached a warning state.
Resolution: Disabled this alert, as we are monitoring memory via the in-guest agents deployed to the individual operating systems.
Alert: nworks VMware: VM CPU Usage has exceeded threshold
Issue: An individual virtual machine is going beyond the threshold set for allocated CPU resources.
Resolution: Disabled this alert, as we are monitoring processor usage via the in-guest agents deployed to the individual operating systems.
Alert: nworks VirtualCenter: Virtual Machine CPU Alarm changed to Yellow
Issue: An individual virtual machine is going beyond the threshold set for allocated CPU resources.
Resolution: Disabled this alert, as we are monitoring processor usage via the in-guest agents deployed to the individual operating systems.
Alert: nworks VirtualCenter: VM Disconnected in VC
Issue: This occurs when the host connection is lost, but another alert (nworks VirtualCenter: Host Connection Lost) also provides this notification so the individual notifications on a per VM are not relevant.
Resolution: Disabled these alerts through an override and manually closed the alerts. Our understanding is this alert is disabled out-of-the box in version 5.5 of the management pack.
Alert: nworks VMware: VM GuestOS vDisk is low on free space
Issue: The drive on the system was at 4% free disk space.
Resolution: Freed up additional disk space by removing the power configuration file (powercfg –h off). This monitor did not automatically reset itself so we manually closed the alert.
Views
All the alert views have a criterion of Resolution State = New, rather than Resolution State Does not equal 255 (Resolved). You can build custom views to get around this; there is a feature request open with Veeam to change this, and our understanding is this is changed in version 5.5.
Acknowledgements
Special thanks go to the following members of the System Center community for their contributions to this ByExample guide: Robert Burleson, Graham Davies, Raymond Chou, and Kwan Thean.