Blog

April 11 2010 10:39 AM

Written By: Cameron Fuller [MVP], Kerrie Meyler {MVP], John Joyner [MVP], and Andy Dominey

One of the most comprehensive ways to monitor VMware ESX hosts in an Operations Manager environment is with the nworks management pack (MP) from Veeam. Version 5.0.0 of the Veeam nworks MP is now available and contains some solid new functionality. Of key interest is the new ability to provide redundancy and load balancing among different Virtual Infrastructure Collectors (VICs) in the management group. VICs are like gateways for communication with ESX hosts and vCenter servers.

How to acquire and integrate the nworks management pack into your OpsMgr environment:

1.       Request a trial version of the management pack from http://www.veeam.com/vmware-microsoft-esx-monitoring.html.

2.       Veeam provides online access to their installation, deployment, and operations guides. These are available at http://www.veeam.com/vmware-microsoft-esx-monitoring.html.

3.       Read the installation guide prior to importing any of the management packs. It includes details for installing the components including the Management Center Server, the Management Center UI, the Virtual Infrastructure Collector(s), and importing the management packs.

4.       Create an  nworks_Overrides management pack to contain any overrides required for the MP. (An overrides.xml file is included with the MP for your use, but you may want to create your own to maintain your own naming standards.)

Benefits of Monitoring VMware in OpsMgr

VMware has its own monitoring product that does a decent job of monitoring the VM servers and the VM guest systems, so the question is often asked: Why do we want to integrate this with Operations Manager? This was best answered for us when we ran into a strange client situation:

As background, this particular site had both Operations Manager 2007 R2, VMware, and the nworks management pack installed. We got the call that one of our important business applications had just stopped responding. It was a large distributed application, so if we lost a server or a piece of the application that wasn't shocking. However, the condition where the entire application failed at the same time was extremely surprising. We tried to debug what was happening real-time, but by the time we started digging into the situation the application was functional again. About a week later, the same situation happened again, and again in a short amount of time it was functional again.

To debug the root cause, we went back to the Operations Manager alerts that occurred during that timeframe. We found that about 3-5 minutes prior to our application outage there was a set of alerts indicating the VMware servers had lost the ability to communicate with their LUNs. Our distributed application is heavily virtualized; we realized that when we lost the ability to communicate with our LUNs, each of the virtual machines on those LUNs was also impacted. We had a tech validate our findings on the VMware side; they confirmed our findings. We also found that our SAN vendor had been on-site in our data center both times this occurred, so a root cause of the issue was identified.

While the "single-pane-of-glass" concept may be overused in many cases, this time it really hits the mark. We were able to identify a root cause of a non-repeatable outage only because we had a single view into what was occurring within our environment, and we could use this view to do things that we otherwise could not do. This becomes an extremely powerful concept as you bring together all these components within Operations Manager: Windows Servers, Unix/Linux Servers, VMware servers, distributed applications, network devices, and server hardware.

There are also some unexpected side-benefits of having this management pack in place. The “VMware: VM GuestOS vDisk is low on free space” alert is a great one to leave in place because it provides you with low disk space information even if there isn’t an OpsMgr agent on the system! This was also very useful as we started to pilot Virtual Desktop Infrastructure (VDI), as we could then determine what VDI images were low on disk space as well.

Another installation had repeated issues of slow performance on several of their VMware servers. The VMware Disk Utilization and CPU Utilization alerts made them aware that the root cause of the problem was that the images were over using resources at specific peak times. They used this information to reorganize their images across the hosts.

Ultimately, utilizing the VMware management pack in addition to the application and hardware vendor management packs takes monitoring from the physical hardware, through the virtual layer, and ultimately through the OS and down to the applications on the virtual guest; providing a true holistic monitoring picture.

Challenges/Gotcha’s to be aware of

1.       If you have an earlier version of the nworks management pack, do NOT upgrade the management pack without first removing the previous version of the management pack. The Veeam documentation states this, but having gone down this path, we recommend you try to avoid removing/editing/re-adding management pack dependencies. Completely remove the current version of the management pack and re-add it for any major version update - such as from 4.x to 5.x. Minor updates to the management pack - such as from 5.0.x to 5.5.x should not require deleting the previous version of the management pack.

2.       In a large environment, this management pack can create a very large number of alerts. The values nworks uses to alert on are the same or similar values that a healthy VMware environment uses to keep itself tuned for health and reliability. If you have a VMware environment where you use HA and DRS (vmotion), the alerts are automatically resolved by VMware. This is not to say that the nworks management pack is not relevant in this situation, but it does involve some specific tuning recommendations included here:

For our VMware environment with HA and DRS configured, we disabled the following alerts because they were host specific:

  • Alert: nworks VMware: ESX Host Hardware Sensor has breached threshold from Source: AvgPwrIns1 for Power Distribution 1
  • Alert: nworks VMware: VM CPU Ready value has exceeded threshold from Source: _Total
  • Alert: nworks VMware: ESX Host Hardware Sensor has breached threshold
  • Alert: nworks VMware: ESX Cluster CPU Used has exceeded threshold
  • Alert: nworks vCenter: ESX Host Memory Alarm changed to Yellow

For our VMware environment with HA and DRS configured, we disabled the following alerts because they are addressed in Virtual Center: (we specifically disregard any processor or memory notifications on either the guest or host systems as they are often auto-resolved through vmotion)

  • Alert: nworks VMware: VMware Tools Service Status changed to Yellow
  • Alert: nworks VMware: VMware Tools Service Status changed to Red
  • Alert: nworks VMware: VM CPU Usage has exceeded threshold
  • Alert: nworks VMware: ESX Host Available Memory has dropped below threshold
  • Alert: nworks Collector: ESX Host overall status is 'Yellow'
  • Alert: nworks vCenter: ESX Host CPU Alarm changed to Yellow
  • Alert: nworks vCenter: Virtual Machine Memory Alarm changed to Yellow
  • Alert: nworks vCenter: VM CPU Alarm changed to Yellow
  • Alert: nworks VMware: ESX Host has exceeded threshold for CPU _Total Usage
  • Alert: nworks VMware: ESX Host CPU has exceeded usage threshold

Additional Tuning/Alerts to Look for in the nworks MP

Alert: nworks VirtualCenter: Failed to login user

Issue: Failed attempt to log into the VirtualCenter application.

Resolution: Disregarded, as the account used appears to have been a typo situation. Closed the alert, but left the rule active to track for conditions where multiple failed log attempts may occur.

Alert: nworks VMware: ESX Host Memory Swap Rate has exceeded threshold

Issue: There are other alerts that cover this with the nworks management pack which are more relevant. This is a measure of how often a host is swapping to memory, which is not as relevant as the nworks VirtualCenter: ESX Host Memory Alarm. These were generating a significant number of alerts in this management pack and were not actionable.

Resolution: Disabled the alerts via an override to the nworks_Overrides management pack. Veeam changes these to a Consecutive Sample in v5.5, which should change the frequency of this alert. If these alerts are due to spikes, the situation should go away in v5.5.

Alert: nworks Virtual Infrastructure Collector

Issue: Just after installation of the nworks management pack this alert appeared.

Resolution: Used the nworks VIC Configuration tool to specify the VirtualCenter server, and started the collector service within the same user interface. Closed the alert in the OpsMgr console.

Alert: nworks VMware: VM CPU Usage has exceeded threshold

Issue: The default threshold in this management pack alerted as 35% for a warning and 38% for a critical level. These VMware systems are highly utilized on a daily basis making these thresholds too low for daily monitoring. These alerts were generating every few minutes in this environment.

Resolution: Created an override for all objects of type: vCPU to be 70% for a warning and 75% for a critical level. To do this, set Threshold2 to 75 and Threshold1 to 70, and store it in an nworks_Overrides management pack. In version 5.5 of the management pack, Veeam disables this alert and replaces it with a new one based on a consecutive sample with 90% generating a critical error.

Alert: Failed Accessing Windows Event Log

Issue:
This alert appeared just after installing the nworks management pack.

Resolution: Used the nworks VIC Configuration tool to specify the VirtualCenter server, and started the collector service within the same user interface. Closed the alert in the OpsMgr console.

Alert: nworks VMware: ESX Host Available Memory has dropped below threshold

Issue: Per my VMware SME, “All of these memory issues are false alarms. I'm watching them, and not a single one is even close. We might want to just disable this alarm for now.”

Resolution: Created an override to disable this alert and put it into the nworks_Overrides management pack. Veeam states this is fixed in Version 5.5 of the management pack.

Alert: nworks VMware: VM CPU Ready value has exceeded threshold

Issue:
Warning level and critical level alerts were being generated; this was the #1 most common alert in the environment. The alert means that the virtual machine is in need of resources and the host does not have any available. We received a value of 3 for warning, value of 5 for critical. In a production environment, this would indicate that the cluster is now overwhelmed/we are out of resources on the cluster. These values make sense in a production non-overwhelmed environment. In this case, the environment is overloaded, so these need to be adjusted. Investigate what a median value for this is. We exported the performance data for this counter into Excel as an XML data file and found there was a spike of this value to 126,185, among other spikes.

Resolution: Short term, we have set the warning level to 10 as this represents a situation where resources are not sufficient. And an error level of 1000 as this represents a situation that should be mathematically impossible based upon the maximum of 100% (but we are still seeing it). Closed the warning and critical level alerts. Veeam states this issue is resolved in version 5.5 of the nworks management pack. The alert will have a new single threshold of 20% for warning and be based on a Consecutive Sample.

Alert: nworks VMware: ESX Host Hardware Sensor has breached threshold

Issue:
Temperature sensors state that they are reporting in a range that is an alert in nworks, but does not appear to be an alert in either IBM Director or VMware Virtual Center. Tracking these down, we found that all of these were reporting from a single chassis (chassis 1). Investigated these errors, but these seem to have no basis in reality on the systems which are reporting the errors. We checked both through the VMware Virtual Center and through the IBM Blade Chassis software to validate this.

Resolution: Created an override to disable these alerts and put them into the nworks_Overrides management pack.

Alert: nworks VirtualCenter: ESX Host CPU Alarm changed to Yellow

Issue:
This is linked to other alerts such as high CPU, low memory, or other host bottlenecks. Since there is no way to tie together the alerts for CPU/Memory to the yellow/red state alerts, the best option is to disable the CPU alerting and use the green/yellow/red state monitors for the host systems.

Resolution: Disabling the nworks VMware: ESX Host CPU has exceeded usage threshold alert based upon this item. We do not want the alerts for the individual CPUs, but they are beneficial for the entire host (Total counter). This did not disable either the cluster or the host total CPU monitoring. We have been informed that Version 5.5 will only use the _Total monitor.

Alert: nworks VirtualCenter: Enabling HA agent on Host

Issue: Multiple hosts in a VMware Cluster are setup through HA to talk between the multiple systems. Should one of them go down and stop responding to ‘hello’ it checks with its other neighbors, who identify who is isolated and perform a defined action. This is seen whenever a system goes into or out of maintenance mode, or when it is rebooted. This is part of the normal functionality of VMware.

Resolution: Create an override to disable this alert, as enabling on these is done as a result of manual actions that we are performing ,such as putting the system into maintenance mode or rebooting the system. This alert is disabled out-of-the box in version 5.5 of the management pack.

Alert: nworks VirtualCenter: HA Agent Disabled

Issue: Multiple hosts in a VMware Cluster are setup through HA to talk between the multiple systems. Should one of them go down and stop responding to "hello," it checks with its other neighbors, who identify who is isolated and perform a defined action. This is seen whenever a system goes into our out of maintenance mode, or when it is rebooted. This is part of the normal functionality of VMware.

Resolution: Create an override to disable this alert, as enabling on these is done as a result of manual actions that we are performing, such as putting the system into maintenance mode or rebooting the system. These informational alerts indicate that someone is disabling the HA agent on a system, but most likely this is a result of going through maintenance mode or reboots. The ESX overall host status would go red from an alert generated by nworks VirtualCenter: Host Connection Lost. This alert is disabled out-of-the box in version 5.5 of the management pack.

Alert: nworks VirtualCenter: Failed to login user

Issue:
Failed attempt to log into the VirtualCenter application.

Resolution: Disregarded, as the account used appears to have been a typo situation. Closed the alert, but left the rule active to track for conditions where multiple failed log attempts may occur.

Alert: nworks VMware: VM CPU Usage has exceeded threshold

Issue:
The nworks management pack detects high CPU usage levels on guest operating systems. This system was experiencing high CPU utilization but the guest OS is also monitored by OpsMgr, so this was double-alerting.

Resolution: First attempt was to create overrides to disable these notifications for the All Windows Computer Group, as notifications were provided by the OpsMgr server operating system management pack. Performed this action on both the warning and critical level alerts but it did not have an impact.

We individually disabled the following alerts for systems that were generating the most alerting as we were being double-notified (both from OpsMgr MP’s and from the nworks MP).

Alert: nworks VirtualCenter: VM Moved from Resource Pool

Issue:
The alert indicates the move of a guest from one resource pool to another.

Resolution: For our environment, this alert was identified as not relevant enough to justify notification. Disabled the alert. This alert is disabled out-of-the box in version 5.5 of the management pack.

Alert: nworks VirtualCenter: Virtual Machine Memory Alarm changed to Yellow

Issue:
The memory state for the virtual machine has reached a warning state.

Resolution: Disabled this alert, as we are monitoring memory via the in-guest agents deployed to the individual operating systems.

Alert: nworks VMware: VM CPU Usage has exceeded threshold

Issue:
An individual virtual machine is going beyond the threshold set for allocated CPU resources.

Resolution: Disabled this alert, as we are monitoring processor usage via the in-guest agents deployed to the individual operating systems.

Alert: nworks VirtualCenter: Virtual Machine CPU Alarm changed to Yellow

Issue:
An individual virtual machine is going beyond the threshold set for allocated CPU resources.

Resolution: Disabled this alert, as we are monitoring processor usage via the in-guest agents deployed to the individual operating systems.

Alert: nworks VirtualCenter: VM Disconnected in VC

Issue:
This occurs when the host connection is lost, but another alert (nworks VirtualCenter: Host Connection Lost) also provides this notification so the individual notifications on a per VM are not relevant.

Resolution: Disabled these alerts through an override and manually closed the alerts. Our understanding is this alert is disabled out-of-the box in version 5.5 of the management pack.

Alert: nworks VMware: VM GuestOS vDisk is low on free space

Issue:
The drive on the system was at 4% free disk space.

Resolution: Freed up additional disk space by removing the power configuration file (powercfg –h off). This monitor did not automatically reset itself so we manually closed the alert.

Views

All the alert views have a criterion of Resolution State = New, rather than Resolution State Does not equal 255 (Resolved). You can build custom views to get around this; there is a feature request open with Veeam to change this, and our understanding is this is changed in version 5.5.

Acknowledgements

Special thanks go to the following members of the System Center community for their contributions to this ByExample guide: Robert Burleson, Graham Davies, Raymond Chou, and Kwan Thean.


on 4/11/2010 6:24:44 PM
Veeam/Nworks is a good product but sells much more to HP Openview customer than to Microsoft Systems Center... this results in less specific funtionality for the Systems Center user. Did you know that Veeam builds their MP so that it will failover from one collector to another, but requires redundancies to avaid overloading SCOM.



This is why in a shot period of time BridgeWays VMware MP is used by some of the largest Systems Center users, because of the depth of monitoring and seamless integration with Microsoft the product is easier to install and learn. In the failover situation above situation BridgeWays will generate an alert if the collector becomes unavaialable and provides a task in the SCOM console for restarting... just one of many example how BridgeWays was designed exclusively for the Systems Center user.



BridgeWays KB works self contained in SCOM... Veeam relies on 3 external site.



BridgeWays provide much deeper look at CPU and memory usage (actie vs consumed) from a given pool... etc



BridgeWays allows you to use SCOM and SCCM to trigger tickets and manage beyond the virtual layer and into the complete SLA service stack by providing views into the application, database, and Hypervisors such as VMware and HyperV.



For those MVP authors or Microsoft users that would like to learn more about BridgeWays MPs, look at or website or drop by at the 2010 MMS... we would enjoy meeting you.



http://www.bridgeways.ca/

on 4/12/2010 10:29:03 AM
We appreciate the feedback on the BridgeWays VMware MP.

This series focuses on tuning tips for selected management packs; this particular blog posting is specifically related to the Veeam nworks management pack, which we had the opportunity to deploy into production.

We are aware of BridgeWays and their management packs and often recommend them for clients.

on 4/12/2010 10:55:41 AM
I find Veaam to be a great product, but as you point out, like most deep MPs takes some tuning effort. We purchased back when they were still nWorks. I guess I disagree with some of Tom's remarks. In particular, I guess I don't see how selling to the HP OpenView market matters. both Veaam and Quest have been selling VMware monitoring solutions since MOM 2000 / 2005. I also see the online knowledge base as a strength, and in fact is something we see in lots of good packs out there, including Microsoft.



I have not used BridgeWays VMware MP (it may also be great for all I know), but not really looking to change.

on 4/12/2010 12:10:24 PM
As the Product Manager for nworks and the author of the nworks MP, let me answer Tom@Bridgeways points.



Re: how nworks has 'less functionality for the System Center user'...

Not true, actually. I wrote the MP from the ground up for Ops Mgr 2007, and there is a lot of key functionality that leverages specific Ops Mgr features - such as our integration of data from a VMware VM with the Ops Mgr agent running inside that VM.

Pretty specific to Ops Mgr, that one :-)



Re: failover - not sure what you mean about "overloading SCOM". We have a scalable centrally managed distributed architecture - that's how we avoid 'overloading SCOM'. And we have fault-tolerant HA monitoring capability, because this is a requirement for enterprise customers.



Re: Knowledge Base - yes indeed, our KB includes links to external articles. As most good KBs should. However we also have plenty of built-in knowledge contained there, leveraged from our years in the market and our status as VMware partners with 'VMware Ready' certification.



Re: a 'deep view' of CPU and memory - we have 80+ performance metrics in the MP. Fifteen specifically on memory for VMs.

And more on Resource Pools, on Hosts, on Clusters, and we watch for 150+ events from VC....I think we have some depth :-)



The final point on 'managing beyond the virtual layer' - I already addressed in my first point I believe. The nworks MP will integrate VMware monitoring data with the apps and services running inside a VM. Allowing true 'end-to-end' monitoring perspective.



We also will be at MMS, previewing our new nworks 5.5 - more enhancements to our core Ops Mgr MP for VMware, and now featuring our PRO Pack integration - for automation of VMware management in System Center Virtual Machine Manager. See you there! ;-)



Alec King

Senior Product Manager

Veeam Software

on 4/13/2010 3:45:14 PM
We have been using Veeam Nworks since January 2010, and have been not at all happy with this product or their customer support.

Since we deployed, we are still struggling to tune the alerts. Our VM admins have configured their outlook to send alerts to the deleted items.

Thanks for this article, i am going to give a thorough read again and work on tuning the alerts accordingly..

on 4/14/2010 1:22:50 AM
Hi Sameer,

I think we may have communicated on the Veeam forums before? If so, I know you have encountered two issues with the nworks version you have now:

1. Certain metrics sometimes report very high values.

2. Certain events sometimes are repeated

Both of these can cause duplicate/false alerts.



I can tell you that both of these issues are fixed in v5.5. We will have a Release Candidated for 5.5 available next week, and full release planned very soon after.



I'm sorry you haven't had a 100% satisfactory experience with nworks so far - but please engage with our support again, and we can get

you the 5.5 build (RC or GA) as soon as possilbe.

Please feel free to CC me on your communications - alec dot king at veeam dot com.



Regards

Alec

on 4/15/2010 4:54:00 AM
Great article. I also disabled most of the VM monitors, because an SCOM Agent is installed on all servers, fysical and virtual.

Too bad, disabling VM monitoring through the nWorks management Console didn't do the trick.



I also blogged about this: http://michielw.blogspot.com/2010/04/nworks-vmware-management-pack-donts.html

Read it before you go tweak the VM discovery ;)



I wonder how much space is eaten by the nWorks VMware MP in your environments. I saw occasions where this MP used about 15% of used DB space. You can view this easily by using the latest SCOM R2 Management Pack. This contains a new report, Data Volume by Management Pack. When you follow Cameron's advice and disable some VM monitors, DB instance usage for this MP will likely decrease.



Best regards,

Michiel Wouters

on 4/15/2010 5:26:02 AM
Hi Michiel,



Thanks for the comments, and the blog post too - and it shows that some of the new features I'm introducing in nworks 5.5 will be welcome! :-)



I'd say you would still get some interesting info, with both nworks VM monitoring, and SCOM Agent inside the VM guest OS. For example, nworks would tell you about hypervisor metrics such as balloon memory usage, cpu Ready times, and swapfile I/O - the SCOM Agent can't tell you these things.



However - I do know that there can be some duplication of data, and for various reasons some customers want to disable some or all VM monitoring, and use nworks for just the ESX Hosts, their hardware, vCenter events etc.



In nworks 5.5, you have complete control over the discoveries - you can globally disable VM discovery in the nworks UI. And you can also disable VM discovery globally, or on a per-Host basis, using overrides in the MP (it would be discovery rule 'SV102 Stage 3' BTW - these details are in the new 5.5 documentation :-))

You can even disable just the discovery of VM vNICs, or VM storage links (vmhba) and so on. It is completely configurable.



You also now have complete control over performance data collection intervals - this means that core metrics (such as cpu and memory) can be gathered on the most frequent interval, e.g. 5 minutes - but less critical metrics such as network traffic can be averaged over 4, or 5, or more intervals. This means you still have the deep-dive metrics - but you get better scalability and performance, and use less space in SCOM DB as well.



And finally - all our performance collection rules now use Ops Mgr's Optimised Providers, which only deliver a new data point if it exceeds a certain deviation from the previous data point. Again - this means that you really save on DB space, while still having all the relevant metrics for graphs and reports.



With nworks 5.5 we will be introducing two new Deployment Toolkit items - a calculator to predict SCOM Database usage, and a calculator to predict number of Collectors required. These will both be available online at veeam.com.



I hope you can give nworks 5.5 a try - if you'd like to see it just let me know.



Thanks!

Alec

on 4/16/2010 5:15:52 PM
Michiel,



I absolutely agree with Alec! Great blog!



Please note that I have added a comment on your blog giving some direction on how to remove vms from the initial discovery. However, as Alec said, there are some compelling reasons to monitor those VMs. If you ever have any questions with regards to the nworks MP for VMware feel free to contact us at any time on the Veeam forums!



www.veeam.com



Thanks!



Brian Pavnick

Veeam Software

Solutions Architect - MP

brian.pavnick@veeam.com

twitter: vbpav

veeam forums: vbpav

on 4/22/2010 8:10:23 PM
Monitor VMware for free via vCenter/vSphere with System Center Central gold sponsor, Quest Software's QMX - Operations Manager Edition:

http://www.management-extensions.org/entry.jspa?externalID=100338&categoryID=252

 Print  

Quick Links
Top Contributors
Featured Members
Pete Zerger
Points: 41211
Level: System Center Expert
Simon Skinner
Points: 30429
Level: System Center Expert
Tommy Gunn
Points: 29964
Level: System Center Expert
Stefan Koell
Points: 20109
Level: System Center Expert
Tenchuu
Points: 15261
Level: System Center Expert