Tuning the Veeam V7 management pack for VMWare alerts

This is not a full “by example” type article but I’ve been spending some time with the V7 management pack and I have a few key points that I’ve seen so far that I thought would be worth sharing. This blog post will discuss a quick method to identify common alerts, discusses three alerts that we did tuning for and introduces one important Veeam reporting trick.

Identifying alerts for tuning:

As a starting point, we used the top alerts report in the Microsoft Generic Report library / Most Common Alerts report to quickly identify what is generating the most alerts. By setting the report to only show the Veeam VMWare management packs, you can choose to see only alerts from these specific management packs. An example of the top 5 alerts (not restricted to the Veeam management packs) is shown below.

 

Alert tuning in the Veeam V7 management pack:

During our testing, three of the most common alerts were all related to Virtual Machines in the environment. This blog post covers the following alerts:

 

  • Veeam VMware: Virtual Machine Compute Latency Analysis
  • Veeam VMware: Virtual Machine CPU Usage Analysis
  • Veeam VMware: Virtual Machine Storage Latency Analysis

 

The first alert we worked on from this management pack was the “Veeam VMware: Virtual Machine Storage Latency Analysis” alert.

Alert: Veeam VMware: Virtual Machine Storage Latency Analysis

Management Pack Name: Veeam VMware Monitoring

Management Pack Version: 7.0.0.1862

Rule or Monitor? Monitor

Rule or Monitor Name: Veeam VMware: Virtual Machine Storage Latency Analysis

Rule or Monitor Notes: This monitor checks the VMGuest-disk object, diskLatency counter, for the InstanceName of _Total. The warning threshold is 40, and error threshold is 80. This counter is checked for a total of 3 samples (NumSamples).

Issue: This monitoring is generating thousands of alerts and making it not viable to sort through other issues in the environment. Our attempt to tune through increasing the sample was not successful as this counter appears to be the average value for the counter, not the latest value for the counter. As a result when a huge spike in these occur it impacts the average for this counter for a significant duration of time even once the counter has returned to normal.

Causes: As of this point in time, it has not been determined where the root error for this condition is (which may be in the VM’s, the virtualization layer, or on the storage layer as an example). What has been determined is that this is normal for the client currently, occurring about a half a dozen times a day in this environment but only occurring for one sample when data is gathered (implying that the condition occurs for a total time of less than 5 minutes).

 

This was occurring across a vast number of virtual machines at the same timeframe. A performance chart for these counters is shown below:

Veeam02

This can be zoomed to show a single spike shown above to show the data points which were collected (12:17, 12:22, 12:27). Only one of these data points shows the spike (12:22).

 

What we found was that a single spike caused the average of this value to go out of the boundaries even though the condition was not still occurring and didn’t occur for the total of 3 samples (we tested tuning this though increasing the sample count with no success.

Resolution: Our temporary workaround to limit the amount of alerts on this condition was to override the default thresholds to be a warning of 75 and an error of 150 matching how this is monitored the client’s VMware environment.

Recommendation for enhancement: This monitor should use the last value to monitor instead of the average value. Using the last value would allow for tuning of the number of samples so that conditions like this could be easily suppressed if required.

 

The next most common alert was related to CPU Usage on the Virtual Machines.

 

Alert: Veeam VMware: Virtual Machine CPU Usage Analysis

Management Pack Name: Veeam VMware Monitoring

Management Pack Version: 7.0.0.1862

Rule or Monitor? Monitor

Rule or Monitor Name: Veeam VMWare: Virtual Machine CPU Usage Analysis

Rule or Monitor Notes: This monitor is watching the VMGuest-Cpu, cpuUsedPct, _Total. This monitor is defined with a warning at 80% and an error level at 90%.

Issue: These alerts indicate an issue with the CPU configuration for the virtuals that the errors are occurring on. We generated a dashboard to show the performance counters over a 24-Hour timeframe and a 72-Hour timeframe to determine what systems were consistently above the 80% or 90% thresholds.

Causes: N/A

Resolution: We used the Veeam Virtual Machines: Right-sizing – VMs Undersized for Memory and CPU report to identify the fact that each of the top 3 shown above in the dashboards for CPU latency were actually undersized for their CPU configuration. Our current plan is to resize these based upon the report recommendations shown below.

The third alert that we worked with was related to virtual machine compute latency. Details are below:

Alert: Veeam VMware: Virtual Machine Compute Latency Analysis

Management Pack Name: Veeam VMware Monitoring

Management Pack Version: 7.0.0.1862

Rule or Monitor? Monitor

Rule or Monitor Name: Veeam VMware: Veeam VMWare Compute Latency Analysis

Rule or Monitor Notes: What counter is it watching? MemoryLatencyPctThreshold (the threshold is 10, Object is VMGuest-Memory, counter is memoryLatencyPct, instance is _Total) and CPULatencyPctThreshold (the threshold is 10, Object is VMGuest-CPU, counter is cpuLatencyPct, instance is _Total). This is an “or” condition, so if either of these activates the alert will occur. [NOTE: You cannot use sample count for tuning on this monitor]

Causes: Using the Veeam management pack we were able to identify that the hosts where the VMs are running on were at 75%+ memory utilized and that these are the only hosts in this state currently. We next created a baseline to determine what is normal and to tune these accordingly. Without the dashboard you can go to health explorer, and track the alerts or use a dashboard performance widget to see what is normal for these VMs. We found that for our environment 10-15% is normal from these. Finally we created custom dashboard to see what the actual averages are for the environment for the past 24 hours and 72 hours as shown below using the Veeam widgets.

Resolution: We configured an override for all objects to change the CPULatencyPctThreshold from 10 to 20 for this environment.

 

A quick reporting note:

The “Storage Performance Profile” report is extremely interesting and provides a lot of great information. This tracks the IOPS on shared storage, and the latency recorded at time of highest IOPS. This is an extremely important piece of information because what they are doing with this report is showing what the actual upper bound on IOPS is for the storage – not by actually stress testing the storage but by reporting when the storage is stressed and starts to have issues with latency.

I ran into some problems getting this to run in my larger client environments where it would time out when I tried to get it to run. My recommendation is to run with the aggregation of Daily instead of Hourly. This change provided a successful run of the report in my environment. I’ve also heard from good sources that there was a change made to the SQL query for hourly data so it should run about 3x faster, and that the default aggregation for this report will be Daily in the next version.

Additional reference: https://www.veeam.com/system-center-management-pack-vmware-hyperv.html

Summary: The Veeam management pack has highlighted some very important issues in environments where I have worked and it continues to do so. I hope that this blog post provides some ideas on tuning and approaches that you can use to identify resolutions for alerts you find from this management pack. Thank you to the great crew at Veeam, I look forward to seeing several of you next week at SCU!

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.