Performance Issues with the Veeam v7 Management Pack

Disclaimer: OpsMgr management group environments are all different, and this is my experience with two management groups. I am currently working with Veeam and Microsoft Tier 3 support.

Note: A lot of troubleshooting detail is omitted b/c it would require deep dives into each troubleshooting element. Churn-like symptoms can be tricky to troubleshoot, requiring peeling layers away until you reach the root cause.

The Issue:

In August I updated our existing Veeam 6.5 environment to v7 in two management groups:

  • Staging (approx. 800 Windows servers)
  • Production (approx. 2000 Windows servers)

Things seemed fine, and I went about my business. It was during production agent updates post UR3 that things slowed to a crawl. Management servers were full of 2115 events for the CollectDiscoveryData workflow. SQL blocking was extensive. The Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\Health Service State\Completed File Uploads directory was full of files on each MS.

Not good.

Our DBA’s use tools to automatically collect profiler traces under extended blocking conditions, and one of them explained the cause was a session performing updates/inserts on the Relationship table in the OperationsManager database. A corresponding trigger, triu_Relationship fires, causing the blocking. MSFT’s tier 3 SQL confirmed this as well after analyzing PSSDIAG results. What caused the change? Everything seemed to be running fine.

Analysis led to checking the size of the production RecursiveMembership table; it had 4 M rows. MSFT felt this was an extremely high value. I took a look at staging, and was surprised to find it was also experiencing 2115’s for the same CollectDiscoveryData workflow, You would not have known there was a problem unless you happened to be looking for trouble in the event logs. Staging’s RecursiveMembership table had 1.3 M rows. OK, maybe rowcount is the issue, or perhaps it’s the content in RecursiveMembership that mattered most. I pieced together some queries and deduced that at least 25% of RecursiveMembership pertained to relationships involving elements discovered by Veeam. That seemed disproportionate. Time to test that hypothesis.

I removed the Veeam MP’s from staging and let things settle. The 2115’s stopped. RecursiveMembership rowcount dropped to 1 M rows. This proved the hypothesis, so I then took a closer look at the discovery XML and developed another hypothesis. Veeam v7 builds a very detailed VMWare topology in OpsMgr with multiple relationships between objects. Perhaps with > 2,000 VMs spread across 60+ ESX hosts, this built a relationship structure that caused any convergence of discovery data to start the blocking. I then cross checked the 2115 storm timestamps with modified property report timestamps. IIS discoveries in particular lined up nicely, as they tend to have larger discovery payloads.

I did not apply this analysis to staging as the Veeam MPs had already been removed, but it’s safe to say the same situation occurred, just on a smaller scale.

The Proposed Solution:

Rolling back to 6.5 was an option, but I like to fix things and move forward. My next task was to determine the possibility of eliminating unneeded elements of topology discovery, stop OpsMgr from being hammered, and still meet our requirements:

  • Host discovery and some monitoring/data collection
  • VM discovery and nothing else

This required some experimentation in staging; fortunately not too much. Veeam is still reviewing our data, but importing the MPs and disabling these discoveries is working:

  • VMGuest to OpsMgr Agent Relationship
  • Virtual Switches
  • Resource Pools
  • Datatores
  • Populate VMWare VMs that run OpsMgr agents

The staging MG RecursiveMembership table increased to 1.1 M rows. There were no CollectDIscoveryData 2115’s after two days. Our requirements for monitoring and inventory were met. Good news!

I followed the same procedure in production. After removing the MPs, RecursiveMembership dropped to 2.1 M rows. The 2115’s stopped. I waited a few hours. After reimporting with the new changes, RecursiveMembership is 2.4 M rows. No CollectDIscoveryData 2115’s. Result!

There is one side effect resulting from limiting discovery. I’m certain Veeam will address and correct it. All the management servers report this every twelve minutes:

Note: None of our management servers are Veeam collectors

The Windows Event Log Provider is still unable to open the Veeam VMware event log on computer ‘MSSERVER.SOMEDOMAIN.com’. The Provider has been unable to open the Veeam VMware event log for 22320 seconds.

Most recent error details: The specified channel could not be found. Check channel configuration.

One or more workflows were affected by this.

Workflow name: many

Instance name: many

Instance ID: many

Management group: OURMG

Conclusion:

If you have a medium to larger OpsMgr management group, and your vCenter environment has > 2,000 VM’s, you may run into discovery performance issues after discovering the full VMWare topology using default settings.

I have fixed my share of churn over the years, and this was a great puzzle to solve. I need to learn more about the RecursiveMembership and Relationship structures since they play a big part in management group performance.

4 thoughts on “Performance Issues with the Veeam v7 Management Pack

  1. Alec King

    Hi Drew,

    Alec King of Veeam here, AKA “The MP King” 😉 as I run our Management Pack R&D group.

    Thanks for the very interesting and detailed post!

    Discovery churn is something we always seek to minimize in the Veeam MP. I’m diving right now with my team into the root cause of the churn you were seeing, and I’ll post more detail back here soon….

  2. Alec King

    Hi Drew,

    A quick update from Veeam R&D as promised!

    Most of the additional rows in the RecursiveMembership table are generated when we create the Containment relationship between a VM (Veeam MP object) and the Ops Mgr agent (Windows Computer object) running inside the VM, if present.

    That relationship is a very useful one, as you can imagine – allowing a link to be shown between the virtual infrastructure and the applications and services that depend on it. We use it in dashboards, reports, groups…etc.

    However we’ve established that the RecursiveMembership table is not populated with just one entry, when we create that single VM-to-Agent relationship. In fact that table populates with an entry referring to each child object under the Windows Computer, and then is multiplied by a factor of the topology depth. On the Veeam (VMware) side this can be pretty deep, as our topo starts at vCenter, then through Datacenter, Cluster, Host….

    As you saw this generates a huge amount of additional rows in this table and you experienced performance issues with the additional relationships.

    We believe Microsoft implemented this table to optimize calculation/rendering of topology diagrams – however in a large/deep topology it creates bottlenecks.

    And your issue was exacerbated by some problems in communicating direct with vSphere Hosts to gather CIM (hardware) data – this caused our topology to ‘flap’ and lead to repeated discovery update triggers, which made things worse.

    So, we continue to dive in. I’ll probably reach out to you direct via our Support org, so we can discuss in details.  Apologies that you experienced this issue – but thanks for your patience and detailed research, and we can already see here at Veeam how we will solve this!

  3. Drew Post author

    This issue has been fixed, going to test ASAP. Below is from the readme for the just released R2.

    Topology views optimized and new diagram dashboards added

    Veeam MP 7.0 for System Center built a very detailed and deep VMware and/or Hyper-V topology which extended from clusters, through the physical host servers, to the virtual machines, and even included the Ops Mgr agents running inside the virtual machines (if present). In large environments, the depth of this topology could be an issue for Ops Mgr to maintain and could cause SQL performance issues (specifically in the RecursiveMembership table), including problems with insertion/update of discovery data.

    In Veeam Management Pack 7.0 R2, the relationship between a VM and the Ops Mgr Agent (Windows Computer object), was replaced by discovering a relationship between a VM and the specific Veeam MP object “Ops Mgr Agent in VM” which is discovered inside each Windows OS. Because this object (unlike Windows Computer object) does not have any child topology objects, the overall Veeam topology depth and total number of contained objects is greatly reduced, which addresses the SQL performance issue.

     

     

  4. Alec King

    Thanks Drew! I was planning to post a notification here for our R2 release – but you beat me to it 😉

    I believe we have addressed the performance issue you found – and without losing any functionality. In fact, the new in-context Diagram Dashboards have added new capabilities – you can now browse from the VM “down” the hierarchy (into the OM agent) and also “up” the hierarchy (into the Host for this specific VM).

    Thanks again for your initial research into this issue – looking forward to your feedback!

    Cheers,

    Alec

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.