Disclaimer: OpsMgr management group environments are all different, and this is my experience with two management groups. I am currently working with Veeam and Microsoft Tier 3 support.
Note: A lot of troubleshooting detail is omitted b/c it would require deep dives into each troubleshooting element. Churn-like symptoms can be tricky to troubleshoot, requiring peeling layers away until you reach the root cause.
In August I updated our existing Veeam 6.5 environment to v7 in two management groups:
- Staging (approx. 800 Windows servers)
- Production (approx. 2000 Windows servers)
Things seemed fine, and I went about my business. It was during production agent updates post UR3 that things slowed to a crawl. Management servers were full of 2115 events for the CollectDiscoveryData workflow. SQL blocking was extensive. The Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\Health Service State\Completed File Uploads directory was full of files on each MS.
Our DBA’s use tools to automatically collect profiler traces under extended blocking conditions, and one of them explained the cause was a session performing updates/inserts on the Relationship table in the OperationsManager database. A corresponding trigger, triu_Relationship fires, causing the blocking. MSFT’s tier 3 SQL confirmed this as well after analyzing PSSDIAG results. What caused the change? Everything seemed to be running fine.
Analysis led to checking the size of the production RecursiveMembership table; it had 4 M rows. MSFT felt this was an extremely high value. I took a look at staging, and was surprised to find it was also experiencing 2115’s for the same CollectDiscoveryData workflow, You would not have known there was a problem unless you happened to be looking for trouble in the event logs. Staging’s RecursiveMembership table had 1.3 M rows. OK, maybe rowcount is the issue, or perhaps it’s the content in RecursiveMembership that mattered most. I pieced together some queries and deduced that at least 25% of RecursiveMembership pertained to relationships involving elements discovered by Veeam. That seemed disproportionate. Time to test that hypothesis.
I removed the Veeam MP’s from staging and let things settle. The 2115’s stopped. RecursiveMembership rowcount dropped to 1 M rows. This proved the hypothesis, so I then took a closer look at the discovery XML and developed another hypothesis. Veeam v7 builds a very detailed VMWare topology in OpsMgr with multiple relationships between objects. Perhaps with > 2,000 VMs spread across 60+ ESX hosts, this built a relationship structure that caused any convergence of discovery data to start the blocking. I then cross checked the 2115 storm timestamps with modified property report timestamps. IIS discoveries in particular lined up nicely, as they tend to have larger discovery payloads.
I did not apply this analysis to staging as the Veeam MPs had already been removed, but it’s safe to say the same situation occurred, just on a smaller scale.
The Proposed Solution:
Rolling back to 6.5 was an option, but I like to fix things and move forward. My next task was to determine the possibility of eliminating unneeded elements of topology discovery, stop OpsMgr from being hammered, and still meet our requirements:
- Host discovery and some monitoring/data collection
- VM discovery and nothing else
This required some experimentation in staging; fortunately not too much. Veeam is still reviewing our data, but importing the MPs and disabling these discoveries is working:
- VMGuest to OpsMgr Agent Relationship
- Virtual Switches
- Resource Pools
- Populate VMWare VMs that run OpsMgr agents
The staging MG RecursiveMembership table increased to 1.1 M rows. There were no CollectDIscoveryData 2115’s after two days. Our requirements for monitoring and inventory were met. Good news!
I followed the same procedure in production. After removing the MPs, RecursiveMembership dropped to 2.1 M rows. The 2115’s stopped. I waited a few hours. After reimporting with the new changes, RecursiveMembership is 2.4 M rows. No CollectDIscoveryData 2115’s. Result!
There is one side effect resulting from limiting discovery. I’m certain Veeam will address and correct it. All the management servers report this every twelve minutes:
Note: None of our management servers are Veeam collectors
The Windows Event Log Provider is still unable to open the Veeam VMware event log on computer ‘MSSERVER.SOMEDOMAIN.com’. The Provider has been unable to open the Veeam VMware event log for 22320 seconds.
Most recent error details: The specified channel could not be found. Check channel configuration.
One or more workflows were affected by this.
Workflow name: many
Instance name: many
Instance ID: many
Management group: OURMG
If you have a medium to larger OpsMgr management group, and your vCenter environment has > 2,000 VM’s, you may run into discovery performance issues after discovering the full VMWare topology using default settings.
I have fixed my share of churn over the years, and this was a great puzzle to solve. I need to learn more about the RecursiveMembership and Relationship structures since they play a big part in management group performance.