Issue with data aggregation

Forum: Operations Manager4
Viewing 3 posts - 1 through 3 (of 3 total)
  • #229193
    Profile photo of Gordon
    Gordon
    Participant

    I am having issues with the data aggregation in my 2012 environment. Dataset aggregation run by the maintenance rule doesn’t seem to be working and I can’t run my availability and health reports.

    Here’s what happened: I’ve finally decided to bring my production environment up to speed. Was running 2012 UR3 and wanted to upgrade it to the current version one step at a time. SP1 was the obvious choice so I installed it in the test environment, no issues. Off we go with the production, followed the MS article and it all seemed to be going smoothly. Then some time later I discovered events 31552, 31553, 31551, 31565 – all with SQLException Timout Expired – (sometimes 2115 and 33333, too) on the management servers. I’ve checked and followed all sorts of instructions, blogs, articles I have found out there and mainly followed those to manually run dataset aggregation for perf and state datasets as they were falling behind. Steps I followed were:

    1. OverrideStandard data warehouse data set maintenance rule for perf and state
    2. Stop Healthservice on each management server
    3. wait for 10 minutes
    4. run the 500 loop of manual aggregation sql query with a 5 sec WAITFOR delay (one dataset at a time)
    5. start healthservice on each management server

    This seems to have worked for perf dataset  as I would get perf data outstanding aggregations down to acceptable levels although DirtyInd remains high. Tried the same approach for state dataset and it just doesn’t seem to do anything. outstanding hourly aggregations remain high (over 300) and DirtyInd remains high  too (over 400).

     

    When I follow what’s going on on the SQL server I can see that as soon as I start healthservice services on management servers, CREATE INDEX stored procedure starts and everything gets blocked by it. This CREATE INDEX seems to be running all the time and it’s been almost 3 weeks now. I would have expected that any indexing would have been completed by now. I checked my lab and no such SP running there. A peculiar thing is that if I follow the steps for manual aggregation for perf dataset CREATE INDEX SP does not affect it as it completes successfully. If I try the same for state dataset it runs forever (longest I left it to run was 16 hours). The EXEC from my query is always blocked by CREATE INDEX so it gets nowhere.

    So my questions are: What starts this CREATE INDEX? How can I disable it and should I? To me the issue is that even if I manually catch up with all of aggregations (by having SCOM healthservice stopped for days – if I really have to) and I’m up to date, if CREATE INDEX keeps running it will keep blocking automatic aggregations run by the rule and I will be in the same position again. What can I do?

    Appreciate any help.

    #229258
    Profile photo of Gordon
    Gordon
    Participant

    This has been resolved after going through a few weeks of troubleshooting with MS support. (thought I’d save some poor soul hard work I had to go through – even if one person finds it helpful I’ll be a happy man)

    The issue was with the upgrade itself where everything was upgraded apart from State dataset. To check this you need to run select * from dataset on OperationsManagerDW database. Then check the LastUpgradeCompletedInd column. If any of the values are 0 there’s your problem. Also, next to it should be LastUpgradedDateTime column. All the dates/times should have the value of the time when SCOM was upgraded. What I found was that for State dataset, the value of LastUpgradeCompletedInd was 0 and the time was very recent meaning that the state dataset wasn’t upgraded and that last attempt was recent (but obviously failed).

     

    Another thing I noticed in troubleshooting was that all the tables had clustered indexes ok. Also, all tables apart from State.statexxxxx had non-clustered indexes ok. Which is why Create Index command kept running (and failing).

     

    The way this was resolved was by following these steps:

    1. Create a new registry key: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse
    2. Create new DWORD in that key: Deployment Command Timeout Seconds
    3. Assign decimal value of 259200 (3 days) – this is fairly arbitrary as we wanted to make sure the upgrade process finished. What seemed to have happened was that the State dataset upgrade process kept timing out so increasing this timeout helped.
    4. Restart  SCOM services.

    this needs to be performed on each management server.

     

    Once that was done, 4 hours later I noticed event in the event log on one of the management servers saying that deployment was complete. From here I kept running following queries against OperationsMAnagerDW DB:

    To check outstanding aggregations:

    USE OperationsManagerDW;
    WITH AggregationInfo AS (
    SELECT
    AggregationType = CASE
    WHEN AggregationTypeId = 0 THEN ‘Raw’
    WHEN AggregationTypeId = 20 THEN ‘Hourly’
    WHEN AggregationTypeId = 30 THEN ‘Daily’
    ELSE NULL
    END
    ,AggregationTypeId
    ,MIN(AggregationDateTime) as ‘TimeUTC_NextToAggregate’
    ,COUNT(AggregationDateTime) as ‘Count_OutstandingAggregations’
    ,DatasetId
    FROM StandardDatasetAggregationHistory
    WHERE LastAggregationDurationSeconds IS NULL
    GROUP BY DatasetId, AggregationTypeId
    )
    SELECT
    SDS.SchemaName
    ,AI.AggregationType
    ,AI.TimeUTC_NextToAggregate
    ,Count_OutstandingAggregations
    ,SDA.MaxDataAgeDays
    ,SDA.LastGroomingDateTime
    ,SDS.DebugLevel
    ,AI.DataSetId
    FROM StandardDataSet AS SDS WITH(NOLOCK)
    JOIN AggregationInfo AS AI WITH(NOLOCK) ON SDS.DatasetId = AI.DatasetId
    JOIN dbo.StandardDatasetAggregation AS SDA WITH(NOLOCK) ON SDA.DatasetId = SDS.DatasetId AND SDA.AggregationTypeID = AI.AggregationTypeID
    ORDER BY SchemaName DESC

     

    All values need to be 1 or maybe 2 or 3.

    To check DirtyInd (run this for ‘Perf’, ‘State’, ‘Alert’ and ‘Event’ datasets) and make sure no more than a few are there.

    DECLARE @DatasetId uniqueidentifier
    SELECT
    @DatasetId = DatasetId
    FROM StandardDataset d
    WHERE (d.SchemaName = ‘perf’)
    Select AggregationDateTime, AggregationTypeId
    From StandardDatasetAggregationHistory
    Where DatasetId = @DatasetId
    And
    DirtyInd = 1

     

    when I ran these I noticed that with every minute these numbers were reducing. Took about a day and a half and SCOM sorted it out and brought those numbers so that all values were 1 (in both queries).

     

    I’m sure same troubleshooting could be applied for any SCOM upgrade.

    #230996
    Profile photo of OdgeUK
    OdgeUK
    Participant

    Thanks so much for posting this. I had this in my environment (after an upgrade from 2012 R2 to 1801) and being able to pinpoint the LastUpgradeCompletedInd as having a value of Zero for State Data Set really helped us go to the right solution with MS.

    For us, the Create index job simply wouldn’t finish before the Transaction Logs filled up. The transaction would then take about an hour to rollback and the process would repeat. We had to add quite a significant amount of Disk space to the AppLog DB, to allow the Transaction to complete (300GB T-Log Disk for a 700Gb Datawarehouse). We did change the Deployment Command Timeout setting (it already existed so we may have had similar issues a long time ago) but the issue for us was not timeout, simply that the SQL could not process the Create index proc to upgrade the State Data Set without running out of Disk space.

    We had a sort of early warning too. On upgrade of the first MS, the upgrade actually failed configuring the Data Warehouse. Advice from MS at the time was to continue with the upgrade, running a repair on the SCOM 1801 install from source media (which appeared to be successful but it is likely that it simply skipped the issue). On reflection, this could have set up the issue we saw later. The error in setup was:

    Error: :ImportDataItemTransforms failed: Threw Exception.Type: System.Data.SqlClient.SqlException, Exception Error Code: 0x80131904, Exception.Message: Connection Timeout Expired. The timeout period elapsed while attempting to consume the pre-login handshake acknowledgement. This could be because the pre-login handshake failed or the server was unable to respond back in time. The duration spent while attempting to connect to this server was – [Pre-Login] initialization=60308; handshake=7;

    Successful upgrade was indicated on Management Server via a 31566: “Data Warehouse component successfully Deployed” event

    We are now mopping up the backed up aggregations and health checking.

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic.