Unstable Behavior from Ops Mgr Health Service

In this article

    Symptoms

    You observe unstable behavior from the OpsMgr agent, such as:

    • Heartbeat failure or other availability alerts
    • Agent restarts
    • Gaps in collected VMware data in reports or charts
    • Agent and/or VMware objects appear ‘grayed out’ in the OpsMgr console.

    You may receive alerts in OpsMgr that the Health Service (agent) on the Veeam Collector server is not heartbeating, as shown below:

    Alert name: Health Service Heartbeat Failure

    Alert description:  The Health Service on computer [server name] failed to heartbeat.

    Additional errors and warnings in the Operations Manager event log on the Collector server may appear, including those shown below:

    Veeam Collector Health Service version store error

    Source:        Health Service ESE StoreEvent ID:      623

    Level:         Error

    Description:

    HealthService (1080) Health Service Store: The version store for this instance (0) has reached its maximum size of [nn]Mb. It is likely that a long-running transaction is preventing cleanup of the version store and causing it to build up in size. Updates will be rejected until the long-running transaction has been completely committed or rolled back.

    KB for this rule should note that ‘maximum size nn MB’ should match what is set by Veeam (auto) HS configuration task. Also note this alert may arrive late, depending on whether HealthService was able to deliver the event for processing.

    -----------------------------------

    Veeam Collector Health Service agent restart error

    Event Type: Warning

    Event Source: Health Service Script

    Event ID: 6024

    Description:

    LaunchRestartHealthService.js : Launching Restart Health Service. Health Service exceeded Process\Handle Count or Private Bytes threshhold.

    KB for this rule should note that this restart action should be disabled by nworks_overrides MP – check that override MP was imported.

    -----------------------------------

    Veeam Collector: Health Service delay in event processing

    NOTE: This rule should check for Veeam in the log Name.

    Source:        Health Service Modules

    Event ID:      26017

    Level:         Warning

    Description:

    The Windows Event Log Provider monitoring the [Veeam log name] is [n] minutes behind in processing events.  This can occur when the provider is restarted after being offline for some time, or there are too many events to be handled by the workflow.

    -----------------------------------

    Veeam Collector: Health Service cannot load configuration

    Source:        HealthService

    Event ID:      1230

    Level:         Error

    Description:

    New configuration cannot be loaded, the error is 0x80FF0004(0x80FF0004). Management group "[MG name]".

    -----------------------------------

    Veeam Collector: Health Service state change flow stalled

    Source:        HealthService

    Event ID:      5300 OR 5302 OR 5304

    Level:         Error

    Description:

    Local health service is not healthy. Monitor state change flow is stalled with pending acknowledgement.

    -----------------------------------

    Veeam Collector: Health Service send queue exceeded

    Source:        HealthService

    Event ID:      2023

    Level: Error

    The health service has removed some items from the send queue for management group "[MG name]" since it exceeded the maximum allowed size of [N] megabytes.

    Cause

    The OpsMgr Health Service is overloaded when trying to process the incoming VMware data.

    Either the required configuration has not been applied, or there are too many monitoring jobs (vSphere hosts) assigned to this Collector.

    Some of the above events have specific solution steps. Review the solutions below for details.

    Solution 1. Clear the Agent Health Service State

    When the agent is exhibiting unstable behavior, heartbeat failures, and/or logging the above events, then it is usually first required to clear the Health Service State. This is the local agent cache. Clearing this data will restore communications with the agent and allow further troubleshooting steps to be effective.

    Note

    You should first remove monitoring jobs from the Collector — otherwise the OpsMgr agent may become overloaded during troubleshooting. Only when the agent is stable again should you re-apply monitoring jobs to the Collector.

    To clear the agent Health Service State:

    1. Stop the System Center Management service on the Collector server.
    2. Delete the Health Service State folder, typically:
    • For a Management Server, %Program Files%\Microsoft System Center 2012\Operations Manager\Server\Health Service State or %Program Files%\Microsoft System Center 2012 R2\Operations Manager\Server\Health Service State).
    • For an OpsMgr agent, %Program Files%\System Center Operations Manager\Agent\Health Service State or %Program Files%\Microsoft Monitoring Agent\Agent\Health Service State.
    1. Start the System Center Management service.
    2. Observe the Operations Manager event log on the server. You should see configuration being loaded, management packs being deployed, etc. When the agent is running in stable fashion, continue with troubleshooting.

    Solution 2. Health Service version store error

    Some events indicate that the required configuration has not been applied. In particular Event 623 as shown below will include the size of the Version Store (the agent local processing cache). If the size shown in this event is less than [nn]MB, then the Configure Health Service task has not been run against this Collector server.

    Source: Health Service ESE Store

    Event ID: 623

    Level: Error

    Description: HealthService (1080) Health Service Store:

    The version store for this instance (0) has reached its maximum size of [nn]Mb. It is likely that a long-running transaction is preventing cleanup of the version store and causing it to build up in size. Updates will be rejected until the long-running transaction has been completely committed or rolled back.

    Solution: Run the Configure Health Service task as follows:

    In the Monitoring view, open the Veeam for VMware > Veeam Collectors > _Veeam Collectors node.

    After discovery of Collector(s) applications is complete in OpsMgr, you will see the _Veeam Collectors view populated with the Collectors.

    Note

    If the view is empty, you should wait for discovery to complete.
    Alternatively, restarting the System Center Management service on the Collector server can be used to trigger discovery.

     

    Configure Health Service Task

    Then do the following:

    1. In the Actions pane, expand Veeam VMware Collector Service Tasks and click the Configure Health Service task (outlined in red above).
    2. Run this task against the Collector. Tasks may be run simultaneously. The task will cause the System Center Management service to restart.

    Successful task output should look like the one below:

    Output

    Setting Persistence Version Store Maximum...done

    Setting Persistence Cache Maximum...done

    Setting State Queue Items...done

    Enumerating Management Groups...done

    Setting MaximumQueueSizeKb for [Mgt Group name]...done

    Error

    None

    Exit Code:

    0

    Solution 3. Health Service agent restart error

    The OpsMgr agent will auto-restart if certain thresholds are breached. This is a default behavior in OpsMgr which must be modified with overrides when an agent is acting as a Veeam Collector.

    Event 6024 shows that the out-of-box monitors for Private Bytes and/or Handle Count have fired and are restarting the agent:

    Source: Health Service Script

    Event ID: 6024

    Level: Warning

    Description: LaunchRestartHealthService.js : Launching Restart Health Service. Health Service exceeded Process\Handle Count or Private Bytes threshold.

    Note that the Required Health Service Overrides MP which is part of the standard Veeam MP for VMware set will disable this Health Service restart action when the Veeam Collector is installed. Check that the overrides MP has been imported.

    Solution: Use the pre-built ‘Overrides MP’ as supplied in the Veeam MP for VMware installation ISO to apply the required thresholds. For details, see the Installation Guide.

    Solution 4. Miscellaneous Agent Errors and Continued Instability

    If errors and unstable behavior continue even after the correct configuration is applied as above, then the following solutions should be considered:

    • Relocate the Veeam Collector role(s) to OpsMgr Management Servers. A Management Server will scale higher in terms of monitored vSphere hosts and VMs per Collector. Note that in large environments you may need to dedicate Management Server(s) to the role of VMware data processing.
    • Relocate monitoring jobs (vSphere clusters/hosts) to other Veeam Collectors to distribute the monitoring load.