Avoiding HP Server Resets due to IPMI Watchdog Timeout

Asset ID:	1-71-2379256.1
Update Date:	2018-04-12
Keywords:

Solution Type Technical Instruction Sure

Solution 2379256.1 : Avoiding HP Server Resets due to IPMI Watchdog Timeout

Applies to:

BNS Platform Hardware - Version UDR 10.2 and later
Oracle Communications Performance Intelligence Center (PIC) Hardware - Version 10.2.0 and later
Information in this document applies to any platform.

Goal

Assist Customers and Support teams on how to increase the IPMI watchdog timeout value on HP Gen8 and Gen9 servers so they can avoid system resets associated with iLO congestion.

Solution

TPD implements a hardware watchdog by setting up two timers. There is only a hardware watchdog running.

A hardware timer in the BMC (Baseboard Management Controller) with a timeout of 120 seconds
A software timer (watchdog daemon) which will write to the /dev/watchdog device every 5 seconds

The TKLCwatchdog service will initialize these timers by loading the ipmi_watch kernel module with the timeout of 120 seconds and then start the software watchdog daemon to write to the /dev/watchdog device every 5 seconds.

For HP servers the IPMI BMC functionality is emulated by the ILO. It is the ILO that will perform the reset of the server when the 120 second timer expires.

The write to the /dev/watchdog will result in the IPMI command 22 (Reset Watchdog Timer) being issued to the ILO. (Information collected from the IPMI spec).

HP Server Resets due to IPMI watchdog timeout

There were reports of HP servers randomly resetting due to watchdog timeout issues. It is believed that the IPMI reset watchdog timer commands are not being processed due to congestion on the ILO, leading to IPMI timer expiration and a subsequent reboot of the server.

Logs from the servers show the following messages that point to the IPMI watchdog reset commands not being serviced:

kernel: IPMI Watchdog: response: Error c0 on cmd 22 (Completion code c0 indicates 'Node Busy. Command could not be processed because command processing resources are temporary unavailable')
kernel: IPMI Watchdog: response: Error ff on cmd 22 (Completion code ff indicates 'Unspecified error.')

Investigations point to a temporary congestion on ILO due to PCI devices (NICs for example) sending overload of sensor data that is slowing down the ILO such that the watchdog timer resets are not being processed.

HP recommends a longer IPMI timeout which will allow the congestion to be cleared so that the watchdog resets will be serviced. Alternatively, the HP Advanced Service Recovery Daemon can be used in place of the IPMI hardware watchdog, as this mechanism does not use IPMI. However, the use of this Recovery Daemon is not compatible with TPD versions below 7.6.

Steps to extend IPMI watchdog timeout

Determine the hardware ID of the server by executing the following command

# hardwareInfo | grep ID

2. Create new watchdogSetup file for this hardware ID from the G7 watchdogSetup file which has a longer timeout value defined by executing the following command

# cp /usr/TKLC/plat/etc/hardware/watchdog/watchdogSetup-ProLiantBL685cG7 /usr/TKLC/plat/etc/hardware/watchdog/watchdogSetup-<hardwareID>

Example for hardware ID ProLiantBL460cGen8: cp /usr/TKLC/plat/etc/hardware/watchdog/watchdogSetup ProLiantBL685cG7 /usr/TKLC/plat/etc/hardware/watchdog/watchdogSetup-ProLiantBL460cGen8

3. Restart the TKLCwatchdog service by executing the following command

# service TKLCwatchdog restart

4. Validate watchdog timer value after boot by executing the following command

# ipmitool mc watchdog get (verify output indicates 'Initial Countdown: 300 sec')

example output:
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x00
Initial Countdown: 300 sec
Present Countdown: 298 sec

Notes:

1. Watchdog Timer default for ProLiantBL685cG7 is already set to 300 seconds, so the timer will not need to be changed.

2. This new file will survive upgrades.

3. In case of a server replacement due to a system fault, this timer procedure will need to be run after the new system is restored

4. This procedure applies to TPD releases 6.5.1-82.28.0 through 7.5.0.0.0.0-88.46.0.

Attachments

This solution has no attachment