Oracle Communications Products: Avoid HP Server Resets due to IPMI Watchdog Timeout

Asset ID:	1-77-2398577.1
Update Date:	2018-05-17
Keywords:

Solution Type Sun Alert Sure

Solution 2398577.1 : Oracle Communications Products: Avoid HP Server Resets due to IPMI Watchdog Timeout

Applies to:

BNS Platform Hardware - Version DSR 5.0 to DSR 8.2.0 [Release DSR 5.0 to DSR 8.0]
BNS Platform Hardware - Version POLICY 8.0 to POLICY 12.4.0 [Release POLICY 8.0 to POLICY 12.0]
BNS Platform Hardware - Version UDR 10.0 to UDR 12.4 [Release UDR 10.0 to UDR 12.0]
Oracle Communications Performance Intelligence Center (PIC) Hardware - Version 9.0.2 to 10.2.1 [Release 9.0 to 10.0]
Tekelec

Description

Applies to HP bl460c or dl380p Gen8 or Gen9 Servers with iLO4 and running Tekelec Platform Distribution (TPD) operating system software prior to TPD 7.6 release.
TPD is part of Oracle Communications software applications including DSR, Policy PCRF, and others. Refer to application software Release Notes for TPD release in use.

TPD implements a hardware watchdog by setting up two timers. There is only a hardware watchdog running.

a hardware timer in the BMC (Baseboard Management Controller) with a timeout of 120 seconds
a software timer (watchdog daemon) which will write to the watchdog device every 5 seconds

The TKLCwatchdog service will initialize these timers by loading a kernel module with the timeout of 120 seconds and then start the software watchdog daemon to write to the device every 5 seconds.

For HP servers the IPMI BMC functionality is emulated by the ILO, so it is the ILO that will perform the reset of the server when the 120 second timer expires.

Some reports have arisen where an HP server gets reset unexpectedly due to watchdog timeout issues. The server may automatically recover (reboot) or may require manual intervention to boot to complete its recovery. Investigations into the cause point to a temporary congestion on ILO due to PCI devices (NICs for example) sending overload of sensor data that is slowing down the ILO such that the watchdog timer resets are not being processed.

To reduce exposure or prevent server reset due to this condition, the IPMI timeout interval needs to be extended or avoided. Extending the IPMI timeout interval improves the chances for the congestion to be cleared so that the watchdog resets will be serviced. Alternatively, the HP Advanced Service Recovery Daemon (hp-asrd) can be used in place of the IPMI hardware watchdog, as this mechanism will not use the IPMI bus and therefore avoids timeout issues due to bus congestion.

Similar effort is involved to invoke either mitigation strategy. If no mitigation strategy has yet been applied, Oracle recommends the hp-asrd service be used instead of extending the timeout interval and using TKLCwatchdog, as this is a more comprehensive solution.

Occurrence

Observed appearance of this condition has been rare, but depending on the role of the affected server and path to recovery the impact can be significant.

Symptoms

Logs from the servers (typically syslog - /var/log/messages files) show the following messages that point to the IPMI watchdog reset commands not being serviced:

kernel: IPMI Watchdog: response: Error c0 on cmd 22 (Completion code c0 indicates 'Node Busy. Command could not be processed because command processing resources are temporary unavailable')
kernel: IPMI Watchdog: response: Error ff on cmd 22 (Completion code ff indicates 'Unspecified error.')

Workaround

The following will need to be applied to each affected server. User will need to be logged into each server as an admin-level user to execute the commands.

Mitigation Option 1 (recommended). HP Automatic Server Recovery (ASR)

HP has a default mechanism for restarting your system in event of a software hang, HP Advanced Server Recovery Daemon. This daemon is started by the hp-asrd service.

Steps to setup server to use hp-asrd instead of IPMI-watchdog

Disable TKLCwatchdog service
1. command: service_conf del TKLCwatchdog
Enable hp-asrd service
1. command: service_conf add hp-asrd
Reconfig services
1. command: service_conf reconfig
Change the asr timer from 10 minutes (default) to 5 minutes
1. command: hpasmcli
2. hpasmcli prompt: show asr (should display ASR timeout is 10 minutes)
3. hpasmcli prompt: set asr 5 (successfully set ASR timeout to 5 minutes)
4. hpasmcli prompt: quit
Stop the TKLCwatchdog service and start hp-asrd
1. command: service TKLCwatchdog stop
2. command: service hp-asrd start

Notes:

If the TKLCwatchdog rpm is upgraded (through TPD upgrade) the TKLCwatchdog service may be re-enabled. See disable TKLCwatchdog service section below.
Default value of the timer is 10 minutes. It will be changed by using the hpasmcli command.
Disabling the TKLCwatchdog service will also disable the FS monitoring functionality setup via the watchdogMgr command.

Steps to disable TKLCwatchdog service (see Note#1 above)

You will only need to do this if the service is enabled.

Check if the TKLCwatchdog service is enabled.
1. command: service_conf query TKLCwatchdog
2. if the command output nothing, then no need to continue.
Disable TKLCwatchdog service
1. command: service_conf del TKLCwatchdog
Reconfig services
1. command: service_conf reconfig
Stop the TKLCwatchdog service
1. command: service TKLCwatchdog stop

Mitigation Option 2. Extending the IPMI Watchdog Timeout

Steps to extend IPMI watchdog timeout

Determine hardware ID of the server
1. command: hardwareInfo | grep ID
Create new watchdogSetup file for this hardware ID from a watchdogSetup that has a longer timeout value
1. command: cp /usr/TKLC/plat/etc/hardware/watchdog/watchdogSetup-ProLiantBL685cG7 /usr/TKLC/plat/etc/hardware/watchdog/watchdogSetup-<hardwareID>
2. example for hardware ID ProLiantBL460cGen8: cp /usr/TKLC/plat/etc/hardware/watchdog/watchdogSetup-ProLiantBL685cG7 /usr/TKLC/plat/etc/hardware/watchdog/watchdogSetup-ProLiantBL460cGen8
Restart the TKLCwatchdog service
1. command: service TKLCwatchdog restart
Validate watchdog timer value after boot
1. command: ipmitool mc watchdog get (verify output indicates 'Initial Countdown: 300 sec')
  
  example output:
  Watchdog Timer Use: SMS/OS (0x44)
  Watchdog Timer Is: Started/Running
  Watchdog Timer Actions: Hard Reset (0x01)
  Pre-timeout interval: 0 seconds
  Timer Expiration Flags: 0x00
  Initial Countdown: 300 sec
  Present Countdown: 298 sec

Notes:

Watchdog Timer default for ProLiantBL685cG7 is already set to 300 seconds, so the timer will not need to be changed.
This new file will survive upgrades.
In case of DR, the timer will need to be extended again.
This assumes running TPD 6.5.1-82.28.0 or later.

Patches

Permanent resolution utilizing hp-asrd will be available in application software releases utilizing TPD 7.6 or later.

History

14-MAY-2018 - Initial Publication

Attachments

This solution has no attachment