![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||
Solution Type Technical Instruction Sure Solution 2379256.1 : Avoiding HP Server Resets due to IPMI Watchdog Timeout
In this Document
Applies to:BNS Platform Hardware - Version UDR 10.2 and laterOracle Communications Performance Intelligence Center (PIC) Hardware - Version 10.2.0 and later Information in this document applies to any platform. GoalAssist Customers and Support teams on how to increase the IPMI watchdog timeout value on HP Gen8 and Gen9 servers so they can avoid system resets associated with iLO congestion. SolutionTPD implements a hardware watchdog by setting up two timers. There is only a hardware watchdog running.
The TKLCwatchdog service will initialize these timers by loading the ipmi_watch kernel module with the timeout of 120 seconds and then start the software watchdog daemon to write to the /dev/watchdog device every 5 seconds. For HP servers the IPMI BMC functionality is emulated by the ILO. It is the ILO that will perform the reset of the server when the 120 second timer expires. The write to the /dev/watchdog will result in the IPMI command 22 (Reset Watchdog Timer) being issued to the ILO. (Information collected from the IPMI spec). HP Server Resets due to IPMI watchdog timeout There were reports of HP servers randomly resetting due to watchdog timeout issues. It is believed that the IPMI reset watchdog timer commands are not being processed due to congestion on the ILO, leading to IPMI timer expiration and a subsequent reboot of the server. Logs from the servers show the following messages that point to the IPMI watchdog reset commands not being serviced:
Investigations point to a temporary congestion on ILO due to PCI devices (NICs for example) sending overload of sensor data that is slowing down the ILO such that the watchdog timer resets are not being processed. HP recommends a longer IPMI timeout which will allow the congestion to be cleared so that the watchdog resets will be serviced. Alternatively, the HP Advanced Service Recovery Daemon can be used in place of the IPMI hardware watchdog, as this mechanism does not use IPMI. However, the use of this Recovery Daemon is not compatible with TPD versions below 7.6.
Steps to extend IPMI watchdog timeout
# hardwareInfo | grep ID 2. Create new watchdogSetup file for this hardware ID from the G7 watchdogSetup file which has a longer timeout value defined by executing the following command # cp /usr/TKLC/plat/etc/hardware/watchdog/watchdogSetup-ProLiantBL685cG7 /usr/TKLC/plat/etc/hardware/watchdog/watchdogSetup-<hardwareID> Example for hardware ID ProLiantBL460cGen8: cp /usr/TKLC/plat/etc/hardware/watchdog/watchdogSetup ProLiantBL685cG7 /usr/TKLC/plat/etc/hardware/watchdog/watchdogSetup-ProLiantBL460cGen8 3. Restart the TKLCwatchdog service by executing the following command # service TKLCwatchdog restart 4. Validate watchdog timer value after boot by executing the following command # ipmitool mc watchdog get (verify output indicates 'Initial Countdown: 300 sec') example output: Notes: 1. Watchdog Timer default for ProLiantBL685cG7 is already set to 300 seconds, so the timer will not need to be changed. 2. This new file will survive upgrades. 3. In case of a server replacement due to a system fault, this timer procedure will need to be run after the new system is restored 4. This procedure applies to TPD releases 6.5.1-82.28.0 through 7.5.0.0.0.0-88.46.0.
Attachments This solution has no attachment |
||||||||||||||
|