![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||||||||||||
Solution Type Sun Alert Sure Solution 2398577.1 : Oracle Communications Products: Avoid HP Server Resets due to IPMI Watchdog Timeout
Oracle Communications Products operating on HP Generation 8 or Generation 9 Hardware In this Document
Applies to:BNS Platform Hardware - Version DSR 5.0 to DSR 8.2.0 [Release DSR 5.0 to DSR 8.0]BNS Platform Hardware - Version POLICY 8.0 to POLICY 12.4.0 [Release POLICY 8.0 to POLICY 12.0] BNS Platform Hardware - Version UDR 10.0 to UDR 12.4 [Release UDR 10.0 to UDR 12.0] Oracle Communications Performance Intelligence Center (PIC) Hardware - Version 9.0.2 to 10.2.1 [Release 9.0 to 10.0] Tekelec DescriptionApplies to HP bl460c or dl380p Gen8 or Gen9 Servers with iLO4 and running Tekelec Platform Distribution (TPD) operating system software prior to TPD 7.6 release.
TPD implements a hardware watchdog by setting up two timers. There is only a hardware watchdog running.
The TKLCwatchdog service will initialize these timers by loading a kernel module with the timeout of 120 seconds and then start the software watchdog daemon to write to the device every 5 seconds. For HP servers the IPMI BMC functionality is emulated by the ILO, so it is the ILO that will perform the reset of the server when the 120 second timer expires. Some reports have arisen where an HP server gets reset unexpectedly due to watchdog timeout issues. The server may automatically recover (reboot) or may require manual intervention to boot to complete its recovery. Investigations into the cause point to a temporary congestion on ILO due to PCI devices (NICs for example) sending overload of sensor data that is slowing down the ILO such that the watchdog timer resets are not being processed. To reduce exposure or prevent server reset due to this condition, the IPMI timeout interval needs to be extended or avoided. Extending the IPMI timeout interval improves the chances for the congestion to be cleared so that the watchdog resets will be serviced. Alternatively, the HP Advanced Service Recovery Daemon (hp-asrd) can be used in place of the IPMI hardware watchdog, as this mechanism will not use the IPMI bus and therefore avoids timeout issues due to bus congestion. Similar effort is involved to invoke either mitigation strategy. If no mitigation strategy has yet been applied, Oracle recommends the hp-asrd service be used instead of extending the timeout interval and using TKLCwatchdog, as this is a more comprehensive solution.
OccurrenceObserved appearance of this condition has been rare, but depending on the role of the affected server and path to recovery the impact can be significant. SymptomsLogs from the servers (typically syslog - /var/log/messages files) show the following messages that point to the IPMI watchdog reset commands not being serviced:
WorkaroundThe following will need to be applied to each affected server. User will need to be logged into each server as an admin-level user to execute the commands. Mitigation Option 1 (recommended). HP Automatic Server Recovery (ASR)HP has a default mechanism for restarting your system in event of a software hang, HP Advanced Server Recovery Daemon. This daemon is started by the hp-asrd service. Steps to setup server to use hp-asrd instead of IPMI-watchdog
Notes:
Steps to disable TKLCwatchdog service (see Note#1 above) You will only need to do this if the service is enabled.
Mitigation Option 2. Extending the IPMI Watchdog TimeoutSteps to extend IPMI watchdog timeout
Notes:
PatchesPermanent resolution utilizing hp-asrd will be available in application software releases utilizing TPD 7.6 or later. History14-MAY-2018 - Initial Publication Attachments This solution has no attachment |
||||||||||||||||||||||||||||||
|