Resolving Condition of "ECC Correctable Memory Error" Raised on all DIMMs in DSR HP Proliant BL460 G8 or G9 Servers

Asset ID:	1-72-2270203.1
Update Date:	2017-09-19
Keywords:

Solution Type Problem Resolution Sure

Solution 2270203.1 : Resolving Condition of "ECC Correctable Memory Error" Raised on all DIMMs in DSR HP Proliant BL460 G8 or G9 Servers

Applies to:

Oracle Communications Diameter Signaling Router (DSR) - Version DSR 5.0 and later
Oracle Communications Performance Intelligence Center (PIC) Software - Version 10.1.5 and later
Oracle Communications Performance Intelligence Center (PIC) Hardware - Version 10.1.5 and later
Information in this document applies to any platform.

Symptoms

Problem condition may raise certain platform alarms on the Diameter Signaling Router GUI, including Event 32321 ECC Memory Correctable Error or 32300 Server Fan Failure.

Syscheck may report some or all modules in class hardware failing.

Server is Hewlett Packard (HP) Generation 8 or 9, running iLO4 firmware versions lower than 2.54.

Changes

This condition, generally rare, may appear without any noted event.

Cause

The problem is due to a Bug identified in the HP iLO4 firmware involving iLO NAND. The condition can cause varied problems including errant reports of hardware issues through the hp-health/syscheck facilities.

Solution

Purpose

This document provides steps to remedy a Server reporting a platform health check failure matching the symptoms described above.

More specifically, this document is to be used on HP Proliant BL460c Gen8 or Gen 9 models running HP integrated Lights-Out 4 (iLO 4) version 2.53 or earlier.

When the alarm is investigated, all DIMMs show a "Correctable ECC Memory Error" among other possible hardware module failures reported by syscheck. If this is observed, then this troubleshooting procedure applies.

Impact

iLO on HP servers can be rebooted or reset without disrupting the host server OS.

Restarting hp-health daemon and syscheck service should have no impact to operation of the application.

Oracle recommends all maintenance activity be conducted in a scheduled maintenance window, even if no impact is expected.

This condition may be progressive, and should be addressed permanently via firmware upgrade and NAND reformat as soon as possible to prevent iLO4 degradation. The troubleshooting steps offered here may resolve the condition temporarily but if due to the NAND issue it will not arrest degradation.

Troubleshooting Steps

Using an ssh client (e.g Putty, SecureCRT, etc), login as admusr to the affected server exhibiting the issue and execute the following command.

[admusr@server ~]$ sudo syscheck -v hardware

Example errors returned (note output may differ somewhat from what follows, but will be similar in content):

Running modules in class hardware...
cmosbattery: This hardware does not support monitoring the CMOS battery.
cmosbattery: The test will not be ran.
ecc: Checking ECC hardware.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p2m12: N/A discrete, CHIP: hpasm Time: 05/08/2017 18:48:34.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p2m4: N/A discrete, CHIP: hpasm Time: 05/08/2017 18:48:34.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p1m9: N/A discrete, CHIP: hpasm Time: 05/08/2017 18:48:34.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p2m9: N/A discrete, CHIP: hpasm Time: 05/08/2017 18:48:34.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p1m12: N/A discrete, CHIP: hpasm Time: 05/08/2017 18:48:34.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p1m1: N/A discrete, CHIP: hpasm Time: 05/08/2017 18:48:34.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p2m1: N/A discrete, CHIP: hpasm Time: 05/08/2017 18:48:34.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p1m4: N/A discrete, CHIP: hpasm Time: 05/08/2017 18:48:34.
fan: Checking Status of Server Fans.
* fan: FAILURE:: MAJOR::3000000000000001 -- Server Fan Failure. This test uses the leaky bucket algorithm.
* fan: FAILURE:: Fan RPM is too low, hpasm,generic 2: ERROR: "SHOW" FAN command failed discrete, CHIP: hpasm
fancontrol: ProLiant DL 380p Gen8 does not support Fan Controls
fancontrol: Will not run the test.
oemHW: Only Oracle servers support hwmgmt.
psu: This hardware does not support power feed monitoring.
psu: Will not run test.
psu: Checking status of Server Supplies.
* psu: FAILURE:: MINOR::5000000000000004 -- Server Hardware Configuration Error
* psu: FAILURE:: Insufficient number of PSU sensors found. 1
serial: Running serial port configuration test
* serial: FAILURE:: MINOR::5000000000040000 -- Platform Health Check Failure
* serial: FAILURE:: Cannot determine embedded serial port value using command (/sbin/hpasmcli -s show serial embedded)
* serial: FAILURE:: MINOR::5000000000040000 -- Platform Health Check Failure
* serial: FAILURE:: Cannot determine virtual serial port value using command (/sbin/hpasmcli -s show serial virtual)
temp: Checking server temperature.
* temp: FAILURE:: MINOR::5000000000040000 -- Platform Health Check Failure
* temp: FAILURE:: There is no high temperature threshold! hpasm,generic 4: ERROR: "SHOW TEMP" command failed. discrete, CHIP: hpasm
voltage: ProLiant DL 380p Gen8 does not support voltage monitoring
voltage: Will not run test.
One or more module in class "hardware" FAILED

Workaround / Solution

The signature matches a known Bug in HP's firmware, currently resolved with Oracle's Firmware Upgrade Pack (FUP) 2.2.11 procedures. Although the permanent solution is to upgrade to the firmware with the corrective content (and following steps to reformat the NAND as instructed in the FUP procedures), an interim workaround can be applied which may clear the condition on a temporary basis. Reference HP Advisory at http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c04996097

IF ENCOUNTERED, OPERATORS ARE ENCOURAGED TO UPGRADE FIRMWARE AS SOON AS POSSIBLE.

If symptoms align to description and example above, first attempt to clear by restarting hp-health service and syscheck:

[admusr@server ~]$ sudo service hp-health restart
           Note: If above hangs, use depress Ctrl C and cancel out, then
                  [admusr@server ~]$ sudo pkill -9 hpasmlited
                  [admusr@server ~]$ sudo service hp-health start
Then re-start syscheck:
[admusr@server ~]$ sudo restart syscheck
If this fails to clear the condition, continue.

Example output of hp-health restart:

Using Proliant Standard
IPMI based System Health Monitor
Already stopped Proliant Standard
IPMI based System Health Monitor (hpasmlited): [ OK ]
Starting Proliant Standard
IPMI based System Health Monitor (hpasmlited): [ OK ]
If the above fails, next reset iLO via the GUI:
1. Log into iLO GUI via web browser (IE or Firefox) as root or Administrator
2. Navigate to the Information, then Diagnostics screen; look for the 'Reset iLO' section and select [Reset]
3. When challenged, answer 'Yes.' The GUI connection will be lost when the iLO processor reboots.
4. After a few minutes, you should be able to log into iLO again.
Finally, restart the syscheck service:

[admusr@server ~]$ sudo restart syscheck
Example output:

syscheck start/running, process 38288
Execute syscheck again; the failures in the hardware module should clear. Alarms on the application GUI should likewise clear after a few minutes.

If the attempt to clear fails, contact Oracle DSR Technical Support via SR.

For the permanent solution, operators are encouraged to upgrade to Oracle's Firmware Upgrade Pack release 2.2.11 along with HP's iLO4 v2.54 release immediately. Operators that sourced hardware through Oracle are should open an SR to request firmware upgrade through the normal upgrade process. For those customers who have purchased hardware from other sources, please consult your local practice to have Firmware upgraded.

Warning / Note: Per the CGBU cotsfw team:

"Until [iLO4 firmware release 2.50 or later is] applied the iLO is definitely still in a bad state. Every time an
iLO is reset while running an affected firmware verion (less than 2.50) there is a possibility that it can have
this problem. However, once the problem has already occurred there is now corruption of the iLO4 NAND
device which can continue to cause the iLO4 to malfunction even if the iLO4 reset allowed the NAND device
to initialize and mount properly.
If the problem gets bad enough we have seen a couple of instances where this issue causes the iLO to get
into such a bad state that it reboots the server unexpectedly."

6 Sept 2017 addendum: iLO4 release 2.54 supersedes 2.50 per HP's latest Advisory revision. iLO4 v2.54
must be sourced directly from HP's website per latest FUP 2.2.11 revision; it will be included with FUP
2.2.12, available soon.

Steps for applying a final solution - which includes a discrete iLO4 FW update - are available in the following thread:

http://mailfinder.us.oracle.com/thread/8938480

HP's advisory is available on the following link:

http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c04996097

References

<BUG:25305878> - ILO4 NAND FLASH DEVICE MAY NOT INITIALIZE OR MOUNT PROPERLY

Attachments

This solution has no attachment