![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||||||||||
Solution Type Problem Resolution Sure Solution 2270203.1 : Resolving Condition of "ECC Correctable Memory Error" Raised on all DIMMs in DSR HP Proliant BL460 G8 or G9 Servers
In this Document
Created from <SR 3-14852139731> Applies to:Oracle Communications Diameter Signaling Router (DSR) - Version DSR 5.0 and laterOracle Communications Performance Intelligence Center (PIC) Software - Version 10.1.5 and later Oracle Communications Performance Intelligence Center (PIC) Hardware - Version 10.1.5 and later Information in this document applies to any platform. SymptomsProblem condition may raise certain platform alarms on the Diameter Signaling Router GUI, including Event 32321 ECC Memory Correctable Error or 32300 Server Fan Failure. Syscheck may report some or all modules in class hardware failing. Server is Hewlett Packard (HP) Generation 8 or 9, running iLO4 firmware versions lower than 2.54. ChangesThis condition, generally rare, may appear without any noted event. CauseThe problem is due to a Bug identified in the HP iLO4 firmware involving iLO NAND. The condition can cause varied problems including errant reports of hardware issues through the hp-health/syscheck facilities. SolutionPurposeThis document provides steps to remedy a Server reporting a platform health check failure matching the symptoms described above. More specifically, this document is to be used on HP Proliant BL460c Gen8 or Gen 9 models running HP integrated Lights-Out 4 (iLO 4) version 2.53 or earlier. When the alarm is investigated, all DIMMs show a "Correctable ECC Memory Error" among other possible hardware module failures reported by syscheck. If this is observed, then this troubleshooting procedure applies. ImpactiLO on HP servers can be rebooted or reset without disrupting the host server OS. Restarting hp-health daemon and syscheck service should have no impact to operation of the application. Oracle recommends all maintenance activity be conducted in a scheduled maintenance window, even if no impact is expected. This condition may be progressive, and should be addressed permanently via firmware upgrade and NAND reformat as soon as possible to prevent iLO4 degradation. The troubleshooting steps offered here may resolve the condition temporarily but if due to the NAND issue it will not arrest degradation. Troubleshooting StepsUsing an ssh client (e.g Putty, SecureCRT, etc), login as admusr to the affected server exhibiting the issue and execute the following command. [admusr@server ~]$ sudo syscheck -v hardware
Example errors returned (note output may differ somewhat from what follows, but will be similar in content): Running modules in class hardware...
cmosbattery: This hardware does not support monitoring the CMOS battery. cmosbattery: The test will not be ran. ecc: Checking ECC hardware. * ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error * ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p2m12: N/A discrete, CHIP: hpasm Time: 05/08/2017 18:48:34. * ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error * ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p2m4: N/A discrete, CHIP: hpasm Time: 05/08/2017 18:48:34. * ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error * ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p1m9: N/A discrete, CHIP: hpasm Time: 05/08/2017 18:48:34. * ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error * ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p2m9: N/A discrete, CHIP: hpasm Time: 05/08/2017 18:48:34. * ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error * ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p1m12: N/A discrete, CHIP: hpasm Time: 05/08/2017 18:48:34. * ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error * ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p1m1: N/A discrete, CHIP: hpasm Time: 05/08/2017 18:48:34. * ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error * ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p2m1: N/A discrete, CHIP: hpasm Time: 05/08/2017 18:48:34. * ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error * ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p1m4: N/A discrete, CHIP: hpasm Time: 05/08/2017 18:48:34. fan: Checking Status of Server Fans. * fan: FAILURE:: MAJOR::3000000000000001 -- Server Fan Failure. This test uses the leaky bucket algorithm. * fan: FAILURE:: Fan RPM is too low, hpasm,generic 2: ERROR: "SHOW" FAN command failed discrete, CHIP: hpasm fancontrol: ProLiant DL 380p Gen8 does not support Fan Controls fancontrol: Will not run the test. oemHW: Only Oracle servers support hwmgmt. psu: This hardware does not support power feed monitoring. psu: Will not run test. psu: Checking status of Server Supplies. * psu: FAILURE:: MINOR::5000000000000004 -- Server Hardware Configuration Error * psu: FAILURE:: Insufficient number of PSU sensors found. 1 serial: Running serial port configuration test * serial: FAILURE:: MINOR::5000000000040000 -- Platform Health Check Failure * serial: FAILURE:: Cannot determine embedded serial port value using command (/sbin/hpasmcli -s show serial embedded) * serial: FAILURE:: MINOR::5000000000040000 -- Platform Health Check Failure * serial: FAILURE:: Cannot determine virtual serial port value using command (/sbin/hpasmcli -s show serial virtual) temp: Checking server temperature. * temp: FAILURE:: MINOR::5000000000040000 -- Platform Health Check Failure * temp: FAILURE:: There is no high temperature threshold! hpasm,generic 4: ERROR: "SHOW TEMP" command failed. discrete, CHIP: hpasm voltage: ProLiant DL 380p Gen8 does not support voltage monitoring voltage: Will not run test. One or more module in class "hardware" FAILED Workaround / Solution
The signature matches a known Bug in HP's firmware, currently resolved with Oracle's Firmware Upgrade Pack (FUP) 2.2.11 procedures. Although the permanent solution is to upgrade to the firmware with the corrective content (and following steps to reformat the NAND as instructed in the FUP procedures), an interim workaround can be applied which may clear the condition on a temporary basis. Reference HP Advisory at http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c04996097 IF ENCOUNTERED, OPERATORS ARE ENCOURAGED TO UPGRADE FIRMWARE AS SOON AS POSSIBLE.
If the attempt to clear fails, contact Oracle DSR Technical Support via SR. For the permanent solution, operators are encouraged to upgrade to Oracle's Firmware Upgrade Pack release 2.2.11 along with HP's iLO4 v2.54 release immediately. Operators that sourced hardware through Oracle are should open an SR to request firmware upgrade through the normal upgrade process. For those customers who have purchased hardware from other sources, please consult your local practice to have Firmware upgraded. Warning / Note: Per the CGBU cotsfw team: "Until [iLO4 firmware release 2.50 or later is] applied the iLO is definitely still in a bad state. Every time an 6 Sept 2017 addendum: iLO4 release 2.54 supersedes 2.50 per HP's latest Advisory revision. iLO4 v2.54
Steps for applying a final solution - which includes a discrete iLO4 FW update - are available in the following thread: http://mailfinder.us.oracle.com/thread/8938480 HP's advisory is available on the following link: http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c04996097
References<BUG:25305878> - ILO4 NAND FLASH DEVICE MAY NOT INITIALIZE OR MOUNT PROPERLYAttachments This solution has no attachment |
||||||||||||||||||||||||||||
|