Asset ID: |
1-75-2214816.1 |
Update Date: | 2017-02-27 |
Keywords: | |
Solution Type
Troubleshooting Sure
Solution
2214816.1
:
Troubleshooting "Correctable ECC Memory Alarm" on all DIMMs of OCUDR HP Proliant BL460 G8 or G9 Servers
Related Items |
- Oracle Communications User Data Repository
|
Related Categories |
- PLA-Support>Sun Systems>CommsGBU>Broadband Network Solutions>SN-SND: Tekelec UDR
|
In this Document
Applies to:
Oracle Communications User Data Repository - Version UDR 10.0 to UDR 12.1 [Release UDR 10.0 to UDR 12.0]
Tekelec
Purpose
This document provides steps to remedy a Server on the UDR system that reports a platform health check failure. This can be seen on one of the following elements: TVOE, NOAM, SOAM, or MP.
More specifically, this document is to be used on the following HP Proliant BL460c Gen8 or Gen 9 models and they are running on HP integrated Lights-Out 4 (iLO 4) which is firmware version 2.42 or earlier.
When the alarm is investigated, you would find that all DIMMs show a "Correctable ECC Memory Error". If this is observed, then this troubleshooting procedure applies.
Troubleshooting Steps
Impact
iLO on HP servers can be rebooted or reset without disrupting the host server OS.
Action Plan
- Using an ssh client, such as Putty, login to the affected server that is displaying the issue and execute the following command.
$ sudo syscheck -v hardware
Running modules in class hardware...
cmosbattery: This hardware does not support monitoring the CMOS battery.
cmosbattery: The test will not be ran.
ecc: Checking ECC hardware.
Discarding cache...
ERROR: /sbin/hpasmcli Command error: (1, 0, 0).
ERROR: --------- hpasmcli Output ---------------
ERROR:
ERROR: ERROR: Could not open /dev/cpqhealth/cdt.
ERROR: Please make sure the Health Monitor is started.
ERROR: --------- End hpasmcli Output -----------
ERROR: CMD: /sbin/hpasmcli -s "show temp; show dimm"
ERROR: Could not rescan the sensor chip list!
ERROR: Failed scanning for sensors of type Sensors::Driver::Hpasm!
ERROR: Failure processing sensors output!
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p2m4: N/A discrete, CHIP: hpasm Time: 07/11/2016 04:20:49.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p1m5: N/A discrete, CHIP: hpasm Time: 07/11/2016 04:20:49.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p2m3: N/A discrete, CHIP: hpasm Time: 07/11/2016 04:20:49.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p2m6: N/A discrete, CHIP: hpasm Time: 07/11/2016 04:20:49.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p1m8: N/A discrete, CHIP: hpasm Time: 07/11/2016 04:20:49.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p1m2: N/A discrete, CHIP: hpasm Time: 07/11/2016 04:20:49.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p2m7: N/A discrete, CHIP: hpasm Time: 07/11/2016 04:20:49.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p2m1: N/A discrete, CHIP: hpasm Time: 07/11/2016 04:20:49.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p1m1: N/A discrete, CHIP: hpasm Time: 07/11/2016 04:20:49.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p1m7: N/A discrete, CHIP: hpasm Time: 07/11/2016 04:20:49.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p2m2: N/A discrete, CHIP: hpasm Time: 07/11/2016 04:20:49.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p1m3: N/A discrete, CHIP: hpasm Time: 07/11/2016 04:20:49.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p2m5: N/A discrete, CHIP: hpasm Time: 07/11/2016 04:20:49.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p1m6: N/A discrete, CHIP: hpasm Time: 07/11/2016 04:20:49.
* ecc: FAILURE:: MAJOR::3000000000200000 -- Correctable ECC Memory Error
* ecc: FAILURE:: ECC Correctable Memory Error detected, hpasm,DIMM p2m8: N/A discrete, CHIP: hpasm Time: 08/03/2016 21:21:45.
ERROR: /sbin/hpasmcli Command error: (1, 0, 0).
ERROR: --------- hpasmcli Output ---------------
ERROR:
ERROR: ERROR: Could not open /dev/cpqhealth/cdt.
ERROR: Please make sure the Health Monitor is started.
ERROR: --------- End hpasmcli Output -----------
ERROR: CMD: /sbin/hpasmcli -s "show temp; show dimm"
ERROR: Could not rescan the sensor chip list!
ERROR: Failed scanning for sensors of type Sensors::Driver::Hpasm!
ERROR: Failure processing sensors output!
ERROR: /sbin/hpasmcli Command error: (1, 0, 0).
ERROR: --------- hpasmcli Output ---------------
ERROR:
ERROR: ERROR: Could not open /dev/cpqhealth/cdt.
ERROR: Please make sure the Health Monitor is started.
ERROR: --------- End hpasmcli Output -----------
ERROR: CMD: /sbin/hpasmcli -s "show temp; show dimm"
ERROR: Could not rescan the sensor chip list!
ERROR: Failed scanning for sensors of type Sensors::Driver::Hpasm!
ERROR: Failure processing sensors output!
* temp: FAILURE:: MINOR::5000000000040000 -- Platform Health Check Failure
* temp: FAILURE:: There is no high temperature threshold! hpasm,generic 2: ERROR: "SHOW TEMP" command failed. discrete, CHIP: hpasm
2. If you see the command output similar to one at the beginning of this document, please do the following steps:
1) Restart iLO (from iLO GUI)
-
- Access iLO GUI
- Expand 'Information' and select 'Diagnostics'.
- Click the Reset button.
OR
2) Restart iLO from CLI
-
- ssh to the affected blade's iLO address as root or Administrator and enter the following commands:
</>hpiLO-> cd /map1
</map1>hpiLO-> reset
Note: Please wait a minimum of 5 minutes for the iLO to reinitialize.
3. Login the affected server's shell as root user and execute the following command:
[root@<hostname> ~]# service hp-health start
------- End of Procedure -------
Long Term Solution
Long term fix is to upgrade the blade server to Oracle Firmware Upgrade Package (FUP) 2.2.10, which includes the iLO 2.44 Firmware that fixes this.
-
- If Oracle furnished your hardware and you would like to upgrade your firmware, please open a new SR to request Firmware upgrade.
- For those customers who have purchased hardware from other sources, please consult your local practice to have Firmware upgraded
Attachments
This solution has no attachment