Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1956549.1
Update Date:2017-01-05
Keywords:

Solution Type  Problem Resolution Sure

Solution  1956549.1 :   The BDA utility, bdacheckhw Reports a Memory Failure: "WARNING: Hardware errors reported by ILOM : fault.memory.intel.sb.dimm_ce"  


Related Items
  • Big Data Appliance X3-2 Hardware
  •  
  • Big Data Appliance X4-2 Hardware
  •  
  • Big Data Appliance Hardware
  •  
Related Categories
  • PLA-Support>Eng Systems>BDA>Big Data Appliance>DB: BDA_EST
  •  




In this Document
Symptoms
Cause
Solution
References


Created from <SR 3-9966605076>

Applies to:

Big Data Appliance X3-2 Hardware - Version All Versions and later
Big Data Appliance X4-2 Hardware - Version All Versions and later
Big Data Appliance Hardware - Version All Versions and later
x86_64

Symptoms

1. Running the BDA utility, bdacheckhw on a node, reports a memory failure:  fault.memory.intel.sb.dimm_ce.

The complete message looks like:

WARNING: Hardware errors reported by ILOM : fault.memory.intel.sb.dimm_ce
INFO: Run 'ipmitool sunoem cli "show faulty"' to see the full error
...
WARNING: Big Data Appliance warnings during hardware validation checks


2. Running 'ipmitool sunoem cli "show faulty"' reports the same:

# ipmitool sunoem cli "show faulty"
  
Connected. Use ^D to exit.
-> show faulty
  
Target              | Property               | Value
--------------------+------------------------+---------------------------------
/SP/faultmgmt/0     | fru                    | /SYS/MB/P0/D7
/SP/faultmgmt/0/    | class                  | fault.memory.intel.sb.dimm_ce
faults/0            |                        |
/SP/faultmgmt/0/    | sunw-msg-id            | SPX86-8004-CE
faults/0
...                             


3. Rebooting the server clears the memory fault.

4. The ILOM snapshot after the reboot confirms the fault is cleared.  In other words, the ILOM snapshot confirms no DIMM faults are present after reboot.

a) From ./ilom/@usr@local@bin@spshexec_show_-script_@X@logs@event@list.out we see the fault and that it was cleared after reboot:

1060   Wed Dec  3 15:08:08 2014  Fault     Repair    minor
      Fault fault.memory.intel.sb.dimm_ce on component /SYS/MB/P0/D7 cleared
...
1058   Tue Dec  2 11:51:11 2014  Fault     Fault     critical
      Fault detected at time = Tue Dec  2 11:51:11 2014. The suspect component:
       /SYS/MB/P0/D7 has fault.memory.intel.sb.dimm_ce with probability=100. Re
      fer to http://www.sun.com/msg/SPX86-8004-CE for details.


b) From -> show faulty after reboot, the fault is cleared:

Target              | Property               | Value                         
--------------------+------------------------+---------------------------------

-> Session closed

ipmiint_sunoem_led_get.out fault leds

P0/SERVICE       | OFF
...
P0/D6/SERV       | OFF
P0/D7/SERV       | OFF<<<<<<<<<<<<<<<  Not faulted
P1/SERVICE       | OFF
...
P1/D7/SERV       | OFF

c) Also from the ILOM snapshot the fault is "Repaired"/"Resolved" after reboot:

...
2014-12-02/11:51:11  ef3f77c5-d16b-6a08-fca1-dbce2c725eee   SPX86-8004-CE 
       FRU       = /SYS/MB/P0/D7

2014-12-03/15:08:08  ef3f77c5-d16b-6a08-fca1-dbce2c725eee SPX86-8004-CE Repaired

2014-12-03/15:08:08  ef3f77c5-d16b-6a08-fca1-dbce2c725eee SPX86-8004-CE Resolved
...




Cause

The bdacheckhw errors and ILOM snapshot prior to reboot confirm a memory fault i.e. fault.memory.intel.sb.dimm_ce"  due to excessive memory correctable errors.

Note: fault.memory.intel.sb.dimm_ce (from the ce) are memory correctable errors. 

There are a number of ways an excessive memory correctable / non correctable errors can occur:

  •  Electrical or magnetic interference 
  •  Dimm poorly seated in slot
  •  High temperatures           
  •  Data compare

Memory is tested with every post, so if a a server is rebooted and the error does not reoccur, memory is assumed to be ok.    MOS document: 1155200.1 - PSH Procedural Article for ILOM-Based Diagnosis, states that you should check FMA errors.

Solution

In the case of a memory fault due to excessive memory correctable errors the solution is to:

1. First reboot the server or clear the faults in the ILOM. See if these steps allow the Fault Manager to clear the memory faults.

You can enable x86 diagnostics (Enabled or Extended) to run at boot to further confirm that no memory faults remain.  To do so see: Running Enabled and Extended Diagnostics to Confirm Hardware Errors on the BDA (Doc ID 1956552.1)

2. The recommendation is to replace the faulted dimm only if there is re-occurrence of the fault on the same dimm.  See MOS document: 1155200.1 - PSH Procedural Article for ILOM-Based Diagnosis.

Also see the MOS document: 1438864.1 - SPX86-8004-CE - Fault due to excessive memory correctable errors (CE's)


References

<NOTE:1438864.1> - SPX86-8004-CE - Fault due to excessive memory correctable errors (CE's)
<NOTE:1155200.1> - PSH Procedural Article for ILOM-Based Diagnosis
<NOTE:1956552.1> - Running Enabled and Extended Diagnostics to Confirm Hardware Errors on the BDA

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback