Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-79-1615285.1
Update Date:2018-05-30
Keywords:

Solution Type  Predictive Self-Healing Sure

Solution  1615285.1 :   SPX86A-8002-XM - Memory Correctable ECC  


Related Items
  • Exadata Database Machine V2
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: Sun PSH
  •  




In this Document
Purpose
Details


Applies to:

Sun Microsystems > Servers > x64 Servers
Exadata Database Machine V2
Information in this document applies to any platform.

Purpose

 This document provides additional information for message ID: SPX86A-8002-XM

Details

Memory Correctable ECC

Type

Fault
  fault.memory.intel.dimm_ce

Severity

Minor

Description

Message ID: SPX86A-8002-XM indicates that the ILOM fault manager has applied diagnosis to error reports
received and has determined multiple correctable ECC errors on a memory DIMM has occurred.

ILOM has determined that memory DIMM(s) have exceeded threshold limit for memory correctable errors.

Oracle is changing the memory correctable error (CE) threshold limit from 240 CE's in a 72-hour period
to 1024 in a 1-hour period, based on extensive discussions with memory vendors and with their understanding
and experience that other system vendors have had CE thresholds in this new range for years.

A. If your ILOM firmware is below 3.2.10.22 for X3-2(L), X4-2(L), X5-2(L), X6-2(L) platforms;

   A1. Please upgrade your system to ILOM firmware 3.2.10.22, which includes the new threshold limit.
   A2. Clear all memory correctable DIMM faults present in the system and reboot.
   A3. If the problem persists, then replace the DIMM(s) identified on the suspect list.

B. If your ILOM firmware is below 3.2.10.21 for X4-4, X4-8, X5-4, X5-8 ,X6-8 platforms;

   B1. Please upgrade your system to ILOM firmware 3.2.10.21, which includes the new threshold limit.
   B2. Clear all memory correctable DIMM faults present in the system and reboot.
   B3. If the problem persists, then replace the DIMM(s) identified on the suspect list.

If the system firmware can't be upgraded, then replace the DIMM(s) per the current threshold limit.

 

New Policy Allows You to Prevent System from Disabling DIMMs with Correctable Errors

 

Oracle is confident that the higher memory correctable error (CE) thresholds;

- Do not introduce any system performance concerns based on the system handling memory correctable errors.

- Memory CE events are handled almost exclusively in hardware (not software),  so correcting CE events is an efficient process.

- Repeated memory correctable errors (CE) do not turn into uncorrectable errors (UE).

 

 

ILOM, Solaris, and Oracle Linux handle correctable memory error events during runtime.
Solaris & Oracle Linux will only retire those page(s) of memory and are not persistent upon a reboot.
BIOS will disable DIMM only upon next reboot as long as the faulty DIMM has not been replaced.

Error and Fault Handling Actions Taken by Solaris and Oracle Linux for Memory CE Error Events

  Memory controller on processor detects a correctable memory error.

  Operating system machine check handler reads, logs, & then clears all MCA banks.

  Operating system "FMD" daemon consumes and logs ereport.

  Operating system "FMD" daemon diagnoses fault and creates fault event.

  Operating system retires affected page(s) of memory.

  Operating system DOES NOT produce a fault message or SNMP Trap for notification.

  Operating system forwards error telemetry to ILOM.


Error & Fault Handling Actions Taken by ILOM for Memory CE Error Events

  Memory controller on processor detects a correctable memory error.
 
  IIO controller updates correctable error event status in CSR register.

  Processor signals Severity_0 error number (N0) to Complex Programmable Logic Device (CPLD).

  CPLD receives severity error number from processor.

  CPLD correlates error event to processor on SP using GPIO.
 
  CPLD maps cpu on CPLD to processor on SP using GPIO.

  RAS runtime error handling code uses PECI to read error counters and clears each memory controller channel.

  RAS runtime error handling code logs content of memory error counters and maps to physical DIMM.

  RAS runtime error handling code creates ereport for error event ( Payload includes value of counter ).

  ILOM "FDD" daemon consumes and logs ereport.

  ILOM "FDD" discards error telemetry received from Operating system.

  ILOM "FDD" daemon generates a fault event of DIMM's that exceeds 240 CE's on same DIMM within 72-hr period.

  ILOM illuminates the service-required LED for those DIMM's identified as faulty.

  ILOM generates fault message and SNMP trap.

  BIOS disables faulty DIMM's after next system reset.

Automated Response

The affected page(s) of memory associated with the faulty memory module maybe immediately retired by the operating system to avoid subsequent errors.
The memory DIMM and chassis wide service-required LED's are illuminated.

Impact

The system will continue to operate in the presence of this fault.
The memory DIMM is still in use and is not disabled.
The memory DIMM is disabled on next system reboot and remains unavailable until repaired.
System performance may be impacted slightly due to retired memory pages.

Suggested Action for System Administrator

Replace the faulty memory DIMM at the earliest possible convenience.

Refer to the following document for the latest procedures for displaying event content
in preparation for submitting a service request and applying any post-repair actions that may be required.

PSH Procedural Article for ILOM-Based Diagnosis (Doc ID 1155200.1

 


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback