Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2316670.1
Update Date:2017-12-13
Keywords:

Solution Type  Problem Resolution Sure

Solution  2316670.1 :   All DIMM's reported as failed with fault "SPX86-8001-QX fault.memory.intel.dimm.tempsensor-failed" during Exadata image and/or firmware upgrade.  


Related Items
  • Exadata Database Machine X2-2 Hardware
  •  
Related Categories
  • PLA-Support>Sun Systems>x86>Engineered Systems HW>SN-x64: EXADATA
  •  




In this Document
Symptoms
Changes
Cause
Solution
References


Created from <SR 3-15915627461>

Applies to:

Exadata Database Machine X2-2 Hardware - Version All Versions to All Versions [Release All Releases]
x86

Symptoms

On certain X86 model systems (primarily Exadata V2/X2, Exalogic X2, X4170/X4270 and X4170 M2/X4270 M2), under rare circumstances during firmware and/or Engineered Systems (Exadata, Exalogic, etc), image upgrade you may receive error messages showing some or (typically) all DIMM's failed due to a faulty temp sensor like or similar to below:

2017-02-24/21:20:56 89ad837b-d10d-663f-f7d1-8cf391c74018 SPX86-8001-QX Critical

Fault class : fault.memory.intel.dimm.tempsensor-failed

FRU : /SYS/MB/P0/D1
(Part Number: 001-0003-01,M393B1K70CH0-YH9)
(Serial Number: 00CE02122633F7A200)

Description : A Memory DIMM's temperature sensor has failed.

Response : None.

Impact : DIMM will be used and enabled, but will no longer be
protected by closed loop thermal throttling (CLTT).

Action : Please refer to the associated reference document at
http://support.oracle.com/msg/SPX86-8001-QX for the latest
service procedures and policies regarding this diagnosis.

 

Changes

 Software image and/or firmware upgrade

Cause


Bug 15758839 : SUNBT7117637 fault.memory.intel.dimm.tempsensor-failed seen during exadata fw up
Bug 17263114 : All dimms are faulted - fault.memory.intel.dimm.tempsensor-failed

There are several known bugs logged about this behavior. In general, it's a transient issue that typically occurs as a result of the ILOM being unable to read DIMM temp sensor data for a short period of time during/after firmware upgrade. This can lead the ILOM Fault Manager incorrectly diagnosing one or more (typically all) DIMM's as having faulty temperature sensors.

Solution

ILOM 3.0  - "Using the Oracle ILOM Fault Management Shell"
ILOM 3.1 - "Oracle ILOM Fault Management Shell", "Launch a Fault Management Shell Session (CLI)" , "Using fmadm to Administer Active Oracle Hardware Faults"
How to use the Oracle ILOM 3.x Fault Management Shell (Doc ID 1309092.1)
Clear the faults and ignore. No further action is required unless a specific fault repeats, indicating a possible real fault.

1)Login to the ILOM CLI and type the commands:

->set /SYS/MB/P0/D0 clear_fault_action=true

Repeat this for each faulted DIMM (typically all of them)

Or, use the ILOM Fault Management Shell: (It doesn't matter, use whichever you like better.)
NOTE ** fault management prompt may vary by system model and ILOM version, but otherwise function generally the same.

-> start /SP/faultmgmt/shell
Are you sure you want to start /SP/faultmgmt/shell (y/n)? y

faultmgmtsp> fmadm faulty

(lists all open fault events/components)

[faultmgmtsp>fmadm repaired [UUID OR COMPONENT]

Repeat for each faulted DIMM. Note you can use "repair" OR "repaired", both serve the same purpose which is to inform the ILOM Fault Manager that the issue was fixed without replacing hardware.

[faultmgmtsp>exit

The system's Service LED ~SHOULD~ turn off once all faults are cleared. Confirm with:
->show /SP/SERVICE value
or
->show /System health 
(if = Service Required the Service LED is ON, if = OK, then it's OFF)


OPTIONAL (But Recommended ): Warm reset the ILOM. This does NOT impact the running host, and helps ensure correct ILOM operation. A warm reset may also be necessary if the Service LED doesn't shut off, or it's state doesn't match the indicator output.

-> reset /SP
Are you sure you want to reset /SP (y/n)? y
Performing reset on /SP

NOTE - the ILOM will become temporarily unavailable while it reboots (approx 1-3 minutes). There is NO impact or affect on the host.

References

<NOTE:1966568.1> - Platinum Service Delivery - Password Management for Customers starting OASG version 4.0
<BUG:15758839> - SUNBT7117637 FAULT.MEMORY.INTEL.DIMM.TEMPSENSOR-FAILED SEEN DURING EXADATA FW UP
<BUG:17263114> - ALL DIMMS ARE FAULTED - FAULT.MEMORY.INTEL.DIMM.TEMPSENSOR-FAILED
https://docs.oracle.com/cd/E19860-01/E21549/z400015e1400653.html
<NOTE:1309092.1> - How to use the Oracle ILOM 3.x Fault Management Shell

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback