Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2209101.1
Update Date:2017-10-05
Keywords:

Solution Type  Problem Resolution Sure

Solution  2209101.1 :   Fujitsu M10-4/M10-4S: Two DIMMs marked Faulted, replacing DIMMs does not fix the issue  


Related Items
  • Fujitsu M10-1
  •  
  • Fujitsu M10-4S
  •  
  • Fujitsu M10-4
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Fujitsu M10
  •  




In this Document
Symptoms
Changes
Cause
Solution
References


Oracle Confidential PARTNER - Available to partners (SUN).
Reason: internal only info
Created from <SR 3-13561092971>

Applies to:

Fujitsu M10-4S - Version All Versions and later
Fujitsu M10-4 - Version All Versions and later
Fujitsu M10-1 - Version All Versions and later
Information in this document applies to any platform.

Symptoms

This issue has been seen once in the products lifetime, fault data and relevant bits of information from various locations are summarized in this Internal Use Only document.

After a firmware update on a M10-4 system, the subsequent POST of the domain marked two DIMMS faulted.

Two DIMMs got marked faulted, replacing those two DIMMS does not solve the issue.
Replacing the CMU (CMUL in this case) does not solve the issue.
Moving groups of DIMMS around, resulted in narrowing down the issue to a maximum of 4 DIMMs, marked Deconfigured, to be the possible DIMM causing the issue.

Changes

 

Cause

Bug:25043220 - FUNCTIONAL DIMMS FAULTED DURING PPAR POST, FIXED BY REPLACING DECONFIGURED DIMMS:


FJ confirmed the same phenomenon with RCA returned DIMM (*)
and it is caused by a break of DIMM address lines.
(*) 32GB(4Rank), P/N: CA07361-D432, S/N: PP14160KB
.
The broken point was confirmed at a DIMM register by log analysis.
Overall quality of the DIMMs for M10 is good and stable. So far FJ
has shipped 150,000 DIMMs for M10 and this error was first reported. FJ
concluded this DIMM break case is individual one. As above, FJ considers the
possibility of recurrence of this type of error in the field is small. If the
same phenomenon is reported, please follow the operation that FJ sent to
Oracle on November 22, 2016. (Documented by Oracle as 2209101.1)

 

 

A rare type of DIMM failure in a group of 4 DIMMS behind a Memory Access Controller (MAC) will cause the lowest numbered DIMM in both group A and group B controlled by that specific MAC to be marked Faulted. The real failing DIMM gets marked Deconfigured as a result of Memory Configuration rules.

First some information on memory mounting rules. The Service manual show following memory mounting configuration rules:

CMUL-CPU#0-A
CMUL-CPU#0-A CMUL-CPU#1-A
CMUL-CPU#0-A CMUL-CPU#1-A CMUL-CPU#0-B
CMUL-CPU#0-A CMUL-CPU#1-A CMUL-CPU#0-B CMUL-CPU#1-B

This translates to the following memory locations:

00A-07A
00A-07A 10A-17A
00A-07A 10A-17A 00B-07B
00A-07A 10A-17A 00B-07B 10B-17B

Then we have "CPU#0 Group A ≦ CPU#0 Group B ≦ CPU#1 Group A ≦ …. ≦ CPU#3 Group B" , from Fujitsu M10-4S DR Considerations (Doc ID 2123775.2) 2.1 PPAR DR PREREQUISITES.
This translates to : 00A-07A ≦ 00B-07B ≦ 10A-17A ≦ 10B-17B .

This is an overview of which MAC controls which DIMMs:
- MAC00: MEM#00A MEM#00B MEM#01A MEM#01B
- MAC01: MEM#02A MEM#02B MEM#03A MEM#03B
- MAC02: MEM#04A MEM#04B MEM#05A MEM#05B
- MAC03: MEM#06A MEM#06B MEM#07A MEM#07B
- MAC10: MEM#10A MEM#10B MEM#11A MEM#11B
- MAC11: MEM#12A MEM#12B MEM#13A MEM#13B
- MAC12: MEM#14A MEM#14B MEM#15A MEM#15B
- MAC13: MEM#16A MEM#16B MEM#17A MEM#17B

Mirroring in our case is off, but still:

- MAC00 gets mirrored to MAC02
- MAC01 gets mirrored to MAC03
- MAC10 gets mirrored to MAC12
- MAC11 gets mirrored to MAC13

Solution

Given the above information, if you face the scenario described above, determine if the two DIMMs marked Faulted are the lowest numbered DIMMs behind their MAC.
Example: DIMMs MEM#02A and MEM#02B are marked Faulted.
If replacing those two DIMMs does not fix the issue, it is believed that one of the other two DIMMS behind MAC01, MEM#03A or MEM#03B , are at fault.

In such cases, ordering 4 replacement DIMMS is warranted, and should be sufficient to fix the issue.

 

References

<NOTE:2123775.2> - Fujitsu M10-4S DR Considerations
<BUG:25043220> - FUNCTIONAL DIMMS FAULTED DURING PPAR POST, FIXED BY REPLACING DECONFIGURED DIMMS
<NOTE:1531454.1> - M10-memory.se - A serious error has been detected at a DIMM

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback