Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-2066498.1
Update Date:2017-10-11
Keywords:

Solution Type  Troubleshooting Sure

Solution  2066498.1 :   SPARC M7 Series Servers : Recognizing and Troubleshooting fault.cpu.generic-sparc.c2c and ereport.hc.link-init-threshold  


Related Items
  • SPARC M7-8
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: M7
  •  




In this Document
Purpose
Troubleshooting Steps
References


Applies to:

SPARC M7-8 - Version All Versions and later
Information in this document applies to any platform.

Purpose

 Please contact Oracle Service Personnel for assistance if fault.cpu.generic-sparc.c2c and ereport.hc.link-init-threshold is observed in FMA

There exists a bug in SYSFW 9.4.3, where FMA reports only one end of the CLINK fault for fault.cpu.generic-sparc.c2c

FMA should be reporting both sides of the CLINK Fault with a 80:20 probability for ereport.hc.link-init-threshold events.

Sometimes, the side that FMA reported was the wrong one (80%) and after replacing the CMIOU reported by FMA, the CMIOU on the other end of CLINK (20%) was actually the bad one.

This document will help service engineers to identify the other_side of the CLINK



Fix available in 9.5.2.g

Troubleshooting Steps

Start fault management shell and run 'fmdump -V'

If the signature matches ereport.hc.link-init-threshold and fault.cpu.generic-sparc.c2c, please contact Oracle Service Personnel for assistance

-> start -script /SP/faultmgmt/shell

faultmgmtsp> fmdump -V

2015-09-01/00:15:24 3d4b91b7-b44a-e2c2-f32c-d6e12a6b45f5 SPSUN4V-8000-2S
 
timestamp           ereports
2015-09-01/00:15:22 ereport.hc.link-init-threshold/sys/cmiou2/cm/cmp/clx0/clink1/lane0
 
     fault = fault.cpu.generic-sparc.c2c/SYS/CMIOU2/CM/CMP/CLX0/CLINK1/LANE0
         certainty = 100.0 %
         FRU       = /SYS/CMIOU2
         ASRU      = /SYS/CMIOU2/CM/CMP/CLX0/CLINK1/LANE0
         resource  = /SYS/CMIOU2/CM/CMP/CLX0/CLINK1/LANE0
         _list_sz     = 1
         _list_idx    = 0
         retire       = reboot
         _diagnosis_engine_version = 1.0
         _diagnosis_engine_name = fdd
         system_serial_number = AK00316846
         system_part_number = 7092780
         system_name  = SPARC M7-8
         system_manufacturer = Oracle Corporation
         chassis_serial_number = AK00317993
         chassis_part_number = 33682633+1+1
         chassis_name = SPARC M7-8
         chassis_manufacturer = Oracle Corporation
         system_component_serial_number = AK00317993
         system_component_part_number = 33682633+1+1
         system_component_name = SPARC M7-8
         system_component_manufacturer = Oracle Corporation
         fru_name     = CMIOU Module
         fru_manufacturer = Oracle Corporation
         fru_serial_number = 465769T+15286C02VF
         fru_rev_level = 02
         fru_part_number = 7312807

 

In SYS FW 9.4.3, a bug exists in FMA, for a fault after repeated (SERD) ereport.hc.link-init-threshold events, FMA only indicts one end of the link, at 100%, when it should be 80:20 weighted between both ends of the link.

The CLINK for the ereports is seen in the form of: ereport.hc.link-init-threshold/sys/cmiou/cm/cmp/clx/clink and ereport.hc.link-init-threshold/sys/cmiou/cm/cmp/clx/clink/lane
                    
The other end of the CLINK(20%) which might be the reason for the error is not diagnosed by FMA, resulting in 100% diagnosis on one end of the link.

For this Bug, we need to treat the 100% as a definitive 80% fault, while the other_side is treated as 20%.

Actions to be taken by Service Engineer:

1. Identify the other end of the CLINK CMIOU (20%) using one of the three methods described in the following Document:

See Document: SPARC M7 Series Servers : Coherency Link (CLINK) Pair information (Doc ID 2064591.1)

Note: The system needs to be first identified as a M7-8 One PDOM or M7-8 Two PDOM as the "other side" varies depending on the type. (1x8 vs 2x4)
See Document: SPARC M7 Series Servers : How to differentiate M7-8 with one Pdomain and M7-8 with two Pdomains (Doc ID 2060018.1)
  • For the example above:
    • For a Single PDOM M7-8, the other_side of CMIOU2/CLINK1 is CMIOU6/CLINK1, whereas,
    • For a Dual   PDOM M7-8, the other_side of CMIOU2/CLINK1 is CMIOU0/CLINK1


2. Replace the CMIOU with the 100% indictment (which should be weighted 80%). In this example, it is CMIOU2

3. If FMA continues to fault the replaced CMIOU (80%) for the same event, then the service team needs to replace the other_side of the CLINK (20%).  In this example, it will be CMIOU6.

Note. This should not be seen in the Field, if it does, please report to Engineering (new bug)

References

<BUG:21883123> - FMA FAULTED ONLY 1 END OF THE CLINK FAULT, SOMETIME THE WRONG END
<BUG:21770571> - FMA FAULTED ONLY 1 END OF THE CLINK FAULT, SOMETIME THE WRONG END

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback