Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1321335.1
Update Date:2015-12-02
Keywords:

Solution Type  Troubleshooting Sure

Solution  1321335.1 :   Sun Enterprise[TM] 10000: Troubleshooting Recordstop Dumps  


Related Items
  • Sun Enterprise 10000 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: SF-Exxk
  •  
  • _Old GCS Categories>Sun Microsystems>Servers>High-End Servers
  •  




In this Document
Purpose
Troubleshooting Steps
 Bogus Uncorrectable Error reported on bit 32, syndrome 13
  Correctable ECC Error (CE) Processor X Dtags


Applies to:

Sun Enterprise 10000 Server - Version Not Applicable to Not Applicable [Release N/A]
Information in this document applies to any platform.

Purpose

This document provides troubleshooting information for various recordstop dump events.

Troubleshooting Steps

Bogus Uncorrectable Error reported on bit 32, syndrome 13

There is a problem in the Starfire's XDB algorithm that checks the syndrome bit to identify the bad bit and determine if it is a single or multiple error. The XDB is coded to expect a syndrome of 12 for bit 32. The syndrome for bit 32 really is 13. The result is that the XDB will request a Recordstop but instead of recording a single bit error (CE), it will record an multiple bit error (UE).

From the wfail output, we see something like the following:

redxl> wfail
(output omitted)
XDB   7.2   EccErrFlags[11:0] = 220
        EccFlg[5]: Uncorrectable error in ldat bus lo half, bits [71:0]
        EccFlg[11:8]: Error count = 2
ldat[ 71: 0]= D3 00FD0ECD 00000004 (xmux_par[5:0]= 02)  syn= 13: bit 32 [06]
(output omitted)
Bear in mind that the UE is misreported by the XDB only. Solaris detects and reports this error properly. As a result, only the Recordstop Dump File will reflect a UE with Bit 32 in error in the XDB output. The flip-side of this problem will be the XDB reporting a Syndrome 12 Correctable Error, but not identify which Bit was Corrected. In reality, Syndrome 12 maps to an Uncorrectable Error (UE), and cannot be mapped to a single bit.

 

  • Only a recordstop is generated.
  • Solaris properly detects, handles, and reports the error properly. Only the XDB output in the Recordstop file is in error.
  • This error is in XDB code and has no relation to system board hardware.
  • This problem will not be fixed.

 

Correctable ECC Error (CE) Processor X Dtags

From the wfail output, we see something like the following:

redxl> wfail
(output omitted)
CIC   7.2   ErrFlags[61:0] = 00000001 00000002   (after mask)
         ErrFlag[1]: Correctable ECC Error   (CE)  Processor 1 Dtags
         ErrFlag[32]: Repeated Error
     Proc 1 Dtag ECCSyn[13: 8] = 23:  CE: bit 00  Dtag SRAM 7.2.0
  FAIL Proc 7.1 in all  configs using CIC2: : Arbstop/Recordstop detected by cic
         (*** NOTE: Implicated FRU is sysboard 7)
(output omitted)
The above error should be analyzed in a way consistent with other Correctable Error recordstops. This means that the first instance or event for any given error against a particular DTag SRAM (in this example, CIC 7.2- DTag SRam 0) should be diagnosed as a soft error, and no action should be taken against it.

Swap the "Implicated FRU" (SB7 in example) when the third failure occurs on any one CIC.

NOTE: Blacklisting the affected CPU (proc 7.1 in example) could be used as a short term workaround.



Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback