![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||
Solution Type Troubleshooting Sure Solution 1321335.1 : Sun Enterprise[TM] 10000: Troubleshooting Recordstop Dumps
In this Document
Applies to:Sun Enterprise 10000 Server - Version Not Applicable to Not Applicable [Release N/A]Information in this document applies to any platform. PurposeThis document provides troubleshooting information for various recordstop dump events. Troubleshooting StepsBogus Uncorrectable Error reported on bit 32, syndrome 13There is a problem in the Starfire's XDB algorithm that checks the syndrome bit to identify the bad bit and determine if it is a single or multiple error. The XDB is coded to expect a syndrome of 12 for bit 32. The syndrome for bit 32 really is 13. The result is that the XDB will request a Recordstop but instead of recording a single bit error (CE), it will record an multiple bit error (UE). From the wfail output, we see something like the following: redxl> wfail (output omitted) XDB 7.2 EccErrFlags[11:0] = 220 EccFlg[5]: Uncorrectable error in ldat bus lo half, bits [71:0] EccFlg[11:8]: Error count = 2 ldat[ 71: 0]= D3 00FD0ECD 00000004 (xmux_par[5:0]= 02) syn= 13: bit 32 [06] (output omitted)Bear in mind that the UE is misreported by the XDB only. Solaris detects and reports this error properly. As a result, only the Recordstop Dump File will reflect a UE with Bit 32 in error in the XDB output. The flip-side of this problem will be the XDB reporting a Syndrome 12 Correctable Error, but not identify which Bit was Corrected. In reality, Syndrome 12 maps to an Uncorrectable Error (UE), and cannot be mapped to a single bit.
Correctable ECC Error (CE) Processor X DtagsFrom the wfail output, we see something like the following: redxl> wfail (output omitted) CIC 7.2 ErrFlags[61:0] = 00000001 00000002 (after mask) ErrFlag[1]: Correctable ECC Error (CE) Processor 1 Dtags ErrFlag[32]: Repeated Error Proc 1 Dtag ECCSyn[13: 8] = 23: CE: bit 00 Dtag SRAM 7.2.0 FAIL Proc 7.1 in all configs using CIC2: : Arbstop/Recordstop detected by cic (*** NOTE: Implicated FRU is sysboard 7) (output omitted)The above error should be analyzed in a way consistent with other Correctable Error recordstops. This means that the first instance or event for any given error against a particular DTag SRAM (in this example, CIC 7.2- DTag SRam 0) should be diagnosed as a soft error, and no action should be taken against it. Swap the "Implicated FRU" (SB7 in example) when the third failure occurs on any one CIC. NOTE: Blacklisting the affected CPU (proc 7.1 in example) could be used as a short term workaround. Attachments This solution has no attachment |
||||||||||||||||||
|