![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Troubleshooting Sure Solution 1012214.1 : Troubleshooting Red State Exception Memory Errors
PreviouslyPublishedAs 216842 Applies to:Sun Fire V490 Server - Version Not Applicable and laterSun Fire V880 Server - Version All Versions and later Sun Fire V480 Server - Version Not Applicable and later Sun Fire V880z Visualization Server - Version All Versions and later Sun Fire V890 Server - Version Not Applicable and later All Platforms PurposeWhen scanning error messages from any of the message files you will usually see the failing DIMM(s) being printed out for you, but when decoding a Red State Exception you won't have this luxury. Troubleshooting StepsRed State Exceptions (RSE):
ERROR:
CPU3 RED State Exception **CPU3 called a Red State Exception, further investigation is needed System State (CPU3 reporting) **CPU3 is JUST reporting the error, any CPU in the system can report the error, but this does not mean CPU3 is the problem** CPU0 Config/Control/Status registers: CPUVersion: 003e.0014.5400.0507 SafConfig: 0caa.01bc.0000.8002 SafBaseAdr: 0000.0400.0000.0000 DCacheCtl: 0000.0200.0000.0000 ECacheCtl: 0000.0000.0009.4400 ECErrEnable: 0000.0000.0000.000b AFAR: 0000.0000.0000.0000 AFSR: 0000.0000.0000.0000 (no errors set) Important: The Red State example is from a 750MHz CPU, because 900MHz CPUs (and beyond) will also include AFAR2/AFSR2 register lines below the AFAR/AFSR register lines and this represents the first error captured. The AFAR/AFSR will always represent the most recent error that occurred on the system.** DMMU SFAR: 0000.0000.fff7.8ec8 DMMU SFSR: 0000.0000.0080.8008 TM PR IMMU SFSR: 0000.0000.0000.0000 (no status set) CPU0 Trap registers: Trap Level = 1 *TL=1 TT: 0000.0000.0000.0003 TSTATE: 0000.0099.1500.1600 XCC:NC ICC:NC MM=TSO PEF PRIV IE TPC: 0000.0000.f004.9700 TnPC: 0000.0000.f004.9704 TL=2 TT: 0000.0000.0000.0068 TSTATE: 0000.0099.5804.1400 XCC:NC ICC:NC MM=TSO PEF PRIV TPC: 0000.0000.f004.4b68 TnPC: 0000.0000.f004.4b6c TL=3 TT: 0000.0000.0000.0000 TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO TPC: 0000.0000.3333.3330 TnPC: 0000.0000.3333.3330 TL=4 TT: 0000.0000.0000.0000 TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO TPC: 0000.0000.4444.4444 TnPC: 0000.0000.4444.4444 TL=5 TT: 0000.0000.0000.0000 TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO TPC: 0000.0000.5555.5554 TnPC: 0000.0000.5555.5554 CPU0 General registers: %PIL: 15 %PC: 0000.0000.f004.9700 %nPC: 0000.0000.f004.9704 %PSTATE: 0000.0000.0000.0035 TLE MM=TSO PEF %CCR: 0000.0000.0000.0099 XCC:NC ICC:NC %FPRS: 0000.0000.0000.0005 FEF DL %v0: 0000.0000.0000.0000 %v1: 0000.0000.0000.004a %v2: 0000.0000.0000.0000 %v3: 0000.0000.fff7.8000 %v4: 0000.0000.0000.0ef8 %v5: 0caa.01bc.0000.8002 %v6: 0000.0000.0000.007f %v7: 0000.0000.0000.0680 . . text deleted . %i0: 0000.0000.f000.00e0 %i1: 0000.0000.0000.0005 %i2: 0000.0000.0000.0004 %i3: 0000.0000.f000.00e0 %i4: 0000.0000.0000.001f %i5: 0000.0000.0000.0000 %i6: f000.0000.0001.c981 %i7: 0000.0000.f000.d680 CPU1 Config/Control/Status registers: CPUVersion: 003e.0014.5400.0507 SafConfig: 0caa.01bc.0002.8002 SafBaseAdr: 0000.0400.0080.0000 DCacheCtl: 0000.0200.0000.0000 ECacheCtl: 0000.0000.0009.4400 ECErrEnable: 0000.0000.0000.000b AFAR: 0000.0000.0000.0000 AFSR: 0000.0000.0000.0000 (no errors set) DMMU SFAR: 0000.0000.fff7.8ec8 DMMU SFSR: 0000.0000.0080.8008 TM PR IMMU SFSR: 0000.0000.0000.0000 (no status set) CPU1 Trap registers: Trap Level = 1 *TL=1 TT: 0000.0000.0000.0003 TSTATE: 0000.0099.1500.1600 XCC:NC ICC:NC MM=TSO PEF PRIV IE TPC: 0000.0000.f004.9700 TnPC: 0000.0000.f004.9704 TL=2 TT: 0000.0000.0000.0068 TSTATE: 0000.0099.5804.1400 XCC:NC ICC:NC MM=TSO PEF PRIV TPC: 0000.0000.f004.4b68 TnPC: 0000.0000.f004.4b6c TL=3 TT: 0000.0000.0000.0000 TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO TPC: 0000.0001.3333.3330 TnPC: 0000.0001.3333.3330 TL=4 TT: 0000.0000.0000.0000 TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO TPC: 0000.0001.4444.4444 TnPC: 0000.0001.4444.4444 TL=5 TT: 0000.0000.0000.0000 TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO TPC: 0000.0001.5555.5554 TnPC: 0000.0001.5555.5554 CPU1 General registers: %PIL: 15 %PC: 0000.0000.f004.9700 %nPC: 0000.0000.f004.9704 %PSTATE: 0000.0000.0000.0035 TLE MM=TSO PEF %CCR: 0000.0000.0000.0099 XCC:NC ICC:NC %FPRS: 0000.0000.0000.0005 FEF DL %v0: 0000.0000.0000.0000 %v1: 0000.0000.0000.004a %v2: 0000.0000.0000.0000 %v3: 0000.0000.fff7.8000 %v4: 0000.0000.0000.0ef8 %v5: 0caa.01bc.0002.8002 %v6: 0000.0000.0000.007f %v7: 0000.0000.0000.0680 . . text deleted . %i0: 0000.0000.f000.00e0 %i1: 0000.0000.0000.0005 %i2: 0000.0000.0000.0004 %i3: 0000.0000.f000.00e0 %i4: 0000.0000.0000.001f %i5: 0000.0001.0000.0000 %i6: f000.0000.0001.c981 %i7: 0000.0000.f000.d680 CPU2 Config/Control/Status registers: CPUVersion: 003e.0014.5400.0507 SafConfig: 1534.01bc.0004.8002 SafBaseAdr: 0000.0400.0100.0000 DCacheCtl: 0000.0200.0000.0000 ECacheCtl: 0000.0000.0009.4400 ECErrEnable: 0000.0000.0000.000b AFAR: 0000.0000.0000.0000 AFSR: 0000.0000.0000.0000 (no errors set) DMMU SFAR: 0000.0000.fff7.8ec8 DMMU SFSR: 0000.0000.0080.8008 TM PR IMMU SFSR: 0000.0000.0000.0000 (no status set) CPU2 Trap registers: Trap Level = 1 *TL=1 TT: 0000.0000.0000.0003 TSTATE: 0000.0099.1500.1600 XCC:NC ICC:NC MM=TSO PEF PRIV IE TPC: 0000.0000.f004.9700 TnPC: 0000.0000.f004.9704 TL=2 TT: 0000.0000.0000.0068 TSTATE: 0000.0099.5804.1400 XCC:NC ICC:NC MM=TSO PEF PRIV TPC: 0000.0000.f004.4b68 TnPC: 0000.0000.f004.4b6c TL=3 TT: 0000.0000.0000.0000 TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO TPC: 0000.0002.3333.3330 TnPC: 0000.0002.3333.3330 TL=4 TT: 0000.0000.0000.0000 TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO TPC: 0000.0002.4444.4444 TnPC: 0000.0002.4444.4444 TL=5 TT: 0000.0000.0000.0000 TSTATE: 0000.0000.0000.0000 XCC:(clear) ICC:(clear) MM=TSO TPC: 0000.0002.5555.5554 TnPC: 0000.0002.5555.5554 CPU2 General registers: %PIL: 15 %PC: 0000.0000.f004.9700 %nPC: 0000.0000.f004.9704 %PSTATE: 0000.0000.0000.0035 TLE MM=TSO PEF %CCR: 0000.0000.0000.0099 XCC:NC ICC:NC %FPRS: 0000.0000.0000.0005 FEF DL %v0: 0000.0000.0000.0000 %v1: 0000.0000.0000.004a %v2: 0000.0000.0000.0000 %v3: 0000.0000.fff7.8000 %v4: 0000.0000.0000.0ef8 %v5: 1534.01bc.0004.8002 %v6: 0000.0000.0000.007f %v7: 0000.0000.0000.0680 . . text deleted . %i0: 0000.0000.f000.00e0 %i1: 0000.0000.0000.0005 %i2: 0000.0000.0000.0004 %i3: 0000.0000.f000.00e0 %i4: 0000.0000.0000.001f %i5: 0000.0002.0000.0000 %i6: f000.0000.0002.b381 %i7: 0000.0000.f000.d680 CPU3 Config/Control/Status registers: CPUVersion: 003e.0014.5400.0507 SafConfig: 1534.01bc.0006.8002 SafBaseAdr: 0000.0400.0180.0000 DCacheCtl: 0000.0000.0000.0000 ECacheCtl: 0000.0000.0009.4400 ECErrEnable: 0000.0000.0000.000b AFAR: 0000.00b0.ece1.0450 AFSR: 0010.0006.0000.015b PRIV UE CE 'UE' and 'CE' tell you 'Uncorrectable' and 'Correctable' memory errors occurred and caused this 'Red State Exception'. 'b0' in the AFAR tells you the error occurred on CPU/Memory board in Slot 'B'. See Step #3 'Calculating the Physical Memory bank location' in <Document 1359373.1> V480/V880 Manual Decoding of DIMM(s) in Memory Error on how to calculate '15b' (Bits 8-0 of the AFSR) tells you the ECC Syndrome. In this example it is M2 Probable Double bit error within a nibble. See 'Step #1 (Find bit(s) in error using ECC Syndromes)' in <Document 1359373.1> V480/V880 Manual Decoding of DIMM(s) in Memory Error on how to calculate] '450' (Bits 9-6 of the AFAR) tells you which logical bank you are using. In this example it is 'CPU3 Bank0 (Bank0 located on CPU/Memory board in Slot 'B')'.See 'Step #3 (Calculating the Physical Memory bank location)' in <Document 1359373.1> V480/V880 Manual Decoding of DIMM(s) in Memory Error on how to calculate on how to calculate The failing DIMMs in 'CPU3 Bank0' are J7900, J7901, J8001, and J8000. See 'Step #4 (Finding the 4 DIMMs (Jxxxx's) Related to this Physical Bank)' in <Document 1359373.1> V480/V880 Manual Decoding of DIMM(s) in Memory Error on how to calculate] Resolution: All DIMMs in the above bank need to be changed, because a multibit error can not be broken down to the correct DIMM since the multiple bits in error could be on multiple DIMMs in the faulty memory bank. Keep in mind that a DIMM having CE errors early on in the Explorer message (0,1,2,3,...) files could very likely be the bad DIMM if a UE crashes the system and that DIMM is in the bank of DIMMs included in the error message. A DIMM causing multiple CE's repeatedly has a more likely chance of hitting a double bit error or UE. Always review message files starting at the oldest message file and working your way to the current dated message file for DIMM history DMMU SFAR: 0000.0000.fff5.2000 DMMU SFSR: 0000.0000.0004.8028 TM CT1 PR IMMU SFSR: 0000.0000.0080.8008 TM PR CPU3 Trap registers: Trap Level = 5 CPU3 is in question since it went to Trap Level 5 (Red State Level) TL=1 TT: 0000.0000.0000.0063 (Corrected ECC Error) TSTATE: 0000.0099.8000.1603 XCC:NC ICC:NC MM=TSO PEF PRIV IE TPC: 0000.0000.0102.96f0 TnPC: 0000.0000.0102.96dc TL=2 TT: 0000.0000.0000.0068 (Fast Data Access MMU miss) TSTATE: 0000.0099.8000.1503 XCC:NC ICC:NC MM=TSO PEF PRIV AG TPC: 0000.0000.f004.2c24 TnPC: 0000.0000.f004.2c28 TL=3 TT: 0000.0000.0000.0032 (Data Access Error) TSTATE: 0000.0088.5804.1403 XCC:N ICC:N MM=TSO PEF PRIV TPC: 0000.0000.f004.4c64 TnPC: 0000.0000.f004.4c68 TL=4 TT: 0000.0000.0000.0010 (Illegal Instruction) TSTATE: 0000.0088.5800.1503 XCC:N ICC:N MM=TSO PEF PRIV AG TPC: 0000.0000.f000.4640 TnPC: 0000.0000.f000.4644 *TL=5 TT: 0000.0000.0000.0010 (Illegal Instruction) TSTATE: 0000.0088.5800.1503 XCC:N ICC:N MM=TSO PEF PRIV AG TPC: 0000.0000.f000.4200 TnPC: 0000.0000.f000.4204 CPU3 General registers: %PIL: 13 %PC: 0000.0000.f000.4200 %nPC: 0000.0000.f000.4204 %PSTATE: 0000.0000.0000.0035 TLE MM=TSO PEF %CCR: 0000.0000.0000.0091 XCC:NC ICC:C %FPRS: 0000.0000.0000.0000 %v0: 0000.0000.0000.0000 %v1: 0000.0000.0000.0000 %v2: 0000.0000.0000.0000 %v3: 0000.0000.0000.0000 %v4: ffff.ffff.0000.0000 %v5: 0000.0000.0000.0000 %v6: 0000.0000.0000.0000 %v7: 00ca.02a8.0840.0005 . . text deleted . %i0: 0000.0000.0008.0000 %i1: 0000.0000.0508.0000 %i2: 0000.0000.0000.0000 %i3: 0000.0700.0536.0c20 %i4: 0000.0700.0536.0c30 %i5: 0000.0000.0007.3c00 %i6: 0000.0000.0140.8fe1 %i7: 0000.0000.0102.96b8 IO-Bridge 8 at 0000.0400.0400.0000 Device ID fc00.0000.0011.a954 Ctl/Stat 0255.5554.0080.7e02 Error Ctl fc00.0000.0000.03e0 Int Ctl 8000.0000.0000.0017 Error Log 0000.0000.0000.0000 ECC Ctl e000.0000.0000.0000 EStar Ctl 0000.0000.0000.0001 Queue Ctl 0000.0000.0000.0000 Address Match Address Mask PCIA Mem 8000.07fd.0000.0000 0000.07ff.0000.0000 PCIA C/IO 8000.07ff.ec00.0000 0000.07ff.fe00.0000 PCIB Mem 8000.07fe.0000.0000 0000.07ff.0000.0000 PCIB C/IO 8000.07ff.ee00.0000 0000.07ff.fe00.0000 AFAR AFSR UE 0000.0000.0000.0000 0000.0000.0000.0000 CE 0000.0100.0000.0000 0000.0000.0000.0000 PCI A 0000.0000.0000.0000 0000.0000.0000.0000 PCI B 0000.0000.0000.0000 0000.0000.0000.0000 Control/Status Idle Check Diag Diagnostic PCI A 0000.0002.010e.003f 0000.0000.0000.8000 0000.0000.0000.0000 PCI B 0000.0000.010e.003f 0000.0000.0000.8000 0000.0000.0000.0000 IO-Bridge 9 at 0000.0400.0480.0000 Device ID fc00.0000.0013.a954 Ctl/Stat 0255.59a8.0090.7e02 Error Ctl fc00.0000.0000.03e0 Int Ctl 8000.0000.0000.0017 Error Log 0000.0000.0000.0000 ECC Ctl e000.0000.0000.0000 EStar Ctl 0000.0000.0000.0001 Queue Ctl 0000.0000.0000.0000 Address Match Address Mask PCIA Mem 8000.07fb.0000.0000 0000.07ff.0000.0000 PCIA C/IO 8000.07ff.e800.0000 0000.07ff.fe00.0000 PCIB Mem 8000.07fc.0000.0000 0000.07ff.0000.0000 PCIB C/IO 8000.07ff.ea00.0000 0000.07ff.fe00.0000 AFAR AFSR UE 0000.0000.0000.0000 0000.0000.0000.0000 CE 0000.0000.0000.0000 0000.0000.0000.0000 PCI A 0000.0000.0000.0000 0000.0000.0000.0000 PCI B 0000.0000.0000.0000 0000.0000.0000.0000 Control/Status Idle Check Diag Diagnostic PCI A 0000.0002.010e.003f 0000.0000.0000.8000 0000.0000.0000.0000 PCI B 0000.0000.010e.003f 0000.0000.0000.8000 0000.0000.0000.0000
Attachments This solution has no attachment |
||||||||||||
|