Asset ID: |
1-71-1008360.1 |
Update Date: | 2018-01-16 |
Keywords: | |
Solution Type
Technical Instruction Sure
Solution
1008360.1
:
UltraSPARC[R] IIIi CPU Error Handling and Messages
Related Items |
- Sun Fire V215 Server
- Sun Fire V240 Server
- Sun Fire V245 Server
- Sun Ultra 45 Workstation
- Sun Netra 240 (DC) Server
- Sun Fire V445 Server
- Sun Netra 210 Server
- Sun Fire V125 Server
- Sun Fire V250 Server
- Sun Fire V440 Server
- Sun Netra 440 Server
- Sun Fire V210 Server
- Sun Blade 2500 Workstation
- Sun Ultra 25 Workstation
- Sun Blade 1500 Workstation
- Sun Netra CP3010 Blade Server
- Sun Netra 240 (AC) Server
- Sun Netra CP2500 Blade Server
|
Related Categories |
- PLA-Support>Sun Systems>SPARC>Usx/Blade/Netra>SN-SPARC: USx
- _Old GCS Categories>Sun Microsystems>Servers>Entry-Level Servers
|
PreviouslyPublishedAs
211433
Applies to:
Sun Fire V125 Server - Version Not Applicable and later
Sun Fire V210 Server - Version Not Applicable and later
Sun Fire V250 Server - Version Not Applicable and later
Sun Fire V240 Server - Version Not Applicable and later
Sun Fire V215 Server - Version Not Applicable and later
Sun SPARC Sun OS
Goal
Help to understand how CPU handle errors and what are the errors.
Fix
1. HARDWARE FEATURES
UltraSPARC[R] IIIi checks for errors in data used form the E$, DRAM, JBus and most of the internal cache data and tag arrays. These data are protected by the following:
DATA/TAG PROTECTION
==================================
E$ data ECC
E$ tag Parity
I$ data, tag Parity
D$ data, tag Parity
Memory data ECC
JBus addr, data Parity
- Parity provides for single bit detection but no correction.
- ECC provides single bit detection (w/ correction) and multi-bit detection (w/o correction).
Some of the ECC/Parity errors can be automatically corrected by the hardware; while some other may require software intervention. And there are some errors that cannot be corrected by either HW nor SW. These errors are logged in AFSR and AFAR. AFSR contains the following info (See the last section on how to decode AFSR):
- Individual error bits
- MEM/E$ ECC syndrom (Esynd: indicates the specific bits of errors)
- JBus Agent (J_AID: Requester for certain data)
- Active JBus request signal (J_REQ: JBus driver at time of error)
- JBus addr/data parity syndrom (Bsynd)
AFAR contains the physical address of the faulty component/data. This address corresponds to the first occurrence of the highest priority error.
2. SOLARIS ERROR REPORTING
Each error message contains an AFT tag and an errID. errID indicates the time (in nano-second) of the error report form the high resolution clock on CPU. The AFT tag is classified as the following:
AFT0/AFT1 messages:
- Identify the type and source of an error
- AFT0 tag is used for correctable and recoverable errors
- AFt1 tag is used for uncorrectable errors
- Fault_PC is not always precise, and my be within 4 instruction of the PC at the time of event
- Esynd indicates the number of bits in error
AFT2 messages:
- Cache diagnostic info for debug purposes
- E$ data dump
- Any ECC mismatch in E$ data dump
- I$/D$ data dump
AFT3 messages:
- Printed by kernel recovery code
- Indicates the action the kernel took to handle an error
Error Logging
Set ce_verbose_memory and ce_verbose_others variables in /etc/system to change
the logging level:
0 No error messages are logged in /var/adm/messages or console
1 Error messages are longged ONLY in /var/adm/messages
2 Error messages are longged BOTH in /var/adm/messages and console
Memory Layout and DIMM Labels
Each US-IIIi supports 43-bit physical address space and a 64 GB cacheable address space for memory:
PHYSICAL DESCRIPTION
ADDRESS
=================================================================
PA[42:41] Must be 0
PA[40:36] AGENT_ID (0 through 3 for CPU0 through 3)
PA[35:0] 64 GB cacheable address space for memory
Memory layout and DIMM module labels are specific to each platform and is provided by the OBP to Solaris for error reporting purposes. This memory-layput property is associate with each memory controller device node in the OBP device tree and contains the following:
BYTE# DESCRIPTION
=======================================================================
00-07 DIMM lable for lower numbered DIMM withing DIMM pair 1
08-0F DIMM lable for higher numbered DIMM withing DIMM pair 1
10-17 DIMM lable for lower numbered DIMM withing DIMM pair 2
18-1F DIMM lable for higher numbered DIMM withing DIMM pair 2
20 Table width/type
21-32 DIMM Table (144-bit long)
- 1 bit used to store DIMM number
- Info stored in big endian:
Data[127:0], ECC[8:0], Unused[6:0]
32-C2 Pin Table (144-bit long)
- 1 byte used stored DIMM pin number
- Info sotred in little endian:
Unused[0:6], ECC[0:8], DATA[0:127]
Summary of DIMM labels of each paltform:
PLATFORM DIMM DIMM LABEL
===================================================================
SB 1500 4 DIMM/CPU CPU0: DIMM0 ... DIMM3
Total 4 DIMMs
SB 2500 4 DIMM/CPU CPU0: DIMM0 ... DIMM3
Total 8 DIMMs CPU1: DIMM4 ... DIMM7
SF V210 4 DIMM/CPU CPU0: MB/P0/{B0,B1}/{D0,D1}
Total 8 DIMMs CPU1: MB/P1/{B0,B1}/{D0,D1}
SF V240 4 DIMM/CPU CPU0: MB/P0/{B0,B1}/{D0,D1}
Total 8 DIMMs CPU1: MB/P1/{B0,B1}/{D0,D1}
SF V250 4 DIMM/CPU CPU0: DIMM0 ... DIMM3
Total 8 DIMMs CPU1: DIMM4 ... DIMM7
SF V440 4 DIMM/CPU CPU0: C0/P0/{B0,B1}/{D0,D1}
Total 16 DIMMs CPU1: C1/P0/{B0,B1}/{D0,D1}
CPU2: C2/P0/{B0,B1}/{D0,D1}
CPU3: C3/P0/{B0,B1}/{D0,D1}
3. Errors
The errors can be classified as following:
Memory ECC Error
- Correctable
- CE Hardware corrected ECC error from local memory
- FRC Hardware corrected ECC error on foreign read form memory
- RCE Hardware corrected ECC error from remote memory/cache
- Uncorrectable
- UE Uncorrectable ECC error from local memory
- FRU Uncorrectable ECC error on foreign read form memory
- RUE Uncorrectable ECC error from remote memory/cache
E$ ECC Error
- Correctable
- UCC Software correctable L2 cache ECC error
- EDC Hardware corrected L2 cache ECC event on W-cache merge or block load or prefetch
- WDC Hardware corrected L2 cache ECC event on writeback
- CPC Hardware corrected L2 cache EcC event on copyout
- RCE Hardware corrected ECC error from remote memory/cache
- Uncorrectable
- UCU Uncorrectable L2 cache error
- EDU:ST Uncorrectable ECC error from L2 cache on W-cache merge
- EDU:BLD Uncorrectable ECC error from L2 cache on block load
- WDU Uncorrectable L2 cache ECC event on writeback
- CPU Uncorrectable L2 cache ECC event on copyout
- RUE Uncorrectable L2 cache error from remote memory/cache
CPU I$/D$ Parity Error
- DPE D$ parity error
- DDSPE D$ data parity error
- DTSPE D$ physical tag parity error
- IPE I$ parity error
- IDSPE I$ data parity error
- ITSPE I$ physical tag parity error
System/JBus Error
- BP JBus psarity error on returned read data
- WBP JBus Parity error on data for writeback or block store
- IVPE Interrupt vector parity error
- BERR Bus error received on JBus
- TO Unmapped error on JBus read, form JBus device
- UMS Unsupported memory store
- OM JBus transaction error due to out of range address
Fatal Errors
- JEIC System interface protocol error, illegal command
- JEIT System interface protocol error, illegal ADTYPE detected
- JEIS System interface protocol error, illegal install state
- JETO System interface protocol error, hardware timeout
- SCE JBus parity error on J_PACK or J_REQ signals
- IERR CPU interial error
- ISAP JBus address packet parity error
- ETP Parity error in L2 cache tag
4. Decod AFSR
AFSR:
BIT FIELD DESCRIPTION
========================================================================
63:58 Reserved Reserved
57 JETO System interface protocol error, hardware timeout
56 SCE JBus parity error on J_PACK or J_REQ signals
55 JEIC System interface protocol error, illegal command
54 JEIT System interface protocol error, illegal ADTYPE detected
53 ME Multiple error of same type occurred
52 PRIV Privileged code access error has occurred
51 JEIS System interface protocol error, illegal install state
50 IERR CPU interial error
49 ISAP JBus address packet parity error
48 ETP Parity error in L2 cache tag SRAM
47 OM JBus transaction error due to out of range address
46 UMS Unsupported memory store
45 IVPE Interrupt vector parity error
44 TO Unmapped error on JBus read, form JBus device
43 BERR Bus error received on JBus
42 UCC Software correctable L2 cache ECC error
41 UCU Uncorrectable L2 cache error
40 CPC Hardware corrected L2 cache EcC event on copyout
39 CPU Uncorrectable L2 cache ECC event on copyout
38 WDC Hardware corrected L2 cache ECC event on writeback
37 WDU Uncorrectable L2 cache ECC event on writeback
36 EDC Hardware corrected L2 cache ECC event on W-cache merge or block load or prefetch
35 EDU Uncorrectable L2 cache ECC event on W-cache merge or block load or prefetch
34 UE Uncorrectable ECC error from local memory
33 CE Hardware corrected ECC error from local memory
32 RUE Uncorrectable L2 cache error from remote memory/cache
31 RCE Hardware corrected ECC error from remote memory/cache
30 BP JBus Parity error on returned read data
29 WBP JBus Parity error on data for writeback or block store
28 FRC Foreign read to DRAM incurring correctable ECC error
27 FRU Foreign read to DRAM incurring uncorrectable ECC error
26:24 J_REQ Active JBus request signal when error occurred
23:22 ETW L2 cache way information
21:20 Reserved Reserved
19:16 Bsynd JBus address/data parity error syndrom
15:14 Reserved Reserved
13:9 J_AID JBus agent ID of device
8:0 Esynd Data ECC syndrome
J_REQ:
J_REQ DESCRIPTION
========================================
000 CPU0
001 CPU1
010 CPU2
011 CPU3
100 Slave Tomatillo or CPU4
101 Master Taomtillo
110 Zulu or CPU5
111 Reserved
J_AID:
J_AID DESCRIPTION
========================================
00000 CPU0
00001 CPU1
00010 CPU2
00011 CPU3
0110x Slave Tomatillo
0111x Master Tomatillo
5. References:
UltraSPARC-IIIi Programmer's Reference Manual: Additions to US-III
<Document 1004729.1> Introduction to Solaris[TM] Operating System CE/UE/ECC/CBB/CBI/DBB/DBI Error Messages
The Event Types (in section 3. Errors) are explained in more detail in:
<Document 1004903.1 > Event Messages for UltraSPARC-III[R], UltraSPARC-III+[R], UltraSPARC-IIIi[R], UltraSPARC-IV[R] and UltraSPARC-IV+[R] CPU Modules
References
Attachments
This solution has no attachment