Asset ID: |
1-71-1002430.1 |
Update Date: | 2017-10-04 |
Keywords: | |
Solution Type
Technical Instruction Sure
Solution
1002430.1
:
Diagnose Sun Fire[TM] Uncorrectable Error(s) from Memory on Solaris[TM] 8 and 9
Related Items |
- Sun Fire 15K Server
- Sun Netra 1280 Server
- Sun Fire V880z Visualization Server
- Sun Fire E20K Server
- Sun Fire 6800 Server
- Sun Fire V880 Server
- Sun Fire E25K Server
- Sun Fire E2900 Server
- Sun Fire V890 Server
- Sun Fire 3800 Server
- Sun Fire 4810 Server
- Sun Blade 1000 Workstation
- Sun Fire 280R Server
- Sun Fire 12K Server
- Sun Fire V1280 Server
- Sun Blade 2000 Workstation
- Sun Fire 4800 Server
- Sun Fire V480 Server
- Sun Fire V490 Server
- Sun Fire E4900 Server
- Sun Fire E6900 Server
- Sun Netra 1290 Server
|
Related Categories |
- PLA-Support>Sun Systems>Sun_Other>Sun Generic Product>SN-OTH: Gen_Prod
- _Old GCS Categories>Sun Microsystems>Desktops>Workstations
- _Old GCS Categories>Sun Microsystems>Servers>NEBS-Certified Servers
- _Old GCS Categories>Sun Microsystems>Servers>Midrange Servers
- _Old GCS Categories>Sun Microsystems>Servers>Entry-Level Servers
- _Old GCS Categories>Sun Microsystems>Boards>Memory Module
- _Old GCS Categories>Sun Microsystems>Servers>High-End Servers
|
PreviouslyPublishedAs
203402
Applies to:
Sun Fire V890 Server - Version All Versions and later
Sun Fire 3800 Server - Version All Versions and later
Sun Fire E20K Server - Version All Versions and later
Sun Fire 6800 Server - Version All Versions and later
Sun Fire V490 Server - Version All Versions and later
All Platforms
Goal
This document is Step 2. in the overall resolution path for:
<Document: 1006517.1>
Troubleshooting Sun Fire[TM] Uncorrectable CPU and Memory Error(s) on Solaris[TM] 8 and 9
Solution
Steps to Follow
To have reached this document you will have had UE errors with non signalling syndromes and also have identified the error message(s) as memory uncorrectable error(s). You must now verify and diagnose these message(s) to the correct FRU:
Tools to Diagnose Memory UEs
- cediag, Customer Ready
- <Document: 1003867.1> Memory DIMM Replacement Management Tool - cediag FAQ and download
-
- Usage: cediag -e unpacked_explorer_dir
- Rule#4 can identify individual UE DIMMs before or after they cause an outage.
cediag: findings: 1 DIMMs with a failure pattern matching Rule#4
cediag: findings: DIMM 'Slot A: J8101' matched Rule#4 failure pattern
cediag: advice:HIGH: replace DIMM 'Slot A: J8101' [A]s [S]oon [A]s [P]ossible
- cediag will report no further useful info than the following message if only UEs are found.
- Uncorrectable UE errors are often seen as a result of single DIMM Rule#4 failures.
cediag: findings: 1 UE(s) found - potential Rule#3 match
cediag: advice:HIGH: refer UE(s) to Sun Support [A]s [S]oon [A]s [P]ossible
- UE Datapath faults are rare but do happen - See <Document: 1010642.1> Diagnosis of bad writers and datapath faults from Solaris messages
cediag: findings: 4 datapath fault message(s) found
cediag: findings: 8 DIMM(s) having CEs with Esynd of 0x0010 found
cediag: advice:HIGH: possible datapath fault - refer to Sun Support [A]s [S]oon [A]s [P]ossible
- Whenever more than one DIMM fails Rules#4,#5, or#6 you will get this message. Make sure you really do have multiple failures before replacing any DIMMs
cediag: advice:MEDIUM: consult Sun Support to rule out other causes of CEs before replacing any DIMMs
- Runnable from cores3 - /cores_data/local/bin/cediag
- findaft, Internal Only- Usage: findaft messages
- Findaft will identify banks of memory that have reported UE errors with non signalling syndromes. In addition to aid in narrowing down the fault to a single DIMM a summary of all CEs reported from within that bank will also be shown.
- Findaft also looks for multiple UEs banks within an single slot to help identify the FCO A0253 PLL failures.
- <Document: 1010934.1> Findaft - an AFT, CPU, Memory and PCI ECC error message summary script.
- Runnable from cores3 - /cores_local/data/bin/findaft
Top
Diagnosis of Memory UEs
Once you know you have a UE memory problem the first step is to try and narrow the fault down to a single DIMM within that bank.
- Narrowing a UE DIMM fault down to a single DIMM is not possible in all cases, in some cases the whole bank will need to be replaced.
- The recent firmwares on the uniboard based systems 1280 -> E25k will autodiagnose Memory UE errors and should identify the suspect DIMMs using CHS status.
- <Document: 1010056.1> Troubleshooting offline or disabled components on Sun Fire [TM] Serengeti or LightWeight8 systems
- <Document: 1010905.1> Sun Enhanced Memory DIMM Replacement Policy.
-
- <Document: 1007056.1> Understanding the AVL firmware ECC diagnosis engine
- CEs reported from a bank of memory reporting UEs are usually a reliable indicator that the single DIMM is the cause of the UE. Use cediag to look for DIMMs that have failed Rule#4, findaftalso generates a summary of CEs from each identified UE bank.
- There are further options available ask for assistance/confirmation before replacing multiple FRUs, use IM, escalate, or email the appropriate support alias depending on severity.
- CEs from multiple DIMMs within a UE bank might indicate a PLL fault.
- FCO A0253 PLL DIMMs - Failing to identify that you are dealing with a PLL DIMM fault will result in further outages. Any one of the 16 DIMMs installed on a single 480/880 CPU memory board can have caused the errors you are looking at.
- FCO A0253 style failures are always fatal, but you may see both CEs and UEs before the system finally panics.
- FCO A0253 style failures can also cause Fatal Resets, if you have a combination of UE panics and Fatal resets a PLL failure may well be the cause.
- FCO A0223 B-Die DIMMs (Only affects DIMMs from 2001 and earlier) will report CEs prior to the panic and the individual fault DIMM will often be identified by DIMM Policy Rule#4.
- FCO A0223-1 On Sun systems, a small number of 256MB DIMMs may experience Uncorrectable Memory Errors (UE).
- B-Die scanner - use showfru -DIMMs newer than 2001 will not be affected by this FCO.
- Solaris 8 prior to patch 108528 rev 16 and Solaris 9 FCShave a number of significant CE reporting bugs, patching is required for reliable diagnosis.
- At these patch levels Solaris only reports a summary of CEs detected by the CPUs after every 256th CE event. In the event of a UE error the preceding CEs that would allow identification of the Individual faulty DIMM are not seen. To make the problem worse all the PCI detected CEs are reported and the locations reported are incorrect.
- See <Document: 1011997.1> Solaris[TM] Operating System: Tuning for improved diagnosis of memory or Central Processing Unit(CPU) errors, for a workaround, patching is a better option.
- The Schizo detected CEs report the wrong DIMMs as at fault.
- 4682258 PCI ECC correctable errors can report wrong dimm or whole bank of dimms
- 4491362 CE error reporting in Daktari and Cherrystone is ambiguous
- Solaris 8 prior to 108528 rev 17 and Solaris 9 112233 rev 03can panic as a result of correctable memory errors.
- 4462509 Recursive CE errors cause kernel stack overflow on Cheetah processors
Top
Next Step
Raise an SR in order to have the issue further troubleshooted or the FRU replaced
If FRU replacement does not fix error messages, proceed to:
Step #4: <Document: 1012314.1>
Not sure of what is at fault
It is much better to raise an escalation for diagnostic assistance before changing parts. Never simply re-enable parts and re-test FRUs without first understanding why the components were disabled in the first place.
- For quick questions ask in the sparc-enterprise IM chat room.
- Raise a collaboration to the proper PLA team depending on what platform you are dealing with.
Attachments
This solution has no attachment