Diagnose Sun Fire[TM] Uncorrectable Error(s) from Memory on Solaris[TM] 8 and 9

Asset ID:	1-71-1002430.1
Update Date:	2017-10-04
Keywords:

Solution Type Technical Instruction Sure

Solution 1002430.1 : Diagnose Sun Fire[TM] Uncorrectable Error(s) from Memory on Solaris[TM] 8 and 9

Applies to:

Sun Fire V890 Server - Version All Versions and later
Sun Fire 3800 Server - Version All Versions and later
Sun Fire E20K Server - Version All Versions and later
Sun Fire 6800 Server - Version All Versions and later
Sun Fire V490 Server - Version All Versions and later
All Platforms

Goal

This document is Step 2. in the overall resolution path for:

<Document: 1006517.1>

Troubleshooting Sun Fire[TM] Uncorrectable CPU and Memory Error(s) on Solaris[TM] 8 and 9

Solution

Steps to Follow
To have reached this document you will have had UE errors with non signalling syndromes and also have identified the error message(s) as memory uncorrectable error(s). You must now verify and diagnose these message(s) to the correct FRU:

Tools to Diagnosis Memory UEs
Diagnosis of Memory UEs

Tools to Diagnose Memory UEs

cediag, Customer Ready
- <Document: 1003867.1> Memory DIMM Replacement Management Tool - cediag FAQ and download

- Usage: cediag -e unpacked_explorer_dir
- Rule#4 can identify individual UE DIMMs before or after they cause an outage.
  cediag: findings: 1 DIMMs with a failure pattern matching Rule#4
  cediag: findings: DIMM 'Slot A: J8101' matched Rule#4 failure pattern
  cediag: advice:HIGH: replace DIMM 'Slot A: J8101' [A]s [S]oon [A]s [P]ossible
- cediag will report no further useful info than the following message if only UEs are found.
  - Uncorrectable UE errors are often seen as a result of single DIMM Rule#4 failures.
    cediag: findings: 1 UE(s) found - potential Rule#3 match
    cediag: advice:HIGH: refer UE(s) to Sun Support [A]s [S]oon [A]s [P]ossible
  - UE Datapath faults are rare but do happen - See <Document: 1010642.1> Diagnosis of bad writers and datapath faults from Solaris messages
    cediag: findings: 4 datapath fault message(s) found
    cediag: findings: 8 DIMM(s) having CEs with Esynd of 0x0010 found
    cediag: advice:HIGH: possible datapath fault - refer to Sun Support [A]s [S]oon [A]s [P]ossible
  - Whenever more than one DIMM fails Rules#4,#5, or#6 you will get this message. Make sure you really do have multiple failures before replacing any DIMMs
    cediag: advice:MEDIUM: consult Sun Support to rule out other causes of CEs before replacing any DIMMs
- Runnable from cores3 - /cores_data/local/bin/cediag
findaft, Internal Only- Usage: findaft messages
- Findaft will identify banks of memory that have reported UE errors with non signalling syndromes. In addition to aid in narrowing down the fault to a single DIMM a summary of all CEs reported from within that bank will also be shown.
  - Example findaft output V880 with a single DIMM causing CEs and UEs to be reported from a single bank.
- Findaft also looks for multiple UEs banks within an single slot to help identify the FCO A0253 PLL failures.
  - Example findaft output V880 with a PLL fault somewhere on Slot B
- <Document: 1010934.1> Findaft - an AFT, CPU, Memory and PCI ECC error message summary script.
  - Runnable from cores3 - /cores_local/data/bin/findaft

Top

Diagnosis of Memory UEs

Once you know you have a UE memory problem the first step is to try and narrow the fault down to a single DIMM within that bank.

Narrowing a UE DIMM fault down to a single DIMM is not possible in all cases, in some cases the whole bank will need to be replaced.

The recent firmwares on the uniboard based systems 1280 -> E25k will autodiagnose Memory UE errors and should identify the suspect DIMMs using CHS status.
- <Document: 1010056.1> Troubleshooting offline or disabled components on Sun Fire [TM] Serengeti or LightWeight8 systems
<Document: 1010905.1> Sun Enhanced Memory DIMM Replacement Policy.

- <Document: 1007056.1> Understanding the AVL firmware ECC diagnosis engine

CEs reported from a bank of memory reporting UEs are usually a reliable indicator that the single DIMM is the cause of the UE. Use cediag to look for DIMMs that have failed Rule#4, findaftalso generates a summary of CEs from each identified UE bank.
- There are further options available ask for assistance/confirmation before replacing multiple FRUs, use IM, escalate, or email the appropriate support alias depending on severity.
- CEs from multiple DIMMs within a UE bank might indicate a PLL fault.
- FCO A0285 check script
- FCO A0253 PLL DIMMs - Failing to identify that you are dealing with a PLL DIMM fault will result in further outages. Any one of the 16 DIMMs installed on a single 480/880 CPU memory board can have caused the errors you are looking at.
- FCO A0253 style failures are always fatal, but you may see both CEs and UEs before the system finally panics.
- FCO A0253 style failures can also cause Fatal Resets, if you have a combination of UE panics and Fatal resets a PLL failure may well be the cause.
  - PLL scanner - showfru

FCO A0223 B-Die DIMMs (Only affects DIMMs from 2001 and earlier) will report CEs prior to the panic and the individual fault DIMM will often be identified by DIMM Policy Rule#4.
FCO A0223-1 On Sun systems, a small number of 256MB DIMMs may experience Uncorrectable Memory Errors (UE).
B-Die scanner - use showfru -DIMMs newer than 2001 will not be affected by this FCO.

Solaris 8 prior to patch 108528 rev 16 and Solaris 9 FCShave a number of significant CE reporting bugs, patching is required for reliable diagnosis.
- At these patch levels Solaris only reports a summary of CEs detected by the CPUs after every 256th CE event. In the event of a UE error the preceding CEs that would allow identification of the Individual faulty DIMM are not seen. To make the problem worse all the PCI detected CEs are reported and the locations reported are incorrect.
  - See <Document: 1011997.1> Solaris[TM] Operating System: Tuning for improved diagnosis of memory or Central Processing Unit(CPU) errors, for a workaround, patching is a better option.
- The Schizo detected CEs report the wrong DIMMs as at fault.
  - 4682258 PCI ECC correctable errors can report wrong dimm or whole bank of dimms
  - 4491362 CE error reporting in Daktari and Cherrystone is ambiguous
Solaris 8 prior to 108528 rev 17 and Solaris 9 112233 rev 03can panic as a result of correctable memory errors.
- 4462509 Recursive CE errors cause kernel stack overflow on Cheetah processors

Top

Next Step

Raise an SR in order to have the issue further troubleshooted or the FRU replaced

If FRU replacement does not fix error messages, proceed to:

Step #4: <Document: 1012314.1>

Not sure of what is at fault

It is much better to raise an escalation for diagnostic assistance before changing parts. Never simply re-enable parts and re-test FRUs without first understanding why the components were disabled in the first place.

For quick questions ask in the sparc-enterprise IM chat room.
Raise a collaboration to the proper PLA team depending on what platform you are dealing with.

Attachments

This solution has no attachment