How to Troubleshoot CPU/Memory faults with Solaris[TM] FMA

Asset ID:	1-71-1008263.1
Update Date:	2017-10-18
Keywords:

Solution Type Technical Instruction Sure

Solution 1008263.1 : How to Troubleshoot CPU/Memory faults with Solaris[TM] FMA

Applies to:

Sun Fire E20K Server - Version All Versions and later
Sun Fire E25K Server - Version All Versions and later
Sun Fire 4810 Server - Version All Versions and later
Sun Fire 6800 Server - Version All Versions and later
Sun Fire E2900 Server - Version All Versions and later
All Platforms

Goal

Certain hardware failures can result in multiple components being identified as at fault. In addition FMA bugs exist which can cause innocent components to be disabled in error.

Solution

1. Confirm what FMA believes to be currently at fault from the output of fmadm faulty

If the fmadm faulty output looks like the following example the system is patched at a very old level where multiple bugs exist.

At this level memory fault diagnosis is still very reliable but CPU faults are very likely to be false.

$ pwd
/explorer.../fma
$ cat fmadm-faulty.out
STATE RESOURCE / UUID
-------- ----------------------------------------------------------------------
faulted cpu:///cpuid=482/serial=80011A28C751C469
6edb6157-d195-e1ba-eed2-a021b432b99f
-------- ----------------------------------------------------------------------
faulted cpu:///cpuid=486/serial=80011A28C751C469
6edb6157-d195-e1ba-eed2-a021b432b99f
-------- ----------------------------------------------------------------------
degraded mem:///unum=SB17/P2/B1,J15301,J15401,J15501,J15601
709b49ae-c2c9-ea93-93a2-ed44337adf5d
-------- ----------------------------------------------------------------------

In this example two CPU cores from the same socket were identified as a result of a single fault diagnosis UUID 6edb6157-d195-e1ba-eed2-a021b432b99f. The single bank of memory including four DIMMs was identified as a result of a second diagnosis UUID 709b49ae-c2c9-ea93-93a2-ed44337adf5d.

Exceptions: The output of fmadm faulty will only contain faults on components currently presented to Solaris. There are a number of reasons why components might not be presented to Solaris.

The component is CHS disabled and no longer presented to Solaris
The component failed POST and is no longer presented to Solaris
An admin manually removed components from the domain/system config.

Note: Once you get to Solaris 10 u5 (05/08) or above if you specify the -a option (fmadm-faulty-a.out in the explorer), all resource information cached by the Fault Manager is listed.

Example fmadm faulty output from Solaris 10 u5 or above:

--------------- ------------------------------------ -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------ -------------- ---------
Aug 28 15:51:07 fde0f532-04f1-ed2a-dd72-d371fe56972a SUN4U-8000-35 Critical

Host        : cls011a
Platform    : SUNW,Sun-Fire-15000 Chassis_id :
Product_sn :

Fault class : fault.memory.bank 95%
Affects     : mem:///unum=SB3/P1/B1,J14301,J14401,J14501,J14601
                   out of service, but associated components no longer faulty
FRU         : mem:///unum=SB3/P1/B1,J14301,J14401,J14501,J14601 95%
                   not present
Serial ID. : 5015401A1EF22
               5015401414014
               5015401A1EF00
               5015401407058

Description : The number of errors associated with this memory module has
               exceeded acceptable levels. Refer to
               http://sun.com/msg/SUN4U-8000-35 for more information.

Response    : Pages of memory associated with this memory module are being
               removed from service as errors are reported.

Impact      : Total system memory capacity will be reduced as pages are
               retired.

Action      : Schedule a repair procedure to replace the affected memory
               module. Use fmdump -v -u <EVENT_ID> to identify the module.

In certain cases the FMA system might no longer be running, or components might have been removed from the config. To see recent faults it can be useful to look at the raw fltlog. When looking at the raw fault and error logs be aware that the logs will contain information for resources that might no longer be present in the system. The fmdump command can be used to view the raw fltlog in this example we see both the two faults and also can confirm see the DIMMs were identified first.

$ pwd
/explorer.../fma
$ fmdump -v var/fm/fmd/fltlog

TIME UUID SUNW-MSG-ID
Nov 13 20:59:26.2683 709b49ae-c2c9-ea93-93a2-ed44337adf5d SUN4U-8000-35
   95% fault.memory.bank
      Problem in: mem:///unum=SB17/P2/B1,J15301,J15401,J15501,J15601
     Affects: mem:///unum=SB17/P2/B1,J15301,J15401,J15501,J15601
            FRU: mem:///unum=SB17/P2/B1,J15301,J15401,J15501,J15601
Nov 13 20:59:26.8858 6edb6157-d195-e1ba-eed2-a021b432b99f SUN4U-8000-XJ
   100% fault.cpu.ultraSPARC-IVplus.l2cachedata
      Problem in: cpu:///cpuid=486/serial=80011A28C751C469
         Affects: cpu:///cpuid=486/serial=80011A28C751C469
            FRU: hc:///component=SB15
   100% fault.cpu.ultraSPARC-IVplus.l2cachedata
      Problem in: cpu:///cpuid=482/serial=80011A28C751C469
         Affects: cpu:///cpuid=482/serial=80011A28C751C469
            FRU: hc:///component=SB15

Note: When running fmdump commands on fltlog/errlogs captured by explorer the date stamps will be displayed using your timezone not the customer timezone. The customer TZ setting can be found in an explorer in sysconfig/env.out

Step 2. Confirm exactly which error/errors resulted in each fault diagnosis

For each active fault now look for the detail of what errors led to the fault diagnosis. To do this there are two options, either run commands against the raw fltlog/errlog or use the files collected by explorer itself.

In this example we have explorer version 5.4+ so all the outputs we need have been collected.

$ pwd
/explorer.../fma
$ cat fmdump-eu_6edb6157-d195-e1ba-eed2-a021b432b99f.out

TIME CLASS
Nov 13 21:10:00.1295 ereport.cpu.ultraSPARC-IVplus.ucu
$ cat fmdump-eu_709b49ae-c2c9-ea93-93a2-ed44337adf5d.out

TIME CLASS
Nov 13 21:10:00.1130 ereport.cpu.ultraSPARC-IVplus.due

From the above output we can see that both diagnosis's took place in the same second. With the CPU identified 165 nanoseconds after the DIMMs were identified as at fault. The chances of two distinct failures occurring on a single system in a single day is very low, to have them happen within the same second is almost impossible.

If the fmdump-eu_ outputs have not been collected to manually run the same commands do the following:

$ grep TZ sysconfig/env.out

TZ=US/Pacific

$ TZ=US/Pacific ; export TZ

$ fmdump -eu 6edb6157-d195-e1ba-eed2-a021b432b99f fma/var/fm/fmd/fltlog

TIME CLASS
Nov 13 21:10:00.1295 ereport.cpu.ultraSPARC-IVplus.ucu

Step 3. Confirm if those errors were actually the first significant errors or are as a result of earlier significant errors.

Looking at the raw errors around the time of the failure we see a CE followed by two DUEs then the UCU which resulted in the CPU being disabled.

$ pwd
/explorer.../fma
$ grep "Nov 13 21" fmdump-e.out

Nov 13 21:09:40.8350 ereport.cpu.ultraSPARC-IVplus.ce
Nov 13 21:10:00.1130 ereport.cpu.ultraSPARC-IVplus.due
Nov 13 21:10:00.1130 ereport.cpu.ultraSPARC-IVplus.due
Nov 13 21:10:00.1295 ereport.cpu.ultraSPARC-IVplus.ucu
Nov 13 21:10:00.1295 ereport.cpu.ultraSPARC-IVplus.l3-wdu
Nov 13 21:10:00.1295 ereport.cpu.ultraSPARC-IVplus.wdu

Looking in more detail at the errors we see the CE and DUE come from the same bank of DIMMs and all the AFAR/Syndromes indicate a clear DIMM fault. In this example we can be 100% confident that the subsequent CPU fault is false ( Bug ID: 6606049 () was hit in this case ).

$ cat fmdump-eV.out
Nov 13 2007 21:09:40.835062186 ereport.cpu.ultraSPARC-IVplus.ce
nvlist version: 0
class = ereport.cpu.ultraSPARC-IVplus.ce
ena = 0x594d9d20ea280001
detector = (embedded nvlist)
nvlist version: 0
version = 0x1
scheme = cpu
cpuid = 0x200
cpumask = 0x24
serial = 80010208F76C9017
(end detector)
afsr = 0x1000020000010c
afsr-ext = 0x0
afar-status = 0x1
afar = 0x22202c22780
pc = 0x124037c
tl = 0x0
tt = 0x63
privileged = 1
multiple = 0
syndrome-status = 0x1
syndrome = 0x10c
error-type = P
error-disposition = 0x221008000a9
l3-cache-ways = 0x4
l3-cache-data = ... removed excessive output
l2-cache-ways = 0x1
l2-cache-data = 0xec0106f1a6 ...
dcache-ways = 0x1
dcache-data = 0xdc0106f1a6 ...
icache-ways = 0x0
resource = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = mem
unum = SB17/P2/B1/D2 J15501
serial = 5017386234369
offset = 0x2d0c69a2
(end resource)
__ttl = 0x1
__tod = 0x473a0484 0x31c609aa
Nov 13 2007 21:10:00.113082196 ereport.cpu.ultraSPARC-IVplus.due
nvlist version: 0
class = ereport.cpu.ultraSPARC-IVplus.due
ena = 0x59957926b9178801
detector = (embedded nvlist)
nvlist version: 0
version = 0x1
scheme = cpu
cpuid = 0x1e2
cpumask = 0x22
serial = 80011A28C751C469
(end detector)
afsr = 0x500000000001bc
afsr-ext = 0x0
afar-status = 0x1 <- valid AFAR
afar = 0x227e242f780
pc = 0x10ef908
tl = 0x0
tt = 0x63
privileged = 1
multiple = 0
syndrome-status = 0x1 <- valid Syndrome
syndrome = 0x1bc <- not a signalling syndrome
error-type = U
error-disposition = 0x0
l3-cache-ways = 0x4
l3-cache-data = ... removed excessive output
l2-cache-ways = 0x1
l2-cache-data = 0xec0106f1a6 ...
dcache-ways = 0x0
icache-ways = 0x0
resource = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = mem
unum = SB17/P2/B1 J15301 J15401 J15501 J15601
(end resource)
__ttl = 0x1
__tod = 0x473a0498 0x6bd7f54

Step 4. What to do if a misdiagnosis is suspected

Patching Solaris is the first thing to be done.

Open an SR and have the data reviewed

TSC Engineer reccomendation:

Recommend patching if a known bug has been identified as the cause of the misdiagnosis

re-enable the innocent components if required. Or, collaborate to the next level of support, so that your customer can be added to an existing open CR, or a new CR can be logged if required.

Additional info in case of panic

From the core files FMA gets the ereports which occurred during the panic. These get re-played and standard faults get generated.

The ereport time stamps are correct
The fault time stamps reflect when the diagnosis was carried out

Known bugs:

CPUs disabled due to memory UE errors

<BUG 15424025> USIV+ CPUs disabled due to a memory UE fault
- Fix in Solaris 10 137111-02
<BUG 15323597> On Panther, on an L2 or L3 xxU, cpumem DE needs to check two UE caches
- Fix in Solaris 10 123839-06
<BUG 15328432> UE and DUE and WDU cause processor to be offlined
- Fix in Solaris 10 125369-11
Sun Fire Systems Equipped With UltraSPARC IV+ Processor Modules Running Solaris 9 or Solaris 10 may Exhibit Unnecessary CPU Offlining and Solaris Panics <Document 1000495.1>
<BUG 15296134> sparc cpumem-diagnosis needs to implement rules 4A and 4B
- Fix in Solaris 10 125369-02

NOTE: Not a comprehensive list.

Tips Tricks and Tools

Use fmdump to look at errors before or after a certain date

Display errors after Dec 1st 2007 fmdump -t 12/01/07 -e errlog
Display errors before Dec 1st 2007 fmdump -T 12/01/07 -e errlog

CPU and Memory errors can generate thousands of errors, when looking into these faults a summary of the errors is far more useful than trying to look at individual errors.

Findfma is a command line summary script for fmdump -eV outputs.

The fmdump -V output captures the serial numbers of DIMMs identified. This is useful for confirming that the correct DIMMs have actually been replaced if a fault returns.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in an appropriate
My Oracle Support Community - Oracle Sun Technologies Community.

References

<BUG:15328432> - SUNBT6420932-SOLARIS_11 UE AND DUE AND WDU CAUSE PROCESSOR TO BE OFFLINED
<BUG:15424025> - SUNBT6606049-SOLARIS_11 USIV+ CPUS DISABLED DUE TO A MEMORY UE FAULT
<NOTE:1000495.1> - Sun Fire Systems Equipped With UltraSPARC IV+ Processor Modules Running Solaris 9 or Solaris 10 may Exhibit Unnecessary CPU Offlining and Solaris Panics
<BUG:15323597> - SUNBT6408988-SOLARIS_11 ON PANTHER, ON AN L2 OR L3 XXU, CPUMEM DE NEEDS TO CHECK

Attachments

This solution has no attachment