![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Technical Instruction Sure Solution 1008263.1 : How to Troubleshoot CPU/Memory faults with Solaris[TM] FMA
PreviouslyPublishedAs 211323 Applies to:Sun Fire E20K Server - Version All Versions and laterSun Fire E25K Server - Version All Versions and later Sun Fire 4810 Server - Version All Versions and later Sun Fire 6800 Server - Version All Versions and later Sun Fire E2900 Server - Version All Versions and later All Platforms GoalCertain hardware failures can result in multiple components being identified as at fault. In addition FMA bugs exist which can cause innocent components to be disabled in error. Solution1. Confirm what FMA believes to be currently at fault from the output of fmadm faultyIf the fmadm faulty output looks like the following example the system is patched at a very old level where multiple bugs exist. At this level memory fault diagnosis is still very reliable but CPU faults are very likely to be false. $ pwd
/explorer.../fma $ cat fmadm-faulty.out STATE RESOURCE / UUID -------- ---------------------------------------------------------------------- faulted cpu:///cpuid=482/serial=80011A28C751C469 6edb6157-d195-e1ba-eed2-a021b432b99f -------- ---------------------------------------------------------------------- faulted cpu:///cpuid=486/serial=80011A28C751C469 6edb6157-d195-e1ba-eed2-a021b432b99f -------- ---------------------------------------------------------------------- degraded mem:///unum=SB17/P2/B1,J15301,J15401,J15501,J15601 709b49ae-c2c9-ea93-93a2-ed44337adf5d -------- ----------------------------------------------------------------------
Note: Once you get to Solaris 10 u5 (05/08) or above if you specify the -a option (fmadm-faulty-a.out in the explorer), all resource information cached by the Fault Manager is listed.
Example fmadm faulty output from Solaris 10 u5 or above:
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Aug 28 15:51:07 fde0f532-04f1-ed2a-dd72-d371fe56972a SUN4U-8000-35 Critical Host : cls011a Platform : SUNW,Sun-Fire-15000 Chassis_id : Product_sn : Fault class : fault.memory.bank 95% Affects : mem:///unum=SB3/P1/B1,J14301,J14401,J14501,J14601 out of service, but associated components no longer faulty FRU : mem:///unum=SB3/P1/B1,J14301,J14401,J14501,J14601 95% not present Serial ID. : 5015401A1EF22 5015401414014 5015401A1EF00 5015401407058 Description : The number of errors associated with this memory module has exceeded acceptable levels. Refer to http://sun.com/msg/SUN4U-8000-35 for more information. Response : Pages of memory associated with this memory module are being removed from service as errors are reported. Impact : Total system memory capacity will be reduced as pages are retired. Action : Schedule a repair procedure to replace the affected memory module. Use fmdump -v -u <EVENT_ID> to identify the module.
$ pwd
/explorer.../fma $ fmdump -v var/fm/fmd/fltlog TIME UUID SUNW-MSG-ID Nov 13 20:59:26.2683 709b49ae-c2c9-ea93-93a2-ed44337adf5d SUN4U-8000-35 95% fault.memory.bank Problem in: mem:///unum=SB17/P2/B1,J15301,J15401,J15501,J15601 Affects: mem:///unum=SB17/P2/B1,J15301,J15401,J15501,J15601 FRU: mem:///unum=SB17/P2/B1,J15301,J15401,J15501,J15601 Nov 13 20:59:26.8858 6edb6157-d195-e1ba-eed2-a021b432b99f SUN4U-8000-XJ 100% fault.cpu.ultraSPARC-IVplus.l2cachedata Problem in: cpu:///cpuid=486/serial=80011A28C751C469 Affects: cpu:///cpuid=486/serial=80011A28C751C469 FRU: hc:///component=SB15 100% fault.cpu.ultraSPARC-IVplus.l2cachedata Problem in: cpu:///cpuid=482/serial=80011A28C751C469 Affects: cpu:///cpuid=482/serial=80011A28C751C469 FRU: hc:///component=SB15
Note: When running fmdump commands on fltlog/errlogs captured by explorer the date stamps will be displayed using your timezone not the customer timezone. The customer TZ setting can be found in an explorer in sysconfig/env.out
Step 2. Confirm exactly which error/errors resulted in each fault diagnosis
For each active fault now look for the detail of what errors led to the fault diagnosis. To do this there are two options, either run commands against the raw fltlog/errlog or use the files collected by explorer itself.
In this example we have explorer version 5.4+ so all the outputs we need have been collected. $ pwd
/explorer.../fma $ cat fmdump-eu_6edb6157-d195-e1ba-eed2-a021b432b99f.out TIME CLASS Nov 13 21:10:00.1295 ereport.cpu.ultraSPARC-IVplus.ucu $ cat fmdump-eu_709b49ae-c2c9-ea93-93a2-ed44337adf5d.out TIME CLASS Nov 13 21:10:00.1130 ereport.cpu.ultraSPARC-IVplus.due
If the fmdump-eu_ outputs have not been collected to manually run the same commands do the following:
$ grep TZ sysconfig/env.out TZ=US/Pacific $ TZ=US/Pacific ; export TZ $ fmdump -eu 6edb6157-d195-e1ba-eed2-a021b432b99f fma/var/fm/fmd/fltlog TIME CLASS
Step 3. Confirm if those errors were actually the first significant errors or are as a result of earlier significant errors.Looking at the raw errors around the time of the failure we see a CE followed by two DUEs then the UCU which resulted in the CPU being disabled. $ pwd
/explorer.../fma $ grep "Nov 13 21" fmdump-e.out Nov 13 21:09:40.8350 ereport.cpu.ultraSPARC-IVplus.ce Nov 13 21:10:00.1130 ereport.cpu.ultraSPARC-IVplus.due Nov 13 21:10:00.1130 ereport.cpu.ultraSPARC-IVplus.due Nov 13 21:10:00.1295 ereport.cpu.ultraSPARC-IVplus.ucu Nov 13 21:10:00.1295 ereport.cpu.ultraSPARC-IVplus.l3-wdu Nov 13 21:10:00.1295 ereport.cpu.ultraSPARC-IVplus.wdu
$ cat fmdump-eV.out
Nov 13 2007 21:09:40.835062186 ereport.cpu.ultraSPARC-IVplus.ce nvlist version: 0 class = ereport.cpu.ultraSPARC-IVplus.ce ena = 0x594d9d20ea280001 detector = (embedded nvlist) nvlist version: 0 version = 0x1 scheme = cpu cpuid = 0x200 cpumask = 0x24 serial = 80010208F76C9017 (end detector) afsr = 0x1000020000010c afsr-ext = 0x0 afar-status = 0x1 afar = 0x22202c22780 pc = 0x124037c tl = 0x0 tt = 0x63 privileged = 1 multiple = 0 syndrome-status = 0x1 syndrome = 0x10c error-type = P error-disposition = 0x221008000a9 l3-cache-ways = 0x4 l3-cache-data = ... removed excessive output l2-cache-ways = 0x1 l2-cache-data = 0xec0106f1a6 ... dcache-ways = 0x1 dcache-data = 0xdc0106f1a6 ... icache-ways = 0x0 resource = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = mem unum = SB17/P2/B1/D2 J15501 serial = 5017386234369 offset = 0x2d0c69a2 (end resource) __ttl = 0x1 __tod = 0x473a0484 0x31c609aa Nov 13 2007 21:10:00.113082196 ereport.cpu.ultraSPARC-IVplus.due nvlist version: 0 class = ereport.cpu.ultraSPARC-IVplus.due ena = 0x59957926b9178801 detector = (embedded nvlist) nvlist version: 0 version = 0x1 scheme = cpu cpuid = 0x1e2 cpumask = 0x22 serial = 80011A28C751C469 (end detector) afsr = 0x500000000001bc afsr-ext = 0x0 afar-status = 0x1 <- valid AFAR afar = 0x227e242f780 pc = 0x10ef908 tl = 0x0 tt = 0x63 privileged = 1 multiple = 0 syndrome-status = 0x1 <- valid Syndrome syndrome = 0x1bc <- not a signalling syndrome error-type = U error-disposition = 0x0 l3-cache-ways = 0x4 l3-cache-data = ... removed excessive output l2-cache-ways = 0x1 l2-cache-data = 0xec0106f1a6 ... dcache-ways = 0x0 icache-ways = 0x0 resource = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = mem unum = SB17/P2/B1 J15301 J15401 J15501 J15601 (end resource) __ttl = 0x1 __tod = 0x473a0498 0x6bd7f54
Step 4. What to do if a misdiagnosis is suspectedPatching Solaris is the first thing to be done. Open an SR and have the data reviewed TSC Engineer reccomendation: Recommend patching if a known bug has been identified as the cause of the misdiagnosis re-enable the innocent components if required. Or, collaborate to the next level of support, so that your customer can be added to an existing open CR, or a new CR can be logged if required.
Known bugs:
NOTE: Not a comprehensive list.
Tips Tricks and Tools
CPU and Memory errors can generate thousands of errors, when looking into these faults a summary of the errors is far more useful than trying to look at individual errors.
To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in an appropriate
My Oracle Support Community - Oracle Sun Technologies Community. References<BUG:15328432> - SUNBT6420932-SOLARIS_11 UE AND DUE AND WDU CAUSE PROCESSOR TO BE OFFLINED<BUG:15424025> - SUNBT6606049-SOLARIS_11 USIV+ CPUS DISABLED DUE TO A MEMORY UE FAULT <NOTE:1000495.1> - Sun Fire Systems Equipped With UltraSPARC IV+ Processor Modules Running Solaris 9 or Solaris 10 may Exhibit Unnecessary CPU Offlining and Solaris Panics <BUG:15323597> - SUNBT6408988-SOLARIS_11 ON PANTHER, ON AN L2 OR L3 XXU, CPUMEM DE NEEDS TO CHECK Attachments This solution has no attachment |
||||||||||||
|