Troubleshooting data needed for T3-x, T4-x, T5-x, T7-x, S7-x, T8-x servers

Asset ID:	1-79-1470580.1
Update Date:	2018-03-06
Keywords:

Solution Type Predictive Self-Healing Sure

Solution 1470580.1 : Troubleshooting data needed for T3-x, T4-x, T5-x, T7-x, S7-x, T8-x servers

Applies to:

SPARC T5-2 - Version All Versions and later
Netra SPARC T4-2 Server - Version All Versions and later
SPARC T8-1 - Version All Versions and later
SPARC T5-4 - Version All Versions and later
Netra SPARC S7-2 - Version All Versions and later
Information in this document applies to any platform.

Purpose

This document provides high level guide for hardware specialists about what data do collect & how to analyze major server problems. A similar document written from the software perspective is: 1012913.1

Scope

Details

If a T7-x or S7-x server has a fatal reset due to a power glitch (Voltage Regulator failure) then please provide both a Snapshot & Explorer!

Please upload the required data to the SR unless other arrangements have been made with the SR owner.

Reasons for SR Creation & data to gather

A system becomes unresponsive for one of four reasons:

Admin rebooted the host
Fatal Reset (hardware detected)
OS Panic (software detected)
OS/Application Hang (not detected by host, but possibly by SP)

In most cases, we recommend that both explorers & ILOM snapshots be collected if possible. The absolutely required information relates to how it crashed. If the server is operational, obtain information about the necessary data to gather via doc 1010911.1: "What to send to Oracle after a system panic and/or unexpected reboot". This document indicates basic questions the must be asked to determine what data to gather. Please note that an ILOM Snapshot will contain the console output & most other SP related data for the T3 & newer servers.

An SR can also be opened for boot failure or for non-fatal reasons like hardware failure where the system continues to operate since some components are redundant or since performance reduced on failure of one:

Boot failure (The Ultimate Fatal Reset)
Redundant Components: Fans & PSUs
Performance Limited: DIMMs

LDOM based problems can be temporarily be worked around by booting the system with the Trusted Platform Module (TPM) state purged (LDOM Manager) on the next boot by:

-> set /HOST/tpm forceclear=true

An Explorer should be gathered if the system is bootable. Features & problems with versions of Explorer are as follows:

6.10 is minimal for these platforms since it collects ILOM data via ipmitool to gather FRU data. System FW 8.2.2.c will also collect DIMM vendor part numbers.
- During ipmi & ilom collection via the net, the ILOM's IP address & root password must be entered twice.
7.03 can collect ILOM information via the vbsc with no prompts to the user. See doc 1518044.1.
7.03 has added collection timeouts, so some information may not be gathered. See Explorer Timeout FAQs in doc 1287574.1
8.00 may take 5 minutes to install on an S11 based system.
8.00 collects LDOM information by default
8.00 extended collection timeouts for ipmitool data.
8.00 ipmitool first choice set to /opt/ipmitool/sbin. Earlier versions must have ipmitool executable copied to /opt/ipmitool/bin!!!
8.03 Timeouts introduced in 7.03 were removed. Please perform command ptree & see Explorer FAQ doc: 1287574.1 for data collection & timeout configuration information.
- See doc 1612918.1 to determine why Explorer collection runs too long
8.11 Resolves lack of FMA data collection introduced in 8.10
8.13 Allows usage on Linux systems if called "explorer -w default,sundiag"

Explorers should be run as follows to gather the proper data:

# explorer -w ipmi,ipmiextended,ilomextended,ldom,default
# explorer -w sundiag,default For Linux! Also note when extracting the explorer that directories sosreport & sundiag must be unzipped on a Linux system with command "tar -xif".

ILOM data should typically be gathered & obtained by the following. Please note that these servers allow a sideband connection to the network if the NetMgmt port cannot be used.

Snapshot - Contains console history, FRU config, event logs, sensor information
- ILOMs typically require the Net Management / Sideband port configured, but can use the internal ILOM Interconnect or LDC ports if Oracle Hardware Management Pack (OHMP) is installed (1518044.1)
- The Data Set: "Collect Only Log Files" should ONLY be used if the snapshot hangs the ILOM. Dataset: "FRUID" will collect showpsnc
- Have CU perform "set /HOST/console timestamp=yes" to place ILOM time in log.
or ILOM Data: "show -l all /", "show /SP/console/history", "show /SP/logs/event/list", "show faulty". See doc 1619420.1 for examples & helpful commands. (see ShowOff summary tool doc 1454001.1)
or ipmitool Data (Requires Net Management port configured unless server's VBMCcd / used per doc 1518044.1).
- ipmitool -V
- ipmitool -I lanplus -H "SP ipaddress" -U root fru
- ipmitool -I lanplus -H "SP ipaddress" -U root sel elist
- ipmitool -I lanplus -H "SP ipaddress" -U root -v sdr
- ipmitool -I lanplus -H "SP ipaddress" -U root sdr elist
- ipmitool -I lanplus -H "SP ipaddress" -U root sdr list
- ipmitool -I lanplus -H "SP ipaddress" -U root chassis status
- ipmitool -I lanplus -H "SP ipaddress" -U root sunoem led get (requires OHMP or ipmitool versions per doc 1516567.1)
- ipmitool -I lanplus -H "SP ipaddress" -U root sensor
- Please note that the ILOM Interconnect is the preferred method when OHMP is installed!
  - ipmitool -I lanplus -H 169.254.182.76 -U root sunoem led get
- Please note that the LDC interface can also be used.
  - ipmitool -I bmc -U root fru On SPARC systems
  - ipmitool -I open -U root fru On Linux systems

If the ILOM network is unresponsive, then reset the ILOM via the serial port, or power cycle the system.

Admin Reboot

Sometimes the admin accidentally or purposely reboots a server via Solaris commands (like init 0), via the SP's break, reset, or poweroff commands, or via removing power. Please note that the admin may do any of these to stop a hang condition instead of the recommended method below which attempts to generate a core file. This is easy to detect by:

Check messages for signal 15 prior the reboot which indicates the admin performed an "init 6" or some other method to reboot the host,
Check SP events to determine if admin reset via the SP or power button,
Check for power events followed by a reboot,

Last 20 Reboots

First review the reboots during the time of the incident as shown in the explorer.

##### sysconfig/last-20-reboot (last reboot) #####
reboot    system boot                   Tue Feb 20 04:39
reboot    system down                   Tue Feb 20 04:37
reboot    system boot                   Thr Feb 8 16:24
reboot    system down                   Thr Feb 8 16:20
reboot    system boot                   Mon Feb 4 04:39
reboot    system down                   Mon Feb 4 04:39

Then check the messages file for signs of signal 15 to determine if the admin did an init 6.

##### messages/messages (/var/adm/messages) #####
Feb 20 04:37:50 kcgams7 xntpd[337]: [ID 866926 daemon.notice] xntpd exiting on signal 15
Feb 20 04:37:50 kcgams7 syslogd: going down on signal 15
...
Feb 20 04:39:23 kcgams7 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_144488-17 64-bit

Also check the ILOM for reasons for a possible restart. Log entry 5 is a result of an ILOM /SYS stop & /SYS start. Log entries c & d prior a reboot is a possible indication that power was lost so determine if the SP was reset.
##### ilom/10.133.109.209/ipmitool_sel_elist.out (show /SP/logs/event/list or ipmitool -H "SP IP" -U root sel elist) #####
5 | 02/08/2012 | 21:22:29 | System Boot Initiated | System Restart | Asserted

c | 01/14/2000 | 09:06:19 | Voltage PS1/V_OUT | Lower Non-critical going low | Reading 0.80 < Threshold 10 Volts
d | 01/14/2000 | 09:06:19 | Power Supply PS1/PWROK | State Deasserted

Please remember, if power is removed/restored to the system, the ILOM message file is restarted (from snapshot) & the restart message indicates the time of restart.

##### spos_logs/@var@log@messages#####

Jul 15 15:36:05 netra-t4-2-sca11-a-sp syslogd 1.5.0: restart.

Fatal Reset

Fatal Resets are hardware detected problems and are caused when the central processing unit (CPU) performs a trap which immediately drops to the OBP or worse! No Solaris messages are logged since the disk image does not get updated following these events to prevent corruption. The system may or may not be operational following a fatal reset. One reason for a fatal reset is due to a watchdog reset which is caused when the operating system fails to access the watchdog circuitry within its time out period. This is really due to an operating system hang detected by the watchdog timer, so see the hang section below for techniques to diagnose. Other reasons for fatal resets are due to hardware failure like loss of input voltage (see 1558027.1), or other major hardware related issues. No core file is saved and the messages file shows normal operation followed by an abrupt system restart (no done or dump succeeded message). The most important diagnosis information to retrieve is the following which his mostly gained through the service processor (SP), the ILOM so a snapshot should be gathered!

LAST Output: From the explorer contains the times of reboots.
Messages do not contain useful information, just a boot following an unrelated message.
Console Output - This typically contains a reason for the reset for example critical component failure (or nothing for total power loss). An ILOM Snapshot contains the SP info! Output from ILOM command "show /HOST/console/history" will provide it.
SP Faults
SP events - This could contain sensor related events like under voltage conditions on one rail or OEM specific events like 0x12's or an ILOM reboot if power loss.
SP ereports
SP sensor data - This contains information if a sensor has a consistent problem like a voltage regulator or fan failure.
SP field replaceable unit (FRU) data - This describes the hardware inventory configuration to assist with hardware replacement. Collect this to determine if the system has the proper configuration (eg. partially installed memory bank). A good item to check is the system board page in the Sun System Handbook.
Is system operational? If not see Boot failure after data analysis.

If the cause of the reboot or crash cannot be quickly determined given the information above, it's important to perform hardware diagnostics such as a full power on self test (POST) or SunVTS to determine if the hardware is stable. Typically the field engineer should bring the system to a minimal configuration (minimal DIMMs, no PCI cards, minimal PSUs, ...) to isolate these types of problems & then add or replace components to isolate the failed one.

Last 20 Reboots

First review the reboots during the time of the incident.

##### sysconfig/last-20-reboot (last reboot) #####
reboot    system boot                   Tue Mar 6 01:13
reboot    system down                   Tue Mar 6 00:29
reboot    system boot                   Tue Mar 6 00:24
reboot    system down                   Tue Mar 6 00:13
reboot    system boot                   Mon Mar 5 21:40
reboot    system down                   Mon Mar 5 15:01

Messages

Then review the explorer's messages for each reboot to determine when the fatal reset occurred (no preceding done or dump succeeded message) .

##### messages/messages (/var/adm/messages) #####

Mar 5 15:00:52 osdldom50 qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(2): Link ONLINE
Mar 5 21:40:04 osdldom50 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_147440-01 64-bit

--------------------------------------

Jul 3 17:03:22 ctrstapp01 Corrupt label; wrong magic number
Jul 5 13:19:53 ctrstapp01 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_142900-02 64-bit

Console Output

For the moment, console output is only collected by a snapshot or ILOM command:

##### ilom/@persist@hostconsole.log (show /SP/console/history) ###
OpenBoot v. 4.33.6. @(#)OpenBoot 4.33.6.a
2012/03/29 11:22 May 24 11:31:10 t4-4-bur09-a reboot: rebooted by root
OpenBoot 4.33.6.a, 523776 MB memory available, Serial #97968252.
SunOS Release 5.10 Version Generic_147440-01 64-bit
May 24 11:33:09 t4-4-bur09-a scsi: WARNING: /pci@400/pci@1/pci@0/pci@0/LSI,sas@0/iport@v0/disk@w3461186e1b9925ce,0 (sd9):

SP Events

The ILOM event log can be obtained by an explorer (below), by a snapshot of by ILOM commands:

##### ipmi/ipmitool_sel_elist.out (show /SP/logs/event) #####
1 | 06/29/2012 | 06:28:20 | System Boot Initiated | System Restart | Asserted
2 | 07/18/2012 | 05:24:44 | System Boot Initiated | System Restart | Asserted
1c | 07/28/2012 | 14:58:46 | Power Supply PS2/V_IN_ERR | State Asserted
1d | 07/28/2012 | 14:58:46 | Power Supply PS3/V_IN_ERR | State Asserted

In this case, 2 PSUs of a 4 PSU system indicate input voltage problems just prior the server reset. These servers require 2 operational PSUs, so is a good indication that a third PSU had input problems that couldn't be logged since the server went down due to total loss of power.

ILOM FMA ereports

The ILOM ereports contain voltage related problems like the power glitch below.

##### fma/@persist@faultdiags@ereports.log #####
2017-09-13/05:47:55 ereport.cpu.generic-sparc.pio-error@/SYS/PM0/CM0/CMP/IOS0
2017-09-13/05:48:12 ereport.chassis.power.glitch-fatal@/SYS/MB
REG_0x2d20010 = 0x4 (MB Standby)
REG_0x2d20011 = 0x0
2017-09-13/06:41:14 ereport.chassis.pok.fail-asserted@/SYS/MB /SYS/MB/SW_DC_POK_FLT

Notice in the following case, the SP rebooted at the saame time that the host rebooted. This indicates that the server lost input power.

##### fma/@persist@faultdiags@ereports.log #####
2018-02-03/19:50:58 ereport.psu.input.ac-none-asserted@/SYS/PS3 /SYS/PS3/V_IN_ERR
2018-02-03/19:57:46 ereport.chassis.tli.ok@/SYS
2018-02-03/19:58:05 ereport.sp.boot-cold@/SYS/MB/SP
2018-02-03/19:58:32 ereport.chassis.sp.restart@/SYS/MB/SP
2018-02-26/12:45:51 ereport.chassis.tli.ok@/SYS
2018-02-26/12:46:10 ereport.sp.boot-cold@/SYS/MB/SP

SP Faults

This is only obtained via snapshot or ILOM command, as follows

##### fma/@usr@local@bin@fmadm_faulty.out (show faulty) #####
------------------- ------------------------------------ -------------- --------
Time                UUID                                 msgid          Severity
------------------- ------------------------------------ -------------- --------
2012-06-21/17:03:09 8be7d7d8-f047-efb9-ba98-e4f69ac676cd SUN4V-8000-E2 Critical
Fault class : fault.memory.bank
FRU         : /SYS/PM0/CMP0/BOB0/CH1/D0
              (Part Number: 07014672)
              (Serial Number: 00CE0211378799D517)
Description : A fault has been diagnosed by the Host Operating System.

2012-06-21/17:03:09 8be7d7d8-f047-efb9-ba98-e4f69ac676cd SUN4V-8000-E2 Critical
Fault class : fault.memory.bank
FRU         : /SYS/PM0/CMP0/BOB1/CH1/D0
              (Part Number: 07014672)
              (Serial Number: 00CE0211378799D59C)
Description : A fault has been diagnosed by the Host Operating System.

ILOM snapshots typically contain verbose ereport data in files: fma/fmdump-eV.out, elogs/elogs-v.out, or elogs/@usr@local@bin@elogs_-eV.out. An explorer will also contain a list of existing or repaired faults in file: fma/fmdump.out.

SP Sensor Data

Sensor data can be obtained by the explorer (as below), by a snapshot, or by ILOM command:

##### ipmi/ipmitool_sdr_list_all_info.out (show -l all /) #####
200 |MB/C0_V_VCORE | .98 Volts | failed

OS Panic

OS Panics are software detected problems and caused when the operating system detects that the integrity of data is suspect or in danger of being corrupted. The panic routine will typically place a panic message into the messages file (captured by explorer) & console output (captured by snapshot) & create a core dump if properly configured in dumpadm. Panics can be caused by either operating system coding errors which are typically fixed by patches, or caused by hardware related problems. Uncorrectable Hardware Errors are typically related to DIMM UE's, but problems with firmware are another possibility. PCI fabric panics are also typically associated with hardware problems, but driver issues must be checked. If the fabric panic is HBA related, then also collaborate with the storage group to determine if HBA firmware or SAN drivers are involved.

If the panic is software related, collect the core dump for analysis. If hardware related then collect the following data so the problem can be isolated:

Panic Message as found in the explorer messages file (but not always) or snapshot console output. This describes the type of panic. The reboot is typically proceeded by a dump succeeded message.
FMA Data collected by explorer or a snapshot which may isolate a failed DIMM or PCI path. The snapshot may contain the fatal ereport that doesn't make it to Solaris FMA.
Prtdiag data collected by explorer for PCI related panics. Lists the PCI paths with associated card type & lists CPU faults. Use doc 1373995.1 to determine the associated Oracle PCI part number.
SP FRU data. This describes the hardware configuration to assist with hardware replacement.

If the cause of the panic/reboot cannot be quickly determined given the information above, it's important to perform hardware diagnostics such as a full power on self test (POST) or SunVTS to determine if the hardware is stable. If the isolation information is not helpful, then the field engineer may need to bring the system to a minimal configuration (minimal DIMMs, no PCI cards, minimal PSUs, ...) to isolate these types of problems & then add or replace components to isolate the failed one. This is not typically needed, but a remote possibility.

Bug 6983432 (Repaired FMA Fault reports sent to ILOM after reboot) could affect ILOM based systems by failing a component on the ILOM which failed in the past & was already been repaired or replaced. In one case, it failed a component that has been removed. The resolution is to install FMD patch 147790-01 & the workaround is to perform Solaris command: fmadm flush "component" which removes the FMA repair records so they don't get resent on following reboots.

LDoms

The admin must indicate which LDom panic'd. In some cases the LDom manager version will be needed:

##### sysconfig/ldm_-V.out   #####
Logical Domains Manager (v 3.1.1.1.7)

##### sysconfig/virtinfo-a.out   #####
Domain role: LDoms control I/O service root
Domain name: primary
Control domain: sopcspsc01-ldm05

   Or

Domain role: LDoms guest I/O root
Domain name: ssccn2-dom1
Control domain: sopcspsc01-ldm05

The control domain should be checked to determine how system resources are configured:

##### sysconfig/ldm_list_-l.out #####
NAME             STATE      FLAGS   CONS    VCPU MEMORY   UTIL NORM UPTIME
primary          active     -n-cv- UART    128   260864M 0.8% 0.8% 6d 22h 33m
Proc: 0   16 Cores: 0 to 15
    pci@300/pci@1/pci@0/pci@6        /SYS/RCSA/PCIE1
    pci@300/pci@1/pci@0/pci@c        /SYS/RCSA/PCIE2
    pci@340/pci@1/pci@0/pci@6        /SYS/RCSA/PCIE3
    pci@340/pci@1/pci@0/pci@c        /SYS/RCSA/PCIE4
ssccn2-dom1      active     -n---- 5001    128   256G     0.4% 0.4% 2d 5h 6m
Proc: 1   16 Cores: 16 to 31
    pci@380/pci@1/pci@0/pci@a        /SYS/RCSA/PCIE9
    pci@380/pci@1/pci@0/pci@4        /SYS/RCSA/PCIE10
    pci@3c0/pci@1/pci@0/pci@e        /SYS/RCSA/PCIE11
    pci@3c0/pci@1/pci@0/pci@8        /SYS/RCSA/PCIE12

Messages

The following panics are typically related to hardware or firmware problems (but not always).

##### messages/messages (/var/adm/messages) #####
May 2 05:53:47 Sun04535 ^Mpanic[cpu53]/thread=2a102c49ca0:
May 2 05:53:47 Sun04535 unix: Fatal error has occurred in: PCIe fabric.(0x1)(0x43)
May 2 05:53:47 Sun04535 unix:
May 2 05:53:47 Sun04535 unix: 000002a102c496f0 px:px_err_panic+1ac (19db400, 7be43800, 43, 2a102c497a0, 1, 0)
...
May 2 06:06:26 nk11p04mm-mail04535 unix: dump succeeded
May 2 06:08:08 nk11p04mm-mail04535 unix: ^MSunOS Release 5.10 Version Generic_147440-12 64-bit

-------------------------

May 8 22:50:05 Sungams3 ^Mpanic[cpu64]/thread=3004519c680:
May 8 22:50:05 Sungams3 unix: [ID 400509 kern.notice] Unrecoverable hardware error
May 8 22:50:05 Sungams3 unix: [ID 100000 kern.notice]
May 8 22:50:05 Sungams3 genunix: [ID 723222 kern.notice] 000002a1076b8a90 unix:process_nonresumable_error+298 (2a1076b8c80, 0, 1, 40, 0, 0)

FMA Data

FMA data can typically isolate the hardware problem to a specific DIMM or PCI card.

##### fma/fmdump-eV (fmdump -eV) #####
The first fmdump-eV entry is from Feb 21 2012 22:36:02.
---- FIRST DATE ----       ---- LAST DATE ---- COUNT DEVICE
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47   4978 /pci@400
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47   2780 /pci@400/pci@1
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47    121 /pci@400/pci@2
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47   2495 /pci@400/pci@1/pci@0
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47   2497 /pci@400/pci@1/pci@0/pci@8
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47   2828 /pci@400/pci@1/pci@0/pci@8/SUNW,qlc@0

##### fma/fmadm-faulty (fmadm faulty) #####
May 02 06:18:38 da17c43c-66da-c866-82f7-9f257b792011 SUNOS-8000-FU Major
Host        : Sun04535
Platform    : ORCL,SPARC-T3-2   Chassis_id :
Product_sn :
Fault class : defect.sunos.eft.undiag.fme
FRU         : None   faulty
Description : The diagnosis engine encountered telemetry for which it was unable to perform a diagnosis.

-------------------------

##### fma/fmdump-eV (fmdump -eV) #####
The first fmdump-eV entry is from Apr 13 2011 03:09:52.

---- FIRST DATE ----       ---- LAST DATE ---- COUNT DEVICE
Apr 13 2011 03:09:52 thru May 17 2011 15:06:18 64214 MB/CMP0/BR0/CH0
Apr 13 2011 03:13:01 thru May 17 2011 15:06:56 54833 MB/CMP0/BR0: CH0/D1/J0600
Apr 13 2011 03:13:13 thru May 17 2011 15:06:18   5426 MB/CMP0/BR0: CH0/D0/J0500
Apr 13 2011 04:07:22 thru May 16 2011 12:30:29     22 MB/CMP0/BR0: CH0/D0/J0500 CH1/
Apr 13 2011 04:07:22 thru May 17 2011 06:26:46    137 MB/CMP0/BR0

##### fma/fmadm-faulty (fmadm faulty) #####
Apr 04 09:57:49 21c312ab-4c48-ee92-c762-e2680ae35b74 FMD-8000-0W    Minor
Host        : fwgams3
Platform    : SUNW,T5240        Chassis_id :
Fault class : defect.sunos.fmd.nosub
Description : The Solaris Fault Manager received an event from a component to which no automated diagnosis software is currently subscribed.

Apr 04 15:54:22 b3895ac1-e2ad-c58f-f189-f2bf8fb0db53 SUN4V-8002-42 Critical
Host        : fwgams3
Platform    : SUNW,T5240        Chassis_id :
Fault class : fault.memory.dimm-ue-imminent 95%
Affects     : mem:///unum=MB/CMP0/BR0/CH0/D1/J0600                  faulted but still in service
FRU         : "MB/CMP0/BR0/CH0/D1/J0600" (hc://:serial=00AD01101110A1A65B:part=511-1151-01-Rev-05/motherboard=0/chip=0/branch=0/dram-channel=0/dimm=1) 95%                  faulty
Description : A pattern of correctable errors has been observed suggesting the potential exists that an uncorrectable error may occur.

Use the prtdiag data or FRU data to determine the part number of the component.

Prtdiag Data

##### sysconfig/prtdiag: #####
System Configuration: Oracle Corporation sun4v SPARC T3-2
Memory size: 130560 Megabytes
CPU ID Frequency Implementation         Status
0      1649 MHz SPARC-T3               on-line
...
255    1649 MHz SPARC-T3               on-line

...
/SYS/MB/PCIE6     PCIE SUNW,qlc-pciex1077,2532           QLE2562 <--- To obtain Oracle part # see doc 1373995.1 or 1282491.1 for HBAs.
                        /pci@400/pci@1/pci@0/pci@8/SUNW,qlc@0
/SYS/MB/PCIE6     PCIE SUNW,qlc-pciex1077,2532           QLE2562
                        /pci@400/pci@1/pci@0/pci@8/SUNW,qlc@0,1
/SYS/MB/PCIE0     PCIE SUNW,qlc-pciex1077,2532           QLE2562
                        /pci@400/pci@2/pci@0/pci@8/SUNW,qlc@0
/SYS/MB/PCIE0     PCIE SUNW,qlc-pciex1077,2532           QLE2562
                        /pci@400/pci@2/pci@0/pci@8/SUNW,qlc@0,1

FRU Data

##### ipmi/ipmitool_fru.out (show -l all /) See doc 1411086.1 #####
      Part        Manufacturer        Part #        Ser #
Builtin FRU Dev Oracle Corporatio
           P0/M0 7696 MITAC COMPUT 541-4438     0328MSL-1213TA10EU
           P0/M1 7696 MITAC COMPUT 541-4438     0328MSL-1213TA10E6
           P1/M0 7696 MITAC COMPUT 541-4438     0328MSL-1213TA10WD
           P1/M1 7696 MITAC COMPUT 541-4438     0328MSL-1213TA10E4
            /SYS Oracle Corporatio               1219BDYA39
           MB/SP 5030 CELESTICA CO 542-0442     0111APO-1220S11
P0/M0/B0/C0/D0   Hynix Semiconduct 07014642     00AD01120231A40E05
P0/M0/B0/C1/D0   Hynix Semiconduct 07014642     00AD01120231540DF1
P0/M0/B1/C0/D0   Hynix Semiconduct 07014642     00AD01120231940DF3
P0/M0/B1/C1/D0   Hynix Semiconduct 07014642     00AD01120231540DFD
P0/M1/B0/C0/D0   Hynix Semiconduct 07014642     00AD01120231140DEF
P0/M1/B0/C1/D0   Hynix Semiconduct 07014642     00AD01120231540DF4
P0/M1/B1/C0/D0   Hynix Semiconduct 07014642     00AD01120231940DF0
P0/M1/B1/C1/D0   Hynix Semiconduct 07014642     00AD011202177317AA
P1/M0/B0/C0/D0   Hynix Semiconduct 07014642     00AD01120231340DF5
P1/M0/B0/C1/D0   Hynix Semiconduct 07014642     00AD01120231640DEB
P1/M0/B1/C0/D0   Hynix Semiconduct 07014642     00AD01120231240DF2
P1/M0/B1/C1/D0   Hynix Semiconduct 07014642     00AD01120231B40DF6
P1/M1/B0/C0/D0   Hynix Semiconduct 07014642     00AD01120231640DEE
P1/M1/B0/C1/D0   Hynix Semiconduct 07014642     00AD01120231440E04
P1/M1/B1/C0/D0   Hynix Semiconduct 07014642     00AD01120231B40DED
P1/M1/B1/C1/D0   Hynix Semiconduct 07014642     00AD01120231C40DF3
              FB 9615 HON HAI PREC 541-3535     0226LHF-1213HH0055
           SASBP 9615 HON HAI PREC 511-1246     0226LHF-1210A90FFT
             PS0 10465 ASTEC INTER 300-2344     476856F+1212CD00UR
             PS1 10465 ASTEC INTER 300-2344     476856F+1213CD002B

Hang

A Hang is when some applications or OS functions may operate properly, and others appear dead. The hardware and operating system do not detect a problem unless the SP watchdog detects the problem. Hangs are caused by resource deadlocks due to operating system race conditions or resource deprivation due to one or more applications that are too needy. Sometimes console messages may indicate the source of the hang, but typically a live core should be forced so that Sun's kernel group can analyze the data. There is a small possibility that hangs can be caused by hardware, so please check for hardware problems then contact the kernel group for isolation prior transferring the SR. They may wish output from the GUDs tool to determine OS resource statistics. Some Solaris documents that discuss procedures and configuration do isolate Solaris panics and hangs are as follows:

DocID: 1004530.1 KERNEL: How to enable deadman kernel code
DocID: 1012913.1 Troubleshooting Panics, dumps, hangs or crashes in the Solaris Operating System
DocID: 1001950.1 Troubleshooting Suspected Solaris Operating System Hangs
DocID: 1004506.1 How to force a crash when my machine is hung
DocID: 1001950.1 When to Force a Solaris System Core File

When SPARC CMT host systems hang because of a software, firmware, or hardware bug. Obtaining information is difficult as one is often unable to break into the system with any of the above procedure docs. SPARC systems do have the ability to initiate a break via the Service processor (SP) by setting of the property send_break_action under /HOST (in the procedure docs above) but this relies on all software being functional at the time of the failure and responsive, which is often not the case when certain situations arise. Thus, in many cases when this situation occurs, diagnosability is not possible, or is difficult since no/limited information can be obtained from the host, especially from the customer perspective/front line service organization.

With the exception of the T1000/T2000 all legacy Sun4u cpus and Sun4v CPUs supported an eXternal Initiated Reset (XIR) by the restricted shell. XIR is a specific pin on the CPU that when toggled will generate a SPARC h/w trap (Trap 3) which will change the flow of execution by entering Reset Error Debug (RED) mode. XIR can be utilized to gather information, generate a crash dump, drop into a debugger or simply initiate a reset of the host. 1530801.1 indicates that "xir dumpcore" starts the reset & dumps a corefile

It is recommended that the server have the keyswitch in the lock position & that the customer not cause a host NMI so that it doesn't place the system at the OBP prompt.

-> set /HOST keyswitch_state=Locked
-> set /HOST/generate_host_nmi=true (avoid since sends NMI to host)

If a corefile was not collected, the user will want to configure the ILOM to cause a reset & dump a corefile for future hangs per ILOM command:

-> set /HOST autorestart=dumpcore

The data needed to attempt isolation of hardware related hangs is similar to Fatal Resets so mainly SP data is required:

LAST Output: From the explorer contains the times of reboots.
Messages do not contain useful information, just a boot following an unrelated message.
Console Output - This typically contains a reason for the reset for example critical component failure (or nothing for total power loss). An ILOM Snapshot contains the SP info! Output from ILOM command "show /HOST/console/history" will also provide it.
SP events - This could contain sensor related events like under voltage conditions on one rail or OEM specific events like 0x12's or an ILOM reboot if power loss.
SP sensor data - This contains information if a sensor has a consistent problem like a voltage regulator or fan failure.
FMA Fault data - Check for a history of DIMM or disk problems.
SP field replaceable unit FRU data - This describes the hardware inventory configuration to assist with hardware replacement. Collect this to determine if the system has the proper configuration (eg. partially installed memory bank). A good item to check is the system board page in the Sun System Handbook.
Are boot disks internal or external?

The LDom section above should be reviewed to isolate the LDom which hung.

LAST
Determine when the admin forced a reboot to stop the hang.

##### sysconfig/last-20-reboot.out #####
reboot system boot Wed Jul 4 18:21
reboot system down Wed Jul 4 17:06

Messages

Message prior reboot unrelated to reboot or if the user caused a panic via the ILOM break.

##### messages/messages (/var/adm/messages) #####
Jul 4 04:53:09 prod-db iscsi: [ID 632887 kern.warning] WARNING: iscsi connection(19) login failed - authentication failed with target
Jul 4 18:21:27 prod-db genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_142900-15 64-bit

Use the data below to look for possible hardware problems:

SP Events

Notice that the event log has an indication of system shutdown via the power button. Another example shows that the SP watchdog triggered a reboot.

##### ipmi/ipmitool_sel_elist.out #####

Jul 04 18:10:56: Chassis |major : "System shutdown has been requested via power button."
Jul 04 18:11:00: Chassis |major : "System power off has been requested via power button."
Jul 04 18:11:01: Chassis |critical : "Host has been powered off"
Jul 04 18:11:15: Chassis |major : "System power on has been requested via power button."
Jul 04 18:11:16: Chassis |major : "Host has been powered on"
Jul 04 18:15:13: Chassis |major : "Host is running"

-------------------------

Did an admin recently log into the ILOM prior the reset to do "cd /HOST; set send_break_action=break" during this time?

Oct 20 16:47:15 tesp07328 SC Alert: [ID 113266 daemon.notice] Audit | minor: admin : Open Session : object = "/SP/session/type" : value = "shell" : success
Oct 20 16:48:13 tesp07328 SC Alert: [ID 354481 daemon.notice] Audit | minor: admin : Close Session : object = "/SP/session/type" : value = "shell" : success
Oct 20 16:48:19 tesp07328 SC Alert: [ID 113266 daemon.notice] Audit | minor: admin : Open Session : object = "/SP/session/type" : value = "shell" : success

-------------------------

Dec 29 06:58:37: Chassis |critical: "SP Request to Reset Host due to Watchdog"
Dec 29 06:58:37: Chassis |major : "Host is running"
Jan 03 20:24:36: Chassis |critical: "SP Request to Reset Host due to Watchdog"
Jan 03 20:24:36: Chassis |major : "Host is running"

VBSC data from the ILOM snapshot can also be useful:

##### ilom/@persist@vbsc@vbsc.log #####
DEBUG: check_poweron_button: pwr_status = ?
NOTICE: System shutdown has been requested.
NOTICE: System power off has been requested via power button.

FMA Data

Check FMA data for a history of DIMM or disk problems.

##### fma/fmdump-eV: #####
The first fmdump-eV entry is from Sep 14 2010 19:44:23.
---- FIRST DATE ----       ---- LAST DATE ---- COUNT DEVICE
Sep 14 2010 19:44:23 thru May 17 2012 23:19:23    842 MB/CMP0/BR0: CH0/D0/J1001
Sep 20 2010 06:03:28 thru May 18 2012 00:32:03    493 MB/CMP0/BR1: CH1/D0/J1601

##### fma/fmadm-faulty: #####
--------------- ------------------------------------ -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------ -------------- ---------
Jul 26 22:07:23 3c4922d5-5c1c-e4f9-e4c2-c86ac566231d SUN4V-8002-42 Critical
Fault class : fault.memory.dimm-ue-imminent 95%
Affects     : mem:///unum=MB/CMP0/BR1:CH1/D0/J1601 faulted but still in service
Serial ID. :        0
Description : A pattern of correctable errors has been observed suggesting the potential exists that an uncorrectable error may occur.

FRU Data

Check FRU data for faulted components or uncertified DIMMs.

##### ipmi/ipmitool_fru.out (show -l all /)   See doc 1411086.1 #####
      Part        Manufacturer        Part #        Ser #
Builtin FRU Dev Oracle Corporatio
           P0/M0 7696 MITAC COMPUT 541-4438     0328MSL-1213TA10EU
           P0/M1 7696 MITAC COMPUT 541-4438     0328MSL-1213TA10E6
           P1/M0 7696 MITAC COMPUT 541-4438     0328MSL-1213TA10WD
           P1/M1 7696 MITAC COMPUT 541-4438     0328MSL-1213TA10E4
            /SYS Oracle Corporatio               1219BDYA39
           MB/SP 5030 CELESTICA CO 542-0442     0111APO-1220S11
P0/M0/B0/C0/D0   Hynix Semiconduct 07014642     00AD01120231A40E05
P0/M0/B0/C1/D0   Hynix Semiconduct 07014642     00AD01120231540DF1
P0/M0/B1/C0/D0   Hynix Semiconduct 07014642     00AD01120231940DF3
P0/M0/B1/C1/D0   Hynix Semiconduct 07014642     00AD01120231540DFD
P0/M1/B0/C0/D0   Hynix Semiconduct 07014642     00AD01120231140DEF
P0/M1/B0/C1/D0   Hynix Semiconduct 07014642     00AD01120231540DF4
P0/M1/B1/C0/D0   Hynix Semiconduct 07014642     00AD01120231940DF0
P0/M1/B1/C1/D0   Hynix Semiconduct 07014642     00AD011202177317AA
P1/M0/B0/C0/D0   Hynix Semiconduct 07014642     00AD01120231340DF5
P1/M0/B0/C1/D0   Hynix Semiconduct 07014642     00AD01120231640DEB
P1/M0/B1/C0/D0   Hynix Semiconduct 07014642     00AD01120231240DF2
P1/M0/B1/C1/D0   Hynix Semiconduct 07014642     00AD01120231B40DF6
P1/M1/B0/C0/D0   Hynix Semiconduct 07014642     00AD01120231640DEE
P1/M1/B0/C1/D0   Hynix Semiconduct 07014642     00AD01120231440E04
P1/M1/B1/C0/D0   Hynix Semiconduct 07014642     00AD01120231B40DED
P1/M1/B1/C1/D0   Hynix Semiconduct 07014642     00AD01120231C40DF3
              FB 9615 HON HAI PREC 541-3535     0226LHF-1213HH0055
           SASBP 9615 HON HAI PREC 511-1246     0226LHF-1210A90FFT
             PS0 10465 ASTEC INTER 300-2344     476856F+1212CD00UR
             PS1 10465 ASTEC INTER 300-2344     476856F+1213CD002B

Boot Failure

If the system cannot boot clearly we should not ask for an explorer since only Service Processor data is obtainable.

First determine if power is present. If so, this could be an indication of disk failure or failure of another critical component.

These systems use the ILOM, and sometimes a memory leak can make the ILOM partially operational. Please remove system power to reboot the ILOM which rules out memory leak problems.

If the ILOM is dead/fails to boot, either power or ILOM hardware problems exist so power LEDs must be checked first. The ILOM is powered by the 3.3V Standby Voltage from the PSU/PDB via a ribbon cable to the system board. Have a field engineer remove either PSU to isolate a problem with it, then replace ribbon cables, PDB & then system board if not PSU related. Always review the server's wiring diagram which is linked to the Sun System Handbook's Full Component List page.

If the ILOM is operational, the data we should obtain is:

Data from the ILOM as indicated in the overview,
Did the admin perform any OS / firmware / hardware upgrades/changes prior the problem? Please note that it's best to power cycle a failing system prior firmware upgrade in case ILOM resources were lost or in a corrupt state!
States of the LEDs

If the system fails to get to the OK prompt, analyze the SP data requested above to isolate a problem. The TSE should check the front & rear system views in the Sun System Handbook to determine which LEDs exist on that platform & request their status. The sensor data is most important since it contains voltage related info. The FRU & fault related data should then be checked for component failure. Have the customer provide console output just after power on to view POST output. If the data is not helpful, then a field engineer should bring the system to a minimal configuration (minimal DIMMs, no PCI cards, minimal PSUs, ...) to isolate these types of problems & then add or replace components to isolate the failed one. The following is an indication of a possible counterfeit DIMM installed that also failed POST, but also look for system board regulator problems:

##### ipmi/@usr@local@bin@ipmiint_fru_print.out #####
Jul 11 20:10:39: Chassis |minor : "DIMMS at MB/CMP0/BR1/CH1/D1 and MB/CMP1/BR0/CH0/D1 have duplicate serial and part-dash-rev numbers"
Jul 11 20:10:41: Fault |critical: "SP detected fault at time Wed Jul 11 20:10:41 2012. /SYS/MB/CMP0/BR1/CH1/D1 Forced Fail (POST)"

If the system is able to get to the OK prompt, then the boot problem is most likely disk/controller related, ZFS related, or ILOM related. Have the admin attempt to boot the system via DVD (check for minimal supported OS for platform!!!) or via network. If the system boots from an external boot source, then either the boot disk/controller has failed or a ZFS filesystem may need repair. Determine if something was done to change the boot parameters, or the boot image is corrupt. Please attempt a "reset-all" prior subsequent boot or show-scsi-all commands to see if it resolves. Also, if ZFS is in use, boot from CDROM, & perform a bootadm update after the rpool is imported & mounted. If partial disk information is accessed prior the failure or ZFS related, then open a SR for Solaris OS assistance. Output from the following commands may be helpful:

ok> printenv :Determine boot device (internal or external!!!). If external & an Oracle storage array, then open an SR into the Storage group
ok> probe-scsi-all :Determine if boot device seen
ok> devalias :Lists aliases for device paths
ok> show-disks :Lists disks & has options to add to dev alias
ok> boot -aV -s :Boot in ask me mode (respond with default settings) with the verbose option. Determine if boot device correct (such as internal RAID controller path).
ok> boot -m verbose -s or -m dubug :(Solaris 10 & up) Boot which shows the services starting (see doc 1006328.1 )
ok> boot -F failsafe :(Solaris 10 U6 & up) Contact the OS group for usage (see doc 1340586.1)
Boot from cdrom or from net, then do format; select a disk; analyze; read; which will test if the disk is properly accessed by the controller.
Obtain an ILOM snapshot to rule out ILOM problems.

Attachments

This solution has no attachment