Asset ID: |
1-79-1470580.1 |
Update Date: | 2018-03-06 |
Keywords: | |
Solution Type
Predictive Self-Healing Sure
Solution
1470580.1
:
Troubleshooting data needed for T3-x, T4-x, T5-x, T7-x, S7-x, T8-x servers
Related Items |
- SPARC T3-1
- SPARC T3-4
- SPARC S7-2
- SPARC T3-1B
- SPARC T7-1
- Netra SPARC T4-2 Server
- Netra T3-1BA
- SPARC T4-2
- SPARC T8-1
- Netra SPARC S7-2
- SPARC T5-8
- Netra SPARC T4-1 Server
- SPARC T8-2
- Netra SPARC T4-1B
- SPARC T7-2
- SPARC T5-2
- SPARC T5-4
- SPARC T8-4
- SPARC T4-1
- SPARC T3-2
- SPARC T4-1B
- SPARC T5-1B
- SPARC T4-4
- SPARC S7-2L
- Netra T3-1
|
Related Categories |
- PLA-Support>Sun Systems>SPARC>CMT>SN-SPARC: T4
|
In this Document
Applies to:
SPARC T5-2 - Version All Versions and later
Netra SPARC T4-2 Server - Version All Versions and later
SPARC T8-1 - Version All Versions and later
SPARC T5-4 - Version All Versions and later
Netra SPARC S7-2 - Version All Versions and later
Information in this document applies to any platform.
Purpose
This document provides high level guide for hardware specialists about what data do collect & how to analyze major server problems. A similar document written from the software perspective is: 1012913.1
Scope
Details
If a T7-x or S7-x server has a fatal reset due to a power glitch (Voltage Regulator failure) then please provide both a Snapshot & Explorer!
Please upload the required data to the SR unless other arrangements have been made with the SR owner.
Reasons for SR Creation & data to gather
A system becomes unresponsive for one of four reasons:
In most cases, we recommend that both explorers & ILOM snapshots be collected if possible. The absolutely required information relates to how it crashed. If the server is operational, obtain information about the necessary data to gather via doc 1010911.1: "What to send to Oracle after a system panic and/or unexpected reboot". This document indicates basic questions the must be asked to determine what data to gather. Please note that an ILOM Snapshot will contain the console output & most other SP related data for the T3 & newer servers.
An SR can also be opened for boot failure or for non-fatal reasons like hardware failure where the system continues to operate since some components are redundant or since performance reduced on failure of one:
- Boot failure (The Ultimate Fatal Reset)
- Redundant Components: Fans & PSUs
- Performance Limited: DIMMs
LDOM based problems can be temporarily be worked around by booting the system with the Trusted Platform Module (TPM) state purged (LDOM Manager) on the next boot by:
- -> set /HOST/tpm forceclear=true
An Explorer should be gathered if the system is bootable. Features & problems with versions of Explorer are as follows:
- 6.10 is minimal for these platforms since it collects ILOM data via ipmitool to gather FRU data. System FW 8.2.2.c will also collect DIMM vendor part numbers.
- During ipmi & ilom collection via the net, the ILOM's IP address & root password must be entered twice.
- 7.03 can collect ILOM information via the vbsc with no prompts to the user. See doc 1518044.1.
- 7.03 has added collection timeouts, so some information may not be gathered. See Explorer Timeout FAQs in doc 1287574.1
- 8.00 may take 5 minutes to install on an S11 based system.
- 8.00 collects LDOM information by default
- 8.00 extended collection timeouts for ipmitool data.
- 8.00 ipmitool first choice set to /opt/ipmitool/sbin. Earlier versions must have ipmitool executable copied to /opt/ipmitool/bin!!!
- 8.03 Timeouts introduced in 7.03 were removed. Please perform command ptree & see Explorer FAQ doc: 1287574.1 for data collection & timeout configuration information.
- See doc 1612918.1 to determine why Explorer collection runs too long
- 8.11 Resolves lack of FMA data collection introduced in 8.10
- 8.13 Allows usage on Linux systems if called "explorer -w default,sundiag"
Explorers should be run as follows to gather the proper data:
- # explorer -w ipmi,ipmiextended,ilomextended,ldom,default
- # explorer -w sundiag,default For Linux! Also note when extracting the explorer that directories sosreport & sundiag must be unzipped on a Linux system with command "tar -xif".
ILOM data should typically be gathered & obtained by the following. Please note that these servers allow a sideband connection to the network if the NetMgmt port cannot be used.
- Snapshot - Contains console history, FRU config, event logs, sensor information
- ILOMs typically require the Net Management / Sideband port configured, but can use the internal ILOM Interconnect or LDC ports if Oracle Hardware Management Pack (OHMP) is installed (1518044.1)
- The Data Set: "Collect Only Log Files" should ONLY be used if the snapshot hangs the ILOM. Dataset: "FRUID" will collect showpsnc
- Have CU perform "set /HOST/console timestamp=yes" to place ILOM time in log.
- or ILOM Data: "show -l all /", "show /SP/console/history", "show /SP/logs/event/list", "show faulty". See doc 1619420.1 for examples & helpful commands. (see ShowOff summary tool doc 1454001.1)
- or ipmitool Data (Requires Net Management port configured unless server's VBMCcd / used per doc 1518044.1).
- ipmitool -V
- ipmitool -I lanplus -H "SP ipaddress" -U root fru
- ipmitool -I lanplus -H "SP ipaddress" -U root sel elist
- ipmitool -I lanplus -H "SP ipaddress" -U root -v sdr
- ipmitool -I lanplus -H "SP ipaddress" -U root sdr elist
- ipmitool -I lanplus -H "SP ipaddress" -U root sdr list
- ipmitool -I lanplus -H "SP ipaddress" -U root chassis status
- ipmitool -I lanplus -H "SP ipaddress" -U root sunoem led get (requires OHMP or ipmitool versions per doc 1516567.1)
- ipmitool -I lanplus -H "SP ipaddress" -U root sensor
- Please note that the ILOM Interconnect is the preferred method when OHMP is installed!
- ipmitool -I lanplus -H 169.254.182.76 -U root sunoem led get
- Please note that the LDC interface can also be used.
- ipmitool -I bmc -U root fru On SPARC systems
- ipmitool -I open -U root fru On Linux systems
If the ILOM network is unresponsive, then reset the ILOM via the serial port, or power cycle the system.
Admin Reboot
Sometimes the admin accidentally or purposely reboots a server via Solaris commands (like init 0), via the SP's break, reset, or poweroff commands, or via removing power. Please note that the admin may do any of these to stop a hang condition instead of the recommended method below which attempts to generate a core file. This is easy to detect by:
- Check messages for signal 15 prior the reboot which indicates the admin performed an "init 6" or some other method to reboot the host,
- Check SP events to determine if admin reset via the SP or power button,
- Check for power events followed by a reboot,
Last 20 Reboots
First review the reboots during the time of the incident as shown in the explorer.
##### sysconfig/last-20-reboot (last reboot) #####
reboot system boot Tue Feb 20 04:39
reboot system down Tue Feb 20 04:37
reboot system boot Thr Feb 8 16:24
reboot system down Thr Feb 8 16:20
reboot system boot Mon Feb 4 04:39
reboot system down Mon Feb 4 04:39
Then check the messages file for signs of signal 15 to determine if the admin did an init 6.
##### messages/messages (/var/adm/messages) #####
Feb 20 04:37:50 kcgams7 xntpd[337]: [ID 866926 daemon.notice] xntpd exiting on signal 15
Feb 20 04:37:50 kcgams7 syslogd: going down on signal 15
...
Feb 20 04:39:23 kcgams7 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_144488-17 64-bit
Also check the ILOM for reasons for a possible restart. Log entry 5 is a result of an ILOM /SYS stop & /SYS start. Log entries c & d prior a reboot is a possible indication that power was lost so determine if the SP was reset.
##### ilom/10.133.109.209/ipmitool_sel_elist.out (show /SP/logs/event/list or ipmitool -H "SP IP" -U root sel elist) #####
5 | 02/08/2012 | 21:22:29 | System Boot Initiated | System Restart | Asserted
c | 01/14/2000 | 09:06:19 | Voltage PS1/V_OUT | Lower Non-critical going low | Reading 0.80 < Threshold 10 Volts
d | 01/14/2000 | 09:06:19 | Power Supply PS1/PWROK | State Deasserted
Please remember, if power is removed/restored to the system, the ILOM message file is restarted (from snapshot) & the restart message indicates the time of restart.
##### spos_logs/@var@log@messages#####
Jul 15 15:36:05 netra-t4-2-sca11-a-sp syslogd 1.5.0: restart.
Fatal Reset
Fatal Resets are hardware detected problems and are caused when the central processing unit (CPU) performs a trap which immediately drops to the OBP or worse! No Solaris messages are logged since the disk image does not get updated following these events to prevent corruption. The system may or may not be operational following a fatal reset. One reason for a fatal reset is due to a watchdog reset which is caused when the operating system fails to access the watchdog circuitry within its time out period. This is really due to an operating system hang detected by the watchdog timer, so see the hang section below for techniques to diagnose. Other reasons for fatal resets are due to hardware failure like loss of input voltage (see 1558027.1), or other major hardware related issues. No core file is saved and the messages file shows normal operation followed by an abrupt system restart (no done or dump succeeded message). The most important diagnosis information to retrieve is the following which his mostly gained through the service processor (SP), the ILOM so a snapshot should be gathered!
- LAST Output: From the explorer contains the times of reboots.
- Messages do not contain useful information, just a boot following an unrelated message.
- Console Output - This typically contains a reason for the reset for example critical component failure (or nothing for total power loss). An ILOM Snapshot contains the SP info! Output from ILOM command "show /HOST/console/history" will provide it.
- SP Faults
- SP events - This could contain sensor related events like under voltage conditions on one rail or OEM specific events like 0x12's or an ILOM reboot if power loss.
- SP ereports
- SP sensor data - This contains information if a sensor has a consistent problem like a voltage regulator or fan failure.
- SP field replaceable unit (FRU) data - This describes the hardware inventory configuration to assist with hardware replacement. Collect this to determine if the system has the proper configuration (eg. partially installed memory bank). A good item to check is the system board page in the Sun System Handbook.
- Is system operational? If not see Boot failure after data analysis.
If the cause of the reboot or crash cannot be quickly determined given the information above, it's important to perform hardware diagnostics such as a full power on self test (POST) or SunVTS to determine if the hardware is stable. Typically the field engineer should bring the system to a minimal configuration (minimal DIMMs, no PCI cards, minimal PSUs, ...) to isolate these types of problems & then add or replace components to isolate the failed one.
Last 20 Reboots
First review the reboots during the time of the incident.
##### sysconfig/last-20-reboot (last reboot) #####
reboot system boot Tue Mar 6 01:13
reboot system down Tue Mar 6 00:29
reboot system boot Tue Mar 6 00:24
reboot system down Tue Mar 6 00:13
reboot system boot Mon Mar 5 21:40
reboot system down Mon Mar 5 15:01
Messages
Then review the explorer's messages for each reboot to determine when the fatal reset occurred (no preceding done or dump succeeded message) .
##### messages/messages (/var/adm/messages) #####
Mar 5 15:00:52 osdldom50 qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(2): Link ONLINE
Mar 5 21:40:04 osdldom50 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_147440-01 64-bit
--------------------------------------
Jul 3 17:03:22 ctrstapp01 Corrupt label; wrong magic number
Jul 5 13:19:53 ctrstapp01 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_142900-02 64-bit
Console Output
For the moment, console output is only collected by a snapshot or ILOM command:
##### ilom/@persist@hostconsole.log (show /SP/console/history) ###
OpenBoot v. 4.33.6. @(#)OpenBoot 4.33.6.a
2012/03/29 11:22 May 24 11:31:10 t4-4-bur09-a reboot: rebooted by root
OpenBoot 4.33.6.a, 523776 MB memory available, Serial #97968252.
SunOS Release 5.10 Version Generic_147440-01 64-bit
May 24 11:33:09 t4-4-bur09-a scsi: WARNING: /pci@400/pci@1/pci@0/pci@0/LSI,sas@0/iport@v0/disk@w3461186e1b9925ce,0 (sd9):
SP Events
The ILOM event log can be obtained by an explorer (below), by a snapshot of by ILOM commands:
##### ipmi/ipmitool_sel_elist.out (show /SP/logs/event) #####
1 | 06/29/2012 | 06:28:20 | System Boot Initiated | System Restart | Asserted
2 | 07/18/2012 | 05:24:44 | System Boot Initiated | System Restart | Asserted
1c | 07/28/2012 | 14:58:46 | Power Supply PS2/V_IN_ERR | State Asserted
1d | 07/28/2012 | 14:58:46 | Power Supply PS3/V_IN_ERR | State Asserted
In this case, 2 PSUs of a 4 PSU system indicate input voltage problems just prior the server reset. These servers require 2 operational PSUs, so is a good indication that a third PSU had input problems that couldn't be logged since the server went down due to total loss of power.
ILOM FMA ereports
The ILOM ereports contain voltage related problems like the power glitch below.
##### fma/@persist@faultdiags@ereports.log #####
2017-09-13/05:47:55 ereport.cpu.generic-sparc.pio-error@/SYS/PM0/CM0/CMP/IOS0
2017-09-13/05:48:12 ereport.chassis.power.glitch-fatal@/SYS/MB
REG_0x2d20010 = 0x4 (MB Standby)
REG_0x2d20011 = 0x0
2017-09-13/06:41:14 ereport.chassis.pok.fail-asserted@/SYS/MB /SYS/MB/SW_DC_POK_FLT
Notice in the following case, the SP rebooted at the saame time that the host rebooted. This indicates that the server lost input power.
##### fma/@persist@faultdiags@ereports.log #####
2018-02-03/19:50:58 ereport.psu.input.ac-none-asserted@/SYS/PS3 /SYS/PS3/V_IN_ERR
2018-02-03/19:57:46 ereport.chassis.tli.ok@/SYS
2018-02-03/19:58:05 ereport.sp.boot-cold@/SYS/MB/SP
2018-02-03/19:58:32 ereport.chassis.sp.restart@/SYS/MB/SP
2018-02-26/12:45:51 ereport.chassis.tli.ok@/SYS
2018-02-26/12:46:10 ereport.sp.boot-cold@/SYS/MB/SP
SP Faults
This is only obtained via snapshot or ILOM command, as follows
##### fma/@usr@local@bin@fmadm_faulty.out (show faulty) #####
------------------- ------------------------------------ -------------- --------
Time UUID msgid Severity
------------------- ------------------------------------ -------------- --------
2012-06-21/17:03:09 8be7d7d8-f047-efb9-ba98-e4f69ac676cd SUN4V-8000-E2 Critical
Fault class : fault.memory.bank
FRU : /SYS/PM0/CMP0/BOB0/CH1/D0
(Part Number: 07014672)
(Serial Number: 00CE0211378799D517)
Description : A fault has been diagnosed by the Host Operating System.
2012-06-21/17:03:09 8be7d7d8-f047-efb9-ba98-e4f69ac676cd SUN4V-8000-E2 Critical
Fault class : fault.memory.bank
FRU : /SYS/PM0/CMP0/BOB1/CH1/D0
(Part Number: 07014672)
(Serial Number: 00CE0211378799D59C)
Description : A fault has been diagnosed by the Host Operating System.
ILOM snapshots typically contain verbose ereport data in files: fma/fmdump-eV.out, elogs/elogs-v.out, or elogs/@usr@local@bin@elogs_-eV.out. An explorer will also contain a list of existing or repaired faults in file: fma/fmdump.out.
SP Sensor Data
Sensor data can be obtained by the explorer (as below), by a snapshot, or by ILOM command:
##### ipmi/ipmitool_sdr_list_all_info.out (show -l all /) #####
200 |MB/C0_V_VCORE | .98 Volts | failed
OS Panic
OS Panics are software detected problems and caused when the operating system detects that the integrity of data is suspect or in danger of being corrupted. The panic routine will typically place a panic message into the messages file (captured by explorer) & console output (captured by snapshot) & create a core dump if properly configured in dumpadm. Panics can be caused by either operating system coding errors which are typically fixed by patches, or caused by hardware related problems. Uncorrectable Hardware Errors are typically related to DIMM UE's, but problems with firmware are another possibility. PCI fabric panics are also typically associated with hardware problems, but driver issues must be checked. If the fabric panic is HBA related, then also collaborate with the storage group to determine if HBA firmware or SAN drivers are involved.
If the panic is software related, collect the core dump for analysis. If hardware related then collect the following data so the problem can be isolated:
- Panic Message as found in the explorer messages file (but not always) or snapshot console output. This describes the type of panic. The reboot is typically proceeded by a dump succeeded message.
- FMA Data collected by explorer or a snapshot which may isolate a failed DIMM or PCI path. The snapshot may contain the fatal ereport that doesn't make it to Solaris FMA.
- Prtdiag data collected by explorer for PCI related panics. Lists the PCI paths with associated card type & lists CPU faults. Use doc 1373995.1 to determine the associated Oracle PCI part number.
- SP FRU data. This describes the hardware configuration to assist with hardware replacement.
If the cause of the panic/reboot cannot be quickly determined given the information above, it's important to perform hardware diagnostics such as a full power on self test (POST) or SunVTS to determine if the hardware is stable. If the isolation information is not helpful, then the field engineer may need to bring the system to a minimal configuration (minimal DIMMs, no PCI cards, minimal PSUs, ...) to isolate these types of problems & then add or replace components to isolate the failed one. This is not typically needed, but a remote possibility.
Bug 6983432 (Repaired FMA Fault reports sent to ILOM after reboot) could affect ILOM based systems by failing a component on the ILOM which failed in the past & was already been repaired or replaced. In one case, it failed a component that has been removed. The resolution is to install FMD patch 147790-01 & the workaround is to perform Solaris command: fmadm flush "component" which removes the FMA repair records so they don't get resent on following reboots.
LDoms
The admin must indicate which LDom panic'd. In some cases the LDom manager version will be needed:
##### sysconfig/ldm_-V.out #####
Logical Domains Manager (v 3.1.1.1.7)
##### sysconfig/virtinfo-a.out #####
Domain role: LDoms control I/O service root
Domain name: primary
Control domain: sopcspsc01-ldm05
Or
Domain role: LDoms guest I/O root
Domain name: ssccn2-dom1
Control domain: sopcspsc01-ldm05
The control domain should be checked to determine how system resources are configured:
##### sysconfig/ldm_list_-l.out #####
NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME
primary active -n-cv- UART 128 260864M 0.8% 0.8% 6d 22h 33m
Proc: 0 16 Cores: 0 to 15
pci@300/pci@1/pci@0/pci@6 /SYS/RCSA/PCIE1
pci@300/pci@1/pci@0/pci@c /SYS/RCSA/PCIE2
pci@340/pci@1/pci@0/pci@6 /SYS/RCSA/PCIE3
pci@340/pci@1/pci@0/pci@c /SYS/RCSA/PCIE4
ssccn2-dom1 active -n---- 5001 128 256G 0.4% 0.4% 2d 5h 6m
Proc: 1 16 Cores: 16 to 31
pci@380/pci@1/pci@0/pci@a /SYS/RCSA/PCIE9
pci@380/pci@1/pci@0/pci@4 /SYS/RCSA/PCIE10
pci@3c0/pci@1/pci@0/pci@e /SYS/RCSA/PCIE11
pci@3c0/pci@1/pci@0/pci@8 /SYS/RCSA/PCIE12
Messages
The following panics are typically related to hardware or firmware problems (but not always).
##### messages/messages (/var/adm/messages) #####
May 2 05:53:47 Sun04535 ^Mpanic[cpu53]/thread=2a102c49ca0:
May 2 05:53:47 Sun04535 unix: Fatal error has occurred in: PCIe fabric.(0x1)(0x43)
May 2 05:53:47 Sun04535 unix:
May 2 05:53:47 Sun04535 unix: 000002a102c496f0 px:px_err_panic+1ac (19db400, 7be43800, 43, 2a102c497a0, 1, 0)
...
May 2 06:06:26 nk11p04mm-mail04535 unix: dump succeeded
May 2 06:08:08 nk11p04mm-mail04535 unix: ^MSunOS Release 5.10 Version Generic_147440-12 64-bit
-------------------------
May 8 22:50:05 Sungams3 ^Mpanic[cpu64]/thread=3004519c680:
May 8 22:50:05 Sungams3 unix: [ID 400509 kern.notice] Unrecoverable hardware error
May 8 22:50:05 Sungams3 unix: [ID 100000 kern.notice]
May 8 22:50:05 Sungams3 genunix: [ID 723222 kern.notice] 000002a1076b8a90 unix:process_nonresumable_error+298 (2a1076b8c80, 0, 1, 40, 0, 0)
FMA Data
FMA data can typically isolate the hardware problem to a specific DIMM or PCI card.
##### fma/fmdump-eV (fmdump -eV) #####
The first fmdump-eV entry is from Feb 21 2012 22:36:02.
---- FIRST DATE ---- ---- LAST DATE ---- COUNT DEVICE
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47 4978 /pci@400
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47 2780 /pci@400/pci@1
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47 121 /pci@400/pci@2
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47 2495 /pci@400/pci@1/pci@0
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47 2497 /pci@400/pci@1/pci@0/pci@8
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47 2828 /pci@400/pci@1/pci@0/pci@8/SUNW,qlc@0
##### fma/fmadm-faulty (fmadm faulty) #####
May 02 06:18:38 da17c43c-66da-c866-82f7-9f257b792011 SUNOS-8000-FU Major
Host : Sun04535
Platform : ORCL,SPARC-T3-2 Chassis_id :
Product_sn :
Fault class : defect.sunos.eft.undiag.fme
FRU : None faulty
Description : The diagnosis engine encountered telemetry for which it was unable to perform a diagnosis.
-------------------------
##### fma/fmdump-eV (fmdump -eV) #####
The first fmdump-eV entry is from Apr 13 2011 03:09:52.
---- FIRST DATE ---- ---- LAST DATE ---- COUNT DEVICE
Apr 13 2011 03:09:52 thru May 17 2011 15:06:18 64214 MB/CMP0/BR0/CH0
Apr 13 2011 03:13:01 thru May 17 2011 15:06:56 54833 MB/CMP0/BR0: CH0/D1/J0600
Apr 13 2011 03:13:13 thru May 17 2011 15:06:18 5426 MB/CMP0/BR0: CH0/D0/J0500
Apr 13 2011 04:07:22 thru May 16 2011 12:30:29 22 MB/CMP0/BR0: CH0/D0/J0500 CH1/
Apr 13 2011 04:07:22 thru May 17 2011 06:26:46 137 MB/CMP0/BR0
##### fma/fmadm-faulty (fmadm faulty) #####
Apr 04 09:57:49 21c312ab-4c48-ee92-c762-e2680ae35b74 FMD-8000-0W Minor
Host : fwgams3
Platform : SUNW,T5240 Chassis_id :
Fault class : defect.sunos.fmd.nosub
Description : The Solaris Fault Manager received an event from a component to which no automated diagnosis software is currently subscribed.
Apr 04 15:54:22 b3895ac1-e2ad-c58f-f189-f2bf8fb0db53 SUN4V-8002-42 Critical
Host : fwgams3
Platform : SUNW,T5240 Chassis_id :
Fault class : fault.memory.dimm-ue-imminent 95%
Affects : mem:///unum=MB/CMP0/BR0/CH0/D1/J0600 faulted but still in service
FRU : "MB/CMP0/BR0/CH0/D1/J0600" (hc://:serial=00AD01101110A1A65B:part=511-1151-01-Rev-05/motherboard=0/chip=0/branch=0/dram-channel=0/dimm=1) 95% faulty
Description : A pattern of correctable errors has been observed suggesting the potential exists that an uncorrectable error may occur.
Use the prtdiag data or FRU data to determine the part number of the component.
Prtdiag Data
##### sysconfig/prtdiag: #####
System Configuration: Oracle Corporation sun4v SPARC T3-2
Memory size: 130560 Megabytes
CPU ID Frequency Implementation Status
0 1649 MHz SPARC-T3 on-line
...
255 1649 MHz SPARC-T3 on-line
...
/SYS/MB/PCIE6 PCIE SUNW,qlc-pciex1077,2532 QLE2562 <--- To obtain Oracle part # see doc 1373995.1 or 1282491.1 for HBAs.
/pci@400/pci@1/pci@0/pci@8/SUNW,qlc@0
/SYS/MB/PCIE6 PCIE SUNW,qlc-pciex1077,2532 QLE2562
/pci@400/pci@1/pci@0/pci@8/SUNW,qlc@0,1
/SYS/MB/PCIE0 PCIE SUNW,qlc-pciex1077,2532 QLE2562
/pci@400/pci@2/pci@0/pci@8/SUNW,qlc@0
/SYS/MB/PCIE0 PCIE SUNW,qlc-pciex1077,2532 QLE2562
/pci@400/pci@2/pci@0/pci@8/SUNW,qlc@0,1
FRU Data
##### ipmi/ipmitool_fru.out (show -l all /) See doc 1411086.1 #####
Part Manufacturer Part # Ser #
Builtin FRU Dev Oracle Corporatio
P0/M0 7696 MITAC COMPUT 541-4438 0328MSL-1213TA10EU
P0/M1 7696 MITAC COMPUT 541-4438 0328MSL-1213TA10E6
P1/M0 7696 MITAC COMPUT 541-4438 0328MSL-1213TA10WD
P1/M1 7696 MITAC COMPUT 541-4438 0328MSL-1213TA10E4
/SYS Oracle Corporatio 1219BDYA39
MB/SP 5030 CELESTICA CO 542-0442 0111APO-1220S11
P0/M0/B0/C0/D0 Hynix Semiconduct 07014642 00AD01120231A40E05
P0/M0/B0/C1/D0 Hynix Semiconduct 07014642 00AD01120231540DF1
P0/M0/B1/C0/D0 Hynix Semiconduct 07014642 00AD01120231940DF3
P0/M0/B1/C1/D0 Hynix Semiconduct 07014642 00AD01120231540DFD
P0/M1/B0/C0/D0 Hynix Semiconduct 07014642 00AD01120231140DEF
P0/M1/B0/C1/D0 Hynix Semiconduct 07014642 00AD01120231540DF4
P0/M1/B1/C0/D0 Hynix Semiconduct 07014642 00AD01120231940DF0
P0/M1/B1/C1/D0 Hynix Semiconduct 07014642 00AD011202177317AA
P1/M0/B0/C0/D0 Hynix Semiconduct 07014642 00AD01120231340DF5
P1/M0/B0/C1/D0 Hynix Semiconduct 07014642 00AD01120231640DEB
P1/M0/B1/C0/D0 Hynix Semiconduct 07014642 00AD01120231240DF2
P1/M0/B1/C1/D0 Hynix Semiconduct 07014642 00AD01120231B40DF6
P1/M1/B0/C0/D0 Hynix Semiconduct 07014642 00AD01120231640DEE
P1/M1/B0/C1/D0 Hynix Semiconduct 07014642 00AD01120231440E04
P1/M1/B1/C0/D0 Hynix Semiconduct 07014642 00AD01120231B40DED
P1/M1/B1/C1/D0 Hynix Semiconduct 07014642 00AD01120231C40DF3
FB 9615 HON HAI PREC 541-3535 0226LHF-1213HH0055
SASBP 9615 HON HAI PREC 511-1246 0226LHF-1210A90FFT
PS0 10465 ASTEC INTER 300-2344 476856F+1212CD00UR
PS1 10465 ASTEC INTER 300-2344 476856F+1213CD002B
Hang
A Hang is when some applications or OS functions may operate properly, and others appear dead. The hardware and operating system do not detect a problem unless the SP watchdog detects the problem. Hangs are caused by resource deadlocks due to operating system race conditions or resource deprivation due to one or more applications that are too needy. Sometimes console messages may indicate the source of the hang, but typically a live core should be forced so that Sun's kernel group can analyze the data. There is a small possibility that hangs can be caused by hardware, so please check for hardware problems then contact the kernel group for isolation prior transferring the SR. They may wish output from the GUDs tool to determine OS resource statistics. Some Solaris documents that discuss procedures and configuration do isolate Solaris panics and hangs are as follows:
- DocID: 1004530.1 KERNEL: How to enable deadman kernel code
- DocID: 1012913.1 Troubleshooting Panics, dumps, hangs or crashes in the Solaris Operating System
- DocID: 1001950.1 Troubleshooting Suspected Solaris Operating System Hangs
- DocID: 1004506.1 How to force a crash when my machine is hung
- DocID: 1001950.1 When to Force a Solaris System Core File
When SPARC CMT host systems hang because of a software, firmware, or hardware bug. Obtaining information is difficult as one is often unable to break into the system with any of the above procedure docs. SPARC systems do have the ability to initiate a break via the Service processor (SP) by setting of the property send_break_action under /HOST (in the procedure docs above) but this relies on all software being functional at the time of the failure and responsive, which is often not the case when certain situations arise. Thus, in many cases when this situation occurs, diagnosability is not possible, or is difficult since no/limited information can be obtained from the host, especially from the customer perspective/front line service organization.
With the exception of the T1000/T2000 all legacy Sun4u cpus and Sun4v CPUs supported an eXternal Initiated Reset (XIR) by the restricted shell. XIR is a specific pin on the CPU that when toggled will generate a SPARC h/w trap (Trap 3) which will change the flow of execution by entering Reset Error Debug (RED) mode. XIR can be utilized to gather information, generate a crash dump, drop into a debugger or simply initiate a reset of the host. 1530801.1 indicates that "xir dumpcore" starts the reset & dumps a corefile
It is recommended that the server have the keyswitch in the lock position & that the customer not cause a host NMI so that it doesn't place the system at the OBP prompt.
- -> set /HOST keyswitch_state=Locked
- -> set /HOST/generate_host_nmi=true (avoid since sends NMI to host)
If a corefile was not collected, the user will want to configure the ILOM to cause a reset & dump a corefile for future hangs per ILOM command:
- -> set /HOST autorestart=dumpcore
The data needed to attempt isolation of hardware related hangs is similar to Fatal Resets so mainly SP data is required:
- LAST Output: From the explorer contains the times of reboots.
- Messages do not contain useful information, just a boot following an unrelated message.
- Console Output - This typically contains a reason for the reset for example critical component failure (or nothing for total power loss). An ILOM Snapshot contains the SP info! Output from ILOM command "show /HOST/console/history" will also provide it.
- SP events - This could contain sensor related events like under voltage conditions on one rail or OEM specific events like 0x12's or an ILOM reboot if power loss.
- SP sensor data - This contains information if a sensor has a consistent problem like a voltage regulator or fan failure.
- FMA Fault data - Check for a history of DIMM or disk problems.
- SP field replaceable unit FRU data - This describes the hardware inventory configuration to assist with hardware replacement. Collect this to determine if the system has the proper configuration (eg. partially installed memory bank). A good item to check is the system board page in the Sun System Handbook.
- Are boot disks internal or external?
The LDom section above should be reviewed to isolate the LDom which hung.
LAST
Determine when the admin forced a reboot to stop the hang.
##### sysconfig/last-20-reboot.out #####
reboot system boot Wed Jul 4 18:21
reboot system down Wed Jul 4 17:06
Messages
Message prior reboot unrelated to reboot or if the user caused a panic via the ILOM break.
##### messages/messages (/var/adm/messages) #####
Jul 4 04:53:09 prod-db iscsi: [ID 632887 kern.warning] WARNING: iscsi connection(19) login failed - authentication failed with target
Jul 4 18:21:27 prod-db genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_142900-15 64-bit
Use the data below to look for possible hardware problems:
SP Events
Notice that the event log has an indication of system shutdown via the power button. Another example shows that the SP watchdog triggered a reboot.
##### ipmi/ipmitool_sel_elist.out #####
Jul 04 18:10:56: Chassis |major : "System shutdown has been requested via power button."
Jul 04 18:11:00: Chassis |major : "System power off has been requested via power button."
Jul 04 18:11:01: Chassis |critical : "Host has been powered off"
Jul 04 18:11:15: Chassis |major : "System power on has been requested via power button."
Jul 04 18:11:16: Chassis |major : "Host has been powered on"
Jul 04 18:15:13: Chassis |major : "Host is running"
-------------------------
Did an admin recently log into the ILOM prior the reset to do "cd /HOST; set send_break_action=break" during this time?
Oct 20 16:47:15 tesp07328 SC Alert: [ID 113266 daemon.notice] Audit | minor: admin : Open Session : object = "/SP/session/type" : value = "shell" : success
Oct 20 16:48:13 tesp07328 SC Alert: [ID 354481 daemon.notice] Audit | minor: admin : Close Session : object = "/SP/session/type" : value = "shell" : success
Oct 20 16:48:19 tesp07328 SC Alert: [ID 113266 daemon.notice] Audit | minor: admin : Open Session : object = "/SP/session/type" : value = "shell" : success
-------------------------
Dec 29 06:58:37: Chassis |critical: "SP Request to Reset Host due to Watchdog"
Dec 29 06:58:37: Chassis |major : "Host is running"
Jan 03 20:24:36: Chassis |critical: "SP Request to Reset Host due to Watchdog"
Jan 03 20:24:36: Chassis |major : "Host is running"
VBSC data from the ILOM snapshot can also be useful:
##### ilom/@persist@vbsc@vbsc.log #####
DEBUG: check_poweron_button: pwr_status = ?
NOTICE: System shutdown has been requested.
NOTICE: System power off has been requested via power button.
FMA Data
Check FMA data for a history of DIMM or disk problems.
##### fma/fmdump-eV: #####
The first fmdump-eV entry is from Sep 14 2010 19:44:23.
---- FIRST DATE ---- ---- LAST DATE ---- COUNT DEVICE
Sep 14 2010 19:44:23 thru May 17 2012 23:19:23 842 MB/CMP0/BR0: CH0/D0/J1001
Sep 20 2010 06:03:28 thru May 18 2012 00:32:03 493 MB/CMP0/BR1: CH1/D0/J1601
##### fma/fmadm-faulty: #####
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jul 26 22:07:23 3c4922d5-5c1c-e4f9-e4c2-c86ac566231d SUN4V-8002-42 Critical
Fault class : fault.memory.dimm-ue-imminent 95%
Affects : mem:///unum=MB/CMP0/BR1:CH1/D0/J1601 faulted but still in service
Serial ID. : 0
Description : A pattern of correctable errors has been observed suggesting the potential exists that an uncorrectable error may occur.
FRU Data
Check FRU data for faulted components or uncertified DIMMs.
##### ipmi/ipmitool_fru.out (show -l all /) See doc 1411086.1 #####
Part Manufacturer Part # Ser #
Builtin FRU Dev Oracle Corporatio
P0/M0 7696 MITAC COMPUT 541-4438 0328MSL-1213TA10EU
P0/M1 7696 MITAC COMPUT 541-4438 0328MSL-1213TA10E6
P1/M0 7696 MITAC COMPUT 541-4438 0328MSL-1213TA10WD
P1/M1 7696 MITAC COMPUT 541-4438 0328MSL-1213TA10E4
/SYS Oracle Corporatio 1219BDYA39
MB/SP 5030 CELESTICA CO 542-0442 0111APO-1220S11
P0/M0/B0/C0/D0 Hynix Semiconduct 07014642 00AD01120231A40E05
P0/M0/B0/C1/D0 Hynix Semiconduct 07014642 00AD01120231540DF1
P0/M0/B1/C0/D0 Hynix Semiconduct 07014642 00AD01120231940DF3
P0/M0/B1/C1/D0 Hynix Semiconduct 07014642 00AD01120231540DFD
P0/M1/B0/C0/D0 Hynix Semiconduct 07014642 00AD01120231140DEF
P0/M1/B0/C1/D0 Hynix Semiconduct 07014642 00AD01120231540DF4
P0/M1/B1/C0/D0 Hynix Semiconduct 07014642 00AD01120231940DF0
P0/M1/B1/C1/D0 Hynix Semiconduct 07014642 00AD011202177317AA
P1/M0/B0/C0/D0 Hynix Semiconduct 07014642 00AD01120231340DF5
P1/M0/B0/C1/D0 Hynix Semiconduct 07014642 00AD01120231640DEB
P1/M0/B1/C0/D0 Hynix Semiconduct 07014642 00AD01120231240DF2
P1/M0/B1/C1/D0 Hynix Semiconduct 07014642 00AD01120231B40DF6
P1/M1/B0/C0/D0 Hynix Semiconduct 07014642 00AD01120231640DEE
P1/M1/B0/C1/D0 Hynix Semiconduct 07014642 00AD01120231440E04
P1/M1/B1/C0/D0 Hynix Semiconduct 07014642 00AD01120231B40DED
P1/M1/B1/C1/D0 Hynix Semiconduct 07014642 00AD01120231C40DF3
FB 9615 HON HAI PREC 541-3535 0226LHF-1213HH0055
SASBP 9615 HON HAI PREC 511-1246 0226LHF-1210A90FFT
PS0 10465 ASTEC INTER 300-2344 476856F+1212CD00UR
PS1 10465 ASTEC INTER 300-2344 476856F+1213CD002B
Boot Failure
If the system cannot boot clearly we should not ask for an explorer since only Service Processor data is obtainable.
First determine if power is present. If so, this could be an indication of disk failure or failure of another critical component.
These systems use the ILOM, and sometimes a memory leak can make the ILOM partially operational. Please remove system power to reboot the ILOM which rules out memory leak problems.
If the ILOM is dead/fails to boot, either power or ILOM hardware problems exist so power LEDs must be checked first. The ILOM is powered by the 3.3V Standby Voltage from the PSU/PDB via a ribbon cable to the system board. Have a field engineer remove either PSU to isolate a problem with it, then replace ribbon cables, PDB & then system board if not PSU related. Always review the server's wiring diagram which is linked to the Sun System Handbook's Full Component List page.
If the ILOM is operational, the data we should obtain is:
- Data from the ILOM as indicated in the overview,
- Did the admin perform any OS / firmware / hardware upgrades/changes prior the problem? Please note that it's best to power cycle a failing system prior firmware upgrade in case ILOM resources were lost or in a corrupt state!
- States of the LEDs
If the system fails to get to the OK prompt, analyze the SP data requested above to isolate a problem. The TSE should check the front & rear system views in the Sun System Handbook to determine which LEDs exist on that platform & request their status. The sensor data is most important since it contains voltage related info. The FRU & fault related data should then be checked for component failure. Have the customer provide console output just after power on to view POST output. If the data is not helpful, then a field engineer should bring the system to a minimal configuration (minimal DIMMs, no PCI cards, minimal PSUs, ...) to isolate these types of problems & then add or replace components to isolate the failed one. The following is an indication of a possible counterfeit DIMM installed that also failed POST, but also look for system board regulator problems:
##### ipmi/@usr@local@bin@ipmiint_fru_print.out #####
Jul 11 20:10:39: Chassis |minor : "DIMMS at MB/CMP0/BR1/CH1/D1 and MB/CMP1/BR0/CH0/D1 have duplicate serial and part-dash-rev numbers"
Jul 11 20:10:41: Fault |critical: "SP detected fault at time Wed Jul 11 20:10:41 2012. /SYS/MB/CMP0/BR1/CH1/D1 Forced Fail (POST)"
If the system is able to get to the OK prompt, then the boot problem is most likely disk/controller related, ZFS related, or ILOM related. Have the admin attempt to boot the system via DVD (check for minimal supported OS for platform!!!) or via network. If the system boots from an external boot source, then either the boot disk/controller has failed or a ZFS filesystem may need repair. Determine if something was done to change the boot parameters, or the boot image is corrupt. Please attempt a "reset-all" prior subsequent boot or show-scsi-all commands to see if it resolves. Also, if ZFS is in use, boot from CDROM, & perform a bootadm update after the rpool is imported & mounted. If partial disk information is accessed prior the failure or ZFS related, then open a SR for Solaris OS assistance. Output from the following commands may be helpful:
- ok> printenv :Determine boot device (internal or external!!!). If external & an Oracle storage array, then open an SR into the Storage group
- ok> probe-scsi-all :Determine if boot device seen
- ok> devalias :Lists aliases for device paths
- ok> show-disks :Lists disks & has options to add to dev alias
- ok> boot -aV -s :Boot in ask me mode (respond with default settings) with the verbose option. Determine if boot device correct (such as internal RAID controller path).
- ok> boot -m verbose -s or -m dubug :(Solaris 10 & up) Boot which shows the services starting (see doc 1006328.1 )
- ok> boot -F failsafe :(Solaris 10 U6 & up) Contact the OS group for usage (see doc 1340586.1)
- Boot from cdrom or from net, then do format; select a disk; analyze; read; which will test if the disk is properly accessed by the controller.
- Obtain an ILOM snapshot to rule out ILOM problems.
Attachments
This solution has no attachment