Troubleshooting data needed for TX000 servers

Asset ID:	1-79-1518205.1
Update Date:	2017-06-14
Keywords:

Solution Type Predictive Self-Healing Sure

Solution 1518205.1 : Troubleshooting data needed for TX000 servers

Applies to:

Sun Blade T6300 Server Module
Sun Fire T2000 Server
Sun Netra T2000 Server
Sun SPARC Enterprise T2000 Server
Sun Fire T1000 Server
Information in this document applies to any platform.

Purpose

This document provides high level guide for hardware specialists about what data do collect & how to analyze major server problems. A similar document written from the software perspective is: 1012913.1

Details

Reasons for SR Creation

A system becomes unresponsive for one of four reasons:

The admin rebooted the host
Fatal Reset (hardware detected)
OS Panic (software detected)
OS/Application Hang (not detected by host, but possibly by SP)

In most cases, we recommend that both explorers & ALOM (ALOM Data) be collected if possible. The absolutely required information relates to how it crashed. If operational, obtain information about the necessary data to gather via doc 1010911.1: What to send to Oracle after a system panic and/or unexpected reboot. This document indicates basic questions the must be asked to determine what data to gather. Please note that ALOM Data will contain the console output & most other SP related data for the CMT based servers.

An SR can also be opened for boot failure or for non-fatal reasons like hardware failure where the system continues to operate since some components are redundant or since performance reduced on failure of one:

Boot failure (The Ultimate Fatal Reset)
Redundant Components: Fans & PSUs
Performance Limited: DIMMs

Please note that an explorer will not collect ALOM data if the ALOM's net management port is not configured. An Explorer should be gathered if the system is bootable. Features & problems with versions of Explorer are as follows:

5.2 is minimal for these platforms.
5.5 collects ALOM data by default
7.03 can collect ILOM information via the vbsc with no prompts to the user. See doc 1518044.1.
7.03 has added collection timeouts, so some information may not be gathered. See Explorer Timeout FAQs in doc 1287574.1
8.00 may take 5 minutes to install on an S11 based system.
8.00 collects LDOM information by default
8.00 extended collection timeouts for ipmitool data.
8.00 ipmitool first choice set to /opt/ipmitool/sbin. Earlier versions must have ipmitool executable copied to /opt/ipmitool/bin!!!
8.03 Timeouts introduced in 7.03 were removed. Please perform command ptree & see Explorer FAQ doc: 1287574.1 for data collection & timeout configuration information.
See doc 1612918.1 to determine why Explorer collection runs too long

8.11 Resolves lack of FMA data collection introduced in 8.10

Explorer version 5.2 through 5.4 should be run as follows to gather the proper ALOM data:

# explorer -w default,Tx000

ALOM data should typically be gathered & obtained by:

ALOM Data: consolehistory -v, showenvironment, showfaults -v, showfru and showlogs -v, showplatform, showsc

Admin Reboot

Sometimes the admin accidentally or purposely reboots a server via Solaris commands (like init 0), via the SP's break, reset, or poweroff commands, or via removing power. Please note that the admin may do any of these to stop a hang condition instead of the recommended method below which attempts to generate a core file. This is easy to detect by:

Check messages for signal 15 prior the reboot which indicates the admin performed an "init 6" or some other method to reboot the host,
Check SP events to determine if admin reset via the SP,
Check for power events followed by a reboot,

##### messages/messages (/var/adm/messages) #####
Feb 20 04:37:50 kcgams7 xntpd[337]: [ID 866926 daemon.notice] xntpd exiting on signal 15
Feb 20 04:37:50 kcgams7 syslogd: going down on signal 15
...
Feb 20 04:39:23 kcgams7 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_144488-17 64-bit

##### Tx000/showlogs_-v #####
Feb 08 21:22:29: Reset |major : "Reset of /SYS initiated by root."
Feb 08 21:22:29: Reset |major : "Reset of /SYS by root succeeded."

Feb 08 21:24:31: Chassis |critical: "Host has been reset"
Feb 08 21:52:31: Chassis |major   : "System shutdown has been requested via power button."
Feb 08 21:52:35: Chassis |major   : "System power off has been requested via power button."
Feb 08 21:53:03: Chassis |major   : "System power on has been requested via power button."

##### Tx000/showlogs_-v #####

Feb 04 20:37:07: IPMI    |critical: "ID =   17 : 02/04/2012 : 20:37:07 : Voltage : /PS0/AC_POK : State Deasserted"
Feb 04 20:37:27: Fault   |critical: "SP detected fault at time Sat Feb 4 20:37:27 2012. Input power unavailable for PSU at PS0"
Feb 04 20:41:07: IPMI    |critical: "ID =   19 : 02/04/2012 : 20:41:07 : Voltage : /PS0/AC_POK : State Deasserted"
Feb 04 20:48:46: Chassis |major   : "Host has been powered on"

Fatal Reset

Fatal Resets are hardware detected problems and are caused when the central processing unit (CPU) performs a trap which immediately drops to the OBP or worse! No messages are logged since the disk image does not get updated following these events to prevent corruption. The system may or may not be operational following a fatal reset. One reason for a fatal reset is due to a watchdog reset which is caused when the operating system fails to access the watchdog circuitry within its time out period. This is really due to an operating system hang detected by the watchdog timer, so see the hang section below for techniques to diagnose. Other reasons for fatal resets are due to hardware failure like loss of input voltage, or other major hardware related issues. No core file is saved and the messages file shows normal operation followed by an abrupt system restart (no done or dump succeeded message). The most important diagnosis information to retrieve is the following which his mostly gained through the service processor (SP), the ALOM so information must ba gathered.

LAST Output: From the explorer contains the times of reboots.
Messages do not contain useful information, just a boot following an unrelated message.
Console Output - This typically contains a reason for the reset for example critical component failure (or nothing for total power loss). If ALOM based, the ALOM Data contains the SP info! Output from ALOM command "show /HOST/console/history" or ALOM command "consolehistory -v" will provide it.
SP Faults
SP events - This could contain sensor related events like under voltage conditions on one rail or OEM specific events like 0x12's or an ALOM reboot if power loss.
SP sensor data - This contains information if a sensor has a consistent problem like a voltage regulator or fan failure.
SP field replaceable unit (FRU) data - This describes the hardware inventory configuration to assist with hardware replacement. Collect this to determine if the system has the proper configuration (eg. partially installed memory bank). A good item to check is the system board page in the Sun System Handbook.
Is system operational? If not see Boot failure after data analysis.

If the cause of the reboot or crash cannot be quickly determined given the information above, it's important to perform hardware diagnostics such as a full power on self test (POST) or SunVTS to determine if the hardware is stable. Typically the field engineer should bring the system to a minimal configuration (minimal DIMMs, no PCI cards, minimal PSUs, ...) to isolate these types of problems & then add or replace components to isolate the failed one.

Last 20 Reboots

First review the reboots during the time of the incident.

##### sysconfig/last-20-reboot: (last reboot) #####
reboot    system boot                   Tue Mar 6 01:13
reboot    system down                   Tue Mar 6 00:29
reboot    system boot                   Tue Mar 6 00:24
reboot    system down                   Tue Mar 6 00:13
reboot    system boot                   Mon Mar 5 21:40
reboot    system down                   Mon Mar 5 15:01

Messages

Then review the messages for each reboot to determine when the fatal reset occurred (no preceding done or dump succeeded message) .

##### messages/messages (/var/adm/messages) #####

Mar 5 15:00:52 osdldom50 qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(2): Link ONLINE
Mar 5 21:40:04 osdldom50 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_147440-01 64-bit

--------------------------------------

Jul 3 17:03:22 ctrstapp01 Corrupt label; wrong magic number
Jul 5 13:19:53 ctrstapp01 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_142900-02 64-bit

Console Output

S10 & FMA does a good job of logging problems & little good console information is found as shown below.

##### Tx000/consolehistory_-v (ALOM compat - consolehistory -v ) #####

Mar 5 05:00:19 osdldom50 Corrupt label; wrong magic number^M^M
|^H/^H-^H\^H|^H/^H-^H\^H|^H
0:0:0>SPARC-Enterprise[TM] T5120/T5220 POST 4.30.4 2009/08/19 07:49 ^M
...
Mar 6 00:13:25 vxvm:vxconfigd: V-5-1-554 Disk EMC0_1F3E names group rootdg, but group ID differs^M

SP Events

The ALOM clock needs to be correlated to the host clock so the events can align to the messages above, so also note each's timezone.

##### Tx000/showdate (Host - date + ALOM - showdate) ##### (SC normally in UTC!)
SC Date: Tue Mar 6 14:13:46 2012
Host Date: Tue Mar 6 08:37:17 2012

Notice that the ALOM time is roughly 6 hours ahead of the host time since the ALOM is using UTC instead of the local timezone. These times are derived from when file Tx000/showdate was created & the ALOM time with in it.

##### Tx000/showlogs_-v #####
Mar 06 04:30:20: Chassis |major   : "Host detected fault, MSGID: FMD-8000-11"
Mar 06 04:30:20: Chassis |major   : "Host detected fault, MSGID: FMD-8000-0W"
Mar 06 04:30:20: Chassis |major   : "Host detected fault, MSGID: FMD-8000-0W"
Mar 06 04:30:20: Chassis |major   : "Host detected fault, MSGID: FMD-8000-0W"
Mar 06 04:30:39: Chassis |major   : "Host has been powered on"
Mar 06 04:40:16: Chassis |major   : "Host detected fault, MSGID: FMD-8000-11"
Mar 06 04:40:16: Chassis |major   : "Host detected fault, MSGID: FMD-8000-0W"
Mar 06 04:40:16: Chassis |major   : "Host detected fault, MSGID: FMD-8000-0W"
Mar 06 04:40:16: Chassis |major   : "Host detected fault, MSGID: FMD-8000-0W"
Mar 06 04:40:35: Chassis |major   : "Host has been powered on"

##### Tx000/showlogs_-v #####
Jul 21 00:07:20: Chassis |major   : "Jul 21 00:07:20 ERROR: [CMP0] Received Fatal Error"
Jul 21 12:15:57: Reset   |major   : "Reset of /SYS initiated by radmin."
Jul 21 12:15:57: Reset   |major   : "Reset of /SYS by radmin succeeded."
Jul 21 12:17:59: Chassis |critical: "Host has been reset"
Jul 21 12:25:39: Reset   |major   : "Reset of /SP initiated by radmin. Success unless failure noted."
Jul 21 12:41:49: Chassis |critical: "Host has been powered off"
Jul 21 12:42:03: Chassis |major   : "Host has been powered on"

##### Tx000/showlogs_-v   #####
May 21 22:10:07: Chassis |critical: "Critical temperature value : host is being powered off"
May 21 22:10:07: Chassis |critical: "Host has been powered off"
May 21 22:18:42: IPMI    |major   : "ID =   eb : 05/21/2008 : 22:18:38 : Temperature : /MB/T_BUS_BAR0 : Lower Critical going low : reading -128 < threshold -8 degrees C"
May 21 22:18:43: IPMI    |critical: "ID =   ec : 05/21/2008 : 22:18:43 : Temperature : /MB/T_BUS_BAR0 : Lower Non-recoverable going low : reading -128 < threshold -10 degrees C"
May 21 22:19:58: Fault   |critical: "SP detected fault at time Wed May 21 22:19:58 2008. T_BUS_BAR0 at /SYS/MB has exceeded low non-recoverable threshold."
May 21 22:24:06: Chassis |major   : "Host has been powered on"
May 21 22:24:24: Chassis |critical: "Critical temperature value : host is being powered off"
May 21 22:24:25: Chassis |critical: "Host has been powered off"
May 21 22:24:26: Fault   |critical: "SP detected fault at time Wed May 21 22:24:26 2008. Required FAN at FANBD0/FM0 is not present."
May 21 22:24:29: Fault   |critical: "SP detected fault at time Wed May 21 22:24:29 2008. Required FAN at FANBD0/FM1 is not present."
May 21 22:24:31: Fault   |critical: "SP detected fault at time Wed May 21 22:24:31 2008. Required FAN at FANBD0/FM2 is not present."
May 21 22:24:34: Chassis |major   : "Hot removal of /SYS/FANBD0/FM0"
May 21 22:24:34: Chassis |major   : "Hot removal of /SYS/FANBD0/FM1"
May 21 22:24:35: Chassis |major   : "Hot removal of /SYS/FANBD0/FM2"
May 21 22:24:36: Chassis |major   : "Hot insertion of /SYS/FANBD1/FM1"
May 21 22:24:37: Chassis |major   : "Hot insertion of /SYS/FANBD1/FM2"

SP Faults

##### Tx000/showfaults_-v #####
Last POST Run: Tue Mar 6 05:48:23 2012

Post Status: Passed all devices
ID Time FRU Class Fault
1 Mar 06 05:43:39 /SYS/MB Host detected fault MSGID: FMD-8000-11 UUID: 331899ef-47a0-ea0c-cf7d-9327080fd179

Additional fault information can be obtained from the ALOM user account via the following commands:

   sc> setsc sc_servicemode true
   Warning: misuse of this mode may invalidate your warranty.
   sc> showfmerptlog1 -v

...

sc> setsc sc_servicemode false

An explorer will also contain a list of existing or repaired faults in file: fma/fmdump.out.

SP Sensor Data

Sensor data is good for locating a failed voltage regulator (VRM) or other failed components like fans.

##### Tx000/showenvironment #####
/SYS/LOCATE /SYS/SERVICE /SYS/ACT
OFF ON ON

OS Panic

OS Panics are software detected problems and caused when the operating system detects that the integrity of data is suspect or in danger of being corrupted. The panic routine will place a panic message into the messages file (captured by explorer) & console output & then create a core dump if properly configured in dumpadm. Panics can be caused by either operating system, driver, or firmware coding errors which are typically fixed by patches, or caused by hardware related problems. Uncorrectable Hardware Errors are typically related to DIMM UE's. PCI fabric panics are also typically associated with hardware problems, but driver issues must be checked. If the fabric panic is HBA related, then also collaborate with the storage group to determine if HBA firmware or SAN drivers are involved.

If the panic is software related, collect the core dump for the kernel group's analysis. If hardware related then collect the following data so the problem can be isolated:

Panic Message as found in the explorer messages file (but not always) or ALOM's console output. This describes the type of panic. The reboot is typically proceeded by a dump succedded message.
FMA Data collected by explorer or a ALOM data which may isolate a failed DIMM or PCI path. The ALOM data may contain the fatal ereport that doesn't make it to Solaris FMA.
Prtdiag data collected by explorer for PCI related panics. Lists the PCI paths with associated card type & lists CPU faults. Use doc 1373995.1 to determine the associated Oracle PCI part number.
SP FRU data. This describes the hardware configuration to assist with hardware replacement.

If the cause of the panic/reboot cannot be quickly determined given the information above, it's important to perform hardware diagnostics such as a full power on self test (POST) or SunVTS to determine if the hardware is stable. Typically the field engineer should bring the system to a minimal configuration (minimal DIMMs, no PCI cards, minimal PSUs, ...) to isolate these types of problems & then add or replace components to isolate the failed one.

Messages

The following panics are typically related to hardware or firmware problems (but not always).

##### messages/messages (/var/adm/messages) #####
May 2 05:53:47 Sun04535 ^Mpanic[cpu53]/thread=2a102c49ca0:
May 2 05:53:47 Sun04535 unix: Fatal error has occured in: PCIe fabric.(0x1)(0x43)
May 2 05:53:47 Sun04535 unix:
May 2 05:53:47 Sun04535 unix: 000002a102c496f0 px:px_err_panic+1ac (19db400, 7be43800, 43, 2a102c497a0, 1, 0)

...

May 2 06:06:26 nk11p04mm-mail04535 unix: dump succeeded
May 2 06:08:08 nk11p04mm-mail04535 unix: ^MSunOS Release 5.10 Version Generic_147440-12 64-bit

-------------------------

##### messages/messages (/var/adm/messages) #####
May 8 22:50:05 Sungams3 ^Mpanic[cpu64]/thread=3004519c680:
May 8 22:50:05 Sungams3 unix: [ID 400509 kern.notice] Unrecoverable hardware error
May 8 22:50:05 Sungams3 unix: [ID 100000 kern.notice]
May 8 22:50:05 Sungams3 genunix: [ID 723222 kern.notice] 000002a1076b8a90 unix:process_nonresumable_error+298 (2a1076b8c80, 0, 1, 40, 0, 0)

FMA Data

FMA data can typically isolate the hardware problem to a specific DIMM or PCI card.

##### fma/fmdump-eV (fmdump -eV) #####
The first fmdump-eV entry is from Feb 21 2012 22:36:02.
---- FIRST DATE ----       ---- LAST DATE ---- COUNT DEVICE
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47   4978 /pci@400
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47   2780 /pci@400/pci@1
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47    121 /pci@400/pci@2
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47   2495 /pci@400/pci@1/pci@0
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47   2497 /pci@400/pci@1/pci@0/pci@8
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47   2828 /pci@400/pci@1/pci@0/pci@8/SUNW,qlc@0

##### fma/fmadm-faulty (fmadm faulty) #####
May 02 06:18:38 da17c43c-66da-c866-82f7-9f257b792011 SUNOS-8000-FU Major
Host        : Sun04535
Platform    : ORCL,SPARC-T3-2   Chassis_id :
Product_sn :
Fault class : defect.sunos.eft.undiag.fme
FRU         : None   faulty
Description : The diagnosis engine encountered telemetry for which it was unable to perform a diagnosis.

-------------------------

##### fma/fmdump-eV (fmdump -eV) #####
The first fmdump-eV entry is from Apr 13 2011 03:09:52.

---- FIRST DATE ----       ---- LAST DATE ---- COUNT DEVICE
Apr 13 2011 03:09:52 thru May 17 2011 15:06:18 64214 MB/CMP0/BR0/CH0
Apr 13 2011 03:13:01 thru May 17 2011 15:06:56 54833 MB/CMP0/BR0: CH0/D1/J0600
Apr 13 2011 03:13:13 thru May 17 2011 15:06:18   5426 MB/CMP0/BR0: CH0/D0/J0500
Apr 13 2011 04:07:22 thru May 16 2011 12:30:29     22 MB/CMP0/BR0: CH0/D0/J0500 CH1/
Apr 13 2011 04:07:22 thru May 17 2011 06:26:46    137 MB/CMP0/BR0

##### fma/fmadm-faulty (fmadm faulty) #####
Apr 04 09:57:49 21c312ab-4c48-ee92-c762-e2680ae35b74 FMD-8000-0W    Minor
Host        : fwgams3
Platform    : SUNW,T5240        Chassis_id :
Fault class : defect.sunos.fmd.nosub
Description : The Solaris Fault Manager received an event from a component to which no automated diagnosis software is currently subscribed.

Apr 04 15:54:22 b3895ac1-e2ad-c58f-f189-f2bf8fb0db53 SUN4V-8002-42 Critical
Host        : fwgams3
Platform    : SUNW,T5240        Chassis_id :
Fault class : fault.memory.dimm-ue-imminent 95%
Affects     : mem:///unum=MB/CMP0/BR0/CH0/D1/J0600                  faulted but still in service
FRU         : "MB/CMP0/BR0/CH0/D1/J0600" (hc://:serial=00AD01101110A1A65B:part=511-1151-01-Rev-05/motherboard=0/chip=0/branch=0/dram-channel=0/dimm=1) 95%                  faulty
Description : A pattern of correctable errors has been observed suggesting the potential exists that an uncorrectable error may occur.

Use the prtdiag data or FRU data to determine the partnumber of the component.

Prtdiag Data

##### sysconfig/prtdiag (prtdiag -v) #####
System Configuration: Oracle Corporation sun4v SPARC T3-2
Memory size: 130560 Megabytes
CPU ID Frequency Implementation         Status
0      1649 MHz SPARC-T3               on-line
...
255    1649 MHz SPARC-T3               on-line

...
/SYS/MB/PCIE6     PCIE SUNW,qlc-pciex1077,2532           QLE2562 <--- To obtain Oracle part # see doc 1373995.1
                        /pci@400/pci@1/pci@0/pci@8/SUNW,qlc@0
/SYS/MB/PCIE6     PCIE SUNW,qlc-pciex1077,2532           QLE2562
                        /pci@400/pci@1/pci@0/pci@8/SUNW,qlc@0,1
/SYS/MB/PCIE0     PCIE SUNW,qlc-pciex1077,2532           QLE2562
                        /pci@400/pci@2/pci@0/pci@8/SUNW,qlc@0
/SYS/MB/PCIE0     PCIE SUNW,qlc-pciex1077,2532           QLE2562
                        /pci@400/pci@2/pci@0/pci@8/SUNW,qlc@0,1

FRU Data

##### Tx000/showfru #####
             Part       Manufacturer        Part #         Ser #               Max Temp         Status
             /SYS/MB Mitac Internat     5111392-02 AU01UL               101 (28 degrees C) 0x64 (MAINTENANCE REQUIRED, SUSPECT, DE
            /SYS/PDB FOXCONN            5017697-09 G05KFH               101 (28 degrees C) 0x00 (OK)
         /SYS/PADCRD FOXCONN            5111255-03 A10YC9               101 (28 degrees C) 0x00 (OK)
          /SYS/SASBP FOXCONN            5111256-01 A20TLN               101 (28 degrees C) 0x00 (OK)
         /SYS/FANBD0 FOXCONN            5017695-04 E07T59               101 (28 degrees C) 0x00 (OK)
         /SYS/FANBD1 FOXCONN            5017695-04 E07T99               101 (28 degrees C) 0x00 (OK)
            /SYS/PS0 Power-One          3002138-03 A718CU
            /SYS/PS1 Power-One          3002138-03 A718CZ
              DIMM               Manufacturer       Vendor Part #      Part #     Ser #              Status
       /SYS/MB/CMP0/BR0/CH0/D0 Hynix Semicond HYMP125L72CP8D5-Y5    511-1151 1091A63A            0x64 (MAINTENANCE REQUIRED, SUSPECT, DE
       /SYS/MB/CMP0/BR0/CH0/D1 Hynix Semicond HYMP125L72CP8D5-Y5    511-1151 10A1A65B            0x64 (MAINTENANCE REQUIRED, SUSPECT, DE
       /SYS/MB/CMP0/BR0/CH1/D0 Hynix Semicond HYMP125L72CP8D5-Y5    511-1151 10C1A63A            0x64 (MAINTENANCE REQUIRED, SUSPECT, DE
       /SYS/MB/CMP0/BR0/CH1/D1 Hynix Semicond HYMP125L72CP8D5-Y5    511-1151 1031A66B            0x64 (MAINTENANCE REQUIRED, SUSPECT, DE
       /SYS/MB/CMP0/BR1/CH0/D0 Hynix Semicond HYMP125L72CP8D5-Y5    511-1151 1041A635            0x00 (OK)
       /SYS/MB/CMP0/BR1/CH0/D1 Hynix Semicond HYMP125L72CP8D5-Y5    511-1151 1051A65D            0x00 (OK)
       /SYS/MB/CMP0/BR1/CH1/D0 Hynix Semicond HYMP125L72CP8D5-Y5    511-1151 10B1A673            0x00 (OK)
       /SYS/MB/CMP0/BR1/CH1/D1 Hynix Semicond HYMP125L72CP8D5-Y5    511-1151 10C1A637            0x00 (OK)
       /SYS/MB/CMP1/BR0/CH0/D0 Hynix Semicond HYMP125L72CP8D5-Y5    511-1151 10B1A634            0x00 (OK)
       /SYS/MB/CMP1/BR0/CH0/D1 Hynix Semicond HYMP125L72CP8D5-Y5    511-1151 1041A65C            0x00 (OK)
       /SYS/MB/CMP1/BR0/CH1/D0 Hynix Semicond HYMP125L72CP8D5-Y5    511-1151 1051A65A            0x00 (OK)
       /SYS/MB/CMP1/BR0/CH1/D1 Hynix Semicond HYMP125L72CP8D5-Y5    511-1151 10C1A65B            0x00 (OK)
       /SYS/MB/CMP1/BR1/CH0/D0 Hynix Semicond HYMP125L72CP8D5-Y5    511-1151 10B1A65B            0x00 (OK)
       /SYS/MB/CMP1/BR1/CH0/D1 Hynix Semicond HYMP125L72CP8D5-Y5    511-1151 1061A65D            0x00 (OK)
       /SYS/MB/CMP1/BR1/CH1/D0 Hynix Semicond HYMP125L72CP8D5-Y5    511-1151 1031A65C            0x00 (OK)
       /SYS/MB/CMP1/BR1/CH1/D1 Hynix Semicond HYMP125L72CP8D5-Y5    511-1151 1081A65D            0x00 (OK)`

Output from "show -l all /" will also provide similar information if the admin account is not configured.

Hang

A Hang is when some applications or OS functions may operate properly, and others appear dead. The hardware and operating system do not detect a problem unless the SP watchdog detects the problem. Hangs are caused by resource deadlocks due to operating system race conditions or resource deprivation due to one or more applications that are too needy. Sometimes console messages may indicate the source of the hang, but typically a live core should be forced so that Sun's kernel group can analyze the data. There is a small possibility that hangs can be caused by hardware, so please check for hardware problems & then contact the kernel group for isolation prior transfering the SR. Some Solaris documents that discuss procedures and configuration do isolate Solaris panics and hangs are as follows:

   DocID: 1004530.1 KERNEL: How to enable deadman kernel code
   DocID: 1012913.1 Troubleshooting Panics, dumps, hangs or crashes in the Solaris[TM] Operating System
   DocID: 1001950.1 Troubleshooting Suspected Solaris Operating System Hangs
   DocID: 1004506.1 How to force a crash when my machine is hung
   DocID: 1001950.1 When to Force a Solaris System Core File

The data needed to attempt isolation of hardware related hangs (is similar to Fatal Resets):

LAST Output: From the explorer contains the times of reboots.
Messages do not contain useful information, just a boot following an unrelated message.
Console Output - This typically contains a reason for the reset for example critical component failure (or nothing for total power loss). Output from ALOM command "consolehistory -v" will provide it.
SP events - This could contain sensor related events like under voltage conditions on one rail or OEM specific events like 0x12's or an ALOM reboot if power loss.
SP sensor data - This contains information if a sensor has a consistent problem like a voltage regulator or fan failure.
FMA Fault data - Check for a history of DIMM or disk problems.
SP field replaceable unit FRU data - This describes the hardware inventory configuration to assist with hardware replacement. Collect this to determine if the system has the proper configuration (eg. partially installed memory bank). A good item to check is the system board page in the Sun System Handbook.
Are boot disks internal or external?

LAST
Determine when the admin forced a reboot to stop the hang.

##### sysconfig/last-20-reboot.out (last reboot) #####
reboot system boot Wed Jul 4 18:21
reboot system down Wed Jul 4 17:06

Messages

Message prior reboot unrelated to reboot.

##### messages/messages (/var/adm/messages) #####
Jul 4 04:53:09 prod-db iscsi: [ID 632887 kern.warning] WARNING: iscsi connection(19) login failed - authentication failed with target
Jul 4 18:21:27 prod-db genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_142900-15 64-bit

SP Events

Notice that the event log has an indication of system shutdown via the power button. Anothe example shows that the SP watchdog triggered a reboot.

##### Tx000/showlogs_-v #####

Jul 04 18:10:56: Chassis |major : "System shutdown has been requested via power button."
Jul 04 18:11:00: Chassis |major : "System power off has been requested via power button."
Jul 04 18:11:01: Chassis |critical : "Host has been powered off"
Jul 04 18:11:15: Chassis |major : "System power on has been requested via power button."
Jul 04 18:11:16: Chassis |major : "Host has been powered on"
Jul 04 18:15:13: Chassis |major : "Host is running"

-------------------------

##### Tx000/showlogs_-v   #####
Dec 29 06:58:37: Chassis |critical: "SP Request to Reset Host due to Watchdog"
Dec 29 06:58:37: Chassis |major   : "Host is running"
Jan 03 20:24:36: Chassis |critical: "SP Request to Reset Host due to Watchdog"
Jan 03 20:24:36: Chassis |major   : "Host is running"

FMA Data

Check FMA data for a history of DIMM or disk problems.

##### fma/fmdump-eV (fmdump -eV) #####
The first fmdump-eV entry is from Sep 14 2010 19:44:23.
---- FIRST DATE ----       ---- LAST DATE ---- COUNT DEVICE
Sep 14 2010 19:44:23 thru May 17 2012 23:19:23    842 MB/CMP0/BR0: CH0/D0/J1001
Sep 20 2010 06:03:28 thru May 18 2012 00:32:03    493 MB/CMP0/BR1: CH1/D0/J1601

##### fma/fmadm-faulty (fmadm faulty) #####
--------------- ------------------------------------ -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------ -------------- ---------
Jul 26 22:07:23 3c4922d5-5c1c-e4f9-e4c2-c86ac566231d SUN4V-8002-42 Critical
Fault class : fault.memory.dimm-ue-imminent 95%
Affects     : mem:///unum=MB/CMP0/BR1:CH1/D0/J1601 faulted but still in service
Serial ID. :        0
Description : A pattern of correctable errors has been observed suggesting the potential exists that an uncorrectable error may occur.

FRU Data

Check FRU data for faulted components or uncertified DIMMs.

##### Tx000/showfru (ALOM compat - showfru ) #####
            Part       Manufacturer        Part #         Ser #               Max Temp         Status
             /SYS/SP Celestica          5017822-06 5D014P               117 (44 degrees C) 0x00 (OK)
             /SYS/MB Celestica          5407370-04 5N00GU               117 (44 degrees C) 0x00 (OK)
         /SYS/MB/REM Celestica          5017821-06 5C01P0               117 (44 degrees C) 0x00 (OK)
              DIMM               Manufacturer       Vendor Part #      Part #     Ser #              Status
       /SYS/MB/CMP0/BR0/CH0/D0 Micron Technol GR2DF4GBX8MT667Q2R    0000000   00000000            0x00 (OK) DIMM possibly not certified!!!
       /SYS/MB/CMP0/BR0/CH1/D0 Micron Technol GR2DF4GBX8MT667Q2R    0000000   00000000            0x00 (OK) DIMM possibly not certified!!!
       /SYS/MB/CMP0/BR1/CH0/D0 Micron Technol GR2DF4GBX8MT667Q2R    0000000   00000000            0x00 (OK) DIMM possibly not certified!!!
       /SYS/MB/CMP0/BR1/CH1/D0 Micron Technol GR2DF4GBX8MT667Q2R    0000000   00000000            0x00 (OK) DIMM possibly not certified!!!
       /SYS/MB/CMP0/BR2/CH0/D0 Micron Technol GR2DF4GBX8MT667Q2R    0000000   00000000            0x00 (OK) DIMM possibly not certified!!!
       /SYS/MB/CMP0/BR2/CH1/D0 Micron Technol GR2DF4GBX8MT667Q2R    0000000   00000000            0x00 (OK) DIMM possibly not certified!!!
       /SYS/MB/CMP0/BR3/CH0/D0 Micron Technol GR2DF4GBX8MT667Q2R    0000000   00000000            0x00 (OK) DIMM possibly not certified!!!
       /SYS/MB/CMP0/BR3/CH1/D0 Micron Technol GR2DF4GBX8MT667Q2R    0000000   00000000            0x00 (OK) DIMM possibly not certified!!!

Boot Failure

If the system cannot boot clearly we should not ask for an explorer since only Service Processor data is obtainable. Please first determine if power is present. If so, this could be an indication of disk failure or failure of another critical component. Please remove power to reboot the ALOM to rule out memory leak problems. If the ALOM is dead/fails to boot, either power or ALOM hardware problems exist so power LEDs must be checked first. The ALOM typically obtains the 3.3V Standby Voltage from the PSU/PDB via a ribbon cable to the system board. Have a field engineer remove either PSU to isolate a problem with it, then replace ribbon cables, PDB & then system board if not PSU related. Always review the server's wiring diagram which is linked to the Sun System Handbook's Full Component List page.

If the ALOM is operational, the data we should obtain is:

Did the admin perform any upgrades/changes prior the problem?
Obtain states of the LEDs

If the system fails to get to the OK prompt, analyze the SP data requested above to isolate a problem. The TSE should check the front & rear system views in the Sun System Handbook to determine which LEDs exist on that platform & request their status. The sensor data is most important since it contains voltage related info. The FRU & fault related data should then be checked for component failure. If the data is not helpful, then a field engineer should bring the system to a minimal configuration (minimal DIMMs, no PCI cards, minimal PSUs, ...) to isolate these types of problems & then add or replace components to isolate the failed one. The following is an indication of a possible counterfeit DIMM installed that also failed POST, but also look for system board regulator problems:

##### ipmi/@usr@local@bin@ipmiint_fru_print.out #####
Jul 11 20:10:39: Chassis |minor : "DIMMS at MB/CMP0/BR1/CH1/D1 and MB/CMP1/BR0/CH0/D1 have duplicate serial and part-dash-rev numbers"
Jul 11 20:10:41: Fault |critical: "SP detected fault at time Wed Jul 11 20:10:41 2012. /SYS/MB/CMP0/BR1/CH1/D1 Forced Fail (POST)"

If the system is able to get to the OK prompt, then the boot problem is most likely disk/controller related. Have the admin attempt to boot the system via DVD (first check for minimal OS version of platform!!!) or via network. If the system boots from an external boot source, then either the boot disk/controller has failed, something was done to change the boot parameters, or the boot image is corrept. Open a SR into the OS group for assistance. Output from the following commands may be helpful:

ok> printenv :Determine boot device (internal or external!!!). If external & an Oracle storage array, then the Oracle Storage group should be contacted
ok> devalias :Lists nvaliases to relate the boot device with a PCI path
ok> probe-scsi-all :Determine if boot device seen
ok> show-disks :Lists disks & has options to add to dev alias
ok> boot -aV -s :(See attached) Boot in ask me mode (respond with default settings) with the verbose option. Determine if boot device correct (such as internal RAID controller path).
ok> boot -m verbose -s or -m dubug :(See attached) (Solaris 10 & up) Boot which shows the services starting (see doc 1006328.1)
ok> boot -F failsafe :(Solaris 10 U6 & up) Contact the OS group for usage (see doc 1340586.1)
Boot from cdrom or from net, then do format; select a disk; analyze; read; which will test if the disk is properly accessed by the controller.

Attachments

This solution has no attachment