Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-79-1518023.1
Update Date:2017-12-05
Keywords:

Solution Type  Predictive Self-Healing Sure

Solution  1518023.1 :   Troubleshooting data needed for SPARC based blades  


Related Items
  • SPARC T4-1B
  •  
  • Sun Blade T6340 Server Module
  •  
  • Sun Blade T6320 Server Module
  •  
  • SPARC T3-1B
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>CMT>SN-SPARC: T4
  •  




In this Document
Purpose
Details
 SP Hang
 Admin Reboot
 Fatal Reset
 OS Panic
 Hang
 Boot Failure
References


Applies to:

Sun Blade T6340 Server Module - Version Not Applicable and later
SPARC T3-1B - Version Not Applicable and later
Sun Blade T6320 Server Module - Version Not Applicable and later
SPARC T4-1B - Version Not Applicable and later
Information in this document applies to any platform.

Purpose

This document provides high level guide for hardware specialists about what data do collect & how to analyze major blade problems.  A similar document written from the software perspective is: 1012913.1

Details

Reasons for SR Creation & data to gather

A blade becomes unresponsive for one of four reasons:

In most cases, we recommend that both explorers & ILOM snapshots be collected if possible.  The absolutely required information relates to how it crashed.  If operational, obtain information about the necessary data to gather via doc 1010911.1: "What to send to Oracle after a system panic and/or unexpected reboot".  This document indicates basic questions the must be asked to determine what data to gather.  Please note that an ILOM Snapshot will contain the console output & most other SP related data for the T3-1B & T4-1B based blades. 

An SR can also be opened for boot failure or for non-fatal reasons like hardware failure where the blade continues to operate since some components are redundant or since performance reduced on failure of one:

  • Boot failure (The Utimate Fatal Reset)
  • Redundant Components: Fans & PSUs
  • Performance Limited: DIMMs

An Explorer should be gathered if the system is bootable.  Features & problems with versions of Explorer are as follows:

  • 6.5 is required to collect Adaptec RAID controller card data.  See doc  1518433.1 for earlier versions.
  • 6.10 is minimal for T3-x & newer platforms since it collects ILOM data via ipmitool to gather FRU data.  System FW 8.2.2.c will also collect DIMM vendor part numbers.
    • During ipmi & ilom collection via the net, the ILOM's IP address & root password must be entered twice.
  • 7.03 can collect ILOM information via the vbsc with no prompts to the user.  See doc 1518044.1.
  • 7.03 has added collection timeouts, so some information may not be gathered.  See Explorer Timeout FAQs in doc 1287574.1
  • 8.00 may take 5 minutes to install on an S11 based system.
  • 8.00 collects LDOM information by default
  • 8.00 extended collection timeouts for ipmitool data.
  • 8.00 ipmitool first choice set to /opt/ipmitool/sbin.  Earlier versions must have ipmitool executable copied to /opt/ipmitool/bin!!!
  • 8.03 Timeouts introduced in 7.03 were removed.  See doc 1612918.1 to determine why Explorer collection runs too long
  • 8.11 Resolves lack of FMA data collection introduced in 8.10

Explorers for T3-1B & newer blades must be collected as follows to gather ILOM data:

   # explorer -w ipmi,ipmiextended,ilomextended,default

 

 ILOM data should typically be gathered & obtained by:

  • Snapshot (preferred) - Contains console history, FRU config, event logs, sensor information
  • or ILOM Data: "show -l all /", "show /SP/console/history", "show /SP/logs/event/list", "show faulty"
  • or ipmitool Data: S10U11_17 will contain some needed ipmitool fixes & a newer version of ipmitool should be upgraded (1516567.1):
    • ipmitool -V
    • ipmitool -I lanplus -H "SP ipaddress" -U root fru       (or use:  ipmitool -I bmc -U root fru)
    • ipmitool -I lanplus -H "SP ipaddress" -U root sel elist
    • ipmitool -I lanplus -H "SP ipaddress" -U root -v sdr
    • ipmitool -I lanplus -H "SP ipaddress" -U root sdr elist
    • ipmitool -I lanplus -H "SP ipaddress" -U root sdr list
    • ipmitool -I lanplus -H "SP ipaddress" -U root chassis status
    • ipmitool -I lanplus -H "SP ipaddress" -U root sunoem led get  ("sunoem sbled get"  to be used on ipmitool version 1.8.8)
    • ipmitool -I lanplus -H "SP ipaddress" -U root sensor

 

Note: If the Net Management port is not configured, see doc 1473359.1 for alternate procedure.

Note: In most cases, we recommend that both explorers & ILOM snapshots be collected if possible. If the issue is related to power, cooling or affects more than 1 blade, a CMM snapshot will be required (see doc 1019322.1 on how to gather a CMM snapshot).

Note: Most NEMs that can be assigned an IP Address can also gather a snapshot just like the CMM (see doc 1019322.1 on how to gather a CMM snapshot). All you need to do is to "https://NEM_ip_address" at a browser window and following the same instructions as you would with a CMM.

 


SP Hang

If the SP is not responding, use a serial connection to check for faults and that network settings are correct. Resetting the SP (reset /SP) may restore functionality. If the SP does not respond with the serial connection, it may be hung due to SP failure or ILOM memory leaks. Check the service manual for the location of the NMI button. Pressing this button will reset the SP.


 

Admin Reboot

Sometimes the admin accidentally or purposely reboots a server via Solaris commands (like init 0), via the SP's break, reset, or poweroff commands, or via removing power.  Please note that the admin may do any of these to stop a hang condition instead of the recommended method below which attempts to generate a core file.  This is easy to detect by:

  • Check messages for signal 15 prior the reboot which indicates the admin performed an "init 6" or some other method to reboot the host,
  • Check SP events to determine if admin reset via the SP or power button,
  • Check for power events followed by a reboot,

 

Last 20 Reboots

First review the reboots during the time of the incident as shown in the explorer.

##### sysconfig/last-20-reboot  (last reboot)  #####
reboot    system boot                   Tue Feb 20 04:39
reboot    system down                   Tue Feb 20 04:37
reboot    system boot                   Thr Feb 8 16:24
reboot    system down                   Thr Feb 8 16:20
reboot    system boot                   Mon Feb 4 04:39
reboot    system down                   Mon Feb 4 04:39

 

Then check the messages file for signs of signal 15 to determine if the admin did an init 6.

##### messages/messages (/var/adm/messages)  #####
Aug 14 20:23:55 sv62919 xntpd[841]: [ID 866926 daemon.notice] xntpd exiting on signal 15
Aug 14 20:23:55 sv62919 rpc.metad: [ID 702911 daemon.error] Terminated
Aug 14 20:24:02 sv62919 syslogd: going down on signal 15
...
Aug 14 21:39:36 sv62919 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_147440-07 64-bit

 

Also check the ILOM for reasons for a possible restart.  Log entry 1 is a result of an ILOM /SYS stop & /SYS start.  Log entries c, d, & e prior a reboot is a possible indication that a blade voltage(s) are incorrect so determine if the SP was reset.
##### ilom/10.133.109.209/ipmitool_sel_elist.out   (show /SP/logs/event/list  or  ipmitool -H "SP IP" -U root sel elist)  #####

   1 | 01/04/2013 | 16:19:58 | System Boot Initiated | System Restart | Asserted
  
   c | 01/04/2013 | 16:23:11 | Voltage MB/V_+3V3_MAIN | Upper Critical going high | Reading 3.59 > Threshold 3.53 Volts
   d | 01/04/2013 | 16:23:12 | Voltage MB/V_+3V3_SLOT | Upper Critical going high | Reading 3.83 > Threshold 3.53 Volts
   e | 01/04/2013 | 16:23:15 | Voltage MB/V_+12V0 | Upper Non-recoverable going high | Reading 14.40 > Threshold 13.22 Volts

 

Please remember, if power is removed/restored to the system, the ILOM message file is restarted (from snapshot) & the restart message indicates the time of restart.

##### spos_logs/@var@log@messages#####

Jul 15 15:36:05 sparc-t4-1b-sca11-a-sp syslogd 1.5.0: restart.

 


Fatal Reset

Fatal Resets are hardware detected problems and are caused when the central processing unit (CPU) performs a trap which immediately drops to the OBP or worse!  No messages are logged since the disk image does not get updated  following these events to prevent corruption.  The system may or may not be operational following a fatal reset.  One reason for a fatal reset is due to a watchdog reset which is caused when the operating system fails to access the watchdog circuitry within its time out period.  This is really due to an operating system hang detected by the watchdog timer, so see the hang section below for techniques to diagnose.  Other reasons for fatal resets are due to hardware failure like loss of input voltage, or other major hardware related issues.  No core file is saved and the messages file shows normal operation followed by an abrupt system restart (no 'done' or 'dump succeeded' message).  The most important diagnosis information to retrieve is the following which is mostly gained through the service processor (SP), the ILOM, so a snapshot should be gathered!

  • LAST Output: From the explorer contains the times of reboots.
  • Messages do not contain useful information, just a boot following an unrelated message.
  • Console Output - This typically contains a reason for the reset for example critical component failure (or nothing for total power loss).  An ILOM Snapshot contains the SP info!  Output from ILOM command "show /HOST/console/history" will provide it.
  • SP Faults
  • SP events -  This could contain sensor related events like under voltage conditions on one rail or OEM specific events like 0x12's or an ILOM reboot if power loss.
  • SP sensor data - This contains information if a sensor has a consistent problem like a voltage regulator or fan failure.
  • SP field replaceable unit (FRU) data - This describes the hardware inventory configuration to assist with hardware replacement. Collect this to determine if the system has the proper configuration (eg. partially installed memory bank). A good item to check is the system board page in the Sun System Handbook.
  • Is system operational?  If not see Boot failure after data analysis.

If the cause of the reboot or crash cannot be quickly determined given the information above, it's important to perform hardware diagnostics such as a full power on self test (POST) or SunVTS to determine if the hardware is stable.  Typically the field engineer should bring the system to a minimal configuration (minimal DIMMs, no PCI cards, minimal PSUs, ...) to isolate these types of problems & then add or replace components to isolate the failed one.

 

Last 20 Reboots

First review the reboots during the time of the incident.

##### sysconfig/last-20-reboot  (last reboot)  #####
reboot    system boot                   Tue Mar  6 01:13
reboot    system down                   Tue Mar  6 00:29
reboot    system boot                   Tue Mar  6 00:24
reboot    system down                   Tue Mar  6 00:13
reboot    system boot                   Mon Mar  5 21:40
reboot    system down                   Mon Mar  5 15:01

 

Messages

Then review the explorer's messages for each reboot to determine when the fatal reset occurred (no preceding done or dump succeeded message) .

##### messages/messages  (/var/adm/messages) #####

Mar  5 15:00:52 osdldom50 qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(2): Link ONLINE
Mar  5 21:40:04 osdldom50 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_147440-01 64-bit

--------------------------------------

Jul  3 17:03:22 ctrstapp01      Corrupt label; wrong magic number
Jul  5 13:19:53 ctrstapp01 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_142900-02 64-bit

 

Console Output

For the moment, console output is only collected by a snapshot or ILOM command:

##### ilom/@persist@hostconsole.log (show /SP/console/history) ###
OpenBoot v. 4.33.6. @(#)OpenBoot 4.33.6.a
2012/03/29 11:22 May 24 11:31:10 t4-4-bur09-a reboot: rebooted by root
OpenBoot 4.33.6.a, 523776 MB memory available, Serial #97968252.
SunOS Release 5.10 Version Generic_147440-01 64-bit
May 24 11:33:09 t4-4-bur09-a scsi: WARNING: /pci@400/pci@1/pci@0/pci@0/LSI,sas@0/iport@v0/disk@w3461186e1b9925ce,0 (sd9): 

 

SP Events

The ILOM event log can be obtained by an explorer, by a snapshot, or by ILOM commands:

 ##### ipmi/ipmitool_sel_elist.out  (show /SP/logs/event)  #####
   1 | 01/04/2013 | 16:19:58 | System Boot Initiated | System Restart | Asserted
  
   c | 01/04/2013 | 16:23:11 | Voltage MB/V_+3V3_MAIN | Upper Critical going high | Reading 3.59 > Threshold 3.53 Volts
   d | 01/04/2013 | 16:23:12 | Voltage MB/V_+3V3_SLOT | Upper Critical going high | Reading 3.83 > Threshold 3.53 Volts
   e | 01/04/2013 | 16:23:15 | Voltage MB/V_+12V0 | Upper Non-recoverable going high | Reading 14.40 > Threshold 13.22 Volts

 

SP Faults

This is only obtained via snapshot or ILOM command, as follows

##### fma/@usr@local@bin@fmadm_faulty.out  (show faulty)  #####
------------------- ------------------------------------ -------------- --------
Time                UUID                                 msgid          Severity
------------------- ------------------------------------ -------------- --------
2012-06-21/17:03:09 8be7d7d8-f047-efb9-ba98-e4f69ac676cd SUN4V-8000-E2  Critical
Fault class : fault.memory.bank
FRU         : /SYS/MB/CMP0/BOB1/CH0/D1
              (Part Number: 07020577)
              (Serial Number: 00CE02123486CFC6F5)


Description : A fault has been diagnosed by the Host Operating System.

2012-06-21/17:03:09 8be7d7d8-f047-efb9-ba98-e4f69ac676cd SUN4V-8000-E2  Critical
Fault class : fault.memory.bank
FRU         : /SYS/MB/CMP0/BOB1/CH1/D1
              (Part Number: 07020577)
              (Serial Number: 00CE02123486CFC6F4)
Description : A fault has been diagnosed by the Host Operating System.

 

ILOM snapshots typically contain verbose ereport data in files: fma/fmdump-eV.out, elogs/elogs-v.out, or elogs/@usr@local@bin@elogs_-eV.out.  An explorer will also contain a list of existing or repaired faults in file: fma/fmdump.out.

 

SP Sensor Data

Sensor data can be obtained by the explorer (as below), by a snapshot, or by ILOM command:

##### ipmi/ipmitool_sdr_list_all_info.out  (show -l all /)  #####
   200 |MB/CMP0/V_VCORE    | .98 Volts        | failed



OS Panic

OS Panics are software detected problems and caused when the operating system detects that the integrity of data is suspect or in danger of being corrupted.  The panic routine will typically place a panic message into the messages file (captured by explorer) & console output (captured by snapshot) & create a core dump if properly configured in dumpadm.  Panics can be caused by either operating system coding errors which are typically fixed by patches, or caused by hardware related problems.  Uncorrectable Hardware Errors are typically related to DIMM UE's, but problems with firmware are another possibility.  PCI fabric panics are also typically associated with hardware problems, but driver issues must be checked.  If the fabric panic is HBA related, then also collaborate with the storage group to determine if HBA firmware or SAN drivers are involved.

If the panic is software related, collect the core dump for analysis.  If hardware related then collect the following data so the problem can be isolated:

  • Panic Message as found in the explorer messages file (but not always) or snapshot console output.  This describes the type of panic.  The reboot is typically proceeded by a dump succedded message.
  • FMA Data collected by explorer or a snapshot which may isolate a failed DIMM or PCI path.  The snapshot may contain the fatal ereport that doesn't make it to Solaris FMA.
  • Prtdiag data collected by explorer for PCI related panics.  Lists the PCI paths with associated card type & lists CPU faults.  Use doc 1373995.1 to determine the associated Oracle PCI part number.
  • SP FRU data. This describes the hardware configuration to assist with hardware replacement.

If the cause of the panic/reboot cannot be quickly determined given the information above, it's important to perform hardware diagnostics such as a full power on self test (POST) or SunVTS to determine if the hardware is stable.  Typically the field engineer should bring the system to a minimal configuration (minimal DIMMs, no PCI cards, minimal PSUs, ...) to isolate these types of problems & then add or replace components to isolate the failed one. 

Bug 6983432 (Repaired FMA Fault reports sent to ILOM after reboot) could affect ILOM based systems by failing a component on the ILOM which failed in the past & was already been repaired or replaced.  In one case, it failed a component that has been removed.  The resolution is to install FMD patch 147790-01 & the workaround is to perform Solaris command: fmadm flush "component" which removes the FMA repair records so they don't get resent on following reboots.

 

Messages

The following panics are typically related to hardware or firmware problems (but not always).

##### messages/messages  (/var/adm/messages) #####
May  2 05:53:47 Sun04535 ^Mpanic[cpu53]/thread=2a102c49ca0:
May  2 05:53:47 Sun04535 unix: Fatal error has occured in: PCIe fabric.(0x1)(0x43)
May  2 05:53:47 Sun04535 unix:
May  2 05:53:47 Sun04535 unix: 000002a102c496f0 px:px_err_panic+1ac (19db400, 7be43800, 43, 2a102c497a0, 1, 0)
...
May  2 06:06:26 nk11p04mm-mail04535 unix: dump succeeded
May  2 06:08:08 nk11p04mm-mail04535 unix: ^MSunOS Release 5.10 Version Generic_147440-12 64-bit

-------------------------


May  8 22:50:05 Sungams3 ^Mpanic[cpu64]/thread=3004519c680:
May  8 22:50:05 Sungams3 unix: [ID 400509 kern.notice] Unrecoverable hardware error
May  8 22:50:05 Sungams3 unix: [ID 100000 kern.notice]
May  8 22:50:05 Sungams3 genunix: [ID 723222 kern.notice] 000002a1076b8a90 unix:process_nonresumable_error+298 (2a1076b8c80, 0, 1, 40, 0, 0)

FMA Data

FMA data can typically isolate the hardware problem to a specific DIMM or PCI card.

##### fma/fmdump-eV  (fmdump -eV) #####
The first fmdump-eV entry is from Feb 21 2012 22:36:02.
---- FIRST DATE ----       ---- LAST DATE ----  COUNT  DEVICE
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47   4978  /pci@400
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47   2780  /pci@400/pci@1
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47   2495  /pci@400/pci@1/pci@0
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47   2497  /pci@400/pci@1/pci@0/pci@4
Feb 21 2012 22:36:02 thru May 02 2012 05:53:47   2828  /pci@400/pci@1/pci@0/pci@4/SUNW,qlc@0

-------------------------

##### fma/fmadm-faulty  (fmadm faulty) #####
Dec 03 12:23:10 162ab54b-4918-438e-ae4f-83a9280f4ab7 SUNOS-8000-FU Major

Host : trafioracp01
Platform : ORCL,SPARC-T4-1B Chassis_id :
Product_sn :

Fault class : defect.sunos.eft.undiag.fme
FRU : None faulty

Description : The diagnosis engine encountered telemetry for which it was unable to perform a diagnosis.

-------------------------

##### fma/fmdump-eV  (fmdump -eV) #####
The first fmdump-eV entry is from Apr 13 2011 03:09:52.

---- FIRST DATE ----       ---- LAST DATE ----  COUNT  DEVICE
Apr 13 2011 03:09:52 thru May 17 2011 15:06:18  64214  MB/CMP1/BR1/CH0
Apr 13 2011 03:13:13 thru May 17 2011 15:06:18   5426  MB/CMP1/BR1: CH0/D0/J3201
Apr 13 2011 04:07:22 thru May 16 2011 12:30:29     22  MB/CMP1/BR1: CH0/D0/J3201 CH1/D0/J3601


##### fma/fmadm-faulty  (fmadm faulty) #####
Nov 14 18:25:45 21c312ab-4c48-ee92-c762-e2680ae35b74  FMD-8000-0W    Minor
Host        : av2s728p
Platform    : SUNW,Sun-Blade-T6340        Chassis_id  :
Fault class : defect.sunos.fmd.nosub
Description : The Solaris Fault Manager received an event from a component to which no automated diagnosis software is currently subscribed.

Nov 14 18:25:45 59226e87-abce-4add-e27d-def4d6469faa SUN4V-8002-42 Critical
Host : av2s728p
Platform : SUNW,Sun-Blade-T6340 Chassis_id :
Fault class : fault.memory.dimm-ue-imminent 95%
Affects : mem:///unum=MB/CMP0/BR0/CH0/D1/J0601                     faulted but still in service
FRU : "MB/CMP0/BR0/CH0/D1/J0601" (hc://:serial=00CE010829050C2E88:part=511-1151-01-Rev-50/motherboard=0/chip=0/branch=0/dram-channel=0/dimm=1) 95%
faulty

Description : A pattern of correctable errors has been observed suggesting the potential exists that an uncorrectable error may occur.

 

Use the prtdiag data or FRU data to determine the part number of the component.

Prtdiag Data

##### sysconfig/prtdiag: #####
System Configuration:  Oracle Corporation  sun4v SPARC T3-1B
Memory size: 130560 Megabytes
CPU ID Frequency Implementation         Status
0      1649 MHz  SPARC-T3               on-line
...
128    1649 MHz  SPARC-T3               on-line

...
/SYS/MB/PCI-EM1     PCIE  SUNW,qlc-pciex1077,2532           QLE2562  <--- To obtain Oracle part # see doc 1373995.1
                        /pci@400/pci@1/pci@0/pci@4/SUNW,qlc@0
/SYS/MB/PCI-EM1     PCIE  SUNW,qlc-pciex1077,2532           QLE2562
                        /pci@400/pci@1/pci@0/pci@4/SUNW,qlc@0,1
/SYS/MB/PCI-EM0     PCIE  SUNW,qlc-pciex1077,2532           QLE2562
                        /pci@400/pci@2/pci@0/pci@4/SUNW,qlc@0
/SYS/MB/PCI-EM0     PCIE  SUNW,qlc-pciex1077,2532           QLE2562
                        /pci@400/pci@2/pci@0/pci@4/SUNW,qlc@0,1

 

FRU Data

##### ipmi/ipmitool_fru.out  (show -l all /)  See doc 1411086.1 #####

      Part        Manufacturer        Part #        Ser #
 Builtin FRU Dev  Oracle Corporatio                                   
           SYS/MB 5030 CELESTICA CO 7015272 465769T+1236Y60316  
            /SYS  Oracle Corporatio               1236NN14J8          
           MB/SP  5030 CELESTICA CO  7019998     465769T+1236Y702MX
           /SYS/MB/REM 10080 LSI CO 375-3643     0291IPT-1229001739 
           /SYS/MB/FEM0 Intel Malaysia 375-3648    02089D <<<<---------<< (optional)
 /SYS/MB/CMP0/BOB0/CH0/D0   Samsung  07020577     00CE02123486CFC76D
 /SYS/MB/CMP0/BOB0/CH0/D1   Samsung  07020577     00CE02123486CFC76A
 /SYS/MB/CMP0/BOB0/CH1/D0   Samsung  07020577     00CE02123486CFC6F0
 /SYS/MB/CMP0/BOB0/CH1/D1   Samsung  07020577     00CE02123486CFC6F8
 /SYS/MB/CMP0/BOB1/CH0/D0   Samsung  07020577     00CE02123486CFC773 
 /SYS/MB/CMP0/BOB1/CH0/D1   Samsung  07020577     00CE02123486CFC6F5
 /SYS/MB/CMP0/BOB1/CH1/D0   Samsung  07020577     00CE02123486CFC790
 /SYS/MB/CMP0/BOB1/CH1/D1   Samsung  07020577     00CE02123486CFC78C
 /SYS/MB/CMP0/BOB2/CH0/D0   Samsung  07020577     00CE02123486CFC775
 /SYS/MB/CMP0/BOB2/CH0/D1   Samsung  07020577     00CE02123486CFC770
 /SYS/MB/CMP0/BOB2/CH1/D0   Samsung  07020577     00CE02123486CFC774
 /SYS/MB/CMP0/BOB2/CH1/D1   Samsung  07020577     00CE02123486CFC72A
 /SYS/MB/CMP0/BOB3/CH0/D0   Samsung  07020577     00CE02123486CFC6F6
 /SYS/MB/CMP0/BOB3/CH0/D1   Samsung  07020577     00CE02123486CFC766
 /SYS/MB/CMP0/BOB3/CH1/D0   Samsung  07020577     00CE02123486CFC6F2
 /SYS/MB/CMP0/BOB3/CH1/D1   Samsung  07020577     00CE02123486CFC76F
             /SYS/NEM0 <no S/N>   541-3770-03     1005LCB-1229RW025N
             /SYS/NEM1 <no S/N>   541-3770-03     1005LCB-1230RW0270
             /SYS/PS0   <no S/N>   300-2259-03     465824T+1222B80013
             /SYS/PS1   <no S/N>   300-2259-03     465824T+1121B80157

 


Hang

A Hang is when some applications or OS functions may operate properly, and others appear dead.  The hardware and operating system do not detect a problem unless the SP watchdog detects the problem.  Hangs are caused by resource deadlocks due to operating system race conditions or resource deprivation due to one or more applications that are too needy.  Sometimes console messages may indicate the source of the hang, but typically a live core should be forced so that Sun's kernel group can analyze the data.  There is a small possibility that hangs can be caused by hardware, so please check for hardware problems then contact the kernel group for isolation prior to transferring the SR.  They may wish to have output from the GUDs tool to determine OS resorce statistics. Some Solaris documents that discuss procedures and configuration do isolate Solaris panics and hangs are as follows:

   DocID: 1004530.1 KERNEL: How to enable deadman kernel code
   DocID: 1012913.1 Troubleshooting Panics, dumps, hangs or crashes in the Solaris[TM] Operating System
   DocID: 1001950.1 Troubleshooting Suspected Solaris Operating System Hangs
   DocID: 1004506.1 How to force a crash when my machine is hung
   DocID: 1001950.1 When to Force a Solaris System Core File
   DocID: 1004530.1 KERNEL: How to enable deadman kernel code

The data needed to attempt isolation of hardware related hangs is similar to Fatal Resets so mainly SP data is required:

  • LAST Output: From the explorer contains the times of reboots.
  • Messages do not contain useful information, just a boot following an unrelated message.
  • Console Output - This typically contains a reason for the reset for example critical component failure (or nothing for total power loss).  An ILOM Snapshot contains the SP info!  Output from ILOM command "show /HOST/console/history" will also provide it.
  • SP events -  This could contain sensor related events like under voltage conditions on one rail or OEM specific events like 0x12's or an ILOM reboot if power loss.
  • SP sensor data - This contains information if a sensor has a consistent problem like a voltage regulator or fan failure.
  • FMA Fault data - Check for a history of DIMM or disk problems.
  • SP field replaceable unit FRU data - This describes the hardware inventory configuration to assist with hardware replacement. Collect this to determine if the system has the proper configuration (eg. partially installed memory bank). A good item to check is the system board page in the Sun System Handbook.
  • Are boot disks internal or external?

 

LAST
Determine when the admin forced a reboot to stop the hang.

##### sysconfig/last-20-reboot.out #####
reboot system boot Wed Jul 4 18:21
reboot system down Wed Jul 4 17:06

 

Messages

Message prior reboot unrelated to reboot or if the user caused a panic via the ILOM break.

##### messages/messages  (/var/adm/messages) #####
Jul  4 04:53:09 prod-db iscsi: [ID 632887 kern.warning] WARNING: iscsi connection(19) login failed - authentication failed with target
Jul  4 18:21:27 prod-db genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_142900-15 64-bit

 

Use the data below to look for possible hardware problems:

SP Events

Notice that the event log has an indication of system shutdown via the power button.  Anothe example shows that the SP watchdog triggered a reboot.

#####  ipmi/ipmitool_sel_elist.out   #####

Jul 04 18:10:56: Chassis |major : "System shutdown has been requested via power button."
Jul 04 18:11:00: Chassis |major : "System power off has been requested via power button."
Jul 04 18:11:01: Chassis |critical : "Host has been powered off"
Jul 04 18:11:15: Chassis |major : "System power on has been requested via power button."
Jul 04 18:11:16: Chassis |major : "Host has been powered on"
Jul 04 18:15:13: Chassis |major : "Host is running"

-------------------------

Did an admin recently log into the ILOM prior to the reset to do "cd /HOST; set send_break_action=break" during this time?


Oct 20 16:47:15 tesp07328 SC Alert: [ID 113266 daemon.notice] Audit | minor: admin : Open Session : object = "/SP/session/type" : value = "shell" : success
Oct 20 16:48:13 tesp07328 SC Alert: [ID 354481 daemon.notice] Audit | minor: admin : Close Session : object = "/SP/session/type" : value = "shell" : success
Oct 20 16:48:19 tesp07328 SC Alert: [ID 113266 daemon.notice] Audit | minor: admin : Open Session : object = "/SP/session/type" : value = "shell" : success

-------------------------


Dec 29 06:58:37: Chassis |critical: "SP Request to Reset Host due to Watchdog"
Dec 29 06:58:37: Chassis |major   : "Host is running"
Jan 03 20:24:36: Chassis |critical: "SP Request to Reset Host due to Watchdog"
Jan 03 20:24:36: Chassis |major   : "Host is running"

 

VBSC data from the ILOM snapshot can also be useful:

##### ilom/@persist@vbsc@vbsc.log  #####
DEBUG: check_poweron_button: pwr_status = ?
NOTICE: System shutdown has been requested.
NOTICE: System power off has been requested via power button.

 

FMA Data

Check FMA data for a history of DIMM or disk problems.

##### fma/fmdump-eV: #####
The first fmdump-eV entry is from Sep 14 2010 19:44:23.
---- FIRST DATE ----       ---- LAST DATE ----  COUNT  DEVICE
Apr 13 2011 03:09:52 thru May 17 2011 15:06:18  64214  MB/CMP1/BR1/CH0
Apr 13 2011 03:13:13 thru May 17 2011 15:06:18   5426  MB/CMP1/BR1: CH0/D0/J3201
Apr 13 2011 04:07:22 thru May 16 2011 12:30:29     22  MB/CMP1/BR1: CH0/D0/J3201 CH1/D0/J3601

 

##### fma/fmadm-faulty: #####
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Jul 26 22:07:23 3c4922d5-5c1c-e4f9-e4c2-c86ac566231d  SUN4V-8002-42  Critical
Fault class : fault.memory.dimm-ue-imminent 95%
Affects     : mem:///unum=MB/CMP1/BR1:CH0/D0/J3201  faulted but still in service
Serial ID.  :        0
Description : A pattern of correctable errors has been observed suggesting the potential exists that an uncorrectable error may occur.

 

FRU Data

Check FRU data for faulted components or uncertified DIMMs.

##### ipmi/ipmitool_fru.out (show -l all /)   See doc 1411086.1 #####
      Part        Manufacturer        Part #        Ser #
 Builtin FRU Dev  Oracle Corporatio                                   
           /SYS/MB 5030 CELESTICA CO 7015272 465769T+1236Y60316  
            /SYS  Oracle Corporatio               1236NN14J8          
           /SYS/MB/SP  5030 CELESTICA CO  7019998     465769T+1236Y702MX
 /SYS/MB/REM 10080 LSI CO 375-3643 0291IPT-1229001739
/SYS/MB/FEM0 Intel Malaysia 375-3648 02089D <<<<---------<< (optional)
 /SYS/MB/CMP0/BOB0/CH0/D0   Samsung  07020577     00CE02123486CFC76D
 /SYS/MB/CMP0/BOB0/CH0/D1   Samsung  07020577     00CE02123486CFC76A
 /SYS/MB/CMP0/BOB0/CH1/D0   Samsung  07020577     00CE02123486CFC6F0
 /SYS/MB/CMP0/BOB0/CH1/D1   Samsung  07020577     00CE02123486CFC6F8
 /SYS/MB/CMP0/BOB1/CH0/D0   Samsung  07020577     00CE02123486CFC773
 /SYS/MB/CMP0/BOB1/CH0/D1   Samsung  07020577     00CE02123486CFC6F5
 /SYS/MB/CMP0/BOB1/CH1/D0   Samsung  07020577     00CE02123486CFC790
 /SYS/MB/CMP0/BOB1/CH1/D1   Samsung  07020577     00CE02123486CFC78C
 /SYS/MB/CMP0/BOB2/CH0/D0   Samsung  07020577     00CE02123486CFC775
 /SYS/MB/CMP0/BOB2/CH0/D1   Samsung  07020577     00CE02123486CFC770
 /SYS/MB/CMP0/BOB2/CH1/D0   Samsung  07020577     00CE02123486CFC774
 /SYS/MB/CMP0/BOB2/CH1/D1   Samsung  07020577     00CE02123486CFC72A
 /SYS/MB/CMP0/BOB3/CH0/D0   Samsung  07020577     00CE02123486CFC6F6
 /SYS/MB/CMP0/BOB3/CH0/D1   Samsung  07020577     00CE02123486CFC766
 /SYS/MB/CMP0/BOB3/CH1/D0   Samsung  07020577     00CE02123486CFC6F2
 /SYS/MB/CMP0/BOB3/CH1/D1   Samsung  07020577     00CE02123486CFC76F
 /SYS/NEM0 <no S/N>   541-3770-03     1005LCB-1229RW025N
             /SYS/NEM1 <no S/N>  541-3770-03     1005LCB-1230RW0270
             /SYS/PS0 <no S/N>   300-2259-03     465824T+1222B80013
             /SYS/PS1 <no S/N>  300-2259-03     465824T+1121B80157  

 


Boot Failure

If the entire chassis is dead, there is a good chance that a lightning strike may have severely damaged the system.  An action plan would be to:

  • Remove all FRUs,
  • Determine if the CMM & one PSU are operational, else insert a replacement PSU & CMM,
  • Determine if CMM communicates on it's serial port, then configure the IP information for Net access,
  • Insert FRUs one by one & monitor the CMM for faults to determine if the component is failed.
  • Obtain a snapshot once complete

 

If the blade cannot boot cleanly we should not ask for an explorer since only Service Processor data is obtainable. Please first determine if power is present.  If so, this could be an indication of disk failure or failure of another critical component.  These blades use the ILOM, and sometimes a memory leak can make the ILOM partially operational.  Please remove blade (then plug it back in) to reboot ILOM and rule out memory leak problems.  If the ILOM is dead/fails to boot, either blade or ILOM hardware problems exist. The ILOM typically obtains the 3.3V Standby Voltage from the chassis PSU's via the midplane.  Have a field engineer replace the ILOM first and then the blade (in that order) to fix.

If the ILOM is operational, the data we should obtain is:

  • Data from the ILOM as indicated in the overview,
  • Did the admin perform any OS / firmware / hardware upgrades/changes prior the problem?
  • States of the LEDs

 

If the system fails to get to the OK prompt, analyze the SP data requested above to isolate a problem.  The TSE should check the front & rear system views in the Sun System Handbook to determine which LEDs exist on that platform & request their status.  The sensor data is most important since it contains voltage related info.  The FRU & fault related data should then be checked for component failure.  If the data is not helpful, then a field engineer should bring the system to a minimal configuration (minimal DIMMs, no PCI cards, minimal PSUs, ...) to isolate these types of problems & then add or replace components to isolate the failed one.  The following is an indication of a possible counterfeit DIMM installed that also failed POST, but also look for system board regulator problems:

##### ipmi/@usr@local@bin@ipmiint_fru_print.out #####
  Jul 11 20:10:39: Chassis |minor   : "DIMMS at MB/CMP0/BR1/CH1/D1 and MB/CMP1/BR0/CH0/D1 have duplicate serial and part-dash-rev numbers"
  Jul 11 20:10:41: Fault   |critical: "SP detected fault at time Wed Jul 11 20:10:41 2012. /SYS/MB/CMP0/BR1/CH1/D1 Forced Fail (POST)"

 

If the system is able to get to the OK prompt, then the boot problem is most likely disk/controller related.  Have the admin attempt to boot the system via DVD (check for minimal supported OS for platform!!!) or via network.  If the system boots from an external boot source, then either the boot disk/controller has failed, something was done to change the boot  parameters, or the boot image is corrupt.  Open a SR into the OS group for assistance.  Output from the following commands may be helpful:

  • ok> printenv           :Determine boot device  (internal or external!!!).  If external & an Oracle storage array, then open an SR into the Storage group
  • ok> probe-scsi-all   :Determine if boot device seen
  • ok> devalias          :Lists
  • ok> show-disks       :Lists disks & has options to add to dev alias
  • ok> boot -aV -s       :(See attachment) Boot in ask me mode (respond with default settings) with the verbose option.  Determine if boot device correct (such as internal RAID controller path).
  • ok> boot -m verbose -s   or -m debug  :(See attachment)  (Solaris 10 & up) Boot which shows the services starting (see doc 1006328.1 )
  • ok> boot -F failsafe   :(Solaris 10 U6 & up) Contact the OS group for usage  (see doc 1340586.1)
  • Boot from cdrom or from net, then do format; select a disk; analyze; read; which will test if the disk is properly accessed by the controller.

 



Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback