Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-79-1540394.1
Update Date:2017-10-11
Keywords:

Solution Type  Predictive Self-Healing Sure

Solution  1540394.1 :   SPARC M5-32 and M6-32 Servers: How to deal with a hung or unresponsive Physical Domain  


Related Items
  • SPARC M5-32
  •  
  • SPARC M6-32
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Mx-32
  •  




In this Document
Purpose
Scope
Details
 Responding to a hung PDomain
References


Applies to:

SPARC M5-32 - Version All Versions and later
SPARC M6-32 - Version All Versions and later
Information in this document applies to any platform.

Purpose

This document provides details on how to deal with a hung or unresponsive M5-32/M6-32 Physical Domain (PDomain) and also explains how to send a break to the domain for recovery

Scope

Explains how to recognize a hung PDomain and the recovery options available. It also explains how to enable/disable the break signal in ILOM

Please see the following document for the procedure necessary to deal with a hung or unresponsive Logical Domain (LDom, also known as a Guest domain),

How to Collect a Forced Crash Dump of a Hanging Solaris Guest LDom (Doc ID 1020884.1)

Details

Responding to a hung PDomain

If you suspect that an M5-32 PDomain is hung or non responsive, the first step is confirmation of the hang before recovery can be attempted

Please substitute HOSTx for HOST0, HOST1, HOST2 or HOST3

HOSTx is a soft link to /servers/Pdomains/Pdomain_x/HOST, so both can be used

Verify that the PDomain is hung


Step#1 Verify the status of the HOSTx

The status should be "Solaris Running" and not powered off or at OBP or just starting up

 

-> show /HOSTx status

  /HOSTx
    Properties:
        status = Solaris running


Step#2 Verify that ping and ssh is not working for the host

# ping <host IP>
# ssh <host IP>

 

Sometimes ping works but ssh does not work, this is an indication of soft hang or some underlying issue with the network stack

 

Step#3 Verify that console is hung

-> start -f /HOSTx/console
Are you sure you want to start /HOSTx/console (y/n)? y

Serial console started.  To stop, type #.

 

Note that -f or -force is used to connect to console. This will avoid a situation where the console is already in use by another session/user, and the current session with no -f will be read-only that appears to be hung.

A force or -f will disconnect the other user and avoid these assumptions.

If console connects but has no output and does not return any input, then we can qualify this as a hang and eliminate any network issues

Once all of the above is verified, the domain is deemed hung, proceed to next step for recovery

 

Recovery of Hung PDomain

Step#4 Decide if you want to reset the OS or proceed to dumpcore

Dumpcore is a better option as we would like to have a coredump for RCA of hang

 

**WARNING **WARNING **WARNING **

If guest ldoms are configured with IO dependencies on the control domain, then there is a risk of guest domains not surviving a control domain reset or XIR

If guest domains are configured with dependencies on root domains or IO domains, then they will survive a control domain reset.

If you are unsure, then it will be safer to shutdown the guest domains via ssh connection before trying to recover a hung control domain

 **WARNING **WARNING **WARNING **

 


Step#5 Executing the recovery

Recovery Option#1 dumpcore

dumpcore causes the hung OS to recover then coredump. This coredump is essential for the RCA of the hang

 

-> set /HOSTx send_break_action=dumpcore


Step#6 Start the /HOSTx/console to view what is happening

-> start /HOSTx/console
Are you sure you want to start /HOSTx/console (y/n)?y

Serial console started.  To stop, type #.
     done
    dumping to /dev/dsk/c0t5000CCA00AB4C674d0s1, offset 107544576, content: kernel
     0:50  100% done
     ...

    rebooting...


Recovery Option#2 break

send_break_action=break provides the user with three options:  continue, sync, or reset.

Take note that there is no direct action to drop the system from OS to OBP Prompt {ok}.
The only way to get the system to OBP Prompt {ok} is to either select "sync or reset",
which then does a panic or reset respectively, then drops the system to OBP {ok}.
This too depends on the eeprom value for auto-boot, which can be set

-> set /HOSTx/bootmode state=reset_nvram script=”setenv auto-boot? false”

c) continue = is the same as hitting "go" in legacy system to cancel the break or back to OS
s) sync = panic the systemand drop to OBP prompt or boot
r) reset = perform a reset and drop the system to OBP prompt or boot

  

-> set /HOSTx send_break_action=break
-> start /HOSTx/console
Are you sure you want to start /HOSTx/console (y/n)?y

Serial console started.  To stop, type #.
c)ontinue, s)ync, r)eset? s

panic[cpu53]/thread=2a1014b1ca0: sync initiated
    sched: trap type = 0x0
    pid=0, pc=0x0, sp=0x0, tstate=0x0, context=0x0

 

 ***IMPORTANT: If the hung host control domain failed to respond to the send_break_action(s), you then have to make a decision based on these two facts:

1. If RCA is more important, due to recurring hung situations, then Do NOT reset or restart the HOST immediately.
First open a new SR with Oracle Support, to allow TSC to further investigate the issue.
TSC will guide you through a process to further collect critical information and data, needed to identify the possible cause.
Without this data, proper problem analysis will not be possible".

2. If immediate Recovery is more important to you, then proceed to Recovery Option#3

 

If the hung host control domain failed to respond to send_break_action, before resetting the host (and losing any opportunity to collect data), it's possible to send XIR to the CPUs composing the Pdom.

Sending XIR may be very useful in order to :

  • collect state dump information,
  • force a host reset/core dump as post-XIR action.


Note : Obtaining XIR dump files can be particularly useful for further analysis of hang situation due to Hypervisor issue.

The xir command is available from sunservice or Esc mode but also from the restricted shell.

See How to use Restricted Shell on ILOM 3.0.10 and later platforms (Doc ID 1302296.1)

On M5-32/M6-32, the xir command must be initiated from the Active SP and the xir is sent to the CPUs composing the Pdom.

-> set SESSION mode=restricted

WARNING: The "Restricted Shell" account is provided solely
to allow Services to perform diagnostic tasks.

[(restricted_shell) m5-32-sca11-a-sp1:~]$ xir -?
domain_id is required on this platform
usage:    xir -d domain_id resume|guest_reset|guest_core|guest_debugger
    xir display [-p <physcpu>] [-s <strand>] [-t] <filename>
    xir display  -t   # display only trap PCs
    xir list          # list XIR files
    xir delete <filename>


The resulting xir dump files (xir_ser.N) are available from the respective SPP in the /coredump/xir directory.
The resulting dump files are automatically collected by snapshot in the SPP snapshots : spp_snapshot/SPPx/ilom/traces/coredump/xir/

When a xir is issued from rshell, it's reported in the SP audit logs :

-> show /SP/logs/audit/list Class=="Restricted Shell"

Audit
ID     Date/Time                 Class     Type      Severity
-----  ------------------------  --------  --------  --------
...
79072  Thu Oct  2 04:38:44 2014  Restricted Shell  Command Executed  minor   
       root: RShell Executed: xir -d 3 guest_reset
79043  Thu Oct  2 04:26:52 2014  Restricted Shell  Command Executed  minor   
       root: RShell Executed: xir -d 3 guest_core


There are 2 parts in the action of sending XIR to a Pdom :

  • XIR signal sent to all cpus, and associated state dump is collected
  • post-XIR action, which is one of: resume | guest_reset | guest_core | guest_debugger


For the XIR state dump, the following state is collected:

  • Per-strand status, per physical cpu
    • strand_available, strand_enable, strand_running
  • Per-strand XIR data, per physical cpu
    • per-TL: tpc, tnpc, tstate, tt, htstate
    • 4 global register windows
    • 8 local register windows
    • Trap PCs
CommandActionCommentImpact
resume resume normal operation it's possible to use 'xir resume' several times in a row and capture data over time to monitor any progress. This may be followed by send_break_action=dumpcore if the host is still responsive enough. All guest ldoms stay up
guest_reset request Hypervisor reset the HOST or all LDOMS that reside on the Pdom works for 9.2.0.b and later (16450696) All guest ldoms DO NOT stays up
guest_core request Hypervisor generate a crashdump on the HOST or all LDOMS that reside on the Pdom works for 9.2.0.b and later (16450696). Looks like only the control domain coredump All guest ldoms DO NOT stays up

 

Example for 'xir guest_core' :

[(restricted_shell) m5-32-sca11-a-sp1:~]# xir -d 3 guest_core
preparing to send XIR (guest_core) to domain 3

cognizant SP for /SYS/DCU3: /SYS/SPP3
starting XIR: guest_core
get XIR filename: GM busy (5009, 1): sleeping 5
get XIR filename: GM busy (5009, 1): sleeping 5
XIR filename: xir_ser.8
[(restricted_shell) m5-32-sca11-a-sp1:~]#


m5-32-sca11-a-pdom03 console login:
panic[cpu2305]/thread=2a100ae5c40: Panic - Generated at user request

000002a100ae5600 unix:process_nonresumable_error+4f8 (10e2000, 0, 0, 2a100ae5710, 2a100ae5768, 100000000)
%l0-3: ffffffffffffff7f 0000000000000040 0000000000000100 00000300000007f8
%l4-7: 0000000000000000 00000000000000ff 0000000000000000 0000000003000000
000002a100ae57a0 unix:ktl0+64 (0, c40030acc080, 0, 0, 3, 12)
%l0-3: 0000030000000000 0000000000004808 0000004400001404 000000000102bfe0
%l4-7: 0000000000000001 0000c40030acc000 0000000000000000 000002a100ae5850
000002a100ae58f0 unix:cpu_halt+13c (1000, 1, 1041fb58, 1041fa20, 30000000000, 1)
%l0-3: 0000000000000001 0000c4003f595524 0000000000000016 00000000000010fc
%l4-7: 0000000000000013 0000000000000000 00000000100dbc00 0000000000000000
000002a100ae59a0 unix:idle+12c (100dbc00, 14, 30000000000, c4003f595524, 100dbf98, ffffffffffffffff)
%l0-3: 00000000000bb8bd 00000000000bb8bc 0000c4003f595500 ffffffffffffffff
%l4-7: 000000001041fa20 000000000104a85c fffffffffffffffe 0000000010652c00

syncing file systems... done
dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel sections:
0:03 100% done (kernel)
100% done: 415913 (kernel) pages dumped, dump succeeded
rebooting...
Resetting...

20141002 04:18:17: Start Host completed successfully
20141002 04:27:10: status='Solaris panicking'
20141002 04:27:15: status='Solaris rebooting'
20141002 04:27:43: status='Solaris rebooting'
20141002 04:27:44: status='OpenBoot initializing'
20141002 04:28:25: status='OpenBoot Primary Boot Loader'
20141002 04:30:43: status='OpenBoot Primary Boot Loader'
20141002 04:30:54: status='OpenBoot Running OS Boot'
20141002 04:31:20: status='Solaris running'
20141002 04:31:20: Start Host completed successfully

 



 

Recovery Option#3 reset

 

Reset is only used as a last resort when the recovery option#1 and option#2 fails. This could mean that the domain has a hard hang and is not responding to the break signal
The downside to using reset is that, it only serves as recovery and will not allow us to collect coredump for RCA

 

-> reset /HOSTx
Are you sure you want to reset /HOSTx (y/n)?

 

The Following Section is for Preventing or Allowing the break signal from the ILOM 

 

Sometimes, we would want to prevent accidental or unauthorized break to a running OS, this can done by locking the ILOM keyswitch state

 

-> help keyswitch_state
    Properties:
    keyswitch_state : Keyswitch State.
    keyswitch_state : Possible values = Normal, Standby, Diag, Locked <====
    keyswitch_state : User role required for set = a

 
How to Prevent/Supress break

-> set /HOSTx keyswitch_state=Locked
Set 'keyswitch_state' to 'Locked'

-> set /HOSTx send_break_action=break
set: Cannot send break action because keyswitch is in LOCKED position. <========

 
How to Allow break      

-> set /HOSTx keyswitch_state=Normal
Set 'keyswitch_state' to 'Normal'

 


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback