![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||||||||||||||||||
Solution Type Predictive Self-Healing Sure Solution 1540394.1 : SPARC M5-32 and M6-32 Servers: How to deal with a hung or unresponsive Physical Domain
In this Document
Applies to:SPARC M5-32 - Version All Versions and laterSPARC M6-32 - Version All Versions and later Information in this document applies to any platform. PurposeThis document provides details on how to deal with a hung or unresponsive M5-32/M6-32 Physical Domain (PDomain) and also explains how to send a break to the domain for recovery ScopeExplains how to recognize a hung PDomain and the recovery options available. It also explains how to enable/disable the break signal in ILOM Please see the following document for the procedure necessary to deal with a hung or unresponsive Logical Domain (LDom, also known as a Guest domain), How to Collect a Forced Crash Dump of a Hanging Solaris Guest LDom (Doc ID 1020884.1) DetailsResponding to a hung PDomainIf you suspect that an M5-32 PDomain is hung or non responsive, the first step is confirmation of the hang before recovery can be attempted HOSTx is a soft link to /servers/Pdomains/Pdomain_x/HOST, so both can be used Verify that the PDomain is hung The status should be "Solaris Running" and not powered off or at OBP or just starting up
-> show /HOSTx status
/HOSTx Properties: status = Solaris running
# ping <host IP>
# ssh <host IP>
Sometimes ping works but ssh does not work, this is an indication of soft hang or some underlying issue with the network stack
Step#3 Verify that console is hung -> start -f /HOSTx/console
Are you sure you want to start /HOSTx/console (y/n)? y Serial console started. To stop, type #.
Note that -f or -force is used to connect to console. This will avoid a situation where the console is already in use by another session/user, and the current session with no -f will be read-only that appears to be hung. A force or -f will disconnect the other user and avoid these assumptions. If console connects but has no output and does not return any input, then we can qualify this as a hang and eliminate any network issues Once all of the above is verified, the domain is deemed hung, proceed to next step for recovery
Recovery of Hung PDomain Dumpcore is a better option as we would like to have a coredump for RCA of hang
**WARNING **WARNING **WARNING ** If guest ldoms are configured with IO dependencies on the control domain, then there is a risk of guest domains not surviving a control domain reset or XIR If guest domains are configured with dependencies on root domains or IO domains, then they will survive a control domain reset. If you are unsure, then it will be safer to shutdown the guest domains via ssh connection before trying to recover a hung control domain **WARNING **WARNING **WARNING **
dumpcore causes the hung OS to recover then coredump. This coredump is essential for the RCA of the hang
-> set /HOSTx send_break_action=dumpcore
-> start /HOSTx/console
Are you sure you want to start /HOSTx/console (y/n)?y Serial console started. To stop, type #. done dumping to /dev/dsk/c0t5000CCA00AB4C674d0s1, offset 107544576, content: kernel 0:50 100% done ... rebooting...
send_break_action=break provides the user with three options: continue, sync, or reset.
Take note that there is no direct action to drop the system from OS to OBP Prompt {ok}. The only way to get the system to OBP Prompt {ok} is to either select "sync or reset", which then does a panic or reset respectively, then drops the system to OBP {ok}. This too depends on the eeprom value for auto-boot, which can be set -> set /HOSTx/bootmode state=reset_nvram script=”setenv auto-boot? false” c) continue = is the same as hitting "go" in legacy system to cancel the break or back to OS s) sync = panic the systemand drop to OBP prompt or boot r) reset = perform a reset and drop the system to OBP prompt or boot
-> set /HOSTx send_break_action=break
-> start /HOSTx/console Are you sure you want to start /HOSTx/console (y/n)?y Serial console started. To stop, type #. c)ontinue, s)ync, r)eset? s panic[cpu53]/thread=2a1014b1ca0: sync initiated sched: trap type = 0x0 pid=0, pc=0x0, sp=0x0, tstate=0x0, context=0x0
***IMPORTANT: If the hung host control domain failed to respond to the send_break_action(s), you then have to make a decision based on these two facts:
If the hung host control domain failed to respond to send_break_action, before resetting the host (and losing any opportunity to collect data), it's possible to send XIR to the CPUs composing the Pdom. Sending XIR may be very useful in order to :
-> set SESSION mode=restricted
WARNING: The "Restricted Shell" account is provided solely to allow Services to perform diagnostic tasks. [(restricted_shell) m5-32-sca11-a-sp1:~]$ xir -? domain_id is required on this platform usage: xir -d domain_id resume|guest_reset|guest_core|guest_debugger xir display [-p <physcpu>] [-s <strand>] [-t] <filename> xir display -t # display only trap PCs xir list # list XIR files xir delete <filename>
-> show /SP/logs/audit/list Class=="Restricted Shell"
Audit ID Date/Time Class Type Severity ----- ------------------------ -------- -------- -------- ... 79072 Thu Oct 2 04:38:44 2014 Restricted Shell Command Executed minor root: RShell Executed: xir -d 3 guest_reset 79043 Thu Oct 2 04:26:52 2014 Restricted Shell Command Executed minor root: RShell Executed: xir -d 3 guest_core
[(restricted_shell) m5-32-sca11-a-sp1:~]# xir -d 3 guest_core
preparing to send XIR (guest_core) to domain 3 cognizant SP for /SYS/DCU3: /SYS/SPP3 starting XIR: guest_core get XIR filename: GM busy (5009, 1): sleeping 5 get XIR filename: GM busy (5009, 1): sleeping 5 XIR filename: xir_ser.8 [(restricted_shell) m5-32-sca11-a-sp1:~]# m5-32-sca11-a-pdom03 console login: panic[cpu2305]/thread=2a100ae5c40: Panic - Generated at user request 000002a100ae5600 unix:process_nonresumable_error+4f8 (10e2000, 0, 0, 2a100ae5710, 2a100ae5768, 100000000) %l0-3: ffffffffffffff7f 0000000000000040 0000000000000100 00000300000007f8 %l4-7: 0000000000000000 00000000000000ff 0000000000000000 0000000003000000 000002a100ae57a0 unix:ktl0+64 (0, c40030acc080, 0, 0, 3, 12) %l0-3: 0000030000000000 0000000000004808 0000004400001404 000000000102bfe0 %l4-7: 0000000000000001 0000c40030acc000 0000000000000000 000002a100ae5850 000002a100ae58f0 unix:cpu_halt+13c (1000, 1, 1041fb58, 1041fa20, 30000000000, 1) %l0-3: 0000000000000001 0000c4003f595524 0000000000000016 00000000000010fc %l4-7: 0000000000000013 0000000000000000 00000000100dbc00 0000000000000000 000002a100ae59a0 unix:idle+12c (100dbc00, 14, 30000000000, c4003f595524, 100dbf98, ffffffffffffffff) %l0-3: 00000000000bb8bd 00000000000bb8bc 0000c4003f595500 ffffffffffffffff %l4-7: 000000001041fa20 000000000104a85c fffffffffffffffe 0000000010652c00 syncing file systems... done dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel sections: 0:03 100% done (kernel) 100% done: 415913 (kernel) pages dumped, dump succeeded rebooting... Resetting... 20141002 04:18:17: Start Host completed successfully 20141002 04:27:10: status='Solaris panicking' 20141002 04:27:15: status='Solaris rebooting' 20141002 04:27:43: status='Solaris rebooting' 20141002 04:27:44: status='OpenBoot initializing' 20141002 04:28:25: status='OpenBoot Primary Boot Loader' 20141002 04:30:43: status='OpenBoot Primary Boot Loader' 20141002 04:30:54: status='OpenBoot Running OS Boot' 20141002 04:31:20: status='Solaris running' 20141002 04:31:20: Start Host completed successfully
Recovery Option#3 reset
Reset is only used as a last resort when the recovery option#1 and option#2 fails. This could mean that the domain has a hard hang and is not responding to the break signal
-> reset /HOSTx
Are you sure you want to reset /HOSTx (y/n)?
The Following Section is for Preventing or Allowing the break signal from the ILOM
Sometimes, we would want to prevent accidental or unauthorized break to a running OS, this can done by locking the ILOM keyswitch state
-> help keyswitch_state
Properties: keyswitch_state : Keyswitch State. keyswitch_state : Possible values = Normal, Standby, Diag, Locked <==== keyswitch_state : User role required for set = a -> set /HOSTx keyswitch_state=Locked
Set 'keyswitch_state' to 'Locked' -> set /HOSTx send_break_action=break set: Cannot send break action because keyswitch is in LOCKED position. <======== -> set /HOSTx keyswitch_state=Normal
Set 'keyswitch_state' to 'Normal' Attachments This solution has no attachment |
||||||||||||||||||||||||||||||||||||
|