Asset ID: |
1-79-2241115.1 |
Update Date: | 2017-10-11 |
Keywords: | |
Solution Type
Predictive Self-Healing Sure
Solution
2241115.1
:
How to deal/recover from a hung/unresponsive Physical Partition (PPAR) or Guest Domain
Related Items |
- Fujitsu M10-1
- Fujitsu M10-4S
- Fujitsu SPARC M12-2S
- Fujitsu SPARC M12-1
- Fujitsu SPARC M12-2
- Fujitsu M10-4
|
Related Categories |
- PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Fujitsu M10
|
In this Document
Applies to:
Fujitsu M10-1 - Version All Versions and later
Fujitsu M10-4 - Version All Versions and later
Fujitsu M10-4S - Version All Versions and later
Fujitsu SPARC M12-1 - Version All Versions and later
Fujitsu SPARC M12-2 - Version All Versions and later
Information in this document applies to any platform.
Purpose
Purpose is to help customer(s) or Oracle Engineer(s) to assist customer(s) how to recover a hung/unresponsive Physical Partition (PPAR) in a M10 series machine
Scope
Explains how to recognize a hung/unresponsive Physical Partition (PPAR) and the recovery options available. If the Guest Domain within a PPAR becomes hung/unresponsive
it can be recovered through the XSCF.
Details
Goal
This document provides details on how to recover or unresponsive M10 series machine Physical Partition/ Guest Domain.
Solution
How to respond to a suspected hung/unresponsive Physical Partition (PPAR)
If you suspect that a Physical Partition (PPAR) is hung, verify it.
( PPAR is equivalent to a Physical Domain (PDOM) on the Mx-32 series / M7 series)
Step #1 Verify the status of the PPAR or its Guest Domains
Log into the M10 machine XSCF
XSCF>showdomainstatus -p ppar_id where ppar_id =0, ... 16
Example
XSCF> showdomainstatus
usage : showdomainstatus -p ppar_id [-v] [-M] [-g domainname]
showdomainstatus -h
XSCF>
XSCF> showdomainstatus -p 0
Logical Domain Name Status
primary Solaris running
ldg0 Solaris running
XSCF>
XSCF> showdomainstatus -p 0
Logical Domain Name Status
primary OpenBoot Running
ldg0 OpenBoot Running
Step #2 Veriy if the PPAR or its Guest Domain responds to a ping or ssh command
#ping <IP address of PPAR/Guest Domain>
#ssh <IP address of PPAR/Guest Domain>
Step #3 Verify if the console of the PPAR or Guest Domain is accessible
For the PPAR's console , from the M10 series machine's XSC
usage : console [[-q] -{y|n}] -p ppar_id [-f|-r] [-s escapeChar]
console -h
Where ppar_id =0, ...16
For the Guest / Logical Domain within the console, log into the PPAR issue the
following
#ldm list
For example to log into Guest/Logical Domain ldgo, use the telnet command
jack@m10-1-:~$ ldm list
NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME
primary active -n-cv- UART 16 32G 0.0% 0.0% 23h 57m
ldg0 active -n---- 5000 8 32G 0.0% 0.0% 23h 48m
jack@m10-1:~$ telnet localhost 5000
XSCF>
showdomainstatusInvalid parameter.
usage : showdomainstatus -p ppar_id [-v] [-M] [-g domainname]
showdomainstatus -h
XSCF>
showdomainstatus -p 0Logical Domain Name Status
primary Solaris running
ldg0 Solaris running
XSCF> console -p 0
Console contents may be logged.
Connect to PPAR-ID 0?[y|n] :y
If the console is used by another user the -f option should be used to force
takeover of the console.
If console connects but has no output and does not return any input,
then there is a very likelihood that the PPAR is indeed hung as it cannot
be accessed via network and its console.
Once all of the above is verified, the domain is deemed hung,
proceed to next step for recovery
Once confirm that the PPAR is indeed unresponsive, proceed to recover
the hung PPAR
Step #4 Proceed to recover the hung/unresponsive PPAR
2 choices are applicable when a PPAR is hung: reset or sync
A reset will recover the PPAR
A sync will attempt to generate a coredump and then reset
to recover the PPAR.
If a RCA is needed to determine why a PPAR hung, then select
the sync option.
If a RCA is not required, then select the reset to recover the PPAR.
Open two terminal sessions to the M10 machine's XSCF
Terminal session #1 is to connect to the PPAR's console
Terminal session #2 is to run the sendbreak command to the PPAR
Terminal Session #1
XSCF> console -p 0
Console contents may be logged.
Connect to PPAR-ID 0?[y|n] :y
Terminal session #2
XSCF> sendbreak -p 0
Send break signal to PPAR-ID 0?[y|n] :y
After the sendbreak command is issued, the following can be seen on Terminal Session #1
XSCF> console -p 0
Console contents may be logged.
Connect to PPAR-ID 0?[y|n] :y
c)ontinue, s)ync, r)eset?
c) continue = is the same as hitting "go" in legacy system to cancel the break or back to OS
s) sync = panic the system and drop to OBP prompt or boot
r) reset = perform a reset and drop the system to OBP prompt or boot
Example of a sync option after the sendbreak command has been issue on a PPAR
In this scenario, the hung/unresponsive responded to a sync command and the PPAR generated a coredump:
m10-1 console login:
m10-1 console login: Debugging requested; hardware watchdog suspended.
c)ontinue, s)ync, r)eset? s
panic[cpu13]/thread=2a101ef3b80: sync initiated
sched: trap type = 0x0
pid=0, pc=0x0, sp=0x0, tstate=0x0, context=0x0
o0-o7: 0, 0, 0, 0, 0, 0, 0, 0
g1-g7: 0, 0, 0, 0, 0, 0, 0
000002a101ef28d0 unix:sync_handler+13c (0, 8000000000000000, 1010d000, 2a101ef3b80, 1, 20122c00)
%l0-3: 0000000020833c00 0000000000000000 0000000020511400 0000000000000000
%l4-7: 00000000204ccc00 000003000001c300 000000002011c400 0000030000018100
...
000002a101ef3080 unix:dispatch_handler+224 (30000018100, 2a101ef3b80, 0, 7008c900, 0, f)
%l0-3: 000000000000000a 00000300000181a0 0000030000018150 0000030000018148
%l4-7: 0000000000000000 0000000000000000 0000000010024254 0000000000000000
syncing file systems... done
Preserving kernel image in RAM, content: kernel sections: zfs
0:12 96% done (kernel)
0:12 100% done (zfs)
100% done: 299673 (kernel) + 9559 (zfs) pages dumped, dump succeeded
rebooting...
Resetting...
NOTICE: Entering OpenBoot.
NOTICE: Fetching Guest MD from HV.
NOTICE: Starting additional cpus.
NOTICE: Initializing LDC services.
NOTICE: Probing PCI devices.
NOTICE: Finished PCI probing.
SPARC M10-1, No Keyboard
Copyright (c) 1998, 2016, Oracle and/or its affiliates. All rights reserved.
OpenBoot 4.38.5, 32.0000 GB memory available, Serial #268895952.
[ 2.21.0 ]
Ethernet address b0:99:28:a0:5d:d0, Host ID: 900706d0.
Here is an example where a coredump is not needed for analysis.
root@m10-1:~#
root@m10-1:~# Debugging requested; hardware watchdog suspended.
c)ontinue, s)ync, r)eset? r
Resetting...
NOTICE: Entering OpenBoot.
NOTICE: Fetching Guest MD from HV.
NOTICE: Starting additional cpus.
NOTICE: Initializing LDC services.
NOTICE: Probing PCI devices.
NOTICE: Finished PCI probing.
SPARC M10-1, No Keyboard
Copyright (c) 1998, 2016, Oracle and/or its affiliates. All rights reserved.
OpenBoot 4.38.5, 32.0000 GB memory available, Serial #268895952.
[ 2.21.0 ]
Ethernet address b0:99:28:a0:5d:d0, Host ID: 900706d0.
{0} ok boot
Boot device: disk File and args:
SunOS Release 5.11 Version 11.3 64-bit
Copyright (c) 1983, 2015, Oracle and/or its affiliates. All rights reserved.
...
Step #5 The PPAR maybe hard hung and may not response to a sendbreak command from the XSCF.
In this case, we need to use the reset command
In that case, the reset command issued from the XSCF
will attempt to force the PPAR or LDOM to respond.
XSCF>reset [ [-q] -{y|n}] -p ppar_id por
XSCF> reset [ [-q] -{y|n}] -p ppar_id xir
the options available is
POR (power-on-reset) - Resets PPAR.
xir (eXternal reset) - Resets all CPUs in PPAR.
Example reset command with POR option
XSCF>
reset -y -p 0 porPPAR-ID to reset :00
Continue? [y|n] :y
00 : Resetting
*Note*
This command only issues the instruction to reset.
The result of the instruction can be checked by the "showpparprogress".
XSCF> showpparprogress -p 0
PPAR reset PPAR#0 [ 1/13]
CPU Stop PPAR#0 [ 2/13]
PSU Off PPAR#0 [ 3/13]
XBBOX Reset PPAR#0 [ 4/13]
PSU On PPAR#0 [ 5/13]
CMU Reset Start PPAR#0 [ 6/13]
XB Reset 1 PPAR#0 [ 7/13]
XB Reset 2 PPAR#0 [ 8/13]
XB Reset 3 PPAR#0 [ 9/13]
CPU Reset 1 PPAR#0 [10/13]
CPU Reset 2 PPAR#0 [11/13]
Reset released PPAR#0 [12/13]
CPU Start PPAR#0 [13/13]
The sequence of power control is completed.
XSCF>console -f -p 0
XSCF> m10-1aconsole login: POST Sequence 01 Banner
LSB#00: POST 3.13.0 (2016/10/14 09:37)
POST Sequence 02 CPU Check
POST Sequence 03 CPU Register
POST Sequence 04 STICK Increment
POST Sequence 05 Extended Instruction
POST Sequence 06 MMU
POST Sequence 07 Memory Initialize
POST Sequence 08 Memory Address Line
POST Sequence 09 MSCAN
POST Sequence 0A Cache
POST Sequence 0B Floating Point Unit
POST Sequence 0C Encryption
POST Sequence 0D Cacheable Instruction
POST Sequence 0E Softint
POST Sequence 0F CPU Cross Call
POST Sequence 10 CMU-CH
POST Sequence 11 PCI-CH
POST Sequence 12 TOD
POST Sequence 13 MBC Check Before STICK Diag
POST Sequence 14 STICK Stop
POST Sequence 15 STICK Start
POST Sequence 16 Barrier Blade
POST Sequence 17 Single Barrier Bank
POST Sequence 18 Sector Cache
POST Sequence 19 SX
POST Sequence 1A RT
POST Sequence 1B RT/SX NC
POST Sequence 1C RT/SX Interrupt
POST Sequence 1D RT/SX Barrier
POST Sequence 1E Error CPU Check
POST Sequence 1F System Configuration
POST Sequence 20 System Status Check
POST Sequence 21 Start Hypervisor
POST Sequence Complete.
Hypervisor version: @(#)Hypervisor 1.4.12 2016/12/15 09:54 1.4.11+3
Configuring System Board.... .Completed.
Starting Logical Domains....
NOTICE: Entering OpenBoot.
NOTICE: Fetching Guest MD from HV.
NOTICE: Starting additional cpus.
NOTICE: Initializing LDC services.
NOTICE: Probing PCI devices.
NOTICE: Finished PCI probing.
SPARC M10-1, No Keyboard
Copyright (c) 1998, 2016, Oracle and/or its affiliates. All rights reserved.
OpenBoot 4.38.5, 32.0000 GB memory available, Serial #268895952.
[ 2.21.0 ]
Ethernet address b0:99:28:a0:5d:d0, Host ID: 900706d0.
{0} ok
Example of reset command with XIR option
### issuing XIR to a PPAR
XSCF> reset -y -p 0 xir
PPAR-ID to reset :00
Continue? [y|n] :y
00 : Resetting
*Note*
This command only issues the instruction to reset.
The result of the instruction can be checked by the "showpparprogress".
XSCF>
XSCF>
XSCF> showpparprogress -p 0
PPAR reset PPAR#0 [ 1/13]
CPU Stop PPAR#0 [ 2/13]
PSU Off PPAR#0 [ 3/13]
XBBOX Reset PPAR#0 [ 4/13]
PSU On PPAR#0 [ 5/13]
CMU Reset Start PPAR#0 [ 6/13]
XB Reset 1 PPAR#0 [ 7/13]
XB Reset 2 PPAR#0 [ 8/13]
XB Reset 3 PPAR#0 [ 9/13]
CPU Reset 1 PPAR#0 [10/13]
CPU Reset 2 PPAR#0 [11/13]
Reset released PPAR#0 [12/13]
CPU Start PPAR#0 [13/13]
The sequence of power control is completed.
...
XSCF> console -p 0
Console contents may be logged.
Connect to PPAR-ID 0?[y|n] :y
POST Sequence 01 Banner
LSB#00: POST 3.13.0 (2016/10/14 09:37)
POST Sequence 02 CPU Check
POST Sequence 03 CPU Register
POST Sequence 04 STICK Increment
POST Sequence 05 Extended Instruction
POST Sequence 06 MMU
POST Sequence 07 Memory Initialize
POST Sequence 08 Memory Address Line
POST Sequence 09 MSCAN
...
POST Sequence 1F System Configuration
POST Sequence 20 System Status Check
POST Sequence 21 Start Hypervisor
POST Sequence Complete.
Hypervisor version: @(#)Hypervisor 1.4.12 2016/12/15 09:54 1.4.11+3
Configuring System Board.... .Completed.
Starting Logical Domains....
NOTICE: Entering OpenBoot.
NOTICE: Fetching Guest MD from HV.
NOTICE: Starting additional cpus.
NOTICE: Initializing LDC services.
NOTICE: Probing PCI devices.
NOTICE: Finished PCI probing.
SPARC M10-1, No Keyboard
Copyright (c) 1998, 2016, Oracle and/or its affiliates. All rights reserved.
OpenBoot 4.38.5, 125.0000 GB memory available, Serial #268895952.
[ 2.21.0 ]
Ethernet address b0:99:28:a0:5d:d0, Host ID: 900706d0.
{0} ok boot
If the PPAR is function but the Guest / Logical Domain(s) has an issue, one can leave the PPAR running
and specify the Guest or logical Domain
Logical Domain Name Status
primary Solaris running
ldg0 Solaris Running <<<--- Guest Domain
XSCF>
To reset the Guest or logical domain using reset command:
reset [ [-q] -{y|n}] -p ppar_id -g domainname sir
reset [ [-q] -{y|n}] -p ppar_id -g domainname panic
where
sir Resets the logical domain.
panic Orders panic to the Oracle Solaris of the logical domain.
It is ignored during shut-down processing or under suspension.
XSCF> reset -p 0 -g ldg0 sir
PPAR-ID :00
GuestDomain to sir : ldg0
Be sure to execute "ldm add-spconfig" before using this command when you have changed the ldm configuration.
Otherwise, an unexpected domain might be reset.
Continue? [y|n] :y
00 ldg0 : Resetting
XSCF> showdomainstatus -p 0
Logical Domain Name Status
primary Solaris running
ldg0 OpenBoot Running
XSCF>
XSCF>
reset -p 0 -g ldg0 panicPPAR-ID :00
GuestDomain to panic : ldg0
Be sure to execute "ldm add-spconfig" before using this command when you have changed the ldm configuration.
Otherwise, an unexpected domain might be reset.
Continue? [y|n] :y
00 ldg0 : Resetting
*Note*
This command only issues the instruction to reset.
The result of the instruction can be checked by the "showdomainstatus".
{0} ok boot
Boot device: disk File and args:
SunOS Release 5.10 Version Generic_147147-26 64-bit
Copyright (c) 1983, 2013, Oracle and/or its affiliates. All rights reserved.
Hostname: bookable-10-187-57-211
ldg0 console login: Mar 6 13:27:30 bookable-10-187-57-211 sendmail[607]: My unqualified host name
ldg0 unknown; sleeping for retry
ldg0 console login:
ldg0 console login:
panic[cpu0]/thread=2a10001fc80: Panic - Generated at user request
000002a10001f6a0 unix:process_nonresumable_error+2d8 (2a10001f890, 0, ff, 40, 5, 40)
%l0-3: 0000000000000100 0000000003000000 0000000000000001 000000000180c600
%l4-7: 0000000000000000 0000000100000000 00000000ffffffff 0000000000000000
000002a10001f7e0 unix:ktl0+64 (0, 1, 0, 100, 101010101010101, 12)
%l0-3: 000000000180c000 0000000000000000 0000000011001406 0000000001029aa4
%l4-7: 0000000000000000 0000000000000000 0000000000000000 000002a10001f890
000002a10001f930 unix:cpu_halt+f4 (180c000, 0, 19dd9b8, 19dd888, 180c000, 0)
%l0-3: 00000000019d73a4 0000000000000001 0000000000000016 0000000000000000
%l4-7: 0000000000000000 0000000000000002 000000000180c178 0000000000000001
000002a10001f9e0 unix:idle+128 (1866c00, 8, 180c000, ffffffffffffffff, 1, 1865800)
%l0-3: 00000000019d7380 000000000000001b 0000000000000000 ffffffffffffffff
%l4-7: 0000000000000001 0000000001a6ec00 000000000180c178 0000000001045bac
syncing file systems... done
dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kerne
#6 If the guest domain with the PPAR if unresponsive, first attempt to send
break to the guest domain within the control domain framework.
(via telnet and send break)
ldg0 console login:
ldg0 console login:
ldg0 console login: root
Password:
Mar 6 14:08:32 ldg0 login: ROOT LOGIN /dev/console
Last login: Mon Mar 6 13:24:28 on console
Oracle Corporation SunOS 5.10 Generic Patch January 2005
#
#
#
#
#
Summary of various scenarios where the PPAR or Guest/Logical Domain is hung/unresponsive
col |
Scenario |
Physical Partition (PPAR) |
Guest or Logical Domain with PPAR |
#1 |
Physcial Partiton (PPAR) hang/unresponsive |
XSCF>showdomainstatus -a XSCF>sendbreak -p <PPAR#> where PPAR#=0 ...16
XSCF> sendbreak -p 0 Send break signal to PPAR-ID 0?[y|n] :y XSCF> console -p 0
Console contents may be logged. Connect to PPAR-ID 0?[y|n] :y
c)ontinue, s)ync, r)eset? c
|
|
#2 |
Physical Partition hard hang (non responsive to sendbreak) |
The PPAR maybe hard hung and may not response to a sendbreak command from the XSCF. In that case, the reset command issued from the XSCF will attempt to force the PPAR or LDOM to respond.
XSCF>reset [ [-q] -{y|n}] -p ppar_id por
XSCF> reset [ [-q] -{y|n}] -p ppar_id xir
|
|
#3 |
Guest / Logical domain has an issue. issue sendbreak to Guest Domain (panic) |
|
jack:~$ ldm list NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME primary active -n-cv- UART 16 32G 0.2% 0.2% 4m ldg0 active -t---- 5000 8 32G 12% 12% 4m jack:~$ jack:~$ telnet localhost 5000
ldg0 console login: root Password: Mar 6 14:08:32 bookable-10-187-57-211 login: ROOT LOGIN /dev/console Last login: Mon Mar 6 13:24:28 on console Oracle Corporation SunOS 5.10 Generic Patch January 2005 # #
telnet> send break Debugging requested; hardware watchdog suspended. c)ontinue, s)ync, r)eset? s
panic[cpu6]/thread=2a10057dc80: sync initiated
sched: trap type = 0x0 pid=0, pc=0x0, sp=0x0, tstate=0x0, context=0x0 o0-o7: 0, 0, 0, 0, 0, 0, 0, 0 g1-g7: 0, 0, 0, 0, 0, 0, 0
|
#4 |
If the guest domain with the PPAR if unresponsive, first attempt to send break to the
guest domain within the control domain framework. (via telnet and send break)
If Guest domain does not response to a send break command from the telnet command,
then attempt to reset the Guest Domain from the M10's XSCF
|
|
XSCF> showdomainstatus -p 0 Logical Domain Name Status primary Solaris running ldg0 Solaris Running XSCF>
XSCF> reset -p 0 -g ldg0 sir PPAR-ID :00 GuestDomain to sir : ldg0 Be sure to execute "ldm add-spconfig" before using this command when you have changed the ldm configuration. Otherwise, an unexpected domain might be reset. Continue? [y|n] :y 00 ldg0 : Resetting
or
XCP Last change: March 2016 1 XSCF> reset -p 0 -g ldg0 panic PPAR-ID :00 GuestDomain to panic : ldg0 Be sure to execute "ldm add-spconfig" before using this command when you have changed the ldm configuration. Otherwise, an unexpected domain might be reset. Continue? [y|n] :y 00 ldg0 : Resetting
|
Attachments
This solution has no attachment