Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-79-2241115.1
Update Date:2017-10-11
Keywords:

Solution Type  Predictive Self-Healing Sure

Solution  2241115.1 :   How to deal/recover from a hung/unresponsive Physical Partition (PPAR) or Guest Domain  


Related Items
  • Fujitsu M10-1
  •  
  • Fujitsu M10-4S
  •  
  • Fujitsu SPARC M12-2S
  •  
  • Fujitsu SPARC M12-1
  •  
  • Fujitsu SPARC M12-2
  •  
  • Fujitsu M10-4
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Fujitsu M10
  •  




In this Document
Purpose
Scope
Details
 Goal
 Solution


Applies to:

Fujitsu M10-1 - Version All Versions and later
Fujitsu M10-4 - Version All Versions and later
Fujitsu M10-4S - Version All Versions and later
Fujitsu SPARC M12-1 - Version All Versions and later
Fujitsu SPARC M12-2 - Version All Versions and later
Information in this document applies to any platform.

Purpose

 Purpose is to help customer(s)  or Oracle Engineer(s)  to assist customer(s)  how to recover a hung/unresponsive Physical Partition (PPAR)  in a M10 series machine

Scope

Explains how to recognize a hung/unresponsive  Physical Partition (PPAR)  and the recovery options available.  If the Guest Domain within a PPAR becomes hung/unresponsive

it can be recovered through the XSCF.

 

 

Details

Goal

 

 This document provides details on how to recover or unresponsive M10 series machine Physical Partition/ Guest Domain.

 

Solution

 

How to respond to a suspected hung/unresponsive Physical Partition (PPAR)


If you suspect that a Physical Partition (PPAR) is hung, verify it.
( PPAR is equivalent to a Physical Domain (PDOM) on the Mx-32 series  / M7 series)

 

Step #1 Verify the status of the PPAR or its Guest Domains

Log into the M10 machine XSCF

XSCF>showdomainstatus -p  ppar_id    where ppar_id  =0, ... 16

Example
XSCF> showdomainstatus

usage : showdomainstatus -p ppar_id [-v] [-M] [-g domainname]
showdomainstatus -h

 

XSCF>
XSCF> showdomainstatus -p 0
Logical Domain Name Status
primary Solaris running
ldg0 Solaris running
XSCF>

XSCF> showdomainstatus -p 0
Logical Domain Name Status
primary OpenBoot Running
ldg0 OpenBoot Running

 

Step #2 Veriy if the PPAR or its Guest Domain responds to a ping or ssh command

 

#ping <IP address of PPAR/Guest Domain>
#ssh <IP address of PPAR/Guest Domain>

 

Step #3 Verify if the console of the PPAR  or  Guest Domain is accessible


For the PPAR's console , from the M10 series machine's XSC

usage : console [[-q] -{y|n}] -p ppar_id [-f|-r] [-s escapeChar]
console -h

Where ppar_id =0, ...16

For the Guest / Logical Domain within the console, log into the PPAR issue the
following

#ldm list

For example to log into Guest/Logical Domain ldgo, use the telnet command

jack@m10-1-:~$ ldm list
NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME
primary active -n-cv- UART 16 32G 0.0% 0.0% 23h 57m
ldg0 active -n---- 5000 8 32G 0.0% 0.0% 23h 48m

jack@m10-1:~$ telnet localhost 5000

 

XSCF> showdomainstatus
Invalid parameter.
usage : showdomainstatus -p ppar_id [-v] [-M] [-g domainname]
showdomainstatus -h
XSCF> showdomainstatus -p 0
Logical Domain Name Status
primary Solaris running
ldg0 Solaris running
XSCF> console -p 0

Console contents may be logged.
Connect to PPAR-ID 0?[y|n] :y

 

 


If the console is used by another user the -f option should be used to force
takeover of the console.

If console connects but has no output and does not return any input,
then there is a very likelihood that the PPAR is indeed hung as it cannot
be accessed via network and its console.

Once all of the above is verified, the domain is deemed hung,
proceed to next step for recovery

 Once confirm that the PPAR is indeed unresponsive, proceed to recover
the hung PPAR

 

 Step #4  Proceed to recover the hung/unresponsive PPAR

 

2 choices are applicable when a PPAR is hung: reset or sync

A reset will recover the PPAR
A sync will attempt to generate a coredump and then reset
to recover the PPAR.

If a RCA is needed to determine why a PPAR hung, then select
the sync option.

If a RCA is not required, then select the reset to recover the PPAR.

 

Open two terminal sessions to the M10 machine's XSCF

Terminal session #1 is to connect to the PPAR's console

Terminal session #2 is to run the sendbreak command to the PPAR

 

Terminal Session #1

XSCF> console -p 0

Console contents may be logged.
Connect to PPAR-ID 0?[y|n] :y

 

Terminal session #2

XSCF> sendbreak -p 0
Send break signal to PPAR-ID 0?[y|n] :y

 After the sendbreak command is issued, the following can be seen on Terminal Session #1

XSCF> console -p 0

Console contents may be logged.
Connect to PPAR-ID 0?[y|n] :y

c)ontinue, s)ync, r)eset?

 

 

c) continue = is the same as hitting "go" in legacy system to cancel the break or back to OS
s) sync = panic the system and drop to OBP prompt or boot
r) reset = perform a reset and drop the system to OBP prompt or boot

 Example of a sync option after the sendbreak command has been issue on a PPAR

In this scenario,  the hung/unresponsive responded to a sync command and the PPAR generated a coredump:

m10-1 console login:
m10-1 console login: Debugging requested; hardware watchdog suspended.
c)ontinue, s)ync, r)eset? s

panic[cpu13]/thread=2a101ef3b80: sync initiated

sched: trap type = 0x0
pid=0, pc=0x0, sp=0x0, tstate=0x0, context=0x0
o0-o7: 0, 0, 0, 0, 0, 0, 0, 0
g1-g7: 0, 0, 0, 0, 0, 0, 0

000002a101ef28d0 unix:sync_handler+13c (0, 8000000000000000, 1010d000, 2a101ef3b80, 1, 20122c00)
%l0-3: 0000000020833c00 0000000000000000 0000000020511400 0000000000000000
%l4-7: 00000000204ccc00 000003000001c300 000000002011c400 0000030000018100

 

 


...

000002a101ef3080 unix:dispatch_handler+224 (30000018100, 2a101ef3b80, 0, 7008c900, 0, f)
%l0-3: 000000000000000a 00000300000181a0 0000030000018150 0000030000018148
%l4-7: 0000000000000000 0000000000000000 0000000010024254 0000000000000000

syncing file systems... done
Preserving kernel image in RAM, content: kernel sections: zfs
0:12 96% done (kernel)
0:12 100% done (zfs)
100% done: 299673 (kernel) + 9559 (zfs) pages dumped, dump succeeded
rebooting...
Resetting...
NOTICE: Entering OpenBoot.
NOTICE: Fetching Guest MD from HV.
NOTICE: Starting additional cpus.
NOTICE: Initializing LDC services.
NOTICE: Probing PCI devices.
NOTICE: Finished PCI probing.

SPARC M10-1, No Keyboard
Copyright (c) 1998, 2016, Oracle and/or its affiliates. All rights reserved.
OpenBoot 4.38.5, 32.0000 GB memory available, Serial #268895952.
[ 2.21.0 ]
Ethernet address b0:99:28:a0:5d:d0, Host ID: 900706d0.

 

 Here is an example where a coredump is not needed for analysis.

root@m10-1:~#
root@m10-1:~# Debugging requested; hardware watchdog suspended.
c)ontinue, s)ync, r)eset? r

Resetting...
NOTICE: Entering OpenBoot.
NOTICE: Fetching Guest MD from HV.
NOTICE: Starting additional cpus.
NOTICE: Initializing LDC services.
NOTICE: Probing PCI devices.
NOTICE: Finished PCI probing.

SPARC M10-1, No Keyboard
Copyright (c) 1998, 2016, Oracle and/or its affiliates. All rights reserved.
OpenBoot 4.38.5, 32.0000 GB memory available, Serial #268895952.
[ 2.21.0 ]
Ethernet address b0:99:28:a0:5d:d0, Host ID: 900706d0.

{0} ok boot
Boot device: disk File and args:
SunOS Release 5.11 Version 11.3 64-bit
Copyright (c) 1983, 2015, Oracle and/or its affiliates. All rights reserved.

...

 

 Step #5 The PPAR maybe hard hung and may not response to a sendbreak command from the XSCF.
In this case, we need to use the reset command

 


In that case, the reset command issued from the XSCF
will attempt to force the PPAR or LDOM to respond.

XSCF>reset [ [-q] -{y|n}] -p ppar_id por

XSCF> reset [ [-q] -{y|n}] -p ppar_id xir

the options available is

POR (power-on-reset) - Resets PPAR.
xir (eXternal reset) - Resets all CPUs in PPAR.


 

 

Example reset command with POR option

 

XSCF> reset -y -p 0 por
PPAR-ID to reset :00
Continue? [y|n] :y
00 : Resetting

*Note*
This command only issues the instruction to reset.
The result of the instruction can be checked by the "showpparprogress".
XSCF> showpparprogress -p 0
PPAR reset PPAR#0 [ 1/13]
CPU Stop PPAR#0 [ 2/13]
PSU Off PPAR#0 [ 3/13]
XBBOX Reset PPAR#0 [ 4/13]
PSU On PPAR#0 [ 5/13]
CMU Reset Start PPAR#0 [ 6/13]
XB Reset 1 PPAR#0 [ 7/13]
XB Reset 2 PPAR#0 [ 8/13]
XB Reset 3 PPAR#0 [ 9/13]
CPU Reset 1 PPAR#0 [10/13]
CPU Reset 2 PPAR#0 [11/13]
Reset released PPAR#0 [12/13]
CPU Start PPAR#0 [13/13]
The sequence of power control is completed.

 

XSCF>console -f -p 0

XSCF> m10-1aconsole login: POST Sequence 01 Banner
LSB#00: POST 3.13.0 (2016/10/14 09:37)
POST Sequence 02 CPU Check
POST Sequence 03 CPU Register
POST Sequence 04 STICK Increment
POST Sequence 05 Extended Instruction
POST Sequence 06 MMU
POST Sequence 07 Memory Initialize
POST Sequence 08 Memory Address Line
POST Sequence 09 MSCAN
POST Sequence 0A Cache
POST Sequence 0B Floating Point Unit
POST Sequence 0C Encryption
POST Sequence 0D Cacheable Instruction
POST Sequence 0E Softint
POST Sequence 0F CPU Cross Call
POST Sequence 10 CMU-CH
POST Sequence 11 PCI-CH
POST Sequence 12 TOD
POST Sequence 13 MBC Check Before STICK Diag
POST Sequence 14 STICK Stop
POST Sequence 15 STICK Start
POST Sequence 16 Barrier Blade
POST Sequence 17 Single Barrier Bank
POST Sequence 18 Sector Cache
POST Sequence 19 SX
POST Sequence 1A RT
POST Sequence 1B RT/SX NC
POST Sequence 1C RT/SX Interrupt
POST Sequence 1D RT/SX Barrier
POST Sequence 1E Error CPU Check
POST Sequence 1F System Configuration
POST Sequence 20 System Status Check
POST Sequence 21 Start Hypervisor
POST Sequence Complete.

Hypervisor version: @(#)Hypervisor 1.4.12 2016/12/15 09:54 1.4.11+3

Configuring System Board.... .Completed.

Starting Logical Domains....

NOTICE: Entering OpenBoot.
NOTICE: Fetching Guest MD from HV.
NOTICE: Starting additional cpus.
NOTICE: Initializing LDC services.
NOTICE: Probing PCI devices.
NOTICE: Finished PCI probing.

SPARC M10-1, No Keyboard
Copyright (c) 1998, 2016, Oracle and/or its affiliates. All rights reserved.
OpenBoot 4.38.5, 32.0000 GB memory available, Serial #268895952.
[ 2.21.0 ]
Ethernet address b0:99:28:a0:5d:d0, Host ID: 900706d0.

{0} ok

 

Example of reset command with XIR option


### issuing XIR to a PPAR

XSCF> reset -y -p 0 xir
PPAR-ID to reset :00
Continue? [y|n] :y
00 : Resetting

*Note*
This command only issues the instruction to reset.
The result of the instruction can be checked by the "showpparprogress".
XSCF>
XSCF>
XSCF> showpparprogress -p 0
PPAR reset PPAR#0 [ 1/13]
CPU Stop PPAR#0 [ 2/13]
PSU Off PPAR#0 [ 3/13]
XBBOX Reset PPAR#0 [ 4/13]
PSU On PPAR#0 [ 5/13]
CMU Reset Start PPAR#0 [ 6/13]
XB Reset 1 PPAR#0 [ 7/13]
XB Reset 2 PPAR#0 [ 8/13]
XB Reset 3 PPAR#0 [ 9/13]
CPU Reset 1 PPAR#0 [10/13]
CPU Reset 2 PPAR#0 [11/13]
Reset released PPAR#0 [12/13]
CPU Start PPAR#0 [13/13]
The sequence of power control is completed.

...

XSCF> console -p 0

Console contents may be logged.
Connect to PPAR-ID 0?[y|n] :y
POST Sequence 01 Banner
LSB#00: POST 3.13.0 (2016/10/14 09:37)
POST Sequence 02 CPU Check
POST Sequence 03 CPU Register
POST Sequence 04 STICK Increment
POST Sequence 05 Extended Instruction
POST Sequence 06 MMU
POST Sequence 07 Memory Initialize
POST Sequence 08 Memory Address Line
POST Sequence 09 MSCAN
...

POST Sequence 1F System Configuration
POST Sequence 20 System Status Check
POST Sequence 21 Start Hypervisor
POST Sequence Complete.

Hypervisor version: @(#)Hypervisor 1.4.12 2016/12/15 09:54 1.4.11+3

Configuring System Board.... .Completed.

Starting Logical Domains....

NOTICE: Entering OpenBoot.
NOTICE: Fetching Guest MD from HV.
NOTICE: Starting additional cpus.
NOTICE: Initializing LDC services.
NOTICE: Probing PCI devices.
NOTICE: Finished PCI probing.

SPARC M10-1, No Keyboard
Copyright (c) 1998, 2016, Oracle and/or its affiliates. All rights reserved.
OpenBoot 4.38.5, 125.0000 GB memory available, Serial #268895952.
[ 2.21.0 ]
Ethernet address b0:99:28:a0:5d:d0, Host ID: 900706d0.

{0} ok boot

 

 

If the PPAR is function but the Guest / Logical Domain(s) has an issue, one can leave the PPAR running

and specify the Guest or logical Domain

 


Logical Domain Name Status
primary Solaris running
ldg0 Solaris Running <<<--- Guest Domain
XSCF>

 

 

To reset the Guest or logical domain using reset command:

reset [ [-q] -{y|n}] -p ppar_id -g domainname sir

reset [ [-q] -{y|n}] -p ppar_id -g domainname panic

where

sir Resets the logical domain.

panic Orders panic to the Oracle Solaris of the logical domain.

It is ignored during shut-down processing or under suspension.

  

 


XSCF> reset -p 0 -g ldg0 sir
PPAR-ID :00
GuestDomain to sir : ldg0
Be sure to execute "ldm add-spconfig" before using this command when you have changed the ldm configuration.
Otherwise, an unexpected domain might be reset.
Continue? [y|n] :y
00 ldg0 : Resetting

XSCF> showdomainstatus -p 0
Logical Domain Name Status
primary Solaris running
ldg0 OpenBoot Running
XSCF>

 


XSCF> reset -p 0 -g ldg0 panic
PPAR-ID :00
GuestDomain to panic : ldg0
Be sure to execute "ldm add-spconfig" before using this command when you have changed the ldm configuration.
Otherwise, an unexpected domain might be reset.
Continue? [y|n] :y
00 ldg0 : Resetting

*Note*
This command only issues the instruction to reset.
The result of the instruction can be checked by the "showdomainstatus".

{0} ok boot
Boot device: disk File and args:
SunOS Release 5.10 Version Generic_147147-26 64-bit
Copyright (c) 1983, 2013, Oracle and/or its affiliates. All rights reserved.
Hostname: bookable-10-187-57-211

ldg0 console login: Mar 6 13:27:30 bookable-10-187-57-211 sendmail[607]: My unqualified host name
ldg0 unknown; sleeping for retry

ldg0 console login:
ldg0 console login:
panic[cpu0]/thread=2a10001fc80: Panic - Generated at user request

000002a10001f6a0 unix:process_nonresumable_error+2d8 (2a10001f890, 0, ff, 40, 5, 40)
%l0-3: 0000000000000100 0000000003000000 0000000000000001 000000000180c600
%l4-7: 0000000000000000 0000000100000000 00000000ffffffff 0000000000000000
000002a10001f7e0 unix:ktl0+64 (0, 1, 0, 100, 101010101010101, 12)
%l0-3: 000000000180c000 0000000000000000 0000000011001406 0000000001029aa4
%l4-7: 0000000000000000 0000000000000000 0000000000000000 000002a10001f890
000002a10001f930 unix:cpu_halt+f4 (180c000, 0, 19dd9b8, 19dd888, 180c000, 0)
%l0-3: 00000000019d73a4 0000000000000001 0000000000000016 0000000000000000
%l4-7: 0000000000000000 0000000000000002 000000000180c178 0000000000000001
000002a10001f9e0 unix:idle+128 (1866c00, 8, 180c000, ffffffffffffffff, 1, 1865800)
%l0-3: 00000000019d7380 000000000000001b 0000000000000000 ffffffffffffffff
%l4-7: 0000000000000001 0000000001a6ec00 000000000180c178 0000000001045bac

syncing file systems... done
dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kerne

#6 If the guest domain with the PPAR if unresponsive, first attempt to send
break to the guest domain within the control domain framework.
(via telnet and send break)

ldg0 console login:
ldg0 console login:
ldg0 console login: root
Password:
Mar 6 14:08:32 ldg0 login: ROOT LOGIN /dev/console
Last login: Mon Mar 6 13:24:28 on console
Oracle Corporation SunOS 5.10 Generic Patch January 2005
#
#
#
#
#

 

 

 Summary of various scenarios where the PPAR or Guest/Logical Domain is hung/unresponsive

col Scenario Physical Partition (PPAR) Guest or Logical Domain with PPAR
#1 Physcial Partiton (PPAR) hang/unresponsive XSCF>showdomainstatus -a
XSCF>sendbreak -p <PPAR#> where PPAR#=0 ...16

XSCF> sendbreak -p 0
Send break signal to PPAR-ID 0?[y|n] :y
XSCF> console -p 0

Console contents may be logged.
Connect to PPAR-ID 0?[y|n] :y

c)ontinue, s)ync, r)eset? c

 
#2 Physical Partition hard hang (non responsive to sendbreak) The PPAR maybe hard hung and may not response to a sendbreak command
from the XSCF.
In that case, the reset command issued from the XSCF
will attempt to force the PPAR or LDOM to respond.

XSCF>reset [ [-q] -{y|n}] -p ppar_id por

XSCF> reset [ [-q] -{y|n}] -p ppar_id xir

 
#3 Guest / Logical domain has an issue.
issue sendbreak to Guest Domain (panic)
 

 

jack:~$ ldm list
NAME STATE FLAGS CONS VCPU MEMORY UTIL NORM UPTIME
primary active -n-cv- UART 16 32G 0.2% 0.2% 4m
ldg0 active -t---- 5000 8 32G 12% 12% 4m
jack:~$
jack:~$ telnet localhost 5000

ldg0 console login: root
Password:
Mar 6 14:08:32 bookable-10-187-57-211 login: ROOT LOGIN /dev/console
Last login: Mon Mar 6 13:24:28 on console
Oracle Corporation SunOS 5.10 Generic Patch January 2005
#
#

telnet> send break
Debugging requested; hardware watchdog suspended.
c)ontinue, s)ync, r)eset? s

panic[cpu6]/thread=2a10057dc80: sync initiated

sched: trap type = 0x0
pid=0, pc=0x0, sp=0x0, tstate=0x0, context=0x0
o0-o7: 0, 0, 0, 0, 0, 0, 0, 0
g1-g7: 0, 0, 0, 0, 0, 0, 0


 

#4

 If the guest domain with the PPAR if unresponsive, first attempt to send break to the

guest domain within the control domain framework.
(via telnet and send break)

If Guest domain does not response to a send break command from the telnet command,

then attempt to reset the Guest Domain from the M10's XSCF

 
XSCF> showdomainstatus -p 0
Logical Domain Name Status
primary Solaris running
ldg0 Solaris Running
XSCF>

XSCF> reset -p 0 -g ldg0 sir
PPAR-ID :00
GuestDomain to sir : ldg0
Be sure to execute "ldm add-spconfig" before using this command when you have changed the ldm configuration.
Otherwise, an unexpected domain might be reset.
Continue? [y|n] :y
00 ldg0 : Resetting

or

XCP Last change: March 2016 1
XSCF> reset -p 0 -g ldg0 panic
PPAR-ID :00
GuestDomain to panic : ldg0
Be sure to execute "ldm add-spconfig" before using this command when you have changed the ldm configuration.
Otherwise, an unexpected domain might be reset.
Continue? [y|n] :y
00 ldg0 : Resetting

 

 

 

 

 

 

 


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback