Asset ID: |
1-72-1010335.1 |
Update Date: | 2018-05-01 |
Keywords: | |
Solution Type
Problem Resolution Sure
Solution
1010335.1
:
Sun Fire[TM] 12K/15K/E20K/E25K: Identifying and recovering from a domain hang
Related Items |
- Sun Fire E25K Server
- Sun Fire 12K Server
- Sun Fire 15K Server
- Sun Fire E20K Server
|
Related Categories |
- PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: SF-Exxk
- _Old GCS Categories>Sun Microsystems>Servers>High-End Servers
|
PreviouslyPublishedAs
214177
Identifying and recovering from a hung domain
Applies to:
Sun Fire 12K Server - Version Not Applicable and later
Sun Fire 15K Server - Version Not Applicable and later
Sun Fire E25K Server - Version Not Applicable and later
Sun Fire E20K Server - Version Not Applicable and later
All Platforms
***Checked for relevance on 14-Jul-2014***
Symptoms
Domain is no longer responding
Changes
No change required to hang a domain.
Cause
Root cause to be determined once the described procedure for data collection has been completed successfully.
Solution
There are several tools to use when trying to determine if a domain is hung. If a domain does not respond to the following commands from the System Controller, that is a good indication that the domain is hung.
sc0:sms-svc:1> ping <domain_name>
sc0:sms-svc:2> telnet <domain_name>
sc0:sms-svc:3> console -d <domain_id>
Now that you have established the likelihood that the domain is hung, the following steps can be used to return to a 'Running Solaris' state.
1. Connect to the CONSOLE of the hung domain, send a break sequence to drop the OS to the OK prompt and once at the OK prompt, enter the 'sync' command to generate a system dump file & reboot domain.
From the System Controller, connect to the domain through the console command.
sc0:sms-svc:1> console -d <domain_id>
Even though there won't be any response or activity, we can still send a break sequence (~#) that will drop the OS to the OK prompt, effectively a Stop-A. Once at the OK prompt the sync command will try to generate a system dump file and reboot the domain.
~#
Type 'go' to resume
{0} ok sync
If you are connecting to the system controller via SSH (Secure Shell) then the break sequence will be intercepted by SSH and the system will not drop to the OK prompt. In order to avoid this either tell SSH not to intercept the sequence by preceding it with another tilde (ie, ~~#) or change the SSH escape character to something other than ~ using the -e option when starting ssh.
2. If a break sequence at the console is insufficient to regain control of the domain, the 'reset -x' command from the System Controller can be attempted and a 'sync' can be done manually at the ok> prompt. This is less desirable than the first option (send break, sync) as an XIR reset "may" clear some valuable information in CPU stats , however, if this was a software hang, the corefile is still quite useful.
sc0:sms-svc:2> reset -x -d <domain_id>
{0} ok sync
Note: Ensure that you use the 'reset -x' option. Without it, no crash dump will be possible.
It may take several seconds for the OK prompt to appear after issuing this command. Once at the OBP, be sure to type the sync command so that a core file might be generated.
3. Finally, try the setkeyswitch.
sc0:sms-svc:1> setkeyswitch -d <domain_id> off
sc0:sms-svc:2> setkeyswitch -d <domain_id> on
This should recover the domain, however, this will prevent a core file being generated.
Performing setkeyswitch off/on is the least desirable method and is used only as a final option for recovery
Advanced SF15k Hang Debugging Procedure By Daniel Ellison, PTS-HSG-Americas
Sometimes it may be interesting to delve into the cause for the hang. For this reason, an advanced procedure detailing information that can be pulled out of the hardware state on a Sun Fire[TM] 12K/15K/20K/25K system using both OBP and REDX may be interesting for some people. In this procedure, the 'redx' commands 'xir' and 'bbxir' take the place of using the SMS command 'reset(1M)'.
Prerequisites
- /etc/system on domain side must have set nopanicdebug = 1.
- You must have a console window open on the domain.
- Domain must have a defined dump device (dumpadm).
- Domain must have savecore enabled. It is enabled by default for Solaris[TM] 8
- Operating Environment. Check dumpadm and/or /etc/rc2.d/S75savecore.
When the domain hangs, perform the following on the SC:
1. cd /var/opt/SUNWSMS/adm/<domain_letter>/post.
Run the following script as user sms-svc. You will need to enter the processor to which XIR should be sent. You will need to run it once per processor.
#! /bin/ksh
print "Enter expander number:[0-17]:\t\c"
read E
print "Enter slot number [0 or 1]:\t\c"
read S
print "Enter proc number [0-3]:\t\c"
read P
RDATE=`/bin/date +%y%m%d`.`/bin/date +%H%M%S`
redx -c <>xirdump.$RDATE.log
port $E $S $P
shproc
xir
bbxir
shproc
lo
EOF
#####
#END#
#####
NOTE:
This is the basic script.
A much fancier one could be generated using this as a basis.
2. In the console window, the domain should notice the XIR and drop to OBP.
DSMD should also report XIR being detected in the domain messages file.
Example for proc 2:
# ERROR: error-reset-cleanup: Externally Initiated Reset has occured. Externally Initiated Reset
{2} ok go
NOTE:
The OBP prompt tells you, in HEX, what proc dropped to OBP due to the xir. If this does NOT happen, pick a different proc and try the above script again. Otherwise, issue the following commands at OBP and save all data.
<ok> .locals
<ok> .registers
<ok> .cpu-afsr
<ok> cpu-afar@ .
<ok> .trap-registers
<ok> ctrace
3. From the redx shproc output, you can find what address the proc stopped at; for example, PC[63:6],2b'0 = 00000000 1003E84_.
You can display where exactly this is using the 'dis' Forth word.
What you type here will be recorded in the domain's console log file on the system controller.
Examples:
1003E840 dis
1003E848 dis
10029FE8 dis
1003E898 dis
You can also look up the PC value from the 'bbxir' output that is listed for trap type 0x03 in the tl list and disassemble that address as well. For example, IF YOU SEE THIS:
tl: 1
tt tstate tpc tnpc
0x03 0x0080001600 00000000.101C1740 00000000.101C1744
THEN YOU TYPE THIS:
101C1740 dis
This will tell you where the CPU was sitting when it received the XIR (XIR trap type = 3).
4. Type 'sync' at OBP to attempt to get a core.
5. setkeyswitch off
6. setkeyswitch on - recover domain
7. Run Sun[TM] Explorer on the system controller and domain.
These, along with the core file, will collect all of the relevant data for analysis. Make sure the core file is sent along with the Sun Explorer data collections.
hung, hang, recover, 12K, 15K, 20K, 25K
Previously Published As 48138
Attachments
This solution has no attachment