Asset ID: |
1-71-1343158.1 |
Update Date: | 2017-03-29 |
Keywords: | |
Solution Type
Technical Instruction Sure
Solution
1343158.1
:
How to Perform On Site Diagnosis for a Down System for Sun Fire 280R, V480, V490, V880, V880z and V890 Systems
Related Items |
- Sun Fire 280R Server
- Sun Fire V480 Server
- Sun Fire V890 Server
- Sun Fire V880z Visualization Server
- Sun Fire V880 Server
- Sun Fire V490 Server
|
Related Categories |
- PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: SPARC-CAP VCAP
|
In this Document
Applies to:
Sun Fire 280R Server - Version All Versions to All Versions [Release All Releases]
Sun Fire V480 Server - Version All Versions to All Versions [Release All Releases]
Sun Fire V490 Server - Version All Versions to All Versions [Release All Releases]
Sun Fire V880z Visualization Server - Version All Versions to All Versions [Release All Releases]
Sun Fire V880 Server - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.
Goal
******************************************************************************
To report errors or request improvements on this procedure,
please go to https://support.us.oracle.com and put a comment on Doc ID: 1343158.1
******************************************************************************
To aid Field Engineers in On site diagnosis of Down Hard Systems
Solution
DISPATCH INSTRUCTIONS
WHAT SKILLS DOES THE ENGINEER NEED:(IS A SITE ENGINEER AVAILABLE?)
System Controller Application, lom, Intermidiate Solaris Skills
TIME ESTIMATE: 120 minutes
TASK COMPLEXITY: 4
FIELD ENGINEER INSTRUCTIONS
PROBLEM OVERVIEW:
Down System
WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY? :
Down Hard, unknown reason.
WHAT ACTION DOES THE ENGINEER NEED TO TAKE:
1. Validate whether the system is powered on or not (or if board power issues are present).
# Are the LEDs lit, are the fans spinning? If nothing is powered on, then the issue is external to the server.
# Confirm power to the cabinet.
# Confirm power is being provided to Power Inlets on 280R/v480/v490. Power Supplies on v880/v890.
# Investigate the system's power source, power cords, etc for a potential issue.
2. Once power is confirmed, connect to the RSC or serial port.
See This document for directing output from RSC to serial console and back: <Document 1013110.1>
This document will also be helpful for connecting to the console:
Sun Fire 280R V480 V490 V880 and V890 Reference for Improving Remote Diagnosibility (Doc ID 1541598.1)
As a temporary method to direct output to the serial port. Put the system key switch in maintenance position and power cycle the system. The console will be redirected to the ttya serial port until the keyswitch is returned to normal and the system is reset with a reset-all. ( 4.15.X or higher )
Once console is established Solaris is often at the 'fsck' prompt asking for confirmation to fsck root or another filesystem. Answer 'y' to proceed with the fsck.
3. Validate that the system is not currently executing POST, or that the System has enough hardware to run post.
# Console output will be scrolling "LPOST" messages if POST is running.
# POST needs to complete before the system starts OBP ("ok" prompt) and then can boot.
# If POST has completed it may be useful to look through the POST output for devices that have failed. A simple search in the output for "fail" should find parts failed by post. If the RSC is configured `consolehistory boot` can be useful for looking at POST output. It is best to get consolehistory early since the buffer space is limited and can be overrun by additional activity.
On smalled configurations with limited hardware a single CPU failure can cause post not to run. Process of elimination troubleshooting may be necessary to root cause these situations.
4. If it is able to get to OBP, it may or may not "auto boot" depending on configuration. If it stops at the ok prompt try typing boot and see what happens.
#Setting auto-boot? at the ok prompt:
{2} ok printenv
Variable Value (Default Value)
------------------------------------------------------------------------------
--- Some output removed---
auto-boot? false (true)
--- Some output removed---
diag-device net (disk diskifp diskglm diskc net)
boot-device disk (disk diskifp diskglm diskc net)
--- Some output removed---
diag-switch? false (false)
{2} ok
In the above example auto-boot? is set to false. Use 'setenv auto-boot?' true to turn auto-boot? on.
Other settings noted above may effect booting behavior as well. diag-switch? should be set to the default which is false. If it is true, the system will attempt to boot off the diag-device which is usually the network. boot-device settings may vary. See Step 5 for a more complete discussion of boot-device.
The RSC has a bootmode command that effects a single (the next) boot. It expires after 10 minutes if not booted, and the system returns to normal boot mode. Valid options to bootmode are:
bootmode [-u] [normal|forth|reset_nvram|diag|skip_diag]
5. Boot device issue are often causes of failure to boot.
Trace the validity of the boot device. If the device being booted is an alias defined in devalias at the OBP, the device that the alias references must exist in probe-scsi-all.
{2} ok printenv boot-device
boot-device disk (disk diskifp diskglm diskc net)
{2} ok devalias
...
disk /pci@8,600000/SUNW,qlc@2/fp@0,0/disk@0,0
disk0 /pci@8,600000/SUNW,qlc@2/fp@0,0/disk@0,0
disk1 /pci@8,600000/SUNW,qlc@2/fp@0,0/disk@1,0
disk2 /pci@8,600000/SUNW,qlc@2/fp@0,0/disk@2,
...
{2} ok probe-scsi-all
--- Look for presence boot device. -----
{2} ok
# Alternate Boot device
Often it is useful to boot off alternate boot devices to test whether the OS on the primary device is corrupt. It is also common to boot off of the OS mirror disk when the primary mirror is experiencing hardware issues. An alternate device might be the DVD or cdrom, the network, or a root mirror or alternate disk image.
The alternate boot devices are usually listed in the output of devalias. Alias names can be created, so there is no way to list all known aliases, but vx-rootdisk vx-rootmirror are common with Veritas Volume Manager environments. Any alias with the word mirror should also be investigated as possibly booting.
6. Other aids in troubleshooting boot issues.
# Verify Devices in POST.
Often it us useful to display the results of post to see components that have failed, or to verify that enough hardware is present to boot.
{2} ok show-post-results
--- Some output removed---
--- Verify sufficient hardware is present to boot.
ok
# Boot and POST hangs.
When booting hangs it is often useful to collect additional data. See <Document 1003108.1> for collecting additional data. This data along with a Solaris core can be used to diagnose the problem further.
# Verbose booting options for boot hanging.
It is often helpful where booting hangs after seeing the SunOS starting to gather additional data. In cases like this it is useful to put Solaris into a verbose boot with a boot -v at the ok prompt. The auto-boot? setting must be set to false to prevent normal booting to allow manual boot commands. See Step 4 for information on setting auto-boot? to false. If the boot operation appears to hang in the middle of disk probing, this could give additional insight into the cause of the boot failure.
# If you suspect a SB, CPU or DIMM, is passing post, but has problems, It can be manually disabled. asr-disable and asr-enable can be used for this. See <Document 1004897.1> for a more complete discussion on asr.
{3} ok asr-disable cpu1 (V480)
{3} ok asr-disable cpu3 (V480)
{3} ok asr-disable cmp1 (V490)
{3} ok asr-disable cmp3 (V490)
{3} ok .asr (to check ASR Disablement Status)
Component: Status
Valid device labels include( note cmp could be cpu on sertain platforms):
cmp3-bank3 cmp3-bank2 cmp3-bank1 cmp3-bank0
cmp2-bank3 cmp2-bank2 cmp2-bank1 cmp2-bank0
cmp1-bank3 cmp1-bank2 cmp1-bank1 cmp1-bank0
cmp0-bank3 cmp0-bank2 cmp0-bank1 cmp0-bank0
pci-slot5 pci-slot4 pci-slot3 pci-slot2
pci-slot1 pci-slot0 gptwo-slotc gptwo-slotb
gptwo-slota ob-ide ob-net0 ob-net1
ob-fcal io-bridge9 io-bridge8 io-bridge5
cmp3 cmp2 cmp1 cmp0
* cmp3-bank* cmp2-bank* cmp1-bank*
cmp0-bank* pci* pci-slot* gptwo-slot*
io-bridge* cmp*
# Higher POST can be useful - Sometimes.
Note:
Use max post only when searching for IO Controller issues, Fatal Resets, and Unrecoverable Errors(UEs) that Solaris is unable to properly diagnose. If diag-level is set to max at OBP, confirm with customer that if can be moved to the factory default of min.
See: Guidance for POST Diagnostic Level Setting on Sun Fire[TM] 280R, V480, V490, V880, V890 and V880z servers. (Doc ID
1582330.1) for details.
Often increasing POST level can catch marginal hardware issues. Use setenv at the OBP to control diag-level. Valid levels are off|min|max.
Additional testing on IO devices and peripherals can be performed with obdiag from the OBP.
ok obdiag
For a more complete discussion of obdiag see:
<Document 1383378.1> Sun Fire[TM] Servers (V280R, V480, V490, V880, V890):OBDiag Troubleshooting
Remember to set the parameters back to their original after testing.
Final Words on boot issues:
If unsure how to proceed, or unable to perform the above process, collect as much information pertaining to the boot failure as possible (console logs, error messages, etc). Attach, or have the System administrator attach the information to the SR then call back in and request next available engineer.
OBTAIN CUSTOMER ACCEPTANCE
- WHAT ACTION DOES THE CUSTOMER NEED TO TAKE TO RETURN THE SYSTEM TO
AN OPERATIONAL STATE:
Customer should verify system is stable for return to production.
PARTS NOTE:
No parts required for this action plan. Parts may end up being required, but they are not part of this Action plan. Another Action Plan may be necessary.
REFERENCE INFORMATION:
Service Manuals, Admin Manuals, and RSC Release Notes:
http://download.oracle.com/docs/cd/E19095-01/sfv480.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/sfv490.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/sfv880.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/sfv890.srvr/index.html
http://download.oracle.com/docs/cd/E19088-01/280r.srvr/index.html
References
<NOTE:1013110.1> - Establishing connection to the Remote System Control (RSC) SunFire[TM] V280r, V480, V490, V880z, V880, and V890
<NOTE:1003108.1> - How to gather information at the OK prompt
<NOTE:1383378.1> - Sun Fire[TM] Servers (V280R, V480, V490, V880, V890):OBDiag Troubleshooting
<NOTE:1004897.1> - Using ASR commands to manually enable/disable CPUs on a Sun Fire[TM] V480/V880 V490/v890
Attachments
This solution has no attachment