Asset ID: |
1-71-1349558.1 |
Update Date: | 2017-10-18 |
Keywords: | |
Solution Type
Technical Instruction Sure
Solution
1349558.1
:
How to Perform On Site Diagnosis for a Down System for Enterprise Server Exx00 Systems:ATR:1349558.1:4
Related Items |
- Sun Enterprise 5500 Server
- Sun Enterprise 5000 Server
- Sun Enterprise 4000 Server
- Sun Enterprise 3000 Server
- Sun Enterprise 3500 Server
- Sun Enterprise 6000 Server
- Sun Enterprise 6500 Server
- Sun Enterprise 4500 Server
|
Related Categories |
- PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: SPARC-CAP VCAP
|
In this Document
Goal
Solution
Oracle Confidential (INTERNAL). Do not distribute to customers
Reason: FRU CAP
Applies to:
Sun Enterprise 3000 Server - Version: Not Applicable and later [Release: N/A and later ]
Sun Enterprise 3500 Server - Version: Not Applicable and later [Release: N/A and later]
Sun Enterprise 5000 Server - Version: Not Applicable and later [Release: N/A and later]
Sun Enterprise 4500 Server - Version: Not Applicable and later [Release: N/A and later]
Sun Enterprise 4000 Server - Version: Not Applicable and later [Release: N/A and later]
Information in this document applies to any platform.
Goal
How to Perform On Site Diagnosis for a down system for Enterprise Server Exx00 Systems.
To aid Field Engineers in On site diagnosis of Down Hard Systems
Solution
DISPATCH INSTRUCTIONS
WHAT SKILLS DOES THE ENGINEER NEED:(IS A SITE ENGINEER AVAILABLE?)
Enterprise Server Troubleshooting, Intermidiate Solaris Skills
Time Estimate: 120 minutes
TASK COMPLEXITY: 4
FIELD ENGINEER INSTRUCTIONS
PROBLEM OVERVIEW: Down System
WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY? :
Down Hard, unknown reason.
WHAT ACTION DOES THE ENGINEER NEED TO TAKE:
1. Validate whether the system is powered on or not (or if board power issues are present).
- Are the LEDs lit, are the fans spinning? If nothing is powered on, then the issue is external to the server.
- Investigate the system's power source, power cords, etc for a potential issue.
- Is they Keyswitch on a different position than "0"?.
2. Validate the customer can connect to the console.
See Console Logging Options to capture Fatal Reset output for Sun systems (Doc ID 1008702.1) for console help.
It is necessary to collect every possible output that is possible to see in the console
serial connection to proceed with troubleshooting.
3. Once serial connection is established run extended post
Placing the key from the position "0" to the "diagnostic" position will automatically start an extended POST which will test the HW and show the faults.
NOTE: it is possible that the original keyswitch position is not 0 so perform this step
only if the previous 2 didn't helped anyhow.
4. If it is able to get to OBP, it may or may not "auto boot" depending on configuration.
If it stops at the ok prompt try typing boot and see what happens.
"Auto-boot" can be configured the following way.
- Setting auto-boot? at the ok prompt:
{2} ok printenv
Variable Value (Default Value)
------------------------------------------------------------------------------
--- Some output removed---
auto-boot? false (true)
--- Some output removed---
diag-device net (disk diskifp diskglm diskc net)
boot-device disk (disk diskifp diskglm diskc net)
--- Some output removed---
diag-switch? false (false)
{2} ok
In the above example auto-boot? is set to false. Use setenv auto-boot? true to turn auto-boot? on.
Other settings noted above may effect booting behavior as well.
diag-switch? should be set to the default which is false.
If it is true, the system will attempt to boot off the diag-device which is usually the network.
boot-device settings may vary.
See Step 5 for a more complete discussion of boot-device.
5. Boot device issue are often causes of failure to boot.
Trace the validity of the boot device.
If the device being booted is an alias defined in devalias at the OBP,
the device that the alias references must exist in probe-scsi-all.
{2} ok printenv boot-device
boot-device disk (disk diskifp diskglm diskc net)
{2} ok devalias disk
disk /sbus@3,0/SUNW,fas@3,8800000/sd@0,0
{2} ok probe-scsi-all
--- Look for presence boot device. -----
{2} ok
Internal Disk localization (E3x00 only)
Use the "probe-fcal-all" command and the disk World-Wide Number (WWN)
to determine which array (top or bottom) your disk is on.
i.e.: /sbus@A,0/SUNW,socal@B,10000/sf@C,0/ssd@w2DEEEEEEEEEEEEEE,F:G
A - convert to decimal: divide by 2 to get the I/O slot#
B - 0=Sbus slot0, 1=Sbus slot1, 2=Sbus slot2, d=Sbus slotd (onboard fcal)
C - identifies which gbic: 0=bottom gbic, 1 = top gbic
D - identifies which port (A or B) on the IB board: w21=UA/LA, w22=UB/LB
E - disk WWN
F - drive SSD number
G - LUN (always "0") - slice
Often it is useful to boot off alternate boot devices to test whether the OS on the primary
device is corrupt.
It is also common to boot off of the OS mirror disk when the primary mirror is experiencing hardware issues.
An alternate device might be the DVD or cdrom, the network, or a root mirror or alternate
disk image.
The alternate boot devices are usually listed in the output of devalias.
Alias names can be created, so there is no way to list all known aliases, but vx-rootdisk
vx-rootmirror are common with Veritas Volume Manager environments.
Any alias with the word mirror should also be investigated as possibly booting.
6. Other aids in troubleshooting boot issues.
Often it us useful to display the results of post to see components that have failed,
or to verify that enough hardware is present to boot.
{2} ok show-post-results
--- Some output removed---
--- Verify sufficient hardware is present to boot.
ok
- Verbose booting options for boot hanging.
It is often helpful where booting hangs after seeing the SunOS starting to gather additional data.
In cases like this it is useful to put Solaris into a verbose boot with a boot -v at the ok prompt.
The auto-boot? setting must be set to false to prevent normal booting to allow manual boot commands.
See Step 3 for information on setting auto-boot? to false.
If the boot operation appears to hang in the middle of disk probing,
this could give additional insight into the cause of the boot failure.
7. If you suspect a SB, CPU or DIMM, is passing post, but has problems, it can be manually disabled.
The following command can be used:
ok setenv disabled-board-list --> sets board to be disabled on reboot
See Ultra Enterprise[TM] Workstation: OBP Variable Definition for disabled-board-list
(Doc ID 1001859.1) for details
ok setenv disabled-memory-list --> sets memory to be disabled on reboot
Example:
ok> setenv disabled-memory-list 4 <<<! this would shut off all memory on bd4
ok> reset-all <<<! this will actually turn off the memory!
ok> set-default disabled-memory-list <<<! turn mem on bd4 back on
ok> reset-all <<<! this will turn memory back on again.
- Final Word on boot issues:
If unsure how to proceed, or unable to perform the above process, collect as much information
pertaining to the boot failure as possible (console logs, error messages, etc) and call back
in and request next available engineer.
OBTAIN CUSTOMER ACCEPTANCE
WHAT ACTION DOES THE CUSTOMER NEED TO TAKE TO RETURN THE SYSTEM TO
AN OPERATIONAL STATE:
After booting customer will need to verify system meets production requirements.
PARTS NOTE:
Parts may end up being required, but they are not part of this Action plan.
Another Action Plan may be necessary
REFERENCE INFORMATION:
Service Manuals, Admin Manuals, and other manuals:
http://download.oracle.com/docs/cd/E19095-01/ent3k.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/ent35.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/ent4k.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/ent45.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/ent5k.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/ent55.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/ent6k.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/ent6500.srvr/index.html
Attachments
This solution has no attachment