Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1342756.1
Update Date:2017-03-29
Keywords:

Solution Type  Technical Instruction Sure

Solution  1342756.1 :   How to Perform On Site Diagnosis for a Down System for Sun Fire 1280, 3800, 4800, 4810, 6800, E2900, E4900, E6900 and Netra 1280/1290 Systems:ATR:1342756.1:4  


Related Items
  • Sun Fire 4810 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Netra 1290 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire E6900 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Netra 1280 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: SPARC-CAP VCAP
  •  




In this Document
Goal
Solution
References


Oracle Confidential INTERNAL - Do not distribute to customer (OracleConfidential).
Reason: FRU CAP

Applies to:

Sun Netra 1280 Server - Version Not Applicable and later
Sun Fire 4810 Server - Version Not Applicable and later
Sun Fire 6800 Server - Version Not Applicable and later
Sun Fire E2900 Server - Version Not Applicable and later
Sun Fire E4900 Server - Version Not Applicable and later
Information in this document applies to any platform.

Goal

To aid Field Engineers in On site diagnosis of Down Hard Systems

Solution

DISPATCH INSTRUCTIONS

WHAT SKILLS DOES THE ENGINEER NEED:(IS A SITE ENGINEER AVAILABLE?)
System Controller Application (ScApp), lom, Intermidiate Solaris Skills

Time Estimate: 120 minutes

TASK COMPLEXITY:  4

FIELD ENGINEER INSTRUCTIONS

PROBLEM OVERVIEW:
Down System

WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY? :

Down Hard, unknown reason.

WHAT ACTION DOES THE ENGINEER NEED TO TAKE:

1. Validate whether the system is powered on or not (or if board power issues are present).
     # Are the LEDs lit, are the fans spinning? If nothing is powered on, then the issue is external to the server.
     # Confirm power to RTS and RTU in the cabinet.
     # Confirm power is being provided to Power Inlet Box on lw8 systems and AC Input Box for serengeti servers.
     # Investigate the system's power source, power cords, etc for a potential issue.
     # Also inquire on the status of all the domain's System or I/O Boards. Make sure they can all power on or else the domain may fail to boot. (showboards, poweron )

e4900-sca11-a-sc0:A> showboards

Slot Pwr Component Type State Status Domain
---- --- -------------- ----- ------ ------
/N0/SB0 On CPU Board V3 Active OK A
/N0/SB2 On CPU Board V3 Active OK A
/N0/IB6 On PCI-X I/O Board Active Passed A


2. Validate the customer can connect to the domain console.
  

Sun Fire[TM]/Netra[TM] 1280/1290/E2900: Reference for Improving Remote Diagnosibility (Doc ID 1540855.1)
Sun Fire [TM] SF3800/SF4800/SF4810/SF6800 - E4900/E6900: Reference for Improving Remote Diagnosibility (Doc ID 1540858.1)


3. Validate that the domain in question is not currently executing POST.
     # Console output will be scrolling "LPOST" messages if POST is running.
     # POST needs to complete before the domain reaches OBP ("ok" prompt) and then the domain can be booted.
     # To display the state of the domain from the platform shell run.  showplatform anf the platform shell are not available on lw8. showboards or console scrolling post messages is the only option to display testing status :

e4900-sca11-a-sc0:SC> showplatform -p status

Domain Solaris Nodename Domain Status Keyswitch
-------- ------------------ ----------------------- -------------
A - Active - OpenBoot PROM on
B - Powered Off off
C - Standby standby
D - Powered Off off

# POST completes with lines that look similar to this:

{/N0/SB2/P2/C1} DCB_ENTER_OBP command succeeded
{/N0/SB2/P3/C1} DCB_ENTER_OBP command succeeded
Entering OBP ...


Sun Fire E4900
OpenFirmware version 5.20.16 (08/24/10 02:33)
Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved.
SmartFirmware, Copyright (C) 1996-2001. All rights reserved.
28672 MB memory installed, Serial #52731163.
Ethernet address 0:3:ba:24:9d:1b, Host ID: 83249d1b.


{2} ok

  • Occasionally, showplatform -p status will report a domain running POST, and the console will have no output. If that is the case, first attempt to break out of the console with the escape sequence ( usually #. or control-] ) then send a break with the break command. If the break completes initiate POST again with another poweron or setkeyswitch operation.

 

  • A reboot of the sc with a resetsc on lw8 or reboot on serengeti may also fix this issue. Unfortunately, on very rare occasions a platform power cycle is necessary. Pushing the reset on the serengeti SC should be avoided as it can reset other running domains. A sc reboot with resetsc or reboot will not effect running domains.

 

  • If POST has completed it may be useful to look through the POST output for devices that have failed. A simple search in the output for "fail" that ignores case should find parts failed by post. Also look at POST output for parts that have failed POST in the past and are marked CHS faulty. These should be at the beginning of POST. Refer to Steps 6 Verify Devices in POST for additional information and examples.



4. If it is able to get to OBP, it may or may not "auto boot" depending on configuration. If it stops at the ok prompt try typing boot and see what happens. "Auto-boot" can be configured in 2 places.

     #Setting auto-boot? at the ok prompt:

{2} ok printenv
Variable Value (Default Value)
------------------------------------------------------------------------------
---  Some output removed---
auto-boot? false (true)
---  Some output removed---
diag-device net (disk diskifp diskglm diskc net)
boot-device disk (disk diskifp diskglm diskc net)
---  Some output removed---
diag-switch? false (false)
{2} ok



In the above example auto-boot? is set to false. Use setenv auto-boot? true to turn auto-boot? on.

Other settings noted above may effect booting behavior as well. diag-switch? should be set to the default which is false. If it is true, the system will attempt to boot off the diag-device which is usually the network. boot-device settings may vary. See Step 5 for a more complete discussion of boot-device.

     #   Setting auto-boot on the SC/LOM

e4900-sca11-a-sc0:A> showdomain -p bootparams

---  Some output removed---
OBP.auto-boot? = false
---  Some output removed---


The LOM on the lw8 systems have a bootmode command that effects a single (the next) boot. It expires after 10 minutes if not booted, and the system returns to normal boot mode. lw8 systems do not have a showdomain command.



5. Boot device issue are often causes of failure to boot.

Trace the validity of the boot device. If the device being booted is an alias defined in devalias at the OBP, the device that the alias references must exist in probe-scsi-all.

{2} ok printenv boot-device
boot-device           disk                        (disk diskifp diskglm diskc net)
{2} ok devalias disk
disk                  /ssm@0,0/pci@19,700000/pci@1/scsi@4/sd@0,0:a
{2} ok probe-scsi-all
 ---  Look for presence boot device.  -----
{2} ok



     # Alternate Boot device

Often it is useful to boot off alternate boot devices to test whether the OS on the primary device is corrupt. It is also common to boot off of the OS mirror disk when the primary mirror is experiencing hardware issues. An alternate device might be the DVD or cdrom, the network, or a root mirror or alternate disk image.

The alternate boot devices are usually listed in the output of devalias. Alias names can be created, so there is no way to list all known aliases, but vx-rootdisk vx-rootmirror are common with Veritas Volume Manager environments. Any alias with the word mirror should also be investigated as possibly booting.

6. Other aids in troubleshooting boot issues.
    # Verify Devices in POST.

Often it us useful to display the results of post to see components that have failed, or to verify that enough hardware is present to boot.

{2} ok show-post-results
---  Some output removed---
---  Verify sufficient hardware is present to boot.
 ok


This can be compared to the showchs -b output from the platform shell.

e4900-sca11-a-sc0:SC> showchs -b
Component           Status
---------------     --------
/N0/SB0/P1          Faulty
/N0/SB2/P0          Faulty
/N0/SB2/P1/B1/D0/L0 Faulty
/N0/SB2/P1/B1/D0/L1 Faulty

During a POST run, items that are CHS'ed are also listed at the beginning of post.

Search for "failed" in the POST output with the ignore case option.

     # Verbose booting options for boot hanging.

It is often helpful where booting hangs after seeing the SunOS starting to gather additional data. In cases like this it is useful to put Solaris into a verbose boot with a boot -v at the ok prompt. The auto-boot? setting must be set to false to prevent normal booting to allow manual boot commands. See Step 4 for information on setting auto-boot? to false. If the boot operation appears to hang in the middle of disk probing, this could give additional insight into the cause of the boot failure.

  • If you suspect a SB, CPU or DIMM, is passing post, but has problems, It can be manually disabled.


Note setls takes effect next reboot

e4900-sca11-a-sc0:A> setls -s disable -l sb0
sb0: will be disabled at the next domain reboot, board power cycle, or post execution.

  • Higher POST can be useful - Sometimes.


Often increasing POST level can catch marginal hardware issues. If memory is suspect it is often better to try turning off mpr-support-enable and tolerate_mem_ce first. This will force POST to disable any DIMMs with correctable errors. Usually, POST will ignore CEs up to a threshold before disabling the DIMM. See  <Document 1005247.1> for a more complete discussion on these parameters.


See Below for examples of  setting bootparams in serengeti. For lw8 the setting is in setupsc.

e4900-sca11-a-sc0:A> setupdomain -p bootparams

Domain Boot Parameters
----------------------
diag-level [default]: <>
post-tolerate-ce [false]:
mpr-support-enable [true]:  << valid settings true,False >>
---  Some output removed---

Remember to set the parameters back to their original after testing.

Final Word on boot issues:
If unsure how to proceed, or unable to perform the above process, collect as much information pertaining to the boot failure as possible (console logs, error messages, etc) and call back in and request next available engineer.


OBTAIN CUSTOMER ACCEPTANCE
WHAT ACTION DOES THE CUSTOMER NEED TO TAKE TO RETURN THE SYSTEM TO
AN OPERATIONAL STATE:
After booting customer will need to verify system meets production requirements.

PARTS NOTE:
Parts may end up being required, but they are not part of this Action plan. Another Action Plan may be necessary.

REFERENCE INFORMATION:
Service Manuals, Admin Manuals, and ScApp Command reference manuals:
http://download.oracle.com/docs/cd/E19095-01/sf3800.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/sf4800.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/sf4810.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/sf6800.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/sfe6900.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/sfe4900.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/sfv1280.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/sfe2900.srvr/index.html


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback