Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1356790.1
Update Date:2016-01-20
Keywords:

Solution Type  Technical Instruction Sure

Solution  1356790.1 :   How to Perform On Site Diagnosis for a Down Starcat System:ATR:1356790.1:4  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire 15K Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire E20K Server
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: SPARC-CAP VCAP
  •  
  • _Old GCS Categories>Sun Microsystems>Servers>High-End Servers
  •  




In this Document
  Goal
  Solution


Oracle Confidential (INTERNAL). Do not distribute to customers
Reason: FE Action Plan

Applies to:

Sun Fire E20K Server - Version: Not Applicable and later   [Release: N/A and later ]
Sun Fire E25K Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Fire 12K Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Fire 15K Server - Version: Not Applicable and later    [Release: N/A and later]
Information in this document applies to any platform.

Goal

To aid Field Engineers in On site diagnosis of Down Hard Systems
********************************************************************************
To report errors or request improvements on this procedure,
please go to http://support.us.oracle.com and put a comment on Doc ID: 1356790.1
********************************************************************************

Solution

DISPATCH INSTRUCTIONS

WHAT SKILLS DOES THE ENGINEER NEED:(IS A SITE ENGINEER AVAILABLE?)
System Management Services (SMS), Intermediate Solaris Skills

Time Estimate: 120 minutes

TASK COMPLEXITY: 4

FIELD ENGINEER INSTRUCTIONS

PROBLEM OVERVIEW:
Down System

WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY? :

Down Hard, unknown reason.


WHAT ACTION DOES THE ENGINEER NEED TO TAKE:

1. Validate whether the system is powered on or not (or if board power issues are present).
     # Are the LEDs lit, are the fans spinning? If nothing is powered on, then the issue is external to the server.
     # Confirm power to all the AC PSU's in the cabinet.
     # Investigate the system's power source, power cords, etc for a potential issue.

2.  Validate the customer can log into the system controllers.
     # Inquire on the status of all the domain's System or I/O Boards. Make sure they can all power on or else the domain may fail to boot. (showboards, poweron )

sms-svc:1> showboards
Retrieving board information. Please wait.
.....
Location    Pwr    Type of Board   Board Status  Test Status   Domain
--------    ---    -------------   ------------  -----------   ------
SB0          -     Empty Slot      Available         -         Isolated
SB1         On     CPU             Assigned      Unknown       C
SB2         Off    CPU             Assigned      Unknown       D
.....
IO0         On     HPCI+           Assigned      Unknown       C
IO1          -     Empty Slot      Available         -         Isolated
IO2         Off    WPCI            Assigned      Unknown       D

3. Validate that the domain in question is not currently executing POST.
sms-svc:1> ps -ef | grep post
sms-svc  1463   235  0 11:27:38 console  0:04 /opt/SUNWSMS/SMS1.5/bin/hpost -d C

     POST needs to complete before the domain reaches OBP ("ok" prompt) and then the domain can be booted.
     To display the state of the domain from the system controller run:
sms-svc:1> showplatform
PLATFORM:
=========
Platform Type: Sun Fire E25K
......<chop>....
Domain configurations:
======================
Domain ID   Domain Tag        Solaris Nodename       Domain Status
A           -                 -                      Running OBP
B           -                 -                      Running OBP
C           -                 -                      Running Domain POST
......<chop>....

# If post is running, once it completes check the the most recent post log file ( $SMSVAR/SMS/adm/<domain letter>/post/ ) to ensure it ran to completion and the hardware passed post.

CPU_Brds:  Proc  Mem P/B: 3/1 3/0  2/1 2/0  1/1 1/0  0/1 0/0
Slot  Gen  3210        /L: 10  10   10  10   10  10   10  10     CDC
SB10:  P   PPPP            PP  PP   PP  PP   PP  PP   PP  PP      P
SB11:  P   PPPP            PP  PP   PP  PP   PP  PP   PP  PP      P
SB12:  P   PPPP            PP  PP   PP  PP   PP  PP   PP  PP      P

I/O_Brds:         IOC  P1/Bus/Adapt   IOC  P0/Bus/Adapt
Slot  Gen  Type   P1   B1/10 B0/10    P0   B1/eb10 B0/10  (e=ENet, b=BBC)
IO10:  P   hsPCI   P    p _p  p _p     P    p PP_p  p _p
IO11:  c   hsPCI   c    c _c  c _c     b    c cc_c  c _c
IO12:  P   hsPCI   P    p _p  p _p     P    p PP_p  p _p

Configured in 333 with 12 procs, 48.000 GBytes, 8 IO adapters.
Interconnect frequency is 149.978 MHz, Measured.
Golden sram is on Slot IO10.
POST (level=16, verbose=20) execution time 8:40

4. Validate the customer can connect to the domain console.
sms-svc:1> console -d a
console -da
Trying to connect...
Connected to Domain Server.
Your console is in exclusive mode now.
{0} ok

5. If it is able to get to OBP, it may or may not "auto boot" depending on configuration. If it stops at the ok prompt try typing boot and see what happens. "Auto-boot" can be configured in 2 places.

     #Setting auto-boot? at the ok prompt:

{0} ok printenv
Variable Name           Value                          Default Value
------------------------------------------------------------------------------ 
---  Some output removed---
diag-device             net                            net
boot-device             /pci@15d,700000/pci@1/sc ...   disk net
---  Some output removed---
auto-boot?              false                          false
diag-switch?            false                          false
{2} ok
In the above example auto-boot? is set to false. Use setenv auto-boot? true to turn auto-boot? on.

Other settings noted above may effect booting behavior as well. diag-switch? should be set to the default which is false. If it is true, the system will attempt to boot off the diag-device which is usually the network. boot-device settings may vary. See Step 5 for a more complete discussion of boot-device.

     #   Setting auto-boot from the system controller:
sms-svc:1> showobpparams -da
auto-boot?=false
diag-switch?=false
.....

6. Boot device issue are often causes of failure to boot.

Trace the validity of the boot device. If the device being booted is an alias defined in devalias at the OBP, the device that the alias references must exist in probe-scsi-all.

{2} ok printenv boot-device
boot-device  =         disk (disk diskifp diskglm diskc net)
{2} ok devalias disk
disk                  /ssm@0,0/pci@19,700000/pci@1/scsi@4/sd@0,0:a
{2} ok probe-scsi-all
 ---  Look for presence boot device.  -----
{2} ok


     # Alternate Boot device

Often it is useful to boot off alternate boot devices to test whether the OS on the primary device is corrupt. It is also common to boot off of the OS mirror disk when the primary mirror is experiencing hardware issues. An alternate device might be the the network, a root mirror or an alternate disk image.

The alternate boot devices are usually listed in the output of devalias. Alias names can be created, so there is no way to list all known aliases, but vx-rootdisk vx-rootmirror are common with Veritas Volume Manager environments. Any alias with the word mirror should also be investigated as a possible alternate.

6. Other aids in troubleshooting boot issues.
    # Verify Devices in POST.
See above info on post logs.

During a POST run, items that are CHS'ed are also listed at the beginning of post log.
This can be compared to the showchs -b output.
sms-svc:39> showchs -b
Component           Status
---------------     --------
SB1/P1              Faulty

     # Verbose booting options for boot hanging.

It is often helpful where booting hangs after seeing the SunOS starting to gather additional data. In cases like this it is useful to put Solaris into a verbose boot with a boot -v at the ok prompt. The auto-boot? setting must be set to false to prevent normal booting to allow manual boot commands. See Step 4 for information on setting auto-boot? to false. If the boot operation appears to hang in the middle of disk probing, this could give additional insight into the cause of the boot failure.

  • If you suspect a SB, CPU or DIMM, is passing post, but has problems, It can be manually disabled.

Note disablecomponent takes effect on the next boot.
sms-svc:40> disablecomponent SB9 -i "Suspected Faulty"

sb9: will be disabled at the next  post execution.
  • Higher POST can be useful - Sometimes.
Often increasing POST level can catch marginal hardware issues. If memory is suspect it is often better to try turning off mpr-support-enable and tolerate_mem_ce first. This will force POST to disable any DIMMs with correctable errors. Usually, POST will ignore CEs up to a threshold before disabling the DIMM. See  <Document 1005247.1> for a more complete discussion on these parameters.

Remember to set the parameters back to their original after testing.

Final Word on boot issues:
If unsure how to proceed, or unable to perform the above process, collect as much information pertaining to the boot failure as possible (console logs, error messages, etc) and call back in and request next available engineer.


OBTAIN CUSTOMER ACCEPTANCE
WHAT ACTION DOES THE CUSTOMER NEED TO TAKE TO RETURN THE SYSTEM TO
AN OPERATIONAL STATE:
After booting customer will need to verify system meets production requirements.

PARTS NOTE:
Parts may end up being required, but they are not part of this Action plan. Another Action Plan may be necessary.

REFERENCE INFORMATION:
Service Manuals, Admin Manuals, and SMS Command reference manuals:
http://download.oracle.com/docs/cd/E19065-01/index.html

KEYWORDS:
ERRORS:

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback