Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1349558.1
Update Date:2017-10-18
Keywords:

Solution Type  Technical Instruction Sure

Solution  1349558.1 :   How to Perform On Site Diagnosis for a Down System for Enterprise Server Exx00 Systems:ATR:1349558.1:4  


Related Items
  • Sun Enterprise 5500 Server
  •  
  • Sun Enterprise 5000 Server
  •  
  • Sun Enterprise 4000 Server
  •  
  • Sun Enterprise 3000 Server
  •  
  • Sun Enterprise 3500 Server
  •  
  • Sun Enterprise 6000 Server
  •  
  • Sun Enterprise 6500 Server
  •  
  • Sun Enterprise 4500 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: SPARC-CAP VCAP
  •  




In this Document
  Goal
  Solution


Oracle Confidential (INTERNAL). Do not distribute to customers
Reason: FRU CAP

Applies to:

Sun Enterprise 3000 Server - Version: Not Applicable and later   [Release: N/A and later ]
Sun Enterprise 3500 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Enterprise 5000 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Enterprise 4500 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Enterprise 4000 Server - Version: Not Applicable and later    [Release: N/A and later]
Information in this document applies to any platform.

Goal

How to Perform On Site Diagnosis for a down system for Enterprise Server Exx00 Systems.
To aid Field Engineers in On site diagnosis of Down Hard Systems

Solution

DISPATCH INSTRUCTIONS


WHAT SKILLS DOES THE ENGINEER NEED:(IS A SITE ENGINEER AVAILABLE?)

Enterprise Server Troubleshooting, Intermidiate Solaris Skills


Time Estimate: 120 minutes


TASK COMPLEXITY:  4


FIELD ENGINEER INSTRUCTIONS


PROBLEM OVERVIEW: Down System


WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY? :

Down Hard, unknown reason.


WHAT ACTION DOES THE ENGINEER NEED TO TAKE:


1. Validate whether the system is powered on or not (or if board power issues are present).

  • Are the LEDs lit, are the fans spinning? If nothing is powered on, then the issue is external to the server.
  • Investigate the system's power source, power cords, etc for a potential issue.
  • Is they Keyswitch on a different position than "0"?.


2. Validate the customer can connect to the console.

        See Console Logging Options to capture Fatal Reset output for Sun systems (Doc ID 1008702.1) for console help.

        It is necessary to collect every possible output that is possible to see in the console
        serial connection to proceed with troubleshooting.

3. Once serial connection is established run extended post

        Placing the key from the position "0" to the "diagnostic" position will automatically start an extended POST which will test the HW and show the faults.

        NOTE: it is possible that the original keyswitch position is not 0 so perform this step 
        only if the previous 2 didn't helped anyhow.


4. If it is able to get to OBP, it may or may not "auto boot" depending on configuration.
    If it stops at the ok prompt try typing boot and see what happens.

    "Auto-boot" can be configured the following way.

  • Setting auto-boot? at the ok prompt:

        {2} ok printenv

        Variable Value (Default Value)
        ------------------------------------------------------------------------------
        ---  Some output removed---
        auto-boot? false (true)
        ---  Some output removed---
        diag-device net (disk diskifp diskglm diskc net)
        boot-device disk (disk diskifp diskglm diskc net)
        ---  Some output removed---
        diag-switch? false (false)
        {2} ok


          In the above example auto-boot? is set to false. Use setenv auto-boot? true to turn auto-boot? on.
          Other settings noted above may effect booting behavior as well.
          diag-switch? should be set to the default which is false.
          If it is true, the system will attempt to boot off the diag-device which is usually the network.
          boot-device settings may vary.
          See Step 5 for a more complete discussion of boot-device.


5. Boot device issue are often causes of failure to boot.

        Trace the validity of the boot device.
        If the device being booted is an alias defined in devalias at the OBP,
        the device that the alias references must exist in probe-scsi-all.


       {2} ok printenv boot-device
       boot-device           disk                        (disk diskifp diskglm diskc net)
       {2} ok devalias disk
       disk                     /sbus@3,0/SUNW,fas@3,8800000/sd@0,0
       {2} ok probe-scsi-all
        ---  Look for presence boot device.  -----
       {2} ok


       Internal Disk localization (E3x00 only)

       Use the "probe-fcal-all" command and the disk World-Wide Number (WWN)
       to determine which array (top or bottom) your disk is on.

       i.e.: /sbus@A,0/SUNW,socal@B,10000/sf@C,0/ssd@w2DEEEEEEEEEEEEEE,F:G


          A  - convert to decimal: divide by 2 to get the I/O slot#
          B  - 0=Sbus slot0, 1=Sbus slot1, 2=Sbus slot2, d=Sbus slotd (onboard fcal)
          C  - identifies which gbic: 0=bottom gbic, 1 = top gbic
          D  - identifies which port (A or B) on the IB board:  w21=UA/LA, w22=UB/LB
          E  - disk WWN
          F  - drive SSD number
          G  - LUN (always "0") - slice


  •  Alternate Boot device

           Often it is useful to boot off alternate boot devices to test whether the OS on the primary
           device is corrupt.
           It is also common to boot off of the OS mirror disk when the primary mirror is   experiencing hardware issues.
           An alternate device might be the DVD or cdrom, the network, or a root mirror or alternate
           disk image.

           The alternate boot devices are usually listed in the output of devalias.
           Alias names can be created, so there is no way to list all known aliases, but vx-rootdisk
           vx-rootmirror are common with Veritas Volume Manager environments.
           Any alias with the word mirror should also be investigated as possibly booting.


6. Other aids in troubleshooting boot issues.

  • Verify Devices in POST.

           Often it us useful to display the results of post to see components that have failed,
           or to verify that enough hardware is present to boot.

           {2} ok show-post-results

            ---  Some output removed---
            ---  Verify sufficient hardware is present to boot.
            ok   



  • Verbose booting options for boot hanging.

           It is often helpful where booting hangs after seeing the SunOS starting to gather additional data.        
           In cases like this it is useful to put Solaris into a verbose boot with a boot -v at the ok prompt.    
           The auto-boot? setting must be set to false to prevent normal booting to allow manual boot commands.
           See Step 3 for information on setting auto-boot? to false.
           If the boot operation appears to hang in the middle of disk probing,
           this could give additional insight into the cause of the boot failure.


7. If you suspect a SB, CPU or DIMM, is passing post, but has problems, it can be manually disabled.

     The following command can be used:

     ok setenv disabled-board-list  --> sets board to be disabled on reboot
     See Ultra Enterprise[TM] Workstation: OBP Variable Definition for disabled-board-list
     (Doc ID 1001859.1) for details

     ok setenv disabled-memory-list  --> sets memory to be disabled on reboot

        Example:
        ok> setenv disabled-memory-list 4     <<<! this would shut off all memory on bd4
        ok> reset-all                                       <<<! this will actually turn off the memory!
        ok> set-default disabled-memory-list  <<<! turn mem on bd4 back on
        ok> reset-all                                       <<<! this will turn memory back on again.

  • Final Word on boot issues:

            If unsure how to proceed, or unable to perform the above process, collect as much information
            pertaining to the boot failure as possible (console logs, error messages, etc) and call back
            in and request next available engineer.

OBTAIN CUSTOMER ACCEPTANCE

WHAT ACTION DOES THE CUSTOMER NEED TO TAKE TO RETURN THE SYSTEM TO

AN OPERATIONAL STATE:

After booting customer will need to verify system meets production requirements.


PARTS NOTE:

Parts may end up being required, but they are not part of this Action plan.

Another Action Plan may be necessary


REFERENCE INFORMATION:

Service Manuals, Admin Manuals, and other manuals:

http://download.oracle.com/docs/cd/E19095-01/ent3k.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/ent35.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/ent4k.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/ent45.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/ent5k.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/ent55.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/ent6k.srvr/index.html
http://download.oracle.com/docs/cd/E19095-01/ent6500.srvr/index.html


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback