Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1418253.1
Update Date:2014-10-17
Keywords:

Solution Type  Technical Instruction Sure

Solution  1418253.1 :   How to Perform Onsite Diagnosis for a Down x86 Intel System  


Related Items
  • Sun Fire X4470 Server
  •  
  • Sun Fire X4170 Server
  •  
  • Sun Fire X4250 Server
  •  
  • Sun Netra X4450 Server
  •  
  • Sun Fire X4150 Server
  •  
  • Sun Netra X4270 Server
  •  
  • Sun Fire X4450 Server
  •  
  • Sun Blade X6270 M2 Server Module
  •  
  • Sun Fire X4270 Server
  •  
  • Sun Fire X4275 Server
  •  
  • Sun Blade X6450 Server Module
  •  
  • Sun Fire X4800 Server
  •  
  • Sun Blade X6275 Server Module
  •  
  • Sun Netra X4250 Server
  •  
  • Sun Fire X2270 Server
  •  
  • Sun Blade X6270 Server Module
  •  
  • Sun Fire X4170 M2 Server
  •  
  • Sun Blade X6250 Server Module
  •  
  • Sun Fire X4270 M2 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: x64-CAP VCAP
  •  




In this Document
Goal
Solution
 1. Investigate system power source
 2. Validate the customer can log into the Service Processor (ILOM)
 Verifying system power status via ipmitool
 Verifying system power status via the Service Processor CLI
 Verifying system power status via the Service Processor Web GUI
 3. Troubleshoot power issues
 4. Perform internal and external visual inspection
 5. Collect basic server information regarding the outage using the Service Processor
 IPMITOOL:
 ILOM
 CMM (Blade specific)
 6. Hardware best practices
 BIOS POST
 Boot device
 Disks
 Blades
 Fans
 Memory modules
 7. Run platform diagnostics
 PcCheck
 
 8. Check platform health at Operating System level
 Generate an Oracle Explorer Data Collector utility output with the command:
 Execute Suse Linux Enterprise Server supportconfig utility
 Execute Red Hat Enterprise Linux sosreport utility
 Execute VMware vm-support utility
 Windows MPS Report utility
 Core dump generation
 9. Collect diagnostic information for Oracle support
 Collect ILOM Service Snapshot utility from ILOM Web GUI
 Appendix
 Links of interest:
References


Oracle Confidential PARTNER - Available to partners (SUN).
Reason: FRU CAP (please leave Internal)

Applies to:

Sun Blade X6270 M2 Server Module - Version All Versions to All Versions [Release All Releases]
Sun Fire X4170 Server - Version All Versions to All Versions [Release All Releases]
Sun Fire X4270 Server - Version All Versions to All Versions [Release All Releases]
Sun Netra X4270 Server - Version All Versions to All Versions [Release All Releases]
Sun Blade X6270 Server Module - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Goal

How to Perform Onsite Diagnosis for a Down x86 Intel System. It applies to Intel Processor-based Servers and Blade servers.

Note: Please leave this Document "Internal"

Solution

How to Perform On Site Diagnosis for a Down x86 Intel System

DISPATCH INSTRUCTIONS

WHAT SKILLS DOES THE ENGINEER NEED:(IS A SITE ENGINEER AVAILABLE?)
ILOM, Intermediate Linux/Unix Skills

Time Estimate: 120 minutes

TASK COMPLEXITY: 4

FIELD ENGINEER INSTRUCTIONS

PROBLEM OVERVIEW: System Down

WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY? : Down Hard, unknown reason

WHAT ACTION DOES THE ENGINEER NEED TO TAKE:

It's very important to document the server settings before any hardware or software changes are made.

1. Investigate system power source

# Are LEDs lit?
# Are fans spinning?
# Confirm power to all the AC Power Supplies.
# In collaboration with the customer, investigate the system's power source, power cords, etc for a potential issue.

2. Validate the customer can log into the Service Processor (ILOM)

Verifying system power status via ipmitool

Run ipmitool from a remote system to the Service Processor with the command shown in the examples below. The resulting output will indicate whether power is on or off.

# ipmitool -I lanplus -U root -H <ILOM IP Address> chassis status
System Power         : on
Power Overload       : false
Power Interlock      : inactive
Main Power Fault     : false
Power Control Fault  : false
Power Restore Policy : unknown
Last Power Event     :
Chassis Intrusion    : inactive
Front-Panel Lockout  : inactive
Drive Fault          : false
Cooling/Fan Fault    : false

 


# ipmitool -I lanplus -U root -H <ILOM IP Address> chassis power status
Chassis Power is on

 

Verifying system power status via the Service Processor CLI

Log in to the Service Processor via SSH:

# ssh -l <USERNAME> <ILOM IP Address>



For systems that have ILOM installed, use the following commands to
determine the platform power status:

  • ILOM
-> show /SYS
...
   /SYS
   Properties:
   type = Host System
   ipmi_name = /SYS
   product_name = SUN FIRE X4440
   product_part_number = 602-4057-01
   product_serial_number = 0812ZYX001
   product_manufacturer = SUN MICROSYSTEMS
   power_state = On

 

Verifying system power status via the Service Processor Web GUI

Integrated Lights Out Manager (ILOM) based Service Processors provide an easy-to-use web interface for managing the platform. Point your web browser to the Service Processor IP address or resolving DNS hostname, and enter your login credentials when prompted.

After you have logged into the Service Processor, click "Remote Control" tab then Click "Remote Power Control" tab.

This contains the status of the platform, for example:   

Host is currently on


Alternatively, click the "System Monitoring" tab, then "Summary" tab where 'Power Status' will be shown.

If OFF and you expect it to be ON, then refer to How to check why the system powered off, on Sun X64 servers. (Doc ID 1002941.1)

Refer to the ILOM Administration Guide for your platform and firmware version. Also see the ILOM Administration Guide Supplement for your platform:
http://www.oracle.com/technetwork/documentation/oracle-x86-servers-190077.html

Related ILOM documentation:
Integrated Lights Out Manager (ILOM) 2.0 documentation: http://docs.oracle.com/cd/E19720-01/
Integrated Lights Out Manager (ILOM) 3.0 and CMM documentation: http://docs.oracle.com/cd/E19860-01/

3. Troubleshoot power issues

Verify the state of the Power OK LED from the front or rear of the server. LED states may vary slightly between platforms, but generally:

  • STEADY GREEN ON - System is powered on.
  • SLOW BLINK GREEN - System is powered OFF, but standby power is present.
  • NOT ILLUMINATED (OFF) - Server main power and standby power are off (no AC power, not plugged in, defective power cord).


Investigate the system's power source, power cords, power supplies for a potential issue.

Refer to the following Oracle documents for help on diagnosing power issues on x64 platforms:

How to check if a Sun X64 server is powered on (Doc ID 1002926.1)
How to check why the system powered off, on Sun X64 servers. (Doc ID 1002941.1)

4. Perform internal and external visual inspection

- Confirm if the General Service Fault LED is lit or if any Component Fault LEDs is ON and would indicate a hardware failure.


- A system shutdown can be initiated by a request from either of the following:

  • Board management controller (BMC). The conditions that trigger the BMC to issue a shutdown request are:
    • An over-temperature condition for more than 1 second.
    • Multiple fan failures.
  • Fault condition. The fault conditions that trigger a shutdown are:
    • All power supplies have failed or have been removed.
    • A power supply has been out of spec for more than 100 mS.
    • The hot-swap circuit has faulted.
    • An over-temperature condition has occurred.
  1. Inspect the external status indicator LEDs, which can indicate a defective component.
  2. Verify that nothing in the server environment is blocking air flow or making a contact that could short out power.
  3. If the server does not power on, check with the power source, power cords, for a potential issue.
  4. Disconnect power cords a few minutes to discharge the capacitors.
  5. Pull the power cords back and check if the power issue remains.
  6. If no power is distributed then refer to the Sun System Handbook (https://support.oracle.com/handbook_private/) wiring diagram to identify the possible components that could trigger this power issue.
  7. Inspect the cables, cards and pins to detect any evidence of a visually defect.
  8. Reseat processors, riser cards, pci cards, power supplies, memory modules, fans cables, and disks.
  9. Disconnect any external storage array to verify if the same symptoms remain.

 

5. Collect basic server information regarding the outage using the Service Processor


Login to the Service Processor using ssh (requires the Service Processor IP address or resolvable DNS hostname):

# ssh -l <USERNAME> <ILOM IP Address>


Display System Event Logs, sensor & fault indicator information:

IPMITOOL:

# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sel elist
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> -v sel elist
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sensor
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sunoem sbled get all
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sdr list all info
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> -v sdr


Be sure you use the latest by Oracle compiled ipmitool version to collect this information

ipmitool is part of the Oracle Hardware pack, more info http://download.oracle.com/docs/cd/E19960-01/index.html

Refer to the DocID 1009698.1 for detailed information on the use of ipmitool for collection of data from the platform.

ILOM

Log in to the Server's ILOM and execute the commands:

-> show /SP/logs/event/list
-> show -d properties -level all /SYS
-> show -o table -level all /SP/faultmgmt (Only available with latest ILOM versions)

 

CMM (Blade specific)

Log in to the Chassis CMM where is inserted the faulty Blade and execute the commands:
-> show /CMM/logs/event/list
-> show -d properties -level all /

 

6. Hardware best practices

Best practices scenario to isolate a hardware issue when facing an Oracle x86 Intel Processor-Based server down:

- Power off the platform and disconnect power cords a few minutes
- Update platform firmwares to the latest (ILOM/BIOS/HW RAID/PCI Cards)
- Review ILOM logs and sensors along with OS boot sequence to verify if any hardware or software issue is reported
-- Start the SP console to monitor the boot process
-- Start the Java Remote console to monitor OS errors
-- View component information to determine component status.
-- View the ILOM system event log.
- Run Oracle VTS to verify if any hardware error is reported
- Disconnect any external storage array
- If a component is reported faulty replace upon failure
- If unable to boot the OS then reduce to a minimum CPU/Memory configuration to isolate the faulty component.
- Remove any additional PCI card
- If no evidence of a hardware issue and the OS is booting then we should consider gathering Operating System information
- Update platform related OS drivers
- Engage the OS/software support to assist with a possible software issue

 

BIOS POST

From the point that the host subsystem is powered on and begins executing code, BIOS code is executed. The sequence that BIOS goes through, from the first point where code is executed to the point that the operating system booting begins, is referred to as POST (power-on self-test).

In case a hardware issue is detected during the POSTS the boot process will stop and a 4 digits error code could be displayed at the console. Refer to your platform Service Manual or Diagnostic guide to translate the POST code.

Boot device

Verify the boot device is correct from the BIOS Boot tab:

Main    Advanced    PCIPnP    Boot    Security    Chipset    Exit
********************************************************************************
* Boot Settings                                       * Configure Settings     *
* *************************************************** * during System Boot.    *
* * Boot Settings Configuration                       *                        *
*                                                     *                        *
* * Boot Device Priority                              *                        *
* * Hard Disk Drives                                  *                        *
* * Removable Drives                                  *                        *
* * CD/DVD Drives                                     *                        *
*                                                     *                        *
*                                                     *                        *
*                                                     *                        *
*                                                     *                        *
*                                                     * **    Select Screen    *
*                                                     * **    Select Item      *
*                                                     * Enter Go to Sub Screen *
*                                                     * F1    General Help     *
*                                                     * F10   Save and Exit    *
*                                                     * ESC   Exit             *
*                                                     *                        *
*                                                     *                        *
********************************************************************************


Bios boot device output is also available as a text file attached to this document: BIOS.TXT

Disks

To troubleshoot a disk issue identify your HW RAID Controller and follow the instructions from the document below:
How to Identify BIOS and Solaris[TM] Hardware RAID Status (Doc ID 1013107.1)

Blades

When troubleshooting a Blade issue, swap the Blade module to another known working slot to isolate the root cause.

  • If the problem follows the Blade then the failure is located on the Blade
  • If the Blade works in another slot then the problem could be related to the slot

Fans

A faulty fan or fan board can avoid an x64 Server to boot because of potential for system over-temperature and component damage.

Verify the Fans and Fan Board status from the ILOM Monitoring tool.

Memory modules

When investigating a memory issue

  • Verify that only supported memory modules are inserted
  • Verify the population rules are respected
  • Press the DIMM Fault Remind button if available for your platform to turn ON the slot LED for the faulty DIMM

Additionally when a Memory errors are logged in Windows or Linux logs fles, install HERD to translate the memory addresses error into CPU slot/Memory slot
How to analyze Memory Errors on x64 Servers running Linux using HERD (Doc ID 1019683.1)

CAUTION: After replacing an Oracle server motherboard it is necessary to update the platform serial number which is the reference used to log Service Requests

7. Run platform diagnostics

Oracle provides provides comprehensive diagnostic tools that tests and validates Oracle hardware by verifying the connectivity and functionality of most hardware controllers and devices on Oracle hardware platforms.

The diagnostic tools can usually be executed booting from:

  • the Tools and Drivers CD/DVD
  • the ILOM
  • an external drive
  • PXE (network)
  • the running Operating System


We will prefer a standalone method and avoid executing diagnostics from a running operating system because it could generates false I/O access errors during the tests.

PcCheck

Since PcCheck is fairly easy to obtain (its on most of the Tools and Drivers DVDs), is easy to run, and performs a decent low level health check of the system, we recommend using PcCheck first (before VTS).

PcCheck is a diagnostic software that will check completely the hardware components including memory modules, floppy, hard disk drives, CD-ROM/DVD drives, I/O ports, graphic controller.

To run the PcCheck diagnostics follow the steps below:

  1. Boot the system with the Supplemental CD (or Tools and Drivers DVD)
  2. At the main menu select "Run Hardware Diagnostics"
  3. At the PcCheck main menu select "Advanced Diagnostic Tests"
  4. At the Advanced Diagnostic Tests menu select "Memory"
  5. Then select "Test System Memory"


Oracle VTS

SunVTS software has a sophisticated graphical user interface (GUI) that provides test configuration and status monitoring. The user interface can be run on one system to display the Sun VTS testing of another system on the network. SunVTS software also provides a TTY-mode interface for situations in which running a GUI is not possible.

The following tests are available in SunVTS: Processor/Memory/Disk/Graphics/Media/Ioports/Interconnects/Network/Environment/HBA

For more information refer to Oracle VTS 7.0 Software User's guide: http://docs.oracle.com/cd/E19719-01/E21664/index.html

8. Check platform health at Operating System level

When able to boot and log into the Operating System it is important to also verify if any hardware or software issue has been reported at OS level and if platform related patches are up-to-date.

The following OS specific commands are tools that collects information about each operating system, such as what kernel is running, currently loaded drivers, configuration files, log files, etc. Each of these tools must be collected as root.

Generate an Oracle Explorer Data Collector utility output with the command:

# /opt/SUNWexplo/bin/explorer


Oracle Explorer software is a support tool used to  collect pertinent data from a system running the Solaris(TM) Operating System. Oracle engineers use Explorer to describe a system's configuration or to troubleshoot a problem.

Oracle Explorer is part of the STB (Services Tools Bundle) that can be downloaded from My Oracle Support (MOS):

http://support.oracle.com -> Patches & Updates -> Advanced Search and Select "Services Tools Bundle" as the Product


More details about Oracle Explorer and available options:
Oracle Explorer Data Collector - Product Information Center (Doc ID 1312847.1)

Execute Suse Linux Enterprise Server supportconfig utility

# supportconfig
# man supportconfig

 

Execute Red Hat Enterprise Linux sosreport utility

# sosreport
# man sosreport


For Red Hat Enterprise Linux 4.5 and earlier use sysreport instead.

Execute VMware vm-support utility

# /usr/bin/vm-support


For more information refer to VMware knowledge document:
Collecting diagnostic information for VMware Server
kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1967

Windows MPS Report utility

For instructions to install and run Microsoft MPS Report refer to Microsoft Knowledge document:
Microsoft Product Support Reports:
http://www.microsoft.com/download/en/details.aspx?id=24745

Core dump generation

Some Oracle platforms have an NMI switch at the back of the system that generates an interruption to stop the Operating System and force a core dump. The NMI interruption can be generated as follow:

  • Pressing the NMI physical button at the back of the server
  • Using the ILOM web GUI "generate NMI" button located under the diagnostic tab
  • Executing the command below at ILOM prompt:
-> /HOST generate_host_nmi=true



Note that NMI data collection must be configured to collect a core dump in case of interruption. Refer to your Operating System documentation if required.

To assist with possible system hang refer to the following document:
How to check if your x64 platform "system hang" actually is a system hang (DocID 1012991.1)

Caution: Before reviewing OS related data it is also recommended to verify if the Operating System installed is supported or certified for this particular server to avoid any driver or compatibility issue.

9. Collect diagnostic information for Oracle support

Collect ILOM Service Snapshot utility from ILOM Web GUI

The purpose of the ILOM Service Snapshot utility is to collect data for use by Oracle Services personnel to diagnose system problems.

An ILOM snapshot output can be generated from the ILOM GUI -> Maintenance tab -> Snapshot tab.

Select the desired Data Set:

  • Normal: Specifies that ILOM, operating system, and hardware information is collected.
  • FRUID: Available as of ILOM 3.0.3, specifies that information about FRUs currently configured on your server in addition to the data collected by the Normal set option is collected.
  • Full: Specifies that all data is to be collected. Selecting Full might reset the system on an AMD Processor-based platform if an Hypertransport bus failure is detected when running HDT low level diagnostics
  • Custom: Allows you to choose one or more data sets


Caution: Customers should not run this utility unless requested to do so by Oracle Services.

For more information about the ILOM Service Snapshot utility please refer to the Oracle Integrated Lights Out Manager (ILOM) 3.0 Web Interface Procedures Guide:
http://docs.oracle.com/cd/E19860-01/


If an ILOM Snapshot cannot be collected it is recommended to collect the one of the following outputs :

# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> chassis status
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sel elist
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> -v sel elist
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> fru
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sensor
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sunoem sbled get all
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sdr list all info
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> -v sdr

 

# ssh -l <USERNAME> <ILOM IP Address>
-> show / -l all -o table

 

Appendix

Links of interest:

Oracle x86 Servers Documentation
http://www.oracle.com/technetwork/documentation/oracle-x86-servers-190077.html#hic

Firmware Downloads and Release History for Sun Systems
http://www.oracle.com/technetwork/systems/patches/firmware/release-history-jsp-138416.html

Sun x86 and x64 Platforms: Matrix of expansion cards (Doc ID 1374659.1)

Sun System Handbook
https://support.oracle.com/handbook_private/

Oracle VTS 7.0
http://docs.oracle.com/cd/E19719-01/

Systems Management and Diagnostics
http://www.oracle.com/us/products/applications/crmondemand/login/sys-mgmt-networking-190072.html

Oracle Integrated Lights Out Manager (ILOM) 3.0 Documentation
http://docs.oracle.com/cd/E19860-01/index.html

Sun Integrated Lights Out Manager (ILOM) 2.0 Documentation
http://docs.oracle.com/cd/E19720-01/index.html

Sun Installation Assistant for x64 Servers Documentation
http://docs.oracle.com/cd/E19593-01/index.html

How to update the Serial Number on Oracle x64 platforms (Doc ID 1364359.1)

RAID Management Software Documentation
http://docs.oracle.com/cd/E23383_01/index.html

If unsure how to proceed, or unable to perform the above process, collect as much information pertaining to the boot failure as possible (console logs, error messages, etc), call back in and request next available engineer.

References

<NOTE:1019683.1> - How to analyze Memory Errors on x86_64 Servers Using HERD
<NOTE:1312847.1> - Oracle Explorer Data Collector Resource Center
<NOTE:1330254.1> - X86 Product Home
<NOTE:1364359.1> - How to update the Serial Number on Oracle x64 platforms
<NOTE:1374659.1> - Sun x86 Platforms: Matrix of Expansion Cards
<NOTE:1002926.1> - How to check if a Sun X86 server is powered on
<NOTE:1009698.1> - How to perform platform configuration, management, and data collection tasks with ipmitool on Sun X64 servers. [Video]
<NOTE:1012991.1> - How to check if your x86 platform "system hang" actually is a system hang
<NOTE:1002941.1> - How to check why the system powered off, on Sun X64 servers.
<NOTE:1013107.1> - How to Identify BIOS and Solaris[TM] Hardware RAID Status

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback