Asset ID: |
1-71-1418253.1 |
Update Date: | 2014-10-17 |
Keywords: | |
Solution Type
Technical Instruction Sure
Solution
1418253.1
:
How to Perform Onsite Diagnosis for a Down x86 Intel System
Related Items |
- Sun Fire X4470 Server
- Sun Fire X4170 Server
- Sun Fire X4250 Server
- Sun Netra X4450 Server
- Sun Fire X4150 Server
- Sun Netra X4270 Server
- Sun Fire X4450 Server
- Sun Blade X6270 M2 Server Module
- Sun Fire X4270 Server
- Sun Fire X4275 Server
- Sun Blade X6450 Server Module
- Sun Fire X4800 Server
- Sun Blade X6275 Server Module
- Sun Netra X4250 Server
- Sun Fire X2270 Server
- Sun Blade X6270 Server Module
- Sun Fire X4170 M2 Server
- Sun Blade X6250 Server Module
- Sun Fire X4270 M2 Server
|
Related Categories |
- PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: x64-CAP VCAP
|
In this Document
Oracle Confidential PARTNER - Available to partners (SUN).
Reason: FRU CAP (please leave Internal)
Applies to:
Sun Blade X6270 M2 Server Module - Version All Versions to All Versions [Release All Releases]
Sun Fire X4170 Server - Version All Versions to All Versions [Release All Releases]
Sun Fire X4270 Server - Version All Versions to All Versions [Release All Releases]
Sun Netra X4270 Server - Version All Versions to All Versions [Release All Releases]
Sun Blade X6270 Server Module - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.
Goal
How to Perform Onsite Diagnosis for a Down x86 Intel System. It applies to Intel Processor-based Servers and Blade servers.
Note: Please leave this Document "Internal"
Solution
How to Perform On Site Diagnosis for a Down x86 Intel System
DISPATCH INSTRUCTIONS
WHAT SKILLS DOES THE ENGINEER NEED:(IS A SITE ENGINEER AVAILABLE?)
ILOM, Intermediate Linux/Unix Skills
Time Estimate: 120 minutes
TASK COMPLEXITY: 4
FIELD ENGINEER INSTRUCTIONS
PROBLEM OVERVIEW: System Down
WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY? : Down Hard, unknown reason
WHAT ACTION DOES THE ENGINEER NEED TO TAKE:
It's very important to document the server settings before any hardware or software changes are made.
1. Investigate system power source
# Are LEDs lit?
# Are fans spinning?
# Confirm power to all the AC Power Supplies.
# In collaboration with the customer, investigate the system's power source, power cords, etc for a potential issue.
2. Validate the customer can log into the Service Processor (ILOM)
Verifying system power status via ipmitool
Run ipmitool from a remote system to the Service Processor with the command shown in the examples below. The resulting output will indicate whether power is on or off.
# ipmitool -I lanplus -U root -H <ILOM IP Address> chassis status
System Power : on
Power Overload : false
Power Interlock : inactive
Main Power Fault : false
Power Control Fault : false
Power Restore Policy : unknown
Last Power Event :
Chassis Intrusion : inactive
Front-Panel Lockout : inactive
Drive Fault : false
Cooling/Fan Fault : false
# ipmitool -I lanplus -U root -H <ILOM IP Address> chassis power status
Chassis Power is on
Verifying system power status via the Service Processor CLI
Log in to the Service Processor via SSH:
# ssh -l <USERNAME> <ILOM IP Address>
For systems that have ILOM installed, use the following commands to
determine the platform power status:
-> show /SYS
...
/SYS
Properties:
type = Host System
ipmi_name = /SYS
product_name = SUN FIRE X4440
product_part_number = 602-4057-01
product_serial_number = 0812ZYX001
product_manufacturer = SUN MICROSYSTEMS
power_state = On
Verifying system power status via the Service Processor Web GUI
Integrated Lights Out Manager (ILOM) based Service Processors provide an easy-to-use web interface for managing the platform. Point your web browser to the Service Processor IP address or resolving DNS hostname, and enter your login credentials when prompted.
After you have logged into the Service Processor, click "Remote Control" tab then Click "Remote Power Control" tab.
This contains the status of the platform, for example:
Host is currently on
Alternatively, click the "System Monitoring" tab, then "Summary" tab where 'Power Status' will be shown.
If OFF and you expect it to be ON, then refer to How to check why the system powered off, on Sun X64 servers. (Doc ID 1002941.1)
Refer to the ILOM Administration Guide for your platform and firmware version. Also see the ILOM Administration Guide Supplement for your platform:
http://www.oracle.com/technetwork/documentation/oracle-x86-servers-190077.html
Related ILOM documentation:
Integrated Lights Out Manager (ILOM) 2.0 documentation: http://docs.oracle.com/cd/E19720-01/
Integrated Lights Out Manager (ILOM) 3.0 and CMM documentation: http://docs.oracle.com/cd/E19860-01/
3. Troubleshoot power issues
Verify the state of the Power OK LED from the front or rear of the server. LED states may vary slightly between platforms, but generally:
- STEADY GREEN ON - System is powered on.
- SLOW BLINK GREEN - System is powered OFF, but standby power is present.
- NOT ILLUMINATED (OFF) - Server main power and standby power are off (no AC power, not plugged in, defective power cord).
Investigate the system's power source, power cords, power supplies for a potential issue.
Refer to the following Oracle documents for help on diagnosing power issues on x64 platforms:
How to check if a Sun X64 server is powered on (Doc ID 1002926.1)
How to check why the system powered off, on Sun X64 servers. (Doc ID 1002941.1)
4. Perform internal and external visual inspection
- Confirm if the General Service Fault LED is lit or if any Component Fault LEDs is ON and would indicate a hardware failure.
- A system shutdown can be initiated by a request from either of the following:
- Board management controller (BMC). The conditions that trigger the BMC to issue a shutdown request are:
- An over-temperature condition for more than 1 second.
- Multiple fan failures.
- Fault condition. The fault conditions that trigger a shutdown are:
- All power supplies have failed or have been removed.
- A power supply has been out of spec for more than 100 mS.
- The hot-swap circuit has faulted.
- An over-temperature condition has occurred.
- Inspect the external status indicator LEDs, which can indicate a defective component.
- Verify that nothing in the server environment is blocking air flow or making a contact that could short out power.
- If the server does not power on, check with the power source, power cords, for a potential issue.
- Disconnect power cords a few minutes to discharge the capacitors.
- Pull the power cords back and check if the power issue remains.
- If no power is distributed then refer to the Sun System Handbook (https://support.oracle.com/handbook_private/) wiring diagram to identify the possible components that could trigger this power issue.
- Inspect the cables, cards and pins to detect any evidence of a visually defect.
- Reseat processors, riser cards, pci cards, power supplies, memory modules, fans cables, and disks.
- Disconnect any external storage array to verify if the same symptoms remain.
5. Collect basic server information regarding the outage using the Service Processor
Login to the Service Processor using ssh (requires the Service Processor IP address or resolvable DNS hostname):
# ssh -l <USERNAME> <ILOM IP Address>
Display System Event Logs, sensor & fault indicator information:
IPMITOOL:
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sel elist
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> -v sel elist
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sensor
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sunoem sbled get all
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sdr list all info
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> -v sdr
Be sure you use the latest by Oracle compiled ipmitool version to collect this information
ipmitool is part of the Oracle Hardware pack, more info http://download.oracle.com/docs/cd/E19960-01/index.html
Refer to the DocID 1009698.1 for detailed information on the use of ipmitool for collection of data from the platform.
ILOM
Log in to the Server's ILOM and execute the commands:
-> show /SP/logs/event/list
-> show -d properties -level all /SYS
-> show -o table -level all /SP/faultmgmt (Only available with latest ILOM versions)
CMM (Blade specific)
Log in to the Chassis CMM where is inserted the faulty Blade and execute the commands:
-> show /CMM/logs/event/list
-> show -d properties -level all /
6. Hardware best practices
Best practices scenario to isolate a hardware issue when facing an Oracle x86 Intel Processor-Based server down:
- Power off the platform and disconnect power cords a few minutes
- Update platform firmwares to the latest (ILOM/BIOS/HW RAID/PCI Cards)
- Review ILOM logs and sensors along with OS boot sequence to verify if any hardware or software issue is reported
-- Start the SP console to monitor the boot process
-- Start the Java Remote console to monitor OS errors
-- View component information to determine component status.
-- View the ILOM system event log.
- Run Oracle VTS to verify if any hardware error is reported
- Disconnect any external storage array
- If a component is reported faulty replace upon failure
- If unable to boot the OS then reduce to a minimum CPU/Memory configuration to isolate the faulty component.
- Remove any additional PCI card
- If no evidence of a hardware issue and the OS is booting then we should consider gathering Operating System information
- Update platform related OS drivers
- Engage the OS/software support to assist with a possible software issue
BIOS POST
From the point that the host subsystem is powered on and begins executing code, BIOS code is executed. The sequence that BIOS goes through, from the first point where code is executed to the point that the operating system booting begins, is referred to as POST (power-on self-test).
In case a hardware issue is detected during the POSTS the boot process will stop and a 4 digits error code could be displayed at the console. Refer to your platform Service Manual or Diagnostic guide to translate the POST code.
Boot device
Verify the boot device is correct from the BIOS Boot tab:
Main Advanced PCIPnP Boot Security Chipset Exit
********************************************************************************
* Boot Settings * Configure Settings *
* *************************************************** * during System Boot. *
* * Boot Settings Configuration * *
* * *
* * Boot Device Priority * *
* * Hard Disk Drives * *
* * Removable Drives * *
* * CD/DVD Drives * *
* * *
* * *
* * *
* * *
* * ** Select Screen *
* * ** Select Item *
* * Enter Go to Sub Screen *
* * F1 General Help *
* * F10 Save and Exit *
* * ESC Exit *
* * *
* * *
********************************************************************************
Bios boot device output is also available as a text file attached to this document: BIOS.TXT
Disks
To troubleshoot a disk issue identify your HW RAID Controller and follow the instructions from the document below:
How to Identify BIOS and Solaris[TM] Hardware RAID Status (Doc ID 1013107.1)
Blades
When troubleshooting a Blade issue, swap the Blade module to another known working slot to isolate the root cause.
- If the problem follows the Blade then the failure is located on the Blade
- If the Blade works in another slot then the problem could be related to the slot
Fans
A faulty fan or fan board can avoid an x64 Server to boot because of potential for system over-temperature and component damage.
Verify the Fans and Fan Board status from the ILOM Monitoring tool.
Memory modules
When investigating a memory issue
- Verify that only supported memory modules are inserted
- Verify the population rules are respected
- Press the DIMM Fault Remind button if available for your platform to turn ON the slot LED for the faulty DIMM
Additionally when a Memory errors are logged in Windows or Linux logs fles, install HERD to translate the memory addresses error into CPU slot/Memory slot
How to analyze Memory Errors on x64 Servers running Linux using HERD (Doc ID 1019683.1)
CAUTION: After replacing an Oracle server motherboard it is necessary to update the platform serial number which is the reference used to log Service Requests
7. Run platform diagnostics
Oracle provides provides comprehensive diagnostic tools that tests and validates Oracle hardware by verifying the connectivity and functionality of most hardware controllers and devices on Oracle hardware platforms.
The diagnostic tools can usually be executed booting from:
- the Tools and Drivers CD/DVD
- the ILOM
- an external drive
- PXE (network)
- the running Operating System
We will prefer a standalone method and avoid executing diagnostics from a running operating system because it could generates false I/O access errors during the tests.
PcCheck
Since PcCheck is fairly easy to obtain (its on most of the Tools and Drivers DVDs), is easy to run, and performs a decent low level health check of the system, we recommend using PcCheck first (before VTS).
PcCheck is a diagnostic software that will check completely the hardware components including memory modules, floppy, hard disk drives, CD-ROM/DVD drives, I/O ports, graphic controller.
To run the PcCheck diagnostics follow the steps below:
- Boot the system with the Supplemental CD (or Tools and Drivers DVD)
- At the main menu select "Run Hardware Diagnostics"
- At the PcCheck main menu select "Advanced Diagnostic Tests"
- At the Advanced Diagnostic Tests menu select "Memory"
- Then select "Test System Memory"
Oracle VTS
SunVTS software has a sophisticated graphical user interface (GUI) that provides test configuration and status monitoring. The user interface can be run on one system to display the Sun VTS testing of another system on the network. SunVTS software also provides a TTY-mode interface for situations in which running a GUI is not possible.
The following tests are available in SunVTS: Processor/Memory/Disk/Graphics/Media/Ioports/Interconnects/Network/Environment/HBA
For more information refer to Oracle VTS 7.0 Software User's guide: http://docs.oracle.com/cd/E19719-01/E21664/index.html
8. Check platform health at Operating System level
When able to boot and log into the Operating System it is important to also verify if any hardware or software issue has been reported at OS level and if platform related patches are up-to-date.
The following OS specific commands are tools that collects information about each operating system, such as what kernel is running, currently loaded drivers, configuration files, log files, etc. Each of these tools must be collected as root.
Generate an Oracle Explorer Data Collector utility output with the command:
# /opt/SUNWexplo/bin/explorer
Oracle Explorer software is a support tool used to collect pertinent data from a system running the Solaris(TM) Operating System. Oracle engineers use Explorer to describe a system's configuration or to troubleshoot a problem.
Oracle Explorer is part of the STB (Services Tools Bundle) that can be downloaded from My Oracle Support (MOS):
http://support.oracle.com -> Patches & Updates -> Advanced Search and Select "Services Tools Bundle" as the Product
More details about Oracle Explorer and available options:
Oracle Explorer Data Collector - Product Information Center (Doc ID 1312847.1)
Execute Suse Linux Enterprise Server supportconfig utility
# supportconfig
# man supportconfig
Execute Red Hat Enterprise Linux sosreport utility
# sosreport
# man sosreport
For Red Hat Enterprise Linux 4.5 and earlier use sysreport instead.
Execute VMware vm-support utility
# /usr/bin/vm-support
For more information refer to VMware knowledge document:
Collecting diagnostic information for VMware Server
kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1967
Windows MPS Report utility
For instructions to install and run Microsoft MPS Report refer to Microsoft Knowledge document:
Microsoft Product Support Reports:
http://www.microsoft.com/download/en/details.aspx?id=24745
Core dump generation
Some Oracle platforms have an NMI switch at the back of the system that generates an interruption to stop the Operating System and force a core dump. The NMI interruption can be generated as follow:
- Pressing the NMI physical button at the back of the server
- Using the ILOM web GUI "generate NMI" button located under the diagnostic tab
- Executing the command below at ILOM prompt:
-> /HOST generate_host_nmi=true
Note that NMI data collection must be configured to collect a core dump in case of interruption. Refer to your Operating System documentation if required.
To assist with possible system hang refer to the following document:
How to check if your x64 platform "system hang" actually is a system hang (DocID 1012991.1)
Caution: Before reviewing OS related data it is also recommended to verify if the Operating System installed is supported or certified for this particular server to avoid any driver or compatibility issue.
9. Collect diagnostic information for Oracle support
Collect ILOM Service Snapshot utility from ILOM Web GUI
The purpose of the ILOM Service Snapshot utility is to collect data for use by Oracle Services personnel to diagnose system problems.
An ILOM snapshot output can be generated from the ILOM GUI -> Maintenance tab -> Snapshot tab.
Select the desired Data Set:
- Normal: Specifies that ILOM, operating system, and hardware information is collected.
- FRUID: Available as of ILOM 3.0.3, specifies that information about FRUs currently configured on your server in addition to the data collected by the Normal set option is collected.
- Full: Specifies that all data is to be collected. Selecting Full might reset the system on an AMD Processor-based platform if an Hypertransport bus failure is detected when running HDT low level diagnostics
- Custom: Allows you to choose one or more data sets
Caution: Customers should not run this utility unless requested to do so by Oracle Services.
For more information about the ILOM Service Snapshot utility please refer to the Oracle Integrated Lights Out Manager (ILOM) 3.0 Web Interface Procedures Guide:
http://docs.oracle.com/cd/E19860-01/
If an ILOM Snapshot cannot be collected it is recommended to collect the one of the following outputs :
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> chassis status
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sel elist
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> -v sel elist
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> fru
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sensor
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sunoem sbled get all
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sdr list all info
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> -v sdr
# ssh -l <USERNAME> <ILOM IP Address>
-> show / -l all -o table
Appendix
Links of interest:
Oracle x86 Servers Documentation
http://www.oracle.com/technetwork/documentation/oracle-x86-servers-190077.html#hic
Firmware Downloads and Release History for Sun Systems
http://www.oracle.com/technetwork/systems/patches/firmware/release-history-jsp-138416.html
Sun x86 and x64 Platforms: Matrix of expansion cards (Doc ID 1374659.1)
Sun System Handbook
https://support.oracle.com/handbook_private/
Oracle VTS 7.0
http://docs.oracle.com/cd/E19719-01/
Systems Management and Diagnostics
http://www.oracle.com/us/products/applications/crmondemand/login/sys-mgmt-networking-190072.html
Oracle Integrated Lights Out Manager (ILOM) 3.0 Documentation
http://docs.oracle.com/cd/E19860-01/index.html
Sun Integrated Lights Out Manager (ILOM) 2.0 Documentation
http://docs.oracle.com/cd/E19720-01/index.html
Sun Installation Assistant for x64 Servers Documentation
http://docs.oracle.com/cd/E19593-01/index.html
How to update the Serial Number on Oracle x64 platforms (Doc ID 1364359.1)
RAID Management Software Documentation
http://docs.oracle.com/cd/E23383_01/index.html
If unsure how to proceed, or unable to perform the above process, collect as much information pertaining to the boot failure as possible (console logs, error messages, etc), call back in and request next available engineer.
References
<NOTE:1019683.1> - How to analyze Memory Errors on x86_64 Servers Using HERD
<NOTE:1312847.1> - Oracle Explorer Data Collector Resource Center
<NOTE:1330254.1> - X86 Product Home
<NOTE:1364359.1> - How to update the Serial Number on Oracle x64 platforms
<NOTE:1374659.1> - Sun x86 Platforms: Matrix of Expansion Cards
<NOTE:1002926.1> - How to check if a Sun X86 server is powered on
<NOTE:1009698.1> - How to perform platform configuration, management, and data collection tasks with ipmitool on Sun X64 servers. [Video]
<NOTE:1012991.1> - How to check if your x86 platform "system hang" actually is a system hang
<NOTE:1002941.1> - How to check why the system powered off, on Sun X64 servers.
<NOTE:1013107.1> - How to Identify BIOS and Solaris[TM] Hardware RAID Status
Attachments
This solution has no attachment