![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Solution Type Technical Instruction Sure Solution 1383349.1 : How to Perform Onsite Diagnosis for an Oracle AMD-based x86 System that is Down
In this Document
Oracle Confidential PARTNER - Available to partners (SUN). Applies to:Sun Fire V20z Server - Version Not Applicable and laterSun Fire X4140 Server - Version Not Applicable and later Sun Fire X4500 Server - Version Not Applicable and later Sun Fire X4200 Server - Version Not Applicable and later Sun Blade X6440 Server Module - Version Not Applicable to Not Applicable [Release N/A] Information in this document applies to any platform. GoalHow to perform onsite diagnosis for a down x64 AMD system. It applies to AMD Processor-based Servers and Blade servers. SolutionHow to perform On Site Diagnosis for a Down x64 AMD system 1. Investigate system power source# Are LEDs lit? 2. Validate the customer can log into the Service Processor (SP/ELOM/ILOM)Depending on the system the monitoring interface can vary between:
Refer to Oracle x86 Servers documentation to identify which is the monitoring interface for your system: Verifying system power status via ipmitoolRun ipmitool from a remote system to the Service Processor with the command shown in the examples below. The resulting output will indicate whether power is on or off. # ipmitool -I lanplus -U root -H <ILOM IP Address> chassis status
System Power : on Power Overload : false Power Interlock : inactive Main Power Fault : false Power Control Fault : false Power Restore Policy : unknown Last Power Event : Chassis Intrusion : inactive Front-Panel Lockout : inactive Drive Fault : false Cooling/Fan Fault : false
# ipmitool -I lanplus -U root -H <ILOM IP Address> chassis power status Chassis Power is on
Verifying system power status via the Service Processor CLILog in to the Service Processor via SSH: # ssh -l <USERNAME> <ILOM IP Address>
$ platform get power state
On
-> show /SP/SystemInfo/CtrlInfo
/SP/SystemInfo/CtrlInfo Targets: Properties: PowerStatus = on
-> show /SYS
... /SYS Properties: type = Host System ipmi_name = /SYS product_name = SUN FIRE X4440 product_part_number = 602-4057-01 product_serial_number = 0812ZYX001 product_manufacturer = SUN MICROSYSTEMS power_state = On
Verifying system power status via the Service Processor Web GUIIntegrated Lights Out Manager (ILOM) and Embedded Lights Out Manager (ELOM) based Service Processors provide an easy-to-use web interface for managing the platform. Point your web browser to the Service Processor IP address or resolving DNS hostname, and enter your login credentials when prompted. Host is currently on
3. Troubleshoot power issuesVerify the state of the Power OK LED from the front or rear of the server. LED states may vary slightly between platforms, but generally:
4. Perform internal and external visual inspection- Confirm if the General Service Fault LED is lit or if any Component Fault LEDs is ON and would indicate a hardware failure.
5. Collect basic server information regarding the outage using the Service Processor
# ssh -l <USERNAME> <ILOM IP Address>
IPMITOOL:# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sel elist
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> -v sel elist # ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sensor # ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sunoem sbled get all # ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sdr list all info # ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> -v sdr
ILOMLog in to the Server's ILOM and execute the commands: -> show /SP/logs/event/list
-> show -d properties -level all /SYS -> show -o table -level all /SP/faultmgmt (Only available with latest ILOM versions)
ELOM:Log in to the Server's ELOM and execute the commands: -> show /SP/AgentInfo/SEL
-> show -d properties -level all /SP/SystemInfo
CMM (Blade specific)Log in to the Chassis CMM where is inserted the faulty Blade and execute the commands:
-> show /CMM/logs/event/list -> show -d properties -level all /
V20z & V40z specific Service Processor commandsLog in to the Server's SP IP Address and execute the following commands: # sp get events -v
# sensor get --verbose # inventory get all -v # sp get tdulog -f stdout
6. Hardware best practicesBest practices scenario to isolate a hardware issue when facing an Oracle x64 AMD Processor-Based server down: - Power off the platform and disconnect power cords a few minutes
- Update platform firmwares to the latest (ILOM/BIOS/HW RAID/PCI Cards) - Review ILOM logs and sensors along with OS boot sequence to verify if any hardware or software issue is reported -- Start the SP console to monitor the boot process -- Start the Java Remote console to monitor OS errors -- View component information to determine component status. -- View the ILOM system event log. - Run Oracle VTS to verify if any hardware error is reported - Disconnect any external storage array - If a component is reported faulty replace upon failure - If unable to boot the OS then reduce to a minimum CPU/Memory configuration to isolate the faulty component. - Remove any additional PCI card - If no evidence of a hardware issue and the OS is booting then we should consider gathering Operating System information - Update platform related OS drivers - Engage the OS/software support to assist with a possible software issue
BIOS POSTFrom the point that the host subsystem is powered on and begins executing code, BIOS code is executed. The sequence that BIOS goes through, from the first point where code is executed to the point that the operating system booting begins, is referred to as POST (power-on self-test). Boot deviceVerify the boot device is correct from the BIOS Boot tab: Main Advanced PCIPnP Boot Security Chipset Exit
******************************************************************************** * Boot Settings * Configure Settings * * *************************************************** * during System Boot. * * * Boot Settings Configuration * * * * * * * Boot Device Priority * * * * Hard Disk Drives * * * * Removable Drives * * * * CD/DVD Drives * * * * * * * * * * * * * * * * ** Select Screen * * * ** Select Item * * * Enter Go to Sub Screen * * * F1 General Help * * * F10 Save and Exit * * * ESC Exit * * * * * * * ********************************************************************************
DisksTo troubleshoot a disk issue identify your HW RAID Controller and follow the instructions from the document below: BladesWhen troubleshooting a Blade issue, swap the Blade module to another known working slot to isolate the root cause.
FansA faulty fan or fan board can avoid an x64 Server to boot because of potential for system over-temperature and component damage. Memory modulesWhen investigating a memory issue
Additionally when a Memory errors are logged in Windows or Linux logs fles, install HERD to translate the memory addresses error into CPU slot/Memory slot 7. Run platform diagnosticsOracle provides provides comprehensive diagnostic tools that tests and validates Oracle hardware by verifying the connectivity and functionality of most hardware controllers and devices on Oracle hardware platforms.
Oracle VTSSunVTS software has a sophisticated graphical user interface (GUI) that provides test configuration and status monitoring. The user interface can be run on one system to display the Sun VTS testing of another system on the network. SunVTS software also provides a TTY-mode interface for situations in which running a GUI is not possible. PcCheckPcCheck is a diagnostic software that will check completely the hardware components including memory modules, floppy, hard disk drives, CD-ROM/DVD drives, I/O ports, graphic controller.
HDT Tool (AMD Processor-Based specific)The Hardware Debug Tool (HDT) is a low level diagnostic tool that tests access to the system bus, memory spaces and CPU registers of the AMD Processor-Based platform.
# hdtl -q
8. Collect Post Codes during the bootIn case there is no video display when attempting to power on a platform, run the hdt command below from the ILOM to catch the last Post Code: # hdtl -bp8
0156 port80: 08c6 waiting for next POST code ............ waiting for next POST code ............ Capture the last Post Codes from a Sunfire v20z/v40z Service Processor with the command: $ sp get port80 -m
0x97 Refer to the AMD Platform Service Manual to translate the Post Code that could help diagnosing at which point the boot is failing during the initialization. To translate Sunfire v20z/v40z Post Codes refer to Cheatsheet for V20z and V40z Post Codes (Doc ID 1006320.1) 9. Collect diagnostic information for Oracle supportCollect ILOM Service Snapshot utility from ILOM Web GUIThe purpose of the ILOM Service Snapshot utility is to collect data for use by Oracle Services personnel to diagnose system problems.
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> chassis status
# ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sel elist # ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> -v sel elist # ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> fru # ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sensor # ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sunoem sbled get all # ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sdr list all info # ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> -v sdr
# ssh -l <USERNAME> <ILOM IP Address>
-> show / -l all -o table
For v20z or v40z platforms it is required to collect tdu logs# sp get tdulog -f stdout
ANNEX: Links of interestOracle x86 Servers Documentation References<NOTE:1002926.1> - How To Check If an Oracle x86 Server Is Powered On<NOTE:1002941.1> - How To Check Why the System Powered Off, on Oracle x86 Servers <NOTE:1006320.1> - Cheatsheet for V20z and V40z Post Codes <NOTE:1009698.1> - How to Perform Platform Configuration, Management, and Data Collection Tasks with ipmitool on Oracle X86 Systems [Video] <NOTE:1013107.1> - How to Identify BIOS and Solaris[TM] Hardware RAID Status <NOTE:1018266.1> - How to Collect Data from the TDULOGs on Sun Fire V20z/V40z <NOTE:1019683.1> - How to analyze Memory Errors on x86_64 Servers Using HERD <NOTE:1330254.1> - X86 Product Home <NOTE:1364359.1> - How to update the Serial Number on Oracle x64 platforms <NOTE:1374659.1> - Sun x86 Platforms: Matrix of Expansion Cards <NOTE:1418253.1> - How to Perform Onsite Diagnosis for a Down x86 Intel System <NOTE:1431330.1> - How to Collect Operating System Data to Troubleshoot Oracle X86 Platforms Attachments This solution has no attachment |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|