Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1008409.1
Update Date:2018-03-26
Keywords:

Solution Type  Technical Instruction Sure

Solution  1008409.1 :   How To Verify Platform Health on an Oracle x86 System  


Related Items
  • Sun Fire X2200 M2 Server
  •  
  • Sun Fire X4150 Server
  •  
  • Sun Blade 6000 System
  •  
  • Sun Fire X4440 Server
  •  
  • Sun Fire V20z Server
  •  
  • Sun Fire X4540 Server
  •  
  • Sun Netra X4200 M2 Server
  •  
  • Sun Blade 8000 System
  •  
  • Sun Netra X4270 Server
  •  
  • Sun Fire X4275 Server
  •  
  • Sun Fire X4250 Server
  •  
  • Sun Fire X4200 Server
  •  
  • Sun Netra X4450 Server
  •  
  • Sun Fire X4240 Server
  •  
  • Sun Fire X4600 M2 Server
  •  
  • Sun Fire X4200 M2 Server
  •  
  • Sun Fire X2270 Server
  •  
  • Sun Fire X4470 Server
  •  
  • Sun Fire X4140 Server
  •  
  • Sun Fire X2250 Server
  •  
  • Sun Fire X4100 M2 Server
  •  
  • Sun Fire X2100 M2 Server
  •  
  • Sun Blade 6048 System
  •  
  • Sun Fire X4170 Server
  •  
  • Sun Fire V40z Server
  •  
  • Sun Fire X4270 M2 Server
  •  
  • Sun Fire X4600 Server
  •  
  • Sun Fire X4640 Server
  •  
  • Sun Netra X4250 Server
  •  
  • Sun Fire X4100 Server
  •  
  • Sun Fire X4270 Server
  •  
  • Sun Fire X2270 M2 Server
  •  
  • Sun Fire X4800 Server
  •  
  • Sun Fire X4500 Server
  •  
  • Sun Fire X4450 Server
  •  
  • Sun Fire X4170 M2 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>x86>Server>SN-x64: MISC-SERVER
  •  
  • _Old GCS Categories>Sun Microsystems>Servers>x64 Servers
  •  

PreviouslyPublishedAs
211493


Applies to:

Sun Fire X4450 Server - Version Not Applicable and later
Sun Fire X4200 Server - Version Not Applicable and later
Sun Fire V40z Server - Version Not Applicable and later
Sun Fire X2200 M2 Server - Version Not Applicable and later
Sun Fire X4140 Server - Version Not Applicable and later
All Platforms

Goal

Description

This purpose of this document is to outline the various ways in which you can check a Sun X64 server for error conditions.

Symptoms

  • Data gathering for troubleshooting

 

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Sun x86 Systems

Solution

Steps to Follow

This document explains how to examine the system LEDs, status indicators, and event logs via ipmitool, Service Processor Web GUI, and Service Processor CLI, as well as what to check if you are local to the server.

If any potential problems are identified, further troubleshooting will be required.

Checking LEDS and indicators with ipmitool

# ipmitool -I lan -H <SP IP Address> -U <SP username> sunoem led get


Look for ON Values for the LEDS that could indicate a problem.

Example: Processor 1 DIMM 0 LED is ON (Processor socket, not CPU core number):

p0.d2.led | OFF
p0.d3.led | OFF
p1.led    | OFF
p1.d0.led | ON <--
p1.d1.led | OFF
p1.d2.led | OFF
p1.d3.led | OFF


Example: A fan fault on module FB1/FM0 causes other related LEDs to turn on and off:

OK              | ON
SERVICE         | ON <-- System service LED turns ON
LOCATE          | OFF
PS_FAULT        | OFF
FAN_FAULT       | ON <-- Fan fault LED turns ON
TEMP_FAULT      | OFF
...
FB1/FM0/SERVICE | ON <-- Faulted fan module service indicator requiring repair
FB1/FM1/SERVICE | OFF
FB1/FM2/SERVICE | OFF
FB1/FM0/OK      | OFF <-- Fan module is not ok
FB1/FM1/OK      | ON
FB1/FM2/OK      | ON



Output similar to the following may indicate a newer version of ipmitool is required:

Sun OEM Get LED command failed: Parameter out of range
Sun OEM Get LED command failed: Destination unavailable


Download the latest Sun-Oracle supplied version of ipmitool from the Sun Oracle product downloads web page, see:

How to perform platform configuration, management, and data collection tasks with ipmitool on Sun X64 servers <Document: 1009698.1>

Checking fma data in the ILOM snapshot

ILOM 3.x versions may have an fma directory located and populated with data in the snapshot file. The ILOM fma should be the same as solaris fma as far as faults are concerned.

Example of a failed fan:

------------------- ------------------------------------
Time UUID msgid Severity
------------------- ------------------------------------ 

2011-03-22/14:22:09 f8dde8af-d369-e31f-c9a6-b159683a286f SPX86-8000-33 Major

Fault class : fault.chassis.device.fan.fail

FRU : /SYS/FB/FAN1 (Part Number: unknown) (Serial Number: unknown)

Response : The service-required LED may be illuminated on the affected
FRU and chassis. System will be powered down when the High Temperature
threshold is reached.

Action : The administrator should review the ILOM event log for
additional information pertaining to this diagnosis. Please refer to the
Details section of the Knowledge Article for additional information.


In addition, the certain ILOM 3.x may have a fault management shell included. Refer to:

 How to use the Oracle ILOM 3.x Fault Management Shell <Document: 1309092.1>

Checking LEDS and indicators in the Service Processor Web GUI

Integrated Lights Out Manager (ILOM) and Embedded Lights Out Manager (ELOM) based Service Processors provide an easy-to-use web interface for managing the platform.
Point your web browser to the Service Processor IP address or resolving DNS hostname, and enter your login credentials when prompted.

Then, when logged in (exact display will differ between ILOM and ELOM):

  • Click "System Monitoring" tab then click "Sensor readings" tab. Newer ILOM versions have an "Indicators" tab which also needs checking.
  • Using the drop down menu select "All Sensors" (or "All Indicators" when in the Indicators tab).
  • Browse the resulting output for fault LEDs and indicators:
  • Check the 'Name' column for names ending in 'fail', 'FAULT', and 'SERVICE
  • Then look along to the the 'Status' or 'Reading' column for its status:
  • "Predictive Failure Asserted" means the fault LED is ON
  • "Predictive Failure Deasserted" means OFF.


Example: CPU1 DIMM0 fault (as displayed by an older version of ILOM)

Status                                         Name        Reading

Predictive Failure Asserted    p1.d0.fail      2         - Processor One DIMM 0 Fault LED ON
Predictive Failure Deasserted  p1.d1.fail      1         - Processor One DIMM 1 Fault LED OFF
Predictive Failure Deasserted  p1.d2.fail      1         - Processor One DIMM 2 Fault LED OFF


Example: Fan fault on module FB1/FM0 shown in the Indicators tab on a newer ILOM version.

Name                       Status
FB1/FM0/SERVICE            On      <-- Faulty fan module
FB1/FM1/SERVICE            Off
FB1/FM2/SERVICE            Off
FB1/FM0/OK                 Off     <-- Fan module no longer OK
FB1/FM1/OK                 On
FB1/FM2/OK                 On
...
/SYS/FAN_FAULT             On      <-- Fan fault indicator is On
/SYS/LOCATE                Off
/SYS/OK                    On
/SYS/PS_FAULT              Off
/SYS/SERVICE               On      <-- System service LED is On
/SYS/TEMP_FAULT            Off
...


For more information, refer to the ELOM or ILOM Administration Guide for your platform:

http://www.oracle.com/technetwork/documentation/oracle-x86-servers-190077.html

Checking LEDS and indicators using the Service Processor CLI

ILOM:

-> show -d properties -level all /SYS


Example: Chassis 'Service' LED ON

/SYS/SERVICE
Properties:
type = Indicator
value = On



Example: Processor Zero DIMM 2 LED is ON (Processor 0 socket, not CPU core)

/SYS/P0/D2/SERVICE
Properties:
type = Indicator
value = On



Example: Processor Zero DIMM 2 fault from the Fault Management Architecture logic (FMA/FDD)

-> show /SP/faultmgmt

/SP/faultmgmt
Targets:
0 (/SYS/MB/P0/D2)

Properties:

Commands:
cd
show



ELOM:

-> show -level all /SP
-> show -level all /SYS


Example: CPU1 disabled due to a fault

/SP/SystemInfo/CPU/CPU1
 Properties:
  Designation = CPU 1
  Manufacturer = AMD
  Name = Opetron
  Speed = 2800MHz
  Status = disabled



V20/40Z:

$ sensor get --type led



Example - CPU0 DIMM3 and System Fault LEDs are ON

Identifier        Value
cd.lp 0.00        On/Off
cpu0.lp 0.00      On/Off
cpu0.mem0.lp 0.00 On/Off
cpu0.mem1.lp 0.00 On/Off
cpu0.mem2.lp 0.00 On/Off
cpu0.mem3.lp 1.00 On/Off <-- 1 means ON
...
cpuplanar.lp 0.00 On/Off
faultswitch 1.00  On/Off <-- 1 means ON
floppy.lp 0.00    On/Off
...


Physically checking LEDS if you are local to the server

Physically examine both back and front of the server for illuminated LEDs. For further Information about LED states refer to the appropriate Server Service Manual, or Server Diagnostics Guide:

http://www.oracle.com/technetwork/documentation/oracle-x86-servers-190077.html

Checking platform events and sensors with ipmitool

Use the following common ipmitool commands to gather further data as to the possible reasons for the platform state. These would also be useful if you need to report a support call.

ipmitool -I lan -H <SP IP Address> -U <SP username> sel elist
ipmitool -I lan -H <SP IP Address> -U <SP username> sel info
ipmitool -I lan -H <SP IP Address> -U <SP username> sdr list all info
ipmitool -I lan -H <SP IP Address> -U <SP username> fru print
ipmitool -I lan -H <SP IP Address> -U <SP username> sensor
ipmitool -I lan -H <SP IP Address> -U <SP username> sunoem led get


See <Document: 1009698.1> for more information on using ipmitool to collect system event, state, and LED information.

Recent versions of ILOM include a Snapshot feature, which automates collection of ipmitool outputs and other relevant diagnostic information from the Service Processor needed for troubleshooting platform problems. A 'normal' level ILOM snapshot is appropriate in most cases.

For more information, see <Document: 1020204.1>

Gathering information on system issues using the Service Processor web GUI

Point your web browser to the Service Processor IP address or resolving DNS hostname, and enter your login credentials when prompted.

Checking the System Event Log (SEL)

After you have logged into the Service Processor, click "System Monitoring" tab then click the "Event Logs" tab. Select an event log category that you want to view from the drop-down list. You can select from the following types of events:

  • Sensor-specific events - Events generated by sensors.
  • BIOS-generated events - Error messages generated in the BIOS.
  • System management software events - Events that occur within the ILOM software.


After you have selected a category of event, the Event Log table displays the specified events. Or dependent on ILOM/ELOM version, choose Display drop-down to display All or a number of events.

Checking ILOM Fault Management

To display a list of active system faults, click "System Information" tab, then "Fault Management" tab (not available on all platforms).

If a fault is present, click on the fault in the "ID" column to display more details.

Refer to Integrated Lights Out Manager (ILOM) Administration Guide For ILOM for you platform and ILOM version. Also see ILOM Administration Guide Supplement for Sun Fire if available for your platform at http://www.oracle.com/technetwork/documentation/oracle-x86-servers-190077.html

Gathering information on system issues using Service Processor CLI

SSH into the Service processor, then use the following commands to view the system event and fault logs:
ILOM:

-> show /SP/logs/event/list
-> show -d properties -level all /SP/faultmgmt


NOTE: /SP/faultmgmt is not available on all platforms

ELOM:

-> show /SP/AgentInfo/SEL


V20/40Z:

$ sp get events -v

 


Using your service processors built-in diagnostic

You may be able to troubleshoot your platforms issue using the built-in hardware diagnostic of the service processor.

1. Please login to the ILOM CLI of the affected cell (e.g. ssh root@x5-2l-sp)
2. Once logged in, enter the restricted shell: -> set SESSION mode=restricted

-> set SESSION mode=restricted

WARNING: The "Restricted Shell" account is provided solely
to allow Services to perform diagnostic tasks.

[(restricted_shell) x5-2l-sp:~]#

3. Collect the outputs from following hwdiag commands (here you also see some shortened example outputs:):

[(restricted_shell) x5-2l-sp:~]# hwdiag i2c test all
HWdiag (Restricted Mode) - Build Number 104652 (Nov 05 2015, 08:00:50)
Current Date/Time: May 05 2016, 11:06:59
Note: Turn off host to access DIMMs over i2c.
I2C DEVICE CHIP BUS/MUX/CH/ADDR RESULT
----------------------------------------------------------------------
Power Control FPGA 1/FF/FF/4E OK
PCI Slot 0 Inlet Temp Sensor NCT214 1/FF/FF/30 OK
PCI Slot 0 Exit Temp Sensor NCT214 1/FF/FF/98 OK
PCI Slot 1 Exit Temp Sensor NCT214 1/FF/FF/9A OK
...

[(restricted_shell) x5-2l-sp:~]# hwdiag led info all
HWdiag (Restricted Mode) - Build Number 104652 (Nov 05 2015, 08:00:50)
Current Date/Time: May 05 2016, 11:06:35
Dumping Registers for REAR_CAT9552:
REGISTER ADDR VALUE
-------------------------------------------------------------------
Input0 (0x00) : 0x06
...

[(restricted_shell) x5-2l-sp:~]# hwdiag led get all
HWdiag (Restricted Mode) - Build Number 104652 (Nov 05 2015, 08:00:50)
Current Date/Time: May 05 2016, 11:06:29
LED VALUE
------------------------------------------
/SYS/DBP/HDD0/GREEN : Host-driving/On
/SYS/DBP/HDD0/OK2RM : Host-driving/Off
/SYS/DBP/HDD0/SERVICE : Host-driving/Off
...

[(restricted_shell) x5-2l-sp:~]# hwdiag system summary
HWdiag (Restricted Mode) - Build Number 104652 (Nov 05 2015, 08:00:50)
Current Date/Time: May 05 2016, 11:04:50

Platform ORACLE SERVER X5-2L

ILOM Firmware Date: Thu Nov 5 16:06:43 CST 2015, Version: 3.2.4.62(r104652)

...

4. Once completed, simple type "exit" and you will leave the restricted shell:

[(restricted_shell) x5-2l-sp:~]# exit

See resource https://docs.oracle.com/cd/E23161_01/html/E23099/gmcfi.html#scrolltoc for more information

 

Gathering information about possible issues if you are local to the server

To check for issues physically on the platform, the platform needs to be down as you need to enter BIOS. If the server is up, use one of other methods provided either via ipmitool, SP web GUI or SP CLI.

  • Power on the Platform by pressing the power on button.
  • Press F2 when prompted to enter BIOS. Note any events that might be reported.
  • Once in BIOS navigate using the cursor keys to the tab labeled Advanced.
  • Navigate down to Event Log Configuration, press enter.
  • Select View Event Log, examine for possible reasons of the outage, use Esc to exit.
  • Once back at 'Advanced' tab navigate to 'IPMI 2.0 Configuration', Select and press enter to view 'View BMC System Event Log'


NOTE: Unless you are familiar with these events as they are in raw format, I would suggest you use ipmitool commands above as this decodes these events automatically. As there will be events that are part of the normal process of the system powering on, decoding of these events would be required to look for issues.

The messages can also be decoded manually by accessing the following document:

http://download.intel.com/design/servers/ipmi/IPMI2_0E4_Markup_061209.pdf

NOTE: It is beyond the scope of this document to cover this manual process of decoding.

Previously Published As 91593


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback