Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1309092.1
Update Date:2017-12-05
Keywords:

Solution Type  Technical Instruction Sure

Solution  1309092.1 :   How to use the Oracle ILOM 3.x Fault Management Shell  


Related Items
  • SPARC T4-2
  •  
  • Oracle Virtual Compute Appliance X4-2 Hardware
  •  
  • Sun Blade 6000 System
  •  
  • Sun Fire X2200 M2 Server
  •  
  • SPARC M5-32
  •  
  • SPARC T3-2
  •  
  • SPARC T3-4
  •  
  • Sun Fire X4140 Server
  •  
  • SPARC M8-8
  •  
  • Sun SPARC Enterprise T5240 Server
  •  
  • SPARC T5-2
  •  
  • SPARC T3-1
  •  
  • SPARC M7-8
  •  
  • Exadata X3-2 Hardware
  •  
  • Sun Fire X4170 M2 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: M7
  •  
  • _Old GCS Categories>Sun Microsystems>Servers>CMT Servers
  •  
  • _Old GCS Categories>Sun Microsystems>Servers>x64 Servers
  •  
  • _Old GCS Categories>Sun Microsystems>Servers>Blade Servers
  •  




In this Document
Goal
Solution


Applies to:

Sun SPARC Enterprise T5240 Server - Version Not Applicable and later
SPARC T4-2 - Version Not Applicable and later
SPARC T5-2 - Version All Versions and later
SPARC M5-32 - Version All Versions and later
SPARC T3-2 - Version Not Applicable and later
Information in this document applies to any platform.

Goal

This document describes how to use the Fault Management Shell on ILOM 3.x and later platforms.

Caution - The purpose of the Oracle ILOM Fault Management Shell is to help Oracle Services personnel diagnose system problems. Customers should not launch this shell or run fault management commands in the shell unless requested to do so by Oracle Services.


The following fault management terms are defined :

TermDescription
Fault A detected error condition in the hardware or software. A fault can be logged to the ILOM system event log.
FMRI Fault Management Resource Identifier. This could be either the FRU name or UUID.
FRU Field Replaceable Unit (such as a drive, memory DIMM, or printed circuit board).
Proactive Self-Healing Proactive Self-Healing is an architecture and methodology for automatically diagnosing, reporting, and handling software and hardware fault conditions. This reduces the time required to debug a hardware or software problem and provides the administrator or Oracle support with detailed data about each fault. The architecture consists of an event management protocol, the fault manager, and the fault-handling software.
Universal Unique Identifier (UUID) Used to uniquely identify a problem across any set of systems.


Note : by using /X to navigate through the ILOM targets, the system will automatically drive you to /SP or to /CMM accordingly.  Example from a T3-4 system :

-> help /X/diag/snapshot

 /SP/diag/snapshot : Take snapshot of system for diagnostic purposes

Solution

How to start/stop the Fault Management Shell.

The Fault Management shell is launched as a separate shell through the Oracle ILOM CLI. Only fault management commands can be run from this shell. To run standard Oracle ILOM commands, you must first exit the Fault Management shell.

To launch the shell, enter the following command when logged into the command line interface of the system's Oracle ILOM service processor:

-> start /X/faultmgmt/shell
Are you sure you want to start /CMM/faultmgmt/shell (y/n)? y

faultmgmtsp> help

Built-in commands:
  echo   - Display information to user.
           Typical use: echo $?
  help   - Produces this help.
           Use 'help <command>' for more information about an external command.
  exit   - Exit this shell.

External commands:
  fmadm  - Administers the fault management service
  fmdump - Displays contents of the fault and ereport/error logs
  fmstat - Displays statistics on fault management operations


The Fault Management shell includes the following commands.

CommandDescription
fmadm Administers the fault management service.
fmdump Displays contents of the fault and ereport/error logs.
fmstat Displays statistics on fault management operations.
echo Displays the exit code of the last command executed.
help Displays a list of the fault management commands that can run after entering the shell.
exit Exits the Fault Management shell.
CommandDescription
etcd Injects specified events in order to test the Fault Manager.

Note : Starting from ILOM 3.2.1, etcd is only available from the Service mode. No longer from the Fault Management shell.

 


   
- To exit the shell, enter the following command from the prompt:

faultmgmtsp> exit
->




How to track usage of the Fault Management Shell.

An audit log will is saved to the SP event log at: /X/logs/event/

Example :

-> show /X/logs/event/list

  /SP/logs/event/list
    Targets:

    Properties:

    Commands:
        cd
        show

ID     Date/Time                 Class     Type      Severity
-----  ------------------------  --------  --------  --------
668    Wed Mar 30 13:25:57 2011  Captive Shell  Command Entered  minor  
       Fault Management Shell Command Executed: exit
667    Wed Mar 30 13:25:54 2011  Captive Shell  Command Entered  minor  
       Fault Management Shell Command Executed: fmstat
666    Wed Mar 30 13:25:50 2011  Captive Shell  Command Entered  minor  
       Fault Management Shell Command Executed: fmdump
665    Wed Mar 30 13:25:41 2011  Captive Shell  Command Entered  minor  
       Fault Management Shell Command Executed: fmadm




Fault Management Shell Command Reference

fmadm - Fault Management Administration Tool

The fmadm utility can be used by administrators and service personnel to view and modify system fault management configuration parameters maintained by ILOM.
Use fmadm to:

  • View the list of system components that have been diagnosed as faulty.
  • Perform administrative tasks related to these entities.
faultmgmtsp> fmadm
Usage: fmadm <action>
  where <action> is one of the following:
    faulty           : list all faults
    faulty -s        : list all faults (summary)
    faulty -u <UUID> : list faults for <UUID>
    faulty -f        : list all faulty FRUs
    faulty -r        : list all faulty FRUs (summary)
    acquit <FMRI>    : acquit/clear faults for a FRU or UUID
    repaired <FMRI>  : repair/clear faults for a FRU or UUID
    replaced <FMRI>  : clear faults for a FRU or UUID
    repair <FMRI>    : equivalent to "repaired"
    rotate errlog    : rotate error log
    rotate fltlog    : rotate fault log


Syntax
fmadm [subcommand [arguments]]

Subcommands
The fmadm utility accepts the following subcommands. Some of the subcommands accept or require additional options and operands

SubcommandDescription
acquit fru Notify the Fault Manager that the specified fru is not to be considered to be a suspect in the fault event identified by uuid, or if no UUID is specified, then in any fault or faults that have been detected. The fmadm acquit subcommand should be used only at the direction of a documented Oracle repair procedure. Administrators might need to apply additional commands to re-enable a previously faulted resource.

faultmgmtsp>fmadm acquit /SYS/hdd1
acquit uuid Notify Oracle ILOM that the fault event identified by uuid can be safely ignored. The fmadm acquit subcommand should be used only at the direction of a documented Oracle repair procedure. Administrators might need to apply additional commands to re-enable any previously faulted resources.

faultmgmtsp>fmadm acquit 6d76a0f4-b5f5-623c-af8b-9d7b53812ea1
faulty [-afrs] [-u uuid]    Display status information for resources that Oracle ILOM has detected as faulty.
The following arguments are supported:
  • -a Display all faults. (Default.)
  • -f Display faulty FRUs (Field Replaceable Units).
  • -r Display faulty FRUs and their fault management state (states are described below).
  • -s Display one line fault summary for each fault event.
  • -u uuid Only display faults for a given uuid.
Oracle ILOM associates the following management states with every resource for which telemetry information has been received:
  • ok : The resource is present and in use and has no known problems detected by Oracle ILOM.
  • unknown : The resource is not present or not usable but has no known problems. This might indicate the resource has been disabled or deconfigured by an administrator. Consult the appropriate management tools for more information.
  • faulted : The resource is present but is not usable because one or more problems have been diagnosed by Oracle ILOM. The resource has been disabled to prevent further damage to the system.
  • degraded : The resource is present and usable, but one or more problems have been diagnosed in the resource by Oracle ILOM. If all affected resources are in the same state, this is reflected in the message at the end of the list. Otherwise the state is given after each affected resource.
faultmgmtsp> fmadm faulty -a
------------------- ------------------------------------ -------------- --------
Time                UUID                                 msgid          Severity
------------------- ------------------------------------ -------------- --------
2011-03-25/13:01:56 f46c20c7-5552-e50f-9c3d-ea65bdaadffc SPT-8000-5X    Major

Fault class : fault.chassis.env.power.loss

FRU         : /SYS/PS0
              (Part Number: 300-2344-01)
              (Serial Number: 19080GM-1041B101EW)

Description : A power supply AC input voltage failure has occurred.

Response    : The service-required LED on the affected power supply and
              chassis will be illuminated.

Impact      : Server will be powered down when there are insufficient
              operational power supplies.

Action      : The administrator should review the ILOM event log for
              additional information pertaining to this diagnosis.  Please
              refer to the Details section of the Knowledge Article for
              additional information.
repaired fru | uuid

Notify Oracle ILOM that a repair procedure has been carried out on the specified fru or uuid. The fmadm repaired subcommand should be used only at the direction of a documented Oracle repair procedure. Administrators might need to apply additional commands to re-enable a previously faulted resource.
An equivalent to this command is fmadm repair fru.

faultmgmtsp> fmadm repaired /SYS/PS0                          

replaced fru | uuid Notify Oracle ILOM that the specified fru or uuid resource has been replaced. This command should be used in those cases where Oracle ILOM is unable to automatically detect the replacement. The fmadm replaced subcommand should be used only at the direction of a documented Oracle repair procedure. Administrators might need to apply additional commands to re-enable a previously faulted resource.

faultmgmtsp> fmadm replaced /SYS/PS0                
rotate errlog | fltlog The rotate subcommand causes the specified log file (the error log or fault log file) to be rotated. Up to ten files are maintained in the rotation with the most recent version ending with a .0.
The archived files are collected by snapshot and available in its fma directory.
faultmgmtsp> fmdump            
TIMESTAMP            UUID                                   MSGID         
2011-03-25/13:01:56  f46c20c7-5552-e50f-9c3d-ea65bdaadffc   SPT-8000-5X   
faultmgmtsp> fmdump -e
TIMESTAMP            EREPORT
2011-03-25/13:01:46  ereport.psu.input.ac-asserted@/sys/ps0
2011-03-30/10:06:36  ereport.fault.chassis.device.fan.fail@/sys/fanbd/f0

faultmgmtsp> fmadm rotate errlog              
faultmgmtsp> fmdump -e
no ereports found
faultmgmtsp> fmadm rotate fltlog
faultmgmtsp> fmdump            
no faults found

In the snapshot :
./fma/@persist@faultdiags@ereports.log
./fma/@persist@faultdiags@faults.log
./fma/@persist@faultdiags@ereports.log.0
./fma/@persist@faultdiags@faults.log.0
./fma/@persist@faultdiags@ereports.log.1
./fma/@persist@faultdiags@faults.log.1

 

Note : when fault exists, 'fmadm faulty' equivalent information is available from spsh.

Ex :

 

-> start /SP/faultmgmt/shell/
Are you sure you want to start /SP/faultmgmt/shell (y/n)? y

faultmgmtsp> fmadm faulty
------------------- ------------------------------------ -------------- --------
Time                UUID                                 msgid          Severity
------------------- ------------------------------------ -------------- --------
2012-05-24/01:26:27 5fb35dc6-b0f9-e1e9-fff1-9d39a5e44de0 SPX86-8001-NJ  Critical

Fault class : fault.memory.intel.dimm.population-invalid

FRU         : /SYS/MB/P1/D1

Description : One or more memory DIMM's have been improperly populated or
              have mixed DIMM types present.

Response    : BIOS forwards error telemetry to the SP for diagnosis and
              logging in ILOM event log.

Impact      : Entire memory subsystem may become unusable and there may be
              no video output.

Action      : The administrator should review the ILOM event log for
              additional information pertaining to this diagnosis.  Please
              refer to the Details section of the Knowledge Article for
              additional information.

-> ls /SP/faultmgmt/

 /SP/faultmgmt
    Targets:
    shell
    0 (/SYS/MB/P1/D1)

    Properties:

    Commands:
    cd
    show

-> show -l all -o table /SP/faultmgmt/     
Target              | Property               | Value                           
--------------------+------------------------+---------------------------------
/SP/faultmgmt/0     | fru                    | /SYS/MB/P1/D1                   
/SP/faultmgmt/0/    | class                  | fault.memory.intel.dimm.populati
 faults/0           |                        | on-invalid                      
/SP/faultmgmt/0/    | sunw-msg-id            | SPX86-8001-NJ                   
 faults/0           |                        |                                 
/SP/faultmgmt/0/    | uuid                   | 5fb35dc6-b0f9-e1e9-fff1-9d39a5e4
 faults/0           |                        | 4de0                            
/SP/faultmgmt/0/    | timestamp              | 2012-05-24/01:26:27             
 faults/0           |                        |                                 
/SP/faultmgmt/0/    | fru_part_number        | M393B5170EH1-CH9                
 faults/0           |                        |                                 
/SP/faultmgmt/0/    | fru_serial_number      | 851D2AD0                        
 faults/0           |                        |                                 
/SP/faultmgmt/0/    | product_serial_number  | 1019FMN003                      
 faults/0           |                        |                                 
/SP/faultmgmt/0/    | chassis_serial_number  | 0000000-0000000000              
 faults/0           |                        |                     



fmdump - Fault Management Log Viewer

The fmdump utility can be used to display the contents of any of the log files associated with Oracle ILOM. Oracle ILOM receives telemetry information relating to problems detected by the system software, diagnoses these problems, and initiates proactive self-healing activities such as disabling faulty components. Oracle ILOM maintains two sets of log files for use by administrators and service personnel:

  • error log  :   A log which records error telemetry; the symptoms of problems detected by the system.
  • fault log   :  A log which records fault diagnosis information; the problems possibly related to the symptoms.

By default, fmdump displays the contents of the fault log, which records the result of each diagnosis made by the fault manager or one of its component modules.
Here is an example of a default fmdump display:

faultmgmtsp> fmdump
TIMESTAMP           UUID                                 MSGID
2010-02-25/06:05:38 6d76a0f4-b5f5-623c-af8b-9d7b53812ea1 SPX86-8001-TS



Each problem recorded in the fault log is identified by:

  • The time of its diagnosis.
  • A Universal Unique Identifier (UUID) that can be used to uniquely identify a particular problem across any set of systems.
  • A message identifier that can be used to access a corresponding knowledge article located on Oracle's support web site.
  • If a problem requires action by a human administrator or service technician or affects system behavior, ILOM also issues a human-readable message to its Event Log. This message provides a summary of the problem and a reference to the knowledge article on the Oracle web site.

You can use the -v and -V options to expand the display from a single-line summary to increased levels of detail for each event recorded in the log. The -u option can be used to filter the output by selecting only those events that match the specified uuid.

Syntax
fmdump [options [argument]]

Options
The following options are supported:

OptionDescription
-e Display events from the fault management error log instead of the fault log. This option is shorthand for specifying the pathname of the error log file. The error log file contains Private telemetry information used by Oracle's automated diagnosis software. This information is recorded to facilitate post-mortem analysis of problems and event replay, and should not be parsed or relied upon for the development of scripts and other tools.
-u uuid Select fault diagnosis events that exactly match the specified argument (uuid). Each diagnosis is associated with a Universal Unique Identifier (UUID) for identification purposes. The -u option can be combined with other options such as -v to show all of the details associated with a particular diagnosis. If the -e option and -u option are both present, the error events that are cross-referenced by the specified diagnosis are displayed.
-v Display verbose event detail. The event display is enlarged to show additional common members of the selected events.
-V Display very verbose event detail. The event display is enlarged to show every member of the name-value pair list associated with each event. In addition, for fault logs, the event display includes a list of crossreferences to the corresponding errors that were associated with the diagnosis.



Example
This example dumps the fault log for the designated FRU UUID.

faultmgmtsp> fmdump -V -u edddce14-bf6f-eca7-aff8-dd84e9be27dc
2010-10-05/12:02:18  edddce14-bf6f-eca7-aff8-dd84e9be27dc   SPX86-8000-33 

    fault = fault.chassis.device.fan.fail@/sys/fm1
        certainty = 100.0 %
        FRU       = /sys/fm1
        ASRU      = /sys/fm1
        chassis_serial_number = 0000000-0000000000
        product_serial_number = 1234567890
        detector     = /SYS/FM1/ERR
        [skipped fruid update]




fmstat - Statistical Module Report Generator

Syntax
fmstat

The fmstat utility can be used by administrators and service personnel to report statistics associated with the Oracle ILOM Fault Manager and its associated set of modules. The Fault Manager runs in the background on each Oracle ILOM system. It receives telemetry information relating to problems detected by the system software, diagnoses these problems, and initiates proactive self-healing activities such as disabling faulty components.
You can use fmstat to view statistics for diagnosis engines that are currently participating in fault management.

The fmstat utility reports the following statistics for each of the diagnosis engines:

  • engine    The name of the diagnosis engine. The engines execute rules for the fault diagnosis daemon based on ereport input. Oracle ILOM Fault Management engines include:
    • repair - Rule that indicates a fault should be considered repaired if a specified ereport is logged. For example, the fault "fault.chassis.power.inadequate@/sys" would be considered repaired if "ereport.chassis.boot.power-off-requested@/sys" was logged.
    • hysteresis - Rule to diagnose a fault if ereport A (initiation) is logged and ereport B (cancelation) is not logged within some specified time afterwards. For example, ereport A is "ereport.fan.speed-low-asserted" and ereport B is "ereport.fan.speed-low-deasserted". The time limit between the initiation/cancelation can be no greater than 10 seconds.
    • SERD - Soft Error Rate Discrimination (SERD) is used in tracking multiple occurences of an ereport. If more than N ereports show up within time period T, the fault is diagnosed. For example, if too many correctable memory error ereports are logged within a specific time frame, a DIMM fault will be diagnosed.
    • simple - Rule to allow one ereport to result in the diagnosis of multiple faults. For example, an ereport for an uncorrectable memory error can be diatnosed to the faults for two DIMMs in a DIMM pair.
  • status    The status of the engine, either uninit, empty, enqueued, busy, or exiting.
  • evts_in    The number of events received by the engine as relevant to a diagnosis.
  • evts_out    The number of events sent by the engine.
  • errors    The number of errors detected by the engine.

Example

faultmgmtsp> fmstat
fdd statistics    2011-03-31/20:13:40

engine               status    evts_in  evts_out  errors
platform             empty          0       0       0
repair               empty          2       0       0
hysteresis           empty          1       1       0
simple               empty          0       0       0

 


etcd - Fault Event Injection Utility

The etcd (error telemetry collection daemon) utility can be used by administrators and service personnel to inject events in order to test the ILOM Fault Manager. The utility allows either ereports or sensor events with an offset to be entered.

Note : Starting from ILOM 3.2.1, etcd is only available from the Service mode. No longer from the Fault Management Shell.

Syntax
etcd [options [argument]]

Options
The following options are supported:

OptionDescription
-d dump list of possible ereports and exit
-i <ereport>[,detector=devicename] Inject as an ereport the entire "ereport" string.  In addition, an optional device name string (sensor or fru) can be given as a detector.
For example, the following command:
etcd -i ereport.fan.fail-pred-deasserted@/sys/fm3,detector=/SYS/FM3
Will produce the ereport that can be viewed via fmdump:
2010-11-03/20:29:04  ereport.fan.fail-pred-deasserted@/sys/fm3
-s <sensorname>:<offset> Generate an event from the specified sensor name string and offset (represented by either 00 or 01).
For example:
etcd -s /SYS/PS0/S0/V_IN_ERR:01
Will produce an event that can be viewed from /SP/logs/event:
1488   Mon Feb  7 14:49:02 2011  Fault     Fault     critical
Fault detected at time = Mon Feb  7 14:49:02 2011. The suspect component:
/SYS/PS0 has fault.chassis.env.power.loss with probability=100. Refer to http://www.sun.com/msg/SPX86-8000-55 for details.

Note - To display a complete list of sensor names for your system, you can use the IPMItool available with your server. Enter the following command from a system with IPMItool and network access to the server's service processor: ipmitool -H SP-IPaddress -U root sdr elist
-f Force the event to be injected even if the specified component is absent.



Example
This example injects an ereport into the Fault Management error log.

faultmgmtsp> etcd -i ereport.fault.chassis.device.fan.fail@/sys/fm5,detector=/SYS/FM5

You can confirm that the ereport was added by running fmdump.
faultmgmtsp> fmdump -e
TIMESTAMP            EREPORT
2011-02-04/19:09:51  ereport.fault.chassis.device.fan.fail@/sys/fm5

 

For ILOM > 3.2.1

-> set SESSION mode=service
Short Form Password:**   **** ***
Currently in service mode.

-> etcd
Usage: etcd [-d] [-f] -i ereport
      -d   : dump list of possible ereports and exit
      -i <ereport>[,detector=NAC] [:<payload_name>=<payload_string>]* :  inject event <ereport>
      -i <ereport> [-s <payload_name>=<payload_string>] [-u <payload_name>=<payload_val_int>] :  inject event <ereport>
      -s <sensor_nac>:<offset>    :  inject sensor/offset event
      -f                          :  inject event even if component is absent

-> etcd -i ereport.chassis.device.psu.fail@/SYS/PSU11
2014-11-12/01:47:07: injected ereport.chassis.device.psu.fail@/SYS/PSU11




How to identify an error injected using etcd


Usage of etcd is tracked via an entry in the SP event log.

-> show /X/logs/event/list

  /SP/logs/event/list
    Targets:

    Properties:

    Commands:
        cd
        show

ID     Date/Time                 Class     Type      Severity
-----  ------------------------  --------  --------  --------
670    Wed Mar 30 13:26:53 2011  Captive Shell  Command Entered  minor  
       Fault Management Shell Command Executed: etcd

 

Starting with 3.2.1, since etcd is available from Service mode, there is no more entry in the event logs but the 'fmdump -eV' would not report any detail about the detector. For example (fmdump -eV) :

Genuine error

2014-11-11/08:11:02  ereport.chassis.device.psu.fail@/SYS/PSU1
                         detector = /SYS/PSU1/PWR_MGR/POWER_OFF
                         hidden   = true

Injected error

2014-11-12/01:47:07  ereport.chassis.device.psu.fail@/SYS/PSU11

 

 

 


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback