Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-79-2216293.1
Update Date:2018-05-01
Keywords:

Solution Type  Predictive Self-Healing Sure

Solution  2216293.1 :   Commands To Clear FMA faults on the T5-x, T7-x, S7-x Servers  


Related Items
  • SPARC T5-1
  •  
  • SPARC T5-1B
  •  
  • SPARC T7-4
  •  
  • MiniCluster S7-2 Hardware
  •  
  • SPARC S7-2L
  •  
  • SPARC S7-2
  •  
  • SPARC T5-2
  •  
  • SPARC T7-2
  •  
  • SPARC T5-4
  •  
  • SPARC T7-1
  •  
  • SPARC T5-8
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>CMT>SN-SPARC: T4
  •  


Quick Reference for CLI commands to run to fully clear ILOM/SP,  faultmgmt shell, and FMA faults on the T3-x and T4-x Servers

In this Document
Purpose
Scope
Details
References


Applies to:

SPARC S7-2
SPARC T5-1B
SPARC S7-2L
SPARC T5-1
SPARC T5-4
Information in this document applies to any platform.

Purpose

Quick Reference for CLI commands to run to fully clear FMA faults on T5-x T7-x and S7-x Servers

Scope

 This document is meant for any persons responsible for clearing  faults on T5-x T7-x and S7-x Servers. It will include a limited description of the fault handling process. For more complete info please see:

Managing Faults, Defects, and Alerts in Oracle ® Solaris 11.3

and

Oracle ® ILOM User's Guide for System Monitoring and Diagnostics Firmware Release 3.2.x

Details

What's New Starting With T5-x Systems


Starting with T5-x systems and continuing with T7-x and S7-x systems there are significant changes with fault reporting and clearing.

CPU and Memory diagnosis are now done mostly by ILOM FMA (FDD) instead of Solaris FMA.

An example MSG ID for such a diagnosis would be:

SPSUN4V-8000-EJ Memory Uncorrectable Error

Notice the SPSUN4V for T5 versus SUN4V used for T4

There is now a shared fault database.

Clearing of a fault, no matter which diagnosis engine (Solaris or ILOM FMA FDD) diagnosed the fault, can be done from either Solaris or ILOM FMA shell and the fault will be cleared universally (Solaris and ILOM FMA shell).
This behavior differs from the T4 where Solaris diagnosed faults cleared via Solaris would also need to be cleared in the ILOM FMA shell, and ILOM diagnosed faults such as power or environmental faults could not be cleared via Solaris.

Power and environmental type faults are now seen not only by ILOM “show faulty” and ILOM FMA shell commands, but also by Solaris “fmadm faulty”. The fault can also be cleared via Solaris.

IO Errors are still diagnosed by the Solaris FMA. Some of the more common IO errors are PCIe related. Unlike the T4 an IO error will be propagated down to the ILOM FMA shell, and it could also be cleared from there and that would clear it from Solaris as well.

 

Types of Fault Repair

When a component in your system has faulted, the Fault Manager can repair the component implicitly or you can repair the component explicitly.


Implicit repair


An implicit repair can occur when the faulty component is replaced if that component has serial number information that the Fault Manager daemon (fmd) can track. On many systems, serial number information is included in the FMRIs so that fmd can determine when components have been replaced. When fmd determines that a component has been replaced and the replacement has been successfully brought into service, then the Fault Manager no longer displays that component in fmadm list output. The component is maintained in the Fault Manager internal resource cache until the fault event is 30 days old. When fmd faults a piece of hardware, that hardware might be taken out of service so that it does not adversely affect the system. Hardware removal from service can occur whether Solaris or ILOM diagnosed the problem. Hardware removal from service is usually reported in the Response section of the diagnosis message.

 

Explicit repair

Sometimes no FRU serial number information is available even though the FMRI includes a chassis identifier. In this case, fmd cannot detect an FRU replacement, and you must perform an explicit repair by using the fmadm command with the replaced, repaired ,or acquit subcommand as shown in the following sections.

Other corner case situations may exist where a fault needs to be explicitly  repaired.

1) Clearing Faults from Solaris


These fmadm commands take the following operands:

The UUID , also shown as the EVENT-ID in Fault Manager output, identifies the fault event. The UUID can only be used with the fmadm acquit command. You can specify that the entire event can be safely ignored, or you can specify that a particular resource is not a suspect in this event.


The FMRI and the Label identify the suspect faulted resource. Typically, the label is easier to use than the FMRI.

 

a) fmadm replaced command

Use the fmadm replaced command to indicate that the suspect FRU has been replaced. If multiple faults are currently reported against one FRU, the FRU shows as replaced in all cases.

example: fmadm replaced /SYS/MB

When an FRU is replaced, the serial number of the FRU changes. If fmd automatically detects that the serial number of an FRU has changed, the Fault Manager behaves in the same way as if you had entered the fmadm replaced command. If fmd cannot detect whether the serial number of the FRU has changed, then you must enter the fmadm replaced command if you have replaced the FRU. If fmd detects that the serial number of the FRU has not changed, then the fmadm replaced command exits with an error.

b) fmadm repaired Command

Use the fmadm repaired command when you have performed a physical repair other than replacement of the FRU to resolve the problem. Examples of such repairs include reseating a card or straightening a bent pin. If multiple faults are currently reported against one FRU, the FRU shows as repaired in all cases.

example: fmadm repaired /SYS/MB


c) fmadm acquit command

Use the acquit subcommand if you determine that the indicated resource is not the cause of the fault. Usually the Fault Manager automatically acquits some suspects in a multi-element suspect list. Acquittal can occur implicitly as the Fault Manager refines the diagnosis, for example if additional error events occur. Sometimes Support Services gives you instructions to perform a manual acquittal.


Replacement takes precedence over repair, and both replacement and repair take precedence over acquittal. Thus, you can acquit a component and then subsequently repair the component, but you cannot acquit a component that has already been repaired.

If you do not specify any FMRI or label with the UUID , then the entire event is identified as able to be ignored. A case is considered repaired when the fault event UUID is acquitted.

example: fmadm acquit ad64afc4-0f28-67d2-ddf7-c0cf90c28d42

Acquit by FMRI or label with no UUID only if you determine that the resource is not a factor in any current cases in which that resource is a suspect. If multiple faults are currently reported against one FRU, the FRU shows as acquitted in all cases.

example: fmadm acquit /SYS/MB

To acquit a resource in one case and keep that resource as a suspect in other cases, specify both the fault event UUID and the resource FMRI or both the UUID and the resource label, as shown in the following example:

example: fmadm acquit /SYS/MB ad64afc4-0f28-67d2-ddf7-c0cf90c28d42

 


2)  Clearing Faults from the ILOM Fault Management Shell

          i)  While  logged into the Oracle ILOM/SP, start the Fault Management Shell
       
                      -> start /SP/faultmgmt/shell

          ii)  View faulted components
        
                      faultmgmtsp> fmadm faulty

          iii)  For each fault listed, type one of the following fmadm commands to manually clear a fault:


a)  fmadm replaced [ fru|cru ]

A suspect component has been replaced or removed.

example: fmadm replaced /SYS/MB


b)  fmadm repaired [ fru|cru ]

A suspect component has been physically repaired to resolve the reported problem. For example, a component has been reseated or a bent pin has been fixed.

example: fmadm repaired /SYS/MB


c)  fmadm acquit [ fru|cru ] [ uuid ]

A suspect component or uuid resource is not the cause of the problem. Where [ fru|cru ] [ uuid ] appears, type the system path to the suspect chassis FRU or CRU,
or type the associated universal unique identifier ( uuid ) for the resource reported in the problem

example: fmadm acquit ad64afc4-0f28-67d2-ddf7-c0cf90c28d42

Acquit by fru/cru or label with no UUID only if you determine that the resource is not a factor in any current cases in which that resource is a suspect. If multiple faults are currently reported against one FRU, the FRU shows as acquitted in all cases.

example: fmadm acquit /SYS/MB


               

NOTE: Do not use 'fmadm faulty -a' to determine if there any any currently active faults. When you specify the -a option all resource information cached by the Fault Manager is listed including faults which have already been corrected or where no recovery action is needed (see 'fmadm' man page). The listings also include information for resources that may no longer be present in the system.

    

 

 

 

References

<NOTE:1643464.1> - [SPARC T3/T4/T5 and T7] OBP reports "One or more resources have been retired, please run 'show faulty' on the SP" on console
<NOTE:1004229.1> - How To Clear FMA faults from Solaris[TM] and SC (System Controller) on T1000/T2000 T5120/T5220/T5140/T5240/T5440,T6320,T6340, T3-1/T3-2/T3-4, T4-1/T4-2/T4-4

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback