![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||
Solution Type Predictive Self-Healing Sure Solution 2216293.1 : Commands To Clear FMA faults on the T5-x, T7-x, S7-x Servers
Quick Reference for CLI commands to run to fully clear ILOM/SP, faultmgmt shell, and FMA faults on the T3-x and T4-x Servers In this Document
Applies to:SPARC S7-2SPARC T5-1B SPARC S7-2L SPARC T5-1 SPARC T5-4 Information in this document applies to any platform. PurposeQuick Reference for CLI commands to run to fully clear FMA faults on T5-x T7-x and S7-x Servers ScopeThis document is meant for any persons responsible for clearing faults on T5-x T7-x and S7-x Servers. It will include a limited description of the fault handling process. For more complete info please see: Managing Faults, Defects, and Alerts in Oracle ® Solaris 11.3 and Oracle ® ILOM User's Guide for System Monitoring and Diagnostics Firmware Release 3.2.x DetailsWhat's New Starting With T5-x Systems
CPU and Memory diagnosis are now done mostly by ILOM FMA (FDD) instead of Solaris FMA. An example MSG ID for such a diagnosis would be: SPSUN4V-8000-EJ Memory Uncorrectable Error Notice the SPSUN4V for T5 versus SUN4V used for T4 There is now a shared fault database. Clearing of a fault, no matter which diagnosis engine (Solaris or ILOM FMA FDD) diagnosed the fault, can be done from either Solaris or ILOM FMA shell and the fault will be cleared universally (Solaris and ILOM FMA shell). Power and environmental type faults are now seen not only by ILOM “show faulty” and ILOM FMA shell commands, but also by Solaris “fmadm faulty”. The fault can also be cleared via Solaris. IO Errors are still diagnosed by the Solaris FMA. Some of the more common IO errors are PCIe related. Unlike the T4 an IO error will be propagated down to the ILOM FMA shell, and it could also be cleared from there and that would clear it from Solaris as well.
Types of Fault Repair When a component in your system has faulted, the Fault Manager can repair the component implicitly or you can repair the component explicitly.
Explicit repair Sometimes no FRU serial number information is available even though the FMRI includes a chassis identifier. In this case, fmd cannot detect an FRU replacement, and you must perform an explicit repair by using the fmadm command with the replaced, repaired ,or acquit subcommand as shown in the following sections. Other corner case situations may exist where a fault needs to be explicitly repaired. 1) Clearing Faults from Solaris
The UUID , also shown as the EVENT-ID in Fault Manager output, identifies the fault event. The UUID can only be used with the fmadm acquit command. You can specify that the entire event can be safely ignored, or you can specify that a particular resource is not a suspect in this event.
a) fmadm replaced command Use the fmadm replaced command to indicate that the suspect FRU has been replaced. If multiple faults are currently reported against one FRU, the FRU shows as replaced in all cases. example: fmadm replaced /SYS/MB When an FRU is replaced, the serial number of the FRU changes. If fmd automatically detects that the serial number of an FRU has changed, the Fault Manager behaves in the same way as if you had entered the fmadm replaced command. If fmd cannot detect whether the serial number of the FRU has changed, then you must enter the fmadm replaced command if you have replaced the FRU. If fmd detects that the serial number of the FRU has not changed, then the fmadm replaced command exits with an error. b) fmadm repaired Command Use the fmadm repaired command when you have performed a physical repair other than replacement of the FRU to resolve the problem. Examples of such repairs include reseating a card or straightening a bent pin. If multiple faults are currently reported against one FRU, the FRU shows as repaired in all cases. example: fmadm repaired /SYS/MB
Use the acquit subcommand if you determine that the indicated resource is not the cause of the fault. Usually the Fault Manager automatically acquits some suspects in a multi-element suspect list. Acquittal can occur implicitly as the Fault Manager refines the diagnosis, for example if additional error events occur. Sometimes Support Services gives you instructions to perform a manual acquittal.
If you do not specify any FMRI or label with the UUID , then the entire event is identified as able to be ignored. A case is considered repaired when the fault event UUID is acquitted. example: fmadm acquit ad64afc4-0f28-67d2-ddf7-c0cf90c28d42 Acquit by FMRI or label with no UUID only if you determine that the resource is not a factor in any current cases in which that resource is a suspect. If multiple faults are currently reported against one FRU, the FRU shows as acquitted in all cases. example: fmadm acquit /SYS/MB To acquit a resource in one case and keep that resource as a suspect in other cases, specify both the fault event UUID and the resource FMRI or both the UUID and the resource label, as shown in the following example: example: fmadm acquit /SYS/MB ad64afc4-0f28-67d2-ddf7-c0cf90c28d42
A suspect component has been replaced or removed. example: fmadm replaced /SYS/MB
A suspect component has been physically repaired to resolve the reported problem. For example, a component has been reseated or a bent pin has been fixed. example: fmadm repaired /SYS/MB
A suspect component or uuid resource is not the cause of the problem. Where [ fru|cru ] [ uuid ] appears, type the system path to the suspect chassis FRU or CRU, example: fmadm acquit ad64afc4-0f28-67d2-ddf7-c0cf90c28d42 Acquit by fru/cru or label with no UUID only if you determine that the resource is not a factor in any current cases in which that resource is a suspect. If multiple faults are currently reported against one FRU, the FRU shows as acquitted in all cases. example: fmadm acquit /SYS/MB
NOTE: Do not use 'fmadm faulty -a' to determine if there any any currently active faults. When you specify the -a option all resource information cached by the Fault Manager is listed including faults which have already been corrected or where no recovery action is needed (see 'fmadm' man page). The listings also include information for resources that may no longer be present in the system.
References<NOTE:1643464.1> - [SPARC T3/T4/T5 and T7] OBP reports "One or more resources have been retired, please run 'show faulty' on the SP" on console<NOTE:1004229.1> - How To Clear FMA faults from Solaris[TM] and SC (System Controller) on T1000/T2000 T5120/T5220/T5140/T5240/T5440,T6320,T6340, T3-1/T3-2/T3-4, T4-1/T4-2/T4-4 Attachments This solution has no attachment |
||||||||||||||||||
|