Tx000/T5x20/T5x40: How To Clear recurring faults on Solaris FMA and System Controller after hardware replacement

Asset ID:	1-71-1571240.1
Update Date:	2018-05-17
Keywords:

Solution Type Technical Instruction Sure

Solution 1571240.1 : Tx000/T5x20/T5x40: How To Clear recurring faults on Solaris FMA and System Controller after hardware replacement

Applies to:

Sun SPARC Enterprise T2000 Server - Version All Versions to All Versions [Release All Releases]
Sun Netra T5440 Server - Version All Versions to All Versions [Release All Releases]
Sun SPARC Enterprise T5140 Server - Version All Versions to All Versions [Release All Releases]
Sun Fire T1000 Server - Version All Versions to All Versions [Release All Releases]
Sun SPARC Enterprise T5120 Server - Version All Versions to All Versions [Release All Releases]
Oracle Solaris on SPARC (32-bit)
Oracle Solaris on SPARC (64-bit)

Goal

This document explains what to do, if the same fault (different timestamps but the same FMA event ID) occurs after replacing a hardware component.

If you simply wish to know how to clear faults within Solaris FMA or on the System Controller, please refer to:

How To Clear FMA faults from Solaris[TM] and SC (System Controller) on T1000/T2000 T5120/T5220/T5140/T5240/T5440, T3-1/T3-2/T3-4, T4-1/T4-2/T4-4 (Doc ID 1004229.1)

Solution

Symptoms

As an example, let us assume you have encountered the following fault on the ALOM of your server:

sc> showfaults -v
Last POST run: MON JUL 01 10:30:26 2013
POST status: Passed all devices

ID Time FRU Fault
3 JUL 15 08:48:55 MB/CMP0/CH0/R1/D1 Host detected fault, MSGID: SUN4V-8002-42 UUID: f258876c-0c10-edef-a7a5-d2283828fe09
sc>

The Host FMA shows:

# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jul 15 11:11:20 f258876c-0c10-edef-a7a5-d2283828fe09 SUN4V-8002-42 Critical

Host : somehostname
Platform : SUNW,Sun-Fire-T200 Chassis_id :
Product_sn :

Fault class : fault.memory.dimm-ue-imminent 95%
Affects : mem:///unum=MB/CMP0/CH0:R1/D1/J0901
faulted but still in service
FRU : "MB/CMP0/CH0/R1/D1" (mem:///unum=MB/CMP0/CH0:R1/D1/J0901) 95%
faulty
Serial ID. : 7100719e

Description : A pattern of correctable errors has been observed suggesting the
potential exists that an uncorrectable error may occur.
Refer to http://sun.com/msg/SUN4V-8002-42 for more information.

Response : None at this time.

Impact : None at this time. However, the potential uncorrectable error
warrants proactive service action to avoid any unplanned system
outages.

Action : Schedule a repair procedure to replace the DIMM. Use fmadm faulty
to identify the DIMM to replace.

Let us further assume that you have correctly replaced the problematic DIMM MB/CMP0/CH0/R1/D1.

But the replaced DIMM gets faulted again:

sc> showfaults -v
Last POST run: WED JUL 17 10:29:57 2013
POST status: Passed all devices

ID Time FRU Fault
1 JUL 17 10:32:37 MB/CMP0/CH0/R1/D1 Host detected fault, MSGID: SUN4V-8002-42 UUID: f258876c-0c10-edef-a7a5-d2283828fe09
sc>

At this point, it is likely that Solaris FMA won't report any faults:

#fmadm faulty
#

Solution Steps

Please note that the UUIDs of the event before and the event after the DIMM replacement are the same: f258876c-0c10-edef-a7a5-d2283828fe09

This means that the original fault has been not correctly cleared either by the host FMA or by the System Controller.

To solve the problem:

1. if the operating system is down, boot it and execute the following as root user in Solaris:

# svcadm disable -s svc:/system/fmd:default
# cd /var/fm/fmd
# find /var/fm/fmd -type f -exec ls {} \;
# find /var/fm/fmd -type f -exec rm {} \;
# svcadm enable svc:/system/fmd:default

Important:

The procedure shown above clears the entire fault history. That history is necessary to analyze problems that might occur in the future.
Please follow these steps only if you encounter the problem described in this document.
There is no need to clear FMA caches this way during normal work with Solaris FMA.
Instead, you should use the normal "#fmadm repair <uuid>" commands.
See: PSH Procedural Article for Solaris FMA-Based Diagnosis

2. shutdown the OS but do not boot it up yet. Connect to the ALOM and

2a. if POST has disabled the DIMM upon the original fault, execute:

sc> enablecomponent MB/CMP0/CH0/R1/D1

2b. clear the asr database

sc> clearasrdb

2c. clear the ereports

sc> setsc sc_servicemode true
sc> clearereports -y
sc> setsc sc_servicemode false

2d. reset the SC

sc> resetsc

2e. set the keyswitch virtually to diag mode

sc> setkeyswitch diag

2f. powercycle the server

sc> poweroff

sc> poweron

To poweron the server and to automatically start the host console, execute:
sc> poweron -c

Note: If you're in ILOM shell, you may need to execute:

-> set /SYS/<COMPONENT_PATH> clear_fault_action=true

and, if there are any disabled components,

-> set /SYS/<COMPONENT_PATH> component_state=enabled

See "Section C.2 Using the ILOM Command Line Interface to Clear the Fault" in the "PSH Procedural Article for ILOM-Based Diagnosis (Doc ID 1155200.1)"

3. allow the OS to boot up and check, if the fault has recurred:

On the Host:

# fmadm faulty

On the System Controller:

sc> showfaults -v

4. if the fault has not been successfully cleared contact the owner of the
Service Request, if the SR is still open. Else, request reopening the SR.

References

<NOTE:1004229.1> - How To Clear FMA faults from Solaris[TM] and SC (System Controller) on T1000/T2000 T5120/T5220/T5140/T5240/T5440, T3-1/T3-2/T3-4, T4-1/T4-2/T4-4
<NOTE:1173733.1> - PSH Procedural Article for Solaris FMA-Based Diagnosis

Attachments

This solution has no attachment