Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1516048.1
Update Date:2018-03-07
Keywords:

Solution Type  Problem Resolution Sure

Solution  1516048.1 :   Sun Fire T1000/T2000, Sun Netra T2000 DIMMs with CEs (correctable errors) are being unnecessarily flagged by POST as faulty  


Related Items
  • Sun Fire T1000 Server
  •  
  • Sun SPARC Enterprise T1000 Server
  •  
  • Sun Netra T2000 Server
  •  
  • Sun SPARC Enterprise T2000 Server
  •  
  • Sun Fire T2000 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>CMT>SN-SPARC: Tx000
  •  




In this Document
Symptoms
Changes
Cause
Solution
References


Applies to:

Sun Fire T1000 Server - Version All Versions and later
Sun SPARC Enterprise T1000 Server - Version All Versions and later
Sun Fire T2000 Server - Version All Versions and later
Sun SPARC Enterprise T2000 Server - Version All Versions and later
Sun Netra T2000 Server - Version All Versions and later
Information in this document applies to any platform.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community, SPARC Legacy Servers .


Symptoms

DIMMs with CEs (correctable errors) are being unnecessarily flagged by POST as faulty.

Below is an example error message of a POST fault for a single DIMM.  If this error message occured when POST
was run with diag_level set to max, this is probably a case where the DIMM was flagged by POST unnecessarily.

sc> showfaults -v
ID Time           FRU               Fault
1 OCT 13 12:47:27 MB/CMP0/CH0/R0/D0 MB/CMP0/CH0/R0/D0 deemed faulty and disabled

 

Example #1:  POST fault for a single DIMM
sc> showfaults -v
ID Time           FRU               Fault
1 OCT 13 12:47:27 MB/CMP0/CH0/R0/D0 MB/CMP0/CH0/R0/D0 deemed faulty and disabled

Example #2:  POST fault for multiple DIMMs (this example shows two DIMMs on the same channel/rank,
which in most cases is a UE)
sc> showfaults -v
ID Time           FRU               Fault
1 OCT 13 12:47:27 MB/CMP0/CH0/R0/D0 MB/CMP0/CH0/R0/D0 deemed faulty and disabled
2 OCT 13 12:47:27 MB/CMP0/CH0/R0/D1 MB/CMP0/CH0/R0/D1 deemed faulty and disabled

Example #3:  FMA fault and a POST fault on the same DIMM (the DIMM should be replaced as it
exceeded the FMA page retire threshold)
sc> showfaults -v
ID Time           FRU               Fault
0 SEP 09 11:09:26 MB/CMP0/CH0/R0/D0 Host detected fault,
MSGID:SUN4V-8000-DX UUID: 7ee0e46b-ea64-6565-e684-e996963f7b86
1 OCT 13 12:47:27 MB/CMP0/CH0/R0/D0 MB/CMP0/CH0/R0/D0 deemed faulty and disabled

 

Changes

Contributing Factors

Sun Fire T1000 and T2000 systems with firmware prior to 6.3.0 the default setting of diag_level is "max".

Cause

On the T1000/T2000, when POST encounters a single CE, the associated DIMM is declared faulty and half of
system's memory is de-configured and unavailable for Solaris. Since PSH (Predictive Self-Healing) is the
primary means for detecting errors and diagnosing faults on the Tx000 platforms, this policy is too aggressive.

Solution

Workaround

The following procedures are recommended as a workaround to this issue:

1. Normal Operation

For normal operation, set diag_level to "min". POST min mode provides a sanity check to insure the system
will boot. Once Solaris is up, PSH provides run time diagnosis of faults. Normal operation applies to any
boot of the system except hardware upgrades or repairs (as described in section 2 below).

 a) Use the ALOM command "setsc diag_level min" to set POST to min mode. Also, make sure that diag_mode is
in the normal state with the ALOM command "setsc diag_mode normal". Note, that with FW prior to release 6.3.0,
the default setting of diag_level is "max". Therefore, the ALOM "setdefaults" command will return POST to max
mode.

b) With the FW release 6.3.0 (or later), the default setting of diag_level is min.

Note: When upgrading an existing system with FW prior to version 6.3.0, the existing POST settings will not
be changed by the firmware upgrade. Therefore, if the settings have not yet been changed on the system, follow
the procedure in 1.a above to change to the recommended POST settings.

c) For systems shipped with FW release 6.3.0 or later, the default setting of POST is min so no action is
required, as long as there have been no changes to the default POST settings.

Note: Any faulty FRU reported by POST in min mode should be replaced. Once the FRU is replaced, follow the
procedure in section 2 below "Hardware Upgrades or Repairs".

2. Hardware Upgrades or Repairs

It is recommended that POST max mode (diag_level=max) be used to validate hardware upgrades or repairs. After
completing the upgrade or repair and prior to booting the system, set POST to max mode using the ALOM command
"setsc diag_level max".

a) If the validation completes successfully, return POST to min mode. Use the ALOM command "setsc diag_level min".

b) If the validation does not complete successfully and POST faults a SINGLE DIMM (see example #1) that was not
part of the hardware upgrade or repair, it is likely that POST has encountered a CE on the DIMM that will be
handled by PSH (this can be validated by examining POST output). For this case re-enable the DIMM and re-run POST
in min mode as described below:

- Reenable the DIMM via the ALOM command "enablecomponent <name of DIMM>"
- Set POST to min mode. Use the ALOM command "setsc diag_level min".
- If POST continues to fault the DIMM, it should be replaced.

For any other case, (e.g. multiple DIMMs faults (see example #2), the faulty DIMM was part of the hardware
upgrade/repair, etc.) the faulty FRU(s) identified by POST should be replaced.

Note: The above procedure is not recommended following a software or firmware upgrade or any other reboot of the
system that is not intended to validate a hardware change or debug a hardware problem.  These boots/reboots should
have POST diag_level set to min as described above under "Normal Operation".

3. POST faults reported with diag_level at max

For systems booted with diag_level at max, where it was not intended to validate a hardware upgrade or repair as
described in section 2 above, any fault reported by POST should be examined to ensure that it would not have been
transparently handled by Solaris PSH.

Use the following procedure to examine the fault:

a) If the FRU(s) reported by POST is not a DIMM or is more then a single DIMM, then replace the FRU(s).

b) If the FRU reported by POST is a single DIMM and the same DIMM had also been reported faulty by FMA/PSH, then
replace the DIMM (see example #3).

c) If the FRU reported by POST is a single DIMM and the same DIMM had not been reported by FMA/PSH, then follow the
steps in 2.b above to determine whether to replace the DIMM.

After completing this procedure, it is recommended that diag_level be set to min and diag_mode is set to normal as described in section 1 for
"Normal Operation".

 


Impact

Sun Fire T1000 and T2000 DIMMs with CEs (correctable errors) are being unnecessarily flagged by POST
as faulty. There is a high fallout of DIMMs in the field because of this POST policy and a field issue
with excessive DIMM returns, caused by CEs during the extended POST memory tests.

This document minimizes the opportunity for POST reporting memory faults that are fully and transparently
handled by the PSH (Predictive Self-Healing), also known as  FMA (Fault Management Architecture) features of Solaris.
 

Previously Published As
102671
Internal Contributor/submitter
Arnold.Epstein@Sun.COM, Dencho.Kojucharov@Sun.COM, Robert.Balfour@Sun.COM

Internal Eng Business Unit Group
SSG WGS (Workgroup Systems)

Internal Eng Responsible Engineer
Steve.Trullo@Sun.COM

Internal Services Knowledge Engineer
Joe.Davis@Sun.COM

Internal Kasp FAB Legacy ID
102671

Internal Sun Alert & FAB Admin Info
Critical Category:
Significant Change Date: 2006-10-20
Avoidance: Service Procedure
Responsible Manager: Steve.Doherty@Sun.COM
Original Admin Info: WF - Initial draft done on Oct/16
WF - published on Oct/20
WF - updated Solution section per Dencho and republished on Oct/23

Product_uuid
41b7bc41-2581-11da-99bc-080020a9ed93|Sun Fire T2000 Server
79ad78b9-961d-11d9-9adf-080020a9ed93|Sun Fire T1000 Server

References

<NOTE:1001026.1> - ARCHIVED: Refer to Problem docID 1516048.1: Sun Fire T1000 and T2000 DIMMs with CEs (correctable errors) are being unnecessarily flagged by POST as faulty.
<BUG:15291071> - SUNBT6334560 ONTARIO ASR POLICY FOR MEMORY CE'S IS TOO AGGRESSIVE

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback