Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-77-1920585.1
Update Date:2016-07-29
Keywords:

Solution Type  Sun Alert Sure

Solution  1920585.1 :   Solaris 10 and Solaris 11 Fault Management Architecture (FMA) may Erroneously Report Xeon 7500 Series and E7-x800 Series Intel CPUs as Faulty After Memory Controller Events are Logged  


Related Items
  • Sun ZFS Storage 7420
  •  
  • Sun Software - Generic
  •  
  • Solaris Operating System
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: Sun Alert
  •  
  • _Old GCS Categories>Sun Microsystems>Sun Alert>Criteria Category>Availability
  •  
  • _Old GCS Categories>Sun Microsystems>Sun Alert>Release Phase>Resolved
  •  




In this Document
Description
Occurrence
Symptoms
Workaround
Patches
History
References


Applies to:

Sun Software - Generic
Solaris Operating System
Sun ZFS Storage 7420
x86
Information in this document applies to any platform.
__________________________________________



Date of Resolved Release: 26-Aug-2014
__________________________________________

Description

Solaris 10 and Solaris 11 Fault Management Architecture (FMA) may erroneously report 'Xeon 7500 series' and 'Xeon E7-x800 series'  Intel CPUs as faulty, reporting SUNW-MSG-ID: INTEL-8001-ND in response to 'FB-DIMM Memory Controller' events being logged, which may lead to unnecessary hardware replacement.

Note: INTEL-8001-ND errors that are triggered by events other than 'FB-DIMM Memory Controller' events should not be ignored.

Occurrence

This issue can occur in the following releases:

x86 Platform

  • Solaris 10 with patch 142901-09 through 142901-15 OR with patch 142910-17 and without patch 150126-01
  • Solaris 11.0 through 11.1.3.5.1

Note 1:  Solaris 8, Solaris 9, and Solaris on the SPARC platforms are not affected by this issue.

Note 2: This issue only impacts 'Xeon 7500 series' and 'Xeon E7-x800 series' Intel CPUs. To determine if a system has either of these CPUs, the following command can be used:

    $ psrinfo -vp | egrep '(E7-\ 8|X75)' 

Systems generating any output to this command are vulnerable to this issue. Output would look similar to the below, depending on which CPU in the affected series is present:

    Intel(r) Xeon(r) CPU E7- 8870  @ 2.40GHz
Or
    Intel(r) Xeon(r) CPU           X7550  @ 2.00GHz

Note 3: To determine the SRU level, the following command can be used:

    $ pkg info entire | grep Summary
    Summary: entire incorporation including Support Repository Update (Oracle Solaris 11.1.4.6.0)

Symptoms

This issue of the erroneous reporting of CPU faults occurs only when the INTEL-8001-ND CPU errors are triggered from the Memory Controller North Bound or South Bound FB-DIMM link events.

The FMA  INTEL-8001-ND errors go to the console and are logged in the '/var/adm/messages' file as shown in the following example:

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Jun 15 14:26:14 47a421d2-b0bd-c641-f793-123456e12db9  INTEL-8001-ND  Major

Host        : abcde
Platform    : SUN-FIRE-X4470-M2-SERVER Chassis_id  : 1234FMF011
Product_sn  : 1234FMF011

Host        : abcde
Platform    : SUN-FIRE-X4470-M2-SERVER Chassis_id  : 1234FMF011
Product_sn  : 1234FMF011

Fault class : fault.cpu.intel.quickpath.mem_link_ce
Problem in  : "MB/CPU 0" (hc://:product-id=SUN-FIRE-X4470-M2-SERVER:product-sn=1234FMF011:server-id=abcde:chassis-id=1234FMF011/chassis=0/motherboard=0/chip=0/memory-controller=0) faulted but still in service
FRU         : "MB/CPU 0" (hc://:product-id=SUN-FIRE-X4470-M2-SERVER:product-sn=1234FMF011:server-id=abcde:chassis-id=1234FMF011:serial=To-Be-Filled-By-O.E.M.:part=To-Be-Filled-By-O.E.M.:revision=Intel(R)-Xeon(R)
-CPU-E7--4870--@-2.40GHz/chassis=0/motherboard=0/chip=0) faulty

The Memory Controller North Bound or South Bound FB-DIMM link event 'ereports' are recorded in the FMA error log and can be seen using the command 'fmdump -e' :

      For the South bound FB-DIMM link errors:

        Jun 15 03:12:13.2625 ereport.cpu.intel.quickpath.mem_sbfbdlinkerr

      For the North bound FB-DIMM link errors:

        Jun 15 03:12:13.2625 ereport.cpu.intel.quickpath.mem_nbfbdlnkerr

Workaround

There is no workaround for this issue. The INTEL-8001-ND faults that have been triggered by these Memory Controller events can be ignored or repaired using the "fmadm repair" command along with the UUID (Universally Unique IDentifier) associated with the faulty CPU, for example:

      Jun 15 14:26:14   47a421d2-b0bd-c641-f793-123456e12db9        INTEL-8001-ND     Major

Use fmadm (1M) repair to clear the fault using the UUID (Universally Unique Identifier).

      # fmadm repair  47a421d2-b0bd-c641-f793-123456e12db9  
      fmadm: recorded repair to 47a421d2-b0bd-c641-f793-123456e12db9

This issue is addressed in the following releases:

x86 Platform:

  • Solaris 10 with patch 150126-01 or later
  • Solaris 11.1.4.6.0 or later

Patches

 <SUNPATCH:150126-01>

History

26-Aug-2014: Document released, status Resolved
28-Aug-2014: Additional edits for clarification

Internal Section: Comments:

The issue can only occur on systems with Nehalem-EX or Westmere-EX processors due to an errata. See the following hardware bug for details:
15799863 - SUNBT7178832 Solaris FMA showing all CPUs as failed with INTEL-8001-ND Major. IL

Note: This issue only occurs after FMA support was added to Solaris for the affected CPUs using bug 15547215.

Prior to this fix for 15807381, FMA used to process these FB-DIMM ereports via a SERD engine,
faulting the CPU after more than 500 such ereports are logged within one week.

The FMA support for these two families of CPUs was delivered to Solaris 10 as follows:

Intel Xeon 7500 series (aka family 6 model 46 is  Nehalem EX) : 142910-17
Intel Xeon E7-x800 series (aka family 6 model 47 is  Westmere EX):
family 6 model 46 is  Nehalem EX  (7500 series)   FMA support in S10U9 KU 142901-09 (and greater)
family 6 model 47 is  Westmere EX (E7-x800 series)  FMA support in S10U10 KU 147441-01 or sustaining KU 144489-05 (and greater).

Questions regarding this document should be addressed to
sunalertpublication_us_grp@oracle.com and copy the
responsible engineer listed below.

Internal Contributor/Submitter: mary.beale@oracle.com
Internal Eng Responsible Engineer: mary.beale@oracle.com
Internal Services Knowledge Engineer: jeff.folla@oracle.com
Internal Eng Business Unit Group: Systems RPE
Internal Associated SR IDs: 3-5763704161, 3-5871625671, 3-5997888511,
3-6623272976, 3-6951366821, 3-8185098301, 3-9309627381, 3-9310156281,
3-9364166621
Internal Resolution Patches:  150126-01, 11.1.4.6.0

References




Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback