Information in this document applies to any platform.
Date of Resolved Release: 26-Aug-2014
__________________________________________
Description
Solaris 10 and Solaris 11 Fault Management Architecture (FMA) may erroneously report 'Xeon 7500 series' and 'Xeon E7-x800 series' Intel CPUs as faulty, reporting SUNW-MSG-ID: INTEL-8001-ND in response to 'FB-DIMM Memory Controller' events being logged, which may lead to unnecessary hardware replacement.
Note: INTEL-8001-ND errors that are triggered by events other than 'FB-DIMM Memory Controller' events should not be ignored.
Occurrence
This issue can occur in the following releases:
x86 Platform
- Solaris 10 with patch 142901-09 through 142901-15 OR with patch 142910-17 and without patch 150126-01
- Solaris 11.0 through 11.1.3.5.1
Note 1: Solaris 8, Solaris 9, and Solaris on the SPARC platforms are not affected by this issue.
Note 2: This issue only impacts 'Xeon 7500 series' and 'Xeon E7-x800 series' Intel CPUs. To determine if a system has either of these CPUs, the following command can be used:
$ psrinfo -vp | egrep '(E7-\ 8|X75)'
Systems generating any output to this command are vulnerable to this issue. Output would look similar to the below, depending on which CPU in the affected series is present:
Intel(r) Xeon(r) CPU E7- 8870 @ 2.40GHz
Or
Intel(r) Xeon(r) CPU X7550 @ 2.00GHz
Note 3: To determine the SRU level, the following command can be used:
$ pkg info entire | grep Summary
Summary: entire incorporation including Support Repository Update (Oracle Solaris 11.1.4.6.0)
Symptoms
This issue of the erroneous reporting of CPU faults occurs only when the INTEL-8001-ND CPU errors are triggered from the Memory Controller North Bound or South Bound FB-DIMM link events.
The FMA INTEL-8001-ND errors go to the console and are logged in the '/var/adm/messages' file as shown in the following example:
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jun 15 14:26:14 47a421d2-b0bd-c641-f793-123456e12db9 INTEL-8001-ND Major
Host : abcde
Platform : SUN-FIRE-X4470-M2-SERVER Chassis_id : 1234FMF011
Product_sn : 1234FMF011
Host : abcde
Platform : SUN-FIRE-X4470-M2-SERVER Chassis_id : 1234FMF011
Product_sn : 1234FMF011
Fault class : fault.cpu.intel.quickpath.mem_link_ce
Problem in : "MB/CPU 0" (hc://:product-id=SUN-FIRE-X4470-M2-SERVER:product-sn=1234FMF011:server-id=abcde:chassis-id=1234FMF011/chassis=0/motherboard=0/chip=0/memory-controller=0) faulted but still in service
FRU : "MB/CPU 0" (hc://:product-id=SUN-FIRE-X4470-M2-SERVER:product-sn=1234FMF011:server-id=abcde:chassis-id=1234FMF011:serial=To-Be-Filled-By-O.E.M.:part=To-Be-Filled-By-O.E.M.:revision=Intel(R)-Xeon(R)
-CPU-E7--4870--@-2.40GHz/chassis=0/motherboard=0/chip=0) faulty
The Memory Controller North Bound or South Bound FB-DIMM link event 'ereports' are recorded in the FMA error log and can be seen using the command 'fmdump -e' :
For the South bound FB-DIMM link errors:
Jun 15 03:12:13.2625 ereport.cpu.intel.quickpath.mem_sbfbdlinkerr
For the North bound FB-DIMM link errors:
Jun 15 03:12:13.2625 ereport.cpu.intel.quickpath.mem_nbfbdlnkerr
Workaround
There is no workaround for this issue. The INTEL-8001-ND faults that have been triggered by these Memory Controller events can be ignored or repaired using the "fmadm repair" command along with the UUID (Universally Unique IDentifier) associated with the faulty CPU, for example:
Jun 15 14:26:14 47a421d2-b0bd-c641-f793-123456e12db9 INTEL-8001-ND Major
Use fmadm (1M) repair to clear the fault using the UUID (Universally Unique Identifier).
# fmadm repair 47a421d2-b0bd-c641-f793-123456e12db9
fmadm: recorded repair to 47a421d2-b0bd-c641-f793-123456e12db9
This issue is addressed in the following releases:
x86 Platform:
- Solaris 10 with patch 150126-01 or later
- Solaris 11.1.4.6.0 or later
Patches
<SUNPATCH:150126-01>
History
26-Aug-2014: Document released, status Resolved
28-Aug-2014: Additional edits for clarification
Internal Section: Comments:
The issue can only occur on systems with Nehalem-EX or Westmere-EX processors due to an errata. See the following hardware bug for details:
15799863 - SUNBT7178832 Solaris FMA showing all CPUs as failed with INTEL-8001-ND Major. IL
Note: This issue only occurs after FMA support was added to Solaris for the affected CPUs using bug 15547215.
Prior to this fix for 15807381, FMA used to process these FB-DIMM ereports via a SERD engine,
faulting the CPU after more than 500 such ereports are logged within one week.
The FMA support for these two families of CPUs was delivered to Solaris 10 as follows:
Intel Xeon 7500 series (aka family 6 model 46 is Nehalem EX) : 142910-17
Intel Xeon E7-x800 series (aka family 6 model 47 is Westmere EX):
family 6 model 46 is Nehalem EX (7500 series) FMA support in S10U9 KU 142901-09 (and greater)
family 6 model 47 is Westmere EX (E7-x800 series) FMA support in S10U10 KU 147441-01 or sustaining KU 144489-05 (and greater).
Questions regarding this document should be addressed to
sunalertpublication_us_grp@oracle.com and copy the
responsible engineer listed below.
Internal Contributor/Submitter: mary.beale@oracle.com
Internal Eng Responsible Engineer: mary.beale@oracle.com
Internal Services Knowledge Engineer: jeff.folla@oracle.com
Internal Eng Business Unit Group: Systems RPE
Internal Associated SR IDs: 3-5763704161, 3-5871625671, 3-5997888511,
3-6623272976, 3-6951366821, 3-8185098301, 3-9309627381, 3-9310156281,
3-9364166621
Internal Resolution Patches: 150126-01, 11.1.4.6.0
References
Attachments
This solution has no attachment