Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2369194.1
Update Date:2018-03-07
Keywords:

Solution Type  Problem Resolution Sure

Solution  2369194.1 :   Oracle ZFS Storage Applaince: Many false persistent link errors which lead to FMA fault - INTEL-8001-ND  


Related Items
  • Sun ZFS Storage 7420
  •  
  • Oracle ZFS Storage ZS5-2
  •  
  • Oracle ZFS Storage ZS3-2
  •  
  • Oracle ZFS Storage ZS4-4
  •  
  • Oracle ZFS Storage ZS5-4
  •  
  • Sun ZFS Storage 7120
  •  
  • Oracle ZFS Storage ZS3-4
  •  
  • Sun ZFS Storage 7320
  •  
  • Oracle ZFS Storage ZS3-BA
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>ZFS Storage>SN-DK: 7xxx NAS
  •  




In this Document
Symptoms
Cause
Solution
References


Created from <SR 3-16600912681>

Applies to:

Sun ZFS Storage 7420 - Version All Versions and later
Sun ZFS Storage 7320 - Version All Versions and later
Sun ZFS Storage 7120 - Version All Versions and later
Oracle ZFS Storage ZS3-4 - Version All Versions and later
Oracle ZFS Storage ZS3-2 - Version All Versions and later
7000 Appliance OS (Fishworks)

Symptoms

This issue of the erroneous reporting of CPU faults occurs only when the INTEL-8001-ND CPU errors are triggered from the Memory Controller North Bound or South Bound FB-DIMM link events.

The INTEL-8001-ND errors are logged in the fmadm.out as shown in the following example:

------------------- ------------------------------------------------ ---------------- -----------
TIME                  EVENT-ID                                                MSG-ID           SEVERITY
------------------- ------------------------------------------------ ---------------- -----------
Jan 14 19:15:03 88b430c3-b48e-42e1-d73d-c35d688a5a08 INTEL-8001-ND Major

Problem Status : open
Diag Engine : eft / 1.16
System
Manufacturer : unknown
Name : unknown
Part_Number : unknown
Serial_Number : unknown

System Component
Manufacturer : Oracle-Corporation
Name : SUN-FIRE-X4470-M2-SERVER
Part_Number : 31842620+27+1
Serial_Number : 1325FMJ00K
Host_ID : 00000000
Server_Name : bwnas022

----------------------------------------
Suspect 1 of 1 :
Fault class : fault.cpu.intel.quickpath.mem_link_ce
Certainty : 100%

FRU
Name : "hc://:chassis-mfg=Oracle-Corporation:chassis-name=SUN-FIRE-X4470-M2-SERVER:chassis-part=Not-applicable:chassis-serial=1325FMJ00K/motherboard=0/chip=1"
Manufacturer : unknown
Name : unknown
Part_Number : unknown
Revision : unknown
Serial_Number : unknown
Chassis
Manufacturer : Oracle-Corporation
Name : SUN-FIRE-X4470-M2-SERVER
Part_Number : Not-applicable
Serial_Number : 1325FMJ00K
Status : faulty
Resource
Status : faulted but still in service

Description : A quickpath memory link correctable error was detected.

Response : Processor is not off-lined, as the memory controller on the CPU chip is still accessible by other processors

 

Cause

This issue is related to Bug 15807381.

To confirm if this problem is being hit, run fmdump on the errlog, using the -u option with the UUID from the fault to find out if quickpath.mem_lnkpers is the cause, e.g.

 

From the fltlog:

Jan 14 19:15:03.7607 88b430c3-b48e-42e1-d73d-c35d688a5a08 INTEL-8001-ND

 

/usr/sbin/fmdump -eu 88b430c3-b48e-42e1-d73d-c35d688a5a08 fltlog

TIME CLASS
Jan 14 19:14:32.7134 ereport.cpu.intel.quickpath.mem_lnkpers
Jan 14 19:14:22.6365 ereport.cpu.intel.quickpath.mem_lnkpers
Jan 14 19:14:07.6437 ereport.cpu.intel.quickpath.mem_lnkpers
Jan 14 19:13:42.2797 ereport.cpu.intel.quickpath.mem_lnkpers
Jan 14 19:13:05.4707 ereport.cpu.intel.quickpath.mem_lnkpers
........

 

Solution

There is no workaround for this issue.

The INTEL-8001-ND faults that have been triggered by these Memory Controller events can be ignored or repaired using the "mark repaired" command.

 

This issue is addressed in the following releases: AK 2013.1.6.15 (OS 8.6.15) or later

 

References

<NOTE:1920585.1> - Solaris 10 and Solaris 11 Fault Management Architecture (FMA) may Erroneously Report Xeon 7500 Series and E7-x800 Series Intel CPUs as Faulty After Memory Controller Events are Logged
<BUG:24711310> - FMA FOR NHM-EX SHOULD NOT SERD CORRECTABLE PERSISTENT LINK ERROR

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback