Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1999137.1
Update Date:2018-03-20
Keywords:

Solution Type  Problem Resolution Sure

Solution  1999137.1 :   Oracle ZFS Storage Appliance: CPU failing continuously - even after replacement.  


Related Items
  • Sun ZFS Storage 7420
  •  
  • Sun Storage 7110 Unified Storage System
  •  
  • Oracle ZFS Storage ZS3-2
  •  
  • Sun Storage 7210 Unified Storage System
  •  
  • Oracle ZFS Storage ZS4-4
  •  
  • Sun Storage 7410 Unified Storage System
  •  
  • Oracle ZFS Storage ZS3-4
  •  
  • Sun ZFS Storage 7120
  •  
  • Sun Storage 7310 Unified Storage System
  •  
  • Sun ZFS Storage 7320
  •  
  • Oracle ZFS Storage ZS3-BA
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>ZFS Storage>SN-DK: 7xxx NAS
  •  




In this Document
Symptoms
Cause
Solution
References


Created from <SR 3-10230898762>

Applies to:

Sun Storage 7210 Unified Storage System - Version All Versions and later
Sun Storage 7110 Unified Storage System - Version All Versions and later
Oracle ZFS Storage ZS3-4 - Version All Versions and later
Oracle ZFS Storage ZS3-2 - Version All Versions and later
Oracle ZFS Storage ZS3-BA - Version All Versions and later
7000 Appliance OS (Fishworks)

Symptoms

Alarm received :

Fault event description: The diagnosis engine encountered telemetry from the listed devices for which it was unable to perform a diagnosis.
The diagnosis engine encountered telemetry from the listed devices for which it was unable to perform a diagnosis - all hypotheses were disproved.
Event-ID: 2861e502-1ea0-4175-f67e-f2b373191c9b
Auto-Response: None
Impact: None
Rec-Action: None

 

Looking at the 'hardware status', CPU 1 appears faulted :

chassis-000  zfs01    faulted   Oracle    Sun ZFS Storage 7320                           

cpu-000      CPU 0    ok        Intel     Intel(r) Xeon(r) CPU    E5620  @ 2.40GHz    unknown
cpu-001      CPU 1    faulted   Intel     Intel(r) Xeon(r) CPU    E5620  @ 2.40GHz    unknown
disk-000     HDD 0    ok        WDC       WD500BLHXSUN500G                            WD-WX61EC1LUJ16    10000
disk-001     HDD 1    ok        WDC       WD500BLHXSUN500G                            WD-WXD1E61EDE33    10000

 

If you are experiencing this issue after replacing the CPU, it will fail again some days after replacement - because AKD is reporting a failure in an 'incorrect' component.

In order to determine the error, we have to look into the 'alert.ak' log and we see which error is reported by the FMA telemetry :

nvlist version: 0
  version = 0x0
  class = fault.sunos.eft.unexpected_telemetry
  certainty = 0x32
  resource = (embedded nvlist)
  nvlist version: 0
  version = 0x1
  scheme = hc
  hc-root =
  authority = (embedded nvlist)
  nvlist version: 0
  chassis-mfg = Oracle-Corporation
  chassis-name = SUN-FIRE-X4170-M2-SERVER
  chassis-part = unknown
  chassis-serial = 1239FMM03T
  (end authority)
  hc-list-sz = 0x3
  hc-list = (array of embedded nvlists)
  (start hc-list[0])
  nvlist version: 0
  hc-name = motherboard
  hc-id = 0
  (end hc-list[0])
  (start hc-list[1])
  nvlist version: 0
  hc-name = chip
  hc-id = 1
  (end hc-list[1])
  (start hc-list[2])
  nvlist version: 0
  hc-name = memory-controller
  hc-id = 0
  (end hc-list[2])
  (end resource)

  reason = all hypotheses were disproved
  retire = 0
  response = 0
  fru = (embedded nvlist)
  nvlist version: 0
  version = 0x1
  scheme = hc
  hc-root =
  authority = (embedded nvlist)
  nvlist version: 0
  chassis-mfg = Oracle-Corporation
  chassis-name = SUN-FIRE-X4170-M2-SERVER
  chassis-part = unknown
  chassis-serial = 1239FMM03T
  (end authority)

  hc-list = (array of embedded nvlists)
  (start hc-list[0])
  nvlist version: 0
  hc-name = motherboard
  hc-id = 0
  (end hc-list[0])
  (start hc-list[1])
  nvlist version: 0
  hc-name = chip
  hc-id = 1
  (end hc-list[1])

  hc-list-sz = 0x2
  (end fru)

  location = zfs01/CPU 1
  (end fault-list[1])

  fault-status = 0x1 0x1
  severity = Major
  source = appliance/kit/akd:default
  uuid = 2861e502-1ea0-4175-f67e-f2b373191c9b

----------------------------------------

Suspect 1 of 2 :
  Fault class : defect.sunos.eft.unexpected_telemetry
  Certainty : 50%

  Resource
  Name : "hc://:chassis-mfg=Oracle-Corporation:chassis-name=SUN-FIRE-X4170-M2-SERVER:chassis-part=unknown:chassis-serial=1239FMM03T/motherboard=0/chip=1/memory-controller=0"
  Manufacturer : unknown
  Name : unknown
  Part_Number : unknown
  Revision : unknown
  Serial_Number : unknown
  Chassis
  Manufacturer : Oracle-Corporation
  Name : SUN-FIRE-X4170-M2-SERVER
  Part_Number : unknown
  Serial_Number : 1239FMM03T
  Status : faulted but still in service
----------------------------------------
Suspect 2 of 2 :
  Fault class : fault.sunos.eft.unexpected_telemetry
  Certainty : 50%

 

The FMA telemetry has decided that a CPU is probably the cause of the error, but in the FMA (fm) logs we see there are multiple errors pointing to a DIMM.

For example,  ereport.cpu.intel.quickpath.mem_ce

and, for FMA, an unexpected telemetry error that is not able to be identified - defect.sunos.eft.unexpected_telemetry

 

Cause

Telemetry is not able to manage ereport.cpu.intel.quickpath.mem_ce errors. The DIMM is not reported as faulted because not does not reach the threshold.

Looking for closer reason telemetry report cpu as faulted.

Finally CPU is marked as faulted.

If we mark the error as repaired, the problem is solved ... after some days, the problem happens again.

 

Solution

Replace the DIMM causing the FMA events and upgrade to the latest Appliance Firmware Release.
 

References

<BUG:19321750> - 7320 - CPU 0 FAULT WITHOUT ADEQUATE REASON

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback