![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||
Solution Type Problem Resolution Sure Solution 1999137.1 : Oracle ZFS Storage Appliance: CPU failing continuously - even after replacement.
In this Document
Created from <SR 3-10230898762> Applies to:Sun Storage 7210 Unified Storage System - Version All Versions and laterSun Storage 7110 Unified Storage System - Version All Versions and later Oracle ZFS Storage ZS3-4 - Version All Versions and later Oracle ZFS Storage ZS3-2 - Version All Versions and later Oracle ZFS Storage ZS3-BA - Version All Versions and later 7000 Appliance OS (Fishworks) SymptomsAlarm received : Fault event description: The diagnosis engine encountered telemetry from the listed devices for which it was unable to perform a diagnosis.
The diagnosis engine encountered telemetry from the listed devices for which it was unable to perform a diagnosis - all hypotheses were disproved. Event-ID: 2861e502-1ea0-4175-f67e-f2b373191c9b Auto-Response: None Impact: None Rec-Action: None
Looking at the 'hardware status', CPU 1 appears faulted : chassis-000 zfs01 faulted Oracle Sun ZFS Storage 7320
cpu-000 CPU 0 ok Intel Intel(r) Xeon(r) CPU E5620 @ 2.40GHz unknown cpu-001 CPU 1 faulted Intel Intel(r) Xeon(r) CPU E5620 @ 2.40GHz unknown disk-000 HDD 0 ok WDC WD500BLHXSUN500G WD-WX61EC1LUJ16 10000 disk-001 HDD 1 ok WDC WD500BLHXSUN500G WD-WXD1E61EDE33 10000
If you are experiencing this issue after replacing the CPU, it will fail again some days after replacement - because AKD is reporting a failure in an 'incorrect' component. In order to determine the error, we have to look into the 'alert.ak' log and we see which error is reported by the FMA telemetry : nvlist version: 0
version = 0x0 class = fault.sunos.eft.unexpected_telemetry certainty = 0x32 resource = (embedded nvlist) nvlist version: 0 version = 0x1 scheme = hc hc-root = authority = (embedded nvlist) nvlist version: 0 chassis-mfg = Oracle-Corporation chassis-name = SUN-FIRE-X4170-M2-SERVER chassis-part = unknown chassis-serial = 1239FMM03T (end authority) hc-list-sz = 0x3 hc-list = (array of embedded nvlists) (start hc-list[0]) nvlist version: 0 hc-name = motherboard hc-id = 0 (end hc-list[0]) (start hc-list[1]) nvlist version: 0 hc-name = chip hc-id = 1 (end hc-list[1]) (start hc-list[2]) nvlist version: 0 hc-name = memory-controller hc-id = 0 (end hc-list[2]) (end resource) reason = all hypotheses were disproved retire = 0 response = 0 fru = (embedded nvlist) nvlist version: 0 version = 0x1 scheme = hc hc-root = authority = (embedded nvlist) nvlist version: 0 chassis-mfg = Oracle-Corporation chassis-name = SUN-FIRE-X4170-M2-SERVER chassis-part = unknown chassis-serial = 1239FMM03T (end authority) hc-list = (array of embedded nvlists) (start hc-list[0]) nvlist version: 0 hc-name = motherboard hc-id = 0 (end hc-list[0]) (start hc-list[1]) nvlist version: 0 hc-name = chip hc-id = 1 (end hc-list[1]) hc-list-sz = 0x2 (end fru) location = zfs01/CPU 1 (end fault-list[1]) fault-status = 0x1 0x1 severity = Major source = appliance/kit/akd:default uuid = 2861e502-1ea0-4175-f67e-f2b373191c9b ---------------------------------------- Suspect 1 of 2 : Fault class : defect.sunos.eft.unexpected_telemetry Certainty : 50% Resource Name : "hc://:chassis-mfg=Oracle-Corporation:chassis-name=SUN-FIRE-X4170-M2-SERVER:chassis-part=unknown:chassis-serial=1239FMM03T/motherboard=0/chip=1/memory-controller=0" Manufacturer : unknown Name : unknown Part_Number : unknown Revision : unknown Serial_Number : unknown Chassis Manufacturer : Oracle-Corporation Name : SUN-FIRE-X4170-M2-SERVER Part_Number : unknown Serial_Number : 1239FMM03T Status : faulted but still in service ---------------------------------------- Suspect 2 of 2 : Fault class : fault.sunos.eft.unexpected_telemetry Certainty : 50%
The FMA telemetry has decided that a CPU is probably the cause of the error, but in the FMA (fm) logs we see there are multiple errors pointing to a DIMM. For example, ereport.cpu.intel.quickpath.mem_ce and, for FMA, an unexpected telemetry error that is not able to be identified - defect.sunos.eft.unexpected_telemetry
CauseTelemetry is not able to manage ereport.cpu.intel.quickpath.mem_ce errors. The DIMM is not reported as faulted because not does not reach the threshold. Looking for closer reason telemetry report cpu as faulted. Finally CPU is marked as faulted. If we mark the error as repaired, the problem is solved ... after some days, the problem happens again.
SolutionReplace the DIMM causing the FMA events and upgrade to the latest Appliance Firmware Release. References<BUG:19321750> - 7320 - CPU 0 FAULT WITHOUT ADEQUATE REASONAttachments This solution has no attachment |
||||||||||||||||||
|