I/O SERD threshold values are set too low and may result in PCIEX-8000-J5, PCIEX-8000-YJ and PCIEX-8000-KP faults.

Asset ID:	1-72-1617956.1
Update Date:	2017-08-22
Keywords:

Solution Type Problem Resolution Sure

Solution 1617956.1 : I/O SERD threshold values are set too low and may result in PCIEX-8000-J5, PCIEX-8000-YJ and PCIEX-8000-KP faults.

Applies to:

Oracle SuperCluster T5-8 Full Rack - Version All Versions and later
SPARC M6-32 - Version All Versions and later
Oracle SuperCluster M6-32 Hardware - Version All Versions and later
Oracle SuperCluster T5-8 Half Rack - Version All Versions and later
SPARC T5-4 - Version All Versions and later
Information in this document applies to any platform.

Symptoms

PCIEX-8000-J5 and/or PCIEX-8000-KP FMA faults similar to those below will be reported. Systems utilizing InfiniBand fabric, i.e., T5-8 SSC are seen to be more susceptible to these faults.

The fmdump -e output will contain ereport.io.pciex.dl.btlp and/or ereport.io.pciex.dl.bdllp events. Examples are below the FMA examples.

-------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jan 26 23:29:14 0db03f13-3b8e-6c2c-e4d7-99da84c97f63 PCIEX-8000-J5 Major

Problem Status : isolated
Diag Engine : eft / 1.16
System
Manufacturer : unknown
Name : unknown
Part_Number : unknown
Serial_Number : unknown
Host_ID : 84fbc856

----------------------------------------
----------------------------------------
Suspect 1 of 1 :
Fault class : fault.io.pciex.device-interr-corr
Certainty : 100%
Affects : dev:////pci@540/pci@1
Status : faulted but still in service

FRU
Location : "/SYS/PM2"
Manufacturer : unknown
Name : unknown
Part_Number : 7056873
Revision : 08
Serial_Number : 465769T+1321880308
Chassis
Manufacturer : Oracle Corporation
Name : SPARC T5-8
Part_Number : 7068108
Serial_Number : AK00119551
Status : faulty

Description : Too many recovered internal errors have been detected within the
specified PCIEX device. This may degrade into a non-recoverable
specified PCIEX device. This may degrade into a non-recoverable
fault.

--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jan 26 23:56:38 dd72c9ae-890f-e479-8860-c850fffef4e9 PCIEX-8000-KP Major

Problem Status : solved
Diag Engine : eft / 1.16
System
Manufacturer : unknown
Name : unknown
Part_Number : unknown
Serial_Number : unknown
Host_ID : 84fbc856

----------------------------------------
Suspect 1 of 2 :
Fault class : fault.io.pciex.device-interr-corr
Certainty : 100%
Affects : dev:////pci@5c0/pci@1/pci@0
Status : faulted but still in service

FRU
Location : "/SYS/MB"
Manufacturer : unknown
Name : unknown
Part_Number : 7070931
Revision : 03
Serial_Number : 465769T+13245203VU
Chassis
Manufacturer : Oracle Corporation
Name : SPARC T5-8
Part_Number : 7068108
Serial_Number : AK00119551
Status : faulty
----------------------------------------
Suspect 2 of 2 :
Fault class : fault.io.pciex.bus-linkerr-corr
Certainty : 100%
Affects : dev:////pci@5c0/pci@1/pci@0
Status : faulted but still in service

FRU
Location : "/SYS/MB"
Manufacturer : unknown
Name : unknown
Part_Number : 7070931
Revision : 03
Serial_Number : 465769T+13245203VU
Chassis
Manufacturer : Oracle Corporation
Name : SPARC T5-8
Part_Number : 7068108
Serial_Number : AK00119551
Status : faulty

Description : Too many recovered bus errors have been detected, which indicates
a problem with the specified bus or with the specified
transmitting device. This may degrade into an unrecoverable
fault.

Verify the type and number of PCIe errors that triggered the fault event;

fmdump -eVu {UUID}

Cause

It has been determined that a certain set of I/O SERD threshold values are set too low which could lead to unnecessary FRU replacements. These values are to be adjusted in future OS releases.

Solution

Purpose

Threshold levels for ereport.io.pciex.dl.btlp and ereport.io.pciex.dl.bdllp correctable errors are too restrictive and a fix has been released for Solaris 11.1 and Solaris 10 which brings thresholds in line with physical link receiver errors (180/1hr). If a fault event occurs first determine if the FMA faults are attributable to these incorrect settings.

Solaris 11.1 : SRU 18.5
Solaris 10 SPARC : PatchID 149279-03
Solaris 10 i386 : PatchID 150913-02

If the customer is unable to schedule downtime to apply the permanent fix they can implement a temporary solution by manually overriding SERD thresholds as outlined below in the "Scope" section.

Current threshold settings for both ereport.io.pciex.dl.btlp and ereport.io.pciex.dl.bdllp events are 18 events per hour, the corrected settings will be 180 events per hour in line with ereport.io.pciex.pl.re events.

To determine if the faults are related to the incorrect settings review the ereports associated with the PCIEX-8000-KP, PCIEX-8000-YJ or PCIEX-8000-J5 faults in the explorer fmadm-faulty.out data.

egrep "PCIEX-8000-KP|PCIEX-8000-J5|PCIEX-8000-YJ" fmadm-faulty.out

Jan 26 23:29:14 0db03f13-3b8e-6c2c-e4d7-99da84c97f63 PCIEX-8000-J5 Major

PCIEX-8000-J5 for the latest

Apr 11 22:35:28 ad6018ae-a1bd-ea79-95df-a35df783a985 PCIEX-8000-YJ Major

PCIEX-8000-YJ for the latest

Jan 26 23:56:38 dd72c9ae-890f-e479-8860-c850fffef4e9 PCIEX-8000-KP Major

PCIEX-8000-KP for the latest

Using the UUID provided by fmadm faulty use the following command to determine the ereport count and check if it exceeds the new threshold

fmdump -eVu UUID /var/fm/fmd/fltlog | grep class | sort| uniq -c

Example using explorer data:

fmdump -eVu dd72c9ae-890f-e479-8860-c850fffef4e9 fma/var/fm/fmd/fltlog | grep class | sort| uniq -c
19 class = ereport.io.pciex.dl.btlp

If the count reported (19 in the output above) is less than 181 then the temporary fix listed below should be implemented - if the error rate is =/>181 then please raise a Service Request with Oracle Support for further analysis.

If the fault was triggered by Replay TImeouts (ereport.io.pciex.dl.rto) please refer to KM 1966605.1 for more information on possible resolution.

Note : There could be a corner case where the above syntax displays no output yet the issue has occurred. The fmdump syntax relies on the fltlog and errlog

files. If either a lot of time has lapsed or a flood of ereports has taken place since , it could be possible that the errlog files rotate so many times that there exists

no errlog file that contains the ereports associated with the fault diagnosis in question. In others words even the oldest errlog file doesn't include the date the

fault diagnosis took place. To check for such a condition run the following:

Example Diagnosis:

May 10 04:45:43 8b980ed8-f8aa-cb05-c876-f257888eb2ff PCIEX-8000-KP Major

#more fmdump-eV.out | grep "May 10 04:45:43"

If there is no output you may have run into the corner case described above. Manual examination of data that is most recent may be required to further diagnose the problem.

If you encounter this corner case please contact a product lead.

Note : Prior to implementing the manual workaround please ensure existing FMA faults are repaired/cleared.

Scope

Details

Formal fix with updated I/O SERD threshold values is available in the following Solaris updates;

Solaris 11.1 : SRU 18.5
Solaris 10 SPARC : PatchID 149279-03
Solaris 10 i386 : PatchID 150913-02

- If the customer is unable to schedule downtime to apply the permanent fix they may instead implement a manual workaround whereby I/O SERD threshold values can be updated by setting the serd_override property in the /usr/lib/fm/fmd/plugins/eft.conf file

1) Add the following line to the bottom of /usr/lib/fm/fmd/plugins/eft.conf

setprop serd_override "serd.io.device.nonfatal_btlp,180,1h serd.io.pciex.corrlink-bus_btlp,180,1h serd.io.device.nonfatal_bdllp,180,1h serd.io.pciex.corrlink-bus_bdllp,180,1h"

2) Restart FMD

# svcadm restart fmd

3)Next, clear the existing serd counters

#fmadm reset eft

fmadm: eft module has been reset

– Manual workaround changes will persist across AC power cycles and Solaris reboots, unless the OS is reinstalled or a patch installed that updates the above configuration file

Bug 18145114 : CONSOLIDATE DATA LINK RECEIVER ERRORS TO USE DYNAMIC SERD THRESHOLDS

Attachments

This solution has no attachment