Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1617956.1
Update Date:2017-08-22
Keywords:

Solution Type  Problem Resolution Sure

Solution  1617956.1 :   I/O SERD threshold values are set too low and may result in PCIEX-8000-J5, PCIEX-8000-YJ and PCIEX-8000-KP faults.  


Related Items
  • Oracle SuperCluster T5-8 Full Rack
  •  
  • SPARC M5-32
  •  
  • Fujitsu M10-1
  •  
  • SPARC M6-32
  •  
  • Oracle SuperCluster T5-8 Half Rack
  •  
  • SPARC T5-8
  •  
  • Sun SPARC Enterprise M4000 Server
  •  
  • Sun SPARC Enterprise M3000 Server
  •  
  • SPARC T5-4
  •  
  • SPARC T5-2
  •  
  • Sun SPARC Enterprise M9000-32 Server
  •  
  • Solaris Operating System
  •  
  • Oracle SuperCluster M6-32 Hardware
  •  
  • Oracle Exalytics T5-8
  •  
  • Oracle SuperCluster T5-8 Hardware
  •  
  • Sun SPARC Enterprise M5000 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>CMT>SN-SPARC: T5
  •  


It has been determined that a certain set of I/O SERD threshold values are set too low,
which could lead to false positive FMA Faults and possibly unnecessary FRU replacements

In this Document
Symptoms
Cause
Solution
 Purpose
 Scope
 Details
References


Applies to:

Oracle SuperCluster T5-8 Full Rack - Version All Versions and later
SPARC M6-32 - Version All Versions and later
Oracle SuperCluster M6-32 Hardware - Version All Versions and later
Oracle SuperCluster T5-8 Half Rack - Version All Versions and later
SPARC T5-4 - Version All Versions and later
Information in this document applies to any platform.

Symptoms

PCIEX-8000-J5 and/or PCIEX-8000-KP FMA faults similar to those below will be reported. Systems utilizing InfiniBand fabric, i.e., T5-8 SSC are seen to be more susceptible to these faults.

The fmdump -e output will contain ereport.io.pciex.dl.btlp and/or ereport.io.pciex.dl.bdllp events. Examples are below the FMA examples.

-------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Jan 26 23:29:14 0db03f13-3b8e-6c2c-e4d7-99da84c97f63  PCIEX-8000-J5  Major

Problem Status    : isolated
Diag Engine       : eft / 1.16
System
   Manufacturer  : unknown
   Name          : unknown
   Part_Number   : unknown
   Serial_Number : unknown
   Host_ID       : 84fbc856

----------------------------------------
----------------------------------------
Suspect 1 of 1 :
  Fault class : fault.io.pciex.device-interr-corr
  Certainty   : 100%
  Affects     : dev:////pci@540/pci@1
  Status      : faulted but still in service

  FRU
    Location         : "/SYS/PM2"
    Manufacturer     : unknown
    Name             : unknown
    Part_Number      : 7056873
    Revision         : 08
    Serial_Number    : 465769T+1321880308
    Chassis
       Manufacturer  : Oracle Corporation
       Name          : SPARC T5-8
       Part_Number   : 7068108
       Serial_Number : AK00119551
       Status        : faulty

Description : Too many recovered internal errors have been detected within the
             specified PCIEX device. This may degrade into a non-recoverable
             specified PCIEX device. This may degrade into a non-recoverable
             fault.

 


--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Jan 26 23:56:38 dd72c9ae-890f-e479-8860-c850fffef4e9  PCIEX-8000-KP  Major

Problem Status    : solved
Diag Engine       : eft / 1.16
System
   Manufacturer  : unknown
   Name          : unknown
   Part_Number   : unknown
   Serial_Number : unknown
   Host_ID       : 84fbc856

----------------------------------------
Suspect 1 of 2 :
  Fault class : fault.io.pciex.device-interr-corr
  Certainty   : 100%
  Affects     : dev:////pci@5c0/pci@1/pci@0
  Status      : faulted but still in service

  FRU
    Location         : "/SYS/MB"
    Manufacturer     : unknown
    Name             : unknown
    Part_Number      : 7070931
    Revision         : 03
    Serial_Number    : 465769T+13245203VU
    Chassis
       Manufacturer  : Oracle Corporation
       Name          : SPARC T5-8
       Part_Number   : 7068108
       Serial_Number : AK00119551
       Status        : faulty
----------------------------------------
Suspect 2 of 2 :
  Fault class : fault.io.pciex.bus-linkerr-corr
  Certainty   : 100%
  Affects     : dev:////pci@5c0/pci@1/pci@0
  Status      : faulted but still in service

  FRU
    Location         : "/SYS/MB"
    Manufacturer     : unknown
    Name             : unknown
    Part_Number      : 7070931
    Revision         : 03
    Serial_Number    : 465769T+13245203VU
    Chassis
       Manufacturer  : Oracle Corporation
       Name          : SPARC T5-8
       Part_Number   : 7068108
       Serial_Number : AK00119551
       Status        : faulty

Description : Too many recovered bus errors have been detected, which indicates
             a problem with the specified bus or with the specified
             transmitting device. This may degrade into an unrecoverable
             fault.


Verify the type and number of PCIe errors that triggered the fault event;

 fmdump -eVu {UUID}

Cause

It has been determined that a certain set of I/O SERD threshold values are set too low which could lead to unnecessary FRU replacements. These values are to be adjusted in future OS releases.

Solution

Purpose

Threshold levels for ereport.io.pciex.dl.btlp and ereport.io.pciex.dl.bdllp correctable errors are too restrictive and a fix has been released for Solaris 11.1 and Solaris 10 which brings thresholds in line with physical link receiver errors (180/1hr). If a fault event occurs first determine if the FMA faults are attributable to these incorrect settings.

Solaris 11.1 : SRU 18.5
Solaris 10 SPARC : PatchID 149279-03
Solaris 10 i386 : PatchID 150913-02

If the customer is unable to schedule downtime to apply the permanent fix they can implement a temporary solution by manually overriding SERD thresholds as outlined below in the "Scope" section.

Current threshold settings for both ereport.io.pciex.dl.btlp and ereport.io.pciex.dl.bdllp events are 18 events per hour, the corrected settings will be 180 events per hour in line with ereport.io.pciex.pl.re events.

To determine if the faults are related to the incorrect settings review the ereports associated with the PCIEX-8000-KP, PCIEX-8000-YJ or PCIEX-8000-J5 faults in the explorer fmadm-faulty.out data.

 egrep "PCIEX-8000-KP|PCIEX-8000-J5|PCIEX-8000-YJ" fmadm-faulty.out


Jan 26 23:29:14 0db03f13-3b8e-6c2c-e4d7-99da84c97f63  PCIEX-8000-J5  Major


             PCIEX-8000-J5 for the latest

 

Apr 11 22:35:28 ad6018ae-a1bd-ea79-95df-a35df783a985  PCIEX-8000-YJ  Major

 

             PCIEX-8000-YJ for the latest


Jan 26 23:56:38 dd72c9ae-890f-e479-8860-c850fffef4e9  PCIEX-8000-KP  Major


              PCIEX-8000-KP for the latest

 

Using the UUID provided by fmadm faulty use the following command to determine the ereport count and check if it exceeds the new threshold
 

fmdump -eVu UUID /var/fm/fmd/fltlog | grep class | sort| uniq -c
 

Example using explorer data:


fmdump -eVu dd72c9ae-890f-e479-8860-c850fffef4e9 fma/var/fm/fmd/fltlog | grep class | sort| uniq -c
 19    class = ereport.io.pciex.dl.btlp

If the count reported (19 in the output above) is less than 181 then the temporary fix listed below should be implemented - if the error rate is =/>181 then please raise a Service Request with Oracle Support for further analysis.

 

 

If the fault was triggered by Replay TImeouts (ereport.io.pciex.dl.rto) please refer to KM 1966605.1 for more information on possible resolution.

 

Note : There could be a corner case where the above syntax displays no output yet the issue has occurred. The fmdump syntax relies on the fltlog and errlog

files. If either a lot of time has lapsed or a flood of ereports has taken place since , it could be possible that the errlog files rotate so many times that there exists

no errlog file that contains the ereports associated with the fault diagnosis in question. In others words even the oldest errlog file doesn't include the date the

fault diagnosis took place. To check for such a condition run the following:

 

Example Diagnosis:

May 10 04:45:43 8b980ed8-f8aa-cb05-c876-f257888eb2ff  PCIEX-8000-KP  Major

#more fmdump-eV.out | grep "May 10 04:45:43"

 

If there is no output you may have run into the corner case described above. Manual examination of data that is most recent may be required to further diagnose the problem.

If you encounter this corner case please contact a product lead.

 

 

Note : Prior to implementing the manual workaround please ensure existing FMA faults are repaired/cleared.

Scope

 

Details

Formal fix with updated I/O SERD threshold values is available in the following Solaris updates;

Solaris 11.1 : SRU 18.5
Solaris 10 SPARC : PatchID 149279-03
Solaris 10 i386 : PatchID 150913-02

- If the customer is unable to schedule downtime to apply the permanent fix they may instead implement a manual workaround whereby I/O SERD threshold  values can be updated by setting the serd_override property in the /usr/lib/fm/fmd/plugins/eft.conf file

1) Add the following line to the bottom of /usr/lib/fm/fmd/plugins/eft.conf

setprop serd_override "serd.io.device.nonfatal_btlp,180,1h serd.io.pciex.corrlink-bus_btlp,180,1h serd.io.device.nonfatal_bdllp,180,1h serd.io.pciex.corrlink-bus_bdllp,180,1h"

2) Restart FMD

# svcadm restart fmd

3)Next, clear the existing serd counters

#fmadm reset eft

fmadm: eft module has been reset

– Manual workaround changes will persist across AC power cycles and Solaris reboots, unless the OS is reinstalled or a patch installed that updates the above configuration file

 

Bug 18145114 : CONSOLIDATE DATA LINK RECEIVER ERRORS TO USE DYNAMIC SERD THRESHOLDS

 

 

 

 


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback