![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||||||
Solution Type Problem Resolution Sure Solution 1617956.1 : I/O SERD threshold values are set too low and may result in PCIEX-8000-J5, PCIEX-8000-YJ and PCIEX-8000-KP faults.
It has been determined that a certain set of I/O SERD threshold values are set too low, which could lead to false positive FMA Faults and possibly unnecessary FRU replacements In this Document
Applies to:Oracle SuperCluster T5-8 Full Rack - Version All Versions and laterSPARC M6-32 - Version All Versions and later Oracle SuperCluster M6-32 Hardware - Version All Versions and later Oracle SuperCluster T5-8 Half Rack - Version All Versions and later SPARC T5-4 - Version All Versions and later Information in this document applies to any platform. SymptomsPCIEX-8000-J5 and/or PCIEX-8000-KP FMA faults similar to those below will be reported. Systems utilizing InfiniBand fabric, i.e., T5-8 SSC are seen to be more susceptible to these faults. The fmdump -e output will contain ereport.io.pciex.dl.btlp and/or ereport.io.pciex.dl.bdllp events. Examples are below the FMA examples. -------------- ------------------------------------ -------------- ---------
fmdump -eVu {UUID} CauseIt has been determined that a certain set of I/O SERD threshold values are set too low which could lead to unnecessary FRU replacements. These values are to be adjusted in future OS releases. SolutionPurposeThreshold levels for ereport.io.pciex.dl.btlp and ereport.io.pciex.dl.bdllp correctable errors are too restrictive and a fix has been released for Solaris 11.1 and Solaris 10 which brings thresholds in line with physical link receiver errors (180/1hr). If a fault event occurs first determine if the FMA faults are attributable to these incorrect settings. Solaris 11.1 : SRU 18.5 If the customer is unable to schedule downtime to apply the permanent fix they can implement a temporary solution by manually overriding SERD thresholds as outlined below in the "Scope" section. Current threshold settings for both ereport.io.pciex.dl.btlp and ereport.io.pciex.dl.bdllp events are 18 events per hour, the corrected settings will be 180 events per hour in line with ereport.io.pciex.pl.re events. To determine if the faults are related to the incorrect settings review the ereports associated with the PCIEX-8000-KP, PCIEX-8000-YJ or PCIEX-8000-J5 faults in the explorer fmadm-faulty.out data. egrep "PCIEX-8000-KP|PCIEX-8000-J5|PCIEX-8000-YJ" fmadm-faulty.out
Apr 11 22:35:28 ad6018ae-a1bd-ea79-95df-a35df783a985 PCIEX-8000-YJ Major
PCIEX-8000-YJ for the latest
Using the UUID provided by fmadm faulty use the following command to determine the ereport count and check if it exceeds the new threshold fmdump -eVu UUID /var/fm/fmd/fltlog | grep class | sort| uniq -c Example using explorer data:
If the count reported (19 in the output above) is less than 181 then the temporary fix listed below should be implemented - if the error rate is =/>181 then please raise a Service Request with Oracle Support for further analysis.
If the fault was triggered by Replay TImeouts (ereport.io.pciex.dl.rto) please refer to KM 1966605.1 for more information on possible resolution.
Note : There could be a corner case where the above syntax displays no output yet the issue has occurred. The fmdump syntax relies on the fltlog and errlog files. If either a lot of time has lapsed or a flood of ereports has taken place since , it could be possible that the errlog files rotate so many times that there exists no errlog file that contains the ereports associated with the fault diagnosis in question. In others words even the oldest errlog file doesn't include the date the fault diagnosis took place. To check for such a condition run the following:
Example Diagnosis: May 10 04:45:43 8b980ed8-f8aa-cb05-c876-f257888eb2ff PCIEX-8000-KP Major #more fmdump-eV.out | grep "May 10 04:45:43"
If there is no output you may have run into the corner case described above. Manual examination of data that is most recent may be required to further diagnose the problem. If you encounter this corner case please contact a product lead.
Note : Prior to implementing the manual workaround please ensure existing FMA faults are repaired/cleared. Scope
DetailsFormal fix with updated I/O SERD threshold values is available in the following Solaris updates; Solaris 11.1 : SRU 18.5 - If the customer is unable to schedule downtime to apply the permanent fix they may instead implement a manual workaround whereby I/O SERD threshold values can be updated by setting the serd_override property in the /usr/lib/fm/fmd/plugins/eft.conf file 1) Add the following line to the bottom of /usr/lib/fm/fmd/plugins/eft.conf setprop serd_override "serd.io.device.nonfatal_btlp,180,1h serd.io.pciex.corrlink-bus_btlp,180,1h serd.io.device.nonfatal_bdllp,180,1h serd.io.pciex.corrlink-bus_bdllp,180,1h" 2) Restart FMD # svcadm restart fmd 3)Next, clear the existing serd counters #fmadm reset eft fmadm: eft module has been reset – Manual workaround changes will persist across AC power cycles and Solaris reboots, unless the OS is reinstalled or a patch installed that updates the above configuration file
Bug 18145114 : CONSOLIDATE DATA LINK RECEIVER ERRORS TO USE DYNAMIC SERD THRESHOLDS
Attachments This solution has no attachment |
||||||||||||||||||||||||
|