![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||
Solution Type Troubleshooting Sure Solution 1362005.1 : Sun SPARC Enterprise[TM] M3000/M4000/M5000/M8000/M9000 (OPL) Servers: Troubleshooting PCIEX-8000-KP and SUNOS-8000-FU fault codes produced by Solaris FMA
In this Document
Oracle Confidential PARTNER - Available to partners (SUN). Applies to:Sun SPARC Enterprise M8000 Server - Version Not Applicable to Not Applicable [Release N/A]Sun SPARC Enterprise M9000-64 Server - Version Not Applicable to Not Applicable [Release N/A] Sun SPARC Enterprise M4000 Server - Version Not Applicable to Not Applicable [Release N/A] Sun SPARC Enterprise M3000 Server - Version Not Applicable to Not Applicable [Release N/A] Sun SPARC Enterprise M5000 Server - Version Not Applicable to Not Applicable [Release N/A] Oracle Solaris on SPARC (64-bit) Oracle Solaris on SPARC (32-bit) PurposeThis document is intended to be a overall guide to troubleshooting OPL systems that Solaris FMA on a System Domain is reporting the SUNOS-8000-FU and/or the PCIEX-8000-KP faults in "fmadm faulty" output. Troubleshooting StepsIssue VerificationVerify the faults exist by running "fmadm faulty" on the Solaris Domain as the root user (or confirm with the outputs from explorer in the /fma directory). --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Aug 17 08:21:39 8f503397-4226-eba6-9d1f-ebf8f2ac6df2 PCIEX-8000-KP Major Host : <system host name> Platform : SUNW,SPARC-Enterprise Chassis_id : <system serial number> Fault class : fault.io.pciex.device-interr-corr max 25% fault.io.pciex.bus-linkerr-corr max 13% Affects : dev:////pci@10,600000/pci@0/pci@9 faulted but still in service dev:////pci@10,600000/pci@0/pci@9/SUNW,emlxs@0,1 ok and in service dev:////pci@10,600000/pci@0/pci@9/SUNW,emlxs@0 ok and in service FRU : "iou#1" (hc:///component=iou#1) 25% faulty "iou#1-pci#1" (hc://:product-id=SUNW,SPARC-Enterprise:chassis-id=B EF08506C0:server-id=cdb1.dc1.prod/chassis=0/ioboard=1/hostbridge=0/pciexrc=0/pci exbus=2/pciexdev=0/pciexfn=0/pciexbus=3/pciexdev=9/pciexfn=0/pciexbus=119/pciexd ev=0) max 25% repair attempted Description : Too many recovered bus errors have been detected, which indicates a problem with the specified bus or with the specified transmitting device. This may degrade into an unrecoverable fault. Refer to http://sun.com/msg/PCIEX-8000-KP for more information. Response : One or more device instances may be disabled Impact : Loss of services provided by the device instances associated with this fault Action : If a plug-in card is involved check for badly-seated cards or bent pins. Otherwise schedule a repair procedure to replace the affected device. Use fmadm faulty to identify the device or contact Sun for support. --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Aug 17 08:21:25 ac02f2f7-d203-6661-acb9-b3bb9c070d92 SUNOS-8000-FU Major Host : <system host name> Platform : SUNW,SPARC-Enterprise Chassis_id : <system serial number> Fault class : defect.sunos.eft.undiag.fme Description : The diagnosis engine encountered telemetry for which it was unable to perform a diagnosis. Refer to http://sun.com/msg/SUNOS-8000-FU for more information. Response : Error reports have been logged for examination by Sun. Impact : Automated diagnosis and response for these events will not occur. Action : Ensure that the latest Solaris Kernel and Predictive Self-Healing (PSH) patches are installed.
ResolutionSUNOS-8000-FU Sep 02 2011 03:13:48.955933600 ereport.io.pci.fabric nvlist version: 0 class = ereport.io.pci.fabric ena = 0xfc4a6a8a88404801 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci@10,600000/pci@0/pci@9/SUNW,emlxs@0 (end detector) bdf = 0x7700 device_id = 0xfc20 vendor_id = 0x10df rev_id = 0x2 dev_type = 0x0 pcie_off = 0x44 pcix_off = 0x0 aer_off = 0x100 ecc_ver = 0x0 pci_status = 0x10 pci_command = 0x147 pcie_status = 0x1 pcie_command = 0x203f pcie_dev_cap = 0x6409a4 pcie_adv_ctl = 0x1f4 pcie_ue_status = 0x0 pcie_ue_mask = 0x0 pcie_ue_sev = 0x62011 pcie_ue_hdr0 = 0x4008001 pcie_ue_hdr1 = 0x1000703 pcie_ue_hdr2 = 0x22020000 pcie_ue_hdr3 = 0x220200 pcie_ce_status = 0x1 pcie_ce_mask = 0x0 remainder = 0x1 severity = 0x3 __ttl = 0x1 __tod = 0x4e60ac5c 0x38fa63a0 Sep 02 2011 03:13:48.955933200 ereport.io.pci.sec-rserr nvlist version: 0 ena = 0xfc4a6a8a88404801 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci@10,600000/pci@0/pci@9 (end detector) class = ereport.io.pci.sec-rserr pci-sec-status = 0x4000 pci-bdg-ctrl = 0x3 __ttl = 0x1 __tod = 0x4e60ac5c 0x38fa6210
3a. If the error has come from another PCI bus than the PCIEX-8000-KP fault, then the number of errors on that bus should be checked manually from the "fmdump -eV" output, and cleared if found to be a low number of errors (<= 36 per hour), and again, the patch recommended to the customer.
Sep 02 03:13:48.9559 ereport.io.pci.fabric Sep 02 03:13:48.9559 ereport.io.pci.fabric Sep 02 03:13:48.9559 ereport.io.pci.fabric Sep 02 03:13:48.9559 ereport.io.pci.fabric Sep 02 03:13:48.9559 ereport.io.pci.sec-rserr Sep 02 03:13:48.9559 ereport.io.pciex.pl.re Sep 02 03:13:48.9559 ereport.io.pciex.rc.ce-msg Sep 02 03:17:18.6640 ereport.io.pci.fabric Sep 02 03:17:18.6640 ereport.io.pci.fabric Sep 02 03:17:18.6640 ereport.io.pci.fabric Sep 02 03:17:18.6640 ereport.io.pci.fabric Sep 02 03:17:18.6640 ereport.io.pci.sec-rserr Sep 02 03:17:18.6640 ereport.io.pciex.pl.re Sep 02 03:17:18.6640 ereport.io.pciex.rc.ce-msg
NOTE - the above patch has been obsoleted by <SunPatch:149279-02> which implements Enchancement Request 15755488 to allow FMA to dynamically set the PCI correctable error SERD rates based on the PCI bus generation and bus width.
Please note that this issue regarding the SERD error rates has been documented in this SunAlert, and it includes further commands that can be run to investigate the SERD decision engines (fmstat): Solaris 10 SPARC Kernel Patch 137137-09 May Cause Erroneous PCIEX-8000-KP Reports During PCIE Correctable Events <Document 1369835.1> 2a. Note: If 147705-01 is installed an adapter may still fault with diagnosis code PCIEX-8000-KP upon ereport frequency of <=36 per hour. Not all correctable events were raised to the 36 per hour threshold. SunBug 7051331 (fixed in Solaris Patch 147705-01) made these changes to the SERD engine thresholds for these PCI fault types - > 36 / hour : for Receiver Error (ereport.io.pciex.pl.re) The specfic ereport which triggered the fault should be checked for its frequency. If it is Bad DLLP or Bad TLP types, Solaris FMA will use a requirement of only >18 errors per hour of those ereports to mark the adapter faulted. If this is the case for the system you are using this document to troubleshoot, continue on with this document using the >18 per hour SERD threshold.
3a. If the fault returns after PCI card re-seat and pin cleaning, then the error rates must be checked again. If the error rate is <= 36 per hour, then the resolution is in Step 2.
4a. If the fault returns after PCI card replacement, then the error rates must be checked again. If the error rate is <= 36 per hour, then the resolution is in Step 2. 5. If you have reached this step, then the next action is to replace the IOU that the PCI card is installed in with a new FRU stock unit. 5a. If the fault returns after IOU replacement, then the error rates must be checked again. If the error rate is <= 36 per hour, then the resolution is in Step 2. 6. If you have reached this step, then the card model must be checked. If the card being faulted is a 8-Port 3Gbps SAS/SATA HBA PN:375-3487, then there is a known rare bug with this card (known internally as Pandora). If this card model is the one continually being faulted, then it must be replaced with another model card, specifically the PCI Express 8-Port 6Gbps SAS HBA PN: 375-3641 (known internally as Erie). As this is a rare fault, an FCO has not been deemed justified, and Field Services should treat this as a CIC of the Pandora card(s) to get the Erie replacement(s) for the customer. % fmdump -e -c 'ereport.io.pciex.rc.ce-msg' -n "detector.device-path=/pci@2,600000*" -t 01/01/13 errlog | cut -b1-9 | uniq -c | awk '{print $2,$3,"2013",$4":00,"$1}' | egrep -v TIME | sort -t, -rn +1 | head Example output (date/time hour, CE rate): % fmdump -e -c 'ereport.io.pciex.rc.ce-msg' -n "detector.device-path=/pci@2,600000*" -t 01/01/13 errlog | cut -b1-9 | uniq -c | awk '{print $2,$3,"2013",$4":00,"$1}' | egrep -v TIME Be aware that neither of the above methods check for the changes done in SunBug 7051331 (fixed in Solaris Patch 147705-01) - SunBug 7051331 (fixed in Solaris Patch 147705-01) made these changes to the SERD engine for these PCI fault types - > 36 / hour : for Receiver Error (ereport.io.pciex.pl.re) These numbers were further adjusted higher to 180 events per hour in: Solaris 11.1 : SRU 18.5 See this document for details: I/O SERD threshold values are set too low and may result in PCIEX-8000-J5, PCIEX-8000-YJ and PCIEX-8000-KP faults. (Doc ID 1617956.1)
Therefore the fault type itself should be checked for its frequency. The ereport 'ereport.io.pciex.rc.ce-msg' is a generic message saying a correctable fault has occured, it does not provide fault details. If the actual fault is the less common Bad DLLP or Bad TLP types, Solaris FMA will use a requirement of only >18 errors per hour of those fault types to mark the adapter faulted. If your fault types are of the Bad DLLP (ereport.io.pciex.dl.bdllp) or Bad TLP (ereport.io.pciex.dl.btlp) you can subsitute the exact ereport for 'ereport.io.pciex.rc.ce-msg' in the above example fmdump command and get a count per hour of the specific fault. References<NOTE:1369869.1> - Healthy Solaris 10 SPARC Systems May Incorrectly Report Hardware Errors (SUNOS-8000-FU) During PCIE Correctable Events<NOTE:1369835.1> - Solaris 10 SPARC Kernel Patch 137137-09 May Cause Erroneous PCIEX-8000-KP/-J5 Reports During PCIE Correctable Events Attachments This solution has no attachment |
||||||||||||||||||||
|