Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1362005.1
Update Date:2018-03-01
Keywords:

Solution Type  Troubleshooting Sure

Solution  1362005.1 :   Sun SPARC Enterprise[TM] M3000/M4000/M5000/M8000/M9000 (OPL) Servers: Troubleshooting PCIEX-8000-KP and SUNOS-8000-FU fault codes produced by Solaris FMA  


Related Items
  • Sun SPARC Enterprise M9000-32 Server
  •  
  • Sun SPARC Enterprise M8000 Server
  •  
  • Sun SPARC Enterprise M9000-64 Server
  •  
  • Sun SPARC Enterprise M4000 Server
  •  
  • Sun SPARC Enterprise M3000 Server
  •  
  • Sun SPARC Enterprise M5000 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Mx000
  •  
  • _Old GCS Categories>Sun Microsystems>Servers>OPL Servers
  •  




In this Document
Purpose
Troubleshooting Steps
 Issue Verification
 Resolution
References


Oracle Confidential PARTNER - Available to partners (SUN).
Reason: Troubleshooting steps require actions not allowed for customers to do

Applies to:

Sun SPARC Enterprise M8000 Server - Version Not Applicable to Not Applicable [Release N/A]
Sun SPARC Enterprise M9000-64 Server - Version Not Applicable to Not Applicable [Release N/A]
Sun SPARC Enterprise M4000 Server - Version Not Applicable to Not Applicable [Release N/A]
Sun SPARC Enterprise M3000 Server - Version Not Applicable to Not Applicable [Release N/A]
Sun SPARC Enterprise M5000 Server - Version Not Applicable to Not Applicable [Release N/A]
Oracle Solaris on SPARC (64-bit)
Oracle Solaris on SPARC (32-bit)

Purpose

This document is intended to be a overall guide to troubleshooting OPL systems that Solaris FMA on a System Domain is reporting the SUNOS-8000-FU and/or the PCIEX-8000-KP faults in "fmadm faulty" output. 

Please note that on M3000/M4000/M5000/M8000/M9000 series systems, Solaris FMA is responsible for the majority of PCI fault troubleshooting, the XSCF will only produce errors of the type FMD-8000-11 when these types of issues occur, and are not useful in diagnosing the true problem.

Troubleshooting Steps

Issue Verification

Verify the faults exist by running "fmadm faulty" on the Solaris Domain as the root user (or confirm with the outputs from explorer in the /fma directory).

Faults reported will be similar to this example:

--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Aug 17 08:21:39 8f503397-4226-eba6-9d1f-ebf8f2ac6df2 PCIEX-8000-KP Major

Host : <system host name>
Platform : SUNW,SPARC-Enterprise Chassis_id : <system serial number>

Fault class : fault.io.pciex.device-interr-corr max 25%
fault.io.pciex.bus-linkerr-corr max 13%
Affects : dev:////pci@10,600000/pci@0/pci@9
faulted but still in service
dev:////pci@10,600000/pci@0/pci@9/SUNW,emlxs@0,1
ok and in service
dev:////pci@10,600000/pci@0/pci@9/SUNW,emlxs@0
ok and in service
FRU : "iou#1" (hc:///component=iou#1) 25%
faulty
"iou#1-pci#1" (hc://:product-id=SUNW,SPARC-Enterprise:chassis-id=B
EF08506C0:server-id=cdb1.dc1.prod/chassis=0/ioboard=1/hostbridge=0/pciexrc=0/pci
exbus=2/pciexdev=0/pciexfn=0/pciexbus=3/pciexdev=9/pciexfn=0/pciexbus=119/pciexd
ev=0) max 25%
repair attempted

Description : Too many recovered bus errors have been detected, which indicates
a problem with the specified bus or with the specified
transmitting device. This may degrade into an unrecoverable
fault.
Refer to http://sun.com/msg/PCIEX-8000-KP for more information.

Response : One or more device instances may be disabled

Impact : Loss of services provided by the device instances associated with
this fault

Action : If a plug-in card is involved check for badly-seated cards or
bent pins. Otherwise schedule a repair procedure to replace the
affected device. Use fmadm faulty to identify the device or
contact Sun for support.

--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Aug 17 08:21:25 ac02f2f7-d203-6661-acb9-b3bb9c070d92 SUNOS-8000-FU Major

Host : <system host name>
Platform : SUNW,SPARC-Enterprise Chassis_id : <system serial number>

Fault class : defect.sunos.eft.undiag.fme

Description : The diagnosis engine encountered telemetry for which it was
unable to perform a diagnosis. Refer to
http://sun.com/msg/SUNOS-8000-FU for more information.

Response : Error reports have been logged for examination by Sun.

Impact : Automated diagnosis and response for these events will not occur.

Action : Ensure that the latest Solaris Kernel and Predictive Self-Healing
(PSH) patches are installed.


*Note - it is most often seen that both errors are reported, but there are times when only one, and not the other are reported

Resolution

SUNOS-8000-FU

If only the SUNOS-8000-FU is reported, without the associated PCIEX-8000-KP, then manual inspection of the "fmdump -eV" outputs will be required to determine the device path of the fault.

Example:

Sep 02 2011 03:13:48.955933600 ereport.io.pci.fabric
nvlist version: 0
class = ereport.io.pci.fabric
ena = 0xfc4a6a8a88404801
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /pci@10,600000/pci@0/pci@9/SUNW,emlxs@0
(end detector)

bdf = 0x7700
device_id = 0xfc20
vendor_id = 0x10df
rev_id = 0x2
dev_type = 0x0
pcie_off = 0x44
pcix_off = 0x0
aer_off = 0x100
ecc_ver = 0x0
pci_status = 0x10
pci_command = 0x147
pcie_status = 0x1
pcie_command = 0x203f
pcie_dev_cap = 0x6409a4
pcie_adv_ctl = 0x1f4
pcie_ue_status = 0x0
pcie_ue_mask = 0x0
pcie_ue_sev = 0x62011
pcie_ue_hdr0 = 0x4008001
pcie_ue_hdr1 = 0x1000703
pcie_ue_hdr2 = 0x22020000
pcie_ue_hdr3 = 0x220200
pcie_ce_status = 0x1
pcie_ce_mask = 0x0
remainder = 0x1
severity = 0x3
__ttl = 0x1
__tod = 0x4e60ac5c 0x38fa63a0

Sep 02 2011 03:13:48.955933200 ereport.io.pci.sec-rserr
nvlist version: 0
ena = 0xfc4a6a8a88404801
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /pci@10,600000/pci@0/pci@9
(end detector)

class = ereport.io.pci.sec-rserr
pci-sec-status = 0x4000
pci-bdg-ctrl = 0x3
__ttl = 0x1
__tod = 0x4e60ac5c 0x38fa6210


1. The SUNOS-8000-FU is a Bug in the FMA diagnostic engine being unable to diagnose the "ereport.io.pci.sec-rserr" fault coming from the PCI bus.  This Bug has been resolved in <SunPatch:146855-01>  and is documented in this Sun Alert:

Healthy Solaris 10 SPARC Systems May Incorrectly Report Hardware Errors During PCIE Correctable Events <Document 1369869.1>

2. If this is the only fault being reported, then it is expected that there are very few PCI events found in the "fmdump -eV" output, then this fault can be safely cleared with "fmadm repair <uuid>" and customer given recommendation to apply the patch.

3. If the PCIEX-8000-KP fault is being reported in conjunction with SUNOS-8000-FU, simply verify the "ereport.io.pci.sec-rserr" fault is coming from the same device path as the card blamed in the "fmadm faulty" output for the PCIEX-8000-KP.  If they are coming from the same device path, then it can simply be cleared with the "fmadm repair <uuid>" and the patch recommended to the customer.  

3a. If the error has come from another PCI bus than the PCIEX-8000-KP fault, then the number of errors on that bus should be checked manually from the "fmdump -eV" output, and cleared if found to be a low number of errors (<= 36 per hour), and again, the patch recommended to the customer.



Sun Bug Reference: 6960665 SUNOS-8000-FU reported on ereport.io.pci.sec-rserr on OPL during PCI CE events



PCIEX-8000-KP

For this event, "fmadm faulty" output already gives us the affected device in question taking the correctable errors.

1.
The first step is to check the frequency of the errors being reported; "fmdump -e" output is the simplest way to check for this:

Sep 02 03:13:48.9559 ereport.io.pci.fabric
Sep 02 03:13:48.9559 ereport.io.pci.fabric
Sep 02 03:13:48.9559 ereport.io.pci.fabric
Sep 02 03:13:48.9559 ereport.io.pci.fabric
Sep 02 03:13:48.9559 ereport.io.pci.sec-rserr
Sep 02 03:13:48.9559 ereport.io.pciex.pl.re
Sep 02 03:13:48.9559 ereport.io.pciex.rc.ce-msg
Sep 02 03:17:18.6640 ereport.io.pci.fabric
Sep 02 03:17:18.6640 ereport.io.pci.fabric
Sep 02 03:17:18.6640 ereport.io.pci.fabric
Sep 02 03:17:18.6640 ereport.io.pci.fabric
Sep 02 03:17:18.6640 ereport.io.pci.sec-rserr
Sep 02 03:17:18.6640 ereport.io.pciex.pl.re
Sep 02 03:17:18.6640 ereport.io.pciex.rc.ce-msg


The frequency of the errors is critical to the next step in the resolution path. 

2. If the fault has occurred <= 36 times in an hour, the it should simply be cleared with "fmadm repair <uuid>".  This is a Solaris FMA Bug, the SERD error rates were changed in Solaris 10 Update 6 to produce the PCIEX-8000-KP fault with only 6 errors in a 2 hour time frame.   This Bug has been resolved in <SunPatch:147705-01>  Solaris Patch 147705-01 must be installed on Solaris 10 Update 10 or apply Solaris 10 Update 10 Feature Kernel Update patch <SunPatch:144500-19> and then apply patch 147705-01.  *Only if the card is re-faulted after doing the Solaris upgrade and patch application would we continue to the next step.*

NOTE - the above patch has been obsoleted by <SunPatch:149279-02> which implements Enchancement Request 15755488 to allow FMA to dynamically set the PCI correctable error SERD rates based on the PCI bus generation and bus width.

Please note that this issue regarding the SERD error rates has been documented in this SunAlert, and it includes further commands that can be run to investigate the SERD decision engines (fmstat):

Solaris 10 SPARC Kernel Patch 137137-09 May Cause Erroneous PCIEX-8000-KP Reports During PCIE Correctable Events <Document 1369835.1>

2a. Note: If 147705-01 is installed an adapter may still fault with diagnosis code PCIEX-8000-KP upon ereport frequency of <=36 per hour.  Not all correctable events were raised to the 36 per hour threshold.

SunBug 7051331 (fixed in Solaris Patch 147705-01) made these changes to the SERD engine thresholds for these PCI fault types -

   > 36 / hour : for Receiver Error (ereport.io.pciex.pl.re)
   > 18 / hour : for Bad DLLP (ereport.io.pciex.dl.bdllp)
   > 18 / hour : for Bad TLP (ereport.io.pciex.dl.btlp)

The specfic ereport which triggered the fault should be checked for its frequency.  If it is Bad DLLP or Bad TLP types, Solaris FMA will use a requirement of only >18 errors per hour of those ereports to mark the adapter faulted.   If this is the case for the system you are using this document to troubleshoot, continue on with this document using the >18 per hour SERD threshold.


Sun Bug Reference: 7051331 SPARC Solaris IO FMA s10u6 and later causing false IO hardware faults


Note: if the device that reports the PCI correctable errors is an Aura F20 card (PCI Express Flash Accelerator F20 SAS HBA), please check the following bug before proceeding with troubleshooting steps below:

Sun Bug Reference  6997490: PIC fabric errors seen during OPL M4/5000 production testing of Aura F20

The instructions to fix this issue on OPL Systems for this scenario are available here.


3. If the fault has occurred >36 times in an hour, then the card is now suspect to have a seating issue or something wrong with its PCI Pins. The PCI card should be removed from the system and from its cassette, carefully inspected for any damage or contaminants on its PCI pins on both sides.   If no damage other than the normal scuffing seen parallel to the pins from insertion into the PCI slot then the pins should be carefully wiped with an isopropyl alcohol wipe, being careful not to touch the pins after this is done.   It should be then carefully installed in its cassette and verified it is properly mounted and engages fully and evenly when the lever on the cassette is in the locked position.   The card can now be inserted back into the system in the same slot, and brought back online.  Solaris FMA should then be checked again to see if the faults are still occurring (simplest to check "fmdump -e" every couple of minutes for 15 minutes with the card in full operation).

3a. If the fault returns after PCI card re-seat and pin cleaning, then the error rates must be checked again.  If the error rate is <= 36 per hour, then the resolution is in Step 2.
3b. If the fault returns after PCI card re-seat and pin cleaning, and the error rates are >36 per hour, only then would you move to Step 4.


Sun Bug Reference: 6907573 Repeated PCIEX-8000-KP errors on iou#0-pci#2 even after replacing the card.


4.  If you have reached this step, then the next action is to replace the PCI card with a new FRU stock unit.

4a. If the fault returns after PCI card replacement, then the error rates must be checked again. If the error rate is <= 36 per hour, then the resolution is in Step 2.
4b. If the fault returns after PCI card replacement, and the error rates are  >36 per hour, only then would you move to Step 5.

5. If you have reached this step, then the next action is to replace the IOU that the PCI card is installed in with a new FRU stock unit.

5a. If the fault returns after IOU replacement, then the error rates must be checked again. If the error rate is <= 36 per hour, then the resolution is in Step 2.
5b. If the fault returns after IOU replacement, and the error rates are >36 per hour, only then would you move to Step 6.

6. If you have reached this step, then the card model must be checked.  If the card being faulted is a 8-Port 3Gbps SAS/SATA HBA PN:375-3487, then there is a known rare bug with this card (known internally as Pandora).  If this card model is the one continually being faulted, then it must be replaced with another model card, specifically the PCI Express 8-Port 6Gbps SAS HBA PN: 375-3641 (known internally as Erie).  As this is a rare fault, an FCO has not been deemed justified, and Field Services should treat this as a CIC of the Pandora card(s) to get the Erie replacement(s) for the customer.


Sun Bug Reference: 7002517 Repeated PCIEX-8000-KP errors reopen of 6907573


7. If you have reached this step, engagement of a Senior Level Domain Engineer within the TSC SPARC OPL team is required.


Product
M3000 M4000 M5000 M8000 M9000



Addendum

In order to determine the correctable error rate for PCIEX-8000-KP events on a particular system, you may use one of the following methods:

1. Use the findfma tool available on cores2 (/cores_data/local/bin/findfma); this needs to be run against the affected system's "fmdump -eV" output (note that fmdump-eV.out file is included in the explorer output into fma directory).

2. Use a single line counter on the errlog collected in explorer (fma/var/fm/fmd), for example:

% fmdump -e -c 'ereport.io.pciex.rc.ce-msg' -n "detector.device-path=/pci@2,600000*" -t 01/01/13 errlog  |  cut -b1-9 | uniq -c | awk '{print $2,$3,"2013",$4":00,"$1}' | egrep -v TIME | sort -t, -rn +1 | head

Example output (date/time hour, CE rate):
Nov 15 2013 23:00,537
Nov 01 2013 17:00,322
Oct 28 2013 16:00,235

NOTE: substitute the device path for whichever adapter is listed faulty by FMA.
This will sort the CE rate and show you the highest correctable error rate per hour. If it is higher than 36, then the patch won't help.
You can also see how the CE rate varies through time by dropping the sorting. This may be useful to see if the rate was influenced by a recent insertion or other cassette manipulation:

% fmdump -e -c 'ereport.io.pciex.rc.ce-msg' -n "detector.device-path=/pci@2,600000*" -t 01/01/13 errlog  |  cut -b1-9 | uniq -c | awk '{print $2,$3,"2013",$4":00,"$1}' | egrep -v TIME

Be aware that neither of the above methods check for the changes done in SunBug 7051331 (fixed in Solaris Patch 147705-01) -

SunBug 7051331 (fixed in Solaris Patch 147705-01) made these changes to the SERD engine for these PCI fault types -

   > 36 / hour : for Receiver Error (ereport.io.pciex.pl.re)
   > 18 / hour : for Bad DLLP (ereport.io.pciex.dl.bdllp)
   > 18 / hour : for Bad TLP (ereport.io.pciex.dl.btlp)

These numbers were further adjusted higher to 180 events per hour in:

Solaris 11.1 : SRU 18.5
Solaris 10 SPARC : PatchID 149279-03
Solaris 10 i386 : PatchID 150913-02

See this document for details:

I/O SERD threshold values are set too low and may result in PCIEX-8000-J5, PCIEX-8000-YJ and PCIEX-8000-KP faults. (Doc ID 1617956.1)

 

Therefore the fault type itself should be checked for its frequency.  The ereport 'ereport.io.pciex.rc.ce-msg' is a generic message saying a correctable fault has occured, it does not provide fault details.
If the actual fault is the less common Bad DLLP or Bad TLP types, Solaris FMA will use a requirement of only >18 errors per hour of those fault types to mark the adapter faulted.  If your fault types are of the Bad DLLP (ereport.io.pciex.dl.bdllp)
or Bad TLP (ereport.io.pciex.dl.btlp) you can subsitute the exact ereport for 'ereport.io.pciex.rc.ce-msg' in the above example fmdump command and get a count per hour of the specific fault.  

References

<NOTE:1369869.1> - Healthy Solaris 10 SPARC Systems May Incorrectly Report Hardware Errors (SUNOS-8000-FU) During PCIE Correctable Events
<NOTE:1369835.1> - Solaris 10 SPARC Kernel Patch 137137-09 May Cause Erroneous PCIEX-8000-KP/-J5 Reports During PCIE Correctable Events

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback