SuperCluster - Reboot of SuperCluster IO domains can result in PCIE errors on the Infiniband HCA

Asset ID:	1-72-2150184.1
Update Date:	2016-06-15
Keywords:

Solution Type Problem Resolution Sure

Solution 2150184.1 : SuperCluster - Reboot of SuperCluster IO domains can result in PCIE errors on the Infiniband HCA

Applies to:

Solaris SPARC Operating System - Version 11.1 to 11.3 [Release 11.0]
Oracle SuperCluster M7 Hardware - Version All Versions and later
Oracle SuperCluster T5-8 Hardware - Version All Versions and later
Oracle Solaris on SPARC (64-bit)

Symptoms

Reboot of an a SuperCluster IO domain can lead to the following or similar FMA errors

fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Apr 06 14:00:15 fe15e83c-aa78-4c7e-a845-c3528fb5a80d PCIEX-8000-8R Major

Problem Status : isolated
Diag Engine : eft / 1.16
System
Manufacturer : Oracle Corporation
Name : SuperCluster M7
Part_Number : SuperCluster M7
Serial_Number : AK00350094

System Component
Manufacturer : Oracle Corporation
Name : SPARC M7-8
Part_Number : 7309340
Serial_Number : AK00349170
Host_ID : 8647c299

----------------------------------------
Suspect 1 of 1 :
Problem class : fault.io.pciex.device-invreq
Certainty : 100%
Affects : dev:////pci@31a/pci@1/pciex15b3,1004@0,2
Status : faulted and taken out of service

FRU
Status : faulty
Location : "/SYS/CMIOU5/PCIE3"
Manufacturer : unknown
Name : unknown
Part_Number : unknown
Revision : unknown
Serial_Number : unknown
Chassis
Manufacturer : Oracle Corporation
Name : SPARC M7-8
Part_Number : 7309340
Serial_Number : AK00349170

Description : The transmitting device sent an invalid request.

Response : One or more device instances may be disabled

Impact : Loss of services provided by the device instances associated with
this fault

In turn, these errors, can lead to the IO domains or zone contained therein not booting due to iscsi errors such as or similar to

NOTICE: Configuring iSCSI to access the root filesystem...
Hostname: orlm7client0111
May 12 12:20:46 auditd[456]: getaddrinfo(orlm7client0111) failed[temporary name resolution failure].
cannot open 'sc30zadmclient1011': I/O error

SUNW-MSG-ID: ZFS-8000-LR, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Thu May 12 12:21:04 BST 2016
PLATFORM: unknown, CSN: unknown, HOSTNAME: orlm7client01
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: 1c540fd5-92f3-4c79-869c-c2c2fe412f46
DESC: ZFS device 'id1,ssd@n600144f0fb59ad34000057237656000e/a' in pool 'orlm7client01' failed to open.
AUTO-RESPONSE: An attempt will be made to activate a hot spare if available.
IMPACT: Fault tolerance of the pool may be compromised.
REC-ACTION:

Changes

Normal or unexpected reboot of Oracle SuperCluster IO Domain could lead to this.

Cause

Unpublished Bug 22241559

Solution

Immediate workaround is to clear all related fmadm faults in the SP , primary domain and root domain and then it will boot . These could be iscsi faults, zfs faults,and/or PCIE faults depending on what is running in the domain.

Solution is to apply the latest SuperCluster IDR for your QFSDP please refer to < Note 2086278.1> SuperCluster Recommended IDRs and CVEs Addressed for the latest IDR for your QFSDP level. Please note this fix is only available retroactively for JAN 2016 and APR 2016 QFSDP.

References

<NOTE:1424503.2> - Information Center: SuperCluster
<NOTE:2088923.1> - Oracle SuperCluster Application Domain and Zones Best Practices
<NOTE:2004702.1> - Oracle SuperCluster Best Practices
<NOTE:1625975.1> - On-proc TRANSIENT Threads Can Delay Runnable Threads Leading to Cluster Node Evictions
<BUG:17697871> - SUNBT7199390 RUNNABLE THREAD OCCASIONALLY STAYS IN RUN QUEUE FOR TOO LONG

Attachments

This solution has no attachment