Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2150184.1
Update Date:2016-06-15
Keywords:

Solution Type  Problem Resolution Sure

Solution  2150184.1 :   SuperCluster - Reboot of SuperCluster IO domains can result in PCIE errors on the Infiniband HCA  


Related Items
  • Solaris Operating System
  •  
  • Oracle SuperCluster M7 Hardware
  •  
  • Oracle SuperCluster T5-8 Hardware
  •  
Related Categories
  • PLA-Support>Eng Systems>Exadata/ODA/SSC>SPARC SuperCluster>DB: SuperCluster_EST
  •  


This Document describes a SuperCluster critical issue . If classified as a critical issue the item specified as the solution is considered mandatory.

In this Document
Symptoms
Changes
Cause
Solution
References


Applies to:

Solaris SPARC Operating System - Version 11.1 to 11.3 [Release 11.0]
Oracle SuperCluster M7 Hardware - Version All Versions and later
Oracle SuperCluster T5-8 Hardware - Version All Versions and later
Oracle Solaris on SPARC (64-bit)

Symptoms

Reboot of an a SuperCluster IO domain can lead to the following or similar  FMA errors

fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Apr 06 14:00:15 fe15e83c-aa78-4c7e-a845-c3528fb5a80d PCIEX-8000-8R Major

Problem Status : isolated
Diag Engine : eft / 1.16
System
Manufacturer : Oracle Corporation
Name : SuperCluster M7
Part_Number : SuperCluster M7
Serial_Number : AK00350094

 

System Component
Manufacturer : Oracle Corporation
Name : SPARC M7-8
Part_Number : 7309340
Serial_Number : AK00349170
Host_ID : 8647c299

 

----------------------------------------
Suspect 1 of 1 :
Problem class : fault.io.pciex.device-invreq
Certainty : 100%
Affects : dev:////pci@31a/pci@1/pciex15b3,1004@0,2
Status : faulted and taken out of service

 

FRU
Status : faulty
Location : "/SYS/CMIOU5/PCIE3"
Manufacturer : unknown
Name : unknown
Part_Number : unknown
Revision : unknown
Serial_Number : unknown
Chassis
Manufacturer : Oracle Corporation
Name : SPARC M7-8
Part_Number : 7309340
Serial_Number : AK00349170

 

Description : The transmitting device sent an invalid request.

 

Response : One or more device instances may be disabled

 

Impact : Loss of services provided by the device instances associated with
this fault

 


In turn, these errors,  can lead to the IO domains or zone contained therein not booting due to iscsi errors such as or similar to

 

NOTICE: Configuring iSCSI to access the root filesystem...
Hostname: orlm7client0111
May 12 12:20:46 auditd[456]: getaddrinfo(orlm7client0111) failed[temporary name resolution failure].
cannot open 'sc30zadmclient1011': I/O error

SUNW-MSG-ID: ZFS-8000-LR, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Thu May 12 12:21:04 BST 2016
PLATFORM: unknown, CSN: unknown, HOSTNAME: orlm7client01
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: 1c540fd5-92f3-4c79-869c-c2c2fe412f46
DESC: ZFS device 'id1,ssd@n600144f0fb59ad34000057237656000e/a' in pool 'orlm7client01' failed to open.
AUTO-RESPONSE: An attempt will be made to activate a hot spare if available.
IMPACT: Fault tolerance of the pool may be compromised.
REC-ACTION:

 

 

 

Changes

Normal or unexpected reboot of Oracle SuperCluster IO Domain could lead to this.

Cause

 Unpublished Bug 22241559

Solution

Immediate workaround is to clear all related fmadm faults in the SP , primary domain and root domain and then it will boot . These could be iscsi faults, zfs faults,and/or PCIE faults depending on what is running in the domain.

Solution is to apply the latest SuperCluster IDR for your QFSDP please refer to < Note 2086278.1>  SuperCluster Recommended IDRs and CVEs Addressed  for the latest IDR for your QFSDP level. Please note this fix is only available retroactively for JAN 2016 and APR 2016 QFSDP.

 

 

References

<NOTE:1424503.2> - Information Center: SuperCluster
<NOTE:2088923.1> - Oracle SuperCluster Application Domain and Zones Best Practices
<NOTE:2004702.1> - Oracle SuperCluster Best Practices
<NOTE:1625975.1> - On-proc TRANSIENT Threads Can Delay Runnable Threads Leading to Cluster Node Evictions
<BUG:17697871> - SUNBT7199390 RUNNABLE THREAD OCCASIONALLY STAYS IN RUN QUEUE FOR TOO LONG

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback