Information in this document applies to any platform.
Date of Workaround Release: 18-Aug-2015
Date of Resolved Release: 26-Jun-2017
_________________________________________
Description
If there are a large number of established Reliable Datagram Socket (RDS) connections when a SuperCluster Infiniband switch is rebooted, single instance Oracle 12c databases may crash due to 'diskmon' time outs, and Solaris domains or zones running Oracle RAC database instances can reboot or hang. This can affect many domains or zones in a cluster.
Occurrence
This issue can occur in the following releases:
SPARC Platform
- Solaris 11.2.1.5.0 through 11.3.6.5.0
for the following platforms:
- Oracle SuperCluster T4-4, T5-8 and M6-32 and M7
Notes:
1. Since 22nd June 2015, Solaris 11.2.5.5.0 has been installed by default on all new SuperClusters.
2. This issue does not occur on SuperCluster systems running any version of Solaris 11.1 SRU.
To determine the Solaris 11 SRU level, enter the following command:
# pkg info entire | grep Version
Version: 0.5.11 (Oracle Solaris 11.2.5.5)
Symptoms
Immediately following an Infiniband switch reboot, multiple Solaris zones hosting Oracle RAC databases will unexpectedly hang and/or transition to state "shutting_down". This can be observed using the zoneadm(1M) command:
# zoneadm list -civ
ID NAME STATUS PATH BRAND IP
0 global running / solaris shared
5 etc3-exa4dbadm01 shutting_down /zoneHome/etc3-exa4dbadm01 solaris excl
15 etc3-exa5dbadm01 shutting_down /zoneHome/etc3-exa5dbadm01 solaris excl
16 etc3-exa1dbadm01 shutting_down /zoneHome/etc3-exa1dbadm01 solaris excl
17 etc3-exa2dbadm01 running /zoneHome/etc3-exa2dbadm01 solaris excl
18 etc3-exa3dbadm01 running /zoneHome/etc3-exa3dbadm01 solaris excl
19 etc3-exa7dbadm01 running /zoneHome/etc3-exa7dbadm01 solaris excl
To fully recover from a hung zone, the database domain containing the zones must be rebooted.
Domains running Oracle RAC databases in the global zone will experience a node eviction panic and reboot. This is indicated by the following message in the /var/adm/messages file:
Nov 14 14:02:11 sbcdbadm02 unix: [ID 156897 kern.notice] forced crash dump initiated at user request
And, indications similar to the following can be found in the diskmon.trc file after the rebooted switch comes online:
ssnet_connect_to_box: Giving up on box <IPADDR as retry limit (7) reached.
ssnet_connect_to_box: Giving up on box <IPADDR as retry limit (7) reached.
ssnet_connect_to_box: Giving up on box <IPADDR as retry limit (7) reached.
Workaround
To work around this issue for systems running 11.2.1.5.0 or later, see "How-To" <Document:2044825.1> "How-To Update the OSC-Exawatcher Package on All Database Domains."
To mitigate the issue for SuperCluster systems running 11.2.5.5.0:
1. Ensure the correct version of osc-exawatcher is installed - see "How-To" Document:2044825.1 "How-To Update the OSC-Exawatcher Package on All Database Domains."
2. Apply IDR2000.3 or later, which is available from MOS as patch 21683427. See the critical issues <Document:1452277.1> for further details.
Note: SuperCluster M7 running Solaris 11.3 already has the above two workarounds applied. See "How-To" <Document:2044825.1> "How-To Update the OSC-Exawatcher Package on All Database Domains." for details on how to count total connections and determine if your SuperCluster M7 is at risk.
Resolution
This issue is addressed in the following release:
SPARC Platform
- Solaris 11.3.7.6.0 or later
History
18-Aug-2015: Document released, status is Workaround
23-Sep-2015: Add note to Workaround for Cluster 11.2.5.5
26-Jun-2017: Updated for Solaris SRU fix with corrected Bug
This bug was introduced by the putback for BugID: 16024464 - QoS - Segregate RDS traffic based on SL (PSARC/2013/237 - IB QOS for RDSv3)
BugID 21417505 which was originally listed in the Alert for this issue, was closed as a duplicate of 22380320, fixed in Solaris 11.3.7.6.0.
NOTE: If a customer system is running Solaris 11.2 and has already exceeded the limit of 200 RDS connections, there are 2 choices:
1) Avoid Infiniband switch reboots
2) Disable the NRM / QOS feature on all 12c RAC nodes and cells
See also: "How-To" <Document:2044825.1> "How-To Update the OSC-Exawatcher Package on All Database Domains" for more information.
Questions regarding this document should be addressed to
sunalertpublication_us_grp@oracle.com and copy the
submitter/responsible engineer listed below.
Internal Contributor/Submitter: james.gates@oracle.com
Internal Eng Responsible Engineer: sherman.pun@oracle.com
Oracle Knowledge Analyst: david.mariotto@oracle.com
Internal Eng Business Unit Group: Systems RPE
Internal Associated SRs:
Internal Resolution Patches:TBD
References
<BUG:21417505> - SSC : REBOOTING OF IB SWITCH CAUSING EVICTIONS OF DB IN ZONES
<NOTE:2044825.1> - How-To Update the OSC-Exawatcher Package on All Database Domains
Attachments
This solution has no attachment