Infiniband Switch Reboot May Cause Database Crash or Node Eviction on SuperCluster

Asset ID:	1-77-2044894.1
Update Date:	2017-06-26
Keywords:

Solution Type Sun Alert Sure

Solution 2044894.1 : Infiniband Switch Reboot May Cause Database Crash or Node Eviction on SuperCluster

Applies to:

Oracle SuperCluster T5-8 Full Rack
SPARC M6-32
Sun Infiniband HCA
Oracle SuperCluster Specific Software
SPARC T4-4
Information in this document applies to any platform.
SPARC
_________________________________________

Date of Workaround Release: 18-Aug-2015

Date of Resolved Release: 26-Jun-2017
_________________________________________

Description

If there are a large number of established Reliable Datagram Socket (RDS) connections when a SuperCluster Infiniband switch is rebooted, single instance Oracle 12c databases may crash due to 'diskmon' time outs, and Solaris domains or zones running Oracle RAC database instances can reboot or hang. This can affect many domains or zones in a cluster.

Occurrence

This issue can occur in the following releases:

SPARC Platform

Solaris 11.2.1.5.0 through 11.3.6.5.0

for the following platforms:

Oracle SuperCluster T4-4, T5-8 and M6-32 and M7

Notes:

    1. Since 22nd June 2015, Solaris 11.2.5.5.0 has been installed by default on all new SuperClusters.

    2. This issue does not occur on SuperCluster systems running any version of Solaris 11.1 SRU.

To determine the Solaris 11 SRU level, enter the following command:

      # pkg info entire | grep Version
      Version: 0.5.11 (Oracle Solaris 11.2.5.5)

Symptoms

Immediately following an Infiniband switch reboot, multiple Solaris zones hosting Oracle RAC databases will unexpectedly hang and/or transition to state "shutting_down". This can be observed using the zoneadm(1M) command:

      # zoneadm list -civ
        ID NAME             STATUS      PATH                         BRAND      IP
         0 global           running     /                            solaris      shared
         5 etc3-exa4dbadm01 shutting_down /zoneHome/etc3-exa4dbadm01   solaris       excl
        15 etc3-exa5dbadm01 shutting_down /zoneHome/etc3-exa5dbadm01   solaris      excl
        16 etc3-exa1dbadm01 shutting_down /zoneHome/etc3-exa1dbadm01   solaris      excl
        17 etc3-exa2dbadm01 running     /zoneHome/etc3-exa2dbadm01   solaris      excl
        18 etc3-exa3dbadm01 running     /zoneHome/etc3-exa3dbadm01   solaris      excl
        19 etc3-exa7dbadm01 running     /zoneHome/etc3-exa7dbadm01   solaris      excl

To fully recover from a hung zone, the database domain containing the zones must be rebooted.

Domains running Oracle RAC databases in the global zone will experience a node eviction panic and reboot. This is indicated by the following message in the /var/adm/messages file:

      Nov 14 14:02:11 sbcdbadm02 unix: [ID 156897 kern.notice] forced crash dump initiated at user request

And, indications similar to the following can be found in the diskmon.trc file after the rebooted switch comes online:

      ssnet_connect_to_box: Giving up on box <IPADDR as retry limit (7) reached.
      ssnet_connect_to_box: Giving up on box <IPADDR as retry limit (7) reached.
      ssnet_connect_to_box: Giving up on box <IPADDR as retry limit (7) reached.

Workaround

To work around this issue for systems running 11.2.1.5.0 or later, see "How-To" <Document:2044825.1> "How-To Update the OSC-Exawatcher Package on All Database Domains."

To mitigate the issue for SuperCluster systems running 11.2.5.5.0:

1. Ensure the correct version of osc-exawatcher is installed - see "How-To" Document:2044825.1 "How-To Update the OSC-Exawatcher Package on All Database Domains."

2. Apply IDR2000.3 or later, which is available from MOS as patch 21683427. See the critical issues <Document:1452277.1> for further details.

Note: SuperCluster M7 running Solaris 11.3 already has the above two workarounds applied. See "How-To" <Document:2044825.1> "How-To Update the OSC-Exawatcher Package on All Database Domains." for details on how to count total connections and determine if your SuperCluster M7 is at risk.

Resolution

This issue is addressed in the following release:

SPARC Platform

Solaris 11.3.7.6.0 or later

History

18-Aug-2015: Document released, status is Workaround
23-Sep-2015: Add note to Workaround for Cluster 11.2.5.5
26-Jun-2017: Updated for Solaris SRU fix with corrected Bug

This bug was introduced by the putback for BugID: 16024464 - QoS - Segregate RDS traffic based on SL (PSARC/2013/237 - IB QOS for RDSv3)

BugID 21417505 which was originally listed in the Alert for this issue, was closed as a duplicate of 22380320, fixed in Solaris 11.3.7.6.0.

NOTE: If a customer system is running Solaris 11.2 and has already exceeded the limit of 200 RDS connections, there are 2 choices:

1) Avoid Infiniband switch reboots
2) Disable the NRM / QOS feature on all 12c RAC nodes and cells

See also: "How-To" <Document:2044825.1> "How-To Update the OSC-Exawatcher Package on All Database Domains" for more information.

Questions regarding this document should be addressed to
sunalertpublication_us_grp@oracle.com and copy the
submitter/responsible engineer listed below.

Internal Contributor/Submitter: james.gates@oracle.com
Internal Eng Responsible Engineer: sherman.pun@oracle.com
Oracle Knowledge Analyst: david.mariotto@oracle.com
Internal Eng Business Unit Group: Systems RPE
Internal Associated SRs:
Internal Resolution Patches:TBD

References

<BUG:21417505> - SSC : REBOOTING OF IB SWITCH CAUSING EVICTIONS OF DB IN ZONES
<NOTE:2044825.1> - How-To Update the OSC-Exawatcher Package on All Database Domains

Attachments

This solution has no attachment