SuperCluster - Infiniband switch reboot may cause database evictions on if there are a large number of RDS connections

Asset ID:	1-72-2043654.1
Update Date:	2017-02-07
Keywords:

Solution Type Problem Resolution Sure

Solution 2043654.1 : SuperCluster - Infiniband switch reboot may cause database evictions on if there are a large number of RDS connections

Applies to:

SPARC SuperCluster T4-4 Full Rack - Version All Versions to All Versions [Release All Releases]
SPARC SuperCluster T4-4 Half Rack - Version All Versions to All Versions [Release All Releases]
Oracle SuperCluster T5-8 Full Rack - Version All Versions to All Versions [Release All Releases]
Oracle SuperCluster T5-8 Half Rack - Version All Versions to All Versions [Release All Releases]
Oracle SuperCluster M6-32 Hardware - Version All Versions to All Versions [Release All Releases]
Oracle Solaris on SPARC (64-bit)

Symptoms

One of more database nodes may evict due to diskmon split brain for RAC or diskmon may fence off cells for non-RAC.

If this has occurred, indications similar to the following can be found in the diskmon.trc file after the rebooted switch comes online:

ossnet_connect_to_box: Giving up on box <IPADDR> as retry limit (7) reached.
ossnet_connect_to_box: Giving up on box <IPADDR> as retry limit (7) reached.
ossnet_connect_to_box: Giving up on box <IPADDR> as retry limit (7) reached.

If the impacted database is in a zone, the zone may or may not transition to the state "shutting_down" and possibly hang.

This may be observed using the zoneadm command:

# zoneadm list -civ
ID NAME STATUS PATH BRAND IP
0 global running / solaris shared
5 etc3-exa4dbadm01 shutting_down /zoneHome/etc3-exa4dbadm01 solaris excl
15 etc3-exa5dbadm01 shutting_down /zoneHome/etc3-exa5dbadm01 solaris excl
16 etc3-exa1dbadm01 shutting_down /zoneHome/etc3-exa1dbadm01 solaris excl
17 etc3-exa2dbadm01 running /zoneHome/etc3-exa2dbadm01 solaris excl
18 etc3-exa3dbadm01 running /zoneHome/etc3-exa3dbadm01 solaris excl
19 etc3-exa7dbadm01 running /zoneHome/etc3-exa7dbadm01 solaris excl

To fully recover from a hung zone, the domain containing the database zones needs to be rebooted.

Changes

When a SuperCluster Infiniband switch is rebooted, database nodes may evict on domains with a large number of RDS connections. The numbers stated are on a per LDom (global zone) basis not the cumulative across all LDoms(global zones)

There are a number of factors which impact the number of RDS connections, in particular Network Resource Management (a.k.a. Exadata QOS / TOS), which is enabled in Database 12.1.0.2, Solaris 11.2 SRU1, and 12.1.2.x.x Exadata Storage Cells. These, or later, versions are included in the SuperCluster QFSDP from July 2015 onwards. Other factors include the number of Databases, the number of Exadata Storage Cells they use, etc.

This Doc provides information for users to determine their current number of RDS connections and prediction formulae to determine the likely number of RDS connections after applying the July 2015 QFSDP or later.

This July 2015 QFSDP or later will be suitable for most SuperCluster customers to install as is. In particular, enhancements contained in the July 2016 QFSDP and further enhancements contained in Oct 2016 QFSDP will make these QFSDPs suitable for the vast majority of customers.

If the prediction formulae indicate that applying this QFSDP would exceed the current safe threshold of RDS connections, then file an SR and await further advice.

The prediction formulae should also be used before deploying additional Databases on a system.

Cause

The problem is due to a timeout threshold being reached before all the RDS connections can be re-established on SuperCluster domains with a large number of RDS connections.

Solution

Incremental improvements have been made in Solaris 11.3 SRU7 (in July 2016 QFSDP) and SRU11 (in Oct 2016 QFSDP).

The safe RDS connection limit is:

200 for systems below Solaris 11.3 SRU7
700 for systems running Solaris 11.3 SRU7 (July 2016 QFSDP level)
1,800 for systems running Solaris 11.3 SRU11 (Oct 2016 QFSDP level)
3,400 for systems running Solaris 11.3 SRU14 (Jan 2017 QFSDP level)

1. To determine the number of RDS connections currently on a System, run the shell script below (conn-cnt.sh) in the global zone of all database domains. If the reported number of RDS connections is below the safe RDS connection limit in each domain, then your system is not currently at risk. Proceed to step 2.

If the number of RDS connections is above the safe RDS connection limit in any domain, log a service request for further assistance.

$ cat conn-cnt.sh

#! /bin/sh
#
# gather rds connections from global zone
rds-info -n > conn_rds.txt
#
# gather rds connections from all non-global zones
zones=`eval "zoneadm list | grep -v global"`
for i in $zones
do
zlogin $i "rds-info -n" >> conn_rds.txt
done
#
# remove headers
grep -v "RDS Connections" conn_rds.txt > totalcount1
grep -v "LocalAddr" totalcount1 > totalcount
grep -v "127.0.0.1" totalcount > totalcount1
sed '/^$/d' totalcount1 > totalcount
#
# record total count
tc=`eval wc -l totalcount`
rm totalcount1 totalcount conn_rds.txt
# create a log for offline reference
uname -a > conn_rds.txt
date >> conn_rds.txt
rds-info -n >> conn_rds.txt
zones=`eval "zoneadm list | grep -v global"`
for i in $zones
do
echo $i >> conn_rds.txt
zlogin $i "rds-info -n" >> conn_rds.txt
echo " " >> conn_rds.txt
done
#
echo "Total number of rds connections detected : $tc" >> conn_rds.txt
echo "Total number of rds connections detected : $tc"

2) Before applying the July 2015 QFSDP or later, creating more database zones, adding storage expansion or migrating database versions - for example, to 12.1.0.2 - use the following formulae to predict the likely resultant number of RDS connections. If the prediction is above the safe RDS connection limit, log a service request for further assistance.

On a database domain in every RAC cluster:

B is the number of IP addresses specified in the fle /etc/oracle/cell/network-config/cellinit.ora, which can be retrieved as follows:

# grep "^ipaddress" /etc/oracle/cell/network-config/cellinit.ora | wc -l

C is the number of IP addresses specified in file  /etc/oracle/cell/network-config/cellip.ora. Note: There may be more than one ip address per line and each should be counted.

N is the node count in the cluster. Run 'olsnodes' and count the number of nodes: Run as root or grid user:

# $GRID_HOME/bin/olsnodes | wc -l

For 12.1.0.2 databases, the formula is:

  B(B*N*2 + C*7)

For 11g and 12.1.0.1 database versions, the formula is:

  B(B*N*2 + C*2)


Sum the above for all database zones under a single domain. If there are no  zones, it's just the sum of all databases running in the domain itself.

For example, rds_count=Zone_1_12.1.0.2[B(B*N*2 + C*7)] + Zone_2_11.2.0.4[B(B*N*2+*C*2)] + Zone_3_12.1.0.1[B(B*N*2+*C*2)].

This bug introduced by the putback for BugID: 16024464 - QoS - Segregate RDS traffic based on SL (/PSARC/2013/237 - IB QOS for RDSv3)
<Bug 22380320> - Multiple worker threads are required for connection scaling (was <Bug 21417505> - rebooting of ib switch causing local zones going to shutting down state) was addressed in Solaris 11.3 SRU7 (in the July 2016 QFSDP) and the safe RDS connection limit
has been further enhanced in Solaris 11.3 SRU11 with further enhancements due in Solaris 11.3 SRU14.

Safe RDS Connection Limits:

200 for systems below Solaris 11.3 SRU7
700 for systems running Solaris 11.3 SRU7 (July 2016 QFSDP)
1,800 for systems running Solaris 11.3 SRU11 (Oct 2016 QFSDP)
3,400 for systems running Solaris 11.3 SRU14 (Jan 2017 QFSDP)

NOTE: 

If a customer system has already exceeded the limit of safe RDS connection limit, there are 2 choices (in the following order of preference):

1) Avoid Infiniband switch reboots
2) Disable the NRM / QOS feature on all 12c RAC nodes

Option 1 is preferable since it doesn't involve any disruptive changes that have to be enabled again once the bug is fixed (which might be forgotten or overlooked). However, it still leaves the customer vulnerable should an Infiniband switch fail. If the customer has experienced leaf switch failures or node evictions in the recent past, then consider implementing option 2.

Disabling NRM / QOS is the last choice and not generally recommended since it requires setting a hidden parameter. It requires both the compute nodes and the Exadata storage cells to be rebooted at the same time to get rid of existing RDS QOS connections, otherwise they'll persist.  20 minutes must be left between IB switch reboots.


1) Disable NRM / QOS on all 12 RAC nodes as follows:

- As root on one RAC node stop the cluster:

# crsctl stop cluster -all

- As root on all RAC nodes disable restart and stop OHAS:

# crsctl disable crs
# crsctl stop crs [ -f to force if needed ]

* All CRS on all nodes that talk to the cells should be stopped.

- As root on all RAC nodes edit /etc/oracle/*/*/cellinit.ora, adding:

_skgxp_ctx_flags1=8388608
_skgxp_ctx_flags1mask=8388608

- As root on all RAC nodes shutdown Solaris:

# init 0

2) Reboot all cells to purge previously established NRM / QOS connections.

* Take proper precaution to ensure other RACs sharing the cells are
prepared to lose access to all the cells.

- As root on all cells reboot Linux:

# reboot

- As root verify celld service post reboot:

# service celld status

3) Boot RAC nodes and then restart CRS as follows:

- On all RAC nodes boot Solaris:

> boot

* For a zone RAC node use 'zoneadm -z <zonename> boot' from GZ.

- As root on all RAC nodes start the cluster:

# crsctl start crs (will take some time)

- As root on all RAC nodes enable restart:

# crsctl enable crs

- As root verify all resources are up:

# crsctl stat res -t

4) Verify RDS connections:

- As root on all RAC nodes verify only Tos/SL's 0 and 4 are in use:

# rds-info -n

RDS IB Connections:
LocalAddr RemoteAddr Tos SL LocalDev RemoteDev
192.168.10.1 192.168.10.1 0 0 fe80::10:e000:14a:a731 fe80::10:e000:14a:a731
192.168.10.1 192.168.10.1 4 4 fe80::10:e000:14a:a731 fe80::10:e000:14a:a731
192.168.10.1 192.168.10.57 0 0 fe80::10:e000:14a:a731 fe80::10:e000:14a:a631
192.168.10.1 192.168.10.57 4 4 fe80::10:e000:14a:a731 fe80::10:e000:14a:a631
192.168.10.1 192.168.10.58 0 0 fe80::10:e000:14a:a731 fe80::10:e000:14a:a632
192.168.10.1 192.168.10.59 0 0 fe80::10:e000:14a:a731 fe80::10:e000:159:3589
192.168.10.1 192.168.10.59 4 4 fe80::10:e000:14a:a731 fe80::10:e000:159:3589
192.168.10.1 192.168.10.60 0 0 fe80::10:e000:14a:a731 fe80::10:e000:159:358a
192.168.10.1 192.168.10.60 4 4 fe80::10:e000:14a:a731 fe80::10:e000:159:358a
192.168.10.1 192.168.10.61 0 0 fe80::10:e000:14a:a731 fe80::10:e000:159:7909
192.168.10.1 192.168.10.61 4 4 fe80::10:e000:14a:a731 fe80::10:e000:159:7909

References

<NOTE:1452277.1> - SuperCluster Critical Issues

Attachments

This solution has no attachment