SuperCluster - RDSinfoExaWatcher.sh in Exawatcher does not collect rds-ping information

Asset ID:	1-72-2119546.1
Update Date:	2016-03-24
Keywords:

Solution Type Problem Resolution Sure

Solution 2119546.1 : SuperCluster - RDSinfoExaWatcher.sh in Exawatcher does not collect rds-ping information

Applies to:

Oracle SuperCluster T5-8 Hardware - Version All Versions and later
SPARC SuperCluster T4-4 Full Rack - Version All Versions and later
SPARC SuperCluster T4-4 Half Rack - Version All Versions and later
Solaris SPARC Operating System - Version 10 1/13 U11 to 11.3 [Release 10.0 to 11.0]
Oracle SuperCluster M7 Hardware - Version All Versions and later
Oracle Solaris on SPARC (64-bit)

Symptoms

Exawatcher collections on SuperCluster are missing this key diagnostic data point. Without it , it can become hard to narrow down to the exact time that RDS activity started experiencing issues. This applies to all LDoms ( global zones) and local zones.

Changes

Cause

The script was modified years ago because of rds defects that have long been fixed.

Solution

#cd /opt/oracle.ExaWatcher
#mv RDSinfoExaWatcher.sh

#./StopExaWatcher.sh

#vi RDSinfoExaWatcher.sh

#svcadm enable ExaWatcher

Insert in the following code. Please note that even though the notes show the script named changed make sure the file you created is the old name. This is just to differentiate us from the Exadata version.

#!/bin/ksh
#
# SuperClusterRDSinfoExaWatcher.sh
#
# Copyright (c) 2013, 2016, Oracle and/or its affiliates. All rights reserved.
#
#    NAME
#      SuperClusterRDSinfoExaWatcher.sh
#
#    DESCRIPTION
#      This is a script to collect information related to RDS on SuperCluster
#      DB domains and zones. Useful for diagnosis following RAC node evictions
#
#    NOTES
#      This script is very modified from the original script designed to run
#      on Linux Exadata DB & cells nodes. Because of the highly virtualized
#      nature of SuperCluster, you can't easily determine cluster members from
#      things such as ibhosts command or cellip.ora file. So it uses Solaris 11
#      networking commands plus the output of rds-info to identify the RDS
#      connections established by RAC/grid, and rds-pings those addresses.
#      This script is designed only to run in Solaris 11 (or later) DB domains &
#      zones. It is not meant to run in Solaris 10 application domains. If this
#      script is run in Solaris 11 app or root domains it will do nothing since
#      no RDS connections will have be established by RAC/grid.
#
#    MODIFIED   (MM/DD/YY)
#    jamgates    01/27/16 - re-write for SuperCluster
#    jamgates    01/29/16 - minor fixes
#

umask 0037

echodo() { echo "# $@" ; "$@" ; }

check_os()
{
   # If this is an S10 domain or zone then exit
   if [[ `uname -r` == "5.10" ]]; then
       echo "[ERROR:`date +'%F %T %Z'`] This domain is running Solaris 10. RDS info is not relevant in a domain running anything less than Solaris 11. Exiting ...."
       exit
   fi
}

do_dlstat_on_ib_links()
{
   # This function gets all links on the Exadata (FFFF) IB partition
   # and runs dlstat 4 times with an interval of 1 second. The first
   # row output shows the total numbers since the creation of the link.
   # The subsequent rows show the nomalized (per second) statistics.

   for LINK in `dladm show-part -p -o LINK,PKEY | grep ":FFFF$" | cut -d: -f1`
   do
       echodo dlstat -Z $LINK 1 4
   done
}

get_my_local_exadata_ib_ip_addresses()
{
   # This function grabs local IP addresses on all IPMP groups on the
   # Exadata (FFFF) IB partition. These will have connectivity to all
   # cells and should be specified as the "cluster_interconnects"
   # parameter in all DB init.ora files
   IPMPSTAT=`ipmpstat -o INTERFACE,GROUP,ACTIVE -P -i`

   for LINK in `dladm show-part -p -o LINK,PKEY | grep ":FFFF$" | cut -d: -f1`
   do
       for GROUP in `echo "$IPMPSTAT" | grep "^$LINK:" | grep ":yes$" | cut -d: -f2`
       do
           LOCAL_ADDR="$LOCAL_ADDR "`ipadm show-addr -p -o ADDR $GROUP | cut -d/ -f1`
           # Simultaneous calls to ipmpstat can overload in.mpathd.
           # Sleep between each call to
       done
   done

   if [[ "$LOCAL_ADDR" == "" ]]; then
       echo "[WARNING:`date +'%F %T %Z'`] No local IB IP addresses are configured on the Exadata (FFFF) partition. Either this domain/zone is mis-configured or this isn't a DB domain"
       echo ""
   else
       echo "[INFO:`date +'%F %T %Z'`] My (`hostname`) local IB IP addresses:"
       echo ""
       echo $LOCAL_ADDR
       echo ""
   fi
}

get_remote_exadata_ib_ip_addresses()
{
   # This function uses rds-info to identify remote addresses of all
   # current RDS connections. These will correspond to the other DB
   # nodes & cells in the same RAC clusters as this domain or zone
   # and will have been established by the RAC DBs on this domain or
   # zone

   rds-info -n | while read LOC REM TOS NEXTTX NEXTRX FLGS
   do
       if [[ "$FLGS" == *"C-" ]]; then
           # Flags containing --C- means the remote host is
           # successfuly connected, so add it to the list
           REMOTE_ADDR="$REMOTE_ADDR "$REM
       fi
   done

   if [[ "$REMOTE_ADDR" == "" ]]; then
       echo "[WARNING:`date +'%F %T %Z'`] No established RDS connections (is RAC running?)"
       echo ""
   else
       # Sort the remote address list into unique addresses
       REMOTE_ADDR=`echo $REMOTE_ADDR | tr " " "\n" | sort -u`
   fi
}

split_remote_ib_ip_addresses()
{
   # This function splits the list of remote IP addresses into cell
   # nodes and DB nodes. This can be deduced a number of ways, not all
   # reliable though. Probably the simplest & most reliable is to check
   # for the address in the cellip.ora file. If the address isn't in the
   # file, we assume its a DB node.

   CELLIP=/etc/oracle/cell/network-config/cellip.ora

   if [[ -r $CELLIP && -s $CELLIP ]]; then
       for ADDR in $REMOTE_ADDR
       do
           grep -q "cell=\"$ADDR\"" /etc/oracle/cell/network-config/cellip.ora
           if [[ $? -eq 0 ]]; then
               CELLS="$CELLS "$ADDR
           else
               DBNODES="$DBNODES "$ADDR
           fi
       done
   else
       # cellip.ora is empty or doesn't exist? Plan B is to check the
       # output of 'ibhosts' which identifies storage cells with
       # "hostname C IP address[,IP address...] HCA-#" in the node
       # descriptor field.

       IBHOSTS=`ibhosts`
       for ADDR in $REMOTE_ADDR
       do
           echo "$IBHOSTS" | grep -q " C.*[ ,]$ADDR[ ,].*HCA-"
           if [[ $? -eq 0 ]]; then
               CELLS="$CELLS "$ADDR
           else
               DBNODES="$DBNODES "$ADDR
           fi
       done
   fi

   echo "[INFO:`date +'%F %T %Z'`] Connected remote IB IP addresses:"
   echo ""
   echo "Cells: "$CELLS
   echo "DB Nodes: "$DBNODES
   echo ""
}

do_rds_ping()
{
   # This function gets all local & remote IB IP addresses and rds-pings
   # each remote address from each local address. The ping is performed
        # 4 times with a (default) 1 second timeout. This is so we see a
   # reasonable sample or response times (since the first rds-ping can
   # often take a lot longer than subsequent). Note we don't ping local
   # addresses from local addresses because a) That doesn't really tell
   # us much about the health of the IB transport and b) RAC doesn't
   # establish loopback connections to itself anyway.

   get_my_local_exadata_ib_ip_addresses
   get_remote_exadata_ib_ip_addresses
   split_remote_ib_ip_addresses

   echo ""

   echo "[INFO:`date +'%F %T %Z'`] rds-ping to cells"
   for I_ADDR in $LOCAL_ADDR
   do
       for R_ADDR in $CELLS
       do
           echodo rds-ping -c 4 -I $I_ADDR $R_ADDR
           if [[ $? != 0 ]]; then
               echo "[WARNING:`date +'%F %T %Z'`] rds-ping to $R_ADDR failed"
               echodo ibdiagnet
           fi
       done
   done

   echo "[INFO:`date +'%F %T %Z'`] rds-ping to DB nodes"
   for I_ADDR in $LOCAL_ADDR
   do
       for R_ADDR in $DBNODES
       do
           echodo rds-ping -c 4 -I $I_ADDR $R_ADDR
           if [[ $? != 0 ]]; then
               echo "[WARNING:`date +'%F %T %Z'`] rds-ping to $R_ADDR failed"
               echodo ibdiagnet
           fi
       done
   done
}

########======Main=====#######

CounterLimit=6
ExaWatcherDir="/opt/oracle.ExaWatcher"
RDSinfoCounterFile="$ExaWatcherDir/tmp/RDSinfoCounter"

check_os

DATE=`date "+%F %T %Z"`
echo "     <$DATE>"
echo "     ==========================="
echo "     This is zone - `zonename`"
echo ""

# Check if an rds-info command is already running. This might indicate
# another ExaWatcher is already running and/or wedged. Running multiple
# rds-info commands can burden the system.

pgrep -f "rds-info"
if [[ $? -ne 0 ]]; then
   # rds-info (with no arguments) prints all data, which inlcudes
   # socket & queue information, which can be large. kstat produces
   # a lot of output too. So these commands are only run once every
   # six times.

   if [[ ! -f $RDSinfoCounterFile ]]; then
       RDSinfoCounter=1
   else
       RDSinfoCounter=`cat $RDSinfoCounterFile`
   fi

   if [[ $RDSinfoCounter == 1 ]]; then
       # Full rds-info & kstats
       echo "===/usr/bin/rds-info==="
       echodo rds-info
   else
       # Just rds connections & counters
       echo "===/usr/bin/rds-info -Icn==="
       echodo rds-info -Icn
   fi

   let RDSinfoCounter=$RDSinfoCounter+1
   if [[ $RDSinfoCounter -gt $CounterLimit ]]; then
       RDSinfoCounter=1
   fi
   echo $RDSinfoCounter > $RDSinfoCounterFile

   echo "===/bin/netstat -rpn==="
   echodo netstat -rpn

   echo "===All nodes rds-ping==="
   do_rds_ping

   echo "===dlstat==="
   do_dlstat_on_ib_links
else
   echo "[WARNING:`date +'%F %T %Z'`] ExaWatcher has found another rds-info process running. This turn of collection will be skipped."
fi

exit 0

Save the file

#svcadm enable ExaWatcher

#./ExaWatcher

You may have to hit enter twice to get back to the prompt.

Please note if you do a pkg fix on osc-exawatcher or apply any QFSDP prior to APR 2016 you will have to repeat these steps as the repair / upgrade activity will put the original file back.

This will be corrected permanently in the APR 2016 QFSDP.

Attachments

This solution has no attachment