SuperCluster - GI and DB homes linked with UDP instead of RDS lead to CSSD reporting "has a disk HB, but no network HB" and "CSSD aborting from thread GMClientListener"

Asset ID:	1-72-1916992.1
Update Date:	2016-07-06
Keywords:

Solution Type Problem Resolution Sure

Solution 1916992.1 : SuperCluster - GI and DB homes linked with UDP instead of RDS lead to CSSD reporting "has a disk HB, but no network HB" and "CSSD aborting from thread GMClientListener"

Applies to:

Oracle Database - Enterprise Edition - Version 11.2.0.3 and later
Solaris SPARC Operating System - Version 11.1 to 11.2 [Release 11.0]
Oracle SuperCluster M6-32 Hardware - Version All Versions and later
Oracle SuperCluster T5-8 Full Rack - Version All Versions and later
SPARC SuperCluster T4-4 - Version All Versions and later
Oracle Solaris on SPARC (64-bit)
Oracle SuperCluster and version. Grid Infrastructure and/or Database Homes installed without using Java Once Command (JOC)

Symptoms

RAC CRS services on one or many nodes shutting down intermittently and not able to restart.

OCSSD Log

[ CSSD][28]clssnmvDHBValidateNcopy: node 1, rmb-zpr-db-fin1, has a disk HB, but no network HB, DHB has rcfg 300902083, wrtcnt, 4912592, LATS 1807588635, lastSeqNo 4912589, uniqueness 1405754628, timestamp 1405999042/1899489711
[ CSSD][37]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
[ CSSD][28]clssnmvDHBValidateNcopy: node 1, rmb-zpr-db-fin1, has a disk HB, but no network HB, DHB has rcfg 300902083, wrtcnt, 4912595, LATS 1807589637, lastSeqNo 4912592, uniqueness 1405754628, timestamp 1405999043/1899490711
[ CSSD][37]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
[ CSSD][28]clssnmvDHBValidateNcopy: node 1, rmb-zpr-db-fin1, has a disk HB, but no network HB, DHB has rcfg 300902083, wrtcnt, 4912598, LATS 1807590638, lastSeqNo 4912595, uniqueness 1405754628, timestamp 1405999044/1899491712
[ CSSD][5]clssgmExecuteClientRequest: MAINT recvd from proc 2 (100e55210)

[ CSSD][5]clssgmShutDown: Received abortive shutdown request from client.
[ CSSD][5]###################################
[ CSSD][5]clssscExit: CSSD aborting from thread GMClientListener
[ CSSD][5]###################################
[ CSSD][5](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

One node will usually remain up , typically the master node. On that node the following command will start to show rapidly accumulating Indle connections on the private interconnect. Typically you will start to see the other RAC nodes evict when the command below reaches around 2200 idle connections.

Please not that if the GI/DB in question is in Oracle Soalris Zones then you have to run the netstat command below from within the local zone (non global zone). If the GI/DB in question is at the LDom level you run it from the global zone.

netstat -an |grep Idle| grep 192| wc-l

While Idles are acucmulating run the following dtrace probe for udp_bind, the call that allocates udp sockets, for a few seconds and then control + C it.

dtrace -n 'udp_bind:entry{@x[pid,execname,ustack()] = count();}'

If you see multiple entries for skgcp functions then that is indicative of the problem a few every now and again is not bad but more than 10 or so in the matter of a few seconds is bad. In this case I retured over 100 matching calls in 5 seconds.

43317 oracle

libc.so.1`_so_bind+0x4

libskgxp11.so`sskgxp_createport+0x2fc

libskgxp11.so`_$o1cexiH0.skgxpicini+0x770

libskgxp11.so`skgxpcini_with_stats+0x174

oracle`ksxposdcini+0x32e0

oracle`ksxppluginosd+0x1308

oracle`ksxp_open+0x58c

oracle`ksucrp+0x9f0

oracle`opiino+0x5b4

oracle`opiodr+0x48c

oracle`opidrv+0x408

oracle`sou2o+0x58

oracle`opimai_real+0x1f8

oracle`ssthrdmain+0x13c

oracle`main+0x13c

oracle`_start+0x17c

Changes

The environment has Grid Infrastructure and Database Homes that

Cause

There were non Java One Command (JOC) homes were installed and then not linked with RDS. RAC by default is installed with UDP.

Solution

Verify the condition by setting the ORACL_HOME form the home you are investigating and run the skgxpinfo command to see if it reports rds or udp

$ORACLE_HOME/bin/skgxpinfo

Alternatively you can find the information in the ASM and/or Database alert logs

grep 'cluster interconnect IPC version' /<path_to_oracle_base>/diag/rdbms/<sid_name>/<instance_name>/trace/alert*.log

The supported ones will reflect cluster interconnect IPC version:Oracle RDS/IP (generic)

The un-supported ones will reflect "cluster interconnect IPC version:Oracle UDP/IP (generic)"

If either the Grid Infrastructure or Database Homes are shoing UDP you will need to relink them with RDS.

Please note this is an offline operation

1) As the ORACLE_HOME/GI_HOME owner, stop all resources (database, listener, ASM etc) that's running from the home. When stopping database, use NORMAL or IMMEDIATE option.
2) If relinking Grid Infrastructure (GI) home, as root, unlock GI home: <GI_HOME>/crs/install/rootcrs.pl -unlock
3) As the ORACLE_HOME/GI_HOME owner, go to ORACLE_HOME/GI_HOME and cd to rdbms/lib
4) As the ORACLE_HOME/GI_HOME owner, issue "make -f ins_rdbms.mk ipc_rds ioracle" (Repeat steps 3&4 for each Oracle Home GI&RDBMS)
5) If relinking Grid Infrastructure (GI) home, as root, lock GI home: <GI_HOME>/crs/install/rootcrs.pl –patch

Please note that exachk will catch this condition for all Oracle Homes known to the OCR. If the home iss not known to the OCR then the databases running out of these homes need to be indicated by passing the dbnames flag into exachk. This is well documented in the Exachk Users Guide which comes down with the software. Also note as a safety net we are adding an enhancement to SSCTUNER to check for this condition as well.

References

<BUG:19375096> - SSCTUNER SHOULD CHECK THAT DB HOMES ARE LINKED AGAINST RDS, INCLUDING DB ZONES.
<BUG:19341923> - ASM_XDMG_+ASM2 PROCESS HANGING IN MUNMAP
<BUG:19362035> - CSS ABORTS ON SECOND NODE AS ABORTING FROM THREAD GMCLIENTLISTENER
<BUG:17997507> - 11.2.0.4: XDMG PROCESS EXITS WITHOUT CLOSING SKGXP CONTEXT WHEN ORA-15311 IS SEE
<NOTE:1374110.1> - Top 5 issues for Instance Eviction
<NOTE:1676719.1> - Clusterware do not start on ALL nodes after reboot
<NOTE:330358.1> - Oracle Clusterware 10gR2/ 11gR1/ 11gR2/ 12cR1 Diagnostic Collection Guide

Attachments

This solution has no attachment