Tuning for SPARC SuperCluster and Solaris X86-64 Exadata (X2-2) RDS issues contributing to RDS Latency, RAC Node Evictions,Intermittent spikes in cluster waits and ORA-27300 MTU errors

Asset ID:	1-72-1498896.1
Update Date:	2014-10-18
Keywords:

Solution Type Problem Resolution Sure

Solution 1498896.1 : Tuning for SPARC SuperCluster and Solaris X86-64 Exadata (X2-2) RDS issues contributing to RDS Latency, RAC Node Evictions,Intermittent spikes in cluster waits and ORA-27300 MTU errors

Applies to:

Exadata Database Machine X2-2 Full Rack - Version All Versions to All Versions [Release All Releases]
Exadata Database Machine X2-2 Half Rack - Version All Versions to All Versions [Release All Releases]
SPARC SuperCluster T4-4 - Version All Versions to All Versions [Release All Releases]
Oracle Database - Enterprise Edition - Version 11.2.0.3 to 11.2.0.3 [Release 11.2]
Oracle SuperCluster T5-8 Hardware - Version All Versions to All Versions [Release All Releases]
Oracle Solaris on x86-64 (64-bit)
Oracle Solaris on SPARC (64-bit)
This tuning is only appropriate for NUMA systems such as the T4-4 compute node in SPARC SuperCluster and Solaris x86-64 on X2-2. The tuning is not appropriate for general purpose LDoms within the SPARC SuperCluster.

Symptoms

Scope

This tuning is only appropriate for NUMA systems such as the T4-4 compute node in SPARC SuperCluster and Solaris x86-64 on X2-2. The tuning is not appropriate for general purpose LDoms within the SPARC SuperCluster.

Problem

There have been instances of split-brain and node evictions with some customer SuperClusterand Solaris X86-64 on X2-2 systems. These issues appear to be workload related and can affect some customers more than others. To determine if your customer is hitting this issue; review the symptoms below.

Symptoms

diskmon.log

2012-10-14 00:30:58.012: [ DISKMON][11154:12] SKGXP:[100f97ba0.1139]{0}: SKGXP_DO_HEART_BEAT_RESP: NO HB PENDING source: 0 (max 2) in response from 192.168.20.9 mhbr 192.168.20.0/9
2012-10-14 00:30:58.013: [ DISKMON][11154:12] SKGXP:[100f97ba0.1140]{0}:        SSKGXPT 101105b60 flags 0x2 { WRITE } sockno 122 IP 192.168.20.9 RDS 20143 lerr 0
2012-10-14 00:30:58.013: [ DISKMON][11154:12] SKGXP:[100f97ba0.1141]{0}:        SSKGXPT 101105b90 flags 0x2 { WRITE } sockno 123 IP 192.168.20.9 RDS 20143 lerr 0
2012-10-14 00:30:58.013: [ DISKMON][11154:12] SKGXP:[100f97ba0.1142]{0}: SKGXPID 1105b14 vers 0 conproto 1 flags 8 magic 4c89
.
.
.
2012-10-14 00:30:58.037: [ DISKMON][11154:12] dskm_ant_rsc_monitor_start: rscnam: o/192.168.20.9 rsc: 1010b83e0 state: UNREACHABLE reconn_attempts: 7 last_reconn_ts: 1350199850
2012-10-14 00:30:58.037: [ DISKMON][11154:12] dskm_queue_tcpmon_request: posting
2012-10-14 00:30:58.037: [ DISKMON][11154:12] dskm_post_tcpmon_thrd
2012-10-14 00:30:58.037: [ DISKMON][11154:3] dskm_tcpmon_thrd_main: posted, poll returned with retcode = 45
2012-10-14 00:30:58.037: [ DISKMON][11154:3] dskm_tcpmon_thrd_main: Got a request with type 2, cellname = o/192.168.20.9, cellname length 15, cell incarnation = 0
2012-10-14 00:30:58.048: [ DISKMON][11154:12] dskm_health_check_ssb2: Checking if Cell o/192.168.20.9 is UNREACHABLE from all the nodes
.
.
.
2012-10-14 00:31:02.271: [ DISKMON][11154:4] dskm_get_evt_mbr: member 2 signaled the event
2012-10-14 00:31:02.280: [ DISKMON][11154:4] dskm_cell_health_resp1: Encounter a split-brain with node 2, suicide self....

Panic string / System corefile

The system corefile generated from the resultant panic should be examined to determine if other factors are at play. The panic string typically generated is:

panic[cpu64]/thread=30167a04700: 
forced crash dump initiated at user request

 
000002a11f863930 genunix:kadmin+5a0 (0, 0, 10, 125c400, 5, 1)
%l0-3: 000000000125c420 000000000125c400 0000000000000004 0000000000000004
%l4-7: 0000000000000208 0000000000000010 0000000000000004 0000000000000004
000002a11f863a00 genunix:uadmin+1c0 (1, 604e7775a98, 0, 1, 5, 5)
%l0-3: 00000000fd4a0000 000000000000fd4a 0000000000000004 000003002a5b6000
%l4-7: 00005a2cf7153e2e 0000000000000000 0000000000000000 0000000000000000

rds-ping latencies && ib_tx_ring_full

Large latencies in rds-ping times (1000's of usec) are observed on the system around the time of the event. These are recorded in OSWatcher data. Normal response time is around 100usec between database nodes.

$ rds-ping  -c 10 -I 192.168.10.10 192.168.10.8
      1: 7701 usec
      2: 6634 usec
      3: 8448 usec
      4: 5395 usec

ib_tx_ring_full will increment rapidly over a short period prior to the event. Note: ib_tx_ring_full is cumulative since boot. Large values may not necessarily indicate a problem, but a steady rapid increase over a short period is indicative of problems.

# date; rds-info -c | egrep 'tx_ring_full'
Mon Oct 15 07:44:26 PDT 2012
          ib_tx_ring_full        102047725
# date; rds-info -c | egrep 'tx_ring_full'
Mon Oct 15 07:44:26 PDT 2012
          ib_tx_ring_full        102049072

ORA-27300 errors

The instance alert log shows evidence of

ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:mtu select abnormal return 
failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvfymtu

Changes

Cause

These events are caused by latencies in RDSv3 communication between nodes of the SuperCluster. It has been determined that these latencies can be caused by Oracle RT processes starving rdsv3 worker threads of cpu time. The current remediation for this issue is to disable numa object binding within the kernel. This will allow the rdsv3 worker threads to be scheduled on an alternate cpu in the system..

Solution

On SuperCluster you should make sure you have installed and run the ssctuner service. It will set this and all other /etc/system best practices. The manul tuning approach applies to Solaris exadata as well as other Solaris 11.0 and 11.1 RAC systems outside of engineered systems.

Tuning

Please note that even though the text is marked as internal only but the tuning steps may be delivered to the customers via a Service Request.

A reboot it required after making these changes

SuperCluster

To effect this change; /etc/system should be updated on all 'Exa' domains in the SuperCluster. i.e all domains running the 11gR2 database as part of the Exadata stack. After /etc/system has been updated a reboot is required.

exclude:nxge

set numaio_bind_objects=0

It's also recommended at this time to disable intrd. The following is persistent across reboots.

# svcadm disable intrd

Please note the numaio_bind_objects =0 is no longer a valid parameter in Solaris 11.2 . When SuperCluster goes to this version ssctuner will remove this setting.

Please not for Database in zones these settings are all done in the global zone except for FX-60 thread priority. All work is still accomplished with ssctuner. Also for LMS and LGWR FX-60 is the default thread priority for LMS and LGWR

Solaris Exadata

To effect this change; /etc/system should be updated on all compute nodes in the Solaris Exadata.

set numaio_bind_objects=0

It's also recommended at this time to disable intrd. The following is persistent across reboots.

# svcadm disable intrd

References

<BUG:15821624> - SUNBT7203790 RDS STALLS ON SPARC SUPER CLUSTER
<BUG:15748320> - SUNBT7100788 MULTI-CPU BINDING FOR NUMA I/O
<BUG:12951619> - DATABASE TO USE CRITICAL THREADS FEATURE IN SOLARIS
<NOTE:1903388.1> - SuperCluster - ssctuner reference document
<NOTE:1424503.2> - Information Center: SuperCluster

Attachments

This solution has no attachment