Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2149887.1
Update Date:2016-06-24
Keywords:

Solution Type  Problem Resolution Sure

Solution  2149887.1 :   SuperCluster: Transient Threads Can Lead to Instance Crashes, Node Evictions and Random Database or Application Performance Issues  


Related Items
  • Oracle SuperCluster T5-8 Full Rack
  •  
  • Solaris Operating System
  •  
  • Oracle SuperCluster M7 Hardware
  •  
  • Oracle SuperCluster T5-8 Half Rack
  •  
  • SPARC SuperCluster T4-4 Half Rack
  •  
  • Oracle SuperCluster T5-8 Hardware
  •  
  • SPARC SuperCluster T4-4
  •  
  • Oracle SuperCluster M6-32 Hardware
  •  
Related Categories
  • PLA-Support>Eng Systems>Exadata/ODA/SSC>SPARC SuperCluster>DB: SuperCluster_EST
  •  
  • Tools>Primary Use>Configuration
  •  


This Document describes a SuperCluster critical issue . If classified as a critical issue the item specified as the solution is considered mandatory.

Applies to:

Oracle SuperCluster T5-8 Half Rack - Version All Versions and later
Oracle SuperCluster T5-8 Full Rack - Version All Versions and later
SPARC SuperCluster T4-4 - Version All Versions and later
SPARC SuperCluster T4-4 Half Rack - Version All Versions and later
Oracle SuperCluster M6-32 Hardware - Version All Versions and later
Oracle Solaris on SPARC (64-bit)

Symptoms

Node evictions due to heart beat timeouts with no apparent problem on the IB fabric. Instance evictions due to delayed lms or lmd ping acknowledgements with no apparent interconnect problems. Unexplained delays in java or shell script code on very busy application nodes.

 

Changes

 Issues seems to be more prevalent in Solaris 11.2 and 11.3 and is triggered frequently by excessive CPU saturation.

Cause

 Unpublished Bug 17697871

With the introduction of workload characterization optimization for threads, threads can be marked as TRANSIENT if CPU utilization does not exceed a certain threshold. This intends to identify threads that use low CPU resources.

The transience counter ( t_transience kthread_t field) gets incremented each time a thread consumes less than thread_transience_pct (0.02% default) of a CPU's resources. If that counter reaches 10, the thread is flagged as TRANSIENT.

CPUs running TRANSIENT threads are also flagged as CPU_DISP_TRANSIENT. This causes some optimizations to be triggered within the scheduler. Usually idle CPUs will steal threads in dispatch queues of other busy CPUs. But, in this case, dispatch queues of a CPU flagged as CPU_DISP_TRANSIENT are not considered for processing, even if there are idle CPUs, since Solaris makes the assumption that the transient thread will leave the CPU soon. The problem appears when a transient thread running on a CPU starts to behave as non-transient. This can lead to threads staying too long in dispatch queues for that CPU, causing latency bubbles and possible scheduling issues.

Solution

Edit your /etc/system file in every global zone to set the following settings and reboot:

set thread_transience_kernel=0
set thread_transience_user=0

 

When a future  QFSDP is released that has the fix this document will be edited to reflect.

References

<NOTE:1424503.2> - Information Center: SuperCluster
<NOTE:2088923.1> - Oracle SuperCluster Application Domain and Zones Best Practices
<NOTE:2004702.1> - Oracle SuperCluster Best Practices
<NOTE:1625975.1> - On-proc TRANSIENT Threads Can Delay Runnable Threads Leading to Cluster Node Evictions
<BUG:17697871> - SUNBT7199390 RUNNABLE THREAD OCCASIONALLY STAYS IN RUN QUEUE FOR TOO LONG

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback