SuperCluster: Transient Threads Can Lead to Instance Crashes, Node Evictions and Random Database or Application Performance Issues

Asset ID:	1-72-2149887.1
Update Date:	2016-06-24
Keywords:

Solution Type Problem Resolution Sure

Solution 2149887.1 : SuperCluster: Transient Threads Can Lead to Instance Crashes, Node Evictions and Random Database or Application Performance Issues

Applies to:

Oracle SuperCluster T5-8 Half Rack - Version All Versions and later
Oracle SuperCluster T5-8 Full Rack - Version All Versions and later
SPARC SuperCluster T4-4 - Version All Versions and later
SPARC SuperCluster T4-4 Half Rack - Version All Versions and later
Oracle SuperCluster M6-32 Hardware - Version All Versions and later
Oracle Solaris on SPARC (64-bit)

Symptoms

Node evictions due to heart beat timeouts with no apparent problem on the IB fabric. Instance evictions due to delayed lms or lmd ping acknowledgements with no apparent interconnect problems. Unexplained delays in java or shell script code on very busy application nodes.

Changes

Issues seems to be more prevalent in Solaris 11.2 and 11.3 and is triggered frequently by excessive CPU saturation.

Cause

Unpublished Bug 17697871

With the introduction of workload characterization optimization for threads, threads can be marked as TRANSIENT if CPU utilization does not exceed a certain threshold. This intends to identify threads that use low CPU resources.

The transience counter ( t_transience kthread_t field) gets incremented each time a thread consumes less than thread_transience_pct (0.02% default) of a CPU's resources. If that counter reaches 10, the thread is flagged as TRANSIENT.

CPUs running TRANSIENT threads are also flagged as CPU_DISP_TRANSIENT. This causes some optimizations to be triggered within the scheduler. Usually idle CPUs will steal threads in dispatch queues of other busy CPUs. But, in this case, dispatch queues of a CPU flagged as CPU_DISP_TRANSIENT are not considered for processing, even if there are idle CPUs, since Solaris makes the assumption that the transient thread will leave the CPU soon. The problem appears when a transient thread running on a CPU starts to behave as non-transient. This can lead to threads staying too long in dispatch queues for that CPU, causing latency bubbles and possible scheduling issues.

Solution

Edit your /etc/system file in every global zone to set the following settings and reboot:

set thread_transience_kernel=0
set thread_transience_user=0

When a future QFSDP is released that has the fix this document will be edited to reflect.

References

<NOTE:1424503.2> - Information Center: SuperCluster
<NOTE:2088923.1> - Oracle SuperCluster Application Domain and Zones Best Practices
<NOTE:2004702.1> - Oracle SuperCluster Best Practices
<NOTE:1625975.1> - On-proc TRANSIENT Threads Can Delay Runnable Threads Leading to Cluster Node Evictions
<BUG:17697871> - SUNBT7199390 RUNNABLE THREAD OCCASIONALLY STAYS IN RUN QUEUE FOR TOO LONG

Attachments

This solution has no attachment