Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2211410.1
Update Date:2017-02-27
Keywords:

Solution Type  Problem Resolution Sure

Solution  2211410.1 :   SuperCluster T5-8 Primary Domain Intermittent Performance Issue: ssh Connections Time Out, Console Connection Very Slow  


Related Items
  • Oracle SuperCluster M6-32 Hardware
  •  
  • Oracle SuperCluster T5-8 Hardware
  •  
  • SPARC SuperCluster T4-4 Half Rack
  •  
  • Oracle SuperCluster T5-8 Half Rack
  •  
  • SPARC SuperCluster T4-4
  •  
  • Oracle SuperCluster T5-8 Full Rack
  •  
  • SPARC SuperCluster T4-4 Full Rack
  •  
Related Categories
  • PLA-Support>Eng Systems>Exadata/ODA/SSC>SPARC SuperCluster>DB: SuperCluster_EST
  •  


Shell Command and ssh login connections intermittently very slow and timing out.

Created from <SR 3-13769367001>

Applies to:

Oracle SuperCluster T5-8 Hardware - Version All Versions to All Versions [Release All Releases]
SPARC SuperCluster T4-4 - Version All Versions to All Versions [Release All Releases]
SPARC SuperCluster T4-4 Full Rack - Version All Versions to All Versions [Release All Releases]
SPARC SuperCluster T4-4 Half Rack - Version All Versions to All Versions [Release All Releases]
Oracle SuperCluster M6-32 Hardware - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

Node initially went to hung state, ssh could not work. System rebooted w/o CRS/RAC/RDBMS running, and problem re-occurs same time each day.
Subsequent console log in sessions via ILOM exhibiting very poor response times intermittently.
If any data collection process is started it exacerbates the problem. E.g. Exachk, explorer, GUDS, etc.

Changes

 In this case there were no changes. System was working then "suddenly" started having problems.

Cause

The ROOT CAUSE is / was a slowly / poorly performing LUN in the ldom's boot rpool.

Found by running GUDS first to in /var/tmp then killed it and started it up with -D /tmp. It ran much better to /tmp (memory) than to /var/tmp (root rpool).

From iostat -xcnz:

                  extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   37.0    0.0 1284.9  0.0  0.7    0.0   18.2   0  13 c0t5000CCA01672EECCd0
    0.0   24.0    0.0  123.4  0.0  4.2    0.0  173.3   0  99 c0t5000CCA0167440E8d0  <<<<
    0.0  109.9    0.0 5180.6  0.0  2.2    0.0   20.4   0  36 c0t5000CCA0166BE1FCd0
     cpu
 us sy wt id
  0  0  0 100
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  5.0    0.0    0.0   0 100 c0t5000CCA0167440E8d0   <<<<
    0.0   16.0    0.0   64.0  0.0  0.1    0.0    4.3   0   7 c0t5000CCA0166BE1FCd0
     cpu
 us sy wt id
  0  0  0 100
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   12.0    0.0  861.8  0.0  4.6    0.0  387.1   0 100 c0t5000CCA0167440E8d0   <<<
    0.0    5.0    0.0   20.0  0.0  0.0    0.0    5.8   0   3 c0t5000CCA0166BE1FCd0



o. This was confirmed by the kernel team in core file analysis in that they saw ZFS IO's timing out to the same device.

o. Note the %BUSY and ASVC_T are really high for this LUN when there is almost no IO going on. 12 writes a second is no real load at all, yet this drive can not handle it.

Solution

Drop the disk and replace it with a new one and bring it back in again.


Steps:

o. c0t5000CCA01672EECCd0 is primary mirror
o. c0t5000CCA0167440E8d0 is the seconday BAD/slow performing mirror
o. c0t5000CCA057A0A9BCd0 is new disk that replaced secondary mirror

1. Detach the 'bad' mirror:

# zpool detach rpool c0t5000CCA0167440E8d0s0

2. Replace disk / wait until disk replaced
-> process calls for shut down of SSC node, replacement of disk and reboot of node
-> hot swap not supported on SuperCluster

3. See new disk in /dev/rdsk

# ls -lah /dev/rdsk/c0t5000CCA01672EECCd0s2

4. Use format, select disk and 'format' it.

# format -e

5. Copy vtoc from primary to secondary mirror - makes partition tables match.

# prtvtoc /dev/rdsk/c0t5000CCA01672EECCd0s2 | fmthard -s - /dev/rdsk/c0t5000CCA057A0A9BCd0s2

6. Attach new disk to rpool

# zpool attach rpool c0t5000CCA01672EECCd0s0 c0t5000CCA057A0A9BCd0s0
                     PRIMARY RPOOL MIRROR    SECONDARY RPOOL MIRROR

7. waiting for re-silvering / re-balance to complete.

# zpool status rpool

 

References

<BUG:25198355> - SSH CONNECTIONS SLOW AND EVENTUALLY TIME OUT INTERMITTENTLY ON NODE 1
<BUG:15654938> - SUNBT6967781 TXG_SYNC_THREAD IS BLOCKING, EVEN THOUGH THERE IS NO I/O ERROR.
<NOTE:2185936.1> - ldm commands on Control Domain hanging on ZFS, customer unable to run Explorer

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback