Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1611984.1
Update Date:2018-01-08
Keywords:

Solution Type  Troubleshooting Sure

Solution  1611984.1 :   Sun Storage 7000 Unified Storage System: How to make sure failover times are acceptable  


Related Items
  • Sun ZFS Storage 7320
  •  
  • Sun Storage 7720 Unified Storage System
  •  
  • Sun ZFS Storage 7420
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>ZFS Storage>SN-DK: 7xxx NAS
  •  


If long failover times are observed on a 7000 clustered system, an analyze should be further done by TSC to check if those times are beyond the expectations

In this Document
Purpose
Troubleshooting Steps
References


Applies to:

Sun ZFS Storage 7420 - Version Not Applicable and later
Sun ZFS Storage 7320 - Version Not Applicable and later
Sun Storage 7720 Unified Storage System - Version Not Applicable and later
7000 Appliance OS (Fishworks)
Long failover times may be observed on 7000 clustered systems.
In that case, we need to see if those times are beyond the expectations.

Purpose

 The purpose of this document is to help customer deciding whether failover times are too long.

Troubleshooting Steps

Appliance cluster failover times have sustantially decreased with the new 2013.1.6 software release.

Failover times should now be less than 5 minutes.

If a time greater than 5 minutes is observed either when failing back, or taking over, then here are the decisions customers might make :

1. make sure the cluster is upgraded to ak.2013.1.6 at least.

If this is not already done, then a maintenance activity must be scheduled for the upgrade.

note 1447284.1 : Sun Storage 7000 Unified Storage System: How to upgrade a clustered

note 2021771.1 : Oracle ZFS Storage Appliance: Software Updates

 

2. disable analytics before the upgrade : disabling certain analytics datasets while running a FW pre-2013.1.6 can have a significant impact on takeover and failback times

note 1988278.1 : Oracle ZFS Storage Appliance: Analytics Dataset collection can increase Cluster takeover/failback times

 

3. open a Service Request to TSC, and send one support bundle for each node to the engineer in charge of your Service Request.

Also provide accurate times when such a long failback and/or takeover has been observed.

Support bundles contain logs recording the latest failbacks and takeovers having been issued on the cluster.

Therefore TSC engineer will be able to tell if this failback time has been greater than the 5 minutes expectation.

If this is the case, further actions will be done to point this issue to engineering and some additional traces might be requested in a second time.

 

Logs can be found in /var/ak/logs on the live system and in /logs directory in the bundle.

Failaover times can be found in rm.ak* logs, using aklog macro. MOS documentation below may help to use aklog properly.

note  1427053.1 : Sun Storage 7000 Unified Storage System: Using aklog to read extended accounting log files
Here is an example showing how long lasted a takeover : 
 
$aklog ./rm.ak | egrep -e 'import|export|takeover|failback'
... output cut
Mon Oct 21 20:09:29 2013: import of ak:/fct/DE24P_20-10k-2wa-SP succeeded in 0.027s 
Mon Oct 21 20:09:29 2013: import of ak:/fct/DE24C_20-7.2k-1wa-SP succeeded in 0.030s
Mon Oct 21 20:09:30 2013: takeover completed in 92.469s 

 

It lasted 92 seconds, that is to say 1 minute and a half.


Failback times can be found in the rm.ak* log as well, but on both cluster peers : one peer exports resources, the other peer imports those resources

On node 1

Mon Oct 21 20:29:06 2013: ak_rm_fail_back phase 1 complete in 3.789
Mon Oct 21 20:29:23 2013: ak_rm_fail_back phase 2 complete in 16.874
 

On node 2

Mon Oct 21 20:28:18 2013: failback completed in 16.809s 
For failback, time on node 1 and time on node2 must be added to get the total time of failback. Here it lasted about 37 seconds in total.
 
Those times are reliable in ak.2013 firmware and they tell how long customer has effectively waited to have data services transferred from one node to the other.

References

<NOTE:1408475.1> - Sun Storage 7000 Unified Storage System: How to troubleshoot long cluster take-over and fail-back times

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback