Sun Storage 7000 Unified Storage System: How to make sure failover times are acceptable

Asset ID:	1-75-1611984.1
Update Date:	2018-01-08
Keywords:

Solution Type Troubleshooting Sure

Solution 1611984.1 : Sun Storage 7000 Unified Storage System: How to make sure failover times are acceptable

Applies to:

Sun ZFS Storage 7420 - Version Not Applicable and later
Sun ZFS Storage 7320 - Version Not Applicable and later
Sun Storage 7720 Unified Storage System - Version Not Applicable and later
7000 Appliance OS (Fishworks)
Long failover times may be observed on 7000 clustered systems.
In that case, we need to see if those times are beyond the expectations.

Purpose

The purpose of this document is to help customer deciding whether failover times are too long.

Troubleshooting Steps

Appliance cluster failover times have sustantially decreased with the new 2013.1.6 software release.

Failover times should now be less than 5 minutes.

If a time greater than 5 minutes is observed either when failing back, or taking over, then here are the decisions customers might make :

1. make sure the cluster is upgraded to ak.2013.1.6 at least.

If this is not already done, then a maintenance activity must be scheduled for the upgrade.

note 1447284.1 : Sun Storage 7000 Unified Storage System: How to upgrade a clustered

note 2021771.1 : Oracle ZFS Storage Appliance: Software Updates

2. disable analytics before the upgrade : disabling certain analytics datasets while running a FW pre-2013.1.6 can have a significant impact on takeover and failback times

note 1988278.1 : Oracle ZFS Storage Appliance: Analytics Dataset collection can increase Cluster takeover/failback times

3. open a Service Request to TSC, and send one support bundle for each node to the engineer in charge of your Service Request.

Also provide accurate times when such a long failback and/or takeover has been observed.

Support bundles contain logs recording the latest failbacks and takeovers having been issued on the cluster.

Therefore TSC engineer will be able to tell if this failback time has been greater than the 5 minutes expectation.

If this is the case, further actions will be done to point this issue to engineering and some additional traces might be requested in a second time.

Logs can be found in /var/ak/logs on the live system and in /logs directory in the bundle.

Failaover times can be found in rm.ak* logs, using aklog macro. MOS documentation below may help to use aklog properly.

note 1427053.1 : Sun Storage 7000 Unified Storage System: Using aklog to read extended accounting log files

Here is an example showing how long lasted a takeover :

 
$aklog ./rm.ak | egrep -e 'import|export|takeover|failback'

... output cut

Mon Oct 21 20:09:29 2013: import of ak:/fct/DE24P_20-10k-2wa-SP succeeded in 0.027s

Mon Oct 21 20:09:29 2013: import of ak:/fct/DE24C_20-7.2k-1wa-SP succeeded in 0.030s

Mon Oct 21 20:09:30 2013: takeover completed in 92.469s

It lasted 92 seconds, that is to say 1 minute and a half.

Failback times can be found in the rm.ak* log as well, but on both cluster peers : one peer exports resources, the other peer imports those resources

On node 1

Mon Oct 21 20:29:06 2013: ak_rm_fail_back phase 1 complete in 3.789

Mon Oct 21 20:29:23 2013: ak_rm_fail_back phase 2 complete in 16.874

On node 2

Mon Oct 21 20:28:18 2013: failback completed in 16.809s

For failback, time on node 1 and time on node2 must be added to get the total time of failback. Here it lasted about 37 seconds in total.
 
Those times are reliable in ak.2013 firmware and they tell how long customer has effectively waited to have data services transferred from one node to the other.

References

<NOTE:1408475.1> - Sun Storage 7000 Unified Storage System: How to troubleshoot long cluster take-over and fail-back times

Attachments

This solution has no attachment