Sun Storage 7000 Unified Storage System: How to troubleshoot long cluster take-over and fail-back times

Asset ID:	1-75-1408475.1
Update Date:	2018-05-10
Keywords:

Solution Type Troubleshooting Sure

Solution 1408475.1 : Sun Storage 7000 Unified Storage System: How to troubleshoot long cluster take-over and fail-back times

Applies to:

Sun ZFS Storage 7420 - Version All Versions and later
Sun Storage 7310 Unified Storage System - Version All Versions and later
Oracle ZFS Storage ZS3-2 - Version All Versions and later
Oracle ZFS Storage ZS3-4 - Version All Versions and later
Oracle ZFS Storage ZS4-4 - Version All Versions and later
7000 Appliance OS (Fishworks)

Purpose

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Disk Storage ZFS Storage Appliance Community

The appliance does fail-back and takeover and there is an important distinction between them since the former requires resources be exported before they are imported by the other head, whereas takeover just imports them.

In the case of a slow fail-back, its worth figuring out if the relinquishing head is slow to export or whether the claiming head is slow to import the resources.

NOTE: To confirm that the cluster 'links' cabling is correctly configured - See Document ID 2081179.1

Troubleshooting Steps

The takeover and fail-back times depends on the number of objects that need to be iterated during the resource import phase.

On the 7x20 and 7x10 series system those objects include: shares, LUNs, data-links, V-LANs, network interfaces, IPMP/LACP setup, iscsi/fc targets, initiators, and groups, etc. Simple configurations are faster than complex configurations.

Other considerations:

If there is IO to the pool at the time of takeover/failback... especially writes since that would yield more dirty data in the Logzillas to be "replayed" during takeover. Time should increase. Reads should not make a difference

When iSCSI/FC LUNs are in use and in the case of a takeover, the contents of the logzilla will need to be replayed before the zpool can be imported

If there are many CIFS clients authorized by an Active Directory Server, more time will be needed to perform re-authorization upon the peer cluster head after takeover or failback.

If a destroy is in progress this needs to be completed before the zpool can be imported. This has been seen with snapshots especially where the head being taken over was in the process of destroying a snapshot which the other head then had to complete before the pool could be imported. This situation is remedied in appliance software 2011.1.1.0.

Finally, if they are available the readzilla L2ARC caches will need warming up after their associated pool is imported. Note that this does not apply to the logzillas, because they are imported along with the rest of the pool whereas the readzillas are specific to each cluster head.

NOTE: An analysis of over 2000 ZFS Storage Appliances over the past year (2014-2015) showed that the average failover/takeover time was ~30 seconds.
              Outliers above 2 minutes occur less than 3% of the time across Appliances running a mixture of workloads/configurations running both block and file.
              Recently with the 2013.1.4.5 release, there are a set of changes to address block failover reducing the sequential nature of LUN enumeration, this greatly affects the cluster failover time.

     Takeover/failback time can be influenced by:

The total number of projects and shares configured
The complexity of the network configuration
The amount of Analytics being continuously collected

Identifying the problem
For reference, the expected takeover time (for 7x10 SAS-1 systems) is:
Time in seconds = (20 * D) + (.03 * S)
D is # of disksets (half JBODs)
S is # of shares (filesystems)

NOTE: For 7x20/ZS3-x SAS-2 systems, see below.

How to make sure failover times are acceptable

Given the considerations above partially explaining expected failover times, failover times should now be less than 5 minutes in 2013 release.

Appliance cluster failover times have substantially decreased with the new 2013 software release.

If a time greater than 5 minutes (3 minutes in 2013.1.4.x code) is observed either when failing back, or taking over, then here are the decisions customers might make :

1. Make sure the cluster is upgraded to the latest AK 2013.1.x release

If this is not already done, then a cluster activity must be scheduled.

2. Limit LUNs/Shares to 500 or less. This is the most impactful item to failover times

3. Limit Analytics and using only when needed. The biggest 'offenders' are generally ARC-related DTrace Probes

4. For analysis, open a Service Request to TSC, and send one supportbundle for each node to the engineer in charge of your Service Request.

Also provide accurate times when such a long failback and/or takeover has been observed.

Support bundles contain logs recording the latest failbacks and takeovers having been issued on the cluster.

Therefore TSC engineer will be able to tell if this failback time has been greater than the 5 minutes (3 minutes in 2013.1.4.x code) expectation.

If this is the case, further actions will be done to point this issue to engineering and some additional traces might be requested in a second time.

Logs can be found in /var/ak/logs on the live system and in /logs directory in the bundle.

Failover times can be found in rm.ak* logs, using aklog macro. MOS documentation below may help to use aklog properly.

MOS Document ID 1427053.1 : Sun Storage 7000 Unified Storage System: Using aklog to read extended accounting log files

Here is an example showing how long lasted a takeover :

 
$aklog ./rm.ak | egrep -e 'import|export|takeover|failback'

... output cut

Mon Oct 21 20:09:29 2013: import of ak:/fct/DE24P_20-10k-2wa-SP succeeded in 0.027s

Mon Oct 21 20:09:29 2013: import of ak:/fct/DE24C_20-7.2k-1wa-SP succeeded in 0.030s

Mon Oct 21 20:09:30 2013: takeover completed in 92.469s

It lasted 92 seconds, that is to say 1 minute and a half.

Failback times can be found in the rm.ak* log as well, but on both cluster peers : one peer exports resources, the other peer imports those resources

On node 1

Mon Oct 21 20:29:06 2013: ak_rm_fail_back phase 1 complete in 3.789

Mon Oct 21 20:29:23 2013: ak_rm_fail_back phase 2 complete in 16.874

On node 2

Mon Oct 21 20:28:18 2013: failback completed in 16.809s

For failback, time on node 1 and time on node2 must be added to get the total time of failback. Here it lasted about 37 seconds in total.
 
Those times are reliable in ak.2013 firmware and they tell how long customer has effectively waited to have data services transferred from one node to the other.

NOTE: For later Appliance Firmware Release versions, the 'akrmzfs.txt' file is included in the supportbundle - it prints EVERYTHING about a takeover/failback.

It reports how long every routine took to run (in micro seconds), how long the datasets took, as well as times to Unmount, Export, Import and Mount.
It even makes a little bar graph to review

1402545.1 - Sun Storage 7000 Unified Storage System: How to Troubleshoot Cluster Problems

How to gather key data and information for Oracle Disk array products, to minimise problem diagnosis and resolution times (Doc ID 1346234.1)

Note: For any fail-over issues that are not addressed by this document please contact Oracle Support for assistance in diagnosing the issue and be prepare that remote access maybe require.
Ref: Oracle Shared Shell Document 1194226.1

Sun ZFS Storage Appliances Troubleshooting Resource Center (Doc ID 1416406.1)

Customers are not permitted to run commands at the emergency shell.

New Data gathering script (01-2013)

Here a link to a new data gathering script called failback_monitor.akwf,
supplied by RPE, to be used to diagnose long fail-over issues
more then 5 minutes. For access to the wiki/AmberRoadSupport do you need privileges.

Ask Sailesh Thanki for this

Link to the workflow: https://stbeehive.oracle.com/teamcollab/wiki/AmberRoadSupport:Long+failback+-+Data+collection

Checking takeover/failback times from a support bundle:

Refer to the supportbundle log-file rm.ak
for example :
bash-3.2$ cd /cores/sr-id/supportbundle
bash-3.2$ find . -type f -exec grep "in 0." {} /dev/null \;

It will give you an overview about how long the export and import of some items takes.

Example from 15772458

adc26stor08:configuration cluster> date
2012-2-11 08:59:46
adc26stor08:configuration cluster> failback
Continuing will immediately fail back the resources assigned to the cluster
peer. This may result in clients experiencing a slight delay in service.

Are you sure? (Y/N)
date
adc26stor08:configuration cluster> date
2012-2-11 09:06:24

on the exporting node (08) we see that the pools take the longest:
adc26stor08# aklog rm | grep -i export | grep "Sat Feb 11 09:0" | grep -v "in 0." | tail -20
Sat Feb 11 09:01:26 2012: export of ak:/nas/pool07a succeeded in 95.727s
Sat Feb 11 09:04:10 2012: export of ak:/zfs/pool07a succeeded in 164.224s
adc26stor08#

On the importing node (07), they are biggest hitters also:
adc26stor07# aklog rm | grep -i import | grep "Sat Feb 11 09:0" | grep -v "in 0." | tail -20
Sat Feb 11 09:05:14 2012: [zfs import] zpool_import_props() succeeded in 61.090s
Sat Feb 11 09:05:14 2012: import of ak:/zfs/pool07a succeeded in 61.129s
Sat Feb 11 09:06:03 2012: [nas import] discovery completed in 48.400s
Sat Feb 11 09:06:16 2012: [nas import] mounted 673 datasets in 6.989s
Sat Feb 11 09:06:17 2012: import of ak:/nas/pool07a succeeded in 62.531s
Sat Feb 11 09:06:20 2012: import of ak:/net/ixgbe93003 succeeded in 1.649s
adc26stor07#

Here is very useful dtrace script that allows checking which operation takes the most of time.

Dtrace script import.d helps to troubleshoot long cluster takeover and fail-back times. The script measures the time to import each resource.

Output:
The first table is the aggregate time spent importing each resource, the second is the number of times it was imported. The special "resource" SAS LOCK is just the time taken to grab all the zone locks in the expanders. These two activities are basically all there is to takeover so they should capture everything that consumes time.

Checked for relevancy - 10-May-2018

References

<NOTE:1402545.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Cluster Problems
<BUG:15772458> - SUNBT7144862 6.5 MINUTE FAILBACK ON Q3.4.3 - NEED RCA
<NOTE:1194226.1> - Oracle Shared Shell
<NOTE:1346234.1> - How to gather key data and information for Oracle Disk array products, to minimise problem diagnosis and resolution times

Attachments

This solution has no attachment