![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||
Solution Type Troubleshooting Sure Solution 1611984.1 : Sun Storage 7000 Unified Storage System: How to make sure failover times are acceptable
If long failover times are observed on a 7000 clustered system, an analyze should be further done by TSC to check if those times are beyond the expectations In this Document
Applies to:Sun ZFS Storage 7420 - Version Not Applicable and laterSun ZFS Storage 7320 - Version Not Applicable and later Sun Storage 7720 Unified Storage System - Version Not Applicable and later 7000 Appliance OS (Fishworks) Long failover times may be observed on 7000 clustered systems. In that case, we need to see if those times are beyond the expectations. PurposeThe purpose of this document is to help customer deciding whether failover times are too long. Troubleshooting StepsAppliance cluster failover times have sustantially decreased with the new 2013.1.6 software release. Failover times should now be less than 5 minutes. If a time greater than 5 minutes is observed either when failing back, or taking over, then here are the decisions customers might make : 1. make sure the cluster is upgraded to ak.2013.1.6 at least. If this is not already done, then a maintenance activity must be scheduled for the upgrade. note 1447284.1 : Sun Storage 7000 Unified Storage System: How to upgrade a clustered note 2021771.1 : Oracle ZFS Storage Appliance: Software Updates
2. disable analytics before the upgrade : disabling certain analytics datasets while running a FW pre-2013.1.6 can have a significant impact on takeover and failback times note 1988278.1 : Oracle ZFS Storage Appliance: Analytics Dataset collection can increase Cluster takeover/failback times
3. open a Service Request to TSC, and send one support bundle for each node to the engineer in charge of your Service Request. Also provide accurate times when such a long failback and/or takeover has been observed. Support bundles contain logs recording the latest failbacks and takeovers having been issued on the cluster. Therefore TSC engineer will be able to tell if this failback time has been greater than the 5 minutes expectation. If this is the case, further actions will be done to point this issue to engineering and some additional traces might be requested in a second time.
Logs can be found in /var/ak/logs on the live system and in /logs directory in the bundle. Failaover times can be found in rm.ak* logs, using aklog macro. MOS documentation below may help to use aklog properly. note 1427053.1 : Sun Storage 7000 Unified Storage System: Using aklog to read extended accounting log files
Here is an example showing how long lasted a takeover : $aklog ./rm.ak | egrep -e 'import|export|takeover|failback' ... output cut Mon Oct 21 20:09:29 2013: import of ak:/fct/DE24P_20-10k-2wa-SP succeeded in 0.027s Mon Oct 21 20:09:29 2013: import of ak:/fct/DE24C_20-7.2k-1wa-SP succeeded in 0.030s Mon Oct 21 20:09:30 2013: takeover completed in 92.469s
It lasted 92 seconds, that is to say 1 minute and a half.
Mon Oct 21 20:29:06 2013: ak_rm_fail_back phase 1 complete in 3.789
Mon Oct 21 20:29:23 2013: ak_rm_fail_back phase 2 complete in 16.874
On node 2 Mon Oct 21 20:28:18 2013: failback completed in 16.809s For failback, time on node 1 and time on node2 must be added to get the total time of failback. Here it lasted about 37 seconds in total. Those times are reliable in ak.2013 firmware and they tell how long customer has effectively waited to have data services transferred from one node to the other. References<NOTE:1408475.1> - Sun Storage 7000 Unified Storage System: How to troubleshoot long cluster take-over and fail-back timesAttachments This solution has no attachment |
||||||||||||||||
|