![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Troubleshooting Sure Solution 1408475.1 : Sun Storage 7000 Unified Storage System: How to troubleshoot long cluster take-over and fail-back times
Applies to:Sun ZFS Storage 7420 - Version All Versions and laterSun Storage 7310 Unified Storage System - Version All Versions and later Oracle ZFS Storage ZS3-2 - Version All Versions and later Oracle ZFS Storage ZS3-4 - Version All Versions and later Oracle ZFS Storage ZS4-4 - Version All Versions and later 7000 Appliance OS (Fishworks) PurposeTo discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Disk Storage ZFS Storage Appliance Community
The appliance does fail-back and takeover and there is an important distinction between them since the former requires resources be exported before they are imported by the other head, whereas takeover just imports them. In the case of a slow fail-back, its worth figuring out if the relinquishing head is slow to export or whether the claiming head is slow to import the resources.
NOTE: To confirm that the cluster 'links' cabling is correctly configured - See Document ID 2081179.1
Troubleshooting StepsThe takeover and fail-back times depends on the number of objects that need to be iterated during the resource import phase. On the 7x20 and 7x10 series system those objects include: shares, LUNs, data-links, V-LANs, network interfaces, IPMP/LACP setup, iscsi/fc targets, initiators, and groups, etc. Simple configurations are faster than complex configurations.
NOTE: An analysis of over 2000 ZFS Storage Appliances over the past year (2014-2015) showed that the average failover/takeover time was ~30 seconds.
NOTE: For 7x20/ZS3-x SAS-2 systems, see below.
How to make sure failover times are acceptable Given the considerations above partially explaining expected failover times, failover times should now be less than 5 minutes in 2013 release. Appliance cluster failover times have substantially decreased with the new 2013 software release. If a time greater than 5 minutes (3 minutes in 2013.1.4.x code) is observed either when failing back, or taking over, then here are the decisions customers might make : 1. Make sure the cluster is upgraded to the latest AK 2013.1.x release If this is not already done, then a cluster activity must be scheduled. 2. Limit LUNs/Shares to 500 or less. This is the most impactful item to failover times 3. Limit Analytics and using only when needed. The biggest 'offenders' are generally ARC-related DTrace Probes 4. For analysis, open a Service Request to TSC, and send one supportbundle for each node to the engineer in charge of your Service Request. Also provide accurate times when such a long failback and/or takeover has been observed. Support bundles contain logs recording the latest failbacks and takeovers having been issued on the cluster. Therefore TSC engineer will be able to tell if this failback time has been greater than the 5 minutes (3 minutes in 2013.1.4.x code) expectation. If this is the case, further actions will be done to point this issue to engineering and some additional traces might be requested in a second time.
Logs can be found in /var/ak/logs on the live system and in /logs directory in the bundle. Failover times can be found in rm.ak* logs, using aklog macro. MOS documentation below may help to use aklog properly. MOS Document ID 1427053.1 : Sun Storage 7000 Unified Storage System: Using aklog to read extended accounting log files
Here is an example showing how long lasted a takeover : $aklog ./rm.ak | egrep -e 'import|export|takeover|failback' ... output cut Mon Oct 21 20:09:29 2013: import of ak:/fct/DE24P_20-10k-2wa-SP succeeded in 0.027s Mon Oct 21 20:09:29 2013: import of ak:/fct/DE24C_20-7.2k-1wa-SP succeeded in 0.030s Mon Oct 21 20:09:30 2013: takeover completed in 92.469s
It lasted 92 seconds, that is to say 1 minute and a half.
Mon Oct 21 20:29:06 2013: ak_rm_fail_back phase 1 complete in 3.789
Mon Oct 21 20:29:23 2013: ak_rm_fail_back phase 2 complete in 16.874
Mon Oct 21 20:28:18 2013: failback completed in 16.809s For failback, time on node 1 and time on node2 must be added to get the total time of failback. Here it lasted about 37 seconds in total. Those times are reliable in ak.2013 firmware and they tell how long customer has effectively waited to have data services transferred from one node to the other.
NOTE: For later Appliance Firmware Release versions, the 'akrmzfs.txt' file is included in the supportbundle - it prints EVERYTHING about a takeover/failback.
1402545.1 - Sun Storage 7000 Unified Storage System: How to Troubleshoot Cluster Problems
How to gather key data and information for Oracle Disk array products, to minimise problem diagnosis and resolution times (Doc ID 1346234.1)
Note: For any fail-over issues that are not addressed by this document please contact Oracle Support for assistance in diagnosing the issue and be prepare that remote access maybe require. Ref: Oracle Shared Shell Document 1194226.1 Sun ZFS Storage Appliances Troubleshooting Resource Center (Doc ID 1416406.1)
Customers are not permitted to run commands at the emergency shell.
New Data gathering script (01-2013) Ask Sailesh Thanki for this Link to the workflow: https://stbeehive.oracle.com/teamcollab/wiki/AmberRoadSupport:Long+failback+-+Data+collection
Checking takeover/failback times from a support bundle: Dtrace script import.d helps to troubleshoot long cluster takeover and fail-back times. The script measures the time to import each resource.
Checked for relevancy - 10-May-2018 References<NOTE:1402545.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Cluster Problems<BUG:15772458> - SUNBT7144862 6.5 MINUTE FAILBACK ON Q3.4.3 - NEED RCA <NOTE:1194226.1> - Oracle Shared Shell <NOTE:1346234.1> - How to gather key data and information for Oracle Disk array products, to minimise problem diagnosis and resolution times Attachments This solution has no attachment |
||||||||||||
|