![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||
Solution Type Troubleshooting Sure Solution 1403503.1 : Sun Storage 7000 Unified Storage System: A cluster node fails to rejoin the cluster
This document is provided to assist in troubleshooting cluster join issues where one node of a cluster, following a reboot, fails to rejoin the cluster. In this Document
Applies to:Oracle ZFS Storage ZS3-2 - Version All Versions and laterOracle ZFS Storage ZS3-4 - Version All Versions and later Oracle ZFS Storage ZS3-BA - Version All Versions and later Oracle ZFS Storage ZS4-4 - Version All Versions and later Oracle ZFS Storage Appliance Racked System ZS4-4 - Version All Versions and later 7000 Appliance OS (Fishworks) NAS head revision : [not dependent] BIOS revision : [not dependent] ILOM revision : [not dependent] JBODs Model : [not dependent] CLUSTER related : [yes] PurposeThis document is provided to assist in troubleshooting cluster join issues where one node of a cluster, following a reboot, fails to rejoin the cluster. To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Disk Storage ZFS Storage Appliance Community
Troubleshooting StepsNOTE: To confirm that the cluster 'links' cabling is correctly configured - See Document ID 2081179.1
When one cluster node fails to join the cluster the problem is often due to loading on the working cluster node resulting in slow communications between the nodes or sometimes due to cluster-wide locking issues on the working node. For the former issue simply leaving the system to attempt the rejoin operation may be sufficient and eventually the join operation may complete successfully. In the latter case it is unlikely the second node will manage to rejoin the cluster and this document attempts to provide a workaround for this particular issue. Note: If you wish to know the cause for the node's failure to rejoin the cluster then please contact Oracle Support so they can collect additional diagnostic information in order to determine the underlying cause of the failure.
If the node, which is failing to rejoin the cluster, had a Factory Reset performed on it and you are observing the 'cluster join' counter increasing rapidly then it is possible the clustering was not 'unconfigured' on the peer node as part of the Factory Reset procedure. In this case, please engage Oracle Support for assistance.
The node must be powered off to ensure the cluster interconnect is offline. Simply shutting-down the node is not sufficient in this case.
Step 2. Restart the management service (called akd) on the working node PLEASE NOTE: A watchdog feature was added to AK 8.6.0 (2013.1.6.0) release for cluster systems. If the management software on Node-A, (aka the Appliance Kit Damon, or AKD) is stopped (or restarted) while AKD on Node-B is in an unknown state, down, or the head is powered off, Node-A WILL PANIC to prevent a situation that might corrupt the data in the pool. See Doc ID 2174141.1 (Restart of the Appliance Kit Daemon (akd) May Panic a ZFS Cluster Node)
Connect to the working node and issue the following CLI command:
Step 3 Wait for the system to restart the management interfaces and resume normal operation It may take several minutes for the management services to fully initialize. Once you have regained access to the Admin BUI or CLI check that the system is working correctly.
Step 4 Power on the second cluster node At this point the system should be working correctly with all resources available from the single working node. We can power on the remaining node and this time it should rejoin the cluster successfully.
Step 5 Check the system is working as a cluster From the Admin BUI you can check the status from the Configuration -> Cluster page. Note: If the cluster node still fails to join the cluster then further investigation will be required.
Please contact Oracle Support so they can collect additional diagnostic information in order to determine the underlying cause of the failure. References<NOTE:1402545.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Cluster ProblemsAttachments This solution has no attachment |
||||||||||||||||
|