Sun Storage 7000 Unified Storage System: A cluster node fails to rejoin the cluster

Asset ID:	1-75-1403503.1
Update Date:	2018-05-21
Keywords:

Solution Type Troubleshooting Sure

Solution 1403503.1 : Sun Storage 7000 Unified Storage System: A cluster node fails to rejoin the cluster

Applies to:

Oracle ZFS Storage ZS3-2 - Version All Versions and later
Oracle ZFS Storage ZS3-4 - Version All Versions and later
Oracle ZFS Storage ZS3-BA - Version All Versions and later
Oracle ZFS Storage ZS4-4 - Version All Versions and later
Oracle ZFS Storage Appliance Racked System ZS4-4 - Version All Versions and later
7000 Appliance OS (Fishworks)
NAS head revision : [not dependent]
BIOS revision : [not dependent]
ILOM revision : [not dependent]
JBODs Model : [not dependent]
CLUSTER related : [yes]

Purpose

This document is provided to assist in troubleshooting cluster join issues where one node of a cluster, following a reboot, fails to rejoin the cluster.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Disk Storage ZFS Storage Appliance Community

Troubleshooting Steps

NOTE: To confirm that the cluster 'links' cabling is correctly configured - See Document ID 2081179.1

When one cluster node fails to join the cluster the problem is often due to loading on the working cluster node resulting in slow communications between the nodes or sometimes due to cluster-wide locking issues on the working node.

For the former issue simply leaving the system to attempt the rejoin operation may be sufficient and eventually the join operation may complete successfully.

In the latter case it is unlikely the second node will manage to rejoin the cluster and this document attempts to provide a workaround for this particular issue.

Note: If you wish to know the cause for the node's failure to rejoin the cluster then please contact Oracle Support so they can collect additional diagnostic information in order to determine the underlying cause of the failure.

If the node, which is failing to rejoin the cluster, had a Factory Reset performed on it and you are observing the 'cluster join' counter increasing rapidly then it is possible the clustering was not 'unconfigured' on the peer node as part of the Factory Reset procedure. In this case, please engage Oracle Support for assistance.

If you wish to try to resolve the issue yourself then please follow these steps.

Step 1. Power down the node that is failing to join the cluster.

The node must be powered off to ensure the cluster interconnect is offline. Simply shutting-down the node is not sufficient in this case.

From the console where the system is reporting the cluster join failure message press the following key sequence to Halt the system:

<ESC>-3 - Halt system

If possible, connect to the SP (Service Processor, sometimes called the ILOM) of the system and login as the root user. Check the system status by issuing the following command:

-> show /SYS

Towards the end of the output under the section entitled 'Properties' will be displayed the power state:

    Properties:
        type = Host System
        chassis_name = SUN FIRE X4240
        chassis_part_number = 540-7618-XX
        .   .   .
        product_manufacturer = SUN MICROSYSTEMS
        power_state = On

If the system is powered-on then issue the following SP command to power-off the system:

-> stop /SYS

At this point the system will be powered-off. You can check the status by reissuing the 'show /SYS' command as used earlier.

If you have access to the node itself then you can simply depress the Power button on the front panel.

Step 2. Restart the management service (called akd) on the working node

PLEASE NOTE: A watchdog feature was added to AK 8.6.0 (2013.1.6.0) release for cluster systems.

If the management software on Node-A, (aka the Appliance Kit Damon, or AKD) is stopped (or restarted) while AKD on Node-B is in an

unknown state, down, or the head is powered off, Node-A WILL PANIC to prevent a situation that might corrupt the data in the pool.

See Doc ID 2174141.1 (Restart of the Appliance Kit Daemon (akd) May Panic a ZFS Cluster Node)

Connect to the working node and issue the following CLI command:

> maintenance system restart

This may affect the data services (see Document 1543359.1) - it will restart the Admin interfaces and as a result you will be logged-out of the CLI session.

Step 3 Wait for the system to restart the management interfaces and resume normal operation

It may take several minutes for the management services to fully initialize. Once you have regained access to the Admin BUI or CLI check that the system is working correctly.
You may wish to wait one or two minutes more to ensure the system is fully recovered before proceeding.

Step 4 Power on the second cluster node

At this point the system should be working correctly with all resources available from the single working node. We can power on the remaining node and this time it should rejoin the cluster successfully.

If you have access to the node itself then you can simply depress the Power button on the front panel to power-on the node.

If you have access to the SP then issue the following SP command:

-> start /SYS

this will power-on the node. You can check the power state by issuing the 'show /SYS' command, as before.

Once the system has completed its power-on self tests it will load the operating system and appliance firmware and start operation.

Step 5 Check the system is working as a cluster

From the Admin BUI you can check the status from the Configuration -> Cluster page.

From the CLI you can issue the following command:

> configuration cluster show

The cluster will probably show one node as Owner and the other as Stripped indicating the cluster is operational and ready for the cluster fail-back operation.

Note: If the cluster node still fails to join the cluster then further investigation will be required.

Please contact Oracle Support so they can collect additional diagnostic information in order to determine the underlying cause of the failure.

References

<NOTE:1402545.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Cluster Problems

Attachments

This solution has no attachment