Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1402545.1
Update Date:2017-10-09
Keywords:

Solution Type  Troubleshooting Sure

Solution  1402545.1 :   Sun Storage 7000 Unified Storage System: How to Troubleshoot Cluster Problems  


Related Items
  • Sun ZFS Storage 7420
  •  
  • Oracle ZFS Storage ZS3-2
  •  
  • Oracle ZFS Storage ZS4-4
  •  
  • Sun Storage 7410 Unified Storage System
  •  
  • Oracle ZFS Storage ZS3-4
  •  
  • Sun Storage 7310 Unified Storage System
  •  
  • Oracle ZFS Storage Appliance Racked System ZS4-4
  •  
  • Sun ZFS Storage 7320
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>ZFS Storage>SN-DK: 7xxx NAS
  •  
  • _Old GCS Categories>Sun Microsystems>Storage - Disk>Unified Storage
  •  


This document is provided to assist in troubleshooting cluster issues on the ZFS Storage Appliance

In this Document
Purpose
Troubleshooting Steps
 Identifying the problem
 Setting-up the Cluster
 Problems during normal cluster operation
 Removing a node from a cluster
 Configuration Guidelines
 Other considerations
 Terms & Definitions 
 References
References


Applies to:

Oracle ZFS Storage ZS4-4 - Version All Versions and later
Oracle ZFS Storage Appliance Racked System ZS4-4 - Version All Versions and later
Sun ZFS Storage 7320 - Version All Versions and later
Sun Storage 7410 Unified Storage System - Version All Versions and later
Oracle ZFS Storage ZS3-2 - Version All Versions and later
7000 Appliance OS (Fishworks)
NAS head revision : [not dependent]
BIOS revision : [not dependent]
ILOM revision : [not dependent]
JBODs Model : [not dependent]
CLUSTER related : [yes]


Purpose

This document is provided to assist in troubleshooting cluster issues.

It will help to frame the problem, identifies some known issues and provides some guidelines to obtain a stable clustered system.

This document has been written as a resolution path, each step giving links to other specific documents.  

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Disk Storage ZFS Storage Appliance Community

Troubleshooting Steps

 

Note: For any cluster issues that are not addressed by this document please contact Oracle Support for assistance in diagnosing the issue and be prepare that remote access maybe require.
Ref: Oracle Shared Shell Document 1194226.1

 

Identifying the problem

The following sections address cluster issues based on when the issues are observed during the clustering life-cycle (i.e. creating the initial cluster, operating a cluster, removing nodes from a cluster.)

If you are experiencing a cluster problem:
 - during the initial configuration of the cluster then see the section Setting-up the Cluster
 - during normal cluster operations then see Problems during normal cluster operation
 - while removing clustering then see Removing a node from a cluster

Setting-up the Cluster

For the initial configuration steps see:

    Document 1329307.1 - Sun Storage 7000 Unified Storage System: How to set up NAS clustering

NOTE: To confirm that the cluster 'links' cabling is correctly configured - See Document ID 2081179.1

 

Problems during normal cluster operation

This section describes some common cluster issues that may be observed during normal operations.

1.  A cluster node fails to join the cluster.  See Document 1403503.1

2.  A node reboots following a take-over or fail-back operation

This is indicative of a resource issue that has been recognized by the cluster node that is attempting to acquire its resources from the main node. 
Examples are network interfaces that are not operational on the second node so the node would be unable to provide a data service following the cluster operation.  In this case the node will automatically reboot itself and thereby force the cluster resources to remain on the working node. Following an automatic reboot such as this, be sure to check network cables connecting the node to the network switches, and SAS cables connecting the node to shelves.

3.  The Admin BUI does not respond when the Configuration:Clustering page is selected. This can be caused by loading issues within the management service (the akd service).  If the system is busy performing a lengthy operation then it may not respond to some menu selections until the operation has completed. In case of some deletion operations, this may take several minutes. In case of large snapshot deletions, it may take even several hours. This is not necessarily a cluster issue but a management interface issue.


Note:  For any other cluster issues please contact Oracle Support who will work with you in resolving the issue.

Removing a node from a cluster

Before removing a node from a cluster or unconfigure clustering, be aware that if you run both heads without clustering enabled they will both try to import the pools and the data will be destroyed. It is critical that you do not have two heads accessing the same storage while not configured in a cluster:

1.  Power off the node ("B") to be removed from the cluster and disconnect it from any storage trays the other head ("A") will continue to use.
2.  Disconnect the surviving head ("A") from the storage that will go with the head ("B") you are removing from the cluster.

3.  From the remaining node ("A"), in the Admin BUI navigate to the Configuration -> Cluster page, press the <Unconfig> button to remove the cluster configuration.

4.  Detach the cluster interconnect cables. (This step should never be performed on heads that will continue to operate in a cluster.)
5.  Perform a factory reset on head "B" from grub when it comes up, and it is ready for re-deployment.

At this point both of the ZFS SA nodes will operate independently.

INTERNAL: FOR TSC USE
If the Admin BUI is inoperative then it is possible to unconfigure clustering from the CLI using the raw command:
> raw cluster.unconfigure();

See also:  Document 1174473.1 - Sun Storage 7000 Unified Storage System: How to factory reset a cluster node without downtime

         

Configuration Guidelines

There are additional items to consider when configuring nodes to form a clustered system. For example, how to distribute the data pools and network interfaces between nodes to balance the loading on both nodes.

Oracle recommend that one network interface be dedicated on each node for use as a management interface. In this case the interface is marked as a private resource for the single node.

For more information on Clustering see the online Help pages available from the Admin BUI.  You can navigate to the Configuration-> Cluster page and then press the Help word located in the top right-hand corner of the page - this will open the help pages to the cluster context.
Alternatively, simply press the Help word located in the top right-hand corner of the page to display the main help page and then navigate to Configuration and Cluster.

Other considerations

Some cluster-wide resources need special attention when transitioning from one node to the other.  For example, SCSI & FC LUN resources need support from the clients themselves: the clients will need to support ALUA for their FC LUNs.

Some client systems require additional configuration if they themselves are also members of a cluster.  For example, for some notes on configuring  Solaris Cluster see:
    Document 1380870.1 - Sun Storage 7000 Unified Storage System: Configuring the ZFS Storage Appliance to work in Oracle Solaris Cluster

Terms & Definitions 

Cluster : With the ZFS Storage Appliance the term cluster is used to denote a system comprising two identical ZFS SA nodes accessing shared storage and with access to a common network infrastructure.
In the event of a node failure the resources and services of the failed node will be taken by the remaining working node and the services will continue to be provided to clients and users by that node.

  • Cluster types
    • active-active  : a cluster in which the resources are shared between the two nodes and each provides services to clients.
    • active-passive : a cluster in which one node performs most of the work while the second node remains idle until there is a failure of the active node at which point the passive node resumes operation as the now active node.
  • Cluster States
    • AKCS_CLUSTERED  : Both nodes are running in normal condition sharing resources.
    • AKCS_OWNER      : One node in the cluster owns all of the shared cluster resources
    • AKCS_STRIPPED   : One node has joined the cluster but does not own any cluster resources (the node is waiting for the administrator to perform a fail-back operation)
  • Cluster operations
    • Take over      : following a node failure the remaining node takes over the resources from the failed node.
    • Fail back      : once a failed node has been repaired and joined the cluster the node waits for the Administrator to fail-back the node's resources from the main node (which owns all of the cluster resources).  On completion on the fail back operation both nodes will be operating in a fully clustered mode (active-active).
    • Shutdown       : see: Document 1379117.1 - Sun Storage 7000 Unified Storage System: How To Shutdown ZFSSA Cluster

References

Collecting diagnostic data : Document 1019887.1 - Sun Storage 7000 Unified Storage System: How to collect support bundle using the BUI or CLI

Online Help is available in the Admin BUI under the section: Configuration:Cluster

Sun ZFS Storage 7000 System Administration Guide
       http://download.oracle.com/docs/cd/E22471_01/html/820-4167 - see the section on Clustering.

References

<NOTE:1543385.1> - Sun Storage 7000 Unified Storage System: How To identify and avoid a cluster split brain situation
<NOTE:1377062.1> - How To Replace Clustron Card In A Oracle ZFS Storage ZS3-4 Or Sun Storage 7000 Series( Not ZS3-2 ) :ATR:1377062.1:1
<NOTE:1194226.1> - Oracle Shared Shell
<NOTE:1329307.1> - Sun Storage 7000 Unified Storage System: How to set up NAS clustering
<NOTE:1379117.1> - Sun Storage 7000 Unified Storage System: How To Shutdown a ZFSSA Cluster
<NOTE:1542550.1> - Sun Storage 7000 Unified Storage System: Communication with the cluster peer via a cluster interconnect link has been lost
<NOTE:1403503.1> - Sun Storage 7000 Unified Storage System: A cluster node fails to rejoin the cluster
<NOTE:1447284.1> - Sun Storage 7000 Unified Storage System: How to upgrade a clustered system
<NOTE:1380870.1> - Sun Storage 7000 Unified Storage System: Configuring the ZFS Storage Appliance to work in Oracle Solaris Cluster

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback