Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1543385.1
Update Date:2018-01-05
Keywords:

Solution Type  Technical Instruction Sure

Solution  1543385.1 :   Sun Storage 7000 Unified Storage System: How To identify and avoid a cluster split brain situation  


Related Items
  • Sun ZFS Storage 7420
  •  
  • Sun Storage 7410 Unified Storage System
  •  
  • Sun Storage 7310 Unified Storage System
  •  
  • Sun ZFS Storage 7320
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>ZFS Storage>SN-DK: 7xxx NAS
  •  


This document is intended to help resolving split brain situations

In this Document
Goal
Solution
 Symptoms
 Changes
 Cause
References


Created from <SR 3-6973271013>

Applies to:

Sun ZFS Storage 7420 - Version All Versions and later
Sun Storage 7310 Unified Storage System - Version All Versions and later
Sun Storage 7410 Unified Storage System - Version All Versions and later
Sun ZFS Storage 7320 - Version All Versions and later
7000 Appliance OS (Fishworks)

Goal

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Disk Storage ZFS Storage Appliance

This document is intended to help resolving split brain situations and furthermore, to avoid them.

Solution

This never happens unless the cluster cables have been removed and the nodes were booted while those links were taken away.
If you face this kind of situation, you should apply the following action plan :

  1. Reconnect the 3 Clustron cables
  2. Reboot both heads immediately after having replugged the cables

It is crucial to do step 2 immediately after step 1. Otherwise, there will be a panic of one of the node.
Indeed, in that configuration, the nodes seem to have returned in production, each node has his pool and is serving data from this pool.
An attempt to acquire a 'sas lock' will occur after a while in a cluster 'split brain' situation - because the akd of one head tries to take ownership of all the disks.
This is the normal reaction of akd causing a takeover in a normal situation.
So, the attempt to acquire a 'sas lock' is merely akd trying  to put a lock to the disks.
In that case, the disks from pool owned by the peer could not be owned, and attempting to acquire a 'sas lock' will cause a (ZFS) panic

Symptoms

A split brain issue is when clustered NAS nodes 'ignore' each other.

Changes

This situation is created when the 3 cables normally linking the cluster interconnect Clustron cards are physically unplugged.

NOTE: To confirm that the cluster 'links' cabling is correctly configured - See Document ID 2081179.1

 

Cause

A split brain situation should only happen when someone has manually unplugged the cluster cables and has rebooted the head(s) while they were taken away.
In that kind of situation, "cli> configuration cluster show" will show the same output from both sides.
From one head, we see that the peer is 'disconnected or restarting':

h1-mgmt:> configuration cluster show

Properties:
                        state = AKCS_OWNER
                   description = Active (takeover completed)
                      peer_asn = 1f3ef929-734a-ee61-922c-a18e5def6a52
                 peer_hostname = h2
                    peer_state =
              peer_description = Unknown (disconnected or restarting)

Children:
                        resources => Configure resources

And the partner (peer) head will also show the peer as 'disconnected or restarting':

h2-mgmt:> configuration cluster show

Properties:
                        state = AKCS_OWNER
                   description = Active (takeover completed)
                      peer_asn = 2f5ef629-712a-ef61-922c-a18e5de12345
                 peer_hostname = h1
                    peer_state =
              peer_description = Unknown (disconnected or restarting)

Children:
                        resources => Configure resources

Again, this kind of situation is caused by a human intervention, and never happens on its own.

References

<NOTE:1402545.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Cluster Problems

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback