Sun Storage 3510 FW 4.11I : Recover Cluster : Storage failed initialization with an Unrecoverable Error

Asset ID:	1-72-1003887.1
Update Date:	2017-10-04
Keywords:

Solution Type Problem Resolution Sure

Solution 1003887.1 : Sun Storage 3510 FW 4.11I : Recover Cluster : Storage failed initialization with an Unrecoverable Error

Applies to:

Sun Storage 3510 FC Array - Version Not Applicable and later
Solaris Cluster - Version 3.0 and later
Sun Storage 3511 SATA Array - Version Not Applicable and later
All Platforms

Symptoms

A Sun Storage 3510/3511 Array, running Firmware 4.11I in a Cluster configuration, is rebooted (or power cycled). There is a Cluster Quorum device on the array. The reboot of the array fails to complete. The following messages appear on the console.

Restoring saved persistent reservations.
Preparing to restore saved persistent reservations. Type 'skip' to skip:

 Unrecoverable Controller Error Encountered !
Resetting Controller !

 Unrecoverable Controller Error Encountered !
Resetting Controller !

Changes

There have been no changes to the configuration or hardware, other than the reboot / power cycle of the array.

Cause

During the initialization of the 3510 array,the controller reads the saved reservations, or keys, from the disks and accordingly loads them into memory. This is how it restores persistent reservations in the cluster environment. The computed offset to the memory buffer may be wrong, if the buffer is not available in time, and the SE3510 fails to complete the process of saving the reservation keys. This usually happens when there are a large number of reservations to be restored.

Solution

This problem is due to a firmware issue in the 4.11I code of the 3510. See Sun Alert <Document 1000382.1> Sun StorEdge 3510/3511 Arrays May Fail to Boot Upon Reset/Power Cycle When Connected in a Sun Cluster 3.x Environment With 3+ Nodes. The best solution is to upgrade to a version or firmware beyond 4.11I. You can download patch <Patch 9605238> for the 3510 and patch <Patch 9605249> for the 3511. The problem does not occur on 3.27R firmware.

If the 3510 array is stuck in its boot sequence, the following procedure may be used to workaround the problem and bring the array back online.

1. If the cluster is running, shutdown the cluster gracefully using "scshutdown".

2. Bring all the servers to the "ok" prompt to quiesce all I/O.

3. Power off the 3510 array.

4. Establish a serial console (tip) connection to either 3510 controller.

5. Power on the storage array.

6. When the following message appears on the console, Type skip

Restoring saved persistent reservations.
Preparing to restore saved persistent reservations. Type 'skip' to skip: skip

7. The array will pass initialization now and come up without persistent reservations. Typical console messages will display along with confirmation that persistent reservations were skipped.

Restoration of saved persistent reservations skipped.

8. Boot up all cluster nodes into non-cluster single user mode (boot -sx). Reference <Document 1018806.1> Oracle Solaris Cluster 3.x: Recovering from Amnesia

A. Copy the existing infrastructure file.

# /usr/bin/cp infrastructure infrastructure.bk

B. Remove all the Quorum related information from the infrastructure file and save it.

Lets assume we have four node RAC cluster and d100 is the only quorum device configured. So you have to remove all the following line:

cluster.quorum_devices.1.name d100
cluster.quorum_devices.1.state enabled
cluster.quorum_devices.1.properties.votecount 3
cluster.quorum_devices.1.properties.gdevname /dev/did/rdsk/d100s2
cluster.quorum_devices.1.properties.path_1 enabled
cluster.quorum_devices.1.properties.path_2 enabled
cluster.quorum_devices.1.properties.path_3 enabled
cluster.quorum_devices.1.properties.path_4 enabled
cluster.quorum_devices.1.properties.access_mode scsi3

C. Regenerate the checksum of the infrastructure file by running:

For SC 3.2u3 or later:

# /usr/cluster/lib/sc/ccradm recover -o/etc/cluster/ccr/infrastructure

For SC Before 3.2u3:

# /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/infrastructure -o

D. Repeat Steps A,B & C for all the nodes in the cluster.

9. Boot all the cluster nodes and the cluster should be formed now. The cluster has been reconfigured to run without a quorum in step 7. Configure the new Quorum on the
desired SE3510 LUN. Once completed, Cluster High Availability has returned to normal.

Additional Information

The cluster software registers keys from each host on each path to every shared device. It also takes a reservation from one of the hosts on every device. That is a fencing
operation for the global devices. There appears to be a limit on the size of the saved persistent reservation file for the 4.11 firmware.

Suppose there are 64 partitions created on SE3510, in a 4 Nodes (having 2 paths to each LUNs) Cluster connected in a SAN Configuration. The calculation of reservation
keys to be loaded for persistent reservation would be --> 64 device * 4 hosts * 2 paths = 512 keys

In this case, each host has a key registered on each of two paths to each of 64 devices. Four hosts means that there are 512 registered keys. One host fences the storage so there are 128 existing reservations. The firmware fails when attempting to load these values into memory.

Attachments

This solution has no attachment