Exalogic ZFS Storage Pool Failure or Faulted

Asset ID:	1-72-1489376.1
Update Date:	2015-01-03
Keywords:

Solution Type Problem Resolution Sure

Solution 1489376.1 : Exalogic ZFS Storage Pool Failure or Faulted

Applies to:

Sun ZFS Storage 7320 - Version All Versions to All Versions [Release All Releases]
Oracle Exalogic Elastic Cloud Software - Version 1.0.0.0.0 and later
Sun ZFS Storage 7320
7000 Appliance OS (Fishworks)

Symptoms

When trying to view shares on Storage Node getting below error:

XXXXXXsnXX:> shares

error: The action could not be completed because the target 'exalogic/local' no
longer exists on the system. It may have been destroyed or renamed by
another user, or the current bookmark may be stale

Changes

Logzilla swap and memory upgraded on Storage Node, where the issue is reported

Cause

After the Cluster takeover the logzillas were being removed but were not added again.

Solution

Please contact Oracle Support for assistance for manual steps to recover the situation.

For Exalogic Support Engineers:

The following INTERNAL ONLY section of this note provides a description of the steps that will need to be performed under support supervision of Exalogic Support and ZFS Engineer

Perform the below steps to resolve this issue:

1. Check zpool list

XXXXXXsnXX:> confirm shell zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
exalogic 16.3T 12.2T 4.09T 74% 1.00x DEGRADED -
system 464G 167G 297G 35% 1.00x ONLINE -

Refer:- For more information on Zpool

2. Generate an akd core (gcore `pgrep -xo akd`) and look at the ::nas_cache...

#echo :nas_cache | mdb core.<PID>

You can see nas_cache has given no output.

3. Verify if zfs/exalogic is faulted in akd as below

> ::ak_rm_elem !grep exalo
93ea348 SINGLETON FAULTED ak:/zfs/exalogic
17848b48 SYMBIOTE IMPORTED ak:/nas/exalogic
178488c8 SYMBIOTE FAULTED ak:/replication/exalogic
17848648 SYMBIOTE FAULTED ak:/shadow/exalogic
178483c8 SYMBIOTE IMPORTED ak:/fct/exalogic

4. Disable and enable the akd

Note:

In this step, akd will be restarted. Before and after disableing and enabling the akd, make sure to check the peer cluster state to verify whether the cluster states shown by peer is proper and should not be in transition like rebooting, joining . It should be owner, stripped.
Otherwise it will lead to unwanted situation where akd will fail to restart or node will refuse to join cluster.

Disabled akd:

svcadm disable -t akd

then enable it:

svcadm enable akd

It will take a couple of minutes but once akd is restarted the shares will be accessible...

NOTE:

Instead of disabling and enabling akd which restarts akd, alternative we can run following command.

raw nas.discover({pool:”<pool>”})

Where <pool> is the name of the pool (ie. raw nas.discover({pool:"my pool"}) )

5. After disabling/enabling the akd check the nas_cache again and you will see Entries and Mountpoints.

# echo ::nas_cache |mdb -p `pgrep -ox akd`

nas cache at 0x910b6c8

Entries:

ADDR DATASET STATE FLAGS
16c3c208 exalogic CLEAN PENDING
19558608 exalogic/nas-rr-1f282e13-bc81-49e4-bd97-80bd0d24479c CLEAN
PENDING|REPL_PKG
...
Mountpoints:

19557208 NONE /export/binaries/mw_home1
19557a08 NONE /export/config/admin
1976d008 NONE /export/config/domains
18ff0808 NONE /export/config/jmsjta
...

6. Once the nas_cache looksfine then fix the logzillas which are UNAVAIL (note: the logzillas were replaced with larger logzillas)...

> logs
replacing-9 DEGRADED 0 2 0
c0t5000A72030022B90d0 REMOVED 0 0 0
c0t5000A7203004D200d0 UNAVAIL 0 0 0 corrupted data
replacing-10 DEGRADED 0 2 0
c0t5000A72030022B98d0 REMOVED 0 0 0
c0t5000A7203004D24Fd0 UNAVAIL 0 0 0 corrupted data
replacing-11 UNAVAIL 0 0 0 insufficient replicas
c0t5000A72030022315d0 REMOVED 0 0 0
c0t5000A7203004D21Ed0 UNAVAIL 0 0 0 corrupted data
c0t5000A7203004D221d0 ONLINE 0 0 0

7. Check for the GUID that needs to be cleaned up using ::spa -v...

> ::spa -v
...
ffffff826238f2c0 DEGRADED - replacing
ffffff826238a080 REMOVED - /dev/dsk/c0t5000A72030022B98d0s0
ffffff82623a40c0 CANT_OPEN BAD_LABEL /dev/dsk/c0t5000A7203004D24Fd0s0
...

8. To get the GUID, do the following:

> ffffff826238f2c0::print vdev_t vdev_guid |=E

1769560619761342386

9. Now remove the GUID from the pool

XXXXXXsnXX# zpool remove exalogic 1769560619761342386 &

10. Now check the zpool status again:

logs
replacing-9 DEGRADED 0 2 0
c0t5000A72030022B90d0 REMOVED 0 0 0
c0t5000A7203004D200d0 UNAVAIL 0 0 0 corrupted data
replacing-11 DEGRADED 0 2 0
c0t5000A72030022315d0 REMOVED 0 0 0
c0t5000A7203004D21Ed0 UNAVAIL 0 0 0 corrupted data
c0t5000A7203004D221d0 ONLINE 0 0 0

11. Once that was all removed, now we add the logzillas back...

XXXXXXsnXX# zpool add exalogic log c0t5000A7203004D21Ed0&

References

<NOTE:1476998.1> - List Of Hardware and Storage Related Notes That Exalogic SRs Can Be Linked To/Referenced While Working Exalogic SRs

Attachments

This solution has no attachment