Oracle ZFS Storage Appliance: Restart of the Appliance Kit Daemon (akd) May Panic a ZFS Cluster Node

Asset ID:	1-72-2174141.1
Update Date:	2017-07-04
Keywords:

Solution Type Problem Resolution Sure

Solution 2174141.1 : Oracle ZFS Storage Appliance: Restart of the Appliance Kit Daemon (akd) May Panic a ZFS Cluster Node

Applies to:

Oracle ZFS Storage ZS4-4 - Version All Versions to All Versions [Release All Releases]
Oracle ZFS Storage ZS3-4 - Version All Versions to All Versions [Release All Releases]
Oracle ZFS Storage ZS3-2 - Version All Versions to All Versions [Release All Releases]
Sun ZFS Storage 7420 - Version All Versions to All Versions [Release All Releases]
Sun ZFS Storage 7320 - Version All Versions to All Versions [Release All Releases]
7000 Appliance OS (Fishworks)

Symptoms

An Oracle ZFSSA cluster has transitioned into a degraded state. In this particular issue, three things will be true.

Node-A is reporting as the AKCS_OWNER
Node-B is reporting as Unknown (disconnected or restarting)
The cluster links are reporting as AKCIOS_TIMEDOUT

The reason for the degraded state may be known (for example, Node-B is turned off), or it may be unknown. You can see these cluster states from the CLI with the following commands.

ZFSSA:> configuration cluster show
Properties:
state = AKCS_OWNER
description = Active (takeover completed)
peer_asn = 0e4ab067-942c-6b9b-da8a-d7a7985e7d93
peer_hostname = s7320-bur09-a-head-1
peer_state =
peer_description = Unknown (disconnected or restarting)

ZFSSA:> configuration cluster links

clustron2:0/clustron_uart:0 = AKCIOS_TIMEDOUT
clustron2:0/clustron_uart:1 = AKCIOS_TIMEDOUT
clustron2:0/dlpi:0 = AKCIOS_TIMEDOUT

While in this cluster state, the management software on Node-A, (aka the Appliance Kit Damon, or akd) is restarted. If akd on Node-A is stopped (or restarted) while akd on Node-B is in unknown state, down, or the head is powered off, this head WILL PANIC Node-A to prevent a situation that might corrupt the data in the pool. In summary, if akd is stopped while akd is not running correctly on the other head, we will panic to prevent data corruption.

ZFSSA:> confirm maintenance system restart

If there is access to console, you will see the following panic string in the console log.

panic[cpu11]/thread=ffffff005d0eec20: akd_failed:pools_imported;no_working_uarts

ffffff005d0eeb10 clustron:clustron_akd_watchdog+d2 ()
ffffff005d0eeb90 clustron:clustron_softintr+282 ()
ffffff005d0eebd0 unix:av_dispatch_softvect+62 ()
ffffff005d0eec00 apix:apix_dispatch_softint+33 ()
ffffff005e393820 unix:switch_sp_and_call+13 ()

syncing file systems... 1 done
ireport.io.akd_failed:pools_imported;no_working_uarts ena=513473afd1602c01
detector=[ version=0 scheme="dev" device-path=
"/pci@0,0/pci8086,340e@7/pci111d,8039@0/pci111d,8039@2/pci108e,7b07@0,1" ]
pri="medium"

Changes

This new watchdog feature was added to AK 8.6.0 (2013.05.06.6.0) (2013.1.6) cluster systems.

This feature is also present in AK 8.7.0, and will likely remain in future code releases.

Cause

Prior to the release of 2013.1.6, a serious data integrity issue could occur if the Appliance Kit Daemon software stops, while the cluster links are down because both heads could attempt to write to the data pool(s) causing corruption.

In order to avoid corrupting data, we will panic the head when akd goes down while akd on the other head is not communicating, for whatever reason.

This is expected behavior in AK8.6 and future releases.

Solution

The solution to the problem is to fix the cluster link issue at hand.

Whether it be the simple solution of just powering up the other node, or something more complicated requiring Oracle Support, you need to check the cluster links prior to restarting the Appliance Kit Software.

From the cli, the links should report like this...

ZFSSA :> configuration cluster links

clustron2:0/clustron_uart:0 = AKCIOS_ACTIVE
clustron2:0/clustron_uart:1 = AKCIOS_ACTIVE
clustron2:0/dlpi:0 = AKCIOS_ACTIVE

Note: this problem does NOT apply to Standalone Appliances (there will be no cluster commands) or Appliances where the cluster is not configured.

ZFSSA :> configuration cluster show
Properties:
state = AKCS_UNCONFIGURED
description = Clustering is not configured

There are circumstances when Oracle Support personnel are required to put a cluster into this known state. The most obvious reason is when a clean up of cluster objects (aka stash objects) is required.

The steps should NOW be:

Power down node-A (All resources move to node-B)
Disable the watchdog on node-B
# echo "watchdog_warn_only/W 1" | mdb -kw
Once node-A is down, stop akd on node-B
# svcadm disable -t akd
Proceed with removal and cleanup of stash objects from node-B
Restart akd on node-B
#svcadm enable akd
Once akd comes up on node-B, power on and boot up node-A
Enable the watchdog on node-B
# echo "watchdog_warn_only/W 0" | mdb -kw

Without disabling the watchdog, node-B will panic as soon as akd is disabled.

Ref: Bug 24484064 ZFS appliance panics during akd restart if cluster links are down

Attachments

This solution has no attachment