Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2174141.1
Update Date:2017-07-04
Keywords:

Solution Type  Problem Resolution Sure

Solution  2174141.1 :   Oracle ZFS Storage Appliance: Restart of the Appliance Kit Daemon (akd) May Panic a ZFS Cluster Node  


Related Items
  • Sun ZFS Storage 7320
  •  
  • Oracle ZFS Storage ZS3-2
  •  
  • Oracle ZFS Storage ZS3-4
  •  
  • Sun ZFS Storage 7420
  •  
  • Oracle ZFS Storage ZS4-4
  •  
  • Sun ZFS Storage 7120
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>ZFS Storage>SN-DK: 7xxx NAS
  •  




In this Document
Symptoms
Changes
Cause
Solution


Created from <SR 3-12810845751>

Applies to:

Oracle ZFS Storage ZS4-4 - Version All Versions to All Versions [Release All Releases]
Oracle ZFS Storage ZS3-4 - Version All Versions to All Versions [Release All Releases]
Oracle ZFS Storage ZS3-2 - Version All Versions to All Versions [Release All Releases]
Sun ZFS Storage 7420 - Version All Versions to All Versions [Release All Releases]
Sun ZFS Storage 7320 - Version All Versions to All Versions [Release All Releases]
7000 Appliance OS (Fishworks)

Symptoms

 An Oracle ZFSSA cluster has transitioned into a degraded state. In this particular issue, three things will be true. 

  • Node-A is reporting as the AKCS_OWNER 
  • Node-B is reporting as Unknown (disconnected or restarting)
  • The cluster links are reporting as AKCIOS_TIMEDOUT

The reason for the degraded state may be known (for example, Node-B is turned off), or it may be unknown. You can see these cluster states from the CLI with the following commands. 

ZFSSA:> configuration cluster show
  Properties:
             state = AKCS_OWNER
       description = Active (takeover completed)
          peer_asn = 0e4ab067-942c-6b9b-da8a-d7a7985e7d93
     peer_hostname = s7320-bur09-a-head-1
        peer_state =
  peer_description = Unknown (disconnected or restarting)

ZFSSA:> configuration cluster links

   clustron2:0/clustron_uart:0 = AKCIOS_TIMEDOUT
   clustron2:0/clustron_uart:1 = AKCIOS_TIMEDOUT
   clustron2:0/dlpi:0 = AKCIOS_TIMEDOUT

While in this cluster state, the management software on Node-A, (aka the Appliance Kit Damon, or akd) is restarted. If akd on Node-A is stopped (or restarted) while akd on Node-B is in unknown state, down, or the head is powered off, this head WILL PANIC Node-A to prevent a situation that might corrupt the data in the pool. In summary, if akd is stopped while akd is not running correctly on the other head, we will panic to prevent data corruption.

ZFSSA:> confirm maintenance system restart

 If there is access to console, you will see the following panic string in the console log. 

   panic[cpu11]/thread=ffffff005d0eec20: akd_failed:pools_imported;no_working_uarts

   ffffff005d0eeb10 clustron:clustron_akd_watchdog+d2 ()
   ffffff005d0eeb90 clustron:clustron_softintr+282 ()
   ffffff005d0eebd0 unix:av_dispatch_softvect+62 ()
   ffffff005d0eec00 apix:apix_dispatch_softint+33 ()
   ffffff005e393820 unix:switch_sp_and_call+13 ()

   syncing file systems... 1 done
   ireport.io.akd_failed:pools_imported;no_working_uarts ena=513473afd1602c01
   detector=[ version=0 scheme="dev" device-path=
   "/pci@0,0/pci8086,340e@7/pci111d,8039@0/pci111d,8039@2/pci108e,7b07@0,1" ]
   pri="medium"

 

Changes

This new watchdog feature was added to AK 8.6.0 (2013.05.06.6.0) (2013.1.6) cluster systems.

This feature is also present in AK 8.7.0, and will likely remain in future code releases.

Cause

Prior to the release of 2013.1.6, a serious data integrity issue could occur if the Appliance Kit Daemon software stops, while the cluster links are down because both heads could attempt to write to the data pool(s) causing corruption.

In order to avoid corrupting data, we will panic the head when akd goes down while akd on the other head is not communicating, for whatever reason.

This is expected behavior in AK8.6 and future releases.

 

Solution

The solution to the problem is to fix the cluster link issue at hand.

Whether it be the simple solution of just powering up the other node, or something more complicated requiring Oracle Support, you need to check the cluster links prior to restarting the Appliance Kit Software.

 

From the cli, the links should report like this... 

ZFSSA :> configuration cluster links

   clustron2:0/clustron_uart:0 = AKCIOS_ACTIVE
   clustron2:0/clustron_uart:1 = AKCIOS_ACTIVE
   clustron2:0/dlpi:0 = AKCIOS_ACTIVE

 

Note: this problem does NOT apply to Standalone Appliances (there will be no cluster commands) or Appliances where the cluster is not configured.

ZFSSA :> configuration cluster show
   Properties:
                state = AKCS_UNCONFIGURED
         description = Clustering is not configured

 

There are circumstances when Oracle Support personnel are required to put a cluster into this known state. The most obvious reason is when a clean up of cluster objects (aka stash objects) is required.

The steps should NOW be:

  • Power down node-A (All resources move to node-B)
  • Disable the watchdog on node-B
    # echo "watchdog_warn_only/W 1" | mdb -kw
  • Once node-A is down, stop akd on node-B
    # svcadm disable -t akd
  • Proceed with removal and cleanup of stash objects from node-B
  • Restart akd on node-B
    #svcadm enable akd
  • Once akd comes up on node-B, power on and boot up node-A
  • Enable the watchdog on node-B
    # echo "watchdog_warn_only/W 0" | mdb -kw

Without disabling the watchdog, node-B will panic as soon as akd is disabled.

 

Ref: Bug 24484064 ZFS appliance panics during akd restart if cluster links are down

 


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback