Oracle ZFS Storage Appliance: Appliance Kit Daemon (akd) Panic in zfs_send

Asset ID:	1-72-1913790.1
Update Date:	2018-05-25
Keywords:

Solution Type Problem Resolution Sure

Solution 1913790.1 : Oracle ZFS Storage Appliance: Appliance Kit Daemon (akd) Panic in zfs_send_impl()

Applies to:

Sun ZFS Storage 7420 - Version All Versions and later
Sun ZFS Storage 7320 - Version All Versions and later
Sun ZFS Storage 7120 - Version All Versions and later
Sun Storage 7410 Unified Storage System - Version All Versions and later
Sun Storage 7310 Unified Storage System - Version All Versions and later
7000 Appliance OS (Fishworks)

Symptoms

Suspecting that AKD 'unexpectedly' restarted:

SUNW-MSG-ID: AK-8001-RK, TYPE: alert, VER: 1, SEVERITY: Minor
EVENT-TIME: Sat Mar 23 15:00:39 2013
PLATFORM: i86pc, CSN: 1101FMJ01D, HOSTNAME: adc08stor07
SOURCE: svc:/appliance/kit/akd:default, REV: 1.0
EVENT-ID: 0ba484c7-e754-cb67-f453-9e61bcd9fc78
DESC: Communication with the cluster peer via a cluster interconnect link has been lost.
AUTO-RESPONSE: None.
IMPACT: Cluster reliability is impaired. If the cluster peer is functioning normally but no cluster interconnects remain active, arbitrary and unwanted cluster takeover may occur.
REC-ACTION: Check the cluster interconnect cables and the state of the cluster peer. Contact your vendor for support if an interconnect link remains inexplicably down.

Following alerts were noticed in BUI:

Description A cluster interconnect link has been restored.
Type Minor alert
Impact Cluster reliability has improved.
Automated response None.
Recommended action None.
Event time 2013-3-23 20:36:56
Unique Identifier f685cc2f-69e0-e623-d62b-ce225994a8eb
Status This alert is not associated with a problem.

Description A cluster interconnect link has been restored.
Type Minor alert
Impact Cluster reliability has improved.
Automated response None.
Recommended action None.
Event time 2013-3-23 20:36:56
Unique Identifier 9c903ddd-9cd3-6414-f574-d5e56aa365cb
Status This alert is not associated with a problem.

Description A cluster interconnect link has been restored.
Type Minor alert
Impact Cluster reliability has improved.
Automated response None.
Recommended action None.
Event time 2013-3-23 20:37:01
Unique Identifier f7e3244d-511e-c014-c3d7-b8f86724ce5f
Status This alert is not associated with a problem.

Description The appliance has rejoined the cluster.
Type Minor alert
Impact Cluster failover is now available.
Automated response None.
Recommended action None.
Event time 2013-3-23 20:39:07
Unique Identifier ab7114a5-9ca6-c347-fd53-fe7c7d2dd8d7
Status This alert is not associated with a problem.

Cause

AKD was restarted on adc08stor08.

AKD service log:

Assertion failed: 0 == close(sdd.cleanup_fd), file ../common/libzfs_sendrecv.c, line 1545, function zfs_send
[ Mar 23 15:05:28 Stopping because process dumped core. ]
[ Mar 23 15:05:28 Executing stop method (:kill). ]
[ Mar 23 15:05:39 Executing start method ("exec /usr/lib/ak/akd"). ]
[ Mar 23 15:07:12 Method "start" exited with status 0. ]

AKD application coredump:

> ::status
debugging core file of akd (32-bit) from adc08stor08
initial argv: /usr/lib/ak/akd
threading model: native threads
status: process terminated by SIGABRT (Abort), pid=1326 uid=0 code=-1
panic message: Assertion failed: 0 == close(sdd.cleanup_fd), file ../common/libzfs_sendrecv.c, line 1545, function zfs_send
>

> $C
f2aec6c8 libc_hwcap1.so.1`_lwp_kill+0x15(98, 6, f2aec6e8, fee608b1)
f2aec6e8 libc_hwcap1.so.1`raise+0x25(6, 0, f2aec738, fee3869d)
f2aec738 libc_hwcap1.so.1`abort+0xf5(65737341, 6f697472, 6166206e, 64656c69, 2030203a, 63203d3d)
f2aec948 0xfee38ad0(fe69c64c, fe69d500, 6d6, fe685040)
f2aed2c8 libzfs.so.1`zfs_send_impl+0xc9b(689402c8, 6f3c2dcb, 67ced313, 6, 152, fd364f70)
f2aed308 libzfs.so.1`zfs_send+0x2e(689402c8, 6f3c2dcb, 67ced313, 6, 152, fd364f70)
f2aed588 nas.so`nas_repl_send_stream_send+0x262(f2aed5c0, 65cc0ac8, fd3b9478, fd367966)
f2aedf28 nas.so`nas_repl_eng_send+0xca(80a1c08)
f2aedf98 libak.so.1`ak_engine_worker+0x170(88aa7f8, 0, 0, f013cde9)
f2aedfc8 libak.so.1`ak_thread_start+0x6a(8cf0408, fef51000, f2aedfe8, feeb36d9)
f2aedfe8 libc_hwcap1.so.1`_thrp_setup+0x9d(f4a11140)
f2aedff8 libc_hwcap1.so.1`_lwp_start(f4a11140, 0, 0, 0, 0, 0)
>

> ::vmem
ADDR NAME INUSE TOTAL SUCCEED FAIL
fe99dc28 sbrk_top 1275564032 3451412480 857568324 4135 <<<<<<<<<
fe99e09c sbrk_heap 1275564032 1275564032 857568324 731
fe99e510 vmem_internal 43978752 43978752 24581317 0
fe99e984 vmem_seg 41811968 41811968 10208 0
fe99edf8 vmem_hash 2150656 2154496 26 0
fe99f26c vmem_vmem 17100 19128 24571101 0
08062000 umem_internal 17249280 17252352 81924 0
08062474 umem_cache 402320 577536 51 0
080628e8 umem_hash 1239040 1245184 224 0
08063000 umem_log 0 0 0 0
08063474 umem_firewall_va 0 0 0 0
080638e8 umem_firewall 0 0 0 0
08064000 umem_oversize 147805517 151138304 831194045 731 <<<<<<<<<
08064474 umem_memalign 4456464 4464640 714301 0
080648e8 umem_default 1058729984 1058729984 996737 0
>

This is an instance of bug CR 15826181 - akd crashed in libzfs with Assertion failed: 0 == close(sdd.cleanup_fd)

Solution

Upgrade to Appliance Firmware Release 2011.1.8.0 (or later) OR Appliance Firmware Release 2013.1.0.1 (or later).

***Checked for relevance on 25-MAY-2018***

References

<NOTE:1494369.1> - Sun Storage 7000 Unified Storage System: BUI unavailable and seeing errors like "failed to update kstat chain: Not enough space"
<NOTE:1019887.1> - Sun Storage 7000 Unified Storage System: How to Collect a Support Bundle using the BUI or CLI
<NOTE:1325025.1> - Sun Storage 7000 Unified Storage System: aksh fatal error: no memory
<NOTE:1401282.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Unresponsive Administrative Interface (BUI/CLI hang)
<NOTE:1401288.1> - Sun Storage 7000 Unified Storage System: Data collection for akd hang issues
<BUG:15826181> - SUNBT7207252 AKD CRASHED IN LIBZFS WITH ASSERTION FAILED: 0 == CLOSE(SDD.CLEANUP

Attachments

This solution has no attachment