Sun Storage 7000 Unified Storage System: BUI unavailable and seeing errors like "failed to update kstat chain: Not enough space"

Asset ID:	1-72-1494369.1
Update Date:	2018-05-24
Keywords:

Solution Type Problem Resolution Sure

Solution 1494369.1 : Sun Storage 7000 Unified Storage System: BUI unavailable and seeing errors like "failed to update kstat chain: Not enough space"

Applies to:

Sun ZFS Storage 7320 - Version All Versions and later
Sun ZFS Storage 7120 - Version All Versions and later
Sun ZFS Storage 7420 - Version All Versions and later
Sun Storage 7410 Unified Storage System - Version All Versions and later
Sun Storage 7310 Unified Storage System - Version All Versions and later
7000 Appliance OS (Fishworks)

Symptoms

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Disk Storage ZFS Storage Appliance Community

BUI access hangs or generates log errors of the form:

Thu Mar 22 23:27:37 2012: asynchronous error on statistics module 'mem': failed to update kstat chain: Not enough space
Fri Mar 23 02:05:53 2012: failed to update chassis data: failed to update kstat chain: Not enough space

These errors will also be seen in the akd.ak error log

CLI access may also be lost - but sometimes may still be available

The management interface (akd) will usually have been running without having been restarted for months

Take a core dump of akd and check vmem:

# gcore -o akd.core `pgrep -ox akd`

# mdb akd.core.<PID>

>::vmem

ADDR NAME INUSE TOTAL SUCCEED FAIL
fe9bdd98 sbrk_top 2454306816 3592060928 1040654983 8468408 << TOTAL memory over 3 Gb, but INUSE memory much lower. This indicates memory fragmentation.
fe9be20c sbrk_heap 2454306816 2454306816 1040654983 8466433
fe9be680 vmem_internal 89620480 89620480 126069236 0
fe9beaf4 vmem_seg 86294528 86294528 21068 0
fe9bef68 vmem_hash 3309824 3313664 35 0
fe9bf3dc vmem_vmem 17100 19128 126048153 0
08062000 umem_internal 22701056 22704128 79028 0
08062474 umem_cache 402320 626688 51 0
080628e8 umem_hash 2239488 2244608 54 0
08063000 umem_log 0 0 0 0
08063474 umem_firewall_va 0 0 0 0
080638e8 umem_firewall 0 0 0 0
08064000 umem_oversize 158445402 166821888 913453146 8466433 << The umem_oversize line shows a large number of allocations (billions succeeded, millions failed).
08064474 umem_memalign 4431888 10883072 64834 0
080648e8 umem_default 2164277248 2164277248 1053568 0

Cause

This is most likely a known problem where the akd process that controls the management interface runs out of memory because of memory fragmentation due to large number of oversize allocations.

If unsure, please raise a call with Oracle Support who will be able to verify if you are hitting this issue.

The likely cause for this is Bug 15781962 - Repeated analytics graphs and drilldowns fragment memory until akd runs out of memory

Solution

The workaround for this is to restart the management interface (akd) to alleviate the heap fragmentation.

If the CLI is still available, it is possible to restart the management interface from there:

S7000:> maintenance system restart

Please note that if you have a cluster then you should verify that the cluster is in a sane state before restarting the management interface on any one head to prevent a takeover happening.

You can check this by checking the cluster configuration:

S7000:> configuration cluster show

Properties:
                         state = AKCS_CLUSTERED
                   description = Active
                      peer_asn = 7adaa852-e2da-e6d6-e0ad-d22330278cb3
                 peer_hostname = zs7420-tvp540-b-h1
                    peer_state = AKCS_CLUSTERED
              peer_description = Active

Children:
                        resources => Configure resources

Valid states for the cluster head and peer are AKCS_CLUSTERED, AKCS_OWNER and AKCS_STRIPPED. In these states restarting the management interface will not cause any takeover by the other head.

Restarting the management interface will not have any effect on access to the shares.

If the CLI is not available, please raise a service request with Oracle Support to restart the menagement interface.

The fix will be to upgrade to Appliance Firmware Release 2011.1.9.0 (or later) or 2013.1.1.1 (or later)

Some customers are still seeing memory fragmentation in akd after installing 2011.04.24.5.0 (2011.1.5.0), this is being tracked by Bug 16187433 - closed as a duplicate of Bug 15685321.

The workaround of restarting akd remains the same.

The total number of analytics datasets that must be continuously updated is a contributory reason to the fragmentation so destroying unnecessary datasets will help.

    16187433 - datasets with lots of breakdowns causes memory fragmentation in akd
    15685321 - UMEM_MAXBUF too small for modern applications

For 2011.1.x releases: Fixed in 2011.1.9.0    (15685321-17778426 Backport 15685321 to AK-2011.04.24 - UMEM_MAXBUF too small for modern applications)

For 2013.1.x releases: Fixed in 2013.1.1.1 (15685321-17531615 Backport 15685321 to ak-2013-rel)

Also, be aware of Bug 20751907 (datasets with lots of breakdowns caused akd to bloat) reported on 2013.1.2.12

It seems like datasets (nfs3.ops[file] and arc.accesses[file]) with lots of breakdowns resulted in allocating over 20 million ak_dataspan_datum_t objects from umem_alloc_48 buffer.

***Checked for relevance on 24-MAY-2018***

References

<BUG:15781962> - SUNBT7157268-AK-8 REPEATED ANALYTICS GRAPHS AND DRILLDOWNS FRAGMENT MEMORY UNTIL
<BUG:16187433> - DATASETS WITH LOTS OF BREAKDOWNS CAUSES MEMORY FRAGMENTATION IN AKD
<BUG:15685321> - SUNBT7004788 UMEM_MAXBUF TOO SMALL FOR MODERN APPLICATIONS

Attachments

This solution has no attachment