Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1401282.1
Update Date:2018-05-24
Keywords:

Solution Type  Troubleshooting Sure

Solution  1401282.1 :   Sun Storage 7000 Unified Storage System: How to Troubleshoot Unresponsive Administrative Interface (BUI/CLI hang)  


Related Items
  • Sun ZFS Storage 7420
  •  
  • Sun Storage 7110 Unified Storage System
  •  
  • Oracle ZFS Storage ZS3-2
  •  
  • Oracle ZFS Storage ZS4-4
  •  
  • Sun Storage 7210 Unified Storage System
  •  
  • Sun Storage 7410 Unified Storage System
  •  
  • Oracle ZFS Storage ZS3-4
  •  
  • Sun Storage 7310 Unified Storage System
  •  
  • Sun ZFS Storage 7120
  •  
  • Oracle ZFS Storage Appliance Racked System ZS4-4
  •  
  • Sun ZFS Storage 7320
  •  
  • Oracle ZFS Storage ZS3-BA
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>ZFS Storage>SN-DK: 7xxx NAS
  •  
  • _Old GCS Categories>Sun Microsystems>Storage - Disk>Unified Storage
  •  


To assist a user in resolving management BUI/CLI connectivity/responsiveness issues.

In this Document
Purpose
Troubleshooting Steps
 Symptoms:
 Causes and Resolutions:
 Excessive kernel virtual memory (exceeding the 32-bit VM limit)
 Excessive amount of 'old' analytics
 Excessive amount of 'old' log files
 Excessive use of contracts
 Excessive amount of AKD process (memory) heap fragmentation
 System pool is 'full'
 Faulty (?) hardware issue
 Issue with supportbundle upload or 'phone home'
 Changing the replication target IP address
 Issues with Replication updates
 Further Assistance Required:
 Other useful information:
References


Applies to:

Sun Storage 7410 Unified Storage System - Version All Versions and later
Oracle ZFS Storage ZS3-4 - Version All Versions and later
Oracle ZFS Storage ZS3-BA - Version All Versions and later
Oracle ZFS Storage ZS4-4 - Version All Versions and later
Oracle ZFS Storage Appliance Racked System ZS4-4 - Version All Versions and later
7000 Appliance OS (Fishworks)

Purpose

The purpose of this document is to assist a user in resolving management BUI/CLI connectivity/responsiveness issues.

If ssh to the appliance drops the user into the emergency shell, the end user must open a support session to allow the Oracle System Support team remote access to the system to troubleshoot and fix this issue.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Disk Storage ZFS Storage Appliance Community

 

Customers are not permitted to run commands at the emergency shell.

Troubleshooting Steps

Please validate that each troubleshooting step below is true for the affected environment.  The steps will provide instructions or a link to a document, for validating the step and taking corrective action as necessary.  The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution.  Please do not skip a step.

Symptoms:

The usual symptoms of an Unresponsive Administrative Interface issue are:

  • Cannot login to BUI/CLI (on one, or both, nodes of a cluster)
  • Slow startup of system/management interfaces

See the following Internal-Only Documents for collecting useful data.

  • Document 1401288.1 - Unified Storage System: Data collection for akd hang issues
  • How to analyse akd core files (Wiki Doc TBD)

Causes and Resolutions:

The initial step is to check the basic configuration/operation of the appliance management connectivity, please see:
Document 1392845.1 - Sun Storage 7000 Unified Storage System: How to Troubleshoot Loss of Network Connection to the Management Interface.

This is a general 'non-exhaustive' list all possible causes of 'BUI/CLI hang' conditions:

  •   AKD exceeding the 32-bit kernel virtual memory limit (~3.5Gb+)
  •   Excessive amount of 'old' analytics
  •   Excessive amount of 'old' log files
  •   Excessive use of contracts
  •   Excessive amount of AKD process (memory) heap fragmentation
  •   'Temporary' hang when ZFS dataset destroy is running
  •   Cluster 'peer' locking issue
  •   Replication 'peer' locking issue
  •   Faulty (?) hardware issue [disk, SIM/IOM, clustron card, cables]
  •   Intermittent (or permanent) networking issues

To be added to ... 'non-exhaustive' list all possible causes of 'BUI/CLI hang' conditions:

  • SVM system hanging when svc.configd overflows it's heap
  • All 'other' known issues


For each of these causes, the details of known issues regarding each cause will be given below - along with the specific symptoms and the recommended actions for resolution.

PLEASE NOTE:  Since we are unable to login to the Administrative Interfaces (BUI or CLI), the customer may be unable to view the standard 'error/fault' reporting mechanisms:

  • FMA events  -> (BUI) Maintenance > Problems > Active Problems
  • Alert log  -> (BUI) Maintenance > Logs > Alerts
  • Fault log  -> (BUI) Maintenance > Logs > Faults
  • System log  -> (BUI) Maintenance > Logs > System
  • Audit log  -> (BUI) Maintenance > Logs > Audit
  • Phone Home log  -> (BUI) Maintenance > Logs > Phone Home

... to provide further diagnostic/context data to assist in isolating the cause of the issue.

In the vast majority of cases where a BUI/CLI hang is observed, you will need to engage Oracle System Support by opening a Service Request to assist in determining the root cause of the problem.

 

Excessive kernel virtual memory (exceeding the 32-bit VM limit)

For any system running Appliance Release versions earlier than 2010.Q3.4 or 2011.1, running (many) aksh scripts can exhaust the Appliance management daemon kernel memory.

See Document 1334777.1 - Sun Storage 7000 Unified Storage System: System hang - aksh scripts can exhaust memory

Attempting to login to the CLI, generates a 'fatal error: no memory' message.

See Document 1325025.1 - Sun Storage 7000 Unified Storage System: aksh fatal error: no memory

For any system running Appliance Release versions earlier than 2011.1, the creation and deletion activities for a large amount of VDI LUNs can cause a BUI/CLI hang condition.

See Document 1408593.1 - Sun Storage 7000 Unified Storage System: Creation/deletion of large amount of VDI LUNs can cause BUI/CLI hang

To monitor the memory used by akd, a workflow can be used.

See Document 1391232.1 - Sun Storage 7000 Unified Storage System: The workflow to check memory usage of the akd.

Excessive amount of 'old' analytics

Due to the detailed amount of information available when using analytics, and the 'always on' operation for the collection of the default set of analytics, collection of 'excessive' analytics data can eventually cause a 'hang' condition for the Appliance management interfaces (BUI/CLI).

See Document 1401595.1 - Sun Storage 7000 Unified Storage System: BUI/CLI hang due to 'excessive' analytics collected

A 'hang' condition for the BUI/CLI may also result due to the 'total' amount of analytics currently being collected

See Document 1572205.1 - Sun Storage 7000 Unified Storage System: BUI/CLI hangs when accessing the 'status' or 'analytics' page

A 'hang' condition for the Appliance management interfaces (BUI/CLI) may result due to a known analytics compilation bug.

See Document 1468128.1 - Sun Storage 7000 Unified Storage System: BUI/CLI hang due to analytics compilation (CCP) bug

Excessive amount of 'old' log files

For any system running Appliance Release versions earlier than 2010.Q1.0, system libraries used by akd can exceed a 256 file descriptor limit if many (old) logfiles are present. This can cause a 'hang' condition for the Appliance management interfaces (BUI/CLI).

See Document 1408493.1 - Sun Storage 7000 Unified Storage System: BUI/CLI hang due to 'excessive' amount of 'old' log files

Excessive use of contracts

Whenever a workflow terminates abnormally, it leaves a unused 'contract id'.   Eventually, the contract limit may be exceeded  and processes are unable to start.  Error messages may include "Resource temporarily unavailable".

See Document 1410873.1 - Sun Storage 7000 Unified Storage System: SMF unable to spawn processes due to contract exhaustion

Excessive amount of AKD process (memory) heap fragmentation

For any system running Appliance Release version 2011.1.6.0 and earlier, the akd process controlling the management interface can run out of memory because of memory fragmentation issues due to large number of oversize allocations.

See Document 1494369.1 - Sun Storage 7000 Unified Storage System: BUI unavailable and seeing errors like "failed to update kstat chain: Not enough space"

System pool is 'full'

The system may experience 'hang' conditions due to the system zpool nearing 100% capacity.  In this situation, try to reduce the system pool capacity (to below 80%).

See Document 1392082.1 - Sun Storage 7000 Unified Storage System: How to free some space on system pool

Faulty (?) hardware issue

For example, a 'flaky' disk which isn't getting faulted can cause akd to be mostly but not completely unresponsive.

Check the FMA events (in the BUI : Maintenance > Problems > Active Problems) for a bad disk not getting faulted.

If such a disk is preventing completion of zfs operations within a normal timeframe, the nas lock may be getting held up together with the zfs lock, causing akd to be unresponsive.

See Document 2055701.1 - Oracle ZFS Storage Appliance : Identifying Bad Disk Drives Causing Performance and Other Problems

Issue with supportbundle upload or 'phone home'

For any system running Appliance Release versions earlier than 2011.1.6.0, the scrk/curl thread within the akd daemon can hang.

See Document 1553935.1 - Sun Storage 7000 Unified Storage System: BUI/CLI hang when attempting to 'phone home' or upload supportbundle

Changing the replication target IP address

AKD can hang when changing the replication target IP address if the 'old' IP address is unavailable.

See bug 18827266 (Updating target IP can hang up the NAS class if the old IP is unavailable)

Fixed in Appliance Firmware Release 2013.1.6.0

 

Issues with Replication updates

Replication Issue:  22120225 - BUI/CLI Inaccessible ( Cloud Infrastructure)

This appears to be a Replication issue: it seems that bugs:

    22259667 - akd is slow waiting for zfs property update due to large number of datasets    Closed/Could Not Reproduce
    22649736 - target akd is slow waiting for zfs property update during replication                (Duplicate of 21116328)

and in turn:

    21116328 - nas_cache needs more scalable locking for property handling code            ... work in progress

are potentially the underlying issue here.

 

====================================================================


Additional topics for content creation ...

Excessive kernel virtual memory (exceeding the 32-bit VM limit)
15685248  ak_stream_buffer allocation is conducive to heap fragmentation                 Fixed in  2010.Q3.4.2
15727240  7410 cluster (strl1) for abut 6 min not serving NFS (dup of 15685248)        Fixed in  2010.Q3.4.2
15762369  akd hang on ak_job_cancel (workflow related)                                          Closed/Could Not Reproduce

'Temporary' hang when ZFS dataset destroy running
15632295  akd spinning in tight loop destroying zombie snapshots with holds             Fixed in  2010.Q1.0.2

Cluster 'peer' locking issue
15520920  Mr. Freeze and ak_cio_disable wreak havoc with rm lock                                                        Fixed in  2009.Q2.5.1
15562408  akd can fail to learn it has taken the rm lock                                                                       Fixed in  2009.Q3.5.0
15615056  RM lock tied in knots while making XDR-RPC call                                                                  Fixed in  2009.Q3.5.0
15646361  rm locking deadlock due to the race between two peer server threads                                     Fixed in  2011.1.1.0
15705482  Need focused fix for 6956503                                                                                            Fixed in  2010.Q3.4.0
15717253  BUI/CLI are not accessable. I tried to restart akd and this did not help (dup of 15646361)        Fixed in  2011.1.1.0
15724088  deadlock between ak_peer_server and its assassins during replication                                    Fixed in  2010.Q3.4.2

Replication 'peer' locking issue
15617192  rm lock deadlock creating replication target                                            Fixed in  2010.Q1.0.0
15727902  rm deadlock in nas_repl_createTarget                                                    Fixed in  2010.Q3.4.2

Faulty hardware issue [clustron card, cables]
15693479  akd went into maintenance because of 3520/3524 mixed SIM code on 7420 new install             Closed/Not Feasible To Fix
15722104  another case of discovery loop in pmcs                                                                                 Fixed in  2010.Q3.4.2

SMF framework issue
15519098  svc.configd leaked two million nodes                                                                                          Fixed in  2011.1.1.0
15635512  svc.configd having some difficulty with memory consumption (dup of 15519098)                             Fixed in  2011.1.1.0
15733749  SMF services down [svcs: Could not bind to repository server: repository server unavailable]             Closed/Could Not Reproduce
          (may be related to 15519098)

All 'other' known issues
15378956  vdev fullness can degrade performance, should cause zpool to become degraded           (RFE) Cause Not Known/Uncommitted
15614367  snapshot related activity causes akd to hang                                                              Fixed in  2010.Q3.0.0
15621496  destroying a dedup-enabled dataset bricks system                                                      Fixed in  2010.Q1.2.0
15648057  lack of accounting for DDT in dedup'd frees can oversubs (dup of 15621496)                  Fixed in  2010.Q1.2.0
15661489  changing shadow migration threads or cancel migration can lead to deadlock                  Fixed in  2010.Q3.3.1
15742356  akd slow and storage add taking long time (dup of 15378956)                                     (RFE) Cause Not Known/Uncommitted

=====================================================================

Further Assistance Required:

At this point, if you have validated that each troubleshooting step above is true for your environment and the problem still exists, further troubleshooting is required.
You will need to engage Oracle Support by opening a Service Request to assist you further.

Please include all the relevant details and information - including examples of any errors that you see - along with an accurate problem description in the SR notes.

If possible, a current supportbundle (from both heads, if this a cluster system) should also be obtained and uploaded to Oracle.

The following links will provide more information:

  • Document 1019887.1 - Sun Storage 7000 Unified Storage System: How to collect a supportbundle using the BUI or CLI
  • Document 1345655.1 - Sun Storage 7000 Unified Storage System: How to provide the correct Serial Number when opening an Oracle Service Request on a ZFS Storage Appliance or S7000 series NAS
It may be necessary for the Oracle Support Engineer to remotely run some 'emergency shell' commands. To accomplish this, the Oracle Support Engineer may request that you initiate an Oracle Shared Shell session. It would be useful if you are already familiar with this remote access tool - please see:

https://www.oracle.com/us/support/systems/premier/shared-shell-sun-systems-163755.html

 

Other useful information:

1. The Online Appliance Wiki documentation can be found at:

https://<appliance-ip-address>:215/wiki/index.php

2. To upgrade to the latest Appliance Firmware Release:

There are many improvements in later Appliance Firmware releases, please check the current Appliance Firmware revision and, if required, upgrade to the latest release:

See MOS Document ID 2021771.1 - Oracle ZFS Storage Appliance: Software Updates

3. If the BUI and CLI are completely hung, and you are unable to access the console via the Service Processor, then if you wish to reset the system and still gather some useful diagnostic information you can do this by issuing a NMI reset to the system.  This will cause the system to gather a kernel crash dump. The procedure to do this is documented in:

Document 1173064.1 - Sun Storage 7000 Unified Storage System: How to generate NMI to collect a system core dump

 

 

Back to Document 1416406.1  ZFS Storage Appliances Troubleshooting Resource Center.

 

 

***Checked for relevance on 24-MAY-2018***

References

<NOTE:1391232.1> - Sun Storage 7000 Unified Storage System: The work flow to check memory usage of the akd.
<NOTE:1416406.1> - Sun ZFS Storage Appliances Troubleshooting Resource Center
<NOTE:1392082.1> - Sun Storage 7000 Unified Storage System: How to free some space in the 'system' pool
<NOTE:1494369.1> - Sun Storage 7000 Unified Storage System: BUI unavailable and seeing errors like "failed to update kstat chain: Not enough space"
<NOTE:1334777.1> - Sun Storage 7000 Unified Storage System: System hang - aksh scripts can exhaust memory
<NOTE:1392845.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Loss of Network Connection to the Management Interface
<NOTE:1325025.1> - Sun Storage 7000 Unified Storage System: aksh fatal error: no memory
<NOTE:1408593.1> - Sun Storage 7000 Unified Storage System: Creation/deletion of large amount of VDI LUNs can cause BUI/CLI hang
<NOTE:1401595.1> - Sun Storage 7000 Unified Storage System: BUI/CLI hang due to 'excessive' analytics collected
<NOTE:1468128.1> - Sun Storage 7000 Unified Storage System: BUI/CLI hang due to analytics compilation (CCP) bug
<NOTE:1410873.1> - Sun Storage 7000 Unified Storage System: SMF unable to spawn processes due to contract exhaustion
<NOTE:1572205.1> - Sun Storage 7000 Unified Storage System: BUI/CLI hangs when accessing the 'status' or 'analytics' page
<NOTE:1642216.1> - Exalogic: SSH fails to connect to ZFS node with message "aksh-wrapper: No Such file or directory"
<NOTE:1553935.1> - Sun Storage 7000 Unified Storage System: BUI/CLI hang when attempting to 'phone home' or upload supportbundle
<NOTE:1019887.1> - Sun Storage 7000 Unified Storage System: How to Collect a Support Bundle using the BUI or CLI
<NOTE:1345655.1> - How to Identify the Serial Number of a ZFS Storage Appliance or 7000 Series Unified Storage System
<NOTE:1173064.1> - Oracle ZFS Storage Appliance: How to generate a system core dump in case of system hang (BUI and CLI fails to respond) using NMI when directed to do so by an Oracle Support Engineer
<NOTE:1401288.1> - Sun Storage 7000 Unified Storage System: Data collection for akd hang issues
<NOTE:1408493.1> - Sun Storage 7000 Unified Storage System: BUI/CLI hang due to 'excessive' amount of 'old' log files
<NOTE:1543359.1> - Sun Storage 7000 Unified Storage System: Restarting the Appliance Kit Management Daemon (AKD) may impact production data services

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback