Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1610490.1
Update Date:2018-01-08
Keywords:

Solution Type  Problem Resolution Sure

Solution  1610490.1 :   Sun Storage 7000 Unified Storage System: How to troubleshoot a (kernel) memory shortage  


Related Items
  • Sun ZFS Storage 7420
  •  
  • Sun Storage 7110 Unified Storage System
  •  
  • Oracle ZFS Storage ZS3-2
  •  
  • Sun Storage 7210 Unified Storage System
  •  
  • Sun Storage 7410 Unified Storage System
  •  
  • Sun Storage 7310 Unified Storage System
  •  
  • Sun ZFS Storage 7120
  •  
  • Oracle ZFS Storage ZS3-4
  •  
  • Sun ZFS Storage 7320
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>ZFS Storage>SN-DK: 7xxx NAS
  •  




In this Document
Symptoms
Changes
Cause
Solution
References


Created from <SR 3-8078739601>

Applies to:

Sun Storage 7110 Unified Storage System - Version All Versions and later
Sun Storage 7210 Unified Storage System - Version All Versions and later
Sun Storage 7310 Unified Storage System - Version All Versions and later
Sun Storage 7410 Unified Storage System - Version All Versions and later
Sun ZFS Storage 7420 - Version All Versions and later
7000 Appliance OS (Fishworks)

Symptoms

The following symptoms can be used to determine if we are in a kernel memory shortage situation :

When a memory shortage occurs, the system typically collapses this way:

- Data services become unavailable,

- The BUI and CLI are unavailable also.

- A connection to the ZFSSA through SSH is impossible.

- Any connection to the ILOM of the ZFSSA through SSH, then starting the console shows no prompt at all.

 

Those are the typical symptoms helping to determine we are in a kernel memory shortage situation.

Changes

 This can happen while no recent change has been committed to the system recently.

Cause

The following troubleshooting steps can be done before calling the support center :

 

1/ Confirm we currently have in  CLI > maintenance hardware show  the full and regular amount of memory the system used to have.

To do so, error and alert messages at the BUI can be checked.

But, DIMMs can either fail showing an error message or can't be blacklisted by Solaris.  This silent failure for DIMMs is called "Operating System DIMM blacklisting" and does not show any error message at the BUI.

In that specific case, when the box reboots having found DIMMs that must be taken off the configuration, it prints error messages once on the console saying that those DIMMS have been blacklisted by the Operating System.

Those messages can possibly not be logged anywhere else than in the console.

Once Solaris has printed those messages on the console at the first reboot, then it does not print anything else at any next reboot, nor logs any error in the log messages.Those DIMMs are just considered as out of the configuration.

To troubleshoot those silent or cleared DIMM failures, customers have to check that the total amount of memory configured at installation is still present.

 

2/ If there are Readzillas on the ZFSSA, confirm we are not using too much memory with the L2ARC memory headers

 

If a system runs out of memory, it can be because of L2 headers buffers, filling up the ARC and preventing it from growing. 

This can happen when we add new trays to existing configurations, having to deal with bigger pools. This can also happen when we replace 'old' 100G SSDs with larger 500G SSDs.

This might also happen when we configure small block sizes for shares or luns.

 

Note 1573028.1 :  Sun Storage 7000 Unified Storage System: How to check how much memory the L2ARC headers are occupying in the ARC.

This doc will allow to check either if we are hitting this configuration issue, but also if we might hit it one day, given the current configuration of the ZFSSA.

 

 

Solution

Call your support representative having checked the symptoms above are checked, and ask for help in this situation of memory shortage.

Do not reboot the appliance.

 

Your support representative will collect a crash dump using NMI, and possibly engage next level of support to have the crash dump analyzed.

Note  1173064.1 :  Sun Storage 7000 Unified Storage System: How to generate NMI to collect a system core dump.

 

If a remote connection can be done, then TSC will check how memory behaves.

TSC may need to engage next level of support to connect remotely on the appliance to do so.

 

The next level of support will check as a first step if there is evidence of a memory leak.

As a second step, we should focus on the ARC : the Adaptive Replacement Cache is ZFS first level of cache.

It grows with high use. When we are under memory pressure, the ARC tries to react to that memory pressure from the operating system.

Sometimes, in some specific circumstances, the ARC may fail to react to that memory pressure and may exceeds its target.

In that case, we must figure out what caused that memory pressure, and check if any memory leaks exists on any buffers.

When all pathological behaviour is out of the scope, then limiting the ARC to leave more memory to the kernel may be required.

 

TSC should check this document :

Note 1602108.1 :  Sun Storage 7000 Unified Storage System: Tuning the ARC in case of memory pressure 

References

<NOTE:1194226.1> - Oracle Shared Shell

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback