Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1173064.1
Update Date:2018-05-14
Keywords:

Solution Type  Technical Instruction Sure

Solution  1173064.1 :   Oracle ZFS Storage Appliance: How to generate a system core dump in case of system hang (BUI and CLI fails to respond) using NMI when directed to do so by an Oracle Support Engineer  


Related Items
  • Sun ZFS Storage 7420
  •  
  • Sun Storage 7110 Unified Storage System
  •  
  • Sun Storage 7210 Unified Storage System
  •  
  • Sun Storage 7410 Unified Storage System
  •  
  • Sun ZFS Storage 7120
  •  
  • Sun Storage 7310 Unified Storage System
  •  
  • Sun ZFS Storage 7320
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>ZFS Storage>SN-DK: 7xxx NAS
  •  
  • _Old GCS Categories>Sun Microsystems>Storage - Disk>Unified Storage
  •  




In this Document
Goal
Solution
References


Applies to:

Sun ZFS Storage 7420 - Version All Versions and later
Sun ZFS Storage 7120 - Version All Versions and later
Sun Storage 7310 Unified Storage System - Version All Versions and later
Sun Storage 7410 Unified Storage System - Version All Versions and later
Sun Storage 7110 Unified Storage System - Version All Versions and later
7000 Appliance OS (Fishworks)

Goal

How to generate a system core dump in case of system hang (BUI and CLI fails to respond) using NMI.

Before performing the NMI, a Service Request should be opened to Oracle Support with an Engineer who can verify the system status and collect any additional information - that will be lost once the NMI is
performed - which may be necessary in the analysis of the core dump.

The Oracle Engineer will confirm if NMI is necessary and when no further data collection is required and the NMI can be performed.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Disk Storage ZFS Storage Appliance Community

Solution

Before collecting a system crash dump, try to retrieve some akd information as per Doc 1401288.1 : Storage 7000 Unified Storage System: Data collection for akd hang issue

Also, please collect a gcore of the 'fmd' (FMA) daemon:    (shell)  gcore -o /var/ak/dropbox/core.fmd `pgrep fmd` &

The following will stop a hung system by generating a Non-Maskable Interrupt (NMI). It should force a core dump and reboot the node.

  • Soft NMI


Have 2 ssh sessions running, one to the console and the other to the Service Processor (SP).

From the ILOM 2.x revision :

-> cd /SP/diag
-> set generate_host_nmi=true

From the ILOM 3.x revision :

-> cd /HOST/
-> set generate_host_nmi=true

The console session should report something similar to the following:

panic[cpu2]/thread=ffffff001eccbc60: NMI received
 ffffff001eccbac0 pcplusmp:apic_nmi_intr+7c ()
 ffffff001eccbaf0 unix:av_dispatch_nmivect+30 ()
 ffffff001eccbb00 unix:nmiint+154 ()
 ffffff001eccbbf0 unix:mach_cpu_idle+b ()
 ffffff001eccbc20 unix:cpu_idle+c2 ()
 ffffff001eccbc40 unix:idle+114 ()
 ffffff001eccbc50 unix:thread_start+8 ()
 syncing file systems... done
 dumping to /dev/zvol/dsk/system/dump, offset 65536, content: kernel + curproc
 100% done: 356267 pages dumped, compression ratio 3.84, dump succeeded

 

PLEASE NOTE: The 'savecore' process - to copy the corefile from the dump device into the root filesystem - must have completed before the supportbundle is collected.

There is no supported method that the customer has to verify savecore has finished and the time this takes can vary widely.

A suggestion may be to wait a minimum of one hour (?) or contact Oracle support to confirm the next steps.

If the supportbundle is generated too soon, it may contain an incomplete core and the supportbundle is deleted from the system after upload.

We then have no core dump from the NMI and no possibility of RCA from the NMI, so it is important that the bundle is not generated until the savecore is complete.

 

Generate a bundle after the reboot and the core should be in the cores section of the bundle.

 

Oracle engineers can drop to the shell and check 'debug.sys' and wait for a similar message to:

        Jan  9 17:29:44 hostname savecore: [ID 165606 auth.error] Decompress the crash dump with
        Jan  9 17:29:44 hostname 'savecore -vf /var/ak/core/vmdump.2'

 and possibly check for the 'savecore' process still running.

 

Refer to Sun Storage 7000 Unified Storage System: How to collect supportfile bundle using the BUI or CLI (Doc ID 1019887.1)


NOTE: If the corefile has not been successfully copied to the system core directory (possible due to a 'quota' issue), this can be re-tried using:

            # savecore  -vd  [directory]

This will save the crash dump files to the specified directory (alternate filesystem location). If 'directory' is not specified, savecore saves the crash dump files to the default savecore directory, configured by dumpadm(1M).

( -v = Verbose.  Enables verbose error messages from savecore.  )
( -d = Disregard dump header valid flag. Force savecore to attempt to save a crash dump even if the header information stored on the dump device indicates the dump has already been saved.  )

 

  • Hardware NMI


If Soft NMI is not even possible from the Service Processor (SP), you can press the NMI button located on the SP as shown below.

NMI switch location for for 7110, 7310 and 7410:
The Reset switch on the motherboard sends a reset order to the CPUs, resetting the main system, but not the service processor. The button for this switch is one of the 3 hidden (recessed) buttons on the back of the motherboard located between the NET MGT and NET0 connectors and closest to NET0. It can be pushed by sticking a paper clip or similar object through the hole provided on the rear of the chassis.
7110_7310_7410_NM

NMI switch location for 7210:
As written in the rear side of the 7210 (button in the middle).

7210_NMI

NMI switch location for 7120 and 7320:
Face the rear of the 7120 or 7320. There are three recessed switch buttons (holes) between the "NET MGT" port to the left and the "NET 0" port of the 1GB NIC to the right. The middle button is the NMI reset switch. This button can be depressed using a straightened out paper clip.

7120_7320_NMI

NMI switch location for 7420, BA, ZS3-4 and ZS3-BA:
Face the rear of the 7420. There are three recessed switch buttons (holes) between the "CLUSTER CARD" slot to the left and the "NET 0" port of the 1GB NIC to the right. The rightmost button nearest "NET 0" is the NMI reset switch.  This button can be depressed using a straightened out paper clip.

7420_NMI

 

 

NMI switch location for ZS3-2:  Please see - https://support.oracle.com/handbook_private/Systems/ZS3_2/component.rear_zoom.html .   NMI is the pinhole between the VGA connector and the SER MGT port.

NMI switch location for ZS3-ES: The NMI switch is between LEDs and NET MGT port.    Please see  https://mosemp.us.oracle.com/handbook_internal/Systems/ZS3_ES/component.rear_zoom.html

NMI switch location for ZS4-4:    Please see  https://docs.oracle.com/cd/E38212_01/html/E38213/xffsm.gnjil.html

Tip for Oracle Storage-TSC Support Engineer:

In order to minimize the delay between the akd core collection and the dump collection, the gcore of akd must be generated as close as possible to the NMI in order for the akd userland threads to be corresponding to the kthread stacks.

Moreover, one ipmitool command allows to send an NMI reset to the ILOM instead of logging to the SP.

The best way to reduce the requirement for typing is to save this script to the dropbox and then to launch it.

Copy the following script into /var/ak/dropbox (as filename script.sh):

#/bin/bash

#cd /var/ak/dropbox; gcore `pgrep -ox akd`

# ipmitool chassis power diag


Then type  chmod +x script.sh  and run it with :

./script.sh

 

NOTE:  ipmitool chassis power diag  may not work, if does not,  ipmitool power diag  can be used

Example ss7120-sin06-a# ipmitool power diag
rol: Diagower Cont

panic[cpu6]/thread=ffffff002eb05c40: NMI received


ffffff002eb05a70 pcplusmp:apic_nmi_intr+7c ()
ffffff002eb05aa0 unix:av_dispatch_nmivect+30 ()
ffffff002eb05ab0 unix:nmiint+152 ()
ffffff002eb05ba0 unix:i86_mwait+d ()
ffffff002eb05bf0 unix:cpu_idle_mwait+158 ()
ffffff002eb05c20 unix:idle+112 ()
ffffff002eb05c30 unix:thread_start+8 ()

syncing file systems... done
dumping to /dev/zvol/dsk/system/dump, offset 65536, content: kernel + curproc
 0:15  13% done


also -  ipmitool -I bmc  chassis power diag

 

Checked for Currency - 18-FEB-2017


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback