Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1533606.1
Update Date:2018-05-25
Keywords:

Solution Type  Problem Resolution Sure

Solution  1533606.1 :   Sun Storage 7000 Unified Storage System: Alert 'defect.sunos.kernel.panic' may NOT actually mean that the system has had a kernel panic  


Related Items
  • Sun ZFS Storage 7420
  •  
  • Oracle ZFS Storage ZS5-2
  •  
  • Sun Storage 7110 Unified Storage System
  •  
  • Oracle ZFS Storage ZS3-2
  •  
  • Sun Storage 7210 Unified Storage System
  •  
  • Oracle ZFS Storage ZS4-4
  •  
  • Sun Storage 7410 Unified Storage System
  •  
  • Oracle ZFS Storage ZS5-4
  •  
  • Sun ZFS Storage 7120
  •  
  • Sun Storage 7310 Unified Storage System
  •  
  • Oracle ZFS Storage ZS3-4
  •  
  • Sun ZFS Storage 7320
  •  
  • Oracle ZFS Storage ZS3-BA
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>ZFS Storage>SN-DK: 7xxx NAS
  •  




In this Document
Symptoms
Cause
Solution


Created from <SR 3-6870703278>

Applies to:

Sun Storage 7110 Unified Storage System - Version All Versions and later
Sun ZFS Storage 7420 - Version All Versions and later
Sun ZFS Storage 7320 - Version All Versions and later
Sun ZFS Storage 7120 - Version All Versions and later
Sun Storage 7410 Unified Storage System - Version All Versions and later
7000 Appliance OS (Fishworks)

Symptoms

Customer 'believes' the system has suffered a kernel panic during appliance upgrade:

The system has rebooted after a kernel panic.Event-ID: 37a88282-e5eb-4912-d212-aec865f9a1c8
Auto-Response: None
Impact: None
Rec-Action: None
Event-ID: 37a88282-e5eb-4912-d212-aec865f9a1c8
Auto-Response: None
Impact: None
Rec-Action: None
Software Component:
Name:sw:///:path=/var/ak/core/.37a88282-e5eb-4912-d212-aec865f9a1c8
Description:

Additional Information: There may be some performance impact while the panic is copied to the savecore directory. Disk space usage by panics can be substantial.

Additional Information: The system has rebooted after a kernel panic.



On system '7310b', an alert was flagged on 27th Feb:

#### alert.ak.txt

Wed Feb 27 19:01:28 2013
        code = SUNOS-8000-KL
        version = 0x0
        product-id = Sun-Fire-X4140
        chassis-id = 0929QBP003
        server-id = 7310b
        mod-name = software-diagnosis
        mod-version = 0.1
        class = defect.sunos.kernel.panic
        path = /var/ak/core/.37a88282-e5eb-4912-d212-aec865f9a1c8
        savecore-succcess = 1
        dump-dir = /var/ak/core
        dump-files = vmdump.0
        os-instance-uuid = 37a88282-e5eb-4912-d212-aec865f9a1c8
        crashtime = 1360831100
        panic-time = Thu Feb 14 08:38:20 2013 UTC
        fault-status = 0x1
        severity = Major
        source = appliance/kit/akd:default
        uuid = 37a88282-e5eb-4912-d212-aec865f9a1c8

  => NOTE: panic-time = Thu Feb 14 08:38:20 2013 UTC

The coredump was actually collected on 14th Feb 2013 !!


Looking in the debug.sys log:

Feb 14 08:30:22 7310b idmapd[7526]: [ID 280452 daemon.error] Error:  smb_lookup_sid failed.
Feb 14 08:30:22 7310b idmapd[7526]: [ID 455671 daemon.error] Check SMB service (svc:/network/smb/server).
Feb 14 08:30:22 7310b idmapd[7526]: [ID 174421 daemon.error] Check connectivity to Active Directory.
Feb 14 08:30:22 7310b idmapd[7526]: [ID 280452 daemon.error] Error:  smb_lookup_sid failed.
Feb 14 08:30:22 7310b idmapd[7526]: [ID 455671 daemon.error] Check SMB service (svc:/network/smb/server).
Feb 14 08:30:22 7310b idmapd[7526]: [ID 174421 daemon.error] Check connectivity to Active Directory.
Feb 14 08:30:34 7310b smbd[29169]: [ID 702911 daemon.notice] service initialized
Feb 14 08:30:37 7310b smbd[29169]: [ID 702911 daemon.notice] domain axiell.local: domain controller tyfon.axiell.local
Feb 14 08:31:23 7310b smbsrv: [ID 421734 kern.notice] NOTICE: [AXIELL\dn]: . share not found
Feb 14 08:38:20 7310b genunix: [ID 111219 kern.notice] dumping to /dev/zvol/dsk/system/dump, offset 65536, content: kernel + curproc
Feb 14 08:39:27 7310b genunix: [ID 100000 kern.notice]
Feb 14 08:39:27 7310b genunix: [ID 665016 kern.notice] ^M100% done: 2056714 pages dumped,
Feb 14 08:39:27 7310b genunix: [ID 851671 kern.notice] dump succeeded
Feb 14 08:50:58 7310b smbd[29169]: [ID 702911 daemon.notice] domain axiell.local: domain controller tyfon.axiell.local
Feb 14 09:11:18 7310b smbd[29169]: [ID 702911 daemon.notice] domain axiell.local: domain controller tyfon.axiell.local

  => There appears to be some SMB/Active Directory issue at that time.


Analysing the coredump:

$ pwd
/cores/3-6870703278/bundles/ak.7310b-0929QBP003-maguro-2013-02-28.07.44.07/core


CAT(vmcore.0/11X)> coreinfo

core file:      /cores/3-6870703278/bundles/ak.7310b-0929QBP003-maguro-2013-02-28.07.44.07/core/vmcore.0
user:           Robert Kelly - PTS EMEA (robkelly:126854)
release:        5.11 (64-bit)
version:        ak/generic@2011.04.24.3.0,1-1.19
machine:        i86pc
node name:      7310b
system type:    i86pc
hostid:         0
dump_conflags:  0x40000 (DUMP_CURPROC) on /dev/zvol/dsk/system/dump(16G)
moddebug:       0x10 (NOAUTOUNLOAD)
dump_uuid:      37a88282-e5eb-4912-d212-aec865f9a1c8
time in kernel: Thu Feb 14 08:38:22 UTC 2013 (core is 14 days old)        <<<<<<<<
age of system:  0 seconds
CPUs:           8 (31.9G memory)

  => Confirmed - coredump is from 14th Feb 2013

CAT(vmcore.0/11X)> panic
  corefile is from live system...

  => Coredump collected from a 'live' (running) system


Looking at the process list what is the system doing ...):

CAT(vmcore.0/11X)> proc
       addr         PID    PPID   RUID/UID     size      RSS     swresv   time  command
================== ====== ====== ========== ========== ======== ======== ====== =========
0xfffff60035f10060  29202  29171          0    2400256   946176   233472    279 savecore -Lvd -f akd /var/ak/dropbox/
0xfffff600143c00d0  29194  15417         80   40673280  9424896  9818112      2 /usr/apache2/current/bin/httpd -f /var/run/ak/httpd.conf -k start
0xfffff60039b87010  29171  29161          0    3899392  2674688   557056     17 -bash
0xfffff6005f1a1038  29169      1          0   57393152 17948672  5611520    168 /usr/lib/smbsrv/smbd start
0xfffff600598ad048  29167  15417         80   40706048  9519104  9850880     16 /usr/apache2/current/bin/httpd -f /var/run/ak/httpd.conf -k start
0xfffff600143330b0  29161  29138          0   29769728 17412096  9879552    133 aksh
0xfffff60039a7d0d0  29157  15417         80   40787968  9883648  9932800      7 /usr/apache2/current/bin/httpd -f /var/run/ak/httpd.conf -k start
0xfffff60013155030  29138  29132          0    3899392  2641920   557056     14 -bash
0xfffff6001316b028  29132  29131          0   29175808 16719872  9285632    144 /usr/lib/ak/tools/aksh -l -N4
0xfffff600488b3028  29131  29127          0    2494464  1667072   442368      0 -aksh-wrapper
0xfffff60012d88000  29127  29126          0   21966848 11644928  4022272     47 /usr/lib/ssh/sshd
........

  => NOTE: the first entry in the process list is  "savecore -Lvd -f akd /var/ak/dropbox/"

  => The coredump was specifically collected by 'operator activity'.

  => I believe that the coredump was collected on 14th February by a Oracle Storage-TSC NAS Engineer while investigating another issue/SR.



SUMMARY

The coredump was collected on 14th Feb:

Feb 14 08:38:20 7310b genunix: [ID 111219 kern.notice] dumping to /dev/zvol/dsk/system/dump, offset 65536, content: kernel + curproc
Feb 14 08:39:27 7310b genunix: [ID 100000 kern.notice]
Feb 14 08:39:27 7310b genunix: [ID 665016 kern.notice] ^M100% done: 2056714 pages dumped,
Feb 14 08:39:27 7310b genunix: [ID 851671 kern.notice] dump succeeded


   ... but, the system was NOT rebooted (so the coredump was NOT copied from the 'dump device' to the root filesystem).


The 'next' reboot was done during the 'upgrade process':

Feb 27 18:55:51 7310b genunix: [ID 540533 kern.notice] ^MSunOS Release 5.11 Version ak/generic@2011.04.24.3.0,1-1.19 64-bit
Feb 27 18:55:51 7310b genunix: [ID 877030 kern.notice] Copyright (c) 1983, 2010, Oracle and/or its affiliates. All rights reserved.
Feb 27 18:55:51 7310b acpica: [ID 749588 kern.notice] ACPI: RSDP faf20 00014 (v0 ACPIAM)
Feb 27 18:55:51 7310b acpica: [ID 263073 kern.notice] ACPI: RSDT d7fb0000 0005C (v1 SUN    X4x40    00000080 MSFT 00000097)
........


It was during this reboot that the system detected that the dump device contained a coredump - and so copied it to the root filesystem:

Feb 27 18:57:55 7310b savecore: [ID 207219 auth.warning] System dump time: Thu Feb 14 08:38:20 2013
Feb 27 18:57:55 7310b savecore: [ID 500639 auth.error] Saving compressed system crash dump in /var/ak/core/vmdump.0
Feb 27 19:01:28 7310b savecore: [ID 824469 auth.error] Decompress the crash dump with
Feb 27 19:01:28 7310b 'savecore -vf /var/ak/core/vmdump.0'


This is the underlying reason why the alert was generated on 27th Feb ... and the crashdump date (in the filesystem is 27th Feb) !!

 

Cause

When a 'defect.sunos.kernel.panic' alert is generated, this MAY NOT actually indicate that the system has had a kernel panic.

The 'defect.sunos.kernel.panic' alert appears to be generated when a system (during boot) detected that the dump device contained a coredump - and subsequently copied it to the root filesystem.

The log files show the definitive sequence of events when collecting a coredump, rebooting and then generating the alert.

 

 

Solution

Be aware: When a 'defect.sunos.kernel.panic' alert is generated, this MAY NOT actually indicate that the system has had a kernel panic.

 


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback