SUNOS-8000-KL - Kernel Panic

Asset ID:	1-79-1173706.1
Update Date:	2018-01-09
Keywords:

Solution Type Predictive Self-Healing Sure

Solution 1173706.1 : SUNOS-8000-KL - Kernel Panic

Applies to:

Sun Microsystems > Operating Systems > Solaris Operating System
SPARC T7-1
SPARC T7-2
SPARC T7-4
SPARC T8-1
Information in this document applies to any platform.

Purpose

This document provides additional information for Message ID: SUNOS-8000-KL

Scope

Details

Type

Defect

defect.sunos.kernel.panic

Severity

Major

Description

The system has rebooted after a kernel panic.

Automated Response

The failed system image was dumped to the dump device. If savecore is enabled (see dumpadm(1M)) a copy of the dump will be written to the savecore directory.

Impact

There may be some performance impact while the panic is copied to the savecore directory. Disk space usage by panics can be substantial.

Suggested Action for System Administrator

If savecore is not enabled then please take steps to preserve the crash image. Use 'fmdump -Vp -u <EVENT-ID>' to view more panic detail. Please refer to the knowledge article for additional information.

Details

Summary

The operating system has paniced and the system has rebooted. A crash dump (and image of the operating system at the time of failure) has been produced for post-mortem analysis.

An operating system panic (sometimes called a "crash") occurs when the operating system encounters conditions that prohibit it from continuing, such as a critical hardware error, an errnoneous data access from kernel or driver software, or a violation of some critical invariant condition. If the integrity of the operating system is compromised then it is in no state to continue to provide service (doing so may lead to data corruption, for instance) and so it chooses to panic and restart.

What To Do?

Check the integrity of the applications hosted on this system. As described below, a panic involves an unclean shutdown in which in-flight data is synced to disk but no formal shutdown scripts/methods are executed.
Preserve the panic information for post-mortem analysis:
- Runfmadm faultyand look for an entry with a MSG-ID of SUNOS-8000-KL (there can be more than one if there have been multiple panics).
- Run thefmdumpcommand listed in the Action field of thefmadmoutput. If thefmdumpoutput listssavecore-success = 1and includes adump-diranddump-filesthen the crash dump has been successfully extracted to the indicated directory and files.
- If the crash has not been extracted there may have been an error (e.g., savecore directory filesystem full), ordumpadm -nis in effect (see example below). In both cases you may try to run savecore manually using/usr/bin/savecore <dest-dir>; typically the destination directory you'd choose should be that listed as the "Savecore directory" indumpadmoutput.
Panic analysis is a very specialized discipline and typically requires expert source-level knowledge of the affected subsystem. Depending on the nature of the panic, your support organization or vendor may also be interested in any changelogs that are available, system history and that of other like systems, observations of circumstances at the time of the panic, etc. The fmdump command above will include the "panic string" (panicstr) and panic stack (panicstack), and these will also be of interest to your support organization or vendor.

Example 1

We'll assume a system that is configured with a suitably-sized dump device and for which compressed savecore is enabled (dumpadm -y -z on); this is the usual configuration. We sabotage some kernel state to simulate a panic. If you happened to be watching the console at the time (if not then simply runfmadm faulty) the "corruption" is encountered you'd see something like:

panic[cpu2]/thread=ffffff096190f7c0: mutex_enter: bad mutex, lp=fffffffffbc836b8 owner=fffffffffbc836b8 thread=ffffff096190f7c0

ffffff003d00cdb0 unix:mutex_panic+73 ()
ffffff003d00ce20 unix:mutex_vector_enter+4a3 ()
ffffff003d00ce70 pset:pset_getloadavg+40 ()
ffffff003d00ceb0 pset:pset+86 ()
ffffff003d00cf00 unix:brand_sys_syscall32+272 ()

syncing file systems... done
dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
 0:12  75% done
100% done: 807692 pages dumped, dump succeeded
rebooting...

Then during the subsequent reboot:

May 24 22:31:49 parity savecore: [ID 803649 auth.error] Saving compressed
system crash dump in /var/crash/parity/vmdump.6

When the crash dump has been saved you see an instruction on how to uncompress the dump (only required if you plan to perform crash analysis - otherwise leave it compressed and provide it to your support vendor in that form):

May 24 22:32:55 parity savecore: Decompress the crash dump with 
May 24 22:32:55 parity 'savecore -vf /var/crash/parity/vmdump.6'

Once the fault management software sees that a new crash dump is available and has been extracted to the filesystem, it produces the following on the console:

SUNW-MSG-ID: SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major
EVENT-TIME: Mon May 24 22:32:55 PDT 2010
PLATFORM: Sun-Fire-V40z, CSN: XG051535088, HOSTNAME: parity
SOURCE: software-diagnosis, REV: 0.1
EVENT-ID: f421b96e-84a6-6922-bb85-a4b15e3411a4
DESC: The system has rebooted after a kernel panic.  Refer to
http://sun.com/msg/SUNOS-8000-KL for more information.
AUTO-RESPONSE: The failed system image was dumped to the dump device.  If
savecore is enabled (see dumpadm(1M)) a copy of the dump will be written
to the savecore directory /var/crash/parity.
IMPACT: There may be some performance impact while the panic is copied
to the savecore directory.  Disk space usage by panics can be substantial.
REC-ACTION: Please log a call with you support vendor and provide them
with this information.  If savecore is not enabled then please take steps
to preserve the crash image.
Use 'fmdump -Vp -u f421b96e-84a6-6922-bb85-a4b15e3411a4' to view more panic detail.

If we run the command suggested in the message we see a little more detail (the more-interesting parts are highlighted in red):

# fmdump -Vp -u f421b96e-84a6-6922-bb85-a4b15e3411a4
TIME                           UUID                                 SUNW-MSG-ID
May 24 2010 22:32:55.087356000 f421b96e-84a6-6922-bb85-a4b15e3411a4 SUNOS-8000-KL

  TIME                 CLASS                                 ENA
  May 24 22:32:55.0314 ireport.os.sunos.panic.dump_available 0x0000000000000000
  May 24 22:31:49.3979 ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = f421b96e-84a6-6922-bb85-a4b15e3411a4
        code = SUNOS-8000-KL
        diag-time = 1274765575 47199
        de = fmd:///module/software-diagnosis
        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = defect.sunos.kernel.panic
                certainty = 0x64
                asru = sw:///:path=/var/crash/parity/.f421b96e-84a6-6922-bb85-a4b15e3411a4
                resource = sw:///:path=/var/crash/parity/.f421b96e-84a6-6922-bb85-a4b15e3411a4

savecore-succcess = 1
dump-dir = /var/crash/parity
dump-files = vmdump.6

                os-instance-uuid = f421b96e-84a6-6922-bb85-a4b15e3411a4

panicstr = mutex_enter: bad mutex, lp=fffffffffbc836b8 owner=fffffffffbc836b8 thread=ffffff096190f7c0
panicstack = unix:mutex_panic+73 () | unix:mutex_vector_enter+4a3 () | pset:pset_getloadavg+40 () | pset:pset+86 () | unix:brand_sys_syscall32+272 () |

 
                crashtime = 1274765443

panic-time = Mon May 24 22:30:43 2010 PDT

        (end fault-list[0])

        fault-status = 0x1
        severity = Major
        __ttl = 0x1
        __tod = 0x4bfb6107 0x534f260

Example 2

If the system does not have savecore enabled (dumpadm -nin effect):

# dumpadm
      Dump content: kernel pages
       Dump device: /dev/zvol/dsk/rpool/dump (dedicated)
Savecore directory: /var/crash/parity

Savecore enabled: no

   Save compressed: on

then no crash dump will be extracted during the reboot, but a diagnosis will still be made. We'll illustrate this with a forced panic via

reboot -d

# reboot -d
May 24 23:06:03 parity reboot: initiated by root on /dev/console

...
May 24 23:12:58 parity savecore: System dump time: Mon May 24 23:06:22 2010     
May 24 23:12:58 parity savecore: Panic crashdump pending on dump device but dumpadm -n in effect; run savecore(1M) manually to extract. Image UUID 0dd0d7ac-fda4-e07b-b73b-fc47c8016853.

...

SUNW-MSG-ID: SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major               
EVENT-TIME: Mon May 24 23:13:10 PDT 2010                                        
PLATFORM: Sun-Fire-V40z, CSN: XG051535088, HOSTNAME: parity                     
SOURCE: software-diagnosis, REV: 0.1                                            
EVENT-ID: 0dd0d7ac-fda4-e07b-b73b-fc47c8016853                                  
DESC: The system has rebooted after a kernel panic.
Refer to http://sun.com/msg/SUNOS-8000-KL for more information.                                            
AUTO-RESPONSE: The failed system image was dumped to the dump
device.  If savecore is enabled (see dumpadm(1M)) a copy of the
dump will be written to the savecore directory .
IMPACT: There may be some performance impact while the panic is
copied to the savecore directory.  Disk space usage by panics 
can be substantial.
REC-ACTION: Please log a call with you support vendor and provide
them with this information.  If savecore is not enabled then please
take steps to preserve the crash image.
Use fmdump -Vp -u 0dd0d7ac-fda4-e07b-b73b-fc47c8016853 to view more panic detail.

Running fmdump as suggested shows additional information but not the

dump-dir

and

dump-files

this time since the dump was not extracted:

# fmdump -Vp -u 0dd0d7ac-fda4-e07b-b73b-fc47c8016853
TIME                           UUID                                 SUNW-MSG-ID
May 24 2010 23:13:10.425447000 0dd0d7ac-fda4-e07b-b73b-fc47c8016853 SUNOS-8000-KL

  TIME                 CLASS                                 ENA
  May 24 23:12:58.7045 ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = 0dd0d7ac-fda4-e07b-b73b-fc47c8016853
        code = SUNOS-8000-KL
        diag-time = 1274767990 334100
        de = fmd:///module/software-diagnosis
        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = defect.sunos.kernel.panic
                certainty = 0x64
                asru = sw:///:path=/var/crash/parity/.0dd0d7ac-fda4-e07b-b73b-fc47c8016853
                resource = sw:///:path=/var/crash/parity/.0dd0d7ac-fda4-e07b-b73b-fc47c8016853
                savecore-succcess = 0
                os-instance-uuid = 0dd0d7ac-fda4-e07b-b73b-fc47c8016853
                panicstr = forced crash dump initiated at user request
                panicstack = genunix:kadmin+16e () | genunix:uadmin+168 () | unix:brand_sys_syscall32+272 () | 
                crashtime = 1274767582
                panic-time = Mon May 24 23:06:22 2010 PDT
        (end fault-list[0])

        fault-status = 0x1
        severity = Major
        __ttl = 0x1
        __tod = 0x4bfb6a76 0x195bce58

Panic Sequence

The following is the sequence of events when a panic condition is detected:

The operating system kernel aborts normal execution immediately. No shutdown scripts are run - in fact all process execution is stopped and only the kernel thread supervising the panic procedure will execute until that thread resets the system.
An attempt is made to sync data to disk (see sync(2)). This means that in-flight data written to filesystems should make its way to stable storage. For some failure causes, such as those involving broken cpus or panic causes in the filesystem or I/O stack itself, this sync attempt may fail or timeout.
If a dump device is configured (see dumpadm(1m)) then an image of the operating system instance that contains all the kernel state at the time of panic is written to the dump device. The dump device also transports messages and error report that were emitted during the panic procedure, so that they can be logged and diagnosed in the subsequent reboot
Once the crash dump is written out (or aborted if unsuccessful) the system is reset, and begins to boot from scratch a new operating system instance.
When a newly-booted operating system instance sees a crash dump present on the dump device it extracts the dump into the filesystem, ready for analysis. Ifdumpadm -nis in effect (not recommended for enterprise installations) then the dump is left unextracted on the dump device - it may be over-written should the system panic again, in which case post-mortem analysis options are limited.
When fault management software starts in the newly-booted instance it will diagnose a defect to track this panic.

Attachments

This solution has no attachment