| Asset ID: |
1-79-1173706.1 |
| Update Date: | 2018-01-09 |
| Keywords: | |
Solution Type
Predictive Self-Healing Sure
Solution
1173706.1
:
SUNOS-8000-KL - Kernel Panic
| Related Items |
- SPARC T8-1
- SPARC T8-4
- Oracle SuperCluster M7 Hardware
- SPARC M8-8
- SPARC M7-8
- SPARC T7-4
- SPARC T8-2
- Oracle SuperCluster M8 Hardware
- SPARC T7-2
- SPARC M7-16
- SPARC T7-1
|
| Related Categories |
- PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: Sun PSH
|
In this Document
Applies to:
Sun Microsystems > Operating Systems > Solaris Operating System
SPARC T7-1
SPARC T7-2
SPARC T7-4
SPARC T8-1
Information in this document applies to any platform.
Purpose
This document provides additional information for Message ID: SUNOS-8000-KL
Scope
Details
Type
Defect
defect.sunos.kernel.panic
Severity
Major
Description
The system has rebooted after a kernel panic.
Automated Response
The failed system image was dumped to the dump device. If savecore is enabled (see dumpadm(1M)) a copy of the dump will be written to the savecore directory.
Impact
There may be some performance impact while the panic is copied to the savecore directory. Disk space usage by panics can be substantial.
Suggested Action for System Administrator
If savecore is not enabled then please take steps to preserve the crash image. Use 'fmdump -Vp -u <EVENT-ID>' to view more panic detail. Please refer to the knowledge article for additional information.
Details
Summary
The operating system has paniced and the system has rebooted. A crash dump (and image of the operating system at the time of failure) has been produced for post-mortem analysis.
An operating system panic (sometimes called a "crash") occurs when the operating system encounters conditions that prohibit it from continuing, such as a critical hardware error, an errnoneous data access from kernel or driver software, or a violation of some critical invariant condition. If the integrity of the operating system is compromised then it is in no state to continue to provide service (doing so may lead to data corruption, for instance) and so it chooses to panic and restart.
What To Do?
- Check the integrity of the applications hosted on this system. As described below, a panic involves an unclean shutdown in which in-flight data is synced to disk but no formal shutdown scripts/methods are executed.
- Preserve the panic information for post-mortem analysis:
- Run
fmadm faultyand look for an entry with a MSG-ID of SUNOS-8000-KL (there can be more than one if there have been multiple panics).
- Run the
fmdumpcommand listed in the Action field of thefmadmoutput. If thefmdumpoutput listssavecore-success = 1and includes adump-diranddump-filesthen the crash dump has been successfully extracted to the indicated directory and files.
- If the crash has not been extracted there may have been an error (e.g., savecore directory filesystem full), or
dumpadm -nis in effect (see example below). In both cases you may try to run savecore manually using/usr/bin/savecore <dest-dir>; typically the destination directory you'd choose should be that listed as the "Savecore directory" indumpadmoutput.
- Panic analysis is a very specialized discipline and typically requires expert source-level knowledge of the affected subsystem. Depending on the nature of the panic, your support organization or vendor may also be interested in any changelogs that are available, system history and that of other like systems, observations of circumstances at the time of the panic, etc. The fmdump command above will include the "panic string" (panicstr) and panic stack (panicstack), and these will also be of interest to your support organization or vendor.
Example 1
We'll assume a system that is configured with a suitably-sized dump device and for which compressed savecore is enabled (dumpadm -y -z on); this is the usual configuration. We sabotage some kernel state to simulate a panic. If you happened to be watching the console at the time (if not then simply runfmadm faulty) the "corruption" is encountered you'd see something like:
panic[cpu2]/thread=ffffff096190f7c0: mutex_enter: bad mutex, lp=fffffffffbc836b8 owner=fffffffffbc836b8 thread=ffffff096190f7c0
ffffff003d00cdb0 unix:mutex_panic+73 ()
ffffff003d00ce20 unix:mutex_vector_enter+4a3 ()
ffffff003d00ce70 pset:pset_getloadavg+40 ()
ffffff003d00ceb0 pset:pset+86 ()
ffffff003d00cf00 unix:brand_sys_syscall32+272 ()
syncing file systems... done
dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
0:12 75% done
100% done: 807692 pages dumped, dump succeeded
rebooting...
Then during the subsequent reboot:
May 24 22:31:49 parity savecore: [ID 803649 auth.error] Saving compressed
system crash dump in /var/crash/parity/vmdump.6
When the crash dump has been saved you see an instruction on how to uncompress the dump (only required if you plan to perform crash analysis - otherwise leave it compressed and provide it to your support vendor in that form):
May 24 22:32:55 parity savecore: Decompress the crash dump with
May 24 22:32:55 parity 'savecore -vf /var/crash/parity/vmdump.6'
Once the fault management software sees that a new crash dump is available and has been extracted to the filesystem, it produces the following on the console:
SUNW-MSG-ID: SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major
EVENT-TIME: Mon May 24 22:32:55 PDT 2010
PLATFORM: Sun-Fire-V40z, CSN: XG051535088, HOSTNAME: parity
SOURCE: software-diagnosis, REV: 0.1
EVENT-ID: f421b96e-84a6-6922-bb85-a4b15e3411a4
DESC: The system has rebooted after a kernel panic. Refer to
http://sun.com/msg/SUNOS-8000-KL for more information.
AUTO-RESPONSE: The failed system image was dumped to the dump device. If
savecore is enabled (see dumpadm(1M)) a copy of the dump will be written
to the savecore directory /var/crash/parity.
IMPACT: There may be some performance impact while the panic is copied
to the savecore directory. Disk space usage by panics can be substantial.
REC-ACTION: Please log a call with you support vendor and provide them
with this information. If savecore is not enabled then please take steps
to preserve the crash image.
Use 'fmdump -Vp -u f421b96e-84a6-6922-bb85-a4b15e3411a4' to view more panic detail.
If we run the command suggested in the message we see a little more detail (the more-interesting parts are highlighted in red):
# fmdump -Vp -u f421b96e-84a6-6922-bb85-a4b15e3411a4
TIME UUID SUNW-MSG-ID
May 24 2010 22:32:55.087356000 f421b96e-84a6-6922-bb85-a4b15e3411a4 SUNOS-8000-KL
TIME CLASS ENA
May 24 22:32:55.0314 ireport.os.sunos.panic.dump_available 0x0000000000000000
May 24 22:31:49.3979 ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000
nvlist version: 0
version = 0x0
class = list.suspect
uuid = f421b96e-84a6-6922-bb85-a4b15e3411a4
code = SUNOS-8000-KL
diag-time = 1274765575 47199
de = fmd:///module/software-diagnosis
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = defect.sunos.kernel.panic
certainty = 0x64
asru = sw:///:path=/var/crash/parity/.f421b96e-84a6-6922-bb85-a4b15e3411a4
resource = sw:///:path=/var/crash/parity/.f421b96e-84a6-6922-bb85-a4b15e3411a4
savecore-succcess = 1
dump-dir = /var/crash/parity
dump-files = vmdump.6
os-instance-uuid = f421b96e-84a6-6922-bb85-a4b15e3411a4
panicstr = mutex_enter: bad mutex, lp=fffffffffbc836b8 owner=fffffffffbc836b8 thread=ffffff096190f7c0
panicstack = unix:mutex_panic+73 () | unix:mutex_vector_enter+4a3 () | pset:pset_getloadavg+40 () | pset:pset+86 () | unix:brand_sys_syscall32+272 () |
crashtime = 1274765443
panic-time = Mon May 24 22:30:43 2010 PDT
(end fault-list[0])
fault-status = 0x1
severity = Major
__ttl = 0x1
__tod = 0x4bfb6107 0x534f260
Example 2
If the system does not have savecore enabled (dumpadm -nin effect):
# dumpadm
Dump content: kernel pages
Dump device: /dev/zvol/dsk/rpool/dump (dedicated)
Savecore directory: /var/crash/parity
Savecore enabled: no
Save compressed: on
then no crash dump will be extracted during the reboot, but a diagnosis will still be made. We'll illustrate this with a forced panic via
reboot -d
:
# reboot -d
May 24 23:06:03 parity reboot: initiated by root on /dev/console
...
May 24 23:12:58 parity savecore: System dump time: Mon May 24 23:06:22 2010
May 24 23:12:58 parity savecore: Panic crashdump pending on dump device but dumpadm -n in effect; run savecore(1M) manually to extract. Image UUID 0dd0d7ac-fda4-e07b-b73b-fc47c8016853.
...
SUNW-MSG-ID: SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major
EVENT-TIME: Mon May 24 23:13:10 PDT 2010
PLATFORM: Sun-Fire-V40z, CSN: XG051535088, HOSTNAME: parity
SOURCE: software-diagnosis, REV: 0.1
EVENT-ID: 0dd0d7ac-fda4-e07b-b73b-fc47c8016853
DESC: The system has rebooted after a kernel panic.
Refer to http://sun.com/msg/SUNOS-8000-KL for more information.
AUTO-RESPONSE: The failed system image was dumped to the dump
device. If savecore is enabled (see dumpadm(1M)) a copy of the
dump will be written to the savecore directory .
IMPACT: There may be some performance impact while the panic is
copied to the savecore directory. Disk space usage by panics
can be substantial.
REC-ACTION: Please log a call with you support vendor and provide
them with this information. If savecore is not enabled then please
take steps to preserve the crash image.
Use fmdump -Vp -u 0dd0d7ac-fda4-e07b-b73b-fc47c8016853 to view more panic detail.
Running fmdump as suggested shows additional information but not the
dump-dir
and
dump-files
this time since the dump was not extracted:
# fmdump -Vp -u 0dd0d7ac-fda4-e07b-b73b-fc47c8016853
TIME UUID SUNW-MSG-ID
May 24 2010 23:13:10.425447000 0dd0d7ac-fda4-e07b-b73b-fc47c8016853 SUNOS-8000-KL
TIME CLASS ENA
May 24 23:12:58.7045 ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000
nvlist version: 0
version = 0x0
class = list.suspect
uuid = 0dd0d7ac-fda4-e07b-b73b-fc47c8016853
code = SUNOS-8000-KL
diag-time = 1274767990 334100
de = fmd:///module/software-diagnosis
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = defect.sunos.kernel.panic
certainty = 0x64
asru = sw:///:path=/var/crash/parity/.0dd0d7ac-fda4-e07b-b73b-fc47c8016853
resource = sw:///:path=/var/crash/parity/.0dd0d7ac-fda4-e07b-b73b-fc47c8016853
savecore-succcess = 0
os-instance-uuid = 0dd0d7ac-fda4-e07b-b73b-fc47c8016853
panicstr = forced crash dump initiated at user request
panicstack = genunix:kadmin+16e () | genunix:uadmin+168 () | unix:brand_sys_syscall32+272 () |
crashtime = 1274767582
panic-time = Mon May 24 23:06:22 2010 PDT
(end fault-list[0])
fault-status = 0x1
severity = Major
__ttl = 0x1
__tod = 0x4bfb6a76 0x195bce58
Panic Sequence
The following is the sequence of events when a panic condition is detected:
- The operating system kernel aborts normal execution immediately. No shutdown scripts are run - in fact all process execution is stopped and only the kernel thread supervising the panic procedure will execute until that thread resets the system.
- An attempt is made to sync data to disk (see sync(2)). This means that in-flight data written to filesystems should make its way to stable storage. For some failure causes, such as those involving broken cpus or panic causes in the filesystem or I/O stack itself, this sync attempt may fail or timeout.
- If a dump device is configured (see dumpadm(1m)) then an image of the operating system instance that contains all the kernel state at the time of panic is written to the dump device. The dump device also transports messages and error report that were emitted during the panic procedure, so that they can be logged and diagnosed in the subsequent reboot
- Once the crash dump is written out (or aborted if unsuccessful) the system is reset, and begins to boot from scratch a new operating system instance.
- When a newly-booted operating system instance sees a crash dump present on the dump device it extracts the dump into the filesystem, ready for analysis. If
dumpadm -nis in effect (not recommended for enterprise installations) then the dump is left unextracted on the dump device - it may be over-written should the system panic again, in which case post-mortem analysis options are limited.
- When fault management software starts in the newly-booted instance it will diagnose a defect to track this panic.
Attachments
This solution has no attachment