Installing and Configuring Kdump on Exalogic Compute Nodes Running Oracle Linux 5 (Linux Physical 2.0.3.x.x & 2.0.6.x.x)

Asset ID:	1-79-1992871.1
Update Date:	2016-08-05
Keywords:

Solution Type Predictive Self-Healing Sure

Solution 1992871.1 : Installing and Configuring Kdump on Exalogic Compute Nodes Running Oracle Linux 5 (Linux Physical 2.0.3.x.x & 2.0.6.x.x)

Applies to:

Oracle Exalogic Elastic Cloud Software - Version 2.0.3.0.0 and later
Exalogic Elastic Cloud X3-2 Hardware - Version X3 to X3 [Release X3]
Linux x86-64
Exalogic Linux Physical Release 2.0.3.0.x (running OL 5)
Exalogic Linux Physical Release 2.0.6.0.x (running OL 5)
Exalogic Linux Physical Release 2.0.6.1.x (running OL 5)
Exalogic Linux Physical Release 2.0.6.2.x (running OL 5)

Purpose

The procedure to configure Kdump described in this document applies to the following EECS releases running Oracle Linux 5 on Exalogic in a Physical configuration:

   -   2.0.3.0.x
   -   2.0.6.0.x
   -   2.0.6.1.x
   -   2.0.6.2.x

For MOS notes related to Kdump configurations on other Exalogic releases refer to Kdump configuration Master <Note 1996649.1> for Exalogic.

Details

Pre-requisites

Perform the following pre-requisite steps to configure Kdump on the Exalogic compute nodes:

Verify that the required RPM packages are installed.

The following packages (attached to this Note) are required for configuring Kdump on Exalogic

- busybox-1.2.0-7.el5 (or higher)
- kexec-tools-2.0.3-4.0.9.el5 (or higher)

Ensure that the packages are installed by running the following command:

[root@compute-node ~]# rpm -qa | egrep 'kexec-tools|busybox-1'
busybox-1.2.0-14.el5
kexec-tools-2.0.3-4.0.9.el5

Depending on the EECS version running on the compute node, you may find older versions of the two RPMs installed on the nodes. The required updated RPMs are included in the attachment to this note. Download the attachment and update the RPMs as shown below:

[root@compute-node ~]# rpm -Uvh <location-of-rpms>/busybox-1.2.0-14.el5.x86_64.rpm

[root@compute-node ~]# rpm -Uvh <location-of-rpms>/kexec-tools-2.0.3-4.0.9.el5.x86_64.rpm

Run the following commands to update software configuration metadata:

[root@compute-node ~]# find /usr/lib/init-exalogic-node/ -name "supported_software*.conf" | xargs sed -i 's/^busybox-.*rpm$/busybox-1.2.0-14.el5.x86_64.rpm/g'

[root@compute-node ~]# find /usr/lib/init-exalogic-node/ -name "supported_software*.conf" | xargs sed -i 's/^kexec-tools-.*rpm$/kexec-tools-2.0.3-4.0.9.el5.x86_64.rpm/g'
Backup /boot/grub/grub.conf file

Create a backup copy of the /boot/grub/grub.conf file for reference and later restore (if needed), as shown below:

[root@compute-node ~]# DATESTAMP=`date +%d-%b-%Y_%H-%M-%S`; cp /boot/grub/grub.conf /boot/grub/grub.conf.${DATESTAMP}
Ensure that there is sufficient free disk space for the vmcore file to be written. The default path for the generated vmcore files is /var/crash. Ideally, the free space available on the disk should be equal to or greater than the amount of RAM on the compute node. If the available free space is less than the RAM, it will still be possible for the vmcore to be created if the actual RAM used is lower than the available disk space.

Crashkernel Configuration

Modify the /boot/grub/grub.conf file to add a crashkernel parameter to the kernel line of the active boot entry.

The active boot entry is determined by the value of the “default” parameter in the grub.conf file. A value of 0 indicates the first kernel entry; a value of 1 indicates the second kernel entry, and so on.

Add "crashkernel=128M" to the active Oracle VM Server entry in grub.conf at the end of the kernel command line. The following snippet of grub.conf shows the updated kernel entry with the crashkernel parameter:

title Oracle Linux Server (2.6.39-400.247.1.el5uek)
root (hd0,0)
kernel /vmlinuz-2.6.39-400.247.1.el5uek ro root=/dev/VolGroup00/LogVol00 rhgb quiet console=tty0 console=ttyS0,9600n8 crashkernel=128M
initrd /initrd-2.6.39-400.247.1.el5uek.img

Note: The kernel in the examples shown in this note is 2.6.39-400.247.1.el5uek. The actual kernel is dependent on the EECS version running on the compute node.
After making the above change, reboot the compute node for the crash kernel memory reservation to take effect. Once the node comes back up, validate the crashkernel setting as shown below:
[root@compute-node ~]# dmesg | grep crashkernel
Command line: ro root=/dev/VolGroup00/LogVol00 rhgb quiet console=ttyS0,9600n8 crashkernel=128M
Reserving 128MB of memory at 752MB for crashkernel (System RAM: 264192MB)
Kernel command line: ro root=/dev/VolGroup00/LogVol00 rhgb quiet console=ttyS0,9600n8 crashkernel=128M

[root@compute-node ~]# cat /proc/iomem | grep -i "Crash kernel"
04000000-3bffffff : Crash kernel

NOTE: If you see the following in the output or in /var/log/messages, it indicates that the crashkernel memory reservation failed:

kernel: crashkernel reservation failed - No suitable area found.
kernel: crashkernel reservation failed - memory is in use.

Contact Oracle Support if you run into this issue.
Configure the Kdump service to start automatically following node reboots. Run the following command to configure the Kdump service to start when the kernel is booted to various run levels:
[root@compute-node ~]# chkconfig kdump on

The successful operation of the above command can be confirmed as follows:

[root@compute-node ~]# chkconfig --list | grep kdump
kdump 0:off 1:off 2:on3:on4:on5:on 6:off
Backup and modify /etc/kdump.conf for reference or a later restore (if needed), as shown below:

DATESTAMP=`date +%d-%b-%Y_%H-%M-%S`; cp /etc/kdump.conf /etc/kdump.conf.${DATESTAMP}
Append the following lines at the end of the existing /etc/kdump.conf. The following is a sample set of lines to be appended:

ext3 /dev/mapper/VolGroup00-LogVol00
path /var/crash
extra_bins /bin/cp
core_collector makedumpfile -c --message-level 1 -d 31
default shell

An example of the updated kdump.conf file is shown below:

# Configures where to put the kdump /proc/vmcore files
#
# This file contains a series of commands to perform (in order) when a
# kernel crash has happened and the kdump kernel has been loaded. Directives in
# this file are only applicable to the kdump initramfs, and have no effect if
# the root filesystem is mounted and the normal init scripts are processed
#
# Currently only one dump target and path may be configured at once
# if the configured dump target fails, the default action will be preformed
# the default action may be configured with the default directive below. If the
# configured dump target succedes
#
# See the kdump.conf(5) man page for details of configuration directives

#raw /dev/sda5
#ext3 /dev/sda3
#ext3 LABEL=/boot
#ext3 UUID=03138356-5e61-4ab3-b58e-27507ac41937
#net my.server.com:/export/tmp
#net user@my.server.com
#core_collector makedumpfile -c --message-level 1
#link_delay 60
#kdump_post /var/crash/scripts/kdump-post.sh
#extra_bins /usr/bin/lftp
#disk_timeout 30
#extra_modules gfs2
#options modulename options
#default shell
#sshkey /root/.ssh/kdump_id_rsa
#core_collector /bin/cp --sparse=always
#disk_timeout 30

ext3 /dev/mapper/VolGroup00-LogVol00
path /var/crash
extra_bins /bin/cp
core_collector makedumpfile -c --message-level 1 -d 31
default shell

Starting Kdump Service

Start the kdump service and confirm that a new ramdisk is successfully generated.

Start the kdump service as shown below. Starting the kdump service will create a kdump img file if not already present:

[root@compute-node ~]# /etc/init.d/kdump start

Rebuilding /boot/initrd-2.6.39-400.247.1.el5uekkdump.img
[root@compute-node ~]# tail /var/log/messages
...
Jan 18 09:29:34 host kdump: kexec: loaded kdump kernel
Jan 18 09:29:34 host kdump: started up
Check the status of the kdump service to ensure that it is operational.

[root@compute-node ~]# /etc/init.d/kdump status
Kdump is operational
Disable auto-negotiation and transmission flow control on the compute node by referring to the following MOS note.

<Note 1990121.1>: ILOM Hangs When Compute Node Hangs In Exalogic X3-2 and X4-2 Racks

Kdump vmcore Generation Test

Test the Kdump configuration by forcing a system crash:

[root@compute-node ~]# echo 1 > /proc/sys/kernel/sysrq
[root@compute-node ~]# echo c > /proc/sysrq-trigger

This will take several minutes depending on the amount of memory on the compute node. You may monitor the progress of Kdump as it writes vmcore by connecting to the ILOM console.
After the server boots back into the active kernel, check that the vmcore is created properly under /var/crash ,as shown in the example below:

[root@compute-node ~]#ls -lsh /var/crash/127.0.0.1-2013-01-17-12\:55\:08/
total 440M
440M -r-------- 1 root root 7.9G Jan 17 12:56 vmcore

[root@compute-node ~]# file /var/crash/127.0.0.1-2013-01-17-12\:55\:08/vmcore

/ssd/var/crash/127.0.0.1-2013-01-17-12:55:08/vmcore: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style
Compress (e.g. with tar xvfz) and upload the resulting vmcore to Oracle support as needed.

IMPORTANT NOTE:

Due to the limited amount of local disk space that is set aside for use by Kdump it is highly recommended that you closely monitor the directory /var/crash on an regular basis to detect the emergence of new vmcore files, and archive these to a central share on the ZFS storage appliance to ensure that space is available for vmcore files that may be created in the future.

References

<NOTE:1924843.1> - Installing and Configuring Kdump on Exalogic Virtual 2.0.6.x.x Compute Nodes (Dom0)
<NOTE:1585178.1> - How to configure KDump for Compute Nodes (Dom0) In Exalogic Elastic Cloud Software Version 2.0.4.0.2 Virtual Releases
<NOTE:1423490.1> - How To Configure and Initiate A Linux Kernel Dump On Exalogic Linux Physical Releases 1.x.x.x.x And 2.0.0.x.x
<NOTE:1996649.1> - Master Note: Kdump Configuration For Exalogic Elastic Cloud Software

Attachments

This solution has no attachment