Storage Server in continuous reboot due to kernel panic and steps to reimage

Asset ID:	1-72-1664436.1
Update Date:	2014-10-31
Keywords:

Solution Type Problem Resolution Sure

Solution 1664436.1 : Storage Server in continuous reboot due to kernel panic and steps to reimage

Applies to:

Oracle Exadata Storage Server Software - Version 11.1.0.3.0 to 12.1.1.1.0 [Release 11.1 to 12.1]
Exadata X3-2 Hardware - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

The battery was replaced for the LSI disk controller on one of the storage cells. After the replacement, the storage cells is continuously rebooting. The same symptoms may also occur without any hardware maintenance activity having previously been performed.

Cause

From the ILOM snapshot or via the /SP/console or ILOM Remote Console, we see the following messages before the reboot of the server, which indicates a kernel panic :

md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: Scanned 0 and added 0 devices.
md: autorun ...
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: Scanned 0 and added 0 devices.
md: autorun ...
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: Scanned 0 and added 0 devices.
md: autorun ...
md: ... autorun DONE.
EXT3-fs (md4): error: unable to read superblock
EXT3-fs (md6): error: unable to read superblock
Kernel panic - not syncing: Attempted to kill init!
Pid: 1, comm: init Not tainted 2.6.39-400.126.1.el5uek #1
Call Trace:
[] panic+0xbf/0x1f0
[] ? free_vfsmnt+0x3a/0x50
[] ? account_entity_enqueue+0x8f/0xa0
[] find_new_reaper+0xc9/0xd0
[] forget_original_parent+0x45/0x1b0
[] ? sched_move_task+0xb2/0x150
[] exit_notify+0x16/0x170
[] do_exit+0x283/0x460
[] do_group_exit+0x41/0xb0
[] sys_exit_group+0x17/0x20
[] system_call_fastpath+0x16/0x1b
Rebooting in 60 seconds..
ACPI MEMORY or I/O RESET_REG.

This may suggest that the RAID controller is either not present or has not been able to find any disk drives. The system is booting from the internal USB drive, but is failing to mount the root partition located on the physical disks in slots 0 and 1. As it is unable to mount the root partition, /sbin/init is not found and the kernel panics.

The ilom/@persist@hostconsole.log as collected via the ILOM snapshot will contain lines similar to the following when the LSI HBA controller is properly detected, along with it's 12 physical disks. Note that the disk manufacturer may differ from below:

megasas: 06.505.02.00 Wed. Nov. 14 17:00:00 PDT 2012 <<<<<<<<<<<<<<<< megasas kernel module loaded
megasas: 0x1000:0x0079:0x1000:0x9263: bus 80:slot 0:func 0 <<<<<<<<<<<<<< LSI HBA controller detected on the PCI bus
megaraid_sas 0000:50:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
megaraid_sas 0000:50:00.0: setting latency timer to 64
megasas: FW now in Ready state <<<<<<<<<<<<<< HBA firmware initialized
megaraid_sas 0000:50:00.0: irq 102 for MSI/MSI-X
megasas: cpx is not supported.
usb 2-1: new high speed USB device number 2 using ehci_hcd
megasas: INIT adapter done <<<<<<<<<<<<< Controller successfully initialized
scsi0 : LSI SAS based MegaRAID driver

...

scsi 0:0:8:0: Direct-Access SEAGATE ST32000SSSUN2.0T 061A PQ: 0 ANSI: 5 <<<<<<<< Disks detected
scsi 0:0:9:0: Direct-Access SEAGATE ST32000SSSUN2.0T 061A PQ: 0 ANSI: 5
scsi 0:0:10:0: Direct-Access SEAGATE ST32000SSSUN2.0T 061A PQ: 0 ANSI: 5
scsi 0:0:11:0: Direct-Access SEAGATE ST32000SSSUN2.0T 061A PQ: 0 ANSI: 5
scsi 0:0:12:0: Direct-Access SEAGATE ST32000SSSUN2.0T 061A PQ: 0 ANSI: 5
scsi 0:0:13:0: Direct-Access SEAGATE ST32000SSSUN2.0T 061A PQ: 0 ANSI: 5
scsi 0:0:14:0: Direct-Access SEAGATE ST32000SSSUN2.0T 061A PQ: 0 ANSI: 5
...
scsi 0:0:15:0: Direct-Access SEAGATE ST32000SSSUN2.0T 061A PQ: 0 ANSI: 5
scsi 0:0:16:0: Direct-Access SEAGATE ST32000SSSUN2.0T 061A PQ: 0 ANSI: 5
scsi 0:0:17:0: Direct-Access SEAGATE ST32000SSSUN2.0T 061A PQ: 0 ANSI: 5
scsi 0:0:18:0: Direct-Access SEAGATE ST32000SSSUN2.0T 061A PQ: 0 ANSI: 5
scsi 0:0:19:0: Direct-Access SEAGATE ST32000SSSUN2.0T 061A PQ: 0 ANSI: 5

If entries similar to above do show up, we are likely dealing with a corruption of the root filesystem. This generally requires that the cell be reimaged from the internal USB.

Solution

This may be due to improper battery replacement or improper card seating or a card damaged during battery replacement. It may also be due to a corruption of the root filesystem.

1: An ILOM snapshot should be gathered to assist support.
2: Open a SR with Oracle Support to determine the cause and assist with resolving the issue. If there is an existing SR, then please re-enagage the owner to assist.

If it's deemed that the storage cell needs to be reimaged, the following steps can be followed via the java based ILOM Remote Console. They assume that the internal USB is healthy.

1. Select the last line from the grub menu. It reads: CELL_USB_BOOT_CELLBOOT_usb_in_rescue_mode

2. When prompted, select (r)einstall or try to recover damaged system. Confirm your decision when prompted "Are you sure?"

3. When prompted whether to erase data partition and disks, choose "no"

4. Follow the remaining prompts

5. Once cell is up, the celldisks will need to be imported. Run:

cellcli -e import celldisk all force

5. Check if flashlog and flashcache are created:

cellcli -e list flashcache detail
cellcli -e list flashlog detail

6. Run the following and check if flashCacheMode matches that of a healthy cell:

cellcli -e list cell detail

7. Manually add the griddisks to ASM:

a. Check the status of the griddisks

qlplus / as sysasm
col path format a59
set pagesi 200
set linesi 200
select path, name, header_status, mode_status, mount_status, state from v$asm_disk oder by path;

- The griddisks belonging to the reimaged cell should show up with a header_status of CANDIDATE

b. For each diskgroup (DATA, RECO, DBFS_DG, etc), run:

alter diskgroup <diskgroup name> add disk '<path to diskgroup from above query>/<diskgroup name>*<cell name>';

e.g. for DATA diskgroup, given a cell whose name is chsmchs00203 and whose Infiniband is configured for active/active, hence two IPs:

alter diskgroup DATA add disk 'o/10.111.249.15;10.111.249.16/DATA*chsmsck00203';

References

<NOTE:1448069.1> - How to run an ILOM Snapshot on a Sun/Oracle X86 System

Attachments

This solution has no attachment