Exadata Compute Node / Exalogic RAID Controller Failed

Asset ID:	1-72-1362174.1
Update Date:	2015-09-01
Keywords:

Solution Type Problem Resolution Sure

Solution 1362174.1 : Exadata Compute Node / Exalogic RAID Controller Failed

Applies to:

Linux OS - Version Oracle Linux 5.0 to Oracle Linux 5.0 [Release OL5]
Exadata Database Machine V2 - Version All Versions and later
Exalogic Elastic Cloud X3-2 Hardware - Version X3 and later
Information in this document applies to any platform.

Symptoms

This can be seen on any Exadata or Exalogic system.

- On the affected compute node the filesystems become read only.
- It's not possible to remount them as read/write :

# mount -o remount,rw /
# mount: block device /dev/sda1 is write-protected, mounting read-only

- MegaCLI64 commands do not work correctly:

# /opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -a0

User specified controller is not present.
Failed to get CpController object.

Exit Code: 0x01

- The console log reports messages like :

ADP_RESET_GEN2: retry time=3e8, hostdiag=a4
megaraid_sas: FW was restarted successfully, initiating next stage...
megaraid_sas: HBA recovery state machine, state 2 starting...
printk: 9 messages suppressed.
printk: 9 messages suppressed.
megaraid_sas: out: controller is not in ready state
megasas: waiting_for_outstanding: after issue OCR.
megasas: waiting_for_outstanding: before issue OCR. FW state = f0000000
megaraid_sas: pending commands remain even state = f0000000
megaraid_sas: pending commands remain even after reset handling.

megasas[0]: Dumping Frame Phys Address of all pending cmds in FW
megasas[0]: Total OS Pending cmds : 0

megasas[0]: 64 bit SGLs were sent to FW
megasas[0]: Pending OS cmds in FW :
megasas[0]: Frame addr :0x37f22800 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, lba lo : 0x167727f, lba_hi : 0x0, sense_buf addr : 0x37f20500,sge count : 0x1

.....

0x7f77f400 : <3>megasas[0]: Dumping Done.

megasas: failed to do reset
sd 0:2:0:0: megasas: RESET -1140663 cmd=2a retries=0
megasas: cannot recover from previous reset failures
sd 0:2:0:0: megasas: RESET -1140663 cmd=2a retries=0
megasas: cannot recover from previous reset failures
sd 0:2:0:0: timing out command, waited 360s
end_request: I/O error, dev sda, sector 23119751
printk: 8 messages suppressed.
Buffer I/O error on device sda1, logical block 2889961
lost page write due to I/O error on sda1
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device

...

_journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
journal commit I/O error
ext3_abort called.
EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only

Cause

This is likely to be a failure of the LSI RAID Controller.

Solution

Hardware SR needed to replace the LSI controller on the affected compute node ( 6GIGABIT SAS RAID PCI EXPRESS HBA, B4 ASIC ), then restart the compute node.

Attachments

This solution has no attachment