Exadata :Cell Rebooted SCSI error: return code & kernel: end

Asset ID:	1-72-1447496.1
Update Date:	2013-10-15
Keywords:

Solution Type Problem Resolution Sure

Solution 1447496.1 : Exadata :Cell Rebooted SCSI error: return code & kernel: end_request: I/O error

Applies to:

Exadata Database Machine V2 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.
***Checked for relevance on 15-Oct-2013***

Symptoms

An Exadata cell may reboot with the SCSI error: return code & kernel: end_request: I/O error:

Configuration:-

cellVersion: OSS_11.2.2.4.2_LINUX.X64_111221
kernelVersion: 2.6.18-238.12.2.0.2.el5

Errors Similar To:-

Alert History:
info "IO hang detected on CD_01_cell06. Power cycle forced."

ASM Log:-

Wed Mar 28 14:03:20 2012
WARNING: Disk in group 2 mode 0x7f is now being offlined
ORA-27603: Cell storage I/O error, I/O failed on disk at offset 8392704 for data length 4096
ORA-27626: Exadata error: 201 (Generic I/O error)
WARNING: Read Failed. group:5 disk:52 AU:2 offset:4096 size:4096
WARNING: cache failed reading from group fn=4 blk=1 count=1 from
disk= kfkist=0x20 status=0x02 file=kfc.c line=11366

System Log:-

kernel: sd 0:2:6:0: SCSI error: return code = 0x00040000
cell06 kernel: end_request: I/O error, dev sdg, sector 2006351888
kernel: sd 0:2:6:0: SCSI error: return code = 0x00040000
disk LSI MR9261-8i 2.12 /dev/sdac

Cause

The sequence of events are:
Disk in a slot failed. Then I/O to disk in another slot timed out.

This caused the power cycle, as I/Os should never be hung on other devices for more than 30 seconds when we are having trouble with 1 bad disk.

If there is an outstanding I/O hang on a disk for more than 95 seconds, then we pull the trigger and reboot the storage server.

Previous to image 11.2.3.1.0 there was no mechanism to cancel an I/O on a griddisk other than to reboot the server. So, to prevent the risk of hanging the entire database, we choose to reboot just one storage cell.

Usually, the reboot provides quiet-time for background disk media scan to kick in on the offending disk and fix the bad sectors.

Solution

The fix is included in 11.2.3.1.0 <Patch: 13536739>

References

<BUG:12592457> - FENCEMASTER: OSS_IOCTL_FENCE_ENTITY
<BUG:13922277> - CELL NODE REBOOTED - WITH ERRORS IN ASM LOG, MESSAGE & CELL LOGS

Attachments

This solution has no attachment