Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1447496.1
Update Date:2013-10-15
Keywords:

Solution Type  Problem Resolution Sure

Solution  1447496.1 :   Exadata :Cell Rebooted SCSI error: return code & kernel: end_request: I/O error  


Related Items
  • Exadata Database Machine V2
  •  
Related Categories
  • PLA-Support>Sun Systems>x86>Engineered Systems HW>SN-x64: EXADATA
  •  
  • _Old GCS Categories>Sun Microsystems>Specialized Systems>Database Systems
  •  




In this Document
Symptoms
Cause
Solution
References


Created from <SR 3-5525182881>

Applies to:

Exadata Database Machine V2 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.
***Checked for relevance on 15-Oct-2013***

Symptoms

An Exadata cell may reboot with the SCSI error: return code & kernel: end_request: I/O error:

 

Configuration:-

cellVersion: OSS_11.2.2.4.2_LINUX.X64_111221
kernelVersion: 2.6.18-238.12.2.0.2.el5


Errors Similar To:-

Alert History:
info "IO hang detected on CD_01_cell06. Power cycle forced."


ASM Log:-

Wed Mar 28 14:03:20 2012
WARNING: Disk  in group 2 mode 0x7f is now being offlined
ORA-27603: Cell storage I/O error, I/O failed on disk  at offset 8392704 for data length 4096
ORA-27626: Exadata error: 201 (Generic I/O error)
WARNING: Read Failed. group:5 disk:52 AU:2 offset:4096 size:4096
WARNING: cache failed reading from group fn=4 blk=1 count=1 from
disk=  kfkist=0x20 status=0x02 file=kfc.c line=11366


System Log:- 

kernel: sd 0:2:6:0: SCSI error: return code = 0x00040000
cell06 kernel: end_request: I/O error, dev sdg, sector 2006351888
kernel: sd 0:2:6:0: SCSI error: return code = 0x00040000
disk LSI MR9261-8i 2.12 /dev/sdac

Cause

The sequence of events are:
Disk in a slot failed. Then I/O to disk in another slot timed out.

This caused the power cycle, as I/Os should never be hung on other devices for more than 30 seconds when we are having trouble with 1 bad disk.

If there is an outstanding I/O hang on a disk for more than 95 seconds, then we pull the trigger and reboot the storage server.

Previous to image 11.2.3.1.0 there was no mechanism to cancel an I/O on a griddisk other than to reboot the server. So, to prevent the risk of hanging the entire database, we choose to reboot just one storage cell.

Usually, the reboot provides quiet-time for background disk media scan to kick in on the offending disk and fix the bad sectors.

Solution

The fix is included in 11.2.3.1.0 <Patch: 13536739>

References

<BUG:12592457> - FENCEMASTER: OSS_IOCTL_FENCE_ENTITY
<BUG:13922277> - CELL NODE REBOOTED - WITH ERRORS IN ASM LOG, MESSAGE & CELL LOGS

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback