Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2199450.1
Update Date:2017-08-02
Keywords:

Solution Type  Problem Resolution Sure

Solution  2199450.1 :   ODA Disk Replacement When ASM Disk is Offline  


Related Items
  • Oracle Database Appliance X5-2
  •  
Related Categories
  • PLA-Support>Eng Systems>Exadata/ODA/SSC>Oracle Database Appliance>DB: ODA_EST
  •  


ASM will offline the disk because of write IO error and in os message we can find some io timeout.  Even oakcli and smartctl showing the disk is good we will still need replace the related disk.

Created from <SR 3-13173148281>

Applies to:

Oracle Database Appliance X5-2 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

ASM showing two disk offline/drop.

In ASM meta data:

0 0 /dev/mapper/HDD_E0_S03_1001930264p2 NORMAL MEMBER ONLINE <<< CLOSED
0 1 /dev/mapper/HDD_E0_S03_1001930264p1 NORMAL MEMBER ONLINE <<< CLOSED

HDD_E0_S03_1001930264P1 UNKNOWN 1 <<<====
HDD_E0_S03_1001930264P2 UNKNOWN 1 <<<====

 

In ASM alert log:

Thu Aug 11 22:02:06 2016
WARNING: Write Failed. group:3 disk:2 AU:1 offset:4190208 size:4096
path:/dev/mapper/HDD_E0_S03_1001930264p2
incarnation:0xe968a916 asynchronous result:'I/O error'
subsys:System krq:0x7f10b1d8b0a8 bufp:0x7f10b1da3000 osderr1:0x69b5 osderr2:0x0
IO elapsed time: 0 usec Time waited on I/O: 0 usec
WARNING: Write Failed. group:1 disk:2 AU:1 offset:4190208 size:4096
path:/dev/mapper/HDD_E0_S03_1001930264p1
incarnation:0xe968a901 asynchronous result:'I/O error'
subsys:System krq:0x7f10b1d4c9d8 bufp:0x7f10b1d71000 osderr1:0x69b5 osderr2:0x0
IO elapsed time: 0 usec Time waited on I/O: 0 usec

 

In OS message file:

Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: device_offlined, handle(0x000d)
Aug 11 22:02:04 prddtoda01 multipathd: 8:48: mark as failed
Aug 11 22:02:04 prddtoda01 multipathd: HDD_E0_S03_1001930264: remaining active paths: 1
Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: rejecting I/O to offline device
Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: [sdd] killing request
Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: rejecting I/O to offline device
Aug 11 22:02:04 prddtoda01 kernel: device-mapper: multipath: Failing path 8:48.
Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: [sdd] Unhandled error code
Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: [sdd] CDB: Write(10): 2a 20 00 00 4f f8 00 00 08 00
Aug 11 22:02:04 prddtoda01 kernel: end_request: I/O error, dev sdd, sector 20472
Aug 11 22:02:05 prddtoda01 kernel: sd 2:0:3:0: device_blocked, handle(0x000d)
Aug 11 22:02:05 prddtoda01 kernel: mpt3sas1: target reset completed: handle(0x000d)
Aug 11 22:02:05 prddtoda01 kernel: mpt3sas1: failure at drivers/scsi/mpt3sas/mpt3sas_scsih.c:6829/_scsih_start_unit()!
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: device_offlined, handle(0x000d)
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: rejecting I/O to offline device
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] killing request
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: rejecting I/O to offline device
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] killing request
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] Unhandled error code
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] Unhandled error code
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] CDB:
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] CDB: Write(16)Write(10): 2a 20 00 02 10 11 00 00 01 00

 

But oakcli and smartctl show the disk is good.

Changes

 

Cause

There are situations where ASM may have offlined the disk, but smartctl and oakcli showing the disk is good.

Usually we do not replace disk under such situation.  But based on the CPAS result that disk is bad even when smartclt showing the disk is good.  We should start replace disk under the situation.

Solution

Physically replace the disk.  Whenever ASM offlined the disk with write/read error, and at the same time OS showing the following error at the same time, we should replace the disk:

device_offlined/rejection I/O to offline device/mpt3sas0: _scsi_send_scsi_io: timeout

 

References

<BUG:24594757> - DISK DROP FROM ASM BECAUSE OF MPT3SAS & IO ERROR
<BUG:24455442> - DISK DROP FROM ASM BECAUSE OF IO ERR IN OS BUT FROM HW SIDE NO ISSUE

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback