ODA Disk Replacement When ASM Disk is Offline

Asset ID:	1-72-2199450.1
Update Date:	2017-08-02
Keywords:

Solution Type Problem Resolution Sure

Solution 2199450.1 : ODA Disk Replacement When ASM Disk is Offline

Applies to:

Oracle Database Appliance X5-2 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

ASM showing two disk offline/drop.

In ASM meta data:

0 0 /dev/mapper/HDD_E0_S03_1001930264p2 NORMAL MEMBER ONLINE <<< CLOSED
0 1 /dev/mapper/HDD_E0_S03_1001930264p1 NORMAL MEMBER ONLINE <<< CLOSED

HDD_E0_S03_1001930264P1 UNKNOWN 1 <<<====
HDD_E0_S03_1001930264P2 UNKNOWN 1 <<<====

In ASM alert log:

Thu Aug 11 22:02:06 2016
WARNING: Write Failed. group:3 disk:2 AU:1 offset:4190208 size:4096
path:/dev/mapper/HDD_E0_S03_1001930264p2
incarnation:0xe968a916 asynchronous result:'I/O error'
subsys:System krq:0x7f10b1d8b0a8 bufp:0x7f10b1da3000 osderr1:0x69b5 osderr2:0x0
IO elapsed time: 0 usec Time waited on I/O: 0 usec
WARNING: Write Failed. group:1 disk:2 AU:1 offset:4190208 size:4096
path:/dev/mapper/HDD_E0_S03_1001930264p1
incarnation:0xe968a901 asynchronous result:'I/O error'
subsys:System krq:0x7f10b1d4c9d8 bufp:0x7f10b1d71000 osderr1:0x69b5 osderr2:0x0
IO elapsed time: 0 usec Time waited on I/O: 0 usec

In OS message file:

Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: device_offlined, handle(0x000d)
Aug 11 22:02:04 prddtoda01 multipathd: 8:48: mark as failed
Aug 11 22:02:04 prddtoda01 multipathd: HDD_E0_S03_1001930264: remaining active paths: 1
Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: rejecting I/O to offline device
Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: [sdd] killing request
Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: rejecting I/O to offline device
Aug 11 22:02:04 prddtoda01 kernel: device-mapper: multipath: Failing path 8:48.
Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: [sdd] Unhandled error code
Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: [sdd] CDB: Write(10): 2a 20 00 00 4f f8 00 00 08 00
Aug 11 22:02:04 prddtoda01 kernel: end_request: I/O error, dev sdd, sector 20472
Aug 11 22:02:05 prddtoda01 kernel: sd 2:0:3:0: device_blocked, handle(0x000d)
Aug 11 22:02:05 prddtoda01 kernel: mpt3sas1: target reset completed: handle(0x000d)
Aug 11 22:02:05 prddtoda01 kernel: mpt3sas1: failure at drivers/scsi/mpt3sas/mpt3sas_scsih.c:6829/_scsih_start_unit()!
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: device_offlined, handle(0x000d)
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: rejecting I/O to offline device
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] killing request
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: rejecting I/O to offline device
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] killing request
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] Unhandled error code
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] Unhandled error code
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] CDB:
Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] CDB: Write(16)Write(10): 2a 20 00 02 10 11 00 00 01 00

But oakcli and smartctl show the disk is good.

Changes

Cause

There are situations where ASM may have offlined the disk, but smartctl and oakcli showing the disk is good.

Usually we do not replace disk under such situation. But based on the CPAS result that disk is bad even when smartclt showing the disk is good. We should start replace disk under the situation.

Solution

Physically replace the disk. Whenever ASM offlined the disk with write/read error, and at the same time OS showing the following error at the same time, we should replace the disk:

device_offlined/rejection I/O to offline device/mpt3sas0: _scsi_send_scsi_io: timeout

References

<BUG:24594757> - DISK DROP FROM ASM BECAUSE OF MPT3SAS & IO ERROR
<BUG:24455442> - DISK DROP FROM ASM BECAUSE OF IO ERR IN OS BUT FROM HW SIDE NO ISSUE

Attachments

This solution has no attachment