![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Problem Resolution Sure Solution 2199450.1 : ODA Disk Replacement When ASM Disk is Offline
ASM will offline the disk because of write IO error and in os message we can find some io timeout. Even oakcli and smartctl showing the disk is good we will still need replace the related disk. Created from <SR 3-13173148281> Applies to:Oracle Database Appliance X5-2 - Version All Versions to All Versions [Release All Releases]Information in this document applies to any platform. SymptomsASM showing two disk offline/drop. In ASM meta data: 0 0 /dev/mapper/HDD_E0_S03_1001930264p2 NORMAL MEMBER ONLINE <<< CLOSED HDD_E0_S03_1001930264P1 UNKNOWN 1 <<<====
In ASM alert log: Thu Aug 11 22:02:06 2016
WARNING: Write Failed. group:3 disk:2 AU:1 offset:4190208 size:4096 path:/dev/mapper/HDD_E0_S03_1001930264p2 incarnation:0xe968a916 asynchronous result:'I/O error' subsys:System krq:0x7f10b1d8b0a8 bufp:0x7f10b1da3000 osderr1:0x69b5 osderr2:0x0 IO elapsed time: 0 usec Time waited on I/O: 0 usec WARNING: Write Failed. group:1 disk:2 AU:1 offset:4190208 size:4096 path:/dev/mapper/HDD_E0_S03_1001930264p1 incarnation:0xe968a901 asynchronous result:'I/O error' subsys:System krq:0x7f10b1d4c9d8 bufp:0x7f10b1d71000 osderr1:0x69b5 osderr2:0x0 IO elapsed time: 0 usec Time waited on I/O: 0 usec
In OS message file: Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: device_offlined, handle(0x000d)
Aug 11 22:02:04 prddtoda01 multipathd: 8:48: mark as failed Aug 11 22:02:04 prddtoda01 multipathd: HDD_E0_S03_1001930264: remaining active paths: 1 Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: rejecting I/O to offline device Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: [sdd] killing request Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: rejecting I/O to offline device Aug 11 22:02:04 prddtoda01 kernel: device-mapper: multipath: Failing path 8:48. Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: [sdd] Unhandled error code Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Aug 11 22:02:04 prddtoda01 kernel: sd 2:0:3:0: [sdd] CDB: Write(10): 2a 20 00 00 4f f8 00 00 08 00 Aug 11 22:02:04 prddtoda01 kernel: end_request: I/O error, dev sdd, sector 20472 Aug 11 22:02:05 prddtoda01 kernel: sd 2:0:3:0: device_blocked, handle(0x000d) Aug 11 22:02:05 prddtoda01 kernel: mpt3sas1: target reset completed: handle(0x000d) Aug 11 22:02:05 prddtoda01 kernel: mpt3sas1: failure at drivers/scsi/mpt3sas/mpt3sas_scsih.c:6829/_scsih_start_unit()! Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: device_offlined, handle(0x000d) Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: rejecting I/O to offline device Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] killing request Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: rejecting I/O to offline device Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] killing request Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] Unhandled error code Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] Unhandled error code Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] CDB: Aug 11 22:02:05 prddtoda01 kernel: sd 3:0:3:0: [sdab] CDB: Write(16)Write(10): 2a 20 00 02 10 11 00 00 01 00
But oakcli and smartctl show the disk is good. Changes
CauseThere are situations where ASM may have offlined the disk, but smartctl and oakcli showing the disk is good. Usually we do not replace disk under such situation. But based on the CPAS result that disk is bad even when smartclt showing the disk is good. We should start replace disk under the situation. SolutionPhysically replace the disk. Whenever ASM offlined the disk with write/read error, and at the same time OS showing the following error at the same time, we should replace the disk: device_offlined/rejection I/O to offline device/mpt3sas0: _scsi_send_scsi_io: timeout
References<BUG:24594757> - DISK DROP FROM ASM BECAUSE OF MPT3SAS & IO ERROR<BUG:24455442> - DISK DROP FROM ASM BECAUSE OF IO ERR IN OS BUT FROM HW SIDE NO ISSUE Attachments This solution has no attachment |
||||||||||||
|