Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1497610.1
Update Date:2018-01-10
Keywords:

Solution Type  Troubleshooting Sure

Solution  1497610.1 :   Determining when to Replace Disks on Oracle Database Appliance  


Related Items
  • Oracle Database Appliance X4-2
  •  
  • Oracle Database Appliance
  •  
  • Oracle Database Appliance X3-2
  •  
  • Oracle Database Appliance X5-2
  •  
Related Categories
  • PLA-Support>Sun Systems>x86>Engineered Systems HW>SN-x64: ORA-DATA-APP
  •  




In this Document
Purpose
Troubleshooting Steps
 The standard data required for troubleshooting disk issues is odasundiag output.
 Errors for which Disk Replacement is Recommended:
 1) oakcli reports the disk as FAILED
 2) oakcli reports the disk as PredictiveFail status
 3) There are many predictive fault errors in messages log as reported by smartd.
 4) Disk is missing from oakcli.
 5) iostat output shows high utilization on a particular disk drive.
 6)  There are "predictive fault" errors in the messages log.
 7) ASM has dropped the disk
References


Applies to:

Oracle Database Appliance X3-2 - Version All Versions to All Versions [Release All Releases]
Oracle Database Appliance X4-2 - Version All Versions to All Versions [Release All Releases]
Oracle Database Appliance X5-2 - Version All Versions to All Versions [Release All Releases]
Oracle Database Appliance - Version All Versions to All Versions [Release All Releases]
Oracle Enterprise Linux 4.0

Purpose

This document explains the I/O errors seen in /var/log/messages, and how to determine if the disk is faulted and needs replacement.

Troubleshooting Steps

The standard data required for troubleshooting disk issues is odasundiag output.

In software release 2.2, this script is already in /root/Extras/odasundiag.sh

If not, then it can be downloaded here from the following document

In the messages log, you will see errors like so:

Oct  3 13:04:13 x4370m2-a kernel: sd 7:0:14:0: [sdan] Unhandled sense code
Oct  3 13:04:13 x4370m2-a kernel: sd 7:0:14:0: [sdan] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct  3 13:04:13 x4370m2-a kernel: sd 7:0:14:0: [sdan] Sense Key : Medium Error [current]
Oct  3 13:04:13 x4370m2-a kernel: Info fld=0x5677
Oct  3 13:04:13 x4370m2-a kernel: sd 7:0:14:0: [sdan] Add. Sense: Unrecovered read error
Oct  3 13:04:13 x4370m2-a kernel: sd 7:0:14:0: [sdan] CDB: Read(10): 28 00 00 00 54 00 00 04 00 00
Oct  3 13:04:13 x4370m2-a kernel: end_request: I/O error, dev sdan,sector 22135
Oct  3 13:04:13 x4370m2-a kernel: device-mapper: multipath: Failing path 66:112.
Oct  3 13:04:13 x4370m2-a kernel: end_request: I/O error, dev dm-7, sector 21504

 

The errors above do not alone, indicate a failed disk drive.  A failing read operation will, in most cases, only throw an I/O error, like the error we see above. Once a write operation fails, the bad sector would get relocated.  This is a normal operation for a disk, and disks are designed to internally reallocate/remap bad sectors to reserved area on the drive.  The disk drive has a set threshold for the amount of bad sectors tolerated, and will mark the drive as failed or predictive failed appropriately.

In the messages file, you may also see the same I/O error reported on a different device:

Sep 28 21:22:16 x4370m2-a kernel: sd 6:0:1:0: [sdd] Sense Key : Medium Error [current]
Sep 28 21:22:16 x4370m2-a kernel: Info fld=0x5677
Sep 28 21:22:16 x4370m2-a kernel: sd 6:0:1:0: [sdd] Add. Sense: Unrecovered read error
Sep 28 21:22:16 x4370m2-a kernel: sd 6:0:1:0: [sdd] CDB: Read(10): 28 00 00 00 54 00 00 04 00 00
Sep 28 21:22:16 x4370m2-a kernel: end_request: I/O error, dev sdd, sector 22135

 

Oakcli actually shows that /dev/sdd and /dev/sdan are the same physical disk:

[root@x4370m2-a log]# oakcli show disk pd_04
Disk: pd_04
        ActionTimeout   :       600
        ActivePath      :       /dev/sdan
        AsmDiskList     :       |data_04||reco_04|
        AutoDiscovery   :       1
        AutoDiscoveryHi :       |data:43:HDD||reco:57:HDD||redo:100
        CheckInterval   :       100
        ColNum          :       0
        DiskId          :       35000c5003a218bcf
        DiskType        :       HDD
        Enabled         :       0
        ExpNum          :       0
        IState          :       0
        Initialized     :       0
        MonitorFlag     :       0
        MultiPathList   :       |/dev/sdan||/dev/sdd|  <<<<<<<<<<<<<<<<<<<< same disk
        Name            :       pd_04
        NewPartAddr     :       0
        PrevState       :       0
        PrevUsrDevName  :
        SectorSize      :       512
        SerialNum       :       001116E0T3DC
        Size            :       600127266304
        SlotNum         :       4
        State           :       Online
        StateChangeTs   :       1349294530
        StateDetails    :       Good
        TotalSectors    :       1172123567
        TypeName        :       0
        UsrDevName      :       HDD_E0_S04_975277007
        gid             :       0
        mode            :       660
        uid             :       0

 

 Oakcli will also show this disk in Good status since it has not yet reached the set threshold for a failed/failing disk:

[root@x4370m2-a log]# oakcli show disk
        NAME            PATH            TYPE            STATE           STATE_DETAILS

        pd_00           /dev/sdam       HDD             ONLINE          Good
        pd_01           /dev/sdaw       HDD             ONLINE          Good
        pd_02           /dev/sdaa       HDD             ONLINE          Good
        pd_03           /dev/sdak       HDD             ONLINE          Good
        pd_04           /dev/sdan       HDD             ONLINE          Good  <<< still in Good status

 A great tool to use to determine disk failure is stordiag.  During the execution of this command, you will need to input the password for the second node several times, since this command gives status from both nodes.  For example:

 

[root@x4370m2-a-n0 ~]# oakcli stordiag pd_04
 Node Name : x4370m2-a-n0
 Test : Diagnostic Test Description

   1  : OAK Check
        NAME            PATH            TYPE            STATE           STATE_DETAILS
        pd_04           /dev/sdd        HDD             ONLINE          Good

   2  : ASM Check
        ASM Disk Status                        :  group_number   state   mode_s  mount_s header_s
        /dev/mapper/HDD_E0_S04_975284803p2     :         2       NORMAL  ONLINE  CACHED  MEMBER
        /dev/mapper/HDD_E0_S04_975284803p1     :         1       NORMAL  ONLINE  CACHED  MEMBER

   3  : Smartctl Health Check
        SMART Health Status: OK

   4  : Multipathd Status
        multipathd running on system

   5  : Multipath Status
        Device List : /dev/sdd   /dev/sdan
        Info:
             HDD_E0_S04_975284803 (35000c5003a21aa43) dm-14 SEAGATE,ST360057SSUN600G
             size=559G features='0' hwhandler='0' wp=rw
             |-+- policy='round-robin 0' prio=1 status=active
             | `- 6:0:1:0  sdd  8:48   active ready running
             `-+- policy='round-robin 0' prio=1 status=enabled
               `- 7:0:14:0 sdan 66:112 active ready running

   6  : Check Partition using fdisk
        Check using active device path: /dev/sdd
        Partition check on device /dev/sdd  :  PASS
        Partition list found by fdisk for active device path: /dev/sdd
                Device Boot      Start         End      Blocks   Id  System
             /dev/sdd1               1       31331   251658240   83  Linux
             /dev/sdd2           31331       72962   334401496   83  Linux
        Check using passive device path: /dev/sdan
        Partition check on device /dev/sdan  :  PASS
        Partition list found by fdisk for passive device path: /dev/sdan
                 Device Boot      Start         End      Blocks   Id  System
             /dev/sdan1               1       31331   251658240   83  Linux
             /dev/sdan2           31331       72962   334401496   83  Linux

   7  : Device Mapper Diagnostics
        Mapper Device : dm-14
        Partition List: HDD_E0_S04_975284803p2 HDD_E0_S04_975284803p1
        Permissions :
                      /dev/mapper/HDD_E0_S04_975284803p2 : brw-rw----  grid  asmadmin
                      /dev/mapper/HDD_E0_S04_975284803p1 : brw-rw----  grid  asmadmin
        Open Ref Count:
        HDD_E0_S04_975284803p2       :  5
                                     :  grid 34378 asm_rbal_+ASM1
                                     :  grid 34382 asm_gmon_+ASM1
                                     :  grid 35879 apx_rbal_+APX1
                                     :  grid 38800 oracle+ASM1_vbg0_+apx1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
        HDD_E0_S04_975284803p1       :  4
                                     :  grid 34378 asm_rbal_+ASM1
                                     :  grid 34382 asm_gmon_+ASM1
                                     :  grid 35879 apx_rbal_+APX1

   8  : asmappl.config and multipath.conf consistency check
The authenticity of host '192.168.16.25 (192.168.16.25)' can't be established.
RSA key fingerprint is 0d:23:97:11:ec:c8:ff:69:71:9e:b4:fd:ff:80:bc:6e.
Are you sure you want to continue connecting (yes/no)? yes
root@192.168.16.25's password:
          /opt/oracle/extapi/asmappl.config file is in sync between nodes
          /etc/multipath.conf file is in sync between nodes

   9  : fwupdate
        ID        Manufacturer   Model               Chassis Slot   Type   Media   Size (GB) FW Version XML Support
        c1d4      SEAGATE        ST360057SSUN600G    0       4      sas    HDD     600       0B25       N/A
        c2d4      SEAGATE        ST360057SSUN600G    0       4      sas    HDD     600       0B25       N/A

  10  : Fishwrap
          Controller "mpt2sas:0d:00.0"
                Disk  /dev/sdd: SEAGATE ST360057SSUN600G (s/n "001116E0T3NQ        6SL0T3NQ"), bay 4
          Controller "mpt2sas:1f:00.0"
                Disk  /dev/sdan: SEAGATE ST360057SSUN600G (s/n "001116E0T3NQ        6SL0T3NQ"), bay 4

  11  : SCSI INQUIRY
        Active multipath device /dev/sdd  :  PASS
        Passive multipath device /dev/sdan  :  PASS

  12  : Multipath Conf for device
           multipath {
             wwid 35000c5003a21aa43
             alias HDD_E0_S04_975284803
       }

  13  : Last five LSI Events Received for slot 4
          [INFO]: No LSI events are recorded in OAKD logs

  14  : Version Information
          OAK              :  12.1.2.1.0
          kernel           :  2.6.39-400.214.3.el5uek
          mpt2sas          :  17.00.06.00
          Multipath        :  0.4.9
          Disk Firmware    :  0B25

  15  : OAK Conf Parms
        Device : queue_depth     Timeout         max_sectors_kb  nr_requests     read_ahead_kb   scheduler
      /dev/sdd :     32              32               1024           4096           128            noop [deadline] cfq
     /dev/sdan :     32              32               1024           4096           128            noop [deadline] cfq

          ******************************
          ********** 2nd NODE **********
          ******************************

root@192.168.16.25's password:
 Node Name : x4370m2-a-n1
 Test : Diagnostic Test Description

   1  : OAK Check
        NAME            PATH            TYPE            STATE           STATE_DETAILS
        pd_04           /dev/sdd        HDD             ONLINE          Good

   2  : ASM Check
        ASM Disk Status                        :  group_number   state   mode_s  mount_s header_s
        /dev/mapper/HDD_E0_S04_975284803p2     :         2       NORMAL  ONLINE  CACHED  MEMBER
        /dev/mapper/HDD_E0_S04_975284803p1     :         1       NORMAL  ONLINE  CACHED  MEMBER

   3  : Smartctl Health Check
        SMART Health Status: OK

   4  : Multipathd Status
        multipathd running on system

   5  : Multipath Status
        Device List : /dev/sdd   /dev/sdan
        Info:
             HDD_E0_S04_975284803 (35000c5003a21aa43) dm-27 SEAGATE,ST360057SSUN600G
             size=559G features='0' hwhandler='0' wp=rw
             |-+- policy='round-robin 0' prio=1 status=active
             | `- 6:0:1:0  sdd  8:48   active ready running
             `-+- policy='round-robin 0' prio=1 status=enabled
               `- 7:0:14:0 sdan 66:112 active ready running

   6  : Check Partition using fdisk
        Check using active device path: /dev/sdd
        Partition check on device /dev/sdd  :  PASS
        Partition list found by fdisk for active device path: /dev/sdd
                Device Boot      Start         End      Blocks   Id  System
             /dev/sdd1               1       31331   251658240   83  Linux
             /dev/sdd2           31331       72962   334401496   83  Linux
        Check using passive device path: /dev/sdan
        Partition check on device /dev/sdan  :  PASS
        Partition list found by fdisk for passive device path: /dev/sdan
                 Device Boot      Start         End      Blocks   Id  System
             /dev/sdan1               1       31331   251658240   83  Linux
             /dev/sdan2           31331       72962   334401496   83  Linux

   7  : Device Mapper Diagnostics
        Mapper Device : dm-27
        Partition List: HDD_E0_S04_975284803p2 HDD_E0_S04_975284803p1
        Permissions :
                      /dev/mapper/HDD_E0_S04_975284803p2 : brw-rw----  grid  asmadmin
                      /dev/mapper/HDD_E0_S04_975284803p1 : brw-rw----  grid  asmadmin
        Open Ref Count:
        HDD_E0_S04_975284803p2       :  6
                                     :  grid 24346 asm_rbal_+ASM2
                                     :  grid 24350 asm_gmon_+ASM2
                                     :  grid 27374 apx_rbal_+APX2
                                     :  grid 32065 oracle+ASM2_vbg0_+apx2 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
                                     :  grid 42993 mdb_rbal_-MGMTDB
        HDD_E0_S04_975284803p1       :  16
                                     :  grid 24317 asm_dbw0_+ASM2
                                     :  grid 24346 asm_rbal_+ASM2
                                     :  grid 24350 asm_gmon_+ASM2
                                     :  grid 27374 apx_rbal_+APX2
                                     :  grid 41509 mdb_dbw0_-MGMTDB

   8  : asmappl.config and multipath.conf consistency check
The authenticity of host '192.168.16.24 (192.168.16.24)' can't be established.
RSA key fingerprint is bb:97:4b:c9:1f:81:7a:d3:58:39:8a:da:89:d9:6d:89.
Are you sure you want to continue connecting (yes/no)? yes
root@192.168.16.24's password:
          /opt/oracle/extapi/asmappl.config file is in sync between nodes
          /etc/multipath.conf file is in sync between nodes

   9  : fwupdate
        ID        Manufacturer   Model               Chassis Slot   Type   Media   Size (GB) FW Version XML Support
        c1d4      SEAGATE        ST360057SSUN600G    0       4      sas    HDD     600       0B25       N/A
        c2d4      SEAGATE        ST360057SSUN600G    0       4      sas    HDD     600       0B25       N/A

  10  : Fishwrap
          Controller "mpt2sas:0d:00.0"
                Disk  /dev/sdd: SEAGATE ST360057SSUN600G (s/n "001116E0T3NQ        6SL0T3NQ"), bay 4
          Controller "mpt2sas:1f:00.0"
                Disk  /dev/sdan: SEAGATE ST360057SSUN600G (s/n "001116E0T3NQ        6SL0T3NQ"), bay 4

  11  : SCSI INQUIRY
        Active multipath device /dev/sdd  :  PASS
        Passive multipath device /dev/sdan  :  PASS

  12  : Multipath Conf for device
           multipath {
             wwid 35000c5003a21aa43
             alias HDD_E0_S04_975284803
       }

  13  : Last five LSI Events Received for slot 4
          [INFO]: No LSI events are recorded in OAKD logs

  14  : Version Information
          OAK              :  12.1.2.1.0
          kernel           :  2.6.39-400.214.3.el5uek
          mpt2sas          :  17.00.06.00
          Multipath        :  0.4.9
          Disk Firmware    :  0B25

  15  : OAK Conf Parms
        Device : queue_depth     Timeout         max_sectors_kb  nr_requests     read_ahead_kb   scheduler
      /dev/sdd :     32              32               1024           4096           128            noop [deadline] cfq
     /dev/sdan :     32              32               1024           4096           128            noop [deadline] cfq

        Above details can also be found in log file=/opt/oracle/oak/log/x4370m2-a-n0/stordiag/stordiag_pd_04-2014-12-08-09:41:18.log
[root@x4370m2-a-n0 ~]#

 

Errors for which Disk Replacement is Recommended:

 

1) oakcli reports the disk as FAILED

 

# oakcli show disk
        NAME            PATH            TYPE            STATE           STATE_DETAILS

        e0_pd_03        /dev/sda        HDD             ONLINE          Good
        e0_pd_04        /dev/sdb        HDD             FAILED          DiskRemoved


=> in the ASM logs, you will see this disk has been offlined


Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_gmon_19431.trc:
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096
WARNING: Write Failed. group:2 disk:4 AU:1 offset:4190208 size:4096   
WARNING: Hbeat write to PST disk 4.3916010383 (HDD_E0_S04_705772328P2) in
group 2 failed.
Thu Aug 30 15:59:21 2012
NOTE: process _b000_+asm1 (20748) initiating offline of disk 4.3916010383  <-------OFFLINED
(HDD_E0_S04_705772328P2) with mask 0x7e in group 2
NOTE: checking PST: grp = 2
GMON checking disk modes for group 2 at 13 for pid 39, osid 20748
NOTE: group RECO: updated PST location: disk 0000 (PST copy 0)
NOTE: group RECO: updated PST location: disk 0001 (PST copy 1)
NOTE: group RECO: updated PST location: disk 0002 (PST copy 2)
NOTE: group RECO: updated PST location: disk 0003 (PST copy 3)

2) oakcli reports the disk as PredictiveFail status

 

# oakcli show disk
        NAME            PATH            TYPE            STATE           STATE_DETAILS

        e0_pd_19        /dev/sda        HDD             ONLINE          Good
        e0_pd_20        /dev/sdb        HDD             PARTIAL         PredictiveFail


=> in the ASM logs, you will see this disk will be dropped


WARNING: Disk 20 (SSD_E0_S20_805706842P1) in group 3 will be dropped in: (12960) secs on ASM inst 2
Thu Jun 04 17:25:57 2015
WARNING: Disk 20 (SSD_E0_S20_805706842P1) in group 3 will be dropped in: (12777) secs on ASM inst 2
offlined

 

In some instances, a disk needs to be replaced, even though it has not been marked FAILED or PredictiveFail status.
The following are examples of when a disk should be replaced when still in ONLINE status:

3) There are many predictive fault errors in messages log as reported by smartd.

This can happen in older oak revision 2.9 and lower, where there is a bug where oakd did not report predictive faults.
In the example below, it is ok to replace disk sdax.

Jul 20 07:17:02 dom0 smartd[19156]: Device: /dev/sdax, SMART
Failure: FIRMWARE IMPENDING FAILURE DATA ERROR RATE TOO HIGH
Jul 20 07:22:01 dom0 kernel: md: data-check of RAID array md0
Jul 20 07:22:01 dom0 kernel: md: minimum _guaranteed_  speed: 1000
KB/sec/disk.
Jul 20 07:22:01 dom0 kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for data-check.
Jul 20 07:22:01 dom0 kernel: md: using 128k window, over a total of
513984k.
Jul 20 07:22:01 dom0 kernel: md: delaying data-check of md2 until
md0 has finished (they share one or more physical units)
Jul 20 07:22:06 dom0 kernel: md: md0: data-check done.
Jul 20 07:22:06 dom0 kernel: md: data-check of RAID array md2
Jul 20 07:22:06 dom0 kernel: md: minimum _guaranteed_  speed: 1000
KB/sec/disk.
Jul 20 07:22:06 dom0 kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for data-check.
Jul 20 07:22:06 dom0 kernel: md: using 128k window, over a total of
4096448k.
Jul 20 07:22:55 dom0 kernel: md: md2: data-check done.
Jul 20 07:47:02 dom0 smartd[19156]: Device: /dev/sdax, SMART
Failure: FIRMWARE IMPENDING FAILURE DATA ERROR RATE TOO HIGH
Jul 21 22:56:07 dom0 kernel: sd 2:0:1:0: [sdax]  Sense Key :
Recovered Error [current] [descriptor]
Jul 21 22:56:07 dom0 kernel: Descriptor sense data with sense
descriptors (in hex):
Jul 21 22:56:07 dom0 kernel:         72 01 17 01 00 00 00 34 00 0a
80 00 00 00 00 00
Jul 21 22:56:07 dom0 kernel:         09 67 e6 09 01 0a 00 00 00 00
00 00 00 00 00 00
Jul 21 22:56:07 dom0 kernel:         02 06 00 00 80 00 18 00 03 02
00 2c 05 02 00 00
Jul 21 22:56:07 dom0 kernel:         80 02 17 2c 81 06 00 41 e7 03
06 9d

 

4) Disk is missing from oakcli.

This is output from oakcli show storage

/dev/sdc HGST HDD 7865gb slot: 0 exp: 0
/dev/sdaa HGST HDD 7865gb slot: 1 exp: 0
/dev/sdab HGST HDD 7865gb slot: 2 exp: 0
/dev/sdac HGST HDD 7865gb slot: 3 exp: 0
/dev/sdad HGST HDD 7865gb slot: 4 exp: 0
/dev/sdae HGST HDD 7865gb slot: 5 exp: 0
/dev/sdaf HGST HDD 7865gb slot: 6 exp: 0
/dev/sdag HGST HDD 7865gb slot: 7 exp: 0
/dev/sdah HGST HDD 7865gb slot: 8 exp: 0
/dev/sdai HGST HDD 7865gb slot: 9 exp: 0
Disk is missing in the slot: 10      <<<<<<<<<<<<<<<< disk 10 is missing
/dev/sdaj HGST HDD 7865gb slot: 11 exp: 0
/dev/sdak HGST HDD 7865gb slot: 12 exp: 0
/dev/sdal HGST HDD 7865gb slot: 13 exp: 0
/dev/sdam HGST HDD 7865gb slot: 14 exp: 0
/dev/sdan HGST HDD 7865gb slot: 15 exp: 0
/dev/sdao HGST SSD 400gb slot: 16 exp: 0
/dev/sdap HGST SSD 400gb slot: 17 exp: 0
/dev/sdaq HGST SSD 400gb slot: 18 exp: 0
/dev/sdar HGST SSD 400gb slot: 19 exp: 0
/dev/sdas HGST SSD 200gb slot: 20 exp: 0
/dev/sdat HGST SSD 200gb slot: 21 exp: 0
/dev/sdau HGST SSD 200gb slot: 22 exp: 0
/dev/sdav HGST SSD 200gb slot: 23 exp: 0

 

5) iostat output shows high utilization on a particular disk drive.

In this case, it is ok to proactively replace the disk drive, only if there is a performance degradation. If there is not a performance issue, then there is no need to replace the disk.  If you are seeing a high utilization on the disk, but there are no symptoms, refer to BUG 16036958 - TWO SYSTEM DISKS ON DB NODE REPORTING >95% BUSY ACCORDING TO IOSTAT

In the example below, the disk is sda, only if you are seeing a performance issue:

 

# iostat -x /dev/sd*
Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00  1.07  0.52     9.44     1.09 6.62     0.01    3.23   3.14   99.50   <<<< you will see that it is very high compared to other disks
sda1              0.00     0.00  1.07  0.52     9.44     0.94 6.53     0.01    3.23   3.14   0.50
sda2              0.00     0.00  0.00  0.00     0.00     0.14 607.54     0.00   11.95   5.64   0.00
sdb               0.00     0.00  1.08  0.51     9.86     1.07 6.86     0.00    3.03   2.95   0.47
sdb1              0.00     0.00  1.08  0.51     9.86     0.92 6.77     0.00    3.03   2.95   0.47
sdb2              0.00     0.00  0.00  0.00     0.00     0.15 660.48     0.00   16.87   5.51   0.00
sdc               0.00     0.00  1.07  0.51     9.52     1.01 6.65     0.00    3.15   3.07   0.49
sdc1              0.00     0.00  1.07  0.51     9.52     0.86 6.55     0.00    3.15   3.07   0.49

6)  There are "predictive fault" errors in the messages log.

In the example below, replace disk sdc.

Nov  7 04:37:02 oda1 kernel: scsi target0:0:2: predicted fault
Nov  7 04:37:02 oda1 kernel: sd 0:0:2:0: [sdc]  Sense Key : Recovered Error [current] [descriptor]
Nov  7 04:37:02 oda1 kernel: Descriptor sense data with sense descriptors (in hex):
Nov  7 04:37:02 oda1 kernel:         72 01 5d 62 00 00 00 34 00 0a 80 00 00 00 00 00
Nov  7 04:37:02 oda1 kernel:         00 00 00 ff 01 0a 00 00 00 00 00 00 00 00 00 00
Nov  7 04:37:02 oda1 kernel:         02 06 00 00 00 00 00 00 03 02 00 32 05 02 00 00
Nov  7 04:37:02 oda1 kernel:         80 02 2a 32 81 06 00 00 00 00 00 00
Nov  7 04:37:02 oda1 kernel: sd 0:0:2:0: [sdc]  Add. Sense: Firmware impending failure data error rate too high   <<<<<<<<<<<<<<<<<<<<<<<<<<<<<< this is the line to look for, in this example, sdc needs to be replaced

Since the disk is still online, you will need to take special precaution before replacing the disk drive. 

Please follow the instructions is DOC ID 2063028.1 How to Replace an online ODA (Oracle Database Appliance) Shared Storage Disk

7) ASM has dropped the disk

 In this scenario, ASM has dropped the disk, but smartctl and oak shows the disk is still in good status.  Do not get this confused with the issue where disks are dropped from ASM during a rebalance.

 Follow the guidelines in Doc ID 2199450.1 ODA-Disk replacement when asm offline disk

References

<BUG:14558447> - ODA - RECEIVING DISK ERRORS FROM OS BUT NOT FROM OAKCLI
<NOTE:1390058.1> - Oracle Database Appliance Diagnostic Information required for Disk Failures
<BUG:16036958> - TWO SYSTEM DISKS ON DB NODE REPORTING >95% BUSY ACCORDING TO IOSTAT

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback