Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1452325.1
Update Date:2017-12-22
Keywords:

Solution Type  Troubleshooting Sure

Solution  1452325.1 :   Determining when Disks should be replaced on Oracle Exadata Database Machine  


Related Items
  • Exadata X3-2 Hardware
  •  
  • Exadata X4-2 Hardware
  •  
  • Exadata Database Machine X2-2 Qtr Rack
  •  
  • Exadata X3-2 Half Rack
  •  
  • Exadata X4-2 Quarter Rack
  •  
  • Oracle Exadata Hardware
  •  
  • Exadata Database Machine X2-8
  •  
  • Exadata X4-2 Half Rack
  •  
  • Exadata X4-8 Hardware
  •  
  • Exadata X3-2 Full Rack
  •  
  • Exadata Database Machine X2-2 Full Rack
  •  
  • Exadata X3-8 Hardware
  •  
  • Exadata Database Machine X2-2 Half Rack
  •  
  • Exadata X4-2 Full Rack
  •  
  • Exadata Database Machine X2-2 Hardware
  •  
  • Exadata X3-2 Quarter Rack
  •  
  • Exadata X4-2 Eighth Rack
  •  
  • Exadata Database Machine V2
  •  
Related Categories
  • PLA-Support>Sun Systems>x86>Engineered Systems HW>SN-x64: EXADATA
  •  


This document explains which I/O errors require disk replacement, which do not, and which should be investigated further. I/O errors can be reported in different places for different reasons, and not every I/O error is due to a physical hard disk problem that requires replacement.

In this Document
Purpose
Troubleshooting Steps
 About Disk Error Handling:
 Errors for which Disk Replacement is Recommended:
 Case R1. Cell's alerthistory reports the drive has changed its S.M.A.R.T. status to "Predictive Failure":
 Case R2. Cell's alerthistory reports the drive lun has experienced a critical error for which it cannot recover from:
 Case R3. DB node's where the Megacli status is shown as "Firmware state: (Unconfigured Bad)" or "Firmware state: Failed" preceded by logged errors indicating the drive was Failed or Predictive Failed:
 Case R4.  DB node's where the "Predictive Failure Count" is >0 even if the drive status shows as "Online".
 Case R5. Storage Cell's where the drive cell status is "Warning" and Megacli status is "Firmware State: (Unconfigured Bad)". The Cell's alerthistory may report the drive with a "not present" alert.
 Case R6. Storage Cell's where the drive cell status is "Warning - Poor Performance" even though the Megacli status is "Firmware State: Online" and there does not appear to be any error counts.
 Errors for which Disk Replacement is NOT Recommended:
 Case N1. The Media Error counters reported by MegaCli in PdList or LdPdInfo outputs in a sundiag. On Storage Servers, these are also reported by Cellsrv in the physical disk view:
 Case N2. The Other Error counters reported by MegaCli in PdList or LdPdInfo outputs in a sundiag. On Storage Servers, these are also reported by Cellsrv in the physical disk view:
 Case N3. ASM logs on the DB node show I/O error messages in *.trc files similar to:
 Case N4. Oracle Enterprise Manager users of the Exadata plug-ins may see alerts marked "Critical" for all I/O errors.
 Case N5.  A disk with Firmware status "Unconfigured(good)".
 Case N6.  A Storage Server disk reported with an alert as "Status: Warning - Confined Offline" followed by a 2nd alert of "Status: Normal" 
 Conclusion:
References


Applies to:

Exadata Database Machine X2-2 Qtr Rack - Version All Versions and later
Exadata Database Machine X2-8 - Version All Versions and later
Exadata X3-8 Hardware - Version All Versions and later
Exadata X3-2 Quarter Rack - Version All Versions and later
Exadata X3-2 Half Rack - Version All Versions and later
Information in this document applies to any platform.

Purpose

This document explains which I/O errors require disk replacement, which do not, and which should be investigated further. I/O errors can be reported in different places for different reasons, and not every I/O error is due to a physical hard disk problem that requires replacement.

Troubleshooting Steps

About Disk Error Handling:

The inability to read some sectors is not always an indication that a drive is about to fail. Even if the physical disk is damaged at one location, such that a certain sector is unreadable, the disk may be able to use spare space to replace the bad area, so that the sector can be overwritten. Physical hard disks are complex mechanical devices with spinning media, so media errors and other mechanical problems are a fact of life which is why redundancy was designed into Exadata in order to protect data against such errors. It is important to stay up to date on disk vendor's firmware which resolves known issues with internal drive mechanical control, media usage and re-allocation algorithms which are problems that can lead to premature failure if not attended to in a timely manner. The most recent Exadata patch image releases contain the latest disk firmware for each drive supported, as well as continuous improvements in ASM and Cellsrv management and handling of disk related I/O errors and failures. Refer to Note 888828.1 for the latest patch releases. The Exadata Critical Issues also documents specific instances of bugs which cause disk failures, and situations where the disk may not get faulted but should. Refer to Note 1270094.1 for the latest critical issues.

Physical hard disks used on Exadata support S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) and will report their own S.M.A.R.T. status to the RAID HBA in the event of a problem. S.M.A.R.T. events are based on thresholds which are vendor-defined for monitoring various internal disk mechanisms and will be different for different types of errors. By definition S.M.A.R.T. has only 2 external states – predicted failure or not failed (OK). The S.M.A.R.T. status does not necessarily indicate the drive's past or present reliability. Exadata Storage Cells reports disks in 2 places – on the cluster by ASM and on the individual cell server by cellsrv which accesses disks through LSI's RAID HBA Megacli utility.

When a failed disk has occurred, ASM will forcibly re-balance the data off the failed disk to restore redundancy if not replaced within the timeout period. Performance may be reduced during a rebalance operation. The rebalance operation status should be checked to ensure that this rebalance has completed successfully, because if the rebalance fails, then redundancy will stay reduced until this is rectified or the disk replaced. On-site spare disks are provided such that failed disks can be replaced rapidly by the customer if they choose, in order to ensure maximum performance and full redundancy is maintained prior to the timeout expiring and forcing a rebalance.

In normal redundancy configuration for the disk groups, 1 disk failure can be survived before ASM rebalance re-establishes data redundancy for the whole cluster; if a 2nd failure occurs before ASM rebalance has completed, then the DB may lose data and crash. In high redundancy configuration for the disk groups, 2 disk failures can be survived before ASM rebalance re-estabilishes redundancy for the whole cluster; if a 3rd failure occurs before then, the DB may lose data and crash. While the statistical chance of a 2nd disk failure is very low, the consequences are severe in normal redundancy mode. Redundancy configuration is a trade-off between higher availability for mission-critical and business-critical systems, vs. higher capacity disk groups available for data storage, and should be chosen according to individual customer need.

/opt/oracle.SupportTools/sundiag.sh is a utility used to collect data for Exadata service requests, and in particular contains data specific for diagnosing disk failures. The version in the Exadata software image may not be the latest. For more details and the latest version, refer to Note 761868.1 Each of the examples below are from outputs collected by sundiag from systems (cell or DB) suspected of having a disk problem.

 

If there is ever a situation where 2 or more disks report critical failure within seconds of each other, in particular from more than 1 server at the same time, then a sundiag output should be collected from each server and a SR opened for further analysis.

 

Errors for which Disk Replacement is Recommended:

Case R1. Cell's alerthistory reports the drive has changed its S.M.A.R.T. status to "Predictive Failure":

20_1 2012-03-18T02:22:43+00:00 critical "Hard disk status changed to predictive failure. Status : PREDICTIVE FAILURE Manufacturer : SEAGATE Model Number : ST32000SSSUN2.0T Size : 2.0TB Serial Number : L1A2B3 Firmware : 0514 Slot Number : 11 Cell Disk : CD_11_exd1cel01 Grid Disk DATA_EXD1_CD_11_exd1cel01, RECO_EXD1_CD_11_exd1cel01, DBFS_DG_CD_11_exd1cel01"

This indicates the drive has determined via S.M.A.R.T. that it is predicting it will fail and a SR should be opened for a replacement as soon as is convenient, and a sundiag attached for data analysis purposes.

If the system is connected to Oracle via Automatic Service Request, then a SR will automatically be opened for this event.

Storage cell disks can be replaced by the customer using the onsite spare provided, if the customer chooses, or Oracle will send out an engineer with the disk. 

 

Case R2. Cell's alerthistory reports the drive lun has experienced a critical error for which it cannot recover from:

21 2012-03-24T10:45:41+08:00 warning "Logical drive status changed. Lun : 0_11 Status : critical Physical Hard disk : 20:11 Slot Number : 11 Serial Number : L1C4D5 Cell Disk : CD_11_edx1cel01 Grid Disks : RECO_EDX1_CD_11_edx1cel01, DBFS_DG_CD_11_edx1cel01, DATA_EDX1_CD_11_edx1cel01."

This indicates the drive has experienced a critical error during the transaction causing the RAID HBA to mark the volume as critical. This often occurs concurrently with Predictive Failure in Storage cells where each volume is a single-disk RAID0, but may occur by itself in the event of a problem writing. A SR should be opened for a replacement as soon as is convenient, and a sundiag attached for data analysis purposes.

If the system is connected to Oracle via Automatic Service Request, then a SR will automatically be opened for this event.

Storage cell disks can be replaced by the customer using the onsite spare provided, if the customer chooses, or Oracle will send out an engineer with the disk.

Internal Note: Review sundiag's for all disk failures, and look for symptoms of the various issues with specific disks as noted on Note 1360343.1 (V2/X2-2/X2-8), Note 1501450.1 (X3-2/X4-2/X3-8/X4-8) and Notes 2010837.1 (Storage X5-2/X6-2/X5-8/X6-8) and 2010838.1 (DB X5-2/X6-2/X5-8/X6-8).  Where applicable, specific Critical Issues notes are linked there.  Some disk failure modes may require CPAS, refer to the Issues notes for details.

 

Case R3. DB node's where the Megacli status is shown as "Firmware state: (Unconfigured Bad)" or "Firmware state: Failed" preceded by logged errors indicating the drive was Failed or Predictive Failed:

=> cat exa1db01_megacli64-PdList_short_2012_03_30_01_23.out
...
Slot 03 Device 08 (HITACHI H103030SCSUN300GA2A81026A1B2C3 ) status is: Unconfigured(bad)

=> cat exa1db01_megacli64-status_2012_03_30_01_23.out
Checking RAID status on exa1db01.oracle.com
Controller a0: LSI MegaRAID SAS 9261-8i
No of Physical disks online : 3
Degraded : 0
Failed Disks : 1

=> cat exa1db01_megacli64-PdList_short_2014_10_15_01_23.out
...
Slot 00 Device 11 (SEAGATE ST90003SUN300G0B701050A1B2C3 ) status is: Failed

=> cat exa1db01_megacli64-status_2014_10_15_01_23.out
Checking RAID status on exa1db01.oracle.com
Controller a0: LSI MegaRAID SAS 9261-8i
No of Physical disks online : 3
Degraded : 1
Failed Disks : 1

The above command output files are gathered by a sundiag.

DB node's have the disks configured into a RAID5 volume and with or without a hotspare, depending on the image version. If the configuration does not have a hotspare, the volume status will show as "Degraded". If the configuration has a hotspare, when a disk fails, the hotspare will turn on and the RAID will be rebuilt, so the volume status will be "Degraded" temporarily and then return to "Optimal". This will be evident in the Megacli logs, however may not be obvious to an operator without analysis. A failed disk can be verified by collecting a sundiag output, and a SR should be opened for analysis.

 

Case R4.  DB node's where the "Predictive Failure Count" is >0 even if the drive status shows as "Online".

# cat exa1db01_megacli64-PdList_long_2012_03_30_01_23.out
...
Slot Number: 2
...
Predictive Failure Count: 14
...

# cat exa1db01_megacli64-PdList_short_2012_03_30_01_23.out
...
Slot 02 Device 16 (HITACHI H103030SCSUN300GA2A81026A1B2C3 ) status is: Online,
...

The above command output files are gathered by a sundiag.

In this case, the hotspare has not turned on due to an incorrect MegaRAID setting. To force the hotspare to turn on at the next failure, do the following:

# /opt/MegaRAID/MegaCli/MegaCli64 -AdpSetProp -SMARTCpyBkEnbl -1 -a0
(use "/opt/MegaRAID/MegaCli" on Solaris DB nodes)

If replacing the drive before the next failure, then hot-plug remove it and wait for the controller to start copyback to the hotspare due to the missing disk.

 

Case R5. Storage Cell's where the drive cell status is "Warning" and Megacli status is "Firmware State: (Unconfigured Bad)". The Cell's alerthistory may report the drive with a "not present" alert.

=> cat exa1cel01_alerthistory_2012_08_23_18_24.out
...
     31       2012-08-07T20:50:12-04:00     warning      "Logical drive status changed.  Lun                  : 0_2  Status               : not present  Physical Hard disk         : 20:2  Slot Number          : 2  Serial Number        : L5YH9W  Cell Disk            : CD_02_xsd1cel06  Grid Disks           : DBFS_DG_CD_02_xsd1cel06, FRAFILE_GRP1_CD_02_xsd1cel06, DBFILE_GRP1_CD_02_xsd1cel06."

=> cat exa1cel01_megacli64-status_2012_08_23_18_24.out
Checking RAID status on exa1cel01.oracle.com
Controller a0:  LSI MegaRAID SAS 9261-8i
No of Physical disks online : 11
Degraded : 0
Failed Disks : 1


=> cat exa1cel01_megacli64-PdList_short_2012_08_23_18_24.out
...
Slot 02 Device 17 (SEAGATE ST32000SSSUN2.0T061A1120L5YH9W  ) status is: Unconfigured(bad)
...



=> cat exa1cel01_physicaldisk-fail_2012_08_23_18_24.out
     20:2     L5YH9W     warning

This case may occur when the drive fails during a boot cycle, before the Cell's management services are running so the Cell does not see it go offline, only that its no longer present in the OS configuration. This will be evident in the Megacli logs, however may not be obvious to an operator without analysis. A failed disk can be verified by collecting a sundiag output, and a SR should be opened for analysis.

Storage cell disks can be replaced by the customer using the onsite spare provided, if the customer chooses, or Oracle will send out an engineer with the disk. In this case, follow the procedure for Storage Cell disks in Predictive Failure status.

 

Case R6. Storage Cell's where the drive cell status is "Warning - Poor Performance" even though the Megacli status is "Firmware State: Online" and there does not appear to be any error counts.

=> cat exa1cel01_alerthistory_2013_01_21_10_17.out
...
2_1 2013-01-18T04:40:28-05:00 critical "Hard disk entered poor performance status. Status : WARNING - POOR PERFORMANCE Manufacturer : HITACHI Model Number : HUS1560SCSUN600G Size : 600G Serial Number : 1150KB123A Firmware : A6C0 Slot Number : 6 Cell Disk : CD_06_exa1cel01 Grid Disk : DBFS_DG_CD_06_exa1cel01, RECO_EXA1_CD_06_exa1cel01, DATA_EXA1_CD_06_exa1cel01 Reason for poor performance : threshold for service time exceeded"
...

The storage cell management service monitors and periodically tests disk response time to I/O requests, and determines if a disk is not running at an expected performance threshold compared to other disks in the cell, then it may affect the response performance of the rest of the storage cell to ASM I/O requests. The disk is determined to be not performing sufficiently for Exadata and is marked in this state and removed from the ASM configuration.  A SR should be opened for a replacement as soon as is convenient, and a sundiag attached for data analysis purposes.

If the system is connected to Oracle via Automatic Service Request, then a SR will automatically be opened for this event.

Storage cell disks can be replaced by the customer using the onsite spare provided, if the customer chooses, or Oracle will send out an engineer with the disk. The replacement procedure should follow that of Predictive Failed disks, since the OS still will see the disk as being in a normal state.


 

Errors for which Disk Replacement is NOT Recommended:

Case N1. The Media Error counters reported by MegaCli in PdList or LdPdInfo outputs in a sundiag. On Storage Servers, these are also reported by Cellsrv in the physical disk view:

# cat exa1db01_megacli64-PdList_long_2012_03_30_01_23.out
...
Enclosure Device ID: 252
Slot Number: 3
Device Id: 8
Sequence Number: 4
Media Error Count: 109
Other Error Count: 0
Predictive Failure Count: 0
...


# cellcli -e list physicaldisk detail
...
name: 20:3
deviceId: 16
diskType: HardDisk
enclosureDeviceId: 20
errMediaCount: 402
errOtherCount: 2
...
slotNumber: 3
status: normal

The above command output files are gathered by a sundiag.

These are counters of how many times a single disk I/O transaction has experienced an error, and are not indicative of the health of a disk and its ability to keep operating. They are not thresholds of SMART and do not have any kind of specific threshold that can be used to determine if a disk needs replacement or not. On earlier Exadata image versions, some of these errors may have been generated during a Patrol Scrub operation which does a verify of all the blocks including those on the disk which have not been used by ASM yet. These may or may not cause a problem in the future, so should be left until ASM does use them and can manage any data and errors on them.

Errors being counted here should be ignored until such time as a disk SMART asserts it is critical or predicted failure and the RAID HBA will offline the disk, send an alert and change the "status" field accordingly.

If multiple disks in different cells are having errors counted, there is a possibility of multiple disks going to failure at the same time, as described above. Greater diligence should be taken to monitor for and replace each predicted or critical failure in a timely manner.

Case N2. The Other Error counters reported by MegaCli in PdList or LdPdInfo outputs in a sundiag. On Storage Servers, these are also reported by Cellsrv in the physical disk view:

# cat exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out
...
Enclosure Device ID: 252
Slot Number: 3
Device Id: 16
Sequence Number: 4
Media Error Count: 0
Other Error Count: 190

Predictive Failure Count: 0
...

# grep "Error Count" exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 184
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 62
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 220
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 211
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 183
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 19
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 184
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 342
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 225
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 146
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Media Error Count: 0
exa1cel01_megacli64-PdList_long_2012_03_30_01_23.out:Other Error Count: 121

# cellcli -e list physicaldisk detail
...
name: 20:2
deviceId: 17
diskType: HardDisk
enclosureDeviceId: 20
errMediaCount: 62
errOtherCount: 220

...
slotNumber: 2
status: normal
...
name: 20:8
deviceId: 23
diskType: HardDisk
enclosureDeviceId: 20
errMediaCount: 0
errOtherCount: 342

...
slotNumber: 8
status: critical

...

The above command output files are gathered by a sundiag.

These are counters of how many times a single disk I/O transaction has experienced a SCSI transaction error, and are most likely caused by a data path problem. On Exadata, this could be due to the RAID HBA, the SAS Cables, the SAS Expander, the Disk Backplane or the Disk. On occasion they may also cause a Disk to report as 'critical' due to a timeout responding to an I/O transaction or some other unexpected sense returned, and replacing the disk in this case most likely will not resolve the problem. In many cases these data path problems may appear on multiple disks which would indicate something other than the disk is faulted.

In the example shown, all the disks have had data path errors. Slot 2 disk had some corrected read errors as a side-effect of the data path errors that are not critical, hence status is normal and therefore this disk does not match the criteria outlined above that requires replacement. One of those data path errors has triggered Slot 8 to change to critical, although it has not shown any media errors. Disk replacements of slot 8 did not resolve this problem. Data analysis of the full history of the errors and their types identified the problem component to be the SAS Expander.

A sundiag output should be collected and a SR opened for further analysis to determine the source of the fault.

 

Case N3. ASM logs on the DB node show I/O error messages in *.trc files similar to:

ORA-27603: Cell storage I/O error, I/O failed on disk o/192.168.10.09/DATA_CD_01_exa1cel01 at offset 212417384 for data length 1048576
ORA-27626: Exadata error: 201 (Generic I/O error)
WARNING: Read Failed. group:1 disk:75 AU:52 offset:1048576 size:1048576
path:o/192.168.10.09/DATA_CD_01_exa1cel01
incarnation:0xe96e1227 asynchronous result:'I/O error'
subsys:OSS iop:0x2b7a8ff34160 bufp:0x2b7a90ff9e00 osderr:0xc9 osderr1:0x0
Exadata error:'Generic I/O error'

That may also be accompanied by ASM recovery messages such as this:

WARNING: failed to read mirror side 1 of virtual extent 1251 logical extent 0 of file 73 in group [1.1721532102] from disk DATA_EXA1_CD_01_EXA1CEL01  allocation unit 52 reason error; if possible, will try another mirror side
NOTE: successfully read mirror side 2 of virtual extent 1251 logical extent 1 of file 293 in group [1.1721532102 ] from disk DATA_EXA1_CD_03_EXA1CEL12 allocation unit 191

This is a single I/O Error on a read, which ASM has recovered and corrected. There are other similar ASM messages for different types of read errors.

This will likely also have a matching entry in the Storage Server's alert.log such as this:

IO Error on dev=/dev/sdb cdisk=CD_01_exa1cel01 [op=RD offset=132200642 (in sectors) sz=1048576 bytes] (errno: Input/output error [5])

This may also generate a Storage Server entry in /var/log/messages such as this:

Mar 30 15:37:08 td01cel06 kernel: sd 0:2:1:0: SCSI error: return code = 0x00070002
Mar 30 15:37:08 td01cel06 kernel: end_request: I/O error, dev sdb, sector 132200642

and will probably match an entry in the RAID HBA logs gathered by sundiag.

These single errors are recoverable using the built-in redundancy of ASM so ASM will initiate re-write of the block that had the error, using the mirrored copy which allows the disk to re-allocate data around any bad blocks in the physical disk media. The disk should not be replaced until the failures are such that they trigger predictive failure or critical cell alerts.

 

Case N4. Oracle Enterprise Manager users of the Exadata plug-ins may see alerts marked "Critical" for all I/O errors.

From: EnterpriseManager Exadata-OracleSupport @ oracle.com>
Date: Mar 30, 2012 1:16:35 PM PDT
To: <Exadata-OracleSupport @ oracle.com>
Subject: EM Alert: Critical:+ASM1_exa1db01.oracle.com - Disk DATA.DATA_CD_09_EXA1CEL02 has 3 Read/Write errors.

Target Name=+ASM1_exa1db01.oracle.com
Target type=Automatic Storage Management
Host=exa1db01.oracle.com
Occurred At=Mar 30, 2012 1:15:21 PM PDT
Message=Disk DATA.DATA_CD_09_EXA1CEL02 has 3 Read/Write errors.
Metric=Read Write Errors
Metric value=3
Instance ID=3
Disk Group Name=DATA
Disk Name=DATA_CD_09_EXA1CEL02
Severity=Critical
Acknowledged=No
Notification Rule Name=EXADATA ASM Alerts
Notification Rule Owner=SYSMAN

FileName
----------------
EM Alert

Since read errors are correctable and not truly critical, this may be a false report. A sundiag output should be collected and a SR opened for further analysis to determine if the fault is critical or not that requires replacement.

A request for enhancement has been filed for EM to separate non-critical and critical write errors in EM. ER 13739260

 

Case N5.  A disk with Firmware status "Unconfigured(good)".

This is an indication the disk is good, but not configured into a RAID volume. This is not an expected status in Exadata except during transition periods after a replacement if something does not work correctly during a replacement.

In Storage Servers in particular, this may be an indication that the disk was replaced but the Management Service (MS) daemon did not function properly and did not create the RAID volume and subsequent cell and grid disks. Refer to Note 1281395.1 and Note 1312266.1 for more details.
 

Case N6.  A Storage Server disk reported with an alert as "Status: Warning - Confined Offline" followed by a 2nd alert of "Status: Normal" 

43_1  2013-07-11T07:32:41+01:00  warning   "Hard disk entered confinement offline status. The LUN x_x changed status to warning - confinedOffline. CellDisk changed status to normal - confinedOffline. All subsequent I/Os on this disk are failed immediately. Confinement tests will be run on the disk to determine if the disk should be dropped.Status                      : WARNING - CONFINEDOFFLINE  Manufacturer                : SEAGATE  Model Number                : ST360057SSUN600G  Size                        : 600G  Serial Number               : 1110E0D2FZ  Firmware                    : 0B25  Slot Number                 : X  Cell Disk                   : CD_cel06  Grid Disk                   : DATA_cel06, DBFS_cel06, RECO_cel06  Reason for confinement      : threshold for service time exceeded"

43_2  2013-07-11T07:36:17+01:00  clear     "Hard disk status changed to normal.  Status        : NORMAL  Manufacturer  : SEAGATE  Model Number  : ST360057SSUN600G  Size          : 600GB  Serial Number : 1110E0D2FZ  Firmware      : 0B25  Slot Number   : X CD_cel06  Grid Disk                   : DATA_cel06, DBFS_cel06, RECO_cel06

This is an indication the disk was slow in responding to an I/O transaction and will be taken offline for further testing. The further testing did not reveal a continuous performance problem so the disk was returned back to the configuration as good. For more details on the Disk Confinement feature, refer to Note 1509105.1.

Often this event may be seen due to problems with another disk, or with multiple disks being confined indicating a possible problem with the disk cables or controller, and not the specific disk being mentioned. In that case, a sundiag output should be collected and a SR opened for further analysis to determine the source of the fault.

Conclusion:

Any other disk or I/O errors for which a disk may be suspect for such as device not present, device missing or timeouts, should have a sundiag output collected and a SR opened for further analysis to determine if the fault is critical such as causing a hang or performance issue.  Any doubts or concerns about disk or I/O errors listed above, then a sundiag output should be collected and a SR opened for further analysis to determine whether action is necessary.

 

References

<NOTE:1501450.1> - INTERNAL Exadata Database Machine Hardware Current Product Issues (X3-2, X4-2, X3-8, X4-8 w/X4-2L)
<NOTE:888828.1> - Exadata Database Machine and Exadata Storage Server Supported Versions
<NOTE:2010837.1> - INTERNAL Exadata Database Machine Hardware Current Product Issues - Storage Cells (X5 and Later)
<NOTE:1281395.1> - Steps to manually create cell/grid disks on Exadata if auto-create fails during disk replacement
<NOTE:1312266.1> - Exadata: After disk replacement ,celldisk and Griddisk is not created automatically
<NOTE:1390836.1> - How to Replace a Hard Drive in an Exadata Storage Server (Predictive Failure)
<NOTE:1479736.1> - How to Replace an Exadata Compute (Database) node hard disk drive (Predictive or Hard Failure) (X4-2 and earlier)
<NOTE:761868.1> - Oracle Exadata Diagnostic Information required for Disk Failures and some other Hardware issues
<NOTE:1360343.1> - INTERNAL Exadata Database Machine Hardware Current Product Issues (V2, X2-2, X2-8)
<NOTE:1386147.1> - How to Replace a Hard Drive in an Exadata Storage Server (Hard Failure)
<NOTE:1270094.1> - Exadata Critical Issues
<NOTE:2010838.1> - INTERNAL Exadata Database Machine Hardware Current Product Issues - DB Nodes (X5 and Later)
<NOTE:1509105.1> - Identification of Underperforming Disks feature in Exadata

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback