(EX37) X6 storage server flash disk predictive failure may lead to corruption in primary and/or secondary ASM mirror copies due to flash firmware issue

Asset ID:	1-77-2242320.1
Update Date:	2018-01-11
Keywords:

Solution Type Sun Alert Sure

Solution 2242320.1 : (EX37) X6 storage server flash disk predictive failure may lead to corruption in primary and/or secondary ASM mirror copies due to flash firmware issue

Applies to:

Oracle Exadata Storage Server Software - Version 12.1.2.3.1 to 12.1.2.3.3 [Release 12.1]
Oracle SuperCluster Specific Software
Exadata X6-2 Hardware
Exadata X6-8 Hardware
Information in this document applies to any platform.

Description

Due to bug 25595250, a flash disk predictive failure on an Exadata X6 storage server may lead to corruption in primary and/or secondary ASM mirror copies, and may propagate to other storage servers during certain ASM rebalance operations.

Occurrence

Pre-requisite Conditions for Bug 25595250

The following conditions must exist for this issue to occur:

Storage server must be X6 hardware (EF or HC). Earlier hardware generations are not affected.
Exadata software version on storage servers is lower than 12.1.2.3.4.
Write-back flash cache is configured (i.e. flashCacheMode=writeback).
A flash disk predictive failure occurs. Other flash disk failures are not affected.
After flash disk predictive failure, database block corruptions may be encountered by the database and reported in the database alert log. See the Symptoms section below.

Actions to Take If a Flash Disk Predictive Failure Has Already Occurred

If flash drive predictive failure has already occurred, then perform the following steps:

Step 1: Avoid actions that initiate ASM rebalance until the scope of the corruption is determined.

Avoid running SQL statements for ASM that will initiate a disk group rebalance that can propagate the corruption. Some examples include ALTER DISKGROUP REBALANCE, ALTER DISKGROUP DROP DISK (without FORCE option), and ALTER DISKGROUP DROP DISKS IN FAILGROUP (without FORCE option). Refer to the Automatic Storage Management Administrator's Guide for details of which operations initiate rebalance.

Step 2: Identify the time of the flash disk predictive failure.

This information can be obtained from the storage server alert history. For example:

CellCLI> list alerthistory
4_1 2017-03-18T04:12:38+01:00 critical "Flash disk entered predictive failure status. Status : WARNING - PREDICTIVE FAILURE Manufacturer : Oracle Model Number : Flash Accelerator F320 PCIe Card Size : 2981GB Serial Number : XXXXXXXXXXXXXX Firmware : KPYABR3Q Slot Number : PCI Slot: 5; FDOM: 1 Cell Disk : FD_03_dm01cel01 Grid Disk : Not configured Flash Cache : Present Flash Log : Present"

This information will be used in a later step to compare against the most recent ASM rebalance for each affected disk group.

This issue applies only to flash drive predictive failure. A critical alert that reports "Flash disk failed. Status : FAILED" is real flash drive failure, not a predictive failure.

Step 3: Identify the list of grid disks that were cached by the failed flash drive at the time of the failure.

This information can be obtained from the storage server alert.log file by using the following command:

[root@cell01 ~]# awk '
BEGIN { FS="[)(= ]"; startFound=0 }
/Cellsrv Incarnation is set/ { startFound=1 }
/cached by FlashCache/ { griddisk[$3]=$NF }
/Disabling the FlashCache part/ {
if (startFound==0) { print "No cellsrv startup marker - cached by mapping undetermined"; exit }
fcpart=$6; fcname=$11
for (gd in griddisk)
if (griddisk[gd]==fcpart)
printf "Grid disk %s cached by disabled flash %s (%s)\n",gd,fcname,fcpart
}
' $CELLTRACE/alert.log | sort

This command will produce output similar to the following:

Grid disk DATA_CD_03_cell01 cached by disabled flash FD_03_cell01 (2158794172)
Grid disk DATA_CD_04_cell01 cached by disabled flash FD_03_cell01 (2158794172)
Grid disk DBFS_CD_09_cell01 cached by disabled flash FD_03_cell01 (2158794172)

If the output is "No cellsrv startup marker - cached by mapping undetermined" then contact Oracle Support.

If no output is produced, confirm that the alert indicates flash drive predictive failure. This issue applies only to flash drive predictive failure.

If this information must be gathered manually - When a flash drive fails a message similar to the following is reported in the storage server alert.log:

CDHS: Received cd health state change with newState HEALTH_BAD_DROP_NORMAL guid 3abeec14-3c66-4413-9fe5-f0a1ca2dbcf7
CDHS: Do cd health state change FD_03_cell01 from HEALTH_GOOD to newState HEALTH_BAD_DROP_NORMAL
Disabling the FlashCache part (2158794172) located on CellDisk FD_03_cell01 because it is in status HEALTH_BAD_DROP_NORMAL
FlashLog FL-FD_03_cell01 guid=4bb7631d-9130-47ef-9dcc-2e2e9c375648 (1238914796) cdisk=FD_03_cell01 is inactive due to inactive flash disk
Done cd health state change FD_03_cell01 from HEALTH_GOOD to newState HEALTH_BAD_DROP_NORMAL

Using the FlashCache part identifier reported in the message (e.g. 2158794172), search backward in alert.log starting from the location when error was reported for the string "cached by FlashCache: 2158794172". This will report the grid disks cached by the failed flash drive. The "cached by FlashCache" messages are reported when cellsrv process starts and when there is a change to the cached by mapping, hence you must search backward to the last cellsrv restart prior to the flash drive failure.

GridDisk name=DATA_CD_09_cell01 guid=e67f7860-b1dc-49cb-bf4a-f8593496d9d4 (1741906740) status=GDISK_ACTIVE cached by FlashCache: 2158794172
...
GridDisk name=DATA_CD_04_cell01 guid=4dbbf6db-8c9e-42e5-aa67-8e98a5d0fbcc ( 776922516) status=GDISK_ACTIVE cached by FlashCache: 2158794172
...
GridDisk name=DBFS_CD_03_cell01 guid=28c42b5c-421b-4196-beb6-202c9c9fe209 (2923504996) status=GDISK_ACTIVE cached by FlashCache: 2158794172

In this example, the flash drive in question was used to cache grid disks DATA_CD_03_cell01, DATA_CD_04_cell01, and DBFS_CD_09_cell01.

It is extremely important to get the complete list of grid disks cached by the failed flash drive at the time of the failure. The number of grid disks cached by a flash disk depends on the number of grid disks defined on the storage server.

Note: You cannot obtain the above information with CellCLI using grid disk attribute cachedBy because this information may have changed as a result of the flash drive failure. You must obtain the list of grid disks that were cached by the flash drive at the time of the failure, which can only be done by reviewing the storage server alert.log.

This information is used in the next step to identify affected ASM disk groups.

Step 4: Identify the list of affected disk groups.

Using the list of grid disks obtained from the previous step, connect to any ASM instance to identify the affected disk groups using a query similar to the following:

SQL> select distinct(dg.name)
from v$asm_diskgroup dg, v$asm_disk dsk
where upper(dsk.name)
in (upper('DATA_CD_03_cell01'),
upper('DATA_CD_04_cell01'),
upper('DBFS_CD_09_cell01'))
and dg.group_number = dsk.group_number;

NAME
------------------------------
DATA
DBFS

In this example, the affected disk groups are DATA and DBFS. Note that RECO is not a disk group that is cached by flash cache unless changes have been made to the default best practice configuration.

Step 5: Identify the start time of the most recent ASM rebalance operation for the affected disk groups.

For each of the affected disk groups, obtain the start time of the most recent ASM rebalance operation from all ASM alert.log files. Since an ASM rebalance operation may run from any ASM instance, the alert.log for all ASM instances must be reviewed.

The following bash shell commands can be run on each node from the trace directory where the ASM alert log exists.

Replace 'DATA' and 'DBFS' from the example with your affected disk groups identified in the previous step.

[oracle@node1 ~]$ cd /u01/app/grid/diag/asm/+asm/+ASM*/trace

[oracle@node1 trace]$ \
for dg in DATA DBFS; do
if [ ! -r alert_+ASM*.log ]; then
echo 'Cannot find alert_+ASM*.log. Run commands from directory where ASM alert log exists.'
break
fi
linenum=$(grep -n "NOTE: starting rebalance of group.*($dg)" alert_+ASM*.log | tail -1 | awk -F: '{print $1}')
if [ -z $linenum ]; then
echo "No rebalance found in alert log for $dg"
else
tail -n+$(($linenum-1)) alert_+ASM*.log | head -n 2 | paste -d ' ' - -
fi
done

Example output from node1:

Mon Mar 13 11:54:47 2017 NOTE: starting rebalance of group 1/0x29a4ccd (DATA) at power 4
Mon Mar 09 11:38:35 2017 NOTE: starting rebalance of group 3/0x2ac5f92 (DBFS) at power 4

Example output from node2:

Mon Mar 12 10:54:35 2017 NOTE: starting rebalance of group 1/0x2985f51 (DATA) at power 4
Mon Mar 13 11:53:35 2017 NOTE: starting rebalance of group 3/0x2aa4cce (DBFS) at power 4

In this example

- The most recent ASM rebalance for DATA disk group occurred on the first node at Mon Mar 13 11:54:47 2017.
- The most recent ASM rebalance for DBFS disk group occurred on the second node at Mon Mar 13 11:53:35 2017.

Step 6: Take repair actions for each affected disk group.

For each affected disk group, perform one of the following repair actions. If there are two affected disk groups, like shown in the examples above, then two repair actions are required, one for each disk group (e.g. DATA and DBFS).

Compare the time of the flash disk predictive failure obtained in step 2 with the time of the most recent ASM rebalance for the disk group obtained in step 5.

Repair scenario 1 - ASM rebalance occurred before flash disk predictive failure or there has been no rebalance
Repair scenario 2 - ASM rebalance occurred after flash disk predictive failure

Repair scenario 1 - ASM rebalance occurred before flash disk predictive failure or there has been no rebalance

If the most recent ASM rebalance occurred before the flash disk predictive failure, then repair the corruption with ASM disk scrubbing for individual grid disks.

Note: The Grid Infrastructure home requires fixes for bug 22446455, bug 25417056, and bug 25733479 to use ASM disk scrubbing. Without these fixes ASM disk scrubbing will not repair the corruption properly/efficiently.

Note: ASM 12.2 supports asynchronous scrubbing to increase scrub performance. See 12.2 Oracle Automatic Storage Management Administrator's Guide for details of disk group attribute scrub_async_limit.

Using the list of grid disks obtained from step 3, run scrub repair on each disk within the disk group being repaired. For example:

SQL> alter diskgroup DATA scrub disk DATA_CD_03_cell01 repair wait;
SQL> alter diskgroup DATA scrub disk DATA_CD_04_cell01 repair wait;

An alternate action for this scenario is to force drop the grid disks cached by the failed flash drive instead of using ASM disk scrubbing. Note the use of FORCE for each disk.

Note: It is essential to use the FORCE keyword. Running DROP DISK without FORCE will cause the corruption to propagate to other storage servers.

SQL> alter diskgroup DATA drop disk DATA_CD_03_cell01 FORCE, DATA_CD_04_cell01 FORCE rebalance power 64;

Repair scenario 2 - ASM rebalance occurred after flash disk predictive failure

If the most recent ASM rebalance occurred after the flash disk predictive failure, then it is highly likely the corruption spread to other storage servers during the rebalance. Repair the corruption with ASM disk scrubbing for the disk group.

Run scrub repair on the disk group. For example:

SQL> alter diskgroup DATA scrub repair wait;

Step 6: Replace the failed flash drive

Replace the failed flash drive using the documented procedure by a trained Oracle field technician. Note that dropping ASM disks that are part of the storage server that contains the failed flash drive is not a proper action to perform during the flash drive replacement procedure according to <Document 1993842.1> "How to Replace an Exadata X5-2/X6-2 Storage Server Flash F160/F320 Card."

If grid disks were force dropped in the previous step, after the failed flash drive is replaced, verify the dropped disks are automatically added back into the disk group.

Step 7: Update storage servers according to the Patches section to prevent this issue from occurring again.

Symptoms

Symptoms may include the following:

Note: It is possible for corruption to occur as a result of the flash disk predictive failure but none of the following symptoms are present. Also, this issue may affect both primary and secondary extents. RMAN validate will check only primary extents, therefore, RMAN cannot be used to reliably detect corruption caused by this issue.

After a flash drive predictive failure, database block corruption may be reported for data contained in the storage server with the failed flash drive. The following is an example of a what the database alert.log may contain:

Corrupt block relative dba: 0x01ebfd41 (file 9, block 32243009)
Fractured block found during buffer read
Data in bad block:
type: 6 format: 2 rdba: 0x01ebfd41
last change scn: 0x09c3.361fc6a0 seq: 0x1 flg: 0x04
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0x8e237e58
check value in block header: 0x15f5
computed block checksum: 0x32a6
Reading datafile '+DATA/dbm/datafile/dbfile.123.123456789' for corruption at rdba: 0x01ebfd41 (file 9, block 32243009)
Read datafile mirror 'DATA_CD_04_DBM01CELADM01' (file 9, block 32243009) found same corrupt data (no logical check)
Read datafile mirror 'DATA_CD_08_DBM01CELADM01' (file 9, block 32243009) found valid data
Hex dump of (file 9, block 32243009) in trace file /u01/app/oracle/diag/rdbms/dbm/DBM01/trace/DBM01_ora_12345.trc
Repaired corruption at (file 9, block 32243009)
After a flash drive predictive failure, a message that contains "Diagnostics: Block mismatch found for data read from flash/disk" may be reported in the storage server alert.log. For example:
[57] Diagnostics: Block mismatch found for data read from flash/disk: tsn: 2147483647 rdba: 30031054. trace file: /opt/oracle/cell/log/diag/asm/cell/cell01/trace/svtrc_40104_57.trc

Workaround

None

Patches

To prevent this issue, update flash firmware by updating Exadata X6 storage servers to Exadata 12.1.2.3.4 or higher.

Important: Updating the Exadata version and flash firmware only prevents this issue from occurring again in the future. The update will not detect nor repair existing corruption caused previously by this issue. The steps in the Occurrence section of this document must be followed to address existing corruption.

History

11-Jan-2018 - Add reference to ASM 12.2 scrub_async_limit which can increase scrub performance
19-May-2017 - Alter steps because corruption messages may not be reported in all circumstances
11-Apr-2017 - Published

Attachments

This solution has no attachment