![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||||||||||
Solution Type Sun Alert Sure Solution 2242320.1 : (EX37) X6 storage server flash disk predictive failure may lead to corruption in primary and/or secondary ASM mirror copies due to flash firmware issue
In this Document
Applies to:Oracle Exadata Storage Server Software - Version 12.1.2.3.1 to 12.1.2.3.3 [Release 12.1]Oracle SuperCluster Specific Software Exadata X6-2 Hardware Exadata X6-8 Hardware Information in this document applies to any platform. DescriptionDue to bug 25595250, a flash disk predictive failure on an Exadata X6 storage server may lead to corruption in primary and/or secondary ASM mirror copies, and may propagate to other storage servers during certain ASM rebalance operations. OccurrencePre-requisite Conditions for Bug 25595250The following conditions must exist for this issue to occur:
Actions to Take If a Flash Disk Predictive Failure Has Already OccurredIf flash drive predictive failure has already occurred, then perform the following steps: Step 1: Avoid actions that initiate ASM rebalance until the scope of the corruption is determined. Avoid running SQL statements for ASM that will initiate a disk group rebalance that can propagate the corruption. Some examples include ALTER DISKGROUP REBALANCE, ALTER DISKGROUP DROP DISK (without FORCE option), and ALTER DISKGROUP DROP DISKS IN FAILGROUP (without FORCE option). Refer to the Automatic Storage Management Administrator's Guide for details of which operations initiate rebalance.
Step 2: Identify the time of the flash disk predictive failure. This information can be obtained from the storage server alert history. For example: CellCLI> list alerthistory
4_1 2017-03-18T04:12:38+01:00 critical "Flash disk entered predictive failure status. Status : WARNING - PREDICTIVE FAILURE Manufacturer : Oracle Model Number : Flash Accelerator F320 PCIe Card Size : 2981GB Serial Number : XXXXXXXXXXXXXX Firmware : KPYABR3Q Slot Number : PCI Slot: 5; FDOM: 1 Cell Disk : FD_03_dm01cel01 Grid Disk : Not configured Flash Cache : Present Flash Log : Present" This information will be used in a later step to compare against the most recent ASM rebalance for each affected disk group. This issue applies only to flash drive predictive failure. A critical alert that reports "Flash disk failed. Status : FAILED" is real flash drive failure, not a predictive failure.
Step 3: Identify the list of grid disks that were cached by the failed flash drive at the time of the failure. This information can be obtained from the storage server alert.log file by using the following command: [root@cell01 ~]# awk '
BEGIN { FS="[)(= ]"; startFound=0 } /Cellsrv Incarnation is set/ { startFound=1 } /cached by FlashCache/ { griddisk[$3]=$NF } /Disabling the FlashCache part/ { if (startFound==0) { print "No cellsrv startup marker - cached by mapping undetermined"; exit } fcpart=$6; fcname=$11 for (gd in griddisk) if (griddisk[gd]==fcpart) printf "Grid disk %s cached by disabled flash %s (%s)\n",gd,fcname,fcpart } ' $CELLTRACE/alert.log | sort This command will produce output similar to the following: Grid disk DATA_CD_03_cell01 cached by disabled flash FD_03_cell01 (2158794172)
Grid disk DATA_CD_04_cell01 cached by disabled flash FD_03_cell01 (2158794172) Grid disk DBFS_CD_09_cell01 cached by disabled flash FD_03_cell01 (2158794172) If the output is "No cellsrv startup marker - cached by mapping undetermined" then contact Oracle Support. If no output is produced, confirm that the alert indicates flash drive predictive failure. This issue applies only to flash drive predictive failure.
If this information must be gathered manually - When a flash drive fails a message similar to the following is reported in the storage server alert.log: CDHS: Received cd health state change with newState HEALTH_BAD_DROP_NORMAL guid 3abeec14-3c66-4413-9fe5-f0a1ca2dbcf7
CDHS: Do cd health state change FD_03_cell01 from HEALTH_GOOD to newState HEALTH_BAD_DROP_NORMAL Disabling the FlashCache part (2158794172) located on CellDisk FD_03_cell01 because it is in status HEALTH_BAD_DROP_NORMAL FlashLog FL-FD_03_cell01 guid=4bb7631d-9130-47ef-9dcc-2e2e9c375648 (1238914796) cdisk=FD_03_cell01 is inactive due to inactive flash disk Done cd health state change FD_03_cell01 from HEALTH_GOOD to newState HEALTH_BAD_DROP_NORMAL Using the FlashCache part identifier reported in the message (e.g. 2158794172), search backward in alert.log starting from the location when error was reported for the string "cached by FlashCache: 2158794172". This will report the grid disks cached by the failed flash drive. The "cached by FlashCache" messages are reported when cellsrv process starts and when there is a change to the cached by mapping, hence you must search backward to the last cellsrv restart prior to the flash drive failure. GridDisk name=DATA_CD_09_cell01 guid=e67f7860-b1dc-49cb-bf4a-f8593496d9d4 (1741906740) status=GDISK_ACTIVE cached by FlashCache: 2158794172
... GridDisk name=DATA_CD_04_cell01 guid=4dbbf6db-8c9e-42e5-aa67-8e98a5d0fbcc ( 776922516) status=GDISK_ACTIVE cached by FlashCache: 2158794172 ... GridDisk name=DBFS_CD_03_cell01 guid=28c42b5c-421b-4196-beb6-202c9c9fe209 (2923504996) status=GDISK_ACTIVE cached by FlashCache: 2158794172 In this example, the flash drive in question was used to cache grid disks DATA_CD_03_cell01, DATA_CD_04_cell01, and DBFS_CD_09_cell01. It is extremely important to get the complete list of grid disks cached by the failed flash drive at the time of the failure. The number of grid disks cached by a flash disk depends on the number of grid disks defined on the storage server. Note: You cannot obtain the above information with CellCLI using grid disk attribute cachedBy because this information may have changed as a result of the flash drive failure. You must obtain the list of grid disks that were cached by the flash drive at the time of the failure, which can only be done by reviewing the storage server alert.log.
This information is used in the next step to identify affected ASM disk groups.
Step 4: Identify the list of affected disk groups. Using the list of grid disks obtained from the previous step, connect to any ASM instance to identify the affected disk groups using a query similar to the following: SQL> select distinct(dg.name)
from v$asm_diskgroup dg, v$asm_disk dsk where upper(dsk.name) in (upper('DATA_CD_03_cell01'), upper('DATA_CD_04_cell01'), upper('DBFS_CD_09_cell01')) and dg.group_number = dsk.group_number; NAME ------------------------------ DATA DBFS In this example, the affected disk groups are DATA and DBFS. Note that RECO is not a disk group that is cached by flash cache unless changes have been made to the default best practice configuration.
Step 5: Identify the start time of the most recent ASM rebalance operation for the affected disk groups. For each of the affected disk groups, obtain the start time of the most recent ASM rebalance operation from all ASM alert.log files. Since an ASM rebalance operation may run from any ASM instance, the alert.log for all ASM instances must be reviewed. The following bash shell commands can be run on each node from the trace directory where the ASM alert log exists. Replace 'DATA' and 'DBFS' from the example with your affected disk groups identified in the previous step. [oracle@node1 ~]$ cd /u01/app/grid/diag/asm/+asm/+ASM*/trace
[oracle@node1 trace]$ \ for dg in DATA DBFS; do if [ ! -r alert_+ASM*.log ]; then echo 'Cannot find alert_+ASM*.log. Run commands from directory where ASM alert log exists.' break fi linenum=$(grep -n "NOTE: starting rebalance of group.*($dg)" alert_+ASM*.log | tail -1 | awk -F: '{print $1}') if [ -z $linenum ]; then echo "No rebalance found in alert log for $dg" else tail -n+$(($linenum-1)) alert_+ASM*.log | head -n 2 | paste -d ' ' - - fi done Example output from node1: Mon Mar 13 11:54:47 2017 NOTE: starting rebalance of group 1/0x29a4ccd (DATA) at power 4
Mon Mar 09 11:38:35 2017 NOTE: starting rebalance of group 3/0x2ac5f92 (DBFS) at power 4 Example output from node2: Mon Mar 12 10:54:35 2017 NOTE: starting rebalance of group 1/0x2985f51 (DATA) at power 4
Mon Mar 13 11:53:35 2017 NOTE: starting rebalance of group 3/0x2aa4cce (DBFS) at power 4 In this example
Step 6: Take repair actions for each affected disk group. For each affected disk group, perform one of the following repair actions. If there are two affected disk groups, like shown in the examples above, then two repair actions are required, one for each disk group (e.g. DATA and DBFS). Compare the time of the flash disk predictive failure obtained in step 2 with the time of the most recent ASM rebalance for the disk group obtained in step 5.
Repair scenario 1 - ASM rebalance occurred before flash disk predictive failure or there has been no rebalance If the most recent ASM rebalance occurred before the flash disk predictive failure, then repair the corruption with ASM disk scrubbing for individual grid disks. Note: The Grid Infrastructure home requires fixes for bug 22446455, bug 25417056, and bug 25733479 to use ASM disk scrubbing. Without these fixes ASM disk scrubbing will not repair the corruption properly/efficiently.
Note: ASM 12.2 supports asynchronous scrubbing to increase scrub performance. See 12.2 Oracle Automatic Storage Management Administrator's Guide for details of disk group attribute scrub_async_limit.
Using the list of grid disks obtained from step 3, run scrub repair on each disk within the disk group being repaired. For example: SQL> alter diskgroup DATA scrub disk DATA_CD_03_cell01 repair wait;
SQL> alter diskgroup DATA scrub disk DATA_CD_04_cell01 repair wait;
An alternate action for this scenario is to force drop the grid disks cached by the failed flash drive instead of using ASM disk scrubbing. Note the use of FORCE for each disk. Note: It is essential to use the FORCE keyword. Running DROP DISK without FORCE will cause the corruption to propagate to other storage servers.
SQL> alter diskgroup DATA drop disk DATA_CD_03_cell01 FORCE, DATA_CD_04_cell01 FORCE rebalance power 64;
Repair scenario 2 - ASM rebalance occurred after flash disk predictive failure If the most recent ASM rebalance occurred after the flash disk predictive failure, then it is highly likely the corruption spread to other storage servers during the rebalance. Repair the corruption with ASM disk scrubbing for the disk group. Note: The Grid Infrastructure home requires fixes for bug 22446455, bug 25417056, and bug 25733479 to use ASM disk scrubbing. Without these fixes ASM disk scrubbing will not repair the corruption properly/efficiently.
Note: ASM 12.2 supports asynchronous scrubbing to increase scrub performance. See 12.2 Oracle Automatic Storage Management Administrator's Guide for details of disk group attribute scrub_async_limit.
Run scrub repair on the disk group. For example: SQL> alter diskgroup DATA scrub repair wait;
Step 6: Replace the failed flash drive Replace the failed flash drive using the documented procedure by a trained Oracle field technician. Note that dropping ASM disks that are part of the storage server that contains the failed flash drive is not a proper action to perform during the flash drive replacement procedure according to <Document 1993842.1> "How to Replace an Exadata X5-2/X6-2 Storage Server Flash F160/F320 Card." If grid disks were force dropped in the previous step, after the failed flash drive is replaced, verify the dropped disks are automatically added back into the disk group.
Step 7: Update storage servers according to the Patches section to prevent this issue from occurring again.
SymptomsSymptoms may include the following: Note: It is possible for corruption to occur as a result of the flash disk predictive failure but none of the following symptoms are present. Also, this issue may affect both primary and secondary extents. RMAN validate will check only primary extents, therefore, RMAN cannot be used to reliably detect corruption caused by this issue.
WorkaroundNone PatchesTo prevent this issue, update flash firmware by updating Exadata X6 storage servers to Exadata 12.1.2.3.4 or higher. Important: Updating the Exadata version and flash firmware only prevents this issue from occurring again in the future. The update will not detect nor repair existing corruption caused previously by this issue. The steps in the Occurrence section of this document must be followed to address existing corruption.
History11-Jan-2018 - Add reference to ASM 12.2 scrub_async_limit which can increase scrub performance Attachments This solution has no attachment |
||||||||||||||||||||||||||||
|