![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Solution Type Sun Alert Sure Solution 2115005.1 : (EX28) High risk of data loss on X3 or X4 Exadata storage when flash compression is enabled and running late 2015 or early 2016 software release
In this Document
Applies to:Oracle Exadata Storage Server Software - Version 12.1.2.2.0 to 12.1.2.3.0 [Release 12.1]Exadata X4-2 Hardware Exadata X3-2 Hardware Oracle SuperCluster Exadata X3-8 Hardware Information in this document applies to any platform. DescriptionDue to <bug 22909764>, on X4 and X3 storage servers (with X4-2, X4-8, X3-2, and X3-8 Exadata Database Machines) running Exadata 12.1.2.2.0, 12.1.2.2.1, or 12.1.2.3.0, when Exadata Smart Flash Cache Compression is enabled, one or more flash drives may fail on multiple storage servers, leading to a potential for data loss if flash cache is configured write-back, or reduced performance if flash cache is configured write-through. Flash Cache Compression is disabled by default. Only storage servers where this feature has been explicitly enabled will be affected. An Oracle Advanced Compression option license is required in order to use Flash Cache Compression. OccurrencePre-requisite Conditions for Bug 22909764<Bug 22909764> may occur when all of the following conditions are met:
How to check if your storage servers are susceptible to bug 22909764Run the following checks on all storage servers. If one or more storage servers meet all of the conditions then review the Workaround and Patches sections below.
1. Run the following imageinfo command to determine the current Exadata software version: # imageinfo -ver
12.1.2.2.1.160119 If current version is 12.1.2.2.0, 12.1.2.2.1, or 12.1.2.3.0, then proceed to the next check. Note: If the current version is lower than 12.1.2.2.0, and Flash Cache Compression is enabled, then special consideration must be taken when upgrading. See the Patches sections below for further details.
2. Run the following CellCLI command to determine if Flash Cache Compression is enabled: CellCLI> LIST CELL attributes flashCacheCompress
TRUE A value of TRUE indicates Flash Cache Compression is enabled, then proceed to the next step. A value of FALSE or no value at all indicates Flash Cache Compression is disabled. Note that Flash cache compression requires licensing the Oracle Database Advanced Compression option.
3. As the root user, run the following command to check for flash controller firmware version: [root@dm01cel01 ~]# /opt/oracle.SupportTools/CheckHWnFWProfile -action list -component Flash | grep -i 'cardfw' | uniq If flash controller firmware version is 13.05.11.00 (for X4 systems) or 13.05.10.00 (for X3 systems), then proceed to the Workaround and Patches sections for further action.
SymptomsMultiple flash drives on more than one flash card, potentially across multiple storage servers, fail at the same time with one of the following statuses:
Multiple failed flash drives may lead to data loss if flash cache is configured write-back, or reduced performance if flash cache is configured write-through. If you believe that your system is currently exhibiting the symptoms described, please contact Oracle Support. Note that installing the software fix as described in the Patches section, or disabling Flash Cache Compression as described in the Workaround section, does not correct a system that already has failed flash drives.
Assessing the Risk of Hitting Critical Issue EX28X4 and X3 Storage servers running Exadata version 12.1.2.2.0, 12.1.2.2.1, or 12.1.2.3.0 with Flash Cache Compression enabled are exposed to bug 22909764. Storage servers with this configuration that then have flash devices reach too little free space may experience many flash cards failing in a short period of time across multiple cells. If Flash Cache mode is set to write-back and multiple flash cards fail across multiple cells, then it is likely that ASM disk group(s) will dismount and data will be lost. If Flash Cache mode is set to write-through, then there is no data loss, but there will likely be performance impact because flash cache content will have been lost. This section applies to systems that currently have no flash disk failures due to this issue. Flash failure due to this issue is caused by flash disks reaching too little free space. Because flash cache free space is affected by workload and working set size, some storage servers may hit the issue shortly after upgrade, while others may never hit the issue. The following command may be used on each storage server to determine the amount of free space currently available on the 16 flash disks that are operating normally (i.e. ensure all 16 flash drives are reported - drives cannot already be failed): [root@cell ~]# for dev in $(cellcli -e 'list physicaldisk attributes devicename where disktype=FlashDisk')
do free=$(smartctl -a $dev | grep '243 Unknown_Attribute' | awk '{print and($NF,0xffffffff)*8/1024/1024}') printf "%02d. flash disk %s free space %.2f GiB\n" $((++i)) $dev $free done 01. flash disk /dev/sdi free space 67.90 GiB 02. flash disk /dev/sdj free space 68.01 GiB 03. flash disk /dev/sdk free space 67.82 GiB 04. flash disk /dev/sdl free space 67.76 GiB 05. flash disk /dev/sdm free space 67.73 GiB 06. flash disk /dev/sdn free space 67.72 GiB 07. flash disk /dev/sdo free space 67.67 GiB 08. flash disk /dev/sdp free space 67.80 GiB 09. flash disk /dev/sde free space 67.65 GiB 10. flash disk /dev/sdf free space 67.90 GiB 11. flash disk /dev/sdg free space 67.56 GiB 12. flash disk /dev/sdh free space 67.98 GiB 13. flash disk /dev/sda free space 67.70 GiB 14. flash disk /dev/sdb free space 67.55 GiB 15. flash disk /dev/sdc free space 67.50 GiB 16. flash disk /dev/sdd free space 67.75 GiB Depending on workload and working set size the amount of free space on flash drives may continue to decrease over time (potentially over a short period of time). As free space decreases, the likelihood of hitting this issue increases. The free space threshold is dependent on hardware model (X4 or X3) according to the table below.
With the software fix in place and flash cache compression operating normally, the amount of free space should stabilize at ~68 GiB on X4 storage servers, and ~34 GiB on X3 storage servers.
Failure and Recovery Scenarios When this issue occurs content on the failed flash drives will be lost. Impact and recovery steps depend on flash cache mode (writethrough or writeback). Flash cache mode writethrough If FC mode is writethrough, then there will likely be impact to database read performance since read-only flash cache will be lost. There should be no availability impact to ASM disk groups or database.
Flash cache mode writeback If FC mode is writeback, then there are writes acknowledged by the flash cache that are not persisted on disk, hence there is data loss. There are multiple possible scenarios. Scenario 1 - One or more flash disks have failed but the database is still operational
Scenario 2 - One or more flash disks have failed and the database is not operational. DATA disk group has dismounted and, possibly clusterware has crashed due to loss of OCR/voting files.
Steps to re-enable flash drives that failed due to bug 22909764 (bug 22848220) To re-enable failed flash cards where any flash drive on a flash card has failed, follow these steps:
Steps to handle upgrade failure due to latent effects of critical issue EX17 An X3 cell may be exposed to a similar flash disk failure issue previously published as Exadata critical issue EX17 in <Document 1968234.1>. EX17 summary is an X3 cell upgraded from 11.2.3.3.0/12.1.1.1.0 to 11.2.3.3.1/12.1.1.1.1/12.1.2.1.0 while flash cache compression is enabled will have a mismatch on each flash disk between the compression setting and the flash disk size. The result is that Exadata software behaves as if compression is enabled, when, in fact, it is not, which can lead to flash disk failure if free space is exhausted. However, in cases where workload and working set size is small enough not to exhaust free space, there will be no failure after upgrade even though the mismatch is still present. A new check was added in Exadata 12.1.2.2.0 to validate each flash disk has correctly matching compression setting and size. If there is a mismatch and the cell is upgraded from a release affected by EX17 to 12.1.2.2.0 or later, then cellsrv will fail to start with the following error: ORA-00600: internal error code, arguments: [ossflc_create - DFF and Falcon out of sync regarding DLC state.] The scenario can occur when upgrading to 12.1.2.2.0 or later using one of the full release patches listed in the Patches section below. To resolve this condition, perform the internal-only resolution steps in <Document 1968234.1>.
WorkaroundIf you believe that your system is currently exhibiting the Symptoms described above, please contact Oracle Support. Recommended Action Apply the software fix as described in the Patches section. If your system is running an earlier Exadata version (12.1.2.1.3 or lower), and Flash Cache Compression is enabled, and you will upgrade to 12.1.2.3.0, 12.1.2.2.1 or 12.1.2.2.0, then upgrade using the full release component of the target release as described in the Patches section. Alternate Action A workaround to avoid this issue is to disable Flash Cache Compression by following the steps in the Oracle Exadata Database Machine Maintenance Guide. Disable Flash Cache Compression as a workaround only if there are no failed flash drives. If there are failed flash drives, contact Oracle Support before attempting to disable Flash Cache Compression. Grid disks created on flash drives must be moved to hard drives before disabling Flash Cache Compression. The workaround may be performed on systems already running 12.1.2.2.0, 12.1.2.2.1, or 12.1.2.3.0, or on systems running an earlier version prior to upgrade. Re-enable Flash Cache Compression only after the software fix is installed.
PatchesRecommended Action The recommended action is update to Exadata 12.1.2.2.2, 12.1.2.2.3, 12.1.2.3.1, or higher release. INTERNAL: Due to EX29, these patches have been revoked. Recommended action for all customers who are still exposed to this issue is update to revised 12.1.2.2.2/12.1.2.3.1. The revoked patches will not be re-released. Customers who previously updated to one of the revoked patches are considered safe - no further action is required. Alternatively, patches are available for the following releases:
Note: Before updating ensure storage servers are not currently exposed to Exadata critical issue EX17 by reviewing <Document 1968234.1>.
The patch contains the following two components:
The following table summarizes the recommended actions based on the currently installed Exadata version:
Footnotes 1 Patches 22917774 and 22928622 for Exadata versions 12.1.2.2.1 and 12.1.2.2.0, respectively, were re-released on 01-Apr-2016. No action is required for storage servers that were updated using the previously released patches.
History29-Jul-2016 - Add 12.1.2.2.3 as recommended release Attachments This solution has no attachment |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|