FS1-2 Drive Failures May Result in Data Loss if Controllers Warm Start or Boot During Rebuild or Copyback

Asset ID:	1-77-2051245.1
Update Date:	2015-09-03
Keywords:

Solution Type Sun Alert Sure

Solution 2051245.1 : FS1-2 Drive Failures May Result in Data Loss if Controllers Warm Start or Boot During Rebuild or Copyback

Applies to:

Sun Software - Generic
Sun Hardware - Generic
Oracle FS1-2 Flash Storage System - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.
________________________________________

,

Date of Resolved Release: 03-Sept-2015
________________________________________

Description

FS1-2 Flash Storage System drive failures may result in Data Loss if controllers warm start or boot during a rebuild or copyback.

Occurrence

This issue can occur on Oracle FS1-2 Flash Storage Systems with the following configurations and conditions:

FS1-2 version 6.1.11 or earlier with one of the following failed Hard Disk Drive (HDD) part numbers:

7044283 (300GB)
7044376 (900GB)
7066831 (4TB)

This issue only occurs when all of the following conditions are met:

The drive has been replaced, and a copyback task is in progress
The Controller performing the copyback warm starts or reboots
Another drive fails in the same drive group while the copyback is in progress

FS1-2 version 6.1.17 or earlier with any failed drive type:

This issue only occurs when all of the following conditions are met:

Either a rebuild or copyback task is in progress
SAN Hosts are actively writing data to the drive group with the rebuild or copyback
BOTH Controllers warm start or reboot at the same time

Note: To verify the version of software currently installed on the FS1-2 system, do one of the following:

1. Using the Oracle FS System Manager GUI, from the System tab, select System Information in the navigation tree. The Software version will be displayed on the left side at the bottom of the System Information list.

2. Using the Flash Systems Command Line Interface (FSCLI), run the following command:

    fscli version -list

Symptoms

Symptoms for systems using 6.1.11 or earlier releases when a copytask is in progress:

1. When a controller warm starts, the following will be shown in the event log:

        ENCLOSURE_HARDWARE_REMOVED
        ENCLOSURE_HARDWARE_INSERTED
        ENCLOSURE_MAINTENANCE_STARTED
        PCP_EVT_CONTROLLER_WARMSTART_BEGIN
        PCP_EVT_CONTROLLER_WARMSTART_COMPLETE

2. When a controller reboots, the event log will have multiple 'CM_EVT_BOOT_STATE_CHANGED' events.

3. When a drive fails, the event log will show a 'ENCLOSURE_COMPONENT_STATE_CHANGE' event with another drive in the same drive group as the drive where the copyback is taking place.

Data loss will occur if '1' and '3' occur during the copyback process or if '2' and '3' occur during the copyback process.

Symptoms for systems using 6.1.17 or earlier releases when a copytask is in progress:

1. When a controller warm starts, the following will be shown in the event log:

    ENCLOSURE_COMPONENT_STATE_CHANGE
    ENCLOSURE_MAINTENANCE_STARTED
    PCP_EVT_CONTROLLER_WARMSTART_BEGIN  on one Controller
    PCP_EVT_CONTROLLER_WARMSTART_BEGIN  on the other Controller
    PCP_EVT_CONTROLLER_WARMSTART_COMPLETE
    PCP_EVT_CONTROLLER_WARMSTART_COMPLETE
    ENCLOSURE_MAINTENANCE_COMPLETED

2. If both controllers reboot at the same time, the system will initiate a full system restart. A 'PCP_EVT_SYSTEM_STATE_CHANGED' event will be followed by multiple 'CM_EVT_BOOT_STATE_CHANGED' events.

3. When a drive fails the event log will have indications of a 'ENCLOSURE_COMPONENT_STATE_CHANGE' where the event detail indicates CRU_FAULT on a drive of any type and that a maintenance process has begun, and then warm starts or reboots on both controllers occur before the maintenance process has completed.

If a rebuild or copyback task is in progress on any type of drive, and both controllers warm start or reboot at the same time, data loss may result if hosts are writing to the drive group with maintenance in progress.

If either of these situations occur, SAN Hosts may begin logging Media Errors for volumes that are on the affected drive group. The exposure to either issue requires a very specific set of circumstances and should be detected not long after they occur.

If you replaced a HDD on systems with 6.1.11 or earlier releases, or both controllers warm start or reboot while a rebuild or copyback is in progress on any drive type on systems with 6.1.17 or earlier releases, Oracle recommends verifying data integrity from your application(s).

Workaround

This issue is addressed in the following releases:

FS1-2 version 6.1.18 or later

Important Notes:

Please refer to the 6.1.18 Patch Readme for critical information before you attempt to install this release.

Systems currently installed with 6.1.11 or earlier:

If the installed release is 6.1.11 or earlier, and a HDD fails, do not replace the drive. Allow all rebuild tasks to complete and then schedule a maintenance window and perform the upgrade to 6.1.18 or later disruptively.

Check running tasks to make sure there are no rebuilds or copybacks in progress.
Check System Alerts and make sure there are no alerts other than the alert for the failed HDD.
Perform a disruptive upgrade. Use the Software options to Restart and update software, and select the options to Shutdown Controller, Ignore Hardware Status, and Ignore System Alerts.
After the upgrade completes, verify data access from Host application(s).

Systems currently installed with 6.1.12 through 6.1.17:

If the installed release is 6.1.12 through 6.1.17, the upgrade to 6.1.18 can be performed non-disruptively ONLY if the drive is replaced and all rebuild or copyback tasks have completed.

Check running tasks to make sure there are no rebuilds or copybacks in progress.
Check System Alerts and make sure no alerts exist.
Start the upgrade with the "Update software without restarting system" radio button.
After the upgrade completes, verify data access from Host application(s).

If the installed release is 6.1.12 through 6.1.17, and you do not wish to wait for the tasks to complete, you must schedule a maintenance window and perform the upgrade disruptively.

Check running tasks to make sure there are no rebuilds or copybacks in progress.
Check System Alerts and make sure there are no alerts other than the alert for the failed HDD.
Perform a disruptive upgrade. Use the Software options to Restart and update software, and select the options to Shutdown Controller, Ignore Hardware Status, and Ignore System Alerts.
After the upgrade completes, verify data access from Host application(s).

Note: Please refer to <Document:1968129.1> for general upgrade instructions.

Patches

<Patch:21770913>

History

03-Sept-2015: Document released, status Resolved

Internal Section: Comments:

REF: Bug 21187085 fixed in 6.1.12.

If events indicate HDD drive replacement, be safe and assume that RAID6 volumes may have been affected unless you can prove otherwise. Although it is possible to create non-RAID6 volumes on Capacity HDD, unless you are absolutely certain that there are none, treat the situation as if there are.

Volumes on Capacity HDD Enclosures default to RAID6. Volumes on Performance HDD may also be RAID6. The GUI will show "Double Parity" for the RAID Parity option. The chsh.xml file will indicate RAID_6. The event log detail will indicate a maintenance process of type copyback was initiated on the Enclosure.

The issue occurs because copyback does not apply to RAID6, only to RAID10 and RAID5. The RAID software module maintains a flag that indicates the maintenance is in progress, and if RAID6, that the copyback operation must be converted to a RAID6 rebuild for affected volumes. The flag required to perform the internal conversion of copyback to RAID6 rebuild is not preserved across a controller warm start or reboot, or if another drive in the same Drive Group fails while the copyback is in progress.

Due to the issue of Bug 21540944, an NDU (Non-Disruptive Upgrade) would expose the customer to dropped data, where the dual warm starts of the NDU are sufficient to cause that lost data.

To determine if the failed drive is utilized in a drive group that has RAID6 LUNs:

1. Copy the log bundle onto an ISDE server
2. Process the log bundle using the scanlog6 utility
3. Run FSInfo.pl -dg to generate RAID information for each Drive Group
4. Search for Drive Groups with RAID6 LUNs

For Example:

    Drive Group Enclosure RAID Information:
    Name               | Raid Level   | Raid State   | Has Allocation
    ----------------------------------------------------------------------
    /DRIVE_GROUP-002   | RAID_5       | NORMAL       | true
    /DRIVE_GROUP-002   | RAID_10     | NORMAL       | true
    /DRIVE_GROUP-002   | RAID_6       | NORMAL       | true
    /DRIVE_GROUP-001   | RAID_5       | NORMAL       | true
    /DRIVE_GROUP-001   | RAID_10     | NORMAL       | true
    /DRIVE_GROUP-001   | RAID_6       | NORMAL       | true
    /DRIVE_GROUP-000   | RAID_5       | NORMAL       | true
    /DRIVE_GROUP-000   | RAID_10     | NORMAL       | true

The above ouput would have hit this issue if a drive would have failed in either DRIVE_GROUP-001 or DRIVE_GROUP-002. You are not affected by this alert if either of these Drive Groups would have reported the "Has Allocation" as false.

REF: Bug 21540944

Data loss occurs when there is an impaired Drive Group. All host writes and rebuild data are merged and should be written to FBM (FailBack Memory) on the buddy controller. If this is done, the host write and rebuild data can be recovered after anything but a clean shutdown and reboot of the controller or a PO
(Power On). Prior to R6.1.18, this data was not written to FBM on the buddy Controller, but to ordinary memory. The two warm starts must occur before either controller is able to recover. This is most likely to occur on an NDU, but some software faults also trigger simultaneous warm starts.

An NDU to 6.1.18 is safe only if all rebuild and copyback tasks have completed.

This regression introduced by putback for BugID: Bug number not known.

Questions regarding this document should be addressed to
sunalertpublication_us-grp@oracle.com and copy the
responsible engineer listed below.

Internal Contributor/Submitter: bob.deguc@oracle.com, john.rafferty@oracle.com
Internal Eng Responsible Engineer: Lon Stowell
Oracle Knowledge Analyst: jeff.folla@oracle.com
Internal Eng Business Unit Group: Flash Storage
Internal Associated SRs: 3-10858357071, 3-10927543381
Internal Pending Patches:
Internal Resolution Patches: 21770913: FS1-2 6.1.18

References

<BUG:21540944> - PERFORMANCE ISSUE AFTER MOVING FROM STORAGE DOMAIN2 TO STORAGE DOMAIN1
<BUG:21187085> - SYSTEM RESTART DURING COPYBACK RESULTS IN POTENTIAL CORRUPTION

Attachments

This solution has no attachment