FS System: Explanation of CM_EVT_QOS_REBALANCE

Asset ID:	1-72-1988602.1
Update Date:	2016-12-01
Keywords:

Solution Type Problem Resolution Sure

Solution 1988602.1 : FS System: Explanation of CM_EVT_QOS_REBALANCE_FAILED events

Applies to:

Oracle FS1-2 Flash Storage System - Version 6.1.0 and later
Information in this document applies to any platform.

Symptoms

QoS Rebalance Failed events (CM_EVT_QOS_REBALANCE_FAILED) may be observed when adding Drive Enclosures or freeing up space if the system is reclaiming, or zeroing, storage. If the "Enable Automatic QoS Rebalancing" is enabled, the system performs a daily scan and rebalances volumes as required. If there is Unavailable Storage, the event may be generated if the free storage is less than that required to perform the rebalance.

NOTE: The feature "Automatic QoS Rebalancing" must be disabled on all the Storage Domains if the current software version is on 06.01.xx
See Document 2044829.1 FS System: Automatic QoS (Quality of Service) Rebalancing Recommended Setting for FS1-2 for more details.

If the system is zeroing storage, whether as new DEs are added or LUNs are deleted, and it initiates a rebalance, the current situation is that it does not check for sufficient free (not reclaiming) space before it starts the migration. This will cause that specific migration to fail and generate the title events.

Changes

This issue is affecting FS1 code release 06.01.xx

Cause

If there is unavailable space due to zeroing in progress (from a system reset, re-use of a Drive Group, bug, LUN deletion), is that although there is free space, there is not enough for the migration attempted.

If migration of a large overlapping LUN fails, the system will still look to see if smaller migrations are possible and will perform them.

All migrations are throttled and performed in small increments rather than as a continuous operation like QoS migrations are done.

Each 24 hours, the software on the Controllers will scan the allocations for migration opportunities if Auto QoS migration is enabled. To check if Auto QoS migration is enabled, go to the FS1 GUI, System -> Storage Domains, double click on the Storage Domain and verify that “Enable Automatic QoS Rebalancing” is ticked.

QoS Rebalancing will happen on LUNs where extents are not equally distributed on the defined number of Drive Groups (stripe width). It is happening at the tier level (Storage Class in the Storage Domain) and must not be confused with the Auto-Tiering (QoS+) feature.

TSE can verify if Auto QoS migration is enabled by searching for the <AutoQrMigrationDisabled> tag(s) (one per Storage Domain) in the A1…chsh.xml file:
<AutoQrMigrationDisabled>false</AutoQrMigrationDisabled> <-- in this example, Auto QoS migration is Enabled.

Or run 'FSInfo.pl -sd'

Also from Bug 20638408, QoS rebalancing can fail before starting if the available free space on the Storage Class in the Storage Domain is not enough (like 60%). This is caused by the algorithm trying to see if there is enough space for all the LUNs that need a QoS rebalancing instead of handling one LUN at a time. The issue is not that simple and it is under investigation.

It will not perform an immediate migration if it had been off and if it has been just enabled, it will wait until the next scan ---which could be anywhere from virtually immediate to 24 hours. There is no memory from prior scans since the storage layout may have been changed since the last scan.

Solution

The issue will be fixed on code 06.02.00

The migrations are done under two conditions:

1. A LUN is physically on less than the optimal number of Drive Groups for the current Priority Level (example: only 2 Drive Groups for a Premium LUN).
If Automatic QoS Rebalancing is enabled, those migrations should happen as the Drive Groups are added. With this bug, you may see the errors if zeroing was being done on the Drive Groups.
If Automatic QoS Rebalancing is off, and is turned on, then on the next daily scan, migration opportunities are checked for and initiated.

2. If all LUNs are on their optimal number of drive groups, but new storage is added (or reclaimed), the FS1 will look to see if the existing drive groups are highly allocated and may move LUNs even though they are already optimal --- to balance out the LUNs over the Drive Groups to prevent future contention on the latest Drive Groups.

There is currently no way to manually initiate the scan. It happens once a day.

There are situations where having Automatic QoS Rebalancing not enabled on Drive Enclosure add may be desirable. For example, if you have some critical LUNs that need to be migrated.
Turn off the Automatic QoS Rebalancing, and do a regular migration onto new DEs or newly vacated ones (after zeroing) for those LUNs.
Then, turn the Automatic QoS Rebalancing back on and let the system do it for the rest of the LUNs.

References

<BUG:20638408> - CM_EVT_QOS_REBALANCE_FAILED EVENTS AFTER ADDING 3 DES

Attachments

This solution has no attachment