Asset ID: |
1-77-1613534.1 |
Update Date: | 2014-02-04 |
Keywords: | |
Solution Type
Sun Alert Sure
Solution
1613534.1
:
Pillar/Axiom: Multiple Combined LUN Copy and Migration Operations on a SAN LUN May Cause System Reboot to Fail or CUs to be Disabled
Related Items |
- Pillar Axiom 300 Storage System
- Pillar Axiom 500 Storage System
- Pillar Axiom 600 Storage System
- Sun Hardware - Generic
|
Related Categories |
- PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: Sun Alert
- _Old GCS Categories>Sun Microsystems>Sun Alert>Criteria Category>Availability
- _Old GCS Categories>Sun Microsystems>Sun Alert>Release Phase>Resolved
|
In this Document
Applies to:
Pillar Axiom 500 Storage System
Pillar Axiom 600 Storage System
Sun Hardware - Generic
Pillar Axiom 300 Storage System
Information in this document applies to any platform.
_________________________________________
BUG:17995745, BUG:17987802, BUG:18049993, BUG:18070459
Date of Workaround Release (for this document): 10-Jan-2014
Date of Resolved Release (for this document): 04-Feb-2014
Date of R5.x Workaround Release: 10-Jan-2014
Date of R5.x Recovery Release: 31-Jan-2014
Date of R4.x Workaround Release: 31-Jan-2014
_________________________________________
Description
Making a copy of a LUN and then changing the Quality of Service (QoS) for that LUN as a separate operation while the copy is still in progress, or changing the QoS for a LUN and then copying that LUN as a separate operation may cause a pointer to the configuration records for the LUN copy to not be set correctly. This will cause repeated software panics on multiple Slammer Controller Units (CU). This will cause the system to perform an automatic restart once all Slammer CUs have experienced these panics. That restart will fail with a COLD START FAILED status, resulting in all data offline.
As an Axiom LUN is copied, the copy is created immediately and the GUI task to copy the LUN completes in a few seconds. The new LUN is available for use at that time, but all existing data on the source LUN must be copied to the new LUN. This copy is done as a background task “BackgroundProgressVolumesCmTask” which may take several hours to complete on larger LUNs. When the background data copy completes, the references used for the background task must be removed. If the QoS is changed, or a restore operation is attempted while this background task is still in progress, it may trigger a race condition as the background data copy completes and the references to the parent LUN are removed.
As an Axiom LUN QoS is changed, the GUI task will complete immediately and the LUN will remain available during the operation. Internally, the Axiom creates a virtual copy of the LUN in the new QoS and migrates the data from the original LUN as a background task “BackgroundProgressVolumesCmTask” which may take several hours to complete on larger LUNs. Once the data migration is complete, the references used for the migration are removed, and the storage used in the original QoS for the LUN is released. If a LUN Copy or restore operation is attempted while this background task is still in progress, it may trigger a race condition as the background task completes.
The race condition may result in an invalid reference being stored in the system Configuration On Disk (COD). If this occurs, the next time that reference is used, the Slammer CUs will begin experiencing multiple Software panics. As each CU is disabled for multiple panics, the next CU will panic until all CUs are offline. At that point, the Axiom will initiate a restart and that restart will fail with COLD START FAILED status.
Occurrence
This issue can occur on the following platforms:
- Pillar/Axiom Ax500 storage systems on patch levels from 05.03.07 up to but not including release 05.04.09
- Pillar/Axiom Ax600 storage systems on patch levels from 05.03.07 up to but not including release 05.04.09
- Pillar/Axiom Ax300, Ax500, or Ax600 storage systems on patch level 04.06.06 up to but not including release 04.06.16
Notes:
1. No other Pillar/Axiom systems or software revisions are affected by this issue.
Symptoms
If a combination of LUN Copy and QoS change is done, the Axiom Slammer CU that controls the system configuration may panic and warm start. This will continue until that CU fails Over. Another CU will be nominated to control the system configuration, and that CU will panic and warm start until it fails over. This will repeat until all Slammer CUs have failed, at which time the Axiom will attempt to restart to recover. The restart will fail, and the system status will either be COLD START FAILED or the system will be hung in SYSTEM COLD START state.
Repeated Slammer CU SEGV panics in ConMan will result. These will begin on the Master node and repeat until it fails over. This will cause a new Master to be elected, which will panic until it fails over. This will repeat until all Slammer CUs have failed. The PacMan pilotcfg process will declare “all nodes dead” and initiate a system restart. That restart will fail, leaving the system in a COLD START FAILED or HUNG COLD START IN PROGRESS condition. As the background task "BackgroundProgressVolumesCmTask" completes, ConMan will attempt to detach the LUN copy from its parent, or the LUN from the migration parent used for recovery. As this occurs, the race condition may result in creating a bad record or an invalid record pointer in COD for the Failover/Failback sequence that describes the Slammer CUs the LUN may be assigned to. The repeated panics will typically have entries similar to these in the Slammer TDS log:
DEBUG CM DetachBsAssociationTask.c 416 "Hit state FIX_STANDALONE_VLUNS"
NOTICE CM DetachBsAssociationTask.c 433 "FixFailoverSequence after detach, sourceVlunId 0x78, targetVlunId"0x0.
If an Axiom without Patch 5.4.9 or 4.6.16 installed encounters this condition:
- 5.X systems can install patch 5.4.10 to recover after the issue has been confirmed. Patch 5.4.10 must be installed disruptively only if it is used to recover from this condition, otherwise patch 5.4.10 can be installed as a normal non-disruptive update.
- Do not attempt recovery of R4.x systems by installing patch 4.6.16. Contact Oracle Pillar Engineering for assistance.
Workaround
To avoid this issue:
1. Install the appropriate patch for your system as soon as possible:
- R5.x systems: Install Patch 05.04.09 (listed below).
- R4.x systems: Install Patch 04.06.16 (listed below).
2. If you copy a LUN and want to make a QoS change for the copy, set the new QoS in the same GUI operation, so the LUN copy is created in the new QoS category. Wait until the “BackgroundProgressVolumesCmTask – Copy_Name” task in the task list is completed before making any other changes to the copy.
3. If you copy a LUN, look in the Task List for a “BackgroundProgressVolumesCmTask – Copy_Name” task. Do not attempt to modify the QoS as long as that task exists. Once that task no longer appears in the task list, you can safely make the desired QoS change.
NOTE: If you change the QoS of a LUN, look in the task list for a “BackgroundProgressVolumesCmTask – LUN_Name” task. Do not attempt to copy that LUN_Name as long as that task exists. Once that task no longer appears in the task list, you can safely copy the LUN.
This issue is addressed on the following platforms:R5.x Pillar/Axiom Ax500 storage systems with software revision 05.04.09 (patch 18019845) or later
- R5.x Pillar/Axiom Ax600 storage systems with software revision 05.04.09 (patch 18019844) or later
- R4.x Pillar/Axiom Ax300 storage systems with software revision 04.06.16 (patch 18169228) or later
- R4.x Pillar/Axiom Ax500 storage systems with software revision 04.06.16 (patch 18169226) or later
- R4.x Pillar/Axiom Ax600 storage systems with software revision 04.06.16 (patch 18169224) or later
NOTE: On R5.x systems, if software 5.4.9 or later is not installed and the system encounters this condition, the system may be recovered by installing patch 5.4.10, but the upgrade must be performed as a disruptive upgrade with selected overrides for the system condition. On R4.x systems, if software 4.6.16 or later is not installed, do not attempt to recover by installing patch 4.6.16. Contact Oracle Customer Support for assistance.
For links to the above software patches, release notes, and more complete information on this and other issues, see:
<Document:1611335.1>: Pillar Axiom: Ax600 & Ax500 Patch 05.04.09 for SAN
<Document:1618292.1>: Pillar Axiom: Ax600 & Ax500 Patch 05.04.10 for SAN
<Document:1618332.1>: Pillar Axiom: Ax600, Ax500, Ax300 Patch 04.06.16 for NAS and SAN
Patches
Please download and install the correct patch for your Axiom model and software revision as follows:
To prevent: R5.x Axiom systems:
<SUNPATCH:18019845> Ax500 5.4.9
<SUNPATCH:18019844> Ax600 5.4.9
To recover: R5.x Axiom systems:
<SUNPATCH:18169222> Ax500 5.4.10
<SUNPATCH:18169221> Ax600 5.4.10
To prevent: R4.x Axiom systems:
<SUNPATCH:18169228> Ax300 4.6.16
<SUNPATCH:18169226> Ax500 4.6.16
<SUNPATCH:18169224> Ax600 4.6.16
History
10-Jan-2014: Document released; status Workaround
04-Feb-2014: Fix patches released; All sections updated for Resolution information
This issue is a regression, which was caused by a fix for an older bug where LUN copies could
not be assigned separately from their parents, which injected the problem.
That change was made in 05.03.07 and affects all R5 versions from 05.03.07 up to 05.04.08. That same change was
made in 04.06.06 and affects all R4 versions from 04.06.06 up to 04.06.15.
Internal Contributor/Submitter: lon.stowell@oracle.com
Internal Eng Responsible Engineer: lon.stowell@oracle.com
Internal Services Knowledge Engineer: david.mariotto@oracle.com
Internal Eng Business Unit Group: Pillar/Axiom Storage
Internal Resolution Patches: 05.04.09
References
Attachments
This solution has no attachment