Asset ID: |
1-71-2291209.1 |
Update Date: | 2017-08-02 |
Keywords: | |
Solution Type
Technical Instruction Sure
Solution
2291209.1
:
Oracle ZFS Storage Appliance (ZFSSA): Remote Replication Action May Be Destroyed After Replication Reversal
Related Items |
- Sun ZFS Storage 7320
- Oracle ZFS Storage ZS3-BA
- Oracle ZFS Storage ZS5-4
- Oracle ZFS Storage ZS3-4
- Sun ZFS Storage 7420
- Oracle ZFS Storage ZS5-2
- Oracle ZFS Storage ZS4-4
- Sun ZFS Storage 7120
|
Related Categories |
- PLA-Support>Sun Systems>DISK>ZFS Storage>SN-DK: 7xxx NAS
|
In this Document
Applies to:
Oracle ZFS Storage ZS4-4 - Version All Versions to All Versions [Release All Releases]
Oracle ZFS Storage ZS3-BA - Version All Versions to All Versions [Release All Releases]
Sun ZFS Storage 7420 - Version All Versions to All Versions [Release All Releases]
Oracle ZFS Storage ZS3-4 - Version All Versions to All Versions [Release All Releases]
Sun ZFS Storage 7320 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.
Goal
This document describes in more detail the conditions and corrective action related to Service Alert
2286584.1 - ZFS Storage Appliance (ZFSSA) Replication Actions May Be Missing After Package
Reversal.
Conditions
For software versions OS8.7.0 through OS8.7.7, reversing a replication package may create a damaged
replication action that will be destroyed during the next software update, takeover, failback, reboot,
powerup, or restart of the management interface (that is, restart of the appliance kit daemon - akd).
Once the replication action is destroyed, the replication package that was being maintained by the lost
replication action will no longer receive updates.
This unintended destruction of the action can occur under the following circumstance:
• The source and/or target were updated to OS8.7.0 or higher from a software release prior to
OS8.7.0.
• The replication action is in a project/package replication pair that existed while the source
and/or target were at a software release prior to OS8.7.0.
• While running any of the software versions OS8.7.0 through OS8.7.7 the project/package
replication pair was reversed via one of the following methods:
◦ BUI (Browser User Interface) reverse operation
◦ CLI command pkgreverse
◦ RESTful API pkgreverse resource
This circumstance creates a damaged replication action in the project created by the reversal operation.
This damage will not become apparent until a subsequent pool import operation takes place. The
operations listed above: software update, takeover, failback, etc., perform a pool import operation that
destroys all damaged replication actions in the pool being imported.
Note in particular that once a replication action has been damaged due to the circumstance described
above, the act of installing a software update of any software version, older or newer, can destroy the
damaged replication action. Preventing the destruction of the replication action requires determining
that damaged action(s) have been created and carrying out an action repair process before proceeding
with the software update, takeover, failback, etc.
Solution
The ultimate goal is to update the storage appliance software to a future version that resolves bug
26367379, but for an appliance that is currently running any of the software versions OS8.7.0 through
OS8.7.7, the act of installing the software update may cause the destruction of damaged replication
actions. These damaged replication actions must be repaired before installing the future software
version that eliminates the generation of damaged replication actions.
For appliances running OS8.7.0 through OS8.7.7, before applying any software update or performing a
takeover, failback, reboot, powerup, or restart of the management interface, download and run the
Report Damaged Actions and Patch User Interface workflow. It is best to run this workflow
immediately to apply the user interface patch that prevents further creation of damaged replication
actions and then run this workflow again prior to a software update, takeover, failback, etc. to verify
that there are no remaining damaged replication actions.
The Report Damaged Actions and Patch User Interface workflow can be downloaded by clicking on
this link:
report_damaged_actions_patch_ui.akwf
Instructions for loading and executing workflows are available in the on-line help facility within the
storage appliance and also in the Oracle ZFS Storage Appliance Administration Guide. Refer to the
“Maintenance Workflows” help topic.
Following a code update that installs any of the software versions OS8.7.0 through OS8.7.7, run this
workflow (again) so that it can patch the newly installed user interface code to prevent further damage
to replication actions.
The Report Damaged Actions and Patch User Interface workflow performs two operations:
1. Examine each replication action and report the number of damaged actions.
2. Apply a patch to the storage appliance user interface code to prevent replication reverse
operations from creating damaged replication actions. This patch changes the BUI reverse
operation and the CLI and RESTful API pkgreverse commands so that they no longer preserve
replication action properties. Only future software versions that resolve bug 26367379 are
capable of correctly preserving replication action properties.
When there are no damaged replication actions, the workflow reports:
No damaged actions found due to 26367379.
If there are damaged replication actions, the workflow reports, for example:
Found 1 actions affected by 26367379. Please contact support to repair these action properties.
If this message is reported, please contact support before proceeding with operations that will destroy
damaged replication actions: software update, takeover, failback, reboot, powerup, or restart of the
management interface. The damage must be repaired to prevent the unintended destruction of the
replication action(s).
After reporting the number of damaged replication actions, the workflow determines whether the currently
installed appliance software version is capable of creating damaged replication actions during reverse.
If so, the workflow patches the user interface code to disable the preservation of replication action properties
when reversing a replication package. It thus backs out the replication reverse enhancement that addresses:
16570437 - Replication action properties are not preserved after reversal
The creation of damaged replication actions is due to an issue in the implementation of this
enhancement.
If the currently installed appliance software contains the fix for 26367379, the workflow reports:
No UI patch needed.
Otherwise the first time this workflow is executed, it will apply the patch to the user interface code and report:
UI files have been patched. Please exit any open CLI sessions, log out
of the BUI, and reload the BUI page for the changes to take effect.
When this workflow is run again, it reports:
UI files already patched.
Finally, the workflow reports:
Additional logs are in the dropbox.
The next time a support bundle is generated for this storage appliance, the logs generated by this
workflow will be placed into the support bundle.
Avoiding Damage to Replication Actions Prior to Running the Report and Patch Workflow
It is best to run the Report Damaged Actions and Patch User Interface workflow immediately so that it
can patch the user interface code so that replication reversal does not create a damaged replication
action. However, if the running of the workflow must be delayed, then only use the reverse command
forms that do not create damaged replication actions.
Prior to the running of the workflow, avoid using the following methods to reverse a replication
package:
• BUI (Browser User Interface)
• CLI command pkgreverse
• RESTful API pkgreverse resource
Instead, use one of the following methods to reverse a replication package:
• CLI command reverse
• RESTful API reverse resource
These two reverse methods are deprecated in OS8.7 and later software but they are still supported.
Because the reverse command and the reverse method are deprecated, though, they are not documented
in the on-line help and the OS8.7 Oracle ZFS Storage Appliance Administration Guide.
The CLI reverse command can be issued from within the context of a package within a replication
source. It accepts one optional parameter, the name of the project into which the package's contents will
be put. By default, the new project will retain the name of the project within the package. For example:
mynode:shares replication source-000 package-001> reverse
or
mynode:shares replication source-000 package-001> reverse new_proj
Similarly, in the RESTful API, the reverse resource is available within a package within a replication
source.
Example Request:
PUT /api/storage/v1/replication/sources/zfssa-repl/packages/
8373d331-de60-e590-90e8-9ad69fcb4aec/reverse HTTP/1.1
Host: zfs-storage.example.com
Authorization: Basic ab6rt4psMWE=
{"projname":"restrev"}
Success Response:
HTTP/1.1 202 Accepted
X-Zfssa-Replication-Api: 1.0
PRIVATE PORTION OF DOCUMENT FOR ORACLE INTERNAL USE ONLY
NOTE: Please contact Jeremy Uejio for any issue concerning this section.
When the Report Damaged Actions and Patch User Interface workflow reports the existence of one or
more damaged replication actions, the Bug 26367379 Repair script can be used to repair the damaged
replication actions so that they are not destroyed by a pool import operation (which includes software
update, takeover, failback, etc.).
The Bug 26367379 Repair script must be executed on each node that is the current owner of one or
more pools containing damaged replication actions. Do not run this script on a cluster node with no
pools.
The Bug 26367379 Repair script is available here:
26367379_repair_v2.1.sh
This is a bash script that must be copied into a writable directory within the storage appliance and
executed from the Solaris shell. For example, you can create a directory under /var/tmp and write the
script into that directory.
From the comments at the top of the repair script:
# Usage: ./26367379_repair_v2.1.sh [-t] [-d]
#
# Options:
# -t test and report only (do not disable akd or fix actions)
# -d print debug messages to log file
This script must disable akd while it repairs the replication actions and then re-enable akd when the
repair is complete. Disabling and enabling akd will disrupt all active BUI sessions and CLI sessions.
Requests issued through the RESTful API will fail during the time that akd is disabled.
In a clustered system, it is crucial that this script only be run on one cluster node at a time and be run
only when akd has joined the cluster on both cluster nodes and the cluster links are active. The cluster
state can be verified by the CLI command "configuration cluster show". The state and peer_state
should show one of three cluster states: AKCS_STRIPPED, AKCS_OWNER or AKCS_CLUSTERED.
If the state is AKCS_BOOT_RX_JOIN, please wait for a few minutes. If the state is any other state,
then an error has occurred and will need to be investigated before running the repair script. The cluster
link state can be verified via the CLI by running "configuration cluster links" on both cluster nodes.
The link status should be AKCIOS_ACTIVE.
After running this script on one node of a cluster, ensure that the akd restart has completed, the node
has rejoined the cluster, both nodes show a valid cluster state and the cluster links are active before
running this script on the other cluster node.
Running the repair script in test mode (-t) is useful to verify that there are damaged replication actions
to be repaired. In test mode, the script does not disable and re-enable akd. Thus, in test mode, the script
does not interfere with the appliance's administrative interface. Here is an example showing the use of
the repair script in test mode.
zs_node1# ./26367379_repair_v2.1.sh -t
Writing log file to: /var/ak/logs/26367379_repair.txt
Looking for damaged replication action properties.
Finished processing action properties. 1 were affected and 0 were fixed.
Test complete, exiting.
This shows that one action was affected by bug 26367379. This action must be repaired so that it will
not be unintentionally destroyed.
After ensuring that the customer can tolerate a brief outage of the appliance's administrative interface
and, for cluster, ensuring that both nodes show a valid cluster state and the cluster links are active, run
the repair script in repair mode (that is, without -t).
zs_node1# ./26367379_repair_v2.1.sh
Writing log file to: /var/ak/logs/26367379_repair.txt
Disabling akd.
Looking for damaged replication action properties.
Finished processing action properties. 1 were affected and 1 were fixed.
Enabling akd.
Workaround complete, exiting.
References
<BUG:26367379> - REPLICATION ACTION LOST AFTER BUI REVERSE OR PKGREVERSE AND TAKEOVER + FAILBACK
Attachments
This solution has no attachment