Sun Storage 7000 Unified Storage System: Shadow Migration Copy Performance Is Slow

Asset ID:	1-72-1377069.1
Update Date:	2016-02-08
Keywords:

Solution Type Problem Resolution Sure

Solution 1377069.1 : Sun Storage 7000 Unified Storage System: Shadow Migration Copy Performance Is Slow

Applies to:

Sun Storage 7210 Unified Storage System - Version All Versions and later
Sun Storage 7310 Unified Storage System - Version All Versions and later
Sun ZFS Storage 7320 - Version All Versions and later
Sun ZFS Storage 7120 - Version All Versions and later
Sun Storage 7410 Unified Storage System - Version All Versions and later
7000 Appliance OS (Fishworks)

Symptoms

Sun Storage 7000 Unified Storage System array Shadow Migration copy jobs have been observed via the BUI to be running for many days, even weeks in some extreme cases.

The copy is still going on in the background for the share in question and the operation is taking longer than expected.

Shadow Migration Supports NFS filesystems only at this time, use NFS v4 for best results.

Cause

As long as Shadow Migration is making progress, even if it is slow, there isn't a lot that can be done to speed it up.

If a share to be migrated contains lots (thousands or Millions) of little files and/or has lots of subdirectories, you probably don't want to use Shadow Migration, as this will take a long time to complete. Consider other options such as rsync.

Shadow Migration just wasn't built for speed or performance. It was built for completeness and to complete seamlessly in the background.

Monitoring progress of a Shadow Migration is difficult given the context in which the operation runs.

A single filesystem can shadow all or part of a filesystem, or multiple filesystems with nested mountpoints. As such, there is no way to request statistics about the source and have any confidence in them being 100% accurate.

In addition, even with migration of a single filesystem, the methods used to calculate the available size is not consistent across systems.

For example, the remote filesystem may use compression, or it may or not include the meta data overhead. For these reasons, it's impossible to display an accurate progress bar for any particular migration.

The appliance provides the following information that is guaranteed to be accurate:

*Local size of the local filesystem so far
*Logical size of the data copied so far
*Time spent migrating data so far

These values are made available in the BUI and CLI through both the standard filesystem properties as well as properties of the Shadow Migration node (or UI panel).

If you know the size of the remote filesystem, you can use this to estimate progress. The size of the data copied consists only of plain file contents that needed to be migrated from the source. Directories, meta data, and extended attributes are not included in this calculation. While the size of the data migrated so far includes only remotely migrated data, resuming background migration may traverse parts of the filesystem that have already been migrated. This can cause it to run fairly quickly while processing these initial directories, and slow down once it reaches portions of the filesystem that have not yet been migrate.

While there is no accurate measurement of progress, the appliance does attempt to make an estimation of remaining data based on the assumption of a relatively uniform directory tree.

This estimate can range from fairly accurate to completely worthless depending on the set of data, and is for information purposes only.

For example, one could have a relatively shallow filesystem tree but have large amounts of data in a single directory that is visited last.

In this scenario, the migration will appear almost complete, and then rapidly drop to a very small percentage as this new tree is discovered.

Conversely, if that large directory was processed first, then the estimate may assume that all other directories have a similarly large amount of data, and when it finds them mostly empty the estimate quickly rises from a small percentage to nearly complete.

The best way to measure progress is to setup a test migration, let it run to completion, and use that value to estimate progress for filesystem of similar layout and size.

Solution

As long as the shadow migration job is making progress, even if it is slow, there isn't a lot that can be done.

To monitor for any possible shadow migration errors via the command line:

s7000:> shares select nas_project01 select data01 shadow show
Properties:
                        source = nfs://10.235.00.00/data01
                   transferred = 12.3T
                     remaining = 8E
                       elapsed = 86h54m
                        errors = 935
                      complete = false

Shadow migration just wasn't built for speed. It was built for completeness and to be seamless.

Increasing Shadow Migration Performance:

Reduce the number of Shadow Migration filesystems being transferred at one time.
Be aware that filesystems with Large numbers of Small files within a share to be migrated causes increased latency in transfer and increases time to completion.
UTF8 file rejection can cause the Shadow Migration job to not complete. Enable UTF8 file rejection
One major bug in this area has now been fixed and will be released in the next major update 2011.1.
CR 15661489 - changing shadow migration threads or canceling a migration can lead to a Kernel deadlock and may require a restart of the akd appliance process.
Possible options here are to increase the number of threads available for Shadow Migration:

EXAMPLE procedure:

CLI>:configuration services shadow> show
Properties:
<status> = online
threads = 8

CLI>:configuration services shadow> set threads=16
threads = 16 (uncommitted)
CLI>:configuration services shadow> commit
CLI>:configuration services shadow> show
Properties:
<status> = online
threads = 16

The Advice here is to increase this thread value in stages and try to gauge the impact on other services and array functionality first before increasing it again

However, See major Bug above (point 4.) ... Increasing the number of threads would give greater resources to shadow migration but it would also take away resources that may be needed for more critical work.

But can potentially lead to deadlock issues and hangs, if not running appliance firmware 2011.1.0

Checking over supplied Support Bundle data from customers who have reported this type of situation has confirmed there are no problems or errors or alerts and no failures or FM events reported that would account for slow Shadow Migration progress.

Array's are functioning correctly, just very slowly in terms of progress with Shadow Migration.

If a Shadow migration job has been started and is taking a long time, you need to be patient and just let it complete.

Dependent on multiple factors like incoming load or other requests and the amount and/or kind of data to copy it could take up to several weeks.

Shadow Migration is a background function and will always be given lower priority in the Kernel than serving new IO for client requests.

The following section contains internal information, do not share with customers.

Useful shell commands:

S7000# df -h | grep shadow
S7000# df -h | grep shadow | wc -l
S7000# iostat -xcnz

To see what files have not yet been shadow migrated:

S7000# cd /export/new_shadow_share
S7000# find . -type f -exec runat {} ls SUNWshadow 2>/dev/null '&&' echo {} \; | grep -v ^SUNW | tee /var/ak/dropbox/not_migrated_yet

To see what files have already been migrated:

S7000# cd /export/new_shadow_share
S7000# find . -type f -exec runat {} ls SUNWshadow 2>/dev/null '||' echo {} \; | grep -v ^SUNW | tee /var/ak/dropbox/already_migrated

From the CLI>

s7000:> raw nas.shadowErrors({pool: "pool-0", project: "data_migration", share: "data01", collection: "local"})

Making sure to use the appropriate pool, project, and share name.

Possible influence of open bugs:

CR 15669804 - Shadow migration goes single-threaded and never tries to go back
CR 15651868 - shadow migration from netapp -> 7310 drops off to trickle
CR 15654495 - migrating fs having large number of smaller files cause appliance to hang
CR 15671951 - Need a summary for all shadow migration volume

References

<BUG:15669804> - SUNBT6985747 SHADOW MIGRATION GOES SINGLE-THREADED AND NEVER TRIES TO GO BACK
<BUG:15651868> - SUNBT6963751 SHADOW MIGRATION FROM NETAPP -> 7310 DROPS OFF TO TRICKLE
<BUG:15581607> - SUNBT6870256 SHADOW MIGRATION: CLI SUPPORT FOR VIEWING INDIVIDUAL ERRORS
<BUG:15654495> - SUNBT6967206 MIGRATING FS HAVING LARGE NUMBER OF SMALLER FILES CAUSE APPLIANCE T
<BUG:15671951> - SUNBT6988343 NEED A SUMMARY FOR ALL SHADOW MIGRATION VOLUME
<BUG:16203014> - PANIC AT VFS_SHADOW_PENDING_ADD() DURING SHADOW MIGRATION
<NOTE:1213705.1> - Sun Storage 7000 Unified Storage System: Performance issues - Framing the problem
<NOTE:1213714.1> - Sun ZFS Storage Appliance: Performance clues and considerations

Attachments

This solution has no attachment