Sun Storage 7000 Unified Storage System: Multiple NDMP job streams continue to run without moving any data

Asset ID:	1-72-1584045.1
Update Date:	2016-03-08
Keywords:

Solution Type Problem Resolution Sure

Solution 1584045.1 : Sun Storage 7000 Unified Storage System: Multiple NDMP job streams continue to run without moving any data

Applies to:

Oracle ZFS Storage ZS3-BA - Version All Versions and later
Oracle ZFS Storage ZS4-4 - Version All Versions and later
Oracle ZFS Storage Appliance Racked System ZS4-4 - Version All Versions and later
Sun Storage 7410 Unified Storage System - Version All Versions and later
Sun ZFS Storage 7420 - Version All Versions and later
7000 Appliance OS (Fishworks)

Symptoms

Periodically, multiple NDMP job streams on the appliance have the potential to become 'runaways', i.e. the job streams continue to run without moving any data.

This state continues until the NDMP service is restarted

Changes

None

Cause

This problem is found that if an exchange of tapes during backup takes a long time (> ~10 minutes), then the backup job hangs - because ndmpd reports to OSB that the job is still active, even though it failed because of internal error.

Solution

Check for the ndmp log from bundle: you may see message similar to those below:

Recent backups were finished. but there is no confirmation that the data movement was progressed or not.

5/28 19:50:05 Added ZFS_BACKUP_SIZE env: 285473794256
5/28 19:50:05 Backing up "pool-1/local/vte-svl-cluster6/vte-svl-cluster6" finished.
5/28 19:50:06 data bytes_total(including header):285473794256
5/28 19:50:06 HIST is not set. No file history will be generated.
5/28 20:05:16 Added ZFS_BACKUP_SIZE env: 308243469924
5/28 20:05:16 Backing up "pool-1/local/vte-svl-clx/vte-svl-clx" finished.
5/28 20:05:17 data bytes_total(including header):308243469924
5/28 20:05:17 HIST is not set. No file history will be generated.
5/28 20:14:49 Added ZFS_BACKUP_SIZE env: 354268620552
5/28 20:14:49 Backing up "pool-1/local/un-svl-sram5/un-svl-sram5" finished.
5/28 20:14:49 data bytes_total(including header):354268620552
5/28 20:14:49 HIST is not set. No file history will be generated.
5/28 20:35:45 Added ZFS_BACKUP_SIZE env: 371113201288
5/28 20:35:45 Backing up "pool-1/local/vte-svl-verif6/vte-svl-verif6" finished.
5/28 20:35:46 data bytes_total(including header):371113201288
5/28 20:35:46 HIST is not set. No file history will be generated.
5/28 20:37:09 Added ZFS_BACKUP_SIZE env: 446473523648
5/28 20:37:09 Backing up "pool-1/local/vte-svl-mcu4/vte-svl-mcu4" finished.
5/28 20:37:10 data bytes_total(including header):446473523648
5/28 20:37:10 HIST is not set. No file history will be generated.

If you see further logs, you can see similar to below

5/17 3:21:53 env(ZFS_MODE) not specified, defaulting to recursive
5/17 3:21:53 env(ZFS_FORCE) not specified, defaulting to FALSE
5/17 3:21:53 env(UPDATE): "n"
5/17 3:21:53 env(DMP_NAME) not specified, defaulting to 'level'
5/17 3:21:53 *** E R R O R *** [ndmpd_zfs_pre_backup:1334]:Pre-backup plug-in: Error 0
5/17 3:21:53 *** E R R O R *** [ndmpd_zfs_start_op:1724]:pre_backup error
5/17 3:21:53 *** E R R O R *** [ndmpd_data_get_env_v4:938]:Backup operation not active.
5/17 4:24:14 *** E R R O R *** [ndmpd_remote_write:1555]:Socket write error: Broken pipe.
5/17 4:24:14 *** E R R O R *** [ndmpd_zfs_backup_tape_write:935]:MOD_WRITE error
5/17 4:24:14 *** E R R O R *** [ndmpd_zfs_abort:1298]:ndmpd_zfs_abort() called...aborting backup operation
5/17 4:24:14 *** E R R O R *** [ndmpd_remote_write:1555]:Socket write error: Broken pipe.
5/17 4:24:14 *** E R R O R *** [ndmpd_zfs_backup_tape_write:935]:MOD_WRITE error
5/17 4:24:14 *** E R R O R *** [ndmpd_zfs_abort:1298]:ndmpd_zfs_abort() called...aborting backup operation
5/17 4:24:14 *** E R R O R *** [ndmpd_zfs_backup_send_read:800]:zfs_send: -1
5/17 4:24:14 *** E R R O R *** [ndmpd_zfs_backup_send_read:800]:cannot send 'pool-1/local/vte-serc/vte-serc'--Bad file number

Looks similar to: Bug 16432233 NDMP: BAD FILE NUMBER ERROR ABORTING BACKUPS (it is duplicate of 15813688 )

At the time of the hung situation, collect the ndmp core, to check if the hangs OSB backup when exchange of tapes takes longer time

Check for the stack signature

----------------- lwp# 7 / thread# 7 --------------------
feec83d5 lwp_wait (9, fe19ee4c)
feebfaff _thrp_join (9, 0, fe19eedc, 1) + 65
feebfc9b pthread_join (9) + 31
0808f6d5 ndmpd_zfs_reader_writer (fe82c9e0, fe19eee4, fe19eee8, fe19eeec) + f1
0808e929 ndmpd_zfs_backup (fe82c9e0, 3) + 99
0808e7e4 ndmpd_zfs_backup_starter (fe82c9e0, fef61000, fe19efe8, feec3519) + 140
feec356c _thrp_setup (fed93240) + 9d
feec3810 _lwp_start (fed93240, 0, 0, 0, 0, 0)
----------------- lwp# 9 / thread# 9 --------------------
feec3869 lwp_park (0, 0, 0)
feebd203 cond_wait_queue (fe82ee8c, fe82ee5c, 0, feebd720) + 63
feebd798 __cond_wait (fe82ee8c, fe82ee5c, fdf7fe98, feebd7e1) + 89
feebd7ef cond_wait (fe82ee8c, fe82ee5c, 80603dc, feebd827) + 27
feebd83c pthread_cond_wait (fe82ee8c, fe82ee5c, bb8c, feebc07c) + 24
0808d844 ndmpd_get_free_buf (fe82ee50, fc00, fdf7ff2c, fdf7ff30) + 54
0808edeb ndmpd_zfs_backup_buf_write (fe82c9e0, fef61000, fdf7ffe8, feec3519) + 12b
feec356c _thrp_setup (fed94240) + 9d
feec3810 _lwp_start (fed94240, 0, 0, 0, 0, 0)
----------------- lwp# 10 / thread# 10 --------------------
0808dadc indirect_start_routine(), exit value = 0xffffffff
** zombie (exited, not detached, not yet joined) **

We can see that that main backup thread #7 is waiting for buffering thread #9 which is stuck waiting for free buffers, but free buffers cannot arrive because thread #10, which creates free buffers as data are sent over the wire is already gone.

So all buffers are dirty, because writing to socket failed and the thread is stuck waiting for free buffers -> *deadlock*

The fix is to call ndmpd_release_free_buf() when MOD_WRITE in ndmpd_zfs_backup_tape_write() fails.

This problem is found that if exchange of tapes during backup takes long time (> ~10 minutes), then backup job hangs, because ndmpd reports to OSB that the job is still active even though it failed because of internal error (and fixed with the bug 15813688).

Bug 15813688 fixed in 2011.1.6.0 code.

Upgrade to the 2011.1.6.0 Appliance Firmware Release (or later) to fix this issue.

Attachments

This solution has no attachment