![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||
Solution Type Problem Resolution Sure Solution 1584045.1 : Sun Storage 7000 Unified Storage System: Multiple NDMP job streams continue to run without moving any data
In this Document
Applies to:Oracle ZFS Storage ZS3-BA - Version All Versions and laterOracle ZFS Storage ZS4-4 - Version All Versions and later Oracle ZFS Storage Appliance Racked System ZS4-4 - Version All Versions and later Sun Storage 7410 Unified Storage System - Version All Versions and later Sun ZFS Storage 7420 - Version All Versions and later 7000 Appliance OS (Fishworks) SymptomsPeriodically, multiple NDMP job streams on the appliance have the potential to become 'runaways', i.e. the job streams continue to run without moving any data. This state continues until the NDMP service is restarted
ChangesNone CauseThis problem is found that if an exchange of tapes during backup takes a long time (> ~10 minutes), then the backup job hangs - because ndmpd reports to OSB that the job is still active, even though it failed because of internal error.
SolutionCheck for the ndmp log from bundle: you may see message similar to those below: 5/28 19:50:05 Added ZFS_BACKUP_SIZE env: 285473794256
5/28 19:50:05 Backing up "pool-1/local/vte-svl-cluster6/vte-svl-cluster6" finished. 5/28 19:50:06 data bytes_total(including header):285473794256 5/28 19:50:06 HIST is not set. No file history will be generated. 5/28 20:05:16 Added ZFS_BACKUP_SIZE env: 308243469924 5/28 20:05:16 Backing up "pool-1/local/vte-svl-clx/vte-svl-clx" finished. 5/28 20:05:17 data bytes_total(including header):308243469924 5/28 20:05:17 HIST is not set. No file history will be generated. 5/28 20:14:49 Added ZFS_BACKUP_SIZE env: 354268620552 5/28 20:14:49 Backing up "pool-1/local/un-svl-sram5/un-svl-sram5" finished. 5/28 20:14:49 data bytes_total(including header):354268620552 5/28 20:14:49 HIST is not set. No file history will be generated. 5/28 20:35:45 Added ZFS_BACKUP_SIZE env: 371113201288 5/28 20:35:45 Backing up "pool-1/local/vte-svl-verif6/vte-svl-verif6" finished. 5/28 20:35:46 data bytes_total(including header):371113201288 5/28 20:35:46 HIST is not set. No file history will be generated. 5/28 20:37:09 Added ZFS_BACKUP_SIZE env: 446473523648 5/28 20:37:09 Backing up "pool-1/local/vte-svl-mcu4/vte-svl-mcu4" finished. 5/28 20:37:10 data bytes_total(including header):446473523648 5/28 20:37:10 HIST is not set. No file history will be generated.
If you see further logs, you can see similar to below 5/17 3:21:53 env(ZFS_MODE) not specified, defaulting to recursive
5/17 3:21:53 env(ZFS_FORCE) not specified, defaulting to FALSE 5/17 3:21:53 env(UPDATE): "n" 5/17 3:21:53 env(DMP_NAME) not specified, defaulting to 'level' 5/17 3:21:53 *** E R R O R *** [ndmpd_zfs_pre_backup:1334]:Pre-backup plug-in: Error 0 5/17 3:21:53 *** E R R O R *** [ndmpd_zfs_start_op:1724]:pre_backup error 5/17 3:21:53 *** E R R O R *** [ndmpd_data_get_env_v4:938]:Backup operation not active. 5/17 4:24:14 *** E R R O R *** [ndmpd_remote_write:1555]:Socket write error: Broken pipe. 5/17 4:24:14 *** E R R O R *** [ndmpd_zfs_backup_tape_write:935]:MOD_WRITE error 5/17 4:24:14 *** E R R O R *** [ndmpd_zfs_abort:1298]:ndmpd_zfs_abort() called...aborting backup operation 5/17 4:24:14 *** E R R O R *** [ndmpd_remote_write:1555]:Socket write error: Broken pipe. 5/17 4:24:14 *** E R R O R *** [ndmpd_zfs_backup_tape_write:935]:MOD_WRITE error 5/17 4:24:14 *** E R R O R *** [ndmpd_zfs_abort:1298]:ndmpd_zfs_abort() called...aborting backup operation 5/17 4:24:14 *** E R R O R *** [ndmpd_zfs_backup_send_read:800]:zfs_send: -1 5/17 4:24:14 *** E R R O R *** [ndmpd_zfs_backup_send_read:800]:cannot send 'pool-1/local/vte-serc/vte-serc'--Bad file number
Looks similar to: Bug 16432233 NDMP: BAD FILE NUMBER ERROR ABORTING BACKUPS (it is duplicate of 15813688 ) At the time of the hung situation, collect the ndmp core, to check if the hangs OSB backup when exchange of tapes takes longer time ----------------- lwp# 7 / thread# 7 --------------------
feec83d5 lwp_wait (9, fe19ee4c) feebfaff _thrp_join (9, 0, fe19eedc, 1) + 65 feebfc9b pthread_join (9) + 31 0808f6d5 ndmpd_zfs_reader_writer (fe82c9e0, fe19eee4, fe19eee8, fe19eeec) + f1 0808e929 ndmpd_zfs_backup (fe82c9e0, 3) + 99 0808e7e4 ndmpd_zfs_backup_starter (fe82c9e0, fef61000, fe19efe8, feec3519) + 140 feec356c _thrp_setup (fed93240) + 9d feec3810 _lwp_start (fed93240, 0, 0, 0, 0, 0) ----------------- lwp# 9 / thread# 9 -------------------- feec3869 lwp_park (0, 0, 0) feebd203 cond_wait_queue (fe82ee8c, fe82ee5c, 0, feebd720) + 63 feebd798 __cond_wait (fe82ee8c, fe82ee5c, fdf7fe98, feebd7e1) + 89 feebd7ef cond_wait (fe82ee8c, fe82ee5c, 80603dc, feebd827) + 27 feebd83c pthread_cond_wait (fe82ee8c, fe82ee5c, bb8c, feebc07c) + 24 0808d844 ndmpd_get_free_buf (fe82ee50, fc00, fdf7ff2c, fdf7ff30) + 54 0808edeb ndmpd_zfs_backup_buf_write (fe82c9e0, fef61000, fdf7ffe8, feec3519) + 12b feec356c _thrp_setup (fed94240) + 9d feec3810 _lwp_start (fed94240, 0, 0, 0, 0, 0) ----------------- lwp# 10 / thread# 10 -------------------- 0808dadc indirect_start_routine(), exit value = 0xffffffff ** zombie (exited, not detached, not yet joined) **
We can see that that main backup thread #7 is waiting for buffering thread #9 which is stuck waiting for free buffers, but free buffers cannot arrive because thread #10, which creates free buffers as data are sent over the wire is already gone. So all buffers are dirty, because writing to socket failed and the thread is stuck waiting for free buffers -> *deadlock* The fix is to call ndmpd_release_free_buf() when MOD_WRITE in ndmpd_zfs_backup_tape_write() fails.
Bug 15813688 fixed in 2011.1.6.0 code. Upgrade to the 2011.1.6.0 Appliance Firmware Release (or later) to fix this issue.
Attachments This solution has no attachment |
||||||||||||||||||
|