Oracle ZFS Storage Appliance: Is a ZFS Rewind possible if some bad data have been created by mistake on a Pool ?

Asset ID:	1-72-1546706.1
Update Date:	2018-03-07
Keywords:

Solution Type Problem Resolution Sure

Solution 1546706.1 : Oracle ZFS Storage Appliance: Is a ZFS Rewind possible if some bad data have been created by mistake on a Pool ?

Applies to:

Sun ZFS Storage 7120 - Version All Versions and later
Sun Storage 7210 Unified Storage System - Version All Versions and later
Sun Storage 7110 Unified Storage System - Version All Versions and later
Sun Storage 7410 Unified Storage System - Version All Versions and later
Sun ZFS Storage 7320 - Version All Versions and later
7000 Appliance OS (Fishworks)

Symptoms

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Disk Storage ZFS Storage Appliance

Here are some examples that could make you want to do a transaction group rollback.

A mistake was made when, for example, trying to resize a ISCSI LUN from 1.5TB to 1.1TB but resizing it to 1.1GB.

Backup software purge process removed expired critical backup.

Cause

Human error

Solution

1. Take pool offline as soon as the mistake was made

2. Contact Oracle support to see there is any possibility that this error can be reversed

Customer had found this blog post http://billroth.sys-con.com/node/2466887

ZFS writes changes to a pool in groups - transaction groups (txgs). As ZFS is a copy-on-write file system, data, including meta-data, is not overwritten.

Instead, new blocks are allocated for the new data and new meta-data blocks point to the new data all the way up to the top-level “uberblock”.

ZFS maintains a list of the last 127 "uberblocks" and the current one. Each time a transaction group (txg) is committed to the pool, the oldest entry is replaced.

These historical uberblocks provide a kind of temporary or transient snapshot, providing avery short window in which you can rollback to a specific state of a pool.

Many kinds of activity will update a pool and cause new txgs in pool, limiting how far back in time you can recover to. That's why it is important to export the pool as soon as possible.

An engineer can re-import your damaged pool and review the recent history, perhaps finding a transaction group before "the event" that caused the damage.

The engineer can also pull information about the transactions groups from the disks in the pool, and maybe, just maybe, find a point in time to recover to.

If a rollback is performed, all changes make to the pool after the point of recovery will be lost.

If the pool that was impacted by "The event" was taken off-line quickly, that may not be an issue, but are you sure?

The pool must be taken offline before all 127 "uberblocks" gets overwritten, we are talking seconds not minutes in a busy pool.

As it turns out, 24 hours had passed since this problem occurred and the pool was taken offline, meaning that the 127 super blocks have been overwritten
many times since then - meaning that NONE of the transaction groups (txgs) will contain the old pool state.

Attachments

This solution has no attachment