ZFS Share Decreases To Current Filesystem Size During NDMP Backup Filling It Up

Asset ID:	1-72-2137870.1
Update Date:	2016-08-16
Keywords:

Solution Type Problem Resolution Sure

Solution 2137870.1 : ZFS Share Decreases To Current Filesystem Size During NDMP Backup Filling It Up

Applies to:

Exalogic Elastic Cloud X3-2 Hardware - Version X3 and later
Information in this document applies to any platform.

Symptoms

Experienced a serious problem with the backup of Exalogic.

Take the backup every day at the same time using Veritas Netbackup and NDMP with snapshots.

The problem was with the backup of /exports/romatg/domain (97GB) .

After a while, suddenly reserved all available disk space of zpool (300GB), causing serious unavailability of application.

Stopped the backup and disk space was freed again.

Cause

During onsite tests problem was not able to be reproduced even when NDMP backup was hung.

Further investigation showed that the problem was NOT that the filesystem was filled up, but that the "REPORTED" filesystem capacity WAS REDUCED matching current filesystem USED space.
Thus the filesystem usage went from about 30% to 100%.

See df -h from the guest vServer before and after the problem (total capacity 400GB - usage 24%) :

2.0.0.5:/export/<name>/domain
400G 95G 306G 24% /<name>/domain
See below df -k from the guest vServer DURING the problem (total reported capacity REDUCED to 95GB, same as used space and usage went to 100%) :

[root@<name> ~]# df -k
Filesystem 1K-blocks Used Available Use% Mounted on
2.0.0.5:/export/<name>/domain
94818304 94818304 0 100% /<name>/domain
After killing the backup job the filesystem capacity from the guest was reported again OK as 400GB.

The reported filesystem capacity is actually determined by the respective NFS share quota that is set on the ZFS storage.
Customer has checked the ZFS BUI and he saw that during the problem the quota was NOT changed.
This is why we suspect a bug that was possibly triggered by the hung of the NDMP backup.
Problem is that this behavior has not been able to be reproduced since.

As you may see the other collabs haven't find any issues on the guest or on the Infiniband.

Problem not reproduced since after tests for several weeks.
Suspect possibly a bug that was triggered by the hung of the NDMP backup causing the "REPORTED" filesystem capacity of NFS filesystem to be REDUCED matching current filesystem USED space.

Solution

A. From affected guest vserver - NFS client

Enable debug on all layers of NFS - warning this will create plenty of garbage.

When issue will re-occur we need:

- Enable rpcdebug from the affected guest vserver to monitor all components:

# rpcdebug -m nfs all

Additional logs should start to appear in /var/log/messages

Now you can run below commands to verify what is going on:

# mount
# df -ak
# showmount -e 2.0.0.5 (IP of ZFSSA)

Try to mount exact same share but with different path for example /mnt/test1
And verify the filesystem space provided there.

After that, stop the Backup Process or kill it - now rpcdebug should take all the information inside rpc and provide it via /var/log/messages

And then again obtain:

# mount
# df -ak
# showmount -e 2.0.0.5

And stop debug for nfs

# rpcdebug -m nfs -c all

B. From ZFS storage - NFS server (Warning: Risky)

ZFS team also suggested that you may also create a coredump during the problem as per below procedure.
(Note: But personally I would not suggest it, because this will failover all ZFS shares to the other storage head, which might have impact on all the guests!!)

"The only other thing you could do if this should happen again is to collect a core file from the ZFSSA while the problem is occuring,

You simply ssh into the ilom of the system that owns the pools (primary ZFS head) and type

-> cd /HOST/
-> set generate_host_nmi=true

References

<NOTE:1580215.1> - Your root Filesystem is shown as 100% in Use
<NOTE:1666661.1> - How to Check and Repair EXT3/EXT4 Filesystem on Oracle Linux
<NOTE:2058684.1> - Oracle ZFS Storage Appliance: Symantec Backup Exec NDMP Backups Fail
<NOTE:452067.1> - How to Configure "kdump" on Oracle Linux 5
<NOTE:457444.1> - 'du' and 'df' tools report different space utilization
<NOTE:1568410.1> - ALERT OSuggests: The Share Percentage Used is XX and has crossed warning (97) or critical (99) threshold. Share ID: pool03a:prodosug_03/psr_timesten_admin
<NOTE:1531223.1> - OS Watcher User's Guide
<NOTE:580513.1> - How To Start OSWatcher Black Box (OSWBB) Every System Boot Using RPM oswbb-service
<NOTE:228203.1> - Alt SysRq Keys Utility on Oracle Linux
<NOTE:2046923.1> - Oracle Linux: All the Sub Folders and Files are not Getting Listed in NFS Client When the Parent and its Sub-Folders are Exported from NFS Server

Attachments

This solution has no attachment