When a HDFS Disk goes Bad on Oracle Big Data Appliance Node, /root Directory is Filled up with HDFS data.

Asset ID:	1-72-1618570.1
Update Date:	2018-03-04
Keywords:

Solution Type Problem Resolution Sure

Solution 1618570.1 : When a HDFS Disk goes Bad on Oracle Big Data Appliance Node, /root Directory is Filled up with HDFS data.

Applies to:

Big Data Appliance X3-2 Hardware - Version All Versions and later
Big Data Appliance Integrated Software - Version 2.1.0 and later
Linux x86-64

Symptoms

When a HDFS disk goes bad and unmounted (perhaps by a reboot) on one of the Oracle Big Data Appliance (BDA) nodes, /root file system gets filled up by HDFS data.

Slot Number: 11
Firmware state: Unconfigured(bad)
Foreign State: Foreign
Foreign Secure: Drive is not secured by a foreign lock key

Changes

Disk went bad, but ASR didn't fire an SR due to internal Bug 18139689. Reboot has been performed on the node and as the disk is bad, it got unmounted during startup.

Bug 18139689 - BDA DOES NOT RAISE ASR EVENTS FOR ALL UNHEALTHY DISK STATES

Cause

As per Cloudera, data would be written to root filesystem by the DataNode (DN) under the following conditions ..

Say disk in slot 11 (/u12) fails or turns bad:

- /dev/sdl1 disk fails
- DN marks the /u12/hadoop/dfs volume as failed and stops writing to it
- Administrator stops the DN
- Administrator unmounts the /u12
- Administrator starts the DN from CM (Please note that few previous steps could have been done automatically by a reboot of the host)
- On BDA /u12 has 755 permissions and it's owned by root, Cloudera Manager (CM) agent creates /u12/hadoop/dfs directory as it finds that /u12 is empty. It also recursively changes the ownership of /u12/hadoop to hdfs user.
- DN gets started by the CM agent, sees that /u12/hadoop/dfs is empy, or "unformatted", and formats the "volume". It then starts normal operations and uses the newly formatted volume thus ending up writing to /root

Solution

Internal Bug 18139715 has been filed to change permissions on unmounted directories so that hdfs cannot write to /root in case of a disk failure. Bug 18139715 is aimed to be fixed in 2.4.1 release.

Bug 18139715 - BDA SHOULD MAKE UNMOUNTED DATA DIRECTORIES UNREADABLE (PERMISSIONS 700)

ASR Bug 18139689 to report all failure disk states is aimed to be fixed in 2.4.1 release.

Meanwhile bad disk status can be checked in Cloudera Manager (CM). CM shows the DataNode on which disk failed with "bad health". The message is: "The data node has 1 volume of failure(s). Critical threshold any."

Also executing bdacheckcluster will report failed disks.

Attachments

This solution has no attachment