How to Configure a Server Disk After Replacement as an HDFS Disk or Oracle NoSQL Database Disk on Oracle Big Data Appliance V2.2.*/V2.3.1/V2.4.0/V2.5.0/V3.x/V4.x

Asset ID:	1-79-1581583.1
Update Date:	2017-11-12
Keywords:

Solution Type Predictive Self-Healing Sure

Solution 1581583.1 : How to Configure a Server Disk After Replacement as an HDFS Disk or Oracle NoSQL Database Disk on Oracle Big Data Appliance V2.2.*/V2.3.1/V2.4.0/V2.5.0/V3.x/V4.x

Applies to:

Big Data Appliance X5-2 Full Rack - Version All Versions to All Versions [Release All Releases]
Big Data Appliance X3-2 Starter Rack - Version All Versions to All Versions [Release All Releases]
Big Data Appliance X3-2 In-Rack Expansion - Version All Versions to All Versions [Release All Releases]
Big Data Appliance X4-2 In-Rack Expansion - Version All Versions to All Versions [Release All Releases]
Big Data Appliance X5-2 In-Rack Expansion - Version All Versions to All Versions [Release All Releases]
Linux x86-64

Purpose

The document will describe the steps for configuring a server disk as an HDFS disk or Oracle NoSQL Database disk after disk drive replacement on Oracle Big Data Appliance V2.2.*/V2.3.1/V2.4.0/V2.5.0/V3.x/V4.x.

Scope

This document is to be used by anyone who is configuring the disk. If attempting the steps and further assistance is needed please log a service request to contact support for help.

Details

Overview

Failure of a disk is never catastrophic on Oracle Big Data Appliance. No user data should be lost. Data stored in HDFS or Oracle NoSQL Database is automatically replicated.

The following are the basic steps for replacing a server disk drive and configuring it as an HDFS or Oracle NoSQL Database Disk:

1. Replace the failed disk drive.

2. Perform the basic configuration steps for the new disk. If multiple disks are unconfigured, then configure them in order from the lowest to the highest slot number. Finish all the steps for one disk and then start with all the steps for the next.

3. Identify the dedicated function of the failed disk, either as an operating system disk, an HDFS disk, or an Oracle NoSQL Database disk for the disk drive that has been replaced.

The steps for 1, 2, and 3 are listed in "Steps for Replacing a Disk Drive and Determining its Function on the Oracle Big Data Appliance V2.2.*/V2.3.1/V2.4.0/V2.5.0/V3.x/V4.x (Doc ID 1581331.1)."

4. Configure the disk for its dedicated function, in this case for HDFS or Oracle NoSQL Database.

5. Verify that the configuration is correct.

The steps for 4 and 5 are listed here in this document.

About Disk Drive Identifiers

The Oracle Big Data Appliance includes a disk enclosure cage that holds 12 disk drives and is controlled by the HBA (Host Bus Adapter). The drives in this enclosure are identified by slot numbers (0..11) and can have different purposes, for example the drives in slot 0 and 1 have a raid 1 OS and boot partitions. The drives can be dedicated to specific functions, as shown in Table 1.

Version 2 of the image introduces new device symbolic links in /dev/disk/by_hba_slot/. The links refer the physical location or slot number of a disk inside the disk enclosure cage. The links are of the form of s<n>p<m>, where n is the slot number and m is the partition number. For example in an unaltered system the /dev/disk/by_hba_slot/s0p1 corresponds to /dev/sda1, ..s0p4 to ..sda4, ..s1p1 to sdb1 etc, and disk /dev/sda itself corresponds to /dev/by-hba-slot/s0, ..sdb to ..s1 etc.

When a disk is hot swapped, the operating system may not reuse the kernel device name. Instead, it may allocate a new device name. For example if /dev/sda was hot swapped the disk corresponding /dev/disk/by-hba-slot/s0 may link to /dev/sdn instead of /dev/sda. Thus the links in /dev/disk/by-hba-slot/ are automatically updated (as part of udev rules) when devices are added or removed. Hence on BDA symbolic device links in /dev/disk/by-hba-slot are used in configuration and recovery procedures.

See the following tables to identify the function of the drive, the slot number, and the mount point which will be used later in the procedure.

Disk Drive Identifiers

The following table (Table 1) shows the mappings between the RAID logical drives and the probable initial kernel device names, and the dedicated function of each drive in an Oracle Big Data Appliance server. The server with the failed drive is part of either a CDH cluster (HDFS) or an Oracle NoSQL Database cluster. This information will be used in a later procedure of partioning the disk for it's appropriate function so please note which mapping is applicable for the disk drive that is being replaced.

Table 1 - Disk Drive Identifiers

Physical Slot	Symbolic Link to Physical Slot	Probable Initial Kernel Device Names	Dedicated Function
0	/dev/disk/by-hba-slot/s0	/dev/sda	Operating system
1	/dev/disk/by-hba-slot/s1	/dev/sdb	Operating system
2	/dev/disk/by-hba-slot/s2	/dev/sdc	HDFS or Oracle NoSQL Database
3	/dev/disk/by-hba-slot/s3	/dev/sdd	HDFS or Oracle NoSQL Database
4	/dev/disk/by-hba-slot/s4	/dev/sde	HDFS or Oracle NoSQL Database
5	/dev/disk/by-hba-slot/s5	/dev/sdf	HDFS or Oracle NoSQL Database
6	/dev/disk/by-hba-slot/s6	/dev/sdg	HDFS or Oracle NoSQL Database
7	/dev/disk/by-hba-slot/s7	/dev/sdh	HDFS or Oracle NoSQL Database
8	/dev/disk/by-hba-slot/s8	/dev/sdi	HDFS or Oracle NoSQL Database
9	/dev/disk/by-hba-slot/s9	/dev/sdj	HDFS or Oracle NoSQL Database
10	/dev/disk/by-hba-slot/s10	/dev/sdk	HDFS or Oracle NoSQL Database
11	/dev/disk/by-hba-slot/s11	/dev/sdl	HDFS or Oracle NoSQL Database

Standard Mount Points

The following table (Table 2) shows the mappings between HDFS partitions and mount points. This information will be used in the later procedure so please note which mapping is applicable for the disk drive that is being replaced.

Table 2 - Mount Points

Physical Slot	Symbolic Link to Physical Slot and Partition	Probable Name for HDFS Partition	Mount Point
0	/dev/disk/by-hba-slot/s0p4	/dev/sda4	/u01
1	/dev/disk/by-hba-slot/s1p4	/dev/sdb4	/u02
2	/dev/disk/by-hba-slot/s2p1	/dev/sdc1	/u03
3	/dev/disk/by-hba-slot/s3p1	/dev/sdd1	/u04
4	/dev/disk/by-hba-slot/s4p1	/dev/sde1	/u05
5	/dev/disk/by-hba-slot/s5p1	/dev/sdf1	/u06
6	/dev/disk/by-hba-slot/s6p1	/dev/sdg1	/u07
7	/dev/disk/by-hba-slot/s7p1	/dev/sdh1	/u08
8	/dev/disk/by-hba-slot/s8p1	/dev/sdi1	/u09
9	/dev/disk/by-hba-slot/s9p1	/dev/sdj1	/u10
10	/dev/disk/by-hba-slot/s10p1	/dev/sdk1	/u11
11	/dev/disk/by-hba-slot/s11p1	/dev/sdl1	/u12

Note: MegaCli64, mount, umount and many of the commands require root so the recommendation is to run the entire procedure as root.

Note: The code examples provided here are based on replacing /dev/disk/by-hba-slot/s4 == /dev/sde == /dev/disk/by-hba-slot/s4p1 == /dev/sde1 == /u05. These 4 mappings for example are an easy way to set up the information that will be needed throughout the procedure. It is best to figure out the mapping and write it down to use in the procedure. For example, slot # is one less that mount point. All disks replacements will vary so please replace the examples with the proper information in regards to the disk replacement being done.

Helpful Tips: You can re-confirm the relationship among the disk slot number, the current kernel device name and mount point as follows:

1. Re-confirm the relationship between slot number and the current kernel device name using "lsscsi" command.
The lsscsi command shows the slot number X as "[0.2:X:0]". For example [0:2:4:0] means the slot number 4, [0:2:11:0] means the slot number 11.

# lsscsi
[0:0:20:0]   enclosu ORACLE   CONCORD14        0d03 -
[0:2:0:0]    disk    LSI      MR9261-8i        2.13 /dev/sda
[0:2:1:0]    disk    LSI      MR9261-8i        2.13 /dev/sdb
[0:2:2:0]    disk    LSI      MR9261-8i        2.13 /dev/sdc
[0:2:3:0]    disk    LSI      MR9261-8i        2.13 /dev/sdd
[0:2:4:0]    disk    LSI      MR9261-8i        2.13 /dev/sdn
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[0:2:5:0]    disk    LSI      MR9261-8i        2.13 /dev/sdf
[0:2:6:0]    disk    LSI      MR9261-8i        2.13 /dev/sdg
[0:2:7:0]    disk    LSI      MR9261-8i        2.13 /dev/sdh
[0:2:8:0]    disk    LSI      MR9261-8i        2.13 /dev/sdi
[0:2:9:0]    disk    LSI      MR9261-8i        2.13 /dev/sdj
[0:2:10:0]   disk    LSI      MR9261-8i       2.13 /dev/sdk
[0:2:11:0]   disk    LSI      MR9261-8i       2.13 /dev/sdl
[7:0:0:0]    disk    ORACLE   UNIGEN-UFD       PMAP /dev/sdm

In this case, you can re-confirm the following:

    slot 0:   /dev/sda
    slot 1:   /dev/sdb
    slot 2:   /dev/sdc
    slot 3:   /dev/sdd
    slot 4:   /dev/sdn
    slot 5:   /dev/sdf
slot 6: /dev/sdg
    slot 7:   /dev/sdh
    slot 8: /dev/sdi
    slot 9:   /dev/sdj
    slot 10: /dev/sdk
    slot 11: /dev/sdl

2. Re-confirm the relationship between the current kernel device name and mount point using "mount" command.

# mount -l | grep /u
/dev/sda4 on /u01 type ext4 (rw,nodev,noatime) [/u01]
/dev/sdb4 on /u02 type ext4 (rw,nodev,noatime) [/u02]
/dev/sdc1 on /u03 type ext4 (rw,nodev,noatime) [/u03]
/dev/sdd1 on /u04 type ext4 (rw,nodev,noatime) [/u04]
/dev/sdn1 on /u05 type ext4 (rw,nodev,noatime) [/u05]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/dev/sdf1 on /u06 type ext4 (rw,nodev,noatime) [/u06]
/dev/sdg1 on /u07 type ext4 (rw,nodev,noatime) [/u07]
/dev/sdh1 on /u08 type ext4 (rw,nodev,noatime) [/u08]
/dev/sdi1 on /u09 type ext4 (rw,nodev,noatime) [/u09]
/dev/sdj1 on /u10 type ext4 (rw,nodev,noatime) [/u10]
/dev/sdk1 on /u11 type ext4 (rw,nodev,noatime) [/u11]
/dev/sdl1 on /u12 type ext4 (rw,nodev,noatime) [/u12]

In this case, you can re-confirm the followings:

    /dev/sda:   /u01
    /dev/sdb:   /u02
    /dev/sdc:    /u03
    /dev/sdd:   /u04
    /dev/sdn:   /u05
    /dev/sdf:    /u06
    /dev/sdg:   /u07
    /dev/sdh: /u08
    /dev/sdi:    /u09
    /dev/sdj:    /u10
    /dev/sdk:   /u11
    /dev/sdl:    /u12

3. From the outputs above, you can re-confirm the relationship among them as follows:

    slot 0: /dev/sda: /u01
    slot 1: /dev/sdb: /u02
    slot 2: /dev/sdc: /u03
    slot 3:   /dev/sdd: /u04
    slot 4:   /dev/sdn: /u05
    ^^^^^^^^^^^^^^^^^^^^^^
    slot 5: /dev/sdf:     /u06
    slot 6:   /dev/sdg:   /u07
    slot 7:   /dev/sdh:   /u08
    slot 8:   /dev/sdi:    /u09
    slot 9:   /dev/sdj:    /u10
    slot 10: /dev/sdk: /u11
    slot 11: /dev/sdl:   /u012

Configuring an HDFS or Oracle NoSQL Database Disk

Complete the following steps for any disk not used by the operating system. See Table 1 to determine how the disk is configured. Most disks are used for HDFS or Oracle NoSQL Database, as shown in Table 1 . If multiple disks are unconfigured, then configure them in order from the lowest to the highest slot number. Finish all the steps for one disk and then start with all the steps for the next.

Verify that the failed disk was not used for either the operating system before configuring it for a particular function. See the following note to determine the function of the disk drive: "Steps for Replacing a Disk Drive and Determining its Function on the Oracle Big Data Appliance V2.2.*/V2.3.1/V2.4.0/V2.5.0/V3.x/V4.x (Doc ID 1581331.1)."

1. Unmounting an HDFS or Oracle NoSQL Database Partition

2. Partitioning a Disk for HDFS or Oracle NoSQL Database

3. Mount HDFS or Oracle NoSQL Database Partition

4. Verifying the Disk Configuration

Unmounting an HDFS or Oracle NoSQL Database Partition

To dismount HDFS or Oracle NoSQL Database partition:

1. Log in to the server as root with the failing drive.

2. List the mounted HDFS partitions:

# mount -l

Sample output:

# mount -l
/dev/md2 on / type ext3 (rw,noatime)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/md0 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
/dev/sda4 on /u01 type ext4 (rw,nodev,noatime) [/u01]
/dev/sdb4 on /u02 type ext4 (rw,nodev,noatime) [/u02]
/dev/sdc1 on /u03 type ext4 (rw,nodev,noatime) [/u03]
/dev/sdd1 on /u04 type ext4 (rw,nodev,noatime) [/u04]
/dev/sde1 on /u05 type ext4 (rw,nodev,noatime) [/u05]
/dev/sdf1 on /u06 type ext4 (rw,nodev,noatime) [/u06]
/dev/sdg1 on /u07 type ext4 (rw,nodev,noatime) [/u07]
/dev/sdh1 on /u08 type ext4 (rw,nodev,noatime) [/u08]
/dev/sdi1 on /u09 type ext4 (rw,nodev,noatime) [/u09]
/dev/sdj1 on /u10 type ext4 (rw,nodev,noatime) [/u10]
/dev/sdk1 on /u11 type ext4 (rw,nodev,noatime) [/u11]
/dev/sdl1 on /u12 type ext4 (rw,nodev,noatime) [/u12]
fuse_dfs on /mnt/hdfs-nnmount type fuse.fuse_dfs (rw,nosuid,nodev,allow_other,allow_other,default_permissions)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
nfsd on /proc/fs/nfsd type nfsd (rw)

3. Dismount the HDFS mount points for the failed disk as root user. Replace mountpoint below with the mount point obtained earlier as shown in the Standard Mount Points table (Table 2) above:

# umount mountpoint

Example of dismounting /u05, umount /u05 removes the mount point for disk /dev/sde:

# umount /u05

If the umount command succeed, then verify the partition is no longer listed by listing the mounted HDFS partitions:

# mount -l

Sample output shows that /u05 has been dismounted:

# mount -l
/dev/md2 on / type ext3 (rw,noatime)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/md0 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
/dev/sda4 on /u01 type ext4 (rw,nodev,noatime) [/u01]
/dev/sdb4 on /u02 type ext4 (rw,nodev,noatime) [/u02]
/dev/sdc1 on /u03 type ext4 (rw,nodev,noatime) [/u03]
/dev/sdd1 on /u04 type ext4 (rw,nodev,noatime) [/u04]
/dev/sdf1 on /u06 type ext4 (rw,nodev,noatime) [/u06]
/dev/sdg1 on /u07 type ext4 (rw,nodev,noatime) [/u07]
/dev/sdh1 on /u08 type ext4 (rw,nodev,noatime) [/u08]
/dev/sdi1 on /u09 type ext4 (rw,nodev,noatime) [/u09]
/dev/sdj1 on /u10 type ext4 (rw,nodev,noatime) [/u10]
/dev/sdk1 on /u11 type ext4 (rw,nodev,noatime) [/u11]
/dev/sdl1 on /u12 type ext4 (rw,nodev,noatime) [/u12]
fuse_dfs on /mnt/hdfs-nnmount type fuse.fuse_dfs (rw,nosuid,nodev,allow_other,allow_other,default_permissions)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
nfsd on /proc/fs/nfsd type nfsd (rw)

If a umount command fails with a device busy message, then the partition is still in use. For example an HDFS partition could be in use by the datanode service. Continue to the next step.

Example:

# umount /u05
umount: /u05: device is busy
umount: /u05: device is busy

4. Open a browser window to Cloudera Manager. For example:

http://bda1node03.example.com:7180

5. Complete these steps in Cloudera Manager:

    a. Log in as admin.

    b. On the Services page, click hdfs.

    c. Click the Instances subtab.

    d. In the Host column, locate the server with the failed disk. Then click the service in the Name column, such as datanode (...), to open its page.

    e. Click the Configuration subtab -> Click on View and Edit.

    f.   Remove the mount point from the DataNode Data Directory dfs.data.dir, dfs.datanode.data.dir field. You have to click on - (minus) sign to remove this.
         This shows the screen you will see where you will need to remove the mount point.

data node before removal

In this example /u05/hadoop/dfs has been removed.

data node after removal

g. Click Save Changes.

save changes
h. From the Actions list, choose Restart this DataNode.

restart
You will see a pop up that says something similar to the following:

"Restart this DataNode

Are you sure you want to Restart the role instance datanode (...)?"

i. Click on the button that says "Restart this DataNode."

Note: If you removed the mount point in Cloudera Manager, then you must restore the mount point in Cloudera Manager after finishing all other configuration procedures.

6. Return to your session on the server with the failed drive.

7. Reissue the umount command:

# umount mountpoint

Example removing /u05 showing that it succeeds:

# umount /u05

Partitioning a Disk for HDFS or Oracle NoSQL Database

To configure a disk, you must partition and format it. Having verified that the drive is not an OS disk proceed to partition the disk.

To format a disk for use by HDFS or Oracle NoSQL Database:

Note: Replace sn or snp1 in the following commands with the appropriate operating system location name that was determined from Table 1, such as s4 or s4p1.

1. Complete the steps in the document "Steps for Replacing a Disk Drive and Determining its Function on the Oracle Big Data Appliance V2.2.*/V2.3.1/V2.4.0/V2.5.0/V3.x/V4.x (Doc ID 1581331.1)."

2. Partition the drive as root. Replace /dev/disk/by-hba-slot/sn with the operating system location name that was determined from Table 1

On OL6:

# parted /dev/disk/by-hba-slot/sn -s mklabel gpt mkpart primary ext4 0% 100%

On OL5:

# parted /dev/disk/by-hba-slot/s4 -s mklabel gpt mkpart primary ext3 0% 100%

Optional Sanity check:

Confirm the partition was fully created. Replace /sn with the appropriate slot number for the partition:

# parted /dev/disk/by-hba-slot/sn

This is sample output which indicates the partition was fully created:

# parted /dev/disk/by-hba-slot/s4

GNU Parted 1.8.1

Using /dev/sde

Welcome to GNU Parted! Type 'help' to view a list of commands.

(parted) print

Model: LSI MR9261-8i (scsi)

Disk /dev/sde: 3000GB

Sector size (logical/physical): 512B/512B Partition Table: gpt

Number Start End Size File system Name Flags

1 17.4kB 3000GB 3000GB ext3 primary

(parted) quit

Information: Don't forget to update /etc/fstab, if necessary.

Note there is nothing to update in /etc/fstab.

3. Format the partition for an ext4 file system as user root. Replace /dev/disk/by-hba-slot/snp1 with the proper hdfs partition name determined from Table 2 above:

# mkfs -t ext4 /dev/disk/by-hba-slot/snp1

Example using /dev/disk/by-hba-slot/s4p1:

# mkfs -t ext4 /dev/disk/by-hba-slot/s4p1
mkfs 1.41.12 (17-May-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
122011648 inodes, 488036855 blocks
24401842 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
14894 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
        102400000, 214990848

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 28 mounts or
180 days, whichever comes first. Use tune4fs -c or -i to override.

4. Verify that the device is missing:

# ls -l /dev/disk/by-label

In this example output, u05 and ../../sde1 are missing:

# ls -l /dev/disk/by-label
total 0
lrwxrwxrwx 1 root root 10 Aug 28 09:45 BDAUSB -> ../../sdm1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u01 -> ../../sda4
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u02 -> ../../sdb4
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u03 -> ../../sdc1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u04 -> ../../sdd1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u06 -> ../../sdf1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u07 -> ../../sdg1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u08 -> ../../sdh1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u09 -> ../../sdi1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u10 -> ../../sdj1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u11 -> ../../sdk1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u12 -> ../../sdl1

It is possible that the device will not show as missing as seen in the following output:

# ls -l /dev/disk/by-label
total 0
lrwxrwxrwx 1 root root 10 Aug 28 09:45 BDAUSB -> ../../sdm1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u01 -> ../../sda4
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u02 -> ../../sdb4
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u03 -> ../../sdc1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u04 -> ../../sdd1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u05 -> ../../sde1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u06 -> ../../sdf1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u07 -> ../../sdg1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u08 -> ../../sdh1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u09 -> ../../sdi1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u10 -> ../../sdj1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u11 -> ../../sdk1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u12 -> ../../sdl1

5. Reset the appropriate partition label, reserved space, and file system check options to the missing device as root. Replace /unn with the correct mount point and replace /dev/disk/by-hba-slot/snp1 with the with the proper Symbolic Link to Physical Slot and Partition name determined from Table 2 above:

Note: For OL6 use tune2fs. For OL5 use tune4fs.

On OL6 use tune2fs:

# tune2fs -c -1 -i 0 -m 0.2 -L /unn /dev/disk/by-hba-slot/snp1

On OL5 use tune4fs:

# tune4fs -c -1 -i 0 -m 0.2 -L /unn /dev/disk/by-hba-slot/snp1

For example, this command resets the label for /dev/disk/by-hba-slot/s4p1 to /u05:

# tune4fs -c -1 -i 0 -m 0.2 -L /u05 /dev/disk/by-hba-slot/s4p1
tune4fs 1.41.12 (17-May-2010)
Setting maximal mount count to -1
Setting interval between checks to 0 seconds
Setting reserved blocks percentage to 0.2% (976073 blocks)

Note: If an incorrect tune2fs command is run, correct this by running the correct tune2fs command after. For example if you incorrectly run:
# tune2fs -c -1 -i 0 -m 0.2 -L /u05 /dev/disk/by-hba-slot/s5p1
instead of correctly running:
# tune2fs -c -1 -i 0 -m 0.2 -L /u06 /dev/disk/by-hba-slot/s5p1

run the correct tune2fs command "tune2fs -c -1 -i 0 -m 0.2 -L /u06 /dev/disk/by-hba-slot/s5p1" and proceed. The reason for this is that tune2fs is used to set the label on the partition so the incorrect command will not affect the slot/mount. In the example here it will not affect "slot 4/u05" which will still have the same label set (having the same label temporarily is ok as long as you don't try to do a mount).

6. Verifiy the replaced disk is listed in 'ls -l /dev/disk/by-label' output. If the replaced disk is listed then please skip to 'Mount HDFS or Oracle NoSQL Database Partition' section. But if the replaced disk is NOT listed then continue to next step (7).

# ls -l /dev/disk/by-label

For example if disk in slot 4 / u05 is replaced then below output shows the replaced disk is still missing

# ls -l /dev/disk/by-label
total 0
lrwxrwxrwx 1 root root 10 Aug 28 09:45 BDAUSB -> ../../sdm1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u01 -> ../../sda4
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u02 -> ../../sdb4
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u03 -> ../../sdc1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u04 -> ../../sdd1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u06 -> ../../sde1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u07 -> ../../sdf1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u08 -> ../../sdg1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u09 -> ../../sdh1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u10 -> ../../sdi1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u11 -> ../../sdj1
lrwxrwxrwx 1 root root 10 Aug 28 09:45 u12 -> ../../sdk1

7. Trigger kernel device uevents to replay missing events at system coldplug.

a) For Linux OS 5, execute below command

udevtrigger

b) For Linux OS 6, execute below command

udevadm trigger

Note:- With both commands --verbose option can be used to check what events are triggered

8. Verifiy the replaced disk is listed in 'ls -l /dev/disk/by-label' output.

# ls -l /dev/disk/by-label

Mount HDFS or Oracle NoSQL Database Partition

1. Mount the HDFS or Oracle NoSQL Databases partition as root, entering the appropriate mount point by replacing /unn below with the correct mount point from Table 2 above:

# mount /unn

For example:

# mount /u05

Optional sanity checks on an HDFS cluster only:

1. As an optional sanity check, run the following replacing /unn with the correct mount point:

# ls -la /unn

Example:

# ls -la /u05

total 28

drwxr-xr-x 4 root root 4096 Jul 29 21:58 .

drwxr-xr-x 39 root root 4096 Sep 26 06:49 ..

drwxr-xr-x 4 root root 4096 Jul 29 21:58 hadoop

drwx------ 2 root root 16384 Jul 29 18:43 lost+found

2. Run the following replacing /unn/hadoop with the correct mount point:

# ls -la /unn/hadoop

Example:

# ls -la /u05/hadoop

total <X>

drwxr-xr-x 4 root root 4096 Jul 29 21:58 .

drwxr-xr-x 4 root root 4096 Jul 29 21:58 ..

drwx------ 3 hdfs hadoop 4096 Sep 26 06:50 dfs

drwxr-xr-x 7 mapred hadoop 4096 Sep 26 15:54 mapred

In the case where the mount point needs to be added back into Cloudera Manager no dfs subdirectory will be seen.

Example:

# ls -la /u05/hadoop

total <X>

drwxr-xr-x 4 root root 4096 Jul 29 21:58 .

drwxr-xr-x 4 root root 4096 Jul 29 21:58 ..

drwxr-xr-x 7 mapred hadoop 4096 Sep 26 15:54 mapred

2. If you are configuring multiple drives, then repeat the previous steps.

3. If you previously removed a mount point in Cloudera Manager, then restore it to the list.

Restart of DataNode is needed after the disk is replaced and configured. If DataNode is NOT restarted then the replaced disk will not be recognized by HDFS.

a. Open a browser window to Cloudera Manager. For example:

http://bda1node03.example.com:7180

b. Open Cloudera Manager and log in as admin.

c. On the Services page, click hdfs.

d. Click the Instances subtab.

e. In the Host column, locate the server with the replaced disk. Then click the service in the Name column, such as "datanode (...)", to open its page.

f. Click the Configuration subtab -> Click on View and Edit.

g. If the mount point is missing from the DataNode Data Directory dfs.data.dir field then add it to the list. Do this only if it was removed previously or is missing.

Before: /u10/hadoop/dfs,/u09/hadoop/dfs,/u08/hadoop/dfs,/u07/hadoop/dfs,/u06/hadoop/dfs,/u04/hadoop/dfs,/u03/hadoop/dfs

After: /u10/hadoop/dfs,/u09/hadoop/dfs,/u08/hadoop/dfs,/u07/hadoop/dfs,/u06/hadoop/dfs,/u05/hadoop/dfs,/u04/hadoop/dfs,/u03/hadoop/dfs

1. To add the node click on the + plus sign higher than where the node will be added.

2. An empty box will be added that you can fill in.

data node add

3. Fill in the mount point that you want to add. In this case /u05/hadoop/dfs was added in the empty box.

data node added

h. Click Save Changes.

save changes

i. From the Actions list, choose "Restart this DataNode."

restart

4. If you previously removed a mount point from NodeManager Local Directories, then also restore it to the list using Cloudera Manager. (On BDA V3.* and higher).

a. On the Services page, click Yarn.

b. In the Status Summary, click NodeManager.

c. From the list, click to select the NodeManager that is on the host with the failed disk.

d. Click the Configuration sub-tab.

e. If the mount point is missing from the NodeManager Local Directories field, then add it to the list.

f. Click Save Changes.

g. From the Actions list, choose Restart this NodeManager.

Optional Sanity Checks on an HDFS cluster only:

1. After adding the mount point back into Cloudera Manager, in the cases where this was done in the last step and the dfs directory shows up the following sanity check can be done to make sure this succeeded. Replace /unn with the appropriate mount point:

# ls -la /unn

Example:

# ls -la /u05

total 28

drwxr-xr-x 4 root root 4096 Jul 29 21:58 .

drwxr-xr-x 39 root root 4096 Sep 26 06:49 ..

drwxr-xr-x 4 root root 4096 Jul 29 21:58 hadoop

drwx------ 2 root root 16384 Jul 29 18:43 lost+found

2. Replace /unn/hadoop with the appropriate mount point to show the dfs directory:

# ls -la /unn/hadoop

Example:

# ls -la /u05/hadoop

total <X>

drwxr-xr-x 4 root root 4096 Jul 29 21:58 .

drwxr-xr-x 4 root root 4096 Jul 29 21:58 ..

drwx------ 3 hdfs hadoop 4096 Sep 26 06:50 dfs

drwxr-xr-x 7 mapred hadoop 4096 Sep 26 15:54 mapred

3. For the new disk, after adding into Cloudera Manager, you will observe nothing in /unn/hadoop/dfs as compared with other systems. This will change over time as disk is used. Compare with a disk that has not been replaced. Replace /unn/hadoop/dfs with the mount point that has not been replaced.

# du -ms /unn/hadoop/dfs

Example:

# du -ms /u04/hadoop/dfs

118841 /u04/hadoop/dfs

4. Replace /unn/hadoop/dfs with the mount point that has been replaced.

# du -ms /unn/hadoop/dfs

Example:

# du -ms /u05/hadoop/dfs

1 /u05/hadoop/dfs

Oracle NoSQL Database Disk Configuration

The following steps apply only to an Oracle NoSQL Database disk in an Oracle NoSQL Database cluster. This does not apply to a CDH cluster (HDFS).

1. Re-create the storage directories. Replace /unn with the appropriate mount point for the disk you have replaced:

# mkdir -p /unn/kvdata
# chmod 755 /unn/kvdata
# chown oracle:oinstall /unn/kvdata

For example if it is disk /u04 that went down run the following commands:

# mkdir -p /u04/kvdata
# chmod 755 /u04/kvdata
# chown oracle:oinstall /u04/kvdata

4. Start the NoSQL DB service

# service nsdbservice start

Verifying the Disk Configuration

To verify the disk configuration:

1. Check the software configuration as root user:

# bdachecksw

Example successful output from running bdachecksw:

# bdachecksw
SUCCESS: Correct OS disk sda partition info : 1 ext3 raid 2 ext3 raid 3 linux-swap 4 ext3 primary
SUCCESS: Correct OS disk sdb partition info : 1 ext3 raid 2 ext3 raid 3 linux-swap 4 ext3 primary
SUCCESS: Correct data disk sdc partition info : 1 ext3 primary
SUCCESS: Correct data disk sdd partition info : 1 ext3 primary
SUCCESS: Correct data disk sde partition info : 1 ext3 primary
SUCCESS: Correct data disk sdf partition info : 1 ext3 primary
SUCCESS: Correct data disk sdg partition info : 1 ext3 primary
SUCCESS: Correct data disk sdh partition info : 1 ext3 primary
SUCCESS: Correct data disk sdi partition info : 1 ext3 primary
SUCCESS: Correct data disk sdj partition info : 1 ext3 primary
SUCCESS: Correct data disk sdk partition info : 1 ext3 primary
SUCCESS: Correct data disk sdl partition info : 1 ext3 primary
SUCCESS: Correct software RAID info : /dev/md2 level=raid1 num-devices=2 /dev/md0 level=raid1 num-devices=2
SUCCESS: Correct mounted partitions : /dev/md0 /boot ext3 /dev/md2 / ext3 /dev/sda4 /u01 ext4 /dev/sdb4 /u02 ext4 /dev/sdc1 /u03 ext4 /dev/sdd1 /u04 ext4 /dev/sde1 /u05 ext4 /dev/sdf1 /u06 ext4 /dev/sdg1 /u07 ext4 /dev/sdh1 /u08 ext4 /dev/sdi1 /u09 ext4 /dev/sdj1 /u10 ext4 /dev/sdk1 /u11 ext4 /dev/sdl1 /u12 ext4
SUCCESS: Correct swap partitions : /dev/sdb3 partition /dev/sda3 partition
SUCCESS: Correct internal USB device (sdm) : 1
SUCCESS: Correct internal USB partitions : 1 primary ext3
SUCCESS: Correct internal USB ext3 partition check : clean
SUCCESS: Correct Linux kernel version : Linux 2.6.32-200.21.1.el5uek
SUCCESS: Correct Java Virtual Machine version : HotSpot(TM) 64-Bit Server 1.6.0_29
SUCCESS: Correct puppet version : 2.6.11
SUCCESS: Correct MySQL version : 5.5.17
SUCCESS: All required programs are accessible in $PATH
SUCCESS: All required RPMs are installed and valid
SUCCESS: Correct bda-monitor status : bda monitor is running
SUCCESS: Big Data Appliance software validation checks succeeded

2. If there are errors, then redo the configuration steps as necessary to correct the problem.

a) If error like below occurs i.e replaced disk partition is listed at the end and all partitions are recognized then this error can be ignored and is caused due to Bug 17899101 in bdachecksw script.

ERROR: Wrong mounted partitions : /dev/md0 /boot ext3 /dev/md2 / ext3 /dev/sd4 /u01 ext4 /dev/sd4 /u02 ext4 /dev/sd1 /u03 ext4 /dev/sd1 /u04 ext4 /dev/sd1 /u06 ext4 /dev/sd1 /u07 ext4 /dev/sd1 /u08 ext4 /dev/sd1 /u09 ext4 /dev/sd1 /u10 ext4 /dev/sd1 /u11 ext4 /dev/sd1 /u12 ext4 /dev/sd1 /u05 ext4
INFO: Expected mounted partitions : 12 data partitions, /boot and /

Bug 17899101 is fixed in V2.4 release of BDA.

Patch 17924936 contains one-off patch for BUG 17899101 to V2.3.1 release of BDA.

Patch 17924887 contains one-off patch for BUG 17899101 to V2.2.1 release of BDA.

Refer to the Readme file for instructions on how to apply the patch. Readme file also contains un-install instructions as needed.

b) If an incorrect tune2fs command was entered above and then corrected by re-running the correct command but then you find that although bdachecksw and bdacheckhw are successful, mount -l shows output like below:

/dev/sdl1 on /u05 type ext4 (rw,nodev,noatime) [/u06]
/dev/sdl1 on /u06 type ext4 (rw,nodev,noatime) [/u06]

instead of:

/dev/sdl1 on /u05 type ext4 (rw,nodev,noatime) [/u05]
/dev/sdl1 on /u06 type ext4 (rw,nodev,noatime) [/u06]

then this is indicative of the list of mounts containing old information.

Try to umount and then remount e.g. in the example here:

# umount /dev/sdl1
# umount /u05
# umount /u06

# mount /u05
# mount /u06

If this does not resolve the issue as long as bdachecksw and bdacheckhw are completely successful you can try a reboot to clear any stale information.

What If Firmware Warnings or Errors occur?

If the bdacheckhw utility reports errors / warnings with regards to the HDD (Hard Disk Drive) Firmware Information indicating that the HDD firmware needs to be updated follow the instructions in "Firmware Usage and Upgrade Information for BDA Software Managed Components on Oracle Big Data Appliance V2 [ID 1542871.1]”.

What If a Server Fails to Restart?

The server may restart during the disk replacement procedures, either because you issued a reboot command or made an error in a MegaCli64 command. In most cases, the server restarts successfully, and you can continue working. However, in other cases, an error occurs so that you cannot reconnect using ssh. In this case, you must complete the reboot using Oracle ILOM.

To restart a server using Oracle ILOM:

1. Use your browser to open a connection to the server using Oracle ILOM. For example:

http://bda1node12-c.example.com

Note: Your browser must have a JDK plug-in installed. If you do not see the Java coffee cup on the log-in page, then you must install the plug-in before continuing.

2. Log in using your Oracle ILOM credentials.

3. Select the Remote Control tab.

4. Click the Launch Remote Console button.

5. Enter Ctrl+d to continue rebooting.

6. If the reboot fails, then enter the server root password at the prompt and attempt to fix the problem.

7. After the server restarts successfully, open the Redirection menu and choose Quit to close the console window.

See the following documentation for more information: Oracle Integrated Lights Out Manager (ILOM) 3.0 documentation at http://docs.oracle.com/cd/E19860-01/

References

<NOTE:1542871.1> - Firmware Usage and Upgrade Information for BDA Software Managed Components on Oracle Big Data Appliance
<NOTE:1581331.1> - Steps for Replacing a Disk Drive and Determining its Function on the Oracle Big Data Appliance V2.2.*/V2.3.1/V2.4.0/V2.5.0/V3.x/V4.x

Attachments

This solution has no attachment