ODA (Oracle Database Appliance) : How to replace FAILED SYSTEM BOOT DISK

Asset ID:	1-71-1382300.1
Update Date:	2018-03-06
Keywords:

Solution Type Technical Instruction Sure

Solution 1382300.1 : ODA (Oracle Database Appliance) : How to replace FAILED SYSTEM BOOT DISK

Applies to:

Oracle Database Appliance Software - Version 2.1.0.1 and later
Oracle Database Appliance - Version All Versions and later
Linux OS - Version Oracle Linux 5.0 to Oracle Linux 5.0 [Release OL5]
Oracle Database Appliance X3-2 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Goal

This article provides steps to replace failed / faulty system boot disk on Oracle Database Appliance Servers.

Solution

For X5-2

How to Replace a boot drive in an Oracle Database Appliance X5-2 Server node (Doc ID 1991446.1)

 Important Note: There are two common scenarios for needing to replace a Boot Disk

For X4-2 and X3-2

NOTE: "oakcli add disk -local" feature was introduced starting in 12.1.2.3.0. for Bare Metal Installations (Only)

There is a new option to adding an OS boot drive that simplifies the whole process: FrontPanel location of the X3-2 and X4-2 boot disks

Once disk has been replaced, and it is confirmed that lsscsi command sees the new disk, simply use the oakcli command to sync up the partitions, for example:

# oakcli add disk -local

Step 1: Identifying the new device name created for the replaced disk...

new device is /dev/sdax
old device is /dev/sdaw

Step 2: Backing up the partition table from properly running disk...

Step 3: Checking and ensure that there is no partition table on replaced disk...

Step 4: Partitioning the replaced disk i.e /dev/sdc using partition table backed up in step 2...
Checking that both disks have similar partition boundaries...

Step 5: Adding the replaced disk to the raid group...

Step 6: Updating grub MBR on the replaced disk...

Step 7: Updating the multipath.conf...

Step 8: Flushing multipath...

Step 9: Reloading multipath...

Step 10: Waiting for the RAID sync to complete...(this can take around 2 hours)
[>....................] recovery = 4.9% (29048640/585545088) finish=221.5min speed=41866K/sec
[===>.................] recovery = 18.2% (106743040/585545088) finish=106.8min speed=74665K/sec

Then confirm again that the sync has completed, for example:

# cat /proc/mdstat

Personalities : [raid1]

md0 : active raid1 sdax1[1] sdaw1[0]
513984 blocks [2/2] [UU]

md1 : active raid1 sdax2[1] sdaw2[0]
585545088 blocks [2/2] [UU]

unused devices: <none>

** the above example was taken from a Bare Metal install on an X3-2

There are two system boot disks on each Database Appliance Server

For ODA V1

Use oakcli add disk -local to add the systems disk. However, this command will not work on many older V1 ODA versions.
You can not replace a disk which is only partially failed using provided commands. You may need to manually fail a problem disk first.
- See the steps below to collect more information about the system / boot disks before replacement

# lsscsi -v | grep ATA
[0:0:0:0] disk ATA SEAGATE ST95000N n/a /dev/sda
[1:0:0:0] disk ATA SEAGATE ST95000N n/a /dev/sdb

ODA V1 - continued

# fdisk -l /dev/sda

Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot      Start         End      Blocks   Id System
/dev/sda1    * 1          13      104391   fd Linux raid autodetect
/dev/sda2               14       60801   488279610   fd Linux raid autodetect

# fdisk -l /dev/sdb

Disk /dev/sdb: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot      Start         End      Blocks   Id System
/dev/sdb1    *           1          13      104391   fd Linux raid autodetect
/dev/sdb2    14       60801   488279610   fd Linux raid autodetect

Partitions on these two System Boot Disks are protected by RAID1 (mirroring) at OS level.

/dev/md0 created from /dev/sda1 and /dev/sdb1
/dev/md1 created from /dev/sda2 and /dev/sdb2

Following commands can be used to find the details about each RAID device

# mdadm --detail /dev/md0
# mdadm --detail /dev/md1

Following file shows current sync status of the RAID devices on the system

/proc/mdstat

Properly working / synchronized RAID devices will have following status

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md1 : active raid1 sdb2[1] sda2[0]
488279488 blocks [2/2] [UU]

unused devices: <none>

If any system boot disk (for example /dev/sdb) is failed, then contents of /proc/mdstat look like as below

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[0]
104320 blocks [2/1] [U_]

md1 : active raid1 sda2[0]
488279488 blocks [2/1] [U_]

unused devices: <none>

The following steps can be used to replace the failed system boot disk

Note: in this EXAMPLE scenario

- It is assumed that system boot disk /dev/sdb has been failed and the disk is on ODA V1
- In your environment, confirm which system boot disk has been failed by observing the contents of /proc/mdstat
- /proc/mdstat shows the disk which is functioning ok

Important Note: There are two common scenarios for needing to replace a Boot Disk

1) When a disk has failed and / or is already physically removed.
2) When a disk is receiving IO error messages but has not yet completely failed

1. Remove the failed/faulty system boot disk partitions from the RAID group

# mdadm -r /dev/md0 /dev/sdb1
mdadm: hot removed /dev/sdb1

# mdadm -r /dev/md1 /dev/sdb2
mdadm: hot removed /dev/sdb2

IF the disk is not completely failed use the mdadm --manage -- FAIL command for the problem disk before removal>

Comment:
This example is showing the failing and removal of BOTH system disks to confirm commands and outputs.
(IF you lose the OS on both boot disks then this node would require redeployment/re-image)

# mdadm --manage /dev/md0 --fail /dev/sdb1
# mdadm --manage /dev/md1 --fail /dev/sdb2

Now, run cat /proc/mdstat, it should look something like this:

[root@oda ~]# cat /proc/mdstat
Personalities : [raid1]

md0 : active raid1 sdb1[1] sdb1[0](F) << <notice the F flag
104320 blocks [2/1] [U_]

md1 : active raid1 sdb2[1] sdb2[0](F)
488279488 blocks [2/1] [U_]

Now, you can use the remove command:

# mdadm --manage /dev/md0 --remove /dev/sdb1
# mdadm --manage /dev/md1 --remove /dev/sdb2

2. Replace the failed / faulty system boot disk

As system boot disks are hot pluggable, Identify and replace the FAILED system boot disk located at system back panel of the database appliance server. Generally, amber Service Required LED might be lit for the failed / faulty disk drive.

You can refer to Oracle® Database Appliance Service Manual for identifying the system boot disks (known as boot drives) on back panel

3. Identify the new device name created for the replaced disk using following command*

# lsscsi -v | grep ATA
[0:0:0:0] disk ATA SEAGATE ST95000N n/a /dev/sda
[1:0:0:0] disk ATA SEAGATE ST95000N n/a /dev/sdc <<<<<<<<

*for V2 lsscsi -v | grep 600G

In this output, /dev/sdc is the new device name created for the replaced disk

4. Partition the new device (similar to the one which is running ok) using following steps
Backup the partition table from properly running disk

# sfdisk -d /dev/sda > /tmp/sda_part

Check and ensure that there is no partition table on replaced disk* (i.e /dev/sdc)

# fdisk -l /dev/sdc
Disk /dev/sdc: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk /dev/sdc doesn't contain a valid partition table

* V2 topology will be different

Partition the replaced disk i.e /dev/sdc using partition table backed up to /tmp/sda_part

# sfdisk /dev/sdc --force < /tmp/sda_part

Check and ensure that both are having similar partition boundaries

# fdisk -l /dev/sdc
# fdisk -l /dev/sda

UPDATE [5/2015] - PERFORM THIS STEP BEFORE STEP 5

Update the multipath.conf

Need to ensure that the device mapper disks are not created for the System Boot Disks and hence need to add it to the blacklist section of /etc/multipath.conf file.

Get the scsi id for the replaced disk

# /sbin/scsi_id -g -u -s /block/sdc
SATA_SEAGATE_ST950009SP19KZT_

Add to the blacklist section of the multipath.conf

blacklist {
devnode "^asm/*"
devnode "ofsctl"
wwid SATA_SEAGATE_ST950009SP122XT_
wwid SATA_SEAGATE_ST950009SP19KZT_
}

Refresh the config

# multipath -F
# multipath -r
# multipath -v2

5. Add replaced disk to the raid group

# mdadm -a /dev/md0 /dev/sdc1
mdadm: added /dev/sdc1
# mdadm -a /dev/md1 /dev/sdc2
mdadm: added /dev/sdc2

6. Check and ensure that sync is in progress

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[1] sda1[0]
104320 blocks [2/2] [UU]

md1 : active raid1 sdc2[2] sda2[0]
488279488 blocks [2/1] [U_]
[>....................] recovery = 0.0% (362112/488279488) finish=157.1min speed=51730K/sec

unused devices: <none>

7. Update Grub MBR

On the replaced disk update the MBR stamp so it is identified as a bootable disk. Run the below commands to generate the /tmp/mbr.sh script.

     GRUB_SCRIPT="/tmp/mbr.sh"
     OS_DISK="sdc"
     echo "grub <<EOF" >>$GRUB_SCRIPT
     echo "device (hd0) /dev/$OS_DISK" >>$GRUB_SCRIPT
     echo "root (hd0,0)" >>$GRUB_SCRIPT
     echo "setup (hd0)" >>$GRUB_SCRIPT
     echo "EOF" >>$GRUB_SCRIPT

NOTE: In the above OS_DISK needs to be the replaced disk.
     Run the /tmp/mbr.sh

8. Run a validation check

oakcli validate -v -c OSDiskStorage

--- "validate" will shows errors on the disk until the rebuild is 100% complete.

Now system boot disks are back to their normal state.

9. Update the multipath.conf

Need to ensure that the device mapper disks are not created for the System Boot Disks and hence need to add it to the blacklist section of /etc/multipath.conf file.

If you encounter the following error mdadm: Cannot open /dev/sdXX: Device or resource busy
Refer to pre- Step#5 on setting the blacklist section Document:1483141.1 - mdadm: Cannot open /dev/sda1: Device or resource busy

Get the scsi id for the replaced disk

# /sbin/scsi_id -g -u -s /block/sdc
SATA_SEAGATE_ST950009SP19KZT_

Add to the blacklist section of the multipath.conf

blacklist {
devnode "^asm/*"
devnode "ofsctl"
wwid SATA_SEAGATE_ST950009SP122XT_
wwid SATA_SEAGATE_ST950009SP19KZT_
}

For latest x5-2 system we need add something like:

blacklist {
devnode "^asm/*"
devnode "ofsctl"
wwid *WD500BLHXSUWD-WXD1E71LUZT9*
wwid *WD500BLHXSUWD-WX71E71WY554*
}

in the black list in case it showing SATA_WDC_WD500BLHXSUWD-WXD1E71LUZT9_ in the wwid.

V1 bootdisk location

V1bootDisk_front_panel_light

References

<NOTE:1483141.1> - mdadm: Cannot open /dev/sda1: Device or resource busy
<NOTE:1991446.1> - How to Replace a boot drive in an Oracle Database Appliance X5-2/X6-2HA/X6-2S/X6-2M Server

Attachments

This solution has no attachment

Applies to:

Goal

Solution

For X5-2

For X4-2 and X3-2

For ODA V1

The following steps can be used to replace the failed system boot disk

1. Remove the failed/faulty system boot disk partitions from the RAID group

2. Replace the failed / faulty system boot disk

3. Identify the new device name created for the replaced disk using following command*

4. Partition the new device (similar to the one which is running ok) using following steps Backup the partition table from properly running disk

Update the multipath.confNeed to ensure that the device mapper disks are not created for the System Boot Disks and hence need to add it to the blacklist section of /etc/multipath.conf file.Get the scsi id for the replaced disk

Add to the blacklist section of the multipath.conf

Refresh the config

5. Add replaced disk to the raid group

6. Check and ensure that sync is in progress

7. Update Grub MBR

8. Run a validation check

References

4. Partition the new device (similar to the one which is running ok) using following steps
Backup the partition table from properly running disk

Update the multipath.conf

Need to ensure that the device mapper disks are not created for the System Boot Disks and hence need to add it to the blacklist section of /etc/multipath.conf file.

Get the scsi id for the replaced disk