Asset ID: |
1-71-1382300.1 |
Update Date: | 2018-03-06 |
Keywords: | |
Solution Type
Technical Instruction Sure
Solution
1382300.1
:
ODA (Oracle Database Appliance) : How to replace FAILED SYSTEM BOOT DISK
Related Items |
- Oracle Database Appliance Software
- Linux OS
- Oracle Database Appliance X3-2
- Oracle Database Appliance
|
Related Categories |
- PLA-Support>Eng Systems>Exadata/ODA/SSC>Oracle Database Appliance>DB: ODA_EST
- _Old GCS Categories>ST>Server>Engineered Systems>Oracle Database Appliance>Hardware
|
Applies to:
Oracle Database Appliance Software - Version 2.1.0.1 and later
Oracle Database Appliance - Version All Versions and later
Linux OS - Version Oracle Linux 5.0 to Oracle Linux 5.0 [Release OL5]
Oracle Database Appliance X3-2 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.
Goal
This article provides steps to replace failed / faulty system boot disk on Oracle Database Appliance Servers.
Solution
For X5-2
How to Replace a boot drive in an Oracle Database Appliance X5-2 Server node (Doc ID 1991446.1) 
Important Note: There are two common scenarios for needing to replace a Boot Disk
For X4-2 and X3-2
NOTE: "oakcli add disk -local" feature was introduced starting in 12.1.2.3.0. for Bare Metal Installations (Only)
There is a new option to adding an OS boot drive that simplifies the whole process: 
Once disk has been replaced, and it is confirmed that lsscsi command sees the new disk, simply use the oakcli command to sync up the partitions, for example:
# oakcli add disk -local
Step 1: Identifying the new device name created for the replaced disk...
new device is /dev/sdax
old device is /dev/sdaw
Step 2: Backing up the partition table from properly running disk...
Step 3: Checking and ensure that there is no partition table on replaced disk...
Step 4: Partitioning the replaced disk i.e /dev/sdc using partition table backed up in step 2...
Checking that both disks have similar partition boundaries...
Step 5: Adding the replaced disk to the raid group...
Step 6: Updating grub MBR on the replaced disk...
Step 7: Updating the multipath.conf...
Step 8: Flushing multipath...
Step 9: Reloading multipath...
Step 10: Waiting for the RAID sync to complete...(this can take around 2 hours)
[>....................] recovery = 4.9% (29048640/585545088) finish=221.5min speed=41866K/sec
[===>.................] recovery = 18.2% (106743040/585545088) finish=106.8min speed=74665K/sec
Then confirm again that the sync has completed, for example:
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdax1[1] sdaw1[0]
513984 blocks [2/2] [UU]
md1 : active raid1 sdax2[1] sdaw2[0]
585545088 blocks [2/2] [UU]
unused devices: <none>
** the above example was taken from a Bare Metal install on an X3-2
There are two system boot disks on each Database Appliance Server
For ODA V1
- Use oakcli add disk -local to add the systems disk. However, this command will not work on many older V1 ODA versions.
- You can not replace a disk which is only partially failed using provided commands. You may need to manually fail a problem disk first.
- See the steps below to collect more information about the system / boot disks before replacement
# lsscsi -v | grep ATA
[0:0:0:0] disk ATA SEAGATE ST95000N n/a /dev/sda
[1:0:0:0] disk ATA SEAGATE ST95000N n/a /dev/sdb
ODA V1 - continued
# fdisk -l /dev/sda
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sda1 * 1 13 104391 fd Linux raid autodetect
/dev/sda2 14 60801 488279610 fd Linux raid autodetect
# fdisk -l /dev/sdb
Disk /dev/sdb: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdb1 * 1 13 104391 fd Linux raid autodetect
/dev/sdb2 14 60801 488279610 fd Linux raid autodetect
Partitions on these two System Boot Disks are protected by RAID1 (mirroring) at OS level.
/dev/md0 created from /dev/sda1 and /dev/sdb1
/dev/md1 created from /dev/sda2 and /dev/sdb2
Following commands can be used to find the details about each RAID device
# mdadm --detail /dev/md0
# mdadm --detail /dev/md1
Following file shows current sync status of the RAID devices on the system
/proc/mdstat
Properly working / synchronized RAID devices will have following status
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]
md1 : active raid1 sdb2[1] sda2[0]
488279488 blocks [2/2] [UU]
unused devices: <none>
If any system boot disk (for example /dev/sdb) is failed, then contents of /proc/mdstat look like as below
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[0]
104320 blocks [2/1] [U_]
md1 : active raid1 sda2[0]
488279488 blocks [2/1] [U_]
unused devices: <none>
The following steps can be used to replace the failed system boot disk
Note: in this EXAMPLE scenario
- It is assumed that system boot disk /dev/sdb has been failed and the disk is on ODA V1
- In your environment, confirm which system boot disk has been failed by observing the contents of /proc/mdstat
- /proc/mdstat shows the disk which is functioning ok
Important Note: There are two common scenarios for needing to replace a Boot Disk
1) When a disk has failed and / or is already physically removed.
2) When a disk is receiving IO error messages but has not yet completely failed
1. Remove the failed/faulty system boot disk partitions from the RAID group
# mdadm -r /dev/md0 /dev/sdb1
mdadm: hot removed /dev/sdb1
# mdadm -r /dev/md1 /dev/sdb2
mdadm: hot removed /dev/sdb2
IF the disk is not completely failed use the mdadm --manage -- FAIL command for the problem disk before removal>
Comment:
This example is showing the failing and removal of BOTH system disks to confirm commands and outputs.
(IF you lose the OS on both boot disks then this node would require redeployment/re-image)
# mdadm --manage /dev/md0 --fail /dev/sdb1
# mdadm --manage /dev/md1 --fail /dev/sdb2
Now, run cat /proc/mdstat, it should look something like this:
[root@oda ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[1] sdb1[0](F) << <notice the F flag
104320 blocks [2/1] [U_]
md1 : active raid1 sdb2[1] sdb2[0](F)
488279488 blocks [2/1] [U_]
Now, you can use the remove command:
# mdadm --manage /dev/md0 --remove /dev/sdb1
# mdadm --manage /dev/md1 --remove /dev/sdb2
2. Replace the failed / faulty system boot disk
As system boot disks are hot pluggable, Identify and replace the FAILED system boot disk located at system back panel of the database appliance server. Generally, amber Service Required LED might be lit for the failed / faulty disk drive.
3. Identify the new device name created for the replaced disk using following command*
# lsscsi -v | grep ATA
[0:0:0:0] disk ATA SEAGATE ST95000N n/a /dev/sda
[1:0:0:0] disk ATA SEAGATE ST95000N n/a /dev/sdc <<<<<<<<
*for V2 lsscsi -v | grep 600G
In this output, /dev/sdc is the new device name created for the replaced disk
4. Partition the new device (similar to the one which is running ok) using following steps
Backup the partition table from properly running disk
# sfdisk -d /dev/sda > /tmp/sda_part
Check and ensure that there is no partition table on replaced disk* (i.e /dev/sdc)
# fdisk -l /dev/sdc
Disk /dev/sdc: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk /dev/sdc doesn't contain a valid partition table
* V2 topology will be different
Partition the replaced disk i.e /dev/sdc using partition table backed up to /tmp/sda_part
# sfdisk /dev/sdc --force < /tmp/sda_part
Check and ensure that both are having similar partition boundaries
# fdisk -l /dev/sdc
# fdisk -l /dev/sda
UPDATE [5/2015] - PERFORM THIS STEP BEFORE STEP 5
Update the multipath.conf
Need to ensure that the device mapper disks are not created for the System Boot Disks and hence need to add it to the blacklist section of /etc/multipath.conf file.
Get the scsi id for the replaced disk
# /sbin/scsi_id -g -u -s /block/sdc
SATA_SEAGATE_ST950009SP19KZT_
Add to the blacklist section of the multipath.conf
blacklist {
devnode "^asm/*"
devnode "ofsctl"
wwid SATA_SEAGATE_ST950009SP122XT_
wwid SATA_SEAGATE_ST950009SP19KZT_
}
Refresh the config
# multipath -F
# multipath -r
# multipath -v2
5. Add replaced disk to the raid group
# mdadm -a /dev/md0 /dev/sdc1
mdadm: added /dev/sdc1
# mdadm -a /dev/md1 /dev/sdc2
mdadm: added /dev/sdc2
6. Check and ensure that sync is in progress
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[1] sda1[0]
104320 blocks [2/2] [UU]
md1 : active raid1 sdc2[2] sda2[0]
488279488 blocks [2/1] [U_]
[>....................] recovery = 0.0% (362112/488279488) finish=157.1min speed=51730K/sec
unused devices: <none>
7. Update Grub MBR
On the replaced disk update the MBR stamp so it is identified as a bootable disk. Run the below commands to generate the /tmp/mbr.sh script.
GRUB_SCRIPT="/tmp/mbr.sh"
OS_DISK="sdc"
echo "grub <<EOF" >>$GRUB_SCRIPT
echo "device (hd0) /dev/$OS_DISK" >>$GRUB_SCRIPT
echo "root (hd0,0)" >>$GRUB_SCRIPT
echo "setup (hd0)" >>$GRUB_SCRIPT
echo "EOF" >>$GRUB_SCRIPT
NOTE: In the above OS_DISK needs to be the replaced disk.
Run the /tmp/mbr.sh
8. Run a validation check
oakcli validate -v -c OSDiskStorage
--- "validate" will shows errors on the disk until the rebuild is 100% complete.
Now system boot disks are back to their normal state.
9. Update the multipath.conf
Need to ensure that the device mapper disks are not created for the System Boot Disks and hence need to add it to the blacklist section of /etc/multipath.conf file.
If you encounter the following error mdadm: Cannot open /dev/sdXX: Device or resource busy
Refer to pre- Step#5 on setting the blacklist section Document:1483141.1 - mdadm: Cannot open /dev/sda1: Device or resource busy
Get the scsi id for the replaced disk
# /sbin/scsi_id -g -u -s /block/sdc
SATA_SEAGATE_ST950009SP19KZT_
Add to the blacklist section of the multipath.conf
blacklist {
devnode "^asm/*"
devnode "ofsctl"
wwid SATA_SEAGATE_ST950009SP122XT_
wwid SATA_SEAGATE_ST950009SP19KZT_
}
For latest x5-2 system we need add something like:
blacklist {
devnode "^asm/*"
devnode "ofsctl"
wwid *WD500BLHXSUWD-WXD1E71LUZT9*
wwid *WD500BLHXSUWD-WX71E71WY554*
}
in the black list in case it showing SATA_WDC_WD500BLHXSUWD-WXD1E71LUZT9_ in the wwid.


References
<NOTE:1483141.1> - mdadm: Cannot open /dev/sda1: Device or resource busy
<NOTE:1991446.1> - How to Replace a boot drive in an Oracle Database Appliance X5-2/X6-2HA/X6-2S/X6-2M Server
Attachments
This solution has no attachment