reclaimdisks.sh -reclaim interrupted. Affecting execution of dbnodeupdate

Asset ID:	1-72-1909127.1
Update Date:	2014-07-21
Keywords:

Solution Type Problem Resolution Sure

Solution 1909127.1 : reclaimdisks.sh -reclaim interrupted. Affecting execution of dbnodeupdate

Applies to:

Exadata Database Machine V2 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

dbnodeupdate executed on a system configured for dual boot or disk unreclaimed
during the execution of reclaimdisks.sh it failed leaving the RAID configuration incorrect.
dbnodeupdate was executed without fixing the RAID configuration, leaving the image status in FAILURE

Note: This is not a problem with dbnodeupdate script which is just affected due to the incomplete execution of reclaimdisks.sh

Cause

When dual boot is configured on the system, free disks are reclaimed before any deployment. If system is deployed without reclaiming disks, further execution of dbnodeupdate will not run, with the instructions to reclaim the un-used disks first.

Output of reclaimdisks.sh -check when disk have not been reclaimed:

    2014-07-11 11:00:14 +0530 [INFO] This is SUN FIRE X4170 M2 SERVER machine
    2014-07-11 11:00:14 +0530 [INFO] Number of LSI controllers: 1
    2014-07-11 11:00:14 +0530 [INFO] Physical disks found: 4 (252:0 252:1 252:2 252:3)
    2014-07-11 11:00:14 +0530 [INFO] Logical drives found: 3
    2014-07-11 11:00:14 +0530 [INFO] Linux logical drive: 0
    2014-07-11 11:00:14 +0530 [INFO] RAID Level for the Linux logical drive: 1
    2014-07-11 11:00:14 +0530 [INFO] Dual boot installation: yes
    2014-07-11 11:00:14 +0530 [INFO] LVM based installation: yes
    File descriptor 5 (/var/log/cellos/dbnodeupdate.log) leaked on lvm invocation. Parent PID 28741: /bin/bash
    2014-07-11 11:00:16 +0530 [INFO] Physical disks in the Linux logical drive: 2 (252:0 252:1)
    2014-07-11 11:00:16 +0530 [INFO] Dedicated Hot Spares for the Linux logical drive: 0
    2014-07-11 11:00:17 +0530 [INFO] Global Hot Spares: 0
    2014-07-11 11:00:17 +0530 [INFO] Valid dual boot configuration found for Linux: RAID1 from 2 disks

Output of dbnodeupdate.log when finding about the unreclaimed disks:

1405062718][2014-07-11 12:43:54
+0530][FILE][/u01/dbnodeupdate/dbnodeupdate.sh][CheckSolReclaimed][]  [FILE:/tmp/.yum_update.110714124112]
    2014-07-11 12:43:51 +0530  [INFO] This is SUN FIRE X4170 M2 SERVER machine
    2014-07-11 12:43:51 +0530  [INFO] Number of LSI controllers: 1
    2014-07-11 12:43:52 +0530  [INFO] Physical disks found: 4 (252:0 252:1252:2 252:3)
    2014-07-11 12:43:52 +0530  [INFO] Logical drives found: 3
    2014-07-11 12:43:52 +0530  [INFO] Linux logical drive: 0
    2014-07-11 12:43:52 +0530  [INFO] RAID Level for the Linux logical drive:1
    2014-07-11 12:43:52 +0530  [INFO] Dual boot installation: yes
    2014-07-11 12:43:52 +0530  [INFO] LVM based installation: yes
    File descriptor 5 (/var/log/cellos/dbnodeupdate.log) leaked on lvm invocation. Parent PID 13670: /bin/bash
    2014-07-11 12:43:53 +0530  [INFO] Physical disks in the Linux logical drive: 2 (252:0 252:1)
    2014-07-11 12:43:53 +0530  [INFO] Dedicated Hot Spares for the Linux logical drive: 0
    2014-07-11 12:43:54 +0530  [INFO] Global Hot Spares: 0
    2014-07-11 12:43:54 +0530  [INFO] Valid dual boot configuration found for Linux: RAID1 from 2 disks
[1405062718][2014-07-11 12:43:54
+0530][INFO][/u01/dbnodeupdate/dbnodeupdate.sh][PrintGenError][]  Entering PrintGenError Solaris disks are not reclaimed. This needs to be done before
the upgrade. See the Exadata Database Machine documentation to claim the Solaris disks
[1405062718][2014-07-11 12:43:54
+0530][ERROR][/u01/dbnodeupdate/dbnodeupdate.sh][PrintGenError][]  Solaris
disks are not reclaimed. This needs to be done before the upgrade. See the Exadata Database Machine documentation to claim the Solaris disks
[1405062718][2014-07-11 12:43:54
+0530][INFO][/u01/dbnodeupdate/dbnodeupdate.sh][UpdateDbnodeupdateStatFile][] Entering UpdateDbnodeupdateStatFile failed

The linux logical group is configured in RAID 1 with two disks (slot 0 and slot 1).
The reclaim disk procedure, will create a new logical group in RAID 5 with three physical disks (1,2,3). This requires using disk in slot 1, which is part of the logical group 0.
Because reclaimdisk is interrupted while creating the new RAID 5 logical group, the RAID configuration will be incorrect:
logical group RAID 1 with only 1 disk.
logical group RAID 5 with 3 disks but no content.

From dbnodeupdate.log or execution of reclaimdisks.sh -check

reclaimdisks.sh -check: fails with 2014-07-16 17:13:57 +0530 [INFO] Physical disks found: 4 (252:0 252:1 252:2 252:3) 2014-07-16 17:13:57 +0530 [INFO] Logical drives found: 2 2014-07-16 17:13:57 +0530 [INFO] Linux logical drive: 0 2014-07-16 17:13:57 +0530 [INFO] RAID Level for the Linux logical drive: 1 2014-07-16 17:13:57 +0530 [INFO] Dual boot installation: yes 2014-07-16 17:13:57 +0530 [INFO] LVM based installation: yes File descriptor 4 (/var/log/cellos/dbnodeupdate.log) leaked on lvm invocation. Parent PID 16754: /bin/bash 2014-07-16 17:13:58 +0530 [INFO] Physical disks in the Linux logical drive:1 (252:0) 2014-07-16 17:13:58 +0530 [INFO] Dedicated Hot Spares for the Linux logical drive: 0 2014-07-16 17:13:59 +0530 [INFO] Global Hot Spares: 0 2014-07-16 17:13:59 +0530 [ERROR] Expected RAID 1 from 2 physical disks with no dedicated and global hot sparse

Next execution of dbnodeupdate -u will run but during the validation due to the incorrect configuration of RAID, image status will be marked as failure.

Solution

In order to run reclaimdisks.sh again, we need to restore the system back to the expected configuration:

one logical group in RAID 1 with 2 disks
the disks in slots 2 and 3 not used.

STEPS

1. remove the logical group 1, created as part of the reclaimdisks.sh execution that was interrupted.

Make sure that physical disk in slot 1 is released.

# /opt/MegaRAID/MegaCli/MegaCli64 -CfgLdDel -L1 -Force -a0

2. We know the logical group 0 (linux) had a missing disk because it executed command -PdMarkMissing. We can get the info by running:

# /opt/MegaRAID/MegaCli/MegaCli64 -pdgetmissing -a0

Please validate that the output of this command shows Array 0 Row 1. If that is correct, continue with next step.

3. We replace the missing disk by running

# /opt/MegaRAID/MegaCli/MegaCli64 -pdreplacemissing -physdrv [252:1] -array0 -row1 -a0

The arguments are obtained from command in point 2 where:

array 0 : Logical group 0

row 1 : physical disk 1

a0 : the disk array controller.

4. We start the rebuild of the logical group associated to the linux group:

# /opt/MegaRAID/MegaCli/MegaCli64 -pdrbld -start -physdrv [252:1] -a0

5. We can check the progress of rebuild:

# /opt/MegaRAID/MegaCli/MegaCli64 -pdrbld -showprog -physdrv[252:1] -a0

This process needs to complete before moving to next step. It could take around two or more hours.

6. Validate that we only have 1 logical group by running:

# /opt/MegaRAID/MegaCli/MegaCli64 -cfgdsply -a0|grep 'DISK GROUP'

It should return only 1.

Number of DISK GROUPS:1
DISK GROUP: 0

7. Validate that disks can be reclaimed by running

# /opt/oracke.SupportTools/reclaimdisks.sh -check

Below is an example of the output of the command on the local server after removing the logical groups with the solaris partitions and rebuilding the linux raid:

Logical drives found: 1
RAID level for the Linux logical drive:1
Physical disks in the Linux logical drive: 2 (252:0 252:1)
Dual boot installation: yes

2014-07-17 10:40:19 -0400 [INFO] This is SUN SERVER X4-2 machine
2014-07-17 10:40:19 -0400 [INFO] Number of LSI controllers: 1
2014-07-17 10:40:19 -0400 [INFO] Physical disks found: 4 (252:0 252:1 252:2 252:3)
2014-07-17 10:40:19 -0400 [INFO] Logical drives found: 1
2014-07-17 10:40:19 -0400 [INFO] Linux logical drive: 0
2014-07-17 10:40:19 -0400 [INFO] RAID Level for the Linux logical drive: 1
2014-07-17 10:40:19 -0400 [INFO] Dual boot installation: yes
2014-07-17 10:40:19 -0400 [INFO] LVM based installation: yes
2014-07-17 10:40:19 -0400 [INFO] Physical disks in the Linux logical drive: 2 (252:0 252:1)
2014-07-17 10:40:19 -0400 [INFO] Dedicated Hot Spares for the Linux logical drive: 0
2014-07-17 10:40:19 -0400 [INFO] Global Hot Spares: 0
2014-07-17 10:40:19 -0400 [INFO] Valid dual boot configuration found for Linux: RAID1 from 2 disks

9. Disable CRS for autostart

# $GI_HOME/bin/crsctl disable crs

10. Modify the reclaimdisks.sh.

Make changes to the file under /opt/oracle.SupportTools OR
Note that with the patch 16486998 (dbnodeupdate), a file dbupdate-helpers.zip is included. That one has a newer version of reclaimdisks.sh. Make the changes to the file, but do not copy to /opt/oracle.SupportTools directory as this version uses other files extracted as part of patch 16486998.

Find the following lines:

if [ $? -ne 0 ]; then
    out_logfile "[ERROR] Unable to create parition table on the block device $sysdisk"
    return 1
fi
sync

and replace them with

sync
sleep 10
sync
partprobe $sysdisk
if [ $? -ne 0 ]; then
   out_logfile "$FUNCNAME: [ERROR] Unable to re-load of partition tables of $sysdisk"
   return 1
fi
stabilize_block_device $sysdisk
if [ $? -ne 0 ]; then
    out_logfile "[ERROR] $FUNCNAME: One or more block devices are missing for $sysdisk after one minute stabilization timeout"
    return 1
fi

save the file and copy to /opt/oracle.SupportTools

11. Execute the reclaimdisks.sh

# /opt/oracke.SupportTools/reclaimdisks.sh -free -reclaim

The machine will be rebooted and once is back online, rebuild will continue.
There will be a message reported to the remote console, indicating the progress of the rebuild

2014-07-17 15:57:33 -0400 [INFO] Reconstruction of the logical drive 0 is in progress: Completed 6%, Taken 10 min.
2014-07-17 15:58:34 -0400 [INFO] Reconstruction of the logical drive 0 is in progress: Completed 7%, Taken 11 min.
2014-07-17 15:59:34 -0400 [INFO] Reconstruction of the logical drive 0 is in progress: Completed 8%, Taken 12 min.
2014-07-17 16:00:34 -0400 [INFO] Reconstruction of the logical drive 0 is in progress: Completed 9%, Taken 13 min.

This could take from two to ten hours with an average of 2 hours.
Leave Oracle stack down to avoid increasing the time needed to finish the rebuild
Once rebuild is 100%, server is rebooted one more time
Enable CRS for autostart
Start CRS

Attachments

This solution has no attachment