Node Constantly Booting into the Grub Menu After Reboot Step of Oracle BDA Mammoth V4.0.0 Upgrade

Asset ID:	1-72-1935983.1
Update Date:	2017-12-08
Keywords:

Solution Type Problem Resolution Sure

Solution 1935983.1 : Node Constantly Booting into the Grub Menu After Reboot Step of Oracle BDA Mammoth V4.0.0 Upgrade

Applies to:

Big Data Appliance X4-2 Hardware - Version All Versions and later
x86_64

Symptoms

After the 'reboot' step of an Oracle Big Data Appliance upgrade to V4.0.0 one node continuously boots into the Grub menu.

Additional Troubleshooting for three potential update problems shows no indication of errors. The three things being updated during the 'reboot' step are:

A new kernel is installed.
Both copies of grub.conf are updated.
imageinfo kernel information is updated.

In the case here, all look to be updated appropriately because:

1. Checking both copies of grub.conf i.e. /boot/grub/grub.conf and /usbdisk/boot/grub/grub.conf confirms that both were updated on all nodes of the cluster and that all are identical. This rules out the case that one copy was not updated when the kernel was upgraded. (Note that if usbdisk is not mounted, mount it with "mount usbdisk" before checking /usbdisk/boot/grub/grub.conf.)

2. The output from "rpm -qa|grep kernel-uek|sort" is the same on all nodes of the cluster.

3. The output from 'imageinfo' is correct on all nodes of the cluster.

Cause

The root problem cause is that a partition for both /dev/md2 and /dev/md0 was removed on one node of the cluster, i.e. the node continuously booting to the Grub menu.

The output from 'mdadm --detail --test' for the boot and / partitions shows this as per the output below:

# mdadm --detail --test /dev/md2

/dev/md2:
Version : 1.1
Creation Time : Mon Jun 16 17:26:45 2014
Raid Level : raid1
Array Size : 488149824 (465.54 GiB 499.87 GB)
Used Dev Size : 488149824 (465.54 GiB 499.87 GB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Wed Oct 15 12:47:44 2014
State : active, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0

Name : localhost.localdomain:2
UUID : 99f3d098:a655aefc:72a9322c:f004d4fa
Events : 1390874

Number Major Minor RaidDevice State
0 8 18 0 active sync /dev/sdb2
1 0 0 1 removed

# mdadm --detail --test /dev/md0

/dev/md0:
Version : 1.0
Creation Time : Mon Jun 16 17:26:44 2014
Raid Level : raid1
Array Size : 194496 (189.97 MiB 199.16 MB)
Used Dev Size : 194496 (189.97 MiB 199.16 MB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent

Update Time : Tue Oct 14 17:30:17 2014
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0

Name : localhost.localdomain:0
UUID : be320f94:05f780e6:9d4150a0:7e041533
Events : 258

Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 0 0 1 removed

Solution

Long term bug: Bug 19824921 - BDACHECKCLUSTER NOT CHECKING IF UNDERLYING RAID PARTITIONS ARE FULLY FUNCTIONAL, was filed to check for this condition.

Note: Currently bdacheckcluster runs bdacheckhw and bdachecksw on all nodes.
bdacheckhw: checks for physical health of all disks so that will not error our here because all disks are healthy.
bdachecksw: checks that all partitions are fully functional. It isn't throwing an error because / (/dev/md2) and /boot (/dev/md0) are still fully functional even with one disk out of the raid partition. There is no error here because we are only testing that all partitions are fully functional (which they are here). We are not testing that the underlying raid partitions are fully healthy so even if they are degraded (as they are here) there is no error.

Workaround the problem by adding the devices removed back into the array as 'root' on the node rebooting to the Grub menu with:

# mdadm --add /dev/md0 /dev/sda1
# mdadm --add /dev/md2 /dev/sda2

After re-adding the devices verify:

1. That reboot is successful.

2. That the output run as 'root' from Node 1 for: "dcli -C imageinfo | grep KERNEL_VERSION" is the same and correct on all nodes.

# dcli -C imageinfo | grep KERNEL_VERSION

3. That the output run as 'root' from Node 1 for: "dcli -C uname -a" is the same and correct on all nodes.

# dcli -C uname -a

References

<BUG:19824921> - BDACHECKCLUSTER NOT CHECKING IF UNDERLYING RAID PARTITIONS ARE FULLY FUNCTIONAL

Attachments

This solution has no attachment