Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1935983.1
Update Date:2017-12-08
Keywords:

Solution Type  Problem Resolution Sure

Solution  1935983.1 :   Node Constantly Booting into the Grub Menu After Reboot Step of Oracle BDA Mammoth V4.0.0 Upgrade  


Related Items
  • Big Data Appliance X4-2 Hardware
  •  
Related Categories
  • PLA-Support>Eng Systems>BDA>Big Data Appliance>DB: BDA_EST
  •  




In this Document
Symptoms
Cause
Solution
References


Created from <SR 3-9733467061>

Applies to:

Big Data Appliance X4-2 Hardware - Version All Versions and later
x86_64

Symptoms

After the 'reboot' step of an Oracle Big Data Appliance upgrade to V4.0.0 one node continuously boots into the Grub menu.

Additional Troubleshooting for three potential update problems shows no indication of errors.  The three things being updated during the 'reboot' step are:

  • A new kernel is installed.
  • Both copies of grub.conf are updated.
  • imageinfo kernel information is updated. 

In the case here, all look to be updated appropriately because:

1. Checking both copies of grub.conf i.e. /boot/grub/grub.conf and /usbdisk/boot/grub/grub.conf confirms that both were updated on all nodes of the cluster and that all are identical.  This rules out the case that one copy was not updated when the kernel was upgraded.  (Note that if usbdisk is not mounted, mount it with "mount usbdisk" before checking /usbdisk/boot/grub/grub.conf.) 

2. The output from "rpm -qa|grep kernel-uek|sort" is the same on all nodes of the cluster.

3. The output from 'imageinfo' is correct on all nodes of the cluster.

Cause

The root problem cause is that a partition for both /dev/md2 and /dev/md0 was removed on one node of the cluster, i.e. the node continuously booting to the Grub menu.


The output from 'mdadm --detail --test' for the boot and / partitions shows this as per the output below:

# mdadm --detail --test /dev/md2
  
/dev/md2:
       Version : 1.1
 Creation Time : Mon Jun 16 17:26:45 2014
    Raid Level : raid1
    Array Size : 488149824 (465.54 GiB 499.87 GB)
 Used Dev Size : 488149824 (465.54 GiB 499.87 GB)
  Raid Devices : 2
 Total Devices : 1
   Persistence : Superblock is persistent

 Intent Bitmap : Internal

   Update Time : Wed Oct 15 12:47:44 2014
         State : active, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
 Spare Devices : 0

          Name : localhost.localdomain:2
          UUID : 99f3d098:a655aefc:72a9322c:f004d4fa
        Events : 1390874

   Number   Major   Minor   RaidDevice State
      0       8       18        0      active sync   /dev/sdb2
      1       0        0        1      removed

# mdadm --detail --test /dev/md0
  
/dev/md0:
       Version : 1.0
 Creation Time : Mon Jun 16 17:26:44 2014
    Raid Level : raid1
    Array Size : 194496 (189.97 MiB 199.16 MB)
 Used Dev Size : 194496 (189.97 MiB 199.16 MB)
  Raid Devices : 2
 Total Devices : 1
   Persistence : Superblock is persistent

   Update Time : Tue Oct 14 17:30:17 2014
         State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
 Spare Devices : 0

          Name : localhost.localdomain:0
          UUID : be320f94:05f780e6:9d4150a0:7e041533
        Events : 258

   Number   Major   Minor   RaidDevice State
      0       8       17        0      active sync   /dev/sdb1
      1       0        0        1      removed

  

Solution

Long term bug: Bug 19824921 - BDACHECKCLUSTER NOT CHECKING IF UNDERLYING RAID PARTITIONS ARE FULLY FUNCTIONAL, was filed to check for this condition.

Note: Currently bdacheckcluster runs bdacheckhw and bdachecksw on all nodes.
bdacheckhw: checks for physical health of all disks so that will not error our here because all disks are healthy.
bdachecksw: checks that all partitions are fully functional. It isn't throwing an error because / (/dev/md2) and /boot (/dev/md0) are still fully functional even with one disk out of the raid partition. There is no error here because we are only testing that all partitions are fully functional (which they are here). We are not testing that the underlying raid partitions are fully healthy so even if they are degraded (as they are here) there is no error.


Workaround the problem by adding the devices removed back into the array as 'root' on the node rebooting to the Grub menu with:

# mdadm --add /dev/md0 /dev/sda1
# mdadm --add /dev/md2 /dev/sda2

After re-adding the devices verify:

1. That reboot is successful.

2. That the output run as 'root' from Node 1 for: "dcli -C imageinfo | grep KERNEL_VERSION" is the same and correct  on all nodes.

# dcli -C imageinfo | grep KERNEL_VERSION

3. That the output run as 'root' from Node 1 for: "dcli -C uname -a" is the same and correct  on all nodes.

# dcli -C uname -a

  


References

<BUG:19824921> - BDACHECKCLUSTER NOT CHECKING IF UNDERLYING RAID PARTITIONS ARE FULLY FUNCTIONAL

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback