Reboot Hangs Running dbnodeupdate.sh While Upgrading Exadata Db Server

Asset ID:	1-72-1620826.1
Update Date:	2014-08-29
Keywords:

Solution Type Problem Resolution Sure

Solution 1620826.1 : Reboot Hangs Running dbnodeupdate.sh While Upgrading Exadata Db Server

Applies to:

Exadata Database Machine X2-2 Hardware - Version All Versions and later
Information in this document applies to any platform.
The process being followed is:
1. calling dbnodeupdate - which is kicking off the yum update
2. the yum update process causes the node to reboot
3. reboot not happening due to a bug. This bug could be encountered on any reboot, regardless of whether dbnodeupdate.sh was called.

Symptoms

Running dbnodeupdate.sh is hanging while attempting to upgrade Exadata software on a database node.

The hanging step is likely from the reboot step of the patching process.

It can occur during rebooting / shutting down, an action which may not be affiliated with a patching activity.

The console window returns a stack which often looks like the following:

INFO: task rmmod:17665 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
...
Call Trace:
[] rds_ib_remove_one+0xf0/0x110 [rds_rdma]
[] ? autoremove_wake_function+0x0/0x3d
[] ? _cond_resched+0xe/0x22
[] ib_unregister_device+0x36/0x103 [ib_core]
[] mlx4_ib_remove+0x3f/0xfa [mlx4_ib]
[] mlx4_remove_device+0x78/0xa0 [mlx4_core]
[] mlx4_unregister_interface+0x2f/0x99 [mlx4_core]
[] mlx4_ib_cleanup+0x15/0x23 [mlx4_ib]
[] sys_delete_module+0x1c3/0x244
[] ? audit_syscall_entry+0x103/0x12f
[] system_call_fastpath+0x16/0x1b

Changes

Applying a new version of the Exadata software (upgrade)

This can occur even when simply rebooting / shutting down, as a process unrelated to patching activities.

Cause

The issue is related to <bug 17580227> and the fix is part of future kernels (**version below) as per an internal-only bug

Solution

Please wait at least 5 minutes to allow the system to autocorrect the issue. If it does not progress within 30 minutes then:

1. Login to ilom and do

reset /SYS

2. This action will do a power cycle. The machine should start rebooting within 3-4 minutes

3. Because it is hung on the shutdown, this will not affect the upgrade and you should be able to resume.

** The issue is fixed in kernel 2.6.39-400.110.1.EL6UEKX86_64

In the event the reset was not sufficient, especially if you encounter:

# ./dbnodeupdate.sh -c
(*) 2014-02-09 06:45:38: Unzipping helpers (/u01/patches/YUM/dbupdate-helpers.zip) to /opt/oracle.SupportTools/dbnodeupdate_helpers
(*) 2014-02-09 06:45:38: Initializing logfile /var/log/cellos/dbnodeupdate.log
(*) 2014-02-09 06:45:38: Collecting system configuration details, this may take some time...

ERROR: Unable to determine hardware type, reset ILOM and retry, exiting

This means not all the firmware was updated fully. To resolve this the system needs to be powered down and off.
First, try

stop /SYS

only if that doesn't work you, will need to use the ilom to shutdown using force and startup the system:

1) Force a shutdown as normal stop /SYS does not work:
ssh to ilom --->> stop -force -script /SYS

2) Power needs to stay off for about 5 minutes. Verify it is off:
show /SYS

3) Manually start the system back up (after 5 mins):
start /SYS

Then, confirm power is indeed off by running

show /SYS

Power needs to remain off for about 5 mins then it can be started back up with

start /SYS

Once the system comes fully up you can see if it resolved the issue by running

dmidecode -s system-product-name

If it does not show an error then check the imageinfo and re-run dbnodeupdate.

imageinfo
dbnodeupdate -c

There are rare cases where the ilom hangs and needs to be restarted. Since the node is inaccessible, ipmitool will not work to reset the ilom so it must be done from a remote node as such:

# ipmitool -H <ip address of problematic db node> -U root -P mypassword1 mc reset cold

References

<NOTE:1570371.1> - DO_IRQ: NO IRQ HANDLER FOR VECTOR (IRQ -1)
<NOTE:1553103.1> - dbnodeupdate.sh: Exadata Database Server Patching using the DB Node Update Utility
<BUG:17580227> - WHILE SYSTEM IS DOING SHUTDOWN, RMMOD INTERMITTENTLY HANGS.
<BUG:16605377> - KERNEL PANIC WHEN RDMA SERVICE RESTARTED WHILE RDS-STRESS RUNS ON SERVER/CLIENT
<NOTE:1009715.1> - Integrated Lights Out Manager (ILOM) CLI Quick Reference

Attachments

This solution has no attachment