Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2199474.1
Update Date:2017-09-03
Keywords:

Solution Type  Problem Resolution Sure

Solution  2199474.1 :   Exalogic Virtual: Compute Node Hang Issue with "rcu_sched_state detected stalls on CPUs/tasks" Error Message  


Related Items
  • Exalogic Elastic Cloud X5-2 Hardware
  •  
  • Oracle Exalogic Elastic Cloud Software
  •  
Related Categories
  • PLA-Support>Eng Systems>Exalogic/OVCA>Oracle Exalogic>MW: Exalogic Core
  •  




In this Document
Symptoms
Cause
Solution
References


Created from <SR 3-13352866891>

Applies to:

Oracle Exalogic Elastic Cloud Software - Version 2.0.6.2.160419 to 2.0.6.2.161018
Exalogic Elastic Cloud X5-2 Hardware
Linux x86-64
Oracle Virtual Server x86-64

Symptoms

In Exalogic Virtual racks running July or October 2016 PSUs, Compute Node hanging or rebooting is seen. If the node hangs, it cannot be pinged or accept logins at the time of the issue. In that case, the node has to be force stopped from the ILOM and started again the same way in order to recover from the issue.

Below are the error messages seen in /var/log/messages system log at the time of the issue:

Sep 19 07:34:19 testcomputenode kernel: INFO: rcu_sched_state detected stalls on CPUs/tasks: { 10 11} (detected by 9, t=60002 jiffies)
Sep 19 07:34:19 testcomputenode kernel: sending NMI to all CPUs:
Sep 19 07:34:19 testcomputenode kernel: NMI backtrace for cpu 4
Sep 19 07:34:19 testcomputenode kernel: CPU 4
Sep 19 07:34:19 testcomputenode kernel: Modules linked in: xen_pciback dm_nfs nfs fscache auth_rpcgss nfs_acl xen_blkback xen_netback xen_gntdev xen_evtchn ipmi_devintf ipmi_si ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm lockd sunrpc bridge stp llc bonding be2iscsi iscsi_boot_sysfs iscsi_tcp bnx2i cnic uio cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_vnic(U) mlx4_vnic_helper(U) mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) mlx4_core(U) xenfs xen_privcmd ocfs2 jbd2 ocfs2_nodemanager configfs ocfs2_stackglue video sbs sbshc acpi_memhotplug acpi_ipmi ipmi_msghandler parport_pc lp parport cdc_ether usbnet mii igb hwmon i2c_algo_bit snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm iTCO_wdt i2c_i801 iTCO_vendor_support i2c_core snd_timer snd ioatdma soundcore snd_page_alloc i7core_edac pcspkr dca edac_core ghes hed dm_snapshot dm_zero dm_mirror dm_regio
Sep 19 07:34:19 testcomputenode kernel: _hash dm_log dm_mod ahci libahci sg shpchp megaraid_sas sd_mod crc_t10dif ext3 jbd mbcache
Sep 19 07:34:19 testcomputenode kernel:
Sep 19 07:34:19 testcomputenode kernel: Pid: 0, comm: swapper Not tainted 2.6.39-400.278.1.el5uek #1 Oracle Corporation SUN FIRE X4170 M2 SERVER /ASSY,MOTHERBOARD,X4170
Sep 19 07:34:19 testcomputenode kernel: RIP: e030:[<ffffffff810013aa>] [<ffffffff810013aa>] xen_hypercall_sched_op+0xa/0x20
Sep 19 07:34:19 testcomputenode kernel: RSP: e02b:ffff88013cd1fed8 EFLAGS: 00000246
Sep 19 07:34:19 testcomputenode kernel: RAX: 0000000000000000 RBX: 0000000000000004 RCX: ffffffff810013aa
Sep 19 07:34:19 testcomputenode kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
Sep 19 07:34:19 testcomputenode kernel: RBP: ffff88013cd1fef0 R08: 0000000000000001 R09: ffff88015348d3c0
Sep 19 07:34:19 testcomputenode kernel: R10: ffff88015348d410 R11: 0000000000000246 R12: ffffffff819b1840
Sep 19 07:34:19 testcomputenode kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Sep 19 07:34:19 testcomputenode kernel: FS: 00007f31ef22e6e0(0000) GS:ffff880153480000(0000) knlGS:0000000000000000
Sep 19 07:34:19 testcomputenode kernel: CS: e033 DS: 002b ES: 002b CR0: 000000008005003b
Sep 19 07:34:19 testcomputenode kernel: CR2: 0000000002293120 CR3: 000000007dc95000 CR4: 0000000000002660
Sep 19 07:34:19 testcomputenode kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 19 07:34:19 testcomputenode kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Sep 19 07:34:19 testcomputenode kernel: Process swapper (pid: 0, threadinfo ffff88013cd1c000, task ffff88013cd18180) 

Cause

This issue happens due to known OVM Bug 24712997.

Following are OVM and Exalogic Bug numbers for this known issue.

BUG 24713073 - TRACKING BUG FOR LINUX BUG 24712997 ON EECS <==== for July 2016 PSU
BUG 25093658 - TRACKING BUG FOR LINUX BUG 24712997 ON EECS 2.0.6.2.161018 <==== for October 2016 PSU
BUG 25535718 - TRACKING BUG FOR LINUX BUG 24712997 ON EECS 2.0.6.2.160419 <=== April 2016 PSU

Solution

Following one-off patches are available for this issue.

  • Patch 24713073 is available for the July 2016 Virtual PSU
  • Patch 25093658 is available for the October 2016 Virtual PSU.
  • Patch 25535718 is available for the April 2016 Virtual PSU.

Please Contact Exalogic Support by opening a service request if you run into this issue.

This issue is fixed in Jan 2017 PSU and later PSU versions.

INTERNAL NOTE FOR SUPPORT

If confirmed after analysis that the customer is running into the issue mentioned in this note please contact Exalogic Development team to get their approval for providing either patch to the customer.

 

References

<BUG:24713073> - TRACKING BUG FOR LINUX BUG 24712997 ON EECS 2.0.6.2.160719
<NOTE:1512139.1> - Oracle Exalogic Elastic Cloud Known Issues - Virtualization Release
<BUG:25079287> - REQUEST FOR THE KERNEL-RPMS WITH FIX FOR 24712997 ON 2.6.39-400.283.1.EL5UEK
<BUG:24712997> - DOM0 HUNG WITH RCU_SCHED_STATE DETECTED STALLS ON MLX4_IB_TUNNEL_COMP_HANDLER
<BUG:25093658> - TRACKING BUG FOR LINUX BUG 24712997 ON EECS 2.0.6.2.161018

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback