Exadata: CPU Stalls Causing Node to Panic and iLOM can become Non-Responsive

Asset ID:	1-72-1904612.1
Update Date:	2014-07-21
Keywords:

Solution Type Problem Resolution Sure

Solution 1904612.1 : Exadata: CPU Stalls Causing Node to Panic and iLOM can become Non-Responsive

Applies to:

Oracle Exadata Storage Server Software - Version 11.2.3.1.1 to 12.1.1.1.0 [Release 11.2 to 12.1]
Exadata Database Machine X2-2 Hardware - Version All Versions and later
Information in this document applies to any platform.
Cell Machine X4270 M2 becomes non-responsive, eventually resulting in a node panic and we also loose access to iLOM.
Resulting in me having to disconnect/reconnect power cables to restore stability.

Symptoms

Two Symptoms experienced as a result of this hardware problem.

1). ILOM becomes non-responsive resulting in an engineer having to restore stability by disconnect/connect power cables again.
2). CPU Node panics due to detected stalls on a particular CPU. These may or may not occur with any regularity.

Console Dump -->
~~~~~~~~~~~~~~~~~
May 7 04:05:09 exacelmel01 kernel: Call Trace:
May 7 04:05:09 exacelmel01 kernel: [<ffffffff81014ac6>] cpu_idle+0xc6/0xf0
May 7 04:05:09 exacelmel01 kernel: [<ffffffff814fca20>] start_secondary+0xf0/0x100
May 7 04:07:20 exacelmel01 kernel: INFO: rcu_sched_state detected stalls on CPUs/tasks: { 7} (detected by 14, t=1860326 jiffies)
May 7 04:07:20 exacelmel01 kernel: sending NMI to all CPUs:
May 7 04:07:20 exacelmel01 kernel: NMI backtrace for cpu 0
May 7 04:07:20 exacelmel01 kernel: CPU 0
..
..
May 7 04:07:21 exacelmel01 kernel: Call Trace:
May 7 04:07:21 exacelmel01 kernel: <IRQ> [<ffffffff8109c6f8>] ktime_get+0x68/0xf0
May 7 04:07:21 exacelmel01 kernel: [<ffffffff810a2fc0>] ? tick_clock_notify+0x60/0x60
May 7 04:07:21 exacelmel01 kernel: [<ffffffff810a2fea>] tick_sched_timer+0x2a/0xd0
May 7 04:07:21 exacelmel01 kernel: [<ffffffff810a2fc0>] ? tick_clock_notify+0x60/0x60
May 7 04:07:21 exacelmel01 kernel: [<ffffffff81095ab3>] __run_hrtimer+0x83/0x1e0
May 7 04:07:21 exacelmel01 kernel: [<ffffffff81095dc6>] hrtimer_interrupt+0xe6/0x240
May 7 04:07:21 exacelmel01 kernel: [<ffffffff81033e6b>] local_apic_timer_interrupt+0x3b/0x70
May 7 04:07:21 exacelmel01 kernel: [<ffffffff815108a5>] smp_apic_timer_interrupt+0x45/0x5a
May 7 04:07:21 exacelmel01 kernel: [<ffffffff8150f733>] apic_timer_interrupt+0x13/0x20
May 7 04:07:21 exacelmel01 kernel: <EOI> [<ffffffff81098351>] ? sched_clock_idle_sleep_event+0x11/0x20
May 7 04:07:21 exacelmel01 kernel: [<ffffffff8101de79>] ? mwait_idle+0x99/0x1c0
May 7 04:07:21 exacelmel01 kernel: [<ffffffff81014ac6>] cpu_idle+0xc6/0xf0
May 7 04:07:21 exacelmel01 kernel: [<ffffffff814fca20>] start_secondary+0xf0/0x100
May 7 04:08:09 exacelmel01 kernel: INFO: rcu_bh_state detected stalls on CPUs/tasks: { 7} (detected by 18, t=1860326 jiffies)
May 7 04:08:09 exacelmel01 kernel: sending NMI to all CPUs:
May 7 04:08:09 exacelmel01 kernel: NMI backtrace for cpu 0
May 7 04:08:09 exacelmel01 kernel: CPU 0

Note - System panic's can occur for any reason, including application software, OS drivers, and hardware. The above signature is particular to this symptom of a bad CPU, but in general do not assume a panic handled by CPU0 is a CPU fault.

Cause

CPU 0 Stalling detected by the kernel and causing a hard hang including ILOM non-response may indicate a faulty CPU.

Solution

This signature has been identified as likely to be a hardware problem. An ILOM snapshot will assist to confirm the problem.

In this case CPU#0 was replaced to resolve the problem.

References

<BUG:18480030> - SYSTEM HUNG & PANICED AFTER SEVERAL ""ILOM HAS STOPPED RESPONDING" MESSAGES

Attachments

This solution has no attachment