![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Problem Resolution Sure Solution 1904612.1 : Exadata: CPU Stalls Causing Node to Panic and iLOM can become Non-Responsive
Applies to:Oracle Exadata Storage Server Software - Version 11.2.3.1.1 to 12.1.1.1.0 [Release 11.2 to 12.1]Exadata Database Machine X2-2 Hardware - Version All Versions and later Information in this document applies to any platform. Cell Machine X4270 M2 becomes non-responsive, eventually resulting in a node panic and we also loose access to iLOM. Resulting in me having to disconnect/reconnect power cables to restore stability. SymptomsTwo Symptoms experienced as a result of this hardware problem. Console Dump -->
~~~~~~~~~~~~~~~~~ May 7 04:05:09 exacelmel01 kernel: Call Trace: May 7 04:05:09 exacelmel01 kernel: [<ffffffff81014ac6>] cpu_idle+0xc6/0xf0 May 7 04:05:09 exacelmel01 kernel: [<ffffffff814fca20>] start_secondary+0xf0/0x100 May 7 04:07:20 exacelmel01 kernel: INFO: rcu_sched_state detected stalls on CPUs/tasks: { 7} (detected by 14, t=1860326 jiffies) May 7 04:07:20 exacelmel01 kernel: sending NMI to all CPUs: May 7 04:07:20 exacelmel01 kernel: NMI backtrace for cpu 0 May 7 04:07:20 exacelmel01 kernel: CPU 0 .. .. May 7 04:07:21 exacelmel01 kernel: Call Trace: May 7 04:07:21 exacelmel01 kernel: <IRQ> [<ffffffff8109c6f8>] ktime_get+0x68/0xf0 May 7 04:07:21 exacelmel01 kernel: [<ffffffff810a2fc0>] ? tick_clock_notify+0x60/0x60 May 7 04:07:21 exacelmel01 kernel: [<ffffffff810a2fea>] tick_sched_timer+0x2a/0xd0 May 7 04:07:21 exacelmel01 kernel: [<ffffffff810a2fc0>] ? tick_clock_notify+0x60/0x60 May 7 04:07:21 exacelmel01 kernel: [<ffffffff81095ab3>] __run_hrtimer+0x83/0x1e0 May 7 04:07:21 exacelmel01 kernel: [<ffffffff81095dc6>] hrtimer_interrupt+0xe6/0x240 May 7 04:07:21 exacelmel01 kernel: [<ffffffff81033e6b>] local_apic_timer_interrupt+0x3b/0x70 May 7 04:07:21 exacelmel01 kernel: [<ffffffff815108a5>] smp_apic_timer_interrupt+0x45/0x5a May 7 04:07:21 exacelmel01 kernel: [<ffffffff8150f733>] apic_timer_interrupt+0x13/0x20 May 7 04:07:21 exacelmel01 kernel: <EOI> [<ffffffff81098351>] ? sched_clock_idle_sleep_event+0x11/0x20 May 7 04:07:21 exacelmel01 kernel: [<ffffffff8101de79>] ? mwait_idle+0x99/0x1c0 May 7 04:07:21 exacelmel01 kernel: [<ffffffff81014ac6>] cpu_idle+0xc6/0xf0 May 7 04:07:21 exacelmel01 kernel: [<ffffffff814fca20>] start_secondary+0xf0/0x100 May 7 04:08:09 exacelmel01 kernel: INFO: rcu_bh_state detected stalls on CPUs/tasks: { 7} (detected by 18, t=1860326 jiffies) May 7 04:08:09 exacelmel01 kernel: sending NMI to all CPUs: May 7 04:08:09 exacelmel01 kernel: NMI backtrace for cpu 0 May 7 04:08:09 exacelmel01 kernel: CPU 0
Note - System panic's can occur for any reason, including application software, OS drivers, and hardware. The above signature is particular to this symptom of a bad CPU, but in general do not assume a panic handled by CPU0 is a CPU fault.
CauseCPU 0 Stalling detected by the kernel and causing a hard hang including ILOM non-response may indicate a faulty CPU. SolutionThis signature has been identified as likely to be a hardware problem. An ILOM snapshot will assist to confirm the problem. In this case CPU#0 was replaced to resolve the problem. References<BUG:18480030> - SYSTEM HUNG & PANICED AFTER SEVERAL ""ILOM HAS STOPPED RESPONDING" MESSAGESAttachments This solution has no attachment |
||||||||||||
|