Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1990121.1
Update Date:2017-06-28
Keywords:

Solution Type  Problem Resolution Sure

Solution  1990121.1 :   ILOM Hangs When Compute Node Hangs In Exalogic X3-2, X4-2 and X5-2 Racks  


Related Items
  • Exalogic Elastic Cloud X5-2 Hardware
  •  
  • Exalogic Elastic Cloud X4-2 Hardware
  •  
  • Exalogic Elastic Cloud X3-2 Hardware
  •  
Related Categories
  • PLA-Support>Eng Systems>Exalogic/OVCA>Oracle Exalogic>MW: Exalogic Core
  •  




In this Document
Symptoms
Cause
Solution


Applies to:

Exalogic Elastic Cloud X5-2 Hardware - Version X5 to X5 [Release X5]
Exalogic Elastic Cloud X4-2 Hardware - Version X4 to X4 [Release X4]
Exalogic Elastic Cloud X3-2 Hardware - Version X3 to X3 [Release X3]
Linux x86-64
Oracle Virtual Server x86-64
Applicable to all Physical Linux before July 2015 PSU 2.0.6.2.2 on Exalogic X3-2, X4-2 and X5-2 racks.
Applicable to all Virtual configurations before Oct 2015 PSU 2.0.6.2.3 on Exalogic X3-2, X4-2 and X5-2 racks.

Symptoms

When an application or operating system causes a compute node to hang, in some cases, it is observed that the node’s ILOM becomes unresponsive as well, making it impossible to remotely reset the node.  Recovery from this state requires a manual power cycle of the compute node in the data center. This leads to extended downtime and inconvenience, especially if the data center is remote and unmanned. 

Cause

Exalogic compute node ILOMs are configured to use sideband management. In this configuration, traffic to/from the node’s eth0 interface and its ILOM both flow through the same network port.

When the operating system hangs, the receive (rx) buffer of the X540 Ethernet Controller becomes full, which causes the X540 flow control to transmit the PAUSE frame to the Cisco switch. This in turn prevents the switch from sending any more packages to the Ethernet controller. This issue is tracked in the following bug:

Bug 19530512 - Compute node hung causes connectivity loss to ILOM

Solution

Implement the following workaround to fix the problem on Exalogic.

NOTE: The workaround provided below is integrated into July 2015 PSU (EECS 2.0.6.2.2) for Exalogic Physical Linux, and into Oct 2015 PSU (EECS 2.0.6.2.3) for Exalogic Virtual. It needs to be performed on Exalogic racks running earlier versions (Exalogic Physical racks running EECS 2.0.6.2.1 and older, and Exalogic Virtual racks running EECS 2.0.6.2.2 and older).
  1. Disable auto negotiation and transmission flow control on the compute node:
    [root@x4-compute-node ~]# ethtool -A eth0 autoneg off tx off 
  2. Verify the changed settings:
    [root@x4-compute-node ~]# ethtool -a eth0
    Pause parameters for eth0:
    Autonegotiate:  off
    RX:             on
    TX:             off 
  3. Add the following line to /etc/rc.local to automatically reconfigure upon node reboot, since the setting does not persist across reboots.
    ethtool -A eth0 autoneg off tx off 

This completes the procedure to implement the workaround. A reboot of the compute node is not required for this configuration change to take effect.

There are no known performance issues or side effects on Exalogic due to this configuration change. Note also that the configuration change is made to the management network (eth0) interface. Application traffic on internal IPoIB networks or client access EoIB networks is unaffected by this change.


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback