How to Check if a Reboot is Due to a Node being Fenced out of an OCFS2 / o2cb Cluster

Asset ID:	1-71-2096087.1
Update Date:	2017-10-16
Keywords:

Solution Type Technical Instruction Sure

Solution 2096087.1 : How to Check if a Reboot is Due to a Node being Fenced out of an OCFS2 / o2cb Cluster

Applies to:

Oracle VM - Version 3.0.1 and later
Linux OS - Version Enterprise Linux 3.0 and later
Private Cloud Appliance - Version 1.0.1 and later
Oracle Exalogic Elastic Cloud Software - Version 2.0.6.2.2 to 2.0.6.2.2
Private Cloud Appliance X5-2 Hardware
Linux x86-64

Goal

On Oracle Linux, Oracle VM and Oracle Private Cloud Appliance (PCA), when making use of the Oracle Clustered File System version 2 (OCFS2), it may be sometimes difficult to determine if a server which rebooted has been fenced out of the ocfs2 cluster due to not writing its heartbeat in time or if the cause is external to the o2cb cluster.

Solution

In several node fencing situations looking only at the node that had an unexpected reboot log files, there is no track of any logging describing the potential root cause of the reboot - The syslog ends abruptly and then a new server boot is recorded, e.g. because the fenced node lost access to its clustered file system.

A node gets fenced out of the o2cb cluster when it does not write its heartbeat on the shared filesystem for O2CB_HEARTBEAT_THRESHOLD times two seconds (this parameter is defined on each node in the /etc/sysconfig/o2cb configuration file).

However, on the surviving nodes of the o2cb cluster (for Oracle VM, on other nodes of the server pool, on PCA's on the other compute nodes), the following message is reliably logged at least on one member of the cluster / server pool :

ovs1 kernel: o2cb: o2dlm has evicted node X from domain ovm

So to check if a node has been evicted, the surviving nodes are often giving more leads about a possible fence than the fenced node.

Running a command in the lines of :

# grep "has evicted node" /var/log/messages*

on the surviving nodes of the cluster/pool often gives an initial lead if this is a cluster eviction or not.

Attachments

This solution has no attachment