MaxRep: Intermittent HA failover issues and failover not working properly

Asset ID:	1-72-2277296.1
Update Date:	2017-07-19
Keywords:

Solution Type Problem Resolution Sure

Solution 2277296.1 : MaxRep: Intermittent HA failover issues and failover not working properly

Applies to:

Pillar Axiom Replication Engine (MaxRep) - Version 3.0 to 3.0 [Release 3.0]
Information in this document applies to any platform.

Symptoms

Intermittent MaxRep HA failover issues where the Active Engine in the MaxRep cluster has failed over its resources to the Passive Engine.

Confirmation that a MaxRep HA failover has occurred can be observed in the Control Service Engine Support page http://<MaxRep Control Service Engine IP>/support navigate to the "Dashboard" page -> select the Cluster IP. and look at the "Role:" if it shows Active the Engine is the Active Engine in the cluster and the other Engine is Passive in the cluster. If a failover occurred the Engine that was Passive will show as being Active.

There also may be issues if the failover process did not working properly. This is noticed that the retention LUN that was mounted on the previously Active Engine is not automatically mounted on the new Active Engine in the cluster. Logging into the UI at http://<MaxRep Control Service Engine IP>/ui navigate to "Protect" tab -> from the left menu click "Manage Protection Plan", from this page under column Servers the Source Engine and Target Engine should reflect the current Active Engines in each cluster, if not then the failover did not work properly.

A failover will only occur if the Protection Plans were Activated at the time the HA failover occurred.

Cause

The cause for an HA failover is typically due to the Active Engine in the cluster having issues communicating with the Ping Node (Ping Node is an HA setting that identifies the node by which both Replication Engines in the HA cluster ping a common IP address. If the active node cannot ping this ping node, then the passive Replication Engine pings the node and initiates fail over). It's very important to use a reliable Ping Node that is always available, usually a local switch or router are good Ping Node candidates to use.

Indications of the Active Engine having problems reaching the Ping Node can be seen in file /var/log/messages reporting "Late heartbeat" messages:

May 30 07:03:48 MAXREP02 heartbeat: [20041]: WARN: Late heartbeat: Node <Ping Node IP>: interval 10020 ms
May 30 07:08:33 MAXREP02 heartbeat: [20041]: WARN: Late heartbeat: Node <Ping Node IP>: interval 10020 ms
May 30 07:09:23 MAXREP02 heartbeat: [20041]: WARN: Late heartbeat: Node <Ping Node IP>: interval 10040 ms
May 30 07:23:38 MAXREP02 heartbeat: [20041]: WARN: Late heartbeat: Node <Ping Node IP>: interval 10050 ms
May 30 07:51:38 MAXREP02 heartbeat: [6296]: WARN: Late heartbeat: Node <Ping Node IP>: interval 10020 ms

If these messages continually repeat it indicates that the MaxRep Engine is seeing high response times reaching the Ping Node IP address, in the above example there are late responses of about 10 seconds (> 10000 ms) reaching the Ping Node.

It was also noticed that the /var/log/messages file was continually repeating these ERROR messages:

May 30 12:24:52 MAXREP02 heartbeat: [6401]: ERROR: Message hist queue is filling up (500 messages in queue)
May 30 12:24:53 MAXREP02 heartbeat: [6401]: ERROR: Message hist queue is filling up (500 messages in queue)
May 30 12:24:53 MAXREP02 heartbeat: [6401]: ERROR: Message hist queue is filling up (500 messages in queue)
May 30 12:24:54 MAXREP02 heartbeat: [6401]: ERROR: Message hist queue is filling up (500 messages in queue)
May 30 12:24:55 MAXREP02 heartbeat: [6401]: ERROR: Message hist queue is filling up (500 messages in queue)
May 30 12:24:56 MAXREP02 heartbeat: [6401]: ERROR: Message hist queue is filling up (500 messages in queue)
May 30 12:24:57 MAXREP02 heartbeat: [6401]: ERROR: Message hist queue is filling up (500 messages in queue)

Attempting to stop the heartbeat service with command "service heartbeat stop" would hang and the service would continue to run.

The "Error Message hist queue filling up" messages indicate the heartbeat service is having problems communicating with the other MaxRep Engine in the cluster.

Solution

Make sure to first review how the MaxRep Engine HA cluster is configured. This can be checked from the the Support page http://<MaxRep Control Service Engine IP>/support -> under "Management Tasks" click "Configure MaxRep HA" -> select the Primary Node having issues and note what the Multicast group IP address and the Ping Node IP is set to.

If the MaxRep HA was not configured from the Support page then log into the MaxRep Engine console using the root user account, default password is found at Document 2046703.1 FS System: Passwords Associated with the Oracle FS1-2 Flash Storage System

Cat file /etc/ha.d/ha.cf to view the HA configuration:

[root@MAXREP02 ~]# cat /etc/ha.d/ha.cf
keepalive 5
deadtime 120
warntime 10
initdead 130 # depend on your hardware
udpport 694
ping <Ping Node IP>
mcast MgtBond <mcast group IP> 694 1 0
auto_failback off
node MAXREP01
node MAXREP02
respawn root /usr/lib64/heartbeat/ipfail
use_logd yes

Both Engines in the cluster should have the same HA configuration.

For HA to be functioning properly verify the following:

Both MaxRep Engines in the cluster need a reliable connection to the Ping Node. The Ping Node is used by the MaxRep Engine to check for network connectivity, the customer network must allow Internet Control Message Protocol (ICMP) network traffic in order for both MaxRep Engine to reach the Ping Node.
Both MaxRep Engines in the cluster communicate with one another using the multicast group. The heartbeat service on both MaxRep Engines sends heartbeats every 1 second or so using the multicast group. The customer network must allow UDP port 694 network traffic and configure a multicast group.

To test ping node connectivity on the MaxRep Engine issue a ping to the Ping Node IP address:

[root@MAXREP02 ~]# ping <Ping Node IP>
PING <Ping Node IP> (<Ping Node IP>) 56(84) bytes of data.
64 bytes from <Ping Node IP>: icmp_seq=1 ttl=64 time=0.037 ms
64 bytes from <Ping Node IP>: icmp_seq=2 ttl=64 time=0.016 ms
64 bytes from <Ping Node IP>: icmp_seq=3 ttl=64 time=0.016 ms
64 bytes from <Ping Node IP>: icmp_seq=4 ttl=64 time=0.017 ms
64 bytes from <Ping Node IP>: icmp_seq=5 ttl=64 time=0.022 ms

In this example the MaxRep Engine can successfully reach the Ping Node IP and the ping replies are under 1ms. If there are intermittent network issues reaching the Ping Node or the MaxRep Engine receives much higher ping replies, for example 8-9ms latency, may need to consider using a different Ping Node.

To verify that both MaxRep Engines in the cluster are communicating with each other properly, verify the following:

On both MaxRep Engines in the cluster execute "service heartbeat status" to make sure the heartbeat service on both Engines is running, if the heartbeat service is not running then execute "service heartbeat start" to start the service, then execute "service heartbeat status" to confirm.
On both MaxRep Engines type "ps -AHfww | grep heartbeat" and make sure the output looks like the following:

[root@MAXREP02 ~]# ps -AHfww | grep heartbeat
root 6037 3173 0 15:53 pts/0 00:00:00 grep heartbeat
root 6288 1 9 May26 ? 06:36:51 heartbeat: master control process
nobody 6378 6288 0 May26 ? 00:00:05 heartbeat: FIFO reader
nobody 6379 6288 0 May26 ? 00:00:26 heartbeat: write: ping <Ping Node IP>
nobody 6380 6288 0 May26 ? 00:00:04 heartbeat: read: ping <Ping Node IP>
nobody 6381 6288 0 May26 ? 00:00:23 heartbeat: write: mcast MgtBond
nobody 6382 6288 0 May26 ? 00:00:03 heartbeat: read: mcast MgtBond

These are the required processes from the heartbeat service that must be running on both MaxRep Engines in the cluster. The heartbeat service creates read and write multicast sockets on each cluster node and spawns core processes to read and write via the sockets for heartbeat communication.
The customer network must allow UDP port 694 network traffic for both MaxRep Engines in the cluster to communicate with each other. To verify that both MaxRep Engines are communicating with each other use the "netstat -su" command, here is a sample of what this looks like from both MaxRep Engines in the cluster:

[root@MAXREP01 ~]# netstat -su
IcmpMsg:
        InType0: 94149
        InType3: 13828
        OutType3: 16912
        OutType8: 135811
Udp:
    336 packets received
    0 packets to unknown port received.
    0 packet receive errors
    61950 packets sent
UdpLite:
IpExt:
    OutMcastPkts: 58038
    InBcastPkts: 16721
    InOctets: 420992648293
    OutOctets: 35357140207
    OutMcastOctets: 12525045
    InBcastOctets: 2890299
[root@MAXREP02 ~]# netstat -su
IcmpMsg:
        InType0: 52590
        InType8: 2
        OutType0: 2
        OutType3: 16918
        OutType8: 52658
Udp:
    58385 packets received
    0 packets to unknown port received.
    0 packet receive errors
    349836 packets sent
UdpLite:
IpExt:
    InMcastPkts: 58043
    OutMcastPkts: 349496
    InBcastPkts: 16723
    InOctets: 11124925798
    OutOctets: 345068201
    InMcastOctets: 12526573
    OutMcastOctets: 82759893
    InBcastOctets: 2891140

NOTE: If you want to continuously monitor the UDP statistics use command "netstat -suc" which will printout the output every second continuously.

In the above example we can see that host MAXREP02 sent 349836 UDP packets but host MAXREP01 only received 336 UDP packets. In the other direction host MAXREP01 sent 61950 UDP packets and host MAXREP02 received 58385 UDP packets. The UDP packets sent and received should increment every 1 second or so.

This example appears to be a problem with the UDP communication from host MAXREP02 to MAXREP01, host MAXREP02 sent 349836 UDP packets but host MAXREP01 received only 336 UDP packets, the discrepancy is significant and the received UDP packets never increment. So there is something wrong with the communication between both MaxRep Engines in the cluster. Customer will need to review their network configuration to make sure the network is accepting UDP traffic and multicast group is configured properly.

References

<BUG:26189907> - MAXREP DOCUMENTS NEED TO BE UPDATED

Attachments

This solution has no attachment