![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||
Solution Type Problem Resolution Sure Solution 2277296.1 : MaxRep: Intermittent HA failover issues and failover not working properly
In this Document
Created from <SR 3-14709840661> Applies to:Pillar Axiom Replication Engine (MaxRep) - Version 3.0 to 3.0 [Release 3.0]Information in this document applies to any platform. SymptomsIntermittent MaxRep HA failover issues where the Active Engine in the MaxRep cluster has failed over its resources to the Passive Engine. Confirmation that a MaxRep HA failover has occurred can be observed in the Control Service Engine Support page http://<MaxRep Control Service Engine IP>/support navigate to the "Dashboard" page -> select the Cluster IP. and look at the "Role:" if it shows Active the Engine is the Active Engine in the cluster and the other Engine is Passive in the cluster. If a failover occurred the Engine that was Passive will show as being Active. There also may be issues if the failover process did not working properly. This is noticed that the retention LUN that was mounted on the previously Active Engine is not automatically mounted on the new Active Engine in the cluster. Logging into the UI at http://<MaxRep Control Service Engine IP>/ui navigate to "Protect" tab -> from the left menu click "Manage Protection Plan", from this page under column Servers the Source Engine and Target Engine should reflect the current Active Engines in each cluster, if not then the failover did not work properly. A failover will only occur if the Protection Plans were Activated at the time the HA failover occurred. CauseThe cause for an HA failover is typically due to the Active Engine in the cluster having issues communicating with the Ping Node (Ping Node is an HA setting that identifies the node by which both Replication Engines in the HA cluster ping a common IP address. If the active node cannot ping this ping node, then the passive Replication Engine pings the node and initiates fail over). It's very important to use a reliable Ping Node that is always available, usually a local switch or router are good Ping Node candidates to use. Indications of the Active Engine having problems reaching the Ping Node can be seen in file /var/log/messages reporting "Late heartbeat" messages: May 30 07:03:48 MAXREP02 heartbeat: [20041]: WARN: Late heartbeat: Node <Ping Node IP>: interval 10020 ms
May 30 07:08:33 MAXREP02 heartbeat: [20041]: WARN: Late heartbeat: Node <Ping Node IP>: interval 10020 ms May 30 07:09:23 MAXREP02 heartbeat: [20041]: WARN: Late heartbeat: Node <Ping Node IP>: interval 10040 ms May 30 07:23:38 MAXREP02 heartbeat: [20041]: WARN: Late heartbeat: Node <Ping Node IP>: interval 10050 ms May 30 07:51:38 MAXREP02 heartbeat: [6296]: WARN: Late heartbeat: Node <Ping Node IP>: interval 10020 ms If these messages continually repeat it indicates that the MaxRep Engine is seeing high response times reaching the Ping Node IP address, in the above example there are late responses of about 10 seconds (> 10000 ms) reaching the Ping Node. It was also noticed that the /var/log/messages file was continually repeating these ERROR messages: May 30 12:24:52 MAXREP02 heartbeat: [6401]: ERROR: Message hist queue is filling up (500 messages in queue)
May 30 12:24:53 MAXREP02 heartbeat: [6401]: ERROR: Message hist queue is filling up (500 messages in queue) May 30 12:24:53 MAXREP02 heartbeat: [6401]: ERROR: Message hist queue is filling up (500 messages in queue) May 30 12:24:54 MAXREP02 heartbeat: [6401]: ERROR: Message hist queue is filling up (500 messages in queue) May 30 12:24:55 MAXREP02 heartbeat: [6401]: ERROR: Message hist queue is filling up (500 messages in queue) May 30 12:24:56 MAXREP02 heartbeat: [6401]: ERROR: Message hist queue is filling up (500 messages in queue) May 30 12:24:57 MAXREP02 heartbeat: [6401]: ERROR: Message hist queue is filling up (500 messages in queue) Attempting to stop the heartbeat service with command "service heartbeat stop" would hang and the service would continue to run. The "Error Message hist queue filling up" messages indicate the heartbeat service is having problems communicating with the other MaxRep Engine in the cluster. SolutionMake sure to first review how the MaxRep Engine HA cluster is configured. This can be checked from the the Support page http://<MaxRep Control Service Engine IP>/support -> under "Management Tasks" click "Configure MaxRep HA" -> select the Primary Node having issues and note what the Multicast group IP address and the Ping Node IP is set to. If the MaxRep HA was not configured from the Support page then log into the MaxRep Engine console using the root user account, default password is found at Document 2046703.1 FS System: Passwords Associated with the Oracle FS1-2 Flash Storage System Cat file /etc/ha.d/ha.cf to view the HA configuration: [root@MAXREP02 ~]# cat /etc/ha.d/ha.cf
keepalive 5 deadtime 120 warntime 10 initdead 130 # depend on your hardware udpport 694 ping <Ping Node IP> mcast MgtBond <mcast group IP> 694 1 0 auto_failback off node MAXREP01 node MAXREP02 respawn root /usr/lib64/heartbeat/ipfail use_logd yes Both Engines in the cluster should have the same HA configuration. For HA to be functioning properly verify the following:
To test ping node connectivity on the MaxRep Engine issue a ping to the Ping Node IP address: [root@MAXREP02 ~]# ping <Ping Node IP>
PING <Ping Node IP> (<Ping Node IP>) 56(84) bytes of data. 64 bytes from <Ping Node IP>: icmp_seq=1 ttl=64 time=0.037 ms 64 bytes from <Ping Node IP>: icmp_seq=2 ttl=64 time=0.016 ms 64 bytes from <Ping Node IP>: icmp_seq=3 ttl=64 time=0.016 ms 64 bytes from <Ping Node IP>: icmp_seq=4 ttl=64 time=0.017 ms 64 bytes from <Ping Node IP>: icmp_seq=5 ttl=64 time=0.022 ms In this example the MaxRep Engine can successfully reach the Ping Node IP and the ping replies are under 1ms. If there are intermittent network issues reaching the Ping Node or the MaxRep Engine receives much higher ping replies, for example 8-9ms latency, may need to consider using a different Ping Node. To verify that both MaxRep Engines in the cluster are communicating with each other properly, verify the following:
References<BUG:26189907> - MAXREP DOCUMENTS NEED TO BE UPDATEDAttachments This solution has no attachment |
||||||||||||||||||
|