Standby Node Becomes Out of Service After Memory Peak Alarm In Active Node

Asset ID:	1-71-2291969.1
Update Date:	2018-03-29
Keywords:

Solution Type Technical Instruction Sure

Solution 2291969.1 : Standby Node Becomes Out of Service After Memory Peak Alarm In Active Node

Applies to:

Acme Packet 6300 - Version S-Cz7.3.0 and later
Information in this document applies to any platform.

Goal

How Standby Node becomes Out of Service After Memory Peak Alarm In Active Node of SBC

Solution

This issue is related to page 913 of Oracle Communications Session Boarder Controller Configuration Guide
HA Media Interface Keepalive

In a lot of cases, since a lot of information is replicated between active and standby, if there is allocated memory to something and its not freeing, the standby will also show that as well
in those cases, the only way out is to reboot both at the same time to free the memory that is not being released by some process.

Log of the system shows:
Jul 5 15:10:48.282 [MAJOR] (0) Peer <ActiveNodename> timed out in state Active, my state is Standby
Jul 5 15:10:48.282 [WARNING] (0) BERPProcess::setPeerAddress() - old = XXX.YY.Z.W, new = XXX.YY.Z.W
Jul 5 15:10:48.282 [WARNING] (0) BERPProcess::decisionStandby() - active peer <ActiveNodename> has unacceptable health (100) or has timed out
Jul 5 15:10:48.282 [WARNING] (0) BerpProcess::decisionStandby() - taking 500 ms to check for peer over media i/f
Jul 5 15:10:48.287 [WARNING] (0) BERPProcess::decisionStandby() - received arp reply from active peer, going out of service
Jul 5 15:10:48.287 [CRITICAL] (0) Switchover, Standby to OutOfService, active peer <ActiveNodeName> has timed out, but active replied to arp within 500ms

Under configuration file following is observed:

redundancy-config
  state enabled
  log-level INFO
  health-threshold 75
  emergency-threshold 50
  port XXXX
:

<snip>

:
  gateway-heartbeat-timeout 1
  gateway-heartbeat-health 0
  media-if-peercheck-time 500 ------------> It is enabled

So in this case, the standby stops receiving responses to check point messages it is sending to the active SBC, so it declares it has timed out but you have media if peer check enabled on this HA pair
So with this feature enabled, once the standby stops receiving responses through the HA ports from the active, it sends a arp request to the media interfaces of the active SBC and since it received a response to that arp request, it took itself OOS, which is how that feature is designed, perfectly normal behavior.

Action to be taken for this:
1. Reboot the active to clear the memory
2. Reboot both to clear the memory and restore HA
Either way, both active and Standby with OOS needs to be rebooted.
Both the above steps can be done in the same maintenance window.

References

<NOTE:1591900.1> - High CPU Spiking Checklist

Attachments

This solution has no attachment