DSR Multihoming SCTP Association Terminates Unexpectedly

Asset ID:	1-71-2288315.1
Update Date:	2018-04-05
Keywords:

Solution Type Technical Instruction Sure

Solution 2288315.1 : DSR Multihoming SCTP Association Terminates Unexpectedly

Applies to:

Oracle Communications Diameter Signaling Router (DSR) - Version DSR 7.0.1 and later
Tekelec

Goal

A multihoming SCTP association is established between DSR and HSS.

The HSS sends a cross-path HEARTBEAT packet.

This packet is mistakenly balanced by IPFE to the MP2.

MP2 responds with SCTP ABORT.

HSS considers SCTP association terminated and responds with an ABORT on subsequent DSR messages.

Solution

There were different configuration changes on the DSR. Also following alarms are seen:

0704:095902.385 STK-V sync error, did not receive ping for 1000 milliseconds [8824/IpfeStateSync.C:1292]
0704:095904.005 STK-V timeout trying to connect to 10.a.b.24 for sync [8824/IpfeStateSync.C:1433]
mm/dd/yyyy 12:59:02 (IPFE-1: IPFE-A: data read error) ipfe#5003{IPFE state sync run error}
mm/dd/yyyy 12:59:04 (IPFE-1: IPFE-A: connect error) ipfe#5003{IPFE state sync run error}
mm/dd/yyyy 12:59:04 (IPFE-1: eth2) ipfe#5012{Signaling interface heartbeat timeout}
mm/dd/yyyy 12:59:04 (IPFE-1: eth3) ipfe#5012{Signaling interface heartbeat timeout}
mm/dd/yyyy 12:59:02 (IPFE-1: IPFE-A: data read error) ipfe#5003{IPFE state sync run error}
*C GN_DOWN/WRN
^^ [8824:IpfeStateSync.C:1293]

mm/dd/yyyy 12:59:03 (IPFE-1: 10.255.94.13) ipfe#5001{IPFE Backend Unavailable}
-* GN_DOWN/WRN
^^ [8824:IpfeBackendMonitor.C:194]

mm/dd/yyyy 12:59:03 (IPFE-1: 10.255.94.77) ipfe#5001{IPFE Backend Unavailable}
-* GN_DOWN/WRN
^^ [8824:IpfeBackendMonitor.C:194]

Possible Reason of wrong dispatch of HB message:

Heartbeat is sent from from IP Address 1 (aa.aa.aaa.aa) to TSA IP Address 2 (bb.bbb.bb.bb).

The TSA IP Address 2 (bb.bbb.bb.bb) does not have the correct corresponding association info and hence it dispatches the heartbeat package to the wrong destination MP.

1. Peer sent SCTP INIT (src port 3868) via the secondary path to DSR port 49135 (DSR acts as responder), a new responder association to MPx was created.
2. Within 600 seconds (delete age time) of enabling the connection, DSR acts as initiator with SCTP multi-homing (src port 49135, dst port 3868) to peer, a new initiator association to MPy for the primary path was created.
-- No initiator association created for the secondary path because the previous responder association already existed with the peer secondary IP.
3. Upon receiving the SCTP heartbeat ACK from the peer via the secondary path, IPFE certainly routed the ACK to MPx instead of the expected MPy.
4. DSR will initiate the SCTP connection periodically (configurable, default 30 seconds) and accordingly this unexpected xt_recent record will be refreshed all along.

Explanation

When IPFE creates new sctp association with peer, the new assocation will be added into the association list which is stored under the xt_recent directory.

IPFE will sync the association list between the mate IPFEs.

If the heartbeat is sent to the IPFE which does not have the corresponding association info, it may process the heartbeat (from IP Address 1 (aa.aa.aaa.aa) to TSA IP Address 2 (bb.bbb.bb.bb)) as a new incoming message and dispatch this heartbeat package to the wrong destination MP.

This points us to believe that the IPFE1 and IPFE2 may not be well synced when the error occurred. This is strengthened by the above mentioned IPFE sync alarms during same time.

This issue happened when the customer VMs were shutdown unexpectedly.

If similar situation happens, start SCTP PCAP capture and IPFE association dump with the script at least 600 seconds (delete age time) before recovery procedures (like enabling the connections).

Note that once this unexpected association record already exists, it is meaningless to analyze the PCAP during DSR initiating the SCTP connection and sending out the SCTP Abort.

If the issue reappears, following steps should be taken

Normally the peer side should reconnect to DSR listening port (acts as Responder) instead of initiating a connection.

Catch the pcap trace both on IPFE1 and IPFE2 (please cover the sctp INIT ~ sctp Abort). The trace should cover the issue occurrence time.
Dump the xt_recent content from both IPFE1 and IPFE2 during the issue multiple number of times.
date >> xt_recent_`hostname`
grep . /proc/net/xt_recent/* >> xt_recent_`hostname`
grep . /proc/net/xt_recentvtag/* >> xt_recent_`hostname`
cat /proc/net/xt_recentinfo >> xt_recent_`hostname`
A typical script for gathering the xt_recent data is mentioned here. Oracle TAC can use this to gather the contents and later remove it from customer's system:

#!/bin/bash
rm -f xt_recent_`hostname`
while true
do
date >> xt_recent_`hostname`
grep . /proc/net/xt_recent/* >> xt_recent_`hostname`
grep . /proc/net/xt_recentvtag/* >> xt_recent_`hostname`
cat /proc/net/xt_recentinfo >> xt_recent_`hostname`
sleep 1
done
Gather following information:
====================================================
RETRIEVING FULL CONFIG EXPORT FROM SOAM
====================================================
Active SOAM:
$ iqt -E IpListTsa
$ iqt -E IpfeOption

MPs (DA-MP 7 & 9 should be enough):
$ sudo grep "" /proc/net/sctp/*
$ netstat
$ ss

====================================================
RETRIEVING FULL CONFIG EXPORT FROM SOAM
====================================================
1) Access the SOAM GUI / Main Menu / Diameter / Configuration / Export
[Note: In recent releases, it is moved from 'Diameter' to 'Diameter Common' folder]
2) In the Export window, use pulldown to select Export Application: ALL
3) For Export Directory, select radio button for Export Server Directory
4) Select <Ok> at the bottom of the screen.
5) From your local PC, use WinSCP or similar utility and log in as root to the SOAM XMI address
6) Pull the entire contents (including subdirectories) of /var/TKLC/db/filemgmt/export/<SOAM_Hostname>/DSR/
7) If you have WinZip or 7-Zip, it may be useful to compress the data. Attach to this SR when available.
Share the details with Oracle TAC for further troubleshooting and analysis.

To Recover the Condition

Disable the problematic (fluctuating) connection, wait for more than 600 seconds (default delete age time) to make sure all stale records have been deleted, then enable the connection.

References

<NOTE:2106578.1> - Connection Between Diameter Routing Agent (DRA) and Client is Going Down Frequently, DRA is Sending Transmission Control protocol (TCP) Reset Message to Bring Down the Connection

Attachments

This solution has no attachment