Asset ID: |
1-71-2281488.1 |
Update Date: | 2017-06-29 |
Keywords: | |
Solution Type
Technical Instruction Sure
Solution
2281488.1
:
DSR Several Peer Down Alarms
Related Items |
- Oracle Communications Diameter Signaling Router (DSR)
|
Related Categories |
- PLA-Support>Sun Systems>CommsGBU>Global Signaling Solutions>SN-SND: Tekelec DSR
|
In this Document
Created from <SR 3-15155772601>
Applies to:
Oracle Communications Diameter Signaling Router (DSR) - Version DSR 7.0.1 and later
Tekelec
Goal
Several peers lost connection but there was not a specific MP associated and there was not reported any problem on routers or switches. The peers remained out about 30 seconds before to return the traffic.
Solution
Summary Analysis
The basic conclusion from the available data is that these were network impairment events. Consider the following:
- Typically a problem at the DSR affects either a single connection or many unrelated connections on a single DA-MP. We see whole peers across multiple DA-MPs concurrently, while other peers are unaffected.
- A spontaneous problem is rare; usually there is some kind of trigger (new connection, etc) where a DSR problem is discovered
- No accompanying errors (usually seen as INFO events) or alarms raised in the context of the event suggesting anything unexpected or unusual
- There are several sets of connections/peers going unavailable during very small specific intervals during the day e.g. 17th June 2017 (16:29-16:30, 16:55-16:56, 19:28, 20:05 etc)
Detailed Analysis
Looking at the Events data, it contains about 80,000 events spanning 48 hours.
Interesting events for troubleshooting the situation are:
Events 22051, 22055, 22101, 22102, 22303, 22313, 22322, 22326, 22329, and 22345
Events taking place during this time are:
- 22101-Connection Unavailable
- 22103-SCTP Connection Impaired
- 22303-Connection Unavailable: Peer closed connection
- 22313-Connection Unavailable: Transport failure
- 22322-Connection Proving Success
- 22326-Connection Established
- 22329-SCTP Connection Impaired: A path has become unreachable
- 22345-Connection Priority Level changed
All these alarms point towards an issue at SCTP layer/physical layer impacting the link.
Other such events are found to be:
YYYY-MM-DD hh:mm:sec.msec GMT-YYYY-MM-DD hh:mm:sec.msec GMT duration: 59 Seconds Approx
YYYY-MM-DD hh:mm:sec.msec GMT-YYYY-MM-DD hh:mm:sec.msec GMT duration: 08 Seconds Approx
YYYY-MM-DD hh:mm:sec.msec GMT-YYYY-MM-DD hh:mm:sec.msec GMT duration: 05 Seconds Approx
YYYY-MM-DD hh:mm:sec.msec GMT-YYYY-MM-DD hh:mm:sec.msec GMT duration: 31 Seconds Approx
YYYY-MM-DD hh:mm:sec.msec GMT-YYYY-MM-DD hh:mm:sec.msec GMT duration: 30 Seconds Approx
YYYY-MM-DD hh:mm:sec.msec GMT-YYYY-MM-DD hh:mm:sec.msec GMT duration: 90 Seconds Approx
YYYY-MM-DD hh:mm:sec.msec GMT-YYYY-MM-DD hh:mm:sec.msec GMT duration: 2 minute 40 Seconds Approx
YYYY-MM-DD hh:mm:sec.msec GMT-YYYY-MM-DD hh:mm:sec.msec GMT duration: 58 Seconds Approx
YYYY-MM-DD hh:mm:sec.msec GMT-YYYY-MM-DD hh:mm:sec.msec GMT duration: 45 Seconds Approx
YYYY-MM-DD hh:mm:sec.msec GMT-YYYY-MM-DD hh:mm:sec.msec GMT duration: 60 Seconds Approx
YYYY-MM-DD hh:mm:sec.msec GMT-YYYY-MM-DD hh:mm:sec.msec GMT duration: 19 Seconds Approx
That all these network related events are of small duration. So this can be due to a network latency issue or small glitch somewhere down the network path.
Also, there is a very huge number of following messages seen on the system:
"22008 - Orphan Answer Response Received" for multiple far end locations.
Conclusions from observation
From both these observations we can confirm multiple drops and recoveries lasting few seconds in duration. Which clearly points towards a network impairment issue.
I believe that the delay in answer message can be a good starting point to investigate response delays and possibly a network issue.
The 22345 events indicating socket send failure and the 22303 events showing Peer Closed Connection (without DPR) as well as the clustered nature support the conclusion that network events outside DSR are causing peers to become unavailable momentarily, and the recovery mechanisms are bringing the connections back up and peer availability recovering.
Connections span multiple MPs, so there isn't a local server fault and I see no evidence of processing errors among the events.
Recommended Actions
- None at the DSR; no evidence of mishandling or local error.
- Looking outward, determine where the path might converge such that a momentary or short-term event (failure, buffer overrun, etc) may impact connectivity for the affected peers.
- Review each peer node to identify if any further clues may be available to isolate the location of cause.
- For ""Alarm 22008 orphan Answer Response Received" I would suggest:
- Contact the Peer nodes to identify the reason for delay in Answer message.
- Investigate latency in IP media which may result in delay and cause orphan messages on the DRA.
- On DRA If the Pending Answer Timer (PAT) is set to a very low value then it is advised to modify the timer value (SOAM GUI --> Diameter > Configuration --> System Options) as per your requirement
References
<NOTE:2160456.1> - SRDC - Collect Data for Diameter Signaling Router
Attachments
This solution has no attachment