DSR Multiple Diameter Links down with Alarms 5001, 31226 and 8000

Asset ID:	1-71-2310291.1
Update Date:	2017-10-09
Keywords:

Solution Type Technical Instruction Sure

Solution 2310291.1 : DSR Multiple Diameter Links down with Alarms 5001, 31226 and 8000

Applies to:

Oracle Communications Diameter Signaling Router (DSR) - Version DSR 7.2.0 and later
Tekelec

Goal

We observed multiple links down from Diameter Routing Agent (DRA) for around 20 minutes. Need to find out the reason for this.

Solution

Event IDs: 5001 & 31226 On Servers: DRA1, DRA2, DRA3 and DRA4

From DRA5 perspective at onset there is a difficulty communicating to its peers and the active SOAM.

From /var/log/messages, at 18:46:31 EXGSTACK_Process watchdog failure appears:

08/04/2017 18:46:31 (EXGSTACK_Process) dsr#31003{Thread Watchdog Failure}

Along with eth21 going down:

Aug 4 18:46:19 DRA5 kernel: NETDEV WATCHDOG: eth21 (bnx2x): transmit queue 2 timed out
Aug 4 18:46:19 DRA5 kernel: bond1: link status down for active interface eth21, disabling it in 200 ms
Aug 4 18:48:58 DRA5 kernel: bond1: link status definitely down for interface eth21, disabling it

Fail_log of Syscheck reports a failure with eth21 in the platform log.

1501852688(30673)> ARGS: syscheck
1501852736(30847)> START TIME: Fri Aug 4 18:48:56 2017
1501852736(30847)> ARGS: syscheck
1501852738(27810)> * ipbond: FAILURE:: MINOR::5000000000002000 -- Device Interface Warning
1501852738(27810)> * ipbond: FAILURE:: Enslaved device eth21 is going down
1501852860(32074)> START TIME: Fri Aug 4 18:51:00 2017
1501852860(32074)> ARGS: syscheck
1501852861(32074)> STOP TIME: Fri Aug 4 18:51:01 2017

bond1 has its primary slave change to eth22 (/var/log/messages):

Aug 4 18:48:58 DRA5 kernel: bond1: making interface eth22 the new active one

Then at that moment, the EXGSTACK_Process recovers and communication returns.

S pid procTag $1 stat spawnTime N cmd
A 30869 EXGSTACK_Process 7 Up 08/04 18:48:59 2 dsr

Ethernet 21 also comes up (/var/log/messages):

Aug 4 18:48:58 DRA5 kernel: bnx2x 0000:21:00.0: eth21: NIC Link is Up, 10000 Mbps full duplex, Flow control: none
Aug 4 18:48:58 DRA5 kernel: bond1: link status up for interface eth21, enabling it in 200 ms
Aug 4 18:48:59 DRA5 kernel: MGMT: IN=bond0 OUT= MAC=XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX SRC=XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX DST=YYYY:YYYY:YYYY:YYYY:YYYY:YYYY LEN=96 TC=0 HOPLIMIT=1 FLOWLBL=0 PROTO=UDP SPT=55060 DPT=5220 LEN=56
Aug 4 18:48:59 DRA5 kernel: bond1: link status definitely up for interface eth21, 10000 Mbps full duplex

The effect on other servers is attributable to the interface panic, which appears to have triggered COMCOL HA instability as seen in logs during the time when the interfaces changed roles.

From cmsysdiag log:

Line 4420: ^^ 08/04/2017 18:49:03.473 11358706 inetrep DB Replication From Master Failure DRA6-DSR_SLDB_Policy
Line 4426: .. 08/04/2017 18:49:03.395 11358705 inetrep S/W Status
Line 4432: -* 08/04/2017 18:46:25.937 11358437 inetrep DB Replication From Master Failure DRA7
Line 4442: ^^ 08/04/2017 18:49:01.758 11358694 inetrep DB Replication To Slave Failure DRA8-DSR_SLDB_Policy
Line 4454: .. 08/04/2017 18:49:01.619 11358674 inetrep S/W Status
Line 4460: .. 08/04/2017 18:49:01.758 11358693 inetrep DB Replication Audit Complete DRA8
Line 4485: *C 08/04/2017 18:48:58.604 11358588 inetrep DB Replication To Slave Failure DRA9-DSR_SLDB_Policy
Line 4494: *C 08/04/2017 18:48:58.606 11358595 inetrep DB Replication To Slave Failure DRA6-DSR_SLDB_Policy
Line 4584: -* 08/04/2017 18:48:58.605 11358594 inetrep DB Replication From Master Failure DRA7
Line 4618: ^^ 08/04/2017 18:49:01.572 11358671 inetrep DB Replication From Master Failure DRA10
Line 4661: -* 08/04/2017 18:48:58.607 11358596 inetrep HA Remote Subscriber Heartbeat Warning DRA11
Line 4667: -* 08/04/2017 18:48:58.607 11358597 inetrep DB Replication From Master Failure DRA11

Conclusion

The sequence of events points towards a possible NIC card issue but as it has come up and has not happened again, I would suggest to not make any hardware change at the moment. The OA logs for enclosure may confirm whether this was due to a hardware issue on enclosure switch or was only due to server NIC card.

Event ID: 8000 DA-MP connection FSM exception

This is an ongoing situation and not related to the incident reported above.

Open an SR with TAC with below mentioned logs for troubleshooting from the SO and MP:

iqt -E ConnectionStatus.1 > ConnectionStatus_$HOSTNAME
iqt -E ConnectionAdmin > ConnectionAdmin_$HOSTNAME

Attachments

This solution has no attachment