![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||
Solution Type Technical Instruction Sure Solution 2310291.1 : DSR Multiple Diameter Links down with Alarms 5001, 31226 and 8000
In this Document
Created from <SR 3-15484840831> Applies to:Oracle Communications Diameter Signaling Router (DSR) - Version DSR 7.2.0 and laterTekelec GoalWe observed multiple links down from Diameter Routing Agent (DRA) for around 20 minutes. Need to find out the reason for this. SolutionEvent IDs: 5001 & 31226 On Servers: DRA1, DRA2, DRA3 and DRA4From DRA5 perspective at onset there is a difficulty communicating to its peers and the active SOAM. From /var/log/messages, at 18:46:31 EXGSTACK_Process watchdog failure appears: 08/04/2017 18:46:31 (EXGSTACK_Process) dsr#31003{Thread Watchdog Failure}
Along with eth21 going down: Aug 4 18:46:19 DRA5 kernel: NETDEV WATCHDOG: eth21 (bnx2x): transmit queue 2 timed out
Aug 4 18:46:19 DRA5 kernel: bond1: link status down for active interface eth21, disabling it in 200 ms Aug 4 18:48:58 DRA5 kernel: bond1: link status definitely down for interface eth21, disabling it Fail_log of Syscheck reports a failure with eth21 in the platform log. 1501852688(30673)> ARGS: syscheck
1501852736(30847)> START TIME: Fri Aug 4 18:48:56 2017 1501852736(30847)> ARGS: syscheck 1501852738(27810)> * ipbond: FAILURE:: MINOR::5000000000002000 -- Device Interface Warning 1501852738(27810)> * ipbond: FAILURE:: Enslaved device eth21 is going down 1501852860(32074)> START TIME: Fri Aug 4 18:51:00 2017 1501852860(32074)> ARGS: syscheck 1501852861(32074)> STOP TIME: Fri Aug 4 18:51:01 2017 bond1 has its primary slave change to eth22 (/var/log/messages): Aug 4 18:48:58 DRA5 kernel: bond1: making interface eth22 the new active one
Then at that moment, the EXGSTACK_Process recovers and communication returns. S pid procTag $1 stat spawnTime N cmd
A 30869 EXGSTACK_Process 7 Up 08/04 18:48:59 2 dsr Ethernet 21 also comes up (/var/log/messages): Aug 4 18:48:58 DRA5 kernel: bnx2x 0000:21:00.0: eth21: NIC Link is Up, 10000 Mbps full duplex, Flow control: none
Aug 4 18:48:58 DRA5 kernel: bond1: link status up for interface eth21, enabling it in 200 ms Aug 4 18:48:59 DRA5 kernel: MGMT: IN=bond0 OUT= MAC=XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX SRC=XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX DST=YYYY:YYYY:YYYY:YYYY:YYYY:YYYY LEN=96 TC=0 HOPLIMIT=1 FLOWLBL=0 PROTO=UDP SPT=55060 DPT=5220 LEN=56 Aug 4 18:48:59 DRA5 kernel: bond1: link status definitely up for interface eth21, 10000 Mbps full duplex The effect on other servers is attributable to the interface panic, which appears to have triggered COMCOL HA instability as seen in logs during the time when the interfaces changed roles. From cmsysdiag log: Line 4420: ^^ 08/04/2017 18:49:03.473 11358706 inetrep DB Replication From Master Failure DRA6-DSR_SLDB_Policy
Line 4426: .. 08/04/2017 18:49:03.395 11358705 inetrep S/W Status Line 4432: -* 08/04/2017 18:46:25.937 11358437 inetrep DB Replication From Master Failure DRA7 Line 4442: ^^ 08/04/2017 18:49:01.758 11358694 inetrep DB Replication To Slave Failure DRA8-DSR_SLDB_Policy Line 4454: .. 08/04/2017 18:49:01.619 11358674 inetrep S/W Status Line 4460: .. 08/04/2017 18:49:01.758 11358693 inetrep DB Replication Audit Complete DRA8 Line 4485: *C 08/04/2017 18:48:58.604 11358588 inetrep DB Replication To Slave Failure DRA9-DSR_SLDB_Policy Line 4494: *C 08/04/2017 18:48:58.606 11358595 inetrep DB Replication To Slave Failure DRA6-DSR_SLDB_Policy Line 4584: -* 08/04/2017 18:48:58.605 11358594 inetrep DB Replication From Master Failure DRA7 Line 4618: ^^ 08/04/2017 18:49:01.572 11358671 inetrep DB Replication From Master Failure DRA10 Line 4661: -* 08/04/2017 18:48:58.607 11358596 inetrep HA Remote Subscriber Heartbeat Warning DRA11 Line 4667: -* 08/04/2017 18:48:58.607 11358597 inetrep DB Replication From Master Failure DRA11 ConclusionThe sequence of events points towards a possible NIC card issue but as it has come up and has not happened again, I would suggest to not make any hardware change at the moment. The OA logs for enclosure may confirm whether this was due to a hardware issue on enclosure switch or was only due to server NIC card. Event ID: 8000 DA-MP connection FSM exceptionThis is an ongoing situation and not related to the incident reported above. Open an SR with TAC with below mentioned logs for troubleshooting from the SO and MP: iqt -E ConnectionStatus.1 > ConnectionStatus_$HOSTNAME
iqt -E ConnectionAdmin > ConnectionAdmin_$HOSTNAME Attachments This solution has no attachment |
||||||||||||||||||||
|