NSP GUI Unreachable With "Failure of server APACHE bridge" Due to the Big Flow of Received Alarms

Asset ID:	1-72-1910778.1
Update Date:	2015-12-10
Keywords:

Solution Type Problem Resolution Sure

Solution 1910778.1 : NSP GUI Unreachable With "Failure of server APACHE bridge" Due to the Big Flow of Received Alarms

Applies to:

Oracle Communications Performance Intelligence Center (PIC) Software - Version 4.1 and later
Information in this document applies to any platform.

Symptoms

NSP GUI is not accessible anymore with the following error "Failure of server APACHE bridge". This is happening when all Weblogic instances become unreachable which can have many causes.

The status of the different instances can be checked through the Weblogic console.

Cause

NSP is receiving a big number of alarms from distant servers. This causes the overload of JMS queues and consequently Weblogic instances hang up.
Check the JMS queues of different Weblogic instances and verify that queues are overloaded. Especially focus on "Messages Pending" column. JMS queues load can be checked from the Weblogic console interface under the folowing menu:
Domain Structure -> tekelec -> Services -> Messaging -> JMS Servers -> NSPJMSServerxa -> Monitoring -> Active Destinations

Solution

Identify the object sending the biggest number of alarms in ProAlarm viewer, all alarms list.

Each alarms with high amount of occurences must be addressed.

Alternate method using sql

Identify the object sending the biggest number of alarms. On NSP ORACLE server, as oracle user:
1. Connect to the NSP database:
  # sqlplus login/password
2. Order the number of raising alarms by Object ID:
  
  SQL> select count(*),MO_ID from COR_ALARM group by MO_ID order by count(*);
  The result should look like below:
  COUNT(*)      MO_ID
  ---------- ----------
  13244     331048
  14497      27909
  16642     376377
  44439     275531
3. Identify the name of the impacted object (biggest counts at bottom of list):
  
  SQL> select NAME from COR_MANAGED_OBJECT where MO_ID=xxxx;
  xxxx being the MO_ID extracted from 1-b
Identify the alarm causing overload. Connect to the impacted server already identified in the previous step (or master server of subsystem) and display JMX logs. In case the impacted Object is a link or linkset, connect to the master server of the xMF subsystem where the link/linkset is defined. As cfguser:

$ cd $PROC
$ cf.follow -20 jmx_agent.log
In the following example, alarms are caused by a Network Interface Board Error:
0701:093710.587 TR-V alarm 'Ethernet - Network Interface Board Error' activated (devName=eth31; moOid=.1.3.6.1.4.1.4404.20.0.181952565.3.1) [14571/MonMgr.C:549]
0701:093712.090 TR-V alarm 'Ethernet - Network Interface Board Error' cleared (devName=eth34; moOid=.1.3.6.1.4.1.4404.20.0.181952565.3.4) [14571/MonMgr.C:559]
Make needed actions to stop raising identified alarms. In the above example, Network Interface Board Error was caused by bad frames' MTU size.

Attachments

This solution has no attachment