High CPU Spiking Checklist

Asset ID:	1-71-1591900.1
Update Date:	2017-06-19
Keywords:

Solution Type Technical Instruction Sure

Solution 1591900.1 : High CPU Spiking Checklist

Applies to:

Net-Net 3810 - Version S-Cx6.3.0 to S-Cx6.4.0 [Release S-Cx6.0]
Acme Packet 3820 - Version S-Cx6.3.0 to S-Cx6.4.0 [Release S-Cx6.0]
Acme Packet 4500 - Version S-Cx6.3.0 to S-Cx6.4.0 [Release S-Cx6.0]
Acme Packet OS

Goal

How to check for high CPU spiking.

Solution

The two most common reasons for high CPU usage or CPU spikes are high traffic volume and high log levels.

High Logs

Ensure that your logs are not set at detailed levels. Issue the command

# show loglevel all

Ensure that all logs are at level 'notice' or lower.

From the ACLI Reference Guide, the following log levels are available in order of least messages logged to most:

-emergency (1)
-critical (2)
-major (3)
-minor (4)
-warning (5)
-notice (6)
-info (7)
-trace (8)
-debug (9)
-detail(10)

In production networks, no log level above notice should be used unless in a maintenance window for troubleshooting reasons.

Also ensure that the sipmsg.log is not enabled.

Issue the command

#notify sipd nosiplog

to turn this log off.

Ensure that the system-log-level and process-log-level parameters are set accordingly in the system config.

# conf t
(configure)# system
(system)# system-config
(system-config)# sel
(system-config)#show

Traffic increase can be seen by issuing the commands

show sipd agents

High Traffic

High traffic can be register floods, call floods, or various other high message rates cause by abnormal network events.

Use the show sipd (or H323) agents command

#show sipd agents

This command will show the rate at which each of your configured session agents are processing calls.
Use the command show sipd invites and show sipd register

#show sipd invite
#show sipd register

These commands will show the rate at which invites and register messages are hitting the SBC. It is very common for a register flood to cause a sudden and extended CPU spike.

Some CLI commands

Running a command such as the following on a system with many registrations (such as 50k) can raise the CPU usage to around 80% for a few seconds causing a CPU spike:

show registration sipd by-realm <realm_id> brief

This is normal. CLI command runs as possible and therefore uses the available CPU, causing such short spikes. The load shedding algorithms used to determine if calls should be rejected because of CPU overload don't use the overall CPU usage (reported in "show plat cpu-load") but rather they look at the CPU being used by, for example SIPD. Therefore an overall CPU spike caused by such a CLI command being run, cannot be the reason of any call rejections. This is therefore not considered an issue. See also https://bug.oraclecorp.com/pls/bug/webbug_edit.edit_info_top?rptno=24797772

==============================================================

Recovering from a CPU spike

==============================================================

Description:

In the course of operation, a system's host processor usage may spike to high levels. Before taking steps to resolve this, it is important to determine the cause. High CPU can be caused either by external factors (higher than usual signaling traffic volume being sent to the system) or an internal issue to the SBC.

This document explains the procedure to analyze and recover a Net-Net 3000 or 4000-series from high CPU utilization.

Check the process(es) consuming the CPU:

Determine the spiking process by typing the show proc cpu command on the system experiencing the spike.

show proc cpu

This command displays the highest consuming processes. The current usage per process is in the far-right column Now.

TAC-4500-1# show proc cpu

Task Name Task Id Pri Status Total CPU Avg Now
-------------- -------- --- ---------- -------------- ----- -----
tSipd 2914218c 80 PEND+T 1.006 0.0 0.0
tBgfd 2935731c 80 PEND+T 0.848 0.0 0.0
tCli 2a8b10d4 1 PEND+T 0.687 0.0 0.0
tCliTnet1 2a8b1cb0 1 READY 0.424 0.0 0.0
tAtcpd 03d86d34 75 PEND+T 0.348 0.0 0.0
tAlgd 2907669c 80 PEND+T 0.316 0.0 0.0
tLemd 03d34d34 99 PEND+T 0.303 0.0 0.0
tIked 2a04acd8 80 PEND+T 0.281 0.0 0.0
tMbcd 03cfa8c4 78 PEND+T 0.258 0.0 0.0
-------------- -------- --- ---------- -------------- ----- -----
Applications 6:28.323 3.3
System 3:11:28 2.6

TAC-4500-1#

In this example the system is idle, but in normal operation the usage may be seen from a signaling process such as tsipd or h323d, and a media flow process like tmbcd. If one of these processes is using higher than normal resources, check the command for the associated process (for example ‘show sip invite' ‘show sipd register', etc.) to determine where the signaling spike is coming from (or the associated ‘show h323d command). Then drill down using ‘show sipd agents' to determine the source of the signaling spike (looking at max-burst, etc), and work on that external equipment to lower the amount of signaling hitting the Net-Net system.

If the cause of the spike is a registration cascade / flood, it may be helpful to apply SROP (SIP Registration Overload Protection) to help mitigate the traffic. This is detailed in the Security Guide : https://docs.oracle.com/cd/E55742_01/doc/sbc_security.pdf . If the source of the traffic is a signaling or traffic-related issue, and the spiking system is part of an HA pair, DO NOT switchover the systems' Active / Standby High Availability roles. This will simply move the CPU spike to the other system in the pair, and won't resolve the issue, and actually could potentially cause other more serious issues.

However if the source of the spike is not a signaling or media-related process such as sipd, h323d, or mbcd, but rather an internal process, this may indicate a runaway loop or other bug. To resolve this follow the steps outlined below.

Runaway Internal Process:

Check the output of show proc cpu in the Now column for a process consuming close to 100% of the resources. If this is the case, the first step is to lower the priority of this task to restore proper operation of the SBC. For the following steps log the command output and save if for later analysis.

Recovery Procedure:

1) Determine the task_id of runaway process by typing show process command on the system experiencing the spike. Example shown below.

NAME ENTRY TID PRI STATUS PC ERRNO PD ID
---------- ---------- ---------- --- ---------- ---------- --------- ----------
tAlgd _Z11algd > 0x3df60bd0 80 PEND+T 0x10c130 0x3d0002 0x233553c

2) Enter vxworks shell mode on the system by typing shell and entering password vxworks.

3) Reduce the problem task's priority on the system from its previous value (in this case 80) to 254 using command taskPrioritySet <TID> <NEW PRI>. Where the hex number is the task id TID which you get from show process cpu and 254 is the least priority Example shown below.

taskPrioritySet 0x3df60bd0 254

4) While still in the operating system shell run the commands ti tsipd and tt tsipd a 3-4 times, with 1 second apart (replacing tsipd for the name of the process spiking the CPU). (These commands do not work on 720m6.) The idea is to get a stack dump of the problem process, for later engineering analysis.

5) Exit the shell by typing exit.

6) Verify that the process' priority has changed by typing show proc cpu or show proc. The PRI column should have changed from 60, or 80 or its previously higher value to 254 (lowest system priority).

NAME ENTRY TID PRI STATUS PC ERRNO PD ID
---------- ---------- ---------- --- ---------- ---------- --------- ----------
tAlgd _Z11algd > 0x3df60bd0 254 PEND+T 0x10c130 0x3d0002 0x233553c

7) Check the output of show health that the problem system is in Standby state. If it's still in Active state, switchover to the other system by running the command notify berpd force. Then check the states and health using show health and show sessions etc, to verify proper operation of the newly active system.

8) Now gather the information to diagnose the issue on the problem system. Run the show support-info command, on the problem system, then run dump-nvram, and archive logs using either the package-logfiles command or the archives command followed by the create LOGS command.

How to Capture all Log Files on Acme Packet systems including 3820,3900,4500, 4600, 6100, 6300 (Doc ID 1589489.1)
9) Last but not least, “dump-tasks” to force taskcheckdump.dat being written to the flash. Now you can safely reboot the problem system to bring it back into a normal operating state.

10) Finally, once the system has booted up FTP to the management IP address and gather files from /code, and provide them to Acme Packet support. This, along with the captured command outputs will aid in their analysis.

References

<BUG:25868689> - HIGH CPU WHEN CONNECTED SO SDM

Attachments

This solution has no attachment