Eagle5 STP - E5ENETB IPSG Card rebooted with Obit: Module ath

Asset ID:	1-72-2384922.1
Update Date:	2018-04-12
Keywords:

Solution Type Problem Resolution Sure

Solution 2384922.1 : Eagle5 STP - E5ENETB IPSG Card rebooted with Obit: Module ath_vxw.c Line 3314 Class 0001

Applies to:

Oracle Communications EAGLE (Hardware) - Version EAGLE 46.5 and later
Information in this document applies to any platform.

Symptoms

E5-ENETB IPSG Card rebooted. After the reload the card was fully functional.

****18-03-25 02:28:37****

0223.0096 CARD 1203 IPSG Card has been reloaded

****18-03-25 02:27:31****

0218.0014 CARD 1203 IPSG Card is present

ASSY SN: 10212325147

****18-03-25 02:22:30****

0131.0013 ** CARD 1203 IPSG Card is isolated from the system

ASSY SN: 10212325147

Upon reload the card generated Obit ath_vxw.c Line 3314 Class 0001 on the active MASP:

STH: Received a BOOT APPL-Obituary reply for restart

Card 1203 Module ath_vxw.c Line 3314 Class 0001

EFL=00000000 CS =0000 EIP=00000000 SS =0000

EAX=00000000 ECX=00000000 EDX=00000000 EBX=00000000

ESP=00000000 EBP=00000000 ESI=00000000 EDI=00000000

DS =0000 ES =0000 FS =0000 GS =0000

Stack Dump :

[SP+1E]=0000 [SP+16]=0000 [SP+0E]=0000 [SP+06]=0000

[SP+1C]=0000 [SP+14]=0000 [SP+0C]=0000 [SP+04]=0000

[SP+1A]=0000 [SP+12]=0000 [SP+0A]=0000 [SP+02]=0000

[SP+18]=0000 [SP+10]=0000 [SP+08]=0000 [SP+00]=0000

User Data Dump :

30 78 66 66 66 66 66 66 66 66 20 41 50 50 4c 20 0xffffffff.APPL.

57 61 74 63 68 64 6f 67 20 74 69 6d 65 6f 75 74 Watchdog.timeout

20 72 65 73 65 74 .reset

Report Date:18-03-25 Time:02:27:31

Changes

Cause

Start by analyzing the logs and search for other possible symptoms in the node.

A single malfunction can have multiple causes: internal causes (for example bouncing DPCs) or external causes (for example an issue on a port of the switch which made the card to reboot as a recovering mechanism, or due to a router which is causing heavy retransmissions).

OBIT ath_vxw.c class 0001 is related to Application Trouble Handler. This indicates a HW fault if it keeps repeating. In the user data dump section we see 0xffffffff.APPL.Watchdog.timeout.reset.

On the E5 cards we use 3 types of watchdog mechanisms (hardware watchdogs, low priority starvation, and sanity).

In our case the system points to the hardware.

This is a hardware watchdog and because hardware reset the system without any software involvement, there is no post fail data available. These are typically difficult obits to debug due to the lack of post mortem. Proceed with gathering more data.

In order to check if any messages are being discarded by the card:

rept-stat-mfc:mode=stats:service=vsccp:sample=tot24h

rept-stat-mfc:mode=stats:service=mtp3:sample=tot24h

rtrv-trbl:loc=<active MASP 1113 or 1115>

rtrv-obit:loc=<active MASP 1113 or 1115>

rtrv-log:mode=full:dir=bkwd:num=500:outgrp=sys:slog=act

rtrv-log:mode=full:dir=bkwd:num=500:outgrp=card:slog=act

rept-stat-rtd

rept-stat-imt:mode=full

rept-imt-lvl1:sloc=1201:eloc=1115:r=summary

rept-stat-mux

rept-stat-db:display=all

rept-stat-ddb:display=all

rept-stat-card:loc=<card location>:mode=full

rtrv-card:loc=<card location>

Solution

While gathering all the logs, continue to monitor the card. If all the other logs are clear continue monitoring for an extended period of time, agreed with the customer.

If a second reboot takes place consider a hard reset by re-seating.

If a third reboot takes place change the board ASAP with a spare. Monitor the behavior after the replacement to confirm the normal functionality.

Attachments

This solution has no attachment