Brocade CP Set to Faulty Because CP ERROR Asserted - WARNING, SilkWorm48000, Detected termination of process fwd

Asset ID:	1-72-1990394.1
Update Date:	2015-03-18
Keywords:

Solution Type Problem Resolution Sure

Solution 1990394.1 : Brocade CP Set to Faulty Because CP ERROR Asserted - WARNING, SilkWorm48000, Detected termination of process fwd

Applies to:

Brocade 48000 Director - Version All Versions and later
Brocade SAN Switch Hardware - Version All Versions and later
Information in this document applies to any platform.

Symptoms

This is a Brocade 48K, with FOS v6.4.3b , currently CP1 on Slot6 is active, and CP0 on Slot5 is standby, working fine.

firmwareshow -v :
Slot Name Appl Primary/Secondary Versions Status
--------------------------------------------------------------------------
  5 CP0 FOS v6.4.3b STANDBY
  v6.4.3b
  6 CP1 FOS v6.4.3b ACTIVE *
  v6.4.3b

*** There was a problem on 11 Feb , CP0 in Slot 5 had a CP error and was set to faulty due to a "Software Fault:Kernel Panic", and after reboot it become the standby cp

2015/02/11-03:56:09, [FSSM-1003], 18787, SLOT 6 | CHASSIS, WARNING, SilkWorm48000, HA State out of sync.
2015/02/11-03:56:42, [ISNS-1011], 18788, SLOT 6 | FID 128, INFO, WRO_CORE_48K_BLUE, iSNS Client Service is disabled.
2015/02/11-03:57:08, [EM-1033], 18789, SLOT 6 | CHASSIS, ERROR, SilkWorm48000, CP in Slot 5 set to faulty because CP ERROR asserted.
2015/02/11-03:57:46, [HAMK-1004], 18790, SLOT 6 | CHASSIS, INFO, SilkWorm48000, Resetting standby CP (double reset may occur)
2015/02/11-03:57:50, [EM-1047], 18791, SLOT 6 | CHASSIS, INFO, SilkWorm48000, CP in slot 5 not faulty, CP ERROR deasserted.
2015/02/11-03:58:05, [FW-1424], 18792, SLOT 6 | FID 128, WARNING, WRO_CORE_48K_BLUE, Switch status changed from HEALTHY to MARGINAL.
2015/02/11-03:58:05, [FW-1433], 18793, SLOT 6 | FID 128, WARNING, WRO_CORE_48K_BLUE, Switch status change contributing factor CP: CP non-redundant (Slot5/CP0) faulty.
2015/02/11-03:58:53, [HAM-1004], 18794, SLOT 5 | CHASSIS, INFO, SilkWorm48000, Processor rebooted - Software Fault:Kernel Panic
2015/02/11-03:59:01, [TRCE-1001], 18795, SLOT 5 | CHASSIS, WARNING, SilkWorm48000, Trace dump available (Slot 5)! (reason: PANIC)
2015/02/11-03:59:01, [TRCE-1004], 18796, SLOT 5 | CHASSIS, WARNING, SilkWorm48000, Trace dump (Slot 5) was not transferred because trace auto-FTP disabled.
2015/02/11-03:59:02, [TRCE-1001], 18797, SLOT 6 | CHASSIS, WARNING, SilkWorm48000, Trace dump available (Slot 5)! (reason: PANIC)
2015/02/11-03:59:02, [TRCE-1004], 18798, SLOT 6 | CHASSIS, WARNING, SilkWorm48000, Trace dump (Slot 5) was not transferred because trace auto-FTP disabled.
2015/02/11-03:59:38, [FSSM-1002], 18799, SLOT 6 | CHASSIS, INFO, SilkWorm48000, HA State is in sync.
2015/02/11-03:59:38, [FSSM-1002], 18800, SLOT 5 | CHASSIS, INFO, SilkWorm48000, HA State is in sync.
2015/02/11-03:59:39, [FW-1425], 18801, SLOT 6 | FID 128, INFO, WRO_CORE_48K_BLUE, Switch status changed from MARGINAL to HEALTHY.

Currently all blades are enabled:

slotshow:
slotshow -m :

Slot Blade Type ID Model Name Status
--------------------------------------------------
  1 SW BLADE 18 FC4-32 ENABLED
  2 SW BLADE 18 FC4-32 ENABLED
  3 SW BLADE 18 FC4-32 ENABLED
  4 UNKNOWN VACANT
  5 CP BLADE 16 CP256 ENABLED
  6 CP BLADE 16 CP256 ENABLED
  7 UNKNOWN VACANT
  8 SW BLADE 18 FC4-32 ENABLED
  9 SW BLADE 17 FC4-16 ENABLED
10 SW BLADE 18 FC4-32 ENABLED

On the information collected with supportsave from CP0 Slot 5 we can see more in detail the error:

*** CORE FILES WARNING (02/11/15 - 03:00:18 ) ***
5376 KBytes in 1 file(s)
use "supportsave" command to upload

ASSERT - Failed expression: size == sizeof (fwDump_t), file = thresh_agent.c, line = 2422, user mode Call backtrace:
/fabos/lib/libutils.so.1.0(do_assert+0x250) [0xfed47dc]
fwd(fwDumpCB+0xa8) [0x100238f4]
/fabos/lib/libipc.so.1.0 [0xf3defc4]
/fabos/lib/libipc.so.1.0 [0xf3df140]
/fabos/lib/libgiot.so.1.0 [0xfe33524]
/lib/libpthread.so.0 [0xfe02470]
/lib/libc.so.6(clone+0x84) [0xf19a610]
do_assert: forcing segv to get core file

2015/02/11-03:56:09, [RAS-1005], 28090, SLOT 5 | FFDC | FID 128, WARNING, WRO_CORE_48K_BLUE, Software 'assert' error detected.
2015/02/11-03:56:09, [RAS-1001], 28091, SLOT 5 | CHASSIS, INFO, SilkWorm48000, First failure data capture (FFDC) event occurred.
2015/02/11-03:56:11, [TRCE-1001], 28092, SLOT 5 | CHASSIS, WARNING, SilkWorm48000, Trace dump avDetected termination of fwd:1205 (1)
ailable (Slot 5)exit code:11, exit sig:17, parent sig:0
! (reason: FFDC)
2015/02/11-03:56:11, [TRCE-1004], 28093, SLOT 5 | CHASSIS, WARNING, SilkWorm48000, Trace dump (Slot 5) was not transferred because trace auto-FTP disabled.
== Dumping debug information ==
  PID VSZ RSS COMMAND
  1 1696 592 init
  2 0 0 ksoftirqd/0
  3 0 0 events/0
  4 0 0 khelper
  5 0 0 kthread
  27 0 0 kblockd/0
  56 0 0 pdflush
  59 0 0 aio/0
  58 0 0 kswapd0
  66 0 0 kseriod
  243 0 0 kjournald
  263 1676 412 wdtd
  335 0 0 kjournald
  508 2116 652 inetd
  521 2556 1092 kmsghandler
  535 1700 384 klogd
  536 1944 688 syslogd
  537 1808 620 crond
  566 0 0 RASLOGK_TH
  583 0 0 krscmon
  689 0 0 kwt_nb_thread
  770 0 0 module-99-th
  773 0 0 module-107-th
  776 0 0 module-146-th
  779 0 0 module-126-th
  782 0 0 module-162-th
  801 0 0 kmtracer
  932 20692 2708 ipadmd
  935 11488 1852 telnetmond
  936 47480 4908 hasmd
1000 0 0 FSSK_TH
1043 4988 1076 sshd
1044 1720 560 getty
1045 1720 560 getty
1054 29328 3676 pdmd
1049 0 0 ISCK_TH
1050 0 0 XCP_TX
1051 0 0 XCP_RX
1052 0 0 XCP_TX
1053 0 0 XCP_RX
1057 12384 1240 proxy
1058 73472 6200 raslogd
1059 33380 7708 traced
1060 46016 3664 bmd
1061 12300 2932 diagd
1056 0 0 RTEK_TH
1065 88852 4880 emd
1067 12724 3116 porttestd
1081 0 0 porttestd
1195 80924 6356 webd
1196 29472 3592 arrd
1197 124032 10696 cald
1198 73292 5760 essd
1199 73668 5704 evmd
1200 67176 7256 fabricd
1201 57088 5960 fcpd
1202 88184 4068 fdmid
1203 47784 4892 ficud
1204 65976 6640 fspfd
1206 72684 4740 rcsd
1207 63768 3708 ipsd
1208 107696 45792 iswitchd
1209 98316 6420 msd
1210 81348 13684 nsd
1211 30124 4164 pdmd
1214 85476 6932 psd
1215 43092 6552 rpcd
1216 80780 6824 secd
1217 73012 3864 authd
1236 116788 12460 snmpd
1237 97664 21836 trafd
1238 39940 3692 tsd
1239 99648 9048 zoned
1278 8332 2168 httpd.0
1281 105884 26444 0.weblinker.fcg
1294 68980 4784 icpd
1295 46528 3760 isnscd
1320 38744 3628 scpd
13663 0 0 pdflush
18033 2564 1080 sh
18034 2228 784 ps
2015/02/11-03:56:11, [KSWD-1002], 28094, SLOT 5 | FFDC | CHASSIS, WARNING, SilkWorm48000, Detected termination of process fwd:1205 <<<----!!!
2015/02/11-03:56:11, [HAM-1014], 28095, SLOT 5 | CHASSIS, CRITICAL, SilkWorm48000, Non restartable component (fw (pid=1205)) died.

2015/02/11-03:56:11, [FSSM-1003], 28096, SLOT 6 | CHASSIS, WARNING, SilkWorm48000, HA State out of sync.
Time=2:56:14-716194 Total:0KB Used:0KB Free:0KB Buffers:0KB Cached:0KB
Time=2:56:14-716194 Total:0KB Used:0KB Free:0KB Buffers:0KB Cached:0KB

Cause

The reason for the panic is that the fabricwatch daemon (fwd) terminated. There are a number of daemons in FOS which are termed 'non restartable' and when these daemons die/fail the only way for them to recover/restart is to panic/reboot the operating system - hence the CP failover in this case

This is a known Brocade Defect fixed in the 7.x code but it's not being back ported into 6.4.3x.
The CP0 was reset due to this software bug, and the other CP1 become primary, IO through the switch is not affected by CP reboots.

Based on Brocade escalation, the defect number which references this problem is 538046 although it is an internal defect and may not appear in any release notes.

Solution

Brocade has identified this as a known issue, that it won't be fixed on FOS 6.4.3x version, latest FOS supported on Brocade 48000

This defect is fixed in the 7.x code:

Upgrade to FOS 7.3.x.

Note. Oracle does not sell Brocade DCX anymore (the next generation of directors after Brocade 48K), but only the 16GB FC switches, so if you are interested on new director switch, you would need to talk to Brocade sales rep.

Recomendation for Brocade 48000: leave the switch as it is (although there is no solution for FOS 6.4.3 - Brocade 48K )
because this problem is a corner case, with low probability to occur, and the impact is transparent for applications,

Attachments

This solution has no attachment