Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1990394.1
Update Date:2015-03-18
Keywords:

Solution Type  Problem Resolution Sure

Solution  1990394.1 :   Brocade CP Set to Faulty Because CP ERROR Asserted - WARNING, SilkWorm48000, Detected termination of process fwd  


Related Items
  • Brocade 48000 Director
  •  
  • Brocade SAN Switch Hardware
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>Switch>SN-DK: Brocade Switch
  •  




In this Document
Symptoms
Cause
Solution


Created from <SR 3-10281225391>

Applies to:

Brocade 48000 Director - Version All Versions and later
Brocade SAN Switch Hardware - Version All Versions and later
Information in this document applies to any platform.

Symptoms

This is a Brocade 48K, with FOS v6.4.3b , currently CP1 on Slot6 is active, and CP0 on Slot5 is standby, working fine.


firmwareshow -v :
Slot Name Appl Primary/Secondary Versions Status
--------------------------------------------------------------------------
  5 CP0 FOS v6.4.3b STANDBY
  v6.4.3b
  6 CP1 FOS v6.4.3b ACTIVE *
  v6.4.3b



*** There was a problem on 11 Feb , CP0 in Slot 5 had a CP error and was set to faulty due to a "Software Fault:Kernel Panic", and after reboot it become the standby cp

2015/02/11-03:56:09, [FSSM-1003], 18787, SLOT 6 | CHASSIS, WARNING, SilkWorm48000, HA State out of sync.
2015/02/11-03:56:42, [ISNS-1011], 18788, SLOT 6 | FID 128, INFO, WRO_CORE_48K_BLUE, iSNS Client Service is disabled.
2015/02/11-03:57:08, [EM-1033], 18789, SLOT 6 | CHASSIS, ERROR, SilkWorm48000, CP in Slot 5 set to faulty because CP ERROR asserted.
2015/02/11-03:57:46, [HAMK-1004], 18790, SLOT 6 | CHASSIS, INFO, SilkWorm48000, Resetting standby CP (double reset may occur)
2015/02/11-03:57:50, [EM-1047], 18791, SLOT 6 | CHASSIS, INFO, SilkWorm48000, CP in slot 5 not faulty, CP ERROR deasserted.
2015/02/11-03:58:05, [FW-1424], 18792, SLOT 6 | FID 128, WARNING, WRO_CORE_48K_BLUE, Switch status changed from HEALTHY to MARGINAL.
2015/02/11-03:58:05, [FW-1433], 18793, SLOT 6 | FID 128, WARNING, WRO_CORE_48K_BLUE, Switch status change contributing factor CP: CP non-redundant (Slot5/CP0) faulty.
2015/02/11-03:58:53, [HAM-1004], 18794, SLOT 5 | CHASSIS, INFO, SilkWorm48000, Processor rebooted - Software Fault:Kernel Panic
2015/02/11-03:59:01, [TRCE-1001], 18795, SLOT 5 | CHASSIS, WARNING, SilkWorm48000, Trace dump available (Slot 5)! (reason: PANIC)
2015/02/11-03:59:01, [TRCE-1004], 18796, SLOT 5 | CHASSIS, WARNING, SilkWorm48000, Trace dump (Slot 5) was not transferred because trace auto-FTP disabled.
2015/02/11-03:59:02, [TRCE-1001], 18797, SLOT 6 | CHASSIS, WARNING, SilkWorm48000, Trace dump available (Slot 5)! (reason: PANIC)
2015/02/11-03:59:02, [TRCE-1004], 18798, SLOT 6 | CHASSIS, WARNING, SilkWorm48000, Trace dump (Slot 5) was not transferred because trace auto-FTP disabled.
2015/02/11-03:59:38, [FSSM-1002], 18799, SLOT 6 | CHASSIS, INFO, SilkWorm48000, HA State is in sync.
2015/02/11-03:59:38, [FSSM-1002], 18800, SLOT 5 | CHASSIS, INFO, SilkWorm48000, HA State is in sync.
2015/02/11-03:59:39, [FW-1425], 18801, SLOT 6 | FID 128, INFO, WRO_CORE_48K_BLUE, Switch status changed from MARGINAL to HEALTHY.



Currently all blades are enabled:

slotshow:
slotshow -m :

Slot Blade Type ID Model Name Status
--------------------------------------------------
  1 SW BLADE 18 FC4-32 ENABLED
  2 SW BLADE 18 FC4-32 ENABLED
  3 SW BLADE 18 FC4-32 ENABLED
  4 UNKNOWN VACANT
  5 CP BLADE 16 CP256 ENABLED
  6 CP BLADE 16 CP256 ENABLED
  7 UNKNOWN VACANT
  8 SW BLADE 18 FC4-32 ENABLED
  9 SW BLADE 17 FC4-16 ENABLED
 10 SW BLADE 18 FC4-32 ENABLED




On the information collected with supportsave from CP0 Slot 5 we can see more in detail the error:

*** CORE FILES WARNING (02/11/15 - 03:00:18 ) ***
5376 KBytes in 1 file(s)
use "supportsave" command to upload

ASSERT - Failed expression: size == sizeof (fwDump_t), file = thresh_agent.c, line = 2422, user mode Call backtrace:
/fabos/lib/libutils.so.1.0(do_assert+0x250) [0xfed47dc]
fwd(fwDumpCB+0xa8) [0x100238f4]
/fabos/lib/libipc.so.1.0 [0xf3defc4]
/fabos/lib/libipc.so.1.0 [0xf3df140]
/fabos/lib/libgiot.so.1.0 [0xfe33524]
/lib/libpthread.so.0 [0xfe02470]
/lib/libc.so.6(clone+0x84) [0xf19a610]
do_assert: forcing segv to get core file

2015/02/11-03:56:09, [RAS-1005], 28090, SLOT 5 | FFDC | FID 128, WARNING, WRO_CORE_48K_BLUE, Software 'assert' error detected.
2015/02/11-03:56:09, [RAS-1001], 28091, SLOT 5 | CHASSIS, INFO, SilkWorm48000, First failure data capture (FFDC) event occurred.
2015/02/11-03:56:11, [TRCE-1001], 28092, SLOT 5 | CHASSIS, WARNING, SilkWorm48000, Trace dump avDetected termination of fwd:1205 (1)
ailable (Slot 5)exit code:11, exit sig:17, parent sig:0
! (reason: FFDC)
2015/02/11-03:56:11, [TRCE-1004], 28093, SLOT 5 | CHASSIS, WARNING, SilkWorm48000, Trace dump (Slot 5) was not transferred because trace auto-FTP disabled.
== Dumping debug information ==
  PID VSZ RSS COMMAND
  1 1696 592 init
  2 0 0 ksoftirqd/0
  3 0 0 events/0
  4 0 0 khelper
  5 0 0 kthread
  27 0 0 kblockd/0
  56 0 0 pdflush
  59 0 0 aio/0
  58 0 0 kswapd0
  66 0 0 kseriod
  243 0 0 kjournald
  263 1676 412 wdtd
  335 0 0 kjournald
  508 2116 652 inetd
  521 2556 1092 kmsghandler
  535 1700 384 klogd
  536 1944 688 syslogd
  537 1808 620 crond
  566 0 0 RASLOGK_TH
  583 0 0 krscmon
  689 0 0 kwt_nb_thread
  770 0 0 module-99-th
  773 0 0 module-107-th
  776 0 0 module-146-th
  779 0 0 module-126-th
  782 0 0 module-162-th
  801 0 0 kmtracer
  932 20692 2708 ipadmd
  935 11488 1852 telnetmond
  936 47480 4908 hasmd
 1000 0 0 FSSK_TH
 1043 4988 1076 sshd
 1044 1720 560 getty
 1045 1720 560 getty
 1054 29328 3676 pdmd
 1049 0 0 ISCK_TH
 1050 0 0 XCP_TX
 1051 0 0 XCP_RX
 1052 0 0 XCP_TX
 1053 0 0 XCP_RX
 1057 12384 1240 proxy
 1058 73472 6200 raslogd
 1059 33380 7708 traced
 1060 46016 3664 bmd
 1061 12300 2932 diagd
 1056 0 0 RTEK_TH
 1065 88852 4880 emd
 1067 12724 3116 porttestd
 1081 0 0 porttestd
 1195 80924 6356 webd
 1196 29472 3592 arrd
 1197 124032 10696 cald
 1198 73292 5760 essd
 1199 73668 5704 evmd
 1200 67176 7256 fabricd
 1201 57088 5960 fcpd
 1202 88184 4068 fdmid
 1203 47784 4892 ficud
 1204 65976 6640 fspfd
 1206 72684 4740 rcsd
 1207 63768 3708 ipsd
 1208 107696 45792 iswitchd
 1209 98316 6420 msd
 1210 81348 13684 nsd
 1211 30124 4164 pdmd
 1214 85476 6932 psd
 1215 43092 6552 rpcd
 1216 80780 6824 secd
 1217 73012 3864 authd
 1236 116788 12460 snmpd
 1237 97664 21836 trafd
 1238 39940 3692 tsd
 1239 99648 9048 zoned
 1278 8332 2168 httpd.0
 1281 105884 26444 0.weblinker.fcg
 1294 68980 4784 icpd
 1295 46528 3760 isnscd
 1320 38744 3628 scpd
13663 0 0 pdflush
18033 2564 1080 sh
18034 2228 784 ps
2015/02/11-03:56:11, [KSWD-1002], 28094, SLOT 5 | FFDC | CHASSIS, WARNING, SilkWorm48000, Detected termination of process fwd:1205  <<<----!!!
2015/02/11-03:56:11, [HAM-1014], 28095, SLOT 5 | CHASSIS, CRITICAL, SilkWorm48000, Non restartable component (fw (pid=1205)) died.

2015/02/11-03:56:11, [FSSM-1003], 28096, SLOT 6 | CHASSIS, WARNING, SilkWorm48000, HA State out of sync.
Time=2:56:14-716194 Total:0KB Used:0KB Free:0KB Buffers:0KB Cached:0KB
Time=2:56:14-716194 Total:0KB Used:0KB Free:0KB Buffers:0KB Cached:0KB

 

Cause

The reason for the panic is that the fabricwatch daemon (fwd) terminated. There are a number of daemons in FOS which are termed 'non restartable' and when these daemons die/fail the only way for them to recover/restart is to panic/reboot the operating system - hence the CP failover in this case

This is a known Brocade Defect fixed in the 7.x code but it's not being back ported into 6.4.3x.
The CP0 was reset due to this software bug, and the other CP1 become primary, IO through the switch is not affected by CP reboots.
 
Based on Brocade escalation, the defect number which references this problem is 538046 although it is an internal defect and may not appear in any release notes.
 

Solution

Brocade has identified this as a known issue, that it won't be fixed on FOS 6.4.3x version, latest FOS supported on Brocade 48000

This defect is fixed in the 7.x code:

Upgrade to FOS 7.3.x.


Note. Oracle does not sell Brocade DCX anymore (the next generation of directors after Brocade 48K), but only the 16GB FC switches, so if you are interested on new director switch, you would need to talk to Brocade sales rep.

Recomendation for Brocade 48000: leave the switch as it is (although there is no solution for FOS 6.4.3 - Brocade 48K )
because this problem is a corner case, with low probability to occur, and the impact is transparent for applications,
 


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback