Brocade Switch reboot, reason: Unknown.

Asset ID:	1-72-1018712.1
Update Date:	2015-06-15
Keywords:

Solution Type Problem Resolution Sure

Solution 1018712.1 : Brocade Switch reboot, reason: Unknown.

Applies to:

Brocade 200E Switch - Version All Versions to All Versions [Release All Releases]
Brocade 4100 Switch - Version All Versions to All Versions [Release All Releases]
Brocade 48000 Director - Version All Versions to All Versions [Release All Releases]
All Platforms

Symptoms

Brocade Switch reboot, reason: Unknown.

Cause

Brocade Switch reboot, reason: Unknown.

Solution

Enter the Main Content

Symptoms
Error message(s) in log indicating a switch reboot but with little diagnostic assistance.

Resolution
Depending on the cause, the reboot may or may not be intentional.

Review the following troubleshooting steps to attempt to identify the issue, isolate from cause if possible, review the need for possible FW upgrade.

Additional Information
Brocade Switch reboot, reason: Unknown.

Brocade switches running FabOS 4.x and 5.x can log this message as a result of a switch reboot. These messages can be viewed/displayed by issuing an 'errshow' or 'errdump' at the CLI.

An example recorded on a 12K switch :-

 Error 02
--------
0x26e (fabos): Sep 28 03:36:56
Switch: 0, Info HAM-REBOOT_REASON, 4, Switch reboot, reason: Unknown

Because the reason is 'Unknown' further investigation is required. There can be a number of causes, including :-

Intentional reboots such as those occurring during a Firmware Upgrade or 'reboot' command.
Loss of power.
Software panics.
Hardware Issues.

Collecting information is important. The more data available the better the chance of identifying the cause.

On FabOS 4.4.x/5.x.x and above, collect a 'supportsave'
On FabOS prior to 4.4.x, collect a 'supportshow' and a 'pdshow'.
NB FabOS 3.x and 2.x are outside the scope of this document.

On blade switches (such as the 12K, 24K and 48K), there are 2 CP processor cards. 'High Availability' is achieved by 1 card being active and the other standby. Each CP has a Linux kernel and associated flash memory filesystem. Information pertaining to the cause of the reboot can potentially be located on either CP, particularly if the active/standby relationship swapped between the 2 as a result of/during the switch reboot. Obtain data for BOTH CP cards. It's important to collect the data from the CP cards and not the logical switches as collecting the data from the 2 logical switches will result in only collecting 1 set of CP data. To do this telnet to each CP in turn. NOTE: On a standby CP, 'supportshow' will not run. Obtain 'supportsave', 'errdump' and 'pdshow' on the standby CP, 'supportsave' and 'pdshow' on the active CP.

The console output is also invaluable. If the serial console port is connected to a device then collect the log from that device. Depending on the switch firmware level, the console output may also be collected by a supportsave.

Analyse the data

1: Review the errdump output for other events that day.
Compare against the Brocade "System Error Message Reference Manual". Certain errors are are well documented and may give an indication of what the switch was doing.

For example, a Firmware download is an intentional activity which by design will cause the switch to reboot with this type of error log.

   Switch: 0, Warning SULIB-FWDL_START, 3, Firmwaredownload command has started.

2: Did the switch lose power ?
Sound obvious, but if the switch losses power, it will reboot. The switch will not log simultaneous loss on all supplies, but there may be evidence of previous power issues on a single source within in the errlog.

Examples:-

   2005/10/03-02:38:04, [EM-1034], 2,, ERROR, Silkworm3900, PS 1 set to faulty, rc=2000e
2005/10/03-02:38:30, [FW-1010], 3,, WARNING, swd77, Env Power Supply 1, is below low boundary(High=0, Low=1). Current value is 0 (1 OK/0 FAULTY).

Ensure the switch has 2 independent power paths and that the customer has no record of power events for the date&time of the reboot.
Ensure that all the supplies are currently up with 'psshow'.

3: Was the switch intentionally rebooted at the CLI/WebTools ?
Review the console log (if available) to review the console/serial port activity. telnet CLI is not audited, however if this considered a possible cause then 'trackChangesSet' will log an entry on CLI successful login. see 'help trackChangesSet'

4: Is there evidence of a panic ?

The existence of a panic dump file indicates a kernel panic at some previous time. If a new pd file is generated (panic dump), then an error log entry should alert you to this following the reboot.

[PDTR-1001], 12,, INFO, ?, pdcheck: info: found new pd

If the date of this log entry matches the reboot date then it's likely the switch did panic.

Run the 'pdshow' command. A supportshow from 4.4.x onwards will include 'pdshow'. Without any arguments, this will use the latest panic dump file. This will contain some valuable information which can be crossed matched to bugs and resolved issues. There are a number of sections of interest. PD_MISC, CONSOLE_LOG

example pdshow sections:-

    _______________________********________________________
*   File   :/core_files/panic/core.pd1127141983       *
*   SECTION:PD_MISC                                   *
-----------------------********------------------------
Section=Startup time: Mon Sep 19 13:59:54 GMT 2005
Kernel=     2.4.19
Fabric OS=  v4.4.0b

    *   SECTION:CONSOLE_LOG                               *
-----------------------********------------------------
Out of Memory: Killed process 854 (snmpd). VM size = 84276 KB, Runtime = 58527 .
pid: 684, process: snmpd
flags: 7428, pending sigs: 0
exit_code: 9, exit_sig: 17
parent_sig: 0
kSWD: Prepare to reboot/failover in a moment..

    *   SECTION:CONSOLE_LOG                               *
-----------------------********------------------------
pid: 688, process: cald0
flags: 1348, pending sigs: 0
exit_code: 11, exit_sig: 17
parent_sig: 0
kSWD: Prepare to reboot/failover in a moment...

The process name/text in the above examples (snmpd, cald) are good indication of the source of the problem in that particular example.

In some instances, any text following the kSWD (kernel software watchdog) will detail an event that caused a panic. The kSWD function will alert the high availability (HAM) to the fact of a hung or crashed demon which inturn is likely to cause a CP failover or reboot.

   "kSWD: Detected Unexpected Termination of: nsd"

The supportshow output may have a portlogdump (software port event log) for the time of the panic. If there are entries at or around the panic time, then these can be further decoded using the Brocade 'supporttool' - this can be obtained from http://partner.brocade.com/content/For%20Technical%20Professionals/Downloads/Unsupported%20Tools%20Library%20Partner/browseScripts.jsp?cat_id=66669465

Search SunSolve for Brocade bugs which panic'd with YOUR text (not the above!) Search Brocade FW release notes for resolved or identified issues that were not recorded as Sun bugs. Review SunAlerts for symptoms of the issue as per the SunAlert detail.

5: Hardware issues.

Review the errdump output for HW issues. Compare against the Brocade "System Error Message Reference Manual".
Review the current status of the HW as reported via supportshow/supportsave.
Look for any faulty blades (slotshow, frushow).
Review the potential to run offline diagnostics on the switch. POST (following power cycle or "reboot") will run a set of tests for a small number of iterations. Several tests exist for exercising indivual parts of the Brocade switch, for more information see "Brocade Fabric OS Command Reference Manual". Remember that these are offline tests requiring downtime.

If there is no match in the above, then further investigation by engineering may be required.
In addition to the already collected information, the panic dump image itself may be required. use 'savecore' to ftp the image from the switch.

Good sources of reference once you have obtained information are:-

SunAlerts for matching on alert issues.
help pdshow (CLI online help/man pages)
Brocade Diagnostic/Reference manuals
Brocade Fabric OS Procedures Guide
Brocade Fabric OS Command Reference Manual
Brocade release notes for matching text strings from pdshow outputs.
Sun bugs for matching text strings from pdshow outputs.

Note: On situation where the FC switch has only one power supply, and there is only an error like this:
[HAM-1004], 478, CHASSIS, INFO, Brocade300, Processor rebooted - Reset

The reason code given indicates that the power has been removed. This can be done manual (human) or a HW defect. It is not a SW (FOS) issue, otherwise there would be some lead up to the RAS events indicating the power up and boot, and there are none - again reinforcing that power was removed and then re-applied. Since there is only one power supply to this type of switch, removal of power and then reapplication of power results in a cold boot.

There are no diagnostics that indicate how the power was removed (mainly because the switch cannot perform diagnostics without any power), but since the switch is back up again, it seems unlikely a permanent hardware defect. However, if similar happens again, consider replacing the unit.

Product
Brocade 48000 Director
Brocade 4100 Switch
Brocade SilkWorm 3900 Switch
Brocade SilkWorm 3850 Fabric Switch
Brocade SilkWorm 3250 Fabric Switch
Brocade SilkWorm 24000 Director
Brocade 200E Switch
Brocade 12000 2 GB Switch

Internal Comments
Brocade Switch

With a dual processor switch/director (12K, 24K, 48K) , _IF_ the director had a panic then the panic may have occurred on just one CP board or on both. In the case of a 12K/24K/48K panic that only occurred on a single CP, the director will 'failover' to the standby CP so that the panic data is now on the new 'standby' CP. When a 'supportsave' is run on the standby CP as part of the data collection exercise, it will not collect a 'supportshow' from the standby CP, as a supportshow cannot be run on a standby CP. However, the panic dump images will be collected and possibly a CONSOLE log which can be used to investigate the cause.

Note: The actual file and directory names within a supportsave collection as well as UPPER or lower case selection are firmware dependent.

Any panic dump images within the supportsave collection will likely be located under '*-core/panic/' on the FTP server. Should a pdshow output be missing from the supportsave/supportshow collection, it is possible to ftp these to a lab switch of the same type and firmware level for analysis, and run pdshow from the lab switch.

There is a spreadsheet that can help identify the approximate date&time of a panic from a panic dump file. This can help confirm that you are looking at the correct panic dump file. The spreadsheet requires the 'System Startup Time' and the 'Epoch Date'. The 'System Startup' time is contained with the core.pdNNNNNNN file, where the NNNNNNN is the 'Epoch Date'.

For example, the following panic dump file has a System Startup of 05/30/04 10:04:28 PM and the 'Epoch' date/time is 1128905472

my_host:->strings core.pd1128905472 | grep Startup

Section=Startup time: Sun May 30 22:04:28 UTC 2004

The spreadsheet calculates the reboot occurred on 10/10/05 around 12:51:12 AM

There is also a projected 'Estimated Jiffie Reboot Date' if the panic occurred due to SunAlert 101607 <Document: 1000167.1>

The spreadsheet can be obtained from http://service.uk/~briana/jiffie_calc.xls

Brocade FW release notes for Sun qualified versions are contained in the patch release notes eg patch 115360 Click Here or via the http://partner.brocade.com website.

Newer FW release notes may exist but may not be freely available.

brocade, reboot, reason, unknown, panic
Previously Published As
82788

Change History
Date: 2007-09-25
User Name: 97961
Action: Approved
Comment: Publishing. No further edits required.
Version: 9
Date: 2007-09-25
User Name: 97961
Action: Accept
Comment:
Version: 0
Date: 2007-09-25
User Name: 100761
Action: Approved
Comment: Fixed a typo and the hyper link, sun alert was mentioned as 10167 instead of 101607.
changes made to line :
There is also a projected 'Estimated Jiffie Reboot Date' if the panic occurred due to SunAlert 101607
Rest of the doc looks fine. Publish it.
Version: 0
Date: 2007-09-24
User Name: 100761
Action: Accept
Comment:
Version: 0
Attachments

This solution has no attachment