![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Problem Resolution Sure Solution 1018712.1 : Brocade Switch reboot, reason: Unknown.
PreviouslyPublishedAs 230424 Applies to:Brocade 200E Switch - Version All Versions to All Versions [Release All Releases]Brocade 4100 Switch - Version All Versions to All Versions [Release All Releases] Brocade 48000 Director - Version All Versions to All Versions [Release All Releases] All Platforms SymptomsBrocade Switch reboot, reason: Unknown. CauseBrocade Switch reboot, reason: Unknown. SolutionEnter the Main Content
Brocade switches running FabOS 4.x and 5.x can log this message as a result of a switch reboot. These messages can be viewed/displayed by issuing an 'errshow' or 'errdump' at the CLI. An example recorded on a 12K switch :- Error 02 -------- 0x26e (fabos): Sep 28 03:36:56 Switch: 0, Info HAM-REBOOT_REASON, 4, Switch reboot, reason: Unknown Because the reason is 'Unknown' further investigation is required. There can be a number of causes, including :-
Collecting information is important. The more data available the better the chance of identifying the cause. On FabOS 4.4.x/5.x.x and above, collect a 'supportsave' On blade switches (such as the 12K, 24K and 48K), there are 2 CP processor cards. 'High Availability' is achieved by 1 card being active and the other standby. Each CP has a Linux kernel and associated flash memory filesystem. Information pertaining to the cause of the reboot can potentially be located on either CP, particularly if the active/standby relationship swapped between the 2 as a result of/during the switch reboot. Obtain data for BOTH CP cards. It's important to collect the data from the CP cards and not the logical switches as collecting the data from the 2 logical switches will result in only collecting 1 set of CP data. To do this telnet to each CP in turn. NOTE: On a standby CP, 'supportshow' will not run. Obtain 'supportsave', 'errdump' and 'pdshow' on the standby CP, 'supportsave' and 'pdshow' on the active CP. The console output is also invaluable. If the serial console port is connected to a device then collect the log from that device. Depending on the switch firmware level, the console output may also be collected by a supportsave. Analyse the data 1: Review the errdump output for other events that day. For example, a Firmware download is an intentional activity which by design will cause the switch to reboot with this type of error log. Switch: 0, Warning SULIB-FWDL_START, 3, Firmwaredownload command has started. 2: Did the switch lose power ? Examples:- 2005/10/03-02:38:04, [EM-1034], 2,, ERROR, Silkworm3900, PS 1 set to faulty, rc=2000e 2005/10/03-02:38:30, [FW-1010], 3,, WARNING, swd77, Env Power Supply 1, is below low boundary(High=0, Low=1). Current value is 0 (1 OK/0 FAULTY). Ensure the switch has 2 independent power paths and that the customer has no record of power events for the date&time of the reboot. 3: Was the switch intentionally rebooted at the CLI/WebTools ? 4: Is there evidence of a panic ? The existence of a panic dump file indicates a kernel panic at some previous time. If a new pd file is generated (panic dump), then an error log entry should alert you to this following the reboot. [PDTR-1001], 12,, INFO, ?, pdcheck: info: found new pd If the date of this log entry matches the reboot date then it's likely the switch did panic. Run the 'pdshow' command. A supportshow from 4.4.x onwards will include 'pdshow'. Without any arguments, this will use the latest panic dump file. This will contain some valuable information which can be crossed matched to bugs and resolved issues. There are a number of sections of interest. PD_MISC, CONSOLE_LOG example pdshow sections:- _______________________********________________________ * File :/core_files/panic/core.pd1127141983 * * SECTION:PD_MISC * -----------------------********------------------------ Section=Startup time: Mon Sep 19 13:59:54 GMT 2005 Kernel= 2.4.19 Fabric OS= v4.4.0b * SECTION:CONSOLE_LOG * -----------------------********------------------------ Out of Memory: Killed process 854 (snmpd). VM size = 84276 KB, Runtime = 58527 . pid: 684, process: snmpd flags: 7428, pending sigs: 0 exit_code: 9, exit_sig: 17 parent_sig: 0 kSWD: Prepare to reboot/failover in a moment.. * SECTION:CONSOLE_LOG * -----------------------********------------------------ pid: 688, process: cald0 flags: 1348, pending sigs: 0 exit_code: 11, exit_sig: 17 parent_sig: 0 kSWD: Prepare to reboot/failover in a moment... The process name/text in the above examples (snmpd, cald) are good indication of the source of the problem in that particular example. In some instances, any text following the kSWD (kernel software watchdog) will detail an event that caused a panic. The kSWD function will alert the high availability (HAM) to the fact of a hung or crashed demon which inturn is likely to cause a CP failover or reboot. "kSWD: Detected Unexpected Termination of: nsd" The supportshow output may have a portlogdump (software port event log) for the time of the panic. If there are entries at or around the panic time, then these can be further decoded using the Brocade 'supporttool' - this can be obtained from http://partner.brocade.com/content/For%20Technical%20Professionals/Downloads/Unsupported%20Tools%20Library%20Partner/browseScripts.jsp?cat_id=66669465 Search SunSolve for Brocade bugs which panic'd with YOUR text (not the above!) Search Brocade FW release notes for resolved or identified issues that were not recorded as Sun bugs. Review SunAlerts for symptoms of the issue as per the SunAlert detail. 5: Hardware issues.
If there is no match in the above, then further investigation by engineering may be required. Good sources of reference once you have obtained information are:-
Note: On situation where the FC switch has only one power supply, and there is only an error like this:
[HAM-1004], 478, CHASSIS, INFO, Brocade300, Processor rebooted - Reset The reason code given indicates that the power has been removed. This can be done manual (human) or a HW defect. It is not a SW (FOS) issue, otherwise there would be some lead up to the RAS events indicating the power up and boot, and there are none - again reinforcing that power was removed and then re-applied. Since there is only one power supply to this type of switch, removal of power and then reapplication of power results in a cold boot. There are no diagnostics that indicate how the power was removed (mainly because the switch cannot perform diagnostics without any power), but since the switch is back up again, it seems unlikely a permanent hardware defect. However, if similar happens again, consider replacing the unit.
Brocade Switch
With a dual processor switch/director (12K, 24K, 48K) , _IF_ the director had a panic then the panic may have occurred on just one CP board or on both. In the case of a 12K/24K/48K panic that only occurred on a single CP, the director will 'failover' to the standby CP so that the panic data is now on the new 'standby' CP. When a 'supportsave' is run on the standby CP as part of the data collection exercise, it will not collect a 'supportshow' from the standby CP, as a supportshow cannot be run on a standby CP. However, the panic dump images will be collected and possibly a CONSOLE log which can be used to investigate the cause. Note: The actual file and directory names within a supportsave collection as well as UPPER or lower case selection are firmware dependent. Any panic dump images within the supportsave collection will likely be located under '*-core/panic/' on the FTP server. Should a pdshow output be missing from the supportsave/supportshow collection, it is possible to ftp these to a lab switch of the same type and firmware level for analysis, and run pdshow from the lab switch. There is a spreadsheet that can help identify the approximate date&time of a panic from a panic dump file. This can help confirm that you are looking at the correct panic dump file. The spreadsheet requires the 'System Startup Time' and the 'Epoch Date'. The 'System Startup' time is contained with the core.pdNNNNNNN file, where the NNNNNNN is the 'Epoch Date'. For example, the following panic dump file has a System Startup of 05/30/04 10:04:28 PM and the 'Epoch' date/time is 1128905472 my_host:->strings core.pd1128905472 | grep Startup The spreadsheet calculates the reboot occurred on 10/10/05 around 12:51:12 AM There is also a projected 'Estimated Jiffie Reboot Date' if the panic occurred due to SunAlert 101607 <Document: 1000167.1> The spreadsheet can be obtained from http://service.uk/~briana/jiffie_calc.xls Brocade FW release notes for Sun qualified versions are contained in the patch release notes eg patch 115360 Click Here or via the http://partner.brocade.com website. Newer FW release notes may exist but may not be freely available. brocade, reboot, reason, unknown, panic Previously Published As 82788 Change History Date: 2007-09-25 User Name: 97961 Action: Approved Comment: Publishing. No further edits required. Version: 9 Date: 2007-09-25 User Name: 97961 Action: Accept Comment: Version: 0 Date: 2007-09-25 User Name: 100761 Action: Approved Comment: Fixed a typo and the hyper link, sun alert was mentioned as 10167 instead of 101607. changes made to line : There is also a projected 'Estimated Jiffie Reboot Date' if the panic occurred due to SunAlert 101607 Rest of the doc looks fine. Publish it. Version: 0 Date: 2007-09-24 User Name: 100761 Action: Accept Comment: Version: 0 Attachments This solution has no attachment |
||||||||||||
|