Brocade Silkworm Switches: Host Reboots Can Cause Brocade Marginal/Warning/DOWN Healthy/OK Errors

Asset ID:	1-72-1018903.1
Update Date:	2013-11-06
Keywords:

Solution Type Problem Resolution Sure

Solution 1018903.1 : Brocade Silkworm Switches: Host Reboots Can Cause Brocade Marginal/Warning/DOWN Healthy/OK Errors

Related Items


Brocade 48000 Director
 Brocade 200E Switch
 Brocade 4100 Switch

Related Categories


PLA-Support>Sun Systems>DISK>Switch>SN-DK: Brocade Switch
 _Old GCS Categories>Sun Microsystems>Switches>Brocade

PreviouslyPublishedAs
230744

Symptoms
Brocade Silkworm switches can log "Marginal/Warning/DOWN" "HEALTHY/OK" error
messages when a host reboots.

Resolution
Modify the switch policy for "FaultyPorts" either at the command line
interface (CLI) or Web graphical user interface (GUI). You can
increase the "Down" and "Marginal" settings or designate disable by
entering "0."

 # help switchstatuspolicyset

 # switchstatuspolicyset
The minimum number of
FaultyPorts contributing to

                               DOWN status: (0..64) [2] 3
FaultyPorts contributing to
MARGINAL status: (0..64) [1] 2
MissingSFPs contributing to
DOWN status: (0..64) [0]  hit return until..

 Policy parameter set has been changed

The changes take place immediately. There is no need to reboot the switch.

Additional Information
During the "non-healthy" state, the switch will appear orange in color if
viewed using the Web GUI and may send an SNMP trap if configured to do so.

Brocade Silkworm switches have an error policy mechanism that logs error
messages when an element transitions above or below definable thresholds.
One such threshold is "FaultyPorts." When a device reboots, it is possible
for a switch to act as if the port has faulted. When the counter goes
beyond the threshold (which defaults to 1 faulty port) a error is logged,
such as

 >> Switch: 1, Warning FW-STATUS_SWITCH, 3, Switch status changed from
>> HEALTHY/OK to Marginal/Warning (  --- 1 faulty port;)

When the port recovers and is deemed non-faulty, the counter is lowered
and the lower threshold can be crossed and another error message is logged
as follows:

 >> Switch: 1, Warning FW-STATUS_SWITCH, 3, Switch status changed from
>> Marginal/Warning to HEALTHY/OK

Online documentation within the CLI details how to change the threshold or
disable such policies:

 # help switchstatuspolicyset

The customer can increase the default values to avoid this scenario. This
is obviously at the customers discretion. The switchstatuspolicyset
changes take effect immediately. There is no need to reboot the switch.

Obviously, it should not be taken for granted that a rebooting host caused
a port "fault" that in turn caused the error message. There are several
other policies relating to power supplies, fans, and so on.

For example, the following output is from a 12000 switch.

The current overall switch status policy parameters:

              Down    Marginal
----------------------------------
FaultyPorts  2         1
MissingSFPs  0         0
PowerSupplies  2         1
Temperatures  2         1
Fans  2         1
PortStatus  0         0
ISLStatus  0         0
CP  0         1
WWN  0         1
Blade  0         1

During the "non-healthy" state, it is possible to identify the cause of the
policy error. Issue "switchstatusshow." If the switch is not HEALTHY/OK,
additional information will be provided.

A problem can be that the marginal / healthy transitions are brief and that
the issue is only acted on after the issue has passed - making confirmation
and identification of the port/device that caused the error difficult.

To identify the cause of the problem, you can cross-match the dates of the
error entries against hosts messages files or logs to determine if a host
rebooted during that time. You will need to be aware of any clock/time
difference between the switch and host. The "date" command on the switch
will display current date and time.

Depending on the condition, there may be other messages recorded such as BL-nnnn or PORT-nnnn which will identify the port in question.

Additionally, you can use 'fabstateshow' and match fabric changes to the date&time of the errdump entry.

sample errdump output:-
2007/05/03-14:47:21, [FW-1424], 149,, WARNING, A, Switch status changed from HEALTHY to DOWN.

2007/05/03-14:47:21, [FW-1437], 150,, WARNING, A, Switch status change contributing factor Faulty ports: 1 faulty ports.

corresponding date&time entries in sample 'fabstateshow' output:-
Thu May 3 08:01:07 2007
....
14:47:21.266 SCN Port Offline;g=0x0 D2,P0 D2,P0 39 NA
14:47:21.280 *Removing all nodes from port D2,P0 D2,P0 39 NA

Here you can see that port 39 is likely to be a contributory cause of the status change and error log entry.

Additionally, you can use Fabric Watch (which has a finer granularity of
error logging). Fabric Watch is a licensable option that may or may not be
present on the switch in question. "licenseshow" will identify if Fabric
Watch is installed. If Fabric Watch is installed, Fabric Watch can be
configured as follows.

1) Identify if Fabric Watch is licensed on this switch:

  # licenseshow
Fabric Watch License

2) Use fwshow or the GUI to identify current thresholds:

  # fwshow
1 : Show class thresholds
3 : Port class

3) Use fwconfigure to modify port class to custom link loss settings:

  # fwconfigure
3 : Port class
1 : Link loss
4 : Advanced configuration
6 : change custom low                [1]
7 : change custom high               [0]
3  : change threshold boundary level [2] custom
9  : apply threshold boundary changes
11 : change threshold alarm level    [2] custom
14 : change below alarm              [1]
15 : change above alarm              [1]
16 : change inBetween alarm          [1]
17 : apply threshold alarm changes
^C

When the next "Marginal/Warning HEALTHY/OK" error entry occurs,
there should be an accompanying Fabric Watch error showing the port
number involved.

For example:

  WARNING FW-ABOVE2, 3, portLink006, Port #006 Link Failures is above
high boundary. current value : 1 Error(s)/minute. (faulty)

If there is no such accompanying error, this suggests that a
rebooting host is not the cause and Sun Service should investigate
further (for example, decoding the portlodgdump output). If the
preceding technique is used, remember to reverse the procedure (setting
back to default, if required) unless you want Fabric Watch to continue
logging link events.

  # fwconfigure
3 : Port class
1 : Link loss
4 : Advanced configuration
3  : change threshold boundary level [1]   default
11 : change threshold alarm level    [1]   default
9  : apply threshold boundary changes
^C

Refer to Brocade documentation for guidelines for Fabric Watch settings.

Setting switch status policy values and Fabric Watch definitions should be
viewed as part of good storage area network (SAN) monitoring practice.

For additional information refer to:

CLI online help at switchstatuspolicyshow;switchstatuspolicyset

Brocade Fabric OS Procedures Guide

Fabric Watch Guidelines

Brocade Silkworm Design, Deployment and Management Guide

Product
SAN Brocade 3800 2 GB 16-Port Switch
Brocade SilkWorm 3250 Fabric Switch
Brocade SilkWorm 3850 Fabric Switch
Brocade SilkWorm 3250 Switch
Brocade SilkWorm 24000 Director

Internal Comments
Brocade Silkworm Switches: Host Reboots Can Cause Brocade Marginal/Warning/DOWN Healthy/OK Errors

The behaviour of some Qlogic HBAs during boot/reboot is eluded to in:

FAB < Solution: 201068 >

see the root cause paragraph in that document.

brocade, warning, healthy, down, policy
Previously Published As
76978

Change History
Date: 2009-12-01
User Name: 84789
Action: Reviewed
Comment: Reviewed

Attachments

This solution has no attachment