Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1006856.1
Update Date:2017-12-11
Keywords:

Solution Type  Troubleshooting Sure

Solution  1006856.1 :   Sun Storage 3510 and 3511 Arrays:  Troubleshooting Redundant Loop Failures  


Related Items
  • Sun Storage 3511 SATA Array
  •  
  • Sun Storage 3510 FC Array
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>Arrays>SN-DK: SE31xx_33xx_35xx
  •  

PreviouslyPublishedAs
209520
Purpose/scope : This is a sub-set of : "Troubleshooting Sun StorEdge[TM]  33x0/351x Hardware". The steps below will help verify and resolve fibre channel redundant path problems.

Applies to:

Sun Storage 3511 SATA Array - Version Not Applicable and later
Sun Storage 3510 FC Array - Version Not Applicable and later
All Platforms

Purpose

 This document provides the troubleshooting approach on Sun Storage 3510 and 3511 arrays that have the following symptoms:

  • Redundant loop or path failures
  • Incorrect channel speed
  • Incrementing invalid transmission word and/or CRC errors
  • Failed or missing disks, controller, or IOM
  • Logical drive or multiple drive failures
  • Logical drive rebuilds or initialization may hang

 

Troubleshooting Steps

The following troubleshooting steps can help resolve redundant loop failures on Sun Storage 3510 and 3511 arrays.

Step 1: Check the event log or persistent event log and verify that there are no redundant loop failures, which may or may not be accompanied by multiple drive failures on the same loop, by executing sccli> show events or sccli> show persistent-events.

Example from a 3510 array:

sccli> show events

[113f] #9: StorEdge Array SN#1234567 CH2: ALERT: redundant loop failure detected (ALT Surviving CH3
[113f] #10: StorEdge Array SN#1234567 CH2: ALERT: redundant loop failure detected (ALT Surviving CH3)
[113f] #11: StorEdge Array SN#1234567 CH2: NOTICE: fibre channel loop connection restored
[113f] #12: StorEdge Array SN#1234567 CH2: NOTICE: fibre channel loop connection restored
[113f] #13: StorEdge Array SN#1234567 CH2: ALERT: redundant loop failure detected (ALT Surviving CH3)
...

[2101] #19: LD-ID 436CE267 on StorEdge Array SN#1234567: ALERT: SCSI drive failure (CH2 ID42)
[2101] #20: LD-ID 72BE7D18 on StorEdge Array SN#1234567: ALERT: SCSI drive failure (CH2 ID22)
[2101] #21: LD-ID 00000000 on StorEdge Array SN#1234567: ALERT: SCSI drive failure (CH2 ID5)
[2101] #22: LD-ID 72BE7D18 on StorEdge Array SN#1234567: ALERT: SCSI drive failure (CH2 ID25)
[2101] #23: LD-ID 436CE267 on StorEdge Array SN#1234567: ALERT: SCSI drive failure (CH2 ID43)


Step 2: Execute the command sccli> show disks to determine whether multiple drives on same loop are BAD. Below is the sample output from a 3510 array that shows BAD drives.

sccli> show disks
Ch     Id      Size   Speed  LD     Status     IDs                   Rev  
(3)  34       N/A   N/A    NONE   BAD        SEAGATE ST336753FSUN36G  0349
                                                  S/N3HX1F0M400007412
                                                  WWNN2000000C505F89FA
(3)  35       N/A   N/A    NONE   BAD        SEAGATE ST336753FSUN36G  0349
                                                  S/N3HX1F09X00007412
                                                  WWNN2000000C505F8AAD
(3)  36       N/A   N/A    NONE   BAD        SEAGATE ST336753FSUN36G  0349
                                                  S/N3HX1F26800007412
                                                  WWNN2000000C505F8715
(3)  37       N/A   N/A    NONE   BAD        SEAGATE ST336753FSUN36G  0349
                                                  S/N3HX1EYJY00007412
                                                  WWNN2000000C505F8A28


Step 3: Ensure that the ID switch settings are unique for each enclosure, and that the disk IDs are identified correctly as described in <Document 1007692.1> Sun Storage 3510 and 3511 Arrays: How to Identify Switch IDs, Disk IDs and Correct Backend Cabling.

Step 4: Verify that the diagnostic Invalid Transmission Word counters for the RAID devices are not increasing by comparing the counters at different times while I/O is taking place.

sccli> show diag error channel 2
sccli> show diag error channel 3


If sccli is not installed on the host, access the download by consulting <Document 1004352.1> How to Download and Install Sun Storage Configuration Service Software (SUNWsscs). An alternate method to capture data during I/O activity is given below.


1. Check the Fibre Channel Error Statistics using the Firmware Interface as described in the Fibre Channel Error Statistics (FC and SATA Only) of the Sun StorEdge 3000 Family RAID Firmware 4.2x User's Guide.

2. During I/O activity, monitor the following values for sharp increases on the RAID devices:

InvalTXWord - Total number of instances of invalid transmission words. This error indicates either an invalid transmit word or disparity error.
InvalCRC - Total number of instances of invalid CRC, or the number of times a frame was received and the CRC was not as expected.

 
For example, the following 3510 RAID device has high invalid transmission counts for channel 2 controller (device IDs 14 and 15):

sccli> show diag error channel 2

CH  ID  TYPE  LIP   LinkFail LossOfSy LossOfSi PrimErr  InvalTxW InvalCRC
------------------------------------------------------------------------
2   0  DISK  59    0        3        0        0        450311   0
2   1  DISK  59    0        1        0        0        476834   0
2   2  DISK  59    0        5        0        0        456602   0
2   3  DISK  59    0        1        0        0        450818   0
2   4  DISK  59    0        1        0        0        450556   0
....
2  39  DISK  59    0        1        0        0        448454   0
2  40  DISK  59    0        1        0        0        451082   0
2  41  DISK  59    0        3        0        0        448987   0
2  42  DISK  59    0        5        0        0        450288   0
2  43  DISK  59    0        1        0        0        448025   0
2  44  SES   59    0        0        0        0        0        0
2  14  RAID  59    0        0        0        0        20863    0
2  15  RAID  59    0        0        0        0        20840    0


If counters are increasing:

  • Investigate back-end loop device order to understand what is just before any devices showing high error counts.
  • Investigate the device just BEFORE the device reporting high error counts.
  • If there are invalid transmission counts or CRC errors for the RAID devices 14 and 15, then this may be an indication of an improperly seated or faulty component.


Step 5: Execute the command sccli> show channels to ensure that all the configured ports are running at the correct speed. Below is sample output showing loop B with an incorrect speed:

sccli> show channels

Ch  Type    Media   Speed   Width  PID / SID
0 Host    FC(L)   2G      Serial  40 / NA
1 Host    FC(L)   2G      Serial  43 / NA
2 Drive   FC(L)   2G      Serial  14 / 15
3 Drive   FC(L)   ASYNC   Serial  14 / 15  <----Loop B is async
4 Host    FC(L)   2G      Serial  44 / NA
5 Host    FC(L)   2G      Serial  47 / NA


Step 6: Execute the command sccli> show enclosure-status to ensure that both loop A and loop B are visible. In the below sample output, both loops are visible.

sccli> show enclosure-status

Ch Id  Chassis Vendor/Product ID    Rev  PLD  WWNN             WWPN
-------------------------------------------------------------------------------
2 12 0xxxx9  SUN StorEdge 3510F A 1080 1000 204000C0FF0859A9 214000C0FF0859A9   Topology: loop(a)  Status: OK
3 12 0xxxx9  SUN StorEdge 3510F A 1080 1000 204000C0FF0859A9 224000C0FF0859A9   Topology: loop(b)  Status: OK


In the below sample output, only loop A is available, which indicates a problem.

sccli> show enclosure-status
sccli: selected device /dev/rdsk/c3t44d0s2 [SUN StorEdge 3510 SN#1234567]

Ch  Id Chassis Vendor/Product ID        Rev  PLD  WWNN             WWPN
-------------------------------------------------------------------------------
2 12 0xxxxB  SUN StorEdge 3510F A     1080 1000 204000C0FF0xxxxB 214000C0FF0xxxxB Topology: loop(a)
2 28 0xxxx9  SUN StorEdge 3510F D     1080 1000 205000C0FF0xxxx9 215000C0FF0xxxx9 Topology: loop(a)
2 44 0xxxxA  SUN StorEdge 3510F D     1080 1000 205000C0FF0xxxxA 215000C0FF0xxxxA Topology: loop(a)


Step 7: Execute the commands sccli>show bypass sfp and sccli>show bypass raid to verify both controllers and all devices are visible on each loop, and then check for correct device IDs.

sccli>show bypass sfp

PORT    ENCL-ID ENCL-TYPE       LOOP    BYP-STATUS      ATTRIBUTES
----    ------- ---------       ----    ----------      ----------
0       0       RAID            LOOP-B  Not-Installed   --
1       0       RAID            LOOP-B  Unbypassed      --
L       0       RAID            LOOP-B  Not-Installed   --
R       0       RAID            LOOP-B  Unbypassed      --
4       0       RAID            LOOP-B  Not-Installed   --
5       0       RAID            LOOP-B  Unbypassed      --
L       1       JBOD            LOOP-B  Unbypassed      --
R       1       JBOD            LOOP-B  Not-Installed   --
L       2       JBOD            LOOP-B  Unbypassed      --
R       2       JBOD            LOOP-B  Unbypassed      --

sccli> show bypass raid

SLOT    LOOP    BYP-STATUS
----    ----    ----------
TOP     LOOP-A  Unbypassed
TOP     LOOP-B  Unbypassed
BOTTOM  LOOP-A  Unbypassed
BOTTOM  LOOP-B  Unbypassed


Refer to the Sun StorEdge 3000 Family RAID Firmware 4.2x User's Guide for more details.


Step 8: Check the output of sccli> show fru to determine there are no N/A or Absent components on the loop, especially the IOM or controller. If there are expansion trays attached, then check if they are visible.

The sample output shown below from a 3510 array indicates that the lower IOM (CH3) is not available:

 Name: FC_RAID_IOM
 Description: N/A
 Part Number: N/A
 Serial Number: N/A
 Revision: N/A
 Initial Hardware Dash Level: N/A
 F Manufacturing Date: N/A
 Manufacturing Location: N/A
 Manufacturer JEDEC ID: N/A
 FRU Location: LOWER FC RAID IOM SLOT
 Chassis Serial Number: 00331B
 FRU Status: Absent


The output should appear as below.


 Name: FC_RAID_IOM
 Description: SE3510 I/O w/SES + RAID Cont 1GB
 Part Number: 370-5537
 Serial Number: 0029xx
 Revision: 02
 Initial Hardware Dash Level: 02
 RU Shortname: N/A
 FRU Shortname: 370-5537-02
 Manufacturing Date: Fri Jun 27 20:52:58 2003
 Manufacturing Location: Milpitas,CA,USA
 Manufacturer JEDEC ID: 0x0301
 FRU Location: LOWER FC RAID IOM SLOT
 Chassis Serial Number: 00331B
 FRU Status: OK


Step 9: Below are the steps to troubleshoot the faults described above.

  • Verify whether the IOM or controller is improperly seated or faulty. Refer to <Document 1002641.1> Sun Storage 3000 Arrays: Troubleshooting Controller Problems.
  • Verify cabling is correct. Refer to <Document 1008193.1> Sun Storage 3510 and 3511 Arrays: Troubleshooting the Cabling.
  • Verify there are no unused SFP's in drive channels 2 and 3 on each controller.
  • Verify firmware levels for controller, PLD and SES are at the latest revisions.
  • Remove and reinsert the hardware components that may need reseating. They are SFP's, cables, disks, and controller/IOM.


Step 10: If the hardware fault persists, then contact Oracle for further support.

Step 11: If no problems were found during the course of this document, please refer to <Document 1011431.1> Troubleshooting Sun Storage 3000 Array Series Hardware.



Previously Published As
89049



Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback