Sun StorEdge A5000/A5100/A5200 Arrays: I/O becomes unresponsive or hang on disks

Asset ID:	1-72-1008074.1
Update Date:	2016-04-01
Keywords:

Solution Type Problem Resolution Sure

Solution 1008074.1 : Sun StorEdge A5000/A5100/A5200 Arrays: I/O becomes unresponsive or hang on disks

Applies to:

Sun Storage A5000 Array - Version All Versions and later
Sun Storage A5200 Array - Version All Versions and later
Sun Storage A5100 Array - Version All Versions and later
All Platforms

Symptoms

I/O to drive(s) in an Sun StorEdge A5x00 Array subsystem becomes unresponsive or incurs retryable SCSI errors on a Solaris host.
In most cases, this is a hardware fault. In general the design of FC-AL(Fibre Channel Arbitrated Loop) allows participants of the loop to corrupt, or fail to pass packets to the target of the I/O. This is much like a TCP/IP token ring configuration, in that if one system fails, the rest of the ring cannot communicate.

The command "luxadm probe" shows individual drives instead of the SES devices for the IB's

Found Fibre Channel device(s):
Node WWN:20000004cf6be6b7 Device Type:Disk device
Logical Path:/dev/rdsk/c9t0d0s2
Node WWN:20000004cf7009a2 Device Type:Disk device
Logical Path:/dev/rdsk/c9t1d0s2
Node WWN:20000004cf6bfd68 Device Type:Disk device
Logical Path:/dev/rdsk/c9t4d0s2

 NOTE: This is normal for some FC Multipacks and JBODS, but not for the A5x00.

The command "format" shows drive type unknown:

12. c9t1d0 <drive type unknown>
/sbus@6,0/SUNW,socal@1,0/sf@1,0/ssd@w21000004cf7009a2,0
13. c9t5d0 <drive type unknown>
/sbus@6,0/SUNW,socal@1,0/sf@1,0/ssd@w21000004cf6bde10,0
14. c9t10d0 <drive type unknown>

the commands "luxadm dump_map" or "luxadm display enclosure_names " hows all zero's for the WWN and WWPN of a drive:

luxadm dump_map:

 Pos AL_PA ID Hard_Addr Port WWN         Node WWN         Type
0     1   7d    1      2007020000122498 5020020000122498 0x3  (Processor device,
Host Bus Adapter)
1     d2  d     0      0000000000000000 0000000000000000 0x1f (Unknown Type)
2     ef  0     41     0000000000000000 0000000000000000 0x1f (Unknown Type)
3     e8  1     0      0000000000000000 0000000000000000 0x1f (Unknown Type)
4     e1  4     0      0000000000000000 0000000000000000 0x1f (Unknown Type)

Cause

Component failure in the Fibre-Channel Loop

Solution

The trouble with isolating faults in this FC-AL architecture, is that any one participant can be failing in a marginal fashion as to still accept I/O, but not cause an overt failure. This is typically seen as a host hang in most cases, but can also be seen as strange, partial, or corrupted output in the Solaris luxadm(1M) and format(1M) commands.

An approach of methodical fault isolation of component(s) offers the most comprehensive way of resolution to this issue.

Start this isolation by running:

format(1M)
luxadm probe
luxadm display enclosure_name
luxadm -e dump_map enclosure_name

If any of the aforementioned symptoms are observed, or if one or more of these commands hang, contact Oracle Support immediately.

Additionally, it may be necessary to collect Read Link Status(RLS) data from the array on both HBA channels. RLS data is useful when viewed as a delta, or change, between two points in time. If you suspect a problem, RLS data can be collected by running:

luxadm -e rdls enclosure_name

Example:

# luxadm -e rdls A

Link Error Status information for loop:/devices/sbus@6,0/SUNW,socal@1,0:0
al_pa   lnk fail    sync loss   signal loss   sequence err   invalid word   CRC
5a      0           13          16            0              0              0
72      4           1186        0             0              0              0
71      0           16          0             0              0              0
6e      0           15          0             0              0              0
6d      0           28          0             0              0              0
6c      0           26          0             0              0              0
6b      0           26          0             0              0              0
6a      0           21          0             0              0              0
45      0           0           11            0              0              0
55      4           930         0             0              0              0
54      0           19          0             0              0              0
53      0           19          0             0              0              0
52      0           19          0             0              0              0
4e      0           19          0             0              0              0
4d      0           18          0             0              0              0
1       720896      0           0             0              0              0

NOTE: Remember, these outputs are cumulative since the last power cycle of the
array, so it is worthwhile to collect two samples of data on each path to the
array. The Explorer also collects this information.

Additional Information
The Sun StorEdge A5x00 Array has the following loop architecture:

1) Each drive participates on 2 channels, A and B.
2) Each Interface Board(IB) participates on either the A or B drive channel
3) Each IB has an a participant FC port that accesses the front and rear drive backplanes.

So that we understand the basic loop, here is an A or B drive channel in a full loop mode(front and rear backplanes are joined)

HBA in front port->IB front port(SES chip)->Drives slots 0-NN in the front-> IB rear port(SES chip) -> Drives slots 0-NN in rear

Any component that connects, or passes information along, between these devices can cause data transmission problems that present as a hang, I/O timeout, and incorrect or incomplete outputs.

Depending of the drive manufacturer the disk loop clock/timing may be from one or two clock chips on the disk. If the disk has one clock chip and it fails both loops will be affected.

The storage array front panel has the ability to bypass many components within the array - this can be used as a means to help isolate components during troubleshooting

Internal Comments For internal Sun use only.

Fault Isolation: Service Engineers should look for:

1) hangs on one path but not another

2) increasing RLS counters on a particular path

3) increasing RLS counters on a particular component

It is a drive/backplane/midplane issue if the increase is occurring on both the A and B channels

The first drive in the dump_map output, showing the RLS increase, should be replaced, IF THERE ARE MULTIPLE INCREASES OVER A PERIOD OF TIME.

Increases of less than 10 Invalid Word per hour are acceptable.

Increases in CRC errors are never acceptable.

Increases on all drives in a single channel indicate an IB fault

We recommend replacing the HBA, cable, GBIC, and IB board in this case.

NOTE: RLS paths MUST be used in a delta fashion (collect at least two samples of data).

4) After parts replacement, engineers should have the customer monitor on a daily basis, offering a new RLS and /var/adm/messages output, or Explorer, for review of the RLS counters, to note any increases.

For a trouble shooting guide refer to the attachment A5000 Troubleshooting Guide (attached to this KM document)

Attachments

This solution has no attachment