![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Problem Resolution Sure Solution 1008074.1 : Sun StorEdge A5000/A5100/A5200 Arrays: I/O becomes unresponsive or hang on disks
PreviouslyPublishedAs 211117 Applies to:Sun Storage A5000 Array - Version All Versions and laterSun Storage A5200 Array - Version All Versions and later Sun Storage A5100 Array - Version All Versions and later All Platforms SymptomsI/O to drive(s) in an Sun StorEdge A5x00 Array subsystem becomes unresponsive or incurs retryable SCSI errors on a Solaris host.
The command "luxadm probe" shows individual drives instead of the SES devices for the IB's
Found Fibre Channel device(s):
Node WWN:20000004cf6be6b7 Device Type:Disk device Logical Path:/dev/rdsk/c9t0d0s2 Node WWN:20000004cf7009a2 Device Type:Disk device Logical Path:/dev/rdsk/c9t1d0s2 Node WWN:20000004cf6bfd68 Device Type:Disk device Logical Path:/dev/rdsk/c9t4d0s2 NOTE: This is normal for some FC Multipacks and JBODS, but not for the A5x00.
The command "format" shows drive type unknown: 12. c9t1d0 <drive type unknown> /sbus@6,0/SUNW,socal@1,0/sf@1,0/ssd@w21000004cf7009a2,0 13. c9t5d0 <drive type unknown> /sbus@6,0/SUNW,socal@1,0/sf@1,0/ssd@w21000004cf6bde10,0 14. c9t10d0 <drive type unknown> the commands "luxadm dump_map" or "luxadm display enclosure_names " hows all zero's for the WWN and WWPN of a drive: luxadm dump_map: Pos AL_PA ID Hard_Addr Port WWN Node WWN Type 0 1 7d 1 2007020000122498 5020020000122498 0x3 (Processor device, Host Bus Adapter) 1 d2 d 0 0000000000000000 0000000000000000 0x1f (Unknown Type) 2 ef 0 41 0000000000000000 0000000000000000 0x1f (Unknown Type) 3 e8 1 0 0000000000000000 0000000000000000 0x1f (Unknown Type) 4 e1 4 0 0000000000000000 0000000000000000 0x1f (Unknown Type)
CauseComponent failure in the Fibre-Channel Loop SolutionThe trouble with isolating faults in this FC-AL architecture, is that any one participant can be failing in a marginal fashion as to still accept I/O, but not cause an overt failure. This is typically seen as a host hang in most cases, but can also be seen as strange, partial, or corrupted output in the Solaris luxadm(1M) and format(1M) commands. format(1M) luxadm probe luxadm display enclosure_name luxadm -e dump_map enclosure_name If any of the aforementioned symptoms are observed, or if one or more of these commands hang, contact Oracle Support immediately. Additionally, it may be necessary to collect Read Link Status(RLS) data from the array on both HBA channels. RLS data is useful when viewed as a delta, or change, between two points in time. If you suspect a problem, RLS data can be collected by running: luxadm -e rdls enclosure_name Example: # luxadm -e rdls A Link Error Status information for loop:/devices/sbus@6,0/SUNW,socal@1,0:0 al_pa lnk fail sync loss signal loss sequence err invalid word CRC 5a 0 13 16 0 0 0 72 4 1186 0 0 0 0 71 0 16 0 0 0 0 6e 0 15 0 0 0 0 6d 0 28 0 0 0 0 6c 0 26 0 0 0 0 6b 0 26 0 0 0 0 6a 0 21 0 0 0 0 45 0 0 11 0 0 0 55 4 930 0 0 0 0 54 0 19 0 0 0 0 53 0 19 0 0 0 0 52 0 19 0 0 0 0 4e 0 19 0 0 0 0 4d 0 18 0 0 0 0 1 720896 0 0 0 0 0 NOTE: Remember, these outputs are cumulative since the last power cycle of the
1) Each drive participates on 2 channels, A and B. So that we understand the basic loop, here is an A or B drive channel in a full loop mode(front and rear backplanes are joined) HBA in front port->IB front port(SES chip)->Drives slots 0-NN in the front-> IB rear port(SES chip) -> Drives slots 0-NN in rear Any component that connects, or passes information along, between these devices can cause data transmission problems that present as a hang, I/O timeout, and incorrect or incomplete outputs. Depending of the drive manufacturer the disk loop clock/timing may be from one or two clock chips on the disk. If the disk has one clock chip and it fails both loops will be affected. The storage array front panel has the ability to bypass many components within the array - this can be used as a means to help isolate components during troubleshooting
Fault Isolation: Service Engineers should look for:
1) hangs on one path but not another 2) increasing RLS counters on a particular path 3) increasing RLS counters on a particular component It is a drive/backplane/midplane issue if the increase is occurring on both the A and B channels The first drive in the dump_map output, showing the RLS increase, should be replaced, IF THERE ARE MULTIPLE INCREASES OVER A PERIOD OF TIME. Increases of less than 10 Invalid Word per hour are acceptable. Increases in CRC errors are never acceptable. Increases on all drives in a single channel indicate an IB fault We recommend replacing the HBA, cable, GBIC, and IB board in this case. NOTE: RLS paths MUST be used in a delta fashion (collect at least two samples of data).
4) After parts replacement, engineers should have the customer monitor on a daily basis, offering a new RLS and /var/adm/messages output, or Explorer, for review of the RLS counters, to note any increases.
For a trouble shooting guide refer to the attachment A5000 Troubleshooting Guide (attached to this KM document)
Attachments This solution has no attachment |
||||||||||||
|