Solaris Storage Driver Troubleshooting SCSI Transport Errors - Command Failed to Complete

Asset ID:	1-71-1001444.1
Update Date:	2017-12-30
Keywords:

Solution Type Technical Instruction Sure

Solution 1001444.1 : Solaris Storage Driver Troubleshooting SCSI Transport Errors - Command Failed to Complete - Command Timeout

Applies to:

Solaris Operating System - Version 10 3/05 and later
Sun Blade T6320 Server Module - Version All Versions to All Versions [Release All Releases]
Qlogic FC HBA - Version All Versions to All Versions [Release All Releases]
Oracle Solaris on x86-64 (64-bit)
Oracle Solaris on SPARC (64-bit)
Oracle Solaris on SPARC (32-bit)

Goal

This article addresses the SCSI transport errors and how they can be identified and investigated.

Solution

1. Introduction

In the Solaris Operating System, the architecture of the SCSI subsystem can be summarized by a stack of three layer:

upper/target layer: target device driver like
- sd: driver for disk device
- st: driver for tape device
- sg: generic driver
mid layer: MPxIO multipathing driver or other 3rd party multipathing driver such as Symantec's VxDMP or EMC's PowerPath
lower/transport layer: transport layer implementing various transport protocol (non exhaustive list of host bus adapters drivers included in Solaris Operating System)
- scsi: isp, esp, fas, glm, mpt drivers
- sas: mpt_sas, pmc drivers
- fc: fc/fcp and qlc or emlxs driver
- fcoe: fc/fcp and qlge
- iscsi: iscsi driver

Upon completion of an SCSI command the host adapter calls the completion routine provided by the target driver. The scsi_pkt structure member pkt_reason contains a completion code that indicates why the SCSI command completion function (pkt_comp) was called. The most often seen reason in case of transport error are:

CMD_INCOMPLETE: Transport stopped with abnormal state.
CMD_TRAN_ERR: Unspecified transport error.
CMD_RESET: SCSI bus reset destroyed command.
CMD_ABORTED: Command transport aborted on request.
CMD_TIMEOUT: Command timed out.
CMD_DEV_GONE: The device has been removed.

During the completion of the SCSI command, transport, multipathing and target driver could produce various error message reporting the error. The decision to retry the SCSI command or fail it will be taken by the target device driver with the exception of some retries done automatically by multipathing driver.
For example the sd target driver will retry each command after waiting 100ms (default sd restart timeout value) and up to the default count of three retries. After three retries if the command is still in error following message will be printed. Some command may also not be retried at all if the application specified to failfast them:

Aug 6 07:49:59 paris unix: WARNING: /pci@1f,0/pci@1/scsi@2/sd@2,0 (sd2):
Aug 6 07:49:59 paris unix: Error for Command: write ErrorLevel: Fatal

When investigating SCSI transport error it is needed to look at different source of information:

console and /var/adm/dmesg or /var/adm/messages* files
"iostat -iE" command
kstat command

Server performance can be degraded severely by scsi transport error depending on the type and amount of error being reported.

2. Transport error with reason 'reset'

The 'reset' reason indicates that after the SCSI packet was sent a reset occured on the bus and therefore prevented the completion of the packet.
This kind of error can happen on any type of transport protocol (scsi, sas, fc, etc...) and may be a indication of faulty component on the channel (target device, controller, cable, switch, etc...)

Reset can also happened during recovery decided by the hba or target driver issuing target reset (therefore affecting only one target device) or bus reset (affecting all target device).

a) parallel SCSI bus example

Aug 3 14:02:57 paris unix: WARNING: /pci@1f,4000/scsi@3 (glm0):
Aug 3 14:02:57 paris unix: WARNING: /pci@1f,4000/scsi@3 (glm0):
Aug 3 14:02:57 paris unix: WARNING: /pci@1f,4000/scsi@3/sd@0,0 (sd0):
Aug 3 14:02:57 paris unix: SCSI transport failed: reason 'reset': retrying command

In this case the tran_err error is due to a reset of the target device initiated by the controller in order to clear it's fault. At the end disk sd0 was replaced.

3. Transport error with reason 'tran_err'

This reason is by definition more vague and not necessarily implies and hardware issue. The error tells that the i/o chipset on the hba was not able to send the SCSI command to the target device or that the target or any other equipement on the channel returned it with this pkt_reason value.

a) parallel SCSI bus transport errors

This example shows transport error on parallel SCSI bus caused by parity error:

Mar 24 06:58:10 paris unix: warning: /pci@6,4000/scsi@4 (glm4):
Mar 24 06:58:10 paris unix: SCSI bus DATA IN phase parity error
Mar 24 06:58:10 paris unix: warning: ID[SUNWpd.glm.parity_check.6010]
Mar 24 06:58:10 paris unix: warning: /pci@6,4000/scsi@4 (glm4):
Mar 24 06:58:10 paris unix: Target 0 reducing sync. transfer rate
Mar 24 06:58:10 paris unix: warning: ID[SUNWpd.glm.sync_wide_backoff.6014]
Mar 24 06:58:10 paris unix: warning: /pci@6,4000/scsi@4/sd@0,0 (sd60):
Mar 24 06:58:10 paris unix: SCSI transport failed: reason 'tran_err': retrying command

In this the controller got transfer issue with the target 0 and decided to reduce the transfer rate and recover the fault using a target reset which generated the tran_err error.

Usually SCSI bus DATA phase parity errors are caused by bad cable and/or termination. It is also recommended to check for patches of hba driver and disk firmware.

The message "reducing sync. transfer rate" indicates that the hba driver negotiated a lower channel transfer rate with the specified target. As the transfer speed is reduced, it is expected to have performance issue. The transfer rate can only be raised by a reconfiguration of the target device with "cfgadm -c unconfigure/configure" or a host reboot.

b) Fiber channel transport errors

In this example the mid layer multipathing driver VxDMP also reported the error and took action (disabling a path), One can see the error was not transient and forced the target driver to fail the command (giving up):

Nov 27 12:45:28 paris scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w50060e8004599921,14 (ssd30):
Nov 27 12:45:28 paris SCSI transport failed: reason 'tran_err': retrying command
Nov 27 12:45:29 paris scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w50060e8004599921,14 (ssd30):
Nov 27 12:45:29 paris SCSI transport failed: reason 'tran_err': giving up
Nov 27 12:45:29 paris vxdmp: [ID 480808 kern.notice] NOTICE: VxVM vxdmp V-5-0-112 disabled path 118/0xf0 belonging to the dmpnode 310/0x78 due to path failure
Nov 27 12:45:58 paris vxdmp: [ID 736771 kern.notice] NOTICE: VxVM vxdmp V-5-0-148 enabled path 118/0xf0 belonging to the dmpnode 310/0x78
Nov 27 13:01:58 paris scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w50060e8004599921,14 (ssd30):
Nov 27 13:01:58 paris SCSI transport failed: reason 'tran_err': giving up

The disk ssd30 was later replaced.

In recent version of MPxIO, the driver will print the path on which the transport error occured:

Jan 13 22:29:53 paris scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Jan 13 22:29:53 paris /scsi_vhci/disk@g60060e8005478d000000478d0000401a (sd15): Command failed to complete (3) on path fp6/disk@w50060e8005478d5d,5
Jan 13 22:29:53 paris scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g60060e8005478d000000478d0000401a (sd15):
Jan 13 22:29:53 paris SCSI transport failed: reason 'tran_err': retrying command

Transport error can be caused by a target device port going offline. Notice how MPxIO triggered a path failover. As no more error were reported by the kernel the retry was successful:

Sep 4 11:37:25 paris scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Sep 4 11:37:25 paris        /scsi_vhci/ssd@g600a0b8000170e2b000020c04ecf6c96 (ssd1): Command failed to complete (3) on path fp2/ssd@w200600a0b8170e2c,0
Sep 4 11:37:25 paris scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g600a0b8000170e2b000020c04ecf6c96 (ssd1):
Sep 4 11:37:25 paris        SCSI transport failed: reason 'tran_err': retrying command
Sep 4 11:37:26 paris fctl: [ID 517869 kern.warning] WARNING: fp(2)::GPN_ID for D_ID=10300 failed
Sep 4 11:37:26 paris fctl: [ID 517869 kern.warning] WARNING: fp(2)::N_x Port with D_ID=10300, PWWN=200600a0b8170e2c disappeared from fabric
Sep 4 11:37:45 paris scsi: [ID 243001 kern.info] /pci@1d,700000/SUNW,qlc@1/fp@0,0 (fcp2):
Sep 4 11:37:45 paris        offlining lun=0 (trace=0), target=10300 (trace=2800004)
Sep 4 11:37:46 paris genunix: [ID 483743 kern.info] /scsi_vhci/ssd@g600a0b8000170e2b000020c04ecf6c96 (ssd1) multipath status: degraded: path 1 fp2/ssd@w200600a0b8170e2c,0 is offline
Sep 4 11:37:46 paris scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Sep 4 11:37:46 paris        Initiating failover for device ssd (GUID 600a0b8000170e2b000020c04ecf6c96)
Sep 4 11:37:46 paris scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Sep 4 11:37:46 paris        Failover operation completed successfully for device ssd (GUID 600a0b8000170e2b000020c04ecf6c96): failed over from primary to secondary

At this point one need to investigate why the target device port gone offline.

Typical complex to troubleshoot issue are when you have a mix of transport error which occur intermittently and does not trigger command to fail:

Dec 1 10:11:26 paris scsi: WARNING: /scsi_vhci/ssd@g60a98000572d5471534a5a424d4b345a (ssd8):
Dec 1 10:11:26 paris SCSI transport failed: reason 'timeout': retrying command
Dec 1 13:49:47 paris scsi: WARNING: /scsi_vhci/ssd@g60a98000572d5471534a5a424d4b345a (ssd8):
Dec 1 13:49:47 paris SCSI transport failed: reason 'tran_err': retrying command
Dec 1 18:56:14 paris scsi: WARNING: /scsi_vhci/ssd@g60a98000572d54712f345a616a33346e (ssd16):
Dec 1 18:56:14 paris SCSI transport failed: reason 'timeout': retrying command
Dec 2 00:28:28 paris scsi: WARNING: /scsi_vhci/ssd@g60a98000572d5471534a5a424d4b3436 (ssd10):
Dec 2 00:28:28 paris SCSI transport failed: reason 'tran_err': retrying command
Dec 2 15:20:39 paris scsi: WARNING: /scsi_vhci/ssd@g60a98000572d5471534a5a424d4b345a (ssd8):
Dec 2 15:20:39 paris SCSI transport failed: reason 'timeout': retrying command
Dec 2 17:51:59 paris scsi: WARNING: /scsi_vhci/ssd@g60a98000572d5471534a5a424d4b345a (ssd8):
Dec 2 17:51:59 paris SCSI transport failed: reason 'tran_err': retrying command
Dec 2 22:13:13 paris scsi: WARNING: /scsi_vhci/ssd@g60a98000572d5471534a5a424d4b345a (ssd8):
Dec 2 22:13:13 paris SCSI transport failed: reason 'timeout': retrying command

c) Conclusion

Transport error with reason 'tran_err' are not always easy to investigate. Troubleshooting these errors request to carefully check the various error counter on hba, switch port, target device and compare with different systems which can access the same target device. As these error can also be caused by bug in multipathing or hba driver, it is strongly recommended to check patch level.

4. Transport timeout

Transport timeout error occurs when the SCSI command has been processed by the hba driver and the SCSI packet completion did not happened after the default target device timeout (for the sd/ssd driver the default packet timeout value is 60 sec). This can be caused by:

non responding target device or FC switches.
hba firmware bug.
hba / transport drive bug.

These are theorically possible but unlikely to be seen:

systemic issue (HW interrupt not processed). Imply that you have a specific issue with HW interrupt _and_ the cyclic/callout monitoring hba state and queue is able to run
drift in system clock (hba driver look frequently at outstanding request and check if the (current time - stored packet start time) is not superior to the packet timeout, when the clock drift or is adjusted it changes the value of the current time)

a) Examples

The following example t shows intermittent issue on a private loop affecting all target device. This can be cause by faulty channel or spurious error generated by one of the target. Looking at counter statistic of the hba using "luxadm -e rdls /dev/cfg/cN" may indicates where the issue is:

Jun 6 19:16:34 paris unix: ID[SUNWssa.socal.link.5010] socal1: port 0:Fibre Channel is OFFLINE
Jun 6 19:16:34 paris unix: ID[SUNWssa.socal.link.6010] socal1: port 0:Fibre Channel Loop is ONLINE
Jun 6 19:17:49 paris unix: WARNING: /sbus@2,0/SUNW,socal@2,0/sf@0, 0/ssd@w22000020370ed7ff,0 (ssd16):
Jun 6 19:17:49 paris unix: SCSI transport failed: reason 'timeout':retrying command
Jun 6 19:19:14 paris unix: WARNING: /sbus@2,0/SUNW,socal@2,0/sf@0, 0/ssd@w22000020370edd68,0 (ssd12):
Jun 6 19:19:14 paris unix: SCSI transport failed:reason 'timeout':retrying command
Jun 6 19:20:44 paris unix: WARNING: /sbus@2,0/SUNW,socal@2,0/sf@0, 0/ssd@w22000020370edd69,0 (ssd13):
Jun 6 19:20:44 paris unix: SCSI transport failed: reason 'timeout':retrying command

The above errors we generated by faulty controller which was replaced.

This example is typical of a failing internal SCSI hard disk drive which will need to be replaced. This can be confirmed by the number of error reported for that device in "iostat -iE" output and the fact the the error is isolated to that specific device :

Mar 10 01:36:05 paris scsi: [ID 107833 kern.warning] WARNING: /pci@0,600000/pci@0/pci@8/pci@0/scsi@1 (mpt0):
Mar 10 01:36:05 paris Disconnected command timeout for Target 1
Mar 10 01:36:05 paris scsi: [ID 365881 kern.info] /pci@0,600000/pci@0/pci@8/pci@0/scsi@1 (mpt0):
Mar 10 01:36:05 paris Log info 31140000 received for target 1.
Mar 10 01:36:05 paris scsi_status=0, ioc_status=8048, scsi_state=c
Mar 10 01:37:15 paris scsi: [ID 107833 kern.warning] WARNING: /pci@0,600000/pci@0/pci@8/pci@0/scsi@1 (mpt0):
Mar 10 01:37:15 paris Disconnected command timeout for Target 1
Mar 10 01:37:16 paris scsi: [ID 365881 kern.info] /pci@0,600000/pci@0/pci@8/pci@0/scsi@1 (mpt0):
Mar 10 01:37:16 paris Log info 31140000 received for target 1.
Mar 10 01:37:16 paris scsi_status=0, ioc_status=8048, scsi_state=c

The hard disk driver on target 1 was replaced.

In this example you can see how a faulty hard disk drive ssd11 can impact other due to bus reset initiated to recover from a failing target device on a private FC loop:

The sd instance ssd11 is having intermittent transport issues:
Apr 23 06:15:56 paris scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
Apr 23 06:15:56 paris /scsi_vhci/ssd@g200000008793253e (ssd11): Command Timeout on path /pci@9,700000/pci@2/SUNW,qlc@4/fp@0,0 (fp1)
Apr 23 06:15:56 paris scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g200000008793253e (ssd11):
Apr 23 06:15:56 paris SCSI transport failed: reason 'timeout': retrying command
Apr 23 06:15:56 paris scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
Apr 23 06:15:56 paris /scsi_vhci/ssd@g200000008793253e (ssd11): Command Timeout on path /pci@8,600000/SUNW,qlc@2/fp@0,0 (fp0)
Apr 23 06:16:07 paris scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g200000008793253e (ssd11):
Apr 23 06:16:07 paris SCSI transport failed: reason 'tran_err': retrying command
...
Apr 24 19:52:43 paris /scsi_vhci/ssd@g200000008793253e (ssd11): Command Timeout on path /pci@9,700000/pci@2/SUNW,qlc@4/fp@0,0 (fp1)
Apr 24 19:52:43 paris scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g200000008793253e (ssd11):
Apr 24 19:52:43 paris SCSI transport failed: reason 'timeout': retrying command
...
Apr 25 00:05:30 paris qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(1): Loop OFFLINE
Apr 25 00:05:30 paris scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g200000008793253e (ssd11):
Apr 25 00:05:30 paris SCSI transport failed: reason 'tran_err': retrying command <-- this transport error is a consequence of the loop offline/online events
Apr 25 00:06:06 paris qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(1): Loop ONLINE
Apr 25 00:06:10 paris fp: [ID 517869 kern.info] NOTICE: fp(1): PLOGI to ef failed state=Packet Transport error, reason=No Connection
Apr 25 00:06:10 paris fctl: [ID 517869 kern.warning] WARNING: fp(1)::PLOGI to ef failed. state=e reason=5.
Apr 25 00:06:10 paris scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g500000e014b56cf0 (ssd22):
Apr 25 00:06:10 paris Error for Command: write(10) Error Level: Retryable
Apr 25 00:06:10 paris scsi: [ID 107833 kern.notice] Requested Block: 98719853 Error Block: 98719853
Apr 25 00:06:10 paris scsi: [ID 107833 kern.notice] Vendor: FUJITSU Serial Number: 0709C0BECH
Apr 25 00:06:10 paris scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
Apr 25 00:06:10 paris scsi: [ID 107833 kern.notice] ASC: 0x29 (bus device reset message occurred), ASCQ: 0x3, FRU: 0x0 <-- this packet was in the qlc(1) driver at the time the loop reseated and returned in error to the target device so that it can trigger retry
...
Apr 25 00:06:30 paris scsi: [ID 243001 kern.info] /pci@9,700000/pci@2/SUNW,qlc@4/fp@0,0 (fcp1):
Apr 25 00:06:30 paris offlining lun=0 (trace=0), target=ef (trace=2800004)
Apr 25 00:06:30 paris genunix: [ID 834635 kern.info] /scsi_vhci/ssd@g200000008793253e (ssd11) multipath status: degraded, path /pci@9,700000/pci@2/SUNW,qlc@4/fp@0,0 (fp1) to target address: w220000008793253e,0 is offline Load balancing: round-robin
Apr 25 00:43:49 paris scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
Apr 25 00:43:49 paris /scsi_vhci/ssd@g2000000087482d40 (ssd10): Command Timeout on path /pci@9,700000/pci@2/SUNW,qlc@4/fp@0,0 (fp1)
...
Apr 25 06:12:03 paris fp: [ID 517869 kern.info] NOTICE: fp(1): PLOGI to ef failed state=Packet Transport error, reason=No Connection

The disk ssd11 was replaced.

b) Conclusion

Timeout transport on a parallel SCSI bus, SAS bus or FC private loop (directly connected no FC Switch) are more likely generated by misbehaving component.

On FC switched fabric topology, "Transport timeout" may indicate an issue on fc switches (loss FC frame) or Storage array (lack of performance to process all request i/o, reboot of controller, etc...). Especially if when "Command Timeout on path ..." type errors are not accompanied with scsi transport errors.

In any case, on the server side, one should have a close look at all the error counter and check if patch or firmware level are not at a too low revision. When FC Switches and external storage arrays are involved, they should also be checked for issues as well.

5. Summary of action to take

review carefully console and messages files. Look for any occurence.
check error counters to determine if the error affects particular device:
- use "iostat -iE" to check disk errors.
- use "luxadm -e rdls /dev/cfg/cN" or Solaris 10 or above "fcinfo hba-port -l" command to check Fiber Channel port statistics.
if no persistent hardware error can be found, check for driver and firmware level of all component.

6. References:

man page:

buf(9S)
strategy(9E)
scsi(4)
scsi_pkt(9S)
scsi_transport(9F)

List of scsi key code: http://www.t10.org/lists/asc-num.htm

References

<NOTE:1012309.1> - Sun Storage 3310 SCSI Array: Resolving SCSI Bus Resets and Transport Errors When Using the Ultra320 Host Bus Adapter
<NOTE:1283727.1> - Troubleshooting Fibre Channel HBA Performance
<NOTE:1004694.1> - SCSI Troubleshooting on Sun Systems
<NOTE:1346234.1> - How to gather key data and information for Oracle Disk array products, to minimise problem diagnosis and resolution times
<NOTE:1502843.1> - SAN Fibre Channel (FC) Storage Connectivity Issues

Attachments

This solution has no attachment