![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Technical Instruction Sure Solution 1001444.1 : Solaris Storage Driver Troubleshooting SCSI Transport Errors - Command Failed to Complete - Command Timeout
PreviouslyPublishedAs 201952 Applies to:Solaris Operating System - Version 10 3/05 and laterSun Blade T6320 Server Module - Version All Versions to All Versions [Release All Releases] Qlogic FC HBA - Version All Versions to All Versions [Release All Releases] Oracle Solaris on x86-64 (64-bit) Oracle Solaris on SPARC (64-bit) Oracle Solaris on SPARC (32-bit) GoalThis article addresses the SCSI transport errors and how they can be identified and investigated. Solution1. IntroductionIn the Solaris Operating System, the architecture of the SCSI subsystem can be summarized by a stack of three layer:
Upon completion of an SCSI command the host adapter calls the completion routine provided by the target driver. The scsi_pkt structure member pkt_reason contains a completion code that indicates why the SCSI command completion function (pkt_comp) was called. The most often seen reason in case of transport error are:
During the completion of the SCSI command, transport, multipathing and target driver could produce various error message reporting the error. The decision to retry the SCSI command or fail it will be taken by the target device driver with the exception of some retries done automatically by multipathing driver. Aug 6 07:49:59 paris unix: WARNING: /pci@1f,0/pci@1/scsi@2/sd@2,0 (sd2):
Aug 6 07:49:59 paris unix: Error for Command: write ErrorLevel: Fatal When investigating SCSI transport error it is needed to look at different source of information:
Server performance can be degraded severely by scsi transport error depending on the type and amount of error being reported.
2. Transport error with reason 'reset'The 'reset' reason indicates that after the SCSI packet was sent a reset occured on the bus and therefore prevented the completion of the packet. Reset can also happened during recovery decided by the hba or target driver issuing target reset (therefore affecting only one target device) or bus reset (affecting all target device).
a) parallel SCSI bus example
Aug 3 14:02:57 paris unix: WARNING: /pci@1f,4000/scsi@3 (glm0):
Aug 3 14:02:57 paris unix: WARNING: /pci@1f,4000/scsi@3 (glm0): Aug 3 14:02:57 paris unix: WARNING: /pci@1f,4000/scsi@3/sd@0,0 (sd0): Aug 3 14:02:57 paris unix: SCSI transport failed: reason 'reset': retrying command In this case the tran_err error is due to a reset of the target device initiated by the controller in order to clear it's fault. At the end disk sd0 was replaced.
3. Transport error with reason 'tran_err'This reason is by definition more vague and not necessarily implies and hardware issue. The error tells that the i/o chipset on the hba was not able to send the SCSI command to the target device or that the target or any other equipement on the channel returned it with this pkt_reason value. a) parallel SCSI bus transport errorsThis example shows transport error on parallel SCSI bus caused by parity error: Mar 24 06:58:10 paris unix: warning: /pci@6,4000/scsi@4 (glm4):
Mar 24 06:58:10 paris unix: SCSI bus DATA IN phase parity error Mar 24 06:58:10 paris unix: warning: ID[SUNWpd.glm.parity_check.6010] Mar 24 06:58:10 paris unix: warning: /pci@6,4000/scsi@4 (glm4): Mar 24 06:58:10 paris unix: Target 0 reducing sync. transfer rate Mar 24 06:58:10 paris unix: warning: ID[SUNWpd.glm.sync_wide_backoff.6014] Mar 24 06:58:10 paris unix: warning: /pci@6,4000/scsi@4/sd@0,0 (sd60): Mar 24 06:58:10 paris unix: SCSI transport failed: reason 'tran_err': retrying command In this the controller got transfer issue with the target 0 and decided to reduce the transfer rate and recover the fault using a target reset which generated the tran_err error. Usually SCSI bus DATA phase parity errors are caused by bad cable and/or termination. It is also recommended to check for patches of hba driver and disk firmware. The message "reducing sync. transfer rate" indicates that the hba driver negotiated a lower channel transfer rate with the specified target. As the transfer speed is reduced, it is expected to have performance issue. The transfer rate can only be raised by a reconfiguration of the target device with "cfgadm -c unconfigure/configure" or a host reboot.
b) Fiber channel transport errors
In this example the mid layer multipathing driver VxDMP also reported the error and took action (disabling a path), One can see the error was not transient and forced the target driver to fail the command (giving up): Nov 27 12:45:28 paris scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w50060e8004599921,14 (ssd30):
Nov 27 12:45:28 paris SCSI transport failed: reason 'tran_err': retrying command Nov 27 12:45:29 paris scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w50060e8004599921,14 (ssd30): Nov 27 12:45:29 paris SCSI transport failed: reason 'tran_err': giving up Nov 27 12:45:29 paris vxdmp: [ID 480808 kern.notice] NOTICE: VxVM vxdmp V-5-0-112 disabled path 118/0xf0 belonging to the dmpnode 310/0x78 due to path failure Nov 27 12:45:58 paris vxdmp: [ID 736771 kern.notice] NOTICE: VxVM vxdmp V-5-0-148 enabled path 118/0xf0 belonging to the dmpnode 310/0x78 Nov 27 13:01:58 paris scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w50060e8004599921,14 (ssd30): Nov 27 13:01:58 paris SCSI transport failed: reason 'tran_err': giving up The disk ssd30 was later replaced.
Jan 13 22:29:53 paris scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Jan 13 22:29:53 paris /scsi_vhci/disk@g60060e8005478d000000478d0000401a (sd15): Command failed to complete (3) on path fp6/disk@w50060e8005478d5d,5 Jan 13 22:29:53 paris scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g60060e8005478d000000478d0000401a (sd15): Jan 13 22:29:53 paris SCSI transport failed: reason 'tran_err': retrying command
Sep 4 11:37:25 paris scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Sep 4 11:37:25 paris /scsi_vhci/ssd@g600a0b8000170e2b000020c04ecf6c96 (ssd1): Command failed to complete (3) on path fp2/ssd@w200600a0b8170e2c,0 Sep 4 11:37:25 paris scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g600a0b8000170e2b000020c04ecf6c96 (ssd1): Sep 4 11:37:25 paris SCSI transport failed: reason 'tran_err': retrying command Sep 4 11:37:26 paris fctl: [ID 517869 kern.warning] WARNING: fp(2)::GPN_ID for D_ID=10300 failed Sep 4 11:37:26 paris fctl: [ID 517869 kern.warning] WARNING: fp(2)::N_x Port with D_ID=10300, PWWN=200600a0b8170e2c disappeared from fabric Sep 4 11:37:45 paris scsi: [ID 243001 kern.info] /pci@1d,700000/SUNW,qlc@1/fp@0,0 (fcp2): Sep 4 11:37:45 paris offlining lun=0 (trace=0), target=10300 (trace=2800004) Sep 4 11:37:46 paris genunix: [ID 483743 kern.info] /scsi_vhci/ssd@g600a0b8000170e2b000020c04ecf6c96 (ssd1) multipath status: degraded: path 1 fp2/ssd@w200600a0b8170e2c,0 is offline Sep 4 11:37:46 paris scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0): Sep 4 11:37:46 paris Initiating failover for device ssd (GUID 600a0b8000170e2b000020c04ecf6c96) Sep 4 11:37:46 paris scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0): Sep 4 11:37:46 paris Failover operation completed successfully for device ssd (GUID 600a0b8000170e2b000020c04ecf6c96): failed over from primary to secondary At this point one need to investigate why the target device port gone offline.
Dec 1 10:11:26 paris scsi: WARNING: /scsi_vhci/ssd@g60a98000572d5471534a5a424d4b345a (ssd8):
Dec 1 10:11:26 paris SCSI transport failed: reason 'timeout': retrying command Dec 1 13:49:47 paris scsi: WARNING: /scsi_vhci/ssd@g60a98000572d5471534a5a424d4b345a (ssd8): Dec 1 13:49:47 paris SCSI transport failed: reason 'tran_err': retrying command Dec 1 18:56:14 paris scsi: WARNING: /scsi_vhci/ssd@g60a98000572d54712f345a616a33346e (ssd16): Dec 1 18:56:14 paris SCSI transport failed: reason 'timeout': retrying command Dec 2 00:28:28 paris scsi: WARNING: /scsi_vhci/ssd@g60a98000572d5471534a5a424d4b3436 (ssd10): Dec 2 00:28:28 paris SCSI transport failed: reason 'tran_err': retrying command Dec 2 15:20:39 paris scsi: WARNING: /scsi_vhci/ssd@g60a98000572d5471534a5a424d4b345a (ssd8): Dec 2 15:20:39 paris SCSI transport failed: reason 'timeout': retrying command Dec 2 17:51:59 paris scsi: WARNING: /scsi_vhci/ssd@g60a98000572d5471534a5a424d4b345a (ssd8): Dec 2 17:51:59 paris SCSI transport failed: reason 'tran_err': retrying command Dec 2 22:13:13 paris scsi: WARNING: /scsi_vhci/ssd@g60a98000572d5471534a5a424d4b345a (ssd8): Dec 2 22:13:13 paris SCSI transport failed: reason 'timeout': retrying command c) ConclusionTransport error with reason 'tran_err' are not always easy to investigate. Troubleshooting these errors request to carefully check the various error counter on hba, switch port, target device and compare with different systems which can access the same target device. As these error can also be caused by bug in multipathing or hba driver, it is strongly recommended to check patch level.
4. Transport timeoutTransport timeout error occurs when the SCSI command has been processed by the hba driver and the SCSI packet completion did not happened after the default target device timeout (for the sd/ssd driver the default packet timeout value is 60 sec). This can be caused by:
These are theorically possible but unlikely to be seen:
a) Examples
The following example t shows intermittent issue on a private loop affecting all target device. This can be cause by faulty channel or spurious error generated by one of the target. Looking at counter statistic of the hba using "luxadm -e rdls /dev/cfg/cN" may indicates where the issue is: Jun 6 19:16:34 paris unix: ID[SUNWssa.socal.link.5010] socal1: port 0:Fibre Channel is OFFLINE
Jun 6 19:16:34 paris unix: ID[SUNWssa.socal.link.6010] socal1: port 0:Fibre Channel Loop is ONLINE Jun 6 19:17:49 paris unix: WARNING: /sbus@2,0/SUNW,socal@2,0/sf@0, 0/ssd@w22000020370ed7ff,0 (ssd16): Jun 6 19:17:49 paris unix: SCSI transport failed: reason 'timeout':retrying command Jun 6 19:19:14 paris unix: WARNING: /sbus@2,0/SUNW,socal@2,0/sf@0, 0/ssd@w22000020370edd68,0 (ssd12): Jun 6 19:19:14 paris unix: SCSI transport failed:reason 'timeout':retrying command Jun 6 19:20:44 paris unix: WARNING: /sbus@2,0/SUNW,socal@2,0/sf@0, 0/ssd@w22000020370edd69,0 (ssd13): Jun 6 19:20:44 paris unix: SCSI transport failed: reason 'timeout':retrying command The above errors we generated by faulty controller which was replaced.
Mar 10 01:36:05 paris scsi: [ID 107833 kern.warning] WARNING: /pci@0,600000/pci@0/pci@8/pci@0/scsi@1 (mpt0):
Mar 10 01:36:05 paris Disconnected command timeout for Target 1 Mar 10 01:36:05 paris scsi: [ID 365881 kern.info] /pci@0,600000/pci@0/pci@8/pci@0/scsi@1 (mpt0): Mar 10 01:36:05 paris Log info 31140000 received for target 1. Mar 10 01:36:05 paris scsi_status=0, ioc_status=8048, scsi_state=c Mar 10 01:37:15 paris scsi: [ID 107833 kern.warning] WARNING: /pci@0,600000/pci@0/pci@8/pci@0/scsi@1 (mpt0): Mar 10 01:37:15 paris Disconnected command timeout for Target 1 Mar 10 01:37:16 paris scsi: [ID 365881 kern.info] /pci@0,600000/pci@0/pci@8/pci@0/scsi@1 (mpt0): Mar 10 01:37:16 paris Log info 31140000 received for target 1. Mar 10 01:37:16 paris scsi_status=0, ioc_status=8048, scsi_state=c The hard disk driver on target 1 was replaced. In this example you can see how a faulty hard disk drive ssd11 can impact other due to bus reset initiated to recover from a failing target device on a private FC loop: The sd instance ssd11 is having intermittent transport issues:
Apr 23 06:15:56 paris scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0): Apr 23 06:15:56 paris /scsi_vhci/ssd@g200000008793253e (ssd11): Command Timeout on path /pci@9,700000/pci@2/SUNW,qlc@4/fp@0,0 (fp1) Apr 23 06:15:56 paris scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g200000008793253e (ssd11): Apr 23 06:15:56 paris SCSI transport failed: reason 'timeout': retrying command Apr 23 06:15:56 paris scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0): Apr 23 06:15:56 paris /scsi_vhci/ssd@g200000008793253e (ssd11): Command Timeout on path /pci@8,600000/SUNW,qlc@2/fp@0,0 (fp0) Apr 23 06:16:07 paris scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g200000008793253e (ssd11): Apr 23 06:16:07 paris SCSI transport failed: reason 'tran_err': retrying command ... Apr 24 19:52:43 paris /scsi_vhci/ssd@g200000008793253e (ssd11): Command Timeout on path /pci@9,700000/pci@2/SUNW,qlc@4/fp@0,0 (fp1) Apr 24 19:52:43 paris scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g200000008793253e (ssd11): Apr 24 19:52:43 paris SCSI transport failed: reason 'timeout': retrying command ... Apr 25 00:05:30 paris qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(1): Loop OFFLINE Apr 25 00:05:30 paris scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g200000008793253e (ssd11): Apr 25 00:05:30 paris SCSI transport failed: reason 'tran_err': retrying command <-- this transport error is a consequence of the loop offline/online events Apr 25 00:06:06 paris qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(1): Loop ONLINE Apr 25 00:06:10 paris fp: [ID 517869 kern.info] NOTICE: fp(1): PLOGI to ef failed state=Packet Transport error, reason=No Connection Apr 25 00:06:10 paris fctl: [ID 517869 kern.warning] WARNING: fp(1)::PLOGI to ef failed. state=e reason=5. Apr 25 00:06:10 paris scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g500000e014b56cf0 (ssd22): Apr 25 00:06:10 paris Error for Command: write(10) Error Level: Retryable Apr 25 00:06:10 paris scsi: [ID 107833 kern.notice] Requested Block: 98719853 Error Block: 98719853 Apr 25 00:06:10 paris scsi: [ID 107833 kern.notice] Vendor: FUJITSU Serial Number: 0709C0BECH Apr 25 00:06:10 paris scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Apr 25 00:06:10 paris scsi: [ID 107833 kern.notice] ASC: 0x29 (bus device reset message occurred), ASCQ: 0x3, FRU: 0x0 <-- this packet was in the qlc(1) driver at the time the loop reseated and returned in error to the target device so that it can trigger retry ... Apr 25 00:06:30 paris scsi: [ID 243001 kern.info] /pci@9,700000/pci@2/SUNW,qlc@4/fp@0,0 (fcp1): Apr 25 00:06:30 paris offlining lun=0 (trace=0), target=ef (trace=2800004) Apr 25 00:06:30 paris genunix: [ID 834635 kern.info] /scsi_vhci/ssd@g200000008793253e (ssd11) multipath status: degraded, path /pci@9,700000/pci@2/SUNW,qlc@4/fp@0,0 (fp1) to target address: w220000008793253e,0 is offline Load balancing: round-robin Apr 25 00:43:49 paris scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0): Apr 25 00:43:49 paris /scsi_vhci/ssd@g2000000087482d40 (ssd10): Command Timeout on path /pci@9,700000/pci@2/SUNW,qlc@4/fp@0,0 (fp1) ... Apr 25 06:12:03 paris fp: [ID 517869 kern.info] NOTICE: fp(1): PLOGI to ef failed state=Packet Transport error, reason=No Connection The disk ssd11 was replaced.
b) ConclusionTimeout transport on a parallel SCSI bus, SAS bus or FC private loop (directly connected no FC Switch) are more likely generated by misbehaving component.
5. Summary of action to take
6. References:man page:
List of scsi key code: http://www.t10.org/lists/asc-num.htm
References<NOTE:1012309.1> - Sun Storage 3310 SCSI Array: Resolving SCSI Bus Resets and Transport Errors When Using the Ultra320 Host Bus Adapter<NOTE:1283727.1> - Troubleshooting Fibre Channel HBA Performance <NOTE:1004694.1> - SCSI Troubleshooting on Sun Systems <NOTE:1346234.1> - How to gather key data and information for Oracle Disk array products, to minimise problem diagnosis and resolution times <NOTE:1502843.1> - SAN Fibre Channel (FC) Storage Connectivity Issues Attachments This solution has no attachment |
||||||||||||
|