Solaris 10 Hung While Booting With Errors mpt_cmd_timeout: Restarting HBA - Failed Internal Disk Controlled by Mpt Driver

Asset ID:	1-72-1998049.1
Update Date:	2017-10-06
Keywords:

Solution Type Problem Resolution Sure

Solution 1998049.1 : Solaris 10 Hung While Booting With Errors mpt_cmd_timeout: Restarting HBA - Failed Internal Disk Controlled by Mpt Driver

Applies to:

Sun SPARC Enterprise M9000-32 Server - Version All Versions and later
Solaris Operating System - Version 10 9/10 U9 and later
Sun SPARC Enterprise M8000 Server - Version All Versions and later
Sun SPARC Enterprise M5000 Server - Version All Versions and later
Sun SPARC Enterprise M4000 Server - Version All Versions and later
Information in this document applies to any platform.

Symptoms

Solaris 10 Server has continuous disk error messages "Disconnected command timeout for Target 1" and format command takes too much time to complete, these are some error messages observed:

Mar 27 13:37:57 server01 scsi: [ID 107833 kern.warning] WARNING: /pci@10,600000/pci@0/scsi@1 (mpt2):
Mar 27 13:37:57 server01 passthrough command timeout
Mar 27 13:37:57 server01 scsi: [ID 365881 kern.info] /pci@10,600000/pci@0/scsi@1 (mpt2):
Mar 27 13:37:57 server01 Rev. 2 LSI, Inc. 1064 found.
Mar 27 13:37:57 server01 scsi: [ID 365881 kern.info] /pci@10,600000/pci@0/scsi@1 (mpt2):
Mar 27 13:37:57 server01 mpt2 supports power management.
Mar 27 13:37:58 server01 scsi: [ID 365881 kern.info] /pci@10,600000/pci@0/scsi@1 (mpt2):
Mar 27 13:37:58 server01 mpt2: IOC Operational.
Mar 27 13:39:15 server01 scsi: [ID 107833 kern.warning] WARNING: /pci@10,600000/pci@0/scsi@1 (mpt2):
Mar 27 13:39:15 server01 passthrough command timeout

Mar 31 11:06:09 server01 scsi: WARNING: /pci@10,600000/pci@0/scsi@1 (mpt2):
Mar 31 11:06:09 server01        Disconnected command timeout for Target 1
Mar 31 11:07:19 server01 scsi: WARNING: /pci@10,600000/pci@0/scsi@1 (mpt2):
Mar 31 11:07:19 server01        Disconnected command timeout for Target 1
Mar 31 11:08:30 server01 scsi: WARNING: /pci@10,600000/pci@0/scsi@1 (mpt2):
Mar 31 11:08:30 server01        Disconnected command timeout for Target 1

Mar 31 11:16:53 server01 scsi: [ID 107833 kern.warning] WARNING: /pci@10,600000/pci@0/scsi@1 (mpt2):
Mar 31 11:16:53 server01 Disconnected command timeout for Target 1
Mar 31 11:16:54 server01 scsi: [ID 243001 kern.info] /pci@10,600000/pci@0/scsi@1 (mpt2):
Mar 31 11:16:54 server01 mpt_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31140000

Disk c2t1d0 is not visible in format:

AVAILABLE DISK SELECTIONS:
      0. c0t0d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>
         /pci@0,600000/pci@0/scsi@1/sd@0,0
      1. c0t1d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>
         /pci@0,600000/pci@0/scsi@1/sd@1,0

      2. c2t0d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>
         /pci@10,600000/pci@0/scsi@1/sd@0,0

      3. c3t0d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>
         /pci@14,600000/pci@0/scsi@1/sd@0,0
      4. c3t1d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>
         /pci@14,600000/pci@0/scsi@1/sd@1,0

Also, raidctl command takes too much time to complete, after 90 min :

# raidctl -l
Controller: 0
      Disk: 0.0.0
      Disk: 0.1.0
Controller: 2
      Disk: 0.0.0
Controller: 3
      Disk: 0.0.0
      Disk: 0.1.0
Controller: 7

As there are known mpt driver problems with errors <passthrough command timeout> and hung issues
fixed in kernel patch 147147-26 and mpt patch 150309-02 and finally with kernel patch 150400-22, this were not installed , we recommended to install the latest recommended patches.

In particular, 150400-22 and later revisions contain fixes for these relevant bugs:

Bug 17594186 mpt_accept_tx_waitq: failed to accept cmd on queue and hang
Bug 19350198 mpt driver panic in mpt_do_passthru when system memory is low

After installing a cluster of Solaris 10 recommended patches, with kernel patch 150400-17, system hung while booting with these errors on console:

Apr 04 18:40:50 CEST 2015 WARNING: /pci@10,600000/pci@0/scsi@1 (mpt2):
Apr 04 18:40:50 CEST 2015 mpt_cmd_timeout: Restarting HBA
Apr 04 18:41:00 CEST 2015 WARNING: /pci@10,600000/pci@0/scsi@1 (mpt2):
Apr 04 18:41:00 CEST 2015 mpt_cmd_timeout: Restarting HBA

However, we can still boot successfully with the previous older kernel patch and mpt driver, although with

the original errors for the failed disk:

SunOS server01 5.10 Generic_147440-19 sun4u sparc SUNW,SPARC-Enterprise
modinfo.out:163 7b284000 3a1c0 213 1 mpt (MPT HBA Driver v1.113)

Apr 04 19:52:00 CEST 2015     Apr 4 19:52:00 server01 scsi: WARNING: /pci@10,600000/pci@0/scsi@1 (mpt2):
Apr 04 19:52:01 CEST 2015     Apr 4 19:52:00 server01        Disconnected command timeout for Target 1
Apr 04 19:53:11 CEST 2015     Apr 4 19:53:11 server01 scsi: WARNING: /pci@10,600000/pci@0/scsi@1 (mpt2):
Apr 04 19:53:11 CEST 2015     Apr 4 19:53:11 server01        Disconnected command timeout for Target 1

Cause

This server boots from the internal disks c0t0d0 and c2t0d0 (mirrored under svm) but c2t1d0 is defective,
The boot / hung issue should not be caused by hardware. We could be facing this bug:

Bug ID 20475363 mpt_cmd_timeout: Restarting HBA causing domain hang--> closed as duplicate of Bug ID 21348068 - stuck command following mpt_do_scsi_reset --> same as Bug 20237135 - stuck command following mpt_do_scsi_reset --> fixed by kernel patch 150400-28

or this other one

Bug ID 17594186 mpt_accept_tx_waitq: failed to accept cmd on queue and hang --> fixed by kernel patch 150400-22, not installed by customer --> Closed as duplicate of 20237135

Solution

Replace internal disk c2t1d0 ( on this M9000 server it was located on IOU#1/HDD#1 ), that solved the issue.

In the case the system cannot boot due to the failed disk, physically remove the disk , that way you should be able to boot.

Bug ID 20475363/21348068/20237135 has been observed with kernel patches 150400 with revisions 9 and 17, lower than 19.

There are several mpt bugs fixed with kernel patch 150400-19 and 150400-22 and finally 150400-28.

If you are installing any new kernel patch and have disks under mpt driver, in order to avoid this issue, make sure to install :

Solaris 10:

Sparc:
<SunPatch:150400-28> (or above) SunOS 5.10 Sparc : Kernel Patch
<SunPatch:150309-02> SunOS 5.10 Sparc : mpt.so patch

x86:
<SunPatch:150401-28> (or above) SunOS 5.10 x86 : Kernel Patch
<SunPatch:148877-04> SunOS 5.10 x86 : mpt.so patch

Solaris 11:

Fix has been provided by Bug ID 20237135 - stuck command following mpt_do_scsi_reset , fixed on:
Solaris 11.2 SRU 11.2.14.5.0 (or greater)

This can be found on:

Oracle Solaris 11.2 Support Repository Updates (SRU) Index (Doc ID 1672221.1)

References

<BUG:17594186> - MPT_ACCEPT_TX_WAITQ: FAILED TO ACCEPT CMD ON QUEUE AND HANG
<BUG:20475363> - MPT_CMD_TIMEOUT: RESTARTING HBA CAUSING DOMAIN HANG
<BUG:21348068> - STUCK COMMAND FOLLOWING MPT_DO_SCSI_RESET
<BUG:20237135> - STUCK COMMAND FOLLOWING MPT_DO_SCSI_RESET

Attachments

This solution has no attachment