Asset ID: |
1-72-1627528.1 |
Update Date: | 2018-03-06 |
Keywords: | |
Solution Type
Problem Resolution Sure
Solution
1627528.1
:
T-Series servers with LSI106x(e) (mpt driver) / LSI2x08 (mpt_sas driver) controllers reports "Command Timeout" for all internal disks (HDD/SSD) Caused by mpt/mpt_sas driver bug
Related Items |
- SPARC T4-2
- Sun SPARC Enterprise T5440 Server
- SPARC T4-4
- Solaris Operating System
- Sun SPARC Enterprise T5140 Server
- Solaris Operating System
- Sun SPARC Enterprise T5120 Server
- Sun SPARC Enterprise T5220 Server
- SPARC T3-4
- SPARC T5-4
- SPARC T5-2
- SPARC T4-1
- Sun SPARC Enterprise T5240 Server
- SPARC T3-1
- SPARC T3-2
- SPARC T5-8
|
Related Categories |
- PLA-Support>Sun Systems>SPARC>CMT>SN-SPARC: T4
|
In this Document
Created from <SR 3-8231098081>
Applies to:
SPARC T5-2 - Version All Versions to All Versions [Release All Releases]
SPARC T3-4 - Version Not Applicable to Not Applicable [Release N/A]
SPARC T5-4 - Version All Versions to All Versions [Release All Releases]
SPARC T5-8 - Version All Versions to All Versions [Release All Releases]
Solaris Operating System - Version 10 3/05 to 11.1 [Release 10.0 to 11.0]
Oracle Solaris on SPARC (64-bit)
Symptoms
The system reported fault event PCIEX-8000-0A and there were SCSI command timeouts reported against all internal disks, in this case Solid State Disks (SSDs) in a T4-2 system.
The server's response was very slow (primary / alternate domains and zones) and after a reboot the system worked again without any problems, but the issue re-appeared.
Cause
- All 6 internal disks were affected in a T4-2 system:
// messages:
...
Dec 12 20:21:14 xxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 20:21:14 xxxxx Disconnected command timeout for Target 11
Dec 12 20:21:14 xxxxx scsi: [ID 243001 kern.info] /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 20:21:14 xxxxx mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Dec 12 20:21:14 xxxxx scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
Dec 12 20:21:14 xxxxx /scsi_vhci/disk@g5001517bb2a0d272 (sd7): Command Timeout on path mpt_sas4/disk@w5001517bb2a0d272,0
Dec 12 20:22:24 xxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 20:22:24 xxxxx Disconnected command timeout for Target 12
Dec 12 20:22:24 xxxxx scsi: [ID 243001 kern.info] /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 20:22:24 xxxxx mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Dec 12 20:22:24 xxxxx scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
...
Dec 12 21:14:05 xxxxx /scsi_vhci/disk@g5001517bb2a0d272 (sd7): Command Timeout on path mpt_sas4/disk@w5001517bb2a0d272,0
Dec 12 21:14:25 xxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:14:25 xxxxx Disconnected command timeout for Target 9
Dec 12 21:14:25 xxxxx scsi: [ID 243001 kern.info] /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:14:25 xxxxx mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
...
Dec 12 21:15:15 xxxxx /scsi_vhci/disk@g5001517bb2a00ccf (sd8): Command Timeout on path mpt_sas5/disk@w5001517bb2a00ccf,0
Dec 12 21:15:35 xxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:15:35 xxxxx Disconnected command timeout for Target 10
Dec 12 21:15:35 xxxxx scsi: [ID 243001 kern.info] /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:15:35 xxxxx mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
...
Dec 12 21:16:25 xxxxx /scsi_vhci/disk@g5001517bb2a0d272 (sd7): Command Timeout on path mpt_sas4/disk@w5001517bb2a0d272,0
Dec 12 21:16:25 xxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:16:25 xxxxx Disconnected command timeout for Target 13
Dec 12 21:16:25 xxxxx scsi: [ID 243001 kern.info] /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:16:25 xxxxx mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Dec 12 21:16:25 xxxxx scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
Dec 12 21:16:25 xxxxx /scsi_vhci/disk@g5001517bb2a63b66 (sd6): Command Timeout on path mpt_sas3/disk@w5001517bb2a63b66,0
Dec 12 21:17:25 xxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:17:25 xxxxx Disconnected command timeout for Target 12
Dec 12 21:17:25 xxxxx scsi: [ID 243001 kern.info] /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:17:25 xxxxx mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Dec 12 21:17:25 xxxxx scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
Dec 12 21:17:25 xxxxx /scsi_vhci/disk@g5001517bb2a00ccf (sd8): Command Timeout on path mpt_sas5/disk@w5001517bb2a00ccf,0
Dec 12 21:17:35 xxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:17:35 xxxxx Disconnected command timeout for Target 11
Dec 12 21:17:35 xxxxx scsi: [ID 243001 kern.info] /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:17:35 xxxxx mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Dec 12 21:17:35 xxxxx scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
Dec 12 21:17:35 xxxxx /scsi_vhci/disk@g5001517bb2a0d272 (sd7): Command Timeout on path mpt_sas4/disk@w5001517bb2a0d272,0
Dec 12 21:18:25 xxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:18:25 xxxxx Disconnected command timeout for Target 14
Dec 12 21:18:25 xxxxx scsi: [ID 243001 kern.info] /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:18:25 xxxxx mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Dec 12 21:18:25 xxxxx scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
...
- In this case the disks are SSDs in a T4-2 system:
// From iostat-En.out:
Disk Size Soft Hard Trans Media Ready NoDev Recov Illeg PFlAn Product
c0t5001517BB2AEA51Dd0 300.07GB 0 217 228 0 0 217 0 0 0 ATA INTEL SSDSA2BZ30
c0t5001517BB29E8C83d0 300.07GB 0 30 104 0 0 30 0 0 0 ATA INTEL SSDSA2BZ30
c2t0d0 0.00GB 0 4 0 0 4 0 0 2 0 AMI Virtual CDROM
c0t50015179596FDEB7d0 300.07GB 0 15 15 0 0 15 0 0 0 ATA INTEL SSDSA2BZ30
c0t5001517BB2A63B66d0 300.07GB 0 15 15 0 0 15 0 0 0 ATA INTEL SSDSA2BZ30
c0t5001517BB2A00CCFd0 300.07GB 0 313 429 0 0 313 0 0 0 ATA INTEL SSDSA2BZ30
c0t5001517BB2A01004d0 300.07GB 0 30 131 0 0 30 0 0 0 ATA INTEL SSDSA2BZ30
c3t7d0 0.00GB 0 242 16 0 242 0 0 0 0 TEAC DV-W28SS-V
- The ereports, that lead to FMA event PCIEX-8000-0A show that the onboard SAS Controller no longer responds:
// fmadm faulty:
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Dec 09 15:46:07 5aa6bbad-89a9-4857-bb9a-d89378a5d7db PCIEX-8000-0A Critical
Problem Status : solved
Diag Engine : eft / 1.16
System
Manufacturer : unknown
Name : ORCL,SPARC-T4-2
Part_Number : unknown
Serial_Number : xxxxxx
Host_ID : xxxxxx
----------------------------------------
Suspect 1 of 1 :
Fault class : fault.io.pciex.device-interr
Certainty : 100%
Affects : dev:////pci@400/pci@2/pci@0/pci@e/scsi@0
Status : faulted but still in service
FRU
Location : "/SYS/MB"
Manufacturer : unknown
Name : unknown
Part_Number : 7049060
Revision : 01
Serial_Number : xxxxxxx
Chassis
Manufacturer : unknown
Name : ORCL,SPARC-T4-2
Part_Number : 31414590+1+1
Serial_Number : xxxxxxx
Status : faulty
Description : A problem was detected for a PCIEX device.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances associated with this fault
Action : Use 'fmadm faulty' to provide a more detailed view of this event. Please refer to the associated reference document at
http://support.oracle.com/msg/PCIEX-8000-0A for the latest service procedures and policies regarding this diagnosis.
// fmdump-eVu_2204ab2b-5e61-4254-c645-b88233bd4f28.out
TIME CLASS
Dec 26 2013 23:10:05.279330795 ereport.io.service.lost
nvlist version: 0
class = ereport.io.service.lost
ena = 0xdd65b701a4f03401
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
cna_dev = 0x52b059bf00000075
device-path = /pci@400/pci@2/pci@0/pci@e/scsi@0
(end detector)
__ttl = 0x1
__tod = 0x52bca93d 0x10a63feb
Dec 26 2013 23:10:05.279300360 ereport.io.device.no_response
nvlist version: 0
class = ereport.io.device.no_response
ena = 0xdd65b6fa49103401
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
cna_dev = 0x52b059bf00000075
device-path = /pci@400/pci@2/pci@0/pci@e/scsi@0
(end detector)
__ttl = 0x1
__tod = 0x52bca93d 0x10a5c908
- Two SSDs had already been replaced and it did not improve the situation.
- Suspected a SAS controller issue because it no longer responded --> replaced the Motherboard --> did not help either!
- It seems the non responding controller was aftereffect of another issue that causes the command timeouts.
- System ran Oracle Solaris 11.1 SRU 2.5
Solution
- The reason for the command timeouts was Bug 16245585: "Disconnected command timeout for Target" scsi warning.
- According to the bug, it seems high load triggers these errors but they were also seen on idle systems.
- There is another important bug fix: bug 15875298: mpt does not recover from "Disconnected command timeout" with failing disks.
- Both, Bug 16245585 and Bug 15875298 are fixed in:
- Solaris 11.1 SRU 12.5.0 or higher
- Solaris 10 with Kernel Patch 150400-09 or higher
- Customer upgraded to latest available Solaris release (11.1 SRU 14.5) and the SCSI timeouts disappeared (monitored for two weeks)!
Note:
- This bug has also been seen with hard disk drives.
- In other cases the PCIEX-8000-0A fault ID might not occur.
References
<NOTE:1501435.1> - Oracle Solaris 11.1 Support Repository Updates (SRU) Index
<BUG:16245585> - "DISCONNECTED COMMAND TIMEOUT FOR TARGET" SCSI WARNING.
<BUG:15875298> - MPT DOES NOT RECOVER FROM "DISCONNECTED COMMAND TIMEOUT" WITH FAILING DISKS
Attachments
This solution has no attachment