Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1627528.1
Update Date:2018-03-06
Keywords:

Solution Type  Problem Resolution Sure

Solution  1627528.1 :   T-Series servers with LSI106x(e) (mpt driver) / LSI2x08 (mpt_sas driver) controllers reports "Command Timeout" for all internal disks (HDD/SSD) Caused by mpt/mpt_sas driver bug  


Related Items
  • SPARC T4-2
  •  
  • Sun SPARC Enterprise T5440 Server
  •  
  • SPARC T4-4
  •  
  • Solaris Operating System
  •  
  • Sun SPARC Enterprise T5140 Server
  •  
  • Solaris Operating System
  •  
  • Sun SPARC Enterprise T5120 Server
  •  
  • Sun SPARC Enterprise T5220 Server
  •  
  • SPARC T3-4
  •  
  • SPARC T5-4
  •  
  • SPARC T5-2
  •  
  • SPARC T4-1
  •  
  • Sun SPARC Enterprise T5240 Server
  •  
  • SPARC T3-1
  •  
  • SPARC T3-2
  •  
  • SPARC T5-8
  •  
Related Categories
  • PLA-Support>Sun Systems>SPARC>CMT>SN-SPARC: T4
  •  




In this Document
Symptoms
Cause
Solution
References


Created from <SR 3-8231098081>

Applies to:

SPARC T5-2 - Version All Versions to All Versions [Release All Releases]
SPARC T3-4 - Version Not Applicable to Not Applicable [Release N/A]
SPARC T5-4 - Version All Versions to All Versions [Release All Releases]
SPARC T5-8 - Version All Versions to All Versions [Release All Releases]
Solaris Operating System - Version 10 3/05 to 11.1 [Release 10.0 to 11.0]
Oracle Solaris on SPARC (64-bit)

Symptoms

The system reported fault event PCIEX-8000-0A and there were SCSI command timeouts reported against all internal disks, in this case Solid State Disks (SSDs) in a T4-2 system.
The server's response was very slow (primary / alternate domains and zones) and after a reboot the system worked again without any problems, but the issue re-appeared.

Cause

  • All 6 internal disks were affected in a T4-2 system:

// messages:
...
Dec 12 20:21:14 xxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 20:21:14 xxxxx Disconnected command timeout for Target 11
Dec 12 20:21:14 xxxxx scsi: [ID 243001 kern.info] /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 20:21:14 xxxxx mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Dec 12 20:21:14 xxxxx scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
Dec 12 20:21:14 xxxxx /scsi_vhci/disk@g5001517bb2a0d272 (sd7): Command Timeout on path mpt_sas4/disk@w5001517bb2a0d272,0
Dec 12 20:22:24 xxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 20:22:24 xxxxx Disconnected command timeout for Target 12
Dec 12 20:22:24 xxxxx scsi: [ID 243001 kern.info] /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 20:22:24 xxxxx mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Dec 12 20:22:24 xxxxx scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
...
Dec 12 21:14:05 xxxxx /scsi_vhci/disk@g5001517bb2a0d272 (sd7): Command Timeout on path mpt_sas4/disk@w5001517bb2a0d272,0
Dec 12 21:14:25 xxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:14:25 xxxxx Disconnected command timeout for Target 9
Dec 12 21:14:25 xxxxx scsi: [ID 243001 kern.info] /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:14:25 xxxxx mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
...
Dec 12 21:15:15 xxxxx /scsi_vhci/disk@g5001517bb2a00ccf (sd8): Command Timeout on path mpt_sas5/disk@w5001517bb2a00ccf,0
Dec 12 21:15:35 xxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:15:35 xxxxx Disconnected command timeout for Target 10
Dec 12 21:15:35 xxxxx scsi: [ID 243001 kern.info] /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:15:35 xxxxx mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
...
Dec 12 21:16:25 xxxxx /scsi_vhci/disk@g5001517bb2a0d272 (sd7): Command Timeout on path mpt_sas4/disk@w5001517bb2a0d272,0
Dec 12 21:16:25 xxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:16:25 xxxxx Disconnected command timeout for Target 13
Dec 12 21:16:25 xxxxx scsi: [ID 243001 kern.info] /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:16:25 xxxxx mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Dec 12 21:16:25 xxxxx scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
Dec 12 21:16:25 xxxxx /scsi_vhci/disk@g5001517bb2a63b66 (sd6): Command Timeout on path mpt_sas3/disk@w5001517bb2a63b66,0
Dec 12 21:17:25 xxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:17:25 xxxxx Disconnected command timeout for Target 12
Dec 12 21:17:25 xxxxx scsi: [ID 243001 kern.info] /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:17:25 xxxxx mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Dec 12 21:17:25 xxxxx scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
Dec 12 21:17:25 xxxxx /scsi_vhci/disk@g5001517bb2a00ccf (sd8): Command Timeout on path mpt_sas5/disk@w5001517bb2a00ccf,0
Dec 12 21:17:35 xxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:17:35 xxxxx Disconnected command timeout for Target 11
Dec 12 21:17:35 xxxxx scsi: [ID 243001 kern.info] /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:17:35 xxxxx mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Dec 12 21:17:35 xxxxx scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
Dec 12 21:17:35 xxxxx /scsi_vhci/disk@g5001517bb2a0d272 (sd7): Command Timeout on path mpt_sas4/disk@w5001517bb2a0d272,0
Dec 12 21:18:25 xxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:18:25 xxxxx Disconnected command timeout for Target 14
Dec 12 21:18:25 xxxxx scsi: [ID 243001 kern.info] /pci@400/pci@2/pci@0/pci@e/scsi@0 (mpt_sas0):
Dec 12 21:18:25 xxxxx mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Dec 12 21:18:25 xxxxx scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
...

 

  • In this case the disks are SSDs in a T4-2 system:
// From iostat-En.out: 
       Disk           Size   Soft Hard Trans Media Ready NoDev Recov Illeg PFlAn Product
  c0t5001517BB2AEA51Dd0 300.07GB    0  217   228     0     0   217     0     0   0   ATA INTEL SSDSA2BZ30
  c0t5001517BB29E8C83d0 300.07GB    0   30   104     0     0    30     0     0   0   ATA INTEL SSDSA2BZ30
  c2t0d0                  0.00GB    0    4     0     0     4     0     0     2   0   AMI Virtual CDROM
  c0t50015179596FDEB7d0 300.07GB    0   15    15     0     0    15     0     0   0   ATA INTEL SSDSA2BZ30
  c0t5001517BB2A63B66d0 300.07GB    0   15    15     0     0    15     0     0   0   ATA INTEL SSDSA2BZ30
  c0t5001517BB2A00CCFd0 300.07GB    0  313   429     0     0   313     0     0   0   ATA INTEL SSDSA2BZ30
  c0t5001517BB2A01004d0 300.07GB    0   30   131     0     0    30     0     0   0   ATA INTEL SSDSA2BZ30
  c3t7d0                  0.00GB    0  242    16     0   242     0     0     0   0   TEAC DV-W28SS-V

 

  • The ereports, that lead to FMA event PCIEX-8000-0A show that the onboard SAS Controller no longer responds:
// fmadm faulty:
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Dec 09 15:46:07 5aa6bbad-89a9-4857-bb9a-d89378a5d7db PCIEX-8000-0A Critical

Problem Status  : solved
Diag Engine     : eft / 1.16
System
  Manufacturer  : unknown
  Name          : ORCL,SPARC-T4-2
  Part_Number   : unknown
  Serial_Number : xxxxxx
  Host_ID       : xxxxxx

----------------------------------------
Suspect 1 of 1  :
  Fault class   : fault.io.pciex.device-interr
  Certainty     : 100%
  Affects       : dev:////pci@400/pci@2/pci@0/pci@e/scsi@0
  Status        : faulted but still in service

  FRU
  Location      : "/SYS/MB"
  Manufacturer  : unknown
  Name          : unknown
  Part_Number   : 7049060
  Revision      : 01
  Serial_Number : xxxxxxx
  Chassis
  Manufacturer  : unknown
  Name          : ORCL,SPARC-T4-2
  Part_Number   : 31414590+1+1
  Serial_Number : xxxxxxx
  Status        : faulty

Description     : A problem was detected for a PCIEX device.
Response        : One or more device instances may be disabled
Impact          : Loss of services provided by the device instances associated with this fault
Action          : Use 'fmadm faulty' to provide a more detailed view of this event.  Please refer to the associated reference document at http://support.oracle.com/msg/PCIEX-8000-0A for the latest service procedures and policies regarding this diagnosis.

 

// fmdump-eVu_2204ab2b-5e61-4254-c645-b88233bd4f28.out
TIME CLASS
Dec 26 2013 23:10:05.279330795 ereport.io.service.lost
nvlist version: 0
class         = ereport.io.service.lost
ena           = 0xdd65b701a4f03401
detector      = (embedded nvlist)
nvlist version: 0
version       = 0x0
scheme        = dev
cna_dev       = 0x52b059bf00000075
device-path   = /pci@400/pci@2/pci@0/pci@e/scsi@0
(end detector)
__ttl         = 0x1
__tod         = 0x52bca93d 0x10a63feb

Dec 26 2013 23:10:05.279300360 ereport.io.device.no_response
nvlist version: 0
class         = ereport.io.device.no_response
ena           = 0xdd65b6fa49103401
detector      = (embedded nvlist)
nvlist version: 0
version       = 0x0
scheme        = dev
cna_dev       = 0x52b059bf00000075
device-path   = /pci@400/pci@2/pci@0/pci@e/scsi@0
(end detector)
__ttl         = 0x1
__tod         = 0x52bca93d 0x10a5c908

 

  • Two SSDs had already been replaced and it did not improve the situation.
  • Suspected a SAS controller issue because it no longer responded --> replaced the Motherboard --> did not help either!
  • It seems the non responding controller was aftereffect of another issue that causes the command timeouts.
  • System ran Oracle Solaris 11.1 SRU 2.5

 

Solution

  • The reason for the command timeouts was Bug 16245585: "Disconnected command timeout for Target" scsi warning.
  • According to the bug, it seems high load triggers these errors but they were also seen on idle systems.
  • There is another important bug fix: bug 15875298: mpt does not recover from "Disconnected command timeout" with failing disks.
  • Both, Bug 16245585 and Bug 15875298 are fixed in:
    • Solaris 11.1 SRU 12.5.0 or higher
    • Solaris 10 with Kernel Patch 150400-09 or higher

  • Customer upgraded to latest available Solaris release (11.1 SRU 14.5) and the SCSI timeouts disappeared (monitored for two weeks)!

 

 Note:
  • This bug has also been seen with hard disk drives.
  • In other cases the PCIEX-8000-0A fault ID might not occur.



References

<NOTE:1501435.1> - Oracle Solaris 11.1 Support Repository Updates (SRU) Index
<BUG:16245585> - "DISCONNECTED COMMAND TIMEOUT FOR TARGET" SCSI WARNING.
<BUG:15875298> - MPT DOES NOT RECOVER FROM "DISCONNECTED COMMAND TIMEOUT" WITH FAILING DISKS

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback