Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1545687.1
Update Date:2016-12-22
Keywords:

Solution Type  Problem Resolution Sure

Solution  1545687.1 :   In certain conditions, the LSI 1068E firmware may hang  


Related Items
  • Sun Fire X4540 Server
  •  
Related Categories
  • PLA-Support>Sun Systems>x86>Server>SN-x64: AMD-STOR-SERVER
  •  




In this Document
Symptoms
Cause
Solution
References


Applies to:

Sun Fire X4540 Server - Version Not Applicable to Not Applicable [Release N/A]
Oracle Solaris on x86-64 (64-bit)

Symptoms

 I/O appears to hang with the following entries in the messages file: 

Oct 26 09:09:33 xxxx scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,376@f/pci1000,1000@0/sd@3,0 (sd29):
Oct 26 09:09:33 xxxx    Error for Command: read(10)                Error Level: Retryable
Oct 26 09:09:33 xxxx scsi: [ID 107833 kern.notice]      Requested Block: 330359991                 Error Block: 330360071
Oct 26 09:09:33 xxxx scsi: [ID 107833 kern.notice]      Vendor: ATA                                Serial Number: 9SF13ATP
Oct 26 09:09:33 xxxx scsi: [ID 107833 kern.notice]      Sense Key: Media Error
Oct 26 09:09:33 xxxx scsi: [ID 107833 kern.notice]      ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0
Oct 26 09:09:46 xxxx scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,376@f/pci1000,1000@0/sd@3,0 (sd29):
Oct 26 09:09:46 xxxx    Error for Command: read(10)                Error Level: Retryable
Oct 26 09:09:46 xxxx scsi: [ID 107833 kern.notice]      Requested Block: 327290721                 Error Block: 327290862
Oct 26 09:09:46 xxxx scsi: [ID 107833 kern.notice]      Vendor: ATA                                Serial Number: 9SF13ATP
Oct 26 09:09:46 xxxx scsi: [ID 107833 kern.notice]      Sense Key: Media Error
Oct 26 09:09:46 xxxx scsi: [ID 107833 kern.notice]      ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0
Oct 26 09:09:50 xxxx scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,376@f/pci1000,1000@0/sd@3,0 (sd29):
Oct 26 09:09:50 xxxx    Error for Command: read(10)                Error Level: Retryable
Oct 26 09:09:50 xxxx scsi: [ID 107833 kern.notice]      Requested Block: 327290721                 Error Block: 327290862
Oct 26 09:09:50 xxxx scsi: [ID 107833 kern.notice]      Vendor: ATA                                Serial Number: 9SF13ATP
Oct 26 09:09:50 xxxx scsi: [ID 107833 kern.notice]      Sense Key: Media Error
Oct 26 09:09:50 xxxx scsi: [ID 107833 kern.notice]      ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0
    :
Oct 26 09:47:22 xxxx scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci10de,376@f/pci1000,1000@0 (mpt2):
Oct 26 09:47:22 xxxx    SAS Discovery Error on port 3. DiscoveryStatus is DiscoveryStatus is |Unaddressable device found|
Oct 26 09:48:26 xxxx scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,376@f/pci1000,1000@0 (mpt2):
Oct 26 09:48:26 xxxx    Disconnected command timeout for Target 3
    :
Oct 26 12:04:34 xxxx SOURCE: zfs-diagnosis, REV: 1.0
Oct 26 12:04:34 xxxx EVENT-ID: 3b0bf893-a701-e1e7-80f9-8b04fe02d8bb
Oct 26 12:04:34 xxxx DESC: The number of I/O errors associated with a ZFS device exceeded
Oct 26 12:04:34 xxxx         acceptable levels.  Refer to http://sun.com/msg/ZFS-8000-FD for more information.
Oct 26 12:04:34 xxxx AUTO-RESPONSE: The device has been offlined and marked as faulted.  An attempt
Oct 26 12:04:34 xxxx         will be made to activate a hot spare if available.
Oct 26 12:04:34 xxxx IMPACT: Fault tolerance of the pool may be compromised.


Sep 19 05:09:14 yyyy scsi: [ID 243001 kern.info] /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Sep 19 05:09:14 yyyy    mpt_check_scsi_io: IOCStatus=0x4b IOCLogInfo=0x31123000
Sep 19 05:09:35 yyyy scsi: [ID 107833 kern.warning] WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0/sd@6,0 (sd32):
Sep 19 05:09:35 yyyy    Error for Command: write(10)               Error Level: Retryable
Sep 19 05:09:35 yyyy scsi: [ID 107833 kern.notice]      Requested Block: 328770879                 Error Block: 328770879
Sep 19 05:09:35 yyyy scsi: [ID 107833 kern.notice]      Vendor: ATA                                Serial Number: 9SF15RB5
Sep 19 05:09:35 yyyy scsi: [ID 107833 kern.notice]      Sense Key: Unit Attention
Sep 19 05:09:35 yyyy scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Sep 19 05:13:28 yyyy scsi: [ID 243001 kern.info] /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Sep 19 05:13:28 yyyy    mpt_handle_event_sync: IOCLogInfo=0x31123000
Sep 19 05:13:28 yyyy scsi: [ID 243001 kern.info] /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Sep 19 05:13:28 yyyy    mpt_handle_event: IOCLogInfo=0x31123000
      :
Sep 19 05:13:37 yyyy scsi: [ID 107833 kern.warning] WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0/sd@6,0 (sd32):
Sep 19 05:13:37 yyyy    Error for Command: write(10)               Error Level: Retryable
Sep 19 05:13:37 yyyy scsi: [ID 107833 kern.notice]      Requested Block: 329686969                 Error Block: 329686969
Sep 19 05:13:37 yyyy scsi: [ID 107833 kern.notice]      Vendor: ATA                                Serial Number: 9SF15RB5
Sep 19 05:13:37 yyyy scsi: [ID 107833 kern.notice]      Sense Key: Unit Attention
Sep 19 05:13:37 yyyy scsi: [ID 107833 kern.notice]      ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Sep 19 05:20:26 yyyy scsi: [ID 107833 kern.warning] WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Sep 19 05:20:26 yyyy    Disconnected command timeout for Target 6
Sep 19 05:20:28 yyyy scsi: [ID 243001 kern.info] /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Sep 19 05:20:28 yyyy    mpt_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31140000
Sep 19 05:20:28 yyyy scsi: [ID 107833 kern.warning] WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0/sd@6,0 (sd32):
Sep 19 05:20:28 yyyy    SCSI transport failed: reason 'timeout': retrying command
    :
Sep 19 05:46:35 yyyy scsi: [ID 243001 kern.warning] WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Sep 19 05:46:35 yyyy    SAS Discovery Error on port 6. DiscoveryStatus is DiscoveryStatus is |Unaddressable device found|
Sep 19 05:47:41 yyyy scsi: [ID 107833 kern.warning] WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Sep 19 05:47:41 yyyy    Disconnected command timeout for Target 6
    :
Sep 19 06:10:26 yyyy fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-FD, TYPE: Fault, VER: 1, SEVERITY: Major
Sep 19 06:10:26 yyyy EVENT-TIME: Wed Sep 19 06:10:26 EDT 2012
Sep 19 06:10:26 yyyy PLATFORM: Sun-Fire-X4540, CSN: 0000000000, HOSTNAME: yyyy
Sep 19 06:10:26 yyyy SOURCE: zfs-diagnosis, REV: 1.0
Sep 19 06:10:26 yyyy EVENT-ID: 0836161b-3b9e-6d50-da39-9783f831c4bb
Sep 19 06:10:26 yyyy DESC: The number of I/O errors associated with a ZFS device exceeded
Sep 19 06:10:26 yyyy         acceptable levels.  Refer to http://sun.com/msg/ZFS-8000-FD for more information.
Sep 19 06:10:26 yyyy AUTO-RESPONSE: The device has been offlined and marked as faulted.  An attempt
Sep 19 06:10:26 yyyy         will be made to activate a hot spare if available.
Sep 19 06:10:26 yyyy IMPACT: Fault tolerance of the pool may be compromised.
Sep 19 06:10:26 yyyy REC-ACTION: Run 'zpool status -x' and replace the bad device.


The 2 examples above resulted to I/O hanging from the time "SAS Discovery..." was seen until zfs failed the drive (zfs-diagnosis).

Cause

 There are issues in the LSI firmware as well as the mpt driver.

Solution

Upgrade LSI firmware to 1.27.92 (011b5c00):

<Patch: 16044285> X4540 SW 2.3.2 - HIA 2.4.10.5

Apply the following patch:

<Patch: 150401-09>/<Patch: 150400-09>  or later

To address the following BUG:

<Bug: 15706409> SUNBT7032847 MPT SHOULD HANDLE FAILING DISKS MORE INTELLIGENTLY

 

<Bug: 15875298> MPT DRIVER DOES NOT RECOVER FROM "DISCONNECTED COMMAND TIMEOUT" WITH FAILING DISK


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback