Sun Storage J4000 JBOD Array: Troubleshooting Disk Failures

Asset ID:	1-75-1353887.1
Update Date:	2017-05-12
Keywords:

Solution Type Troubleshooting Sure

Solution 1353887.1 : Sun Storage J4000 JBOD Array: Troubleshooting Disk Failures

Applies to:

Sun Storage J4400 Array - Version Not Applicable and later
Sun Storage J4500 Array - Version Not Applicable and later
Sun Storage J4200 Array - Version Not Applicable and later
Information in this document applies to any platform.

Purpose

The purpose of this document is to help troubleshoot disk failure symptoms in Sun Storage J4000 JBOD arrays.

Symptoms may include:

An amber LED is lit on one or more drives in the array.
Host connected to the array reports SCSI driver errors for one or more drives.
One or more drives from the array is not seen by host.
SAS RAID HBA reports Failed/Degraded status for a volume.

This document mainly deals with the Solaris Operating System Environment. The instructions may vary for other OS environments. This document does not cover J4000 JBOD arrays connected to Sun Storage 7000 Unified Storage Systems, for which refer to Sun Storage 7000 Unified Storage System documentation.

Troubleshooting Steps

1. Verify Host logs to identify the fault(s), and the details of each fault

Reference <Document 1005530.1> How to Check for Solaris[TM] x64 Disk Errors and Online/Offline Status
Reference <Document 1007706.1> Troubleshooting Tips for SCSI Disk Errors On Linux Systems (This document includes notes about 'smartctl' output)

2. Verify whether the drive(s) is/are configured under RAID HBA

Reference <Document 1017961.1> How to Identify if a Solaris[TM] Operating Environment is Installed on a Hardware RAID Controller

If the drive(s) is/are configured under SAS RAID HBA, refer <Document 1013107.1> How to Identify BIOS and Solaris[TM] Hardware RAID Status. If one or more J4000 drives identified as faulty, proceed to Step 7.

For more information about SAS RAID HBAs, refer the documentation located here

If the drive(s) is/are NOT configured under SAS RAID HBA, proceed to Step 3.

3. Verify '/var/adm/messages*' file(s) for any SCSI errors

Verify /var/adm/messages* file(s) for any SCSI errors similar to the following:

Apr 22 04:39:58 host01 scsi: [ID 107833 kern.warning] WARNING: /pci@7c,0/pci10de,378@b/pci1000,3150@0 (mpt0):
Apr 22 04:39:58 host01 scsi: [ID 107833 kern.warning] WARNING: /pci@7c,0/pci10de,378@b/pci1000,3150@0 (mpt0):
Apr 22 04:39:58 host01 Disconnected command timeout for Target 17
Apr 22 04:39:58 host01 Disconnected command timeout for Target 17
Apr 22 04:40:00 host01 scsi: [ID 107833 kern.warning] WARNING: /pci@7c,0/pci10de,378@b/pci1000,3150@0 (mpt0):
Apr 22 04:40:00 host01 scsi: [ID 365881 kern.info] /pci@7c,0/pci10de,378@b/pci1000,3150@0 (mpt0):
Apr 22 04:40:00 host01 mpt_check_task_mgt: Task 3 failed. ioc status = 4a target= 17
Apr 22 04:40:00 host01 Log info 31140000 received for target 17.
Apr 22 04:40:00 host01 scsi_status=0, ioc_status=8048, scsi_state=c
Apr 22 04:40:00 host01 scsi: [ID 107833 kern.warning] WARNING: /pci@7c,0/pci10de,378@b/pci1000,3150@0 (mpt0):
Apr 22 04:40:00 host01 mpt_check_task_mgt: Task 3 failed. ioc status = 4a target= 17

(or)

Mar 12 10:01:10 host02 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci1000,3150@0/sd@b,0 (sd6):
Mar 12 10:01:10 host02 Error for Command: read(10)                Error Level: Retryable
Mar 12 10:01:10 host02 scsi: [ID 107833 kern.notice]    Requested Block: 55060475                  Error Block: 55060539
Mar 12 10:01:10 host02 scsi: [ID 107833 kern.notice]    Vendor: SEAGATE                            Serial Number: 01234XXXXX
Mar 12 10:01:10 host02 scsi: [ID 107833 kern.notice]    Sense Key: Media Error
Mar 12 10:01:10 host02 scsi: [ID 107833 kern.notice]    ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0

If such errors are found, proceed to Step 4.
If no such errors are found, proceed to Step 7.

4. Verify whether Common Array Manager(CAM) application is installed in the host

SCSI errors reported in hosts installed with Common Array Manager(CAM), and connected to J4000 JBOD array(s) using Pandora HBA, due to Bug 15638598 - mpt Disconnected timeouts - Pandora HBA connected to two J4500 continually reset.
<BUG 15638598> - mpt Disconnected timeouts - Pandora HBA connected to two J4500 continually reset.

Pandora is an 8-port 3Gbps SAS/SATA HBA - External. Model Number : SG-XPCIE8SAS-E-Z.

You are required to verify whether CAM application is installed in the host. Use pkginfo command as indicated below:

# pkginfo -l SUNWsefms
PKGINST: SUNWsefms
NAME: Sun Storage Common Array Manager Fault Management Services
CATEGORY: application
ARCH: all
VERSION: 6.8.0,REV=2011.06.04.08.08.24
BASEDIR: /opt
VENDOR: Oracle Corporation
DESC: The Sun Storage Common Array Manager Fault Management Services

If CAM is installed and the version is 6.6 or above, the bug 15638598 is applicable, proceed to Step 5.
If CAM is NOT installed -or- CAM version is 6.5 or lower, the bug is not applicable, proceed to Step 6.

5. Implement the fix for Bug 15638598

The Bug 15638598 is fixed in the HBA firmware 01.33.03.00 located here .
Until the HBA firmware can be upgraded to the fixed version, you may apply the temporary workaround of disabling fmservice as follows:

Verify the status of 'fmservice':

# svcs fmservice
STATE STIME FMRI
online Aug_24 svc:/system/fmservice:default

If fmservice is reported online, disable fmservice and reboot the host for the drive(s) to come online. Then proceed to Step 12.

# svcadm disable fmservice
# svcs fmservice
STATE STIME FMRI
disabled 20:16:06 svc:/system/fmservice:default

Oracle HBA Engineering Guidance:

We recommend that the fmservice daemon of CAM not be used with topologies that include SATA disks, or, if it must be used to perform maintenance activity, that the daemon be disabled after the maintenance.

While the daemon is running, SATA PASSTHRU commands are issued to each attached device and due to the nature of the SATA protocol, this disrupts pending read/write I/O activity leading to a drop in performance.
Additionally you may encounter messages of the following form in '/var/adm/messages', these are to be expected while CAM is running the fmservice daemon and will *not* be fixed:

    scsi: [ID 243001 kern.info] /pci@78,0/pci8086,e08@3/pci1000,3150@0 (mpt0):
        mpt_check_scsi_io: IOCStatus=0x4b IOCLogInfo=0x31112000
    scsi: [ID 107833 kern.warning] WARNING: /pci@78,0/pci8086,e08@3/pci1000,3150@0/sd@13,0 (sd25):
         Error for Command: read                    Error Level: Retryable
    scsi: [ID 107833 kern.notice]     Requested Block: 280333                    Error Block: 280333
    scsi: [ID 107833 kern.notice]     Vendor: ATA                                Serial Number: ABCD1234
    scsi: [ID 107833 kern.notice]     Sense Key: Unit_Attention
    scsi: [ID 107833 kern.notice]     ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0

If fmservice is reported disabled, or no fmservice is found (in case of no CAM installation), this bug is not applicable and hence proceed to Step 6.

6. Check the SCSI errors for any media errors

If there are Media errors as indicated below and there are less than 30 for a two week period for the same drive, the errors were relocated successfully and no further action is required.

Mar 12 10:01:10 host02 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci1000,3150@0/sd@b,0 (sd6):
Mar 12 10:01:10 host02 Error for Command: read(10)                Error Level: Retryable
Mar 12 10:01:10 host02 scsi: [ID 107833 kern.notice]    Requested Block: 55060475                  Error Block: 55060539
Mar 12 10:01:10 host02 scsi: [ID 107833 kern.notice]    Vendor: SEAGATE                            Serial Number: 01234XXXXX
Mar 12 10:01:10 host02 scsi: [ID 107833 kern.notice]    Sense Key: Media Error
Mar 12 10:01:10 host02 scsi: [ID 107833 kern.notice]    ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0x0
If they are 30 or more for the same drive during the two week period, the drive needs to be replaced; contact Oracle Support for drive replacement.
If there is no uniform pattern, or if the errors are reported for multiple drives, proceed to Step 7.

7. Verify whether the fault(s) is/are observed for multiple drives

If the issue is seen with a single drive, proceed to Step 8.
If the issue is seen with multiple drives, proceed to Step 9.

8. Verify the physical LED indications of the drive

If the Amber Fault LED is ON, the drive is faulty; contact Oracle Support for drive replacement.

9. Check the cable connectivity

Reference Cabling configuration for J4500
Reference Cabling configuration for J4200/J4400 - Single path
Reference Cabling configuration for J4200/J4400 - Multipath

If the cabling is as per the documentation, proceed to Step 11.
If not, proceed to Step 10.

10. Adjust the cabling as per the documentation and verify whether the host can access the drives properly

If host can access the drives properly, proceed to Step 12.
If host sees one of the symptoms again, proceed to Step 11.

Note: Cabling cannot be adjusted while host is online and accessing other enclosure drives. You need to plan a maintenance window to correct the cabling.

11. Verify SIM board LEDs and back panel indicators

Reference Back Panel Indications J4500
Reference Back Panel Indications J4200/J4400

Capture any Amber LED indications seen and proceed to Step 13.

12. Monitor the system for any errors for two days

If the symptoms repeat, proceed to Step 13.
If no further symptoms are seen, the issue is considered to be resolved.

13. Open a call for further analysis

At this point, if you have validated that each troubleshooting step above is true for your environment and the issue still exists, further troubleshooting is required. Please contact Oracle Support and supply:

Critical Faults
Support Data Collection (if applicable) Reference <Document 1002514.1> Collecting Support Data for Arrays Using Sun StorageTek[TM] Common Array Manager
Detailed LED indications
Cabling configuration
Explorer output
Support Archive (if JBOD is configured under RAID HBA) Reference Creating a Support Archive

Do you still have questions? You can use My Oracle Support Communities. Communities put you in touch with industry professionals like yourself. They are monitored by Oracle support engineers, so you can expect reliable and correct answers. Ask questions and see what others are asking about in the Disk Storage 2000, 3000, 6000 RAID Arrays & JBODs Community.

Attachments

This solution has no attachment