Solaris Scsi Error - Sense Key: Aborted_Command - ASC: 0x44 (internal target failure), ASCQ: 0x0, FRU: 0x0

Asset ID:	1-72-2179850.1
Update Date:	2018-03-07
Keywords:

Solution Type Problem Resolution Sure

Solution 2179850.1 : Solaris Scsi Error - Sense Key: Aborted_Command - ASC: 0x44 (internal target failure), ASCQ: 0x0, FRU: 0x0

Applies to:

Sun SPARC Enterprise T5220 Server - Version All Versions and later
Solaris Operating System - Version 8.0 and later
Information in this document applies to any platform.

Symptoms

This is a Solaris 10 T4-2 server with two Oracle FC HBAs connected to the SAN to access an EMC Disk Storage Array

No errors on FC HBAs, EMC LUNs are under mpxio multipathing software.

From time to time, we see single scsi write errors against different storage LUNs

Aug 26 00:57:57 server01 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g60000970000xxxxxxxxxxxx030324441 (ssd1041):
Aug 29 22:12:38 server01 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g60000970000xxxxxxxxxxxx030313944 (ssd1218):
Sep 2 04:24:07 server01 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g60000970000xxxxxxxxxxxx030324441 (ssd1041):
Sep 4 10:27:03 server01 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g60000970000xxxxxxxxxxxx030314138 (ssd1207):

All the errors are like this one:

Aug 26 00:57:57 server01 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g60000970000xxxxxxxxxxxx030324441 (ssd1041):
Aug 26 00:57:57 server01 Error for Command: write(10) Error Level: Retryable
Aug 26 00:57:57 server01 scsi: [ID 107833 kern.notice] Requested Block: 12160 Error Block: 12160
Aug 26 00:57:57 server01 scsi: [ID 107833 kern.notice] Vendor: EMC Serial Number: 41XXXXXXX
Aug 26 00:57:57 server01 scsi: [ID 107833 kern.notice] Sense Key: Aborted_Command
Aug 26 00:57:57 server01 scsi: [ID 107833 kern.notice] ASC: 0x44 (internal target failure), ASCQ: 0x0, FRU: 0x0

Cause

EMC Disk Storage array is reporting "ASC: 0x44 (internal target failure)" for these particular LUNs at that particular moment in time.

What does it mean this single error?

There are hundreds of IOs per second that can be generated by applications to the storage disks ,
and Solaris scsi disk driver (ssd) send these IOs through the FC HBAs to the EMC storage array.

There was one IO that was failed by the EMC storage at this moment in time: "Aug 26 00:57:57" , it was a write IO operation to disk "/scsi_vhci/ssd@g60000970000xxxxxxxxxxxx030324441 (ssd1041)" (ssd instance number 1041),
requesting access to block number 12160 of that disk.
That single write IO could not be completed by the EMC storage and was aborted by the storage with this error: "ASC: 0x44 (internal target failure), ASCQ: 0x0, FRU: 0x0"
The error level was Retryable, so the ssd driver retried that error again, this is transparent to us.

Notice this was single error , no other errors were reported at that time. That meas all other other IOs (reads and writes) were being completed successfully, except this single IO that had to be retried by ssd driver.
As there are no more Retryable or Fatal errors around that time, that write IO retried was completed also successfully by the EMC storage array

Solution

This is a external problem to the Solaris server.
Disk Storage vendor (EMC in this case) is responsible to explain why the storage is reporting this error, what situations can make the storage to report this error?

RCA Example

Note. Be aware there may be different scenarios on the Diks Storage arrays that may lead to this type of errors, the following is an example, EMC provided RCA and solution:

EMC box is replicating these disks/LUNs to another EMC box synchronously, and there are FC errors on the communication between the EMC boxes.

Than means, any IO from the host has to be written on both boxes before it is completed .
Due to these FC / communication errors between EMC boxes (or maybe some other problem on the secondary box),
the primary EMC box may report an "ASC: 0x44 (internal target failure)" to the server (aborting IOs that cannot be written on the second box)
Then the IO aborted is retried again by Solaris (ssd driver) and that was completed as not Fatal errors are observed on the messages files.

There is another option mode for replication called "Adaptive Copy" , instead of Synchronous, that would avoid these type of failures from EMC storage side,
in this mode, EMC box will complete the IO to Solaris as soon the data is written on the first EMC box, look at the "modes of operation" section of the "EMC Symmetrix Remote Data Facility (SRDF) for VMAX Product Guide"

EMC found an problem with a FC port used for replication to connect both boxes, that was disconnected and after that no more errors observed.

References

<NOTE:1285485.1> - GUDS - A Script for Gathering Solaris Performance Data
<NOTE:1010680.1> - Troubleshooting Disk Performance

Attachments

This solution has no attachment