Check Condition, Unit Attention, ASC: 0x3f for Read 10 Messages Produces IO Performance Issue

Asset ID:	1-72-1628717.1
Update Date:	2016-04-04
Keywords:

Solution Type Problem Resolution Sure

Solution 1628717.1 : Check Condition, Unit Attention, ASC: 0x3f for Read 10 Messages Produces IO Performance Issue

Applies to:

Sun Storage 9980 System - Version Not Applicable and later
Solaris (SPARC) - Version 10 and later
Sun Storage 9970 System - Version All Versions and later
Sun Storage 9985 System - Version All Versions and later
Sun Storage 9990 System - Version All Versions and later
Information in this document applies to any platform.

Symptoms

There is a IO performance issue in the server which is connected to HDS Array.
The performance degradation happens sometimes for 5 minutes sometimes for 20 minutes and after that everything back to normal.

Below are our observations:

• Each time suffer a performance degradation, there are +600 scsi: [ID 1078 kern.warning] Retryable errors logged on hosts.
• The errors are generated on all active LUNs cross all 4 paths toward 4 different Storage ports
• It was found that each time when we add new LUNs to this cluster, it will trigger a rescan on the host and it seems rescan generated these Retryable errors
• These Retryable [ID 107833 kern.warning] errors cause 30 to 50% I/O drops

Cause

First approach after analyzed the data and found a lot of messages that indicate the possible source of the problem:

scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci103c,3263@0,1/fp@0,0/disk@w50060e80164ef929,11a (sd1627):
Error for Command: read(10) Error Level: Retryable
scsi: [ID 107833 kern.notice] Requested Block: 103559536 Error Block: 103559536
scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
scsi: [ID 107833 kern.notice] ASC: 0x3f (reported LUNs data has changed), ASCQ: 0xe, FRU: 0x0
scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3410@9/pci103c,3263@0,1/fp@0,0/disk@w50060e80164ef929,323 (sd11):

The following message block is typical of a RAID array where some arrayside administrative action has been taken and the Solaris
host must be made aware of the change in order to re-query for current configuration.

In this case above sd instance 1627 via QLC HBA is reporting a CHECK CONDITION, UNIT ATTENTION, ASC: 0x3f for READ 10 issued to an HP logical unit;
no data was transfered. In this particular case, the ASC: 0x3f "reported LUNs data" likely reflects a change in target port level information,
typically the arrayside mapping or unmapping of a logical unit for this I_T nexus.
Solaris will respond by issuing a REPORT LUNS, updating the kernel information regarding this I_T nexus, and retry this READ 10 which we expect to succeed.

Like most UNIT ATTENTION conditions this does not reflect a failure of any kind, merely a notification of change in configuration.
The READ 10 will be retried and under normal conditions will complete normally.

But Eventhough the mentioned retryable messages are information messages ( related to change in configuration ) ,
we are facing the performance issue in the server.

Based on the information available no indicators to see drivers side as the root cause of this bad performance:

-- The I/O error will occur on different block locations.
-- The Error Level is Retryable
-- The ASC and ASCQ codes is always the same. ASC: 0x3f (inquiry data has changed), ASCQ: 0x3
-- No other types of errors are being logged, (Bus Resets, Fatal r/w, Timeouts, etc..)

READ 10 messages are not the reason of bad performance, only if the same message will not be repeated frequently.

Below is a note from HDS – original storage producer:

If the Unit Attention report occurs frequently and the load on the host side becomes high, the data transfer cannot be
started on the host side and timeout may occur.

In fact on the hosts we observed that when there are +1000 READ 10 messages, the I/O on host will be dropped from 15000 IO/s to 7000 IO/s,
if we keep receiving READ 10 messages for more than 30 minutes, the I/O could be dropped to nearly 0.
When IOPS dropped from 15000 to 7000, there will be a performance impact.

Solution

On the HDS side, Host Mode Option ( HMO ) 7 can be disabled to prevent automatic LUN recognition:

From

http://www.hds.com/assets/pdf/configuration-guide-for-sun-solaris-host-attachment.pdf

page 2-5

--- start ---

7 HUS VM
VSP
USP V/VM

Changes the setting of whether to return the Unit Attention response when adding a LUN.

ON:
Unit Attention response is returned.

OFF (default):
Unit Attention response is not returned.
Sense code: REPORTED LUNS DATA HAS CHANGED

Notes:

1. Set host mode option 07 to ON when you expect the REPORTED LUNS DATA HAS CHANGED UA at SCSI path change.

2. If the Unit Attention report occurs frequently and the load on the host side becomes high,
the data transfer cannot be started on the host side and timeout may occur.

3. If both HMO 07 and HMO 69 are set to ON, the UA of HMO 69 is returned to the host

References

<NOTE:1004933.1> - StorageTek[TM] 9900/9900V/9990: Unable to Dynamic Add/Discover New Luns addition.

Attachments

This solution has no attachment