VSM5 to VLE Replication or Migration Timeouts CC=14, RC=93

Asset ID:	1-72-1575884.1
Update Date:	2016-12-05
Keywords:

Solution Type Problem Resolution Sure

Solution 1575884.1 : VSM5 to VLE Replication or Migration Timeouts CC=14, RC=93

Applies to:

Sun StorageTek VSM5 System - Version All Versions to All Versions [Release All Releases]
Sun Virtual Library Extension (VLE) - Version 1.0 to 1.3 [Release 1.0]
Information in this document applies to any platform.

Symptoms

It is possible to encounter timeout problems when performing VSM5 to VLE migrations or recalls. These timeout errors are caused by the fact that a migration or recall exceeded the time allotted to complete a migrate or recall.

Here is an example of a timeout from a customer log:

MVS1 13190 19:39:07.72 
 :SLS6684I RTD VRTD0007 on VTSS STKVSM1 Returned UUIREQ error CC=14 RC=93
 S STKDVSS,PRM=0900 
 F SMC0,ROUTE STKVLE0 SEND_ASR

The VSM5 logs do not show much for the timeout problem as the problem is detected and reported by the VLE.

Here is an example of what that timeout looked like in the VLE vlelog:

 2013-07-09 16:40:29,045 [UuiReqHdlr-10.8.32.37:59449] INFO 
 vle.manager.VleManagerClient - Queuing request: RecallVtvVleRequest: 
 REQUEST_ID: R5497566, DEVICE_ID: 5670002003510202, VTV: B60442, TIMESTAMP: 51D751E80007DCCB, MOUNT_DEV: -1, MOUNT_TIME_STAMP: -1, MOUNT_THUMB_WHEEL: -1, AFFINITY_NODE: VLENODE1.PTXVLE0
 2013-07-09 16:40:29,047 [VleRequestHandler-R5497566] INFO vle.manager.VleRequestHandler - Handling: RecallVtvVleRequest: REQUEST_ID: 
 R5497566, DEVICE_ID: 5670002003510202, VTV: B60442, TIMESTAMP: 
 51D751E80007DCCB, MOUNT_DEV: -1, MOUNT_TIME_STAMP: -1, MOUNT_THUMB_WHEEL: -1, AFFINITY_NODE: VLENODE1.PTXVLE0
 2013-07-09 16:40:29,074 [Primary-393-R5497566] INFO 
 vle.replication.VtvPrimaryReplicator - Handle Replicate work item for VTV MV5684/B60442
 ~~~
 2013-07-09 17:40:29,050 [UuiReqHdlr-10.8.32.37:59449] ERROR 
 vle.manager.VleManagerClient - Timed out while waiting for response to 
 VleRequest: RecallVtvVleRequest: REQUEST_ID: R5497566, DEVICE_ID: 
 5670002003510202, VTV: B60442, TIMESTAMP: 51D751E80007DCCB, MOUNT_DEV: -1, MOUNT_TIME_STAMP: -1, MOUNT_THUMB_WHEEL: -1, AFFINITY_NODE: VLENODE1.PTXVLE0 
 (ID = R5497566)
 2013-07-09 17:40:29,053 [UuiReqHdlr-10.8.32.37:59449] ERROR 
 messaging.uui.UuiRequestHandler - VLE Request RECALL_VTV failed. Response: 
 RecallVtvVleResponse: REQUEST_ID: R5497566, COMPLETION_CODE: FAILURE, 
 REASON_CODE: TIMED_OUT_WAITING_FOR_RESPONSE, VTV_INFO: null
 2013-07-09 17:40:29,055 [UuiReqHdlr-10.8.32.37:59449] INFO 
 messaging.uui.UuiMessageFactory - UUI Sending: RECALL_VTV_RESPONSE, 
 DeviceId=5670002003510202, VTV=B60442, RC=14/93-RECALL_VMVC_COMMUNICATION_TIMEOUT.
 2013-07-09 17:40:29,270 [UuiReqHdlr-10.8.32.37:59941] INFO 
 messaging.uui.UuiMessageFactory - UUI Received: SEND_ASR.

Cause

These timeout conditions are most frequently encountered when transferring large VTVs (4GB) across a long distance network during high activity periods.

VLE microcode has the following timeout values set for completion a VTV migrate or recall:

VLE 1.0.43 code = no timeout specified
VLE 1.1.14 code = If the migrate or recall is not complete within 20 minutes of when it was started it will timeout.
VLE 1.2.16 code = If the migrate or recall is not complete within 60 minutes of when it was started it will timeout.
VLE 1.3.12 code = Migrates and recalls use a ‘watchdog’ timer that monitors to verify that there has been some data transferred within the past 10 minutes and if there has not been any data transferred during that time it will timeout.
VLE 1.4.2 code with patch A3 = If the migrate or recall is not complete within 12 hours of when it was started it will timeout.

Current VSM microcode contributes to these timeout problems because of the way it processes IFF IP I/Os. The VSM processes IFF I/Os in device number order (not round robin). Each IFF card has 4 potential targets (T0 – T3). Here is an example sequence explaining how the IFF card processes IP I/O and thereby contributes to the timeout problem.

In this example we have an IFF card with 3 targets (T0, T1 and T2) configured and in use.
Data for T0 is processed until it runs out of pages and blocks.
Data for T1 is then processed until it runs out of pages and blocks.
At this point we may frequently have both T0 and T2 waiting to process work.
Because we process work in device number order the IFF card will next process work for T0.
At this point T2 has not gotten any work done. T2 will not get any work done until T0 or T1 are idle and the one that is busy then runs out of pages and blocks to process.
The problem sequence above is only made worse if all 4 IFF card targets are configured and trying to do work.
Timeout events that fit this scenario will typically take place on T2 or T3.

The above IFF card IP I/O processing algorithm is in place in all levels of VSM5 microcode up to and including D/H02.18 codes.

Solution

The timeout problem can be mitigated by upgrading to VLE 1.3.12 or higher code as we expect it will be exceedingly rare for any target to fail transferring within the time limits of these codes.

There is no plan to address the way in which the VSM5 services the targets (in device order as described above).

Attachments

This solution has no attachment