Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2168424.1
Update Date:2018-04-25
Keywords:

Solution Type  Problem Resolution Sure

Solution  2168424.1 :   MaxRep: Protection Plan in Resync Required Due to Source LUN Using VMware VAAI  


Related Items
  • Pillar Axiom Replication Engine (MaxRep)
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>Axiom>SN-DK: MaxRep-2x
  •  




In this Document
Symptoms
Changes
Cause
Solution
References


Applies to:

Pillar Axiom Replication Engine (MaxRep) - Version 3.0 to 3.0 [Release 3.0]
Information in this document applies to any platform.

Symptoms

 A volume pair goes into Resync Required when at least one ESX host mapped to the LUN has VAAI enabled and the MaxRep software is on version 3.0.4 or lower.

On the MaxRep GUI, go to Monitor -> Volume Protection, the status of a volume pair(s) is in Resync Required.
Plan in Resync Required

Click on the Summary link (on the right of the screen), then on Primary Log: Details
Primary Log: Details

Multiple MirrorFailedReason=MAX_RETRY events will be displayed:
MirrorFailedReason=MAX_RETRY

Changes

FS1 Release 06.02.02 supports VAAI (VMware vSphere Storage APIs Array Integration)

The issue can happen when a VM is deployed, blocks are reclaimed, new data is created or locked on VMFS volumes (datastore), etc.
All these functions are part of VAAI features.

Note: FS1 must be upgraded to R6.2.12 or later due to a known issue with VMware VAAI (see the description for Bug 22868808 in the Release 6.2.7 README and Bug 24381286 in the R6.2.12 README).

 

Cause

To determine why the volume pair went into Resync Required, the /var/log/messages file needs to be checked soon after the failure:
Open an SSH session to the MaxRep Engine management IP address that is accessing the source LUN (see Document 2046703.1 FS System: Passwords Associated with the Oracle FS1-2 Flash Storage System to obtain the password).

Use vi or the less command to open /var/log/messages
Go to the timestamp matching the alert:

Jul 18 17:16:27 MAXREPBA kernel: [0]: scst: scst_parse_cmd:780:Warning: expected transfer length 512 for opcode 0x93 (handler InMageEMD, target qla2x00tgt) doesn't match decoded value 1048576
Jul 18 17:16:27 MAXREPBA kernel: [0]: scst_parse_cmd:782:Suspicious CDB:
Jul 18 17:16:27 MAXREPBA kernel: (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F
Jul 18 17:16:27 MAXREPBA kernel: 0: 93 00 00 00 00 00 b5 2d 28 00 00 00 08 00 00 00 .......-(.......
Jul 18 17:16:27 MAXREPBA kernel: Invalid opcode = 0x93
Jul 18 17:16:27 MAXREPBA kernel: [0]: scst: scst_parse_cmd:780:Warning: expected transfer length 512 for opcode 0x93 (handler InMageEMD, target qla2x00tgt) doesn't match decoded value 1048576
Jul 18 17:16:27 MAXREPBA kernel: [0]: scst_parse_cmd:782:Suspicious CDB:
Jul 18 17:16:27 MAXREPBA kernel: (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F
Jul 18 17:16:27 MAXREPBA kernel: 0: 93 00 00 00 00 00 b5 2d 28 00 00 00 08 00 00 00 .......-(.......
Jul 18 17:16:27 MAXREPBA kernel: Invalid opcode = 0x93

0x93 is the OpCode of the WRITE SAME(16) command, one of the VAAI commands.
These commands are not supported by the current SCST (generic SCSI target subsystem for Linux).

The same file can be found in the logs: ./home/svsystems/var/.miscellaneous_log/messages*

The issue can also be found in the controller logs (they must be collected within the same day of the synchronization failure):

  • Go to the controller folder of the node that owns the source LUN
  • Run tracesan script
  • Run the following command to filter the traces for the MaxRep solution:
-bash-4.1$ egrep Inm sanlog | less

Mirror is created:

* 2016-07-18T13:46:41.537997623Z 526 scc_PmiInmMirrorCreate2Execute 176 "source slunGuid 0x6000b08414b3030313336313634000d2 mirrorLunGuid 0x50023830000002850000000000000000"
* 2016-07-18T13:46:41.537997796Z 526 scc_PmiInmMirrorCreate2Execute 179 "revision 0 mirrorLunNumber 0x3 ioTimeout 30"
* 2016-07-18T13:46:41.537997852Z 526 scc_PmiInmMirrorCreate2Execute 182 "channelFailoverSequence[0] 0xa2300d5bee960000 channelFailoverSequence[1] 0xa2300d5bee970000"
* 2016-07-18T13:46:41.538006316Z 526 scc_SlunUpdateInmMirrorIndex 2550 "slunGuid 0x6000b08414b3030313336313634000d2 (slunTid 0x800d2) mirrorIndex 0x0"
* 2016-07-18T13:46:41.538044646Z 526 san_EventInmMirrorActive 205 "MirrorActive: slunGuid: 0x6000b08414b3030313336313634000d2 mirrorStatus 1 addtnlInfo 0x0"

Mirror is failing (UTC timestamp, in this case there is an offset of 2 hours with the time zone set on MaxRep):

* 2016-07-18T15:14:49.061700619Z 512 scc_InmChildTaskCheck 586 "jobHandle 0x1525 tgtNexus=0x10c status 0x02 sense 0x052000"
* 2016-07-18T15:14:49.061700806Z 512 scc_InmChildTaskCheck 681 "jobHandle 0x1525 tgtNexus 0x10c retry #17"
* 2016-07-18T15:14:49.061944991Z 512 scc_InmChildTaskCheck 586 "jobHandle 0x1525 tgtNexus=0x10c status 0x02 sense 0x052000"
* 2016-07-18T15:14:49.061945125Z 512 scc_InmChildTaskCheck 681 "jobHandle 0x1525 tgtNexus 0x10c retry #18"
* 2016-07-18T15:14:49.062219146Z 512 scc_InmChildTaskCheck 586 "jobHandle 0x1525 tgtNexus=0x10c status 0x02 sense 0x052000"
* 2016-07-18T15:14:49.062219428Z 512 scc_InmChildTaskCheck 681 "jobHandle 0x1525 tgtNexus 0x10c retry #19"
* 2016-07-18T15:14:49.062545570Z 512 scc_InmChildTaskCheck 586 "jobHandle 0x1525 tgtNexus=0x10c status 0x02 sense 0x052000"
* 2016-07-18T15:14:49.062545713Z 512 scc_InmChildTaskCheck 681 "jobHandle 0x1525 tgtNexus 0x10c retry #20"
* 2016-07-18T15:14:49.062789344Z 512 scc_InmChildTaskCheck 586 "jobHandle 0x1525 tgtNexus=0x10c status 0x02 sense 0x052000"
* 2016-07-18T15:14:49.062789511Z 512 scc_InmChildTaskCheck 654 "jobHandle 0x1525 tgtNexus 0x10c retries exceeded"

sense 0x052000 -> 5-20-00 Illegal Request - invalid/unsupported command code

Note: The SAN traces do not provide the CDB (Command Descriptor Block) that would contain the OpCode 0x93

Mirror failed:

* 2016-07-18T15:14:49.062796706Z 512 scc_InmMirrorStatusUpdateFailed 1272 "mirror FAILED: slunGuid 0x6000b08414b3030313336313634000d2 slunTid 0x800d2 reason 0"

Solution

 There are two solutions:

  • Upgrade the MaxRep software to release 3.0.8 or higher (the new Oracle MaxRep for SAN software supports the additional SCSI commands from VMware) and R6.2.12 or higher on the FS1-2.
    Note: Applying MaxRep 3.0.8 requires a reboot of the Engine(s).  All the Protection Plans will have to be resynchronized.

    Note: Reference Document 2192488.1 MaxRep: MaxRep Engine status is Unknown After upgrading FS1-2 to Version 6.2.11 or Higher

     
  • Disable VAAI on all the ESX hosts accessing the source LUN(s) causing the issue: VMware KB 1033665

 

References

<BUG:23642817> - MAXREP LUN IS CONTINUOUSLY IN RESYNC REQUIRED MODE.
<BUG:24367943> - MIRROR FAILED: SLUNGUID 0X6000B08414B303033313137343300010
<BUG:23635204> - MAXREP 3.0.4: PLAN GOES TO "RESYNC REQUIRED" WHEN NEW VM IS CREATED.

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback