MaxRep: No data throughput on all Protection Plans causes the Recovery Point Objective (RPO) to exceed the default time limit threshold

Asset ID:	1-72-2111203.1
Update Date:	2018-04-25
Keywords:

Solution Type Problem Resolution Sure

Solution 2111203.1 : MaxRep: No data throughput on all Protection Plans causes the Recovery Point Objective (RPO) to exceed the default time limit threshold

Applies to:

Pillar Axiom Replication Engine (MaxRep) - Version 3.0 to 3.0 [Release 3.0]
Information in this document applies to any platform.

Symptoms

The default threshold for the MaxRep Recovery Point Objective (RPO) value is 30 minutes. An alert will be sent if the RPO increases beyond this limit. The RPO can be increased or decreased under Protect -> Manage Protection Plan: Click on Modify to change the Protection Plan and select Modify Replication Options. Under normal operations, the RPO should be well below the default threshold of 30 minutes but due to a known issue, this value may start increasing and continue to increase until a workaround is applied by Oracle Support.

Below is an example of the symptoms as seen in the MaxRep Graphical User Interface (GUI) under Monitor -> Volume Protection:

Pairs with High and increasing RPO

Symptoms of the issue can also be determined in the following way:

a) Open a SSH to the source Engine IP address where the plans are being affected.

b) Check the source and target cache folder to determine if the differentials are draining correctly.

c) Under /home/svsystems/transport/log/, review the file cxps.err.log for timeout errors like these:

2016-Feb-24 12:35:32 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67965.7fddfd5320a0) 10.1.20.113 asyncReadSome timed out (300)
2016-Feb-24 12:35:32 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67966.7fde3ee100b0) 10.1.20.113 asyncReadSome timed out (300)
2016-Feb-24 12:35:32 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67967.7fddc127d090) 10.1.20.113 asyncReadSome timed out (300)
2016-Feb-24 12:35:32 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67968.7fddf80556e0) 10.1.20.113 asyncReadSome timed out (300)
2016-Feb-24 12:35:32 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67969.7fde3f43cc30) 10.1.20.113 asyncReadSome timed out (300)
2016-Feb-24 12:35:32 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67970.7fddc00cc0c0) 10.1.20.113 asyncReadSome timed out (300)
2016-Feb-24 12:35:32 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67971.7fddc007e380) 10.1.20.113 asyncReadSome timed out (300)
2016-Feb-24 12:35:33 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67972.7fddfd9845f0) 10.1.20.113 asyncReadSome timed out (300)
2016-Feb-24 12:35:33 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67973.7fddb826eb20) 10.1.20.113 asyncReadSome timed out (300)
2016-Feb-24 12:35:34 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67974.7fde3f43b9d0) 10.1.20.113 asyncReadSome timed out (300)
2016-Feb-24 12:35:34 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67975.7fde3f869d90) 10.1.20.113 asyncReadSome timed out (300)
2016-Feb-24 12:35:34 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67976.7fde29104290) 10.1.20.113 asyncReadSome timed out (300)

The expected log entries during normal operation looks like these:

2015-Oct-16 22:19:43 INFO REQUEST HANDLER WORKER THREAD STARTED: 0x7fdeec000920
2015-Oct-16 22:19:43 INFO REQUEST HANDLER WORKER THREAD STARTED: 0x7fdeec001420
2015-Oct-16 22:19:43 INFO REQUEST HANDLER WORKER THREAD STARTED: 0x7fdeec000be0
2015-Oct-16 22:19:43 INFO REQUEST HANDLER WORKER THREAD STARTED: 0x7fdeec001a20
2015-Oct-16 22:19:43 INFO REQUEST HANDLER WORKER THREAD STARTED: 0x7fdeec0010f0
2015-Oct-16 22:19:43 INFO REQUEST HANDLER WORKER THREAD STARTED: 0x7fdeec002070

d) Next open a SSH session to the target Engine IP address.

e) Check the cachemgr session threads status - which are currently running. If it is hung the expected output looks like this:

[root@DRSANREP-01 ~]# netstat -apn| grep cachemgr
tcp 1 0 10.1.20.113:54666 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr
tcp 1 0 10.1.20.113:54693 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr
tcp 1 0 10.1.20.113:54737 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr
tcp 1 0 10.1.20.113:54674 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr
tcp 1 0 10.1.20.113:54768 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr
tcp 1 0 10.1.20.113:54683 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr
tcp 1 0 10.1.20.113:54745 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr
tcp 1 0 10.1.20.113:54637 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr
tcp 0 0 10.1.20.113:47960 10.1.20.13:9443 ESTABLISHED 2209/cachemgr
tcp 1 0 10.1.20.113:54768 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr
tcp 1 0 10.1.20.113:54683 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr
tcp 1 0 10.1.20.113:54745 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr
tcp 1 0 10.1.20.113:54637 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr
tcp 0 0 10.1.20.113:47939 10.1.20.13:9443 ESTABLISHED 2209/cachemgr
tcp 1 0 10.1.20.113:54743 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr
tcp 1 0 10.1.20.113:54739 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr
tcp 1 0 10.1.20.113:54736 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr
tcp 0 0 10.1.20.113:47961 10.1.20.13:9443 ESTABLISHED 2209/cachemgr
tcp 1 0 10.1.20.113:54641 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr

The expected output during normal operation would look like this:

[root@DRSANREP-01 ~]# netstat -apn| grep cachemgr
tcp 0 0 10.1.20.113:47960 10.1.20.13:9443 ESTABLISHED 2209/cachemgr
tcp 0 0 10.1.20.113:47960 10.1.20.13:9443 ESTABLISHED 2209/cachemgr
tcp 0 0 10.1.20.113:47960 10.1.20.13:9443 ESTABLISHED 2209/cachemgr
tcp 0 0 10.1.20.113:47960 10.1.20.13:9443 ESTABLISHED 2209/cachemgr

In the above analysis example it is shown that cachemgr was not draining the differentials from the source/target cache because it was hung displaying the "CLOSE_WAIT" message.

Changes

Cause

The cause is down to an issue that was found with the Cache Manager module of the MaxRep software.

NOTE: according to Bug 22826866 and Bug 22507536 (the former of which is linked as a duplicate of Base Bug 22507536), this Cache Manager issue is not feasible to fix. Please apply the workaround supplied in the Solution section of this document.

Solution

a) On the Target Engine, restart the vxagent:

[root@DRSANREP-01 ~]# service vxagent restart
Volume Agent daemon is running...
appservice daemon is running...
Stopping VolumeAgent daemon...
All VolumeAgent-related processes are successfully terminated!

Volume Agent daemon is not running!
appservice daemon is not running!
Starting Volume Agent daemon...
kernel.hotplug =
Starting appservice
Starting fabricagent
Starting vacps
Starting svagents

Volume Agent daemon is running...
appservice daemon is running...

b) Then use one of the Source LUN Globally Unique Identifiers (GUID) from the MaxRep Graphical User Interface (GUI)

NOTE: In this example, reference is to the second pair from the screenshot above.

c) Next run the following command on the Control Service Engine to obtain the name of the virtual LUN.

[root@DRSANREP-01 ~]# mysql -u root -psvsHillview svsdb1 -e "select * from applianceTargetLunMapping where sharedDeviceId='36000b08414b30303330333735320000f'";

+-----------------------------+-----------------------------------+---------------------------------+--------------------------------------+
| applianceTargetLunMappingId | sharedDeviceId | applianceTargetLunMappingNumber | processServerId |
+-----------------------------+-----------------------------------+---------------------------------+--------------------------------------+
| inmage0000000033 | 36000b08414b303031333631363400001 | 1 | d39dc123-44b5-4003-a003-9e71bb09a25f |
+-----------------------------+-----------------------------------+---------------------------------+--------------------------------------+

d) Then execute the following command on the Source Engine to pull statistics from the Virtual LUN and check the number of committed changes and bytes:

[root@PRODSANREP-01 ~]# /usr/local/InMage/Vx/bin/inm_dmit --get_volume_stat inmage0000000033

Volume State : Filtering Enabled , Read-Write
Filtering Mode/Write Order State : Data/Bitmap
Time spent in Curr mode/state (sec) : 756/32514
Pending Changes/bytes : 16455/1144128512
Pending Changes in metadata : 16455
Commited Changes/bytes : 1414583325/20582941190656 <-- The values to monitor
Total Write IOs received : 23015129497600 (1669596027 bytes)
Total Write IO Cancels received : 0 (0 bytes)
Data Files Created/Pending : 1355/0
Data File Disk Space (Alloc/Used) : 268435456/0
Bitmap Changes Queued/bytes : 0/0
Bitmap Changes Read/bytes : 226176/2786160558080 (1337 times)
Bitmap Changes Written/bytes : 225212432/5216370718720 (74 times)
Pending Changes in each state : Data = 0 | Meta = 0 | Bitmap = 16455
Pages Allocated : 0
No. of Pages Reserved : 4096
No. of change-node pages : 1371294
Threshold for DF in pages : 1638
Pages in DF Queue : 10601996
Changes Lost : No
DB Notify Threshold : 33554432
Tags Dropped : 0
Metadata transition due to delay Data Pool allocation : 43
Total Mode Transistions : Data = 184 | Meta = 182
Total Time spent in each Mode(sec) : Data = 8459484 | Meta = 160454
Total State Transistions : Data = 86 | Meta = 85 | Bitmap = 7
Total Time spent in each State(sec) : Data = 7617205 | Meta = 7639 | Bitmap = 963282
IO Pattern
OP 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M 2M 4M 8M 8M+
W 30040888 43399007 20604762 440012935 23549633 1034855535 41383966 10171515 7672848 9501502 2210587 3285865 2906984 0 0 0

History
Start filtering issued at (sec) : 1447781306
Clear stats issued at (sec) : 0
Clear diffs issued : 1
Last Clear diffs issued at (sec) : 1447781466
Times Resync marked : 0
Last Resync marked at(sec) : 0
Last Resync Error : 0

e) Repeat the same command a few minutes later to ensure the values are increasing:

[root@PRODSANREP-01 ~]# /usr/local/InMage/Vx/bin/inm_dmit --get_volume_stat inmage0000000033

Volume State : Filtering Enabled , Read-Write
Filtering Mode/Write Order State : Data/Bitmap
Time spent in Curr mode/state (sec) : 1367/33125
Pending Changes/bytes : 11456/31971426304
Pending Changes in metadata : 9896
Commited Changes/bytes : 1414659000/20586602525184 <-- NOTE: both values have increased

f) To conclude, monitor the MaxRep Graphical User Interface to ensure that the Recovery Point Objective (RPO) decreases, eventually reaching below default threshold.

References

<BUG:22826866> - MAXREP HIGH RPO ON ALL PROTECTION PLANS WITH NO DATA THROUGHPUT
<BUG:22507536> - SUSPECTED MEMORY LEAK WITH MAXREP CACHE MANAGER

Attachments

This solution has no attachment