![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||
Solution Type Problem Resolution Sure Solution 2111203.1 : MaxRep: No data throughput on all Protection Plans causes the Recovery Point Objective (RPO) to exceed the default time limit threshold
In this Document
Applies to:Pillar Axiom Replication Engine (MaxRep) - Version 3.0 to 3.0 [Release 3.0]Information in this document applies to any platform. SymptomsThe default threshold for the MaxRep Recovery Point Objective (RPO) value is 30 minutes. An alert will be sent if the RPO increases beyond this limit. The RPO can be increased or decreased under Protect -> Manage Protection Plan: Click on Modify to change the Protection Plan and select Modify Replication Options. Under normal operations, the RPO should be well below the default threshold of 30 minutes but due to a known issue, this value may start increasing and continue to increase until a workaround is applied by Oracle Support. Below is an example of the symptoms as seen in the MaxRep Graphical User Interface (GUI) under Monitor -> Volume Protection:
Symptoms of the issue can also be determined in the following way: a) Open a SSH to the source Engine IP address where the plans are being affected. b) Check the source and target cache folder to determine if the differentials are draining correctly. c) Under /home/svsystems/transport/log/, review the file cxps.err.log for timeout errors like these: 2016-Feb-24 12:35:32 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67965.7fddfd5320a0) 10.1.20.113 asyncReadSome timed out (300)
2016-Feb-24 12:35:32 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67966.7fde3ee100b0) 10.1.20.113 asyncReadSome timed out (300) 2016-Feb-24 12:35:32 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67967.7fddc127d090) 10.1.20.113 asyncReadSome timed out (300) 2016-Feb-24 12:35:32 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67968.7fddf80556e0) 10.1.20.113 asyncReadSome timed out (300) 2016-Feb-24 12:35:32 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67969.7fde3f43cc30) 10.1.20.113 asyncReadSome timed out (300) 2016-Feb-24 12:35:32 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67970.7fddc00cc0c0) 10.1.20.113 asyncReadSome timed out (300) 2016-Feb-24 12:35:32 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67971.7fddc007e380) 10.1.20.113 asyncReadSome timed out (300) 2016-Feb-24 12:35:33 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67972.7fddfd9845f0) 10.1.20.113 asyncReadSome timed out (300) 2016-Feb-24 12:35:33 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67973.7fddb826eb20) 10.1.20.113 asyncReadSome timed out (300) 2016-Feb-24 12:35:34 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67974.7fde3f43b9d0) 10.1.20.113 asyncReadSome timed out (300) 2016-Feb-24 12:35:34 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67975.7fde3f869d90) 10.1.20.113 asyncReadSome timed out (300) 2016-Feb-24 12:35:34 ERROR [at cxpslib/session.cpp:handleTimeout:245] (sid: 67976.7fde29104290) 10.1.20.113 asyncReadSome timed out (300) The expected log entries during normal operation looks like these: 2015-Oct-16 22:19:43 INFO REQUEST HANDLER WORKER THREAD STARTED: 0x7fdeec000920
2015-Oct-16 22:19:43 INFO REQUEST HANDLER WORKER THREAD STARTED: 0x7fdeec001420 2015-Oct-16 22:19:43 INFO REQUEST HANDLER WORKER THREAD STARTED: 0x7fdeec000be0 2015-Oct-16 22:19:43 INFO REQUEST HANDLER WORKER THREAD STARTED: 0x7fdeec001a20 2015-Oct-16 22:19:43 INFO REQUEST HANDLER WORKER THREAD STARTED: 0x7fdeec0010f0 2015-Oct-16 22:19:43 INFO REQUEST HANDLER WORKER THREAD STARTED: 0x7fdeec002070 d) Next open a SSH session to the target Engine IP address. e) Check the cachemgr session threads status - which are currently running. If it is hung the expected output looks like this: [root@DRSANREP-01 ~]# netstat -apn| grep cachemgr
tcp 1 0 10.1.20.113:54666 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr tcp 1 0 10.1.20.113:54693 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr tcp 1 0 10.1.20.113:54737 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr tcp 1 0 10.1.20.113:54674 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr tcp 1 0 10.1.20.113:54768 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr tcp 1 0 10.1.20.113:54683 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr tcp 1 0 10.1.20.113:54745 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr tcp 1 0 10.1.20.113:54637 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr tcp 0 0 10.1.20.113:47960 10.1.20.13:9443 ESTABLISHED 2209/cachemgr tcp 1 0 10.1.20.113:54768 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr tcp 1 0 10.1.20.113:54683 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr tcp 1 0 10.1.20.113:54745 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr tcp 1 0 10.1.20.113:54637 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr tcp 0 0 10.1.20.113:47939 10.1.20.13:9443 ESTABLISHED 2209/cachemgr tcp 1 0 10.1.20.113:54743 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr tcp 1 0 10.1.20.113:54739 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr tcp 1 0 10.1.20.113:54736 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr tcp 0 0 10.1.20.113:47961 10.1.20.13:9443 ESTABLISHED 2209/cachemgr tcp 1 0 10.1.20.113:54641 10.1.20.13:9443 CLOSE_WAIT 2209/cachemgr The expected output during normal operation would look like this: [root@DRSANREP-01 ~]# netstat -apn| grep cachemgr
tcp 0 0 10.1.20.113:47960 10.1.20.13:9443 ESTABLISHED 2209/cachemgr tcp 0 0 10.1.20.113:47960 10.1.20.13:9443 ESTABLISHED 2209/cachemgr tcp 0 0 10.1.20.113:47960 10.1.20.13:9443 ESTABLISHED 2209/cachemgr tcp 0 0 10.1.20.113:47960 10.1.20.13:9443 ESTABLISHED 2209/cachemgr In the above analysis example it is shown that cachemgr was not draining the differentials from the source/target cache because it was hung displaying the "CLOSE_WAIT" message. Changes
CauseThe cause is down to an issue that was found with the Cache Manager module of the MaxRep software. NOTE: according to Bug 22826866 and Bug 22507536 (the former of which is linked as a duplicate of Base Bug 22507536), this Cache Manager issue is not feasible to fix. Please apply the workaround supplied in the Solution section of this document.
Solutiona) On the Target Engine, restart the vxagent: [root@DRSANREP-01 ~]# service vxagent restart Volume Agent daemon is not running! Volume Agent daemon is running... b) Then use one of the Source LUN Globally Unique Identifiers (GUID) from the MaxRep Graphical User Interface (GUI) NOTE: In this example, reference is to the second pair from the screenshot above.
c) Next run the following command on the Control Service Engine to obtain the name of the virtual LUN. [root@DRSANREP-01 ~]# mysql -u root -psvsHillview svsdb1 -e "select * from applianceTargetLunMapping where sharedDeviceId='36000b08414b30303330333735320000f'"; +-----------------------------+-----------------------------------+---------------------------------+--------------------------------------+ d) Then execute the following command on the Source Engine to pull statistics from the Virtual LUN and check the number of committed changes and bytes: [root@PRODSANREP-01 ~]# /usr/local/InMage/Vx/bin/inm_dmit --get_volume_stat inmage0000000033 Volume State : Filtering Enabled , Read-Write History e) Repeat the same command a few minutes later to ensure the values are increasing: [root@PRODSANREP-01 ~]# /usr/local/InMage/Vx/bin/inm_dmit --get_volume_stat inmage0000000033 Volume State : Filtering Enabled , Read-Write f) To conclude, monitor the MaxRep Graphical User Interface to ensure that the Recovery Point Objective (RPO) decreases, eventually reaching below default threshold. References<BUG:22826866> - MAXREP HIGH RPO ON ALL PROTECTION PLANS WITH NO DATA THROUGHPUT<BUG:22507536> - SUSPECTED MEMORY LEAK WITH MAXREP CACHE MANAGER Attachments This solution has no attachment |
||||||||||||||||||||
|