![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||
Solution Type Problem Resolution Sure Solution 2243804.1 : Oracle ZFS Storage Appliance: Infiniband mcxnex WARNINGS - retry counter exceeded - filling up debug.sys logs on ZFS-SA arrays inside Exalogic running Oracle VM Servers
In this Document
Created from <SR 3-14372461221> Applies to:Exalogic Elastic Cloud X5-2 Hardware - Version X5 and laterOracle Exalogic Elastic Cloud Software - Version 2.0.6.2.170418 to 2.0.6.2.170418 7000 Appliance OS (Fishworks) SymptomsInfiniband mcxnex WARNINGS "retry counter exceeded" messages filling up debug.sys logs on ZFSSA arrays inside Exalogic running Oracle VM Servers
Mar 30 01:32:07 el01sn01 genunix: [ID 549104 kern.warning] WARNING: mcxnex0: CQE transport retry counter exceeded
ChangesThe reported issues have been seen on the Exalogic X5-2 rack that's just an observation so far the X5-2 that is. Its only been seen with vservers running. Specifically OVM “Oracle VM Manager”
CauseOracle VM Server Kernel version: 2.6.39-400.283.1.el5uek Oracle VM Manager Version: BUILDID=3.2.9.746 Engineered system - Exalogic 2.0.6.2.4(OVM 3.2.9)
ANALYSIS AND RESEARCH Regarding the following error messages printed on ZFSSA, The explanation for pairs of messages of the general form: WARNING: mcxnex0: CQE ERR: cqe fffff61bcf02f280 QPN 4000f7 indx 14 status 0x15 vendor syndrome 81
The local HCA having sent the message once on the fabric receives no reply from the peer hardware in <retry timeout> time the local HCA hardware will retry the message. This will happen up to <retry count> times at which time the HCA hardware/firmware will issue an error completion to the upper layer protocol (client) which owns the QP, one for each message in flight and unacked by the peer HCA HW. Each error completion is also logged by the mcxnex driver (the message pair you see, to the system messages file. Each pair of message in the msgs file represents one in flight unacked message which has not been responded by the peer HW within (retry timeout * retry count). Usually this is because the peer rebooted causing the HCA HW to be reset, alternatively it might be the result of a very bad cable or transceiver on the peer node (not this node). If there is no impact to customer's application or any error observed from operation system or IB switches, we can ignore these messages.
SolutionAsk customer if there is an access issue with NFS or iSCSI over Infiniband. If there is "NO IMPACT" to customer applications or any error observed from operating system or IB switches, we can ignore these messages. But if there is a issue with access to shares or LUN's - continue the investigation. Ask customer if they made any changes, shut down or rebooted any nodes connected to Infiniband - remember it can happen anywhere on the IB fabric from client through IB switch out to the ZFSSA.
The related bugs are below: Bug 25090540 - Packets getting retransmitted sent by ZFSSA, OVS server doesn't send ACK packet Bug 23248069 - retry counter exceeded warning messages filling up the logs
Attachments This solution has no attachment |
||||||||||||||||||
|