Oracle ZFS Storage Appliance: Infiniband mcxnex WARNINGS - retry counter exceeded - filling up debug.sys logs on ZFS-SA arrays inside Exalogic running Oracle VM Servers

Asset ID:	1-72-2243804.1
Update Date:	2017-12-07
Keywords:

Solution Type Problem Resolution Sure

Solution 2243804.1 : Oracle ZFS Storage Appliance: Infiniband mcxnex WARNINGS - retry counter exceeded - filling up debug.sys logs on ZFS-SA arrays inside Exalogic running Oracle VM Servers

Applies to:

Exalogic Elastic Cloud X5-2 Hardware - Version X5 and later
Oracle Exalogic Elastic Cloud Software - Version 2.0.6.2.170418 to 2.0.6.2.170418
7000 Appliance OS (Fishworks)

Symptoms

Infiniband mcxnex WARNINGS "retry counter exceeded" messages filling up debug.sys logs on ZFSSA arrays inside Exalogic running Oracle VM Servers

PROBLEM DESCRIPTION & SYMPTOMS:
-------------------------------
On exalogic 2.0.6.x.x virtual, OVM 3.2.9, Customer is seeing many retry counter exceeded messages filling up ZFSSA debug.sys log files.

No clients are affected. The messages do not show for which client the warning messages are printed.

Mar 30 01:32:07 el01sn01 genunix: [ID 549104 kern.warning] WARNING: mcxnex0: CQE transport retry counter exceeded
Mar 30 01:33:55 el01sn01 genunix: [ID 801312 kern.warning] WARNING: mcxnex0: CQE ERR: cqe fffff6103509e040 QPN 58 indx 2 status 0x15 vendor syndrome 81
Mar 30 01:33:55 el01sn01 genunix: [ID 549104 kern.warning] WARNING: mcxnex0: CQE transport retry counter exceeded
Mar 30 01:39:57 el01sn01 genunix: [ID 801312 kern.warning] WARNING: mcxnex0: CQE ERR: cqe fffff60fa7ca55a0 QPN 300085 indx 12d status 0x15 vendor syndrome 81
Mar 30 01:39:57 el01sn01 genunix: [ID 549104 kern.warning] WARNING: mcxnex0: CQE transport retry counter exceeded
Mar 30 01:53:13 el01sn01 genunix: [ID 801312 kern.warning] WARNING: mcxnex0: CQE ERR: cqe fffff60fa7ca32e0 QPN 14007d indx 17 status 0x15 vendor syndrome 81

Changes

The reported issues have been seen on the Exalogic X5-2 rack that's just an observation so far the X5-2 that is.

Its only been seen with vservers running. Specifically OVM “Oracle VM Manager”

Cause

Oracle VM Server Kernel version: 2.6.39-400.283.1.el5uek

Oracle VM Manager Version: BUILDID=3.2.9.746

Engineered system - Exalogic 2.0.6.2.4(OVM 3.2.9)

ANALYSIS AND RESEARCH

Regarding the following error messages printed on ZFSSA, The explanation for pairs of messages of the general form:

WARNING: mcxnex0: CQE ERR: cqe fffff61bcf02f280 QPN 4000f7 indx 14 status 0x15 vendor syndrome 81
WARNING: mcxnex0: CQE transport retry counter exceeded

When an Infiniband stack client (ULP aka upper layer protocol) creates a queue pair it will specify a retry timeout and a retry count to the hardware which will dictate what the HCA hardware/firmware does once a message is sent via that QP onto the fabric.

The local HCA having sent the message once on the fabric receives no reply from the peer hardware in <retry timeout> time the local HCA hardware will retry the message.

This will happen up to <retry count> times at which time the HCA hardware/firmware will issue an error completion to the upper layer protocol (client) which owns the QP, one for each message in flight and unacked by the peer HCA HW.

Each error completion is also logged by the mcxnex driver (the message pair you see, to the system messages file. Each pair of message in the msgs file represents one in flight unacked message which has not been responded by the peer HW within (retry timeout * retry count).

Usually this is because the peer rebooted causing the HCA HW to be reset, alternatively it might be the result of a very bad cable or transceiver on the peer node (not this node).

If there is no impact to customer's application or any error observed from operation system or IB switches, we can ignore these messages.

Solution

Ask customer if there is an access issue with NFS or iSCSI over Infiniband.

If there is "NO IMPACT" to customer applications or any error observed from operating system or IB switches, we can ignore these messages.

But if there is a issue with access to shares or LUN's - continue the investigation.

Ask customer if they made any changes, shut down or rebooted any nodes connected to Infiniband - remember it can happen anywhere on the IB fabric from client through IB switch out to the ZFSSA.

The related bugs are below:

Bug 25090540 - Packets getting retransmitted sent by ZFSSA, OVS server doesn't send ACK packet

Bug 23248069 - retry counter exceeded warning messages filling up the logs

Attachments

This solution has no attachment