What Does "Hermon0: CQE Transport Retry Counter Exceeded" in /var/log/messages File Mean?

Asset ID:	1-72-1614576.1
Update Date:	2016-04-21
Keywords:

Solution Type Problem Resolution Sure

Solution 1614576.1 : What Does "Hermon0: CQE Transport Retry Counter Exceeded" in /var/log/messages File Mean?

Applies to:

Oracle Exadata Hardware - Version 11.2.3.1.1 and later
SPARC SuperCluster T4-4 Half Rack - Version All Versions and later
Solaris SPARC Operating System - Version 11.1 to 11.1 [Release 11.0]
Information in this document applies to any platform.

Symptoms

The following entry is in the /var/log/messages file - what does this mean?

hermon0: CQE transport retry counter exceeded

WARNING: mcxnex0: CQE ERR: cqe fffff61bcf02f280 QPN 4000f7 indx 14 status 0x15 vendor syndrome 81
WARNING: mcxnex0: CQE transport retry counter exceeded

Cause

The message "hermon0: CQE transport retry counter exceeded" is related to the Infiniband HCA driver and simply means that the IB connection has gone down.

Solution

When an Infiniband stack client (ULP a.k.a. upper layer protocol) creates a queue pair it will specify a retry timeout and a retry count to the hardware which will dictate what the HCA hardware/firmware does once a message is sent via that QP onto the fabric.

The local HCA having sent the message once on the fabric receives no reply from the peer hardware in <retry timeout> time the local HCA hardware will retry the message.

This will happen up to <retry count> times at which time the HCA hardware/firmware will issue an error completion to the upper layer protocol (client) which owns the QP, one for each message in flight and unacked by the peer HCA HW.

Each error completion is also logged by the mcxnex driver (the message pair you see, to the system messages file. Each pair of message in the msgs file represents one in flight unacked message which has not been responded by the peer HW within (retry timeout * retry count).

Usually this is because the peer rebooted causing the HCA HW to be reset, alternatively it might be the result of a very bad cable or transceiver on the peer node (not this node).

In other words, if you are in the midst of any maintenance operations where the connection may be restarted due to a remote machine reboot, then this is expected.

This message is purely informational as long as this doesn't occur frequently and unexpectedly.

Attachments

This solution has no attachment