Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1614576.1
Update Date:2016-04-21
Keywords:

Solution Type  Problem Resolution Sure

Solution  1614576.1 :   What Does "Hermon0: CQE Transport Retry Counter Exceeded" in /var/log/messages File Mean?  


Related Items
  • Oracle Exadata Hardware
  •  
  • SPARC SuperCluster T4-4 Half Rack
  •  
  • Solaris Operating System
  •  
Related Categories
  • PLA-Support>Eng Systems>Exadata/ODA/SSC>SPARC SuperCluster>DB: SuperCluster_EST
  •  




Created from <SR 3-8280566671>

Applies to:

Oracle Exadata Hardware - Version 11.2.3.1.1 and later
SPARC SuperCluster T4-4 Half Rack - Version All Versions and later
Solaris SPARC Operating System - Version 11.1 to 11.1 [Release 11.0]
Information in this document applies to any platform.

Symptoms

The following entry is in the /var/log/messages file - what does this mean? 

hermon0: CQE transport retry counter exceeded

or

WARNING: mcxnex0: CQE ERR: cqe fffff61bcf02f280 QPN 4000f7 indx 14 status 0x15  vendor syndrome 81
WARNING: mcxnex0: CQE transport retry counter exceeded

 

Cause

 The message "hermon0: CQE transport retry counter exceeded" is related to the Infiniband HCA driver and simply means that the IB connection has gone down.

Solution

When an Infiniband stack client (ULP a.k.a. upper layer protocol) creates a queue pair it will specify a retry timeout and a retry count to the hardware which will dictate what the HCA hardware/firmware does once a message is sent via that QP onto the fabric.

The local HCA having sent the message once on the fabric receives no reply from the peer hardware in <retry timeout> time the local HCA hardware will retry the message.

This will happen up to <retry count> times at which time the HCA hardware/firmware will issue an error completion to the upper layer protocol (client) which owns the QP, one for each message in flight and unacked by the peer HCA HW.

Each error completion is also logged by the mcxnex driver (the message pair you see, to the system messages file. Each pair of message in the msgs file represents one in flight unacked message which has not been responded by the peer HW within (retry timeout * retry count).

Usually this is because the peer rebooted causing the HCA HW to be reset, alternatively it might be the result of a very bad cable or transceiver on the peer node (not this node).

 

In other words, if you are in the midst of any maintenance operations where the connection may be restarted due to a remote machine reboot, then this is expected. 

This message is purely informational as long as this doesn't occur frequently and unexpectedly.

 


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback