Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2276427.1
Update Date:2017-06-19
Keywords:

Solution Type  Problem Resolution Sure

Solution  2276427.1 :   Infiniband - Port HCA-#:# Is Showing Non-Zero Error Counts  


Related Items
  • Exadata X6-2 Hardware
  •  
  • Exadata X4-2 Hardware
  •  
  • Exadata X5-2 Hardware
  •  
  • Exadata Database Machine V2
  •  
  • Oracle SuperCluster T5-8 Hardware
  •  
  • Oracle SuperCluster M7 Hardware
  •  
  • SPARC SuperCluster T4-4
  •  
  • Exadata X3-2 Hardware
  •  
Related Categories
  • PLA-Support>Eng Systems>Exadata/ODA/SSC>Oracle Exadata>DB: Exadata_EST
  •  


Resolve scenarios when error counts greater than zero are showing for Infinband HCA cards in compute or cell nodes of engineered systems.

Created from <SR 3-15042034141>

Applies to:

SPARC SuperCluster T4-4 - Version All Versions to All Versions [Release All Releases]
Exadata X6-2 Hardware - Version All Versions to All Versions [Release All Releases]
Exadata X4-2 Hardware - Version All Versions to All Versions [Release All Releases]
Exadata X5-2 Hardware - Version All Versions to All Versions [Release All Releases]
Exadata X3-2 Hardware - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

A warning that the installed Infiniband HCA card has an error count higher than zero will be encountered. This warning can come from various sources.  It may come in an email such as from an Enterprise Manager alert, be seen in an Exachk result, appear in the results of a patch precheck, or show in the status or details output when checking IB interfaces.

In this example, we see the rcvErrs count is 1 and the symbolErrs count is 2 when listing ibport details via cellcli on a cell.

     CellCLI> LIST IBPORT DETAIL
     name: HCA-1:1
     dataRate: "40 Gbps"
     hcaFWVersion: 2.11.1280
     id: 0x0010e000018664f1
     lid: 9
     linkDowned: 0
     linkIntegrityErrs: 0
     linkRecovers: 0
     physLinkState: LinkUp
     portNumber: 1
     rcvConstraintErrs: 0
     rcvData: 49243317551099
     rcvErrs: 1
     rcvRemotePhysErrs: 0
     status: Active
     symbolErrs: 2
     vl15Dropped: 2
     xmtConstraintErrs: 0
     xmtData: 57606927160604
     xmtDiscards: 0

Otherwise, running ibstat shows both ports on the interface with “State: Active ” and “Physical state: LinkUp” indicating they are functioning normally.

 

Cause

The cause depends on the error count and whether it is static or increasing.

If the error count is static, then the errors are old and occurred some time in the past.

If the error count is increasing then the cause may be network or hardware related.

The Solution section contains more detailed explanations of these causes and how to address them.

 

Solution

Accomplish the following actions as root:

1. Log in to the node that is identified as having a non-zero error count.

2. Issue the ibstat command and determine the base LID for each port on the HCA card.   An example is shown here.

     [root@thx1138cel04 ~]# ibstat
     CA 'mlx4_0'
     CA type: MT4099
     Number of ports: 2
     Firmware version: 2.11.1280
     Hardware version: 0
     Node GUID: 0x0010e000018664f0
     System image GUID: 0x0010e000018664f3
     Port 1:
          State: Active
          Physical state: LinkUp
          Rate: 40
          Base lid: 9 LMC: 0
          SM lid: 1
          Capability mask: 0x02514868
          Port GUID: 0x0010e000018664f1
          Link layer: IB
     Port 2:
          State: Active
          Physical state: LinkUp
          Rate: 40
          Base lid: 10 LMC: 0
          SM lid: 1
          Capability mask: 0x02514868
          Port GUID: 0x0010e000018664f2
          Link layer: IB


3. Run the perfquery command for each interface to get a full list of activity data. Using the data from step 2 above, we would issue the following commands to get the data for port 1: perfquery 9 1 . Port 2 would be perfquery 10 2 . The output of the perfquery will look like this example:

     [root@thx1138cel04 ~]# perfquery 9 1
     # Port counters: Lid 9 port 1 (CapMask: 0x1400)
     PortSelect:......................1
     CounterSelect:...................0x0000
     SymbolErrorCounter:..............2
     LinkErrorRecoveryCounter:........0
     LinkDownedCounter:...............0
     PortRcvErrors:...................1
     PortRcvRemotePhysicalErrors:.....0
     PortRcvSwitchRelayErrors:........0
     PortXmitDiscards:................0
     PortXmitConstraintErrors:........0
     PortRcvConstraintErrors:.........0
     CounterSelect2:..................0x00
     LocalLinkIntegrityErrors:........0
     ExcessiveBufferOverrunErrors:....0
     VL15Dropped:.....................2
     PortXmitData:....................4294967295
     PortRcvData:.....................4294967295
     PortXmitPkts:....................4294967295
     PortRcvPkts:.....................4294967295
     PortXmitWait:....................4294967295


4. In the output of the perfquery for each port, check all of the error counters (PortRcvErrors, PortRcvRemotePhysicalErrors, etc). Note any that are not 0.

5. Log in to each of the IB switch as root and execute the following commands

     ibclearerrors

     ibclearcounters


6. Wait at least 2 hours.

7. Repeat steps 2 and 3 on the node.

Error Counters At 0 After Clearing:
There is no issue.  The InfiniBand error counters are cumulative over time and the errors seen before clearing the counters occurred at some time in the past.

If the error count prior to clearing was very high, this can indicate a problem that happened in the past but has since been resolved. A review of network related activity by your network team will need to be conducted to see if the problem was in the recent past.  If the problem occurred too far in the past, there is most likely no way to determine what caused it as all of the log and trace data needed to investigate the problem will have been aged out and removed.

Error Counters Not At 0 After Clearing, But Count Is Low (Under 10):
Wait another 2 hours then repeat steps 2 and 3. If the errors are the same and not increasing, there is no issue.  Some network errors are expected over time and are not necessarily an indication of a problem. They should be few and not often.

If after waiting the two hours then checking again results in an increasing error count, then there may be a network or hardware problem.   Go to the section "Resolving Network And Hardware Issues" below.

Error Counters Not At 0 After Clearing AND Count Is At 10 or Above:
Wait another 2 hours then repeat steps 2 and 3.  If the error count does not change, then there is likely no issue.  Some network errors are expected over time and are not necessarily an indication of a problem.  However, checking for network or hardware problems may be warranted.  If checking for network or hardware problems is desired, go to the section "Resolving Network And Hardware Issues" below.

If the error count is increasing, and especially if the increase is substantial, then there is most likely a network or hardware problem.  Go to the section "Resolving Network And Hardware Issues".


Resolving Network And Hardware Issues

If the above steps have indicated a possible problem, accomplish the following steps.

1. Work with your network and/or system administration teams to ensure:

  • the IB cables for the affected node are seated securely at both ends.
  • there are no communications issues over the IB fabric.   Refer to Doc ID 2016560.1 for details.


2. Repeat steps 3 through 7 again.

    If there are no errors, the issue is resolved.

    If the errors reappear or problems were found in step 1, collect all data detailed in Doc ID 1683910.1 and open a Service Request with Oracle support.



References

3-13566741281
<NOTE:1922592.1> - Patchmgr precheck fails "InfiniBand Port HCA-1:1 is showing non-zero error counts"

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback