![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||
Solution Type Problem Resolution Sure Solution 2276427.1 : Infiniband - Port HCA-#:# Is Showing Non-Zero Error Counts
Resolve scenarios when error counts greater than zero are showing for Infinband HCA cards in compute or cell nodes of engineered systems. Created from <SR 3-15042034141> Applies to:SPARC SuperCluster T4-4 - Version All Versions to All Versions [Release All Releases]Exadata X6-2 Hardware - Version All Versions to All Versions [Release All Releases] Exadata X4-2 Hardware - Version All Versions to All Versions [Release All Releases] Exadata X5-2 Hardware - Version All Versions to All Versions [Release All Releases] Exadata X3-2 Hardware - Version All Versions to All Versions [Release All Releases] Information in this document applies to any platform. SymptomsA warning that the installed Infiniband HCA card has an error count higher than zero will be encountered. This warning can come from various sources. It may come in an email such as from an Enterprise Manager alert, be seen in an Exachk result, appear in the results of a patch precheck, or show in the status or details output when checking IB interfaces. In this example, we see the rcvErrs count is 1 and the symbolErrs count is 2 when listing ibport details via cellcli on a cell. CellCLI> LIST IBPORT DETAIL
name: HCA-1:1 dataRate: "40 Gbps" hcaFWVersion: 2.11.1280 id: 0x0010e000018664f1 lid: 9 linkDowned: 0 linkIntegrityErrs: 0 linkRecovers: 0 physLinkState: LinkUp portNumber: 1 rcvConstraintErrs: 0 rcvData: 49243317551099 rcvErrs: 1 rcvRemotePhysErrs: 0 status: Active symbolErrs: 2 vl15Dropped: 2 xmtConstraintErrs: 0 xmtData: 57606927160604 xmtDiscards: 0 Otherwise, running ibstat shows both ports on the interface with “State: Active ” and “Physical state: LinkUp” indicating they are functioning normally.
CauseThe cause depends on the error count and whether it is static or increasing. If the error count is static, then the errors are old and occurred some time in the past. If the error count is increasing then the cause may be network or hardware related. The Solution section contains more detailed explanations of these causes and how to address them.
SolutionAccomplish the following actions as root: [root@thx1138cel04 ~]# ibstat
CA 'mlx4_0' CA type: MT4099 Number of ports: 2 Firmware version: 2.11.1280 Hardware version: 0 Node GUID: 0x0010e000018664f0 System image GUID: 0x0010e000018664f3 Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 9 LMC: 0 SM lid: 1 Capability mask: 0x02514868 Port GUID: 0x0010e000018664f1 Link layer: IB Port 2: State: Active Physical state: LinkUp Rate: 40 Base lid: 10 LMC: 0 SM lid: 1 Capability mask: 0x02514868 Port GUID: 0x0010e000018664f2 Link layer: IB
[root@thx1138cel04 ~]# perfquery 9 1
# Port counters: Lid 9 port 1 (CapMask: 0x1400) PortSelect:......................1 CounterSelect:...................0x0000 SymbolErrorCounter:..............2 LinkErrorRecoveryCounter:........0 LinkDownedCounter:...............0 PortRcvErrors:...................1 PortRcvRemotePhysicalErrors:.....0 PortRcvSwitchRelayErrors:........0 PortXmitDiscards:................0 PortXmitConstraintErrors:........0 PortRcvConstraintErrors:.........0 CounterSelect2:..................0x00 LocalLinkIntegrityErrors:........0 ExcessiveBufferOverrunErrors:....0 VL15Dropped:.....................2 PortXmitData:....................4294967295 PortRcvData:.....................4294967295 PortXmitPkts:....................4294967295 PortRcvPkts:.....................4294967295 PortXmitWait:....................4294967295
ibclearerrors
ibclearcounters
If the error count prior to clearing was very high, this can indicate a problem that happened in the past but has since been resolved. A review of network related activity by your network team will need to be conducted to see if the problem was in the recent past. If the problem occurred too far in the past, there is most likely no way to determine what caused it as all of the log and trace data needed to investigate the problem will have been aged out and removed.
If there are no errors, the issue is resolved. If the errors reappear or problems were found in step 1, collect all data detailed in Doc ID 1683910.1 and open a Service Request with Oracle support. References3-13566741281<NOTE:1922592.1> - Patchmgr precheck fails "InfiniBand Port HCA-1:1 is showing non-zero error counts" Attachments This solution has no attachment |
||||||||||||
|