Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-79-2195792.1
Update Date:2017-07-11
Keywords:

Solution Type  Predictive Self-Healing Sure

Solution  2195792.1 :   Infiniband Fabric errors and counters  


Related Items
  • Sun Network QDR InfiniBand Gateway Switch
  •  
  • Sun Datacenter InfiniBand Switch 36
  •  
Related Categories
  • PLA-Support>Sun Systems>SAND>Network>SN-SND: Sun Network Infiniband
  •  




In this Document
Purpose
Scope
Details


Applies to:

Sun Network QDR InfiniBand Gateway Switch
Sun Datacenter InfiniBand Switch 36
Information in this document applies to any platform.

Purpose

 Explain what is the meaning of certain errors and counters visible at the IB Fabric level.

Scope

 Help to determine which errors and counters are harmful and which can be safely ignored.

Details

Infiniband Fabric errors and counters can be divided into 2 groups: 1) harmful, meaning a real problem 2) harmless, that can be ignored.

 

Harmful:

Error / Counter Meaning
SymbolErrors Total number of minor link errors detected on one or more physical lanes.
LinkErrorRecovery Total number of times the Port Training state machine has successfully completed the link error recovery process.
PortRcvErrors

Total number of packets containing an error that were received on the port.

PortXmitConstraintErrors Total number of packets not transmitted from the port for the following reasons:

- FilterRawOutbound is true and packet is raw

- PartitionEnforcementOutbound is true and packet fails partition key check or IP version check

PortRcvConstraintErrors

Total number of packets received on the port that are discarded for the following reasons:

- FilterRawInbound is true and packet is raw

- PartitionEnforcementInbound is true and packet fails partition key check or IP version check

LocalLinkIntegrityErrors

The number of times that the count of local physical errors exceeded the threshold specified by LocalPhyErrors

ExcessiveBufferOverrunErrors The number of times that OverrunErrors consecutive flow control update periods occurred, each having at least one overrun error
PortRcvRemotePhysicalErrors Total number of packets marked with the EBP delimiter received on the port.




 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 Harmless: counters can have non-zero values in our environments and these can be safely ignored.

Error / Counter Meaning
LinkDown

Total number of times the Port Training state machine has failed the link error recovery process and downed the link.

PortXmitDiscards Total number of outbound packets discarded by the port because the port is down or congested.
VL15Dropped Number of incoming VL15 packets dropped due to resource limitations (e.g., lack of buffers) in the port.
PortXmitData

Total number of data octets, divided by 4, transmitted on all VLs from the port. This includes all octets between

(and not including) the start of packet delimiter and the VCRC, and may include packets containing errors It excludes all link packets.

PortRcvData

Total number of data octets, divided by 4, received on all VLs at the port. This includes all octets between

(and not including) the start of packet delimiter and the VCRC, and may include packets containing errors It excludes all link packets.

PortXmitPkts Total number of packets transmitted on all VLs from the port. This may include packets with errors and excludes link packets.
PortRcvPkts

Total number of packets, including packets containing errors and excluding link packets, received from all VLs on the port.

PortRcvSwitchRelayErrors Total number of packets received on the port that were discarded because they could not be forwarded by the switch relay.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 




 


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback