![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||
Solution Type Problem Resolution Sure Solution 2227109.1 : Exalogic VM lost network communication through both vnics when one IB gateway switch was rebooted
In this Document
Applies to:Sun Network QDR InfiniBand Gateway Switch - Version All Versions and laterOracle Exalogic Elastic Cloud Software - Version 1.0.0.0.0 and later Information in this document applies to any platform. SymptomsA VM in an exalogic system lost communication through all its vnics when one of the infiniband gateway switch went down or rebooted. CauseThe root cause of the problem was due to a misconfiguration of the vnics. When a vnic is created on a gateway switch, it is expected that the port guid used for creating that vnic is associated with the directly connected port of the IB HCA of the server node. For example, if the server has one HCA with two IB ports, and if port 1 is connected to gateway switch gw01, and port 2 is connected to gw02, then when creating vnic on gw01, the port guid used must belong to port 1 of the HCA. And, when creating a vnic on gw02, it must use port guid associated with port 2 of the HCA. If these are swapped, the the packets from the hosts through these vnics will pass through both IB switches, requiring both switches to be up and active for both vnics to be operational. Failure of any one switch will break that path resulting in the failure of both vnics. That is what happened in this case. It will be observed that only those VMs where these ports are swapped the problems are seen and all other VMs will be unaffected when any one of the gateway switch is brought down. Here is an example: Here is the output of the ibstat in the VM. # ibstat Output of ibstat in the corresponding compute node is identical. And, the cabling of both these ports of the HCA can be seen in the ibnetdiscover output as follows: vendid=0x2c9 This shows that port 1(mlx4_0:1) of the HCA is cabled to port 11 of the ib gateway switch dpxlp01agw01 and port 2 (mlx4_0:2) of the HCA is cabled to port 11 of the switch dpxlp01agw02. So, the direct path from mlx4_0:1 is to gw01, and direct path from mlx4_0:2 is to gw02. In a VM that lost connectivity when one of these two switches was shut down, it was found that the vnics were created as follows: # mlx4_vnic_info -s
This output shows that vnic eth651_2.1190 is created on the switch dpxlp01agw01 using port guid of port 2 (mlx4_0:2) of the HCA. And, vnic eth650_2.1818 is created on the switch dpxlp01agw02 using guid of port 1 (mlx4_0:1). Normally, when EMOC creates these vnics, it creates them correctly without these swapping. However, this swapping happens sometimes when this VM is rebooted or restarted manually or through xm.
SolutionTo resolve this problem, re-create these vnics. In an exalogic system, restarting this VM through EMOC will restore the correct configuration. So, the simple solution is to restart this VM through EMOC.
Attachments This solution has no attachment |
||||||||||||||||
|