Troubleshooting VMotion Failures and VMotion Configuration Best Practices

Asset ID:	1-75-1518833.1
Update Date:	2017-05-01
Keywords:

Solution Type Troubleshooting Sure

Solution 1518833.1 : Troubleshooting VMotion Failures and VMotion Configuration Best Practices

Applies to:

Oracle Fabric Interconnect F1-15 - Version All Versions to All Versions [Release All Releases]
Oracle Fabric Interconnect F1-4 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.
Checked for relevance on 06/26/2014

Purpose

How to troubleshoot VMotion failures.

Troubleshooting Steps

VMotion failures occur for a number of reasons:

1) Underlying shared storage issues - Check host logs under /var/log/vmkernel (ESX Classic), /var/log/vmkernel.log (ESXi 5.0), or /var/log/messages (ESXi 4.1). Resolve all Storage related issues first.

2) Not enough free memory on destination server. Make sure there is enough free memory on destination server - How large is VM NVRAM - there needs to be free memory greater than the VM NVRAM file size. This is the amount of memory assigned to the Virtual Machine. Right click on VM and choose Edit Settings to find out the amount of memory assigned to the VM.

3) LACP configured LAG(s) issues - If VMotion vnics are terminated to an LACP LAG group, login to upstream ethernet switch and check that all connected applicable switch ports are up and not showing inactive or down. If ports are down or inactive on upstream switch ports and this is especially true after Fabric Director XgOS upgrade, IO card reset or Fabric Director reboot, change LAG configuration to 'Static' on upstream ethernet switch ports and change to static LAG on Fabric Directors using:

# set lag <#.#> -lacp-enable=false

The above LACP LAG issue is apt to be encountered when using 10port 1GE IO Cards. 4Port 10 GE IO Cards do not exhibit this issue.

The above case is where VMotion vNICs in vSwitch are in active/active state - wherein VMotion traffic traverses the network, traffic does not stay within the Xsigo Fabric Director. Setting VMotion vnics to active/active instead of active/standby is generally used when VMotions need to go out over the network in the case of VMotioning between Physical non-Xsigo connected hosts to Xsigo connected hosts or vice versa. However this is not the only use case, many customers prefer the VMotion vnics to be set to active/active instead of active/standby.

Set the vnics to active/standby to test if VMotion works when VMotion traffic is confined within the Fabric Director, reverse the vnic active/standby order to check both Fabric Director vnics. If VMotion works with both VMotion vnics set to active/standby in both directions and VMotion vnics are terminated to LACP enabled LAGs, check the upstream ethernet switch ports to make sure none are marked 'inactive' or 'down'.

NOTE : Xsigo's implementation of LACP is broken in versions under 4.0.0 XgOS - since Xsigo is rewriting LACP, please check with Dennis Rivkin on targetted release of reworked / rewritten Xsigo LACP.

4) Make sure that VMotion has unique IP and there aren't duplicate VMotion IPs

Make sure that VMotion vmkernel portgroup is on its own vSwitch and has a unique VLAN ID if VLANs are being used. VMotion needs to be on a isolated network as does iSCSI per this VMware KB - please note VMware Network Documents also state that VMotion and iSCSI vmkernel interfaces need to be on isolated networks - vmkernel PGs in seperate vSwitches,and on seperate networks or VLANs.

5) Make sure that VMotion is NOT enabled on any other portgroups other than vmkernel interface intended for VMotion. Had a use case where customer had VMotion checked for VMotion vmkernel interface in one vSwitch PLUS the MGMT (Service Console) portgroup in another vSwitch also had VMotion box checked (Equinix). Unchecking the VMotion box for the MGMT portgroup (Service Console), allowed VMotion to work again.

6) Check time settings, enable NTP for the ESX/ESXi Servers in the clusters.

7) VMware suggests where possible to not route VMotion traffic to limit the number of hops that VMotion has to take, and only one vmkernel interface is permitted per vSwitch. Every hop that VMotion traverses adds to VMotion latency. Again, this is not always possible but it is a VMware suggestion.

http://kb.vmware.com/kb/1006989

http://kb.vmware.com/kb/1002662

8) Please note Xsigo Support has seen issues in customers sites if MGMT network and VMotion Vmkernel PGs are in the same vSwitch. ESX Classic is not affected because the Service Console (MGMT console) is a 'vswif' interface, not a 'vmkernel' interface.

9) If VMotion times out between 10% - 44% try performing Migrate.Enabled disable/enable - NOTE: Perform this on both the source *and* destination VMotion servers and wait at least 4 minutes in between the disable and enable task.

See this Xsigo KB:

http://xsigo.force.com/articles/Knowledge_Article/1172/

Specifically verify in VMotion source and destination hosts logs that the timeout errors are similar to those in the above. The command to disable / enable Migrate is:

From ESX CLI:

# esxcfg-advcfg -s 0 /Migrate/Enabled
# esxcfg-advcfg -s 1 /Migrate/Enabled

For other VMware VMotion Troubleshooting KBs see these links:

http://kb.vmware.com/kb/1030267

http://kb.vmware.com/kb/1003734 Most in-depth VMware VMotion KB, also contains instructions to try to disable / then re-enable Migrate.

<snipped from the last VMware KB above, please note this works even if you do not find "Broken Pipe" messages>:

Determine if resetting the Migrate.Enabled setting on both the source and destination ESX or ESXi hosts addresses the vMotion failure. For more information on this issue, see VMotion fails at 10% with the error: A general system error occurred: Migration failed while copying data, Broken Pipe (1013150).

The above information in this VMware KB should show customers that VMware has also suggested to use the Migrate.Enabled workaround along with increasing the Migrate.NetTimeout value for customers using physical ethernet devices. This is not just a Xsigo issue.

This link works to get to the above KB mentioned in the snippet.

http://kb.vmware.com/kb/1013150

11) If VMotion still fails at 10%, increase Migrate.NetTimeout from 20 to 120.

Attachments

This solution has no attachment