Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-79-1615444.1
Update Date:2016-07-22
Keywords:

Solution Type  Predictive Self-Healing Sure

Solution  1615444.1 :   In EECS 2.0.6.0.0 guest vServers intermittently not able to communicate (ping) each other over EoIB VLAN network  


Related Items
  • Oracle Exalogic Elastic Cloud Software
  •  
  • Exalogic Elastic Cloud X3-2 Hardware
  •  
Related Categories
  • PLA-Support>Eng Systems>Exalogic/OVCA>Oracle Exalogic>MW: Exalogic Core
  •  




In this Document
Purpose
 Symptoms
 Cause
 Solution
Details
 Interim Patch for the Bug 18175326
 Workaround Procedure
References


Applies to:

Oracle Exalogic Elastic Cloud Software - Version 2.0.6.0.0 to 2.0.6.0.0
Exalogic Elastic Cloud X3-2 Hardware - Version X3 to X3 [Release X3]
Linux x86-64
Oracle Virtual Server(x86-64)

Purpose

Symptoms

In a fresh / upgraded EECS v2.0.6.0.0 installation of an Exalogic Rack, there is a strange problem among the guest vServers connectivity. Due to this, intermittently guest vServers cannot communicate with each other over EoIB/Ethernet interfaces.

If problematic vServers are restarted, the inter-connectivity issues can be resolved to some vServers, but occur for previously contactable vServers.

Cause

The cause of the issue, is that there are multiple gateway ports up and VNICs with the same GUID/MAC are up on multiple ports. There should only be one VNIC per switch for GUID/MAC combination.

The cause is a bug introduced in EECS v2.0.6.0.0 EMOC to enable vServers to be started on other server pools. EMOC was modified to create the VNICs for EoIB interfaces at vServer start-up. They also implemented round-robin use of connectors when VNICs are created. Unfortunately, the change did not include a check whether the VNIC already existed. When only one gateway port is up, the VNIC creation in vServer start-up silently fails because the VNIC already exists. When multiple gateway ports are up, there is a good probability that a new VNIC will be created on a different gateway port. The duplicate MACs intermittently block inter-vServer traffic on the VLAN. Everything else works.

Solution

The solution is to apply the patch from the unpublished bug:

Bug 18175326 - INTERIM PATCH FOR BUGS 17294107, 17596353, 17934988, 18120474 FOR EL 2.0.6.0.0 

The base bug fixes included in this patch are:

  • Bug 17294107: E2E: EMOC GENERATED HOSTNAMES CONTAIN UNDERSCORE -- AN ILLEGAL CHARACTER
  • Bug 17596353: el 1/2 rack with 2 pools. need to set distribution group to have all 16 nodes
  • Bug 17934988: problem with eoib between vservers in exalogic 2.0.6.0.0
  • Bug 18120474: eoib network issues and not able to ping vserver from another one

Details

Please file a Service Request with Oracle Customer Support to obtain the interim patch from the Bug 18175326.

Note to EEST Support:

This EM Ops Center (EMOC) patch conflicts with the following patch that provides support for NM2-GW FW v2.1.3-4:

BUG 18034691 - CUMULATIVE PATCH FOR EXALOGIC CONTROL SERVICE (EL VIRTUAL )

Since the two patches are not compatible, customers requiring fixes included in both patches need to wait for the Exalogic April 2014 Patch Set Update. The following workaround procedure may be implemented until April 2014 PSU is available:

  • For deployments with Exalogic connected to Exadata, with Exadata running NM2 FW v2.1.3-4, apply patch 18034691 and implement the workaround provided in this MOS note below for the duplicate vNIC issue.
  • For deployments with Exalogic connected to Exadata, with Exadata planning to be upgraded to 11.2.3.3.0, perform the Exadata upgrade without upgrading the NM2 GW FW to v2.1.3-4 and apply this duplicate vNIC patch. The NM2 FW upgrade in the Exadata upgrade to 11.2.3.3.0 is optional.

Interim Patch for the Bug 18175326

The interim one-off patch for the bug 18175326, can be downloaded from the ARU 17312922.

Note:

Distribution of this ARU 17312922 is only 'By Support' to the Customers through a customer filed Service Request. The patch should be distributed to customer only with the approval from Exalogic Development team (please reach Dev Prabhu (dev.prabhu@oracle.com) for Approval with Customer Name, Region and SR number).

Customers require a password to download this patch (please login to the ARU site with your SSO, to see the password). Do not supply the password to customers without first ensuring that:

  • Exalogic Development Approval for Patch Distribution
  • This patch is appropriate for the customer.
  • Conflict verification of the patch from BUG 18034691 that provides support for NM2-GW FW v2.1.3-4.
  • Remind customers that this patch is not subject to the same rigorous level of testing as done for PSUs.
  • This password is valid for 7 days.
  • Please review the readme ( included in the patch p18175326_20600_Linux-x86-64.zip ) for applying the patch.

Workaround Procedure

The following INTERNAL ONLY section of this note provides a description of the steps to implement a workaround if the customer is unable to apply patch 18175326 due to conflict with EMOC cumulative patch 18034691 which may be critical for the customer. This workaround procedure needs to be performed under support supervision:

1. Enter Maintenance Period

Plan a maintenance window in order to perform this procedure.

2. Shut Down All Guest vServers

This will be attempted in parallel in EMOC console to reduce time.

3. Generate list of ALL guest vServer VNICs on each switch

Ensure list does NOT contain the VNICs used by the ELControl vServer.

For example, determine ELControl VNIC from any Compute Node:

$> (cd /OVS/Repositories/0004fb0000030000f8d9bef44e1586b8/VirtualMachines/;grep -c Control */* | grep -v :0 | cut -d: -f1 | xargs cat | egrep "simple_name|exalogic_vnic|uuid")

OVM_simple_name = 'ExalogicControlOpsCenterPC1'

uuid = '0004fb0000060000302c15fd67d21624'

expose_host_uuid = 1

exalogic_vnic = [{'pkey': [], 'guid': '0x88e22c1fdb58c20a', 'port': '1'}, {'pkey': [], 'guid': '0x88e22c1fdb58c20b', 'port': '2'}]

OVM_simple_name = 'ExalogicControlOpsCenterPC2'

uuid = '0004fb000006000095cffcd18478c782'

expose_host_uuid = 1

exalogic_vnic = [{'pkey': [], 'guid': '0xe3a9e9709fd2425c', 'port': '1'}, {'pkey': [], 'guid': '0xe3a9e9709fd2425d', 'port': '2'}]

OVM_simple_name = 'ExalogicControl'

uuid = '0004fb0000060000c3637b689f90c079'

expose_host_uuid = 1

exalogic_vnic = [{'pkey': ['0x8006'], 'guid': '0x9013963aa357ef63', 'port': '1'}, {'pkey': ['0x8006'], 'guid': '0x9013963aa357ef64', 'port': '2’}]

  

Note:

The ELControl vserver VNICs should NOT be deleted. But the user must ensure that they will exist on the single connector that will be eventually available. In this case, it is assumed this connector will be '0A-ETH-1'

  

Generate list of VNICs to delete, without the ELControl VNICs 

[root@xxxib01 ~]# shownics | grep ETH | egrep -iv "9013963aa357ef63|9013963aa357ef64" | sed 's/^ *\([0-9]*\) .*:[0-9a-fA-F]* [0-9]* [0-9a-fA-F]* /\1/‘ > /tmp/vnics_to_delete 

Sample output of above is:

...

180  0A-ETH-1

409  0A-ETH-1

601  0A-ETH-1

156  0A-ETH-1

771  0A-ETH-1

128  0A-ETH-1

317  0A-ETH-1

77  0A-ETH-1

449  0A-ETH-1

49  0A-ETH-1

93  0A-ETH-1

...

which is in format that deletevnic tool takes

[root@enxl01sib001 ~]# deletevnic

Usage deletevnic connector vNIC_Id

Example: deletevnic 0a-eth-1 1

Legal values for connector is:

0A-ETH-1, 0A-ETH-2, 0A-ETH-3, 0A-ETH-4, 0A-ETH,

1A-ETH-1, 1A-ETH-2, 1A-ETH-3, 1A-ETH-4, 1A-ETH,


 

4. Fix the Switch Configuration

  • Remove ALL the VNICs list generated above on each switch

Show VNICs delete commands:

cat /tmp/vnics_to_delete | while read line;do echo "deleting: \"deletevnic $line\""; done

Actually delete them:

cat /tmp/vnics_to_delete | while read line;do echo "deleting: \"deletevnic $line\””;deletevnic $line; done

 

NOTE: This will also remove any dead/orphaned vnics

 

  • If ELControl vServer has vnic any connector other than 0A-ETH-1, perform the following steps:
    • Stop the control stack using ExaBR to stop the control stack (For ExaBR Tool, please refer the Document ID : 1586312.1 - Exalogic Lifecycle Toolkit: ExaLogs and ExaBR).
    • Manually delete/recreate the ELControl VNICs on each switch so they are on connector 0A-ETH-1
    • Start control stack and confirm health
    • Confirm only 1 vnic exists on each switch on 0A-ETH-1 connector
Note:

The procedure in this document assumes that all but the connector 0A-ETH-1 will be disabled.
If a different connector will become the sole active connection, any commands below should be modified appropriately.

 

  • Delete any vlans on connectors we are going to disable

same kind of command as above, except parsing and deleting VLANs instead of VNICs

showvlan | grep ETH | grep -v "0A-ETH-1" | awk '{print $1" "$2}' | sed 's/ / -vlan /' | while read line;do echo "deleting: \"deletevlan $line\"";deletevlan $line;done

 

  • Disable 3 of the 4 connectors on the switch (keep 0A-ETH-1 only)

For example, the gw configuration on each switch is:

[root@enxl01sib001 ~]# showgwports

 

INTERNAL PORTS:

---------------

 

Device   Port Portname  PeerPort PortGUID           LID    IBState  GWState

---------------------------------------------------------------------------

Bridge-0  1   Bridge-0-1    4    0x0010e0300cfcc001 0x0006 Active   Up

Bridge-0  2   Bridge-0-2    3    0x0010e0300cfcc002 0x0007 Active   Up

Bridge-1  1   Bridge-1-1    2    0x0010e0300cfcc041 0x0008 Active   Up

Bridge-1  2   Bridge-1-2    1    0x0010e0300cfcc042 0x0009 Active   Up

 

CONNECTOR 0A-ETH:

-----------------

 

Port      Bridge      Adminstate Link  State       Linkmode       Speed

------------------------------------------------------------------------

0A-ETH-1  Bridge-0-2  Enabled    Up    Up          XFI            10Gb/s

0A-ETH-2  Bridge-0-2  Enabled    Up    Up          XFI            10Gb/s

0A-ETH-3  Bridge-0-1  Enabled    Down  Reset       XFI            10Gb/s

0A-ETH-4  Bridge-0-1  Enabled    Down  Reset       XFI            10Gb/s

 

CONNECTOR 1A-ETH:

-----------------

 

Port      Bridge      Adminstate Link  State       Linkmode       Speed

------------------------------------------------------------------------

1A-ETH-1  Bridge-1-2  Enabled    Up    Up          XFI            10Gb/s

1A-ETH-2  Bridge-1-2  Enabled    Up    Up          XFI            10Gb/s

1A-ETH-3  Bridge-1-1  Enabled    Down  Reset       XFI            10Gb/s

1A-ETH-4  Bridge-1-1  Enabled    Down  Reset       XFI            10Gb/s

Use disablegwport command for connectors

0A-ETH-2

1A-ETH-1

1A-ETH-2

command usage is:

[root@enxl01sib001 ~]# disablegwport

Usage disablegwport connector

Legal values for connector is:

0A-ETH-1, 0A-ETH-2, 0A-ETH-3, 0A-ETH-4, 0A-ETH,

1A-ETH-1, 1A-ETH-2, 1A-ETH-3, 1A-ETH-4, 1A-ETH,

so we run,

[root@enxl01sib001 ~]# disablegwport 0A-ETH-2

[root@enxl01sib001 ~]# disablegwport 1A-ETH-1

[root@enxl01sib001 ~]# disablegwport 1A-ETH-2

 

  • Reboot Active Switch, causing a failover. When the rebooted switch is back, reboot other switch too

 

5. Select 3 any vServers to test the new configuration

  • Start test vServers
  • Confirm vServer connectivity works
    • For example, testing a ping to 3 guest vServers (IP addresses 10.141.135.85, 10.141.135.87, and 10.141.135.92)
      root@vGuest01 ~# for ip in 10.141.135.85 10.141.135.87 10.141.135.92; do ping -c 1 -t 1 $ip 1>/dev/null; if [ $? -eq 0 ]; then echo "${ip} --> OK"; else echo "${ip} --> BAD"; fi; done
      10.141.135.85 --> OK
      10.141.135.87 --> OK
      10.141.135.92 --> OK
      root@vGuest01 ~# 

      If it results to "OK", that means connectivity works.

  • Reboot the vServers and confirm connectivity still works
  • Confirm no extra VNICs are created for the vServer/mac/guid combinations

 

6. Create 2 new vServers

  • One guest vServer using the 2.0.6.0.0 Template, another guest vServer using any current 2.0.4.x template (in case if the 2.0.6.0.0 is upgraded from 2.0.4.x and old vServers migrated).
  • Run the same validation actions as step 5 above.

 

7. Start all guest vServers

  • Perform further connectivity validations
  • Confirm all switch VNICs are on single connector

8. Leave Maintenance Window

 

References

<BUG:18175326> - INTERIM PATCH FOR BUGS 17294107, 17596353, 17934988, 18120474 FOR EL 2.0.6.0.0
<BUG:17934988> - PROBLEM WITH EOIB BETWEEN VSERVERS IN EXALOGIC 2.0.6.0.0
<BUG:17294107> - E2E: EMOC GENERATED HOSTNAMES CONTAIN UNDERSCORE -- AN ILLEGAL CHARACTER
<BUG:17596353> - EL 1/2 RACK WITH 2 POOLS. NEED TO SET DISTRIBUTION GROUP TO HAVE ALL 16 NODES
<BUG:18120474> - EOIB NETWORK ISSUES AND NOT ABLE TO PING VSERVER FROM ANOTHER ONE.

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback