Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-79-2218689.1
Update Date:2018-05-23
Keywords:

Solution Type  Predictive Self-Healing Sure

Solution  2218689.1 :   Exalogic Infiniband Switch Replacement - Follow-up Actions (Restoration)  


Related Items
  • Oracle Exalogic Elastic Cloud Software
  •  
  • Exalogic Elastic Cloud X3-2 Hardware
  •  
Related Categories
  • PLA-Support>Eng Systems>Exalogic/OVCA>Oracle Exalogic>MW: Exalogic Core
  •  




In this Document
Purpose
Details
 1. Validate the Firmware version on the Newly Replaced Switch
 2. Validate whether Subnet Manager is disabled on newly replaced Switch
 3. Validate SM controller_handover on current switch running the Master in the Fabric
 4. Validate that the physical installation of the new switch into the fabric was completed successfully. Run the "ibnetdiscover" and "ibswitches" command.
 5. Change the passwords of the root and ilom-admin users on the replacement switch to their previous values.
 Steps for changing the password for "root" user
 Steps for changing the password for "ilom-admin" user
 6. Update the smnodes list on the replaced switch with the IP addresses of all the Exalogic switches running the subnet manager.
 7. Set SM Priority To Recommended Values on the replaced switch
 8. Validate if ocadmin SNMP community exists and create it if it does not exists
 9. Copy the /conf/partitions.current file from the Switch running the SM Master to the newly replaced Switch under /conf directory
 10. Restore the Switch configuration from Exabr backups.
 a. View a list of backups by running ExaBR list command on replacement Infiniband Switch
 b. Use ExaBR restore command on newly replaced InfiniBand switch to restore the configuration from exabr backups:
 c. Validate if the exabr restore command restored the Switch configuration successfully.
 d. Run "smpartition start && smpartition commit" command on the Switch running the SM Master to propagate the partitions from SM Master to newly replaced Switch
 11. Validate whether the VNICs and VLANs are seen after Exabr restore on the newly replaced Switch.
 12. Register the Port GUIDs of Newly Replaced Switch with EoIB Partitions using "exabr ib-register" command.
 13. Validate the status of VNICs and VLANs
 14. Additional restoration steps for Virtual & Hybrid racks with EMOC
 Final checkup and verification
 a. Check/set firewall rule settings on port 623
 b. Check the opensm status and smpriorities on all switches in the IB fabric
 c. Check network/fabric is operating normally
 d. Take a fresh Exalogic Control vServers backup using Exabr.
 e. Collect fresh full exalogs from the rack after Switch replacement.
 KNOWN ISSUES WHICH CAN BE ENCOUNTERED DURING SWITCH REPLACEMENT
 Exalogic Exabr ib-register Command To Register New Replaced Infiniband Switch Port GUIDs Fails With "Unable to get rpc version on some nodes in the fabric" Error
 Exalogic: "smpartition start" & "exabr ib-register" Commands Failing With "cli commit is in progress" Error
References


Applies to:

Oracle Exalogic Elastic Cloud Software - Version 2.0.0.0.0 and later
Exalogic Elastic Cloud X3-2 Hardware - Version X3 to X3 [Release X3]
Linux x86-64
Oracle Solaris on x86-64 (64-bit)
Oracle Virtual Server x86-64

Purpose

This Note provides follow-up reconfiguration steps for restoring the Exalogic Infiniband (IB) switch configuration after replacement.

This document is intended to be used immediately after physical replacement steps have been completed. 

Prior to a customer running these procedures, an Oracle Field Engineer should have completed replacing the switch by following the two Canned Action Plan documents below.

<Note 1383773.1>: How to Replace a Failed Sun Network QDR InfiniBand Gateway Switch 
<Note 1341658.1>: How to Replace a Failed Sun Datacenter InfiniBand Switch 36 

Details

After an IB switch replacement, the following steps restore customer-specific configuration of smnodes, IB partitions, and vNICs (if any), using previously taken Backups if available. These steps need to be performed by the Customer-admin, or under the close supervision of the Customer-admin.  Some of these steps require root access to the IB switches running SM Master. Any minor errors in these steps can lead to outage of nodes/Servers if the replacement is being done on a live production environment, hence caution and care needs to be exercised at all times.

1. Validate the Firmware version on the Newly Replaced Switch

The replaced switch must run the same firmware version as the other working Exalogic IB switches on the fabric. 

Check the firmware version using "version" command.

2. Validate whether Subnet Manager is disabled on newly replaced Switch

On the new replacement switch check and make sure that the subnet manager is disabled by running below command

service opensmd status

If the subnet manager is running disable subnet manager by running "disablesm" command on the switch

3. Validate SM controller_handover on current switch running the Master in the Fabric

Check the setting of controlled_handover on the switch running as the current Master. Login to the switch running as the Master and run the following command to check the setting of controlled_handover. controlled_handover should be set to TRUE.

setsmpriority list 

4. Validate that the physical installation of the new switch into the fabric was completed successfully. Run the "ibnetdiscover" and "ibswitches" command.

You should see the newly replaced switch in the "ibswitches" command output. "ibnetdiscover" command output should show the newly replaced switch as connected to all the Compute Nodes and Storage heads in the rack.

5. Change the passwords of the root and ilom-admin users on the replacement switch to their previous values.

Steps for changing the password for "root" user

  1. Log in as root to the IB switch.

  2. Run the passwd command and type and re-type the new password for root user.
    # passwd
    Changing password for user root.
    New UNIX password:
    Retype new UNIX password:

Steps for changing the password for "ilom-admin" user

  1. Login to IB Switch using "ilom-admin" user to login to ILOM prompt.

  2. in ILOM prompt change the password for the ilom-admin user by running the following. Enter the re-enter the new password for ilom-admin user.
    -> set /SP/users/ilom-admin password
    Enter new password: ***********
    Enter new password again: ***********

6. Update the smnodes list on the replaced switch with the IP addresses of all the Exalogic switches running the subnet manager.

To do this use the following command:

smnodes add IP_Address_of_Switch 

7. Set SM Priority To Recommended Values on the replaced switch

Set SM Priorities and controlled handover settings on the newly replaced Switch to recommended values as described in following MOS Note.

IMPORTANT NOTE: Do not enable the SM using "enablesm" command after configuring the SM settings. SM has to be enabled after the Switch restoration is completed.

<Note 1682501.1>: Setting up the Subnet Manager in a multi-rack cabling configuration containing Exalogic/Big Data Appliance and Exadata/SuperCluster

8. Validate if ocadmin SNMP community exists and create it if it does not exists

Validate if ocadmin SNMP community exists using following command

spsh show /SP/services/snmp/communities/ | grep ocadmin

 

NOTE:

In some cases above command to check the ocadmin community can fail. If that happens follow below steps to check the ocadmin community string.

  1. On the IB Switch run "spsh" command to login to ILOM prompt.

  2. Run below command from ILOM prompt. You should see ocadmin listed under targets if it exists.
    show /SP/services/snmp/communities/

If you do not see ocaadmin community in above command output create it using following procedure.

  1. Login to ilom prompt of the switch by running "spsh" command.

  2. Manually create the ocadmin community as follows from ILOM prompt
    create /SP/services/snmp/communities/ocadmin
     

NOTE:

If ocadmin community does not exist ExaBR restore command (./exabr restore hostname) part of next step may fail with the following error:

ERROR: set /SP/config load_uri=http://10.10.XX.XX/switch.backup
set: Load partially successful, please view the event log

INTERNAL NOTE: Issue happens due to known Bug 16926597 : SIQ: EXABR IB GW RESTORE RESULTS IN ERROR IF PARTIALLY SUCCESSFULY LOAD_URI 

9. Copy the /conf/partitions.current file from the Switch running the SM Master to the newly replaced Switch under /conf directory

Copy the /conf/partitions.current file from the Switch running the SM Master to the newly replaced Switch under /conf directory

You can determine the Switch running the SM Master by running below command.

getmaster

10. Restore the Switch configuration from Exabr backups.

Restore the Switch configuration from Exabr backups using below steps:

a. View a list of backups by running ExaBR list command on replacement Infiniband Switch

View a list of exabr backups by running ExaBR list command on replacement Infiniband Switch

./exabr list <Switch hostname> [options]

Example:

./exabr list ib01.example.com -v

In this example, ExaBR lists the backups for ib01.example.com in detail, because the -v option is used.

b. Use ExaBR restore command on newly replaced InfiniBand switch to restore the configuration from exabr backups:

Use ExaBR restore command on newly replaced InfiniBand switch as follows to restore the configuration from exabr backups:

./exabr restore <replacement Switch hostname> -b <exabr backup directory>

In this above, ExaBR restores an InfiniBand switch. The data is restored from the backup directory specified, because the -b option is used.

For e.g. if the Switch exabr backup directory name is 201308230428 (from exabr list <switch hostname> from above step a) and switch hostname is ib01.example.com, Exabr restore command looks as follows:

./exabr restore ib01.example.com -b 201308230428

 

IMPORTANT NOTE

If the "exabr restore" command console output in above step 10 (b) shows below warning message during the restore process, please contact Support via Service request. We will have to manually restore the Switch configuration if this error happens.

WARNING: ILOM reports this message: 'Load partially successful. Please view the event log'
Switch configuration restored, with warnings.
To see the ILOM event-log navigate to 'System Monitoring-->Event Logs' in the ILOM Web user interface

c. Validate if the exabr restore command restored the Switch configuration successfully.

Validate if the exabr restore command restored the Switch configuration successfully. This can be done by comparing the contents of /conf/bx.conf, /conf/bxm.conf & /etc/opensm/opensm.conf files matches with bx.conf, bxm.conf & opensm.conf files in the exabr backup folder of the replacement. If the content does not match it means that the exabr restore command did not restore the configuration. Do not proceed with further steps. Contact Support for steps to restore the Switch configuration manually.

d. Run "smpartition start && smpartition commit" command on the Switch running the SM Master to propagate the partitions from SM Master to newly replaced Switch

c. Run the "smpartition start && smpartition commit" command on the Switch running the SM Master to propagate the partitions from SM Master to newly replaced Switch.

smpartition start && smpartition commit 

Reference Documentation for Exabr restore procedure:

http://docs.oracle.com/cd/E18476_01/doc.220/e36329/infra.htm#ELFLR118

IMPORTANT INTERNAL NOTE TO SUPPORT 

Manual Switch Restoration on IB Switches With 2.2.2.X Firmware in case Exabr Restore Command Fails with Decryption Errors (or) Does Not restore the Switch Configuration Correctly

In some cases when running Exabr restore command (Step 4 under section "Using ExaBR - See section 3.3.2 - Recovering InfiniBand Switches" of http://docs.oracle.com/cd/E18476_01/doc.220/e36329/infra.htm#ELFLR118 documentation) to restore configuration of IB Switch with firmware 2.2.2.X will not succeed.

Following symptoms are noticed to confirm that the Switch restoration did not succeed.

  1. As shown below "Load partially successful" & review ILOM event log messages are seen when running exabr restore command to restore the Switch configuration.

    This switch is not SM master

    smnodes list for this switch:
    xx.xx248.215 (eltest3gw02.testing.com)
    xx.xx249.55 (exa7sw-iba01.testing.com)
    xx.xx249.56 (exa7sw-ibb01.testing.com)
    xx.xx248.214 (eltest3gw01.testing.com)
    It matches the one from the backup
    Please review it before continuing.
    eltest3gw01.testing.com is member of the current smnodes list. After the restore, we will enable SM.
    Continue?
    [y/n] y
    Continuing with restore
    Checking firmware version
    Restoring switch configuration: /exalogic-lcdata/backups/ib_gw_switches/eltest3gw01.testing.com/201702200400/switch.backup
    (takes a couple of minutes)
    WARNING: ILOM reports this message: 'Load partially successful. Please view the event log' <<<<<<<<<<<<
    Switch configuration restored, with warnings.
    To see the ILOM event-log navigate to 'System Monitoring-->Event Logs' in the ILOM Web user interface
    SM already running, no need to enable it
    Master switch is: eltest3gw02.testing.com
    eltest3gw02.testing.com authentication successful
    Propagating partitions from the master switch
    ---------------------------------------------------
    IB partitions are backed up but not automatically restored by exabr.
    In case you want to restore partitions, please manually copy partitions.current file
    to /conf/partitions.current on the master switch, and run 'smpartition start && smpartition commit'
    on the master switch
    ---------------------------------------------------
    OK: eltest3gw01.testing.com Infiniband Gateway switch restore successful (took: 1 minutes)

  2. Running "spsh show /SP/logs/event/list/" command to list the events on the replacement Switch we see "decryption failed." message as shown below.
    51 Wed Mar 1 22:08:56 2017 Restore Log minor
    Config restore complete.
    50 Wed Mar 1 22:08:04 2017 Restore Log major
    Config restore: Unable to restore property '/SP/clients/dns/nameserver' (Invalid IP address).
    49 Wed Mar 1 22:07:54 2017 Restore Log major
    Config restore: Unable to restore property 'platform', decryption failed.
    48 Wed Mar 1 22:07:53 2017 Restore Log minor
  3. Running showvnics and showvlan commands we do not see any VNICs or VLANs. Reviewing /conf/bx.conf and /conf/bxm.conf files on replacement Switch we notice that they do not have configuration restored (bx.conf file should have createvnic and createvlan command lines inside it).

If all above 3 symptoms are seen, then you are running into the Exabr Switch restoration issue with decryption errors. There is following Bug opened on this issue which is being looked by Development.

BUG 25661713 - EXABR NOT RESTORING CONFIGURATION OF IB SWITCHES WITH FIRMWARE 2.2.2.X

While we are waiting for fix for above Bug if you run into this issue follow below workaround:

IMPORTANT NOTE TO SUPPORT: The following workaround steps should be executed by the customer with supervision from Exalogic Support over a WebEx and should not be executed by Exalogic Support themselves on the WebEx or via Platinum access for Platinum Customers.

WORKAROUND

  1. disablesm on replacement Switch by running below command.
    disablesm

    Verify the opensmd service is stopped using below command
    service opensmd status
  2. Copy below files from exabr backup folder of replacement Switch to newly replaced Switch.

    IMPORTANT NOTEPlease ensure that you are copying correct configuration files from exabr backup folder of Switch being replaced to the newly replaced Switch. Copying the configuration files from other running Switch or from incorrect backup's will cause Network outages.

    /etc/opensm/opensm.conf
    /conf/bx.conf
    /conf/bxm.conf
    /conf/smnodes
  3. Copy the /conf/partitions.current file from the Switch running the SM Master to the newly replaced Switch under /conf directory
  4. Validate whether MAC addresses part of bx.conf file on replacement Switch are different than MAC addresses part of bx.conf on other running IB Switches. This can be done by running below command. MAC addresses should be unique on each of the NM2-GW Switches.

    cat /conf/bx.conf | grep createvnic

    In above command output you should see MAC addresses different on this Switch compared to other NM2-GW Switches. 
  5. Validate whether the GWInstance ID on the replacement Switch is different than compared to other NM2-GW Switches in Fabric using below command.

    showgwconfig

    In above command configured value for GWInstance should be different on replacement should than compared to other NM2-GW Switches. GWInstance should be unique on each of the NM2-GW Switches.

  6. enablesm on replacement Switch by running below command.
    enablesm

    Verify the opensmd service is started using below command
    service opensmd status
  7. Now run below commands on Switch running the Master. You can find the Switch running the master by running "getmaster" command. Below commands will start partition and commit partitions on Master switch, so that the partition information is propagated from Master switch to other standby switches running opensm.
    smpartition start
    smpartition commit

Proceed with rest of the steps starting with step 11 in the MOS Note to complete Switch restoration.

11. Validate whether the VNICs and VLANs are seen after Exabr restore on the newly replaced Switch.

Validate whether the VNICs and VLANs are seen after Exabr restore on the newly replaced Switch by running below commands.

showvnics
showvlan

We should be seeing the VLANs and VNICs (in WAIT-VHUB state). If the VNICs are showing in WAIT-VHUB on the Switch it is because the port GUIDs of new Switch are not added to partitions. This should be corrected when we run exabr ib-register command as listed in step 12.

If you do not see the VNICs and VLANs when running above commands on replaced switch, reboot the replaced Switch using below "reboot" command and wait for 5 minutes for the Switch to reboot. Log back in and check the VNICs and VLANs again.

reboot

12. Register the Port GUIDs of Newly Replaced Switch with EoIB Partitions using "exabr ib-register" command.

Once configuration on the newly replaced switch is restored register the Port GUID's of newly replaced Switch with EoIB partitions using Exabr ib-register command. Follow Step 5 in documentation http://docs.oracle.com/cd/E18476_01/doc.220/e36329/infra.htm#ELFLR173 , section "3.3.3 Replacing InfiniBand Switches in a Virtual Environment"

For registering the gateway port GUIDs with the EoIB partitions we can run exabr command with ib-register option as follows:

./exabr ib-register hostname_of_IB_switch 

Example:

./exabr ib-register ib02.example.com --dry-run 
./exabr ib-register ib02.example.com

In the first example, ExaBR displays what operations will be run without saving the changes because the ib-register command is run with the --dry-run option.

13. Validate the status of VNICs and VLANs

Once the exabr ib-register command is executed and Port GUIDs of newly replaced Switch are added to the EoIB partitions, validate the status of VNICs and VLAN's to make sure they are up using below commands

showvnics 
showvlan 

14. Additional restoration steps for Virtual & Hybrid racks with EMOC

For Virtual & Hybrid racks follow additional restoration steps 6,7 listed in below documentation, section "3.3.3 Replacing InfiniBand Switches in a Virtual Environment". Please note Step 6 in section "3.3.3 Replacing InfiniBand Switches in a Virtual Environment" in below documentation can be skipped if the passwords for the replacement switch is set to previous passwords as mentioned in above step 5.

http://docs.oracle.com/cd/E18476_01/doc.220/e36329/infra.htm#ELFLR173 

Final checkup and verification

a. Check/set firewall rule settings on port 623

If the firewall rule on port 623 has been previously present, then reinstate it: Refer to the procedure in following Note

<Note 2023539.1>: IB Switch Messages Wrapping with "Possible SYN Flooding On Port 623"

b. Check the opensm status and smpriorities on all switches in the IB fabric

Check the opensm status and smpriorities on all switches in the IB fabric. Refer to following MOS Note which has information on recommended opensm status and sm priorities for the Switches.

<Note 1682501.1>: Setting up the subnet manager in a multirack configuration containing Exalogic/BDA and Exadata/SSC/Expansion Rack 

c. Check network/fabric is operating normally

Validate if everything is working normally including status of the vnics in the host Compute Nodes/vServers, IB networks on Compute Nodes/vServers, interface/bonding status in Compute Nodes/vServers.

NOTE: Stopping and Starting of existing vServer from EMOC would be good test to make sure that the IB Switch replacement is successful. Because when vServer is stopped and started from EMOC, EMOC will delete and recreate VNICs on the Switches. By this test we can be sure that VNICs are getting recreated on the new replacement Switch as expected by EMOC and EMOC has discovered the new Switch properly. 

d. Take a fresh Exalogic Control vServers backup using Exabr.

Immediately take a fresh Backup of Exalogic Control vServers using exabr. Exabr backup of the Control vServers takes backup of the IB Switches as well.

e. Collect fresh full exalogs from the rack after Switch replacement.

Upon the completion of all the steps above, collect the fresh set of full exalogs from the rack after switch replacement. This set of data will become useful for investigating root cause of any problem that may occur as a result of Switch replacement.

_____________________________________________________________________________________________________

KNOWN ISSUES WHICH CAN BE ENCOUNTERED DURING SWITCH REPLACEMENT

Exalogic Exabr ib-register Command To Register New Replaced Infiniband Switch Port GUIDs Fails With "Unable to get rpc version on some nodes in the fabric" Error

Refer to <Note 2308204.1> for details on this known issue.

Exalogic: "smpartition start" & "exabr ib-register" Commands Failing With "cli commit is in progress" Error

Refer to <Note 2356168.1> for details on this known issue.

 

References

<NOTE:2223662.1> - Master Note For Exalogic Infiniband Switch Replacement – Overview and guide to key articles
<BUG:16926597> - SIQ: EXABR IB GW RESTORE RESULTS IN ERROR IF PARTIALLY SUCCESSFULY LOAD_URI
<NOTE:2218443.1> - How to Prepare an Exalogic Infiniband Switch for Replacement (Pre-checks & Backup)
<NOTE:2308204.1> - Exalogic Exabr ib-register Command To Register New Replaced Infiniband Switch Port GUIDs Fails With "Unable to get rpc version on some nodes in the fabric" Error
<NOTE:2356168.1> - Exalogic: "smpartition start" & "exabr ib-register" Commands Failing With "cli commit is in progress" Error
<NOTE:2211261.1> - How to Prepare an Exalogic Infiniband (IB) Fabric for Planned Outage of an IB Switch
<BUG:25661713> - EXABR NOT RESTORING CONFIGURATION OF IB SWITCHES WITH FIRMWARE 2.2.2.X

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback