Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-2125203.1
Update Date:2018-04-08
Keywords:

Solution Type  Technical Instruction Sure

Solution  2125203.1 :   Infiniband Switch Replacement - Follow-up Actions  


Related Items
  • Sun Datacenter InfiniBand Switch 36
  •  
  • Exadata Database Machine V2
  •  
  • Sun Network QDR InfiniBand Gateway Switch
  •  
Related Categories
  • PLA-Support>Sun Systems>SAND>Network>SN-SND: Sun Network Infiniband
  •  




In this Document
Goal
Solution
 Introduction
 A. State that the Switch must be in
 B. For exalogic Systems, refer to Doc ID 2218689.1
 C. Restoration using ILOM backup or where no valid backup is available
 1. Restore configuration using ILOM backup (where available)
 2. Check/update smnodes list on this replacement IB switch
 3. Set smpriority  and enablesm on this switch
 4. For NM2-GW replacement switch only, list the GUIDs of the four bridges
 5.  Propagate IB Partitions from the running SM master (and set GUIDs if not already done)
 6.  Check/propagate secret M-Key policy from the running SM master.
 D.  Additional actions when no valid backup is available
 E.  Final checkup and verification
 1. Check/set firewall rule settings on port 623
 2. Check the opensm status and smpriorities on all switches in the IB fabric
 3. Check network/fabric is operating normally
 4. Take a fresh switch configuration backup
 5. Take a snapshot of key diags;  engage Oracle support if any problems
References


Applies to:

Exadata Database Machine V2 - Version All Versions and later
Sun Datacenter InfiniBand Switch 36 - Version All Versions to All Versions [Release All Releases]
Sun Network QDR InfiniBand Gateway Switch - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Goal

Provide follow-up reconfiguration steps to be performed after Infiniband (IB) switch replacement.

This document is intended to be used immediately after physical replacement steps have been completed, as detailed in the How to Replace documents.  Also, the How to Prepare document must also have been followed prior, to prepare the plan for this stage.  Refer to:  Infiniband Switch Replacement – Overview and guide to key articles (Doc ID 2125242.1)

 

Note: For IB switches within an exalogic system, use Doc ID 2218689.1 instead of this document

 

Solution

Introduction

After an IB switch replacement, the following steps restore customer-specific configuration of smnodes, IB partitions, and vNICs (if any), using previously taken Backup snapshots images if available.   These steps need to be performed by the Customer-admin, or under the close supervision of the Customer-admin.  In the case of an Engineered System, these steps may be performed with assistance of your Engineered Systems Support Engineer.

Some of these steps require root access to the IB switches running SM Master.  Any minor errors in these steps can lead to outage of nodes/Servers if the replacement is being done on a live production environment, hence caution and care needs to be exercised at all times.

A. State that the Switch must be in

1. After an IB Switch replacement following the Oracle Support "How to Replace" documents, the Switch will have had Subnet Manager disabled (# disablesm ).  If there is any doubt about this, then before proceeding to the next steps, run the following command on the replaced switch:

         # disablesm  

 2. Check the setting of controlled_handover on the switch running as the current Master.

     Login to the switch running as the Master and run the following command to check the setting of controlled_handover.

           #setsmpriority list

     If it is not TRUE,  it is recommended to do the additional steps in this document during a down time to avoid any possible problem that may occur as a result of Master moving while doing the steps in this document.

 

B. For exalogic Systems, refer to Doc ID 2218689.1

   When the replaced IB Switch is part of an Exalogic system,  follow "Exalogic Infiniband Switch Replacement - Follow-up Actions (Restoration) (Doc ID 2218689.1)"

 

C. Restoration using ILOM backup or where no valid backup is available

   For systems other than Exalogic, or for Exalogic where no valid ExaBR backup image is available:

1. Restore configuration using ILOM backup (where available)

   If you have a valid IB Switch (ILOM) Backup, restore the backup onto the replacement Switch as follows:

Refer to the relevant IB Switch Firmware Product Guide sections (the following links are for firmware 2.1):

  For Infiniband Switch 36 (nm2-36P):

   "Restore the Configuration (CLI)"

or

  "Restore the Configuration (Web)"

 

For Infiniband Gateway Switch (nm2-GW):

   "Restore the Configuration (CLI)"

or

   "Restore the Configuration (Web)".

   Whether or not you have successfully restored using ILOM Backup, proceed with the following steps in this section:

 

2. Check/update smnodes list on this replacement IB switch

        Check the following on this switch and compare with that running on the current SM master.

             #smnodes list

        If this is empty, or not matching with the output in the current Master,  make it identical using smnodes command as follows:

            #smnodes add <ip_address> ... <ip_address>

        Or, you can delete an ip address using the following command:

             #smnodes delete <ip_address> ...

 

3. Set smpriority  and enablesm on this switch

     If the controlled_handover is TRUE on the current Master,   set setsmpriority of this replacement IB switch to the value suggested in the install documentation.   If this is exalogic or a multirack consisting of exalogic, refer to doc ID 1682501.1,  otherwise rerfer to "Understanding the Network Subnet Manager Master".   Set smpriority on this switch as follows:

             #setsmpriority <priority>

     If the controlled_handove on the current Master is FALSE,  it is recommended to set the smpriority of this switch to a lower value so that Master will not move while configuring this switch.  The actual value will have to be restored in step E.2 later.

            #setsmpriority 1

     Now, enable SM on this replacement switch:

             #enablesm

 

4. For NM2-GW replacement switch only, list the GUIDs of the four bridges

     Run the following command on the replacement switch to find the GUIDs of the four bridges (you will need this information in the next step):

            #showgwports

 

5.  Propagate IB Partitions from the running SM master (and set GUIDs if not already done)

      Login to the IB switch currently running the Master Subnet Manager, and do the following:

           # smpartition start

           If the replacement switch is nm2-GW and either is not in Exalogic, or was not successfully restored using ExaBR, then manually add the GUIDs from the previous step:

                  # smpartition add -pkey <PKey> -port <port GUID> <port GUID> <port GUID> <port GUID> -m full

                          Note: <port GUID> are the four GUIDs of the bridges that you see in the output of showgwports on the new switch (as found out in previous step)

                      Repeat the above command for all the pkeys other than the default and 0x0001

                  These steps (performed as noted on the switch currently running the Master Subnet Manager) ensure that the GUIDs of all the bridges of the new switch are added to all the partitions in this IB network.

                  Note, the above manual addition of PKey/GUID for replacement nm2-GW is not needed if this is an Exalogic and the replacement switch has been restored successfully using ExaBR.


           #smpartition commit.


           The above steps ensure that IB partitions are propagated to all IB switches running opensm.

 

6.  Check/propagate secret M-Key policy from the running SM master.

      On the switch running as the current Master, check if secret M-Key policy is in use.  To check that, run the following command on the current Master switch:

           #smsubnetprotection list active

         Only if the output above shows secret M-keys, run the following commands on this Master switch:

                  #smsubnetprotection start
                  #smsubnetprotection commit

          This will make sure that secret M-Keys policy (if used) is propagated to all switches listed in the smnodes list.

          Prior to commit, please ensure all IB switches, participating in this secret M-keys replication, have the identical replication password inside /conf/mkey_password.

 

D.  Additional actions when no valid backup is available

       If the customer did not have a valid backup at all:  The Customer-admin will need to configure the replacement switch manually at this time.  This step is only required if the configuration Backup was incomplete or aged, the restore was unsuccessful and/or if copying the configuration from another Gateway switch in the same rack.  In these situations, further work by Customer-admin is required if VNICs are configured in these switches or any other customized configuration:

  1. By following section C. above (from step C. 2.) customer may have been able to recover basic Infiniband fabric configuration such as smnodes list and smpartition, by copying and/or propagating from a nearby running Master switch or from local install documentation

  2. Additional configurations may need to be manually restored based on your local install documentation, including Domain Name Service (DNS), SNMP and gwinstance (for NM2-GW)

E.  Final checkup and verification

 

1. Check/set firewall rule settings on port 623

     If the firewall rule on port 623 has been previously present, then reinstate it:  Refer to the procedure in this document:  IB Switch Messages Wrapping with "Possible SYN Flooding On Port 623" (Doc ID 2023539.1).

 

2. Check the opensm status and smpriorities on all switches in the IB fabric

Run the following command to know if opensm is running:


#service opensmd status


The following command will display smpriority and ControlledHandover:


#setsmpriority list

Make sure that the smpriorities and controlledhandover of all the switches running opensm in this IB Fabric are as per the standard configuration of your engineered system, and that opensm is running on the switches as per the standard configurations:

- If this is a rack or multirack containing Exalogic, refer to "Setting up the subnet manager in a multirack configuration containing Exalogic/BDA and Exadata/SSC/Expansion Rack (Doc ID 1682501.1)"

- If this is rack or multirack consisting of Exadata and/or SuperCluster only, refer to "Understanding the Network Subnet Manager Master" in Oracle Exadata Database Machine Owner's Guide.

- If this is a custom multi-IB-switch configuration, check your ISV's install documentation.

 

3. Check network/fabric is operating normally

        Customer-admin should now check if everything is working normally including status of the vnics in the host nodes, interface/bonding status in nodes, vservers, LDOMs or other VMs.

 

4. Take a fresh switch configuration backup

        Immediately take a fresh Backup of the freshly replaced Switch with the restored Config, using at minimum the ILOM backup (all Platforms) and if Exalogic then optionally the ExaBR backup also.

 

 

5. Take a snapshot of key diags;  engage Oracle support if any problems

Upon the completion of all the steps above, collect the following set of data and upload to the Service Request(SR). This set of data will become useful for investigating root cause of any problem that may occur as a result of any planned outage.

       a). Collect the following data from all IB switches in this IB fabric (if multirack, all switches in the entire multirack)

             #version
             #listlinkup
             #service opensmd status
             #setsmpriority list
             #smnodes list
             #ifconfig eth0
             #md5sum /conf/partitions.current

             #spsh
                 -> ls /SP/network
             #exit

       b). Copy the following file from all IB switches running opensm

             /conf/partitions.current

       c), Copy the following file from the switch currently running the Master Subnet Manager

             /var/log/whereismaster.log
             /var/log/opensm.log

       d). Collect the following data from the switch currently running the Master Subnet Manager

             #smpartition list active

       e). Collect the following data from any one of the IB leaf switch

             #ibnetdiscover
             #sminfo
             #getmaster -l
             #ibdiagnet -skip dup_guids -pm -P all=1

                 After running this command, collect all the files it creates in /tmp/ibdiagnet* files
                 Example:
                      cd /tmp
                      tar cvf ibdiagnet.tar ibdiagnet*

       f). If there are IB-Gateway switches in this IB fabric (for example Exalogic), collect the following data from all IB-Gateway switches.

             #showgwports
             #showvlan
             #showvnics
             #showioadapters

       g). Collect ILOM snapshot of this switch

                 For Infiniband Switch 36 (nm2-36P):

                        "Create a Snapshot of the Switch State (CLI)"
                      or
                        "Create a Snapshot of the Switch State (Web)"

                 For Infiniband Gateway Switch (nm2-GW):

                        "Create a Snapshot of the Gateway State (CLI)"
                      or
                        "Create a Snapshot of the Gateway State (Web)".

 

       h).  If Enterprise Manager is configured, please refer to:
               Steps to Perform in Enterprise Manager When Replacing an InfiniBand Switch (Doc ID 2055236.1)

 

References

<NOTE:2125242.1> - Infiniband Switch Replacement – Overview and guide to key articles
<NOTE:1341658.1> - How to Replace a Failed Sun Datacenter InfiniBand Switch 36
<NOTE:1383773.1> - How to Replace a Failed Sun Network QDR InfiniBand Gateway Switch
<NOTE:2140928.1> - How to Prepare an Infiniband (IB) Fabric for Planned Outage of an IB Switch
<NOTE:1636229.1> - How to Prepare an Infiniband Switch for Replacement

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback