![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||
Solution Type Technical Instruction Sure Solution 2125203.1 : Infiniband Switch Replacement - Follow-up Actions
In this Document
Applies to:Exadata Database Machine V2 - Version All Versions and laterSun Datacenter InfiniBand Switch 36 - Version All Versions to All Versions [Release All Releases] Sun Network QDR InfiniBand Gateway Switch - Version All Versions to All Versions [Release All Releases] Information in this document applies to any platform. GoalProvide follow-up reconfiguration steps to be performed after Infiniband (IB) switch replacement. This document is intended to be used immediately after physical replacement steps have been completed, as detailed in the How to Replace documents. Also, the How to Prepare document must also have been followed prior, to prepare the plan for this stage. Refer to: Infiniband Switch Replacement – Overview and guide to key articles (Doc ID 2125242.1)
Note: For IB switches within an exalogic system, use Doc ID 2218689.1 instead of this document
SolutionIntroductionAfter an IB switch replacement, the following steps restore customer-specific configuration of smnodes, IB partitions, and vNICs (if any), using previously taken Backup snapshots images if available. These steps need to be performed by the Customer-admin, or under the close supervision of the Customer-admin. In the case of an Engineered System, these steps may be performed with assistance of your Engineered Systems Support Engineer. Some of these steps require root access to the IB switches running SM Master. Any minor errors in these steps can lead to outage of nodes/Servers if the replacement is being done on a live production environment, hence caution and care needs to be exercised at all times. A. State that the Switch must be in1. After an IB Switch replacement following the Oracle Support "How to Replace" documents, the Switch will have had Subnet Manager disabled (# disablesm ). If there is any doubt about this, then before proceeding to the next steps, run the following command on the replaced switch: # disablesm 2. Check the setting of controlled_handover on the switch running as the current Master. Login to the switch running as the Master and run the following command to check the setting of controlled_handover. #setsmpriority list If it is not TRUE, it is recommended to do the additional steps in this document during a down time to avoid any possible problem that may occur as a result of Master moving while doing the steps in this document.
B. For exalogic Systems, refer to Doc ID 2218689.1When the replaced IB Switch is part of an Exalogic system, follow "Exalogic Infiniband Switch Replacement - Follow-up Actions (Restoration) (Doc ID 2218689.1)"
C. Restoration using ILOM backup or where no valid backup is availableFor systems other than Exalogic, or for Exalogic where no valid ExaBR backup image is available: 1. Restore configuration using ILOM backup (where available)If you have a valid IB Switch (ILOM) Backup, restore the backup onto the replacement Switch as follows: Refer to the relevant IB Switch Firmware Product Guide sections (the following links are for firmware 2.1): For Infiniband Switch 36 (nm2-36P): "Restore the Configuration (CLI)" or "Restore the Configuration (Web)"
For Infiniband Gateway Switch (nm2-GW): "Restore the Configuration (CLI)" Whether or not you have successfully restored using ILOM Backup, proceed with the following steps in this section:
2. Check/update smnodes list on this replacement IB switchCheck the following on this switch and compare with that running on the current SM master. #smnodes list If this is empty, or not matching with the output in the current Master, make it identical using smnodes command as follows: #smnodes add <ip_address> ... <ip_address> Or, you can delete an ip address using the following command: #smnodes delete <ip_address> ...
3. Set smpriority and enablesm on this switchIf the controlled_handover is TRUE on the current Master, set setsmpriority of this replacement IB switch to the value suggested in the install documentation. If this is exalogic or a multirack consisting of exalogic, refer to doc ID 1682501.1, otherwise rerfer to "Understanding the Network Subnet Manager Master". Set smpriority on this switch as follows: #setsmpriority <priority> If the controlled_handove on the current Master is FALSE, it is recommended to set the smpriority of this switch to a lower value so that Master will not move while configuring this switch. The actual value will have to be restored in step E.2 later. #setsmpriority 1 Now, enable SM on this replacement switch: #enablesm
4. For NM2-GW replacement switch only, list the GUIDs of the four bridgesRun the following command on the replacement switch to find the GUIDs of the four bridges (you will need this information in the next step): #showgwports
5. Propagate IB Partitions from the running SM master (and set GUIDs if not already done)Login to the IB switch currently running the Master Subnet Manager, and do the following: # smpartition start If the replacement switch is nm2-GW and either is not in Exalogic, or was not successfully restored using ExaBR, then manually add the GUIDs from the previous step: # smpartition add -pkey <PKey> -port <port GUID> <port GUID> <port GUID> <port GUID> -m full Note: <port GUID> are the four GUIDs of the bridges that you see in the output of showgwports on the new switch (as found out in previous step) Repeat the above command for all the pkeys other than the default and 0x0001 These steps (performed as noted on the switch currently running the Master Subnet Manager) ensure that the GUIDs of all the bridges of the new switch are added to all the partitions in this IB network. Note, the above manual addition of PKey/GUID for replacement nm2-GW is not needed if this is an Exalogic and the replacement switch has been restored successfully using ExaBR.
6. Check/propagate secret M-Key policy from the running SM master.On the switch running as the current Master, check if secret M-Key policy is in use. To check that, run the following command on the current Master switch: #smsubnetprotection list active Only if the output above shows secret M-keys, run the following commands on this Master switch: #smsubnetprotection start This will make sure that secret M-Keys policy (if used) is propagated to all switches listed in the smnodes list. Prior to commit, please ensure all IB switches, participating in this secret M-keys replication, have the identical replication password inside /conf/mkey_password.
D. Additional actions when no valid backup is availableIf the customer did not have a valid backup at all: The Customer-admin will need to configure the replacement switch manually at this time. This step is only required if the configuration Backup was incomplete or aged, the restore was unsuccessful and/or if copying the configuration from another Gateway switch in the same rack. In these situations, further work by Customer-admin is required if VNICs are configured in these switches or any other customized configuration:
E. Final checkup and verification
1. Check/set firewall rule settings on port 623If the firewall rule on port 623 has been previously present, then reinstate it: Refer to the procedure in this document: IB Switch Messages Wrapping with "Possible SYN Flooding On Port 623" (Doc ID 2023539.1).
2. Check the opensm status and smpriorities on all switches in the IB fabricRun the following command to know if opensm is running:
Make sure that the smpriorities and controlledhandover of all the switches running opensm in this IB Fabric are as per the standard configuration of your engineered system, and that opensm is running on the switches as per the standard configurations: - If this is a rack or multirack containing Exalogic, refer to "Setting up the subnet manager in a multirack configuration containing Exalogic/BDA and Exadata/SSC/Expansion Rack (Doc ID 1682501.1)" - If this is rack or multirack consisting of Exadata and/or SuperCluster only, refer to "Understanding the Network Subnet Manager Master" in Oracle Exadata Database Machine Owner's Guide.
3. Check network/fabric is operating normallyCustomer-admin should now check if everything is working normally including status of the vnics in the host nodes, interface/bonding status in nodes, vservers, LDOMs or other VMs.
4. Take a fresh switch configuration backupImmediately take a fresh Backup of the freshly replaced Switch with the restored Config, using at minimum the ILOM backup (all Platforms) and if Exalogic then optionally the ExaBR backup also.
5. Take a snapshot of key diags; engage Oracle support if any problemsUpon the completion of all the steps above, collect the following set of data and upload to the Service Request(SR). This set of data will become useful for investigating root cause of any problem that may occur as a result of any planned outage. a). Collect the following data from all IB switches in this IB fabric (if multirack, all switches in the entire multirack) #version #spsh b). Copy the following file from all IB switches running opensm /conf/partitions.current c), Copy the following file from the switch currently running the Master Subnet Manager /var/log/whereismaster.log d). Collect the following data from the switch currently running the Master Subnet Manager #smpartition list active e). Collect the following data from any one of the IB leaf switch #ibnetdiscover After running this command, collect all the files it creates in /tmp/ibdiagnet* files f). If there are IB-Gateway switches in this IB fabric (for example Exalogic), collect the following data from all IB-Gateway switches. #showgwports g). Collect ILOM snapshot of this switch For Infiniband Switch 36 (nm2-36P): "Create a Snapshot of the Switch State (CLI)" For Infiniband Gateway Switch (nm2-GW): "Create a Snapshot of the Gateway State (CLI)"
h). If Enterprise Manager is configured, please refer to:
References<NOTE:2125242.1> - Infiniband Switch Replacement – Overview and guide to key articles<NOTE:1341658.1> - How to Replace a Failed Sun Datacenter InfiniBand Switch 36 <NOTE:1383773.1> - How to Replace a Failed Sun Network QDR InfiniBand Gateway Switch <NOTE:2140928.1> - How to Prepare an Infiniband (IB) Fabric for Planned Outage of an IB Switch <NOTE:1636229.1> - How to Prepare an Infiniband Switch for Replacement Attachments This solution has no attachment |
||||||||||||||||||||||||||||||||||||||||||||||||||
|