![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Solution Type Predictive Self-Healing Sure Solution 2218689.1 : Exalogic Infiniband Switch Replacement - Follow-up Actions (Restoration)
In this Document
Applies to:Oracle Exalogic Elastic Cloud Software - Version 2.0.0.0.0 and laterExalogic Elastic Cloud X3-2 Hardware - Version X3 to X3 [Release X3] Linux x86-64 Oracle Solaris on x86-64 (64-bit) Oracle Virtual Server x86-64 PurposeThis Note provides follow-up reconfiguration steps for restoring the Exalogic Infiniband (IB) switch configuration after replacement. This document is intended to be used immediately after physical replacement steps have been completed. Prior to a customer running these procedures, an Oracle Field Engineer should have completed replacing the switch by following the two Canned Action Plan documents below. <Note 1383773.1>: How to Replace a Failed Sun Network QDR InfiniBand Gateway Switch
<Note 1341658.1>: How to Replace a Failed Sun Datacenter InfiniBand Switch 36
DetailsAfter an IB switch replacement, the following steps restore customer-specific configuration of smnodes, IB partitions, and vNICs (if any), using previously taken Backups if available. These steps need to be performed by the Customer-admin, or under the close supervision of the Customer-admin. Some of these steps require root access to the IB switches running SM Master. Any minor errors in these steps can lead to outage of nodes/Servers if the replacement is being done on a live production environment, hence caution and care needs to be exercised at all times. 1. Validate the Firmware version on the Newly Replaced SwitchThe replaced switch must run the same firmware version as the other working Exalogic IB switches on the fabric. Check the firmware version using "version" command. 2. Validate whether Subnet Manager is disabled on newly replaced SwitchOn the new replacement switch check and make sure that the subnet manager is disabled by running below command service opensmd status
If the subnet manager is running disable subnet manager by running "disablesm" command on the switch 3. Validate SM controller_handover on current switch running the Master in the FabricCheck the setting of controlled_handover on the switch running as the current Master. Login to the switch running as the Master and run the following command to check the setting of controlled_handover. controlled_handover should be set to TRUE. setsmpriority list
4. Validate that the physical installation of the new switch into the fabric was completed successfully. Run the "ibnetdiscover" and "ibswitches" command.You should see the newly replaced switch in the "ibswitches" command output. "ibnetdiscover" command output should show the newly replaced switch as connected to all the Compute Nodes and Storage heads in the rack. 5. Change the passwords of the root and ilom-admin users on the replacement switch to their previous values.Steps for changing the password for "root" user
Steps for changing the password for "ilom-admin" user
6. Update the smnodes list on the replaced switch with the IP addresses of all the Exalogic switches running the subnet manager.To do this use the following command: smnodes add IP_Address_of_Switch
7. Set SM Priority To Recommended Values on the replaced switchSet SM Priorities and controlled handover settings on the newly replaced Switch to recommended values as described in following MOS Note. IMPORTANT NOTE: Do not enable the SM using "enablesm" command after configuring the SM settings. SM has to be enabled after the Switch restoration is completed. <Note 1682501.1>: Setting up the Subnet Manager in a multi-rack cabling configuration containing Exalogic/Big Data Appliance and Exadata/SuperCluster
8. Validate if ocadmin SNMP community exists and create it if it does not existsValidate if ocadmin SNMP community exists using following command spsh show /SP/services/snmp/communities/ | grep ocadmin
NOTE: In some cases above command to check the ocadmin community can fail. If that happens follow below steps to check the ocadmin community string.
If you do not see ocaadmin community in above command output create it using following procedure.
NOTE: If ocadmin community does not exist ExaBR restore command (./exabr restore hostname) part of next step may fail with the following error: ERROR: set /SP/config load_uri=http://10.10.XX.XX/switch.backup
set: Load partially successful, please view the event log INTERNAL NOTE: Issue happens due to known Bug 16926597 : SIQ: EXABR IB GW RESTORE RESULTS IN ERROR IF PARTIALLY SUCCESSFULY LOAD_URI 9. Copy the /conf/partitions.current file from the Switch running the SM Master to the newly replaced Switch under /conf directoryCopy the /conf/partitions.current file from the Switch running the SM Master to the newly replaced Switch under /conf directory You can determine the Switch running the SM Master by running below command. getmaster
10. Restore the Switch configuration from Exabr backups.Restore the Switch configuration from Exabr backups using below steps: a. View a list of backups by running ExaBR list command on replacement Infiniband SwitchView a list of exabr backups by running ExaBR list command on replacement Infiniband Switch ./exabr list <Switch hostname> [options]
Example: ./exabr list ib01.example.com -v
In this example, ExaBR lists the backups for ib01.example.com in detail, because the -v option is used. b. Use ExaBR restore command on newly replaced InfiniBand switch to restore the configuration from exabr backups:Use ExaBR restore command on newly replaced InfiniBand switch as follows to restore the configuration from exabr backups: ./exabr restore <replacement Switch hostname> -b <exabr backup directory>
In this above, ExaBR restores an InfiniBand switch. The data is restored from the backup directory specified, because the -b option is used. For e.g. if the Switch exabr backup directory name is 201308230428 (from exabr list <switch hostname> from above step a) and switch hostname is ib01.example.com, Exabr restore command looks as follows: ./exabr restore ib01.example.com -b 201308230428
IMPORTANT NOTE If the "exabr restore" command console output in above step 10 (b) shows below warning message during the restore process, please contact Support via Service request. We will have to manually restore the Switch configuration if this error happens. WARNING: ILOM reports this message: 'Load partially successful. Please view the event log'
Switch configuration restored, with warnings. To see the ILOM event-log navigate to 'System Monitoring-->Event Logs' in the ILOM Web user interface c. Validate if the exabr restore command restored the Switch configuration successfully.Validate if the exabr restore command restored the Switch configuration successfully. This can be done by comparing the contents of /conf/bx.conf, /conf/bxm.conf & /etc/opensm/opensm.conf files matches with bx.conf, bxm.conf & opensm.conf files in the exabr backup folder of the replacement. If the content does not match it means that the exabr restore command did not restore the configuration. Do not proceed with further steps. Contact Support for steps to restore the Switch configuration manually. d. Run "smpartition start && smpartition commit" command on the Switch running the SM Master to propagate the partitions from SM Master to newly replaced Switchc. Run the "smpartition start && smpartition commit" command on the Switch running the SM Master to propagate the partitions from SM Master to newly replaced Switch. smpartition start && smpartition commit
Reference Documentation for Exabr restore procedure: http://docs.oracle.com/cd/E18476_01/doc.220/e36329/infra.htm#ELFLR118 IMPORTANT INTERNAL NOTE TO SUPPORT Manual Switch Restoration on IB Switches With 2.2.2.X Firmware in case Exabr Restore Command Fails with Decryption Errors (or) Does Not restore the Switch Configuration Correctly In some cases when running Exabr restore command (Step 4 under section "Using ExaBR - See section 3.3.2 - Recovering InfiniBand Switches" of http://docs.oracle.com/cd/E18476_01/doc.220/e36329/infra.htm#ELFLR118 documentation) to restore configuration of IB Switch with firmware 2.2.2.X will not succeed. Following symptoms are noticed to confirm that the Switch restoration did not succeed.
If all above 3 symptoms are seen, then you are running into the Exabr Switch restoration issue with decryption errors. There is following Bug opened on this issue which is being looked by Development. BUG 25661713 - EXABR NOT RESTORING CONFIGURATION OF IB SWITCHES WITH FIRMWARE 2.2.2.X
While we are waiting for fix for above Bug if you run into this issue follow below workaround: IMPORTANT NOTE TO SUPPORT: The following workaround steps should be executed by the customer with supervision from Exalogic Support over a WebEx and should not be executed by Exalogic Support themselves on the WebEx or via Platinum access for Platinum Customers.
WORKAROUND
Proceed with rest of the steps starting with step 11 in the MOS Note to complete Switch restoration. 11. Validate whether the VNICs and VLANs are seen after Exabr restore on the newly replaced Switch.Validate whether the VNICs and VLANs are seen after Exabr restore on the newly replaced Switch by running below commands. showvnics
showvlan
We should be seeing the VLANs and VNICs (in WAIT-VHUB state). If the VNICs are showing in WAIT-VHUB on the Switch it is because the port GUIDs of new Switch are not added to partitions. This should be corrected when we run exabr ib-register command as listed in step 12. If you do not see the VNICs and VLANs when running above commands on replaced switch, reboot the replaced Switch using below "reboot" command and wait for 5 minutes for the Switch to reboot. Log back in and check the VNICs and VLANs again. reboot
12. Register the Port GUIDs of Newly Replaced Switch with EoIB Partitions using "exabr ib-register" command.Once configuration on the newly replaced switch is restored register the Port GUID's of newly replaced Switch with EoIB partitions using Exabr ib-register command. Follow Step 5 in documentation http://docs.oracle.com/cd/E18476_01/doc.220/e36329/infra.htm#ELFLR173 , section "3.3.3 Replacing InfiniBand Switches in a Virtual Environment" For registering the gateway port GUIDs with the EoIB partitions we can run exabr command with ib-register option as follows: ./exabr ib-register hostname_of_IB_switch
Example: ./exabr ib-register ib02.example.com --dry-run
./exabr ib-register ib02.example.com
In the first example, ExaBR displays what operations will be run without saving the changes because the ib-register command is run with the --dry-run option. 13. Validate the status of VNICs and VLANsOnce the exabr ib-register command is executed and Port GUIDs of newly replaced Switch are added to the EoIB partitions, validate the status of VNICs and VLAN's to make sure they are up using below commands showvnics
showvlan
14. Additional restoration steps for Virtual & Hybrid racks with EMOCFor Virtual & Hybrid racks follow additional restoration steps 6,7 listed in below documentation, section "3.3.3 Replacing InfiniBand Switches in a Virtual Environment". Please note Step 6 in section "3.3.3 Replacing InfiniBand Switches in a Virtual Environment" in below documentation can be skipped if the passwords for the replacement switch is set to previous passwords as mentioned in above step 5. http://docs.oracle.com/cd/E18476_01/doc.220/e36329/infra.htm#ELFLR173 Final checkup and verificationa. Check/set firewall rule settings on port 623If the firewall rule on port 623 has been previously present, then reinstate it: Refer to the procedure in following Note <Note 2023539.1>: IB Switch Messages Wrapping with "Possible SYN Flooding On Port 623"
b. Check the opensm status and smpriorities on all switches in the IB fabricCheck the opensm status and smpriorities on all switches in the IB fabric. Refer to following MOS Note which has information on recommended opensm status and sm priorities for the Switches. <Note 1682501.1>: Setting up the subnet manager in a multirack configuration containing Exalogic/BDA and Exadata/SSC/Expansion Rack
c. Check network/fabric is operating normallyValidate if everything is working normally including status of the vnics in the host Compute Nodes/vServers, IB networks on Compute Nodes/vServers, interface/bonding status in Compute Nodes/vServers. NOTE: Stopping and Starting of existing vServer from EMOC would be good test to make sure that the IB Switch replacement is successful. Because when vServer is stopped and started from EMOC, EMOC will delete and recreate VNICs on the Switches. By this test we can be sure that VNICs are getting recreated on the new replacement Switch as expected by EMOC and EMOC has discovered the new Switch properly.
d. Take a fresh Exalogic Control vServers backup using Exabr.Immediately take a fresh Backup of Exalogic Control vServers using exabr. Exabr backup of the Control vServers takes backup of the IB Switches as well. e. Collect fresh full exalogs from the rack after Switch replacement.Upon the completion of all the steps above, collect the fresh set of full exalogs from the rack after switch replacement. This set of data will become useful for investigating root cause of any problem that may occur as a result of Switch replacement. _____________________________________________________________________________________________________ KNOWN ISSUES WHICH CAN BE ENCOUNTERED DURING SWITCH REPLACEMENTExalogic Exabr ib-register Command To Register New Replaced Infiniband Switch Port GUIDs Fails With "Unable to get rpc version on some nodes in the fabric" ErrorRefer to <Note 2308204.1> for details on this known issue. Exalogic: "smpartition start" & "exabr ib-register" Commands Failing With "cli commit is in progress" ErrorRefer to <Note 2356168.1> for details on this known issue.
References<NOTE:2223662.1> - Master Note For Exalogic Infiniband Switch Replacement – Overview and guide to key articles<BUG:16926597> - SIQ: EXABR IB GW RESTORE RESULTS IN ERROR IF PARTIALLY SUCCESSFULY LOAD_URI <NOTE:2218443.1> - How to Prepare an Exalogic Infiniband Switch for Replacement (Pre-checks & Backup) <NOTE:2308204.1> - Exalogic Exabr ib-register Command To Register New Replaced Infiniband Switch Port GUIDs Fails With "Unable to get rpc version on some nodes in the fabric" Error <NOTE:2356168.1> - Exalogic: "smpartition start" & "exabr ib-register" Commands Failing With "cli commit is in progress" Error <NOTE:2211261.1> - How to Prepare an Exalogic Infiniband (IB) Fabric for Planned Outage of an IB Switch <BUG:25661713> - EXABR NOT RESTORING CONFIGURATION OF IB SWITCHES WITH FIRMWARE 2.2.2.X Attachments This solution has no attachment |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|