![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||||||||||||||||||||||||
Solution Type Predictive Self-Healing Sure Solution 2140928.1 : How to Prepare an Infiniband (IB) Fabric for Planned Outage of an IB Switch
In this Document
Applies to:Sun Datacenter InfiniBand Switch 36 - Version All Versions to All Versions [Release All Releases]Oracle SuperCluster Specific Software Sun Network QDR InfiniBand Gateway Switch - Version All Versions to All Versions [Release All Releases] Exadata Database Machine V2 - Version All Versions and later Information in this document applies to any platform. PurposeThis document contains information on how to prepare an Infiniband (IB) Fabric for any planned outage of an Infiniband Switch within that IB Fabric. It also contains a checklist to assist Customer-admin to determine if a full Fabric outage will be required, based on the results of checks done. ScopeNote: For IB switches within an exalogic system or a multirack containing exalogic, use Doc ID 2211261.1 instead of this document. Planned Outage could include a Reboot (or boot after previous shut-down), Patching (firmware-upgrade), or Replacement of an IB Switch in the IB Fabric. The checks and actions in this document are critical to ensuring that production traffic in the Infiniband (IB) Fabric may be resilient to the necessary restart of the IB Switch involved in any of the above operations. Based on the result of the aforementioned checks, guidance is provided - via a checklist - as to whether a full downtime of the IB Fabric will be required (full outage of all switches and nodes actively participating in the fabric). Customers should only take the IB Switch outage within a production IB Fabric, when all checks are cleared in the affirmative. This document is referenced by several other Oracle Support knowledge articles, including: - How to Prepare an Infiniband Switch for Replacement (Doc ID 1636229.1) The document distribution is EXTERNAL since it needs to be shared with and used by the Customer-admin, as well as referenced by Partners, Field Engineers, and Oracle Support.
Details
1. Checks for IB fabric with multiple IB Switches1.1. Confirm Hosts bonding/IPMP/IO-path redundancy
1.2. For a CRS Cluster, confirm fix is in place for node reboot on IB Switch reboot issue
1.3. Check firmware version on all the IB switches within the rack. All switches must be running the same firmware version. The output of the following command will give the firmware version number. #version 1.4. Check the opensm status and smpriorities on all switchesCheck the following outputs on all IB switches to determine which switches are running opensm and what their priorities are.
- If this is a custom multi-IB-switch configuration, check your ISV's install documentation.
1.5. Check IB Fabric using “ibswitches” and “getmaster”
#ibswitches Ensure that all IB switches are seen in its output. If any of the expected IB switches are missing in the output, IB cable connectivity to each missing switch needs to be checked and fixed. Secondly, run the following command on all the IB switches in the network and make sure that all of them report the same master subnet manager in the IB network and that the master is not moving around from switch to switch: #getmaster Note: If this is a multirack system consisting of several racks, make sure that the above command is run on all IB switches in all the racks. Any anomaly here could be the result of problems in Infiniband cabling, for example one Switch being isolated incorrectly.
1.6. Check that all IB Switches can ping each other through management interfaces Make sure that you can (Ethernet) ping every IB switch from every other IB switch through its management interface. If any Switch is not reachable from any other Switch over their respective management Ethernet interfaces, then you need to get that fixed first. Ensure that there are no firewalls between management networks of individual racks within a multirack system.
1.7. Check IB partitions and secret M-Key policy
If IB partitions are configured in this IB fabric or secret M-Key policy is in use, do the following steps.
a) Run the following command on all switches running opensm
b) If there are IB Gateway switches in this IB fabric, check and make sure that the port GUIDs of all the IB Gateway switches are in all IB partitions. The following command run on an IB Gateway switch will show GUIDs of the four bridges of this switch #showgwports Run the above command on all IB Gateway switches in the IB fabric. The following command run on the switch running as the Master will show all the IB partitions (first identify the Master either by # getmaster or # sminfo command) #smpartition list active Check if all four GUIDs of all IB Gateway switches are in all IB partitions. If not, add the missing GUIDs as follows: Run the following on the switch running as Master #smpartition start # smpartition add -pkey <PKey> -port <port GUID> <port GUID> <port GUID> <port GUID> -m full #smpartition commit Note: You can skip the next step (c) if step (b) is completed.
#smsubnetprotection list active If the output shows secret M-keys, run the following commands on this Master switch: #smsubnetprotection start #smsubnetprotection commit This will make sure that secret M-Keys policy is propagated to other switches. Normally, this is done at the time of creating secret M-keys. This step here is to make sure that M-key policy is propagated.
2. Confirm type/extent of downtime required If this is a standalone switch (only Switch in this IB Fabric), then you will need a downtime of the whole IB Fabric to replace, reboot or patch the standalone switch.
3. Complete the check-list template – IB Fabric preparation for IB Switch planned outage.
All IB switches within the rack are running the same firmware version ? ___yes / no_____
4. Data Collection and UploadUpon the completion of all the steps above, collect the following set of data. This set of data will become useful for investigating root cause of any problem that may occur as a result of any planned outage. a). Collect the following data from all IB switches in this IB fabric (if multirack, all switches in the entire multirack) #version #spsh b). Copy the following file from all IB switches running opensm /conf/partitions.current c). Collect the following data from the switch currently running the Master Subnet Manager #smpartition list active d). Collect the following data from any one of the IB leaf switch #ibnetdiscover After running this command, collect all the files it creates in /tmp/ibdiagnet* files e). If there are IB-Gateway switches in this IB fabric, collect the following data from all IB-Gateway switches. #showgwports
5. Proceed to next steps
- If you are only rebooting or patching this IB Switch, then you are ready to go ahead, with the appropriate downtime as confirmed in the steps above. Please refer to the appropriate sections of the Product Guides, for reboot (restart) or firmware upgrade.
Notes / Addendum
2. Note for Patching in multiple-switch environment: It is recommended that before commencing patching, there is a master switch and at least one standby switch. Patch the standby switches first and the master last. This reduces the number of SM failovers – there will be only to one failover of the master switch to one of the standby switch.
References<NOTE:1383773.1> - How to Replace a Failed Sun Network QDR InfiniBand Gateway Switch<NOTE:2125242.1> - Infiniband Switch Replacement – Overview and guide to key articles <NOTE:2125203.1> - Infiniband Switch Replacement - Follow-up Actions <NOTE:1636229.1> - How to Prepare an Infiniband Switch for Replacement <NOTE:1341658.1> - How to Replace a Failed Sun Datacenter InfiniBand Switch 36 Attachments This solution has no attachment |
||||||||||||||||||||||||||||||||||||||||||
|