Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-79-1636229.1
Update Date:2018-01-09
Keywords:

Solution Type  Predictive Self-Healing Sure

Solution  1636229.1 :   How to Prepare an Infiniband Switch for Replacement  


Related Items
  • Sun Datacenter InfiniBand Switch 36
  •  
  • Oracle SuperCluster Specific Software
  •  
  • Sun Network QDR InfiniBand Gateway Switch
  •  
  • Exadata Database Machine V2
  •  
Related Categories
  • PLA-Support>Sun Systems>SAND>Network>SN-SND: Sun Network Infiniband
  •  




In this Document
Purpose
Scope
Details
 Note: For IB switches within an exalogic system, use Doc ID 2218443.1 instead of this document
 1. Checks needed prior to Dispatch of Part/Onsite
  1.1. Check the IB Fabric to ensure resilience to later booting (refer to companion KB article)
  1.2  Check/update the configuration backup
  1.3 Check the firmware version of the Switch
  1.4 Check for presence of workaround firewall rule on port 623
 2. Complete the check-list template – IB Switch preparation for Replacement
 3. Provide a report on the replacement pre-checks to Oracle Support including outage type
 4. Final pre-Dispatch preparation (IB Switch replacement in production IB Fabric)
  4.1. Disable SM on the switch being replaced
  4.2. Check if running ASR / block alerts if so
  4.3. Update in MOS that step 4 actions are completed
 5. Contact Oracle Support for Part/Onsite Dispatch
 6. Replace the switch and perform Follow-up actions
References


Applies to:

Oracle SuperCluster Specific Software
Sun Network QDR InfiniBand Gateway Switch - Version All Versions to All Versions [Release All Releases]
Sun Datacenter InfiniBand Switch 36 - Version All Versions to All Versions [Release All Releases]
Exadata Database Machine V2 - Version All Versions and later
Information in this document applies to any platform.

Purpose

This document helps the user to prepare an Infiniband Switch for replacement and collect information required by Oracle Support / On-site team prior to Dispatch of a replacement IB Switch part.

Scope

- The document distribution is EXTERNAL since it needs to be shared with and used by the Customer-admin, as well as referenced by Partners, Field Engineers, and Oracle Support.

- Where Oracle Support has confirmed the need for an IB Switch replacement (whether that replacement will be performed by Customer-admin, Partner or Oracle Field Engineer), Customer-admin is requested to confirm by updating in MOS SR-notes that all the pre-check actions in this document have been completed - by copying/pasting the requested checklist(s) into the SR in MOS - prior to Oracle Dispatching the Part for replacement.

- Oracle Support will not Dispatch until these checks have been performed unless customer has given a clear statement acknowledging the risks of not performing these checks prior to replacement.

Details

Note: For IB switches within an exalogic system, use Doc ID 2218443.1 instead of this document

 

1. Checks needed prior to Dispatch of Part/Onsite

 

 1.1. Check the IB Fabric to ensure resilience to later booting (refer to companion KB article)

Perform the checks/actions in the following document and confirm to Oracle Support that these checks/actions have been done:

     How to Prepare an Infiniband Fabric for Planned Outage of an IB Switch (Doc ID 2140928.1)

Note, it is needed to perform these checks even if the IB switch being replaced, is hung or otherwise not active, since the IB Fabric needs to be configured correctly so as to minimize the risk of any outage when the replacement IB switch comes up.

Return to this document, take a note of the check-list (section 3.) completed in the above document, and complete the subsequent steps here below.

 

 1.2  Check/update the configuration backup

       - If the Infiniband Switch is still responsive on the Management Ethernet port:

- Use the current IB Switch Firmware Product Guide document to backup the configuration of Infiniband switch (Switch ILOM “my.config” XML backup). The following links are for Firmware v2.1:

       (Note: Regardless whether a backup is taken using Exabr in the previous step,  it is recommended to take backup using the ILOM of the switch in all cases)

            For Infiniband Switch 36 (nm2-36P):

                Back Up the Configuration (CLI)

             or,

               Back Up the Configuration (Web)

          For Infiniband Gateway Switch (nm2-GW):

               Back Up the Configuration (CLI)

            or,

                      Back Up the Configuration (Web)

 

- If this switch is non-responsive or otherwise unable to booted, check if a recent configuration backup exists (must have been taken after any previous change in the IB Fabric).  Refer above to the relevant backup files to look for in this case.

- If a recent Backup is not available, then the Customer-admin will need to manually reconfigure the switch after replacing.  Minimally, knowledge of the IB Switch management-port IP host-name/address information and the switch instance number " gwinstance" (if this is a multi-rack cabling with several Exalogic and/or Big Data Appliance racks), will be required.

 

      When confirming in MOS SR-update that this list of pre-checks have been done, Customer-admin needs to comment specifically to this point 1, to confirm exactly what the configuration restoration strategy will be, the path to the configuration backup file that will be used (if any), or the list of commands to restore configuration that will be used.  This plan is required to be completed before the Part/FE is Dispatched, so as to ensure that the Customer-Admin will be ready to step in and reconfigure during the replacement intervention.   Please contact Oracle Support if any questions.

     As with all Oracle products, customers are expected to maintain regular backups.

 

 

 1.3 Check the firmware version of the Switch

    If this is a replacement, check the firmware version of the switch that is being replaced and make sure that that firmware version is available to download (since it will need to be applied to the replacement switch).  To check the firmware version, login to the switch that is being replaced and run the following command

              # version

 

     Here is a sample output:

             # version

             SUN DCS gw version: 2.1.8-4             <<<<<<<<  firmware version

             Build time: ...

             FPGA version: ...

             SP board info: ...

            

 

   Then, check in MOS (Patches&Updates) the availability of this firmware for download.

   If this firmware is not available to download, it will not be possible to have this loaded on a new switch after replacing.  In that case, upgrading the firmware to the latest or the next available firmware may be required. Inform the Oracle Support engineer prior to the Dispatch by updating this information into a MOS SR-note, so that it can be confirmed whether or not upgrading the firmware after replacing the switch can have any adverse effect on the IB Fabric.

 

 1.4 Check for presence of workaround firewall rule on port 623

Check if a workaround firewall rule to block incoming requests on port 623 is implemented on the IB switch.  Refer to: IB Switch Messages Wrapping with "Possible SYN Flooding On Port 623" (Doc ID 2023539.1) : 

# iptables -L -n

If you see an entry for port 623, then that indicates this was implemented. Keep a note that this should be restored later after replacement / re-image / restore

 

2. Complete the check-list template – IB Switch preparation for Replacement

Answer yes/no, and/or provide plan/comment:

"IB Fabric preparation for IB Switch planned outage” Checklist (MOS Note 1636229.1, section 1.1; MOS Note 2140928.1, section 3) has been completed and is being provided to Oracle Support for review?   ___ yes/no?___

Configuration backup/restoration strategy Plan – including the path to the configuration backup file that will be used (if any), and/or the list of commands to restore configuration that will be used:   __<insert detail of config restoration plan or tool being used, date of last backup etc>______________________________

Confirm plan/availability of the firmware version of the switch being replaced, so that the replacement switch can be upgraded to the right version after replacement - provide path to the appropriate firmware:   ____<State the FW revision to be used; and confirm that the appropriate firmware version has been downloaded> ________

Workaround firewall rule for port 623 is present in iptables, yes/no?  _____ yes/no?_______

 

3. Provide a report on the replacement pre-checks to Oracle Support including outage type


Inform Oracle Support by copying and pasting the completed checklists into your open MOS SR, confirming that all pre-checks have been completed and indicating where any steps were skipped or any anomalies or concerns and provide commentary where required.

Include the check-list template both from the linked IB Fabric outage preparation at step 1.1 and for the replacement itself at step 2 above.
Upload the data collected in step 4 of the linked IB Fabric outage preparation to the Service Request (SR).


If the answer was “No” to any of the checks in the IB Fabric doc, then confirm in writing in an update in the SR in MOS, that a full downtime of the IB Fabric is planned.

If it has been determined that a full downtime of the IB Fabric will NOT be required, then the actions in step 4 below must also be completed, prior to Dispatch. This is to ensure that when the IB Switch is powered down, any resulting problems in the IB Fabric can be rectified by the Admin team with the assistance of Oracle Support, prior to the arrival of On-site team. Wait for confirmation from Oracle Support before proceeding to the next steps.

 

 

4. Final pre-Dispatch preparation (IB Switch replacement in production IB Fabric)

 

If there will NOT be a full outage of the IB Fabric, then the following steps MUST be completed now prior to the Part/Onsite being Dispatch by Oracle Support.

 4.1. Disable SM on the switch being replaced


Since you are going to replace this switch, disable SM on this switch:

    #disablesm


Note: If this switch is not accessible through the management port, skip this step.

If this switch is the current Master, the effect of the above command is that Master will move to another switch. If all configurations are correct and as per standard, this will have no effect on the operation. Monitor and check if everything is working well in your system - check which switch is the current Master by running the following command on any IB leaf switch:

    #sminfo

 

 4.2. Check if running ASR / block alerts if so


Check if your IB Switch is currently actively running ASR. Include a statement about whether ASR is running on the IB Switch, in your preparation-report to Oracle Support. If ASR is enabled and this is not an Exalogic system, then customer needs to go into ASR Manager and block new alerts about the IB Switch Serial# being replaced.

 

 4.3. Update in MOS that step 4 actions are completed

Update in MOS that all actions in Steps 4.1, 4.2, and 4.3 have been completed successfully and that you are ready for Dispatch.

 

5. Contact Oracle Support for Part/Onsite Dispatch

Once Oracle Support has reviewed and approved the check-lists and plan above at step 3, and when you have updated Oracle Support that the actions in step 4 have also been completed for the case where there will not be a full IB Fabric downtime, Oracle Support will contact you to confirm and will request from you the details of the outage window for the change, so that the Part/FE can be Dispatched.

NOTE to Oracle Systems Support TSE:  When Dispatching the replacement IB Switch Part, try to order the same part# that the customer is currently using - part# is determinable from the ILOM Snapshot and/or showfruinfo command.   If ordering the newer Part# switches, be sure that customer will be able to use the later firmware version, as per <Document 2187802.1> Infiniband Switch - Firmware Downgrade To 2.1.6 Fails With Error: Cannot proceed with downgrade on this SP. (Doc ID 2187802.1)

The team responsible for replacing the Switch (whether Customer-Admin, Partner, or FE), will then follow all the steps in the relevant "How to Replace document" for the particular Switch-part involved

6. Replace the switch and perform Follow-up actions

Once the new Part is received and at the time of the outage window: Ensure that the On-site team follows the relevant How to Replace action-plan for this model of IB Switch (Refer to Infiniband Switch Replacement – Overview and guide to key articles (Doc ID 2125242.1) - click on the How to Replace document relevant to your Switch part#), unless a special action-plan has been provided by Oracle Support.

*After* the replacement, Customer-Admin will need to continue with the Follow-up Actions documented here:  Infiniband Switch Replacement - Follow-up Actions (Doc ID 2125203.1)

References

<NOTE:2125203.1> - Infiniband Switch Replacement - Follow-up Actions
<NOTE:2140928.1> - How to Prepare an Infiniband (IB) Fabric for Planned Outage of an IB Switch
<NOTE:1341658.1> - How to Replace a Failed Sun Datacenter InfiniBand Switch 36
<NOTE:2125242.1> - Infiniband Switch Replacement – Overview and guide to key articles
<NOTE:1383773.1> - How to Replace a Failed Sun Network QDR InfiniBand Gateway Switch

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback