Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1383773.1
Update Date:2018-05-09
Keywords:

Solution Type  Technical Instruction Sure

Solution  1383773.1 :   How to Replace a Failed Sun Network QDR InfiniBand Gateway Switch  


Related Items
  • Sun Network QDR InfiniBand Gateway Switch
  •  
  • Exalogic Elastic Cloud X5-2 Hardware
  •  
  • Exalogic Elastic Cloud X3-2 Eighth Rack
  •  
  • Big Data Appliance X3-2 Hardware
  •  
  • Oracle Exalogic Elastic Cloud X2-2 Qtr Rack
  •  
  • Exalogic Elastic Cloud X4-2 Full Rack
  •  
  • Exalogic Elastic Cloud X4-2 Hardware
  •  
  • Big Data Appliance X4-2 Hardware
  •  
  • Exalogic Elastic Cloud X4-2 Quarter Rack
  •  
  • Oracle Exalogic Elastic Cloud X2-2 One-Eighth Rack
  •  
  • Exalogic Elastic Cloud X3-2 Half Rack
  •  
  • Oracle Exalogic Elastic Cloud X2-2 Full Rack
  •  
  • Exalogic Elastic Cloud X5-2 Half Rack
  •  
  • Exalogic Elastic Cloud X5-2 Eighth Rack
  •  
  • Big Data Appliance X5-2 Hardware
  •  
  • Exalogic Elastic Cloud X3-2 Hardware
  •  
  • Big Data Appliance Hardware
  •  
  • Big Data Appliance X7-2 Hardware
  •  
  • Exalogic Elastic Cloud X4-2 Half Rack
  •  
  • Exalogic Elastic Cloud X3-2 Quarter Rack
  •  
  • Exalogic Elastic Cloud X5-2 Quarter Rack
  •  
  • Exalogic Elastic Cloud X3-2 Full Rack
  •  
  • Exalogic Elastic Cloud X4-2 Eighth Rack
  •  
  • Oracle Exalogic Elastic Cloud X2-2 Half Rack
  •  
  • Oracle Exalogic Elastic Cloud X2-2 Hardware
  •  
  • Exalogic Elastic Cloud X5-2 Full Rack
  •  
  • Big Data Appliance X6-2 Hardware
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: SaND-CAP VCAP
  •  




In this Document
Goal
Solution
References


Applies to:

Exalogic Elastic Cloud X4-2 Half Rack - Version X4 and later
Exalogic Elastic Cloud X4-2 Full Rack - Version X4 and later
Sun Network QDR InfiniBand Gateway Switch - Version Not Applicable to Not Applicable [Release N/A]
Exalogic Elastic Cloud X3-2 Quarter Rack - Version X3 and later
Exalogic Elastic Cloud X3-2 Hardware - Version X3 and later
Information in this document applies to any platform.

Goal

Replace a Sun Network QDR InfiniBand Gateway Switch (NM2-GW).

Customer must be first referred to the following Document, on preparing for the on-site work:

If this is an exalogic system, Doc ID 2218443.1 How to Prepare an Exalogic Infiniband Switch for Replacement (Pre-checks & Backup)
or, for all other systems, Document 1636229.1 How to prepare an Infiniband Switch for a Field Engineer Visit for servicing or replacing

Customer is requested to upload in SR-notes a confirmation that the the preparation checks have been done.  The preparation steps including a plan for Restoring the Switch, whether using available backups or manual restoration, must be completed and documented in the SR-notes (as stated in MOS Note 1636229.1 Steps 3 and 4.4) ** prior to Dispatching the Part and/or FE if applicable **.

An additional goal of this document is to cover the post-replacement steps needed to be taken. Certain steps are typically performed by the Field Engineer, including update of ASR and NEW process for updating the Installed-base, once the replacement has been performed - refer to step D. below. Refer also to additional post-replacement steps and links in Customer Acceptance section, typically performed by Customer-admin or their representatives.

This document has distribution EXTERNAL, since the IB Switch is defined as a Customer-Replaceable-Unit in a limited number of Platforms, for example in custom-built solutions outside Engineered Systems.

 

Solution

DISPATCH INSTRUCTIONS

WHAT SKILLS DOES THE ENGINEER NEED:

If this switch is part of an Exalogic machine, the engineer must be Exalogic trained.
If this switch is not part of an Exalogic machine, then engineer should be familiar with this type of switch.


TIME ESTIMATE: 150 minutes

TASK COMPLEXITY: 3



FIELD ENGINEER INSTRUCTIONS:
PROBLEM OVERVIEW:

Failed Sun Network QDR InfiniBand Switch needs to be replaced.


"This CAP document for replacing Sun Network QDR InfiniBand Gateway Switch is available live at this link:  Document 1383773.1

WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY?:

    Customer has completed steps given in Doc id Doc id 2218443.1 (Exalogic Systems), or  1636229.1 (for all other systems) and the checklists (Steps 3 and 4.3) been confirmed in customer-visible SR-notes; the Plan for restoration of the configuration information has been documented in customer-visible SR-notes and the owner for the configuration restoration actions has been clearly identified.  Oracle TSE should have ordered the correct Part# relevant to the IB Switch firmware needed to be used, as per <Document 2187802.1> .  If the customer has not already powered off the Switch (for example, replacement during Production IB Fabric), then the IB Switch should be on and the Subnet Manager has been disabled (#disablesm)   ** If the FE is unable to confirm that the checklist has been followed or is unable to view the SR in SR Viewer, then please phone back in to Support before attending site and ask for this to be confirmed **

 

WHAT ACTION DOES THE ENGINEER NEED TO TAKE:     ( PLEASE READ ALL INSTRUCTIONS BEFORE PROCEEDING )

This procedure is comprised of 4 stages:

A.  Initial physical replacement in the rack
B.  Replacement Switch Firmware Check & Upgrade
C.  Cable up the replacement switch and check basic IB Fabric connectivity
D.  Final housekeeping, documentation, warm-handover and wrap-up

 

A.  Initial physical replacement in the rack (no cabling yet - cables will be connected in later Step C)

1.  Power off the Switch:  The switch needing replacement will need to be powered off (if not already done by customer team).  Power off both power supplies on the switch by removing both the power plugs.

    If customer has not taken a full IB Fabric downtime, then check with customer representative to confirm if everything is working normal on other parts of the IB Fabric, after powering off this switch.  For example SM master may have moved to another switch (if this Switch had been the Master and admin team were unable to move the master earlier).  If every configuration is as per the standard, this will have no effect on any operation of the system.  If there is any problem or anomaly detected in the running IB Fabric, then work with customer / Support to get the issue resolved prior to proceeding with the replacement.

2.  Now, disconnect the cables from the switch. All InfiniBand cables should have labels at both ends indicating their locations.  If there are any cables that do not have labels, then label them - if needed, refer to cabling tables in customer's build documentation, for example, if Exalogic, then this would be Exalogic Machine Owner's Guide.

     Then, remove the switch being replaced, from the rack.

           Note: Read "Sun Network QDR InfiniBand Gateway Switch Installation Guide for Firmware Version 2.1".  To remove the switch,  you can just reverse the steps of installing.

 
3.  Install the replacement switch in the rack.   Do not connect any Infiniband cables yet.

      Refer to "Sun Network QDR InfiniBand Gateway Switch Installation Guide for Firmware Version 2.1" for detailed steps on installing a gateway switch.


4. Connect management Ethernet port of the replacement IB switch to the Cisco switch within the rack (to the same Ethernet port where old IB switch's management port was connected).

     Then follow the steps on "Powering On the Gateway", or  the "Power on the Gateway" section of  the pdf document "Sun Network QDR InfiniBand Gateway Switch Installation Guide for Firmware Version 2.1"

          In the above section you need to complete  1) Attach the Management Cables, 2) Attach the Power Cords, 3) Accessing the Management Controller and 4) Verify the Gateway Status.

                 Do not do the section on "Start the subnet Manager"

                       Note: The default password for root is changeme

     Set the Network Management Parameters (CLI).   The initial setting up of the network management parameters may have to be done by accessing the switch through its USB management port.  Make sure that the management IP address assigned to the replacement switch is the same as that of the old switch.   If customer does not know that IP address, as well as its mask and default gateway, then ask customer to provide any available IP address in the same subnet of the other IB switches in this rack.  This switch will get its correct management IP address when its configuration is restored in step 6 below.   We can use a temporary address until then.

     Do not connect any Infiniband cables yet.

     Do not start subnet manager yet.

 

 

B.  Replacement Switch Firmware Check & Upgrade

1.  Check the firmware of the other switches in the rack to know what firmware version the replacement switch should be running.  If firmware is to be downgraded, for example to match an older Engineered Systems PSU, then the following document must have been referred to by the Oracle TSE when ordering the part:

    <Document 2187802.1> Infiniband Switch - Firmware Downgrade To 2.1.6 Fails With Error: Cannot proceed with downgrade on this SP.

2.  Download that firmware from MOS and upgrade the firmware of the replacement switch that version:  For Sun Network QDR InfiniBand Gateway Switch (NM2-GW), be sure to follow the detailed firmware upgrade procedures given in the Product Note document for firmware 2.2 "Upgrading the Gateway Firmware (CLI)" (n.b. these procedures are also on page 20 to 24 of the PDF version of the Product Note).  Note carefully to only use the protocols for download, that are supported by the IB Switch firmware, as per the protocols listed at step 3 of the Update procedure.  Also note carefully to perform the double-upgrade as at steps 3 and 4.  Ensure that all the steps for upgrading firmware are completed and the switch is restarted (as at step 6) and firmware integrity checked (as at steps 7, 8 and 9). 

Note: When upgrading firmware from 1.3.x to 2.1.7 or above, first upgrade to 2.1.6. ( Refer BUG 26735450 - NM2 GW Product Notes FW 2.1 Upgrade/Downgrade table is incorrect).
Refer to the product notes or 2.1 (http://docs.oracle.com/cd/E36256_01/pdf/E36258.pdf) regarding upgrade/downgrade paths.

Note, If firmware upgrade fails when upgrading to firmware version 2.1.8, Oracle employee may refer to the following internal document, please call Oracle support if you need access to this document:  <Document 2109781.1> How to fix broken InfiniBand Switch after upgrade to 2.1.8 firmware

 

3. Disable SM on the replacement switch:

     # disablesm

 

 

C.  Cable up the replacement switch and check basic IB Fabric connectivity

1.  Completely power off the replacement switch now by unplugging both the power supplies.

 

2.  Now connect all the Infiniband cables

      Refer to the section "Connecting Data Cables"  or, "Connecting Data cables" section of the pdf document "Sun Network QDR InfiniBand Gateway Switch Installation Guide for Firmware Version 2.1"

3. Power on the replacement switch by installing power cords to the InfiniBand switch power supply slots.


4.   On-site team now needs to check basic Infiniband connectivity:   Run the following commands on the replaced switch for verification purposes:

     # listlinkup
            -> Ensure that all cabled ports are in " up (Enabled)" state for all links that are expected to be active with nodes up at the other end of the link.  Otherwise re-seat the cables/transceivers, or check if the cables/transceivers are damaged / need replacement.
     # ibswitches
             ->  Check that all Switches including the replaced Switch are listed
     # getmaster and # sminfo
             -> Ensure that it can see the master
     # service opensmd status
             -> Ensure that opensmd is not running.  If it is still running, disable it using #disablesm command

   With the above commands, basic IB Fabric connectivity is confirmed and the replaced switch is ready for the follow-up actions described in the subsequent document.


5. Check and make sure that you can (Ethernet) ping every IB switch from every other IB switch through its management interface.

 

D.  Final housekeeping, documentation, warm-handover and wrap-up

1. ASR:  Set the serial number, product level identity and ASR of this replaced switch as per the steps in the following document.

     Refer to: How to configure Datacenter InfiniBand Switch 36 & QDR InfiniBand Gateway Switches for ASR (Doc ID 1902710.1)

 

2. Installed Base:  Update Installed Base, to ensure that the replacement-part serial# will be properly entitled.  Within Oracle, the IB Switch is termed a "SuperFRU" which simply means that it is a whole chassis replacement including both chassis and internal main-board. Therefore, follow the relevant SuperFRU procedure:

    a. If the IB Switch has been replaced by end customer as i.e. by Parts-Only/CRU, then use the following procedure:  Oracle Support Document 1575977.1 (How can customers update the System Serial Number after a SuperFru Part Replacement)

    b. If the IB Switch has been replaced by Oracle field engineer, then the Oracle FE should use the *NEW* process in the internal Oracle Global Desk Manual repository, by clinking the ptp.oraclecorp.com link directly here: How to Update Install Base serial number entitlement for InfiniBand Switch FRU replacements

    c. Partners use the process they already use.

 

3.   Oracle Field Engineer (where an FE has been dispatched) should now document in a visible note in the Task debrief, whether all of the above steps are complete and if any have been missed or skipped or any anomalies then these need to be clearly indicated.   If the customer will be working further with Oracle Support at this time on restoration of the configuration, then a ** warm handover from the Field Engineer to the Oracle Support Engineer is required **. 

 


OBTAIN CUSTOMER ACCEPTANCE

- WHAT ACTION DOES THE CUSTOMER NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:

Customer-admin is required to now follow the steps given in Document 2218689.1 Exalogic Infiniband Switch Replacement - Follow-up Actions (Restoration), if this is an Exalogic system, or  2125203.1 Infiniband Switch Replacement – Follow-up Actions, for all other systems.

These document lists critical follow-up actions required to restore the configuration (from config backup where available) along with customer-specific configuration items such as smnodes, partitions and VNICs.

 



Reference Information:

Exalogic Machine Owner's Guide - https://docs.oracle.com/cd/E18476_01/index.htm
Sun Network QDR InfiniBand Gateway Switch - Product Note, Installation Guide, Administration Guide, Command Reference, Service Manual - http://docs.oracle.com/cd/E36256_01/index.html

References

<NOTE:2125242.1> - Infiniband Switch Replacement – Overview and guide to key articles
<NOTE:2140928.1> - How to Prepare an Infiniband (IB) Fabric for Planned Outage of an IB Switch
<NOTE:1341658.1> - How to Replace a Failed Sun Datacenter InfiniBand Switch 36
<NOTE:1636229.1> - How to Prepare an Infiniband Switch for Replacement
<NOTE:2125203.1> - Infiniband Switch Replacement - Follow-up Actions

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback