Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-79-2218443.1
Update Date:2017-12-20
Keywords:

Solution Type  Predictive Self-Healing Sure

Solution  2218443.1 :   How to Prepare an Exalogic Infiniband Switch for Replacement (Pre-checks & Backup)  


Related Items
  • Exalogic Elastic Cloud X3-2 Eighth Rack
  •  
  • Exalogic Elastic Cloud X4-2 Half Rack
  •  
  • Oracle Exalogic Elastic Cloud Software
  •  
Related Categories
  • PLA-Support>Eng Systems>Exalogic/OVCA>Oracle Exalogic>MW: Exalogic Core
  •  




In this Document
Purpose
Scope
Details
 1. Checks needed prior to Dispatch of Part/Onsite
 1.1. Check the IB Fabric to ensure resilience to later booting (refer to companion KB article)
 Prerequisite
 IB Fabric validation
 1.2 Check/update the configuration backup
 1.3 Check the firmware version of the Switch & Validate if it is available for download in MOS
 1.4 Check for presence of workaround firewall rule on port 623
 1.5 For Exalogic Virtual & Exalogic Hybrid racks (mix of physical &virtual), remove the asset from Exalogic control (This Step Only Applies to Virtual & Hybrid racks rack)
 2. Complete the check-list template – IB Switch preparation for Replacement
 3. Provide a report on the replacement pre-checks to Oracle Support including outage type
 4. Final pre-Dispatch preparation (IB Switch replacement in production IB Fabric)
 4.1 Disable SM on the switch being replaced
 4.2 Check if running ASR / block alerts if so
 4.3 Update in MOS that step 4 actions are completed
 5. Contact Oracle Support for Part/Onsite Dispatch
 6. Replace the switch and perform Follow-up actions
References


Applies to:

Oracle Exalogic Elastic Cloud Software - Version 2.0.0.0.0 and later
Exalogic Elastic Cloud X3-2 Eighth Rack
Exalogic Elastic Cloud X4-2 Half Rack
Oracle Solaris on x86-64 (64-bit)
Linux x86-64
Oracle Virtual Server x86-64

Purpose

1. Prepare an Infiniband Switch in Exalogic rack for replacement

2. Collect information required by Oracle Support / On-site team prior to Dispatch of a replacement IB Switch part for Exalogic rack.

Scope

This document distribution is EXTERNAL since it needs to be shared with and used by the Customer-admin, as well as referenced by Partners, Field Engineers, and Oracle Support.

Once Oracle Support has confirmed the need for an IB Switch replacement (Replacement will be performed by Oracle Field Engineer), Customer-admin is requested to confirm by updating in MOS SR-notes that all the pre-check actions in this document have been completed - by copying/pasting the requested checklist(s) into the SR in MOS - prior to Oracle Dispatching the Part for replacement.

Oracle Support will not Dispatch until these checks have been performed unless customer has given a clear statement acknowledging the risks of not performing these checks prior to replacement. 

Details

1. Checks needed prior to Dispatch of Part/Onsite

1.1. Check the IB Fabric to ensure resilience to later booting (refer to companion KB article)

Prerequisite

For all fabric check commands in <Note 2211261.1> run against switches, if there is a situation where lifetime on problem switch is already very low (or negative) the fabric check commands executed against the switch may cause further failure where the Switch may become inaccessible at which point backup's cannot be taken on the Switch. So it recommended to copy /conf/bx.conf file from the problem switch to local directory or NFS share on Compute Node. Because /conf/bx.conf file has unique values and is the most difficult one to recreate once it's no longer accessible.

For e.g. from one of the Compute Nodes run below command to copy /conf/bx.conf file from Switch to /tmp directory on Compute Node

scp <switch_ip>:/conf/bx.conf /tmp/bx.conf.saved.<switch_ip>

In case the problem switch is not in a healthy state where scp command is also not working, run "cat /conf/bx.conf" and use copy and paste to create a file on one of the Compute Nodes in the rack.

NOTE:

There may be instances where the Switch being replaced is already inaccessible and Fabric check commands in <Note 2211261.1> cannot be executed on the switch that is being replaced. In these situations please proceed with Fabric check commands in <Note 2211261.1> on other Switches which are accessible in the fabric.

IB Fabric validation

Perform the checks/actions in the following document on all the Exalogic IB switches and confirm to Oracle Support that these checks/actions have been done:

<Note 2211261.1>: How to Prepare an Exalogic Infiniband (IB) Fabric for Planned Outage of an IB Switch

NOTE: Regardless whether the IB switch being replaced is active or not (hung, non responsive or went down), it is needed to perform checks in <Note 2211261.1>. This is to ensure IB Fabric is configured correctly so as to minimize the risk of any outage when the replacement IB switch comes up.

 IMPORTANT NOTE

For Exalogic Virtual racks running April 2017 PSU 2.0.6.2.170418, refer to below MOS Note which has information on important known issue related to Compute Nodes losing IB networks when the NM2-GW Switch unexpectedly reboots or goes dead.

<NOTE 2282480.1>: Exalogic April 2017 Virtual PSU (2.0.6.2.170418): Compute Nodes Lose IB Networks When NM2-GW IB Switch Unexpectedly Reboots or Becomes Unavailable

 

1.2 Check/update the configuration backup

Make sure a backup has been completed for the switches before being replaced. Backups can be done periodically to all the hardware components in the rack. It is highly recommended for this to be done on a time bases, so the backups can be use for recovery purposes in case of catastrophic failures. Following is more information on how to back up the Exalogic Infiniband switches.

Taking backup's using Exabr is the recommended approach for backing up the infrastructure components of Exalogic. In situations where backup of Switch cannot be taken using Exabr due problems on switch like scp commands not working or other reasons backup can be taken manually.

Following are procedure's for taking backup of the Switches using Exabr and Manually. Recommended approach is to take backup using Exabr as listed in below section "Backup of Switches using Exabr" . If backup using Exabr does not work proceed with taking backup's manually using steps listed in section "Manual Backup of Switches".

Backup of Switches using Exabr

For backing up the Switches using ExaBR refer to following documentation, section "3.3.1 Backing Up InfiniBand Switches:"
http://docs.oracle.com/cd/E18476_01/doc.220/e36329/infra.htm#ELFLR117

Once Exabr backup's are taken for the Switches, validate the status of the backup taken to see if it is OK using below command.

./exabr list <Switch>

You should see the status of latest exabr backup taken as OK if the backup completed successfully.

Manual Backup of Switches

  1. Manually take backup of following configuration files from the Exalogic IB switches to some location on one of the compute nodes.
    /etc/opensm/opensm.conf
    /conf/bx.conf
    /conf/bxm.conf
    /conf/partitions.current
    /conf/smnodes
  2. Take ILOM snapshot backup of the Switches by following steps in below documentation

    For Infiniband Switch 36 (nm2-36P):

    Back Up the Configuration (CLI)

    or,

    Back Up the Configuration (Web)

    For Infiniband Gateway Switch (nm2-GW):

    Back Up the Configuration (CLI)

    or,

    Back Up the Configuration (Web)


If the switch being replaced is non-responsive or otherwise unable to booted, check if a recent configuration exabr backup exists (must have been taken after any previous change in the IB Fabric or prior to the Patching). In case there are any recent full exalogs captured from the rack, configuration files listed in manual backup will be inside full exalogs zip bundle.

1.3 Check the firmware version of the Switch & Validate if it is available for download in MOS

Check the firmware version of the switch that is being replaced. This is to make sure that that firmware version is available to download (since it will need to be applied to the replacement switch). To check the firmware version, login to the switch that is being replaced and run the following command

# version

Here is a sample output:

# version

SUN DCS gw version: 2.1.8-4 <<<<<<<< firmware version

Build time: ...

FPGA version: ...

SP board info: ...

Validate whether the firmware version of the Switch is available for download in MOS using below steps.

  • Login to MOS & click on "Patches & Updates" tab.
  • Click on "Product or Family" option under Patch search.
  • Select "InfiniBand Gateway Switch" for Product dropdown box and check whether Switch firmware version is available for download under "Release" dropdown box.

1.4 Check for presence of workaround firewall rule on port 623

Check if a workaround firewall rule to block incoming requests on port 623 is implemented on the IB switch. Refer to following MOS Note for more details.

<Note 2023539.1>: IB Switch Messages Wrapping with "Possible SYN Flooding On Port 623" 

For this check run below command on the IB Switch being replaced.

# iptables -L -n

If you see an entry for port 623 as follows in above command output:

DROP udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:623 length 29:31

then that indicates this was implemented. Keep a note that this should be restored later after replacement / re-image / restore.

1.5 For Exalogic Virtual & Exalogic Hybrid racks (mix of physical &virtual), remove the asset from Exalogic control (This Step Only Applies to Virtual & Hybrid racks rack)

If this is an Exalogic Virtual rack or Exalogic Hybrid rack which runs combination of both Physical and Virtual stack PSUs, before the switch is replaced, it needs to be removed from the Exalogic Control (EMOC). This can be either done manually or using the ExaBR utility. These instructions can be found in our documentation "Recovering the InfiniBand Switches in a Virtual Environment" (Manual) or "Replacing InfiniBand Switches in a Virtual Environment" (using ExaBR documentation):

For doing this task using ExaBR, refer to following documentation, section "3.3.3 Replacing InfiniBand Switches in a Virtual Environment". Follow step 1 and 2.
http://docs.oracle.com/cd/E18476_01/doc.220/e36329/infra.htm#ELFLR173

For doing this task Manually, refer to following documentation, section "3.3.3 of the below documentation". Follow step 1.

http://docs.oracle.com/cd/E18476_01/doc.220/e40226/infra.htm#ELXBR173

2. Complete the check-list template – IB Switch preparation for Replacement

Answer yes/no

  1. IB Fabric checks/actions in the following document has been completed and is being provided to Oracle Support for review? yes/no? _____

    <Note 2211261.1>: How to Prepare an Exalogic Infiniband (IB) Fabric for Planned Outage of an IB Switch
  2. Configuration backup/restoration strategy Plan.

    Is backup of Switches as mentioned in above section "1.2 Check/update the configuration backup"? yes/no? _____

  3. Is the Switch firmware version verified as mentioned in above section "1.3 Check the firmware version of the Switch"? yes/no? _____

  4. Workaround firewall rule for port 623 is present in iptables? yes/no? _____

3. Provide a report on the replacement pre-checks to Oracle Support including outage type

Update Oracle Support in SR which is open for IB Switch replacement with following information:

  1. Checklist information in above section "2. Complete the check-list template – IB Switch preparation for Replacement".

  2. Checklist template information from <Note 2211261.1>, section "3. Complete the check-list template – IB Fabric preparation for IB Switch planned outage."

    If the answer was “No” to any of the checks in the IB Fabric doc, then confirm in writing in an update in the SR in MOS, that a full downtime of the IB Fabric is planned.

  3. Data Collected as part of IB fabric validation in <Note 2211261.1>, section "4. Data Collection"

If it has been determined that a full downtime of the IB Fabric will NOT be required, then the actions in step 4 below must also be completed, prior to Dispatch. This is to ensure that when the IB Switch is powered down, any resulting problems in the IB Fabric can be rectified by the Admin team with the assistance of Oracle Support, prior to the arrival of On-site team. Wait for confirmation from Oracle Support before proceeding to the next steps.

4. Final pre-Dispatch preparation (IB Switch replacement in production IB Fabric)

If there will NOT be a full outage of the IB Fabric, then the following steps 4.1 thru 4.3 MUST be completed now prior to the Part/Onsite being Dispatch by Oracle Support:

4.1 Disable SM on the switch being replaced

Since you are going to replace this switch, disable SM on the switch being replaced:

#disablesm

NOTE: If switch is not accessible through the management port, skip this step.

If this switch is the current Master, the effect of the above command is that Master will move to another switch. If all configurations are correct and as per standard, this will have no effect on the operation. Monitor and check if everything is working well in your system - check which switch is the current Master by running the following command on any IB leaf switch:

#sminfo 


NOTE:

If you are using OL 6 Guest vServers, you may run into known issue of EoIB network bonds not working when the SM Master failover is done. Refer to below MOS Note which has information and fix for this issue.

<Note 2311001.1>: Exalogic Virtual: EoIB Network Bonds Do Not Work On OL6 Guest vServers When SM Master Failover Happens On the IB Switches 

4.2 Check if running ASR / block alerts if so

Check if your IB Switch is currently actively running ASR. Include a statement about whether ASR is running on the IB Switch, in your preparation-report to Oracle Support. 

4.3 Update in MOS that step 4 actions are completed

Update in MOS SR you have open for Switch replacement that all actions in above Steps 4.1, 4.2 have been completed successfully and that you are ready for Dispatch.

5. Contact Oracle Support for Part/Onsite Dispatch

Once Oracle Support has reviewed the check-lists and information provided as part of above steps 3 & 4, Oracle Support will contact you to confirm and will request from you the details of the outage window for the change, to create Field task so that the Part/FE can be Dispatched.

INTERNAL NOTE to Oracle Systems Support TSE

When Dispatching the replacement IB Switch Part, try to order the same part# that the customer is currently using - part# is determinable from the ILOM Snapshot and/or showfruinfo command. If ordering the newer Part# switches, be sure that customer will be able to use the later firmware version, as per following Note.

<Note 2187802.1>: Infiniband Switch - Firmware Downgrade To 2.1.6 Fails With Error: Cannot proceed with downgrade on this SP

6. Replace the switch and perform Follow-up actions

Once the new Part is received, at the time of the outage On-site Field team follows the relevant How to Replace action-plan documents in below MOS Note under section "B. Physical Replacement of Switch by On-site Field team" depending on the model of IB Switch.

<Note 2223662.1>: Master Note For Exalogic Infiniband Switch Replacement – Overview and guide to key articles

*After* the replacement, Customer-Admin will need to continue with the Follow-up Switch restoration actions documented in following MOS Note:

<Note 2218689.1>: Exalogic Infiniband Switch Replacement - Follow-up Actions (Restoration)

 

References

<NOTE:2218689.1> - Exalogic Infiniband Switch Replacement - Follow-up Actions (Restoration)
<NOTE:2211261.1> - How to Prepare an Exalogic Infiniband (IB) Fabric for Planned Outage of an IB Switch
<NOTE:2223662.1> - Master Note For Exalogic Infiniband Switch Replacement – Overview and guide to key articles
<NOTE:2282480.1> - Exalogic April 2017 Virtual PSU (2.0.6.2.170418): Compute Nodes Lose IB Networks When NM2-GW IB Switch Unexpectedly Reboots or Becomes Unavailable
<NOTE:2311001.1> - Exalogic Virtual: EoIB Network Bonds Do Not Work On OL6 Guest vServers When SM Master Failover Happens On the IB Switches

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback