Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-2230797.1
Update Date:2017-03-15
Keywords:

Solution Type  Technical Instruction Sure

Solution  2230797.1 :   How to Replace a Big Data Appliance Faulty Infiniband HCA  


Related Items
  • Big Data Appliance X5-2 Starter Rack
  •  
  • Big Data Appliance X3-2 Hardware
  •  
  • Big Data Appliance X3-2 Full Rack
  •  
  • Big Data Appliance X3-2 In-Rack Expansion
  •  
  • Big Data Appliance X5-2 Full Rack
  •  
  • Big Data Appliance X4-2 Hardware
  •  
  • Big Data Appliance X4-2 Full Rack
  •  
  • Big Data Appliance X5-2 Hardware
  •  
  • Big Data Appliance Hardware
  •  
  • Big Data Appliance X4-2 Starter Rack
  •  
  • Big Data Appliance X5-2 In-Rack Expansion
  •  
  • Big Data Appliance X4-2 In-Rack Expansion
  •  
  • Big Data Appliance X6-2 Hardware
  •  
  • Big Data Appliance X3-2 Starter Rack
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: x64-CAP VCAP
  •  




In this Document
Goal
Solution
References


Oracle Confidential PARTNER - Available to partners (SUN).
Reason: partner FRU CAP

Applies to:

Big Data Appliance X3-2 In-Rack Expansion - Version All Versions and later
Big Data Appliance X4-2 Hardware - Version All Versions and later
Big Data Appliance X3-2 Hardware - Version All Versions and later
Big Data Appliance X4-2 Starter Rack - Version All Versions and later
Big Data Appliance X3-2 Full Rack - Version All Versions and later
Linux x86-64

Goal

How to Replace a Big Data Appliance Faulty Infiniband HCA

Solution

WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED?: BDA Trained

TIME ESTIMATE: 90 Minutes
TASK COMPLEXITY: 3

FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS:

PROBLEM OVERVIEW: A faulty InfiniBand card in a Big Data Appliance Server Node has been diagnosed as needing replacement.

WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY?:

The instructions below assume the Customer system administrator is available and working with the field engineer onsite to manage the host OS and BDA services. They are provided here to allow the FE to have all the available steps needed when onsite, and can be done by the FE if the customer system administrator wants or allows or needs help with these steps.

NOTE:
If a system uses custom non-default InfiniBand partitions [e.g., Exalogic (virtual/physical/hybrid), Exadata (virtual/physical), SuperCluster, BDA] then the HCA Port GUIDs might need to be updated in the InfiniBand partition(s) after replacing an HCA. 

Determine the switch running as MASTER.  From it, check for any custom, non-default IP partitions.

[root@bda01node05 ~]# sminfo
sminfo: sm lid 15 sm guid 0x10e0406d5aa0a0, activity count 26263191 priority 14 state 3 SMINFO_MASTER

[root@bda01node05 ~]# ibnetdiscover | grep 10e0406d5aa0a0
switchguid=0x10e0406d5aa0a0(10e0406d5aa0a0)
Switch 36 "S-0010e0406d5aa0a0" # "SUN DCS 36P QDR bdax01sw-ib1 xxx.xxx.171.24" enhanced port 0 lid 15 lmc 0

[root@bda01node05 ~]# ssh root@xxx.xxx.171.24

[root@bda01sw-ib1 ~]# smpartition list active
# Sun DCS IB partition config file
# This file is generated, do not edit
#! version_number : 0
Default=0x7fff, ipoib : ALL_CAS=full, ALL_SWITCHES=full, SELF=full;
SUN_DCS=0x0001, ipoib : ALL_SWITCHES=full;

If there are IB partitions other than default partitions, then refer to MOS note 1985159.1 for additional steps that will need to be taken before the old HCA is removed.

The server that contains the faulty InfiniBand card should have its services offlined and system powered off. The Customer’s system administrator should shutdown the server node and BDA services following the shutdown instructions for Big Data Appliance detailed in MOS Note 2099858.1

WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE?:

1. Slide out the server for maintenance.

Do not remove any cables prior to sliding the server forward, or the loose cable ends will jam in the cable management arms. Take care to ensure the cables and Cable Management Arm is moving properly. Refer to Note 1444683.1 for CMA handling training.

2. Disconnect the AC power cords.

3. Unlatch and slide off the top cover of the server.

4. Remove the IB HCA card from the server:

BDA (V1) Server Nodes:
These steps are relevant to BDA nodes based on Sun Fire x4270 M2 Server.
   a) Remove the two IB cables from the IB card in slot 3 below the HBA making a note of which port each cable goes into so they can go back into the same port. The cables are already labeled to assist also.
   b) Remove back panel PCI cross bar
      i) Loosen the two captive Phillips screws on each end of the crossbar
      ii) Lift the PCI crossbar up and back to remove it from the chassis
   c) Remove the PCIe Riser 2 containing the IB HCA card to be serviced
      i) Loosen the captive screw holding the riser to the motherboard
      ii) Lift up the riser and the PCIe cards that are attached to it as a unit.
   d) Extract the IB HCA card from the PCIe Riser assembly, and place on an anti-static mat.

BDA X3-2 and X4-2 Server Nodes:
These steps are relevant to BDA nodes based on Sun Server X3-2L and Sun Server X4-2L.
   a) Remove the two IB cables from the IB HCA card in slot 3 making a note of which port each cable goes into so they can go back into the same port. The cables are already labeled to assist also.
   b) Rotate the PCIe card slot 3 locking mechanism latch out to disengage the IB HCA card that has failed.
   c) Lift up and remove the IB HCA card from the server.
   d) Place the removed IB HCA card on an anti-static mat.

BDA X5-2 and X6-2 Server Nodes:
These steps are relevant to BDA nodes based on Oracle Server X5-2L.
   a) Swivel the air baffle into the upright position to allow access to PCIe cards
   b) Remove the two IB cables from the IB HCA card in slot 3 making a note of which port each cable goes into so they can go back into the same port. The cables are already labeled to assist also.
   c) Rotate the PCIe card slot 3 locking mechanism latch out to disengage the IB HCA card that has failed.
   d) Lift up and remove the IB HCA card from the server.
   e) Place the removed IB HCA card on an anti-static mat.

5. Insert the new IB HCA card into PCIe slot 3. Reverse the removal instructions in Step 4.

6. Reconnect the cables to the InfiniBand card that you unplugged during the removal procedure, making sure they go back to their original ports. If unsure of the ports, follow the labeling on the cables and card.

7. Install the top cover

8. Install the AC power cords

9. Slide the server back into the rack.

OBTAIN CUSTOMER ACCEPTANCE

- WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:

1. Once the ILOM has booted you will see a slow blink on the green LED for the server. Press the power button on the front of the server to power on the unit.

2. After the OS has booted, login to the OS with ‘root’ privilege.

3. Verify the new InfiniBand card is seen by the OS and the InfiniBand links are up at 40Gbps:

# /usr/sbin/ibstatus
Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:0010:e000:0159:c61d
base lid: 0x9
sm lid: 0x2
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: IB

Infiniband device 'mlx4_0' port 2 status:
default gid: fe80:0000:0000:0000:0010:e000:0159:c61e
base lid: 0xa
sm lid: 0x2
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: IB

4. Verify the new InfiniBand card is seen by the BDA software:

# /opt/oracle/bda/bin/bdacheckhw

Should return OK.

5. After InfiniBand card replacement, the client network VNICs and bondeth0 are no longer valid and require to be recreated for the new card:

a) run the following command

# /opt/oracle/bda/network/BdaUserConfigEoib

This will not delete the "old" VNICS because the MAC addresses have changed. However a new VNIC will be created and the client network will function with the new VNIC.

b) Delete the old VNICs from the switches which correspond to the old HCA that was replaced. Do so with the deletevinc command.

i. On the BDA server where the IB HCA card was replaced find the current MAC address. For example:
# ifconfig eth8 |grep HW
  
HWADDR=*:*:*:*:*:x
# ifconfig eth9 |grep HW
  
HWADDR=*:*:*:*:*:y

The current MAC addresses are *:*:*:*:*:x and *:*:*:*:*:y, respectively.

ii. Log into the 2 gateway switches e.g.

<rack>-sw-ib2
<rack>-sw-ib3

  iii. On each run:

# showvnics | grep <hostname>

  Output is like:

From <rack>-sw-ib2:

# showvnics | grep bdanode0x

 139 UP N **** bdanode0x BDA <private-ip bdanode0x> 0000 *:*:*:*:*:y NO 0xffff 0A-ETH-3

 From <rack>-sw-ib3

# showvnics | grep bdanode0x

138 UP N **** bdanode0x BDA <private-ip bdanode0x> 0000 *:*:*:*:*:x NO 0xffff 0A-ETH-3

 There should be 2 VNICs. The one not matching with the current server MAC address for either eth8 or eth9 should be removed with the deletevnic command on the gateway switch.

iv. Based on the details in the above steps delete the VNIC returned by showvnics on the gateway switch which does not match the "new" HW address for either eth8 or eth9 on the server (as returned by ifconfig eth8 |grep HW/ifconfig eth9 |grep HW).

Use:

# deletevnic <VNIC Port> <VNIC ID>

 For example if the output from showvnics on the switch is as below (the state of the old VNIC will be WAIT-IOA not UP):

# showvnics | grep bdanode0x

138 WAIT-IOA N **** bdanode0x BDA <private-ip bdanode0x> 0000 *:*:*:*:*:z NO 0xffff 0A-ETH-3

 Where the MAC address above (*:*:*:*:*:z) does not match the output returned on the server (*:*:*:*:*:x/*:*:*:*:*:y) delete the VNIC with:

# deletevnic 0A-ETH-3 138

 6. Verify the InfiniBand links are up:

# /opt/oracle/bda/bin/bdacheckib

7. Verify the VNIC and network links are up and operational:

# /opt/oracle/bda/bin/bdachecknet

8. Once the hardware is verified as up and running, the Customer's system administrator will need to verify the BDA services are up following the startup procedures for Big Data Appliance detailed in MOS Note 2099858.1

References

<NOTE:1985159.1> - Updating IB partitions after replacing an Infiniband HCA in any nodes within IB network - steps to do after replacing HCA
<NOTE:2216163.1> - After Replacing an Infiniband HCA Card on Oracle Big Data Appliance bondeth0 Will Not be Up - Post-Configuration Steps are Required.
<NOTE:2099858.1> - Steps to Gracefully Shutdown and Power on a Single Node on Oracle Big Data Appliance Prior to Maintenance

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback