![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||
Solution Type Technical Instruction Sure Solution 2230797.1 : How to Replace a Big Data Appliance Faulty Infiniband HCA
In this Document
Oracle Confidential PARTNER - Available to partners (SUN). Applies to:Big Data Appliance X3-2 In-Rack Expansion - Version All Versions and laterBig Data Appliance X4-2 Hardware - Version All Versions and later Big Data Appliance X3-2 Hardware - Version All Versions and later Big Data Appliance X4-2 Starter Rack - Version All Versions and later Big Data Appliance X3-2 Full Rack - Version All Versions and later Linux x86-64 GoalHow to Replace a Big Data Appliance Faulty Infiniband HCA SolutionWHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED?: BDA Trained TIME ESTIMATE: 90 Minutes FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS: WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY?: NOTE:
If a system uses custom non-default InfiniBand partitions [e.g., Exalogic (virtual/physical/hybrid), Exadata (virtual/physical), SuperCluster, BDA] then the HCA Port GUIDs might need to be updated in the InfiniBand partition(s) after replacing an HCA. Determine the switch running as MASTER. From it, check for any custom, non-default IP partitions. [root@bda01node05 ~]# sminfo sminfo: sm lid 15 sm guid 0x10e0406d5aa0a0, activity count 26263191 priority 14 state 3 SMINFO_MASTER [root@bda01node05 ~]# ibnetdiscover | grep 10e0406d5aa0a0 switchguid=0x10e0406d5aa0a0(10e0406d5aa0a0) Switch 36 "S-0010e0406d5aa0a0" # "SUN DCS 36P QDR bdax01sw-ib1 xxx.xxx.171.24" enhanced port 0 lid 15 lmc 0 [root@bda01node05 ~]# ssh root@xxx.xxx.171.24 [root@bda01sw-ib1 ~]# smpartition list active If there are IB partitions other than default partitions, then refer to MOS note 1985159.1 for additional steps that will need to be taken before the old HCA is removed. The server that contains the faulty InfiniBand card should have its services offlined and system powered off. The Customer’s system administrator should shutdown the server node and BDA services following the shutdown instructions for Big Data Appliance detailed in MOS Note 2099858.1 WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE?: Do not remove any cables prior to sliding the server forward, or the loose cable ends will jam in the cable management arms. Take care to ensure the cables and Cable Management Arm is moving properly. Refer to Note 1444683.1 for CMA handling training. 2. Disconnect the AC power cords. BDA (V1) Server Nodes: 5. Insert the new IB HCA card into PCIe slot 3. Reverse the removal instructions in Step 4. 6. Reconnect the cables to the InfiniBand card that you unplugged during the removal procedure, making sure they go back to their original ports. If unsure of the ports, follow the labeling on the cables and card. 7. Install the top cover 8. Install the AC power cords 9. Slide the server back into the rack. OBTAIN CUSTOMER ACCEPTANCE - WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE: 1. Once the ILOM has booted you will see a slow blink on the green LED for the server. Press the power button on the front of the server to power on the unit. # /usr/sbin/ibstatus Infiniband device 'mlx4_0' port 2 status: 4. Verify the new InfiniBand card is seen by the BDA software: # /opt/oracle/bda/bin/bdacheckhw
Should return OK. 5. After InfiniBand card replacement, the client network VNICs and bondeth0 are no longer valid and require to be recreated for the new card: a) run the following command # /opt/oracle/bda/network/BdaUserConfigEoib
This will not delete the "old" VNICS because the MAC addresses have changed. However a new VNIC will be created and the client network will function with the new VNIC. b) Delete the old VNICs from the switches which correspond to the old HCA that was replaced. Do so with the deletevinc command. i. On the BDA server where the IB HCA card was replaced find the current MAC address. For example:
# ifconfig eth8 |grep HW
HWADDR=*:*:*:*:*:x # ifconfig eth9 |grep HW
HWADDR=*:*:*:*:*:y The current MAC addresses are *:*:*:*:*:x and *:*:*:*:*:y, respectively. ii. Log into the 2 gateway switches e.g. <rack>-sw-ib2
<rack>-sw-ib3 iii. On each run: # showvnics | grep <hostname>
Output is like: From <rack>-sw-ib2: # showvnics | grep bdanode0x
139 UP N **** bdanode0x BDA <private-ip bdanode0x> 0000 *:*:*:*:*:y NO 0xffff 0A-ETH-3 From <rack>-sw-ib3 # showvnics | grep bdanode0x
138 UP N **** bdanode0x BDA <private-ip bdanode0x> 0000 *:*:*:*:*:x NO 0xffff 0A-ETH-3 There should be 2 VNICs. The one not matching with the current server MAC address for either eth8 or eth9 should be removed with the deletevnic command on the gateway switch. iv. Based on the details in the above steps delete the VNIC returned by showvnics on the gateway switch which does not match the "new" HW address for either eth8 or eth9 on the server (as returned by ifconfig eth8 |grep HW/ifconfig eth9 |grep HW). Use: # deletevnic <VNIC Port> <VNIC ID>
For example if the output from showvnics on the switch is as below (the state of the old VNIC will be WAIT-IOA not UP): # showvnics | grep bdanode0x
138 WAIT-IOA N **** bdanode0x BDA <private-ip bdanode0x> 0000 *:*:*:*:*:z NO 0xffff 0A-ETH-3 Where the MAC address above (*:*:*:*:*:z) does not match the output returned on the server (*:*:*:*:*:x/*:*:*:*:*:y) delete the VNIC with: # deletevnic 0A-ETH-3 138
6. Verify the InfiniBand links are up: # /opt/oracle/bda/bin/bdacheckib
7. Verify the VNIC and network links are up and operational: # /opt/oracle/bda/bin/bdachecknet
8. Once the hardware is verified as up and running, the Customer's system administrator will need to verify the BDA services are up following the startup procedures for Big Data Appliance detailed in MOS Note 2099858.1 References<NOTE:1985159.1> - Updating IB partitions after replacing an Infiniband HCA in any nodes within IB network - steps to do after replacing HCA<NOTE:2216163.1> - After Replacing an Infiniband HCA Card on Oracle Big Data Appliance bondeth0 Will Not be Up - Post-Configuration Steps are Required. <NOTE:2099858.1> - Steps to Gracefully Shutdown and Power on a Single Node on Oracle Big Data Appliance Prior to Maintenance Attachments This solution has no attachment |
||||||||||||||||
|