Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1448314.1
Update Date:2018-04-05
Keywords:

Solution Type  Technical Instruction Sure

Solution  1448314.1 :   How to Replace a Failed InfiniBand (HCA) Card on an Exadata Compute Node (X2-8/X3-8)  


Related Items
  • Exadata Database Machine X2-8
  •  
  • Exadata X3-8 Hardware
  •  
  • Exadata X3-8b Hardware
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: x64-CAP VCAP
  •  




In this Document
Goal
Solution
References


Oracle Confidential PARTNER - Available to partners (SUN).
Reason: FRU CAP

Applies to:

Exadata X3-8 Hardware - Version All Versions and later
Exadata Database Machine X2-8 - Version All Versions and later
Exadata X3-8b Hardware - Version All Versions and later
Information in this document applies to any platform.

Goal

HowTo Replace a Failed InfiniBand (IB-HCA) Card on a Exadata Compute Node (X2-8/X3-8).

Solution

DISPATCH INSTRUCTIONS
- WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED:

The FSE needs to be Exadata Trained.

- TIME ESTIMATE: 60 minutes
- TASK COMPLEXITY: 3

FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS:
- PROBLEM OVERVIEW: An InfiniBand HCA has failed in an Exadata X2-8 / X3-8 Compute Node. 

For Exadata Storage Server X2-8 refer to Note 1446404.1. For Exadata Storage Server X3-8 refer to Note 1670096.1.


- WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE
RESOLUTION ACTIVITY?:

The system administrator should prepare the system for service by performing any application related functions required to shutdown the compute or storage node. This might include but is not limited to performing a system backup, failover of application or services, and finally a system shutdown.

- WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE:

Identify the faulty card:

Infiniband port layout/location:


EM 3.1:  Empty
EM 3.0:  ib7  ib6
EM 2.1:  Empty
EM 2.0: ib5  ib4
EM 1.1: Empty
EM 1.0: ib3  ib2
EM 0.1: Empty
EM 0.0: ib1  ib0

Please see "How to Power on & off the Server" & "How to Remove a PCIe EM" within the "Sun Fire X4800 Server Service Manual".

  1. Power-off the target node to Standby mode for service.
  2. Pull the InfiniBand Cables from the IB Card at the rear of the server.
  3. Remove and Replace the defective (PCIe EM) IB Card.
  4. Reconnect any cables disconnected earlier.
  5. Slide the node back into the rack operating position.
  6. Power on the system either via ILOM or via the push button on the front of the server.

OBTAIN CUSTOMER ACCEPTANCE
- WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:

The customer is responsible to verify the new component is functioning correctly, some steps the customer may want to use for verification are as follows:


1. On the host that the card was replaced run:

# ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 2
Firmware version: 2.7.8130
Hardware version: b0
Node GUID: 0x0021280001a0e00c
System image GUID: 0x0021280001a0e00f
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 96
LMC: 0
SM lid: 10
Capability mask: 0x02510868
Port GUID: 0x0021280001a0e00d
Port 2:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 97
LMC: 0
SM lid: 10
Capability mask: 0x02510868
Port GUID: 0x0021280001a0e00e
CA 'mlx4_1'
CA type: MT26428
Number of ports: 2
Firmware version: 2.7.8130
Hardware version: b0
Node GUID: 0x0021280001a0e108
System image GUID: 0x0021280001a0e10b
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 94
LMC: 0
SM lid: 10
Capability mask: 0x02510868
Port GUID: 0x0021280001a0e109
Port 2:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 95
LMC: 0
SM lid: 10
Capability mask: 0x02510868
Port GUID: 0x0021280001a0e10a
CA 'mlx4_2'
CA type: MT26428
Number of ports: 2
Firmware version: 2.7.8130
Hardware version: b0
Node GUID: 0x0021280001a0dff8
System image GUID: 0x0021280001a0dffb
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 136
LMC: 0
SM lid: 10
Capability mask: 0x02510868
Port GUID: 0x0021280001a0dff9
Port 2:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 137
LMC: 0
SM lid: 10
Capability mask: 0x02510868
Port GUID: 0x0021280001a0dffa
CA 'mlx4_3'
CA type: MT26428
Number of ports: 2
Firmware version: 2.7.8130
Hardware version: b0
Node GUID: 0x0021280001a0e004
System image GUID: 0x0021280001a0e007
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 92
LMC: 0
SM lid: 10
Capability mask: 0x02510868
Port GUID: 0x0021280001a0e005
Port 2:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 93
LMC: 0
SM lid: 10
Capability mask: 0x02510868
Port GUID: 0x0021280001a0e006


Ensure ALL Ports are:
State is "Active"
Physical state: "LinkUp"
Rate: "40"



2. Run Verify Infiniband topology (example of fully-operational system):


[root@cn01 ~]# /opt/oracle.SupportTools/ibdiagtools/verify-topology
[ DB Machine Infiniband Cabling Topology Verification Tool ]
Is every external switch connected to every internal switch......[SUCCESS]
Are any external switches connected to each other................[SUCCESS]
Are any hosts connected to spine switch..........................[SUCCESS]
Check if all hosts have 2 CAs to different switches..............[SUCCESS]
Leaf switch check: cardinality and even distribution.............[SUCCESS]
Check if each rack has an valid internal ring....................[SUCCESS]
[root@cn01 ibdiagtools]#

For a Quarter Rack or Half Rack you need to use the "-t" option to specify the topology.
Example: ./verify-topology -t quarterrack
Example: ./verify-topology -t halfrack




PARTS NOTE:
Oracle Exadata X2-8
Oracle Exadata X3-8


REFERENCE INFORMATION:

Exadata Documentation
Sun Fire X4800 Server Service Manual


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback