Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1446404.1
Update Date:2018-04-08
Keywords:

Solution Type  Technical Instruction Sure

Solution  1446404.1 :   How to Replace a Failed InfiniBand (HCA) Card on a V2/X2-2/X2-8 Exadata Storage Server or V2/X2-2 Compute Node  


Related Items
  • SPARC SuperCluster T4-4 Full Rack
  •  
  • Exadata Database Machine X2-2 Qtr Rack
  •  
  • Exadata Database Machine X2-2 Full Rack
  •  
  • Exadata Database Machine X2-8
  •  
  • Exadata Database Machine X2-2 Half Rack
  •  
  • SPARC SuperCluster T4-4 Half Rack
  •  
  • Exadata Database Machine X2-2 Hardware
  •  
  • SPARC SuperCluster T4-4
  •  
  • Exadata Database Machine V2
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: x64-CAP VCAP
  •  




In this Document
Goal
Solution
References


Oracle Confidential PARTNER - Available to partners (SUN).
Reason: FRU CAP

Applies to:

Exadata Database Machine X2-2 Hardware - Version All Versions and later
Exadata Database Machine X2-2 Qtr Rack - Version All Versions and later
Exadata Database Machine V2 - Version All Versions and later
Exadata Database Machine X2-2 Full Rack - Version All Versions and later
Exadata Database Machine X2-2 Half Rack - Version All Versions and later
Information in this document applies to any platform.

Goal

HowTo Replace a Failed InfiniBand (IB-HCA) Card on an Exadata Storage Server or V2/X2-2 Compute Node

Solution

DISPATCH INSTRUCTIONS
- WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED:

The FSE needs to be Exadata Trained.

- TIME ESTIMATE: 60 minutes
- TASK COMPLEXITY: 3

FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS:
- PROBLEM OVERVIEW: An InfiniBand HCA has failed in an Exadata  Storage Server or V2/X2-2 Compute Node.  For X2-8 Compute Nodes, refer to Note 1448314.1.
- WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE
RESOLUTION ACTIVITY?:

The system administrator should prepare the system for service by performing any application related functions required to shutdown the compute or storage node. This might include but is not limited to performing a system backup, failover of application or services, and finally a system shutdown.

- WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE:

Please see the "Servicing PCIe Cards" within the "Sun Fire X4170 M2 Server Service Manual".

Please refer to DOC ID: 1539451.1 for instructions to shutdown and restart a database node or DOC ID: 1188080.1 for instructions to shutdown and restart a storage cell.

  1. Pull out the stabilizing bars before pulling out any server for service.
  2. Power-off the target node for service.
  3. Detach the power cord from the node.
  4. Pull the InfiniBand Cables from the IB Card at the rear of the server.
  5. Transition the target node to the service position.
  6. Remove the top cover.
  7. Locate and Remove the PCIe Riser that includes the IB Card.
  8. Remove and Replace the defective IB Card.
  9. Re-Install the PCIe Riser.
  10. Install the top cover.
  11. Reconnect any cables disconnected earlier.
  12. Slide the node back into the rack operating position.
  13. Retract the stabilizing bars.
  14. Power on the system either via ILOM or via the push button on the front of the server.

OBTAIN CUSTOMER ACCEPTANCE
- WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:

The system administrator should verify the system is functioning correctly. Some suggested actions they can take to verify are:


1. On the host that the card was replaced run:

# ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 2
Firmware version: 2.7.0
Hardware version: a0
Node GUID: 0x00212800013e6c22
System image GUID: 0x00212800013e6c25
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 26
LMC: 0
SM lid: 10
Capability mask: 0x02510868
Port GUID: 0x00212800013e6c23
Port 2:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 25
LMC: 0
SM lid: 10
Capability mask: 0x02510868
Port GUID: 0x00212800013e6c24


Ensure both Port 1 & Port2:
State is "Active"
Physical state: "LinkUp"
Rate: "40"



2. Run Verify Infiniband topology (example of fully-operational system):


[root@db01 ~]# /opt/oracle.SupportTools/ibdiagtools/verify-topology
[ DB Machine Infiniband Cabling Topology Verification Tool ]
Is every external switch connected to every internal switch......[SUCCESS]
Are any external switches connected to each other................[SUCCESS]
Are any hosts connected to spine switch..........................[SUCCESS]
Check if all hosts have 2 CAs to different switches..............[SUCCESS]
Leaf switch check: cardinality and even distribution.............[SUCCESS]
Check if each rack has an valid internal ring....................[SUCCESS]
[root@cn01 ibdiagtools]#

For a Quarter Rack or Half Rack you need to use the "-t" option to specify the topology.
Example: ./verify-topology -t quarterrack
Example: ./verify-topology -t halfrack



3. Ping other nodes over the Infiniband subnet



PARTS NOTE:
System Handbook

Exadata

or

Exadata X2-2


REFERENCE INFORMATION:

Exadata Documentation
Sun Fire X4170 M2 Server Service Manual


How to Remove and Replace a X4170/X4270 PCI Card:ATR:1936:0: DocumentID 1347366.1

How to shutdown the Exadata database nodes and storage cells in a rolling fashion so certain hardware tasks can be performed. (Doc ID 1539451.1)

Steps to shut down or reboot an Exadata storage cell without affecting ASM (Doc ID 1188080.1)


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback