Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1992981.1
Update Date:2018-04-05
Keywords:

Solution Type  Technical Instruction Sure

Solution  1992981.1 :   How to Replace an Exadata X5-2/X6-2 Storage Cell Infiniband Card  


Related Items
  • Oracle SuperCluster T5-8 Full Rack
  •  
  • Oracle SuperCluster M7 Hardware
  •  
  • Zero Data Loss Recovery Appliance X6 Hardware
  •  
  • Exadata SL6 Hardware
  •  
  • Exadata X5-2 Hardware
  •  
  • Exadata X5-2 Full Rack
  •  
  • Exadata X5-2 Eighth Rack
  •  
  • Oracle SuperCluster T5-8 Half Rack
  •  
  • Exadata X6-2 Hardware
  •  
  • Exadata X6-8 Hardware
  •  
  • Exadata X5-2 Quarter Rack
  •  
  • Exadata X4-8 Hardware
  •  
  • Zero Data Loss Recovery Appliance X5 Hardware
  •  
  • Exadata X5-2 Half Rack
  •  
  • Oracle SuperCluster T5-8 Hardware
  •  
  • Oracle SuperCluster M6-32 Hardware
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: x64-CAP VCAP
  •  




In this Document
Goal
Solution
References


Oracle Confidential PARTNER - Available to partners (SUN).
Reason: This procedure is for a FRU part so only field engineers need access to this

Applies to:

Exadata X5-2 Hardware - Version All Versions and later
Exadata X5-2 Full Rack - Version All Versions and later
Exadata X5-2 Eighth Rack - Version All Versions and later
Exadata X5-2 Quarter Rack - Version All Versions and later
Zero Data Loss Recovery Appliance X5 Hardware - Version All Versions and later
Information in this document applies to any platform.

Goal

 How to Replace an Exadata X5-2/X6-2 Storage Cell Infiniband Card

Solution

DISPATCH INSTRUCTIONS
WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED?: Exadata Trained


TIME ESTIMATE: 90 Minutes
TASK COMPLEXITY: 3


FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS:
PROBLEM OVERVIEW: A faulty infiniband card in an Exadata X5-2/X6-2 Storage Cell has been diagnosed as needing replacement


WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY?:

The server that contains the faulty Infiniband card should have its services offline and system powered off.


WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE?:

The instructions below assume the customer DBA is available and working with the field engineer onsite to manage the host OS and
DB/ASM services. They are provided here to allow the FE to have all the available steps needed when onsite, and can be done by the
FE if the customer DBA wants or allows or needs help with these steps.


Step A. Pre-Steps to shutdown the node for servicing:

1. Determine if the HCA that needs to be replaced is within an infiniband network where IB partitions exist, please follow steps 1 and 2 provided in DOC ID: 1985159.1

 

2. For Extended information on this section check MOS Note:

ID 1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM

This is also documented in the Exadata Owner's Guide in chapter 7 section titled "Maintaining Exadata Storage Servers" subsection "Shutting Down Exadata Storage Server" available on the customer's cell server image in the /opt/oracle/cell/doc directory.

Available to Oracle internally here:  http://amomv0115.us.oracle.com/archive/cd_ns/E13877_01/doc/doc.112/e13874/maintenance.htm#DBMOG21129

In the following examples the SQL commands should be run by the Customers DBA prior to doing the hardware replacement. These should be done by the field engineer only if the customer directs them to, or is unable to do them. The cellcli commands will need to be run as root.


3. ASM drops a disk shortly after it/they are taken offline. The default DISK_REPAIR_TIME attribute value of 3.6hrs should be adequate for replacing components, but may have been changed by the Customer. To check this parameter, have the Customer log into ASM and perform the following query:
 

SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg where a.name = 'disk_repair_time' and a.group_number = dg.group_number;


As long as the value is large enough to comfortably replace the components being replaced, then there is no need to change it.

4. Check if ASM will be OK if the grid disks go OFFLINE.


# cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome

 ...sample ...

     CATALOG_CD_09_zdlx5_tvp_a_cel3  ONLINE  Yes
     CATALOG_CD_10_zdlx5_tvp_a_cel3  ONLINE  Yes
     CATALOG_CD_11_zdlx5_tvp_a_cel3  ONLINE  Yes
     DELTA_CD_00_zdlx5_tvp_a_cel3    ONLINE  Yes
     DELTA_CD_01_zdlx5_tvp_a_cel3    ONLINE  Yes
     DELTA_CD_02_zdlx5_tvp_a_cel3    ONLINE  Yes

...repeated for all griddisks....


If one or more disks return asmdeactivationoutcome='No', then wait for some time and repeat step #2. Once all disks return asmdeactivationoutcome='Yes', proceed to the next step.

5. Run cellcli command to Inactivate all grid disks on the cell that needs to be powered down for maintenance. (this could take up to 10 minutes or longer)

# cellcli

CellCLI> ALTER GRIDDISK ALL INACTIVE

...sample ...
GridDisk CATALOG_CD_09_zdlx5_tvp_a_cel3 successfully altered
GridDisk CATALOG_CD_10_zdlx5_tvp_a_cel3 successfully altered
GridDisk CATALOG_CD_11_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_00_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_01_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_02_zdlx5_tvp_a_cel3 successfully altered

...repeated for all griddisks...


6. Execute the command below and the output should show asmmodestatus='UNUSED' or 'OFFLINE' and asmdeactivationoutcome=Yes for all griddisks once the disks are offline and inactive in ASM.

CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
         CATALOG_CD_09_zdlx5_tvp_a_cel3  inactive        OFFLINE         Yes
         CATALOG_CD_10_zdlx5_tvp_a_cel3  inactive        OFFLINE         Yes
         CATALOG_CD_11_zdlx5_tvp_a_cel3  inactive        OFFLINE         Yes
         DELTA_CD_00_zdlx5_tvp_a_cel3    inactive        OFFLINE         Yes
         DELTA_CD_01_zdlx5_tvp_a_cel3    inactive        OFFLINE         Yes
         DELTA_CD_02_zdlx5_tvp_a_cel3    inactive        OFFLINE         Yes

...repeated for all griddisks...

7. Once all disks are offline and inactive, the customer may shutdown the Cell using the following command:

# shutdown -hP now


8. The field engineer can now slide out the server for maintenance. Do not remove any cables prior to sliding the server forward, or the loose cable ends will jam in the cable management arms. Take care to ensure the cables and Cable Management Arm is moving properly. Refer to Note 1444683.1 for CMA handling training.

Remember to disconnect the power cords before opening the top cover of the server.

 

Step B. Physical card replacement

Reference links for Service Manual:
X5-2L : ( http://docs.oracle.com/cd/E41033_01/html/E48325/cnpsm.html#scrolltoc )

Remove the Infiniband card

1. Remove the server top cover

2. Swivel the air baffle into the upright position to allow access to PCIe cards

3. Remove the two infiniband cables from the infiniband card in PCIe slot 3, making note of their locations so they are plugged back into the same ports after replacement

4. Rotate the PCIe card locking mechanism, and then lift up on the PCIe card to disengage it from the motherboard connectors


Install the new Infiniband card

1. Insert the new Infiniband card into PCIe slot 3 and rotate the PCIe locking mechanism to secure the PCIe card in place

2. Reconnect the cables to the Infiniband card that you unplugged during the removal procedure, making sure they go back to their original ports

3. Lower the air baffle to the installed position

4. Install the top cover


Step C. Post-Replacement additional steps

1. Once the power cords have been re-attached, slide the server back into the rack.

2. Once the ILOM has booted you will see a slow blink on the green LED for the server. Power on the server by pressing the power
button on the front of the unit.

 

Step D. Server Services Startup Validation

1. After the OS is up, login as root

 

2. If the HCA is part of an infiniband network where IB partitions exist follow steps 3 and 4 or step 5 of DOC ID: 1985159.1 otherwise go to next step

 

3. Verify the InfiniBand links are up at 40Gbps as the cables were disconnected:

# /usr/sbin/ibstatus

 Infiniband device 'mlx4_0' port 1 status:
        default gid:     fe80:0000:0000:0000:0010:e000:0159:c61d
        base lid:        0x9
        sm lid:          0x2
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            40 Gb/sec (4X QDR)
        link_layer:      IB

Infiniband device 'mlx4_0' port 2 status:
        default gid:     fe80:0000:0000:0000:0010:e000:0159:c61e
        base lid:        0xa
        sm lid:          0x2
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            40 Gb/sec (4X QDR)
        link_layer:      IB

 

4. Run Verify Infiniband topology (example of fully-operational system):

# /opt/oracle.SupportTools/ibdiagtools/verify-topology
        [ DB Machine Infiniband Cabling Topology Verification Tool ]
Every node is connected to two leaf switches in a single rack.......................................................[SUCCESS]
Every inter-leaf switch link is connected correctly in a single rack................................................[SUCCESS]
Every leaf switch in an interconnected quarter rack is correctly connected to other rack in a multi-rack group......[NOT APPLICABLE]
Every leaf switch is connected to every spine switch in a multi-rack group..........................................[NOT APPLICABLE]
Every rack has balanced inter-leaf-and-spine switch links in a multi-rack group.....................................[NOT APPLICABLE]
No spine switch is connected to another spine switch in a multi-rack group..........................................[NOT APPLICABLE]
Every spine switch is connected to two external spine switches in a multi-rack group................................[NOT APPLICABLE]
No external spine switch is connected to a leaf switch in a multi-rack group........................................[NOT APPLICABLE]
No external spine switch is connected to another external spine switch in a multi-rack group........................[NOT APPLICABLE]


5. Once the hardware is verified as up and running, the Customer's DBA will need to activate the grid disks:

# cellcli

 CellCLI> alter griddisk all active

GridDisk CATALOG_CD_09_zdlx5_tvp_a_cel3 successfully altered
GridDisk CATALOG_CD_10_zdlx5_tvp_a_cel3 successfully altered
GridDisk CATALOG_CD_11_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_00_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_01_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_02_zdlx5_tvp_a_cel3 successfully altered

...repeated for all griddisks...


Issue the command below and all disks should show 'active':

CellCLI> list griddisk

         CATALOG_CD_09_zdlx5_tvp_a_cel3  active
         CATALOG_CD_10_zdlx5_tvp_a_cel3  active
         CATALOG_CD_11_zdlx5_tvp_a_cel3  active
         DELTA_CD_00_zdlx5_tvp_a_cel3    active
         DELTA_CD_01_zdlx5_tvp_a_cel3    active
         DELTA_CD_02_zdlx5_tvp_a_cel3    active

...repeated for all griddisks...


6. Verify all grid disks have been successfully put online using the following command. Wait until asmmodestatus is ONLINE for all grid disks and no longer SYNCING. The following is an example of the output early in the activation process.

CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome

         CATALOG_CD_09_zdlx5_tvp_a_cel3  active  SYNCING         Yes
         CATALOG_CD_10_zdlx5_tvp_a_cel3  active  SYNCING         Yes
         CATALOG_CD_11_zdlx5_tvp_a_cel3  active  SYNCING         Yes
         DELTA_CD_00_zdlx5_tvp_a_cel3    active  SYNCING         Yes
         DELTA_CD_01_zdlx5_tvp_a_cel3    active  SYNCING         Yes
         DELTA_CD_02_zdlx5_tvp_a_cel3    active  SYNCING         Yes

...repeated for all griddisks...


Notice in the above example that the grid disks are still in the 'SYNCING' process. Oracle ASM synchronization is only complete when ALL grid disks show asmmodestatus=ONLINE. This process can take some time depending on how busy the machine is, and has been while this individual server was down for repair.

 

OBTAIN CUSTOMER ACCEPTANCE

- WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:

- Verify that HW Components and SW Components are returned to properly functioning state with server up and all ASM disks online on Storage Servers.

 

REFERENCE INFORMATION:

1093890.1 Steps To Shutdown/Startup The Exadata & RDBMS Services and Cell/Compute Nodes On An Exadata Configuration.

1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM

1985159.1 Updating IB partitions after replacing an Infiniband HCA in any nodes within IB network - steps to do after replacing HCA


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback