Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1970759.1
Update Date:2018-04-10
Keywords:

Solution Type  Technical Instruction Sure

Solution  1970759.1 :   How to Replace an Exadata X5-2/X6-2 Compute Node Infiniband Card  


Related Items
  • Zero Data Loss Recovery Appliance X6 Hardware
  •  
  • Exadata X5-2 Hardware
  •  
  • Exadata X5-2 Eighth Rack
  •  
  • Exadata X5-2 Full Rack
  •  
  • Exadata X6-2 Hardware
  •  
  • Exadata X5-2 Quarter Rack
  •  
  • Zero Data Loss Recovery Appliance X5 Hardware
  •  
  • Exadata X5-2 Half Rack
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: x64-CAP VCAP
  •  




In this Document
Goal
Solution
References


Oracle Confidential PARTNER - Available to partners (SUN).
Reason: FRU replacement on Engineered system

Applies to:

Exadata X5-2 Hardware - Version All Versions and later
Exadata X5-2 Quarter Rack - Version All Versions and later
Exadata X5-2 Eighth Rack - Version All Versions and later
Zero Data Loss Recovery Appliance X5 Hardware - Version All Versions and later
Exadata X5-2 Full Rack - Version All Versions and later
Information in this document applies to any platform.

Goal

 How to replace a faulty Infiniband card in Exadata X5-2/X6-2 Compute Node

Solution

 DISPATCH INSTRUCTIONS


WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED?:

Exadata Server Training.

 

TIME ESTIMATE: 60 minutes.

TASK COMPLEXITY: 3-FRU


FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS:

WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY?:

The server that contains the faulty Infiniband HCA card should have its services offline and system powered off.


WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE?:

The instructions below assume the customer DBA is available and working with the field engineer onsite to manage the host OS and
DB/ASM services. They are provided here to allow the FE to have all the available steps needed when onsite, and can be done by the
FE if the customer DBA wants or allows or needs help with their steps.


Step A. Pre-Steps to shutdown the node for servicing:


1. For Extended information on this section, check MOS Note:
ID 1093890.1 Steps To Shutdown/Startup The Exadata & RDBMS Services and Cell/Compute Nodes On An Exadata Configuration.


For a documentation reference, in the Exadata Maintenance Guide, use the section of chapter 1 "General Maintenance Information"
titled "Non-Emergency Power Procedures" section "Powering Off Oracle Exadata Rack" sub-section "Powering off Database Servers" available on the customer's
cell server image in the /opt/oracle/cell/doc directory, or internal to Oracle here:
http://amomv0115.us.oracle.com/archive/cd_ns/E50790_01/doc/doc.121/e51951/general.htm#DBMMN21014

2.Check if there are any non-default Infiniband partitions.

 - Refer to document Updating IB partitions after replacing an Infiniband HCA in any nodes within IB network - steps to do after replacing HCA (Doc ID 1985159.1) - follows steps 1 and 2 from the document.


if running OVM then go to section "For Compute Node running OVM" - for non-OVM proceed as follows:


Shutdown crs


i. As root user do the following to stop crs and disable autostart of crs on reboot:

# . oraenv
ORACLE_SID = [root] ? +ASM1
The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle
# $ORACLE_HOME/bin/crsctl disable crs
# $ORACLE_HOME/bin/crsctl stop crs
or
# <GI_HOME>/bin/crsctl stop crs

where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the customer's environment.
In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the value would be +ASM3.

ii. Validate CRS is down cleanly. There should be no processes running.

# ps -ef | grep css

For Compute Node running OVM proceed as follows:


If there are any concerns engage EEST engineer.

The customer should perform the following:

(a) See what user domains are running (record result )
Connect to the management domain (domain zero, or dom0).
This is an example with just two domains and the management domain Domain-0

# xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 8192 4 r----- 409812.7
dm01db01vm01 8 8192 2 -b---- 156610.6
dm01db01vm02 9 8192 2 -b---- 152169.8

connect to each domain using the command

# xm console domainname

where domainname would be dm01db01vm01 or dm01db01vm02 if using the above examples.

Shut down any instances of crs ,refer to the example above in previous section "shutdown crs" in all user domains

Note: Omit the following command for OVM as it is not not required.
# $ORACLE_HOME/bin/crsctl disable crs

Press CTRL+] to disconnect from the console.

(b)Shutdown all user domains from dom0

# xm shutdown -a -w

(c) See what user domains are running (should be only Domain-0)

(d) Disable user domains from auto starting during dom0 boot after motherboard has been replaced.

# chkconfig xendomains off


The customer can now shutdown the server operating system:

# shutdown -hP now

 

Reference links for Service Manual:

X5-2 DB’s: ( http://docs.oracle.com/cd/E41059_01/html/E48312/napsm.gnriy.html#scrolltoc )

 The field engineer can now slide out the server for maintenance. Do not remove any cables prior to sliding the server forward, or the
loose cable ends will jam in the cable management arms (CMA). Ensure all customer-added data network cables are properly dressed
into the CMA Take care to ensure the cables and CMA is moving properly.
Remember to disconnect the power cords before opening the top of the server.

Locate and Remove the PCIe card.

(a) There are three external PCIe slots in the system. The external PCIe slots are numbered 1, 2, and 3 from left to right when you
view the server from the rear. The Infiniband card is always installed in PCIe slot 3.

(b) Locate the Infiniband card in PCIe slot 3 and unplug the two cables from the PCIe card making note of their locations so that they
can be re-installed in the same configuration (label if needed).

(c) lift the green-tabbed latch on the rear of the server's chassis next to the PCIe slot to release the PCIe card's rear bracket.

(d) To release the riser from the motherboard connector, lift the green-tabbed lease
lever on the PCIe riser to the open position.

(e) Slide the plastic PCIe card retainer, which is mounted on the side of the chassis,
toward the front of the server to release the card(s) installed in the riser .

(f) Grasp the riser with both hands and remove it from the server.

(g) Remove the Infiniband card from the PCIe riser. Hold the riser in one hand and use your other hand to carefully pull the PCIe
card connector out of the riser.

(h) Disconnect the rear bracket that is attached to the Infiniband card from the rear of the PCIe riser.


Replace the PCIe card.

(a) Insert the rear bracket that is attached to the Infiniband card into the PCIe riser.

(c) Hold the riser in one hand and use your other hand to carefully insert the PCIe card connector into the Riser.

(d) Install the PCIe riser with the installed PCIe cards into the server.

(e) Raise the PCIe riser release lever (marked with a green tab) to the open (up) position
Making sure to replace the riser into the same position from which it was removed (PCIe slot 3), gently press the riser into the
motherboard connector until it seats and press the green-tabbed, riser release lever to the closed (down) position.

(f) Close the green-tabbed latch on the rear of the server's chassis next to the applicable PCIe slot to secure the PCIe card's rear
bracket to the server's chassis.

(g) Reconnect the Infiniband cables to the PCIe card that were unplugged during the removal procedure making sure to connect them in the
same configuration as when they were disconnected.

 

Server Services Startup Validation:


DB Node Startup:

 

OBTAIN CUSTOMER ACCEPTANCE

The system administrator should verify the system is functioning correctly. Some suggested actions they can take to verify are:


1. On the host that the card was replaced run:

# ibstat
CA 'mlx4_0'
        CA type: MT4099
        Number of ports: 2
        Firmware version: 2.11.1280
        Hardware version: 0
        Node GUID: 0x0010e0000159ee7c
        System image GUID: 0x0010e0000159ee7f
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 5
                LMC: 0
                SM lid: 2
                Capability mask: 0x02514868
                Port GUID: 0x0010e0000159ee7d
                Link layer: IB
        Port 2:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 6
                LMC: 0
                SM lid: 2
                Capability mask: 0x02514868
                Port GUID: 0x0010e0000159ee7e
                Link layer: IB

Ensure both Port 1 & Port2:
State is "Active"
Physical state: "LinkUp"
Rate: "40"

ii) refer to document Updating IB partitions after replacing an Infiniband HCA in any nodes within IB network - steps to do after replacing HCA (Doc ID 1985159.1) - follow steps 3 ,4 and 5 from the document.

2. Run Verify Infiniband topology (example of fully-operational system):

Most important at this stage are lines #2 & #3 of the output below which
have been highlighted.

# /opt/oracle.SupportTools/ibdiagtools/verify-topology

        [ DB Machine Infiniband Cabling Topology Verification Tool ]
Every node is connected to two leaf switches in a single rack.......................................................[SUCCESS]
Every inter-leaf switch link is connected correctly in a single rack................................................[SUCCESS]
Every leaf switch in an interconnected quarter rack is correctly connected to other rack in a multi-rack group......[NOT APPLICABLE]
Every leaf switch is connected to every spine switch in a multi-rack group..........................................[NOT APPLICABLE]
Every rack has balanced inter-leaf-and-spine switch links in a multi-rack group.....................................[NOT APPLICABLE]
No spine switch is connected to another spine switch in a multi-rack group..........................................[NOT APPLICABLE]
Every spine switch is connected to two external spine switches in a multi-rack group................................[NOT APPLICABLE]
No external spine switch is connected to a leaf switch in a multi-rack group........................................[NOT APPLICABLE]
No external spine switch is connected to another external spine switch in a multi-rack group........................[NOT APPLICABLE]

 

3. Ping other nodes over the Infiniband subnet

 

CRS services should now be started.

"DB Node Startup Verification" - for compute node NOT running OVM ,for OVM refer to next section.

Startup CRS and re-enable autostart of crs. After the OS is up, the Customer DBA should validate that CRS is running. As root execute:

# . oraenv
ORACLE_SID = [root] ? +ASM1
The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle

# $ORACLE_HOME/bin/crsctl start crs
# $ORACLE_HOME/bin/crsctl check crs

Now re-enable autostart

# $ORACLE_HOME/bin/crsctl enable crs
or
# <GI_HOME>/bin/crsctl check crs

# <GI_HOME>/bin/crsctl enable crs

where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the customer's environment.
In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the value would be +ASM3.
Example output when all is online is:

# /u01/app/11.2.0/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

2. Validate that instances are running:

# ps -ef |grep pmon

It should return a record for the ASM instance and a record for each database.

For Compute Node running OVM

If the customer requires assistance please ask them to contact EEST engineer or parent case owner.

Once the compute node has booted ,re-enable user domains to autostart during Domain-0 boot.

# chkconfig xendomains on

Startup all user domains that are marked for auto start

# service xendomains start

See what user domains are running (compare against result from previously collected data)

# xm list

if any not auto-started then Startup a single user domain

# xm create -c /EXAVMIMAGES/GuestImages/DomainName/vm.cfg

 Check that crs has started in user domains ,refer to previous section "DB Node Startup Verification"



PARTS NOTE: 7092757

REFERENCE INFORMATION:

1093890.1 Steps To Shutdown/Startup The Exadata & RDBMS Services and Cell/Compute Nodes On An Exadata Configuration.

Service Manual's:
X5-2 DB’s: ( Service Manual's:
X5-2 DB’s: ( http://docs.oracle.com/cd/E41059_01/html/E48312/napsm.html#scrolltoc )


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback