How to Replace an Exadata X5-2/X6-2 Compute Node RAID HBA

Asset ID:	1-71-1968405.1
Update Date:	2018-04-10
Keywords:

Solution Type Technical Instruction Sure

Solution 1968405.1 : How to Replace an Exadata X5-2/X6-2 Compute Node RAID HBA

Applies to:

Exadata X5-2 Eighth Rack - Version All Versions and later
Exadata X5-2 Hardware - Version All Versions and later
Exadata X5-2 Full Rack - Version All Versions and later
Exadata X5-2 Half Rack - Version All Versions and later
Zero Data Loss Recovery Appliance X5 Hardware - Version All Versions and later
Information in this document applies to any platform.

Goal

How to Replace a Faulty RAID HBA on Exadata successfully in Exadata X5-2/X6-2 Compute Node

Solution

DISPATCH INSTRUCTIONS
WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED?: Exadata Trained

TIME ESTIMATE: 90 Minutes
TASK COMPLEXITY: 3

FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS:
PROBLEM OVERVIEW: A faulty RAID HBA in an Exadata X5-2/X6-2 Compute node has been diagnosed as needing replacement

WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY?:

- The server that contains the faulty HBA should have its services offline and system powered off.

WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE?:

The instructions below assume the customer DBA is available and working with the field engineer onsite to manage the host OS and
DB/ASM services. They are provided here to allow the FE to have all the available steps needed when onsite, and can be done by the
FE if the customer DBA wants or allows or needs help with their steps.

Step A. Pre-Steps to shutdown the node for servicing:

1. For Extended information on this section, check MOS Note:
ID 1093890.1 Steps To Shutdown/Startup The Exadata & RDBMS Services and Cell/Compute Nodes On An Exadata Configuration.

For a documentation reference, in the Exadata Maintenance Guide, use the section of chapter 1 "General Maintenance Information"

titled "Non-Emergency Power Procedures" section "Powering Off Oracle Exadata Rack" sub-section "Powering off Database Servers" available on the customer's cell
server image in the /opt/oracle/cell/doc directory, or internal to Oracle here:
http://amomv0115.us.oracle.com/archive/cd_ns/E50790_01/doc/doc.121/e51951/general.htm#DBMMN21014

It is highly recommended to make and verify a backup of all disk partitions prior to RAID HBA replacement.

if running OVM then go to section "For Compute Node running OVM" - for non-OVM proceed as follows:

Shutdown crs

i. As root user do the following to stop crs and disable autostart of crs on reboot:

# . oraenv
      ORACLE_SID = [root] ? +ASM1
      The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle

      # $ORACLE_HOME/bin/crsctl disable crs

      # $ORACLE_HOME/bin/crsctl stop crs
     or
     # <GI_HOME>/bin/crsctl stop crs

where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the customer's environment.

In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the value would be +ASM3.

ii. Validate CRS is down cleanly. There should be no processes running.

# ps -ef | grep css

For Compute Node running OVM proceed as follows:

If there are any concerns engage EEST engineer.

The customer should perform the following:

(a) See what user domains are running (record result )

Connect to the management domain (domain zero, or dom0).

This is an example with just two domains and the management domain Domain-0

# xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 8192 4 r----- 409812.7
dm01db01vm01 8 8192 2 -b---- 156610.6
dm01db01vm02 9 8192 2 -b---- 152169.8

connect to each domain using the command

# xm console domainname

where domainname would be dm01db01vm01 or dm01db01vm02 if using the above examples.

Shut down any instances of crs ,refer to the example above in previous section "shutdown crs" in all user domains

Note: Omit the following command for OVM as it is not not required.

# $ORACLE_HOME/bin/crsctl disable crs

Press CTRL+] to disconnect from the console.

(b)Shutdown all user domains from dom0

# xm shutdown -a -w

(c) See what user domains are running (should be only Domain-0)

(d) Disable user domains from auto starting during dom0 boot after HBA has been replaced.

# chkconfig xendomains off

3. Revert all the RAID disk volumes to WriteThrough mode to ensure all data in the RAID cache memory is flushed to disk and not lost
when replacement of the HBA occurs. Set all logical volumes cache policy to WriteThrough cache mode:

# /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wt -lall -a0

Verify the current cache policy for all logical volumes is now WriteThrough :

# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU

4. The customer can now shutdown the server operating system:

# shutdown -hP now

5. The field engineer can now slide out the server for maintenance. Do not remove any cables prior to sliding the server forward, or the
loose cable ends will jam in the cable management arms (CMA). Ensure all customer-added data network cables are properly dressed
into the CMA Take care to ensure the cables and CMA is moving properly.
Remember to disconnect the power cords before opening the top of the server.

Step B. Physical RAID Card replacement

Reference links for Service Manual:
X5-2 DB’s: ( http://docs.oracle.com/cd/E41059_01/html/E48312/napsm.html#scrolltoc )

Remove the old HBA PCI Card

1. Remove the IB cables from the IB card in slot 3 above the HBA making a note of which port each cable goes into so
they can go back into the same port.

2. Remove the PCIe riser from slots 3 and 4.

(a) Open the green-tabbed latch located on the rear of the server chassis next to
PCIe slot 3 to release the rear bracket on the PCIe card

(b) To release the riser from the motherboard connector, lift the green-tabbed release
lever on the PCIe riser to the open position.

(c) Slide the plastic PCIe card retainer, which is mounted on the side of the chassis,
toward the front of the server to release the card(s) installed in the riser .

(d) Grasp the riser with both hands and remove it from the server.

(e) Disconnect the SAS storage drive (HDD) cables from the internal HBA card
installed in PCIe slot 4 .
(f) Disconnect the super capacitor cable from the internal HBA card in slot 4

(g) Disconnect the rear bracket attached to the PCIe card from the rear of the
PCIe riser.

Install the new HBA PCI Card

Reverse the removal instructions, taking care to get the cables re-connected to the same ports they were removed from. If reversed,
this may affect disk slot mappings.
Take care to also put the IB cables back into the original ports, as well, in the correct orientation. IB cables are factory
labeled with the port identification where port 2 is the port nearest the PCI connector, and port 1 is the port near the top side of the
card. The cables should be inserted with the latch release tab on the down side, so they fully seat and latch. If inserted upside down,
they will not fully seat or latch.

Power on :
1. Once the power cords have been re-attached, slide the server back into the rack.
2. Once the ILOM has booted you will see a slow blink on the green LED for the server. Power on the server by pressing the power
button on the front of the unit.

Server Services Startup Validation:

DB Node Startup:

1. As the system boots the hardware/firmware profile will be checked, and either a green "Passed" will be displayed, or a red "Warning"
that the check does not match if the firmware on the HBA is different from what the image expects.
If the check passes, then the firmware is correct, continue to step 2.

If the check fails, then an attempt will be made to automatically update the firmware , a subsequent reboot will occur. Monitor to ensure this occurs properly.

If the check or update still fail, then:
a) Login as root at the OS login prompt.
b) Run the following to update the RAID HBA to the correct supported firmware for the image:

# /opt/oracle.SupportTools/CheckHWnFWProfile -U /opt/oracle.cellos/iso/cellbits

c) After the firmware updates, the server will reboot again. The disk volumes should remain intact and boot up to the OS again.

2. After the OS is up, login as root and validate the physical and logical volumes are seen properly from the new RAID HBA in the OS,
for the configuration that it should be for the DB node and that the supercap is seen:

# /opt/MegaRAID/MegaCli/MegaCli64 -LdInfo -Lall -a0

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :DBSYS
RAID Level          : Primary-5, Secondary-0, RAID Level Qualifier-3
Size                : 1.633 TB
Physical Sector Size: 512
Logical Sector Size : 512
VD has Emulated PD : No
Parity Size         : 557.861 GB
State               : Optimal
Strip Size          : 1.0 MB
Number Of Drives    : 4
Span Depth          : 1
Creation Date     : 25-12-2014
Creation Time     : 08:32:46 AM
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disabled
Encryption Type     : None
Bad Blocks Exist: No
PI type: No PI

Is VD Cached: No

Exit Code: 0x00

# /opt/MegaRAID/MegaCli/MegaCli64 -PdList -a0 | grep "Slot\|Firmware\|Inq"

Slot Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: A690
Inquiry Data: HITACHI H109060SESUN600GA6901446BZMTTX
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: A690
Inquiry Data: HITACHI H109060SESUN600GA6901446BZMW0X
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: A690
Inquiry Data: HITACHI H109060SESUN600GA6901446B01TBX
Slot Number: 3
Firmware state: Online, Spun Up
Device Firmware Level: A690
Inquiry Data: HITACHI H109060SESUN600GA6901446BZN1KX

# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -a0

BBU status for Adapter: 0
BatteryType: CVPM02

...Output truncated...

3. Set all logical drives cache policy to WriteBack cache mode:

# /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wb -lall -a0

Verify the current cache policy for all logical drives is now using WriteBack cache mode:

# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU

4. CRS services should now be started.

"DB Node Startup Verification" - for compute node NOT running OVM ,for OVM refer to next section.

Startup CRS and re-enable autostart of crs. After the OS is up, the Customer DBA should validate that CRS is running. As root execute:

# . oraenv
ORACLE_SID = [root] ? +ASM1
The Oracle base for ORACLE_HOME=/u01/app/11.2.0/grid is /u01/app/oracle

# $ORACLE_HOME/bin/crsctl start crs
# $ORACLE_HOME/bin/crsctl check crs

Now re-enable autostart

# $ORACLE_HOME/bin/crsctl enable crs
or
# <GI_HOME>/bin/crsctl check crs

# <GI_HOME>/bin/crsctl enable crs

where GI_HOME environment variable is typically set to “/u01/app/11.2.0/grid” but will depend on the customer's environment.
In the above output the “1” of “+ASM1” refers to the DB node number. For example, Db node #3 the value would be +ASM3.
Example output when all is online is:

# /u01/app/11.2.0/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

2. Validate that instances are running:

# ps -ef |grep pmon

It should return a record for the ASM instance and a record for each database.

For Compute Node running OVM

If the customer requires assistance please ask them to contact EEST engineer or parent case owner.

Once the compute node has booted ,re-enable user domains to autostart during Domain-0 boot.

# chkconfig xendomains on

Startup all user domains that are marked for auto start

# service xendomains start

See what user domains are running (compare against result from previously collected data)

# xm list

if any not auto-started then Startup a single user domain

# xm create -c /EXAVMIMAGES/GuestImages/DomainName/vm.cfg

Check that crs has started in user domains ,refer to previous section "DB Node Startup Verification"

4. Verify also the InfiniBand links are up at 40Gbps as the cables were disconnected:

# /usr/sbin/ibstatus
Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:0021:2800:013e:70bb
base lid: 0x50
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp

rate: 40 Gb/sec (4X QDR)
Infiniband device 'mlx4_0' port 2 status:
default gid: fe80:0000:0000:0000:0021:2800:013e:70bc
base lid: 0x51
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)

OBTAIN CUSTOMER ACCEPTANCE
WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE?:

- Verify that HW Components and SW Components are returned to properly functioning state with server up and database services
operating on DB Servers

PARTS NOTE:

REFERENCE INFORMATION:

1093890.1 Steps To Shutdown/Startup The Exadata & RDBMS Services and Cell/Compute Nodes On An Exadata Configuration.

Service Manual's:
X5-2 DB’s: ( Service Manual's:
X5-2 DB’s: ( http://docs.oracle.com/cd/E41059_01/html/E48312/napsm.html#scrolltoc )

Attachments

This solution has no attachment