How to Replace an Exadata X5-2/X6-2 Storage Cell RAID HBA

Asset ID:	1-71-1990221.1
Update Date:	2018-04-05
Keywords:

Solution Type Technical Instruction Sure

Solution 1990221.1 : How to Replace an Exadata X5-2/X6-2 Storage Cell RAID HBA

Applies to:

Zero Data Loss Recovery Appliance X6 Hardware - Version All Versions and later
Oracle SuperCluster T5-8 Hardware - Version All Versions and later
Oracle SuperCluster M6-32 Hardware - Version All Versions and later
Exadata X5-2 Eighth Rack - Version All Versions and later
Exadata X5-2 Quarter Rack - Version All Versions and later
Information in this document applies to any platform.

Goal

How to Replace an Exadata X5-2/X6-2 Storage Cell RAID HBA

Solution

DISPATCH INSTRUCTIONS
WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED?: Exadata Trained

TIME ESTIMATE: 90 Minutes
TASK COMPLEXITY: 3

FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS:
PROBLEM OVERVIEW: A faulty RAID HBA in an Exadata X5-2/X6-2 Storage Cell has been diagnosed as needing replacement

WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY?:

- The server that contains the faulty HBA should have its services offline and system powered off.

WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE?:

The instructions below assume the customer DBA is available and working with the field engineer onsite to manage the host OS and
DB/ASM services. They are provided here to allow the FE to have all the available steps needed when onsite, and can be done by the
FE if the customer DBA wants or allows or needs help with these steps.

Step A. Pre-Steps to shutdown the node for servicing:

1. For Extended information on this section check MOS Note:

ID 1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM

This is also documented in the Exadata Owner's Guide in chapter 7 section titled "Maintaining Exadata Storage Servers" subsection "Shutting Down Exadata Storage Server" available on the customer's cell server image in the /opt/oracle/cell/doc directory.

Available to Oracle internally here: http://amomv0115.us.oracle.com/archive/cd_ns/E13877_01/doc/doc.112/e13874/maintenance.htm#DBMOG21129

In the following examples the SQL commands should be run by the Customers DBA prior to doing the hardware replacement. These should be done by the field engineer only if the customer directs them to, or is unable to do them. The cellcli commands will need to be run as root.

2. ASM drops a disk shortly after it/they are taken offline. The default DISK_REPAIR_TIME attribute value of 3.6hrs should be adequate for replacing components, but may have been changed by the Customer. To check this parameter, have the Customer log into ASM and perform the following query:

SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg where

a.name = 'disk_repair_time' and a.group_number = dg.group_number;

As long as the value is large enough to comfortably replace the components being replaced, then there is no need to change it.

3. Check if ASM will be OK if the grid disks go OFFLINE.

# cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome

...sample ...

     CATALOG_CD_09_zdlx5_tvp_a_cel3 ONLINE Yes
     CATALOG_CD_10_zdlx5_tvp_a_cel3 ONLINE Yes
     CATALOG_CD_11_zdlx5_tvp_a_cel3 ONLINE Yes
     DELTA_CD_00_zdlx5_tvp_a_cel3    ONLINE Yes
     DELTA_CD_01_zdlx5_tvp_a_cel3    ONLINE Yes
     DELTA_CD_02_zdlx5_tvp_a_cel3    ONLINE Yes

...repeated for all griddisks....

If one or more disks return asmdeactivationoutcome='No', then wait for some time and repeat step #2. Once all disks return asmdeactivationoutcome='Yes', proceed to the next step.

4. Run cellcli command to Inactivate all grid disks on the cell that needs to be powered down for maintenance. (this could take up to 10 minutes or longer)

# cellcli

CellCLI> ALTER GRIDDISK ALL INACTIVE

...sample ...

GridDisk CATALOG_CD_09_zdlx5_tvp_a_cel3 successfully altered
GridDisk CATALOG_CD_10_zdlx5_tvp_a_cel3 successfully altered
GridDisk CATALOG_CD_11_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_00_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_01_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_02_zdlx5_tvp_a_cel3 successfully altered

...repeated for all griddisks...

5. Execute the command below and the output should show asmmodestatus='UNUSED' or 'OFFLINE' and asmdeactivationoutcome=Yes for all griddisks once the disks are offline and inactive in ASM.

CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome

         CATALOG_CD_09_zdlx5_tvp_a_cel3 inactive        OFFLINE         Yes
         CATALOG_CD_10_zdlx5_tvp_a_cel3 inactive        OFFLINE         Yes
         CATALOG_CD_11_zdlx5_tvp_a_cel3 inactive        OFFLINE         Yes
         DELTA_CD_00_zdlx5_tvp_a_cel3    inactive        OFFLINE         Yes
         DELTA_CD_01_zdlx5_tvp_a_cel3    inactive        OFFLINE         Yes
         DELTA_CD_02_zdlx5_tvp_a_cel3    inactive        OFFLINE         Yes

...repeated for all griddisks...

6. Revert all the RAID disk volumes to WriteThrough mode to ensure all data in the RAID cache memory is flushed to disk and not lost when replacement of the HBA occurs. As 'root' user, set all logical volumes cache policy to WriteThrough cache mode:

# /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wt -lall -a0

Verify the current cache policy for all logical volumes is now WriteThrough :

# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU

7. Once all disks are offline and inactive, the customer may shutdown the Cell using the following command:

# shutdown -hP now

8. The field engineer can now slide out the server for maintenance. Do not remove any cables prior to sliding the server forward, or the loose cable ends will jam in the cable management arms. Take care to ensure the cables and Cable Management Arm is moving properly. Refer to Note 1444683.1 for CMA handling training.

Remember to disconnect the power cords before opening the top cover of the server.

Step B. Physical RAID Card replacement

Reference links for Service Manual:
X5-2L : ( http://docs.oracle.com/cd/E41033_01/html/E48325/cnpsm.html#scrolltoc )

Remove the old HBA PCI Card

1. Swivel the air baffle into the upright position to allow access to the super capacitor cable and the Oracle Storage 12 Gb/s SAS PCIe RAID HBA card in PCI slot 6

2. Rotate the PCIe card locking mechanism, and then lift up on the PCIe HBA card to disengage it from the motherboard connectors

3. Disconnect the super capacitor cable and the SAS cables from the Oracle Storage 12 Gb/s SAS PCIe RAID HBA card

4. Lift and remove the Oracle Storage 12 Gb/s SAS PCIe RAID HBA card from the chassis

Install the new HBA PCI Card
1. Connect the super capacitor cable to the Oracle Storage 12 Gb/s SAS PCIe RAID HBA, and then reconnect the SAS cables that you unplugged during the removal procedure
2. Insert the Oracle Storage 12 Gb/s SAS PCIe RAID HBA card into PCIe slot 6, and rotate the PCIe locking mechanism to secure the PCIe HBA card in place

Step C. Post-Replacement RAID Card additional steps

1. Once the power cords have been re-attached, slide the server back into the rack.

2. Once the ILOM has booted you will see a slow blink on the green LED for the server. Power on the server by pressing the power
button on the front of the unit.

Step D. Server Services Startup Validation

1. As the system boots the hardware/firmware profile will be checked, and either a green "Passed" will be displayed, or a red "Warning" that the check does not match if the firmware on the HBA is different from what the image expects. If the check passes, then the firmware is correct, and the boot will continue up to the OS login prompt. If the check fails, then the firmware will automatically be updated, and a subsequent reboot will occur. Monitor to ensure this occurs properly.

2. After the OS is up, login as root and verify all the expected devices are present:

The following command should show 12 disks:

# lsscsi | grep -i LSI

[0:2:0:0]    disk    LSI      MR9361-8i        4.23 /dev/sda
[0:2:1:0]    disk    LSI      MR9361-8i        4.23 /dev/sdb
[0:2:2:0]    disk    LSI      MR9361-8i        4.23 /dev/sdc
[0:2:3:0]    disk    LSI      MR9361-8i        4.23 /dev/sdd
[0:2:4:0]    disk    LSI      MR9361-8i        4.23 /dev/sde
[0:2:5:0]    disk    LSI      MR9361-8i        4.23 /dev/sdf
[0:2:6:0]    disk    LSI      MR9361-8i        4.23 /dev/sdg
[0:2:7:0]    disk    LSI      MR9361-8i        4.23 /dev/sdh
[0:2:8:0]    disk    LSI      MR9361-8i        4.23 /dev/sdi
[0:2:9:0]    disk    LSI      MR9361-8i        4.23 /dev/sdj
[0:2:10:0]   disk    LSI      MR9361-8i        4.23 /dev/sdk
[0:2:11:0]   disk    LSI      MR9361-8i        4.23 /dev/sdl

If the device count is not correct check also that the LSI controller has the correct Virtual Drives configured and in Optimal state, physically Online and spun up, with no Foreign configuration. There should be Virtual Drives 0 to 11, and the physical slots 0 to 11 should be allocated to 1 each (not necessarily the same 0:0 1:1 etc. mapping).

# /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -a0 | grep "Virtual Drive\|State\|Slot\|Firmware state"

Virtual Drive: 0 (Target Id: 0)
State               : Optimal
Slot Number: 0
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 1 (Target Id: 1)
State               : Optimal
Slot Number: 1
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 2 (Target Id: 2)
State               : Optimal
Slot Number: 2
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 3 (Target Id: 3)
State               : Optimal
Slot Number: 3
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 4 (Target Id: 4)
State               : Optimal
Slot Number: 4
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 5 (Target Id: 5)
State               : Optimal
Slot Number: 5
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 6 (Target Id: 6)
State               : Optimal
Slot Number: 6
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 7 (Target Id: 7)
State               : Optimal
Slot Number: 7
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 8 (Target Id: 8)
State               : Optimal
Slot Number: 8
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 9 (Target Id: 9)
State               : Optimal
Slot Number: 9
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 10 (Target Id: 10)
State               : Optimal
Slot Number: 10
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 11 (Target Id: 11)
State               : Optimal
Slot Number: 11
Firmware state: Online, Spun Up
Foreign State: None

Check the status of the Super Cap:

# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -a0

BBU status for Adapter: 0

BatteryType: CVPM02
Voltage: 9450 mV
Current: 0 mA
Temperature: 29 C
Battery State: Optimal
BBU Firmware Status:

...Output truncated...

If this is not correct, then there is a problem with the disk volumes that may need additional assistance to correct. The server should be re-opened and the device connections and boards checked to be sure they are secure and well seated BEFORE the following CellCLI commands are issued.

3. Set all logical drives cache policy to WriteBack cache mode:

# /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wb -lall -a0

Verify the current cache policy for all logical drives is now using WriteBack cache mode:

# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU

Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

4. Verify also the InfiniBand links are up at 40Gbps as the cables were disconnected:

# /usr/sbin/ibstatus

Infiniband device 'mlx4_0' port 1 status:
        default gid:     fe80:0000:0000:0000:0010:e000:0159:c61d
        base lid:        0x9
        sm lid:          0x2
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            40 Gb/sec (4X QDR)
        link_layer:      IB

Infiniband device 'mlx4_0' port 2 status:
        default gid:     fe80:0000:0000:0000:0010:e000:0159:c61e
        base lid:        0xa
        sm lid:          0x2
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            40 Gb/sec (4X QDR)
        link_layer:      IB

5. Once the hardware is verified as up and running, the Customer's DBA will need to activate the grid disks:

# cellcli

CellCLI> alter griddisk all active

...repeated for all griddisks...

Issue the command below and all disks should show 'active':

CellCLI> list griddisk

         CATALOG_CD_09_zdlx5_tvp_a_cel3 active
         CATALOG_CD_10_zdlx5_tvp_a_cel3 active
         CATALOG_CD_11_zdlx5_tvp_a_cel3 active
         DELTA_CD_00_zdlx5_tvp_a_cel3    active
         DELTA_CD_01_zdlx5_tvp_a_cel3    active
         DELTA_CD_02_zdlx5_tvp_a_cel3    active

...repeated for all griddisks...

6. Verify all grid disks have been successfully put online using the following command. Wait until asmmodestatus is ONLINE for all grid disks and no longer SYNCING. The following is an example of the output early in the activation process.

CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome

         CATALOG_CD_09_zdlx5_tvp_a_cel3 active SYNCING         Yes
         CATALOG_CD_10_zdlx5_tvp_a_cel3 active SYNCING         Yes
         CATALOG_CD_11_zdlx5_tvp_a_cel3 active SYNCING         Yes
         DELTA_CD_00_zdlx5_tvp_a_cel3    active SYNCING         Yes
         DELTA_CD_01_zdlx5_tvp_a_cel3    active SYNCING         Yes
         DELTA_CD_02_zdlx5_tvp_a_cel3    active SYNCING         Yes

...repeated for all griddisks...

Notice in the above example that the grid disks are still in the 'SYNCING' process. Oracle ASM synchronization is only complete when ALL grid disks show asmmodestatus=ONLINE. This process can take some time depending on how busy the machine is, and has been while this individual server was down for repair.

OBTAIN CUSTOMER ACCEPTANCE

- WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO

TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:

- Verify that HW Components and SW Components are returned to properly functioning state with server up and all ASM disks online on Storage Servers.

REFERENCE INFORMATION:

1093890.1 Steps To Shutdown/Startup The Exadata & RDBMS Services and Cell/Compute Nodes On An Exadata Configuration.

1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM

Attachments

This solution has no attachment