Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1990221.1
Update Date:2018-04-05
Keywords:

Solution Type  Technical Instruction Sure

Solution  1990221.1 :   How to Replace an Exadata X5-2/X6-2 Storage Cell RAID HBA  


Related Items
  • Oracle SuperCluster T5-8 Full Rack
  •  
  • Oracle SuperCluster M7 Hardware
  •  
  • Zero Data Loss Recovery Appliance X6 Hardware
  •  
  • Exadata SL6 Hardware
  •  
  • Exadata X5-8 Hardware
  •  
  • Exadata X6-8 Hardware
  •  
  • Exadata X5-2 Eighth Rack
  •  
  • Exadata X5-2 Full Rack
  •  
  • Exadata X6-2 Hardware
  •  
  • Exadata X5-2 Hardware
  •  
  • Oracle SuperCluster T5-8 Half Rack
  •  
  • Exadata X5-2 Quarter Rack
  •  
  • Exadata X4-8 Hardware
  •  
  • Zero Data Loss Recovery Appliance X5 Hardware
  •  
  • Exadata X5-2 Half Rack
  •  
  • Exadata Cloud at Customer X6-2 Hardware
  •  
  • Oracle SuperCluster T5-8 Hardware
  •  
  • Oracle SuperCluster M6-32 Hardware
  •  
Related Categories
  • PLA-Support>Sun Systems>Sun_Other>Sun Collections>SN-OTH: x64-CAP VCAP
  •  




In this Document
Goal
Solution
References


Oracle Confidential PARTNER - Available to partners (SUN).
Reason: Action plan to replace raid card in Exadata X5 storage cell

Applies to:

Zero Data Loss Recovery Appliance X6 Hardware - Version All Versions and later
Oracle SuperCluster T5-8 Hardware - Version All Versions and later
Oracle SuperCluster M6-32 Hardware - Version All Versions and later
Exadata X5-2 Eighth Rack - Version All Versions and later
Exadata X5-2 Quarter Rack - Version All Versions and later
Information in this document applies to any platform.

Goal

  How to Replace an Exadata X5-2/X6-2 Storage Cell RAID HBA

Solution

 DISPATCH INSTRUCTIONS
WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED?: Exadata Trained


TIME ESTIMATE: 90 Minutes
TASK COMPLEXITY: 3


FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS:
PROBLEM OVERVIEW: A faulty RAID HBA in an Exadata X5-2/X6-2 Storage Cell has been diagnosed as needing replacement


WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY?:

- The server that contains the faulty HBA should have its services offline and system powered off.


WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE?:

The instructions below assume the customer DBA is available and working with the field engineer onsite to manage the host OS and
DB/ASM services. They are provided here to allow the FE to have all the available steps needed when onsite, and can be done by the
FE if the customer DBA wants or allows or needs help with these steps.


Step A. Pre-Steps to shutdown the node for servicing:

1. For Extended information on this section check MOS Note:

ID 1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM

 

This is also documented in the Exadata Owner's Guide in chapter 7 section titled "Maintaining Exadata Storage Servers" subsection "Shutting Down Exadata Storage Server" available on the customer's cell server image in the /opt/oracle/cell/doc directory.

 

Available to Oracle internally here:  http://amomv0115.us.oracle.com/archive/cd_ns/E13877_01/doc/doc.112/e13874/maintenance.htm#DBMOG21129

 

In the following examples the SQL commands should be run by the Customers DBA prior to doing the hardware replacement. These should be done by the field engineer only if the customer directs them to, or is unable to do them. The cellcli commands will need to be run as root.

 

2. ASM drops a disk shortly after it/they are taken offline. The default DISK_REPAIR_TIME attribute value of 3.6hrs should be adequate for replacing components, but may have been changed by the Customer. To check this parameter, have the Customer log into ASM and perform the following query:

 

SQL> select dg.name,a.value from v$asm_attribute a, v$asm_diskgroup dg where

 

a.name = 'disk_repair_time' and a.group_number = dg.group_number;

 

As long as the value is large enough to comfortably replace the components being replaced, then there is no need to change it.

 

3. Check if ASM will be OK if the grid disks go OFFLINE.

 

# cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome

 ...sample ...

     CATALOG_CD_09_zdlx5_tvp_a_cel3  ONLINE  Yes
     CATALOG_CD_10_zdlx5_tvp_a_cel3  ONLINE  Yes
     CATALOG_CD_11_zdlx5_tvp_a_cel3  ONLINE  Yes
     DELTA_CD_00_zdlx5_tvp_a_cel3    ONLINE  Yes
     DELTA_CD_01_zdlx5_tvp_a_cel3    ONLINE  Yes
     DELTA_CD_02_zdlx5_tvp_a_cel3    ONLINE  Yes


...repeated for all griddisks....

 

If one or more disks return asmdeactivationoutcome='No', then wait for some time and repeat step #2. Once all disks return asmdeactivationoutcome='Yes', proceed to the next step.

 

4. Run cellcli command to Inactivate all grid disks on the cell that needs to be powered down for maintenance. (this could take up to 10 minutes or longer)

 

# cellcli

CellCLI> ALTER GRIDDISK ALL INACTIVE

 

...sample ...

GridDisk CATALOG_CD_09_zdlx5_tvp_a_cel3 successfully altered
GridDisk CATALOG_CD_10_zdlx5_tvp_a_cel3 successfully altered
GridDisk CATALOG_CD_11_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_00_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_01_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_02_zdlx5_tvp_a_cel3 successfully altered

...repeated for all griddisks...

 

5. Execute the command below and the output should show asmmodestatus='UNUSED' or 'OFFLINE' and asmdeactivationoutcome=Yes for all griddisks once the disks are offline and inactive in ASM.

 

CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome

         CATALOG_CD_09_zdlx5_tvp_a_cel3  inactive        OFFLINE         Yes
         CATALOG_CD_10_zdlx5_tvp_a_cel3  inactive        OFFLINE         Yes
         CATALOG_CD_11_zdlx5_tvp_a_cel3  inactive        OFFLINE         Yes
         DELTA_CD_00_zdlx5_tvp_a_cel3    inactive        OFFLINE         Yes
         DELTA_CD_01_zdlx5_tvp_a_cel3    inactive        OFFLINE         Yes
         DELTA_CD_02_zdlx5_tvp_a_cel3    inactive        OFFLINE         Yes

...repeated for all griddisks...

 

6. Revert all the RAID disk volumes to WriteThrough mode to ensure all data in the RAID cache memory is flushed to disk and not lost when replacement of the HBA occurs. As 'root' user, set all logical volumes cache policy to WriteThrough cache mode:

# /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wt -lall -a0

Verify the current cache policy for all logical volumes is now WriteThrough :

# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU

 

7. Once all disks are offline and inactive, the customer may shutdown the Cell using the following command:

# shutdown -hP now

 

8. The field engineer can now slide out the server for maintenance. Do not remove any cables prior to sliding the server forward, or the loose cable ends will jam in the cable management arms. Take care to ensure the cables and Cable Management Arm is moving properly. Refer to Note 1444683.1 for CMA handling training.

 

Remember to disconnect the power cords before opening the top cover of the server.

 

Step B. Physical RAID Card replacement

Reference links for Service Manual:
X5-2L : ( http://docs.oracle.com/cd/E41033_01/html/E48325/cnpsm.html#scrolltoc )

Remove the old HBA PCI Card

1. Swivel the air baffle into the upright position to allow access to the super capacitor cable and the Oracle Storage 12 Gb/s SAS PCIe RAID HBA card in PCI slot 6

2. Rotate the PCIe card locking mechanism, and then lift up on the PCIe HBA card to disengage it from the motherboard connectors

3. Disconnect the super capacitor cable and the SAS cables from the Oracle Storage 12 Gb/s SAS PCIe RAID HBA card

4. Lift and remove the Oracle Storage 12 Gb/s SAS PCIe RAID HBA card from the chassis


Install the new HBA PCI Card
1. Connect the super capacitor cable to the Oracle Storage 12 Gb/s SAS PCIe RAID HBA, and then reconnect the SAS cables that you unplugged during the removal procedure
2. Insert the Oracle Storage 12 Gb/s SAS PCIe RAID HBA card into PCIe slot 6, and rotate the PCIe locking mechanism to secure the PCIe HBA card in place


Step C. Post-Replacement RAID Card additional steps

1. Once the power cords have been re-attached, slide the server back into the rack.

2. Once the ILOM has booted you will see a slow blink on the green LED for the server. Power on the server by pressing the power
button on the front of the unit.

 

Step D. Server Services Startup Validation

1. As the system boots the hardware/firmware profile will be checked, and either a green "Passed" will be displayed, or a red "Warning" that the check does not match if the firmware on the HBA is different from what the image expects. If the check passes, then the firmware is correct, and the boot will continue up to the OS login prompt. If the check fails, then the firmware will automatically be updated, and a subsequent reboot will occur. Monitor to ensure this occurs properly.

2. After the OS is up, login as root and verify all the expected devices are present:

The following command should show 12 disks:

 

# lsscsi | grep -i LSI

[0:2:0:0]    disk    LSI      MR9361-8i        4.23  /dev/sda
[0:2:1:0]    disk    LSI      MR9361-8i        4.23  /dev/sdb
[0:2:2:0]    disk    LSI      MR9361-8i        4.23  /dev/sdc
[0:2:3:0]    disk    LSI      MR9361-8i        4.23  /dev/sdd
[0:2:4:0]    disk    LSI      MR9361-8i        4.23  /dev/sde
[0:2:5:0]    disk    LSI      MR9361-8i        4.23  /dev/sdf
[0:2:6:0]    disk    LSI      MR9361-8i        4.23  /dev/sdg
[0:2:7:0]    disk    LSI      MR9361-8i        4.23  /dev/sdh
[0:2:8:0]    disk    LSI      MR9361-8i        4.23  /dev/sdi
[0:2:9:0]    disk    LSI      MR9361-8i        4.23  /dev/sdj
[0:2:10:0]   disk    LSI      MR9361-8i        4.23  /dev/sdk
[0:2:11:0]   disk    LSI      MR9361-8i        4.23  /dev/sdl

 

If the device count is not correct check also that the LSI controller has the correct Virtual Drives configured and in Optimal state, physically Online and spun up, with no Foreign configuration. There should be Virtual Drives 0 to 11, and the physical slots 0 to 11 should be allocated to 1 each (not necessarily the same 0:0 1:1 etc. mapping).

 

# /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -a0 | grep "Virtual Drive\|State\|Slot\|Firmware state"

 Virtual Drive: 0 (Target Id: 0)
State               : Optimal
Slot Number: 0
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 1 (Target Id: 1)
State               : Optimal
Slot Number: 1
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 2 (Target Id: 2)
State               : Optimal
Slot Number: 2
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 3 (Target Id: 3)
State               : Optimal
Slot Number: 3
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 4 (Target Id: 4)
State               : Optimal
Slot Number: 4
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 5 (Target Id: 5)
State               : Optimal
Slot Number: 5
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 6 (Target Id: 6)
State               : Optimal
Slot Number: 6
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 7 (Target Id: 7)
State               : Optimal
Slot Number: 7
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 8 (Target Id: 8)
State               : Optimal
Slot Number: 8
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 9 (Target Id: 9)
State               : Optimal
Slot Number: 9
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 10 (Target Id: 10)
State               : Optimal
Slot Number: 10
Firmware state: Online, Spun Up
Foreign State: None
Virtual Drive: 11 (Target Id: 11)
State               : Optimal
Slot Number: 11
Firmware state: Online, Spun Up
Foreign State: None

 

Check the status of the Super Cap:

 

# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -a0

BBU status for Adapter: 0

BatteryType: CVPM02
Voltage: 9450 mV
Current: 0 mA
Temperature: 29 C
Battery State: Optimal
BBU Firmware Status:


...Output truncated...

 

If this is not correct, then there is a problem with the disk volumes that may need additional assistance to correct. The server should be re-opened and the device connections and boards checked to be sure they are secure and well seated BEFORE the following CellCLI commands are issued.

3. Set all logical drives cache policy to WriteBack cache mode:

 

# /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wb -lall -a0

 

Verify the current cache policy for all logical drives is now using WriteBack cache mode:

 

# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU

 Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

 

4. Verify also the InfiniBand links are up at 40Gbps as the cables were disconnected:

 

# /usr/sbin/ibstatus

 Infiniband device 'mlx4_0' port 1 status:
        default gid:     fe80:0000:0000:0000:0010:e000:0159:c61d
        base lid:        0x9
        sm lid:          0x2
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            40 Gb/sec (4X QDR)
        link_layer:      IB

Infiniband device 'mlx4_0' port 2 status:
        default gid:     fe80:0000:0000:0000:0010:e000:0159:c61e
        base lid:        0xa
        sm lid:          0x2
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            40 Gb/sec (4X QDR)
        link_layer:      IB

 

5. Once the hardware is verified as up and running, the Customer's DBA will need to activate the grid disks:

 

# cellcli

 CellCLI> alter griddisk all active

GridDisk CATALOG_CD_09_zdlx5_tvp_a_cel3 successfully altered
GridDisk CATALOG_CD_10_zdlx5_tvp_a_cel3 successfully altered
GridDisk CATALOG_CD_11_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_00_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_01_zdlx5_tvp_a_cel3 successfully altered
GridDisk DELTA_CD_02_zdlx5_tvp_a_cel3 successfully altered

...repeated for all griddisks...

 

Issue the command below and all disks should show 'active':

 

CellCLI> list griddisk

         CATALOG_CD_09_zdlx5_tvp_a_cel3  active
         CATALOG_CD_10_zdlx5_tvp_a_cel3  active
         CATALOG_CD_11_zdlx5_tvp_a_cel3  active
         DELTA_CD_00_zdlx5_tvp_a_cel3    active
         DELTA_CD_01_zdlx5_tvp_a_cel3    active
         DELTA_CD_02_zdlx5_tvp_a_cel3    active

...repeated for all griddisks...

 

6. Verify all grid disks have been successfully put online using the following command. Wait until asmmodestatus is ONLINE for all grid disks and no longer SYNCING. The following is an example of the output early in the activation process.

 

CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome

         CATALOG_CD_09_zdlx5_tvp_a_cel3  active  SYNCING         Yes
         CATALOG_CD_10_zdlx5_tvp_a_cel3  active  SYNCING         Yes
         CATALOG_CD_11_zdlx5_tvp_a_cel3  active  SYNCING         Yes
         DELTA_CD_00_zdlx5_tvp_a_cel3    active  SYNCING         Yes
         DELTA_CD_01_zdlx5_tvp_a_cel3    active  SYNCING         Yes
         DELTA_CD_02_zdlx5_tvp_a_cel3    active  SYNCING         Yes

...repeated for all griddisks...

 

Notice in the above example that the grid disks are still in the 'SYNCING' process. Oracle ASM synchronization is only complete when ALL grid disks show asmmodestatus=ONLINE. This process can take some time depending on how busy the machine is, and has been while this individual server was down for repair.

 

OBTAIN CUSTOMER ACCEPTANCE

- WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO

TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:

 

- Verify that HW Components and SW Components are returned to properly functioning state with server up and all ASM disks online on Storage Servers.

 

REFERENCE INFORMATION:

1093890.1 Steps To Shutdown/Startup The Exadata & RDBMS Services and Cell/Compute Nodes On An Exadata Configuration.

1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback