How to Replace a Faulty RAID HBA on an Exadata server (V2

Asset ID:	1-71-1388322.1
Update Date:	2017-04-14
Keywords:

Solution Type Technical Instruction Sure

Solution 1388322.1 : How to Replace a Faulty RAID HBA on an Exadata server (V2 - X4-2/X4-8)

Applies to:

Exadata X4-2 Quarter Rack - Version All Versions and later
Exadata Database Machine X2-2 Full Rack - Version All Versions and later
Exadata X4-2 Half Rack - Version All Versions and later
Exadata X4-2 Full Rack - Version All Versions and later
Exadata Database Machine V2 - Version All Versions and later
Information in this document applies to any platform.

Goal

How to Replace a Faulty RAID HBA on Exadata (V2 through X4-2/X4-8) successfully.

Solution

DISPATCH INSTRUCTIONS:
- WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED: Exadata Trained
- TIME ESTIMATE: 90 Minutes
- TASK COMPLEXITY: 3

FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS:
- PROBLEM OVERVIEW: A faulty RAID HBA in an Exadata node has been diagnosed as needing replacement.

- WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE
RESOLUTION ACTIVITY?:
- The server that contains the faulty HBA should have its services offline and system powered off.

- WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE:

The complete procedure below is also attached to this Note in PDF format with screenshot images embedded.

The instructions below assume the customer DBA is available and working with the field engineer onsite to manage the host OS and DB/ASM services. They are provided here to allow the FE to have all the available steps needed when onsite, and can be done by the FE if the customer DBA wants or allows or needs help with their steps.

Step A. Pre-Steps to shutdown the node for servicing:

For DB Nodes:

1. To shutdown the compute node check MOS Note:
ID 1093890.1 Steps To Shutdown/Startup The Exadata & RDBMS Services and Cell/Compute Nodes On An Exadata Configuration.

Or refer to
How to Shutdown and Startup Exadata X5 compute nodes and storage cells when performing hardware maintenance (includes Supercluster X5 storage cells) (Doc ID 1982342.1)

It is highly recommended to make and verify a backup of all disk partitions prior to RAID HBA replacement

2. Revert all the RAID disk volumes to WriteThrough mode to ensure all data in the RAID cache memory is flushed to disk and not lost when replacement of the HBA occurs. Set all logical volumes cache policy to WriteThrough cache mode:

# /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wt -lall -a0

Verify the current cache policy for all logical volumes is now WriteThrough :

# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU

3. The customer can now shutdown the server operating system:

Linux:

# shutdown -hP now

Solaris:

# shutdown -y -i 5 -g 0

4. The field engineer can now slide out the server for maintenance. Do not remove any cables prior to sliding the server forward, or the loose cable ends will jam in the cable management arms (CMA). Ensure all customer-added data network cables are properly dressed into the CMA Take care to ensure the cables and CMA is moving properly. Refer to Note 1444683.1 for CMA handling training.

Remember to disconnect the power cords before opening the top of the server .

For Storage Cells:

1. To shutdown a cell check MOS Note:
ID 1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM

Or refer to
How to Shutdown and Startup Exadata X5 compute nodes and storage cells when performing hardware maintenance (includes Supercluster X5 storage cells) (Doc ID 1982342.1)

SQL commands should be run by the Customers DBA prior to doing the hardware replacement. These should be done by the field engineer only if the customer directs them to, or is unable to do them. The cellcli commands will need to be run as root.

# /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wt -lall -a0

Verify the current cache policy for all logical volumes is now WriteThrough :

# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU

3. Once all disks are offline and inactive, the customer may shutdown the Cell using the following command:

# shutdown -hP now

4. The field engineer can now slide out the server for maintenance. Do not remove any cables prior to sliding the server forward, or the loose cable ends will jam in the cable management arms. Take care to ensure the cables and Cable Management Arm is moving properly. Refer to Note 1444683.1 for CMA handling training.

Remember to disconnect the power cords before opening the top of the server .

Step B. Physical RAID Card replacement

Exadata X2-2 Database Machine Compute nodes and all Storage Cells:

These steps are relevant to Exadata nodes based on x4170, x4170M2, x4275, and x4270M2.

Remove the old HBA PCI Card

1. On Storage cells remove the IB cables from the IB card in slot 3 above the HBA making a note of which port each cable goes into so they can go back into the same port.

2. Remove back panel PCI cross bar

3. Remove the PCI Riser containing the PCI card to be serviced

4. Disconnect the SAS cables from PCI card making a note of which port each cable goes into so they can go back into the same port.

5. Extract the RAID HBA card from the PCI riser assembly

Remove the HBA's battery from the old HBA

1. Use a No. 1 Phillips screwdriver to remove the 3 retaining screws that secure the battery to the HBA from the underside of the card. Do not attempt to remove any screws from the top side of the HBA.

2. Detach the battery pack including circuit board from the HBA by gently lifting it from its circuit board connector on the top side of the HBA

Reinstall the HBA's battery onto the new HBA

Reverse the removal instructions

Install the new HBA PCI Card

Reverse the removal instructions, taking care to get the cables re-connected to the same ports they were removed from. If reversed, this may affect disk slot mappings.

Take care on storage cells to also put the IB cables back into the original ports, as well, in the correct orientation. IB cables are factory labeled with the port identification where port 2 is the port nearest the PCI connector, and port 1 is the port near the top side of the card. The cables should be inserted with the latch release tab on the down side, so they fully seat and latch. If inserted upside down, they will not fully seat or latch.

Exadata X2-8 Database Machine Compute nodes:

These steps are relevant to Exadata nodes based on x4800, and x4800M2.

Removing the HBA REM and the battery.

1. Remove CMOD0 from the server setting it on a flat, antistatic surface with ample space and light.

2. Remove the CMOD cover.

3. Lift the REM ejector handle and rotate it to its fully open position.

4. Lift the connector end of the REM and pull the REM away from the retaining clip on the front support bracket.

5. To remove the battery, use a No. 1 Phillips screwdriver to remove the 3 retaining screws that mount the battery to the REM.

6. Detach the battery pack including circuit board from the REM by gently lifting it from its circuit board connector.

Install the new HBA REM

1. Attach the battery pack to the REM by aligning the circuit board connectors and gently pressing together.

2. Secure the original battery to the underside of the new REM using the 3 retaining screws.

3. Ensure that the REM ejector lever is in the closed position. (The lever should be flat with the REM support bracket.)

4. Position the REM so that the battery is facing downward and the connector is aligned with the connector on the motherboard.

5. Slip the opposite end of the REM under the retaining clips on the front support bracket
and ensure that the notch on the edge of the REM is positioned around the alignment
post on the bracket.

6. Carefully lower and position the connector end of the REM until the REM contacts the connector on the motherboard, ensuring that the connectors are aligned. To seat the connector, carefully push the REM downward until it is in a level position.

7. Install the cover on the CMOD and return the CMOD back into the unit in CMOD0 slot.

Step C. Post-Replacement RAID Card additional steps:

Power on for both DB nodes and Storage Cells:

1. Once the power cords have been re-attached, slide the server back into the rack.

2. Once the ILOM has booted you will see a slow blink on the green LED for the server. Power on the server by pressing the power button on the front of the unit.

Accepting the Foreign Configuration for both DB nodes and Storage Cells:

1. During boot, monitor the graphics console through either ILOM javaconsole or the local KVM. When loading its BIOS ROM, the new RAID controller will detect the RAID configuration on the disks and complain it has a Foreign configuration. This is expected. At the prompt, press "F" or "C" to accept the foreign configuration or enter the controller BIOS utility. If you press any other key to continue, then the controller will not import the RAID and will fail to find a bootable disk. If this occurs, it is safe to press "ctrl-alt-del" and reset and get the "F" or "C" prompt again.

Press F or C - Refer to attached image 1_foreign-config1.png for screenshot.
Press C - Refer to attached image 2_foreign-config2.png for screenshot.
Press Y - Refer to attached image 3_foreign-config3.png for screenshot.

NOTE:FOR SOLARIS ONLY CONFIG

If the DB node is running Solaris ONLY when the foreign config is imported the Boot Drive is not defined .Therefore for Solaris ONLY config follow these steps.

Solaris only config on non X2-8 compute nodes will have 4 Virtual Drives VD0,1,2 and 3 .

VD1 and VD2 will contain the grub loader.

i) When the Foreign Config detected message is displayed press "C" to load the config utillity

ii) Then the system will display a warning that all disks from previous config are gone .Again press "C" to load config utillity.

iii) Finally a warning that entering the utillity will result in drive config changes.Press Y to continue into the config utillity.

iv) Press the start button and then select the "Preview" button and verify that configuration is correct.Save the configuration.Refer to step 4 below for correct configs.

v) Now define the boot device.Tab to the Virtual Drives view ,then from this view tab to the box listing the drives ,select VD1 by using the up/down keys until it is highlighted .Hit enter to select VD1 ,now tab to "Set Boot Drives"

and hit enter.Finally tab to "Go" and hit enter.The value "Set Boot Drives" should now show ( current = 1) ##refer to solaris-screen-4.jpg for screenshot.

vi) return to the home window and proceed to "Step 5 " below.

2. When the utility loads, there should only be 1 adapter. Select the "Start" button.
Refer to image 4_niwot-utility1.png for screenshot.

3. The foreign configuration screen is shown. Select "Configuration" from the drop down, and select the "Preview" button.
Refer to attached image 5_niwot-utility2.png for screenshot.

4. Verify the configuration looks correct on the "Virtual Drives" side and select the "Import" button if it is. The correct configuration should be 1 of the following:
a) Storage Cells - 12 RAID0's, 1 per disk. Refer to attached image 7_niwot-utility4.png for screenshot.
b) X2-2/V2 DB Nodes with Linux only - 1 RAID5 with disks 0,1,2; 1 Global Hotspare disk 3. Refer to attached image 6_niwot-utility3.png for screenshot.
c) X2-8 DB Nodes with Linux only - 1 RAID 5 with disks 0-6; 1 Global Hotspare disk 7
d) X2-2/V2 DB Nodes with Linux/Solaris Dual-boot - 1 RAID1 with disks 0,1; 2 RAID 0's on disks 2 and 3.
e) X2-8 DB Nodes with Linux/Solaris Dual-boot - 1 RAID5 with disks 0,1,2; 1 Global Hotspare disk 3; 4 RAID 0's on disks 4-7
f) x2-2/V2 DB nodes with Solaris only - RAID 0's on disks 0,1,2 and 3 .Either disk 1 or disk 2 will be the boot disk.

Verify with the customer, which configuration they have if this is a DB node.

NOTE: If the foreign configuration fails to import, as it may if the firmware on the replacement is different from the firmware on the failed HBA, then you may need to recreate the volume. This has been observed on DB nodes. The virtual drive creation screens should be followed to create a new disk group with the RAID configuration as above for the disk replaced. Use Stripe Size 1MB, and cache type "Writeback with BBU" and the full size available when presented. This should be saved but not initialized which will erase the metadata already on the disks. In the primary home screen, select the Physical Disk to set disk3 to be a Global Hotspare if applicable. Once saved, this should be able to find the old data and allow booting. If not, then you will need to restore from backup per Note 1084360.1

5. This will bring you back to the Logical View screen where the virtual drives should be listed out on the right side. Select the "Exit" link from the left side menu. Refer to attached image 8_niwot-utility5.png for screenshot.

This will bring you to the "Please Reboot" screen. Press "Ctrl-Alt-Del" to reboot the machine. Refer to attached image 9_niwot-utility6.png for screenshot.

Step D. Server Services Startup Validation:

DB Node Startup:

1. As the system boots the hardware/firmware profile will be checked, and either a green "Passed" will be displayed, or a red "Warning" that the check does not match if the firmware on the HBA is different from what the image expects.

If the check passes, then the firmware is correct, continue to step 2.

If the check fails, then:
a) Login as root at the OS login prompt.
b) The customer should repeat Step A DB Node step 2 to ensure CRS services are shutdown before doing a firmware update.
c) Run imageinfo to verify what image version is currently installed. If it is 11.2.2.3.2 then refer to Note 1323958.1 for a new MegaCli version and detailed instructions. If it is any other image, then continue.
d) Run the following to update the RAID HBA to the correct supported firmware for the image:

# /opt/oracle.SupportTools/CheckHWnFWProfile -U /opt/oracle.cellos/iso/cellbits

e) After the firmware updates, the server will reboot again. The disk volumes should remain intact and boot up to the OS again.

2. After the OS is up, login as root and validate the physical and logical volumes are seen properly from the new RAID HBA in the OS, for the configuration that it should be for the DB node and OS type (see above "Foreign Configuration" section step 4 b-e), and that the battery is seen:

# /opt/MegaRAID/MegaCli/MegaCli64 -LdInfo -Lall -a0



# /opt/MegaRAID/MegaCli/MegaCli64 -PdList -a0 | grep "Slot\|Firmware\|Inq"



# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -a0

BBU status for Adapter: 0

BatteryType: iBBU

...Output truncated...

3. Set all logical drives cache policy to WriteBack cache mode:

# /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wb -lall -a0

Verify the current cache policy for all logical drives is now using WriteBack cache mode:

# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU

4. CRS services should start automatically during the OS boot. After the OS is up, the Customer DBA should validate that CRS is running .

Cell Node Startup:

1. As the system boots the hardware/firmware profile will be checked, and either a green "Passed" will be displayed, or a red "Warning" that the check does not match if the firmware on the HBA is different from what the image expects. If the check passes, then the firmware is correct, and the boot will continue up to the OS login prompt. If the check fails, then the firmware will automatically be updated, and a subsequent reboot will occur. Monitor to ensure this occurs properly.

2. After the OS is up, login as root and verify all the expected devices are present:

The following command should show 12 disks:

# lsscsi | grep -i LSI

[0:0:20:0]   enclosu LSILOGIC SASX28 A.1       502E  -

[0:2:0:0]    disk    LSI      MR9261-8i        2.90  /dev/sda
.
.
...Output truncated...
.
.
.

[0:2:11:0]   disk    LSI      MR9261-8i        2.90  /dev/sdl

If the device count is not correct check also that the LSI controller has the correct Virtual Drives configured and in Optimal state, physically Online and spun up, with no Foreign configuration. There should be Virtual Drives 0 to 11, and the physical slots 0 to 11 should be allocated to 1 each (not necessarily the same 0:0 1:1 etc. mapping).

# /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -a0 | grep "Virtual Drive\|State\|Slot\|Firmware state"

Virtual Drive: 0 (Target Id: 0)

State               : Optimal

Slot Number: 0

Firmware state: Online, Spun Up

Foreign State: None
.
.
.
.
...Output truncated...
.
.
.
.

Slot Number: 11

Firmware state: Online, Spun Up

Foreign State: None



# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -a0

BBU status for Adapter: 0

BatteryType: iBBU

...Output truncated...

If this is not correct, then there is a problem with the disk volumes that may need additional assistance to correct. The server should be re-opened and the device connections and boards checked to be sure they are secure and well seated BEFORE the following CellCLI commands are issued.

3. Set all logical drives cache policy to WriteBack cache mode:

# /opt/MegaRAID/MegaCli/MegaCli64 -ldsetprop wb -lall -a0

Verify the current cache policy for all logical drives is now using WriteBack cache mode:

# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep BBU

4. Verify also both InfiniBand links are up at 40Gbps as the cables were disconnected:

# /usr/sbin/ibstatus

Infiniband device 'mlx4_0' port 1 status:

 .
...truncated...
.

 state:           4: ACTIVE

 phys state:      5: LinkUp

 rate:            40 Gb/sec (4X QDR)



Infiniband device 'mlx4_0' port 2 status:

.
..truncated..

 state:           4: ACTIVE

 phys state:      5: LinkUp

 rate:            40 Gb/sec (4X QDR)

5. Once the hardware is verified as up and running, the Customer's DBA will need to activate the grid disks:

# cellcli

...

CellCLI> alter griddisk all active

GridDisk DATA_CD_00_dmorlx8cel01 successfully altered

...repeated for all griddisks...

Issue the command below and all disks should show 'active':

CellCLI> list griddisk

DATA_CD_00_dmorlx8cel01 		active

...repeated for all griddisks...

6. Verify all grid disks have been successfully put online using the following command. Wait until asmmodestatus is ONLINE for all grid disks and no longer SYNCING. The following is an example of the output early in the activation process.

CellCLI> list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome

DATA_CD_00_dmorlx8cel01 active ONLINE Yes

RECO_CD_00_dmorlx8cel01 active SYNCING Yes

...repeated for all griddisks...

Notice in the above example that RECO_CD_00_dmorlx8cel01 is still in the 'SYNCING' process. Oracle ASM synchronization is only complete when ALL grid disks show asmmodestatus=ONLINE. This process can take some time depending on how busy the machine is, and has been while this individual server was down for repair.

OBTAIN CUSTOMER ACCEPTANCE
- WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO
TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:

- Verify that HW Components and SW Components are returned to properly functioning state with server up and database services operating on DB Servers, and all ASM disks online on Storage Servers.

PARTS NOTE:
Migrate the BBU if replacing like-for-like parts. If replacing a B2 HBA 375-3644 with a B4 HBA 375-3701, then if the BBU is BBU07 371-4746 then it must also be replaced with a BBU08 371-4982. BBU07 371-4746 is not supported on B4 HBA 375-3701. Refer to Note 1329989.1 for details.

REFERENCE INFORMATION:
1093890.1 Steps To Shutdown/Startup The Exadata & RDBMS Services and Cell/Compute Nodes On An Exadata Configuration.

1188080.1 Steps to shut down or reboot an Exadata storage cell without affecting ASM

Attachments

This solution has no attachment