How to Replace an Exadata X5-2, X4-8, or later Compute (Database) Node HDD (Predictive or Hard Failure)

Asset ID:	1-71-1967510.1
Update Date:	2018-05-10
Keywords:

Solution Type Technical Instruction Sure

Solution 1967510.1 : How to Replace an Exadata X5-2, X4-8, or later Compute (Database) Node HDD (Predictive or Hard Failure)

Applies to:

Exadata X5-2 Half Rack - Version All Versions and later
Exadata X5-2 Quarter Rack - Version All Versions and later
Exadata X5-2 Eighth Rack - Version All Versions and later
Zero Data Loss Recovery Appliance X5 Hardware - Version All Versions and later
Exadata X5-2 Full Rack - Version All Versions and later
Information in this document applies to any platform.

Goal

How to Replace an Exadata X5-2, X4-8, or later Compute (Database) Node HDD (Predictive or Hard Failure).

Solution

DISPATCH INSTRUCTIONS WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED?:

The following information will be required prior to dispatch of a replacement:
Exadata Model - Confirm it is an X5-2/X6-2/X7-2/X4-8/X5-8/X6-8
Type of database node - X5-2 = X5-2 X6-2 = X6-2 X7-2 = X7-2 X4-8 = X4-8 X5-8/X6-8 = X5-8
Name/location of database node
Slot number of failed drive
Image Version (output of "/opt/oracle.cellos/imageinfo -all")

WHAT SKILLS DOES THE FIELD ENGINEER/ADMINISTRATOR NEED?: Linux megaraid familiarity

TIME ESTIMATE: 60 minutes

Complete time may be dependent on disk re-sync time.

TASK COMPLEXITY: 0

CRU-optional; default is FRU with Task Complexity: 2

FIELD ENGINEER/ADMINISTRATOR INSTRUCTIONS:
PROBLEM OVERVIEW:A hard disk in an Exadata X5-2/X4-8 (or later) Compute (Database) node needs replacing. For Exadata X4-2 and earlier, use Note 1479736.1.

WHAT STATE SHOULD THE SYSTEM BE IN TO BE READY TO PERFORM THE RESOLUTION ACTIVITY?:

Hard disks in Exadata DB nodes are configured into RAID volumes and are hot swappable provided the failed hard disk has been offlined by LSI MegaRAID that
manages the volume. The volume contains redundancy and should remain online, while in a degraded state. The normal DB node volume arrangement depends on
the rack type:

- X5-2 - 4 disk RAID5

- X6-2 - 4 disk RAID5 with option for 8 disk RAID5

- X7-2 - 4 disk RAID5 with option for 8 disk RAID5

- X4-8 - 7 disk RAID5

- X5-8/X6-8 - 8 disk RAID5

A critical hard disk failure may be marked “critical” or “failed” depending on release version, and is referred to as critical hard failure in this document.
A predictive hard disk failure may be marked “predictive failure” or “proactive failure” which may or may no include additional information on failure
mode type. There may be differences based on exact failure mode and release version, the process should be treated the same and is referred to as preditive
failure in this document.

For a critical hard failure, the LED for the failed hard disk should have the "OK to Remove" blue LED illuminated/flashing and have the "Service Action Required"
amber LED illuminated/flashing. This may trigger alarm HALRT-02007 - refer to Note 1113034.1.

For a predictive failure, the LED for the failed hard disk should have the “Service Action Required” amber LED illuminated/flashing.On certain image revisions,
predictive failures may not yet be removed from the volume and may not have a fault LED on. This may trigger alarm HALRT-02008 - refer to Note 1113014.1.

WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE?:

1. Backup the volume and be familiar with the restore from bare metal procedure before replacing the disk. See Note 1084360.1 for details.

To check the status of the current cache policy, use the following command, the current should be WriteBack, not WriteThrough:

# /opt/MegaRAID/MegaCli/MegaCli64 -ldpdinfo -a0 | grep -i "cache policy"
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
Disk Cache Policy : Disabled
#

2. Identify the disk using the Amber fault and Blue OK-to-Remove LED states. The DB node server within the rack can be determined from the hostname usually,
and the known default Exadata server numbering scheme counting server numbers up from 1 as the lower most DB node in the rack. The server's white Locate
LED may be flashing as well.

The slot number locations are labelled next to each disk slot. On X5-2/X6-2/X7-2 DB nodes, the disk slots are in the front of the server. On X4-8/X5-8/X6-8 DB nodes, the disk slots are in the rear of the server.

If still unsure on the slot location, use the following commands to identify the faulted disk:
a. Obtain the enclosure ID for the MegaRAID card:

# /opt/MegaRAID/MegaCli/MegaCli64 -encinfo -a0 | grep ID
Device ID : 252
#

b. Identify the physical disk slot that is failed:

# /opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|firmware"

Slot Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: A690
Slot Number: 1
Firmware state: Failed
Device Firmware Level: A690
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: A690
Slot Number: 3
Firmware state: Online, Spun Up
Device Firmware Level: A690

"Failed" or "Unconfigured(bad)" is the expected state for the faulted disk. In this example, it is located in physical slot 1

If all disks show as Online , then the disk may be in predictive failure state but not yet gone offline. The failed disk can be identified using this additional
information:

# /opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -iE "slot|predictive|firmware"
Slot Number: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 1
Predictive Failure Count: 290
Last Predictive Failure Event Seq Number: 121022
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 2
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: 0B70
Slot Number: 3
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: 0B70

In this example, the disk in slot 1 has reported itself as predictive failed several times but is still online. This disk should be considered the bad one. For more
details refer to Note 1452325.1.

c. Use the locate function which turns the "Service Action Required" amber LED on flashing:

# /opt/MegaRAID/MegaCli/MegaCli64 -PdLocate -start -physdrv[E#:S#] -a0

where E# is the enclosure ID number identified in step a, and S# is the slot number of the disk identified in step b. In the example above, the command
would be:

# /opt/MegaRAID/MegaCli/MegaCli64 -PdLocate -start -physdrv[252:0] -a0

3. Verify the state of the RAID is optimal or degraded , with the good disk(s) online before hot-swap removing the
failed disk.

# /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -a0 | grep -iE "target|state|slot"
Virtual Drive: 0 (Target Id: 0)
State : Degraded
Slot Number: 0
Firmware state: Online, Spun Up
Foreign State: None
Slot Number: 1
Firmware state: Failed
Foreign State: None
Slot Number: 2
Firmware state: Online, Spun Up
Foreign State: None
Slot Number: 3
Firmware state: Online, Spun Up
Foreign State: None

#

4. On the drive you plan to remove, push the storage drive release button to open the latch.

5. Grasp the latch and pull the drive out of the drive slot (Caution: The latch is not an ejector. Do not bend it too far to the right. Doing so can damage the latch.
Also, whenever you remove a storage drive, you should replace it with another storage drive or a filler panel, otherwise the server might overheat due to improper
airflow.)

6. Wait three minutes for the system to acknowledge the disk has been removed.

7. Slide the new drive into the drive slot until it is fully seated.

8. Close the latch to lock the drive in place.

9. Verify the "OK/Activity" Green LED begins to flicker as the system recognizes the new drive. The other two LEDs for the drive should no longer be illuminated.
The server's locate and disk's service LED locate blinking function should automatically turn off.
If it does not, it can be manually turned off for the device using:

# /opt/MegaRAID/MegaCli/MegaCli64 -PdLocate -stop -physdrv[E#:S#] -a0

where E# is the enclosure ID number identified in step 2a, and S# is the slot number of the disk identified in step 2b. In the example above, the command
would be:

# /opt/MegaRAID/MegaCli/MegaCli64 -PdLocate -stop -physdrv[252:1] -a0

OBTAIN CUSTOMER ACCEPTANCE

WHAT ACTION DOES THE FIELD ENGINEER/ADMINISTRATOR NEED TO TAKE TO RETURN THE SYSTEM TO AN OPERATIONAL STATE:

1. Verify the disk is brought online into a volume by LSI MegaRAID. Until the disk is added into a volume, the OS will not be able to use the disk.

Use the following to verify the physical disk is in one of these expected states – Copyback, or Online:

# /opt/MegaRAID/MegaCli/MegaCli64 -PdInfo -physdrv[E#:Slot#] -a0

where E# is the enclosure ID number identified in step 2a of the replacement steps, and S# is the slot number of the disk replaced. In the example above,
the command and output would be:

# /opt/MegaRAID/MegaCli/MegaCli64 -PdInfo -physdrv[252:1] -a0
Adapter #0
Enclosure Device ID: 252
Slot Number: 1
Device Id: 10
Sequence Number: 7
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 136.727 GB [0x11174b81 Sectors]
Non Coerced Size: 136.227 GB [0x11074b81 Sectors]
Coerced Size: 136.218 GB [0x11070000 Sectors]
Firmware state: Online, Spun Up
SAS Address(0): 0x5000cca00a1b817d
SAS Address(1): 0x0
Connected Port Number: 2(path0)
Inquiry Data: HITACHI H103014SCSUN146GA1600934FH3Y8E
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive: Not Certified

2 Verify the replacement disk has been added to the expected RAID volume.

Use the following MegaRAID command to verify the status of the RAID:

# /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -a0 | grep -iE "target|state|slot"

If it has already completed the copyback when checked, then it may already be in “Online” state. If it is in rebuilding or copyback state, you can use the following to
verify progress to completion:

# /opt/MegaRAID/MegaCli/MegaCli64 -pdrbld -showprog -physdrv [E#:Slot#] -a0

where E# is the enclosure ID number identified in step 2a of the replacement steps, and S# is the slot number of the disk in Rebuild state.

# /opt/MegaRAID/MegaCli/MegaCli64 -pdrbld -showprog -physdrv [252:1] -a0
Rebuild Progress on Device at Enclosure 252, Slot 1 Completed 9% in 3 Minutes.
Exit Code: 0x00
#

# /opt/MegaRAID/MegaCli/MegaCli64 -pdcpybk -showprog -physdrv [E#:Slot#] -a0

where E# is the enclosure ID number identified in step 2a of the replacement steps, and S# is the slot number of the disk in Copyback state. This is typically
the replaced disk slot.

# /opt/MegaRAID/MegaCli/MegaCli64 -pdcpybk -showprog -physdrv [252:0] -a0
Copyback Progress on Device at Enclosure 252, Slot 0 Completed 79% in 29 Minutes.
Exit Code: 0x00
#

3. Optionally update the disk firmware as needed, following the procedure in Note 2088888.1.

PARTS NOTE:

Refer to the Exadata Database Maintenance Guide Appendix B for part information.

Refer to the Oracle System Handbook for part information. (https://mosemp.us.oracle.com/handbook_internal/index.html)

REFERENCE INFORMATION:

Exadata Database Machine Documentation:
Exadata Database Machine Owner's Guide is available on the Storage Server OS image in /opt/oracle/cell/doc/welcome.html or https://docs.oracle.com/cd/E80920_01/index.htm

HALRT-02007: Database node hard disk failure (Doc ID 1113034.1)
HALRT-02008: Database node hard disk failure (Doc ID 1113014.1)

Bare Metal Restore Procedure for Compute Nodes on an Exadata Environment (Doc ID 1084360.1)
Determining when Disks should be replaced on Oracle Exadata Database Machine (Doc ID 1452325.1)

Internal Only References:
- INTERNAL Exadata Database Machine Hardware Current Product Issues - DB Nodes (X5-2/X4-8/X5-8/X6-2/X6-8) (Doc ID 2010838.1)
- INTERNAL Exadata Database Machine Hardware Troubleshooting (Doc ID 1360360.1)

References

<NOTE:1113014.1> - HALRT-02008: Database node hard disk predictive failure
<NOTE:2088888.1> - How to Update Disk Drive Firmware on Exadata and Recovery Appliance Compute Nodes
<NOTE:2010838.1> - INTERNAL Exadata Database Machine Hardware Current Product Issues - DB Nodes (X5 and Later)
<NOTE:1479736.1> - How to Replace an Exadata Compute (Database) node hard disk drive (Predictive or Hard Failure) (X4-2 and earlier)
<NOTE:1360360.1> - INTERNAL Exadata Database Machine Hardware Troubleshooting
<NOTE:1452325.1> - Determining when Disks should be replaced on Oracle Exadata Database Machine
<NOTE:1084360.1> - Bare Metal Restore Procedure for Compute Nodes on an Exadata Environment
<NOTE:1113034.1> - HALRT-02007: Database node hard disk failure

Attachments

This solution has no attachment