Root Volume of Predictive Failure Boot Hard Drive in an Exadata Storage Server Remains in State of 'active'

Asset ID:	1-71-1524329.1
Update Date:	2013-04-11
Keywords:

Solution Type Technical Instruction Sure

Solution 1524329.1 : Root Volume of Predictive Failure Boot Hard Drive in an Exadata Storage Server Remains in State of 'active'

Applies to:

Exadata X3-8 Hardware - Version All Versions and later
Exadata X3-2 Full Rack - Version All Versions and later
Exadata X3-2 Half Rack - Version All Versions and later
Exadata Database Machine X2-2 Hardware - Version All Versions and later
SPARC SuperCluster T4-4 Full Rack - Version All Versions and later
Information in this document applies to any platform.

Goal

This document describes the steps to take when the root volume of a Predictive Failure boot hard drive in an Exadata storage (cell) server remains in a state of 'active'. This document presupposes that <Document 1390836.1> How to Replace a Hard Drive in an Exadata Storage Server (Predictive Failure) has been followed through step 5 b.

Solution

<Document 1390836.1> How to Replace a Hard Drive in an Exadata Storage Server (Predictive Failure) explains that before pulling an OS disk that is in a state of Predictive Failure, the field engineer should verify that the root volume of the disk is in a 'clean' state. It then states that if the volume is 'active' and the disk is hot removed, the OS may crash making the recovery more difficult. Normally, the state should be clean immediately following (or shortly after) the disk changes to Predictive Failure and the disk replacement can proceed using the steps in <Document 1390836.1>. If this is not the case, and it is possible to do so, the cell node should be rebooted using <Document 1188080.1> Steps to shut down or reboot an Exadata storage cell without affecting ASM in an attempt to get the root volume status to change to 'clean'. If a reboot does not change the root volume status to the necessary 'clean' status, follow the steps below before proceeding to remove the physical device per <Document 1390836.1>.

The following output assumes the root volume is /dev/md5 and shows that its status is still active:

[root@edx2cel03 ~]# mdadm -Q --detail /dev/md5
/dev/md5:
        Version : 0.90
Creation Time : Thu Mar 17 23:19:42 2011
     Raid Level : raid1
     Array Size : 10482304 (10.00 GiB 10.73 GB)
Used Dev Size : 10482304 (10.00 GiB 10.73 GB)
   Raid Devices : 2
Total Devices : 2
Preferred Minor : 5
    Persistence : Superblock is persistent

    Update Time : Wed Jul 18 11:53:34 2012
          State : active
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

           UUID : e75c1b6a:64cce9e4:924527db:b6e45d21
         Events : 0.108

    Number   Major   Minor   RaidDevice State
       0       8        5        0      active sync   /dev/sda5
       1       8       21        1      active sync   /dev/sdb5

Use 'mdadm' to set the faulty disk's root volume to a faulty state and to remove the volume from the configuration.

IMPORTANT: Before faulting and removing the root volume of the failed disk, confirm again the slot of the failed boot disk using CellCLI as was done in step #1 of <Document 1390836.1> How to Replace a Hard Drive in an Exadata Storage Server (Predictive Failure). Failure to do so may result in the wrong root volume being faulted, thus causing the running OS to crash. A boot disk in slot 0 will have a logical device name of '/dev/sda', while a boot disk in slot 1 will have a logical device name of '/dev/sdb'.

In the following example, we fault and remove the root volume for the disk in slot 1:

[root@edx2cel03 ~]# mdadm --set-faulty /dev/md5 /dev/sdb5

mdadm: set /dev/sdb5 faulty in /dev/md5

[root@edx2cel03 ~]# mdadm --remove /dev/md5 /dev/sdb5

After running the above commands, 'mdadm --detail /dev/md5' should show sdb in a state of 'removed'. The disk in slot 1 can then be removed and replaced per the steps for physical disk replacement in <Document 1390836.1> How to Replace a Hard Drive in an Exadata Storage Server (Predictive Failure). After replacement, the mirror re-attach and resyncs should happen automatically, which could take up to several minutes to start. Run 'mdadm --detail /dev/md5' again to confirm. Also note that the logical device name for the disk that was replaced may have changed (for example, from sdb to sdad). This is normal.

References

<NOTE:1390836.1> - How to Replace a Hard Drive in an Exadata Storage Server (Predictive Failure)

Attachments

This solution has no attachment