How to Drop and Re-create the /U01 on Exadata Without Performing a Full Bare Metal Restore

Asset ID:	1-72-2005694.1
Update Date:	2017-01-16
Keywords:

Solution Type Problem Resolution Sure

Solution 2005694.1 : How to Drop and Re-create the /U01 on Exadata Without Performing a Full Bare Metal Restore

Applies to:

Exadata X5-2 Hardware - Version All Versions and later
Exadata X3-2 Hardware - Version All Versions and later
Oracle Exadata Hardware - Version 11.2.0.3 and later
Exadata Database Machine X2-2 Hardware - Version All Versions and later
Exadata X4-2 Hardware - Version All Versions and later
Information in this document applies to any platform.

Symptoms

When attempting to resize the /u01 filesystem, the /u01 filesystem became corrupted with clusterware showing as not usable.

The output of fsck shows corrupted inode tables.

Cause

Attempt to re-size /u01 file system caused corruption of /u01. Damage is non-repairable.

This requires dropping and re-creation of /u01 file system.

Solution

Drop and re-create the /u01 file system per the process below:

Nodes:

Bad node dm01db01 <<< Failing node

Good Node dm01db02 <<< Surviving node

If the /u01 file system is damaged, but able to be mounted, backup whatever data is possible.

Step 1: Remove the Failed Database Server from the Cluster

1. Disable the listener that runs on the failed database server:

[oracle@surviving]$ srvctl disable listener -n dm01db01

[oracle@surviving]$ srvctl stop listener -n dm01db01

PRCC-1017 : LISTENER was already stopped on dm01db01

2. Delete the Oracle Home from the Oracle inventory:

[oracle@surviving]$ cd ${ORACLE_HOME}/oui/bin

[oracle@surviving]$ ./runInstaller –updateNodeList ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1

"CLUSTER_NODES=dm01db02"

Starting Oracle Universal Installer...

Checking swap space: must be greater than 500 MB. Actual 16383 MB

Passed

The inventory pointer is located at /etc/oraInst.loc

The inventory is located at /u01/app/oraInventory

'UpdateNodeList' was successful.

3. Verify that the failed database server is unpinned:

[oracle@surviving]$ olsnodes -s -t

dm01db01 Inactive Unpinned

dm01db02 Active Unpinned

4. Stop the VIP Resources for the failed database server and delete:

[root@surviving]# srvctl stop vip -i dm01db01-vip

PRCC-1016 : dm01db01-vip.acme.com was already stopped

[root@surviving]# srvctl remove vip -i dm01db01-vip

Please confirm that you intend to remove the VIPs dm01db01-vip (y/[n]) y

5. Delete the node from the cluster:

[root@surviving]# crsctl delete node -n dm01db01

CRS-4661: Node dm01db01 successfully deleted.

6. Update the Oracle Inventory:

[oracle@surviving]$ cd ${ORACLE_HOME}/oui/bin

[oracle@surviving]$ ./runInstaller -updateNodeList ORACLE_HOME=/u01/app/11.2.0/grid

"CLUSTER_NODES=dm01db02" CRS=TRUE Starting Oracle Universal

Installer...

Checking swap space: must be greater than 500 MB. Actual 16383 MB

Passed

The inventory pointer is located at /etc/oraInst.loc

The inventory is located at /u01/app/oraInventory

'UpdateNodeList' was successful.

7. Verify the node deletion is successful:

[oracle@surviving]$ cluvfy stage -post nodedel -n dm01db01 -verbose

Performing post-checks for node removal

Checking CRS integrity...

The Oracle clusterware is healthy on node "dm01db02"

CRS integrity check passed

Result:

Node removal check passed

Post-check for node removal was successful

Step 2: Drop and Re-create /u01

/dev/mapper/VGExaDb-LVDbOra1 (* * * *% ) /u01

# umount /u01
Issue df –k and see the file system type and more the /etc/fstab to verify the mount options.
Re-format the /u01 to clean it out and re-create the inodes.
1. # mkfs -t ext3 /dev/mapper/VGExaDb-LVDbOra1 /u01
2. OR
3. # mkfs.ext3 /dev/mapper/VGExaDb-LVDbOra1 /u01

Note: when you're creating the filesystem, check the filesystem version on a healthy node first to determine if it's ext3 or ext4 - then create the same version. The more recent factory versions ship with ext4.

# mount -t /dev/mapper/VGExaDb-LVDbOra1 /u01
On the failed node, create the directories for
1. /u01/app
2. /u01/app/11.2.0.4/grid
Grant correct ownership and permissions on directories
1. [root@replacement]# mkdir -p /u01/app/11.2.0.4/grid/
2. [root@replacement]# chown oracle /u01/app/11.2.0.4/grid
3. [root@replacement]# chgrp -R oinstall /u01/app/11.2.0.4/grid
4. [root@replacement]#chmod -R 775 /u01/

Step 3: Add Node back to cluster:

Clone Oracle Grid Infrastructure to the Replacement Database Server

1. Verify the hardware and operating system installations with the Cluster Verification Utility (CVU):

[oracle@surviving]$ cluvfy stage -post hwos -n dm01db01,dm01db02 –verbose

At the end of the report, you should see the text: “Post-check for hardware and operating system setup was successful.”

2. Verify peer compatibility:

[oracle@surviving]$ cluvfy comp peer -refnode dm01db02 -n dm01db01 –orainv

oinstall -osdba dba | grep -B 3 -A 2 mismatched

Compatibility check: Available memory [reference node: dm01db02]

Node Name Status Ref. node status Comment

------------ ----------------------- ----------------------- ----------

dm01db01 31.02GB (3.2527572E7KB) 29.26GB (3.0681252E7KB) mismatched

Available memory check failed

Compatibility check: Free disk space for "/tmp" [reference node: dm01db02]

Node Name Status Ref. node status Comment

------------ ----------------------- ---------------------- ----------

dm01db01 55.52GB (5.8217472E7KB) 51.82GB (5.4340608E7KB) mismatched

Free disk space check failed

If the only components that failed are related to physical memory, swap space, and disk space, then it is safe to continue.

3. Perform requisite checks for node addition:

[oracle@surviving]$ cluvfy stage -pre nodeadd -n dm01db01 -fixup -fixupdir

/home/oracle/fixup.d

If the only component that fails is related to swap space, then it is safe to continue.

4. Add the replacement database server into the cluster:

NOTE: addnode.sh may error out with files that are only readable by root giving error similar to MOS 1526405.1 so following the workaround for these files and rerun the addnode.sh again.

[oracle@surviving]$ cd /u01/app/11.2.0/grid/oui/bin/

[oracle@surviving]$ ./addnode.sh -silent "CLUSTER_NEW_NODES={dm01db01}"

"CLUSTER_NEW_VIRTUAL_HOSTNAMES={dm01db01-vip}"

This initiates the OUI to copy the clusterware software to the replacement database server.

WARNING: A new inventory has been created on one or more nodes in this session.

However, it has not yet been registered as the central inventory of this system.

To register the new inventory please run the script at '/u01/app/oraInventory/orainstRoot.sh' with root privileges on nodes 'dm01db01'.

If you do not register the inventory, you may not be able to update or patch the products you installed.

The following configuration scripts need to be executed as the "root" user in each cluster node:

/u01/app/oraInventory/orainstRoot.sh #On nodes dm01db01

/u01/app/11.2.0/grid/root.sh #On nodes dm01db01

To execute the configuration scripts:

a) Open a terminal window.

b) Log in as root.

c) Run the scripts on each cluster node

After the scripts are finished, you should see the following informational messages:

The Cluster Node Addition of /u01/app/11.2.0/grid was successful.

Please check '/tmp/silentInstall.log' for more details.

5. Run the orainstRoot.sh and root.sh scripts for the replacement database server:

NOTE: orainstRoot.sh will not need to be run if only /u01 was created and / filesystem was unchanged or restored because oraInst.loc and oratab files still exist.

[root@replacement]# /u01/app/oraInventory/orainstRoot.sh

Creating the Oracle inventory pointer file (/etc/oraInst.loc)

Changing permissions of /u01/app/oraInventory.

Adding read,write permissions for group.

Removing read,write,execute permissions for world.

Changing groupname of /u01/app/oraInventory to oinstall.

The execution of the script is complete.

[root@replacement]# /u01/app/11.2.0/grid/root.sh

Check /u01/app/11.2.0/grid/install/root_dm01db01.acme.com_2010-03-10_17-59-15.log for the output of root script

The output file created above will report that the LISTENER resource on the replaced database server failed to start.

This is the expected output:

PRCR-1013 : Failed to start resource ora.LISTENER.lsnr

PRCR-1064 : Failed to start resource ora.LISTENER.lsnr on node dm01db01

CRS-2662: Resource 'ora.LISTENER.lsnr' is disabled on server 'dm01db01' start listener on node=dm01db01 ... failed

6. Reenable the listener resource that was stopped and disabled

[root@replacement]# /u01/app/11.2.0/grid/bin/srvctl enable listener -l LISTENER -n dm01db01

[root@replacement]# /u01/app/11.2.0/grid/bin/srvctl start listener -l LISTENER -n dm01db01

Step 4: Clone Oracle Database Homes to Replacement Database Server

1. Add the RDBMS ORACLE_HOME on the replacement database server:

[oracle@surviving]$ cd /u01/app/oracle/product/11.2.0/dbhome_1/oui/bin/

[oracle@surviving]$ ./addnode.sh -silent "CLUSTER_NEW_NODES={dm01db01}

These commands initiate the OUI (Oracle Universal Installer) to copy the Oracle Database software to the replacement database server. However, to complete the installation, you must run the root scripts on the replacement database server after the command completes.

WARNING: The following configuration scripts need to be executed as the “root” user in each cluster node.

/u01/app/oracle/product/11.2.0/dbhome_1/root.sh #On nodes dm01db01

To execute the configuration scripts:

Open a terminal window.

Run the scripts on each cluster node.

After the scripts are finished, you should see the following informational messages:

The Cluster Node Addition of /u01/app/oracle/product/11.2.0/dbhome_1 was successful.

Please check '/tmp/silentInstall.log' for more details.

2. Run the following scripts on the replacement database server:

[root@replacement]# /u01/app/oracle/product/11.2.0/dbhome_1/root.sh

Check /u01/app/oracle/product/11.2.0/dbhome_1/install/root_dm01db01.acme.com_2010-03-

10_18-27-16.log for the output of root script

3. Validate initialization parameter files

Review that file init<SID>.ora under $ORACLE_HOME/dbs reference the spfile in the ASM shared storage.

Review the password file which gets copied over under $ORACLE_HOME/dbs during addnode, needs to be changed to orapw<SID>

References

<NOTE:1084360.1> - Bare Metal Restore Procedure for Compute Nodes on an Exadata Environment
<NOTE:1664897.1> - EXT3 File system Error "EXT3-fs error (device dm-5): ext3_lookup: deleted inode referenced"

Attachments

This solution has no attachment