ODA (Oracle Database Appliance): The Steps to replace multiple disks failing concurrently

Asset ID:	1-71-1496114.1
Update Date:	2016-08-01
Keywords:

Solution Type Technical Instruction Sure

Solution 1496114.1 : ODA (Oracle Database Appliance): The Steps to replace multiple disks failing concurrently

Applies to:

Oracle Database Appliance - Version All Versions to All Versions [Release All Releases]
Oracle Database Appliance Software - Version 2.1.0.1 to 2.9.0.0 [Release 2.1 to 2.9]
Information in this document applies to any platform.

Goal

The intention of this article is to describe which steps you should follow on replacing failing disks on ODA (Oracle Database Appliance)

Solution

In case one or more disks need to be replaced, you should follow the following steps:

1.Check the current disk status

Check the current status of your shared storage disks from ASM & OAK perspective, make sure all other disks are in good shape and ASM redundancy will not be affected by the removal of the disk you intend to replace:

- from an ASM point of view (login as user grid), you could issue the following command:

export ORACLE_SID=+ASM1
asmcmd lsdsk -p

The correct working disk status should be CACHED; MEMBER; ONLINE; NORMAL

Check for negative Usable_file_MB, issuing the command:

asmcmd lsdg

if Usable_file_MB is negative check reason:

a) In case we have missing disks:

- candidate disk for replacement: continue with procedure
- non-candidate disk: investigate reason and add it back. (involve ODA support for checking details and action)

b) In case you have overallocated space you need to free some disk space in ASM, ie. remove archivelogs already backed up.

- from an OAK point of view issues the commands:

oakcli show disk

and

oakcli show diskgroup DATA
oakcli show diskgroup REDO
oakcli show diskgroup RECO

2. Take a backup of your databases, cloud file systems (ACFS) in case something goes wrong

3. Identify the failed disk

In order to identify the disk that needs to be replaced, issue the following command to turn on the LED on the disk

ODA V1:

oakcli locate disk pd_xx on
(where xx is the number in the range of 01 to 23)

ODA X3-2 and higher

oakcli locate disk eX_pd_xx on
(where X=0 or 1 and xx is the number in the range of 01 to 23)

4. Monitor the disk operations

Oracle recommends to monitor the disk operations by tailing the ASM alertlog on both nodes during disk replacement:

<node1> tail -f /u01/app/grid/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log
<node2> tail -f /u01/app/grid/diag/asm/+asm/+ASM2/trace/alert_+ASM2.log

Look or events that the disk was removed, added and disk group rebalanced.

5. Pull-out the bad disk

Wait until the disk has been removed from ASM. You can verify that the disk has been "removed" with the following steps:

fwupdate list disk
oakcli show disk pd_<slotnumber>
oakcli show diskgroup

grid> asmcmd lsdsk -p -t|grep <Slotnumber - Sxx>

6. Insert the new disk

7. Check the status of the new inserted disk

Wait until the disk goes online, test with

ODA V1:

oakcli show disk pd_<slotnumber>

ie:
oakcli show disk pd_16

ODA X3-2 and higher:

oakcli show disk e<jbod_number>_pd_<slotnumber>

ie:
oakcli show disk e0_pd_16

and

grid> asmcmd lsdsk -p -t|grep <Slotnumber - Sxx>

If the disk does not go online after 5 minutes restart oak (login as root) and run

oakcli restart oak

Update:
ODA - > Do NOT use FORCE or REINIT for disk replacement

If the disk does not come online pull it out and insert again and check the status.
A further failure to online the disk may require a reboot (Check the impact of a reboot)

If the disk is not coming ONLINE you may want to try the following:

1- verify the disk it's not added to ASM
2- find the disk device
3- dd the initial disk area example:
dd if=/dev/zero of=/dev/mapper/HDD_E1_S19_372682224 bs=8192 count=1000
4- remove it
5- wait for 3 mins
6- reinsert it again
7- wait for 3 mins
8- check the oak disk status: "oakcli show disk"

8. Check the ASM status of the new disk

As grid OS user, verify that the disk has been accepted by ASM, should be member or initializing

grid> asmcmd lsdsk -p -t --member|grep <Slotnumber - Sxx>

Compare the disk number (path) with the slot number of the lsdsk output.

Contact Support if these numbers do not match. DO NOT REPLACE MORE DISKS.

If disk was not added then add the disk manually using the disk name reported by the above asmcmd output:
(below is just only an example)

grid> sqlplus / as sysasm
SQL> alter diskgroup /*+ _OAK_AsmCookie */ DATA add disk '/dev/mapper/HDD_E0_S04_971463627p1' name HDD_E0_S04_971463627p1;
SQL> alter diskgroup /*+ _OAK_AsmCookie */ RECO add disk '/dev/mapper/HDD_E0_S04_971463627p2' name HDD_E0_S04_971463627p2;

If the command is not working properly contact Oracle Support.

DO NOT CONTINUE WITH THE NEXT DISK UNTIL THE DISK IS ACCEPTED BY ASM!

9. Check ASM rebalance operation

Check in ASM that the rebalance is finished, in case you need to replace more than one disk

grid> asmcmd lsdg (check if value for REBAL column is Y)

or executing the following query:

SQL> select GROUP_NUMBER, OPERATION, STATE, ACTUAL, SOFAR, EST_MINUTES from gv$asm_operation;

Rebalance the disk groups (optional if not started automatically by ASM)

grid> asmcmd rebal DATA --power 11 -w (waits and prints rebalance complete when finished)
grid> asmcmd rebal RECO --power 11 -w

When rebalance has finished continue with next disk.

References

<NOTE:1457254.1> - ODA (Oracle Database Appliance): after disk failure some disks are in ASM mount_status 'CLOSED'
<NOTE:1382300.1> - ODA (Oracle Database Appliance) : How to replace FAILED SYSTEM BOOT DISK
<NOTE:1534154.1> - Oracle Database Appliance FCO 0328 Disk Replacement Procedure

Attachments

This solution has no attachment