ODA (Oracle Database Appliance): Test Plan Outline

Asset ID:	1-79-1474273.1
Update Date:	2017-10-15
Keywords:

Solution Type Predictive Self-Healing Sure

Solution 1474273.1 : ODA (Oracle Database Appliance): Test Plan Outline

Applies to:

Oracle Database Appliance Software - Version 2.1.0.1 and later
Oracle Database Appliance - Version All Versions and later
Information in this document applies to any platform.
***Checked for relevance on 23-Jun-2014***

Purpose

Before a new computer /cluster system is deployed in production it is important to test the system thoroughly to validate that it will perform at a satisfactory level, relative to its service level objectives. Testing is also required when introducing major or minor changes to the system. This document provides an outline consisting of basic guidelines and recommendations for how to test a new ODA (Oracle Database Appliance) system. It can be used as a framework for building a system test plan specific to each company’s ODA implementation and the associated service level objectives.

Scope

This document provides an outline of basic testing guidelines that will be used to validate core component functionality
for ODA system in the form of an organized test plan. Every application exercises the underlying software and hardware infrastructure differently, and must be tested as part of a component testing strategy. Each new system must be tested thoroughly, in an environment that is a realistic representation of the production environment in terms of configuration, capacity, and workload prior to going live or after implementing significant architectural/system modifications. Without a completed system implementation and functional available end-user applications, only core
component functionality and testing is possible to verify cluster, RDBMS and various sub-component behaviors for the networking, I/O subsystem and miscellaneous database administrative functions.
In addition to the specific system testing outlined in this document additional testing needs to be defined and executed for RMAN, backup and recovery, and Data Guard (for disaster recovery). Each component area of testing also requires specific operational procedures to be documented and maintained to address site-specific requirements.

Details

Test Case 1 - Simulate failures to Internal (OS) disks

Test description

Each ODA node has two Seagate 500GB serial ATA hard drives that are used for the operating system. This test wants to show what happens if one of these disks is damaged or lost for any reason.

Test result

Simulating an HD corruption or failure on one of the two internal OS disks, ODA is still working and oakcli is showing the failure

Test Steps

1. Check your initial disk configuration

You can use the OS command mdadm (manage MD devices aka Linux Software RAID), see man page for further details (man mdadm)

# mdadm --detail /dev/md0
/dev/md0:
        Version : 0.90
Creation Time : Thu Dec 8 12:25:33 2011
     Raid Level : raid1
     Array Size : 104320 (101.89 MiB 106.82 MB)
Used Dev Size : 104320 (101.89 MiB 106.82 MB)
   Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Jan 3 04:02:39 2012
          State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

           UUID : 1751c3b7:1a4d91b5:3cba0f44:a85d2398
         Events : 0.14

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1

2. Check the initial disk status with oakcli

Issuing "oakcli validate -c OSDiskStorage" you can validate the operating system disks and file system information:

# /opt/oracle/oak/bin/oakcli validate -c OSDiskStorage
INFO: Checking Operating System Storage
SUCCESS: The OS disks have the boot stamp
RESULT: Raid device /dev/md0 found clean
RESULT: Raid device /dev/md1 found clean
RESULT: Physical Volume   /dev/md1 in VolGroupSys has 270213.84M out of total 499994.59M
RESULT: Volumegroup   VolGroupSys consist of 1 physical volumes,contains 4 logical volumes, has 0 volume snaps with total size of 499994.59M and free space of 270213.84M
RESULT: Logical Volume   LogVolOpt in VolGroupSys Volume group is of size 60.00G
RESULT: Logical Volume   LogVolRoot in VolGroupSys Volume group is of size 30.00G
RESULT: Logical Volume   LogVolSwap in VolGroupSys Volume group is of size 24.00G
RESULT: Logical Volume   LogVolU01 in VolGroupSys Volume group is of size 100.00G
RESULT: Device /dev/mapper/VolGroupSys-LogVolRoot is mounted on / of type ext3 in (rw)
RESULT: Device /dev/md0 is mounted on /boot of type ext3 in (rw)
RESULT: Device /dev/mapper/VolGroupSys-LogVolOpt is mounted on /opt of type ext3 in (rw)
RESULT: Device /dev/mapper/VolGroupSys-LogVolU01 is mounted on /u01 of type ext3 in (rw)
RESULT: / has 19344 MB free out of total 29758 MB
RESULT: /boot has 74 MB free out of total 99 MB
RESULT: /opt has 31944 MB free out of total 59516 MB
RESULT: /u01 has 62255 MB free out of total 99194 MB

3. Simulate a disk failure

You can use the OS dd (convert and copy a file) command (see man dd for further details), in this case we are wiping the disk for 512 blocks with zeros :

# dd if=/dev/zero of=<device name> count=512
512+0 records in
512+0 records out
262144 bytes (262 kB) copied, 0.000519 seconds, 505 MB/s

Note: please pay attention on use the right device name

4. OAK is recongnizing the failure

Issuing again the oakcli validate command, We see the failure is recognized :

# oakcli validate -c OSDiskStorage
INFO: Checking Operating System Storage
ERROR: OS disk sdb does not have right boot stamp
WARNING: Check MBR stamp on OS disk failed
RESULT: Raid device /dev/md0 found clean
RESULT: Raid device /dev/md1 found clean
RESULT: Physical Volume   /dev/md1 in VolGroupSys has 270213.84M out of total 499994.59M
RESULT: Volumegroup   VolGroupSys consist of 1 physical volumes,contains 4 logical volumes, has 0 volume snaps with total size of 499994.59M and free space of 270213.84M
RESULT: Logical Volume   LogVolOpt in VolGroupSys Volume group is of size 60.00G
RESULT: Logical Volume   LogVolRoot in VolGroupSys Volume group is of size 30.00G
RESULT: Logical Volume   LogVolSwap in VolGroupSys Volume group is of size 24.00G
RESULT: Logical Volume   LogVolU01 in VolGroupSys Volume group is of size 100.00G
RESULT: Device /dev/mapper/VolGroupSys-LogVolRoot is mounted on / of type ext3 in (rw)
RESULT: Device /dev/md0 is mounted on /boot of type ext3 in (rw)
RESULT: Device /dev/mapper/VolGroupSys-LogVolOpt is mounted on /opt of type ext3 in (rw)
RESULT: Device /dev/mapper/VolGroupSys-LogVolU01 is mounted on /u01 of type ext3 in (rw)
RESULT: / has 19344 MB free out of total 29758 MB
RESULT: /boot has 74 MB free out of total 99 MB
RESULT: /opt has 31942 MB free out of total 59516 MB
RESULT: /u01 has 62268 MB free out of total 99194 MB

The ODA server where we have simulated the internal HDD failure/corruption is still working. The mirror configuration (RAID 1) provides the capability to survive an OS disk failure.

5. Restore the disk

Let's suppose now that the failed disk has been restored. In this case we are copying the "good" data from the disk1 (sda). Using again the OS command 'dd' We are making the failed disk (disk2, sdb) a mirror copy of good disk sda:

# dd if=/dev/sda of=/dev/sdb count=512
512+0 records in
512+0 records out
262144 bytes (262 kB) copied, 0.00072 seconds, 364 MB/s

6. Check the disk status using OAK

As the failed disk has been restored, OAK recognizes it's the good status. Using 'oakcli OSDiskStorage' confirms that the failed disk has been restored with status "SUCCESS: The OS disks have the boot stamp" :

# oakcli validate -c OSDiskStorage
INFO: Checking Operating System Storage
SUCCESS: The OS disks have the boot stamp
RESULT: Raid device /dev/md0 found clean
RESULT: Raid device /dev/md1 found clean
RESULT: Physical Volume   /dev/md1 in VolGroupSys has 270213.84M out of total 499994.59M
RESULT: Volumegroup   VolGroupSys consist of 1 physical volumes,contains 4 logical volumes, has 0 volume snaps with total size of 499994.59M and free space of 270213.84M
RESULT: Logical Volume   LogVolOpt in VolGroupSys Volume group is of size 60.00G
RESULT: Logical Volume   LogVolRoot in VolGroupSys Volume group is of size 30.00G
RESULT: Logical Volume   LogVolSwap in VolGroupSys Volume group is of size 24.00G
RESULT: Logical Volume   LogVolU01 in VolGroupSys Volume group is of size 100.00G
RESULT: Device /dev/mapper/VolGroupSys-LogVolRoot is mounted on / of type ext3 in (rw)
RESULT: Device /dev/md0 is mounted on /boot of type ext3 in (rw)
RESULT: Device /dev/mapper/VolGroupSys-LogVolOpt is mounted on /opt of type ext3 in (rw)
RESULT: Device /dev/mapper/VolGroupSys-LogVolU01 is mounted on /u01 of type ext3 in (rw)
RESULT: / has 19344 MB free out of total 29758 MB
RESULT: /boot has 74 MB free out of total 99 MB
RESULT: /opt has 31939 MB free out of total 59516 MB
RESULT: /u01 has 62266 MB free out of total 99194 MB

Test Case 2 - HDD (Hard Disk Drive) failure

Test description

An Oracle Database Appliancer has 20 600GB - 3.5" SAS 15k RPM HDD - used by ASM for DATA and RECO disk group. This test shows what happens in case one HDD of the shared storage is damaged or lost for any reason.

Test result

Simulating ashared storage HDD failure. ODA database/instance continues working and oakcli is showing the failure

Test Steps

Startup system and database
Verify all disks are online (v$asm_disk) and verify DATA and RECO disk group configuration
Remove a hard disk manually by pulling it out of the slot (from any slot except the top row of disks)
Verify that an alert is received
Verify the disk is not available to ASM (v$asm_disk) and verify DATA and RECO disk group configuration
Reinsert the hard disk (same slot)
Verify all disks are online (v$asm_disk) and verify DATA and RECO disk group configuration

Details of above steps:

1. Startup system and database
2. Verify initial shared disks status

Verify all disks are online (v$asm_disk) and verify DATA and RECO disk group configuration issuing queries on ASM istance

col GN format 99
col DN format 99
col NAME format a23

SELECT
group_number GN,disk_number DN,name,state,mode_status,mount_status
FROM v$asm_disk
ORDER BY group_number, disk_number;

i.e.:

col GN format 99
col DN format 99
col NAME format a23

SELECT
group_number GN,disk_number DN,name,state,mode_status,mount_status
FROM v$asm_disk
ORDER BY group_number, disk_number;

1 0 HDD_E0_S00_967034331P1 NORMAL ONLINE CACHED
1 1 HDD_E0_S01_965477095P1 NORMAL ONLINE CACHED
1 2 HDD_E1_S02_966582999P1 NORMAL ONLINE CACHED
1 3 HDD_E1_S03_966592943P1 NORMAL ONLINE CACHED
1 4 HDD_E0_S04_969051883P1 NORMAL ONLINE CACHED
1 5 HDD_E0_S05_966535155P1 NORMAL ONLINE CACHED
1 6 HDD_E1_S06_967038139P1 NORMAL ONLINE CACHED
1 7 HDD_E1_S07_966537131P1 NORMAL ONLINE CACHED
1 8 HDD_E0_S08_967043831P1 NORMAL ONLINE CACHED
1 9 HDD_E0_S09_966584211P1 NORMAL ONLINE CACHED
1 10 HDD_E1_S10_967036703P1 NORMAL ONLINE CACHED
1 11 HDD_E1_S11_966589399P1 NORMAL ONLINE CACHED
1 12 HDD_E0_S12_967036523P1 NORMAL ONLINE CACHED
1 13 HDD_E0_S13_966800467P1 NORMAL ONLINE CACHED
1 14 HDD_E1_S14_967038379P1 NORMAL ONLINE CACHED
1 15 HDD_E1_S15_967035195P1 NORMAL ONLINE CACHED
1 16 HDD_E0_S16_966617223P1 NORMAL ONLINE CACHED
1 17 HDD_E0_S17_966520995P1 NORMAL ONLINE CACHED
1 18 HDD_E1_S18_966584379P1 NORMAL ONLINE CACHED
1 19 HDD_E1_S19_966573799P1 NORMAL ONLINE CACHED

2 0 HDD_E0_S00_967034331P2 NORMAL ONLINE CACHED
2 1 HDD_E0_S01_965477095P2 NORMAL ONLINE CACHED
2 2 HDD_E1_S02_966582999P2 NORMAL ONLINE CACHED
2 3 HDD_E1_S03_966592943P2 NORMAL ONLINE CACHED
2 4 HDD_E0_S04_969051883P2 NORMAL ONLINE CACHED
2 5 HDD_E0_S05_966535155P2 NORMAL ONLINE CACHED
2 6 HDD_E1_S06_967038139P2 NORMAL ONLINE CACHED
2 7 HDD_E1_S07_966537131P2 NORMAL ONLINE CACHED
2 8 HDD_E0_S08_967043831P2 NORMAL ONLINE CACHED
2 9 HDD_E0_S09_966584211P2 NORMAL ONLINE CACHED
2 10 HDD_E1_S10_967036703P2 NORMAL ONLINE CACHED
2 11 HDD_E1_S11_966589399P2 NORMAL ONLINE CACHED
2 12 HDD_E0_S12_967036523P2 NORMAL ONLINE CACHED
2 13 HDD_E0_S13_966800467P2 NORMAL ONLINE CACHED
2 14 HDD_E1_S14_967038379P2 NORMAL ONLINE CACHED
2 15 HDD_E1_S15_967035195P2 NORMAL ONLINE CACHED
2 16 HDD_E0_S16_966617223P2 NORMAL ONLINE CACHED
2 17 HDD_E0_S17_966520995P2 NORMAL ONLINE CACHED
2 18 HDD_E1_S18_966584379P2 NORMAL ONLINE CACHED
2 19 HDD_E1_S19_966573799P2 NORMAL ONLINE CACHED

3 20 SSD_E0_S20_805607370P1 NORMAL ONLINE CACHED
3 21 SSD_E0_S21_805607443P1 NORMAL ONLINE CACHED
3 22 SSD_E1_S22_805607458P1 NORMAL ONLINE CACHED
3 23 SSD_E1_S23_805607433P1 NORMAL ONLINE CACHED

44 rows selected.

From v$asm_disk you see all your disks NORMAL, ONLINE, CACHED

i.e.:

col DG format a4
col "Size(MB)" format 9,999,999
col "Free(MB)" format 9,999,999
col "Usable(MB)" format 9,999,999

SELECT name AS "DG",
sector_size AS "Sector Size",
state,
type AS "Redundancy",
total_mb AS "Size(MB)",
free_mb AS "Free(MB)",
usable_file_mb AS "Usable(MB)"
FROM V$ASM_DISKGROUP;

DG Sector Size STATE Redund Size(MB) Free(MB) Usable(MB)
---- ----------- ----------- ------ ---------- ---------- ----------
DATA 512 MOUNTED HIGH 4,669,440 4,657,372 1,388,617
RECO 512 MOUNTED HIGH 6,204,640 5,967,132 1,771,337
REDO 512 MOUNTED HIGH 280,016 242,460 34,150

--> You see fromV$ASM_DISKGROUP as your diskgroups are mounted

from oakcli point of view:

# oakcli show disk
NAME PATH TYPE STATE STATE_DETAILS

pd_00 /dev/sdam HDD ONLINE Good
pd_01 /dev/sdaw HDD ONLINE Good
pd_02 /dev/sdaa HDD ONLINE Good
pd_03 /dev/sdak HDD ONLINE Good
pd_04 /dev/sdan HDD ONLINE Good
pd_05 /dev/sdax HDD ONLINE Good
pd_06 /dev/sdab HDD ONLINE Good
pd_07 /dev/sdal HDD ONLINE Good
pd_08 /dev/sdao HDD ONLINE Good
pd_09 /dev/sdau HDD ONLINE Good
pd_10 /dev/sdac HDD ONLINE Good
pd_11 /dev/sdai HDD ONLINE Good
pd_12 /dev/sdap HDD ONLINE Good
pd_13 /dev/sdav HDD ONLINE Good
pd_14 /dev/sdad HDD ONLINE Good
pd_15 /dev/sdaj HDD ONLINE Good
pd_16 /dev/sdaq HDD ONLINE Good
pd_17 /dev/sdas HDD ONLINE Good
pd_18 /dev/sdae HDD ONLINE Good
pd_19 /dev/sdag HDD ONLINE Good
pd_20 /dev/sdar SSD ONLINE Good
pd_21 /dev/sdat SSD ONLINE Good
pd_22 /dev/sdaf SSD ONLINE Good
pd_23 /dev/sdah SSD ONLINE Good

3. Remove a shared storage hard disk

Remove a hard disk manually by pulling it out of the slot (from any slot except the top row of disks).
oakcli shows the disk is now removed:

# oakcli show disk
NAME PATH TYPE STATE STATE_DETAILS

pd_00 /dev/sdam HDD ONLINE Good
pd_01 /dev/sdaw HDD ONLINE Good
pd_02 /dev/sdaa HDD ONLINE Good
pd_03 /dev/sdak HDD ONLINE Good
pd_04 /dev/sdan HDD ONLINE Good
pd_05 /dev/sdax HDD ONLINE Good
pd_06 /dev/sdab HDD ONLINE Good
pd_07 /dev/sdal HDD ONLINE Good
pd_08 /dev/sdao HDD ONLINE Good
pd_09 /dev/sdau HDD FAILED DiskRemoved
pd_10 /dev/sdac HDD ONLINE Good
pd_11 /dev/sdai HDD ONLINE Good
pd_12 /dev/sdap HDD ONLINE Good
pd_13 /dev/sdav HDD ONLINE Good
pd_14 /dev/sdad HDD ONLINE Good
pd_15 /dev/sdaj HDD ONLINE Good
pd_16 /dev/sdaq HDD ONLINE Good
pd_17 /dev/sdas HDD ONLINE Good
pd_18 /dev/sdae HDD ONLINE Good
pd_19 /dev/sdag HDD ONLINE Good
pd_20 /dev/sdar SSD ONLINE Good
pd_21 /dev/sdat SSD ONLINE Good
pd_22 /dev/sdaf SSD ONLINE Good
pd_23 /dev/sdah SSD ONLINE Good

4. Verify alert is received

From the ASM alert.log the IO problem is documented (using adrci utility you can checkout the ASM alert.log):

$ adrci

ADRCI: Release 11.2.0.2.0 - Production on Tue Feb 21 14:20:10 2012

Copyright (c) 1982, 2009, Oracle and/or its affiliates. All rights reserved.

ADR base = "/u01/app/grid"
adrci> show home
ADR Homes:
diag/asm/+asm/+ASM1
adrci> set home diag/asm/+asm/+ASM1
adrci> show alert -tail -f
Tue Feb 21 12:52:50 2012
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_11832.trc:
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096
WARNING: Read Failed. group:2 disk:9 AU:0 offset:0 size:4096
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_11832.trc:
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096
WARNING: Read Failed. group:1 disk:9 AU:0 offset:0 size:4096
SQL> alter diskgroup /*+ _OAK_AsmCookie */ DATA offline disk 'HDD_E0_S09_966584211p1'
NOTE: DRTimer CD Create: for disk group 1 disks:
9
NOTE: process _user11832_+asm1 (11832) initiating offline of disk 9.3916349682 (HDD_E0_S09_966584211P1) with mask 0x7e in group 1
NOTE: initiating PST update: grp = 1, dsk = 9/0xe96ec0f2, mode = 0x6a, op = 4

(...)
Tue Feb 21 13:05:19 2012
WARNING: Disk (HDD_E0_S09_966584211P1) will be dropped in: (12213) secs on ASM inst: (1)
WARNING: Disk (HDD_E0_S09_966584211P2) will be dropped in: (12213) secs on ASM inst: (1)

The OS is signaling the IO problem (dmesg - OS command to print or control the kernel ring buffer):

# dmesg
mpt2sas1: removing handle(0x0021), sas_addr(0x5000c500399ce791)
mpt2sas0: removing handle(0x0012), sas_addr(0x5000c500399ce791)
scsi 7:0:21:0: rejecting I/O to dead device
device-mapper: multipath: Failing path 66:224.
scsi 7:0:21:0: rejecting I/O to dead device
scsi 6:0:8:0: rejecting I/O to dead device
device-mapper: multipath: Failing path 8:160.
scsi 6:0:8:0: rejecting I/O to dead device

5. Verify the removed disk is not available

Verify the disk is not available to ASM (v$asm_disk) and verify DATA and RECO disk group configuration

The instances are still running. Checking for SMON process you can see as it's running. You can also connect to your instance as usual.

$ ps -ef|grep smon
grid 6030 1 0 Jan30 ? 00:00:00 asm_smon_+ASM1
oracle 13946 1 0 Jan31 ? 00:01:19 ora_smon_simpledb_1
oracle 16169 1 0 Jan30 ? 00:01:48 ora_smon_orcl1
grid 25237 15019 0 13:13 pts/3 00:00:00 grep smon
root 30298 1 0 Jan30 ? 01:29:18 /u01/app/11.2.0/grid/bin/osysmond.bin

6. Reinsert the hard disk (same slot)

Oakcli will show the disk as ONLINE Good

# oakcli show disk
NAME PATH TYPE STATE STATE_DETAILS

pd_00 /dev/sdam HDD ONLINE Good
pd_01 /dev/sdaw HDD ONLINE Good
pd_02 /dev/sdaa HDD ONLINE Good
pd_03 /dev/sdak HDD ONLINE Good
pd_04 /dev/sdan HDD ONLINE Good
pd_05 /dev/sdax HDD ONLINE Good
pd_06 /dev/sdab HDD ONLINE Good
pd_07 /dev/sdal HDD ONLINE Good
pd_08 /dev/sdao HDD ONLINE Good
pd_09 /dev/sdau HDD ONLINE Good
pd_10 /dev/sdac HDD ONLINE Good
pd_11 /dev/sdai HDD ONLINE Good
pd_12 /dev/sdap HDD ONLINE Good
pd_13 /dev/sdav HDD ONLINE Good
pd_14 /dev/sdad HDD ONLINE Good
pd_15 /dev/sdaj HDD ONLINE Good
pd_16 /dev/sdaq HDD ONLINE Good
pd_17 /dev/sdas HDD ONLINE Good
pd_18 /dev/sdae HDD ONLINE Good
pd_19 /dev/sdag HDD ONLINE Good
pd_20 /dev/sdar SSD ONLINE Good
pd_21 /dev/sdat SSD ONLINE Good
pd_22 /dev/sdaf SSD ONLINE Good
pd_23 /dev/sdah SSD ONLINE Good

If at this stage the reinserted disk is marked as not Good from the above oak command, you could restart the oakd issuing:

# oakcli restart oak

7. Verify all disks are online

Verify all disks are online (v$asm_disk) and verify DATA and RECO disk group configuration from ASM prospective:

col DG format a4
col "Size(MB)" format 9,999,999
col "Free(MB)" format 9,999,999
col "Usable(MB)" format 9,999,999

SELECT name AS "DG",
sector_size AS "Sector Size",
state,
type AS "Redundancy",
total_mb AS "Size(MB)",
free_mb AS "Free(MB)",
usable_file_mb AS "Usable(MB)"
FROM V$ASM_DISKGROUP;

DG Sector Size STATE Redund Size(MB) Free(MB) Usable(MB)
---- ----------- ----------- ------ ---------- ---------- ----------
DATA 512 MOUNTED HIGH 4,915,200 4,902,532 1,470,337
RECO 512 MOUNTED HIGH 6,531,200 6,281,728 1,876,202
REDO 512 MOUNTED HIGH 280,016 242,460 34,150

col GN format 99
col DN format 99
col NAME format a23

SELECT
group_number GN,disk_number DN,name,state,mode_status,mount_status
FROM v$asm_disk
ORDER BY group_number, disk_number;

1 0 HDD_E0_S00_967034331P1 NORMAL ONLINE CACHED
1 1 HDD_E0_S01_965477095P1 NORMAL ONLINE CACHED
1 2 HDD_E1_S02_966582999P1 NORMAL ONLINE CACHED
1 3 HDD_E1_S03_966592943P1 NORMAL ONLINE CACHED
1 4 HDD_E0_S04_969051883P1 NORMAL ONLINE CACHED
1 5 HDD_E0_S05_966535155P1 NORMAL ONLINE CACHED
1 6 HDD_E1_S06_967038139P1 NORMAL ONLINE CACHED
1 7 HDD_E1_S07_966537131P1 NORMAL ONLINE CACHED
1 8 HDD_E0_S08_967043831P1 NORMAL ONLINE CACHED
1 9 HDD_E0_S09_966584211P1 NORMAL ONLINE CACHED
1 10 HDD_E1_S10_967036703P1 NORMAL ONLINE CACHED
1 11 HDD_E1_S11_966589399P1 NORMAL ONLINE CACHED
1 12 HDD_E0_S12_967036523P1 NORMAL ONLINE CACHED
1 13 HDD_E0_S13_966800467P1 NORMAL ONLINE CACHED
1 14 HDD_E1_S14_967038379P1 NORMAL ONLINE CACHED
1 15 HDD_E1_S15_967035195P1 NORMAL ONLINE CACHED
1 16 HDD_E0_S16_966617223P1 NORMAL ONLINE CACHED
1 17 HDD_E0_S17_966520995P1 NORMAL ONLINE CACHED
1 18 HDD_E1_S18_966584379P1 NORMAL ONLINE CACHED
1 19 HDD_E1_S19_966573799P1 NORMAL ONLINE CACHED

2 0 HDD_E0_S00_967034331P2 NORMAL ONLINE CACHED
2 1 HDD_E0_S01_965477095P2 NORMAL ONLINE CACHED
2 2 HDD_E1_S02_966582999P2 NORMAL ONLINE CACHED
2 3 HDD_E1_S03_966592943P2 NORMAL ONLINE CACHED
2 4 HDD_E0_S04_969051883P2 NORMAL ONLINE CACHED
2 5 HDD_E0_S05_966535155P2 NORMAL ONLINE CACHED
2 6 HDD_E1_S06_967038139P2 NORMAL ONLINE CACHED
2 7 HDD_E1_S07_966537131P2 NORMAL ONLINE CACHED
2 8 HDD_E0_S08_967043831P2 NORMAL ONLINE CACHED
2 9 HDD_E0_S09_966584211P2 NORMAL ONLINE CACHED
2 10 HDD_E1_S10_967036703P2 NORMAL ONLINE CACHED
2 11 HDD_E1_S11_966589399P2 NORMAL ONLINE CACHED
2 12 HDD_E0_S12_967036523P2 NORMAL ONLINE CACHED
2 13 HDD_E0_S13_966800467P2 NORMAL ONLINE CACHED
2 14 HDD_E1_S14_967038379P2 NORMAL ONLINE CACHED
2 15 HDD_E1_S15_967035195P2 NORMAL ONLINE CACHED
2 16 HDD_E0_S16_966617223P2 NORMAL ONLINE CACHED
2 17 HDD_E0_S17_966520995P2 NORMAL ONLINE CACHED
2 18 HDD_E1_S18_966584379P2 NORMAL ONLINE CACHED
2 19 HDD_E1_S19_966573799P2 NORMAL ONLINE CACHED

3 20 SSD_E0_S20_805607370P1 NORMAL ONLINE CACHED
3 21 SSD_E0_S21_805607443P1 NORMAL ONLINE CACHED
3 22 SSD_E1_S22_805607458P1 NORMAL ONLINE CACHED
3 23 SSD_E1_S23_805607433P1 NORMAL ONLINE CACHED

44 rows selected.

from the ASM alert.log

Tue Feb 21 13:38:56 2012
ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.
ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.
ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.
ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.
ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.
ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.
NOTE: PST update grp = 1 completed successfully
NOTE: reset timers for disk: 9
NOTE: completed online of disk group 1 disks
HDD_E0_S09_966584211P1 (9)
Tue Feb 21 13:38:58 2012
NOTE: Found /dev/mapper/HDD_E0_S09_966584211p2 for disk HDD_E0_S09_966584211P2
WARNING: ignoring disk in deep discovery
SUCCESS: validated disks for 2/0x83be304f (RECO)
GMON querying group 2 at 47 for pid 46, osid 11274
NOTE: membership refresh pending for group 2/0x83be304f (RECO)
GMON querying group 2 at 48 for pid 18, osid 6032
NOTE: cache opening disk 9 of grp 2: HDD_E0_S09_966584211P2 path:/dev/mapper/HDD_E0_S09_966584211p2
SUCCESS: refreshed membership for 2/0x83be304f (RECO)
NOTE: initiating PST update: grp = 2, dsk = 9/0x0, mode = 0x5d, op = 1
SUCCESS: alter diskgroup /*+ _OAK_AsmCookie */ RECO online disk 'HDD_E0_S09_966584211p2'
GMON updating disk modes for group 2 at 49 for pid 46, osid 11274
NOTE: PST update grp = 2 completed successfully
NOTE: initiating PST update: grp = 2, dsk = 9/0x0, mode = 0x7d, op = 1
GMON updating disk modes for group 2 at 50 for pid 46, osid 11274
NOTE: PST update grp = 2 completed successfully
NOTE: Voting File refresh pending for group 2/0x83be304f (RECO)
NOTE: Attempting voting file refresh on diskgroup RECO
Tue Feb 21 13:40:56 2012
NOTE: initiating PST update: grp = 2, dsk = 9/0x0, mode = 0x7f, op = 1
Tue Feb 21 13:40:56 2012
GMON updating disk modes for group 2 at 51 for pid 46, osid 11274
NOTE: PST update grp = 2 completed successfully
NOTE: reset timers for disk: 9
NOTE: completed online of disk group 2 disks
HDD_E0_S09_966584211P2 (9)

and dmesg shows the disk is recognized:

mpt2sas0: detecting: handle(0x0012), sas_address(0x5000c500399ce791), phy(8)
mpt2sas0: REPORT_LUNS: handle(0x0012), retries(0)
mpt2sas0: TEST_UNIT_READY: handle(0x0012), lun(0)
Vendor: SEAGATE Model: ST360057SSUN600G Rev: 0A25
Type: Direct-Access ANSI SCSI revision: 05
scsi 6:0:26:0: SSP: handle(0x0012), sas_addr(0x5000c500399ce791), phy(8), device_name(0x00c5005091e79c39)
scsi 6:0:26:0: SSP: enclosure_logical_id(0x5080020000b16e00), slot(9)
scsi 6:0:26:0: serial_number(001112E0L4P2 6SL0L4P2)
scsi 6:0:26:0: qdepth(254), tagged(1), simple(1), ordered(0), scsi_level(6), cmd_que(1)
mpt2sas1: detecting: handle(0x0021), sas_address(0x5000c500399ce791), phy(8)
mpt2sas1: REPORT_LUNS: handle(0x0021), retries(0)
mpt2sas1: TEST_UNIT_READY: handle(0x0021), lun(0)
SCSI device sdaz: 1172123568 512-byte hdwr sectors (600127 MB)
Vendor: SEAGATE Model: ST360057SSUN600G Rev: 0A25
Type: Direct-Access ANSI SCSI revision: 05
scsi 7:0:26:0: SSP: handle(0x0021), sas_addr(0x5000c500399ce791), phy(8), device_name(0x00c5005091e79c39)
scsi 7:0:26:0: SSP: enclosure_logical_id(0x5080020000b16e00), slot(9)
scsi 7:0:26:0: serial_number(001112E0L4P2 6SL0L4P2)
scsi 7:0:26:0: qdepth(254), tagged(1), simple(1), ordered(0), scsi_level(6), cmd_que(1)
SCSI device sdba: 1172123568 512-byte hdwr sectors (600127 MB)
sdba: Write Protect is off
sdba: Mode Sense: df 00 10 08
SCSI device sdba: drive cache: write through w/ FUA
SCSI device sdba: 1172123568 512-byte hdwr sectors (600127 MB)
sdba: Write Protect is off
sdba: Mode Sense: df 00 10 08
SCSI device sdba: drive cache: write through w/ FUA
sdba: sdba1 sdba2
sd 7:0:26:0: Attached scsi disk sdba
sdaz: Write Protect is off
sd 7:0:26:0: Attached scsi generic sg10 type 0
sdaz: Mode Sense: df 00 10 08
SCSI device sdaz: drive cache: write through w/ FUA
SCSI device sdaz: 1172123568 512-byte hdwr sectors (600127 MB)
sdaz: Write Protect is off
sdaz: Mode Sense: df 00 10 08
SCSI device sdaz: drive cache: write through w/ FUA
sdaz: sdaz1 sdaz2
sd 6:0:26:0: Attached scsi disk sdaz
sd 6:0:26:0: Attached scsi generic sg49 type 0

Test Case 3 - SDD (Solid State Disk) failure

Test description

An Oracle Database Appliance has four 73GB - 3.5" SAS2 SSDs - used by ASM for REDO diskgroup. This test shows what happens in case one shared storage SDD is damaged or lost for any reason.

Test result

Simulating a shared storage SDD failure. ODA database/instance continue working and oakcli is showing the failure

Test Steps

Startup system and database
Verify all disks are online (v$asm_disk) and verify REDO disk group configuration
Remove a solid state disk manually by pulling it out of the slot (from any slot in the top row of 4 disks)
Verify that an alert is received
Verify the disk is not available to ASM (v$asm_disk) and verify REDO disk group configuration
Reinsert the SDD into its slot
Verify all disks are online (v$asm_disk) and verify REDO disk group configuration

Details of above steps:

1. Startup system and database
2. Verify initial storage shared disks status

Verify all disks are online (v$asm_disk) and verify RECO disk group configuration issuing queries on ASM instance

col GN format 99
col DN format 99
col NAME format a23

SELECT
group_number GN,disk_number DN,name,state,mode_status,mount_status
FROM v$asm_disk
ORDER BY group_number, disk_number;

1 0 HDD_E0_S00_967034331P1 NORMAL ONLINE CACHED
1 1 HDD_E0_S01_965477095P1 NORMAL ONLINE CACHED
1 2 HDD_E1_S02_966582999P1 NORMAL ONLINE CACHED
1 3 HDD_E1_S03_966592943P1 NORMAL ONLINE CACHED
1 4 HDD_E0_S04_969051883P1 NORMAL ONLINE CACHED
1 5 HDD_E0_S05_966535155P1 NORMAL ONLINE CACHED
1 6 HDD_E1_S06_967038139P1 NORMAL ONLINE CACHED
1 7 HDD_E1_S07_966537131P1 NORMAL ONLINE CACHED
1 8 HDD_E0_S08_967043831P1 NORMAL ONLINE CACHED
1 9 HDD_E0_S09_966584211P1 NORMAL ONLINE CACHED
1 10 HDD_E1_S10_967036703P1 NORMAL ONLINE CACHED
1 11 HDD_E1_S11_966589399P1 NORMAL ONLINE CACHED
1 12 HDD_E0_S12_967036523P1 NORMAL ONLINE CACHED
1 13 HDD_E0_S13_966800467P1 NORMAL ONLINE CACHED
1 14 HDD_E1_S14_967038379P1 NORMAL ONLINE CACHED
1 15 HDD_E1_S15_967035195P1 NORMAL ONLINE CACHED
1 16 HDD_E0_S16_966617223P1 NORMAL ONLINE CACHED
1 17 HDD_E0_S17_966520995P1 NORMAL ONLINE CACHED
1 18 HDD_E1_S18_966584379P1 NORMAL ONLINE CACHED
1 19 HDD_E1_S19_966573799P1 NORMAL ONLINE CACHED

2 0 HDD_E0_S00_967034331P2 NORMAL ONLINE CACHED
2 1 HDD_E0_S01_965477095P2 NORMAL ONLINE CACHED
2 2 HDD_E1_S02_966582999P2 NORMAL ONLINE CACHED
2 3 HDD_E1_S03_966592943P2 NORMAL ONLINE CACHED
2 4 HDD_E0_S04_969051883P2 NORMAL ONLINE CACHED
2 5 HDD_E0_S05_966535155P2 NORMAL ONLINE CACHED
2 6 HDD_E1_S06_967038139P2 NORMAL ONLINE CACHED
2 7 HDD_E1_S07_966537131P2 NORMAL ONLINE CACHED
2 8 HDD_E0_S08_967043831P2 NORMAL ONLINE CACHED
2 9 HDD_E0_S09_966584211P2 NORMAL ONLINE CACHED
2 10 HDD_E1_S10_967036703P2 NORMAL ONLINE CACHED
2 11 HDD_E1_S11_966589399P2 NORMAL ONLINE CACHED
2 12 HDD_E0_S12_967036523P2 NORMAL ONLINE CACHED
2 13 HDD_E0_S13_966800467P2 NORMAL ONLINE CACHED
2 14 HDD_E1_S14_967038379P2 NORMAL ONLINE CACHED
2 15 HDD_E1_S15_967035195P2 NORMAL ONLINE CACHED
2 16 HDD_E0_S16_966617223P2 NORMAL ONLINE CACHED
2 17 HDD_E0_S17_966520995P2 NORMAL ONLINE CACHED
2 18 HDD_E1_S18_966584379P2 NORMAL ONLINE CACHED
2 19 HDD_E1_S19_966573799P2 NORMAL ONLINE CACHED

3 20 SSD_E0_S20_805607370P1 NORMAL ONLINE CACHED
3 21 SSD_E0_S21_805607443P1 NORMAL ONLINE CACHED
3 22 SSD_E1_S22_805607458P1 NORMAL ONLINE CACHED
3 23 SSD_E1_S23_805607433P1 NORMAL ONLINE CACHED

44 rows selected.

col DG format a4
col "Size(MB)" format 9,999,999
col "Free(MB)" format 9,999,999
col "Usable(MB)" format 9,999,999

SELECT name AS "DG",
sector_size AS "Sector Size",
state,
type AS "Redundancy",
total_mb AS "Size(MB)",
free_mb AS "Free(MB)",
usable_file_mb AS "Usable(MB)"
FROM V$ASM_DISKGROUP
WHERE name='REDO';

DG Sector Size STATE Redund Size(MB) Free(MB) Usable(MB)
---- ----------- ----------- ------ ---------- ---------- ----------
REDO 512 MOUNTED HIGH 280,016 242,460 34,150

oakcli shows no FAILED disk :

# oakcli show disk | grep FAILED
#

4. Remove a shared storage SSD

Remove a shared storage SSD manually by pulling it out of the slot (from any slot in the top row)
oakcli shows the disk is now removed:

5. Verify alert is received

In the ASM alert.log you see the IO error due to missing disk

2012-02-21 18:36:06.118000 +02:00
SUCCESS: alter diskgroup /*+ _OAK_AsmCookie */ REDO offline disk 'SSD_E1_S23_805607433p1'
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_28122.trc:
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096
WARNING: Read Failed. group:0 disk:40 AU:0 offset:0 size:4096
2012-02-21 18:36:42.493000 +02:00
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28525] opening OCR file
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28525] opening OCR file
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28667] opening OCR file
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28667] opening OCR file
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28733] opening OCR file
2012-02-21 18:36:43.517000 +02:00
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28733] opening OCR file
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28837] opening OCR file
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28847] opening OCR file
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28837] opening OCR file
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28847] opening OCR file
2012-02-21 18:36:45.043000 +02:00
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28932] opening OCR file
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28932] opening OCR file
2012-02-21 18:37:25.992000 +02:00
WARNING: Disk (SSD_E1_S23_805607433P1) will be dropped in: (12960) secs on ASM inst: (1)
2012-02-21 18:38:12.255000 +02:00
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_29564.trc:
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096
WARNING: Read Failed. group:0 disk:40 AU:0 offset:0 size:4096
2012-02-21 18:38:59.040000 +02:00
WARNING: Disk (SSD_E1_S23_805607433P1) will be dropped in: (12867) secs on ASM inst: (1)
2012-02-21 18:40:29.089000 +02:00
WARNING: Disk (SSD_E1_S23_805607433P1) will be dropped in: (12777) secs on ASM inst: (1)

6. Verify the removed SSD disk is not available to ASM

Verify the disk is not available to ASM (v$asm_disk) and verify REDO disk group configuration

col GN format 99
col DN format 99
col NAME format a23

SELECT
group_number GN,disk_number DN,name,state,mode_status,mount_status
FROM v$asm_disk
WHERE mode_status='OFFLINE'
ORDER BY group_number, disk_number;

3 23 SSD_E1_S23_805607433P1 NORMAL OFFLINE MISSING

col DG format a4
col "Size(MB)" format 9,999,999
col "Free(MB)" format 9,999,999
col "Usable(MB)" format 9,999,999

SELECT name AS "DG",
sector_size AS "Sector Size",
state,
type AS "Redundancy",
total_mb AS "Size(MB)",
free_mb AS "Free(MB)",
usable_file_mb AS "Usable(MB)"
FROM V$ASM_DISKGROUP;

DG Sector Size STATE Redund Size(MB) Free(MB) Usable(MB)
---- ----------- ----------- ------ ---------- ---------- ----------
REDO 512 MOUNTED HIGH 210,012 181,832 60,610

7. Reinsert the SSD back into its slot

oakcli shows the disk is now available as ONLINE good:

8. Verify all disks are online

Verify all disks are online (v$asm_disk) and verify REDO disk group configuration

Test Case 4 - Connectivity to Database

1. Local connection - Verify connectivity to an instance

Test description

Check if you can connect to your instance locally on the node

Test result

Locally on an Oracle Database Appliance node you can connect your instance by setting ORACLE_HOME and ORACLE_SID environment variables.

Test Steps

In this example we can connect (expected result) and we are creating an user (test) used on the other connection tests below

su - oracle
export ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1
export ORACLE_SID=ODAMIG1
export PATH=$PATH:$ORACLE_HOME/bin

sqlplus / as sysdba
SQL> create user test identified by test;
SQL> alter user test account unlock;
SQL> grant resource,connect to test;
SQL> grant select on v_$instance to test;

sqlplus test/test
SQL>

2. External connection - Connect from an application

Test description

Check if you can connect to your instance from a remote client

Test result

You can connect to an Oracle Database Apppliance database from a remote client using an appropriate connect string.

Test Steps

Define an appropriate connection entry on you client tnsnames.ora and connect using the previously created user. Note as we are using the SCAN listener (HOST = rc-voda1-scan)

ODAMIG =
(DESCRIPTION =
      (ADDRESS = (PROTOCOL = TCP)(HOST = rc-voda1-scan)(PORT = 1521))
      (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = ODAMIG)
)
)

$ sqlplus test/test@ODAMIG
SQL>

3. Connect using services - Connect using services and test load balancing

Test description

Your are connecting to an Oracle Database Appliance database using a service and you verify load balancing

Test result

Connect to a database on the Oracle Database Appliance from a remote client using an appropriate connection string. The second connection will be on the second node.

Test Steps

1. Define a service which is running on both nodes:

$ srvctl add service -d ODAMIG -s oltp -r "ODAMIG1,ODAMIG2" -P BASIC -e select
$ srvctl start service -s OLTP -d ODAMIG
$ srvctl status service -d ODAMIG -s OLTP
Service oltp is running on instance(s) ODAMIG1,ODAMIG2

2. Define an appropriate connection entry in your client tnsnames.ora

    OLTP =
      (DESCRIPTION =
        (ADDRESS = (PROTOCOL = TCP)(HOST = rc-voda1-scan)(PORT = 1521))
        (CONNECT_DATA =
        (SERVER = DEDICATED)
        (SERVICE_NAME = OLTP)
      )
    )

3. Connect with the service name from your remote client

    sqlplus test/test@oltp
    SQL> select instance_name from sys.v_$instance;
    INSTANCE_NAME
    ----------------
    ODAMIG1

4. A second connection from a remote cliente will connect to the second RAC instance running on the other ODA node

    sqlplus test/test@oltp
    SQL> select instance_name from sys.v_$instance;
    INSTANCE_NAME
    ----------------
    ODAMIG2

Test Case 5 - Connection failover and continued service availability

Test description

Test node failures and continued database service availability.

Test result

You are connected to the instance on node 1, if this node crashes for any reason, repeating the query will automatically connect you to the second instance on the second node

Test Steps

1. You are connected to the instance on node 2:

SQL> select instance_name from sys.v_$instance;

INSTANCE_NAME
----------------
ODAMIG2

2. Simulate database instance failure (instance 'ODAMIG2'), execute a shutdown abort of instance 2 on node 2 from another client.

SQL> shutdown abort;

3. On client side run the above query again:

SQL> select instance_name from sys.v_$instance;

INSTANCE_NAME
----------------
ODAMIG1

SQL>

Note:
When instance 'ODAMIG2' crashed, clients will be reconnected to instance 'ODAMIG1'

Test Case 6 - Private network failure

Test description

The nodes in an Oracle Database Appliance are connected through two internal 1GbE connections. This test shows what happen if the interconnect breaks.

Test result

As the interconnect in an Oracle Database Appliance is redundant, if one interface is affected by any issue the other one still works and no side effects are observed by Oracle Clusterware and the database(s). If both interconnect (eth0,eth1) fail a node is evicted (expected result). When the connectivity on the interconnect is restored the evicted node will rejoin the cluster.

Test Steps

1. Initial default status isthat both private NICs are up

Using 'ifconfig' OS command you can check the initial eth networking interface (all are up):

# ifconfig
bond0     Link encap:Ethernet HWaddr 00:21:28:D7:67:48
          inet addr:10.245.48.12 Bcast:10.245.55.255 Mask:255.255.248.0
          inet6 addr: fe80::221:28ff:fed7:6748/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
          RX packets:168068528 errors:0 dropped:0 overruns:0 frame:0
          TX packets:73908628 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:172142891441 (160.3 GiB) TX bytes:25356106080 (23.6 GiB)

bond0:1   Link encap:Ethernet HWaddr 00:21:28:D7:67:48
          inet addr:10.245.48.28 Bcast:10.245.55.255 Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1

bond0:2   Link encap:Ethernet HWaddr 00:21:28:D7:67:48
          inet addr:10.245.48.56 Bcast:10.245.55.255 Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1

bond0:3   Link encap:Ethernet HWaddr 00:21:28:D7:67:48
          inet addr:10.245.48.57 Bcast:10.245.55.255 Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1

bond1     Link encap:Ethernet HWaddr 00:1B:21:AE:FC:49
          inet6 addr: fe80::21b:21ff:feae:fc49/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
          RX packets:54213591 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:4693019668 (4.3 GiB) TX bytes:936 (936.0 b)

bond2     Link encap:Ethernet HWaddr 00:1B:21:AE:FC:4B
          inet6 addr: fe80::21b:21ff:feae:fc4b/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
          RX packets:41836342 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:11374469913 (10.5 GiB) TX bytes:936 (936.0 b)

eth0      Link encap:Ethernet HWaddr 00:21:28:D7:67:4C
          inet addr:192.168.16.24 Bcast:192.168.16.255 Mask:255.255.255.0
          inet6 addr: fe80::221:28ff:fed7:674c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
          RX packets:18313594 errors:0 dropped:0 overruns:0 frame:0
          TX packets:17582725 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:21373231223 (19.9 GiB) TX bytes:20417204203 (19.0 GiB)
          Memory:dee80000-deea0000

eth0:1    Link encap:Ethernet HWaddr 00:21:28:D7:67:4C
          inet addr:169.254.112.206 Bcast:169.254.127.255 Mask:255.255.128.0
          UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
          Memory:dee80000-deea0000

eth1      Link encap:Ethernet HWaddr 00:21:28:D7:67:4D
          inet addr:192.168.17.24 Bcast:192.168.17.255 Mask:255.255.255.0
          inet6 addr: fe80::221:28ff:fed7:674d/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
          RX packets:16744784 errors:0 dropped:0 overruns:0 frame:0
          TX packets:15887756 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:18212979386 (16.9 GiB) TX bytes:12045330895 (11.2 GiB)
          Memory:deee0000-def00000

eth1:1    Link encap:Ethernet HWaddr 00:21:28:D7:67:4D
          inet addr:169.254.240.172 Bcast:169.254.255.255 Mask:255.255.128.0
          UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
          Memory:deee0000-def00000

eth2      Link encap:Ethernet HWaddr 00:21:28:D7:67:48
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:167991955 errors:0 dropped:0 overruns:0 frame:0
          TX packets:73908628 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:172138293677 (160.3 GiB) TX bytes:25356106332 (23.6 GiB)
          Memory:def60000-def80000

eth3      Link encap:Ethernet HWaddr 00:21:28:D7:67:48
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:76573 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:4597764 (4.3 MiB) TX bytes:88 (88.0 b)
          Memory:defe0000-df000000

eth4      Link encap:Ethernet HWaddr 00:1B:21:AE:FC:49
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:27112876 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2349160551 (2.1 GiB) TX bytes:936 (936.0 b)
          Memory:df1a0000-df1c0000

eth5      Link encap:Ethernet HWaddr 00:1B:21:AE:FC:49
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:27100715 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2343859117 (2.1 GiB) TX bytes:0 (0.0 b)
          Memory:df1e0000-df200000

eth6      Link encap:Ethernet HWaddr 00:1B:21:AE:FC:4B
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:20920394 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5688489549 (5.2 GiB) TX bytes:936 (936.0 b)
          Memory:df2a0000-df2c0000

eth7      Link encap:Ethernet HWaddr 00:1B:21:AE:FC:4B
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:20915948 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5685980364 (5.2 GiB) TX bytes:0 (0.0 b)
          Memory:df2e0000-df300000

eth8      Link encap:Ethernet HWaddr 00:1B:21:B6:0A:E4
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:6740448 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:411876392 (392.7 MiB) TX bytes:936 (936.0 b)

eth9      Link encap:Ethernet HWaddr 00:1B:21:B6:0A:E4
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:6727602 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:411062894 (392.0 MiB) TX bytes:0 (0.0 b)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:39670610 errors:0 dropped:0 overruns:0 frame:0
          TX packets:39670610 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:78672574464 (73.2 GiB) TX bytes:78672574464 (73.2 GiB)

xbond0    Link encap:Ethernet HWaddr 00:1B:21:B6:0A:E4
          inet6 addr: fe80::21b:21ff:feb6:ae4/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
          RX packets:13468050 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:822939286 (784.8 MiB) TX bytes:936 (936.0 b)

2. Simulate a failure of eth0

With 'ifdown' OS command you can switch off a network interface:

# ifdown eth0

ifconfig output now is not showing anymore the eth0:

# ifconfig
bond0     Link encap:Ethernet HWaddr 00:21:28:D7:67:48
          inet addr:10.245.48.12 Bcast:10.245.55.255 Mask:255.255.248.0
          inet6 addr: fe80::221:28ff:fed7:6748/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
          RX packets:168070270 errors:0 dropped:0 overruns:0 frame:0
          TX packets:73910433 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:172143383625 (160.3 GiB) TX bytes:25357203979 (23.6 GiB)

bond0:1   Link encap:Ethernet HWaddr 00:21:28:D7:67:48
          inet addr:10.245.48.28 Bcast:10.245.55.255 Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1

bond0:2   Link encap:Ethernet HWaddr 00:21:28:D7:67:48
          inet addr:10.245.48.56 Bcast:10.245.55.255 Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1

bond0:3   Link encap:Ethernet HWaddr 00:21:28:D7:67:48
          inet addr:10.245.48.57 Bcast:10.245.55.255 Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1

bond1     Link encap:Ethernet HWaddr 00:1B:21:AE:FC:49
          inet6 addr: fe80::21b:21ff:feae:fc49/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
          RX packets:54217767 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:4693393560 (4.3 GiB) TX bytes:936 (936.0 b)

bond2     Link encap:Ethernet HWaddr 00:1B:21:AE:FC:4B
          inet6 addr: fe80::21b:21ff:feae:fc4b/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
          RX packets:41838642 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:11375143337 (10.5 GiB) TX bytes:936 (936.0 b)

eth1      Link encap:Ethernet HWaddr 00:21:28:D7:67:4D
          inet addr:192.168.17.24 Bcast:192.168.17.255 Mask:255.255.255.0
          inet6 addr: fe80::221:28ff:fed7:674d/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
          RX packets:16746499 errors:0 dropped:0 overruns:0 frame:0
          TX packets:15889403 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:18215302041 (16.9 GiB) TX bytes:12046528425 (11.2 GiB)
          Memory:deee0000-def00000

eth1:1    Link encap:Ethernet HWaddr 00:21:28:D7:67:4D
          inet addr:169.254.240.172 Bcast:169.254.255.255 Mask:255.255.128.0
          UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
          Memory:deee0000-def00000

eth2      Link encap:Ethernet HWaddr 00:21:28:D7:67:48
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:167993697 errors:0 dropped:0 overruns:0 frame:0
          TX packets:73910433 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:172138785861 (160.3 GiB) TX bytes:25357204231 (23.6 GiB)
          Memory:def60000-def80000

eth3      Link encap:Ethernet HWaddr 00:21:28:D7:67:48
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:76573 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:4597764 (4.3 MiB) TX bytes:88 (88.0 b)
          Memory:defe0000-df000000

eth4      Link encap:Ethernet HWaddr 00:1B:21:AE:FC:49
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:27114964 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2349347497 (2.1 GiB) TX bytes:936 (936.0 b)
          Memory:df1a0000-df1c0000

eth5      Link encap:Ethernet HWaddr 00:1B:21:AE:FC:49
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:27102803 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2344046063 (2.1 GiB) TX bytes:0 (0.0 b)
          Memory:df1e0000-df200000

eth6      Link encap:Ethernet HWaddr 00:1B:21:AE:FC:4B
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:20921544 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5688826261 (5.2 GiB) TX bytes:936 (936.0 b)
          Memory:df2a0000-df2c0000

eth7      Link encap:Ethernet HWaddr 00:1B:21:AE:FC:4B
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:20917098 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5686317076 (5.2 GiB) TX bytes:0 (0.0 b)
          Memory:df2e0000-df300000

eth8      Link encap:Ethernet HWaddr 00:1B:21:B6:0A:E4
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:6740753 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:411894734 (392.8 MiB) TX bytes:936 (936.0 b)

eth9      Link encap:Ethernet HWaddr 00:1B:21:B6:0A:E4
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:6727906 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:411081176 (392.0 MiB) TX bytes:0 (0.0 b)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:39673653 errors:0 dropped:0 overruns:0 frame:0
          TX packets:39673653 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:78674031488 (73.2 GiB) TX bytes:78674031488 (73.2 GiB)

xbond0    Link encap:Ethernet HWaddr 00:1B:21:B6:0A:E4
          inet6 addr: fe80::21b:21ff:feb6:ae4/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
          RX packets:13468659 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:822975910 (784.8 MiB) TX bytes:936 (936.0 b)

1° Result Note: System status is normal, no failure seen at database and clusterware level

3. Simulate a failure of both private NICs at the same time

Initial status: both private NICs on box1 are up

# ifup eth0

Simulate a failure of private NICs eth0 and eth1 on box1

# ifdown eth0
# ifdown eth1

2° Result Note:
In the clusterware alert.log you see a missing heartbeat after 15 seconds and 30 seconds later, node 2 was evicted as expected (Database/ASM instances are down, CRS/CSS is down, OHASD is up).
As expected, the instance on node one resumes service once reconfiguration is completed.

from the Clusterware Alert.log (/log/<nodename>/alert.log) you can see the node eviction and the cluster reconfiguration:

2012-01-05 06:46:12.833
[cssd(12247)]CRS-1612:Network communication with node slcac457 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.300 seconds
2012-01-05 06:46:19.849
[cssd(12247)]CRS-1611:Network communication with node slcac457 (2) missing for 75% of timeout interval. Removal of this node from cluster in 7.280 seconds
2012-01-05 06:46:24.859
[cssd(12247)]CRS-1610:Network communication with node slcac457 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.270 seconds
2012-01-05 06:46:29.132
[cssd(12247)]CRS-1623:The IPMI node kill information of BMC at IP address 10.131.228.195 could not be validated due to invalid authorization information. The BMC username provided is 'root'; details at (:CSSNK00004:) in /u01/app/11.2.0/grid/log/slcac456/cssd/ocssd.log
2012-01-05 06:46:29.132
[cssd(12247)]CRS-1620:The node kill information of node slcac456 could not be validated by this node due to invalid authorization information; details at (:CSSNM00003:) in /u01/app/11.2.0/grid/log/slcac456/cssd/ocssd.log
2012-01-05 06:46:31.129
[cssd(12247)]CRS-1623:The IPMI node kill information of BMC at IP address 10.131.228.196 could not be validated due to invalid authorization information. The BMC username provided is 'root'; details at (:CSSNK00004:) in /u01/app/11.2.0/grid/log/slcac456/cssd/ocssd.log
2012-01-05 06:46:31.129
[cssd(12247)]CRS-1620:The node kill information of node slcac457 could not be validated by this node due to invalid authorization information; details at (:CSSNM00003:) in /u01/app/11.2.0/grid/log/slcac456/cssd/ocssd.log
2012-01-05 06:46:31.129
[cssd(12247)]CRS-1607:Node slcac457 is being evicted in cluster incarnation 219716296; details at (:CSSNM00007:) in /u01/app/11.2.0/grid/log/slcac456/cssd/ocssd.log.
2012-01-05 06:46:34.138
[cssd(12247)]CRS-1625:Node slcac457, number 2, was manually shut down
2012-01-05 06:46:34.144
[cssd(12247)]CRS-1601:CSSD Reconfiguration complete. Active nodes are slcac456 .
2012-01-05 06:46:34.150
[crsd(13460)]CRS-5504:Node down event reported for node 'slcac457'.
2012-01-05 06:46:34.153
[ctssd(12915)]CRS-2407:The new Cluster Time Synchronization Service reference node is host slcac456.
2012-01-05 06:46:50.242
[crsd(13460)]CRS-2773:Server 'slcac457' has been removed from pool 'ora.ODAMIG_oltp'.
2012-01-05 06:46:50.243
[crsd(13460)]CRS-2773:Server 'slcac457' has been removed from pool 'Generic'.
2012-01-05 06:46:50.243
[crsd(13460)]CRS-2773:Server 'slcac457' has been removed from pool 'ora.ODAMIG'.

the instance is still running on node 1:

[root@slcac456]# ps -ef | grep smon
oracle    9120     1 0 Jan04 ?        00:00:01 ora_smon_ORAMIG1
root     12193     1 0 2011 ?        01:09:16 /u01/app/11.2.0/grid/bin/osysmond.bin
grid     13374     1 0 2011 ?        00:00:00 asm_smon_+ASM1
root     30439 24554 0 07:00 pts/4    00:00:00 grep smon

On node 2 the instance is no longer running (node evicted):

[root@rc-voda2]# ps -ef | grep smon
root 12148 1 1 2011 ? 04:00:08 /u01/app/11.2.0/grid/bin/osysmond.bin
root 28904 24208 0 06:58 pts/0 00:00:00 grep smon

4. Simulate both failed NICs(eth0/eth1) come back to work

# ifup eth0
# ifup eth1

3° Result Note:
eth0, eth1 IP and HAIP come back almost immediately, ASM instance and DB instance on node 2 was started up by Clusterware automatically, Clusterware stack itself was restarted on node 2. Validate in the log files and other means that the node rejoins cluster.

from the Clusterware Alert.log (/log/<nodename>/alert.log)

2012-01-05 07:03:54.239
[cssd(12247)]CRS-1623:The IPMI node kill information of BMC at IP address 10.131.228.195 could not be validated due to invalid authorization information. The BMC username provided is 'root'; details at (:CSSNK00004:) in /u01/app/11.2.0/grid/log/slcac456/cssd/ocssd.log
2012-01-05 07:03:54.239
[cssd(12247)]CRS-1620:The node kill information of node slcac456 could not be validated by this node due to invalid authorization information; details at (:CSSNM00003:) in /u01/app/11.2.0/grid/log/slcac456/cssd/ocssd.log
2012-01-05 07:03:56.243
[cssd(12247)]CRS-1623:The IPMI node kill information of BMC at IP address 10.131.228.196 could not be validated due to invalid authorization information. The BMC username provided is 'root'; details at (:CSSNK00004:) in /u01/app/11.2.0/grid/log/slcac456/cssd/ocssd.log
2012-01-05 07:03:56.243
[cssd(12247)]CRS-1620:The node kill information of node slcac457 could not be validated by this node due to invalid authorization information; details at (:CSSNM00003:) in /u01/app/11.2.0/grid/log/slcac456/cssd/ocssd.log
2012-01-05 07:03:56.294
[cssd(12247)]CRS-1601:CSSD Reconfiguration complete. Active nodes are slcac456 slcac457

from the Clusterware Alert.log on node 2:

2012-01-05 06:59:11.724
[client(29071)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/slcac457/client/crsctl_oracle.log.
2012-01-05 07:02:50.055
[cssd(28166)]CRS-1601:CSSD Reconfiguration complete. Active nodes are slcac456 slcac457 .
2012-01-05 07:02:52.095
[ctssd(30648)]CRS-2403:The Cluster Time Synchronization Service on host slcac457 is in observer mode.
2012-01-05 07:02:52.445
[ctssd(30648)]CRS-2407:The new Cluster Time Synchronization Service reference node is host slcac456.
2012-01-05 07:02:52.447
[ctssd(30648)]CRS-2401:The Cluster Time Synchronization Service started on host slcac457.
2012-01-05 07:03:04.233
[ctssd(30648)]CRS-2412:The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /u01/app/11.2.0/grid/log/slcac457/ctssd/octssd.log.
2012-01-05 07:03:04.233
[ctssd(30648)]CRS-2409:The clock on host slcac457 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2012-01-05 07:03:16.741
[crsd(31084)]CRS-1012:The OCR service started on node slcac457.
2012-01-05 07:03:16.773
[evmd(30670)]CRS-1401:EVMD started on node slcac457.
2012-01-05 07:03:18.719
[crsd(31084)]CRS-1201:CRSD started on node slcac457.
2012-01-05 07:03:20.922
[/u01/app/11.2.0/grid/bin/oraagent.bin(31415)]CRS-5011:Check of resource "ODAMIG" failed: details at "(:CLSN00007:)" in "/u01/app/11.2.0/grid/log/slcac457/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2012-01-05 07:03:20.962
[/u01/app/11.2.0/grid/bin/oraagent.bin(31398)]CRS-5016:Process "/u01/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/slcac457/agent/crsd/oraagent_grid/oraagent_grid.log"
2012-01-05 07:03:22.454
[/u01/app/11.2.0/grid/bin/oraagent.bin(31398)]CRS-5016:Process "/u01/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/slcac457/agent/crsd/oraagent_grid/oraagent_grid.log"
[client(31529)]CRS-10001:05-Jan-12 07:03 ACFS-9139: Attempting recovery of offline mount point '/cloudfs'
[client(31541)]CRS-10001:05-Jan-12 07:03 ACFS-9111: Offline mount point '/cloudfs' was recovered.
[client(31629)]CRS-10001:GWS: name=RECO, vol=ACFSVOL, state=DISABLED
[client(31632)]CRS-10001:05-Jan-12 07:03 ACFS-9103: Enabling volume 'acfsvol' on diskgroup 'reco'.
[client(31650)]CRS-10001:05-Jan-12 07:03 ACFS-9257: Mounting device '/dev/asm/acfsvol-18' on mount point '/cloudfs'.

[root@rc-voda2]# ps -ef | grep smon
root      1287 24208 0 07:06 pts/0    00:00:00 grep smon
root     12148     1 1 2011 ?        04:00:13 /u01/app/11.2.0/grid/bin/osysmond.bin
grid     30916     1 0 07:03 ?        00:00:00 asm_smon_+ASM2
oracle   32172     1 0 07:04 ?        00:00:00 ora_smon_ODAMIG2

Test Case 7 - Public Network Failure

Conneect from a client (outside ODA):

sqlplus test/test@ODAMIG

SQL> select instance_name from sys.v_$instance;

INSTANCE_NAME
----------------
ODAMIG1

Shutdown bond0 on node 1

# ifdown bond0

Check Existing Client connection

SQL> /
INSTANCE_NAME
----------------
ODAMIG2

The existing client connection failed over quickly to the other instance automatically.

New client connection

SQL> select instance_name from sys.v_$instance;
INSTANCE_NAME
----------------
ODAMIG2

New client connections went to the other instance automatically.

Check public network and VIPs and services on another node

# ifconfig -a
bond0     Link encap:Ethernet HWaddr 00:21:28:D6:14:3A
          inet addr:10.245.48.13 Bcast:10.245.55.255 Mask:255.255.248.0
          inet6 addr: fe80::221:28ff:fed6:143a/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
          RX packets:5476567 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4699960 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1361587908 (1.2 GiB) TX bytes:1068230705 (1018.7 MiB)

bond0:1   Link encap:Ethernet HWaddr 00:21:28:D6:14:3A
          inet addr:10.245.48.56 Bcast:10.245.55.255 Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1

bond0:2   Link encap:Ethernet HWaddr 00:21:28:D6:14:3A
          inet addr:10.245.48.29 Bcast:10.245.55.255 Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1

bond0:3   Link encap:Ethernet HWaddr 00:21:28:D6:14:3A
          inet addr:10.245.48.57 Bcast:10.245.55.255 Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1

bond0:4   Link encap:Ethernet HWaddr 00:21:28:D6:14:3A
          inet addr:10.245.48.58 Bcast:10.245.55.255 Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1

bond0:5   Link encap:Ethernet HWaddr 00:21:28:D6:14:3A
          inet addr:10.245.48.28 Bcast:10.245.55.255 Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1

bond1     Link encap:Ethernet HWaddr 00:1B:21:AE:FB:B5
          inet6 addr: fe80::21b:21ff:feae:fbb5/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
          RX packets:54340199 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:4704239924 (4.3 GiB) TX bytes:468 (468.0 b)

bond2     Link encap:Ethernet HWaddr 00:1B:21:AE:FB:B7
          inet6 addr: fe80::21b:21ff:feae:fbb7/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
          RX packets:41931080 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:11399604342 (10.6 GiB) TX bytes:468 (468.0 b)

eth0      Link encap:Ethernet HWaddr 00:21:28:D6:14:3E
          inet addr:192.168.16.25 Bcast:192.168.16.255 Mask:255.255.255.0
          inet6 addr: fe80::221:28ff:fed6:143e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
          RX packets:17623438 errors:0 dropped:0 overruns:0 frame:0
          TX packets:18354950 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:20465535412 (19.0 GiB) TX bytes:21418856097 (19.9 GiB)
          Memory:dee80000-deea0000

eth0:1    Link encap:Ethernet HWaddr 00:21:28:D6:14:3E
          inet addr:169.254.37.90 Bcast:169.254.127.255 Mask:255.255.128.0
          UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
          Memory:dee80000-deea0000

eth1      Link encap:Ethernet HWaddr 00:21:28:D6:14:3F
          inet addr:192.168.17.25 Bcast:192.168.17.255 Mask:255.255.255.0
          inet6 addr: fe80::221:28ff:fed6:143f/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
          RX packets:15933937 errors:0 dropped:0 overruns:0 frame:0
          TX packets:16791488 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:12093725549 (11.2 GiB) TX bytes:18255805634 (17.0 GiB)
          Memory:deee0000-def00000

eth1:1    Link encap:Ethernet HWaddr 00:21:28:D6:14:3F
          inet addr:169.254.228.105 Bcast:169.254.255.255 Mask:255.255.128.0
          UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
          Memory:deee0000-def00000

eth2      Link encap:Ethernet HWaddr 00:21:28:D6:14:3A
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:5399866 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4699963 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1356982464 (1.2 GiB) TX bytes:1068231207 (1018.7 MiB)
          Memory:def60000-def80000

eth3      Link encap:Ethernet HWaddr 00:21:28:D6:14:3A
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:76701 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:4605444 (4.3 MiB) TX bytes:0 (0.0 b)
          Memory:defe0000-df000000

eth4      Link encap:Ethernet HWaddr 00:1B:21:AE:FB:B5
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:27176189 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2354775972 (2.1 GiB) TX bytes:468 (468.0 b)
          Memory:df1a0000-df1c0000

eth5      Link encap:Ethernet HWaddr 00:1B:21:AE:FB:B5
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:27164010 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2349463952 (2.1 GiB) TX bytes:0 (0.0 b)
          Memory:df1e0000-df200000

eth6      Link encap:Ethernet HWaddr 00:1B:21:AE:FB:B7
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:20967761 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5701058754 (5.3 GiB) TX bytes:468 (468.0 b)
          Memory:df2a0000-df2c0000

eth7      Link encap:Ethernet HWaddr 00:1B:21:AE:FB:B7
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:20963319 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5698545588 (5.3 GiB) TX bytes:0 (0.0 b)
          Memory:df2e0000-df300000

eth8      Link encap:Ethernet HWaddr 00:1B:21:B6:0C:DC
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:6756411 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:412849763 (393.7 MiB) TX bytes:468 (468.0 b)

eth9      Link encap:Ethernet HWaddr 00:1B:21:B6:0C:DC
          UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
          RX packets:6743542 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:412035116 (392.9 MiB) TX bytes:0 (0.0 b)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:9263136 errors:0 dropped:0 overruns:0 frame:0
          TX packets:9263136 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:3692025488 (3.4 GiB) TX bytes:3692025488 (3.4 GiB)

sit0      Link encap:IPv6-in-IPv4
          NOARP MTU:1480 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

xbond0    Link encap:Ethernet HWaddr 00:1B:21:B6:0C:DC
          inet6 addr: fe80::21b:21ff:feb6:cdc/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
          RX packets:13499953 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:824884879 (786.6 MiB) TX bytes:468 (468.0 b)

[root@rc-voda2]# crsctl stat res -t

(...)
ora.rc-voda1.vip
1 ONLINE INTERMEDIATE rc-voda2 FAILED OVER
ora.rc-voda2.vip
1 ONLINE ONLINE rc-voda2

Expected Result:
Two public network NICs failure caused public network inaccessible on this node.
The GI and database on this node was still running.
The node VIPs on this node failed over to another node quickly.
The SCAN vip and SCAN listener are all running on another node.
The existing client connections failed over to another node.
New client connections went to another instance automatically.

When public network recovered, the node VIP for this node failed back automatically. Database service was brought online on this node automatically

Test Case 8 - Database backup and recovery test

Test description

A key operational aspect of deploying ODA is to ensure that database backup are performed so that Oracle database that reside on ODA can be restored if disaster strikes. This test is just only the most simple test you could do (using RMAN) to backup&restore your database. The test is using the "internal" disks, you should use your favorite backup strategy instead.

Test result

After the restore steps you have your database up&running

1. Backup a database using RMAN.

- verify the database is in archive log mode

SQL> archive log list

- create a directory to store the backup set

# mkdir -p /u01/bakDB
# chown -R oracle:oinstall /u01/bakDB
# chmod 755 /u01/bakDB

- Configure the controlfile autobackup

$ rman nocatalog target /
RMAN> show all;
RMAN> CONFIGURE CONTROLFILE AUTOBACKUP ON;

- Increase the db_recovery_file_dest_size as necessary (e.g., 10G)

- Create or update a database object that can be subsequently validated.
(e.g. create a table or insert data into a table, etc.)

- Backup the full database and archive log to the above directory

RMAN> backup database plus archivelog format '/u01/bakDB/db_%U';

- Verify the backupset has been generated

# ls -l /u01/bakDB
-rwxrwxr-x 1 oracle asmadmin 123944960 Aug 25 02:06 db_0gmkre9d_1_1
-rwxrwxr-x 1 oracle asmadmin
69632 Aug 25 02:06 db_0imkrea0_1_1

2. Recover database (optional) - Optionally simulate database loss and perform database recovery.

$ export ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1
$ export ORACLE_SID=ODAMIG1
$ export PATH=$ORACLE_HOME/bin:$PATH
$ srvctl stop database -d ODAMIG;
$ sqlplus "/ as sysdba"
SQL> startup nomount;

$ rman nocatalog target /
RMAN> restore controlfile from autobackup;
RMAN> sql 'alter database mount';
RMAN> restore database;
RMAN> recover database;
RMAN> sql 'alter database open resetlogs';
RMAN> exit

3. Verify database recovery (validate using the database object created or updated previously).

$ sqlplus scott/tiger
SQL> select count(*) from test1;
COUNT(*)
----------
5375

References

<NOTE:1391655.1> - ODA (Oracle Database Appliance): Simulated Failure tests
<NOTE:810394.1> - RAC and Oracle Clusterware Best Practices and Starter Kit (Platform Independent)

Attachments

This solution has no attachment

Applies to:

Purpose

Scope

Details

Test Case 1 - Simulate failures to Internal (OS) disks

Test description

Test result

Test Steps

1. Check your initial disk configuration

2. Check the initial disk status with oakcli

3. Simulate a disk failure

4. OAK is recongnizing the failure

5. Restore the disk

6. Check the disk status using OAK

Test Case 2 - HDD (Hard Disk Drive) failure

Test description

Test result

Test Steps

1. Startup system and database2. Verify initial shared disks status

3. Remove a shared storage hard disk

4. Verify alert is received

5. Verify the removed disk is not available

6. Reinsert the hard disk (same slot)

7. Verify all disks are online

Test Case 3 - SDD (Solid State Disk) failure

Test description

Test result

Test Steps

1. Startup system and database2. Verify initial storage shared disks status

4. Remove a shared storage SSD

5. Verify alert is received

6. Verify the removed SSD disk is not available to ASM

7. Reinsert the SSD back into its slot

8. Verify all disks are online

Test Case 4 - Connectivity to Database

1. Local connection - Verify connectivity to an instance

Test description

Test result

Test Steps

2. External connection - Connect from an application

Test description

Test result

Test Steps

3. Connect using services - Connect using services and test load balancing

Test description

Test result

Test Steps

Test Case 5 - Connection failover and continued service availability

Test description

Test result

Test Steps

Test Case 6 - Private network failure

Test description

Test result

Test Steps

1. Initial default status isthat both private NICs are up

2. Simulate a failure of eth0

3. Simulate a failure of both private NICs at the same time

4. Simulate both failed NICs(eth0/eth1) come back to work

Test Case 7 - Public Network Failure

Test Case 8 - Database backup and recovery test

Test description

Test result

1. Backup a database using RMAN.

2. Recover database (optional) - Optionally simulate database loss and perform database recovery.

3. Verify database recovery (validate using the database object created or updated previously).

References

1. Startup system and database
2. Verify initial shared disks status

1. Startup system and database
2. Verify initial storage shared disks status