Asset ID: |
1-79-1474273.1 |
Update Date: | 2017-10-15 |
Keywords: | |
Solution Type
Predictive Self-Healing Sure
Solution
1474273.1
:
ODA (Oracle Database Appliance): Test Plan Outline
Related Items |
- Oracle Database Appliance Software
- Oracle Database Appliance
|
Related Categories |
- PLA-Support>Eng Systems>Exadata/ODA/SSC>Oracle Database Appliance>DB: ODA_EST
|
In this Document
Applies to:
Oracle Database Appliance Software - Version 2.1.0.1 and later
Oracle Database Appliance - Version All Versions and later
Information in this document applies to any platform.
***Checked for relevance on 23-Jun-2014***
Purpose
Before a new computer /cluster system is deployed in production it is important to test the system thoroughly to validate that it will perform at a satisfactory level, relative to its service level objectives. Testing is also required when introducing major or minor changes to the system. This document provides an outline consisting of basic guidelines and recommendations for how to test a new ODA (Oracle Database Appliance) system. It can be used as a framework for building a system test plan specific to each company’s ODA implementation and the associated service level objectives.
Scope
This document provides an outline of basic testing guidelines that will be used to validate core component functionality
for ODA system in the form of an organized test plan. Every application exercises the underlying software and hardware infrastructure differently, and must be tested as part of a component testing strategy. Each new system must be tested thoroughly, in an environment that is a realistic representation of the production environment in terms of configuration, capacity, and workload prior to going live or after implementing significant architectural/system modifications. Without a completed system implementation and functional available end-user applications, only core
component functionality and testing is possible to verify cluster, RDBMS and various sub-component behaviors for the networking, I/O subsystem and miscellaneous database administrative functions.
In addition to the specific system testing outlined in this document additional testing needs to be defined and executed for RMAN, backup and recovery, and Data Guard (for disaster recovery). Each component area of testing also requires specific operational procedures to be documented and maintained to address site-specific requirements.
Details
Test Case 1 - Simulate failures to Internal (OS) disks
Test description
Each ODA node has two Seagate 500GB serial ATA hard drives that are used for the operating system. This test wants to show what happens if one of these disks is damaged or lost for any reason.
Test result
Simulating an HD corruption or failure on one of the two internal OS disks, ODA is still working and oakcli is showing the failure
Test Steps
1. Check your initial disk configuration
You can use the OS command mdadm (manage MD devices aka Linux Software RAID), see man page for further details (man mdadm)
# mdadm --detail /dev/md0
/dev/md0:
Version : 0.90
Creation Time : Thu Dec 8 12:25:33 2011
Raid Level : raid1
Array Size : 104320 (101.89 MiB 106.82 MB)
Used Dev Size : 104320 (101.89 MiB 106.82 MB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Tue Jan 3 04:02:39 2012
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
UUID : 1751c3b7:1a4d91b5:3cba0f44:a85d2398
Events : 0.14
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
2. Check the initial disk status with oakcli
Issuing "oakcli validate -c OSDiskStorage" you can validate the operating system disks and file system information:
# /opt/oracle/oak/bin/oakcli validate -c OSDiskStorage
INFO: Checking Operating System Storage
SUCCESS: The OS disks have the boot stamp
RESULT: Raid device /dev/md0 found clean
RESULT: Raid device /dev/md1 found clean
RESULT: Physical Volume /dev/md1 in VolGroupSys has 270213.84M out of total 499994.59M
RESULT: Volumegroup VolGroupSys consist of 1 physical volumes,contains 4 logical volumes, has 0 volume snaps with total size of 499994.59M and free space of 270213.84M
RESULT: Logical Volume LogVolOpt in VolGroupSys Volume group is of size 60.00G
RESULT: Logical Volume LogVolRoot in VolGroupSys Volume group is of size 30.00G
RESULT: Logical Volume LogVolSwap in VolGroupSys Volume group is of size 24.00G
RESULT: Logical Volume LogVolU01 in VolGroupSys Volume group is of size 100.00G
RESULT: Device /dev/mapper/VolGroupSys-LogVolRoot is mounted on / of type ext3 in (rw)
RESULT: Device /dev/md0 is mounted on /boot of type ext3 in (rw)
RESULT: Device /dev/mapper/VolGroupSys-LogVolOpt is mounted on /opt of type ext3 in (rw)
RESULT: Device /dev/mapper/VolGroupSys-LogVolU01 is mounted on /u01 of type ext3 in (rw)
RESULT: / has 19344 MB free out of total 29758 MB
RESULT: /boot has 74 MB free out of total 99 MB
RESULT: /opt has 31944 MB free out of total 59516 MB
RESULT: /u01 has 62255 MB free out of total 99194 MB
3. Simulate a disk failure
You can use the OS dd (convert and copy a file) command (see man dd for further details), in this case we are wiping the disk for 512 blocks with zeros :
# dd if=/dev/zero of=<device name> count=512
512+0 records in
512+0 records out
262144 bytes (262 kB) copied, 0.000519 seconds, 505 MB/s
Note: please pay attention on use the right device name
4. OAK is recongnizing the failure
Issuing again the oakcli validate command, We see the failure is recognized :
# oakcli validate -c OSDiskStorage
INFO: Checking Operating System Storage
ERROR: OS disk sdb does not have right boot stamp
WARNING: Check MBR stamp on OS disk failed
RESULT: Raid device /dev/md0 found clean
RESULT: Raid device /dev/md1 found clean
RESULT: Physical Volume /dev/md1 in VolGroupSys has 270213.84M out of total 499994.59M
RESULT: Volumegroup VolGroupSys consist of 1 physical volumes,contains 4 logical volumes, has 0 volume snaps with total size of 499994.59M and free space of 270213.84M
RESULT: Logical Volume LogVolOpt in VolGroupSys Volume group is of size 60.00G
RESULT: Logical Volume LogVolRoot in VolGroupSys Volume group is of size 30.00G
RESULT: Logical Volume LogVolSwap in VolGroupSys Volume group is of size 24.00G
RESULT: Logical Volume LogVolU01 in VolGroupSys Volume group is of size 100.00G
RESULT: Device /dev/mapper/VolGroupSys-LogVolRoot is mounted on / of type ext3 in (rw)
RESULT: Device /dev/md0 is mounted on /boot of type ext3 in (rw)
RESULT: Device /dev/mapper/VolGroupSys-LogVolOpt is mounted on /opt of type ext3 in (rw)
RESULT: Device /dev/mapper/VolGroupSys-LogVolU01 is mounted on /u01 of type ext3 in (rw)
RESULT: / has 19344 MB free out of total 29758 MB
RESULT: /boot has 74 MB free out of total 99 MB
RESULT: /opt has 31942 MB free out of total 59516 MB
RESULT: /u01 has 62268 MB free out of total 99194 MB
The ODA server where we have simulated the internal HDD failure/corruption is still working. The mirror configuration (RAID 1) provides the capability to survive an OS disk failure.
5. Restore the disk
Let's suppose now that the failed disk has been restored. In this case we are copying the "good" data from the disk1 (sda). Using again the OS command 'dd' We are making the failed disk (disk2, sdb) a mirror copy of good disk sda:
# dd if=/dev/sda of=/dev/sdb count=512
512+0 records in
512+0 records out
262144 bytes (262 kB) copied, 0.00072 seconds, 364 MB/s
6. Check the disk status using OAK
As the failed disk has been restored, OAK recognizes it's the good status. Using 'oakcli OSDiskStorage' confirms that the failed disk has been restored with status "SUCCESS: The OS disks have the boot stamp" :
# oakcli validate -c OSDiskStorage
INFO: Checking Operating System Storage
SUCCESS: The OS disks have the boot stamp
RESULT: Raid device /dev/md0 found clean
RESULT: Raid device /dev/md1 found clean
RESULT: Physical Volume /dev/md1 in VolGroupSys has 270213.84M out of total 499994.59M
RESULT: Volumegroup VolGroupSys consist of 1 physical volumes,contains 4 logical volumes, has 0 volume snaps with total size of 499994.59M and free space of 270213.84M
RESULT: Logical Volume LogVolOpt in VolGroupSys Volume group is of size 60.00G
RESULT: Logical Volume LogVolRoot in VolGroupSys Volume group is of size 30.00G
RESULT: Logical Volume LogVolSwap in VolGroupSys Volume group is of size 24.00G
RESULT: Logical Volume LogVolU01 in VolGroupSys Volume group is of size 100.00G
RESULT: Device /dev/mapper/VolGroupSys-LogVolRoot is mounted on / of type ext3 in (rw)
RESULT: Device /dev/md0 is mounted on /boot of type ext3 in (rw)
RESULT: Device /dev/mapper/VolGroupSys-LogVolOpt is mounted on /opt of type ext3 in (rw)
RESULT: Device /dev/mapper/VolGroupSys-LogVolU01 is mounted on /u01 of type ext3 in (rw)
RESULT: / has 19344 MB free out of total 29758 MB
RESULT: /boot has 74 MB free out of total 99 MB
RESULT: /opt has 31939 MB free out of total 59516 MB
RESULT: /u01 has 62266 MB free out of total 99194 MB
Test Case 2 - HDD (Hard Disk Drive) failure
Test description
An Oracle Database Appliancer has 20 600GB - 3.5" SAS 15k RPM HDD - used by ASM for DATA and RECO disk group. This test shows what happens in case one HDD of the shared storage is damaged or lost for any reason.
Test result
Simulating ashared storage HDD failure. ODA database/instance continues working and oakcli is showing the failure
Test Steps
- Startup system and database
- Verify all disks are online (v$asm_disk) and verify DATA and RECO disk group configuration
- Remove a hard disk manually by pulling it out of the slot (from any slot except the top row of disks)
- Verify that an alert is received
- Verify the disk is not available to ASM (v$asm_disk) and verify DATA and RECO disk group configuration
- Reinsert the hard disk (same slot)
- Verify all disks are online (v$asm_disk) and verify DATA and RECO disk group configuration
Details of above steps:
1. Startup system and database
2. Verify initial shared disks status
Verify all disks are online (v$asm_disk) and verify DATA and RECO disk group configuration issuing queries on ASM istance
col GN format 99
col DN format 99
col NAME format a23
SELECT
group_number GN,disk_number DN,name,state,mode_status,mount_status
FROM v$asm_disk
ORDER BY group_number, disk_number;
i.e.:
col GN format 99
col DN format 99
col NAME format a23
SELECT
group_number GN,disk_number DN,name,state,mode_status,mount_status
FROM v$asm_disk
ORDER BY group_number, disk_number;
1 0 HDD_E0_S00_967034331P1 NORMAL ONLINE CACHED
1 1 HDD_E0_S01_965477095P1 NORMAL ONLINE CACHED
1 2 HDD_E1_S02_966582999P1 NORMAL ONLINE CACHED
1 3 HDD_E1_S03_966592943P1 NORMAL ONLINE CACHED
1 4 HDD_E0_S04_969051883P1 NORMAL ONLINE CACHED
1 5 HDD_E0_S05_966535155P1 NORMAL ONLINE CACHED
1 6 HDD_E1_S06_967038139P1 NORMAL ONLINE CACHED
1 7 HDD_E1_S07_966537131P1 NORMAL ONLINE CACHED
1 8 HDD_E0_S08_967043831P1 NORMAL ONLINE CACHED
1 9 HDD_E0_S09_966584211P1 NORMAL ONLINE CACHED
1 10 HDD_E1_S10_967036703P1 NORMAL ONLINE CACHED
1 11 HDD_E1_S11_966589399P1 NORMAL ONLINE CACHED
1 12 HDD_E0_S12_967036523P1 NORMAL ONLINE CACHED
1 13 HDD_E0_S13_966800467P1 NORMAL ONLINE CACHED
1 14 HDD_E1_S14_967038379P1 NORMAL ONLINE CACHED
1 15 HDD_E1_S15_967035195P1 NORMAL ONLINE CACHED
1 16 HDD_E0_S16_966617223P1 NORMAL ONLINE CACHED
1 17 HDD_E0_S17_966520995P1 NORMAL ONLINE CACHED
1 18 HDD_E1_S18_966584379P1 NORMAL ONLINE CACHED
1 19 HDD_E1_S19_966573799P1 NORMAL ONLINE CACHED
2 0 HDD_E0_S00_967034331P2 NORMAL ONLINE CACHED
2 1 HDD_E0_S01_965477095P2 NORMAL ONLINE CACHED
2 2 HDD_E1_S02_966582999P2 NORMAL ONLINE CACHED
2 3 HDD_E1_S03_966592943P2 NORMAL ONLINE CACHED
2 4 HDD_E0_S04_969051883P2 NORMAL ONLINE CACHED
2 5 HDD_E0_S05_966535155P2 NORMAL ONLINE CACHED
2 6 HDD_E1_S06_967038139P2 NORMAL ONLINE CACHED
2 7 HDD_E1_S07_966537131P2 NORMAL ONLINE CACHED
2 8 HDD_E0_S08_967043831P2 NORMAL ONLINE CACHED
2 9 HDD_E0_S09_966584211P2 NORMAL ONLINE CACHED
2 10 HDD_E1_S10_967036703P2 NORMAL ONLINE CACHED
2 11 HDD_E1_S11_966589399P2 NORMAL ONLINE CACHED
2 12 HDD_E0_S12_967036523P2 NORMAL ONLINE CACHED
2 13 HDD_E0_S13_966800467P2 NORMAL ONLINE CACHED
2 14 HDD_E1_S14_967038379P2 NORMAL ONLINE CACHED
2 15 HDD_E1_S15_967035195P2 NORMAL ONLINE CACHED
2 16 HDD_E0_S16_966617223P2 NORMAL ONLINE CACHED
2 17 HDD_E0_S17_966520995P2 NORMAL ONLINE CACHED
2 18 HDD_E1_S18_966584379P2 NORMAL ONLINE CACHED
2 19 HDD_E1_S19_966573799P2 NORMAL ONLINE CACHED
3 20 SSD_E0_S20_805607370P1 NORMAL ONLINE CACHED
3 21 SSD_E0_S21_805607443P1 NORMAL ONLINE CACHED
3 22 SSD_E1_S22_805607458P1 NORMAL ONLINE CACHED
3 23 SSD_E1_S23_805607433P1 NORMAL ONLINE CACHED
44 rows selected.
From v$asm_disk you see all your disks NORMAL, ONLINE, CACHED
col DG format a4
col "Size(MB)" format 9,999,999
col "Free(MB)" format 9,999,999
col "Usable(MB)" format 9,999,999
SELECT name AS "DG",
sector_size AS "Sector Size",
state,
type AS "Redundancy",
total_mb AS "Size(MB)",
free_mb AS "Free(MB)",
usable_file_mb AS "Usable(MB)"
FROM V$ASM_DISKGROUP;
i.e.:
col DG format a4
col "Size(MB)" format 9,999,999
col "Free(MB)" format 9,999,999
col "Usable(MB)" format 9,999,999
SELECT name AS "DG",
sector_size AS "Sector Size",
state,
type AS "Redundancy",
total_mb AS "Size(MB)",
free_mb AS "Free(MB)",
usable_file_mb AS "Usable(MB)"
FROM V$ASM_DISKGROUP;
DG Sector Size STATE Redund Size(MB) Free(MB) Usable(MB)
---- ----------- ----------- ------ ---------- ---------- ----------
DATA 512 MOUNTED HIGH 4,669,440 4,657,372 1,388,617
RECO 512 MOUNTED HIGH 6,204,640 5,967,132 1,771,337
REDO 512 MOUNTED HIGH 280,016 242,460 34,150
--> You see fromV$ASM_DISKGROUP as your diskgroups are mounted
from oakcli point of view:
# oakcli show disk
NAME PATH TYPE STATE STATE_DETAILS
pd_00 /dev/sdam HDD ONLINE Good
pd_01 /dev/sdaw HDD ONLINE Good
pd_02 /dev/sdaa HDD ONLINE Good
pd_03 /dev/sdak HDD ONLINE Good
pd_04 /dev/sdan HDD ONLINE Good
pd_05 /dev/sdax HDD ONLINE Good
pd_06 /dev/sdab HDD ONLINE Good
pd_07 /dev/sdal HDD ONLINE Good
pd_08 /dev/sdao HDD ONLINE Good
pd_09 /dev/sdau HDD ONLINE Good
pd_10 /dev/sdac HDD ONLINE Good
pd_11 /dev/sdai HDD ONLINE Good
pd_12 /dev/sdap HDD ONLINE Good
pd_13 /dev/sdav HDD ONLINE Good
pd_14 /dev/sdad HDD ONLINE Good
pd_15 /dev/sdaj HDD ONLINE Good
pd_16 /dev/sdaq HDD ONLINE Good
pd_17 /dev/sdas HDD ONLINE Good
pd_18 /dev/sdae HDD ONLINE Good
pd_19 /dev/sdag HDD ONLINE Good
pd_20 /dev/sdar SSD ONLINE Good
pd_21 /dev/sdat SSD ONLINE Good
pd_22 /dev/sdaf SSD ONLINE Good
pd_23 /dev/sdah SSD ONLINE Good
3. Remove a shared storage hard disk
Remove a hard disk manually by pulling it out of the slot (from any slot except the top row of disks).
oakcli shows the disk is now removed:
# oakcli show disk
NAME PATH TYPE STATE STATE_DETAILS
pd_00 /dev/sdam HDD ONLINE Good
pd_01 /dev/sdaw HDD ONLINE Good
pd_02 /dev/sdaa HDD ONLINE Good
pd_03 /dev/sdak HDD ONLINE Good
pd_04 /dev/sdan HDD ONLINE Good
pd_05 /dev/sdax HDD ONLINE Good
pd_06 /dev/sdab HDD ONLINE Good
pd_07 /dev/sdal HDD ONLINE Good
pd_08 /dev/sdao HDD ONLINE Good
pd_09 /dev/sdau HDD FAILED DiskRemoved
pd_10 /dev/sdac HDD ONLINE Good
pd_11 /dev/sdai HDD ONLINE Good
pd_12 /dev/sdap HDD ONLINE Good
pd_13 /dev/sdav HDD ONLINE Good
pd_14 /dev/sdad HDD ONLINE Good
pd_15 /dev/sdaj HDD ONLINE Good
pd_16 /dev/sdaq HDD ONLINE Good
pd_17 /dev/sdas HDD ONLINE Good
pd_18 /dev/sdae HDD ONLINE Good
pd_19 /dev/sdag HDD ONLINE Good
pd_20 /dev/sdar SSD ONLINE Good
pd_21 /dev/sdat SSD ONLINE Good
pd_22 /dev/sdaf SSD ONLINE Good
pd_23 /dev/sdah SSD ONLINE Good
4. Verify alert is received
From the ASM alert.log the IO problem is documented (using adrci utility you can checkout the ASM alert.log):
$ adrci
ADRCI: Release 11.2.0.2.0 - Production on Tue Feb 21 14:20:10 2012
Copyright (c) 1982, 2009, Oracle and/or its affiliates. All rights reserved.
ADR base = "/u01/app/grid"
adrci> show home
ADR Homes:
diag/asm/+asm/+ASM1
adrci> set home diag/asm/+asm/+ASM1
adrci> show alert -tail -f
Tue Feb 21 12:52:50 2012
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_11832.trc:
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096
WARNING: Read Failed. group:2 disk:9 AU:0 offset:0 size:4096
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_11832.trc:
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096
WARNING: Read Failed. group:1 disk:9 AU:0 offset:0 size:4096
SQL> alter diskgroup /*+ _OAK_AsmCookie */ DATA offline disk 'HDD_E0_S09_966584211p1'
NOTE: DRTimer CD Create: for disk group 1 disks:
9
NOTE: process _user11832_+asm1 (11832) initiating offline of disk 9.3916349682 (HDD_E0_S09_966584211P1) with mask 0x7e in group 1
NOTE: initiating PST update: grp = 1, dsk = 9/0xe96ec0f2, mode = 0x6a, op = 4
(...)
Tue Feb 21 13:05:19 2012
WARNING: Disk (HDD_E0_S09_966584211P1) will be dropped in: (12213) secs on ASM inst: (1)
WARNING: Disk (HDD_E0_S09_966584211P2) will be dropped in: (12213) secs on ASM inst: (1)
The OS is signaling the IO problem (dmesg - OS command to print or control the kernel ring buffer):
# dmesg
mpt2sas1: removing handle(0x0021), sas_addr(0x5000c500399ce791)
mpt2sas0: removing handle(0x0012), sas_addr(0x5000c500399ce791)
scsi 7:0:21:0: rejecting I/O to dead device
device-mapper: multipath: Failing path 66:224.
scsi 7:0:21:0: rejecting I/O to dead device
scsi 6:0:8:0: rejecting I/O to dead device
device-mapper: multipath: Failing path 8:160.
scsi 6:0:8:0: rejecting I/O to dead device
5. Verify the removed disk is not available
Verify the disk is not available to ASM (v$asm_disk) and verify DATA and RECO disk group configuration
col DG format a4
col "Size(MB)" format 9,999,999
col "Free(MB)" format 9,999,999
col "Usable(MB)" format 9,999,999
SELECT name AS "DG",
sector_size AS "Sector Size",
state,
type AS "Redundancy",
total_mb AS "Size(MB)",
free_mb AS "Free(MB)",
usable_file_mb AS "Usable(MB)"
FROM V$ASM_DISKGROUP;
DG Sector Size STATE Redund Size(MB) Free(MB) Usable(MB)
---- ----------- ----------- ------ ---------- ---------- ----------
DATA 512 MOUNTED HIGH 4,669,440 4,657,372 1,388,617
RECO 512 MOUNTED HIGH 6,204,640 5,967,132 1,771,337
REDO 512 MOUNTED HIGH 280,016 242,460 34,150
-----
col GN format 99
col DN format 99
col NAME format a23
SELECT
group_number GN,disk_number DN,name,state,mode_status,mount_status
FROM v$asm_disk
WHERE mode_status='OFFLINE'
ORDER BY group_number, disk_number;
1 9 HDD_E0_S09_966584211P1 NORMAL OFFLINE MISSING
2 9 HDD_E0_S09_966584211P2 NORMAL OFFLINE MISSING
The instances are still running. Checking for SMON process you can see as it's running. You can also connect to your instance as usual.
$ ps -ef|grep smon
grid 6030 1 0 Jan30 ? 00:00:00 asm_smon_+ASM1
oracle 13946 1 0 Jan31 ? 00:01:19 ora_smon_simpledb_1
oracle 16169 1 0 Jan30 ? 00:01:48 ora_smon_orcl1
grid 25237 15019 0 13:13 pts/3 00:00:00 grep smon
root 30298 1 0 Jan30 ? 01:29:18 /u01/app/11.2.0/grid/bin/osysmond.bin
6. Reinsert the hard disk (same slot)
Oakcli will show the disk as ONLINE Good
# oakcli show disk
NAME PATH TYPE STATE STATE_DETAILS
pd_00 /dev/sdam HDD ONLINE Good
pd_01 /dev/sdaw HDD ONLINE Good
pd_02 /dev/sdaa HDD ONLINE Good
pd_03 /dev/sdak HDD ONLINE Good
pd_04 /dev/sdan HDD ONLINE Good
pd_05 /dev/sdax HDD ONLINE Good
pd_06 /dev/sdab HDD ONLINE Good
pd_07 /dev/sdal HDD ONLINE Good
pd_08 /dev/sdao HDD ONLINE Good
pd_09 /dev/sdau HDD ONLINE Good
pd_10 /dev/sdac HDD ONLINE Good
pd_11 /dev/sdai HDD ONLINE Good
pd_12 /dev/sdap HDD ONLINE Good
pd_13 /dev/sdav HDD ONLINE Good
pd_14 /dev/sdad HDD ONLINE Good
pd_15 /dev/sdaj HDD ONLINE Good
pd_16 /dev/sdaq HDD ONLINE Good
pd_17 /dev/sdas HDD ONLINE Good
pd_18 /dev/sdae HDD ONLINE Good
pd_19 /dev/sdag HDD ONLINE Good
pd_20 /dev/sdar SSD ONLINE Good
pd_21 /dev/sdat SSD ONLINE Good
pd_22 /dev/sdaf SSD ONLINE Good
pd_23 /dev/sdah SSD ONLINE Good
If at this stage the reinserted disk is marked as not Good from the above oak command, you could restart the oakd issuing:
# oakcli restart oak
7. Verify all disks are online
Verify all disks are online (v$asm_disk) and verify DATA and RECO disk group configuration from ASM prospective:
col DG format a4
col "Size(MB)" format 9,999,999
col "Free(MB)" format 9,999,999
col "Usable(MB)" format 9,999,999
SELECT name AS "DG",
sector_size AS "Sector Size",
state,
type AS "Redundancy",
total_mb AS "Size(MB)",
free_mb AS "Free(MB)",
usable_file_mb AS "Usable(MB)"
FROM V$ASM_DISKGROUP;
DG Sector Size STATE Redund Size(MB) Free(MB) Usable(MB)
---- ----------- ----------- ------ ---------- ---------- ----------
DATA 512 MOUNTED HIGH 4,915,200 4,902,532 1,470,337
RECO 512 MOUNTED HIGH 6,531,200 6,281,728 1,876,202
REDO 512 MOUNTED HIGH 280,016 242,460 34,150
col GN format 99
col DN format 99
col NAME format a23
SELECT
group_number GN,disk_number DN,name,state,mode_status,mount_status
FROM v$asm_disk
ORDER BY group_number, disk_number;
1 0 HDD_E0_S00_967034331P1 NORMAL ONLINE CACHED
1 1 HDD_E0_S01_965477095P1 NORMAL ONLINE CACHED
1 2 HDD_E1_S02_966582999P1 NORMAL ONLINE CACHED
1 3 HDD_E1_S03_966592943P1 NORMAL ONLINE CACHED
1 4 HDD_E0_S04_969051883P1 NORMAL ONLINE CACHED
1 5 HDD_E0_S05_966535155P1 NORMAL ONLINE CACHED
1 6 HDD_E1_S06_967038139P1 NORMAL ONLINE CACHED
1 7 HDD_E1_S07_966537131P1 NORMAL ONLINE CACHED
1 8 HDD_E0_S08_967043831P1 NORMAL ONLINE CACHED
1 9 HDD_E0_S09_966584211P1 NORMAL ONLINE CACHED
1 10 HDD_E1_S10_967036703P1 NORMAL ONLINE CACHED
1 11 HDD_E1_S11_966589399P1 NORMAL ONLINE CACHED
1 12 HDD_E0_S12_967036523P1 NORMAL ONLINE CACHED
1 13 HDD_E0_S13_966800467P1 NORMAL ONLINE CACHED
1 14 HDD_E1_S14_967038379P1 NORMAL ONLINE CACHED
1 15 HDD_E1_S15_967035195P1 NORMAL ONLINE CACHED
1 16 HDD_E0_S16_966617223P1 NORMAL ONLINE CACHED
1 17 HDD_E0_S17_966520995P1 NORMAL ONLINE CACHED
1 18 HDD_E1_S18_966584379P1 NORMAL ONLINE CACHED
1 19 HDD_E1_S19_966573799P1 NORMAL ONLINE CACHED
2 0 HDD_E0_S00_967034331P2 NORMAL ONLINE CACHED
2 1 HDD_E0_S01_965477095P2 NORMAL ONLINE CACHED
2 2 HDD_E1_S02_966582999P2 NORMAL ONLINE CACHED
2 3 HDD_E1_S03_966592943P2 NORMAL ONLINE CACHED
2 4 HDD_E0_S04_969051883P2 NORMAL ONLINE CACHED
2 5 HDD_E0_S05_966535155P2 NORMAL ONLINE CACHED
2 6 HDD_E1_S06_967038139P2 NORMAL ONLINE CACHED
2 7 HDD_E1_S07_966537131P2 NORMAL ONLINE CACHED
2 8 HDD_E0_S08_967043831P2 NORMAL ONLINE CACHED
2 9 HDD_E0_S09_966584211P2 NORMAL ONLINE CACHED
2 10 HDD_E1_S10_967036703P2 NORMAL ONLINE CACHED
2 11 HDD_E1_S11_966589399P2 NORMAL ONLINE CACHED
2 12 HDD_E0_S12_967036523P2 NORMAL ONLINE CACHED
2 13 HDD_E0_S13_966800467P2 NORMAL ONLINE CACHED
2 14 HDD_E1_S14_967038379P2 NORMAL ONLINE CACHED
2 15 HDD_E1_S15_967035195P2 NORMAL ONLINE CACHED
2 16 HDD_E0_S16_966617223P2 NORMAL ONLINE CACHED
2 17 HDD_E0_S17_966520995P2 NORMAL ONLINE CACHED
2 18 HDD_E1_S18_966584379P2 NORMAL ONLINE CACHED
2 19 HDD_E1_S19_966573799P2 NORMAL ONLINE CACHED
3 20 SSD_E0_S20_805607370P1 NORMAL ONLINE CACHED
3 21 SSD_E0_S21_805607443P1 NORMAL ONLINE CACHED
3 22 SSD_E1_S22_805607458P1 NORMAL ONLINE CACHED
3 23 SSD_E1_S23_805607433P1 NORMAL ONLINE CACHED
44 rows selected.
from the ASM alert.log
Tue Feb 21 13:38:56 2012
ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.
ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.
ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.
ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.
ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.
ASM Volume(VDBG) - Unable to send message 'disk status' to the volume driver.
NOTE: PST update grp = 1 completed successfully
NOTE: reset timers for disk: 9
NOTE: completed online of disk group 1 disks
HDD_E0_S09_966584211P1 (9)
Tue Feb 21 13:38:58 2012
NOTE: Found /dev/mapper/HDD_E0_S09_966584211p2 for disk HDD_E0_S09_966584211P2
WARNING: ignoring disk in deep discovery
SUCCESS: validated disks for 2/0x83be304f (RECO)
GMON querying group 2 at 47 for pid 46, osid 11274
NOTE: membership refresh pending for group 2/0x83be304f (RECO)
GMON querying group 2 at 48 for pid 18, osid 6032
NOTE: cache opening disk 9 of grp 2: HDD_E0_S09_966584211P2 path:/dev/mapper/HDD_E0_S09_966584211p2
SUCCESS: refreshed membership for 2/0x83be304f (RECO)
NOTE: initiating PST update: grp = 2, dsk = 9/0x0, mode = 0x5d, op = 1
SUCCESS: alter diskgroup /*+ _OAK_AsmCookie */ RECO online disk 'HDD_E0_S09_966584211p2'
GMON updating disk modes for group 2 at 49 for pid 46, osid 11274
NOTE: PST update grp = 2 completed successfully
NOTE: initiating PST update: grp = 2, dsk = 9/0x0, mode = 0x7d, op = 1
GMON updating disk modes for group 2 at 50 for pid 46, osid 11274
NOTE: PST update grp = 2 completed successfully
NOTE: Voting File refresh pending for group 2/0x83be304f (RECO)
NOTE: Attempting voting file refresh on diskgroup RECO
Tue Feb 21 13:40:56 2012
NOTE: initiating PST update: grp = 2, dsk = 9/0x0, mode = 0x7f, op = 1
Tue Feb 21 13:40:56 2012
GMON updating disk modes for group 2 at 51 for pid 46, osid 11274
NOTE: PST update grp = 2 completed successfully
NOTE: reset timers for disk: 9
NOTE: completed online of disk group 2 disks
HDD_E0_S09_966584211P2 (9)
and dmesg shows the disk is recognized:
mpt2sas0: detecting: handle(0x0012), sas_address(0x5000c500399ce791), phy(8)
mpt2sas0: REPORT_LUNS: handle(0x0012), retries(0)
mpt2sas0: TEST_UNIT_READY: handle(0x0012), lun(0)
Vendor: SEAGATE Model: ST360057SSUN600G Rev: 0A25
Type: Direct-Access ANSI SCSI revision: 05
scsi 6:0:26:0: SSP: handle(0x0012), sas_addr(0x5000c500399ce791), phy(8), device_name(0x00c5005091e79c39)
scsi 6:0:26:0: SSP: enclosure_logical_id(0x5080020000b16e00), slot(9)
scsi 6:0:26:0: serial_number(001112E0L4P2 6SL0L4P2)
scsi 6:0:26:0: qdepth(254), tagged(1), simple(1), ordered(0), scsi_level(6), cmd_que(1)
mpt2sas1: detecting: handle(0x0021), sas_address(0x5000c500399ce791), phy(8)
mpt2sas1: REPORT_LUNS: handle(0x0021), retries(0)
mpt2sas1: TEST_UNIT_READY: handle(0x0021), lun(0)
SCSI device sdaz: 1172123568 512-byte hdwr sectors (600127 MB)
Vendor: SEAGATE Model: ST360057SSUN600G Rev: 0A25
Type: Direct-Access ANSI SCSI revision: 05
scsi 7:0:26:0: SSP: handle(0x0021), sas_addr(0x5000c500399ce791), phy(8), device_name(0x00c5005091e79c39)
scsi 7:0:26:0: SSP: enclosure_logical_id(0x5080020000b16e00), slot(9)
scsi 7:0:26:0: serial_number(001112E0L4P2 6SL0L4P2)
scsi 7:0:26:0: qdepth(254), tagged(1), simple(1), ordered(0), scsi_level(6), cmd_que(1)
SCSI device sdba: 1172123568 512-byte hdwr sectors (600127 MB)
sdba: Write Protect is off
sdba: Mode Sense: df 00 10 08
SCSI device sdba: drive cache: write through w/ FUA
SCSI device sdba: 1172123568 512-byte hdwr sectors (600127 MB)
sdba: Write Protect is off
sdba: Mode Sense: df 00 10 08
SCSI device sdba: drive cache: write through w/ FUA
sdba: sdba1 sdba2
sd 7:0:26:0: Attached scsi disk sdba
sdaz: Write Protect is off
sd 7:0:26:0: Attached scsi generic sg10 type 0
sdaz: Mode Sense: df 00 10 08
SCSI device sdaz: drive cache: write through w/ FUA
SCSI device sdaz: 1172123568 512-byte hdwr sectors (600127 MB)
sdaz: Write Protect is off
sdaz: Mode Sense: df 00 10 08
SCSI device sdaz: drive cache: write through w/ FUA
sdaz: sdaz1 sdaz2
sd 6:0:26:0: Attached scsi disk sdaz
sd 6:0:26:0: Attached scsi generic sg49 type 0
Test Case 3 - SDD (Solid State Disk) failure
Test description
An Oracle Database Appliance has four 73GB - 3.5" SAS2 SSDs - used by ASM for REDO diskgroup. This test shows what happens in case one shared storage SDD is damaged or lost for any reason.
Test result
Simulating a shared storage SDD failure. ODA database/instance continue working and oakcli is showing the failure
Test Steps
- Startup system and database
- Verify all disks are online (v$asm_disk) and verify REDO disk group configuration
- Remove a solid state disk manually by pulling it out of the slot (from any slot in the top row of 4 disks)
- Verify that an alert is received
- Verify the disk is not available to ASM (v$asm_disk) and verify REDO disk group configuration
- Reinsert the SDD into its slot
- Verify all disks are online (v$asm_disk) and verify REDO disk group configuration
Details of above steps:
1. Startup system and database
2. Verify initial storage shared disks status
Verify all disks are online (v$asm_disk) and verify RECO disk group configuration issuing queries on ASM instance
col GN format 99
col DN format 99
col NAME format a23
SELECT
group_number GN,disk_number DN,name,state,mode_status,mount_status
FROM v$asm_disk
ORDER BY group_number, disk_number;
1 0 HDD_E0_S00_967034331P1 NORMAL ONLINE CACHED
1 1 HDD_E0_S01_965477095P1 NORMAL ONLINE CACHED
1 2 HDD_E1_S02_966582999P1 NORMAL ONLINE CACHED
1 3 HDD_E1_S03_966592943P1 NORMAL ONLINE CACHED
1 4 HDD_E0_S04_969051883P1 NORMAL ONLINE CACHED
1 5 HDD_E0_S05_966535155P1 NORMAL ONLINE CACHED
1 6 HDD_E1_S06_967038139P1 NORMAL ONLINE CACHED
1 7 HDD_E1_S07_966537131P1 NORMAL ONLINE CACHED
1 8 HDD_E0_S08_967043831P1 NORMAL ONLINE CACHED
1 9 HDD_E0_S09_966584211P1 NORMAL ONLINE CACHED
1 10 HDD_E1_S10_967036703P1 NORMAL ONLINE CACHED
1 11 HDD_E1_S11_966589399P1 NORMAL ONLINE CACHED
1 12 HDD_E0_S12_967036523P1 NORMAL ONLINE CACHED
1 13 HDD_E0_S13_966800467P1 NORMAL ONLINE CACHED
1 14 HDD_E1_S14_967038379P1 NORMAL ONLINE CACHED
1 15 HDD_E1_S15_967035195P1 NORMAL ONLINE CACHED
1 16 HDD_E0_S16_966617223P1 NORMAL ONLINE CACHED
1 17 HDD_E0_S17_966520995P1 NORMAL ONLINE CACHED
1 18 HDD_E1_S18_966584379P1 NORMAL ONLINE CACHED
1 19 HDD_E1_S19_966573799P1 NORMAL ONLINE CACHED
2 0 HDD_E0_S00_967034331P2 NORMAL ONLINE CACHED
2 1 HDD_E0_S01_965477095P2 NORMAL ONLINE CACHED
2 2 HDD_E1_S02_966582999P2 NORMAL ONLINE CACHED
2 3 HDD_E1_S03_966592943P2 NORMAL ONLINE CACHED
2 4 HDD_E0_S04_969051883P2 NORMAL ONLINE CACHED
2 5 HDD_E0_S05_966535155P2 NORMAL ONLINE CACHED
2 6 HDD_E1_S06_967038139P2 NORMAL ONLINE CACHED
2 7 HDD_E1_S07_966537131P2 NORMAL ONLINE CACHED
2 8 HDD_E0_S08_967043831P2 NORMAL ONLINE CACHED
2 9 HDD_E0_S09_966584211P2 NORMAL ONLINE CACHED
2 10 HDD_E1_S10_967036703P2 NORMAL ONLINE CACHED
2 11 HDD_E1_S11_966589399P2 NORMAL ONLINE CACHED
2 12 HDD_E0_S12_967036523P2 NORMAL ONLINE CACHED
2 13 HDD_E0_S13_966800467P2 NORMAL ONLINE CACHED
2 14 HDD_E1_S14_967038379P2 NORMAL ONLINE CACHED
2 15 HDD_E1_S15_967035195P2 NORMAL ONLINE CACHED
2 16 HDD_E0_S16_966617223P2 NORMAL ONLINE CACHED
2 17 HDD_E0_S17_966520995P2 NORMAL ONLINE CACHED
2 18 HDD_E1_S18_966584379P2 NORMAL ONLINE CACHED
2 19 HDD_E1_S19_966573799P2 NORMAL ONLINE CACHED
3 20 SSD_E0_S20_805607370P1 NORMAL ONLINE CACHED
3 21 SSD_E0_S21_805607443P1 NORMAL ONLINE CACHED
3 22 SSD_E1_S22_805607458P1 NORMAL ONLINE CACHED
3 23 SSD_E1_S23_805607433P1 NORMAL ONLINE CACHED
44 rows selected.
col DG format a4
col "Size(MB)" format 9,999,999
col "Free(MB)" format 9,999,999
col "Usable(MB)" format 9,999,999
SELECT name AS "DG",
sector_size AS "Sector Size",
state,
type AS "Redundancy",
total_mb AS "Size(MB)",
free_mb AS "Free(MB)",
usable_file_mb AS "Usable(MB)"
FROM V$ASM_DISKGROUP
WHERE name='REDO';
DG Sector Size STATE Redund Size(MB) Free(MB) Usable(MB)
---- ----------- ----------- ------ ---------- ---------- ----------
REDO 512 MOUNTED HIGH 280,016 242,460 34,150
oakcli shows no FAILED disk :
# oakcli show disk | grep FAILED
#
4. Remove a shared storage SSD
Remove a shared storage SSD manually by pulling it out of the slot (from any slot in the top row)
oakcli shows the disk is now removed:
# oakcli show disk
NAME PATH TYPE STATE STATE_DETAILS
pd_00 /dev/sdam HDD ONLINE Good
pd_01 /dev/sdaw HDD ONLINE Good
pd_02 /dev/sdaa HDD ONLINE Good
pd_03 /dev/sdak HDD ONLINE Good
pd_04 /dev/sdan HDD ONLINE Good
pd_05 /dev/sdax HDD ONLINE Good
pd_06 /dev/sdab HDD ONLINE Good
pd_07 /dev/sdal HDD ONLINE Good
pd_08 /dev/sdao HDD ONLINE Good
pd_09 /dev/sdau HDD ONLINE Good
pd_10 /dev/sdac HDD ONLINE Good
pd_11 /dev/sdai HDD ONLINE Good
pd_12 /dev/sdap HDD ONLINE Good
pd_13 /dev/sdav HDD ONLINE Good
pd_14 /dev/sdad HDD ONLINE Good
pd_15 /dev/sdaj HDD ONLINE Good
pd_16 /dev/sdaq HDD ONLINE Good
pd_17 /dev/sdas HDD ONLINE Good
pd_18 /dev/sdae HDD ONLINE Good
pd_19 /dev/sdag HDD ONLINE Good
pd_20 /dev/sdar SSD ONLINE Good
pd_21 /dev/sdat SSD ONLINE Good
pd_22 /dev/sdaf SSD ONLINE Good
pd_23 /dev/sdah SSD FAILED DiskRemoved
5. Verify alert is received
In the ASM alert.log you see the IO error due to missing disk
2012-02-21 18:36:06.118000 +02:00
SUCCESS: alter diskgroup /*+ _OAK_AsmCookie */ REDO offline disk 'SSD_E1_S23_805607433p1'
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_28122.trc:
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096
WARNING: Read Failed. group:0 disk:40 AU:0 offset:0 size:4096
2012-02-21 18:36:42.493000 +02:00
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28525] opening OCR file
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28525] opening OCR file
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28667] opening OCR file
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28667] opening OCR file
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28733] opening OCR file
2012-02-21 18:36:43.517000 +02:00
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28733] opening OCR file
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28837] opening OCR file
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28847] opening OCR file
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28837] opening OCR file
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28847] opening OCR file
2012-02-21 18:36:45.043000 +02:00
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28932] opening OCR file
NOTE: [crsctl.bin@zaoda-01 (TNS V1-V3) 28932] opening OCR file
2012-02-21 18:37:25.992000 +02:00
WARNING: Disk (SSD_E1_S23_805607433P1) will be dropped in: (12960) secs on ASM inst: (1)
2012-02-21 18:38:12.255000 +02:00
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_29564.trc:
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096
WARNING: Read Failed. group:0 disk:40 AU:0 offset:0 size:4096
2012-02-21 18:38:59.040000 +02:00
WARNING: Disk (SSD_E1_S23_805607433P1) will be dropped in: (12867) secs on ASM inst: (1)
2012-02-21 18:40:29.089000 +02:00
WARNING: Disk (SSD_E1_S23_805607433P1) will be dropped in: (12777) secs on ASM inst: (1)
6. Verify the removed SSD disk is not available to ASM
Verify the disk is not available to ASM (v$asm_disk) and verify REDO disk group configuration
col GN format 99
col DN format 99
col NAME format a23
SELECT
group_number GN,disk_number DN,name,state,mode_status,mount_status
FROM v$asm_disk
WHERE mode_status='OFFLINE'
ORDER BY group_number, disk_number;
3 23 SSD_E1_S23_805607433P1 NORMAL OFFLINE MISSING
col DG format a4
col "Size(MB)" format 9,999,999
col "Free(MB)" format 9,999,999
col "Usable(MB)" format 9,999,999
SELECT name AS "DG",
sector_size AS "Sector Size",
state,
type AS "Redundancy",
total_mb AS "Size(MB)",
free_mb AS "Free(MB)",
usable_file_mb AS "Usable(MB)"
FROM V$ASM_DISKGROUP;
DG Sector Size STATE Redund Size(MB) Free(MB) Usable(MB)
---- ----------- ----------- ------ ---------- ---------- ----------
REDO 512 MOUNTED HIGH 210,012 181,832 60,610
7. Reinsert the SSD back into its slot
oakcli shows the disk is now available as ONLINE good:
# oakcli show disk
NAME PATH TYPE STATE STATE_DETAILS
pd_00 /dev/sdam HDD ONLINE Good
pd_01 /dev/sdaw HDD ONLINE Good
pd_02 /dev/sdaa HDD ONLINE Good
pd_03 /dev/sdak HDD ONLINE Good
pd_04 /dev/sdan HDD ONLINE Good
pd_05 /dev/sdax HDD ONLINE Good
pd_06 /dev/sdab HDD ONLINE Good
pd_07 /dev/sdal HDD ONLINE Good
pd_08 /dev/sdao HDD ONLINE Good
pd_09 /dev/sdau HDD ONLINE Good
pd_10 /dev/sdac HDD ONLINE Good
pd_11 /dev/sdai HDD ONLINE Good
pd_12 /dev/sdap HDD ONLINE Good
pd_13 /dev/sdav HDD ONLINE Good
pd_14 /dev/sdad HDD ONLINE Good
pd_15 /dev/sdaj HDD ONLINE Good
pd_16 /dev/sdaq HDD ONLINE Good
pd_17 /dev/sdas HDD ONLINE Good
pd_18 /dev/sdae HDD ONLINE Good
pd_19 /dev/sdag HDD ONLINE Good
pd_20 /dev/sdar SSD ONLINE Good
pd_21 /dev/sdat SSD ONLINE Good
pd_22 /dev/sdaf SSD ONLINE Good
pd_23 /dev/sdah SSD ONLINE Good
8. Verify all disks are online
Verify all disks are online (v$asm_disk) and verify REDO disk group configuration
col GN format 99
col DN format 99
col NAME format a23
SELECT
group_number GN,disk_number DN,name,state,mode_status,mount_status
FROM v$asm_disk
ORDER BY group_number, disk_number;
1 0 HDD_E0_S00_967034331P1 NORMAL ONLINE CACHED
1 1 HDD_E0_S01_965477095P1 NORMAL ONLINE CACHED
1 2 HDD_E1_S02_966582999P1 NORMAL ONLINE CACHED
1 3 HDD_E1_S03_966592943P1 NORMAL ONLINE CACHED
1 4 HDD_E0_S04_969051883P1 NORMAL ONLINE CACHED
1 5 HDD_E0_S05_966535155P1 NORMAL ONLINE CACHED
1 6 HDD_E1_S06_967038139P1 NORMAL ONLINE CACHED
1 7 HDD_E1_S07_966537131P1 NORMAL ONLINE CACHED
1 8 HDD_E0_S08_967043831P1 NORMAL ONLINE CACHED
1 9 HDD_E0_S09_966584211P1 NORMAL ONLINE CACHED
1 10 HDD_E1_S10_967036703P1 NORMAL ONLINE CACHED
1 11 HDD_E1_S11_966589399P1 NORMAL ONLINE CACHED
1 12 HDD_E0_S12_967036523P1 NORMAL ONLINE CACHED
1 13 HDD_E0_S13_966800467P1 NORMAL ONLINE CACHED
1 14 HDD_E1_S14_967038379P1 NORMAL ONLINE CACHED
1 15 HDD_E1_S15_967035195P1 NORMAL ONLINE CACHED
1 16 HDD_E0_S16_966617223P1 NORMAL ONLINE CACHED
1 17 HDD_E0_S17_966520995P1 NORMAL ONLINE CACHED
1 18 HDD_E1_S18_966584379P1 NORMAL ONLINE CACHED
1 19 HDD_E1_S19_966573799P1 NORMAL ONLINE CACHED
2 0 HDD_E0_S00_967034331P2 NORMAL ONLINE CACHED
2 1 HDD_E0_S01_965477095P2 NORMAL ONLINE CACHED
2 2 HDD_E1_S02_966582999P2 NORMAL ONLINE CACHED
2 3 HDD_E1_S03_966592943P2 NORMAL ONLINE CACHED
2 4 HDD_E0_S04_969051883P2 NORMAL ONLINE CACHED
2 5 HDD_E0_S05_966535155P2 NORMAL ONLINE CACHED
2 6 HDD_E1_S06_967038139P2 NORMAL ONLINE CACHED
2 7 HDD_E1_S07_966537131P2 NORMAL ONLINE CACHED
2 8 HDD_E0_S08_967043831P2 NORMAL ONLINE CACHED
2 9 HDD_E0_S09_966584211P2 NORMAL ONLINE CACHED
2 10 HDD_E1_S10_967036703P2 NORMAL ONLINE CACHED
2 11 HDD_E1_S11_966589399P2 NORMAL ONLINE CACHED
2 12 HDD_E0_S12_967036523P2 NORMAL ONLINE CACHED
2 13 HDD_E0_S13_966800467P2 NORMAL ONLINE CACHED
2 14 HDD_E1_S14_967038379P2 NORMAL ONLINE CACHED
2 15 HDD_E1_S15_967035195P2 NORMAL ONLINE CACHED
2 16 HDD_E0_S16_966617223P2 NORMAL ONLINE CACHED
2 17 HDD_E0_S17_966520995P2 NORMAL ONLINE CACHED
2 18 HDD_E1_S18_966584379P2 NORMAL ONLINE CACHED
2 19 HDD_E1_S19_966573799P2 NORMAL ONLINE CACHED
3 20 SSD_E0_S20_805607370P1 NORMAL ONLINE CACHED
3 21 SSD_E0_S21_805607443P1 NORMAL ONLINE CACHED
3 22 SSD_E1_S22_805607458P1 NORMAL ONLINE CACHED
3 23 SSD_E1_S23_805607433P1 NORMAL ONLINE CACHED
44 rows selected.
col DG format a4
col "Size(MB)" format 9,999,999
col "Free(MB)" format 9,999,999
col "Usable(MB)" format 9,999,999
SELECT name AS "DG",
sector_size AS "Sector Size",
state,
type AS "Redundancy",
total_mb AS "Size(MB)",
free_mb AS "Free(MB)",
usable_file_mb AS "Usable(MB)"
FROM V$ASM_DISKGROUP
WHERE name='REDO';
DG Sector Size STATE Redund Size(MB) Free(MB) Usable(MB)
---- ----------- ----------- ------ ---------- ---------- ----------
REDO 512 MOUNTED HIGH 280,016 242,460 34,150
Test Case 4 - Connectivity to Database
1. Local connection - Verify connectivity to an instance
Test description
Check if you can connect to your instance locally on the node
Test result
Locally on an Oracle Database Appliance node you can connect your instance by setting ORACLE_HOME and ORACLE_SID environment variables.
Test Steps
In this example we can connect (expected result) and we are creating an user (test) used on the other connection tests below
su - oracle
export ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1
export ORACLE_SID=ODAMIG1
export PATH=$PATH:$ORACLE_HOME/bin
sqlplus / as sysdba
SQL> create user test identified by test;
SQL> alter user test account unlock;
SQL> grant resource,connect to test;
SQL> grant select on v_$instance to test;
sqlplus test/test
SQL>
2. External connection - Connect from an application
Test description
Check if you can connect to your instance from a remote client
Test result
You can connect to an Oracle Database Apppliance database from a remote client using an appropriate connect string.
Test Steps
Define an appropriate connection entry on you client tnsnames.ora and connect using the previously created user. Note as we are using the SCAN listener (HOST = rc-voda1-scan)
ODAMIG =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = rc-voda1-scan)(PORT = 1521))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = ODAMIG)
)
)
$ sqlplus test/test@ODAMIG
SQL>
3. Connect using services - Connect using services and test load balancing
Test description
Your are connecting to an Oracle Database Appliance database using a service and you verify load balancing
Test result
Connect to a database on the Oracle Database Appliance from a remote client using an appropriate connection string. The second connection will be on the second node.
Test Steps
1. Define a service which is running on both nodes:
$ srvctl add service -d ODAMIG -s oltp -r "ODAMIG1,ODAMIG2" -P BASIC -e select
$ srvctl start service -s OLTP -d ODAMIG
$ srvctl status service -d ODAMIG -s OLTP
Service oltp is running on instance(s) ODAMIG1,ODAMIG2
2. Define an appropriate connection entry in your client tnsnames.ora
OLTP =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = rc-voda1-scan)(PORT = 1521))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = OLTP)
)
)
3. Connect with the service name from your remote client
sqlplus test/test@oltp
SQL> select instance_name from sys.v_$instance;
INSTANCE_NAME
----------------
ODAMIG1
4. A second connection from a remote cliente will connect to the second RAC instance running on the other ODA node
sqlplus test/test@oltp
SQL> select instance_name from sys.v_$instance;
INSTANCE_NAME
----------------
ODAMIG2
Test Case 5 - Connection failover and continued service availability
Test description
Test node failures and continued database service availability.
Test result
You are connected to the instance on node 1, if this node crashes for any reason, repeating the query will automatically connect you to the second instance on the second node
Test Steps
1. You are connected to the instance on node 2:
SQL> select instance_name from sys.v_$instance;
INSTANCE_NAME
----------------
ODAMIG2
2. Simulate database instance failure (instance 'ODAMIG2'), execute a shutdown abort of instance 2 on node 2 from another client.
SQL> shutdown abort;
3. On client side run the above query again:
SQL> select instance_name from sys.v_$instance;
INSTANCE_NAME
----------------
ODAMIG1
SQL>
Note:
When instance 'ODAMIG2' crashed, clients will be reconnected to instance 'ODAMIG1'
Test Case 6 - Private network failure
Test description
The nodes in an Oracle Database Appliance are connected through two internal 1GbE connections. This test shows what happen if the interconnect breaks.
Test result
As the interconnect in an Oracle Database Appliance is redundant, if one interface is affected by any issue the other one still works and no side effects are observed by Oracle Clusterware and the database(s). If both interconnect (eth0,eth1) fail a node is evicted (expected result). When the connectivity on the interconnect is restored the evicted node will rejoin the cluster.
Test Steps
1. Initial default status isthat both private NICs are up
Using 'ifconfig' OS command you can check the initial eth networking interface (all are up):
# ifconfig
bond0 Link encap:Ethernet HWaddr 00:21:28:D7:67:48
inet addr:10.245.48.12 Bcast:10.245.55.255 Mask:255.255.248.0
inet6 addr: fe80::221:28ff:fed7:6748/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:168068528 errors:0 dropped:0 overruns:0 frame:0
TX packets:73908628 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:172142891441 (160.3 GiB) TX bytes:25356106080 (23.6 GiB)
bond0:1 Link encap:Ethernet HWaddr 00:21:28:D7:67:48
inet addr:10.245.48.28 Bcast:10.245.55.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bond0:2 Link encap:Ethernet HWaddr 00:21:28:D7:67:48
inet addr:10.245.48.56 Bcast:10.245.55.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bond0:3 Link encap:Ethernet HWaddr 00:21:28:D7:67:48
inet addr:10.245.48.57 Bcast:10.245.55.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bond1 Link encap:Ethernet HWaddr 00:1B:21:AE:FC:49
inet6 addr: fe80::21b:21ff:feae:fc49/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:54213591 errors:0 dropped:0 overruns:0 frame:0
TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:4693019668 (4.3 GiB) TX bytes:936 (936.0 b)
bond2 Link encap:Ethernet HWaddr 00:1B:21:AE:FC:4B
inet6 addr: fe80::21b:21ff:feae:fc4b/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:41836342 errors:0 dropped:0 overruns:0 frame:0
TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:11374469913 (10.5 GiB) TX bytes:936 (936.0 b)
eth0 Link encap:Ethernet HWaddr 00:21:28:D7:67:4C
inet addr:192.168.16.24 Bcast:192.168.16.255 Mask:255.255.255.0
inet6 addr: fe80::221:28ff:fed7:674c/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:18313594 errors:0 dropped:0 overruns:0 frame:0
TX packets:17582725 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:21373231223 (19.9 GiB) TX bytes:20417204203 (19.0 GiB)
Memory:dee80000-deea0000
eth0:1 Link encap:Ethernet HWaddr 00:21:28:D7:67:4C
inet addr:169.254.112.206 Bcast:169.254.127.255 Mask:255.255.128.0
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
Memory:dee80000-deea0000
eth1 Link encap:Ethernet HWaddr 00:21:28:D7:67:4D
inet addr:192.168.17.24 Bcast:192.168.17.255 Mask:255.255.255.0
inet6 addr: fe80::221:28ff:fed7:674d/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:16744784 errors:0 dropped:0 overruns:0 frame:0
TX packets:15887756 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:18212979386 (16.9 GiB) TX bytes:12045330895 (11.2 GiB)
Memory:deee0000-def00000
eth1:1 Link encap:Ethernet HWaddr 00:21:28:D7:67:4D
inet addr:169.254.240.172 Bcast:169.254.255.255 Mask:255.255.128.0
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
Memory:deee0000-def00000
eth2 Link encap:Ethernet HWaddr 00:21:28:D7:67:48
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:167991955 errors:0 dropped:0 overruns:0 frame:0
TX packets:73908628 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:172138293677 (160.3 GiB) TX bytes:25356106332 (23.6 GiB)
Memory:def60000-def80000
eth3 Link encap:Ethernet HWaddr 00:21:28:D7:67:48
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:76573 errors:0 dropped:0 overruns:0 frame:0
TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4597764 (4.3 MiB) TX bytes:88 (88.0 b)
Memory:defe0000-df000000
eth4 Link encap:Ethernet HWaddr 00:1B:21:AE:FC:49
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:27112876 errors:0 dropped:0 overruns:0 frame:0
TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2349160551 (2.1 GiB) TX bytes:936 (936.0 b)
Memory:df1a0000-df1c0000
eth5 Link encap:Ethernet HWaddr 00:1B:21:AE:FC:49
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:27100715 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2343859117 (2.1 GiB) TX bytes:0 (0.0 b)
Memory:df1e0000-df200000
eth6 Link encap:Ethernet HWaddr 00:1B:21:AE:FC:4B
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:20920394 errors:0 dropped:0 overruns:0 frame:0
TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5688489549 (5.2 GiB) TX bytes:936 (936.0 b)
Memory:df2a0000-df2c0000
eth7 Link encap:Ethernet HWaddr 00:1B:21:AE:FC:4B
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:20915948 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5685980364 (5.2 GiB) TX bytes:0 (0.0 b)
Memory:df2e0000-df300000
eth8 Link encap:Ethernet HWaddr 00:1B:21:B6:0A:E4
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:6740448 errors:0 dropped:0 overruns:0 frame:0
TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:411876392 (392.7 MiB) TX bytes:936 (936.0 b)
eth9 Link encap:Ethernet HWaddr 00:1B:21:B6:0A:E4
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:6727602 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:411062894 (392.0 MiB) TX bytes:0 (0.0 b)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:39670610 errors:0 dropped:0 overruns:0 frame:0
TX packets:39670610 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:78672574464 (73.2 GiB) TX bytes:78672574464 (73.2 GiB)
xbond0 Link encap:Ethernet HWaddr 00:1B:21:B6:0A:E4
inet6 addr: fe80::21b:21ff:feb6:ae4/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:13468050 errors:0 dropped:0 overruns:0 frame:0
TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:822939286 (784.8 MiB) TX bytes:936 (936.0 b)
2. Simulate a failure of eth0
With 'ifdown' OS command you can switch off a network interface:
# ifdown eth0
ifconfig output now is not showing anymore the eth0:
# ifconfig
bond0 Link encap:Ethernet HWaddr 00:21:28:D7:67:48
inet addr:10.245.48.12 Bcast:10.245.55.255 Mask:255.255.248.0
inet6 addr: fe80::221:28ff:fed7:6748/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:168070270 errors:0 dropped:0 overruns:0 frame:0
TX packets:73910433 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:172143383625 (160.3 GiB) TX bytes:25357203979 (23.6 GiB)
bond0:1 Link encap:Ethernet HWaddr 00:21:28:D7:67:48
inet addr:10.245.48.28 Bcast:10.245.55.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bond0:2 Link encap:Ethernet HWaddr 00:21:28:D7:67:48
inet addr:10.245.48.56 Bcast:10.245.55.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bond0:3 Link encap:Ethernet HWaddr 00:21:28:D7:67:48
inet addr:10.245.48.57 Bcast:10.245.55.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bond1 Link encap:Ethernet HWaddr 00:1B:21:AE:FC:49
inet6 addr: fe80::21b:21ff:feae:fc49/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:54217767 errors:0 dropped:0 overruns:0 frame:0
TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:4693393560 (4.3 GiB) TX bytes:936 (936.0 b)
bond2 Link encap:Ethernet HWaddr 00:1B:21:AE:FC:4B
inet6 addr: fe80::21b:21ff:feae:fc4b/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:41838642 errors:0 dropped:0 overruns:0 frame:0
TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:11375143337 (10.5 GiB) TX bytes:936 (936.0 b)
eth1 Link encap:Ethernet HWaddr 00:21:28:D7:67:4D
inet addr:192.168.17.24 Bcast:192.168.17.255 Mask:255.255.255.0
inet6 addr: fe80::221:28ff:fed7:674d/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:16746499 errors:0 dropped:0 overruns:0 frame:0
TX packets:15889403 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:18215302041 (16.9 GiB) TX bytes:12046528425 (11.2 GiB)
Memory:deee0000-def00000
eth1:1 Link encap:Ethernet HWaddr 00:21:28:D7:67:4D
inet addr:169.254.240.172 Bcast:169.254.255.255 Mask:255.255.128.0
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
Memory:deee0000-def00000
eth2 Link encap:Ethernet HWaddr 00:21:28:D7:67:48
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:167993697 errors:0 dropped:0 overruns:0 frame:0
TX packets:73910433 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:172138785861 (160.3 GiB) TX bytes:25357204231 (23.6 GiB)
Memory:def60000-def80000
eth3 Link encap:Ethernet HWaddr 00:21:28:D7:67:48
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:76573 errors:0 dropped:0 overruns:0 frame:0
TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4597764 (4.3 MiB) TX bytes:88 (88.0 b)
Memory:defe0000-df000000
eth4 Link encap:Ethernet HWaddr 00:1B:21:AE:FC:49
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:27114964 errors:0 dropped:0 overruns:0 frame:0
TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2349347497 (2.1 GiB) TX bytes:936 (936.0 b)
Memory:df1a0000-df1c0000
eth5 Link encap:Ethernet HWaddr 00:1B:21:AE:FC:49
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:27102803 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2344046063 (2.1 GiB) TX bytes:0 (0.0 b)
Memory:df1e0000-df200000
eth6 Link encap:Ethernet HWaddr 00:1B:21:AE:FC:4B
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:20921544 errors:0 dropped:0 overruns:0 frame:0
TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5688826261 (5.2 GiB) TX bytes:936 (936.0 b)
Memory:df2a0000-df2c0000
eth7 Link encap:Ethernet HWaddr 00:1B:21:AE:FC:4B
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:20917098 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5686317076 (5.2 GiB) TX bytes:0 (0.0 b)
Memory:df2e0000-df300000
eth8 Link encap:Ethernet HWaddr 00:1B:21:B6:0A:E4
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:6740753 errors:0 dropped:0 overruns:0 frame:0
TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:411894734 (392.8 MiB) TX bytes:936 (936.0 b)
eth9 Link encap:Ethernet HWaddr 00:1B:21:B6:0A:E4
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:6727906 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:411081176 (392.0 MiB) TX bytes:0 (0.0 b)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:39673653 errors:0 dropped:0 overruns:0 frame:0
TX packets:39673653 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:78674031488 (73.2 GiB) TX bytes:78674031488 (73.2 GiB)
xbond0 Link encap:Ethernet HWaddr 00:1B:21:B6:0A:E4
inet6 addr: fe80::21b:21ff:feb6:ae4/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:13468659 errors:0 dropped:0 overruns:0 frame:0
TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:822975910 (784.8 MiB) TX bytes:936 (936.0 b)
1° Result Note: System status is normal, no failure seen at database and clusterware level
3. Simulate a failure of both private NICs at the same time
Initial status: both private NICs on box1 are up
# ifup eth0
Simulate a failure of private NICs eth0 and eth1 on box1
# ifdown eth0
# ifdown eth1
2° Result Note:
In the clusterware alert.log you see a missing heartbeat after 15 seconds and 30 seconds later, node 2 was evicted as expected (Database/ASM instances are down, CRS/CSS is down, OHASD is up).
As expected, the instance on node one resumes service once reconfiguration is completed.
from the Clusterware Alert.log (/log/<nodename>/alert.log) you can see the node eviction and the cluster reconfiguration:
2012-01-05 06:46:12.833
[cssd(12247)]CRS-1612:Network communication with node slcac457 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.300 seconds
2012-01-05 06:46:19.849
[cssd(12247)]CRS-1611:Network communication with node slcac457 (2) missing for 75% of timeout interval. Removal of this node from cluster in 7.280 seconds
2012-01-05 06:46:24.859
[cssd(12247)]CRS-1610:Network communication with node slcac457 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.270 seconds
2012-01-05 06:46:29.132
[cssd(12247)]CRS-1623:The IPMI node kill information of BMC at IP address 10.131.228.195 could not be validated due to invalid authorization information. The BMC username provided is 'root'; details at (:CSSNK00004:) in /u01/app/11.2.0/grid/log/slcac456/cssd/ocssd.log
2012-01-05 06:46:29.132
[cssd(12247)]CRS-1620:The node kill information of node slcac456 could not be validated by this node due to invalid authorization information; details at (:CSSNM00003:) in /u01/app/11.2.0/grid/log/slcac456/cssd/ocssd.log
2012-01-05 06:46:31.129
[cssd(12247)]CRS-1623:The IPMI node kill information of BMC at IP address 10.131.228.196 could not be validated due to invalid authorization information. The BMC username provided is 'root'; details at (:CSSNK00004:) in /u01/app/11.2.0/grid/log/slcac456/cssd/ocssd.log
2012-01-05 06:46:31.129
[cssd(12247)]CRS-1620:The node kill information of node slcac457 could not be validated by this node due to invalid authorization information; details at (:CSSNM00003:) in /u01/app/11.2.0/grid/log/slcac456/cssd/ocssd.log
2012-01-05 06:46:31.129
[cssd(12247)]CRS-1607:Node slcac457 is being evicted in cluster incarnation 219716296; details at (:CSSNM00007:) in /u01/app/11.2.0/grid/log/slcac456/cssd/ocssd.log.
2012-01-05 06:46:34.138
[cssd(12247)]CRS-1625:Node slcac457, number 2, was manually shut down
2012-01-05 06:46:34.144
[cssd(12247)]CRS-1601:CSSD Reconfiguration complete. Active nodes are slcac456 .
2012-01-05 06:46:34.150
[crsd(13460)]CRS-5504:Node down event reported for node 'slcac457'.
2012-01-05 06:46:34.153
[ctssd(12915)]CRS-2407:The new Cluster Time Synchronization Service reference node is host slcac456.
2012-01-05 06:46:50.242
[crsd(13460)]CRS-2773:Server 'slcac457' has been removed from pool 'ora.ODAMIG_oltp'.
2012-01-05 06:46:50.243
[crsd(13460)]CRS-2773:Server 'slcac457' has been removed from pool 'Generic'.
2012-01-05 06:46:50.243
[crsd(13460)]CRS-2773:Server 'slcac457' has been removed from pool 'ora.ODAMIG'.
the instance is still running on node 1:
[root@slcac456]# ps -ef | grep smon
oracle 9120 1 0 Jan04 ? 00:00:01 ora_smon_ORAMIG1
root 12193 1 0 2011 ? 01:09:16 /u01/app/11.2.0/grid/bin/osysmond.bin
grid 13374 1 0 2011 ? 00:00:00 asm_smon_+ASM1
root 30439 24554 0 07:00 pts/4 00:00:00 grep smon
On node 2 the instance is no longer running (node evicted):
[root@rc-voda2]# ps -ef | grep smon
root 12148 1 1 2011 ? 04:00:08 /u01/app/11.2.0/grid/bin/osysmond.bin
root 28904 24208 0 06:58 pts/0 00:00:00 grep smon
4. Simulate both failed NICs(eth0/eth1) come back to work
# ifup eth0
# ifup eth1
3° Result Note:
eth0, eth1 IP and HAIP come back almost immediately, ASM instance and DB instance on node 2 was started up by Clusterware automatically, Clusterware stack itself was restarted on node 2. Validate in the log files and other means that the node rejoins cluster.
from the Clusterware Alert.log (/log/<nodename>/alert.log)
2012-01-05 07:03:54.239
[cssd(12247)]CRS-1623:The IPMI node kill information of BMC at IP address 10.131.228.195 could not be validated due to invalid authorization information. The BMC username provided is 'root'; details at (:CSSNK00004:) in /u01/app/11.2.0/grid/log/slcac456/cssd/ocssd.log
2012-01-05 07:03:54.239
[cssd(12247)]CRS-1620:The node kill information of node slcac456 could not be validated by this node due to invalid authorization information; details at (:CSSNM00003:) in /u01/app/11.2.0/grid/log/slcac456/cssd/ocssd.log
2012-01-05 07:03:56.243
[cssd(12247)]CRS-1623:The IPMI node kill information of BMC at IP address 10.131.228.196 could not be validated due to invalid authorization information. The BMC username provided is 'root'; details at (:CSSNK00004:) in /u01/app/11.2.0/grid/log/slcac456/cssd/ocssd.log
2012-01-05 07:03:56.243
[cssd(12247)]CRS-1620:The node kill information of node slcac457 could not be validated by this node due to invalid authorization information; details at (:CSSNM00003:) in /u01/app/11.2.0/grid/log/slcac456/cssd/ocssd.log
2012-01-05 07:03:56.294
[cssd(12247)]CRS-1601:CSSD Reconfiguration complete. Active nodes are slcac456 slcac457
from the Clusterware Alert.log on node 2:
2012-01-05 06:59:11.724
[client(29071)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/slcac457/client/crsctl_oracle.log.
2012-01-05 07:02:50.055
[cssd(28166)]CRS-1601:CSSD Reconfiguration complete. Active nodes are slcac456 slcac457 .
2012-01-05 07:02:52.095
[ctssd(30648)]CRS-2403:The Cluster Time Synchronization Service on host slcac457 is in observer mode.
2012-01-05 07:02:52.445
[ctssd(30648)]CRS-2407:The new Cluster Time Synchronization Service reference node is host slcac456.
2012-01-05 07:02:52.447
[ctssd(30648)]CRS-2401:The Cluster Time Synchronization Service started on host slcac457.
2012-01-05 07:03:04.233
[ctssd(30648)]CRS-2412:The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /u01/app/11.2.0/grid/log/slcac457/ctssd/octssd.log.
2012-01-05 07:03:04.233
[ctssd(30648)]CRS-2409:The clock on host slcac457 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2012-01-05 07:03:16.741
[crsd(31084)]CRS-1012:The OCR service started on node slcac457.
2012-01-05 07:03:16.773
[evmd(30670)]CRS-1401:EVMD started on node slcac457.
2012-01-05 07:03:18.719
[crsd(31084)]CRS-1201:CRSD started on node slcac457.
2012-01-05 07:03:20.922
[/u01/app/11.2.0/grid/bin/oraagent.bin(31415)]CRS-5011:Check of resource "ODAMIG" failed: details at "(:CLSN00007:)" in "/u01/app/11.2.0/grid/log/slcac457/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2012-01-05 07:03:20.962
[/u01/app/11.2.0/grid/bin/oraagent.bin(31398)]CRS-5016:Process "/u01/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/slcac457/agent/crsd/oraagent_grid/oraagent_grid.log"
2012-01-05 07:03:22.454
[/u01/app/11.2.0/grid/bin/oraagent.bin(31398)]CRS-5016:Process "/u01/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/slcac457/agent/crsd/oraagent_grid/oraagent_grid.log"
[client(31529)]CRS-10001:05-Jan-12 07:03 ACFS-9139: Attempting recovery of offline mount point '/cloudfs'
[client(31541)]CRS-10001:05-Jan-12 07:03 ACFS-9111: Offline mount point '/cloudfs' was recovered.
[client(31629)]CRS-10001:GWS: name=RECO, vol=ACFSVOL, state=DISABLED
[client(31632)]CRS-10001:05-Jan-12 07:03 ACFS-9103: Enabling volume 'acfsvol' on diskgroup 'reco'.
[client(31650)]CRS-10001:05-Jan-12 07:03 ACFS-9257: Mounting device '/dev/asm/acfsvol-18' on mount point '/cloudfs'.
[root@rc-voda2]# ps -ef | grep smon
root 1287 24208 0 07:06 pts/0 00:00:00 grep smon
root 12148 1 1 2011 ? 04:00:13 /u01/app/11.2.0/grid/bin/osysmond.bin
grid 30916 1 0 07:03 ? 00:00:00 asm_smon_+ASM2
oracle 32172 1 0 07:04 ? 00:00:00 ora_smon_ODAMIG2
Test Case 7 - Public Network Failure
Conneect from a client (outside ODA):
sqlplus test/test@ODAMIG
SQL> select instance_name from sys.v_$instance;
INSTANCE_NAME
----------------
ODAMIG1
Shutdown bond0 on node 1
# ifdown bond0
Check Existing Client connection
SQL> /
INSTANCE_NAME
----------------
ODAMIG2
The existing client connection failed over quickly to the other instance automatically.
New client connection
SQL> select instance_name from sys.v_$instance;
INSTANCE_NAME
----------------
ODAMIG2
New client connections went to the other instance automatically.
Check public network and VIPs and services on another node
# ifconfig -a
bond0 Link encap:Ethernet HWaddr 00:21:28:D6:14:3A
inet addr:10.245.48.13 Bcast:10.245.55.255 Mask:255.255.248.0
inet6 addr: fe80::221:28ff:fed6:143a/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:5476567 errors:0 dropped:0 overruns:0 frame:0
TX packets:4699960 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:1361587908 (1.2 GiB) TX bytes:1068230705 (1018.7 MiB)
bond0:1 Link encap:Ethernet HWaddr 00:21:28:D6:14:3A
inet addr:10.245.48.56 Bcast:10.245.55.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bond0:2 Link encap:Ethernet HWaddr 00:21:28:D6:14:3A
inet addr:10.245.48.29 Bcast:10.245.55.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bond0:3 Link encap:Ethernet HWaddr 00:21:28:D6:14:3A
inet addr:10.245.48.57 Bcast:10.245.55.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bond0:4 Link encap:Ethernet HWaddr 00:21:28:D6:14:3A
inet addr:10.245.48.58 Bcast:10.245.55.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bond0:5 Link encap:Ethernet HWaddr 00:21:28:D6:14:3A
inet addr:10.245.48.28 Bcast:10.245.55.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bond1 Link encap:Ethernet HWaddr 00:1B:21:AE:FB:B5
inet6 addr: fe80::21b:21ff:feae:fbb5/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:54340199 errors:0 dropped:0 overruns:0 frame:0
TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:4704239924 (4.3 GiB) TX bytes:468 (468.0 b)
bond2 Link encap:Ethernet HWaddr 00:1B:21:AE:FB:B7
inet6 addr: fe80::21b:21ff:feae:fbb7/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:41931080 errors:0 dropped:0 overruns:0 frame:0
TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:11399604342 (10.6 GiB) TX bytes:468 (468.0 b)
eth0 Link encap:Ethernet HWaddr 00:21:28:D6:14:3E
inet addr:192.168.16.25 Bcast:192.168.16.255 Mask:255.255.255.0
inet6 addr: fe80::221:28ff:fed6:143e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:17623438 errors:0 dropped:0 overruns:0 frame:0
TX packets:18354950 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:20465535412 (19.0 GiB) TX bytes:21418856097 (19.9 GiB)
Memory:dee80000-deea0000
eth0:1 Link encap:Ethernet HWaddr 00:21:28:D6:14:3E
inet addr:169.254.37.90 Bcast:169.254.127.255 Mask:255.255.128.0
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
Memory:dee80000-deea0000
eth1 Link encap:Ethernet HWaddr 00:21:28:D6:14:3F
inet addr:192.168.17.25 Bcast:192.168.17.255 Mask:255.255.255.0
inet6 addr: fe80::221:28ff:fed6:143f/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:15933937 errors:0 dropped:0 overruns:0 frame:0
TX packets:16791488 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:12093725549 (11.2 GiB) TX bytes:18255805634 (17.0 GiB)
Memory:deee0000-def00000
eth1:1 Link encap:Ethernet HWaddr 00:21:28:D6:14:3F
inet addr:169.254.228.105 Bcast:169.254.255.255 Mask:255.255.128.0
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
Memory:deee0000-def00000
eth2 Link encap:Ethernet HWaddr 00:21:28:D6:14:3A
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:5399866 errors:0 dropped:0 overruns:0 frame:0
TX packets:4699963 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1356982464 (1.2 GiB) TX bytes:1068231207 (1018.7 MiB)
Memory:def60000-def80000
eth3 Link encap:Ethernet HWaddr 00:21:28:D6:14:3A
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:76701 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4605444 (4.3 MiB) TX bytes:0 (0.0 b)
Memory:defe0000-df000000
eth4 Link encap:Ethernet HWaddr 00:1B:21:AE:FB:B5
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:27176189 errors:0 dropped:0 overruns:0 frame:0
TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2354775972 (2.1 GiB) TX bytes:468 (468.0 b)
Memory:df1a0000-df1c0000
eth5 Link encap:Ethernet HWaddr 00:1B:21:AE:FB:B5
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:27164010 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2349463952 (2.1 GiB) TX bytes:0 (0.0 b)
Memory:df1e0000-df200000
eth6 Link encap:Ethernet HWaddr 00:1B:21:AE:FB:B7
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:20967761 errors:0 dropped:0 overruns:0 frame:0
TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5701058754 (5.3 GiB) TX bytes:468 (468.0 b)
Memory:df2a0000-df2c0000
eth7 Link encap:Ethernet HWaddr 00:1B:21:AE:FB:B7
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:20963319 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5698545588 (5.3 GiB) TX bytes:0 (0.0 b)
Memory:df2e0000-df300000
eth8 Link encap:Ethernet HWaddr 00:1B:21:B6:0C:DC
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:6756411 errors:0 dropped:0 overruns:0 frame:0
TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:412849763 (393.7 MiB) TX bytes:468 (468.0 b)
eth9 Link encap:Ethernet HWaddr 00:1B:21:B6:0C:DC
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:6743542 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:412035116 (392.9 MiB) TX bytes:0 (0.0 b)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:9263136 errors:0 dropped:0 overruns:0 frame:0
TX packets:9263136 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:3692025488 (3.4 GiB) TX bytes:3692025488 (3.4 GiB)
sit0 Link encap:IPv6-in-IPv4
NOARP MTU:1480 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
xbond0 Link encap:Ethernet HWaddr 00:1B:21:B6:0C:DC
inet6 addr: fe80::21b:21ff:feb6:cdc/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:13499953 errors:0 dropped:0 overruns:0 frame:0
TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:824884879 (786.6 MiB) TX bytes:468 (468.0 b)
[root@rc-voda2]# crsctl stat res -t
(...)
ora.rc-voda1.vip
1 ONLINE INTERMEDIATE rc-voda2 FAILED OVER
ora.rc-voda2.vip
1 ONLINE ONLINE rc-voda2
Expected Result:
Two public network NICs failure caused public network inaccessible on this node.
The GI and database on this node was still running.
The node VIPs on this node failed over to another node quickly.
The SCAN vip and SCAN listener are all running on another node.
The existing client connections failed over to another node.
New client connections went to another instance automatically.
When public network recovered, the node VIP for this node failed back automatically. Database service was brought online on this node automatically
Test Case 8 - Database backup and recovery test
Test description
A key operational aspect of deploying ODA is to ensure that database backup are performed so that Oracle database that reside on ODA can be restored if disaster strikes. This test is just only the most simple test you could do (using RMAN) to backup&restore your database. The test is using the "internal" disks, you should use your favorite backup strategy instead.
Test result
After the restore steps you have your database up&running
1. Backup a database using RMAN.
- verify the database is in archive log mode
SQL> archive log list
- create a directory to store the backup set
# mkdir -p /u01/bakDB
# chown -R oracle:oinstall /u01/bakDB
# chmod 755 /u01/bakDB
- Configure the controlfile autobackup
$ rman nocatalog target /
RMAN> show all;
RMAN> CONFIGURE CONTROLFILE AUTOBACKUP ON;
- Increase the db_recovery_file_dest_size as necessary (e.g., 10G)
- Create or update a database object that can be subsequently validated.
(e.g. create a table or insert data into a table, etc.)
- Backup the full database and archive log to the above directory
RMAN> backup database plus archivelog format '/u01/bakDB/db_%U';
- Verify the backupset has been generated
# ls -l /u01/bakDB
-rwxrwxr-x 1 oracle asmadmin 123944960 Aug 25 02:06 db_0gmkre9d_1_1
-rwxrwxr-x 1 oracle asmadmin
69632 Aug 25 02:06 db_0imkrea0_1_1
2. Recover database (optional) - Optionally simulate database loss and perform database recovery.
$ export ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1
$ export ORACLE_SID=ODAMIG1
$ export PATH=$ORACLE_HOME/bin:$PATH
$ srvctl stop database -d ODAMIG;
$ sqlplus "/ as sysdba"
SQL> startup nomount;
$ rman nocatalog target /
RMAN> restore controlfile from autobackup;
RMAN> sql 'alter database mount';
RMAN> restore database;
RMAN> recover database;
RMAN> sql 'alter database open resetlogs';
RMAN> exit
3. Verify database recovery (validate using the database object created or updated previously).
$ sqlplus scott/tiger
SQL> select count(*) from test1;
COUNT(*)
----------
5375
References
<NOTE:1391655.1> - ODA (Oracle Database Appliance): Simulated Failure tests
<NOTE:810394.1> - RAC and Oracle Clusterware Best Practices and Starter Kit (Platform Independent)
Attachments
This solution has no attachment