ODA (Oracle Database Appliance) Different Disks Randomly Disappear After a Reboot

Asset ID:	1-72-1420126.1
Update Date:	2017-05-18
Keywords:

Solution Type Problem Resolution Sure

Solution 1420126.1 : ODA (Oracle Database Appliance) Different Disks Randomly Disappear After a Reboot

Applies to:

Oracle Database Appliance - Version All Versions to All Versions [Release All Releases]
Oracle Database Appliance X4-2 - Version All Versions to All Versions [Release All Releases]
Oracle Database Appliance X3-2 - Version All Versions to All Versions [Release All Releases]
Oracle Database Appliance Software - Version 2.1.0.2 to 12.1.2.9 [Release 2.1 to 12.1]
Information in this document applies to any platform.
***Checked for relevance on 21-Oct-2013***

Symptoms

    FOR SUPPORT ONLY - there are many flavors of this problem across several ODA versions:    Please review INTERNAL information at the bottom of those note for further discussion and information

As a result of randomly missing disks, ASM disks or groups do not come up:

ODA and ASM is failing to identify disks during startup
The problem symptom includes different disks randomly showing a failed, predictive failure or missing
The problem can lead to an entire node failing to startup

NOTE: This problem was originally associated with a problem alerted in 2.1.0.0 and fixed by 2.1.0.3.1
           However, the problem symptoms can be considered more generic and can happen on most any ODA version:

              This does not mean you are hitting the same bug stated as fixed in 2.1.0.3.1:
            It does mean that similar / same symptoms can use the same corrective actions on most all versions

Example:

For the 2.1.0. the symptoms were as follows:

The problematic Node has been rebooted several times and has come back up with different disks missing each time:

DATA dg - missing disks
---------------
/dev/mapper/HDD_E1_S19_993871319p1
/dev/mapper/HDD_E1_S11_1196820151p1
/dev/mapper/HDD_E0_S13_1196881379p1
/dev/mapper/HDD_E0_S04_1196963151p1

RECO dg - missing disks
---------------------
/dev/mapper/HDD_E1_S19_993871319p2
/dev/mapper/HDD_E1_S11_1196820151p2
/dev/mapper/HDD_E0_S13_1196881379p2
/dev/mapper/HDD_E0_S04_1196963151p2

Missing disks before a reboot included pd_04; pd_11; pd_13; pd_19 (RECO and DATA)

However: After rebooting the node you confirm that different disks are missing (different ones):

ASMCMD> mount all
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "21" is missing from group number "3"
ORA-15042: ASM disk "20" is missing from group number "3"
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "17" is missing from group number "2"
ORA-15042: ASM disk "16" is missing from group number "2"
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "17" is missing from group number "1"
ORA-15042: ASM disk "16" is missing from group number "1" (DBD ERROR:
OCIStmtExecute)

After reboot after a reboot included pd_21; pd_22 (REDO); and pd_16; pd_17 (RECO and DATA)

ls -l /dev/mapper/HDD* is a method to quickly confirm the available HDD disks

Note: Your counts should take the version into account as V1, X3-2 and X4-2 ODA have different disk counts
Also: use SSD for the REDO or *D* to include both the SSD and HDD counts

Example - this shows two different counts on the same ODA

[grid@svp-oda1 ~]$ ls -l /dev/mapper/HDD* |wc -l
57

[root@svp-oda2 ~]# ls -l /dev/mapper/HDD* |wc -l
51

Node 1
------------
57 disks

Node 2
----------
51 disks

As a result of these missing disks, ASM disks and Grid are not coming up

Commands for determining missing disks :

# oakcli show disk
NAME PATH TYPE STATE STATE_DETAILS

pd_00 /dev/sdam HDD ONLINE Good
pd_01 /dev/sdaw HDD ONLINE Good
pd_02 /dev/sdaa HDD ONLINE Good
pd_03 /dev/sdak HDD ONLINE Good
pd_04 /dev/sdan HDD ONLINE Good
pd_05 /dev/sdax HDD ONLINE Good
pd_06 /dev/sdab HDD ONLINE Good
pd_07 /dev/sdal HDD ONLINE Good
pd_08 /dev/sdao HDD ONLINE Good
pd_09 /dev/sdau HDD ONLINE Good
pd_10 /dev/sdac HDD ONLINE Good
pd_11 /dev/sdai HDD ONLINE Good
pd_12 /dev/sdap HDD ONLINE Good
pd_13 /dev/sdav HDD ONLINE Good
pd_14 /dev/sdad HDD ONLINE Good
pd_15 /dev/sdaj HDD ONLINE Good
pd_16 /dev/sdaq HDD ONLINE Good
pd_17 /dev/sdas HDD ONLINE Good
pd_18 /dev/sdae HDD ONLINE Good
pd_19 /dev/sdag HDD ONLINE Good
pd_20 /dev/sdar SSD ONLINE Good
pd_21 /dev/sdat SSD ONLINE Good
pd_22 /dev/sdaf SSD ONLINE Good
pd_23 /dev/sdah SSD ONLINE Good

# oakcli show diskgroup data
ASM_DISK PATH DISK STATE STATE_DETAILS

data_00 /dev/mapper/HDD_E0_S00_975071251p1 pd_00 ONLINE Good
data_01 /dev/mapper/HDD_E0_S01_973074223p1 pd_01 ONLINE Good
data_02 /dev/mapper/HDD_E1_S02_975283211p1 pd_02 ONLINE Good
data_03 /dev/mapper/HDD_E1_S03_975067947p1 pd_03 ONLINE Good
data_04 /dev/mapper/HDD_E0_S04_975277007p1 pd_04 ONLINE Good
data_05 /dev/mapper/HDD_E0_S05_975080611p1 pd_05 ONLINE Good
data_06 /dev/mapper/HDD_E1_S06_975276063p1 pd_06 ONLINE Good
data_07 /dev/mapper/HDD_E1_S07_975284323p1 pd_07 ONLINE Good
data_08 /dev/mapper/HDD_E0_S08_970712075p1 pd_08 ONLINE Good
data_09 /dev/mapper/HDD_E0_S09_975061523p1 pd_09 ONLINE Good
data_10 /dev/mapper/HDD_E1_S10_975282083p1 pd_10 ONLINE Good
data_11 /dev/mapper/HDD_E1_S11_975281571p1 pd_11 ONLINE Good
data_12 /dev/mapper/HDD_E0_S12_975274931p1 pd_12 ONLINE Good
data_13 /dev/mapper/HDD_E0_S13_977596619p1 pd_13 ONLINE Good
data_14 /dev/mapper/HDD_E1_S14_975053527p1 pd_14 ONLINE Good
data_15 /dev/mapper/HDD_E1_S15_975284719p1 pd_15 ONLINE Good
data_16 /dev/mapper/HDD_E0_S16_975268647p1 pd_16 ONLINE Good
data_17 /dev/mapper/HDD_E0_S17_975283679p1 pd_17 ONLINE Good
data_18 /dev/mapper/HDD_E1_S18_975281159p1 pd_18 ONLINE Good
data_19 /dev/mapper/HDD_E1_S19_975279427p1 pd_19 ONLINE Good

Changes

This problem can occur after:

One ASM disk is lost
Replacing a disk
Reboot (of one server)

Cause

#1 If you are on ODA 2.1.x this problem has been identified as a bug:

<Bug: 13728921> - PHYSICAL DISKS DISAPPEAR AFTER REBOOTING NODE
-closed as a duplicate of

<Bug: 13618428> - AFTER LOSING ONE ASM DISK, MULTIPLE DISKS BECAME UNRESPONSIVE

#2 There are other similar scenarios for finding disks offlined that are not the result of a specific bug

Corrupted disk
IO errors
Disk removed improperly
Disk physically replaced but not firmly inserted into the ODA
FW problems
Certain conditions during patching
-- usually when a disk was not discovered as offlined before applying a patch
-- incomplete or failed patching
Other

CR 7132662 - P1 erie/firmware Cluster outage resulted from a single HDD failure - X4370M2 with Erie

Solution

Resolution for #1:

Apply the ODA 2.1.0.3.0. Patch Bundle Patch 13622348
- then
Apply the ODA 2.1.0.3.1 Emergency Patch:13817532 -- single patch applied on top of 2.1.0.3.0

See Urgent Mandatory OAK Patch 2.1.0.3.1 <Document: 1438089.1>

Workaround
------------
1) Power-cycle both (2) of the servers

Resolution for #2

Please refer to - ODA (Oracle Database Appliance): The Steps to replace failing disks (Doc ID 1496114.1)

Bug 17387042 : LNX64-112-CMT: OAKCLI SHOW DISK SHOULD GIVE A WARNING/ERROR IF A DISK IS MISSING (2.8)
- Add cross reference Bug 18418872 : DIAGNOSABILITY IMPROVEMENTS
Non-ODA
ASM Is Not Discovery/Detecting New Candidate Disks On Only One RAC Node Of Several Nodes In The Cluster (Too many open files). (Doc ID 1548607.1) -- Version 10.2.0.1 to 11.2.0.4 Any Platform

ODA (Oracle Database Appliance) Different Disks Randomly Disappear After a Reboot (Doc ID 1420126.1)
ALERT - Urgent Mandatory OAK Patch 2.1.0.3.1 for ODA - (Oracle Database Appliance) (Doc ID 1438089.1)
ODA (Oracle Database Appliance) troubleshooting and solutions for ORA-600 [kfdjoin3] causing ASM startup failure after patching to 2.5 or 2.6 (Doc ID 1557502.1)
Replaced ODA drive lists as “UNKNOWN PARTIAL PathsNotLoaded” (Doc ID 1536486.1)
ODA : Disk space usage on boot mount point shows warnings such as Disk Space Usage on mount point: "/boot is (XX%)" (Doc ID 1537133.1)
Bug 17387042 : LNX64-112-CMT: OAKCLI SHOW DISK SHOULD GIVE A WARNING/ERROR IF A DISK IS MISSING
Bug 16231699 - Adding a new disk on ODA appliance hits ORA-600 [kfgpset3] (Doc ID 16231699.8)
Bug 18740794 : ODA SHOW DISK STATUS IS GOOD HOWEVER DISK IS OFFLINED (2.7)
In the same category, disk failed, oakcli show good online.
Bug 18477093 - OAKCLI SHOW DISK SAYS GOOD ONLINE, IN FACT SMART FAILURE

Bug 18418872 : LNX64-112-CMT: DIAGNOSABILITY IMPROVEMENTS
Dup of bug 17387042: OAKCLI SHOW DISK SHOULD GIVE A WARNING/ERROR IF A DISK IS MISSING

ODA HW: After SSD replacement, disk shows STATE as UNINITIALIZED and STATE_DETAILS as NewDiskInserted (Doc ID 1461219.1)
bug 16803770 - unable to get rid of uninitialized newdiskinserted

ODA (Oracle Database Appliance): ORA-00600: [kfdApplianceDiskNum0] adding a disk (Doc ID 1457115.1)
The configuration file: /opt/oracle/extapi/asmappl.config is missing on the right disk entries
Edit /opt/oracle/extapi/asmappl.config

Add the missing entries (EXAMPLE for HDD S13)
    disk /dev/mapper/HDD_E0_S13_1137808215p1       0 13 1
    disk /dev/mapper/HDD_E0_S13_1137808215p2       0 13 2

Then   alter diskgroup /*+ _OAK_AsmCookie */ [DATA | RECO | REDO] add disk.....

You can check as ASM is running the "rebalance":

SQL> select GROUP_NUMBER, OPERATION, STATE, ACTUAL, SOFAR, EST_MINUTES from v$asm_operation;

2.1.x
Bug 13728921 : PHYSICAL DISKS DISAPPEAR AFTER REBOOTING NODE 2
96 - Closed, Duplicate Bug -- Bug 13618428
Bug 13370690 : LNX64-112-CMT:DISK MISSED AT OS AFTER REBOOT (2.1.0.0.0)

#1 Please 'oakcli update -patch 2.1.0.0.0' to update firmware version to eliminate the known issues.
@ For #2, disk 10 missing and reappearing after re-seated.
@ This is Seagate FW problem.
@ CR 7132662 - P1 erie/firmware Cluster outage resulted from a single HDD
@ failure - X4370M2 with Erie
@ .
@ Before Seagate can provide us a fix,
@ we will provide a workaround to reduce disk queue_depth in 2.1.0.3.1.
@ Workaround by oracle is uploaded to 2.1.0.3.1 place holder Bug 13817532.
@ -- Bug 13812384 - QUEUE DEPTH TO BE SET FOR SHARED DISKS

Bug 13618428 : AFTER LOSING ONE ASM DISK, MULTIPLE DISKS BECAME UNRESPONSIVE

All disks are actually not bad.
@
@ If you are close to the lab, then you can reseat the disk.
@ If you lost slot 3, issue 'oakcli locate pd_03 on', led lights should be
on.
@ Then remove the disks, wait 1-2 minutes, insert the disks. No downtime.
@ Or if you're not close to the lab, then using ilom to power down
@ the machine, wait 1-2 minutes, power up the machine.
@ All the missing disks will re-appear.
@ .
@ This is Seagate FW problem. Before they can provide us a fix,
@ we will provide a workaround to reduce disk queue_depth in 2.1.0.3.1.
@ Workaround will be uploaded to 2.1.0.3.1 place holder Bug 13817532

References

<NOTE:1438089.1> - ALERT - Urgent Mandatory OAK Patch 2.1.0.3.1 for ODA - (Oracle Database Appliance)
<BUG:13728921> - PHYSICAL DISKS DISAPPEAR AFTER REBOOTING NODE 2
<BUG:13618428> - AFTER LOSING ONE ASM DISK, MULTIPLE DISKS BECAME UNRESPONSIVE
<NOTE:1496114.1> - ODA (Oracle Database Appliance): The Steps to replace multiple disks failing concurrently

Attachments

This solution has no attachment