ODA HW: After SSD replacement, disk shows STATE as UNINITIALIZED and STATE

Asset ID:	1-72-1461219.1
Update Date:	2017-05-18
Keywords:

Solution Type Problem Resolution Sure

Solution 1461219.1 : ODA HW: After SSD replacement, disk shows STATE as UNINITIALIZED and STATE_DETAILS as NewDiskInserted

Applies to:

Oracle Database Appliance - Version All Versions and later
Oracle Database Appliance Software - Version 2.1.0.1 and later
Linux x86-64

Symptoms

NAME PATH TYPE STATE STATE_DETAILS
----------------------------------------------------------------------------------------------
pd_00 /dev/sdam HDD ONLINE    Good
...
pd_21 /dev/sdat SSD ONLINE     Good
pd_22 /dev/sdaf SSD UNINITIALIZED    NewDiskInserted   <<<<
pd_23 /dev/sdah SSD ONLINE     Good

Changes

A Defective disk was replaced

Cause

There are currently several potential sources of the disk showing as uninitialized after disk replacement

This note will discuss some of the potential sources and corrective actions but should not be considered comprehensive.

Solution

Development has suggested the following two methods:

Reason #1.

After pulling or removing a problem disk, not waiting long enough before re-inserting the NEW disk.

A) Run the following on both nodes:

   1) multipath -F
   2) multipath -v2

   3) remove the disk again

   4) wait five minutes
   5) insert the disk back

   6) oakcli restart oak    - on the first node
       oakcli restart oak    - on the second node

   7) oakcli show ismaster    < should show ASMASTER on the first Node and ASSLAVE on the second Node

The above does not need downtime, where:

1) multipath -F ==> flush all unused multipath device maps
2) multipath -v2 ==> print all info : detected paths, coalesced paths (ie multipaths) and device maps)

NOTE:

It is possible that the metadata for the OLD disk still exists after a disk replacement fails.
There are a few key areas where this information can still exist including multipath.conf and asmappl.config.
If the old disk information still references the disk AND is anything other than group 0 more cleanup is required.

Comment:

Group 0 means the disk is not associated with a working ODA group
Groups 1,2,3 and for X5-2 diskgroup 4 are working ASM diskgroups

Note: Restart the oak process the following command - Make sure that oak is working on both nodes.

On Node 1

#oakcli restart oak

On Node 2

#oakcli restart oak

wait 5 minutes and check the status of disks again

A simple diagnostic used to confirm the disk status and health is the oakcli STORDIAG command.

See Note 1497610.1 for usage

For more comprehensive information for assistance via an Service Request please use the following

1) For X3-2, X4-2 or X5-2

oakcli stordiag e#_pd_## < Where E is for Enclosure 0 or 1 and pd is for the SLOT number (0-23)

For V1

oakcli stordiag pd_## < where ## is the SLOT #

2) oakcli manage diagcollect -storage

Logs are collected to: /opt/oracle/oak/log/<nodeName>/oakdiag/oakStorage-<nodeName>-<YearMonthDay>_<HHMI>.tar.gz

Supplemental

3) Diskdiag < very good for HW based diagnosis

4) Manual file collection
- message filefrom both nodes including the time the disk was added
- complete oakd.log file from both nodes
- run "fwupdate list disk" and provide the output from both nodes

- oakcli stordiag e#_pd_##
                                               Comment: for X5-2, X4-2, X3-2 :   e.g e
                                               where E is for Enclosure 0 or 1 and pd is for the SLOT number (0-23)
                                                 For ODA V1 use pd_## where ## is Slot number (0-23)
- ASM alert.logs -- Each Node

COMMENT -The following had previously been published but is now INTERNAL as this might cause CORRUPTION = Don't do it

... addasmdisk << this has bugs in some older versions _especially if there is a SECOND JBOD
-- consult with ODA BDE prior to using unless on 12.1.2.8 or higher

. .."

Comment 1 Suggestion from Oracle support analyst:

"...Reboot the nodes before running any action plan wait for 5-10 min after reboot
I have rebooted the nodes and solved everytime this issue..."

-- Not always the case, but often a resolution if down time is not a problem

Comment 2 Please note, there is also an issue that can occur in special circumstances.

"...The customer in bug 16803770 - unable to get rid of uninitialized newdiskinserted
     changed the DATA/REDO allocation ratio, upgraded to 2.2, and somehow the patch
     did not update opt/oracle/oak/conf/oak_conf.xml correctly to that effect. A typical
     sign of that would be from oakd.log:

2013-07-02 16:05:30.103: [pd_19][1082747200] {0:0:166} [resource_initialize] Invalid partition size
..."

Example of Partitions

Check the oak_conf.xml file on both nodes to ensure the ratios for data/reco are customer specific settings and the not default values. The correct values are

Local Backup
# grep -A2 'data:' /opt/oracle/oak/conf/oak_conf.xml
                    <Value>data:43:HDD</Value>
                    <Value>reco:57:HDD</Value>
                    <Value>redo:100:SSD</Value>
External Backup
# grep -A2 'data:' /opt/oracle/oak/conf/oak_conf.xml
                   <Value>data:86:HDD</Value>
                 <Value>reco:14:HDD</Value>
                   <Value>redo:100:SSD</Value>

Comment 3

This problem should be less likely to happen the newer the ODA version:

"...ALWAYS check the ASM Alert.logs for both nodes and review to confirm that all diskgroups are available.
A more serious condition can exist due to diskgroups being offlined beyond the redundancy capacity..."

Attachments

This solution has no attachment