ODA Outage including diskgroup offlined: ASM alert.log WARNING: Waited 15 secs for write IO to PST disk [0,1...23] in group [ 1 | 2

Asset ID:	1-72-1940986.1
Update Date:	2017-07-19
Keywords:

Solution Type Problem Resolution Sure

Solution 1940986.1 : ODA Outage including diskgroup offlined: ASM alert.log WARNING: Waited 15 secs for write IO to PST disk [0,1...23] in group [ 1 | 2 | 3 ]

Applies to:

Oracle Database Appliance X3-2 - Version All Versions to All Versions [Release All Releases]
Oracle Database Appliance X4-2 - Version All Versions to All Versions [Release All Releases]
Oracle Database Appliance - Version All Versions to All Versions [Release All Releases]
Oracle Database Appliance Software - Version 2.1.0.1 to 12.1.2.9 [Release 2.1 to 12.1]
Information in this document applies to any platform.
ODA, Node Outage, Crash, ASM, Poor Performance, Diskgroup offline

Symptoms

This problem has a few distinctive symptoms but the highest is a node crash:

Diskgroup outage
Very Slow IO Performance*
Possible very high CPU
Timeouts for IO
Communications to ASM, CRS or CSS failures

Only one node active, the other one hangs while starting ASM.**
After an outage the Node restarts, but IO Waits are very high
Overall Very slow performance on one node, but no load or evidence of why IO be stats are so high

* Note:
          Confirm if your issue is regarding very slow DISKGROUP level performance.
          Then investigate further to confirm if the problem appears related to a single substandard disk
          Then, the Enhancement included in UEK4 included in ODA 12.1.2.11.0 may resolve your problem
      ** If your symptom is ONLY seen on one node then continue with this note for further information.

Changes

None known to the users or dba.
Review of performance of the disks in the alert.log or other sources will usually reveal some substandard performance on at least one disk.

See many delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup.
This results in the ASM instance dismounting the diskgroup

ASM ALERT.LOG

           The Disk number can range from 0,1.. up to 23, and diskgroup 1 (Data), 2 (RECO) or 3 (REDO)
...
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited   15 secs for write IO to PST disk 2 in group 2.
WARNING: Waited   15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited   15 secs for write IO to PST disk 2 in group 2
    ...
    ...
NOTE: process _b000_+asm1 (12580) initiating offline of disk 0.3915926799 (HDD_E0_S00_576669152P1) with mask 0x7e in group 1    << Can be for any ODA Diskgroup: 1 (DATA), 2 (RECO) or 3 (REDO)
NOTE: process _b000_+asm1 (12580) initiating offline of disk 1.3915926797 (HDD_E0_S01_576659536P1) with mask 0x7e in group 1     << Can be any disk including HDD [0-19] or SSD [20-23]
NOTE: process _b000_+asm1 (12580) initiating offline of disk 2.3915926788 (HDD_E0_S02_576440136P1) with mask 0x7e in group 1
NOTE: checking PST: grp = 1
GMON checking disk modes for group 1 at 14 for pid 50, osid 12580
   ...
                   Symptom - GMON
   ...
GMON dismounting group 2 at 151 for pid 24, osid 13912                                                      << Can be for any ODA Diskgroup: 1 (DATA), 2 (RECO) or 3 (REDO)
NOTE: Disk SSD_E0_S20_805853057P1 in mode 0x7f marked for de-assignment                        << Can be any disk including HDD [0-19] or SSD [20-23]
NOTE: Disk SSD_E0_S21_805849551P1 in mode 0x7f marked for de-assignment
NOTE: Disk SSD_E0_S22_805853058P1 in mode 0x7f marked for de-assignment
NOTE: Disk SSD_E0_S23_805852406P1 in mode 0x7f marked for de-assignment

Another very common accompanying error are Read Failures found in the ASM ALERT.LOG, Repairing group, read failurs and success will become more and more frequent.

ASM ALERT.LOG
...
...
NOTE: repairing group 1 file 459 extent 34    ------- The read failures and Repairing warnings can be several hours to days before the outage!
...
WARNING: Read Failed. group:1 disk:2 AU:54900 offset:3145728 size:1048576
...
SUCCESS: extent 34 of file 459 group 1 repaired by relocating to a different AU on the same disk or the disk is offline   -- Followed by SUCCESS - relocation messages
...
NOTE: repairing group 1 file 459 extent 54     -------        More Repairing Group messages!
NOTE: repairing group 1 file 459 extent 54
...
WARNING: Read Failed. group:1 disk:2 AU:54904 offset:2097152 size:1048576         --------        More Read Failed messages...
NOTE: repairing group 1 file 459 extent 54
WARNING: Read Failed. group:1 disk:2 AU:54904 offset:2097152 size:1048576
...
     ----    This pattern continues, with the messages for Warning, read Failed: Note Repairing Group and Successful Relocating increasing in a tighter time loop
...
...
SUCCESS: extent 54 of file 459 group 1 repaired by relocating to a different AU on the same disk or the disk is offline
SUCCESS: extent 54 of file 459 group 1 repaired by relocating to a different AU on the same disk or the disk is offline
SUCCESS: extent 54 of file 459 group 1 repaired by relocating to a different AU on the same disk or the disk is offline
...
NOTE: repairing group 1 file 459 extent 54
SUCCESS: extent 54 of file 459 group 1 repaired - all online mirror sides found readable, no repair required
SUCCESS: extent 54 of file 459 group 1 repaired - all online mirror sides found readable, no repair required
...

--- UNTIL FAILURE with any of several assorted messages:
...
Received dirty detach msg from inst 1 for dom 2
Fri Oct 31 02:02:36...
List of instances:
1 2
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 8)
Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 2 invalid = TRUE

The ASM Alert log may show some or all of the following errors:

ORA-29701

Unexpected return code (6) from the Cluster Synchronization Service (LCK0)
Please check the CSS log file for more detail
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_lck0_14093.trc:

ORA-29701: unable to connect to Cluster Synchronization Service

LCK0 (ospid: 14093): terminating the instance due to error 29701

ORA-600 [kfcDismount01], [0]

Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_b002_18363.trc (incident=98337):
ORA-00600: internal error code, arguments: [kfcDismount01], [0], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/grid/diag/asm/+asm/+ASM1/incident/incdir_98337/+ASM1_b002_18363_i98337.trc
...
NOTE: killing foreground process 56 (1234) for state cleanup
...
WARNING: client [orcl1:oda1] cleanup delayed; waited 306s, pid 1234 mbr 0x1
...
NOTE: timeout waiting for prior umbilllicus process to exit; 300s

ORA-15081

WARNING: Read Failed. group:2 disk:0 AU:3 offset:0 size:4096
NOTE: successfully read ACD block gn=2 blk=0 via retry read
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_lgwr_4645.trc:

ORA-15081: failed to submit an I/O operation to a disk

WARNING: Write Failed. group:2 disk:0 AU:3 offset:0 size:4096
WARNING: Write Failed. group:2 disk:0 AU:5 offset:2699264 size:4096
NOTE: cache initiating offline of disk 0 group RECO

Look for multiple _DROPPED message in the ASM alert.log

Messages in the ASM ALERT.LOG will show one disk after another being offlined until either the ASM Redundancy limit is reached or the VOTING disks no longer have a quorum
ASM High Redundancy allows up to two offlined disks and only one offlined disk in Normal Redundancy.
After this the group is offlined = database / instance comes down which are also messages you can find in the ASM alert.logs

Excerpt from an ASM Alert.log with this problem

"...
NOTE: repairing group 1 file 270 extent 60731
...
NOTE: repairing group 1 file 270 extent 61031
...
NOTE: repairing group 1 file 270 extent 60871
...
SQL> alter diskgroup DATA drop disk 'HDD_E0_S02_........7P1' force rebalance power 11 wait < at this point we quit trying to repair the disk and proceed to DROP the problem disk
...

Later: with HDD #2 offlined we see the same problem for another disk #4 -- The problem can be for ANY disk, but the example here is for disks 2,3 and 4.
...
...
SQL> alter diskgroup DATA drop disk 'HDD_E0_S04_........2P1' force rebalance power 11 nowait
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=1
...
-- Later, with HDD #2 and #4 offline we try to offline a 3rd disk
...
SQL> alter diskgroup DATA drop disk 'HDD_E0_S03_.........4P1' force rebalance power 32 nowait
NOTE: GroupBlock outside rolling migration privileged region
...
...
NOTE: cache closing disk 2 of grp 1: (not open) _DROPPED_0002_DATA
...
...
Nov 08 14:20:18 2014
GMON updating for reconfiguration, group 1 at 64 for pid 45, osid 12361
NOTE: cache closing disk 4 of grp 1: (not open) _DROPPED_0004_DATA
NOTE: group 1 PST updated.
SUCCESS: grp 1 disk _DROPPED_0004_DATA going offline

Cause

Unpublished Bug 18409717: ODA 2.8 : GMON FORCE DISMOUNTED DISKGROUP :WARNING: WAITED 15 SECS FOR WRITE IO
GMON failed to update the mode on the disks and caused the diskgroup to force dismount and causing this ASM outage.

This bug is fixed by 2.10

ASM Disks Offline When Few Paths In The Storage Is Lost (Doc ID 1581684.1).

<Bug 18342714> DISKS GO OFFLINE (DG UNMOUNT) AFTER UPGRADE TO 2.9.
<Bug 17043894>
Comment: Useful command for future reference- from GRID_home/bin
on the ODA - /u01/app/11.2.0.4/grid/bin oclumon dumpnodeview -allnodes -v -last "00:00:05" < 05 is duration back from the current point in time or 5 seconds

Unpublished <Bug 18409717> ODA 2.8 : GMON FORCE DISMOUNTED DISKGROUP :WARNING: WAITED 15 SECS FOR WRITE IO

Symptom - GMON
------------------------
...
GMON dismounting group 2 at 1151 for pid 24, osid 13912 << Can be group 1,2 or 3
NOTE: Disk SSD_E0_S20_805853057P1 in mode 0x7f marked for de-assignment << Can be any disk including HDD [0-19] or SSD [20-23]
NOTE: Disk SSD_E0_S21_805849551P1 in mode 0x7f marked for de-assignment
NOTE: Disk SSD_E0_S22_805853058P1 in mode 0x7f marked for de-assignment
NOTE: Disk SSD_E0_S23_805852406P1 in mode 0x7f marked for de-assignment

Unpublished <Bug 17274537> -- Internal bug

@ b. In asm instance. @ Mon Aug 05 23:54:18 2013 @ WARNING: Waited 15 secs for write IO to PST disk 0 in group 2. @ WARNING: Waited 15 secs for write IO to PST disk 2 in group 2. @ WARNING: Waited 15 secs for write IO to PST disk 0 in group 2. @ WARNING: Waited 15 secs for write IO to PST disk 2 in group 2

SUPPORT - UPGRADE TO THE MOST CURRENT ODA VERSION

For this problem to occur you need to check for a _substandard_ disk
This is not well documented anywhere - the symptoms require a poor performance disk - but where the disk does not go offline = not a 100% bad disk

We need the poor performance to trigger the problem as the bug is a message to all other disks in the disk group to go offline when the 15 second timeout is detected -
Interesting potential workaround
= if they find the substandard disk and pull it they are actually going to avoid the problem vs. leaving the substandard performance disk in the system!!
This means that if they do have an outage and also know the disk, pull it asap or before a reboot to keep this cascading problem from occurring

.

Solution

1) On both ASM instances

alter system set "_asm_hbeatiowait"=<value> scope=spfile sid='*';

e.g.

alter system set "_asm_hbeatiowait"=120 scope=spfile sid='*';

2) Then, restart the ASM instance with the disk problem(s)

3) Last, Restart CRS (or use oakcli restart oak) with the disk problem(s)

References

<BUG:17274537> - ASM DISK GROUP FORCE DISMOUNTED DUE TO SLOW I/OS
<BUG:18409717> - ODA 2.8 : GMON FORCE DISMOUNTED DISKGROUP :WARNING: WAITED 15 SECS FOR WRITE IO
<NOTE:1581684.1> - ASM diskgroup dismount with "Waited 15 secs for write IO to PST"
<BUG:17043894> - DISKGROUP DISMOUNTS IF 2 OUT OF 8 PATHS LOST
<BUG:18342714> - DISKS GO OFFLINE AFTER UPGRADE TO 2.9

Attachments

This solution has no attachment