ODA: Disk Array Has Multiple Disks Offlined with Yellow Lights Blinking and oakd.log Shows "state changed from: ONLINE to: FAILED"

Asset ID:	1-72-1995656.1
Update Date:	2015-12-28
Keywords:

Solution Type Problem Resolution Sure

Solution 1995656.1 : ODA: Disk Array Has Multiple Disks Offlined with Yellow Lights Blinking and oakd.log Shows "state changed from: ONLINE to: FAILED"

Applies to:

Oracle Database Appliance X3-2 - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

ODA: oakcli command not working:

root@orarac1 ~]# oakcli show disk
Failed to connect to oakd.

The oakd log shows:

...

Line 92628: 2015-03-31 07:59:25.586: [ OAKFW][855628096]{0:0:10641} e0_pd_00 state changed from: ONLINE to: FAILED
Line 92635: 2015-03-31 07:59:25.586: [ OAKFW][855628096]{0:0:10641} e0_data_00 state changed from: ONLINE to: FAILED
Line 92638: 2015-03-31 07:59:25.586: [ OAKFW][855628096]{0:0:10641} e0_reco_00 state changed from: ONLINE to: FAILED
Line 92841: 2015-03-31 07:59:29.901: [ OAKFW][855628096]{0:0:10638} e0_pd_01 state changed from: ONLINE to: FAILED
Line 92854: 2015-03-31 07:59:29.901: [ OAKFW][855628096]{0:0:10638} e0_data_01 state changed from: ONLINE to: FAILED
Line 92858: 2015-03-31 07:59:29.901: [ OAKFW][855628096]{0:0:10638} e0_reco_01 state changed from: ONLINE to: FAILED
Line 93348: 2015-03-31 07:59:51.297: [ OAKFW][855628096]{0:0:10642} e0_pd_02 state changed from: ONLINE to: FAILED
..

....

Line 94563: 2015-03-31 08:01:16.826: [ OAKFW][855628096]{0:0:10686} e0_data_06 state changed from: ONLINE to: FAILED
Line 94566: 2015-03-31 08:01:16.826: [ OAKFW][855628096]{0:0:10686} e0_reco_06 state changed from: ONLINE to: FAILED
Line 94831: 2015-03-31 08:01:38.180: [ OAKFW][855628096]{0:0:10695} e0_pd_07 state changed from: ONLINE to: FAILED
Line 94844: 2015-03-31 08:01:38.180: [ OAKFW][855628096]{0:0:10695} e0_data_07 state changed from: ONLINE to: FAILED
Line 94850: 2015-03-31 08:01:38.180: [ OAKFW][855628096]{0:0:10695} e0_reco_07 state changed from: ONLINE to: FAILED

...

2015-03-31 15:40:03.558: [ OAKFW][347838016] Logging level for Module: ABC 0
2015-03-31 15:40:03.558: [ OAKFW][347838016] Starting /opt/oracle/oak/log/ora1/oak
2015-03-31 15:40:03.558: [ OAKFW][347838016] ORA_CRS_HOME = /u01/app/12.1.0.2/grid/
2015-03-31 15:40:03.558: [ OAKFW][347838016] ORACLE_HOME = /u01/app/12.1.0.2/grid/
2015-03-31 15:40:03.558: [ OAKFW][347838016] Checking/Waiting for CRS ... <<<<<<<<<<<<<<<<<<<<<<<<
2015-03-31 15:41:04.296: [ OAKFW][347838016] Checking/Waiting for CRS ...
2015-03-31 15:42:05.396: [ OAKFW][347838016] Checking/Waiting for CRS ...
2015-03-31 15:43:06.490: [ OAKFW][347838016] Checking/Waiting for CRS ...
2015-03-31 15:44:07.585: [ OAKFW][347838016] Checking/Waiting for CRS ...

messages log:

============

Mar 31 07:58:29 orarac1 kernel: sd 0:0:23:0: device_blocked, handle(0x0021)
Mar 31 07:59:00 orarac1 kernel: sd 1:0:1:0: [sdz] Unhandled error code
Mar 31 07:59:00 orarac1 kernel: sd 1:0:1:0: [sdz] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar 31 07:59:00 orarac1 kernel: sd 1:0:1:0: [sdz] CDB: Read(10): 28 20 00 02 10 03 00 00 01 00
Mar 31 07:59:00 orarac1 kernel: end_request: I/O error, dev sdz, sector 135171 <<<<<<<<<<<<<<<<<<<<<<<First instance of I/O error reported here
Mar 31 07:59:00 orarac1 kernel: sd 1:0:1:0: [sdz] Unhandled error code
Mar 31 07:59:00 orarac1 kernel: sd 1:0:1:0: [sdz] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar 31 07:59:00 orarac1 kernel: sd 1:0:1:0: [sdz] CDB: Write(10): 2a 20 04 61 50 60 00 00 20 00
Mar 31 07:59:00 orarac1 kernel: end_request: I/O error, dev sdz, sector 73486432
Mar 31 07:59:00 orarac1 kernel: sd 1:0:6:0: [sdae] Unhandled error code
Mar 31 07:59:00 orarac1 kernel: sd 1:0:6:0: [sdae] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar 31 07:59:00 orarac1 kernel: sd 1:0:6:0: [sdae] CDB: Write(10): 2a 20 04 6f 90 60 00 00 20 00
Mar 31 07:59:00 orarac1 kernel: end_request: I/O error, dev sdae, sector 74420320

=== alert_ASM1.log===

NOTE: process _smon_+asm1 (77434) initiating offline of disk 0.3916384594 (HDD_E0_S00_371307544P2) with mask 0x7e in group 2 (RECO) without client assisting
NOTE: process _smon_+asm1 (77434) initiating offline of disk 1.3916384600 (HDD_E0_S01_371296312P2) with mask 0x7e in group 2 (RECO) without client assisting
NOTE: process _smon_+asm1 (77434) initiating offline of disk 2.3916384604 (HDD_E0_S02_371290656P2) with mask 0x7e in group 2 (RECO) without client assisting
NOTE: process _smon_+asm1 (77434) initiating offline of disk 3.3916384597 (HDD_E0_S03_371296152P2) with mask 0x7e in group 2 (RECO) without client assisting
NOTE: process _smon_+asm1 (77434) initiating offline of disk 4.3916384607 (HDD_E0_S04_371074080P2) with mask 0x7e in group 2 (RECO) without client assisting
...

....

NOTE: process _smon_+asm1 (77434) initiating offline of disk 15.3916384603 (HDD_E0_S15_371302992P2) with mask 0x7e in group 2 (RECO) without client assisting
NOTE: process _smon_+asm1 (77434) initiating offline of disk 16.3916384591 (HDD_E0_S16_371129644P2) with mask 0x7e in group 2 (RECO) without client assisting
NOTE: process _smon_+asm1 (77434) initiating offline of disk 17.3916384606 (HDD_E0_S17_371165504P2) with mask 0x7e in group 2 (RECO) without client assisting
NOTE: process _smon_+asm1 (77434) initiating offline of disk 18.3916384605 (HDD_E0_S18_371335024P2) with mask 0x7e in group 2 (RECO) without client assisting
NOTE: process _smon_+asm1 (77434) initiating offline of disk 19.3916384602 (HDD_E0_S19_371139340P2) with mask 0x7e in group 2 (RECO) without client assisting
NOTE: checking PST: grp = 2
Tue Mar 31 07:59:01 2015
GMON checking disk modes for group 2 at 10 for pid 18, osid 77434
Tue Mar 31 07:59:01 2015
ERROR: too many offline disks in PST (grp 2)
Tue Mar 31 07:59:01 2015

=== alert_ASM2.log ===

Tue Mar 31 07:59:01 2015
ERROR: no read quorum in group: required 3, found 0 disks
ERROR: Could not read PST for grp 2. Force dismounting the disk group.
Tue Mar 31 07:59:01 2015
ERROR: no read quorum in group: required 3, found 0 disks
ERROR: Could not read PST for grp 3. Force dismounting the disk group.
Tue Mar 31 07:59:01 2015
Dirty Detach Reconfiguration complete (total time 0.2 secs)
Tue Mar 31 07:59:01 2015
WARNING: Offline of disk 20 (SSD_E0_S20_805849400P1) in group 3 and mode 0x7f failed on ASM inst 2
WARNING: Offline of disk 21 (SSD_E0_S21_805838554P1) in group 3 and mode 0x7f failed on ASM inst 2
WARNING: Offline of disk 22 (SSD_E0_S22_805838515P1) in group 3 and mode 0x7f failed on ASM inst 2
WARNING: Offline of disk 23 (SSD_E0_S23_805838503P1) in group 3 and mode 0x7f failed on ASM inst 2
Tue Mar 31 07:59:01 2015
WARNING: Offline of disk 0 (HDD_E0_S00_371307544P1) in group 1 and mode 0x7f failed on ASM inst 2
WARNING: Offline of disk 1 (HDD_E0_S01_371296312P1) in group 1 and mode 0x7f failed on ASM inst 2
WARNING: Offline of disk 2 (HDD_E0_S02_371290656P1) in group 1 and mode 0x7f failed on ASM inst 2
WARNING: Offline of disk 3 (HDD_E0_S03_371296152P1) in group 1 and mode 0x7f failed on ASM inst 2
WARNING: Offline of disk 4 (HDD_E0_S04_371074080P1) in group 1 and mode 0x7f failed on ASM inst 2
Tue Mar 31 07:59:01 2015

Cause

Disk array has yellow lights blinking on disks 0 to 8, indicating a problem with the disk controller.

Solution

Rebooting the storage node fixed the problem.

After rebooting the storage node, cluster services are up and oakcli commands start working.

References

<NOTE:1424493.1> - ODA (Oracle Database Appliance): ILOM Oracle Integrated Lights Out Manager

Attachments

This solution has no attachment