ASM Grid disks dropped randomly or missing after reboot

Asset ID:	1-72-2097468.1
Update Date:	2016-03-17
Keywords:

Solution Type Problem Resolution Sure

Solution 2097468.1 : ASM Grid disks dropped randomly or missing after reboot

Applies to:

Oracle Database Appliance X5-2 - Version All Versions and later
Information in this document applies to any platform.

Symptoms

Oracle cluster services going down randomly and some ASM grid disks are missing and dropped, or some ASM disks missing after a reboot.

This query in ASM shows missing disks.

select group_number, disk_number, name, label, path, redundancy, mount_status, header_status from v$asm_disk where mount_status='MISSING'

You can add the disks back, but then these or other disks will eventually be dropped for no apparent reason and may show missing after a reboot.

The ASM alert log shows messages such a following related to dropped disks:

WARNING: Started Drop Disk Timeout for Disk 2 (SSD_E0_S23_1310812220P1) in group 4 with a value 12960
WARNING: Started Drop Disk Timeout for Disk 3 (SSD_E0_S22_1310812236P1) in group 4 with a value 12960

WARNING: Disk 2 (SSD_E0_S23_1310812220P1) in group 4 will be dropped in: (12042) secs on ASM inst 2
WARNING: Disk 3 (SSD_E0_S22_1310812236P1) in group 4 will be dropped in: (12042) secs on ASM inst 2

A review of the disks and hardware does not indicate any disk issues.

Cause

Ethernet cables are connected to the Ethernet ports on the I/O Modules (IOM) in the storage units.

These Ethernet ports are not used in production, only in Oracle Manufacturing. If Ethernet cables are attached this way, this can cause the IOMs to do a watchdog reset, which will cause loss of disk connectivity on the host side.

Notice in var/log/message log after reboot there is 8:0:292:0

Dec 31 18:52:49 MyCompanyODA0 kernel: scsi 8:0:292:0: Direct-Access HGST HSCAC2DA6SUN200G A122 PQ: 0 ANSI: 6
Dec 31 18:52:49 MyCompanyODA0 kernel: scsi 8:0:292:0: SSP: handle(0x001e), sas_addr(0x5000cca04e21534e), phy(20), device_name(0x5000cca04e21534f)
Dec 31 18:52:49 MyCompanyODA0 kernel: scsi 8:0:292:0: SSP: enclosure_logical_id(0x5080020001e293fe), slot(90)
Dec 31 18:52:49 MyCompanyODA0 kernel: scsi 8:0:292:0: serial_number(001526JLA50A 0QVLA50A)
Dec 31 18:52:49 MyCompanyODA0 kernel: scsi 8:0:292:0: qdepth(254), tagged(1), simple(0), ordered(0), scsi_level(7), cmd_que(1)
Dec 31 18:52:49 MyCompanyODA0 kernel: sd 8:0:292:0: [sdah] Enabling DIF Type 1 protection
Dec 31 18:52:49 MyCompanyODA0 kernel: sd 8:0:292:0: [sdah] 390721968 512-byte logical blocks: (200 GB/186 GiB)
Dec 31 18:52:49 MyCompanyODA0 kernel: sd 8:0:292:0: [sdah] 4096-byte physical blocks
Dec 31 18:52:49 MyCompanyODA0 kernel: sd 8:0:292:0: [sdah] Write Protect is off
Dec 31 18:52:49 MyCompanyODA0 kernel: sd 8:0:292:0: [sdah] Write cache: disabled, read cache: enabled, supports DPO and FUA

Notice number is 292. There is certainly not this many hard disks connected on ODA.

The incorrect number may be a value higher than 24 or 48 if additional storage shelf added indicating this behavior.

This is a symptom of what happens when Ethernet cables are attached to the IOMs in the storage units. They cause the IOMs to reset, and then Linux re-initializes information related to the disks.

Notice in a normal expected var/log/message log there is 8:x:0:x through 8:x:23:0, and 9:x:0:x through 9:x:23:0.

Dec 31 18:59:00 MyCompanyODA0 kernel: scsi 8:0:23:0: Direct-Access HGST HSCAC2DA6SUN200G A122 PQ: 0 ANSI: 6
Dec 31 18:59:00 MyCompanyODA0 kernel: scsi 8:0:23:0: SSP: handle(0x0021), sas_addr(0x5000cca04e21683e), phy(23), device_name(0x5000cca04e21683f)
Dec 31 18:59:00 MyCompanyODA0 kernel: scsi 8:0:23:0: SSP: enclosure_logical_id(0x5080020001e293fe), slot(93)
Dec 31 18:59:00 MyCompanyODA0 kernel: scsi 8:0:23:0: serial_number(001526JLBK7A 0QVLBK7A)
Dec 31 18:59:00 MyCompanyODA0 kernel: scsi 8:0:23:0: qdepth(254), tagged(1), simple(0), ordered(0), scsi_level(7), cmd_que(1)

Solution

Disconnect the Ethernet cables from the IOMs on the storage unit.

Then, reboot both nodes.

The normal"and healthy way that disk information is recorded in the messages file uses 8:0:0:0 through 8:0:23:0 and 9:0:0:0 through 9:0:23:0.
If that is what you see when you check your current OS /var/log/messages file, then there is nothing further to do, the server and OS is operating normally.

Attachments

This solution has no attachment