How to clear AMBER LED on Exadata Compute nodes when Compute Node disks report offline but no fault detected by Raid controller

Asset ID:	1-71-2111921.1
Update Date:	2016-12-02
Keywords:

Solution Type Technical Instruction Sure

Solution 2111921.1 : How to clear AMBER LED on Exadata Compute nodes when Compute Node disks report offline but no fault detected by Raid controller

Applies to:

Exadata Database Machine X2-2 Full Rack - Version All Versions and later
Exadata X3-2 Hardware - Version All Versions and later
Exadata X4-2 Hardware - Version All Versions and later
Exadata X5-2 Hardware - Version All Versions and later
Information in this document applies to any platform.
How to clear AMBER LED on Exadata Compute nodes when Compute Node disks report offline but no fault detected by Raid controller

Goal

All the disks in the database node have their service and OK2RM LED's turned on:

Issue reported: Amber LED in the ILOM.
=========================================================================================================
HDD0/SVC | ON ==================
HDD0/OK2RM | ON ==================
HDD1/SVC | ON ==================
HDD1/OK2RM | ON ==================
HDD2/SVC | ON ==================
HDD2/OK2RM | ON ==================
HDD3/SVC | ON ==================
HDD3/OK2RM | ON ==================
=========================================================================================================

However the raid controller reports all the disks are online:

Slot 00 Device 08 (HITACHI H109060SESUN600GA6901516BUEMHX ) status is: Online,
Slot 01 Device 09 (HITACHI H109060SESUN600GA6901516BURE4X ) status is: Online,
Slot 02 Device 11 (HITACHI H109060SESUN600GA6901516BUYJAX ) status is: Online,
Slot 03 Device 10 (HITACHI H109060SESUN600GA6901516BURE8X ) status is: Online,

Solution

Work around is provided below:

1) reset the SP first to clear the service and OK2RM LED:

Run the following in ILOM command line:

-> reset /SP

In the ILOM snapshot output below we can see the service and OK2RM LEDs are all off after the SP is reset:

2) However from the dbmcli -e list physicaldisk output, it still shows the disks are failed.

# dbmcli -e list physicaldisk
===========================================================================
252:0 BUEMHX failed
252:1 BURE4X failed
252:2 BUYJAX failed
252:3 BURE8X failed
===========================================================================

3) In order to clear the failed status in the above dbmcli output you will need to edit the cellinit.ora file, however be sure to edit the correct file.

To find the correct cellinit.ora file, issue the following command to identify the image version

# imageinfo -ver

Then use the imageinfo -ver output to edit the correct file such as below:

/opt/oracle/dbserver_version/dbms/deploy/config/cellinit.ora

In this example imageinfo -ver reported

12.1.2.1.2.150617.1

therefore the path to the file will be as follows:

/opt/oracle/dbserver_12.1.2.1.2.150617.1/dbms/deploy/config/cellinit.ora

Add the following line to cellinit.ora

"_cell_allow_reenable_predfail=true"

4) Restart ms

# dbmcli -e alter dbserver restart services all

5) reenable physicaldisks that are marked as failed

# dbmcli -e alter physicaldisk <pdid> reenable force

To identify the <pdid> and failed disks use # dbmcli -e list physicaldisk

an example is:

# dbmcli -e alter physicaldisk 252:0 reenable force

Once all failed disks are reenabled, check the # dbmcli -e list physicaldisk output to confirm all disks are normal:

# dbmcli -e list physicaldisk
==========================================================================
252:0 BUEMHX normal
252:1 BURE4X normal
252:2 BUYJAX normal
252:3 BURE8X normal
==========================================================================

6) Check dbmcli -e list alerthistory output to make sure the disks status are all changed to normal:

# dbmcli -e list alerthistory
9_2 2016-02-26T14:20:38+08:00 clear "Hard disk status changed to normal. Status : NORMAL Manufacturer : HITACHI Model Number :
H109060SESUN600G Size : 600GB Serial Number : 1516BUEMHX Firmware : A690 Slot Number : 0"^M
10_1 2016-02-02T19:20:43+08:00 critical "Hard disk failed. Status : FAILED Manufacturer : HITACHI Model Number : H109060SESUN600G
Size : 600GB Serial Number : 1516BURE4X Firmware : A690 Slot Number : 1"^M
10_2 2016-02-26T14:20:58+08:00 clear "Hard disk status changed to normal. Status : NORMAL Manufacturer : HITACHI Model Number :
H109060SESUN600G Size : 600GB Serial Number : 1516BURE4X Firmware : A690 Slot Number : 1"^M
11_1 2016-02-02T19:20:47+08:00 critical "Hard disk failed. Status : FAILED Manufacturer : HITACHI Model Number : H109060SESUN600G
Size : 600GB Serial Number : 1516BUYJAX Firmware : A690 Slot Number : 2"^M
11_2 2016-02-26T14:21:09+08:00 clear "Hard disk status changed to normal. Status : NORMAL Manufacturer : HITACHI Model Number :
H109060SESUN600G Size : 600GB Serial Number : 1516BUYJAX Firmware : A690 Slot Number : 2"^M
12_1 2016-02-02T19:20:51+08:00 critical "Hard disk failed. Status : FAILED Manufacturer : HITACHI Model Number : H109060SESUN600G
Size : 600GB Serial Number : 1516BURE8X Firmware : A690 Slot Number : 3"^M
12_2 2016-02-26T14:21:19+08:00 clear "Hard disk status changed to normal. Status : NORMAL Manufacturer : HITACHI Model Number :
H109060SESUN600G Size : 600GB Serial Number : 1516BURE8X Firmware : A690 Slot Number : 3"^M

Note: Leaving "_cell_allow_reenable_predfail=true" in cellinit.ora after the disk is reenabled will not be a problem. Its just a parameter for MS to allow reenabling. By default (without this parameter) MS will throw an error when trying to change the disk which is in CRITICAL / Failed status back to normal.

You can remove "_cell_allow_reenable_predfail=true" from cellinit.ora after running "dbmcli -e alter physicaldisk <pdid> reenable force" and checking if the disks are back to normal using "dbmcli -e list physicaldisk".

References: This problem is due to the following BUG 22079518,

This is a workaround to fix the disks showing Amber Led on an Exadata Database Node.

Attachments

This solution has no attachment