Troubleshooting Flashing Amber LED on Oracle Big Data Appliance V2.0 for Non-Failure Status

Asset ID:	1-75-1537798.1
Update Date:	2016-01-07
Keywords:

Solution Type Troubleshooting Sure

Solution 1537798.1 : Troubleshooting Flashing Amber LED on Oracle Big Data Appliance V2.0 for Non-Failure Status

Applies to:

Big Data Appliance X3-2 Hardware - Version All Versions and later
Linux x86-64

Purpose

This note provides a general approach for determining if a flashing amber LED on an Oracle Big Data Appliance (BDA) disk drive is indicating a disk failure. If analysis does not point to a disk failure steps to reset a flashing amber LED are given.

Troubleshooting Steps

This document makes use of an example where the disk in slot 3 for a server on the BDA is observed to have a flashing amber LED.

What does a flashing amber LED on a disk drive mean

On the Oracle Big Data Appliance (BDA) a bda-monitor service runs in the background and turns on a service LED if any of the following is true for a disk:

1. It is failed.

2. It shows a predictive failure, where the predictive failure is based on the value of "Predictive Failure Count" as returned for the disk when using:

MegaCli64 -PDList -a0| grep -E 'Slot|Failure Count'

with "Predictive Failure Count: 0" indicative of no predictive failures. Anything non-zero triggers an alert.

Output for 12 healthy disks looks like:

# MegaCli64 -PDList -a0 | grep -E 'Slot|Failure Count'

Slot Number: 0
Predictive Failure Count: 0
Slot Number: 1
Predictive Failure Count: 0
...
Slot Number: 11
Predictive Failure Count: 0

3. It has failed a specific performance test 5 times in a row, where the performance test is based on:

/sbin/hdparm -t /dev/sd<X> (where X is the disk location) returning 100MB/second or more.

Using /dev/sdc as an example the output of hdparm (run as root or sudo) 5 times in a row below shows that this disk is performing in the expected range:

# /sbin/hdparm -t /dev/sdc

/dev/sdc:
Timing buffered disk reads: 442 MB in 3.00 seconds = 147.12 MB/sec

# /sbin/hdparm -t /dev/sdc

/dev/sdc:
Timing buffered disk reads: 442 MB in 3.01 seconds = 146.98 MB/sec

# /sbin/hdparm -t /dev/sdc

/dev/sdc:
Timing buffered disk reads: 442 MB in 3.00 seconds = 147.29 MB/sec

# /sbin/hdparm -t /dev/sdc

/dev/sdc:
Timing buffered disk reads: 444 MB in 3.01 seconds = 147.66 MB/sec

# /sbin/hdparm -t /dev/sdc

/dev/sdc:
Timing buffered disk reads: 444 MB in 3.01 seconds = 147.68 MB/sec

The performance test could fail if there is greater than normal load on the system. However this scenario would not indicate that the disk is showing performance issues. Hadoop Distributed File System (HDFS) is tolerant of disk failures and disk performance so replacing disks performing less than optimally should not be necessary.

Note once the LED is turned on it will not turn off again on its own.

ILOM snapshot indication of a service fault

Once an amber flashing LED is observed on a disk on the BDA check the file: ipmi/@usr@local@bin@ipmiint_sunoem_led_get.out from the ILOM snapshot to confirm that a disk service fault is set. Collect the ILOM snapshot on the server with the flashing amber LED(s):

1. According to the directions in: How to Collect a Snapshot from an x86 Platform Service Processor version (SP or ILOM) (Doc ID 1448069.1)

In this case uncompress the ILOM snapshot and examine ipmi/@usr@local@bin@ipmiint_sunoem_led_get.out

2. Via the output from the "bdadiag snapshot" utility as documented in Oracle Big Data Appliance Diagnostic Information Collection with bdadiag V2.0 (Doc ID 1516469.1).

In this case uncompress the "bdadiag snapshot" output, cd to the ilom subdirectory, uncompress the ILOM snapshot, and examine ipmi/@usr@local@bin@ipmiint_sunoem_led_get.out

ipmi/@usr@local@bin@ipmiint_sunoem_led_get.out should show a service fault on the disk for which the flashing amber LED is observed on the BDA. For example for the disk in slot 3, ipmi/@usr@local@bin@ipmiint_sunoem_led_get.out should show service fault on DBP/HDD3:

A quick alternative to collecting the ILOM snapshot to verify a disk service fault is to issue (as root or sudo):

# ipmitool sunoem led get | grep -i svc

Check for disk service faults on all servers of a rack with:

# dcli 'ipmitool sunoem led get | grep -i svc'

Note that in the case of a flashing amber light the service fault can change between ON/OFF. Hence the recommendation is to collect either the ILOM snapshot several times in a row or run "ipmitool sunoem led get | grep -i svc" several times in a row and compare the state to verify a blinking LED status.

Indications that a flashing amber LED does not represent disk failure

Once one or more disks with flashing amber LEDs are observed on the BDA, and the corresponding service fault confirmed via ipmi/@usr@local@bin@ipmiint_sunoem_led_get.out, there are several additional checks that can be made to determine if the observed LED represents a hard or predictive disk failure.

The checks listed below can be used to determine that the disk looks to be in a healthy state inspite of the observd flashing amber LED on the BDA and corresponding service fault in ipmi/@usr@local@bin@ipmiint_sunoem_led_get.out.

1. bdacheckhw utility reports a healthy hardware profile.

A healthy hardware profile with functional disks looks like below. For example in the case of a flashing amber LED for the disk in slot three, bdacheckhw reports that the disk is healthy:

# bdacheckhw
...
SUCCESS: Correct disk controller model : LSI MegaRAID SAS 9261-8i
SUCCESS: Correct disk controller firmware major version : 12.12.0
SUCCESS: Sufficient disk controller firmware minor version (>=0079): 0079
SUCCESS: Got expected disk controller enclosure ID : 20
SUCCESS: Correct disk controller PCI address : 13:00.0
SUCCESS: Correct disk controller PCI info : 0104: 1000:0079
SUCCESS: Correct disk controller PCIe slot width : x8
SUCCESS: Correct disk controller battery type : iBBU08
SUCCESS: Correct disk controller battery state : Operational
SUCCESS: Correct number of disks : 12
SUCCESS: Correct disk 0 model : SEAGATE ST32000SSSUN2.0
SUCCESS: Sufficient disk 0 firmware (>=61A): 61A
SUCCESS: Correct disk 1 model : SEAGATE ST32000SSSUN2.0
SUCCESS: Sufficient disk 1 firmware (>=61A): 61A
SUCCESS: Correct disk 2 model : SEAGATE ST32000SSSUN2.0
SUCCESS: Sufficient disk 2 firmware (>=61A): 61A
SUCCESS: Correct disk 3 model : SEAGATE ST32000SSSUN2.0
SUCCESS: Sufficient disk 3 firmware (>=61A): 61A
SUCCESS: Correct disk 4 model : SEAGATE ST32000SSSUN2.0
SUCCESS: Sufficient disk 4 firmware (>=61A): 61A
SUCCESS: Correct disk 5 model : SEAGATE ST32000SSSUN2.0
SUCCESS: Sufficient disk 5 firmware (>=61A): 61A
SUCCESS: Correct disk 6 model : SEAGATE ST32000SSSUN2.0
SUCCESS: Sufficient disk 6 firmware (>=61A): 61A
SUCCESS: Correct disk 7 model : SEAGATE ST32000SSSUN2.0
SUCCESS: Sufficient disk 7 firmware (>=61A): 61A
SUCCESS: Correct disk 8 model : SEAGATE ST32000SSSUN2.0
SUCCESS: Sufficient disk 8 firmware (>=61A): 61A
SUCCESS: Correct disk 9 model : SEAGATE ST32000SSSUN2.0
SUCCESS: Sufficient disk 9 firmware (>=61A): 61A
SUCCESS: Correct disk 10 model : SEAGATE ST32000SSSUN2.0
SUCCESS: Sufficient disk 10 firmware (>=61A): 61A
SUCCESS: Correct disk 11 model : SEAGATE ST32000SSSUN2.0
SUCCESS: Sufficient disk 11 firmware (>=61A): 61A
SUCCESS: Correct disk 0 status : Online, Spun Up No alert
SUCCESS: Correct disk 1 status : Online, Spun Up No alert
SUCCESS: Correct disk 2 status : Online, Spun Up No alert
SUCCESS: Correct disk 3 status : Online, Spun Up No alert
SUCCESS: Correct disk 4 status : Online, Spun Up No alert
SUCCESS: Correct disk 5 status : Online, Spun Up No alert
SUCCESS: Correct disk 6 status : Online, Spun Up No alert
SUCCESS: Correct disk 7 status : Online, Spun Up No alert
SUCCESS: Correct disk 8 status : Online, Spun Up No alert
SUCCESS: Correct disk 9 status : Online, Spun Up No alert
SUCCESS: Correct disk 10 status : Online, Spun Up No alert
SUCCESS: Correct disk 11 status : Online, Spun Up No alert
SUCCESS: Correct number of virtual disks : 12
SUCCESS: Correct slot 0 mapping to HBA target : 0
SUCCESS: Correct slot 1 mapping to HBA target : 1
SUCCESS: Correct slot 2 mapping to HBA target : 2
SUCCESS: Correct slot 3 mapping to HBA target : 3
SUCCESS: Correct slot 4 mapping to HBA target : 4
SUCCESS: Correct slot 5 mapping to HBA target : 5
SUCCESS: Correct slot 6 mapping to HBA target : 6
SUCCESS: Correct slot 7 mapping to HBA target : 7
SUCCESS: Correct slot 8 mapping to HBA target : 8
SUCCESS: Correct slot 9 mapping to HBA target : 9
SUCCESS: Correct slot 10 mapping to HBA target : 10
SUCCESS: Correct slot 11 mapping to HBA target : 11
...

SUCCESS: Big Data Appliance hardware validation checks succeeded

2. megacli64-status.out reports healthy disks

Checking the megacli64-status.out file from the bdadiag output found after uncompressing the bdadiag file in bdadiag_<clustername>_<timestamp>/raid/megacli64-status.out reports healthy status. The output looks likes:

Checking RAID status on scaj0107.us.oracle.com
Controller a0: LSI MegaRAID SAS 9261-8i
No of Physical disks online : 12
Degraded : 0
Failed Disks : 0

3. megacli64-PdList_long.out shows the disk or disks with the flashing amber light have a firmware state of Online, Spun Up and a foreign state of None.

Checking the megacli64-PdList_long.out file from the bdadiag output found after uncompressing the bdadiag file in bdadiag_<clustername>_<timestamp>/raid/megacli64-PdList_long.out reports healthy status on specific disks. The output looks like below confirming a healthy status for the disk in slot 3.

Enclosure Device ID: 20
Slot Number: 3
Drive's postion: DiskGroup: 2, Span: 0, Arm: 0
Enclosure position: 0
Device Id: 16
WWN: 5000C5003482F5D4
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.817 TB [0xe8b6d000 Sectors]
Firmware state: Online, Spun Up
Is Commissioned Spare : NO
Device Firmware Level: 061A
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x5000c5003482f5d5
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST32000SSSUN2.0T061A1126L6LD5N
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive Temperature :29C (84.20 F)
PI Eligibility: No
Drive is formatted for PI information: No
PI: No PI
Drive's write cache : Disabled
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: Unknown
Drive has flagged a S.M.A.R.T alert : No

4. megacli64-LdPdInfo.out Virtual Drive output for the disk indicates no problems

Checking the megacli64-LdPdInfo.out file from the bdadiag output found after uncompressing the bdadiag file in bdadiag_<clustername>_<timestamp>/raid/megacli64-LdPdInfo.out reports output like below shows healthy state for the disk or disks with a flashing amber LED. The example output below is for the disk in slot 3.

Virtual Drive: 3 (Target Id: 3)
Name                :
RAID Level          : Primary-0, Secondary-0, RAID Level Qualifier-0
Size                : 1.817 TB
Parity Size         : 0
State               : Optimal
Strip Size          : 64 KB
Number Of Drives    : 1
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAheadNone, Cached, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Cached, No Write Cache if Bad BBU
Access Policy       : Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Number of Spans: 1
Span: 0 - Number of PDs: 1

5. hdparm -t /dev/sd<X> indicates hard disk timings for device reads is currently inline with performance threshold.

Recall once the LED is turned ON it remains ON therefore a performance issue could have occurred previously causing the LED to flash but this command verifies performance is now within acceptable standard.

For example for the case of the disk in slot 3:

# /sbin/hdparm -t /dev/sdd

/dev/sdd:
Timing buffered disk reads: 442 MB in 3.00 seconds = 147.12 MB/sec

6. It is possible to write a file to the disk and read it back successfully.

Verify that coping a file onto and off of the filesystem mount on the disk or disks with the flashing amber lights is successful also indicates that the disk is functioning properly. A quick diagnostic test can be done as below. This test is for the disk in slot 3.

## see whats mounted
# mount

## copy bda rpm to filesystem on that disk
# cp /<path>/<file> /u04/
# sync

## check for differences between the 2 files
# echo Checking copied file.
# diff /<path>/<file> /u04/

## copy the file back and check for differences
# cp /u04/<file> /tmp/<file>
# sync
# echo Checking file after copying back.
# diff /<path>/<file> /tmp/<file>

If there is no other indication of a hard or predictive disk failure associated with the flashing amber light then, as a workaround, you can reset the service LED(s) back to the OFF state.

Resetting the service LEDs back to "OFF"

Reset the service LED or LEDs back to the "OFF" state by issuing the following commands as root or sudo:

1. Before doing the reset, re-verify that the disk or disks with the flashing LED does not indicate a hard or predictive failure.

Disks without a hard or predictive failure show:

Slot Number: <disk slot>
Predictive Failure Count: 0
Firmware state: Online, Spun Up
Drive has flagged a S.M.A.R.T alert : No

Output for 12 healthy disks looks like:

Slot Number: 0
Predictive Failure Count: 0
Firmware state: Online, Spun Up
Drive has flagged a S.M.A.R.T alert : No

Slot Number: 1
Predictive Failure Count: 0
Firmware state: Online, Spun Up
Drive has flagged a S.M.A.R.T alert : No

Slot Number: 2
Predictive Failure Count: 0
Firmware state: Online, Spun Up
Drive has flagged a S.M.A.R.T alert : No

Slot Number: 3
Predictive Failure Count: 0
Firmware state: Online, Spun Up
Drive has flagged a S.M.A.R.T alert : No

Slot Number: 4
Predictive Failure Count: 0
Firmware state: Online, Spun Up
Drive has flagged a S.M.A.R.T alert : No

Slot Number: 5
Predictive Failure Count: 0
Firmware state: Online, Spun Up
Drive has flagged a S.M.A.R.T alert : No

Slot Number: 6
Predictive Failure Count: 0
Firmware state: Online, Spun Up
Drive has flagged a S.M.A.R.T alert : No

Slot Number: 7
Predictive Failure Count: 0
Firmware state: Online, Spun Up
Drive has flagged a S.M.A.R.T alert : No

Slot Number: 8
Predictive Failure Count: 0
Firmware state: Online, Spun Up
Drive has flagged a S.M.A.R.T alert : No

Slot Number: 9
Predictive Failure Count: 0
Firmware state: Online, Spun Up
Drive has flagged a S.M.A.R.T alert : No

Slot Number: 10
Predictive Failure Count: 0
Firmware state: Online, Spun Up
Drive has flagged a S.M.A.R.T alert : No

Slot Number: 11
Predictive Failure Count: 0
Firmware state: Online, Spun Up
Drive has flagged a S.M.A.R.T alert : No

2. If no failure, reset service LEDs to "OFF"

On a single server:

# for a in {0..11} ; do MegaCli64 -PDLocate -stop -PhysDrv[20:$a] -a0 -nolog ; done

For all servers in a rack:

# dcli 'for a in {0..11} ; do MegaCli64 -PDLocate -stop -PhysDrv[20:$a] -a0 -nolog ; done'

For all servers in a cluster:

# dcli -C 'for a in {0..11} ; do MegaCli64 -PDLocate -stop -PhysDrv[20:$a] -a0 -nolog ; done'

Output for resetting the service LEDs to "OFF" for all servers in a rack looks like:

# dcli 'for a in {0..11} ; do MegaCli64 -PDLocate -stop -PhysDrv[20:$a] -a0 -nolog ; done'
<IP Server 1>:
<IP Server 1>: Adapter: 0: Device at EnclId-20 SlotId-0 -- PD Locate Stop Command was successfully sent to Firmware
<IP Server 1>:
<IP Server 1>: Exit Code: 0x00
<IP Server 1>:
<IP Server 1>: Adapter: 0: Device at EnclId-20 SlotId-1 -- PD Locate Stop Command was successfully sent to Firmware
<IP Server 1>:
<IP Server 1>: Exit Code: 0x00
<IP Server 1>:
<IP Server 1>: Adapter: 0: Device at EnclId-20 SlotId-2 -- PD Locate Stop Command was successfully sent to Firmware
...
<IP Server 1>:
<IP Server 1>: Adapter: 0: Device at EnclId-20 SlotId-11 -- PD Locate Stop Command was successfully sent to Firmware
<IP Server 1>:
<IP Server 1>: Exit Code: 0x00
...
<same for Server2>
...
<same for Server 18>

Verify Flashing LEDs are reset

Verify that service LED was reset by checking that the LED that was flashing is now "OFF" on the BDA, i.e. the LED is solid green.

If you can not immediately check the state on the BDA verify that the service LED is reset to OFF by:

1. Capturing the ILOM snapshot as described in "ILOM snapshot indication of a service fault". The ipmi/@usr@local@bin@ipmiint_sunoem_led_get.output should show all service LEDs OFF:

2. Querying the ON/OFF state of the service LED for a particular HDD with the command (as root or sudo) several times:

# for a in {1..6} ; do ipmitool sunoem led get DBP/HDD<X>/SVC ; done

If OFF is returned all 6 times then the LED is OFF and not blinking. Otherwise, you might see a state of ON and OFF which indicates a blinking LED.

To verify that disk in slot 3:

# for a in {1..6} ; do ipmitool sunoem led get DBP/HDD3/SVC ; done

If any of these verifications fail please file a support request (SR) via My Oracle Support to further investigate.

Attachments

This solution has no attachment