![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||||||||
Solution Type Troubleshooting Sure Solution 1537798.1 : Troubleshooting Flashing Amber LED on Oracle Big Data Appliance V2.0 for Non-Failure Status
In this Document
Applies to:Big Data Appliance X3-2 Hardware - Version All Versions and laterLinux x86-64 PurposeThis note provides a general approach for determining if a flashing amber LED on an Oracle Big Data Appliance (BDA) disk drive is indicating a disk failure. If analysis does not point to a disk failure steps to reset a flashing amber LED are given. Troubleshooting Steps
This document makes use of an example where the disk in slot 3 for a server on the BDA is observed to have a flashing amber LED.
What does a flashing amber LED on a disk drive meanOn the Oracle Big Data Appliance (BDA) a bda-monitor service runs in the background and turns on a service LED if any of the following is true for a disk:
MegaCli64 -PDList -a0| grep -E 'Slot|Failure Count' with "Predictive Failure Count: 0" indicative of no predictive failures. Anything non-zero triggers an alert. Output for 12 healthy disks looks like: # MegaCli64 -PDList -a0 | grep -E 'Slot|Failure Count'
Slot Number: 0Predictive Failure Count: 0 Slot Number: 1 Predictive Failure Count: 0 ... Slot Number: 11 Predictive Failure Count: 0 /sbin/hdparm -t /dev/sd<X> (where X is the disk location) returning 100MB/second or more. # /sbin/hdparm -t /dev/sdc
/dev/sdc: Timing buffered disk reads: 442 MB in 3.00 seconds = 147.12 MB/sec # /sbin/hdparm -t /dev/sdc
/dev/sdc: Timing buffered disk reads: 442 MB in 3.01 seconds = 146.98 MB/sec # /sbin/hdparm -t /dev/sdc
/dev/sdc: Timing buffered disk reads: 442 MB in 3.00 seconds = 147.29 MB/sec # /sbin/hdparm -t /dev/sdc
/dev/sdc: Timing buffered disk reads: 444 MB in 3.01 seconds = 147.66 MB/sec # /sbin/hdparm -t /dev/sdc
/dev/sdc: Timing buffered disk reads: 444 MB in 3.01 seconds = 147.68 MB/sec
The performance test could fail if there is greater than normal load on the system. However this scenario would not indicate that the disk is showing performance issues. Hadoop Distributed File System (HDFS) is tolerant of disk failures and disk performance so replacing disks performing less than optimally should not be necessary. Note once the LED is turned on it will not turn off again on its own.
ILOM snapshot indication of a service faultOnce an amber flashing LED is observed on a disk on the BDA check the file: ipmi/@usr@local@bin@ipmiint_sunoem_led_get.out from the ILOM snapshot to confirm that a disk service fault is set. Collect the ILOM snapshot on the server with the flashing amber LED(s): 1. According to the directions in: How to Collect a Snapshot from an x86 Platform Service Processor version (SP or ILOM) (Doc ID 1448069.1) In this case uncompress the ILOM snapshot and examine ipmi/@usr@local@bin@ipmiint_sunoem_led_get.out or 2. Via the output from the "bdadiag snapshot" utility as documented in Oracle Big Data Appliance Diagnostic Information Collection with bdadiag V2.0 (Doc ID 1516469.1). In this case uncompress the "bdadiag snapshot" output, cd to the ilom subdirectory, uncompress the ILOM snapshot, and examine ipmi/@usr@local@bin@ipmiint_sunoem_led_get.out ipmi/@usr@local@bin@ipmiint_sunoem_led_get.out should show a service fault on the disk for which the flashing amber LED is observed on the BDA. For example for the disk in slot 3, ipmi/@usr@local@bin@ipmiint_sunoem_led_get.out should show service fault on DBP/HDD3: DBP/HDD0/SVC | OFF
DBP/HDD1/SVC | OFF DBP/HDD2/SVC | OFF DBP/HDD3/SVC | ON DBP/HDD4/SVC | OFF DBP/HDD5/SVC | OFF DBP/HDD6/SVC | OFF DBP/HDD7/SVC | OFF DBP/HDD8/SVC | OFF DBP/HDD9/SVC | OFF DBP/HDD10/SVC | OFF DBP/HDD11/SVC | OFF
A quick alternative to collecting the ILOM snapshot to verify a disk service fault is to issue (as root or sudo): # ipmitool sunoem led get | grep -i svc
DBP/HDD0/SVC | OFF DBP/HDD1/SVC | OFF DBP/HDD2/SVC | OFF DBP/HDD3/SVC | OFF DBP/HDD4/SVC | OFF DBP/HDD5/SVC | OFF DBP/HDD6/SVC | OFF DBP/HDD7/SVC | OFF DBP/HDD8/SVC | OFF DBP/HDD9/SVC | OFF DBP/HDD10/SVC | OFF DBP/HDD11/SVC | OFF # dcli 'ipmitool sunoem led get | grep -i svc'
<IP Server 1>: DBP/HDD0/SVC | OFF <IP Server 1>: DBP/HDD1/SVC | OFF ... <IP Server 1>: DBP/HDD11/SVC | OFF ... <same for Server2> ... <same for Server 18>
Note that in the case of a flashing amber light the service fault can change between ON/OFF. Hence the recommendation is to collect either the ILOM snapshot several times in a row or run "ipmitool sunoem led get | grep -i svc" several times in a row and compare the state to verify a blinking LED status.
Indications that a flashing amber LED does not represent disk failureOnce one or more disks with flashing amber LEDs are observed on the BDA, and the corresponding service fault confirmed via ipmi/@usr@local@bin@ipmiint_sunoem_led_get.out, there are several additional checks that can be made to determine if the observed LED represents a hard or predictive disk failure. The checks listed below can be used to determine that the disk looks to be in a healthy state inspite of the observd flashing amber LED on the BDA and corresponding service fault in ipmi/@usr@local@bin@ipmiint_sunoem_led_get.out. A healthy hardware profile with functional disks looks like below. For example in the case of a flashing amber LED for the disk in slot three, bdacheckhw reports that the disk is healthy: # bdacheckhw
... SUCCESS: Correct disk controller model : LSI MegaRAID SAS 9261-8i SUCCESS: Correct disk controller firmware major version : 12.12.0 SUCCESS: Sufficient disk controller firmware minor version (>=0079): 0079 SUCCESS: Got expected disk controller enclosure ID : 20 SUCCESS: Correct disk controller PCI address : 13:00.0 SUCCESS: Correct disk controller PCI info : 0104: 1000:0079 SUCCESS: Correct disk controller PCIe slot width : x8 SUCCESS: Correct disk controller battery type : iBBU08 SUCCESS: Correct disk controller battery state : Operational SUCCESS: Correct number of disks : 12 SUCCESS: Correct disk 0 model : SEAGATE ST32000SSSUN2.0 SUCCESS: Sufficient disk 0 firmware (>=61A): 61A SUCCESS: Correct disk 1 model : SEAGATE ST32000SSSUN2.0 SUCCESS: Sufficient disk 1 firmware (>=61A): 61A SUCCESS: Correct disk 2 model : SEAGATE ST32000SSSUN2.0 SUCCESS: Sufficient disk 2 firmware (>=61A): 61A SUCCESS: Correct disk 3 model : SEAGATE ST32000SSSUN2.0 SUCCESS: Sufficient disk 3 firmware (>=61A): 61A SUCCESS: Correct disk 4 model : SEAGATE ST32000SSSUN2.0 SUCCESS: Sufficient disk 4 firmware (>=61A): 61A SUCCESS: Correct disk 5 model : SEAGATE ST32000SSSUN2.0 SUCCESS: Sufficient disk 5 firmware (>=61A): 61A SUCCESS: Correct disk 6 model : SEAGATE ST32000SSSUN2.0 SUCCESS: Sufficient disk 6 firmware (>=61A): 61A SUCCESS: Correct disk 7 model : SEAGATE ST32000SSSUN2.0 SUCCESS: Sufficient disk 7 firmware (>=61A): 61A SUCCESS: Correct disk 8 model : SEAGATE ST32000SSSUN2.0 SUCCESS: Sufficient disk 8 firmware (>=61A): 61A SUCCESS: Correct disk 9 model : SEAGATE ST32000SSSUN2.0 SUCCESS: Sufficient disk 9 firmware (>=61A): 61A SUCCESS: Correct disk 10 model : SEAGATE ST32000SSSUN2.0 SUCCESS: Sufficient disk 10 firmware (>=61A): 61A SUCCESS: Correct disk 11 model : SEAGATE ST32000SSSUN2.0 SUCCESS: Sufficient disk 11 firmware (>=61A): 61A SUCCESS: Correct disk 0 status : Online, Spun Up No alert SUCCESS: Correct disk 1 status : Online, Spun Up No alert SUCCESS: Correct disk 2 status : Online, Spun Up No alert SUCCESS: Correct disk 3 status : Online, Spun Up No alert SUCCESS: Correct disk 4 status : Online, Spun Up No alert SUCCESS: Correct disk 5 status : Online, Spun Up No alert SUCCESS: Correct disk 6 status : Online, Spun Up No alert SUCCESS: Correct disk 7 status : Online, Spun Up No alert SUCCESS: Correct disk 8 status : Online, Spun Up No alert SUCCESS: Correct disk 9 status : Online, Spun Up No alert SUCCESS: Correct disk 10 status : Online, Spun Up No alert SUCCESS: Correct disk 11 status : Online, Spun Up No alert SUCCESS: Correct number of virtual disks : 12 SUCCESS: Correct slot 0 mapping to HBA target : 0 SUCCESS: Correct slot 1 mapping to HBA target : 1 SUCCESS: Correct slot 2 mapping to HBA target : 2 SUCCESS: Correct slot 3 mapping to HBA target : 3 SUCCESS: Correct slot 4 mapping to HBA target : 4 SUCCESS: Correct slot 5 mapping to HBA target : 5 SUCCESS: Correct slot 6 mapping to HBA target : 6 SUCCESS: Correct slot 7 mapping to HBA target : 7 SUCCESS: Correct slot 8 mapping to HBA target : 8 SUCCESS: Correct slot 9 mapping to HBA target : 9 SUCCESS: Correct slot 10 mapping to HBA target : 10 SUCCESS: Correct slot 11 mapping to HBA target : 11 ... SUCCESS: Big Data Appliance hardware validation checks succeeded Checking the megacli64-status.out file from the bdadiag output found after uncompressing the bdadiag file in bdadiag_<clustername>_<timestamp>/raid/megacli64-status.out reports healthy status. The output looks likes: Checking RAID status on scaj0107.us.oracle.com
Controller a0: LSI MegaRAID SAS 9261-8i No of Physical disks online : 12 Degraded : 0 Failed Disks : 0 Checking the megacli64-PdList_long.out file from the bdadiag output found after uncompressing the bdadiag file in bdadiag_<clustername>_<timestamp>/raid/megacli64-PdList_long.out reports healthy status on specific disks. The output looks like below confirming a healthy status for the disk in slot 3. Enclosure Device ID: 20
Slot Number: 3 Drive's postion: DiskGroup: 2, Span: 0, Arm: 0 Enclosure position: 0 Device Id: 16 WWN: 5000C5003482F5D4 Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SAS Raw Size: 1.819 TB [0xe8e088b0 Sectors] Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors] Coerced Size: 1.817 TB [0xe8b6d000 Sectors] Firmware state: Online, Spun Up Is Commissioned Spare : NO Device Firmware Level: 061A Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x5000c5003482f5d5 SAS Address(1): 0x0 Connected Port Number: 0(path0) Inquiry Data: SEAGATE ST32000SSSUN2.0T061A1126L6LD5N FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Hard Disk Device Drive Temperature :29C (84.20 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Drive's write cache : Disabled Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Port-1 : Port status: Active Port's Linkspeed: Unknown Drive has flagged a S.M.A.R.T alert : No Checking the megacli64-LdPdInfo.out file from the bdadiag output found after uncompressing the bdadiag file in bdadiag_<clustername>_<timestamp>/raid/megacli64-LdPdInfo.out reports output like below shows healthy state for the disk or disks with a flashing amber LED. The example output below is for the disk in slot 3. Virtual Drive: 3 (Target Id: 3)
Name : RAID Level : Primary-0, Secondary-0, RAID Level Qualifier-0 Size : 1.817 TB Parity Size : 0 State : Optimal Strip Size : 64 KB Number Of Drives : 1 Span Depth : 1 Default Cache Policy: WriteBack, ReadAheadNone, Cached, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAheadNone, Cached, No Write Cache if Bad BBU Access Policy : Read/Write Disk Cache Policy : Disk's Default Encryption Type : None Number of Spans: 1 Span: 0 - Number of PDs: 1 Recall once the LED is turned ON it remains ON therefore a performance issue could have occurred previously causing the LED to flash but this command verifies performance is now within acceptable standard. # /sbin/hdparm -t /dev/sdd
/dev/sdd: Verify that coping a file onto and off of the filesystem mount on the disk or disks with the flashing amber lights is successful also indicates that the disk is functioning properly. A quick diagnostic test can be done as below. This test is for the disk in slot 3. ## see whats mounted
# mount ## copy bda rpm to filesystem on that disk # cp /<path>/<file> /u04/ # sync ## check for differences between the 2 files # echo Checking copied file. # diff /<path>/<file> /u04/ ## copy the file back and check for differences # cp /u04/<file> /tmp/<file> # sync # echo Checking file after copying back. # diff /<path>/<file> /tmp/<file>
If there is no other indication of a hard or predictive disk failure associated with the flashing amber light then, as a workaround, you can reset the service LED(s) back to the OFF state. Resetting the service LEDs back to "OFF"Reset the service LED or LEDs back to the "OFF" state by issuing the following commands as root or sudo: Query the state of all disks on the node with: Disks without a hard or predictive failure show: Slot Number: <disk slot>
Predictive Failure Count: 0 Firmware state: Online, Spun Up Drive has flagged a S.M.A.R.T alert : No
# MegaCli64 -PDList -a0| grep -E 'Slot|state|Failure Count|flagged'| sed 's/Slot/\nSlot/'
# for a in {0..11} ; do MegaCli64 -PDLocate -stop -PhysDrv[20:$a] -a0 -nolog ; done
# dcli 'for a in {0..11} ; do MegaCli64 -PDLocate -stop -PhysDrv[20:$a] -a0 -nolog ; done'
# dcli -C 'for a in {0..11} ; do MegaCli64 -PDLocate -stop -PhysDrv[20:$a] -a0 -nolog ; done'
# dcli 'for a in {0..11} ; do MegaCli64 -PDLocate -stop -PhysDrv[20:$a] -a0 -nolog ; done'
<IP Server 1>: <IP Server 1>: Adapter: 0: Device at EnclId-20 SlotId-0 -- PD Locate Stop Command was successfully sent to Firmware <IP Server 1>: <IP Server 1>: Exit Code: 0x00 <IP Server 1>: <IP Server 1>: Adapter: 0: Device at EnclId-20 SlotId-1 -- PD Locate Stop Command was successfully sent to Firmware <IP Server 1>: <IP Server 1>: Exit Code: 0x00 <IP Server 1>: <IP Server 1>: Adapter: 0: Device at EnclId-20 SlotId-2 -- PD Locate Stop Command was successfully sent to Firmware ... <IP Server 1>: <IP Server 1>: Adapter: 0: Device at EnclId-20 SlotId-11 -- PD Locate Stop Command was successfully sent to Firmware <IP Server 1>: <IP Server 1>: Exit Code: 0x00 ... <same for Server2> ... <same for Server 18> Verify Flashing LEDs are resetVerify that service LED was reset by checking that the LED that was flashing is now "OFF" on the BDA, i.e. the LED is solid green. If you can not immediately check the state on the BDA verify that the service LED is reset to OFF by: 1. Capturing the ILOM snapshot as described in "ILOM snapshot indication of a service fault". The ipmi/@usr@local@bin@ipmiint_sunoem_led_get.output should show all service LEDs OFF: DBP/HDD0/SVC | OFF
DBP/HDD1/SVC | OFF DBP/HDD2/SVC | OFF DBP/HDD3/SVC | OFF DBP/HDD4/SVC | OFF DBP/HDD5/SVC | OFF DBP/HDD6/SVC | OFF DBP/HDD7/SVC | OFF DBP/HDD8/SVC | OFF DBP/HDD9/SVC | OFF DBP/HDD10/SVC | OFF DBP/HDD11/SVC | OFF # for a in {1..6} ; do ipmitool sunoem led get DBP/HDD<X>/SVC ; done
If OFF is returned all 6 times then the LED is OFF and not blinking. Otherwise, you might see a state of ON and OFF which indicates a blinking LED. To verify that disk in slot 3: # for a in {1..6} ; do ipmitool sunoem led get DBP/HDD3/SVC ; done
Output if the service light is still ON i.e. flashing LED: DBP/HDD3/SVC | ON DBP/HDD3/SVC | ON DBP/HDD3/SVC | OFF DBP/HDD3/SVC | ON DBP/HDD3/SVC | OFF DBP/HDD3/SVC | ON Output if the service light is OFF: DBP/HDD3/SVC | OFF DBP/HDD3/SVC | OFF DBP/HDD3/SVC | OFF DBP/HDD3/SVC | OFF DBP/HDD3/SVC | OFF DBP/HDD3/SVC | OFF Attachments This solution has no attachment |
||||||||||||||||||||||||||
|