Running bdacheckcluster/bdacheckhw Fails on Oracle Big Data Appliance: "ERROR: Wrong disk status:Online,Spun Up Yes alert"

Asset ID:	1-72-1580223.1
Update Date:	2017-11-22
Keywords:

Solution Type Problem Resolution Sure

Solution 1580223.1 : Running bdacheckcluster/bdacheckhw Fails on Oracle Big Data Appliance: "ERROR: Wrong disk status:Online,Spun Up Yes alert"

Applies to:

Big Data Appliance X3-2 Starter Rack - Version All Versions and later
Big Data Appliance X3-2 Full Rack - Version All Versions and later
Big Data Appliance Hardware - Version All Versions and later
Big Data Appliance X3-2 In-Rack Expansion - Version All Versions and later
Linux x86-64

Symptoms

1. Running the bdacheckcluster utility to verify the health of the BDA cluster raises: "ERROR: Hardware checks failing" for one or more servers in the cluster:

# bdacheckcluster

...
INFO: Checking hardware on host bdanode0n
...
ERROR: Hardware checks failing on host bdanode0n
...
ERROR: Big Data Appliance failed cluster health checks

2. Runnng the badcheckhw utility to check the hardware profile for the server(s) with failing hardware checks reports a disk status of: "ERROR: Wrong disk status : Online, Spun Up Yes alert" on the node(s) with the failure like:

# bdacheckhw

...
SUCCESS: Correct disk 1 status : Online, Spun Up No alert
SUCCESS: Correct disk 2 status : Online, Spun Up No alert
...
ERROR: Wrong disk 5 status : Online, Spun Up Yes alert
INFO: Expected disk 5 status : Online, Spun Up No alert
SUCCESS: Correct disk 6 status : Online, Spun Up No alert
...
SUCCESS: Correct disk 11 status : Online, Spun Up No alert
INFO: Errors reported on disk 5 : 12 0
SUCCESS: Correct number of virtual disks : 12
...
ERROR: Big Data Appliance failed hardware validation checks

3. Further investigation with "MegaCli64 LdPdInfo a0" and "MegaCli64 pdlist a0" run as root on the node(s) reporting the fault shows that the disk with the errors has predictive failures and a S.M.A.R.T alert.

# MegaCli64 LdPdInfo a0

# MegaCli64 pdlist a0

Show output like below where the example here reports a problem for the disk in slot 5:

Virtual Drive: 5 (Target Id: 5)
... :
Media Error Count: 12
Other Error Count: 0
Predictive Failure Count: 5
Last Predictive Failure Event Seq Number: 15137
...
Firmware state: Online, Spun Up
...
Drive has flagged a S.M.A.R.T alert : Yes

The Media Error Count, can be ignored. These are not failures and represent recoverable read/write errors.

The disk has a firmware state of Online, Spun Up, but also exhibits predictive failures and a SMART alert:

Predictive Failure Count: 5
Drive has flagged a S.M.A.R.T alert : Yes

4. However running (as root) lsscsi and mount -l show the disk with the predictive failure count and SMART alert is otherwise ok.

# lsscsi

[0:0:20:0]   enclosu SUN      HYDE12           0341 -
[0:2:0:0]    disk    LSI      MR9261-8i        2.12 /dev/sda
[0:2:1:0]    disk    LSI      MR9261-8i        2.12 /dev/sdb
[0:2:2:0]    disk    LSI      MR9261-8i        2.12 /dev/sdc
[0:2:3:0]    disk    LSI      MR9261-8i        2.12 /dev/sdd
[0:2:4:0]    disk    LSI      MR9261-8i        2.12 /dev/sde
[0:2:5:0]    disk    LSI      MR9261-8i        2.12 /dev/sdf
[0:2:6:0]    disk    LSI      MR9261-8i        2.12 /dev/sdg
[0:2:7:0]    disk    LSI      MR9261-8i        2.12 /dev/sdh
[0:2:8:0]    disk    LSI      MR9261-8i        2.12 /dev/sdi
...

# mount -l

/dev/md2 on / type ext3 (rw,noatime)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/md0 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
/dev/sda4 on /u01 type ext4 (rw,nodev,noatime) [/u01]
/dev/sdb4 on /u02 type ext4 (rw,nodev,noatime) [/u02]
/dev/sdc1 on /u03 type ext4 (rw,nodev,noatime) [/u03]
/dev/sdd1 on /u04 type ext4 (rw,nodev,noatime) [/u04]
/dev/sde1 on /u05 type ext4 (rw,nodev,noatime) [/u05]
/dev/sdf1 on /u06 type ext4 (rw,nodev,noatime) [/u06]
/dev/sdg1 on /u07 type ext4 (rw,nodev,noatime) [/u07]
/dev/sdh1 on /u08 type ext4 (rw,nodev,noatime) [/u08]
/dev/sdi1 on /u09 type ext4 (rw,nodev,noatime) [/u09]
...

5. Investigation with: "bdadiag snapshot" (See Doc ID 1516469.1: Oracle Big Data Appliance Diagnostic Information Collection with bdadiag V2.0) shows in the file:

<bdadiag...>/raid/megacli64-GetEvents-all.out

when the Predictive Failure was first raised which provides a time line of how long the disk has been in this state. In the example here the Predictive Failure for the disk in slot 5 was reported for the first time at:

Time: Sat Aug 28 18:40:30 2013

Code: 0x00000060
Class: 1
Locale: 0x02
Event Description: Predictive failure: PD 15(e0x14/s2)
Event Data:
===========
Device ID: 21
Enclosure Index: 20
Slot Number: 5

Cause

A predictive failure due to a SMART alert indicates a failing drive which should be replaced. However a disk showing predictive failures/SMART alerts is still usable although it has a high chance of becoming unusable in the near future.

Solution

File a Service Request with Oracle Support to have disk replaced as soon as possible.

References

<NOTE:1516469.1> - Oracle Big Data Appliance Diagnostic Information Collection with bdadiag V2.*/V3.*/V4.*

Attachments

This solution has no attachment