TROUBLESHOOTING: Failed or Failing Disks on Oracle Big Data Appliance

Asset ID:	1-75-1585438.1
Update Date:	2014-03-24
Keywords:

Solution Type Troubleshooting Sure

Solution 1585438.1 : TROUBLESHOOTING: Failed or Failing Disks on Oracle Big Data Appliance

Applies to:

Big Data Appliance X3-2 In-Rack Expansion - Version All Versions and later
Big Data Appliance X3-2 Hardware - Version All Versions and later
Big Data Appliance X3-2 Full Rack - Version All Versions and later
Big Data Appliance Hardware - Version All Versions and later
Big Data Appliance X3-2 Starter Rack - Version All Versions and later
Linux x86-64

Purpose

This document provides information on troubleshooting disk errors, warnings, info messages, failing, and failed disks on Oracle Big Data Appliance.

Troubleshooting Steps

Oracle Big Data Appliance (BDA) utilities such as bdacheckcluster, bdacheckhw, and bdachecksw when run may result in errors, warnings, and info messages about Oracle Big Data Appliance disks. The information below will show some helpful utility commands and output that may be seen and how to further troubleshoot whether the disk is in fact already failed or is in the process of failing. The information below will help determine whether a service request should be opened to get a new disk drive shipped and replaced. The spares kit contains one or two spare disks that can be used depending on the BDA Rack shipped. The spare disks may be used but it is important to open a service request to get the failed disk replaced.

BDA Utilities to check status of disks

Please note all utilities should be run as the root user.

1. The bdacheckcluster utility should be run on node01 of the BDA Cluster as root user.

To run bdacheckcluster enter the following:

# bdacheckcluster

Sample output showing hardware checks failing on a node:

# bdacheckcluster
...
INFO: Checking hardware on host bdanode0n
...
ERROR: Hardware checks failing on host bdanode0n
...
ERROR: Big Data Appliance failed cluster health checks

If an error is seen continue investigating with bdacheckhw on the node found to have the error.

2. If bdacheckcluster reports hardware checks failing, find out if those are related to disks. Do this by running the bdacheckhw command on the node / server found to have the error.

To run bdacheckhw enter the following:

# bdacheckhw

If bdacheckhw reports a disk issue continue investigating with the MegaCli64 commands shown below.

Sample output showing an error with wrong disk 5 status:

# bdacheckhw
...
SUCCESS: Correct disk 1 status : Online, Spun Up No alert
SUCCESS: Correct disk 2 status : Online, Spun Up No alert
...
ERROR: Wrong disk 5 status : Online, Spun Up Yes alert
INFO: Expected disk 5 status : Online, Spun Up No alert
SUCCESS: Correct disk 6 status : Online, Spun Up No alert
...
SUCCESS: Correct disk 11 status : Online, Spun Up No alert
INFO: Errors reported on disk 5 : 12 0
SUCCESS: Correct number of virtual disks : 12
...
ERROR: Big Data Appliance failed hardware validation checks

Note in this case, a service request should be logged since we are concerned about seeing this message: ERROR: Wrong disk 5 status : Online, Spun Up Yes alert. See the section "Errors Reported on Disk Seen in bdacheckhw Output" where the Yes Alert in this case indicates the disk is failing and should be replaced. Note this will be shown in the output from running the MegaCli64 LdPdInfo a0 or MegaCli64 pdlist a0 commands.

Media errors are not of concern as explained in "Errors Reported on Disk Seen in bdacheckcluster Output." They do not indicate a need for disk replacement.

3. The bdachecksw utility can also be run on each individual server to find any disk related errors. In general the bdachecksw will show information about partitions and mounts.

To run bdachecksw enter the following:

# bdachecksw

Sample output showing warnings and errors with the partitions due to disk errors follows. This output does not show /u04 or /u05 which may indicate a problem with two of the disks. Further investigation with the MegaCli64 commands below should be helpful.

# bdachecksw
...
WARNING: Wrong data disk sdk partition info : 1 primary fat16 boot 2 primary ext2
INFO: Expected data disk sdk partition info : "1 ext3 primary" or "1 primary"
WARNING: Wrong data disk sdl partition info :
INFO: Expected data disk sdl partition info : "1 ext3 primary" or "1 primary"
SUCCESS: Correct software RAID info : /dev/md2 level=raid1 num-devices=2 /dev/md0 level=raid1 num-devices=2
ERROR: Wrong mounted partitions : /dev/mapper/lvg1-lv1 /lv1 ext4 /dev/md0 /boot ext3 /dev/md2 / ext3 /dev/sda4 /u01 ext4 /dev/sdb4 /u02 ext4 /dev/sdc1 /u03 ext4 /dev/sdd1 /u06 ext4 /dev/sde1 /u07 ext4 /dev/sdf1 /u08 ext4 /dev/sdg1 /u09 ext4 /dev/sdh1 /u10 ext4 /dev/sdi1 /u11 ext4
...

MegaCli64 Commands to Assist in Troubleshooting

There are a number of MegaCli64 Commands which can assist in troubleshooting disk failure.

Generally pay attention to any results from the commands which are as follows:

Disks that have a Firmware state of Unconfigured.
Disks that have a Foreign State of Foreign.
Disks that have a Drive has flagged a S.M.A.R.T alert : Yes
Disks that have a Media Error Count: non 0
Disks that have a Other Error Count: non 0
Disks that have a Predictive Failure Count: non 0

Here are some of the MegaCli64 commands that can be used and sample output.

To check the Firmware state issue the following command:

# MegaCli64 -PDList -a0 | grep Firmware

Sample output showing Firmware state: Unconfigured(good), Spun Up:

# MegaCli64 -PDList -a0 | grep Firmware
Firmware state: Online, Spun Up
Device Firmware Level: 061A
...
...
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: 061A
Firmware state: Online, Spun Up
Device Firmware Level: 061A

In the instance of seeing Firmware state: Unconfigured(good) or Firmware state: Unconfigured(bad) the disk will need to be replaced. Log a service request to replace the disk. Please note there is no need to be concerned about Firmware state: Online, Spun Up.

The physical drive information can also be stored in a file for review:

# MegaCli64 pdlist a0 > pdinfo.tmp

Open the pdinfo.tmp file you created in a text editor and search for the following:

Disks that have a Foreign State of Foreign.
Disks that have a Firmware state of Unconfigured.
Disks that have a Drive has flagged a S.M.A.R.T alert : Yes
Disks that have a Media Error Count: non 0
Disks that have a Other Error Count: non 0
Disks that have a Predictive Failure Count: non 0

The following example shows a Firmware state of Unconfigured(bad) and a Foreign State of Foreign:

...
Firmware state: Unconfigured(bad)
Device Firmware Level: 061A
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x5000c50040a4a8b9
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST32000SSSUN2.0T061A1140L7VSMZ
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: Foreign
...

Here is an example of a disk that shows a Firmware state of Unconfigured(good):

...
Enclosure Device ID: 20
Slot Number: 4
Enclosure position: 0
Device Id: 18
WWN: 5000C500348B72B4
Sequence Number: 4
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.817 TB [0xe8b6d000 Sectors]
Firmware state: Unconfigured(good), Spun Up
Is Emergency Spare : NO
Device Firmware Level: 061A
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x5000c500348b72b5
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST32000SSSUN2.0T061A1127L6LX53
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
...

Use MegaCli LdPdInfo a0 to verify mapping of logical to physical drive numbers.

# MegaCli64 LdPdInfo a0 | more

Example output:

# MegaCli64 LdPdInfo a0 | more

Adapter #0

Number of Virtual Disks: 12
Virtual Drive: 0 (Target Id: 0)
...
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
...
Firmware state: Online, Spun Up
...
Foreign State: None
...

For Predictive Failure Counts check the value of "Predictive Failure Count" as returned for a disk when using:

# MegaCli64 -PDList -a0| grep -E 'Slot|Failure Count'

A "Predictive Failure Count: 0" indicates no predictive failures. Anything non-zero triggers an alert.

Output for 12 healthy disks looks like:
# MegaCli64 -PDList -a0 | grep -E 'Slot|Failure Count'
Slot Number: 0
Predictive Failure Count: 0
Slot Number: 1
Predictive Failure Count: 0
...
Slot Number: 11
Predictive Failure Count: 0

Gathering Information about the Disk Drives on the Server

The "lsscsi", "mount -l", and "ls -l /dev/disk/by-label" commands can provide information about disks that are missing, not showing up in output, or mountpoints which are not found by the system during disk failure.

Verify if the disk is recognized using the "lsscsi" command:

# lsscsi

Sample output showing where slots 3 and 4 are not listed:

[0:0:20:0]   enclosu SUN      HYDE12           0341 -
[0:2:0:0]    disk    LSI      MR9261-8i        2.12 /dev/sda
[0:2:1:0]    disk    LSI      MR9261-8i        2.12 /dev/sdb
[0:2:2:0]    disk    LSI      MR9261-8i        2.12 /dev/sdc
[0:2:5:0]    disk    LSI      MR9261-8i        2.12 /dev/sdd << This shows slot 3 and 4 are not listed since [0:2:3:0] and [0:2:4:0] are not shown
[0:2:6:0]    disk    LSI      MR9261-8i        2.12 /dev/sde
[0:2:7:0]    disk    LSI      MR9261-8i        2.12 /dev/sdf
[0:2:8:0]    disk    LSI      MR9261-8i        2.12 /dev/sdg
[0:2:9:0]    disk    LSI      MR9261-8i        2.12 /dev/sdh
[0:2:10:0]   disk    LSI      MR9261-8i        2.12 /dev/sdi
[0:2:11:0]   disk    LSI      MR9261-8i        2.12 /dev/sdj
[7:0:0:0]    disk    Unigen   PSA4000          1100 /dev/sdk

If observing similar output where slots are not listed then log a service request to investigate further if the disk needs to be replaced.

Determine if the mountpoints are all listed using "mount -l" command.

# mount -l

Sample output showing the mountpoints where /dev/sdd1 on /u04 type ext4 (rw,nodev,noatime) [/u04] is missing:

# mount -l
/dev/md2 on / type ext3 (rw,noatime)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/md0 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
/dev/sda4 on /u01 type ext4 (rw,nodev,noatime) [/u01]
/dev/sdb4 on /u02 type ext4 (rw,nodev,noatime) [/u02]
/dev/sdc1 on /u03 type ext4 (rw,nodev,noatime) [/u03] << this shows /dev/sdd1 on /u04 type ext4 (rw,nodev,noatime) [/u04] is missing
/dev/sde1 on /u05 type ext4 (rw,nodev,noatime) [/u05]
/dev/sdf1 on /u06 type ext4 (rw,nodev,noatime) [/u06]
/dev/sdg1 on /u07 type ext4 (rw,nodev,noatime) [/u07]
/dev/sdh1 on /u08 type ext4 (rw,nodev,noatime) [/u08]
/dev/sdi1 on /u09 type ext4 (rw,nodev,noatime) [/u09]
/dev/sdj1 on /u10 type ext4 (rw,nodev,noatime) [/u10]
/dev/sdk1 on /u11 type ext4 (rw,nodev,noatime) [/u11]
/dev/sdl1 on /u12 type ext4 (rw,nodev,noatime) [/u12]
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)

If observing similar output with missing mount points then log a service request to investigate if the disk needs to be replaced.

Use "ls -l /dev/disk/by-label" to determine if any device is missing. The following sample shows u03 -> ../../sdc1 is missing:

# ls -l /dev/disk/by-label
total 0
lrwxrwxrwx 1 root root 10 Jul 29 19:30 BDAUSB -> ../../sdm1
lrwxrwxrwx 1 root root 10 Jul 29 19:30 SWAP-sda3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Jul 29 19:30 SWAP-sdb3 -> ../../sdb3
lrwxrwxrwx 1 root root 10 Jul 29 19:30 u01 -> ../../sda4
lrwxrwxrwx 1 root root 10 Jul 29 19:30 u02 -> ../../sdb4 << shows u03 -> ../../sdc1 is missing
lrwxrwxrwx 1 root root 10 Jul 29 19:30 u04 -> ../../sdd1
lrwxrwxrwx 1 root root 10 Jul 29 19:30 u05 -> ../../sde1
lrwxrwxrwx 1 root root 10 Jul 29 19:30 u06 -> ../../sdf1
lrwxrwxrwx 1 root root 10 Jul 29 19:30 u07 -> ../../sdg1
lrwxrwxrwx 1 root root 10 Jul 29 19:30 u08 -> ../../sdh1
lrwxrwxrwx 1 root root 10 Jul 29 19:30 u09 -> ../../sdi1
lrwxrwxrwx 1 root root 10 Jul 29 19:30 u10 -> ../../sdj1
lrwxrwxrwx 1 root root 10 Jul 29 19:30 u11 -> ../../sdk1
lrwxrwxrwx 1 root root 10 Jul 29 19:30 u12 -> ../../sdl1

If observing similar output with missing devices then log a service request to investigate further replacement of the disk.

Troubleshooting a Flashing Amber LED on an Oracle Big Data Appliance

If a flashing amber LED is seen on a server in the BDA cluster please see the troubleshooting tips in the following document:

Troubleshooting Flashing Amber LED on Oracle Big Data Appliance V2.0 for Non-Failure Status (Doc ID 1537798.1).

Checking the Disk Service Faults

To check the Disk Service Faults as root issue the following command:

# ipmitool sunoem led get | grep -i svc

Output for the case where all disks are healthy:

Any ONs in the above is indicative that the disk requires replacing. File a service request.

Errors Reported on Disk Seen in bdacheckcluster Output

If executing bdacheckcluster shows an info message about disk errors on one of the BDA nodes like:

INFO: Errors reported on disk 7 : 2 0

And running the "MegaCli64 ldpdinfo a0" command on the node shows the following:

Media Error Count: 2
Other Error Count: 0

Solution

Then no disk replacement is needed. The Media Error Count, can be ignored. These are not failures but represent recoverable read/write errors. More information can be found in "Running bdacheckcluster utility on Oracle Big Data Appliance Reports: INFO: Errors Reported On Disk 7 : 8 0 (Doc ID 1568792.1)."

Errors Reported on Disk Seen in bdacheckhw Output

If running bdacheckhw on a BDA node errors with:

SUCCESS: Correct disk 1 status : Online, Spun Up No alert
ERROR: Wrong disk 2 status : Online, Spun Up Yes alert
INFO: Expected disk 2 status : Online, Spun Up No alert
...
INFO: Errors reported on disk 2 : 12 0

And disk 2 is a HITACHI disk but looks ok:

SUCCESS: Correct disk 1 model : SEAGATE ST33000SSSUN3.0
SUCCESS: Sufficient disk 1 firmware (>=64A): 64A
SUCCESS: Correct disk 2 model : HITACHI H7230AS60SUN3.0
SUCCESS: Sufficient disk 2 firmware (>=A142): A310
SUCCESS: Correct disk 3 model : SEAGATE ST33000SSSUN3.0
SUCCESS: Sufficient disk 3 firmware (>=64A): 64A

However the "MegaCli64 LdPdInfo a0" and "MegaCli64 pdlist a0" report:

Media Error Count: 12 (these are not failures, and represent recoverable read/write errors)

Predictive Failure Count: 5 (There are predictive failures.)
Last Predictive Failure Event Seq Number: 15137

Drive has flagged a S.M.A.R.T alert : Yes (S.M.A.R.T alert is showing yes here)

Further checking in the example output from running "MegaCli64 LdPdInfo a0" command shows Firmware state: Online, Spun Up:

Enclosure Device ID: 20
Slot Number: 2
Drive's postion: DiskGroup: 11, Span: 0, Arm: 0
Enclosure position: 0
Device Id: 21
WWN: 5000CCA03E22B2AF
Sequence Number: 2
Media Error Count: 12
Other Error Count: 0
Predictive Failure Count: 5
Last Predictive Failure Event Seq Number: 15137
PD Type: SAS
Raw Size: 2.728 TB [0x15d50a3b0 Sectors]
Non Coerced Size: 2.728 TB [0x15d40a3b0 Sectors]
Coerced Size: 2.727 TB [0x15d3ef000 Sectors]
Firmware state: Online, Spun Up
Is Commissioned Spare : NO
Device Firmware Level: A310
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x5000cca03e22b2ad
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: HITACHI H7230AS60SUN3.0TA3101237RM2KED
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive Temperature :27C (80.60 F)
PI Eligibility: No
Drive is formatted for PI information: No
PI: No PI
Drive's write cache : Disabled
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: Unknown
Drive has flagged a S.M.A.R.T alert : Yes

The output of "lsscsi" and "mount -l" look ok.

# mount -l
/dev/md2 on / type ext3 (rw,noatime)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/md0 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
/dev/sda4 on /u01 type ext4 (rw,nodev,noatime) [/u01]
/dev/sdb4 on /u02 type ext4 (rw,nodev,noatime) [/u02]
/dev/sdc1 on /u03 type ext4 (rw,nodev,noatime) [/u03]
/dev/sdd1 on /u04 type ext4 (rw,nodev,noatime) [/u04]
/dev/sde1 on /u05 type ext4 (rw,nodev,noatime) [/u05]
/dev/sdf1 on /u06 type ext4 (rw,nodev,noatime) [/u06]
/dev/sdg1 on /u07 type ext4 (rw,nodev,noatime) [/u07]
/dev/sdh1 on /u08 type ext4 (rw,nodev,noatime) [/u08]
/dev/sdi1 on /u09 type ext4 (rw,nodev,noatime) [/u09]
/dev/sdj1 on /u10 type ext4 (rw,nodev,noatime) [/u10]
/dev/sdk1 on /u11 type ext4 (rw,nodev,noatime) [/u11]
/dev/sdl1 on /u12 type ext4 (rw,nodev,noatime) [/u12]
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)

# lsscsi
[0:0:20:0]   enclosu ORACLE   CONCORD14        0960 -
[0:2:0:0]    disk    LSI      MR9261-8i        2.12 /dev/sda
[0:2:1:0]    disk    LSI      MR9261-8i        2.12 /dev/sdb
[0:2:2:0]    disk    LSI      MR9261-8i        2.12 /dev/sdc
[0:2:3:0]    disk    LSI      MR9261-8i        2.12 /dev/sdd
[0:2:4:0]    disk    LSI      MR9261-8i        2.12 /dev/sde
[0:2:5:0]    disk    LSI      MR9261-8i        2.12 /dev/sdf
[0:2:6:0]    disk    LSI      MR9261-8i        2.12 /dev/sdg
[0:2:7:0]    disk    LSI      MR9261-8i        2.12 /dev/sdh
[0:2:8:0]    disk    LSI      MR9261-8i        2.12 /dev/sdi
[0:2:9:0]    disk    LSI      MR9261-8i        2.12 /dev/sdj
[0:2:10:0]   disk    LSI      MR9261-8i        2.12 /dev/sdk
[0:2:11:0]   disk    LSI      MR9261-8i        2.12 /dev/sdl
[7:0:0:0]    disk    ORACLE   UNIGEN-UFD       PMAP /dev/sdm

Solution

The disk should be replaced at the soonest opportunity. Log a service request to have a disk drive shipped and replaced on the server.

The Media Error Count, can be ignored. These are not failures and represent recoverable read/write errors. More information can be found in "Running bdacheckcluster utility on Oracle Big Data Appliance Reports: INFO: Errors Reported On Disk 7 : 8 0 (Doc ID 1568792.1)."

The disk however is failing as shown in the Predictive Failure Count: 5 and the S.M.A.R.T alert: Yes. A predictive failure due to a SMART alert indicates a failing drive which should be replaced. However a disk showing predictive failures/SMART alerts is still usable although it has a high chance of becoming unusable in the near future. More information can be found in "bdaclustercheck/bdacheckhw Fail on Oracle Big Data Appliance: ERROR: Wrong disk status:Online,Spun Up Yes alert (Doc ID 1580223.1)."

Summary

There are many BDA utilities and MegaCli64 commands to assist in determining whether a disk is failing, failed, or fine.

If a flashing amber LED is found and is determined to be ok per "Troubleshooting Flashing Amber LED on Oracle Big Data Appliance V2.0 for Non-Failure Status (Doc ID 1537798.1)" and/or if bdacheckcluster is run and a media count error is identified then the disk is fine.

If however such errors are accompanied by Predictive Failure Count: X and S.M.A.R.T alert: Yes then get the disk replaced.

Similarly if the MegaCli64 utilities, mount, lsscsi, etc. show the problems discussed in this document then file a Service Request with Oracle Support to investigate and get the disk replaced.

References

<NOTE:1537798.1> - Troubleshooting Flashing Amber LED on Oracle Big Data Appliance V2.0 for Non-Failure Status
<NOTE:1568792.1> - Running bdacheckcluster utility on Oracle Big Data Appliance Reports: INFO: Errors Reported On Disk 7 : 8 0
<NOTE:1580223.1> - Running bdaclustercheck/bdacheckhw Fail on Oracle Big Data Appliance: ERROR: Wrong disk status:Online,Spun Up Yes alert

Attachments

This solution has no attachment