![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||||||||||||||
Solution Type Troubleshooting Sure Solution 1585438.1 : TROUBLESHOOTING: Failed or Failing Disks on Oracle Big Data Appliance
In this Document
Applies to:Big Data Appliance X3-2 In-Rack Expansion - Version All Versions and laterBig Data Appliance X3-2 Hardware - Version All Versions and later Big Data Appliance X3-2 Full Rack - Version All Versions and later Big Data Appliance Hardware - Version All Versions and later Big Data Appliance X3-2 Starter Rack - Version All Versions and later Linux x86-64 PurposeThis document provides information on troubleshooting disk errors, warnings, info messages, failing, and failed disks on Oracle Big Data Appliance. Troubleshooting StepsOracle Big Data Appliance (BDA) utilities such as bdacheckcluster, bdacheckhw, and bdachecksw when run may result in errors, warnings, and info messages about Oracle Big Data Appliance disks. The information below will show some helpful utility commands and output that may be seen and how to further troubleshoot whether the disk is in fact already failed or is in the process of failing. The information below will help determine whether a service request should be opened to get a new disk drive shipped and replaced. The spares kit contains one or two spare disks that can be used depending on the BDA Rack shipped. The spare disks may be used but it is important to open a service request to get the failed disk replaced. BDA Utilities to check status of disksPlease note all utilities should be run as the root user. 1. The bdacheckcluster utility should be run on node01 of the BDA Cluster as root user. To run bdacheckcluster enter the following: # bdacheckcluster
# bdacheckcluster
... INFO: Checking hardware on host bdanode0n ... ERROR: Hardware checks failing on host bdanode0n ... ERROR: Big Data Appliance failed cluster health checks If an error is seen continue investigating with bdacheckhw on the node found to have the error. 2. If bdacheckcluster reports hardware checks failing, find out if those are related to disks. Do this by running the bdacheckhw command on the node / server found to have the error. To run bdacheckhw enter the following: # bdacheckhw
If bdacheckhw reports a disk issue continue investigating with the MegaCli64 commands shown below. Sample output showing an error with wrong disk 5 status: # bdacheckhw
... SUCCESS: Correct disk 1 status : Online, Spun Up No alert SUCCESS: Correct disk 2 status : Online, Spun Up No alert ... ERROR: Wrong disk 5 status : Online, Spun Up Yes alert INFO: Expected disk 5 status : Online, Spun Up No alert SUCCESS: Correct disk 6 status : Online, Spun Up No alert ... SUCCESS: Correct disk 11 status : Online, Spun Up No alert INFO: Errors reported on disk 5 : 12 0 SUCCESS: Correct number of virtual disks : 12 ... ERROR: Big Data Appliance failed hardware validation checks
Note in this case, a service request should be logged since we are concerned about seeing this message: ERROR: Wrong disk 5 status : Online, Spun Up Yes alert. See the section "Errors Reported on Disk Seen in bdacheckhw Output" where the Yes Alert in this case indicates the disk is failing and should be replaced. Note this will be shown in the output from running the MegaCli64 LdPdInfo a0 or MegaCli64 pdlist a0 commands. Media errors are not of concern as explained in "Errors Reported on Disk Seen in bdacheckcluster Output." They do not indicate a need for disk replacement.
To run bdachecksw enter the following: # bdachecksw
# bdachecksw
... WARNING: Wrong data disk sdk partition info : 1 primary fat16 boot 2 primary ext2 INFO: Expected data disk sdk partition info : "1 ext3 primary" or "1 primary" WARNING: Wrong data disk sdl partition info : INFO: Expected data disk sdl partition info : "1 ext3 primary" or "1 primary" SUCCESS: Correct software RAID info : /dev/md2 level=raid1 num-devices=2 /dev/md0 level=raid1 num-devices=2 ERROR: Wrong mounted partitions : /dev/mapper/lvg1-lv1 /lv1 ext4 /dev/md0 /boot ext3 /dev/md2 / ext3 /dev/sda4 /u01 ext4 /dev/sdb4 /u02 ext4 /dev/sdc1 /u03 ext4 /dev/sdd1 /u06 ext4 /dev/sde1 /u07 ext4 /dev/sdf1 /u08 ext4 /dev/sdg1 /u09 ext4 /dev/sdh1 /u10 ext4 /dev/sdi1 /u11 ext4 ... MegaCli64 Commands to Assist in TroubleshootingThere are a number of MegaCli64 Commands which can assist in troubleshooting disk failure. Generally pay attention to any results from the commands which are as follows: Disks that have a Firmware state of Unconfigured. Here are some of the MegaCli64 commands that can be used and sample output. To check the Firmware state issue the following command: # MegaCli64 -PDList -a0 | grep Firmware
Sample output showing Firmware state: Unconfigured(good), Spun Up: # MegaCli64 -PDList -a0 | grep Firmware
Firmware state: Online, Spun Up Device Firmware Level: 061A ... ... Firmware state: Unconfigured(good), Spun Up Device Firmware Level: 061A Firmware state: Online, Spun Up Device Firmware Level: 061A In the instance of seeing Firmware state: Unconfigured(good) or Firmware state: Unconfigured(bad) the disk will need to be replaced. Log a service request to replace the disk. Please note there is no need to be concerned about Firmware state: Online, Spun Up.
The physical drive information can also be stored in a file for review: # MegaCli64 pdlist a0 > pdinfo.tmp
Open the pdinfo.tmp file you created in a text editor and search for the following:
The following example shows a Firmware state of Unconfigured(bad) and a Foreign State of Foreign: ...
Firmware state: Unconfigured(bad) Device Firmware Level: 061A Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x5000c50040a4a8b9 SAS Address(1): 0x0 Connected Port Number: 0(path0) Inquiry Data: SEAGATE ST32000SSSUN2.0T061A1140L7VSMZ FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: Foreign ... Here is an example of a disk that shows a Firmware state of Unconfigured(good): ...
Enclosure Device ID: 20 Slot Number: 4 Enclosure position: 0 Device Id: 18 WWN: 5000C500348B72B4 Sequence Number: 4 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SAS Raw Size: 1.819 TB [0xe8e088b0 Sectors] Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors] Coerced Size: 1.817 TB [0xe8b6d000 Sectors] Firmware state: Unconfigured(good), Spun Up Is Emergency Spare : NO Device Firmware Level: 061A Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x5000c500348b72b5 SAS Address(1): 0x0 Connected Port Number: 0(path0) Inquiry Data: SEAGATE ST32000SSSUN2.0T061A1127L6LX53 FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None ...
Use MegaCli LdPdInfo a0 to verify mapping of logical to physical drive numbers. # MegaCli64 LdPdInfo a0 | more
Example output: # MegaCli64 LdPdInfo a0 | more
Adapter #0 Number of Virtual Disks: 12 Virtual Drive: 0 (Target Id: 0) ... Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 ... Firmware state: Online, Spun Up ... Foreign State: None ...
For Predictive Failure Counts check the value of "Predictive Failure Count" as returned for a disk when using: # MegaCli64 -PDList -a0| grep -E 'Slot|Failure Count'
Output for 12 healthy disks looks like:
# MegaCli64 -PDList -a0 | grep -E 'Slot|Failure Count' Slot Number: 0 Predictive Failure Count: 0 Slot Number: 1 Predictive Failure Count: 0 ... Slot Number: 11 Predictive Failure Count: 0
Gathering Information about the Disk Drives on the ServerThe "lsscsi", "mount -l", and "ls -l /dev/disk/by-label" commands can provide information about disks that are missing, not showing up in output, or mountpoints which are not found by the system during disk failure. Verify if the disk is recognized using the "lsscsi" command: # lsscsi
[0:0:20:0] enclosu SUN HYDE12 0341 -
[0:2:0:0] disk LSI MR9261-8i 2.12 /dev/sda [0:2:1:0] disk LSI MR9261-8i 2.12 /dev/sdb [0:2:2:0] disk LSI MR9261-8i 2.12 /dev/sdc [0:2:5:0] disk LSI MR9261-8i 2.12 /dev/sdd << This shows slot 3 and 4 are not listed since [0:2:3:0] and [0:2:4:0] are not shown [0:2:6:0] disk LSI MR9261-8i 2.12 /dev/sde [0:2:7:0] disk LSI MR9261-8i 2.12 /dev/sdf [0:2:8:0] disk LSI MR9261-8i 2.12 /dev/sdg [0:2:9:0] disk LSI MR9261-8i 2.12 /dev/sdh [0:2:10:0] disk LSI MR9261-8i 2.12 /dev/sdi [0:2:11:0] disk LSI MR9261-8i 2.12 /dev/sdj [7:0:0:0] disk Unigen PSA4000 1100 /dev/sdk If observing similar output where slots are not listed then log a service request to investigate further if the disk needs to be replaced.
Determine if the mountpoints are all listed using "mount -l" command. # mount -l
Sample output showing the mountpoints where /dev/sdd1 on /u04 type ext4 (rw,nodev,noatime) [/u04] is missing: # mount -l
/dev/md2 on / type ext3 (rw,noatime) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) /dev/md0 on /boot type ext3 (rw) tmpfs on /dev/shm type tmpfs (rw) /dev/sda4 on /u01 type ext4 (rw,nodev,noatime) [/u01] /dev/sdb4 on /u02 type ext4 (rw,nodev,noatime) [/u02] /dev/sdc1 on /u03 type ext4 (rw,nodev,noatime) [/u03] << this shows /dev/sdd1 on /u04 type ext4 (rw,nodev,noatime) [/u04] is missing /dev/sde1 on /u05 type ext4 (rw,nodev,noatime) [/u05] /dev/sdf1 on /u06 type ext4 (rw,nodev,noatime) [/u06] /dev/sdg1 on /u07 type ext4 (rw,nodev,noatime) [/u07] /dev/sdh1 on /u08 type ext4 (rw,nodev,noatime) [/u08] /dev/sdi1 on /u09 type ext4 (rw,nodev,noatime) [/u09] /dev/sdj1 on /u10 type ext4 (rw,nodev,noatime) [/u10] /dev/sdk1 on /u11 type ext4 (rw,nodev,noatime) [/u11] /dev/sdl1 on /u12 type ext4 (rw,nodev,noatime) [/u12] none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) If observing similar output with missing mount points then log a service request to investigate if the disk needs to be replaced. Use "ls -l /dev/disk/by-label" to determine if any device is missing. The following sample shows u03 -> ../../sdc1 is missing: # ls -l /dev/disk/by-label
total 0 lrwxrwxrwx 1 root root 10 Jul 29 19:30 BDAUSB -> ../../sdm1 lrwxrwxrwx 1 root root 10 Jul 29 19:30 SWAP-sda3 -> ../../sda3 lrwxrwxrwx 1 root root 10 Jul 29 19:30 SWAP-sdb3 -> ../../sdb3 lrwxrwxrwx 1 root root 10 Jul 29 19:30 u01 -> ../../sda4 lrwxrwxrwx 1 root root 10 Jul 29 19:30 u02 -> ../../sdb4 << shows u03 -> ../../sdc1 is missing lrwxrwxrwx 1 root root 10 Jul 29 19:30 u04 -> ../../sdd1 lrwxrwxrwx 1 root root 10 Jul 29 19:30 u05 -> ../../sde1 lrwxrwxrwx 1 root root 10 Jul 29 19:30 u06 -> ../../sdf1 lrwxrwxrwx 1 root root 10 Jul 29 19:30 u07 -> ../../sdg1 lrwxrwxrwx 1 root root 10 Jul 29 19:30 u08 -> ../../sdh1 lrwxrwxrwx 1 root root 10 Jul 29 19:30 u09 -> ../../sdi1 lrwxrwxrwx 1 root root 10 Jul 29 19:30 u10 -> ../../sdj1 lrwxrwxrwx 1 root root 10 Jul 29 19:30 u11 -> ../../sdk1 lrwxrwxrwx 1 root root 10 Jul 29 19:30 u12 -> ../../sdl1 If observing similar output with missing devices then log a service request to investigate further replacement of the disk. Troubleshooting a Flashing Amber LED on an Oracle Big Data ApplianceIf a flashing amber LED is seen on a server in the BDA cluster please see the troubleshooting tips in the following document: Troubleshooting Flashing Amber LED on Oracle Big Data Appliance V2.0 for Non-Failure Status (Doc ID 1537798.1).
Checking the Disk Service FaultsTo check the Disk Service Faults as root issue the following command: # ipmitool sunoem led get | grep -i svc
DBP/HDD0/SVC | OFF
DBP/HDD1/SVC | OFF DBP/HDD2/SVC | OFF DBP/HDD3/SVC | OFF DBP/HDD4/SVC | OFF DBP/HDD5/SVC | OFF DBP/HDD6/SVC | OFF DBP/HDD7/SVC | OFF DBP/HDD8/SVC | OFF DBP/HDD9/SVC | OFF DBP/HDD10/SVC | OFF DBP/HDD11/SVC | OFF
Errors Reported on Disk Seen in bdacheckcluster OutputIf executing bdacheckcluster shows an info message about disk errors on one of the BDA nodes like: INFO: Errors reported on disk 7 : 2 0
And running the "MegaCli64 ldpdinfo a0" command on the node shows the following: Media Error Count: 2
Other Error Count: 0
Then no disk replacement is needed. The Media Error Count, can be ignored. These are not failures but represent recoverable read/write errors. More information can be found in "Running bdacheckcluster utility on Oracle Big Data Appliance Reports: INFO: Errors Reported On Disk 7 : 8 0 (Doc ID 1568792.1)."
Errors Reported on Disk Seen in bdacheckhw OutputIf running bdacheckhw on a BDA node errors with: SUCCESS: Correct disk 1 status : Online, Spun Up No alert
ERROR: Wrong disk 2 status : Online, Spun Up Yes alert INFO: Expected disk 2 status : Online, Spun Up No alert ... INFO: Errors reported on disk 2 : 12 0
SUCCESS: Correct disk 1 model : SEAGATE ST33000SSSUN3.0
SUCCESS: Sufficient disk 1 firmware (>=64A): 64A SUCCESS: Correct disk 2 model : HITACHI H7230AS60SUN3.0 SUCCESS: Sufficient disk 2 firmware (>=A142): A310 SUCCESS: Correct disk 3 model : SEAGATE ST33000SSSUN3.0 SUCCESS: Sufficient disk 3 firmware (>=64A): 64A
Media Error Count: 12 (these are not failures, and represent recoverable read/write errors)
Predictive Failure Count: 5 (There are predictive failures.) Last Predictive Failure Event Seq Number: 15137 Drive has flagged a S.M.A.R.T alert : Yes (S.M.A.R.T alert is showing yes here)
Enclosure Device ID: 20
Slot Number: 2 Drive's postion: DiskGroup: 11, Span: 0, Arm: 0 Enclosure position: 0 Device Id: 21 WWN: 5000CCA03E22B2AF Sequence Number: 2 Media Error Count: 12 Other Error Count: 0 Predictive Failure Count: 5 Last Predictive Failure Event Seq Number: 15137 PD Type: SAS Raw Size: 2.728 TB [0x15d50a3b0 Sectors] Non Coerced Size: 2.728 TB [0x15d40a3b0 Sectors] Coerced Size: 2.727 TB [0x15d3ef000 Sectors] Firmware state: Online, Spun Up Is Commissioned Spare : NO Device Firmware Level: A310 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x5000cca03e22b2ad SAS Address(1): 0x0 Connected Port Number: 0(path0) Inquiry Data: HITACHI H7230AS60SUN3.0TA3101237RM2KED FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Hard Disk Device Drive Temperature :27C (80.60 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Drive's write cache : Disabled Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Port-1 : Port status: Active Port's Linkspeed: Unknown Drive has flagged a S.M.A.R.T alert : Yes
# mount -l # lsscsi
Solution The disk should be replaced at the soonest opportunity. Log a service request to have a disk drive shipped and replaced on the server. The Media Error Count, can be ignored. These are not failures and represent recoverable read/write errors. More information can be found in "Running bdacheckcluster utility on Oracle Big Data Appliance Reports: INFO: Errors Reported On Disk 7 : 8 0 (Doc ID 1568792.1)." The disk however is failing as shown in the Predictive Failure Count: 5 and the S.M.A.R.T alert: Yes. A predictive failure due to a SMART alert indicates a failing drive which should be replaced. However a disk showing predictive failures/SMART alerts is still usable although it has a high chance of becoming unusable in the near future. More information can be found in "bdaclustercheck/bdacheckhw Fail on Oracle Big Data Appliance: ERROR: Wrong disk status:Online,Spun Up Yes alert (Doc ID 1580223.1)." SummaryThere are many BDA utilities and MegaCli64 commands to assist in determining whether a disk is failing, failed, or fine. If a flashing amber LED is found and is determined to be ok per "Troubleshooting Flashing Amber LED on Oracle Big Data Appliance V2.0 for Non-Failure Status (Doc ID 1537798.1)" and/or if bdacheckcluster is run and a media count error is identified then the disk is fine. References<NOTE:1537798.1> - Troubleshooting Flashing Amber LED on Oracle Big Data Appliance V2.0 for Non-Failure Status<NOTE:1568792.1> - Running bdacheckcluster utility on Oracle Big Data Appliance Reports: INFO: Errors Reported On Disk 7 : 8 0 <NOTE:1580223.1> - Running bdaclustercheck/bdacheckhw Fail on Oracle Big Data Appliance: ERROR: Wrong disk status:Online,Spun Up Yes alert Attachments This solution has no attachment |
||||||||||||||||||||||||||||||||
|