![]() | Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition | ||
|
|
![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Solution Type Predictive Self-Healing Sure Solution 1643715.1 : Oracle Big Data Appliance Exachk Health-Check Tool
In this Document
Applies to:Big Data Appliance HardwareBig Data Appliance X6-2 Hardware Big Data Appliance X3-2 Hardware Big Data Appliance X4-2 Hardware Big Data Appliance X5-2 Hardware Linux x86-64 PurposeExachk for Oracle Big Data Appliance (BDA) is a health-check tool that is designed to audit important configuration settings within an Oracle BDA cluster. This reference document describes the benefit of the check, the risk, if a particular health-check fails, and the steps to resolve a failed health check for each of the health checks that BDA performs. ScopeThis document is intended for anyone planning to use and run Exachk on a BDA. DetailsThis document outlines the Exachk health check diagnostic information on BDA. About ExachkExachk is a healthcheck tool for Engineered Systems. It automates auditing of customer systems for known configuration problems and best practices. The Exachck tool consists of data collection, analysis, and reporting stages. Key Components of the Exachk Kit:
About Exachk on BDAExachk for Big Data Appliance supports all BDA versions later than 2.0.1. It is considered the standard BDA procedure to perform hardware and software health checks BDA Exachk can audit important configuration settings within a BDA. Exachk examines the following components:
Exachk Known Issues1. If exachk times out with a message like: Timed out while checking password on bdanode0x. exachk is exiting.
By default exachk waits for 10 seconds to connect to the target node. If it can not connect to the target node in that amount of time, exachk gives up. Hence the need to increase RAT_PASSWORDCHECK_TIMEOUT in this case. Goals for BDA Health Checks
Recommended Validation FrequencyIt is recommended that a BDA be validated immediately after initial deployment, before and after any change, and at least once a quarter as part of planned maintenance operations. The runtime duration of Exachk depends on the number of nodes to be checked, CPU load, network latency, etc. Note: Plan to run exachk during times of least load on the system. This avoids the chance of runtime timeouts from occurring during health checks.
Exachk New FeaturesSee the readme.txt and UserGuide.txt from the unzipped exachk.zip file for details/pointers to bug fixes and new features. Exachk Initial Deployment and InstallationThe latest Exachk is located on My Oracle Support in Patch 18622611. 1. Download the exachk.zip from the patch, to a directory of your choice on the BDA. You can do this as "root" user. 2. Extract the contents of exachk.zip. a) unzip exachk.zip. $ unzip exachk.zip
... b) Verify directory, example output is like: $ ls -ltr
total 91528 -r-xr-xr-x 1 root root 8218911 Apr 4 2017 Apex5_CollectionManager_App.sql -r-xr-xr-x 1 root root 4816355 Sep 15 2016 CollectionManager_App.sql -r--r----- 1 root root 49666697 Oct 11 20:41 collections.dat drwxr-xr-x 2 root root 4096 Oct 19 06:36 doc -r-xr-xr-x 1 root root 2901231 Oct 11 20:05 exachk -rw-r--r-- 1 root root 1976299 Jul 19 01:03 EXAchk_Health_Check_Catalog.html -rw-r--r-- 1 root root 19691135 Oct 19 06:35 exachk.zip drwxr-xr-x 2 root root 4096 Oct 11 19:14 exadiscover -r--r--r-- 1 root root 4898 Oct 18 09:41 readme.txt -r--r----- 1 root root 6368905 Oct 11 20:41 rules.dat -r-xr-xr-x 1 root root 40052 Jul 22 2015 sample_user_defined_checks.xml drwxr-xr-x 2 root root 4096 Oct 11 19:14 templates -r-xr-xr-x 1 root root 2888 Oct 9 2015 user_defined_checks.xsd -r--r--r-- 1 root root 234 Apr 1 2017 UserGuide.txt
./exachk -v
EXACHK VERSION: 12.2.0.1.3_20171011 4. Add the location of the exachk executable to /root/.bash_profile so that it can invoked any where. This is an optional but recommended step. It can be done by updating /root/.bash_profile from for example: From: # User specific environment and startup programs
PATH=$PATH:$HOME/bin To: # User specific environment and startup programs PATH=$PATH:$HOME/bin:<path to exachk> If exachk is installed in /root/exachk_home (for example) update /root/.bash_profile with: PATH=$PATH:$HOME/bin:/root/exachk_home
Running ExachkThis section overviews Exachk options for the BDA. Note: Not all options apply to BDA.
Unless otherwise noted, run exachk as root. Run from Node1 of the BDA cluster.
Prerequisites1. The password for each Infiniband switch is required for most data collection options. (This will only be the case if there is no ssh user equivalency from running compute node to switch.) Use
1. To show usage run the command below as root or non-root user: ./exachk -h
Usage : ./exachk [-abvhpfmsuSo:c:t:] -a All (Perform best practice check and recommended patch check) -b Best Practice check only. No recommended patch check -h Show usage -v Show version ... 2. The list of Exachk options supported for BDA is: Usage for BDA Exachk -a (Perform best practice check and recommended patch check. This is the default option. If no options are specified exachk runs with -a) -diff <Old Report> <New Report> [-outfile <Output HTML>] -excludeprofile -profile Pass specific profile.
Note that running any other profiles than what is listed above will return: <profile_name> is not supported component. exachk will run generic checks for components identified from environment
3. For Example to perform All checks, including best practice checks and recommendations run: # ./exachk -a
Note: The -a option is the default. You do not have to specify it. Running ./exachk with no options runs ./exachk -a. Output looks like: # ./exachk -a
Checking ssh user equivalency settings on all nodes in cluster Node <BDANode01> is configured for ssh user equivalency for root user ... Node <BDANode0n> is configured for ssh user equivalency for root user Copying plug-ins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 of the included audit checks require root privileged data collection on INFINIBAND SWITCH . 1. Enter 1 if you will enter root password for each INFINIBAND SWITCH when prompted 2. Enter 2 to exit and to arrange for root access and run the exachk later. 3. Enter 3 to skip checking best practices on INFINIBAND SWITCH Please indicate your selection from one of the above options for INFINIBAND SWITCH[1-3][1]:- 1 Is root password same on all INFINIBAND SWITCH ?[y/n][y]y Enter root password for INFINIBAND SWITCH :- Verifying root password. . . . *** Checking Best Practice Recommendations (PASS/WARNING/FAIL) *** Collections and audit checks log file is /<dir>/exachk_<BDANode0x_040414_091246/log/exachk.log Starting to run exachk in background on <BDANode01> ... Starting to run exachk in background on <BDANode0n> ============================================================= Node name - <BDANode01> ============================================================= Collecting - Verify ASR configuration check via ASREXACHECK Starting to run root privileged commands in background on INFINIBAND SWITCH <RackName>sw-ib1. Starting to run root privileged commands in background on INFINIBAND SWITCH <RackName>sw-ib2. Starting to run root privileged commands in background on INFINIBAND SWITCH <RackName>sw-ib3. Collections from INFINIBAND SWITCH: ------------------------------------ Collecting - Infiniband Switch NTP configuration Collecting - Infiniband switch HOSTNAME configuration Data collections completed. Checking best practices on <BDANode01> -------------------------------------------------------------------------------------- ... Copying results from <BDANode02> and generating report. This might take a while. Be patient. ============================================================= Node name - <BDANode02> ============================================================= Collecting - Verify ASR configuration check via ASREXACHECK Data collections completed. Checking best practices on <BDANode02> -------------------------------------------------------------------------------------- ... --------------------------------------------------------------------------------- Detailed report (html) - /<dir>/exachk_<BDANode01>_040414_091246/exachk_<BDANode01>_040414_091246.html UPLOAD(if required) - /<dir>/exachk_<BDANode01>_040414_091246.zip Known Issues1. On BDA V2.4/V2.5 only, running exachk may incorrectly indicate failed software validation checks. FAIL => Big Data Appliance failed software validation checks.
In this case further analysis shows that: a) bdacheckcluster, bdachecksw and bdacheckhw all complete successfully. <name>.html.out:ERROR: Wrong mounted partitions :
/dev/md2 / ext3 /dev/md0 /boot ext3 /dev/sd4 /u01 ext4 /dev/sd4 /u02 ext4 /dev/sd1 /u03 ext4 /dev/sd1 /u04 ext4 /dev/sd1 /u05 ext4 /dev/sd1 /u06 ext4 /dev/sd1 /u07 ext4 /dev/sd1 /u08 ext4 /dev/sd1 /u09 ext4 /dev/sd1 /u10 ext4 /dev/sd1 /u11 ext4 /dev/sd1 /u12 ext4 INFO: Expected mounted partitions : 12 data partitions, /boot and / ... ERROR: Big Data Appliance failed software validation checks On BDA V2.4/2.5, for such symptoms, ignore the software validation check error. 2. In the case of a slow ssh on a given switch, exachk can error out as below. In this case, increase the SSH timeout using exachk environment variable. Starting to run root privileged commands in background on INFINIBAND SWITCH <cluster>sw-ib1.
Timed out Unable to create temp directory on <cluster>sw-ib1 Skipping root privileged commands on INFINIBAND SWITCH <cluster> sw-ib1 is available but SSH is blocked.
a) Set the RAT_PASSWORDCHECK_TIMEOUT=40. # set RAT_PASSWORDCHECK_TIMEOUT=40
b) Rerun the ./exachk, for example: # ./exachk -a
Exachk OutputThe output of Exachk is displayed at the end of the health check, and looks like: Detailed report (html) - /<path to exachk installation>/exachk_<hostname>_<date>_<timestamp>/exachk_<hostname>_<date>_<timestamp>.html
UPLOAD(if required) - /<path to exachk installation>/exachk_<hostname>_<date>_<timestamp>.zip
Note: Do not rename any of the Exachk output report files or directories.
Directory: exachk_scaj31bda01_041314_214504
Zip file: exachk_scaj31bda01_041314_214504.zip The directory contains the HTML report, and the zip file as well as other supporting files and directories. Note: The directory in which the subdirectory and zip file are created should be cleaned up on a regular basis.
BDA Health Assessment ReportThe HTML report contains the following sections. The sections vary depending on the options selected while executing Exachk:
Exachk SummaryThis section of the report summarizes the key data collected from the Exachk environment. It shows:
Findings Needing AttentionThis section lists the health checks that failed, that resulted in an ERROR, WARNING or INFO status. Only the issues reported in the "Findings Needing Attention" section are real problems. __________________________________________________________________________
Findings PassedThis section lists the health checks that passed. System Wide Automatic Service Request(ASR) HealthcheckASRExacheck is designed to check and test ASR configurations to make sure that a BDA can communicate to the ASR Manager. This is a non-invasive script that checks configurations only and does not write to any system or configuration files. This checks for known configuration issues and any previous hardware faults that may not have been reported by ASR due to a misconfiguration on the BDA. Known IssuesOn BDA V2.4/V2.5 only, the BDA Health Assessment report may show: Verify BDA Software Profile -- Failure
In this case the output under the FAIL => Big Data Appliance failed software validation checks reports an ERROR: Wrong mounted partitions: ERROR: Wrong mounted partitions : /dev/md2 / ext3 /dev/md0 /boot ext3 /dev/sd4 /u01 ext4 /dev/sd4 /u02 ext4 /dev/sd1 /u03 ext4 /dev/sd1 /u04 ext4 /dev/sd1 /u05 ext4 /dev/sd1 /u06 ext4 /dev/sd1 /u07 ext4 /dev/sd1 /u08 ext4 /dev/sd1 /u09 ext4 /dev/sd1 /u10 ext4 /dev/sd1 /u11 ext4 /dev/sd1 /u12 ext4 INFO: Expected mounted partitions : 12 data partitions, /boot and /
However running: bdacheckcluster, bdachecksw and bdacheckhw all complete successfully. On BDA V2.4/2.5, for such symptoms, ignor the software validation check error. Comparing Two Exachk ReportsYou can compare two Exachk reports by using the -diff option with the exachk command. You can use the -diff option to generate a comparison HTML report which can be used to find changes in the health of a BDA between Exachk runs. You can also use this report to find checks that have been added to a new version of Exachk. # ./exachk -diff report1 report2 [-outfile name_of_compared_report.html]
- report1 and report2 are the names of the reports being compared. - The -outfile option is optional. By default, when the exachk -diff command is run, the comparison report is stored in file called exachk_report1_report2_diff.html
a) Default output file: # ./exachk -diff exachk_<host>_040314_073600.zip exachk_<host>_040314_134241.zip
Summary Total : 14 Missing : 0 New : 0 Changed : 0 Same : 14 File comparison is complete. The comparison report can be viewed in: /<path>/exachk_040314073600_040314134241_diff.html
b) With -outfile options: # ./exachk -diff exachk_<host>_040314_073600.zip exachk_<host>_040314_134241.zip -outfile compared_report.html
Summary Total : 14 Missing : 0 New : 0 Changed : 0 Same : 14 File comparison is complete. The comparison report can be viewed in: /<path>/compared_report.html
Full Set of Rules and ChecksBest Practices and Other Recommendations are generally items documented in various sources which could be overlooked. Exachk assesses them and calls attention to any findings. The current Exachk makes the following checks:
Verify Subnet ManagerVerify BDA Software ProfileVerify NTP SynchronizationbdachecknetVerify BDA Hardware ProfileVerify ILOM Power Up ConfigurationVerify MTU SizebdacheckibVerify DNS SetupVerify InfiniBand Cable Connection QualityInfiniband Switch NTP configurationInfiniband switch HOSTNAME configurationTroubleshootingFor support on any problems that you might encounter while using Exachk, create a service request via My Oracle Support. Runtime Command TimeoutsDuring the health check process, if a particular node or switch does not respond to the health-check command within a pre-defined duration, Exachk terminates that command. To prevent the program from freezing, Exachk automatically terminates commands that exceed default timeouts. On a busy system, Exachk terminates commands when the target of the check does not respond within the default timeout. Note: To avoid runtime command timeouts from occurring during health checks, ensure that you run the tool when there is least load on the system.
Attachments This solution has no attachment |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|