Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-79-1643715.1
Update Date:2018-05-29
Keywords:

Solution Type  Predictive Self-Healing Sure

Solution  1643715.1 :   Oracle Big Data Appliance Exachk Health-Check Tool  


Related Items
  • Big Data Appliance X5-2 Hardware
  •  
  • Big Data Appliance Hardware
  •  
  • Big Data Appliance X3-2 Hardware
  •  
  • Big Data Appliance X4-2 Hardware
  •  
  • Big Data Appliance X6-2 Hardware
  •  
Related Categories
  • PLA-Support>Eng Systems>BDA>Big Data Appliance>DB: BDA_EST
  •  
  • _Old GCS Categories>Support>SET>DiagnosticTools>Health Check Rulesets
  •  




In this Document
Purpose
Scope
Details
 About Exachk
 About Exachk on BDA
 Exachk Known Issues
 Goals for BDA Health Checks
 Recommended Validation Frequency
 Exachk New Features
 Exachk Initial Deployment and Installation
 Running Exachk
 Prerequisites
 Use
 Known Issues
 Exachk Output
 BDA Health Assessment Report
 Exachk Summary
 Findings Needing Attention
 Findings Passed
 System Wide Automatic Service Request(ASR) Healthcheck
 Known Issues
 Comparing Two Exachk Reports
 Full Set of Rules and Checks
 Verify Subnet Manager
 Verify BDA Software Profile
 Verify NTP Synchronization
 bdachecknet
 Verify BDA Hardware Profile
 Verify ILOM Power Up Configuration
 Verify MTU Size
 bdacheckib
 Verify DNS Setup
 Verify InfiniBand Cable Connection Quality
 Infiniband Switch NTP configuration
 Infiniband switch HOSTNAME configuration
 Troubleshooting
 Runtime Command Timeouts
References


Applies to:

Big Data Appliance Hardware
Big Data Appliance X6-2 Hardware
Big Data Appliance X3-2 Hardware
Big Data Appliance X4-2 Hardware
Big Data Appliance X5-2 Hardware
Linux x86-64

Purpose

Exachk for Oracle Big Data Appliance (BDA) is a health-check tool that is designed to audit important configuration settings within an Oracle BDA cluster. This reference document describes the benefit of the check, the risk, if a particular health-check fails, and the steps to resolve a failed health check for each of the health checks that BDA performs.

Scope

This document is intended for anyone planning to use and run Exachk on a BDA.

Details

This document outlines the Exachk health check diagnostic information on BDA.

About Exachk

Exachk is a healthcheck tool for Engineered Systems.  It automates auditing of customer systems for known configuration problems and best practices. The Exachck tool consists of data collection, analysis, and reporting stages.

Key Components of the Exachk Kit:

  • exachk – bash shell script
  • collections.dat – driver file (required)
  • rules.dat – driver file (required for full analysis mode)
  • readme.txt

About Exachk on BDA

Exachk for Big Data Appliance supports all BDA versions later than 2.0.1.  It is considered the standard BDA procedure to perform hardware and software health checks

BDA Exachk can audit important configuration settings within a BDA. Exachk examines the following components:

  • Compute – CPU
  • Hardware, Firmware, BIOS
  • Operating System - kernel parameters, system packages
  • Network - Ethernet, InfiniBand
  • Memory - RAM, disks
  • Software Installed

Exachk Known Issues

1. If exachk times out with a message like:

Timed out while checking password on bdanode0x.
Set RAT_PASSWORDCHECK_TIMEOUT to increase timeout. Ex: export RAT_PASSWORDCHECK_TIMEOUT=10

exachk is exiting.


1. Log into the node where you are running exachk and see how long the login takes in seconds.
2. If it takes more than 10 seconds to login then increase the time out using the RAT_PASSWORDCHECK_TIMEOUT environment variable as it says in the error message.  Increase RAT_PASSWORDCHECK_TIMEOUT to a value which is a little longer than it takes to manually login using ssh. 

By default exachk waits for 10 seconds to connect to the target node. If it can not connect to the target node in that amount of time, exachk gives up. Hence the need to increase RAT_PASSWORDCHECK_TIMEOUT in this case.

Goals for BDA Health Checks

  1. Provide a mechanism to check the complete health of a BDA system on a proactive (before an issue happens) and reactive (after an issue happens) basis.
  2. Provide a “recommendation engine” for best practices and tips to fix known issues on the BDA.

Recommended Validation Frequency

It is recommended that a BDA be validated immediately after initial deployment, before and after any change, and at least once a quarter as part of planned maintenance operations. The runtime duration of Exachk depends on the number of nodes to be checked, CPU load, network latency, etc.

Note:  Plan to run exachk during times of least load on the system.  This avoids the chance of runtime timeouts from occurring during health checks.   

Exachk New Features

See the readme.txt and UserGuide.txt from the unzipped exachk.zip file for details/pointers to bug fixes and new features.

Exachk Initial Deployment and Installation

The latest Exachk is located on My Oracle Support in Patch 18622611.

1. Download the exachk.zip from the patch, to a directory of your choice on the BDA.  You can do this as "root" user.

2. Extract the contents of exachk.zip. 

a) unzip exachk.zip.

$ unzip exachk.zip

...
inflating: readme.txt
inflating: doc/ORAchk_and_EXAchk_User_Guide.pdf

b) Verify directory, example output is like:

$ ls -ltr
  
total 91528
-r-xr-xr-x 1 root root 8218911 Apr 4 2017 Apex5_CollectionManager_App.sql
-r-xr-xr-x 1 root root 4816355 Sep 15 2016 CollectionManager_App.sql
-r--r----- 1 root root 49666697 Oct 11 20:41 collections.dat
drwxr-xr-x 2 root root 4096 Oct 19 06:36 doc
-r-xr-xr-x 1 root root 2901231 Oct 11 20:05 exachk
-rw-r--r-- 1 root root 1976299 Jul 19 01:03 EXAchk_Health_Check_Catalog.html
-rw-r--r-- 1 root root 19691135 Oct 19 06:35 exachk.zip
drwxr-xr-x 2 root root 4096 Oct 11 19:14 exadiscover
-r--r--r-- 1 root root 4898 Oct 18 09:41 readme.txt
-r--r----- 1 root root 6368905 Oct 11 20:41 rules.dat
-r-xr-xr-x 1 root root 40052 Jul 22 2015 sample_user_defined_checks.xml
drwxr-xr-x 2 root root 4096 Oct 11 19:14 templates
-r-xr-xr-x 1 root root 2888 Oct 9 2015 user_defined_checks.xsd
-r--r--r-- 1 root root 234 Apr 1 2017 UserGuide.txt


3. Verify exachk version for example run as root user:

./exachk -v
 
EXACHK VERSION: 12.2.0.1.3_20171011

4. Add the location of the exachk executable to /root/.bash_profile so that it can invoked any where. This is an optional but recommended step.  It can be done by updating /root/.bash_profile from for example:

From:

# User specific environment and startup programs

PATH=$PATH:$HOME/bin

To:

# User specific environment and startup programs

PATH=$PATH:$HOME/bin:<path to exachk>

If exachk is installed in /root/exachk_home (for example) update /root/.bash_profile with:

PATH=$PATH:$HOME/bin:/root/exachk_home

Running Exachk

This section overviews Exachk options for the BDA.  

Note: Not all options apply to BDA.  
Unless otherwise noted, run exachk as root. Run from Node1 of the BDA cluster. 

Prerequisites

1. The password for each Infiniband switch is required for most data collection options. (This will only be the case if there is no ssh user equivalency from running compute node to switch.)

Use

1. To show usage run the command below as root or non-root user:

 ./exachk -h
  
Usage : ./exachk [-abvhpfmsuSo:c:t:]
        -a      All (Perform best practice check and recommended patch check)
        -b      Best Practice check only. No recommended patch check
        -h      Show usage
        -v      Show version
         ...

2.  The list of Exachk options supported for BDA is:

Usage for BDA Exachk

        -a      (Perform best practice check and recommended patch check.  This is the default option.  If no options are specified exachk runs with -a)
        -b      Best Practice check only. No recommended patch check
        -h      Show usage
        -v      Show version
        -m      exclude checks for Maximum Availability Architecture (MAA) scorecards(see user guide for more details)
        -o      Argument to an option. if -o is followed by v,V,Verbose,VERBOSE or Verbose, it will print checks which passs on the screen
                 if -o option is not specified,it will print only failures on screen. for eg: exachk -a -o v
        -clusternodes
                Pass comma separated node names to run exachk only on subset of nodes.
        -localonly
                Run exachk only on local node.

        -debug  Run exachk in debug mode. Debug log will be generated.
                eg:- ./exachk -debug
                Output goes to stdout as well as generated log files

        -nopasd  Skip PASS'ed check to print in exachk report and upload to database.

        -noscore  Do not print healthscore in HTML report.

        -diff <Old Report> <New Report> [-outfile <Output HTML>]
                Diff two exachk reports. Pass directory name or zip file or html report file as <Old Report> & <New Report>
        -<initsetup|initrmsetup|initcheck|initpresetup>
                initsetup       : Setup auto restart. Auto restart functionality automatically brings up exachk daemon when node starts
                initrmsetup   : Remove auto restart functionality
                initcheck       : Check if auto restart functionality is setup or not
                initpresetup  : Sets root user equivalency for COMPUTE, STORAGE and IBSWITCHES.(root equivalency for COMPUTE nodes is mandatory for setting up auto restart   functionality)                        
        -d <start|start_debug|stop|status|info|stop_client|nextautorun>
                start           : Start the exachk daemon
                start_debug     : Start the exachk daemon in debug mode
                stop            : Stop the exachk daemon
                status          : Check if the exachk daemon is running

        -daemon
                run exachk only if daemon is running

       -nodaemon
                Dont use daemon to run exachk

       -set
                configure exachk daemon parameter like "param1=value1;param2=value2... "

                 Supported parameters are:-

                 AUTORUN_INTERVAL <n[d|h]> :- Automatic rerun interval in daemon mode.Set it zero to disable automatic rerun which is zero.

                 AUTORUN_SCHEDULE * * * *       :- Automatic run at specific time in daemon mode.
                                  - - - -
                                  ¦ ¦ ¦ ¦
                                  ¦ ¦ ¦ +----- day of week (0 - 6) (0 to 6 are Sunday to Saturday)
                                  ¦ ¦ +---------- month (1 - 12)
                                  ¦ +--------------- day of month (1 - 31)
                                  +-------------------- hour (0 - 23)

                     example: exachk -set "AUTORUN_SCHEDULE=8,20 * * 2,5" will schedule runs on tuesday and friday at 8 and 20 hour.

                 AUTORUN_FLAGS <flags> : exachk flags to use for auto runs.

                     example: exachk -set "AUTORUN_INTERVAL=12h;AUTORUN_FLAGS=-profile sysadmin" to run sysadmin profile every 12 hours

                              exachk -set "AUTORUN_INTERVAL=2d;AUTORUN_FLAGS=-profile dba" to run dba profile once every 2 days.

                 NOTIFICATION_EMAIL : Comma separated list of email addresses used for notifications by daemon if mail server is configured.

                 PASSWORD_CHECK_INTERVAL <number of hours> : Interval to verify passwords in daemon mode

                 COLLECTION_RETENTION <number of days> : Purge exachk collection directories and zip files older than specified days.

       -unset <parameter>
                unset the parameter
                  example: exachk -unset "AUTORUN_SCHEDULE"

       -get <parameter | all>
                Print the value of parameter

        -excludeprofile
                Pass specific profile.
                List of supported profiles is same as for -profile.

       -merge
                Pass comma separated collection names(directory or zip files) to merge collections and prepare single report.
                eg:- ./exachk -merge exachk_hostname1_db1_120213_163405.zip,exachk_hostname2_db2_120213_164826.zip

       -profile Pass specific profile.
                 List of supported profiles for BDA:
                 switch          Infiniband switch checks
                 sysadmin     sysadmin checks
      
       -ibswitches
                Pass comma separated infiniband switch names to run exachk only on selected infiniband switches.

  

Note that running any other profiles than what is listed above will return:

<profile_name> is not supported component. exachk will run generic checks for components identified from environment

 

3. For Example to perform All checks, including best practice checks and recommendations run:

# ./exachk -a

Note: The -a option is the default. You do not  have to specify it.  Running ./exachk with no options runs ./exachk -a.

Output looks  like:

# ./exachk -a
  

Checking ssh user equivalency settings on all nodes in cluster

Node <BDANode01> is configured for ssh user equivalency for root user
...

Node <BDANode0n> is configured for ssh user equivalency for root user

Copying plug-ins
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .


9 of the included audit checks require root privileged data collection on INFINIBAND SWITCH .

1. Enter 1 if you will enter root password for each INFINIBAND SWITCH when prompted

2. Enter 2 to exit and to arrange for root access and run the exachk later.

3. Enter 3 to skip checking best practices on INFINIBAND SWITCH

Please indicate your selection from one of the above options for INFINIBAND SWITCH[1-3][1]:- 1

Is root password same on all INFINIBAND SWITCH ?[y/n][y]y

Enter root password for INFINIBAND SWITCH :-

Verifying root password.
. . .

*** Checking Best Practice Recommendations (PASS/WARNING/FAIL) ***

Collections and audit checks log file is
/<dir>/exachk_<BDANode0x_040414_091246/log/exachk.log

Starting to run exachk in background on <BDANode01>
...
Starting to run exachk in background on <BDANode0n>

=============================================================
                    Node name - <BDANode01>
=============================================================

Collecting - Verify ASR configuration check via ASREXACHECK

Starting to run root privileged commands in background on INFINIBAND SWITCH <RackName>sw-ib1.

Starting to run root privileged commands in background on INFINIBAND SWITCH <RackName>sw-ib2.

Starting to run root privileged commands in background on INFINIBAND SWITCH <RackName>sw-ib3.

Collections from INFINIBAND SWITCH:
------------------------------------
Collecting - Infiniband Switch NTP configuration
Collecting - Infiniband switch HOSTNAME configuration


Data collections completed. Checking best practices on <BDANode01>
--------------------------------------------------------------------------------------
 ...

Copying results from <BDANode02> and generating report. This might take a while. Be patient.

=============================================================
                    Node name - <BDANode02>
=============================================================

Collecting - Verify ASR configuration check via ASREXACHECK

Data collections completed. Checking best practices on <BDANode02>
--------------------------------------------------------------------------------------
...
---------------------------------------------------------------------------------

Detailed report (html) - /<dir>/exachk_<BDANode01>_040414_091246/exachk_<BDANode01>_040414_091246.html


UPLOAD(if required) - /<dir>/exachk_<BDANode01>_040414_091246.zip

Known Issues

1. On BDA V2.4/V2.5 only, running exachk may incorrectly indicate failed software validation checks.

 FAIL =>    Big Data Appliance failed software validation checks.

In this case further analysis shows that:

a) bdacheckcluster, bdachecksw and bdacheckhw all complete successfully.

b) Looking through the logs shows: "ERROR: Wrong mounted partitions" like:

<name>.html.out:ERROR: Wrong mounted partitions :
/dev/md2 / ext3
/dev/md0 /boot ext3
/dev/sd4 /u01 ext4
/dev/sd4 /u02 ext4
/dev/sd1 /u03 ext4
/dev/sd1 /u04 ext4
/dev/sd1 /u05 ext4
/dev/sd1 /u06 ext4
/dev/sd1 /u07 ext4
/dev/sd1 /u08 ext4
/dev/sd1 /u09 ext4
/dev/sd1 /u10 ext4
/dev/sd1 /u11 ext4
/dev/sd1 /u12 ext4
INFO: Expected mounted partitions : 12 data partitions, /boot and /
...
ERROR: Big Data Appliance failed software validation checks

On BDA V2.4/2.5, for such symptoms, ignore the software validation check error.

2. In the case of a slow ssh on a given switch, exachk can error out as below. In this case, increase the SSH timeout using exachk environment variable.
The error looks like:

Starting to run root privileged commands in background on INFINIBAND SWITCH <cluster>sw-ib1.

Timed out
Unable to create temp directory on <cluster>sw-ib1

Skipping root privileged commands on INFINIBAND SWITCH <cluster> sw-ib1 is available but SSH is blocked.


To resolve:

a) Set the RAT_PASSWORDCHECK_TIMEOUT=40.

# set RAT_PASSWORDCHECK_TIMEOUT=40

b) Rerun the ./exachk, for example:

# ./exachk -a 

Exachk Output

The output of Exachk is displayed at the end of the health check, and looks like:

Detailed report (html) - /<path to exachk installation>/exachk_<hostname>_<date>_<timestamp>/exachk_<hostname>_<date>_<timestamp>.html

UPLOAD(if required) - /<path to exachk installation>/exachk_<hostname>_<date>_<timestamp>.zip

 

Note: Do not rename any of the Exachk output report files or directories.


The detailed Exachk report is available in the following formats:

1. HTML report:
The HTML report is structured such that the most important exceptions are listed first.

The reports are stored in the directory in which you installed Exachk, and can be accessed through a browser by using an HTTP URL.
 
2. Zip file:
The Exachk report is also available within the zip file that is provided after each Exachk run. Information that Exachk collected about the system, is also embedded in the data within the zip file.


Whenever you run Exachk, it automatically creates a subdirectory and a zip file in the directory in which you installed Exachk,  using the naming convention, exachk_<hostname>_<date>_<timestamp>,  as shown in the following examples:

Directory: exachk_scaj31bda01_041314_214504

Zip file: exachk_scaj31bda01_041314_214504.zip

The directory contains the HTML report, and the zip file as well as other supporting files and directories.

Note: The directory in which the subdirectory and zip file are created should be cleaned up on a regular basis.

BDA Health Assessment Report

The HTML report contains the following sections. The sections vary depending on the options selected while executing Exachk:

  • Exachk Summary
  • Findings Needing Attention
  • Findings Passed
  • System Wide Automatic Service Request(ASR) Healthcheck

Exachk Summary

This section of the report summarizes the key data collected from the Exachk environment. It shows:

  • Operating system/Kernel Version
  • BDA Versions
  • System Identifier (Rack serial number)
  • Number of Nodes (which is number of nodes in the cluster plus the number of IB switches which is 3)
  • Number of IB Swtiches
  • exachk Version
  • Collection name
  • Collection date

Findings Needing Attention

This section lists the health checks that failed, that resulted in an ERROR, WARNING or INFO status.  Only the issues reported in the "Findings Needing Attention" section are real problems.

The status messages and the action that needs to be taken for each status message is described below:

__________________________________________________________________________

Message Status    Description or Possible Impact      Action to be Taken
__________________________________________________________________________

FAIL                    Shows checks that did not pass.     Address immediately.


WARNING            Shows checks that might cause       Investigate further.
                         performance or stability issues
                         if not addressed.

INFO                  Indicates information                     Read the information 
                         about the system.      
                         Follow any instructions provided.
__________________________________________________________________________           

  

Findings Passed

This section lists the health checks that passed.

System Wide Automatic Service Request(ASR) Healthcheck

ASRExacheck is designed to check and test ASR configurations to make sure that a BDA can communicate to the ASR Manager. This is a non-invasive script that checks configurations only and does not write to any system or configuration files. This checks for known configuration issues and any previous  hardware faults that may not have been reported by ASR due to a misconfiguration on the BDA.

Known Issues

On BDA V2.4/V2.5 only, the BDA Health Assessment report may show:

Verify BDA Software Profile -- Failure

softwarefailure

 

In this case the output under the FAIL => Big Data Appliance failed software validation checks reports an ERROR: Wrong mounted partitions:

ERROR: Wrong mounted partitions : /dev/md2 / ext3 /dev/md0 /boot ext3 /dev/sd4 /u01 ext4 /dev/sd4 /u02 ext4 /dev/sd1 /u03 ext4 /dev/sd1 /u04 ext4 /dev/sd1 /u05 ext4 /dev/sd1 /u06 ext4 /dev/sd1 /u07 ext4 /dev/sd1 /u08 ext4 /dev/sd1 /u09 ext4 /dev/sd1 /u10 ext4 /dev/sd1 /u11 ext4 /dev/sd1 /u12 ext4 INFO: Expected mounted partitions : 12 data partitions, /boot and /
 

However running: bdacheckcluster, bdachecksw and bdacheckhw all complete successfully.

On BDA V2.4/2.5, for such symptoms, ignor the software validation check error.

Comparing Two Exachk Reports

You can compare two Exachk reports by using the -diff option with the exachk command. You can use the -diff option to generate a comparison HTML report which can be used to find changes in the health of a BDA between Exachk runs. You can also use this report to find checks that have been added to a new version of Exachk.

To compare two Exachk reports, run the following command:

# ./exachk -diff report1 report2 [-outfile name_of_compared_report.html]

    - report1 and report2 are the names of the reports being compared.

    - The -outfile option is optional. By default, when the exachk -diff  command is run, the comparison report is stored in file called    exachk_report1_report2_diff.html


Example:

a) Default output file:

# ./exachk -diff  exachk_<host>_040314_073600.zip exachk_<host>_040314_134241.zip
  
Summary
Total   : 14
Missing : 0
New     : 0
Changed : 0
Same    : 14
File comparison is complete. The comparison report can be viewed in: /<path>/exachk_040314073600_040314134241_diff.html

 

b) With -outfile options:

# ./exachk -diff  exachk_<host>_040314_073600.zip exachk_<host>_040314_134241.zip -outfile compared_report.html
  
Summary
Total   : 14
Missing : 0
New     : 0
Changed : 0
Same    : 14
File comparison is complete. The comparison report can be viewed in: /<path>/compared_report.html



The comparison report contains the following details:

A summary of the Exachk report comparison contains:

  •     The differences between the two reports
  •     Checks that are in only the first report
  •     Checks that are in only the second report
  •     Checks that are common to both reports

Full Set of Rules and Checks

Best Practices and Other Recommendations are generally items documented in various sources which could be overlooked. Exachk assesses them and calls attention to any findings.  The current Exachk makes the following checks:

  • Verify Subnet Manager
  • Verify BDA Software Profile
  • Verify NTP Synchronization
  • Verify BDA Software Profile
  • bdachecknet
  • Verify BDA Hardware Profile
  • Verify ILOM Power Up Configuration
  • Verify MTU Size
  • bdacheckib
  • Verify DNS Setup
  • Verify InfiniBand Cable Connection Quality
  • Infiniband Switch NTP configuration
  • Infiniband switch HOSTNAME configuration

Verify Subnet Manager

verifysubnet

Verify BDA Software Profile

verifysoftware

Verify NTP Synchronization

verifyntp

bdachecknet

bdachecknet

Verify BDA Hardware Profile

verifyhardware

Verify ILOM Power Up Configuration

verifyilom

Verify MTU Size

verifymtu

bdacheckib

bdacheckib

Verify DNS Setup

verifydns

Verify InfiniBand Cable Connection Quality

verifyib

Infiniband Switch NTP configuration

verifyswitchntp

Infiniband switch HOSTNAME configuration

inifinbandswitchhos

Troubleshooting

For support on any problems that you might encounter while using Exachk, create a service request via My Oracle Support.

Runtime Command Timeouts

During the health check process, if a particular node or switch does not respond to the health-check command within a pre-defined duration, Exachk terminates that command.  To prevent the program from freezing, Exachk automatically terminates commands that exceed default timeouts. On a busy system,  Exachk terminates commands when the target of the check does not respond within the default timeout.

Note: To avoid runtime command timeouts from occurring during health checks, ensure that you run the tool when there is least load on the system.

  

 


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback