Oracle Exadata Diagnostic Information required for Disk Failures and some other Hardware issues

Asset ID:	1-75-761868.1
Update Date:	2018-03-21
Keywords:

Solution Type Troubleshooting Sure

Solution 761868.1 : Oracle Exadata Diagnostic Information required for Disk Failures and some other Hardware issues

Applies to:

Exadata Database Machine V2 - Version All Versions and later
Exadata Database Machine X2-8 - Version All Versions and later
Exadata X3-8 Hardware - Version All Versions and later
Oracle Exadata Storage Server Software - Version 11.1.0.3.0 and later
Exadata Database Machine X2-2 Hardware
Information in this document applies to any platform.

Purpose

This document provides the set of commands required to collect diagnostic information required for disk failures on the Exadata Storage Servers and could be used on Db nodes or Storage Servers for some other hardware issues.

Note: sundiag is an Exadata node/cell tool. It works on both Linux and Solaris installations. For issues on Solaris platforms, additional OS data may be collected by the Explorer tool - <Document 1006990.1> Oracle Explorer Data Collector Implementation Best Practice. For issues on Linux platforms, additional OS data may be collected by the sosreport tool - <Document 1500235.1>.

Troubleshooting Steps

--- Software Requirements/Prerequisites ---

The script needs to be executed as root on the Exadata Storage server having disk problems and sometimes also on Db nodes or Storage Servers for some other hardware issues.

The script collects diagnostic information required by Sun or HP.

--- Configuring the Script ---

Executed as root user.

--- Running the Script ---

For Sun Oracle Exadata Environments:

sundiag.sh is included in the Exadata base image in /opt/oracle.SupportTools:

For image versions 12.1.2.2.0 or later, use the sundiag.sh already included in /opt/oracle.SupportTools/sundiag.sh.

For all systems using any image version prior to 12.1.2.2.0 then update the existing sundiag.sh to the latest v12.1.2.2.0_150917 version.

Note: as of 12.1.2.2.0 image, the version number scheme changed from v1.5.1 as the prior release, to version numbers matching the image release with date.

Updating sundiag.sh to the latest on older image systems:

1. Download file sundiag.zip attached to this Note and copy to the first compute node under /tmp.
2. Using dcli, copy the file to all the nodes and unzip it
#dcli -l root -g /opt/oracle.SupportTools/onecommand/all_group -f /tmp/sundiag.zip -d /tmp
#dcli -l root -g /opt/oracle.SupportTools/onecommand/all_group "cd /tmp;unzip sundiag.zip;ls -l sundiag_12.1.2.2.0_150917.sh;md5sum sundiag_12.1.2.2.0_150917.sh"

Output should be like this, for all the nodes referenced in file all_group

nodedb03: Archive: sundiag.zip
nodedb03: inflating: sundiag_12.1.2.2.0_150917.sh
nodedb03: -r-xr-xr-x 1 root root 54919 Sep 17 19:49 sundiag_12.1.2.2.0_150917.sh
nodedb03: 0e6fa48b54d7881b9fc8a252a9b068aa sundiag_12.1.2.2.0_150917.sh

3. Copy the new version of sundiag.sh to the default location
   #dcli -l root -g /opt/oracle.SupportTools/onecommand/all_group "cd /opt/oracle.SupportTools;mv sundiag.sh sundiag.sh.orig"
     #dcli -l root -g /opt/oracle.SupportTools/onecommand/all_group "cd /tmp;mv sundiag_12.1.2.2.0_150917.sh /opt/oracle.SupportTools/sundiag.sh;md5sum /opt/oracle.SupportTools/sundiag.sh;ls -l /opt/oracle.SupportTools/*sundiag*"

4. Remove temporary files

    #dcli -l root -g /opt/oracle.SupportTools/onecommand/all_group "cd /tmp;rm -fr sundiag.zip;rm -fr sundiag_12.1.2.2.0_150917.sh"

Usage of sundiag.sh:

# /opt/oracle.SupportTools/sundiag.sh -h

Oracle Exadata Database Machine - Diagnostics Collection Tool

Version: 12.1.2.2.0.150917

By default sundiag will collect OSWatcher/ExaWatcher, Cell Metrics and traces,
if there was an alert in the last 7 days. If there is more than one alert, latest
alert is chosen to set the time range for data collection.
Time range is 8hrs prior to and 1hr after the latest alert, for the total of 9 hrs
e.g: latest alert timestamp = 2014-03-29T01:20:04-05:00
      echo Time range = 2014-03-28_16:00:00 and 2014-03-29_01:00:00
User can also specify time ranges (as explained in usage below), which takes
precedence over default behavior of checking for alerts

Usage: /opt/oracle.SupportTools/sundiag.sh [ilom | snapshot] [osw <time ranges>]
   osw      - This argument when used expects value of one or more comma separated
              time ranges. OSWatcher/ExaWatcher, cell metrics and traces will be gathered
              in those time ranges.
              The format for time range(s) is <from>-<to>,<from>-<to> and so on without spaces
              where <from> and <to> format is <date>_<time>
              <date> and <time> format should be any valid format that can be recognized by
              'date' command. The command 'date -d <date>' or 'date -d <time>' should be valid
              e.g: /opt/oracle.SupportTools/sundiag.sh osw 2014/03/31_15:00:00-2014/03/31_18:00:00
              Note: Total time range should not exceed 9 hrs. Only the time ranges that
              fall within this limit are considered for the collection of above data
   ilom     - User level ILOM data gathering option via ipmitool, in place of
              separately using root login to get ILOM snapshot over the network.
   snapshot - Collects node ILOM snapshot- requires host root password for ILOM
              to send snapshot data over the network.

Execution will create a date stamped tar.bz2 file in /var/log/exadatatmp/sundiag_<hostname>_<serial#>_<date/time>.tar.bz2 (in /tmp on v1.4-1.5.1).
Upload this file to the Service Request.

This list of the files created by version 12.1.2.2.0_150917 of sundiag.sh:

asr
cell
disk
etc_configs
etc_sysconfig_net
fru-print_ipmitool.out
ilom
imagehistory-all.out
imageinfo-all.out
messages
mrdiag
net
osw
RackMasterSN
raid
SerialNumbers
stderr.txt
sysconfig
var_log_cellos
.version_sundiag

For gathering OS Watcher data alongside sundiag:

Usage:

# /opt/oracle.SupportTools/sundiag.sh osw <time ranges>

By default sundiag will collect OSWatcher/ExaWatcher, Cell Metrics and traces, if there was an alert in the last 7 days. If there is more than one alert, the latest alert is chosen to set the time range for data collection. Time range is 8hrs prior to and 1hr after the latest alert, for the total of 9 hrs e.g: latest alert timestamp = 2014-03-29T01:20:04-05:00
echo Time range = 2014-03-28_16:00:00 and 2014-03-29_01:00:00

User can also specify time ranges, which takes precedence over default behavior of checking for alerts. This argument when used expects a value of one or more comma separated time ranges. OSWatcher/ExaWatcher, cell metrics and traces will be gathered in those time ranges.

The format for time range(s) is <from>-<to>,<from>-<to> and so on without spaces where <from> and <to> format is <date>_<time>.
<date> and <time> format should be any valid format that can be recognized by 'date' command. The command 'date -d <date>' or 'date -d <time>' should be valid.

For Example: /opt/oracle.SupportTools/sundiag.sh osw 2014/03/31_15:00:00-2014/03/31_18:00:00

Note: Total time range should not exceed 9 hrs. Only the time ranges that fall within this limit are considered for the collection of above data. This is to limit the amount of data being gathered to be appropriate for the problem being analysed.

Execution will create a date stamped tar.bz2 file in /var/log/exadatatmp/sundiag_<hostname>_<serial#>_<date/time>.tar.bz2 (in /tmp on v1.4-1.5.1) including OS Watcher archive logs. These logs may be very large.

Upload this file to the Service Request.

For gathering ILOM data alongside sundiag:

Usage:

# /opt/oracle.SupportTools/sundiag.sh snapshot

Execution will create a date stamped tar.bz2 file in /var/log/exadatatmp/sundiag_<hostname>_<serial#>_<date/time>.tar.bz2 (in /tmp on v1.4-1.5.1) which includes running an ILOM snapshot. In order to collect a snapshot, the host (not ILOM) 'root' password is required to facilitate network transfer of the snapshot into the /tmp directory. This is the preferred method of ILOM data collection.

If there are concerns about entering the host 'root' password, then an alternative option is provided using the "# /opt/oracle.SupportTools/sundiag.sh ilom" which will use IPMI to gather user-level ILOM outputs. This is usually good but the ILOM snapshot level can provide more underlying ILOM outputs for troubleshooting issues with ILOM and system faults that the user-level data may not provide.

Upload this file to the Service Request.

For gathering sundiag data across a whole rack:

For gathering sundiag.sh outputs on versions of sundiag.sh, where the filename is unique for each node (v1.4 and later), use the following from DB01:

1. [root@exadb01 ~]# cd /opt/oracle.SupportTools/onecommand (or wherever the all_group file is with the list of the rack hostnames)

2. [root@exadb01 onecommand]# dcli -g all_group -l root /opt/oracle.SupportTools/sundiag.sh 2>&1
<this will take up to several minutes while each node runs sundiag.sh>

3. Verify there is output in /tmp or /var/log/exadatatmp/ on each node:
[root@exadb01 onecommand]# dcli -g all_group -l root --serial 'ls -l /tmp/sundiag* ' (v.1.4-1.5.1)

[root@exadb01 onecommand]# dcli -g all_group -l root --serial 'ls -l /var/log/exadatatmp/sundiag* ' (v12.1.2.2.0_150917)

4. Make a temporary directory to copy for zipping:
[root@exadb01 onecommand]# mkdir dbm01_sundiags_date

It is recommended the date be of the format YYMMDD year, month, day for SR's where multiple days of analysis may be required.

5. Copy the generated sundiag files from the nodes to the temporary directory (/tmp on v1.4-1.5.1, /var/log/exadatatmp on v12.1.2.2.0_150917):

[root@exadb01 onecommand]# for H in `cat all_group`; do scp -p $H:/tmp/sundiag*.tar.bz2 dbm01_sundiags_date ; done

[root@exadb01 onecommand]# for H in `cat all_group`; do scp -p $H:/var/log/exadatatmp/sundiag*.tar.bz2 dbm01_sundiags_date ; done

6. Bundle them into a single file for upload to Oracle:

[root@exadb01 ~]# tar jcvf exa_rack_sundiag_date.tar.bz2 dbm01_sundiags_date

For gathering sundiag.sh outputs on older versions of sundiag.sh (prior to v1.4) where the filename generated does not include hostname.
When gathering sundiag.sh outputs across a whole rack using dcli, the outputs may end up with the same tarball name which will overwrite each other upon unzipping. To avoid this, use the following from DB01:

1. [root@exadb01 ~]# cd /opt/oracle.SupportTools/onecommand (or wherever the all_group file is with the list of the rack hostnames)

2. [root@exadb01 onecommand]# dcli -g all_group -l root /opt/oracle.SupportTools/sundiag.sh 2>&1
<this will take up to about 2 minutes>

3. Verify there is output in /tmp on each node:
[root@exadb01 onecommand]# dcli -g all_group -l root --serial 'ls -l /tmp/sundiag* '

4. Sort them by hostname into directories, as they will likely mostly have the same filename with the same date stamp:
[root@exadb01 onecommand]# for H in `cat all_group`; do mkdir /root/rack-sundiag/$H ; scp -p $H:/tmp/sundiag*.tar.bz2 /root/rack-sundiag/$H ; done

5. [root@exadb01 onecommand]# cd /root/rack-sundiag

6. [root@exadb01 ~]# ls exa*
exacel01:
sundiag_2011_05_24_10_11.tar.bz2

exacel02:
sundiag_2011_05_24_10_11.tar.bz2
...
exadb08:
sundiag_2011_05_24_10_11.tar.bz2

7. [root@exadb01 ~]# tar jcvf exa_rack_sundiag_oracle.tar.bz2 exa*
exacel01/
exacel01/sundiag_2011_05_24_10_11.tar.bz2
exacel02/
exacel02/sundiag_2011_05_24_10_11.tar.bz2
...
exadb08/
exadb08/sundiag_2011_05_24_10_11.tar.bz2

8. [root@exadb01 ~]# ls -l exa_rack_sundiag_oracle.tar.bz2
-rw-r--r-- 1 root root 3636112 May 24 10:21 exa_rack_oracle.tar.bz2

Upload this file to the Service Request.

For HP Oracle Exadata Environments:

1. Download to the Exadata Storage Server with the disks failures, file deaddisk.zip
2. Unzip into any directory
3. Change permissions (700) to file deaddisk.sh and execute:
    # chmod 700 deaddisk.sh
    #./deaddisk.sh
4. Upload the zip file generated. Under /tmp, will be a zip file with the content of directory created by the
    execution of the script, format info_deaddisk_.zip,
Example: info_deaddisk_2009-01-18-19:52:14.zip

If link for deaddisk.zip is broken, please copy this code:

#!/bin/ksh
mdate=`/bin/date '+%Y-%m-%d-%H:%M:%S'`
mkdir -p /tmp/info_deaddisk_$mdate
cd /tmp/info_deaddisk_$mdate
cellcli -e alter cell shutdown services ms
MSSTATUS='cellcli -e list cell attributes msstatus'
if [ `$MSSTATUS` != 'stopped' ]
then
echo MS did not stop!!!!
echo -n 'Continuing may not be safe, do you whish to continue (y|N)? '
read Answer
if [ `echo $Answer` = 'Y' -o `echo $Answer` = 'y' ]
then
echo Coninuing to dump diagnostics
else
exit 0
fi
fi
echo "Starting to collect disk information....."
hpacucli ctrl all show config detail > `hostname -a`_$mdate.hpacucli.txt
hpaducli -f `hostname -a`_$mdate.hpaducli.txt
hpaducli -x -f `hostname -a`_$mdate.hpaducli.xml
cellcli -e alter cell startup services ms
MSSTATUS='cellcli -e list cell attributes msstatus'
if [ `$MSSTATUS` != 'running' ]
then
echo MS did not start!!!!
echo If this is unexpected, please contact Oracle Customer Support
echo dead disk diagnostics were collected but the state of MS is unknown
fi
cp /var/log/messages .
cp /var/spool/compaq/hpasm/registry/serial_output/* .
cd ..
zip -r info_deaddisk_$mdate info_deaddisk_$mdate

Community Discussions

Still have questions? Use the communities window below to search for similar discussions or start a new discussion on this subject. (Window is the live community not a screenshot)

Click here to open in main browser window

References

<NOTE:1006990.1> - Oracle Explorer Data Collector Implementation Best Practice
<NOTE:1500235.1> - How To Collect an Sosreport on Oracle Linux

Attachments

This solution has no attachment