Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1538237.1
Update Date:2018-05-21
Keywords:

Solution Type  Troubleshooting Sure

Solution  1538237.1 :   Gathering Troubleshooting Information for the Infiniband Network in Engineered Systems  


Related Items
  • Sun Network QDR InfiniBand Gateway Switch
  •  
  • Sun Datacenter InfiniBand Switch 36
  •  
  • Exadata Database Machine V2
  •  
  • Exadata X3-2 Hardware
  •  
  • Oracle SuperCluster T5-8 Full Rack
  •  
  • Oracle SuperCluster Specific Software
  •  
Related Categories
  • PLA-Support>Sun Systems>SAND>Network>SN-SND: Sun Network Infiniband
  •  


This document lists the data gathering required to troubleshoot issues with infiniband network in Exadata,  Exalogic and suppercluster Engineered Systems.
This document can also be useful to gather information on any infiniband network where Sun Datacenter Infiniband switch 36 and/or Sun Network QDR Gateway Switch are used.

In this Document
Purpose
Troubleshooting Steps
References


Applies to:

Exadata X3-2 Hardware
Sun Datacenter InfiniBand Switch 36 - Version Not Applicable to Not Applicable [Release N/A]
Sun Network QDR InfiniBand Gateway Switch - Version Not Applicable to Not Applicable [Release N/A]
Oracle SuperCluster Specific Software
Sun Microsystems > Boards > InfiniBand (IB)
Information in this document applies to any platform.

Purpose

This document includes the data gathering required to troubleshoot issues with an Infiniband network in Exadata, Exalogic and supercluster Engineered Systems. It is also useful in gathering information on any Infiniband network where a Sun Datacenter Infiniband switch 36 and/or a Sun Network QDR Gateway Switch are used.

This document lists the data to be collected to troubleshoot infiniband network

Troubleshooting Steps

1.  From all the  infiniband switches in the network, collect the outputs of the following commands:

          a) version
          b) env_test
          c) listlinkup
          d) showunhealthy
          e) getmaster -l
          f) service opensmd status
          g) setsmpriority list
          h) smnodes list
          i) md5sum /conf/partitions.current

 

2.  Copy the following files from all the infiniband switches;

        a)  /var/log/messages
        b)  /var/log/opensm.log


3.  Collect the outputs of the following commands on any leaf switch:

       a) ibswitches
       b) ibnetdiscover
       c) sminfo

 

4.  From the IB switch running as Master, collect

         #smpartition list active


5.  Run the following commands on a leaf switch

        ibqueryerrors.pl -rR -s RcvSwRelayErrors,XmtDiscards,XmtWait,VL15Dropped

        /usr/bin/ibdiagnet -skip dup_guids -ls 10 -lw 4x -pm

             This command will create a few files in /tmp directory.

             Copy these files.

                    Example:
                          # cd /tmp
                          # tar cvf pre-clear-ibdiagnet.tar ibdiagnet*

 

        This will let us capture all of the pm counters since the last time the errors & counters were cleared.

        After the above command is run and the files collected, please run the following two commands to reset the counters & errors:

        # ibclearcounters
        # ibclearerrors

        Then wait for an hour and collect another ibdiagnet and ibqueryerrors output once more.

NOTE: Alternatively, if immediate results are required, traffic may be generated manually...

          # /usr/bin/ibdiagnet -c 500 -P all=1 (this will send 500 packets over all links)

...and collect another ibdiagnet and ibqueryerrors output once more.

              ibqueryerrors.pl -rR -s RcvSwRelayErrors,XmtDiscards,XmtWait,VL15Dropped

             /usr/bin/ibdiagnet -skip dup_guids -ls 10 -lw 4x -pm

              and copy the files from /tmp directory as follows:.

                    Example:
                          # cd /tmp
                          # tar cvf post-clear-ibdiagnet.tar ibdiagnet*

 NOTE: Once the information has been provided, please remove the pre- and post-clear-ibdiagnet.tar files from the switch.

 

6.  If there are Sun Network QDR Gateway Switch in the network (In Exalogic systems, for example),  collect the outputs of the following commands in all the Sun Network QDR Gateway Switches :

      a) showvnics
      b) showvlan
      c) showioadapters
      d) showgwports
      e) showgwconfig
      g) bxmtool --gw


7.  On all hosts experiencing issues that are connected to the infiniband network through Infiniband HCAs,  collect explorer (if running solaris) or sosreport (if running Linux) or  support bundle (if it is a ZFS appliance).

      The following data may also be collected on these nodes, if they are not in the explorer, sosreport or support bundle.

            ibstat
            ibv_devinfo -v
            mlx4_vnic_info -s

8.  If the issue is with communication between nodes, then additional data may be collected as per the following document

          Troubleshooting communication issues over an Infiniband fabric Using ibping, ping, and rds-ping (Doc ID 2016560.1)

          tcpdump data may also be collected on the appropriate interfaces of both the nodes while pinging from one node to the other.


9.  If this is a single-rack exadata or exalogic system,  the following command may be run on any compute node in this infiniband network, and its output may be collected.

           /opt/oracle.SupportTools/ibdiagtools/verify-topology -t  [quarterrack | halfrack | fullrack ]

                   Use the correct option depending on the type of rack used.  If no option is given, the default fullrack is assumed.

 

NOTE:The -t option has been deprecated and is no longer available in newer version of the 'verify-topology' script.

In this case, alternatively, the 'ibnetdiscover' data can be reviewed for assessment of the IB topology.

 

10.   If the issue is with the infiniband switch and its hardware,  a snapshot of its ILOM may also be collected.
      Refer to Doc 1594992.1 ( How to Generate iLOM Snapshot on Exadata IB Switches (Doc ID 1594992.1).

         Example:

            -> cd /SP/diag
           -> set snapshot dataset=normal dump_uri=ftp://<user>:<password>@<destination host>//<directory name>/
                            where user and password are the user name and password for the destination host where the snapshot should be ftp'ed to.

           -> show snapshot result
                   When snapshot data is fully transferred to the specified location, this command output will indicate the status as "completed".



Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback