Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1904446.1
Update Date:2017-12-04
Keywords:

Solution Type  Problem Resolution Sure

Solution  1904446.1 :   Oracle ZFS Storage Appliance: Infiniband IB Port not Activated  


Related Items
  • Sun ZFS Storage 7320
  •  
  • Oracle ZFS Storage ZS3-2
  •  
  • Oracle ZFS Storage ZS3-4
  •  
  • Sun ZFS Storage 7420
  •  
  • Sun ZFS Storage 7120
  •  
Related Categories
  • PLA-Support>Sun Systems>DISK>ZFS Storage>SN-DK: 7xxx NAS
  •  




In this Document
Symptoms
Changes
Cause
Solution
References


Created from <SR 3-8688320181>

Applies to:

Oracle ZFS Storage ZS3-2 - Version All Versions to All Versions [Release All Releases]
Sun ZFS Storage 7120 - Version All Versions to All Versions [Release All Releases]
Sun ZFS Storage 7420 - Version All Versions to All Versions [Release All Releases]
Oracle ZFS Storage ZS3-4 - Version All Versions to All Versions [Release All Releases]
Sun ZFS Storage 7320 - Version All Versions to All Versions [Release All Releases]
7000 Appliance OS (Fishworks)

Symptoms

The Appliance: ZS3-2 active-active cluster storage is unable to communicate using Infiniband Ports.


We have two Sun datacenter 36 Infiniband switches:

Port1 of Head1 connected to Switch1 Port 0A
Port2 of Head1 connected to Switch2 Port 0A

Port1 of Head2 connected to Switch1 Port 0B
Port2 of Head2 connected to Swicth2 Port 0B

Both switches have pkey 0xfe80
Switch1 has an sm priority 6 and switch2 sm priority 5

All connected ports are showing active and enabled at switch side but the IB is not active at Storage side.

 

Changes

 New Infiniband Installation

Cause

The support bundle Net logs showed this:

dladm-show-ib.out

LINK         HCAGUID         PORTGUID         PORT STATE  PKEYS
ibp1         10E000013284D0  10E000013284D2  2     up     FFFF  <<<<<<<<<<<<<<<<<<<<<<<<<< Pkey FFFF
ibp0         10E000013284D0  10E000013284D1  1     up     FFFF

dladm-show-part.out 

LINK         PKEY  OVER         STATE    FLAGS
pfe80_ibp0   FE80  ibp0         down     f---  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Pkey FE80
pfe80_ibp1   FE80  ibp1         down     f---

 
 

Solution

Looking at the support bundles for both appliances the pkey was shown as FFFF in some instances and FE80 in others:

## dladm-show-ib.out

LINK HCAGUID            PORTGUID           PORT STATE  PKEYS
ibp1 10E000013284D0 10E000013284D2 2       up       FFFF <<<<<<<<<<<<<<<<<<<<<<<<<< FFFF
ibp0 10E000013284D0 10E000013284D1 1       up       FFFF

(This subcommand displays the physical links, port GUID, port# HCA GUID, and P_Key present on the port at the time the command is running)

 IB partition link information.

## dladm-show-part.out

LINK           PKEY   OVER STATE FLAGS
pfe80_ibp0 FE80    ibp0  down   f--- <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< FE80
pfe80_ibp1 FE80    ibp1  down   f---


Checking the IB network configuration - there was an issue with the infiniband ports showing the PKEY set two different values.

The default PKEY is FFFF on the appliance, but the link outputs showed then set to FE80, which implied the switch was set to a different PKEY value.

The partition link state is down under the following conditions:

  • HCA port is down

  • P_Key is absent

  • Broadcast group is absent

 

From the appliance help it also is very clear that the partition will remain "down" until the port GUID is member of the subnet partition:

Partition Key
Use the partition (fabric domain) in which the underlying port device is a member. The partition key (pkey) is found on and configured by the subnet manager (SM).
The pkey may be defined before configuring the subnet manager but the datalink will remain "down" until the subnet partition has been properly configured
with the port GUID as a member. It is important to keep partition membership for HCA ports consistent with IPMP and clustering rules on the subnet manager.


==================================================================================================================

Further data that can be collected to help isolate an Infiniband issue:

## What do you see from the appliance to the IB switches?

To confirm run the following commands from the both Appliances CLI prompt:

CLI>  confirm shell ibswitches

For Example:

Switch  : 0x002128f56f5da0a0 ports 36 "SUN DCS 36P QDR localhost 10.145.229.242" enhanced port 0 lid 1 lmc 0

CLI>  confirm shell ibstat

For Example:

CA 'mlx4_0'
        CA type: 0
        Number of ports: 2
        Firmware version: 2.6.000
        Hardware version: 160
        Node GUID: 0x00212800013f2416
        System image GUID: 0x00212800013f2419
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 21
                LMC: 0
                SM lid: 1
                Capability mask: 0x00000030
                Port GUID: 0x00212800013f2417
                Link layer: IB
        Port 2:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 22
                LMC: 0
                SM lid: 1
                Capability mask: 0x00000030
                Port GUID: 0x00212800013f2418
                Link layer: IB


CLI>  confirm shell iblinkinfo.pl

For Example:

Switch 0x002128f56f5da0a0 SUN DCS 36P QDR localhost 10.145.229.242:
           1    1[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1    2[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1    3[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1    4[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1    5[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1    6[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1    7[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1    8[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1    9[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      16    2[  ] "zs3-2-a PCIe 6" ( )
           1   10[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      23    1[  ] "zs3-2-a PCIe 6" ( )
           1   11[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1   12[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1   13[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1   14[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1   15[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1   16[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1   17[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1   18[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1   19[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1   20[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1   21[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      22    2[  ] "s7420-a PCIe 4" ( )
           1   22[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      21    1[  ] "s7420-a PCIe 4" ( )
           1   23[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1   24[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1   25[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      19    1[  ] "s7420b1 PCIe 3" ( )
           1   26[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      20    2[  ] "s7420b1 PCIe 3" ( )
           1   27[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      18    1[  ] "s7420b2 PCIe 3" ( )
           1   28[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      17    2[  ] "s7420b2 PCIe 3" ( )
           1   29[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           1   30[  ] ==(                Down/ Polling)==>             [  ] "" ( )
           1   31[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1   32[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1   33[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1   34[  ] ==(                Down/Disabled)==>             [  ] "" ( )
           1   35[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      12    2[  ] "MT25408 ConnectX Mellanox Technologies" ( )
           1   36[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      11    1[  ] "MT25408 ConnectX Mellanox Technologies" ( )




To guide you through the steps to help resolve issues like this  please refer to this knowledge document:

Sun Storage 7000 Unified Storage System: How to Troubleshoot Infiniband issues (Doc ID 1435063.1)

This will go through checking the basic cabling all the way to the IB port setups - with example of what you should be seeing.

On both infiniband switches can you also gather support data by running the following commands:

All commands are taken from the Sun Datacenter InfiniBand Switch 36 user guide :  http://docs.oracle.com/cd/E76424_01/pdf/E36271.pdf
1/ Display Link Status
In some situations, you might need to know the status of each route through the switch. Additionally, the listlinkup command displays where InfiniBand cables are connected to the switch.

On the management controller, type:

# listlinkup


2/ Discover the InfiniBand Fabric Topology:
The ibnetdiscover command enables you to see the InfiniBand fabric topology and build a topology file, which is used by the OpenSM Subnet Manager.
Identify the prerequisite and subsequent installation tasks that you must perform in conjunction with this procedure.

On the management controller, type:

# ibnetdiscover


Display Subnet Manager Status:
If you want to quickly determine your Subnet Manager’s priority and state, the sminfo command can also provide the LID and GUID of the hosting HCA.

On the management controller, type:

# sminfo


3/ Perform Comprehensive Diagnostics for the Entire Fabric

If you require a full testing of your InfiniBand fabric, the ibdiagnet command can perform many tests with verbose results.
The command is a useful tool to determine the general overall health of the InfiniBand fabric.

On the management controller, type:

# ibdiagnet -v -r | tee /var/ak/dropbox/ibdiagnet.out


The ibdiagnet.log file contains the log of the testing.

====================================================================================
Also as a good reference is the "Oracle Exalogic install guide"

Using InfiniBand Partitions in Exalogic Physical Environments

http://docs.oracle.com/cd/E18476_01/doc.220/e18478/physical_part.htm
====================================================================================

 

References

<NOTE:1587913.1> - Sun Storage 7000 Unified Storage System: How to rebuild network interfaces
<NOTE:1435063.1> - Sun Storage 7000 Unified Storage System: How to Troubleshoot Infiniband Issues

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback