Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-2084598.1
Update Date:2018-01-03
Keywords:

Solution Type  Problem Resolution Sure

Solution  2084598.1 :   smpartition start - Fails With Unable To Get Rpc Version On Some Nodes In The Fabric  


Related Items
  • Sun Datacenter InfiniBand Switch 36
  •  
  • Sun Network QDR InfiniBand Gateway Switch
  •  
Related Categories
  • PLA-Support>Sun Systems>SAND>Network>SN-SND: Sun Network Infiniband
  •  




In this Document
Symptoms
Changes
Cause
Solution


Created from <SR 3-11769456148>

Applies to:

Sun Datacenter InfiniBand Switch 36 - Version All Versions to All Versions [Release All Releases]
Sun Network QDR InfiniBand Gateway Switch - Version All Versions to All Versions [Release All Releases]
Information in this document applies to any platform.

Symptoms

smpartition start (i.e. /usr/local/sbin/smpartition start) can fail if peer checks by partitiond (i.e. /usr/local/util/partitiond) fails.

Here are some example scenarios.

1)
# smpartition start
Unable to get rpc version on some nodes in the fabric
Please check /var/log/messages
#

- /var/log/messages -
Nov 25 11:31:46 ibsw partitiond: doPeerCheck: Unable to get rpc version for peer W.X.Y.Z
Nov 25 11:31:46 ibsw partitiond: doPeerCheck: Unable to get rpc version for peer W.X.Y.Z
Nov 25 11:36:16 ibsw partitiond: doPeerCheck: Unable to get rpc version for peer W.X.Y.Z
Nov 25 11:36:16 ibsw partitiond: doPeerCheck: Unable to get rpc version for peer W.X.Y.Z

2)
# smpartition start
Unable to signal some of the smnodes
Please check /var/log/messages
#

- /var/log/messages -
Dec 10 19:28:36 ibsw partitiond: doPeerCheck:Signaling other host failed: W.X.Y.Z
Dec 10 19:28:38 ibsw partitiond: doPeerCheck:Signaling other host failed: W.X.Y.Z
Dec 10 19:37:48 ibsw partitiond: doPeerCheck:Signaling other host failed: W.X.Y.Z
Dec 10 19:37:49 ibsw partitiond: doPeerCheck:Signaling other host failed: W.X.Y.Z

3)
# smpartition start
smnodes list on some nodes is inconsistant(empty/doesn't exist/different) with that of the master
Please check /var/log/messages
Aborting.
#

- /var/log/messages -
Dec 10 19:32:24 ibsw partitiond: doPeerCheck :smnode file on W.X.Y.Z is different from master
Dec 10 19:38:58 ibsw partitiond: doPeerCheck :smnode file on W.X.Y.Z is different from master

 

 

Changes

IB environments can use IB partitions and smpartition start starts a session to edit IB partitions.

 

Cause

The IB switch on which smpartition start is invoked has to be the master subnet manager (i.e. SMINFO_MASTER).  This is because IB partitions can only be modified on SMINFO_MASTER.  partitiond on SMINFO_MASTER talks to other partitiond on other IB switches specified in the smnodes list.  In short, the smnodes list includes only the IP addresses of IB switches running the subnet manager (i.e. OpenSM).  Every IB switch maintains its own smnodes list.

Scenarios 1 means partitiond on SMINFO_MASTER is not able to acquire RPC version from the peer partitiond on the peer IB switch with IP address, W.X.Y.Z.

This can happen if the peer IB switch is not running portmap (i.e. /sbin/portmap) OR portmap is running but partitiond is not running.  partitiond is an RPC program, so when it starts, it has to register itself with portmap.

Scenario 2 means partitiond on SMINFO_MASTER is not able to communicate with the peer partitiond on the peer IB switch with IP address, W.X.Y.Z.

This can happen if the peer IB switch is not running partitiond, hence no communication at all.

Scenario 3 means the smnodes list on SMINFO_MASTER is not the same as that on the peer IB switch with IP address, W.X.Y.Z.

 

Solution

For scenario 1, check portmap and start it if is not running, then check partitiond and start it if it is not running.

On SMINFO_MASTER, just run:

# rpcinfo -p W.X.Y.Z

On the peer IB switch with IP address, W.X.Y.Z, just run:

# rpcinfo -p

# service portmap status

# service portmap start  <<-- if not already running

# ps -ef | grep 'portmap'

On the peer IB switch with IP address, W.X.Y.Z, just run:

# service partconfigd status

# enablesm               <<-- if not already running

# ps -ef | grep 'part'

# rpcinfo -p

# netstat -lnp

For scenario 2, check partitiond and start it if it is not running.

On the peer IB switch with IP address, W.X.Y.Z, just run:

# service partconfigd status

# enablesm <<-- if not already running

# ps -ef | grep 'part'

For scenario 3, compare the smnodes list between that on SMINFO_MASTER and that on the peer IB switch with IP address, W.X.Y.Z.

On the peer IB switch with IP address, W.X.Y.Z, just run:

# smnodes list

# smnodes delete <...>

# smnodes add <...>

 

enablesm (i.e. /usr/local/sbin/enablesm) starts opensm (i.e. via service opensmd start ) followed by partitiond (i.e. via service partconfigd start).

disablesm (i.e. /usr/local/sbin/disablesm) stops the processes started by enablesm in reverse order.

opensm is the running process name as shown in the "ps -ef" outputs whereas /etc/init.d/opensmd is the corresponding service script.

partitiond is the running process name as shown in the "ps -ef" outputs whereas /etc/init.d/partconfigd is the corresponding service script.

 

 


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback