Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1505294.1
Update Date:2012-11-13
Keywords:

Solution Type  Problem Resolution Sure

Solution  1505294.1 :   When running global Infiniband commnads, such as "ibnetdiscover" or "ibswitches", it fails with "mad_rpc" warning messages.  


Related Items
  • Oracle Exalogic Elastic Cloud X2-2 Full Rack
  •  
Related Categories
  • PLA-Support>Sun Systems>SAND>Network>SN-SND: Sun Network Infiniband
  •  
  • _Old GCS Categories>Sun Microsystems>Switches>Sun InfiniBand IB
  •  




In this Document
Symptoms
Cause
Solution


Applies to:

Oracle Exalogic Elastic Cloud X2-2 Full Rack - Version All Versions to All Versions [Release All Releases]
Linux x86

Symptoms

 Symptoms incude very slow response to global infiniband commands(ibnetdiscover/ibswitches), as well as returning the following errors:

ibwarn: [31013] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0,26,13)
ibwarn: [31013] discover: can't reach node DR path slid 0; dlid 0; 0,26,13 port 16
ibwarn: [31013] _do_madrpc: recv failed: Connection timed out
ibwarn: [31013] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0,26,13,18)
ibwarn: [31013] handle_port: NodeInfo on DR path slid 0; dlid 0; 0,26,13,18 failed, skipping port
ibwarn: [31013] _do_madrpc: recv failed: Invalid argument
ibwarn: [31013] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0,26,13)
ibwarn: [31013] discover: can't reach node DR path slid 0; dlid 0; 0,26,13 port 19
ibwarn: [31013] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0,26,13,33)
ibwarn: [31013] handle_port: NodeInfo on DR path slid 0; dlid 0; 0,26,13,33 failed, skipping port
ibwarn: [31013] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0,26,13,34)
ibwarn: [31013] handle_port: NodeInfo on DR path slid 0; dlid 0; 0,26,13,34 failed, skipping port

 

Cause

Any time a global infiniband command is ran, it tries to contact to all infiniband device that has link within the infiniband fabric. If the other end does not respond, then the error messages are displayed indicating it did not responde. The end may be in a down or in a crashed state where there is still an IB link but nothing more.

Solution

 Based on the DR(Direct Route) path information, determine which infiniband device is not responding by collecting a

# ibnetdiscover -s

and then verify that that device is properly booted in an up and stable state.


Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback