Sun Microsystems, Inc.  Oracle System Handbook - ISO 7.0 May 2018 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-2263643.1
Update Date:2017-06-14
Keywords:

Solution Type  Technical Instruction Sure

Solution  2263643.1 :   Information On Infiniband Architecture Layer  


Related Items
  • Oracle Exalogic Elastic Cloud Software
  •  
  • Sun Network QDR InfiniBand Gateway Switch
  •  
Related Categories
  • PLA-Support>Eng Systems>Exalogic/OVCA>Oracle Exalogic>MW: Exalogic Core
  •  




In this Document
Goal
Solution
References


Created from <SR 3-14848718931>

Applies to:

Oracle Exalogic Elastic Cloud Software - Version 1.0.0.0.0 and later
Sun Network QDR InfiniBand Gateway Switch - Version All Versions to All Versions [Release All Releases]
Linux x86-64
Oracle Solaris on x86-64 (64-bit)
Oracle Virtual Server x86-64

Goal

This document outlines the Infiniband network layer and compares each layer with traditional TCP/IP network at a high level. Although the function for each layer of Infiniband network is similar to traditional TCP/IP network, the underlying technical detail is very different.

Please refer to the latest version of IB specification for further detail.

Solution

InfiniBand (abbreviated IB) is a computer-networking communications standard used in high-performance computing (HPC) as well as enterprise data center that features very high throughput and very low latency.

IB is a completely different network technology compared to traditional ethernet-based TCP/IP network, from software to hardware.  IB is designed to be able to work independently to provide message-deliver service to application. That means IB software stack can work with applications directly without using IP address to communicate between end nodes. Though IB is a completely new network technology, its architecture layer complies with the OSI 7 layer model. It has the similar layers defined as TCP/IP network. Below is the diagram of TCP/IP layers comparing with OSI model:

tcp_ip_layer_small

Below is the diagram of IB layers:

IB_Layer_small

As you can see, both the IB and TCP/IP network define a 5 layer protocol stack: physical layer, link layer, network layer, transport layer, application layer. In IB network, application layer is also called upper layer. From the application layer point of view, the underlying layers for both IB and TCP/IP provide a message service.

If you have a basic knowledge of ethernet-based TCP/IP network, you will know that each layer on one host can only communicate with the same layer on remote host, and an Address will be used on each layer. For example, a MAC address will be used for link layer, IP address is used on network layer, port number is used in transport layer, while a URL might be used on application layer. The packet is the basic unit of information transferred across a network, consisting, at a minimum, of a header for each layer with the sending and receiving addresses, and a body with the data to be transferred. As the packet travels through the TCP/IP protocol stack, the protocols at each layer either add or remove header from the basic header. When a protocol on the sending host adds header to the packet, the process is called data encapsulation. Since the address information needs to be filled to the header for each layer during encapsulation, an address resolution process might be required to determine the address. For example, when we type a URL in a browser, the URL address is used on application layer. Before a HTTP request is sent to network layer, a DNS lookup is performed to get the IP address based on URL. When the request travels from network layer to link layer, the server has to get the MAC address of next hop through ARP broadcast so that MAC address can be filled to the ethernet header of the packets.

While for IB protocol stack, it's similar for the general functions of each layer though the underlying detail is very different. As we have said above, IB can be working well without IP address in use. That means that IB has its own set of address scheme.

On the Link layer, Local Identifiers (LID) is used for communication address. LID is subnet-unique 16-bit identifiers used within a network by switches for routing. Unlike MAC address which is assigned and stored in the network adapter by network card manufacturer, LIDs are assigned by the Subnet Manager(SM, as its name says, it configure and control the IB subnet) to each port on an end node and to each switch (switch ports are not assigned LIDs).

On the network layer, Global IDs (GID) is used for routing across subnets. This is similar as IP address. The source and destination IP address in the IP packets header can be in different subnet, route (router/gateway) will be used to transport this kind of packets. So far, there is no IB router device produced as currently the IB devices mainly work in same subnet. GIDs are 128-bit identifiers used to identify an end node port, switch etc. which are constructed by prepending a 64-bit GID prefix onto a GUID.

GUID, Global Unique IDs, are 64-bit EUI-64 IEEE defined identifiers for elements in a subnet. GUIDs are global scope IEEE EUI-64 identifiers assigned to a device, created by concatenating 24-bit company_id, assigned by the IEEE Registration Authority, to a 40-bit extension identifier. Companies assign GUIDs to chassis, channel adapter (CA), switch, CA port, and router port. The SM may assign additional localscope EUI-64s to a CA port or a router port. GUID are built into in IB devices when they are manufactured, this is similar to MAC address for ethernet card. But the difference is that GUID is not used for link layer address.

GID is built on top of GUID. Subnet manager constructed GID by prepending a 64-bit GID prefix onto a GUID. The GID prefix has a detailed format including a 16-bit subnet prefix as its low order 16 bits.

On transport layer, Queue Pair (QP) is used which are in pairs including a Send and a Receive queue. QP number is a little bit similar as the port number on the transport layer of TCP/IP network which identify the communication end point. Similar as the TCP/IP network, The transport layer is responsible for segmenting a piece of message data into multiple packets when the message’s data payload is greater than the maximum transfer unit (MTU) of the path. The QP on the receiving end reassembles the data into the specified data buffer in its memory.

QP is allocated by the IB device. For each HCA port, it requires at least 3 QPs, two are used for management purpose (QP0 and QP1), the other for application data transfer.

Below is the complete IB packet format:

IB_PACKET_FORMAT_s

Local routing header(LRH) is used by Link layer to move a packet across a subnet to its destination end node network port. Destination and source LID (DLID and SLID) are carried by each packet in its LRH. Like the ethernet header in TCP/IP network, the LRH is interpreted by IB switch to forward the packets within a subnet.

Global Routing Header (GRH) is used by network layer to move packets between subnets. Destination and source GIDs (DGID and SGID) are carried by each packet in its GRH. Like IP header for a packet in TCP/IP network, GRH is interpreted by network device working on network layer (Router) to route the packets between subnet.

Base Transport Header (BTH) is used by transport layer protocol. Destination QP is carried by BTH. See below figure which is a IPoIB ARP request captured by ibdump. As you may have noticed, there is no source QP which is different than TCP/IP network that source and destination port are in place. You may want to know how the destination host send back a responding packets to source host without knowing the source QP in the packets it receives. The answer is the remote host indeed knows the QP on source host, but it doesn't get it from the packets it received. For example, for connection mode IPoIB, before the source host sends its first application data packets to destination host, a negotiation is performed through Communication Manager(CM) so that both source and destination host get aware of each other's QP. The CM is an agent located on IB hosts providing the services needed to allow hosts to establish connections. The CM communicates through the well-known QP1.

 In addition to Destination QP, BTH has an opcode, partition key and a number of other fields. Partition is a bit like VLAN in ethernet network.

IB_PACKET_FORMAT1_s

In summary, GID and LID are assigned by Subnet manager, GUID is assigned in factory. LID is used on link layer, GID is used in network layer, while QP is used on transport layer.

Although IB can work well without IP involved, while in real life, IB still work closely with TCP/IP closely in most cases. Part of the reason is the complexity of application coding.  Since the IB provide a messages service through VERB (the interface provided to applications by IB) to upper layer applications, that require the application coder are aware of IB/RDMA technology and increase the complexity of application coding. To make the traditional application utilize the IB network without changing any code, ULR (Upper Layer Protocols) was introduced.

ULPs are special applications (from IB point of view) writing over the VERBs interface(provided by IB) that bridge between standard interfaces like TCP/IP (or traditional applications) to IB to allow running legacy application intact. As shown in below IB software stack diagram, you can see the protocols on ULP layer include IPoIB, SDP, RDS etc.

IB_software_stack

As shown in the following figure which is much clear, IPoIB(IP over IB) bridges the traditional existing tcp/ip kernel module and IB fabric. It encapsulates the IP packets over IB. The pro side of IPoIB is it can make the traditional IP-based applications and TCP/IP module of operation system kernel change nothing to utilize IB network. The con side is it can't use the by-pass kernel and CPU feature of IB.

IPoIB

Let's see other ULPs called SDP(Sockets Direct Protocol). SDP bridges the traditional IP-based application and IB. Due to SDP bypass the TCP and IP layer of the kernel, it reduces the over-head of protocol handling by TCP/IP and avoids copying data between applications buffer and kernel TCP/IP buffer. This will off-load CPU usage which is so-called by pass kernel or CPU.

sdp

In the last, have to say that for a ULP involved software stack, from the TCP/IP point of view, the entire IB network is working in the link layer of the TCP/IP network . From the IB point of view, ULPs are working in the application layer of IB network.

References

<NOTE:1374890.1> - Exalogic FAQs

Attachments
This solution has no attachment
  Copyright © 2018 Oracle, Inc.  All rights reserved.
 Feedback