Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 0 times |
Contemporary High-speed Techniques
Tan Li
2
Outline Native InfiniBand
Components Subnet Management and Services
High Speed Ethernet (HSE) Family Internet Wide Area RDMA Protocol (iWARP) Alternate choice - OpenOnLoad
InfiniBand/Ethernet Convergence Technologies (InfiniBand) RDMA over Ethernet (RoE) (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE)
Resources Summary
3
Outline - Native InfiniBand
Recall InfiniBand Components InfiniBand Link Speed Roadmap InfiniBand Communication Model InfiniBand Switching and Routing InfiniBand Transport Layer Subnet Management and Services
4
Recall Comparing InfiniBand with Traditional Networking Stack
5
Recall - InfiniBand Protocol Offload Engines
Completely implement ISO/OSI layers 2‐4 (link layer, network layer and transport layer) in hardware
verbsIB
transport
IB network
IB link/phy
IB fabric
6
InfiniBand Components Cables and Connectors
Channel Adapter
Switches
Routers
7
Cables and Connectors• Volume 2 of the Architecture Specification is devoted to the physical and
electrical characteristics of InfiniBand. This has enabled vendors to develop and offer for sale a wide range of both copper and optical cables in a broad range of widths (4x, 12x) and speed grades (SDR, DDR, QDR).
• Many networks so far (1GE, Myrinet, Quadrics) used 8b/10b encoding
• New networks (IB (post‐QDR), HSE (>= 10GE) use 64b/66b encoding
• The eternal IB confusion: All networks other than IB specify data rate (1 Gigabit Ethernet ==1Gbps data
rate) IB initially broke this convention, when IB (up to QDR) is reported as
10/20/40Gbps, that’s actually the signaling rate: 8/16/32Gbps data rate IB FDR and EDR standards fixed this “error” and started reporting the data rate (IB
EDR reported as 100Gbps is truly data rate: 103.125Gbps signaling rate)
8
Channel Adapter (CA)
App QP
OS CA
AppQP
OSCA
Different terms: HCA, TCA, DCAChannel adapter is a service, not
a hardware service
Using address translation mechanisms to visit the queue pairs
9
InfiniBand switches & router• Switches: IB supports Virtual Cut Through(VCT), This is a subtle,
but key element of InfiniBand since it means that packets are never dropped in the network during normal operation. This “no drop” behavior is central to the operation of InfiniBand’s highly efficient transport protocol.
• Routers: Unspecified by IB SPEC, Up*/Down*, Shift are popular routing engines supported by OFED. Since InfiniBand’s management architecture is defined on a per subnet basis, using an InfiniBand router allows a large network to be partitioned into a number of smaller subnets thus enabling the deployment of InfiniBand networks that can be scaled to very large sizes, without the adverse performance impacts due to the need to route management traffic throughout the entire network.
Research spot:IB routing & WAN
Capability
10
InfiniBand link speed roadmap
11
InfiniBand link speed roadmap
12
InfiniBand Communication Model
1. Queue Model
2. Overview
3. Memory Registration
4. Memory Protection
5. Verbs
13
InfiniBand Communication Model - Queue Pair(QP) Model
• Send Queue(SQ)• Receive Queue(RQ)• Complete Queue(CQ)• Work requests(WQEs)• Notification of operation
completion(CQE)
14
InfiniBand Communication Model - Overview
15
InfiniBand Communication Model – Memory Registration
1. Registration Request2. Kernel handles
virtual->physical mapping and pins region into physical memory
3. HCA caches the virtual to physical mapping and issues a handle Work requests(WQEs)
4. Handle is returned to application
All memory used for communication must be registered!
16
InfiniBand Communication Model – Memory Protection
• To send or receive data the l_keymust be provided to the HCA
• For security, keys are required for all operations that touch buffers
• For RDMA, initiator must have the r_key for the remote virtual address
r_key is not encrypted in IB!
17
InfiniBand Communication Model - Verbs
• Post receive, send• RDMA-read, RDMA-write• Notify CQEs
Kernel is involved only to:1. Memory Registration
2. Post receive and send WQE3. Poll out completed CQEs
from CQ
18
InfiniBand Switching and Routing
Virtual Lanes• Multiple virtual links within same
physical link• VL15: reserved for management,• Each port supports one or more data
VL
Service Levels• Packets may operate at one of 16
different SLs• Meaning not defined by IB• SL determines which VL on the next
link is to be used• Each port (switches, routers, end
nodes) has a SL to VL mapping table configured by the subnet management
19
InfiniBand Switching and Routing
Allow the multiplexing of the multiple independent logical traffic flows on the same physical link
Simulate multiple networks in one physical network
20
InfiniBand Switching and Routing
• Sender can utilize multiple LIDs associated to the same destination port Packets sent to one DLID take a fixed path Different packets can be sent using different DLIDs Each DLID can have a different path (switch can be configured differently for
each DLID)
• Each QP utilizes a single LID (one on one) All WQEs posted on same QP take the same path All packets are received by the receiver in the same order All receive WQEs are completed in the order in which they were posted
• Handle out-of-order-packet IB uses a simplistic approach: If packets in one connection arrive out‐of‐order,
they are dropped
Mark a Node: LID + GID = IP + MAC
21
InfiniBand Transport Layer IB Transport Services (Queue-pair based)
22
InfiniBand Transport LayerIB allows link rates to be statically changed
On a 4X link, we can set data to be sent at 1X For heterogeneous links, rate can be set to the lowest link rate Useful for low‐priority traffic
Auto‐negotiation also available E.g., if you connect a 4X adapter to a 1X switch, data is
automatically sent at 1X rate
Only fixed settings available Cannot set rate requirement to 3.16 Gbps, for example
Demo
23
Subnet Management and Services Subnet Management Agents (SMA)
Processes or hardware units running on each adapter, switch, router (everything on the network)
Provide capability to query and set parameters
Managers Make high level decisions and implement it on the network
fabric using the agents
Subnet management packets (SMPs) Used for interactions between the manager and agents (or
between agents)
Messages
24
Subnet Management and Services
25
Subnet Management and Services
Subnet management packets (SMP)Define the operation to be performed by SMGet: get the information about CA, switch, portSet: set the attribute of a port (e.g. LID)GetResp: get responseTrap: inform SM about the state of a local node
• A SMA stop sending Trap message until it receives Trap Repress packet.
• Topology information can be obtained by a sweep and by periodical Traps.
26
Subnet Management and Services
Subnet Management phases:Topology discovery: sending direct routed SMP to
evert port and processing the responses.Path computation: computing valid paths between
each pair of end nodePath distribution phase: configuring the forwarding
table
27
Subnet Management and Services
28
High Speed Ethernet (HSE) Family
Internet Wide Area RDMA Protocol (iWARP) Idea of iWARP iWARP & InfiniBand iWARP Architecture and Components iWARP Feathers Software iWARP
Alternative – OpenOnLoad Alternative – pure Ethernet/TCP/IP
29
Idea of iWARP
verbs
TCP
IP
Enet MAC
RDDP
MPA
IP network
RDMAP
30
iWARP & InfiniBand
31
iWARP Architecture and Components
verbs
TCP
IP
Enet MAC
RDDP
MPA
IP network
RDMAP• RDMA Protocol (RDMAP)
Feature rich interfaceSecurity Management
• Remote Direct Data Placement (RDDP) Data Placement and Delivery Connection Management
• Marker PDU Aligned (MPA) Middle Box Fragmentation Data Integrity (CRC)
32
iWARP Feathers
• Decoupled Data Placement and Data Delivery, if data is out‐of‐order, place it at the appropriate offset
• Complicated because of TCP windowing behavior• Can allow for simple prioritization, 8 classes provided,
Two priority classes for high‐priority traffic• Can allow for specific bandwidth requests, e.g., can
request for 3.62 Gbps bandwidth• Link aggregation allows for multiple links to logically look
like a single faster link, this is done at a hardware level• Primarily provides an InfiniBand RC transport like
behavior
33
Software iWARP
34
Alternative – Solarflare OpenOnLoad
Support standard Socket API
acceleration of TCP/UDP applications with no need to modify
applications or to run a new protocol
35
Alternative – pure Ethernet/TCP/IP
High speed Ethernet (HSE) Consortium (10GE/40GE/100GE)
• 10GE Alliance formed by several industry leaders to take the Ethernet family to the next speed step
• Goal: To achieve a scalable and high performance communication architecture while maintaining backward
compatibility with Ethernet• http://www.ethernetalliance.org• 40 Gbps (Servers) and 100 Gbps Ethernet
(Backbones,Switches, Routers): IEEE 802.3 WG
36
InfiniBand/Ethernet Convergence Technologies
Motivation & Hint
(InfiniBand) RDMA over Ethernet (RoE)
(InfiniBand) RDMA over Converged Ethernet (RoCE)
Some Test Results
37
Motivation & Hint - Virtual Protocol Interconnect (VPI)
• Single network firmware to support both IB and Ethernet
• Autosensing of layer-2 protocol
• Multi-port adapters can use one port on IB and another on Ethernet
• Datacenters with IB inside the cluster and Ethernet outside, or clusters with IB network and Ethernet management
38
Motivation & Hint
IB(S/D/Q)
XAUI XFI SGMII
IB Ethernet
IB L3 IPv4
IB transport
RDMAapplications
L1
L2
L3
L4 TCP
SDP
Socket applications
ULP RDSIPoIB
Verbs
39
(InfiniBand) RDMA over Ethernet (IBoE or RoE)
Native convergence of IB network and transport layers with Ethernet link layer
IB packets encapsulated in Ethernet frames Advantages
Works natively in Ethernet environments (entire Ethernet management ecosystem is available)
Has all the benefits of IB verbs Disadvantages
Network bandwidth might be limited to Ethernet switches: 10GE switches available, 40GE yet to arrive, but 32 Gbps IB available now
Some IB native link‐layer features are optional in (regular) Ethernet
40
(InfiniBand) RDMA over Converged Ethernet (RoCE)
Native convergence of IB network and transport layers with Ethernet link layer
IB packets encapsulated in Ethernet frames Advantages
CE is very similar to the link layer of native IB, so there are no missing features
Disadvantages Network bandwidth might be limited to Ethernet switches: 10GE
switches available, 40GE yet to arrive, but 32 Gbps IB available now
41
(InfiniBand) RDMA over Converged Ethernet (RoCE)
LRH(L2 Hdr)
IB PayloadGRH(L3 Hdr)
VCRCICRCBTH+(L4 Hdr)
IB PayloadGRH ICRCBTH+ FCSMAC ETRoCEE
Infiniband
RoCEE
42
Some Test Results
43
Some Test Results
44
Feature Comparison
45
Summery
46
Summery
VerbsSDP Lustre
skts apps
MPIstorage
appsfile
systemsnative apps
MPI SRP, iSER
InfiniBand iWARP RoCE
47
Resources
• InfiniBand: - Introduction to InfiniBand™ for End Users - InfiniBand Trade Association: http
://www.infinibandta.org/• iWarp:
- Rdma Consortium: http://www.rdmaconsortium.org• OpenFabrics: http://www.openfabrics.org• [email protected]
48
Future plan Design and programming in RDMA
Lustre concept and tuning
49
Thanks & Questions