2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
InfiniBand Technology and Usage Update
Erin Filliater Mellanox Technologies
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
FDR InfiniBand solutions were introduced in mid-2011, and the InfiniBand roadmap and EDR specification were updated to provide a data rate of 100Gb/s per 4x EDR port (26Gb/s per lane). FDR InfiniBand introduced new 64/66-bit link encoding and a new reliability mechanism called Forward Error Correction. The newly defined link speeds, reliability mechanisms and transport features are designed to keep the rate of performance increase in like with systems-level performance increases. This session will provide a detailed review of the new InfiniBand speeds, features and roadmap.
2
Abstract
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Learning Objectives
Detailed understanding of the new InfiniBand capabilities
View into the InfiniBand roadmap through 2016 Usage of RDMA for storage acceleration RDMA storage examples
3
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Agenda
InfiniBand Technology Review New Features for FDR InfiniBand Roadmap InfiniBand and RDMA for Storage
4
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
InfiniBand Technology
5
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
What is InfiniBand?
Industry standard defined by the InfiniBand Trade Association (IBTA) Originated in 1999
Input/output architecture used to interconnect servers, communications
infrastructure equipment, storage and embedded systems
Pervasive, low-latency, high-bandwidth interconnect which requires low processing overhead and is ideal to carry multiple traffic types (clustering, communications, storage, management) over a single connection.
As a mature and field-proven technology, InfiniBand is used in thousands of data centers, high-performance compute clusters and embedded applications that scale from small scale to large scale
6
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
The InfiniBand Architecture
Defines System Area Network architecture
Architecture supports Host Channel Adapters (HCA) Target Channel Adapters (TCA) Switches Routers
Facilitated HW design for Low latency / high bandwidth Transport offload
7
Processor Node
InfiniBand Subnet
Gateway
HCA
Switch Switch
Switch Switch
Processor Node
Processor Node
HCA
HCA
TCA
Storage Subsystem
Consoles
TCA
RAID
Ethernet
Gateway
Fibre Channel
HCA
Subnet Manager
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
InfiniBand Feature Highlights
Serial high-bandwidth, ultra-low-latency links
Reliable, lossless, self-managing fabric
Full CPU offload
Quality Of Service
Cluster scalability, flexibility and simplified management
8
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Delivering a Unified Data Center Fabric
9
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
InfiniBand Network Stack
10
InfiniBand node
InfiniBand Switch
Legacy node
User code Kernel code Hardware Application
Network Layer
Link Layer
Physical Layer
Transport Layer
Network Layer
Link Layer
Physical Layer
Packet relay
PHY
Packet relay
PHY
PHY
Li
nk
PHY
Link
Router
Buffer
Buffer Buffer
Transport Layer
Application
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Link Speed (109 bit/sec)
Physical Layer
Data transfer over serial bit streams
Auto-negotiation of link speed and width
Power management
Bit encoding
Control symbols
11
Lane Speed → SDR
(2.5GHz) DDR
(5GHz) QDR
(10GHz) FDR
(14GHz) EDR
(25GHz) Link Width ↓
1X 2.5 5 10 14 25
4X 10 20 40 56 100
8X 20 40 80 102 200
12X 30 60 120 168 300
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Link Layer
Addressing and Switching Local Identifier (LID) addressing Unicast LID – 48K addresses Multicast LID – up to 16K addresses Efficient linear lookup Cut-through switching (ultra-low latency) Multi-pathing support through LMC
Data Integrity
Invariant CRC (ICRC) Variant CRC (VCRC)
12
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Arbitration
De- mux
Mux
Link Control
Packets
Credits Returned
Link Control
Receive Buffers Packets
Transmitted
Link Layer – Flow Control
Credit-based link-level flow control No packet loss within fabric even in the presence of congestion Link Receivers grant packet receive buffer space credits per Virtual Lane
Separate flow control per virtual lane Alleviates of head-of-line blocking Virtual Fabrics – Congestion and latency on one VL does not impact traffic with
guaranteed QoS on another VL even though they share the same physical link
13
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Virtual Lanes and Scheduling
14
Dynamically configure and adjust VLs and scheduling to match application performance needs
InfiniBand fabric
Low-latency VL for clustering
Mainstream storage VL Day: ≥ 40% BW Night: ≥ 20% BW
Backup VL Day: ≥ 20% BW Night: ≥ 60% BW
Physical:
Logical:
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Network Layer
Global Identifier (GID) addressing Based on IPv6 addressing scheme GID = {64 bit GID prefix, 64 bit GUID}
GUID = Global Unique Identifier (64 bit EUI-64) GUID 0 – assigned by the manufacturer GUID 1..(N-1) – assigned by the subnet manager
Used for multicast distribution within end nodes
15
Subnet A Subnet B
IB Router
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Transport Layer
Queue Pair (QP) – transport endpoint Asynchronous interface
Send Queue, Receive Queue, Completion Queue
Full transport offload Segmentation, reassembly, timers, retransmission, etc…
Kernel bypass Enables low latency and CPU offload Exposure of application buffers to the network
Polling and interrupt models supported
16
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Remote QP Local QP
Transport Layer – Queue Pairs
QPs are in pairs (Send/Receive) Work Queue is the consumer/producer interface to the fabric The consumer/producer initiates a Work Queue Element (WQE) The channel adapter executes the work request The channel adapter notifies on completion or errors by writing a
Completion Queue Element (CQE) to a Completion Queue (CQ)
Transmit
Receive
Receive
Transmit WQE
17
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Transport – HCA Model
Asynchronous interface Consumer posts work
requests HCA processes Consumer polls
completions
Transport executed by HCA
I/O channel exposed to the application
18
Port
VL VL VL VL …
Port
VL VL VL VL …
Transport and RDMA Offload Engine
…
Send Queue
Receive Queue
QP
Send Queue
Receive Queue
QP
…
Consumer
Completion Queue
posting WQEs
polling CQEs
HCA
18
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Transport Layer – Types Transfer Operations
SEND Read message from HCA local system memory Transfers data to responder HCA Receive Queue logic Does not specify where the data will be written in remote memory Immediate Data option available
RDMA Read
Responder HCA reads its local memory and returns it to the requesting HCA Requires remote memory access rights, memory start address and message
length
RDMA Write Requester HCA sends data to be written into the responder HCA system
memory Requires remote memory access rights, memory start address and message
length
19
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Typical Buffer Copy Flow Data
Source Data Sink
App Buf
Proto Buf
Proto Buf
Proto Buf
Proto Buf
Proto Buf
Proto
Data Message (Send)
Proto Buf
Proto Buf
Proto Buf
Proto Buf
Proto Buf
Proto App Buf App Buf
20
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Typical Read Zero Copy Flow
Data Source
Data Sink
App Buf Advertise Message
(Send)
App Buf App Buf
RDMA Read
Completion Msg (Send)
Read Response
21
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Typical Write Zero Copy Flow Data
Source Data Sink
App Buf
Advertise Message
(Send)
App Buf App Buf
Completion Msg (Send)
RDMA Write
22
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Management Model
Subnet Manager (SM) Configures/administers fabric
topology Implemented at end-node or switch Active/passive model when more
than one SM is present Talks with SM agents in
nodes/switches Subnet Administration
Provides path records QoS management
Communication Management Connection establishment
processing
23
Subnet Mgt Agent
Subnet Manager
Subnet Management Interface
QP0 (uses VL15) QP1
Baseboard Mgt Agent
Communication Mgt Agent
Performance Mgt Agent
Device Mgt Agent
Vendor-Specific Agent
Application-Specific Agent
SNMP Tunneling Agent
Subnet Administration
General Service Interface
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Partitions
Logically divide fabric into isolated domains
Partial and full membership per partition
Partition filtering at switches
Similar to FC Zoning 802.1Q VLANs
24
Host A Host B
InfiniBand Fabric
Partition 1: Inter-host
Partition 2: Private to host B
Partition 3: Private to host A
Partition 4: Shared
I/O A
I/O B I/O C
I/O D
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
High Availability and Redundancy
Multi-port HCAs
Redundant fabric topologies
Link layer multi-pathing (LMC)
Automatic Path Migration (APM)
ULP High Availability Application-level multi-pathing (SRP/iSER) Teaming/bonding (IPoIB)
25
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Upper Layer Protocols
ULPs connect InfiniBand to common interfaces Clustering
MPI (Message Passing Interface) RDS (Reliable Datagram Socket)
Network IPoIB (IP over InfiniBand) WSD (Winsock Direct) SDP (Socket Direct Protocol) Future: EthoIB
Storage SRP (SCSI RDMA Protocol) iSER (iSCSI Extensions for RDMA) NFSoRDMA (NFS over RDMA) Future: FCoIB
26
Hardware
Device Driver
InfiniBand Core Services
IPoIB
TCP/ IP
SDP RDS
socket interface
SRP iSER NFS over
RDMA
block storage file storage
Device Driver
InfiniBand Core Services
MPI
HPC clustering
Ker
nel B
ypas
s
Kernel
IB Apps
IB Apps
Clustering Apps
Sockets
Sockets-Based Apps
User
Storage
Interfaces (File/Block)
Storage Apps
Operating system InfiniBand Infrastructure Applications
WSD
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
InfiniBand Block Storage
SRP – SCSI RDMA Protocol Defined by T10
iSER – iSCSI Extensions for RDMA
Defined by IETF IP Storage WG InfiniBand spec defined by IBTA Leverages iSCSI management
infrastructure
Protocol Offload InfiniBand Reliable Connection RDMA for zero-copy data transfer
27
SCSI Application
Layer
SCSI Transport Protocol
Layer
Interconnect Layer
SAM-3
FC-3 (FC-FS, FC-LS)
FC-2 (FC-FS) FC-1 (FC-FS) FC-0 (FC-PI)
SCSI Application
Layer
FC-4 Mapping (FCP-3)
Fibre Channel
InfiniBand
SCSI Application
Layer
SRP
SRP
InfiniBand / iWARP
SCSI Application
Layer
iSCSI
iSCSI
iSER
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
SRP: Data Transfer Operations
Send/Receive Commands, Responses Task management
RDMA – Zero-Copy Path
Data-In, Data-Out Target issues the RDMA operations
iSER uses same principles
Immediate/unsolicited data allowed through Send/Receive
Included in mainline Linux kernel
28
Initiator Target
Initiator Target
IO Read
IO Write
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Discovery Mechanism
SRP Persistent Information
{Node_GUID:IOC_GUID} Subnet Administrator Identifiers
Per LUN WWN (through INQUIRY VPD) SRP Target Port ID {IdentifierExt[63:0], IOC
GUID[63:0]} Service Name – SRP.T10.{PortID ASCII} Service ID – Locally assigned by IOC/IOU
29
I/O Controller
I/O Controller
I/O U
nit
InfiniBand I/O Model
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Discovery Mechanism
iSER – uses iSCSI (RFC 3721) Static Configuration {IP, port, target
name} Send Targets {IP, port} SLP iSNS Target naming (RFC 3721/3980)
iSCSI Qualified Names (iqn.), IEEE EUI64 (eui.), T11 Network Address Authority (naa.)
30
I/O Controller
I/O Controller
I/O U
nit
InfiniBand I/O Model
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
NFS Over RDMA
Defined by IETF ONC-RPC extensions for RDMA NFS mapping
RPC Call/Reply Send/Receive or via RDMA Read chunk list
Data transfer RDMA Read/Write – described by chunk list in
XDR message Send – inline in XDR message
Uses InfiniBand Reliable Connection QP IP extensions to CM Connection based on {IP, port} Zero-copy data transfers
Part of mainline Linux kernel
31
Client Server
Client Server
NFS READ
NFS WRITE
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Storage Gateways
Benefits InfiniBand-island-to-SAN connectivity I/O scales independently of compute Design based on average server load
Current Gateways
SRPFC iSERFC Stateful architecture
Future Gateways
FCoIBFC FCoE sibling Stateless architecture
32
IB Header
FCoIB HDR
FC HDR
FC Payload
FC CRC
FCoIB Trailer
IB CRCs
FCoIB
FC HDR
FC Payload
FC CRC
FC
Stateless Packet Relay
Gateway
Servers
InfiniBand Fibre Channel
…
Scalable
Storage
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
InfiniBand Fourteen Data Rate (FDR)
33
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
FDR InfiniBand
Launched mid-2011 Next-generation high-speed interconnect 14Gb/s per lane 56Gb/s per port
PCIe 3.0 support
34
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
FDR InfiniBand: New Features
New bit encoding scheme: 64/66 Forward Error Correction (FEC) Fix bit errors throughout network Reduce overhead for data retransmission
nodes InfiniBand routing
35
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
FDR InfiniBand: Performance
36
120% Higher Application ROI
Double the Bandwidth of QDR Half the Latency of QDR
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Remote Storage Access with Local Storage Performance
InfiniBand and Storage
37
SMB Client
SMB Server
Fusion IO Fusion IO Fusion IO PCIe Flash
IO Micro Benchmark
IB FDR
IB FDR
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
InfiniBand Roadmap
38
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
InfiniBand Roadmap
39
Source: InfiniBand Trade Association
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Leading Interconnect, Leading Performance
40
Latency
5usec 2.5usec
1.3usec 0.7usec
<0.5usec
160/200Gb/s
100Gb/s 56Gb/s
40Gb/s 20Gb/s
10Gb/s 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
2017
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
Bandwidth
Same Software Interface
0.5usec
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
InfiniBand and RDMA Storage
41
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Efficient Storage Access
Full I/O offload Zero copy Interrupt avoidance (moderated per I/O interrupt) Offloaded segmentation and reassembly Transport reliability Lossless fabric – credit-based flow control
Fabric Consolidation Partitioning VL Arbitration and QoS Host virtualization compatible High throughput Performance counters
42
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
InfiniBand Storage Benefits
High-bandwidth fabric Fabric consolidation Data center efficiency Gateways
One wire out of the server FC port sharing Independent growth for I/O,
storage and compute Network cache
43
InfiniBand Backend
Native IB JBODs
Direct attach native IB
Block Storage
Native IB File Server
(NFS RDMA) Native IB
Block Storage (SRP/iSER)
Servers InfiniBand
Gateway
InfiniBand Storage Deployment Options
Fibre Channel
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Clustered/Parallel Storage Benefits
Integrated with clustering infrastructure
Efficient object/block transfer
Atomic operations Ultra-low latency High bandwidth Back-end storage fabric
44
Parallel / Clustered File System
Parallel NFS Server
OSD/Block Storage Targets
Servers
InfiniBand
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Microsoft Windows Server 2012 and SMB Direct
New class of low-latency enterprise file storage Minimal CPU utilization for file storage
processing Leverages RDMA technologies Easy to provision, manage and migrate No application change or admin
configuration RDMA-capable network interface and
hardware required (InfiniBand and RoCE) SMB Multichannel for load balancing and
failover
45
File Client File Server
SMB Server SMB Client
User
Kernel
Application
Disk
RDMA Adapter
Network w/ RDMA
support
NTFS SCSI
Network w/ RDMA
support
RDMA Adapter
10X Performance Improvement versus 10GbE Preliminary results based on Windows Server 2012 beta
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Measuring SMB Direct Performance
46
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Maximizing File Server Performance
Configuration BW (MB/sec)
IOPS (512KB IOs/sec)
CPU Overhead (Privileged)
Local 10,090 38,492 2.5%
Remote 9,852 37,584 5.1%
VM 10,367 39,548 4.6%
Preliminary results from SuperMicro servers, each with 2 Intel E5-2680 CPUs at 2.70GHz. Both client and server use two Mellanox ConnectX-3 network interfaces on PCIe Gen 3 x8 slots. Data goes to 4 LSI 9285-8e RAID controllers and 4 JBODs, each with 8 OCZ Talos 2 R SSDs.
Workload: 512KB IOs, 2 threads, 16 outstanding IOs per thread
10GB/sec Bandwidth with 5% CPU Overhead
47
2012 Storage Developer Conference. © 2012 Mellanox Technologies. All Rights Reserved.
Thank You
48