IO Virtualization with InfiniBand[InfiniBand as a Hypervisor Accelerator]
Michael KaganVice President, Architecture
Mellanox [email protected]
Leadership in InfiniBand silicon 2April 05
Key messages
• InfiniBand enables efficient server’s virtualization– Cross-domain isolation– Efficient IO sharing– Protection enforcement
• Existing HW fully supports virtualization– The most cost-effective path for single-node virtual servers– SW-transparent scale-out
• VMM support in OpenIB SW stack by fall ’05– Alpha version of FW and driver in June
Leadership in InfiniBand silicon 3April 05
InfiniBand scope in server virtualization
• CPU virtualization– Compute power
• Memory virtualization– Memory allocation– Address translation– Protection
• IO virtualization
• NO
• Partial– No– Yes – for IO accesses– Yes – for IO accesses
• YES
Virtualized server
Hypervisor
Domain0 DomainX DomainY
IO
IO drv
Kernel Kernel
IO drv IO drv
Virtual switch(es)
Bridge
IO drv
CPU memory
Leadership in InfiniBand silicon 4April 05
Switch
End Node
Switch Switch
Switch
End Node
End Node
End Node
End Node
End Node
End Node
End Node
End Node
End Node
InfiniBand – Overview• Performance
– Bandwidth – up to 120Gbit/sec per link– Latency – under 3uSec (today)
• Kernel bypass for IO access– Cross-process protection and isolation
• Quality Of Service– End node– Fabric
• Scalability/flexibility– Up to 48K local nodes, up to 2128 total– Multiple link width/trace (Cu, Fiber)
• Multiple transport services in HW– Reliable and unreliable– Connected and datagram– Automatic path migration in HW
• Memory exposure to remote node– RDMA-read and RDMA-write
• Multiple networks on a single wire– Network partitioning in HW (“VLAN”)– Multiple independent virtual networks on a wire
Link data rate:
Today: 2.5,10,20,30,60Gb/sSpec: up to 120Gb/secCu & Optical
Leadership in InfiniBand silicon 5April 05
InfiniBand communicationC
onsu
mer
cha
nnel
inte
rfac
e
Netw
ork (fabric) interface
Leadership in InfiniBand silicon 6April 05
Consumer Queue Model
• Asynchronous execution• In-order execution on queue• Flexible completion report
Host Channel Adapter (HCA)
• Consumers connected via queues– Local or remote node
• 16M independent queues– 16M IO channel– 16M QoS levels
• transport, priority• Memory access through virtual address
– Remote and local– 2G address spaces, 64-bit each– Access rights and isolation enforced by HW
InfiniBand Channel Interface
PCI-Express
InfiniBand ports
HCA
Leadership in InfiniBand silicon 7April 05
Userland
Kernel
HCA
CQ
Up to 16M work queues
InfiniBand Host Channel Adapter• HCA configuration via Command Queue
– Initialization– Run-time resource assignment and setup
• HCA resources (queues) allocated for applications– Resource protection through User Access Region
• IO access through HCA QPs (“IO channels”)– QPs properties match IO requirements– Cross-QP resource isolation
• Memory protection – via Protection Domains– Many-to one association
• Address space to Protection Domain• QP to Protection Domain
– Memory access using Key and virtual address• Boundary and access right validation• Protection Domain validation• Virtual to physical (HW) address translation
• Interrupts delivery – Event Queues
Driver
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
App
App
CC
Q
App
App
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
App
App
Leadership in InfiniBand silicon 8April 05
Domain0
Kernel
Userland
Kernel
HCAHCA
CQ
Up to 16M work queues
InfiniBand Host Channel Adapter• HCA configuration via Command Queue
– Initialization– Run-time resource assignment and setup
• HCA resources (queues) allocated for applications– Resource protection through User Access Region
• IO access through HCA QPs (“IO channels”)– QPs properties match IO requirements– Cross-QP resource isolation
• Memory protection – via Protection Domains– Many-to one association
• Address space to Protection Domain• QP to Protection Domain
– Memory access using Key and virtual address• Boundary and access right validation• Protection Domain validation• Virtual to physical (HW) address translation
• Interrupts delivery – Event Queues
• HCA initialization by VMM– Assign command queue per guest domain– HCA resources partitioned and exported to guest OSes
• HCA resources allocated to guests/their apps– Resource protection through UAR
• Each VM has direct IO access– “Hypervisor offload”
• Memory protection – via Protection Domains• Address translation step generates HW address
– Guest Physical Address to HW address translation– Validate access rights
Driver
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
App
App
CC
Q
App
App
Up to 16M work queues
Driver
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
CQ
CC
Q
DomainZ
Kernel App
App
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
App
App
App
App
DomainY
Kernel App
App
App
App
DomainX
Kernel App
App
App
App
App
App
Leadership in InfiniBand silicon 9April 05
Domain0
Kernel
Userland
Kernel
HCAHCA
CQ
Up to 16M work queues
InfiniBand Host Channel Adapter• HCA configuration via Command Queue
– Initialization– Run-time resource assignment and setup
• HCA resources (queues) allocated for applications– Resource protection through User Access Region
• IO access through HCA QPs (“IO channels”)– QPs properties match IO requirements– Cross-QP resource isolation
• Memory protection – via Protection Domains– Many-to one association
• Address space to Protection Domain• QP to Protection Domain
– Memory access using Key and virtual address• Boundary and access right validation• Protection Domain validation• Virtual to physical (HW) address translation
• Interrupts delivery – Event Queues
• HCA initialization by VMM– Assign command queue per guest domain– HCA resources partitioned and exported to guest OSes
• HCA resources allocated to guests/their apps– Resource protection through UAR
• Each VM has direct IO access– “Hypervisor offload”
• Memory protection – via Protection Domains• Address translation step generates HW address
– Guest Physical Address to HW address translation– Validate access rights
• Guest driver manages HCA resources at run-time– Each OS sees “its own HCA”– HCA HW keeps guest OS honest– Connection manager – see later
Driver
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
App
App
CC
Q
App
App
Up to 128 CCQ Up to 16M work queues
Driver
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
CQ
CC
Q
DomainZ
Kernel
Driver
App
App
CQ
CC
QCQ
CC
QCQ
CC
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
App
App
App
App
DomainY
Kernel
Driver
App
App
App
App
DomainX
Kernel
Driver
App
App
App
App
App
App
Leadership in InfiniBand silicon 10April 05
Address translation and protectionNon-virtual server
• HCA TPT set by driver– Boundaries, access rights– vir2phys table
• Run-time address translation– Access right validation– Translation tables’ walk
Virtual server• VMM sets guest HW address tables
– Address space per guest domain– Managed and updated by VMM
• Guest driver sets HCA TPT– Guest PA in vir2phys table
• Run-time address translation1. Virtual to Guest Phys. Addr2. Guest Physical to HW address
MKey Virtual address
ApplicationMKey entry
HW physical address
MKey table Translation tables
MKey Virtual address
ApplicationMKey entry
VM GPAMKey entry
HW physical address
1
1
1
2
2
2
MKey table
2
Translation tables
Leadership in InfiniBand silicon 11April 05
IO virtualization with InfiniBandsingle node, local IO
• Full offload for local cross-domain access– Eliminate Hypervisor kernel transition on data path
• Reduce cross-domain access latency• Reduce CPU utilization• Kernel bypass on IO access to guest application
• Shared [local] IO– Shared by guest domains
Virtualized serverHypervisor
Domain0 DomainX DomainY
IO
IO drv
Kernel Kernel
IO drv IO drv
Virtual switch(es)
Bridge
IO drv
Virtualized server
Domain0 DomainX DomainY
IO drv
Kernel Kernel
IO drv IO drv
Bridge
IO drv
IO
HCA
HWswitch(es)
IO HypervisorOff-load
Leadership in InfiniBand silicon 12April 05
Virtualized server
IO virtualization with InfiniBandmultiple nodes (cluster), network-resident IO
• SW-transparent remote IO access• No Hypervisor kernel transition• Kernel bypass for guest apps• Shared [remote] IO
– Shared by domains
Virtualized serverHypervisor
Domain0 DomainX DomainY
IO
IO drv
Kernel Kernel
IO drv IO drv
Virtual switch(es)
Bridge
IO drv
Virtualized server
DomainY1
Kernel
IO drv
HCA
HWswitch(es)
InfiniBand switch
IO HypervisorOff-load SW
-tran
spar
ent
Scal
e-ou
t IO
shar
ing
DomainX1
Kernel
IO drv
Domain0
Virtualized server
DomainY2
Kernel
IO drv
HCA
HWswitch(es)
DomainX2
Kernel
IO drv
Domain0
BridgeIO
Domain0 DomainX DomainY
IO drv
Kernel Kernel
IO drv IO drv
Bridge
IO drv
IO
HCA
HWswitch(es)
HC
A
Leadership in InfiniBand silicon 13April 05
Network – IPoverIB• IP over Ethernet
– SW channel for each domain• Virtual NIC in domain• Switch in SW
– Copy, VLANs– Hypervisor call
• Kernel transition– NIC driver in domain0
• External L2 bridge
• IP over IB– HW channel for each domain
• Virtual NIC in domain• Switch in HW
– VLANs, data move– Direct HW access from guest domain
• No Hypervisor transition– IPoverIB in domain0
• Bypass L2 bridge
Virtualized serverHypervisor
Domain0 DomainX DomainY
NIC
N/W drv
Kernel Kernel
N/W drv N/W drv
Virtual switch(es)
Bridge
NIC drv
Virtualized server
Domain0 DomainX DomainY
IPoIB
Kernel Kernel
IPoIB IPoIB
Bridge
NIC drv
NIC
HCA
HWswitch(es)
Leadership in InfiniBand silicon 14April 05
Network – sockets• Sockets over Ethernet
– TCP/IP stack in guest domain– SW L2 channel for guest domain
• Virtual NIC in domain• Switch in SW
– Copy, VLANs• Hypervisor call
– Kernel transition– NIC driver in domain0
• Sockets over InfiniBand (SDP)– HW L4 channel for guest domain
• Socket QP(s) per domain• Transport and switch in HW
– Copy, VLANs– Direct HW access from guest domain
• No Hypervisor transition– Full bypass of domain0
Virtualized server
Domain0 DomainX DomainY
SDP
Kernel Kernel
SDP SDP
Bridge
NIC drv
NIC
HCA
HWswitch(es)
Virtualized serverHypervisor
Domain0 DomainX DomainY
NIC
N/W drv
Kernel Kernel
N/W drv
Virtual switch(es)
Bridge
NIC drv TCP/IP
N/W drv
TCP/IP
Leadership in InfiniBand silicon 15April 05
Storage• Virtualized disk access
– vSCSI driver in guest domain– SCSI “switch” in Hypervisor
• Switch in SW– Copy, isolation
• Hypervisor call– Kernel transition
– Disk driver in domain0• HBA for SAN
• Virtualized disk access– SRP initiator per guest domain– SCSI “switch” in HCA
• Transport and switch in HW– Copy, isolation
– Direct HW access from guest domain• No Hypervisor transition
– Disk driver in domain0• Bypass domain0 for SAN
Virtualized server
Domain0 DomainX DomainY
Kernel Kernel
SRP SRP
SRP target
Adapter
HCA
HWswitch(es)
Virtualized serverHypervisor
Domain0 DomainX DomainY
vSCSI
Kernel Kernel
Virtual switch(es)
SRP target
AdaptervSCSI vSCSI
Leadership in InfiniBand silicon 16April 05
MPI applications[MPI as an example for user-mode access to network]
• MPI over TCP/IP???– Datapath performance hit
• Two kernel transitions on performancepath
– Forget about low latency
• MPI driver in guest app– No datapath performance hit
• Direct access to HCA HW• Full guest kernel bypass• Full Hypervisor bypass
– Event delivery directly to guest OS• Retain control path performance
– Memory registration needs attention• Registration cache
Virtualized server
Domain0 DomainX DomainY
MPI
HCA
HWswitch(es)
Virtualized serverHypervisor
Domain0 DomainX DomainY
NIC
N/W drv
Kernel Kernel
N/W drv
Virtual switch(es)
Bridge
NIC drv TCP/IP
N/W drv
TCP/IP
MPI
Kernel Kernel
Leadership in InfiniBand silicon 17April 05
Plans
• Stage1,2 driver update June/05– Hypervisor bypass for Data Path operation
• Stage3 FW and driver update Aug/05– Full HCA export to guest domain
Domain0
Kernel
Hypervisor
HCA
Up to 128 CCQ Up to 16M work queues
Driver
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
WQ C
Q
CQ
CC
Q CQ
CC
Q
DomainZ
Kernel
Driver
App
App
CQ
CC
QCQ
CC
Q
WQ C
Q
WQ C
QW
Q CQ
WQ C
Q
App
App
DomainY
Kernel
DriverA
ppA
ppA
ppA
pp
DomainX
Kernel
Driver
App
App
App
App
Stage1,2stage3
Virtualized server
Domain0 DomainX DomainY
IO drv
Kernel Kernel
IO drv IO drv
Bridge
IO drv
IO
HCA
HW
switch(es)
Virtualized server
Domain0 DomainX DomainY
IO drv
Kernel Kernel
IO drv IO drv
Bridge
IO drv
IO
HCA
HW
switch(es)
Leadership in InfiniBand silicon 18April 05
Summary
• InfiniBand HCA is a Hypervisor offload engine• InfiniBand enables efficient server’s virtualization
– Cross-domain isolation– Efficient IO sharing– Protection enforcement
• Existing HW fully supports virtualization– The most cost-effective path for single-node virtual servers– SW-transparent scale-out
• VMM support in OpenIB SW stack by fall ’05– Alpha version of FW and driver in June