Intel® MIC x100 Coprocessor Driver - on the Frontiers of Linux & HPC
Nikhil Rao ([email protected])
LinuxCon 2013
Intel® Xeon Phi* (MIC) x100 Coprocessors
Highly-parallel Processing for Unparalleled Discovery
Groundbreaking: differences
Up to 61 IA cores/1.1 GHz/ 244 Threads
Up to 16GB memory with up to 352 GB/s bandwidth
512-bit SIMD instructions
Linux* operating system, IP addressable
Standard programming languages and tools
Leading to Groundbreaking results
Up to 1 TeraFlop/s double precision peak performance1 Enjoy up to 2.2x higher memory bandwidth than on an Intel® Xeon® processor E5 family-based server.2
Up to 4x more performance per watt than with an Intel® Xeon® processor E5 family-based server. 3
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Notes 1, 2 & 3, see backup for system configuration details.
Programming Models
Offload
main()
CPU
Native
main()
Coprocessor
foo()
CPU
main()
Coprocessor
CPU Coprocessor CPU
main()
Coprocessor
CPU
main()
Coprocessor
main()
main()
Symmetric
Compiler Assisted Offload
Host Only Code
float ret = 0;
#pragma omp parallel for reduction (+:ret)
for (int i = 0; i < size; i++)
{
ret += data[i];
}
ans = a[0] + a[1] + .. + a[n-1]
Compiler Assisted Offload
float ret = 0;
#pragma omp parallel for reduction (+:ret)
for (int i = 0; i < size; i++)
{
ret += data[i];
}
Loop Offloaded to Coprocessor
float ret = 0;
#pragma offload target(mic) in(size) in(data:length(size))
{
#pragma omp parallel for reduction (+:ret)
for (int i = 0; i < size; i++)
{
ret += data[i];
}
}
ans = a[0] + a[1] + .. + a[n-1]
Intel® Manycore Platform Software (MPSS) Stack
Host side Tools • Coprocessor FS, network configuration • Status monitoring (e.g. Temperature, Power, RAS) • Coprocessor OS state management (micctrl, mpssd) • VirtIO devices (mpssd)
Programming Models Host Platform
Tools
Driver
Coprocessor
Linux* OS
Offload Apps
Coprocessor
Linux* OS PCIe*
PCIe*
MPI* TCP/IP
• Linux* OS, K1OM ABI
• Busybox filesystem
Intel® MPSS Coprocessor Environment
Programming Models Host Platform
Tools
Driver
Coprocessor
Linux* OS
Offload Apps
Coprocessor
Linux* OS PCIe*
PCIe*
MPI TCP/IP
Intel® Xeon Phi™ Coprocessor Driver
Coprocessor OS Management
Virtual (VirtIO based) Device Support
Process P0 Process P1 PCIe*
PCIe* Messaging & RDMA APIs (SCIF)
Coprocessor OS Boot
Host Driver
sysfs
FW ready
micctrl -b
User
Kernel
Coprocessor
Coprocessor OS Boot
Host Driver
sysfs
FW ready
micctrl -b
User
Kernel
bzImage file name
RAMdisk file name
boot
Coprocessor
Coprocessor OS Boot
Host Driver
sysfs
bzImage
FW ready
micctrl -b
User
Kernel
ramdisk
Coprocessor
Interrupt
Coprocessor OS Boot
Host Driver
sysfs
micctrl -b
User
Kernel
Linux* Coprocessor
Virtio Drivers • Virtio - framework that enables use of common
guest drivers across hypervisors
KVM
Qemu virtqueue
Guest
virtio_net.ko virtio_net.ko
Virtio Drivers • Virtio - framework that enables use of common
guest drivers across hypervisors
virtqueue Guest
virtio_net.ko
lguest
lguest virtio_net.ko
Virtio Drivers • Virtio - framework that enables use of common
guest drivers across hypervisors
Guest virtqueue
Coprocessor Host
mpssd virtio_net.ko
Virtio Drivers • Virtio - framework that enables use of common
guest drivers across hypervisors
Guest virtqueue
Coprocessor Host
mpssd
• Key benefits
• Reuse of well designed, maintained code
• Standard, enables a simple backend
• New devices possible in the future
virtio_net.ko
Device Emulation
HW
Hypervisor/Host OS
Virtio Driver
Virtio Data Path
Guest/Coprocessor OS
avail
used
virtqueue
Device Emulation
HW
Hypervisor/Host OS
Virtio Driver
Buffer
Virtio Data Path
Guest/Coprocessor OS
avail
Interrupt
used
virtqueue
Device Emulation
HW
Hypervisor/Host OS
Virtio Driver
Buffer
Virtio Data Path
Guest/Coprocessor OS
avail
used
virtqueue
Device Emulation
HW
Hypervisor/Host OS
Virtio Driver
Buffer
Virtio Data Path
Guest/Coprocessor OS
avail
Interrupt
used
virtqueue
Virtio Data Path Setup
Device Emulation (mpss daemon)
Coprocessor Host driver virtio-mic
Host OS Coprocessor OS
Virtio Bus
Device Page
Virtio Data Path Setup
Device Emulation (mpss daemon)
Coprocessor Host driver virtio-mic
Host OS Coprocessor OS
Device create IOCTL
• Device page entry
– vring addresses, interrupt information
– Status notification (e.g., driver unloaded)
Virtio Bus
Device Page
Device Entry
Virtio Data Path Setup
Device Emulation (mpss daemon)
Coprocessor Host driver virtio-mic
Host OS Coprocessor OS
Device create IOCTL
• Device page entry
– vring addresses, interrupt information
– Status notification (e.g., driver unloaded)
Virtio Device
Virtio Bus
Device Page
Device Entry
TCP/IP
Virtio-net
virtio-pci
QEMU Network Backend
TAP
bridge
Host OS
Guest
QEMU process
kvm.ko
TCP/IP
Virtio-net
virtio-mic
Network backend (mpssd)
TAP
bridge
Host OS
Coprocessor OS
Coprocessor Driver
Data path
What’s different ?
Control path
SCIF • Symmetric Communications Interface
• Goals
– Performance (PCIe* Available BW 7GB/s)
• TCP/IP host to card BW is around 400MB/s
– Abstract the PCIe* network
PCIe*
Host
Coprocessor
Coprocessor
IB* HCA
~ ~
SCIF • Symmetric Communications Interface
• Goals
– Performance (PCIe* Available BW 7GB/s)
• TCP/IP host to card BW is around 400MB/s
– Abstract the PCIe* network
PCIe*
Host
Coprocessor
Coprocessor
IB* HCA
~ ~
SCIF • Symmetric Communications Interface
• Goals
– Performance (PCIe* Available BW 7GB/s)
• TCP/IP host to card BW is around 400MB/s
– Abstract the PCIe* network
PCIe*
Host
Coprocessor
Coprocessor
IB* HCA
~ ~
SCIF • Symmetric Communications Interface
• Goals
– Performance (PCIe* Available BW 7GB/s)
• TCP/IP host to card BW is around 400MB/s
– Abstract the PCIe* network
PCIe*
Host
Coprocessor
Coprocessor
IB* HCA
~ ~
SCIF • Symmetric Communications Interface
• Goals
– Performance (PCIe* Available BW 7GB/s)
• TCP/IP host to card BW is around 400MB/s
– Abstract the PCIe* network
PCIe*
Host
Coprocessor
Coprocessor
IB* HCA
send/recv, RMA, mapped memory APIs
~ ~
SCIF Endpoints & Connections
Process P0 PCIe* Process P1
Node 0
Port X Port Y
SCIF endpoint
– pipe to a PCIe* node or loopback, bound to a port ID
Exactly 2 endpoints can form a connection, SCIF data transfer/mapping APIs can only accept a connected endpoint
SCIF SCIF
Node 1
• Connection
• Messaging
• Memory Registration
• Remote Memory Access (RMA)
• RMA Fencing
• Remote memory mapping (mmap)
SCIF API Functional Grouping
Connection & send/recv
Send/recv Implementation
Node 1
Port Y Port X
Node 0
msg0 msg1
Process P0
Process P1
Endpoint Recv Q
Endpoint Recv Q
SCIF SCIF
P0: scif_send(epd, msg0, len, flags); P1: scif_recv(epd, msg1, len, flags);
PCIe*
PCIe*
Send/recv Implementation
Node 1
Port Y Port X
Node 0
msg0 msg1
Process P0
Process P1
Endpoint Recv Q
Endpoint Recv Q
SCIF SCIF
P0: scif_send(epd, msg0, len, flags); P1: scif_recv(epd, msg1, len, flags);
PCIe*
PCIe*
Send/recv Implementation
Node 1
Port Y Port X
Node 0
msg0 msg1 msg1
Process P0
Process P1
Endpoint Recv Q
Endpoint Recv Q
SCIF SCIF
P0: scif_send(epd, msg0, len, flags); P1: scif_recv(epd, msg1, len, flags);
PCIe*
PCIe*
Memory Registration
Process P0 Process P1
Node 0 Node 1
Port X
buf0 buf1
• SCIF RMA provides zero copy inter-process data transfer
• Registration exposes local memory for remote access
• Pins pages – Local DMA engine access
– Remote access
Port Y
PCIe*
Registered Address Space (RAS)
• Offsets Reference registered memory in RMA APIs
• RAS is per connection
• Connection has 2 registered address spaces – Local & Remote
– Local RAS offset = Peer’s Remote RAS offset
node0:X
Remote RAS Local RAS
node1:Y
Remote RAS Local RAS
Connection
Process P0 Process P1
off_t scif_register(epd, addr, len, …, prot, ..);
Registered Address Space (RAS)
• Offsets Reference registered memory in RMA APIs
• RAS is per connection
• Connection has 2 registered address spaces – Local & Remote
– Local RAS offset = Peer’s Remote RAS offset
node0:X
Remote RAS Local RAS
buf0
node1:Y
Remote RAS Local RAS
Connection
Process P0 Process P1
offset0
off_t scif_register(epd, addr, len, …, prot, ..);
Registered Address Space (RAS)
• Offsets Reference registered memory in RMA APIs
• RAS is per connection
• Connection has 2 registered address spaces – Local & Remote
– Local RAS offset = Peer’s Remote RAS offset
node0:X
Remote RAS Local RAS
buf0
node1:Y
Remote RAS Local RAS
buf0
Connection
Process P0 Process P1
offset0 offset0
off_t scif_register(epd, addr, len, …, prot, ..);
Registered Address Space (RAS)
• Offsets Reference registered memory in RMA APIs
• RAS is per connection
• Connection has 2 registered address spaces – Local & Remote
– Local RAS offset = Peer’s Remote RAS offset
node0:X
Remote RAS Local RAS
buf0
node1:Y
Remote RAS Local RAS
buf0
Connection
Process P0 Process P1
offset0 offset0
off_t scif_register(epd, addr, len, …, prot, ..);
offset1
buf1
Registered Address Space (RAS)
• Offsets Reference registered memory in RMA APIs
• RAS is per connection
• Connection has 2 registered address spaces – Local & Remote
– Local RAS offset = Peer’s Remote RAS offset
node0:X
Remote RAS Local RAS
buf0
node1:Y
Remote RAS Local RAS
buf0
Connection
Process P0 Process P1
offset0 offset0
off_t scif_register(epd, addr, len, …, prot, ..);
buf1
offset1 offset1
buf1
RMA int scif_writeto(epd, offset0, len, offset1, flags);
node0:X
Remote RAS Local RAS
buf0 buf1
offset1 offset0
Connection
Process P0
RMA int scif_vwriteto(epd, buf0, len, offset1, flags);
node0:X
Remote RAS
buf1
offset1
Process VA
buf0
addr
Connection
Process P0
RMA Fence APIs • Asynchronous RMAs allow overlap of compute &
communication
• Fence APIs allow synchronization with RMA completion
Non-blocking (polling) synchronization RAS
Tim
e
t1
t2
t3
t6
t7
RMA2
RMA1
t4
t5
RMA3
write v
off
scif_fence_signal(ep,off,v)
RMA Fence APIs (contd)
scif_fence_wait(ep,m)
RAS
Tim
e
t1
t2
t3
t6
t7
RMA2
RMA1
t4
t5
RMA3
m=scif_fence_mark(ep)
Blocking Synchronization
Remote Memory Mapping
node0:X
Remote RAS Local RAS
Buf0 Buf1
Process VA
Buf0
va = mmap(addr, len, prot, flags, epd, offset1);
offset1
Connection
Lowest latency path for messaging
Process P0
offset0
Remote Memory Mapping
node0:X
Remote RAS Local RAS
Buf0 Buf1
Buf1
Process VA
va
Buf0
va = mmap(addr, len, prot, flags, epd, offset1);
offset1
Connection
Lowest latency path for messaging
Process P0
offset0
OFED* over SCIF
• OpenFabrics Enterprise Distribution (OFED*) open-source software stack for InfiniBand* and iWARP*
• IB-SCIF driver
– Software emulated HCA
– Used within the box
– IB-SCIF driver uses kernel SCIF send/recv and RMA operations
IB uverbs
IB core
IB Verbs Library
IB-SCIF driver
SCIF
User / Kernel Mode
MPI Application
uDAPL
Host /
Coprocessor
IB-SCIF Library
SCIF RMA Performance
0
1
2
3
4
5
6
7
8
Thro
ugh
pu
t (G
B/s
ec)
Transfer Size (KBytes)
Comparison of TCP and SCIF based BW
Available PCIe BW
SCIF Write DMA (Host initiated)
SCIF Write DMA (Coprocessor initiated)
TCP (Host->Coprocessor)
TCP (Coprocessor->Host)
Code Status & Plans
• Patches for features below submitted, expect inclusion in 3.13
– Coprocessor OS state management
– Virtio device support
• Future patches
– DMA engine & usage in Virtio device support
– SCIF
Summary
• MIC x100 Coprocessor driver is a key element of an all Linux* HPC platform
– Enables choice of programming models
• New driver features
– Virtio for PCIe* endpoints
– SCIF communication
Possibilities for reuse in your HW ? Suggestions ?
Let us know!
Acknowledgements!
• Team
– Dasa Chandramouli , Bruce Chang , Bill Clifford Ashutosh Dixit,, Sudeep Dutt, Harsha Kharche, Sanath Kumar, Ravi Murty, Johnnie Peters, Evan Powers, John Wiegert, Siva Yerramreddy, Caz Yokoyama, Jianxin Xiong
• Reviewers
– PJ Waskiewicz, Eddie Dong
• Presentation – James Reinders
Links
• Patches
– https://lkml.org/lkml/2013/9/5/561
• MPSS
• Software Developer's Guide
Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Intel, Cilk, VTune, Xeon, Xeon Phi, Look Inside and the Intel logo are trademarks of Intel Corporation in the United States and other countries.
*Other names and brands may be claimed as the property of others. Copyright ©2013 Intel Corporation.