Intel MIC Coprocessor Driver · Intel ®Xeon processor E5 family-based server.2 Up to 4x more...

Intel® MIC x100 Coprocessor Driver - on the Frontiers of Linux & HPC

Nikhil Rao ([email protected])

LinuxCon 2013

Intel® Xeon Phi* (MIC) x100 Coprocessors

Highly-parallel Processing for Unparalleled Discovery

Groundbreaking: differences

Up to 61 IA cores/1.1 GHz/ 244 Threads

Up to 16GB memory with up to 352 GB/s bandwidth

512-bit SIMD instructions

Linux* operating system, IP addressable

Standard programming languages and tools

Leading to Groundbreaking results

Up to 1 TeraFlop/s double precision peak performance1 Enjoy up to 2.2x higher memory bandwidth than on an Intel® Xeon® processor E5 family-based server.2

Up to 4x more performance per watt than with an Intel® Xeon® processor E5 family-based server. 3

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Notes 1, 2 & 3, see backup for system configuration details.

http://www.intel.com/performance

http://www.intel.com/performance

Programming Models

Offload

main()

CPU

Native

main()

Coprocessor

foo()

CPU

main()

Coprocessor

CPU Coprocessor CPU

main()

Coprocessor

CPU

main()

Coprocessor

main()

main()

Symmetric

Compiler Assisted Offload

Host Only Code

float ret = 0;

#pragma omp parallel for reduction (+:ret)

for (int i = 0; i < size; i++)

{

ret += data[i];

}

ans = a[0] + a[1] + .. + a[n-1]

Compiler Assisted Offload

float ret = 0;



{

ret += data[i];

}

Loop Offloaded to Coprocessor

float ret = 0;

#pragma offload target(mic) in(size) in(data:length(size))

{



{

ret += data[i];

}

}

ans = a[0] + a[1] + .. + a[n-1]

Intel® Manycore Platform Software (MPSS) Stack

Host side Tools • Coprocessor FS, network configuration • Status monitoring (e.g. Temperature, Power, RAS) • Coprocessor OS state management (micctrl, mpssd) • VirtIO devices (mpssd)

Programming Models Host Platform

Tools

Driver

Coprocessor

Linux* OS

Offload Apps

Coprocessor

Linux* OS PCIe*

PCIe*

MPI* TCP/IP

• Linux* OS, K1OM ABI

• Busybox filesystem

Intel® MPSS Coprocessor Environment

Programming Models Host Platform

Tools

Driver

Coprocessor

Linux* OS

Offload Apps

Coprocessor

Linux* OS PCIe*

PCIe*

MPI TCP/IP

Intel® Xeon Phi™ Coprocessor Driver

Coprocessor OS Management

Virtual (VirtIO based) Device Support

Process P0 Process P1 PCIe*

PCIe* Messaging & RDMA APIs (SCIF)

Coprocessor OS Boot

Host Driver

sysfs

FW ready

micctrl -b

User

Kernel

Coprocessor

Coprocessor OS Boot

Host Driver

sysfs

FW ready

micctrl -b

User

Kernel

bzImage file name

RAMdisk file name

boot

Coprocessor

Coprocessor OS Boot

Host Driver

sysfs

bzImage

FW ready

micctrl -b

User

Kernel

ramdisk

Coprocessor

Interrupt

Coprocessor OS Boot

Host Driver

sysfs

micctrl -b

User

Kernel

Linux* Coprocessor

Virtio Drivers • Virtio - framework that enables use of common

guest drivers across hypervisors

KVM

Qemu virtqueue

Guest

virtio_net.ko virtio_net.ko



virtqueue Guest

virtio_net.ko

lguest

lguest virtio_net.ko



Guest virtqueue

Coprocessor Host

mpssd virtio_net.ko



Guest virtqueue

Coprocessor Host

mpssd

• Key benefits

• Reuse of well designed, maintained code

• Standard, enables a simple backend

• New devices possible in the future

virtio_net.ko

Device Emulation

HW

Hypervisor/Host OS

Virtio Driver

Virtio Data Path

Guest/Coprocessor OS

avail

used

virtqueue

Device Emulation

HW

Hypervisor/Host OS

Virtio Driver

Buffer

Virtio Data Path


avail

Interrupt

used

virtqueue

Device Emulation

HW

Hypervisor/Host OS

Virtio Driver

Buffer

Virtio Data Path


avail

used

virtqueue

Device Emulation

HW

Hypervisor/Host OS

Virtio Driver

Buffer

Virtio Data Path


avail

Interrupt

used

virtqueue

Virtio Data Path Setup

Device Emulation (mpss daemon)

Coprocessor Host driver virtio-mic

Host OS Coprocessor OS

Virtio Bus

Device Page





Device create IOCTL

• Device page entry

– vring addresses, interrupt information

– Status notification (e.g., driver unloaded)

Virtio Bus

Device Page

Device Entry





Device create IOCTL

• Device page entry

– vring addresses, interrupt information

– Status notification (e.g., driver unloaded)

Virtio Device

Virtio Bus

Device Page

Device Entry

TCP/IP

Virtio-net

virtio-pci

QEMU Network Backend

TAP

bridge

Host OS

Guest

QEMU process

kvm.ko

TCP/IP

Virtio-net

virtio-mic

Network backend (mpssd)

TAP

bridge

Host OS

Coprocessor OS

Coprocessor Driver

Data path

What’s different ?

Control path

SCIF • Symmetric Communications Interface

• Goals

– Performance (PCIe* Available BW 7GB/s)

• TCP/IP host to card BW is around 400MB/s

– Abstract the PCIe* network

PCIe*

Host

Coprocessor

Coprocessor

IB* HCA

~ ~


• Goals




PCIe*

Host

Coprocessor

Coprocessor

IB* HCA

~ ~


• Goals




PCIe*

Host

Coprocessor

Coprocessor

IB* HCA

~ ~


• Goals




PCIe*

Host

Coprocessor

Coprocessor

IB* HCA

~ ~


• Goals




PCIe*

Host

Coprocessor

Coprocessor

IB* HCA

send/recv, RMA, mapped memory APIs

~ ~

SCIF Endpoints & Connections

Process P0 PCIe* Process P1

Node 0

Port X Port Y

SCIF endpoint

– pipe to a PCIe* node or loopback, bound to a port ID

Exactly 2 endpoints can form a connection, SCIF data transfer/mapping APIs can only accept a connected endpoint

SCIF SCIF

Node 1

• Connection

• Messaging

• Memory Registration

• Remote Memory Access (RMA)

• RMA Fencing

• Remote memory mapping (mmap)

SCIF API Functional Grouping

Connection & send/recv

Send/recv Implementation

Node 1

Port Y Port X

Node 0

msg0 msg1

Process P0

Process P1

Endpoint Recv Q

Endpoint Recv Q

SCIF SCIF

P0: scif_send(epd, msg0, len, flags); P1: scif_recv(epd, msg1, len, flags);

PCIe*

PCIe*


Node 1

Port Y Port X

Node 0

msg0 msg1

Process P0

Process P1

Endpoint Recv Q

Endpoint Recv Q

SCIF SCIF


PCIe*

PCIe*


Node 1

Port Y Port X

Node 0

msg0 msg1 msg1

Process P0

Process P1

Endpoint Recv Q

Endpoint Recv Q

SCIF SCIF


PCIe*

PCIe*

Memory Registration

Process P0 Process P1

Node 0 Node 1

Port X

buf0 buf1

• SCIF RMA provides zero copy inter-process data transfer

• Registration exposes local memory for remote access

• Pins pages – Local DMA engine access

– Remote access

Port Y

PCIe*

Registered Address Space (RAS)

• Offsets Reference registered memory in RMA APIs

• RAS is per connection

• Connection has 2 registered address spaces – Local & Remote

– Local RAS offset = Peer’s Remote RAS offset

node0:X

Remote RAS Local RAS

node1:Y


Connection


off_t scif_register(epd, addr, len, …, prot, ..);






node0:X


buf0

node1:Y


Connection


offset0







node0:X


buf0

node1:Y


buf0

Connection


offset0 offset0







node0:X


buf0

node1:Y


buf0

Connection


offset0 offset0


offset1

buf1






node0:X


buf0

node1:Y


buf0

Connection


offset0 offset0


buf1

offset1 offset1

buf1

RMA int scif_writeto(epd, offset0, len, offset1, flags);

node0:X


buf0 buf1

offset1 offset0

Connection

Process P0

RMA int scif_vwriteto(epd, buf0, len, offset1, flags);

node0:X

Remote RAS

buf1

offset1

Process VA

buf0

addr

Connection

Process P0

RMA Fence APIs • Asynchronous RMAs allow overlap of compute &

communication

• Fence APIs allow synchronization with RMA completion

Non-blocking (polling) synchronization RAS

Tim

e

t1

t2

t3

t6

t7

RMA2

RMA1

t4

t5

RMA3

write v

off

scif_fence_signal(ep,off,v)

RMA Fence APIs (contd)

scif_fence_wait(ep,m)

RAS

Tim

e

t1

t2

t3

t6

t7

RMA2

RMA1

t4

t5

RMA3

m=scif_fence_mark(ep)

Blocking Synchronization

Remote Memory Mapping

node0:X


Buf0 Buf1

Process VA

Buf0

va = mmap(addr, len, prot, flags, epd, offset1);

offset1

Connection

Lowest latency path for messaging

Process P0

offset0

Remote Memory Mapping

node0:X


Buf0 Buf1

Buf1

Process VA

va

Buf0

va = mmap(addr, len, prot, flags, epd, offset1);

offset1

Connection

Lowest latency path for messaging

Process P0

offset0

OFED* over SCIF

• OpenFabrics Enterprise Distribution (OFED*) open-source software stack for InfiniBand* and iWARP*

• IB-SCIF driver

– Software emulated HCA

– Used within the box

– IB-SCIF driver uses kernel SCIF send/recv and RMA operations

IB uverbs

IB core

IB Verbs Library

IB-SCIF driver

SCIF

User / Kernel Mode

MPI Application

uDAPL

Host /

Coprocessor

IB-SCIF Library

SCIF RMA Performance

0

1

2

3

4

5

6

7

8

Thro

ugh

pu

t (G

B/s

ec)

Transfer Size (KBytes)

Comparison of TCP and SCIF based BW

Available PCIe BW

SCIF Write DMA (Host initiated)

SCIF Write DMA (Coprocessor initiated)

TCP (Host->Coprocessor)

TCP (Coprocessor->Host)

Code Status & Plans

• Patches for features below submitted, expect inclusion in 3.13

– Coprocessor OS state management

– Virtio device support

• Future patches

– DMA engine & usage in Virtio device support

– SCIF

Summary

• MIC x100 Coprocessor driver is a key element of an all Linux* HPC platform

– Enables choice of programming models

• New driver features

– Virtio for PCIe* endpoints

– SCIF communication

Possibilities for reuse in your HW ? Suggestions ?

Let us know!

Acknowledgements!

• Team

– Dasa Chandramouli , Bruce Chang , Bill Clifford Ashutosh Dixit,, Sudeep Dutt, Harsha Kharche, Sanath Kumar, Ravi Murty, Johnnie Peters, Evan Powers, John Wiegert, Siva Yerramreddy, Caz Yokoyama, Jianxin Xiong

• Reviewers

– PJ Waskiewicz, Eddie Dong

• Presentation – James Reinders

Links

• Patches

– https://lkml.org/lkml/2013/9/5/561

• MPSS

• Software Developer's Guide

https://lkml.org/lkml/2013/9/5/561

https://lkml.org/lkml/2013/9/5/561

http://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss

http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-coprocessor-system-software-developers-guide.html

Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Intel, Cilk, VTune, Xeon, Xeon Phi, Look Inside and the Intel logo are trademarks of Intel Corporation in the United States and other countries.

*Other names and brands may be claimed as the property of others. Copyright ©2013 Intel Corporation.

http://www.intel.com/design/literature.htm

Date post:	26-Mar-2021
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Intel MIC Coprocessor Driver · Intel ®Xeon processor E5 family-based server.2 Up to 4x more...

Documents