NICA: An Infrastructure for Inline Acceleration of Network ... · NICA applies these techniques to...

Post on 01-Mar-2021

2 views 0 download

transcript

NICA: An Infrastructure for Inline Acceleration of Network ApplicationsHAGGAI ERAN†#, L IOR ZENO †, MAROUN TORK†, GABI MALKA †, MARK SILBERSTEIN †

†TECHNION – ISRAEL INSTITUTE OF TECHNOLOGY #MELLANOX TECHNOLOGIES

1

FPGA-Based SmartNICs

SmartNIC

Network FPGA

Azure SmartNICs implementing AccelNet have been deployed on all new Azure servers since late 2015 in a fleet of >1M hosts. [NSDI’18]

2

Inline Acceleration (Analogy)

https://www.pinterest.com/pin/334603447314940566/

https://www.amazon.com/WolVol-Friction-Powered-Garbage-Lights/dp/B00WH4Y9SG3

Inline Processing - Filtering

SmartNIC

Network FPGA

4

Inline Processing - Transformation

SmartNIC

Network FPGA

5

Val

A Key-Value Store Cache

SmartNIC

Network FPGA

ValKeyKeyKeyKey

ValValValValVal

6

Previous work: KV-Direct [SOSP’17], Floem [OSDI’18],LaKe [ReConFig’18]

CoAP Cryptographic Authentication

SmartNIC

Network FPGAIoT

server

7

Challenges for Cloud Inline Accelerators• No operating system abstractions

• No virtualization support:• performance & state isolation

The NICA infrastructure fulfills these requirements for arbitrary inline accelerators.

8

NICA – Contributions• ikernel OS abstraction for inline application acceleration

• SmartNIC virtualization:Fine-grain time sharing & strict performance isolation

9

The ikernel Abstraction

SmartNIC

Network

FPGA Process

ikernelAFUAFUAFU

AFU – Accelerator Functional Unit 10

Process

socket

SmartNIC

FPGA

Attaching an ikernel

AFU – Accelerator Functional Unit 11

Networkikernel

AFUAFUAFU

SmartNIC

Network

FPGA Process

ikernelAFUAFUAFU

socket

Attaching an ikernel

AFU – Accelerator Functional Unit 12

Interfaces of an ikernel

• Attach to sockets

• Use POSIX API for data

• Custom ring – direct producer-consumer interface

• Bypass the host network stack

• RPC – access AFU state

• E.g. configure cryptographic keys, read counters

13

The ikernel Abstraction// Create handle

k = ik_create(MEMCACHED_AFU);

ik_command(k, CONFIGURE, ...);

// Init a socket.

s = socket(...); bind(s, ...);

// Activate the ikernel

ik_attach(k, s);

// Use POSIX APIs to receive

while (recvmsg(s, buf, ...))...

107 lines of code for memcached integration14

Process

ikernel

socket

Virtualization Support

15

Computation

I/O

How FPGAs Are Shared [Background]• Space sharing

• Coarse grain time sharing

• Fine grain time sharing

-Limited utilization

16

How FPGAs Are Shared [Background]• Space sharing

• Coarse-grain time sharing

• Fine-grain time sharing

-Long context switch

17

AmorphOS [OSDI’18] combines the first two.

How FPGAs Are Shared [Background]• Space sharing

• Coarse-grain time sharing

• Fine-grain time sharing

Multiple tenants share the same AFU.

+Low latency – hardware context switch (packet granularity)

18

NICA applies these techniques to SmartNIC AFUs

Cloud Deployment Model

19

FPGA-as-a-Service

• Customers bring their own design

• Coarse-grain sharing

AFU Marketplace

• Cloud provider develops/audits AFUs

• AFUs trusted to implement fine-grain virtualization

SmartNIC

AFU

NICA I/O Virtualization

CPU

Network

Network stack

NICA hardware runtime

ProcessProcessVMvAFU 1vAFU 1vAFU 2vAFU 1

20

SmartNIC

AFU

Performance Isolation

CPU

Network

Network stack

NICA hardware runtime

ProcessProcessVMvAFU 1vAFU 1vAFU 2vAFU 1

21

In the Paper• NICA hardware runtime

• Network stack integration and TCP support

• Custom ring implementation using RDMA

• SR-IOV and para-virtual interfaces

• Implementation details

22

Evaluation40 Gbps bump-in-the-wire SmartNIC:

• Mellanox Innova Flex

• ConnectX 4 Lx EN

• Xilinx Kintex UltraScale FPGA

Similar to Microsoft Catapult.

FPGA NICNetwork

23

EvaluationBaseline:• VMA user-space networking

stack

Microbenchmarks:• UDP/TCP performance

• Virtualization overheads

• I/O isolation overheads

Applications:

• memcached• Comparison with

MICA [NSDI’14]; KV-Direct [SOSP’17]

• IoT message authentication• Node.js server• Integrated with 20 lines of

JavaScript

24

Memcached Cache• Host working set: 32M keys

• SmartNIC Cache: 2M keys

• 16 byte keys and values

• Memcached UDP ASCII protocol

SmartNIC

Network FPGA

25

Simplicity of host integration:

• POSIX – 107 lines of C code

• Custom ring – 135 lines of code

Key-Value Store Cache – Bare Metal R/O

1.7 1.96.8

40.3

14.8

0

10

20

30

40

50

0.69 0.79 0.89 0.99 1.09 1.19 1.29 1.39 1.49Thro

ugh

pu

t [M

tps]

Zipf distribution; Throughput by Zipf skew

CPU-only NICA NICA+CR line-rate

26

At Zipf(0.99), x4 through filtering,

x9 with both

Low hit rate: x2 throughput w. custom ring

Near line-rate hardware

throughput

Key-Value Store Cache - Virtualization• 1 core, 5 GB host RAM, 2M key cache, Zipf(0.9)

0

5

10

1 2 3 4 5 6

Thro

ugh

pu

t [M

tps]

Throughput by #VMs

CPU NICA+CR

27

Latency under virtualization• 60% hit-rate – 2.1 µs

• same as bare-metal

• Misses on the CPU – 12-62 µs

• Compared to 6 µs on bare-metal

• Negligible sharing overhead

28

Key-Value Store – Performance Isolation

0

5

10

15

20

25

0 5 10 15 20 25Thro

ugh

pu

t [M

tps]

Throughput over time

VM 1 VM 2 VM 3

29

• NICA framework enables SmartNIC inline processing in cloud environments.

• Find our code at https://github.com/acsl-technion/nica

Thank you!

Questions?

Conclusion

30