+ All Categories
Home > Documents > Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING...

Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING...

Date post: 20-Mar-2020
Category:
Upload: others
View: 8 times
Download: 0 times
Share this document with a friend
31
0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc.
Transcript
Page 1: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

0

LegUp: Accelerating Memcached on Cloud FPGAs

Xilinx Developer ForumDecember 10, 2018

Andrew Canis & Ruolong LianLegUp Computing Inc.

Page 2: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

1

1

COMPUTE IS BECOMING

SPECIALIZED

GPU

Nvidia graphics cards

are being used for

floating point computations

TPU

Google tensor

processing unit used

for machine learning

FPGA

Reconfigurable hardware.

FPGAs excel at real-time

data processing.

Page 3: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

2

2

Hardware

System

CPU

A Unified Hardware Acceleration Platform

Software

Software

Test/Debug

LEGUP HLS PLATFORM

Vendor

Agnostic

Hardware

Page 4: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

The Era of FPGA Cloud Computing is Here

Nov 2016

Jan 2017

Alibaba and Tencent deploy

FPGAs in their cloud

Jul 2017

Sept 2017

Baidu, Huawei deploy FPGAs in

their cloudAmazon and Nimbix deploy

FPGAs in their cloud

June 2014

Microsoft accelerates

Bing Search with FPGAs

Microsoft rolls out FPGAs

in every new datacenter

Oct 2016

SKT deploys FPGAs for AI acceleration

Aug 2018

Page 5: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

4

4

CLOUD PLATFORM

• Network processing engines on cloud FPGAs and on-premises FPGA acceleration cards

Real-Time Analysis

Output to Network

Input your

C/C++ Code

Cloud or On-prem

FPGA ServerReal-Time Data Stream

from Network

void accelerator (

FIFO *input, FIFO

*output) {

int in = fifo_read(input);

loop: for (int i = 0; i <

NUM; i++) {

01010101010101010101010110100101

01010101010010101010010101010010

10101010100101010101010101001010

10101010010101010101010010101010

10100101010101001010101010100110

01010101010010101010010101010100

10101010101010010101010101010010

10101001010100100101011001010101

01010101010101010101010110100101

01010101010010101010010101010010

10101010100101010101010101001010

10101010010101010101010010101010

10100101010101001010101010100110

01010101010010101010010101010100

10101010101010010101010101010010

10101001010100100101010101010101

Hardware

Compilation

Page 6: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

5

5

What is Memcached?

5

• Memcached is a distributed in-memory key-value store• Used as a cache by Facebook, Twitter, Reddit, Youtube, etc

• Facebook Memcached cluster handles billions of requests per second

• Memcached Commands:• Set key value

• Get key

• Typical deployments:• Amazon ElastiCache

• Google Cloud App Engine

• Self-hosted

• Easy horizontal scaling: • Cluster of Memcached servers handles the load

Page 7: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

6

6

9X LOWER LATENCY

9X HIGHER REQUESTS/SEC

10X LOWER TCO

Powered by AWS FPGAs with LegUp’s Platform

Introducing: World’s Fastest

Cloud-Hosted Memcached

Easy to Deploy

Lower TCO

10Gbps network

Page 8: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

7

7

Memcached vs. AWS ElastiCache

7

• Benchmarked Memcached against AWS ElastiCache• AWS provides a fully-managed CPU Memcached service

• Different instance types based on RAM size, network bandwidth, and hourly cost

• Chose an ElastiCache instance with the closest specs to F1

AWS Instance vCPUs RAM Network Speed Cost

cache.r4.4xlarge (CPU) 16 101 GB Up to 10 Gbps $1.82/hour

f1.2xlarge (FPGA) 8 122 GB Up to 10 Gbps $1.65/hour

Page 9: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

8

8

8

• Memtier_benchmark: Open-source Memcached benchmarking tool

• 100-byte size data, pipelining (batching) of 16

• Varied number of connections to Memcached

Experimental Setup

Page 10: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

9

9

Throughput Results

9

• Up to 9X better ops/sec vs. ElastiCache

Page 11: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

10

10

Latency Results

10

• Up to 9X lower latency vs. ElastiCache

Page 12: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

11

11

Total Cost of Ownership Results

11

• Up to 10X better throughput/$ vs AWS ElastiCache

Page 13: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

12

12

Where is the speedup from coming from?

12

1. We accelerated both TCP/IP network and Memcached completely in FPGA

2. Fully pipelined FPGA hardware – new input every clock cycle

3. Multiple Memcached commands in-flight processed in streaming fashion

4. At high packets/sec, software network stack can become a bottleneck

Page 14: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

13

13

Memcached Demonstration on AWS F1

• Live demo from our website:

http://www.legupcomputing.com/main/memcached_demo

• Spins up an AWS F1 instance and another client instance

13

Page 15: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

www.companyname.com© 2016 Motagua PowerPoint Multipurpose Theme. All Rights Reserved.

14

www.companyname.com© 2016 Motagua PowerPoint Multipurpose Theme. All Rights Reserved.

1

1

Page 16: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

www.companyname.com© 2016 Motagua PowerPoint Multipurpose Theme. All Rights Reserved.

15

2

Page 17: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

www.companyname.com© 2016 Motagua PowerPoint Multipurpose Theme. All Rights Reserved.

16

3

Page 18: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

www.companyname.com© 2016 Motagua PowerPoint Multipurpose Theme. All Rights Reserved.

17

4

Page 19: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

18

18

AWS Cloud-Deployed FPGAs (F1)

18

• On F1, the FPGA is not directly connected to the network

• CPU is connected to the network and FPGA is connected over PCIe.

Page 20: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

19

19

Memcached System Architecture

19

AWS F1 Instance

CPUFPGA

PCIe

Virtual Network to FPGA (S/W)

TCP/IPOffloadEngine

Memcached Accelerator

10Gbps Network

Virtual Network to FPGA (H/W)

64GBDDR4

Page 21: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

20

20

Virtual Network to FPGA (VN2F)

• VN2F SW• Bypass Linux kernel, send/receive raw network packets, DMA from/to FPGA

• VN2F HW• Split/combine DMA data to/from individual network packets

• Each direction takes 20~50us, transfers are overlapped

20

VNF S/W

PCIeKernel Bypass

Application

DPDK F1 Drivers

VNF H/W

F1 Shell

Split packets

Combine packets

Network Stack

Page 22: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

21

21

Network Offload: TCP/IP & UDP

• Supports TCP/UDP/IP network protocols

• 10Gbps ethernet support

• 1000s of TCP connections

• Implemented in C++, synthesized by LegUp

• Can be used by other applications

• Interface with application via AXI-S

21

* D. Sidler, et al., Scalable 10Gbps TCP/IP Stack Architecture for Reconfigurable Hardware, in FCCM’15

Page 23: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

22

22

Memcached Core

• The Memcached core is fully pipelined with Initiation Interval of 1 • Request Decoder block decodes the requests and partitions them into key and value pairs.

• Hash Lookup hashes keys to hash values and looks up the corresponding addresses

• Values are stored/retrieved to/from the memory by the Value Store block.

• Response Encoder creates Memcached responses to return to the clients

Memcached Core

Request Decoder

Hash Lookup Value

StoreResponseEncoderRequests

Keys

Values

Addresses

Values Responses

* M. Blott et. al., “Achieving 10Gbps Line-rate Key-value Stores with FPGAs,” in Hot Cloud, 2013

Page 24: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

23

23

Network Bandwidth on AWS F1

• The f1.2xlarge instance has an “Up to 10 Gbps” network

• Bottleneck: bandwidth and PPS

23

10 Gbps network cannot be saturated with small packets

Max PPS is around 700KSmall packets can’t saturate 10 Gbps network

Bandwidth Packets per Second

Bandwidth Limit

Page 25: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

24

24

Memcached Request Batching

• Batching in Memcached permits packing multiple requests into a single network packet• Reduces packet processing overhead• Important feature for performance, especially when PPS is the bottleneck

• Batching Adapter splits up aggregated requests into individual requests• Sends to Memcached core each request in a pipelined fashion

PKT1 REQ1 PKT2 REQ2 PKT3 REQ3 PKT4 REQ4Without

Pipelining

With

PipeliningPKT1 REQ1 REQ2 REQ3 PKT2 REQ4 REQ5 REQ6 …

Batching is not needed for on-premise FPGAs

Page 26: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

25

25

Host

PC

Ie

PCISAXI-4

WR channel

RD channel

BAR1 AXI-L (read & write)

RX FIFO

TX FIFO

Streaming

Kernel

Streaming b/w Host & Kernel in SDAccel

25

MM2S

S2MM

AXI-S

AXI-S data count

Custom Logic on FPGAHDK Shell

Interface

Page 27: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

26

26

Kernel

Interface

DDRMemory

Streaming b/w Host & Kernel in SDAccel

26

Host

PC

Ie

PCISAXI-4

WR channel

RD channel

BAR1 AXI-L (read & write)

RX FIFO

TX FIFO

Streaming

Kernel

MM2S

S2MM

AXI-S

AXI-S data count

AXI-4

AXI-4

Custom Logic on FPGA

Page 28: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

27

27

Kernel

Interface

DDRMemory

Streaming b/w Host & Kernel in SDAccel

27

Host

PC

Ie

PCISAXI-4

BAR1 AXI-L (read & write)

RX FIFO

TX FIFO

Streaming

Kernel

MM2S AXI-SAXI-4transfer size

Custom Logic on FPGA

Page 29: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

28

28

Kernel

Interface

DDRMemory

Streaming b/w Host & Kernel in SDAccel

28

Host

PC

Ie

PCISAXI-4

BAR1 AXI-L (read & write)

RX FIFO

TX FIFO

Streaming

KernelS2MM

Custom Logic on FPGA

Page 30: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

29

29

Kernel

Interface

DDRMemory

Streaming b/w Host & Kernel in SDAccel

29

Host

PC

Ie

PCISAXI-4

BAR1 AXI-L (read & write)

RX FIFO

TX FIFO

Streaming

KernelS2MM AXI-S AXI-4

BatchAccumulator

ready, valid

Send transfer size to the AXI-S pipe when,

• Enough data has been accumulated

• No new incoming data for X cycles

Custom Logic on FPGA

Page 31: Accelerating MemCached on Cloud FPGAs - Xilinx · 2019-10-11 · 1 1 COMPUTE IS BECOMING SPECIALIZED GPU Nvidia graphics cards are being used for floating point computations TPU Google

30

Come talk to us about:• Memcached Acceleration

• FPGA Network Stack

• SDAccel Streaming Handler

• LegUp high-level synthesis tool

• Any other FPGA acceleration needs

Andrew Canis & Ruolong Lian

www.LegUpComputing.com

[email protected] | 647-834-6654 | Toronto, Canada


Recommended