+ All Categories
Home > Documents > FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with...

FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with...

Date post: 22-May-2020
Category:
Upload: others
View: 9 times
Download: 0 times
Share this document with a friend
72
© Copyright 2019 Xilinx FPGA as a First-Class Citizen in Cloud Computing Lintao Zhang Microsoft Research Asia
Transcript
Page 1: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

FPGA as a First-Class Citizen in Cloud Computing

Lintao ZhangMicrosoft Research Asia

Page 2: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Standard Disclaimer

• This presentation contains work of many people, including many of my colleagues, collaborators, and student interns.

• All mistakes are mine.

• This presentation contains only my personal opinion, it does not reflect position of Microsoft.

Page 3: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Haswell (fp32), 8.6 Broadwell (fp32), 9.1

TPU1 (int8), 1227

TPU2 (fp), 281

K80 (fp32), 29

P100 (fp16), 71

P40 (int8), 188

V100 (fp 4x4x4), 400

A10 (int8), 69

A10 (int4), 200S10 (int8), 240

S10 (int6), 440S10 (int4), 640

1

10

100

1000

2014 2015 2016 2017 2018 2019 2020

Ener

gy E

ffici

ency

(Gig

aOps

/Wat

t)

Estimated Year of Deployment

LOG SCALE

Note: assumes power consumption of 160W/TPU2 chip (not confirmed)

CPU: End of frequency & Dennard scaling

The rise of domain-specific computing

Credit: Doug Burger

Page 4: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Heterogeneous Architectures

CPU GPU FPGA ASICThroughput Low High High Very highLatency High Very high Low LowPower High High Low Very lowPrice at scale High Very high Low LowFlexibility Very high High High Low

Page 5: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Project Catapult: FPGA in Azure

Page 6: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Project Catapult: FPGA in Azure

Page 7: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Deployment of FPGAs in Azure

CPU/GPU/TPU compute + NVMe SSD storage

QPICPU

40Gb/s ToR

FPGA

CPU

40Gb/s

Hardware acceleration plane

Search

ML

Network

DB

SmartNIC

Page 8: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

FPGA in Production: SmartNICs in Azure

High Performance FPGA-powered SDN

Bandwidth Highest bandwidth of any cloud so far, with standard compute VM get up to 32Gbps

LatencyConsistent low latency network, 5x+ latency improvement, sub 15us within tenants

Enables workloads requiring native performance to run in cloud VMs

Page 9: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

FPGA in Production: Project Brainwave

A Scalable FPGA-powered DNN Serving Platform

Performance: Excellent inference performance at low batch size, >10x lower than CPU and GPUs

FlexibilitySupport CNN, LSTM, MLP, RL, Decision Trees, and exploit Sparsity, Compression

ScaleDeployed in Azure and Serving Production workloads

Page 10: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Challenges of Using FPGA in Data Centers

• Programing FPGA is hard• Software developers are not familiar with HDL programming

• FPGA needs to communicate and work with other devices• No generic socket programming interface available for hardware

• FPGA needs to provide value for data center workloads• FPGA is not traditionally leveraged by applications such as storage, database,

stream processing, graph ……

• Algorithms need to adapt for FPGA• Need to leverage fine grain parallelism, and fight with bandwidth, capacity,

and complexity limits.

Page 11: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Challenges of Using FPGA in Data Centers

• Programing FPGA is hard• Software developers are not familiar with RTL programming

• FPGA needs to communicate and work with other devices• No generic socket programming interface available for hardware

• FPGA needs to provide value for data center workloads• FPGA is not traditionally leveraged by applications such as storage, database,

stream processing, graph ……

• Algorithms need to adapt for FPGA• Need to leverage fine grain parallelism, and fight with bandwidth, capacity,

and complexity limits.

Page 12: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Challenges of Using FPGA in Data Centers

• Programing FPGA is hard• Software developers are not familiar with RTL programming

• FPGA needs to communicate and work with other devices• No generic socket programming interface available for hardware

• FPGA needs to provide value for data center workloads• FPGA is not traditionally leveraged by applications such as storage, database,

stream processing, graph ……

• Algorithms need to adapt for FPGA• Need to leverage fine grain parallelism, and fight with bandwidth, capacity,

and complexity limits.

Page 13: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Challenges of Using FPGA in Data Centers

• Programing FPGA is hard• Software developers are not familiar with RTL programming

• FPGA needs to communicate and work with other devices• No generic socket programming interface available for hardware

• FPGA needs to provide value for data center workloads• FPGA is not traditionally leveraged by applications such as storage, database,

stream processing, graph ……

• Algorithms need to adapt for FPGA• Need to leverage fine grain parallelism, and fight with bandwidth, capacity,

and complexity limits.

Page 14: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Overview of The Talk

• Introduction• Simplifying FPGA Application Development• Simplifying FPGA Communication• A High-Performance FPGA Based Key Value Store • Sparsity to Facilitate DNN Inference on FPGA• Conclusion

Page 15: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Overview of the Talk

• Introduction• Simplifying FPGA Application Development

ClickNP: Highly Flexible and High-Performance Network Processing with Reconfigurable Hardware (SIGCOMM’16)

• Simplifying FPGA Communication• A High-Performance FPGA Based Key Value Store • Sparsity to Facilitate DNN Inference on FPGA• Conclusion

Page 16: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Challenge: FPGA Programming

Write and debug HDL code is hard

88’h68656C6C6F20776F726C64

Page 17: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

ClickNP: Making FPGA accessible to software developers

• Flexible: fully programmable using high-level language• Modularized: Click abstractions familiar to software developers;

easy code reuse• High performance: high throughput; microsecond-scale latency• Joint CPU/FPGA packet processing: FPGA is no panacea; fine-

grained processing separation

17

Page 18: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

ClickNP Programming ModelStream processing model

A B C

Page 19: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Element: single-threaded core

(interrupt)

(I/O)(I/O)

(reg/mem)

(main thread)

(ISR)

Developed in C-like language (OpenCL)Write once, run in both CPU and FPGACan Embedding native Verilog code in ClickNP element

Page 20: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Workflow and Architecture

Catapult shell

ClickNP

FPGA

Host Catapult PCIe Driver

ClickNP library

ClickNP host process

Mgr thrd Worker thrd

ClickNPelements

ClickNPhost mgr

ClickNP compiler

vendor libs

vendor HLS

PCIe I/O channel

Cross-platform toolchainAltera OpenCL / Vivado HLSVisual Studio / GCC

vendor specific runtime

ClickNPscript

C compiler

Page 21: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Development Efficiency & PerformanceNetwork Function Line of Code Resource Util. Throughput (vs. CPU) Avg Latency (vs. CPU)

Packet generator 665 16% 18x 25x

Packet capture 250 8% 17x 18x

Firewall 538 32% 22x 40x

IPSec gateway 695 35% 60x 3x

Load balancer 860 36% 55x 12x

Packet scheduler 584 11% 32x 10x

IPSec GatewayPerformance Comparisons

Page 22: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Overview of the Talk

• Introduction• Simplifying FPGA Application Development• Simplifying FPGA Communication

Direct Universal Access: Making Data Center Resources Available to FPGA (NSDI’19)

• A High Performance FPGA Based Key Value Store • Sparsity to Facilitate DNN Inference on FPGA• Conclusion

Page 23: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Current FPGA Communication Architecture

Application Layer

Physical Layer

Application Application

DDR 4 IP PCIe Gen 3 IP

DDR Stack LTL NVMeHostDMA

FPGA Chip

QSFP IP

DMA

PCIe Gen 3QSFP

DDR 4

FPGA Board

DCQCN(in LTL) Transport Layer

Data Link Layer

Page 24: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Current FPGA Communication Architecture

MAC layer

Physical layer

Application Application

DDR 4 IP PCIe Gen 3 IP

DDR Stack LTL NVMeHostDMA

FPGA Chip

QSFP IP

DMA

PCIe Gen 3QSFP

DDR 4

FPGA Board

???

Image from Lixia Zhang et al., CCR 2014

DCQCN(in LTL)

Page 25: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Direct Universal AccessFPGA 1Server

QSFPDDR

Server

LTLDDR access

FPGA Connect

HostDMA

PCIe Gen3DDR

App 2

LTLDDR access

FPGA Connect

HostDMA

Datacenter networking fabric

QSFP

FPGA 2

DDR

FPGA Connect

DDR access

Host DMA

PCIe Gen3 PCIe Gen3CPU

DUA DUA

App 1

DUA

Intra-server networking fabric③

① ④

FPGA 3

Page 26: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

DUA OverviewFPGA

Server

QSFPDDR

Server

AppApp

LTLDDR access

FPGA Connect

HostDMA

PCIe Gen3DDR

AppApp

LTLDDR access

FPGA Connect

HostDMA

Datacenter networking fabric

QSFP

FPGA

DDR

FPGA Connect

DDR access

Host DMA

PCIe Gen3 PCIe Gen3CPU

DUA DUA

AppApp

DUA

Intra-server networking fabric③

① ④

DUA is an “IP layer”An abstract overlay network

Leverage all existing h/w stacksHierarchical addressing & routing

Page 27: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

DUA OverviewFPGA

Server

QSFPDDR

Server

AppApp

LTLDDR access

FPGA Connect

HostDMA

PCIe Gen3DDR

AppApp

LTLDDR access

FPGA Connect

HostDMA

Datacenter networking fabric

QSFP

FPGA

DDR

FPGA Connect

DDR access

Host DMA

PCIe Gen3 PCIe Gen3CPU

DUA DUA

AppApp

DUA

Intra-server networking fabric

DUA is an “IP layer”

① ④Efficient Routing

Direct resource access by FPGA, totally bypass CPU

Page 28: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

DUA OverviewFPGA

Server

QSFPDDR

Server

AppApp

LTLDDR access

FPGA Connect

HostDMA

PCIe Gen3DDR

AppApp

LTLDDR access

FPGA Connect

HostDMA

Datacenter networking fabric

QSFP

FPGA

DDR

FPGA Connect

DDR access

Host DMA

PCIe Gen3 PCIe Gen3CPU

DUA DUA

AppApp

DUA

Intra-server networking fabric③

① ④

Compatible BSD-socket Interfacefor both applications and communication stacks

Efficient RoutingDUA is an “IP layer”

Page 29: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

DUA OverviewFPGA

Server

QSFPDDR

Server

AppApp

LTLDDR access

FPGA Connect

HostDMA

PCIe Gen3DDR

AppApp

LTLDDR access

FPGA Connect

HostDMA

Datacenter networking fabric

QSFP

FPGA

DDR

FPGA Connect

DDR access

Host DMA

PCIe Gen3 PCIe Gen3CPU

DUA DUA

AppApp

DUA

Intra-server networking fabric③

① ④

Efficient RoutingCompatible BSD-socket Interface

General Multiplexingfor both applications and communication stacks

DUA is an “IP layer”

Page 30: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

DUA OverviewFPGA

Server

QSFPDDR

Server

AppApp

LTLDDR access

FPGA Connect

HostDMA

PCIe Gen3DDR

AppApp

LTLDDR access

FPGA Connect

HostDMA

Datacenter networking fabric

QSFP

FPGA

DDR

FPGA Connect

DDR access

Host DMA

PCIe Gen3 PCIe Gen3CPU

DUA DUA

AppApp

DUA

Intra-server networking fabric③

① ④

Efficient RoutingCompatible BSD-socket InterfaceCompatible BSD-socket interface

DUA is an “IP layer”

Unified Multiplexing

SecurityProtect against both inside and outside attacks

Page 31: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

App

FPGA

App

App

NIC

FPGA

CPU

FPGA

NIC

CPU

Datacenter networking fabric

Intra-server networking fabric

QSFP

DUA Underlay

PCIe Gen3DDR

Server

Server

CPU Control Agent

DUAOverlay

FPGA Control Agent

FPGA Host Stack

NVMe Stack

LTL

FPGA Connect

Host DMA

DDR ControllerDUA

data plane

System Architecture

DUA Data Plane

Page 32: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

App

FPGA

App

App

NIC

FPGA

CPU

FPGA

NIC

CPU

Datacenter networking fabric

Intra-server networking fabric

QSFP

DUA Underlay

PCIe Gen3DDR

Server

Server

CPU Control Agent

DUAOverlay

FPGA Control Agent

FPGA Host Stack

NVMe Stack

LTL

FPGA Connect

Host DMA

DDR Controller

System Architecture

DUA Data Plane DUAcontrol plane

Page 33: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Evaluation – Regex Matching

• Up to 105~107 higher than CPU, Up to 105 lower than CPU• Up to 3 times throughput and up to 55% latency reduction compared to using

CPU to move data between FPGAs

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

Thro

ughp

ut (G

B/s)

Input String Length (Byte)

through FPGA Connect

through CPU

Pure CPU1.00E+004.00E+001.60E+016.40E+012.56E+021.02E+034.10E+031.64E+046.55E+042.62E+051.05E+064.19E+061.68E+07

64 128 256 512 1024 2048 4096 8192 16384

Late

ncy

(us)

Input String Length (Byte)

through FPGA Connect through CPU Pure CPU

Page 34: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Overview of the Talk

• Introduction• Simplifying FPGA Application Development• Simplifying FPGA Communication• A High Performance FPGA Based Key Value Store

KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC

(SOSP’17)

• Sparsity to Facilitate DNN Inference on FPGA• Conclusion

Page 35: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Key-Value Store in Data Centers

Web Cache

Page 36: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Key-Value Store in Data Centers

Web Cache Shared Data Structure

Page 37: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Key-Value Store in Data Centers

Design Goals

____

Page 38: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Key-Value Store Architectures

Bottleneck: Network stack in OS(~300 Kops per core)

Page 39: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Key-Value Store Architectures

Bottleneck: Network stack in OS(~300 Kops per core)

e.g. DPDK, mtcp, libvma, two-sided RDMABottlenecks: CPU random memory accessand KV operation computation(~5 Mops per core)

Page 40: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Key-Value Store Architectures

Bottleneck: Network stack in OS(~300 Kops per core)

Communication overhead: multiple round-trips per KV operation (fetch index, data)Synchronization overhead: write operations

e.g. DPDK, mtcp, libvma, two-sided RDMABottlenecks: CPU random memory accessand KV operation computation(~5 Mops per core)

Page 41: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Key-Value Store Architectures

Offload KV processing on CPUto Programmable NIC

Page 42: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Performance Challenges

ToR switch

Host DRAM (256 GB)

PCIe Gen3x16 DMA

40 GbE

FPGAOn-board

DRAM (4 GB)

13 GB/s120 Mops

Header overhead and limited parallelism:Be frugal on memory accesses

Page 43: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Performance Challenges

ToR switch

Host DRAM (256 GB)

PCIe Gen3x16 DMA

40 GbE

FPGAOn-board

DRAM (4 GB)

1us delay120 Mops

Atomic operations have dependency:PCIe latency hiding

Page 44: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Performance Challenges

ToR switch

Host DRAM (256 GB)

PCIe Gen3x16 DMA

40 GbE

FPGAOn-board

DRAM (4 GB)

1us delay120 Mops

0.2us delay100 Mops

Load dispatch

Page 45: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Performance Challenges

ToR switch

Host DRAM (256 GB)

PCIe Gen3x16 DMA

40 GbE

FPGAOn-board

DRAM (4 GB)

1us delay120 Mops

60 Mpps

Client-side batchingVector-type operations

0.2us delay100 Mops

Page 46: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

KV Processor Architecture

NIC

PCIe

DRAM

Page 47: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

KV-Direct Performance

Page 48: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Scalability with Multiple NICs

Page 49: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Scalability with Multiple NICs

1.22 billion KV op/s357 watts power

Page 50: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Overview of the Talk

• Introduction• Simplifying FPGA Application Development• Simplifying FPGA Communication• A High Performance FPGA Based Key Value Store • Sparsity to Facilitate DNN Inference on FPGA

Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity (FPGA’19)

• Conclusion

Page 51: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Low Latency DNN Inference is Important

𝑥𝑥𝑡𝑡

𝑦𝑦𝑡𝑡

𝑦𝑦𝑡𝑡𝑦𝑦t − 1

ct − 1

Matrix-Vector Multiplication is the work horse of DNN inference

Page 52: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 XilinxHan, Song, et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS’15

Redundancy in neural networks

Weights Distribution

Page 53: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Weight Pruning1. Train Connectivity

2. Prune Connections

3. Train Weights

�𝑤𝑤 ← 𝑎𝑎𝑎𝑎𝑎𝑎 𝑤𝑤𝑖𝑖𝑖𝑖 �𝑤𝑤𝑖𝑖 ≤ 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑎𝑎𝑇𝑇𝑇𝑇𝑇𝑇𝑇

𝑤𝑤𝑖𝑖 ← 0

Highly unstructured sparse pattern

Sparsity = 50%~90%near-zeros

Difficult to accelerate

Page 54: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Pros:• High model accuracy• High compression ratioCons:• Irregular pattern• Difficult to accelerate

Cons:• Low model accuracy• Low compression ratioPros:• Regular pattern• Easy to accelerate

Irregular Regular

Fine-grained Coarse-grained

Accuracy and Speedup Tradeoff

Page 55: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Speedup and Accuracy Tradeoff

78

79

80

81

82

83

84

85

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Original

Fine-grained

Block Size (4*4)

Block Size (8*8)

Block Size (16*16)

Sparsity

the smaller the better

ModelLoss

Fine-grained:Accurate but

irregular

Coarse-grained:Regular but not accurate

Page 56: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

DenseMatrix

Bank Split

Bank Balanced Pruning

Page 57: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

DenseMatrix

DenseMatrix Row

BBSMatrix Row

Bank Split

0.8 -0.1 0.2 1.5 1.0 0.3 -0.4 -1.4 0.7 2.0 0.9 -0.5 1.2 -1.3 2.1 0.2

0 1 2 3 4 5 6 7 12 13 14 158 9 10 11

0.8 1.5 1.0 -1.4 2.0 0.9 -1.3 2.1

Traverse all rows

Fine-grained pruning inside each bank

Threshold percentage to obtain identical sparsity ratio among banks

Bank Balanced Pruning

Page 58: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Weight map visualization

Page 59: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Bank 0 Bank 1

Weight map visualization

Page 60: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Speech Recognition on TIMIT dataset

Language model PTB dataset

Model Accuracy

Page 61: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Speech Recognition on TIMIT dataset

Language model PTB dataset

Very close

Model Accuracy

Very close

Page 62: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

V0 V1 V2

V3 V4 V5

V6 V7 V8

V9 V10 V11

BSB matrix row

Dense vector

Bank 0 Bank 1 Bank 2 Bank 3

A 0 B C D 0 0 E F G 0 H

Bank 0

Bank 1

Bank 2

Bank 3

Accumulate

Partial dot product: V0A+V3C+V7E+V9G

V0 V3 V7 V9 A C E G

S1

Sparse MV Multiplication (SpMxV)

Page 63: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

V0 V1 V2

V3 V4 V5

V6 V7 V8

V9 V10 V11

BSB matrix row

Dense vector

Bank 0 Bank 1 Bank 2 Bank 3

A 0 B C D 0 0 E F G 0 H

Bank 0

Bank 1

Bank 2

Bank 3

B D F H

Accumulate

Partial dot product: V2B+V4D+V8F+V11H

V2 V4 V8 V11

S2

S1+S2

Sparse MV Multiplication (SpMxV)

Page 64: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

FPGA

SpMxV PE

*

...

*

** +

++

EWOP

ACT

+

Controller Instruction Buffer

DMA

*Private Vector Buffer

Output

+

DRAMCntlr

PCIeCntlr

Off-chipDRAM

HostServer

Vector Memory

MatrixMemory

Indices

Values

Accelerator Overview

Page 65: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

FPGA

SpMxV PE

*

...

*

** +

++

EWOP

ACT

+

Controller Instruction Buffer

DMA

*Private Vector Buffer

Output

+

DRAMCntlr

PCIeCntlr

Off-chipDRAM

HostServer

Vector Memory

MatrixMemory

Indices

Values

Accelerator Overview

Page 66: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

FPGA

SpMxV PE

*

...

*

** +

++

EWOP

ACT

+

Controller Instruction Buffer

DMA

*Private Vector Buffer

Output

+

DRAMCntlr

PCIeCntlr

Off-chipDRAM

HostServer

Vector Memory

MatrixMemory

Indices

Values

Accelerator Overview

Page 67: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

FPGA

SpMxV PE

*

...

*

** +

++

EWOP

ACT

+

Controller Instruction Buffer

DMA

*Private Vector Buffer

Output

+

DRAMCntlr

PCIeCntlr

Off-chipDRAM

HostServer

Vector Memory

MatrixMemory

Indices

Values

Accelerator Overview

Page 68: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Hardware Efficiency

Page 69: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

~34x ~7x

Hardware Efficiency

Page 70: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Challenges of Using FPGA in Data Centers

• Programing FPGA is hard• Programming Network Function with ClickNP

• FPGA needs to communicate and work with other devices• Facilitate communication with Direct Universal Access

• FPGA needs to provide value for data center workloads• KV-Direct is a fast Key Value Store

• Algorithms need to adapt for FPGA• Bank Balanced Sparsity accelerate DNN Inference

Page 71: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Conclusion

• FPGA is becoming a key resource in the cloud• It is not easy to leverage FPGA for data center workloads• We did several work to

• Reduce development cost• Simplify communication• Find valuable applications to accelerate • Adapt algorithms to fit FPGA

• Many challenges remain to be addressed• The trend is clear: FPGA is a first-class citizen in the cloud and future

applications need to take good use of it

Page 72: FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with other devices • No generic socket programming interface available for hardware •

© Copyright 2019 Xilinx

Thanks!


Recommended