FPGA-based Accelerator for Long Short-Term Memory ... · 1 FPGA-based Accelerator for Long...

Post on 21-May-2020

6 views 0 download

transcript

1

FPGA-based Accelerator for Long Short-Term

Memory Recurrent Neural Networks

Yijin Guan1, Zhihang Yuan1, Guangyu Sun1,3, Jason Cong2,3,1

1Center for Energy-Efficient Computing and Applications, Peking University, China2Computer Science Department, University of California, Los Angeles, USA

3PKU/UCLA Joint Research Institute in Science and Engineering

2

Deep Learning

Scenarios

Applications

3

Recurrent Neural Network

Input Layer

Hidden Layer

Output Layer

Feed-forward NN RNN

Input Layer (t)

Hidden Layer (t)

Output Layer (t)

Hidden Layer (t-1)

Recurrent

Connection

4

Recurrent Neural Network

Input Layer (t-1)

Hidden Layer (t-1)

Output Layer (t-1)

Input Layer (t)

Hidden Layer (t)

Output Layer (t)

Input Layer (t+1)

Hidden Layer (t+1)

Output Layer (t+1)

RNN unfolds into a DNN over time

5

Long Short-Term Memory

Input Gate

Wxi & Whi

it

xt ht-1

𝒊𝒕 = σ (𝑾𝒙𝒊𝒙𝒕 + 𝐖𝒉𝒊𝒉𝒕−𝟏 + 𝒃𝒊)

6

Long Short-Term Memory

Input Gate

Wxi & Whi Wxf & Whf

Forget Gate

itft

xt ht-1

𝒇𝒕 = σ (𝑾𝒙𝒇𝒙𝒕 + 𝐖𝒉𝒇𝒉𝒕−𝟏 + 𝒃𝒇)

7

Long Short-Term Memory

Input Gate

Wxi & Whi

Cell Gate

Wxc & Whc

Wxf & Whf

Forget Gate

itft

xt ht-1

𝒄t

𝒄𝒕 = tanh (𝑾𝒙𝒄𝒙𝒕 + 𝐖𝒉𝒄𝒉𝒕−𝟏 + 𝒃𝒄)

8

Long Short-Term Memory

Input Gate

Wxi & Whi

Cell Gate

Wxc & Whc

Wxf & Whf

Forget Gate

itft

ct

ct-1

xt ht-1

𝒄t

𝒄𝒕 = 𝒇𝒕⊙𝒄𝒕−𝟏 + 𝒊𝒕⊙𝒄𝒕

9

Long Short-Term Memory

Input Gate

Wxi & Whi

Cell Gate

Wxc & Whc

Wxf & Whf

Forget Gate

Wxo & Who

Output Gate

itft

ct

ct-1

xt ht-1

𝒄t ot

𝒐𝒕 = σ (𝑾𝒙𝒐𝒙𝒕 + 𝐖𝒉𝒐𝒉𝒕−𝟏 + 𝒃𝒐)

10

Long Short-Term Memory

Input Gate

Wxi & Whi

Cell Gate

Wxc & Whc

Wxf & Whf

Forget Gate

Wxo & Who

Output Gate

itft

ct

ct-1

xt ht-1

ht

𝒄t ot

𝒉𝒕 = 𝒐𝒕⊙ tanh(𝒄𝒕)

11

Why FPGA

Adaptability

Performanc

Energy EfficiencyProgrammability

Scalability

GPU

Adaptability

Performance

Energy EfficiencyProgrammability

Scalability

FPGA

Adaptability

Performance

Energy EfficiencyProgrammability

Scalability

ASIC

Adaptability

Performance

Energy EfficiencyProgrammability

Scalability

CPU

12

Design Challenges and Optimizations

FPGA Chip

Computation

Engine

Data

Buffers

Off-chip

Memory

13

Design Challenges and Optimizations

FPGA Chip

Computation

Engine

Data

Buffers

Off-chip

Memory

Computation Resources & Performance

Loop Unroll

Deep Pipeline

14

Design Challenges and Optimizations

FPGA Chip

Computation

Engine

Data

Buffers

Off-chip

Memory

On-chip Memory Resources

Loop Tiling

Eclectic Data Partition

15

Design Challenges and Optimizations

FPGA Chip

Computation

Engine

Data

Buffers

Off-chip

Memory

Bandwidth

Ping-pong Buffers

Reshaping Data Layout

16

FPGA Chip

System Design

MicroBlaze

LSTM

Accelerator

Timer

Data

Dispatcher

AX

I4 B

us

DD

R3 D

RA

M UART

AX

I4L

ite

Bu

s

Vivado High-level Synthesis (v2015.4)

Vivado Design Suite (v2015.4)

17

Accelerator Design

f c o

LSTM Functional Logic

Input Group 0 Input Group 1

Output Group 0 Output Group 1

Cell Buffer

i

18

Experimental Results

0

5

10

15

20

25

1 thread -O3 16 thread -O3 Our Imp.

Speedup

Device Model Freq. Development Env.

CPU Xeon E5-2430 2.20 GHz gcc -O3 & OpenMP

FPGA Xilinx Virtex7-485t 150 MHz Vivado Design Suite

5.4x

20.2x

19

Experimental Results

0

5

10

15

20

25

Previous Imp. A Previous Imp. B Our Imp. A Our Imp. B

Speedup

Imp. Model Freq. Data Precision

Previous Imp. Xilinx Zynq7020 142MHz Fixed-16

Our Imp. Xilinx Virtex7-485t 150 MHz Float-32

~47%

15.5x

2x

20

Future Work

Data quantization

Low-precision fixed-point numbers

Model compression

Connection Pruning

Matrix compression (e.g. SVD)

General architecture

Support for all LSTM variants

21

Conclusions

An accelerator for LSTM-RNN

Optimizations for computation & communication

at architecture level

On-board implementation with high-performance

computation engines & data dispatcher

Outperforms CPU- & other FPGA- implementations

22

Thank You

Q & A