FPGA-based Accelerator for Long Short-Term Memory ... · 1 FPGA-based Accelerator for Long...

transcript

FPGA-based Accelerator for Long Short-Term

Memory Recurrent Neural Networks

Yijin Guan1, Zhihang Yuan1, Guangyu Sun1,3, Jason Cong2,3,1

1Center for Energy-Efficient Computing and Applications, Peking University, China2Computer Science Department, University of California, Los Angeles, USA

3PKU/UCLA Joint Research Institute in Science and Engineering

Deep Learning

Scenarios

Applications

Recurrent Neural Network

Input Layer

Hidden Layer

Output Layer

Feed-forward NN RNN

Input Layer (t)

Hidden Layer (t)

Output Layer (t)

Hidden Layer (t-1)

Recurrent

Connection

Recurrent Neural Network

Input Layer (t-1)

Hidden Layer (t-1)

Output Layer (t-1)

Input Layer (t)

Hidden Layer (t)

Output Layer (t)

Input Layer (t+1)

Hidden Layer (t+1)

Output Layer (t+1)

RNN unfolds into a DNN over time

Long Short-Term Memory

Input Gate

Wxi & Whi

xt ht-1

𝒊𝒕 = σ (𝑾𝒙𝒊𝒙𝒕 + 𝐖𝒉𝒊𝒉𝒕−𝟏 + 𝒃𝒊)

Input Gate

Wxi & Whi Wxf & Whf

Forget Gate

xt ht-1

𝒇𝒕 = σ (𝑾𝒙𝒇𝒙𝒕 + 𝐖𝒉𝒇𝒉𝒕−𝟏 + 𝒃𝒇)

Input Gate

Wxi & Whi

Cell Gate

Wxc & Whc

Wxf & Whf

Forget Gate

xt ht-1

𝒄𝒕 = tanh (𝑾𝒙𝒄𝒙𝒕 + 𝐖𝒉𝒄𝒉𝒕−𝟏 + 𝒃𝒄)

Input Gate

Wxi & Whi

Cell Gate

Wxc & Whc

Wxf & Whf

Forget Gate

xt ht-1

𝒄𝒕 = 𝒇𝒕⊙𝒄𝒕−𝟏 + 𝒊𝒕⊙𝒄𝒕

Input Gate

Wxi & Whi

Cell Gate

Wxc & Whc

Wxf & Whf

Forget Gate

Wxo & Who

Output Gate

xt ht-1

𝒄t ot

𝒐𝒕 = σ (𝑾𝒙𝒐𝒙𝒕 + 𝐖𝒉𝒐𝒉𝒕−𝟏 + 𝒃𝒐)

Input Gate

Wxi & Whi

Cell Gate

Wxc & Whc

Wxf & Whf

Forget Gate

Wxo & Who

Output Gate

xt ht-1

𝒄t ot

𝒉𝒕 = 𝒐𝒕⊙ tanh(𝒄𝒕)

Why FPGA

Adaptability

Performanc

Energy EfficiencyProgrammability

Scalability

Adaptability

Performance

Scalability

Adaptability

Performance

Scalability

Adaptability

Performance

Scalability

Design Challenges and Optimizations

FPGA Chip

Computation

Engine

Buffers

Off-chip

Memory

FPGA Chip

Computation

Engine

Buffers

Off-chip

Memory

Computation Resources & Performance

Loop Unroll

Deep Pipeline

FPGA Chip

Computation

Engine

Buffers

Off-chip

Memory

On-chip Memory Resources

Loop Tiling

Eclectic Data Partition

FPGA Chip

Computation

Engine

Buffers

Off-chip

Memory

Bandwidth

Ping-pong Buffers

Reshaping Data Layout

FPGA Chip

System Design

MicroBlaze

Accelerator

Dispatcher

M UART

Vivado High-level Synthesis (v2015.4)

Vivado Design Suite (v2015.4)

Accelerator Design

LSTM Functional Logic

Input Group 0 Input Group 1

Output Group 0 Output Group 1

Cell Buffer

Experimental Results

1 thread -O3 16 thread -O3 Our Imp.

Speedup

Device Model Freq. Development Env.

CPU Xeon E5-2430 2.20 GHz gcc -O3 & OpenMP

FPGA Xilinx Virtex7-485t 150 MHz Vivado Design Suite

Experimental Results

Previous Imp. A Previous Imp. B Our Imp. A Our Imp. B

Speedup

Imp. Model Freq. Data Precision

Previous Imp. Xilinx Zynq7020 142MHz Fixed-16

Our Imp. Xilinx Virtex7-485t 150 MHz Float-32

Future Work

Data quantization

Low-precision fixed-point numbers

Model compression

Connection Pruning

Matrix compression (e.g. SVD)

General architecture

Support for all LSTM variants

Conclusions

An accelerator for LSTM-RNN

Optimizations for computation & communication

at architecture level

On-board implementation with high-performance

computation engines & data dispatcher

Outperforms CPU- & other FPGA- implementations

Thank You

FPGA-based Accelerator for Long Short-Term Memory ... · 1 FPGA-based Accelerator for Long...

Documents