+ All Categories
Home > Documents > DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building...

DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building...

Date post: 18-Jun-2020
Category:
Upload: others
View: 13 times
Download: 1 times
Share this document with a friend
27
DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1 , Junsong Wang 2 , Chao Zhu 2 , Yonghua Lin 2 , Jinjun Xiong 3 , Wen-mei Hwu 1 , Deming Chen 1 1 UIUC, 2 IBM Research-China, 3 IBM T. J. Watson Research Center IBM Research AI Systems Day ICCAD’18 Best Paper Award
Transcript
Page 1: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators

for FPGAs

Xiaofan Zhang1, Junsong Wang2, Chao Zhu2, Yonghua Lin2,

Jinjun Xiong3, Wen-mei Hwu1, Deming Chen1

1UIUC, 2IBM Research-China, 3IBM T. J. Watson Research Center

IBM Research AI Systems Day

ICCAD’18 Best Paper Award

Page 2: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

1. Background

2. Motivations

3. Automation Flow

4. Accelerator Architecture

5. Design Space Exploration

6. Experimental Results

7. Conclusions

2

Outline:

Page 3: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

Background

3

Deploying deep learning workloads in the cloud

Major requirements:

• Throughput performance

• Tail latency

• Power efficiency

Recommendations Auto-gen Sport HighlightsDNN Design

Page 4: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

Background

4

Major requirements:

• Real-time ability

• Energy efficiency design

• Area constraint

Deploying deep learning workloads at the edge

Page 5: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

1. Background

2. Motivations

3. Automation Flow

4. Accelerator Architecture

5. Design Space Exploration

6. Experimental Results

7. Conclusions

5

Outline:

Page 6: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

Motivation

6

FPGAs deliver improved latency & energy efficiency (vs. CPUs, GPUs)

But..

An end-to-end automated tool for mapping DNN to FPGAs

Try FPGAs for both cloud- and edge-computing

FPGA have limited computation & memory resources (DSPs, BRAMs)

Large design & test efforts (RTL programming, HW verification… )

Challenges in resource allocation for unbalanced DNN layers

We need

Page 7: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

1. Background

2. Motivations

3. Automation Flow

4. Accelerator Architecture

5. Design Space Exploration

6. Experimental Results

7. Conclusions

7

Outline:

Page 8: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

Automation Design Flow

➢ Low latency

➢ High throughput

➢ Efficient use of FPGA on-chip memory

➢ Auto on-chip resource allocation

8

3-step-solution as Design, Generation, & Execution

To bridge the gap between fast DNN construction in software and slow hardware implementation

Page 9: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

1. Background

2. Motivations

3. Automation Flow

4. Accelerator Architecture

5. Design Space Exploration

6. Experimental Results

7. Conclusions

9

Outline:

Page 10: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

Architecture

10

Overview of the proposed accelerator design

➢ A fine-grained layer-based pipeline structure

1) Higher throughput2) Better support of streaming inputs3) Higher efficiency with dedicated design for each DNN layer

➢ A column-based cache scheme

1) Lower latency, lower on-chip MEM demands 2) Support HD input3) Real-time capability

Page 11: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

Architecture

Proposed design General design

Higher throughput vs. recurrent structure

Lower latency vs. conventional pipeline structure

Reduce 7.7x latency for running YOLO

11

A fine-grained layer-based pipelined architecture

Page 12: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

Architecture

➢ 2-dim parallelism

KPF - kernel parallel factor

CPF - channel parallel factor

➢ Arbitrary quantization

DW - bit-width for feature map

WW - bit-width for weight/bias

External memory (DRAM)

On-chip memory (BRAM)

Computation resources

12

Pipelined stages instantiated on FPGA

Page 13: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

Architecture

13

RTL IPs for different DNN layers

• Adjustable parallel factor = CPF x KPF (more/less DSP utilization)

• On-chip buffers for sufficient data supply

Page 14: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

Architecture

➢ Save on-chip memory

➢ Adjust data reuse factor

For example:

Kernel size = 3

Stride = 1

4 slices cache on-chip instead of keeping the whole feature maps

1 2

14

A column-based cache scheme

Page 15: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

Architecture

Column-based cache schemeReduce 43x BRAM usage

for running YOLO

BRAM usage reduction for keep feature maps

320x ~ 7x

43x on average

15

Page 16: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

1. Background

2. Motivations

3. Automation Flow

4. Accelerator Architecture

5. Design Space Exploration

6. Experimental Results

7. Conclusions

16

Outline:

Page 17: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

Design Space Exploration

Conv1 Conv2

Conv3

Conv4

Conv5

Comp. boundMem. bound

Step1: Computation allocation

Total capability: 100 GOPSTotal BW 10 GB/S;

FC

Conv1 -> 15 GOPS

Conv2 -> 15 GOPS

Conv3 -> 21 GOPS

Conv4 -> 8 GOPS

Conv5 -> 5 GOPS

FC layer maximum

usage 5GOPS

17

An automatic resource allocator

CTC:

Page 18: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

To meet the BW constraint

Total capability 100 GOPS

Conv3

Total BW 10 GB/S

Comp. bound

CTC increase

Required mem. BW drop

Mem. bound

Col. 1

Col. 2 Cache one more Col.

18

Design Space ExplorationStep2: memory bandwidth adjustmentAn automatic resource allocator

CTC:

Page 19: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

1. Background

2. Motivations

3. Automation Flow

4. Accelerator Architecture

5. Design Space Exploration

6. Experimental Results

7. Conclusions

19

Outline:

Page 20: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

Experimental Results

20

Case study: real-time pedestrian/cyclist/car detection

Yolo9000 with HD input (1280x384, 20FPS) is mapped to Xilinx Zynq 706 FPGA @ 200MHz

Page 21: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

Experimental Results

21

Case study: real-time pedestrian/cyclist/car detection

Page 22: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

Experimental Results

*f.-t. in Design represents the accuracy results are collected after retraining and fine-tuning

22

Accuracy results after 16-bit & 8-bit quantization

Page 23: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

Experimental ResultsComparison:Embedded FPGAs for edge-devices

Zynq XC7Z045• LUT: 218,600• FF: 437,200• BRAM: 545• DSP: 900

KU115• LUT: 663,360• FF: 1,326,720• BRAM: 2160• DSP: 5520

Comparison:High-performance FPGAs for cloud computing

Peaking at 524 GOPS

Peaking at 4022 GOPS

23

Page 24: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

Experimental Results

24

Comparison:AlexNet inference performance GPU vs FPGA

Page 25: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

1. Background

2. Motivations

3. Automation Flow

4. Accelerator Architecture

5. Design Space Exploration

6. Experimental Results

7. Conclusions

25

Outline:

Page 26: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

Conclusions

26

➢ We presented DNNBuilder for building DNN accelerator on FPGAs

1) an automation tool (Design, Generation, and Execution)

2) a fine-grained layer-based pipeline architecture

3) a column-based cache scheme

4) an automatic resource allocation algorithm

➢ We delivered the state-of-the-art performance and power efficiency

1) the best throughput: 4022 (KU115) and 524 GOPS (ZC706)

2) the best efficiency: 180.4 (KU115) and 72.8 GOPS/W (ZC706)

Page 27: DNNBuilder: an Automated Tool for Building High ......DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang1, Junsong Wang 2,

Q & A

#AI Research Weekhosted by MIT-IBM Watson AI Lab


Recommended