+ All Categories
Home > Documents > High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA...

High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA...

Date post: 15-Aug-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
35
High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani 2 , Hisham Cholakkai 2 , Yun Liang 3 , Kyle Rupnow 12 , Deming Chen 4 1 Nanyang Technological University 2 Advanced Digital Sciences Center, Illinois at Singapore 3 Peking University 4 Univ. of Illinois Urbana-Champaign
Transcript
Page 1: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

High-Level Synthesis of Multiple

Dependent CUDA Kernels for FPGA

Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3,

Kyle Rupnow12, Deming Chen4

1Nanyang Technological University

2Advanced Digital Sciences Center, Illinois at Singapore 3Peking University

4Univ. of Illinois Urbana-Champaign

Page 2: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

High-Level Synthesis

Automatic generation of hardware

from algorithm descriptions

• RTL design time high for complex

designs

Different input languages

• Extensions to C/C++

(SystemC, ImpulseC)

• Functional (Haskel), GPGPU (CUDA),

Graphical (LabView)

High-level Synthesis Tools

C/C++ SystemC CUDA …

Page 3: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

High-Level Synthesis Tools

Facilitate design space exploration

• Compiler directives or language features

• Automate (partially) selection of design parameters

Challenge – extracting parallelism

• Require restructuring or reimplementation of code in

HLS specific manner

Data-parallel input languages provide inherent

advantage

Page 4: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Parallel Computing & GPU Languages

Shift towards parallel computing & heterogeneous

CUDA programming model (NVIDIA)

• Minimal extensions to C/C++

• CUDA (GPU), MCUDA (Multi-core), FCUDA (FPGA)

CUDA advantages for HLS

• Easier analysis of

application parallelism

• Exploration of parallelism

granularity options

Page 5: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Synthesis of CUDA Kernel

FCUDA - CUDA to FPGA [SASP’09], [FCCM ‘11]

Automates design space exploration of single CUDA kernel • Match GPU performance with significantly less power

Currently supports only single kernel synthesis

FPGA

Bitfile

HLS

&

Logic

Synthesis

AutoPilot

C Code

FCUDA

Translation

Annotated

CUDA FCUDA

Annotation

CUDA

Code

Page 6: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Synthesis of Multiple CUDA Kernels

Possible to create single enclosing wrapper kernel

Single Enclosing Wrapper Kernel is not Ideal

• Must fully-buffer all sub-kernel communications on-chip

• Must use the same thread organization for sub-kernels

• Forces all sub-kernels to be CUDA device-only functions

K1

K2 K3

K4

K1

Page 7: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Objective

Map multiple communicating CUDA kernels onto

FPGA

• Allow fine-grained communication

• Enable data streaming

• Handle different thread organizations

Key Contributions to synthesize communicating CUDA

kernels to RTL

• Manual step-by-step procedure

• Identify key challenges in automation

• Case study of stereo-matching algorithm

Page 8: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Multi-Kernel Synthesis - Steps

Individual Kernel Synthesis

Communication Buffer Generation

Analytical Design Space Exploration

Implementation and Verification

Page 9: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Individual Kernel Synthesis

Kernel extraction and FCUDA flow

Initial solution of cores to be minimal in area

• Perform joint design space exploration kernels later!

Measure resource usage and latency

CUDA Application Kernel

1

FPGA Cores

FPGA Cores Core

Page 10: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Communication Buffers

Generate control flow graph (CFG) for kernels

• ASAP scheduling to determine execution critical path

Buffers between each pair of communicating

kernels

K1

K2 K3

K4

K1

K2 K3

K4

B1

B2

B3

B4

Page 11: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Communication Buffers

Size of buffers? • Full-size buffers infeasible

Data access pattern analysis • Initial buffer size = minimal data processing quanta

• Bigger sizes explored in analytical model

Growth rate of communication buffer

Include overlap data size for correctness • Boundary data for algorithms with windowed

computations

Page 12: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Buffering Schemes

Kernel A

Kernel B

Buffer

Kernel C

Buffer

A B C

Time

Kernel A

Kernel B

Buffer 0

Kernel C

Buffer 0

A B C

Time

Buffer 1

Buffer 1

A B

A

a. Single-Buffer Flow b. Dual-Buffer Flow

A

Page 13: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Analytical Design Exploration Model

Page 14: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Analytical Design Exploration

K1

K2

K3

2

8

6

nQuanta = 8 Max cores = (4,2,4)

Tim

e

Page 15: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Analytical Design Exploration

K1

K2

K3

2

8

6

nQuanta = 8 Max cores = (4,2,4)

Tim

e

K1

K2

K3

2

4

6

K2

Page 16: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Analytical Design Exploration

K1

K2

K3

2

8

6

nQuanta = 8 Max cores = (4,2,4)

Tim

e

K1

K2

K3

2

4

6

K2

K1

K2

K3

2

4

3

K2

K3

Page 17: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Analytical Design Exploration

K1

K2

K3

2

8

6

nQuanta = 8 Max cores = (4,2,4)

Tim

e

K1

K2

K3

2

4

6

K2

K1

K2

K3

2

4

3

K2

K3

Only 2 cores of

K2 is possible!

Page 18: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Analytical Design Exploration

K1

K2

K3

2

8

6

nQuanta = 16 Max cores = (8,4,8)

Tim

e

K1

K2

K3

2

4

6

K2

K1

K2

K3

2

4

3

K2

K3

Increase

nQuanta

Page 19: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Analytical Design Exploration

K1

K2

K3

2

8

6

nQuanta = 16 Max cores = (8,4,8)

Tim

e

K1

K2

K3

2

4

6

K2

K1

K2

K3

2

4

3

K2

K3

K1

K2

K3

2

2

3

K2

K3

K2

K2

Page 20: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Implementation and Verification

Core allocations from analytical model

• AutoPilot-C pragmas for suggested parallelism

Communication buffers and kernel-level

parallelism

SystemC simulation

Vivado Synthesis

Page 21: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Stereo Matching

Two spatially separated color cameras

Distance in pixels between the same object in

the images infers depth

Complex algorithms to match pixels

Page 22: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Case Study – Stereo Matcher

Census

Transform

Kernel

RGB to Lab

Conversion

Kernel

Left

GridBuilding

Kernel

Matching

Kernel

Cross

Correction

Kernel

Pre Filtering

Kernel

Median

Filtering Kernel

Census

Transform

Kernel

RGB to Lab

Conversion

Kernel

Right

GridBuilding

Kernel

Matching

Kernel

Pre Filtering

Kernel

Median

Filtering Kernel

Left Image Right Image

Left Depth Map Right Depth Map

Page 23: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Design Space for Dual-Buffer Flow

1

10

100

1000

0 0.5 1 1.5 2 2.5 3

Lo

g L

ate

ncy (

Mcycle

s)

Sum of Normalized Resource Use

6x96(DB) 12x96(DB) 18x96(DB) 12X192(DB) 18X192(DB)

12x384(DB) 18x384(DB) 24x384(DB) 36x384(DB) 48x384(DB)

72x384(DB) 96x384(DB) 144x384(DB)

Page 24: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Design Space for Dual-Buffer Flow

1

10

100

1000

0 0.5 1 1.5 2 2.5 3

Lo

g L

ate

ncy (

Mcycle

s)

Sum of Normalized Resource Use

6x96(DB) 12x96(DB) 18x96(DB) 12X192(DB) 18X192(DB)

12x384(DB) 18x384(DB) 24x384(DB) 36x384(DB) 48x384(DB)

72x384(DB) 96x384(DB) 144x384(DB)

Page 25: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Design Space for Dual-Buffer Flow

1

10

100

1000

0 0.5 1 1.5 2 2.5 3

Lo

g L

ate

ncy (

Mcycle

s)

Sum of Normalized Resource Use

6x96(DB) 12x96(DB) 18x96(DB) 12X192(DB) 18X192(DB)

12x384(DB) 18x384(DB) 24x384(DB) 36x384(DB) 48x384(DB)

72x384(DB) 96x384(DB) 144x384(DB)

Page 26: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Design Space for Dual-Buffer Flow

1

10

100

1000

0 0.5 1 1.5 2 2.5 3

Lo

g L

ate

ncy (

Mcycle

s)

Sum of Normalized Resource Use

6x96(DB) 12x96(DB) 18x96(DB) 12X192(DB) 18X192(DB)

12x384(DB) 18x384(DB) 24x384(DB) 36x384(DB) 48x384(DB)

72x384(DB) 96x384(DB) 144x384(DB)

Page 27: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Design Space for Dual-Buffer Flow

1

10

100

1000

0 0.5 1 1.5 2 2.5 3

Lo

g L

ate

ncy (

Mcycle

s)

Sum of Normalized Resource Use

6x96(DB) 12x96(DB) 18x96(DB) 12X192(DB) 18X192(DB)

12x384(DB) 18x384(DB) 24x384(DB) 36x384(DB) 48x384(DB)

72x384(DB) 96x384(DB) 144x384(DB)

Selected solution

Page 28: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Design Space for Dual and Single Buffer

Flow

1

10

100

1000

0 0.5 1 1.5 2 2.5 3

Lo

g L

ate

ncy (

MC

ycle

s)

Sum of Normalized Resource Use

SB

DB

Page 29: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Design Space for Dual and Single Buffer

Flow

1

10

100

1000

0 0.5 1 1.5 2 2.5 3

Lo

g L

ate

ncy (

MC

ycle

s)

Sum of Normalized Resource Use

SB

DB

Page 30: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Performance–Power Comparison

HLS of sequential code achieved

speedup of 6.9x over software

[FPT’11]

HLS of CUDA parallel code

achieved speedup of >50x over

sequential software **

Greater exposed parallelism

provides synthesis tool greater

opportunity for optimization

0

0.2

0.4

0.6

0.8

1

1.2

NormalizedLatency

Normalized Energy

GPU

FPGA

Page 31: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Challenges in Automation – Single Kernel

Single kernel synthesis

• Critical: Replicating the initial solution for concurrency

Multi-kernel synthesis

Page 32: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Optimize thread index computations

• Solution: Improved analytical techniques in FCUDA to

optimize index computations

Floating-point to fixed-point computations

• Solution: Automatic transformation with functional

verification of transform

Inefficient implementations of difficult operations

• Solution: Automatic instantiation of library elements

for common but challenging operations

Challenges in Automation – Single Kernel

Page 33: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Challenges in Automation – Multiple Kernel

Selection of single-core implementation • Solution: Complex value function, knowledge of resource

criticality, and iteration of entire design flow

Automatic buffer-generation and insertion • Solution: Complex memory access pattern analysis and

transformations (See upcoming FPGA 13 paper)

Performance estimation within synthesis process • Solution: Improved analytical model for loop bounds, trip

counts, resource estimates

Sub-kernel optimizations to match pipeline stage latencies • Solution: Improved ability to combine

or split pipeline stages

S1

S2

S3

S1

S21

S22

S3

Page 34: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Conclusion

Multi-kernel CUDA synthesis is important

Manual process for mapping multiple dependent CUDA kernels to FPGA

Performance parity with GPU consuming 16x less energy than GPU • Benefit of data-parallel input language for HLS

Fully automating multi-kernel synthesis is challenging

Page 35: High-Level Synthesis of Multiple Dependent CUDA …High-Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Swathi Gurumani2, Hisham Cholakkai2, Yun Liang3, Kyle Rupnow12,

Acknowledgement

A*STAR HSS Funding

Peking University

University of Illinois at Urbana-Champaign

ADSCs Lab Colleagues

• Hongbin Zheng

• Muhammad Teguh Satria


Recommended