+ All Categories
Home > Documents > ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo...

ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo...

Date post: 30-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
26
ESP4ML Platform-Based Design of System-on-Chip for Embedded Machine Learning Davide Giri Kuan-Lin Chiu Giuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020
Transcript
Page 1: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

ESP4MLPlatform-Based Design of System-on-Chip for

Embedded Machine Learning

Davide GiriKuan-Lin ChiuGiuseppe di GuglielmoPaolo MantovaniLuca P. Carloni DATE 2020

Page 2: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

Combines and

• ESP is a platform for heterogeneous SoC design

• hls4ml automatically generates accelerators from ML models

Main contributions to ESP:• Automated integration of hls4ml

accelerators

• Accelerator-accelerator communication

• Accelerator invocation API

Open-source design flow to build and program SoCs for ML applications.

ESP4ML

2

Page 3: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

• Open-source tool developed by Fast ML Lab

• Translates ML algorithms into HLS-able accelerator specifications

o Targets Xilinx Vivado HLS (i.e. FPGA only)

o ASIC support is in the works

• Born for high-energy physics (small and ultra-low latency networks)

o Now has broad applicability

hls4ml

3

Image from https://fastmachinelearning.org/hls4ml/

Page 4: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

ESP motivation

Heterogeneous systems are pervasive

Integrating accelerators into a SoC is hard

Doing so in a scalable way is very hard

Keeping the system simple to program while doing so is even harder

ESP makes it easy

ESP combines a scalable architecture with a flexible methodology

ESP enables several accelerator design flowsand takes care of the hardware and software integration

4

BLADE

CENTER

DATA

CPU GPU

$

Accelerators

I/O

DD

REmbedded SoC

Page 5: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

RapidPrototyping

SoC Integration

Ap

plic

atio

n D

evel

op

ers

Har

dw

are

Des

ign

ers

ESP overview

5

**

By

lew

ing

@is

c.ta

mu

.ed

uL

arry

Ew

ing

an

d T

he

GIM

P

**

accelerator

accelerator

HLSDesignFlows

RTLDesignFlows

* B

y N

vid

ia C

orp

ora

tio

n

…accelerator*

Processor

new design flows

Page 6: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

ESP architecture

• Multi-Processors

• Many-Accelerator

• Distributed Memory

• Multi-Plane NoC

4

The ESP architecture implements a distributed system, which is scalable,

modular and heterogeneous,giving processors and accelerators

similar weight in the SoC

Page 7: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

ESP architecture: the tiles

7

Page 8: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

ESP methodology in practice

8

interactiveautomated

manualmanual (opt.)

Generate accelerator

Test behavior

Generate RTL

Test RTL

Optimize accelerator

Specialize accelerator(not required by hls4ml flow)

Generate sockets

Configure SoC

SoC Flow

Compile bare-metal

Simulate system

Implement for FGPA

Compile Linux

Deploy prototype

Design runtime apps

Accelerator Flow

Ap

plic

atio

n D

eve

lop

ers

Har

dw

are

Des

ign

ers

HLSDesignFlows

RTLDesignFlows

…accelerator

accelerator

accelerator

…accelerator

accelerator

accelerator

**

Page 9: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

ESP accelerator flow

Developers focus on the high-level specification, decoupled from

memory access, system communication, hardware/software interface

Ap

plic

atio

n D

evel

op

ers

Har

dw

are

Des

ign

ers

HLSDesignFlows

RTLDesignFlows

Performance

Are

a /

Po

we

r

3

2

1 High-Level Synthesis

Code Transformation

Ver. 1

Ver. 2

Ver. 3

RTLDesign Space

Programmer ViewDesign Space

…accelerator

accelerator

accelerator

9

Page 10: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

10

ESP Interactive SoC Flow

SoC Integration

…accelerator

accelerator

accelerator

Page 11: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

11

New ESP features

• New accelerator design flows (C/C++, Keras/Pytorch/ONNX)• Accelerator-to-accelerator communication• Accelerator invocation API

Page 12: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

New accelerator design flows

C/C++ accelerators with Vivado HLS

• Generate the accelerator skeleton with ESPo Takes care of communication with the ESP tile socket

• Implement the computation part of the accelerator

12

Example of top level function of ESP accelerator for Vivado HLS

void top(dma_t *out, dma_t *in1, unsigned cfg_size,

dma_info_t *load_ctrl, dma_info_t *store_ctrl)

{

for (unsigned i = 0; i < cfg_size; i++) {

word_t _inbuff[IN_BUF_SIZE];

word_t _outbuff[OUT_BUF_SIZE];

load(_inbuff, in1, i, load_ctrl, 0);

compute(_inbuff, _outbuff);

store(_outbuff, out, i, store_ctrl, cfg_size);

}

}

Page 13: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

New accelerator design flows

Keras/Pytorch/ONNX accelerators with hls4ml

Completely automated integration in ESP:

• Generate an accelerator with hls4ml

• Generate the accelerator wrapper with ESP

13

Page 14: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

Accelerator-to-accelerator communication

Accelerators can exchange data with:

• Shared memory

• Other accelerators (new!)

Benefits

• Avoid roundtrips to shared memory

• Fine-grained accelerators synchronization

o Higher throughput

o Lower invocation and data pre- or post-processing overheads

14

Page 15: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

Accelerator-to-accelerator communication

• No need for additional queues or NoCchannels

• Communication configured at invocation time

• Accelerators can pull data from other accelerators, not push

15

Page 16: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

API for the invocation of accelerators from a user application

• Exposes only 3 functions to the programmer

• Invokes accelerators through Linux device driverso ESP automatically generates the device

drivers

• Enables shared memory between processors and acceleratorso No data copies

• Can be targeted by existing applications with minimal modifications

• Can be targeted to automatically map tasks to accelerators

16

Accelerator invocation APIke

rne

l m

od

e

Linux

ESP core

ESP accelerator driver

use

r m

od

e

ESP alloc

ESP Library

Application

Page 17: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

Accelerator invocation API

17

kern

el

mo

de

Linux

ESP core

ESP accelerator driver

use

r m

od

e

ESP alloc

ESP Library

Application

/*

* Example of existing C application

* with ESP accelerators that replace

* software kernels 2, 3 and 5

*/

{

int *buffer = esp_alloc(size);

for (...) {

kernel_1(buffer,...); // existing software

esp_run(cfg_k2); // run accelerator(s)

esp_run(cfg_k3);

kernel_4(buffer,...); // existing software

esp_run(cfg_k5);

}

validate(buffer); // existing checks

esp_cleanup(); // memory free

}

API for the invocation of accelerators from a user application

• Exposes only 3 functions to the programmer

Page 18: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

Accelerator API

18

/* Example of double-accelerator config */

esp_thread_info_t cfg_k12[] =

{

{

.devname = “k1.0",

.type = k1,

/* accelerator configuration */

.desc.k1_desc.nbursts = 8,

/* p2p configuration */

.desc.k1_desc.esp.p2p_store = true,

.desc.k1_desc.esp.p2p_nsrcs = 0,

.desc.k1_desc.esp.p2p_srcs = {"","","",""},

},

{

.devname = “k2.0",

.type = k2,

/* accelerator configuration */

.desc.k2_desc.nbursts = 8,

/* p2p configuration */

.desc.k2_desc.esp.p2p_store = false,

.desc.k2_desc.esp.p2p_nsrcs = 1,

.desc.k2_desc.esp.p2p_srcs = {“k1.0","","",""},

},

};

Configuration example:

• Invoke accelerators k1 and k2

• Enable point-to-point

communication between them

Page 19: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

19

Evaluation

Page 20: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

• We deploy two multi-accelerator SoCs on FPGA (Xilinx VCU118)

• We execute applications with accelerator chaining and parallelism opportunities

• We compare the our SoCs against:

o Intel i7 8700K processor

o NVIDIA Jetson TX1

▪ 256-core NVIDIA Maxwell GPU

▪ Quad-core ARM Cortex A57

Featured accelerators:

• Image classifier (hls4ml)

o Street View House Numbers (SVHN) dataset from Google

• Denoiser (hls4ml)

o Implemented as an autoencoder

• Night-vision (Stratus HLS)

o Noise filtering, histogram, histogram equalization

20

Experimental setup

Page 21: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

21

Case studies

Page 22: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

Chaining accelerators brings energy savings.

Our SoCs achieve better energy efficiency than Jetson and i7.

22

Efficiency

0.1

1

10

100

1NV+1Cl 4NV+1Cl 4NV+4Cl

Fra

mes

/ Jo

ule

(norm

aliz

ed) Night-Vision and

Classifier

memory p2p

i7 8700k

Jetson TX1

0.1

1

10

100

1De + 1Cl

Denoiser andClassifier

0.1

1

10

100

1Cl split

Multi-tileClassifier

Page 23: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

Performance increases to up to 4.5 times thanks to:

- Parallelization

- Chaining (p2p)

23

Performance

0

1

2

3

4

5

Cl split in5

1NV+1Cl 2NV+1Cl 4NV+1Cl 2NV+2Cl 4NV+4Cl

Fra

mes

/ se

c (n

orm

aliz

ed)

memory p2p

Page 24: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

Accelerator chaining (p2p) reduces the memory accesses by 2-3 times

24

Memory accesses

0%

20%

40%

60%

80%

100%

Multi-tileclassifier

Nightvision+ classifier

Denoiser +classifier

DRAM

acc

ess

es

(norm

aliz

ed)

memory p2p

Page 25: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

Conclusions

ESP4ML is a complete system-level design flow to implement many-accelerator SoCs and to deploy embedded applications on them.

We enhanced ESP with the following features:• Fully automatic integration in ESP of accelerators specified in C/C++ (Vivado

HLS) and Keras/Pytorch/ONNX (hls4ml)

• Minimal API to invoke accelerator for ESP

• Reconfigurable activation of accelerators pipelines through efficient point-to-point communication mechanisms

25

Page 26: ESP4ML - Columbia Universitydavide_giri/pdf/giri_date20_slides.pdfGiuseppe di Guglielmo Paolo Mantovani Luca P. Carloni DATE 2020 Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu

ESP4MLPlatform-Based Design of System-on-Chip

for Embedded Machine Learning

Davide Giri (www.cs.columbia.edu/~davide_giri)Kuan-Lin ChiuGiuseppe di GuglielmoPaolo MantovaniLuca P. Carloni DATE 2020

Thank you from the ESP team!

sld.cs.columbia.edu esp.cs.columbia.edu sld-columbia/esp


Recommended