Modelling and Mapping Framework for Coarse-Grained...

Modelling and Mapping Framework for Coarse-GrainedProgrammable Architectures

E. Barbudo, E. Dokladalova and Th. Grandpierre

Laboratoire d’informatique de l’Institut Gaspard MongeUniversite Gustave Eiffel

ESIEE Paris

14th Summer School on Modelling and Verification of ParallelProcesses

June 26, 2020

1 / 27

Context

Machine Vision for Future Construction Sites (DiXite project [1],I-SITE impulsion project).

(a) Autonomous construction machinery [2]. (b) Autonomous unmanned aerial vehicles [3].

Figure 1: Examples of a time-critical application.

Embedded vision systemsTime-critical execution → drastic latency constraints.Multi-processing capabilities → adaptable or programmable computingsupport.

Existing solutions: GPPs, GPUs, FPGAs and CGRAs.

Target hardware family in our context: Coarse-grained ProgrammableArchitectures (CGPA).

GPP= General Purpose Processor, GPU= Graphics Processing Unit, FPGA= Field Programmable Gate Array, CGRA=Coarse-Grained Reconfigurable Architecture[1] https://dixite.future-isite.fr/ [2] https://www.builtrobotics.com/[3] https://www.geospatialworld.net/blogs/drones-to-propel-new-technological-innovations-in-the-construction-industry/

2 / 27

Contents

1 IntroductionCoarse-Grained Reconfigurable ArchitecturesCoarse-Grained Programmable Architectures

2 MethodologyModelsMapping algorithmsPerformance analysisExperimental studyPreliminary results

3 Conclusions

3 / 27

Coarse-Grained Reconfigurable Architectures

Programmable configuration-driven computing fabric with fixedpost-fabrication flexibility.

There are many variations of CGRAs, depending uponinterconnection, type of processing elements or method ofreconfiguration.

The main application of a CGRA is to perform the inner loops of anapplication.

PE

Dat

a m

emor

y

PE

Instruction memory

PE PE

PE PE PE PE

PE PE PE PE

PE PE PE PE

(a) Generic structure of a CGRA.

ALURegister

�le

Register

(b) Generic processing element of a CGRA.

Figure 2: Example of a generic CGRA [1].

[1] M. Hamzeh et al., “EPIMap: Using Epimorphism to map applications on CGRAs”. Design Automation Conference 2012,pages pp. 1280–1287.

Introduction Coarse-Grained Reconfigurable Architectures 4 / 27

Coarse-Grained Programmable Architectures

Heterogeneous linear array of hardware resources, with mixedgranularity.

Multiple configurations, multiple data (MCMD).

Used as a coprocessor with loose coupling and no shared resources.

CGRA with fixed hardware resources and a set of parameters toprogram.

ParametersParameters

ParametersParameters

ParameterscontextParameters

context

INTERCONNECTIONS

CONFIGURATION CONTROL

Memory1

Actuators

PARAMETERS

SensorsHW

Accel2GPP1

HW Accel4

HW Accel5

HW Accel6

MemoryN

PR1 PR3

PARAMETERS PARAMETERS PARAMETERS

HWAccel3

HWAccel1

PR2

Every hardwareresourceis �xed

The parametersneed to be con�gured

Figure 3: Global architecture of a generic CGPA.

Introduction Coarse-Grained Programmable Architectures 5 / 27

Real life CGPA - motivation hardware

The morphological co-processing unit (MCPU) is dedicated to imageprocessing.

+ High performance+ Flexibility− Manual mapping

MPMC

InputFIFO bu�er

VFBC read control

VFBC read control

InputFIFO bu�er

vfbc_rd

PLB

VFBC1

VFBC3vfbc_rd

vfbc_cmd

vfbc_cmd

PLB interface

Large SEpipeline

Geodesicpipeline

Con�

gura

tion

regi

ster

s

Image output

Image input

Geodesic Bank 1- m Control

Mux_in_1

Mux_in_2

Mux_geo_1

Mux_out_1

Mux_out_2Mux_geo_2

Large SE Bank 1-n

REG 1REG 2

REG n

REG 1REG 2

REG n

REG 1REG 2

REG n

REG 1REG 2

REG n

StartResetReadyState

......

OutputFIFO bu�er

VFBC write control

VFBC write control

OutputFIFO bu�er

vfbc_wrVFBC2

VFBC4vfbc_wr

vfbc_cmd

vfbc_cmd

Bart_proc peripheral (pcore)

(a) Architecture of the Morphological co-processing unit.

Large SEerosiondilation


ALU

Operator, SE ALU Mux_{1,2}

1

INPUT

2

1

2

Mux_1

Mux_2

Stage 1



ALU


OUTPUT

Mux_1

Mux_2

Stage n8-bit Data1-bit Acknowledge1-bit Fifo full �ag

(b) Large SE pipeline basic stage of the MCPU.

Figure 4: Morphological co-processing unit [2].

[2] J. Bartovsky et al., “Morphological co-processing unit for embedded devices”. JRTIP, pages pp. 1–12, Jul 2015.

Introduction Coarse-Grained Programmable Architectures 6 / 27

Modelling and mapping framework for CGPA

Agnostic, to be used in most of the CGPAs on the market.

Easy reuse of CGPAs.

CONFIGURATION MODELLATENCY MODEL

FEATURESCGPA

MODEL

HARDWARE APPLICATIONS

TASKS AND DATA DEPENDENCY MODEL

MAPPING AND SCHEDULING PERFORMANCE ANALYSIS

IMPLEMENTATION MODELGENERATION OF OVERALLCONFIGURATION CONTEXT

Figure 5: Overall scheme of our modelling and mapping framework forCGPA.

Methodology 7 / 27

Application model

GAPP(T ,D) is a directed hypergraph, with ti ∈ T further describedas (typei , pi ).

typei represents the transformation of the data and pi a vector ofparameters.

ErosionSquare

3

DilationSquare

3Sobel

t0 t1 t2 t3 t4

Figure 6: Generic example of an application graph.

Methodology Models 8 / 27

Hardware model

GHW (S ,K ) is a directed hypergraph, where S represents the hardwareresources and K the set of oriented hyperedges.

S scfgRCR RP

RM RWRRRD RMUX

RINTERFACERC

Figure 7: General hierarchy of the set S .

Each hardware resource is described with its programmableparameters, configuration cost function and latency function.


Hardware model

We propose to differentiate the input latency and computinglatency of each hardware resource.

The input latency is the number of clock cycles needed to read allthe samples required to start to compute the first result.

The computing latency is the number of clock cycles necessary toproduce the result once all input samples are available.


Implementation model

GMAP is obtained by graph transformation between GAPP and GHW .

GMAP is a weighted directed hypergraph.

Similar structure as GHW with fixed parameters.

Composed by time slots, which are a subset of resources configured toperform a subset of tasks. Hence, adding a time slot allows re-usingthe hardware which can be reconfigured between time slots whenresources are missing.

t0r1 r2rSNSR0 rACTR3t1 t2 t3

scfg1

Time

Con�gurationcost

Overall input latency+

execution duration

Time slot 1

Figure 8: Generic example of a implementation graph.


Mapping algorithms

Mapping an application onto hardware leads to a very large numberof possible implementations.

The mapping process needs to be automatized.

There are many mapping algorithms, we propose to use the followingtwo:

An exhaustive algorithm, that provides the best result and it can beused as a ground reference.An list-scheduling heuristic based on look-ahead techniques to reducethe exploration time.

Methodology Mapping algorithms 12 / 27

Exhaustive algorithm

The principle is to obtain all the possible mappings.We consider the possibility to map one task in one time slot.The basic process is:

We start with the topological sorting of the application graph.We test the possibility to map a task to any available hardwareresource that complies with the requirements.

+ Optimal mapping.− Very long exploration time (From hours to days).

t0 t1 t2 t3

scfg

rSNSR0 rP1 rACTR4rP2 rP3

Application graph

Hardware graph

t0 t1 t2 t3

scfg


scfg


scfg


t0 t1 t2 t3

scfg

rSNSR0 rP1 rACTR4rP2 rP3scfg


scfg


t1

t1

t0 t1 t2 t3copy

Time slot 1

INPUTS

EXHAUSTIVE ALGORITHMSTAGE 1 STAGE 2

scfg

rSNSR0 rP1 rACTR4rP2 rP3t1

OUTPUTS

scfg

rSNSR0 rP1 rACTR4rP2 rP3t0 t1 t2 t3copy

Time slot 1

Implementation graphs

Figure 9: Exhaustive algorithm principle.Methodology Mapping algorithms 13 / 27

List-scheduling heuristic

Based on look-ahead techniques.

We evaluate the mapping of the successors of the task onto thedescendants of the resource.

The basic process is:

We start with a topological sorting of the application graph and a listwith the source nodes of the hardware graph, which are the candidates.We select the first task ti ∈ GAPP and try to map it to any source nodeof the hardware graph.We get the successors of ti and the descendants of the source noderj ∈ GHW . We compute the chance of allocating successfully thesuccessors of task ti onto the descendants, taking into account thetopological distance.


Probability of mapping success

MSFj=

x∑b=1

d | Qb |

Cn∑

k=1

| Fk |(1)

where:

x is the number of successors of ti .

n is the number of possible resource candidates.

d is the shortest distance to a resource node that complies with thetask’s successor of interest.

C is the critical path of the subgraph made by the descendants of allthe possible resource candidates.

Qb is the set of descendants of rj that complies with the task’ssuccessor of interest.

Fk is the set of the descendants of rk .


Performance analysis

Computing cost (CC) as a metric.Computes the input and computing latency of all the elements of thecritical path.

CC =N∑i=1

(Tci +

Texi︷︸︸︷(CLi )(TS) +

Tini︷︸︸︷|CPi |−1∑j=1

((LINj − 1)(αj) + LCLj + 1))

(2)

whereN is the number of time slots.Tci configuration cost of time slot i .T ini refers to the overall input latency of the resources of time slot i .Texi is the execution duration of time slot i .CLi is the worst computing latency of the critical path of time slot i .TS is the total of input samples.

Methodology Performance analysis 16 / 27

Performance analysis

T ini =

|CPi |−1∑j=1

((LINj − 1)(αj) + LCLj + 1) (3)

where

CPi is a set of resources that belong to the critical path of time slot i .

LINj is the input latency of the resource.

LCLj is the computing latency of the resource.

αj is an expression of the propagation of computing latency. Letαj = max(αj−1,LCLj−1), where αj−1 is the α of the predecessor and

LCLj−1 is the computing latency of the predecessor.

Methodology Performance analysis 17 / 27

Evaluation of the methodology

We consider the morphological co-processing unit (MCPU) [2] as acandidate for the use of our framework and two applications.



ALU


1

INPUT

2

1

2

Mux_1

Mux_2

Stage 1



ALU


OUTPUT

Mux_1

Mux_2

Stage n8-bit Data1-bit Acknowledge1-bit Fifo full �ag

Figure 10: Large SE pipeline basic stage of the MCPU.

[2] J. Bartovsky et al., “Morphological co-processing unit for embedded devices”. JRTIP, pages pp. 1–12, Jul 2015.

Methodology Experimental study 18 / 27

Hardware graph

For the purpose of the evaluation, we only use the Large SE pipelinedouble stage (Fig. 10).

rRD3 rP5

scfg

rSNSR0 rACTR21

rM2 rM2

rP6 rWR19

rWR18

rSNSR1 rRD4

rP7

rMUX8

rMUX9rACTR22

rP10

rP11

rP12

rP13

rP14

rMUX15

rMUX16

rP17

rP20

Figure 11: GHW of the morphological co-processor unit.


Application graph

The first application example is a long linear pipeline of tasks. Thisapplication exceeds the available resources (Fig. 12a).

The second application example represents a highly parallel taskorganization (Fig. 12b).

t0 t1 t2 t3 t4 t5

t6 t7 t8

(a) Application example 1.

t0 t1 t2 t3

t4 t5 t6

t7 t8 t9

(b) Application example 2.

Figure 12: Application examples.


Evaluation

We evaluate our algorithms on two metrics computing cost andexploration time. Table 1 summarize the results. The values ofcomputing cost are in clock cycles.

Table 1: Algorithms evaluation

Numberof nodes

AlgorithmComputing

costExploration

timeImprovement

percent

Application example 1(Linear set of tasks)

9Exhaustive 62470 70 minutes N/AHeuristic 62470 0.53 seconds 99 %

Application example 2(Parallel set of tasks)

10Exhaustive 61628 19 minutes N/AHeuristic 62628 0.9 seconds 99 %

Methodology Preliminary results 21 / 27

Resulting mappings

copy r5_0 r7_2

copy r8_3

r3_14copy r9_4

r10_5

t0 r11_6

r2_73

t1 r6_1

t2 r12_7

r13_8

config_70copy r15_10

r14_9

r4_15r2_71

copy r16_11

r2_74

r17_12

r18_16r19_17

r20_13

sp_49

sp_50r2_75

r2_77

r0_18

config_69

r1_19

t3 r5_27

r7_29

copy r8_30

r3_41

copy r9_31

r10_32

t4 r11_33

sp_51t5

r6_28

t6 r12_34

r13_35

copy r15_37

r14_36

r4_42r2_72

copy r16_38

r17_39

r18_43

r19_44

r20_40

sp_52

r2_78

r2_76

r22_48

r21_47

Figure 13: Exhaustive final mapping for application example 1.

t0 r5_0 r7_2

copy r8_3

r3_14

copy r9_4

r10_5

t1 r11_6

r2_73t2

r6_1

t3 r12_7

r13_8

config_70copy r15_10 r14_9

r4_15r2_71

copy r16_11

r2_74r17_12

r18_16r19_17

r20_13

sp_49sp_50

r2_77

r2_75

r0_18

config_69

r1_19

t4 r5_27

r7_29

copy r8_30

r3_41

copy r9_31

r10_32

t5 r11_33

sp_51t6

r6_28

copy r12_34

r13_35

copy r15_37

r14_36

r4_42r2_72

copy r16_38

r17_39

r18_43

r19_44

r20_40

sp_52

r2_78

r2_76

r22_48

r21_47

Figure 14: Heuristic final mapping for application example 1.


Resulting mappings

t2 r5_0

r7_2

copy r8_3r3_14

copy r9_4

r10_5

t3 r11_6r2_69

t4 r6_1

t5 r12_7

r13_8

config_66

copy r15_10

r14_9

r4_15r2_67

copy r16_11

r2_70

r2_68

r17_12

r18_16

r19_17

r20_13

r2_73

r2_71

r0_18

config_65

r1_19

r5_27

r7_29

r8_30r3_41

copy r9_31

r10_32

r11_33

t0 r6_28

t1 r12_34

r13_35

r15_37

r14_36

r4_42 copy r16_38

r17_39

r18_43

r19_44

r20_40

r2_74

r2_72

r22_48

r21_47

Figure 15: Exhaustive final mapping for application example 2.

t4 r5_0

r7_2

copy r8_3r3_14

copy r9_4

r10_5

t5 r11_6

r2_67

t2 r6_1

t3 r12_7

r13_8

config_64

copy r15_10

r14_9

r4_15r2_65copy

r16_11r2_68

r2_66

r17_12

r18_16

r19_17

r20_13

r2_71

r2_69sensor0_a3

r0_18

config_63

sensor0_a1r1_19

t0 r5_27

r7_29

copy r8_30

r3_41

r9_31

r10_32

t1 r11_33

r6_28 r12_34

r13_35

copy r15_37

r14_36

r4_42 r16_38

r17_39

r18_43

r19_44

r20_40

r2_72

r2_70

r22_48

r21_47

Figure 16: Heuristic final mapping for application example 2.


Conclusions

We have introduced our proposed modelling and mapping framework.

Our models provide the means to abstract the configuration cost andthe heterogeneous latency of a CGPA.

We presented two mapping approaches. The exhaustive approach isable to produce an optimal mapping in terms of latency but at thecost of long exploration time. The look-ahead-based heuristicapproach provides a mapping in less exploration time.

Additionally, we define a method for performance analysis based onthe computing cost over the critical path.

Conclusions 24 / 27

Future work

Analyze the complexity of the mapping algorithms.

Inclusion of the notion of latency in the heuristic equation.

Evaluate extensively the framework with real hardware examples.

Conclusions 25 / 27

Thank you for your attention

Conclusions 26 / 27

Questions

[email protected]

Conclusions 27 / 27

Date post:	12-Jul-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Modelling and Mapping Framework for Coarse-Grained...

Documents