Modelling and Mapping Framework for Coarse-GrainedProgrammable Architectures
E. Barbudo, E. Dokladalova and Th. Grandpierre
Laboratoire d’informatique de l’Institut Gaspard MongeUniversite Gustave Eiffel
ESIEE Paris
14th Summer School on Modelling and Verification of ParallelProcesses
June 26, 2020
1 / 27
Context
Machine Vision for Future Construction Sites (DiXite project [1],I-SITE impulsion project).
(a) Autonomous construction machinery [2]. (b) Autonomous unmanned aerial vehicles [3].
Figure 1: Examples of a time-critical application.
Embedded vision systemsTime-critical execution → drastic latency constraints.Multi-processing capabilities → adaptable or programmable computingsupport.
Existing solutions: GPPs, GPUs, FPGAs and CGRAs.
Target hardware family in our context: Coarse-grained ProgrammableArchitectures (CGPA).
GPP= General Purpose Processor, GPU= Graphics Processing Unit, FPGA= Field Programmable Gate Array, CGRA=Coarse-Grained Reconfigurable Architecture[1] https://dixite.future-isite.fr/ [2] https://www.builtrobotics.com/[3] https://www.geospatialworld.net/blogs/drones-to-propel-new-technological-innovations-in-the-construction-industry/
2 / 27
Contents
1 IntroductionCoarse-Grained Reconfigurable ArchitecturesCoarse-Grained Programmable Architectures
2 MethodologyModelsMapping algorithmsPerformance analysisExperimental studyPreliminary results
3 Conclusions
3 / 27
Coarse-Grained Reconfigurable Architectures
Programmable configuration-driven computing fabric with fixedpost-fabrication flexibility.
There are many variations of CGRAs, depending uponinterconnection, type of processing elements or method ofreconfiguration.
The main application of a CGRA is to perform the inner loops of anapplication.
PE
Dat
a m
emor
y
PE
Instruction memory
PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE
(a) Generic structure of a CGRA.
ALURegister
�le
Register
(b) Generic processing element of a CGRA.
Figure 2: Example of a generic CGRA [1].
[1] M. Hamzeh et al., “EPIMap: Using Epimorphism to map applications on CGRAs”. Design Automation Conference 2012,pages pp. 1280–1287.
Introduction Coarse-Grained Reconfigurable Architectures 4 / 27
Coarse-Grained Programmable Architectures
Heterogeneous linear array of hardware resources, with mixedgranularity.
Multiple configurations, multiple data (MCMD).
Used as a coprocessor with loose coupling and no shared resources.
CGRA with fixed hardware resources and a set of parameters toprogram.
ParametersParameters
ParametersParameters
ParameterscontextParameters
context
INTERCONNECTIONS
CONFIGURATION CONTROL
Memory1
Actuators
PARAMETERS
SensorsHW
Accel2GPP1
HW Accel4
HW Accel5
HW Accel6
MemoryN
PR1 PR3
PARAMETERS PARAMETERS PARAMETERS
HWAccel3
HWAccel1
PR2
Every hardwareresourceis �xed
The parametersneed to be con�gured
Figure 3: Global architecture of a generic CGPA.
Introduction Coarse-Grained Programmable Architectures 5 / 27
Real life CGPA - motivation hardware
The morphological co-processing unit (MCPU) is dedicated to imageprocessing.
+ High performance+ Flexibility− Manual mapping
MPMC
InputFIFO bu�er
VFBC read control
VFBC read control
InputFIFO bu�er
vfbc_rd
PLB
VFBC1
VFBC3vfbc_rd
vfbc_cmd
vfbc_cmd
PLB interface
Large SEpipeline
Geodesicpipeline
Con�
gura
tion
regi
ster
s
Image output
Image input
Geodesic Bank 1- m Control
Mux_in_1
Mux_in_2
Mux_geo_1
Mux_out_1
Mux_out_2Mux_geo_2
Large SE Bank 1-n
REG 1REG 2
REG n
REG 1REG 2
REG n
REG 1REG 2
REG n
REG 1REG 2
REG n
StartResetReadyState
......
OutputFIFO bu�er
VFBC write control
VFBC write control
OutputFIFO bu�er
vfbc_wrVFBC2
VFBC4vfbc_wr
vfbc_cmd
vfbc_cmd
Bart_proc peripheral (pcore)
(a) Architecture of the Morphological co-processing unit.
Large SEerosiondilation
Large SEerosiondilation
ALU
Operator, SE ALU Mux_{1,2}
1
INPUT
2
1
2
Mux_1
Mux_2
Stage 1
Large SEerosiondilation
Large SEerosiondilation
ALU
Operator, SE ALU Mux_{1,2}
OUTPUT
Mux_1
Mux_2
Stage n8-bit Data1-bit Acknowledge1-bit Fifo full �ag
(b) Large SE pipeline basic stage of the MCPU.
Figure 4: Morphological co-processing unit [2].
[2] J. Bartovsky et al., “Morphological co-processing unit for embedded devices”. JRTIP, pages pp. 1–12, Jul 2015.
Introduction Coarse-Grained Programmable Architectures 6 / 27
Modelling and mapping framework for CGPA
Agnostic, to be used in most of the CGPAs on the market.
Easy reuse of CGPAs.
CONFIGURATION MODELLATENCY MODEL
FEATURESCGPA
MODEL
HARDWARE APPLICATIONS
TASKS AND DATA DEPENDENCY MODEL
MAPPING AND SCHEDULING PERFORMANCE ANALYSIS
IMPLEMENTATION MODELGENERATION OF OVERALLCONFIGURATION CONTEXT
Figure 5: Overall scheme of our modelling and mapping framework forCGPA.
Methodology 7 / 27
Application model
GAPP(T ,D) is a directed hypergraph, with ti ∈ T further describedas (typei , pi ).
typei represents the transformation of the data and pi a vector ofparameters.
ErosionSquare
3
DilationSquare
3Sobel
t0 t1 t2 t3 t4
Figure 6: Generic example of an application graph.
Methodology Models 8 / 27
Hardware model
GHW (S ,K ) is a directed hypergraph, where S represents the hardwareresources and K the set of oriented hyperedges.
S scfgRCR RP
RM RWRRRD RMUX
RINTERFACERC
Figure 7: General hierarchy of the set S .
Each hardware resource is described with its programmableparameters, configuration cost function and latency function.
Methodology Models 9 / 27
Hardware model
We propose to differentiate the input latency and computinglatency of each hardware resource.
The input latency is the number of clock cycles needed to read allthe samples required to start to compute the first result.
The computing latency is the number of clock cycles necessary toproduce the result once all input samples are available.
Methodology Models 10 / 27
Implementation model
GMAP is obtained by graph transformation between GAPP and GHW .
GMAP is a weighted directed hypergraph.
Similar structure as GHW with fixed parameters.
Composed by time slots, which are a subset of resources configured toperform a subset of tasks. Hence, adding a time slot allows re-usingthe hardware which can be reconfigured between time slots whenresources are missing.
t0r1 r2rSNSR0 rACTR3t1 t2 t3
scfg1
Time
Con�gurationcost
Overall input latency+
execution duration
Time slot 1
Figure 8: Generic example of a implementation graph.
Methodology Models 11 / 27
Mapping algorithms
Mapping an application onto hardware leads to a very large numberof possible implementations.
The mapping process needs to be automatized.
There are many mapping algorithms, we propose to use the followingtwo:
An exhaustive algorithm, that provides the best result and it can beused as a ground reference.An list-scheduling heuristic based on look-ahead techniques to reducethe exploration time.
Methodology Mapping algorithms 12 / 27
Exhaustive algorithm
The principle is to obtain all the possible mappings.We consider the possibility to map one task in one time slot.The basic process is:
We start with the topological sorting of the application graph.We test the possibility to map a task to any available hardwareresource that complies with the requirements.
+ Optimal mapping.− Very long exploration time (From hours to days).
t0 t1 t2 t3
scfg
rSNSR0 rP1 rACTR4rP2 rP3
Application graph
Hardware graph
t0 t1 t2 t3
scfg
rSNSR0 rP1 rACTR4rP2 rP3
scfg
rSNSR0 rP1 rACTR4rP2 rP3
scfg
rSNSR0 rP1 rACTR4rP2 rP3
t0 t1 t2 t3
scfg
rSNSR0 rP1 rACTR4rP2 rP3scfg
rSNSR0 rP1 rACTR4rP2 rP3
scfg
rSNSR0 rP1 rACTR4rP2 rP3
t1
t1
t0 t1 t2 t3copy
Time slot 1
INPUTS
EXHAUSTIVE ALGORITHMSTAGE 1 STAGE 2
scfg
rSNSR0 rP1 rACTR4rP2 rP3t1
OUTPUTS
scfg
rSNSR0 rP1 rACTR4rP2 rP3t0 t1 t2 t3copy
Time slot 1
Implementation graphs
Figure 9: Exhaustive algorithm principle.Methodology Mapping algorithms 13 / 27
List-scheduling heuristic
Based on look-ahead techniques.
We evaluate the mapping of the successors of the task onto thedescendants of the resource.
The basic process is:
We start with a topological sorting of the application graph and a listwith the source nodes of the hardware graph, which are the candidates.We select the first task ti ∈ GAPP and try to map it to any source nodeof the hardware graph.We get the successors of ti and the descendants of the source noderj ∈ GHW . We compute the chance of allocating successfully thesuccessors of task ti onto the descendants, taking into account thetopological distance.
Methodology Mapping algorithms 14 / 27
Probability of mapping success
MSFj=
x∑b=1
d | Qb |
Cn∑
k=1
| Fk |(1)
where:
x is the number of successors of ti .
n is the number of possible resource candidates.
d is the shortest distance to a resource node that complies with thetask’s successor of interest.
C is the critical path of the subgraph made by the descendants of allthe possible resource candidates.
Qb is the set of descendants of rj that complies with the task’ssuccessor of interest.
Fk is the set of the descendants of rk .
Methodology Mapping algorithms 15 / 27
Performance analysis
Computing cost (CC) as a metric.Computes the input and computing latency of all the elements of thecritical path.
CC =N∑i=1
(Tci +
Texi︷ ︸︸ ︷(CLi )(TS) +
Tini︷ ︸︸ ︷|CPi |−1∑j=1
((LINj − 1)(αj) + LCLj + 1))
(2)
whereN is the number of time slots.Tci configuration cost of time slot i .T ini refers to the overall input latency of the resources of time slot i .Texi is the execution duration of time slot i .CLi is the worst computing latency of the critical path of time slot i .TS is the total of input samples.
Methodology Performance analysis 16 / 27
Performance analysis
T ini =
|CPi |−1∑j=1
((LINj − 1)(αj) + LCLj + 1) (3)
where
CPi is a set of resources that belong to the critical path of time slot i .
LINj is the input latency of the resource.
LCLj is the computing latency of the resource.
αj is an expression of the propagation of computing latency. Letαj = max(αj−1,LCLj−1), where αj−1 is the α of the predecessor and
LCLj−1 is the computing latency of the predecessor.
Methodology Performance analysis 17 / 27
Evaluation of the methodology
We consider the morphological co-processing unit (MCPU) [2] as acandidate for the use of our framework and two applications.
Large SEerosiondilation
Large SEerosiondilation
ALU
Operator, SE ALU Mux_{1,2}
1
INPUT
2
1
2
Mux_1
Mux_2
Stage 1
Large SEerosiondilation
Large SEerosiondilation
ALU
Operator, SE ALU Mux_{1,2}
OUTPUT
Mux_1
Mux_2
Stage n8-bit Data1-bit Acknowledge1-bit Fifo full �ag
Figure 10: Large SE pipeline basic stage of the MCPU.
[2] J. Bartovsky et al., “Morphological co-processing unit for embedded devices”. JRTIP, pages pp. 1–12, Jul 2015.
Methodology Experimental study 18 / 27
Hardware graph
For the purpose of the evaluation, we only use the Large SE pipelinedouble stage (Fig. 10).
rRD3 rP5
scfg
rSNSR0 rACTR21
rM2 rM2
rP6 rWR19
rWR18
rSNSR1 rRD4
rP7
rMUX8
rMUX9rACTR22
rP10
rP11
rP12
rP13
rP14
rMUX15
rMUX16
rP17
rP20
Figure 11: GHW of the morphological co-processor unit.
Methodology Experimental study 19 / 27
Application graph
The first application example is a long linear pipeline of tasks. Thisapplication exceeds the available resources (Fig. 12a).
The second application example represents a highly parallel taskorganization (Fig. 12b).
t0 t1 t2 t3 t4 t5
t6 t7 t8
(a) Application example 1.
t0 t1 t2 t3
t4 t5 t6
t7 t8 t9
(b) Application example 2.
Figure 12: Application examples.
Methodology Experimental study 20 / 27
Evaluation
We evaluate our algorithms on two metrics computing cost andexploration time. Table 1 summarize the results. The values ofcomputing cost are in clock cycles.
Table 1: Algorithms evaluation
Numberof nodes
AlgorithmComputing
costExploration
timeImprovement
percent
Application example 1(Linear set of tasks)
9Exhaustive 62470 70 minutes N/AHeuristic 62470 0.53 seconds 99 %
Application example 2(Parallel set of tasks)
10Exhaustive 61628 19 minutes N/AHeuristic 62628 0.9 seconds 99 %
Methodology Preliminary results 21 / 27
Resulting mappings
copy r5_0 r7_2
copy r8_3
r3_14copy r9_4
r10_5
t0 r11_6
r2_73
t1 r6_1
t2 r12_7
r13_8
config_70copy r15_10
r14_9
r4_15r2_71
copy r16_11
r2_74
r17_12
r18_16r19_17
r20_13
sp_49
sp_50r2_75
r2_77
r0_18
config_69
r1_19
t3 r5_27
r7_29
copy r8_30
r3_41
copy r9_31
r10_32
t4 r11_33
sp_51t5
r6_28
t6 r12_34
r13_35
copy r15_37
r14_36
r4_42r2_72
copy r16_38
r17_39
r18_43
r19_44
r20_40
sp_52
r2_78
r2_76
r22_48
r21_47
Figure 13: Exhaustive final mapping for application example 1.
t0 r5_0 r7_2
copy r8_3
r3_14
copy r9_4
r10_5
t1 r11_6
r2_73t2
r6_1
t3 r12_7
r13_8
config_70copy r15_10 r14_9
r4_15r2_71
copy r16_11
r2_74r17_12
r18_16r19_17
r20_13
sp_49sp_50
r2_77
r2_75
r0_18
config_69
r1_19
t4 r5_27
r7_29
copy r8_30
r3_41
copy r9_31
r10_32
t5 r11_33
sp_51t6
r6_28
copy r12_34
r13_35
copy r15_37
r14_36
r4_42r2_72
copy r16_38
r17_39
r18_43
r19_44
r20_40
sp_52
r2_78
r2_76
r22_48
r21_47
Figure 14: Heuristic final mapping for application example 1.
Methodology Preliminary results 22 / 27
Resulting mappings
t2 r5_0
r7_2
copy r8_3r3_14
copy r9_4
r10_5
t3 r11_6r2_69
t4 r6_1
t5 r12_7
r13_8
config_66
copy r15_10
r14_9
r4_15r2_67
copy r16_11
r2_70
r2_68
r17_12
r18_16
r19_17
r20_13
r2_73
r2_71
r0_18
config_65
r1_19
r5_27
r7_29
r8_30r3_41
copy r9_31
r10_32
r11_33
t0 r6_28
t1 r12_34
r13_35
r15_37
r14_36
r4_42 copy r16_38
r17_39
r18_43
r19_44
r20_40
r2_74
r2_72
r22_48
r21_47
Figure 15: Exhaustive final mapping for application example 2.
t4 r5_0
r7_2
copy r8_3r3_14
copy r9_4
r10_5
t5 r11_6
r2_67
t2 r6_1
t3 r12_7
r13_8
config_64
copy r15_10
r14_9
r4_15r2_65copy
r16_11r2_68
r2_66
r17_12
r18_16
r19_17
r20_13
r2_71
r2_69sensor0_a3
r0_18
config_63
sensor0_a1r1_19
t0 r5_27
r7_29
copy r8_30
r3_41
r9_31
r10_32
t1 r11_33
r6_28 r12_34
r13_35
copy r15_37
r14_36
r4_42 r16_38
r17_39
r18_43
r19_44
r20_40
r2_72
r2_70
r22_48
r21_47
Figure 16: Heuristic final mapping for application example 2.
Methodology Preliminary results 23 / 27
Conclusions
We have introduced our proposed modelling and mapping framework.
Our models provide the means to abstract the configuration cost andthe heterogeneous latency of a CGPA.
We presented two mapping approaches. The exhaustive approach isable to produce an optimal mapping in terms of latency but at thecost of long exploration time. The look-ahead-based heuristicapproach provides a mapping in less exploration time.
Additionally, we define a method for performance analysis based onthe computing cost over the critical path.
Conclusions 24 / 27
Future work
Analyze the complexity of the mapping algorithms.
Inclusion of the notion of latency in the heuristic equation.
Evaluate extensively the framework with real hardware examples.
Conclusions 25 / 27
Thank you for your attention
Conclusions 26 / 27