Embedded SystemsCh
air
for
Faculty of Computer Science Institute of Computer Engineering, Chair for Embedded Systems
REGISTER ALLOCATION FORHIGH-LEVEL SYNTHESIS OFHARDWARE ACCELERATORSTARGETING FPGASGerald Hempel, Christian Hochberger, Jan Hoyer, Thilo Pionteck
Darmstadt, 12 July 2013
Embedded SystemsCh
air
for
Outline
• Motivation• Design Flow• Register Allocation• Results & Discussion• Summary
Darmstadt, 12 July 2013 Register Allocation slide 2 of 14
Embedded SystemsCh
air
for
Motivation
• FPGAs suitable platform for application-specific designs• Predominant concept → standard processor IP-core in combination with
application-specific accelerators• Goal: Automatic mapping of high level application code into HW
– Synthesis result is a combination of HLS and vendor synthesistools (XST, Synopsis, Altera Synthesizer)
How much effort is required in HLS optimization?
• Evaluation of several register allocation strategies for FPGA basedaccelerators on Spartan 6 and Artix 7
• Underlying synthesis tool: Xilinx XST P.49d in ISE 14.4
Darmstadt, 12 July 2013 Register Allocation slide 3 of 14
Embedded SystemsCh
air
for
Motivation
• FPGAs suitable platform for application-specific designs• Predominant concept → standard processor IP-core in combination with
application-specific accelerators• Goal: Automatic mapping of high level application code into HW
– Synthesis result is a combination of HLS and vendor synthesistools (XST, Synopsis, Altera Synthesizer)
How much effort is required in HLS optimization?
• Evaluation of several register allocation strategies for FPGA basedaccelerators on Spartan 6 and Artix 7
• Underlying synthesis tool: Xilinx XST P.49d in ISE 14.4
Darmstadt, 12 July 2013 Register Allocation slide 3 of 14
Embedded SystemsCh
air
for
SpartanMC Soft-Core and Kernel Interface• SpartanMC processor soft-core for software execution
• Program and data stored in local BRAM• HW accelerators treated as peripherals• Peripheral stub used as wrapper for
accelerator• Peripheral stub provides access to
memory and peripheral bus– Parameters transfered via
peripheral bus during kernelstartup
– Direct BRAM accesses (triggeredby pointers and arrays) duringkernel execution
SpartanMC
ProcessorCore
Accelerator
Peripheral Stub
BRAM
Peripheral Interface
...
Darmstadt, 12 July 2013 Register Allocation slide 4 of 14
Embedded SystemsCh
air
for
Typical Workflow
Darmstadt, 12 July 2013 Register Allocation slide 5 of 14
Embedded SystemsCh
air
for
Typical Workflow
Darmstadt, 12 July 2013 Register Allocation slide 5 of 14
Embedded SystemsCh
air
for
Kernel Extraction using GCC
• GIMPLE passes for analysis and synthesis
AnalysisTranscript
SourcesApplication
HDL
BinaryObjects
GCCSynthesis Runs
GCCAnalysis Runs
* *
Intermediate
GIMPLE TreeAnalysis &
Kernel Estimation
GIMPLE Tree
HDL GenerationPatch &
0011110110010111011001101010
0101011
• Analysis:– Finds most worthy loop– Omits functions calls
(inlined functions only)
• Synthesis:– Uses list scheduling– Performs high-level register
allocation
Darmstadt, 12 July 2013 Register Allocation slide 6 of 14
Embedded SystemsCh
air
for
Kernel Extraction using GCC
• GIMPLE passes for analysis and synthesis
AnalysisTranscript
SourcesApplication
HDL
BinaryObjects
GCCSynthesis Runs
GCCAnalysis Runs
* *
Intermediate
GIMPLE TreeAnalysis &
Kernel Estimation
GIMPLE Tree
HDL GenerationPatch &
0011110110010111011001101010
0101011
• Analysis:– Finds most worthy loop– Omits functions calls
(inlined functions only)
• Synthesis:– Uses list scheduling– Performs high-level register
allocation
Darmstadt, 12 July 2013 Register Allocation slide 6 of 14
Embedded SystemsCh
air
for
Kernel Extraction using GCC
• GIMPLE passes for analysis and synthesis
AnalysisTranscript
SourcesApplication
HDL
BinaryObjects
GCCSynthesis Runs
GCCAnalysis Runs
* *
Intermediate
GIMPLE TreeAnalysis &
Kernel Estimation
GIMPLE Tree
HDL GenerationPatch &
0011110110010111011001101010
0101011
• Analysis:– Finds most worthy loop– Omits functions calls
(inlined functions only)
• Synthesis:– Uses list scheduling– Performs high-level register
allocation
Darmstadt, 12 July 2013 Register Allocation slide 6 of 14
Embedded SystemsCh
air
for
Register Allocation (Modified Left-Edge)
Darmstadt, 12 July 2013 Register Allocation slide 7 of 14
Embedded SystemsCh
air
for
Register Allocation (Modified Left-Edge)
Darmstadt, 12 July 2013 Register Allocation slide 7 of 14
Embedded SystemsCh
air
for
Register Allocation Strategies
• le simple: Assigns a register for each GIMPLE-variable
Darmstadt, 12 July 2013 Register Allocation slide 8 of 14
Embedded SystemsCh
air
for
Register Allocation Strategies
• le full: Minimize the number of registers
Darmstadt, 12 July 2013 Register Allocation slide 8 of 14
Embedded SystemsCh
air
for
Register Allocation Strategies
• le uid: Maps variables with identical GIMPLE-ID to one register
Darmstadt, 12 July 2013 Register Allocation slide 8 of 14
Embedded SystemsCh
air
for
Benchmarks• Typical algorithms for embedded systems:
– base64 encoder, bitreverse, FFT, grayscale filter, IIR filter, haarwavelet transformation, matrix multiplication
• Kernels were generated automatically for compute intensive parts
• Parameter sweep:– Area or speed optimization– Spartan 6 or Artix 7 device– Allocation strategy
(le full, le simple, le 2, le 3, le 4,le 5, le uid)
– 8 benchmarks• 224 test cases O
ptim
izat
ion
Targ
et
Devic
e
Allocation Strategy
Darmstadt, 12 July 2013 Register Allocation slide 9 of 14
Embedded SystemsCh
air
for
Artix 7 Resource Consumption
base
64bi
t rev
erse
haar
wav
elet
FFT
gray
scal
e
IIR fi
lter
mat
rix m
ult
bit r
ever
se s
pec
le_full le_uid le_simple le_2 le_3 le_4 le_5
1000
800
600
400
200
LUTs
Used LUTs (optimized for area)
Darmstadt, 12 July 2013 Register Allocation slide 10 of 14
Embedded SystemsCh
air
for
Artix 7 Resource Consumption
base
64bi
t rev
erse
haar
wav
elet
FFT
gray
scal
e
IIR fi
lter
mat
rix m
ult
bit r
ever
se s
pec
le_full le_uid le_simple le_2 le_3 le_4 le_5
1000
800
600
400
200
LUTs
• Allocation strategies show little effecton resource consumption
• Best results for le simple and le uid
Used LUTs (optimized for area)
Darmstadt, 12 July 2013 Register Allocation slide 10 of 14
Embedded SystemsCh
air
for
Artix 7 Frequency Results
frequency
[M
Hz]
150
200
250
300
350
400
base
64bi
t rev
erse
haar
wav
elet
FFT
gray
scal
e
IIR fi
lter
mat
rix m
ult
bit r
ever
se s
pec
le_full le_uid le_simple le_2 le_3 le_4 le_5
Achievable clock frequency(optimized for area)
Darmstadt, 12 July 2013 Register Allocation slide 11 of 14
Embedded SystemsCh
air
for
Artix 7 Frequency Results
frequency
[M
Hz]
150
200
250
300
350
400
base
64bi
t rev
erse
haar
wav
elet
FFT
gray
scal
e
IIR fi
lter
mat
rix m
ult
bit r
ever
se s
pec
le_full le_uid le_simple le_2 le_3 le_4 le_5
significant influence
• Results for le 2, le 3, le 4, le 5 hard topredict (s. haar wavelet, bit reverseand base 64 encoder)
• FFT, grayscale filter, IIR filter andmatrix multiplication → negativeeffect for all allocation strategiesexcept le uid and le simple
Achievable clock frequency(optimized for area)
Darmstadt, 12 July 2013 Register Allocation slide 11 of 14
Embedded SystemsCh
air
for
What is the difference?Benchmarks #GIMPLE
variables#basicblocks
#branches #ALUoperations
#MULoperations
base64 encode 50 3 1 21 0bitreverse 62 22 11 61 0bitreverse (spec.) 62 6 3 61 0haar wavelet 14 3 1 7 0
FFT 51 6 2 22 4grayscale filter 29 3 1 17 2IIR filter 79 6 2 29 11matrix mult 59 3 1 25 8
• Multiply-accumulate operation are mapped to sequential DSP primitives• Multiplexer tree caused by variable swaps leads to performance
degradations
Darmstadt, 12 July 2013 Register Allocation slide 12 of 14
Embedded SystemsCh
air
for
What is the difference?Benchmarks #GIMPLE
variables#basicblocks
#branches #ALUoperations
#MULoperations
base64 encode 50 3 1 21 0bitreverse 62 22 11 61 0bitreverse (spec.) 62 6 3 61 0haar wavelet 14 3 1 7 0
FFT 51 6 2 22 4grayscale filter 29 3 1 17 2IIR filter 79 6 2 29 11matrix mult 59 3 1 25 8
• Multiply-accumulate operation are mapped to sequential DSP primitives• Multiplexer tree caused by variable swaps leads to performance
degradations
Darmstadt, 12 July 2013 Register Allocation slide 12 of 14
Embedded SystemsCh
air
for
Artix 7 normalized Frequency
Artix 7 (Speed) Artix 7 (Area) Spartan 6 (Speed) Spartan 6 (Area)FF
T
gray
scal
e fil
ter
mat
rix m
ult
IIR fi
lter
norm
aliz
ed
fre
qu
en
cy
0.4
0.5
0.6
0.7
0.8
0.9
1
le_uid le_simple le_2 le_4 le_5 le_fullle_3
FFT
gray
scal
e fil
ter
mat
rix m
ult
IIR fi
lter
FFT
gray
scal
e fil
ter
mat
rix m
ult
IIR fi
lter
FFT
gray
scal
e fil
ter
mat
rix m
ult
IIR fi
lter
• Achievable clock frequency optimized for area normalized to one
• Gap of about 30% between le simple, le uid and more complexalgorithms
Darmstadt, 12 July 2013 Register Allocation slide 13 of 14
Embedded SystemsCh
air
for
Artix 7 normalized Frequency
Artix 7 (Speed) Artix 7 (Area) Spartan 6 (Speed) Spartan 6 (Area)FF
T
gray
scal
e fil
ter
mat
rix m
ult
IIR fi
lter
norm
aliz
ed
fre
qu
en
cy
0.4
0.5
0.6
0.7
0.8
0.9
1
le_uid le_simple le_2 le_4 le_5 le_fullle_3
FFT
gray
scal
e fil
ter
mat
rix m
ult
IIR fi
lter
FFT
gray
scal
e fil
ter
mat
rix m
ult
IIR fi
lter
FFT
gray
scal
e fil
ter
mat
rix m
ult
IIR fi
lter
• Achievable clock frequency optimized for area normalized to one• Gap of about 30% between le simple, le uid and more complex
algorithmsDarmstadt, 12 July 2013 Register Allocation slide 13 of 14
Embedded SystemsCh
air
for
Conclusion
For HLS targeting FPGAs:
• Complex registers allocation strategies resulting in a reuse of a singleregister for many variables are not advisable
– Little effect on area consumption– May lead to performance degradations (up to 30%)
• Naive allocation strategy gives most freedom to synthesis tool andmostly better results
Darmstadt, 12 July 2013 Register Allocation slide 14 of 14
Embedded SystemsCh
air
for
Artix 7 normalized LUTs
Artix 7 (Speed) Artix 7 (Area) Spartan 6 (Speed) Spartan 6 (Area)FF
T
gray
scal
e fil
ter
mat
rix m
ult
IIR fi
lter
norm
aliz
ed
usa
ge o
f LU
Ts
0.6
0.7
0.8
0.9
1
le_uid le_simple le_2 le_4 le_5 le_fullle_3
FFT
gray
scal
e fil
ter
mat
rix m
ult
IIR fi
lter
FFT
gray
scal
e fil
ter
mat
rix m
ult
IIR fi
lter
FFT
gray
scal
e fil
ter
mat
rix m
ult
IIR fi
lter
• A simple allocation strategy inclines a small resource footprint.
Darmstadt, 12 July 2013 Register Allocation slide 15 of 14
Embedded SystemsCh
air
for
Artix 7 Frequency and LUTs (speed)
base
64
bitre
vers
e
bitre
vers
esp
ec fft
gray
scale
haar
wavelet iir
mat
rixm
ult
150
200
250
300
350
400
freq
uen
cy[M
Hz]
le full le uid le simple le 2 le 3 le 4 le 5
• Achievable clock frequencyoptimized for speed
base
64
bitre
vers
e
bitre
vers
esp
ec fft
gray
scale
haar
wavelet iir
mat
rixm
ult
200
400
600
800
1,000
LU
Ts
le full le uid le simple le 2 le 3 le 4 le 5
• Used LUTs optimized forspeed
Darmstadt, 12 July 2013 Register Allocation slide 16 of 14
Embedded SystemsCh
air
for
Spartan 6 Frequency and LUTs (area)
base
64
bitre
vers
e
bitre
vers
esp
ec fft
gray
scale
haar
wavelet iir
mat
rixm
ult
100
150
200
250
300
350
freq
uen
cy[M
Hz]
le full le uid le simple le 2 le 3 le 4 le 5
• Achievable clock frequencyoptimized for area
base
64
bitre
vers
e
bitre
vers
esp
ec fft
gray
scale
haar
wavelet iir
mat
rixm
ult
200
400
600
800
1,000
LU
Ts
le full le uid le simple le 2 le 3 le 4 le 5
• Used LUTs optimized forarea
Darmstadt, 12 July 2013 Register Allocation slide 17 of 14
Embedded SystemsCh
air
for
Spartan 6 Frequency and LUTs (speed)
base
64
bitre
vers
e
bitre
vers
esp
ec fft
gray
scale
haar
wavelet iir
mat
rixm
ult
100
150
200
250
300
350
freq
uen
cy[M
Hz]
le full le uid le simple le 2 le 3 le 4 le 5
• Achievable clock frequencyoptimized for speed
base
64
bitre
vers
e
bitre
vers
esp
ec fft
gray
scale
haar
wavelet iir
mat
rixm
ult
200
400
600
800
1,000
1,200
1,400
LU
Ts
le full le uid le simple le 2 le 3 le 4 le 5
• Used LUTs optimized forspeed
Darmstadt, 12 July 2013 Register Allocation slide 18 of 14