Post on 09-Feb-2016
description
transcript
Application-Specific Customization of Parameterized FPGA Soft-Core Processors
David Sheldona, Rakesh Kumarb, Roman Lyseckyc, Frank Vahida*, Dean Tullsenb
aDepartment of Computer Science and EngineeringUniversity of California, Riverside
*Also with the Center for Embedded Computer Systems at UC Irvine
bDepartment of Computer Science and EngineeringUniversity of California, San Diego
cDepartment of Electrical and Computer EngineeringUniversity of Arizona
This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and by hardware and software donations
from Xilinx
David Sheldon, UC Riverside 2 of 22
FPGA Soft Core Processors Soft-core Processor
HDL description
Flexible implementation FPGA or ASIC
Technology independent
HDLDescription
FPGA ASIC
Spartan 3 Virtex 2 Virtex 4
David Sheldon, UC Riverside 3 of 22
FPGA Soft Core Processors Soft Core Processors
can have configurable options Datapath units Cache Bus architecture
Current commercial FPGA Soft-Core Processors Xilinx Microblaze Altera Nios FPGA
μP
Cache
FPUMAC
David Sheldon, UC Riverside 4 of 22
Goal Goal: Tune FPGA soft-core microprocessor for a given application
FPGA
Synthesis
size
time
App
Configured μP
Parameter Values μPParameter
Values
Configured μP
David Sheldon, UC Riverside 5 of 22
Microblaze – Xilinx FPGA Soft-Core
BaseMicroBlaze
Multiplier
Barrel ShifterDivider
FPU
Cache
0
2
4
6
8
10
12
14
16
18
Size (Equivalent LUTs)
Appl
icat
ion
Runt
ime
(ms) base
bs
mul+bs
mul+bs+cache
FPU
bs+cache
mul
01234567
aifir
BaseF
P01bit
mnpbr
evca
nrdr
g3fax
g721
_ps
idctmatm
ulra
ytrac
etbl
ook
ttsprk AVG
Spee
dup
Base MBFull MBOptimal MB
Significant tradeoffs
All units not necessarily the fastest, due to critical path lengthening
Instantiatable units
David Sheldon, UC Riverside 6 of 22
Problem Need fast exploration
Synthesis runs can take an hour
Synthesis~20-60 mins
Parameter Values μP
Exploration
Configured μP
This talk Two approaches
Approach 1: Using Traditional CAD Techniques
Approach 2: Synthesis-in-the-loop
Results
David Sheldon, UC Riverside 7 of 22
Constraints on Configurations Size constraints may prevent use of all
possible unitsMultiplier
FPU
Cache
Barrel Shifter
Divider
MicroBlaze Cache
Multiplier FPU
Max Area
David Sheldon, UC Riverside 8 of 22
Approach 1: Traditional CAD Techniques
Create a model of the problem
Solve model with extensive search heuristics
We will model this problem as a 0-1 knapsack problem
Model
ExplorationFast,
considers 1000s
of configurations
MicroBlaze Cache
Multiplier FPU
Max Area
Create model Slow, includes synthesis
David Sheldon, UC Riverside 9 of 22
Approach 1: Traditional CAD Techniques
MicroBlaze
Multiplier
sizepe
rf
Cache
perf
size
Divider
size
perf
size
perf
Barrel Shifter
perf
size
FPU
BSPerf increment
Size increment
FPU MUL DIV CACHE1.1 0.9 1.2 1.0 1.3
1.4 2.7 1.8 1.1 1.6
Perf/Size 0.96 0.34 0.63 0.93 0.80
Creating the model
Synthesis
MicroBlazeFPU
Synthesis
App
Base
David Sheldon, UC Riverside 10 of 22
Approach 1: Traditional CAD Techniques
0-1 knapsack model Object’s benefit = Unit’s performance increment / size
increment Object’s weight = Unit’s Size Knapsack’s size constraint = FPGA size constraint
BS
Perf increment
Size increment
FPU MUL DIV CACHE
1.1 0.9 1.2 1.0 1.3
1.4 2.7 1.8 1.1 1.6
Perf/Size 0.96 0.34 0.63 0.93 0.80
Micro-Blaze
David Sheldon, UC Riverside 11 of 22
Approach 1: Traditional CAD Techniques
Solved the 0-1 knapsack problem using established methods
Toth, P., Dynamic Programming Algorithms for the Zero-One Knapsack Problem. Computing 1980
Running time 6 Microblaze configuration synthesis
runs to create model O(n*p) to solve model
n is the number of factors p is the available area Negligible (seconds) compared to synthesis
runtimes (~hour)
David Sheldon, UC Riverside 12 of 22
Approach 1: Traditional CAD Techniques
Problems 100’s of target FPGAs
Different hard core resources (multiplier, block RAM)
Model approach estimates size and performance for two or more units
MUL speedup 1.3, DIV speedup 1.6 estimate MUL+DIV speedup 1.9
May really be 1.7 Model inaccuracies may be
large
Device LUTs PPCsXC2V2000 21504 0XC2VP2 2816 0XC4VLX80 71680 0XC4VLX15 12288 0XC2S300E 6140 0XC2V4000 46080 0XC2VP40 38784 2XC4VSX25 20480 0XC4VSX35 30720 0XC4VFX20 17088 1XC2S150E 3456 0XC2VP30 27392 2XC4VLX60 53248 0XC2S600E 13824 0XC2VP20 18560 2XC2V500 6144 0XC2VPX70 66176 2XC4VLX40 36864 0XC2V6000 67584 0XC4VFX60 50560 2XC4VFX100 84352 2XC2VP4 6016 1XC2VP70 66176 2
David Sheldon, UC Riverside 13 of 22
Approach 2: Synthesis-in-the-Loop
Problem with traditional CAD approach
100’s of target FPGAs Model approach estimates size and
performance for two or more units Model inaccuracies may be large
Solution – Synthesis in the loop No abstract model Guided by actual size and
performance data But slow – can only explore a few
configurationsExploration
Synthesis
perf
size
Execute
Synthesis-in-the-Loop
10’s of minutes
Model
Exploration
Create model
David Sheldon, UC Riverside 14 of 22
Approach 2: Synthesis-in-the-Loop
Multiplier
size
perf
Cache
perf
size
Divider
size
perf
BS
Perf increment
Size increment
FPU MUL DIV CACHE
1.1 0.9 1.2 1.0 1.3
1.4 2.7 1.8 1.1 1.6
Perf/Size 0.96 0.34 0.63 0.93 0.80
size
perf
Barrel Shifterpe
rf
size
Floating Point
First pre-analyze units to guide heuristic Same calculations as when creating model for
knapsack
David Sheldon, UC Riverside 15 of 22
Approach 2: Synthesis-in-the-Loop
Build “impact-ordered tree” structure Tree is specific to given
applicationBS FPU MUL DIV CACHE
Perf/Size 0.96 0.34 0.63 0.93 0.80
Sort
BS FPUMULDIV CACHE
Perf/Size 0.96 0.340.630.93 0.80
BS
CACHEMUL
FPU
DIV
Application Specific Impact-
ordering0.96
0.80
0.63
0.34
0.93
Impact
David Sheldon, UC Riverside 16 of 22
Approach 2: Synthesis-in-the-Loop
Run tree-based search heuristic
BS
MULFPU
DIVInclude
Not Includ
e
CACHE
UsefulYes
Yes
No
No
No
0.96
0.80
0.630.34
0.93
Perf/Size
Synthesis-in-the-Loop
Exploration
Synthesis
perf
size
Execute
David Sheldon, UC Riverside 17 of 22
Comparison of Approaches Approach 1 – Traditional CAD
6 synthesis runs to build model O(np) knapsack solution Examines thousands of configurations during
exploration Approach 2 – Synthesis in the loop
11 synthesis runs (6 pre-analysis, 5 exploration)
Examines (at most) 5 configurations during exploration
David Sheldon, UC Riverside 18 of 22
Results 10 EEMBC and Powerstone benchmarks
aifir, BaseFP01, bitmnp, brev, canrdr, g3fax, g721_ps, idct, matmul, tblook, ttsprk
Average results shown, on Virtex 2 Pro, for particular size constraintTo
ol R
un T
ime
(min
)
Speedup
0
200
400
600
800
1 1.5 2 2.5
ExhaustiveApp-Spec
Knapsack
Application-specific impact-ordered tree approach yields near-optimal results in acceptable tool runtime
Knapsack sub-optimality due to multi-unit estimation inaccuracy
David Sheldon, UC Riverside 19 of 22
Results Obtained results
for six different size constraints Results shown
for a second size constraint
Similar findings for all six constraints
Tool
Run
Tim
e (m
in)
Speedup
0
200
400
600
800
1 1.5 2 2.5
Exhaustive
App-Spec
Knapsack
David Sheldon, UC Riverside 20 of 22
Results Also ran for
different FPGA Xilinx Spartan2 Similar findings
Tool
Run
Tim
e (m
in)
Speedup
0
50
150
250
300
1 1.2 1.4 1.6
100
200
Exhaustive
App-Spec
Knapsack
David Sheldon, UC Riverside 21 of 22
Conclusions Synthesis-in-the-loop approach
outperformed traditional CAD approach Better results Slightly longer runtime
Application-specific impact-ordered tree heuristic served well for synthesis-in-the-loop approach
Future Extend for highly-configurable soft-core
processors, and for multiple processors competing for and/or sharing resources
David Sheldon, UC Riverside 22 of 22
Questions?