Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 218 times |
Download: | 0 times |
1 University of MichiganElectrical Engineering and Computer Science
Streamroller: Automatic Synthesis of Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator PipelinesPrescribed Throughput Accelerator Pipelines
Manjunath Kudlur, Kevin Fan, Scott Mahlke
Advanced Computer Architecture Lab
University of Michigan
2 University of MichiganElectrical Engineering and Computer Science
Automated C to Gates SolutionAutomated C to Gates Solution• SoC design
– 10-100 Gops, 200 mW power budget
– Low level tools ineffective• Automated accelerator
synthesis for whole application– Correct by construction– Increase designer productivity– Faster time to market
app.c
LA
LA LA
LA
3 University of MichiganElectrical Engineering and Computer Science
Streaming ApplicationsStreaming Applications
Quantizer
MotionEstimator
Transform Coder
InverseQuantizer
InverseTransform
MotionPredictor
Image Coded Image
H.264 Encoder
• Data “streaming” through kernels
• Kernels are tight loops– FIR, Viterbi, DCT
• Coarse grain dataflow between kernels– Sub-blocks of images,
network packetsData in Data out
CRC Conv./Turbo
BlockInterleaver
OVSFGenerator
Spreader/Scrambler
BasebandTrasmitter
W-CDMA Transmitter
RRCFilter
4 University of MichiganElectrical Engineering and Computer Science
Software OverviewSoftware Overview
Whole Application
1
2 3
4
SystemLevel
Synthesis
FrontendAnalyses
Accelerator Pipeline
SRAMBuffers
Loop Graph
5 University of MichiganElectrical Engineering and Computer Science
Input SpecificationInput Specification
for(i=0; i<8; i++) { for(j=0; j<8; j++) { . . . = inp[i][j]; out[i][j] = . . . ; }}
row_trans(char inp[8][8], char out[8][8] ) {
}
col_trans(char inp[8][8], char out[8][8]);zigzag_trans(char inp[8][8], char out[8][8]);
dct(char inp[8][8], char out[8][8]) {
row_trans
col_trans
zigzag_trans
inp
tmp1
tmp2
out
• Sequential C program• Kernel specification
– Perfectly nested FOR loop– Wrapped inside C function– All data access made
explicit
char tmp1[8][8], tmp2[8][8]; row_trans(inp, tmp1); col_trans(tmp1, tmp2); zigzag_trans(tmp2, out);}
• System specification
– Function with main input/output
– Local arrays to pass data– Sequence of calls to kernels
6 University of MichiganElectrical Engineering and Computer Science
Performance SpecificationPerformance Specification• High performance DCT
– Process one 1024x768 image every 2ms– Given 400 Mhz clock
• One image every 800000 cycles• One block every 64 cycles
• Low Performance DCT– Process one 1024x768 image every 4ms– One block every 128 cycles
8
8
row_trans
col_trans
zigzag_trans
inp
tmp1
tmp2
out
8
8
Input image(1024 x 768)
Output coeffs
Task
Performance goal :Task throughput in number of cycles between tasks
7 University of MichiganElectrical Engineering and Computer Science
Building BlocksBuilding Blocks
Kernel 1
Kernel 2
Kernel 3
Kernel 4
Multifunction Loop Accelerator[CODES/ISSS ’06]
tmp1
tmp2
tmp3
SRAM buffers
8 University of MichiganElectrical Engineering and Computer Science
System Schema OverviewSystem Schema Overview
Kernel 1
Kernel 2
Kernel 4
LA 1
LA 2
LA 3
Kernel 3
Kernel 5
Kernel 1
Kernel 4
Kernel 5
K2 K3 Kernel 1
Kernel 4
Kernel 5
K2 K3 Kernel 1
Kernel 4
Kernel 5
K2 K3
time
Task throughput
9 University of MichiganElectrical Engineering and Computer Science
Cost ComponentsCost Components• Cost of loop accelerator data path
– Cost of FUs, shift registers, muxes, interconnect• Initiation interval (II)
– Key parameter that decides LA cost• Low II → high performance → high cost
– Loop execution time ≈ (trip count) x II– Appropriate II chosen to satisfy task throughput
II=1
II=1
II=1
K1
K2
K3
TC=100
TC=100
TC=100
II=2
II=2
II=2
Low performance
K1
K2
K3
TC=100
TC=100
TC=100
K1
K2
K3
K1
K2
K3
Task 1
Task 2
K1
K2
K3
Task 3
100
200
300
High performance
Throughput = 1 task/100 cyclesK1
K2
K3
K1
K2
K3
Task 1
Task 2200
400
600
Throughput = 1 task/200 cycles
10 University of MichiganElectrical Engineering and Computer Science
Cost Components (Contd..)Cost Components (Contd..)
• Grouping of loops into a multifunction LA– More loops in a single LA → LA occupied for longer
time in current task
K1
K2
K3
TC=100
TC=100
TC=100
K3TC=100
LA 2
LA 3
LA 1
K1
K2
K3
K4LA 1 occupied for 200 cycles
K1
K2
K3
100
200
300
K4400
Throughput = 1 task / 200 cycles
11 University of MichiganElectrical Engineering and Computer Science
Cost Components (Contd..)Cost Components (Contd..)• Cost of SRAM buffers for intermediate arrays• More buffers → more task overlap → high performance
II=1
II=1
II=1
K1
K2
K3
TC=100
TC=100
TC=100
tmp1
tmp2
LA 1
LA 2
LA 3
K1
K2
K3
K1
K2
K3
100
200
300
LA 1
LA 2
LA 3
tmp1 buffer in use by LA2
K1
K2
K3
K1
K2
K3
100
200
300
Adjacent tasks use different
buffers
12 University of MichiganElectrical Engineering and Computer Science
ILP FormulationILP Formulation
• Variables– II for each loop– Which loops are combined into single LA– Number of buffers for temp array
• Objective function– Cost of LAs + cost of buffers
• Constraints– Overall task throughput should be achieved
13 University of MichiganElectrical Engineering and Computer Science
Non-linear LA CostNon-linear LA Cost
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
IImin IImax
II = 1*II1 + 2*II2 + 3*II3 + . . . . + 14*II14 and 0 ≤ IIi ≤ 1
Cost(II) = C1*II1 + C2*II2 + C3*II3 + . . . . + C14*II14
IImin ≤ II ≤ IImax
Re
lativ
e C
ost
Initiation interval
14 University of MichiganElectrical Engineering and Computer Science
Multifunction Accelerator CostMultifunction Accelerator Cost
LA 1LA 2
LA 3LA 4
LA 1LA 2
LA 3LA 4
LA 1LA 2
LA 3LA 4
Worst Case : No sharingCost = Sum
Realistic Case : Some sharingCost = Between Sum and Max
Best case : Full sharingCost = Max
• Impractical to obtain accurate cost of all combinations• CLA = 0.5 * (SUMCLA + MAXCLA)
15 University of MichiganElectrical Engineering and Computer Science
Case Study : “Simple” benchmarkCase Study : “Simple” benchmarkLoop graph
TC=256
1
1
1
1
1
1
1
1
512 cycles LA 1
LA 2
LA 3
LA 4
1
1
2
1
1
1
3
3
1792 cycles
1536 cycles
LA 1
LA 2
1
1
1
1
1
1
1
1
LA 12048 cycles
16 University of MichiganElectrical Engineering and Computer Science
BeamformerBeamformer
Beamformer• 10 loops• Memory Cost – 60% to 70%
• Up to 20% cost savings due to hardware sharing in multifunction accelerators• Systems at lower throughput have over-designed LAs
– Not profitable to pick a lower performance LA• Memory buffer cost significant
– High performance producer consumer better than more buffers
17 University of MichiganElectrical Engineering and Computer Science
ConclusionsConclusions
• Automated design realistic for system of loops• Designers can move up the abstraction hierarchy• Observations
– Macro level hardware sharing can achieve significant cost savings
– Memory cost is significant – need to simultaneously optimize for datapath and memory cost
• ILP formulation tractable– Solver took less than 1 minute for systems with 30 loops
18 University of MichiganElectrical Engineering and Computer Science