Date post: | 16-Dec-2015 |
Category: |
Documents |
Upload: | mabel-jordan |
View: | 218 times |
Download: | 0 times |
5th International Conference , HiPEAC 2010
MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS
Yongjoo Kim, Jongeun Lee*, Aviral Shrivastava**,
Jonghee Yoon and Yunheung Paek
**Compiler and Microarchitecture Lab,Center for Embedded Systems,
Arizona State University, Tempe, AZ, USA.
* Embedded Systems Research Lab,ECE, Ulsan Nat’l Institute of Science & Tech,
Ulsan, Korea
Software Optimization And Restructuring,Department of Electrical Engineering,
Seoul National University, Seoul, South Korea
2010-01-25
2
Coarse-Grained Reconfigurable Array (CGRA)
SO&R and CML Research Group
High computation throughput Low power consumption and scalability High flexibility with fast configuration
Category Processor MIPS mW MIPS/mW
Embedded Xscale 1250 1600 0.78
DSP TI TM320C6455 9.57 3.3 2.9
DSP(VLIW)
TI TM320C614T 4.711 0.67 7
* CGRA shows 10~100MIPS/mW
3
Coarse-Grained Reconfigurable Array (CGRA)
SO&R and CML Research Group
Array of PE Mesh like network Operate on the result of their neighbor PE Execute computation intensive kernel
4
Application mapping in CGRA
SO&R and CML Research Group
Mapping DFG on PE array mapping space Should satisfy several conditions
Should map nodes on the PE which have a right functionality Data transfer between nodes should be guaranteed Resource consumption should be minimized for performance
5
CGRA execution & data mapping
tc : computation time, td : data transfer time
PE
Configuration Memory
Main Memory
Bk1buf2
Bk2buf2
Bk3buf2
Bk4buf2 DMA
Bk1buf1
Bk2buf1
Bk3buf1
Bk4buf1
Local memory
Double buffering
Total runtime = max(tc, td)
6
The performance bottleneck : Data transfer
SO&R and CML Research Group
Many multimedia kernels show bigger td than tc
Average ratio of tc : just 22%
swim
_calc
1
swim
_calc
2
*com
press
*lowpas
s
laplac
e
form
_pre
dictio
n
wavele
tSO
R*G
SRso
bel
AVERAGE0%
10%20%30%40%50%60%70%80%90%
100%Data transfer Time Computation Time
Most applications are memory-bound applications.
< The ratio between tc and td >
100% = tc + td
7
Computation Mapping & Data Mapping
SO&R and CML Research Group
Duplicate array increase data transfer time
Local memory
0 1
2
LD S[i] LD S[i+1]
+
S[i]
S[i+1]
0
1
8
Contributions of this work
SO&R and CML Research Group
First approach to consider computation mapping and data mapping
- balance tc and td
- minimize duplicate arrays (maximize data reuse)- balance bank utilization
Simple yet effective extension - a set of cost functions
- can be plugged in to existing compilation frameworks
- E.g., EMS (edge-based modulo scheduling)
9
Application mapping flow
SO&R and CML Research Group
DFG
PerformanceBottleneckAnalysis
Data ReuseAnalysis
Memory-awareModulo Scheduling
DCR DRG
Mapping
10
Preprocessing 1 : Performance bottleneck analysis
SO&R and CML Research Group
Determines whether it is computation or data trans-fer that limits the overall performance
Calculate DCR(data-transfer-to-computation time ratio)DCR = td / tc
DCR > 1 : the loop is memory-bound
11
Preprocessing 2 : Data reuse analysis
SO&R and CML Research Group
Find the amount of potential data reuse
Creates a DRG(Data Reuse Graph) nodes correspond to memory opera-
tions and edge weights approximate the amount of reuse
The edge weight is estimated to be TS - rd TS : the tile size rd : the reuse distance in itera-
tions
S[i]S[i+1]
D[i]
R[i]
S[i+5]
D[i+10]
R2[i]
< DRG>
12
Application mapping flow
SO&R and CML Research Group
DFG
PerformanceBottleneckAnalysis
Data ReuseAnalysis
Memory-awareModulo Scheduling
DCR DRG
Mapping
DCR & DRG are used for cost calcu-lation
13
Mapping with data reuse opportunity cost
SO&R and CML Research Group
PE0 PE1 PE2 PE3
0
1
2
3
4
0 1
3
5
2
7
4
A[i],A[i+1] B[i]Local Memory
PE PE PE PE
Bank1 Bank2
0 1
3
5
2
7
9
A[i] B[i]
A[i+1]
4
8 B[i+1]
PE Array
40
50
6060
50
x
x
xx
0
0
0+20
+20
x
x
xx
40
50
6040
30
x
x
xx6
6
Memory-unaware costData reuse opportunity costNew total cost(memory unaware cost + DROC)
14
BBC(Bank Balancing Cost)
SO&R and CML Research Group
To prevent allocating all data to just one bank BBC(b) = β × A(b)
β : the base balancing cost(a design parameter)
A(b) : the number of arrays already mapped onto bank b
PE0 PE1 PE2 PE3
0
1
2
3
4 +10 +0
A[i],A[i+1]
0
32
5
6
A[i]
A[i+1]
4
7B[i]
1
0
32
5
6
4
1
Cand Cand
β : 10
Local Memory
PE PE PE PE
Bank1 Bank2
PE Array
15
Application mapping flow
SO&R and CML Research Group
DFG
PerformanceBottleneckAnalysis
Data ReuseAnalysis
Memory-awareModulo Scheduling
DCR DRG
Mapping
Partial ShutdownExploration
16
Partial Shutdown Exploration
SO&R and CML Research Group
For a memory-bound loop, the performance is often limited by the memory bandwidth rather than by computation. ≫ Computation resources are surplus.
Partial Shutdown Exploration on PE rows and the memory banks find the best configuration that gives the minimum
EDP(Energy-Delay Product)
Example of Partial shutdown exploration
Tc Td R E R*E
4r-2m 180 288 288 10.46 3012
2r-2m 270 288 288 10.01 2882
7/7r 8/3
-/6 5/- -/4
0/1 2/0r
D[…], R[…]
S[…]
< 4 row - 2 bank >
-/0r/2 0/1/-
4/-/- -/5/- 3/8/6 7/-/-
S[…]
D[…], R[…]
< 2 row - 2 bank >
0 1
2
43
5
6
7
8
LD S[i] LD S[i+1]
LD D[i]
ST R[i]
17
18
Experimental Setup
SO&R and CML Research Group
A set of loop kernels from MiBench, multimedia, SPEC 2000 benchmarks
Target architecture 4x4 heterogeneous CGRA(4 memory accessable PE) 4 memory bank, each connected to each row Connected to its four neighbors and four diagonal ones
Compared with other mapping flow Ideal : memory unaware + single bank memory architecture MU : memory unaware mapping(*EMS) + multi bank memory
architecture MA : memory aware mapping + multi bank memory architec-
ture MA + PSE : MA + partial shutdown exploration* Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable
Architectures, Hyunchul Park et al, PACT 08
19
Runtime comparison
SO&R and CML Research Group
Compared with MU
The MA reduces the runtime by 30%
form
_pre
d
laplaceso
belSOR
swim
_calc1
swim
_calc2
wavelet
*com
press
*GSR
*lowpass
AVERAGE0
0.2
0.4
0.6
0.8
1
Ideal
MU
MA
MA+PSE
No
rmal
ized
Ru
ntim
e
20
Energy consumption comparison
SO&R and CML Research Group
MA + PSE shows 47% energy consumption reduction.
form
_pre
d
laplaceso
belSOR
swim
_calc1
swim
_calc2
wavelet
*com
press
*GSR
*lowpass
AVERAGE0
0.2
0.4
0.6
0.8
1MU
MA
MA+PSE
No
rmal
ized
En
ergy
21
Conclusion
SO&R and CML Research Group
The CGRA provide very high power efficiency while be soft-ware programmable.
While previous solutions have focused on the computation speed, we consider the data transfer to achieve higher per-formance.
We proposed an effective heuristic that considers memory architecture.
It achieves 62% reduction in the energy-delay product which factors into 47% and 28% reductions in the energy consumption and runtime.
22
SO&R and CML Research Group
Thank you for your attention!