Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | caiden-clayson |
View: | 216 times |
Download: | 2 times |
UPC
CGO’03San Francisco
March 2003
Local Scheduling Techniques for Memory Coherence in a Clustered VLIW
Processor with a Distributed Data Cache
Local Scheduling Techniques for Memory Coherence in a Clustered VLIW
Processor with a Distributed Data Cache
Enric Gibert1
Jesús Sánchez2
Antonio González1,2
1Dept. d’Arquitectura de Computadors
Universitat Politècnica de Catalunya (UPC)
Barcelona
2Intel Barcelona Research CenterIntel LabsBarcelona
UPC
CGO’03San Francisco
March 2003
Motivation
Capacity vs. Communication-bound Clustered microarchitectures
– Simpler + faster– Power consumption– Communications not homogeneous
Clustering embedded/DSP domain
UPC
CGO’03San Francisco
March 2003
Clustered Microarchitectures
CLUSTER 1
Reg. FileReg. File
FUsFUs
CLUSTER 2
Reg. FileReg. File
FUsFUs
CLUSTER 3
Reg. FileReg. File
FUsFUs
CLUSTER 4
Reg. FileReg. File
FUsFUs
Register-to-register communication buses
L1 cacheL1 cache
L2 cacheL2 cache
Memory buses
CLUSTER 1
Reg. FileReg. File
FUsFUs
CLUSTER 2
Reg. FileReg. File
FUsFUs
CLUSTER 3
Reg. FileReg. File
FUsFUs
CLUSTER 4
Reg. FileReg. File
FUsFUs
Register-to-register communication buses
L1 cachemodule
L1 cachemodule
L2 cacheL2 cache
L1 cachemodule
L1 cachemodule
L1 cachemodule
L1 cachemodule
L1 cachemodule
L1 cachemodule
CLUSTER 1
Reg. FileReg. File
FUsFUs
CLUSTER 2
Reg. FileReg. File
FUsFUs
CLUSTER 3
Reg. FileReg. File
FUsFUs
CLUSTER 4
Reg. FileReg. File
FUsFUs
Register-to-register communication buses
L1 cachemodule
L1 cachemodule
L2 cacheL2 cache
L1 cachemodule
L1 cachemodule
L1 cachemodule
L1 cachemodule
L1 cachemodule
L1 cachemodule
Memory buses
UPC
CGO’03San Francisco
March 2003
Contributions
Distribution of data cache– Architecture design + data mapping
• Word-interleaved scheme [ICS’02]
– Appropriate scheduling techniques [MICRO’02]
– Memory coherence Scheduling techniques for mem. coherence
– Local software-based techniques– Applied to word-interleaved cache
• Complex conf. (with Attraction Buffers – refer to paper)• Simple conf. (without Attraction Buffers)
– Applicable to any other cache configuration
UPC
CGO’03San Francisco
March 2003
Talk Outline
Architecture and Scheduling Algorithms Memory Coherence Problem Solutions
– Memory Dependent Chains (MDC) – DDG Transformations (DDGT)
Evaluation Conclusions
UPC
CGO’03San Francisco
March 2003
Word-Interleaved Distribution
CLUSTER 1
Register FileRegister File
Func. UnitsFunc. Units
Register-to-register communication buses
cache module
CLUSTER 2
Register FileRegister File
Func. UnitsFunc. Units
cache module
CLUSTER 3
Register FileRegister File
Func. UnitsFunc. Units
cache module
CLUSTER 4
Register FileRegister File
Func. UnitsFunc. Units
cache module
L2 cacheTAG W0 W1 W2 W4 W5 W6 W7W3
TAG W0 W4 TAG W1 W5 TAG W2 W6 TAG W3 W7
subblock 1
cache block
local hit remote hit
local miss remote miss
UPC
CGO’03San Francisco
March 2003
Scheduling Techniques
CLUSTER 1
cache module
a[0] a[4]
CLUSTER 2
cache module
a[1] a[5]
CLUSTER 3
cache module
a[2] a[6]
CLUSTER 4
cache module
a[3] a[7]
for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i]}
ld r31, a[i] ld r32, a[i+1] ld r33, a[i+2] ld r34, a[i+3]
for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16 bytes) ld r32, a[i+1] (stride 16 bytes) ld r33, a[i+2] (stride 16 bytes) ld r34, a[i+3] (stride 16 bytes) ...}
ld r3, a[i]
Modulo scheduling
Loop unrolling
Assignment of latencies
Padding + Profiling
UPC
CGO’03San Francisco
March 2003
Cluster Assignment
Non-memory instructions• Minimize register communications• Maximize workload balance
Memory instructions 2 heuristics:– PrefClus Heuristic
• Preferred Cluster = most accessed cluster• Profiling + Padding
– MinComs Heuristic• Minimize register communications• Maximize workload balance• Post-pass phase to increase local accesses
UPC
CGO’03San Francisco
March 2003
Talk Outline
Architecture and Scheduling Algorithms Memory Coherence Problem Solutions
– Memory Dependent Chains (MDC) – DDG Transformations (DDGT)
Evaluation Conclusions
UPC
CGO’03San Francisco
March 2003
Memory Coherence Problem
CLUSTER 1
a[0] a[4]
Cache module
CL
UST
ER
3
CL
UST
ER
2
CLUSTER 4
a[3] a[7]
Cache module
NEXT MEMORY LEVELNEXT MEMORY LEVEL
memory buses
cycle i - - - store to a[0]
cycle i+1 - - - -
cycle i+2 - - - -
cycle i+3 - - - -
cycle i+4 load from a[0] - - -
Store to a[0]Store to a[0]
Update a[0]
Read a[0]
Remote accessesMissesReplacementsOthers
NON-DETERMINISTIC BUS LATENCY!!!
Store to a[0]Store to a[0]
UPC
CGO’03San Francisco
March 2003
Talk Outline
Architecture and Scheduling Algorithms Memory Coherence Problem Solutions
– Memory Dependent Chains (MDC) – DDG Transformations (DDGT)
Evaluation Conclusions
UPC
CGO’03San Francisco
March 2003
Solutions Outline
Local scheduling solutions applied at a loop granularity– Memory Dependent Chains (MDC)– Data Dependence Graph Transformations (DDGT)
• Store replication• Load-store synchronization
Software-based solutions Applicable to other configurations
– Replicated distributed cache– MultiVLIW [MICRO00]
…
UPC
CGO’03San Francisco
March 2003
Memory Dependent Chains
Sets of aliased instructions:– Memory Dependent Chains (MDC)
Instructions in same set:– Assigned to same cluster
Restrictions on cluster
assignment– PrefClus: average preferred
cluster– MinComs: minimize comms.
when scheduling first node
n1load
n2load
n3add
n4store
n6load
n7div
n8add
RF
RF
RFRF
RF
RF
MA
MA
MF = memory-flow MA = memory-antiRF = register-flow
MFMF
UPC
CGO’03San Francisco
March 2003
Memory Dependent Chains
CLUSTER 1
a[0] a[4]
Cache module
CL
UST
ER
3
CL
UST
ER
2
CLUSTER 4
a[3] a[7]
Cache module
NEXT MEMORY LEVELNEXT MEMORY LEVEL
memory buses
cycle i - - - store to a[0]
cycle i+1 - - - -
cycle i+2 - - - -
cycle i+3 - - - -
cycle i+4 load from a[0] - - -
store to a[0]load from a[0]
UPC
CGO’03San Francisco
March 2003
DDGT: Store Replication
Overcome MEM_FLOW (MF) and MEM_OUT (MO)
storeA
storeA
loadB
loadB
MF
storeA
storeA
storeA’
storeA’
storeA’’
storeA’’
storeA’’’
storeA’’’
loadB
loadB
MF
storereplication
storeA
storeA
storeB
storeB
MO
storeA
storeA
storeA’
storeA’
storeA’’
storeA’’
storeA’’’
storeA’’’
MO
storereplication
storeB
storeB
storeB’
storeB’
storeB’’
storeB’’
storeB’’’
storeB’’’
local instance
remote instances
UPC
CGO’03San Francisco
March 2003
DDGT: Store Replication
CLUSTER 1
a[0] a[4]
Cache module
CL
UST
ER
3
CL
UST
ER
2
CLUSTER 4
a[3] a[7]
Cache module
NEXT MEMORY LEVELNEXT MEMORY LEVEL
memory buses
cycle i - - - store to a[0]
cycle i+1 store to a[0] - store to a[0] -
cycle i+2 - - - -
cycle i+3 - store to a[0] - -
cycle i+4 load from a[0] - - -
local instance
remote instances
Increase number of register communications!!!
UPC
CGO’03San Francisco
March 2003
DDGT: ld-st Synchronization
Overcome MEM_ANTI (MA) dependences
loadA
loadA
storeB
storeB
MA
addadd
RF load-storesync.
loadA
loadA
storeB
storeB
SYNCaddadd
RF
Special cases:– Store is already REG_FLOW dependent on the load– Impossible recurrences
loadA
loadA
storeC
storeC
RF storeB
storeB
MA
MO
loadA
loadA
storeC
storeC
RF
storeB
storeB
MO
fakecons
fakecons
RF
SYNC
load-storesync.
MA
UPC
CGO’03San Francisco
March 2003
CCCC
BAMRT
IIres=2
C1 C2 C3 C4
MDC Solution: Case Study
Impact on compute time– May increase the IIres
loadA
loadA
storeC
storeC
loadB
loadB
C
BAMRT
IIres=2
C1 C2 C3 C4
MA
MFMFB
C
AMRT
IIres=3
C1 C2 C3 C4
Impact on stall time– May increase remote accesses
• Extra stall cycles = 3 cycles / iteration
always accesses data in cluster 1
always accesses data in cluster 2
Latency LH = 1 cycleLatency RH = 5 cycles
addadd
RF
cycle 1
cycle 3
UPC
CGO’03San Francisco
March 2003
DDGT Solution: Case Study
Impact on compute time– More instructions (IIres)
• Store replication• Fake consumers (few)• Register communications MRT
IIres=2
C1 C2 C3 C4X
XXX
storeB
storeB
loadA
loadA
MAMF
C4
MRT
IIres=3
C1 C2 C3
BXBB
B
AXXX
set ofmemory
instructionsX
Impact on stall time– Small
• New dependences may decrease slack of some memory instructions
UPC
CGO’03San Francisco
March 2003
Talk Outline
Architecture and Scheduling Algorithms Memory Coherence Problem Solutions
– Memory Dependent Chains (MDC) – DDG Transformations (DDGT)
Evaluation Conclusions
UPC
CGO’03San Francisco
March 2003
Evaluation Framework
IMPACT C compiler• Compile + optimize + memory disambiguation
Mediabench benchmark suite
Profile Execution
epicdec test_image titanic
g721dec clinton S_16_44
g721enc clinton S_16_44
gsmdec clinton S_16_44
gsmenc clinton S_16_44
jpegdec testimg monalisa
jpegenc testimg monalisa
Profile Execution
mpeg2dec mei16v2 tek6
pegwitdec pegwit techrep
pegwitenc pgptest techrep
pgpdec pgptext techrep
pgpenc pgptest techrep
rasta ex5_c1 ex5_c1
UPC
CGO’03San Francisco
March 2003
Evaluation Framework
Word-Interleaved Cache Clustered VLIW Processor
# clusters 4
Functional units 1 FP / cluster + 1 integer / cluster + 1 memory / cluster
Register buses 4 buses running at ½ the core freq.
Memory buses 4 buses running at ½ the core freq.
Cache configuration
8KB, 2-way set-associative, 32 byte blocks
L2 always hits
Cache latencies Local Hit=1 Remote Hit=5 Local Miss=10 Remote Miss=15
Algorithm PrefClus and MinComs
Interleaving factor 2 or 4 bytes depending on benchmark
BASELINE Same architecture but complete freedom when assigning instructions to clusters
UPC
CGO’03San Francisco
March 2003
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%N
oRes
MD
CD
DG
T
NoR
esM
DC
DD
GT
NoR
esM
DC
DD
GT
NoR
esM
DC
DD
GT
NoR
esM
DC
DD
GT
NoR
esM
DC
DD
GT
remote misses
local misses
remote hits
local hits
epicdec jpegdec pegwitdec pgpdec rasta AMEAN
Local vs. Remote Accesses
UPC
CGO’03San Francisco
March 2003
0
0,2
0,4
0,6
0,8
1
1,2
1,4
MD
C P
refC
lus
MD
C M
inC
oms
DD
GT
Pre
fClu
sD
DG
T M
inC
oms
MD
C P
refC
lus
MD
C M
inC
oms
DD
GT
Pre
fClu
sD
DG
T M
inC
oms
MD
C P
refC
lus
MD
C M
inC
oms
DD
GT
Pre
fClu
sD
DG
T M
inC
oms
MD
C P
refC
lus
MD
C M
inC
oms
DD
GT
Pre
fClu
sD
DG
T M
inC
oms
MD
C P
refC
lus
MD
C M
inC
oms
DD
GT
Pre
fClu
sD
DG
T M
inC
oms
MD
C P
refC
lus
MD
C M
inC
oms
DD
GT
Pre
fClu
sD
DG
T M
inC
oms
stall time
compute time
Ex
ec
uti
on
tim
eepicdec jpegdec pegw itdec pgpdec rasta AMEAN
Execution Time
UPC
CGO’03San Francisco
March 2003
Other Configurations
Configuration 1
24Memory buses42Register buses
Latency# BusesLatency# Buses
More pressure on register busesMDC outperforms DDGT in all cases MDC requires less register communications
42Memory buses24Register buses
Latency# BusesLatency# Buses
More pressure on memory busesDDGT outperforms best MDC in several cases: epicdec 17%, pgpdec 20%, pgpenc 9%, rasta 7%…
Configuration 2
UPC
CGO’03San Francisco
March 2003
Talk Outline
Architecture and Scheduling Algorithms Memory Coherence Problem Solutions
– Memory Dependent Chains (MDC) – DDG Transformations (DDGT)
Evaluation Conclusions
UPC
CGO’03San Francisco
March 2003
Conclusions
Memory coherence problem– Two software-based solutions: MDC and DDGT– Applied to a word-interleaved cache clustered VLIW
processor MDC vs DDGT
– Results depending on architecture configuration• MDC outperforms DDGT in most cases • DDGT better by up to 20% in specific configuration
– Sets of memory dependent insts. are small– DDGT freedom in cluster assignment
• Increase local accesses by 15% reduce stall time
UPC
CGO’03San Francisco
March 2003
Questions?