Shuchang Shan † ‡ , Yu Hu †, Xiaowei Li †
†Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences
‡ Graduate University of Chinese Academy of Sciences (GUCAS)
Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors
2
Outline
Introduction
TDB execution model
Experimental results
Conclusion
3
FUs
Decode/
Rename
Register File
Writeback/
Commit
Fetch
Reorder Buffer
Issue QueueFUs
Decode/
Rename
Register File
Writeback/
Commit
Fetch
Reorder Buffer
Issue Queue
=
Architectural level Dual Modular Redundancy
Memory system
L1 L1 L1 L1L1
Instruction-level DMR
Core-level DMR
AR-SMT[FTCS’99], SRT[ISCA’00]Thread-level DMR
DIVA[MICRO’99], SHREC[MICRO’04], EDDI[TR’02]
CRTR [ISCA’03], Reunion[MICRO’06], DCC[DSN’07]
Leading thread
Trailing thread
EX’
CHKLeading
instructionsTrailing
instructions
A A’ B B’ For CMP systems, to make use of abundant
hardware resources, buildingCore-level DMR!
4Core-level Dual Modular Redundancy (DMR)Using coupled cores to verify each other’s executionStatic binding
– lacks of flexibility– e.g., Reunion [MICRO’06], CRT [ISCA’02], CRTR [ ISCA’03]
Dynamic binding– Lacks of scalability for parallel processing– e.g., DCC [DSN’07, WDDD’08]
-
On-chip network & Shared Cache
X - X
A A’ B B’
Static binding
C
On-chip network & Shared Cache
X X C’
A B’ B A’
Dynamic binding
5
Key issue in Core-level DMR
Maintain master-slave memory consistencyMaster-slave memory consistency
– Coupled cores must get the same memory value– External writes causes consistency violation
Reunion [Smolens-MICRO’06]– Rollback and recovery for the inconsistency
Dynamic Core Coupling (DCC) [LaFrieda-DSN’07]– Consistency window to stall the external writes
Scalability problem
LD1
ST3
LD1'
LD2Consistency
violation
LD1
ST3
LD1'
LD2
ST3Stall latency
6
Scalability problemExternal writes occur earlier and more frequently as the
system scales– Reunion: Unacceptable recovery overhead for consistency violation– DCC: Unacceptable stall latency caused by consistency window
Scalable solution needed– Reduce the consistency maintenance overhead
1684 1684 1684 1684 1684 1684 1684 1684lu fft ocean-con barnes cholesky radix radiosity average
0.00.10.20.30.40.50.60.70.80.91.0
exte
rnal
write
inte
rval
brea
kdow
n <100 <200 <300 <500 >500
Probability of external writes occurring within certain slacks
For 4-CMP system: 28% in 100 cycles 37% in 500 cycles
For 16-CMP system: 43% in 100 cycles 55% in 500 cycles
cycles
#External writes within 1K cycles: 0.3 for 4-CMP 3.3 for 16-CMP
7
Basic ideathe scope of the master-slave memory consistency maintenanceSphere of Consistency (SoC)
– The memory hierarchy– The private caches
Master
L1 cache L1 cache
Slave
Global memory
Master
L1 cache L1 cache
Slave
Global memory
Transparent Dynamic Binding (TDB):Reduce the SoC to the scale of private caches;
provide scalable and flexible Core-level DMR solution!
8
Outline
Introduction
TDB execution model
Experimental results
Conclusion
9
TDB principle
The same program input for the pairSimilar memory access behavior
Program
A-L1$ A’-L1$
Global memory
Transparent binding: Master issues L1 miss requests for the logical pair Slave is prevent from accessing the global memory
Dynamic binding: using the system network fordata communication and result comparison
10
Transparent dynamic binding
Master
Global memory
Slave
Program Logical pair: Consumer-consumer
Sphere of ConsistencyThe private caches
Transparent of slavesPassively waiting
Consumer-consumer data access pattern
Producer
11
Maintain Consistency under Out-of-Order ExecutionOut-of-Order execution brings in wrong-path effects [1]:
Master
Global memory
Slave
Program
Producer
MA1
1 1
MA2MA3MA4MA5
MA1MA3MA6MA1MA5
LRU MRU
[1] R. Sendaga, et al.“The impact of wrong-path memory references in cache-coherent multiprocessor systems.” In JPDC’07
12
Maintain Consistency under Out-of-Order ExecutionOut-of-Order execution brings in wrong-path effects:
Master
Global memory
Slave
Program
Producer
MA1
1 2 1
MA2MA3MA4MA5
MA1MA3MA6MA1MA5
2
LRU MRU
13
Maintain Consistency under Out-of-Order ExecutionOut-of-Order execution brings in wrong-path effects:
Master
Global memory
Slave
Program
Producer
MA1
1 2 3 4 1
MA2MA3MA4MA5
MA1MA3MA6MA1MA5
2 3
4
LRU MRU
Pipeline Refresh
14
Maintain Consistency under Out-of-Order ExecutionOut-of-Order execution brings in wrong-path effects:
Master
Global memory
Slave
Program
Producer
MA1
1 2 3 4 1
MA2MA3MA4MA5
MA1MA3MA6MA1MA5
2 3 4
MRULRU
5
15
Maintain Consistency under Out-of-Order ExecutionOut-of-Order execution brings in wrong-path effects:
Master
Global memory
Slave
Program
Producer
MA1
2 3 4 3
MA2MA3MA4MA5
MA1MA3MA6MA1MA5
1 4
MRULRU
55
Master-slave private cache consistency violation
Invariant: in-order memory instruction retirement sequence
16
Victim Buffer Assisted Conservative Private Cache Ingress Rule
Master Slave
ProgramMA1
1
MA2MA3MA4MA5
MA1MA3MA6MA1MA5
MRULRU
1
Global memory
Victim Buffer:Filter the WP data blocks
17
Victim Buffer Assisted Conservative Private Cache Ingress Rule
Master Slave
ProgramMA1
1
MA2MA3MA4MA5
MA1MA3MA6MA1MA5
MRULRU
1
2 2
Global memory
18
Victim Buffer Assisted Conservative Private Cache Ingress Rule
Master Slave
ProgramMA1
1
MA2MA3MA4MA5
MA1MA3MA6MA1MA5
MRULRU
1
2 3 4 3 42
Global memory
19
Victim Buffer Assisted Conservative Private Cache Ingress Rule
Master Slave
ProgramMA1
1
MA2MA3MA4MA5
MA1MA3MA6MA1MA5
MRULRU
1
2 3 4 3 42
Global memory
5 5Conservative private cache ingress rule:
accept data blocks from correct path into private caches
20
Master Slave
ProgramMA1
1 5 5
MA2MA3MA4MA5
MA1MA3MA6MA1MA5
MRULRU
1
2 3 4 3 42
Global memory
MA1MA5
Invariant: in-order memory instruction retirement sequence
Maintain Consistency under Out-of-Order Execution
Potential master-slave consistency violation
21
update-after-retirement LRU Replacement policy (uar-LRU)
Master Slave
ProgramMA1
1
MA2MA3MA4MA5
MA1MA3MA6MA1MA5
MRULRU
1
Global memory
MA1MA5
22
update-after-retirement LRU Replacement policy (uar-LRU)
Master Slave
ProgramMA1
1
MA2MA3MA4MA5
MA1MA3MA6MA1MA5
MRULRU
1
2 3 4 3 42
Global memory
MA1MA5
5 5uar-LRU: update MRU after the instruction retirement to prevent the WP
memory references from violating the consistency
23
Master-slave memory consistency violationExternal writes violates the master-slave memory consistencyAtomicity of master-slave data access behaviorLacks of scalability as external writes become more frequent
Master-slave input coherence: (a) external writes violates the consistency; (b) the master-slave consistency window in DCC
24Transparent Input Coherence StrategyTake advantage of Transparent dynamic bindingBreak the atomicity of master-slave data access behavior
LD1
ST3
LD1'
ST3
D D
I D
I D
optimization
time
Checker
25
Outline
Introduction
TDB execution model
Experimental results
Conclusion
26
Experimental Setup
Full system simulator: simics + GEMSParallel workloads: SPLASH-2The Baseline Dual Modular Redundancy System
– N active cores and another N disabled cores– Simulate the DMR system where the slaves work without
interfering the masters
27
The Performance of TDB Proposal
0.8
0.9
1.0
1.1
Norm
alize
d run
time
4P 8P 16P 32P
97.2%, 99.8%, 101.2% and 105.4% over the baseline for 4, 8, 16 and 32 cores respectively
Conservative private cache ingress rule helps filter the WP effects
28
Network Traffic of TDB Proposal
0.8
0.9
1.0
1.1
Norm
alize
d Net
work
Traffi
c
4P 8P 16P 32P
the total traffic is increased by 5.2%, 3.6%, 1.3% and 2.5% for 4-, 8-, 16- and 32-core CMP systems
29
Comparison against DCC [DSN’07]
4P 8P 16P 32P1.01.21.41.6
4P 8P 16P 32P1.0
1.1
TDB DCCNo
rmali
zed
Runti
me
Norm
alize
d Ne
twor
k Tra
ffic
TDB DCC
9.2% 10.4%18%
37.1%
Transparent Dynamic Binding (TDB):scalable and flexible Core-level DMR solution!
30
Conclusion
Transparent Dynamic Binding– Reduce SoC to the scale of Private Caches
Techniques to maintain the consistency– Consumer-consumer data access pattern– Victim Buffer assisted conservative ingress rule– uar-LRU replacement policy– Transparent input coherence policy
Scalable and flexible core-level DMR solution
31
Q&A?