Post on 29-Jan-2016
description
transcript
Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification
Alok Garg, M. W. Rashid, and Michael Huang
Department of Electrical & Computer Engineering
University of Rochester
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 2
Motivation
Out-of-order execution needs efficient memory dependence enforcement logic
Conventional approach – complex, hard to scale Tightly coupled forwarding and enforcement
We use two decoupled components to simplify the task Opportunistic forwarding using L0 cache Verification against in-order re-execution Slackened memory dependence enforcement (SMDE)
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 3
LSQ: complex & hard to scale
Needs priority CAMs
Forwarding from LSQ on timing critical path Serialized with address translation
Design further complicated by Coherence and consistency considerations Corner cases: e.g., partial overlap of operands
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 4
Highlights of prior work
Two-level load store queue [sethumadhavan03], [akkary03], [baugh04], [roth04], [torres05], [gandhi05]
Reducing search frequency using clever filtering and prediction mechanism [park03], [sethumadhavan03]
Memory dependence prediction [moshovos.isca97], [moshovos.micro97], [sha05], [stone05]
Value based re-execution [cain04], [roth04], [sha05]
(more detailed contrast in paper)
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 5
Outline
Overview of SMDE Optional performance optimizations Evaluation Conclusion
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 6
Overview of SMDE
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 7
Decoupled execution
LSQ: competing requirements Front-end execution: little mem dependence enforcement Back-end execution: detect violations (mem access only) Memory B/W: naturally handled
Fetch/Decode/DispatchExecution
(out-of-order)Commit
L1
LSQ
MemoryHierarchy
L0
Front-endexecution
Back-endexecution
MUX
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 8
Why it works – two perspectives
Back-end execution is the only one required Totally in-order, preserving dependence Any front-end execution is OK L0 effectively a slow but accurate value predictor
Front-end execution correct most of the time Common case: 99% of loads happen at right time Speculation is on timing of load store pairs
Two-level LSQ speculate on the scope of stores Relatively expensive replays OK
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 9
Advantage – simplicity
No priority CAM Decoupled design – flexible, modular Front end – large degree of freedom
No need for address translation Soft errors can be ignored (ECC not needed) Corner cases – handle partial overlaps naturally Can ignore coherence invalidations
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 10
Performance of naïve design
LQ: 64SQ: 48
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 11
Optional performance optimizations
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 12
Reducing replay frequency
Major replay cause – RAW violations 48% replays due to RAW violation Replays indirectly cause more replays Often address available (data is not)
Fuzzy disambiguation queue (FDQ) Reject known premature loads
Best effort enough, no need to guarantee anything Conventional LSQ handles this (e.g., POWER 4)
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 13
FDQ: How it works
Address AGE
Address AGE
Address AGE
Address AGE
Address AGE
F uzzy
D isambiguation
Q ueue
ROBLDST
1 2 3 4 5 6
Address 2
Old New
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 14
FDQ not complex Very different from conventional SQ
Does not have priority logic No need to merge with cache data path Small queue is sufficient – no scalability pressure
Stores do not stay in FDQ for the entire lifetime Flexible replacement
A “local” technique Only support needed load rejection No need to augment issue logic to enforce predicted
dependence
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 15
Write buffer at the back-end
Temporarily holds not yet committed stores
Allow back-end execution of loads and stores to start early
A few entries sufficient to streamline back-end execution
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 16
Evaluation
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 17
Evaluation environment Simulator strives to model SMDE very faithfully
Load speculation, load rejection, and store-load replay Data value in the caches Scheduling replays Do not allocate load queue entry for pre-fetches
SPEC CPU2000 benchmark suite
System configuration ROB/Register (INT, FP) – 512/(400,400) LSQ (LQ, SQ) – 112 (64, 48) L0 speculative cache – 16KB, 2-way, 1 cycle
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 18
Impact of 8-entry Write buffer
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 19
Replay frequency reduction
(a) Integer applications.
(b) Floating-point applications.
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 20
Replay breakdown
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 21
Performance improvement
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 22
Scalability test
Memory dependence logic unchangedROB, RFs, IQs doubled
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 23
Other details in paper
Scope of replay Detailed study on replay causes Replay suppression technique Age based filtering Discussion on L0 flush policy Understanding write buffer Membership test for write buffer
* “Implementation Issues of Slackened Memory Dependence Enforcement”, A. Garg, M. Rashid, and M. Huang, Technical Report.
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 24
Conclusions
Common-case forwarding and correctness guarantee separately handled
Decoupled execution allows modular design, verification, and optimization
Forwarding logic is simple to design and incurs minimal interference on execution
Scales very well
Can achieve close to ideal performance
Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification
Alok Garg, M. W. Rashid, and Michael Huang
Department of Electrical & Computer Engineering
University of Rochester
Link to technical report: http://www.ece.rochester.edu/~garg/documents/isca06tr.pdf
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 26
Streamlining back-end execution
ROB
1 2 3 4 5 6 7
Cycles
Age –old to new 2
ST
LD
3
1
ST
1
2
LD LD LD
3Reload
Verificationcommit
Bubble
6/20/2006 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006 27
Streamlining back-end execution
ROB
1 2 3 4 5 6 7
Cycles
Age –old to new 2
ST
LD
3
1
WB
1
2
RL RL LD
3
Insert write buffer at the commit stage
CT