Date post: | 13-Jan-2016 |
Category: |
Documents |
Upload: | thomasine-houston |
View: | 220 times |
Download: | 0 times |
- 1 -
Dongyoon Lee†, Mahmoud Said*,
Satish Narayanasamy†, Zijiang James Yang*, and
Cristiano L. Pereira‡
University of Michigan, Ann Arbor †
Western Michigan University *
Intel, Inc ‡
Offline Symbolic Analysis forMulti-Processor Execution Replay
- 2 -
Overview
Goal: Deterministic replay for multi-threaded programs• Debug non-deterministic bugs
ProgramInput
SharedMemory
Dependency
Past Solutions Our Solution
Log I/O, signals, DMA, etc.,
Monitor memory operations Software is slow Hardware is complex
BugNet [ISCA'05]Log loads (cache miss data)
SAT constraint solverDetermine offline before replay
Sources of non-determinism• Program input (interrupt, I/O, DMA, etc.)• Shared-memory dependencies
- 3 -
Deterministic Replay Uses
Recorder
Replayer
Memory Leaks
Data Races
Dangling Pointers
Dynamic ProgramAnalysis
Reproducenon-deterministic bugs
Remote Site ORIn-house
Developer Site
Step-Backwardin time
Debugging
- 4 -
Traditional Record-N-Replay Systems
Write
ReadRead
Log shared memory dependencies
Checkpoint Memory and Register State
Log non-deterministic program input Interrupts, I/O values, DMA, etc.
Thread 1 Thread 2 Thread 3
- 5 -
Recording Shared Memory Dependency
Problem Need to monitor every memory operation
Software-based Replay SystemPinSEL (UCSD/Intel) iDNA (Microsoft)
Hardware-based Replay SystemFDR/ReRun (Wisconsin)Strata (UCSD)DeLorean (UIUC)
x100 x10
Complex hardware
- 6 -
Hardware Complexity
Hardware-based solution• Detect shared memory dependencies by monitoring cache
coherence messages• Transitive optimization to reduce log size
Complexity• Requires changes to coherence sub-system• Complex to design and verify • 9 design bugs in coherence mechanism of AMD64
[Narayanasamy et al. ICCD’06]
W(a)W(b)
W(b)R(a)
- 7 -
New Direction to Hardware-based Solution
Complexity-effective solution• Do NOT record shared-memory dependencies at all
• Infer dependencies offline before replay using Satisfiability Modulo Theory (SMT) solver
- 8 -
Our Approach
Write
ReadRead
Log shared memory dependency
Checkpoint Memory and Registers
Log non-deterministic program inputInterrupts, I/O values, DMA, etc.
BugNet [ISCA’05]Load-based Hardware Recorder
Satisfiability-Modulo-Theory (SMT) solver reconstructs interleaving offline
Checkpoint Registers
- 9 -
Roadmap
• Motivation• BugNet for single-threaded programs [ISCA’05]• Recording cache miss data is sufficient
• BugNet is sufficient for multi-threaded programs• Insight: BugNet can replay each thread in isolation
• Offline SMT Analysis• Evaluation• Conclusion
- 10 -
BugNet [Narayanasamy et al, ISCA’05]
Insight• Recording initial register state and values of loads is sufficient for
deterministic replay• Implicitly captures the program input from I/O, DMA, interrupts, etc.• Input and output of other instructions are reproduced during replay
Optimization• Record a load only if it is the first access to a memory location
Our modification• Recording data fetched on cache miss captures first loads• Any first access to a location would result in a cache miss• May unnecessarily record data due to store misses, but that is OK
- 11 -
Recording Cache Miss Data (First Loads)
ExecutionTime
Log file
First Load
Checkpoint• Register Values• Program Counter
Load A = 0
Load A = 0
(cnt1, 0)
Load B = 5 (cnt2, 5)
Store C = 1
On a store miss • Record old value – data before store update • New value – data after store update – can be reproduced deterministically
Cache Miss
Checkpoint
Record cache misses• (Memory count , Data)• Implicitly capture first loads
(cnt3, 0)
Deterministic Replay• Input and output (including address) of all instructions are replayed
- 12 -
BugNet Extension
Self-modifying code• Consider instruction read as a load; so instructions are logged
Full system Replay• Continue logging in kernel mode• See the paper for details on context switches, page faults, etc.
- 13 -
Roadmap
• Motivation• BugNet for single-threaded programs [ISCA’05]• Recording cache miss data is sufficient
• BugNet is sufficient for multi-threaded programs• Insight: BugNet can replay each thread in isolation
• Offline SMT Analysis• Evaluation• Conclusion
- 14 -
BugNet for Multithreaded Programs
Insight• BugNet recorder (initial register state + loads) for each thread is
sufficient for replaying that threadÞ Recording cache miss data is sufficient for multithreaded programsÞ No additional hardware support required for recording dependencies
Reason • Load dependent on a remote write cause a cache miss to ensure
coherenceÞ BugNet implicitly records load values dependent on remote writes
Effect• Can replay each thread in isolation (independent of other threads)
using BugNet logs
- 15 -
Replaying Each Thread Independently
Proc 1 Proc 2
Load A=0
Load A=0
Load A=
Store A=1
Invalidation
Cache Coherence• Invalidate cache block to gain exclusive permission
Log cache miss data• Implicitly records loads dependent on remote writes• No change to coherence mechanism
(1st, 0)
(3rd, 1)
Proc 1 LOG
(1st, 0)
Proc 2 LOG
Cache Miss
Cache BlockInvalidated
1Replay each thread• independent of others
- 16 -
Shared Memory DependencyThread 1 Thread 2
Load
Store
Load
Load
Store
Load
Load
Store
Load
Store
Store
Load
SMT Solver resolves shared memory dependency
Billion instructions• Offline analysis would not scale
Final State : A, B, C
We need to bound search space
?
: Old Value x : New Value
A
A
A
B
B
C
A
A
B
B
C
C
- 17 -
Roadmap
• Motivation• BugNet• Offline Symbolic Analysis• Encoding Ordering Constraints• Bounding Search Space
• Evaluation• Conclusion
- 18 -
Old Value
Encoding Ordering Constraints
Proc 1 Proc 2
x New Value
x1
x2
x3
x 4
x 5
xFinal
Program Order Constraint(Assume Sequential Consistency)
Proc1 : X1 < X2 ANDProc2 : X3 < X4 < X5 AND
Load-Store Constraint( M→old== M→prev→new)
X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND
- 19 -
Multiple Memory Locations
Proc 1 Proc 2
x1
x2
x3
x 4
x 5
xFinal
Program Order Constraints(Assume Sequential Consistency)
Proc1 : Y1 < X1 < X2 < Y2 AND Proc2 : X3 < X4 < X5 < Y3 AND
Load-Store Constraints( M→old== M→prev→new)
X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND :Y1: Y1 < Y2 ANDY2: Y1 < Y2 < Y3 AND
:
y
y1
2 y3
yFinal
Old Value x New Value
- 20 -
Satisfiability-Modulo-Theory (SMT) Solver
SMT Solver
Ordering Constraints
(Program Order) ∧(Load-Store Order for X) ∧(Load-Store Order for Y) ∧ :
Total Order
x1
x2
x 3
x 4
x 5
y
y1
2 y3
SMT solver • Find one valid total order from multiple solutions• All solutions could be produced, if needed
- 21 -
Replay Guarantees
• The replayed execution has the same final register and memory states
• Each thread has the exactly same sequence of instructions along with input and output
• Reconstructed shared memory dependencies obey program order and load-store semantics
- 22 -
Roadmap
• Motivation• BugNet• Offline Symbolic Analysis• Encoding Ordering Constraints• Bounding Search Space
• Evaluation• Conclusion
- 23 -
Bounding Search Space
Proc 1 Proc 2
N cycles
N cycles
Final State
cnt 1 cnt 2
cnt 3 cnt 4
Record “Strata hints”• Each processor periodically records memory operation count• Strata regions have a global order
Strata Region 3
SMT solver analyzes• One region at a time• Start from the last region • Final state of a region = Initial state of the following region
Strata Region 2
Strata Region 1
Final State
Initial State
Final State
Initial State
Final State
- 24 -
Strata Hints
Cycle-bound• After N cycles, each core records its memory operation count• No communication is required between cores
Problem • The size of Strata region is not based to number of shared memory
dependencies• Can we bound based on number of shared memory dependencies?
Downgrade-bound• Count coherence downgrade requests• Requires communication between cores, but reduces offline analysis
overhead
- 25 -
Filtering Local & Read-only Accesses
Load A
Store B
Load B
Store B
Store A
Filter• Local accesses : no shared-memory dependency
• Read-only accesses : any total order is valid
Load C
Load C
Load C
Load CLoad C
Load C
Effectiveness< 1% of memory accesses remain to be analyzed
Strata Region
Thread 1 Thread 2
- 26 -
Roadmap
• Motivation• Record & Replay• Offline Symbolic Analysis• Evaluation• Strata Hint Size• Offline Symbolic Analysis Overhead
• Conclusion
- 27 -
Evaluation
• Simics + cycle accurate simulator• Simulate multi-processor execution (2, 4, 8,16 cores) • Fast-forward up to known synchronization points• Trace collected for 500 million instructions
• Benchmarks• SPLASH2 : barnes, fmm, ocean• Parsec 2.0 : blackscholes, bodytrack, x264• SPEComp : wupwise, swim• Apache• MySQL
• Yices SMT constraint solver [Dutertre and Moura CAV’06]
- 28 -
Strata Hints Size vs. Offline Analysis Overhead
• Downgrade-bound scheme is effective
2.7
2.8
2.9
3
3.1
3.2
3.3
Stra
ta lo
g si
ze (M
B/se
c)
100
1000
10000
100000
1000000
Offl
ine
anal
ysis
tim
e
(sec
s p
er s
ec o
f pr
og. E
xec)
Cycle-bound (10,000) Downgrade-bound (25) Downgrade-bound (10)
10% x100
• Offline analysis overhead is one-time cost (not for every replay)
- 29 -
Strata hints vs. ReRun log
• Strata hints are 4x less than ReRun log• Significant reduction in hardware complexity
barn
esfm
moc
ean
blac
ksch
oles
body
track
x264
wupwise
swim
apac
he
mys
ql
aver
age
1
10
100
Downgrade-bound (d10.c10000) Rerun (henkins)
Stra
ta lo
g si
ze (M
B/se
c)
Proposed System ReRun [Hower and Hill, ISCA’08]
x4
- 30 -
Recording Performance, etc.
• Cache Miss Data Log• 290 Mbytes / one second of program execution
• Recording Performance• On average, 0.35% slowdown in IPC
• Scalability results can be found in the paper
- 31 -
Conclusion
• Deterministic replay for multi-threaded program is critical
• We proposed a complexity-effective solution• Use BugNet : Record cache miss data• No need to record shared memory dependencies• Determine shared memory dependency using SMT constraint solver
offline
• Result• < 1% recording overhead• Efficient log size (4x smaller than state-of-the-art scheme ReRun)• Can analyze one second of 8-threaded program in less than 1000
seconds• One-time offline analysis cost (not for every replay)
- 32 -
Thank you