- 1 - Dongyoon Lee †, Mahmoud Said, Satish Narayanasamy †, Zijiang James Yang, and Cristiano...

- 1 -

Dongyoon Lee†, Mahmoud Said*,

Satish Narayanasamy†, Zijiang James Yang*, and

Cristiano L. Pereira‡

University of Michigan, Ann Arbor †

Western Michigan University *

Intel, Inc ‡

Offline Symbolic Analysis forMulti-Processor Execution Replay

- 2 -

Overview

Goal: Deterministic replay for multi-threaded programs• Debug non-deterministic bugs

ProgramInput

SharedMemory

Dependency

Past Solutions Our Solution

Log I/O, signals, DMA, etc.,

Monitor memory operations Software is slow Hardware is complex

BugNet [ISCA'05]Log loads (cache miss data)

SAT constraint solverDetermine offline before replay

Sources of non-determinism• Program input (interrupt, I/O, DMA, etc.)• Shared-memory dependencies

- 3 -

Deterministic Replay Uses

Recorder

Replayer

Memory Leaks

Data Races

Dangling Pointers

Dynamic ProgramAnalysis

Reproducenon-deterministic bugs

Remote Site ORIn-house

Developer Site

Step-Backwardin time

Debugging

- 4 -

Traditional Record-N-Replay Systems

Write

ReadRead

Log shared memory dependencies

Checkpoint Memory and Register State

Log non-deterministic program input Interrupts, I/O values, DMA, etc.

Thread 1 Thread 2 Thread 3

- 5 -

Recording Shared Memory Dependency

Problem Need to monitor every memory operation

Software-based Replay SystemPinSEL (UCSD/Intel) iDNA (Microsoft)

Hardware-based Replay SystemFDR/ReRun (Wisconsin)Strata (UCSD)DeLorean (UIUC)

x100 x10

Complex hardware

- 6 -

Hardware Complexity

Hardware-based solution• Detect shared memory dependencies by monitoring cache

coherence messages• Transitive optimization to reduce log size

Complexity• Requires changes to coherence sub-system• Complex to design and verify • 9 design bugs in coherence mechanism of AMD64

[Narayanasamy et al. ICCD’06]

W(a)W(b)

W(b)R(a)

- 7 -

New Direction to Hardware-based Solution

Complexity-effective solution• Do NOT record shared-memory dependencies at all

• Infer dependencies offline before replay using Satisfiability Modulo Theory (SMT) solver

- 8 -

Our Approach

Write

ReadRead

Log shared memory dependency

Checkpoint Memory and Registers

Log non-deterministic program inputInterrupts, I/O values, DMA, etc.

BugNet [ISCA’05]Load-based Hardware Recorder

Satisfiability-Modulo-Theory (SMT) solver reconstructs interleaving offline

Checkpoint Registers

- 9 -

Roadmap

• Motivation• BugNet for single-threaded programs [ISCA’05]• Recording cache miss data is sufficient

• BugNet is sufficient for multi-threaded programs• Insight: BugNet can replay each thread in isolation

• Offline SMT Analysis• Evaluation• Conclusion

- 10 -

BugNet [Narayanasamy et al, ISCA’05]

Insight• Recording initial register state and values of loads is sufficient for

deterministic replay• Implicitly captures the program input from I/O, DMA, interrupts, etc.• Input and output of other instructions are reproduced during replay

Optimization• Record a load only if it is the first access to a memory location

Our modification• Recording data fetched on cache miss captures first loads• Any first access to a location would result in a cache miss• May unnecessarily record data due to store misses, but that is OK

- 11 -

Recording Cache Miss Data (First Loads)

ExecutionTime

Log file

First Load

Checkpoint• Register Values• Program Counter

Load A = 0

Load A = 0

(cnt1, 0)

Load B = 5 (cnt2, 5)

Store C = 1

On a store miss • Record old value – data before store update • New value – data after store update – can be reproduced deterministically

Cache Miss

Checkpoint

Record cache misses• (Memory count , Data)• Implicitly capture first loads

(cnt3, 0)

Deterministic Replay• Input and output (including address) of all instructions are replayed

- 12 -

BugNet Extension

Self-modifying code• Consider instruction read as a load; so instructions are logged

Full system Replay• Continue logging in kernel mode• See the paper for details on context switches, page faults, etc.

- 13 -

Roadmap

• Motivation• BugNet for single-threaded programs [ISCA’05]• Recording cache miss data is sufficient

• BugNet is sufficient for multi-threaded programs• Insight: BugNet can replay each thread in isolation

• Offline SMT Analysis• Evaluation• Conclusion

- 14 -

BugNet for Multithreaded Programs

Insight• BugNet recorder (initial register state + loads) for each thread is

sufficient for replaying that threadÞ Recording cache miss data is sufficient for multithreaded programsÞ No additional hardware support required for recording dependencies

Reason • Load dependent on a remote write cause a cache miss to ensure

coherenceÞ BugNet implicitly records load values dependent on remote writes

Effect• Can replay each thread in isolation (independent of other threads)

using BugNet logs

- 15 -

Replaying Each Thread Independently

Proc 1 Proc 2

Load A=0

Load A=0

Load A=

Store A=1

Invalidation

Cache Coherence• Invalidate cache block to gain exclusive permission

Log cache miss data• Implicitly records loads dependent on remote writes• No change to coherence mechanism

(1st, 0)

(3rd, 1)

Proc 1 LOG

(1st, 0)

Proc 2 LOG

Cache Miss

Cache BlockInvalidated

1Replay each thread• independent of others

- 16 -

Shared Memory DependencyThread 1 Thread 2

Load

Store

Load

Load

Store

Load

Load

Store

Load

Store

Store

Load

SMT Solver resolves shared memory dependency

Billion instructions• Offline analysis would not scale

Final State : A, B, C

We need to bound search space

?

: Old Value x : New Value

A

A

A

B

B

C

A

A

B

B

C

C

- 17 -

Roadmap

• Motivation• BugNet• Offline Symbolic Analysis• Encoding Ordering Constraints• Bounding Search Space

• Evaluation• Conclusion

- 18 -

Old Value

Encoding Ordering Constraints

Proc 1 Proc 2

x New Value

x1

x2

x3

x 4

x 5

xFinal

Program Order Constraint(Assume Sequential Consistency)

Proc1 : X1 < X2 ANDProc2 : X3 < X4 < X5 AND

Load-Store Constraint( M→old== M→prev→new)

X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND

- 19 -

Multiple Memory Locations

Proc 1 Proc 2

x1

x2

x3

x 4

x 5

xFinal

Program Order Constraints(Assume Sequential Consistency)

Proc1 : Y1 < X1 < X2 < Y2 AND Proc2 : X3 < X4 < X5 < Y3 AND

Load-Store Constraints( M→old== M→prev→new)

X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND :Y1: Y1 < Y2 ANDY2: Y1 < Y2 < Y3 AND

:

y

y1

2 y3

yFinal

Old Value x New Value

- 20 -

Satisfiability-Modulo-Theory (SMT) Solver

SMT Solver

Ordering Constraints

(Program Order) ∧(Load-Store Order for X) ∧(Load-Store Order for Y) ∧ :

Total Order

x1

x2

x 3

x 4

x 5

y

y1

2 y3

SMT solver • Find one valid total order from multiple solutions• All solutions could be produced, if needed

- 21 -

Replay Guarantees

• The replayed execution has the same final register and memory states

• Each thread has the exactly same sequence of instructions along with input and output

• Reconstructed shared memory dependencies obey program order and load-store semantics

- 22 -

Roadmap

• Motivation• BugNet• Offline Symbolic Analysis• Encoding Ordering Constraints• Bounding Search Space

• Evaluation• Conclusion

- 23 -

Bounding Search Space

Proc 1 Proc 2

N cycles

N cycles

Final State

cnt 1 cnt 2

cnt 3 cnt 4

Record “Strata hints”• Each processor periodically records memory operation count• Strata regions have a global order

Strata Region 3

SMT solver analyzes• One region at a time• Start from the last region • Final state of a region = Initial state of the following region

Strata Region 2

Strata Region 1

Final State

Initial State

Final State

Initial State

Final State

- 24 -

Strata Hints

Cycle-bound• After N cycles, each core records its memory operation count• No communication is required between cores

Problem • The size of Strata region is not based to number of shared memory

dependencies• Can we bound based on number of shared memory dependencies?

Downgrade-bound• Count coherence downgrade requests• Requires communication between cores, but reduces offline analysis

overhead

- 25 -

Filtering Local & Read-only Accesses

Load A

Store B

Load B

Store B

Store A

Filter• Local accesses : no shared-memory dependency

• Read-only accesses : any total order is valid

Load C

Load C

Load C

Load CLoad C

Load C

Effectiveness< 1% of memory accesses remain to be analyzed

Strata Region

Thread 1 Thread 2

- 26 -

Roadmap

• Motivation• Record & Replay• Offline Symbolic Analysis• Evaluation• Strata Hint Size• Offline Symbolic Analysis Overhead

• Conclusion

- 27 -

Evaluation

• Simics + cycle accurate simulator• Simulate multi-processor execution (2, 4, 8,16 cores) • Fast-forward up to known synchronization points• Trace collected for 500 million instructions

• Benchmarks• SPLASH2 : barnes, fmm, ocean• Parsec 2.0 : blackscholes, bodytrack, x264• SPEComp : wupwise, swim• Apache• MySQL

• Yices SMT constraint solver [Dutertre and Moura CAV’06]

- 28 -

Strata Hints Size vs. Offline Analysis Overhead

• Downgrade-bound scheme is effective

2.7

2.8

2.9

3

3.1

3.2

3.3

Stra

ta lo

g si

ze (M

B/se

c)

100

1000

10000

100000

1000000

Offl

ine

anal

ysis

tim

e

(sec

s p

er s

ec o

f pr

og. E

xec)

Cycle-bound (10,000) Downgrade-bound (25) Downgrade-bound (10)

10% x100

• Offline analysis overhead is one-time cost (not for every replay)

- 29 -

Strata hints vs. ReRun log

• Strata hints are 4x less than ReRun log• Significant reduction in hardware complexity

barn

esfm

moc

ean

blac

ksch

oles

body

track

x264

wupwise

swim

apac

he

mys

ql

aver

age

1

10

100

Downgrade-bound (d10.c10000) Rerun (henkins)

Stra

ta lo

g si

ze (M

B/se

c)

Proposed System ReRun [Hower and Hill, ISCA’08]

x4

- 30 -

Recording Performance, etc.

• Cache Miss Data Log• 290 Mbytes / one second of program execution

• Recording Performance• On average, 0.35% slowdown in IPC

• Scalability results can be found in the paper

- 31 -

Conclusion

• Deterministic replay for multi-threaded program is critical

• We proposed a complexity-effective solution• Use BugNet : Record cache miss data• No need to record shared memory dependencies• Determine shared memory dependency using SMT constraint solver

offline

• Result• < 1% recording overhead• Efficient log size (4x smaller than state-of-the-art scheme ReRun)• Can analyze one second of 8-threaded program in less than 1000

seconds• One-time offline analysis cost (not for every replay)

- 32 -

Thank you

Date post:	13-Jan-2016
Category:	Documents
Upload:	thomasine-houston
View:	220 times
Download:	0 times

- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano...

Documents

- 1 - Dongyoon Lee †, Mahmoud Said, Satish Narayanasamy †, Zijiang James Yang, and Cristiano...