Computer Architecture Lab at
Evangelos Vlachos,
Michelle L. Goodstein, Michael A. Kozuch, Shimin Chen, Phillip B. Gibbons, Babak Falsafi
and Todd C. Mowry
ParaLog: Enabling and Accelerating Online Parallel Monitoring of Multithreaded
Applications
Software Errors & Analysis Tools
• Errors abundant in parallel software– Program crashes/vulnerabilities, limited performance
• Three main categories of analysis tools– Checking before, during or after program execution
• Instruction-grain Lifeguards– Online detailed analysis, but with high overhead
– Several tools available, but mostly support for single-threaded code
2© Evangelos Vlachos ASPLOS '10 - ParaLog
ParaLog: a framework for efficient analysis of parallel applicationsParaLog: a framework for efficient analysis of parallel applications
Lifeguards and Parallel Applications
Application Threads
TimeslicedExecution & Analysis
ParallelExecution & Analysis
TimeButterfly Analysis ParaLog
windows of uncertainty
precise application
order
(previous talk) (this talk)
DBI tools available today
- high overhead due to serialization
- some false positives+software-based
- new hardware required+no false positives+even better performance
Low-Overhead Instruction-level Analysis
© Evangelos Vlachos ASPLOS '10 - ParaLog 4
accelerators: IT, IF, MTLB
[Chen et. al., ISCA’08]
event streamevent capturingevent capturing
applicationthread
lifeguard thread
event deliveryevent delivery
application lifeguardonline monitoring platform
metadata
add r1 r2, r4
add, r1, r2, r4add, r1, r2, r4
add_handler(){
i = load_state(r2); j = load_state(r4); if(check(i, j)) upd_state(r1); else error();}
Lifeguard coreApplication core
accelerators: IT, IF, MTLB
accelerators: IT, IF, MTLB
Challenges in Parallel Monitoring
© Evangelos Vlachos ASPLOS '10 - ParaLog 5
event stream
application lifeguardonline parallel monitoring platform[ParaLog]
applicationthread 1
event capturingevent capturing event deliveryevent deliverylifeguard thread 1
globalmetadata
event streamapplicationthread k
event capturingevent capturing event deliveryevent deliverylifeguard thread k
accelerators: IT, IF, MTLBaccelerators: IT, IF, MTLB
accelerators: IT, IF, MTLBaccelerators: IT, IF, MTLB
Addressing the Challenges
1. Application event ordering
2. Ensuring metadata access atomicity efficiently
3. Parallelizing hardware accelerators
© Evangelos Vlachos ASPLOS '10 - ParaLog 6
event streamapplication-onlyorder capturingapplication-onlyorder capturing
order enforcingorder enforcing
application lifeguardonline parallel monitoring platform
dependence arcs
[ParaLog]
applicationthread 1
event capturingevent capturing event deliveryevent deliverylifeguard thread 1
globalmetadata
event streamapplication-onlyorder capturingapplication-onlyorder capturing
order enforcingorder enforcingapplication
thread k
event capturingevent capturing event deliveryevent deliverylifeguard thread k
Outline
• Introduction
• Addressing the Challenges of Parallel Monitoring1. Capturing & enforcing application event ordering
2. Ensuring metadata access atomicity
3. Parallelizing hardware accelerators
• Evaluation
• Conclusions
7© Evangelos Vlachos ASPLOS '10 - ParaLog
Event Ordering: the Problem
• Case Study: Information flow analysis (i.e., Taintcheck)
© Evangelos Vlachos ASPLOS '10 - ParaLog 8
store(A)store(A)
load(A)load(A)
Applicationthread j thread k
st_handler(A)st_handler(A)
Lifeguardthread j thread k
ApplicationTime
ld_handler(A)ld_handler(A)
Expose happens-before information to lifeguards
LifeguardTime
{thread j, tj}{thread j, tj}{thread j, tj}{thread j, tj}
progressj: tj progressj: tj - 2 progressk: tk
- 1 progressk: tk progressk: tk - 2 progressj: tj
- 1
Event Ordering: the solution (1/2)
• Coherence-based ordering of application events– Similar to FDR, but online, focusing on application-only events
© Evangelos Vlachos ASPLOS '10 - ParaLog 9
store(A)store(A)
load(A)load(A)
Applicationthread j thread k
Time tj - 1
tj
tj + 1
tk - 1
tk
tk + 1
st_handler(A)st_handler(A)
ld_handler(A)ld_handler(A)
Lifeguardthread j thread k
wait whileprogressj < tj
wait whileprogressj < tj
Is monitoring coherence enough?
Event Ordering: the Solution (2/2)
• Previous work has not solved the problem of Logical Races
• Both logical races and system calls resolved with Conflict Alert messages© Evangelos Vlachos ASPLOS '10 - ParaLog 10
free(A)free(A)
load(A)load(A)
Applicationthread j thread k
free(A)startfree(A)start
ld_handler(A)ld_handler(A)
Lifeguardthread j thread k
Metadata(A)Metadata(A)
free(A)endfree(A)end
Conflict Alert Message Dependence
LogicalRace
ApplicationTime
LifeguardTime
Metadata Atomicity
• Frequent use of locking too expensive– # of instructions added & synchronization cost
• Dependence arcs handle the majority of the cases – Sufficient conditions:
1. One-to-one data-to-metadata mapping
2. Application reads don’t become metadata writes
– Enforcing dependence arcs race-free operation
• Rest of the cases handled by acquiring a lock– Lock used only in the load_handler(); other handlers safe
© Evangelos Vlachos ASPLOS '10 - ParaLog 11
(more details in the paper)
Parallel Hardware Accelerators
• Speed-up frequent lifeguard actions– Metadata-TLB; fast metadata address calculation
– Idempotent Filters; filter out redundant checking
– Inheritance Tracking; fast tracking of dataflow paths
• Accelerators have only local view of the analysis– Cache locally analysis information (e.g., frequent events)
– Important events have application-wide effects (e.g., free())
– Coherence-like issues with accelerators’ local state
• Important events accompanied by Conflict Alerts – Use Conflict Alerts to flush accelerators’ state
© Evangelos Vlachos ASPLOS '10 - ParaLog 12
Outline
• Introduction
• Addressing the Challenges of Parallel Monitoring– Capturing & enforcing application event ordering
– Ensuring metadata access atomicity
– Parallelizing hardware accelerators
• Evaluation
• Conclusions
13© Evangelos Vlachos ASPLOS '10 - ParaLog
Experimental Framework
14© Evangelos Vlachos ASPLOS '10 - ParaLog
• Log-Based Architectures framework– Simics full-system simulation
– CMP system with {2, 4, 8, 16} cores
– {1, 2, 4, 8} of application and lifeguard threads
– Sequentially Consistent memory model
• Benchmarks and multithreaded Lifeguards used– SPLASH-2 and PARSEC– TaintCheck: Information flow tracking; accelerated by M-TLB, IT
– AddrCheck: Memory access checking; accelerated by M-TLB, IF
• Comparison with Timesliced Monitoring
Performance Results: AddrCheck
15© Evangelos Vlachos ASPLOS '10 - ParaLog
8 app/lifeguard threads16 cores total
Normalized to sequential,
unmonitored
Performance Results: AddrCheck
16© Evangelos Vlachos ASPLOS '10 - ParaLog
Performance Results: AddrCheck
17© Evangelos Vlachos ASPLOS '10 - ParaLog
2.3 6.1 6.7 1.71.9 2.9 9.5 15.4 2.1 6.2 1.9 2.4
• Timesliced Monitoring is not scalable• On average 15x slowdown over No Monitoring (8 threads)
Performance Results: AddrCheck
18© Evangelos Vlachos ASPLOS '10 - ParaLog
• Highest overhead with 8 threads: SWAPTIONS 6x• Lowest overhead with 8 threads: < 5%• Average overhead with 8 threads: 26%
Performance Results: TaintCheck
19© Evangelos Vlachos ASPLOS '10 - ParaLog
Performance Results: TaintCheck
20© Evangelos Vlachos ASPLOS '10 - ParaLog
2.1 11.5 12.9 1.910 1.7
1.92.9
6.64.6
15.7 2.4 2.81.7
• Timesliced Monitoring is not scalable• On average 23x slowdown over No Monitoring (8 threads)
Performance Results: TaintCheck
21© Evangelos Vlachos ASPLOS '10 - ParaLog
• Highest overhead with 8 threads: BARNES 2.6x• Lowest overhead with 8 threads: LU 5%• Average overhead with 8 threads: 48%
Other Results in the Paper
• Order capturing and order enforcing under TSO
• Performance Impact of Lifeguard Accelerators– AddrCheck: [1.13x – 3.4x], TaintCheck: [2x – 9x]
• A less expensive order capturing mechanism gets similar performance results– 1 timestamp per core vs. 1 timestamp per cache block
© Evangelos Vlachos ASPLOS '10 - ParaLog 22
Conclusions
• ParaLog: Fast and precise parallel monitoring
• Components of event ordering– Normal memory accesses: monitor coherence activity– Logical Races; use of Conflict Alert messages
• Metadata Atomicity– Enforcing dependence arcs ensures atomicity (most cases)
• Parallel Hardware Accelerators– Flush local state on remote events (Conflict Alert)
• Average overhead is relatively low– AddrCheck: 26% and TaintCheck: 48% (8 threads)
23© Evangelos Vlachos ASPLOS '10 - ParaLog
Questions ?
24© Evangelos Vlachos ASPLOS '10 - ParaLog
Backup Slides
25© Evangelos Vlachos ASPLOS '10 - ParaLog
Metadata Atomicity
• Synchronization-free fast path vs. slow path– Concurrent application reads; no ordering available!
• Concurrent metadata reads: follow the fast-path
• Concurrent metadata writes: follow slow-path acquiring a lock
• Concurrent metadata read and write: read may get either value
– In any other case dependence arcs are available
© Evangelos Vlachos ASPLOS '10 26
Application Event Lifeguard Action
R R W
W R W
AddrCheckTaintCheckMemCheck
LockSet
Parallel Hardware Accelerators
• Accelerators have only local view of the analysis– Important events have system-wide effects
– Case study: Idempotent Filters and AddrCheck
© Evangelos Vlachos ASPLOS '10 - ParaLog 27
R(A)
R(B)
R(A)
R(A)
R(A)
R(C
)
R(B)
R(A)
IF
free(A)
R(A)
IF
LG 0
LG 1
✔✖✔Delivered to lifeguard
✖ Redundant; discarded
✖ ✔
✔✖ ✔✔
✔
Flush IF filtersFlush IF filters
free(A)
Flush local and remote IF
filters
Flush local and remote IF
filters
• Details for parallel M-TLB and IT can be found in the paper
Builds on Remote Conflict
Messages
Builds on Remote Conflict
Messages
Performance Impact of Lifeguard Accelerators
28© Evangelos Vlachos ASPLOS '10 - ParaLog
9.4 6.8 7.3 11.3
• Accelerators provide a major speedup [2x – 9x]
Performance Impact of Lifeguard Accelerators
29© Evangelos Vlachos ASPLOS '10 - ParaLog
• Accelerators provide a major speedup [1.13x – 3.4x]
Transitive Reduction Sensitivity Study
30© Evangelos Vlachos ASPLOS '10 - ParaLog
• Limited transitive reduction– No major performance impact; savings in chip area
Supporting Total Store Order (TSO)
• Cycle of dependencies in relaxed memory models– TSO relaxes the RAW ordering
– Previous work (RTR): maintain versions of data
– Identify SC offending instructions; save loaded value
• This paper: maintain versions of metadata
© Evangelos Vlachos ASPLOS '10 - ParaLog 31
Thread 0 Thread 1Commit
order
0
1
2
Wr(A) Wr(B)
Rd(B) Rd(A)
Memory Order:
P(v1, A)
C(v0, B)
P(v0, B)
C(v1, A)
Log 0 Log 1
Wr(A)
Rd(B, v0)
Wr(B)
Rd(A, v1)
produce_version(v1,A)Lifeguard 0
store_handler(A)
wait_until_available(v0,B)
load_handler(B, v0)
Parallel Hardware Accelerators
• Speed-up frequent lifeguard actions– Fast metadata address calculation – Metadata-TLB
– Fast tracking of data-flow paths – Inheritance Tracking– Filter out redundant checking – Idempotent Filters
• Per-instruction checking gives the same result; cache event
• Accelerators have only local view of the analysis– Important events have system-wide effects (e.g., free())– Coherence-like issues with accelerators’ local state
• Important events accompanied by Conflict Alerts– Use Conflict Alerts to flush state and deliver pending
events© Evangelos Vlachos ASPLOS '10 - ParaLog 32
Experimental Framework
Benchmarks Input
barnes 16K bodies
ocean Grid: 258 x 258
lu Matrix: 1024 x 1024
fmm 32768 particles
radiosity Base problem
blackscholes Simlarge
fluidanimate Simlarge
swaptions Simlarge
Simulation Parameters
Cores {2, 4, 8,16}, 1 GHz, In-Order scalar x86
L1I & L1D(private)
64KB, 64B line, 4-way assoc.
L2 (shared) {1, 2, 4, 8}MB, 64B line, 8-way assoc., 6-cycle latency
Memory 90-cycle latency
Log Buffer 64KB per thread
Multithreaded Lifeguards
TaintCheck: Information flow tracking; accelerated by M-TLB and IT
AddrCheck: Memory access checking; accelerated by M-TLB and IF
33© Evangelos Vlachos ASPLOS '10 - ParaLog
Relative Slowdown - TaintCheck
TaintCheck
0.0
0.5
1.0
1.5
2.0
2.5
3.0
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8
BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY
Slo
wdo
wn
Waiting for ApplicationWaiting for DependenceUseful Work
34© Evangelos Vlachos ASPLOS '10 - ParaLog
Relative Slowdown - AddrCheck
AddrCheck
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8
BARNES LU OCEAN BLACKSH. FLUIDANIM. SWAPTIONS FMM RADIOSITY
Slo
wd
ow
n
Waiting for ApplicationWaiting for DependenceUseful Work
3.0 6.0
35© Evangelos Vlachos ASPLOS '10 - ParaLog
Performance Results - AddrCheck
36© Evangelos Vlachos ASPLOS '10 - ParaLog
2.3 6.1 6.7 1.71.9 2.9 9.5 15.4 2.1 6.2 1.9 2.4
Performance Results - TaintCheck
37© Evangelos Vlachos ASPLOS '10 - ParaLog
2.1 11.5 12.9 1.910 1.7
1.92.9
6.64.6
15.7 2.4 2.81.7
Parallel Hardware Accelerators
• Speed-up frequent lifeguard actions– Metadata-TLB & Inheritance Tracking (discussed in the paper)
– Idempotent Filters; identify and filter out redundant checking
• Per-instruction checking gives the same result
• Cache incoming event and local state to identify redundancy
• Accelerators have only local view of the analysis– Important events have application-wide effects (e.g., free())
– Coherence-like issues with accelerators’ local state
• Important events accompanied by Conflict Alerts – Use Conflict Alerts to flush accelerators’ state
© Evangelos Vlachos ASPLOS '10 - ParaLog 38