HARD: Hardware-Assisted lockset-based Race Detection
P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07
Shimin Chen
LBA Reading Group Presentation
Motivation
Data race detection important S/W solutions slow (not good for production
runs) Previous H/W solutions focus on happens-
before relation
Cannot detect potential races
Motivating Example
Solution: HARD (h/w lockset)
Challenges:– How to efficiently store and maintain lockset for
each variable in hardware?– How to efficiently perform the set operation in the
lockset algorithm? Main ideas (will be detailed later)
– h/w bloom filter– Piggybacking on cache coherence protocols– Reset all bloom filters after exiting a barrier
Outline
LockSet (refresh our memory) HARD Evaluation Conclusion
Main Lockset Algorithm
Idea: accesses to every shared variable should be protected by some common lock.
Data structures:– Thread t’s current lock set: L(t)– Candidate set for a variable v: C(v)
Algorithms– Modify L(t) upon lock acquire and release– Initiate C(v) to be a set of all locks– When t accesses v, C(v)=C(v) L(t)– If C(v) == then report violation on variable v
Reducing False Positives
Outline
LockSet (refresh our memory) HARD Evaluation Conclusion
HARD Overview
LState: exclusive, shared, etc.
BFVector: candidate lock set for the cache line
Lock Register: Thread’s lockset
Counter Register: used for resolving hash collisions (more detail later)
2bits 16bits16bits
32bits
HARD Overview: Operations
A lock a ‘1’ in bloom filter Fetching a line from memory: set the
BFVector to all 1s, LState to exclusive Update BFVector and LState on accesses Communicate them through coherence
protocol Lock register: thread’s lock set
2b 16b
16b
32b
Bloom Filter
Bloom filter: A bit vector that represents a set of keys– A key is hashed d (e.g. d=3) times and represented by d bits
Construct: for every key in the set, set its 3 bits in vector Membership Test: given a key, check if all its 3 bits are 1
– Definitely not in the set if some bits are 0– May have false positives
0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1
Bit0=H0(key)
Bit1=H1(key)
Bit2=H2(key)
Filter
Representing LockSet as Bloom Filter
4 hash functions Lockset Intersection:
bloom filter intersection Lockset empty:
any of the 4bits are all 0
False Negative Caused by Bloom Filter
Prob of False Negatives
Suppose the candidate set contains m locks Given a lock, probability of recognizing it as a member:
prob_whole = prob_part k
prob_part = 1 – (1-1/n)m
When k=4, n=4:– 0.0039 (m=1), 0.037 (m=2), 0.111 (m=3)– Paper says: “experiments show that no races were missed”
But what if the thread currently holds multiple locks?
n bits n bits n bits n bits n bits n bits
k parts
k=4, n=4
If threads hold 1 to 8 locks (not in the paper)
n bits =4k parts =4----------------------------------------------- m=1 m=2 m=3 m=4 t=1 : 0.0039 0.0366 0.1117 0.2184 t=2 : 0.0078 0.0719 0.2109 0.3891 t=3 : 0.0117 0.1059 0.2991 0.5225 t=4 : 0.0155 0.1387 0.3774 0.6267 t=5 : 0.0194 0.1702 0.4469 0.7083 t=6 : 0.0232 0.2006 0.5087 0.7720 t=7 : 0.0270 0.2299 0.5636 0.8218 t=8 : 0.0308 0.2581 0.6123 0.8607 -----------------------------------------------
Try another design
n bits =8k parts =8----------------------------------------------- m=1 m=2 m=3 m=4 t=1 : 0.0000 0.0000 0.0001 0.0009 t=2 : 0.0000 0.0000 0.0003 0.0017 t=3 : 0.0000 0.0000 0.0004 0.0026 t=4 : 0.0000 0.0000 0.0006 0.0034 t=5 : 0.0000 0.0000 0.0007 0.0043 t=6 : 0.0000 0.0001 0.0008 0.0051 t=7 : 0.0000 0.0001 0.0010 0.0060 t=8 : 0.0000 0.0001 0.0011 0.0069 -----------------------------------------------
Unlock operationremove bit from bloom filter?
32 bit counter register each bloom filter bit has 2 bit counter Increment the 2-bit counter if the
bloom filter bit is set Unlock: decrement the 2-bit counter,
if 0, clear bloom filter bit
2b 16b
16b
32b
Candidate Set and LState Communications
must broadcast changes to C(v) if cache line is in shared state
Handling Barriers
Set BFVectors to all 1s after exiting a barrier
(what if t2 does not hold any lock?)
Three Approximations
Bloom filter to represent lockset Lockset info only in cache
– Can only detect races in a short window of execution Cache line granularity
– False sharing– Compiler to put shared variables to different lines?– Removing false sharing is generally good
Outline
LockSet (refresh our memory) HARD Evaluation Conclusion
Methodology
SESC: cycle-accurate execution-driven simulator (MIPS instruction set)
Six SPLASH-2 benchmarks Randomly inject a data race: randomly remove a dynamic
instance of lock and corresponding unlock Compare with happens-before, ideal lockset
Bug detected, false alarms
Ideal: word-granularity, keep state in memory, perfect lockset # of false alarms is # of source code locations, dynamic errors are
much more
Mainly bus traffic increase Note that HARD requires bloom filter operation per memory
access in processor pipeline
Conclusion
Main idea: bloom filter to represent lockset Three approximations:
– Bloom filter to represent lockset– Lockset info only in cache– Cache line granularity
Problems:– Lockset: false positives– Seems hard to add operations into processor pipeline– Are these the right approximations for monitoring production
runs?