548 IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. 4, APRIL ...paruj/papers/ieee-tc12.pdf · only...

Efficient Runtime Detection andToleration of Asymmetric Races

Paruj Ratanaworabhan, Member, IEEE, Martin Burtscher, Senior Member, IEEE,

Darko Kirovski, Member, IEEE, Benjamin Zorn, Member, IEEE Computer Society,

Rahul Nagpal, Member, IEEE, and Karthik Pattabiraman, Member, IEEE

Abstract—We introduce ToleRace, a runtime system that allows programs to detect and even tolerate asymmetric data races.

Asymmetric races are race conditions where one thread correctly acquires and releases a lock for a shared variable while another

thread improperly accesses the same variable. ToleRace provides approximate isolation in the critical sections of lock-based parallel

programs by creating a local copy of each shared variable when entering a critical section, operating on the local copies, and

propagating the appropriate copies upon leaving the critical section. We start by characterizing all possible interleavings that can cause

races and precisely describe the effect of ToleRace in each case. Then, we study the theoretical aspects of an oracle that knows

exactly what type of interleaving has occurred. Finally, we present software implementations of ToleRace and evaluate them on

multithreaded applications from the SPLASH2 and PARSEC suites.

Index Terms—Debugging aids, dynamic instrumentation, parallel programming, race detection and toleration.

Ç

1 INTRODUCTION

THIS paper tackles data race problems in lock-basedparallel programs. It focuses on programs written in

unsafe languages such as C or C++ that use add-on librariesfor threading and synchronization. At present, a largeinstalled code base of such programs exists and program-mers continue to write parallel code in this paradigm.

In general, a race is defined as a condition where multiplethreads access a shared memory location without synchro-nization and there is at least one write among the accesses.With proper synchronization, lock-based programs adhere tothe data-race-free model [3] where synchronization opera-tions are made explicit by calls to specific library functions,e.g., pthread_mutex_lock in POSIX threads (pthreads). Inthis model, the hardware appears sequentially consistentwith respect to the programs even though it may be weaklyordered in reality.

We are interested in asymmetric races, which occurwhen one thread correctly protects a shared variable usinga lock while another thread accesses the same variableimproperly due to a synchronization error (e.g., not taking alock, taking the wrong lock, taking a lock late, etc.).

An example of an asymmetric race is shown in Fig. 1.Here, Thread 1 correctly uses a critical section to protect itsread accesses to the shared variable gScript. Thread 2incorrectly updates gScript without a lock, thus creating arace. The race occurs infrequently, i.e., only when Thread 2’supdate happens between the test for NULL and the else partof the conditional in Thread 1. Our reasons for focusing onasymmetric races are:

1.1 They Are Common in Software DevelopmentProjects

This conclusion comes from direct experience with devel-opers in software houses like Microsoft. There are tworeasons for this. First, usually a programmer’s localreasoning about concurrency, e.g., taking proper locks toprotect shared variables, is correct. Errors due to takingwrong locks or no locks lie outside of the programmer’scode, for example, in third party libraries. Given that lock-based programs rely on convention, this phenomenon isunderstandable. The second reason has to do with legacycode. As software evolves, assumptions about a piece ofcode may be invalidated. For instance, a library may havebeen written assuming a single-threaded environment, butlater the requirements change and multiple threads use it.An expedient response to this change is to demand that allclients wrap their calls to the library, acquiring locks beforeentry and releasing them on exit. Because this solutionrequires that all clients be changed, races can be introducedwhen clients fail to follow the proper locking discipline.

1.2 Symmetric Races Are Often Benign

Because calls to synchronization operations are expensive,programmers often resort to lightweight user-definedsynchronization, as shown in Fig. 2, where Threads 1 and2 synchronize on the flag variable. In this situation, eventhough a race occurs by definition (the shared variable flagis accessed without explicit synchronization), it does not

548 IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. 4, APRIL 2012

. P. Ratanaworabhan is with the Faculty of Engineering, KasetsartUniversity, Rm. 312, 50 Phaholyothin Rd., Bangkok 10900, Thailand.E-mail: [email protected].

. M. Burtscher is with Texas State University-San Marcos, 601 UniversityDrive, San Marcos, TX 78666. E-mail: [email protected].

. D. Kirovski and B. Zorn are with Microsoft Research, One Microsoft Way,Redmond, WA 98052. E-mail: {darkok, zorn}@microsoft.com.

. R. Nagpal is with the Department of CSA, Indian Institute of Science,Bengaluru 560012, Karnataka, India. E-mail: [email protected].

. K. Pattabiraman is with the University of British Columbia, Fred KaiserBuilding, 2332 Main Mall, Vancouver, BC V6T 1Z4, Canada.E-mail: [email protected].

Manuscript received 26 May 2009; revised 31 July 2010; accepted 19 Jan.2011; published online 18 Feb. 2011.Recommended for acceptance by A. George.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TC-2009-05-0288.Digital Object Identifier no. 10.1109/TC.2011.48.

0018-9340/12/$31.00 � 2012 IEEE Published by the IEEE Computer Society

harm the program. Narayanasamy et al. [22] show othertypes of benign symmetric races, e.g., redundant writes anddisjoint bit manipulation. Their experience with WindowsVista and Internet Explorer indicates that these benign racesare rather common.

This work presents ToleRace, a runtime system that notonly detects asymmetric races but also tolerates them.ToleRace allows programs to continue executing in thepresence of asymmetric races and possibly complete with awell-defined semantic. Inspired by the DieHard system [5],which probabilistically tolerates memory safety errors,ToleRace uses replication to detect and/or tolerate races.It provides an approximation of isolation in critical sectionsby creating local copies of shared variables when a criticalsection is entered, operating on the local copy while in thecritical section, detecting conflicting changes to shared datawhen the critical section is exited, and propagating theappropriate copy when possible to hide the race.

ToleRace can be compared to transactional memory(TM) [14]. The ToleRace mechanism outlined above isanalogous to constructing a read-write set while executingin a transaction with a lazy versioning policy and lazilydetecting conflicts to the set, i.e., just before the transactioncommits. However, ToleRace is not based on optimisticsynchronization as TM is; there is no notion of abort-and-rollback, nor is there a need for contention management.Whereas handling side effect operations and nestedtransactions are still open issues with TM, ToleRace handlesall I/O operations as well as overlapped critical sectionstransparently. While TM can provide isolation and toleratesraces just as ToleRace does, it is not clear how TM can beapplied to existing lock-based codes. Converting from lock-based to transaction-based code is not trivial [7].

This paper makes the following contributions:

. Foundations for runtime management of races. Wepresent a theoretical framework that investigates allpossible interactions among safe threads that ob-serve a proper locking discipline and unsafe threadsthat fail to do so. Then, we focus on cases where arace occurs, categorize them, and describe our racedetection and toleration scheme for each category.

. Precise race detection. ToleRace identifies races thatactually happen at runtime. It detects a race whenthe critical section in which the race took place exitsand, by design, never generates a false positive.

. Low overhead software implementation. We pre-sent three software implementations of ToleRace.Our first version uses a dynamic instrumentation-based approach and performs all analysis at run-time. For the second version, we add a staticprogram analysis phase to remedy the shortfalls inthe first version. The third version is radicallydifferent from the first two. It is based on source-code modifications to implement ToleRace.

2 CHARACTERIZING ASYMMETRIC RACES

To characterize asymmetric races, we consider all interleav-ings between operations in a correctly synchronized threadand a second, unsynchronized thread. We then reduce theinterleavings that result in races into four classes andconsider how ToleRace handles each class. We assume thatthere are two types of threads:

. a safe thread consisting of a single critical section, and

. a nonsafe thread that might access a shared variableoutside of a critical section or using the wrong lockto guard it.

Let r, w, and x denote read, write, and don’t-careoperations, respectively. An x don’t-care can be either aread or write operation. Let lower case letters representaccesses of nonsafe threads and upper case letters accessesof safe threads. rþ denotes a sequence of at least one readand r� indicates zero or more reads. The operators + and �are equally defined for writes and don’t-cares. There areonly three ways in which a sequence of operations from asingle thread can interact with a single variable: by readingit only ðrþÞ, by setting its value regardless of its prior ðwx�Þ,and by setting its value based upon its prior ðrþ wx�Þ. Forthe rþ wx� sequence, we assume that w is dependent uponthe value retrieved by r.

Definition 1. A race condition represents any one of all possibleexecution interleavings of a set of threads T ¼ fT1 . . . TNgwhere at least one of the threads in T is nonsafe and at least oneis safe, such that the final computation state after all threadshave executed does not correspond to any case when all safethreads in T have executed in isolation.

To understand how the safe and nonsafe threads caninteract, we exhaustively explore all interleavings where thenonsafe thread T2 executes between operations in the safethread T1. Table 1 tabulates all possible interactions betweena safe thread T1 and a nonsafe thread T2. The safe thread isimproperly intercepted by T2 at a position that slices theoperations of T1 into two parts T01 and T001 . The table evaluatesthe outcomes of this interaction exhaustively. We derive thefollowing classification theorem from Table 1.

Theorem 1. Race condition cases. A race between two threadsoccurs due to one of the following conditions:

1. XwR ¼ Xþ wx� RþX�. This case specifies that anysequence of operations by T2 that starts with a writeand occurs after one or more arbitrary operations butbefore a read in T1 causes a race.

2. WrW ¼ R�WX� rþ R�WX�. This case specifies thatany sequence of reads by T2, when placed in-betweentwo writes by T1, results in a race.

3. RwW ¼ RþX� wx� WX�. When T1 starts with aread followed by an arbitrary sequence of operations,and T2 executes any sequence of operations that starts

RATANAWORABHAN ET AL.: EFFICIENT RUNTIME DETECTION AND TOLERATION OF ASYMMETRIC RACES 549

Fig. 2. User-defined synchronization.

Fig. 1. An asymmetric race.

with a write just before T1 writes back to this variable,a race will occur.

4. XrwX ¼ Xþ rþ wx�Xþ. This case specifies thatany sequence starting with a write based upon aprior by T2 causes a race when interleaved betweenany two T1 operations.

With no effect on the generality of the theorem, in all sequenceswe assume that the last operation in T1, which completes therace condition, is the last operation in the critical section.

Proof. Direct result of combining cases from Table 1. tu

There is previous work [18], [26] that also proposesenumeration of possible interleavings. However, it does notfocus on race toleration as we do. Section 3.1 describes howwe employ the classification from Table 1 for this purpose.

Theorem 2. Reduction of race conditions. Any race conditionamong K > 2 threads can always be reduced to one of the I-IVcases of a race between two threads.

Proof. Consider a single safe thread among K interactingthreads. The K-1 nonsafe threads impart interveningsequences of operations rþ;wx�, or rþwx� on the safethread. When these three sequences interleave, theresulting sequence still belongs to one of the threesequences. As far as the safe thread is concerned, nomatter how many nonsafe threads interact with it, it onlyobserves the resulting intervening sequence. If such asequence is one of the three sequences mentioned, it is as ifit interacted with just a single nonsafe thread, and theresulting race instances can be classified by Table 1.

Now, consider multiple safe threads among the Kinteracting threads. Because safe threads, by definition,hold consistent locking for a given shared variable, onlyone can be in the critical section accessing this variable ata given time. This brings us back to the first case we justconsidered and completes the proof. tu

3 THE TOLERACE ORACLE

Having characterized asymmetric races, we now present ourtheoretical framework, the ToleRace Oracle, and describehow it handles all the race cases specified in Theorem 1.

The core of our approach to managing races is toreplicate the protected shared state so that the thread thatacquires a lock on the shared state has an exclusive copy(see Fig. 3). This thread continues reading from and/orwriting to this copy until it releases the lock. When the lock

is released, the ToleRace runtime can employ a variety ofsoftware and/or hardware mechanisms to determine whichrace, if any, has occurred. Possible outcomes range fromtolerating the race completely to reporting that a race hasoccurred to executing a programmer-specific handler whenan intolerable race is detected.

Next, we study the effect of ToleRace on the casesdescribed in Theorem 1, assuming an oracle determineswhich race has occurred.

Initialization and finalization. We assume that the bindingof locks (xV ) to shared variables (V ) is known before thecritical section in T1 is entered and that storage for twoadditional copies (V 0; V 00) of variable V has been allocated.After the lock is released, the storage for the two copies isdeallocated.

Lock (Entry). When lock xV is acquired by T1, we copy Vto V 0 and V 00ðV 00 ¼ V 0 ¼ V Þ atomically.

Reads and writes inside the critical section. ToleRace altersall instructions in the critical section of T1 to use V 0 insteadof V . Thus, V 0 is the local copy of V for T1 that cannot beaccessed by other threads due to a race. All other threadssuch as T2 are unchanged and continue using V for allaccesses. Copy V 00 is not accessed by any thread until T1

exits the critical section.Unlock (exit). When T1 exits the critical section by

releasing the acquired lock, ToleRace analyzes the contentof V 0, the original value V 00, and the value V that could havebeen altered by other threads as a consequence of a race.Depending on the relationship of the values in fV ; V 0; V 00gand knowledge about the specific case in Theorem 1 that


TABLE 1Tabulating Classes of Race Instances

Column marked “race” denotes if the schedule T01T2T001 results in a race.

Fig. 3. ToleRace uses two additional copies of a variable to tolerateraces.

has occurred, ToleRace deploys a resolution function V ¼fðV ; V 0; V 00Þ that defines the value of V after T1 finishes itscritical section. The resolution function is executed atom-ically in the oracle ToleRace.

3.1 Tolerating and Detecting Races with the Oracle

Combining the mechanism outlined above with theexhaustive interleavings enumerated in Table 1, we canreason about which cases ToleRace will tolerate. Assumingperfect knowledge of the specific race case that hasoccurred, Table 2 summarizes the definition of f andindicates the cases that ToleRace correctly tolerates.

Because ToleRace can tolerate only some races of type IV,in Table 2 we subdivide this case into three subcases:

IVA : RrwR ¼ Rþ rþwx� Rþ;IVB : WrwX ¼WX� rþwx� Xþ; and

IVC : RrwW ¼ XrwX� fRrwR [WrwXg:

The first column in Table 2 lists the race type based uponthe classification from Theorem 1, the second columnspecifies whether V is equal to V 00 at the point when f iscalled, the third column shows a resolution function f thatallows ToleRace to tolerate the race, the fourth columnindicates whether f provably succeeds in tolerating the race,and the fifth column presents �, the schedule of threads thatToleRace’s result represents. Table 2 shows that the ToleRaceoracle tolerates all races with the resolution function fdefined by Table 2 except sequences of the form RrwW.

For races of type RrwW, the interleaving of reads andwrites from T2 breaks the program’s sequential memoryconsistency. Here, T1 and the interleaved part of T2 bothread the value of the shared variable once T1 has enteredthe critical section, execute in parallel, and then join at theexit of the critical section of T1. T1 and T2 see the samevalue returned by the read, which would not be possible ifT1 had executed its critical section in isolation.

When the oracle ToleRace is used as a pure race detector,i.e., when the resolution function is turned off, we canreason about the situations in which it may produce falsepositives or false negatives. The oracle ToleRace inherentlygenerates no false positives. When V 6¼ V 00, an asymmetricrace has occurred by definition. However, it produces afalse negative when:

1. The last write in the intervening sequence writes thesame value as the value in V 00. This is the so-calledABA problem, i.e., the intervening sequence writes Band then A after the safe thread reads A. From theviewpoint of ToleRace, the clean copy appears to beuntouched and ToleRace does not report a race.

Surprisingly, although ToleRace does not detect thiscase, it tolerates it by scheduling the operations fromthe intervening sequence to have come before thoseof the safe thread. ABA will now appear as BAA.

2. There is a WrW race. ToleRace cannot detect thisrace case, but it again tolerates it.

3.2 Multiple Variables and Nested Critical Sections

So far, we have considered the oracle ToleRace in amultithreaded, single-variable, nonnested critical-sectioncontext. We now extend this framework to handle generalcases, which may involve multiple variables and nestedcritical sections. Making local copies and executing theresolution function need to be done atomically for multiplevariables. Nested critical sections share their local copies withthe outer critical sections. However, they have their ownresolution function to resolve races for their protectedvariables. When dealing with these general cases, the racetoleration mechanism employed in ToleRace may lead toinconsistent execution. If this happens, ToleRace prevents theshared variable reordering by acting as a race detector only.

Theorem 3. Inconsistent execution. In the general case oftolerating asymmetric races involving multiple variables andnested critical sections, ToleRace may reorder operations of anonsafe thread such that the operations do not follow theoriginal program order. If there are data dependencies amongthe operations that must be observed, ToleRace disallows suchreorderings and reverts to detection mode.

Here we outline the proof of Theorem 3. Note that thedependences in Theorem 3 refer to data dependencies,which occur when a write to a given variable depends on aread of another variable.

We consider cases I through IVB from Table 2 whereToleRace tolerates races without a custom resolutionfunction. ToleRace can schedule operations from the nonsafethread to have come before or after the critical section. Anyintervening sequence rþ always appears to have comebefore the critical section (race type II) whereas the sequencewx� always appears after (race type I and III). For the rþwx�

sequence, the schedule depends on the race type (after forIVA and before for IVB).

Consider an asymmetric race involving two variables Pand Q. Let a nonnested critical section protect both variablesin a safe thread. In a nonsafe thread, let an interveningsequence to P come before an intervening sequence to Q inprogram order, but the two can overlap each other. Table 3


TABLE 2Tabulating the Outcome of f for Each Race Type

TABLE 3Possible Intervening Sequences to P and Q

Trailing x� and rþ of P sequence may overlap with Q sequence.

enumerates all possible P and Q intervening combinationsfrom the nonsafe thread. The first two columns show thenine possible combinations. The third column indicateswhether ToleRace reorders the intervening operations to Pand Q. This follows directly from the resolution function inTable 2. For example, in the second row of Table 3, there is areordering to make it appear that the Q interveningsequence comes before the P intervening sequence sincerþ sequences will always be scheduled to appear before anycritical section operations (case II in Table 2) whereas thereverse is true for wx� sequences (case I and III in Table 2).The fourth column specifies whether there is a dependencyfrom P to Q. In general, when there is a write to Q and theaccesses to P may contain a read, then Q may be dependenton P, and, hence, the operations must observe programorder. The fifth column shows the ToleRace action for eachcombination, which can be deduced directly from the resultin columns 3 and 4. ToleRace reverts to detection modewhen it determines that there may be a dependency amongthe variables and the resolution function allows out-of-order execution.

The oracle ToleRace, we have described represents atheoretical framework that cannot be fully realized inpractice. The next three sections describe software imple-mentations that approximate it. Although the frameworkpermits both software and hardware implementations, asoftware approach may be more appealing as it can bedeployed immediately. Section 4 describes an initialimplementation that is restricted and suboptimal. It servesas a baseline for other implementations to benchmarkagainst. Section 5 presents an improved version thataddresses the shortfalls in the initial version. It approx-imates what would likely be deployed in practice. Section 6investigates the idealized version of software ToleRace. Itassumes an oracle compiler and the availability of theprogram’s source code.

4 SOFTWARE TOLERACE: A FIRST VERSION

This section discusses the initial version of softwareToleRace that is nonoptimal and possesses some inherentrestrictions. This first version makes all decisions at runtimeand does not perform any static program analysis. It allowsus to gauge an upper bound on the software ToleRaceoverhead. In the next section, we will present an improvedimplementation that incorporates an additional staticanalysis phase to generate hints for the runtime, allowingit to make better decisions. This improved version has alower overhead and eliminates all the restrictions of thefirst version.

We implement ToleRace on top of Pin [20] running onx86 Linux systems. Our parallel applications are written inC/C++ and use the pthreads library for synchronizationoperations. However, we believe the framework describedhere generalizes to other platforms and threading libraries.In the rest of this paper, we apply software ToleRace tocritical sections in the user code whereas critical sections inthe library code receive no ToleRace protection. We assumethat we can readily distinguish the two code regions. Forexample, in an x86/Linux executable compiled to use sharedlibraries, all routines in the .text section are considered usercode (see some exceptions in Section 5.3.2). Library code is

not present at load time and is discovered only at runtimevia the procedure linkage table in the .plt section.

4.1 The General Pin-ToleRace Framework

As the oracle ToleRace has complete knowledge of all theshared variables protected by a critical section, it can createthe local copies as soon as the critical section is entered. Ofcourse, such oracle knowledge may not be available inpractice due to dynamically allocated shared variables.Hence, our Pin-ToleRace implementation assumes no suchknowledge and the shared variables associated with aparticular critical section are always determined on the fly.Pin-ToleRace works directly on the executable; no sourcecode is required. The notion of shared variables, thus, isredefined to that of shared memory locations. We con-servatively assume that all memory accesses in a criticalsection touch shared memory locations except for thosetouching the thread local stack. We use the term safe memoryto refer to the region of memory that holds the local copiesof the shared memory data.

The safe memory is initially empty. Once a running threadis detected to have entered a critical section, each executedinstruction with a memory operand touching a sharedlocation is instrumented; no instructions outside of criticalsections are instrumented. The instrumented code is gen-erally referred to as the analysis routines. It searches the safememory region for a local copy of the shared memory that isbeing accessed. If found, the memory access is redirected tothis copy. If not found, the analysis routine creates a newnode in the safe memory. The node records the address, theoriginal value, and the current value of the shared memorylocation together with other metadata that we describe later.It serves as a local copy of this shared location that allsubsequent accesses in this critical section will consult. Whenexiting from the critical section, Pin-ToleRace traverses thenodes in the safe memory region and compares the savedoriginal value with the value in the corresponding truememory location. After taking the appropriate action totolerate or detect a race, if any, it frees the nodes.

For this first version of Pin-ToleRace, we assume thatcode segments touched while executing in a critical sectioncan be reached from outside of critical sections only afterthey have already been instrumented inside of the criticalsection. We will revisit this restriction in Section 5 when weintroduce the improved version of Pin-ToleRace. For now, itsuffices to say that the presence of Pin’s code cache in itsdynamic translator engine necessitates this restriction.

4.2 Implementation Details

This section describes the implementation of Pin-ToleRace,whose framework is shown in Fig. 4.

4.2.1 The Safe Memory Region

The safe memory contains three main data structures: athread ID (tid) lock mapping table, a safemem header, anda list of safemem nodes. The first two structures arerequired to handle condition variables and nested/over-lapped critical sections. If the program has neither, i.e., itcontains only nonnested critical sections, only the safememlist is necessary. In a safemem node, the fields origva-lue, origaccesstype, currentvalue, and write_

aft_orig_accs are used by the resolution function totolerate races. The lockvar field indicates the lock variable


protecting a given memory location. It is used in conjunctionwith locklist in the safemem header to correctly resolveraces in nested/overlapped critical sections. cond_wait_threadlist and sharedsafemem track the number ofoutstanding threads waiting to be signaled. The tid-locktable associates the ID of a thread executing inside of acritical section with the outer lock variable. When multiplethreads can be inside of a critical section at the same time,there will be a sharing of the safemem structures as shown inFig. 4. The role of each of these structures and their associatedfields are explained next.

4.2.2 Identifying Critical Sections

A critical section is defined by a mutex variable and a pairof pthread_mutex_lock and pthread_mutex_unlock

calls with the mutex variable as their argument. Pin-ToleRace instruments lock/unlock calls dynamically. Whena lock routine is executed, it adds a call to the CSEnter

analysis routine. The analysis routine increments theCSLevel counter and sets the respective entry in thetid-lock table by updating it with the thread ID andlock variable argument passed to it. The CSLevel counteris a per thread counter that keeps track of the critical sectionnesting level. When an unlock call is encountered, a call tothe CSExit routine is added, which decrements theCSLevel counter. A thread is executing inside a criticalsection if its CSLevel counter (CSLevel[tid]) is greaterthan or equal to one. Because Pin-ToleRace is onlyconcerned with user code (see earlier definition), we onlyinstrument lock/unlock calls in the selected code regions.

4.2.3 Instrumenting Accesses to Shared Memory

When an instruction is executed, Pin-ToleRace determineswhich thread it belongs to with the PIN_ThreadId()

function. Then, it checks the value of CSLevel[tid] andwhether the instruction is accessing a shared memorylocation. Instrumentation is enabled only when CSLevel

[tid] is greater than zero. We ignore operands that accessthe local stack; all other locations are presumed to beshared, which includes all truly shared locations as well assome false locations such as private heap variables. Pin-ToleRace cannot determine whether a particular heap

location is shared, and, therefore, conservatively assumesall heap locations to be shared.

Once we decide that an instruction accesses a sharedlocation, we rewrite its memory operand. The operand isconverted from its current addressing mode to the baseregister addressing mode using one of Pin’s scratch registers.We instrument this instruction and pass the effective addressof the memory operand to the analysis routine. The analysisroutine determines which thread is executing it and searchesthe corresponding safemem linked list using the effectiveaddress as the search key. If a match is found, theroutine returns the address of the currentvalue field ofthe matching node. This address is written into the scratchregister that is used as the base address register for therewritten operand. If no match is found, the analysis routinecreates a new node and updates the origvalue andcurrentvalue fields with the true memory value obtainedby dereferencing the effective address. (This performs theV 00 ¼ V 0 ¼ V operation.) It then returns the address ofthe currentvalue field like in the found case. Althoughthe instrumentation routine is a callback routine that is calledby multiple threads, it does not create a race as it is serializedunder Pin. Any thread can instrument code as long as it isexecuting in a critical section, and the same instrumentedcode will apply to all other threads.

4.2.4 Critical Section Exit

Before the call to the unlock routine at the critical sectionexit, we insert a call to an analysis routine that executes theresolution function. The associated lock variable is passedto this routine to handle nested critical sections. At thispoint, we resolve all race conditions to the shared memorylocations accessed within the critical section according toTable 2. Section 4.3 provides more detail. After the racecondition resolution, the safemem nodes are freed,provided that the current critical section is not nested andthat there are no outstanding waits on condition variables(cf. Sections 4.2.5 and 4.2.7).

4.2.5 Nested and Overlapped Critical Sections

The main component of the safe memory data structure thathandles nested and overlapped critical sections is thelocklist in the safemem header. The locklist ismaintained such that the head of the list always points to themost recent lock variable associated with the innermostcritical section. This approach correctly associates sharedmemory accesses with the most recent lock variableacquired. Note that the inner mutex lock variable itselfcannot be part of the protected shared variable under theouter mutex. If it could, the safe thread might be left spinningforever on a local copy of the inner lock variable that no otherthread can reset (i.e., unlock), thus leading to deadlock.

A critical section that executes inside another criticalsection never creates a new safemem list; it shares thisstructure with the outer critical section(s). If this were notso, the inner critical section could access stale memoryvalues as the most up-to-date values may be in another safememory region.

Upon critical section exit, the resolution function selec-tively resolves races for the shared memory locations that areassociated with the current lock variable. Recall fromthe previous section that the lock mutex variable is passed


Fig. 4. Pin-ToleRace framework.

to the analysis routine. We traverse all safemem nodes,check for a matching lockvar value, resolve races for thatparticular node, and delete that node from the safemem

list. The corresponding node in the locklist is also deleted.At this point, the shared memory associated with thematching lockvar becomes globally visible. If the lock-

list becomes empty, the safemem header is freed and therespective entry in the tid-lock table is reclaimed.

One subtlety with Pin-ToleRace involves a (nonnested)critical section that calls a function that is also called fromoutside any critical section. This creates a situation wherethe noncritical code in the called function is executed undera nonnested critical section whereas the code inside thecritical sections receives an extra nesting level. A problemarises once the function’s code is no longer executed underany critical section as it may contain accesses to falselocations whose addresses were redirected by the codeinstrumentation. Since there is no resolution routine,the content of the safe memory is never transferred to thetrue memory locations, which will likely crash the program.Our solution to this problem is to put a guard on theanalysis code that only allows it to perform the safememory access when the CSLevel is greater than zero.Thus, when the function is executed outside a criticalsection, it will access the original memory locations.

4.2.6 Routine Calls inside a Critical Section

Function calls inside a critical section are handled correctlywith the already described data structures of the safememory. If a call passes a shared memory value on thestack, this shared value is correctly obtained from the safememory region. Or, if the called function accesses sharedmemory locations, its accesses are redirected to the safememory. As we want to protect only user routines, Pin-ToleRace must distinguish them from library routines. Notethat Pin itself instruments every instruction dynamicallyand has no knowledge if the instruction comes from a useror library routine. Shared memory accesses in user codeneed redirection to the safe memory whereas those inlibrary code need not. Nevertheless, we cannot simplyexclude accesses to the safe memory from libraries becausea call to a library routine can pass pointers to sharedvariables as arguments. To handle this case, we allow thelibrary code to access the existing nodes in the safemem

list but disallow the addition of new nodes to the list.

4.2.7 Handling Condition Variables

In addition to lock and mutex variables that synchronizethreads by controlling access to data, the pthreads libraryalso supports the use of condition variables to synchronizethreads based on a data value. A call to pthread_cond_waitwith a condition variable and a mutex variable as argumentsatomically unlocks the mutex variable and makes the threadwait for the value of the condition variable. A call topthread_cond_signal with the corresponding conditionalvariable wakes up one of the waiting threads. These twocalls are instrumented with an analysis routine thatincrements and decrements, respectively, the global waitcounter. Our current implementation does not supportwaits on more than a single mutex variable.

Condition variables complicate ToleRace because theyallow multiple threads to be in a critical section at the sametime. When a new thread enters a critical section while

some other threads are waiting, this new thread cannotsimply create its own copy of the safe memory. Instead, itmust share this copy with the waiting threads. Hence,whenever a thread enters the critical section and there is anoutstanding conditional wait as indicated by the waitcounter, Pin-ToleRace searches the tid-lock table forthe lock variable, uses the safemem header associatedwith this lock variable, and increments the sharedsafe-

mem field in the safemem header. When the threadupdates or creates a node in the safemem list, it putsits tid on the node’s cond_wait_threadlist. When itexits the critical section, it checks whether it is the lastthread to exit, and, if so, follows the normal exit procedureand frees the safemem list. Otherwise, it resolves racesonly on the locations it touched. If it was the only threadaccessing this node, it deletes the node from the list. If thenode has been accessed by multiple threads, the threadresolves any races for the node but leaves the node in thelist and only deletes its tid from the node’s cond_wait_

threadlist. If the thread needs to copy the value to thetrue memory, it must also update the origvalue field withthe currentvalue. This ensures that when the remainingthreads sharing this node resolve race conditions, they willnot signal a false race.

4.3 Tolerating and Detecting Races withPin-ToleRace

When Pin-ToleRace performs the resolution function, itknows the type of the first access to a shared location as thisinformation is recorded in the origaccesstype fieldwhen the node is created. It also knows whether subsequentaccesses to this location included a write (write_aft_orig_accs field). Therefore, Pin-ToleRace can determinethe types of accesses that are involved in a race to thisshared location. When it compares V with V 00 and finds thatV 6¼ V 00, the nonsafe interleaving thread must contain awrite. However, it cannot distinguish between the two writesequences, wx� and rþwx�. In some environments, the writesequence may be known, which enables Pin-ToleRace totolerate all races that the oracle ToleRace can tolerate (seeTable 2). In general, however, Pin-ToleRace must conserva-tively assume the worst case interleaving, i.e., rþwx�, whichprevents it from tolerating type III races. Aside from thisrestriction, it tolerates the same race types as the oracle.

As a race detector, Pin-ToleRace has the same propertiesas the oracle (cf. Section 3.1) except it introduces anadditional false negative due to its nonatomic execution ofthe resolution function. This happens when immediatelyafter the comparison of V and V 00 returns equal, theintervening sequence writes to V . Given that the interven-tion must happen precisely at that moment, the probabilityof this occurring should be low. Pin-ToleRace does tolerateraces in this situation. To see this, let us revisit Table 2. It issufficient to consider only race case IV as Pin-ToleRaceassumes rþwx� for all intervening write sequences. In theabsence of a race, when the safe thread operations containonly reads, Pin-ToleRace never writes the local copy back;when the operations start with a write, it always writes backthe local copy. This effectively enforces schedule T1T2 andT2T1 and thus tolerates race types IVA and IVB, respec-tively, if they occurred. Only race type IVC remainsproblematic. When dealing with intolerable races, Pin-ToleRace reports the race and halts program execution.


4.4 Evaluation

4.4.1 Benchmarks

We use 13 applications from the SPLASH2 [27] and PARSEC[6] benchmark suites to evaluate Pin-ToleRace. We alsodeveloped three microbenchmarks to stress-test a safethread’s race toleration in the presence of nonsafe threads.

The microbenchmarks are called scalar, static array, anddynamic array. The eight programs from the SPLASH2suite were chosen per the minimum set recommended bythe suite’s guidelines. For each of the eight programs, thedefault inputs were used. However, we increased some ofthe input sizes to lengthen the program runtimes. Weselected the five programs from the PARSEC suite that usethe pthreads library. They are run with the simlarge inputs.

4.4.2 System and Compiler

All benchmarks, including the microbenchmarks, arecompiled and run on an Intel 32-bit system (IA-32) with afour-core 2.8 GHz Pentium4-Xeon CPU with a 4-wayassociative 16 kB L1 data cache per core, a 2 MB unifiedL2 cache, and 2 GB of main memory. The operating systemis Red Hat Enterprise Linux Release 4 and the compiler isgcc version 3.4.6. We compiled the SPLASH2 and PARSECprograms per each suite’s guideline with the -O2 and -O3optimization level, respectively. The microbenchmarks usethe -O3 optimization level.

4.4.3 Stress Test

The stress tests demonstrate Pin-ToleRace’s ability totolerate races of the form RwW. In this test, the safe threadperforms read-increment-write operations on some sharedlocations while the nonsafe threads write random values tothese locations.

In the program scalar, the safe thread increments a singleshared location from zero to a given number of iterations.The entire incrementing loop resides in a single criticalsection. At the same time, several nonsafe threads set thismemory location to their thread ID and then read the valueback to compute its square. The programs static array anddynamic array perform the same function. However,instead of a single shared location, the safe threadincrements all elements in a static array of size 10 and allelements in a 5� 5 2D dynamic array allocated on the heap,respectively. The nonsafe threads write their IDs to all ofthese shared locations.

For these tests, we know that the nonsafe threads willcause races that always begin with a write to a sharedlocation. By monitoring all shared accesses to the safe

memory region, Pin-ToleRace determines that the safe threadreads and then writes to the shared locations. Once itidentifies this RwW type race, it can tolerate it by schedulingthe nonsafe thread’s action to have happened after the safethread’s read-increment-write operations. Our test setupuses five nonsafe threads and runs the three programs with5M, 7.5M, and 10M iterations. In each experiment, we observethe correct values in all shared locations just before the criticalsection exit. We also see that after exiting from the criticalsection, the values of these locations change to the thread IDof the nonsafe thread that ran last.

Fig. 5 reports the overhead of Pin-ToleRace for toleratingthese RwW races. It is normalized to the runtime of thethree programs under Pin with no instrumentation. We findthat the overhead is largely constant with respect to thenumber of iterations. Note that the native and Pin runs of allthree programs suffer from race conditions while the Pin-ToleRace runs have all their races correctly tolerated.

For all three microbenchmarks, the overhead of Pin-ToleRace over native is very high—up to 80 times in thedynamic array case. The primary reason for this highoverhead is that we are riding on the Pin overhead. If wemeasure the overhead of Pin-ToleRace over Pin, the dynamicarray benchmark incurs an overhead of about 4.5 times.While this is substantial, it should be noted that themicrobenchmarks almost always execute in a critical section,which is where all the Pin-ToleRace code resides. Moreover,because the safemem nodes are organized as a linked list,the linear search operation in the presence of many sharedlocations contributes greatly to the overhead. For example,going from scalar to static array more than doubles theoverhead. In other words, these microbenchmarks reflectworst case scenarios as they are always busy tolerating racesinside a critical section. The next section shows that realapplications have critical section characteristics that lead to amuch lower Pin-ToleRace overhead.

4.4.4 Benchmark Applications

This section characterizes the critical sections of the 13benchmarks and discusses the overhead of Pin-ToleRace onthese programs.

Critical section characterization. For this study, wecompiled the 13 benchmarks to use four processors, whichcorresponds to the number of cores on our evaluationplatform. We then used Pin to collect the critical sectionstatistics shown in Table 4. Note that we only study criticalsections that reside in the user code, i.e., we exclude alllibrary code.


Fig. 5. Normalized execution time of Pin-ToleRace for scalar (a), static array (b) and dynamic array, (c) for different iteration counts.

The second column of Table 4 shows that the number ofunique critical sections per benchmark is quite small.radiosity tops the list with 36. All but two of theprograms have 16 or fewer critical sections. Only fourbenchmarks, radiosity, dedup, facesim, and fer-

ret, contain nested critical sections. Note that some ofthese nestings are statically nonnested. For example, a callinside a nonnested critical section to a function that containsa nonnested critical section dynamically results in nesting.The last column shows the total number of executedinstructions within the critical sections. The numbers inthis column exclude the instructions of any library routinescalled from the critical sections. All programs exceptferret execute less than one percent of their dynamicuser instructions in critical sections. The fourth column ofTable 4 shows the total number of executed critical sections.The counts range from under one hundred in fft andradix to over one million in barnes, radiosity, andfluidanimate. The average number of instructionsexecuted in user code per critical section is given in columnfive. Two benchmarks, dedup and ferret, stand out. Bothexecute over 600 instructions per critical section. barnesfollows as a distant third at 94. These three benchmarksexecute loops inside their critical sections. The rest of theprograms execute fewer than 30 instructions per criticalsection. Nevertheless, some of them have a high totaldynamic instruction count inside critical sections, notablyfluidanimate and radiosity whose small criticalsections are being looped over.

Next, we look at the critical sections from the point ofview of Pin-ToleRace. Table 5 shows the average number ofshared memory locations accessed per critical sectionexecution by each benchmark. With the exception of

ferret, this number is very uniform across the runningthreads as the standard deviations indicate. Nine out of the13 benchmarks perform fewer than five unique locationsaccessed. With so few accesses, Pin-ToleRace’s linked liststructure in the safe memory should not be a performancebottleneck. However, in barnes and especially in dedup

and facesim, the number of unique locations accessed toshared locations is quite high. With these programs, thelinear search through the linked list structure can addconsiderably to the Pin-ToleRace overhead. Overall, thenumber of unique shared locations accessed seems to be inproportion with the number of instructions executed percritical section.

Pin-ToleRace Performance. This section studies theoverhead of Pin-ToleRace on our benchmark applications.Given the results of the previous section, we decided toinvestigate two implementations of the safe memory. Oneuses the linked list approach described earlier and the otheruses a chained hash table with 128 entries. We chose thissize to minimize the collisions in dedup and ferret.

Fig. 6 presents the results. The timing measurements arenormalized to the native runtime. Note that this is differentfrom the normalization we used for the stress tests. Thesecond bar shows the pure Pin overhead without instru-mentation for each program. The third and fourth barsindicate the overhead of Pin-ToleRace with linked list andhash table implementations of the safe memory, respec-tively. On average, Pin-ToleRace incurs about a factor oftwo slowdown relative to the native runs. Much of thisoverhead is an artifact of using Pin; the slowdown due toPin alone is 1.8X. If we consider the overhead of Pin-ToleRace relative to the Pin runs, it is only about 24 percent.By adding static analysis (see Section 5) or hardware


TABLE 4Critical Section Characteristics

TABLE 5Unique Locations Accessed to Possibly SharedLocations per Critical Section by Each Thread

Fig. 6. Normalized execution time of Pin-ToleRace.

support, it should be possible to reduce the overhead. Notethat when a program runs under Pin-ToleRace, it effectivelyruns with a race detector. Therefore, the results in Fig. 6include the detection overhead. When an intolerable race isdetected, Pin-ToleRace simply stops the program andreports the race.

As expected, the hash table implementation of the safememory reduces the Pin-ToleRace overhead of barnes,dedup, and ferret. Unfortunately, it increases the over-head of all the other programs. The reason is that thechained hash table is more expensive to initialize and freethan the linked list. With the hash table scheme, there is afixed minimum number of entries to process (proportionalto the table size) whereas with the linked list, there are onlyas many nodes as there are unique shared memorylocations. Therefore, the hash table is only attractive whenthe execution in a critical section can amortize this over-head. Recall from the previous section that each of the threebenchmarks for which the hash table implementation worksbetter executes a relatively large number of instructions andtouches many unique shared memory locations inside thecritical sections. The remaining benchmarks have smallcritical sections, and each critical section execution does nottouch many unique shared locations, making the linked listimplementation better suited.

Note that it is sufficient to measure Pin-ToleRaceperformance with no-race execution since the cost ofexecuting race-free is always equal to or greater than thecost of tolerating races. With no-race execution, when thereis a write access to a shared variable, Pin-ToleRace needs towrites back the local copy V’ to the actual shared location V.When it tolerates a race, however, sometimes no such writeback is necessary since the intervening write update by anunsafe thread to V might already be legitimate to pass on.

5 IMPROVING THE INITIAL PIN-TOLERACE VERSION

This section describes how to implement a more efficientPin-ToleRace. The improved version also eliminates therestriction mentioned in Section 4.1.

5.1 Inefficiency in Pin-ToleRace

The sources of inefficiency in the initial Pin-ToleRace can beattributed to the following.

Provision for generality. As the initial Pin-ToleRaceassumes no a priori knowledge when encountering a criticalsection, it needs to be conservative and has to provision forthe general case. Thus, the system creates the full structure ofthe safe memory every time a critical section is executed.However, if a critical section is nonnested and does not haveany condition variables, the tid-lock table and thesafemem header become unnecessary and introduce twoextra levels of indirection when accessing the safemem

nodes.Malloc and free operations. As we postpone all the

analysis of possibly shared memory locations until runtime,our safe memory needs to be able to grow dynamically toaccount for those locations that are generated on the fly. It isnatural to use malloc and, hence, its corresponding freeoperations for this purpose. However, malloc and free arerather heavyweight calls and are not easily amortized insmall critical sections. Worse yet, as these small criticalsections are being looped over, the call overhead can add up

significantly. Ideally, if we can bound the number ofpossibly shared locations, we can resort to a stack-basedallocation style where the corresponding malloc and freeoperations are reduced to adding and subtracting a valuefrom the stack pointer.

Fixed data structure for the safe memory. With theprevious implementation of Pin-ToleRace, the safe memorydata structure is fixed throughout the entire run of aprogram. This may not be optimal for an application thatcontains both short and long critical sections. We, therefore,want to selectively assign the right safe memory structure toeach critical section.

5.2 Inherent Restriction in Pin-ToleRace

Fig. 7 shows a situation where the assumption we made forthe initial Pin-ToleRace in Section 4.1 may not hold.Statements 1 through 4 may get executed inside of a criticalsection, i.e., when cond2 is true, or outside of a criticalsection, i.e., when cond2 is false. In addition, the function f()may be called from within a critical section (line 7) or fromwithout (line 2).

The Pin’s code cache poses some complication to thesituation depicted in Fig. 7. First, let cond1 be true andcond2 be false. Statement 1 through 4 and function f() getexecuted outside of a critical section and their translatedexecution code is stored into the code cache. Then, let cond1stay the same and cond2 become true. The four statementsand f() now execute inside of the critical section. This time,however, the executed code, in particular, the instructionsthat may access shared memory may not get the properoperand rewriting and instrumentation. When the runtimesystem consults the code cache, it may find and useinstances of the translation of the first execution, causingincorrect ToleRace operation as the previously translatedcode does not rewrite memory operands and redirectaccesses of shared memory locations. In general, in thepresence of a code cache, code segments that can potentiallybe executing both inside and outside of critical sections maycause incorrect runtime behavior in Pin-ToleRace.

Aliasing caused by indirect calls. Indirect calls insidecritical section may have their targets alias with functionsthat can both be executed inside and outside of criticalsections. Furthermore, indirect calls outside of criticalsections can also be problematic as their targets may aliaswith code that executes inside of critical sections. Thesescenarios bring back the correctness issue we have justdiscussed above.


Fig. 7. An example illustrating how the assumption in the first version ofPin-ToleRace may be violated.

5.3 Static Program Analysis

In this section, we discuss static program analysis whose roleis to generate and pass additional information and hints tothe runtime systems. Such information will be used toremedy both the inefficiencies as well as the restrictions inthe first version of the Pin-ToleRace. Fig. 8 shows a blockdiagram for the static analysis phase. The input program isfirst passed into a call graph construction module. Thismodule produces a graph representation of all calls in theprogram; every function is a node in the graph and there isan edge from function X to function Y if X calls Y. This callgraph information together with the original program are inturn fed into the second module that traverse every criticalsection in the program. The output from this second moduleis a candidate list of instructions that potentially accessshared memory locations inside critical sections. Thesemodules and their interactions are described in detail below.

5.3.1 Assumptions about the Input Program

We assume that the program’s executable contains all theuser code and is available to us. The corresponding sourcecode, however, may or may not be available. We assumethat the program is compiled to use shared libraries. Whilethe library source code is not available to us, the library’sfunction prototypes are. That is we are fully aware of theinterface given to the user, i.e., the number and type ofparameters for all library calls are known. Threading andsynchronization libraries (pthreads in our case) are alsoparts of the shared libraries.

5.3.2 Static Call Graph Construction

Below are the details on how the call graph constructionmodule functions.

Input. The call graph module takes the program’sexecutable as its input.

Processing. We use a two-pass algorithm. During thefirst pass, we traverse the program’s executable and collecttarget addresses and possibly names of the user routines.We obtain such information from examining the .textsection of the program. We eliminate certain routines thatare not actually parts of the program, but get put in peroperating system requirement, for example, call_gmon_start. These target addresses become nodes of the callgraph to be constructed in the next step. In addition, we alsogather target addresses and possibly names of the sharedlibraries including pthreads libraries. Such information ismanifested in the procedure linkage table, which iscontained in the .plt section of the executable. Note thatwe deal with x86/Linux platforms here; others may havedifferent executable formats and conventions.

After we have collected all the necessary information inthe first pass, in the second pass, we traverse the .textsection to build a call graph. We walk each routine in thesection one by one. For a given routine, we traverse everyinstruction in the routine from start to end in static program

order. We search for calls to other routines. If a call is found,we check its target and create an edge from the current(calling) routine to the called routine. When examining eachroutine, we also gather other information required by theanalysis in the next module (see output below). Note thatwe only deal with a call whose target is known at compiletime. We discuss handling of indirect calls in Section 5.3.4.

After the call graph has been constructed, we generate acall chain for each routine. A call chain for a particularroutine gives all the user routines that can be reached byinitiating a call to the said routine. The chain is generated bytraversing the call graph given the said routine as thestarting node.

Output. After processing, we have information abouteach routine in the .text section, which represents a userroutine. For each routine, we are able to tell:

. its call chain,

. its list of calls to shared libraries,

. its instructions that may access shared memory, and

. if it contains indirect calls.

5.3.3 Static Critical Section Traversal

The purpose of this module is to identify all instructionsthat may access shared memory locations and are reachablefrom critical sections.

Input. The module takes the original program and theoutput from the call graph construction module as itsinputs.

Processing. At the heart of the processing stage is thecritical section traversal routine. This function gets invokedwhen a call to pthread_mutex_lock routine is found whilewe traverse the .text section of the program. The first actionis to advance to the next instruction and mark theinstruction as visited. It then recursively traverses instruc-tions in the critical section. The recursion terminates whenthe routine finds all unlocks to match the number of locksfound along a possible execution path.

When we encounter a conditional branch, we traversethe fall through path first, check if the branch targetinstruction has been visited, and if not traverse the targetpath accordingly. For the unconditional branch case, weneed to only traverse the target path if the target instructionhas not already been visited. In both cases, whenever weencounter a branch target address that is less than thecurrent branch address, i.e., a back edge, we check if thisforms a loop and whether there are instructions potentiallyaccessing shared memory locations in the loop. The loopanalysis information will be used to decide if malloc/freecalls can be eliminated as well as to select a suitable datastructure for the safe memory (Section 5.3.4.).

For a critical section that contains calls to user routines,we also need to include the candidate instructions from thecalled routines. We first consult the call chain of each calledroutine. Then, we obtain the list of candidate instructionsfrom all routines in the call chain. The call chain and the listof candidate instructions are taken from the output of theprevious module.

Output. After we traverse every critical section in theprogram, we produce a list of addresses of instructions thatmay execute inside of critical sections and access sharedmemory locations. We also obtain the following informationabout each critical section in the program:


Fig. 8. Static program analysis phase.

. its list of calls to shared libraries,

. if it contains indirect calls,

. if it may access shared memory inside loops,

. if it contains condition variables,

. if it contains overlapped critical sections,

. if it contains statically nested critical sections, and

. if it contains dynamically nested critical sections.

5.3.4 Putting It All Together

This section describes how we use the result of the staticprogram analysis to remedy the inefficiency and restrictionsin Pin-ToleRace. First, we address the inefficiency.

Addressing provisions for generality. This inefficiencyis caused by uniformly implementing the full safe memorystructure in every critical section. With the analysis, we cantailor the safe memory to suit a particular critical section,i.e., each critical section implements only the parts of thesafe memory that are necessary for its correct operation. Weneed to know whether a given critical section containscondition variables, overlapped critical sections, or nestedcritical sections. For example, if the critical section containsnone of the above, we can eliminate the tid-lock table,the safemem header, and the lockvar field. This allowsus to access the safemem node directly without any extraindirection, which should improve the efficiency of the safememory accesses.

Eliminating malloc/free calls. Generally, if we canbound the number of shared memory locations touchedwhen a given critical section is executed, we can use stack-style memory allocation in place of malloc and free calls.This allows us to replace the costly call overhead withsimple stack pointer operations. If the analysis result for acritical section indicates that there are no accesses to sharedmemory locations inside loops, the number of locationstouched is bounded. With stack-based allocation, wepreallocate a chunk of memory for every thread when itstarts. In setting the chunk size, we need to consider all thecritical sections whose shared memory accesses can bebounded, find the maximum number of bound accesses,and set the chunk size accordingly.

Suitable data structure for the safe memory. Aspreviously noted, for long critical sections, we prefer ahash table structure, whereas for short critical sections, alinked-list structure is more efficient. We approximate thesecharacteristics from the analysis result by saying that longcritical sections may access shared memory inside of loops,whereas short critical sections never access shared memoryinside of loops. Note that we use the same type of analysishere as we did when trying to eliminate malloc and freecalls. These two optimizations, eliminating malloc/free andusing an optimized safe memory structure, go hand inhand. Whenever we encounter critical sections that maynever loop over shared memory accesses, we eliminatemalloc/free calls, i.e., using stack-based allocation andchoose a linked-list structure. Otherwise, we cannot avoidmalloc/free completely and select a hash table structure.

We now turn to the restrictions in the first version of Pin-ToleRace. All the analysis that we have done enables us tosolve the situation depicted in Fig. 7. We are able tostatically identify code segments that may execute insidecritical sections and access shared memory locations. Thecritical section traversal module performs the analysisintraprocedurally while the call graph extends the analysis

interprocedurally, enabling whole program analysis. Withthe static analysis hints, the ToleRace runtime guaranteescorrectness even in the presence of the Pin’s code cache. Itinstruments said code segments while they execute bothinside and outside of critical sections. Note that this is incontrast to the initial Pin-ToleRace, which performsinstrumentation only when the program executes inside ofcritical sections.

Handling indirect call aliasing. Because we haveidentified the code segments that may execute inside ofcritical sections upfront, aliasing from indirect calls executedexclusively outside of critical sections is not a problem. Ifsuch aliasing occurs, the runtime will correctly performinstrumentation at the instance the aliasing takes place.

What if we encounter indirect calls inside of a criticalsection, i.e., the critical section and the routines in itsassociated call chain contain indirect calls? Unfortunately,this situation cannot be solved completely with staticprogram analysis. We simply do not know the targets ofsuch indirect calls until runtime. Therefore, any successfulsolutions to this problem inherently require the help of theToleRace runtime. One possible solution is to keep track ofall (user) routines executed outside of critical sections thathave been translated by the just-in-time compiler. Once anindirect call is reached while executing inside a criticalsection, we instrument an analysis routine to search all theroutines that have been translated, and, hence, reside inthe code cache. If there is any aliasing, we flush the codecache so that the aliased routine is correctly instrumented.

So far, we have been concerned only with indirect callaliasing within user code. However, whenever we discovera library call that may execute inside of critical sections, wealso need to worry about indirect call aliasing coming fromthe library code. To tackle this problem, we check if thelibrary call passes function pointers as callback arguments.If so, we hint to the ToleRace runtime to instrument thesecallback functions to use the safe memory. We assume thatwe have complete knowledge about these callback functions(cf. Section 5.3.1) so that we can statically identify them.

5.4 Results and Discussion

Table 6 shows characteristics of the critical sections in eachbenchmark application, i.e., the results from the staticprogram analysis described in the previous section. Thefirst column of the table gives the total number of criticalsections discovered statically. This result is compatible withthat given in Table 4. Apparently, certain critical sections insome applications never get executed, for example, we


TABLE 6Critical Sections with Properties Givenin Each Column for Each Application

statically found 43 critical sections in radiosity, but only36 of which are executed (see Table 4) with the given input.

The programs in the two benchmark suites we considerdo not have indirect calls in critical sections or overlappedcritical sections. This frees us of worry over indirect callaliasing and allows us to get rid of the safemem header

structure. Hence, the improved Pin-ToleRace versionshould run more efficiently with these benchmarks. Mostcritical sections in some kernels of the SPLASH2 suite, fft,lu, and radix, contain condition variables. They aremainly there to support barrier-style synchronization.Similarly, in the PARSEC suite, almost all critical sectionsin dedup and ferret have condition variables. They arethere to support pipelined-style parallelism. facesim is theonly benchmark with a statically nested critical section. Allcritical sections in fluidanimte are simple in the sensethat they are nonnested, do not contain any conditionvariables, and do not have any direct or indirect calls.

Fig. 9 compares the overhead of the improved version ofPin-ToleRace against that of the initial version, bare Pin,and native execution. Note that the improved and the initialversions cannot be compared directly as the latter suffersfrom some restriction whereas the former does not (cf.Sections 4.1 and 5.2). fluidanimate benefits most fromthe static analysis. Since it contains only simple criticalsections, we can eliminate all the safe memory structuresexcept the safemem nodes themselves. In addition, we canbound the shared memory locations for all the criticalsections, allowing us to use stack-based allocations in placeof malloc. Benchmarks such as fft, lu, radix, ocean,

water-spatial, and facesim do not get significantbenefit from the static hints as these programs spend verylittle time in critical sections.

6 IDEALIZED SOFTWARE TOLERACE

Suppose we have an oracle compiler that knows all theshared locations within a critical section. The performanceoverhead of a ToleRace implementation based on such a

compiler presents a lower bound on what we can achieve insoftware. (Recall that Pin-ToleRace infers all the sharedmemory locations on-the-fly, thus yielding an upper bound.)

To mimic the effect of such an oracle compiler, wemanually modified the source code of our benchmarks aftercarefully studying the critical sections and the sharedvariables in each of them. In a few critical sections, wecould not precisely mimic the effect of the oracle compilerbecause of shared variables that are allocated at runtime. Inthese instances, we instead mimic the mechanism used inPin-ToleRace. Moreover, in barnes and radiosity, weonly modified frequently executed critical sections thatcumulatively account for 99 and 90 percent of all dynamiccritical section executions, respectively. We believe thatdoing so should not significantly affect the overhead result.

After we incorporated ToleRace into the critical sections,we recompiled and ran these applications. Fig. 10 showsthe overhead results, which are normalized to the nativeexecution time without ToleRace. The ideal softwareToleRace incurs a 6.4 percent overhead on average acrossour benchmarks. ferret executes inside critical sectionsmore often than the other applications and has manyruntime allocated shared variables. Consequently, it incursthe highest overhead. dedup, which has the second highestoverhead, has similar characteristics. Most of the applica-tions, however, incur less than one percent overhead withthe ideal software ToleRace.

7 RELATED WORK

Related race-detection research includes both static anddynamic approaches. Static race detection relies on programanalysis and either assumes existing programming lan-guages (e.g., Java [21]) or defines new programminglanguage semantics that help improve the static detectionof races (e.g., Cyclone [12]). Static analysis techniques faceseveral challenges. First, because many of the techniquesare based on some form of model checking [13], they are


Fig. 9. Normalized execution time for the improved version of Pin-ToleRace.

Fig. 10. Normalized execution time of ideal software ToleRace.

computationally expensive and issues of scalability arise.Second, the conservative and approximate nature of theanalysis creates the potential for many false positives.RacerX [10] and Houdini/rcc [11] address these issues bycombining traditional static analysis with heuristics andstatistical ranking to identify the most probable races. Oneinherent drawback of static analysis for race detection isthat asymmetric races can occur in contexts where thesource code for the component containing the error is notavailable for examination.

Eraser is a dynamic race detection system based onlocksets [25]. Experience with this approach has shown thatthe overhead of maintaining the locksets is high and thatfalse positives can be problematic. Subsequent approachesextend locksets with happens-before analysis [2]. Combininglocksets with a happens-before scheme results in higherprecision dynamic race detectors [8], [9], [23], [28]. Even withrefinements, the execution overhead of these approaches istypically larger than a factor of two. Previous work focusesprimarily on detecting data races rather than tolerating them.The ToleRace detection technique is distinct from the locksetand happens-before algorithms. Focusing only on asym-metric races allows ToleRace to take a transaction-likeapproach to race detection and toleration, which signifi-cantly reduces the overhead of dynamic race detection.

Dynamic race detection approaches have also beenadopted by Intel’s Thread Checker [16] and Sun’s ThreadAnalyzer [15], which are commercial tools capable oflocating data races in concurrent programs. Both toolssuffer from a high memory footprint and runtime overheadand are, thus, primarily used for software testing.

Atomicity violation is another important class of con-currency errors. It can be addressed statically [4] ordynamically. The AVIO system [18] belongs to the lattercategory and enumerates erroneous access interleavingssimilar to our asymmetric race interleavings. However, itonly looks at single load/store pairs and not sequences ofaccesses. Without hardware support, the overhead of AVIOis very high, which makes it suitable only for testenvironments. The work by Lucia et al. [19] offers to toleratesome degree of atomicity violation with implicit atomicityby grouping consecutive memory operations into atomicblocks.

Vaziri et al. [26] classify harmful interleavings into 11categories, which is more than the six race cases (withcase IV subdivided) we considered. The extra categoriesaddress high-level data races at the object granularity,which we do not consider. Their approach to race detectionrequires source-code annotation and targets safe languageenvironments.

Kiena et al. [17] propose two schemes to dynamicallyheal data races for Java programs. In one scheme, theyreduce the probability of races happening by forcingthreads that are about to cause racy accesses to yield. Thisis done at the byte-code level through yield() calls. In theother scheme, they add extra locks to some common codepatterns that are likely to result in races.

Concurrent to our work, Rajamani et al. [24] propose aruntime system called Isolator that enforces isolationthrough page protection. The idea is to protect the pagescontaining shared variables (that are protected by a lock) sothat accesses to them can be intercepted. Then, accesses tothose variables that observe the proper locking disciplineare redirected to a local copy of the corresponding page.

Any improper access will be to the original page and henceraise a page protection fault. Similarly, Abadi et al. [1] usepage-level protection to guarantee strong atomicity insoftware transactional memory.

8 CONCLUSIONS

This paper introduces ToleRace, a runtime system that usesdata replication for detecting and tolerating asymmetricraces. We have presented a theoretical framework as well asthree software implementations, which we evaluated on 13real parallel applications from the SPLASH2 and thePARSEC suites.

REFERENCES

[1] M. Abadi, T. Harris, and M. Mehrara, “Transactional Memorywith Strong Atomicity Using Off-the-Shelf Memory ProtectionHardware,” Proc. ACM SIGPLAN Symp. Principles and Practice ofParallel Programming, pp. 185-196, 2009.

[2] S.V. Adve, M.D. Hill, B.P. Miller, and R.H.B. Netzer, “DetectingData Races on Weak Memory Systems,” ISCA ’91: Proc. 18th Ann.Int’l Symp. Computer Architecture, pp. 234-243, 1991.

[3] S.V. Adve, V.S. Pai, P. Ranganathan, and P. Ranganathan, “RecentAdvances in Memory Consistency Models for Hardware Shared-Memory Multiprocessors,” Proc. IEEE, vol. 87, no. 3, pp. 445-455,Mar. 1999.

[4] R. Agarwal, A. Sasturkar, L. Wang, and S. Stoller, “OptimizedRun-Time Race Detection and Atomicity Checking Using PartialDiscovered Types,” Proc. 20th IEEE/ACM Int’l Conf. AutomatedSoftware Eng., pp. 233-242, 2005.

[5] E.D. Berger and B.G. Zorn, “DieHard: Probabilistic MemorySafety for Unsafe Languages,” ACM SIGPLAN Notices, vol. 41,pp. 158-168, 2006.

[6] C. Bienia, S. Kumar, J. Singh, and K. Li, “The PARSEC BenchmarkSuite: Characterization and Architectural Implications,” TechnicalReport TR-811-08, Princeton Univ., 2008.

[7] C. Blundell, C. Lewis, and M. Martin, “Deconstructing Transac-tional Semantics: The Subtleties of Atomicity,” Proc. Fourth Ann.Workshop Duplicating, Deconstructing, and Debunking, 2005.

[8] R. Callahan and J.-D. Choi, “Hybrid Dynamic Data RaceDetection,” Proc. ACM SIGPLAN Symp. Principles and Practice ofParallel Programming, 2003.

[9] T. Elmas, S. Qadeer, and S. Tasiran, “Goldilocks: EfficientlyComputing the Happens-before Relation Using Locksets,” Proc.Int’l Workshop Formal Approaches to Testing of Software (FATES/RV),in K. Havlund, N. Manuel, G. Rosu, and B. Wolff, eds., pp. 193-208. 2006.

[10] D.R. Engler and K. Ashcraft, “RacerX: Effective, Static Detection ofRace Conditions and Deadlocks,” SOSP ’03: Proc. 20th ACM Symp.Operating Systems Principles, pp. 237-252, 2003.

[11] C. Flanagan and S.N. Freund, “Detecting Race Conditions in LargePrograms,” PASTE ’01: Proc. 2001 ACM SIGPLAN-SIGSOFT Work-shop Program Analysis for Software Tools and Eng., pp. 90-96, 2001.

[12] D. Grossman, “Type-Safe Multithreading in Cyclone,” TLDI ’03:Proc. ACM SIGPLAN Int’l Workshop Types in Languages Design andImplementation, pp. 13-25, 2003.

[13] T.A. Henzinger, R. Jhala, and R. Majumdar, “Race Checking byContext Inference,” PLDI ’04: Proc. ACM SIGPLAN Conf. Program-ming Language Design and Implementation, pp. 1-13, 2004.

[14] M. Herlihy and J.E.B. Moss, “Transactional Memory: ArchitecturalSupport for Lock-Free Data Structures,” ISCA ’93: Proc. 20th Ann.Int’l Symp. Computer Architecture, pp. 289-300, 1993.

[15] http://developers.sun.com/sunstudio/downloads/ssx/tha/,2011.

[16] http://www.intel.com/cd/software/products/asmo-na/eng/286406.htm, 2011.

[17] B. Krena, Z. Letko, R. Tzoref, S. Ur, and T. Vojnar, “Healing DataRaces on-the-Fly,” Proc. ACM Workshop Parallel and DistributedSystems: Testing and Debugging, 2007.

[18] S. Lu, J. Tucek, F. Qin, and Y. Zhou, “AVIO: Detecting AtomicityViolations via Access Interleaving Invariants,” ASPLOS-XII: Proc.12th Int’l Conf. Architectural Support for Programming Languages andOperating Systems, pp. 37-48, 2006.


[19] B. Lucia, J. Devietti, K. Strauss, and L. Ceze, “Atom-Aid: Detectingand Surviving Atomicity Violations,” Proc. 35th Int’l Symp.Computer Architecture, 2008.

[20] C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S.Wallace, V.J. Reddi, and K. Hazelwood, “Pin: Building Custo-mized Program Analysis Tools with Dynamic Instrumentation,”Proc. ACM SIGPLAN Conf. Programming Language Design andImplementation, 2005.

[21] M. Naik, A. Aiken, and J. Whaley, “Effective Static Race Detectionfor Java,” PLDI ’06: Proc. ACM SIGPLAN Conf. ProgrammingLanguage Design and Implementation, pp. 308-319, 2006.

[22] S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, and B. Calder,“Automatically Classifying Benign and Harmful Data Races UsingReplay Analysis,” Proc. Int’l Conf. Programming Language Designand Implementation (PLDI), 2007.

[23] E. Pozniansky and A. Schuster, “Efficient on-the-Fly Data RaceDetection in Multithreaded C++ programs,” PPoPP ’03: Proc.Ninth ACM SIGPLAN Symp. Principles and Practice of ParallelProgramming, pp. 179-190, 2003.

[24] S. Rajamani, G. Ramalingam, V. Ranganath, and K. Vaswani,“ISOLATOR: Dynamically Ensuring Isolation in ConcurrentPrograms,” Proc. Symp. Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS), 2009.

[25] S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T.E.Anderson, “Eraser: A Dynamic Data Race Detector for Multi-Threaded Programs,” Proc. Symp. Operating Systems Principles(SOSP), pp. 27-37, 1997.

[26] M. Vaziri, F. Tip, and J. Dolby, “Associating SynchronizationConstraints with Data in an Object-Oriented Language,” Proc. 33rdAnn. Symp. Principles of Programming Languages, 2006.

[27] S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta, “TheSPLASH-2 Programs: Characterization and Methodological Con-siderations,” Proc. 22nd Int’l Symp. Computer Architecture, 1995.

[28] Y. Yu, T. Rodeheffer, and W. Chen, “RaceTrack: Efficient Detectionof Data Race Conditions via Adaptive Tracking,” SOSP ’03: Proc.20th ACM Symp. Operating Systems Principles, pp. 221-234, 2005.

Paruj Ratanaworabhan received the PhDdegree in electrical and computer engineeringfrom Cornell University in 2009. He is currently alecturer at the Department of Computer Engi-neering, Faculty of Engineering, Kasetsart Uni-versity. His research focus has been on dynamicrace detection and toleration, phase-awarearchitectures, floating-point data compression,and value-based compiler optimization. Re-cently, he has been investigating issues in

performance and security of JavaScript. He is a member of the IEEE.

Martin Burtscher received the combined BS/MSdegree in computer science from the SwissFederal Institute of Technology (ETH) Zurich in1996 and the PhD degree in computer sciencefrom the University of Colorado at Boulder in2000. He is an associate professor in theDepartment of Computer Science at Texas StateUniversity-San Marcos. He has coauthored morethan 60 peer-reviewed publications. His researchinterests include parallelization of irregular pro-

grams as well as automatic performance assessment and optimization ofHPC applications. He is a senior member of the IEEE, the IEEEComputer Society, and the ACM.

Darko Kirovski received the PhD degree incomputer science from the University of Califor-nia, Los Angeles, in 2001. Since April 2000, hehas been a researcher at Microsoft Research.His research interests include: error-resilientcomputing, anticounterfeiting, system security,electronic trading, Web services, mobile pay-ments, multimedia processing, and embeddedsystem design. He has received the MicrosoftGraduate Research Fellowship, the ACM/IEEE

Design Automation Conference Graduate Scholarship, the ACM Out-standing PhD Dissertation Award in Electronic Design Automation, andthe best paper awards at the ACM Multimedia and IEEE MMSP. He hasbeen serving on several boards of the IEEE Signal Processing Society.He is a coauthor of more than 100 journal and conference publicationsand more than 80 filed patents. He is a member of the IEEE.

Benjamin Zorn After receiving the PhD degreein computer science from UC Berkeley in 1989,he served eight years on the Computer Sciencefaculty at the University of Colorado in Boulder,receiving tenure and being promoted to associ-ate professor in 1996. He left the University ofColorado in 1998 to join Microsoft Research,where he currently works. He is a principalresearcher in the Research in Software Engi-neering (RiSE) group at Microsoft Research. His

research interests include programming language design and imple-mentation for performance, reliability, and security. He has served as anassociate editor of the ACM journals Transactions on ProgrammingLanguages and Systems and Transactions on Architecture and CodeOptimization and he currently serves as a Member-at-Large of theSIGPLAN Executive Committee. He is a member of the IEEE ComputerSociety. For more information, visit his web page at http://research.microsoft.com/~zorn/.

Rahul Nagpal received the MS and PhD degreesin computer science from the Indian Institute ofScience, Bengaluru, India. After couple of in-dustrial stints, he undertook a postdoctoralresearch fellowship at the Indian Institute ofScience. His research focuses on energy-perfor-mance trade-offs and reliability issues in thecontext of decentralized architectures. He is amember of the IEEE.

Karthik Pattabiraman received the MS and PhDdegrees in computer science from the Universityof Illinois, Urbana Champaign (UIUC), in 2004and 2009, respectively. After spending a post-doctoral year at Microsoft Research (Redmond),he joined the University of British Columbia(UBC) where he is currently an assistant profes-sor of electrical and computer engineering. Hisresearch interests include fault-tolerant andsecure computer systems, programming lan-

guages, and computer architecture. He was awarded the William C.Carter award for the best paper at the IEEE International Conference onDependable Systems and Networks (DSN), 2008. He has served on theprogram committees of the International Conference on DependableSystems and Networks (DSN) and cochaired the first and secondworkshops on “Compiler and Architectural Techniques for ApplicationReliability and Security (CATARS),” held in conjunction with DSN 2008and 2009. He is a member of the IEEE.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Date post:	13-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

548 IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. 4, APRIL ...paruj/papers/ieee-tc12.pdf · only...

Documents