Software Controlled Adaptive Pre-Execution for Data Prefetching

Int J Parallel Prog (2012) 40:381–396DOI 10.1007/s10766-011-0190-5

Software Controlled Adaptive Pre-Execution for DataPrefetching

Ákos Dudás · Sándor Juhász · Tamás Schrádi

Received: 27 March 2011 / Accepted: 18 October 2011 / Published online: 30 October 2011© Springer Science+Business Media, LLC 2011

Abstract Data prefetching mechanisms are widely used for hiding memory latencyin data intensive applications. They mask the speed gap between CPUs and their mem-ory systems by preloading data into the CPU caches, where accessing them is by atleast one order of magnitude faster. Pre-execution is a combined prefetching method,which executes a slice of the original code preloading the code and its data at the sametime. Pre-execution is often mentioned in the literature, but according to our knowl-edge, it has not been formally defined yet. We fill this void by presenting the formaldefinition of speculative and non-speculative pre-execution, and derive a lightweightsoftware-based strategy which accelerates the main working thread by introducingan adaptive, non-speculative pre-execution helper thread. This helper thread acts asa perfect predictor, calculates memory addresses, prefetches the data and consumescache misses early. The adaptive automatic control allows the helper thread to config-ure itself in run-time for best performance. The method is directly applicable to anydata intensive application without requiring hardware modifications. Our method wasable to achieve an average speedup of 10–30% in a real-life application.

Keywords Pre-execution · Data prefetch · Self-configuration ·Data intensive application · Performance

Á. Dudás (B) · S. Juhász · T. SchrádiDepartment of Automation and Applied Informatics, Budapest University of Technologyand Economics, Magyar Tudósok körútja 2, 1117 Budapest, Hungarye-mail: [email protected]

S. Juhásze-mail: [email protected]

T. Schrádie-mail: [email protected]

123

382 Int J Parallel Prog (2012) 40:381–396

1 Introduction

Data preftech mechanisms [3] mask memory latency by preloading data from the mainsystem memory into the CPU caches before they are actually needed by the processingalgorithm. These mechanisms speed up execution by eliminating the time the CPUstalls on cache misses in the main execution path. The two key components of thesemethods are prediction and timing. Prefetch methods should correctly identify dataelements that will be requested in the near future; and the data should be brought intothe cache before they are actually needed, but not to early (so that they are not purgedby the time they are accessed).

Luk in [15] argues that irregular memory access patterns (responsible for increasedmemory latencies and inefficient hardware prefetch) can only be predicted throughexecuting the code itself. Hardware prefetch mechanisms are not always effective asthey rely on mere prediction. Executing the software itself on the other hand calcu-lates the memory addresses correctly, but when it does so it is already too late, andthere is no use of issuing a prefetch by that time. The basic idea of pre-execution[4,6,11,15,16,24,26,27] (terminology also includes assisted execution, prefetchinghelper threads, microthreading, speculative precomputation) is to run a second copyof the algorithm itself and let it calculate required the memory addresses. Allowinga more subtle control over what to load and when to load it, pre-execution is a moreflexible solution than hardware prefetch mechanisms.

In this paper we introduce the formal definition of pre-execution. We define pre-execution as a secondary execution engine (e.g. a software thread or a scheduler) whichissues instructions that “somehow” figure out what memory addresses the main exe-cution engine will need, and preloads them. It does not apply history-based predictionbut is allowed to make speculative decisions.

Using the formal definitions we propose a high level, software, lightweight, easy toimplement pre-execution scheme which requires no special hardware. With runtimeparameter tuning the algorithm configures itself for maximum performance and adaptsto the execution environment and the memory access characteristics of the application.When applying this method in a real-life data intensive application of data coding weexperienced an average speedup of 10–30%.

It is not questionable that appropriately and carefully designed functional or datadecomposition of the task could make better use of the CPU resources and achievemore significant speedup than pre-execution. However, such methods require muchmore expenditure on behalf of the programmer [8,20], and may not even be feasiblein all cases (for example if the source code of a component is not available). It mayalso occur that a naive parallel approach scales badly [22] and behaves worse thanpre-execution. At the same time it is suggested by Bianchini and Lim [1] that pre-fetching can have a significant advantage over multi-threading for applications withhigh cache miss rates.

Considering these facts in this paper we focus on pre-execution and show a low-cost and simple software solution for parallelizing algorithms with modest, but easilyattainable performance gain.

The rest of the paper is organized as follows. The formal definition of pre-exe-cution is presented in Sect. 2 followed with a novel software pre-execution scheme

123

Int J Parallel Prog (2012) 40:381–396 383

in Sect. 3. The proposed method is evaluated in Sect. 4 by applying it to data trans-formation. Sect. 5 presents the literature of pre-execution and discusses how pre-execution is approached by others. We conclude in Sect. 6 with a summary of ourfindings.

2 Definition of Pre-Execution

This section presents the formal definition of speculative pre-execution and non-spec-ulative pre-execution. Although there have been numerous works published in thisarea (see Sect. 5) no such formal definition has been proposed to the best of ourknowledge.

Let σ be a sequence of instructions, R(σ ) be the set of variables the instructions inσ only read, and W (σ ) the set of variables they read and write. Let � = {σi } be theset of these sequences of instructions.

Let � = γ1, . . . , γn be an algorithm such that γi ∈ �, R(γi ) �= R(γi+1) orW (γi ) �= W (γi+1), R(γi ) ⊂ 2R , W (γi ) ⊂ 2W ∀i . That is, an algorithm is a series ofsequences of instructions where adjacent sequences of instructions are distinguishedby either different variables they read or different variables they read and write. WhereR = ∪i=1...n R(γi ) is all the variables � only reads and W = ∪i=1...nW (γi ) is all thevariables it reads and writes. The algorithm in this definition is not described by struc-tural programming language, but as a series of assembly level instructions. The totalnumber n of the blocks of instructions in � should be maximal. (This decompositionof � is not unique.)

γi ≡ γ j : γi , γ j ∈ � if completing γi puts the execution engine in the same state asexecuting γ j does with respect to the values of all variables that any future σ (withina specified scope) may access. �1 ≡ �2 if completing �1 puts the execution engine inthe same state as executing �2 does.

γi and γ j (γi , γ j ∈ �) can be executed in parallel if R(γi ) ∩ W (γ j ) = ∅, R(γ j ) ∩W (γi ) = ∅ and W (γi )∩W (γ j ) = ∅. This will be denoted by γi ‖ γ j . In other words theexecution engine must be in the same state after executing γi and γ j at the same time,as it would be if it executed γiγ j or γ jγi . This also means that γi ‖ γ j ⇔ γiγ j ≡ γ jγi .

We can define the same for a set of instructions and an algorithm as: σ ‖ � (σ ∈ �)if σ ‖ γ1, σ ‖ γ2 . . . σ ‖ γn .

Definition 1 Let � be an algorithm as described above. The main execution enginecompletes � by computing each and every γi ∈ � individually, in order. Speculativepre-execution is another execution engine which, at specified trigger points, schedulesσ j ∈ � for execution where σ j ‖ �. The scheduler of the speculative pre-executionengine, if the main thread is currently executing γi , must have already completed σi .

Speculative pre-execution (see Definition 1) is an execution engine which executessequences of instruction in parallel with the main execution engine. The σ j it executesrelates to γ j (the main execution engine will execute) as it preloads data for it, butthere are no criteria as to how it should achieve that. The execution engines must notinterfere with each other under any circumstances.

123


This definition is in line with the works in the literature. σ j is called a slice, a spec-ulative code fragment, by Zilles and Sohi [27], which “approximates” the originalprogram; this is essentially the same as a p-thread [23] by Ro and Gaudiot.

Definition 2 Let � be an algorithm as described above. The main execution enginecompletes � by computing each and every γi ∈ � individually, in order. Nonspec-ulative pre-execution is another execution engine which, at specified trigger points,schedules σ j ∈ � for execution where σ j ‖ � and σ1σ2 . . . σ j ≡ γ1γ2 . . . γ j . Thescheduler of the speculative pre-execution engine, if the main thread is currently exe-cuting γi , must have already completed σi .

Non-speculative pre-execution (see Definition 2) adds an additional constraint thatthe pre-execution, although executes different instructions from the main executionengine, must be in the same state. This restriction increases the accuracy of the prefetchby requiring the pre-execution code to correctly mimic the main thread. Obviouslyσ j = γ j is a trivial choice, but a subset of γ j , or some other sequence of instructionsmay achieve the same result at a lower cost.

3 A Novel Software Pre-Execution Scheme

Using the above formalism we present a novel pre-execution scheme. We advocate apurely software solution, which is lightweight, has low complexity, targets data inten-sive applications at a high level without the need to intensively profile the applica-tion, has low overhead on the execution, and applies adaptive runtime reconfiguration(detailed in Sect. 3.3).

Software pre-execution is our choice for it is fast and easy to prototype requiring nospecial hardware. It is also guaranteed that a software-only solution can be evaluatedthrough execution on physical hardware and the performance gain can be measuredin terms of wall-clock execution time.

3.1 General Scheme

The pre-execution engine is realized as a helper thread. Most solutions suggested inthe literature use a simple helper thread which, when triggered, prefetches a singledata item, and then terminates [5,11,15]. This thread is started by inserting explicitinstructions of thread creation and management into the main thread increasing itscomplexity and introducing overhead. For the sake of performance, we propose thepre-execution engine to be implemented as a persistent helper thread, meaning thatno new thread is created and destroyed for the computation of item σ j . This singlehelper thread contains not only the pre-execution engine but the additional overheadof its own management as well.

The pre-execution engine needs to act in synchrony with the main thread to makesure that by the time γi is scheduled σi has already been completed. The helper threadgains access to the actual value of i by the main thread publishing it via a shared

123


variable. The pre-execution engine periodically reads this shared variable1 and selectsσ j for execution whenever it is not performing any other calculations.

The question is how to choose j for a given i . If we do speculative pre-executionthan all we have to make sure is that j > i and t (σ j ) < t (γi . . . γ j−1) (where t (σ )

means the execution time of σ ). This assures that σ j will be finished before γ j isscheduled; but it also means that it is possible that the pre-execution engine will, afterσ j , choose σ j+d where d ∈ Z

+, d > 1)—so it may skip some σ . The simplest solutionis to specify this temporal distance between the threads with a constant factor k wherek is large enough so that t (σi+k) < t (γi . . . γi+k−1) ∀i .

For non-speculative pre-execution we are not allowed to skip calculations, exceptif σi−1σiσi+1 ≡ σi−1σi+1 ∀i , meaning that the blocks are independent. If this doesnot hold true, non-speculative pre-execution cannot be applied in this scheme.2

For the pseudo-code of the proposed pre-execution scheme please refer toAlgorithm 1.

Algorithm 1 Pseudo-code of the proposed pre-execution schemeshared volatile int idxshared volatile bool end = f alse

procedure main(γ1 . . . γn )startThread(preExecutionEngine)for i = 0 to n do

idx = iexecute γi

end forend = truereturn

procedure preExecutionEngine(γ1 . . . γn )while end == f alse do

if notDoneAlready(min(idx + k, n)) thenexecute σmin(idx+k,n)

end ifend whilereturn

3.2 “Jumpahead Distance” and “Workahead Size”

The question now is how to choose k. This number should be larger than zero so thatthe pre-execution engine works ahead of the main thread, but not too ahead. If thereis not enough “distance” between the threads the main thread simply overtakes thehelper thread; if the distance is too big, the data would be removed from the cache

1 Please note that on x86 architectures writing value to an integer is an atomic operation implicitly. Thisallows us to use the shared variable across threads without synchronizing its access.2 A solution would be to use more than one helper thread or to pause the main thread if the helper threadcannot keep up the pace.

123


Fig. 1 Jumpahead distance and workahead size parameters

before time. The prefetched data also should not not evict other useful data from thecache (called cache conflicts or cache pollution [14,17]).

To manage all the criteria above, we propose that a safe distance is always main-tained between the threads. We call this distance jumpahead distance. The value,however, cannot be a simple constant. It must reflect the characteristics of the system.

If the resolution of the stages of the algorithm (γi ) is fine-grained, then there shouldbe frequent synchronization between the threads; however, we do not require thepre-execution engine to strictly keep the jumpahead distance at all times. We cansignificantly reduce the overhead if the progress is not monitored so tightly. We canapply a batch-like execution and intervene only after a given number of steps has beencompleted by the threads. The reduced (less frequent) management is beneficial interms of performance too. Periodic writes and reads to the shared progress variablecan result in cache pollution and excessive pipeline flushing [26]. In order to regulatethe frequency of intervention we introduce a second parameter called workaheadsize. See Fig. 1 for an illustration of the meaning of the parameters.

The name workahead size comes from Zhou et al. [26]. They describe an analo-gous notion they call workahead-set; it is a structure which is used for communicationbetween the main and helper threads. The main thread enqueues tasks into this set andthe helper thread dequeues and executes them. The workahead-set, in the sense Zhouet al. use it, is a range of tasks (i.e. items in a data set), which we identify with the firstitem and the length of it. The length is what we call the workahead size.

3.3 Adaptive Control

The optimal values for these two parameters depend on a number of factors. It iseffected by the memory characteristics of the application, the amount of data pre-fetched, the size of the caches, the speed of the CPU, the performance of the applicationand the scheduler of the operating system. In order to achieve maximum performancewe propose a run-time, adaptive reconfiguration algorithm.

The Nelder–Mead method [19] is a commonly used nonlinear optimization tech-nique. It approximates a local optimum of a problem with N variables. Our variablesare the two parameters jumpahead distance and workahead size and the objectivefunction is the performance of the application which we measure in terms of averageexecution time between two samplings (see below).

123


The Nelder–Mead method finds a (local) optimum for f (x) by evaluating the func-tion at specified x values. Our function, however, cannot be evaluated at any givenpoint. We have to execute the algorithm with the proposed jumpahead distance andworkahead size to measure the performance and only after is evaluation possible.

To do the reconfiguration in run-time the Nelder–Mead method must be transformedinto an automaton where the evaluation of f (x) is carried out by the automaton pro-viding the next proposed x and then returning without evaluating the function. Ourpre-execution algorithm then executes σi through σi+m (m is the sample interval)and calculates the performance (measured in execution time). Returning the resultsto the Nelder–Mead automaton it evaluates the results and yields its next estimate.The estimation is carried out solely in the helper thread not affecting the main threadsperformance and may be terminated after no considerable change in the parameter esti-mate is detected.3 If the performance of the application in one sample greatly differsfrom previous measurements, the measurement is repeated using the same parameters.This is solely for compensating the noisy performance function.

As for the concern about cache pollution, the beauty of the adaptive control is thatit does not cause more cache conflicts. The access pattern of an algorithm may besuch that it is already configured to avoid cache conflicts, and pre-execution couldtheoretically spoil this behavior due to the helper thread evicting data from the cache.However, the adaptive control takes care of this implicitly. If the workahead size of thealgorithm is such that is causes cache pollution, the algorithm discovers the decreasedperformance and tunes the parameters until the worsened behavior is compensatedfor.

The complete pseudo-code of the proposed pre-execution algorithm with theNelder–Mead approximation of the parameters is listed below (see Algorithm 2).

4 Evaluation

This section presents the evaluation and testing of the proposed pre-execution modelin a real-life application.

4.1 Test Environment

Pre-execution was implemented for a real-life application where a set of items (billionsof records) are pushed through a pipeline of transformation steps, one of which is anencoding with the use of a dictionary. This problem (data transformation) is data inten-sive [2,21]: the transformation is a series of simple computations and a lookup/insertin the dictionary.

The main issue with this particular application, and the reason for even consideringpre-execution, is that there is a need for a single, central dictionary. The encodingmust retain the uniqueness of the data in the input, thus a bijection has to be realizedbetween the transformed values and the input. For this reason the data cannot be split

3 Assuming that the performance of the algorithm remains the same.

123


Algorithm 2 Overview of the proposed pre-exeucution modelshared volatile int idxshared volatile bool end = f alseshared volatile int jump Ahead Distshared volatile int work Ahead Size

procedure main(σ1 . . . σn )startThread(preExecutionEngine)for i = 0 to n/work Ahead Size do

idx = ifor j = 0 to work Ahead Size do

execute σi∗work AheadSize+ jend for

end forend = truereturn

procedure preExecutionEngine(σ1 . . . σn )int samples = 0time start = getT ime()while end == f alse do

for i = 0 to work Ahead Size doif samples == m then

{ jump Ahead Dist, work Ahead Size} = evaluateNelderMead(getT ime() − start)start = getT ime()samples = 0

end ifif notDoneAlready(min(idx + jump Ahead Dist + i, n)) then

execute σmin(idx+ jump Ahead Dist+i,n)

samples = samples + 1end if

end forend whilereturn

into subsets, multiple threads processing the data independently. Either a single dic-tionary shared between multiple threads is required using appropriate level of mutualexclusion; or pre-execution can be used. Our focus is the latter one of course.

The dictionary is large in size (two orders of magnitude larger than the cache). Forbest performance a hash table is used to realize the dictionary. Three different hashtables were used: STL’s hash_map, Google’s sparsehash and a custom bucket hashtable [10]. All three are implemented in C++ and are carefully optimized. The pre-execution algorithm is the same for all three hash tables; only the hash table invocationdiffers in the test cases.

As an experiment the application was also implemented in .NET using C# languageand the built-in Dictionary class. .NET’s managed memory system is much less con-trolled by the programmer than in C++, which allows us to test the performance ofthe pre-execution model under substantially different circumstances.

The applications are compiled with Microsoft Visual Studio 2010 (using defaultrelease build optimization settings) and are executed on two different computers, bothrunning Microsoft Windows 7 64 bit. The first machine is a desktop computer with

123


Intel Core i7-2600 CPU (4 cores, 2*32 kB L1 cache and 256 kB L2 cache per core,8 MB shared L3 cache) with 8 GB RAM. The second one is a laptop with Intel Core 2Duo L7100 CPU (dual core, 2*32 kB L1 cache per core and a shared 2 MB L2 cache)with 3 GB RAM. The various hardware is expected to yield different memory systemperformance and thus different optimal parameter configurations.

Executing two threads simultaneously comes at some cost. However having a multi-core CPU at our disposal we can execute multiple threads on a single CPU withoutsignificant performance penalty. As a matter of fact, one of the test machines, besideshaving 4 cores, also supports HyperThreading technology. This leaves us two choices:use a single physical core with HyperThreading technology to execute the main andthe helper thread, or use two physical cores without HyperThreading. Both come withadvantages as well as disadvantages. If the physical cores are used the threads do notshare the computational pipeline causing no performance degradation; but they shareonly the last level cache. On the other hand, if both threads are executed on the samecore, they share both the L3 and the L2 caches, but the computational resources aswell.

4.2 Implementation Details

For the sake of evaluation we implemented the proposed pre-execution model man-ually. The motivations were manifold. In data parallel applications [9], like this one,the same operations are often performed in a loop on all the data elements; thus thisloop is a likely candidate for being the hotspot. The best way to choose γi then is tochoose it as the processing performed in this loop. In these applications the algorithmis usually not complicated and can be cloned and altered manually to get σi . Thirdly,and probably most importantly we must point out that there has been numerous workspublished on compiler support for the creation of the helper thread [4,11–13,25]. Thisallowed us to focus our attention to the pre-execution engine rather than the codebehind it.

Cache pollution through the shared variables is a concern. If the shared variablesare changed by one thread executed on one of the physical cores the entire cache linemust be immediately written to system memory and removed from the cache of everyother core. We must assure that these writes do not pollute other variables which mayreside in the same cache line as the shared variable. For this reason the shared integervariables are each implemented as cache line-long byte arrays and are aligned on cachelines.

4.3 Performance Characteristics

For fair evaluation five executions are recorded in each test scenario and their aver-age is reported below. The pre-executed application (adaptive) is tested against thebaseline single threaded version (baseline).

Figure 2 plots the wall-clock execution times of the four different hash tables exe-cuted on two different hardwares, and also lists the final jumpahead distance (JAD) andworkahead size (WAS) values estimated by the algorithm. The filled bars correspond

123


Fig. 2 Performance evaluation of adaptive pre-execution

to the desktop machine (D/*) plotted on the left vertical axes and the striped ones to thelaptop (L) on the right axis (they are of different scales). In case of the desktop machineboth true multi-core setup (D/MC) as well as single core setup with HyperThreading(D/HT ) results are shown. The numbers above the columns show the speedup of theadaptive pre-execution compared to the baseline.

All of the hash tables have different internal structures, hence the different per-formance of the baselines. The adaptive pre-execution model we presented achieves2–35% speedups in all but two of the test cases in all three hardware setups. It is alsonotable that this speedup is also present for the third-party STL hash_map and Googlesparsehash, not to mention the .NET Dictionary, which is a pleasant surprise.

It is noticeable that the desktop machine seems to be more effective with pre-execution than the laptop. This must be due to the significantly larger last-level cache(8 MB vs. 2 MB). On the same note, pre-execution seems to favor true multi-core exe-cution rather than HyperThreading despite the additional level of shared cache withHyperThreading—although its size, 256 kB seems too small to matter compared tothe hundreds of megabytes of data the tables allocate. Nevertheless, both physical andvirtual cores are able to produce performance gain with pre-execution.

4.4 Parameter Estimation

The adaptive reconfiguration of the pre-execution model has been shown to be effec-tive. It might be asked though how “good” is it. The Nelder–Mead method minimizesa function; the function for us is the performance, which is characterized by executiontimes. This function is expected to be noisy. Figure 3 plots the processing time peritem during the execution of the application. As we see there are fluctuations and someoutliers, but the algorithm is able to handle them and the function has a steady decline.

The Nelder–Mean method is prone to finding local optima. To illustrate the capabil-ities of the adaptive algorithm Fig. 4 plots the execution times as a 2D surface for two

123


Fig. 3 Convergence of theminimized f (x) function

0.0000

0.0005

0.0010

0.0015

0.0020

processing timetrend (linear regression)

samples

no

rmal

ized

pro

cess

ing

tim

e

Fig. 4 Performance for various jumpahead distance and workahead size values and the convergence pathof the adaptive algorithm

test cases using various, manually specified jumpahead distance and workahead sizevalues. The darker the shade is, the lower the execution time is. The lines drawn on thesurface show the jumpahead distance-workahead size estimates chosen by the adap-tive algorithm during its configuration phase. The starting point in both is JAD = 500,WAS = 100.

123


10 100 1000 100000

1000

2000

3000

4000

5000

6000custom hash table/Dcustom hash table/LSTL hash_map/DSTL hash_map/LGoogle sparsehash/DGoogle sparsehash/L.NET Dictionary/D.NET Dictionary/L

workahead size (log scale)

jum

pah

ead

dis

tan

ce

Fig. 5 Final jumpahead distance and workahead size values

Another question worth exploring is how good the parameters adopt to the capa-bilities of the hardware. Figure 5 plots the final parameter estimates. The values arescattered across all over the plane as each test scenario has different characteristics.

The two machines used for the evaluation have different performance characteris-tics, and different cache size. It is expected that the optimal parameter estimate forboth will reflect these differences. Examining the values for example for the test caseusing the Google sparsehash the desktop machine calculates jumpahead distance as3003 and workahead size as 500 while the estimates of the laptop are 500 and 100respectively. The jumpahead distance value, we remind, is the distance between themain thread and the helper thread and so it controls the amount of data resident inthe cache. The smaller the cache is, the smaller this value should be; this is verified bythe numbers above: the laptop has smaller cache and the resulting jumpahead distanceis smaller indeed.

Through these experiments we have shown that the adaptive reconfiguration inrun-time not only yields good performance but behaves as expected.

4.5 True Parallelism and Pre-Execution

Finally, an interesting question, we briefly touched upon in the introduction, is therelation between pre-execution and true parallelism. It has been explained that a cen-tral dictionary is required by the application in question to allow the threads to workon separate blocks of the input, but they must synchronize access to the hash table.

Let us consider the application (the hash table) as a black box. The source code isnot available, should not be, or cannot be altered easily with minimal effort (as withthe STL hash_map or the Google sparsehash). In such case, protecting the applicationfrom concurrency problems, a global lock must be used when multiple threads wantto access the common resource (It is understood that fine-grained locking, or evenlock-free hash tables are available, but their implementation requires access to thesource code of the original algorithm, thus they are out of the scope of this paper. Thehypothesis is that the shared component is not altered but integrated into a system as ablack box.). Adding a global lock is about the same complexity as implementing theproposed pre-execution scheme.

123


Fig. 6 Pre-execution and true parallelism

Let us examine how this true parallelism with mutual exclusion relates to pre-execution in terms of performance (see Fig. 6). The same conditions are used for thesetest scenarios as previously. The diagonally striped columns correspond to resultsobtained on the laptop while the full columns correspond to the results of the desktoptest machine. Pre-execution (adaptive) is compared to the parallel version (parallel-mutex). The baseline is plotted for reference.

At first it might be surprising to see that the parallel implementation is actually ofthe same performance, or even worse than the baseline. This, however, is exactly asexpected, and explained by the effect of lock contention. The threads, although executein parallel, compete for the same resource (here the global mutex) and the accessesare basically serialized by the lock. The advantage of pre-execution is clear: it doesnot suffer from this penalty as it does not require any the locks.

These results how that pre-execution is an alternative to true parallelism under sim-ilar circumstances. It is easier to implement, requires no modification to the internalsof the algorithms, and integrates with the algorithms at a high-level. Parallelization ofthe same algorithm at such high level scope requires about the same effort, but yieldsworse results.

5 Related Works

Pre-execution became popular in the last decade after the first SMT (simultaneousmultithreading) CPUs had appeared. This section presents some of the related worksincluding a few practical applications. For a survey and a taxonomy of prefetch meth-ods see Byna et al. [3].

Malhotra and Kozyrakis implemented pre-execution for several data structures inthe C++ STL library [16] and achieved an average speedup of 26% for various appli-cations using the library. Zhou et al. [26] has also shown that pre-execution bringssignificant speedup (30–70% in throughput) in a relational database.

The concept of a program slice, which was mentioned in the formal definition inSect. 2, was introduced by Zilles and Sohi [28] as an extract of the original algo-rithm. This slice performs the same calculation as the original algorithm does, henceit calculates the correct memory addresses, but it may execute a subset of the originalprogram.

123


It was also mentioned that numerous works [4,13,15,25,27] focused on to creatingthis slice (Sect. 4.2). The simplest way is to do this manually. The hotspot can beidentified through profiling or by looking at the code and making an educated guess.Then the slice can be extracted by going backwards from the hotspot identifying therelevant instructions. These were the first approaches used by Zilles and Sohi [27] andLuk et al. [15]. Another approach is the use of heuristics. For example when traversingbinary trees Roth et al. suggest an automatic prefetch be issued for the child nodes [24]when a particular node is visited. More sophisticated methods use the compiler [4,11–13,25] to extract the helper thread code and manual/automatic profiling to identify thehotspots [11].

The trigger of pre-execution is either an instruction in the software code or a hard-ware event. Software triggers must be placed in the source code in compile time eithermanually [27] or automatically [11,25]. Hardware triggers can be for example loadoperations during the instruction decoding stage (used by Ro and Gaudiot [23]) orcache misses (used by Dundas and Mudge [7]).

Mutlu et al. showed that preexecution, or as they refer to it “runahead execution”[18] is an alternative to large instruction windows. They claim that runahead executioncombined with an 128-entry window approximately has the same performance as asignificantly larger 384-entry wide instruction window (without pre-execution). Fromthis point of view software pre-execution reduces hardware complexity by allowingsmaller instruction windows.

6 Conclusion and Discussion

This paper was motivated by data prefetch mechanisms for data intensive applicationssuffering from memory to CPU latency. Pre-execution is a method having promisingresults.

In this paper we introduced formalism to describe speculative pre-execution andnon-speculative pre-execution. We presented a novel software-only, lightweight, adap-tive pre-execution algorithm, which—when applied to data intensive applications—masks memory latency by bringing data into the cache of the CPU at the right time.The model we advocate attacks memory latency at a high level and does not requiredetailed knowledge about the algorithm or and profiling of the application.

The algorithm applies two parameters which greatly affect its performance. Theoptimal values of these parameters depend on the execution environment and the algo-rithm itself. We proposed an adaptive, run-time reconfiguring solution to find theoptimal values of these parameters.

Through practical evaluation we have reached an average speedup of 10–30% andhave shown that the algorithm can successfully adapt to the execution environment byreconfiguring itself.

Acknowledgments This project is supported by the New Hungary Development Plan (Project ID:TÁMOP-4.2.1/B-09/1/KMR-2010-0002).

123


References

1. Bianchini, R., Lim, B.-H.: Evaluating the performance of multithreading and prefetching in multipro-cessors. J. Parallel Distrib. Comput. 37(1), 83–97 (1996)

2. Bryant, R.E.: Data-Intensive Supercomputing: The Case for DISC. Technical report, Technical ReportCMU-CS-07-128, School of Computer Science, Carnegie Mellon University (2007)

3. Byna, S., Chen, Y., Sun, X.-H.: A Taxonomy of data prefetching mechanisms. In: Proceedings of the2008 International Symposium on Parallel Architectures, Algorithms, and Networks (i-span 2008),pp. 19–24, Sydney (2008)

4. Chappell, R.S., Stark, J., Kim, S.P., Reinhardt, S.K., Patt, Y.N.: Simultaneous subordinate microth-reading (SSMT). Int. Symp. Comput. Archit. 27(2), 186–195 (1999)

5. Cintra, M., Llanos, D.R.: Toward efficient and robust software speculative parallelization on multipro-cessors. In: PPoPP03 Proceedings of the ninth ACM SIGPLAN Symposium on Principles and Practiceof Parallel Programming, pp. 13–24 (2003)

6. Dubois, M.: Fighting the memory wall with assisted execution. In: Proceedings of the 1st Conferenceon Computing Frontiers, pp. 168–180, Ischia, Italy (2004)

7. Dundas, J., Mudge, T.: Improving data cache performance by pre-executing instructions under a cachemiss. In: Proceedings of the 11th International Conference on Supercomputing—ICS ’97, ICS ’97,pp. 68–75, ACM Press, New York (1997)

8. Herlihy, M., Shavit, N.: The Art of Multiprocessor Programming. Morgan Kaufman Publishers,Burlington (2008)

9. Hillis, W.D., Steele, G.L.: Data parallel algorithms. Commun. ACM 29(12), 1170–1183 (1986)10. Juhász, S., Dudás, Á.: Adapting hash table design to real-life datasets. In: Proceedings of the IADIS

European Conference on Informatics 2009, Part of the IADIS Multiconference of Computer Scienceand Information Systems 2009, pp. 3–10, Algarve, Portugal (2009)

11. Kim, D., Liao, S.S.-W., Wang, P.H., Cuvillo, J.D., Tian, X., Zou, X., Wang, H., Yeung, D., Girkar,M., Shen, J.P.: Physical experimentation with prefetching helper threads on intel’s hyper-threadedprocessors. In: Proceedings of the International Symposium on Code Generation and Optimization:Feedback-Directed and Runtime Optimization, pp. 27–38 (2004)

12. Kim, D., Yeung, D.: Design and evaluation of compiler algorithms for pre-execution. ACM SIGPLANNotices 37(10), 159 (2002)

13. Kim, D., Yeung, D.: A study of source-level compiler algorithms for automatic construction of pre-execution code. ACM Trans. Comput. Syst. 22(3), 326–379 (2004)

14. Lee, J.-H., Lee, M.-Y., Choi, S.-U., Park, M.-S.: Reducing cache conflicts in data cache prefetch-ing. ACM SIGARCH Comput. Archit. News 22(4), 71–77 (1994)

15. Luk, C.-K.: Tolerating memory latency through software-controlled pre-execution in simultaneousmultithreading processors. ACM SIGARCH Comput. Archit. News 29(2), 40–51 (2001)

16. Malhotra, V., Kozyrakis, C.: Library-Based Prefetching for Pointer-Intensive Applications. Technicalreport (2006)

17. Manjikian, N., Abdelrahman, T.S.: Array data layout for the reduction of cache conflicts. In: InProceedings of the 8th International Conference on Parallel and Distributed Computing Systems (1995)

18. Mutlu, O., Stark, J., Wilkerson, C., Patt, Y.N.: Runahead execution: an alternative to very large instruc-tion windows for out-of-order processors. In: Proceedings of The Ninth International Symposiumon High-Performance Computer Architecture, 2003. HPCA-9 2003, pp. 129–140. IEEE ComputerSociety (2003)

19. Nelder, J.A., Mead, R.: A simplex method for function minimization. Comput. J. 7(4), 308–313 (1965)20. Oliker, L., Biswas, R.: Parallelization of a dynamic unstructured algorithm using three leading pro-

gramming paradigms. IEEE Trans. Parallel Distrib. Syst. 11(9), 931–940 (2000)21. Perkins, L.S., Andrews, P., Panda, D., Morton, D., Bonica, R., Werstiuk, N., Kreiser, R.: Data intensive

computing. In: Proceedings of the 2006 ACM/IEEE conference on Supercomputing—SC ’06, p. 69,ACM Press, New York (2006)

22. Purcell, C., Harris, T.: Non-blocking hashtables with open addressing. In: Proceedings of the 19thInternational Symposium on Distributed Computing, pp. 108–121, Krakow, Poland. Springer-VerlagGmbH (2005)

23. Ro, W.W., Gaudiot, J.-L.: SPEAR: a hybrid model for speculative pre-execution. In: Proceedings ofthe 18th International Parallel and Distributed Processing Symposium, pp. 75–84. IEEE (2004)

123


24. Roth, A., Moshovos, A., Sohi, G.S.: Dependence based prefetching for linked data structures. ACMSIGPLAN Notices 33(11), 115–126 (1998)

25. Song, Y., Kalogeropulos, S., Tirumalai, P.: Design and implementation of a compiler framework forhelper threading on multi-core processors. In: 14th International Conference on Parallel Architecturesand Compilation Techniques (PACT’05), pp. 99–109. IEEE Computer Society, Washington, DC (2005)

26. Zhou, J., Cieslewicz, J., Ross, K.A., Shah, M.: Improving database performance on simultaneous mul-tithreading processors. In: Proceedings of the 31st International Conference on Very Large Data Bases,pp. 49–60, Trondheim, Norway (2005)

27. Zilles, C., Sohi, G.: Execution-based prediction using speculative slices. ACM SIGARCH Comput.Archit. News 29(2), 2–13 (2001)

28. Zilles, C.B., Sohi, G.S.: Understanding the backward slices of performance degrading instruc-tions. ACM SIGARCH Comput. Archit. News 28(2), 172–181 (2000)

123

Date post:	25-Aug-2016
Category:	Documents
Upload:	tamas
View:	217 times
Download:	1 times

Software Controlled Adaptive Pre-Execution for Data Prefetching

Documents