Avoiding Locks and Atomic Instructions in Shared-Memory Parallel BFS Using Optimistic...

Avoiding Locks and Atomic Instructions in Shared-Memory Parallel BFSUsing Optimistic Parallelization

Jesmin Jahan Tithi Dhruv Matani Gaurav Menghani Rezaul A. Chowdhury

Department of Computer ScienceStony Brook University

Stony Brook, New York, 11790, USAE-mail: {jtithi, dmatani, gmenghani, rezaul}@cs.stonybrook.edu

Abstract—Dynamic load-balancing in parallel algorithmstypically requires locks and/or atomic instructions for cor-rectness. We have shown that sometimes an optimistic par-allelization approach can be used to avoid the use of locksand atomic instructions during dynamic load balancing. Inthis approach one allows potentially conflicting operations torun in parallel with the hope that everything will run withoutconflicts, and if any occasional inconsistencies arise due toconflicts, one will be able to handle them without hamperingthe overall correctness of the program. We have used thisapproach to implement two new types of high-performancelockfree parallel BFS algorithms and their variants basedon centralized job queues and distributed randomized work-stealing, respectively. These algorithms are implemented usingIntel cilk++, and shown to be scalable and faster than twostate-of-the-art multicore parallel BFS algorithms by Leisersonand Schardl (SPAA, 2010) and Hong et al. (PACT, 2011),where the algorithm described in the fast paper is also freeof locks and atomic instructions but does not use optimisticparallelization. Our implementations can also handle scale-free graphs very efficiently which frequently arise in real-world scenarios such as the World Wide Web, social-networks,biological interaction networks, etc.

Keywords-Optimistic parallel, Lockfree, Breadth FirstSearch, BFS, Work-stealing, Cilk++.

I. INTRODUCTION

Optimistic parallelization is an approach where we allow

parallel execution of potentially conflicting code blocks,

provided we know how to handle conflicts if they actually

occur [27]. In this approach threads modify shared data

optimistically and try to detect conflicts, and if conflicts

arise, they undo the modifications or take recovery steps.

Such parallelization techniques are specifically used for

irregular problems where it is hard to exploit fine-grained

parallelism [20]. Variations of optimistic parallelization ap-

proaches have been used in many applications including

Delaunay mesh refinement, image segmentation using graph-

cut, agglomerative clustering, Delaunay triangulation, SAT

solver [20], [27], and so on. However, all of them typically

use locks or atomic instructions while recovering from

conflicts. In this paper we consider an optimistic paral-

lelization approach in which we allow multiple concurrent

threads to access and modify a shared data structure without

any lock and atomic instruction. We show that sometimes

problem-specific properties can be used to avoid the use

of locks and atomic instructions even during the handling

of the inconsistencies in the data structure that arise due

to the unprotected concurrent updates. We use such a

parallelization approach to implement two shared-memory

parallel BFS algorithms based on dynamic load-balancing.

We show that these algorithms outperform their lock-based

counterparts. To the best of our knowledge, there is only

one other BFS implementation [21] that performs dynamic

load-balancing without using locks and atomic instructions.

Our implementations outperform that one, too. However, that

implementation uses a complicated data structure (called a

bag) instead of optimistic parallelization to achieve the goal,

while our implementations use very simple array-based data

structures.

Breadth First Search (BFS) is one of the basic graph

search algorithms in which we explore a graph system-

atically level by level from a source vertex. Graphs are

used as the fundamental representational tool for solv-

ing problems in a wide range of application areas such

as analyzing social networks [25], consumer-product web

analysis, computational biology [22], intelligent analysis,

robotics, network analysis and even in image processing

[30]. All these applications require handling of massive data,

and traditionally demand longer processing time and other

computational resources. Therefore, efficient parallelization

of graph processing algorithms such as BFS is of utmost

importance. BFS is used as a building block for several

other important algorithms such as finding shortest paths and

connected components [7], [11], [13], graph clustering [8],

community structure discovery, max-flow computation and

the betweeness centrality problem [17]. High performance

BFS is also used in pathminer [22], for pattern searching in

DNA/RNA strings using Trie [16], and for the execution of

range queries of an MVP-index [23]. BFS is being used as

a graph benchmark application for ranking supercomputers

[3], [4], too.

BFS falls in the class of parallel algorithms where memory

accesses and work distribution are both irregular and data-

dependent [24]. Various techniques have been engineered so

2013 IEEE 27th International Symposium on Parallel & Distributed Processing Workshops and PhD Forum

978-0-7695-4979-8/13 $26.00 © 2013 IEEE

DOI 10.1109/IPDPSW.2013.241

1628

https://www.researchgate.net/publication/224208279_Better_Benchmarking_for_Supercomputers?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==

https://www.researchgate.net/publication/27530254_Designing_Multithreaded_Algorithms_for_Breadth-First_Search_and_st-connectivity_on_the_Cray_MTA-2?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==

https://www.researchgate.net/publication/226919626_Cache-Oblivious_Data_Structures_and_Algorithms_for_Undirected_Breadth-First_Search_and_Shortest_Paths?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==

https://www.researchgate.net/publication/247442902_Introduction_To_Algorithms?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==

https://www.researchgate.net/publication/221257473_Cache-oblivious_shortest_paths_in_graphs_using_buffer_heap?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==

https://www.researchgate.net/publication/227013358_Indexing_factors_in_DNARNA_sequences?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==

https://www.researchgate.net/publication/258521061_NUMA-aware_graph_mining_techniques_for_performance_and_energy_efficiency?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==

https://www.researchgate.net/publication/220425008_Optimistic_Parallelism_Requires_Abstractions?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==


https://www.researchgate.net/publication/221257307_A_work-efficient_parallel_breadth-first_search_algorithm_or_how_to_cope_with_the_nondeterminism_of_reducers?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==

https://www.researchgate.net/publication/10572961_PathMiner_Predicting_metabolic_pathways_by_heuristic_search?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==


https://www.researchgate.net/publication/37257065_Distance-based_indexing_and_its_applications_in_bioinformatics?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==

https://www.researchgate.net/publication/221643487_Scalable_GPU_graph_traversal?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==

https://www.researchgate.net/publication/221611895_Measurement_and_Analysis_of_Online_Social_Networks?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==

https://www.researchgate.net/publication/2664309_Optimistic_Parallelization?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==


https://www.researchgate.net/publication/221126466_Parallel_BFS_graph_traversal_on_images_using_structured_grid?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==

far to parallelize traditional BFS algorithm, such as using

distributed memory parallelism [30], [31], shared memory

parallelism [1], [4], [19], [21], centralized queues, distributed

queues or complicated concurrent data structures [2], [21]

and so on. Several GPU based implementations of BFS have

also been proposed [5], [19], [24]. Each year a number of

new approaches are proposed on efficient and scalable par-

allel BFS [9], [10], [17], [29]. However, none of them uses

optimistic parallelization techniques for parallelization. Most

of the previous algorithms either use locks (fine-grained or

coarse grained), or atomic instructions or complicated data-

structures to parallelize BFS. In [10] the authors proposed

a lock and atomic-instruction free VIS data-structure (bit

array) that keeps track of visited vertices for BFS. They have

used static load balancing to divide the vertices, adjacency

lists, and the VIS data structure, and because of static

load balancing, no lock is required. On the contrary, in

our BFS algorithms, we use dynamic load-balancing to

access and modify the shared queues (arrays). Note that

although lock or atomic instruction based resource protection

is very common; it has many disadvantages including non-

scalability and inefficiency.

In this paper, we have presented two different types of

BFS algorithms and their variants that build on very simple

data structures and use optimistic parallelism to avoid the

use of locks and atomic instructions. We have shown that

optimistic parallelization wins over the redundant computa-

tions that happen in a lockfree algorithm. We have used

centralized job queues and distributed randomized work-

stealing for dynamic load balancing in our parallel BFS

algorithms. Experimental results show that these algorithms

perform better than two state-of-the-art parallel BFS im-

plementations presented in [19] and [21] for graphs with

millions of vertices and billions of edges. Moreover, our

BFS algorithms with explicit work-stealing handle scale-free

graphs1 very efficiently which arise very frequently in real-

world scenarios.

The rest of the paper is organized as follows. In Section

II, we discuss some related prior work on parallel BFS, and

in Section III, we highlight our contributions. In Section

IV, we present our proposed BFS algorithms, and Section V

presents experimental results and a detailed comparison with

other previous approaches, namely, [19] and [21]. Finally, in

Section VI we conclude our paper.

II. PRIOR WORK

In the standard serial BFS algorithm, a FIFO queue is

used to sequentially explore the vertices of a graph level by

level from a source vertex and find the levels (or shortest

distance) of other vertices from the source. In a parallel

BFS algorithm, typically all vertices of a given level and all

1Graphs in which vertex degrees follow the power law degree distribu-tion.

the neighbors of a given vertex can be explored in parallel.

Parallel BFS algorithms are typically level-synchronized,

i.e., a synchronization barrier is needed after each level

of the breadth first search. This also means that there

is a synchronization overhead for each BFS level which

increases linearly with the diameter of the graph. Moreover,

the amount of parallelism achievable at each BFS level is

constrained by the number of nodes/vertices at that particular

level.

In [2], the authors have proposed three different paral-

lel BFS algorithms, using a single concurrent queue, two

sequential queues with locks, and two concurrent queues,

respectively. They have used parallel iterators (i.e., iterators

that can iterate over an objects’ array in parallel with-

out any determinacy-race) to implement concurrent queues.

However, they reported results for graphs with fewer than

20 thousands vertices only. The major drawbacks of their

approaches are idling of threads and the locking overhead.

A different approach of BFS has been presented by Su et

al. [30], based on the structure grid points graph represen-

tation for image processing application. In this algorithm,

at each BFS level each vertex checks whether it can be

visited from an already visited vertex, which is similar to

searching in the reverse direction compared to standard BFS

(bottom-up approach). Although this approach offers good

load balancing opportunity among the processors, it does a

lot more unnecessary work than the serial BFS.

Beamer et al. [5] have proposed a hybrid of top-down

(parent to child) and bottom-up (child to parent) exploration

of edges during BFS, and used atomic instructions for

ensuring mutual exclusive writes. A NUMA (Non Uniform

Memory Architecture) aware graph traversal technique has

been proposed in [17] which used a work-stealing approach

where an idle thread steals from other neighboring threads

running on the same socket to improve cache efficiency.

They have used atomic instructions in their implementations,

too. Saule et al. [29] have proposed a block-accessed sharedqueue data structure for implementing layered BFS and used

atomic fetch and add to increment the queue index pointer.

This algorithm is similar to our centralized queue based

BFS algorithm, except that we do not use locks and atomic

instructions to change the queue index pointers.

Lastly, a hybrid approach for BFS has been proposed in

[19], in which an appropriate version of BFS algorithm is

chosen from a) a serial version, b) two different multicore

versions, and c) a GPU version, based on the number of

vertices in the current level and the next level queues. They

have presented two level-synchronous parallel BFS algo-

rithms for multicores using a read-based method (random

arrays instead of queues) and a read+queue based approach

with/without using a visited array bitmap, respectively. Their

implementations use atomic instructions for updating the

visited vertices bitmap.

Although different optimistic parallelization techniques

1629

https://www.researchgate.net/publication/220782745_Scalable_Graph_Exploration_on_Multicore_Processors?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==

https://www.researchgate.net/publication/228736693_Object_oriented_parallelisation_of_graph_algorithms_using_parallel_iterator?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==



https://www.researchgate.net/publication/261497771_Direction-optimizing_Breadth-First_Search?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==




https://www.researchgate.net/publication/220884936_Efficient_Parallel_Graph_Exploration_on_Multi-Core_CPU_and_GPU?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==










https://www.researchgate.net/publication/261322327_An_Early_Evaluation_of_the_Scalability_of_Graph_Algorithms_on_the_Intel_MIC_Architecture?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==




https://www.researchgate.net/publication/4204594_A_Scalable_Distributed_Parallel_Breadth-First_Search_Algorithm_on_BlueGeneL?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==

https://www.researchgate.net/publication/261355852_Fast_and_Efficient_Graph_Traversal_Algorithm_for_CPUs_Maximizing_Single-Node_Efficiency?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==


have been used for efficiently parallelizing several irregular

problems, most of them use locks and/or atomic instructions

for resolving conflicts. Cledat et al. [12], [14] proposed

another type of optimistic parallelization where one traces

the data dependency and readability to decide whether two

operations can be executed in parallel, and demonstrated

its use for the graph coloring problem. However, the type

of optimistic parallelization techniques we are using differs

slightly from all of these. In this lock & atomic instruction

free optimistic parallelization, threads update global shared

data structures without any protection. However, threads

can detect inconsistencies and use their perceived values

to explore a segment of vertices from a queue only when

it is safe to do so. In case of inconsistency, threads retry

to get consistent values. We also exploit problem-specific

properties to maintain correctness of the algorithm.

III. OUR CONTRIBUTIONS

None of the prior known parallel BFS algorithms based

on dynamic load-balancing is free of locks (e.g., [2] uses

locks), atomic instructions (e.g., [19] uses atomic case

and set) and complicated data structures (e.g., [21] uses

bags of reducers) at the same time as opposed to ours.

Similarly, none of them has considered lock-free, atomic

instructions free work-stealing with dynamic load-balancing

which we have employed in our algorithm. In our work-

stealing method, if a thread becomes idle at a given BFS

level while other threads are still working, the idle thread

steals half of the remaining work from a random busy thread.

For the centralized queue based approaches, no thread sits

idle as long as there is some work in the system. In this way,

these approaches improve load balancing among the worker

threads, and often reduce slowdowns due to the irregular

shape of the input graph (e.g., sometimes the presence of

very high degree vertices). Main differences between our

algorithms and other BFS algorithms are: (i) no use of locks

and atomic instructions, (ii) no complicated data structure:

use of simple random-access arrays as queues, and (iii)optimistic parallelization. As a consequence, our algorithms

are easy to implement. Our main contributions in this paper

are:

• Lockfree, atomic instruction free, and simple data struc-

ture based BFS algorithms with explicit dynamic load-

balancing.

• Two types of level-synchronous parallel BFS algo-

rithms for multicores with their variants:

– based on centralized queues

∗ one centralized queue

∗ multiple centralized queues (or distributed

queues)

– based on distributed randomized work-stealing

∗ with/without special considerations for scale-

free graphs

• Introduction of lock & atomic instructions free opti-

mistic parallelization for BFS

– We allow concurrent threads to write to shared

variables (updating queue indices) without locks

and atomic instructions. However, we make sure

that results are still correct and overhead of dupli-

cate exploration (a consequence of lock & atomic

instruction free update of queue indices) is as small

as possible.

• Comparison with two state-of-the-art publicly available

level synchronous parallel BFS algorithms

– Basseline1 [21]: To the best of our knowledge, this

is the only known state-of-the-art BFS algorithm

that avoids the use of locks and atomic instructions

during dynamic load-balancing. However, unlike

our algorithms it uses a complicated reducer-based

[18] recursive data structure (called a bag) pro-

vided by cilk++, and relies on cilk’s random-

ized work-stealing scheduler for efficiency. It also

does not use optimistic parallelization.

– Baseline2 [19]: This algorithm uses atomic instruc-

tions and performs better than one other efficient

multicore algorithm presented in [1].

We have also discussed how to optimize our algorithms for

NUMA machines.

IV. PROPOSED PARALLEL BFS ALGORITHMS FOR

MULTICORES

In this section we present sketches of our parallel BFS

algorithms. We have proposed two different types of BFS

algorithms and their variants based on centralized job queues

and randomized work-stealing. We use optimistic paral-

lelization in the following way: we allow multiple threads to

steal or grab a segment from shared distributed/centralized

queues without any locks or atomic instructions assuming

that nothing will go wrong. However, because of the absence

of locks and atomic instructions during the update of shared

queue indices, threads can pick-

• invalid segments (i.e., at least one of the queue indices

falls outside the actual queue range) or

• overlapping segments (i.e., segment is valid but over-

laps with other thread’s current queue segment) or

• stale segments (i.e., segment is valid but already ex-

plored by other threads).

While invalids segment may produce wrong results, or

even cause the program to crash, overlapping and stale

segment can only cause duplicate explorations. In our al-

gorithms, threads check for invalid segments during stealing

or fetching a segment from the queues. In case of failure

(i.e., actually picked an invalid segment), threads retry to

get a valid segment. In case of overlapping and duplicate

segments, threads skip the exploration as soon as they detect

this. The idea behind this type of optimistic parallelization

1630



https://www.researchgate.net/publication/221257304_Reducers_and_other_Cilk_hyperobjects?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==





is that for BFS algorithms duplicate exploration does not

hamper correctness. Moreover, several tricks can be applied

to reduce duplicate explorations as much as possible. Note

that because of allowing duplicate exploration, some extra

overhead may be added to the system. On the other hand,

we are completely removing the overhead of locks2 and

atomic instructions which are known to create bottlenecks

when the number of threads increases. So, here the chal-

lenge of optimistic parallelization is to reduce the cost of

inconsistency/conflict detection and duplicate exploration to

such an extent that the total overhead does not negate the

total savings resulting from the avoidance of locks, atomic

instructions and other complicated data structures. Similar

techniques can also be used for other types of algorithms

where repeated work does not introduce inaccuracy in results

(e.g., DFS, IDA*, A*, other algorithms that use BFS, etc.).In a scale-free graph vertex degrees asymptotically follow

the power law degree distribution, i.e., number of vertices

with degree k, nk ∼ c nkγ where c is a normalization constant,

and γ has a value typically between 2 and 3. These high

degree vertices are known as the hotspots, and are the main

bottlenecks in achieving high speedup during parallel BFS.

Scale-free graphs arise in real-world scenarios very often

such as in web-graphs, collaboration networks, homeland

security graphs, airline networks, different types of collab-

oration graphs and biological networks. As a result, scale-

free graphs are receiving an increased amount of attention

nowadays. In order to handle scale-free graphs efficiently,

we either divide the adjacency list of a high degree vertex

equally among the threads, or make sure that an idle thread

can even steal part of the adjacency lists of high degree

vertices.In all our BFS algorithms, we use two arrays of queues

(these queues are basically randomly accessible arrays)

Qin[p] and Qout[p] to store vertices in the current level and

in the next level (assuming there are p cores or threads in

the system), respectively. When we are done exploring the

current BFS level vertices, we swap these queues (Qin and

Qout) for the next level of exploration. We always add a

sentinel (0) at the end of each queue which helps in ensuring

correctness of the lockfree algorithms.Table I shows the convention we have used for naming

our algorithms. Table II shows the acronyms of the presented

Subscript Letter Used Meaning

C Centralized

D Decentralized

L Lockfree

W Work-stealing

S Scale-free

Table I: Program Naming Convention

algorithms. For simplicity of exposition we first describe our

2On a typical PC, locks are known to be more than 20 times slower thanstandard CPU operations [28].

Acronym Full Name

BFSC Centralized (with locks)

BFSCL Centralized + Lockfree

BFSW Work-stealing (with locks)

BFSWL Work-stealing + Lockfree

BFSWS Work-stealing + Scale-free (with locks)

BFSWSL Work-stealing + Scale-free + Lockfree

BFSDL Decentralized + Lockfree

sbfs Serial BFS

Baseline1 Implementations from [21]

Baseline2 Implementations from [19]

Table II: Program Acronyms

algorithms using locks, and then explain how to remove the

locks.

A. Based on Centralized Queues

1) BFSC (Centralized (with locks)): In this algorithm,

the access to the centralized queue Qin is controlled by

a lock. The queues in Qin and the vertices in each such

queue are explored from left to right. We maintain a pair

of global indices 〈q, f〉 with the invariant that all vertices

in Qin[j] with j < q, and all vertices to the left of index

f in Qin[q] are already explored. All p threads try to fetch

the next available segment of length s (unless fewer than svertices remain in the current queue) from Qin by changing

〈q, f〉 using a global lock. If a thread is successful in fetching

a new segment, it advances 〈q, f〉 and starts exploring the

newly grabbed segment. While exploring the vertices, each

thread puts the newly discovered vertices in its own private

output queue (Qout[i] where thread id, i ∈ [0, p)). At the

end of the exploration of all vertices in the current level

we swap Qin and Qout, and exploration starts again for the

next BFS level. Note that we change s adaptively after each

dispatch of segment based on the total number of vertices

in Qin and number of worker threads, to make the work

division as efficient as possible.

2) BFSCL (Centralized + Lockfree): In this algorithm,

we use optimistic parallelism to avoid the use of locks and

atomic instructions. We maintain a global queue pointer qwith the invariant that all the vertices in the queues to the

left of Qin[q] have already been explored. For each queue,

Qin[k], we maintain a front pointer Qin[k].f initialized to

0, and maintain the invariant that all vertices to the left of

Qin[k].f in that queue have already been visited. Whenever

a thread needs to fetch a segment, it first stores q in a

local variable k. It then keeps incrementing k (if needed

and as long as necessary) to find the leftmost queue with

f ′ < Qin[k].r, where r is Qin[k]’s rear pointer, and f ′ is

a local variable that holds the value of Qin[k].f (front). As

soon as it finds such a k, it updates q to k, and Qin[k].fto f ′ + s. Observe that in case of two or more threads

changing q at the same time, q may end up updated to a

point to the left of where it should actually be, which can

result in a thread receiving a segment with vertices that are

1631

Figure 1: Concurrent access to current queue pointer by ThreadsT1, T2, T3 can change the pointer to point to a previous locationof Qin.

already visited. The Qin[q].f pointer can also get updated

backwards in a similar way, which may cause two threads

receiving the same segment for exploration (a predictable

consequence of not using locks). Figure 1 explains this

phenomenon. However, as mentioned before, this type of

duplicate exploration does not hamper the correctness of the

algorithm. Nevertheless, to reduce the frequency of duplicate

exploration, we use the following trick. Whenever a thread

reads a new vertex from the queue for exploration, it empties

that location (say, sets it to 0), and whenever a thread sees a 0in the queue, it concludes that the segment has already been

explored or is under exploration by some other thread. So,

it simply stops at that point and retries to get a new segment

form the queues. Note that there is no possibility of creating

a gap in the queues, because a thread only stops when it

sees a 0 (rather than stopping by checking a rear pointer),

and a 0 can only appear either at the end of the queue, or

if the element has already been explored. As before, while

exploring the vertices, each thread puts the newly discovered

vertices in their own private output queue Qout[i], where idenotes the thread id. After finishing the exploration of all

vertices from all queues in the current level, we swap Qin

and Qout, and exploration starts again for the next level.

3) BFSDL (Decentralized + Lockfree): This algorithm

builds on BFSCL. However, rather than having one cen-

tralized queue, we now have j centralized queues for some

j ∈ [1, p], where each centralized queue consists of �p/j�queues from Qin. Note that j = 1 means it is a purely

centralized approach like BFSCL, whereas j = p means

purely distributed. At the beginning of each BFS level, each

thread picks a random centralized queue, and whenever the

thread becomes idle, it fetches the next available segment

from that centralized queue pool, and explores the vertices

from that segment. However, if the chosen centralized queue

pool is empty, it randomly tries at most cj log j times (where

c > 1 is a constant) to get a new nonempty centralized

queue. If it succeeds in finding such a queue, it explores

vertices from it in the same way as before. It can be proved

using the balls and bins model [26] that w.h.p. it takes no

more than cj log j tries to check each centralized queue at

least once for work provided c > 1. This process continues

until all queues become empty, and the next level of BFS

starts.

B. Based on Distributed Randomized Work-stealing

1) BFSW (Work-stealing (with locks)): In this algo-

rithm, during the exploration of any given BFS level, the ith

(for 0 ≤ i ≤ p− 1) thread starts working on the vertices of

Qin[i]. However, whenever a thread becomes idle, it chooses

another random thread (victim) with enough work, and steals

half of its work (right half of the victim’s segment). Each

thread tries at most cp log p times to find a victim, where

c > 1 is a small constant3. Otherwise, it quits for that level

of BFS. Threads use locks for safe stealing (changing the

queue segment indices) to avoid the exploration of an invalid

segment.

2) BFSWL (Work-stealing + Lockfree): This algorithm

works in the same way as BFSW except that here we

use optimistic parallelization to get rid of locks and atomic

instructions. To ensure correctness of the lockfree work-

stealing, a thread performs sanity checks on the segment

boundaries, and changes the queue/segment boundary vari-

ables based on locally saved states of the global queue

segment variables. This helps in maintaining the correctness

of the whole process. We assume that at the start of a BFS

level Qin[q].r holds the rear pointer of Qin[q] for every q,

and this variable remains unchanged throughout that level.

Every thread maintains three variables, namely, q, f and

r to keep track of the queue id, front pointer and rear

pointer, respectively, of the segment of vertices it is currently

working on. Initially, thread t ∈ [0, p) gets the entire Qin[t]as a single segment. As it explores the segment, it keeps

updating its own f pointer accordingly. In order to reduce

the chances of duplicate exploration of the same vertex by

other threads, a thread clears (i.e., sets to 0) every location

of the queue segment as soon as it reads the vertex stored

in that location. A thread aborts working on a segment as

soon as it encounters a 0 value (i.e., a cleared value) in the

segment. Whenever the thread runs out of work, it chooses a

random thread with enough work, and tries to steal half of its

work (i.e., right half of its unexplored segment of vertices).

The thief first saves the queue id q, front pointer f and rear

pointer r of the victim’s segment to local variables q′, f ′

and r′, respectively. It then performs the following sanity

check: f ′ < r′ ≤ Qin[q′].r. If the check fails (meaning

the victim has possibly moved to another queue, and the

retrieved segment is invalid), the thief aborts this steal and

tries another random victim. Otherwise it updates its own q,

f and r pointers to q′, f ′ + 12 (r

′ − f ′) and r′, respectively,

and the victim’s r pointer to f ′ + 12 (r

′ − f ′). It does not

change the victim’s q and f pointers. Observe that as no

thread checks its own rear pointer while exploring, any

invalid change to the rear pointer of the segment (which may

happen due to not using locks and atomic instructions) does

not hamper correctness. If a thief changes a rear segment

3logic behind trying cp log p times follows from the same balls and binsmodel [26] as mentioned in Section IV-A3

1632

https://www.researchgate.net/publication/221669789_Probability_and_Computing_Randomized_Algorithms_and_Probabilistic_Analysis?el=1_x_8&enrichId=rgreq-bb54cb79-f35e-40f2-a70b-f60ee476e63d&enrichSource=Y292ZXJQYWdlOzI2MTIzNTI0MztBUzoxODMyMzgzNDU2MzM3OTJAMTQyMDY5ODgzNDA4Ng==

pointer of the victim to any invalid location, its consequence

will only be that no other thread will be able to steal from

that particular victim for some time until the victim itself

becomes a thief and changes its own rear pointer. On the

contrary, if a thief gets an invalid segment from a victim,

using the sanity checks as described above, it can safely

avoid that segment and retries for a valid segment.

3) BFSWSL (Work-stealing + Scale-free + Lockfree):This algorithm uses an approach similar to that used in

BFSWL. However, the vertices of each level are explored in

two phases. In the first phase, the threads only explore the

low-degree vertices using explicit work-stealing as before

and push the higher degree vertices into a separate queue,

Qs (the definition of high degree can be changed using a

threshold variable). At the end of this phase, we divide the

adjacency list of each vertex from Qs into p chunks, and

for 1 ≤ i ≤ p, the ith thread explores the ith chunk of

the adjacency list (phase 2). No work-stealing happens in

this phase. We have also experimented with another variant

of BFSWSL which uses work-stealing mechanism in the

second phase, too. In this version, a thread is allowed to

steal half of the remaining unexplored adjacency list of a

vertex, if there is only one vertex left in the queue. However,

this approach often performed worse than the first approach

in our experiments.

4) BFSWS (Work-stealing + Scale-free (with locks)):This algorithm is similar to BFSWSL except that in the

work-stealing phase threads use locks to change the queue

segments, which completely removes any possibility of

having invalid segments.

C. Extension to NUMA Architecture

It is not difficult to optimize our algorithms for NUMA.

For example, for the decentralized algorithm (BFSDL), we

make sure that all threads that are initially assigned to the

same centralized queue are launched on the cores of the

same socket.4 When a group of threads finishes exploring

the vertices from their centralized queue, each of them can

migrate to another random queue allocated on the same

socket. In case of no available queue on the same socket,

it explores from queue allocated on other socket. This can

also be done by assigning higher priorities to centralized

queues allocated on the same socket and lower priorities to

others. For the work-stealing based algorithms, we can use

the following approach. While stealing, a thread randomly

chooses a thread running on the same socket with higher

priority. In case of failure to get a thread from the same

socket with enough work, a thread steals from threads

running on other sockets. A NUMA aware work-stealing

approach for the betweenness centrality problem has been

proposed in [17] which can also be followed.

4Cilk++ does not allow setting thread affinities, and so, we can useOpenMP instead.

D. Discussion: Further Improvements

Note that none of our algorithms has used any technique

for removing duplicate vertices from the queues. To remove

duplicate vertices (or to prevent duplicate exploration by

different threads) from queues, one can use locks or atomic

instructions and/or bitmap of visited vertices as used in

[19]. It is also possible to depend on arbitrary concurrent

write property to record only one parent of a vertex (since a

vertex can have multiple parents) as used in [6]. However,

we plan to use the following method for reducing duplicate

exploration of vertices even further. Each thread will store

the queue id (or parent id) of a vertex in a global array

while exploring (using arbitrary concurrent write), and it

will also check the queue id (or parent id) before exploring

a vertex. If the current queue id matches with the stored

value, the thread explores the vertex, otherwise it skips that

vertex. Note that this approach does not require any locking

or atomic instructions. Avoiding duplicate explorations can

be beneficial for dense and low diameter graphs, where the

number of duplicate vertices can be huge.

We are planning to implement another variant, in which

we will divide the edges evenly instead of the vertices,

while using dynamic load-balancing as before. We expect

this approach to be more scalable.

In [29], the authors have shown that a graph traversal

algorithm implemented using Intel Cilk Plus is 2–3 times

slower than its OpenMP based implementation on Intel MIC

with 120 cores. It would be interesting to see whether

OpenMP based implementations of our algorithms show

similar trends.

V. EXPERIMENTAL RESULTS

All experiments included in this section were performed

on a single node of the Lonestar 4 computing cluster located

at the Texas Advanced Computing Center (TACC), and the

Trestles cluster at the San Diego Supercomputer Center

(SDSC). The properties of the simulation environment are

summarized in Table III.

AttributeName

Lonestar Trestles

Processors 3.33 GHz-Hexa-Core 64-bitIntel-Westmere

8-core 2.4 GHz AMD Magny-Cours processor

Cores/node 12 32

RAM sizeand MemorySpeed

24 GB, 177GB/s 64 GB, 171 GB/s

OS Linux Centos 5.5 Linux Centos 5.5

Cache 12MB L3 (shared in samesocket), 256KB private L2,64KB private L1

12MB L3 (shared in samesocket), 512KB private L2,128KB private L1

Table III: Simulation Environment

We have tested all of our parallel BFS algorithms on

real-world graphs such as cage 15, cage 14, kkt-power,

freescale and Wikipedia-2007 from the Florida Sparse Ma-

trix Collection [15]. We have also tested the programs on

1633




Graph Description n m Diameter

Cage15 DNA electrophoresis, 15monomers in polymer

5.2M 99.2M 53

Cage14 DNA electrophoresis, 14monomers in polymer

15.1M 27.1M 42

Freescale Large circuit, FreescaleSemiconductor

3.4M 18.9M 141

Wikipedia Gleich/Wikipedia-20070206

3.6M 45M 14

kkt-power Optimal power flow,nonlinear optimization(KKT)

2M 8.1M 11

RMAT100M RMAT Graph generatedusing Graph-500 RMATgenerator

10M 100M 12

RMAT1B RMAT Graph generatedusing Graph-500 RMATgenerator

10M 1B 5

Table IV: Graphs and their properties. In this table, n and mdenote the number of vertices and the number of edges of thegraph, respectively. The diameters in the table show the maximumdiameter explored by the BFS rather than the actual diameter ofthese graphs.

synthetic random RMAT graphs generated using the Graph-

500 RMAT generator5 with millions of vertices and up to

a billion of edges. All graphs were directed. The properties

of these graphs are summarized in Table IV.

We have compared all variants of our parallel BFS algo-

rithms6 mentioned in Section IV with the BFS implementa-

tions of Baseline1 [21] and Baseline2 [19]. We collected

the source codes from the authors of [21] and [19] and

ran all these programs on the graphs listed in Table IV.

The algorithms presented in [19] had both CPU and GPU

based implementations, but we compared only with the four

multicore based CPU implementations. In [19] and [21] the

authors used pthreads and cilk++ concurrency platform

for parallelism, respectively. All the programs (including

codes from [19] and [21]) were compiled using the -O3optimization parameter. No hyperthreading was used. We

ran all programs for 1000 random non-zero degree source

vertices, and computed the average running time per source.

We found that our work-stealing algorithms optimized for

scale-free graphs almost always perform better than the

corresponding unoptimized variant, even on general graphs.

Hence, we have not reported the results for those unopti-

mized work-stealing variants.

Table V shows the running times of all these algo-

rithms on a single machine. Tables V(a) and V(b) show

the running times of different algorithms run on a sin-

gle compute node of Lonestar and Trestles, respectively.

Observe that each lockfree version generally runs faster

than the corresponding lock-based version. Table V also

shows that the centralized queue based BFS implementations

perform better than the work-stealing based approaches on

Lonestar (with 12 cores/node), whereas on Trestles (with 32

5parameters used: a=.45, b=.15 and c=.15.6The decentralized algorithm was ran with 1 centralized queue.

cores/node) the work-stealing BFS algorithms show better

performance. One possible reason for this behavior may be

as follows. For all our algorithms, the number of accesses

to the shared queue(s) increases as the number of threads

increases. However, in the work-stealing implementation,

steal attempts are more or less evenly distributed among all

queues. Thus, the number of simultaneous accesses to each

queue in those implementations increases at a much slower

rate than the number of accesses into the single shared

queue pool of the centralized queue based implementations.

This means that the overhead of locks increases at a faster

rate in the centralized queue implementation compared to

the lock-based work-stealing implementation. Similarly, in

the lockfree work-stealing version far fewer cases of dupli-

cate/invalid/stale segment extraction and the corresponding

overhead occur compared to the lockfree centralized version.

As a result, though the centralized versions run faster when

p is smaller, they start to slow down w.r.t. the work-stealing

versions as p increases, and at some point of time, the work-

stealing versions become faster. Also note that the lock

wait time (i.e., time between requesting and acquiring a

lock) for each thread in the work-stealing algorithm with

locks is O(1) (using try lock()). On the other hand, for the

centralized queue based approach, the wait time can be as

high as Θ(p).Figures 2(a) and 2(b) show the scalability of our algo-

rithms on Lonestar and Trestles, respectively. We have run

the lockfree versions of our algorithms on the scale-free

Wikipedia graph, and varied the number of worker threads.

The plots show that the centralized queue based versions

are not scalable beyond 20 cores, while the work-stealing

version remains scalable till the end (i.e., up to 32 cores). We

believe that in addition to the reason given before, the fact

the work-stealing version is optimized for scale-free graphs

has contributed to its scalability.

For all the real-world graphs, our best performing BFS im-

plementation was better than both the implementations from

Baseline1 [21] and Baseline2 [19]. Our algorithms perform

the best for scale-free graphs and sparse graphs. However,

for the synthetic RMAT graphs (graph-10M-100M and

graph-10M-1B) Baseline2 implementations performed

slightly better than ours and the Baseline1 implementation.

The Baseline2 implementation (Local queue + read +bitmap) runs faster only for graph-10M-1B on Lonestar

and for both graph-10M-1B and graph-10M-100Mon Trestles. The possible reason for this is that Baseline2

implementation uses bitmap to track visited vertices in the

queues/read arrays (which in turn helps in avoiding duplicate

exploration of the same vertex by different threads). Note

that although the same vertex can appear only once in

a particular thread’s output queue Qout[i], it is possible

that the same vertex may appear once in each of the

output queues of each thread. This happens, if it had been

discovered by all the threads exactly at the same time. The

1634






��

��

��

��

��!��"��" �� !��"��# ��

��

��

��

��!�� "#��$

(a) Running times (ms) on Lonestar (12 cores).

��

��

��

��

��!��"��" �� !��"��# � � ��

��

��

��!��

��

��"#��$

(b) Running times (ms) on Trestles (32 cores).

Table V: Running times of different algorithms (all times are shown in milliseconds). Here, a blue-colored cell means the global bestrunning time for each row and a green-colored cell indicates the best among ours, if it has failed to be the global minimum).

graph graph-10M-1B with 10M vertices and 1B edges is

a very dense graph (in fact the densest graph we have used)

resulting in a lot of duplicate vertices (as a vertex can have

a lot of parents). As the Baseline2 implementation keeps

track of visited vertices using atomic case & set instruction,

and removes bulk overhead of duplicate exploration of a

very high degree vertex by several threads, it runs faster for

dense graphs. Therefore, algorithms like ours that do not

track visited vertices explicitly or do not remove duplicate

vertices from queues will slow down due to a high number

of duplicate explorations. We have not included the plots for

the largest graph and the smallest graph in our input set (i.e.,

the RMAT graph graph-10M-1B, or kkt-power) because

including them in the plots makes the curves for smaller

graphs indistinguishable.

Table VI shows some statistics on the steal attempts made

by threads in BFSWS and BFSWSL. Both implementa-

tions were run 5 times from 100 sources of the Wikipedia

Graph on a single Lonestar node, and average values were

computed for the number of successful steal attempts as

well as different types of failed steal attempts. We observe

that though the total number of steal attempts in BFSWSL

was slightly larger than that in BFSWS , the percentage of

successful steal attempts was also higher in BFSWSL. Also

observe that the number of failed steal attempts as a result of

a victim being idle was also lower in BFSWS . Recall that a

thread becomes idle when it runs out of work and gives up

searching for work after a certain (say, MAX STEAL)

number of failed steal attempts. Thus BFSWSL achieved

better load balancing than BFSWS which was translated

into a better running time for BFSWSL. Since BFSWSL

did not use locks, there were no failed steal attempts as a

result of choosing a victim that was already locked. Instead

in BFSWSL more steal attempts failed because the segment

obtained by the thief was either too small (e.g., could happen

if the victim did not have any work and so was also trying

to steal), or stale (e.g., could happen if two thieves were

stealing from the same victim with work), or invalid (e.g.,

could happen if more than one thief were trying to steal from

the same victim and thus messed up the queue indices).

In both implementations most of the steal attempts failed

because of the large value used for MAX STEAL. A large

MAX STEAL results in a large number of failed steal

attempts at the end of each level which is reflected in the

large number of steal attempts that failed because of idle

victims.

VI. CONCLUSION

In this paper, we have presented two different types of

lockfree parallel BFS algorithms along with their variants

based on centralized job queues and distributed random-

ized work-stealing. These algorithms use a novel optimistic

parallelization technique to avoid any kind of locks and

1635

�

��

��

��

��

��

��

��

� � � � � ��

%��&

��

��'�

$��(

��

��

��)��

��

��*�

��

+��

�,��-�� .��/0��

�$��%��& '�$��%��& ��$��&��&

(a)

��

��

��

��

��

��

��

��

��

��

� � ��

%��&

��

��'�

$��(

��

��

��)��

��

��*�

��,

+�� %��

�,��-�� %'(��12��3/4�2��$-#��

�$��%��& '�$��%��& ��$��&��&

(b)

Figure 2: Scalability of lockfree parallel lockfree BFS algorithmsrunning on (a) Lonestar and (b) Trestles. All algorithms were runon the Wikipedia graph only.

atomic instructions. We have shown that lockfree algorithms

are typically faster than the corresponding locked versions.

Although work-stealing is an old technique for dynamic

load balancing, lockfree work-stealing is novel for BFS.

Experimental results show that these algorithms perform

very well for massive, scale-free and sparse graphs and

achieve better performance compared to two other state-of-

the-art algorithms. We have several interesting ideas that we

plan to try next for BFS including extending this lock and

atomic instruction free optimistic parallelization technique

to other graph traversal algorithms such as IDA*, A*, etc.,

and to other important application areas of BFS itself.

Implementation and analysis of these BFS algorithms on

GPU, Intel MIC and on cluster of multicores are also

something interesting to look at.

ACKNOWLEDGMENT

Thanks to all authors of [21] and [19] for sharing their codeswith us. Special thanks to T.B. Schardl for further explaininghis code. This work used the Extreme Science and EngineeringDiscovery Environment (XSEDE), which is supported by NationalScience Foundation grant number OCI-1053575.

�

��

��

��

��

��

��

��

��

��

�

��

%��&

��

��'�

$��(

��

��

��))�

��

��*�

��,

�,�(�� $�� %'(��.��/0��$��

#��$� #��$� �$��%��(��& '�$��%��(��& ��$��&��&

(a)

�

��

��

��

��

��

��

��

��

%��&

��

��'�

$��(�

��

��

))�

��

�*��

�,

�,�(�� $�� %'(��12��3/4��2��$-#��$�5��

#��$� #��$� �$��%��(��& '�$��%��(��& ��$��&��&

(b)

Figure 3: Performance in terms of Traversed Edges per Second(TEPS) when processing real-word graphs on (a) Lonestar (12cores) and (b) Trestles (32 cores).

REFERENCES

[1] V. Agarwal, F. Petrini, D. Pasetto, and D. Bader, Scalable graph explorationon multicore processors, Proceedings of the International Conference on HighPerformance Computing, Networking, Storage and Analysis (SC’10), pp: 1–11,2010.

[2] L. Akeila, O. Sinnen, and W. Humadi, Object oriented parallelisation of graphalgorithms using parallel iterator, Proceedings of the 8th Australasian Symposiumon Parallel and Distributed Computing (AusPDC’10), pp: 41–50, 2010.

[3] M. Anderson, Better benchmarking for supercomputers, Proceedings of IEEESpectrum, pp: 12–14, 2011.

[4] D. Bader and K. Madduri, Designing multithreaded algorithms for breadth-firstsearch and st-connectivity on the Cray MTA-2, Proceedings of the InternationalConference on Parallel Processing (ICPP’06), pp: 523–530, 2006.

[5] S. Beamer, K. Asanovic, and D. Patterson, Direction-optimizing breadth-firstsearch, Proceedings of the International Conference on High PerformanceComputing, Networking, Storage and Analysis (SC’12), pp:12, 2012.

[6] G. Blelloch and B. Maggs, Parallel algorithms, Algorithms and Theory ofComputation Handbook, Chapman & Hall/CRC, 2010.

[7] G. Brodal, R. Fagerberg, U. Meyer, and N. Zeh, Cache-oblivious data structuresand algorithms for undirected breadth-rst search and shortest paths, Proceedingsof the 9th Scandinavian Workshop on Algorithm Theory (SWAT’04), pp: 480–492, 2004.

[8] A. Beckmann and U. Meyer, Deterministic graph-clustering in external-memorywith applications to breadth-first search,Unpublished manuscript, 2009.

1636




















Program Time(sec)

Total StealAttempts

Failed Steal Attempts

SuccessfulSteal Attempts

VictimLocked

VictimIdle

SegmentToo Small

StaleSegment

InvalidSegment Total

BFSWS 7.72732,535

( 100.00% )

265,198( 36.20% )

271,731( 37.09% )

137,675( 18.79% )

49,387( 6.74% )

N/A723,991

( 98.83% )

8,544( 1.17% )

BFSWSL 7.53734,535

( 100.00% )N/A

268,710( 36.58% )

399,840( 54.43% )

56,849( 7.74% )

221( 0.03% )

725,620( 98.79% )

8,915( 1.21% )

Table VI: Statistics of successful and failed steal attempts on the Wikipedia graph when run from 100 sources. For each program wereport the average of 5 independent runs.

[9] A. Buluc and K. Madduri, Parallel breadth-first search on distributed memorysystems, Proceedings of the International Conference on High PerformanceComputing, Networking, Storage and Analysis (SC’11), pp: 1–12, 2011.

[10] J. Chhugani, N. Satish, J. Sewall C. Kim, and P. Dubey, Fast and efficient graphtraversal algorithm for CPUs : Maximizing single-node efficiency, Proceedingsof the 26th IEEE International Parallel and Distributed Processing Symposium(IPDPS’12), pp: 378–389, 2012.

[11] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to Algorithms.3rd Edition, MIT Press., 2009.

[12] R. Cledat and S. Pand, Discovering optimistic data-Structure oriented paral-lelism, Proceedings of the 3rd USENIX Workshop on Hot Topics in Parallelism(HotPar’12), 2012.

[13] R. Chowdhury and V. Ramachandran, Cache-oblivious shortest paths in graphsusing buffer heap, Proceedings of the 22nd Annual Symposium on Parallelismin Algorithms and Architectures (SPAA’04), pp: 245–254, 2004.

[14] R. Cledat, Programming models for speculative and optimistic parallelism basedon algorithmic properties, PhD thesis, Georgia Institute of Technology, 2012.

[15] T. Davis, University of Florida sparse matrix collection,”http://www.cise.ufl.edu/research/sparse/matrices/”.

[16] T. Flouri, C. Iliopoulos, M. Rahman, L. Vagner, and Michal Voracek, Indexingfactors in DNA/RNA sequences, Bioinformatics Research and Development, pp:436–445, 2008.

[17] M. Frasca, K. Madduri, and P. Raghavan, NUMA-aware graph mining tech-niques performance and energy efficiency, Proceedings of the InternationalConference on High Performance Computing, Networking, Storage and Analysis(SC’12), pp: 95, 2012.

[18] M. Frigo, P. Halpern, C. Leiserson, and S. Lewin-Berlin, Reducers and otherCilk++ hyperobjects. Proceedings of the 21st Annual Symposium on Parallelismin Algorithms and Architectures (SPAA’09), pp: 79–90, 2009.

[19] S. Hong, T. Oguntebi, and K. Olukotun, Efficient parallel graph explorationon multicore CPU and GPU, Proceedings of the International Conference onParallel Architectures and Compilation Techniques (PACT’11), pp: 100–113,2011.

[20] M. Kulkarni, B. Walter, K. Pingali, G. Ramanarayanan, K. Bala, and L. Chew,Optimistic parallelism requires abstractions. Communications of the ACM,52(9):89–97, 2009.

[21] C. Leiserson and T. Schardl, A work-efficient parallel breadth-first searchalgorithm (or how to cope with the nondeterminism of reducers), Proceedingsof the 22nd Annual Symposium on Parallelism in Algorithms and Architectures(SPAA’10), pp: 301–314, 2010.

[22] D. McShan, S. Rao, and I. Shah, Pathminer: predicting metabolic pathways byheuristic search, Bioinformatics., 19(13):1692–1698, 2003.

[23] R. Mao, Distance-based indexing and its applications in Bioinformatics, PhDThesis, University of Texas at Austin, 2007.

[24] D. Merrill, M. Garland, and A. Grimshaw, Scalable GPU graph traversal,Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practiceof Parallel Programming (PPoPP12), pp: 117–128, 2012.

[25] A. Mislove, M. Marcon, P. Gummadi, P. Druschel, and B. Bhattacharjee,Measurement and analysis of online social networks, Proceedings of the 7thACM SIGCOMM Conference on Internet Measurement (IMC’07), pp: 29–42,2007.

[26] M. Mitzenmacher and E. Upfal, Probability and Computing: RandomizedAlgorithms and Probabilistic Analysis. Cambridge University Press, 2005.

[27] G. Morrisett and M. Herlihy, Optimistic parallelization, Technical report, Schoolof Computer Science, Carnegie Mellon University, 1993.

[28] P. Norvig, Teach Yourself Programming in Ten Years, http://norvig.com/21-days.html, 2001.

[29] E. Saule and U. Catalyurek, An early evaluation of the scalability of graphalgorithms on the Intel MIC architecture, 2012 IEEE 26th International Paralleland Distributed Processing Symposium Workshops & PhD Forum, pp:1629–1639, 2012.

[30] B. Su, T. Brutch, and K. Keutzer, Parallel BFS graph traversal on images usingstructured grid, Proceedings of the IEEE 17th International Conference on ImageProcessing (ICIP’10), pp: 4489–4492, 2010.

[31] A. Yoo, E. Chow, K. Henderson, W. McLendon, B. Hendrickson, andU. Catalyurek, A scalable distributed parallel breadth-first search algorithm onBlueGene/L, Proceedings of the International Conference on High PerformanceComputing, Networking, Storage and Analysis (SC’05), pp: 25, 2005.

1637

























































Date post:	11-Apr-2023
Category:	Documents
Upload:	sbsuny
View:	0 times
Download:	0 times

Avoiding Locks and Atomic Instructions in Shared-Memory Parallel BFS Using Optimistic...

Documents