Avoiding Locks and Atomic Instructions in Shared-Memory Parallel BFSUsing Optimistic Parallelization
Jesmin Jahan Tithi Dhruv Matani Gaurav Menghani Rezaul A. Chowdhury
Department of Computer ScienceStony Brook University
Stony Brook, New York, 11790, USAE-mail: {jtithi, dmatani, gmenghani, rezaul}@cs.stonybrook.edu
Abstract—Dynamic load-balancing in parallel algorithmstypically requires locks and/or atomic instructions for cor-rectness. We have shown that sometimes an optimistic par-allelization approach can be used to avoid the use of locksand atomic instructions during dynamic load balancing. Inthis approach one allows potentially conflicting operations torun in parallel with the hope that everything will run withoutconflicts, and if any occasional inconsistencies arise due toconflicts, one will be able to handle them without hamperingthe overall correctness of the program. We have used thisapproach to implement two new types of high-performancelockfree parallel BFS algorithms and their variants basedon centralized job queues and distributed randomized work-stealing, respectively. These algorithms are implemented usingIntel cilk++, and shown to be scalable and faster than twostate-of-the-art multicore parallel BFS algorithms by Leisersonand Schardl (SPAA, 2010) and Hong et al. (PACT, 2011),where the algorithm described in the fast paper is also freeof locks and atomic instructions but does not use optimisticparallelization. Our implementations can also handle scale-free graphs very efficiently which frequently arise in real-world scenarios such as the World Wide Web, social-networks,biological interaction networks, etc.
Keywords-Optimistic parallel, Lockfree, Breadth FirstSearch, BFS, Work-stealing, Cilk++.
I. INTRODUCTION
Optimistic parallelization is an approach where we allow
parallel execution of potentially conflicting code blocks,
provided we know how to handle conflicts if they actually
occur [27]. In this approach threads modify shared data
optimistically and try to detect conflicts, and if conflicts
arise, they undo the modifications or take recovery steps.
Such parallelization techniques are specifically used for
irregular problems where it is hard to exploit fine-grained
parallelism [20]. Variations of optimistic parallelization ap-
proaches have been used in many applications including
Delaunay mesh refinement, image segmentation using graph-
cut, agglomerative clustering, Delaunay triangulation, SAT
solver [20], [27], and so on. However, all of them typically
use locks or atomic instructions while recovering from
conflicts. In this paper we consider an optimistic paral-
lelization approach in which we allow multiple concurrent
threads to access and modify a shared data structure without
any lock and atomic instruction. We show that sometimes
problem-specific properties can be used to avoid the use
of locks and atomic instructions even during the handling
of the inconsistencies in the data structure that arise due
to the unprotected concurrent updates. We use such a
parallelization approach to implement two shared-memory
parallel BFS algorithms based on dynamic load-balancing.
We show that these algorithms outperform their lock-based
counterparts. To the best of our knowledge, there is only
one other BFS implementation [21] that performs dynamic
load-balancing without using locks and atomic instructions.
Our implementations outperform that one, too. However, that
implementation uses a complicated data structure (called a
bag) instead of optimistic parallelization to achieve the goal,
while our implementations use very simple array-based data
structures.
Breadth First Search (BFS) is one of the basic graph
search algorithms in which we explore a graph system-
atically level by level from a source vertex. Graphs are
used as the fundamental representational tool for solv-
ing problems in a wide range of application areas such
as analyzing social networks [25], consumer-product web
analysis, computational biology [22], intelligent analysis,
robotics, network analysis and even in image processing
[30]. All these applications require handling of massive data,
and traditionally demand longer processing time and other
computational resources. Therefore, efficient parallelization
of graph processing algorithms such as BFS is of utmost
importance. BFS is used as a building block for several
other important algorithms such as finding shortest paths and
connected components [7], [11], [13], graph clustering [8],
community structure discovery, max-flow computation and
the betweeness centrality problem [17]. High performance
BFS is also used in pathminer [22], for pattern searching in
DNA/RNA strings using Trie [16], and for the execution of
range queries of an MVP-index [23]. BFS is being used as
a graph benchmark application for ranking supercomputers
[3], [4], too.
BFS falls in the class of parallel algorithms where memory
accesses and work distribution are both irregular and data-
dependent [24]. Various techniques have been engineered so
2013 IEEE 27th International Symposium on Parallel & Distributed Processing Workshops and PhD Forum
978-0-7695-4979-8/13 $26.00 © 2013 IEEE
DOI 10.1109/IPDPSW.2013.241
1628
far to parallelize traditional BFS algorithm, such as using
distributed memory parallelism [30], [31], shared memory
parallelism [1], [4], [19], [21], centralized queues, distributed
queues or complicated concurrent data structures [2], [21]
and so on. Several GPU based implementations of BFS have
also been proposed [5], [19], [24]. Each year a number of
new approaches are proposed on efficient and scalable par-
allel BFS [9], [10], [17], [29]. However, none of them uses
optimistic parallelization techniques for parallelization. Most
of the previous algorithms either use locks (fine-grained or
coarse grained), or atomic instructions or complicated data-
structures to parallelize BFS. In [10] the authors proposed
a lock and atomic-instruction free VIS data-structure (bit
array) that keeps track of visited vertices for BFS. They have
used static load balancing to divide the vertices, adjacency
lists, and the VIS data structure, and because of static
load balancing, no lock is required. On the contrary, in
our BFS algorithms, we use dynamic load-balancing to
access and modify the shared queues (arrays). Note that
although lock or atomic instruction based resource protection
is very common; it has many disadvantages including non-
scalability and inefficiency.
In this paper, we have presented two different types of
BFS algorithms and their variants that build on very simple
data structures and use optimistic parallelism to avoid the
use of locks and atomic instructions. We have shown that
optimistic parallelization wins over the redundant computa-
tions that happen in a lockfree algorithm. We have used
centralized job queues and distributed randomized work-
stealing for dynamic load balancing in our parallel BFS
algorithms. Experimental results show that these algorithms
perform better than two state-of-the-art parallel BFS im-
plementations presented in [19] and [21] for graphs with
millions of vertices and billions of edges. Moreover, our
BFS algorithms with explicit work-stealing handle scale-free
graphs1 very efficiently which arise very frequently in real-
world scenarios.
The rest of the paper is organized as follows. In Section
II, we discuss some related prior work on parallel BFS, and
in Section III, we highlight our contributions. In Section
IV, we present our proposed BFS algorithms, and Section V
presents experimental results and a detailed comparison with
other previous approaches, namely, [19] and [21]. Finally, in
Section VI we conclude our paper.
II. PRIOR WORK
In the standard serial BFS algorithm, a FIFO queue is
used to sequentially explore the vertices of a graph level by
level from a source vertex and find the levels (or shortest
distance) of other vertices from the source. In a parallel
BFS algorithm, typically all vertices of a given level and all
1Graphs in which vertex degrees follow the power law degree distribu-tion.
the neighbors of a given vertex can be explored in parallel.
Parallel BFS algorithms are typically level-synchronized,
i.e., a synchronization barrier is needed after each level
of the breadth first search. This also means that there
is a synchronization overhead for each BFS level which
increases linearly with the diameter of the graph. Moreover,
the amount of parallelism achievable at each BFS level is
constrained by the number of nodes/vertices at that particular
level.
In [2], the authors have proposed three different paral-
lel BFS algorithms, using a single concurrent queue, two
sequential queues with locks, and two concurrent queues,
respectively. They have used parallel iterators (i.e., iterators
that can iterate over an objects’ array in parallel with-
out any determinacy-race) to implement concurrent queues.
However, they reported results for graphs with fewer than
20 thousands vertices only. The major drawbacks of their
approaches are idling of threads and the locking overhead.
A different approach of BFS has been presented by Su et
al. [30], based on the structure grid points graph represen-
tation for image processing application. In this algorithm,
at each BFS level each vertex checks whether it can be
visited from an already visited vertex, which is similar to
searching in the reverse direction compared to standard BFS
(bottom-up approach). Although this approach offers good
load balancing opportunity among the processors, it does a
lot more unnecessary work than the serial BFS.
Beamer et al. [5] have proposed a hybrid of top-down
(parent to child) and bottom-up (child to parent) exploration
of edges during BFS, and used atomic instructions for
ensuring mutual exclusive writes. A NUMA (Non Uniform
Memory Architecture) aware graph traversal technique has
been proposed in [17] which used a work-stealing approach
where an idle thread steals from other neighboring threads
running on the same socket to improve cache efficiency.
They have used atomic instructions in their implementations,
too. Saule et al. [29] have proposed a block-accessed sharedqueue data structure for implementing layered BFS and used
atomic fetch and add to increment the queue index pointer.
This algorithm is similar to our centralized queue based
BFS algorithm, except that we do not use locks and atomic
instructions to change the queue index pointers.
Lastly, a hybrid approach for BFS has been proposed in
[19], in which an appropriate version of BFS algorithm is
chosen from a) a serial version, b) two different multicore
versions, and c) a GPU version, based on the number of
vertices in the current level and the next level queues. They
have presented two level-synchronous parallel BFS algo-
rithms for multicores using a read-based method (random
arrays instead of queues) and a read+queue based approach
with/without using a visited array bitmap, respectively. Their
implementations use atomic instructions for updating the
visited vertices bitmap.
Although different optimistic parallelization techniques
1629
have been used for efficiently parallelizing several irregular
problems, most of them use locks and/or atomic instructions
for resolving conflicts. Cledat et al. [12], [14] proposed
another type of optimistic parallelization where one traces
the data dependency and readability to decide whether two
operations can be executed in parallel, and demonstrated
its use for the graph coloring problem. However, the type
of optimistic parallelization techniques we are using differs
slightly from all of these. In this lock & atomic instruction
free optimistic parallelization, threads update global shared
data structures without any protection. However, threads
can detect inconsistencies and use their perceived values
to explore a segment of vertices from a queue only when
it is safe to do so. In case of inconsistency, threads retry
to get consistent values. We also exploit problem-specific
properties to maintain correctness of the algorithm.
III. OUR CONTRIBUTIONS
None of the prior known parallel BFS algorithms based
on dynamic load-balancing is free of locks (e.g., [2] uses
locks), atomic instructions (e.g., [19] uses atomic case
and set) and complicated data structures (e.g., [21] uses
bags of reducers) at the same time as opposed to ours.
Similarly, none of them has considered lock-free, atomic
instructions free work-stealing with dynamic load-balancing
which we have employed in our algorithm. In our work-
stealing method, if a thread becomes idle at a given BFS
level while other threads are still working, the idle thread
steals half of the remaining work from a random busy thread.
For the centralized queue based approaches, no thread sits
idle as long as there is some work in the system. In this way,
these approaches improve load balancing among the worker
threads, and often reduce slowdowns due to the irregular
shape of the input graph (e.g., sometimes the presence of
very high degree vertices). Main differences between our
algorithms and other BFS algorithms are: (i) no use of locks
and atomic instructions, (ii) no complicated data structure:
use of simple random-access arrays as queues, and (iii)optimistic parallelization. As a consequence, our algorithms
are easy to implement. Our main contributions in this paper
are:
• Lockfree, atomic instruction free, and simple data struc-
ture based BFS algorithms with explicit dynamic load-
balancing.
• Two types of level-synchronous parallel BFS algo-
rithms for multicores with their variants:
– based on centralized queues
∗ one centralized queue
∗ multiple centralized queues (or distributed
queues)
– based on distributed randomized work-stealing
∗ with/without special considerations for scale-
free graphs
• Introduction of lock & atomic instructions free opti-
mistic parallelization for BFS
– We allow concurrent threads to write to shared
variables (updating queue indices) without locks
and atomic instructions. However, we make sure
that results are still correct and overhead of dupli-
cate exploration (a consequence of lock & atomic
instruction free update of queue indices) is as small
as possible.
• Comparison with two state-of-the-art publicly available
level synchronous parallel BFS algorithms
– Basseline1 [21]: To the best of our knowledge, this
is the only known state-of-the-art BFS algorithm
that avoids the use of locks and atomic instructions
during dynamic load-balancing. However, unlike
our algorithms it uses a complicated reducer-based
[18] recursive data structure (called a bag) pro-
vided by cilk++, and relies on cilk’s random-
ized work-stealing scheduler for efficiency. It also
does not use optimistic parallelization.
– Baseline2 [19]: This algorithm uses atomic instruc-
tions and performs better than one other efficient
multicore algorithm presented in [1].
We have also discussed how to optimize our algorithms for
NUMA machines.
IV. PROPOSED PARALLEL BFS ALGORITHMS FOR
MULTICORES
In this section we present sketches of our parallel BFS
algorithms. We have proposed two different types of BFS
algorithms and their variants based on centralized job queues
and randomized work-stealing. We use optimistic paral-
lelization in the following way: we allow multiple threads to
steal or grab a segment from shared distributed/centralized
queues without any locks or atomic instructions assuming
that nothing will go wrong. However, because of the absence
of locks and atomic instructions during the update of shared
queue indices, threads can pick-
• invalid segments (i.e., at least one of the queue indices
falls outside the actual queue range) or
• overlapping segments (i.e., segment is valid but over-
laps with other thread’s current queue segment) or
• stale segments (i.e., segment is valid but already ex-
plored by other threads).
While invalids segment may produce wrong results, or
even cause the program to crash, overlapping and stale
segment can only cause duplicate explorations. In our al-
gorithms, threads check for invalid segments during stealing
or fetching a segment from the queues. In case of failure
(i.e., actually picked an invalid segment), threads retry to
get a valid segment. In case of overlapping and duplicate
segments, threads skip the exploration as soon as they detect
this. The idea behind this type of optimistic parallelization
1630
is that for BFS algorithms duplicate exploration does not
hamper correctness. Moreover, several tricks can be applied
to reduce duplicate explorations as much as possible. Note
that because of allowing duplicate exploration, some extra
overhead may be added to the system. On the other hand,
we are completely removing the overhead of locks2 and
atomic instructions which are known to create bottlenecks
when the number of threads increases. So, here the chal-
lenge of optimistic parallelization is to reduce the cost of
inconsistency/conflict detection and duplicate exploration to
such an extent that the total overhead does not negate the
total savings resulting from the avoidance of locks, atomic
instructions and other complicated data structures. Similar
techniques can also be used for other types of algorithms
where repeated work does not introduce inaccuracy in results
(e.g., DFS, IDA*, A*, other algorithms that use BFS, etc.).In a scale-free graph vertex degrees asymptotically follow
the power law degree distribution, i.e., number of vertices
with degree k, nk ∼ c nkγ where c is a normalization constant,
and γ has a value typically between 2 and 3. These high
degree vertices are known as the hotspots, and are the main
bottlenecks in achieving high speedup during parallel BFS.
Scale-free graphs arise in real-world scenarios very often
such as in web-graphs, collaboration networks, homeland
security graphs, airline networks, different types of collab-
oration graphs and biological networks. As a result, scale-
free graphs are receiving an increased amount of attention
nowadays. In order to handle scale-free graphs efficiently,
we either divide the adjacency list of a high degree vertex
equally among the threads, or make sure that an idle thread
can even steal part of the adjacency lists of high degree
vertices.In all our BFS algorithms, we use two arrays of queues
(these queues are basically randomly accessible arrays)
Qin[p] and Qout[p] to store vertices in the current level and
in the next level (assuming there are p cores or threads in
the system), respectively. When we are done exploring the
current BFS level vertices, we swap these queues (Qin and
Qout) for the next level of exploration. We always add a
sentinel (0) at the end of each queue which helps in ensuring
correctness of the lockfree algorithms.Table I shows the convention we have used for naming
our algorithms. Table II shows the acronyms of the presented
Subscript Letter Used Meaning
C Centralized
D Decentralized
L Lockfree
W Work-stealing
S Scale-free
Table I: Program Naming Convention
algorithms. For simplicity of exposition we first describe our
2On a typical PC, locks are known to be more than 20 times slower thanstandard CPU operations [28].
Acronym Full Name
BFSC Centralized (with locks)
BFSCL Centralized + Lockfree
BFSW Work-stealing (with locks)
BFSWL Work-stealing + Lockfree
BFSWS Work-stealing + Scale-free (with locks)
BFSWSL Work-stealing + Scale-free + Lockfree
BFSDL Decentralized + Lockfree
sbfs Serial BFS
Baseline1 Implementations from [21]
Baseline2 Implementations from [19]
Table II: Program Acronyms
algorithms using locks, and then explain how to remove the
locks.
A. Based on Centralized Queues
1) BFSC (Centralized (with locks)): In this algorithm,
the access to the centralized queue Qin is controlled by
a lock. The queues in Qin and the vertices in each such
queue are explored from left to right. We maintain a pair
of global indices 〈q, f〉 with the invariant that all vertices
in Qin[j] with j < q, and all vertices to the left of index
f in Qin[q] are already explored. All p threads try to fetch
the next available segment of length s (unless fewer than svertices remain in the current queue) from Qin by changing
〈q, f〉 using a global lock. If a thread is successful in fetching
a new segment, it advances 〈q, f〉 and starts exploring the
newly grabbed segment. While exploring the vertices, each
thread puts the newly discovered vertices in its own private
output queue (Qout[i] where thread id, i ∈ [0, p)). At the
end of the exploration of all vertices in the current level
we swap Qin and Qout, and exploration starts again for the
next BFS level. Note that we change s adaptively after each
dispatch of segment based on the total number of vertices
in Qin and number of worker threads, to make the work
division as efficient as possible.
2) BFSCL (Centralized + Lockfree): In this algorithm,
we use optimistic parallelism to avoid the use of locks and
atomic instructions. We maintain a global queue pointer qwith the invariant that all the vertices in the queues to the
left of Qin[q] have already been explored. For each queue,
Qin[k], we maintain a front pointer Qin[k].f initialized to
0, and maintain the invariant that all vertices to the left of
Qin[k].f in that queue have already been visited. Whenever
a thread needs to fetch a segment, it first stores q in a
local variable k. It then keeps incrementing k (if needed
and as long as necessary) to find the leftmost queue with
f ′ < Qin[k].r, where r is Qin[k]’s rear pointer, and f ′ is
a local variable that holds the value of Qin[k].f (front). As
soon as it finds such a k, it updates q to k, and Qin[k].fto f ′ + s. Observe that in case of two or more threads
changing q at the same time, q may end up updated to a
point to the left of where it should actually be, which can
result in a thread receiving a segment with vertices that are
1631
Figure 1: Concurrent access to current queue pointer by ThreadsT1, T2, T3 can change the pointer to point to a previous locationof Qin.
already visited. The Qin[q].f pointer can also get updated
backwards in a similar way, which may cause two threads
receiving the same segment for exploration (a predictable
consequence of not using locks). Figure 1 explains this
phenomenon. However, as mentioned before, this type of
duplicate exploration does not hamper the correctness of the
algorithm. Nevertheless, to reduce the frequency of duplicate
exploration, we use the following trick. Whenever a thread
reads a new vertex from the queue for exploration, it empties
that location (say, sets it to 0), and whenever a thread sees a 0in the queue, it concludes that the segment has already been
explored or is under exploration by some other thread. So,
it simply stops at that point and retries to get a new segment
form the queues. Note that there is no possibility of creating
a gap in the queues, because a thread only stops when it
sees a 0 (rather than stopping by checking a rear pointer),
and a 0 can only appear either at the end of the queue, or
if the element has already been explored. As before, while
exploring the vertices, each thread puts the newly discovered
vertices in their own private output queue Qout[i], where idenotes the thread id. After finishing the exploration of all
vertices from all queues in the current level, we swap Qin
and Qout, and exploration starts again for the next level.
3) BFSDL (Decentralized + Lockfree): This algorithm
builds on BFSCL. However, rather than having one cen-
tralized queue, we now have j centralized queues for some
j ∈ [1, p], where each centralized queue consists of �p/j�queues from Qin. Note that j = 1 means it is a purely
centralized approach like BFSCL, whereas j = p means
purely distributed. At the beginning of each BFS level, each
thread picks a random centralized queue, and whenever the
thread becomes idle, it fetches the next available segment
from that centralized queue pool, and explores the vertices
from that segment. However, if the chosen centralized queue
pool is empty, it randomly tries at most cj log j times (where
c > 1 is a constant) to get a new nonempty centralized
queue. If it succeeds in finding such a queue, it explores
vertices from it in the same way as before. It can be proved
using the balls and bins model [26] that w.h.p. it takes no
more than cj log j tries to check each centralized queue at
least once for work provided c > 1. This process continues
until all queues become empty, and the next level of BFS
starts.
B. Based on Distributed Randomized Work-stealing
1) BFSW (Work-stealing (with locks)): In this algo-
rithm, during the exploration of any given BFS level, the ith
(for 0 ≤ i ≤ p− 1) thread starts working on the vertices of
Qin[i]. However, whenever a thread becomes idle, it chooses
another random thread (victim) with enough work, and steals
half of its work (right half of the victim’s segment). Each
thread tries at most cp log p times to find a victim, where
c > 1 is a small constant3. Otherwise, it quits for that level
of BFS. Threads use locks for safe stealing (changing the
queue segment indices) to avoid the exploration of an invalid
segment.
2) BFSWL (Work-stealing + Lockfree): This algorithm
works in the same way as BFSW except that here we
use optimistic parallelization to get rid of locks and atomic
instructions. To ensure correctness of the lockfree work-
stealing, a thread performs sanity checks on the segment
boundaries, and changes the queue/segment boundary vari-
ables based on locally saved states of the global queue
segment variables. This helps in maintaining the correctness
of the whole process. We assume that at the start of a BFS
level Qin[q].r holds the rear pointer of Qin[q] for every q,
and this variable remains unchanged throughout that level.
Every thread maintains three variables, namely, q, f and
r to keep track of the queue id, front pointer and rear
pointer, respectively, of the segment of vertices it is currently
working on. Initially, thread t ∈ [0, p) gets the entire Qin[t]as a single segment. As it explores the segment, it keeps
updating its own f pointer accordingly. In order to reduce
the chances of duplicate exploration of the same vertex by
other threads, a thread clears (i.e., sets to 0) every location
of the queue segment as soon as it reads the vertex stored
in that location. A thread aborts working on a segment as
soon as it encounters a 0 value (i.e., a cleared value) in the
segment. Whenever the thread runs out of work, it chooses a
random thread with enough work, and tries to steal half of its
work (i.e., right half of its unexplored segment of vertices).
The thief first saves the queue id q, front pointer f and rear
pointer r of the victim’s segment to local variables q′, f ′
and r′, respectively. It then performs the following sanity
check: f ′ < r′ ≤ Qin[q′].r. If the check fails (meaning
the victim has possibly moved to another queue, and the
retrieved segment is invalid), the thief aborts this steal and
tries another random victim. Otherwise it updates its own q,
f and r pointers to q′, f ′ + 12 (r
′ − f ′) and r′, respectively,
and the victim’s r pointer to f ′ + 12 (r
′ − f ′). It does not
change the victim’s q and f pointers. Observe that as no
thread checks its own rear pointer while exploring, any
invalid change to the rear pointer of the segment (which may
happen due to not using locks and atomic instructions) does
not hamper correctness. If a thief changes a rear segment
3logic behind trying cp log p times follows from the same balls and binsmodel [26] as mentioned in Section IV-A3
1632
pointer of the victim to any invalid location, its consequence
will only be that no other thread will be able to steal from
that particular victim for some time until the victim itself
becomes a thief and changes its own rear pointer. On the
contrary, if a thief gets an invalid segment from a victim,
using the sanity checks as described above, it can safely
avoid that segment and retries for a valid segment.
3) BFSWSL (Work-stealing + Scale-free + Lockfree):This algorithm uses an approach similar to that used in
BFSWL. However, the vertices of each level are explored in
two phases. In the first phase, the threads only explore the
low-degree vertices using explicit work-stealing as before
and push the higher degree vertices into a separate queue,
Qs (the definition of high degree can be changed using a
threshold variable). At the end of this phase, we divide the
adjacency list of each vertex from Qs into p chunks, and
for 1 ≤ i ≤ p, the ith thread explores the ith chunk of
the adjacency list (phase 2). No work-stealing happens in
this phase. We have also experimented with another variant
of BFSWSL which uses work-stealing mechanism in the
second phase, too. In this version, a thread is allowed to
steal half of the remaining unexplored adjacency list of a
vertex, if there is only one vertex left in the queue. However,
this approach often performed worse than the first approach
in our experiments.
4) BFSWS (Work-stealing + Scale-free (with locks)):This algorithm is similar to BFSWSL except that in the
work-stealing phase threads use locks to change the queue
segments, which completely removes any possibility of
having invalid segments.
C. Extension to NUMA Architecture
It is not difficult to optimize our algorithms for NUMA.
For example, for the decentralized algorithm (BFSDL), we
make sure that all threads that are initially assigned to the
same centralized queue are launched on the cores of the
same socket.4 When a group of threads finishes exploring
the vertices from their centralized queue, each of them can
migrate to another random queue allocated on the same
socket. In case of no available queue on the same socket,
it explores from queue allocated on other socket. This can
also be done by assigning higher priorities to centralized
queues allocated on the same socket and lower priorities to
others. For the work-stealing based algorithms, we can use
the following approach. While stealing, a thread randomly
chooses a thread running on the same socket with higher
priority. In case of failure to get a thread from the same
socket with enough work, a thread steals from threads
running on other sockets. A NUMA aware work-stealing
approach for the betweenness centrality problem has been
proposed in [17] which can also be followed.
4Cilk++ does not allow setting thread affinities, and so, we can useOpenMP instead.
D. Discussion: Further Improvements
Note that none of our algorithms has used any technique
for removing duplicate vertices from the queues. To remove
duplicate vertices (or to prevent duplicate exploration by
different threads) from queues, one can use locks or atomic
instructions and/or bitmap of visited vertices as used in
[19]. It is also possible to depend on arbitrary concurrent
write property to record only one parent of a vertex (since a
vertex can have multiple parents) as used in [6]. However,
we plan to use the following method for reducing duplicate
exploration of vertices even further. Each thread will store
the queue id (or parent id) of a vertex in a global array
while exploring (using arbitrary concurrent write), and it
will also check the queue id (or parent id) before exploring
a vertex. If the current queue id matches with the stored
value, the thread explores the vertex, otherwise it skips that
vertex. Note that this approach does not require any locking
or atomic instructions. Avoiding duplicate explorations can
be beneficial for dense and low diameter graphs, where the
number of duplicate vertices can be huge.
We are planning to implement another variant, in which
we will divide the edges evenly instead of the vertices,
while using dynamic load-balancing as before. We expect
this approach to be more scalable.
In [29], the authors have shown that a graph traversal
algorithm implemented using Intel Cilk Plus is 2–3 times
slower than its OpenMP based implementation on Intel MIC
with 120 cores. It would be interesting to see whether
OpenMP based implementations of our algorithms show
similar trends.
V. EXPERIMENTAL RESULTS
All experiments included in this section were performed
on a single node of the Lonestar 4 computing cluster located
at the Texas Advanced Computing Center (TACC), and the
Trestles cluster at the San Diego Supercomputer Center
(SDSC). The properties of the simulation environment are
summarized in Table III.
AttributeName
Lonestar Trestles
Processors 3.33 GHz-Hexa-Core 64-bitIntel-Westmere
8-core 2.4 GHz AMD Magny-Cours processor
Cores/node 12 32
RAM sizeand MemorySpeed
24 GB, 177GB/s 64 GB, 171 GB/s
OS Linux Centos 5.5 Linux Centos 5.5
Cache 12MB L3 (shared in samesocket), 256KB private L2,64KB private L1
12MB L3 (shared in samesocket), 512KB private L2,128KB private L1
Table III: Simulation Environment
We have tested all of our parallel BFS algorithms on
real-world graphs such as cage 15, cage 14, kkt-power,
freescale and Wikipedia-2007 from the Florida Sparse Ma-
trix Collection [15]. We have also tested the programs on
1633
Graph Description n m Diameter
Cage15 DNA electrophoresis, 15monomers in polymer
5.2M 99.2M 53
Cage14 DNA electrophoresis, 14monomers in polymer
15.1M 27.1M 42
Freescale Large circuit, FreescaleSemiconductor
3.4M 18.9M 141
Wikipedia Gleich/Wikipedia-20070206
3.6M 45M 14
kkt-power Optimal power flow,nonlinear optimization(KKT)
2M 8.1M 11
RMAT100M RMAT Graph generatedusing Graph-500 RMATgenerator
10M 100M 12
RMAT1B RMAT Graph generatedusing Graph-500 RMATgenerator
10M 1B 5
Table IV: Graphs and their properties. In this table, n and mdenote the number of vertices and the number of edges of thegraph, respectively. The diameters in the table show the maximumdiameter explored by the BFS rather than the actual diameter ofthese graphs.
synthetic random RMAT graphs generated using the Graph-
500 RMAT generator5 with millions of vertices and up to
a billion of edges. All graphs were directed. The properties
of these graphs are summarized in Table IV.
We have compared all variants of our parallel BFS algo-
rithms6 mentioned in Section IV with the BFS implementa-
tions of Baseline1 [21] and Baseline2 [19]. We collected
the source codes from the authors of [21] and [19] and
ran all these programs on the graphs listed in Table IV.
The algorithms presented in [19] had both CPU and GPU
based implementations, but we compared only with the four
multicore based CPU implementations. In [19] and [21] the
authors used pthreads and cilk++ concurrency platform
for parallelism, respectively. All the programs (including
codes from [19] and [21]) were compiled using the -O3optimization parameter. No hyperthreading was used. We
ran all programs for 1000 random non-zero degree source
vertices, and computed the average running time per source.
We found that our work-stealing algorithms optimized for
scale-free graphs almost always perform better than the
corresponding unoptimized variant, even on general graphs.
Hence, we have not reported the results for those unopti-
mized work-stealing variants.
Table V shows the running times of all these algo-
rithms on a single machine. Tables V(a) and V(b) show
the running times of different algorithms run on a sin-
gle compute node of Lonestar and Trestles, respectively.
Observe that each lockfree version generally runs faster
than the corresponding lock-based version. Table V also
shows that the centralized queue based BFS implementations
perform better than the work-stealing based approaches on
Lonestar (with 12 cores/node), whereas on Trestles (with 32
5parameters used: a=.45, b=.15 and c=.15.6The decentralized algorithm was ran with 1 centralized queue.
cores/node) the work-stealing BFS algorithms show better
performance. One possible reason for this behavior may be
as follows. For all our algorithms, the number of accesses
to the shared queue(s) increases as the number of threads
increases. However, in the work-stealing implementation,
steal attempts are more or less evenly distributed among all
queues. Thus, the number of simultaneous accesses to each
queue in those implementations increases at a much slower
rate than the number of accesses into the single shared
queue pool of the centralized queue based implementations.
This means that the overhead of locks increases at a faster
rate in the centralized queue implementation compared to
the lock-based work-stealing implementation. Similarly, in
the lockfree work-stealing version far fewer cases of dupli-
cate/invalid/stale segment extraction and the corresponding
overhead occur compared to the lockfree centralized version.
As a result, though the centralized versions run faster when
p is smaller, they start to slow down w.r.t. the work-stealing
versions as p increases, and at some point of time, the work-
stealing versions become faster. Also note that the lock
wait time (i.e., time between requesting and acquiring a
lock) for each thread in the work-stealing algorithm with
locks is O(1) (using try lock()). On the other hand, for the
centralized queue based approach, the wait time can be as
high as Θ(p).Figures 2(a) and 2(b) show the scalability of our algo-
rithms on Lonestar and Trestles, respectively. We have run
the lockfree versions of our algorithms on the scale-free
Wikipedia graph, and varied the number of worker threads.
The plots show that the centralized queue based versions
are not scalable beyond 20 cores, while the work-stealing
version remains scalable till the end (i.e., up to 32 cores). We
believe that in addition to the reason given before, the fact
the work-stealing version is optimized for scale-free graphs
has contributed to its scalability.
For all the real-world graphs, our best performing BFS im-
plementation was better than both the implementations from
Baseline1 [21] and Baseline2 [19]. Our algorithms perform
the best for scale-free graphs and sparse graphs. However,
for the synthetic RMAT graphs (graph-10M-100M and
graph-10M-1B) Baseline2 implementations performed
slightly better than ours and the Baseline1 implementation.
The Baseline2 implementation (Local queue + read +bitmap) runs faster only for graph-10M-1B on Lonestar
and for both graph-10M-1B and graph-10M-100Mon Trestles. The possible reason for this is that Baseline2
implementation uses bitmap to track visited vertices in the
queues/read arrays (which in turn helps in avoiding duplicate
exploration of the same vertex by different threads). Note
that although the same vertex can appear only once in
a particular thread’s output queue Qout[i], it is possible
that the same vertex may appear once in each of the
output queues of each thread. This happens, if it had been
discovered by all the threads exactly at the same time. The
1634
�����������������
�����������������������
���� ���� ���� ����� �����
������� �� ��� ��� ��� �� �� �� ��� ��� �������� ��� ����� ����� ����� ��� ��� �� ���� ���� ���������� ��� ���� ���� � � ��� ��� ��� ��� ��� ������������� � �� ����� ���� ����� ���� � �� ���� ���� � �� ���������� � ��� ����� ����� �� �� �� � ����� ����� ����� ����� �����
��!���"����" ���� ���� ���� ���� ���� ��� ���� ���� ���� ������ ��!���"��# ����� ������ � ���� ��� �� ��� �� ������ ������ ������ ��� �� �������
����� ��������
���������
�� �
������!�� ���"#�����$
(a) Running times (ms) on Lonestar (12 cores).
�����������������
�����������������������
���� ���� ���� ����� �����
������� ��� ��� ���� ��� ��� ��� ��� ��� ��� �� ����� ���� ����� ����� ����� ����� ����� � �� ���� ���� ���������� ���� ��� ���� ���� ����� � �� ���� ���� � �� � ���������� ����� ����� ��� ����� ���� ����� ����� ���� ����� ��� ������� ����� ����� �� �� ����� ���� �� �� ����� ����� �� �� �� ���
��!���"����" ����� ���� ���� ���� ���� ����� ���� ���� ���� ������ ��!���"��# � � �� ������ ����� ����� ����� ������ ������ ����� �� ��� ��� ��
���������
����� ��������
������!��
�� �
���"#�����$
(b) Running times (ms) on Trestles (32 cores).
Table V: Running times of different algorithms (all times are shown in milliseconds). Here, a blue-colored cell means the global bestrunning time for each row and a green-colored cell indicates the best among ours, if it has failed to be the global minimum).
graph graph-10M-1B with 10M vertices and 1B edges is
a very dense graph (in fact the densest graph we have used)
resulting in a lot of duplicate vertices (as a vertex can have
a lot of parents). As the Baseline2 implementation keeps
track of visited vertices using atomic case & set instruction,
and removes bulk overhead of duplicate exploration of a
very high degree vertex by several threads, it runs faster for
dense graphs. Therefore, algorithms like ours that do not
track visited vertices explicitly or do not remove duplicate
vertices from queues will slow down due to a high number
of duplicate explorations. We have not included the plots for
the largest graph and the smallest graph in our input set (i.e.,
the RMAT graph graph-10M-1B, or kkt-power) because
including them in the plots makes the curves for smaller
graphs indistinguishable.
Table VI shows some statistics on the steal attempts made
by threads in BFSWS and BFSWSL. Both implementa-
tions were run 5 times from 100 sources of the Wikipedia
Graph on a single Lonestar node, and average values were
computed for the number of successful steal attempts as
well as different types of failed steal attempts. We observe
that though the total number of steal attempts in BFSWSL
was slightly larger than that in BFSWS , the percentage of
successful steal attempts was also higher in BFSWSL. Also
observe that the number of failed steal attempts as a result of
a victim being idle was also lower in BFSWS . Recall that a
thread becomes idle when it runs out of work and gives up
searching for work after a certain (say, MAX STEAL)
number of failed steal attempts. Thus BFSWSL achieved
better load balancing than BFSWS which was translated
into a better running time for BFSWSL. Since BFSWSL
did not use locks, there were no failed steal attempts as a
result of choosing a victim that was already locked. Instead
in BFSWSL more steal attempts failed because the segment
obtained by the thief was either too small (e.g., could happen
if the victim did not have any work and so was also trying
to steal), or stale (e.g., could happen if two thieves were
stealing from the same victim with work), or invalid (e.g.,
could happen if more than one thief were trying to steal from
the same victim and thus messed up the queue indices).
In both implementations most of the steal attempts failed
because of the large value used for MAX STEAL. A large
MAX STEAL results in a large number of failed steal
attempts at the end of each level which is reflected in the
large number of steal attempts that failed because of idle
victims.
VI. CONCLUSION
In this paper, we have presented two different types of
lockfree parallel BFS algorithms along with their variants
based on centralized job queues and distributed random-
ized work-stealing. These algorithms use a novel optimistic
parallelization technique to avoid any kind of locks and
1635
�
���
���
��
���
���
���
���
� � � � � �� �� ��
%��&
����
��'�
$���(
����
����
��)��
���
��*�
���
+������� ����������������
�,��������-��������� ������������.�����/0��������������
�$����%������& '�$����%������& ��������$������&�����&
(a)
����
����
����
����
����
����
�� �
�� �
����
����
� � �� �� �� �� � �
%��&
����
��'�
$���(
����
����
��)��
���
��*�
���,
+������� �%���������������
�,��������-��������� �%'(�����12��3/4�2��$-#�����
�$����%������& '�$����%������& ��������$������&�����&
(b)
Figure 2: Scalability of lockfree parallel lockfree BFS algorithmsrunning on (a) Lonestar and (b) Trestles. All algorithms were runon the Wikipedia graph only.
atomic instructions. We have shown that lockfree algorithms
are typically faster than the corresponding locked versions.
Although work-stealing is an old technique for dynamic
load balancing, lockfree work-stealing is novel for BFS.
Experimental results show that these algorithms perform
very well for massive, scale-free and sparse graphs and
achieve better performance compared to two other state-of-
the-art algorithms. We have several interesting ideas that we
plan to try next for BFS including extending this lock and
atomic instruction free optimistic parallelization technique
to other graph traversal algorithms such as IDA*, A*, etc.,
and to other important application areas of BFS itself.
Implementation and analysis of these BFS algorithms on
GPU, Intel MIC and on cluster of multicores are also
something interesting to look at.
ACKNOWLEDGMENT
Thanks to all authors of [21] and [19] for sharing their codeswith us. Special thanks to T.B. Schardl for further explaininghis code. This work used the Extreme Science and EngineeringDiscovery Environment (XSEDE), which is supported by NationalScience Foundation grant number OCI-1053575.
�
���
���
��
���
���
���
���
���
���
�
����� ����� ������������� �����
%��&
����
��'�
$���(
����
����
��))�
���
��*�
��,
�,�(�� ��������� �� �������$�������������� �%'(�����.�����/0��������������������$�����������
#����$� #����$� �$����%��(���& '�$����%��(���& ��������$������&�����&
(a)
�
���
���
��
���
���
���
���
������ ����� �������� �����
%��&
����
��'�
$���(�
�����
����
))�
����
�*��
�,
�,�(�� ��������� �� �������$�������������� �%'(�����12��3/4������2��$-#�����������$�5��������
#����$� #����$� �$����%��(���& '�$����%��(���& ��������$������&�����&
(b)
Figure 3: Performance in terms of Traversed Edges per Second(TEPS) when processing real-word graphs on (a) Lonestar (12cores) and (b) Trestles (32 cores).
REFERENCES
[1] V. Agarwal, F. Petrini, D. Pasetto, and D. Bader, Scalable graph explorationon multicore processors, Proceedings of the International Conference on HighPerformance Computing, Networking, Storage and Analysis (SC’10), pp: 1–11,2010.
[2] L. Akeila, O. Sinnen, and W. Humadi, Object oriented parallelisation of graphalgorithms using parallel iterator, Proceedings of the 8th Australasian Symposiumon Parallel and Distributed Computing (AusPDC’10), pp: 41–50, 2010.
[3] M. Anderson, Better benchmarking for supercomputers, Proceedings of IEEESpectrum, pp: 12–14, 2011.
[4] D. Bader and K. Madduri, Designing multithreaded algorithms for breadth-firstsearch and st-connectivity on the Cray MTA-2, Proceedings of the InternationalConference on Parallel Processing (ICPP’06), pp: 523–530, 2006.
[5] S. Beamer, K. Asanovic, and D. Patterson, Direction-optimizing breadth-firstsearch, Proceedings of the International Conference on High PerformanceComputing, Networking, Storage and Analysis (SC’12), pp:12, 2012.
[6] G. Blelloch and B. Maggs, Parallel algorithms, Algorithms and Theory ofComputation Handbook, Chapman & Hall/CRC, 2010.
[7] G. Brodal, R. Fagerberg, U. Meyer, and N. Zeh, Cache-oblivious data structuresand algorithms for undirected breadth-rst search and shortest paths, Proceedingsof the 9th Scandinavian Workshop on Algorithm Theory (SWAT’04), pp: 480–492, 2004.
[8] A. Beckmann and U. Meyer, Deterministic graph-clustering in external-memorywith applications to breadth-first search,Unpublished manuscript, 2009.
1636
Program Time(sec)
Total StealAttempts
Failed Steal Attempts
SuccessfulSteal Attempts
VictimLocked
VictimIdle
SegmentToo Small
StaleSegment
InvalidSegment Total
BFSWS 7.72732,535
( 100.00% )
265,198( 36.20% )
271,731( 37.09% )
137,675( 18.79% )
49,387( 6.74% )
N/A723,991
( 98.83% )
8,544( 1.17% )
BFSWSL 7.53734,535
( 100.00% )N/A
268,710( 36.58% )
399,840( 54.43% )
56,849( 7.74% )
221( 0.03% )
725,620( 98.79% )
8,915( 1.21% )
Table VI: Statistics of successful and failed steal attempts on the Wikipedia graph when run from 100 sources. For each program wereport the average of 5 independent runs.
[9] A. Buluc and K. Madduri, Parallel breadth-first search on distributed memorysystems, Proceedings of the International Conference on High PerformanceComputing, Networking, Storage and Analysis (SC’11), pp: 1–12, 2011.
[10] J. Chhugani, N. Satish, J. Sewall C. Kim, and P. Dubey, Fast and efficient graphtraversal algorithm for CPUs : Maximizing single-node efficiency, Proceedingsof the 26th IEEE International Parallel and Distributed Processing Symposium(IPDPS’12), pp: 378–389, 2012.
[11] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to Algorithms.3rd Edition, MIT Press., 2009.
[12] R. Cledat and S. Pand, Discovering optimistic data-Structure oriented paral-lelism, Proceedings of the 3rd USENIX Workshop on Hot Topics in Parallelism(HotPar’12), 2012.
[13] R. Chowdhury and V. Ramachandran, Cache-oblivious shortest paths in graphsusing buffer heap, Proceedings of the 22nd Annual Symposium on Parallelismin Algorithms and Architectures (SPAA’04), pp: 245–254, 2004.
[14] R. Cledat, Programming models for speculative and optimistic parallelism basedon algorithmic properties, PhD thesis, Georgia Institute of Technology, 2012.
[15] T. Davis, University of Florida sparse matrix collection,”http://www.cise.ufl.edu/research/sparse/matrices/”.
[16] T. Flouri, C. Iliopoulos, M. Rahman, L. Vagner, and Michal Voracek, Indexingfactors in DNA/RNA sequences, Bioinformatics Research and Development, pp:436–445, 2008.
[17] M. Frasca, K. Madduri, and P. Raghavan, NUMA-aware graph mining tech-niques performance and energy efficiency, Proceedings of the InternationalConference on High Performance Computing, Networking, Storage and Analysis(SC’12), pp: 95, 2012.
[18] M. Frigo, P. Halpern, C. Leiserson, and S. Lewin-Berlin, Reducers and otherCilk++ hyperobjects. Proceedings of the 21st Annual Symposium on Parallelismin Algorithms and Architectures (SPAA’09), pp: 79–90, 2009.
[19] S. Hong, T. Oguntebi, and K. Olukotun, Efficient parallel graph explorationon multicore CPU and GPU, Proceedings of the International Conference onParallel Architectures and Compilation Techniques (PACT’11), pp: 100–113,2011.
[20] M. Kulkarni, B. Walter, K. Pingali, G. Ramanarayanan, K. Bala, and L. Chew,Optimistic parallelism requires abstractions. Communications of the ACM,52(9):89–97, 2009.
[21] C. Leiserson and T. Schardl, A work-efficient parallel breadth-first searchalgorithm (or how to cope with the nondeterminism of reducers), Proceedingsof the 22nd Annual Symposium on Parallelism in Algorithms and Architectures(SPAA’10), pp: 301–314, 2010.
[22] D. McShan, S. Rao, and I. Shah, Pathminer: predicting metabolic pathways byheuristic search, Bioinformatics., 19(13):1692–1698, 2003.
[23] R. Mao, Distance-based indexing and its applications in Bioinformatics, PhDThesis, University of Texas at Austin, 2007.
[24] D. Merrill, M. Garland, and A. Grimshaw, Scalable GPU graph traversal,Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practiceof Parallel Programming (PPoPP12), pp: 117–128, 2012.
[25] A. Mislove, M. Marcon, P. Gummadi, P. Druschel, and B. Bhattacharjee,Measurement and analysis of online social networks, Proceedings of the 7thACM SIGCOMM Conference on Internet Measurement (IMC’07), pp: 29–42,2007.
[26] M. Mitzenmacher and E. Upfal, Probability and Computing: RandomizedAlgorithms and Probabilistic Analysis. Cambridge University Press, 2005.
[27] G. Morrisett and M. Herlihy, Optimistic parallelization, Technical report, Schoolof Computer Science, Carnegie Mellon University, 1993.
[28] P. Norvig, Teach Yourself Programming in Ten Years, http://norvig.com/21-days.html, 2001.
[29] E. Saule and U. Catalyurek, An early evaluation of the scalability of graphalgorithms on the Intel MIC architecture, 2012 IEEE 26th International Paralleland Distributed Processing Symposium Workshops & PhD Forum, pp:1629–1639, 2012.
[30] B. Su, T. Brutch, and K. Keutzer, Parallel BFS graph traversal on images usingstructured grid, Proceedings of the IEEE 17th International Conference on ImageProcessing (ICIP’10), pp: 4489–4492, 2010.
[31] A. Yoo, E. Chow, K. Henderson, W. McLendon, B. Hendrickson, andU. Catalyurek, A scalable distributed parallel breadth-first search algorithm onBlueGene/L, Proceedings of the International Conference on High PerformanceComputing, Networking, Storage and Analysis (SC’05), pp: 25, 2005.
1637