Share this document with a friend

25

Transcript

Knowledge and Information Systemshttps://doi.org/10.1007/s10115-019-01328-3

REGULAR PAPER

PPR-partitioning: a distributed graph partitioning algorithmbased on the personalized PageRank vectors invertex-centric systems

Nasrin Mazaheri Soudani1 · Afsaneh Fatemi1 ·Mohammadali Nematbakhsh1

Received: 24 April 2018 / Revised: 7 January 2019 / Accepted: 10 January 2019© Springer-Verlag London Ltd., part of Springer Nature 2019

AbstractRelations among data items can bemodeledwith graphs inmost of big data sets such as socialnetworks’ data. This modeling creates big graphs with many vertices and edges. Balancedk-way graph partitioning is a common problem with big graphs. It has many applicationsin several fields. There are many approximate solutions for this problem; however, mostof them do not have enough scalability for big graph partitioning and cannot be executedin a distributed manner. Vertex-centric model has been introduced recently as a scalabledistributed processing method for big graphs. There are a few methods for graph partitioningbased on this model. Existing approaches only consider one-step neighbors of vertices forgraph partitioning and do not consider neighbors with higher steps. In this paper, a distributedmethod is introduced based on vertex-centric model for balanced k-way graph partitioning.This method applies the personalized PageRank vectors of vertices and partitions to decidehow vertices are joined partitions. This method has been implemented in the Giraph system.The proposedmethod has been evaluatedwith several synthetic and real graphs. Experimentalresults have shown that this method has scalability for partitioning big graphs. It was alsofound that thismethod produces partitionswith higher quality compared to the state-of-the-artstream-based methods and distributed methods based on vertex-centric programming model.Its result is close to the results of Metis method.

Keywords Graph partitioning · Big graphs · Personalized PageRank · Vertex-centricsystems

B Afsaneh [email protected]

Nasrin Mazaheri [email protected]

Mohammadali [email protected]

1 Department of Computer Engineering, Faculty of Software Engineering, University of Isfahan, HezarJerib Ave, Isfahan, Iran

123

N. Mazaheri Soudani et al.

1 Introduction

Today, the speed of data generation is increased due to the development of the Web, socialnetworks and data gathering technologies such as sensors and smart phones. This creates bigdata sets with high volumes. In most big data sets, each data item has relations with someother data items. These relations can be modeled by graph. Graph vertices represent dataitems, and graph edges represent relationship among them. This modeling creates big graphswith many vertices and edges.

In 2010, Google introduced the Pregel system [26] based on vertex-centric programmingmodel for distributed processing of big graphs. This system has high scalability for graphcomputations. In this model, a graph algorithm is described with a function that shouldbe executed repeatedly for each graph vertex. Each vertex may be active or inactive. Thisfunction is executed for active vertices. All vertices are active at first. With execution of thisfunction for an active vertex, data values of adjacent vertices or edges are read, with vertexdata updated and the results disposed for adjacent vertices or edges. Therefore, each vertexhas a local view of the graph structure. This model is called “Think like a vertex” or TLAV.Today, many graph processing systems such as Giraph [4], Graphlab [25], PowerGraph [15]and Powerlyra [9] are developed based on this model.

There are difficulties with balanced partitioning of big graphs, and it remains as an impor-tant aspect that has many applications in several fields such as image processing [16], VLSIcircuit design [20] and graph distribution for scalable graph processing [15]. This problemhas been studied in the literature, and there are many approximate algorithms for solving it[8]. Most existing algorithms that produce close-to-optimal solutions do not have locality inaccessing graph vertices or edges and cannot be executed in a distributed manner. Therefore,these algorithms do not have scalability for big graph partitioning. Despite the scalability ofvertex-centric systems for graph processing, there are few algorithms for graph partitioningusing this model.

Most of the existing approaches of graph partitioning only consider one-step neighbors ofeach vertex for the decision regarding joining it to a partition; however, considering one-stepneighbors may not produce the best partitioning results. On the other hand, these algorithmsconsider the importance of all one-step neighbors of a vertex to be the same for partitioningit, whereas a one-step neighbor of a vertex may have also higher-step neighborhoods withit.

The personalized PageRank (PPR) vector [32] of a vertex has much information about itsneighbors. Suppose a random walker starts moving from a vertex u and each time returns tothe vertex u with the probability α and jumps to a neighbor of the current vertex with theprobability 1 − α. The personalized PageRank vector of u is the probability distribution ofjumping this random walker in each graph vertices. If a vertex has stronger neighborhoodwith u, its corresponding element in ppr vector of u will be higher. Therefore, the ppr vectorsof vertices can be used for graph partitioning [2,3].

The PPR vectors are basically used for local graph clustering problem. There are a fewworks for balanced k-way graph partitioning considering these vectors. Existing works forgraph clustering and partitioning select some seed vertices and find dense clusters aroundthem using their PPR vector. These works have some issues: (1) They only consider thePPR vector of the seed vertices and do not consider the PPR vectors of other vertices thatjoin the clusters over time. This increases the number of cut edges. (2) These methods arecentralized and cannot be executed in distributedmanner using common distributed paradigmlike vertex-centric model. (3) Most of the existing works are for bidirected graphs, whereas

123

PPR-partitioning: a distributed graph partitioning algorithm…

many real graphs are directed. (4) These algorithms select seed vertices randomly and do nothave any approach for selecting them.

In this paper, a distributed method for big graph partitioning is presented based on vertex-centric programming model that overcomes these issues. This method uses the personalizedPageRank vectors of partitions and vertices for decision concerning joining vertices topartitions. Thismethod has three phases. First, the ppr vectors of vertices are calculated simul-taneously. Second, a start vertex for each partition is selected. These vertices are selected sothat they have low ppr value corresponding to each other. Third, vertices of each partition,based on the ppr vectors, invite other vertices that have not joined partitions so far.

This method was implemented on Giraph systems. It has been evaluated with syntheticand real graphs. The planted-partition and power-law models have been selected in theseexperiments. Experimental results show that this method produces balanced partitions withhigher locality and lower cut edges related to state-of-the-art stream-based methods and dis-tributed methods based on vertex-centric programming model. Its result is close to the resultof Metis algorithm, a common centralized algorithm for graph partitioning. The proposedmethod has scalability for partitioning big graphs.

Our contributions in this paper can be listed as follows:

– We introduce PPR-partitioning algorithm. This is a distributed graph partitioning algo-rithm in vertex-centric programming model. It uses the personalized PageRank vectorsof vertices to decide about placing them in partitions.

– PPR-partitioning algorithm selects a seed vertex for each partition first and then it extendspartitions. We introduce a new metric for selecting seed vertices, which can be used forother seed-based graph clustering or partitioning algorithms.

– We use the probability of staying a random walker in the set of vertices of a partition asa metric for evaluating the density of its internal edges. We rank vertices for joining apartition based on the increase in this metric.

– We have evaluated PPR-partitioning algorithm with several synthetic and real-worldgraphs. We have compared its outputs with the outputs of Spinner [27] and Revolver[30] (they are distributed methods based on vertex-centric programming model), Metis[21] (themost popular centralized graph partitioning algorithm) and some state-of-the-artstream-based methods like Fennel [43], LDG [40,46] and their restream versions [31].The experiments show that PPR-partitioning algorithm produces balanced partitions withfewer cut edges compared to Spinner , Revolver and stream-based methods. Despite thedistributed implementation of PPR-partitioning algorithm, its results are close to Metiswhich is a centralized algorithm that produces close-to-optimal solutions.

– We have compared the run time of PPR-partitioning with Spinner when the number ofvertices, partitions or workers in distributed environment is increased. In all situations,PPR-partitioning algorithm has less run time related to Spinner.

The rest of this paper is organized as follows: Related works on big graph partitioningare discussed in Sect. 2. The definitions and notations used in this paper are presented inSect. 3. Section 4 describes the proposed algorithm. The experimental results are presentedin Sect. 5. Finally, the future works are discussed in Sect. 6.

2 Related works

Today, the state-of-the-art algorithms for big graph partitioning are stream-based methods[15,31,41,43,47]. These methods visit vertices of the graph in a stream successively. Each

123

N. Mazaheri Soudani et al.

vertex is assigned to a partition as soon as it is visited in the stream. This assignment doesnot change later. The performance of these methods is high because they have linear timecomplexity in terms of the number of graph vertices and edges. These methods have somelimitations: (1) The result partitioning depends on the order of vertices in the stream [40].(2) Parallel execution of these methods has some limitations and needs centralized accessto partitioning information of the previous vertices in the stream [15,36]. (3) There is noheuristic that approximates one-pass balanced graph partitioning in o(n) [40].

The alternative algorithms that can be used for big graph partitioning are the distributedmethods. These algorithms can be executed in new distributed computing systems basedon vertex-centric or map-reduce models. Most of the existing distributed algorithms [27,29,30,34,35,44,45] are based on the label propagation method. In this method, each vertexchooses a random partition label at first. Then, each vertex changes its label based on thelabels of its neighbors so that the number of cut edges is reduced. This routine continuesuntil the vertices’ labels do not change anymore. The Spinner algorithm [27] is one of themost successful of these algorithms. The main issue with the label propagation methods isthat the adjacent vertices may change their partitions simultaneously such that the numberof cut edges is increased. The other issue is that the final result depends on the initial randompartitioning of vertices.

The authors of [30] introduce a distributed method, called Revolver, based on the rein-forcement learning and the label propagationmethods. This algorithm can be implemented invertex-centric systems. It assigns a learning agent to each graph vertex. These agents choosea partition number of their vertices in each round and evaluate these selections with the labelpropagation method.

One of the most successful traditional algorithms is Metis that applies a hierarchicalapproach [21]. It has three phases. In the first phase called coarsening, vertices of the graphare clustered and aggregated step-by-step to have a smaller graph. This smaller graph ispartitionedwith an exact algorithm in the second phase. The coarsened graph is extracted step-by-step to the initial graph in the third phase. Some refinement is done in the result partitions ineach extraction step. Today, some distributed algorithms based on label propagation methodare presented for the first and third phases of this algorithm.

A different distributed method is presented in Guerrieri Alessio [17]. In this method, avertex and an initial amount of funds are inserted in each partition. Then, each partitionoffers some amount of fund for buying unassigned edges. The amount of fund is based on theedges’ adjacent vertices located on that partition. Each edge is sold to a partition that has thehighest offer. Each step of this algorithm has three phases. Two of them can be executed indistributed manner in edge-centric and vertex-centric systems, but one phase is centralized.

A distributed method based on the map-reduce programming model is presented in Aydinet al. [6]. This method uses a hierarchical clustering method at first to approximate theproximity of graph vertices. Then, it sorts graph vertices in a line based on this proximity.Finally, this line is divided into k balanced partitions and some refinements are done inpartition boundaries.

Random walk methods can also be executed in distributed manner. These methods arebasically for graph clustering. If a t-step random walker starts moving from a vertex, it ismore probable that it visits vertices that are in a same cluster with the start vertex. Verticesof a cluster are densely connected. Therefore, the random walker is not likely to leave thecluster. It falls in the trap of these vertices [3,10,33,38,39,42,46,48].

Based on this argument, a local clustering method called Nibble is presented in SpielmanandTeng [39]. Thismethod approximates the probability distributionof visiting t-step randomwalker from graph vertices. It sorts graph vertices based on these probabilities divided by

123

PPR-partitioning: a distributed graph partitioning algorithm…

vertices’ degrees. Vertices are joined to the local cluster from the beginning of this list to thepoint that the cluster conductance begins to increase. This routine is called the sweep cut.

A graph bisection algorithm based on the sweep-cut routine is introduced in Spielmanand Teng [39]. In each step of this method, a high-degree start vertex is selected and a localcluster around it is calculated with sweep-cut routine and random parameters. Then, anotherstart vertex from the remaining vertices is selected and this routine is repeated. This continuesuntil the union of these local clusters is the half of the graph vertices.

The authors of [2] use ppr vector instead of the probability vector of t-step random walkerin sweep routine. Indeed, the ppr vector is weighted average of the probability vectors oft-step random walkers with different t values. These two local clustering algorithms arefor bidirected graphs. The authors of [3] present a similar method for directed graphs. Thismethod divides the corresponding ppr value of each vertex by its total PageRank value insteadof its degree to build sweep list. The main issue about these methods is that they use only theppr vector of the start vertex to build local cluster and ignore ppr vector of vertices that jointhe cluster overtime during the sweep routine.

Amethod for overlapping community detection is presented inWhang et al. [46], which issimilar to the method presented in this paper and it uses seed vertex extraction. This methoduses sweep routine with ppr vector for finding clusters around seed vertices. It comparesdifferent strategies for finding seed vertices and indicates that selecting high-degree verticesproduces the best clustering results.

All of the mentioned methods based on the random walk are centralized. The authors of[33] introduced a distributed random walk-based method that implements the sweep routinewith themap-reducemodel [12]. They alsopresent a distributedmethod for approximatingpprvectors based on vertex-centric model. Using the map-reduce model in this method reducesits performance because the information of graph should be written in external memory foreach run of sweep routine.

3 Preliminaries

Assume that G = (V , E) is a graph where V is its vertices and E is its edges. The balancedk-way graph partitioning problem wants to find a set of partitions P = {P1, P2, . . . , Pk} onthe vertex set V so that they are pairwise disjointed and their union is equal to V . Thesepartitions should meet the following two conditions:

minP

∣∣{

e|e = (vi , v j ) ∈ E, vi ∈ Px , v j ∈ Py, x �= y}∣∣ (1)

s.t.maxi |Pi |

1k

∑ki=1 |Pi |

≤ ε, (2)

where k is the number of partitions and ε ≥ 1 is a constant number that indicates theacceptance imbalance in the partition sizes. The size of a partition in the second conditioncan be either the number of its vertices or internal edges. In this paper, the number of verticesin a partition is considered as its size.

Since the proposed method in this paper is based on the personalized PageRank concept,we define this concept in the following text. Consider a random walker that begins at avertex u ∈ V and executes a random walk on the graph vertices as follows. At each step, thewalker follows an outgoing edge choosen uniformly at random from his current vertex withprobability 1− α and returns to vertex u with the probability α. The probability distributionof visiting this random walker from each vertex is the personalized PageRank vector of u

123

N. Mazaheri Soudani et al.

[32], which we show with−−−−→ppr(u). The

−−−−→ppr(u) can be computed with solving this recursion

function [14]: −−−−→ppr(u) = α ∗ −−→

e(u) + (1 − α) ∗ D−1 ∗ A, (3)

where A is the adjacency matrix of the graph, D−1 ∈ R|V |∗|V | is a diagonal matrix in which

each ith element on diagonal is out-degree of vertex vi ∈ V and−−→e(u) ∈ R|V | is a vector

whose corresponding element with u is one and whose other elements are zero. In this paper,−−−−−−→ppr(u, v) is the element corresponding to vertex v in

−−−−→ppr(u). Also,

−−−−−−→pprrev(u) is a vector in

which each i’th element is−−−−−−→ppr(vi , u) for vi ∈ V .

We can extend the definition of ppr by calculating this vector for a subset S of graph verticesinstead of one vertex. In this definition, the random walker follows a random outgoing edgewith probability α and returns to one of the vertices in S with probability 1− α. In this case,the vector e in definition 3 shows the probability distribution of returning random walker toeach of the vertices in S. In the uniform distribution, the elements corresponding to verticesin S are 1/|S| and other elements are zero. In the present paper, the ppr vector related to

an arbitrary vector −→e is shown with−−−−−→ppr(−→e ) and the aggregated ppr vector related to the

subset S of vertices with uniform probability distribution is shown with−−−−→ppr(S).

The ppr vectors have linear properties [14]. This means that if β1 and β2 are two constantswith β1 + β2 = 1, for any probability vectors −→e1 and −→e2 Eq. 4 holds:

−−−−−−−−−−−−−−−−→ppr(β1 ∗ −→e1 + β2 ∗ −→e2 ) = β1 ∗ −−−−−→

ppr(−→e1 ) + β2 ∗ −−−−−→ppr(−→e2 ). (4)

With this property, the aggregated ppr vector of any subset of graph vertices can becalculated with ppr vectors of all vertices in that subset. We use this property in the thirdphase of the proposed method for computing the aggregated ppr vector of vertices in eachpartition.

4 The proposed algorithm

4.1 The first phase: approximating ppr vectors of graph vertices

The ppr vectors of all graph vertices should be approximated in the first phase. There are twomain approaches for calculating approximate ppr vectors [18,24]. The first approach is thepower iteration method. In this method, the initial value of ppr vector of each vertex u is set

to−−→e(u). Then, this value is updated recursively based on the value of ppr vectors of adjacent

vertices with Eq. 5:

−−−−→ppr(u) = α ∗ −−→

e(u) + (1 − α)

|Nout (u)| ∗∑

v∈Nout (u)

−−−−→ppr(v), (5)

where Nout (u) is the set of outgoing neighbors of u. We can use this method to computethe ppr vectors of all graph vertices simultaneously in a vertex-centric system, like Giraph.Using Giraph, each vertex updates its ppr vector and sends it to all outgoing neighbors in eachsuper-step. It continues until the ppr vectors do not change later and converge [2,14,19,33].

The other approach for computing approximate ppr vectors is Monte Carlo method. Inthis method, a number of random walkers start moving from each vertex u simultaneously.Each of these random walkers in each step goes to a random outgoing neighbor of its currentvertex with the probability 1 − α and stops walking with the probability α. ppr(u, v) can

123

PPR-partitioning: a distributed graph partitioning algorithm…

approximately be calculated by dividing the number of visits of u’s random walkers from thevertex v by their total steps. Some references compute ppr(u, v)with the fraction of randomwalks that terminate at v. The result of this method can be adapted with the graph changes[5,7,24].

We used the second approach for the implementation of the first phase with Giraph system[4]. The number of random walkers for each vertex u is set to R(u) = c ∗ |Nout (u)|, wherec is a constant natural number. Each vertex sends its random walkers simultaneously withother vertices in the first super-step. In the following super-steps, vertices send each receivedrandomwalkers to one of their neighbors with probability 1−α. This work is continued untilall random walkers stop walking. Each vertex saves the number of visits of itself by othervertices’ random walkers.

The number of steps of a random walker is a random variable with geometric distributionand the success probability α. Therefore, the expected number of steps of a random walkeris 1

αand the number of total steps of all random walkers of a vertex u is totalsteps(u) =

c ∗ |Nout (u)| ∗ 1α. ppr(u, v) can be approximated by visi ts(u,v)

totalsteps(u), where visi ts(u, v) is the

number of visits of u’s random walkers from vertex v.Each vertex u can save only nonzero values of its ppr vector to reduce the needed memory.

We used the HashMap data structure to save these values in which the ID of vertices is usedas keys for retrieving their ppr values. This list is called ppru . The list of vertices whose pprvalues related to vertex u is nonzero is also saved in vertex u. This list is called ppr − revu .The union of vertices in these two lists is called promising(u). These vertices are morelikely to be in a same partition with u than the other vertices.

We have selected the second approach for two reasons:

– For implementation of the first approach in vertex-centric model, each vertex shouldsend its PPR vector to all neighbors in each super-step. This has high overhead becausethe PPR vector of a vertex has |V | elements. Even, if each vertex sends only nonzeroelements of its PPR vectors to neighbors, the volume of each message would be high.

– Each vertex u should have the list of vertices whose ppr values corresponding to vertexu are nonzero (ppr − revu). With the second approach, no additional message passingamong vertices is needed to calculate these lists. Each vertex u can save the IDs of verticeswhose random walkers visit u.

4.2 The second phase: selecting seed vertices

In this phase, K seed vertices are selected as start points of partitions. The partitions growaround these vertices in the third phase. The proposed method applied a greedy algorithm forselecting these vertices, which uses the aggregators [26] of Giraph system [4]. Aggregatorsare the global values that all vertices of the graph can read them during the execution ofan algorithm. The value of an aggregator is calculated in the distributed manner after eachsuper-step based on the vertices values. A score is assigned to each vertex in the proposedalgorithm that reflects the negative point of that vertex for selecting as a partition seed. Amin-aggregator is used to find the vertex with the lowest negative score. This score is initiallyzero for all vertices. Therefore, the seed vertex of the first partition s0 is selected optionally.It is best to choose a vertex that has the heights out-degree at this stage.

Vertices in promising(s0) are more likely to be in the same partition with the vertex s0.These vertices should not be chosen as other partitions’ seeds. The chance of being a vertexw in a same partition with s0 is proportional to the values of ppr(s0, w) and ppr(w, s0).

123

N. Mazaheri Soudani et al.

Therefore, the negative score of each vertex w in promising(s0) should be set to the sum ofthese two values.

After choosing s0, this vertex sends a message to all vertices in promising(s0) to updatetheir negative scores. The message of each vertex w contains the value ppr(s0, w). Thevertex w itself knows the value ppr(w, s0). Each vertex w in promising(s0) updates itsnegative score based on these two values.

In the next step, an aggregator in Giraph is used for finding the next seed vertex s1 such thatit has the minimum negative score. These steps are continued until the K seed vertices areselected. The negative score of each vertex w in the ith step after choosing {s0, s1, . . . si−1}is updated based on Eq. 6:

negi (w) = negi−1(w) + ppr(si−1, w) + ppr(w, si−1). (6)

In this equation, the negative score of the previous step is summed up with the ppr values.This causes all previous seeds to be considered in computing the negative scores. This algo-rithm chooses the vertex with the highest degree among vertices, which has the minimumnegative score in each step. The distance of a vertex from the cluster containing it has inverseproportion to its degree [13,46]. Therefore, a vertex with high degree has more centrality inits cluster as compared to other vertices. Since partitions grow around seed vertices, select-ing seed vertices that have more centrality in their clusters reduces the number of cut edges.On the other hand, partitions around high-degree vertices can be grown with more freedom[46].

4.3 The third phase: graph partitioning

Graph partitions are built around seed vertices in the third phase. First, a different partitionnumber is assigned to each seed vertex. Then, each vertex that is already located in a partitionand has a partition number invites the best vertices in its promising set to join its partition.This is continued until all vertices are partitioned.

This method, unlike the previous methods, considers the aggregated ppr vector of allvertices in a partition to choose the best vertices for invitation in each step. Other methodsthat use the sweep routine only consider the ppr vector of the seed vertex for this purpose.

Each vertex u ∈ V saves two values ppr(p j , u) and ppr(u, p j ) for each partition j whichare continuously updated during the execution of the algorithm. These two numbers are thecriteria for inviting vertex u to join partition p j . Each vertex u also saves the size of eachpartition j in the last update of ppr(p j , u) and ppr(u, p j ) to use it in the following updateof these values. The list of values which each vertex u holds is tabulated in Table 1.

Each previously partitioned vertex recursively executes an invitation routine to invite itspromising vertices to join its partition. The invitation routine of a vertex v that is in partitionj consists of three steps. In the first super-step, vertex v sends a message to each vertex u inits promising set and asks it to sends two values ppr(p j , u) and ppr(u, p j ) to v. Verticesin the promising(v) receive these messages in the second super-step, and those that arenot yet partitioned reply on them. Vertex v receives answers in the third super-step. It ranksthese promising vertices based on ppr values in their messages. The rank of vertex u inpromising(v) is calculated based on Eq. 7:

rank(u, j) = |p j | ∗ ppr(p j , u) + ppr(u, p j )

|p j | + 1. (7)

123

PPR-partitioning: a distributed graph partitioning algorithm…

Table 1 List of values whicheach vertex u holds

Name Description

ppru The list of nonzero elements of−−−−→ppr(u) with corresponding vertices’IDs

ppr_revu The list of vertices’ IDs that hasnonzero corresponding element in−−−−−−→pprrev(u)

∀1 ≤ j ≤ K ppr(p j , u) The corresponding element to u in theaggregated ppr vector of vertices ineach partition j

∀1 ≤ j ≤ K ppr(u, p j ) The sum of values corresponding tovertices of each partition j in ppr vec-tor of u

∀1 ≤ j ≤ K s j The size of partition j in the lastupdate of ppr(p j , u) and ppr(u, p j )

pid The partition number of the vertex u

Invi teid The partition number that vertex uinvites other vertices to join it. Thisis equal to pid at first

Assume δ(p j ) is the sum of the elements of−−−−−→ppr(p j ) corresponding to vertices in partition

p j . It is equal to the probability that the random walker stays in this subset of vertices anddoes not leave them. The higher number of internal edges among vertices in partition p j

increases this probability because a random walker that starts walking from a vertex in p j istrapped in this subset with higher probability. Therefore, in each step, we should choose newvertices for joining p j such that this number has higher increase. According to Theorem 1,joining the vertex, with the highest rank, to a partition j causes the highest increase in thevalue of δ(p j ).

Theorem 4.1 If rank(u, j) is higher than rank(v, j) for two vertices v and u in V that arenot in p j , then δ(p j ∪ u) will also be higher than δ(p j ∪ v).

Proof According to the definition of e vector with uniform distribution for a set of verticesin Sect. 2, we can write:

−−−−−−→e(p j ∪ u) = |p j |

|p j | + 1

−−−→e(p j ) + 1

|p j | + 1

−−→e(u). (8)

Since ppr vectors have linear properties, Eq. 9 can be extracted from Eq. 8:

δ(p j ∪ u) = |p j ||p j | + 1

δ(p j ) + |p j ||p j | + 1

ppr(p j , u)

+ 1

|p + j | + 1ppr(u, p j )

⇒ δ(p j ∪ u) = p j

|p j | + 1δ(p j ) + ramk(u, j). (9)

123

N. Mazaheri Soudani et al.

Therefore, the value of δ(p j ∪ u) depends on the values of δ(p j ), rank(u, j) and |p j | ofwhich only rank(u, j) depends on vertex u. Therefore, we can write:

rank(u, j) > rank(v, j) ⇒ |p j ||p j | + 1

δ(p j ) + rank(u, j)

>|p j |

|p j | + 1δ(p j ) + rank(v, j) ⇒ δ(p j ∪ u) > δ(p j ∪ v). (10)

��After calculating the rank of all responder vertices to v, it chooses vertices whose ranks

are higher than or equal to a threshold for invitation to join the partition j . The threshold iscalculated as follows:

tr(v) = |p j |max1≤i≤K (|pi |) max∀u∈msg(v)

(rank(u, j)), (11)

where msg(v) is the set of vertices that are not yet partitioned and replies to v in the secondsuper-step. This threshold is calculated based on the maximum ranks of responder verticesin promising(v). A balance factor is multiplied in the max value. This factor has a directrelation with partition size. If partition j has smaller size, its threshold will also be smallerand more vertices will be invited to join this partition. This causes the result partitions to bebalanced.

One issue is that if size of a partition is very small, the threshold of its vertices will be verysmall and many vertices without considering their ranks will be invited to this partition. Thiscauses sudden increase in the size of the partition. Also, the quality of partitioning in termsof the number of cut edges decreases. To avoid this issue, each vertex can maximally inviteβ vertices to join its partition. We consider β value for partition j according to Eq. 12. Eachvertex in the largest partition can invite one other vertex according to the previous equation.Therefore, the size of the largest partition will be doubled at most in the next super-step.Other partitions can grow as large as the largest partition. This causes the sizes of partitionsto grow synchronously.

Each vertex uses a priority queue where inserts its promising vertices based on their ranks.It removes β high-rank vertices from this queue and invites them to join to the partition.Therefore, regardless of the value of β, high-rank vertices will be invited to each partition.

β = 2 ∗ max1≤i≤K |pi | − |p j ||p j | . (12)

Vertex v sends messages to candidate vertices and invites them to join partition j . Itfinishes this invitation routine. v starts next invitation routine in this super-step.

Each vertex u that is not partitioned yet may receive some invitation messages fromdifferent partitions in the next super-step. Among these partitions, vertex u joins partition j

that has the maximum value of rank(u, j) ∗ |p j |max1≤i≤k (|pi |) . After joining vertex u to partition

j , it sends a message to each vertex w in its promising set to update values ppr(p j , w) andppr(w, p j ) according to Eqs. 13 and 14:

ppr(p j , w) = s j ∗ ppr(p j , w) + ∑

∀x∈msg( j,w) ppr(x, w)

|p j |s j = |p j | (13)

ppr(w, p j ) = ppr(w, p j ) +∑

∀x∈msg( j,w)

ppr(w, x), (14)

123

PPR-partitioning: a distributed graph partitioning algorithm…

wheremsg( j, w) is the set of vertices of partition j that send update messages to vertexw inthe current super-step. Equation 13 is obtained from the generalization of Eq. 8. All verticesthat join partition j from the last updates of ppr(p j , w) and ppr(w, p j ) and have nonzeroppr or ppr-rev values corresponding to w send message to w. Therefore, these two valuescan be updated based on received messages by w according to Eqs. 13 and 14.

4.4 Improving run time andmemory usage

Each vertex sends messages to its promising vertices in the third super-step. Indeed, messagepassing is done based on the graph G ′ and according to Definition 15 instead of G in whicheach vertex is connected to its promising vertices. In Giraph system, each vertex can sendmessages to all vertices that he knows their IDs. Therefore, there is no need to build G ′. Eachvertex should only know IDs of its promising vertices.

G ′ = (V , E ′)E ′ = (v,w) : v ∈ V , w ∈ promising(v). (15)

This algorithm finishes when all vertices join the partitions. For this purpose, each vertexshould be invited for joining a partition by at least one other vertex. It needs the graph G ′ beconnected because each vertex should be at least in promising set of one other vertex.

Theorem 4.2 If the graph G is weakly connected, the graph G ′ will be strongly connected.

Proof If there is an edge (u, v) in G for two vertices u and v, this edge also exists in G ′

because the corresponding values of−−−−→ppr(u) with u’s outgoing neighbors are nonzero and

v is in promising set of u. On the other hand, the edge (v, u) exists in G ′ because u isin ppr − rev(v) and as a result u is in promising set of v. Therefore, for each directededge (u, v) in G, there exists two edges (u, v) and (v, u) in G ′ and G ′ has all edges of Gwithout considering their directions. Therefore, ifG is weakly connected,G ′ will be stronglyconnected. ��

The number of edges in G ′ is greater than that of G because the size of promising setof each vertex is greater than the number of its outgoing edges. In order to decrease theexecution time and memory usage of the algorithm, each vertex could delete elements whosecorresponding ppr values are less than a threshold from its promising set. This work decreasesthe number of edges inG ′.G ′ should remain connected after deleting these edges. If incomingand outgoing neighbors of each vertex are not removed from its promising set, G ′ remainsconnected.

In the first phase, each vertex sends R = c ∗ |Nout (w)| random walkers to its outgoingneighbors for estimating its ppr vector. In average, c random walkers jump to each neighborin their first step. The minimum value of elements corresponding to the outgoing neighborsin ppr vector of each vertex can be calculated with dividing c by the total number of itsrandom walkers’ jumps. This value is equal to α

|Nout (w)| . It can be used as a threshold forremoving elements from ppr list of each vertex. Therefore, each vertex estimates its pprvector and removes the elements below this threshold from it in the first phase. Then, itsends messages to all vertices in this list. Vertices can build their ppr-rev lists based on theirreceiving messages. With this method, the graph G ′ remains connected.

The ppr vector of a vertex w represents the probability distribution of being its randomsurfers in graph vertices. The sum of elements in ppr vector of w is 1. Therefore, by removingthe elements that their values are below this threshold, the maximum number of elements that

123

N. Mazaheri Soudani et al.

Fig. 1 An example of balancing problem (all promising vertices of the black partition have been joint otherpartitions and this partition cannot grow more)

remain will be |Nout (w)|α

. The total memory that is needed for storing ppr vectors of verticescan be calculated according to Eq. 16.

∑

∀w∈V

Nout (w)

α= |E |

α. (16)

The memory for storing ppr-rev vectors is also equal to this value. In addition to thesetwo vectors, each vertex saves K values that represent its corresponding elements in totalppr vectors of different partitions. Therefore, the memory complexity of the algorithm iso( 2|E |

α+ KV ).

4.5 Fix a problemwith balancing

In the third phase, all the promising vertices of a partition may have already been joined otherpartitions. This causes this partition to be trapped and cannot grow more. Therefore, the finalpartitions will not be balanced. This state occurs for the black partition in Fig. 1.

For solving this problem, each vertex that is in partition j , in the third super-step of eachinvitation routine, checks the size of partition j . If this partition is full and its size is ε ∗ |V |

K ,this vertex does not invite any other vertex to join it. In this situation, this vertex selectsanother partition such as f randomly and invites its promising vertices to join f from nowuntil f is also full. The probability of selecting a partition is inversely proportional to its size.Therefore, if a partition such as f is trapped and cannot be grown by its vertices, it can begrown by vertices of partitions whose capacity have been completed.

123

PPR-partitioning: a distributed graph partitioning algorithm…

4.6 The pseudocode

The first two phases are straightforward. Therefore, only the pseudocode of the third phaseis explained in this section. This pseudocode for a vertex u is shown in Algorithm 1. Thispseudocode represents an action that vertex u should do in each super-step of the third phase.Four types of messages are exchanged among vertices in this phase: (1) messages of thefirst super-step of each invitation routine to request rank information; (2) the answers ofmessages of type 1 for sending rank information; (3) messages of the third super-step of eachinvitation routine, which invite vertices to join the partitions; and (4) messages for updatingppr of partitions sent by vertices that recently joined the partitions. Each vertex may receivemessages of different types in each super-step.

The lines 2–10 in pseudocode correspond to the first super-step of phase 3. If u has beenselected as j th seed vertex in phase 2, the partition number j is assigned to it. This vertexsendsmessages of type 4 to its promising vertices to update their ppr values related to partitionj in line 6. The vertex should start its invitation routine for partition j . Therefore, it sendsmessages of type 1 to its promising vertices and asks them to send rank information in line7.

In the following super-steps, each vertex, like u, may receive messages of different types.Four procedures F1, F2, F3 and F4 represent actions that vertex u should do in the event ofreceiving each type of messages. Procedures F4 and F3 should be called before the othersto update ppr values of vertex u according to the last changes in partitions. They determinewhether a partition number is assigned to u.

Algorithm 1: ppr-partitioningInput: vertex uMu: all incoming messages to uβ: the maximum number of vertices invited to join to partition by u in each stepε: the accepted imbalance on partition sizes

1 begin2 if get Super Step() = 0 then3 if u is a j th seed vertex then4 u.value.pid = j5 foreach vertex w: promising(u) do6 send message m(t ype ← 4, pid ← j, ppr ← u.value.ppr(w), vid ← u.get I d()) to

w

7 send message m(t ype ← 1, pid ← j) to w

8 end9 end

10 end11 else12 Call F4(u, Mu) //this procedure checks the messages with type 413 Call F3(u, Mu , ε) //this procedure checks the messages with type 314 Call F2(u, Mu , ε, β) //this procedure checks the messages with type 215 Call F1(u, Mu) //this procedure checks the messages with type 116 end17 u.voteT oHalt()18 end

The pseudocode of procedure F4 is shown in Algorithm 2. First, the sum of ppr values ofvertices that join partition j and send message to u is calculated in sum j in lines 3–8. The

123

N. Mazaheri Soudani et al.

sum of their corresponding elements in ppr vector of u is also calculated in sumr j . Then,the values ppr(p j , u), ppr(u, p j ) and s j are updated based on sum j and sumr j for eachpartition j in lines 10–14.

Algorithm 2: F4Input: vertex uMu: all incoming messages to u

1 begin2 sum j ← 0, sumr j ← 0∀1 ≤ j ≤ K3 foreach message m: Mu do4 if m.t ype = 4 then5 j ← m.pid, w ← m.vid6 sum j ← sum j + m.ppr7 sumr j ← sumr j + u.value.ppru.get(w)

8 end9 end

10 for j:1..K do

11 u.valu.ppr(p j , u) ← u.value.s j ∗u.value.ppr(p j ,u)+sum jGet AgrigatedValue(p j )

12 u.value.ppr(u, p j ) ← u.value.ppr(u, p j ) + sumr j13 u.value.s j ← Get AgrigatedValue(p j )14 end15 end

The pseudocode of procedure F2 is shown in Algorithm 3. First, the maximum of ranks ofvertices that send messages with type 2 is calculated and these vertices are added to a priorityqueue based on their ranks in lines 6–14. This maximum value is used for calculating thethreshold of rank for inviting the vertices. True value in f lag shows that vertex u receivesmessages of type 2 and is in the second phase of its invitation routine. The size of partition jis checked in line 15 to ensure that it does not exceed the maximum allowed size of partitions.In lines 16–20, vertex u selects up to β vertices with the highest ranks in Q whose ranks areabove the threshold and sends them messages of type 3 to invite them to join partition j . Ifthe size of partition j is greater than the maximum allowed size, no vertices are invited to thispartition. In this situation, vertex u selects another partition randomly to invite its promisingvertices to join this new partition from now in line 23. In function selectrandomparti tion,the probability of selection of each partition is directly proportional to its empty capacity.Finally, vertexu starts its next invitation routine by sendingmessages of type 1 to its promisingvertices in lines 26 and 28.

The pseudocode of procedure F1 is shown in Algorithm 4. The partition number of vertexu is checked in line 2. In u has not partition number, it replies messages of type 1 in lines3–7.

The pseudocode of procedure F3 is shown in Algorithm 5. The partition number of vertexu is checked in line 2. If this vertex has already joined a partition, it ignores messages of type3. Otherwise, messages of type 3 are considered in lines 4–14. First, for each message, it ischecked whether this relates to a partition that has empty capacity in line 7. If so, messagerank ismultiplied by a penalty function proportional to its partition size in line 8. The partitionwith the maximum of these values is fined in lines 9 and 10. Vertex u joins this partition inline 17. The size of this partition is increased by one. Vertex u sends messages of type 4 to its

123

PPR-partitioning: a distributed graph partitioning algorithm…

Algorithm 3: F2Input: vertex uMu: all incoming messages to uβ: the maximum number of vertices invited to join to partition by u in each stepε: the accepted imbalance on partition sizes

1 begin2 Q ← null //Q is a priority queue , max ← −∞ , i ← 0 ,3 j ← u.value.invi teid,4 f lag ← f lase5 |p j | ← Get AgrigatedValue(p j )6 foreach message m: Mu do7 if m.t ype = 2 then

8 rank ← |p j |∗m.ppr j+m.pprrev j|p j |+1 , f lag ← true

9 if rank > max then10 max ← true11 insert (m.vid, rank) to Q12 end13 end14 end

15 if |p j | ≤ ε ∗ |V |K then

16 foreach element e : Q do

17 if e.rank > max ∗ |p j |maxpsi ze and i ≤ β then

18 Send message m(t ype ← 3, pid ← u.value.pid) to e.vid , i ← i + 119 end20 end21 end22 else if f lag = true then23 j ← selectrandomparti tion() , u.value.invi teid ← j24 end25 if f lag = true then26 foreach vertex w: promising(u) do27 send message m(t ype ← 1, pid ← j) to w

28 end29 end30 end

Algorithm 4: F1Input: vertex uMu: all incoming messages to u

1 begin2 if u.value.pid = null then3 foreach message m: Mu do4 if m.t ype = 1 then5 send message m(t ype ← 2, ppr j ← u.value.ppr(p j , u), pprrev j ←

u.value.ppr(u, p j ), vid ← u.get I d()) to m.vid6 end7 end8 end9 end

123

N. Mazaheri Soudani et al.

promising vertices to update their ppr values related to u’s partition in line 19. It also sendsmessages of type 1 to start its first invitation routine in line 20.

Algorithm 5: F3Input: vertex uMu: all incoming messages to uε: the accepted imbalance on partition sizes

1 begin2 if u.value.pid = null then3 max ← −∞ ,maxp ← −∞, maxpsi ze ← max(Get AgrigatedValue(p j ) ∀1 ≤ j ≤ k)4 foreach message m: Mu do5 if m.t ype = 3 then6 |p j | ← Get AgrigatedValue(pm.pid )

7 if |p j | < ε ∗ |V |K then

8 rank ← m.rank ∗ |p j |maxpsi ze

9 if rank > max then10 max ← rank, maxp ← m.pid11 end12 end13 end14 end15 end16 if maxp <> −1 then17 u.value.pid ← maxp, u.value.invi teid ← maxp, aggregate(maxp, 1)18 foreach vertex w: promising(u) do19 send message

m(t ype ← 4, pid ← u.value.pid, ppr ← u.value.ppr(w), vid ← u.get I d()) to w

20 send message m(t ype ← 1, pid ← u.value.pid) to w

21 end22 end23 end

5 Experimental results

The performance of the proposed method is compared with other partitioning methods in thissection. From recent graph partitioning algorithms, we selected Spinner [27] and Revolver[30] algorithms for comparison. Theywere selected because they are distributedmethods that,like the proposed method, can be implemented with the vertex-centric programming model.Two best stream-based methods, LDG [40,46] and Fennel [43], and their restream versions[31] were also be selected because these methods are common for big graph partitioning.Metis algorithm [21] was also selected, which is the most famous centralized multi-levelpartitioning method that produces close-to-optimal answers.

Both real-world and synthetic data sets were used for experimental evaluation. Real-worldgraphs that were used in this paper are listed in Table 2. We got these graphs from Stanfordlarge network dataset collection [23] and Konect [22]. Synthetic data sets in this paper wereproduced based on planted-partition [11,28] and power-law models [1]. Planted-partitionmodel builds a random graph with |V | vertices and K hidden partitions. This model is usedfor evaluating partitioning algorithms. It assigns an equal number of vertices to partitions

123

PPR-partitioning: a distributed graph partitioning algorithm…

Table 2 Real data sets whichhave been used in experimentalevaluation

Name #vertices #edges Type

p2p-Gnutella31 62,561 147,878 p2p

Amazon0601 403,394 3,387,388 Co-purchasing

Twitter(MPI) 52,579,682 1,963,263,821 Social

LiveJournal 10,690,276 112,307,385 Social

first. Then, it adds an edge between each pair of vertices in a partition with the probability pand in different partitions with the probability q < p [11].

The power-law model was selected because many real-world big graphs such as socialnetworks’ graphs and Web graph are fitted well to this model [37]. The power-law graph inthis paper is built by the Barabasi–Albert algorithm [1]. This algorithm starts building a graphwith an initial connected graph with m0 primary vertices. Other vertices are added to thisgraph one by one. Each new vertex connects tom ≤ m0 previous vertices. The probability ofconnecting a new vertex to each of the previous vertices is directly proportional to the degreeof that vertex.

We use two metrics shown in Eqs. 17 and 18 for evaluation of the number of cut edgesand the balance of the sizes of partitions. We also use the execution time for evaluation ofscalability of the proposed method.

μ = #cutedges

#totaledges(17)

θ = |maximumparti tion||V |K

. (18)

A Hadoop cluster with 32 machines is used for evaluations. Each machine has 8 gigabytesof RAM and a 2-core corei7 processor. Evaluations are done with Hadoop2 and Giraphversion 1.1.0. All algorithms except Metis are implemented with Java language. We useKarypis laboratory implementation with C++ language for Metis algorithm [21].

The result of comparing the proposed method called PPR-partitioning with other algo-rithms for partitioning planted-partition graphs is shown in Fig. 2. All graphs have 100Kvertices and 10 hidden partitions. The value of p is 0.9, and the value of q is changed from0.1 to 0.8.

Parts a and b show the comparison of the cut-edge ratio of PPR-partitioning algorithmand the other methods. The cut-edge ratio μ of PPR-partitioning algorithm is less than thatof the stream-based, Spinner and Revolver algorithms. Only when p− q is less than 0.3, therestream Fennel method produces fewer cut edges, but its result partitions are not balanced.As the distance between p and q has been increased, a greater improvement has been madein the results of the proposed algorithm compared to other algorithms and this algorithm hasbeen successful in finding hidden partitions. In most cases, the results of PPR-partitioningalgorithm are equal to Metis method.

Parts c and d compare the balance factor of PPR-partitioning algorithm with the othermethods. PPR-partitioning algorithm, despite its distributed execution, produces balancedpartitions. Except for case p − q = 0.6, in all cases, less than 5% of the imbalance in itsresults is seen. Its balance factor θ is better than that of Fennel, restream version of Fennel,Spinner and Revolver. It is also equal to that of LDG, its restream version and Metis.

Figure 3 shows the results of the execution of different algorithms on a power-law graph.This graph has 100K vertices. Its degree distribution is d−3. In these experiments, the number

123

N. Mazaheri Soudani et al.

0.2 0.4 0.6 0.8

020

4060

80

p−q

μpprmetisspinnerrevolver

(a) μ(Spinner, Metis, PPR, Revolver)

0.2 0.4 0.6 0.8

2040

6080

p−q

μ

+ + + ++

+

+

++

pprFennelreFennelLDGreLDG

(b) μ(stream-based, PPR)

0.2 0.4 0.6 0.8

1.0

1.1

1.2

1.3

1.4

1.5

p−q

θ

pprmetisspinnerrevolver

(c) θ(Spinner, Metis, PPR, Revolver)

0.2 0.4 0.6 0.8

1.0

1.5

2.0

2.5

p−q

θ

+ + + + + + + + +

pprFennelreFennelLDGreLDG

(d) θ(stream-based, PPR)

Fig. 2 Experimental results on planted-partition graphs

of partitions has been changed from 2 to 18. According to this figure, PPR-partitioningalgorithm produces partitions with fewer cut edges related to all stream-based and Spinneralgorithms for different values of K , but the number of cut edges of Metis algorithm is about6% fewer than that of PPR-partitioning algorithm. The growth rate of the number of cut edgeswith increasing the number of partitions in the proposedmethod is less than the stream-based,Revolver and Spinner methods. All algorithms produce balanced partitions on this graph. Inthe worst case, the number of vertices in the biggest partition is about 5% greater than theaverage value.

The results of PPR-partitioning and other algorithms on the real graph p2p-Gnutella areshown in Fig. 4. By increasing the number of partitions, the speed of increasing the numberof cut edges in this method is less than other methods except for Metis. The cut-edge ratioof PPR-partitioning algorithm, up to 20%, is lower than that of the stream-based, Spinnerand Revolver algorithms. Its results, only 2% to 3%, are greater than that of Metis algorithm.All algorithms produce balanced partitions on this graph. The value of θ of PPR-partitioningalgorithm is below 1.01 in the worst case.

123

PPR-partitioning: a distributed graph partitioning algorithm…

5 10 15

010

2030

4050

K

μpprmetisspinnerrevolver

(a) μ(Spinner, Metis, PPR, Revolver)

5 10 15

010

2030

4050

K

μ

++

++ + + +

pprFennelreFennelLDGreLDG

(b) μ(stream-based, PPR)

5 10 15

1.0

1.1

1.2

1.3

1.4

1.5

K

θ

pprmetisspinnerrevolver

(c) θ(Spinner, Metis, PPR, Revovler)

5 10 15

1.0

1.1

1.2

1.3

1.4

1.5

K

θ

+ + + + + + +

pprFennelreFennelLDGreLDG

(d) θ(stream-based, PPR)

Fig. 3 Experimental results on power-law graph

5 10 15

020

4060

8010

0

K

μ

pprmetisspinnerrevolver

(a) μ(Spinner, Metis, PPR, Revolver)

5 10 15

020

4060

8010

0

K

μ

pprFennelreFennelLDGreLDG

(b) μ(stream-based, PPR)

Fig. 4 Experimental results on p2p-Gnutella graph

123

N. Mazaheri Soudani et al.

5 10 15

010

2030

4050

6070

K

μpprmetisspinnerrevolver

(a) μ(Spinner, Metis, PPR, Revolver)

5 10 15

010

2030

4050

6070

K

μ

++

++

+

++

pprFennelreFennelLDGreLDG

(b) μ(stream-based, PPR)

Fig. 5 Experimental results on Amazon0601 graph

Fig. 6 Experimental results on Twitter and LiveJournal graphs

The results on Amazo0601 graph are shown in Fig. 5. PPR-partitioning algorithm hasbetter results compared to all stream-based, Spinner and Revolver methods in terms of thenumber of cut edges on this graph. Its results are similar to that of Metis algorithm. The onlynegative aspect is that the cut-edge ratio has higher increase in PPR-partitioning algorithmcompared to Fennel, Spinner and Revolver algorithms in this graph when the number ofpartitions is increased. The proposed method like other algorithms produces partitions withup to 1% imbalance on this graph.

The results of partitioning Twitter and LiveJournal graphs are shown in Fig. 6. The numberof partitions (K) in these experiments is 16. PPR-partitioning and Metis methods have thelowest ratio of cut edges related to other methods. PPR-partitioning algorithm producesmore balanced partitions with about 2% lower cut-edge ratio than that of Metis method inliveJournal graph. The difference of cut-edge ratio between Metis and PPR-partitioning inTwitter graph is about 1%.

Figure 7 shows the scalability of PPR-partitioning algorithm compared to Spinner algo-rithm. Spinner algorithm is selected because it, similar to PPR-partitioning method, is adistributed vertex-centric algorithm. The execution time of both algorithms related to the

123

PPR-partitioning: a distributed graph partitioning algorithm…

020

040

060

080

010

00

kilos of vertices

time(

s)

100 101 102 103 104

pprspinner

time-#vertices

10 15 20 25 30

010

0020

0030

0040

0050

00

workers

time(

s)

pprspinner

time-#workers

10 20 30 40 50

050

010

0015

00

K

time(

s)

pprspinner

time-#partitions

(a) (b)

(c)

Fig. 7 Run time of PPR-partitioning compared to Spinner method

number of graph vertices is presented in Fig. 7a. Random graphs with 10 edges for eachvertex have been used for these experiments. Experiments were done with 32 machines asworkers. The experiments show PPR-partitioning method has better run time by increasingthe number of graph vertices. The execution time of both algorithms has linear increase withrespect to increase in the number of vertices.

Figure 7b presents the scalability of algorithms related to the number of workers. Thenumber of graph vertices in these experiments is 1 million. By increasing the number ofworkers, run time of both algorithms decreases exponentially. Their execution time is halvedwith each 5 more workers. Finally, the run time of algorithms as a function of the numberof partitions is presented in Fig. 7c. The run time of PPR-partitioning algorithm has lowerincrease as the number of partitions is increased compared to Spinner algorithm.

5.1 Summary of results

In planted-partition graphs, the proposed method, similar to Metis, can detect hidden parti-tions better than stream-based and distributed methods, especially when the hidden partitionshave higher density and lower cut-edge ratio. In power-law graphs, the growth rate of the

123

N. Mazaheri Soudani et al.

number of cut edges of the proposed method, by increasing the number of partitions, is lessthan the stream-based and distributedmethods. PPR-partitioning algorithm produces balancepartitions with lower cut-edge ratio compared to the stream-based and distributed methodsin all synthetic and real-world graphs. The difference of the cut edges ratio of Metis andPPR-partitioning method is less than 5 percent in all real-world graphs.

The run time of PPR-partitioning algorithm is less than Spinner method in all cases. Withconstant number of workers, the execution time has linear increase with respect to increasein the number of vertices. It decreases exponentially by increasing the number of workers.

6 Conclusion

In this paper, a new distributed big graph partitioning method in vertex-centric systems hasbeen introduced. This method has three phases. In the first phase, the approximate person-alized PageRank vectors of all vertices are calculated simultaneously. In the second phase,a seed vertex is selected for each partition. Seed vertices are selected so that they have lowppr value related to each other because if two vertices have high ppr value, the probabilitythat they are in a same cluster will also be high. Therefore, they should not be inserted indifferent partitions. The degrees of the seed vertices should be high to allow more freedomwith partition growth. In the third phase, vertices in each partition invite other vertices tojoin their partitions. Each vertex invites vertices that have not been partitioned yet and havehigh ppr value corresponding to the existing vertices in its partition. This work is continueduntil all vertices join partitions. A penalty function based on partition sizes are used in select-ing vertices for invitation and for decision about acceptance of these invitations. It balancespartition sizes.

This method can be implemented in distributed vertex-centric systems, which are the mostcommonly used graph processing systems. It has been determined in experimental evaluationsthat this method has scalability for partitioning big graphs. The run time of this algorithmis reduced by increasing the number of worker machines in the distributed environment.Evaluations have been done on synthetic planted-partition, power-law and some real-worldgraphs. It has been determined that the result partitions of the proposed method, despite itsdistributed implementation, are balanced similar to that of the centralized methods such asstream-based and Metis methods. The number of cut edges in this method is smaller thanthat of the stream-based, Spinner and Revolver methods. Its results is close to that of Metismethod.

The proposedmethod is one of the fewgraph partitioning algorithmsbased onpersonalizedPageRank vectors. The ppr vectors are most commonly used for graph clustering because itis hard to produce balanced partitions with ppr-based algorithms. The existing methods useonly the ppr vector of the seed vertices for selecting other vertices to join partitions. Thiscauses selections of seed vertices to have high effects in partitions quality. In this paper, theppr vectors of all vertices are considered for selecting other vertices to join partitions for thefirst time.

Most existing graph partitioning methods consider only one-step neighbors of a vertex forpartitioning it, while the proposed method uses the ppr vectors of vertices. The ppr vectorscontain useful information regarding all neighbors of a vertex with different steps.

One issue of this method is that it only can be used for partitioning connected graphs,but this is not a major limitation because most of vertices in real graphs, such as Web graphand social networks’ graphs, are in one connected component. Extending this method for

123

PPR-partitioning: a distributed graph partitioning algorithm…

partitioning graphs that are not connected is one suggested area for future study. Anotheruseful future work is adapting result partitions of this method with graph changes in dynamicgraphs. The analysis of time complexity of this algorithm can also be another future work.

Compliance with ethical standards

Funding No funding was received by the authors.

Conflict of interest The authors declare that they have no conflict of interest.

References

1. Albert R, Barabási A-L (2002) Statistical mechanics of complex networks. Rev Mod Phys 74:47–972. Andersen R, Chung F, Lang K (2006) Local graph partitioning using PageRank vectors. In: 47th annual

IEEE symposium on foundations of computer science (FOCS’06), pp 475–4863. Andersen R, Chung F, Lang K (2008) Local partitioning for directed graphs using pagerank. Internet

Math. 5(1–2):3–224. Avery C (2011) Giraph: large-scale graph processing infrastructure on hadoop. Proc Hadoop Summit

Santa Clara 11(3):5–95. Avrachenkov K, Litvak N, Nemirovsky D, Osipova N (2007) Monte carlo methods in pagerank compu-

tation: when one iteration is sufficient. SIAM J Numer Anal 45(2):890–9046. Aydin K, Bateni M, Mirrokni V (2016) Distributed balanced partitioning via linear embedding. In: Pro-

ceedings of the 9th international conference on web search and data mining, WSDM’16. ACM, pp387–396

7. Bahmani B, Chowdhury A, Goel A (2010) Fast incremental and personalized pagerank. Proc VLDBEndow 4(3):173–184

8. Buluç A, Meyerhenke H, Safro I, Sanders P, Schulz C (2016) Recent advances in graph partitioning. In:Kliemann L, Sanders P (eds) Algorithm engineering: selected results and surveys, vol 9220. Springer,Cham, pp 117–158. https://doi.org/10.1007/978-3-319-49487-6_4

9. Chen R, Shi J, Chen Y, Chen H (2015) PowerLyra: differentiated graph computation and partitioningon skewed graphs. In: Proceedings of the 10th European conference on computer systems, EuroSys ’15.ACM, pp 1:1–1:15

10. Chung F, Simpson O (2018) Computing heat kernel pagerank and a local clustering algorithm. Eur JComb 68(Supplement C):96–119

11. Condon A, Karp RM (2001) Algorithms for graph partitioning on the planted partition model. RandomStruct Algorithms 18(2):116–140

12. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM51(1):107–113

13. Dhillon IS, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors a multilevel approach.IEEE Trans Pattern Anal Mach Intell 29(11):1944–1957

14. Fogaras D, Rcz B, Csalogny K, Sarls T (2005) Towards scaling fully personalized pagerank: algorithms,lower bounds, and experiments. Internet Math 2(3):333–358

15. Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) PowerGraph: distributed graph-parallel com-putation on natural graphs. In: Proceedings of 10th USENIX symposium on operating systems designand implementation (OSDI), vol 12, pp 17–30

16. Grady L, Schwartz EL (2006) Isoperimetric graph partitioning for image segmentation. IEEE TransPattern Anal Mach Intell 28(3):469–475

17. Guerrieri Alessio MA (2015) DFEP: distributed funding-based edge partitioning. In: Euro-Par: 21stinternational conference on parallel and distributed computing. Springer, Berlin, pp 346–358

18. Guo T, Cao X, Cong G, Lu J, Lin X (2017) Distributed algorithms on exact personalized PageRank. In:Proceedings of the international conference on management of data, SIGMOD ’17. ACM, pp 479–494

19. Jeh G, Widom J (2003) Scaling personalized web search. In: Proceedings of the 12th international con-ference on world wide web, WWW ’03. ACM, pp 271–279

20. Karypis G, Aggarwal R, Kumar V, Shekhar S (1999) Multilevel hypergraph partitioning: applications inVLSI domain. IEEE Trans Very Large Scale Integr VLSI Syst 7(1):69–79

21. Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs.SIAM J Sci Comput 20(1):359–392

123

N. Mazaheri Soudani et al.

22. Kunegis J (2013) KONECT: the Koblenz network collection. In: Proceedings of the 22th internationalconference on world wide web, WWW ’13 companion. ACM, pp 1343–1350

23. Leskovec J,KrevlA (2014) SNAPdatasets: Stanford large network dataset collection. http://snap.stanford.edu/data

24. Lofgren PA, Banerjee S, Goel A, Seshadhri C (2014) FAST-PPR: scaling personalized PageRank esti-mation for large graphs. In: Proceedings of the 20th SIGKDD international conference on knowledgediscovery and data mining, KDD ’14. ACM, pp 1436–1445

25. Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed graphlab: aframework for machine learning and data mining in the cloud. Proc VLDB Endow 5(8):716–727

26. Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a systemfor large-scale graph processing. In: Proceedings of SIGMOD international conference on managementof data, SIGMOD ’10. ACM, pp 135–146

27. Martella C, Logothetis D, Loukas A, Siganos G (2017) Spinner: scalable graph partitioning in the cloud.In: IEEE 33th international conference on data engineering (ICDE), pp 1083–1094

28. McSherry F (2001) Spectral partitioning of randomgraphs. In: Proceedings IEEE international conferenceon cluster computing, pp 529–537

29. Meyerhenke H, Sanders P, Schulz C (2017) Parallel graph partitioning for complex networks. IEEE TransParallel Distrib Syst 28(9):2625–2638

30. MofradMH,MelhemR, HammoudM (2018) Revolver: vertex-centric graph partitioning using reinforce-ment learning. In: 2018 IEEE 11th international conference on cloud computing (CLOUD), vol 00, pp818–821. https://doi.org/10.1109/CLOUD.2018.00111

31. Nishimura J, Ugander J (2013) Restreaming graph partitioning: simple versatile algorithms for advancedbalancing. In: Proceedings of the 19th SIGKDD international conference on knowledge discovery anddata mining, KDD ’13. ACM, pp 1106–1114. https://doi.org/10.1145/2487575.2487696

32. Page L, Brin S, Motwani R, Winograd T (1999) The PageRank citation ranking: bringing order to theweb. Technical report 1999-66, Stanford InfoLab

33. Perozzi B,McCubbinC,Halbert JT (2014) Scalable graph clusteringwith parallel approximate PageRank.Soc Netw Anal Min 4(1):179

34. Rahimian F, Payberah AH, Girdzijauskas S, Haridi S (2014) Distributed vertex-cut partitioning. In: IFIPinternational conference on distributed applications and interoperable systems. Springer, pp 186–200

35. Rahimian F, PayberahAH,Girdzijauskas S, JelasityM,Haridi S (2013) JA-BE-JA: a distributed algorithmfor balanced graph partitioning. In: IEEE 7th international conference on self-adaptive and self-organizingsystems, pp 51–60

36. Sajjad HP, Payberah AH, Rahimian F, Vlassov V, Haridi S (2016) Boosting vertex-cut partitioning forstreaming graphs. In: IEEE international congress on big data (BigData congress), pp 1–8

37. Sala, A, Cao L, Wilson C, Zablit R, Zheng H, Zhao BY (2010) Measurement-calibrated graph modelsfor social network experiments. In: Proceedings of the 19th international conference on world wide web,WWW ’10. ACM, pp 861–870

38. Spielman DA, Teng S-H (2004) Nearly-linear time algorithms for graph partitioning, graph sparsification,and solving linear systems. In: Proceedings of the 36th symposium on theory of computing, STOC ’04.ACM, pp 81–90

39. Spielman DA, Teng S-H (2013) A local clustering algorithm for massive graphs and its application tonearly linear time graph partitioning. SIAM J Comput 42(1):1–26

40. Stanton I (2014) Streaming balanced graph partitioning algorithms for random graphs. In: Proceedingsof the 25th symposium on discrete algorithms. SIAM, pp 1287–1301

41. Stanton I, Kliot G (2012) Streaming graph partitioning for large distributed graphs. In: Proceedings ofthe 18th SIGKDD international conference on knowledge discovery and data mining, KDD ’12. ACM,pp 1222–1230

42. Tabrizi SA, Shakery A, Asadpour M, Abbasi M, Tavallaie MA (2013) Personalized pagerank clustering:a graph clustering algorithm based on random walks. Phys A Stat Mech Appl 392(22):5772–5785

43. Tsourakakis C, Gkantsidis C, Radunovic B, Vojnovic M (2014) Fennel: streaming graph partitioning formassive scale graphs. In: Proceedings of the 7th international conference on web search and data mining,WSDM ’14. ACM, pp 333–342

44. Ugander J, Backstrom L (2013) Balanced label propagation for partitioning massive graphs. In: Proceed-ings of the 6th international conference on web search and data mining, WSDM ’13. ACM, pp 507–516

45. WangL,XiaoY, ShaoB,WangH (2014)How to partition a billion-node graph. In: IEEE30th internationalconference on data engineering, pp 568–579

46. Whang JJ, Gleich DF, Dhillon IS (2016) Overlapping community detection using neighborhood-inflatedseed expansion. IEEE Trans Knowl Data Eng 28(5):1272–1284

123

PPR-partitioning: a distributed graph partitioning algorithm…

47. Xie C, Li W-J, Zhang Z (2015) S-PowerGraph: streaming graph partitioning for natural graphs by vertex-cut. CoRR arXiv:1511.02586

48. Zhang H, Raitoharju J, Kiranyaz S, Gabbouj M (2016) Limited randomwalk algorithm for big graph dataclustering. J Big Data 3(1):26

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps andinstitutional affiliations.

Nasrin Mazaheri Soudani is a Ph.D. candidate of software engineer-ing in University of Isfahan. She received her B.S. and M.S. degreesfrom the University of Isfahan in 2009 and 2011. Her current researchinterests include big data management systems, database managementsystems and big graph computations.

Afsaneh Fatemi received her B.S. degree in software engineering fromIsfahan University of Technology in 1995, and the M.S. and Ph.D.degrees in software engineering both from University of Isfahan in2002 and 2012, respectively. She is currently an assistant professorin the department of software engineering of University of Isfahan.Her current research interests include complex systems, social net-works and computational modeling. She is also a member of Big DataResearch Group of University of Isfahan.

Mohammadali Nematbakhsh is a full professor of software engineer-ing in School of Computer Engineering at the University of Isfahan. Hereceived his B.Sc. in electrical engineering from Louisiana Tech Uni-versity in 1981 and his M.Sc. and Ph.D. degrees in electrical and com-puter engineering from the University of Arizona in 1983 and 1987,respectively. He had worked for Micro Advanced Co. and Toshiba Cor-poration for many years before joining University of Isfahan. He haspublished more than 150 research papers, several US-registered patentsand two database books that are widely used in universities. His mainresearch interests include intelligent Web and big data processing.

123

Recommended