PREGEL a system for large scale graph processing

transcript

Written by: Grzegorz Malewicz et al at SIGMOD 2010

Presented by: Abolfazl Asudeh

CSE 6339 – Spring 2013

Pregel: A System for Large-Scale Graph Processing

04/08/20232

Problem? Very large graphs are a popular object of analysis:

e.g. Social networks, web and several other areas Efficient processing of large graphs is challenging:

poor locality of memory access very little work per vertex

Distribution over many machines: the locality issue? Machine failure?

There was no scalable general-purpose system for implementing arbitrary graph algorithms over arbitrary graph representations in a large-scale distributed environment

04/08/20233

Want to process a large scale graph?The options: Crafting a custom distributed infrastructure

Needs a lot of effort and has to be repeated for every new algorithm

Relying on an existing distributed platform: e.g. Map Reduce Must store graph state in each state too much

communication between stages Using a single-computer graph algorithm

library !! Using an existing parallel graph system

Do not address problems like fault tolerance that are very important in large graph processing

04/08/20234

How can we solve the problem?

Thousand/millions of Computers (Distributed System)

Billions of Vertices/Edges (Graph)

How to Assign?

04/08/20235

The high-level organization of Pregel programs

sequence of iterations, called super-steps. each vertex invokes a function conceptually in

parallel The function specifies behavior at a single

vertex V and a single superstep S can read messages from previous steps and

send messages for next steps Can modify the state? of V and its outgoing

edges.

InputAll Voteto Halt Outpu

04/08/20236

advantage?In vertex-centric approach

users focus on a local action processing each item independently ensures that Pregel programs are inherently

free of deadlocks and data races common in asynchronous systems.

04/08/20237

MODEL OF COMPUTATION A Directed Graph is given to Pregel It runs the computation at each vertex Until all nodes vote for halt Then Returns the results

All Voteto Halt Outpu

04/08/20238

Vertex State Machine Algorithm termination is based on every

vertex voting to halt In superstep 0, every vertex is in the active

state A vertex deactivates itself by voting to halt It can be reactivated by receiving an

(external) message

04/08/20239

The C++ API (Vertex Class)1. template <typename VertexValue,2. typename EdgeValue,3. typename MessageValue>4. class Vertex {5. public:6. virtual void Compute(MessageIterator* msgs) = 0;7. const string& vertex_id() const;8. int64 superstep() const;9. const VertexValue& GetValue();10. VertexValue* MutableValue();11. OutEdgeIterator GetOutEdgeIterator();12. void SendMessageTo(const string& dest_vertex,const

MessageValue& message);13. void VoteToHalt();14.};

04/08/202310

The C++ API - Message Passing Massages are guaranteed to deliver but not in

original order Each message is delivered only once Each vertex can send massage to any vertex

04/08/202311

Maximum Value Example

04/08/202312

The C++ API – other classes Combiners (not active by default)

User Can specify a way in this class to reduce the number of sending the same message

Aggregators Gather The global information such as Statistical

values (sum, average, min,…) Can Also be used as a global manager to force the

vertices to run a specific branches of their Compute functions during specific SuperSteps

04/08/202313

The C++ API Topology Mutation

Some graph algorithms need to change the graph's topology. E.g. A clustering algorithm may need to replace a

cluster with a node Add Vertex, Then add edge. Remove all edges

then remove vertex User defined handlers can be added to solve the

conflicts Input / Output

It has Reader/Writer for famous file formats like text and relational DBs

User can customize Reader/Writer for new input/outputs

04/08/202314

Implementation Pregel was designed for the Google cluster

architecture Each cluster consists of thousands of

commodity PCs organized into racks with high intra-rack bandwidth

Clusters are interconnected but distributed geographically

Vertices are assigned to the machines based on their vertex-ID ( hash(ID) ) so that it can easily be understood that which node is where

04/08/202315

Implementation1. User Programs are copied on machines2. One machine becomes the master.

Other computer can find the master using name service and register themselves to it

The master determines how many partitions the graph have

3. The master assigns one or more partitions (why?) and a portion of user input to each worker

4. The workers run the compute function for active vertices and send the messages asynchronously

There is one thread for each partition in each worker When the superstep is finished workers tell the master

how many vertices will be active for next superstep

04/08/202316

Fault tolerance At the end of each super step:

Workers checkpoint V, E, and Messages Master checkpoints the aggregated values

Failure is detected by “ping” messages from master to workers

The Master reassigns the partition to available workers and it is recovered from the last check point In Confined recovery only the missed partitions

have to be recomputed because the result of other computations are known Not possible for randomized algorithms

04/08/202317

Worker implementation Maintains some partitions of the graph Has the message-queues for supersteps S and

S+1 If the destination is not in this machine it is

buffered to be sent and when the buffer is full it is flashed

The user may define Combiners to send the remote messages

04/08/202318

Master Implementation maintains a list of all workers currently known

to be alive, including the worker's ID and address, and which portion of the graph it has been assigned

Does the synchronization and coordinates all operations

Maintains statistics and runs a HTTP server for user

04/08/202319

Aggregators Implementation The information are passed to the master in a

Tree Structure The workers may send their information to the

Aggregator machines and they aggregate and send the values the Master Machine

Master

Aggregator

Worker

04/08/202320

Application – Page Rank ?class PageRankVertex

: public Vertex<double, void, double> {public: virtual void Compute(MessageIterator* msgs) {

if (superstep() >= 1) {double sum = 0;for (; !msgs->Done(); msgs->Next())

sum += msgs->Value();*MutableValue() =0.15 / NumVertices() + 0.85 * sum;

}if (superstep() < 30) {

const int64 n = GetOutEdgeIterator().size();SendMessageToAllNeighbors(GetValue() / n);

} elseVoteToHalt();

04/08/202321

Application – Shortest Path ?class ShortestPathVertex

: public Vertex<int, int, int> { void Compute(MessageIterator* msgs) {

int mindist = IsSource(vertex_id()) ? 0 : INF;for (; !msgs->Done(); msgs->Next())

mindist = min(mindist, msgs->Value());if (mindist < GetValue()) {

*MutableValue() = mindist;OutEdgeIterator iter = GetOutEdgeIterator();for (; !iter.Done(); iter.Next())

SendMessageTo(iter.Target(),mindist + iter.GetValue());

}VoteToHalt();

04/08/202322

Application – Bipartite matching Problem: Find a set of edges in the bipartite

graph that share no endpoint1. each left vertex not yet matched sends a message

to each of its neighbors to request a match, and then unconditionally votes to halt

2. each right vertex not yet matched randomly chooses one of the messages it receives, sends a message granting that request

3. each left vertex not yet matched chooses one of the grants it receives and sends an acceptance message

4. The right node receives the message and votes to halt

04/08/202323

Experiments 300 Multicore PCs were used They just count the running time (not check

pointing) Measure the scalability of workers Measure the scalability over number of

vertices

04/08/202324

shortest paths runtimes for a binary tree with a billion vertices (and, thus, a billion minus one edges) when the number of Pregel workers varies from 50 to 800

04/08/202325

shortest paths runtimes for binary trees varying in size from a billion to 50 billion vertices, now using a fixed number of 800 worker tasks scheduled on 300 multicore machines.

04/08/202326

Random graphs that use a log-normal distribution of outdegrees

04/08/202327

Thank you

PREGEL a system for large scale graph processing

Education