Download - Carnegie Mellon University Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Joe Hellerstein Alex Smola The Next Generation.

Carnegie Mellon University

Joseph GonzalezJoint work with

YuchengLow

AapoKyrola

DannyBickson

CarlosGuestrin

JoeHellerstein

AlexSmola

@The Next Generation of the GraphLab Abstraction.

JayGu

2

How will wedesign and implement

parallel learning systems?

Map-Reduce / HadoopBuild learning algorithms on-top of

high-level parallel abstractions

... a popular answer:

BeliefPropagation

Label Propagation

KernelMethods

Deep BeliefNetworks

NeuralNetworks

Tensor Factorization

PageRank

Lasso

Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!

4

Data-Parallel Graph-Parallel

CrossValidation

Feature Extraction

Map Reduce

Computing SufficientStatistics

Example of Graph Parallelism

PageRank ExampleIterate:

Where:α is the random reset probabilityL[j] is the number of links on page j

1 32

4 65

Properties of Graph Parallel Algorithms

DependencyGraph

IterativeComputation

My Rank

Friends Rank

Factored Computation

BeliefPropagation

SVM

KernelMethods

Deep BeliefNetworks

NeuralNetworks


PageRank

Lasso

Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!

8


CrossValidation

Feature Extraction

Map Reduce


Map Reduce?Pregel (Giraph)?

BarrierPregel (Giraph)

Bulk Synchronous Parallel Model:

Compute Communicate

PageRank in Giraph (Pregel)

public void compute(Iterator<DoubleWritable> msgIterator) {

double sum = 0;while (msgIterator.hasNext())

sum += msgIterator.next().get();DoubleWritable vertexValue =

new DoubleWritable(0.15 + 0.85 * sum);setVertexValue(vertexValue);if (getSuperstep() < getConf().getInt(MAX_STEPS,

-1)) {long edges = getOutEdgeMap().size();sentMsgToAllEdges(

new DoubleWritable(getVertexValue().get() / edges));

} else voteToHalt();}


Bulk synchronous computation can be inefficient.

11

Problem

Curse of the Slow Job

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Iterations

Barr

ier

Barr

ier

Data

Data

Data

Data

Data

Data

Data

Barr

ier

Curse of the Slow JobAssuming runtime is drawn from an exponential distribution with mean 1.

0 2 4 6 8 10 120

2

4

6

8

10

12

Number of Jobs

Runti

me

Mul

tiple

Problem with MessagingStorage Overhead:

Requires keeping Old and New Messages [2x Overhead]Redundant messages:

PageRank: send a copy of your own rank to all neighborsO(|V|) O(|E|)

Often requires complex protocolsWhen will my neighbors need information about me?

Unable to constrain neighborhood stateHow would you implement graph coloring?

CPU

1 CPU 2

Sends the same message three times!

Converge More Slowly

1 2 3 4 5 6 7 80

1000

2000

3000

4000

5000

6000

7000

8000

9000

Number of CPUs

Runti

me

in S

econ

ds

Optimized in Memory Bulk Synchronous

Asynchronous Splash BP


Bulk synchronous computation can be wrong!

16

Problem

17

The problem with Bulk Synchronous Gibbs

Adjacent variables cannot be sampled simultaneously.

Strong PositiveCorrelation

t=0

Parallel Execution

t=2 t=3

Strong PositiveCorrelation

t=1

Sequential

Execution

Strong NegativeCorrelation

BeliefPropagationSVM

KernelMethods

Deep BeliefNetworks

NeuralNetworks


PageRank

Lasso

The Need for a New AbstractionIf not Pregel, then what?

18


CrossValidation

Feature Extraction

Map Reduce


Pregel (Giraph)

What is GraphLab?

The GraphLab Framework

Scheduler Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

20

Data Graph

21

A graph with arbitrary data (C++ Objects) associated with each vertex and edge.

Vertex Data:• User profile text• Current interests estimates

Edge Data:• Similarity weights

Graph:• Social Network

Comparison with Pregel

PregelData is associated only with vertices

GraphLabData is associated with both vertices and edges

pagerank(i, scope){ // Get Neighborhood data (R[i], Wij, R[j]) scope;

// Update the vertex data

// Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); }

;][)1(][][

iNj

ji jRWiR

Update Functions

23

An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex

PageRank in GraphLab2

struct pagerank : public iupdate_functor<graph, pagerank> {

void operator()(icontext_type& context) {vertex_data& vdata =

context.vertex_data(); double sum = 0;foreach ( edge_type edge,

context.in_edges() )sum +=

1/context.num_out_edges(edge.source()) *

context.vertex_data(edge.source()).rank;double old_rank = vdata.rank;vdata.rank = RESET_PROB + (1-RESET_PROB) *

sum;double residual = abs(vdata.rank –

old_rank) /

context.num_out_edges();if (residual > EPSILON)

context.reschedule_out_neighbors(pagerank());}

};

Comparison with Pregel

PregelData must be sent to adjacent verticesThe user code describes the movement of data as well as computation

GraphLabData is read from adjacent verticesUser code only describes the computation

The Scheduler

26

CPU 1

CPU 2

The scheduler determines the order that vertices are updated.

e f g

kjih

dcba b

ih

a

i

b e f

j

c

Sche

dule

r

The process repeats until the scheduler is empty.

The GraphLab Framework

Scheduler Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

27

Ensuring Race-Free CodeHow much can computation overlap?

GraphLab Ensures Sequential Consistency

29

For each parallel execution, there exists a sequential execution of update functions which produces the same result.

CPU 1

CPU 2

SingleCPU

Parallel

Sequential

time

Consistency Rules

30

Guaranteed sequential consistency for all update functions

Data

Full Consistency

31

Obtaining More Parallelism

32

Edge Consistency

33

CPU 1 CPU 2

Safe

Read

Is pretty neat!

In Summary …

02000

40006000

800010000

1200014000

1.00E-021.00E+001.00E+021.00E+041.00E+061.00E+08

GraphLabPregel

Runtime (s)

L1 E

rror

Pregel vs. GraphLabMulticore PageRank (25M Vertices, 355M Edges)

Pregel [Simulated]Synchronous ScheduleNo Skipping [Unfair updates comparison]No Combiner [Unfair runtime comparison]

0.0E+00 5.0E+08 1.0E+09 1.5E+09 2.0E+091.00E-02

1.00E+00

1.00E+02

1.00E+04

1.00E+06

1.00E+08GraphLabPregel

Updates

L1 E

rror

Update Count Distribution

0 10 20 30 40 50 60 700

2000000

4000000

6000000

8000000

10000000

12000000

14000000

Number of Updates

Num

-Ver

tices

Most vertices need to be updated infrequently

Bayesian Tensor Factorization

Gibbs Sampling

Dynamic Block Gibbs Sampling

MatrixFactorization

Lasso

SVM

Belief Propagation

PageRank

CoEM

K-Means

SVD

LDA

…Many others…

Startups Using GraphLab

Companies experimenting with Graphlab

Academic projects Exploring Graphlab

1600++ Unique Downloads Tracked(possibly many more from direct repository checkouts)

Why do we need a NEW GraphLab?

Natural Graphs

41

Natural Graphs Power Law

Top 1% vertices is adjacent to53% of the edges!

Yahoo! Web Graph

“Power Law”

Problem: High Degree Vertices

High degree vertices limit parallelism:

Touch a LargeAmount of State

Requires Heavy Locking

Processed Sequentially

High Degree Vertices are Common

Use

rs

Movies

Netflix

“Social” People Popular Movies

θZwZwZwZw

θZwZwZwZw

θZwZwZwZw

θZwZwZwZw

Bα

Hyper Parameters

Docs

Words

Freq.

Common Words

Obama

Proposed Four Solutions

Decomposable Update FunctorsExpose greater parallelism by further factoring update functions

Commutative- Associative Update FunctorsTransition from stateless to stateful update functions

Abelian Group Caching (concurrent revisions)Allows for controllable races through diff operations

Stochastic ScopesReduce degree through sampling

PageRank in GraphLab








old_rank) /



};

PageRank in GraphLab








old_rank) /



};

Atomic Single Vertex Apply

Parallel Scatter [Reschedule]

Parallel “Sum” Gather

Decomposable Update Functors

Decompose update functions into 3 phases:

Locks are acquired only for region within a scope Relaxed Consistency

+ + … + Δ

Y YY

ParallelSum

User Defined:Gather( ) ΔY

Δ1 + Δ2 Δ3

Y Scope

Gather

Y

YApply( , Δ) Y

Apply the accumulated value to center vertexUser Defined:

Apply

Y

Scatter( )

Update adjacent edgesand vertices.

User Defined:Y

Scatter

Factorized PageRankstruct pagerank : public iupdate_functor<graph, pagerank> { double accum = 0, residual = 0;

void gather(icontext_type& context, const edge_type& edge) {

accum += 1/context.num_out_edges(edge.source()) *

context.vertex_data(edge.source()).rank;}void merge(const pagerank& other) { accum +=

other.accum; }void apply(icontext_type& context) {

vertex_data& vdata = context.vertex_data();double old_value = vdata.rank;vdata.rank = RESET_PROB + (1 - RESET_PROB)

* accum; residual = fabs(vdata.rank – old_value) /

context.num_out_edges();}void scatter(icontext_type& context, const

edge_type& edge) {if (residual > EPSILON)

context.schedule(edge.target(), pagerank());

}};

Y

Split computation across machines:

Decomposable Execution Model

( o )( )Y

YYF1 F2

YY

Weaker ConsistencyNeighboring vertices maybe be updated simultaneously:

A B

CGather

Gather Gather

Apply

Other Decomposable AlgorithmsLoopy Belief Propagation

Gather: Accumulates product (log sum) of in messagesApply: Updates central beliefScatter: Computes out messages and schedules adjacent vertices

Alternating Least Squares (ALS)

y1

y2

y3

y4

w1

w2

x1

x2

x3Use

r Fac

tors

(W)

Movie Factors (X)

Use

rs MoviesNetflix

Use

rs

≈x

Movies

Convergent Gibbs SamplingCannot be done:

A B

CGather

Gather Gather

Unsafe

Decomposable FunctorsFits many algorithms

Loopy Belief Propagation, Label Propagation, PageRank…

Addresses the earlier concerns

Problem: Does not exploit asynchrony at the vertex level.

Large State

DistributedGather and Scatter

Heavy Locking

Fine GrainedLocking

Sequential

ParallelGather and Scatter

Need for Vertex Level Asynchrony

Exploit commutative associative “sum”

Y

+ + + + + Y

Costly gather for a single change!



Y

+ + + + + Y



Y

+ + + + + + Δ Y



Y

+ + + + + + Δ YOld (Cached) Sum



Y

+ + + + + + Δ YOld (Cached) Sum

Δ Δ Δ Δ

Commutative-Associative Updatestruct pagerank : public iupdate_functor<graph, pagerank> {

double delta;pagerank(double d) : delta(d) { }void operator+=(pagerank& other) { delta +=

other.delta; }void operator()(icontext_type& context) {

vertex_data& vdata = context.vertex_data();

vdata.rank += delta;if(abs(delta) > EPSILON) {

double out_delta = delta * (1 – RESET_PROB) *

1/context.num_out_edges(edge.source());

context.schedule_out_neighbors(pagerank(out_delta));}

}};// Initial Rank: R[i] = 0;// Initial Schedule: pagerank(RESET_PROB);

Scheduling Composes UpdatesCalling reschedule neighbors forces update function composition:

pagerank(3) Pending: pagerank(7)

reschedule_out_neighbors(pagerank(3))pagerank(3)

Pending: pagerank(3)

Pending: pagerank(10)

Experimental Comparison

Comparison of Abstractions:Multicore PageRank (25M Vertices, 355M Edges)

0 1000 2000 3000 4000 5000 60001.00E-021.00E-011.00E+001.00E+011.00E+021.00E+031.00E+041.00E+051.00E+061.00E+071.00E+08

GraphLab1FactorizedDelta

Runtime (s)

L1 E

rror

Comparison of Abstractions:Distributed PageRank (25M Vertices, 355M Edges)

2 3 4 5 6 7 80

50

100

150

200

250

300

350

400GL 1 (Chromatic)GL 2 Delta (Asynchronous)

# Machines (8 CPUs per Machine)

Runti

me

(s)

2 3 4 5 6 7 80

5

10

15

20

25

30

35GL 1 (Chromatic)GL 2 Delta (Asynchronous)

# Machines (8 CPUs per Machine)

Tota

l Com

mun

icati

on (G

B)

PageRank on the Web circa 2000Invented Comparison:

GraphLab2 (512)

Pegasus (800)

Priter (200)

Dryad (960)

0

500

1000

1500

2000

2500

3000

3500Ru

ntim

e Se

cond

s

Ongoing work

Extending all of GraphLab2 to the distributed settingImplemented push based engines (chromatic)Need to build GraphLab2 distributed locking engine

Improving storage efficiency of the distributed data-graphPorting large set of Danny’s applications

Questionshttp://graphlab.org