Carnegie Mellon University
Joseph GonzalezJoint work with
YuchengLow
AapoKyrola
DannyBickson
CarlosGuestrin
JoeHellerstein
AlexSmola
@The Next Generation of the GraphLab Abstraction.
JayGu
2
How will wedesign and implement
parallel learning systems?
Map-Reduce / HadoopBuild learning algorithms on-top of
high-level parallel abstractions
... a popular answer:
BeliefPropagation
Label Propagation
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization
PageRank
Lasso
Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!
4
Data-Parallel Graph-Parallel
CrossValidation
Feature Extraction
Map Reduce
Computing SufficientStatistics
Example of Graph Parallelism
PageRank ExampleIterate:
Where:α is the random reset probabilityL[j] is the number of links on page j
1 32
4 65
Properties of Graph Parallel Algorithms
DependencyGraph
IterativeComputation
My Rank
Friends Rank
Factored Computation
BeliefPropagation
SVM
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization
PageRank
Lasso
Map-Reduce for Data-Parallel MLExcellent for large data-parallel tasks!
8
Data-Parallel Graph-Parallel
CrossValidation
Feature Extraction
Map Reduce
Computing SufficientStatistics
Map Reduce?Pregel (Giraph)?
BarrierPregel (Giraph)
Bulk Synchronous Parallel Model:
Compute Communicate
PageRank in Giraph (Pregel)
public void compute(Iterator<DoubleWritable> msgIterator) {
double sum = 0;while (msgIterator.hasNext())
sum += msgIterator.next().get();DoubleWritable vertexValue =
new DoubleWritable(0.15 + 0.85 * sum);setVertexValue(vertexValue);if (getSuperstep() < getConf().getInt(MAX_STEPS,
-1)) {long edges = getOutEdgeMap().size();sentMsgToAllEdges(
new DoubleWritable(getVertexValue().get() / edges));
} else voteToHalt();}
Carnegie Mellon University
Bulk synchronous computation can be inefficient.
11
Problem
Curse of the Slow Job
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
CPU 1
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
Data
CPU 1
CPU 2
CPU 3
Iterations
Barr
ier
Barr
ier
Data
Data
Data
Data
Data
Data
Data
Barr
ier
Curse of the Slow JobAssuming runtime is drawn from an exponential distribution with mean 1.
0 2 4 6 8 10 120
2
4
6
8
10
12
Number of Jobs
Runti
me
Mul
tiple
Problem with MessagingStorage Overhead:
Requires keeping Old and New Messages [2x Overhead]Redundant messages:
PageRank: send a copy of your own rank to all neighborsO(|V|) O(|E|)
Often requires complex protocolsWhen will my neighbors need information about me?
Unable to constrain neighborhood stateHow would you implement graph coloring?
CPU
1 CPU 2
Sends the same message three times!
Converge More Slowly
1 2 3 4 5 6 7 80
1000
2000
3000
4000
5000
6000
7000
8000
9000
Number of CPUs
Runti
me
in S
econ
ds
Optimized in Memory Bulk Synchronous
Asynchronous Splash BP
Carnegie Mellon University
Bulk synchronous computation can be wrong!
16
Problem
17
The problem with Bulk Synchronous Gibbs
Adjacent variables cannot be sampled simultaneously.
Strong PositiveCorrelation
t=0
Parallel Execution
t=2 t=3
Strong PositiveCorrelation
t=1
Sequential
Execution
Strong NegativeCorrelation
BeliefPropagationSVM
KernelMethods
Deep BeliefNetworks
NeuralNetworks
Tensor Factorization
PageRank
Lasso
The Need for a New AbstractionIf not Pregel, then what?
18
Data-Parallel Graph-Parallel
CrossValidation
Feature Extraction
Map Reduce
Computing SufficientStatistics
Pregel (Giraph)
What is GraphLab?
The GraphLab Framework
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
20
Data Graph
21
A graph with arbitrary data (C++ Objects) associated with each vertex and edge.
Vertex Data:• User profile text• Current interests estimates
Edge Data:• Similarity weights
Graph:• Social Network
Comparison with Pregel
PregelData is associated only with vertices
GraphLabData is associated with both vertices and edges
pagerank(i, scope){ // Get Neighborhood data (R[i], Wij, R[j]) scope;
// Update the vertex data
// Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); }
;][)1(][][
iNj
ji jRWiR
Update Functions
23
An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex
PageRank in GraphLab2
struct pagerank : public iupdate_functor<graph, pagerank> {
void operator()(icontext_type& context) {vertex_data& vdata =
context.vertex_data(); double sum = 0;foreach ( edge_type edge,
context.in_edges() )sum +=
1/context.num_out_edges(edge.source()) *
context.vertex_data(edge.source()).rank;double old_rank = vdata.rank;vdata.rank = RESET_PROB + (1-RESET_PROB) *
sum;double residual = abs(vdata.rank –
old_rank) /
context.num_out_edges();if (residual > EPSILON)
context.reschedule_out_neighbors(pagerank());}
};
Comparison with Pregel
PregelData must be sent to adjacent verticesThe user code describes the movement of data as well as computation
GraphLabData is read from adjacent verticesUser code only describes the computation
The Scheduler
26
CPU 1
CPU 2
The scheduler determines the order that vertices are updated.
e f g
kjih
dcba b
ih
a
i
b e f
j
c
Sche
dule
r
The process repeats until the scheduler is empty.
The GraphLab Framework
Scheduler Consistency Model
Graph BasedData Representation
Update FunctionsUser Computation
27
Ensuring Race-Free CodeHow much can computation overlap?
GraphLab Ensures Sequential Consistency
29
For each parallel execution, there exists a sequential execution of update functions which produces the same result.
CPU 1
CPU 2
SingleCPU
Parallel
Sequential
time
Consistency Rules
30
Guaranteed sequential consistency for all update functions
Data
Full Consistency
31
Obtaining More Parallelism
32
Edge Consistency
33
CPU 1 CPU 2
Safe
Read
Is pretty neat!
In Summary …
02000
40006000
800010000
1200014000
1.00E-021.00E+001.00E+021.00E+041.00E+061.00E+08
GraphLabPregel
Runtime (s)
L1 E
rror
Pregel vs. GraphLabMulticore PageRank (25M Vertices, 355M Edges)
Pregel [Simulated]Synchronous ScheduleNo Skipping [Unfair updates comparison]No Combiner [Unfair runtime comparison]
0.0E+00 5.0E+08 1.0E+09 1.5E+09 2.0E+091.00E-02
1.00E+00
1.00E+02
1.00E+04
1.00E+06
1.00E+08GraphLabPregel
Updates
L1 E
rror
Update Count Distribution
0 10 20 30 40 50 60 700
2000000
4000000
6000000
8000000
10000000
12000000
14000000
Number of Updates
Num
-Ver
tices
Most vertices need to be updated infrequently
Bayesian Tensor Factorization
Gibbs Sampling
Dynamic Block Gibbs Sampling
MatrixFactorization
Lasso
SVM
Belief Propagation
PageRank
CoEM
K-Means
SVD
LDA
…Many others…
Startups Using GraphLab
Companies experimenting with Graphlab
Academic projects Exploring Graphlab
1600++ Unique Downloads Tracked(possibly many more from direct repository checkouts)
Why do we need a NEW GraphLab?
Natural Graphs
41
Natural Graphs Power Law
Top 1% vertices is adjacent to53% of the edges!
Yahoo! Web Graph
“Power Law”
Problem: High Degree Vertices
High degree vertices limit parallelism:
Touch a LargeAmount of State
Requires Heavy Locking
Processed Sequentially
High Degree Vertices are Common
Use
rs
Movies
Netflix
“Social” People Popular Movies
θZwZwZwZw
θZwZwZwZw
θZwZwZwZw
θZwZwZwZw
Bα
Hyper Parameters
Docs
Words
Freq.
Common Words
Obama
Proposed Four Solutions
Decomposable Update FunctorsExpose greater parallelism by further factoring update functions
Commutative- Associative Update FunctorsTransition from stateless to stateful update functions
Abelian Group Caching (concurrent revisions)Allows for controllable races through diff operations
Stochastic ScopesReduce degree through sampling
PageRank in GraphLab
struct pagerank : public iupdate_functor<graph, pagerank> {
void operator()(icontext_type& context) {vertex_data& vdata =
context.vertex_data(); double sum = 0;foreach ( edge_type edge,
context.in_edges() )sum +=
1/context.num_out_edges(edge.source()) *
context.vertex_data(edge.source()).rank;double old_rank = vdata.rank;vdata.rank = RESET_PROB + (1-RESET_PROB) *
sum;double residual = abs(vdata.rank –
old_rank) /
context.num_out_edges();if (residual > EPSILON)
context.reschedule_out_neighbors(pagerank());}
};
PageRank in GraphLab
struct pagerank : public iupdate_functor<graph, pagerank> {
void operator()(icontext_type& context) {vertex_data& vdata =
context.vertex_data(); double sum = 0;foreach ( edge_type edge,
context.in_edges() )sum +=
1/context.num_out_edges(edge.source()) *
context.vertex_data(edge.source()).rank;double old_rank = vdata.rank;vdata.rank = RESET_PROB + (1-RESET_PROB) *
sum;double residual = abs(vdata.rank –
old_rank) /
context.num_out_edges();if (residual > EPSILON)
context.reschedule_out_neighbors(pagerank());}
};
Atomic Single Vertex Apply
Parallel Scatter [Reschedule]
Parallel “Sum” Gather
Decomposable Update Functors
Decompose update functions into 3 phases:
Locks are acquired only for region within a scope Relaxed Consistency
+ + … + Δ
Y YY
ParallelSum
User Defined:Gather( ) ΔY
Δ1 + Δ2 Δ3
Y Scope
Gather
Y
YApply( , Δ) Y
Apply the accumulated value to center vertexUser Defined:
Apply
Y
Scatter( )
Update adjacent edgesand vertices.
User Defined:Y
Scatter
Factorized PageRankstruct pagerank : public iupdate_functor<graph, pagerank> { double accum = 0, residual = 0;
void gather(icontext_type& context, const edge_type& edge) {
accum += 1/context.num_out_edges(edge.source()) *
context.vertex_data(edge.source()).rank;}void merge(const pagerank& other) { accum +=
other.accum; }void apply(icontext_type& context) {
vertex_data& vdata = context.vertex_data();double old_value = vdata.rank;vdata.rank = RESET_PROB + (1 - RESET_PROB)
* accum; residual = fabs(vdata.rank – old_value) /
context.num_out_edges();}void scatter(icontext_type& context, const
edge_type& edge) {if (residual > EPSILON)
context.schedule(edge.target(), pagerank());
}};
Y
Split computation across machines:
Decomposable Execution Model
( o )( )Y
YYF1 F2
YY
Weaker ConsistencyNeighboring vertices maybe be updated simultaneously:
A B
CGather
Gather Gather
Apply
Other Decomposable AlgorithmsLoopy Belief Propagation
Gather: Accumulates product (log sum) of in messagesApply: Updates central beliefScatter: Computes out messages and schedules adjacent vertices
Alternating Least Squares (ALS)
y1
y2
y3
y4
w1
w2
x1
x2
x3Use
r Fac
tors
(W)
Movie Factors (X)
Use
rs MoviesNetflix
Use
rs
≈x
Movies
Convergent Gibbs SamplingCannot be done:
A B
CGather
Gather Gather
Unsafe
Decomposable FunctorsFits many algorithms
Loopy Belief Propagation, Label Propagation, PageRank…
Addresses the earlier concerns
Problem: Does not exploit asynchrony at the vertex level.
Large State
DistributedGather and Scatter
Heavy Locking
Fine GrainedLocking
Sequential
ParallelGather and Scatter
Need for Vertex Level Asynchrony
Exploit commutative associative “sum”
Y
+ + + + + Y
Costly gather for a single change!
Need for Vertex Level Asynchrony
Exploit commutative associative “sum”
Y
+ + + + + Y
Need for Vertex Level Asynchrony
Exploit commutative associative “sum”
Y
+ + + + + + Δ Y
Need for Vertex Level Asynchrony
Exploit commutative associative “sum”
Y
+ + + + + + Δ YOld (Cached) Sum
Need for Vertex Level Asynchrony
Exploit commutative associative “sum”
Y
+ + + + + + Δ YOld (Cached) Sum
Δ Δ Δ Δ
Commutative-Associative Updatestruct pagerank : public iupdate_functor<graph, pagerank> {
double delta;pagerank(double d) : delta(d) { }void operator+=(pagerank& other) { delta +=
other.delta; }void operator()(icontext_type& context) {
vertex_data& vdata = context.vertex_data();
vdata.rank += delta;if(abs(delta) > EPSILON) {
double out_delta = delta * (1 – RESET_PROB) *
1/context.num_out_edges(edge.source());
context.schedule_out_neighbors(pagerank(out_delta));}
}};// Initial Rank: R[i] = 0;// Initial Schedule: pagerank(RESET_PROB);
Scheduling Composes UpdatesCalling reschedule neighbors forces update function composition:
pagerank(3) Pending: pagerank(7)
reschedule_out_neighbors(pagerank(3))pagerank(3)
Pending: pagerank(3)
Pending: pagerank(10)
Experimental Comparison
Comparison of Abstractions:Multicore PageRank (25M Vertices, 355M Edges)
0 1000 2000 3000 4000 5000 60001.00E-021.00E-011.00E+001.00E+011.00E+021.00E+031.00E+041.00E+051.00E+061.00E+071.00E+08
GraphLab1FactorizedDelta
Runtime (s)
L1 E
rror
Comparison of Abstractions:Distributed PageRank (25M Vertices, 355M Edges)
2 3 4 5 6 7 80
50
100
150
200
250
300
350
400GL 1 (Chromatic)GL 2 Delta (Asynchronous)
# Machines (8 CPUs per Machine)
Runti
me
(s)
2 3 4 5 6 7 80
5
10
15
20
25
30
35GL 1 (Chromatic)GL 2 Delta (Asynchronous)
# Machines (8 CPUs per Machine)
Tota
l Com
mun
icati
on (G
B)
PageRank on the Web circa 2000Invented Comparison:
GraphLab2 (512)
Pegasus (800)
Priter (200)
Dryad (960)
0
500
1000
1500
2000
2500
3000
3500Ru
ntim
e Se
cond
s
Ongoing work
Extending all of GraphLab2 to the distributed settingImplemented push based engines (chromatic)Need to build GraphLab2 distributed locking engine
Improving storage efficiency of the distributed data-graphPorting large set of Danny’s applications
Questionshttp://graphlab.org