Announcements Thursday 4/23: student presentations on projects
come with a tablet/laptop/etc Fri 4/24: more sample questions
Tuesday 4/28: review session for exam Thursday 4/30: exam 80min, in
this room closed book, but you can bring one 8 x 11 sheet (front
and back) of notes Tues May 5 th - project writeups due: 2
Slide 3
After next Thursday: next fall! 10-605 will switch to fall, so
its happening F2015 If youre wondering what looks even better on my
c.v. than acing a class called Machine Learning from Large
Datasets? how you can get even more expert in big ML how you can
help your fellow students take a really great course on big ML
consider TA-ing for me next semester 3
Slide 4
Graph-Based Parallel Computing William Cohen 4
Slide 5
Outline Motivation/where it fits Sample systems (c. 2010)
Pregel: and some sample programs Bulk synchronous processing
Signal/Collect and GraphLab Asynchronous processing GraphLab
descendants PowerGraph: partitioning GraphChi: graphs w/o
parallelism GraphX: graphs over Spark 5
Slide 6
Problems weve seen so far Operations on sets of sparse feature
vectors: Classification Topic modeling Similarity joins Graph
operations: PageRank, personalized PageRank Semi-supervised
learning on graphs 6
Slide 7
Architectures weve seen so far Stream-and-sort: limited-memory,
serial, simple workflows + parallelism: Map-reduce (Hadoop) +
abstract operators like join, group: PIG, Hive, + caching in memory
and efficient iteration: Spark, Flink, + parameter servers (Petuum,
) + ..? one candidate: architectures for graph processing 7
Slide 8
Architectures weve seen so far Large immutable data structures
on (distributed) disk, processing by sweeping through then and
creating new data structures: stream-and-sort, Hadoop, PIG, Hive,
Large immutable data structures in distributed memory: Spark
distributed tables Large mutable data structures in distributed
memory: parameter server: structure is a hashtable today: large
mutable graphs 8
Slide 9
Outline Motivation/where it fits Sample systems (c. 2010)
Pregel: and some sample programs Bulk synchronous processing
Signal/Collect and GraphLab Asynchronous processing GraphLab
descendants PowerGraph: partitioning GraphChi: graphs w/o
parallelism GraphX: graphs over Spark 9
Slide 10
GRAPH ABSTRACTIONS: PREGEL (SIGMOD 2010*) *Used internally at
least 1-2 years before 10
Slide 11
Many ML algorithms tend to have Sparse data dependencies Local
computations Iterative updates Typical example: Gibbs sampling
11
Slide 12
Example: Gibbs Sampling [Guestrin UAI 2010] 12 X4X4 X4X4 X5X5
X5X5 X6X6 X6X6 X9X9 X9X9 X8X8 X8X8 X3X3 X3X3 X2X2 X2X2 X1X1 X1X1
X7X7 X7X7 1) Sparse Data Dependencies 2) Local Computations 3)
Iterative Updates For LDA: Z d,m for X d,m depends on others Zs in
doc d, and topic assignments to copies of word X d,m
Slide 13
Pregel (Google, Sigmod 2010) Primary data structure is a graph
Computations are sequence of supersteps, in each of which
user-defined function is invoked (in parallel) at each vertex v,
can get/set value UDF can also issue requests to get/set edges UDF
can read messages sent to v in the last superstep and schedule
messages to send to in the next superstep Halt when every vertex
votes to halt Output is directed graph Also: aggregators (like
ALLREDUCE) Bulk synchronous processing (BSP) model: all vertex
operations happen simultaneously vertex value changes communication
13
Slide 14
Pregel (Google, Sigmod 2010) One master: partitions the graph
among workers Workers keep graph shard in memory Messages to other
partitions are buffered Communication across partitions is
expensive, within partitions is cheap quality of partition makes a
difference! 14
Slide 15
simplest rule: stop when everyone votes to halt everyone
computes in parallel 15
Slide 16
Streaming PageRank: with some long rows Repeat until converged:
Let v t+1 = cu + (1-c)Wv t Store A as a list of edges: each line
is: i d(i) j Store v and v in memory: v starts out as cu For each
line i d j v[j] += (1-c)v[i]/d We need to get the degree of i and
store it locally 16 recap from 3/17 note we need to scan through
the graph each time
Slide 17
17
Slide 18
edge weight Another task: single source shortest path 18
Slide 19
a little bit of a cheat 19
Slide 20
Many Graph-Parallel Algorithms Collaborative Filtering
Alternating Least Squares Stochastic Gradient Descent Tensor
Factorization Structured Prediction Loopy Belief Propagation
Max-Product Linear Programs Gibbs Sampling Semi-supervised M L
Graph SSL CoEM Community Detection Triangle-Counting K-core
Decomposition K-Truss Graph Analytics PageRank Personalized
PageRank Shortest Path Graph Coloring Classification Neural
Networks 20
Slide 21
Low-Rank Matrix Factorization: 21 r 13 r 14 r 24 r 25 f(1) f(2)
f(3) f(4) f(5) User Factors (U) Movie Factors (M) User s Movie s
Netflix User s x Movie s f(i) f(j) Iterate: Recommending
Products
Slide 22
Outline Motivation/where it fits Sample systems (c. 2010)
Pregel: and some sample programs Bulk synchronous processing
Signal/Collect and GraphLab Asynchronous processing GraphLab
descendants PowerGraph: partitioning GraphChi: graphs w/o
parallelism GraphX: graphs over Spark 22
Signal/collect model vs Pregel Integrated with RDF/SPARQL
Vertices can be non-uniform types Vertex: id, mutable state,
outgoing edges, most recent received signals (map: neighbor id
signal), uncollected signals user-defined collect function Edge:
id, source, dest user-defined signal function Allows asynchronous
computations.via v.scoreSignal, v.scoreCollect For data-flow
operations On multicore architecture: shared memory for workers
24
Slide 25
Signal/collect model signals are made available in a list and a
map next state for a vertex is output of the collect() operation
relax num_iterations soon 25
Signal/collect model vs Pregel Integrated with RDF/SPARQL
Vertices can be non-uniform types Vertex: id, mutable state,
outgoing edges, most recent received signals (map: neighbor id
signal), uncollected signals user-defined collect function Edge:
id, source, dest user-defined signal function Allows asynchronous
computations.via v.scoreSignal, v.scoreCollect For data-flow
operations 33
Slide 34
Asynchronous Parallel Computation Bulk-Synchronous: All
vertices update in parallel need to keep copy of old and new vertex
values Asynchronous: Reason 1: if two vertices are not connected,
can update them in any order more flexibility, less storage Reason
2: not all updates are equally important parts of the graph
converge quickly, parts slowly 34
Outline Motivation/where it fits Sample systems (c. 2010)
Pregel: and some sample programs Bulk synchronous processing
Signal/Collect and GraphLab Asynchronous processing GraphLab
descendants PowerGraph: partitioning GraphChi: graphs w/o
parallelism GraphX: graphs over Spark 38
Slide 39
GRAPH ABSTRACTIONS: GRAPHLAB (UAI, 2010) Guestrin, Gonzalez,
Bikel, etc. Many slides below pilfered from Carlos or Joey. 39
Slide 40
GraphLab Data in graph, UDF vertex function Differences: some
control over scheduling vertex function can insert new tasks in a
queue messages must follow graph edges: can access adjacent
vertices only shared data table for global data library algorithms
for matrix factorization, coEM, SVM, Gibbs, GraphLab Now Dato
40
Slide 41
Graphical Model Learning 41 Optimal Better Approx. Priority
Schedule Splash Schedule 15.5x speedup on 16 cpus On multicore
architecture: shared memory for workers
CoEM (Rosie Jones, 2005) Named Entity Recognition Task
VerticesEdges Small0.2M20M Large2M200M the dog Australia Catalina
Island ran quickly travelled to is pleasant Hadoop95 Cores7.5 hrs
Is Dog an animal? Is Catalina a place? 43
Slide 44
CoEM (Rosie Jones, 2005) 44 Optimal Better Small Large
GraphLab16 Cores30 min 15x Faster!6x fewer CPUs! Hadoop95 Cores7.5
hrs
Slide 45
GRAPH ABSTRACTIONS: GRAPHLAB CONTINUED. 45
Slide 46
Outline Motivation/where it fits Sample systems (c. 2010)
Pregel: and some sample programs Bulk synchronous processing
Signal/Collect and GraphLab Asynchronous processing GraphLab
descendants PowerGraph: partitioning GraphChi: graphs w/o
parallelism GraphX: graphs over Spark 46
Slide 47
GraphLabs descendents PowerGraph GraphChi GraphX On multicore
architecture: shared memory for workers On cluster architecture
(like Pregel): different memory spaces What are the challenges
moving away from shared-memory? 47
Slide 48
Natural Graphs Power Law Top 1% of vertices is adjacent to 53%
of the edges! Altavista Web Graph: 1.4B Vertices, 6.7B Edges Power
Law -Slope = 2 GraphLab group/Aapo 48
Slide 49
Touches a large fraction of graph (GraphLab 1) Produces many
messages (Pregel, Signal/Collect) Edge information too large for
single machine Asynchronous consistency requires heavy locking
(GraphLab 1) Synchronous consistency is prone to stragglers
(Pregel) Problem: High Degree Vertices Limit Parallelism GraphLab
group/Aapo 49
Slide 50
PowerGraph Problem: GraphLabs localities can be large all
neighbors of a node can be large for hubs, high indegree nodes
Approach: new graph partitioning algorithm can replicate data
gather-apply-scatter API: finer-grained parallelism gather ~
combiner apply ~ vertex UDF (for all replicates) scatter ~ messages
from vertex to edges 50
Slide 51
Factorized Vertex Updates Split update into 3 phases + + + Y Y
Y Parallel Sum Y Scope Gather Y Y Apply(, ) Y Locally apply the
accumulated to vertex Apply Y Update neighbors Scatter
Data-parallel over edges GraphLab group/Aapo 51
Slide 52
PageRank in PowerGraph PageRankProgram(i) Gather( j i ) :
return w ji * R[j] sum(a, b) : return a + b; Apply(i, ) : R[i] = +
(1 ) * Scatter( i j ) : if (R[i] changes) then activate(j) 52
GraphLab group/Aapo
Slide 53
Machine 2 Machine 1 Machine 4 Machine 3 Distributed Execution
of a PowerGraph Vertex-Program 11 11 22 22 33 33 44 44 + + + Y Y YY
Y Gather Apply Scatter 53 GraphLab group/Aapo
Slide 54
Minimizing Communication in PowerGraph Y YY A vertex-cut
minimizes machines each vertex spans Percolation theory suggests
that power law graphs have good vertex cuts. [Albert et al. 2000]
Communication is linear in the number of machines each vertex spans
54 GraphLab group/Aapo
Slide 55
Partitioning Performance Twitter Graph: 41M vertices, 1.4B
edges Oblivious balances partition quality and partitioning time.
Random Oblivious Coordinated Oblivious Random 55 Cost Construction
Time Better GraphLab group/Aapo
Slide 56
Partitioning matters GraphLab group/Aapo 56
Slide 57
Outline Motivation/where it fits Sample systems (c. 2010)
Pregel: and some sample programs Bulk synchronous processing
Signal/Collect and GraphLab Asynchronous processing GraphLab
descendants PowerGraph: partitioning GraphChi: graphs w/o
parallelism GraphX: graphs over Spark 57
GraphLab cont PowerGraph GraphChi Goal: use graph abstraction
on-disk, not in- memory, on a conventional workstation Linux
Cluster Services (Amazon AWS) MPI/TCP-IP PThreads Hadoop/HDFS
General-purpose API Graph Analytics Graphical Models Computer
Vision Clustering Topic Modeling Collaborative Filtering 59
Slide 60
GraphLab cont GraphChi Key insight: some algorithms on graph
are streamable (i.e., PageRank-Nibble) in general we cant easily
stream the graph because neighbors will be scattered but maybe we
can limit the degree to which theyre scattered enough to make
streaming possible? almost-streaming: keep P cursors in a file
instead of one 60
Slide 61
Vertices are numbered from 1 to n P intervals, each associated
with a shard on disk. sub-graph = interval of vertices PSW: Shards
and Intervals shard(1) interval(1)interval(2)interval(P) shard (2)
shard(P) 1nv1v1 v2v2 61 1. Load 2. Compute 3. Write
Slide 62
PSW: Layout Shard 1 Shards small enough to fit in memory;
balance size of shards Shard: in-edges for interval of vertices;
sorted by source-id in-edges for vertices 1..100 sorted by
source_id Vertices 1..100 Vertices 101..700 Vertices 701..1000
Vertices 1001..10000 Shard 2Shard 3Shard 4Shard 1 1. Load 2.
Compute 3. Write 62
Slide 63
Vertices 1..100 Vertices 101..700 Vertices 701..1000 Vertices
1001..10000 Load all in-edges in memory Load subgraph for vertices
1..100 What about out-edges? Arranged in sequence in other shards
Shard 2Shard 3Shard 4 PSW: Loading Sub-graph Shard 1 in-edges for
vertices 1..100 sorted by source_id 1. Load 2. Compute 3. Write
63
Slide 64
Shard 1 Load all in-edges in memory Load subgraph for vertices
101..700 Shard 2Shard 3Shard 4 PSW: Loading Sub-graph Vertices
1..100 Vertices 101..700 Vertices 701..1000 Vertices 1001..10000
Out-edge blocks in memory in-edges for vertices 1..100 sorted by
source_id 1. Load 2. Compute 3. Write 64
Slide 65
PSW Load-Phase Only P large reads for each interval. P 2 reads
on one full pass. 65 1. Load 2. Compute 3. Write
Slide 66
PSW: Execute updates Update-function is executed on intervals
vertices Edges have pointers to the loaded data blocks Changes take
effect immediately asynchronous. &Dat a Block X Block Y 66 1.
Load 2. Compute 3. Write
Slide 67
PSW: Commit to Disk In write phase, the blocks are written back
to disk Next load-phase sees the preceding writes asynchronous. 67
1. Load 2. Compute 3. Write &Dat a Block X Block Y In total: P
2 reads and writes / full pass on the graph. Performs well on both
SSD and hard drive. To make this work: the size of a vertex state
cant change when its updated (at last, as stored on disk).
Slide 68
Experiment Setting Mac Mini (Apple Inc.) 8 GB RAM 256 GB SSD,
1TB hard drive Intel Core i5, 2.5 GHz Experiment graphs:
GraphVerticesEdgesP (shards)Preprocessing live-journal4.8M69M30.5
min netflix0.5M99M201 min twitter-201042M1.5B202 min
uk-2007-05106M3.7B4031 min uk-union133M5.4B5033 min
yahoo-web1.4B6.6B5037 min 68
Slide 69
Comparison to Existing Systems Notes: comparison results do not
include time to transfer the data to cluster, preprocessing, or the
time to load the graph from disk. GraphChi computes asynchronously,
while all but GraphLab synchronously. PageRank See the paper for
more comparisons. WebGraph Belief Propagation (U Kang et al.)
Matrix Factorization (Alt. Least Sqr.)Triangle Counting On a Mac
Mini: GraphChi can solve as big problems as existing large-scale
systems. Comparable performance. 69
Slide 70
Outline Motivation/where it fits Sample systems (c. 2010)
Pregel: and some sample programs Bulk synchronous processing
Signal/Collect and GraphLab Asynchronous processing GraphLab
descendants PowerGraph: partitioning GraphChi: graphs w/o
parallelism GraphX: graphs over Spark (Gonzalez) 70
Slide 71
GraphLabs descendents PowerGraph GraphChi GraphX implementation
of GraphLabs API on top of Spark Motivations: avoid transfers
between subsystems leverage larger community for common
infrastructure Whats different: Graphs are now immutable and
operations transform one graph into another (RDD RDG, resiliant
distributed graph) 71
Slide 72
Idea 1: Graph as Tables Id Rxin Jegonzal Franklin Istoica
SrcIdDstId rxinjegonzal franklinrxin istoicafranklin jegonzal
Property (E) Friend Advisor Coworker PI Property (V) (Stu., Berk.)
(PstDoc, Berk.) (Prof., Berk) R J F I Property Graph Vertex
Property Table Edge Property Table Under the hood things can be
split even more finely: eg a vertex map table + vertex data table.
Operators maximize structure sharing and minimize communication.
72
Slide 73
Operators Table (RDD) operators are inherited from Spark: 73
map filter groupBy sort union join leftOuterJoin rightOuterJoin
reduce count fold reduceByKey groupByKey cogroup cross zip sample
take first partitionBy mapWith pipe save...
The GraphX Stack (Lines of Code) GraphX (3575) Spark Pregel
(28) + GraphLab (50) PageRan k (5) Connected Comp. (10) Shortest
Path (10) ALS (40) LDA (120) K-core (51) Triangl e Count (45) SVD
(40) 75
Slide 76
Performance Comparisons GraphX is roughly 3x slower than
GraphLab Live-Journal: 69 Million Edges 76
Slide 77
Summary Large immutable data structures on (distributed) disk,
processing by sweeping through then and creating new data
structures: stream-and-sort, Hadoop, PIG, Hive, Large immutable
data structures in distributed memory: Spark distributed tables
Large mutable data structures in distributed memory: parameter
server: structure is a hashtable Pregel, GraphLab, GraphChi,
GraphX: structure is a graph 77
Slide 78
Summary APIs for the various systems vary in detail but have a
similar flavor Typical algorithms iteratively update vertex state
Changes in state are communicated with messages which need to be
aggregated from neighbors Biggest wins are on problems where graph
is fixed in each iteration, but vertex data changes on graphs small
enough to fit in (distributed) memory 78
Slide 79
Some things to take away Platforms for iterative operations on
graphs GraphX: if you want to integrate with Spark GraphChi: if you
dont have a cluster GraphLab/Dato: if you dont need free software
and performance is crucial Pregel: if you work at Google Giraph,
Signal/collect, Important differences Intended architecture:
shared-memory and threads, distributed cluster memory, graph on
disk How graphs are partitioned for clusters If processing is
synchronous or asynchronous 79