+ All Categories
Home > Documents > Graph-Based Parallel Computing William Cohen 1. Announcements Thursday 4/23: student presentations...

Graph-Based Parallel Computing William Cohen 1. Announcements Thursday 4/23: student presentations...

Date post: 26-Dec-2015
Category:
Upload: joella-bridges
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
73
Graph-Based Parallel Computing William Cohen 1
Transcript
  • Slide 1
  • Graph-Based Parallel Computing William Cohen 1
  • Slide 2
  • Announcements Thursday 4/23: student presentations on projects come with a tablet/laptop/etc Fri 4/24: more sample questions Tuesday 4/28: review session for exam Thursday 4/30: exam 80min, in this room closed book, but you can bring one 8 x 11 sheet (front and back) of notes Tues May 5 th - project writeups due: 2
  • Slide 3
  • After next Thursday: next fall! 10-605 will switch to fall, so its happening F2015 If youre wondering what looks even better on my c.v. than acing a class called Machine Learning from Large Datasets? how you can get even more expert in big ML how you can help your fellow students take a really great course on big ML consider TA-ing for me next semester 3
  • Slide 4
  • Graph-Based Parallel Computing William Cohen 4
  • Slide 5
  • Outline Motivation/where it fits Sample systems (c. 2010) Pregel: and some sample programs Bulk synchronous processing Signal/Collect and GraphLab Asynchronous processing GraphLab descendants PowerGraph: partitioning GraphChi: graphs w/o parallelism GraphX: graphs over Spark 5
  • Slide 6
  • Problems weve seen so far Operations on sets of sparse feature vectors: Classification Topic modeling Similarity joins Graph operations: PageRank, personalized PageRank Semi-supervised learning on graphs 6
  • Slide 7
  • Architectures weve seen so far Stream-and-sort: limited-memory, serial, simple workflows + parallelism: Map-reduce (Hadoop) + abstract operators like join, group: PIG, Hive, + caching in memory and efficient iteration: Spark, Flink, + parameter servers (Petuum, ) + ..? one candidate: architectures for graph processing 7
  • Slide 8
  • Architectures weve seen so far Large immutable data structures on (distributed) disk, processing by sweeping through then and creating new data structures: stream-and-sort, Hadoop, PIG, Hive, Large immutable data structures in distributed memory: Spark distributed tables Large mutable data structures in distributed memory: parameter server: structure is a hashtable today: large mutable graphs 8
  • Slide 9
  • Outline Motivation/where it fits Sample systems (c. 2010) Pregel: and some sample programs Bulk synchronous processing Signal/Collect and GraphLab Asynchronous processing GraphLab descendants PowerGraph: partitioning GraphChi: graphs w/o parallelism GraphX: graphs over Spark 9
  • Slide 10
  • GRAPH ABSTRACTIONS: PREGEL (SIGMOD 2010*) *Used internally at least 1-2 years before 10
  • Slide 11
  • Many ML algorithms tend to have Sparse data dependencies Local computations Iterative updates Typical example: Gibbs sampling 11
  • Slide 12
  • Example: Gibbs Sampling [Guestrin UAI 2010] 12 X4X4 X4X4 X5X5 X5X5 X6X6 X6X6 X9X9 X9X9 X8X8 X8X8 X3X3 X3X3 X2X2 X2X2 X1X1 X1X1 X7X7 X7X7 1) Sparse Data Dependencies 2) Local Computations 3) Iterative Updates For LDA: Z d,m for X d,m depends on others Zs in doc d, and topic assignments to copies of word X d,m
  • Slide 13
  • Pregel (Google, Sigmod 2010) Primary data structure is a graph Computations are sequence of supersteps, in each of which user-defined function is invoked (in parallel) at each vertex v, can get/set value UDF can also issue requests to get/set edges UDF can read messages sent to v in the last superstep and schedule messages to send to in the next superstep Halt when every vertex votes to halt Output is directed graph Also: aggregators (like ALLREDUCE) Bulk synchronous processing (BSP) model: all vertex operations happen simultaneously vertex value changes communication 13
  • Slide 14
  • Pregel (Google, Sigmod 2010) One master: partitions the graph among workers Workers keep graph shard in memory Messages to other partitions are buffered Communication across partitions is expensive, within partitions is cheap quality of partition makes a difference! 14
  • Slide 15
  • simplest rule: stop when everyone votes to halt everyone computes in parallel 15
  • Slide 16
  • Streaming PageRank: with some long rows Repeat until converged: Let v t+1 = cu + (1-c)Wv t Store A as a list of edges: each line is: i d(i) j Store v and v in memory: v starts out as cu For each line i d j v[j] += (1-c)v[i]/d We need to get the degree of i and store it locally 16 recap from 3/17 note we need to scan through the graph each time
  • Slide 17
  • 17
  • Slide 18
  • edge weight Another task: single source shortest path 18
  • Slide 19
  • a little bit of a cheat 19
  • Slide 20
  • Many Graph-Parallel Algorithms Collaborative Filtering Alternating Least Squares Stochastic Gradient Descent Tensor Factorization Structured Prediction Loopy Belief Propagation Max-Product Linear Programs Gibbs Sampling Semi-supervised M L Graph SSL CoEM Community Detection Triangle-Counting K-core Decomposition K-Truss Graph Analytics PageRank Personalized PageRank Shortest Path Graph Coloring Classification Neural Networks 20
  • Slide 21
  • Low-Rank Matrix Factorization: 21 r 13 r 14 r 24 r 25 f(1) f(2) f(3) f(4) f(5) User Factors (U) Movie Factors (M) User s Movie s Netflix User s x Movie s f(i) f(j) Iterate: Recommending Products
  • Slide 22
  • Outline Motivation/where it fits Sample systems (c. 2010) Pregel: and some sample programs Bulk synchronous processing Signal/Collect and GraphLab Asynchronous processing GraphLab descendants PowerGraph: partitioning GraphChi: graphs w/o parallelism GraphX: graphs over Spark 22
  • Slide 23
  • GRAPH ABSTRACTIONS: SIGNAL/COLLECT (SEMANTIC WEB CONFERENCE, 2010) Stutz, Strebel, Bernstein, Univ Zurich 23
  • Slide 24
  • Signal/collect model vs Pregel Integrated with RDF/SPARQL Vertices can be non-uniform types Vertex: id, mutable state, outgoing edges, most recent received signals (map: neighbor id signal), uncollected signals user-defined collect function Edge: id, source, dest user-defined signal function Allows asynchronous computations.via v.scoreSignal, v.scoreCollect For data-flow operations On multicore architecture: shared memory for workers 24
  • Slide 25
  • Signal/collect model signals are made available in a list and a map next state for a vertex is output of the collect() operation relax num_iterations soon 25
  • Slide 26
  • Signal/collect examples Single-source shortest path 26
  • Slide 27
  • Signal/collect examples PageRank Life 27
  • Slide 28
  • PageRank + Preprocessing and Graph Building 28
  • Slide 29
  • Signal/collect examples Co-EM/wvRN/Harmonic fields 29
  • Slide 30
  • Signal/collect examples Matching path queries: dept(X) -[member] postdoc(Y) -[recieved] grant(Z) Matching path queries: dept(X) -[member] postdoc(Y) -[recieved] grant(Z) MLD wcohen partha NSF37 8 InMind7 LTI dept(X) -[member] postdoc(Y) -[recieved] grant(Z) 30
  • Slide 31
  • Signal/collect examples: data flow Matching path queries: dept(X) -[member] postdoc(Y) -[recieved] grant(Z) Matching path queries: dept(X) -[member] postdoc(Y) -[recieved] grant(Z) MLD wcohen partha NSF37 8 InMind7 LTI dept(X=MLD) -[member] postdoc(Y) -[recieved] grant(Z) dept(X=LTI) -[member] postdoc(Y) -[recieved] grant(Z) note: can be multiple input signals 31
  • Slide 32
  • Signal/collect examples Matching path queries: dept(X) -[member] postdoc(Y) -[recieved] grant(Z) Matching path queries: dept(X) -[member] postdoc(Y) -[recieved] grant(Z) MLD wcohen partha NSF37 8 InMind7 LTI dept(X=MLD) -[member] postdoc(Y=partha) -[recieved] grant(Z) 32
  • Slide 33
  • Signal/collect model vs Pregel Integrated with RDF/SPARQL Vertices can be non-uniform types Vertex: id, mutable state, outgoing edges, most recent received signals (map: neighbor id signal), uncollected signals user-defined collect function Edge: id, source, dest user-defined signal function Allows asynchronous computations.via v.scoreSignal, v.scoreCollect For data-flow operations 33
  • Slide 34
  • Asynchronous Parallel Computation Bulk-Synchronous: All vertices update in parallel need to keep copy of old and new vertex values Asynchronous: Reason 1: if two vertices are not connected, can update them in any order more flexibility, less storage Reason 2: not all updates are equally important parts of the graph converge quickly, parts slowly 34
  • Slide 35
  • using: v.scoreSignal v.scoreCollect using: v.scoreSignal v.scoreCollect 35
  • Slide 36
  • 36
  • Slide 37
  • SSSP PageRank 37
  • Slide 38
  • Outline Motivation/where it fits Sample systems (c. 2010) Pregel: and some sample programs Bulk synchronous processing Signal/Collect and GraphLab Asynchronous processing GraphLab descendants PowerGraph: partitioning GraphChi: graphs w/o parallelism GraphX: graphs over Spark 38
  • Slide 39
  • GRAPH ABSTRACTIONS: GRAPHLAB (UAI, 2010) Guestrin, Gonzalez, Bikel, etc. Many slides below pilfered from Carlos or Joey. 39
  • Slide 40
  • GraphLab Data in graph, UDF vertex function Differences: some control over scheduling vertex function can insert new tasks in a queue messages must follow graph edges: can access adjacent vertices only shared data table for global data library algorithms for matrix factorization, coEM, SVM, Gibbs, GraphLab Now Dato 40
  • Slide 41
  • Graphical Model Learning 41 Optimal Better Approx. Priority Schedule Splash Schedule 15.5x speedup on 16 cpus On multicore architecture: shared memory for workers
  • Slide 42
  • Gibbs Sampling Protein-protein interaction networks [Elidan et al. 2006] Pair-wise MRF 14K Vertices 100K Edges 10x Speedup Scheduling reduces locking overhead 42 Optimal Better Round robin schedule Colored Schedule
  • Slide 43
  • CoEM (Rosie Jones, 2005) Named Entity Recognition Task VerticesEdges Small0.2M20M Large2M200M the dog Australia Catalina Island ran quickly travelled to is pleasant Hadoop95 Cores7.5 hrs Is Dog an animal? Is Catalina a place? 43
  • Slide 44
  • CoEM (Rosie Jones, 2005) 44 Optimal Better Small Large GraphLab16 Cores30 min 15x Faster!6x fewer CPUs! Hadoop95 Cores7.5 hrs
  • Slide 45
  • GRAPH ABSTRACTIONS: GRAPHLAB CONTINUED. 45
  • Slide 46
  • Outline Motivation/where it fits Sample systems (c. 2010) Pregel: and some sample programs Bulk synchronous processing Signal/Collect and GraphLab Asynchronous processing GraphLab descendants PowerGraph: partitioning GraphChi: graphs w/o parallelism GraphX: graphs over Spark 46
  • Slide 47
  • GraphLabs descendents PowerGraph GraphChi GraphX On multicore architecture: shared memory for workers On cluster architecture (like Pregel): different memory spaces What are the challenges moving away from shared-memory? 47
  • Slide 48
  • Natural Graphs Power Law Top 1% of vertices is adjacent to 53% of the edges! Altavista Web Graph: 1.4B Vertices, 6.7B Edges Power Law -Slope = 2 GraphLab group/Aapo 48
  • Slide 49
  • Touches a large fraction of graph (GraphLab 1) Produces many messages (Pregel, Signal/Collect) Edge information too large for single machine Asynchronous consistency requires heavy locking (GraphLab 1) Synchronous consistency is prone to stragglers (Pregel) Problem: High Degree Vertices Limit Parallelism GraphLab group/Aapo 49
  • Slide 50
  • PowerGraph Problem: GraphLabs localities can be large all neighbors of a node can be large for hubs, high indegree nodes Approach: new graph partitioning algorithm can replicate data gather-apply-scatter API: finer-grained parallelism gather ~ combiner apply ~ vertex UDF (for all replicates) scatter ~ messages from vertex to edges 50
  • Slide 51
  • Factorized Vertex Updates Split update into 3 phases + + + Y Y Y Parallel Sum Y Scope Gather Y Y Apply(, ) Y Locally apply the accumulated to vertex Apply Y Update neighbors Scatter Data-parallel over edges GraphLab group/Aapo 51
  • Slide 52
  • PageRank in PowerGraph PageRankProgram(i) Gather( j i ) : return w ji * R[j] sum(a, b) : return a + b; Apply(i, ) : R[i] = + (1 ) * Scatter( i j ) : if (R[i] changes) then activate(j) 52 GraphLab group/Aapo
  • Slide 53
  • Machine 2 Machine 1 Machine 4 Machine 3 Distributed Execution of a PowerGraph Vertex-Program 11 11 22 22 33 33 44 44 + + + Y Y YY Y Gather Apply Scatter 53 GraphLab group/Aapo
  • Slide 54
  • Minimizing Communication in PowerGraph Y YY A vertex-cut minimizes machines each vertex spans Percolation theory suggests that power law graphs have good vertex cuts. [Albert et al. 2000] Communication is linear in the number of machines each vertex spans 54 GraphLab group/Aapo
  • Slide 55
  • Partitioning Performance Twitter Graph: 41M vertices, 1.4B edges Oblivious balances partition quality and partitioning time. Random Oblivious Coordinated Oblivious Random 55 Cost Construction Time Better GraphLab group/Aapo
  • Slide 56
  • Partitioning matters GraphLab group/Aapo 56
  • Slide 57
  • Outline Motivation/where it fits Sample systems (c. 2010) Pregel: and some sample programs Bulk synchronous processing Signal/Collect and GraphLab Asynchronous processing GraphLab descendants PowerGraph: partitioning GraphChi: graphs w/o parallelism GraphX: graphs over Spark 57
  • Slide 58
  • GraphLabs descendents PowerGraph GraphChi GraphX 58
  • Slide 59
  • GraphLab cont PowerGraph GraphChi Goal: use graph abstraction on-disk, not in- memory, on a conventional workstation Linux Cluster Services (Amazon AWS) MPI/TCP-IP PThreads Hadoop/HDFS General-purpose API Graph Analytics Graphical Models Computer Vision Clustering Topic Modeling Collaborative Filtering 59
  • Slide 60
  • GraphLab cont GraphChi Key insight: some algorithms on graph are streamable (i.e., PageRank-Nibble) in general we cant easily stream the graph because neighbors will be scattered but maybe we can limit the degree to which theyre scattered enough to make streaming possible? almost-streaming: keep P cursors in a file instead of one 60
  • Slide 61
  • Vertices are numbered from 1 to n P intervals, each associated with a shard on disk. sub-graph = interval of vertices PSW: Shards and Intervals shard(1) interval(1)interval(2)interval(P) shard (2) shard(P) 1nv1v1 v2v2 61 1. Load 2. Compute 3. Write
  • Slide 62
  • PSW: Layout Shard 1 Shards small enough to fit in memory; balance size of shards Shard: in-edges for interval of vertices; sorted by source-id in-edges for vertices 1..100 sorted by source_id Vertices 1..100 Vertices 101..700 Vertices 701..1000 Vertices 1001..10000 Shard 2Shard 3Shard 4Shard 1 1. Load 2. Compute 3. Write 62
  • Slide 63
  • Vertices 1..100 Vertices 101..700 Vertices 701..1000 Vertices 1001..10000 Load all in-edges in memory Load subgraph for vertices 1..100 What about out-edges? Arranged in sequence in other shards Shard 2Shard 3Shard 4 PSW: Loading Sub-graph Shard 1 in-edges for vertices 1..100 sorted by source_id 1. Load 2. Compute 3. Write 63
  • Slide 64
  • Shard 1 Load all in-edges in memory Load subgraph for vertices 101..700 Shard 2Shard 3Shard 4 PSW: Loading Sub-graph Vertices 1..100 Vertices 101..700 Vertices 701..1000 Vertices 1001..10000 Out-edge blocks in memory in-edges for vertices 1..100 sorted by source_id 1. Load 2. Compute 3. Write 64
  • Slide 65
  • PSW Load-Phase Only P large reads for each interval. P 2 reads on one full pass. 65 1. Load 2. Compute 3. Write
  • Slide 66
  • PSW: Execute updates Update-function is executed on intervals vertices Edges have pointers to the loaded data blocks Changes take effect immediately asynchronous. &Dat a Block X Block Y 66 1. Load 2. Compute 3. Write
  • Slide 67
  • PSW: Commit to Disk In write phase, the blocks are written back to disk Next load-phase sees the preceding writes asynchronous. 67 1. Load 2. Compute 3. Write &Dat a Block X Block Y In total: P 2 reads and writes / full pass on the graph. Performs well on both SSD and hard drive. To make this work: the size of a vertex state cant change when its updated (at last, as stored on disk).
  • Slide 68
  • Experiment Setting Mac Mini (Apple Inc.) 8 GB RAM 256 GB SSD, 1TB hard drive Intel Core i5, 2.5 GHz Experiment graphs: GraphVerticesEdgesP (shards)Preprocessing live-journal4.8M69M30.5 min netflix0.5M99M201 min twitter-201042M1.5B202 min uk-2007-05106M3.7B4031 min uk-union133M5.4B5033 min yahoo-web1.4B6.6B5037 min 68
  • Slide 69
  • Comparison to Existing Systems Notes: comparison results do not include time to transfer the data to cluster, preprocessing, or the time to load the graph from disk. GraphChi computes asynchronously, while all but GraphLab synchronously. PageRank See the paper for more comparisons. WebGraph Belief Propagation (U Kang et al.) Matrix Factorization (Alt. Least Sqr.)Triangle Counting On a Mac Mini: GraphChi can solve as big problems as existing large-scale systems. Comparable performance. 69
  • Slide 70
  • Outline Motivation/where it fits Sample systems (c. 2010) Pregel: and some sample programs Bulk synchronous processing Signal/Collect and GraphLab Asynchronous processing GraphLab descendants PowerGraph: partitioning GraphChi: graphs w/o parallelism GraphX: graphs over Spark (Gonzalez) 70
  • Slide 71
  • GraphLabs descendents PowerGraph GraphChi GraphX implementation of GraphLabs API on top of Spark Motivations: avoid transfers between subsystems leverage larger community for common infrastructure Whats different: Graphs are now immutable and operations transform one graph into another (RDD RDG, resiliant distributed graph) 71
  • Slide 72
  • Idea 1: Graph as Tables Id Rxin Jegonzal Franklin Istoica SrcIdDstId rxinjegonzal franklinrxin istoicafranklin jegonzal Property (E) Friend Advisor Coworker PI Property (V) (Stu., Berk.) (PstDoc, Berk.) (Prof., Berk) R J F I Property Graph Vertex Property Table Edge Property Table Under the hood things can be split even more finely: eg a vertex map table + vertex data table. Operators maximize structure sharing and minimize communication. 72
  • Slide 73
  • Operators Table (RDD) operators are inherited from Spark: 73 map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save...
  • Slide 74
  • class Graph [ V, E ] { def Graph(vertices: Table[ (Id, V) ], edges: Table[ (Id, Id, E) ]) // Table Views ----------------- def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ] def triplets: Table [ ((Id, V), (Id, V), E) ] // Transformations ------------------------------ def reverse: Graph[V, E] def subgraph(pV: (Id, V) => Boolean, pE: Edge[V,E] => Boolean): Graph[V,E] def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T] // Joins ---------------------------------------- def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)] // Computation ---------------------------------- def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)], reduceF: (T, T) => T): Graph[T, E] } Graph Operators 74 Idea 2: mrTriplets: low- level routine similar to scatter-gather-apply. Evolved to aggregateNeighbors, aggregateMessages Idea 2: mrTriplets: low- level routine similar to scatter-gather-apply. Evolved to aggregateNeighbors, aggregateMessages
  • Slide 75
  • The GraphX Stack (Lines of Code) GraphX (3575) Spark Pregel (28) + GraphLab (50) PageRan k (5) Connected Comp. (10) Shortest Path (10) ALS (40) LDA (120) K-core (51) Triangl e Count (45) SVD (40) 75
  • Slide 76
  • Performance Comparisons GraphX is roughly 3x slower than GraphLab Live-Journal: 69 Million Edges 76
  • Slide 77
  • Summary Large immutable data structures on (distributed) disk, processing by sweeping through then and creating new data structures: stream-and-sort, Hadoop, PIG, Hive, Large immutable data structures in distributed memory: Spark distributed tables Large mutable data structures in distributed memory: parameter server: structure is a hashtable Pregel, GraphLab, GraphChi, GraphX: structure is a graph 77
  • Slide 78
  • Summary APIs for the various systems vary in detail but have a similar flavor Typical algorithms iteratively update vertex state Changes in state are communicated with messages which need to be aggregated from neighbors Biggest wins are on problems where graph is fixed in each iteration, but vertex data changes on graphs small enough to fit in (distributed) memory 78
  • Slide 79
  • Some things to take away Platforms for iterative operations on graphs GraphX: if you want to integrate with Spark GraphChi: if you dont have a cluster GraphLab/Dato: if you dont need free software and performance is crucial Pregel: if you work at Google Giraph, Signal/collect, Important differences Intended architecture: shared-memory and threads, distributed cluster memory, graph on disk How graphs are partitioned for clusters If processing is synchronous or asynchronous 79

Recommended