Single-Pass Graph Stream Analytics with Apache Flink

Post on 16-Apr-2017

102 views 0 download

transcript

@GraphDevroom

Single-pass Graph Stream Analytics with Apache Flink

Rethinking graph processing for dynamic data

Vasiliki Kalavri <vasia@apache.org> Paris Carbone <senorcarbone@apache.org>

1

@GraphDevroom

Real Graphs are dynamic

Graphs created by events happening in real-time • liking a post • buying a book • listening to a song • rating a movie • packet switching in computer networks • bitcoin transactions

Each event adds an edge to the graph

2

@GraphDevroom

3

@GraphDevroom

In a batch world

We create and analyze a snapshot of the real graph • all events / interactions / relationships that

happened between t0 and tn • the Facebook social network on January 30 2016 • user web logs gathered between March 1st 12:00 and 16:00 • retweets and replies for 24h after the announcement of the

death of David Bowie

4

@GraphDevroom

Batch Graph Processing

5

@GraphDevroom

In a streaming world

• We receive and consume the events as they are happening, in real-time

• We analyze the evolving graph and receive results continuously

6

@GraphDevroom

7

Streaming Graph Processing

@GraphDevroom

8

Streaming Graph Processing

@GraphDevroom

9

Streaming Graph Processing

@GraphDevroom

10

Streaming Graph Processing

@GraphDevroom

11

Streaming Graph Processing

@GraphDevroom

12

Streaming Graph Processing

@GraphDevroom

13

Streaming Graph Processing

@GraphDevroom

14

Streaming Graph Processing

@GraphDevroom

15

Streaming Graph Processing

@GraphDevroom

16

Streaming Graph Processing

@GraphDevroom

17

Streaming Graph Processing

@GraphDevroom

Sounds expensive?

Challenges • maintain the graph structure

• how to apply state updates efficiently?

• update the result • re-run the analysis for each event? • design an incremental algorithm? • run separate instances on multiple snapshots?

• compute only on most recent events

18

@GraphDevroom

19

The Apache Flink Stack

APIs

Execution

DataStreamDataSet

Distributed Dataflow

Deployment

• Bounded Data Sources • Structured Iterations • Blocking Operations

• Unbounded Data Sources • Asynchronous Iterations • Incremental Operations

@GraphDevroom

Unifying Data Processing

Job Manager • scheduling tasks • monitoring/recovery

Client

• task pipelining • blocking

• execution plan building • optimisation

20

DataStreamDataSet

Distributed Dataflow

Deployment

HDFS

Kafka

DataSet<String> text = env.readTextFile(“hdfs://…”); text.map(…).groupReduce(…)…

DataStream<String> events = env.addSource(new KafkaConsumer(…)); events.map(…).filter(…).window(…).fold(…)…

@GraphDevroom Graph Processing on

Apache Flink

21

DataStreamDataSet

Distributed Dataflow

Deployment

Gelly

• Static Graphs • Multi-Pass Algorithms • Full Computations

DataStream

@GraphDevroom

Data Streams as ADTs

22

• Direct access to the execution graph / topology

• Suitable for engineers

• Abstract Data Type Transformations hide operator details

• Suitable data analysts and engineers

similar to: PCollection, DStream

DataStream

@GraphDevroom

Nature of a DataStream Job

23

• Tasks are long running in a pipelined execution.

• State is kept within tasks.

• Transformations are applied per-record or per-window.

Execution Graph

unbounded data sinks

unbounded data sources

• operator parallelism• stream partitioning

Execution Properties

@GraphDevroom

Working with DataStreams

24

Creation TransformationsDataStream<String> myStream =

-for supported data sources: env.addSource(new FlinkKafkaConsumer<String>(…)); env.addSource(new RMQSource<String>(…)); env.addSource(new TwitterSource(propsFile)); env.socketTextStream(…);

-for testing: env.fromCollection(…); env.fromElements(…);

-for adding any custom source: env.addSource(MyCustomSource(…));

PropertiesmyStream.setParallelism(3)

myStream.broadcast(); .rebalance(); .forward();

.keyBy(key);

partitioning

partition stream and operator state by key

myStream.map(…); myStream.flatMap(…); myStream.filter(…); myStream.union(myOtherStream);

-for aggregations on partitioned-by-key streams:

myKeyStream.reduce(…); myKeyStream.fold(…); myKeyStream.sum(…);

@GraphDevroom

Example

25

env.setParallelism(2); //default parallelism DataStream<Tuple2<String, Integer>> counts = env

.socketTextStream("localhost", 9999) .flatMap(new Splitter()) //transformation .keyBy(0) //partitioning .sum(1) //rolling aggregation

.setParallelism(4); counts.print();

“cool, gelly is cool”

<“gelly", 1><“is”, 1> <“cool”,1><“cool”,1>

<“is”, 1> <“gelly”, 1>

<“cool”,2> <“cool”,1>

printsum

flatMap

@GraphDevroom

Working with Windows

26

Why windows? We are often interested in fresh data!

Highlight: Flink can form and trigger windows consistently under different notions of time and deal with late events!

#sec40 80

SUM #2

0

SUM #1

20 60 100

#sec40 80

SUM #3

SUM #2

0

SUM #1

20 60 100

120

15 38 65 88

15 38

38 65

65 88

15 38 65 88

110 120

myKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS), Time.of(20, TimeUnit.SECONDS));

1) Sliding windows

2) Tumbling windowsmyKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS));

window buckets/panes

@GraphDevroom

Example

27

env.setParallelism(2); //default parallelism DataStream<Tuple2<String, Integer>> counts = env

.socketTextStream("localhost", 9999) .flatMap(new Splitter()) //transformation .keyBy(0) //partitioning

.window(Time.of(5, TimeUnit.MINUTES)) .sum(1) //rolling aggregation

.setParallelism(4); counts.print();

10:48 - “cool, gelly is cool”

printwindow sumflatMap

11:01 - “dataflow is cool too”

<“gelly”,1>… <“cool”,2>

<“dataflow”,1>… <“cool”,1>

@GraphDevroom Single-Pass Graph Streaming

with Windows• Each event represents an edge addition

• Each edge is processed once and thrown away, i.e. the graph structure is not explicitly maintained

• The state maintained corresponds to a graph summary, a continuously improving property, an aggregation

• Recent events can be grouped in a graph window and processed independently

28

@GraphDevroom

What’s the benefit?

• Get results faster • No need to wait for the job to finish • Sometimes, early approximations are better than late exact

answers • Get results continuously

• Process unbounded number of events • Use less memory

• single-pass algorithms don’t store the graph structure • run computations on a graph summary

29

@GraphDevroom

What can you do in this model?

• transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction

• continuous aggregations, e.g. degree distribution

30

@GraphDevroom

What can you do in this model?

• transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction

• continuous aggregations, e.g. degree distribution

31

@GraphDevroom

What can you do in this model?

• transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction

• continuous aggregations, e.g. degree distribution

32

@GraphDevroom

What can you do in this model?

• transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction

• continuous aggregations, e.g. degree distribution

33

@GraphDevroom

What can you do in this model?

• transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction

• continuous aggregations, e.g. degree distribution

34

@GraphDevroom

What can you do in this model?

• transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction

• continuous aggregations, e.g. degree distribution

35

@GraphDevroom

What can you do in this model?

• transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction

• continuous aggregations, e.g. degree distribution

36

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

Streaming Degrees Distribution#v

ertic

es

degree

37

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

#ver

tices

degree

Streaming Degrees Distribution

38

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

#ver

tices

degree

Streaming Degrees Distribution

39

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

#ver

tices

degree

Streaming Degrees Distribution

40

@GraphDevroom

1

43

2

5

6

7

8

Streaming Degrees Distribution

0

2

4

6

1 2 3 4

#ver

tices

degree

41

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

#ver

tices

degree

Streaming Degrees Distribution

42

@GraphDevroom

1

43

2

5

6

7

8

Streaming Degrees Distribution

0

2

4

6

1 2 3 4

#ver

tices

degree

43

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

#ver

tices

degree

Streaming Degrees Distribution

44

@GraphDevroom

1

43

2

5

6

7

8

Streaming Degrees Distribution

0

2

4

6

1 2 3 4

#ver

tices

degree

45

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

#ver

tices

degree

Streaming Degrees Distribution

46

@GraphDevroom

What can you do in this model?

• spanners for distance estimation • sparsifiers for cut estimation • sketches for homomorphic properties

graph summary

algorithm algorithm~R1 R2

47

@GraphDevroom

What can you do in this model?

• neighborhood aggregations on windows, e.g. triangle counting, clustering coefficient (no iterations… yet!)

48

@GraphDevroom

Examples

49

@GraphDevroom

Batch Connected Components

• State: the graph and a component ID per vertex (initially equal to vertex ID)

• Iterative Computation: For each vertex:

• choose the min of neighbors’ component IDs and own component ID as new ID

• if component ID changed since last iteration, notify neighbors

50

@GraphDevroom

1

43

2

5

6

7

8

i=0

Batch Connected Components

51

@GraphDevroom

1

43

2

5

6

7

8

i=13 4

1 4

4 5

2 4

1 2 4 5

7 8

6 8

6 7

1 1

2

6

6

Batch Connected Components

52

@GraphDevroom

1

11

2

2

6

6

6

i=2

1

1

1 2

1 2 6

6

6

1

1

Batch Connected Components

53

@GraphDevroom

1

11

1

1

6

6

6

i=3

Batch Connected Components

54

@GraphDevroom

Streaming Connected Components

• State: a disjoint set data structure for the components

• Computation: For each edge

• if seen for the 1st time, create a component with ID the min of the vertex IDs

• if in different components, merge them and update the component ID to the min of the component IDs

• if only one of the endpoints belongs to a component, add the other one to the same component

55

@GraphDevroom

31

52

54

76

86

ComponentID Vertices

1

43

2

5

6

7

8

56

@GraphDevroom

31

52

54

76

86

42

ComponentID Vertices

1 1, 3

1

43

2

5

6

7

8

57

@GraphDevroom

31

52

54

76

86

42

ComponentID Vertices

43

2 2, 5

1 1, 3

1

43

2

5

6

7

8

58

@GraphDevroom

31

52

54

76

86

42

43

87

ComponentID Vertices

2 2, 4, 5

1 1, 3

1

43

2

5

6

7

8

59

@GraphDevroom

31

52

54

76

86

42

43

87

41

ComponentID Vertices

2 2, 4, 5

1 1, 3

6 6, 7

1

43

2

5

6

7

8

60

@GraphDevroom

52

54

76

86

42

43

87

41

ComponentID Vertices

2 2, 4, 5

1 1, 3

6 6, 7, 8

1

43

2

5

6

7

8

61

@GraphDevroom

54

76

86

42

43

87

41 ComponentID Vertices

2 2, 4, 5

1 1, 3

6 6, 7, 8

1

43

2

5

6

7

8

62

@GraphDevroom

76

86

42

43

87

41

ComponentID Vertices

2 2, 4, 5

1 1, 3

6 6, 7, 8

1

43

2

5

6

7

8

63

@GraphDevroom

76

86

42

43

87

41

ComponentID Vertices

6 6, 7, 8

1 1, 2, 3, 4, 5

1

43

2

5

6

7

8

64

@GraphDevroom

86

42

43

87

41

ComponentID Vertices

6 6, 7, 8

1 1, 2, 3, 4, 5

1

43

2

5

6

7

8

65

@GraphDevroom

42

43

87

41

ComponentID Vertices

6 6, 7, 8

1 1, 2, 3, 4, 5

1

43

2

5

6

7

8

66

@GraphDevroom Distributed Streaming Connected

Components

67

@GraphDevroom

Streaming Bipartite Detection

Similar to connected components, but

• each vertex is also assigned a sign, (+) or (-)

• edge endpoints must have different signs

• when merging components, if flipping all signs doesn’t work => the graph is not bipartite

68

@GraphDevroom

1

43

2

5

6

7

(+) (-)

(+)(-)

(+) (-)

(+)

Cid=1

Cid=5

Streaming Bipartite Detection

69

@GraphDevroom

3 5

1

43

2

5

6

7

(+) (-)

(+)(-)

(+) (-)

(+)

Cid=1

Cid=5

Streaming Bipartite Detection

70

@GraphDevroom

3 5

1

43

2

5

6

7

(+) (-)

(+)(-)

(+) (-)

(+)

Cid=1

Cid=5

Streaming Bipartite Detection

71

@GraphDevroom

Cid=1

1

43

2

5

6

7

(+) (-)

(-)(+)

(+) (-)

(-)

3 5

Streaming Bipartite Detection

72

@GraphDevroom

3 7

Cid=1

1

43

2

5

6

7

(+) (-)

(-)(+)

(+) (-)

(-)Can’t flip signs and stay consistent

=> not bipartite!

Streaming Bipartite Detection

73

@GraphDevroom

The GraphStream

74

DataStreamDataSet

Distributed Dataflow

Deployment

Gelly Gelly-Stream

• Static Graphs • Multi-Pass Algorithms • Full Computations

• Dynamic Graphs • Single-Pass Algorithms • Incremental Computations

DataStream

@GraphDevroom

Introducing Gelly-Stream

75

• Gelly-Stream enriches the DataStream API with two new additional ADTs:

• GraphStream:

• A representation of a data stream of edges.

• Edges can have state (e.g. weights).

• Supports property streams, transformations and aggregations.

• GraphWindow:

• A “time-slice” of a graph stream.

• It enables neighborhood aggregations (and iterations in the future)

@GraphDevroom

Graph Property Streams

76

AB

C D

A B C D A CGraph Stream:

.getEdges()

.getVertices()

.numberOfVertices()

.numberOfEdges()

.getDegrees()

.inDegrees()

.outDegrees()

GraphStream -> DataStream

@GraphDevroom

.mapEdges();

.distinct();

.filterVertices();

.filterEdges();

.reverse();

.undirected();

.union();

Transform Graph Streams

77

AB

C D

A B C D A CGraph Stream:

GraphStream -> GraphStream

@GraphDevroom

Graph Stream Aggregations

78

result aggregate

property streamgraph stream

(window) fold

combine

fold

reduce

partitioned aggregates

global aggregates

edges

agg

global aggregates can be persistent or transient

graphStream.aggregate(new MyGraphAggregation(window, update, fold, combine, merge))

@GraphDevroom

Graph Stream Aggregations

79

result aggregate

property streamgraph stream

(window) fold

combine merge

graphStream.aggregate(new MyGraphAggregation(window, fold, combine, merge))

fold

reduce map

partitioned aggregates

global aggregates

edges

agg

@GraphDevroom

Connected Components

80

graph stream

combine merge

graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge))

reduce map31

52

1

43

2

5

6

7

8

@GraphDevroom

Connected Components

81

graph stream

combine mergereduce map

{1,3}

{2,5}

graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1

43

2

5

6

7

8

@GraphDevroom

Connected Components

82

graph stream

combine mergereduce map

{1,3}

{2,5}

54

graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1

43

2

5

6

7

8

@GraphDevroom

Connected Components

83

graph stream

combine mergereduce map

{1,3}

{2,5}

{4,5}76

86

graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1

43

2

5

6

7

8

@GraphDevroom

Connected Components

84

graph stream

combine mergereduce map

{1,3}

{2,5}

{4,5}

{6,7}

{6,8}

graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1

43

2

5

6

7

8

@GraphDevroom

Connected Components

85

graph stream

combine mergereduce map

TODO:: show blocking reduce instead?

{2,5}{6,8}

{1,3}{4,5}

{6,7}

3

graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1

43

2

5

6

7

8

@GraphDevroom

Connected Components

86

graph stream

combine mergereduce map

{1,3}{2,4,5}

{6,7,8}

3

graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1

43

2

5

6

7

8

@GraphDevroom

Connected Components

87

graph stream

combine mergereduce map

{1,3}{2,4,5}

{6,7,8}

342

43

graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1

43

2

5

6

7

8

@GraphDevroom

Connected Components

88

graph stream

combine mergereduce map

{1,3}{2,4,5}

{6,7,8}

3

{2,4}

{3,4}

41

87

graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1

43

2

5

6

7

8

@GraphDevroom

Connected Components

89

graph stream

combine mergereduce map

{1,3}{2,4,5}

{6,7,8}

3

{1,2,4}

{3,4}{7,8}

graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1

43

2

5

6

7

8

@GraphDevroom

Connected Components

90

graph stream

combine mergereduce map

{1,2,4,5}{6,7,8}

2

{3,4}{7,8}

graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1

43

2

5

6

7

8

@GraphDevroom

Connected Components

91

graph stream

combine mergereduce map

{1,2,3,4,5}{6,7,8}

2

graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1

43

2

5

6

7

8

@GraphDevroom

Slicing Graph Streams

92

graphStream.slice(Time.of(1, MINUTE));

11:40 11:41 11:42 11:43

@GraphDevroom

Aggregating Slices

93

graphStream.slice(Time.of(1, MINUTE), direction)

.reduceOnEdges();

.foldNeighbors();

.applyOnNeighbors();

• Slicing collocates edges by vertex information

• Neighbourhood aggregations are now enabled on sliced graphs

source

target

Aggregations

@GraphDevroom

Finding matches nearby

94

graphStream.slice(Time.of(1, MINUTE)).applyOnNeighbors(FindPairs())

slice applyOnNeighbors

TODO: make it more interactive with transitions

@GraphDevroom

Summary

• Many graph analysis problems can be covered in single-pass

• Processing dynamic graphs requires an incremental graph processing model

• We introduce Gelly-Stream, a simple yet powerful library for graph streams