+ All Categories
Home > Documents > An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An...

An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An...

Date post: 20-May-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
45
An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate @ KTH<[email protected]> Committer @ Apache Flink <[email protected]> 1
Transcript
Page 1: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

An Intro to Modern Data Stream Analytics

EIT Summer School 2016

Paris Carbone

PhD Candidate @ KTH<[email protected]> Committer @ Apache Flink <[email protected]>

1

Page 2: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Motivation• Time-critical problems / Actionable Insights

• Stock market predictions

• Fraud detection

• Network security

• Fresh customer recommendations

2

more like First-World Problems..

Page 3: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

How about Tsunamis

3

Page 4: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

4

Q =

Q

Deploy Sensors

Analyse Data Regularly

Collect Data

evacuation window

earth & wave activity

Page 5: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Motivation

5

Q Q

Q =

Page 6: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Motivation

6

Q

Standing Query

Q =

evacuationwindow

Page 7: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

The Data Stream Paradigm

• Standing queries are evaluated continuously

• Input data is unbounded

• Queries operate on the full data stream or on the most recent views of the stream ~ windows

7

Page 8: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Data Stream Basics• Events/Tuples : elements of computation - respect a schema

• Data Streams : unbounded sequences of events

• Stream Operators/Tasks: consume and produce data streams

• Events are consumed once - no backtracking!

8

f

S1

S2

So

S’1

S’2

where are computations

stored?

Page 9: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Synopsis-Task StateWe cannot infinitely store all events seen

• Synopsis: A summary of an infinite stream

• It is in principle any streaming operator state

• Examples: samples, histograms, sketches, state machines…

9

f

sa summary of everything

seen so far1. process t, s 2. update s 3. produce t’

t t’

Page 10: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Synopses-Aggregations

• Discussion - Rolling Aggregations

• Propose a synopsis, s=? when

• f= max

• f= ArithmeticMean

• f= stDev

10

Page 11: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Synopses-Approximations

11

• Discussion - Approximate Results

• Propose a synopsis, s=? when

• f= uniform random sample of k records over the whole stream

• f= filter distinct records over windows of 1000 records with a 5% error

Page 12: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Synopses-ML and Graphs

12

• Examples of cool synopses to check out

• Sparsifiers/Spanners - approximating graph properties such as shortest paths

• Change detectors - detecting concept drift

• Incremental decision trees - continuous stream training and classification

Page 13: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Data Stream Basics

Any other problems?

13

f

S1

S2

So

S’1

S’2Does this scale?

Page 14: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Task Parallelism• We need task parallelism:

• Data might be too large to process

• State can get too large to fit in memory (e.g. graphs)

• Data Streams might already be partitioned! (e.g. by key/ kafka partitions)

14

f

S1

S2

So

S’1

S’2

how do streams get partitioned?

Page 15: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Task Partitioning• Partitioning defines how we allocate events to each

parallel task instance. Typical partitioners are:

• Broadcast

• Shuffle

• Key-based

fs

fs

fs

fs

fs

fs

P

P

P

bycolor

Page 16: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Dataflow Pipelines

16

stream1

stream2

approximations predictions alerts ……

Q

sources

sinks

Page 17: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Dataflow Programming with Apache Storm

17

• Step1: Implement input (Spouts) and intermediate operators (Bolts)

• Step 2: Construct a Topology by combining operators

Spout Bolt Bolt

Spouts are the topology sources

They listen to data feeds

Bolts represent all intermediate computation vertices of the topology

They do arbitrary data manipulation

Each operator can emit/subscribe to Streams (computation results)

Page 18: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Example: Topology Definition

18

numbers new_numbers

numbers new_numbers

toFile

Page 19: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Stream Analytics Systems

19

Proprietary Open Source

Google DataFlow

IBM Infosphere

Microsoft Azure

Flink

Storm

Samza

Spark

Beam

Page 20: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Programming Models

20

Compositional Declarative

• Physical Representations • Offer basic building blocks

(Operators/Data Exchange) • Custom Optimisation/

Tuning

• Logical Representations • Operators are transformations

on abstract data types • Advanced behaviour such as

windowing is supported • Self-Optimisation

Page 21: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Programming Abstraction Levels

21

DStream, DataStream, PCollection…

• Direct access to the execution graph / topology

• Suitable for engineers

• Transformations abstract operator details

• Suitable for engineers and data analysts

Page 22: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Introducing Apache Flink

0

20

40

60

80

100

120

juli-09 nov-10 apr-12 aug-13 dec-14 maj-16

#unique contributor ids by gitcommits

• A Top-level project

• Community-driven open source software development

• Publicly open to new contributors

Page 23: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Native Workload Support

Apache Flink

Stream Pipelines

Batch Pipelines Scalable Machine Learning

Graph Analytics

Page 24: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

24

The Apache Flink Stack

APIs

Execution

DataStreamDataSet

Distributed Dataflow

Deployment

• Bounded Data Sources • Staged/Pipelined Execution

• Unbounded Data Sources • Pipelined Execution

Page 25: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

The Big Picture

DataStreamDataSet

Distributed Dataflow

Deployment

Graph

-Gelly

Table

ML

Hado

opM/R

Table

CEP

SQL

SQL

ML

Graph

-Gelly

Page 26: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

26

Basic API Concept

Source Data Stream Operator Data

Stream Sink

Source Data Set Operator Data

Set Sink

Writing a Flink Program1.Bootstrap Sources 2.Apply Operators 3.Output to Sinks

Page 27: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Data Streams as Abstract Data Types

• Tasks are distributed and run in a pipelined fashion.

• State is kept within tasks.

• Transformations are applied per-record or window.

• Transformations: map, flatmap, filter, union…

• Aggregations: reduce, fold, sum

• Partitioning: forward, broadcast, shuffle, keyBy

• Sources/Sinks: custom or Kafka, Twitter, Collections…

27

DataStream

Page 28: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Example

28

textStream .flatMap {_.split("\\W+")}

.map {(_, 1)} .keyBy(0) .sum(1) .print()

“live and let live”

“live”“and”“let”“live”(live,1)(and,1)(let,1)(live,1)

(live,1)(and,1)(let,1)(live,2)

Page 29: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Working with Windows

29

Why windows?

Highlight: Flink can form and trigger windows consistently under different notions of time and deal with late events!

#sec40 80

SUM #2

0

SUM #1

20 60 100

#sec40 80

SUM #3

SUM #2

0

SUM #1

20 60 100

120

15 38 65 88

15 38

38 65

65 88

15 38 65 88

110 120

myKeyedStream.timeWindow( Time.seconds(60), Time.seconds(20));

1) Sliding windows

2) Tumbling windowsmyKeyedStream.timeWindow( Time.seconds(60));

window buckets/panesWe are often interested in fresh data!

Page 30: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Example

30

textStream .flatMap {_.split("\\W+")}

.map {(_, 1)} .keyBy(0)

.timeWindow(Time.minutes(5)) .sum(1) .print()

“live and”

(live,1)(and,1)

(let,1)(live,1)

counting words over windows

“let live”10:48

11:01

Window (10:45-10:50)

Window (11:00-11:05)

Page 31: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Example

31

printwindow sumflatMap

textStream .flatMap {_.split("\\W+")}

.map {(_, 1)} .keyBy(0)

.timeWindow(Time.minutes(5)) .sum(1) .print()

map

where counts are kept in state

Page 32: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Example

32window sum

flatMap

textStream .flatMap {_.split("\\W+")}

.map {(_, 1)} .keyBy(0)

.timeWindow(Time.minutes(5)) .sum(1)

.setParallelism(4) .print()

map print

Page 33: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Making State Explicit

33

• Explicitly defined state is durable to failures

• Flink supports two types of explicit states

• Operator State - full state

• Key-Value State - partitioned state per key

• State Backends: In-memory, RocksDB, HDFS

Page 34: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Fault Tolerance

34

t2t1

snap - t1 snap - t2

snapshotting snapshotting

State is not affected by failuresWhen failures occur we revert computation and state back to a snapshot

events

Page 35: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Performance• Twitter Hack Week - Flink as an in-memory data store

35

Jamie Grier - http://data-artisans.com/extending-the-yahoo-streaming-benchmark/

Page 36: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

So how is Flink different that Spark?

36

Two major differences

1) Stream Execution 2) Mutable State

Page 37: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Flink vs Spark

37

(Spark Streaming)

put new states in output RDDdstream.updateStateByKey(…)

In S’

S

• dedicated resources

• leased resources

• mutable state

• immutable state

Page 38: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

What about DataSets?

38

• Sophisticated SQL-inspired optimiser

• Efficient Join Strategies

• Managed Memory bypasses Garbage Collection

• Fast, in-memory Iterative Bulk Computations

Page 39: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Some Interesting Libraries

39

Page 40: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Detecting Patterns

40

PatternStream<Event> tsunamiPattern = CEP.pattern(sensorStream, Pattern .begin("seismic").where(evt -> evt.motion.equals(“ClassB”)) .next("tidal").where(evt -> evt.elevation > 500));

DataStream<Alert> result = tsunamiPattern.select( pattern -> { return getEvacuationAlert(pattern); });

CEP Library Example (Java)

Page 41: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Mining Graphs with Gelly

41

• Iterative Graph Processing

• Scatter-Gather

• Gather-Sum-Apply

• Graph Transformations/Properties

• Library Methods: Community Detection, Label Propagation, Connected Components, PageRank.Shortest Paths, Triangle Count etc…

Coming up next : Dynamic graph processing support

Page 42: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Machine Learning Pipelines

42

• Scikit-learn inspired pipelining

• Supervised: SVM, Linear Regression

• Preprocessing: Polynomial Features, Scalers

• Recommendation: ALS

Page 43: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Relational Queries

43

// Ingest a DataStream from an external source DataStream<Tuple3<Long, String, Integer>> ds = env.addSource(...); // Register the DataStream as table "Orders" tableEnv.registerDataStream("Orders", ds, "user, product, amount"); // Run a SQL query on the Table and retrieve the result as a new Table Table result = tableEnv.sql( "SELECT STREAM product, amount FROM Orders WHERE product LIKE '%Rubber%'");

Example

Stream SQL on Table API

Page 44: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Live Monitoring

44

Page 45: An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate

Coming Soon

45

• Stream ML

• Stream Graph Processing (Gelly-Stream)

• Autoscaling

• Incremental Snapshots


Recommended