An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An...

An Intro to Modern Data Stream Analytics

EIT Summer School 2016

Paris Carbone

PhD Candidate @ KTH<[email protected]> Committer @ Apache Flink <[email protected]>

1

Motivation• Time-critical problems / Actionable Insights

• Stock market predictions

• Fraud detection

• Network security

• Fresh customer recommendations

2

more like First-World Problems..

How about Tsunamis

3

4

Q =

Q

Deploy Sensors

Analyse Data Regularly

Collect Data

evacuation window

earth & wave activity

Motivation

5

Q Q

Q =

Motivation

6

Q

Standing Query

Q =

evacuationwindow

The Data Stream Paradigm

• Standing queries are evaluated continuously

• Input data is unbounded

• Queries operate on the full data stream or on the most recent views of the stream ~ windows

7

Data Stream Basics• Events/Tuples : elements of computation - respect a schema

• Data Streams : unbounded sequences of events

• Stream Operators/Tasks: consume and produce data streams

• Events are consumed once - no backtracking!

8

f

S1

S2

So

S’1

S’2

where are computations

stored?

Synopsis-Task StateWe cannot infinitely store all events seen

• Synopsis: A summary of an infinite stream

• It is in principle any streaming operator state

• Examples: samples, histograms, sketches, state machines…

9

f

sa summary of everything

seen so far1. process t, s 2. update s 3. produce t’

t t’

Synopses-Aggregations

• Discussion - Rolling Aggregations

• Propose a synopsis, s=? when

• f= max

• f= ArithmeticMean

• f= stDev

10

Synopses-Approximations

11

• Discussion - Approximate Results

• Propose a synopsis, s=? when

• f= uniform random sample of k records over the whole stream

• f= filter distinct records over windows of 1000 records with a 5% error

Synopses-ML and Graphs

12

• Examples of cool synopses to check out

• Sparsifiers/Spanners - approximating graph properties such as shortest paths

• Change detectors - detecting concept drift

• Incremental decision trees - continuous stream training and classification

Data Stream Basics

Any other problems?

13

f

S1

S2

So

S’1

S’2Does this scale?

Task Parallelism• We need task parallelism:

• Data might be too large to process

• State can get too large to fit in memory (e.g. graphs)

• Data Streams might already be partitioned! (e.g. by key/ kafka partitions)

14

f

S1

S2

So

S’1

S’2

how do streams get partitioned?

Task Partitioning• Partitioning defines how we allocate events to each

parallel task instance. Typical partitioners are:

• Broadcast

• Shuffle

• Key-based

fs

fs

fs

fs

fs

fs

P

P

P

bycolor

Dataflow Pipelines

16

stream1

stream2

approximations predictions alerts ……

Q

sources

sinks

Dataflow Programming with Apache Storm

17

• Step1: Implement input (Spouts) and intermediate operators (Bolts)

• Step 2: Construct a Topology by combining operators

Spout Bolt Bolt

Spouts are the topology sources

They listen to data feeds

Bolts represent all intermediate computation vertices of the topology

They do arbitrary data manipulation

Each operator can emit/subscribe to Streams (computation results)

Example: Topology Definition

18

numbers new_numbers

numbers new_numbers

toFile

Stream Analytics Systems

19

Proprietary Open Source

Google DataFlow

IBM Infosphere

Microsoft Azure

Flink

Storm

Samza

Spark

Beam

Programming Models

20

Compositional Declarative

• Physical Representations • Offer basic building blocks

(Operators/Data Exchange) • Custom Optimisation/

Tuning

• Logical Representations • Operators are transformations

on abstract data types • Advanced behaviour such as

windowing is supported • Self-Optimisation

Programming Abstraction Levels

21

DStream, DataStream, PCollection…

• Direct access to the execution graph / topology

• Suitable for engineers

• Transformations abstract operator details

• Suitable for engineers and data analysts

Introducing Apache Flink

0

20

40

60

80

100

120

juli-09 nov-10 apr-12 aug-13 dec-14 maj-16

#unique contributor ids by gitcommits

• A Top-level project

• Community-driven open source software development

• Publicly open to new contributors

Native Workload Support

Apache Flink

Stream Pipelines

Batch Pipelines Scalable Machine Learning

Graph Analytics

24

The Apache Flink Stack

APIs

Execution

DataStreamDataSet

Distributed Dataflow

Deployment

• Bounded Data Sources • Staged/Pipelined Execution

• Unbounded Data Sources • Pipelined Execution

The Big Picture

DataStreamDataSet

Distributed Dataflow

Deployment

Graph

-Gelly

Table

ML

Hado

opM/R

Table

CEP

SQL

SQL

ML

Graph

-Gelly

26

Basic API Concept

Source Data Stream Operator Data

Stream Sink

Source Data Set Operator Data

Set Sink

Writing a Flink Program1.Bootstrap Sources 2.Apply Operators 3.Output to Sinks

Data Streams as Abstract Data Types

• Tasks are distributed and run in a pipelined fashion.

• State is kept within tasks.

• Transformations are applied per-record or window.

• Transformations: map, flatmap, filter, union…

• Aggregations: reduce, fold, sum

• Partitioning: forward, broadcast, shuffle, keyBy

• Sources/Sinks: custom or Kafka, Twitter, Collections…

27

DataStream

Example

28

textStream .flatMap {_.split("\\W+")}

.map {(_, 1)} .keyBy(0) .sum(1) .print()

“live and let live”

“live”“and”“let”“live”(live,1)(and,1)(let,1)(live,1)

(live,1)(and,1)(let,1)(live,2)

Working with Windows

29

Why windows?

Highlight: Flink can form and trigger windows consistently under different notions of time and deal with late events!

#sec40 80

SUM #2

0

SUM #1

20 60 100

#sec40 80

SUM #3

SUM #2

0

SUM #1

20 60 100

120

15 38 65 88

15 38

38 65

65 88

15 38 65 88

110 120

myKeyedStream.timeWindow( Time.seconds(60), Time.seconds(20));

1) Sliding windows

2) Tumbling windowsmyKeyedStream.timeWindow( Time.seconds(60));

window buckets/panesWe are often interested in fresh data!

Example

30


.map {(_, 1)} .keyBy(0)

.timeWindow(Time.minutes(5)) .sum(1) .print()

“live and”

(live,1)(and,1)

(let,1)(live,1)

counting words over windows

“let live”10:48

11:01

Window (10:45-10:50)

Window (11:00-11:05)

Example

31

printwindow sumflatMap


.map {(_, 1)} .keyBy(0)

.timeWindow(Time.minutes(5)) .sum(1) .print()

map

where counts are kept in state

Example

32window sum

flatMap


.map {(_, 1)} .keyBy(0)

.timeWindow(Time.minutes(5)) .sum(1)

.setParallelism(4) .print()

map print

Making State Explicit

33

• Explicitly defined state is durable to failures

• Flink supports two types of explicit states

• Operator State - full state

• Key-Value State - partitioned state per key

• State Backends: In-memory, RocksDB, HDFS

Fault Tolerance

34

t2t1

snap - t1 snap - t2

snapshotting snapshotting

State is not affected by failuresWhen failures occur we revert computation and state back to a snapshot

events

Performance• Twitter Hack Week - Flink as an in-memory data store

35

Jamie Grier - http://data-artisans.com/extending-the-yahoo-streaming-benchmark/

So how is Flink different that Spark?

36

Two major differences

1) Stream Execution 2) Mutable State

Flink vs Spark

37

(Spark Streaming)

put new states in output RDDdstream.updateStateByKey(…)

In S’

S

• dedicated resources

• leased resources

• mutable state

• immutable state

What about DataSets?

38

• Sophisticated SQL-inspired optimiser

• Efficient Join Strategies

• Managed Memory bypasses Garbage Collection

• Fast, in-memory Iterative Bulk Computations

Some Interesting Libraries

39

Detecting Patterns

40

PatternStream<Event> tsunamiPattern = CEP.pattern(sensorStream, Pattern .begin("seismic").where(evt -> evt.motion.equals(“ClassB”)) .next("tidal").where(evt -> evt.elevation > 500));

DataStream<Alert> result = tsunamiPattern.select( pattern -> { return getEvacuationAlert(pattern); });

CEP Library Example (Java)

Mining Graphs with Gelly

41

• Iterative Graph Processing

• Scatter-Gather

• Gather-Sum-Apply

• Graph Transformations/Properties

• Library Methods: Community Detection, Label Propagation, Connected Components, PageRank.Shortest Paths, Triangle Count etc…

Coming up next : Dynamic graph processing support

Machine Learning Pipelines

42

• Scikit-learn inspired pipelining

• Supervised: SVM, Linear Regression

• Preprocessing: Polynomial Features, Scalers

• Recommendation: ALS

Relational Queries

43

// Ingest a DataStream from an external source DataStream<Tuple3<Long, String, Integer>> ds = env.addSource(...); // Register the DataStream as table "Orders" tableEnv.registerDataStream("Orders", ds, "user, product, amount"); // Run a SQL query on the Table and retrieve the result as a new Table Table result = tableEnv.sql( "SELECT STREAM product, amount FROM Orders WHERE product LIKE '%Rubber%'");

Example

Stream SQL on Table API

Live Monitoring

44

Coming Soon

45

• Stream ML

• Stream Graph Processing (Gelly-Stream)

• Autoscaling

• Incremental Snapshots

Date post:	20-May-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

An Intro to Modern Data Stream Analyticsictlabs-summer-school.sics.se/2016/slides/flink.pdf · An...

Documents