An Intro to Modern Data Stream Analytics
EIT Summer School 2016
Paris Carbone
PhD Candidate @ KTH<[email protected]> Committer @ Apache Flink <[email protected]>
1
Motivation• Time-critical problems / Actionable Insights
• Stock market predictions
• Fraud detection
• Network security
• Fresh customer recommendations
2
more like First-World Problems..
How about Tsunamis
3
4
Q =
Q
Deploy Sensors
Analyse Data Regularly
Collect Data
evacuation window
earth & wave activity
Motivation
5
Q Q
Q =
Motivation
6
Q
Standing Query
Q =
evacuationwindow
The Data Stream Paradigm
• Standing queries are evaluated continuously
• Input data is unbounded
• Queries operate on the full data stream or on the most recent views of the stream ~ windows
7
Data Stream Basics• Events/Tuples : elements of computation - respect a schema
• Data Streams : unbounded sequences of events
• Stream Operators/Tasks: consume and produce data streams
• Events are consumed once - no backtracking!
8
f
S1
S2
So
S’1
S’2
where are computations
stored?
Synopsis-Task StateWe cannot infinitely store all events seen
• Synopsis: A summary of an infinite stream
• It is in principle any streaming operator state
• Examples: samples, histograms, sketches, state machines…
9
f
sa summary of everything
seen so far1. process t, s 2. update s 3. produce t’
t t’
Synopses-Aggregations
• Discussion - Rolling Aggregations
• Propose a synopsis, s=? when
• f= max
• f= ArithmeticMean
• f= stDev
10
Synopses-Approximations
11
• Discussion - Approximate Results
• Propose a synopsis, s=? when
• f= uniform random sample of k records over the whole stream
• f= filter distinct records over windows of 1000 records with a 5% error
Synopses-ML and Graphs
12
• Examples of cool synopses to check out
• Sparsifiers/Spanners - approximating graph properties such as shortest paths
• Change detectors - detecting concept drift
• Incremental decision trees - continuous stream training and classification
Data Stream Basics
Any other problems?
13
f
S1
S2
So
S’1
S’2Does this scale?
Task Parallelism• We need task parallelism:
• Data might be too large to process
• State can get too large to fit in memory (e.g. graphs)
• Data Streams might already be partitioned! (e.g. by key/ kafka partitions)
14
f
S1
S2
So
S’1
S’2
how do streams get partitioned?
Task Partitioning• Partitioning defines how we allocate events to each
parallel task instance. Typical partitioners are:
• Broadcast
• Shuffle
• Key-based
fs
fs
fs
fs
fs
fs
P
P
P
bycolor
Dataflow Pipelines
16
stream1
stream2
approximations predictions alerts ……
Q
sources
sinks
Dataflow Programming with Apache Storm
17
• Step1: Implement input (Spouts) and intermediate operators (Bolts)
• Step 2: Construct a Topology by combining operators
Spout Bolt Bolt
Spouts are the topology sources
They listen to data feeds
Bolts represent all intermediate computation vertices of the topology
They do arbitrary data manipulation
Each operator can emit/subscribe to Streams (computation results)
Example: Topology Definition
18
numbers new_numbers
numbers new_numbers
toFile
Stream Analytics Systems
19
Proprietary Open Source
Google DataFlow
IBM Infosphere
Microsoft Azure
Flink
Storm
Samza
Spark
Beam
Programming Models
20
Compositional Declarative
• Physical Representations • Offer basic building blocks
(Operators/Data Exchange) • Custom Optimisation/
Tuning
• Logical Representations • Operators are transformations
on abstract data types • Advanced behaviour such as
windowing is supported • Self-Optimisation
Programming Abstraction Levels
21
DStream, DataStream, PCollection…
• Direct access to the execution graph / topology
• Suitable for engineers
• Transformations abstract operator details
• Suitable for engineers and data analysts
Introducing Apache Flink
0
20
40
60
80
100
120
juli-09 nov-10 apr-12 aug-13 dec-14 maj-16
#unique contributor ids by gitcommits
• A Top-level project
• Community-driven open source software development
• Publicly open to new contributors
Native Workload Support
Apache Flink
Stream Pipelines
Batch Pipelines Scalable Machine Learning
Graph Analytics
24
The Apache Flink Stack
APIs
Execution
DataStreamDataSet
Distributed Dataflow
Deployment
• Bounded Data Sources • Staged/Pipelined Execution
• Unbounded Data Sources • Pipelined Execution
The Big Picture
DataStreamDataSet
Distributed Dataflow
Deployment
Graph
-Gelly
Table
ML
Hado
opM/R
Table
CEP
SQL
SQL
ML
Graph
-Gelly
26
Basic API Concept
Source Data Stream Operator Data
Stream Sink
Source Data Set Operator Data
Set Sink
Writing a Flink Program1.Bootstrap Sources 2.Apply Operators 3.Output to Sinks
Data Streams as Abstract Data Types
• Tasks are distributed and run in a pipelined fashion.
• State is kept within tasks.
• Transformations are applied per-record or window.
• Transformations: map, flatmap, filter, union…
• Aggregations: reduce, fold, sum
• Partitioning: forward, broadcast, shuffle, keyBy
• Sources/Sinks: custom or Kafka, Twitter, Collections…
27
DataStream
Example
28
textStream .flatMap {_.split("\\W+")}
.map {(_, 1)} .keyBy(0) .sum(1) .print()
“live and let live”
“live”“and”“let”“live”(live,1)(and,1)(let,1)(live,1)
(live,1)(and,1)(let,1)(live,2)
Working with Windows
29
Why windows?
Highlight: Flink can form and trigger windows consistently under different notions of time and deal with late events!
#sec40 80
SUM #2
0
SUM #1
20 60 100
#sec40 80
SUM #3
SUM #2
0
SUM #1
20 60 100
120
15 38 65 88
15 38
38 65
65 88
15 38 65 88
110 120
myKeyedStream.timeWindow( Time.seconds(60), Time.seconds(20));
1) Sliding windows
2) Tumbling windowsmyKeyedStream.timeWindow( Time.seconds(60));
window buckets/panesWe are often interested in fresh data!
Example
30
textStream .flatMap {_.split("\\W+")}
.map {(_, 1)} .keyBy(0)
.timeWindow(Time.minutes(5)) .sum(1) .print()
“live and”
(live,1)(and,1)
(let,1)(live,1)
counting words over windows
“let live”10:48
11:01
Window (10:45-10:50)
Window (11:00-11:05)
Example
31
printwindow sumflatMap
textStream .flatMap {_.split("\\W+")}
.map {(_, 1)} .keyBy(0)
.timeWindow(Time.minutes(5)) .sum(1) .print()
map
where counts are kept in state
Example
32window sum
flatMap
textStream .flatMap {_.split("\\W+")}
.map {(_, 1)} .keyBy(0)
.timeWindow(Time.minutes(5)) .sum(1)
.setParallelism(4) .print()
map print
Making State Explicit
33
• Explicitly defined state is durable to failures
• Flink supports two types of explicit states
• Operator State - full state
• Key-Value State - partitioned state per key
• State Backends: In-memory, RocksDB, HDFS
Fault Tolerance
34
t2t1
snap - t1 snap - t2
snapshotting snapshotting
State is not affected by failuresWhen failures occur we revert computation and state back to a snapshot
events
Performance• Twitter Hack Week - Flink as an in-memory data store
35
Jamie Grier - http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
So how is Flink different that Spark?
36
Two major differences
1) Stream Execution 2) Mutable State
Flink vs Spark
37
(Spark Streaming)
put new states in output RDDdstream.updateStateByKey(…)
In S’
S
• dedicated resources
• leased resources
• mutable state
• immutable state
What about DataSets?
38
• Sophisticated SQL-inspired optimiser
• Efficient Join Strategies
• Managed Memory bypasses Garbage Collection
• Fast, in-memory Iterative Bulk Computations
Some Interesting Libraries
39
Detecting Patterns
40
PatternStream<Event> tsunamiPattern = CEP.pattern(sensorStream, Pattern .begin("seismic").where(evt -> evt.motion.equals(“ClassB”)) .next("tidal").where(evt -> evt.elevation > 500));
DataStream<Alert> result = tsunamiPattern.select( pattern -> { return getEvacuationAlert(pattern); });
CEP Library Example (Java)
Mining Graphs with Gelly
41
• Iterative Graph Processing
• Scatter-Gather
• Gather-Sum-Apply
• Graph Transformations/Properties
• Library Methods: Community Detection, Label Propagation, Connected Components, PageRank.Shortest Paths, Triangle Count etc…
Coming up next : Dynamic graph processing support
Machine Learning Pipelines
42
• Scikit-learn inspired pipelining
• Supervised: SVM, Linear Regression
• Preprocessing: Polynomial Features, Scalers
• Recommendation: ALS
Relational Queries
43
// Ingest a DataStream from an external source DataStream<Tuple3<Long, String, Integer>> ds = env.addSource(...); // Register the DataStream as table "Orders" tableEnv.registerDataStream("Orders", ds, "user, product, amount"); // Run a SQL query on the Table and retrieve the result as a new Table Table result = tableEnv.sql( "SELECT STREAM product, amount FROM Orders WHERE product LIKE '%Rubber%'");
Example
Stream SQL on Table API
Live Monitoring
44
Coming Soon
45
• Stream ML
• Stream Graph Processing (Gelly-Stream)
• Autoscaling
• Incremental Snapshots