JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of Apache Flink

Post on 23-Jan-2018

194 views 0 download

transcript

On-Demand Data Streaming from Sensor Nodes

(ACM SoCC 2017)

and

A quick overview of Apache Flink

Presentation at Sep. 30, 2017

University of California, Santa Barbara

About me

• Researcher and PhD candidate at – Technische Universität Berlin (DIMA)

– German Research Center for Artificial Intelligence (DFKI) / (IAM)

• Working with Volker Markl

• Before – Master’s degree in Computer Science (KTH Stockholm and TU Belin)

– Bachelor’s degree in Applied Computer Science (DHBW Stuttgart)

– Four years at IBM in Germany and the USA

Jonas Traub jon@s-traub.com Jonas.traub@tu-berlin.de

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

Optimized On-Demand Data

Streaming from Sensor Nodes

Jonas Traub, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, Volker Markl

Extended Talk for . .

Santa Clara, California,

September 25-27, 2017

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

The Sensor Cloud

Real-time

insights

4

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

The Sensor Cloud

Real-time

insights

Billions of sensor nodes form a sensor cloud

and provide data streams to analysis systems.

5

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

The Sensor Cloud

Real-time

insights

Billions of sensor nodes form a sensor cloud

and provide data streams to analysis systems.

6

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

The Sensor Cloud

Real-time

insights

Billions of sensor nodes form a sensor cloud

and provide data streams to analysis systems.

7

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

The Sensor Cloud

Real-time

insights

Billions of sensor nodes form a sensor cloud

and provide data streams to analysis systems.

8

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

The Sensor Cloud – Problems

Real-time

insights

9

Billions of sensor nodes form a sensor cloud

and provide data streams to analysis systems.

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

The Sensor Cloud – Problems

Real-time

insights

Streaming all data from billions

of sensors to all applications

with maximal frequencies is impossible

10

Billions of sensor nodes form a sensor cloud

and provide data streams to analysis systems.

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

The Sensor Cloud – Problems

Real-time

insights

Streaming all data from billions

of sensors to all applications

with maximal frequencies is impossible

Increasing data rates

require expensive

system scale-out.

11

Billions of sensor nodes form a sensor cloud

and provide data streams to analysis systems.

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

The Sensor Cloud – Solutions

12

Tailor Data Streams to the Demand of Applications

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

The Sensor Cloud – Solutions

13

Tailor Data Streams to the Demand of Applications

• Provide an abstraction to define the data demand of applications.

User-Defined Sampling Functions (UDSFs)

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

The Sensor Cloud – Solutions

14

Tailor Data Streams to the Demand of Applications

• Provide an abstraction to define the data demand of applications.

• Optimize communication costs while maintaining the result accuracy.

User-Defined Sampling Functions (UDSFs)

Read-Time Optimization

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

The Sensor Cloud – Solutions

15

Tailor Data Streams to the Demand of Applications

• Provide an abstraction to define the data demand of applications.

• Optimize communication costs while maintaining the result accuracy.

• Share sensor reads and data transfer among users and queries.

User-Defined Sampling Functions (UDSFs)

Read-Time Optimization

Multi-Query / Multi-User Optimization

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

A Motivating Example

16

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

A Motivating Example

17

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

A Motivating Example

18

Different Data Data Demands:

• Query 1 adaptively increases sampling rates when accelerating or braking.

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

A Motivating Example

19

Different Data Data Demands:

• Query 1 adaptively increases sampling rates when accelerating or braking.

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

A Motivating Example

20

Different Data Data Demands:

• Query 1 adaptively increases sampling rates when accelerating or braking.

• Query 2 requires a sample at least every 20 meters

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

A Motivating Example

21

Different Data Data Demands:

• Query 1 adaptively increases sampling rates when accelerating or braking.

• Query 2 requires a sample at least every 20 meters

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

A Motivating Example

22

Different Data Data Demands:

• Query 1 adaptively increases sampling rates when accelerating or braking.

• Query 2 requires a sample at least every 20 meters

• Query 3 requires a sample at least every 0.3s.

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

A Motivating Example

23

Different Data Data Demands:

• Query 1 adaptively increases sampling rates when accelerating or braking.

• Query 2 requires a sample at least every 20 meters

• Query 3 requires a sample at least every 0.3s.

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

A Motivating Example

24

Different Data Data Demands:

• Query 1 adaptively increases sampling rates when accelerating or braking.

• Query 2 requires a sample at least every 20 meters

• Query 3 requires a sample at least every 0.3s.

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

A Motivating Example - Evaluation

25

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

A Motivating Example - Evaluation

26

-57%

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

A Motivating Example - Evaluation

27

-57%

-72%

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

A Motivating Example - Evaluation

28

1/3 because 3 values per tuple

-57%

-72%

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

Architecture Overview

29

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

Architecture Overview

30

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

Architecture Overview

31

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

Architecture Overview

32

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

Architecture Overview

33

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

Sensor Read Scheduling

34

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

User-Defined Sampling Functions

35

Input:

Sensor read time and value Output:

Next Sensor Read Request

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

User-Defined Sampling Functions

36

Input:

Sensor read time and value

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

User-Defined Sampling Functions

37

Enable adaptive sampling techniques to reduce data transmission

e.g., Adam [Trihinas ‘15], FAST [Fan ‘14], L-SIP [Gaura ’13]

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

User-Defined Sampling Functions - Examples

38

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

User-Defined Sampling Functions - Examples

39

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

User-Defined Sampling Functions - Examples

40

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

Sensor Read Fusion

41

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

Sensor Read Fusion

42

1) Minimize Sensor Reads and Data Transfer:

Latest possible read time

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

Read Time Optimization

43

2) Optimize Sensor Read Times:

● Minimize penalty while executing the minimum number of sensor reads only

● Challenge: assign read requests to sensor reads

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

Assigning Read Requests to Sensor Reads

44

Postpone Assign to next Read

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

Assigning Read Requests to Sensor Reads

45

Postpone Assign to next Read

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

Assigning Read Requests to Sensor Reads

46

Postpone Assign to next Read

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

Assigning Read Requests to Sensor Reads

47

Postpone Assign to next Read

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

Local Filtering

48

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

Local Filtering

49

● Enable adaptive filtering in combination with adaptive sampling

● Enable model-driven data acquisition

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

Local Filtering

50

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

Evaluation

● Replay sensor data

- from a football match [DEBS Grand Challenge ’13]

- formula 1 telementry data

● Random UDSFs:

- Read in a poisson process (also simulate load peaks)

- In average 1 read per query per second

- Exponentially distributed read time tolerance

- high probability for small tolerances

- small probability for large tolerances

- In average 0.04s read time tolerance

51

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 52

Increasing the number of concurrent queries

• On-Demand scheduling reduces sensor reads and data transfer by up to 87%.

• The # of reads and transfers increases sub-linearly with the # of queries.

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 53

Increasing the number of concurrent queries

• Our read-time optimizer reduces the deviation from desired read times

by up to 69% (preserving the min. # of reads and transfers).

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 54

Increasing read time tolerances

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 55

Increasing read time tolerances

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 56

Query Prioritization (1/2)

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 57

Query Prioritization (2/2)

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 58

Slack Robustness of Adaptive Sampling

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

Optimized On-Demand Data

Streaming from Sensor Nodes

Wrap-Up:

Tailor Data Streams to the Demand of Applications

• Define data demand: User-Defined Sampling Functions

• Schedule sensor reads and data transfer on-demand

• Optimize read times globally - for all users and queries

Jonas Traub, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, Volker Markl

Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17

A quick overview of Apache Flink - research summary -

Jonas Traub visiting September 30, 2017

Outline

Apache Flink Primer

• Stratosphere – The origin of Apache Flink

• What is Apache Flink? – Basic System Internals

• The Flink Community

An Apache Flink Research Summary

61

62 © Volker Markl

• Relational Algebra

• Declarativity

• Query Optimization

• Robust Out-of-core

• Scalability

• User-defined

Functions

• Complex Data Types

• Schema on Read

62

Draws on

Database Technology

Draws on

MapReduce Technology

Stratosphere: General Purpose

Programming + Database Execution

63 © Volker Markl

• Relational Algebra

• Declarativity

• Query Optimization

• Robust Out-of-core

• Scalability

• User-defined

Functions

• Complex Data Types

• Schema on Read

• Iterations

• Advanced Dataflows

• General APIs

• Native Streaming

63

Draws on

Database Technology

Draws on

MapReduce Technology

Adds

Stratosphere: General Purpose

Programming + Database Execution

64 © Volker Markl 64

64 © Volker Markl

Apache Flink is an open source platform for scalable batch and stream

data processing.

What is Apache Flink?

http://flink.apache.org

A distributed system that you can use to process data

Like a DBMS but not exactly a DBMS

What kind of data? Data that comes in the form of streams

What kind of processing

Quite flexible. You can use Java/Scala APIs similar to programming with Java collections, the new SQL API, etc

Distributed: runs on many (1000s) of machines and hides this complexity from the user

64

65 © Volker Markl 65

65 © Volker Markl

Basic application architecture

app state

app state

app state

event log

Query

service

Sources of data

(e.g., sensors,

logs, …)

A replayable log of

events with pub/sub

functionality

Processing of

events

Storage and query systems

By courtesy of Kostas Tzoumas 65

66 © Volker Markl 66 © 2013 Berlin Big Data Center • All Rights Reserved

66 © Volker Markl

Technology inside Flink

case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Program

67 © Volker Markl 67 © 2013 Berlin Big Data Center • All Rights Reserved

67 © Volker Markl

Technology inside Flink

case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next }

Cost-based

optimizer

Type extraction

stack

Pre-flight (Client) Program

68 © Volker Markl 68 © 2013 Berlin Big Data Center • All Rights Reserved

68 © Volker Markl

Technology inside Flink

case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next }

Cost-based

optimizer

Type extraction

stack

Pre-flight (Client) DataSourc

e orders.tbl

Filter

Map DataSourc

e lineitem.tbl

Join Hybrid Hash

build

HT probe

hash-part [0] hash-part [0]

GroupRed

sort

forward

Program

Dataflow

Graph

69 © Volker Markl 69 © 2013 Berlin Big Data Center • All Rights Reserved

69 © Volker Markl

Technology inside Flink

case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next }

Cost-based

optimizer

Type extraction

stack

Task

scheduling

Recovery

metadata

Pre-flight (Client)

Master Workers

DataSourc

e orders.tbl

Filter

Map DataSourc

e lineitem.tbl

Join Hybrid Hash

build

HT probe

hash-part [0] hash-part [0]

GroupRed

sort

forward

Program

Dataflow

Graph

deploy

operators

track

intermediate

results

70 © Volker Markl 70

70 © Volker Markl

Flink community

0

50

100

150

200

250

300

Feb15 Dec15 Dec16

NumberofContributors

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Feb15 Dec15 Dec16

StarsonGitHub

0

200

400

600

800

1000

1200

1400

Feb15 Dec15 Dec16

ForksonGitHub

By courtesy of Kostas Tzoumas

Project Statistics (Updated: Sep 29, 2017)

70

71 © Volker Markl 71

71 © Volker Markl

Companies Using Flink

Apache Flink - Related Publications

System Paper 2015:

System Paper 2014:

72

Apache Flink - Related Publications

System Paper 2015:

System Paper 2014:

73

Apache Flink - Related Publications

System Paper 2015:

System Paper 2014:

74

State Management

VLDB 2017

75

Iterative Processing

VLDB 2012

76

Iterative Processing

VLDB 2012

SIGMOD 2013 77

Fault Tolerance

78

Fault Tolerance

79

Fault Tolerance

80

Fault Tolerance

81

Streaming Window Aggregation

82

Visualization of Streaming Data

83

On-Demand Data Streaming from Sensor Nodes

(ACM SoCC 2017)

and

A quick overview of Apache Flink

Presentation at Sep. 30, 2017

University of California, Santa Barbara