Apache Flink: Past, Present and Future

Apache Flink Past, present and future

Gyula Fó[email protected]

What is Apache Flink

2

Distributed Data Flow Processing System

▪Focused on large-scale data analytics

▪Unified real-time stream and batch processing

▪Easy and powerful APIs in Java / Scala (+ Python)

▪Robust and fast execution backend

Reduce

Join

Filter

Reduce

Map

Iterate

Source

Sink

Source

What is Flink good at

3

It‘s a general-purpose data analytics system

▪Real-time stream processing with flexible windowing

▪Complex and heavy ETL jobs

▪Analyzing huge graphs

▪Machine learning on large data sets and streams

▪…

The Flink Stack

4

Pyt

hon

Gel

ly

Tabl

e

ML

SA

MO

A

Batch Optimizer

DataSet (Java/Scala) DataStream (Java/Scala)

Streaming optimizerHad

oop

M/R

Flink Runtime

Local Remote Yarn Tez Embedded

Dat

aflo

w

Dat

aflo

w

Word count in Flink

5

case class Word (word: String, frequency: Int)

val lines: DataStream[String] = env.fromSocketStream(...)

lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(1,MINUTES)).every(Time.of(30,SECONDS)) .groupBy("word").sum("frequency") .print()

val lines: DataSet[String] = env.readTextFile(...)

lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print()

DataSet API (batch):

DataStream API (streaming):

Table API

6

val orders = env.readCsvFile(…) .as('oId, 'oDate, 'shipPrio) .filter('shipPrio === 5)

val items = orders .join(lineitems).where('oId === 'id) .select('oId, 'oDate, 'shipPrio, 'extdPrice * (Literal(1.0f) - 'discnt) as 'revenue)

val result = items .groupBy('oId, 'oDate, 'shipPrio) .select('oId, 'revenue.sum, 'oDate, 'shipPrio)

▪ Execute SQL-like expressions on table data• Tight integration with Java and Scala APIs• Available for batch and streaming programs

A trip down memory lane

7

April 16, 2014

8

9

Stratosphere Optimizer

DataSet API (Java)

Stratosphere Runtime

DataSet API (Scala)

Stratosphere 0.5

Local Remote Yarn

Key new features• New Java API• Distributed cache• Collection data sources and

sinks

• JDBC data sources and sinks• Hadoop I/O format• Avro support

10

Flink Optimizer

DataSet (Java/Scala)

Flink Runtime

Flink 0.7

DataStream (Java)

Stream BuilderHad

oop

M/R

Local Remote Yarn Embedded

Key new features• Unification of Java and Scala

APIs• Logical keys/POJO support• MR compatibility

• Collections backend• Extended filesystem support

11

Flink Runtime

Flink 0.8

Flink Optimizer


Stream BuilderHad

oop

M/R

Local Remote Yarn Embedded

Key new features• Improved filesystem support• DataStream Scala• Streaming windows

• Lots of performance and stability

• Kryo default serializer

12

Pyt

hon

Gel

ly

Tabl

e

ML

SA

MO

A

Current master (0.9-Snapshot)

Batch Optimizer


Stream OptimizerHad

oop

M/R

New Flink Runtime

Local Remote Yarn Tez EmbeddedD

ataf

low

Dat

aflo

w

Key new features• New runtime• Tez mode• Python API• Gelly

• Flinq• FlinkML• Streaming FT

Flink community

13

#unique contributors by git commits (without manual de-dup)

Summary

▪ The project has a lot of momentum with major improvements every release

▪ Healthy community▪ Project diversification

• Real-time data streaming• Several frontends (targeting different user profiles

and use cases)• Several backends (targeting different production

settings)▪ Integration with open source ecosystem

14

Vision for Flink

15

What are we building?

16

A "use-case complete" framework to unify batch & stream processing

Flink

Data Streams• Kafka• RabbitMQ• ...

“Historic” data• HDFS• JDBC• ...

Analytical Workloads• ETL • Relational processing• Graph analysis• Machine learning• Streaming data analysis

Flink

Historic data

Kafka, RabbitMQ, ...

HDFS, JDBC, ...

ETL, Graphs,Machine LearningRelational, …

Low latencywindowing, aggregations, ...

Event logs

An engine that puts equal emphasis to stream and batch processing

Real-time data streams

What are we building?

(master)

Integrating batch with streaming

18

Why?▪ Applications need to combine streaming and

static data sources▪ Making the switch from batch to streaming easy

will be key to boost adoption▪ Companies are making the transition from batch

to streaming now

19

What is stream processing?

20

▪ Data stream: Infinite sequence of data arriving in a continuous fashion ▪Stream processing: Analyzing and acting on real-time streaming data, using continuous queries

Lambda architecture

▪ "Speed layer" can be a stream processing system▪ "Picks up" after the batch layer

21

Kappa architecture

▪ Need for batch & speed layer not fundamental, practical with current tech

▪ Idea: use a stream processing system for all data processing

▪ They are all dataflows anyway

22http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

Data streaming with Flink

▪Flink is building a proper stream processing system• that can execute both batch and stream jobs

natively• batch-only jobs pass via different optimization

code path

▪Flink is building libraries and DSLs on top of both batch and streaming• e.g., see recent Table API

23

Data streaming with Flink

▪ Low-latency stream processor

▪Expressive APIs in Scala/Java

▪Stateful operators and flexible windowing

▪Efficient fault tolerance for exactly-once guarantees

24

Summary

▪Flink is a general-purpose data analytics system

▪Unifies batch and stream processing

▪Expressive high-level APIs

▪Robust and fast execution engine

25

flink.apache.org@ApacheFlink

Date post:	03-Aug-2015
Category:	Engineering
Upload:	gyula-fora
View:	129 times
Download:	3 times

Apache Flink: Past, Present and Future

Engineering