+ All Categories
Home > Engineering > Apache Flink: Past, Present and Future

Apache Flink: Past, Present and Future

Date post: 03-Aug-2015
Category:
Upload: gyula-fora
View: 129 times
Download: 3 times
Share this document with a friend
27
Apache Flink Past, present and fut Gyula Fóra [email protected]
Transcript
Page 1: Apache Flink: Past, Present and Future

Apache Flink Past, present and future

Gyula Fó[email protected]

Page 2: Apache Flink: Past, Present and Future

What is Apache Flink

2

Distributed Data Flow Processing System

▪Focused on large-scale data analytics

▪Unified real-time stream and batch processing

▪Easy and powerful APIs in Java / Scala (+ Python)

▪Robust and fast execution backend

Reduce

Join

Filter

Reduce

Map

Iterate

Source

Sink

Source

Page 3: Apache Flink: Past, Present and Future

What is Flink good at

3

It‘s a general-purpose data analytics system

▪Real-time stream processing with flexible windowing

▪Complex and heavy ETL jobs

▪Analyzing huge graphs

▪Machine learning on large data sets and streams

▪…

Page 4: Apache Flink: Past, Present and Future

The Flink Stack

4

Pyt

hon

Gel

ly

Tabl

e

ML

SA

MO

A

Batch Optimizer

DataSet (Java/Scala) DataStream (Java/Scala)

Streaming optimizerHad

oop

M/R

Flink Runtime

Local Remote Yarn Tez Embedded

Dat

aflo

w

Dat

aflo

w

Page 5: Apache Flink: Past, Present and Future

Word count in Flink

5

case class Word (word: String, frequency: Int)

val lines: DataStream[String] = env.fromSocketStream(...)

lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(1,MINUTES)).every(Time.of(30,SECONDS)) .groupBy("word").sum("frequency") .print()

val lines: DataSet[String] = env.readTextFile(...)

lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print()

DataSet API (batch):

DataStream API (streaming):

Page 6: Apache Flink: Past, Present and Future

Table API

6

val orders = env.readCsvFile(…) .as('oId, 'oDate, 'shipPrio) .filter('shipPrio === 5)

val items = orders .join(lineitems).where('oId === 'id) .select('oId, 'oDate, 'shipPrio, 'extdPrice * (Literal(1.0f) - 'discnt) as 'revenue)

val result = items .groupBy('oId, 'oDate, 'shipPrio) .select('oId, 'revenue.sum, 'oDate, 'shipPrio)

▪ Execute SQL-like expressions on table data• Tight integration with Java and Scala APIs• Available for batch and streaming programs

Page 7: Apache Flink: Past, Present and Future

A trip down memory lane

7

Page 8: Apache Flink: Past, Present and Future

April 16, 2014

8

Page 9: Apache Flink: Past, Present and Future

9

Stratosphere Optimizer

DataSet API (Java)

Stratosphere Runtime

DataSet API (Scala)

Stratosphere 0.5

Local Remote Yarn

Key new features• New Java API• Distributed cache• Collection data sources and

sinks

• JDBC data sources and sinks• Hadoop I/O format• Avro support

Page 10: Apache Flink: Past, Present and Future

10

Flink Optimizer

DataSet (Java/Scala)

Flink Runtime

Flink 0.7

DataStream (Java)

Stream BuilderHad

oop

M/R

Local Remote Yarn Embedded

Key new features• Unification of Java and Scala

APIs• Logical keys/POJO support• MR compatibility

• Collections backend• Extended filesystem support

Page 11: Apache Flink: Past, Present and Future

11

Flink Runtime

Flink 0.8

Flink Optimizer

DataSet (Java/Scala) DataStream (Java/Scala)

Stream BuilderHad

oop

M/R

Local Remote Yarn Embedded

Key new features• Improved filesystem support• DataStream Scala• Streaming windows

• Lots of performance and stability

• Kryo default serializer

Page 12: Apache Flink: Past, Present and Future

12

Pyt

hon

Gel

ly

Tabl

e

ML

SA

MO

A

Current master (0.9-Snapshot)

Batch Optimizer

DataSet (Java/Scala) DataStream (Java/Scala)

Stream OptimizerHad

oop

M/R

New Flink Runtime

Local Remote Yarn Tez EmbeddedD

ataf

low

Dat

aflo

w

Key new features• New runtime• Tez mode• Python API• Gelly

• Flinq• FlinkML• Streaming FT

Page 13: Apache Flink: Past, Present and Future

Flink community

13

#unique contributors by git commits (without manual de-dup)

Page 14: Apache Flink: Past, Present and Future

Summary

▪ The project has a lot of momentum with major improvements every release

▪ Healthy community▪ Project diversification

• Real-time data streaming• Several frontends (targeting different user profiles

and use cases)• Several backends (targeting different production

settings)▪ Integration with open source ecosystem

14

Page 15: Apache Flink: Past, Present and Future

Vision for Flink

15

Page 16: Apache Flink: Past, Present and Future

What are we building?

16

A "use-case complete" framework to unify batch & stream processing

Flink

Data Streams• Kafka• RabbitMQ• ...

“Historic” data• HDFS• JDBC• ...

Analytical Workloads• ETL • Relational processing• Graph analysis• Machine learning• Streaming data analysis

Page 17: Apache Flink: Past, Present and Future

Flink

Historic data

Kafka, RabbitMQ, ...

HDFS, JDBC, ...

ETL, Graphs,Machine LearningRelational, …

Low latencywindowing, aggregations, ...

Event logs

An engine that puts equal emphasis to stream and batch processing

Real-time data streams

What are we building?

(master)

Page 18: Apache Flink: Past, Present and Future

Integrating batch with streaming

18

Page 19: Apache Flink: Past, Present and Future

Why?▪ Applications need to combine streaming and

static data sources▪ Making the switch from batch to streaming easy

will be key to boost adoption▪ Companies are making the transition from batch

to streaming now

19

Page 20: Apache Flink: Past, Present and Future

What is stream processing?

20

▪ Data stream: Infinite sequence of data arriving in a continuous fashion ▪Stream processing: Analyzing and acting on real-time streaming data, using continuous queries

Page 21: Apache Flink: Past, Present and Future

Lambda architecture

▪ "Speed layer" can be a stream processing system▪ "Picks up" after the batch layer

21

Page 22: Apache Flink: Past, Present and Future

Kappa architecture

▪ Need for batch & speed layer not fundamental, practical with current tech

▪ Idea: use a stream processing system for all data processing

▪ They are all dataflows anyway

22http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

Page 23: Apache Flink: Past, Present and Future

Data streaming with Flink

▪Flink is building a proper stream processing system• that can execute both batch and stream jobs

natively• batch-only jobs pass via different optimization

code path

▪Flink is building libraries and DSLs on top of both batch and streaming• e.g., see recent Table API

23

Page 24: Apache Flink: Past, Present and Future

Data streaming with Flink

▪ Low-latency stream processor

▪Expressive APIs in Scala/Java

▪Stateful operators and flexible windowing

▪Efficient fault tolerance for exactly-once guarantees

24

Page 25: Apache Flink: Past, Present and Future

Summary

▪Flink is a general-purpose data analytics system

▪Unifies batch and stream processing

▪Expressive high-level APIs

▪Robust and fast execution engine

25

Page 26: Apache Flink: Past, Present and Future
Page 27: Apache Flink: Past, Present and Future

flink.apache.org@ApacheFlink


Recommended