Date post: | 03-Aug-2015 |
Category: |
Engineering |
Upload: | gyula-fora |
View: | 129 times |
Download: | 3 times |
What is Apache Flink
2
Distributed Data Flow Processing System
▪Focused on large-scale data analytics
▪Unified real-time stream and batch processing
▪Easy and powerful APIs in Java / Scala (+ Python)
▪Robust and fast execution backend
Reduce
Join
Filter
Reduce
Map
Iterate
Source
Sink
Source
What is Flink good at
3
It‘s a general-purpose data analytics system
▪Real-time stream processing with flexible windowing
▪Complex and heavy ETL jobs
▪Analyzing huge graphs
▪Machine learning on large data sets and streams
▪…
The Flink Stack
4
Pyt
hon
Gel
ly
Tabl
e
ML
SA
MO
A
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)
Streaming optimizerHad
oop
M/R
Flink Runtime
Local Remote Yarn Tez Embedded
Dat
aflo
w
Dat
aflo
w
Word count in Flink
5
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(1,MINUTES)).every(Time.of(30,SECONDS)) .groupBy("word").sum("frequency") .print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print()
DataSet API (batch):
DataStream API (streaming):
Table API
6
val orders = env.readCsvFile(…) .as('oId, 'oDate, 'shipPrio) .filter('shipPrio === 5)
val items = orders .join(lineitems).where('oId === 'id) .select('oId, 'oDate, 'shipPrio, 'extdPrice * (Literal(1.0f) - 'discnt) as 'revenue)
val result = items .groupBy('oId, 'oDate, 'shipPrio) .select('oId, 'revenue.sum, 'oDate, 'shipPrio)
▪ Execute SQL-like expressions on table data• Tight integration with Java and Scala APIs• Available for batch and streaming programs
A trip down memory lane
7
April 16, 2014
8
9
Stratosphere Optimizer
DataSet API (Java)
Stratosphere Runtime
DataSet API (Scala)
Stratosphere 0.5
Local Remote Yarn
Key new features• New Java API• Distributed cache• Collection data sources and
sinks
• JDBC data sources and sinks• Hadoop I/O format• Avro support
10
Flink Optimizer
DataSet (Java/Scala)
Flink Runtime
Flink 0.7
DataStream (Java)
Stream BuilderHad
oop
M/R
Local Remote Yarn Embedded
Key new features• Unification of Java and Scala
APIs• Logical keys/POJO support• MR compatibility
• Collections backend• Extended filesystem support
11
Flink Runtime
Flink 0.8
Flink Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)
Stream BuilderHad
oop
M/R
Local Remote Yarn Embedded
Key new features• Improved filesystem support• DataStream Scala• Streaming windows
• Lots of performance and stability
• Kryo default serializer
12
Pyt
hon
Gel
ly
Tabl
e
ML
SA
MO
A
Current master (0.9-Snapshot)
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)
Stream OptimizerHad
oop
M/R
New Flink Runtime
Local Remote Yarn Tez EmbeddedD
ataf
low
Dat
aflo
w
Key new features• New runtime• Tez mode• Python API• Gelly
• Flinq• FlinkML• Streaming FT
Flink community
13
#unique contributors by git commits (without manual de-dup)
Summary
▪ The project has a lot of momentum with major improvements every release
▪ Healthy community▪ Project diversification
• Real-time data streaming• Several frontends (targeting different user profiles
and use cases)• Several backends (targeting different production
settings)▪ Integration with open source ecosystem
14
Vision for Flink
15
What are we building?
16
A "use-case complete" framework to unify batch & stream processing
Flink
Data Streams• Kafka• RabbitMQ• ...
“Historic” data• HDFS• JDBC• ...
Analytical Workloads• ETL • Relational processing• Graph analysis• Machine learning• Streaming data analysis
Flink
Historic data
Kafka, RabbitMQ, ...
HDFS, JDBC, ...
ETL, Graphs,Machine LearningRelational, …
Low latencywindowing, aggregations, ...
Event logs
An engine that puts equal emphasis to stream and batch processing
Real-time data streams
What are we building?
(master)
Integrating batch with streaming
18
Why?▪ Applications need to combine streaming and
static data sources▪ Making the switch from batch to streaming easy
will be key to boost adoption▪ Companies are making the transition from batch
to streaming now
19
What is stream processing?
20
▪ Data stream: Infinite sequence of data arriving in a continuous fashion ▪Stream processing: Analyzing and acting on real-time streaming data, using continuous queries
Lambda architecture
▪ "Speed layer" can be a stream processing system▪ "Picks up" after the batch layer
21
Kappa architecture
▪ Need for batch & speed layer not fundamental, practical with current tech
▪ Idea: use a stream processing system for all data processing
▪ They are all dataflows anyway
22http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
Data streaming with Flink
▪Flink is building a proper stream processing system• that can execute both batch and stream jobs
natively• batch-only jobs pass via different optimization
code path
▪Flink is building libraries and DSLs on top of both batch and streaming• e.g., see recent Table API
23
Data streaming with Flink
▪ Low-latency stream processor
▪Expressive APIs in Scala/Java
▪Stateful operators and flexible windowing
▪Efficient fault tolerance for exactly-once guarantees
24
Summary
▪Flink is a general-purpose data analytics system
▪Unifies batch and stream processing
▪Expressive high-level APIs
▪Robust and fast execution engine
25
flink.apache.org@ApacheFlink