Date post: | 21-Apr-2017 |
Category: |
Data & Analytics |
Upload: | datamantra |
View: | 732 times |
Download: | 1 times |
Introduction to Flink Streaming
Framework for modern streaming applications
https://github.com/phatak-dev/flink-examples
● Madhukara Phatak
● Big data consultant and trainer at datamantra.io
● Consult in Hadoop, Spark and Scala
● www.madhukaraphatak.com
Agenda● Stream abstraction vs streaming applications● Stream as an abstraction● Challenges with modern streaming applications● Why not Spark streaming?● Introduction to Flink● Introduction to Flink streaming● Flink Streaming API● References
Use of stream in applications● Streams are used both in big data and outside big data
to support two major use cases○ Stream as abstraction layer○ Stream as unbounded data to support real time
analysis● Abstraction and real time have different need and
expectation from the streams● Different platforms use stream in different meanings
Stream as the abstraction● A stream is a sequence of data elements made
available over time. ● A stream can be thought of as items on a conveyor belt
being processed one at a time rather than in large batches.
● Streams can be unbounded ( message queue) and bounded ( files)
● Streams are becoming new abstractions to build data pipelines.
Streams as abstraction outside big data
● Streams are used as an abstraction outside big data in last few years
● Some of them are○ Reactive streams like akka-streams, akka-http○ Java 8 streams○ RxJava etc
● These use of streams are don't care about real time analysis
Streams for real time analysis● In this use cases of stream, stream is viewed as
unbounded data which has low latency and available as soon it arrives in the system
● Stream can be processed using non stream abstraction in run time
● So focus in these scenarios is only to model API in streams not the implementation
● Ex : Spark streaming
Stream abstraction in big data● Stream is the new abstraction layer people are
exploring in the big data● With right implementation, stream can support both
streaming and batch applications much more effectively than existing abstractions.
● Batch on streaming is new way of looking at processing rather than treating streaming as the special case of batch
● Batch can be faster on streaming than dedicated batch processing
Frameworks with stream as abstraction
Apache flink● Flink’s core is a streaming dataflow engine that
provides data distribution, communication, and fault tolerance for distributed computations over data streams.
● Flink provides○ Dataset API - for bounded streams○ Datastream API - for unbounded streams
● Flink embraces the stream as abstraction to implement it’s dataflow.
Flink stack
Flink history● Stratosphere project started in Technical university,
Berlin in 2009● Incubated to Apache in March, 2014● Top level project in Dec 2014● Started as stream engine for batch processing● Started to support streaming few versions before● DataArtisians is company founded by core flink team
Flink streaming● Flink Streaming is an extension of the core Flink API for
high-throughput, low-latency data stream processing● Supports many data sources like Flume, Twitter,
ZeroMQ and also from any user defined data source● Data streams can be transformed and modified using
high-level functions similar to the ones provided by the batch processing API
● Sound much like Spark streaming promises !!
Streaming is not fast batch processing
● Most of the streaming framework focuses too much on the latency when they develop streaming extensions
● Both storm and spark-streaming view streaming as low latency batch processing system
● Though latency plays an important role in the real time application, the need and challenges go beyond it
● Addressing the complex needs of modern streaming systems need a fresh view on streaming API’s
Streaming in Lamda architecture● Streaming is viewed as limited, approximate, low
latency computing system compared to a batch system in lambda architecture
● So we usually run a streaming system to get low latency approximate results and run a batch system to get high latency with accurate result
● All the limitations of streaming is stemmed from conventional thinking and implementations
● New idea is why not streaming a low latency accurate system itself?
Google dataflow● Google articulated the first modern streaming
framework which is low latency, exactly once, accurate stream applications in their dataflow paper
● It talks about a single system which can replace need of separate streaming and batch processing system
● Known as Kappa architecture● Modern stream frameworks embrace this over lambda
architecture● Google dataflow is open sourced under the name
apache beam
Google dataflow and Flink streaming● Flink adopted dataflow ideas for their streaming API● Flink streaming API went through big overhaul in 1.0
version to embrace these ideas● It was relatively easy to adapt ideas as both google
dataflow and flink use streaming as abstraction● Spark 2.0 may add some of these ideas in their
structured stream processing effort
Needs of modern real time applications
● Ability to handle out of time events in unbounded data● Ability to correlate the events with different dimensions
of time● Ability to correlate events using custom application
based characteristics like session● Ability to both microbatch and event at a time on same
framework● Support for complex stream processing libraries
Mandatory wordcount● Streams are represented using DataStream in Flink
streaming● DataStream support both RDD and Dataset like API for
manipulation● In this example,
○ Read from socket to create DataStream○ Use map, keyBy and sum operation for aggregation
● com.madhukaraphatak.flink.streaming.examples.StreamingWordCount
Flink streaming vs Spark streamingSpark Streaming Flink Streaming
Streams are represented using DStreams Streams are represented using DataStreams
Stream is discretized to mini batch Stream is not discretized
Support RDD DSL Supports Dataset like DSL
By default stateless By default stateful at operator level
Runs mini batch for each interval Runs pipelined operators for each events that comes in
Near realtime Real time
Discretizing the stream● Flink by default don’t need any discretization of stream
to work● But using window API, we can create discretized stream
similar to spark● This time state will be discarded, as and when the batch
is computed● This way you can mimic spark micro batches in Flink● com.madhukaraphatak.flink.streaming.examples.
WindowedStreamingWordCount
Understanding dataflow of flink● All programs in flink, both batch and streaming, are
represented using a dataflow● This dataflow signifies the stream abstraction provided
by the flink runtime● This dataflow treats all the data as stream and
processes using long running operator model● This is quite different than RDD model of the spark● Flink UI allows us to understand dataflow of a given flink
program
Running in local mode● bin/start-local.sh
● bin/flink run -c com.madhukaraphatak.flink.streaming.examples.StreamingWordCount /home/madhu/Dev/mybuild/flink-examples/target/scala-2.10/flink-examples_2.10-1.0.jar
Dataflow for wordcount example
Operator fusing● Flink optimiser fuses the operator for efficiency● All the fused operator run in a same thread, which
saves the serialization and deserialization cost between the operators
● For all fused operators, flink generates a nested function which comprises all the code from operators
● This is much efficient that RDD optimization● Dataset is planning to support this functionality● You can disable this by env.disableOperatorChaining()
Dataflow for without operate fusing
Flink streaming vs Spark streamingSpark Streaming Flink Streaming
Uses RDD distribution model for processing Uses pipelined stream processing paradigm for processing
Parallelism is done at batch level Parallelism is controlled at operator level
Uses RDD immutability for fault recovery Uses Asynchronous barriers for fault recovery
RDD level optimization for stream optimization
Operator fusing for stream optimization
Window API● Powerful API to track and do custom state analysis● Types of windows
○ Time window■ Tumbling window■ Sliding window
○ Non time based window■ Count window
● Ex : WindowExample.scala
Anatomy of Window API● Window API is made of 3 different components● The three components of window are
○ Window assigner○ Trigger○ Evictor
● These three components made up all the window API in Flink
Window Assigner● A function which determines given element, which
window it should belong● Responsible for creation of window and assigning
elements to a window ● Two types of window assigner
○ Time based window assigner○ GlobalWindow assigner
● User can write their custom window assigner too
Trigger● Trigger is a function responsible for determining when a
given window is triggered● In a time based window, this function will wait till time is
done to trigger● But in non time based window, it can use custom logic
to determine when to evaluate a given window● In our example, the example number of records in an
given window is used to determine the trigger or not.● WindowAnatomy.scala
Building custom session window● We want to track session of a user ● Each session is identified using sessionID● We will get an event when the session is started● Evaluate the session, when we get the end of session
event● For this, we want to implement our own custom window
trigger which tracks the end of session● Ex : SessionWindowExample.scala
Concept of Time in Flink streaming● Time in a streaming application plays an important role● So having ability to express time in flexible way is very
important feature of modern streaming application● Flink support three kind of time
○ Process time○ Event time○ Ingestion time
● Event time is one of the important feature of flink which compliments the custom window API
Understanding event time● Time in flink needs to address following two questions
○ When event is occurred?○ How much time has occurred after the event?
● First question can be answered using the assigning time stamps
● Second question is answered using understanding the concept of the water marks
● Ex : EventTimeExample.scala
Watermarks in Event Time
● Watermark is a special signal which signifies flow of time in Flink
● In above diagram, w(20) signifies 20 units of time is passed in source
● Watermarks allow flink to support different time abstractions
References● http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf● http://blog.madhukaraphatak.com/categories/flink-
streaming/● https://www.youtube.com/watch?v=y7f6wksGM6c● https://yahooeng.tumblr.
com/post/135321837876/benchmarking-streaming-computation-engines-at
● https://www.youtube.com/watch?v=v_exWHj1vmo● http://www.slideshare.net/FlinkForward/dongwon-kim-a-
comparative-performance-evaluation-of-flink