Introduction to Flink Streaming

Introduction to Flink Streaming

Framework for modern streaming applications

https://github.com/phatak-dev/flink-examples



● Madhukara Phatak

● Big data consultant and trainer at datamantra.io

● Consult in Hadoop, Spark and Scala

● www.madhukaraphatak.com

http://datamantra.io/

http://www.madhukaraphatak.com

http://www.madhukaraphatak.com

Agenda● Stream abstraction vs streaming applications● Stream as an abstraction● Challenges with modern streaming applications● Why not Spark streaming?● Introduction to Flink● Introduction to Flink streaming● Flink Streaming API● References

Use of stream in applications● Streams are used both in big data and outside big data

to support two major use cases○ Stream as abstraction layer○ Stream as unbounded data to support real time

analysis● Abstraction and real time have different need and

expectation from the streams● Different platforms use stream in different meanings

Stream as the abstraction● A stream is a sequence of data elements made

available over time. ● A stream can be thought of as items on a conveyor belt

being processed one at a time rather than in large batches.

● Streams can be unbounded ( message queue) and bounded ( files)

● Streams are becoming new abstractions to build data pipelines.

https://en.wikipedia.org/wiki/Sequence

https://en.wikipedia.org/wiki/Conveyor_belt

Streams as abstraction outside big data

● Streams are used as an abstraction outside big data in last few years

● Some of them are○ Reactive streams like akka-streams, akka-http○ Java 8 streams○ RxJava etc

● These use of streams are don't care about real time analysis

Streams for real time analysis● In this use cases of stream, stream is viewed as

unbounded data which has low latency and available as soon it arrives in the system

● Stream can be processed using non stream abstraction in run time

● So focus in these scenarios is only to model API in streams not the implementation

● Ex : Spark streaming

Stream abstraction in big data● Stream is the new abstraction layer people are

exploring in the big data● With right implementation, stream can support both

streaming and batch applications much more effectively than existing abstractions.

● Batch on streaming is new way of looking at processing rather than treating streaming as the special case of batch

● Batch can be faster on streaming than dedicated batch processing

Frameworks with stream as abstraction

Apache flink● Flink’s core is a streaming dataflow engine that

provides data distribution, communication, and fault tolerance for distributed computations over data streams.

● Flink provides○ Dataset API - for bounded streams○ Datastream API - for unbounded streams

● Flink embraces the stream as abstraction to implement it’s dataflow.

Flink stack

Flink history● Stratosphere project started in Technical university,

Berlin in 2009● Incubated to Apache in March, 2014● Top level project in Dec 2014● Started as stream engine for batch processing● Started to support streaming few versions before● DataArtisians is company founded by core flink team

Flink streaming● Flink Streaming is an extension of the core Flink API for

high-throughput, low-latency data stream processing● Supports many data sources like Flume, Twitter,

ZeroMQ and also from any user defined data source● Data streams can be transformed and modified using

high-level functions similar to the ones provided by the batch processing API

● Sound much like Spark streaming promises !!

Streaming is not fast batch processing

● Most of the streaming framework focuses too much on the latency when they develop streaming extensions

● Both storm and spark-streaming view streaming as low latency batch processing system

● Though latency plays an important role in the real time application, the need and challenges go beyond it

● Addressing the complex needs of modern streaming systems need a fresh view on streaming API’s

Streaming in Lamda architecture● Streaming is viewed as limited, approximate, low

latency computing system compared to a batch system in lambda architecture

● So we usually run a streaming system to get low latency approximate results and run a batch system to get high latency with accurate result

● All the limitations of streaming is stemmed from conventional thinking and implementations

● New idea is why not streaming a low latency accurate system itself?

Google dataflow● Google articulated the first modern streaming

framework which is low latency, exactly once, accurate stream applications in their dataflow paper

● It talks about a single system which can replace need of separate streaming and batch processing system

● Known as Kappa architecture● Modern stream frameworks embrace this over lambda

architecture● Google dataflow is open sourced under the name

apache beam

Google dataflow and Flink streaming● Flink adopted dataflow ideas for their streaming API● Flink streaming API went through big overhaul in 1.0

version to embrace these ideas● It was relatively easy to adapt ideas as both google

dataflow and flink use streaming as abstraction● Spark 2.0 may add some of these ideas in their

structured stream processing effort

Needs of modern real time applications

● Ability to handle out of time events in unbounded data● Ability to correlate the events with different dimensions

of time● Ability to correlate events using custom application

based characteristics like session● Ability to both microbatch and event at a time on same

framework● Support for complex stream processing libraries

Mandatory wordcount● Streams are represented using DataStream in Flink

streaming● DataStream support both RDD and Dataset like API for

manipulation● In this example,

○ Read from socket to create DataStream○ Use map, keyBy and sum operation for aggregation

● com.madhukaraphatak.flink.streaming.examples.StreamingWordCount

Flink streaming vs Spark streamingSpark Streaming Flink Streaming

Streams are represented using DStreams Streams are represented using DataStreams

Stream is discretized to mini batch Stream is not discretized

Support RDD DSL Supports Dataset like DSL

By default stateless By default stateful at operator level

Runs mini batch for each interval Runs pipelined operators for each events that comes in

Near realtime Real time

Discretizing the stream● Flink by default don’t need any discretization of stream

to work● But using window API, we can create discretized stream

similar to spark● This time state will be discarded, as and when the batch

is computed● This way you can mimic spark micro batches in Flink● com.madhukaraphatak.flink.streaming.examples.

WindowedStreamingWordCount

Understanding dataflow of flink● All programs in flink, both batch and streaming, are

represented using a dataflow● This dataflow signifies the stream abstraction provided

by the flink runtime● This dataflow treats all the data as stream and

processes using long running operator model● This is quite different than RDD model of the spark● Flink UI allows us to understand dataflow of a given flink

program

Running in local mode● bin/start-local.sh

● bin/flink run -c com.madhukaraphatak.flink.streaming.examples.StreamingWordCount /home/madhu/Dev/mybuild/flink-examples/target/scala-2.10/flink-examples_2.10-1.0.jar

Dataflow for wordcount example

Operator fusing● Flink optimiser fuses the operator for efficiency● All the fused operator run in a same thread, which

saves the serialization and deserialization cost between the operators

● For all fused operators, flink generates a nested function which comprises all the code from operators

● This is much efficient that RDD optimization● Dataset is planning to support this functionality● You can disable this by env.disableOperatorChaining()

Dataflow for without operate fusing

Flink streaming vs Spark streamingSpark Streaming Flink Streaming

Uses RDD distribution model for processing Uses pipelined stream processing paradigm for processing

Parallelism is done at batch level Parallelism is controlled at operator level

Uses RDD immutability for fault recovery Uses Asynchronous barriers for fault recovery

RDD level optimization for stream optimization

Operator fusing for stream optimization

Window API● Powerful API to track and do custom state analysis● Types of windows

○ Time window■ Tumbling window■ Sliding window

○ Non time based window■ Count window

● Ex : WindowExample.scala

Anatomy of Window API● Window API is made of 3 different components● The three components of window are

○ Window assigner○ Trigger○ Evictor

● These three components made up all the window API in Flink

Window Assigner● A function which determines given element, which

window it should belong● Responsible for creation of window and assigning

elements to a window ● Two types of window assigner

○ Time based window assigner○ GlobalWindow assigner

● User can write their custom window assigner too

Trigger● Trigger is a function responsible for determining when a

given window is triggered● In a time based window, this function will wait till time is

done to trigger● But in non time based window, it can use custom logic

to determine when to evaluate a given window● In our example, the example number of records in an

given window is used to determine the trigger or not.● WindowAnatomy.scala

Building custom session window● We want to track session of a user ● Each session is identified using sessionID● We will get an event when the session is started● Evaluate the session, when we get the end of session

event● For this, we want to implement our own custom window

trigger which tracks the end of session● Ex : SessionWindowExample.scala

Concept of Time in Flink streaming● Time in a streaming application plays an important role● So having ability to express time in flexible way is very

important feature of modern streaming application● Flink support three kind of time

○ Process time○ Event time○ Ingestion time

● Event time is one of the important feature of flink which compliments the custom window API

Understanding event time● Time in flink needs to address following two questions

○ When event is occurred?○ How much time has occurred after the event?

● First question can be answered using the assigning time stamps

● Second question is answered using understanding the concept of the water marks

● Ex : EventTimeExample.scala

Watermarks in Event Time

● Watermark is a special signal which signifies flow of time in Flink

● In above diagram, w(20) signifies 20 units of time is passed in source

● Watermarks allow flink to support different time abstractions

References● http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf● http://blog.madhukaraphatak.com/categories/flink-

streaming/● https://www.youtube.com/watch?v=y7f6wksGM6c● https://yahooeng.tumblr.

com/post/135321837876/benchmarking-streaming-computation-engines-at

● https://www.youtube.com/watch?v=v_exWHj1vmo● http://www.slideshare.net/FlinkForward/dongwon-kim-a-

comparative-performance-evaluation-of-flink

http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

http://blog.madhukaraphatak.com/categories/flink-streaming/



https://www.youtube.com/watch?v=y7f6wksGM6c

https://www.youtube.com/watch?v=y7f6wksGM6c

https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at




https://www.youtube.com/watch?v=v_exWHj1vmo

https://www.youtube.com/watch?v=v_exWHj1vmo

http://www.slideshare.net/FlinkForward/dongwon-kim-a-comparative-performance-evaluation-of-flink



Date post:	21-Apr-2017
Category:	Data & Analytics
Upload:	datamantra
View:	732 times
Download:	1 times

Introduction to Flink Streaming

Data & Analytics