+ All Categories
Home > Technology > Introduction to Streaming with Apache Flink

Introduction to Streaming with Apache Flink

Date post: 16-Apr-2017
Category:
Upload: tugdual-grall
View: 28 times
Download: 3 times
Share this document with a friend
56
#DevoxxFR Stream Processing with Apache Flink Tugdual “Tug” Grall Technical Evangelist @ MapR [email protected] @tgrall 1
Transcript

#DevoxxFR

Stream Processing with Apache Flink

Tugdual “Tug” Grall Technical Evangelist @ MapR [email protected] @tgrall

1

#DevoxxFR

{“about” : “me”}

2

Tugdual “Tug” Grall • MapR : Technical Evangelist • MongoDB, Couchbase, eXo, Oracle • NantesJUG co-founder

• @tgrall • http://tgrall.github.io • [email protected] / [email protected]

#DevoxxFR 3

Open Source Engines & Tools Commercial Engines & Applications

Enterprise-Grade Platform Services

Dat

aPr

oces

sing

Web-Scale StorageMapR-FS MapR-DB

Search and Others

Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability

MapR Streams

Cloud and Managed Services

Search and Others

Unified M

anagement and M

onitoring

Search and Others

Event StreamingDatabase

Custom Apps

HDFS API POSIX, NFS HBase API JSON API Kafka API

MapR Converged Data Platform

#DevoxxFR 4

Streaming technology is enabling the obvious: continuous processing on data that is continuously produced

Hint: you already have streaming data

#DevoxxFR

Decoupling

5

App B

App A

App C

State managed centralized

App B

App A

App C

Applications build their own state

#DevoxxFR 6

Event Stream=Data

Pipelines

#DevoxxFR

Streaming and Batch

7

2016-3-1 12:00 am

2016-3-1 1:00 am

2016-3-1 2:00 am

2016-3-11 11:00pm

2016-3-12 12:00am

2016-3-12 1:00am

2016-3-11 10:00pm

2016-3-12 2:00am

2016-3-12 3:00am…

partition

partition

#DevoxxFR

Streaming and Batch

8

2016-3-1 12:00 am

2016-3-1 1:00 am

2016-3-1 2:00 am

2016-3-11 11:00pm

2016-3-12 12:00am

2016-3-12 1:00am

2016-3-11 10:00pm

2016-3-12 2:00am

2016-3-12 3:00am…

partition

partition

Stream (low latency)

Stream (high latency)

#DevoxxFR

Streaming and Batch

9

2016-3-1 12:00 am

2016-3-1 1:00 am

2016-3-1 2:00 am

2016-3-11 11:00pm

2016-3-12 12:00am

2016-3-12 1:00am

2016-3-11 10:00pm

2016-3-12 2:00am

2016-3-12 3:00am…

partition

partition

Stream (low latency)

Batch(bounded stream)Stream (high latency)

#DevoxxFR

Processing

10

• Request / Response

#DevoxxFR

Processing

11

• Request / Response

• Batch

#DevoxxFR

Processing

12

• Request / Response

• Batch

• Stream Processing

#DevoxxFR

Processing

13

• Request / Response

• Batch

• Stream Processing

• Real-time reaction to events

• Continuous applications

• Process both real-time and historical data

#DevoxxFR 14

#DevoxxFR

Flink Architecture

15

#DevoxxFR

Flink Architecture

16

DeploymentLocal Cluster Cloud

Single JVM Standalone, YARN, Mesos AWS, Google

#DevoxxFR

Flink Architecture

17

DeploymentLocal Cluster Cloud

Single JVM Standalone, YARN, Mesos AWS, Google

CoreRuntime

Distributed Streaming Dataflow

#DevoxxFR

Flink Architecture

18

DeploymentLocal Cluster Cloud

Single JVM Standalone, YARN, Mesos AWS, Google

CoreRuntime

Distributed Streaming Dataflow

DataSet APIBatch Processing

API &

Libraries

#DevoxxFR

Flink Architecture

19

DeploymentLocal Cluster Cloud

Single JVM Standalone, YARN, Mesos AWS, Google

CoreRuntime

Distributed Streaming Dataflow

DataSet APIBatch Processing

API &

Libraries

FlinkMLMachine Learning

GellyGraph Processing

TableRelational

#DevoxxFR

Flink Architecture

20

DeploymentLocal Cluster Cloud

Single JVM Standalone, YARN, Mesos AWS, Google

CoreRuntime

Distributed Streaming Dataflow

DataSet APIBatch Processing

DataStream APIStream Processing

API &

Libraries

FlinkMLMachine Learning

GellyGraph Processing

TableRelational

#DevoxxFR

Flink Architecture

21

DeploymentLocal Cluster Cloud

Single JVM Standalone, YARN, Mesos AWS, Google

CoreRuntime

Distributed Streaming Dataflow

DataSet APIBatch Processing

DataStream APIStream Processing

API &

Libraries

FlinkMLMachine Learning

GellyGraph Processing

TableRelational

CEPEvent Processing

TableRelational

#DevoxxFR 22

Demonstration

Flink Basics

#DevoxxFR

Batch & Stream

23

case class Word (word: String, frequency: Int)

// DataSet API - Batchval lines: DataSet[String] = env.readTextFile(…)

lines.flatMap {line => line.split(“ ”).map(word => Word(word,1))} .groupBy("word").sum("frequency") .print()

// DataStream API - Streamingval lines: DataSream[String] = env.fromSocketStream(...)

lines.flatMap {line => line.split(“ ”).map(word => Word(word,1))} .keyBy("word”).window(Time.of(5,SECONDS))

.every(Time.of(1,SECONDS)).sum(”frequency") .print()

#DevoxxFR

Steam Processing

24

SourceFilter /

Transform Sink

#DevoxxFR

Flink Ecosystem

25

Source Sink

Apache Kafka

MapR Streams

AWS Kinesis

RabbitMQ

Twitter

Apache Bahir

Apache Kafka

MapR Streams

AWS Kinesis

RabbitMQ

Elasticsearch

HDFS/MapR-FS

#DevoxxFR

Stateful Steam Processing

26

SourceFilter /

TransformState

read/write Sink

#DevoxxFR 27

Is Flink used?

#DevoxxFR

Powered by Flink

28

#DevoxxFR 29

10 Billion events/day 2Tb of data/day

30 Applications 2Pb of storage and growing

Source Bouyges Telecom : http://berlin.flink-forward.org/wp-content/uploads/2016/07/Thomas-Lamirault_Mohamed-Amine-Abdessemed-A-brief-history-of-time-with-Apache-Flink.pdf

#DevoxxFR 30

Stream Processing

Windowing

#DevoxxFR

Stream Windows

31

#DevoxxFR

Stream Windows

32

#DevoxxFR

Stream Windows

33

#DevoxxFR

Stream Windows

34

#DevoxxFR

Stream Windows

35

#DevoxxFR 36

Demonstration

Flink Windowing

#DevoxxFR 37

Time

What about it ?

#DevoxxFR

Demonstration

38

• Multiple notion of “Time” in Flink

• Event Time

• Ingestion Time

• Processing Time

#DevoxxFR

What Is Event-Time Processing

39

1977 1980 1983 1999 2002 2005 2015

Processing Time

EpisodeIV

EpisodeV

EpisodeVI

EpisodeI

EpisodeII

EpisodeIII

EpisodeVII

Event Time

#DevoxxFR

Time in Flink

40

#DevoxxFR 41

Complex Event Processing

#DevoxxFR

Complex Event Processing

42

• Analyzing a stream of events and drawing conclusions

• “if A and then B ! infer event C”

• Demanding requirements on stream processor

• Low latency!

• Exactly-once semantics & event-time support

#DevoxxFR

Stream Windows

43

#DevoxxFR

Order Events

44

Process is reflected in a stream of order events

Order(orderId, tStamp, “received”)Shipment(orderId, tStamp, “shipped”)Delivery(orderId, tStamp, “delivered”)

orderId: Identifies the ordertStamp: Time at which the event happened

#DevoxxFR

Real-time Warnings

45

#DevoxxFR

CEP to the Rescue

46

Define processing and delivery intervals (SLAs)

ProcessSucc(orderId, tStamp, duration)ProcessWarn(orderId, tStamp)DeliverySucc(orderId, tStamp, duration)DeliveryWarn(orderId, tStamp)

orderId: Identifies the ordertStamp: Time when the event happenedduration: Duration of the processing/delivery

#DevoxxFR

CEP Example

47

#DevoxxFR

Processing: Order ! Shipment

48

#DevoxxFR 49

Processing: Order ! Shipmentval processingPattern = Pattern .begin[Event]("received").subtype(classOf[Order]) .followedBy("shipped").where(_.status == "shipped") .within(Time.hours(1))

#DevoxxFR 50

val processingPattern = Pattern .begin[Event]("received").subtype(classOf[Order]) .followedBy("shipped").where(_.status == "shipped") .within(Time.hours(1))

val processingPatternStream = CEP.pattern( input.keyBy("orderId"), processingPattern)

Processing: Order ! Shipment

#DevoxxFR 51

val processingPattern = Pattern .begin[Event]("received").subtype(classOf[Order]) .followedBy("shipped").where(_.status == "shipped") .within(Time.hours(1))

val processingPatternStream = CEP.pattern( input.keyBy("orderId"), processingPattern)

val procResult: DataStream[Either[ProcessWarn, ProcessSucc]] = processingPatternStream.select { (pP, timestamp) => // Timeout handler ProcessWarn(pP("received").orderId, timestamp) } { fP => // Select function ProcessSucc( fP("received").orderId, fP("shipped").tStamp, fP("shipped").tStamp – fP("received").tStamp) }

Processing: Order ! Shipment

#DevoxxFR

Count Delayed Shipments

52

#DevoxxFR

Compute Avg Processing Time

53

#DevoxxFR

The End

54

• Process events in real time and/or batch

• Complex Event Processing (CEP)

• Many other things to discover

• Deployment

• High Availability

• Table/Relational API

• … https://mapr.com/ebooks/

#DevoxxFR 55

Flink Community &

Thanks to

Kostas Tzoumas Stephan Ewen Fabian Hueske Till Rohrmann

Jamie Grier

#DevoxxFR

Stream Processing with Apache Flink

Tugdual “Tug” Grall Technical Evangelist @ MapR [email protected] @tgrall

56


Recommended