Download - A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

A Data Streaming Architecture with Apache Flink

Robert Metzger@[email protected]

Berlin Buzzwords,June 7, 2016

Talk overview My take on the stream processing space, and how

it changes the way we think about data Transforming an existing data analysis pattern into

the streaming world (“Streaming ETL”) Demo

2

Apache Flink Apache Flink is an open source stream processing

framework• Low latency• High throughput• Stateful• Distributed

Developed at the Apache Software Foundation, 1.0.0 released in March 2016,used in production

3

Entering the streaming era

4

5

Streaming is the biggest change in data infrastructure

since Hadoop

6

1. Radically simplified infrastructure2. Do more with your data, faster3. Can completely subsume batch

7

Real-world data is produced in a continuous fashion.

New systems like Flink and Kafka embrace streaming

nature of data.Web server Kafka topic

Stream processor

Apache Flink stack

8

Gelly

Tabl

e / S

QL

ML

SAM

OA

DataSet (Java/Scala)DataStream (Java / Scala)

Hado

op M

/RLocalClusterYARN

Apac

he B

eam

Apac

he B

eam

Tabl

e /

Stre

amSQ

L

Casc

adin

g

Streaming dataflow runtimeSt

orm

API

Zepp

elin

CEP

What makes Flink flink?

9

Low latency

High Throughput

Well-behavedflow control

(back pressure)

Make more sense of data

Works on real-timeand historic data

TrueStreaming

Event Time

APIsLibraries

StatefulStreaming

Globally consistentsavepoints

Exactly-once semanticsfor fault tolerance

Windows &user-defined state

Flexible windows(time, count, session, roll-your own)

Complex Event Processing

Moving existing (batch) data analysis into streaming

10

Extract, Transform, Load (ETL) ETL: Move data from A to B and transform it on the

way Old approach:

Server LogsServer

Logs

Server Logs

Mobile

IoT


way Old approach:

Server Logs

HDFS / S3

“Data Lake”

Server Logs

Server Logs

Mobile

IoT

Tier 0: Raw data


way Old approach:

Server Logs

HDFS / S3

“Data Lake”

Server Logs

Server Logs

Mobile

IoT

Tier 0: Raw data Tier 1: Normalized, cleansed data

Periodic jobs Parquet /

ORC in HDFS

User


way Old approach:

Server Logs

HDFS / S3

“Data Lake”

Server Logs

Server Logs

Mobile

IoT

Tier 0: Raw data Tier 1: Normalized, cleansed data

Periodic jobs Parquet /

ORC in HDFS

Tier 2: Aggregated data

Periodic jobs

User

User

“Data Warehouse”

Extract, Transform, Load (Streaming ETL) ETL: Move data from A to B and transform it on the

way Streaming approach:

Server Logs

“Data Lake”

Server Logs

Server Logs

Mobile

IoT

Tier 0: Raw data

Stream Processor



Server Logs

“Data Lake”

Server Logs

Server Logs

Mobile

IoT

Kafka Connector

Tier 0: Raw data

Cleansing

Transformation

Time-Window

Alerts

Time-Window

Stream Processor



Server Logs

“Data Lake”

Server Logs

Server Logs

Mobile

IoT

Tier 1: Normalized, cleansed data

Parquet /ORC in HDFSKafka

Connector

ES Connector

Rolling file sink

Tier 0: Raw data

Cleansing

Transformation

Time-Window

Alerts

Time-Window

User

Batch Processing

Stream Processor



Server Logs

“Data Lake”

Server Logs

Server Logs

Mobile

IoT

Tier 1: Normalized, cleansed data

Parquet /ORC in HDFS

Tier 2: Aggregated data

User

Kafka Connector

ES Connector

Rolling file sink

JDBC sink

Cassandrasink

Tier 0: Raw data

Cleansing

Transformation

Time-Window

Alerts

Time-Window

User

Batch Processing

Streaming ETL: Low Latency

19

Less than 500 ms*

Less than 250 ms*

* Your mileage may vary. These are rule of thumb estimates.

Events are processed immediately No need to wait until the next “load” batch job is running

hours minutes milliseconds

Periodic batch job Batch processor with micro-batches

Latency

Approach

seconds

Stream processor

Streaming ETL: Event-time aware

20

Events derived from the same real-world activity might arrive out of order in the system

Flink is event-time aware

11:28 11:29

11:28 11:29

11:28 11:29

Same real-world activityOut of sync clocks Network delays Machine failures

Demo

21

Job Overview

22

Flink Twitter Source

Data Ingestion Job

“Streaming ETL” Job

Job Overview

23

(Rolling) file sinkFilter operationFilter operation

Aggregation to ElasticSearch

Streaming WordCount

TopN operator

Demo code @ GitHub

24

https://github.com/rmetzger/flink-streaming-etl




Closing

25

26

https://www.eventbrite.com/e/apache-flink-hackathon-by-berlin-buzzwords-tickets-25580481910

Flink Forward 2016, Berlin

Submission deadline: June 30, 2016Early bird deadline: July 15, 2016

www.flink-forward.org

http://www.flink-forward.org/

We are hiring!data-artisans.com/careers

Questions? Ask now! eMail: [email protected] Twitter: @rmetzger_

Follow: @ApacheFlink Read: flink.apache.org/blog, data-artisans.com/blog/ Mailinglists: (news | user | dev)@flink.apache.org

29

mailto:[email protected]

Appendix

30

Sources

31

“Large scale ETL with Hadoop” http://www.slideshare.net/OReillyStrata/large-scale-etl-with-hadoop

http://www.slideshare.net/OReillyStrata/large-scale-etl-with-hadoop