A Data Streaming Architecture with Apache Flink
Robert Metzger@[email protected]
Berlin Buzzwords,June 7, 2016
Talk overview My take on the stream processing space, and how
it changes the way we think about data Transforming an existing data analysis pattern into
the streaming world (“Streaming ETL”) Demo
2
Apache Flink Apache Flink is an open source stream processing
framework• Low latency• High throughput• Stateful• Distributed
Developed at the Apache Software Foundation, 1.0.0 released in March 2016,used in production
3
Entering the streaming era
4
5
Streaming is the biggest change in data infrastructure
since Hadoop
6
1. Radically simplified infrastructure2. Do more with your data, faster3. Can completely subsume batch
7
Real-world data is produced in a continuous fashion.
New systems like Flink and Kafka embrace streaming
nature of data.Web server Kafka topic
Stream processor
Apache Flink stack
8
Gelly
Tabl
e / S
QL
ML
SAM
OA
DataSet (Java/Scala)DataStream (Java / Scala)
Hado
op M
/RLocalClusterYARN
Apac
he B
eam
Apac
he B
eam
Tabl
e /
Stre
amSQ
L
Casc
adin
g
Streaming dataflow runtimeSt
orm
API
Zepp
elin
CEP
What makes Flink flink?
9
Low latency
High Throughput
Well-behavedflow control
(back pressure)
Make more sense of data
Works on real-timeand historic data
TrueStreaming
Event Time
APIsLibraries
StatefulStreaming
Globally consistentsavepoints
Exactly-once semanticsfor fault tolerance
Windows &user-defined state
Flexible windows(time, count, session, roll-your own)
Complex Event Processing
Moving existing (batch) data analysis into streaming
10
Extract, Transform, Load (ETL) ETL: Move data from A to B and transform it on the
way Old approach:
Server LogsServer
Logs
Server Logs
Mobile
IoT
Extract, Transform, Load (ETL) ETL: Move data from A to B and transform it on the
way Old approach:
Server Logs
HDFS / S3
“Data Lake”
Server Logs
Server Logs
Mobile
IoT
Tier 0: Raw data
Extract, Transform, Load (ETL) ETL: Move data from A to B and transform it on the
way Old approach:
Server Logs
HDFS / S3
“Data Lake”
Server Logs
Server Logs
Mobile
IoT
Tier 0: Raw data Tier 1: Normalized, cleansed data
Periodic jobs Parquet /
ORC in HDFS
User
Extract, Transform, Load (ETL) ETL: Move data from A to B and transform it on the
way Old approach:
Server Logs
HDFS / S3
“Data Lake”
Server Logs
Server Logs
Mobile
IoT
Tier 0: Raw data Tier 1: Normalized, cleansed data
Periodic jobs Parquet /
ORC in HDFS
Tier 2: Aggregated data
Periodic jobs
User
User
“Data Warehouse”
Extract, Transform, Load (Streaming ETL) ETL: Move data from A to B and transform it on the
way Streaming approach:
Server Logs
“Data Lake”
Server Logs
Server Logs
Mobile
IoT
Tier 0: Raw data
Stream Processor
Extract, Transform, Load (Streaming ETL) ETL: Move data from A to B and transform it on the
way Streaming approach:
Server Logs
“Data Lake”
Server Logs
Server Logs
Mobile
IoT
Kafka Connector
Tier 0: Raw data
Cleansing
Transformation
Time-Window
Alerts
Time-Window
Stream Processor
Extract, Transform, Load (Streaming ETL) ETL: Move data from A to B and transform it on the
way Streaming approach:
Server Logs
“Data Lake”
Server Logs
Server Logs
Mobile
IoT
Tier 1: Normalized, cleansed data
Parquet /ORC in HDFSKafka
Connector
ES Connector
Rolling file sink
Tier 0: Raw data
Cleansing
Transformation
Time-Window
Alerts
Time-Window
User
Batch Processing
Stream Processor
Extract, Transform, Load (Streaming ETL) ETL: Move data from A to B and transform it on the
way Streaming approach:
Server Logs
“Data Lake”
Server Logs
Server Logs
Mobile
IoT
Tier 1: Normalized, cleansed data
Parquet /ORC in HDFS
Tier 2: Aggregated data
User
Kafka Connector
ES Connector
Rolling file sink
JDBC sink
Cassandrasink
Tier 0: Raw data
Cleansing
Transformation
Time-Window
Alerts
Time-Window
User
Batch Processing
Streaming ETL: Low Latency
19
Less than 500 ms*
Less than 250 ms*
* Your mileage may vary. These are rule of thumb estimates.
Events are processed immediately No need to wait until the next “load” batch job is running
hours minutes milliseconds
Periodic batch job Batch processor with micro-batches
Latency
Approach
seconds
Stream processor
Streaming ETL: Event-time aware
20
Events derived from the same real-world activity might arrive out of order in the system
Flink is event-time aware
11:28 11:29
11:28 11:29
11:28 11:29
Same real-world activityOut of sync clocks Network delays Machine failures
Demo
21
Job Overview
22
Flink Twitter Source
Data Ingestion Job
“Streaming ETL” Job
Job Overview
23
(Rolling) file sinkFilter operationFilter operation
Aggregation to ElasticSearch
Streaming WordCount
TopN operator
Demo code @ GitHub
24
https://github.com/rmetzger/flink-streaming-etl
Closing
25
26
https://www.eventbrite.com/e/apache-flink-hackathon-by-berlin-buzzwords-tickets-25580481910
Flink Forward 2016, Berlin
Submission deadline: June 30, 2016Early bird deadline: July 15, 2016
www.flink-forward.org
We are hiring!data-artisans.com/careers
Questions? Ask now! eMail: [email protected] Twitter: @rmetzger_
Follow: @ApacheFlink Read: flink.apache.org/blog, data-artisans.com/blog/ Mailinglists: (news | user | dev)@flink.apache.org
29
Appendix
30
Sources
31
“Large scale ETL with Hadoop” http://www.slideshare.net/OReillyStrata/large-scale-etl-with-hadoop