Beam
credit: http://www.post-gazette.com/starwarscredit: http://reallyobsessedwithfilm.blogspot.com/2012/05/x-men-second-grade-we-need-leader.html
Beam
The Evolution of Apache Beam
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
MillwheelApache Beam
Google Cloud Dataflow
MapReduce
Tuesday [11:00 - 12:00)
[12:00 - 13:00)
[13:00 - 14:00)
[14:00 - 15:00)
[15:00 - 16:00)
[16:00 - 17:00)
[18:00 - 19:00)
[19:00 - 20:00)
[21:00 - 22:00)
[22:00 - 23:00)
[23:00 - 0:00)
Batch Patterns: Time Based Windows
Streaming Patterns: Event-Time Based Windows
Event Time
Processing Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Formalizing Event-Time Skew
Watermarks describe event time progress.
"No timestamp earlier than the watermark will be seen"
Often heuristic-based.
Too Slow? Results are delayed.Too Fast? Some data is late.
The Beam Model: What is Being Computed?
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
The Beam Model: Where in Event Time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());
The Beam Model: When in Processing Time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(Afterwatermark.pastEndOfWindow()))
.apply(Sum.integersPerKey());
The Beam Model: How Do Refinements Relate?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(Afterwatermark.pastEndOfWindow())
.discardingFiredPanes()) // or accumulatingFiredPanes()
.apply(Sum.integersPerKey());
12
The Beam Model: Asking the Right Questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
The Beam Model: Batch
PCollection<String> input = pipeline.apply(HDFSSource.read());PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(Afterwatermark.pastEndOfWindow())
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
The Beam Model: Streaming
PCollection<String> input = pipeline.apply(KafkaSource.read());PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(Afterwatermark.pastEndOfWindow())
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
The Beam Model: Spark Runner
Pipeline pipeline = Pipeline.create(“SparkRunner”);PCollection<String> input = pipeline.apply(KafkaSource.read());
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(Afterwatermark.pastEndOfWindow())
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
The Beam Model: Flink Runner
Pipeline pipeline = Pipeline.create(“FlinkRunner”);PCollection<String> input = pipeline.apply(KafkaSource.read());
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(Afterwatermark.pastEndOfWindow())
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
Why Apache Beam?
Unified - One model handles batch and streaming use cases.
Portable - Pipelines can be executed on multiple execution environments, avoiding lock-in.
Extensible - Supports user and community driven SDKs, Runners, transformation libraries, and IO connectors.
18
The Apache Beam Vision
1. End users: who want to write pipelines in a language that’s familiar.
2. SDK writers: who want to make Beam concepts available in new languages.
3. Runner writers: who have a distributed processing environment and want to support Beam pipelines
Beam Model: Fn Runners
Apache Flink
Apache Spark
Beam Model: Pipeline Construction
OtherLanguagesBeam Java
Beam Python
Execution Execution
Cloud Dataflow
Execution
19
Learn More!
Apache Beam (incubating) http://beam.incubator.apache.org
The World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Why Apache Beam? A Google Perspectivehttp://goo.gl/eWTLH1
Join the mailing lists! User discussions - [email protected] discussions - [email protected]
Follow @ApacheBeam on Twitter
20
Credit
This deck is based on A Brief Introduction to The Beam Model and An Introduction to The Beam Model by Frances Perry & Tyler Akidau Akidau