Apache Beam (incubating)
Kenneth [email protected]@KennKnowles Apache Apex Meetup, 2016-06-
27
https://goo.gl/LTLjKt
Motivation
Beam Model
Beam Project / Technical Vision
Agenda1
2
3
2
3
Motivation1
https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg4
5
Unbounded, delayed, out of order
9:008:00 14:00
13:00
12:00
11:00
10:00
2:001:00 7:006:005:004:003:00
5
8:00
8:008:00
Incoming!
Score per
user?
6
Organizing the stream
7
8:00
8:00
8:00
Completeness
Latency Cost
$$$
Data Processing Tradeoffs
8
What is important for your application?
Completeness Low Latency Low Cost
Important
Not Important
$$$9
Monthly Billing
Completeness Low Latency Low Cost
Important
Not Important
$$$10
Billing estimate
Completeness Low Latency Low Cost
Important
Not Important
$$$11
Abuse Detection
Completeness Low Latency Low Cost
Important
Not Important
$$$12
13
The Beam Model
2
The Beam Model
Pipeline
14
PTransform
PCollection
The Beam Vision (for users)
Sum Per Key
15
input.apply( Sum.integersPerKey())
Java
input | Sum.PerKey()
Python
Apache Flink
Apache Spark
Cloud Dataflow
⋮ ⋮
Apache Apex
Apache Gearpump
(incubating)
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))
.apply(FlatMapElements.via(line → Arrays.asList(line.split("[^a-zA-Z']+"))))
.apply(Filter.byPredicate(word → !word.isEmpty()))
.apply(Count.perElement())
.apply(MapElements.via(count → count.getKey() + ": " + count.getValue())
.apply(TextIO.Write.to("gs://..."));
p.run();
What your (Java) Code Looks Like
16
The Beam Model: Asking the Right Questions What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
17
The Beam Model: Asking the Right Questions What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
18
Aggregations, transformations, ...
The Beam Model: What are you computing?
Sum Per User
19
The Beam Model: What are you computing?
Sum Per Key
20
input.apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...));
Java
input | Sum.PerKey() | Write(BigQuerySink(...))
Python
http://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html
The Beam Model: Asking the Right Questions What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
21
Event time windowing
22
The Beam Model: Where in Event Time?8:00
8:00
8:00
Processing Time vs Event Time
Event Time = Processing Time ??
23
Processing Time vs Event Time
24
Proc
essi
ng T
ime
Proc
essi
ng T
ime
Processing Time vs Event Time
Realtime
25
This is not possible
Processing Time vs Event Time
26
Processing DelayPr
oces
sing
Tim
e
Processing Time vs Event TimeVery delayed
27
Proc
essi
ng T
ime
Event Time
Processing Time windows(probably are not what you want)
Proc
essi
ng T
ime
Event Time 28
Event Time Windows
29
Proc
essi
ng T
ime
Event Time
Proc
essi
ng T
ime
Event Time
Event Time Windows
30
(implementing processing time windows)
Just throw away your data's timestamps and replace them with "now()"
input |
WindowInto(FixedWindows(3600) | Sum.PerKey() | Write(BigQuerySink(...))
Python
The Beam Model: Where in Event Time?
Sum Per Key
Window Into
31
input.apply(
Window.into( FixedWindows.of( Duration.standardHours(1))) .apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...))
Java
So that's what and where...
32
The Beam Model: Asking the Right Questions What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
33
Watermarks &
Triggers
Event time windowsPr
oces
sing
Tim
e
34
Event Time
Fixed cutoff (we can do better)Pr
oces
sing
Tim
e
Event Time35
Allowed delay
Concurrent windows
Perfect watermarkPr
oces
sing
Tim
e
36
Event Time
Check out Slava's slides from Strata London 2016 talk on watermarks:https://goo.gl/K4FnqQ
Heuristic WatermarkPr
oces
sing
Tim
e
37
Event Time
Heuristic WatermarkPr
oces
sing
Tim
e
38
Current processing time
Event Time
Heuristic WatermarkPr
oces
sing
Tim
e
39
Current processing time
Event Time
Heuristic WatermarkPr
oces
sing
Tim
e
40
Current processing time
Late data
Event Time
Watermarks measure completeness
41
$$$
$$$
$$$
? Running Total
✔ Monthly billing
? Abuse Detection
The Beam Model: When in Processing Time?
Sum Per Key
Window Into
42
input .apply(Window.into(FixedWindows.of(...))
.triggering( AfterWatermark.pastEndOfWindow())) .apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...))
Java
input | WindowInto(FixedWindows(3600),
trigger=AfterWatermark()) | Sum.PerKey() | Write(BigQuerySink(...))
Python
Trigger after end of window
Proc
essi
ng T
ime
Event Time
AfterWatermark.pastEndOfWindow()
43
Current processing time
Proc
essi
ng T
ime
Event Time44
AfterWatermark.pastEndOfWindow()
Proc
essi
ng T
ime
Event Time
Late data
45
Current processing time
AfterWatermark.pastEndOfWindow()
Proc
essi
ng T
ime
Event Time46
High completeness
Potentially high latency
Low cost
AfterWatermark.pastEndOfWindow()
$$$
Proc
essi
ng T
ime
Event Time
Repeatedly.forever( AfterPane.elementCountAtLeast(2))
47
Proc
essi
ng T
ime
Event Time48
Current processing time
Repeatedly.forever( AfterPane.elementCountAtLeast(2))
Current processing time
Proc
essi
ng T
ime
Event Time49
Repeatedly.forever( AfterPane.elementCountAtLeast(2))
Proc
essi
ng T
ime
Event Time50
Current processing time
Repeatedly.forever( AfterPane.elementCountAtLeast(2))
Current processing time
Proc
essi
ng T
ime
Event Time51
Repeatedly.forever( AfterPane.elementCountAtLeast(2))
Proc
essi
ng T
ime
Event Time52
Repeatedly.forever( AfterPane.elementCountAtLeast(2))
Low completeness
Low latency
Cost driven by input
$$$
Build a finely tuned trigger for your use caseAfterWatermark.pastEndOfWindow()
.withEarlyFirings( AfterProcessingTime .pastFirstElementInPane() .plusDuration(Duration.standardMinutes(1))
.withLateFirings(AfterPane.elementCountAtLeast(1)) 53
Bill at end of month
Near real-time estimates
Immediate corrections
Proc
essi
ng T
ime
Event Time54
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)
Proc
essi
ng T
ime
Event Time55
Current processing time
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)
Proc
essi
ng T
ime
Event Time56
Current processing time
Low completeness
Low latency
Low cost, driven by time
$$$
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)
Current processing time
Proc
essi
ng T
ime
Event Time57
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)
Current processing time
Proc
essi
ng T
ime
Event Time
Late output
58
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)
Proc
essi
ng T
ime
Event Time
Late output
59
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)
Trigger CatalogueComposite TriggersBasic Triggers
60
AfterEndOfWindow()
AfterCount(n)
AfterProcessingTimeDelay(Δ)
AfterEndOfWindow() .withEarlyFirings(A) .withLateFirings(B)
AfterAny(A, B)AfterAll(A, B)Repeat(A)Sequence(A, B)
The Beam Model: Asking the Right Questions What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
61
Accumulation Mode
The Beam Model: How do refinements relate?
62
input
.apply(Window.into(...).triggering(...).discardingFiredPanes()) .apply(Sum.integersPerKey()) .apply(BigQueryIO.Write.to(...))
vs
1
3 7
4
10
5
1
3 7
4
10
15
discarding accumulating
The Beam Model: Asking the Right Questions What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
63
64
Beam Project / Technical Vision
3
1. End users: who want to write pipelines in a language that’s familiar.
2. SDK writers: who want to make Beam concepts available in new languages.
3. Runner writers: who have a distributed processing environment and want to run Beam pipelines
Beam Fn API: Invoke user-definable functions
Apache Flink
Apache Spark
Beam Runner API: Build and submit a piepline
OtherLanguagesBeam Java Beam
Python
Execution Execution
Cloud Dataflo
w
Execution
The Beam Vision
Apache Apex
Apache Gearpump (incubatin
g)
Project Setup (vision meets code)GoogleCloudPlatform/DataflowJavaSDK cloudera/spark-dataflow dataArtisans/flink-dataflow
apache/incubator-beam
Direct (on your laptop)Google Cloud DataflowFlinkSparkIn pull request: Apex, Gearpump
Integration tests
Runners
Examples
I/O Connectors
sharing
HDFSKafkaBigQueryGoogle Cloud Storage, Pubsub, Bigtable, DatastoreIn pull request: JMS, CassandraProposed: Sqoop, Parquet, JDBC, SocketStream, ...
SDKs
Committers from Google, Data Artisans, Cloudera, Talend, Paypal● ~40 commits/week● Rigorous code review for every commit
Contributors [with GitHub badges] from: Spotify, Intel, Twitter, Capital One, DataTorrent, …, <your name here>
● Improvements to existing I/O connectors● Improvements to Spark runner● Utility classes for users● Documentation fixes● Bug diagnoses● New I/O connectors● Gearpump runner PoC● Apex runner PoC!
… and it has been awesomeapache/incubator-beam
Java SDK: Transition from Dataflow
Dataflow Java 1.x
Apache Beam Java 0.x
Apache Beam Java 2.xBug Fix
Feature
Breaking Change
We are here
Feb 201
6
Late 2016
Understanding: Capability Matrix
http://beam.incubator.apache.org/capability-matrix/
Why Apache Beam?Unified - One model handles batch and streaming use cases.
Portable - Pipelines can be executed on multiple execution environments, avoiding lock-in.
Extensible - Supports user and community driven SDKs, Runners, transformation libraries, and IO connectors.
Why Apache Beam?http://data-artisans.com/why-apache-beam/
"We firmly believe that the Beam model is the correct programming model for streaming and batch data processing."
- Kostas Tzoumas (Data Artisans)
https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective
"We hope it will lead to a healthy ecosystem of sophisticated runners that compete by making users happy, not [via] API lock in."
- Tyler Akidau (Google)
72
Creating an Apache Beam CommunityCollaborate - Beam is becoming a community-driven effort with participation from many organizations and contributors.
Grow - We want to grow the Beam ecosystem and community with active, open involvement so Beam is a part of the larger OSS ecosystem.
We love contributions. Join us!
Apache Beam
http://beam.incubator.apache.org/Why Apache Beam? (from Data Artisans)Why Apache Beam? (from Google)
Programming Model Overviews
Streaming 101Streaming 102The Dataflow Beam Model
Join the community!User discussions - [email protected] discussions - [email protected] @ApacheBeam on Twitter
Learn More!
73
END
74