Date post: | 07-Aug-2015 |
Category: |
Technology |
Upload: | vasia-kalavri |
View: | 287 times |
Download: | 1 times |
Apache Flink Deep Dive
Vasia KalavriFlink Committer & KTH PhD student
1st Apache Flink Meetup StockholmMay 11, 2015
Flink Internals
● Job Life-Cycle○ what happens after you submit a Flink job?
● The Batch Optimizer○ how are execution plans chosen?
● Delta Iterations○ how are Flink iterations special for Graph and ML
apps?
2
what happens after you submit a Flink job?
The Flink Stack
Pyt
hon
Gel
ly
Tabl
e
Flin
k M
L
SA
MO
A
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)Hadoop M/R
Flink Runtime
Local Remote Yarn Tez EmbeddedD
ataf
low
*current Flink master + few PRs
Streaming Optimizer
4
DataSet<String> text = env.readTextFile(input);
DataSet<Tuple2<String, Integer>> result = text .flatMap((str, out) -> { for (String token : value.split("\\W")) { out.collect(new Tuple2(token, 1)); })
.groupBy(0).aggregate(SUM, 1);
1
32
Program Life-Cycle
4
5
Task Manager
Job Manager
Task Manager
Flink Client &Optimizer
DataSet<String> text = env.readTextFile(input);
DataSet<Tuple2<String, Integer>> result = text .flatMap((str, out) -> { for (String token : value.split("\\W")) { out.collect(new Tuple2(token, 1)); })
.groupBy(0).aggregate(SUM, 1);
O Romeo, Romeo, wherefore art thou Romeo?
O, 1Romeo, 3wherefore, 1art, 1thou, 1
6
Nor arm, nor face, nor any other part
nor, 3arm, 1face, 1,any, 1,other, 1part, 1
creates and submits the job graph
creates the execution graph and deploys tasks
execute tasks and send status updates
Input First SecondX Y
Operator X Operator Y
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();DataSet<String> input = env.readTextFile(input);
DataSet<String> first = input.filter (str -> str.contains(“Apache Flink“));DataSet<String> second = first.filter (str -> str.length() > 40);
second.print()env.execute();
Series of Transformations
7
DataSet AbstractionThink of it as a collection of data elements that can be produced/recovered in several ways:
… like a Java collection… like an RDD … perhaps it is never fully materialized (because the program does not need it to)… implicitly updated in an iteration
→ this is transparent to the user
8
Romeo, Romeo, where art thou Romeo?
Load Log
Search for str1
Search for str2
Search for str3
Grep 1
Grep 2
Grep 3
Example: grep
9
Romeo, Romeo, where art thou Romeo?
Load Log
Search for str1
Search for str2
Search for str3
Grep 1
Grep 2
Grep 3
Stage 1:Create/cache Log
Subsequent stages:Grep log for matches
Caching in-memory and disk if needed
Staged (batch) execution
10
Romeo, Romeo, where art thou Romeo?
Load Log
Search for str1
Search for str2
Search for str3
Grep 1
Grep 2
Grep 3
001100110011001100110011
Stage 1:Deploy and start operators
Data transfer in-memory and disk if needed
Note: Log DataSet is never “created”!
Pipelined execution
11
12
how are execution plans chosen?
Flink Batch Optimizer
Inspired by database optimizers, it creates and selects the execution plan for a user program
14
DataSet<Tuple5<Integer, String, String, String, Integer>> orders = … DataSet<Tuple2<Integer, Double>> lineitems = …
DataSet<Tuple2<Integer, Integer>> filteredOrders = orders .filter(. . .) .project(0,4).types(Integer.class, Integer.class);
DataSet<Tuple3<Integer, Integer, Double>> lineitemsOfOrders = filteredOrders .join(lineitems) .where(0).equalTo(0) .projectFirst(0,1).projectSecond(1) .types(Integer.class, Integer.class, Double.class);
DataSet<Tuple3<Integer, Integer, Double>> priceSums = lineitemsOfOrders .groupBy(0,1).aggregate(Aggregations.SUM, 2);
priceSums.writeAsCsv(outputPath);
A Simple Program
15
DataSourceorders.tbl
FilterMap DataSource
lineitem.tbl
JoinHybrid Hash
buildHT probe
broadcast forward
Combine
GroupRedsort
DataSourceorders.tbl
FilterMap DataSource
lineitem.tbl
JoinHybrid Hash
buildHT probe
hash-part [0] hash-part [0]
hash-part [0,1]
GroupRedsort
forwardBest plan depends onrelative sizes of input files
Alternative Execution Plans
16
17
● Evaluates physical execution strategies○ e.g. hash-join vs. sort-merge join
● Chooses data shipping strategies○ e.g. broadcast vs. partition
● Reuses partitioning and sort orders● Decides to cache loop-invariant data in
iterations
Optimization Examples
18
case class PageVisit(url: String, ip: String, userId: Long)
case class User(id: Long, name: String, email: String, country: String)
// get your data from somewhere
val visits: DataSet[PageVisit] = ...
val users: DataSet[User] = ...
// filter the users data set
val germanUsers = users.filter((u) => u.country.equals("de"))
// join data sets
val germanVisits: DataSet[(PageVisit, User)] =
// equi-join condition (PageVisit.userId = User.id)
visits.join(germanUsers).where("userId").equalTo("id")
Example: Distributed Joins
The join operator needs to create all the pairs of elements from the two inputs, for which the join condition evaluates to true
19
Example: Distributed Joins● Ship Strategy: The input data is distributed across all
parallel instances that participate in the join● Local Strategy: Each parallel instance performs a join
algorithm on its local partition
For both steps, there are multiple valid strategies which are favorable in different situations.
20
Repartition-Repartition Strategy
Partitions both inputs using the same partitioning function.
All elements that share the same join key are shipped to the same parallel instance and can be locally joined.
21
Broadcast-Forward Strategy
Sends one complete data set to each parallel instance that holds a partition of the other data.
The other Dataset remains local and is not shipped at all.
22
The optimizer will compute cost estimates for execution plans and will pick the “cheapest” plan:● amount of data shipped over the the network● if the data of one input is already partitioned
R-R Cost: Full shuffle of both data sets over the networkB-F Cost: Depends on the size of the dataset that is broadcasted and the number of parallel instancesRead more: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
How does the Optimizer choose?
23
how are Flink iterations special?
● for/while loop in client submits one job per iteration step
● Data reuse by caching in memory and/or disk
Step Step Step Step Step
Client
Iterate by unrolling
25
Native Iterations● the runtime is aware of the iterative execution● no scheduling overhead between iterations● caching and state maintenance are handled automatically
Caching Loop-invariant DataPushing work“out of the loop”
Maintain state as index
26
Flink Iteration Operators
Iterate IterateDelta
Input
Iterative Update Function
Result
Rep
lace
Workset
IterativeUpdate Function
Result
Solution Set
State
27
Delta Iteration
● Not all the elements of the state are updated in each iteration.
● The elements that require an update, are stored in the workset.
● The step function is applied only to the workset elements.
28
Partition a graph into components by iteratively propagating the min vertex ID among neighbors
Example: Connected Components
29
Delta-Connected Components
30
31
Performance
32
Read the documentation and our blog posts!● Memory Management● Serialization and Type Extraction● Streaming Optimizations● Fault-Tolerance
Want to learn more?
33
Apache Flink Deep Dive
Vasia KalavriFlink Committer & KTH PhD student
1st Apache Flink Meetup StockholmMay 11, 2015