Generating example tuples for
Data-Flow programs in Apache Flink
Master Thesis
by
Amit Pawar
Submitted to Faculty IV, Electrical Engineering and Computer Science
Database Systems and Information Management Group
in partial fulfillment of the requirements for the degree of
Master of Science in Computer Science
as part of the ERASMUS MUNDUS programme IT4BI
at the
Technische Universitat Berlin
July 31, 2015
Thesis Advisors:
Johannes Kirschnick
Thesis Supervisor:
Prof. Dr. Volker Markl
i
Eidesstattliche Erklarung
Ich erklare an Eides statt, dass ich die vorliegende Arbeit selbststandig verfasst,
andere als die angegebenen Quellen/Hilfsmittel nicht benutzt, und die den be-
nutzten Quellen wortlich und inhaltlich entnommenen Stellen als solche kenntlich
gemacht habe.
Statutory Declaration
I declare that I have authored this thesis independently, that I have not used other
than the declared sources/resources, and that I have explicitly marked all material
which has been quoted either literally or by content from the used sources.
Berlin, July 31, 2015
Amit Pawar
GENERATING EXAMLE TUPLES FOR DATA-FLOW
PROGRAMS IN APACHE FLINK
by Amit Pawar
Database System and Information Management Group
Electrical Engineering and Computer Science
Masters in Information Technology for Business Intelligence
Abstract
Dataflow programming is a programming paradigm where a computational logic is
modeled as a directed graph from the input data sources to the output sink. The
intermediate nodes between sources and sink act as a processing unit that defines
what action is to be performed on the incoming data. Due to its inherent support
for concurrency, dataflow programming is a natural choice for many data-intensive
parallel processing systems and is being used extensively in the current big-data
market. Among the wide range of parallel processing platforms available, Apache
Hadoop (with MapReduce framework), Apache Pig (runs on top of MapReduce in
Hadoop ecosystem), Apache Flink and Apache Spark (with their own runtime and
optimizer) are some of the examples that leverage dataflow programming style.
Dataflow programs can handle terabytes of data and perform efficiently, but the
target data of such large-scale introduces difficulties such as understanding the
complete dataflow (what is the output of the dataflow or any intermediate node),
debugging (it is impractical to track large-scale data throughout the program using
breakpoints or watches) and visual representation (it is quite difficult to display
terabytes of data flowing through the tree of nodes).
This thesis aims to address these limitations for dataflow programs in Apache
Flink platform using the concept of Example Generation, a technique to generate
sample example tuples after each intermediate operation from source to sink. This
allows the user to view and validate the behavior of the underlying operators and
thus the overall dataflow. We implement the example generator algorithm for a
defined set of operators, and evaluate the quality of generated examples. For ease
of visual representation of the dataflow, we integrate this implementation with the
Interactive Scala Shell available in Apache Flink.
Acknowledgements
I would like to express my deep sense of gratitude to my advisor Johannes Kirschnick
for his excellent guidance, scientific advice and constant encouragement through-
out this thesis work. His apt assistance helped me understand and tackle many
challenges during the research, implementation and writing of this thesis.
I would like to thank Prof. Dr. Volker Markl and Dr. Ralf-Detlef Kutsche for
their kind co-ordination efforts. I would also like to thank Nikolaas Steenbergen,
Stephan Ewen and all the members of Flink dev user group for their kind advice
and clarifications on understanding the platform.
I express my sincere thanks to all the staffs and professors from IT4BI programme
for their help and support during my Masters degree.
Finally, I would like to thank all my friends from IT4BI programme, generation
I and II, for their constant support and cherishable memories over the past two
years.
iii
Contents
Abstract ii
Acknowledgements iii
List of Figures vi
List of Tables vii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Dataflow Programming 7
2.1 Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Apache Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Comparison of Big-Data Analytics Systems . . . . . . . . . . . . . . 11
2.5 Limitations of Dataflow Programming . . . . . . . . . . . . . . . . . 12
3 Example Generation 14
3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Dataflow Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Example Generation Properties . . . . . . . . . . . . . . . . . . . . 17
3.4 Example Generation Techniques . . . . . . . . . . . . . . . . . . . . 18
3.4.1 Downstream Propagation . . . . . . . . . . . . . . . . . . . 18
3.4.2 Upstream Propagation . . . . . . . . . . . . . . . . . . . . . 19
3.4.3 Example Generator Algorithm . . . . . . . . . . . . . . . . 19
3.5 Example Generation In Apache Flink . . . . . . . . . . . . . . . . . 20
4 Implementation 23
4.1 Operator Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.1 Single Operator . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.2 Constructing the Operator Tree . . . . . . . . . . . . . . . . 25
4.2 Supported Operators . . . . . . . . . . . . . . . . . . . . . . . . . . 27
iv
Contents v
4.3 Equivalence Class Model . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4.2 Downstream Pass . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.3 Upstream Pass . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4.4 Pruning Pass . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Flink Add-ons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5.1 Using Semantic Properties . . . . . . . . . . . . . . . . . . . 45
4.5.2 Interactive Scala Shell . . . . . . . . . . . . . . . . . . . . . 47
5 Evaluation 48
5.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1.1 Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1.2 Conciseness . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1.3 Realism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 Setup and Datasets . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 Quality of Generated Examples . . . . . . . . . . . . . . . . 52
5.2.3 Running Time . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6 Related Work 59
7 Conclusion 61
7.1 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . 62
A Appendix 64
A.1 Word-Cound Example in different Dataflow Systems . . . . . . . . . 64
A.1.1 Apache Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
A.1.2 Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.1.3 Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . 66
A.2 Sample Scala Script for Flink’s Interactive Scala Shell . . . . . . . . 67
A.3 Experiment’s Dataflow Programs . . . . . . . . . . . . . . . . . . . 68
A.4 Flink-Illustrator Source Code . . . . . . . . . . . . . . . . . . . . . 71
Bibliography 72
List of Figures
3.1 Sample Dataflow that returns highly populated countries . . . . . . 14
3.2 Sample Dataflow with output examples . . . . . . . . . . . . . . . . 15
3.3 Dataflow Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Ideal set of output for sample dataflow . . . . . . . . . . . . . . . . 18
3.5 Incomplete set of examples . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 Single Operator Class Diagram . . . . . . . . . . . . . . . . . . . . 24
4.2 Tree Construction Approaches . . . . . . . . . . . . . . . . . . . . . 25
4.3 Sample Dataflow to retrieve users visiting high page rank websites . 32
4.4 Downstream pass on Sample Dataflow . . . . . . . . . . . . . . . . 33
4.5 Empty Equivalence class as a result of only Downstream pass . . . 36
4.6 Upstream Pass: Synthetic records introduction . . . . . . . . . . . . 38
4.7 Upstream Pass: Synthetic records converted to Real records . . . . 38
4.8 Pruning Pass: Start . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.9 Pruning Pass: After pruning sink operator . . . . . . . . . . . . . . 42
4.10 Pruning Pass: Complete pruning (Example 1) . . . . . . . . . . . . 43
4.11 Pruning Pass: Complete pruning (Example 2) . . . . . . . . . . . . 44
4.12 Forwarded Fields Annotations . . . . . . . . . . . . . . . . . . . . . 46
5.1 Completeness of Filter operator . . . . . . . . . . . . . . . . . . . . 49
5.2 Realism of Filter operator . . . . . . . . . . . . . . . . . . . . . . . 51
5.3 Evaluation of Programs 1 to 6 and 13 . . . . . . . . . . . . . . . . . 54
5.4 Evaluation of Programs 7 to 9 . . . . . . . . . . . . . . . . . . . . . 55
5.5 Evaluation of Programs 10 to 12 . . . . . . . . . . . . . . . . . . . . 56
5.6 Flink Illustrator Runtime comparison . . . . . . . . . . . . . . . . . 58
vi
List of Tables
2.1 Comparison of Big-Data systems in context of Dataflow Program-ming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1 Description of SingleOperator properties . . . . . . . . . . . . . . . 25
5.1 Locally created DataSet used for evaluation purpose . . . . . . . . . 52
5.2 Experiment Dataset’s description and sizes . . . . . . . . . . . . . . 57
vii
Chapter 1
Introduction
Dataflow programming has gained popularity with the booming big data market
over the past decade. It is a data processing paradigm that allows data movement
to be the focal point, in contrast to the traditional programming (object-oriented
or imperative or procedural), where passing control from one object to another
forms the basis of the respective programming model. The distributed data pro-
cessing framework leverages dataflow programming by splitting and distributing
large dataset across different computing nodes, where each node performs dataflow
operation on the local data. A single dataflow program can be seen as a directed
graph from the source nodes to the sink nodes, where source nodes represent the
input datasets consumed and sink nodes represent output datasets generated by
executing the dataflow program. All the intermediate nodes, between source and
sink, act as a processing unit that defines what action is to be performed on the
incoming data. These actions can be categorized as either: i. General relational
algebra operators (e.g., join, cross, project, distinct, etc.) or, ii. User-defined
function operators (e.g., flatmap, map, reduce, etc.). Apache Hadoop [1] with its
MapReduce [2] framework uses dataflow programming style. Apache Flink [3] is a
distributed streaming dataflow engine that falls into a newer category of big data
framework, an alternative to Hadoop’s MapReduce, along with Apache Spark [4].
Other dataflow programming systems include Apache Pig [5], Aurora [6], Dyrad
[7], River [8], Tioga [9] and CIEL [10].
This thesis is based on dataflow programs from Apache Flink, where we generate
sample examples after each intermediate node (operator) from source to sink,
allowing the Flink user to:
1
Chapter 1. Introduction 2
1. View and validate the behavior of the underlying set of operators and thus
understand and learn the complete dataflow
2. Optimize the dataflow by determining the correct set of operators to achieve
the final output
3. Monitor iterations in case of an iterative KDDM (Knowledge discovery and
Data Mining) algorithm
4. Understand the behavior of User-defined functional (UDF) operators (seen
as a blackbox)
1.1 Motivation
Dataflow programming, similar to any other programming paradigm, is an it-
erative/incremental process. In order to program a final correct version of the
dataflow, the user may be subjected to more than one iteration. Each itera-
tion consist of steps like; coding a dataflow, building and executing it on the
respective system and finally analyzing the output or the error log to determine
whether the dataflow resulted in the expected outcome, if not, the user revises
the steps (normally via debugging) and repeats with a new iteration. Any pro-
gramming paradigm when dealing with a large-scale dataset results in a longer
execution/testing time and same applies to the dataflow programming. Hence,
the iterative model of development when handling the large-scale data is a time
consuming process and, therefore, inefficient. The whole process can be made effi-
cient if the user can verify the underlying execution of the dataflow, i.e., verifying
execution at each node (operator) in the dataflow. This can be done by checking
the dataset that is being consumed and generated at any given operator, this in
turn allows the user to pin-point and rectify the logic or error (if any). Visualizing
the dataflow with example dataset after each operator execution will allow the
user to test the necessary assumptions made in the program, in a way, it voids the
need of debugging using breakpoints and watches. The overall process of example
generation after each operator permits the user to learn about the logic of the
operators as well as the complete dataflow program.
Dataflow program is a tree of operators from source to sink, where one or more
sources (leaves) converge into a single sink (root) via different intermediate nodes.
Chapter 1. Introduction 3
In order to get an optimal performance from the program, choosing the appro-
priate operator type is of utmost importance. For example the Join operator in
Apache Flink can be executed in multiple ways such as Repartition or Broadcast,
depending on the input size and order. With these options, the user can select
the best operator to optimize the overall performance. Similarly, the user can de-
cide to replace Join or any costly operator, with a transformation or aggregation
or other cheaper operators as long as the final objective is accomplished. Thus
dataflow programming can be seen as a plug-n-play like model, where the user can
plug/unplug (add/remove) operators then play (execute) the complete dataflow
in order to decide on the most optimal set of operators for the final version of
the program. In such scenario, having a concise set of examples (instead of large
datasets) that completely illustrates the dataflow would be of a great help to the
user as it avoids repetitive cost and time heavy execution of the program after
each plugging or unplugging.
Dataflow programming frameworks such as Flink and Spark are well suited for
the implementation of Machine-Learning algorithms for Knowledge Discovery and
Data Mining (KDDM). KDDM in itself is an iterative process, where a model
(classification, clustering, etc.) is repetitively trained for better prediction accu-
racy, for e.g., the k-means clustering algorithm, an iterative refinement process, is
used in many data mining or related applications. In k-means algorithm, given the
input of k clusters and n observations, the observations are iteratively alloted to
the most appropriate cluster based on the computation of the cluster mean. This
problem is considered to be NP-hard and hence training such predictive models can
be a cumbersome process, where constant fine-tuning (via re-sampling or dataset
feature changes or mathematical re-computation) is needed. The overall modeling
process in a dataflow environment can be complicated [11] and is pictured as a
blackbox, because the user has very less clue about how the newly tuned dataflow
might behave and has to wait throughout the process execution, only after that
the user can verify the output to take the necessary action. Having sample ex-
amples that demonstrates the dataflow quickly will be of a great help to the user,
as it opens up the blackbox and provides a quick efficient way for fine tuning the
dataflow and in turn the predictive model.
Out of several operators available in the dataflow programming, UDF operators
pose a problem of readability/understandability to a new user (one who is new to
the system or code), as it might be difficult to guess what is the purpose of the
Chapter 1. Introduction 4
respective UDF. It can be a grouping function or a transformation function or a
partition function, in order to interpret a UDF operator, one needs to investigate
at the code level, which can be a tedious task for a new user. Instead, if the same
user has access to input consumed and the output generated by the UDF operator,
it will be easier to reason the behavior of that very operator.
The idea of presenting examples after each operator has been realized and is being
used for testing and diagnostics purpose in Apache Pig. Apache Pig, a high-level
dataflow language and execution framework for parallel computation, runs on top
of MapReduce framework in Hadoop ecosystem. Pig features a Pig Latin language
layer that simplifies MapReduce programming via its set of operators. One of the
diagnostic operators include ILLUSTRATE 1, that displays sample examples after
each statement in a step-by-step execution of a sequence of statements (a dataflow
program), where each statement represents an operator. It allows user to review
how data is transformed through a sequence of Pig Latin statements (forming a
dataflow). This feature is included in a testing and diagnostics package of Pig, as
it allows the user to test the dataflows on small datasets and get faster turnaround
times. In this thesis, we enhance the understanding of large-scale dataflow program
execution by introducing a new feature in Apache Flink which helps the user by
providing sample examples at operator level, similar to the ILLUSTRATE function
in Apache Pig.
This thesis work, like Apache Pig, is based on an example generator [12]. The
algorithm used in example generator works by retrieving a small sample of the
input data and then propagating this data through the dataflow, i.e., through the
operator tree. However, some operators, such as JOIN and FILTER, can eliminate
these sample examples from the dataflow. For e.g., an operator performing a join
on two input datasets A(x,y) and B(x,z) on common attribute x (a join key), if
both A and B contains many distinct values for x, initial sampling at A and B may
lead to a possibility of unmatched x values. Hence, join may not produce an output
example due to the absence of common join key in both datasets. Similarly, filter
operator executed on a sample dataset might produce an empty result if no input
example satisfies the filtering predicate. To address such issues, the algorithm
used in [12] will generate synthetic example data, that allows the user to examine
the complete semantics of the given dataflow program.
1http://pig.apache.org/docs/r0.15.0/test.html#illustrate
Chapter 1. Introduction 5
1.2 Goal
This thesis concentrates on dataflow programs from Apache Flink. Our main goal
is to implement ILLUSTRATE like example generator feature of Pig in Flink,
such that it will help users to tackle the problems that are faced with respect
to dataflow programming in a big-data environment. As this thesis is inclined
towards implementation of a new feature in Flink, we try to answer following
questions:
1. What are the challenges in implementing the example generator and ways
to tackle them?
2. How can the concept of example generator be realized in Flink? What
inputs are required for the implementation and how they can be obtained
from Flink?
3. What extra features of Flink can be exploited in order to differentiate it from
other implementations?
4. How well does the implemented algorithm performs with respect to the met-
rics defined?
1.3 Outline
The outline lays out the approach that we followed for the thesis that subsequently
helped us to find the answers of the questions mentioned in the previous section.
• A comprehensive study on dataflow programming and the systems that take
the advantage of this paradigm (Chapter 2)
• A broad introduction to the example generation problem, challenges and the
approach towards the solution (Chapter 3)
• Implementation of the example generator algorithm in Flink, with extensive
description of the input construction, various algorithm steps and the Flink
features used (Chapter 4)
Chapter 1. Introduction 6
• Experimenting and evaluating the performance of the implemented algo-
rithm against the defined metrics by executing the implemented code on
different dataflow programs and diverse datasets, covering the operators as
well as the stages of the algorithm (Chapter 5)
• A brief overview of the related works in the field of example generation
(Chapter 6)
• We conclude by discussing the results and findings (Chapter 7)
Chapter 2
Dataflow Programming
Dataflow programming introduces a new programming paradigm that internally
represents applications as a directed graph, similar to a dataflow diagram [13].
Program is represented as a set of nodes (also called as blocks) with input and/or
output ports in them. These nodes act as source, sink or processing blocks to the
data flowing in the system. Nodes are connected by directed edges that define
the flow of data between them. It is reminiscent to the Pipes and Filters software
architectural model, where Filters are the processing nodes and Pipes serves the
passage for the data streams between the filters.
One of the main reasons why dataflow programming surged with the hype of
big-data is, its by default support for concurrency and thus allowing increased
parallelism. In a dataflow program, internally, each node is an independent pro-
cessing block, i.e., each individual node can process on its own once they have
their respective input data. This kind of execution allows data streaming, an in-
termediate node in the dataflow starts functioning as soon as the data arrives from
previous node and transfers the output to the next node. Hence, this avoids the
need of waiting for the previous node to finish its complete execution.
In a big-data environment, the aforementioned characteristics, parallelism and
streaming, allows dataflow programs to be run on a distributed system with com-
puter clusters. Such system splits the large dataset into smaller chunks and dis-
tributes it across cluster nodes, each node then executes the dataflow on their
local chunk. The sub-result produced after execution at all the nodes are com-
bined to get the final result for the large dataset. This is the main gist on how
7
Chapter 2. Dataflow Programming 8
a dataflow program is executed on big-data system with a distributed processing
environment.
In this section, we discuss different big-data analytics systems that take the ad-
vantage of the dataflow programming paradigm.
2.1 Apache Hadoop
Apache Hadoop [1] is a big-data framework that allows distributed processing of
large-scale dataset across clusters of computers using a simple programming mod-
els. Hadoop leverages dataflow programming via MapReduce module. Hadoop
MapReduce is a software framework for writing applications and process vast
amount of data (multi-terabyte datasets) in-parallel on large clusters in a reliable,
fault-tolerant manner. A typical MapReduce program consist a chain of mappers
(with Map method) and reducers (with Reduce method) in that order. These
Map and Reduce methods are second order functions that consume input dataset
(either in whole or part of large dataset) and a user defined function (UDF) that is
applied to the corresponding input data. In a MapReduce program, the mappers
and reducers form the processing nodes of the dataflow with sources and sinks (of
respective formats) explicitly mentioned.
Irrespective of being able to scale to very large datasets, previous studies have
reported the limitations of Hadoop MapReduce and issues that can be related to
dataflow programming are listed as follows:
• Lack of support for dedicated relational algebraic operators such as Join,
Cross, Union and Filter [14, 15]. These operators are frequently used in many
iterative algorithms, for e.g., Join is an important operator in a PageRank
algorithm. This restricts the user to custom code the logic for handling the
above mentioned common operations and the programmer needs to think in
terms of map and reduce to implement them.
• Lack of inherent support for iterative programming [16], an integral feature
of any dataflow programming framework as well as useful for many data ana-
lytics algorithms. For e.g., k-means algorithm, an iterative clustering process
of finding k clusters. The workaround for iteration in MapReduce is to have
Chapter 2. Dataflow Programming 9
an external program that repeatedly invokes mappers and reducers. But,
this comes with an added overhead of increased IO latency and serialization
issues with no explicit support for specifying termination condition.
The above limitation of relation operators is addressed by Hadoop through another
module, Apache Pig.
2.2 Apache Pig
Apache Pig is a high-level dataflow language and execution framework for parallel
computation in Apache Hadoop. Apache Pig [5] was initially developed at Yahoo
to allow Hadoop users to focus more on analyzing dataset rather than to invest
time on writing complex code using the map and reduce operators. Internally, Pig
is an abstraction over MapReduce, i.e., all Pig scripts are converted into Map and
Reduce tasks by its compiler. Thus, in a way Pig makes programming MapReduce
applications easier. The language for the platform is a simple scripting language
called Pig Latin [17], which abstracts from the Java MapReduce idiom into a form
similar to SQL. Pig Latin allows the user to write a dataflow that describes how
the data will be transformed (via aggregation or join or sort) as well as develop
their own functions (UDFs) for reading, processing and writing data.
A typical Pig programs consist of following steps:
1. LOAD the data for manipulation.
2. Run the data through a set of transformations. These transformations can
be either a relation algebra transformation (join, cross, filter, etc.) or an
user defined function. (All the transformations are internally translated to
Map and Reduce tasks by the compiler.)
3. DUMP (display) the data to the screen or STORE the results in a file.
When relating a Pig program to a dataflow; LOAD operators are the source nodes,
the set of transformations forms the processing nodes and DUMP or STORE are
the sink nodes.
Chapter 2. Dataflow Programming 10
Pig addresses the limitations of MapReduce by providing a suite of relational oper-
ators for the ease of data manipulation [5]. Though, in order to achieve iterations
as well as other control flow structures (if else, for, while, etc.) one needs to use
Embedded Pig, where Pig Latin statements and Pig commands are nested into
scripting languages such as Python, JavaScript or Groovy. This reduces the sim-
plicity of programming (one of the major selling point) in Pig, as it introduces
JDBC-like compile, bind, run model that adds extra overhead of complex invoca-
tions. As Pig translates all the statements and commands from the scripts into
MapReduce tasks, it is considered to be slower than a well-written/implemented
MapReduce code 1. Although Pig offer better scope for optimization than MapRe-
duce, an optimized Pig script can perform on par with MapReduce code.
2.3 Apache Flink
Apache Flink (formerly known as Stratosphere) [3] is a data processing system and
an alternative to Hadoop’s MapReduce module. Unlike Pig, that runs on top of
MapReduce, Flink comes with its own runtime, a distributed streaming dataflow
engine that provides data distribution and communication for distributed com-
putations over data streams. It features powerful programming abstractions in
multiple languages, Java and Scala, thus providing the user different language op-
tions to program a dataflow. Flink also supports automatic program optimization,
allowing user to focus more on other data handling issues. It has native support
for iterative programming via iteration operators such as bulk iterations and in-
cremental iterations. Also, it provides support for program consisting of large
directed acyclic graphs (DAGs) of operations.
One of the essential components in Flink framework is the Flink Optimizer [18],
that provides automatic optimization for a given Flink job as well as offers tech-
niques to minimize the amount of data shuffling thus formulating an optimized
data processing pipeline. Flink Optimizer is based on a PACT [19] (Paralleliza-
tion Contract) programming model that extends the concepts from MapReduce,
but is also applicable to more complex operations. This allows Flink to extend its
support for relational operators such as Join, Cross, Union, etc. The output of the
Flink optimizer is a compiled and optimized PACT program, which is nothing but
1https://cwiki.apache.org/confluence/display/PIG/PigMix
Chapter 2. Dataflow Programming 11
a DAG-based dataflow program. This is how dataflow programming is achieved
in Apache Flink.
A typical Flink program consists of the same basic steps:
1. Load/Create the initial data.
2. Specify transformations of this data.
3. Specify where to put the results of your computations, and
4. Trigger the program execution.
In Step 1 we mention the source nodes for the dataflow program, Step 2 transfor-
mations forms the processing nodes and in Step 3 we mention the sink nodes.
Flink’s runtime natively supports iterative programming [20], the feature lacking in
MapReduce and not easily accessible in Pig, through its different types of iteration
operators: Bulk and Delta. These operators encapsulate a part of the program and
execute it repeatedly, feeding back the result of one iteration into the next iteration.
It also allows to explicitly declare the termination criterion. Such an iterative
processing system makes Flink framework extremely fast for data-intensive and
iterative jobs compared to Hadoop’s MapReduce and Apache Pig.
2.4 Comparison of Big-Data Analytics Systems
The sample dataflow program, Word-Count (reads text from files or string variable
and counts how often words occur), implementation is presented in Appendix for
each framework, Hadoop MapReduce A.1.2, Pig A.1.1, and Flink A.1.3, in order
to observe the respective programming technique and to get an idea about the
efforts needed to code the same.
As shown in Table 2.1, Apache Flink has a clear advantage over the other two
systems, with its faster processing times and native support for relational oper-
ators as well as iterative programming. This makes Flink a certain choice when
implementing a dataflow program for a large-scale dataset.
Chapter 2. Dataflow Programming 12
Apache Hadoop (MR) Apache Pig Apache Flink
Framework MapReduce MapReduceFlink optimizer and Flinkruntime
Language Supported Java, C++, etc. Pig-Latin Java, Scala, Python(beta)
Dataflow NodesProcesssing Nodes
Mappers and Reducers Suite of operators, UDFs Suite of Operators
Ease of ProgrammingSimple but tricky whenimplementing join or similaroperators
Simple Simple
Relational OperatorsNot natively supported,implemented via MR
Natively supported Natively supported
Iterative Programming Not natively supportedSupported via EmbeddedPig
Natively supported
Processing Time Fast Slow Fastest
Table 2.1: Comparison of Big-Data systems in context of Dataflow Program-ming
2.5 Limitations of Dataflow Programming
As discussed in [13], the major limitations in a dataflow programming paradigm are
visual representations and debugging. In a big-data environment, with large-scale
data flowing, these limitations are indeed more severe [21]. It is quite impractical
to visually represent terabytes of data flowing through the tree of operators, and
denoting what action is being performed on that data at each operator.
For tracking an error in a program, the user often introduces breakpoints and
watches to monitor the flow of execution with changes in the variable values or
the output data. Although this approach of monitoring is futile when it comes to
data of such vast size.
This thesis work tries to address these limitations in Apache Flink environment.
For a Flink dataflow program, we generate concise set of example data at each
node in the operator tree such that it allows the user to validate the behavior of the
operator as well as the complete dataflow. It eliminates the need of debugging (via
breakpoints, watches, etc.) to an extent, as the user can diagnose the error (after
seeing the flow of sample examples in the dataflow) by locating the problematic
operator in the tree and rectifying the logic at that very operator.
Visual representation of a dataflow is made available in Flink via its Web-Client
interface or Flink job submission interface, though the flow of data is not a part
of this representation. This thesis work integrates with the Flink’s Interactive
Scala Shell, a new feature in Flink, to display the set of examples for the dataflow
Chapter 2. Dataflow Programming 13
program executed via the interactive shell. Thus making the generated examples
visually accessible to the users.
This way we address the limitations of a dataflow programming paradigm in a
big-data system, Apache Flink. In next section, we define the Example generation
problem, explain the different approaches to generating a set of concise examples
that allows the user to reason the complete dataflow.
Chapter 3
Example Generation
In this chapter, we describe the problem of generating example records for dataflow
programs and mention the already existing approaches for the same, followed by
the challenges faced and their drawbacks. We also describe the theoretical terms
and concepts used throughout this thesis, with a brief introduction to the algorithm
that betters the drawbacks in the existing approaches.
3.1 Definition
Dataflow program is a directed graph G=(V,E) where V is the set of nodes de-
noting operators (source, data transformation/processing node, sink) and E is the
set of directed edges denoting the flow of data from one operator to the next. Ex-
ample generation is a process of producing/generating a set of concise examples
after each operator in the dataflow, such that it allows a user to understand the
complete semantics of the dataflow program. Let us demonstrate this concept
using an input dataflow example that returns a list of highly populated countries.
Figure 3.1: Dataflow that returns highly populated countries
14
Chapter 3. Example Generation 15
Figure 3.1 is a dataflow that LOADs two datasets Countries (ISO Code, Name)
and Cities (Name, Country, Population1) and performs a JOIN on the attribute
name/country (Countries/Cities). The joined result is then grouped by countries
and aggregation (SUM) is performed on population. Later we filter out the coun-
tries with population less than and equal to 4 million and present the list as output.
Example generation when applied to above dataflow from Figure 3.1, gives us the
output, shown in Figure 3.2, where we have sample examples after each operator
to facilitate the understanding of the operators as well as the complete dataflow
for a user.
Figure 3.2: Sample output of Example Generation on dataflow from 3.1
This allows user to verify the behavior of each operator just by looking at the
output sample, for e.g., verifying whether aggregation is having an intended effect
or filter is behaving correctly. Instead of cross-checking the whole logic, user will
now be able to target only the faulty operators.
Figure 3.1 and Figure 3.2 are respectively considered as the input and the output
of any example generation algorithm.
Let us now briefly explain the terms used throughout this thesis with respect to
the dataflow example from Figure 3.1
1in million
Chapter 3. Example Generation 16
3.2 Dataflow Concepts
Source: Load operators are the sources in the above dataflow, as they read the
data from the input files/tables/collections.
Sink: Operators that produce (by storing in a file or displaying) the final result
is the Sink. Filter is the Sink in the above dataflow.
Downstream pass: Starting from Sources we move in the direction toward Sink,
i.e., from Loads to Filter.
Upstream pass: Starting from Sink we move in the direction toward Sources,
i.e., from Filter to Loads.
Downstream operator/neighbor: If Operator1 consumes the output of Op-
erator2, it makes Operator1 as the downstream neighbor of Operator2. Join is
the downstream operator of both the Loads, similarly Group is the downstream
neighbor of Join and Filter is of Group.
Upstream operator/neighbor: If Operator1 consumes the output of Opera-
tor2, it makes Operator2 as the upstream neighbor of Operator1. Group is the
upstream neighbor of Filter, similarly Join is of Group and Loads are of Join.
Operator Tree: The complete chain of operators from sources to sink makes
an operator tree. The operator tree is the representation of the given dataflow
program in term of operators used.
All the concepts can be visualized as shown in Figure 3.3
Figure 3.3: Concepts related to a dataflow program
Chapter 3. Example Generation 17
3.3 Example Generation Properties
Let us now briefly explain the properties that define a good example generation
technique.
Completeness
The examples generated at each operator in the dataflow should collectively be
able to illustrate the semantics of that operator. For e.g., in above dataflow out-
put Figure 3.2, the examples before and after the Filter operator clearly explains
the semantics of filtering by population, as examples with low population are not
propagated to the output. Same can be said with respect to Group operator, the
examples illustrate the meaning of grouping of countries and aggregating (via sum)
the population. Completeness is the most important property, as it allows user to
verify the behavior of each and every operator in the dataflow.
Conciseness
The set of examples generated after each operator should be as small as possible,
such that it minimizes the effort of examining the data in order to verify the
behavior of that operator. The output presented in Figure 3.2 cannot be considered
as concise, as there is a scope to explain the behavior of the operators will smaller
set of examples, as shown in Figure 3.4.
Realism
An example is considered to be real if it is present in the respective source
file/table/collection. Thus, the set of examples generated by any of the opera-
tor must be the subset of the examples in the source, in order to be considered
as a real set. In the dataflow output, shown in Figure 3.2, assume that all the
examples generated by Load operators are from the respective source files, this
means all the examples generated in the output 3.2 are real.
In Figure 3.4, we have altered the output of the dataflow (Figure 3.2), to make it
more concise and thus ideal with respect to the properties defined.
Chapter 3. Example Generation 18
Figure 3.4: Ideal set of examples satisfying all properties
3.4 Example Generation Techniques
In this section, we discuss the techniques/approaches that can be considered for
example generation and the problems related to each approach and how we can
overcome the same.
3.4.1 Downstream Propagation
The most simple way to generate examples at each operator in the dataflow is to
sample few examples from the source files and push these examples through the
operator tree, executing each operator and recording the results after each execu-
tion. The problem with this approach is, not always the sampled examples would
allow all operators in the operator tree to achieve total completeness. For e.g.,
considering our dataflow example from Figure 3.1, initial sampling might produce
examples that cannot perform join and thus resulting the lack of completeness
(as shown in Figure 3.5). This can indeed be averted if we increase the sampling
size and push as many possible examples through the operator tree [22], but this
violates the property of conciseness. As we seek the approach that generates com-
plete and concise set of examples, relying only on downstream propagation cannot
be the choice for our implementation.
Chapter 3. Example Generation 19
Figure 3.5: Incompleteness is often the case in downstream propagation
3.4.2 Upstream Propagation
The second approach that can be considered for example generation is to move
from sink to source and generate the output examples based on the characteristics
of the operator (joined examples, crossed examples, or filtered examples). Based
on these output we generate the input and propagate upstream recursively till we
reach the sources. This approach will work fine if we are aware of the operator
behavior (join, union, cross, etc.) but fails when a UDF operator is part of the
operator tree. A UDF being a blackbox, it is complex to predict the behavior
beforehand, this approach fails to generate examples resulting incompleteness. The
only way to generate examples for these UDF operators is to push data through
them in the downstream direction.
Hence, both approaches Downstream and Upstream propagation individually is
not efficient to generate an ideal set of examples satisfying both completeness and
conciseness. Nonetheless, combination of both downstream and upstream with
pruning of redundant examples can lead to the better set of results.
3.4.3 Example Generator Algorithm
The algorithm with steps (in that order):
1. Downstream Pass, 2. Pruning, 3. Upstream Pass, 4. Pruning
Chapter 3. Example Generation 20
was proposed in [12] and was observed to be more efficient (to generate complete
and concise set of examples) compared to only Downstream and only Upstream
approaches.
To ensure the completeness of all the operators in an operator tree, [12] introduces
the Equivalence Class Model.
Equivalence Class Model
For a given operator O, a set of equivalence classes εO = {E1, E2, ...., Em} is defined
such that each class denotes one aspect of the operator semantics. For e.g., a
FILTER operator’s equivalence set will be denoted by εfilter = {E1, E2}, where
E1 denotes the set of examples passing the filtering predicate, and E2 denotes the
set of examples failing the filtering predicate. In section 4.3 we define equivalence
classes for the set of supported operators.
Following is the pseudo-code of the algorithm [12] used in this thesis for example
generation.
Algorithm 1 Example Generator
Step 1 : use downstream pass to generate possible sample examples at eachoperator in the operator tree
Step 2 : using equivalence class model check for the operator completeness, andfor incompleteness (if any) identified, generate sample examples using upstreampass
Step 3 : use pruning pass to prune/remove the redundant set of examples fromthe operators
Let us now formally introduce the example generation problem with respect to
Apache Flink that we are addressing in this thesis.
3.5 Example Generation In Apache Flink
As discussed in section 1.2, we implement the ILLUSTRATE2 feature available in
Apache Pig in a completely distinct computing platform of Apache Flink. This
work is based on the paper [12], where they describe and prove the best possible
2http://pig.apache.org/docs/r0.15.0/test.html#illustrate
Chapter 3. Example Generation 21
way to generate a set of complete, concise and real examples for a given dataflow
program. In the implementation, we have considered few notable differences be-
tween Pig and Flink, and thereby defined our requirements,
1. Flink is a multi-language supporting framework (with support for Java, Scala
and Python), in contrast to Pig that uses a textual scripting language called
Pig Latin. In our implementation, we focus on Flink’s Java jobs.
2. Pig allows ILLUSTRATE either on a complete dataflow script or after any
given dataset (also known as a relation) from the script. We have adapted
this accordingly to support only a complete Flink job; i.e., a dataflow from
source to sink. We are of the opinion that illustrating after the complete
dataflow is more feasible than that of after every dataset because eventually
all used datasets would be displayed in a complete job illustration.
3. Pig’s input is transformed into a relation, a relation is a collection of tuples
and is similar to that of a table in a relational database. In Flink the input
can be of any format (txt, csv, hdfs, collection), although for our implemen-
tation we need to convert these inputs into tuple format before performing
any transformation on the input, as our intention is to find example “tuples”.
These were the differences which we observed and accordingly adapted it in our
implementation. In order to view the overall implementation picture let us now
briefly explain the basic life-cycle when the flink illustrator, the module that gen-
erates sample examples, is invoked by a flink job:
1. For a submitted job, an operator tree is created.
2. This operator tree forms the main input for our algorithm, different passes
then acts on the input to generate complete, real and concise examples.
3. These generated examples are then displayed to the user (via a console or
scala shell).
After an overview on example generation concepts and the introduction to al-
gorithm as well as equivalence class model, next section provides the detailed
explanation of the implementation with respect to Apache Flink, the process of
Chapter 3. Example Generation 22
generating the input (Operator Tree), followed by the list of supported Flink op-
erators and their respective Equivalence classes, and each algorithm pass. Later,
we mention the Flink features that have been integrated with the implementation,
making it different and novel.
Chapter 4
Implementation
After having defined what is example generation and how it is useful in Apache
Flink, this chapter extends to give an in-depth proceeding on how this concept is
realized in Flink. This topic can be best treated under following headings:
1. Operator Tree
2. Supported Operators
3. Equivalence Class
4. Algorithm
5. Flink add-ons
4.1 Operator Tree
Operator tree is a chain or list of single operators that starts from the input
sources and ends at the output sink with one or more data processing/transforming
operators between the two. Operator tree is created once the submitted flink job
invokes the illustrator module. Before going into further details how we build an
operator tree for a given job, let us define the Single Operator object, which is at
the granular level of an operator tree.
23
Chapter 4. Implementation 24
4.1.1 Single Operator
Single Operator object defines different properties of the operator under consid-
eration. The idea behind formalizing our own operator object rather than using
the one available from Flink 1 was, i. We were not going to use all the properties
of flink operator class, as they were not adding any value to our implementa-
tion, e.g., the degree of parallelism and compiler hints. ii. There were few things
missing in the available flink operator class which were needed specifically for our
implementation purpose, e.g., having join keys easily available for the given input,
having the output of each operator for further manipulation, and having details
that verify the operator produced an output for the set of inputs were necessary.
Our single operator class along with the properties can be viewed in Figure 4.1.
SingleOperator
equivalenceClasses: List<EquivalenceClass>
operator: Operator<?>
parentOperators: List<SingleOperator>
operatorOutputAsList: List<Object>
syntheticRecord : Tuple
semanticProperties : SemanticProperties
JUCCondition : JUCCondition
operatorName: String
operatorType : OperatorType
operatorOutputType : TypeInformation<?>
JUCCondition
firstInput : int
secondInput : int
firstInputKeyColumns: int[ ]
secondInputKeyColumns: int[ ]
operatorType : OperatorType
Figure 4.1: The Single Operator Class
1https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/api/java/operators/Operator.html
Chapter 4. Implementation 25
Property Description
equivalenceClasses To define the completeness of the output set
operator Flink operator object
parentOperators Set of input operators whose output is being consumed by the current operator
operatorOutputAsList Output examples produced by the current operator (stored as a List collection )
syntheticRecord Synthetic record constructed to ensure operator completeness
semanticProperties Defines the transformations on tuple fields in the UDF operators
JUCConditionAn inner class that facilitates manipulation of details related to join, union and crossoperators, such as providing the input number as well as the assigned key columns
operatorName Name of the current operator
operatorType The type of the operator (JOIN, LOAD, CROSS, etc.)
operatorOutputType Defines the structure and datatypes of the output produced
Table 4.1: Description of SingleOperator properties
The properties for a given SingleOperator object are explained in the Table 4.1
4.1.2 Constructing the Operator Tree
After defining what our custom operator class consists of, let us now see the two
different ways to construct a list of these custom operators that constitutes our
main input, operator tree. For better understanding let us consider a sample
dataflow program, as shown in Figure 4.2, that takes 2 input Load operators then
performs Join on it and Projects some result as an output. In this scenario, Load1,
Load2 forms our input sources and Project is our output sink.
Figure 4.2: Tree Construction Approaches
1. Top-Down Push Approach:
This is the approach used in our final implementation. In this method, we start
from the source operator and then move downstream towards the sink output
Chapter 4. Implementation 26
operator. If we consider the dataflow program in Figure 4.2, we would get two
source operators, Load1 and Load2, in the order they are defined in the program.
Suppose Load1 is defined first, we construct Load1 operator’s SingleOperator ob-
ject and add it to the operator tree, we then seek for the downstream operator of
Load1 and thus get the Join operator. As Join is a dual-input operator (similar
to Cross, Union), we make sure both of its parents are added to the tree before
adding the Join operator itself. This allows us better handling of outputs and
execution of a dual- input operator. For e.g., if we add and execute in the order
“Load1, Join, Project, Load2” Join will have input only from one parent operator,
Load1, and the execution will halt. Hence, we make sure Load2’s SingleOperator
object is added to the tree before adding Join’s. Using the above technique we
continue moving downstream until the sink is reached, and thus we get our com-
plete operator tree for a given sample dataflow program. This method can be seen
as following pseudo code 2:
Algorithm 2 Top-Down Push Method
Step 1 : let List<SingleOperator> be an operatorTree
Step 2 : get all the input sources for a given dataflow program
Step 3 : get first input from the input sources
Step 4 : construct SingleOperator object for the received input and add it tothe operatorTree
Step 5 : get the downstream operator of the added input and check its property,whether it is a single-input operator or a dual-input operator, if single-inputoperator goto Step 6.1, if dual-input operator goto Step 6.2
Step 6.1 : construct SingleOperator object for this downstream operator andadd it to the operatorTree
Step 6.2 : check if both the parents are present in the operatorTree,if present then construct SingleOperator object for this downstream operatorand add it to the operatorTree,if not get the next input from the input sources and goto Step 4
Step 7 : continue from Step 5 till we have reached the sink, i.e., the addedoperator has no downstream operator
Chapter 4. Implementation 27
2. Bottom-Up Volcano Approach
This can be an alternate technique to create an operator tree. As the name
suggests, this approach constructs the operator tree starting from the sink, i.e., an
output operator and then moving upstream towards the input source operators.
Once a flink job for the above sample dataflow program (Figure 4.2) is submitted,
we seek the sink operator, in this case, the sink operator is Project. We construct
the single operator object for Project operator by setting the necessary details and
then add it to the operator tree. We then seek for the input/parent operators of
Project, and get the Join operator. We continue in similar fashion until there is no
input for the operator, that means we have reached the source operator, resulting
in the completion of the operator tree for a given sample dataflow program.
This whole process can be seen using following pseudo code 3:
Algorithm 3 Bottom-Up Volcano Method
Step 1 : let List<SingleOperator> be an operatorTree
Step 2 : get the sink for a given dataflow program
Step 3 : construct SingleOperator object for the sink and add it to the opera-torTree
Step 4 : get the upstream operators for the object added to the operatorTree
Step 5 : construct SingleOperator object for every upstream operator and addit to the operatorTree
Step 6 : continue from Step 4 to 5 till we have reached the sources, i.e., wedon’t find any upstream operator for the added operator
4.2 Supported Operators
After seeing the techniques to construct an operator tree, we define the type of
operators that are allowed and supported in our implementation and also look into
different conditions when an operator may not produce any output.
• LOAD : Is the first operator in the operator tree and acts as a source for
the operators downstream. It will always produce records unless the input
Chapter 4. Implementation 28
source file or collection is empty. Load in Flink can also be achieved via
FlatMap User defined function (UDF) operator, that allows us to convert
input into requested tuple input format.
• DISTINCT : This operator de-duplicates the records present in the up-
stream operator. Unless the upstream operator generates empty set of tu-
ples, Distinct will always produce output records.
• JOIN : Is a dual-input operator, that combines the records from the up-
stream parent operators on a common join key. If there does not exist
a matching field on the join key in the parent operators output, Join will
produce an empty output. Join in Flink behaves differently compared to tra-
ditional SQL Join, in Flink it will join the tuples from both inputs to form
a single tuple of size 2 (tuple from input 1, tuple from input 2), whereas in
SQL the tuples will be merged as one tuple. Figure 3.2 shows the behavior
of an SQL Join, and Figure 4.4 depicts Flink Join behavior.
• UNION : Is a dual-input operator, that combines the outputs of the up-
stream parent operators. Here the output type of both parent operators
must be same, i.e., they must produce same type of tuple fields. Unless
both the parent operators generates the empty result set, Union will always
produce output records.
• CROSS : Is also a dual-input operator, that produces a cross product of
the records from the upstream parent operators, i.e., building all pairwise
combination of the records from both parent operators output. Cross will
produce empty output if any of the upstream operator’s result set is empty.
• FILTER : Is a User defined function (UDF) operator in Flink, which takes
a filtering predicate and applies it on the upstream operator’s result set. If
none of the upstream operator’s result set passes the predicate, Filter’s result
will be an empty set.
• PROJECT : Is an operator that projects the mentioned fields from the
tuples. Unless the upstream operator’s record set is empty, Project will
always have an output result set.
In our implementation, we are limited to these set of operators from the plethora
of Flink operators that include many other UDF operators because, having these
Chapter 4. Implementation 29
above set of operators gives us the overall idea of how the example generation
works in Flink.
As of now, Filter, FlatMap and Map are the only Flink UDF operators included in
this implementation. Although, after defining the fitting semantics, it is possible
to extend the implementation to include the other not-supported operators.
The list of Flink operators not supported in this implementation is:
GroupReduce, GroupCombine, CoGroup, Aggregation, Iteration, Partition
The reason behind the exclusion of the above operators is, one needs to define exact
semantics for each operator and most of the operators being a UDF (where the
functionality is defined at the code-level), it is difficult to characterize the same.
We were able to include FlatMap and Map UDF operators in this implementation
because, structural transformation of the input tuple is the only semantic that one
has to take into account, in contrast to Grouping or Iterations.
In order to maintain completeness, i.e., to avoid situation where an operator gen-
erates an empty set and to cover all the necessary semantics of an operator, we
adapt the model stated in the paper [12] known as Equivalence Class Model, which
is described in next section.
4.3 Equivalence Class Model
In this model, we assign each operator from the operator tree a list of equivalence
classes, such that each equivalence class depicts a particular unique aspect of the
operator semantics. Let us formulate the above concept; for a given operator
O, εO = {E1, E2, ....., Em} is the set of equivalence classes, and each Ei specifies
and illustrates one aspect of the operator semantics. When the sample examples
are generated for an operator either as input or output, we make sure that each
generated example belongs to an equivalence class. Also we must cover all the
equivalence classes for the given operator by having at least one example for each
equivalence class in the result set. This way we can achieve the semantic com-
pleteness for the operator under consideration. Let us now define the equivalence
classes for the operators that we have considered:
Chapter 4. Implementation 30
• LOAD : Every record from the source file or collection that is being loaded
forms the single equivalence class E1. This can also be stated as, every record
that Load generates is assigned to a single equivalence class E1.
• DISTICNT : Every output record is assigned to a single equivalence class
E1. The behavior can best described when there exist at least one duplicate
record in the input set, but this may not always be possible as it is dependent
on the output generated from its upstream operator.
• JOIN : Every output record is assigned to a single equivalence class E1. This
illustrates that, there exists a case of matching join key in parent operators
output set and thereby clarifying the semantics.
• UNION : Every input record from first parent operator is assigned to E1,
and every input record from second parent operator is assigned to E2. This
allows us to confirm that Union produces tuples having at least one record
from each parent operator.
• CROSS : Similar to Union, every input record from first parent operator
is assigned to E1, and every input from second parent operator is assigned
to E2. This way we confirm that both parents have an input, for Cross to
produce any output.
• FILTER : Every input record that passes the filtering predicate is assigned
to E1, whereas the one that fails is assigned to E2. This way we have at
least one record that passes the filter and Filter produces an output, and we
also depict the type of examples that are blocked by the filtering predicate.
• PROJECT : Every output record is assigned to E1, this verifies that Project
had appropriate input to produce an output.
4.4 Algorithm
In this section, we describe our implementation algorithm for generating the ex-
ample tuples for each operator in the operator tree. Our algorithm needs two main
inputs, 1. An operator tree for a submitted Flink job (which is created using the
technique seen above 4.1.2), 2. The base tables, i.e., the input source files or the
collections, depending on the input format used in the dataflow program. The
Chapter 4. Implementation 31
algorithm is generic with respect to input formats and supports different input
formats used by Flink, such as csv file, text file, java collection, or HDFS file. Ev-
ery input record is then converted to a tuple structure, as they form the basis for
the data movement through the operator tree. With respect to operators support,
we are limited to the set of operators mentioned above, because of the reasons
discussed in the earlier Section 4.2. Let us quickly overview the different steps we
perform throughout out algorithm in next section.
4.4.1 Overview
After receiving the inputs of an operator tree for a given Flink dataflow program
and the input sources, we perform in total three different passes in order to generate
the final concise, complete set of examples. The algorithm can be summarized as:
1. Downstream pass: In this pass we move from source operators to sink,
considering one operator at a time. In source operators, we randomly sample
n records from the input sources and propagate these records to the down-
stream operator. The downstream operator then acts on the records to gen-
erate an intermediate output, which is then passed to the next downstream
operator. We continue in similar fashion till we reach the sink operator.
After reaching the sink, we set the equivalence classes for each operator in
the tree using the examples generated. Based on the initial sampled records,
a downstream operator may or may not generate an example. To tackle the
incompleteness, case of empty equivalence class, we have the next pass.
2. Upstream pass: In this pass we move from sink to source operators, we
check at each operator whether there exist any incompleteness, by identifying
empty equivalence classes. In case of incompleteness, we create a synthetic
record, such that it resolves the incompleteness for the operator under con-
sideration. We then propagate this synthetic record till the source operator
and convert it to a concrete real example. We continue in similar fashion
until the source is reached.
3. Pruning pass: After resolving the incompleteness issue, we tackle concise-
ness in this pass. There may exist some records that may be redundant, in
a way that they do not add any extra value semantically for the operator
Chapter 4. Implementation 32
under consideration. We eliminate such records in this pass, along with their
trail throughout the tree.
In next sections, we explain each pass in very detail. For better understanding, let
us consider an example dataflow program and use its operator tree and sources as
the inputs to our algorithm. Figure 4.3 is a dataflow program that provides the
list of users that tends to visit urls with high page rank.
Figure 4.3: Sample example - Dataflow program that retrieves users visitingpages with high page rank
In above running example, we load two datasets, UserVisits and PageRanks, using
respective source files. Then both the datasets are joined on field url, these joined
records are then filtered where only records having rank greater than three are
accepted. Later, the filtered records are projected in a tuple format of <user, url,
rank> thereby stating the high-ranked url for that user. Once the flink-illustrator
module is invoked the operator tree is created, for our running example, the tree
will be same as Figure 4.3, i.e.,{ Load UserVisits, Load PageRanks, Join, Filter,
Project} as per our approach Top-Down Push algorithm 2. As the input is ready,
our algorithm invokes the downstream pass.
4.4.2 Downstream Pass
In our downstream pass, we randomly select n records from each input source,
which forms the result set of the Load operators. These records are then prop-
agated to the downstream operator, that acts on the received input records and
generates the output, which in turn will act as an input to the next downstream
operator and continue in similar manner till we reach the sink. Consider our op-
erator tree as a list of operators {O1, O2, O3, ...., Om} . Given the order here O1’s
output will be the input of O2, operator O2 will then evaluate on the received set
of inputs and possibly generate an output, this output thus becomes an input for
Chapter 4. Implementation 33
O3. This way we continue till we reach Om , that by definition is the sink operator,
thereby completing the downstream pass. We store all the intermediate results
produced by each operator. Lets apply this concept on running example.
Figure 4.4: Downstream pass applied on our running example 4.3
After the downstream pass, the example generated for each operator can be viewed
as in Figure 4.4. As stated, Loads will have randomly selected inputs from the
source files. Join operator will receive these records as inputs and then evaluate the
join on the url, the joining key, there are two common urls present namely youtube
and facebook, this allows Join to produce outputs by joining the respective tuples
from the parent operators. Next downstream operator, Filter, will accept only
those records with rank greater than three, while discarding the rest. Finally,
Project, the sink operator, creates a tuple format as per the requested structure.
Record’s Lineage
During the downstream pass, we track the flow of each individual record for its
lifetime in the operator tree, this tracking of an individual record is known as a
Lineage of that record. For e.g., lineage of the record <Amit, facebook, 5pm>,
moving in the direction of source to sink will be constructed as follows:
Left hand side denotes the operator traversal in the operator tree (and the effect
each operator has on the tuple’s structure), whereas right hand side denotes the
record’s traversal through the operator tree and shows the effect of evaluation on
it after the respective operator.
Chapter 4. Implementation 34
Operator traversal Record’s traversal
↓ Load ↓ Load<Username, URL, Time> <Amit, facebook, 5pm>
↓ JoinUserV isits with PageRanks on URL ↓ Join<<Username,URL,Time>,<URL, Rank>> <<Amit,facebook,5pm>,<facebook,4>>
↓ FilterRank > 3 ↓ Filter
<<Username,URL,Time>,<URL, Rank>> <<Amit,facebook,5pm>,<facebook,4>>
↓ ProjectUsername, URL, Rank ↓ Project
<Username, URL, Rank> <Amit, facebook, 4>
Hence the lineage of record <Amit, facebook, 5pm> is:
<Amit, facebook, 5pm>→ <<Amit, facebook, 5pm>,<facebook, 4>>→<<Amit, facebook, 5pm>,<facebook, 4>>→ <Amit, facebook, 4>
Similarly, lineage of record <youtube,2> will be:
<youtube, 2>→ <<Amit, youtube, 3pm>,<youtube,2>>
Having these lineage will in turn let us form the lineage group, which is the tech-
nique used for our pruning pass.
Lineage Group
Lineage group for a given record adds some extra information to the available
lineage, such as, the operator that acts on the record and the output it generates.
For e.g., the lineage group of <Amit, facebook, 4pm> will be as follows:
<Amit, facebook, 4pm> → {Load,<Amit, facebook, 4pm>}{Join,<<Amit, facebook, 5pm>,<facebook, 4>>}{Filter,<<Amit, facebook, 5pm>,<facebook, 4>>}{Project,<Amit, facebook, 4>}
These lineage groups are then used for pruning in the pruning pass section. After
the completion of the downstream pass, we examine the generated example set
at each operator and set the equivalence classes for the respective operator. The
complete downstream pass can be summarized into following pseudo code 4:
Chapter 4. Implementation 35
Algorithm 4 Downstream Pass
Step 1 : let operatorTree be the input
Step 2 : set the LOAD operators’ records by randomly selecting n values fromthe respective sources, store these records as the output for LOAD operators
Step 3 : initiate lineage groups for every record from the above outputs as<record → <Load, record>>
Step 4 : get the next downstream operator from operator tree, evaluate theoperator on the available input, store the intermediate output, add <record→ <operator, record’s evaluation on this operator>> to the record’s respectivelineage group
Step 5 : continue Step 4 until the sink is reached
Step 6 : set the equivalence classes for each operator based on the outputgenerated
In the above running dataflow example, each operator generated an output and
also every equivalence class for each operator were filled. For e.g., Join had the
joined example hence filling the E1, Filter had both equivalence classes filled,
E1 with 2 records passing and E2 with one record failing to pass the filtering
predicate. This may not always be the case, for instance, consider the situation as
shown in Figure 4.5 for the same dataflow program but with slight modification,
we remove the Filter operator for simplicity. This type of scenarios might occur in
any example dataflow, to tackle this we then move towards next Upstream pass.
4.4.3 Upstream Pass
In this pass, we move from sink towards the source inspecting one operator at a
time. At each operator, we identify possible cases of incompleteness and address
this by introducing synthetic records that might boost the completeness for the
operator under consideration. We identify incompleteness based on the equiva-
lence classes set in the downstream pass, if for a given operator O, any of the
equivalence class is empty we introduce a synthetic record in O’s input set such
that, after evaluation it makes that respective equivalence class not empty. After
introducing the synthetic record, we propagate it upstream until the respective
source is reached, and there we then convert this synthetic record to an actual
Chapter 4. Implementation 36
Figure 4.5: Unmatching join-key attribute leads to an empty result at theJoin operator
record by matching it as closely as possible with data available at the source.
In the following section, we explain how we generate a synthetic record for an
operator’s equivalence class.
Synthetic Record is a concise representation of the tuple that the operators con-
sumes or produces based on the equivalence class we are trying to fill. Based on
the operator under consideration, we explicitly define whether a field in the tuple
is of certain significance. For e.g., in case of a Join operator we mention whether
a field is the join key or not. For the given set of operators that we consider in
our implementation, our language to define synthetic record is very simple:
i. if the field is a join key, add the marker “Join key” to the respective field
position in the tuple,
ii. all the remaining fields will be marked as “Dont care” (with a hint of field’s
datatype) that means any value is acceptable in this position.
This could for sure be extended to support richer set of operators for e.g., in the
case of Filter operator stating the appropriate field in the tuple is the filtering
predicate or in the case of FlatMap operator stating that this field is transformed.
But as these operators are UDF operators in Flink, retrieving such information is
not yet possible. Indeed semantic properties 4.5.1 of an operator is useful in this
Chapter 4. Implementation 37
scenario, but it cannot be generalized to the overall implementation, as it is an
optional property for an operator and may not always be set.
Now let us explain how this works, with respect to the example in Figure 4.5,
as we can see there is no common join key in both Loads, thereby resulting Join
to produce an empty result and hence the equivalence class E1 is empty. As we
traverse from the sink to source we first come across Project operator, as project
just being an operator that displays result in a mentioned structure, we know that
having some records in its input will solve its incompleteness and hence ignore
it, because if Join produces the result, for sure Project will too. Thus, we move
upstream and reach at Join operator and find E1, i.e., any joined example to be
empty. Hence we now attempt to introduce records in Join’s input such that it
will resolve the incompleteness. As we know, url being the join key, our synthetic
records will be as follows:
Load (UserVisits) < Dont care, Join key,Dont care >
Load (PageRanks) < Join key,Dont care >
Having the above set of records as an input to Join operator, will resolve its in-
completeness as we get the resulting joined set as
<< Dont care, Join key,Dont care >,< Join key,Dont care >> and progres-
sively of Project’s as well, as shown in Figure 4.6. We then move these synthetic
records upstream till the source and convert them to records with real matching
data values taken from the respective source files, these new data values are ob-
tained from the examples that were not used in the Downstream pass, as show in
Figure 4.7, < John, linkedin, 6pm > and < linkedin, 4 >, unused in downstream
pass, are the records introduced to boost completeness.
Chapter 4. Implementation 38
Figure 4.6: Upstream Pass: Synthetic records introduction and assumed op-erator evaluation
Figure 4.7: Upstream Pass : Synthetic records to Real records
Chapter 4. Implementation 39
This way our upstream pass attempts to resolve any incompleteness present, the
whole process can be seen as a single iteration from introducing synthetic records,
converting them to real records and evaluating the operators once again. For
Join, Cross or Union operators, we can fill empty equivalence class within a single
iteration, but when it comes to Filter operator one iteration may not always be
ideal because, to find a specific record that either satisfies or fails the filtering
predicate (depending on the equivalence class to fill) is a time exhausting process
when it comes to big datasets. This is also due to the fact, that we are not
exposed to the filtering predicate as it being a UDF in Flink, hence manipulating
the synthetic record to our benefit is not possible. Having said that, we use
the same method for Filter as we use for other operators to generate the best
possible record in a single iteration. Given the input from the downstream pass
the upstream pass can be viewed as per the following pseudo code 5, where we
move from sink to source examining one operator at a time.
Algorithm 5 Upstream Pass
Step 1 : let O be the operator under consideration (initially O will be the sinkoperator)
Step 2 : check for the empty equivalence classes in O, if exist any empty equiv-alence class goto Step 3.1, if all equivalence classes are filled get the upstreamoperator and goto Step 1
Step 3.1 : introduce synthetic records in O’s input set such that it makes theequivalence class non-empty, propagate this record in upstream direction till thesource
Step 3.2 : at source convert this synthetic record to a closely matching realrecord from the source’s input
Step 4 : evaluate the operators again, store the newly added records, updatethe respective lineage groups, discard the synthetic records
Step 5 : set the equivalence classes for the operators based on the newlyavailable records
Step 6 : get the next upstream operator (if any) and continue from Step 1
After tackling the incompleteness issue in this upstream pass, we now move to-
wards making the result set as concise as possible by pruning the generated ex-
amples. This is done in our last pass of Pruning.
Chapter 4. Implementation 40
4.4.4 Pruning Pass
In this pass, given the set of examples generated by each operator after above two
passes, we reduce the example set for every operator possible such that it does not
affect the completeness of that operator. This is done using the concepts of record
lineage and lineage groups, that are set in downstream and upstream passes. We
start from the sink operator, consider its example set and try removing one exam-
ple at a time. If the removed example keeps the operator’s completeness intact,
we then use the record’s lineage group and attempt to remove the trace of that
record throughout the tree. While pruning, we make sure that the completeness is
being not compromised at any given operator in the tree. We then move upstream,
continue in similar fashion at each operator, making sure that removing record at
this upstream operator does not impact the completeness of the current operator
as well as the downstream operators, till the source operator is reached. Let
us now explain this concept and how it is realized with respect to examples from
Figure 4.4 and 4.7.
Let us start with example from Figure 4.4 (recreated here for reference)
Figure 4.8: Pruning pass on example from 4.4: Initial State
Here the sink operator is Project, we consider its output and see two examples.
Project being an operator where every output record is assigned to a single equiv-
alence class E1, hence to maintain Project’s completeness having just one record
at E1 would suffice. This allows us to prune one of the two examples from the
Project’s output set {< Hung, facebook, 4 >,< Amit, facebook, 4 >}. Let us try
Chapter 4. Implementation 41
to prune record < Hung, facebook, 4 >, this record is a part of two lineage groups
as follows:
Lineage Group 1:
< Hung, facebook, 4pm > → {Load,< Hung, facebook, 4pm >}{Join,<< Hung, facebook, 4pm >,< facebook, 4 >}{Filter,<< Hung, facebook, 4pm >,< facebook, 4 >}{Project,<Hung, facebook, 4>}
Lineage Group 2:
< facebook, 4 > → {Load,< facebook, 4 >}{Join,<< Hung, facebook, 4pm >,< facebook, 4 >}{Join,<< Amit, facebook, 5pm >,< facebook, 4 >}{Filter,<< Hung, facebook, 4pm >,< facebook, 4 >}{Filter,<< Amit, facebook, 5pm >,< facebook, 4 >}{Project,<Hung, facebook, 4>}{Project, < Amit, facebook, 4 >}
Considering Lineage Group 1, as we move from sink to source; we see at Project,
eliminating < Hung, facebook, 4 > does not affect its completeness and thus we
are allowed to prune the example at the sink-Project operator. Next upstream
operator is Filter, here too, removing:
<< Hung, facebook, 4pm >,< facebook, 4 >> has no impact on Filter’s com-
pleteness as both of its equivalence classes
E1 (passing filter: << Amit, facebook, 5pm >,< facebook, 4 >> ) and
E2 (failing filter: << Amit, youtube, 3pm >,< youtube, 2 >>) are still filled, and
Project’s completeness is intact as well.
Thus << Hung, facebook, 4pm >,< facebook, 4 >> is pruned at the filter op-
erator. Similar is the case at Join and Load (UserVisits) operator and the
respective record is pruned at these operators as well. We hence pruned the com-
plete trace of < Hung, facebook, 4 > throughout the operator tree using Lineage
group 1. After the first round of pruning, Lineage group 2 will now look as follows:
Chapter 4. Implementation 42
< facebook, 4 > → {Load,< facebook, 4 >}{Join,<< Hung, facebook, 4pm >,< facebook, 4 >}{Join,<< Amit, facebook, 5pm >,< facebook, 4 >}{Filter,<< Hung, facebook, 4pm >,< facebook, 4 >}{Filter,<< Amit, facebook, 5pm >,< facebook, 4 >}{Project,<Hung, facebook, 4>}{Project, < Amit, facebook, 4 >}
As the remaining records at Join, Filter and Project are not related to:
< Hung, facebook, 4 >, we cannot prune them. But we are allowed to prune
< facebook, 4 > at Load, we attempt to remove it but realize that it will disturb
the completeness of the downstream operator Filter, as there won’t be any record
passing the filter and will make E1 void. Hence < facebook, 4 > is retained at
Load (PageRanks). After finishing pruning at the sink operator (Project), the
resulting operator tree with examples will look as in Figure 4.9, we then move to
next upstream operator Filter.
Figure 4.9: Pruning pass: After pruning Project operator
At Filter, it is not possible to prune the only available record from its output,
as it would make Filter’s, as well as Project’s, result set empty. Moving fur-
ther upstream to Join operator, here the output set consists of two records,
<< Amit, facebook, 5pm >,< facebook, 4 >> and << Amit, youtube, 3pm >
,< youtube, 2 >> at Join it is possible to prune either one of the records without
Chapter 4. Implementation 43
affecting Join’s completeness, but pruning any of these records will reduce com-
pleteness at downstream Filter operator because,
pruning << Amit, facebook, 5pm >,< facebook, 4 >> will make Filter’s E1
(passing records) void. Pruning << Amit, youtube, 3pm >,< youtube, 2 >>
will make Filter’s E2 (failing records) void. Hence at Join, similar to Filter, no
record is pruned. Applying the similar logic to the remaining upstream operators
Load (UserVisits) and Load (PageRanks), at Load (UserVisits) we can prune
< June, gmail, 5pm > and at Load (PageRanks) we can prune < yahoo, 3 > and
< skysports, 2 > as it does not upset the completeness of respective Load’s as well
as any downstream operators. So the overall operator tree along with examples,
after pruning pass completion will look as shown in the Figure 4.10.
Figure 4.10: Pruning pass: Complete pruning of example from Figure 4.4
Using the similar technique, the pruning pass on example from Figure 4.7, would
produce result set as shown in the Figure 4.11.
The complete pruning pass can be summarized as per the pseudo-code 6.
Chapter 4. Implementation 44
Figure 4.11: Pruning pass: Complete pruning of example from Figure 4.7
Algorithm 6 Pruning Pass
Step 1 : let O be the operator under consideration (initially O will be the sinkoperator), for every record R in O’s result set perform Step 2
Step 2 : verify whether pruning R would result in incompleteness at O, ifcompleteness is intact proceed to Step 3, else get next R (if any) from O’s resultset and redo Step 2
Step 3 : get R’s lineage group l, for every lineage entries in l execute Step 4
Step 4 : get record Rli (record at operator i in lineage group l) initially i willbe the farthest downstream operator, verify whether pruning Rli affects com-pleteness at operator i, as well as the completeness of i’s downstream operators(if any). If completeness is intact everywhere remove Rli from the operator i,move to next lineage entry record Rli−1 (if any) where i − 1 is the upstreamoperator of operator i and redo Step 4 (now Rli = Rli−1)
Step 5 : repeat Step 1 through 4 for the next upstream operator until we reachthe source
These were the main passes used in our example generation algorithm. In our next
section we will explain the extensible features of Flink used in our implementation.
Chapter 4. Implementation 45
4.5 Flink Add-ons
4.5.1 Using Semantic Properties
Flink’s optimizer provides a way to inspect UDFs, normally seen as a black box,
by presenting information such as location of the fields of the input tuples that
are being accessed, modified or copied. Hence providing an indirect access to
data movement in a UDF. To access this information, one needs to add Function
Annotations 2 to a UDF. This can be done in a manner similar to shown in Code-
Snippet 4.1 as follows:
@FunctionAnnotation.ForwardedFields("f0.f0->f0;f0.f1->f1;f1.f1->f2")
public static class PrintResultWithAnnotation implements
FlatMapFunction <Tuple2 <Tuple2 <String , String >,
Tuple2 <String , Long >>, Tuple3 <String , String , Long >> {
// Merges the tuples to create tuples of <User ,URL ,PageRank >
public void flatMap(
Tuple2 <Tuple2 <String , String >, Tuple2 <String , Long >> joinSet ,
Collector <Tuple3 <String , String , Long >> collector)
throws Exception {
collector.collect(new Tuple3 <String , String , Long >( joinSet.f0.f0,
joinSet.f0.f1, joinSet.f1.f1));
}
}
Listing 4.1: Demonstrating ForwardedFields Usage
The class PrintResultWithAnnotation, a FlatMap UDF operator, takes input tu-
ple of the format << String, String >,< String, Long >> corresponding to
<< User, URL >,< URL, PageRank >> and returns a new tuple of the format
< String, String, Long > corresponding to < User, URL, PageRank >. If we try
to paint a picture of what is going inside the FlatMap function, it can be viewed
as shown in Figure 4.12:
2https://ci.apache.org/projects/flink/flink-docs-release-0.9/apis/programming guide.html#semantic-annotations
Chapter 4. Implementation 46
Figure 4.12: Inside picture of the FlatMap function in 4.1
This is exactly what is mentioned using ForwardedFields annotation, that can be
seen just before the class definition starts. It is stated as follows:
@FunctionAnnotation.ForwardedFields(“f0.f0 -> f0 ; f0.f1 -> f1 ; f1.f1 -> f2”)
input tuple︷ ︸︸ ︷<
f0︷ ︸︸ ︷< User︸ ︷︷ ︸
f0
, URL︸ ︷︷ ︸f1
>,
f1︷ ︸︸ ︷< URL︸ ︷︷ ︸
f0
, PageRank︸ ︷︷ ︸f1
> >
<
f0︷ ︸︸ ︷User,
f1︷ ︸︸ ︷URL,
f2︷ ︸︸ ︷PageRank >︸ ︷︷ ︸
output tuple
To explain this annotation in detail, we need to understand the input tuple struc-
ture, << User, URL >,< URL, PageRank >> this overall is a Tuple2 represen-
tation, where field0 is < User, URL > and field1 is < URL,PageRank >; let
us for simplicity call them main-field0 and main-field1. These main fields, both 0
and 1, are in turn Tuple2 representations, for < User, URL >, User is field0 and
URL is field1, similarly with < URL,PageRank >, URL is field0 and PageRank
is field1. In expression f0.f1 -> f1, left hand side f0.f1 denotes URL which is
field1 in main-field0, and is just copied or say forwarded to f1, i.e., the location
field1 in the output tuple of < User,URL, PageRank >. This demonstrates how
functional annotations allow us to drill into UDF’s internal operation.
In our implementation, we have used and exploited this exclusive feature of Flink
with respect to FlatMap UDF operator. This indeed can be further extended to
other UDFs, given we have the annotation information available.
Static Code Analysis (SCA) [23], is a new feature in the latest Flink release. SCA
implicitly generate the same type of semantic information that we gain using Func-
tion Annotations. The user can thus avoid explicit mentioning of annotations in
order to realize the UDF behavior. Our implementation can further leverage from
Chapter 4. Implementation 47
this additional feature of static code analysis, as we would then be exposed to
input output mapping of all the UDF operators. We can then extend our imple-
mentation to other UDFs and can be considered as a part for future enhancement.
4.5.2 Interactive Scala Shell
Interactive Scala Shell for Flink is the latest addition to the Flink features, which
allows command line execution of Flink scala code, and evaluating them per line
basis. It provides a very good platform for our implementation to display the ex-
ample result sets generated using the algorithm for a given dataflow program. We
hence have added an extra command “:ILLUSTRATE” to the existing list of Shell
standard commands, upon receiving this command we call our flink-illustrator
module, and display the sample examples in the shell. To use this command we
need to provide two arguments namely, Flink’s execution environment and the sink
operator or sink operator’s dataset. As our implementation is limited to Flink Java
jobs, to execute using scala shell we have to make the necessary adjustments and
use respective Java API modules. In Appendix A.2, we have a sample scala code
adapted to Java API module and the usage of ILLUSTRATE command.
In this chapter, we have explained the details related to our algorithm implemen-
tation as well as the necessary extensions that one can use from Flink. The next
chapter describes the experiments and evaluations conducted on the algorithm.
Chapter 5
Evaluation
This chapter encompasses one of the main objectives of this thesis, i.e., to evaluate
the performance of the algorithm with respect to key metrics namely:
Completeness: Verifies that the examples generated at the operator covers all
the semantics of that operator,
Conciseness: Verifies that the example set is as small as possible, to allow quick
overview of the operator functionality and the dataflow,
Realism: Verifies that the examples are indeed from the input sources, and
Runtime: Verifies that the example generation indeed is faster compared to the
complete dataflow execution.
In following sections, we explain the definition of the each metric, the experimental
setup and workload (the dataflow programs) used, and finally we measure the
performance of our algorithm (with respect to the above metrics) against two
other example generation approaches:
• Downstream Propagation : Sinlgle downstream pass from the source to sink.
• Sink-Pruning : Pruning only at the final sink operator.
48
Chapter 5. Evaluation 49
5.1 Performance Metrics
5.1.1 Completeness
The completeness of an operator O is defined using the concept of equivalence
classes. It is defined as the ratio of number of equivalence classes with at least one
example to the number of equivalence classes for the given operator. This can be
formulated as:
CompletenessO =
[# O′s equivalence classes with atleast one example
# O′s equivalence classes
]10
[This value will always range between 0 and 1, 1 denotes total completeness]
For example, Filter operator has two equivalence classes for its input records
namely,
E1 - for records passing the filtering predicate
E2 - for records failing the filtering predicate
Figure 5.1: Completeness of Filter = 0.5
Consider a scenario where Filter’s input only has records that pass the filter, as
shown in Figure 5.1 where operator accepts records with age above 18. From the
available set of examples E1 is filled and E2 is empty, i.e., only one out of two
equivalence classes has at least one example; so by substituting the values in the
above formula we get 1/2. Using the above technique, we define per-operator
completeness for all the operators in a given dataflow program.
The overall completeness of a given dataflow program P can then be defined as
the average of per-operator completeness of all the operators in P .
CompletenessP =CompletenessO1 + CompletenessO2 + ..... + CompletenessOm
m
Chapter 5. Evaluation 50
5.1.2 Conciseness
The conciseness of an operator O is defined as the ratio of the number of operator’s
equivalence classes to the total number of example records for that operator (with
a ceiling at 1).
ConcisenessO =
[# O′s equivalence classes
# O′s example records
]10
[This value will always range between 0 and 1, 1 denotes that operator has pro-
duced most concise set of examples]
Using the example from Figure 5.1, Filter generates two examples and has two
equivalence classes, hence the conciseness for this operator is: 2/2 = 1, which
means that the records generated are indeed concise, but not complete. Given the
above method to evaluate per-operator conciseness, the overall conciseness of a
dataflow program P can be defined as the average of per-operator conciseness of
all the operators in P .
ConcisenessP =ConcisenessO1 + ConcisenessO2 + ..... + ConcisenessOm
m
5.1.3 Realism
The realism for an operator O can be defined as the fraction of the number of
records that are real over to the total number of records generated by the same
operator. A record is considered as real if it comes from the source operators or
is derived via different operations from the source operators. Operator may or
may not consist synthetic records in its output set, having these synthetic records
reduces realism.
RealismO =
[# real records in O′s output example set
# records in O′s output example set
]10
[This value will always range between 0 and 1]
And similarly the overall realism for a dataflow program P can be seen as:
RealismP =RealismO1 + RealismO2 + ..... + RealismOm
m
Chapter 5. Evaluation 51
Considering the same example as in Figure 5.1, the realism at Filter is 1 as there
exist no synthetic records. Suppose during the upstream pass a synthetic record
< PQR, 17 > is added to Filter’s input to boost its completeness as shown in
Figure 5.2.
Figure 5.2: Realism of Filter after boosting completeness = 2/3
Due to the addition of the synthetic record, we indeed increase the completeness
of Filter operator but at the expense of the reduction in realism. Given the nature
of this problem, there is no solution that can guarantee full completeness with full
realism, either of them is compromised. Similar scenario may appear in a real world
dataflow program, and one cannot guarantee the score of 1 on each completeness,
conciseness and realism. Our algorithm favors completeness over realism, as
it is more important to determine the complete behavior of an operator.
5.2 Experiments
5.2.1 Setup and Datasets
In this section, we evaluate our dataflow programs with respect to above perfor-
mance metrics. Before diving into the details, let us briefly explain the experi-
mental environment used. The experiments were conducted on a local Linux box
running Ubuntu 14.04 with 4 cores and 12 GB RAM. The Apache Flink’s 0.9.0
version was used in our implementation. Most of our experiments were conducted
with local text files as input sources, while some using HDFS as its storage layer
with Hadoop version 2.6.0.
We are using datasets constructed locally to test our implementation that covers
all the possible operators with different scenarios for complete algorithm coverage.
Chapter 5. Evaluation 52
We try to introduce empty equivalence classes by maneuvering the input files
accordingly, this allows us to verify the completeness aspect of the operators and
the overall dataflow program. The dataset details can be seen in Table 5.1.
DataSet Details Format #rows
UserVisits Log of user’s visit to web-sites during the daycsv and hdfs file (Username, URL,Visit TimeStamp)
100
UserVisitsEU Log of EU user’s visit to websites during the daycsv and hdfs file (Username, URL,Visit TimeStamp)
70
PageRanks Contains pagerank of the respective website csv (URL, Rank) 50
JobSeeker Contains user details csv (UserId, Username) 50
Jobs Contains job and salary details csv (JobId, City, Salary) 20
JobsEU Contains job and salary details in EU csv (JobId, City, Salary) 20
Table 5.1: Locally created DataSet used for evaluation purpose
For thorough evaluation, we define in total 13 dataflow programs that use the
variations of above datasets as input. These dataflow programs can be found in
Appendix A.3. We categorize these 13 programs into following types (based on
type of evaluation):
1. Single-operator: Dataflow contains only one data transformation operator
(Programs 1 to 6 and 13).
2. Multi-operators: Dataflow contains multiple data transformation operators.
(Programs 7 to 9).
3. Empty Equivalence classes: Dataflow inputs are manipulated such that it
results in empty-set or empty equivalence classes for the operator under test.
(Programs 10 to 12).
5.2.2 Quality of Generated Examples
We evaluate the above programs using three different approaches against the met-
rics defined; completeness, conciseness and realism. The approaches used are:
1. Our Algorithm: The algorithm from our implementation, with down-
stream, upstream and full pruning passes.(4.4)
2. Downstream: In this approach we propagate the sample example sets from
source LOAD operators till the sink, while evaluating at each operator in the
operator tree. This approach allows us to validate the completeness metric.
Chapter 5. Evaluation 53
3. Sink-level Pruning: This is a variant of our algorithm, with pruning lim-
ited to only sink operator rather than at each operator in the operator tree.
This approach allows us to validate conciseness metric.
We ran three iterations of each program (in order to compensate for the initial
sampling), in Figures 5.3,5.4,5.5 Y-axis thus denotes the average of the metrics,
while X-axis denotes the algorithm types. The idea behind multiple executions
and thus evaluation of each program was, initial sampling at LOAD might pro-
duce examples that cannot always be consumed by the downstream operators to
generate an output, resulting incompleteness. For e.g., two different LOADs pro-
ducing examples with no common join key that is assigned at JOIN, or LOAD
producing no examples that can pass the filtering predicate at FILTER. Having
multiple runs thus helped us to verify whether our algorithm is efficient enough to
tackle the incompleteness (if any) being introduced.
Let us analyze the experimental observations.
1. Single-operator dataflows:
With respect to the completeness of the dataflows containing only one data trans-
formation operator, our algorithm achieves the highest score of 1 for all the 7
programs, i.e., algorithm illustrates all the necessary semantics of the operators.
In contrast, Downstream approach fails to attain total completeness in more than
one occasion, at Join (due to the unavailability of common join-key), at Filter (due
to missing record that passes/fails the filtering predicate). Our algorithm man-
aged to tackle such incompleteness introduced. As stated earlier, we emphasize
completeness over realism, from Figure 5.3 (a/e/f), it is clear that we introduce
a synthetic record in order to address Join incompleteness and, therefore, com-
promising on realism. Compared to other approaches, examples generated by our
algorithm are more concise and attains the highest score, whereas Downstream
and Sink-Pruning struggles in the range of 0.4 - 0.8. Thus, it is quite clear that
our algorithm produces the most complete and concise set of examples for the
single-operator dataflow validation.
Chapter 5. Evaluation 54
Figure 5.3: Evaluation of Single-Operator dataflows (a) Join, (b) Union, (c)Cross, (d) Filter, (e) Join with Project, (f) Join with Project using Annotations,
(g) Join using HDFS
Chapter 5. Evaluation 55
2. Multi-operator dataflows
Figure 5.4: Evaluation of Multi-Operators dataflow (a) Union, Join, Project(b) Union, Filter, Join, Project (c) Union, Cross
As seen in the above Figure 5.4 for Multi-Operator dataflows, our algorithm attains
total completeness in 2 out of 3 programs. Program 8 being a multi-operator
dataflow with FILTER, the upstream pass may not always be successful to fill the
empty equivalence class. This can indeed be rectified using multiple iterations until
the equivalence class is filled but at the expense of extra overhead of long running
time as well as increased cost for pruning (each iteration might add redundant
record). Program 7 and 8, containing UNION followed by JOIN in the operator
tree, results in more than one joined examples in some scenarios and hence the
lack of conciseness. Given that, the conciseness is still higher than the other two
approaches. For multi-operator dataflows, the set of examples produced by our
algorithm may not always be totally complete, but it indeed fares better compared
to only downstream approach.
Chapter 5. Evaluation 56
3. Empty Equivalence Classes:
Figure 5.5: Evaluation of Special cases with intentional manipulation of inputsto create empty equivalence classes at (a) Join, (b) Union, (c) Cross
For the special cases of Program 10-12, the input sources were intentionally al-
tered or were kept empty so that it would result in incompleteness during the
downstream pass. The algorithm still managed to achieve total completeness at
the expense of zero realism, because only synthetic records traversed through the
dataflow. As we can see in Figure 5.5, compared to our implemented algorithm,
downstream approach achieved very low scores on completeness, as there were no
records in the input sources to completely perform either join (in 10) , union (in
11) or cross (in 12). Again, our approach managed to produce the concise set of
operators with highest score of 1 on every occasion.
To summarize the evaluation of the quality of examples generated, our algorithm
approach achieves total completeness on 12 out of 13 programs. Lack of realism
denotes the fact that the downstream pass resulted in incompleteness and irre-
spective of that the algorithm achieved the highest completeness. The examples
generated are always concise and achieved the score of 1 in 11 out 13 programs.
The conciseness is higher than the other two approaches because, Downstream ap-
proach does not perform any pruning, whereas in Sink-Pruning it is limited only
Chapter 5. Evaluation 57
at final sink operator. Our implementation manages good conciseness via prun-
ing, at the same time keeps completeness intact. We can hence confirm that our
algorithm implementation of multiple passes downstream, upstream and pruning
is qualitatively and quantitatively a better approach for the example generation
problem.
5.2.3 Running Time
In this section, we compare the performance of the implementation with respect
to running time. In this experiment, we ran our implementation against varying
sizes of input sources, ranging from 17 MB to 1.1 GB. As we are running our
experiments on local machine and cluster, we limited the maximum dataset size
to 1.1 GB. The datasets were obtained from SNAP Datasets Library [24] and the
details are mentioned in Table 5.2 :
DataSet Details Size (in MB)
Reddit Resubmitted contents on reddit.com 17
Skitter Internet topology graph, from trace-routes daily in 2005 150
Patent-Citations Citation networks among US Patents 280
Pokec Pokec online social network 500
LiveJournal1 LiveJournal online social network 1100
Table 5.2: Experiment Dataset’s description and sizes
For this experiment, we ran the variation of dataflow Program 1, where we JOIN
two datasets (any dataset from Table 4 and a random collection of integers). All
the datasets in Table 5.2 contains an id field as an integer, hence we join on
this id (from the dataset) and integer (from the collection). The purpose of this
experiment was to test the runtime of the flink-illustrator over varying sizes of
input datasets, hence we limited the dataflow to a simple logic of Join on integers.
Figure 5.6 shows the difference in runtime (in seconds) of flink-illustrator module
and flink job execution. For a give dataflow program, the flink-illustrator module
executes on a small sample of the input dataset to produce the concise and com-
plete set of examples; whereas the flink job execution means the total (every input
record) execution of the dataflow. Figure 5.6 clearly shows, that the illustrator
module indeed provides a quick method to overview the dataflow, irrespective of
the input dataset sizes, with runtime ranging from just 0.5 to 2 seconds. This
Chapter 5. Evaluation 58
Figure 5.6: Comparison of Illustrator runtime v/s Flink Job Execution timefor varying sizes of input
allows Flink user to verify the dataflow very quickly and voids the need of running
the complete job, with longer waiting times (around 5 to 40 seconds for above
input sizes), over and over again in order to make the necessary adjustments to
the dataflow program.
This verifies our claim that flink-illustrator provides a fast and effective way to
understand and learn about the dataflow (and its operators) compared to complete
execution of the job when it comes to input data of large scale.
Chapter 6
Related Work
The work of generating example tuples started a while ago in 1985 [25], where
automatic techniques were introduced and studied to generate a test database for
a given query. This work was limited to relational database systems and assisted
in testing relational queries. The methods in [25] are based on illustrating the
semantics of the complete query, whereas our work illustrates the semantics of
every operator in the given dataflow program.
QAGen [26], improves on traditional (trial and error) test database generation
techniques and generates a query aware test database, that facilitates a wide range
of DBMS testing tasks. QAGen takes the query and set of constraints defined on
the query as input and generates query aware test database with all the interme-
diate query results. This is similar to our approach for dataflow programs, where
we take the dataflow program as an input and generate the example tuples after
every intermediate operator.
Recent work on schema mapping via data examples [27], presents good data ex-
amples that illustrate schema mappings between the two databases. The work in
[27] documents the capabilities and limitations of data examples in explaining and
understanding the schema mappings. This technique can indeed be extended to
our work, most notably as UDFs mappings, although the main requirement will
be to make the schema of each UDF operator available beforehand.
This work [28], reverse query processing, is closely related to the logic of the
upstream pass, it gets a query and a result as input and returns the possible
database instance that could have produced that result of the query. In upstream
59
Chapter 6. Related Work 60
pass, we too are aware what output will resolve the incompleteness at the operator
and we introduce the possible set of input examples that will produce that output.
Indeed this work is effective for traditional relational operators (JOIN, CROSS,
UNION, AGGREGATIONS), but it has no mention whatsoever for UDFs.
The work in [29] extends the concepts from [12] by introducing a new technique
to generate example data for dataflow programs. This technique is known as
Symbolic Example Data GEneration (SEDGE) and uses dynamic symbolic exe-
cution for example generation. This work relies on transforming the program into
symbolic constraints and then solving these constraints using a symbolic reason-
ing engine. This work like [12], has been tried and tested only for Apache Pig’s
dataflow programs with Pig-Latin scripts.
Chapter 7
Conclusion
This thesis has shown the significance of dataflow programming in the current
big-data environments. We provided a comparative study of 3 different type of
big-data systems, Apache Hadoop, Apache Pig and Apache Flink that takes ad-
vantage of the dataflow programming paradigm. The presented study makes sev-
eral noteworthy observations to understand the challenges related to the dataflow
programming in these systems, namely debugging and visualization, and how the
example generation technique can overcome the related issues.
Apache Flink provides a comprehensive computing platform that supports faster
data-intensive dataflow processing with inherent support for the iterative program-
ming and traditional relational operators (join, cross, union, etc.). This thesis work
addresses the limitations of dataflow programming in Apache Flink environment
by implementing the example generation “Illustrator” module. Flink-Illustrator
provides the user a set of complete and concise example tuples after each operator
for better understanding and easy visualization of the dataflow.
Example generation is a concept difficult to practically realize and implement.
When basic techniques are used such as, just downstream or just upstream prop-
agation, there exist many different scenarios that lead to the generation of incom-
plete examples or lengthy/repetitive examples at any given operator. In order to
avoid these situations we use the combination of downstream, upstream and prun-
ing passes to generate the complete and concise set of examples. Downstream
pass will get the initial sampled set of examples and propagate it through the
61
Chapter 7. Conclusion 62
dataflow, upstream pass will handle incompleteness (if any) by introducing a syn-
thetic record and pruning will remove the redundant examples (examples that add
no extra meaning to the present set).
The algorithm indeed proves efficient compared to other techniques (only down-
stream propagation, only sink pruning) and generates most complete and concise
set of examples. The examples generated might not be real (from the input source)
all the time, because of the synthetic examples that might be introduced to boost
completeness. With respect to runtime, Flink-Illustrator is definitely a way to
move forward for faster debugging or dataflow/operator understanding in compar-
ison to complete dataflow execution. This is clearly visible from the experimental
observations when it comes to input data of large size, illustrator is 5x - 35x faster
than complete execution over increasing input size.
We have made use of some features from Flink to make our implementation novel
and different from already existing one in PIG. Apache Flink allows us to read
through the UDF operators via its semantic annotations, we make use of this
property in our implementation to read the mappings of tuples in the Flatmap
UDF operator. For visualization purpose, we integrate our implementation with
Flink’s Interactive Scala Shell, by introducing a new command “:ILLUSTRATE”
that invokes the flink-illustrator module and displays the set of examples on the
scala shell.
We conclude by saying that example generation indeed is a technique that makes
life easier for a developer working on a big-data systems. We extend this technique
to dataflow programs in Apache Flink, by implementing the tried and tested algo-
rithm to generate the complete and concise set of examples, for a better program
understanding and visualization.
7.1 Discussion and Future Work
As stated in Section 3.5 our implementation is limited to the Flink’s Java dataflow
programs, we can further extend it to the available Scala API making the flink-
illustrator a language friendly tool. We are also restricted to use only tuple formats
in our input datasets, as they form our base for implementation and are traversed
throughout the dataflow. Although, it would be a great addition to have POJO
Chapter 7. Conclusion 63
input types and then using reflection API such as Trail [30] to access the appro-
priate properties of the objects. Further additions can be, the support for more
richer UDF operators (GroupReduce, etc.) available in Flink as well as support
for iterative dataflows. In order to extend support for the new operators (UDFs
and iterative), we must broaden the language used for defining equivalence classes
as well as the rules used to create synthetic records during an upstream pass. We
can take advantage of different techniques, such as reverse query processing [28]
or symbolic query processing [26] for enhancing the above rules. Hence, more
research based feature extensions can be implemented as an advancement to this
thesis work thereby covering different aspects of Flink dataflow programming.
Appendix A
Appendix
A.1 Word-Cound Example in different Dataflow
Systems
A.1.1 Apache Pig
A = load ’data.txt’ as (line:chararray);
B = foreach A generate TOKENIZE(line) as tokens;
C = foreach B generate flatten(tokens) as words;
D = group C by words;
E = foreach D generate group , COUNT(C);
F = order E by $1;
dump F;
Listing A.1: Word-Count in Apache Pig
64
Appendix A. Appendix 65
A.1.2 Apache Hadoop
public class WordCount {
public static class TokenizerMapper
extends Mapper <Object , Text , Text , IntWritable >{
private final static IntWritable one = new IntWritable (1);
private Text word = new Text();
public void map(Object key , Text value , Context context)
throws IOException , InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString ());
while (itr.hasMoreTokens ()) {
word.set(itr.nextToken ());
context.write(word , one);
}
}
}
public static class IntSumReducer
extends Reducer <Text ,IntWritable ,Text ,IntWritable > {
private IntWritable result = new IntWritable ();
public void reduce(Text key , Iterable <IntWritable > values ,
Context context
) throws IOException , InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key , result);
}
}
public static void main(String [] args) throws Exception {
Configuration conf = new Configuration ();
Job job = Job.getInstance(conf , "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job , new Path(args [0]));
FileOutputFormat.setOutputPath(job , new Path(args [1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Listing A.2: Word-Count in Apache Hadoop
Appendix A. Appendix 66
A.1.3 Apache Flink
public class WordCountExample {
public static void main(String [] args) throws Exception {
final ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment ();
DataSet <String > text = env.fromElements(
"Who’s there?",
"I think I hear them. Stand , ho! Who’s there?");
DataSet <Tuple2 <String , Integer >> wordCounts = text
.flatMap(new LineSplitter ())
.groupBy (0)
.sum(1);
wordCounts.print ();
}
public static class LineSplitter
implements FlatMapFunction <String , Tuple2 <String , Integer >> {
@Override
public void flatMap(String line , Collector <Tuple2 <String , Integer >> out)
{
for (String word : line.split(" ")) {
out.collect(new Tuple2 <String , Integer >(word , 1));
}
}
}
}
Listing A.3: Word-Count in Apache Flink
Appendix A. Appendix 67
A.2 Sample Scala Script for Flink’s Interactive
Scala Shell
import java.util.regex.Pattern
val environ =
org.apache.flink.api.java.ExecutionEnvironment.getExecutionEnvironment
val input1 = environ.readTextFile("src/test/resources/CrossInput1")
val input2 = environ.readTextFile("src/test/resources/CrossInput2")
class OneReader extends org.apache.flink.api.common.functions.FlatMapFunction
[String , org.apache.flink.api.java.tuple.Tuple2[Integer , String ]] {
private val SEPARATOR = Pattern.compile("[ \t,]")
def flatMap(readLineFromFile: String , collector:org.apache.flink.util.Collector
[org.apache.flink.api.java.tuple.Tuple2[Integer , String ]]) {
if (! readLineFromFile.startsWith("%")) {
val tokens = SEPARATOR.split(readLineFromFile)
val id = java.lang.Integer.parseInt(tokens (0))
val city = tokens (1)
collector.collect(new org.apache.flink.api.java.tuple.Tuple2[Integer ,
String ](id,city))
}
}
}
class TwoReader extends org.apache.flink.api.common.functions.FlatMapFunction
[String , org.apache.flink.api.java.tuple.Tuple2[Integer , Double ]] {
private val SEPARATOR = Pattern.compile("[ \t,]")
def flatMap(readLineFromFile: String , collector:org.apache.flink.util.Collector
[org.apache.flink.api.java.tuple.Tuple2[Integer , Double ]]) {
if (! readLineFromFile.startsWith("%")) {
val tokens = SEPARATOR.split(readLineFromFile)
val id = java.lang.Integer.parseInt(tokens (0))
val salary = java.lang.Double.parseDouble(tokens (1))
collector.collect(new org.apache.flink.api.java.tuple.Tuple2[Integer ,
Double ](id,salary))
}
}
}
val set1 = input1.flatMap(new OneReader ())
val set2 = input2.flatMap(new TwoReader ())
val crossSet = set1.cross(set2)
:ILLUSTRATE environ crossSet //Call to Flink -Illustrator
Listing A.4: Scala script from Flink’s Interactive scala shell with
:ILLUSTRATE command usage
Appendix A. Appendix 68
A.3 Experiment’s Dataflow Programs
Program 1 (page rank for user-visited urls):
1. LOAD set UserVisits using a csv input source
2. FLATMAP to input tuple format < user, url >
3. DISTINCT on UserVisits
4. LOAD set PageRanks using a csv input source
5. FLATMAP to input tuple format < url, pagerank >
6. DISTINCT on PageRanks
7. JOIN UserVisits and PageRanks
Program 2 (combining two continent’s user-visited urls):
1. LOAD set UserVisits using a csv input source
2. FLATMAP to input tuple format < user, url, count >
3. LOAD set UserVisitsEU using a csv input source
4. FLATMAP to input tuple format < user, url, count >
5. UNION UserVisits and UserVisitsEU sets
Program 3 (options for available jobs):
1. LOAD set JobSeeker using a csv input source
2. FLATMAP to input tuple format < userID, user >
3. LOAD set Jobs using a csv input source
4. FLATMAP to input tuple format < jobId, salary >
5. CROSS set JobSeeker and Jobs
Program 4 (option for high paying jobs):
1. LOAD set JobSeeker using a csv input source
2. FLATMAP to input tuple format < userID, user >
3. LOAD set Jobs using a csv input source
4. FLATMAP to input tuple format < jobId, salary >
5. FILTER set Jobs on salary field
6. CROSS set JobSeeker and Jobs
Appendix A. Appendix 69
Program 5 (users visiting high pagerank urls using simple flatmap for
projection):
1. LOAD set UserVisits using a csv input source
2. FLATMAP to input tuple format < user, url >
3. DISTINCT on UserVisits
5. LOAD set PageRanks using a csv input source
6. FLATMAP to input tuple format < url, pagerank >
5. DISTINCT on PageRanks
6. JOIN UserVisits and PageRanks
7. PROJECT joined set in the tuple format as < user, url, pagerank >
Program 6 (users visiting high pagerank urls using function annotated
flatmap for projection):
1. LOAD set UserVisits using a csv input source
2. FLATMAP to input tuple format < user, url >
3. DISTINCT on UserVisits
5. LOAD set PageRanks using a csv input source
6. FLATMAP to input tuple format < url, pagerank >
5. DISTINCT on PageRanks
6. JOIN UserVisits and PageRanks
7. PROJECT joined set in the tuple format as < user, url, pagerank > by setting
forwarded fields annotations.
Program 7 (page rank for combined user-visited urls):
1. LOAD set UserVisits using a csv input source
2. FLATMAP to input tuple format < user, url >
3. LOAD set UserVisitsEU using a csv input source
4. FLATMAP to input tuple format < user, url >
5. UNION UserVisits and UserVisitsEU sets
6. LOAD set PageRanks using a csv input source
7. FLATMAP to input tuple format < url, pagerank >
8. JOIN unioned set of UserVisits and UserVisitsEU with Page-Ranks
9. PROJECT joined set
Appendix A. Appendix 70
Program 8 (combined user-visits urls with only high page ranks):
1. LOAD set UserVisits using a csv input source
2. FLATMAP to input tuple format < user, url >
3. LOAD set UserVisitsEU using a csv input source
4. FLATMAP to input tuple format < user, url >
5. UNION UserVisits and UserVisitsEU sets
6. LOAD set PageRanks using a csv input source
7. FLATMAP to input tuple format < url, pagerank >
8. FILTER set PageRanks on field pagerank
9. JOIN unioned set of UserVisits and UserVisitsEU with Page-Ranks
10. PROJECT joined set
Program 9 (options for finding jobs in different continents):
1. LOAD set JobSeeker using a csv input source
2. FLATMAP to input tuple format < userID, user >
3. LOAD set Jobs using a csv input source
4. FLATMAP to input tuple format < jobID, city, salary >
5. LOAD set JobsEU using a csv input source
6. FLATMAP to input tuple format < jobID, city, salary >
7. UNION sets Jobs and JobsEU
8. CROSS set JobSeeker and unioned set of Jobs and JobsEU
Program 10:
Variation of Program 1 , with no common key (url) in both input sets, thereby
resulting in empty equivalence class at JOIN.
Program 11:
Variation of Program 2 , with no/blank input in one of the file, thereby resulting
in empty equivalence class at UNION.
Program 12:
Variation of Program 3 , with no/blank input in one of the file, thereby resulting
in empty equivalence class at CROSS.
Program 13:
Variation of Program 1 , with input source as Hadoop-HDFS file.
Appendix A. Appendix 71
A.4 Flink-Illustrator Source Code
Source Code
Bibliography
[1] D. Borthakur, J. Gray, J. S. Sarma, K. Muthukkaruppan, N. Spiegelberg,
H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash, and others,
“Apache Hadoop goes realtime at Facebook,” in Proceedings of the 2011 ACM
SIGMOD International Conference on Management of data, pp. 1071–1080,
ACM, 2011.
[2] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large
clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
[3] A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise,
O. Kao, M. Leich, U. Leser, V. Markl, and others, “The Stratosphere platform
for big data analytics,” The VLDB JournalThe International Journal on Very
Large Data Bases, vol. 23, no. 6, pp. 939–964, 2014.
[4] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark:
cluster computing with working sets,” in Proceedings of the 2nd USENIX
conference on Hot topics in cloud computing, vol. 10, p. 10, 2010.
[5] A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy,
C. Olston, B. Reed, S. Srinivasan, and U. Srivastava, “Building a high-level
dataflow system on top of Map-Reduce: the Pig experience,” Proceedings of
the VLDB Endowment, vol. 2, no. 2, pp. 1414–1425, 2009.
[6] D. J. Abadi, D. Carney, U. etintemel, M. Cherniack, C. Convey, S. Lee,
M. Stonebraker, N. Tatbul, and S. Zdonik, “Aurora: a new model and archi-
tecture for data stream management,” The VLDB JournalThe International
Journal on Very Large Data Bases, vol. 12, no. 2, pp. 120–139, 2003.
[7] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed
data-parallel programs from sequential building blocks,” in ACM SIGOPS
Operating Systems Review, vol. 41, pp. 59–72, ACM, 2007.
72
Bibliography 73
[8] R. H. Arpaci-Dusseau, “Run-time adaptation in river,” ACM Transactions
on Computer Systems (TOCS), vol. 21, no. 1, pp. 36–86, 2003.
[9] M. Stonebraker, J. Chen, N. Nathan, C. Paxson, and J. Wu, “Tioga: Pro-
viding data management support for scientific visualization applications,” in
VLDB, vol. 93, pp. 25–38, Citeseer, 1993.
[10] D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy,
and S. Hand, “CIEL: A Universal Execution Engine for Distributed Data-
Flow Computing.,” in NSDI, vol. 11, pp. 9–9, 2011.
[11] W. Lee, S. J. Stolfo, and K. W. Mok, “Mining in a data-flow environment:
Experience in network intrusion detection,” in Proceedings of the fifth ACM
SIGKDD international conference on Knowledge discovery and data mining,
pp. 114–124, ACM, 1999.
[12] C. Olston, S. Chopra, and U. Srivastava, “Generating example data for
dataflow programs,” in Proceedings of the 2009 ACM SIGMOD International
Conference on Management of data, pp. 245–256, ACM, 2009.
[13] T. B. Sousa, “Dataflow Programming Concept, Languages and Applications,”
in Doctoral Symposium on Informatics Engineering, 2012.
[14] H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker, “Map-reduce-merge:
simplified relational data processing on large clusters,” in Proceedings of
the 2007 ACM SIGMOD international conference on Management of data,
pp. 1029–1040, ACM, 2007.
[15] D. Battr, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke, “Nephele/-
PACTs: a programming model and execution framework for web-scale ana-
lytical processing,” in Proceedings of the 1st ACM symposium on Cloud com-
puting, pp. 119–130, ACM, 2010.
[16] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, “HaLoop: efficient iterative
data processing on large clusters,” Proceedings of the VLDB Endowment,
vol. 3, no. 1-2, pp. 285–296, 2010.
[17] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig latin: a
not-so-foreign language for data processing,” in Proceedings of the 2008 ACM
SIGMOD international conference on Management of data, pp. 1099–1110,
ACM, 2008.
Bibliography 74
[18] F. Hueske, M. Peters, A. Krettek, M. Ringwald, K. Tzoumas, V. Markl,
and J. Freytag, “Peeking into the optimization of data flow programs with
mapreduce-style udfs,” in Data Engineering (ICDE), 2013 IEEE 29th Inter-
national Conference on, pp. 1292–1295, IEEE, 2013.
[19] A. Alexandrov, S. Ewen, M. Heimel, F. Hueske, O. Kao, V. Markl, E. Ni-
jkamp, and D. Warneke, “MapReduce and PACT-Comparing Data Parallel
Programming Models.,” in BTW, pp. 25–44, 2011.
[20] S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl, “Spinning fast iterative
data flows,” Proceedings of the VLDB Endowment, vol. 5, no. 11, pp. 1268–
1279, 2012.
[21] J. Tan, X. Pan, S. Kavulya, R. Gandhi, and P. Narasimhan, “Mochi: visual
log-analysis based tools for debugging hadoop,” in USENIX Workshop on Hot
Topics in Cloud Computing (HotCloud), San Diego, CA, vol. 6, 2009.
[22] S. Chaudhuri, R. Motwani, and V. Narasayya, “On random sampling over
joins,” in ACM SIGMOD Record, vol. 28, pp. 263–274, ACM, 1999.
[23] F. Hueske, A. Krettek, and K. Tzoumas, “Enabling operator reorder-
ing in data flow programs through static code analysis,” arXiv preprint
arXiv:1301.4200, 2013.
[24] Jure Leskovec and Andrej Krevl, “SNAP Datasets: Stanford Large Network
Dataset Collection.”
[25] H. Mannila and K. J. Raiha, “Test data for relational queries,” in Proceedings
of the fifth ACM SIGACT-SIGMOD symposium on Principles of database
systems, pp. 217–223, ACM, 1985.
[26] C. Binnig, D. Kossmann, E. Lo, and M. T. zsu, “QAGen: generating query-
aware test databases,” in Proceedings of the 2007 ACM SIGMOD interna-
tional conference on Management of data, pp. 341–352, ACM, 2007.
[27] B. Alexe, B. Ten Cate, P. G. Kolaitis, and W.-C. Tan, “Designing and refining
schema mappings via data examples,” in Proceedings of the 2011 ACM SIG-
MOD International Conference on Management of data, pp. 133–144, ACM,
2011.
Bibliography 75
[28] C. Binnig, D. Kossmann, and E. Lo, “Reverse query processing,” in Data
Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on,
pp. 506–515, IEEE, 2007.
[29] K. Li, C. Reichenbach, Y. Smaragdakis, Y. Diao, and C. Csallner, “SEDGE:
Symbolic example data generation for dataflow programs,” in Automated Soft-
ware Engineering (ASE), 2013 IEEE/ACM 28th International Conference on,
pp. 235–245, IEEE, 2013.
[30] D. Green, “Trail: The Reflection API,” The Java Tutorial Continued: The
Rest of the JDK (TM). Addison-Wesley Pub Co, 1998.