Big Data Mining:
Towards implementing Weka-on-Spark
A dissertation submitted to The University of Manchester for the degree of
Master of Science
in the Faculty of Engineering and Physical Sciences
2014
Koliopoulos Aris Kyriakos
School of Computer Science
List of Contents
List of Contents....................................................................................................... 2
List of Figures......................................................................................................... 5
List of Tables........................................................................................................... 7
List of Abbreviations............................................................................................... 8
Abstract................................................................................................................. 10
Declaration............................................................................................................ 12
Intellectual Property Statement............................................................................. 13
Acknowledgements............................................................................................... 14
1 Introduction......................................................................................................... 16
1.1 Distributed Computing Frameworks............................................................ 17
1.2 Data Mining Tools........................................................................................ 18
1.3 Project Objectives........................................................................................ 19
1.4 Implementation Summary............................................................................ 19
1.5 Evaluation Strategy...................................................................................... 21
1.6 Project Achievements................................................................................... 22
1.7 Overview of Dissertation............................................................................. 23
2 Literature Review................................................................................................ 24
2.1 Data Mining................................................................................................. 24
2.1.1 Classification......................................................................................... 24
2.1.2 Regression............................................................................................. 25
2.1.3 Clustering.............................................................................................. 26
2.1.4 Association Rule Learning.................................................................... 26
2.1.5 Data Mining System Development....................................................... 27
2.1.5.1 Tooltkit Based Approaches............................................................. 27
2.1.5.2 Statistical Language Based Approaches......................................... 28
2.1.5.3 Approach Selection........................................................................ 28
2.1.6 Partitioning and Parallel Performance..................................................29
2.2 Distributed Computing Frameworks............................................................ 30
2.2.1 MapReduce........................................................................................... 30
2.2.2 Hadoop.................................................................................................. 31
2.2.2.1 Beyond Hadoop.............................................................................. 32
2.2.3 Iterative MapReduce............................................................................. 33
2 Of 121
2.2.4 Distributed Systems for In-Memory Computations.............................. 34
2.2.4.1 In-Memory Data Grids................................................................... 34
2.2.4.2 Piccolo............................................................................................ 34
2.2.4.3 GraphLab........................................................................................ 35
2.2.4.4 Spark.............................................................................................. 37
2.2.5 Distributed Computing Framework Selection......................................39
2.3 Distributed Data Mining.............................................................................. 39
2.3.1 Data Mining on MapReduce................................................................. 40
2.3.2 R on MapReduce................................................................................... 41
2.3.3 Distributed Weka................................................................................... 43
2.3.4 MLBase................................................................................................. 44
2.4 Summary...................................................................................................... 46
3 System Architecture............................................................................................ 48
3.1 Required Architectural Components............................................................ 48
3.2 Multi-tier Architecture................................................................................. 48
3.2.1 Infrastructure Layer............................................................................... 49
3.2.2 Distributed Storage Layer..................................................................... 50
3.2.3 Batch Execution Layer.......................................................................... 52
3.2.3.1 Spark and Main-memory Caching................................................. 53
3.2.4 Application Layer.................................................................................. 54
3.2.5 CLI........................................................................................................ 55
3.3 Cluster Monitoring....................................................................................... 55
3.4 Summary...................................................................................................... 56
4 Execution Model................................................................................................. 57
4.1 Weka on MapReduce.................................................................................... 57
4.2 Task Initialisation......................................................................................... 58
4.3 Headers......................................................................................................... 61
4.4 Classification and Regression...................................................................... 63
4.4.1 Model Training...................................................................................... 63
4.4.2 Model Testing and Evaluation............................................................... 65
4.5 Association Rules......................................................................................... 66
4.6 Clustering..................................................................................................... 69
4.7 Summary...................................................................................................... 69
5 System Evaluation............................................................................................... 71
5.1 Evaluation Metrics....................................................................................... 71
5.2 System Configuration................................................................................... 72
3 Of 121
5.3 Evaluation Results........................................................................................ 74
5.3.1 Execution Time..................................................................................... 74
5.3.2 Scaling Efficiency................................................................................. 79
5.3.2.1 Weak Scaling.................................................................................. 79
5.3.2.2 Strong Scaling................................................................................ 80
5.3.3 Main-Memory Caching......................................................................... 84
5.3.3.1 Caching overheads......................................................................... 84
5.3.3.2 Caching and Performance.............................................................. 86
5.3.4 IO Utilisation......................................................................................... 89
5.4 Caching Strategy Selection Algorithm......................................................... 92
5.5 Summary...................................................................................................... 96
6 Concluding remarks............................................................................................ 98
6.1 Summary...................................................................................................... 98
6.2 Further Work................................................................................................ 99
6.2.1 Clustering.............................................................................................. 99
6.2.2 Stream Processing............................................................................... 100
6.2.3 Declarative Data Mining..................................................................... 102
6.3 Conclusion.................................................................................................. 103
References........................................................................................................... 104
Appendix 1 – Benchmarking Data...................................................................... 111
Appendix 2 – Installation Guide.......................................................................... 115
Appendix 3 – User Guide.................................................................................... 117
Appendix 4 – Main-Memory Monitoring using CloudWatch............................. 120
Total Word Count: 23003
4 Of 121
List of Figures
Figure 1.1: Cluster Architecture [5]...................................................................... 17
Figure 1.2: System Architecture............................................................................ 20
Figure 2.1: The Data Mining Process.................................................................... 24
Figure 2.2: Supervised Learning Process [22]...................................................... 25
Figure 2.3: MapReduce Execution Overview [8]................................................. 31
Figure 2.4: Hadoop Tech Stack [36]..................................................................... 32
Figure 2.5: HaLoop and MapReduce [35]............................................................ 33
Figure 2.6: GraphLab Consistency Mechanisms[44]............................................ 36
Figure 2.7: RDD Lineage Graph [14]................................................................... 37
Figure 2.8: Ricardo [52]........................................................................................ 42
Figure 2.9: Distributed Weka [59]......................................................................... 44
Figure 2.10: MLBase Architecture [60]................................................................ 45
Figure 3.1: System Architecture............................................................................ 49
Figure 3.2: HDFS Architecture [64]...................................................................... 51
Figure 3.3: Initialisation Process........................................................................... 53
Figure 4.1: Execution Model................................................................................. 58
Figure 4.2:WekaOnSpark's main thread................................................................ 59
Figure 4.3: Task Executor..................................................................................... 60
Figure 4.4: Lineage Graph.................................................................................... 61
Figure 4.5: Header creation MapReduce job........................................................ 62
Figure 4.6: Header Creation Map Function.......................................................... 62
Figure 4.7: Header Creation Reduce Function...................................................... 62
Figure 4.8: Model Training Map Function............................................................ 64
Figure 4.9: Model Aggregation Reduce Function................................................. 65
Figure 4.10: Classifier Evaluation Map Function................................................. 66
Figure 4.11: Evaluation Reduce Function............................................................. 66
Figure 4.12: Association Rules job on Spark........................................................ 67
Figure 4.13: Candidate Generation Map Function................................................ 67
Figure 4.14: Candidate Generation and Validation Reduce Function...................68
Figure 4.15: Validation Phase Map Function........................................................ 68
Figure 5.1: Execution times for SVM................................................................... 75
Figure 5.2: Weak Scaling Efficiencies.................................................................. 80
5 Of 121
Figure 5.3: Strong Scaling for SVM..................................................................... 81
Figure 5.4: Strong Scaling for Linear Regression................................................. 81
Figure 5.5: Strong Scaling for FP-Growth............................................................ 82
Figure 5.6: Strong Scaling on Weka-On-Hadoop.................................................83
Figure 5.7:Main-memory time-line....................................................................... 86
Figure 5.8: Main-Memory Use Reduction............................................................ 87
Figure 5.9: Execution Time Overhead.................................................................. 87
Figure 5.10: Average per-instance disk writes...................................................... 88
Figure 5.11: Network Traffic................................................................................. 90
Figure 5.12: Per-instance average of network and disk utilisation.......................91
Figure 5.13: CPU utilisation................................................................................. 92
Figure 5.14: Storage Level Selection Process....................................................... 93
6 Of 121
List of Tables
Table 5.1: Execution Times for SVM on Weka-On-Spark.................................... 74
Table 5.2: Execution Times for SVM on Weka-On-Hadoop................................74
Table 5.3: Speed-up............................................................................................... 76
Table 5.4: CPU Utilisation of Weka-On-Spark..................................................... 77
Table 5.5: CPU Utilisation of Weka-On-Hadoop.................................................. 77
Table 5.6: Main-memory utilisation of Weka-On-Spark....................................... 78
Table 5.7: Main-memory utilisation of Weka-On-Hadoop...................................78
Table 5.8: RDD size as percentage of the original on-disk value (I).................... 85
Table 5.9: RDD size as percentage of the original on-disk value (II)...................85
Table 5.10: Execution Times................................................................................. 96
Table 5.11: Failed Tasks........................................................................................ 97
7 Of 121
List of Abbreviations
AMI – Amazon Machine Images
API – Application Programming Interface
AWS – Amazon Web Services
BDM – Big Data Mining
CLI – Command Line Interface
CPU – Central Processing Unit
EBS – Elastic Block Store
EC2 – Elastic Compute Cloud
ECU – EC2 Compute Unit
EMR – Elastic Map Reduce
GC – Garbage Collector
GUI – Graphical User Interface
HDFS – Hadoop Distributed File System
IMDG -In-Memory Data Grids
IO – Input/Output
JNI – Java Native Interface
JVM – Java Virtual Machines
MPI – Message Passing Interface
RDD – Resilient Distributed Datasets
SSD – Solid State Drives
SSH – Secure SHell
SVM – Support Vector Machines
VM – Virtual Machine
8 Of 121
WEKA – Waikato Environment of Knowledge Analysis
YARN – Yet Another Resource Negotiator
9 Of 121
Abstract
Data generation and collection across all domains increase in size exponen-
tially. Knowledge discovery and decision making demand the ability to pro-
cess and extract insights from “Big” Data in a scalable and efficient manner.
The traditional cluster-based Big Data platform Hadoop provides a scalable
solution but imposes performance overheads due to only supporting on-disk
data. The Data Mining algorithms used in knowledge discovery usually re-
quire multiple iterations over the dataset and thus, multiple, slow, disk ac-
cesses. In contrast, modern clusters possess increasing amounts of main-
memory that can provide performance benefits by efficiently using main-
memory caching mechanisms.
Apache Spark is an innovative distributed computing framework that sup-
ports in-memory computations. The objective of this dissertation is to design
and develop a scalable Data Mining framework to run on top of Spark and to
identify and document the advantages and disadvantages of main-memory
caching on Data Mining workloads.
The workloads consisted of distributed implementations of Weka's Data
Mining algorithms. Benchmarking was performed by testing seven different
caching strategies on different workloads, measuring elapsed time and monit-
oring resource utilisation.
The project contributions are three-fold:
1. Design and development of a distributed Data Mining framework that
achieves near-linear scaling in executing Data Mining workloads in
parallel;
2. Analysis of the behaviour of distributed main-memory caching mech-
anisms on different Data Mining execution scenarios;
3. Design and development of an automated caching strategy selection
mechanism that assesses dataset and cluster characteristics and selects
an appropriate caching scheme.
10 Of 121
The system was benchmarked using Linear Regression, Support Vector Ma-
chines and the FP-Growth algorithm on datasets from 5GB up to 80GB. The
results demonstrate that Weka-On-Spark outperforms Weka-On-Hadoop on
identical workloads by a factor of 2.36 on average and up to four times in
small scales.
Weak scaling efficiency measures a parallel system's ability to efficiently
utilise increasing number of processing nodes. Average weak scaling effi-
ciency was measured at 91.4% which is within 10% of an ideal parallel sys-
tem. Serialisation and compression were found to decrease main-memory util-
isation by 40%, on only a 5% execution time penalty. Finally, the proposed
caching mechanism reduces execution times up to 25% compared to the de-
fault strategy and diminishes the risk of main-memory exceptions.
11 Of 121
Declaration
No portion of the work referred to in the dissertation has been submitted in
support of an application for another degree or qualification of this or any
other university or other institute of learning.
12 Of 121
Intellectual Property Statement
1. The author of this dissertation (including any appendices and/or sched-
ules to this dissertation) owns certain copyright or related rights in it
(the “Copyright”) and s/he has given The University of Manchester
certain rights to use such Copyright, including for administrative pur-
poses.
2. Copies of this dissertation, either in full or in extracts and whether in
hard or electronic copy, may be made only in accordance with the
Copyright, Designs and Patents Act 1988 (as amended) and regulations
issued under it or, where appropriate, in accordance with licensing
agreements which the University has entered into. This page must
form part of any such copies made.
3. The ownership of certain Copyright, patents, designs, trade marks and
other intellectual property (the “Intellectual Property”) and any repro-
ductions of copyright works in the dissertation, for example graphs
and tables (“Reproductions”), which may be described in this disserta-
tion, may not be owned by the author and may be owned by third
parties. Such Intellectual Property and Reproductions cannot and must
not be made available for use without the prior written permission of
the owner(s) of the relevant Intellectual Property and/or Reproduc-
tions.
4. Further information on the conditions under which disclosure, public-
ation and commercialisation of this dissertation, the Copyright and any
Intellectual Property and/or Reproductions described in it may take
place is available in the University IP Policy (see http://documents.-
manchester.ac.uk/display.aspx?DocID=487), in any relevant Disserta-
tion restriction declarations deposited in the University Library, The
University Library’s regulations (see http://www.manchester.ac.uk/lib-
rary/aboutus/regulations) and in The University’s Guidance for the
Presentation of Dissertations.
13 Of 121
Acknowledgements
I would like to thank my supervisor, Professor John A. Keane, for tirelessly
providing me with valuable advice, motivation, ideas and inspiration through-
out this project.
I would like to thank Dr. Firat Tekiner for his insights on building and eval-
uating an industry-level Big Data platform.
I would also like to thank Dr. Mark Hall for exchanging ideas on how to
implement distributed versions of Weka's algorithms.
Finally, I would like to thank my parents and my brother for their continu-
ous moral and financial support.
14 Of 121
To my family
15 Of 121
1 Introduction
Datasets across all domains are increasing in size exponentially [1]. Gantz et
al. [2] estimated that, in 2013, 4 zettabytes (10^21 bytes) of data were gener-
ated worldwide and expected this number to increase ten-fold by 2020. These
developments created the term “Big Data”.
According to Gartner in 2012, Big Data is “high-volume, high-velocity,
and/or high-variety information assets that require new forms of processing to
enable enhanced decision making, insight discovery and process optimiza-
tion” [3]. In practice, the term refers to datasets that are increasingly difficult
to collect, curate and process using traditional methodologies.
Another aspect that drives the development of big data technologies for-
ward is the emerging trend of data-driven decision-making. For example,
McKinsey in 2012 [4] calculated that the health-care system in the USA could
save up to $300bn by better understanding of domain-specific data (clinical
trials, health insurance transactions, wearable sensors etc.). This trend requires
processing techniques to transform data to valuable insights. The field that ad-
dresses knowledge extraction from raw data is known as Data Mining.
Developments mentioned above are closely associated with the evolution
of distributed systems. Due to large volumes, processing is performed by or-
ganised clusters of computers. Proposed cluster architectures can be divided
into three major categories:
1. Shared-memory clusters: a global main-memory is shared between
processors by a fast interconnect.
2. Shared-disk clusters: an array of disks is accessible through the
network. Each processor has its own private memory.
3. Shared-nothing clusters: every node has a private set of resources.
Figure 1.1 (from Fernandez [5]) presents these architectures.
16 Of 121
Shared-memory and shared-disk architectures suffer difficulties in scaling
to large sizes [6]. Pioneered by Google, shared-nothing architectures have
dominated mainly because they can scale dynamically by adding more inex-
pensive nodes. Emerging cloud computing providers, such as AWS-EC2
(Amazon Web Services – Elastic Compute Cloud) [7], offer access to dynam-
ically configurable instances of these architectures on demand.
1.1 Distributed Computing Frameworks
Google in 2004 [8] introduced MapReduce, a distributed computing model
targeting large-scale processing in shared-nothing architectures. MapReduce
expresses computations using two operators (Map and Reduce), schedules
their execution in parallel on dataset partitions and guarantees fault-tolerance
through replication. The Map operator processes dataset partitions in parallel
and the Reduce operator aggregates the results.
Yahoo in 2005 introduced Hadoop [9], an open source implementation of
MapReduce. Many institutions adopted Hadoop [10] and many others plan
Hadoop integration in the near future [11].
Although MapReduce can express many Data Mining algorithms effi-
ciently [12], a significant performance improvement is possible by introducing
a loop-aware scheduler and main-memory caching. The Data Mining al-
gorithms used in knowledge discovery usually require multiple iterations over
the dataset and thus, multiple, slow, disk accesses. Due to this iterative nature
of many Data Mining algorithms, storing and retaining datasets in-memory
and scheduling successive iterations to the same nodes may yield significant
17 Of 121
Figure 1.1: Cluster Architecture [5]
benefits. Modern clusters possess increasing amounts of main-memory that
can provide performance benefits by efficiently using main-memory caching
mechanisms.
Out of many possible options [13], this project has utilised Apache Spark
[14] as the target platform. Spark supports main-memory caching and pos-
sesses a loop-aware scheduler. Additionally, Spark implements the MapRe-
duce paradigm and it is Java-based. These features enable users to deploy ex-
isting Hadoop application logic in Spark. Spark outperforms Hadoop by up to
two orders of magnitude in many cases [14].
1.2 Data Mining Tools
Data Mining tools can be divided into two major categories:
• Data Mining Toolkits: Libraries of data mining algorithms that can be
invoked against datasets either via Command Line Interfaces (CLI) or
Graphical User Interfaces (GUI).
• Languages for statistical computing: Special purpose programming
languages that incorporate data mining primitives and simplify al-
gorithmic development.
Weka and R are influential exemplars of the respective categories [17].
Weka incorporates libraries that cover all major categories of Data Mining
algorithms and has been under development by the open-source community
for more than 20 years [15]. However, it was developed targeting sequential
single-node execution and thus, it is not suitable for distributed environments.
The volumes Weka can handle are limited by the heap memory of a single
node. This number cannot exceed single-digit gigabytes. Thus, sequential
Weka is not suitable for modern large-scale datasets.
R [16] is a programming language designed for statistical computing. It in-
corporates essential data mining components such as linear and non-linear
models, statistical testing and classification among others. R also provides a
graphical environment for results visualisation. Although it is possibly the
most popular tool [17] in the field, it demonstrates the same shortcomings as
Weka on large-scale datasets.
18 Of 121
Weka was selected due to the fact that is written in Java (R is C-based) and
it is natively supported by Spark's Java-based execution environment. Addi-
tionally, Weka exposes an easy-to-use user interface and the implemented
framework can address the needs of both novice and expert users.
1.3 Project Objectives
This project has two main objectives:
1. An implementation of a scalable and efficient distributed Data Mining
framework using Weka and Spark. It will thus investigate the re-use of
existing sequential algorithms in a distributed context;
2. An experimental evaluation of different main-memory caching
strategies and their effects on Data Mining workloads.
1.4 Implementation Summary
The system implementation can be summarised by the 4-tier architecture de-
picted in Figure 1.2.
19 Of 121
The Infrastructure Layer is provided by Amazon's EC2. The decision to se-
lect AWS was based on its ability to dynamically allocate computing in-
stances, its enhanced monitoring capabilities and its comprehensive online
documentation.
The Distributed Storage Layer consists of multiple SSD drives, managed
by the Hadoop Distributed File System (HDFS) [18]. HDFS encapsulates dis-
tributed storage into a single logical unit and guarantees fault-tolerance
through data partitioning and replication.
The Batch Processing Layer was based upon the distributed computing
framework Apache Spark. Spark is innovative in the field of large-scale in-
memory computations and offers multiple advanced caching mechanisms.
20 Of 121
Figure 1.2: System Architecture
The Data Mining Application Layer was engineered by implementing cus-
tom distributed versions of Weka's Data Mining algorithms. More specifically,
four different categories of Data Mining methods were implemented:
• Classification
• Regression
• Association Rule Learning
• Canopy Clustering
The distribution was achieved by implementing Decorator classes
(“wrappers”) which encapsulated Weka's sequential algorithms. The func-
tional nature of the MapReduce model prohibits data dependencies between
Map and Reduce tasks executed in parallel. Thus, each Map task was designed
to process a dataset partition in parallel using an application specific Weka al-
gorithm. Each Reduce task was designed to receive the results of completed
Map tasks and aggregate them into a single output. In order to guarantee par-
allel execution, Map tasks were implemented as unary functions and Reduce
tasks as binary, commutative and associative functions. This procedure was
made possible by the functional nature of the Scala [19] programming lan-
guage.
1.5 Evaluation Strategy
The implementation has been evaluated using computer clusters and problem
sizes of three distinct levels:
• Small scale: Using an 8-core cluster of m3.xlarge instances possessing
28.2GB of main-memory;
• Medium scale: Using a 32-core cluster of m3.xlarge instances pos-
sessing 112.8GB of main-memory;
• Large scale: Using a 128-core cluster of m3.xlarge instances possess-
ing 451.2GB of main-memory.
Datasets sizes were increased in proportion to the computer cluster sizes and
ranged from 5GB up to 80GB.
21 Of 121
Each category of Data Mining methods was evaluated by using a represent-
ative algorithm. In each case, the experiment was repeated using the three
scales and the seven different caching strategies. An identical workload was
executed on Hadoop for comparison purposes.
The metrics measured at each experiment consisted of execution time,
memory utilisation, CPU utilisation, network utilisation and disk operations.
These measurements were analysed to provide information about the sys-
tem's scalability, achieved load balancing, resource use and potential bottle-
necks. Additionally, trade-offs between different in-memory caching strategies
and system's metrics were documented and analysed.
1.6 Project Achievements
The project contributions are three-fold:
1. A distributed Data Mining framework that achieves near-linear scal-
ing in executing Data Mining algorithms in parallel;
2. An analysis on the behaviour of distributed main-memory caching
mechanisms on different Data Mining execution scenarios;
3. An automated caching strategy selection mechanism that assesses
dataset and cluster characteristics and selects an appropriate caching
scheme.
The system outperforms Hadoop by a factor of 2.36x on average and up to
four times on small scale datasets.
Scaling efficiency indicates a system's ability to efficiently utilise increas-
ing numbers of processing nodes [20]. In weak scaling, the per-node size of
the problem remains constant and additional nodes are used to tackle a bigger
problem. In strong scaling, the total size of the problem remains constant and
additional nodes are introduce to tackle to reduce execution times. An optimal
parallel system demonstrates linear weak and strong scaling efficiencies.
Weak scaling efficiencies for Linear Regression, Support Vector Machines
(SVM) and FP-Growth were measured at 91.4%, 91.8% and 90.9% respect-
ively. These results are within 10% of the optimal parallel performance. The
22 Of 121
strong scaling efficiency was measured at 91.3% on the large-scale experi-
ments. This figure is on a par with the state-of-the-art in the surveyed literat-
ure.
Uncompressed datasets were found to consume up to 500% more memory
than the on-disk values. Serialisation converts distributed object structures to
continuous arrays of bytes. Compression uses encoding techniques to repre-
sent data with fewer bits. Serialisation and compression mechanisms were
found to reduce the memory footprints of uncompressed datasets to 101.3%
and 46.7% of the on-disk value respectively. The execution time overhead of
these mechanisms was measured at 5%.
Disk caching offers an efficient alternative in memory-constrained environ-
ments. The disk caching mechanism was able to successfully execute Data
Mining tasks even with only 2.5% of the cluster's main-memory available.
The caching strategy selection mechanism outperformed the default cach-
ing mechanism by up to 25%. This was achieved by reducing garbage collec-
tion overheads through serialisation and compression. Additionally, this mech-
anism was found to eradicate the fail-rates caused by main-memory excep-
tions in limited memory experiments.
1.7 Overview of Dissertation
The remainder of the dissertation is structured as follows: Chapter 2 presents
an overview of the state-of-the-art in Data Mining and distributed computing
frameworks; - Chapter 3 provides an analysis of the system architecture; -
Chapter 4 presents and analyses the execution model; - Chapter 5 presents the
evaluation methodology and benchmarking results. Finally, Chapter 6 draws
conclusions and proposes future improvements.
23 Of 121
2 Literature Review
This chapter presents fundamental Data Mining techniques and tools (2.1);
the state of the art in Distributed Computing Frameworks (2.2); and distrib-
uted Data Mining efforts using these frameworks (2.3).
2.1 Data Mining
Data Mining procedures can be summarized as a number of fundamental
steps: data recording, noise reduction, data analysis and data representation
and interpretation. Figure 2.1 illustrates these steps.
High data volumes lead to signal to noise ratios that can reach 1:9 [21].
Consequently, a noise reduction phase is substantial to maximize data quality
and to minimize noise effects. Data analysis phase can be divided into four
major categories of methods: Classification, Regression, Clustering and Asso-
ciation Rules.
2.1.1 Classification
Classification is a process in which a hypothesis function (classifier) analyses
a set of attributes and infers the category to which a data object belongs. The
produced classifier can be used to predict the class of unknown data objects.
Classification is an example of supervised learning, where a learning al-
gorithm processes a dataset with annotated data points (data objects of previ-
ously known categories) and computes the parameters of the function.
Figure 2.2 (taken from [22]) illustrates this process.
24 Of 121
Figure 2.1: The Data Mining Process
In many cases, the hypothesis space can be large [23] and retrieving a
strong hypothesis function is a non trivial task. For this purpose, a set of tech-
niques known as Ensemble Learning [24] or more commonly Meta Learning,
have been developed to combine the predicting power of multiple classifiers
into a single strong predictor.
A popular technique builds Voted Ensembles [25] using groups of classifi-
ers. Each encapsulated classifier predicts the outcome of a feature vector inde-
pendently and the Ensemble votes the majority class.
2.1.2 Regression
Regression is a supervised learning technique targeting the implementation of
prediction models for numeric numeric class attributes. The difference
between Regression and Classification models is that the outcome of a regres-
sion hypothesis function can have an infinite set of values (any real number).
During training, a learning algorithm computes the correlation of a set of
assumed independent variables with a dependent (class) variable. The out-
come is a trained hypothesis function which can be used to predict the value
of the class variable given the observed values of the independent feature vari-
ables.
Voted Ensembles of Regression functions can be created using similar
Meta Learning techniques. Since the possible states of the output are theoret-
ically infinite, majority votes are represented by average or median values.
25 Of 121
Figure 2.2: Supervised Learning Process [22]
2.1.3 Clustering
Clustering is the process in which data objects are grouped together in classes
based on some measure of similarity. It is different from classification as pos-
sible data classes are not known in advance. It is an example of unsupervised
learning and it is used to discover hidden structure in datasets.
The clustering process can be divided into iterative and incremental ap-
proaches. Iterative algorithms require multiple passes over the whole dataset
to converge. For incremental cluster construction [26], two approaches can be
used: a) adding data points at each iteration and recomputing the cluster
centres and b) adding a cluster centre at each iteration. In both cases, the solu-
tion is built incrementally and it is possible to find a near optimal solution in a
single pass.
Consensus clustering [27] is the process of combining different clusterings
of the same dataset in a single output. It is a process analogous to Meta Learn-
ing in supervised learning. However, it is known to be NP-Complete and solv-
ing large instances is considered intractable [28].
2.1.4 Association Rule Learning
Association Rule Learning discovers correlations and associations between
items in a set of transactions. Given the large amounts of data stored by retail-
ers, association rules emerged as a solution to the market-basket analysis
problem [29].
A bit vector is used to indicate the presence or absence of an item in an
itemset (transaction). A group of bit vectors is used to represent a set of trans-
actions. By analysing these vectors, it is possible to discover items that fre-
quently occur together. These frequent occurrences are expressed in the form
of rules. For example the rule:
{ itemA, itemB } => { itemC}
Indicates a correlation between the presence of items A and B and the pres-
ence of item C. However, not all rules are interesting. Consequently, measures
of significance are required. Support (percentage of transactions that contain
items A, B and C), Confidence (percentage of transactions containing items A
26 Of 121
and B that also contain item C) and Lift (a correlation measure of the two
itemsets) are three frequently used measures. By setting thresholds to these
measures it is possible to discover interesting rules on a set of transactions us-
ing various algorithms.
Due to the sheer volume of monitored transactions, various techniques
have been developed that enable mining large transaction logs in small parti-
tions. A popular technique, known as Partition [30], divides transactions into
groups and applies Association Rules Learning algorithms to these groups in-
dependently. This procedure produces a set of candidate rules. Each candidate
rule is validated by computing its significance measures across all partitions.
Candidate rules that meet global significance criteria are produced in the out-
put.
2.1.5 Data Mining System Development
Data Mining systems can be developed using two main approaches: Toolkit-
Based and Statistical Languages. The following sections present these ap-
proaches in detail.
2.1.5.1 Tooltkit Based Approaches
Data Mining toolkits can be described as a set of Data Mining algorithm im-
plementations accompanied by a user interface. The user is able to interact
with the algorithms and analyse datasets through the interface without pos-
sessing advanced knowledge about implementation details. An influential ex-
emplar is the Data Mining toolkit Weka.
Weka [15] encapsulates well tested and extensively reviewed implementa-
tions of the most popular Data Mining methods and algorithms mentioned
above. It contains tools that support all phases of the Data Mining process.
A large collection of filters can be used to pre-process datasets and reduce
noise. Algorithms spanning across all major categories can be used for data
analysis. Weka offers a Graphical User Interface (GUI) that supports interact-
ive mining and results visualization. Finally, it automatically produces statist-
ics to assist results evaluation.
27 Of 121
The major disadvantage of Weka is that it only supports sequential single-
node execution. As a result, the size of the datasets that Weka can handle us-
ing the existing environment is limited by the maximum amount of the heap
memory of a single node.
2.1.5.2 Statistical Language Based Approaches
Statistical languages are special purpose programming languages that incor-
porate primitive statistical operations such as matrix arithmetic and vector ma-
nipulation. Since Data Mining is closely associated with statistics, Data Sci-
entists extensively use these languages [17] to develop Data Mining al-
gorithms. A popular example of this approach is R.
R [16] is a statistical programming language with built-in support for linear
and non-linear modelling, matrix manipulation, time-series analysis, data
cleaning, statistical testing and graphics among others [31]. It is interpreted
and can be used interactively to implement data mining tasks. Statistical pro-
cedures are exposed through simple commands.
R gained popularity [17] among analysts mainly because it does not de-
mand advanced programming expertise. However, it is designed for sequential
execution and suffers the same shortcomings as Weka in Big Data problems.
2.1.5.3 Approach Selection
Statistical languages require state of the art knowledge of primitive statistical
operations and mainly consist of tools to develop algorithms rather than librar-
ies of implementations. In contrast, Data Mining toolkits require minimal
prior knowledge and can be directly deployed to analyse datasets.
This project is based on toolkit-based approaches and more specifically
Weka, for two main reasons:
• Weka is written in Java and can be directly deployed and executed on
top of Java-based distributed computing frameworks without requiring
additional compilation overheads;
• The toolkit-based approach is appealing to different categories of users
(novice and experts) and thus targets a larger user span.
28 Of 121
2.1.6 Partitioning and Parallel Performance
Data Mining algorithms can be either single-pass or iterative [32]. Single-pass
algorithms have an upper bound in execution times. Iterative algorithms iter-
ate over the dataset until a stop condition is met (convergence) and thus exe-
cution times may vary. Due to the sheer volumes in Big Data, this project util-
ises the partitioned parallelism approach: the dataset is partitioned and com-
puting nodes process the partitions in parallel.
For single pass algorithms, this method can theoretically yield speed-up
proportional to the number of nodes. In practice, the overhead associated with
distributing computations and aggregating the results over all partitions will
set the limit marginally lower. However, this overhead can be experimentally
computed and system performance is predictable. An example of this case is
computing class means and variances in building a Gaussian model for Naive
Bayes. Each node computes the statistics for each class in local partitions in
one iteration and aggregation is achieved in a single synchronisation step.
In contrast, iterative algorithms may be unpredictable. The number of itera-
tions until convergence cannot be defined in advance. Two different ap-
proaches are possible [33]: synchronous and asynchronous.
In synchronous, each node computes the model's parameters on its own
partition in a single iteration. A synchronisation step is used to collect local
parameters from all nodes and produce aggregated global values. During syn-
chronisation, the stop condition is evaluated. If the condition is not satisfied,
each node obtains a copy of the global values and begins a new iteration. This
technique achieves load balancing between the nodes, but requires constant
node communication and the network can become a performance bottleneck
[34] in large clusters.
In asynchronous, each node computes a local model and a single synchron-
isation step aggregates the results. This technique minimises network over-
heads, but load balancing is not guaranteed. Nodes that struggle to converge
will slow down the performance of the system. One solution is to enforce a
deadline: each node has a certain number of iterations to meet the stop condi-
tion. After the deadline, the final model will be computed only on the nodes
29 Of 121
that managed to converge. This technique may lack precision, but the execu-
tion time is guaranteed and speed-up is proportional to the nodes.
This project focuses on asynchronous techniques, for two main reasons:
• Weka does not support synchronisation. Adding this feature would re-
quire re-implementing Weka's algorithms.
• Distributed computing frameworks suffer from network bottlenecks
and developers aim to avoid network saturation.
2.2 Distributed Computing Frameworks
Distributed Computing Frameworks provide an interface between clusters of
computers and algorithms. Each framework exposes an Application Program-
ming Interface (API) and guarantees that the algorithms developed using this
interface can be executed in parallel. In the following sections, a number of
influential and widely used Distributed Computing Frameworks are presented.
2.2.1 MapReduce
MapReduce was introduced by Google [8] in order to tackle the problem of
large-scale processing in clusters of inexpensive commodity hardware. Data-
sets in MapReduce are automatically partitioned, replicated and distributed
across the cluster nodes. This practice ensures that partitions can be processed
in parallel and fault-tolerance can be guaranteed through replication.
A MapReduce cluster consists of a Master node which handles data parti-
tioning and schedules tasks automatically in an arbitrary number of Workers.
The Master also maintains meta-data concerning partition locations in the
cluster. This practice assists scheduling Workers to process their local partition
and avoids transmitting large data chunks through the network.
The user is required to implement two functions: Map and Reduce. Map is
used to filter and transform a list of key-value pairs into intermediate key-
value pairs. Reduce processes the intermediate pairs, aggregates the results
and produces the output.
Once the user has specified the Map and Reduce functions, the runtime en-
vironment automatically schedules execution of Mappers on idle cluster
30 Of 121
nodes. Each node executes the Map function against its local dataset partition,
writes intermediate results to its local disk and periodically notifies the Master
of progress. As the Mappers start producing intermediate results, the Master
node assigns Reduce tasks to idle cluster nodes. Each intermediate result has a
key and is distributed to the Reducer that handles this key (or key range).
Figure 2.3 (taken from [8]) presents the MapReduce procedure.
2.2.2 Hadoop
Hadoop [9] is an open-source implementation of MapReduce. It was de-
veloped by Yahoo and released as an open-source project in 2006. Hadoop's
main components are HDFS (Hadoop Distributed File System), YARN (Yet-
Another Resource Negotiator) [35] and the MapReduce framework.
HDFS is a disk-based file system that spans across the cluster nodes. Files
stored in HDFS are automatically divided into blocks, replicated and distrib-
uted to the nodes' local disks. HDFS maintains meta-data about the location of
blocks and assists Hadoop to schedule each node to process local blocks rather
than receive remote blocks through the network. HDFS encapsulates distrib-
31 Of 121
Figure 2.3: MapReduce Execution Overview [8]
uted local storage into a single logical unit and automates the procedure of
distributed storage management.
YARN is responsible for managing cluster resources and provide applica-
tions with execution containers for Map and Reduce tasks. YARN maintains a
directory of all the execution containers in the cluster (which may be multiple
per node) and either allocates idle containers to applications for execution or
delays the application execution until a container becomes available.
Finally, MapReduce is a set of Java libraries that implement the aforemen-
tioned MapReduce paradigm.
Hortonworks' web page [36] presents the Hadoop architecture as in Figure
2.4.
2.2.2.1 Beyond Hadoop
As briefly discussed earlier, Hadoop's execution engine is inefficient in two
main areas:
• Native support for iterative algorithms: Many Data Mining algorithms re-
quire multiple iterations over a dataset to converge to a solution. For example,
the K-Means clustering algorithm iterates over a dataset until cluster assign-
ments remain unchanged after two successive iterations. These iterations are
usually included in the user-defined driver program. This feature requires both
the reload of invariant data and the restart of execution processes at each itera-
tion, leading to significant performance overheads. These overheads could be
avoided by introducing a loop-aware scheduler [37].
32 Of 121
Figure 2.4: Hadoop Tech Stack [36]
• HDFS is a disk based file-system. However, modern clusters possess
main memory that can exceed 1TB and most data mining tasks are within this
limit [38]. As a result, significant performance improvement is possible by us-
ing man-memory caching mechanisms.
These issues led to numerous efforts towards novel systems. The following
sections present a number of important projects in the area.
2.2.3 Iterative MapReduce
HaLoop [37] introduced a loop operator to the MapReduce framework aiming
to provide support for iterative algorithms. HaLoop's scheduler is loop-aware
and introduces caching mechanisms which avoid multiple slow disk accesses
to loop-invariant data (for example stop conditions and global variables).
Figure 2.5 (taken from [37]) shows the boundaries between user applica-
tion and system code in MapReduce and HaLoop.
Stop conditions are evaluated automatically by the system and do not re-
quire an additional task as in MapReduce. The loop-aware scheduler co-loc-
ates jobs that use the same data in successive iterations. The caching mechan-
ism is used to save loop-invariant data between iterations. These develop-
ments, as reported in [37], have enabled up to 85% speed-up compared to Ha-
doop on iterative algorithms.
33 Of 121
Figure 2.5: HaLoop and MapReduce [35]
2.2.4 Distributed Systems for In-Memory Computations
These systems use main-memory caching strategies to speed-up computations
and offer significant performance advantages over disk-based solutions. The
following sections overview efforts in the area.
2.2.4.1 In-Memory Data Grids
In-Memory Data Grids (IMDG) [39] emerged as a middle-ware between data-
bases and applications demanding low-latency access to mission critical data.
Infinispan [40] and HazelCast [41] are two representative examples.
Data in IMDG is represented as collections of non-relational data objects.
These collections are hash-partitioned and distributed to the main-memory of
a cluster's nodes. IMDG acts as a distributed cache, serves applications seek-
ing database entries and reduces redundant slow disk accesses to disk-based
database servers.
Traditional IMDGs only supported data storing and retrieval operations.
As a result, complex computations were handled by external frameworks. This
technique faces performance bottlenecks [42] because task scheduling does
not take into consideration neither data locality nor load balancing. In order to
tackle this issue and combine the low-latency data access offered by IMDG
with the computational capabilities of distributed computing solutions, in-
memory computing frameworks emerged.
The following sections present a number of frameworks that combine in-
memory caching and distributed computations in a single scalable solution.
2.2.4.2 Piccolo
Piccolo [43] provides an in-memory data-centric programming model for
building parallel applications on large clusters. Piccolo allows distributed ap-
plications to share state via a mutable key-value table. Table entries are parti-
tioned and distributed to a cluster's main-memory, in order to achieve faster
sharing of intermediate results among cluster nodes.
Piccolo applications are implemented using a control function, a set of ker-
nel functions, a set of accumulation functions and a distributed in-memory
34 Of 121
key-value table. The user defines a control function that operates in the master
node and monitors the control flow of the application. A number of kernel
functions (sequential processing threads) are launched in the slave nodes and
perform parallel computations on the entries of the distributed table. A locality
preference mechanism by default schedules kernel functions to process local
partitions. Intermediate results are updated in the table using atomic opera-
tions and thus an accumulation function must be defined to aggregate multiple
updates to the same key.
The user is required to manually define the distribution of keys in the
cluster nodes through a partition function. As the system does not automatic-
ally manages cases of insufficient memory in a node, the user must carefully
consider partition placement. The system guarantees fault-tolerance using a
check-pointing mechanism that captures system snapshots at regular intervals.
A failed node will force system to roll back to a previous state and repeat
computations in all nodes because datasets are not replicated and lineage in-
formation is not monitored.
Piccolo reports [43] speed-up of up to an order of magnitude compared to
Hadoop. However, the system does not provide a high level interface that
guarantees parallel execution, exposes a complicated programming model,
lacks a mechanism to tackle main-memory shortages and uses a computation-
ally expensive disaster recovery mechanism.
2.2.4.3 GraphLab
GraphLab [44] expresses computations as data graphs and provides schedul-
ing primitives. GraphLab's data model consists of a directed data graph and a
shared (global) data table. Both the graph vertices and the shared table are
stored in-memory. The framework is optimistic, assumes no node failures and
that the data can fit in the cluster's main memory.
Graph nodes and vertices represent sparse data dependencies. Computa-
tions can process graph elements either through a stateless user-defined update
function (analogous to Map in MapReduce) or using a synchronization mech-
anism that defines a global aggregation (analogous to Reduce in MapReduce).
35 Of 121
Update functions operate on a graph neighbourhood and can have read-
only access to the shared data table. Unlike MapReduce, Update functions are
allowed to process overlapping context. User programs can have multiple up-
date functions and parallel execution is guaranteed as long as simultaneous ac-
cess to common vertices is not required. Consistency must be defined by the
user. Three different consistency mechanisms are offered: Vertex, Edge and
Full consistency. Figure 2.6 (taken from [44]) depicts these three options.
Consistency mechanisms define the degree of overlapping in processed
graph neighbourhoods. Relaxing consistency guarantees (by locking less ver-
tices during a vertex update) allows a higher degree of parallelism (more func-
tions can be executed in parallel as fewer vertices are locked).
The global data table can only be updated using the synchronisation mech-
anism. During synchronisation, data across all vertices are aggregated to an
entry in the shared table in a manner analogous to Reduce functions in
MapReduce.
Finally, a GraphLab user program must contain an update schedule. This
schedule defines the order in which vertex-function pairs will be executed.
This is a dynamic list and tasks can be updated or rearranged during execu-
tion. GraphLab provides predefined schedules based on popular data struc-
tures (FIFO queues, priority queues) and, most importantly, contains a sched-
uler construction framework that lets the user define a custom scheduling
mechanism.
36 Of 121
Figure 2.6: GraphLab Consistency Mechanisms[44]
It is important to note that GraphLab's programming model does not limit
itself to graph computations. Sparse and dense matrices can be represented by
graphs and thus GraphLab can be used to express iterative Data Mining al-
gorithms.
Results in [44] demonstrate high performance in iterative asynchronous al-
gorithms. However, GraphLab does not possess a mechanism to tackle
memory shortages, optimistically assumes that node failures are improbable
and exposes a complicated programming model.
2.2.4.4 Spark
Resilient Distributed Datasets (RDDs) [14] are a distributed main-memory ab-
straction that enable users to perform in-memory computations in large
clusters. RDDs are implemented in the open-source Apache Spark [45] frame-
work.
RDDs are an immutable collection of records distributed across the main
memory of a cluster. These data structures can be created by invoking a set of
operators either on persistent storage data objects or on other RDDs. The sys-
tem logs dataset transformations using a lineage graph. The system is con-
sidered to be “lazy”: it does not materialise transformations until the user re-
quests either an output or saving changes to persistent storage.
Figure 2.7 (taken from [14]) illustrates an RDD lineage graph.
37 Of 121
Figure 2.7: RDD Lineage Graph [14]
RDD operators are divided into two categories: Transformations and Ac-
tions. Transformations define a new RDD, based on an existing RDD and a
function. Actions materialise the Transformations and either return a value to
the user or export data to persistent storage.
As the system was inspired by MapReduce, Transformations provide native
support for Map and Reduce operators. Additional operators include join,
union, crossProduct and groupBy. These features extend the system's capabil-
ities by introducing an Online Analytical Processing (OLAP) engine.
The system was designed to tackle the shortcomings of MapReduce in iter-
ative computations by introducing a loop-aware scheduler and thread execut-
ors. In multi-stage execution plans the system schedules successive iterations
at the nodes where the data is cached, avoiding slow network and disk ac-
cesses. The introduction of thread executors enables the system to reuse
launched threads in the execution of multiple successive closures. This prac-
tice avoids the significant initialisation and termination costs observed in
MapReduce during each execution stage.
The users execute tasks by providing a driver program which defines a path
to secondary storage and sets of Transformations and Actions. The system
then creates an RDD and distributes its records across the main-memory of the
cluster. When an Action is issued, the system schedules the execution of all
the requested Transformations. Each Node at the cluster will process its local
set of records and return the results. By caching datasets to main memory the
system avoids slow disks reads and can perform up to 100 times faster than
Hadoop's MapReduce [14] on Logistic Regression and K-means.
Spark also takes advantage of the interactive nature of Scala and Python,
by providing an interactive shell. Users can load a dataset in main-memory
and execute multiple mining tasks interactively.
In cases where the dataset is larger than the available amount of main
memory, Spark employs a mechanism to serialize and store portions of the
dataset to secondary storage. Memory consumption is a configuration para-
meter and the mechanism can be triggered at a user-defined level. Even in
38 Of 121
cases where minimal memory is available, Spark outperforms Hadoop due to
shorter initialisation and termination overheads [46].
The system offers numerous memory management options including differ-
ent serialization and compression libraries. These features allow the user to
define an application specific caching strategy that takes advantage of dataset
and cluster characteristics.
Spark is fully compatible with the Hadoop ecosystem. Spark's Scala API
supports closures and Hadoop Map and Reduce functions can be submitted as
arguments to its native operators. This allows users to deploy existing Hadoop
applications to Spark with minor adjustments.
2.2.5 Distributed Computing Framework Selection
This work has focused on using Apache Spark as a deployment framework for
the WEKA subset of interest. The reasoning behind this decision is as follows:
• Spark offers advanced main-memory caching capabilities that include
robust serialization mechanisms, compression mechanisms and the
ability to tackle memory shortages;
• Spark is based on the Scala programming language and it is natively
compatible with Weka's Java libraries;
• Spark offers many operators through a simple and easy-to-use API;
• Spark has recently generated significant interest in both academic and
industrial environments and it is rapidly evolving into a leading Dis-
tributed Computing Framework [47].
2.3 Distributed Data Mining
The following outlines efforts to express Data Mining algorithms using the
programming models of the distributed computing frameworks described in
Section (2.2).
39 Of 121
2.3.1 Data Mining on MapReduce
The following presents three notable efforts to develop Data Mining systems
using the MapReduce paradigm.
1. Chu et al. [12] developed many popular Data Mining algorithms (such
as the Linear Regression, the SVM and the K-Means clustering al-
gorithm) on top of MapReduce. Experiments on multi-core processors
demonstrate linear speed-up on shared-memory environments. These
results indicate that it is possible to efficiently express Data Mining us-
ing the MapReduce paradigm.
2. Mahout [48] was a community-based side project of Hadoop aiming
to provide scalable Data Mining libraries. As the libraries did not
provide a general framework for building algorithms, the quality of the
provided solutions varied significantly and was dependent on the ex-
pertise of the contributor. This lead to poor performance [49], incon-
sistencies in the content of its various releases [48] and the project was
eventually discontinued.
3. Radoop [50] introduced the RapidMiner [51] toolkit to Hadoop. Rap-
idMiner has a graphical interface to design work-flows which consist
of data-loading, data-cleaning, data-mining and visualization tasks.
Radoop introduced operators to read data from HDFS and execute
Data Mining tasks. Radoop operators correspond to Mahout al-
gorithms. At runtime the workflow is translated to Mahout tasks and
executed on a Hadoop cluster. As far as performance is concerned, Ra-
doop suffers the same shortcomings as Mahout. However, Radoop in-
troduced the idea of separating Data Mining work-flow design from
distributed computations on clusters.
These efforts provide three observations that are of particular interest to
this project:
1. The MapReduce paradigm can be used to express Data Mining al-
gorithms efficiently and it is possible to achieve linear scaling;
40 Of 121
2. The lack of a unified execution model leads to inconsistent perform-
ance, difficulties in maintaining and extending the code-base and dis-
courages widespread adoption. Mahout focused on providing imple-
mentations of specific algorithms, rather than building execution mod-
els for families of algorithms;
3. Abstracting distributed computations on clusters from the design of
data mining processes enables analysts with no technical background
to harness the power of distributed systems.
This work incorporates these observations into the design of the system.
MapReduce is used to implement the distributed versions of Weka's al-
gorithms on Spark. The execution model focuses on abstract interfaces, rather
than concrete implementations of specific algorithms.
Finally, the system is exposed through an interface that mimics the simpli-
city of Weka's command line. This practice enables users familiar with Weka
to directly utilise Weka-on-Spark, regardless of the underlying distributed
nature of the system.
2.3.2 R on MapReduce
The following presents two notable efforts to combine the statistical language
R with MapReduce.
1. Ricardo [52] is an early effort to introduce R to distributed computing
frameworks. The system used the declarative scripting language Jaql
[53] and Hadoop to execute R programs in parallel. The system con-
sists of three components: an interface where the user implements Data
Mining tasks using R and a small set of Jaql functions; an R-Jaql
bridge to integrate R programs into Jaql declarative queries; and the
Jaql compiler that compiles the queries to a series of MapReduce jobs
on the Hadoop cluster. Flow control is performed by the user program.
The bridge allows R programs to make calls to Jaql scripts and Jaql
scripts to run R processes on the cluster.
Figure 2.8, from [52], presents the system architecture.
41 Of 121
The system uses R-syntax, with which most analysts are familiar, and
exposes a simple interface. However, execution times are doubled
compared to native MapReduce jobs for the same task. This is due to
the overhead produced by compiling high-level declarative Jaql scripts
to low-level MapReduce jobs.
2. SparkR [54] is an R package that provides R users a lightweight front-
end to Spark clusters. It enables the generation and transformation of
RDDs through an R shell. RDDs are exposed as distributed lists
through the R interface. Existing R packages can be executed in paral-
lel on partitioned datasets by serializing closures and distributing R
computations to the cluster nodes. Global variables are automatically
captured, replicated and distributed to the cluster enabling efficient
parallel execution. However, the system requires knowledge of statist-
ical algorithms as well as basic knowledge of RDD manipulation tech-
niques. To the best of the author's knowledge (as of August 2014),
benchmarking results comparing the system to other Spark-based solu-
tions are not publicly available.
Both efforts to merge R with MapReduce suffer from two major issues:
1. R is based on C and it is not native to Java-based frameworks such as
Hadoop and Spark. Thus a bridging mechanism is required between R
42 Of 121
Figure 2.8: Ricardo [52]
and the underlying Java Virtual Machine (JVM). R-code is compiled
to C-code and C-code uses the Java Native Interface (JNI [55]) for ex-
ecution. This feature disables the portability of the system (it becomes
platform dependent) and produces bridging overheads.
2. The underlying MapReduce paradigm is not transparent to the user.
The user needs to express the computations as a series of transforma-
tions (Map) and aggregations (Reduce). In Ricardo, this is achieved by
using Jaql declarative queries where the selection predicates use R
functions to transform the data (Map equivalent) and aggregation func-
tions (Reduce equivalent) to produce the final output. SparkR uses the
same methodology on distributed lists.
These observations further support the decision to use Weka as it is written in
Java and the bridging overheads of these systems are avoided. Additionally,
by using Weka's interface distributed MapReduce computations are abstracted
from the design of Data Mining processes.
2.3.3 Distributed Weka
Early efforts made to introduce Weka to distributed environments include
WekaG [56], parallelWeka [57] and Weka4WS [58]. WekaG and WekaWS use
web services to submit and execute tasks to remote servers. However, they do
not support parallelism; each server executes an independent task on its own
local data. ParallelWeka proposed a parallel cross-validation scheme where
each server receives a dataset copy, computes a fold and sends back the res-
ults. This practice cannot be applied on a large scale because of network bot-
tlenecks.
Work by Wegener et al. [59] aimed to merge the user-friendliness of Weka's
user interface with Hadoop's ability to handle large datasets.
The system architecture consists of three actors: the Data Mining Client,
the Data Mining Server and the Hadoop cluster. The client uses Weka's user
interface to build mining tasks and then order execution to the server. The
server receives the client's request, computes the sequential part of the al-
gorithm locally and submits the parts that can be executed in parallel to the
43 Of 121
Hadoop cluster. These procedures require review of the Weka libraries,
identify the parts of each algorithm that can be executed in parallel and re-
write these parts using MapReduce. On the server, Weka's data-loader was ex-
tended to avoid loading datasets to main-memory and instead perform a series
of disk reads.
Figure 2.9, from [59], presents the architecture.
This methodology does not provide a unified framework for expressing
Weka's algorithms. Each algorithm must be inspected to identify parts that can
be executed in parallel and re-implemented using MapReduce. This process
entails producing custom distributed implementations of all the algorithms in
Weka and suffers from the same shortcomings as Mahout. Additionally, dis-
abling Weka's caching and reading incrementally from disk produces large
overheads on iterative algorithms.
2.3.4 MLBase
With MLBase [60] the user can build Data Mining tasks using a high-level de-
clarative language and submit them to the cluster's Master node. The system
then parses the request to form a Logical Learning Plan. This plan consists of
feature extraction, dimensionality reduction, filtering, learning and evaluation
algorithms. The optimizer processes that plan using statistical models and
heuristics. An Optimized Learning Plan (OLP) is produced based on which
combination of algorithms is likely to have better performance (execution
44 Of 121
Figure 2.9: Distributed Weka [59]
time and accuracy). MLBase then translates OLP to a set of primitive operat-
ors that the run-time environment supports. These include relational operators
(joins, projects), filters and high-level functions such as Map in MapReduce.
These primitives are then scheduled for parallel execution in the cluster's
workers.
The system builds the model in stages. From an early stage, it returns a pre-
liminary model to the user and it continues to refine it in the background. This
mechanism provides the opportunity to interrupt the process if the preliminary
results are satisfactory and avoid redundant computations.
Figure 2.10 (taken from [60]) illustrates the MLBase procedure.
The users of this system will be able to submit tasks without specifying an
algorithm. The system will then parse the request and select a near-optimal
solution by analysing various alternatives. This would be an important devel-
opment since users would no longer need to find reliable, scalable and accur-
ate solutions solely based on intuition.
45 Of 121
Figure 2.10: MLBase Architecture [60]
As of August 2014, the system is still under development. However, the in-
terfaces of its components were described in [49] and a proof-of-concept im-
plementation using Spark was tested. The results demonstrate constant weak
scaling and an order of magnitude better performance than Mahout in twenty
times fewer lines of code.
2.4 Summary
Sequential solutions, such as Weka, fail to cope with the sheer volume of Big
Data workloads. Hadoop is a field-tested solution for large datasets and it sets
the standard for industrial Big Data platforms. However, Hadoop's native im-
plementation of MapReduce is inefficient in executing iterative algorithms.
Spark tackles this issue by introducing a main-memory caching mechanism
and a loop-aware scheduler.
The MapReduce model is efficient in expressing Data Mining algorithms.
Many projects demonstrate that Data Mining workloads can achieve linear
scaling on top of MapReduce clusters. However, designing such a system is a
non-trivial task. The following observations, as extracted from the Literature
Survey, act as guidelines and directions for the system design:
• The leverage of distributed main-memory can yield up to an order of
magnitude shorter execution times;
• The parallel execution model should focus on Data Mining methods
rather than individual algorithms;
• Designing Data Mining processes should abstract away from distrib-
uted computations;
• Using libraries that require heterogeneous execution environments in-
creases complexity, decreases portability and introduces compilation
and bridging overheads.
Spark provides native support for in-memory computations. Additionally,
both Spark and Weka are Java-based and require the same execution environ-
ment (JVM).
46 Of 121
Weka represents each category of Data Mining methods using an abstract
interface. Any individual algorithm is required to implement this interface. By
implementing Map and Reduce execution containers (“wrappers”) for Weka's
interfaces, a scalable execution model becomes feasible.
The user interface can be closely modelled after Weka's user interface. This
feature enables users to design and execute Data Mining processes using the
same tools either locally or in distributed environments.
The following chapters provide an in-depth analysis of the proposed Weka-
on-Spark architecture and the implemented execution model.
47 Of 121
3 System Architecture
The following chapter presents the architectural components of an efficient
and scalable Big Data Mining (BDM) solution (3.1), the implemented multi-
tier architecture (3.2) and the cluster's monitoring services(3.3).
3.1 Required Architectural Components
Based on the review in Chapter 2, we identify the following required architec-
tural components for a scalable BDM solution:
• Infrastructure Layer: consists of a reconfigurable cluster of either
physical or virtual computing instances;
• Distributed Storage Layer: automatically encapsulates the local stor-
age of the cluster's computing instances into a large-scale logical unit;
• Batch Execution Layer: schedules and executes tasks on data stored in
distributed storage. Must provide support for in-memory computing
for optimal performance;
• Application Layer: integrates the application logic of BDM workloads
into the programming model supported by the Batch Processing
Layer;
• User Interface: the user requires a mechanism to interact with the sys-
tem and submit Data Mining tasks;
• Monitoring Mechanism: performance tuning and system evaluation de-
mand a mechanism to monitor cluster resources.
The following section presents the implemented multi-tier system which
meets the aforementioned requirements.
3.2 Multi-tier Architecture
The implemented architecture is summarised in Figure 3.1.
48 Of 121
The Infrastructure Layer is based on clusters of AWS EC2 instances. The
Distributed Storage Layer consists of a set EBS [61] volumes and a set of SSD
drives managed by HDFS. The Batch Execution Layer is based on the innov-
ative in-memory computing framework Spark. The Application Layer incor-
porates the implemented BDM framework. The Command Line Interface
(CLI) provides user access to framework services. Finally, the monitoring ser-
vices are provided by CloudWatch.
3.2.1 Infrastructure Layer
AWS Elastic Compute Cloud (EC2) provides on demand access to virtual
servers known as compute instances. Using virtualisation technologies AWS
divides large pools of physical hardware spread among multiple data-centres
into virtual EC2 Compute Units (ECU [62]). By combining multiple ECUs,
AWS can produce virtual machines of varying capacities and capabilities.
49 Of 121
Figure 3.1: System Architecture
The virtual nature of ECUs allows the user to either launch or terminate
multiple instances in minutes. This feature enables the provision of automatic-
ally resizeable clusters of instances. Batch processing tasks usually require
large raw computing power for short periods of time. Additionally, these tasks
may vary in volumes and complexity. The flexibility of the service provides
an easily configurable and cost-effective solution for Big Data processing
problems.
EC2 provides full administrative access to the user. EC2 instances can be
managed in the same fashion as physical hardware. The user has the ability to
select the Operating System and install any required software component. Ad-
ditionally, AWS provides access to a marketplace where multiple pre-built
Amazon Machine Images (AMI [63]) can be selected and deployed.
Out of many possible options, this project was developed on top of
Amazon's proprietary Linux distribution. Amazon Linux is designed specific-
ally to operate on top of EC2, it is provided for free to EC2 customers and in-
corporates all the essential libraries required by the forthcoming layers.
3.2.2 Distributed Storage Layer
The persistent storage, provided by the cluster's instances, consists of virtual
drives known as Elastic Block Store (EBS) volumes. These volumes are net-
work-attached drives and persist even after the cluster termination. Each EC2
instance possesses a pre-defined number of EBS volumes attached, but the
user is able to increase this number on demand. Additionally, each instance
possesses a physically attached SSD drive. This storage option is faster to ac-
cess, but it is ephemeral and data will be lost on termination.
Big Data problems may require much larger storage capacity than the max-
imum size of an EBS volume. Fault-tolerance demands data partitioning and
replication. Additionally, efficient parallel execution requires dataset partitions
to be processed by different instances. These requirements suggest the need
for a persistent storage abstraction that would automatically handle dataset
partitioning, replication and distribution over a set of instances.
50 Of 121
HDFS encapsulates a set of either physical or virtual storage devices into a
single logical unit. The system has three main actors with distinct roles: the
NameNode, the Secondary NameNode and the DataNodes. The NameNode
maintains meta-data about the locations of dataset partitions in the cluster. The
SecondaryNameNode captures snapshots of the NameNode at regular inter-
vals in order to avoid a single point of failure. Finally, the DataNodes store
data on their local drives.
When a dataset is written to HDFS, the system proceeds to partition it into
a number of blocks, triplicate each block and store two blocks in the same
rack and one block in a remote rack. This procedure guarantees fault-tolerance
and promotes minimal data motion: Distributed Computing Frameworks can
obtain meta-data about block locations from the NameNode and schedule exe-
cution on the node where the data is situated.
Figure 3.2 (taken from[64]) illustrates this architecture.
HDFS was installed to the Linux AMIs of all instances by using a Python
script to download the essential libraries and update the configuration files
containing instance locations in the network. By executing a number of shell
51 Of 121
Figure 3.2: HDFS Architecture [64]
scripts provided by the libraries, the NameNode, SecondaryNameNode and
DataNode services were initialised and configured.
At this stage, the system is ready to accept read and write requests, provide
block locations and assist a Batch Execution Layer to move computation to
the data.
3.2.3 Batch Execution Layer
Batch Execution Engines provide the execution containers for user applica-
tions, schedule the execution of user code to a number of instances in the
cluster, monitor progress, provide fault-tolerance mechanisms and present the
results to the users. Although Batch Executions Engines and Distributed Com-
puting Frameworks are usually overlapping terms, it is useful to distinguish
between the two at this stage, because of cases of Distributed Computing
Frameworks which are not batch-oriented.
Two different solutions were evaluated at this layer: Hadoop and Spark.
Hadoop is a field-tested solution with widespread adoption. However, as men-
tioned in Chapter 2 it lacks support for iterative computations and main-
memory caching. These issues were partially tackled by utilising an asyn-
chronous parallel execution model and by taking advantage of Weka's in-
memory computations. However, main-memory is a scarce resource and this
solution did not possess a mechanism to gracefully overcome main-memory
shortages. Additionally, Hadoop's process initialisation and termination costs
were significant. These observations supported the decision to use Spark as
the system's Batch Execution Engine.
Spark guarantees that applications written using the supported operators
can be executed in parallel. During application submission, Spark starts task
executors on a user-defined number of instances in the cluster. Once the ex-
ecutors are started, multiple tasks can be executed by the same threads of exe-
cution, minimizing overheads associated with initialising and terminating Map
and Reduce threads for every processed dataset partition as in Hadoop.
Spark applications retrieve data from HDFS and the system loads local par-
titions to main-memory using the RDD abstraction. Transformations and Ac-
52 Of 121
tions can be submitted by user applications. Spark serializes the functions and
decides on work placements based on data locality. Figure 3.3 illustrates the
initialisation procedure.
Spark pessimistically logs RDD transformations and actions, but does not
maintain snapshots of the actual RDDs. This practice would entail CPU over-
heads for RDD serialization, disk overheads for disk caching and retrieval and
network overhead for snapshot distribution. Early benchmarking, by Zaharia
et al. [14], revealed that recomputing lost partitions is faster and more re-
source efficient than utilizing a snapshot and roll-back mechanism. HDFS
maintains multiple replicas in different racks which guarantees system recov-
ery even in the rare cases of rack-level failure.
Spark's greatest contributions to the system are its multiple caching op-
tions. This feature enables benchmarking multiple combinations and research
optimal caching strategies for in-memory computations on BDM workloads.
The following subsection proceeds to analyse these options in detail.
3.2.3.1 Spark and Main-memory Caching
As briefly discussed earlier, Spark applications request files from HDFS and
Spark represents the data in the cluster's memory as distributed Java objects
using the RDD abstraction. These objects are created based on a set of config-
53 Of 121
Figure 3.3: Initialisation Process
uration parameters submitted in the application context. The user specifies the
number of instances and the allocated per instance main-memory that the ap-
plication should use during execution. The system divides total executor
memory into data caching memory and Java heap memory (to be used for al-
gorithm variables, internal data structures etc.).
The data caching fraction of the memory can be exploited using different
Storage Levels. The Storage Levels define whether Spark should cache ob-
jects in-memory, on-disk or a combination of the two. Additionally, a replica-
tion factor for in-memory objects can be defined. These levels can be further
customised by the use of different serialisation and compression libraries.
More specifically, Spark can be configured to use either the built-in Java seri-
alisation or the Kryo [65] serialisation libraries. According to [66], Kryo of-
fers performance improvements, but requires a custom object registration
class in the Application Layer. Finally, Spark's codecs can further reduce the
memory footprint of RDD objects by compressing serialised byte arrays.
System evaluation in Chapter 5 contributes an analysis of these options and
proposes a scheme to automatically select a strategy based on cluster and data-
set characteristics.
3.2.4 Application Layer
The Application Layer consists of a custom-built distributed Data Mining
framework aiming to tackle BDM problems on top of the underlying architec-
ture. It contains services that handle the complete life-cycle of a Data Mining
task. Support is provided for data and model loading and saving, user option
parsing, task initialization, submission and termination, result presentation and
progress monitoring.
The most important contributions of the implemented solution are two-
fold:
• A scalable distributed Data Mining application logic based on the
MapReduce paradigm and Weka's algorithms. The execution model is
analysed in Chapter 4;
54 Of 121
• Advanced main-memory caching configuration capabilities that aim to
maximize the benefits of in-memory computing. Advanced users are
able to define a caching strategy, but the system is capable of automat-
ically selecting a caching strategy according to data and cluster charac-
teristics. Extensive analysis on main-memory caching is available in
Chapter 5.
The implementation is packaged in a shaded Java archive (shaded-jar).
This practice simplifies deployment because complex installation procedures
are avoided. The package is self-contained and can be directly submitted for
execution. Details on the deployment procedure are given in Appendix 2.
3.2.5 CLI
The Application Layer is accessible to the user through a Command Line In-
terface (CLI). The user is required to have access to the command line of the
Spark cluster's master node. This is possible through a Secure SHell (SSH)
connection. After establishing a successful connection, the user can submit
tasks to the cluster by providing a path to the Java archive, followed by the es-
sential execution parameters.
Details about usage and the system's supported options are given in Ap-
pendix 3.
3.3 Cluster Monitoring
Cluster resource monitoring is performed by the AWS monitoring service
CloudWatch [67]. CloudWatch-enabled EC2 instances automatically report
disk, CPU and network usage metrics every minute. These metrics are access-
ible through the AWS management console. Instances can be monitored either
independently or aggregated by type. CloudWatch provides visualisation util-
ities as well as raw readings.
However, CloudWatch does not monitor memory usage by default. As the
analysis of main-memory caching effects in relation to BDM workloads was
an integral issue in this work, an application-specific solution was implemen-
ted.
55 Of 121
Cron [68] is a time-based task scheduler for Linux. Entries in the “crontab”
(Cronos-table) are executed at defined time intervals. Cron accesses the
crontab every minute and executes the table entries that are scheduled for this
minute. A custom Perl script that monitors main-memory metrics and reports
to CloudWatch was scheduled for execution every minute in all cluster in-
stances. This procedure was automated by developing a Bash script which can
be found in Appendix 4.
3.4 Summary
This chapter has presented the fundamental architectural components of a
scalable BDM architecture, alongside with details on the interactions between
the different layers. The following chapter presents the execution model of the
Application Layer and analyse the algorithmic logic which enables scalable
parallel execution.
56 Of 121
4 Execution Model
This chapter presents a general framework for executing Weka algorithms on
top of MapReduce (4.1) and specific solutions for the Header creation (4.3),
Classification and Regression (4.4), Association Rules Learning (4.5) and
Clustering (4.6).
4.1 Weka on MapReduce
As briefly discussed in previous chapters, Weka was designed for sequential
single-node execution. Efficient parallel execution using the MapReduce
paradigm can be achieved by implementing Decorator classes, also known as
“wrappers”, for core Weka algorithms. Each wrapper class encapsulates a
Data Mining algorithm and exposes the containing functionality through the
Map and Reduce interfaces. The proposed execution model for the headers,
the classifiers and the regressors is based on a set of packages released by the
core development team of Weka [69], adjusted to Spark's API and Scala's
functional characteristics. To the best of the author's knowledge, there are no
benchmarking results published for this model (as of August 2014); in that re-
gard this work has provided a set of benchmarking results.
Spark begins execution by scheduling the slave instances to load local data-
set partitions to main-memory. Each slave invokes a unary Map function con-
taining a Weka algorithm against a local partition and learns an intermediate
Weka model. Intermediate models generated in parallel are aggregated by a
Reduce function and the final output is produced. However, the order of oper-
ands in the Reduce functions is not guaranteed. Consequently, Reduce func-
tions were carefully designed to be associative and commutative, so that the
arbitrary tree of Reducers can be correctly computed in parallel.
This process is illustrated in Figure 4.1.
57 Of 121
The functional model demands stateless functions. Spark provides a mech-
anism to broadcast variables, but this practice introduces complexity, race
conditions and network overheads. As a result, Map and Reduce functions
have been designed to solely depend on their inputs. As Map outputs consist
of Weka models (plain Java objects), this should minimize network communi-
cation between the nodes during execution.
4.2 Task Initialisation
This section outlines the essential steps towards the execution of MapReduce
tasks on Spark. These include parsing user provided options, configuring the
application context, configuring the user requested task and submitting the
task for execution in the cluster.
Figure 4.2 displays the application's main thread in pseudo-code that mim-
ics the Scala syntax.
58 Of 121
Figure 4.1: Execution Model
Upon submitting a task using the CLI and a set of parameters, this thread is
invoked in the Spark Master. The application parses user options using a cus-
tom text parser, configures the essential environment parameters (application
name, total number of cores, per-instance cache memory etc.) and initialises a
Task Executor. If a caching strategy is not provided, the application will com-
pute a custom caching strategy using the implemented Caching Strategy Se-
lection Algorithm (this algorithm is analysed and evaluated in Section 5.4).
Figure 4.3 displays the steps taken by the Task Executor to define a logical
representation of the dataset and to configure the user requested task.
59 Of 121
Figure 4.2:WekaOnSpark's main thread
The Task Executor begins the execution procedure by defining an RDD
(“rawData”) from a file on HDFS. As briefly discussed earlier, RDD Trans-
formations are “lazy”: until an Action is issued (a Reduce operator in this
case), the RDD will not be materialised. Thus, the RDD at this stage is logical.
It contains the path upon which it will be created and the caching mechanism
that will be used.
Weka processes data in a special purpose object format known as Instances.
This object contains a header (meta-data about the attributes) and an array of
Instance objects. Each Instance object contains a set of attributes which rep-
resents the raw data. HDFS data on Spark are defined as RDDs of Java String
objects. Thus, a Transformation is needed to parse the strings and build an In-
stances object for each partition. This is achieved by defining a new RDD
(“dataset”), based on the previous RDD (“rawData”) and a Map function.
These Transformations are automatically logged by Spark into a lineage
graph. Figure 4.4 displays the state of the graph after the steps described
above.
60 Of 121
Figure 4.3: Task Executor
At this stage, the Task Executor will use the newly defined RDD as an ini-
tialisation parameter for the user requested task. These tasks will add their
own Transformations to the graph. When an Action is issued (a Reduce func-
tion), Spark will schedule the cluster instances to build the RDD partitions
from their local dataset partitions, and to materialise the Transformations in
parallel.
The following sections present the implementation of the tasks supported
by the framework.
4.3 Headers
Weka pays particular attention to meta-data about the analysed datasets. An
essential initial step is to compute the header of the ARFF file (Weka's suppor-
ted format, represented by the aforementioned Instances object at runtime). A
header contains attribute names, types and multiple statistics including min-
imum and maximum values, average values and class distributions of nominal
attributes.
Figure 4.5 displays the MapReduce job that computes the dataset's header
file.
61 Of 121
Figure 4.4: Lineage Graph
The job requires the attributes names, the total number of attributes and a
set of options. These parameters are used by the Map function to define the
expected structure of the dataset.
Map functions compute partition statistics in parallel. Figure 4.6 displays
the implementation of the Map function.
Reduce functions receive input from the Map phase and aggregate partition
statistics to global statistics. Figure 4.7 displays the implementation of the Re-
duce function.
62 Of 121
Figure 4.5: Header creation MapReduce job
Figure 4.6: Header Creation Map Function
Figure 4.7: Header Creation Reduce Function
This procedure is only mandatory for nominal values, but it can invoked to
any type of attributes. Upon creation, Headers are distributed to the next
MapReduce stages as an initialisation parameter. This procedure is required
only once for each dataset; upon creation Headers can be stored in HDFS and
retrieved upon request.
4.4 Classification and Regression
Classifiers and Regressors are used to build prediction models on nominal and
numeric values respectively. Although many learning algorithms in these cat-
egories are iterative, both training and testing phases can be completed in a
single step using asynchronous parallelism.
It is important to emphasize at this stage the performance improvement
offered by Spark in multi-phase execution plans. Once the dataset is loaded
into main-memory in Header creation phase, Spark maintains a cached copy
of the dataset until explicitly commanded to discard. This feature offers signi-
ficant speed-up in consecutive MapReduce phases, because the redundant
HDFS accesses required by Hadoop are avoided.
4.4.1 Model Training
Once the Headers are either computed or loaded from persistent storage,
Spark schedules slaves instances to begin the training phase. Every instance
possesses a number of cached partitions and trains a Weka model against each
partition, using a Map function. Classifiers and Regressors are represented in
Weka by the same abstract object.
Figure 4.8 displays the implementation of the model training Map function.
63 Of 121
By using Meta-Learning techniques, the intermediate models are aggregated
using a Reduce function to a final model.
Depending on the characteristics of the trained model the final output may
be:
• A single model, in case the intermediate models can be aggregated
(where a model of the same type as the inputs can be produced)
• A Voted Ensemble of models, in case intermediate models cannot be
aggregated.
Figure 4.9 displays the implementation of the model aggregation Reduce
function.
64 Of 121
Figure 4.8: Model Training Map Function
Trained models can be either used directly for testing unknown data objects or
can be store in HDFS for future use.
4.4.2 Model Testing and Evaluation
Once a trained model is either computed or retrieved from persistent storage,
the model Evaluation phase can be completed in a single MapReduce step.
The trained model is distributed to the slave instances as an initialisation
parameter to the Evaluation Map functions. During the Map phase, each in-
stance evaluates the model against its local partitions and produces the inter-
mediate evaluation statistics.
Figure 4.10 displays the classifier Evaluation Map function.
65 Of 121
Figure 4.9: Model Aggregation Reduce Function
Reduce functions produce the final output by aggregating intermediate res-
ults. Figure 4.11 displays the implementation of the evaluation Reduce func-
tion.
In a similar fashion, trained models can be used to classify unknown in-
stances.
4.5 Association Rules
Association Rules are computed in parallel using a custom MapReduce imple-
mentation of the Partition [30] algorithm. Partition requires two distinct
phases to compute association rules on distributed datasets. In the candidate
generation phase, a number of candidate rules are generated in each partition.
In the candidate validation phase, real support and significance metrics are
computed for all the candidates and those that do not meet global criteria are
pruned.
66 Of 121
Figure 4.10: Classifier Evaluation Map Function
Figure 4.11: Evaluation Reduce Function
Figure 4.12 displays the two distinct execution phases of the Association
Rule Learning job.
The user defines a support threshold and an optional threshold to any Weka
supported measure of significance (by default confidence is used). A number
of Map functions proceed to mine partitions in parallel using a Weka associ-
ation rule learner and generate candidate rules. A rule is considered a candid-
ate, if global significance criteria are met in any of the partitions. Candidate
rules are exported from Map functions using a hash-table.
Figure 4.13 displays the candidate generation Map function.
Reduce functions aggregate multiple hash-tables and produce a final set of
candidates. The hash-table data structure was selected because it enables al-
most constant seek time.
Figure 4.14 displays the candidate generation and validation Reduce func-
tion.
67 Of 121
Figure 4.12: Association Rules job on Spark
Figure 4.13: Candidate Generation Map Function
In the validation phase, each Map function receives the set of candidates
and computes support metrics for every rule.
Figure 4.15 displays the Validation phase Map function.
The validation Reduce phase uses the same Reduce function to aggregate
the metrics across all partitions. Each rule that fails to meet the global criteria
is pruned. The rules are sorted on the requested metrics and returned to the
user.
68 Of 121
Figure 4.14: Candidate Generation and Validation Reduce Function
Figure 4.15: Validation Phase Map Function
4.6 Clustering
Although theoretically the execution model can be applied to any Clusterer,
it was found in practice that it can only be used for the Canopy Clusterer [70].
Canopies divide the dataset into overlapping regions using a cheap distance
metric. Canopies within a threshold are assumed to represent the same region
and thus, they can be aggregated. Map functions can be used to build Canopy
Clusterers on partitions in parallel and Reduce functions to aggregate Canop-
ies on the same region.
Other clustering algorithms do not share this property. Aggregation would
require the use of consensus clustering techniques. However, Weka does not
support consensus clustering.
In a different approach, Map functions could be used to compute distances,
assign data points to cluster centres and update the position of each cluster
centre. Reduce functions could receive the per-partition cluster centres, com-
pute their average values and assess the stop condition. If the stop condition is
not met, the new cluster centres would be distributed to a new set of Map
functions and begin an additional iteration. This process was found to be in-
feasible in practice, because Weka, as of August 2014, does not support a
mechanism to explicitly define cluster centres at the beginning of each itera-
tion.
Contact was made with the Weka core development team regarding these
issues (Dr. Mark Hall). It was concluded that clustering algorithms cannot be
efficiently expressed using the implemented model. The methodology as-
sumes that Weka algorithms can be encapsulated in MapReduce wrappers.
This was found to be infeasible for clustering algorithms due to Weka's limita-
tions. Therefore, distributed clustering using Weka remains an open research
issue.
4.7 Summary
This chapter presented a scalable methodology to execute Weka algorithms in
a distributed environment using the MapReduce paradigm. The following
69 Of 121
chapter analyses the benchmarking results of the presented multi-tier architec-
ture and the proposed execution model.
70 Of 121
5 System Evaluation
This chapter presents the Evaluation Metrics (5.1), the System Configuration
(5.2), the Evaluation Results (5.3) and the Caching Strategy Selection Al-
gorithm (5.4).
5.1 Evaluation Metrics
The system was evaluated through a number of experiments on AWS. The
evaluation assesses the system on four different metrics:
• Elapsed Execution Time: The system's execution time on different
tasks is measured on multiple task submissions using multiple al-
gorithms, dataset sizes and cluster sizes;
• Memory Utilisation: The system's main-memory utilisation is assessed
on seven different caching schemes. The results are cross-validated
with the other system parameters;
• IO Utilisation: Disk and network bandwidths are limited and much
slower to access than main-memory. Input and output streams are
monitored in order to determine and where possible remove potential
performance bottlenecks;
• CPU Utilisation: CPU usage is monitored to determine the CPU-time
overheads, introduced by different parameters.
The system's scalability is assessed in multiple execution scenarios. More spe-
cifically, two different scalability metrics are computed [20]:
1. Weak scaling: The per-instance problem size remains constant and ad-
ditional instances are used to tackle a bigger problem. Weak scaling ef-
ficiency (as percentage of linear) can be computed using the following
formula [20]:
S=T1Tn
∗100%
71 Of 121
Where T1 is the elapsed execution time for 1 work unit using a single instance
and Tn the same metric for N work units on N instances. Linear weak scaling
would be achieved if execution times were constant, regardless of the scale.
2. Strong scaling: The total problem size remains constant and additional
instances are assigned to speed-up computations. Strong scaling effi-
ciency can be computed using the following formula [20]:
S=T1
N∗Tn∗100%
Where T1 is the elapsed processing time for 1 work unit using a single in-
stance and Tn is the elapsed time for the same workload using N instances.
Linear strong scaling would be achieved if execution times were decreasing
proportionally to the number of processing instances.
An ideal parallel system would have both linear strong and weak scaling.
In practice, overheads associated with distributing computations lead to sub-
linear performance. Resource utilisation monitoring is used to determine the
causes of sub-linear performance and propose modifications.
The system is compared against core Weka's distributed implementation on
Hadoop (Weka-On-Hadoop). It is interesting to compare the two systems in
terms of execution time, scalability and resource efficiency. Additional com-
parisons are drawn against competing systems in the surveyed literature. It is
of course difficult to compare against proprietary solutions because accurate
details about the benchmarking procedures are usually not provided.
5.2 System Configuration
Benchmarking was performed using a single Spark Master node in three dif-
ferent cluster configurations:
1. Small scale: Two 4-core m3.xlarge slave instances, possessing 28.2GB
of main-memory;
2. Medium scale: Eight m3.xlarge slave instances, possessing 32 cores
and 112.2GB of main-memory in total;
72 Of 121
3. Large scale: Thirty-two m3.xlarge slave instances, possessing 128
cores and 451.2GB of main-memory in total.
The dataset scales used in the evaluation process were proportional the
cluster sizes:
1. Small scale: 5GB datasets;
2. Medium scale: 20GB datasets;
3. Large scale: 80GB datasets.
The decision to use these data volumes was based on Ananthanarayanan et
al. [71], who analysed access patterns of data mining tasks at Facebook. Ac-
cording to this report, 96% of the tasks processed data which could be stored
in only a fraction of main-memory (assuming 32GB of main-memory per
server). In a similar project by Microsoft Research, Appuswamy et al [72] re-
port that the majority of real-world Data Mining tasks process less than
100GB of input. These observations conclude that, although technically pos-
sible to process Peta-scale datasets in a single task, it is uncommon in prac-
tice.
Finally, each of the three implemented categories of algorithms is represen-
ted by a commonly used algorithm:
1. Regression algorithms using Linear Regression;
2. Classification algorithms using SVM;
3. Association Rule Learning algorithms using FP-Growth.
Further work could investigate both a wider set of algorithmic approaches and
a more exhaustive set of data-sizes.
It is important to emphasize that the computing resources on which the
benchmarks were executed are virtualised. Consequently, although AWS fol-
lows rigorous quality assurance procedures, advertised resources of the in-
stances may be marginally different between experiments. This feature can be
attributed to hardware heterogeneity in Amazon's data-centres and interfer-
ence between different virtual machines hosted on the same physical hardware
[73]. In order to alleviate the effects of multi-tenancy on the physical host,
73 Of 121
large sizes of compute instances were selected. This approach decreases the
chance of “noisy neighbours” [74].
The experiments were repeated multiple times and the analysis is based on
average values. However, the results did not vary as much as it was expected
based on the literature survey.
5.3 Evaluation Results
The following presents and analyses the benchmarking results, computes the
system evaluation metrics and assesses the efficiency of the implemented
solution.
5.3.1 Execution Time
This section presents an analysis about the performance of the system during
the benchmarking experiments. The execution times are compared with an
identical workload on Hadoop.
Table 5.1 displays the execution times of the system on the SVM bench-
mark. Three system sizes were tested against three different dataset scales.
Table 5.2 displays the execution times of the aforementioned benchmark on
Weka-On-Hadoop.
74 Of 121
5GB 20GB 80GB
8 Cores 135 506 2008
32 cores 45 139 551
128 Cores 32 57 147
Weka-On-Spark SVM (sec)
Table 5.1: Execution Times for SVM on Weka-On-Spark
5GB 20GB 80GB
8 Cores 287 900 3389
32 cores 135 317 927
128 Cores 127 129 371
Weka-On-Hadoop SVM (sec)
Table 5.2: Execution Times for SVM on Weka-On-Hadoop
Figure 5.1 plots the execution time results across all dataset scales and
cluster sizes.
Speed-up can be defined as: S=T old
T new
([5])
Where Told represents the elapsed execution time of the system prior the intro-
duction of the proposed improvement and Tnew is the elapsed execution time on
the improved system.
Table 5.3 displays the achieved speed-up of Weka-On-Spark compared to
Weka-On-Hadoop on identical tasks.
75 Of 121
8-core Spark
32-core Spark
128-core Spark
8-core Hadoop
32-core Hadoop
128-core Hadoop
0200400600800
10001200140016001800200022002400260028003000320034003600
Execution Times for SVM
On Weka-On-Hadoop and Weka-On-Spark
5GB
20GB
80GB
Figure 5.1: Execution times for SVM
The use of Spark on distributed Weka workloads speeds up computations
up to four times and by a factor of 2.36 on average. It is important to divide
these experiments in two distinct categories:
1. Experiments where full dataset caching was possible on Weka-On-
Spark;
2. Experiments where only partial dataset caching was possible.
The first category is shadowed in Table 5.3.
Full dataset caching achieves an average speed-up of 2.7x. Although
Weka's caching was enabled on Hadoop, and thus the iterations SVM requires
to converge were performed in-memory at each partition, the task requires at
least two stages: Header Creation and Classifier Training. Weka-On-Hadoop
requires to reload the dataset at each stage, parse the rows to build the suppor-
ted format and initialise the Map tasks. In contrast, Weka-On-Spark retains the
dataset in-memory and avoids multiple initialisations by re-using the task
threads of the first stage.
In cases where full caching was not possible, average speed-up was meas-
ured at 1.7x. Although both systems are forced to reload partitions from
HDFS, Weka-On-Spark achieves superior performance through better lever-
age of the underlying resources.
76 Of 121
Speed-up 5GB 20GB 80GB Average8 Cores 2.13 1.78 1.69 1.8632 cores 3.00 2.28 1.68 2.32128 Cores 3.97 2.23 2.52 2.91Average 3.03 2.10 1.96 2.36
Table 5.3: Speed-up
Tables 5.4 and 5.5 display the average CPU utilisation of Weka-On-Spark
and Weka-On-Hadoop during each experiment.
Weka-On-Spark consumes on average 27.1% more CPU time. Both sys-
tems execute the same implementation of the SVM algorithm, in the same en-
vironment (JVM) and they perform the same amount of computations per byte
of data. Additionally, both systems use HDFS and suffer from the same data
fetching latencies. Consequently, the workloads are identical. Weka-On-Spark
saturates the CPU and thus achieves higher throughput and faster execution
time. In order to identify the reason behind this behavior, it is required to ex-
amine the systems from a different perspective.
Tables 5.6 and 5.7 display the average main-memory utilisation during the
aforementioned benchmark on Weka-On-Spark and Weka-On-Hadoop (shad-
owed cells indicate that full dataset caching was possible).
77 Of 121
5GB 20GB 80GB8 Cores 67.00% 71.00% 67.40%32 cores 73.90% 73.40% 72.10%128 Cores 65.00% 74.20% 76.10%
SVM on Weka-On-Hadoop (CPU %)
Table 5.5: CPU Utilisation of Weka-On-Hadoop
5GB 20GB 80GB8 Cores 98.00% 98.70% 98.00%32 cores 99.00% 98.60% 98.10%128 Cores 97.20% 97.60% 99.10%
SVM on Weka-On-Spark (CPU %)
Table 5.4: CPU Utilisation of Weka-On-Spark
Weka-On-Hadoop demonstrates stable memory footprints across the exper-
iments. The system loads a partition for each active Map container into mem-
ory, executes the Map task, discards the processed partition and repeats the
procedure until the whole dataset is processed. In contrast, Weka-On-Spark
loads partitions until memory saturation and schedules the Map tasks to
process in-memory data.
In cases where the available memory is larger than the dataset, Weka-On-
Spark's approach to cache the dataset has obvious benefits. Successive stages
process the same RDDs and the need to reload and rebuild the dataset in the
required format is avoided.
In cases where the dataset cannot be fully cached, Spark applies a partition
replacement policy where the Least Recently Used (LRU) partition is re-
placed. This process indicates that it is highly unlikely that successive stages
will find the required partitions in-memory. Thus, the partition will be loaded
from disk as in Weka-On-Hadoop. However, there is a big difference between
the mechanisms Hadoop and Spark use to implement this process.
Hadoop reads HDFS partitions using an iterator. Map tasks read an HDFS
partition line-by-line (each line is represented by a key-value pair), process
each line and emit intermediate key-value pairs if necessary. In the specific
78 Of 121
5GB 20GB 80GB8 Cores 71.80% 92.10% 94.10%
32 cores 21.10% 72.10% 96.10%128 Cores 9.10% 25.20% 72.90%
SVM on Weka-On-Spark (Memory %)
Table 5.6: Main-memory utilisation of Weka-On-Spark
5GB 20GB 80GB8 Cores 23.00% 27.10% 26.90%32 cores 10.80% 11.20% 14.10%128 Cores 6.90% 8.90% 9.20%
SVM on Weka-On-Hadoop (Memory %)
Table 5.7: Main-memory utilisation of Weka-On-Hadoop
case of Weka-On-Hadoop, the partitions are read line-by-line, each line is pro-
cessed by a parser and then added to an Instances object (Weka's dataset rep-
resentation). When this procedure is completed, the Map tasks execute the
SVM algorithm and iterate over the data until the algorithm converges. When
the model is built, Hadoop emits the trained model, discards the data and
schedules the Mapper to process a new partition. Thus, reading data from
HDFS is coupled with data processing: while the system is reading data, the
CPU is idle and while the system is processing data the I/O subsystem is idle.
This process leads to suboptimal resource utilisation: CPU cycles are wasted
and I/O bandwidth is never saturated.
Spark resolves this issue by introducing a main-memory abstraction
(RDD). This abstraction decouples the two phases. Map tasks process RDD
partitions that are already in-memory. As the system is not required to wait for
I/O and reads directly from main-memory, maximum CPU utilisation is
achieved. Additionally, Spark evicts older partitions from the distributed cache
and fetches the next set of partition from HDFS regardless of the task execu-
tion phase. This enables to read data at a faster rate (reading is performed at a
block level rather than the line level) and to overlap data loading and data pro-
cessing. These two features, alongside the aforementioned shorted initialisa-
tion times, contribute to a significant speed-up as opposed to the Hadoop-
based solution.
5.3.2 Scaling Efficiency
This section presents the weak and strong scaling efficiencies of the system,
as measured during the experiments on AWS. The scaling efficiency percent-
ages are computed using the formulas of Section 5.1.
5.3.2.1 Weak Scaling
Figure 5.2 demonstrates the system's weak scaling efficiency of the three dif-
ferent algorithms used for benchmarking. The figure also presents the weak
scaling efficiency of the SVM algorithm on Hadoop.
79 Of 121
The system execution times approach the ideal linear performance within less
than 10% for clusters up to 128 cores in all cases. In contrast, Hadoop's weak
scaling efficiency decreases as the scale increases.
As discussed earlier a slight decline in performance was expected in fully
distributed systems. This behaviour is associated with monitoring multiple in-
stances and load balancing data across the cluster. However, as Spark achieves
high locality in consecutive stages and the effects are minimal in multi-stage
execution plans.
5.3.2.2 Strong Scaling
Figures 5.3, 5.4 and 5.5 demonstrate the system's strong scaling efficiency on
the three different categories of algorithms. Each figure depicts the system's
strong scaling on the small, medium and large scale datasets used in the exper-
iments. Full raw data can be found in Appendix 1.
80 Of 121
8 cores/5GB 32 cores/20GB 128 Cores/80GB40
50
60
70
80
90
100
Weak Scaling
Linear Regression
SVM
FP-Growth
SVM Hadoop
Figure 5.2: Weak Scaling Efficiencies
81 Of 121
8cores 32 cores 128 Cores0
10
20
30
40
50
60
70
80
90
100
Strong Scaling
Linear Regression
5GB
20GB
80GB
Figure 5.4: Strong Scaling for Linear Regression
8cores 32 cores 128 Cores0
10
20
30
40
50
60
70
80
90
100
Strong Scaling
SVM
5GB
20GB
80GB
Figure 5.3: Strong Scaling for SVM
Strong scaling efficiencies on Spark approach linearity when datasets are
large and runtime is dominated by computations. Using large systems for
small scales proves to be inefficient due to constant initialisation overheads.
These overheads were measured at 11 seconds, regardless of the scale of the
system. On the large cluster (128 cores), this number corresponds to 40.8% of
the average total execution time on the small scale (5GB) and to 20.3% on the
medium scale (20GB). As the dataset size increases the runtime is dominated
by computations and thus, these overheads are minimal compared to the total
execution time.
For comparison purposes, Figure 5.6 illustrates the strong scaling effi-
ciency of Hadoop on SVM.
82 Of 121
8cores 32 cores 128 Cores0
102030405060708090
100
Strong Scaling
FP-Growth
5GB
20GB
80GB
Figure 5.5: Strong Scaling for FP-Growth
Weka-On-Hadoop's strong scaling efficiency demonstrates inferior per-
formance due to larger initialisation overheads. Hadoop's initialisation cost
was measure at 23 seconds. This overhead is introduced at the beginning of
every MapReduce stage, whereas in Spark it is only required at the first stage.
Additional comparisons can be drawn against similar workloads as repor-
ted in the literature survey. Wegener et al. [59] disabled Weka's caching and
forced the algorithms to read directly from persistent store in a Hadoop
cluster. The reported strong scaling was nearly 88% but the reported execution
times on NaiveBayes were more than an order of magnitude slower. Jha et al.
[75] benchmarked Spark on Data Mining workloads and reports similar scal-
ing efficiencies to those presented above. Additionally, Jha compares Spark
against Hadoop, Mahout, distributed Python scripts and MPI [76]. Data Min-
ing on Spark outperforms all the competing systems in scaling efficiency, but
it is 50% slower than MPI. Given the fact that MPI is a low-level paradigm,
this result was expected. In MPI, the user needs to program the system on how
to perform the data placements, node communications, scheduling etc., lead-
ing to large delays in development. Finally, R and Hadoop-based system Ri-
cardo [52] reports 83% scaling but it is 100% slower than Hadoop.
83 Of 121
8 Cores 32 cores 128 Cores
Strong Scaling
SVM on Weka-On-Hadoop
5GB
20GB
80GB
Figure 5.6: Strong Scaling on Weka-On-Hadoop
In case raw data are not provided, the scaling efficiency numbers are ex-
tracted from the figures of the surveyed publications. As the scale intervals are
usually large, these numbers are in approximation and may contain errors.
5.3.3 Main-Memory Caching
One of the objectives of this project was to study the effects of different cach-
ing strategies on BDM workloads. This section presents and analyses the ex-
perimental results from a main-memory perspective.
5.3.3.1 Caching overheads
Spark RDDs are represented in memory as distributed Java objects. These ob-
jects are very fast to access and process, but they may consume up to 5 times
more memory than the raw data of their attributes. This overhead can be at-
tributed to the meta-data that Java stores alongside the objects and the
memory consumed by the object's internal pointers.
For example, a String class introduces 40 bytes of pure overhead associated
with storing characters separately as Character classes and keeping properties
such as string length. Consequently, a 10 character String requires 60 bytes of
main memory. However, Spark offers a series of tools to tackle these over-
heads by introducing serialisation and compression, as well as an efficient
mechanism to regress to persistence.
The selection of an efficient caching strategy demands consideration of
these overheads. A number of different dataset samples from the UCI Machine
Learning repository [77] and Stanford SNAP [78] were tested in order to
measure memory overheads for different categories of datasets.
A simple application that loads a dataset from HDFS, builds an RDD and
reports the on-disk and RDD sizes was implemented. Tables 5.8 and 5.9 dis-
play the RDD size as percentage of the original on-disk value.
84 Of 121
Uncompressed main-memory footprints vary greatly and can reach 600%
of the original dataset. However, serialised objects demonstrate footprints
close to the on-disk values in all cases. Compression demonstrates an addi-
tional 50% and 75% reductions for dense and sparse datasets respectively.
85 Of 121
Table 5.8: RDD size as percentage of the original on-disk value (I)
RDD size ( % of the original) Type of data
Structured Numeric 52.00% 51.60%
Structured Numeric 50.80% 50.80%
Generated Structured String Sparse 22.45% 22.45%
Structured Numeric 38.08% 38.08%
Record Linkage Structured String 44.00% 43.00%
KDD data '10 Structured String 47.00% 47.00%
adult Structured String 53.16% 53.16%
supermarket Structured String Sparse 26.32% 26.32%
Structured Numeric 55.00% 54.38%
Wiki articles Text 44.12% 44.12%
Social circles: Tw itter Graph 44.55% 43.64%
Graph 48.15% 44.44%
Google Web Graph Graph 40.00% 38.18%
Java Ser. and Compression Kryo Ser. and Compression
Susy
Higgs
USCensus
Million Song Dataset
Epinions social netw ork
Table 5.9: RDD size as percentage of the original on-disk value (II)
Kryo serialisation shows marginally better compression ratios than built-in
Java serialisation.
In order to further assess the benefits of caching, a performance analysis of
different caching strategies is required. The following subsection presents
benchmarking results of seven different caching schemes at large scales.
5.3.3.2 Caching and Performance
Memory consumption followed a predictable pattern, regardless of the
caching scheme in use. Initially, the system builds the RDDs by loading parti-
tions from secondary storage. After the completion of the initialisation pro-
cess, multiple algorithms can be deployed to process the data and memory
consumption is dependent on the internal object structure of the algorithms.
Figure 5.7 illustrates the main-memory time-line of four consecutive experi-
ments on different caching schemes.
Figure 5.8 presents the average reduction in main-memory consumption of
the caching strategies used in benchmarking experiments. Figure 5.9 displays
the average execution time overhead of each caching strategy.
Full raw data are given in Appendix A.
86 Of 121
Figure 5.7:Main-memory time-line
87 Of 121
Java
Ser
ialis
atio
nK
ryo
Ser
ialis
atio
nJa
va S
er. A
nd C
ompr
essi
onK
ryo
Ser
. And
Com
pres
sion
Dis
k O
nly
Dis
k O
nly
(700
MB
cac
he p
er n
ode)
0.00%
10.00%
20.00%
30.00%
Execution Time Overhead
% of default in-memory caching
Figure 5.9: Execution Time Overhead
Java
Ser
ialis
atio
nK
ryo
Ser
ialis
atio
nJa
va S
er. A
nd C
ompr
essi
onK
ryo
Ser
. And
Com
pres
sion
Dis
k O
nly
Dis
k O
nly
(700
MB
cac
he p
er n
ode)
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
Main Memory Use Reduction
% of default in-memory caching
Figure 5.8: Main-Memory Use Reduction
Serialisation and Compression mechanisms present significant memory
footprint reductions on a small performance penalty (~5%). Additionally, as
depicted in Tables 5.8 and 5.9 memory footprints can be predicted using those
schemes, regardless of the data-type. Consequently, the consensus of the ex-
periments indicates that serialising and compressing datasets is beneficial.
However, main-memory remains a scarce resource. In multi-tenant envir-
onments, it is often required to operate with limited memory. In order to de-
termine performance in edge conditions, main-memory was limited to 700MB
(out of 28.2GB) per-instance and the experiments were repeated. The system,
to the author's surprise, managed to effectively use disk caching and success-
fully execute the requested tasks. This behaviour is achieved by temporarily
caching serialised objects to local disks.
Disk caching on limited memory requires a disk-write for every partition of
the dataset. Due to disk latencies, disk caching leads to an execution time
overhead which was measured at 25% on average. This behaviour was con-
sistent across all experiments.
Although this practice may be considered appealing in constraint environ-
ments, observing the persistent storage access patterns reveals another implic-
ation. Figure 5.10 depicts the average per-instance disk writes in bytes.
Each instance processed on average 5GB of data during the experiments.
The disk caching strategy temporarily stores an equivalent volume of serial-
88 Of 121
Figure 5.10: Average per-instance disk writes
ised objects in persistent storage. This practice requires free disk space to be at
least as large as the total size of the dataset. In the case where instances are
used to host distributed databases or other large files, it might be infeasible to
allocate these amounts of free space.
Spark offers the option to avoid disk caching and recompute each RDD
partition when needed. This feature was found unstable in constrained mem-
ory experiments as many of the submitted tasks failed on insufficient main-
memory exceptions. This is attributed to the garbage collector's inability to
evict unused objects faster than the requested memory allocations.
Comparisons can be drawn against competing systems at this point. Graph-
Lab and Piccolo assume full in-memory caching is possible. Neither system
possesses a mechanism to tackle main-memory shortages. Although, as dis-
cussed earlier, most workloads can be stored in-memory this practice is dan-
gerous for production environments.
Hadoop had consistent main-memory requirements throughout the experi-
ments. Due to Weka's in-memory data representation it was required to alloc-
ate 2GB of main-memory per process. This number corresponds to 50% of the
cluster's memory. This figure can be decreased by requesting smaller disk
blocks (on a performance penalty). However, main-memory usage cannot be
adjusted to the needs of a specific workload and the advantages of caching
cannot be harnessed. Additionally, edge cases with limited memory cannot be
handled.
The next sub-section provide a deeper analysis of the system's Input/Output
(IO) utilisation.
5.3.4 IO Utilisation
Spark schedules task execution within the cluster based on data placement in-
formation from HDFS. Spark implements a delay scheduling algorithm [79]
which delays task execution for short intervals in order to seek opportunities
to schedule nodes to process local partitions rather than remote. If the short
period expires and the node who owns the partition is unavailable then the
partition is shipped through the network to an idle node. Zaharia et al. [79] re-
89 Of 121
port that this technique can achieve nearly 100% locality in scheduling batch
tasks.
EBS volumes are network-attached drives and use the network interface
of the attached instance. In order to accurately measure network traffic, the
system was forced to use the physically attached ephemeral storage during
network benchmarking.
The application layer was designed to avoid submitting data through the
network and Spark schedules tasks based on locality. Thus, network utilisation
was expected to be minimal. In practice, experimental results demonstrate a
spike in network usage during the initialisation phase. Figure 5.11 illustrates
the aggregated network traffic across the cluster on three consecutive medium
scale experiments. The spike occurs during the Header creation phase, which
is common for all the algorithms in Weka-on-Spark. The sum across all in-
stances is equal to the 87.5% of the dataset.
This behaviour is attributed to the mechanism that Spark uses in order to
achieve balanced loads across the instances. The default partitioning strategy
retrieves the blocks from HDFS and transforms each block to a number of ar-
rays of string values (each array represents an RDD partition). Each array has
a hash-code and Spark distributes the generated partitions to the instances us-
ing a hash partitioner. This procedure achieves perfect load balancing but re-
quires the nodes to submit the majority of the dataset through the network.
90 Of 121
Figure 5.11: Network Traffic
After this procedure is completed, Spark was tuned to schedule transforma-
tions based on RDD placements and perfect locality was achieved. The Map
phase of the execution model produces trained Weka models which are less
than 50KB on average. Consequently, network traffic is minimal during multi-
stage processing tasks.
In order to determine whether Spark's load balancing strategy is able to
force a network bottleneck, it is required to analyse disk, network, CPU and
memory metrics. Figure 5.12 plots network and disk usage of the aforemen-
tioned experiment in a per-instance basis.
Figure 5.12: Per-instance average of network and disk utilisation
Each instance retrieves on average 2.5GB of data from persistent storage,
builds a number of RDDs (either user defined or system default) and ships
each RDD to the instance that handles the range of its hash key. During this
experiment, 8 4-core instances were used. Consequently, assuming a uniform
distribution of hash keys, each instance would need to submit the 87.5% of the
retrieved data and receive an equivalent amount from remote nodes. This
traffic requires a minimum of 291Mbits/s of network bandwidth. The network
bandwidth for this particular type of instance was experimentally measured
(using the IPerf [80] benchmark) at 1.08Gbits/s. These calculations indicate
that network bandwidth was not saturated.
Similar analysis can be applied to persistent storage utilisation. Each in-
stance retrieves data at a 166.7Mbits/s rate. Disk read throughput was meas-
ured at 2.4Gbits/s using the IOzone [81] benchmark. These calculations indic-
ate that neither disk utilisation was saturated.
91 Of 121
Finally, Figure 5.13 illustrates the CPU utilisation of the experiment.
Figure 5.13: CPU utilisation
CPU usage rapidly approaches saturation during the initialisation phase.
Therefore, the system is CPU-bound. In order to accelerate the system further,
it is required to add more processing power. This is a positive outcome, as it is
shown that the system's scaling efficiency is approaching linearity in large
scales.
Hadoop on the other hand demonstrated minimal network usage (~1MB/s)
throughout the experiments. Hadoop implements the same delay scheduling
algorithm as Spark. However, Hadoop does not possess a main-memory ab-
stractions and any effort to perform load balancing would require HDFS inter-
vention. This feature combined with the slower initialisation would produce
even larger start-up latencies.
It is important to emphasize that AWS offers a plethora of different options
as far as IO resources are concerned. The aforementioned benchmarks utilised
balanced instance types to demonstrate the system's versatility. In practice,
network bound application can benefit from instances with enhanced network-
ing (10Gbits/s) and disk bound application can benefit from storage optimised
instances with RAID configurations of SSD drives. Therefore, it is very im-
portant to thoroughly benchmark the Batch Execution and the Application
Layers, locate performance bottlenecks and optimize the underlying infra-
structure.
5.4 Caching Strategy Selection Algorithm
As extensively discussed earlier, the system offers a plethora of caching op-
tions and enables users to implement custom caching strategies. However, this
92 Of 121
practice demands state-of-the-art knowledge of the underlying platform and
extensive experimental evaluation of the different options.
Spark's default caching option is to store Java objects directly in memory.
This practice loads the cache fraction of the executors until saturation and then
recomputes additional partitions on demand as and when required.
This method was found to be unstable in practice. In many cases, new
RDD partitions are allocated memory faster than the Garbage Collector (GC)
is able to discard older partitions. Consequently, memory is leaking and the
tasks fail on main-memory exceptions.
In order to tackle this issue, a custom strategy was implemented. This
strategy is triggered in cases where the user does not explicitly specify a cach-
ing mechanism. The selection process is illustrated in Figure 5.14.
This process uses the file's size on HDFS, the total cluster wide memory,
the caching fraction of the executors and the maximum overhead as input
parameters. In large text-based files the overhead was experimentally com-
puted and in the worst case scenario approaches 500%. The Apache Spark
93 Of 121
Figure 5.14: Storage Level Selection Process
documentation mentions the same worst-case overhead without specifying a
dataset type. Consequently, the algorithm uses this value as default, but allows
the users to specify the overhead parameter that better relates to their data.
If the cluster-wide executor cache memory is enough to absorb the dataset
in the worst case, the default caching is used. Uncompressed objects are faster
to access and the CPU overhead of serialisation is avoided.
In case the cache approaches the dataset size, serialised objects are pre-
ferred as they demonstrate stable memory footprints which are equivalent to
the original file size. Kryo serialisation proved to be more efficient and it is
used as the default option. This process introduces serialisation overhead but
decreases GC overhead and enables up to 5 times more data to be stored in-
memory.
Compression is additionally used to tackle cases where at least 50% of the
on-disk data can be cached. Compression introduces an additional CPU over-
head but further reduces the memory footprints by 50% compared to serialisa-
tion.
Finally, if the dataset is twice as large as the available cache (and thus the
compressions mechanisms cannot ensure that full caching is possible) then
disk caching is used.
This strategy was implemented in a single Scala class and integrated into
the system. When a task is submitted, the input parameters are read from the
application context and the algorithm selects a Storage Level and decides on
the use of serialisation and compression automatically.
The proposed strategy was tested against the default caching strategy in a
number of experiments. The workload consisted of a 20GB dataset and the
FP-Growth algorithm on an 8-core cluster. The experiment was repeated using
one thousand (1K) and four thousands (4K) partitions on 15GB and 2GB of
cache memory.
Table 5.10 displays the elapsed execution time and Table 5.11 the percent-
age of tasks that failed across the experiments.
94 Of 121
The custom strategy decreases execution times up to 25% and eradicates
task failures caused by insufficient main-memory. The following paragraphs
explain this behaviour.
In the default strategy, the objects are cached deserialised (uncompressed
java object structures). This forces the GC to recursively traverse the object
hierarchy before evicting unreferenced objects. Each time memory runs out, a
set of old partitions is garbage collected and a new set is fetched from disk. As
the size of cache memory decreases, this procedure is repeated more fre-
quently. If this process is slower than the memory allocations, tasks fail due to
main-memory exceptions. Larger partitions contain larger object hierarchies
and thus increase GC overhead (GC has a larger workload when it is
triggered). This is the reason why larger partitions demonstrated increased
fail-rates in the experiments that used the default strategy.
The custom strategy achieves better performance by decreasing both the
GC overhead and the frequency of the procedure. The selection algorithm as-
95 Of 121
Table 5.10: Execution Times
Table 5.11: Failed Tasks
sesses the available memory and whether the partition replacement mechan-
ism would be triggered in the given dataset. If this is the case, the algorithm
activates and configures serialisation and compression mechanisms.
Serialised and compressed objects are represented by an array of bytes. The
GC discards these objects as a single entity, regardless of the number of ob-
jects they encapsulate. Thus the cost of searching object hierarchies for un-
used objects is avoided. Additionally, these objects consume up to 10 times
less memory. This enables storage of up to 10 times more data in-memory and
significantly decrease the frequency of the partition replacement mechanism.
However, this process introduces CPU overheads due to serialisation and
compression.
Experimental results demonstrate that the cost of serialisation is lower than
the cost of partition replacement. Additionally, reducing the GC overhead
achieves 100% task completion even in cases where the default strategy fails
repeatedly and aborts the execution.
5.5 Summary
This chapter evaluated the implemented multi-tier architecture from multiple
angles. The evaluation assessed the achievement of the project’s objectives, as
described in the Section 1.3.
The system outperformed Weka-On-Hadoop by a factor of 2.36 on average
and up to four times in small scales. Even in cases where full dataset caching
was not possible, the system speeds up computations by a factor of 1.7.
The system met the scalability required by Big Data volumes. Both weak
and strong scaling demonstrated near-linear performance on workloads that
emulated the industry-level volumes used in Data Mining tasks. It is therefore
possible to re-use existing sequential solutions in a distributed context. Com-
parison with other systems in the surveyed literature demonstrates that the
system's scalability is on par with the state-of-the-art in the field.
Multiple caching strategies have been experimentally evaluated. The ana-
lysis demonstrates that serialisation and compression mechanisms are able to
significantly decrease memory footprints with a small performance penalty
96 Of 121
(approximately 5%). Additionally, disk caching behaviour was assessed and it
was determined that it can achieve task execution on a reasonable perform-
ance penalty (approximately 25% on average), even in cases where main-
memory is extremely constrained. However, measured disk caching overheads
prohibit the processing of datasets that approach the total disk capacity in size.
The system takes advantage of the network bandwidth during initialisation
to efficiently distribute the load to the cluster instances. This practice achieves
load balanced instances without saturating the network, even in cases of in-
stances with moderate networking capabilities.
The default caching strategy of the Batch Execution Layer was found to be
inefficient. Results analysis produced significant insights about the trade-offs
of different caching strategies and a caching strategy selection algorithm was
proposed. This algorithm was found to decrease execution times by up to 25%
compared to the default strategy and to decrease the risk of main-memory ex-
ceptions. This behaviour is attributed to decreasing the garbage collection
overhead and the garbage collection frequency.
97 Of 121
6 Concluding remarks
The following chapter provides a summary of the work conducted in this pro-
ject (6.1), overviews future expansions and open research issues in the area
(6.2) and closes with a conclusion (6.3).
6.1 Summary
BDM is an increasingly important field, with multiple industrial and academic
applications. The exponential rate of data generation indicates that BDM
problems will continue to be challenging in the future. Consequently, the sci-
entific community needs to continue investing significant time and effort to-
wards the implementation of novel approaches.
The literature survey has focused on three different scientific fields: Data
Mining, Distributed Computing Frameworks and Distributed Data Mining.
Data Mining frameworks, such as Weka, offer a plethora of powerful al-
gorithms, but they are designed for sequential execution and fail to cope with
large data volumes. Established Distributed Computing Frameworks, such as
Hadoop, have a proven ability to tackle Big Data problems, but often demon-
strate poor performance in Data Mining workloads. Emerging platforms, such
as Spark and GraphLab, provide improved solutions by focusing on the short-
comings of Hadoop in Data Mining applications.
This project has presented a scalable multi-tier architecture based on Weka,
Spark, HDFS and AWS. Weka's algorithms have been encapsulated in Map
and Reduce wrapper classes. By submitting these classes to Spark's in-
memory batch processing engine multiple HDFS partitions can be processed
in parallel. The elastic nature of AWS enables the system to dynamically ad-
just to very large volumes. The architecture was evaluated through a series of
experiments on AWS, using multiple configurations.
Benchmarking results demonstrate that the proposed solution is faster by a
factor of 2.36x than the equivalent system on Hadoop. The system achieves
near-linear scaling and manages the cluster's main-memory and network re-
98 Of 121
sources efficiently. The experiments were conducted at scales comparable to
the majority of industry-level workloads.
Thus, the implemented architecture and programming model would appear
to be a viable solution to modern BDM problems. Various caching strategies
were evaluated. The results demonstrated that serialisation and compression
mechanisms can greatly improve memory efficiency with marginal perform-
ance overheads. Furthermore, it was determined that Spark's caching mechan-
isms are able to tackle main-memory shortages effectively, with relatively
small performance penalties. Finally, a mechanism to automatically select a
caching strategy was implemented. The mechanism was found to decrease ex-
ecution times up to 25% compared to the default mechanism and to eradicate
task failures.
The aforementioned achievements demonstrate that the objectives set in
Section 1.3 have been met.
6.2 Further Work
This following identifies three different areas for extension: Clustering (6.2.1),
Stream Processing (6.2.2) and Declarative Data Mining (6.2.3).
6.2.1 Clustering
As explained in Section 4.6, clustering was not implemented in this work be-
cause of Weka's limitations. Therefore, an area of further work is the imple-
mentation of an execution model specifically targeting clustering problems.
Towards that goal, three different approaches could be explored.
The first approach could build on top of the existing implementation of
cluster Canopies. Canopies define regions based on a cheap measure of simil-
arity. Members of different regions are assumed to belong to different clusters.
Thus, Canopies partition the dataset into regions, each of which contains a
number of clusters. If the regions are small enough, Map functions could build
clusterings on different regions in parallel. Clusters in different regions are as-
sumed to be non-overlapping and thus, the concatenation of the region clusters
could produce the final clustering.
99 Of 121
This approach imposes a number of challenges. The Canopy algorithm
does not guarantee that the dataset will be partitioned into a large number of
regions. A small number of regions entails dataset partitions that may be too
large for a single instance to process. Additionally, region volumes are not
guaranteed to be balanced. Developing and evaluating methods to leverage
Canopies in distributed clustering problems is an interesting challenge.
A second approach could focus on implementing distributed versions of
specific clustering algorithms. Map tasks could be used to compute distances
from cluster centres, assign data points to clusters and update the per-partition
cluster centres. Reduce tasks could aggregate the cluster centre positions
across all partitions, update the position of each centre and assess the stop
condition. If the condition is not met, a new iteration should begin. This model
has been successfully applied to the K-Means algorithm [82]. Further work
could focus on exploring the feasibility of generalising this model to other cat-
egories of clustering algorithms. An additional challenge would be to modify
Weka's code-base to apply to this model.
A third approach could investigate the consensus clustering literature in or-
der to identify potential solutions. Although the problem is known to be NP-
Complete [28], multiple heuristics offer performance guarantees [27]. Aggar-
wal et al. [83] indicate that the consensus clustering problem is equivalent to a
symmetric Non-Negative Matrix Factorisation problem [84]. Work by Xie, et
al. [85] provides two fast parallel methods and demonstrates their efficiency in
document clustering problems. Further work could focus on comparing the re-
lative merits of different approaches and provide a concrete implementation
on top of Spark.
6.2.2 Stream Processing
The velocity aspect of Big Data demands advanced stream processing mecha-
nisms. Emerging data-types, such as sensor feeds and social data, are gener-
ated in large volumes and received in real time. These data-types usually con-
sist of groups of messages that are at the peak of their value at the time of gen-
eration. For example, enterprise server logs require real-time processing to en-
able early identification of potential errors.
100 Of 121
The proposed architecture focuses on batch processing large amounts of
on-disk data. Low-latency stream processing would require to replace the
Batch Execution Layer with a framework able to support low-latency schedul-
ing and execution. Towards testing the feasibility of this modification using
the existing execution model, Spark was replaced with Spark Streaming [86]
and a classification experiment was conducted.
In Spark Streaming, the input stream is split into small batches and each
batch is distributed to a cluster instance. It was found that the system achieved
1ms scheduling latency and 20ms average execution time for 500KB mi-
cro-batches. Although the system may theoretically process 200MB/s (assum-
ing an 8-core instance), data receipt rapidly becomes a bottleneck.
Further work could focus on discovering optimal stream partitioning and
distribution techniques. This should enable high degrees of parallelism and ef-
ficient resource utilisation. A popular approach [87] splits the stream of mes-
sages in topics and builds topic-specific sub-streams. The new streams are for-
warded to cluster instances for further processing.
Depending on the workload, different approaches yield different benefits.
Thus, a framework targeting the automation of stream partitioning and pro-
cessing would form an interesting investigation.
Another direction could focus on identifying the relative merits of different
stream processing paradigms. Spark Streaming processes streams of data by
grouping multiple small objects into batches and scheduling the processing on
Spark clusters. A different approach employs an event processing methodol-
ogy where each object is processed independently as it arrives. This approach
is implemented in the Storm [88] stream processing engine.
The execution model is ignorant about the number of input objects. A Map
function will build a Weka model either on a single object (event) or on a
large number of objects. This feature indicates that, after the appropriate mod-
ifications, the Application Layer can be ported to either Spark Streaming or
Storm and be used to evaluate the two approaches.
101 Of 121
6.2.3 Declarative Data Mining
An emerging trend in BDM investigates the application of query optimisation
techniques on Data Mining applications. Traditional query optimisation parses
a declarative query, which describes the requested outcome, and then automat-
ically selects a near optimal execution strategy. A similar methodology can be
applied to Data Mining workloads.
Each Data Mining task can be executed using a large number of different
algorithms. These solutions differ in performance, accuracy and complexity.
Selecting an optimal algorithm requires state-of-the-art expertise and multiple
experiments. Research projects, such as MLBase [60], investigate the feasibil-
ity of automatically selecting an optimal algorithm.
Weka offers multiple different options for each category of Data Mining
methods. Therefore, a topic of further work could focus on implementing a
mechanism to explore these options and automatically select a near-optimal
solution. Two different approaches could be used to tackle this problem.
Firstly, WekaMetal [89] offers a ranking system which tries to predict the
performance (accuracy and execution time) of Weka's algorithms on a dataset.
This ranking is produced by using a knowledge base of benchmarks. The data-
set is analysed to determine its similarity with the benchmark datasets. The
benchmarks have been tested with the whole suite of Weka's algorithms and
their performance is known. WekaMetal selects the algorithm with the best
performance in a dataset similar to the input dataset.
WekaMetal could be integrated into the system and select an appropriate
algorithm automatically. This feature could be engineered either by imple-
menting a distributed version of WekaMetal's ranking system or by using
sampling methods to generate a dataset sample. The first method would re-
quire an additional MapReduce task over the dataset. The second method
would be faster to implement and execute, but may lack precision.
A second approach could be to produce a ranking directly from the dataset.
This could be achieved by using the Header's statistics to produce stratified
samples. Map functions could be used to build multiple models on those
102 Of 121
samples in parallel. A Reduce function could rank those models on various
criteria and select the highest.
Implementing and evaluating those approaches as well as proposing others
should be an interesting contribution. The end result could be integrated into
Weka-on-Spark and provide a declarative interface to the system.
6.3 Conclusion
BDM is an exciting emerging field which will play a very important role in
the era of pervasive computing. Different aspects of Big Data pose different
challenges and a consensus gold standard has yet to be reached. This project
demonstrated an efficient methodology to extract knowledge from large
volumes. Main-memory caching has the potential to greatly improve perform-
ance, given an educated caching strategy. This work suggests that in-memory
cluster computing provides the solid foundations upon which the next genera-
tion of Big Data Mining systems will be built.
103 Of 121
References
[1] C. Lynch, “Big data: how do your data grow?", Nature 455 (7209) 28–29, 2008.
[2] J. Gantz, D. Reinsel, “The Digital Universe of Opportunities: Rich Data and the
Increasing Value of the Internet of Things, “ IDC iView: IDC Analyze the Fu
ture, 2014.
[3] M. A. Beyer, D. Laney. The importance of big data: A definition. Stamford, CT:
Gartner, 2012.
[4] J. Manyika,M. Chui,B. Brown,J. Bughin,R. Dobbs,C. Roxburgh, A. Byers, “Big
data: The next frontier for innovation, competition, and productivity,” Technical
report, McKinsey Global Institute, 2011.
[5] A. Fernandez, “Advanced Database Management Systems,” class notes for
COMP60731, School of Computer Science, University of Manchester, Novem-
ber 2013.
[6] M. Stonebraker , R. Cattell, “10 rules for scalable performance in 'simple opera-
tion' datastores”, Communications of the ACM, v.54 n.6, June 2011
[7] “Amazon Web Services ,” https://aws.amazon.com/ec2/ . Accessed August 14th,
2014.
[8] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large
Clusters,” in OSDI’04: Proceedings of the 6th conference on Symposium on
Opearting Systems Design & Implementation. Berkeley, CA, USA: USENIX
Association, 2004, pp. 137–150.
[9] A. Bialecki, M. Cafarella, D. Cutting and O. O'Malley. "Hadoop: A Framework
for Running Applications on Large Clusters Built of Commodity Hardware",
http://lucene.apache.org/hadoop/, 2005
[10] "Powered by Hadoop," http://wiki.apache.org/hadoop/PoweredBy/ . Accessed
August 14th, 2014.
[11] P. Russom, “Integrating Hadoop into Business Intelligence and Data Warehous-
ing,” TDWI Best Practices Report, 2013.
[12] C. T. Chu, S. K. Kim, Y. A. Lin, Y. Y. Yu, G. Bradski, A. Y. Ng, and K. Oluko-
tun, "MapReduce for machine learning on multicore," in Advances in Neural In-
formation Processing Systems, 2007.
[13] S. Sakr, A. Liu, and A. G. Fayoumi, "The family of mapreduce and large-scale
data processing systems," in ACM Comput. Surv. 46, 1, Article 11, 2013.
104 Of 121
[14] M. Zaharia et al., “Resilient distributed datasets: A fault-tolerant abstraction for
in-memory cluster computing,” in Proceedings of the 9th USENIX Conference
on Networked Systems Design and Implementation, ser. NSDI’12. Berkeley,
CA, USA: USENIX Association, 2012.
[15] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and H. Witten. “The
WEKA data mining software: an update,” in SIGKDD Explor. Newsl., 11(1):10–
18, 2009.
[16] R Core Team, R:” A Language and Environment for Statistical Computing”, R
Foundation for Statistical Computing, Vienna, Austria, 2013. [Online].
Available: http://www.R-project.org/ . Accessed August 14th, 2014.
[17] D. Smith,"R Tops Data Mining Software Poll," in Java Developers Journal, May
31, 2012.
[18] K. Shvachko, Hairong Kuang, S. Radia and R Chansler, “The Hadoop Distrib-
uted File System,” Mass Storage Systems and Technologies (MSST), 2010 IEEE
26th Symposium 3-7 May 2010
[19] M. Odersky, L. Spoon and B. Venners. Programming in Scala. Artima Inc, 2008.
[20] "Measuring Parallel Scaling Performance,"
https://www.sharcnet.ca/help/index.php/Measuring_Parallel_Scaling_Perform
ance . Accessed August 17th, 2014.
[21] N Sawant, H Shah, “Big Data Application Architecture Q & A”, Springer, 2013
[22] A.Y. Ng. “Machine Learning,” class notes, Coursera [Online]:
http://class.coursera.org/ml-003/lecture/5. Accessed May 6th, 2014.
[23] T.Dietterich, "Ensemble methods in machine learning." In Multiple classifier
systems, pp. 1-15. Springer Berlin Heidelberg, 2000.
[24] R. Vilalta, Y. Drissi, "A Perspective View and Survey of Meta-Learning,"
in Artificial Intelligence Review, Volume 18, Issue 2, pp 77-95, 2002.
[25] J. Kittler,R. H. Mohamad, PW Duin, J. Matas, "On combining classifiers." Pat-
tern Analysis and Machine Intelligence, IEEE Transactions on 20, no. 3: 226-
239, 1998
[26] A.M. Bagirov, J. Ugon and D. Webb, “Fast modified global K-means algorithm
for incremental cluster construction,” Pattern Recognition, 44 (4),
pp. 866–876, 2011.
[27] A. Goder and V. Filkov. "Consensus Clustering Algorithms: Comparison and
Refinement," in ALENEX. Vol. 8, 2008.
105 Of 121
[28] V. Filkov, "Integrating microarray data by consensus clustering". In Proceedings
of the 15th IEEE International Conference on Tools with Artificial Intelligence.:
418–426, 2003.
[29] R. Agrawal, R. Srikant. "Fast algorithms for mining association rules," in Pro-
ceedings of the 20th International Conference of Very Large Databases, VLDB.
Vol. 1215. 1994.
[30] R. Agrawal,J. C. Shafer, "Parallel mining of association rules." IEEE Transac-
tions on knowledge and Data Engineering 8, no. 6,p 962-969, 1996.
[31] “The Comprehensive R Archive Network,” http://CRAN.R-project.org/ . Ac-
cessed August 14th, 2014.
[32] W. N. Street and K. YongSeog, "A streaming ensemble algorithm (SEA) for
large-scale classification," in Proceedings of the seventh ACM SIGKDD interna-
tional conference on Knowledge discovery and data mining, ACM, 2001.
[33] D. P. Bertsekas and J. N. Tsitsiklis "Some aspects of parallel and distributed iter-
ative algorithms A survey", Automatica, vol. 27, no. 1, pp.3 -21 1991.
[34] J. Han, M. Ishii and H. Makino, "A Hadoop performance model for multi-rack
clusters," in Computer Science and Information Technology (CSIT), 2013 5th
International Conference on. IEEE, 2013.
[35] V. K. Vavilapalli, “Apache Hadoop YARN: Yet Another Resource Negotiator,”
In Proceedings of SOCC, 2013.
[36] “Hadoop YARN,” http://hortonworks.com/hadoop/yarn/ .Accessed May 6th,
2014
[37] Y. Bu, B. Howe, M. Balazinska, M. D. Ernst, “HaLoop: efficient iterative data
processing on large clusters,” Proceedings of the VLDB Endowment, v.3 n.1-2,
2010.
[38] R.Appuswamy, C. Gkantsidis, D. Narayanan, O. Hodson and A. Rowstron.
“Scale-up vs scale-out for Hadoop: time to rethink?” in Proceeding SOCC '13,
Article No20, 2013.
[39] S. Ji, W. Wang, C. Ye, J. Wei,Z. Liu,
"Constructing a data accessing layer for in-memory data grid, "
In Proceedings of the Fourth Asia-Pacific Symposium on Internetware,
Internetware '12, pages 15:1-15:7, USA, 2012.
[40] “InfiniSpan ,” http://infinispan.org/about/ ,Accessed August 14th, 2014.
[41] “HazelCast, “ http://hazelcast.com/products/hazelcast/ , Accessed August 14th,
2014.
106 Of 121
[42] S. Shahrivari, "Beyond Batch Processing: Towards Real-Time and Streaming
Big Data, " arXiv preprint arXiv:1403.3375, 2014.
[43] R. Power, J. Li, “Piccolo: Building fast, distributed programs with partitioned
tables,” In Proceedings of OSDI, 2010.
[44] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J. M. Hellerstein,
“Graphlab: A new framework for parallel machine learning,” in UAI, 2010.
[45] "Apache Spark, " http://spark.apache.org/ . Accessed August 14th, 2014.
[46] R.S. Xin, J.R., M. Zaharia, M.J. Franklin, S.S., and I. Stoica, “Shark: SQL and
rich analytics at scale,” In Proceedings of ACM SIGMOD Conference, pages
13-24, 2013.
[47] “MapReduce and Spark, “ http://vision.cloudera.com/mapreduce-spark/ , Ac-
cessed August 18th, 2014.
[48] “Apache Mahout,” http://mahout.apache.org/ , Accessed August 14th,2014.
[49] E. Sparks, A. Talwalkar, V. Smith, X. Pan, J. Gonzalez,T. Kraska, M. I. Jordan,.
M. J. Franklin, “MLI: An API for distributed machine learning,” in ICDM, 2013.
[50] Z. Prekopcsak, G. Makrai, T. Henk, C. Gaspar-Papanek, “Radoop: Analyzing
big data with rapidminer and hadoop,” In RCOMM, 2011.
[51] I. Mierswa , M. Wurst , R. Klinkenberg , M. Scholz and T. Euler, “YALE: rapid
prototyping for complex data mining tasks, “ in Proceedings of the 12th ACM
SIGKDD international conference on Knowledge discovery and data mining,
August 20-23, 2006.
[52] S. Das, Y. Sismanis, K. Beyer, R. Gemulla, P. Haas, and J. McPherson.
“Ricardo: integrating R and Hadoop,” In SIGMOD, pages 987–998, 2010.
[53] K. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Eltabakh, C.-C. Kanne, F.
Ozcan, and E. J. Shekita, “Jaql: A Scripting Language for Large Scale Semis-
tructured Data Analysis,” In PVLDB, 2011.
[54] “SparkR, “ http://amplab-extras.github.io/SparkR-pkg/ , Accessed August 14th,
2014.
[55] R. Gordon, “Essential JNI: Java Native Interface, “ Prentice Hall PTR, Upper
Saddle River, 1998.
[56] M. Perez,A. Sanchez,A. Herrero,V. Robles, and Pea, Jos, M., “Adapting the
Weka Data Mining Toolkit to a Grid Based Environment,” Advances in Web
Intelligence, pp. 492–497, 2005.
[57] S. Celis and D.R. Musicant, “Weka-parallel: machine learning in parallel,”
Technical report, Carleton College, CS TR, 2002.
107 Of 121
[58] D. Talia, P. Trunfio,O. Verta,”Weka4WS: a WSRF-enabled Weka Toolkit for
Distributed Data Mining on Grids,” In proceedings of the 9th European Confer-
ence on Principles and Practice of Knowledge Discovery in Databases, Porto,
Portugal, 2005.
[59] D.Wegener, M. Mock, D. Adranale, and S.Wrobel. “Toolkit-based high-perform-
ance data mining of large data on mapreduce clusters,” In Proceedings of ICDM
Workshops, pages 296-301, 2009.
[60] T. Kraska, A. Talwalkar, J. Duchi, R. Griffith, M. J Franklin, M. Jordan.
“Mlbase: A distributed machine-learning system,” In Conf. on Innovative Data
Systems Research, 2013.
[61] “Amazon EBS, “ http://aws.amazon.com/ebs/ . Accessed August 17th, 2014.
[62] “Amazon EC2 FAQs, “ http://aws.amazon.com/ec2/faqs/ . Accessed August 17th,
2014.
[63] “Amazon Machine Images, “ http://docs.aws.amazon.com/AWSEC2/latest/User
Guide/AMIs.html . Accessed August 17th, 2014.
[64] “HDFS Architecture ,” http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html ,
Accessed August 26th, 2014.
[65] “Kryo, “ https://github.com/EsotericSoftware/kryo . Accessed August 17th, 2014.
[66] “Spark Programming Guide, “ http://spark.apache.org/docs/latest/programming-
guide.html . Accessed August 17th, 2014.
[67] “Amazon CloudWatch, “ http://aws.amazon.com/cloudwatch/ . Accessed August
17th, 2014.
[68] “Cron and Crontab usage, “ http://www.pantz.org/software/cron/croninfo.html .
Accessed August 17th, 2014.
[69] M. Hall, "Weka and Hadoop" blog, 15 October 2013;
http://markahall.blogspot.co.uk/2013/10/weka-and-hadoop-part-1.html .
Accessed August 17th, 2014.
[70] A. McCallum, K. Nigam, L.H. Ungar, "Efficient clustering of high-dimensional
data sets with application to reference matching,"
In Proceedings of the 6th ACM SIGKDD international conference
on Knowledge discovery and data mining, p.169-178, 2000.
[71] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica, "Disk-locality in
datacenter computing considered irrelevant," in Proc. USENIX Workshop on
Hot Topics in Operating Syst. (HotOS), 2011.
[72] R.Appuswamy, C. Gkantsidis, D. Narayanan, O. Hodson and A. Rowstron.
108 Of 121
“Scale-up vs scale-out for Hadoop: time to rethink?” in Proceeding SOCC '13,
Article No20, 2013.
[73] J. Dejun, G. Pierre, and C.-H. Chi, "Resource provisioning of web applications
in heterogeneous clouds," in WebApps, 2011.
[74] A. Le-Quoc, M. Fiedler, C. Cabanilla, “The Top 5 AWS EC2 Performance Prob-
lems, “ Datadog Inc, 2013.
[75] S. Jha, J. Qiu, A. Luckow, P. Mantha, G. C. Fox.
"A Tale of Two Data-Intensive Approaches: Applications, Architectures and In
frastructure." In 3rd International IEEE Congress on Big Data Application and
Experience Track. 2014.
[76] W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk, B. Nitzberg, W. Saphir,
and M. Snir, “MPI: The Complete Reference, “ MIT Press, 1998.
[77] “Machine Learning Repository, “ https://archive.ics.uci.edu/ml/datasets.html .
Accessed August 17th, 2014.
[78] “Stanford Network Analysis Project, “ http://snap.stanford.edu/ . Accessed Au-
gust 17th, 2014.
[79] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, I. Stoica,
“Delay scheduling: A simple technique for achieving locality and fairness in
cluster scheduling, “ In EuroSys 10, 2010.
[80] “Iperf, “ https://github.com/esnet/iperf/ . Accessed August 22, 2014.
[81] “Iozone Filesystem Benchmark, “ http://www.iozone.org/ . Accessed August 22,
2014.
[82] Weizhong Zhao , Huifang Ma , Qing He, “Parallel K-Means Clustering
Based on MapReduce, “ In Proceedings of the 1st International
Conference on Cloud Computing, December 01-04, 2009.
[83] C. Aggarwal, C. K. Reddy,"Data Clustering: Algorithms and Applications, "
CRC Press, 2011.
[84] C. Ding, X. He, H. D. Simon,"On the equivalence of nonnegative
matrix factorization and spectral clustering, " In Proceedings of the SIAM
international conference on Data Mining, 2005.
[85] Z. S. He , S. L. Xie , R. Zdunek , G. X. Zhou, A. Cichocki, "Symmetric nonneg-
ative matrixfactorization: Algorithms and applications to probabilistic cluster-
ing", IEEE Trans.Neural Netw., vol. 22, no. 12, pp.2117 -2131, 2011.
[86] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica, “Discretized
streams: fault-tolerant streaming computation at scale, ” In SOSP, 2013.
109 Of 121
[87] J. Kreps, N. Narkhede, and J. Rao, “ Kafka: A distributed messaging system for
log processing, “ In NetDB, 2011.
[88] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J.M. Patel, S. Kulkarni, J.
Jackson, "Storm@ twitter, " In Proceedings of the 2014 ACM SIGMOD interna-
tional conference on Management of data, pp. 147-156. ACM, 2014.
[89] “WekaMetal, “http://www.cs.bris.ac.uk/Research/MachineLearning/wekametal/ ,
Accessed August 18th, 2014.
110 Of 121
Appendix 1 – Benchmarking Data
Appendix 1 contains the bulk of benchmarking data. Tables 1 and 2 display
the execution times for the Linear Regression and FP-Growth experiments.
111 Of 121
Linear Regression (sec) 5GB 20GB 80GB8 Cores 86 336 134032 cores 39 90 352128 Cores 24 39 94
Appendix Table 1: Execution Times for Linear Regression
5GB 20GB 80GB8 Cores 359 1437 561332 cores 100 379 1440128 Cores 34 104 376
FP-Growth (sec)
Appendix Table 2: Execution Times for FP-Growth
Tables 3 and 4 display the average memory utilisation of the Weka-on-Spark
benchmarks.
112 Of 121
Appendix Table 3: Memory Utilisation Part I
Appendix Table 4: Memory Utilisation Part II
Tables 5 and 6 display the average per-instance Network Traffic of the
Weka-on-Spark benchmarks.
113 Of 121
Appendix Table 5: Network Traffic Part I
114 Of 121
Appendix Table 6: Network Traffic Part II
Appendix 2 – Installation Guide
In order to install the system on AWS a number of steps is required:
1. Download the latest release of Spark (http://spark.apache.org/down-
loads.html ).
2. Unzip the package to a location and navigate to the /ec2 folder. Ensure
that Python 2.7 is installed and it is the main Python interpreter. This
folder contains a Python script able to launch an AWS Spark cluster
according to a specification.
3. Export the ACCESS_ID and ACCESS_KEY associated with your
AWS account to the terminal window (or set-up environment vari-
ables). Ensure that the key-pair associated with your account is stored
locally.
4. In the /ec2 directory execute:
./spark-ec2 -k (key-pair name) -i (key-pair file path) -s
(num-of-slaves) -t (slave-instance type) -r (ec2-region)
launch (cluster-name)
These are the essential parameters, but the script supports multiple op-
tions. For the full documentation, go to (
http://spark.apache.org/docs/latest/ec2-scripts.html ). It is important to
emphasize that the procedure is slow and may take up to half an hour
to launch the cluster. When the process is complete, the script will print
the appropriate message in the terminal.
5. Login to the cluster. The easiest way is to type in the same terminal:
./spark-ec2 -k (key-pair name) -i (key pair path) login
(cluster name)
But it is also to possible to establish a direct ssh connection to the
Master node.
6. Once in the Master node, navigate to the /spark directory and place the
download the framework's uber jar. This can be achieved using mul-
tiple methods depending on where the executable is hosted.
115 Of 121
This process assumes that the data are already on HDFS. At this point the sys-
tem is ready to accept user tasks. Appendix 3 (User Guide) provides details
about the submission scripts and the supported user options and Appendix 4
provides details on how to install the main-memory monitoring service on
CloudWatch.
116 Of 121
Appendix 3 – User Guide
This guide assumes that the system is installed on AWS (or any other cluster),
the data are stored on HDFS and the reader possesses fundamental knowledge
of a Linux command line.
Navigate to the folder of the spark installation and type:
bin/spark-submit --master <spark master adress and port> \
--class uk.ac.manchester.ariskk.distributedWekaSpark.
main.distributedWekaSpark \
--executor-memory <per instance memory> \
/root/spark/distributedWekaSpark-0.0.2-SNAPSHOT.jar \
-task <task to submit> \
-hdfs-dataset-path <path to hdfs> \
-num-of-partitions <> -num-of-attributes <> \
<an arbitrary number of supported options>
Double dashed option indicate Spark's options and single dashed options
are parsed by the application. Double dashed options must be submitted be-
fore the path of the executable archive.
Any number of options can be submitted and the order is irrelevant. The
following list presents the supported options, their default values and usage
examples.
-task (task descriptor) possible values: buildHeaders , buildClassifier , buildClassifierEvaluation, buildFoldBased-
Classifier , buildFoldBasedClassifierEvaluation , build-
Clusterer , findAssociationRules default:No (Exception)
-dataset-type (dataset type to use) possible values: In-
stances , ArrayInstance , ArrayString default: ArrayString
-caching (caching strategy to use) possible values : All
Spark supported ex: MEMORY_ONLY default: Cashing Strategy Se-
lection Algorithm
-compress (compress serialized rdds to save memory) possible
values: y (or none) default: none (do not compress)
117 Of 121
-kryo (use kryo serializer) possible values : y (or none)
default: none (do not use kryo)
-caching-fraction (executor caching fraction) possible val-
ues: 0.1-1.0 default: 0.6
-overhead (dataset overhead in java objects) possible val-
ues: any double default: 5.0
-hdfs-dataset-path (path to the dataset on HDFS) possible
values: hdfs://(host):(port)/user/username/dataset.csv de-
fault:No (Exception)
-hdfs-names-path: (path to a names files on HDFS, names must
be in a comma delimited format) possible values : hdfs://
(host):(port)/user/username/names.txt default: Will try to
compute att0,att1 etc
-hdfs-headers-path (path to pre-built headers for the data-
set) possible values: hdfs://(host):
(port)/user/username/someheaders.arff default: Will try to
compute headers
-hdfs-classifier-path (path to trained classifier) possible
values : hdfs://(host):(port)/user/username/someclassfier.-
model default: Will try to build classifier
-hdfs-output-path (path to an HDFS folder where generated
models will be saved) possible values: hdfs://(host):
(port)/user/username/somefolder/ default: No (will not save)
-num-partitions (number of partitions) possible values: Any
Integer default: Spark default
-num-random-chunks (number of randomized/stratified parti-
tions) possible values: any Integer default: No stratifica-
tion/ randomisation
-num-of-attributes (number of attributes in the dataset)
possible values: any Integer default: will try to compute from
the dataset if no headers provided
-class-index (index of the class attribute) possible values:
any Integer [0,num-of-attributes-1] default: num-of-attrib-
utes-1
-num-folds (number of folds) possible values: Any Integer
default:1
-names (attribute names) possible values: a comma delimited
list of names default:Will produce a list att0,att1 etc
118 Of 121
-classifier (the name and package path of the classifier to
use) possible values: weka.classifier.bayes.NaiveBayes (any
weka core) default: NaiveBayes
-meta (meta learner to use) possible values: weka.classifi-
ers.meta.Bagging (any weka core) default:None (WekaClassifi-
erReduceTask default)
-rule-learner (association rule learner to use) posible val-
ues: weka.associations.Apriori and weka.associations.FPGrowth
only default: FPGrowth
-clusterer (clusterer to use) possible value : weka.cluster-
er.Canopy only. default: Canopy
-num-clusters (number of clusters to find) possible values:
any Integer default: 0 (the task will try to set a number of
the clusterer supports auto-config)
-parser-options (options for the csv parser) possible val-
ues: \"-N first-last etc.. \" default: weka defaults based on
the tasks DO NOT FORGET \" \" when grouping parameters
-weka-options (options for weka algorithms (base tasks and
core algorithms)) possible values: \" -depth 3 etc \" default:
weka defaults based on the tasks DO NOT FORGET \" \" when
grouping parameters
I intend to continue supporting the project after my graduation. Con-
sequently, this list may be updated in the future. The project will be released
as an open-source project under the Apache License and will be available
through GitHub.
119 Of 121
Appendix 4 – Main-Memory Monitoring using Cloud-
Watch
Cloudwatch monitors a variety of metrics, but there is no support for main-
memory usage monitoring by default. The following steps explain how a
main-memory monitoring daemon can be installed on Linux-based instances.
Amazon has released a number of Perl scripts to achieve that task. These
scripts must bu downloaded and installed directly to the instances. This pro-
cedure requires a number of steps:
1. Establish a secure shell ( SSH) connection to the instance by typing :
ssh -i (keypair path) root@(instance public DNS)
2. Once the secure connection is established copy and paste the following
commands in the terminal:
sudo yum install perl-Switch perl-Sys-Syslog perl-LWP-
Protocol-https
wget http://ec2-downloads.s3.amazonaws.com/cloud
watch-samples/CloudWatchMonitoringScripts-v1.1.0.zip
unzip CloudWatchMonitoringScripts-v1.1.0.zip
rm CloudWatchMonitoringScripts-v1.1.0.zip
cd aws-scripts-mon
These command will download the appropriate main-memory monit
oring scripts to the instance.
3. (Optional) Change the default text editor from Vim to nano:
export EDITOR=nano
4. Open the crontab (chronos table):
crontab -e
120 Of 121
The entries of this table are executed by the operating system at fixed
intervals.
5. Paste the following entries in the crontab:
* * * * * /root/aws-scripts-mon/mon-put-instance-data.pl --mem-util
--aggregated=only --mem-used --mem-avail –aws-access-key-
id=(your id) --aws-secret-key=(your secret key)
* * * * * /root/aws-scripts-mon/mon-put-instance-data.pl --mem-util
--mem-used --mem-avail --aws-access-key-id=(your id) –aws-secret-
key=(your secret key)
These entries will execute every minute and submit the instance's main-
memory utilisation statistics to Cloudwatch.
At this point browsing the CloudWatch metrics should have a new cat-
egory: Linux System Metrics. All the custom metrics can be retrieved in this
category.
121 Of 121