Big Data Mining - University of Manchester · Data generation and collection across all domains...

Big Data Mining:

Towards implementing Weka-on-Spark

A dissertation submitted to The University of Manchester for the degree of

Master of Science

in the Faculty of Engineering and Physical Sciences

2014

Koliopoulos Aris Kyriakos

School of Computer Science

List of Contents

List of Contents....................................................................................................... 2

List of Figures......................................................................................................... 5

List of Tables........................................................................................................... 7

List of Abbreviations............................................................................................... 8

Abstract................................................................................................................. 10

Declaration............................................................................................................ 12

Intellectual Property Statement............................................................................. 13

Acknowledgements............................................................................................... 14

1 Introduction......................................................................................................... 16

1.1 Distributed Computing Frameworks............................................................ 17

1.2 Data Mining Tools........................................................................................ 18

1.3 Project Objectives........................................................................................ 19

1.4 Implementation Summary............................................................................ 19

1.5 Evaluation Strategy...................................................................................... 21

1.6 Project Achievements................................................................................... 22

1.7 Overview of Dissertation............................................................................. 23

2 Literature Review................................................................................................ 24

2.1 Data Mining................................................................................................. 24

2.1.1 Classification......................................................................................... 24

2.1.2 Regression............................................................................................. 25

2.1.3 Clustering.............................................................................................. 26

2.1.4 Association Rule Learning.................................................................... 26

2.1.5 Data Mining System Development....................................................... 27

2.1.5.1 Tooltkit Based Approaches............................................................. 27

2.1.5.2 Statistical Language Based Approaches......................................... 28

2.1.5.3 Approach Selection........................................................................ 28

2.1.6 Partitioning and Parallel Performance..................................................29

2.2 Distributed Computing Frameworks............................................................ 30

2.2.1 MapReduce........................................................................................... 30

2.2.2 Hadoop.................................................................................................. 31

2.2.2.1 Beyond Hadoop.............................................................................. 32

2.2.3 Iterative MapReduce............................................................................. 33

2 Of 121

2.2.4 Distributed Systems for In-Memory Computations.............................. 34

2.2.4.1 In-Memory Data Grids................................................................... 34

2.2.4.2 Piccolo............................................................................................ 34

2.2.4.3 GraphLab........................................................................................ 35

2.2.4.4 Spark.............................................................................................. 37

2.2.5 Distributed Computing Framework Selection......................................39

2.3 Distributed Data Mining.............................................................................. 39

2.3.1 Data Mining on MapReduce................................................................. 40

2.3.2 R on MapReduce................................................................................... 41

2.3.3 Distributed Weka................................................................................... 43

2.3.4 MLBase................................................................................................. 44

2.4 Summary...................................................................................................... 46

3 System Architecture............................................................................................ 48

3.1 Required Architectural Components............................................................ 48

3.2 Multi-tier Architecture................................................................................. 48

3.2.1 Infrastructure Layer............................................................................... 49

3.2.2 Distributed Storage Layer..................................................................... 50

3.2.3 Batch Execution Layer.......................................................................... 52

3.2.3.1 Spark and Main-memory Caching................................................. 53

3.2.4 Application Layer.................................................................................. 54

3.2.5 CLI........................................................................................................ 55

3.3 Cluster Monitoring....................................................................................... 55

3.4 Summary...................................................................................................... 56

4 Execution Model................................................................................................. 57

4.1 Weka on MapReduce.................................................................................... 57

4.2 Task Initialisation......................................................................................... 58

4.3 Headers......................................................................................................... 61

4.4 Classification and Regression...................................................................... 63

4.4.1 Model Training...................................................................................... 63

4.4.2 Model Testing and Evaluation............................................................... 65

4.5 Association Rules......................................................................................... 66

4.6 Clustering..................................................................................................... 69

4.7 Summary...................................................................................................... 69

5 System Evaluation............................................................................................... 71

5.1 Evaluation Metrics....................................................................................... 71

5.2 System Configuration................................................................................... 72

3 Of 121

5.3 Evaluation Results........................................................................................ 74

5.3.1 Execution Time..................................................................................... 74

5.3.2 Scaling Efficiency................................................................................. 79

5.3.2.1 Weak Scaling.................................................................................. 79

5.3.2.2 Strong Scaling................................................................................ 80

5.3.3 Main-Memory Caching......................................................................... 84

5.3.3.1 Caching overheads......................................................................... 84

5.3.3.2 Caching and Performance.............................................................. 86

5.3.4 IO Utilisation......................................................................................... 89

5.4 Caching Strategy Selection Algorithm......................................................... 92

5.5 Summary...................................................................................................... 96

6 Concluding remarks............................................................................................ 98

6.1 Summary...................................................................................................... 98

6.2 Further Work................................................................................................ 99

6.2.1 Clustering.............................................................................................. 99

6.2.2 Stream Processing............................................................................... 100

6.2.3 Declarative Data Mining..................................................................... 102

6.3 Conclusion.................................................................................................. 103

References........................................................................................................... 104

Appendix 1 – Benchmarking Data...................................................................... 111

Appendix 2 – Installation Guide.......................................................................... 115

Appendix 3 – User Guide.................................................................................... 117

Appendix 4 – Main-Memory Monitoring using CloudWatch............................. 120

Total Word Count: 23003

4 Of 121

List of Figures

Figure 1.1: Cluster Architecture [5]...................................................................... 17

Figure 1.2: System Architecture............................................................................ 20

Figure 2.1: The Data Mining Process.................................................................... 24

Figure 2.2: Supervised Learning Process [22]...................................................... 25

Figure 2.3: MapReduce Execution Overview [8]................................................. 31

Figure 2.4: Hadoop Tech Stack [36]..................................................................... 32

Figure 2.5: HaLoop and MapReduce [35]............................................................ 33

Figure 2.6: GraphLab Consistency Mechanisms[44]............................................ 36

Figure 2.7: RDD Lineage Graph [14]................................................................... 37

Figure 2.8: Ricardo [52]........................................................................................ 42

Figure 2.9: Distributed Weka [59]......................................................................... 44

Figure 2.10: MLBase Architecture [60]................................................................ 45

Figure 3.1: System Architecture............................................................................ 49

Figure 3.2: HDFS Architecture [64]...................................................................... 51

Figure 3.3: Initialisation Process........................................................................... 53

Figure 4.1: Execution Model................................................................................. 58

Figure 4.2:WekaOnSpark's main thread................................................................ 59

Figure 4.3: Task Executor..................................................................................... 60

Figure 4.4: Lineage Graph.................................................................................... 61

Figure 4.5: Header creation MapReduce job........................................................ 62

Figure 4.6: Header Creation Map Function.......................................................... 62

Figure 4.7: Header Creation Reduce Function...................................................... 62

Figure 4.8: Model Training Map Function............................................................ 64

Figure 4.9: Model Aggregation Reduce Function................................................. 65

Figure 4.10: Classifier Evaluation Map Function................................................. 66

Figure 4.11: Evaluation Reduce Function............................................................. 66

Figure 4.12: Association Rules job on Spark........................................................ 67

Figure 4.13: Candidate Generation Map Function................................................ 67

Figure 4.14: Candidate Generation and Validation Reduce Function...................68

Figure 4.15: Validation Phase Map Function........................................................ 68

Figure 5.1: Execution times for SVM................................................................... 75

Figure 5.2: Weak Scaling Efficiencies.................................................................. 80

5 Of 121

Figure 5.3: Strong Scaling for SVM..................................................................... 81

Figure 5.4: Strong Scaling for Linear Regression................................................. 81

Figure 5.5: Strong Scaling for FP-Growth............................................................ 82

Figure 5.6: Strong Scaling on Weka-On-Hadoop.................................................83

Figure 5.7:Main-memory time-line....................................................................... 86

Figure 5.8: Main-Memory Use Reduction............................................................ 87

Figure 5.9: Execution Time Overhead.................................................................. 87

Figure 5.10: Average per-instance disk writes...................................................... 88

Figure 5.11: Network Traffic................................................................................. 90

Figure 5.12: Per-instance average of network and disk utilisation.......................91

Figure 5.13: CPU utilisation................................................................................. 92

Figure 5.14: Storage Level Selection Process....................................................... 93

6 Of 121

List of Tables

Table 5.1: Execution Times for SVM on Weka-On-Spark.................................... 74

Table 5.2: Execution Times for SVM on Weka-On-Hadoop................................74

Table 5.3: Speed-up............................................................................................... 76

Table 5.4: CPU Utilisation of Weka-On-Spark..................................................... 77

Table 5.5: CPU Utilisation of Weka-On-Hadoop.................................................. 77

Table 5.6: Main-memory utilisation of Weka-On-Spark....................................... 78

Table 5.7: Main-memory utilisation of Weka-On-Hadoop...................................78

Table 5.8: RDD size as percentage of the original on-disk value (I).................... 85

Table 5.9: RDD size as percentage of the original on-disk value (II)...................85

Table 5.10: Execution Times................................................................................. 96

Table 5.11: Failed Tasks........................................................................................ 97

7 Of 121

List of Abbreviations

AMI – Amazon Machine Images

API – Application Programming Interface

AWS – Amazon Web Services

BDM – Big Data Mining

CLI – Command Line Interface

CPU – Central Processing Unit

EBS – Elastic Block Store

EC2 – Elastic Compute Cloud

ECU – EC2 Compute Unit

EMR – Elastic Map Reduce

GC – Garbage Collector

GUI – Graphical User Interface

HDFS – Hadoop Distributed File System

IMDG -In-Memory Data Grids

IO – Input/Output

JNI – Java Native Interface

JVM – Java Virtual Machines

MPI – Message Passing Interface

RDD – Resilient Distributed Datasets

SSD – Solid State Drives

SSH – Secure SHell

SVM – Support Vector Machines

VM – Virtual Machine

8 Of 121

WEKA – Waikato Environment of Knowledge Analysis

YARN – Yet Another Resource Negotiator

9 Of 121

Abstract

Data generation and collection across all domains increase in size exponen-

tially. Knowledge discovery and decision making demand the ability to pro-

cess and extract insights from “Big” Data in a scalable and efficient manner.

The traditional cluster-based Big Data platform Hadoop provides a scalable

solution but imposes performance overheads due to only supporting on-disk

data. The Data Mining algorithms used in knowledge discovery usually re-

quire multiple iterations over the dataset and thus, multiple, slow, disk ac-

cesses. In contrast, modern clusters possess increasing amounts of main-

memory that can provide performance benefits by efficiently using main-

memory caching mechanisms.

Apache Spark is an innovative distributed computing framework that sup-

ports in-memory computations. The objective of this dissertation is to design

and develop a scalable Data Mining framework to run on top of Spark and to

identify and document the advantages and disadvantages of main-memory

caching on Data Mining workloads.

The workloads consisted of distributed implementations of Weka's Data

Mining algorithms. Benchmarking was performed by testing seven different

caching strategies on different workloads, measuring elapsed time and monit-

oring resource utilisation.

The project contributions are three-fold:

1. Design and development of a distributed Data Mining framework that

achieves near-linear scaling in executing Data Mining workloads in

parallel;

2. Analysis of the behaviour of distributed main-memory caching mech-

anisms on different Data Mining execution scenarios;

3. Design and development of an automated caching strategy selection

mechanism that assesses dataset and cluster characteristics and selects

an appropriate caching scheme.

10 Of 121

The system was benchmarked using Linear Regression, Support Vector Ma-

chines and the FP-Growth algorithm on datasets from 5GB up to 80GB. The

results demonstrate that Weka-On-Spark outperforms Weka-On-Hadoop on

identical workloads by a factor of 2.36 on average and up to four times in

small scales.

Weak scaling efficiency measures a parallel system's ability to efficiently

utilise increasing number of processing nodes. Average weak scaling effi-

ciency was measured at 91.4% which is within 10% of an ideal parallel sys-

tem. Serialisation and compression were found to decrease main-memory util-

isation by 40%, on only a 5% execution time penalty. Finally, the proposed

caching mechanism reduces execution times up to 25% compared to the de-

fault strategy and diminishes the risk of main-memory exceptions.

11 Of 121

Declaration

No portion of the work referred to in the dissertation has been submitted in

support of an application for another degree or qualification of this or any

other university or other institute of learning.

12 Of 121

Intellectual Property Statement

1. The author of this dissertation (including any appendices and/or sched-

ules to this dissertation) owns certain copyright or related rights in it

(the “Copyright”) and s/he has given The University of Manchester

certain rights to use such Copyright, including for administrative pur-

poses.

2. Copies of this dissertation, either in full or in extracts and whether in

hard or electronic copy, may be made only in accordance with the

Copyright, Designs and Patents Act 1988 (as amended) and regulations

issued under it or, where appropriate, in accordance with licensing

agreements which the University has entered into. This page must

form part of any such copies made.

3. The ownership of certain Copyright, patents, designs, trade marks and

other intellectual property (the “Intellectual Property”) and any repro-

ductions of copyright works in the dissertation, for example graphs

and tables (“Reproductions”), which may be described in this disserta-

tion, may not be owned by the author and may be owned by third

parties. Such Intellectual Property and Reproductions cannot and must

not be made available for use without the prior written permission of

the owner(s) of the relevant Intellectual Property and/or Reproduc-

tions.

4. Further information on the conditions under which disclosure, public-

ation and commercialisation of this dissertation, the Copyright and any

Intellectual Property and/or Reproductions described in it may take

place is available in the University IP Policy (see http://documents.-

manchester.ac.uk/display.aspx?DocID=487), in any relevant Disserta-

tion restriction declarations deposited in the University Library, The

University Library’s regulations (see http://www.manchester.ac.uk/lib-

rary/aboutus/regulations) and in The University’s Guidance for the

Presentation of Dissertations.

13 Of 121

http://www.manchester.ac.uk/library/aboutus/regulations

http://www.manchester.ac.uk/library/aboutus/regulations

http://documents.manchester.ac.uk/display.aspx?DocID=487

http://documents.manchester.ac.uk/display.aspx?DocID=487

Acknowledgements

I would like to thank my supervisor, Professor John A. Keane, for tirelessly

providing me with valuable advice, motivation, ideas and inspiration through-

out this project.

I would like to thank Dr. Firat Tekiner for his insights on building and eval-

uating an industry-level Big Data platform.

I would also like to thank Dr. Mark Hall for exchanging ideas on how to

implement distributed versions of Weka's algorithms.

Finally, I would like to thank my parents and my brother for their continu-

ous moral and financial support.

14 Of 121

To my family

15 Of 121

1 Introduction

Datasets across all domains are increasing in size exponentially [1]. Gantz et

al. [2] estimated that, in 2013, 4 zettabytes (10^21 bytes) of data were gener-

ated worldwide and expected this number to increase ten-fold by 2020. These

developments created the term “Big Data”.

According to Gartner in 2012, Big Data is “high-volume, high-velocity,

and/or high-variety information assets that require new forms of processing to

enable enhanced decision making, insight discovery and process optimiza-

tion” [3]. In practice, the term refers to datasets that are increasingly difficult

to collect, curate and process using traditional methodologies.

Another aspect that drives the development of big data technologies for-

ward is the emerging trend of data-driven decision-making. For example,

McKinsey in 2012 [4] calculated that the health-care system in the USA could

save up to $300bn by better understanding of domain-specific data (clinical

trials, health insurance transactions, wearable sensors etc.). This trend requires

processing techniques to transform data to valuable insights. The field that ad-

dresses knowledge extraction from raw data is known as Data Mining.

Developments mentioned above are closely associated with the evolution

of distributed systems. Due to large volumes, processing is performed by or-

ganised clusters of computers. Proposed cluster architectures can be divided

into three major categories:

1. Shared-memory clusters: a global main-memory is shared between

processors by a fast interconnect.

2. Shared-disk clusters: an array of disks is accessible through the

network. Each processor has its own private memory.

3. Shared-nothing clusters: every node has a private set of resources.

Figure 1.1 (from Fernandez [5]) presents these architectures.

16 Of 121

Shared-memory and shared-disk architectures suffer difficulties in scaling

to large sizes [6]. Pioneered by Google, shared-nothing architectures have

dominated mainly because they can scale dynamically by adding more inex-

pensive nodes. Emerging cloud computing providers, such as AWS-EC2

(Amazon Web Services – Elastic Compute Cloud) [7], offer access to dynam-

ically configurable instances of these architectures on demand.

1.1 Distributed Computing Frameworks

Google in 2004 [8] introduced MapReduce, a distributed computing model

targeting large-scale processing in shared-nothing architectures. MapReduce

expresses computations using two operators (Map and Reduce), schedules

their execution in parallel on dataset partitions and guarantees fault-tolerance

through replication. The Map operator processes dataset partitions in parallel

and the Reduce operator aggregates the results.

Yahoo in 2005 introduced Hadoop [9], an open source implementation of

MapReduce. Many institutions adopted Hadoop [10] and many others plan

Hadoop integration in the near future [11].

Although MapReduce can express many Data Mining algorithms effi-

ciently [12], a significant performance improvement is possible by introducing

a loop-aware scheduler and main-memory caching. The Data Mining al-

gorithms used in knowledge discovery usually require multiple iterations over

the dataset and thus, multiple, slow, disk accesses. Due to this iterative nature

of many Data Mining algorithms, storing and retaining datasets in-memory

and scheduling successive iterations to the same nodes may yield significant

17 Of 121

Figure 1.1: Cluster Architecture [5]

benefits. Modern clusters possess increasing amounts of main-memory that

can provide performance benefits by efficiently using main-memory caching

mechanisms.

Out of many possible options [13], this project has utilised Apache Spark

[14] as the target platform. Spark supports main-memory caching and pos-

sesses a loop-aware scheduler. Additionally, Spark implements the MapRe-

duce paradigm and it is Java-based. These features enable users to deploy ex-

isting Hadoop application logic in Spark. Spark outperforms Hadoop by up to

two orders of magnitude in many cases [14].

1.2 Data Mining Tools

Data Mining tools can be divided into two major categories:

• Data Mining Toolkits: Libraries of data mining algorithms that can be

invoked against datasets either via Command Line Interfaces (CLI) or

Graphical User Interfaces (GUI).

• Languages for statistical computing: Special purpose programming

languages that incorporate data mining primitives and simplify al-

gorithmic development.

Weka and R are influential exemplars of the respective categories [17].

Weka incorporates libraries that cover all major categories of Data Mining

algorithms and has been under development by the open-source community

for more than 20 years [15]. However, it was developed targeting sequential

single-node execution and thus, it is not suitable for distributed environments.

The volumes Weka can handle are limited by the heap memory of a single

node. This number cannot exceed single-digit gigabytes. Thus, sequential

Weka is not suitable for modern large-scale datasets.

R [16] is a programming language designed for statistical computing. It in-

corporates essential data mining components such as linear and non-linear

models, statistical testing and classification among others. R also provides a

graphical environment for results visualisation. Although it is possibly the

most popular tool [17] in the field, it demonstrates the same shortcomings as

Weka on large-scale datasets.

18 Of 121

Weka was selected due to the fact that is written in Java (R is C-based) and

it is natively supported by Spark's Java-based execution environment. Addi-

tionally, Weka exposes an easy-to-use user interface and the implemented

framework can address the needs of both novice and expert users.

1.3 Project Objectives

This project has two main objectives:

1. An implementation of a scalable and efficient distributed Data Mining

framework using Weka and Spark. It will thus investigate the re-use of

existing sequential algorithms in a distributed context;

2. An experimental evaluation of different main-memory caching

strategies and their effects on Data Mining workloads.

1.4 Implementation Summary

The system implementation can be summarised by the 4-tier architecture de-

picted in Figure 1.2.

19 Of 121

The Infrastructure Layer is provided by Amazon's EC2. The decision to se-

lect AWS was based on its ability to dynamically allocate computing in-

stances, its enhanced monitoring capabilities and its comprehensive online

documentation.

The Distributed Storage Layer consists of multiple SSD drives, managed

by the Hadoop Distributed File System (HDFS) [18]. HDFS encapsulates dis-

tributed storage into a single logical unit and guarantees fault-tolerance

through data partitioning and replication.

The Batch Processing Layer was based upon the distributed computing

framework Apache Spark. Spark is innovative in the field of large-scale in-

memory computations and offers multiple advanced caching mechanisms.

20 Of 121

Figure 1.2: System Architecture

The Data Mining Application Layer was engineered by implementing cus-

tom distributed versions of Weka's Data Mining algorithms. More specifically,

four different categories of Data Mining methods were implemented:

• Classification

• Regression

• Association Rule Learning

• Canopy Clustering

The distribution was achieved by implementing Decorator classes

(“wrappers”) which encapsulated Weka's sequential algorithms. The func-

tional nature of the MapReduce model prohibits data dependencies between

Map and Reduce tasks executed in parallel. Thus, each Map task was designed

to process a dataset partition in parallel using an application specific Weka al-

gorithm. Each Reduce task was designed to receive the results of completed

Map tasks and aggregate them into a single output. In order to guarantee par-

allel execution, Map tasks were implemented as unary functions and Reduce

tasks as binary, commutative and associative functions. This procedure was

made possible by the functional nature of the Scala [19] programming lan-

guage.

1.5 Evaluation Strategy

The implementation has been evaluated using computer clusters and problem

sizes of three distinct levels:

• Small scale: Using an 8-core cluster of m3.xlarge instances possessing

28.2GB of main-memory;

• Medium scale: Using a 32-core cluster of m3.xlarge instances pos-

sessing 112.8GB of main-memory;

• Large scale: Using a 128-core cluster of m3.xlarge instances possess-

ing 451.2GB of main-memory.

Datasets sizes were increased in proportion to the computer cluster sizes and

ranged from 5GB up to 80GB.

21 Of 121

Each category of Data Mining methods was evaluated by using a represent-

ative algorithm. In each case, the experiment was repeated using the three

scales and the seven different caching strategies. An identical workload was

executed on Hadoop for comparison purposes.

The metrics measured at each experiment consisted of execution time,

memory utilisation, CPU utilisation, network utilisation and disk operations.

These measurements were analysed to provide information about the sys-

tem's scalability, achieved load balancing, resource use and potential bottle-

necks. Additionally, trade-offs between different in-memory caching strategies

and system's metrics were documented and analysed.

1.6 Project Achievements

The project contributions are three-fold:

1. A distributed Data Mining framework that achieves near-linear scal-

ing in executing Data Mining algorithms in parallel;

2. An analysis on the behaviour of distributed main-memory caching

mechanisms on different Data Mining execution scenarios;

3. An automated caching strategy selection mechanism that assesses

dataset and cluster characteristics and selects an appropriate caching

scheme.

The system outperforms Hadoop by a factor of 2.36x on average and up to

four times on small scale datasets.

Scaling efficiency indicates a system's ability to efficiently utilise increas-

ing numbers of processing nodes [20]. In weak scaling, the per-node size of

the problem remains constant and additional nodes are used to tackle a bigger

problem. In strong scaling, the total size of the problem remains constant and

additional nodes are introduce to tackle to reduce execution times. An optimal

parallel system demonstrates linear weak and strong scaling efficiencies.

Weak scaling efficiencies for Linear Regression, Support Vector Machines

(SVM) and FP-Growth were measured at 91.4%, 91.8% and 90.9% respect-

ively. These results are within 10% of the optimal parallel performance. The

22 Of 121

strong scaling efficiency was measured at 91.3% on the large-scale experi-

ments. This figure is on a par with the state-of-the-art in the surveyed literat-

ure.

Uncompressed datasets were found to consume up to 500% more memory

than the on-disk values. Serialisation converts distributed object structures to

continuous arrays of bytes. Compression uses encoding techniques to repre-

sent data with fewer bits. Serialisation and compression mechanisms were

found to reduce the memory footprints of uncompressed datasets to 101.3%

and 46.7% of the on-disk value respectively. The execution time overhead of

these mechanisms was measured at 5%.

Disk caching offers an efficient alternative in memory-constrained environ-

ments. The disk caching mechanism was able to successfully execute Data

Mining tasks even with only 2.5% of the cluster's main-memory available.

The caching strategy selection mechanism outperformed the default cach-

ing mechanism by up to 25%. This was achieved by reducing garbage collec-

tion overheads through serialisation and compression. Additionally, this mech-

anism was found to eradicate the fail-rates caused by main-memory excep-

tions in limited memory experiments.

1.7 Overview of Dissertation

The remainder of the dissertation is structured as follows: Chapter 2 presents

an overview of the state-of-the-art in Data Mining and distributed computing

frameworks; - Chapter 3 provides an analysis of the system architecture; -

Chapter 4 presents and analyses the execution model; - Chapter 5 presents the

evaluation methodology and benchmarking results. Finally, Chapter 6 draws

conclusions and proposes future improvements.

23 Of 121

2 Literature Review

This chapter presents fundamental Data Mining techniques and tools (2.1);

the state of the art in Distributed Computing Frameworks (2.2); and distrib-

uted Data Mining efforts using these frameworks (2.3).

2.1 Data Mining

Data Mining procedures can be summarized as a number of fundamental

steps: data recording, noise reduction, data analysis and data representation

and interpretation. Figure 2.1 illustrates these steps.

High data volumes lead to signal to noise ratios that can reach 1:9 [21].

Consequently, a noise reduction phase is substantial to maximize data quality

and to minimize noise effects. Data analysis phase can be divided into four

major categories of methods: Classification, Regression, Clustering and Asso-

ciation Rules.

2.1.1 Classification

Classification is a process in which a hypothesis function (classifier) analyses

a set of attributes and infers the category to which a data object belongs. The

produced classifier can be used to predict the class of unknown data objects.

Classification is an example of supervised learning, where a learning al-

gorithm processes a dataset with annotated data points (data objects of previ-

ously known categories) and computes the parameters of the function.

Figure 2.2 (taken from [22]) illustrates this process.

24 Of 121

Figure 2.1: The Data Mining Process

In many cases, the hypothesis space can be large [23] and retrieving a

strong hypothesis function is a non trivial task. For this purpose, a set of tech-

niques known as Ensemble Learning [24] or more commonly Meta Learning,

have been developed to combine the predicting power of multiple classifiers

into a single strong predictor.

A popular technique builds Voted Ensembles [25] using groups of classifi-

ers. Each encapsulated classifier predicts the outcome of a feature vector inde-

pendently and the Ensemble votes the majority class.

2.1.2 Regression

Regression is a supervised learning technique targeting the implementation of

prediction models for numeric numeric class attributes. The difference

between Regression and Classification models is that the outcome of a regres-

sion hypothesis function can have an infinite set of values (any real number).

During training, a learning algorithm computes the correlation of a set of

assumed independent variables with a dependent (class) variable. The out-

come is a trained hypothesis function which can be used to predict the value

of the class variable given the observed values of the independent feature vari-

ables.

Voted Ensembles of Regression functions can be created using similar

Meta Learning techniques. Since the possible states of the output are theoret-

ically infinite, majority votes are represented by average or median values.

25 Of 121

Figure 2.2: Supervised Learning Process [22]

2.1.3 Clustering

Clustering is the process in which data objects are grouped together in classes

based on some measure of similarity. It is different from classification as pos-

sible data classes are not known in advance. It is an example of unsupervised

learning and it is used to discover hidden structure in datasets.

The clustering process can be divided into iterative and incremental ap-

proaches. Iterative algorithms require multiple passes over the whole dataset

to converge. For incremental cluster construction [26], two approaches can be

used: a) adding data points at each iteration and recomputing the cluster

centres and b) adding a cluster centre at each iteration. In both cases, the solu-

tion is built incrementally and it is possible to find a near optimal solution in a

single pass.

Consensus clustering [27] is the process of combining different clusterings

of the same dataset in a single output. It is a process analogous to Meta Learn-

ing in supervised learning. However, it is known to be NP-Complete and solv-

ing large instances is considered intractable [28].

2.1.4 Association Rule Learning

Association Rule Learning discovers correlations and associations between

items in a set of transactions. Given the large amounts of data stored by retail-

ers, association rules emerged as a solution to the market-basket analysis

problem [29].

A bit vector is used to indicate the presence or absence of an item in an

itemset (transaction). A group of bit vectors is used to represent a set of trans-

actions. By analysing these vectors, it is possible to discover items that fre-

quently occur together. These frequent occurrences are expressed in the form

of rules. For example the rule:

{ itemA, itemB } => { itemC}

Indicates a correlation between the presence of items A and B and the pres-

ence of item C. However, not all rules are interesting. Consequently, measures

of significance are required. Support (percentage of transactions that contain

items A, B and C), Confidence (percentage of transactions containing items A

26 Of 121

and B that also contain item C) and Lift (a correlation measure of the two

itemsets) are three frequently used measures. By setting thresholds to these

measures it is possible to discover interesting rules on a set of transactions us-

ing various algorithms.

Due to the sheer volume of monitored transactions, various techniques

have been developed that enable mining large transaction logs in small parti-

tions. A popular technique, known as Partition [30], divides transactions into

groups and applies Association Rules Learning algorithms to these groups in-

dependently. This procedure produces a set of candidate rules. Each candidate

rule is validated by computing its significance measures across all partitions.

Candidate rules that meet global significance criteria are produced in the out-

put.

2.1.5 Data Mining System Development

Data Mining systems can be developed using two main approaches: Toolkit-

Based and Statistical Languages. The following sections present these ap-

proaches in detail.

2.1.5.1 Tooltkit Based Approaches

Data Mining toolkits can be described as a set of Data Mining algorithm im-

plementations accompanied by a user interface. The user is able to interact

with the algorithms and analyse datasets through the interface without pos-

sessing advanced knowledge about implementation details. An influential ex-

emplar is the Data Mining toolkit Weka.

Weka [15] encapsulates well tested and extensively reviewed implementa-

tions of the most popular Data Mining methods and algorithms mentioned

above. It contains tools that support all phases of the Data Mining process.

A large collection of filters can be used to pre-process datasets and reduce

noise. Algorithms spanning across all major categories can be used for data

analysis. Weka offers a Graphical User Interface (GUI) that supports interact-

ive mining and results visualization. Finally, it automatically produces statist-

ics to assist results evaluation.

27 Of 121

The major disadvantage of Weka is that it only supports sequential single-

node execution. As a result, the size of the datasets that Weka can handle us-

ing the existing environment is limited by the maximum amount of the heap

memory of a single node.

2.1.5.2 Statistical Language Based Approaches

Statistical languages are special purpose programming languages that incor-

porate primitive statistical operations such as matrix arithmetic and vector ma-

nipulation. Since Data Mining is closely associated with statistics, Data Sci-

entists extensively use these languages [17] to develop Data Mining al-

gorithms. A popular example of this approach is R.

R [16] is a statistical programming language with built-in support for linear

and non-linear modelling, matrix manipulation, time-series analysis, data

cleaning, statistical testing and graphics among others [31]. It is interpreted

and can be used interactively to implement data mining tasks. Statistical pro-

cedures are exposed through simple commands.

R gained popularity [17] among analysts mainly because it does not de-

mand advanced programming expertise. However, it is designed for sequential

execution and suffers the same shortcomings as Weka in Big Data problems.

2.1.5.3 Approach Selection

Statistical languages require state of the art knowledge of primitive statistical

operations and mainly consist of tools to develop algorithms rather than librar-

ies of implementations. In contrast, Data Mining toolkits require minimal

prior knowledge and can be directly deployed to analyse datasets.

This project is based on toolkit-based approaches and more specifically

Weka, for two main reasons:

• Weka is written in Java and can be directly deployed and executed on

top of Java-based distributed computing frameworks without requiring

additional compilation overheads;

• The toolkit-based approach is appealing to different categories of users

(novice and experts) and thus targets a larger user span.

28 Of 121

2.1.6 Partitioning and Parallel Performance

Data Mining algorithms can be either single-pass or iterative [32]. Single-pass

algorithms have an upper bound in execution times. Iterative algorithms iter-

ate over the dataset until a stop condition is met (convergence) and thus exe-

cution times may vary. Due to the sheer volumes in Big Data, this project util-

ises the partitioned parallelism approach: the dataset is partitioned and com-

puting nodes process the partitions in parallel.

For single pass algorithms, this method can theoretically yield speed-up

proportional to the number of nodes. In practice, the overhead associated with

distributing computations and aggregating the results over all partitions will

set the limit marginally lower. However, this overhead can be experimentally

computed and system performance is predictable. An example of this case is

computing class means and variances in building a Gaussian model for Naive

Bayes. Each node computes the statistics for each class in local partitions in

one iteration and aggregation is achieved in a single synchronisation step.

In contrast, iterative algorithms may be unpredictable. The number of itera-

tions until convergence cannot be defined in advance. Two different ap-

proaches are possible [33]: synchronous and asynchronous.

In synchronous, each node computes the model's parameters on its own

partition in a single iteration. A synchronisation step is used to collect local

parameters from all nodes and produce aggregated global values. During syn-

chronisation, the stop condition is evaluated. If the condition is not satisfied,

each node obtains a copy of the global values and begins a new iteration. This

technique achieves load balancing between the nodes, but requires constant

node communication and the network can become a performance bottleneck

[34] in large clusters.

In asynchronous, each node computes a local model and a single synchron-

isation step aggregates the results. This technique minimises network over-

heads, but load balancing is not guaranteed. Nodes that struggle to converge

will slow down the performance of the system. One solution is to enforce a

deadline: each node has a certain number of iterations to meet the stop condi-

tion. After the deadline, the final model will be computed only on the nodes

29 Of 121

that managed to converge. This technique may lack precision, but the execu-

tion time is guaranteed and speed-up is proportional to the nodes.

This project focuses on asynchronous techniques, for two main reasons:

• Weka does not support synchronisation. Adding this feature would re-

quire re-implementing Weka's algorithms.

• Distributed computing frameworks suffer from network bottlenecks

and developers aim to avoid network saturation.

2.2 Distributed Computing Frameworks

Distributed Computing Frameworks provide an interface between clusters of

computers and algorithms. Each framework exposes an Application Program-

ming Interface (API) and guarantees that the algorithms developed using this

interface can be executed in parallel. In the following sections, a number of

influential and widely used Distributed Computing Frameworks are presented.

2.2.1 MapReduce

MapReduce was introduced by Google [8] in order to tackle the problem of

large-scale processing in clusters of inexpensive commodity hardware. Data-

sets in MapReduce are automatically partitioned, replicated and distributed

across the cluster nodes. This practice ensures that partitions can be processed

in parallel and fault-tolerance can be guaranteed through replication.

A MapReduce cluster consists of a Master node which handles data parti-

tioning and schedules tasks automatically in an arbitrary number of Workers.

The Master also maintains meta-data concerning partition locations in the

cluster. This practice assists scheduling Workers to process their local partition

and avoids transmitting large data chunks through the network.

The user is required to implement two functions: Map and Reduce. Map is

used to filter and transform a list of key-value pairs into intermediate key-

value pairs. Reduce processes the intermediate pairs, aggregates the results

and produces the output.

Once the user has specified the Map and Reduce functions, the runtime en-

vironment automatically schedules execution of Mappers on idle cluster

30 Of 121

nodes. Each node executes the Map function against its local dataset partition,

writes intermediate results to its local disk and periodically notifies the Master

of progress. As the Mappers start producing intermediate results, the Master

node assigns Reduce tasks to idle cluster nodes. Each intermediate result has a

key and is distributed to the Reducer that handles this key (or key range).

Figure 2.3 (taken from [8]) presents the MapReduce procedure.

2.2.2 Hadoop

Hadoop [9] is an open-source implementation of MapReduce. It was de-

veloped by Yahoo and released as an open-source project in 2006. Hadoop's

main components are HDFS (Hadoop Distributed File System), YARN (Yet-

Another Resource Negotiator) [35] and the MapReduce framework.

HDFS is a disk-based file system that spans across the cluster nodes. Files

stored in HDFS are automatically divided into blocks, replicated and distrib-

uted to the nodes' local disks. HDFS maintains meta-data about the location of

blocks and assists Hadoop to schedule each node to process local blocks rather

than receive remote blocks through the network. HDFS encapsulates distrib-

31 Of 121

Figure 2.3: MapReduce Execution Overview [8]

uted local storage into a single logical unit and automates the procedure of

distributed storage management.

YARN is responsible for managing cluster resources and provide applica-

tions with execution containers for Map and Reduce tasks. YARN maintains a

directory of all the execution containers in the cluster (which may be multiple

per node) and either allocates idle containers to applications for execution or

delays the application execution until a container becomes available.

Finally, MapReduce is a set of Java libraries that implement the aforemen-

tioned MapReduce paradigm.

Hortonworks' web page [36] presents the Hadoop architecture as in Figure

2.4.

2.2.2.1 Beyond Hadoop

As briefly discussed earlier, Hadoop's execution engine is inefficient in two

main areas:

• Native support for iterative algorithms: Many Data Mining algorithms re-

quire multiple iterations over a dataset to converge to a solution. For example,

the K-Means clustering algorithm iterates over a dataset until cluster assign-

ments remain unchanged after two successive iterations. These iterations are

usually included in the user-defined driver program. This feature requires both

the reload of invariant data and the restart of execution processes at each itera-

tion, leading to significant performance overheads. These overheads could be

avoided by introducing a loop-aware scheduler [37].

32 Of 121

Figure 2.4: Hadoop Tech Stack [36]

• HDFS is a disk based file-system. However, modern clusters possess

main memory that can exceed 1TB and most data mining tasks are within this

limit [38]. As a result, significant performance improvement is possible by us-

ing man-memory caching mechanisms.

These issues led to numerous efforts towards novel systems. The following

sections present a number of important projects in the area.

2.2.3 Iterative MapReduce

HaLoop [37] introduced a loop operator to the MapReduce framework aiming

to provide support for iterative algorithms. HaLoop's scheduler is loop-aware

and introduces caching mechanisms which avoid multiple slow disk accesses

to loop-invariant data (for example stop conditions and global variables).

Figure 2.5 (taken from [37]) shows the boundaries between user applica-

tion and system code in MapReduce and HaLoop.

Stop conditions are evaluated automatically by the system and do not re-

quire an additional task as in MapReduce. The loop-aware scheduler co-loc-

ates jobs that use the same data in successive iterations. The caching mechan-

ism is used to save loop-invariant data between iterations. These develop-

ments, as reported in [37], have enabled up to 85% speed-up compared to Ha-

doop on iterative algorithms.

33 Of 121

Figure 2.5: HaLoop and MapReduce [35]

2.2.4 Distributed Systems for In-Memory Computations

These systems use main-memory caching strategies to speed-up computations

and offer significant performance advantages over disk-based solutions. The

following sections overview efforts in the area.

2.2.4.1 In-Memory Data Grids

In-Memory Data Grids (IMDG) [39] emerged as a middle-ware between data-

bases and applications demanding low-latency access to mission critical data.

Infinispan [40] and HazelCast [41] are two representative examples.

Data in IMDG is represented as collections of non-relational data objects.

These collections are hash-partitioned and distributed to the main-memory of

a cluster's nodes. IMDG acts as a distributed cache, serves applications seek-

ing database entries and reduces redundant slow disk accesses to disk-based

database servers.

Traditional IMDGs only supported data storing and retrieval operations.

As a result, complex computations were handled by external frameworks. This

technique faces performance bottlenecks [42] because task scheduling does

not take into consideration neither data locality nor load balancing. In order to

tackle this issue and combine the low-latency data access offered by IMDG

with the computational capabilities of distributed computing solutions, in-

memory computing frameworks emerged.

The following sections present a number of frameworks that combine in-

memory caching and distributed computations in a single scalable solution.

2.2.4.2 Piccolo

Piccolo [43] provides an in-memory data-centric programming model for

building parallel applications on large clusters. Piccolo allows distributed ap-

plications to share state via a mutable key-value table. Table entries are parti-

tioned and distributed to a cluster's main-memory, in order to achieve faster

sharing of intermediate results among cluster nodes.

Piccolo applications are implemented using a control function, a set of ker-

nel functions, a set of accumulation functions and a distributed in-memory

34 Of 121

key-value table. The user defines a control function that operates in the master

node and monitors the control flow of the application. A number of kernel

functions (sequential processing threads) are launched in the slave nodes and

perform parallel computations on the entries of the distributed table. A locality

preference mechanism by default schedules kernel functions to process local

partitions. Intermediate results are updated in the table using atomic opera-

tions and thus an accumulation function must be defined to aggregate multiple

updates to the same key.

The user is required to manually define the distribution of keys in the

cluster nodes through a partition function. As the system does not automatic-

ally manages cases of insufficient memory in a node, the user must carefully

consider partition placement. The system guarantees fault-tolerance using a

check-pointing mechanism that captures system snapshots at regular intervals.

A failed node will force system to roll back to a previous state and repeat

computations in all nodes because datasets are not replicated and lineage in-

formation is not monitored.

Piccolo reports [43] speed-up of up to an order of magnitude compared to

Hadoop. However, the system does not provide a high level interface that

guarantees parallel execution, exposes a complicated programming model,

lacks a mechanism to tackle main-memory shortages and uses a computation-

ally expensive disaster recovery mechanism.

2.2.4.3 GraphLab

GraphLab [44] expresses computations as data graphs and provides schedul-

ing primitives. GraphLab's data model consists of a directed data graph and a

shared (global) data table. Both the graph vertices and the shared table are

stored in-memory. The framework is optimistic, assumes no node failures and

that the data can fit in the cluster's main memory.

Graph nodes and vertices represent sparse data dependencies. Computa-

tions can process graph elements either through a stateless user-defined update

function (analogous to Map in MapReduce) or using a synchronization mech-

anism that defines a global aggregation (analogous to Reduce in MapReduce).

35 Of 121

Update functions operate on a graph neighbourhood and can have read-

only access to the shared data table. Unlike MapReduce, Update functions are

allowed to process overlapping context. User programs can have multiple up-

date functions and parallel execution is guaranteed as long as simultaneous ac-

cess to common vertices is not required. Consistency must be defined by the

user. Three different consistency mechanisms are offered: Vertex, Edge and

Full consistency. Figure 2.6 (taken from [44]) depicts these three options.

Consistency mechanisms define the degree of overlapping in processed

graph neighbourhoods. Relaxing consistency guarantees (by locking less ver-

tices during a vertex update) allows a higher degree of parallelism (more func-

tions can be executed in parallel as fewer vertices are locked).

The global data table can only be updated using the synchronisation mech-

anism. During synchronisation, data across all vertices are aggregated to an

entry in the shared table in a manner analogous to Reduce functions in

MapReduce.

Finally, a GraphLab user program must contain an update schedule. This

schedule defines the order in which vertex-function pairs will be executed.

This is a dynamic list and tasks can be updated or rearranged during execu-

tion. GraphLab provides predefined schedules based on popular data struc-

tures (FIFO queues, priority queues) and, most importantly, contains a sched-

uler construction framework that lets the user define a custom scheduling

mechanism.

36 Of 121

Figure 2.6: GraphLab Consistency Mechanisms[44]

It is important to note that GraphLab's programming model does not limit

itself to graph computations. Sparse and dense matrices can be represented by

graphs and thus GraphLab can be used to express iterative Data Mining al-

gorithms.

Results in [44] demonstrate high performance in iterative asynchronous al-

gorithms. However, GraphLab does not possess a mechanism to tackle

memory shortages, optimistically assumes that node failures are improbable

and exposes a complicated programming model.

2.2.4.4 Spark

Resilient Distributed Datasets (RDDs) [14] are a distributed main-memory ab-

straction that enable users to perform in-memory computations in large

clusters. RDDs are implemented in the open-source Apache Spark [45] frame-

work.

RDDs are an immutable collection of records distributed across the main

memory of a cluster. These data structures can be created by invoking a set of

operators either on persistent storage data objects or on other RDDs. The sys-

tem logs dataset transformations using a lineage graph. The system is con-

sidered to be “lazy”: it does not materialise transformations until the user re-

quests either an output or saving changes to persistent storage.

Figure 2.7 (taken from [14]) illustrates an RDD lineage graph.

37 Of 121

Figure 2.7: RDD Lineage Graph [14]

RDD operators are divided into two categories: Transformations and Ac-

tions. Transformations define a new RDD, based on an existing RDD and a

function. Actions materialise the Transformations and either return a value to

the user or export data to persistent storage.

As the system was inspired by MapReduce, Transformations provide native

support for Map and Reduce operators. Additional operators include join,

union, crossProduct and groupBy. These features extend the system's capabil-

ities by introducing an Online Analytical Processing (OLAP) engine.

The system was designed to tackle the shortcomings of MapReduce in iter-

ative computations by introducing a loop-aware scheduler and thread execut-

ors. In multi-stage execution plans the system schedules successive iterations

at the nodes where the data is cached, avoiding slow network and disk ac-

cesses. The introduction of thread executors enables the system to reuse

launched threads in the execution of multiple successive closures. This prac-

tice avoids the significant initialisation and termination costs observed in

MapReduce during each execution stage.

The users execute tasks by providing a driver program which defines a path

to secondary storage and sets of Transformations and Actions. The system

then creates an RDD and distributes its records across the main-memory of the

cluster. When an Action is issued, the system schedules the execution of all

the requested Transformations. Each Node at the cluster will process its local

set of records and return the results. By caching datasets to main memory the

system avoids slow disks reads and can perform up to 100 times faster than

Hadoop's MapReduce [14] on Logistic Regression and K-means.

Spark also takes advantage of the interactive nature of Scala and Python,

by providing an interactive shell. Users can load a dataset in main-memory

and execute multiple mining tasks interactively.

In cases where the dataset is larger than the available amount of main

memory, Spark employs a mechanism to serialize and store portions of the

dataset to secondary storage. Memory consumption is a configuration para-

meter and the mechanism can be triggered at a user-defined level. Even in

38 Of 121

cases where minimal memory is available, Spark outperforms Hadoop due to

shorter initialisation and termination overheads [46].

The system offers numerous memory management options including differ-

ent serialization and compression libraries. These features allow the user to

define an application specific caching strategy that takes advantage of dataset

and cluster characteristics.

Spark is fully compatible with the Hadoop ecosystem. Spark's Scala API

supports closures and Hadoop Map and Reduce functions can be submitted as

arguments to its native operators. This allows users to deploy existing Hadoop

applications to Spark with minor adjustments.

2.2.5 Distributed Computing Framework Selection

This work has focused on using Apache Spark as a deployment framework for

the WEKA subset of interest. The reasoning behind this decision is as follows:

• Spark offers advanced main-memory caching capabilities that include

robust serialization mechanisms, compression mechanisms and the

ability to tackle memory shortages;

• Spark is based on the Scala programming language and it is natively

compatible with Weka's Java libraries;

• Spark offers many operators through a simple and easy-to-use API;

• Spark has recently generated significant interest in both academic and

industrial environments and it is rapidly evolving into a leading Dis-

tributed Computing Framework [47].

2.3 Distributed Data Mining

The following outlines efforts to express Data Mining algorithms using the

programming models of the distributed computing frameworks described in

Section (2.2).

39 Of 121

2.3.1 Data Mining on MapReduce

The following presents three notable efforts to develop Data Mining systems

using the MapReduce paradigm.

1. Chu et al. [12] developed many popular Data Mining algorithms (such

as the Linear Regression, the SVM and the K-Means clustering al-

gorithm) on top of MapReduce. Experiments on multi-core processors

demonstrate linear speed-up on shared-memory environments. These

results indicate that it is possible to efficiently express Data Mining us-

ing the MapReduce paradigm.

2. Mahout [48] was a community-based side project of Hadoop aiming

to provide scalable Data Mining libraries. As the libraries did not

provide a general framework for building algorithms, the quality of the

provided solutions varied significantly and was dependent on the ex-

pertise of the contributor. This lead to poor performance [49], incon-

sistencies in the content of its various releases [48] and the project was

eventually discontinued.

3. Radoop [50] introduced the RapidMiner [51] toolkit to Hadoop. Rap-

idMiner has a graphical interface to design work-flows which consist

of data-loading, data-cleaning, data-mining and visualization tasks.

Radoop introduced operators to read data from HDFS and execute

Data Mining tasks. Radoop operators correspond to Mahout al-

gorithms. At runtime the workflow is translated to Mahout tasks and

executed on a Hadoop cluster. As far as performance is concerned, Ra-

doop suffers the same shortcomings as Mahout. However, Radoop in-

troduced the idea of separating Data Mining work-flow design from

distributed computations on clusters.

These efforts provide three observations that are of particular interest to

this project:

1. The MapReduce paradigm can be used to express Data Mining al-

gorithms efficiently and it is possible to achieve linear scaling;

40 Of 121

2. The lack of a unified execution model leads to inconsistent perform-

ance, difficulties in maintaining and extending the code-base and dis-

courages widespread adoption. Mahout focused on providing imple-

mentations of specific algorithms, rather than building execution mod-

els for families of algorithms;

3. Abstracting distributed computations on clusters from the design of

data mining processes enables analysts with no technical background

to harness the power of distributed systems.

This work incorporates these observations into the design of the system.

MapReduce is used to implement the distributed versions of Weka's al-

gorithms on Spark. The execution model focuses on abstract interfaces, rather

than concrete implementations of specific algorithms.

Finally, the system is exposed through an interface that mimics the simpli-

city of Weka's command line. This practice enables users familiar with Weka

to directly utilise Weka-on-Spark, regardless of the underlying distributed

nature of the system.

2.3.2 R on MapReduce

The following presents two notable efforts to combine the statistical language

R with MapReduce.

1. Ricardo [52] is an early effort to introduce R to distributed computing

frameworks. The system used the declarative scripting language Jaql

[53] and Hadoop to execute R programs in parallel. The system con-

sists of three components: an interface where the user implements Data

Mining tasks using R and a small set of Jaql functions; an R-Jaql

bridge to integrate R programs into Jaql declarative queries; and the

Jaql compiler that compiles the queries to a series of MapReduce jobs

on the Hadoop cluster. Flow control is performed by the user program.

The bridge allows R programs to make calls to Jaql scripts and Jaql

scripts to run R processes on the cluster.

Figure 2.8, from [52], presents the system architecture.

41 Of 121

The system uses R-syntax, with which most analysts are familiar, and

exposes a simple interface. However, execution times are doubled

compared to native MapReduce jobs for the same task. This is due to

the overhead produced by compiling high-level declarative Jaql scripts

to low-level MapReduce jobs.

2. SparkR [54] is an R package that provides R users a lightweight front-

end to Spark clusters. It enables the generation and transformation of

RDDs through an R shell. RDDs are exposed as distributed lists

through the R interface. Existing R packages can be executed in paral-

lel on partitioned datasets by serializing closures and distributing R

computations to the cluster nodes. Global variables are automatically

captured, replicated and distributed to the cluster enabling efficient

parallel execution. However, the system requires knowledge of statist-

ical algorithms as well as basic knowledge of RDD manipulation tech-

niques. To the best of the author's knowledge (as of August 2014),

benchmarking results comparing the system to other Spark-based solu-

tions are not publicly available.

Both efforts to merge R with MapReduce suffer from two major issues:

1. R is based on C and it is not native to Java-based frameworks such as

Hadoop and Spark. Thus a bridging mechanism is required between R

42 Of 121

Figure 2.8: Ricardo [52]

and the underlying Java Virtual Machine (JVM). R-code is compiled

to C-code and C-code uses the Java Native Interface (JNI [55]) for ex-

ecution. This feature disables the portability of the system (it becomes

platform dependent) and produces bridging overheads.

2. The underlying MapReduce paradigm is not transparent to the user.

The user needs to express the computations as a series of transforma-

tions (Map) and aggregations (Reduce). In Ricardo, this is achieved by

using Jaql declarative queries where the selection predicates use R

functions to transform the data (Map equivalent) and aggregation func-

tions (Reduce equivalent) to produce the final output. SparkR uses the

same methodology on distributed lists.

These observations further support the decision to use Weka as it is written in

Java and the bridging overheads of these systems are avoided. Additionally,

by using Weka's interface distributed MapReduce computations are abstracted

from the design of Data Mining processes.

2.3.3 Distributed Weka

Early efforts made to introduce Weka to distributed environments include

WekaG [56], parallelWeka [57] and Weka4WS [58]. WekaG and WekaWS use

web services to submit and execute tasks to remote servers. However, they do

not support parallelism; each server executes an independent task on its own

local data. ParallelWeka proposed a parallel cross-validation scheme where

each server receives a dataset copy, computes a fold and sends back the res-

ults. This practice cannot be applied on a large scale because of network bot-

tlenecks.

Work by Wegener et al. [59] aimed to merge the user-friendliness of Weka's

user interface with Hadoop's ability to handle large datasets.

The system architecture consists of three actors: the Data Mining Client,

the Data Mining Server and the Hadoop cluster. The client uses Weka's user

interface to build mining tasks and then order execution to the server. The

server receives the client's request, computes the sequential part of the al-

gorithm locally and submits the parts that can be executed in parallel to the

43 Of 121

Hadoop cluster. These procedures require review of the Weka libraries,

identify the parts of each algorithm that can be executed in parallel and re-

write these parts using MapReduce. On the server, Weka's data-loader was ex-

tended to avoid loading datasets to main-memory and instead perform a series

of disk reads.

Figure 2.9, from [59], presents the architecture.

This methodology does not provide a unified framework for expressing

Weka's algorithms. Each algorithm must be inspected to identify parts that can

be executed in parallel and re-implemented using MapReduce. This process

entails producing custom distributed implementations of all the algorithms in

Weka and suffers from the same shortcomings as Mahout. Additionally, dis-

abling Weka's caching and reading incrementally from disk produces large

overheads on iterative algorithms.

2.3.4 MLBase

With MLBase [60] the user can build Data Mining tasks using a high-level de-

clarative language and submit them to the cluster's Master node. The system

then parses the request to form a Logical Learning Plan. This plan consists of

feature extraction, dimensionality reduction, filtering, learning and evaluation

algorithms. The optimizer processes that plan using statistical models and

heuristics. An Optimized Learning Plan (OLP) is produced based on which

combination of algorithms is likely to have better performance (execution

44 Of 121

Figure 2.9: Distributed Weka [59]

time and accuracy). MLBase then translates OLP to a set of primitive operat-

ors that the run-time environment supports. These include relational operators

(joins, projects), filters and high-level functions such as Map in MapReduce.

These primitives are then scheduled for parallel execution in the cluster's

workers.

The system builds the model in stages. From an early stage, it returns a pre-

liminary model to the user and it continues to refine it in the background. This

mechanism provides the opportunity to interrupt the process if the preliminary

results are satisfactory and avoid redundant computations.

Figure 2.10 (taken from [60]) illustrates the MLBase procedure.

The users of this system will be able to submit tasks without specifying an

algorithm. The system will then parse the request and select a near-optimal

solution by analysing various alternatives. This would be an important devel-

opment since users would no longer need to find reliable, scalable and accur-

ate solutions solely based on intuition.

45 Of 121

Figure 2.10: MLBase Architecture [60]

As of August 2014, the system is still under development. However, the in-

terfaces of its components were described in [49] and a proof-of-concept im-

plementation using Spark was tested. The results demonstrate constant weak

scaling and an order of magnitude better performance than Mahout in twenty

times fewer lines of code.

2.4 Summary

Sequential solutions, such as Weka, fail to cope with the sheer volume of Big

Data workloads. Hadoop is a field-tested solution for large datasets and it sets

the standard for industrial Big Data platforms. However, Hadoop's native im-

plementation of MapReduce is inefficient in executing iterative algorithms.

Spark tackles this issue by introducing a main-memory caching mechanism

and a loop-aware scheduler.

The MapReduce model is efficient in expressing Data Mining algorithms.

Many projects demonstrate that Data Mining workloads can achieve linear

scaling on top of MapReduce clusters. However, designing such a system is a

non-trivial task. The following observations, as extracted from the Literature

Survey, act as guidelines and directions for the system design:

• The leverage of distributed main-memory can yield up to an order of

magnitude shorter execution times;

• The parallel execution model should focus on Data Mining methods

rather than individual algorithms;

• Designing Data Mining processes should abstract away from distrib-

uted computations;

• Using libraries that require heterogeneous execution environments in-

creases complexity, decreases portability and introduces compilation

and bridging overheads.

Spark provides native support for in-memory computations. Additionally,

both Spark and Weka are Java-based and require the same execution environ-

ment (JVM).

46 Of 121

Weka represents each category of Data Mining methods using an abstract

interface. Any individual algorithm is required to implement this interface. By

implementing Map and Reduce execution containers (“wrappers”) for Weka's

interfaces, a scalable execution model becomes feasible.

The user interface can be closely modelled after Weka's user interface. This

feature enables users to design and execute Data Mining processes using the

same tools either locally or in distributed environments.

The following chapters provide an in-depth analysis of the proposed Weka-

on-Spark architecture and the implemented execution model.

47 Of 121

3 System Architecture

The following chapter presents the architectural components of an efficient

and scalable Big Data Mining (BDM) solution (3.1), the implemented multi-

tier architecture (3.2) and the cluster's monitoring services(3.3).

3.1 Required Architectural Components

Based on the review in Chapter 2, we identify the following required architec-

tural components for a scalable BDM solution:

• Infrastructure Layer: consists of a reconfigurable cluster of either

physical or virtual computing instances;

• Distributed Storage Layer: automatically encapsulates the local stor-

age of the cluster's computing instances into a large-scale logical unit;

• Batch Execution Layer: schedules and executes tasks on data stored in

distributed storage. Must provide support for in-memory computing

for optimal performance;

• Application Layer: integrates the application logic of BDM workloads

into the programming model supported by the Batch Processing

Layer;

• User Interface: the user requires a mechanism to interact with the sys-

tem and submit Data Mining tasks;

• Monitoring Mechanism: performance tuning and system evaluation de-

mand a mechanism to monitor cluster resources.

The following section presents the implemented multi-tier system which

meets the aforementioned requirements.

3.2 Multi-tier Architecture

The implemented architecture is summarised in Figure 3.1.

48 Of 121

The Infrastructure Layer is based on clusters of AWS EC2 instances. The

Distributed Storage Layer consists of a set EBS [61] volumes and a set of SSD

drives managed by HDFS. The Batch Execution Layer is based on the innov-

ative in-memory computing framework Spark. The Application Layer incor-

porates the implemented BDM framework. The Command Line Interface

(CLI) provides user access to framework services. Finally, the monitoring ser-

vices are provided by CloudWatch.

3.2.1 Infrastructure Layer

AWS Elastic Compute Cloud (EC2) provides on demand access to virtual

servers known as compute instances. Using virtualisation technologies AWS

divides large pools of physical hardware spread among multiple data-centres

into virtual EC2 Compute Units (ECU [62]). By combining multiple ECUs,

AWS can produce virtual machines of varying capacities and capabilities.

49 Of 121

Figure 3.1: System Architecture

The virtual nature of ECUs allows the user to either launch or terminate

multiple instances in minutes. This feature enables the provision of automatic-

ally resizeable clusters of instances. Batch processing tasks usually require

large raw computing power for short periods of time. Additionally, these tasks

may vary in volumes and complexity. The flexibility of the service provides

an easily configurable and cost-effective solution for Big Data processing

problems.

EC2 provides full administrative access to the user. EC2 instances can be

managed in the same fashion as physical hardware. The user has the ability to

select the Operating System and install any required software component. Ad-

ditionally, AWS provides access to a marketplace where multiple pre-built

Amazon Machine Images (AMI [63]) can be selected and deployed.

Out of many possible options, this project was developed on top of

Amazon's proprietary Linux distribution. Amazon Linux is designed specific-

ally to operate on top of EC2, it is provided for free to EC2 customers and in-

corporates all the essential libraries required by the forthcoming layers.

3.2.2 Distributed Storage Layer

The persistent storage, provided by the cluster's instances, consists of virtual

drives known as Elastic Block Store (EBS) volumes. These volumes are net-

work-attached drives and persist even after the cluster termination. Each EC2

instance possesses a pre-defined number of EBS volumes attached, but the

user is able to increase this number on demand. Additionally, each instance

possesses a physically attached SSD drive. This storage option is faster to ac-

cess, but it is ephemeral and data will be lost on termination.

Big Data problems may require much larger storage capacity than the max-

imum size of an EBS volume. Fault-tolerance demands data partitioning and

replication. Additionally, efficient parallel execution requires dataset partitions

to be processed by different instances. These requirements suggest the need

for a persistent storage abstraction that would automatically handle dataset

partitioning, replication and distribution over a set of instances.

50 Of 121

HDFS encapsulates a set of either physical or virtual storage devices into a

single logical unit. The system has three main actors with distinct roles: the

NameNode, the Secondary NameNode and the DataNodes. The NameNode

maintains meta-data about the locations of dataset partitions in the cluster. The

SecondaryNameNode captures snapshots of the NameNode at regular inter-

vals in order to avoid a single point of failure. Finally, the DataNodes store

data on their local drives.

When a dataset is written to HDFS, the system proceeds to partition it into

a number of blocks, triplicate each block and store two blocks in the same

rack and one block in a remote rack. This procedure guarantees fault-tolerance

and promotes minimal data motion: Distributed Computing Frameworks can

obtain meta-data about block locations from the NameNode and schedule exe-

cution on the node where the data is situated.

Figure 3.2 (taken from[64]) illustrates this architecture.

HDFS was installed to the Linux AMIs of all instances by using a Python

script to download the essential libraries and update the configuration files

containing instance locations in the network. By executing a number of shell

51 Of 121

Figure 3.2: HDFS Architecture [64]

scripts provided by the libraries, the NameNode, SecondaryNameNode and

DataNode services were initialised and configured.

At this stage, the system is ready to accept read and write requests, provide

block locations and assist a Batch Execution Layer to move computation to

the data.

3.2.3 Batch Execution Layer

Batch Execution Engines provide the execution containers for user applica-

tions, schedule the execution of user code to a number of instances in the

cluster, monitor progress, provide fault-tolerance mechanisms and present the

results to the users. Although Batch Executions Engines and Distributed Com-

puting Frameworks are usually overlapping terms, it is useful to distinguish

between the two at this stage, because of cases of Distributed Computing

Frameworks which are not batch-oriented.

Two different solutions were evaluated at this layer: Hadoop and Spark.

Hadoop is a field-tested solution with widespread adoption. However, as men-

tioned in Chapter 2 it lacks support for iterative computations and main-

memory caching. These issues were partially tackled by utilising an asyn-

chronous parallel execution model and by taking advantage of Weka's in-

memory computations. However, main-memory is a scarce resource and this

solution did not possess a mechanism to gracefully overcome main-memory

shortages. Additionally, Hadoop's process initialisation and termination costs

were significant. These observations supported the decision to use Spark as

the system's Batch Execution Engine.

Spark guarantees that applications written using the supported operators

can be executed in parallel. During application submission, Spark starts task

executors on a user-defined number of instances in the cluster. Once the ex-

ecutors are started, multiple tasks can be executed by the same threads of exe-

cution, minimizing overheads associated with initialising and terminating Map

and Reduce threads for every processed dataset partition as in Hadoop.

Spark applications retrieve data from HDFS and the system loads local par-

titions to main-memory using the RDD abstraction. Transformations and Ac-

52 Of 121

tions can be submitted by user applications. Spark serializes the functions and

decides on work placements based on data locality. Figure 3.3 illustrates the

initialisation procedure.

Spark pessimistically logs RDD transformations and actions, but does not

maintain snapshots of the actual RDDs. This practice would entail CPU over-

heads for RDD serialization, disk overheads for disk caching and retrieval and

network overhead for snapshot distribution. Early benchmarking, by Zaharia

et al. [14], revealed that recomputing lost partitions is faster and more re-

source efficient than utilizing a snapshot and roll-back mechanism. HDFS

maintains multiple replicas in different racks which guarantees system recov-

ery even in the rare cases of rack-level failure.

Spark's greatest contributions to the system are its multiple caching op-

tions. This feature enables benchmarking multiple combinations and research

optimal caching strategies for in-memory computations on BDM workloads.

The following subsection proceeds to analyse these options in detail.

3.2.3.1 Spark and Main-memory Caching

As briefly discussed earlier, Spark applications request files from HDFS and

Spark represents the data in the cluster's memory as distributed Java objects

using the RDD abstraction. These objects are created based on a set of config-

53 Of 121

Figure 3.3: Initialisation Process

uration parameters submitted in the application context. The user specifies the

number of instances and the allocated per instance main-memory that the ap-

plication should use during execution. The system divides total executor

memory into data caching memory and Java heap memory (to be used for al-

gorithm variables, internal data structures etc.).

The data caching fraction of the memory can be exploited using different

Storage Levels. The Storage Levels define whether Spark should cache ob-

jects in-memory, on-disk or a combination of the two. Additionally, a replica-

tion factor for in-memory objects can be defined. These levels can be further

customised by the use of different serialisation and compression libraries.

More specifically, Spark can be configured to use either the built-in Java seri-

alisation or the Kryo [65] serialisation libraries. According to [66], Kryo of-

fers performance improvements, but requires a custom object registration

class in the Application Layer. Finally, Spark's codecs can further reduce the

memory footprint of RDD objects by compressing serialised byte arrays.

System evaluation in Chapter 5 contributes an analysis of these options and

proposes a scheme to automatically select a strategy based on cluster and data-

set characteristics.

3.2.4 Application Layer

The Application Layer consists of a custom-built distributed Data Mining

framework aiming to tackle BDM problems on top of the underlying architec-

ture. It contains services that handle the complete life-cycle of a Data Mining

task. Support is provided for data and model loading and saving, user option

parsing, task initialization, submission and termination, result presentation and

progress monitoring.

The most important contributions of the implemented solution are two-

fold:

• A scalable distributed Data Mining application logic based on the

MapReduce paradigm and Weka's algorithms. The execution model is

analysed in Chapter 4;

54 Of 121

• Advanced main-memory caching configuration capabilities that aim to

maximize the benefits of in-memory computing. Advanced users are

able to define a caching strategy, but the system is capable of automat-

ically selecting a caching strategy according to data and cluster charac-

teristics. Extensive analysis on main-memory caching is available in

Chapter 5.

The implementation is packaged in a shaded Java archive (shaded-jar).

This practice simplifies deployment because complex installation procedures

are avoided. The package is self-contained and can be directly submitted for

execution. Details on the deployment procedure are given in Appendix 2.

3.2.5 CLI

The Application Layer is accessible to the user through a Command Line In-

terface (CLI). The user is required to have access to the command line of the

Spark cluster's master node. This is possible through a Secure SHell (SSH)

connection. After establishing a successful connection, the user can submit

tasks to the cluster by providing a path to the Java archive, followed by the es-

sential execution parameters.

Details about usage and the system's supported options are given in Ap-

pendix 3.

3.3 Cluster Monitoring

Cluster resource monitoring is performed by the AWS monitoring service

CloudWatch [67]. CloudWatch-enabled EC2 instances automatically report

disk, CPU and network usage metrics every minute. These metrics are access-

ible through the AWS management console. Instances can be monitored either

independently or aggregated by type. CloudWatch provides visualisation util-

ities as well as raw readings.

However, CloudWatch does not monitor memory usage by default. As the

analysis of main-memory caching effects in relation to BDM workloads was

an integral issue in this work, an application-specific solution was implemen-

ted.

55 Of 121

Cron [68] is a time-based task scheduler for Linux. Entries in the “crontab”

(Cronos-table) are executed at defined time intervals. Cron accesses the

crontab every minute and executes the table entries that are scheduled for this

minute. A custom Perl script that monitors main-memory metrics and reports

to CloudWatch was scheduled for execution every minute in all cluster in-

stances. This procedure was automated by developing a Bash script which can

be found in Appendix 4.

3.4 Summary

This chapter has presented the fundamental architectural components of a

scalable BDM architecture, alongside with details on the interactions between

the different layers. The following chapter presents the execution model of the

Application Layer and analyse the algorithmic logic which enables scalable

parallel execution.

56 Of 121

4 Execution Model

This chapter presents a general framework for executing Weka algorithms on

top of MapReduce (4.1) and specific solutions for the Header creation (4.3),

Classification and Regression (4.4), Association Rules Learning (4.5) and

Clustering (4.6).

4.1 Weka on MapReduce

As briefly discussed in previous chapters, Weka was designed for sequential

single-node execution. Efficient parallel execution using the MapReduce

paradigm can be achieved by implementing Decorator classes, also known as

“wrappers”, for core Weka algorithms. Each wrapper class encapsulates a

Data Mining algorithm and exposes the containing functionality through the

Map and Reduce interfaces. The proposed execution model for the headers,

the classifiers and the regressors is based on a set of packages released by the

core development team of Weka [69], adjusted to Spark's API and Scala's

functional characteristics. To the best of the author's knowledge, there are no

benchmarking results published for this model (as of August 2014); in that re-

gard this work has provided a set of benchmarking results.

Spark begins execution by scheduling the slave instances to load local data-

set partitions to main-memory. Each slave invokes a unary Map function con-

taining a Weka algorithm against a local partition and learns an intermediate

Weka model. Intermediate models generated in parallel are aggregated by a

Reduce function and the final output is produced. However, the order of oper-

ands in the Reduce functions is not guaranteed. Consequently, Reduce func-

tions were carefully designed to be associative and commutative, so that the

arbitrary tree of Reducers can be correctly computed in parallel.

This process is illustrated in Figure 4.1.

57 Of 121

The functional model demands stateless functions. Spark provides a mech-

anism to broadcast variables, but this practice introduces complexity, race

conditions and network overheads. As a result, Map and Reduce functions

have been designed to solely depend on their inputs. As Map outputs consist

of Weka models (plain Java objects), this should minimize network communi-

cation between the nodes during execution.

4.2 Task Initialisation

This section outlines the essential steps towards the execution of MapReduce

tasks on Spark. These include parsing user provided options, configuring the

application context, configuring the user requested task and submitting the

task for execution in the cluster.

Figure 4.2 displays the application's main thread in pseudo-code that mim-

ics the Scala syntax.

58 Of 121

Figure 4.1: Execution Model

Upon submitting a task using the CLI and a set of parameters, this thread is

invoked in the Spark Master. The application parses user options using a cus-

tom text parser, configures the essential environment parameters (application

name, total number of cores, per-instance cache memory etc.) and initialises a

Task Executor. If a caching strategy is not provided, the application will com-

pute a custom caching strategy using the implemented Caching Strategy Se-

lection Algorithm (this algorithm is analysed and evaluated in Section 5.4).

Figure 4.3 displays the steps taken by the Task Executor to define a logical

representation of the dataset and to configure the user requested task.

59 Of 121

Figure 4.2:WekaOnSpark's main thread

The Task Executor begins the execution procedure by defining an RDD

(“rawData”) from a file on HDFS. As briefly discussed earlier, RDD Trans-

formations are “lazy”: until an Action is issued (a Reduce operator in this

case), the RDD will not be materialised. Thus, the RDD at this stage is logical.

It contains the path upon which it will be created and the caching mechanism

that will be used.

Weka processes data in a special purpose object format known as Instances.

This object contains a header (meta-data about the attributes) and an array of

Instance objects. Each Instance object contains a set of attributes which rep-

resents the raw data. HDFS data on Spark are defined as RDDs of Java String

objects. Thus, a Transformation is needed to parse the strings and build an In-

stances object for each partition. This is achieved by defining a new RDD

(“dataset”), based on the previous RDD (“rawData”) and a Map function.

These Transformations are automatically logged by Spark into a lineage

graph. Figure 4.4 displays the state of the graph after the steps described

above.

60 Of 121

Figure 4.3: Task Executor

At this stage, the Task Executor will use the newly defined RDD as an ini-

tialisation parameter for the user requested task. These tasks will add their

own Transformations to the graph. When an Action is issued (a Reduce func-

tion), Spark will schedule the cluster instances to build the RDD partitions

from their local dataset partitions, and to materialise the Transformations in

parallel.

The following sections present the implementation of the tasks supported

by the framework.

4.3 Headers

Weka pays particular attention to meta-data about the analysed datasets. An

essential initial step is to compute the header of the ARFF file (Weka's suppor-

ted format, represented by the aforementioned Instances object at runtime). A

header contains attribute names, types and multiple statistics including min-

imum and maximum values, average values and class distributions of nominal

attributes.

Figure 4.5 displays the MapReduce job that computes the dataset's header

file.

61 Of 121

Figure 4.4: Lineage Graph

The job requires the attributes names, the total number of attributes and a

set of options. These parameters are used by the Map function to define the

expected structure of the dataset.

Map functions compute partition statistics in parallel. Figure 4.6 displays

the implementation of the Map function.

Reduce functions receive input from the Map phase and aggregate partition

statistics to global statistics. Figure 4.7 displays the implementation of the Re-

duce function.

62 Of 121

Figure 4.5: Header creation MapReduce job

Figure 4.6: Header Creation Map Function

Figure 4.7: Header Creation Reduce Function

This procedure is only mandatory for nominal values, but it can invoked to

any type of attributes. Upon creation, Headers are distributed to the next

MapReduce stages as an initialisation parameter. This procedure is required

only once for each dataset; upon creation Headers can be stored in HDFS and

retrieved upon request.

4.4 Classification and Regression

Classifiers and Regressors are used to build prediction models on nominal and

numeric values respectively. Although many learning algorithms in these cat-

egories are iterative, both training and testing phases can be completed in a

single step using asynchronous parallelism.

It is important to emphasize at this stage the performance improvement

offered by Spark in multi-phase execution plans. Once the dataset is loaded

into main-memory in Header creation phase, Spark maintains a cached copy

of the dataset until explicitly commanded to discard. This feature offers signi-

ficant speed-up in consecutive MapReduce phases, because the redundant

HDFS accesses required by Hadoop are avoided.

4.4.1 Model Training

Once the Headers are either computed or loaded from persistent storage,

Spark schedules slaves instances to begin the training phase. Every instance

possesses a number of cached partitions and trains a Weka model against each

partition, using a Map function. Classifiers and Regressors are represented in

Weka by the same abstract object.

Figure 4.8 displays the implementation of the model training Map function.

63 Of 121

By using Meta-Learning techniques, the intermediate models are aggregated

using a Reduce function to a final model.

Depending on the characteristics of the trained model the final output may

be:

• A single model, in case the intermediate models can be aggregated

(where a model of the same type as the inputs can be produced)

• A Voted Ensemble of models, in case intermediate models cannot be

aggregated.

Figure 4.9 displays the implementation of the model aggregation Reduce

function.

64 Of 121

Figure 4.8: Model Training Map Function

Trained models can be either used directly for testing unknown data objects or

can be store in HDFS for future use.

4.4.2 Model Testing and Evaluation

Once a trained model is either computed or retrieved from persistent storage,

the model Evaluation phase can be completed in a single MapReduce step.

The trained model is distributed to the slave instances as an initialisation

parameter to the Evaluation Map functions. During the Map phase, each in-

stance evaluates the model against its local partitions and produces the inter-

mediate evaluation statistics.

Figure 4.10 displays the classifier Evaluation Map function.

65 Of 121

Figure 4.9: Model Aggregation Reduce Function

Reduce functions produce the final output by aggregating intermediate res-

ults. Figure 4.11 displays the implementation of the evaluation Reduce func-

tion.

In a similar fashion, trained models can be used to classify unknown in-

stances.

4.5 Association Rules

Association Rules are computed in parallel using a custom MapReduce imple-

mentation of the Partition [30] algorithm. Partition requires two distinct

phases to compute association rules on distributed datasets. In the candidate

generation phase, a number of candidate rules are generated in each partition.

In the candidate validation phase, real support and significance metrics are

computed for all the candidates and those that do not meet global criteria are

pruned.

66 Of 121

Figure 4.10: Classifier Evaluation Map Function

Figure 4.11: Evaluation Reduce Function

Figure 4.12 displays the two distinct execution phases of the Association

Rule Learning job.

The user defines a support threshold and an optional threshold to any Weka

supported measure of significance (by default confidence is used). A number

of Map functions proceed to mine partitions in parallel using a Weka associ-

ation rule learner and generate candidate rules. A rule is considered a candid-

ate, if global significance criteria are met in any of the partitions. Candidate

rules are exported from Map functions using a hash-table.

Figure 4.13 displays the candidate generation Map function.

Reduce functions aggregate multiple hash-tables and produce a final set of

candidates. The hash-table data structure was selected because it enables al-

most constant seek time.

Figure 4.14 displays the candidate generation and validation Reduce func-

tion.

67 Of 121

Figure 4.12: Association Rules job on Spark

Figure 4.13: Candidate Generation Map Function

In the validation phase, each Map function receives the set of candidates

and computes support metrics for every rule.

Figure 4.15 displays the Validation phase Map function.

The validation Reduce phase uses the same Reduce function to aggregate

the metrics across all partitions. Each rule that fails to meet the global criteria

is pruned. The rules are sorted on the requested metrics and returned to the

user.

68 Of 121

Figure 4.14: Candidate Generation and Validation Reduce Function

Figure 4.15: Validation Phase Map Function

4.6 Clustering

Although theoretically the execution model can be applied to any Clusterer,

it was found in practice that it can only be used for the Canopy Clusterer [70].

Canopies divide the dataset into overlapping regions using a cheap distance

metric. Canopies within a threshold are assumed to represent the same region

and thus, they can be aggregated. Map functions can be used to build Canopy

Clusterers on partitions in parallel and Reduce functions to aggregate Canop-

ies on the same region.

Other clustering algorithms do not share this property. Aggregation would

require the use of consensus clustering techniques. However, Weka does not

support consensus clustering.

In a different approach, Map functions could be used to compute distances,

assign data points to cluster centres and update the position of each cluster

centre. Reduce functions could receive the per-partition cluster centres, com-

pute their average values and assess the stop condition. If the stop condition is

not met, the new cluster centres would be distributed to a new set of Map

functions and begin an additional iteration. This process was found to be in-

feasible in practice, because Weka, as of August 2014, does not support a

mechanism to explicitly define cluster centres at the beginning of each itera-

tion.

Contact was made with the Weka core development team regarding these

issues (Dr. Mark Hall). It was concluded that clustering algorithms cannot be

efficiently expressed using the implemented model. The methodology as-

sumes that Weka algorithms can be encapsulated in MapReduce wrappers.

This was found to be infeasible for clustering algorithms due to Weka's limita-

tions. Therefore, distributed clustering using Weka remains an open research

issue.

4.7 Summary

This chapter presented a scalable methodology to execute Weka algorithms in

a distributed environment using the MapReduce paradigm. The following

69 Of 121

chapter analyses the benchmarking results of the presented multi-tier architec-

ture and the proposed execution model.

70 Of 121

5 System Evaluation

This chapter presents the Evaluation Metrics (5.1), the System Configuration

(5.2), the Evaluation Results (5.3) and the Caching Strategy Selection Al-

gorithm (5.4).

5.1 Evaluation Metrics

The system was evaluated through a number of experiments on AWS. The

evaluation assesses the system on four different metrics:

• Elapsed Execution Time: The system's execution time on different

tasks is measured on multiple task submissions using multiple al-

gorithms, dataset sizes and cluster sizes;

• Memory Utilisation: The system's main-memory utilisation is assessed

on seven different caching schemes. The results are cross-validated

with the other system parameters;

• IO Utilisation: Disk and network bandwidths are limited and much

slower to access than main-memory. Input and output streams are

monitored in order to determine and where possible remove potential

performance bottlenecks;

• CPU Utilisation: CPU usage is monitored to determine the CPU-time

overheads, introduced by different parameters.

The system's scalability is assessed in multiple execution scenarios. More spe-

cifically, two different scalability metrics are computed [20]:

1. Weak scaling: The per-instance problem size remains constant and ad-

ditional instances are used to tackle a bigger problem. Weak scaling ef-

ficiency (as percentage of linear) can be computed using the following

formula [20]:

S=T1Tn

∗100%

71 Of 121

Where T1 is the elapsed execution time for 1 work unit using a single instance

and Tn the same metric for N work units on N instances. Linear weak scaling

would be achieved if execution times were constant, regardless of the scale.

2. Strong scaling: The total problem size remains constant and additional

instances are assigned to speed-up computations. Strong scaling effi-

ciency can be computed using the following formula [20]:

S=T1

N∗Tn∗100%

Where T1 is the elapsed processing time for 1 work unit using a single in-

stance and Tn is the elapsed time for the same workload using N instances.

Linear strong scaling would be achieved if execution times were decreasing

proportionally to the number of processing instances.

An ideal parallel system would have both linear strong and weak scaling.

In practice, overheads associated with distributing computations lead to sub-

linear performance. Resource utilisation monitoring is used to determine the

causes of sub-linear performance and propose modifications.

The system is compared against core Weka's distributed implementation on

Hadoop (Weka-On-Hadoop). It is interesting to compare the two systems in

terms of execution time, scalability and resource efficiency. Additional com-

parisons are drawn against competing systems in the surveyed literature. It is

of course difficult to compare against proprietary solutions because accurate

details about the benchmarking procedures are usually not provided.

5.2 System Configuration

Benchmarking was performed using a single Spark Master node in three dif-

ferent cluster configurations:

1. Small scale: Two 4-core m3.xlarge slave instances, possessing 28.2GB

of main-memory;

2. Medium scale: Eight m3.xlarge slave instances, possessing 32 cores

and 112.2GB of main-memory in total;

72 Of 121

3. Large scale: Thirty-two m3.xlarge slave instances, possessing 128

cores and 451.2GB of main-memory in total.

The dataset scales used in the evaluation process were proportional the

cluster sizes:

1. Small scale: 5GB datasets;

2. Medium scale: 20GB datasets;

3. Large scale: 80GB datasets.

The decision to use these data volumes was based on Ananthanarayanan et

al. [71], who analysed access patterns of data mining tasks at Facebook. Ac-

cording to this report, 96% of the tasks processed data which could be stored

in only a fraction of main-memory (assuming 32GB of main-memory per

server). In a similar project by Microsoft Research, Appuswamy et al [72] re-

port that the majority of real-world Data Mining tasks process less than

100GB of input. These observations conclude that, although technically pos-

sible to process Peta-scale datasets in a single task, it is uncommon in prac-

tice.

Finally, each of the three implemented categories of algorithms is represen-

ted by a commonly used algorithm:

1. Regression algorithms using Linear Regression;

2. Classification algorithms using SVM;

3. Association Rule Learning algorithms using FP-Growth.

Further work could investigate both a wider set of algorithmic approaches and

a more exhaustive set of data-sizes.

It is important to emphasize that the computing resources on which the

benchmarks were executed are virtualised. Consequently, although AWS fol-

lows rigorous quality assurance procedures, advertised resources of the in-

stances may be marginally different between experiments. This feature can be

attributed to hardware heterogeneity in Amazon's data-centres and interfer-

ence between different virtual machines hosted on the same physical hardware

[73]. In order to alleviate the effects of multi-tenancy on the physical host,

73 Of 121

large sizes of compute instances were selected. This approach decreases the

chance of “noisy neighbours” [74].

The experiments were repeated multiple times and the analysis is based on

average values. However, the results did not vary as much as it was expected

based on the literature survey.

5.3 Evaluation Results

The following presents and analyses the benchmarking results, computes the

system evaluation metrics and assesses the efficiency of the implemented

solution.

5.3.1 Execution Time

This section presents an analysis about the performance of the system during

the benchmarking experiments. The execution times are compared with an

identical workload on Hadoop.

Table 5.1 displays the execution times of the system on the SVM bench-

mark. Three system sizes were tested against three different dataset scales.

Table 5.2 displays the execution times of the aforementioned benchmark on

Weka-On-Hadoop.

74 Of 121

5GB 20GB 80GB

8 Cores 135 506 2008

32 cores 45 139 551

128 Cores 32 57 147

Weka-On-Spark SVM (sec)

Table 5.1: Execution Times for SVM on Weka-On-Spark

5GB 20GB 80GB

8 Cores 287 900 3389

32 cores 135 317 927

128 Cores 127 129 371

Weka-On-Hadoop SVM (sec)

Table 5.2: Execution Times for SVM on Weka-On-Hadoop

Figure 5.1 plots the execution time results across all dataset scales and

cluster sizes.

Speed-up can be defined as: S=T old

T new

([5])

Where Told represents the elapsed execution time of the system prior the intro-

duction of the proposed improvement and Tnew is the elapsed execution time on

the improved system.

Table 5.3 displays the achieved speed-up of Weka-On-Spark compared to

Weka-On-Hadoop on identical tasks.

75 Of 121

8-core Spark

32-core Spark

128-core Spark

8-core Hadoop

32-core Hadoop

128-core Hadoop

0200400600800

10001200140016001800200022002400260028003000320034003600

Execution Times for SVM

On Weka-On-Hadoop and Weka-On-Spark

5GB

20GB

80GB

Figure 5.1: Execution times for SVM

The use of Spark on distributed Weka workloads speeds up computations

up to four times and by a factor of 2.36 on average. It is important to divide

these experiments in two distinct categories:

1. Experiments where full dataset caching was possible on Weka-On-

Spark;

2. Experiments where only partial dataset caching was possible.

The first category is shadowed in Table 5.3.

Full dataset caching achieves an average speed-up of 2.7x. Although

Weka's caching was enabled on Hadoop, and thus the iterations SVM requires

to converge were performed in-memory at each partition, the task requires at

least two stages: Header Creation and Classifier Training. Weka-On-Hadoop

requires to reload the dataset at each stage, parse the rows to build the suppor-

ted format and initialise the Map tasks. In contrast, Weka-On-Spark retains the

dataset in-memory and avoids multiple initialisations by re-using the task

threads of the first stage.

In cases where full caching was not possible, average speed-up was meas-

ured at 1.7x. Although both systems are forced to reload partitions from

HDFS, Weka-On-Spark achieves superior performance through better lever-

age of the underlying resources.

76 Of 121

Speed-up 5GB 20GB 80GB Average8 Cores 2.13 1.78 1.69 1.8632 cores 3.00 2.28 1.68 2.32128 Cores 3.97 2.23 2.52 2.91Average 3.03 2.10 1.96 2.36

Table 5.3: Speed-up

Tables 5.4 and 5.5 display the average CPU utilisation of Weka-On-Spark

and Weka-On-Hadoop during each experiment.

Weka-On-Spark consumes on average 27.1% more CPU time. Both sys-

tems execute the same implementation of the SVM algorithm, in the same en-

vironment (JVM) and they perform the same amount of computations per byte

of data. Additionally, both systems use HDFS and suffer from the same data

fetching latencies. Consequently, the workloads are identical. Weka-On-Spark

saturates the CPU and thus achieves higher throughput and faster execution

time. In order to identify the reason behind this behavior, it is required to ex-

amine the systems from a different perspective.

Tables 5.6 and 5.7 display the average main-memory utilisation during the

aforementioned benchmark on Weka-On-Spark and Weka-On-Hadoop (shad-

owed cells indicate that full dataset caching was possible).

77 Of 121

5GB 20GB 80GB8 Cores 67.00% 71.00% 67.40%32 cores 73.90% 73.40% 72.10%128 Cores 65.00% 74.20% 76.10%

SVM on Weka-On-Hadoop (CPU %)

Table 5.5: CPU Utilisation of Weka-On-Hadoop


SVM on Weka-On-Spark (CPU %)

Table 5.4: CPU Utilisation of Weka-On-Spark

Weka-On-Hadoop demonstrates stable memory footprints across the exper-

iments. The system loads a partition for each active Map container into mem-

ory, executes the Map task, discards the processed partition and repeats the

procedure until the whole dataset is processed. In contrast, Weka-On-Spark

loads partitions until memory saturation and schedules the Map tasks to

process in-memory data.

In cases where the available memory is larger than the dataset, Weka-On-

Spark's approach to cache the dataset has obvious benefits. Successive stages

process the same RDDs and the need to reload and rebuild the dataset in the

required format is avoided.

In cases where the dataset cannot be fully cached, Spark applies a partition

replacement policy where the Least Recently Used (LRU) partition is re-

placed. This process indicates that it is highly unlikely that successive stages

will find the required partitions in-memory. Thus, the partition will be loaded

from disk as in Weka-On-Hadoop. However, there is a big difference between

the mechanisms Hadoop and Spark use to implement this process.

Hadoop reads HDFS partitions using an iterator. Map tasks read an HDFS

partition line-by-line (each line is represented by a key-value pair), process

each line and emit intermediate key-value pairs if necessary. In the specific

78 Of 121

5GB 20GB 80GB8 Cores 71.80% 92.10% 94.10%

32 cores 21.10% 72.10% 96.10%128 Cores 9.10% 25.20% 72.90%

SVM on Weka-On-Spark (Memory %)

Table 5.6: Main-memory utilisation of Weka-On-Spark


SVM on Weka-On-Hadoop (Memory %)

Table 5.7: Main-memory utilisation of Weka-On-Hadoop

case of Weka-On-Hadoop, the partitions are read line-by-line, each line is pro-

cessed by a parser and then added to an Instances object (Weka's dataset rep-

resentation). When this procedure is completed, the Map tasks execute the

SVM algorithm and iterate over the data until the algorithm converges. When

the model is built, Hadoop emits the trained model, discards the data and

schedules the Mapper to process a new partition. Thus, reading data from

HDFS is coupled with data processing: while the system is reading data, the

CPU is idle and while the system is processing data the I/O subsystem is idle.

This process leads to suboptimal resource utilisation: CPU cycles are wasted

and I/O bandwidth is never saturated.

Spark resolves this issue by introducing a main-memory abstraction

(RDD). This abstraction decouples the two phases. Map tasks process RDD

partitions that are already in-memory. As the system is not required to wait for

I/O and reads directly from main-memory, maximum CPU utilisation is

achieved. Additionally, Spark evicts older partitions from the distributed cache

and fetches the next set of partition from HDFS regardless of the task execu-

tion phase. This enables to read data at a faster rate (reading is performed at a

block level rather than the line level) and to overlap data loading and data pro-

cessing. These two features, alongside the aforementioned shorted initialisa-

tion times, contribute to a significant speed-up as opposed to the Hadoop-

based solution.

5.3.2 Scaling Efficiency

This section presents the weak and strong scaling efficiencies of the system,

as measured during the experiments on AWS. The scaling efficiency percent-

ages are computed using the formulas of Section 5.1.

5.3.2.1 Weak Scaling

Figure 5.2 demonstrates the system's weak scaling efficiency of the three dif-

ferent algorithms used for benchmarking. The figure also presents the weak

scaling efficiency of the SVM algorithm on Hadoop.

79 Of 121

The system execution times approach the ideal linear performance within less

than 10% for clusters up to 128 cores in all cases. In contrast, Hadoop's weak

scaling efficiency decreases as the scale increases.

As discussed earlier a slight decline in performance was expected in fully

distributed systems. This behaviour is associated with monitoring multiple in-

stances and load balancing data across the cluster. However, as Spark achieves

high locality in consecutive stages and the effects are minimal in multi-stage

execution plans.

5.3.2.2 Strong Scaling

Figures 5.3, 5.4 and 5.5 demonstrate the system's strong scaling efficiency on

the three different categories of algorithms. Each figure depicts the system's

strong scaling on the small, medium and large scale datasets used in the exper-

iments. Full raw data can be found in Appendix 1.

80 Of 121

8 cores/5GB 32 cores/20GB 128 Cores/80GB40

50

60

70

80

90

100

Weak Scaling

Linear Regression

SVM

FP-Growth

SVM Hadoop

Figure 5.2: Weak Scaling Efficiencies

81 Of 121

8cores 32 cores 128 Cores0

10

20

30

40

50

60

70

80

90

100

Strong Scaling

Linear Regression

5GB

20GB

80GB

Figure 5.4: Strong Scaling for Linear Regression


10

20

30

40

50

60

70

80

90

100

Strong Scaling

SVM

5GB

20GB

80GB

Figure 5.3: Strong Scaling for SVM

Strong scaling efficiencies on Spark approach linearity when datasets are

large and runtime is dominated by computations. Using large systems for

small scales proves to be inefficient due to constant initialisation overheads.

These overheads were measured at 11 seconds, regardless of the scale of the

system. On the large cluster (128 cores), this number corresponds to 40.8% of

the average total execution time on the small scale (5GB) and to 20.3% on the

medium scale (20GB). As the dataset size increases the runtime is dominated

by computations and thus, these overheads are minimal compared to the total

execution time.

For comparison purposes, Figure 5.6 illustrates the strong scaling effi-

ciency of Hadoop on SVM.

82 Of 121


102030405060708090

100

Strong Scaling

FP-Growth

5GB

20GB

80GB

Figure 5.5: Strong Scaling for FP-Growth

Weka-On-Hadoop's strong scaling efficiency demonstrates inferior per-

formance due to larger initialisation overheads. Hadoop's initialisation cost

was measure at 23 seconds. This overhead is introduced at the beginning of

every MapReduce stage, whereas in Spark it is only required at the first stage.

Additional comparisons can be drawn against similar workloads as repor-

ted in the literature survey. Wegener et al. [59] disabled Weka's caching and

forced the algorithms to read directly from persistent store in a Hadoop

cluster. The reported strong scaling was nearly 88% but the reported execution

times on NaiveBayes were more than an order of magnitude slower. Jha et al.

[75] benchmarked Spark on Data Mining workloads and reports similar scal-

ing efficiencies to those presented above. Additionally, Jha compares Spark

against Hadoop, Mahout, distributed Python scripts and MPI [76]. Data Min-

ing on Spark outperforms all the competing systems in scaling efficiency, but

it is 50% slower than MPI. Given the fact that MPI is a low-level paradigm,

this result was expected. In MPI, the user needs to program the system on how

to perform the data placements, node communications, scheduling etc., lead-

ing to large delays in development. Finally, R and Hadoop-based system Ri-

cardo [52] reports 83% scaling but it is 100% slower than Hadoop.

83 Of 121

8 Cores 32 cores 128 Cores

Strong Scaling

SVM on Weka-On-Hadoop

5GB

20GB

80GB

Figure 5.6: Strong Scaling on Weka-On-Hadoop

In case raw data are not provided, the scaling efficiency numbers are ex-

tracted from the figures of the surveyed publications. As the scale intervals are

usually large, these numbers are in approximation and may contain errors.

5.3.3 Main-Memory Caching

One of the objectives of this project was to study the effects of different cach-

ing strategies on BDM workloads. This section presents and analyses the ex-

perimental results from a main-memory perspective.

5.3.3.1 Caching overheads

Spark RDDs are represented in memory as distributed Java objects. These ob-

jects are very fast to access and process, but they may consume up to 5 times

more memory than the raw data of their attributes. This overhead can be at-

tributed to the meta-data that Java stores alongside the objects and the

memory consumed by the object's internal pointers.

For example, a String class introduces 40 bytes of pure overhead associated

with storing characters separately as Character classes and keeping properties

such as string length. Consequently, a 10 character String requires 60 bytes of

main memory. However, Spark offers a series of tools to tackle these over-

heads by introducing serialisation and compression, as well as an efficient

mechanism to regress to persistence.

The selection of an efficient caching strategy demands consideration of

these overheads. A number of different dataset samples from the UCI Machine

Learning repository [77] and Stanford SNAP [78] were tested in order to

measure memory overheads for different categories of datasets.

A simple application that loads a dataset from HDFS, builds an RDD and

reports the on-disk and RDD sizes was implemented. Tables 5.8 and 5.9 dis-

play the RDD size as percentage of the original on-disk value.

84 Of 121

Uncompressed main-memory footprints vary greatly and can reach 600%

of the original dataset. However, serialised objects demonstrate footprints

close to the on-disk values in all cases. Compression demonstrates an addi-

tional 50% and 75% reductions for dense and sparse datasets respectively.

85 Of 121

Table 5.8: RDD size as percentage of the original on-disk value (I)

RDD size ( % of the original) Type of data

Structured Numeric 52.00% 51.60%


Generated Structured String Sparse 22.45% 22.45%


Record Linkage Structured String 44.00% 43.00%

KDD data '10 Structured String 47.00% 47.00%

adult Structured String 53.16% 53.16%

supermarket Structured String Sparse 26.32% 26.32%


Wiki articles Text 44.12% 44.12%

Social circles: Tw itter Graph 44.55% 43.64%

Graph 48.15% 44.44%

Google Web Graph Graph 40.00% 38.18%

Java Ser. and Compression Kryo Ser. and Compression

Susy

Higgs

USCensus

Million Song Dataset

Epinions social netw ork

Table 5.9: RDD size as percentage of the original on-disk value (II)

Kryo serialisation shows marginally better compression ratios than built-in

Java serialisation.

In order to further assess the benefits of caching, a performance analysis of

different caching strategies is required. The following subsection presents

benchmarking results of seven different caching schemes at large scales.

5.3.3.2 Caching and Performance

Memory consumption followed a predictable pattern, regardless of the

caching scheme in use. Initially, the system builds the RDDs by loading parti-

tions from secondary storage. After the completion of the initialisation pro-

cess, multiple algorithms can be deployed to process the data and memory

consumption is dependent on the internal object structure of the algorithms.

Figure 5.7 illustrates the main-memory time-line of four consecutive experi-

ments on different caching schemes.

Figure 5.8 presents the average reduction in main-memory consumption of

the caching strategies used in benchmarking experiments. Figure 5.9 displays

the average execution time overhead of each caching strategy.

Full raw data are given in Appendix A.

86 Of 121

Figure 5.7:Main-memory time-line

87 Of 121

Java

Ser

ialis

atio

nK

ryo

Ser

ialis

atio

nJa

va S

er. A

nd C

ompr

essi

onK

ryo

Ser

. And

Com

pres

sion

Dis

k O

nly

Dis

k O

nly

(700

MB

cac

he p

er n

ode)

0.00%

10.00%

20.00%

30.00%

Execution Time Overhead

% of default in-memory caching

Figure 5.9: Execution Time Overhead

Java

Ser

ialis

atio

nK

ryo

Ser

ialis

atio

nJa

va S

er. A

nd C

ompr

essi

onK

ryo

Ser

. And

Com

pres

sion

Dis

k O

nly

Dis

k O

nly

(700

MB

cac

he p

er n

ode)

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

Main Memory Use Reduction

% of default in-memory caching

Figure 5.8: Main-Memory Use Reduction

Serialisation and Compression mechanisms present significant memory

footprint reductions on a small performance penalty (~5%). Additionally, as

depicted in Tables 5.8 and 5.9 memory footprints can be predicted using those

schemes, regardless of the data-type. Consequently, the consensus of the ex-

periments indicates that serialising and compressing datasets is beneficial.

However, main-memory remains a scarce resource. In multi-tenant envir-

onments, it is often required to operate with limited memory. In order to de-

termine performance in edge conditions, main-memory was limited to 700MB

(out of 28.2GB) per-instance and the experiments were repeated. The system,

to the author's surprise, managed to effectively use disk caching and success-

fully execute the requested tasks. This behaviour is achieved by temporarily

caching serialised objects to local disks.

Disk caching on limited memory requires a disk-write for every partition of

the dataset. Due to disk latencies, disk caching leads to an execution time

overhead which was measured at 25% on average. This behaviour was con-

sistent across all experiments.

Although this practice may be considered appealing in constraint environ-

ments, observing the persistent storage access patterns reveals another implic-

ation. Figure 5.10 depicts the average per-instance disk writes in bytes.

Each instance processed on average 5GB of data during the experiments.

The disk caching strategy temporarily stores an equivalent volume of serial-

88 Of 121

Figure 5.10: Average per-instance disk writes

ised objects in persistent storage. This practice requires free disk space to be at

least as large as the total size of the dataset. In the case where instances are

used to host distributed databases or other large files, it might be infeasible to

allocate these amounts of free space.

Spark offers the option to avoid disk caching and recompute each RDD

partition when needed. This feature was found unstable in constrained mem-

ory experiments as many of the submitted tasks failed on insufficient main-

memory exceptions. This is attributed to the garbage collector's inability to

evict unused objects faster than the requested memory allocations.

Comparisons can be drawn against competing systems at this point. Graph-

Lab and Piccolo assume full in-memory caching is possible. Neither system

possesses a mechanism to tackle main-memory shortages. Although, as dis-

cussed earlier, most workloads can be stored in-memory this practice is dan-

gerous for production environments.

Hadoop had consistent main-memory requirements throughout the experi-

ments. Due to Weka's in-memory data representation it was required to alloc-

ate 2GB of main-memory per process. This number corresponds to 50% of the

cluster's memory. This figure can be decreased by requesting smaller disk

blocks (on a performance penalty). However, main-memory usage cannot be

adjusted to the needs of a specific workload and the advantages of caching

cannot be harnessed. Additionally, edge cases with limited memory cannot be

handled.

The next sub-section provide a deeper analysis of the system's Input/Output

(IO) utilisation.

5.3.4 IO Utilisation

Spark schedules task execution within the cluster based on data placement in-

formation from HDFS. Spark implements a delay scheduling algorithm [79]

which delays task execution for short intervals in order to seek opportunities

to schedule nodes to process local partitions rather than remote. If the short

period expires and the node who owns the partition is unavailable then the

partition is shipped through the network to an idle node. Zaharia et al. [79] re-

89 Of 121

port that this technique can achieve nearly 100% locality in scheduling batch

tasks.

EBS volumes are network-attached drives and use the network interface

of the attached instance. In order to accurately measure network traffic, the

system was forced to use the physically attached ephemeral storage during

network benchmarking.

The application layer was designed to avoid submitting data through the

network and Spark schedules tasks based on locality. Thus, network utilisation

was expected to be minimal. In practice, experimental results demonstrate a

spike in network usage during the initialisation phase. Figure 5.11 illustrates

the aggregated network traffic across the cluster on three consecutive medium

scale experiments. The spike occurs during the Header creation phase, which

is common for all the algorithms in Weka-on-Spark. The sum across all in-

stances is equal to the 87.5% of the dataset.

This behaviour is attributed to the mechanism that Spark uses in order to

achieve balanced loads across the instances. The default partitioning strategy

retrieves the blocks from HDFS and transforms each block to a number of ar-

rays of string values (each array represents an RDD partition). Each array has

a hash-code and Spark distributes the generated partitions to the instances us-

ing a hash partitioner. This procedure achieves perfect load balancing but re-

quires the nodes to submit the majority of the dataset through the network.

90 Of 121

Figure 5.11: Network Traffic

After this procedure is completed, Spark was tuned to schedule transforma-

tions based on RDD placements and perfect locality was achieved. The Map

phase of the execution model produces trained Weka models which are less

than 50KB on average. Consequently, network traffic is minimal during multi-

stage processing tasks.

In order to determine whether Spark's load balancing strategy is able to

force a network bottleneck, it is required to analyse disk, network, CPU and

memory metrics. Figure 5.12 plots network and disk usage of the aforemen-

tioned experiment in a per-instance basis.

Figure 5.12: Per-instance average of network and disk utilisation

Each instance retrieves on average 2.5GB of data from persistent storage,

builds a number of RDDs (either user defined or system default) and ships

each RDD to the instance that handles the range of its hash key. During this

experiment, 8 4-core instances were used. Consequently, assuming a uniform

distribution of hash keys, each instance would need to submit the 87.5% of the

retrieved data and receive an equivalent amount from remote nodes. This

traffic requires a minimum of 291Mbits/s of network bandwidth. The network

bandwidth for this particular type of instance was experimentally measured

(using the IPerf [80] benchmark) at 1.08Gbits/s. These calculations indicate

that network bandwidth was not saturated.

Similar analysis can be applied to persistent storage utilisation. Each in-

stance retrieves data at a 166.7Mbits/s rate. Disk read throughput was meas-

ured at 2.4Gbits/s using the IOzone [81] benchmark. These calculations indic-

ate that neither disk utilisation was saturated.

91 Of 121

Finally, Figure 5.13 illustrates the CPU utilisation of the experiment.

Figure 5.13: CPU utilisation

CPU usage rapidly approaches saturation during the initialisation phase.

Therefore, the system is CPU-bound. In order to accelerate the system further,

it is required to add more processing power. This is a positive outcome, as it is

shown that the system's scaling efficiency is approaching linearity in large

scales.

Hadoop on the other hand demonstrated minimal network usage (~1MB/s)

throughout the experiments. Hadoop implements the same delay scheduling

algorithm as Spark. However, Hadoop does not possess a main-memory ab-

stractions and any effort to perform load balancing would require HDFS inter-

vention. This feature combined with the slower initialisation would produce

even larger start-up latencies.

It is important to emphasize that AWS offers a plethora of different options

as far as IO resources are concerned. The aforementioned benchmarks utilised

balanced instance types to demonstrate the system's versatility. In practice,

network bound application can benefit from instances with enhanced network-

ing (10Gbits/s) and disk bound application can benefit from storage optimised

instances with RAID configurations of SSD drives. Therefore, it is very im-

portant to thoroughly benchmark the Batch Execution and the Application

Layers, locate performance bottlenecks and optimize the underlying infra-

structure.

5.4 Caching Strategy Selection Algorithm

As extensively discussed earlier, the system offers a plethora of caching op-

tions and enables users to implement custom caching strategies. However, this

92 Of 121

practice demands state-of-the-art knowledge of the underlying platform and

extensive experimental evaluation of the different options.

Spark's default caching option is to store Java objects directly in memory.

This practice loads the cache fraction of the executors until saturation and then

recomputes additional partitions on demand as and when required.

This method was found to be unstable in practice. In many cases, new

RDD partitions are allocated memory faster than the Garbage Collector (GC)

is able to discard older partitions. Consequently, memory is leaking and the

tasks fail on main-memory exceptions.

In order to tackle this issue, a custom strategy was implemented. This

strategy is triggered in cases where the user does not explicitly specify a cach-

ing mechanism. The selection process is illustrated in Figure 5.14.

This process uses the file's size on HDFS, the total cluster wide memory,

the caching fraction of the executors and the maximum overhead as input

parameters. In large text-based files the overhead was experimentally com-

puted and in the worst case scenario approaches 500%. The Apache Spark

93 Of 121

Figure 5.14: Storage Level Selection Process

documentation mentions the same worst-case overhead without specifying a

dataset type. Consequently, the algorithm uses this value as default, but allows

the users to specify the overhead parameter that better relates to their data.

If the cluster-wide executor cache memory is enough to absorb the dataset

in the worst case, the default caching is used. Uncompressed objects are faster

to access and the CPU overhead of serialisation is avoided.

In case the cache approaches the dataset size, serialised objects are pre-

ferred as they demonstrate stable memory footprints which are equivalent to

the original file size. Kryo serialisation proved to be more efficient and it is

used as the default option. This process introduces serialisation overhead but

decreases GC overhead and enables up to 5 times more data to be stored in-

memory.

Compression is additionally used to tackle cases where at least 50% of the

on-disk data can be cached. Compression introduces an additional CPU over-

head but further reduces the memory footprints by 50% compared to serialisa-

tion.

Finally, if the dataset is twice as large as the available cache (and thus the

compressions mechanisms cannot ensure that full caching is possible) then

disk caching is used.

This strategy was implemented in a single Scala class and integrated into

the system. When a task is submitted, the input parameters are read from the

application context and the algorithm selects a Storage Level and decides on

the use of serialisation and compression automatically.

The proposed strategy was tested against the default caching strategy in a

number of experiments. The workload consisted of a 20GB dataset and the

FP-Growth algorithm on an 8-core cluster. The experiment was repeated using

one thousand (1K) and four thousands (4K) partitions on 15GB and 2GB of

cache memory.

Table 5.10 displays the elapsed execution time and Table 5.11 the percent-

age of tasks that failed across the experiments.

94 Of 121

The custom strategy decreases execution times up to 25% and eradicates

task failures caused by insufficient main-memory. The following paragraphs

explain this behaviour.

In the default strategy, the objects are cached deserialised (uncompressed

java object structures). This forces the GC to recursively traverse the object

hierarchy before evicting unreferenced objects. Each time memory runs out, a

set of old partitions is garbage collected and a new set is fetched from disk. As

the size of cache memory decreases, this procedure is repeated more fre-

quently. If this process is slower than the memory allocations, tasks fail due to

main-memory exceptions. Larger partitions contain larger object hierarchies

and thus increase GC overhead (GC has a larger workload when it is

triggered). This is the reason why larger partitions demonstrated increased

fail-rates in the experiments that used the default strategy.

The custom strategy achieves better performance by decreasing both the

GC overhead and the frequency of the procedure. The selection algorithm as-

95 Of 121

Table 5.10: Execution Times

Table 5.11: Failed Tasks

sesses the available memory and whether the partition replacement mechan-

ism would be triggered in the given dataset. If this is the case, the algorithm

activates and configures serialisation and compression mechanisms.

Serialised and compressed objects are represented by an array of bytes. The

GC discards these objects as a single entity, regardless of the number of ob-

jects they encapsulate. Thus the cost of searching object hierarchies for un-

used objects is avoided. Additionally, these objects consume up to 10 times

less memory. This enables storage of up to 10 times more data in-memory and

significantly decrease the frequency of the partition replacement mechanism.

However, this process introduces CPU overheads due to serialisation and

compression.

Experimental results demonstrate that the cost of serialisation is lower than

the cost of partition replacement. Additionally, reducing the GC overhead

achieves 100% task completion even in cases where the default strategy fails

repeatedly and aborts the execution.

5.5 Summary

This chapter evaluated the implemented multi-tier architecture from multiple

angles. The evaluation assessed the achievement of the project’s objectives, as

described in the Section 1.3.

The system outperformed Weka-On-Hadoop by a factor of 2.36 on average

and up to four times in small scales. Even in cases where full dataset caching

was not possible, the system speeds up computations by a factor of 1.7.

The system met the scalability required by Big Data volumes. Both weak

and strong scaling demonstrated near-linear performance on workloads that

emulated the industry-level volumes used in Data Mining tasks. It is therefore

possible to re-use existing sequential solutions in a distributed context. Com-

parison with other systems in the surveyed literature demonstrates that the

system's scalability is on par with the state-of-the-art in the field.

Multiple caching strategies have been experimentally evaluated. The ana-

lysis demonstrates that serialisation and compression mechanisms are able to

significantly decrease memory footprints with a small performance penalty

96 Of 121

(approximately 5%). Additionally, disk caching behaviour was assessed and it

was determined that it can achieve task execution on a reasonable perform-

ance penalty (approximately 25% on average), even in cases where main-

memory is extremely constrained. However, measured disk caching overheads

prohibit the processing of datasets that approach the total disk capacity in size.

The system takes advantage of the network bandwidth during initialisation

to efficiently distribute the load to the cluster instances. This practice achieves

load balanced instances without saturating the network, even in cases of in-

stances with moderate networking capabilities.

The default caching strategy of the Batch Execution Layer was found to be

inefficient. Results analysis produced significant insights about the trade-offs

of different caching strategies and a caching strategy selection algorithm was

proposed. This algorithm was found to decrease execution times by up to 25%

compared to the default strategy and to decrease the risk of main-memory ex-

ceptions. This behaviour is attributed to decreasing the garbage collection

overhead and the garbage collection frequency.

97 Of 121

6 Concluding remarks

The following chapter provides a summary of the work conducted in this pro-

ject (6.1), overviews future expansions and open research issues in the area

(6.2) and closes with a conclusion (6.3).

6.1 Summary

BDM is an increasingly important field, with multiple industrial and academic

applications. The exponential rate of data generation indicates that BDM

problems will continue to be challenging in the future. Consequently, the sci-

entific community needs to continue investing significant time and effort to-

wards the implementation of novel approaches.

The literature survey has focused on three different scientific fields: Data

Mining, Distributed Computing Frameworks and Distributed Data Mining.

Data Mining frameworks, such as Weka, offer a plethora of powerful al-

gorithms, but they are designed for sequential execution and fail to cope with

large data volumes. Established Distributed Computing Frameworks, such as

Hadoop, have a proven ability to tackle Big Data problems, but often demon-

strate poor performance in Data Mining workloads. Emerging platforms, such

as Spark and GraphLab, provide improved solutions by focusing on the short-

comings of Hadoop in Data Mining applications.

This project has presented a scalable multi-tier architecture based on Weka,

Spark, HDFS and AWS. Weka's algorithms have been encapsulated in Map

and Reduce wrapper classes. By submitting these classes to Spark's in-

memory batch processing engine multiple HDFS partitions can be processed

in parallel. The elastic nature of AWS enables the system to dynamically ad-

just to very large volumes. The architecture was evaluated through a series of

experiments on AWS, using multiple configurations.

Benchmarking results demonstrate that the proposed solution is faster by a

factor of 2.36x than the equivalent system on Hadoop. The system achieves

near-linear scaling and manages the cluster's main-memory and network re-

98 Of 121

sources efficiently. The experiments were conducted at scales comparable to

the majority of industry-level workloads.

Thus, the implemented architecture and programming model would appear

to be a viable solution to modern BDM problems. Various caching strategies

were evaluated. The results demonstrated that serialisation and compression

mechanisms can greatly improve memory efficiency with marginal perform-

ance overheads. Furthermore, it was determined that Spark's caching mechan-

isms are able to tackle main-memory shortages effectively, with relatively

small performance penalties. Finally, a mechanism to automatically select a

caching strategy was implemented. The mechanism was found to decrease ex-

ecution times up to 25% compared to the default mechanism and to eradicate

task failures.

The aforementioned achievements demonstrate that the objectives set in

Section 1.3 have been met.

6.2 Further Work

This following identifies three different areas for extension: Clustering (6.2.1),

Stream Processing (6.2.2) and Declarative Data Mining (6.2.3).

6.2.1 Clustering

As explained in Section 4.6, clustering was not implemented in this work be-

cause of Weka's limitations. Therefore, an area of further work is the imple-

mentation of an execution model specifically targeting clustering problems.

Towards that goal, three different approaches could be explored.

The first approach could build on top of the existing implementation of

cluster Canopies. Canopies define regions based on a cheap measure of simil-

arity. Members of different regions are assumed to belong to different clusters.

Thus, Canopies partition the dataset into regions, each of which contains a

number of clusters. If the regions are small enough, Map functions could build

clusterings on different regions in parallel. Clusters in different regions are as-

sumed to be non-overlapping and thus, the concatenation of the region clusters

could produce the final clustering.

99 Of 121

This approach imposes a number of challenges. The Canopy algorithm

does not guarantee that the dataset will be partitioned into a large number of

regions. A small number of regions entails dataset partitions that may be too

large for a single instance to process. Additionally, region volumes are not

guaranteed to be balanced. Developing and evaluating methods to leverage

Canopies in distributed clustering problems is an interesting challenge.

A second approach could focus on implementing distributed versions of

specific clustering algorithms. Map tasks could be used to compute distances

from cluster centres, assign data points to clusters and update the per-partition

cluster centres. Reduce tasks could aggregate the cluster centre positions

across all partitions, update the position of each centre and assess the stop

condition. If the condition is not met, a new iteration should begin. This model

has been successfully applied to the K-Means algorithm [82]. Further work

could focus on exploring the feasibility of generalising this model to other cat-

egories of clustering algorithms. An additional challenge would be to modify

Weka's code-base to apply to this model.

A third approach could investigate the consensus clustering literature in or-

der to identify potential solutions. Although the problem is known to be NP-

Complete [28], multiple heuristics offer performance guarantees [27]. Aggar-

wal et al. [83] indicate that the consensus clustering problem is equivalent to a

symmetric Non-Negative Matrix Factorisation problem [84]. Work by Xie, et

al. [85] provides two fast parallel methods and demonstrates their efficiency in

document clustering problems. Further work could focus on comparing the re-

lative merits of different approaches and provide a concrete implementation

on top of Spark.

6.2.2 Stream Processing

The velocity aspect of Big Data demands advanced stream processing mecha-

nisms. Emerging data-types, such as sensor feeds and social data, are gener-

ated in large volumes and received in real time. These data-types usually con-

sist of groups of messages that are at the peak of their value at the time of gen-

eration. For example, enterprise server logs require real-time processing to en-

able early identification of potential errors.

100 Of 121

The proposed architecture focuses on batch processing large amounts of

on-disk data. Low-latency stream processing would require to replace the

Batch Execution Layer with a framework able to support low-latency schedul-

ing and execution. Towards testing the feasibility of this modification using

the existing execution model, Spark was replaced with Spark Streaming [86]

and a classification experiment was conducted.

In Spark Streaming, the input stream is split into small batches and each

batch is distributed to a cluster instance. It was found that the system achieved

1ms scheduling latency and 20ms average execution time for 500KB mi-

cro-batches. Although the system may theoretically process 200MB/s (assum-

ing an 8-core instance), data receipt rapidly becomes a bottleneck.

Further work could focus on discovering optimal stream partitioning and

distribution techniques. This should enable high degrees of parallelism and ef-

ficient resource utilisation. A popular approach [87] splits the stream of mes-

sages in topics and builds topic-specific sub-streams. The new streams are for-

warded to cluster instances for further processing.

Depending on the workload, different approaches yield different benefits.

Thus, a framework targeting the automation of stream partitioning and pro-

cessing would form an interesting investigation.

Another direction could focus on identifying the relative merits of different

stream processing paradigms. Spark Streaming processes streams of data by

grouping multiple small objects into batches and scheduling the processing on

Spark clusters. A different approach employs an event processing methodol-

ogy where each object is processed independently as it arrives. This approach

is implemented in the Storm [88] stream processing engine.

The execution model is ignorant about the number of input objects. A Map

function will build a Weka model either on a single object (event) or on a

large number of objects. This feature indicates that, after the appropriate mod-

ifications, the Application Layer can be ported to either Spark Streaming or

Storm and be used to evaluate the two approaches.

101 Of 121

6.2.3 Declarative Data Mining

An emerging trend in BDM investigates the application of query optimisation

techniques on Data Mining applications. Traditional query optimisation parses

a declarative query, which describes the requested outcome, and then automat-

ically selects a near optimal execution strategy. A similar methodology can be

applied to Data Mining workloads.

Each Data Mining task can be executed using a large number of different

algorithms. These solutions differ in performance, accuracy and complexity.

Selecting an optimal algorithm requires state-of-the-art expertise and multiple

experiments. Research projects, such as MLBase [60], investigate the feasibil-

ity of automatically selecting an optimal algorithm.

Weka offers multiple different options for each category of Data Mining

methods. Therefore, a topic of further work could focus on implementing a

mechanism to explore these options and automatically select a near-optimal

solution. Two different approaches could be used to tackle this problem.

Firstly, WekaMetal [89] offers a ranking system which tries to predict the

performance (accuracy and execution time) of Weka's algorithms on a dataset.

This ranking is produced by using a knowledge base of benchmarks. The data-

set is analysed to determine its similarity with the benchmark datasets. The

benchmarks have been tested with the whole suite of Weka's algorithms and

their performance is known. WekaMetal selects the algorithm with the best

performance in a dataset similar to the input dataset.

WekaMetal could be integrated into the system and select an appropriate

algorithm automatically. This feature could be engineered either by imple-

menting a distributed version of WekaMetal's ranking system or by using

sampling methods to generate a dataset sample. The first method would re-

quire an additional MapReduce task over the dataset. The second method

would be faster to implement and execute, but may lack precision.

A second approach could be to produce a ranking directly from the dataset.

This could be achieved by using the Header's statistics to produce stratified

samples. Map functions could be used to build multiple models on those

102 Of 121

samples in parallel. A Reduce function could rank those models on various

criteria and select the highest.

Implementing and evaluating those approaches as well as proposing others

should be an interesting contribution. The end result could be integrated into

Weka-on-Spark and provide a declarative interface to the system.

6.3 Conclusion

BDM is an exciting emerging field which will play a very important role in

the era of pervasive computing. Different aspects of Big Data pose different

challenges and a consensus gold standard has yet to be reached. This project

demonstrated an efficient methodology to extract knowledge from large

volumes. Main-memory caching has the potential to greatly improve perform-

ance, given an educated caching strategy. This work suggests that in-memory

cluster computing provides the solid foundations upon which the next genera-

tion of Big Data Mining systems will be built.

103 Of 121

References

[1] C. Lynch, “Big data: how do your data grow?", Nature 455 (7209) 28–29, 2008.

[2] J. Gantz, D. Reinsel, “The Digital Universe of Opportunities: Rich Data and the

Increasing Value of the Internet of Things, “ IDC iView: IDC Analyze the Fu

ture, 2014.

[3] M. A. Beyer, D. Laney. The importance of big data: A definition. Stamford, CT:

Gartner, 2012.

[4] J. Manyika,M. Chui,B. Brown,J. Bughin,R. Dobbs,C. Roxburgh, A. Byers, “Big

data: The next frontier for innovation, competition, and productivity,” Technical

report, McKinsey Global Institute, 2011.

[5] A. Fernandez, “Advanced Database Management Systems,” class notes for

COMP60731, School of Computer Science, University of Manchester, Novem-

ber 2013.

[6] M. Stonebraker , R. Cattell, “10 rules for scalable performance in 'simple opera-

tion' datastores”, Communications of the ACM, v.54 n.6, June 2011

[7] “Amazon Web Services ,” https://aws.amazon.com/ec2/ . Accessed August 14th,

2014.

[8] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large

Clusters,” in OSDI’04: Proceedings of the 6th conference on Symposium on

Opearting Systems Design & Implementation. Berkeley, CA, USA: USENIX

Association, 2004, pp. 137–150.

[9] A. Bialecki, M. Cafarella, D. Cutting and O. O'Malley. "Hadoop: A Framework

for Running Applications on Large Clusters Built of Commodity Hardware",

http://lucene.apache.org/hadoop/, 2005

[10] "Powered by Hadoop," http://wiki.apache.org/hadoop/PoweredBy/ . Accessed

August 14th, 2014.

[11] P. Russom, “Integrating Hadoop into Business Intelligence and Data Warehous-

ing,” TDWI Best Practices Report, 2013.

[12] C. T. Chu, S. K. Kim, Y. A. Lin, Y. Y. Yu, G. Bradski, A. Y. Ng, and K. Oluko-

tun, "MapReduce for machine learning on multicore," in Advances in Neural In-

formation Processing Systems, 2007.

[13] S. Sakr, A. Liu, and A. G. Fayoumi, "The family of mapreduce and large-scale

data processing systems," in ACM Comput. Surv. 46, 1, Article 11, 2013.

104 Of 121

http://wiki.apache.org/hadoop/PoweredBy/

https://aws.amazon.com/ec2/

[14] M. Zaharia et al., “Resilient distributed datasets: A fault-tolerant abstraction for

in-memory cluster computing,” in Proceedings of the 9th USENIX Conference

on Networked Systems Design and Implementation, ser. NSDI’12. Berkeley,

CA, USA: USENIX Association, 2012.

[15] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and H. Witten. “The

WEKA data mining software: an update,” in SIGKDD Explor. Newsl., 11(1):10–

18, 2009.

[16] R Core Team, R:” A Language and Environment for Statistical Computing”, R

Foundation for Statistical Computing, Vienna, Austria, 2013. [Online].

Available: http://www.R-project.org/ . Accessed August 14th, 2014.

[17] D. Smith,"R Tops Data Mining Software Poll," in Java Developers Journal, May

31, 2012.

[18] K. Shvachko, Hairong Kuang, S. Radia and R Chansler, “The Hadoop Distrib-

uted File System,” Mass Storage Systems and Technologies (MSST), 2010 IEEE

26th Symposium 3-7 May 2010

[19] M. Odersky, L. Spoon and B. Venners. Programming in Scala. Artima Inc, 2008.

[20] "Measuring Parallel Scaling Performance,"

https://www.sharcnet.ca/help/index.php/Measuring_Parallel_Scaling_Perform

ance . Accessed August 17th, 2014.

[21] N Sawant, H Shah, “Big Data Application Architecture Q & A”, Springer, 2013

[22] A.Y. Ng. “Machine Learning,” class notes, Coursera [Online]:

http://class.coursera.org/ml-003/lecture/5. Accessed May 6th, 2014.

[23] T.Dietterich, "Ensemble methods in machine learning." In Multiple classifier

systems, pp. 1-15. Springer Berlin Heidelberg, 2000.

[24] R. Vilalta, Y. Drissi, "A Perspective View and Survey of Meta-Learning,"

in Artificial Intelligence Review, Volume 18, Issue 2, pp 77-95, 2002.

[25] J. Kittler,R. H. Mohamad, PW Duin, J. Matas, "On combining classifiers." Pat-

tern Analysis and Machine Intelligence, IEEE Transactions on 20, no. 3: 226-

239, 1998

[26] A.M. Bagirov, J. Ugon and D. Webb, “Fast modified global K-means algorithm

for incremental cluster construction,” Pattern Recognition, 44 (4),

pp. 866–876, 2011.

[27] A. Goder and V. Filkov. "Consensus Clustering Algorithms: Comparison and

Refinement," in ALENEX. Vol. 8, 2008.

105 Of 121

https://www.sharcnet.ca/help/index.php/Measuring_Parallel_Scaling_Performance

http://www.R-project.org/

[28] V. Filkov, "Integrating microarray data by consensus clustering". In Proceedings

of the 15th IEEE International Conference on Tools with Artificial Intelligence.:

418–426, 2003.

[29] R. Agrawal, R. Srikant. "Fast algorithms for mining association rules," in Pro-

ceedings of the 20th International Conference of Very Large Databases, VLDB.

Vol. 1215. 1994.

[30] R. Agrawal,J. C. Shafer, "Parallel mining of association rules." IEEE Transac-

tions on knowledge and Data Engineering 8, no. 6,p 962-969, 1996.

[31] “The Comprehensive R Archive Network,” http://CRAN.R-project.org/ . Ac-

cessed August 14th, 2014.

[32] W. N. Street and K. YongSeog, "A streaming ensemble algorithm (SEA) for

large-scale classification," in Proceedings of the seventh ACM SIGKDD interna-

tional conference on Knowledge discovery and data mining, ACM, 2001.

[33] D. P. Bertsekas and J. N. Tsitsiklis "Some aspects of parallel and distributed iter-

ative algorithms A survey", Automatica, vol. 27, no. 1, pp.3 -21 1991.

[34] J. Han, M. Ishii and H. Makino, "A Hadoop performance model for multi-rack

clusters," in Computer Science and Information Technology (CSIT), 2013 5th

International Conference on. IEEE, 2013.

[35] V. K. Vavilapalli, “Apache Hadoop YARN: Yet Another Resource Negotiator,”

In Proceedings of SOCC, 2013.

[36] “Hadoop YARN,” http://hortonworks.com/hadoop/yarn/ .Accessed May 6th,

2014

[37] Y. Bu, B. Howe, M. Balazinska, M. D. Ernst, “HaLoop: efficient iterative data

processing on large clusters,” Proceedings of the VLDB Endowment, v.3 n.1-2,

2010.

[38] R.Appuswamy, C. Gkantsidis, D. Narayanan, O. Hodson and A. Rowstron.

“Scale-up vs scale-out for Hadoop: time to rethink?” in Proceeding SOCC '13,

Article No20, 2013.

[39] S. Ji, W. Wang, C. Ye, J. Wei,Z. Liu,

"Constructing a data accessing layer for in-memory data grid, "

In Proceedings of the Fourth Asia-Pacific Symposium on Internetware,

Internetware '12, pages 15:1-15:7, USA, 2012.

[40] “InfiniSpan ,” http://infinispan.org/about/ ,Accessed August 14th, 2014.

[41] “HazelCast, “ http://hazelcast.com/products/hazelcast/ , Accessed August 14th,

2014.

106 Of 121

http://hazelcast.com/products/hazelcast/

http://infinispan.org/about/

http://CRAN.R-project.org/

[42] S. Shahrivari, "Beyond Batch Processing: Towards Real-Time and Streaming

Big Data, " arXiv preprint arXiv:1403.3375, 2014.

[43] R. Power, J. Li, “Piccolo: Building fast, distributed programs with partitioned

tables,” In Proceedings of OSDI, 2010.

[44] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, J. M. Hellerstein,

“Graphlab: A new framework for parallel machine learning,” in UAI, 2010.

[45] "Apache Spark, " http://spark.apache.org/ . Accessed August 14th, 2014.

[46] R.S. Xin, J.R., M. Zaharia, M.J. Franklin, S.S., and I. Stoica, “Shark: SQL and

rich analytics at scale,” In Proceedings of ACM SIGMOD Conference, pages

13-24, 2013.

[47] “MapReduce and Spark, “ http://vision.cloudera.com/mapreduce-spark/ , Ac-

cessed August 18th, 2014.

[48] “Apache Mahout,” http://mahout.apache.org/ , Accessed August 14th,2014.

[49] E. Sparks, A. Talwalkar, V. Smith, X. Pan, J. Gonzalez,T. Kraska, M. I. Jordan,.

M. J. Franklin, “MLI: An API for distributed machine learning,” in ICDM, 2013.

[50] Z. Prekopcsak, G. Makrai, T. Henk, C. Gaspar-Papanek, “Radoop: Analyzing

big data with rapidminer and hadoop,” In RCOMM, 2011.

[51] I. Mierswa , M. Wurst , R. Klinkenberg , M. Scholz and T. Euler, “YALE: rapid

prototyping for complex data mining tasks, “ in Proceedings of the 12th ACM

SIGKDD international conference on Knowledge discovery and data mining,

August 20-23, 2006.

[52] S. Das, Y. Sismanis, K. Beyer, R. Gemulla, P. Haas, and J. McPherson.

“Ricardo: integrating R and Hadoop,” In SIGMOD, pages 987–998, 2010.

[53] K. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Eltabakh, C.-C. Kanne, F.

Ozcan, and E. J. Shekita, “Jaql: A Scripting Language for Large Scale Semis-

tructured Data Analysis,” In PVLDB, 2011.

[54] “SparkR, “ http://amplab-extras.github.io/SparkR-pkg/ , Accessed August 14th,

2014.

[55] R. Gordon, “Essential JNI: Java Native Interface, “ Prentice Hall PTR, Upper

Saddle River, 1998.

[56] M. Perez,A. Sanchez,A. Herrero,V. Robles, and Pea, Jos, M., “Adapting the

Weka Data Mining Toolkit to a Grid Based Environment,” Advances in Web

Intelligence, pp. 492–497, 2005.

[57] S. Celis and D.R. Musicant, “Weka-parallel: machine learning in parallel,”

Technical report, Carleton College, CS TR, 2002.

107 Of 121

http://amplab-extras.github.io/SparkR-pkg/

http://mahout.apache.org/

http://vision.cloudera.com/mapreduce-spark/

http://spark.apache.org/

[58] D. Talia, P. Trunfio,O. Verta,”Weka4WS: a WSRF-enabled Weka Toolkit for

Distributed Data Mining on Grids,” In proceedings of the 9th European Confer-

ence on Principles and Practice of Knowledge Discovery in Databases, Porto,

Portugal, 2005.

[59] D.Wegener, M. Mock, D. Adranale, and S.Wrobel. “Toolkit-based high-perform-

ance data mining of large data on mapreduce clusters,” In Proceedings of ICDM

Workshops, pages 296-301, 2009.

[60] T. Kraska, A. Talwalkar, J. Duchi, R. Griffith, M. J Franklin, M. Jordan.

“Mlbase: A distributed machine-learning system,” In Conf. on Innovative Data

Systems Research, 2013.

[61] “Amazon EBS, “ http://aws.amazon.com/ebs/ . Accessed August 17th, 2014.

[62] “Amazon EC2 FAQs, “ http://aws.amazon.com/ec2/faqs/ . Accessed August 17th,

2014.

[63] “Amazon Machine Images, “ http://docs.aws.amazon.com/AWSEC2/latest/User

Guide/AMIs.html . Accessed August 17th, 2014.

[64] “HDFS Architecture ,” http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html ,

Accessed August 26th, 2014.

[65] “Kryo, “ https://github.com/EsotericSoftware/kryo . Accessed August 17th, 2014.

[66] “Spark Programming Guide, “ http://spark.apache.org/docs/latest/programming-

guide.html . Accessed August 17th, 2014.

[67] “Amazon CloudWatch, “ http://aws.amazon.com/cloudwatch/ . Accessed August

17th, 2014.

[68] “Cron and Crontab usage, “ http://www.pantz.org/software/cron/croninfo.html .


[69] M. Hall, "Weka and Hadoop" blog, 15 October 2013;

http://markahall.blogspot.co.uk/2013/10/weka-and-hadoop-part-1.html .


[70] A. McCallum, K. Nigam, L.H. Ungar, "Efficient clustering of high-dimensional

data sets with application to reference matching,"

In Proceedings of the 6th ACM SIGKDD international conference

on Knowledge discovery and data mining, p.169-178, 2000.

[71] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica, "Disk-locality in

datacenter computing considered irrelevant," in Proc. USENIX Workshop on

Hot Topics in Operating Syst. (HotOS), 2011.

[72] R.Appuswamy, C. Gkantsidis, D. Narayanan, O. Hodson and A. Rowstron.

108 Of 121

http://markahall.blogspot.co.uk/2013/10/weka-and-hadoop-part-1.html

http://www.pantz.org/software/cron/croninfo.html

http://aws.amazon.com/cloudwatch/

https://github.com/EsotericSoftware/kryo

http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html

http://aws.amazon.com/ec2/faqs/

http://aws.amazon.com/ebs/

“Scale-up vs scale-out for Hadoop: time to rethink?” in Proceeding SOCC '13,

Article No20, 2013.

[73] J. Dejun, G. Pierre, and C.-H. Chi, "Resource provisioning of web applications

in heterogeneous clouds," in WebApps, 2011.

[74] A. Le-Quoc, M. Fiedler, C. Cabanilla, “The Top 5 AWS EC2 Performance Prob-

lems, “ Datadog Inc, 2013.

[75] S. Jha, J. Qiu, A. Luckow, P. Mantha, G. C. Fox.

"A Tale of Two Data-Intensive Approaches: Applications, Architectures and In

frastructure." In 3rd International IEEE Congress on Big Data Application and

Experience Track. 2014.

[76] W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk, B. Nitzberg, W. Saphir,

and M. Snir, “MPI: The Complete Reference, “ MIT Press, 1998.

[77] “Machine Learning Repository, “ https://archive.ics.uci.edu/ml/datasets.html .


[78] “Stanford Network Analysis Project, “ http://snap.stanford.edu/ . Accessed Au-

gust 17th, 2014.

[79] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, I. Stoica,

“Delay scheduling: A simple technique for achieving locality and fairness in

cluster scheduling, “ In EuroSys 10, 2010.

[80] “Iperf, “ https://github.com/esnet/iperf/ . Accessed August 22, 2014.

[81] “Iozone Filesystem Benchmark, “ http://www.iozone.org/ . Accessed August 22,

2014.

[82] Weizhong Zhao , Huifang Ma , Qing He, “Parallel K-Means Clustering

Based on MapReduce, “ In Proceedings of the 1st International

Conference on Cloud Computing, December 01-04, 2009.

[83] C. Aggarwal, C. K. Reddy,"Data Clustering: Algorithms and Applications, "

CRC Press, 2011.

[84] C. Ding, X. He, H. D. Simon,"On the equivalence of nonnegative

matrix factorization and spectral clustering, " In Proceedings of the SIAM

international conference on Data Mining, 2005.

[85] Z. S. He , S. L. Xie , R. Zdunek , G. X. Zhou, A. Cichocki, "Symmetric nonneg-

ative matrixfactorization: Algorithms and applications to probabilistic cluster-

ing", IEEE Trans.Neural Netw., vol. 22, no. 12, pp.2117 -2131, 2011.

[86] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica, “Discretized

streams: fault-tolerant streaming computation at scale, ” In SOSP, 2013.

109 Of 121

http://www.iozone.org/

https://github.com/esnet/iperf/

http://snap.stanford.edu/

https://archive.ics.uci.edu/ml/datasets.html

[87] J. Kreps, N. Narkhede, and J. Rao, “ Kafka: A distributed messaging system for

log processing, “ In NetDB, 2011.

[88] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J.M. Patel, S. Kulkarni, J.

Jackson, "Storm@ twitter, " In Proceedings of the 2014 ACM SIGMOD interna-

tional conference on Management of data, pp. 147-156. ACM, 2014.

[89] “WekaMetal, “http://www.cs.bris.ac.uk/Research/MachineLearning/wekametal/ ,


110 Of 121

http://www.cs.bris.ac.uk/Research/MachineLearning/wekametal/

Appendix 1 – Benchmarking Data

Appendix 1 contains the bulk of benchmarking data. Tables 1 and 2 display

the execution times for the Linear Regression and FP-Growth experiments.

111 Of 121

Linear Regression (sec) 5GB 20GB 80GB8 Cores 86 336 134032 cores 39 90 352128 Cores 24 39 94

Appendix Table 1: Execution Times for Linear Regression

5GB 20GB 80GB8 Cores 359 1437 561332 cores 100 379 1440128 Cores 34 104 376

FP-Growth (sec)

Appendix Table 2: Execution Times for FP-Growth

Tables 3 and 4 display the average memory utilisation of the Weka-on-Spark

benchmarks.

112 Of 121

Appendix Table 3: Memory Utilisation Part I

Appendix Table 4: Memory Utilisation Part II

Tables 5 and 6 display the average per-instance Network Traffic of the

Weka-on-Spark benchmarks.

113 Of 121

Appendix Table 5: Network Traffic Part I

114 Of 121

Appendix Table 6: Network Traffic Part II

Appendix 2 – Installation Guide

In order to install the system on AWS a number of steps is required:

1. Download the latest release of Spark (http://spark.apache.org/down-

loads.html ).

2. Unzip the package to a location and navigate to the /ec2 folder. Ensure

that Python 2.7 is installed and it is the main Python interpreter. This

folder contains a Python script able to launch an AWS Spark cluster

according to a specification.

3. Export the ACCESS_ID and ACCESS_KEY associated with your

AWS account to the terminal window (or set-up environment vari-

ables). Ensure that the key-pair associated with your account is stored

locally.

4. In the /ec2 directory execute:

./spark-ec2 -k (key-pair name) -i (key-pair file path) -s

(num-of-slaves) -t (slave-instance type) -r (ec2-region)

launch (cluster-name)

These are the essential parameters, but the script supports multiple op-

tions. For the full documentation, go to (

http://spark.apache.org/docs/latest/ec2-scripts.html ). It is important to

emphasize that the procedure is slow and may take up to half an hour

to launch the cluster. When the process is complete, the script will print

the appropriate message in the terminal.

5. Login to the cluster. The easiest way is to type in the same terminal:

./spark-ec2 -k (key-pair name) -i (key pair path) login

(cluster name)

But it is also to possible to establish a direct ssh connection to the

Master node.

6. Once in the Master node, navigate to the /spark directory and place the

download the framework's uber jar. This can be achieved using mul-

tiple methods depending on where the executable is hosted.

115 Of 121

http://spark.apache.org/docs/latest/ec2-scripts.html

http://spark.apache.org/downloads.html

http://spark.apache.org/downloads.html

This process assumes that the data are already on HDFS. At this point the sys-

tem is ready to accept user tasks. Appendix 3 (User Guide) provides details

about the submission scripts and the supported user options and Appendix 4

provides details on how to install the main-memory monitoring service on

CloudWatch.

116 Of 121

Appendix 3 – User Guide

This guide assumes that the system is installed on AWS (or any other cluster),

the data are stored on HDFS and the reader possesses fundamental knowledge

of a Linux command line.

Navigate to the folder of the spark installation and type:

bin/spark-submit --master <spark master adress and port> \

--class uk.ac.manchester.ariskk.distributedWekaSpark.

main.distributedWekaSpark \

--executor-memory <per instance memory> \

/root/spark/distributedWekaSpark-0.0.2-SNAPSHOT.jar \

-task <task to submit> \

-hdfs-dataset-path <path to hdfs> \

-num-of-partitions <> -num-of-attributes <> \

<an arbitrary number of supported options>

Double dashed option indicate Spark's options and single dashed options

are parsed by the application. Double dashed options must be submitted be-

fore the path of the executable archive.

Any number of options can be submitted and the order is irrelevant. The

following list presents the supported options, their default values and usage

examples.

-task (task descriptor) possible values: buildHeaders , buildClassifier , buildClassifierEvaluation, buildFoldBased-

Classifier , buildFoldBasedClassifierEvaluation , build-

Clusterer , findAssociationRules default:No (Exception)

-dataset-type (dataset type to use) possible values: In-

stances , ArrayInstance , ArrayString default: ArrayString

-caching (caching strategy to use) possible values : All

Spark supported ex: MEMORY_ONLY default: Cashing Strategy Se-

lection Algorithm

-compress (compress serialized rdds to save memory) possible

values: y (or none) default: none (do not compress)

117 Of 121

-kryo (use kryo serializer) possible values : y (or none)

default: none (do not use kryo)

-caching-fraction (executor caching fraction) possible val-

ues: 0.1-1.0 default: 0.6

-overhead (dataset overhead in java objects) possible val-

ues: any double default: 5.0

-hdfs-dataset-path (path to the dataset on HDFS) possible

values: hdfs://(host):(port)/user/username/dataset.csv de-

fault:No (Exception)

-hdfs-names-path: (path to a names files on HDFS, names must

be in a comma delimited format) possible values : hdfs://

(host):(port)/user/username/names.txt default: Will try to

compute att0,att1 etc

-hdfs-headers-path (path to pre-built headers for the data-

set) possible values: hdfs://(host):

(port)/user/username/someheaders.arff default: Will try to

compute headers

-hdfs-classifier-path (path to trained classifier) possible

values : hdfs://(host):(port)/user/username/someclassfier.-

model default: Will try to build classifier

-hdfs-output-path (path to an HDFS folder where generated

models will be saved) possible values: hdfs://(host):

(port)/user/username/somefolder/ default: No (will not save)

-num-partitions (number of partitions) possible values: Any

Integer default: Spark default

-num-random-chunks (number of randomized/stratified parti-

tions) possible values: any Integer default: No stratifica-

tion/ randomisation

-num-of-attributes (number of attributes in the dataset)

possible values: any Integer default: will try to compute from

the dataset if no headers provided

-class-index (index of the class attribute) possible values:

any Integer [0,num-of-attributes-1] default: num-of-attrib-

utes-1

-num-folds (number of folds) possible values: Any Integer

default:1

-names (attribute names) possible values: a comma delimited

list of names default:Will produce a list att0,att1 etc

118 Of 121

-classifier (the name and package path of the classifier to

use) possible values: weka.classifier.bayes.NaiveBayes (any

weka core) default: NaiveBayes

-meta (meta learner to use) possible values: weka.classifi-

ers.meta.Bagging (any weka core) default:None (WekaClassifi-

erReduceTask default)

-rule-learner (association rule learner to use) posible val-

ues: weka.associations.Apriori and weka.associations.FPGrowth

only default: FPGrowth

-clusterer (clusterer to use) possible value : weka.cluster-

er.Canopy only. default: Canopy

-num-clusters (number of clusters to find) possible values:

any Integer default: 0 (the task will try to set a number of

the clusterer supports auto-config)

-parser-options (options for the csv parser) possible val-

ues: \"-N first-last etc.. \" default: weka defaults based on

the tasks DO NOT FORGET \" \" when grouping parameters

-weka-options (options for weka algorithms (base tasks and

core algorithms)) possible values: \" -depth 3 etc \" default:

weka defaults based on the tasks DO NOT FORGET \" \" when

grouping parameters

I intend to continue supporting the project after my graduation. Con-

sequently, this list may be updated in the future. The project will be released

as an open-source project under the Apache License and will be available

through GitHub.

119 Of 121

Appendix 4 – Main-Memory Monitoring using Cloud-

Watch

Cloudwatch monitors a variety of metrics, but there is no support for main-

memory usage monitoring by default. The following steps explain how a

main-memory monitoring daemon can be installed on Linux-based instances.

Amazon has released a number of Perl scripts to achieve that task. These

scripts must bu downloaded and installed directly to the instances. This pro-

cedure requires a number of steps:

1. Establish a secure shell ( SSH) connection to the instance by typing :

ssh -i (keypair path) root@(instance public DNS)

2. Once the secure connection is established copy and paste the following

commands in the terminal:

sudo yum install perl-Switch perl-Sys-Syslog perl-LWP-

Protocol-https

wget http://ec2-downloads.s3.amazonaws.com/cloud

watch-samples/CloudWatchMonitoringScripts-v1.1.0.zip

unzip CloudWatchMonitoringScripts-v1.1.0.zip

rm CloudWatchMonitoringScripts-v1.1.0.zip

cd aws-scripts-mon

These command will download the appropriate main-memory monit

oring scripts to the instance.

3. (Optional) Change the default text editor from Vim to nano:

export EDITOR=nano

4. Open the crontab (chronos table):

crontab -e

120 Of 121

http://ec2-downloads.s3.amazonaws.com/cloud

The entries of this table are executed by the operating system at fixed

intervals.

5. Paste the following entries in the crontab:

* * * * * /root/aws-scripts-mon/mon-put-instance-data.pl --mem-util

--aggregated=only --mem-used --mem-avail –aws-access-key-

id=(your id) --aws-secret-key=(your secret key)

* * * * * /root/aws-scripts-mon/mon-put-instance-data.pl --mem-util

--mem-used --mem-avail --aws-access-key-id=(your id) –aws-secret-

key=(your secret key)

These entries will execute every minute and submit the instance's main-

memory utilisation statistics to Cloudwatch.

At this point browsing the CloudWatch metrics should have a new cat-

egory: Linux System Metrics. All the custom metrics can be retrieved in this

category.

121 Of 121

Date post:	28-May-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Big Data Mining - University of Manchester · Data generation and collection across all domains...

Documents