Download - School ofEngineering andNatural Sciences, …...School ofEngineering andNatural Sciences, University ofIceland Research Group Leader, JuelichSupercomputingCentre, Germany Apache Spark

PARALLEL & SCALABLE MACHINE LEARNING & DEEP LEARNING

Prof. Dr. – Ing. Morris RiedelAdjunct Associated ProfessorSchool of Engineering and Natural Sciences, University of IcelandResearch Group Leader, Juelich Supercomputing Centre, Germany

Apache Spark for Cloud ApplicationsSeptember 13th, 2018Room Stapi 108

Cloud Computing & Big Data

LECTURE 3

Review of Lecture 2 – Machine Learning Models in Clouds

Machine Learning Fundamentals

1. Some pattern exists2. No exact mathematical formula3. Data exists

Lecture 3 – Apache Spark for Cloud Applications

‘Simple‘ Perceptron Model

Logistic Regression Model

[1] Image sources: Species Iris Group of North America Database

wi

HDInsight

2 / 56

Outline of the Course

1. Cloud Computing & Big Data

2. Machine Learning Models in Clouds

3. Apache Spark for Cloud Applications

4. Virtualization & Data Center Design

5. Map-Reduce Computing Paradigm

6. Deep Learning driven by Big Data

7. Deep Learning Applications in Clouds

8. Infrastructure-As-A-Service (IAAS)

9. Platform-As-A-Service (PAAS)

10. Software-As-A-Service (SAAS)


11. Data Analytics & Cloud Data Mining

12. Docker & Container Management

13. OpenStack Cloud Operating System

14. Online Social Networking & Graphs

15. Data Streaming Tools & Applications

16. Epilogue

+ additional practical lectures for our

hands-on exercises in context

Practical Topics

Theoretical / Conceptual Topics3 / 56


Outline

Basic Apache Spark Capabilities Resilient Distributed Datasets Apache Spark vs Apache Hadoop Parallel Job & In-Memory Execution Transformations & Actions Word Count Application Example

Libraries & Open Source Frameworks Structured Query Language (SQL) Streaming Library Machine Learning Library (MLlib) GraphX Library Machine Learning Applications in Clouds

Promises from previous lecture(s):

Lecture 1: Lecture 3 will provide details on data-intensive services & computing using Apache Spark

Lecture 1: Lecture 3 will give in-depth details on using in-memory processing & Apache Spark & MLLib

Lecture 2: Lecture 3 & 6 provides moreinsights into Logistic Regression & Gradient Descent Optimization

Lecture 2: Lecture 3 provides a detailedintroduction to Apache Spark and itsrelated ‘service ecosystem‘

Lecture 2: Lecture 3 provides an introduction to Apache Spark withSparkSession, SparkContext and YARN

Lecture 2: Lecture 3 provides a detailedintroduction to Apache Spark and itsSpark SQL dataframe concepts

Lecture 2: Lecture 3 provides a detailedintroduction to Apache Spark and its‘service ecosystem‘ with Hive

Lecture 2: Lecture 3 provides detailsabout Apache Spark and other availableMLlib models & pipelines

Lecture 2: Lecture 3 provides moreexamples on training and test datasetsand different model evaluations

Lecture 2: Lecture 3 offers more insightsinto other machine learning approachesin different Cloud Systems

4 / 56

Basic Apache Spark Capabilities

Lecture 3 – Apache Spark for Cloud Applications 5 / 56

Food Inspection in Chicago: Advanced Application Revisited

1. Some pattern exists: Believe in a pattern with ‘quality violations in checking restaurants‘ will

somehow influence if food inspection pass or fail (binary classification)

2. No exact mathematical formula To the best of our knowledge there is no precise formula for this problem

3. Data exists Data collection

from City of Chicago


The goal of the advanced machine learning application with food inspection of restaurants in the City of Chicago is to predict the outcome of food inspection of new Chicago restaurants given some of existing violations of otherrestaurants already obtained in Chicago

6 / 56

Deploy Spark Cluster in MS Azure HDInsight Revisited


Microsoft Azure HDInsight Spark Clusters can be deployed via pre-configured resource manager templates that use Azure computing resources & Azure Storage Blobs as cluster storage

A wide variety of templates are available on Github pages for various general cloud services

[2] Github template Web page

HDInsight

7 / 56

HDInsight


Networking & Big Data Impacts on Cloud Computing

Requirements for scalable programming models and tools CPU speed has surpassed IO capabilities of existing cloud resources Data-intensive clouds with advanced analytics and analysis capabilities Considering moving compute task to data vs. moving data to compute MS Azure is only one

concrete example of many cloud solutions

Requirements for Reliable Filesystems Traditional parallel filesystems need to prove their ‘big data’ feasibility Emerging new forms of filesystems that assume hardware error constantly E.g. Hadoop distributed file system (HDFS) [10] HDFS Architecture Guide

focus in this lecture

Lecture 5 will give in-depth details on Map-Reduce & implementation Hadoop & filesystem8 / 56


Big Data Analytics Frameworks – Revisited

Distributed Processing ‘Map-reduce via files‘: Tackle large problems with many small tasks Advantage of ‘data replication‘ via specialized distributed file system E.g. Apache Hadoop

In-Memory Processing Perform many operations fast via ‘in-memory‘ Enable tasks such as ‘map-reduce in-memory‘ E.g. Apache Spark, Apache Flink

Big Data analytics frameworks shift the approach from ‘bring data to compute resources‘ into ‘bring compute tasks close to data‘

[4] Apache Spark

[3] Map-Reduce

focus in this lecture

Lecture 5 offers more insights into the map-reduce computing paradigm including applications9 / 56


Apache Spark use Distributed Filesystem & Map-Reduce

Taking into account the ever-increasing amounts of ‘big data’ Think: ‘Big Data’ not always denoted by volume, there is velocity, variety, …

Reliable scalable ‘big data’ storage method Data is (almost always) accessible even if failures in nodes/disks occur Enable the access of large quantities of data with good performance

Fault-tolerant and scalable data analytics processing approach Data analytics computations must be divided in small task for easy restart Restart of one data analytics task has no affect on other active tasks

A specialized distributed file system is required that assumes failures as default

A data analytics processing programming model is required that is easy to use and simple to program with fault-tolerance already within its design

Lecture 5 provides more information on the Hadoop Distributed File System (HDFS) used by Spark10 / 56

Apache Spark Basic Properties

Production deployments Adobe, Intel and Yahoo have production deployments

Open Source (initially invented by UC Berkeley) One of the ‘most active big data‘ open source projects today

Idea: Support two properties beyond traditional map-reduce Extension towards Directed Acyclic Graphcs (DAGs) Improvements on ‘data sharing capabilities‘ & ‘in-memory computing‘

Key Concepts: ‘Resilient Distributed Dataset (RDD)‘ & Usability RDDs are a fault-tolerant collection of elements to operate on in parallel Good usability w.r.t. write less code via rich APIs Use Hadoop Distributed File System (HDFS)


Apache Spark is a fast & general engine for large-scale data processing optimized for memory usage Open source & compatible with cloud data storage systems like Amazon Web Services (AWS) S3

[4] Apache Spark

[12] Apache HadoopWeb page

11 / 56

Resilient Distributed Datasets (RDDs)

Key Approach Immutable collections of objects across a cluster (i.e. partitions) Created through parallel transformations (e.g. map, filter, group, etc.) Controllable persistence & Cached partition

(e.g. caching in RAM, if it fits in memory)

Automatically rebuilt on failure RDD track the transformations that created

them (i.e. lineage) to recompute lost data


The key idea of Resilient Distributed Datasets (RDDs) in Apache Spark is that they enable the work with distributed data collections as if one would work with local data collections

[6] Parallel Programming with Spark

12 / 56

Apache Spark RDD Transformations – Example



(use of ‘lambda‘ enables small anonymous functions, e.g. f(x) = x*x, and returns directly the result in expression)

(domain decomposition) (create a parallelized collection holding the numbers 1,2, and 3 in 3 different slices, thesize can be configured for bigger data)

(computed on nums, not squares or even!)

(starting with 0,not x itself!)

13 / 56

Apache Spark RDD Actions – Example



(collect returns elements of the dataset as an array back to the driver program, often used in debugging)

14 / 56

Apache Spark vs. Apache Hadoop


Apache Hadoop

Iteration 1HDFS Read

HDFS Write

slow

slow

HDFS Readslow

Iteration 2HDFS Writeslow

DATAINPUT

HDFS Readslow

RESULT

Apache Spark

Iteration 1HDFS Read

RAM Write

slow

fast

RAM Readfast

Iteration 2RAM Writefast

DATAINPUT

RAM Readfast

RESULT

Spark Results Up to 100% faster then

traditional Hadoop (for some applications)

Requires ‘big memory‘systems to perform well

One key difference between Apache Spark vs. Apache Hadoop is that slow HDFS reads and writes are exchanged with fast RAM reads and writes in several iterations that might bepart of a larger workflow

A workflow is a sequence of processing tasks on data thatleads to results via ‘iterations‘

[5] big-data.tips, Apache Spark vs. Hadoop

15 / 56

Amazon Web Services – High Memory Computing Resource


[7] AWS EC2 Pricing

16 / 56

Distributed Operations & Parallel Job Execution

Different parallel options for tasks SparkContext is head/master and

used for starting applications E.g. use local threads E.g. use cluster via Apache

(e.g. Hadoop YARN)

Spark Task Scheduler Supports general task graphs Cache-aware data re-use & locality

(i.e. multiple iterations over databenefit most, e.g. machine learning)

Partitioning-aware to avoid shuffles


Transformations build RDDs from other RDDs (i.e. map, filter, groubBy, join, union, etc.) Actions return a real result or write it to storage (e.g. count, collect, save, etc.)

[6] Parallel Programmingwith Spark

(also named as Spark Driver master)

[8] Advanced Spark [12] Apache HadoopWeb page

17 / 56

Hadoop Resource Manager Yarn & Wordcount Example


Apache Hadoop offers the Yet Another Resource Negotiator (YARN) scheduler for map-reduce jobs Idea of Yarn is to split up the functionalities of resource management & job scheduling/monitoring The ResourceManager is a scheduler that controls resources among all applications in the system The NodeManager is the per-machine system that is reponsible for containers, monitoring of their

resource uage (cpu, memory, disk, network) and reporting to the ResourceManager

[10] Introduction to Yarn

Lecture 5 provides more detailed information on the scheduler Hadoop YARN and its usage


18 / 56

Understand AWS Cloud Service Portfolio – Analytics

Multiple analytics products Extracting insights and actionable information from data

requires technologies like analytics & machine learning

Products & Usage Amazon Athena: Serverless Query Service Amazon ElasticMapReduce: Hadoop Amazon ElasticSearch Service: Elasticsearch on AWS Amazon Kinesis: Streaming Data Amazon QuickSight: Business Analytics Amazon Redshift: Data Warehouse …


[9] AWS Web page

Apache Spark on Hadoop YARN is natively supported in Amazon ElasticMapReduce (EMR) The AWS management console offers to easily create and manage Apache Spark clusters

[11] Apache Spark on Amazon EMR

Lecture 8 provides more detailed information about AWS products with Infrastructure-As-A-Service19 / 56

Chicago Restaurant Food Inspection Example (cf. Lecture 2)


The execution of the import cell using PySpark kernel notebooks automatically generates in the background a SparkSession with a SparkContext in Azure HDInsight Clusters

The startup and printout of the Spark application status might take a while when executing the cell (shift+enter)

(remember thateven idle SparkSessions

and Azure HDInsight Clusterscost per minute)

A Spark Session (since Version 2.0) is an entry point that subsumed the SQLContext and HiveContext - it is an entry point for programming Spark with the Dataset & DataFrame API


(after having started Jupyther in the cloud we have started our

example application by readingdata into a dataframe)

HDInsight

20 / 56

Spark Dataset & DataFrame API

Example: Load Dataset Read/load dataset from csv file Generated a Spark dataframe for further use


A Spark DataFrame is (like an RDD) also an immutable distributed collection of data The key difference between DataFrame & RDDs is that data is organized into named columns that

is closer to the traditional handling of data with relational databases & clear structure/schema of it

Our assignments will work with more examples of using the Apache Spark DataFrame API

HDInsight

[30] Databricks

(Dataset is a collection of strongly-typed Java Virtual Machineobjects, dictated by a case class you define in as a class in Java)

(DataFrame is an alias for a collection of generic objects Dataset[Row], where a Row is a generic untypedJava Virtual Machine object)

21 / 56

Google Cloud Example – Data Analytics & Machine Learning


The GAE services re-use other existing Big Data services of the Google Cloud Platform through welldefined interfaces such as the Google BigQuery or Google Dataproc for Apache Spark & Hadoop

[13] Google Dataproc service

Google Dataproc enables scalable & automated cluster management for Apache Spark with quick deployment, logging, and monitoring in order to focus on data analysis and not on infrastructure

Lecture 9 provides more detailed information about Google Cloud Platform-As-A-Service Tools22 / 56

IBM Cloud Example – Data Science


IBM Cloud provides 150+ SAAS-based applications for a wide variety of business processes or software for highly specialized purposes such as data science or talent management & recruiting

[14] IBM Cloud SAAS(executors : 2 worker nodes vs 30)

Lecture 10 provides detailed information about IBM Cloud Software-As-A-Service Applications23 / 56

WordCount Application Example including Spark Benefits

Example: Error message log analysis (e.g. wordcount) From a log into memory, then interactively search for patterns Benefit: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Reasoning: Take advantage of configured Cache at workers (‘in-memory‘)


[8] Parallel Programming with Spark(.map: returns iterable resulting from applying the givenfunction to each element of this iterable: e.g. word,1)

(.flatmap: applies given function to each element of thisiterable – THEN concatenates the results, e.g. [1,2], [], [4] = [1,2,4])

(spark is a created SparkContext object)(inherent parallel)

[5] Advanced Spark

24 / 56

[Video] Apache Spark Summary


[10] YouTube, Apache Spark

25 / 56

Advanced Concepts & Libraries


Cloud Computing Approach & MS Azure HDInsight Revisited

Microsoft Azure Cloud Wide variety of different cloud-based services

& resources for many application areas Managed via Microsoft Azure Portal Hub Needs a Microsoft Azure account

HDInsight Cluster ‘Service‘ Full managed services to

deploy known open sourceanalytice frameworks

Easy deployment &configuration per AzurePortal Hub

Pay-per-use model


HDInsight

[28] Azure Portal Hub

[29] Azure HDInsight Web page

(known open source analytics frameworks)

27 / 56

Apache Spark as Base Platform

Generalized platform for a wide variety of applications Extension towards Directed Acyclic Graphcs (DAGs) Improvements on ‘data sharing capabilities‘ & ‘in-memory computing‘


[4] Apache Spark

Apache Spark can be considered as a base platform for advanced analytics such as provided by higher level libraries: (a) Spark SQL; (b) Spark Streaming; (c) MLlib; and (d) GraphX library

[16] Berkeley Data Analytics Stack28 / 56

Apache Spark SQL Library

Usage example Connect ‘Apache Spark context‘ to

existing Apache Hive Database Apply functions to results of SQL queries


[4] Apache Spark

Apache Spark Structured Query Language (SQL) library works with structured data and enables queries for data inside Spark programs using SQL (e.g. based on Apache Hive/HiveQL)

[17] Apache Hive

(i.e. HiveQL,SQL-likelanguage)

[32] Edureka Spark SQL Tutorial

29 / 56



pyspark is a Python Application Programming Interface (API) for Spark available in Microsoft Azure HDInsight Clusters

Import required libraries for our logistic regression application via selected libraries from pysparkand the Spark Machine Learning Library (MLlib)

Import selected libraries for better analyzing data & creating features using the Apache Spark Structured Query Language (SQL) library

Apache Hive offers an optimized column-style table for ‘big data queries’ by storing columns separately from each other to reduce read, decompression & processing of data not even queried

[17] Apache Hive

30 / 56

Apache Spark Streaming Library


[4] Apache Spark

Apache Spark Streaming library enables to write streaming jobs the same way users would write batch jobs or to combine streaming with batch and interactive queries or filters

Usage example Combine streaming with batch

and interactive queries or using ‘windows’ E.g. find words with higher

frequency than historic data Recovers lost work without any

extra code for users to write(e.g. cf. RDD fault tolerance features)

31 / 56

Spark Machine Learning Library (MLlib)


[4] Apache Spark

Apache Spark Machine Learning library (MLlib) is a scalable and parallel machine learning library with a number of implemented algorithms for classification, clustering, and regression

Usage example E.g. Clustering data with K-Means MLlib contains high-quality

algorithms that leverage iteration(key benefit of Apache Spark)

Parallel algorithms for clustering,classification, and regression Clustering

Classification

Regression++

32 / 56



pyspark is a Python Application Programming Interface (API) for Spark available in Microsoft Azure HDInsight Clusters

Import required libraries for our logistic regression application via selected libraries from pysparkand the Spark Machine Learning Library (MLlib)

Import selected libraries for better analyzing data & creating features using the Apache Spark Structured Query Language (SQL) library

The MLlib offers several feature engineering techniques (e.g. Tokenizers for Natural Language Processing) & offers algorithms for classification like Logistic Regression or Random Forests

(i.e. using Logistic functionlike Sigmoid function in Logistic Regression)

33 / 56

Pipelines: Logistic Regression on Spark Cluster Revisited


[31] Apache Spark ML Pipelines

Pipelines are part of the Machine Learning library in Apache Spark

The Machine Learning (ML) Pipelines offer a set of high-level APIs that are built on top of DataFrames

The Apache ML Pipelines enable users to create & tune common practicalmachine learing pipelines

In the Chicago Food Inspection example the pipeline first generates features using the Tokenizer & Hasing before using the actual machine learning via logistic regression

Pipeline explained Before using machine learning algorithms

experts perform feature engineeringand/or feature selection processing datasets• E.g. split each violation‘s text into words• E.g. convert each violation‘s words into a

numerical feature vector Once good features in the datasets are found

the processing continues in the form ofconcrete machine learning algorithms• E.g. predict model using feature vector & labels

34 / 56

Spark Machine Learning Library (MLlib) – Parallel Algorithms

Classification algorithms include logistic regression and naïve Bayes

Regression Generalized linear regression and survival regression

Tree-based approaches Standard decision trees, random forests, gradient-boosted trees

Alternating least squares (ALS) Used to create recommendation engines


[4] Apache Spark

Clustering K-Means and Gaussian Mixture Models (GMM)

Pattern mining Frequent itemsets & Association rule mining

Classification

Clustering

Regression++

35 / 56

Machine Learning Applications & Performance


Complementary Data Mining or Machine Learning course offers more specific algorithm details


E.g. Logistic Regression: Explore unknown trends in datasets E.g. K-Means Clustering: Explore unknown groups in datasets

36 / 56

Apache Spark GraphX Library


[4] Apache Spark

Apache Spark GraphX enables iterative graph computations within a single system with a comparable performance to the fastest specialized graph processing systems

Usage example Tables and Graphs are

composable ‘views‘ ofthe same physical data

Each ‘view‘ has its ownOperators that exploit thesemantics of the view toachieve efficient execution

[16] Berkeley Data Analytics Stack

(e.g. ranking of Web pages can be represented as graph whereby rank computations depends only on neighbours

that link to the corresponding page, graph parallel pattern)

37 / 56

Web Application Example: PageRank Technique – Revisited

Change of Live: Availability of efficient & accurate Web search Many search engines, but search engine ‘Google as pioneer’ Innovation by Google was a nontrivial technological advance: ‘PageRank’ (term comes from Larry Page, inventor of it and Google founder)

Goal: approach against so-called ‘term spam’ Provide techniques against approaches for fooling search engines into

believing your page is about something it is not (e.g. terms in background)

Basic Algorithm Idea: Analyse links and assign weights to each vertex in a

‘Web graph’ by iteratively aggregating weights of inbounds


PageRank is a tool for evaluating the importance of Web pages in a way that it is not easy to fool

modified from [23] Mining of Massive Datasets

38 / 56

Page Rank Technique

Big data problem ‘whole Web of links‘ Memory challenge for whole Web: In order to compute the PageRank at a

given iteration we also need the PageRank for the iteration before E.g. two extremely large arrays required: PR(t), and PR(t+1)

It is important in implementations to not mess these two time steps up

Simplified formula C is factor of normalization

Better formula Solves problem ‘rank sink‘,

two pages just link themselvesand one incoming link (‘accumulate‘)

Add ‘random picks‘ (aka rank source)Lecture 3 – Apache Spark for Cloud Applications

[25] Big-data.tips, ‘Page Rank Technique‘

Page Rank is driven by probability of landing on one page or another (aka ‘being at a node‘) The probability can be interpreted as the higher the chance the more important the page is

[24] The PageRank Citation Ranking

(Bu is set of pages that point to u)(Nv is number of links from v)

(add decay factor, aka‘random picks‘ called

a rank source)

39 / 56

Web Application Example: PageRank (1)

Goal: Give Web pages a rank (score) Reasoning: score based on two different ‘types of links’ to them Links from many pages brings a high rank Link from a high-rank page gives a high rank

Easy problem – Complex algorithm Multiple stages of

map & reduce Benefits from

in-memory caching Multiple iterations

over the same data Known also to be

used initially for plain map-reduce


[6] Parallel Programming with Spark40 / 56


Simplified basic algorithm Start each page at a rank of 1 On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors Set each page’s rank to 0.15 + 0.85 × contribs



(aka ‘link out‘)

41 / 56







42 / 56







43 / 56







44 / 56







45 / 56







46 / 56



[24] The PageRank Citation Ranking47 / 56



Performance Comparisons Using Apache Spark (i.e. Cache, etc.) vs. plain Apache Hadoop

[6] Parallel Programming with Spark48 / 56

Apache Spark Applications in Business

Spark application use cases at Yahoo Personalizing news pages for Web visitors, ML algorithms running on Spark

to figure out what individual users are interested in, categorize news stories as they arise to figure out what types of users would be interested in reading them, …

Running analytics for advertising, Hive (previous lectures) on Spark (Shark’s) interactive capability, use existing BI tools to view and query their advertising analytic data collected in Hadoop

Spark application use cases at Conviva One of the largest streaming video companies

on the Internet, uses Spark Streaming to learn network conditions in real time

Spark application use cases at ClearStory Relies on the Spark technology as one of the core

underpinnings of its interactive & real-time productLecture 3 – Apache Spark for Cloud Applications

[20] Yahoo

[21] Conviva

[22] ClearStory

49 / 56

Spark Machine Learning Library (MLlib) – Revisited


[4] Apache Spark

Apache Spark Machine Learning library (MLlib) is a scalable and parallel machine learning library with a number of implemented algorithms for classification, clustering, and regression

Usage example E.g. Clustering data with K-Means MLlib contains high-quality

algorithms that leverage iteration(key benefit of Apache Spark)

Parallel algorithms for clustering,classification, and regression Clustering

Classification

Regression++

50 / 56

[Video] K-Means Clustering Example


[26] Animation of the k-means clustering algorithm, YouTube Video

51 / 56

Lecture Bibliography


Lecture Bibliography (1)

[1] Species Iris Group of North America Database, Online: http://www.signa.org

[2] Github template: Deploy a Spark cluster in Azure HDInsight,Online: https://github.com/Azure/azure-quickstart-templates/tree/master/101-hdinsight-spark-linux

[3] J. Dean, S. Ghemawat, ‘MapReduce: Simplified Data Processing on Large Clusters’, OSDI'04: Sixth Symposium on Operating System Design and Implementation, December, 2004.

[4] Apache Spark Web page, Online: http://spark.apache.org/

[5] Big Data Tips, ‘Apache Spark vs Hadoop‘, Online: http://www.big-data.tips/apache-spark-vs-hadoop

[6] M. Zaharia, ‘Parallel Programming with Spark‘, 2013 [7] Amazon Web Services EC2 On-Demand Pricing models,

Online: https://aws.amazon.com/ec2/pricing/on-demand/ [8] Reynold Xin, ‘Advanced Spark’, 2014, Spark Summit Training [9] Amazon Web Services Web Page,

Online: https://aws.amazon.com [10] SlideShare, ‚‘Introduction to Yarn and MapReduce‘,

Online: https://www.slideshare.net/cloudera/introduction-to-yarn-and-mapreduce-2 [11] Apache Spark on Amazon EMR,

Online: https://aws.amazon.com/emr/details/spark/



[12] Apache Hadoop Web page, Online: http://hadoop.apache.org/

[13] Google DataProc Service, Online: https://cloud.google.com/dataproc/

[14] IBM Cloud SAAS Applications, Online: https://www.ibm.com/cloud-computing/products/saas/

[15] YouTube Video, ‘Solving Big Data with Apache Spark’, Online: https://www.youtube.com/watch?v=WFoFLJOCOLA

[16] The Berkeley Data Analytics Stack: Present and Future, Online: http://events-tce.technion.ac.il/files/2014/04/Michael-Franklin-UC-Berkeley.pdf

[17] Apache Hive, Online: https://hive.apache.org/

[18] Big Data Tips, Spark Summit, Online: http://www.big-data.tips/spark-summit

[19] Docker Hub, Online: https://hub.docker.com/

[20] Yahoo Web page, Online: http://www.yahoo.com

[21] Conviva Web page, Online: http://www.conviva.com



[23] Mining of Massive Datasets, Online: http://infolab.stanford.edu/~ullman/mmds/book.pdf

[24] L. Page, S. Brin, R. Motwani, T. Winograd, ‘The PageRank Citation Ranking: Bringing Order to the Web‘, Technical Report of the Stanford InfoLab, 1999

[25] www.big-data.tips, ‘Page Rank Technique‘, Online: http://www.big-data.tips/page-rank

[26] YouTube Video, ‘Animation of the k-means algorithm using Matlab 2013‘, Online: http://www.youtube.com/watch?v=5FmnJVv73fU

[27] HDFS Architecture Guide, Online: http://hadoop.apache.org/docs/stable/hdfs_design.html

[28] Microsoft Azure Portal Hub,Online: https://portal.azure.com/#create/hub

[29] Microsoft Azure HDInsight Cluster Web page, Online: https://azure.microsoft.com/en-us/services/hdinsight/

[30] Databricks, A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets, Online: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

[31] Apache Spark Machine Learning Pipelines,Online: https://spark.apache.org/docs/latest/ml-pipeline.html

[32] Edureka Spark SQL Tutorial,Online: https://www.edureka.co/blog/spark-sql-tutorial/