PARALLEL & SCALABLE MACHINE LEARNING & DEEP LEARNING
Prof. Dr. – Ing. Morris RiedelAdjunct Associated ProfessorSchool of Engineering and Natural Sciences, University of IcelandResearch Group Leader, Juelich Supercomputing Centre, Germany
Apache Spark for Cloud ApplicationsSeptember 13th, 2018Room Stapi 108
Cloud Computing & Big Data
LECTURE 3
Review of Lecture 2 – Machine Learning Models in Clouds
Machine Learning Fundamentals
1. Some pattern exists2. No exact mathematical formula3. Data exists
Lecture 3 – Apache Spark for Cloud Applications
‘Simple‘ Perceptron Model
Logistic Regression Model
[1] Image sources: Species Iris Group of North America Database
wi
HDInsight
2 / 56
Outline of the Course
1. Cloud Computing & Big Data
2. Machine Learning Models in Clouds
3. Apache Spark for Cloud Applications
4. Virtualization & Data Center Design
5. Map-Reduce Computing Paradigm
6. Deep Learning driven by Big Data
7. Deep Learning Applications in Clouds
8. Infrastructure-As-A-Service (IAAS)
9. Platform-As-A-Service (PAAS)
10. Software-As-A-Service (SAAS)
Lecture 3 – Apache Spark for Cloud Applications
11. Data Analytics & Cloud Data Mining
12. Docker & Container Management
13. OpenStack Cloud Operating System
14. Online Social Networking & Graphs
15. Data Streaming Tools & Applications
16. Epilogue
+ additional practical lectures for our
hands-on exercises in context
Practical Topics
Theoretical / Conceptual Topics3 / 56
Lecture 3 – Apache Spark for Cloud Applications
Outline
Basic Apache Spark Capabilities Resilient Distributed Datasets Apache Spark vs Apache Hadoop Parallel Job & In-Memory Execution Transformations & Actions Word Count Application Example
Libraries & Open Source Frameworks Structured Query Language (SQL) Streaming Library Machine Learning Library (MLlib) GraphX Library Machine Learning Applications in Clouds
Promises from previous lecture(s):
Lecture 1: Lecture 3 will provide details on data-intensive services & computing using Apache Spark
Lecture 1: Lecture 3 will give in-depth details on using in-memory processing & Apache Spark & MLLib
Lecture 2: Lecture 3 & 6 provides moreinsights into Logistic Regression & Gradient Descent Optimization
Lecture 2: Lecture 3 provides a detailedintroduction to Apache Spark and itsrelated ‘service ecosystem‘
Lecture 2: Lecture 3 provides an introduction to Apache Spark withSparkSession, SparkContext and YARN
Lecture 2: Lecture 3 provides a detailedintroduction to Apache Spark and itsSpark SQL dataframe concepts
Lecture 2: Lecture 3 provides a detailedintroduction to Apache Spark and its‘service ecosystem‘ with Hive
Lecture 2: Lecture 3 provides detailsabout Apache Spark and other availableMLlib models & pipelines
Lecture 2: Lecture 3 provides moreexamples on training and test datasetsand different model evaluations
Lecture 2: Lecture 3 offers more insightsinto other machine learning approachesin different Cloud Systems
4 / 56
Basic Apache Spark Capabilities
Lecture 3 – Apache Spark for Cloud Applications 5 / 56
Food Inspection in Chicago: Advanced Application Revisited
1. Some pattern exists: Believe in a pattern with ‘quality violations in checking restaurants‘ will
somehow influence if food inspection pass or fail (binary classification)
2. No exact mathematical formula To the best of our knowledge there is no precise formula for this problem
3. Data exists Data collection
from City of Chicago
Lecture 3 – Apache Spark for Cloud Applications
The goal of the advanced machine learning application with food inspection of restaurants in the City of Chicago is to predict the outcome of food inspection of new Chicago restaurants given some of existing violations of otherrestaurants already obtained in Chicago
6 / 56
Deploy Spark Cluster in MS Azure HDInsight Revisited
Lecture 3 – Apache Spark for Cloud Applications
Microsoft Azure HDInsight Spark Clusters can be deployed via pre-configured resource manager templates that use Azure computing resources & Azure Storage Blobs as cluster storage
A wide variety of templates are available on Github pages for various general cloud services
[2] Github template Web page
HDInsight
7 / 56
HDInsight
Lecture 3 – Apache Spark for Cloud Applications
Networking & Big Data Impacts on Cloud Computing
Requirements for scalable programming models and tools CPU speed has surpassed IO capabilities of existing cloud resources Data-intensive clouds with advanced analytics and analysis capabilities Considering moving compute task to data vs. moving data to compute MS Azure is only one
concrete example of many cloud solutions
Requirements for Reliable Filesystems Traditional parallel filesystems need to prove their ‘big data’ feasibility Emerging new forms of filesystems that assume hardware error constantly E.g. Hadoop distributed file system (HDFS) [10] HDFS Architecture Guide
focus in this lecture
Lecture 5 will give in-depth details on Map-Reduce & implementation Hadoop & filesystem8 / 56
Lecture 3 – Apache Spark for Cloud Applications
Big Data Analytics Frameworks – Revisited
Distributed Processing ‘Map-reduce via files‘: Tackle large problems with many small tasks Advantage of ‘data replication‘ via specialized distributed file system E.g. Apache Hadoop
In-Memory Processing Perform many operations fast via ‘in-memory‘ Enable tasks such as ‘map-reduce in-memory‘ E.g. Apache Spark, Apache Flink
Big Data analytics frameworks shift the approach from ‘bring data to compute resources‘ into ‘bring compute tasks close to data‘
[4] Apache Spark
[3] Map-Reduce
focus in this lecture
Lecture 5 offers more insights into the map-reduce computing paradigm including applications9 / 56
Lecture 3 – Apache Spark for Cloud Applications
Apache Spark use Distributed Filesystem & Map-Reduce
Taking into account the ever-increasing amounts of ‘big data’ Think: ‘Big Data’ not always denoted by volume, there is velocity, variety, …
Reliable scalable ‘big data’ storage method Data is (almost always) accessible even if failures in nodes/disks occur Enable the access of large quantities of data with good performance
Fault-tolerant and scalable data analytics processing approach Data analytics computations must be divided in small task for easy restart Restart of one data analytics task has no affect on other active tasks
A specialized distributed file system is required that assumes failures as default
A data analytics processing programming model is required that is easy to use and simple to program with fault-tolerance already within its design
Lecture 5 provides more information on the Hadoop Distributed File System (HDFS) used by Spark10 / 56
Apache Spark Basic Properties
Production deployments Adobe, Intel and Yahoo have production deployments
Open Source (initially invented by UC Berkeley) One of the ‘most active big data‘ open source projects today
Idea: Support two properties beyond traditional map-reduce Extension towards Directed Acyclic Graphcs (DAGs) Improvements on ‘data sharing capabilities‘ & ‘in-memory computing‘
Key Concepts: ‘Resilient Distributed Dataset (RDD)‘ & Usability RDDs are a fault-tolerant collection of elements to operate on in parallel Good usability w.r.t. write less code via rich APIs Use Hadoop Distributed File System (HDFS)
Lecture 3 – Apache Spark for Cloud Applications
Apache Spark is a fast & general engine for large-scale data processing optimized for memory usage Open source & compatible with cloud data storage systems like Amazon Web Services (AWS) S3
[4] Apache Spark
[12] Apache HadoopWeb page
11 / 56
Resilient Distributed Datasets (RDDs)
Key Approach Immutable collections of objects across a cluster (i.e. partitions) Created through parallel transformations (e.g. map, filter, group, etc.) Controllable persistence & Cached partition
(e.g. caching in RAM, if it fits in memory)
Automatically rebuilt on failure RDD track the transformations that created
them (i.e. lineage) to recompute lost data
Lecture 3 – Apache Spark for Cloud Applications
The key idea of Resilient Distributed Datasets (RDDs) in Apache Spark is that they enable the work with distributed data collections as if one would work with local data collections
[6] Parallel Programming with Spark
12 / 56
Apache Spark RDD Transformations – Example
Lecture 3 – Apache Spark for Cloud Applications
[6] Parallel Programming with Spark
(use of ‘lambda‘ enables small anonymous functions, e.g. f(x) = x*x, and returns directly the result in expression)
(domain decomposition) (create a parallelized collection holding the numbers 1,2, and 3 in 3 different slices, thesize can be configured for bigger data)
(computed on nums, not squares or even!)
(starting with 0,not x itself!)
13 / 56
Apache Spark RDD Actions – Example
Lecture 3 – Apache Spark for Cloud Applications
[6] Parallel Programming with Spark
(collect returns elements of the dataset as an array back to the driver program, often used in debugging)
14 / 56
Apache Spark vs. Apache Hadoop
Lecture 3 – Apache Spark for Cloud Applications
Apache Hadoop
Iteration 1HDFS Read
HDFS Write
slow
slow
HDFS Readslow
Iteration 2HDFS Writeslow
DATAINPUT
HDFS Readslow
RESULT
Apache Spark
Iteration 1HDFS Read
RAM Write
slow
fast
RAM Readfast
Iteration 2RAM Writefast
DATAINPUT
RAM Readfast
RESULT
Spark Results Up to 100% faster then
traditional Hadoop (for some applications)
Requires ‘big memory‘systems to perform well
One key difference between Apache Spark vs. Apache Hadoop is that slow HDFS reads and writes are exchanged with fast RAM reads and writes in several iterations that might bepart of a larger workflow
A workflow is a sequence of processing tasks on data thatleads to results via ‘iterations‘
[5] big-data.tips, Apache Spark vs. Hadoop
15 / 56
Amazon Web Services – High Memory Computing Resource
Lecture 3 – Apache Spark for Cloud Applications
[7] AWS EC2 Pricing
16 / 56
Distributed Operations & Parallel Job Execution
Different parallel options for tasks SparkContext is head/master and
used for starting applications E.g. use local threads E.g. use cluster via Apache
(e.g. Hadoop YARN)
Spark Task Scheduler Supports general task graphs Cache-aware data re-use & locality
(i.e. multiple iterations over databenefit most, e.g. machine learning)
Partitioning-aware to avoid shuffles
Lecture 3 – Apache Spark for Cloud Applications
Transformations build RDDs from other RDDs (i.e. map, filter, groubBy, join, union, etc.) Actions return a real result or write it to storage (e.g. count, collect, save, etc.)
[6] Parallel Programmingwith Spark
(also named as Spark Driver master)
[8] Advanced Spark [12] Apache HadoopWeb page
17 / 56
Hadoop Resource Manager Yarn & Wordcount Example
Lecture 3 – Apache Spark for Cloud Applications
Apache Hadoop offers the Yet Another Resource Negotiator (YARN) scheduler for map-reduce jobs Idea of Yarn is to split up the functionalities of resource management & job scheduling/monitoring The ResourceManager is a scheduler that controls resources among all applications in the system The NodeManager is the per-machine system that is reponsible for containers, monitoring of their
resource uage (cpu, memory, disk, network) and reporting to the ResourceManager
[10] Introduction to Yarn
Lecture 5 provides more detailed information on the scheduler Hadoop YARN and its usage
[12] Apache HadoopWeb page
18 / 56
Understand AWS Cloud Service Portfolio – Analytics
Multiple analytics products Extracting insights and actionable information from data
requires technologies like analytics & machine learning
Products & Usage Amazon Athena: Serverless Query Service Amazon ElasticMapReduce: Hadoop Amazon ElasticSearch Service: Elasticsearch on AWS Amazon Kinesis: Streaming Data Amazon QuickSight: Business Analytics Amazon Redshift: Data Warehouse …
Lecture 3 – Apache Spark for Cloud Applications
[9] AWS Web page
Apache Spark on Hadoop YARN is natively supported in Amazon ElasticMapReduce (EMR) The AWS management console offers to easily create and manage Apache Spark clusters
[11] Apache Spark on Amazon EMR
Lecture 8 provides more detailed information about AWS products with Infrastructure-As-A-Service19 / 56
Chicago Restaurant Food Inspection Example (cf. Lecture 2)
Lecture 3 – Apache Spark for Cloud Applications
The execution of the import cell using PySpark kernel notebooks automatically generates in the background a SparkSession with a SparkContext in Azure HDInsight Clusters
The startup and printout of the Spark application status might take a while when executing the cell (shift+enter)
(remember thateven idle SparkSessions
and Azure HDInsight Clusterscost per minute)
A Spark Session (since Version 2.0) is an entry point that subsumed the SQLContext and HiveContext - it is an entry point for programming Spark with the Dataset & DataFrame API
[12] Apache HadoopWeb page
(after having started Jupyther in the cloud we have started our
example application by readingdata into a dataframe)
HDInsight
20 / 56
Spark Dataset & DataFrame API
Example: Load Dataset Read/load dataset from csv file Generated a Spark dataframe for further use
Lecture 3 – Apache Spark for Cloud Applications
A Spark DataFrame is (like an RDD) also an immutable distributed collection of data The key difference between DataFrame & RDDs is that data is organized into named columns that
is closer to the traditional handling of data with relational databases & clear structure/schema of it
Our assignments will work with more examples of using the Apache Spark DataFrame API
HDInsight
[30] Databricks
(Dataset is a collection of strongly-typed Java Virtual Machineobjects, dictated by a case class you define in as a class in Java)
(DataFrame is an alias for a collection of generic objects Dataset[Row], where a Row is a generic untypedJava Virtual Machine object)
21 / 56
Google Cloud Example – Data Analytics & Machine Learning
Lecture 3 – Apache Spark for Cloud Applications
The GAE services re-use other existing Big Data services of the Google Cloud Platform through welldefined interfaces such as the Google BigQuery or Google Dataproc for Apache Spark & Hadoop
[13] Google Dataproc service
Google Dataproc enables scalable & automated cluster management for Apache Spark with quick deployment, logging, and monitoring in order to focus on data analysis and not on infrastructure
Lecture 9 provides more detailed information about Google Cloud Platform-As-A-Service Tools22 / 56
IBM Cloud Example – Data Science
Lecture 3 – Apache Spark for Cloud Applications
IBM Cloud provides 150+ SAAS-based applications for a wide variety of business processes or software for highly specialized purposes such as data science or talent management & recruiting
[14] IBM Cloud SAAS(executors : 2 worker nodes vs 30)
Lecture 10 provides detailed information about IBM Cloud Software-As-A-Service Applications23 / 56
WordCount Application Example including Spark Benefits
Example: Error message log analysis (e.g. wordcount) From a log into memory, then interactively search for patterns Benefit: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Reasoning: Take advantage of configured Cache at workers (‘in-memory‘)
Lecture 3 – Apache Spark for Cloud Applications
[8] Parallel Programming with Spark(.map: returns iterable resulting from applying the givenfunction to each element of this iterable: e.g. word,1)
(.flatmap: applies given function to each element of thisiterable – THEN concatenates the results, e.g. [1,2], [], [4] = [1,2,4])
(spark is a created SparkContext object)(inherent parallel)
[5] Advanced Spark
24 / 56
[Video] Apache Spark Summary
Lecture 3 – Apache Spark for Cloud Applications
[10] YouTube, Apache Spark
25 / 56
Advanced Concepts & Libraries
Lecture 3 – Apache Spark for Cloud Applications 26 / 56
Cloud Computing Approach & MS Azure HDInsight Revisited
Microsoft Azure Cloud Wide variety of different cloud-based services
& resources for many application areas Managed via Microsoft Azure Portal Hub Needs a Microsoft Azure account
HDInsight Cluster ‘Service‘ Full managed services to
deploy known open sourceanalytice frameworks
Easy deployment &configuration per AzurePortal Hub
Pay-per-use model
Lecture 3 – Apache Spark for Cloud Applications
HDInsight
[28] Azure Portal Hub
[29] Azure HDInsight Web page
(known open source analytics frameworks)
27 / 56
Apache Spark as Base Platform
Generalized platform for a wide variety of applications Extension towards Directed Acyclic Graphcs (DAGs) Improvements on ‘data sharing capabilities‘ & ‘in-memory computing‘
Lecture 3 – Apache Spark for Cloud Applications
[4] Apache Spark
Apache Spark can be considered as a base platform for advanced analytics such as provided by higher level libraries: (a) Spark SQL; (b) Spark Streaming; (c) MLlib; and (d) GraphX library
[16] Berkeley Data Analytics Stack28 / 56
Apache Spark SQL Library
Usage example Connect ‘Apache Spark context‘ to
existing Apache Hive Database Apply functions to results of SQL queries
Lecture 3 – Apache Spark for Cloud Applications
[4] Apache Spark
Apache Spark Structured Query Language (SQL) library works with structured data and enables queries for data inside Spark programs using SQL (e.g. based on Apache Hive/HiveQL)
[17] Apache Hive
(i.e. HiveQL,SQL-likelanguage)
[32] Edureka Spark SQL Tutorial
29 / 56
Chicago Restaurant Food Inspection Example (cf. Lecture 2)
Lecture 3 – Apache Spark for Cloud Applications
pyspark is a Python Application Programming Interface (API) for Spark available in Microsoft Azure HDInsight Clusters
Import required libraries for our logistic regression application via selected libraries from pysparkand the Spark Machine Learning Library (MLlib)
Import selected libraries for better analyzing data & creating features using the Apache Spark Structured Query Language (SQL) library
Apache Hive offers an optimized column-style table for ‘big data queries’ by storing columns separately from each other to reduce read, decompression & processing of data not even queried
[17] Apache Hive
30 / 56
Apache Spark Streaming Library
Lecture 3 – Apache Spark for Cloud Applications
[4] Apache Spark
Apache Spark Streaming library enables to write streaming jobs the same way users would write batch jobs or to combine streaming with batch and interactive queries or filters
Usage example Combine streaming with batch
and interactive queries or using ‘windows’ E.g. find words with higher
frequency than historic data Recovers lost work without any
extra code for users to write(e.g. cf. RDD fault tolerance features)
31 / 56
Spark Machine Learning Library (MLlib)
Lecture 3 – Apache Spark for Cloud Applications
[4] Apache Spark
Apache Spark Machine Learning library (MLlib) is a scalable and parallel machine learning library with a number of implemented algorithms for classification, clustering, and regression
Usage example E.g. Clustering data with K-Means MLlib contains high-quality
algorithms that leverage iteration(key benefit of Apache Spark)
Parallel algorithms for clustering,classification, and regression Clustering
Classification
Regression++
32 / 56
Chicago Restaurant Food Inspection Example (cf. Lecture 2)
Lecture 3 – Apache Spark for Cloud Applications
pyspark is a Python Application Programming Interface (API) for Spark available in Microsoft Azure HDInsight Clusters
Import required libraries for our logistic regression application via selected libraries from pysparkand the Spark Machine Learning Library (MLlib)
Import selected libraries for better analyzing data & creating features using the Apache Spark Structured Query Language (SQL) library
The MLlib offers several feature engineering techniques (e.g. Tokenizers for Natural Language Processing) & offers algorithms for classification like Logistic Regression or Random Forests
(i.e. using Logistic functionlike Sigmoid function in Logistic Regression)
33 / 56
Pipelines: Logistic Regression on Spark Cluster Revisited
Lecture 3 – Apache Spark for Cloud Applications
[31] Apache Spark ML Pipelines
Pipelines are part of the Machine Learning library in Apache Spark
The Machine Learning (ML) Pipelines offer a set of high-level APIs that are built on top of DataFrames
The Apache ML Pipelines enable users to create & tune common practicalmachine learing pipelines
In the Chicago Food Inspection example the pipeline first generates features using the Tokenizer & Hasing before using the actual machine learning via logistic regression
Pipeline explained Before using machine learning algorithms
experts perform feature engineeringand/or feature selection processing datasets• E.g. split each violation‘s text into words• E.g. convert each violation‘s words into a
numerical feature vector Once good features in the datasets are found
the processing continues in the form ofconcrete machine learning algorithms• E.g. predict model using feature vector & labels
34 / 56
Spark Machine Learning Library (MLlib) – Parallel Algorithms
Classification algorithms include logistic regression and naïve Bayes
Regression Generalized linear regression and survival regression
Tree-based approaches Standard decision trees, random forests, gradient-boosted trees
Alternating least squares (ALS) Used to create recommendation engines
Lecture 3 – Apache Spark for Cloud Applications
[4] Apache Spark
Clustering K-Means and Gaussian Mixture Models (GMM)
Pattern mining Frequent itemsets & Association rule mining
Classification
Clustering
Regression++
35 / 56
Machine Learning Applications & Performance
Lecture 3 – Apache Spark for Cloud Applications
Complementary Data Mining or Machine Learning course offers more specific algorithm details
[6] Parallel Programming with Spark
E.g. Logistic Regression: Explore unknown trends in datasets E.g. K-Means Clustering: Explore unknown groups in datasets
36 / 56
Apache Spark GraphX Library
Lecture 3 – Apache Spark for Cloud Applications
[4] Apache Spark
Apache Spark GraphX enables iterative graph computations within a single system with a comparable performance to the fastest specialized graph processing systems
Usage example Tables and Graphs are
composable ‘views‘ ofthe same physical data
Each ‘view‘ has its ownOperators that exploit thesemantics of the view toachieve efficient execution
[16] Berkeley Data Analytics Stack
(e.g. ranking of Web pages can be represented as graph whereby rank computations depends only on neighbours
that link to the corresponding page, graph parallel pattern)
37 / 56
Web Application Example: PageRank Technique – Revisited
Change of Live: Availability of efficient & accurate Web search Many search engines, but search engine ‘Google as pioneer’ Innovation by Google was a nontrivial technological advance: ‘PageRank’ (term comes from Larry Page, inventor of it and Google founder)
Goal: approach against so-called ‘term spam’ Provide techniques against approaches for fooling search engines into
believing your page is about something it is not (e.g. terms in background)
Basic Algorithm Idea: Analyse links and assign weights to each vertex in a
‘Web graph’ by iteratively aggregating weights of inbounds
Lecture 3 – Apache Spark for Cloud Applications
PageRank is a tool for evaluating the importance of Web pages in a way that it is not easy to fool
modified from [23] Mining of Massive Datasets
38 / 56
Page Rank Technique
Big data problem ‘whole Web of links‘ Memory challenge for whole Web: In order to compute the PageRank at a
given iteration we also need the PageRank for the iteration before E.g. two extremely large arrays required: PR(t), and PR(t+1)
It is important in implementations to not mess these two time steps up
Simplified formula C is factor of normalization
Better formula Solves problem ‘rank sink‘,
two pages just link themselvesand one incoming link (‘accumulate‘)
Add ‘random picks‘ (aka rank source)Lecture 3 – Apache Spark for Cloud Applications
[25] Big-data.tips, ‘Page Rank Technique‘
Page Rank is driven by probability of landing on one page or another (aka ‘being at a node‘) The probability can be interpreted as the higher the chance the more important the page is
[24] The PageRank Citation Ranking
(Bu is set of pages that point to u)(Nv is number of links from v)
(add decay factor, aka‘random picks‘ called
a rank source)
39 / 56
Web Application Example: PageRank (1)
Goal: Give Web pages a rank (score) Reasoning: score based on two different ‘types of links’ to them Links from many pages brings a high rank Link from a high-rank page gives a high rank
Easy problem – Complex algorithm Multiple stages of
map & reduce Benefits from
in-memory caching Multiple iterations
over the same data Known also to be
used initially for plain map-reduce
Lecture 3 – Apache Spark for Cloud Applications
[6] Parallel Programming with Spark40 / 56
Web Application Example: PageRank (2)
Simplified basic algorithm Start each page at a rank of 1 On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors Set each page’s rank to 0.15 + 0.85 × contribs
Lecture 3 – Apache Spark for Cloud Applications
[6] Parallel Programming with Spark
(aka ‘link out‘)
41 / 56
Web Application Example: PageRank (3)
Lecture 3 – Apache Spark for Cloud Applications
Simplified basic algorithm Start each page at a rank of 1 On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors Set each page’s rank to 0.15 + 0.85 × contribs
[6] Parallel Programming with Spark
(aka ‘link out‘)
42 / 56
Web Application Example: PageRank (4)
Lecture 3 – Apache Spark for Cloud Applications
Simplified basic algorithm Start each page at a rank of 1 On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors Set each page’s rank to 0.15 + 0.85 × contribs
[6] Parallel Programming with Spark
(aka ‘link out‘)
43 / 56
Web Application Example: PageRank (5)
Lecture 3 – Apache Spark for Cloud Applications
Simplified basic algorithm Start each page at a rank of 1 On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors Set each page’s rank to 0.15 + 0.85 × contribs
[6] Parallel Programming with Spark
(aka ‘link out‘)
44 / 56
Web Application Example: PageRank (6)
Lecture 3 – Apache Spark for Cloud Applications
Simplified basic algorithm Start each page at a rank of 1 On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors Set each page’s rank to 0.15 + 0.85 × contribs
[6] Parallel Programming with Spark
(aka ‘link out‘)
45 / 56
Web Application Example: PageRank (7)
Lecture 3 – Apache Spark for Cloud Applications
Simplified basic algorithm Start each page at a rank of 1 On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors Set each page’s rank to 0.15 + 0.85 × contribs
[6] Parallel Programming with Spark
(aka ‘link out‘)
46 / 56
Web Application Example: PageRank (8)
Lecture 3 – Apache Spark for Cloud Applications
[24] The PageRank Citation Ranking47 / 56
Web Application Example: PageRank (9)
Lecture 3 – Apache Spark for Cloud Applications
Performance Comparisons Using Apache Spark (i.e. Cache, etc.) vs. plain Apache Hadoop
[6] Parallel Programming with Spark48 / 56
Apache Spark Applications in Business
Spark application use cases at Yahoo Personalizing news pages for Web visitors, ML algorithms running on Spark
to figure out what individual users are interested in, categorize news stories as they arise to figure out what types of users would be interested in reading them, …
Running analytics for advertising, Hive (previous lectures) on Spark (Shark’s) interactive capability, use existing BI tools to view and query their advertising analytic data collected in Hadoop
Spark application use cases at Conviva One of the largest streaming video companies
on the Internet, uses Spark Streaming to learn network conditions in real time
Spark application use cases at ClearStory Relies on the Spark technology as one of the core
underpinnings of its interactive & real-time productLecture 3 – Apache Spark for Cloud Applications
[20] Yahoo
[21] Conviva
[22] ClearStory
49 / 56
Spark Machine Learning Library (MLlib) – Revisited
Lecture 3 – Apache Spark for Cloud Applications
[4] Apache Spark
Apache Spark Machine Learning library (MLlib) is a scalable and parallel machine learning library with a number of implemented algorithms for classification, clustering, and regression
Usage example E.g. Clustering data with K-Means MLlib contains high-quality
algorithms that leverage iteration(key benefit of Apache Spark)
Parallel algorithms for clustering,classification, and regression Clustering
Classification
Regression++
50 / 56
[Video] K-Means Clustering Example
Lecture 3 – Apache Spark for Cloud Applications
[26] Animation of the k-means clustering algorithm, YouTube Video
51 / 56
Lecture Bibliography
Lecture 3 – Apache Spark for Cloud Applications 52 / 56
Lecture Bibliography (1)
[1] Species Iris Group of North America Database, Online: http://www.signa.org
[2] Github template: Deploy a Spark cluster in Azure HDInsight,Online: https://github.com/Azure/azure-quickstart-templates/tree/master/101-hdinsight-spark-linux
[3] J. Dean, S. Ghemawat, ‘MapReduce: Simplified Data Processing on Large Clusters’, OSDI'04: Sixth Symposium on Operating System Design and Implementation, December, 2004.
[4] Apache Spark Web page, Online: http://spark.apache.org/
[5] Big Data Tips, ‘Apache Spark vs Hadoop‘, Online: http://www.big-data.tips/apache-spark-vs-hadoop
[6] M. Zaharia, ‘Parallel Programming with Spark‘, 2013 [7] Amazon Web Services EC2 On-Demand Pricing models,
Online: https://aws.amazon.com/ec2/pricing/on-demand/ [8] Reynold Xin, ‘Advanced Spark’, 2014, Spark Summit Training [9] Amazon Web Services Web Page,
Online: https://aws.amazon.com [10] SlideShare, ‚‘Introduction to Yarn and MapReduce‘,
Online: https://www.slideshare.net/cloudera/introduction-to-yarn-and-mapreduce-2 [11] Apache Spark on Amazon EMR,
Online: https://aws.amazon.com/emr/details/spark/
Lecture 3 – Apache Spark for Cloud Applications 53 / 56
Lecture Bibliography (2)
[12] Apache Hadoop Web page, Online: http://hadoop.apache.org/
[13] Google DataProc Service, Online: https://cloud.google.com/dataproc/
[14] IBM Cloud SAAS Applications, Online: https://www.ibm.com/cloud-computing/products/saas/
[15] YouTube Video, ‘Solving Big Data with Apache Spark’, Online: https://www.youtube.com/watch?v=WFoFLJOCOLA
[16] The Berkeley Data Analytics Stack: Present and Future, Online: http://events-tce.technion.ac.il/files/2014/04/Michael-Franklin-UC-Berkeley.pdf
[17] Apache Hive, Online: https://hive.apache.org/
[18] Big Data Tips, Spark Summit, Online: http://www.big-data.tips/spark-summit
[19] Docker Hub, Online: https://hub.docker.com/
[20] Yahoo Web page, Online: http://www.yahoo.com
[21] Conviva Web page, Online: http://www.conviva.com
Lecture 3 – Apache Spark for Cloud Applications 54 / 56
Lecture Bibliography (3)
[23] Mining of Massive Datasets, Online: http://infolab.stanford.edu/~ullman/mmds/book.pdf
[24] L. Page, S. Brin, R. Motwani, T. Winograd, ‘The PageRank Citation Ranking: Bringing Order to the Web‘, Technical Report of the Stanford InfoLab, 1999
[25] www.big-data.tips, ‘Page Rank Technique‘, Online: http://www.big-data.tips/page-rank
[26] YouTube Video, ‘Animation of the k-means algorithm using Matlab 2013‘, Online: http://www.youtube.com/watch?v=5FmnJVv73fU
[27] HDFS Architecture Guide, Online: http://hadoop.apache.org/docs/stable/hdfs_design.html
[28] Microsoft Azure Portal Hub,Online: https://portal.azure.com/#create/hub
[29] Microsoft Azure HDInsight Cluster Web page, Online: https://azure.microsoft.com/en-us/services/hdinsight/
[30] Databricks, A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets, Online: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
[31] Apache Spark Machine Learning Pipelines,Online: https://spark.apache.org/docs/latest/ml-pipeline.html
[32] Edureka Spark SQL Tutorial,Online: https://www.edureka.co/blog/spark-sql-tutorial/
Lecture 3 – Apache Spark for Cloud Applications 55 / 56
Lecture 3 – Apache Spark for Cloud Applications 56 / 56