+ All Categories
Home > Documents >  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the...

 · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the...

Date post: 20-May-2020
Category:
Upload: others
View: 8 times
Download: 0 times
Share this document with a friend
52
Transcript
Page 1:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics
Page 2:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

––

Page 3:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

•–

Page 4:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

•––

Page 5:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

••

Prepare data for the next iteration

Page 6:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics
Page 8:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Status of RDD actions being computed

Info about cached RDDs and memory usage

In-depth job info

Page 9:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Resilient Distributed

Datasets (RDD)DataFrame DataSet

● Distributed collection of JVM objects

● Functional operators (map, filter, etc.)

● Distributed collection of Row objects

● Expression-based operations

● Fast, efficient internal representations

● Internally rows, externally JVM objects

● Type safe and fast

● Slower than dataframes

Page 10:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

●●●●

RDD1 RDD1’

RDD2 RDD2’

RDD3 RDD3’

Machine B

Machine A

Machine C

RDD Operation(e.g. map, filter)

Page 11:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

>>> input_RDD = sc.textFile("text.file")

>>> transform_RDD = input_RDD.filter(lambda x: "abcd" in x)

>>> print "Number of “abcd”:" + transform_RDD.count()

>>> output.saveAsTextFile(“hdfs:///output”)

Page 12:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics
Page 13:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

●●

○○○

Page 14:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

●●

Page 15:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

people.json{"name":"Michael"} {"name":"Andy", "age":30} {"name":"Justin", "age":19}

val df = spark.read.json("people.json")

val sqlDF = df.filter($"age" > 20).show()+---+----+

|age|name|

+---+----+

| 30|Andy|

+---+----+

df.filter($"age" > 20).select(“name”).write.format(“parquet”).save(“output”)

Note: Parquet is a column-based storage format for Hadoop. You will need special dependencies to read this file

Page 16:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Task Points Description Language

Page 17:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

●●●● map filter ● reduceByKey aggregateByKey

groupByKey

Page 18:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

●●●●

Page 19:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

●○○○○

Page 20:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

●○ ⇒○ ⇒

Page 21:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

● How do we measure influence?○ Intuitively, it should be the node with the most followers

Page 22:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

● Influence scores are initialized to 1.0/number of vertices

0.333 0.333

0.333

Page 23:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

● Influence scores are initialized to 1.0/number of vertices● In each iteration of the algorithm, scores of each user are

redistributed between the users they are following

0.333 0.333

0.333

Page 24:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

● Influence scores are initialized to 1.0/number of vertices● In each iteration of the algorithm, scores of each user are

redistributed between the users they are following

0.333/2 = 0.167

0.333 + 0.333/2 = 0.500

0.333From Node 2

From Node 1

From Node 1From Node 0

Page 25:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

● Influence scores are initialized to 1.0/number of vertices● In each iteration of the algorithm, scores of each user are

redistributed between the users they are following● Convergence is achieved when the scores of nodes do not

change between iterations ● Pagerank is guaranteed to converge

0.333/2 = 0.167

0.333 + 0.333/2 = 0.500

0.333

From Node 2

From Node 1

From Node 1From Node 0

Page 26:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

● Influence scores are initialized to 1.0/number of vertices● In each iteration of the algorithm, scores of each user are

redistributed between the users they are following● Convergence is achieved when the scores of nodes do not

change between iterations● Pagerank is guaranteed to converge

0.208 0.396

0.396

Page 27:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

val links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairsfor (i <- 1 to ITERATIONS) {

// Build an RDD of (targetURL, float) pairs// with the contributions sent by each pageval contribs = links.join(ranks).flatMap {

(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))

}

// Sum contributions by URL and get new ranksranks = contribs.reduceByKey((x,y) => x+y)

.mapValues(sum => a/N + (1-a)*sum)}

Page 28:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

● Dangling or sink vertex○ No outgoing edges○ Redistribute contribution equally among all vertices

● Isolated vertex○ No incoming and outgoing edges○ No isolated nodes in Project 4.1 dataset

● Damping factor d○ Represents the probability that a user clicking on links

will continue clicking on them, traveling down an edge○ Use d = 0.85

Dangling vertexIsolated vertex

Page 29:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

● Adjacency matrix:

● Transition matrix: (rows sum to 1)

Page 30:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Formula for calculating rank

d = 0.85

Page 31:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Formula for calculating rank

d = 0.85

Note: contributions from isolated and dangling vertices are constant in an iteration

Let

Page 32:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Formula for calculating rank

d = 0.85

Note: contributions from isolated and dangling vertices are constant in an iteration

Let

This simplifies the formula to

Page 33:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Formula for calculating rank

d = 0.85

Page 34:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Formula for calculating rank

d = 0.85

Page 35:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

● Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight

● Databricks is an Apache Spark-based analytics platform optimized for Azure

● One-click setup, an interactive workspace, and an optimized Databricks runtime

● Optimized connectors to Azure storage platforms for fast data access

● Software-as-a-Service

Page 36:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

● reduceByKeygroupByKey aggregateByKey

●○○ ./spark/bin/spark-shell ○ ./spark/bin/pyspark

●○○○

●○○

●○○

Page 37:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

● Ensuring correctness○ Make sure total scores sum to 1.0 in every iteration○ Understand closures in Spark

■ Do not do something like thisval data = Array(1,2,3,4,5)

var counter = 0

var rdd = sc.parallelize(data)

rdd.foreach(x => counter += x)

println("Counter value: " + counter)

○ Graph representation■ Adjacency lists use less memory than matrices

○ More detailed walkthroughs and sample calculations can be found here

Page 38:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

● Optimization○ Eliminate repeated calculations○ Use the Spark Web UI

■ Monitor your instances to make sure they are fully utilized

■ Identify bottlenecks○ Understand RDD manipulations

■ Actions vs transformations■ Lazy transformations

○ Explore parameter tuning to optimize resource usage○ Be careful with repartition on your RDDs

Page 39:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

tWITTER DATA ANALYTICS:TEAM PROJECT

Page 40:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Team Project

33

Page 41:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Team Project● Phase 1:

○ Q1○ Q2 (MySQL AND HBase)

● Phase 2○ Q1○ Q2 & Q3 (MySQL AND HBase)

● Phase 3○ Q1○ Q2 & Q3 (MySQL OR HBase)

34

Page 42:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Team Project Deadlines● Writeup and queries were released on

Monday, October 29th, 2018.● Phase 2 milestones:

○ Q2:■ Q2 on scoreboard, due on Sunday, 11/11

○ Phase 2, Live test:■ Q1, Q2 and Q3, on Sunday, 11/11

○ Phase 2, code and report:■ due on Tuesday, 11/13

36

Page 43:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Query 3, Definitions

● time_start, time_end: in Unix time / Epoch time format, e.g. time_start=1480000000

● uid_start, uid_end: marks the user search boundary, e.g. uid_end=492600000

● n1: the maximum number of topic words that should be included in the response

● n2: the maximum number of tweets that should be included in the response

38

Page 44:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Query 3: Effective Word Count (EWC)EWC:● one or more consecutive alphanumeric

characters (A through Z, a through z, 0 through 9) with zero or more ' or/and - characters.

Query 3 is su-per-b! I'mmmm lovin' it! ⇐ 6 EWC

Don’t forget to remove the short URL and stop words before calculation

38

Page 45:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Query 3, Impact Score

Impact Score = EWC*(favorite_count+retweet_count+followers_count)

Consider negative impact_score as 0.

38

Page 46:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Query 3, Topic WordsTopic words:● After filtering short urls● Exclude stop words● Before censor● Case insensitive (lower case)

TF-IDF:● TF: term frequency of a topic word w● IDF: ln(Total number of tweets in range/

Number of tweets with w in it) 38

Page 47:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Query 3, Topic ScoreTopic Score = sum(x * ln(y + 1)) (i from 1 to n)

n: The total number of tweets in the given time and uid range

x: TF-IDF score of word w in tweet Ti

y: The impact score of Ti

38

Page 48:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Query 3 Example

word1:score1\tword2:score2...\twordn1:scoren1Impactscore1\ttid1\ttext1…..

Example:channel:2270.04 amp:1586.31 new:1166.24 just:1153.70 love:1063.31 like:1015.71 good:937.63

26200650 461159182406672384 I just buyed the comedy album of my bestest friend in the entire world @briangaar. https://t.co/hwDB4veaYG #RacesAsToad... 38

Don’t forget to censor the tweets

Page 49:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Warning!!! Any Hadoop Cluster

For any hadoop cluster on AWS, Azure or GCP

● Don’t open ports to the public, except○ Ports: 22, 80, 25, 443, or 465

● Follow the HBase Primer and use SSH tunnel to communicate with your Yarn UI

38

Page 50:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Note:● There will be a report due at the end of each phase, where you are expected to discuss optimizations● WARNING: Check your AWS instance limits on the new account (should be > 10 instances)

Phase (and query due) Start Deadline Code and Report Due

50

Page 51:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

○●

○●

○●

○○

Page 52:  · 2018-10-30 · Run the same PageRank application (in Task 2) on Azure Databricks to compare the differences with Azure HDInsight Databricks is an Apache Spark-based analytics

Questions?


Recommended