+ All Categories
Home > Technology > Spark Meetup July 2015

Spark Meetup July 2015

Date post: 12-Apr-2017
Category:
Upload: debasish-das
View: 109 times
Download: 1 times
Share this document with a friend
21
n. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. Spark Meetup Big Data Analytics Verizon Lab, Palo Alto July 28th, 2015
Transcript
Page 1: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.

Spark Meetup

Big Data AnalyticsVerizon Lab, Palo Alto

July 28th, 2015

Page 2: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 2

Similarity Computation

Page 3: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 3

• Column based flow for tall-skinny matrices (60 M users, 100K items) • Mapper: emit (item-i, item-j), score-ij• Reducer: reduce over (item-i, item-j) to get similarity-ij• Spark 1.2 RowMatrix.columnSimilarities

• Row based flow https://issues.apache.org/jira/browse/SPARK-4823• Column similarity in tall-wide matrices

• 60M users,1M-10M items from advertising use-cases• Kernel generation for tall-skinny matrices

• 60M users, 50-400 latent factors from advertising use-cases• 10M devices, skinny features from IoT use-cases

Similarity Computation Flows

Page 4: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 4

• Preprocess• Column similarity in tall-wide matrices : Transpose data matrix• Kernel generation for tall-skinny matrices : Input data matrix

• Algorithm• Distributed matrix multiply using blocked cartesian pattern• Shuffle space control using topK and similarity threshold• User specified kernel for vector dot product• Supported kernels: Cosine, Euclidean, RBF, ScaledProduct

• Code optimization• Norm caching for efficiency (kernel abstraction differ from scikit-learn)• DGEMM for dense vectors : Spark 1.4 recommendForAll• BLAS.dot for sparse vectors : https://github.com/apache/spark/pull/6213

Row Based Flow

Page 5: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 5

Kernel ExamplesCosineKernel: item->item similarity

case class CosineKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel { override def compute(vi: Vector, indexi: Long, vj: Vector, indexj: Long): Double = { val similarity = BLAS.dot(vi, vj) / rowNorms(indexi) / rowNorms(indexj) if (similarity <= threshold) return 0.0 similarity }}

ScaledProductKernel: memory based recommendationcase class ScaledProductKernel(rowNorms: Map[Long, Double]) extends Kernel { override def compute(vi: Vector, indexi: Long, vj: Vector, indexj: Long): Double = { BLAS.dot(vi, vj) / rowNorms(indexi) }}

Page 6: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 6

Runtime AnalysisDataset Details

ML-1M ML-10M ML-20M Netflix

ratings 1M 10M 20M 100M

users 6040 69878 138493 480189

items 3706 10677 26744 17770

• Production Examples• Data matrix: 60 M x 2.5 M• minSupport: 500• itemThreshold: 1000• Runtime: ~ 4 hrs

ML-1M ML-10M ML-20M Netflix0

225

450

675

900

col 1e-2 row 1e-2

items

Run

time

(s)

Page 7: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 7

Shuffle Write Analysis

ML-1M ML-10M ML-20M Netflix0

10000

20000

30000

40000

col 1e-2 row 1e-2

movies

Shu

ffle

Writ

e (M

B)

Page 8: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 8

TopK Shuffle Write Analysis For Row Based Flow

50 100 200 400 1000 row 1e-20

1000

2000

3000

4000

5000

ML-1M ML-10M ML-20M Netflix

topk

Shu

ffle

Writ

e (M

B)

Page 9: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 9

Recommendation Engine

Page 10: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. ‹10›

• Memory based: kNN based recommendation algorithm using similarity engine

• Model based: ALS based implicit feedback formulation• Datasets

– MovieLens 1M– Netflix

• Mapped ratings to binary features for comparison• Evaluate recommendation performance using

– RMSE– Precision @ k

Recommendation Algorithms

Page 11: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. ‹11›

kNN Based Formulation

val similarItems = SimilarityMatrix.rowSimilarities( itemFeatures, numNeighbors, threshold)

val kernel = new ScaledProductKernel(rowNorms)

val recommendation = SimilarityMatrix.multiply( similarItems, userFeatures, kernel, k)

Predicted rating

Page 12: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. ‹12›

• Implicit feedback datasets: Unobserved items are considered 0 (implicit feedback)• Minimize

• Needs Gram matrix aggregation for 0-ratings

ALS Formulation

val als = new ALSQp() .setRank(params.rank) .setIterations(params. numIterations) .setUserConstraint(Constraints.SMOOTH) .setItemConstraint(Constraints.SMOOTH) .setImplicitPrefs(true) .setLambda(params.lambda) .setAlpha(params.alpha)

val mfModel = als.run(training)RankingUtils.recommendItemsForUsers(mfModel, k, skipItems)

Page 13: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. ‹13›

Comparing kNN and ALS on RMSE

kNN 30 neighbors ALS

0.561

0.62

RMSE on MovieLens

kNN 30 neighbors ALS

0.571

0.661

RMSE on Netflix

Page 14: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. ‹14›

Comparing kNN and ALS on Prec@k (Netflix)

Page 15: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 15

Segmentation Engine

Page 16: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 16

• Input data contains location and time information along with other features• Extract time-unit features for each location (zip code)

Segmentation Feature Extraction

Id Time (Hour)

Zip Code websites

abc 10 94301 website1

abc 15 94085 website2

def 10 94301 website1

.

.

.

.

.

.

.

.

.

.

.

.

website1 website2 …94301 # of hours

(1-24)# of hours

(1-24)…

94085 # of hours (1-24)

# of hours (1-24)

.

.

.

.

.

.

.

.

.

.

.

.

Raw data Sparse Website Matrix

Column Count

Zip codes 31516

Websites 11646

Ratings 45M

Page 17: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 17

ALS with Positive ConstraintsZipCode x Website Sparse Matrix WT H

Each row of WT

represent ZipCode factors

n×m n×k

k×m

Each column of Hrepresent Websitefactors

What do columns of WT

and rows of H represent?minimize

Page 18: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 18

Segment Analysis I

Local websites

Global websites

Page 19: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 19

Segment Analysis IISegments

Most factors display geographic affinity.

Page 20: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 20

Use ALSQp for Nonnegative Matrix Factorization

val als = new ALSQp() .setRank(params.rank) .setIterations(params. numIterations) .setUserConstraint(Constraints.POSITIVE) .setItemConstraint(Constraints.POSITIVE) .setImplicitPrefs(true) .setLambda(params.lambda)

val mfModel = als.run(training)

Other constraints:

.setItemConstraint(Constraints.SIMPLEX) // 1Tw = s, w>=0 and s - constant https://github.com/apache/spark/pull/3221

Page 21: Spark Meetup July 2015

Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 21

Q and A


Recommended