+ All Categories
Home > Data & Analytics > Sparking Science up with Research Recommendations by Maya Hristakeva

Sparking Science up with Research Recommendations by Maya Hristakeva

Date post: 13-Apr-2017
Category:
Upload: spark-summit
View: 1,187 times
Download: 0 times
Share this document with a friend
56
Sparking Science up with Research Recommendations Maya Hristakeva @mayahhf
Transcript
Page 1: Sparking Science up with Research Recommendations by Maya Hristakeva

Sparking Science up with Research Recommendations

Maya Hristakeva@mayahhf

Page 2: Sparking Science up with Research Recommendations by Maya Hristakeva

Overview• What is Mendeley Suggest?

• Computation Layer

• Conclusions

Page 3: Sparking Science up with Research Recommendations by Maya Hristakeva

Read &

Organize

Search &

Discover

Collaborate &

Network

Experiment&

Synthesize

Mendeley builds tools to help researchers …

Page 4: Sparking Science up with Research Recommendations by Maya Hristakeva

Being the best researcher you can be!• Good researchers are on top of their game• Large amount of research produced• Takes time to get what you need

• Help researchers by recommending relevant research

Page 5: Sparking Science up with Research Recommendations by Maya Hristakeva

Mendeley Suggest Personalized Article

Recommender

Page 6: Sparking Science up with Research Recommendations by Maya Hristakeva

Recommender System Components

information flow (components often built in parallel)

Data (Feature

Engineering)Algorithms Business Logic

and AnalyticsUser Experience

Page 7: Sparking Science up with Research Recommendations by Maya Hristakeva

Mendeley Suggest Components (Past)

information flow (components often built in parallel)

Data (Feature

Engineering)Algorithms Business Logic

and AnalyticsUser Experience

Page 8: Sparking Science up with Research Recommendations by Maya Hristakeva

Mendeley Suggest Components (Present)

information flow (components often built in parallel)

Data (Feature

Engineering)Algorithms Business Logic

and AnalyticsUser Experience

Page 9: Sparking Science up with Research Recommendations by Maya Hristakeva

Mendeley Suggest Components (Goal)

information flow (components often built in parallel)

Data (Feature

Engineering)Algorithms Business Logic

and AnalyticsUser Experience

Page 10: Sparking Science up with Research Recommendations by Maya Hristakeva

Overview• What is Mendeley Suggest?

• Computation Layer– Algorithms

– Evaluation

– Implementations & Performance

• Conclusions

Page 11: Sparking Science up with Research Recommendations by Maya Hristakeva

Personalized Article RecommendationsInput:

User libraries

Output:

Suggested articles to read

Algorithms:• Collaborative Filtering

– Item-based

– User-Based

– Matrix Factorization

• Content-based

Page 12: Sparking Science up with Research Recommendations by Maya Hristakeva

Item-based Collaborative Filtering

Recommend articles that are similar to the ones you read– Similarity is based on article co-occurrences in users’ libraries– “Users who read x also read y”

Page 13: Sparking Science up with Research Recommendations by Maya Hristakeva

User-based Collaborative Filtering

Find users who have similar appreciation for articles as you– Similarity is based on users’ libraries overlap

Recommend new articles based on what the users similar to you read

– “Users similar to you (based on a, b, c) also read x”

Page 14: Sparking Science up with Research Recommendations by Maya Hristakeva

Matrix Factorization CF

2 4 5

5 4 1

5 ? 2

1 5 4

4 2

4 5 1

Un x k

Vk x m

fij= <Ui*,V*j>

E(U,V) = L(Xij, fij) + R(U,V)

Xn x m

Page 15: Sparking Science up with Research Recommendations by Maya Hristakeva

Overview• What is Mendeley Suggest?

• Computation Layer– Algorithms

– Evaluation

– Implementations

• Conclusions

Page 16: Sparking Science up with Research Recommendations by Maya Hristakeva

PerformanceCostly & GoodCostly & Bad

Cheap & GoodCheap & Bad

Page 17: Sparking Science up with Research Recommendations by Maya Hristakeva

Costly & GoodCostly & Bad

Cheap & GoodCheap & Bad

Performance

Page 18: Sparking Science up with Research Recommendations by Maya Hristakeva

Costly & GoodCostly & Bad

Cheap & GoodCheap & Bad

Performance

Page 19: Sparking Science up with Research Recommendations by Maya Hristakeva

How to measure quality?• Offline Evaluation

– Parameter sweep is quick– Don’t offend real users

• Methodology– n-fold cross-validation– time-based validation

• Metrics– precision, recall and f-measure– AUC (area under roc curve), NDCG (normalized discounted cumulative gain)

Page 20: Sparking Science up with Research Recommendations by Maya Hristakeva

Overview• What is Mendeley Suggest?

• Computation Layer– Algorithms

– Evaluation

– Implementations

• Conclusions

Page 21: Sparking Science up with Research Recommendations by Maya Hristakeva

ImplementationsMahout (Hadoop)

Mendeley (Hadoop)

Mahout (Spark)

Mendeley (Spark)

MLlib (Spark)

Item-based CF

User-based CF

Matrix Factorization

Page 22: Sparking Science up with Research Recommendations by Maya Hristakeva

Setup• EMR Cluster

– Master: 1 x r3.xlarge instance (4 core, 32GB)– Core: 10 x r3.2xlarge instances (8 core, 64GB)

• Data: user libraries – 15mil documents >>> 1mil users– 150mil interactions

• Offline Evaluation– Methodology: time-based evaluation– Metric: precision@10

Page 23: Sparking Science up with Research Recommendations by Maya Hristakeva

ImplementationsMahout (Hadoop)

Mendeley (Hadoop)

Mahout (Spark)

Mendeley (Spark)

MLlib (Spark)

Item-based CF

User-based CF

Matrix Factorization

Page 24: Sparking Science up with Research Recommendations by Maya Hristakeva

Apache Mahout• Mahout (out-of-the-box)

– Item-based CF• org.apache.mahout.cf.taste.hadoop.item.RecommenderJob

– ALS Matrix Factorization• org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob

• org.apache.mahout.cf.taste.hadoop.als.RecommenderJob

• Implemented User-based CF on top of Mahout at Mendeley

Page 25: Sparking Science up with Research Recommendations by Maya Hristakeva

Costly & GoodCostly & Bad

Cheap & GoodCheap & Bad

Orig. item-based mahout

Tuned item-based mahout

-0.5K (-60%)

Performance

~$125

Page 26: Sparking Science up with Research Recommendations by Maya Hristakeva

Costly & GoodCostly & Bad

Cheap & GoodCheap & Bad

Orig. item-based mahout

Tuned item-based mahout

-0.5K (-60%)

Orig. user-based mahout

Tuned user-based mahout

-0.1K (-40%)

Performance

~$125

Page 27: Sparking Science up with Research Recommendations by Maya Hristakeva

Costly & GoodCostly & Bad

Cheap & GoodCheap & Bad

Orig. item-based mahout

Tuned item-based mahout Orig. user-based

mahout

Tuned user-based mahout

+150%

-0.2K (-55%)

-0.7K (-82%)

Performance

~$125

Page 28: Sparking Science up with Research Recommendations by Maya Hristakeva

Mahout Performance• Mahout’s recommender is already efficient

– But your data may have unusual properties

• We’ve got improvements by– Tuning Hadoop’s mapper and reducer allocation over the Recommender Job steps– Using an appropriate partitioner

• Improve quality– Mahout provides Item-based CF– We have many more items than users– Typically, user-based is more appropriate

Page 29: Sparking Science up with Research Recommendations by Maya Hristakeva

ImplementationsMahout (Hadoop)

Mendeley (Hadoop)

Mahout (Spark)

Mendeley (Spark)

MLlib (Spark)

Item-based CF

User-based CF

Matrix Factorization

Page 30: Sparking Science up with Research Recommendations by Maya Hristakeva

Mahout Spark• Co-occurrence Recommenders with Spark

– Item-Item similarity• mahout spark-itemsimilarity

SimilarityAnalysis.cooccurrencesIDSs(ratings, …)

– User-User similarity• mahout spark-rowsimilarity

SimilarityAnalysis.rowSimilarityIDSs(ratings, …)

• Only supports Boolean data and log-likelihood similarity

• Does not generate actual recommendations

Page 31: Sparking Science up with Research Recommendations by Maya Hristakeva

Mahout Spark• Could not get to run successfully on our data

• Got further by tuning parameters but still failed with OOM– spark.driver.maxResultSize– spark.kryoserializer.buffer.max – spark.default.parallelism– spark.storage.memoryFraction

• Gave best runtime performance on MovieLens datasets

Page 32: Sparking Science up with Research Recommendations by Maya Hristakeva

ImplementationsMahout (Hadoop)

Mendeley (Hadoop)

Mahout (Spark)

Mendeley (Spark)

MLlib (Spark)

Item-based CF

User-based CF

Matrix Factorization

Page 33: Sparking Science up with Research Recommendations by Maya Hristakeva

Mendeley Spark• Started as hack-day project

– Implement Item-based and User-based CF in Spark

• Can be implemented in two steps1. Compute Item-Item or User-User Similarities

• given user preferences

2. Compute Recommendations• given similarities and user preferences

Page 34: Sparking Science up with Research Recommendations by Maya Hristakeva

Spark: Item-Item Similarity

Page 35: Sparking Science up with Research Recommendations by Maya Hristakeva

Spark: Item-Item Similarity

Page 36: Sparking Science up with Research Recommendations by Maya Hristakeva

Spark: Item-Item Similarity

Page 37: Sparking Science up with Research Recommendations by Maya Hristakeva

Spark: Item-Item Similarity

Page 38: Sparking Science up with Research Recommendations by Maya Hristakeva

Spark: Item-Based Recs

Page 39: Sparking Science up with Research Recommendations by Maya Hristakeva

Spark: Item-Based Recs

Page 40: Sparking Science up with Research Recommendations by Maya Hristakeva

Costly & GoodCostly & Bad

Cheap & GoodCheap & Bad

Tuned IB Mahout

Tuned UB Mahout

Orig. UB Spark

Performance

~$50

Page 41: Sparking Science up with Research Recommendations by Maya Hristakeva

Costly & GoodCostly & Bad

Cheap & GoodCheap & Bad

Tuned IB Mahout

Tuned UB Mahout

Orig. UB Spark

Tuned UB Spark

Tuned IB Spark

-0.1K (-40%)

Performance

~$50

Page 42: Sparking Science up with Research Recommendations by Maya Hristakeva

Mendeley Spark Performance• Spark implementation of User-based CF performs well

• Managed to run variation of Item-based CF– Uses fewer items per user to recommend similar items to– Quality not impacted much

• We’ve got improvements by tuning– Resource allocation– Parallelism– http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Page 43: Sparking Science up with Research Recommendations by Maya Hristakeva

ImplementationsMahout (Hadoop)

Mendeley (Hadoop)

Mahout (Spark)

Mendeley (Spark)

MLlib (Spark)

Item-based CF

User-based CF

Matrix Factorization

Page 44: Sparking Science up with Research Recommendations by Maya Hristakeva

Spark MLlib DimSum• DimSum: efficient algorithm for computing all-pairs similarity

– “Dimension Independent Matrix Square using MapReduce”– Contributed by Twitter

• Replace similarity computation with DimSum– Only supports cosine similarity

• Does not generate actual recommendations– Compute recommendations as before

Page 45: Sparking Science up with Research Recommendations by Maya Hristakeva

MLlib DimSum Item-Item Similarity

Page 46: Sparking Science up with Research Recommendations by Maya Hristakeva

MLlib DimSum User-User Similarity

Page 47: Sparking Science up with Research Recommendations by Maya Hristakeva

Costly & GoodCostly & Bad

Cheap & GoodCheap & Bad

Tuned IB Mahout

Tuned UB Mahout

Tuned UB Spark

Tuned IB Spark

UB DimSumSpark MLlib

Performance

~$50

Page 48: Sparking Science up with Research Recommendations by Maya Hristakeva

Spark MLlib Matrix FactorizationImplements alternating least squares (ALS)

1. Compute Model2. Compute Recommendations

Page 49: Sparking Science up with Research Recommendations by Maya Hristakeva

Costly & GoodCostly & Bad

Cheap & GoodCheap & Bad

Tuned IB Mahout

Tuned UB Mahout

Tuned UB Spark

Tuned IB Spark

UB DimSumSpark MLlib

ALS Matrix Fact.Spark MLlib

-50%

Performance

~$50

Page 50: Sparking Science up with Research Recommendations by Maya Hristakeva

MLlib Performance• Provides good alternative for computing user-user similarities

– Due to data sparsity, not getting big gains in runtime – Only supports cosine similarity

• Failed to compute item-item similarities– Exceeds maximum allowed value of 2G for spark.kryoserializer.buffer.max

• User-based CF outperforms ALS CF

• Need scalable solution for generating recommendations based on ALS CF model

Page 51: Sparking Science up with Research Recommendations by Maya Hristakeva

ImplementationsMahout (Hadoop)

Mendeley (Hadoop)

Mahout (Spark)

Mendeley (Spark)

MLlib (Spark)

Item-based CF

User-based CF

Matrix Factorization

Page 52: Sparking Science up with Research Recommendations by Maya Hristakeva

Overview• What is Mendeley Suggest?

• Computation Layer

• Conclusions

Page 53: Sparking Science up with Research Recommendations by Maya Hristakeva

Costly & GoodCostly & Bad

Cheap & GoodCheap & Bad

Tuned IB Mahout

Tuned UB Mahout

Tuned UB Spark

Tuned IB Spark

UB DimSumSpark MLlib

ALS Matrix Fact.Spark MLlib

Performance

+100%

+150%~$50

Page 54: Sparking Science up with Research Recommendations by Maya Hristakeva

Mendeley Suggest Components (Future)

information flow (components often built in parallel)

Data (Feature

Engineering)Algorithms Business Logic

and AnalyticsUser Experience

Page 55: Sparking Science up with Research Recommendations by Maya Hristakeva

Conclusions• Mendeley Suggest is a personalized article recommender

• Spark is good alternative to Mahout as computation layer – Needs some love and tuning– Much fewer lines of code – easier to maintain and extend

• User-based can outperform item-based and matrix factorization

• Save resources and money by understanding your data

• Test offline before deploying– but also need online tests to get real performance

Page 56: Sparking Science up with Research Recommendations by Maya Hristakeva

Thank you!mendeley.com/suggest


Recommended