Date post: | 13-Apr-2017 |
Category: |
Data & Analytics |
Upload: | spark-summit |
View: | 1,187 times |
Download: | 0 times |
Sparking Science up with Research Recommendations
Maya Hristakeva@mayahhf
Overview• What is Mendeley Suggest?
• Computation Layer
• Conclusions
Read &
Organize
Search &
Discover
Collaborate &
Network
Experiment&
Synthesize
Mendeley builds tools to help researchers …
Being the best researcher you can be!• Good researchers are on top of their game• Large amount of research produced• Takes time to get what you need
• Help researchers by recommending relevant research
Mendeley Suggest Personalized Article
Recommender
Recommender System Components
information flow (components often built in parallel)
Data (Feature
Engineering)Algorithms Business Logic
and AnalyticsUser Experience
Mendeley Suggest Components (Past)
information flow (components often built in parallel)
Data (Feature
Engineering)Algorithms Business Logic
and AnalyticsUser Experience
Mendeley Suggest Components (Present)
information flow (components often built in parallel)
Data (Feature
Engineering)Algorithms Business Logic
and AnalyticsUser Experience
Mendeley Suggest Components (Goal)
information flow (components often built in parallel)
Data (Feature
Engineering)Algorithms Business Logic
and AnalyticsUser Experience
Overview• What is Mendeley Suggest?
• Computation Layer– Algorithms
– Evaluation
– Implementations & Performance
• Conclusions
Personalized Article RecommendationsInput:
User libraries
Output:
Suggested articles to read
Algorithms:• Collaborative Filtering
– Item-based
– User-Based
– Matrix Factorization
• Content-based
Item-based Collaborative Filtering
Recommend articles that are similar to the ones you read– Similarity is based on article co-occurrences in users’ libraries– “Users who read x also read y”
User-based Collaborative Filtering
Find users who have similar appreciation for articles as you– Similarity is based on users’ libraries overlap
Recommend new articles based on what the users similar to you read
– “Users similar to you (based on a, b, c) also read x”
Matrix Factorization CF
2 4 5
5 4 1
5 ? 2
1 5 4
4 2
4 5 1
Un x k
Vk x m
fij= <Ui*,V*j>
E(U,V) = L(Xij, fij) + R(U,V)
Xn x m
Overview• What is Mendeley Suggest?
• Computation Layer– Algorithms
– Evaluation
– Implementations
• Conclusions
PerformanceCostly & GoodCostly & Bad
Cheap & GoodCheap & Bad
Costly & GoodCostly & Bad
Cheap & GoodCheap & Bad
Performance
Costly & GoodCostly & Bad
Cheap & GoodCheap & Bad
Performance
How to measure quality?• Offline Evaluation
– Parameter sweep is quick– Don’t offend real users
• Methodology– n-fold cross-validation– time-based validation
• Metrics– precision, recall and f-measure– AUC (area under roc curve), NDCG (normalized discounted cumulative gain)
Overview• What is Mendeley Suggest?
• Computation Layer– Algorithms
– Evaluation
– Implementations
• Conclusions
ImplementationsMahout (Hadoop)
Mendeley (Hadoop)
Mahout (Spark)
Mendeley (Spark)
MLlib (Spark)
Item-based CF
User-based CF
Matrix Factorization
Setup• EMR Cluster
– Master: 1 x r3.xlarge instance (4 core, 32GB)– Core: 10 x r3.2xlarge instances (8 core, 64GB)
• Data: user libraries – 15mil documents >>> 1mil users– 150mil interactions
• Offline Evaluation– Methodology: time-based evaluation– Metric: precision@10
ImplementationsMahout (Hadoop)
Mendeley (Hadoop)
Mahout (Spark)
Mendeley (Spark)
MLlib (Spark)
Item-based CF
User-based CF
Matrix Factorization
Apache Mahout• Mahout (out-of-the-box)
– Item-based CF• org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
– ALS Matrix Factorization• org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob
• org.apache.mahout.cf.taste.hadoop.als.RecommenderJob
• Implemented User-based CF on top of Mahout at Mendeley
Costly & GoodCostly & Bad
Cheap & GoodCheap & Bad
Orig. item-based mahout
Tuned item-based mahout
-0.5K (-60%)
Performance
~$125
Costly & GoodCostly & Bad
Cheap & GoodCheap & Bad
Orig. item-based mahout
Tuned item-based mahout
-0.5K (-60%)
Orig. user-based mahout
Tuned user-based mahout
-0.1K (-40%)
Performance
~$125
Costly & GoodCostly & Bad
Cheap & GoodCheap & Bad
Orig. item-based mahout
Tuned item-based mahout Orig. user-based
mahout
Tuned user-based mahout
+150%
-0.2K (-55%)
-0.7K (-82%)
Performance
~$125
Mahout Performance• Mahout’s recommender is already efficient
– But your data may have unusual properties
• We’ve got improvements by– Tuning Hadoop’s mapper and reducer allocation over the Recommender Job steps– Using an appropriate partitioner
• Improve quality– Mahout provides Item-based CF– We have many more items than users– Typically, user-based is more appropriate
ImplementationsMahout (Hadoop)
Mendeley (Hadoop)
Mahout (Spark)
Mendeley (Spark)
MLlib (Spark)
Item-based CF
User-based CF
Matrix Factorization
Mahout Spark• Co-occurrence Recommenders with Spark
– Item-Item similarity• mahout spark-itemsimilarity
SimilarityAnalysis.cooccurrencesIDSs(ratings, …)
– User-User similarity• mahout spark-rowsimilarity
SimilarityAnalysis.rowSimilarityIDSs(ratings, …)
• Only supports Boolean data and log-likelihood similarity
• Does not generate actual recommendations
Mahout Spark• Could not get to run successfully on our data
• Got further by tuning parameters but still failed with OOM– spark.driver.maxResultSize– spark.kryoserializer.buffer.max – spark.default.parallelism– spark.storage.memoryFraction
• Gave best runtime performance on MovieLens datasets
ImplementationsMahout (Hadoop)
Mendeley (Hadoop)
Mahout (Spark)
Mendeley (Spark)
MLlib (Spark)
Item-based CF
User-based CF
Matrix Factorization
Mendeley Spark• Started as hack-day project
– Implement Item-based and User-based CF in Spark
• Can be implemented in two steps1. Compute Item-Item or User-User Similarities
• given user preferences
2. Compute Recommendations• given similarities and user preferences
Spark: Item-Item Similarity
Spark: Item-Item Similarity
Spark: Item-Item Similarity
Spark: Item-Item Similarity
Spark: Item-Based Recs
Spark: Item-Based Recs
Costly & GoodCostly & Bad
Cheap & GoodCheap & Bad
Tuned IB Mahout
Tuned UB Mahout
Orig. UB Spark
Performance
~$50
Costly & GoodCostly & Bad
Cheap & GoodCheap & Bad
Tuned IB Mahout
Tuned UB Mahout
Orig. UB Spark
Tuned UB Spark
Tuned IB Spark
-0.1K (-40%)
Performance
~$50
Mendeley Spark Performance• Spark implementation of User-based CF performs well
• Managed to run variation of Item-based CF– Uses fewer items per user to recommend similar items to– Quality not impacted much
• We’ve got improvements by tuning– Resource allocation– Parallelism– http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
ImplementationsMahout (Hadoop)
Mendeley (Hadoop)
Mahout (Spark)
Mendeley (Spark)
MLlib (Spark)
Item-based CF
User-based CF
Matrix Factorization
Spark MLlib DimSum• DimSum: efficient algorithm for computing all-pairs similarity
– “Dimension Independent Matrix Square using MapReduce”– Contributed by Twitter
• Replace similarity computation with DimSum– Only supports cosine similarity
• Does not generate actual recommendations– Compute recommendations as before
MLlib DimSum Item-Item Similarity
MLlib DimSum User-User Similarity
Costly & GoodCostly & Bad
Cheap & GoodCheap & Bad
Tuned IB Mahout
Tuned UB Mahout
Tuned UB Spark
Tuned IB Spark
UB DimSumSpark MLlib
Performance
~$50
Spark MLlib Matrix FactorizationImplements alternating least squares (ALS)
1. Compute Model2. Compute Recommendations
Costly & GoodCostly & Bad
Cheap & GoodCheap & Bad
Tuned IB Mahout
Tuned UB Mahout
Tuned UB Spark
Tuned IB Spark
UB DimSumSpark MLlib
ALS Matrix Fact.Spark MLlib
-50%
Performance
~$50
MLlib Performance• Provides good alternative for computing user-user similarities
– Due to data sparsity, not getting big gains in runtime – Only supports cosine similarity
• Failed to compute item-item similarities– Exceeds maximum allowed value of 2G for spark.kryoserializer.buffer.max
• User-based CF outperforms ALS CF
• Need scalable solution for generating recommendations based on ALS CF model
ImplementationsMahout (Hadoop)
Mendeley (Hadoop)
Mahout (Spark)
Mendeley (Spark)
MLlib (Spark)
Item-based CF
User-based CF
Matrix Factorization
Overview• What is Mendeley Suggest?
• Computation Layer
• Conclusions
Costly & GoodCostly & Bad
Cheap & GoodCheap & Bad
Tuned IB Mahout
Tuned UB Mahout
Tuned UB Spark
Tuned IB Spark
UB DimSumSpark MLlib
ALS Matrix Fact.Spark MLlib
Performance
+100%
+150%~$50
Mendeley Suggest Components (Future)
information flow (components often built in parallel)
Data (Feature
Engineering)Algorithms Business Logic
and AnalyticsUser Experience
Conclusions• Mendeley Suggest is a personalized article recommender
• Spark is good alternative to Mahout as computation layer – Needs some love and tuning– Much fewer lines of code – easier to maintain and extend
• User-based can outperform item-based and matrix factorization
• Save resources and money by understanding your data
• Test offline before deploying– but also need online tests to get real performance
Thank you!mendeley.com/suggest