13.09.2012 DIMA – TU Berlin 1
Database Systems and Information Management Group (DIMA)Technische Universität Berlin
http://www.dima.tu-berlin.de/
An Introduction to Collaborative Filtering with Apache Mahout
Sebastian Schelter
Recommender Systems Challenge at ACM RecSys 2012
13.09.2012 DIMA – TU Berlin 2
■ Apache Mahout: apache-licensed library with the goal to provide highly scalable data mining and machine learning
■ its collaborative filtering module is based on the Taste framework of Sean Owen
■ mostly aimed at production scenarios, with a focus on□ processing efficiency
□ integratibility with different datastores, web applications, Amazon EC2
□ scalability, allows computation of recommendations, items similarities and matrix decompositions via MapReduce on Apache Hadoop
■ not that much used in recommender challenges□ not enough different algorithms implemented?
□ not enough tooling for evaluation?
→ it‘s open source, so it‘s up to you to change that!
Overview
13.09.2012 DIMA – TU Berlin 3
Preference & DataModel
■ Preference encapsulates a user-item-interaction as (user,item,value) triple□ only numeric userIDs and itemIDs allowed for memory efficiency
□ PreferenceArray encapsulates a set of preferences
■ DataModel encapsulates a dataset□ lots of convenient accessor methods like getNumUsers(),
getPreferencesForItem(itemID), ...
□ allows to add temporal information to preferences
□ lots of options to store the data (in-memory, file, database, key-value store)
□ drawback: for a lot of usecases, all the data has to fit into memory to allow efficient recommendation
DataModel dataModel = new FileDataModel(new File(„movielens.csv“));
PreferenceArray prefsOfUser1 = dataModel.getPreferencesFromUser(1);
13.09.2012 DIMA – TU Berlin 4
Recommender
■ Recommender is the basic interface for all of Mahout‘s recommenders□ recommend n items for a particular user
□ estimate the preference of a user towards an item
■ a CandidateItemsStrategy fetches all items that might be recommended for a particular user
■ a Rescorer allows postprocessing recommendations
List<RecommendedItem> topItems = recommender.recommend(1, 10);
float preference = recommender.estimatePreference(1, 25);
13.09.2012 DIMA – TU Berlin 5
Item-Based Collaborative Filtering
■ ItemBasedRecommender□ can also compute item similarities
□ can provide preferences for items as justification for recommendations
■ lots of similarity measures available (Pearson correlation, Jaccard coefficient, ...)
■ also allows usage of precomputed item similarities stored in a file (via FileItemSimilarity)
ItemBasedRecommender recommender =
new GenericItemBasedRecommender(dataModel,
new PearsonCorrelationSimilarity(dataModel));
List<RecommendedItem> similarItems =
recommender.mostSimilarItems(5, 10);
13.09.2012 DIMA – TU Berlin 6
Latent factor models
■ SVDRecommender□ uses a decomposition of the user-item-interaction matrix to compute
recommendations
■ uses a Factorizer to compute a Factorization from a DataModel, several different implementations available
□ Simon Funk‘s SGD
□ Alternating Least Squares
□ Weighted matrix factorization for implicit feedback data
Factorizer factorizer = new ALSWRFactorizer(dataModel, numFeatures,
lambda, numIterations);
Recommender svdRecommender =
new SVDRecommender(dataModel, factorizer);
List<RecommendedItem> topItems = svdRecommender.recommend(1, 10);
13.09.2012 DIMA – TU Berlin 7
Evaluating recommenders
■ RecommenderEvaluator, RecommenderIRStatsEvaluator□ allow to measure the prediction quality of a recommender by using a
random split of the dataset
□ support for MAE, RMSE, Precision, Recall, ....
□ need a DataModel, a RecommenderBuilder, a DataModelBuilder for the training data
RecommenderEvaluator maeEvaluator = new
AverageAbsoluteDifferenceRecommenderEvaluator();
maeEvaluator.evaluate(
new BiasedRecommenderBuilder(lambda2, lambda3, numIterations),
new InteractionCutDataModelBuilder(maxPrefsPerUser),
dataModel, trainingPercentage, 1 - trainingPercentage);
13.09.2012 DIMA – TU Berlin 8
Evaluating recommenders
■ RecommenderEvaluator, RecommenderIRStatsEvaluator□ allow to measure the prediction quality of a recommender by using a
random split of the dataset
□ support for MAE, RMSE, Precision, Recall, ....
□ need a DataModel, a RecommenderBuilder, a DataModelBuilder for the training data
RecommenderEvaluator maeEvaluator = new
AverageAbsoluteDifferenceRecommenderEvaluator();
maeEvaluator.evaluate(
new BiasedRecommenderBuilder(lambda2, lambda3, numIterations),
new InteractionCutDataModelBuilder(maxPrefsPerUser),
dataModel, trainingPercentage, 1 - trainingPercentage);
13.09.2012 DIMA – TU Berlin 9
Starting to work on Mahout
■ Prerequisites□ Java 6
□ Maven
□ svn client
■ checkout the source code from
http://svn.apache.org/repos/asf/mahout/trunk
■ import it as a maven project into your favorite IDE
13.09.2012 DIMA – TU Berlin 10
Project: novel item similarity measure
■ in the Million Song DataSet Challenge, a novel item similarity measure was used in the winning solution
■ would be great to see this one also featured in Mahout
■ Task □ implement the novel item similarity measure as subclass of Mahout’s
ItemSimilarity
■ Future Work□ this novel similarity measure is asymmetric, ensure that it is correctly
applied in all scenarios
13.09.2012 DIMA – TU Berlin 11
Project: temporal split evaluator
■ currently Mahout‘s standard RecommenderEvaluator randomly splits the data into training and test set
■ for datasets with timestamps it would be much more interesting use this temporal information to split the data into training and test set
■ Task □ create a TemporalSplitRecommenderEvaluator similar to the existing
AbstractDifferenceRecommenderEvaluator
■ Future Work□ factor out the logic for splitting datasets into training and test set
13.09.2012 DIMA – TU Berlin 12
Project: baseline method for rating prediction
■ port MyMediaLite’s UserItemBaseline to Mahout(preliminary port already available)
■ user-item-baseline estimation is a simple approach that estimates the global tendency of a user or an item to deviate from the average rating (described in Y. Koren: Factor in the Neighbors: Scalable and Accurate Collaborative Filtering, TKDD 2009)
■ Task □ polish the code
□ make it work with Mahout’s DataModel
■ Future Work□ create an ItemBasedRecommender that makes use of the estimated
biases
13.09.2012 DIMA – TU Berlin 13
Thank you.
Questions?
Sebastian SchelterDatabase Systems and Information Management Group (DIMA)Technische Universität Berlin