Download - Introduction to Collaborative Filtering with Apache Mahout

13.09.2012 DIMA – TU Berlin 1

Database Systems and Information Management Group (DIMA)Technische Universität Berlin

http://www.dima.tu-berlin.de/

An Introduction to Collaborative Filtering with Apache Mahout

Sebastian Schelter

Recommender Systems Challenge at ACM RecSys 2012


■ Apache Mahout: apache-licensed library with the goal to provide highly scalable data mining and machine learning

■ its collaborative filtering module is based on the Taste framework of Sean Owen

■ mostly aimed at production scenarios, with a focus on□ processing efficiency

□ integratibility with different datastores, web applications, Amazon EC2

□ scalability, allows computation of recommendations, items similarities and matrix decompositions via MapReduce on Apache Hadoop

■ not that much used in recommender challenges□ not enough different algorithms implemented?

□ not enough tooling for evaluation?

→ it‘s open source, so it‘s up to you to change that!

Overview


Preference & DataModel

■ Preference encapsulates a user-item-interaction as (user,item,value) triple□ only numeric userIDs and itemIDs allowed for memory efficiency

□ PreferenceArray encapsulates a set of preferences

■ DataModel encapsulates a dataset□ lots of convenient accessor methods like getNumUsers(),

getPreferencesForItem(itemID), ...

□ allows to add temporal information to preferences

□ lots of options to store the data (in-memory, file, database, key-value store)

□ drawback: for a lot of usecases, all the data has to fit into memory to allow efficient recommendation

DataModel dataModel = new FileDataModel(new File(„movielens.csv“));

PreferenceArray prefsOfUser1 = dataModel.getPreferencesFromUser(1);


Recommender

■ Recommender is the basic interface for all of Mahout‘s recommenders□ recommend n items for a particular user

□ estimate the preference of a user towards an item

■ a CandidateItemsStrategy fetches all items that might be recommended for a particular user

■ a Rescorer allows postprocessing recommendations

List<RecommendedItem> topItems = recommender.recommend(1, 10);

float preference = recommender.estimatePreference(1, 25);


Item-Based Collaborative Filtering

■ ItemBasedRecommender□ can also compute item similarities

□ can provide preferences for items as justification for recommendations

■ lots of similarity measures available (Pearson correlation, Jaccard coefficient, ...)

■ also allows usage of precomputed item similarities stored in a file (via FileItemSimilarity)

ItemBasedRecommender recommender =

new GenericItemBasedRecommender(dataModel,

new PearsonCorrelationSimilarity(dataModel));

List<RecommendedItem> similarItems =

recommender.mostSimilarItems(5, 10);


Latent factor models

■ SVDRecommender□ uses a decomposition of the user-item-interaction matrix to compute

recommendations

■ uses a Factorizer to compute a Factorization from a DataModel, several different implementations available

□ Simon Funk‘s SGD

□ Alternating Least Squares

□ Weighted matrix factorization for implicit feedback data

Factorizer factorizer = new ALSWRFactorizer(dataModel, numFeatures,

lambda, numIterations);

Recommender svdRecommender =

new SVDRecommender(dataModel, factorizer);

List<RecommendedItem> topItems = svdRecommender.recommend(1, 10);


Evaluating recommenders

■ RecommenderEvaluator, RecommenderIRStatsEvaluator□ allow to measure the prediction quality of a recommender by using a

random split of the dataset

□ support for MAE, RMSE, Precision, Recall, ....

□ need a DataModel, a RecommenderBuilder, a DataModelBuilder for the training data

RecommenderEvaluator maeEvaluator = new

AverageAbsoluteDifferenceRecommenderEvaluator();

maeEvaluator.evaluate(

new BiasedRecommenderBuilder(lambda2, lambda3, numIterations),

new InteractionCutDataModelBuilder(maxPrefsPerUser),

dataModel, trainingPercentage, 1 - trainingPercentage);


Evaluating recommenders

■ RecommenderEvaluator, RecommenderIRStatsEvaluator□ allow to measure the prediction quality of a recommender by using a

random split of the dataset

□ support for MAE, RMSE, Precision, Recall, ....

□ need a DataModel, a RecommenderBuilder, a DataModelBuilder for the training data

RecommenderEvaluator maeEvaluator = new

AverageAbsoluteDifferenceRecommenderEvaluator();

maeEvaluator.evaluate(

new BiasedRecommenderBuilder(lambda2, lambda3, numIterations),

new InteractionCutDataModelBuilder(maxPrefsPerUser),

dataModel, trainingPercentage, 1 - trainingPercentage);


Starting to work on Mahout

■ Prerequisites□ Java 6

□ Maven

□ svn client

■ checkout the source code from

http://svn.apache.org/repos/asf/mahout/trunk

■ import it as a maven project into your favorite IDE

http://svn.apache.org/repos/asf/mahout/trunk


Project: novel item similarity measure

■ in the Million Song DataSet Challenge, a novel item similarity measure was used in the winning solution

■ would be great to see this one also featured in Mahout

■ Task □ implement the novel item similarity measure as subclass of Mahout’s

ItemSimilarity

■ Future Work□ this novel similarity measure is asymmetric, ensure that it is correctly

applied in all scenarios


Project: temporal split evaluator

■ currently Mahout‘s standard RecommenderEvaluator randomly splits the data into training and test set

■ for datasets with timestamps it would be much more interesting use this temporal information to split the data into training and test set

■ Task □ create a TemporalSplitRecommenderEvaluator similar to the existing

AbstractDifferenceRecommenderEvaluator

■ Future Work□ factor out the logic for splitting datasets into training and test set


Project: baseline method for rating prediction

■ port MyMediaLite’s UserItemBaseline to Mahout(preliminary port already available)

■ user-item-baseline estimation is a simple approach that estimates the global tendency of a user or an item to deviate from the average rating (described in Y. Koren: Factor in the Neighbors: Scalable and Accurate Collaborative Filtering, TKDD 2009)

■ Task □ polish the code

□ make it work with Mahout’s DataModel

■ Future Work□ create an ItemBasedRecommender that makes use of the estimated

biases


Thank you.

Questions?

Sebastian SchelterDatabase Systems and Information Management Group (DIMA)Technische Universität Berlin