Probabilistic memory-based collaborative filtering - Knowledge …gkmc.utah.edu/7910F/papers/IEEE...

Probabilistic Memory-BasedCollaborative Filtering

Kai Yu, Anton Schwaighofer, Volker Tresp, Xiaowei Xu, and Hans-Peter Kriegel

Abstract—Memory-based collaborative filtering (CF) has been studied extensively in the literature and has proven to be successful in

various types of personalized recommender systems. In this paper, we develop a probabilistic framework for memory-based CF

(PMCF). While this framework has clear links with classical memory-based CF, it allows us to find principled solutions to known

problems of CF-based recommender systems. In particular, we show that a probabilistic active learning method can be used to actively

query the user, thereby solving the “new user problem.” Furthermore, the probabilistic framework allows us to reduce the

computational cost of memory-based CF by working on a carefully selected subset of user profiles, while retaining high accuracy. We

report experimental results based on two real-world data sets, which demonstrate that our proposed PMCF framework allows an

accurate and efficient prediction of user preferences.

Index Terms—Collaborative filtering, recommender systems, profile density model, active learning, data sampling.

�

1 INTRODUCTION

INFORMATION on the Web has been growing explosively inrecent years. Information filters emerged to meet the

challenge of information searching on the WWW, a problemwhich may be compared to “locating needles in a haystackthat is growing exponentially” [1]. Recommender systemsare a class of information filters which have proven to besuccessful. For example, recommender systems on e-commerce Web sites assist users in finding their favoriteCDs or books. Similarly, recommender systems assist inlocating items like Web pages, news, jokes, or movies, fromthousands or even millions of items.

Content-based filtering (CBF) and collaborative filtering(CF) are two technologies used in recommender systems.CBF systems analyze the contents of a set of items togetherwith the ratings provided by individual users to infer whichnonrated items might be of interest for a specific user.Examples include [2], [3], [4]. In contrast, collaborativefiltering methods [5], [6], [1] typically accumulate adatabase of item ratings cast by a large set of users andthen use those ratings to predict a query user’s preferencesfor unseen items. Collaborative filtering does not rely on thecontent descriptions of items, but purely depends onpreferences expressed by a set of users. These preferencescan either be expressed explicitly by numeric ratings or canbe indicated implicitly by user behaviors, such as clicking

on a hyperlink, purchasing a book, or reading a particularnews article.

One major difficulty in designing CBF systems lies in theproblem of formalizing human perception and preferences.Why one user likes or dislikes a joke, or prefers one CD overanother is virtually impossible to formalize. Similarly, it isdifficult to derive features which represent the differencebetween an average news article and one of high quality. CFprovides a powerful way to overcome these difficulties. Theinformation on personal preferences, tastes, and quality areall carried in (explicit or implicit) user ratings.

CF-based recommender systems have successfully beenapplied in areas ranging from e-commerce (for example,Amazon and CDnow1) to computer-supported collaborativework [7]. CF research projects include Grouplens (the firstautomatic CF algorithm, [5]), Ringo [6], Video Recommender[8], Movielens [9], and Jester [10].

1.1 Collaborative Filtering Algorithms

A variety of CF algorithms have been proposed in thelast decade. One can identify two major classes of CFalgorithms [11], memory-based approaches and model-based approaches.

Memory-based CF can be motivated from the observa-tion that people usually trust the recommendations fromlike-minded friends. These methods apply a nearest-neighbor-like scheme to predict a user’s ratings based onthe ratings given by like-minded users. The first CF systemsGrouplens [5] and Ringo [6] fall into this category. In theliterature, the term collaborative filtering is sometimes usedto refer only to the memory-based methods.

In contrast, model-based CF first learns a descriptivemodel of user preferences and then uses it for predictingratings. Many of these methods are inspired from machinelearning algorithms. Examples include neural networkclassifiers [1], induction rule learning [12], linear classifiers[13], Bayesian networks [11], dependency networks [14],

56 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004

. K. Yu and V. Tresp are with Siemens Corporate Technology, Informationand Communications, CT IC4, 81730 Munich, Germany.E-mail: [email protected] and [email protected].

. A. Schwaighofer is with the Institute for Theoretical Computer Science,Graz University of Technology, Inffeldgasse 16b, 8010 Graz, Austria.E-mail: [email protected].

. X. Xu is with the Information Science Department, University of Arkansasat Little Rock, 2801 S. University Ave., Little Rock, AR 72227.E-mail: [email protected].

. H.-P. Kriegel is with the Institute for Computer Science, University ofMunich, Oettingenstraße 67, 80538 Munich, Germany.E-mail: [email protected].

Manuscript received 1 Sept. 2002; revised 1 Apr. 2003; accepted 10 Apr. 2003.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number 118555. 1. www.amazon.com and www.cdnow.com.

1041-4347/04/$17.00 � 2004 IEEE Published by the IEEE Computer Society

latent class models or mixture models [15], [16], item-basedCF [17], principle component analysis-based CF [10],association rule mining [18], and hybrids of model andmemory-based approaches [19].

1.2 Motivation

Up to now, research on CF primarily focused on exploringvarious learning methods, hoping to improve the predictionaccuracy of recommender systems. Other important as-pects, like scalability, accommodating to new data, andcomprehensibility have received little attention. In thefollowing, we will review five general issues which areimportant for CF and greatly motivated the work presentedin this paper.

1. Accuracy. As a central issue in CF research, predic-tion accuracy has received a high degree of attentionand various methods were proposed for improve-ment. Still, conventional memory-based methodsusing Pearson correlation coefficient remain amongthe most successful methods in terms of accuracy.The experiments presented in Section 5.4 show thatour proposed probabilistic interpretation of mem-ory-based CF can outperform a set of other memoryand model-based CF approaches.

2. Interactive Learning of User Profiles. A recommendersystem cannot provide accurate service to a newuser, whose preferences are initially unknown. Thishas been referred to as the “new user problem” [2],[20], [21]. Before being able to make predictions, a CFsystem typically requires the new user to rate a list ofquery items in an initial information gathering stage.Efficient heuristics [21] are essential to selectinformative query items and thus keep the informa-tion gathering stage as short as possible since usersmay easily lose patience when faced with a long listof query items.

Within our proposed probabilistic framework forCF, we show in Section 3 how informative queryitems can be selected in a principled way. At eachinformation gathering step, those query items arepresented to the user which are expected tomaximally sharpen the user’s profile. Our experi-ments (see Section 5.5) confirm that this interactiveapproach outperforms other ways of selecting queryitems [21] both in terms of necessary user effort andachieved accuracy of predictions.

3. Efficiency. Memory-based CF often suffers from slowresponse time because each single prediction re-quires the scanning of a whole database of userratings. This is a clear disadvantage when comparedto the typically very fast responses of model-basedCF. In the proposed probabilistic memory-based CFapproach, predictions are generated from a carefullyselected small subset of the overall database of userratings, which we call profile space. As a consequence,predictions can be made much faster than in aclassical memory-based CF system. Still, the accuracyof a system using the full data set can be maintained.We will describe this process of data selection inSection 4. The results presented in Section 5.6 confirmthat the constructed profile space does indeed allowboth an accurate and fast prediction of user ratings.

4. Incrementally Accommodating to New Data. Recom-mender systems must be capable of handling newdata, be it new users or new items. For example, in amusic recommender system, the recommendersystem must be able to adapt itself to newly arisingstyles of music and thus new preference patterns.This suggests that the training process of anyunderlying CF algorithm should be incremental.However, model-based CF approaches are typicallytrained using batch algorithms. To our knowledge,little work has addressed the use of online learningin CF. Thus, retraining a model with new data canbecome quite expensive, in particular, if it needs tobe performed regularly [11]. In contrast, memory-based CF can easily accommodate to new data bysimply storing them. In the proposed probabilisticmemory-based CF framework, this goal can beachieved by a straight-forward extension of the dataselection procedure introduced in Section 4.

5. Comprehensibility. The results in [22] indicate thatallowing users to know more about the result-generating process can help them understand thestrengths and weaknesses of CF systems. With thisknowledge, users can make low-risk decisions. Forexample, consider the following two cases: 1) AmongJulia’s like-minded users there are 50 percent ofusers who rated “like” to Titanic, while 50 percent ofthem rated “dislike.” 2) In the other case, most of herneighbors give neutral ratings to that movie. Atraditional CF system may only give a neutral ratingin both of the cases. A more sophisticated systemmay remind Julia of the underlying reasons in thefirst case and, for example, output an estimateddistribution of a user’s rating for some item, either ingraphical or textual form (“I guess you will like thatmovie and I am pretty sure (or very unsure) aboutthat”). This suggests that a probabilistic CF ap-proach, as presented in this paper, can improve thecomprehensibility and, thus, the acceptance of a CFsystem. Furthermore, memory-based CF has a clearinterpretation that can be easily conveyed to users,such as “You seem to be sharing opinions withuser A, who liked the following items...”.

1.3 Overview of Our Approach

In this paper, we introduce probabilistic memory-based

collaborative filtering (PMCF), a probabilistic framework

for CF systems that is similar in spirit to the classical

memory-based CF approach. A schematic drawing of the

components of PMCF is shown in Fig. 1.As the basic ingredient, we present a probabilistic model

for user preferences in Section 2. We use a mixture model

built on the basis of a set of stored user profiles; thus, the

model clearly links with memory-based CF methods.Various heuristics to improve memory-based CF have

been proposed in the literature. In contrast, extensions to

PMCF can be based on a principled probabilistic way. We

argue that this is one of the major advantages of PMCF. We

use PMCF to derive solutions for two particularly important

problems in CF.

YU ET AL.: PROBABILISTIC MEMORY-BASED COLLABORATIVE FILTERING 57

The first one concerns the new user problem. An activelearning extension to the PMCF system can actively query auser for additional information, in case the availableinformation is insufficient.

The second major extension aims at reducing thecomputational burden in the prediction phase typicallyassociated with memory-based CF. PMCF allows us toselect a small subset, called the profile space, from a (possiblyhuge) database of user ratings. The selection procedure isderived directly from the probabilistic framework andensures that the small profile space leads to predictionsthat are as accurate as predictions made by using the wholedatabase of user ratings.

1.4 Structure of this Article

This paper is organized as follows: In Section 2, we describethe framework of probabilistic memory-based CF (PMCF).In Section 3, we present an active learning extension ofPMCF to gather information about a new user in aparticularly efficient way that requires a minimum of userinteraction. In Section 4, we show how to construct theprofile space for the PMCF model, which is a small subset ofthe available user rating data. We present experimentalresults that demonstrate the effectiveness of PMCF, theactive learning extension, and the profile space constructionin Section 5. We end the paper with conclusions and anoutlook in Section 6.

2 PROBABILISTIC MEMORY-BASED CF

In this section, a general probabilistic memory-based CF(PMCF) approach is introduced. Probabilistic CF has been avivid research topic. Examples include Bayesian networks[11], dependency networks [14], latent class models ormixture models [15], [16], and hybrids of memory andmodel based systems [19]. The work presented here hasbeen inspired by [19], in that we also aim at connecting

memory and model-based CF in a probabilistic way. While[19] mainly focuses on making predictions, we use theprobabilistic model for further extensions of the CF system,which will be described in Section 3 and Section 4.

2.1 Notation

Suppose that we have gathered K users’ ratings on a givenitem set I of sizeM ¼ jIj. Let xi;j 2 IR be the rating of user ion item j and let D with ðDÞi;j ¼ xi;j be the K �M matrix ofall ratings.Ri is the set of items for which user i has actuallygiven ratings, Ri � I . If an item has not been rated, we setxi;j to a neutral rating ni, which we will define later. Wedenote by xxi the vector of all ratings of user i. In thefollowing text, user i’s ratings xxi are often referred as useri’s profile. We also maintain a smaller set of user profiles, theprofile space P, which consists of a subset of rows of D.Without loss of generality, we assume that the profile spaceis built up2 from the ratings of the first N users, i.e., the firstN rows of D, where typically N � K.

In CF terminology, the active user is the user that queriesthe CF system for recommendations on some items. Wedenote the active user’s ratings by aa. By aar, we denote theratings the active user has already provided (for items2 Ra) and aan are the yet unknown ratings. The total ratingvector aa is thus the union of aar and aan.

As mentioned above, we use a neutral rating ni for allitems a user i has not given an explicit rating, i.e., xi;j ¼ ni ifj 62 Ri. In order to compute ni, we assume a Gaussian priorfor the neutral rating with mean m0 which is estimated asthe overall mean of user ratings. If we further assume thatni is also Gaussian distributed with mean m0, we canestimate the neutral rating as

ni ¼P

j2Rixi;j þ Cm0

Rij j þ C; ð1Þ

where C is the ratio of the variance of the ratings for user iand the variance of m0. We determined a suitable value forC based on cross validation experiments. We found C ¼ 9to work effectively on the data we consider.

2.2 A Density Model for Preference Profiles

We assume a generative probabilistic model in which theratings aa of an active user are generated based on aprobability density of the form

pðaajPÞ ¼ 1

N

XNi¼1

pðaajiÞ; xxi 2 P; ð2Þ

where pðaajiÞ is the probability of observing the active user’sratings aa if we assume that a has the same profile class asthe ith profile prototype in P, i.e., user i’s profile. Thedensity expressed by (2) models the influences of other like-minded users’ preferences on the active user a. For themixture components pðaajiÞ, we use Gaussian3 density


2. We will show in Section 4 how a compact and accurate profile space Pcan be incrementally built from a given set of user ratings D.

3. We are a little inaccurate here and assume, for simplicity, that ourrating scale is continuous and unbounded, ignoring the fact that ratings areoften given on a discrete scale. One might also choose mixture componentsthat fit particular data, for example, binomial distributions for discreteratings.

Fig. 1. A schematic drawing of the components of probabilistic memory-based collaborative filtering (PMCF). Through an active learning scheme(presented in Section 3), the profile of a new user can be inferred with aminimum of required user effort. User ratings are stored in a database,from which a compact representation—the profile space—can beconstructed in order to make fast predictions (presented in Section 4).

functions. Assuming that ratings on individual items areindependent, given a profile i, we get

pðaajiÞ ¼Yj2I

pðajjiÞ

¼Yj2I

ð2�Þ�1=2ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi�2 þ dj62Ri

�20

q exp � 1

2

ðaj � xi;jÞ2�2 þ dj 62Ri

�20

!:ð3Þ

Here, dj62Ri¼ 1 if xi;j is unrated and dj62Ri

¼ 0 otherwise.This model can be motivated as a mixture model with theprototype profiles xxi serving as cluster centers, or as aParzen density model on the profile space P. The additionalvariance for unrated items takes into account the uncer-tainty of the estimated rating.

In our experiments, we set �20 to be the overall variance of

user ratings. �2 was optimized by maximizing the leave-one-out likelihood of profilesX

aa2PpðaajP n aaÞ ð4Þ

with respect to �2. �2 is tuned after constructing the profilespace (see Section 4) and left constant thereafter. Note that,technically, profiles take on different meanings: If they arepart of the database, they represent prototype vectorsdefining the component densities in (3). If we consider theactive user’s profile, the profile corresponds to a samplegenerated from the probability density defined in the sameequation.

2.3 A Probabilistic Approach to Estimating UserRatings

We can now calculate the posterior density of the activeuser a’s ratings on not yet rated items, denoted by aan, basedon the ratings aar user a has already given. Using thepreviously defined density model for user ratings, we find

pðaanjaar;PÞ ¼ pðaan; aarjPÞpðaarjPÞ ; ð5Þ

¼PN

i¼1 pðaan; aarjiÞPNi¼1 pðaarjiÞ

; ð6Þ

¼XNi¼1

pðaanjiÞPrðijaar;PÞ: ð7Þ

Prðijaar;PÞ indicates the a posteriori probability of user ahaving the ith prototype profile, given the ratings user aalready has provided. It thus models the “like-mindedness”of active user a to other users i in the profile space P:

Prðijaar;PÞ ¼ pðaarjiÞPNi¼1 pðaarjiÞ

: ð8Þ

Within the PMCF model, predictions for the active user arethus made by combining the predictions based on otherprototype users xxi, weighted by their degree of like-mindedness to user a. This puts the key idea of memory-based collaborative filtering into a probabilistic framework.

Note that the computational complexity of prediction isOðNMÞ, i.e., it is linear in the size of the profile space. In

Section 4, we will show how to obtain a profile space thatis much smaller than the complete user rating database D.Making predictions only on basis of the small profile spacethus brings a significant reduction of overall computa-tional cost.

3 AN ACTIVE LEARNING APPROACH TO LEARNINGUSER PROFILES

In the previous section, we introduced the PMCF frame-work and showed how predictions can be made. In thissection, we will use an active learning approach toefficiently learn the profile of an individual user. The activelearning approach smoothly integrates into the PMCFframework and provides a solution for the “new userproblem.” By presenting a set of the most informative queryitems in an interactive process, we can learn about theprofile of a new user with a minimum of user effort.

3.1 The New User Problem

For users that are new to a recommender system, noinformation about their preferences is initially known. Thus,the recommender system typically requests them to rate a setof query items. Using the ratings on these query items, theCF system can then start making recommendations.

There are several important reasons why this set of queryitems should be selected carefully:

1. Users are not willing to rate a long list of items.2. Users cannot rate items unknown to them.3. Rating results for some items might be very

informative for determining a user’s profile whereasrating results for other items might not provideuseful new information.

So far, little work has been done to address4 the new userproblem [21].

In the next sections, we will present an approach forselecting query items that requires particularly little usereffort, yet allows fast learning about the user’s preferences.

3.2 Identifying Informative Query Items

To achieve an efficient interactive learning of user profiles,we put the selection of query items into a decision theoreticframework (see, for example, Section 4.3 of [24]). First, oneneeds to define a loss function, evaluating the quality of thesystem before querying a new item �ðaar;PÞ and afterquerying the user for item j, j 62 Ri and after havingobtained rating aj. We denote the loss after querying by�ðaj; aar;PÞ. The goal is now to select the query item j suchthat the expected loss

Epðajjaar;PÞ��ðaj; aar;PÞ

� ð9Þis minimized. The expectation is calculated here withrespect to the predicted probability of user a’s ratings foritem j.

The most important ingredient is the loss function�ðaj; aar;PÞ. We propose to use the entropy of the like-mindedness Prðijaar;PÞ as the loss function. Prðijaar;PÞ


4. A method for improving the accuracy of CF systems by adding extraquery items has been presented in [23]. This approach might also beadapted to solve the new user problem.

describes the like-mindedness of a user i in the profile spaceP with active user a, given a’s ratings aar. In an extreme case,Prðijaar;PÞ has a uniform distribution, which means that theprofile of user a is completely unclear. In contrast, a sharppeak in the distribution of Prðijaar;PÞ indicates that user ahas similar preferences as a small group of like-mindedusers. It thus seems natural to choose those query items thatminimize the uncertainty (thus, the entropy) of user a’s like-mindedness.

Putting this into a formal setting, we can write for theloss function

�ðaj; aar;PÞ ¼ �XNi¼1

Prðijaj; aar;PÞ logPrðijaj; aar;PÞ: ð10Þ

By Prðijaar; aj;PÞ, we denote like-mindedness, computedwith an updated vector of ratings for the active user, whonow also has rated the (previously unrated) item j.

We can now define the expected benefit [24, Section 4.3.2]for querying item j as

E½BðjÞ� ¼ Epðajjaar;PÞ �ðaj; aar;PÞ� �� ðaar;PÞ ð11Þ

and terminate the query process if the expected benefit isless than a threshold related to the cost of querying.

Our algorithm for query item selection is myopic in thesense that the algorithm only looks one step ahead. Incontrast, a hyperopic algorithm would aim at finding theoptimal sequence of query items to be presented. However,since hyperopic optimization is computationally intractable,myopia is a standard approximation used in sequentialdecision-making problems [25], [26].

3.3 Identifying the Items Possibly Known to theActive User

If we wanted to use the active learning approach describedin the previous section directly, we would most often get a“don’t know” as the answer to most of the query items.Users of a CF system can typically provide ratings for only afew of the items. For example, in a recommender system formovies, users may typically have seen a few dozen moviesout of the several hundred movies contained in the database. It may be quite informative to know the user’s opinionon an unusual movie, yet it is likely that the user will not beable to give this movie any rating.

Thus, we must also predict the probability that a user isable to rate5 a given query item. This can be achieved byagain referring to the like-mindedness of users. In (5),predictions for active user a were built from a sum of otherusers’ ratings, weighted by their degree of like-mindednessPrðijaar;PÞ. Similarly, we can predict the probability of usera being able to rate item j, given his or her other ratings aar

by checking user a’s like-minded users:

Prðuser a can rate item jjaar;PÞ ¼XNi¼1

Prðuser a can rate item jjiÞPrðijaar;PÞ: ð12Þ

Prðuser a can rate item jjiÞ is the probability that a can rateitem j, given that users a and i (as described by prototypeprofile xxi) agree on which items they are able to rate. Weassume for simplicity that user a can rate exactly the same6

movies as user i:

Prðuser a can rate item jjiÞ ¼ 1 if user ihas rated item j0 otherwise:

�ð13Þ

3.4 A Summary of the Active Learning Process

Using the ideas described in the previous sections, wepropose the following iterative scheme to learn the profileof the active user a:

1. Out of the set of items that have not yet been ratedby user a, find those k1 items with the highestprobability of being known to user a, i.e., those itemswith the highest value for (12).

2. Out of these k1 items, select a subset of k2 items thatlead to the highest reduction of uncertainty aboutthe user’s profile, i.e., the items with the highestexpected benefit in (11).

3. Display those k2 items to the user for rating. Collectthe ratings and update the vector of ratings aa.

4. Terminate if the user is not willing to answer anymore queries or if the expected benefit of querying(as defined in (11)) is below a certain threshold.Otherwise, go to Step 1.

In the very first step, where nothing is known about user a,we assume equal like-mindedness of user a with all profilesin P. Thus, user a will be presented the k2 most popularitems as query items.

3.5 Implementation

1. Parameters for Active Learning. The value of k1 (seeStep 1 of Section 3.4) should be carefully selected. Ifk1 is too small, for example, as small as k2, then theselection procedure is too much biased by (12) and,thus, might miss out informative items—the systemperforms too little exploration. If k1 is too large, toomany items will be presented to the user which theuser is not able to rate. In cross validationexperiments, we found that k1 ¼ 50 gives the bestresults for the data we consider. The value for k2 israther uncritical. We used k2 ¼ 10 because it seemsreasonable to display 10 items on a normal-sizedPC screen. Thus, at each iteration, we first find the50 candidate items with the largest probability ofbeing known and then identify 10 query itemsaccording to the expected reduction of uncertaintyin like-mindedness.

2. Computational Complexity. The most costly part inthis active learning approach is the evaluation of(11), where the expected reduction of uncertainty inlike-mindedness is computed. The algorithm needsto exhaust Oðck1Þ possibilities of user feedbacks ateach iteration (where c is the number of ratings a


5. Another way of solving this problem would be to integrate thisprobability into the loss function (10) for the active learning approach. Wedo not pursue this solution in the present article.

6. This is a strong assumption, yet due to the weighting introduced bythe like-mindedness, we obtain meaningful results.

user might possibly give to a presented query item,and k1 is the number of candidate items) andcalculate the entropy of the like-mindedness foreach case. This again requires evaluating (2) withchanged preference vector aa. Fortunately, (2) fac-torizes along items, thus the distances only need tobe recalculated along the dimensions of the newlyrated items. This greatly reduces the overall compu-tational cost.

3. Alternative Methods. Several of the approaches pro-posed in the active learning literature may beadopted for CF. A common approach is uncertaintysampling [27], which has been successfully applied totext categorization [27] and image retrieval [26] toreduce the number of training examples. The generalidea behind all proposed variants of uncertaintysampling is to present the unlabeled examples forwhich the outcome is most uncertain, based on thecurrent predictions. In a CF scenario, one is interestedin predicting a user’s ratings for nonrated items.Thus, the variance of predictions var pðajjaar;PÞ is anappropriate measure of uncertainty. An advantage ofthis approach lies in its low-computational cost sincewe only have to compute the predictions pðajjaar;PÞfor all yet unrated items.

Another low complexity method for query itemselection is entropy sampling [21]. Here, we considerPrjðsÞ, the fraction of users who had given aparticular rating s 2 fs1; ; scg for item j. Query itemsare selected such that the entropy of PrjðsÞ ismaximized.

We will show in Section 5.5 that the method basedon uncertainty of like-mindedness (as outlined inSection 3.2) achieves best results, both in terms ofachieved accuracy and in terms of required userinput.

4 INCREMENTALLY CONSTRUCTING PROFILE SPACE

In Section 2, we introduced a probabilistic model for

describing user preferences. This model was based on a

given set of user profiles, the profile space P. In this section,

we will show how this profile space can be constructed by

selecting informative user profiles from the overall database

of user ratings D. Since the profile space typically contains

only a low number of user profiles (as compared to the

often huge D), it allows us to build compact models and

make predictions efficiently, while maintaining a high

accuracy. It thus solves the well-known problem that

predictions of traditional memory-based CF methods are

rather time-consuming.

4.1 Kullback-Leibler Divergence for User ProfileSampling

Let’s assume that there exists an optimal density model for

user ratings,whichwedenoteby poptðxxÞ.Naturally,wedonot

have access to this optimal model, but we work with a

nonoptimal model pðxxjPÞ, as given in (2), based on some

profile space P. The key idea of our proposed selection

procedure is to select the profile spaceP such that the density

pðxxjPÞ is as close as possible to the optimal density poptðxxÞ.

To measure the distance of these two distributions, weuse the Kullback-Leibler divergence (KL-divergence [28]).We denote the KL-divergence of the two distributions by

D pðxxjPÞjjpoptðxxÞ� � ¼ Z poptðxxÞ log poptðxxÞpðxxjPÞ dxx; ð14Þ

where the integral is over the whole space of user ratingvectors. TheKL-divergence is alwaysnonnegative and is zerowhen two compared distributions are identical. Assumingthat the total set of user ratings D constitutes a set ofindependent samples drawn from poptðxxÞ, we can approx-imate the KL-divergence by Monte-Carlo integration [29]:

~DD pðxxjPÞjjpoptðxxÞ� � ¼ 1

K

XKi¼1

logpoptðxxiÞpðxxijPÞ

¼ 1

Klog

poptðDÞpðDjPÞ ;

ð16Þ

where K is the number of users in D.As stated above, we wish to minimize the KL-divergence

~DDðpðxxjPÞjjpoptðxxÞÞ so that the density pðxxjPÞ best approx-imates poptðxxÞ. Since poptðDÞ is constant, (15) can beminimized by maximizing the likelihood of the user ratingdatabase D with respect to the profile space P. Finding theoptimal profile space P is clearly an intractable task, we thusswitch to an iterative greedy approach for constructing P.4.2 Incremental Profile Space Construction

For constructing the profile space P from a database D ofuser ratings, we consider an incremental scenario. Given thecurrent profile space P, which profile pattern xxi 2 D shouldbe included such that the updated profile space P [ xxi canachieve the maximum reduction in KL-divergence, accord-ing to (15)?

The reduction in KL-divergence caused by including xxi

in P can be written as

�i ¼ ~DD pðxxjPÞjjpoptðxxÞ� �� ~DD pðxxjP [ xxiÞjjpoptðxxÞ� �

¼ 1

Klog

pðDjP [ xxiÞpðDjPÞ :

ð17Þ

Mind that this step causes the optimal density poptðxxÞ todrop out. According to Bayes’ rule, the likelihood of theoverall data D, given the updated profile space P [ xxi canbe written as follows:

pðDjP [ xxiÞ ¼ pðDjPÞ pðxxijDÞpðxxijPÞ ; ð18Þ

where pðxxijDÞ is the likelihood of xxi, based on a model thatuses the complete data as the profile space. Combining (17)and (18), the optimal profile xx to be selected is given by:

argmaxi

�i ¼ arg maxxxi2DnP

pðxxijDÞpðxxijPÞ :

An intuitive interpretation of this selection scheme is asfollows: Equation (19) suggests that profiles xxi with lowpðxxijPÞ, but high pðxxijDÞ will be selected. pðxxijPÞ encodeshow likely a profile xxi is given our current knowledge P,while pðxxijDÞ encodes the likelihood and, thus, the “degreeof typicalness” of profile xxi in the overall data D. The


profile selection scheme thus focuses on profiles that arenovel to our current knowledge (encoded by the currentprofile space), but are, in fact, typical in the real world(represented by the whole data D). Thus, this samplingscheme will result in removing redundancies (we onlyfocus on novel data that is not yet included in the profilespace) and in removing outliers (outliers can be considereduntypical data).

Still, (19) does not give a practical algorithm since itrequires evaluating OðKÞ profiles, K ¼ jDj, where eachevaluation requires OðKÞ steps to actually build pðxxijDÞ.This leads to the clearly impractical overall runtime ofOðK2Þ. Practical variants will be discussed in the nextsection.

4.3 Implementation

Constructing a profile space P according to (19) issometimes referred to as full greedy selection. This canonly be done efficiently if the associated objective functioncan be computed cheaply—which is not the case for thelikelihood ratio we consider here. In related problems, ithas been suggested to consider small subsets of candi-dates, evaluate the objective function for each candidate,and select the best candidate out of this subset (see, forexample, [30, Section 6.5]).

We thus obtain the following profile sampling scheme tobuild P from D:

1. Select a subset C of candidate profiles at randomfrom D n P.

2. Compute the likelihood pðxxijPÞ for each candidateprofile xxi 2 C, based on the current profile space P.

3. Compute the likelihood pðxxijDÞ for each xxi 2 C,based on the complete data D.

4. Include the best candidate profile in the profilespace:

P P [ argmaxxxi2C

pðxxijDÞpðxxijPÞ : ð20Þ

5. Terminate, if the profile space has reached a givenmaximum size or if the reduction of KL-divergenceis below a given threshold.

It has been suggested in [30] that subsets of size jCj ¼ 59 canbe guaranteed to select profiles that are better than95 percent of all other profiles with confidence 95 percent.In our experiments, we aim at achieving higher efficiencyand thus use subsets of size jCj ¼ 7. This corresponds toselecting profiles that are better than 80 percent of all otherswith confidence 80 percent.

4.4 Constructing Profile Spaces in a DynamicEnvironment

While the sampling approach presented in the previoussection works fine in a static environment with a fixeddatabase of user ratings, it needs to be refined to work in adynamic environment. The dynamics arise from changingpreferences patterns (for example, new styles of music in amusic recommender system) and the ever growing databaseof user ratings. Since user profiles are typically collectedincrementally, we suggest an incremental extension to the

basic sampling scheme presented in Section 4.3. We assumethat the profile space is being updated after a fixed period oftime, e.g., each day or week. The new user profiles gatheredduring this period are being processed and some of themwill be added to the profile space.

Assuming that we have a database of user ratings D.From D, we have already constructed a profile space P.After collecting user profile data for some time, we get anupdated database Dþ, with Dþ ¼ D [�D. In order to buildthe according profile space Pþ, select the set of candidateitems C from Dþ. Select the most informative profile andupdate the profile space Pþ:

Pþ Pþ [ argmaxxxi2C

pðxxijDþÞpðxxijPþÞ : ð21Þ

Terminate if the new profile space Pþ has reached a givensize or if none of the candidate items xxi 2 C leads to areduction of KL-divergence. Otherwise, select a newcandidate set and proceed.

Through this straight-forward extension, we can retainthe basic idea of using a small profile space, as introducedin Section 4.2, while now being capable of incrementallyprocessing new data.7

4.5 Computational Complexity

For the basic profile space construction, as outlined inSection 4.2, the computational complexity is as follows.

Evaluating the density function pðxxijDÞ for a candidateprofile xxi (see (19)) requires scanning the whole data base Dwith K user ratings. Its complexity is thus OðKÞ. Since allpotential profile spaces P are subsets of D, P � D, one caneasily construct pðxxijPÞ as a “by-product” when scanningthe database in order to find pðxxijDÞ. Both steps are thusOðKÞ, with K ¼ jDj. Constructing a profile space of size Nrequires a total of OðKNÞ operations. Once the profilespace, is constructed, one also needs to update the variance�2 according to (4). This is done with a leave-one-outscheme, its complexity is thus OðN2Þ.

Since one would typically keep the profile space residentin memory, the memory consumption of the profile spaceconstruction is OðNÞ, with N ¼ jPj.

The suggested method for constructing a profile space Pthus has the same complexity as making predictions in atraditional memory-based CF method. Yet, as described inSection 4.4, profile space construction can be seen as abackground process that is being triggered by time or whenunused computing power is available. Thus, its timeconsumption is not visible to a user of the CF system. Weargue that the so achieved shift of workload is important sinceit greatly improves the efficiency of front-end processing,namely, making predictions.

5 EMPIRICAL STUDY

In this section, we report results from applying theprobabilistic memory-based collaborative filtering (PMCF)framework to two CF benchmark data sets, EACHMOVIE

and JESTER. We report results on prediction accuracy,


7. One might also consider the case of removing certain (outdated) userprofiles from P, yet we did not evaluate this idea in the present work.

efficiency of learning individual user profiles (based on theideas presented in Section 3), and accuracy of theconstructed profile spaces (using the incremental scenarioof Section 4).

5.1 Data Sets

We apply the PMCF framework to the following twobenchmark data sets:

. EACHMOVIE8 contains ratings from 72; 916 users on1; 628 movies. User ratings were recorded on adiscrete scale from zero to five. On average, eachuser rated about 30 movies. EACHMOVIE is one ofthe most widely used data sets in recommendersystem research.

. JESTER9 contains ratings from 17; 998 users on100 jokes, continuously valued from �10 to 10. Onaverage, each user rated about 50 jokes. We trans-ferred the ratings to a discrete scale

f�10;�9; . . . ; 9; 10g:

5.2 Evaluation Metrics and Experimental Setup

In collaborative filtering research, one is typically interestedin two types of accuracy, the accuracy for predicting ratingsand the accuracy for making recommendations. The first onemeasures the performance when explicitly predicting theactive user’s ratings on some unseen items. The second onefocuses on finding an accurate ordering of a set of unseenitems, in order to recommend the top ranked items to theactive user. These two scenarios require different experi-mental setups and metrics, which we will describe now.

1. Accuracy of Predicting Ratings. To evaluate theaccuracy when the CF system is asked to predictan active user’s ratings, we use the mean absoluteerror (MAE, the average absolute difference betweenthe actual ratings and the predicted ratings). Thismeasure has been widely used in previous colla-borative filtering research [11], [31], [19], [5], [6].

We examine the accuracy of predictions in twoexperimental setups, ALLBUTONE and GIVEN5,which were introduced in [11]:

. ALLBUTONE evaluates the prediction accuracywhen sufficient information about the activeuser is available. For each active user (from thetest set10), we randomly hide one of the rateditems and predict its rating, based on the ratingson other nonhidden items.

. GIVEN5 evaluates the performance of a CFsystem when only little information about a useris available. For each active user, we retain onlyfive ratings. The CF system predicts the ratings ofhidden items, based on the five visible ratings.

It has been argued that the accuracy of a CFsystem is most critical when predicting extremeratings (very high or very low) for items [19], [6].Since the goal of a CF system is to make recommen-dations, high accuracy on high and low rated itemsis of most importance. One would like to presentthose items (in particular, products) that the activeuser likes most and avoid anything the user dislikes.Therefore, for both of the above ALLBUTONE andGIVEN5 setups, we use two settings EXTREME andALL (see [19]). The ALL setting corresponds to thestandard case where the CF system is asked topredict any of the hidden ratings. In the EXTREME

setting, the CF system only predicts ratings that areon the end of the rating scales. For EACHMOVIE,these extreme ratings are f0; 1; 2; 4; 5g, and ratingsbelow -5 or above 5 for JESTER.

2. Accuracy of Recommendations. We use precision and

recall to evaluate the accuracy of recommendations.These two metrics have been extensively used in

information retrieval and collaborative filtering

research [1], [18]. In our experiments, precision is

the percentage of items recommended to a user that

the user actually likes. Recall is the percentage of

items the user likes that are also recommended by the

CF system. For the EACHMOVIE data, we assume

that users like those items (movies) which they hadrated 4 or 5. For JESTER, we assume that users like

those jokes that had been given a rating larger than 5.To compute precision and recall, we use the

following setup. For each active user (from the testset11), we randomly hide 30 of the user’s ratings.12

The CF system then predicts the ratings for theseitems, based on the remaining visible ratings. Thetop ranked items out of these 30 items are thenrecommended to the user and used to evaluateprecision and recall. We compute precision andrecall for two cases, where we either recommend thetop 5 or the top 10 ranked items. These two cases willbe labeled TOP5 and TOP10 in the table of results.

3. Training and Test Sets. For comparing the accuracy ofpredictions of PMCF with that of Bayesian networkCF [11] on the EACHMOVIE data, we use exactly thesame split as reported in [11], [19] with training andtest sets of size 5000. To be able to evaluate thesignificance of our results, we use training and testsets (both of size 5000) drawn at random from thedata, and repeat this five times.

Similarly, for evaluating the accuracy of predic-tion on the JESTER data, we take the first 5,000 usersas the training set, and the next 5,000 as the test set.Five random splits are used for significance tests.

As mentioned above, we skip all test users that haverated less than 31 items when computing precision andrecall, respectively, less than two (six) items when comput-


8. Available from the Digital Equipment Research Center at http://www.research.digital.com/SRC/EachMovie/.

9. JESTER stems from a WWW-based joke recommender system,developed at the University of California, Berkley [10]. It is available fromhttp://shadow.ieor.berkley.edu/humor/.

10. This naturally requires that we skip users in the test set that have onlyrated one single item, respectively, users that rated less than six items in theGIVEN5 setup.

11. The setup requires that we skip users who had rated less than31 items.

12. We experimented with different numbers here, for example, hiding20 of the user’s ratings. We found that the results were consistentthroughout these experiments, thus we present only results for one setup.

ing the MAE in the ALLBUTONE (GIVEN5) setup. Finalresults for MAE, precision, and recall are always averaged

over all users in the test set.

5.3 Comparison With Other CF Methods

To compare the results of PMCF with other established CFmethods, we report results in terms of MAE, precision, andrecall for PMCF and for the following methods that haveproven successful in the CF literature:

. Memory-based CF with Pearson correlation coeffi-cient [5], one of the most popular memory-based CFalgorithms.

. Bayesian network CF [11]. Since we use exactly the

same experimental setup and evaluation metrics for

the EACHMOVIE data as reported in [11], we can

directly compare the performance of Bayesian net-

work CF with other methods. We did not implement

Bayesian network CF for the JESTER data.. Naive Bayesian CF [32]. Despite its simplicity, the

naive Bayesian classifier has proven to be competi-

tive with Pearson correlation CF.

All methods are evaluated in the setup described in

Section 5.2.

We compare the above listed methods with two

variants of PMCF, which we label PMCF P and PMCF

D. For the PMCF D variant, we use the full training set to

build the density model in (2), that is, the profile space is

taken to be the full training data P ¼ D. The other variant

PMCF P is PMCF with profile space constructed from the

training set D in the way described in Section 4. For both

EACHMOVIE and JESTER, we constructed profile spaces

with 1,000 profiles (out of the training data of size 5000).

5.4 Evaluation of Accuracy

Tables 1 and 2 summarize the performance of all evaluated

CF methods in terms of accuracy for prediction and

recommendation.Table 1 lists results for accuracy of prediction that are

based on one particular split of the data into training and

test set that has also been used in [11]. It can be clearly seen

that PMCF achieves an MAE that is about 7-8 percent lower

than the MAE of the competing methods. The results also

suggest that PMCF is particularly suitable for making

predictions when only very little information about the

active user is given: PMCF achieved a particularly high

improvement of accuracy for the GIVEN5 scenarios.


TABLE 1Accuracy of Predictions, Measured by Mean Absolute Error MAE, of Different CF Methods

Details on the individual experiments are given in Sections 5.2 and 5.3. Both PMCF P and PMCF D consistently outperform the competing method,in particular, when little information is given about the active user in the GIVEN5 scenario. The results shown here are based on the training/test splitreported in Section 5.2.3. Additional experiments with five random splits and paired t-test confirmed that PMCF outperformed the competingmethods at a significance level of 99 percent or above.

TABLE 2Accuracy of Recommendations, Measured by Precision and Recall, of Different CF Methods

All results in this table are averaged over five runs, where training and test sets had been drawn at random from the total data sets. Marked in boldare PMCF results that are significantly better (with a significance level of 95 percent or above in a paired t-test) than the competing approaches.Marked in italic are PMCF results that are better than the competing approaches with a significance level of 90 percent or above. Further details onthe individual experiments are given in Sections 5.2 and 5.3.

For the accuracy of predictions, we also evaluated allmethods (except for the Bayesian network) with fivedifferent randomly drawn training and test sets of size5000 and did a pairwise comparison of results using apaired t-test. The test confirmed that both variants of PMCFperformed better than all of the competing method with asignificance level of 99 percent or above. Comparing PMCFP and PMCF D, we noted that both performed almostidentical for the GIVEN5 setups. For the two ALLBUTONE

setups, PMCF D achieved a slightly better performance.The results for accuracy of recommendation listed in

Table 1 are averages over five different random splits intotraining and test data, as described above. The largeadvantage of PMCF in terms of accuracy of prediction doesnot fully carry over to the accuracy of recommendation.Still, a consistent and statistically significant gain inperformance could be achieved. Precision and recall ofPMCF are typically about 2 to 3 percent better than those ofthe competing methods. A larger performance gain wasalways achieved in the TOP5 setup. Again, a pairwisecomparison of results in a paired t-test was conducted.Results for one of the two PMCF variants that are marked inbold in Table 1 are better than those of the two competingmethods with a significance level of 95 percent or above.Similarly, results marked in italics achieve a significancelevel of 90 percent or above.

Overall, we could verify that our proposed probabilisticmemory-based CF framework achieves an accuracy that iscomparable or superior to other approaches that have beenproposed for collaborative filtering.

5.5 Evaluation of Profile Learning

In Section 3, we proposed an active learning approach tointeractively learn user profiles. In this section, weinvestigate the performance of this learning process in aseries of experiments that simulate the interaction betweenusers and the recommender system.

We use the training/test split described in Section 5.2.3.For each test user, ratings are randomly split into a set S

of 30 items and the remaining items U . We assume thatthe test user initially has not rated any items, and wewish to infer his profile using the active learningapproach. To obtain long learning curves, we restrict thetest set to users who had rated at least 60 items. Thisleaves us with 972 and 1,340 test users, respectively, forthe EACHMOVIE and JESTER data sets.

The interactive sessions are simulated as follows: Therecommender system selects the 10 most informativeitems13 according to the criterion described in Section 3.4.User feedback is taken from the actual ratings the user hasgiven on an item, if the item is in set U . Otherwise, it is leftunrated, simulating that the user is not able to givefeedback on this particular item. We make a series of suchsimulated interactions, t ¼ 1; 2; . . . , gaining more and moreknowledge about the user’s profile. For test user a, wecompute the MAE when predicting the ratings in set S andthe precision for making recommendations in set S,denoted by MAEða; tÞ and precisionða; tÞ. By averaging overall users in the test set, we obtain MAEðtÞ and precisionðtÞ.

Using MAE and precision, we compare the followingfive methods for selecting the query items:

1. Query item selection by minimizing the entropy ofthe like-mindedness, as outlined in Section 3.4.

2. Uncertainty sampling, as described in Section 3.5.3. Entropy sampling, as described in Section 3.5.4. Popularity sampling: At each iteration, we present

10 of the most popular items to the test user.5. Random sampling: At each iteration t, we randomly

select 10 query items.

Methods 3, 4, and 5 have also been studied in [21].The resulting learning curvesMAEðtÞ and precisionðtÞ for

the five methods mentioned above are shown in Fig. 2 (for


Fig. 2. Learning individual user profiles for the EACHMOVIE data. Mean absolute error MAEðtÞ and precisionðtÞ achieved after t ¼ 1; 2; . . . steps of

user interaction with different strategies for query item selection. Details of the experimental setup are given in Section 5.5. (a) Mean absolute error

MAEðtÞ. (b) precisionðtÞ.

13. Query items might also be presented one by one, instead of usingbatches of 10 items. We chose the variant with 10 items since it seems morenatural in an application scenario. Presenting items one by one can easilymake users impatient.

the EACHMOVIE data) and in Fig. 3 (for JESTER). The graphsclearly indicate that query item selection based on like-mindedness outperforms all other tested methods. Like-mindedness-based selection is thus a method whichachieves a maximum gain of information about a particularuser with only a minimum of user effort.

For all of the tested methods, we also investigated theaverage number of items the user is being able to rate at aparticular iteration t. The low performance of random andentropy based sampling, in particular, on EACHMOVIE canbe explained by the fact that users are not able to answer theposed queries. The remaining three methods all achievesimilar results for the average number of rated items. Yet,like-mindedness sampling seems to ask more informativequestions, leading to the steepest learning curves among allmethods in Figs. 2 and 3.

From the presented results, we conclude that like-mind-edness-based sampling is a sensible and accurate method ofinferring user profiles and requires a minimum amount ofuser effort only. It has a particularly good performance ondata sets with high sparsity such as EACHMOVIE, whereonly 3 percent of the items are rated, yet it also performsbetter than competing approaches on dense data sets(JESTER).

5.6 Evaluation of Constructing Profile Spaces

We showed in Section 4 how a small profile space P for thePMCF model can be constructed out of a large database ofuser ratings D. In this section, we investigate how theprofile space construction relates to the achievable accuracyfor predictions and recommendations in the PMCF model.

To this aim, we use the split of training and test datadescribed in Section 5.2.3. From the training data D, theprofile space P is constructed iteratively as outlined inSection 4. At certain intervals,14 we evaluate the perfor-mance of the PMCF method, based on the profile space

constructed so far, on the test set. We use the mean absoluteerror MAE in the ALLBUTONE setting, and precision in theTOP10 setting as the measures of performance.

We obtain a curve of performance versus size of theprofile space. Since constructing the profile space uses arandomized strategy to select candidate profiles (seeSection 4.3), we repeat this procedure 10 times. Thus, errorbars for the performance of PMCF with a profile space of agiven size can be plotted. As the baseline method, we use aPMCF model with a profile space drawn at random fromthe full training data D.

The resulting curves for accuracy of prediction (MAE)and recommendation (precision) on the EACHMOVIE dataare shown in Fig. 4 and in Fig. 5 for the JESTER data. Allplots clearly indicate that the profile space constructionpresented in Section 4 does bring significant advantages interms of performance over a randomly chosen profile space.The gain in performance was particularly large for accuracyof recommendation on the JESTER data.

6 CONCLUSIONS

In this paper, we proposed a probabilistic framework formemory-based collaborative filtering (PMCF). The PMCF isbased on user profiles in a specially constructed profilespace. With PMCF, the posterior distribution of user ratingscan be used to predict an active user’s ratings. Anexperimental comparison with other CF methods (mem-ory-based CF with Pearson correlation, Bayesian networks,naive Bayes) showed that PMCF outperforms the compet-ing methods both in terms of accuracy for prediction andrecommendation.

As one of its major advantages, PMCF allows extensionsto the basic model on a sound probabilistic basis. Weshowed in Section 3 how an active learning approach can beintegrated smoothly into the PMCF framework. Throughactive learning, the CF system can interactively learn abouta new user’s preferences by presenting well selected query


Fig. 3. Learning individual user profiles for the JESTER data. Mean absolute error MAEðtÞ and precisionðtÞ achieved after t ¼ 1; 2; . . . steps of user

interaction with different strategies for query item selection. Details of the experimental setup are given in Section 5.5. (a) Mean absolute error

MAEðtÞ. (b) precisionðtÞ.

14. Evaluation is done when the profile space has reached a size of 60,125, 250, 500, 1000, 2000, and 4000.

items to the user. Our results showed that the activelearning approach performed better than other methods forlearning user profiles, in the sense that it can make accuratepredictions with only a minimum amount of user input.

In Section 4, we used the probabilistic framework toderive a data selection scheme that allows the recommendersystem to make fast and accurate predictions. Instead ofoperating on a possibly huge database of user preferences(as traditional memory-based CF does), the data selectionscheme retains only a carefully selected subset, which wecall the profile space. Using the so selected profile space inthe PMCF model allows us to make fast predictions withonly a small drop in performance over a PMCF modeloperating on the full data.

We believe that the PMCF framework will allow moreextensions and thus can contribute to further improvements

of recommender systems. A particularly promising re-search direction is the combination of CF methods withcontent-based filtering into hybrid systems. We are cur-rently working on a PMCF-based hybrid system for imageand text retrieval [33]. This system implicitly also solves thenew item problem: If no user ratings are available for anitem, predictions can still be made on the basis of thecontent description.

Our further work on the PMCF model will also includean improved model for user preferences. In (3), only itemsthat were actually rated contribute to the model. Animproved model could also take into account the informa-tion which items had not been rated. For example, in theEACHMOVIE data, a movie may have been unrated becausea friend had dissuaded the user from seeing the movie.Thus, one may be able to extract a certain degree of


Fig. 5. Evaluating the profile space construction for the JESTER data set. Mean absolute error MAE and precision achieved with profile spaces ofdifferent size, which are either constructed based on KL-divergence (see Section 4) or drawn at random from the training data. The plot is averagedover 10 runs, with error bars. (a) Mean absolute error MAEðtÞ. (b) precisionðtÞ.

Fig. 4. Evaluating the profile space construction for the EACHMOVIE data set. Mean absolute error MAE and precision achieved with profile spaces of

different size, which are either constructed based on KL-divergence (see Section 4) or drawn at random from the training data. The plot is averaged

over 10 runs, with error bars. (a) Mean absolute error MAEðtÞ. (b) precisionðtÞ.

information from the set of unrated items as well and

further improve the accuracy of a CF system.For the current PMCF system, as described in this article,

the efficiency of the active learning scheme still needs to be

improved. Active learning based on minimization of the

entropy of like-mindedness achieves the best recommenda-

tion accuracy, yet the computational complexity is higher

than that of competing methods such as uncertainty

sampling.

ACKNOWLEDGMENTS

Anton Schwaighofer gratefully acknowledges support

through an Ernst-von-Siemens scholarship. The authors

thank the three anonymous reviewers for their valuable

comments and suggestions to improve the quality of the

paper.

REFERENCES

[1] D. Billsus and M.J. Pazzani, “Learning Collaborative InformationFilters,” Proc. 15th Int’l Conf. Machine Learning, pp. 46-54, 1998.

[2] M. Balabanovic and Y. Shoham, “Fab: Content-Based, Collabora-tive Recommendation,” Comm. ACM, vol. 40, no. 3, pp. 66-72,1997.

[3] R.J. Mooney and L. Roy, “Content-Based Book RecommendingUsing Learning for Text Categorization,” Proc. Fifth ACM Conf.Digital Libaries, pp. 195-204, 2000.

[4] M. Pazzani, J. Muramastsu, and D. Billsus, “Syskill and Webert:Identifying Interesting Web Sites,” Proc. 13th Nat’l Conf. ArtificialIntelligence, pp. 54-61, 1996.

[5] P. Resnick, N. Iacovou, M. Sushak, P. Bergstrom, and J. Riedl,“Grouplens: An Open Architecture for Collaborative Filtering ofNetnews,” Proc. 1994 Computer Supported Collaborative Work Conf.,pp. 175-186, 1994.

[6] U. Shardanand and P. Maes, “Social Information FilteringAlgorithms for Automating, ‘Word of Mouth’,” Proc. ACMCHI’95 Conf. Human Factors in Computing Systems, vol. 1, pp. 210-217, 1995.

[7] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Analysis ofRecommendation Algorithms for E-Commerce,” Proc. ACM E-Commerce Conf., pp. 158-167, 2000.

[8] W. Hill, L. Stead, M. Rosenstein, and G. Furnas, “Recommendingand Evaluating Choices in a Virtual Community of Use,” Proc.ACM CHI’95 Conf. Human Factors in Computing Systems, pp. 194-201, 1995.

[9] B.J. Dahlen, J.A. Konstan, J.L. Herlocker, and J. Riedl, “Jump-Starting Movielens: User Benefits of Starting a CollaborativeFiltering System with Dead Data,” Technical Report 7, Univ. ofMinnesota, 1998.

[10] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins, “Eigentaste: AConstant Time Collaborative Filtering Algorithm,” InformationRetrieval J., vol. 4, no. 2, pp. 133-151, 2001.

[11] J.S. Breese, D. Heckerman, and C. Kadie, “Empirical Analysis ofPredictive Algorithms for Collaborative Filtering,” Proc. 14th Conf.Uncertainty in Artificial Intelligence, pp. 43-52, 1998.

[12] C. Basu, H. Hirsh, and W.W. Cohen, “Recommendation asClassification: Using Social and Content-Based Information inRecommendation,” Proc. 15th Nat’l Conf. Artificial IntelligenceAAAI/IAAI, pp. 714-720, 1998.

[13] T. Zhang and V. S. Iyengar, “Recommender Systems Using LinearClassifiers,” J. Machine Learning Research, vol. 2, pp. 313-334, 2002.

[14] D. Heckerman, D.M. Chickering, C. Meek, R. Rounthwaite, and C.Kadie, “Dependency Networks for Inference, CollaborativeFiltering, and Data Visualization,” J. Machine Learning Research,vol. 1, pp. 49-75, 2000.

[15] T. Hofmann and J. Puzicha, “Latent Class Models for Collabora-tive Filtering,” Proc. Int’l Joint Conf. Artificial Intelligence, pp. 688-693, 1999.

[16] W.S. Lee, “Collaborative Learning for Recommender Systems,”Proc. 18th Int’l Conf. Machine Learning, pp. 314-321, 2001.

[17] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-BasedCollaborative Filtering Recommendation Algorithms,” Proc. 10thWorld Wide Web (WWW10) Conf., pp. 285-295, 2001.

[18] W. Lin, S.A. Alvarez, and C. Ruiz, “Collaborative Recommenda-tion via Adaptive Association Rule Mining,” Data Mining andKnowledge Discovery, vol. 6, no. 1, pp. 83-105, Jan. 2002.

[19] D.M. Pennock, E. Horvitz, S. Lawrence, and C.L. Giles, “Colla-borative Filtering by Personality Diagnosis: A Hybrid Memory-and Model-Based Approach,” Proc. 16th Conf. Uncertainty inArtificial Intelligence, pp. 473-480, 2000.

[20] N. Good, J.B. Schafer, J. Konstan, A. Borchers, B. Sarwar, J.Herlocker, and J. Riedl, “Combining Collaborative Filtering withPersonal Agents for Better Recommendations,” Proc. AAAI-99,pp. 439-446, 1999.

[21] A.M. Rashid, I. Albert, D. Cosley, S.K. Lam, S.M. McNee, J.A.Konstan, and J. Riedl, “Getting to Know You: Learning New UserPreferences in Recommender Systems,” Proc. Int’l Conf. IntelligentUser Interface (IUI2002), 2002.

[22] J.L. Herlocker, J.A. Konstan, and J. Riedl, “Explaining Collabora-tive Filtering Recommendations,” Proc. Computer Supported Co-operative Work (CSCW ’00) Conf., pp. 241-250, 2000.

[23] C. Boutilier and R.S. Zemel, “Online Queries for CollaborativeFiltering,” Proc. Ninth Int’l Workshop Artificial Intelligence andStatistics, 2003.

[24] F. V. Jensen, Bayesian Networks and Decision Graphs, Statistics forEngineering and Information Science, Springer, 2001.

[25] D. Heckerman, J. Breese, and K. Rommelse, “Troubleshootingunder Uncertainty,” Technical Report MSR-TR-94-07, MicrosoftResearch, 1994.

[26] S. Tong, “Active Learning: Theory and Applicaitons,” PhD thesis,Stanford Univ., 2001.

[27] D. Lewis and J. Catlett, “Heterogeneous Uncertainty Sampling forSupervised Learning,” Proc. 11th Int’l Conf. Machine Learning,pp. 148-156, 1994.

[28] T. Cover and J. Thomas, Elements of Information Theory. Wiley,1991.

[29] G. Fishman, Monte Carlo Concepts, Algorithms and Applications.Springer Verlag, 1996.

[30] B. Scholkopf and A.J. Smola, Learning with Kernels. MIT Press,2002.

[31] J.L. Herlocker, J.A. Konstan, A. Borchers, and J. Riedl, “AnAlgorithmic Framework for Performing Collaborative Filtering,”Proc. ACM Conf. Research and Development in Information Retrieval(SIGIR ’99), pp. 230-237, 1999.

[32] K. Miyahara and M.J. Pazzani, “Collaborative Filtering with theSimple Bayesian Classifier,” Proc. Sixth Pacific Rim Int’l Conf.Artificial Intelligence PRICAI 2000, pp. 679-689, 2000.

[33] K. Yu, A. Schwaighofer, V. Tresp, W.-Y. Ma, and H. Zhang,“Collaborative Ensemble Learning: Combining Collaborative andContent-Based Information Filtering,” Proc. 19th Conf. Uncertaintyin Artificial Intelligence, pp. 616-623, 2003.

Kai Yu received the BS and MS degrees in 1998and 2000, respectively, from Nanjing University,China. He is a PhD student in the Institute forComputer Science at the University of Munich.His research work is supported through ascholarship from Siemens Corporate Technol-ogy in Munich. He has been working in speechseparation, noise reduction, information retrie-val, and data mining. Currently, his researchinterests are mainly focused on statistical

machine learning and its applications in data mining, information andimage retrieval, and medical data analysis.


Anton Schwaighofer received the MSc degreein computer science from Graz University ofTechnology, Austria, in 2000. He is currently aPhD student at Graz University of Technology incooperation with Siemens Corporate Technol-ogy in Munich. He has been working in patternrecognition for biometric applications and med-ical diagnosis systems. His major researchinterests are kernel-based learning systems, inparticular, Gaussian processes for large-scale

regression problems, graphical models, and clustering methods.

Volker Tresp received the Diploma degree inphysics from the University of Gottingen, Ger-many, in 1984 and the MSc and PhD degreesfrom Yale University, New Haven, Connecticut,in 1986 and 1989, respectively. He joined theCentral Research and Development Unit ofSiemens AG in 1989 where he currently is thehead of a research team. In 1994, he was avisiting scientist at the Massachusetts Instituteof Technology’s Center for Biological and

Computational Learning. His main interests include learning systems,in particular, neural networks and graphical models and medicaldecision support systems. He has published papers on various topicsincluding the combination of rule-based knowledge and neural networks,the problem of missing data in neural networks, time-series modelingwith missing and noisy data, committee machines, learning structure ingraphical models, and kernel-based learning systems. He is coeditor ofNeural Information Processing Systems, 13.

Xiaowei Xu received the PhD degree from theUniversity of Munich in 1998. He joined theDepartment of Information Science, University ofArkansas at Little Rock as an associate profes-sor, from Corporate Technology, Siemens AG,Munich, Germany in 2002. His specialty is indata mining and database management sys-tems. His recent research interests focus on textmining, database systems for biological andmedical applications, and multimodal informa-

tion retrieval. Dr. Xu has been an active member of the ACM.

Hans-Peter Kriegel received the MS and PhDdegrees in 1973 and 1976, respectively, fromthe University of Karlsruhe, Germany. He is a fullprofessor of database systems in the Institute forComputer Science at the University of Munich.His research interests are in spatial and multi-media database systems, particularly in queryprocessing, performance issues, similaritysearch, dimensional indexing, and in parallelsystems. Data exploration using visualization led

him to the area of knowledge discovery and data mining. Dr. Kriegel hasbeen chairman and program committee member in many internationaldatabase conferences. He has published more than 200 refereedconference and journal papers. In 1997, he received the internationallyprestigious “SIGMOD Best Paper Award 1997” for the publication andprototype implementation “Fast Parallel Similarity Search in MultimediaDatabases” together with four members of his research team.

. For more information on this or any computing topic, please visitour Digital Library at http://computer.org/publications/dlib.


Date post:	19-Jul-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Probabilistic memory-based collaborative filtering - Knowledge …gkmc.utah.edu/7910F/papers/IEEE...

Documents