+ All Categories
Home > Documents > A Unified Approach to Personalization Based on Probabilistic...

A Unified Approach to Personalization Based on Probabilistic...

Date post: 20-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
9
A Unified Approach to Personalization Based on Probabilistic Latent Semantic Models of Web Usage and Content Xin Jin, Yanzan Zhou, Bamshad Mobasher {xjin,yzhou,mobasher}@cs.depaul.edu Center for Web Intelligence School of Computer Science, Telecommunication, and Information Systems DePaul University, Chicago, Illinois, USA Abstract Web usage mining techniques, such as cluster- ing of user sessions, are often used to identify Web user access patterns. However, to under- stand the factors that lead to common naviga- tional patterns, it is necessary to develop tech- niques that can automatically characterize users’ navigational tasks and intentions. Such a char- acterization must be based both on the common usage patterns, as well as on common semantic information associated with the visited Web re- sources. The integration of semantic content and usage patterns allows the system to make infer- ences based on the underlying reasons for which a user may or may not be interested in partic- ular items. In this paper, we propose a unified framework based on Probabilistic Latent Seman- tic Analysis to create models of Web users, tak- ing into account both the navigational usage data and the Web site content information. Our joint probabilistic model is based on a set of discov- ered latent factors that “explain” the underlying relationships among pageviews in terms of their common usage and their semantic relationships. Based on the discovered user models, we propose algorithms for characterizing Web user segments and to provide dynamic and personalized recom- mendations based on these segments. Our exper- iments, performed on real usage data, show that this approach can more accurately capture users’ access patterns and generate more effective recom- mendations, when compared to more traditional methods based on clustering. 1 Introduction Web users exhibit different types of behavior depend- ing on their information need and their intended tasks. These behavior “types” are captured implicitly by a collection of actions taken by users during their visits to a site. The actions can range from viewing pages or buying products to interacting with online applica- tions or Web services. For example, in an e-commerce site, there may be many user groups with different (but Copyright c 2004, American Association for Artificial In- telligence (www.aaai.org). All rights reserved. overlapping) behavior types. These groups may include visitors who engage in “window shopping” by browsing through a variety of product pages in different cate- gories, visitors who are goal-oriented showing interest in a specific product category, or visitors who tend to place items in their shopping cart, but not purchase those items. Identifying these behavior types may, for example, allow a site to distinguish between those who show a high propensity to buy versus those who don’t. This, in turn, can lead to automatic tools to tailor the content of pages for those users accordingly. Web usage mining techniques (Srivastava et al. 2000), which capture usage patterns from users’ naviga- tional data, have achieved great success in various ap- plication areas such as Web personalization and recom- mender systems (Mobasher, Cooley, & Srivastava 2000; Mobasher, Dai, & T. Luo 2002; Nasraoui et al. 2002; Pierrakos et al. 2003), link prediction and analy- sis (Sarukkai 2000), Web site evaluation or reorga- nization (Spiliopoulou 2000; Srikant & Yang 2001), and e-commerce data analysis (Ghani & Fano 2002; Kohavi et al. 2004). An important problem in Web usage mining is to identify the underlying user goals and functional needs that lead to common navigational activity. Most current Web usage mining systems use different data mining techniques, such as clustering, association rule mining, and sequential pattern mining to extract usage patterns from users’ navigational data. Gener- ally, these usage patterns are standalone patterns at the pageview level. They, however, do not capture the intrinsic characteristics of Web users’ activities, nor can they quantify the underlying and unobservable factors that lead to specific navigational patterns. Thus, to better understand the factors that lead to common nav- igational patterns, it is necessary to develop techniques that can automatically characterize the users’ underly- ing navigational objectives and to discover the hidden semantic relationships among users as well as between users and Web objects. This, in part, requires new ap- proaches that can seamlessly integrate different sources of knowledge from both usage, as well as from the se- mantic content of Web sites. The integration of content information about Web
Transcript
Page 1: A Unified Approach to Personalization Based on Probabilistic …cui.unige.ch/tcs/cours/algoweb/2005/articles/jin.pdf · 2006-01-12 · A Unified Approach to Personalization Based

A Unified Approach to Personalization Based onProbabilistic Latent Semantic Models of Web Usage and Content

Xin Jin, Yanzan Zhou, Bamshad Mobasher{xjin,yzhou,mobasher}@cs.depaul.edu

Center for Web IntelligenceSchool of Computer Science, Telecommunication, and Information Systems

DePaul University, Chicago, Illinois, USA

Abstract

Web usage mining techniques, such as cluster-ing of user sessions, are often used to identifyWeb user access patterns. However, to under-stand the factors that lead to common naviga-tional patterns, it is necessary to develop tech-niques that can automatically characterize users’navigational tasks and intentions. Such a char-acterization must be based both on the commonusage patterns, as well as on common semanticinformation associated with the visited Web re-sources. The integration of semantic content andusage patterns allows the system to make infer-ences based on the underlying reasons for whicha user may or may not be interested in partic-ular items. In this paper, we propose a unifiedframework based on Probabilistic Latent Seman-tic Analysis to create models of Web users, tak-ing into account both the navigational usage dataand the Web site content information. Our jointprobabilistic model is based on a set of discov-ered latent factors that “explain” the underlyingrelationships among pageviews in terms of theircommon usage and their semantic relationships.Based on the discovered user models, we proposealgorithms for characterizing Web user segmentsand to provide dynamic and personalized recom-mendations based on these segments. Our exper-iments, performed on real usage data, show thatthis approach can more accurately capture users’access patterns and generate more effective recom-mendations, when compared to more traditionalmethods based on clustering.

1 Introduction

Web users exhibit different types of behavior depend-ing on their information need and their intended tasks.These behavior “types” are captured implicitly by acollection of actions taken by users during their visitsto a site. The actions can range from viewing pagesor buying products to interacting with online applica-tions or Web services. For example, in an e-commercesite, there may be many user groups with different (but

Copyright c© 2004, American Association for Artificial In-telligence (www.aaai.org). All rights reserved.

overlapping) behavior types. These groups may includevisitors who engage in “window shopping” by browsingthrough a variety of product pages in different cate-gories, visitors who are goal-oriented showing interestin a specific product category, or visitors who tend toplace items in their shopping cart, but not purchasethose items. Identifying these behavior types may, forexample, allow a site to distinguish between those whoshow a high propensity to buy versus those who don’t.This, in turn, can lead to automatic tools to tailor thecontent of pages for those users accordingly.

Web usage mining techniques (Srivastava et al.2000), which capture usage patterns from users’ naviga-tional data, have achieved great success in various ap-plication areas such as Web personalization and recom-mender systems (Mobasher, Cooley, & Srivastava 2000;Mobasher, Dai, & T. Luo 2002; Nasraoui et al. 2002;Pierrakos et al. 2003), link prediction and analy-sis (Sarukkai 2000), Web site evaluation or reorga-nization (Spiliopoulou 2000; Srikant & Yang 2001),and e-commerce data analysis (Ghani & Fano 2002;Kohavi et al. 2004). An important problem in Webusage mining is to identify the underlying user goalsand functional needs that lead to common navigationalactivity.

Most current Web usage mining systems use differentdata mining techniques, such as clustering, associationrule mining, and sequential pattern mining to extractusage patterns from users’ navigational data. Gener-ally, these usage patterns are standalone patterns atthe pageview level. They, however, do not capture theintrinsic characteristics of Web users’ activities, nor canthey quantify the underlying and unobservable factorsthat lead to specific navigational patterns. Thus, tobetter understand the factors that lead to common nav-igational patterns, it is necessary to develop techniquesthat can automatically characterize the users’ underly-ing navigational objectives and to discover the hiddensemantic relationships among users as well as betweenusers and Web objects. This, in part, requires new ap-proaches that can seamlessly integrate different sourcesof knowledge from both usage, as well as from the se-mantic content of Web sites.

The integration of content information about Web

Page 2: A Unified Approach to Personalization Based on Probabilistic …cui.unige.ch/tcs/cours/algoweb/2005/articles/jin.pdf · 2006-01-12 · A Unified Approach to Personalization Based

objects with usage patterns involving those objects pro-vides two primary advantages. First, the semantic in-formation provides additional clues about the underly-ing reasons for which a user may or may not be inter-ested in particular items. This, in turn, allows the sys-tem to make inferences based on this additional sourceof knowledge, possibly improving the quality of discov-ered patterns or the accuracy of recommendations. Sec-ondly, in cases where little or no rating or usage infor-mation is available (such as in the case of newly addeditems, or in very sparse data sets), the system can stilluse the semantic information to draw reasonable con-clusions about user interests.

Recent work (Mobasher et al. 2000; Anderson,Domingos, & Weld 2002; Dai & Mobasher 2002; Ghani& Fano 2002) has shown the benefits of integrating se-mantic knowledge about the domain (e.g., from pagecontent features, relational structure, or domain on-tologies) into the Web usage mining and personaliza-tion processes. There has also been a growing bodyof work in enhancing collaborative filtering systemsby integrating data from other sources such as con-tent and user demographics (Claypool et al. 1999;Pazzani 1999; Melville, Mooney, & Nagarajan 2001).Content-oriented approaches, in particular, can be usedto address the “new item problem” discussed above.Generally, in these approaches, keywords are extractedfrom the content of Web pages and are used to recom-mend other pages or items to a user, not only based onuser ratings or visit patterns, but also (or alternatively)based on the content similarity of these pages to otherpages already visited by the user.

In most cases, however, these techniques involve in-dependently learning user and content models, whileintegrating these after the fact in the recommendationprocess. In this paper, we are interested in developing aunified model of usage and content which can seamlesslyintegrate these sources of knowledge during the miningprocess. We believe that such an approach would bebetter able to capture the hidden semantic associationsamong Web objects and users, and thus result in pat-terns that can more closely represent the true interestsof users and the context of their navigational behavior.

Latent semantic analysis (LSA) based on singularvalue decomposition (SVD) can capture the latent orhidden semantic associations among co-occurring ob-jects (Deerwester et al. 1990). It is mostly used inautomatic indexing and information retrieval (Berry,Dumais, & OBrien 1995), where LSA usually takes the(high dimensional) vector space representation of doc-uments based on term frequency as a starting pointand apply dimension reducing linear projection, suchas Singular Value Decomposition (SVD) to generatea reduced latent space representation. LSA has beenapplied with remarkable success in different domains.Probabilistic Latent Semantic Analysis (PLSA), is aprobabilistic variant of LSA which provides a moresolid statistical foundation than standard LSA and hasmany applications in information retrieval and filtering,

text learning and related fields (Hofmann 1999; 2001;Brants, Chen, & Tsochantaridis 2002; Brants & Stolle2002). Approaches based on PLSA have also been usedin the context of co-citation analysis (Cohn & Chang2000; Cohn & Hofmann 2001).

In this paper we propose a Web usage mining ap-proach based on PLSA. We begin with Web naviga-tional data and Web site content information, and usethese two sources of knowledge to create a joint proba-bilistic model of users’ navigational activities. We thenuse the probabilistic model to discover and characterizeWeb user segments that capture both common naviga-tion activity of users, as well as content characteristicswhich lead to such behavior. Based on the discoveredpatterns, we propose a recommendation algorithm toprovide dynamic content to an active user. The flexi-bility of this model allows for varying degrees to whichcontent and usage information is taken into account. Itcan, therefore, be utilized for personalization even whenthere is inadequate semantic knowledge about Web ob-jects or sparse historical usage information.

We have conducted experiments on usage and contentdata collected from two different Web sites. The resultsshow that our approach can successfully distinguish be-tween different types of Web user segments accordingto the types of tasks performed by these users or theinterest they showed in semantic attributes of the vis-ited objects. Our results also suggest that the proposedapproach results in more effective personalized recom-mendations when compared to other model-based ap-proaches such as those based on clustering.

The paper is organized as follows. In Section 2 weprovide an overview of our unified Probabilistic LatentSemantic Model as applied to both Web usage data andWeb content information. Our algorithms for discover-ing Web user segments, based on the joint probabilisticmodel, and to generate recommendations are describedin Section 3. Finally, in Section 4 we provide someexamples of the discovered patterns and present ourexperimental evaluation.

2 Probabilistic Latent SemanticModels of Web User Navigations

The overall process of Web usage mining consists ofthree phrases: data preparation and transformation,pattern discovery, and pattern analysis. The datapreparation phase transforms raw Web log data intotransaction data that can be processed by various datamining tasks. In the pattern discovery phase, a varietyof data mining techniques, such as clustering, associa-tion rule mining, and sequential pattern discovery canbe applied to the transaction data. The discovered pat-terns should then be analyzed and interpreted for usein various applications, such as personalization, or forfurther analysis.

The usage data preprocessing phase (Cooley,Mobasher, & Srivastava 1999) results in a set of npageviews, P = {p1, p2, . . . , pn} and a set of m user

Page 3: A Unified Approach to Personalization Based on Probabilistic …cui.unige.ch/tcs/cours/algoweb/2005/articles/jin.pdf · 2006-01-12 · A Unified Approach to Personalization Based

Figure 1: Example of a hypothetical pageview-attributematrix

sessions, U = {u1, u2, . . . , um}. A pageview is an ag-gregate representation of a collection of Web objects(e.g. pages) contributing to the display on a user’sbrowser resulting from a single user action (such as aclick through, product purchase, or database query).The Web session data can be conceptually viewed as asession-pageview matrix (also called usage observationdata), UPm×n, where the entry UPij corresponds to aweight associated with the pageview pj in the user ses-sion ui. The weights can be binary (representing theexistence or not existence of a pageview within the ses-sion), based on the amount time spent on a page, orbased on user ratings (such as in a collaborative filter-ing application).

Another important source of information in the dis-covery of navigational patterns is the content of theWeb site. Each pageview contains certain semanticknowledge represented by the content information as-sociated with that pageview. By applying text min-ing and information retrieval techniques, we can repre-sent each pageview as an attribute vector. Attributesmay be the keywords extracted from the pageviews,or structured semantic attributes of the Web objectscontained in the pageviews. For instance, in an e-commerce site there many be many pageviews associ-ated with specific products or product categories. Eachproduct page can be represented by the product at-tributes (product name, price, category, etc). For ex-ample, suppose that a pageview A represents informa-tion about an HP laptop computer. This pageview maybe represented as a vector (price=1200, brand=HP,sub-category=computer, . . .). Similarly a pageviewB about a Kodak camera can be represented as(price=600, brand=Kodak, sub-category=camera, . . .).Applying content preprocessing techniques (Mobasheret al. 2000) to the Web site content, results in a set of sdistinctive attribute values, A = {a1, a2, . . . , as} whichcomprise the content observation data. We can viewthese content observations as an attribute-pageviewmatrix APs×n, where the entry APtj means pageview pj

contains the distinctive attribute value at. A portion ofthe attribute-pageview matrix for the above hypothet-ical example is depicted in Figure 1.

The PLSA model can be used to identify the hid-

den associations among variables in co-occurrence ob-servation data. For the usage observations, a hidden(unobserved) factor variable zk ∈ Z = {z1, z2, · · · , zl}is associated (with certain probability) with each ob-servation (ui, pj) corresponding to an access by user ui

to a Web resource pj (i.e., an entry of matrix UP ).Similarly, the hidden factor zk is also probabilisticallyassociated with each observation (at, pj) (an entry ofthe attribute-pageview matrix AP ). Our goal is to dis-cover this set of latent factors Z = {z1, z2, · · · , zl} fromthe usage and content observation data. The assump-tion behind this joint PLSA model is that the discov-ered latent factors “explain” the underlying relation-ships among pageviews both in terms of their commonusage patterns, as well as in terms of their semanticrelationships. The degree to which such relationshipsare explained by each factor is captured by the derivedconditional probabilities that associate the pageviewswith each of the latent factors.

The probabilistic latent factor model can be de-scribed as the following generative model:1. select a user session ui from U with probability

Pr(ui);2. select a latent factor zk associated with ui with prob-

ability Pr(zk|ui);3. given the factor zk, generate a pageview pj from P

with probability Pr(pj |zk).As a result we obtain an observed pair (ui, pj), while

the latent factor variable zk is discarded. Translatingthis process into a joint probability model results in thefollowing:

Pr(ui, pj) = Pr(ui) • Pr(pj |ui),

where,

Pr(pj |ui) =∑

zk∈Z

Pr(pj |zk) • Pr(zk|ui).

Furthermore, using Bayes’ rule, we can transform thisprobability into:

Pr(ui, pj) =∑

zk∈Z

Pr(zk) • Pr(ui|zk) • Pr(pj |zk).

Similarly, each attribute-page observation (at, pj) canbe modeled as:

Pr(at, pj) =∑

zk∈Z

Pr(zk) • Pr(at|zk) • Pr(pj |zk).

Combining them together, the total likelihoodL(U, A, P ) of two observation data matrices is then de-scribed as:

L(U, A, P ) = α∑i,j

UPij • log Pr(ui, pj)

+ (1 − α)∑t,j

APtj • log Pr(at, pj),

where α is the combination parameter, which is usedto adjust the relative weights of usage observations andattribute observations.

Page 4: A Unified Approach to Personalization Based on Probabilistic …cui.unige.ch/tcs/cours/algoweb/2005/articles/jin.pdf · 2006-01-12 · A Unified Approach to Personalization Based

Thus, the process of generating a model that “ex-plains” observations (U, P ) and (A, P ) amounts to es-timating parameters Pr(zk), Pr(ui|zk), Pr(at|zk) andPr(pj |zk) which can maximize the overall likelihood,L(U, A, P ).

Expectation-Maximization (EM) algorithm (Demp-ster, Laird, & Rubin 1977) is a well-known approachto perform maximum likelihood parameter estimationin latent variable models. It alternates two steps: (1)an expectation (E) step where posterior probabilitiesare computed for latent variables zk ∈ Z, based on thecurrent estimates of the parameters; and (2) a maxi-mization (M) step, where parameters are updated forgiven posterior probabilities computed in the previousE-step. The EM algorithm is guaranteed to reach alocal optimum.

To apply the EM algorithm in this context, we beginwith some initial values of Pr(zk), Pr(ui|zk), Pr(at|zk)and Pr(pj |zk). In the expectation step we compute:

Pr(zk|ui, pj) =Pr(zk) • Pr(ui|zk) • Pr(pj |zk)∑z∈Z

Pr(z) • Pr(ui|z) • Pr(pj |z)

and, similarly

Pr(zk|at, pj) =Pr(zk) • Pr(at|zk) • Pr(pj |zk)∑z∈Z

Pr(z) • Pr(at|z) • Pr(pj |z).

In the maximization step, we compute:

Pr(zk) ∝ α∑i,j

UPij • Pr(zk|ui, pj)

+ (1 − α)∑t,j

APtj • Pr(zk|at, pj),

P r(ui|zk) =

∑pj∈P

UPij • Pr(zk|ui, pj)∑

u∈U,pj∈P

UPij • Pr(zk|u, pj),

P r(at|zk) =

∑pj∈P

APtj • Pr(zk|at, pj)∑

a∈A,pj∈P

APtj • Pr(zk|a, pj),

P r(pj |zk) ∝ α∑i

UPij • Pr(zk|ui, pj)

+ (1 − α)∑t

APtj • Pr(zk|at, pj).

Iterating the above computation of expectation andmaximization steps monotonically increases the totallikelihood of the observed data L(U, A, P ) until a lo-cal optimal solution is reached. Furthermore, note thatvarying the combination parameter α allows the modelto take into account the usage-based and content-basedrelationships among pageviews in varying degrees, asmay be appropriate for a particular Web site. For, in-stance, in a content-rich Web site, α may be set to50%, equally taking into account relationships fromboth sources. On the other hand, if adequate content

information is not available, α can be set to 1, in whichcase the model will be based solely on the usage obser-vations.

The computational complexity of this algorithm isO(mnl + snl), where m,n,s, and l represent the num-ber of user sessions, pageviews, attribute values, andfactors, respectively. Since both the usage observationmatrix and the attribute matrix are very sparse, thememory requirment can be dramaticlly reduced usingefficient sparse matrix implementation.

3 A Recommendation FrameworkBased on the Joint PLSA Model

Applying the EM algorithm, as described in Section 2,will result in estimates of Pr(zk), Pr(ui|zk), Pr(at|zk),and Pr(pj |zk), for each zk ∈ Z, ui ∈ U , at ∈ A, andpj ∈ P . In this context, the hidden factors zk ∈ Zcorrespond to users’ different task-oriented navigationalpatterns. In this section, we discuss how the generatedprobability estimates can be used for common Web us-age mining tasks and applications. We, first, present analgorithm for creating aggregate representations thatcharacterize typical Web user segments based on theircommon navigational behaviors and interests. Theseaggregate representations constitute the discovered usermodels. We then present a recommendation algorithmthat combines the discovered models and the ongoingactivity of a current user to provide dynamic recom-mendations.

3.1 Characterizing Web User Segments

We can use the probability estimates generated by themodel to characterize user segments according to users’navigational behavior. Each segment will be repre-sented as a collection of pageviews which are visitedby users who are performing a similar task. We takeeach latent factor zk, generated by the model, as cor-responding to one such user segment. To this end wewill use Pr(ui|zk), which represents the probability ofobserving a user session ui, given that a particular usersegment is chosen.

A particular advantage of the probabilistic factormodel, in contrast to probabilistic mixture models, isthat a particular user session can be seen as belongingto not just one, but a combination of segments repre-sented by the latent factors. For instance, a user ses-sion may correspond (with different probabilities) totwo different factors z1 and z2. This is important inthe context of user navigational patterns, since a usermay, indeed, perform different information seeking orfunctional tasks during the same session. We describeour approach for characterizing user segments based onthe latent factors below.

For each hidden factor zk, the top user sessions withthe highest Pr(u|zk) values can be considered to bethe “prototypical” user sessions corresponding to thatfactor. In other words, these user sessions represent

Page 5: A Unified Approach to Personalization Based on Probabilistic …cui.unige.ch/tcs/cours/algoweb/2005/articles/jin.pdf · 2006-01-12 · A Unified Approach to Personalization Based

the typical navigational activity of users who are per-forming a similar task. Thus, we can characterize task-oriented user segments based on the top sessions corre-sponding to each factor.

For each user segment zk, we choose all the usersessions with probability Pr(ui|zk) exceeding a cer-tain threshold µ. Since each user session �u can alsobe viewed as a pageview vector in the original n-dimensional space of pageviews, we can create an ag-gregate representation of the collection of user sessionsrelated to zk also as a pageview vector. The algorithmto generate this aggregate representation of user seg-ments is as follow.

1. Input: Pr(ui|zk), user session-page matrix UP ,threshold µ.

2. For each zk, choose all the sessions with Pr(ui|zk) ≥µ to get a user session set R.

3. For each zk, compute the weighed average of all thechosen sessions in R to get a page vector �v as:

�v =

∑R

�ui • Pr(ui|zk)

|R|where |R| denotes the total number of chosen sessionsfor the factor zk.

4. For each factor zk, output page vector �v. This pagevector consists of a set of weights, for each pageviewin P , which represents the relative visit frequency ofeach pageview for this user segment.

We can sort the pageviews in �v by weight so that thetop elements correspond to the most frequently visitedpages within the user segment. In this way, each usersegment is characterized by an “aggregate” representa-tion of all individual users’ navigational activities fromthat user group. In the following, by “user segments”we mean their aggregate representations as describedabove.

The characterization of Web user segments, by itself,can help analysts to understand the behavior of indi-vidual or groups of users based on their navigationalactivity as well as their interest in specific content infor-mation. However, the probabilistic latent factor modelalso provides the flexibility to perform a variety of othersupervised or unsupervised analysis tasks.

For example, we can categorize pages in a Web site,according to common usage patterns corresponding todifferent user segments. Specifically, given a Web pagep, for each factor z, we can compute

Pr(z|p) =Pr(p|z) • Pr(z)∑z′ Pr(p|z′) • Pr(z′)

.

Then, we can select the dominant z with the highestPr(z|p) as the class label for this page.

We can also use a similar approach to classify users,or to predict the liklihood that a user may visit a pre-viously unvisited page. Specifically, given a user ui, we

can compute the probability of a (a previously unvis-ited) page Pj being visited by ui as:

Pr(pj |ui) =∑

z

Pr(pj |z) • Pr(z|ui).

In the following section we present an approach forWeb personalization, based on the joint probabilisticuser models generated using the latent factor model.

3.2 Using the Joint Probability Model forPersonalization

Web personalization usually refers to dynamically tai-loring the presentation of a Web site according to the in-terests or preferences of individual (or groups of users).This can be accomplished by recommending Web re-sources to a user by considering the current user’s be-havior together with learned models of past users (e.g.,collaborative filtering). Here we will use the probabilis-tic user models generated via our joint PLSA frameworkto generate recommendations.

Given a set of user segments and an active user ses-sion, the method of generating a top-N recommenda-tions is as follows.

1. Represent each user segment C as an n-dimensionalpageview vector using the approach described above,where n is the total number of pages. Thus, C =〈ωC

1 , ωC2 , · · · , ωC

n 〉, where ωCi is the weight associated

with pageview pi in C. Similarly, the active usersession S is represented as S = 〈S1, · · · , Sn〉 , whereSi is set to 1, if pageview pi is visited, and to 0,otherwise.

2. Choose the segment that best matches the active usersession. Here we use the standard cosine coefficientto compute the similarity between the active user ses-sion and the discovered user segments.

match(S, C) =∑

n (Sn × ωCn )√∑

n (Sn)2 × ∑n (ωC

n )2

3. Given the top matching segment Ctop and the activeuser session S, a recommendation score, Rec(S, p) iscomputed for each page p ∈ Ctop as:

Rec(S, p) =√

weight(p, Ctop) • match(S, C).

Thus, each page will receive a normalized value be-tween 0 and 1. If the page p is already in the currentactive session S, its recommendation value is set tozero.

4. Choose the top N pages with the highest recommen-dation values to get a top-N recommendation set.

The above approach for generating recommendationsis not unique to the PLSA model. The traditional ap-proach for discovering user segments is based on clus-tering of user records. In (Mobasher, Dai, & T. Luo2002), k-means clustering was used to partition usersessions into clusters. The centroid of each cluster was

Page 6: A Unified Approach to Personalization Based on Probabilistic …cui.unige.ch/tcs/cours/algoweb/2005/articles/jin.pdf · 2006-01-12 · A Unified Approach to Personalization Based

then obtained as an n-dimensional pageview vector,in a similar manner as described above. The clustercentroids, thus, provide an aggregate representation ofcommon user patterns corresponding to each segment.This process was called Profile Aggregation Based onClustering Transactions (PACT). The discovered usersegments were then used to generate recommendationsusing the algorithm described above. In our experi-ments, discussed below, we use PACT as a point of com-parison for the effectiveness of generated recommenda-tions.

4 Experimental EvaluationTo evaluate the effectiveness of the PLSA-based model,we perform two typed of evaluation using two differentdata sets. First, we evaluate individual user segmentsto determine the degree to which they actually repre-sent activities of similar users. Secondly, we evaluateour recommendation algorithm, based on the user seg-ments, in the context of top-N recommendation frame-work. In each case, we compare our approach with theclustering approach for the discovery of Web user seg-ments (PACT) (Mobasher, Dai, & T. Luo 2002), asdescribed above.

4.1 Description of the Data SetsIn our experiments, we have used Web server log datafrom two Web sites. The first data set is based onthe server log data from the host Computer Science de-partment. After data preprocessing, we have identified21,299 user sessions (U) and 692 Web pageviews (P ),where each user session consists of 9.8 pageviews in av-erage. We refer to this data set as the “CTI data.” Inthis data set we used the time spent on each pageviewas the weight associated with that pageview in the givensession. Since most Web pages are dynamically gener-ated, we do not adopt any content information fromthis site. Hence, this data set is used to evaluate ourapproach when only usage information is available foranalysis.

The second data set is from the server logs of a localaffiliate of a national real estate company. The primaryfunction of the Web site is to allow prospective buyersvisit various pages and information related to some 300residential properties. The portion of the Web usagedata during the period of analysis contained approx-imately 24,000 user sessions from 3,800 unique users.The preprocessing phase for this data was focused onextracting a full record for each user of properties theyvisited. This required performing the necessary aggre-gation operations on pageviews in order to treat a prop-erty as the atomic unit of analysis. In addition, thevisit frequency for each user-property pair was recorded,since the number of times a user comes back to a prop-erty listing is a good measure of that user’s interest inthe property. Finally, the data was filtered to limit thefinal data set to those users that had visited at least 3properties. In average, each user has visited 5.6 prop-erties. In our final data matrix, each row represented

a user vector with properties as dimensions and visitfrequencies as the corresponding dimension values. Werefer to this data set as the “Realty data.”

For the “Realty data,” in addition to the usage ob-servations, we also extracted the content informationrelated to the properties. Each property has a setof attributes, including price, number of bedrooms,number of bathrooms, size, garage size (cars), andschool district. After content preprocessing, we builtan attribute-page matrix (similar to Figure 1) to rep-resent the content observations. In this matrix, eachcolumn represents a property, and each row is a dis-tinct attribute value associated with the properties.

Each data set (the usage observations) was randomlydivided into multiple training and test sets to use with10-fold cross-validation. The training sets were usedto build the models while the test sets were used toevaluate the user segments and the recommendationsgenerated by the models. In our experiments, the re-sults in Section 4.3 and 4.4 represent averages over the10 folds.

By conducting some sensitivity analysis, we chose 30factors in the case of CTI data, and 15 factors for theRealty data. Furthermore, we employ a “tempered”version of the EM algorithm as described in (Hofmann2001) to avoid “overtraining.”

4.2 Examples of Extracted Latent Factors

We begin by providing an example of the latent factorsgenerated using the joint PLSA model for the Realtydata. Figure 2 depicts six of the 15 factors extractedfrom this data set. The factors were generated usinga combination of usage and content observations as-sociated with the real estate properties. For each ofthe factors, the most significant attributes (based onthe probability of their association with that factor)are given along with the description of the attributes.

Factors 1 and 2 in this figure clearly indicate the in-terest of users in properties of similar price and size.The model, however, distinguishes the two groups basedon other attribute information. In particular, factor1 represents interest in “town homes” located in theANK school district, while factor 2 represents 2-storyfamily units in the WDM school district. Factors 2and 3 both represent larger properties in a higher pricerange, but again, they are distinguished based on user’sinterests in different school districts. Finally, factors5 and 6 both represent much lower priced propertiesof the same type. However, the relationship betweenthese two factors is more nuanced than the other casesabove. Here, factor 6 represents a special case of factor5, in which users have focused particularly in the DSMschool district. Indeed, our experiments show that thejoint PLSA model can capture overlapping interests of asimilar group if users in different items at various levelsof abstraction.

Page 7: A Unified Approach to Personalization Based on Probabilistic …cui.unige.ch/tcs/cours/algoweb/2005/articles/jin.pdf · 2006-01-12 · A Unified Approach to Personalization Based

Figure 2: Examples of extracted factors from the Realtydata

4.3 Evaluation of Discovered UserSegments

In order to evaluate the quality of individual user seg-ments we use a metric called the Weighted Average VisitPercentage (WAVP) (Mobasher, Dai, & T. Luo 2002).WAVP allows us to evaluate each segment individuallyaccording to the likelihood that a user who visits anypage in the (aggregate representation of the) segmentwill visit the rest of the pages in that segment duringthe same session. Specially, let T be the set of transac-tions in the evaluation set, and for a segment S, let TS

denote a subset of T whose elements contain at leastone page from S. Now, the weighted average similarityto the segment S over all transactions is computed (tak-ing both the transactions and the segment as vectors of

Figure 3: Comparison of user segments in the CTI site;PLSA model v. k-means clustering (PACT). (p-valueover the average WAVP for all segments is < 0.05, 95%)

Figure 4: Comparison of user segments in the real estatesite; PLSA model v. k-means clustering (PACT). (p-value over the average WAVP for all segments is < 0.05,95%)

pageviews) as:

WAV P = (∑t∈TS

�t • �S

|TS | )/(∑p∈S

weight(p, S))

Note that a higher WAVP value implies better qual-ity of a segment, in the sense that the segment repre-sents the actual behavior users based on their similaractivities.

In these experiments we compare the WAVP valuesfor the generated segments using the PLSA model andthose generated by PACT model. Figures 3 and 4 de-pict these results for the CTI and Realty data sets, re-spectively. In each case, the segments are ranked inthe decreasing order of WAVP. The results show clearlythat the probabilistic segments based on the latent fac-tors provide a significant advantage over the clusteringapproach (p-value < 0.05, 95%).

Page 8: A Unified Approach to Personalization Based on Probabilistic …cui.unige.ch/tcs/cours/algoweb/2005/articles/jin.pdf · 2006-01-12 · A Unified Approach to Personalization Based

Figure 5: Comparison of generated page recommenda-tions based on PLSA segments versus PACT segmentsin the CTI site. (for N >= 4, p-value < 0.05, 95%)

4.4 Evaluation of the RecommendationAlgorithm

For evaluating the recommendation effectiveness we usea measure called Hit Ratio in the context of top-N rec-ommendations. For each user session in the test set,we took the first K pages as a representation of an ac-tive user to generate a top-N recommendation set. Wethen compared the recommendations with the pageviewK+1 in the test session. If there is a match, this is con-sidered a hit. We define the Hit Ratio as the total num-ber of hits divided by the total number of user sessionsin the test set. Note that the Hit Ratio increases asthe value of N (number of recommendations) increases.Thus, in our experiments, we pay special attention tosmaller number recommendations that result in goodhit ratios.

In these experiments we compared the recommenda-tion accuracy of the PLSA-based algorithm with thatof PACT. In each case, the recommendations were gen-erated according to the algorithms presented in Sec-tion 3.2. The recommendation accuracy was measuredbased on hit ratio for different number of generated rec-ommendations. These results are depicted in Figures 5and 6 for the CTI and realty data sets, respectively. Inthe case of the CTI data, we used α = 1, solely tak-ing into account the usage observations. In the case ofthe Realty data, however, we compared the result ofPACT recommendations with the PLSA-based recom-mendations both with α = 1 (usage-only), as well aswith α = 0.5 (equally weighted content and usage).

The result show a general advantage for the PLSAmodel. In most realistic situations, we are interested ina small, but accurate, set of recommendations. Gener-ally, a reasonable recommendation set might contain 5to 10 recommendations (At these levels, the differenceof the performance of the PLSA model and the cluster-

Figure 6: Comparison of generated property recom-mendations based on PLSA segments versus PACT seg-ments in the real estate site. (for N ∈ [4, 10], p-value< 0.05, 95%)

ing approach is statistically significant.). Indeed, thisrange of values seem to represent the largest improve-ments of the PLSA model over the clustering approach.In the case of the Realty data, the combined usage-content model provides a small gain in accuracy overthe usage-only model (particularly at lower numbersof recommendations). The more significant advantageof the combined content-usage model, however, is inits ability to generate recommendations in the face ofsparse or insufficient usage data, as well as in providinga better semantic characterization of user segments.

5 Conclusions

Users of a Web site exhibit different types of naviga-tional behavior depending on their intended tasks ortheir information needs. However, to understand thefactors that lead to common navigational patterns, it isnecessary to develop techniques that can automaticallycharacterize users’ tasks and intentions, both based ontheir common navigational behavior, as well as based onsemantic information associated with visited resources.In this paper, we have used a joint probabilistic latentsemantic analysis framework to develop a unified modelof Web user behavior based on the usage and contentdata associated with the site. The probabilistic modelprovides a great deal of flexibility as the derived prob-ability distributions over the space of latent factors canbe used for a variety of Web mining and analysis tasks.In particular, we have presented algorithms based onthe joint PLSA models to discover and characterize Webuser segments and to provide dynamic and personalizedrecommendations based on these segments.

Our experimental results show clearly that, in addi-tion to greater flexibility, the PLSA approach to Webusage mining, generally results in a more accurate rep-

Page 9: A Unified Approach to Personalization Based on Probabilistic …cui.unige.ch/tcs/cours/algoweb/2005/articles/jin.pdf · 2006-01-12 · A Unified Approach to Personalization Based

resentation of user behavior. This, in turn, results inhigher quality patterns that can be used effectively inWeb recommendation.

ReferencesAnderson, C.; Domingos, P.; and Weld, D. 2002. Re-lational markov models and their application to adap-tive web navigation. In Proceedings of the Eighth ACMSIGKDD International Conference on Knowledge Dis-covery and Data Mining (KDD-2002).Berry, M.; Dumais, S.; and OBrien, G. 1995. Us-ing linear algebra for intelligent information retrieval.SIAM Review 37:573–595.Brants, T., and Stolle, R. 2002. Find similar doc-uments in document collections. In Proceedings ofthe Third International Conference on Language Re-sources and Evaluation(LREC-2002).Brants, T.; Chen, F.; and Tsochantaridis, I. 2002.Topic-based document segmentation with probabilis-tic latent semantic analysis. In Proceedings of theEleventh International Conference on Information andKnowledge Management.Claypool, M.; Gokhale, A.; Miranda, T.; Murnikov,P.; Netes, D.; and Sartin, M. 1999. Combiningcontent-based and collaborative filters in an onlinenewspaper. In Proceedings of the ACM SIGIR ’99Workshop on Recommender Systems: Algorithms andEvaluation.Cohn, D., and Chang, H. 2000. Probabilisticallyidentifying authoritative documents. In Proceedings ofthe Seventeenth International Conference on MachineLearning.Cohn, D., and Hofmann, T. 2001. The missing link:A probabilistic model of document content and hy-pertext connectivity. In Todd K. Leen, T. G. D.,and Tresp, V., eds., Advances in Neural InformationProcessing Systems 13. MIT Press.Cooley, R.; Mobasher, B.; and Srivastava, J. 1999.Data preparation for mining world wide web brows-ing patterns. Journal of Knowledge and InformationSystems 1(1).Dai, H., and Mobasher, B. 2002. Using ontologies todiscover domain-level web usage profiles. In Proceed-ings of the 2nd Semantic Web Mining Workshop atECML/PKDD 2002.Deerwester, S.; Dumais, S.; Furnas, G.; Landauer, T.;and Hashman, R. 1990. Indexing by latent semanticindexing. Journal of the American Society for Infor-mation Science 41(6).Dempster, A.; Laird, N.; and Rubin, D. 1977. Maxi-mum likelihood from incomplete data via the em algo-rithm. Journal of Royal Statistical Society B(39):1–38.Ghani, R., and Fano, A. 2002. Building recommendersystems using a knowledge base of product seman-tics. In Proceedings of the Workshop on Recommenda-tion and Personalization in E-Commerce, at the 2nd

Int’l Conf. on Adaptive Hypermedia and Adaptive WebBased Systems.Hofmann, T. 1999. Probabilistic latent semantic in-dexing. In Proceedings of the 22nd International Con-ference on Research and Development in InformationRetrieval.Hofmann, T. 2001. Unsupervised learning by prob-abilistic latent semantic analysis. Machine LearningJournal 42(1):177–196.Kohavi, R.; Mason, L.; Parekh, R.; and Zheng, Z.2004. Lessons and challenges from mining retail e-commerce data. To appear in Machine Learning.Melville, P.; Mooney, R.; and Nagarajan, R. 2001.Content-boosted collaborative filtering. In Proceed-ings of the SIGIR2001 Workshop on RecommenderSystems.Mobasher, B.; Dai, H.; Luo, T.; Sun, Y.; and Zhu,J. 2000. Integrating web usage and content miningfor more effective personalization. In E-Commerceand Web Technologies: Proceedings of the EC-WEB2000 Conference, Lecture Notes in Computer Science(LNCS) 1875, 165–176. Springer.Mobasher, B.; Cooley, R.; and Srivastava, J. 2000.Automatic personalization based on web usage mining.Communications of the ACM 43(8):142–151.Mobasher, B.; Dai, H.; and T. Luo, M. N. 2002. Dis-covery and evaluation of aggregate usage profiles forweb personalization. Data Mining and Knowledge Dis-covery 6:61–82.Nasraoui, O.; Krishnapuram, R.; Joshi, A.; and Kam-dar, T. 2002. Automatic web user profiling and per-sonalization using robust fuzzy relational clustering.In Segovia, J.; Szczepaniak, P.; and Niedzwiedzinski,M., eds., Studies in Fuzziness and Soft Computing.Springer-Verlag.Pazzani, M. 1999. A framework for collaborative,content-based and demographic filtering. Artificial In-telligence Review 13(5-6):393–408.Pierrakos, D.; Paliouras, G.; Papatheodorou, C.; andSpyropoulos, C. 2003. Web usage mining as a tool forpersonalization: A survey. User Modeling and User-Adapted Interaction 13:311–372.Sarukkai, R. 2000. Link prediction and path analysisusing markov chains. In Proceedings of the 9th Inter-national World Wide Web Conference.Spiliopoulou, M. 2000. Web usage mining for web siteevaluation. Communications of the ACM 43(8):127–134.Srikant, R., and Yang, Y. 2001. Mining web logs toimprove website organization. In Proceedings of the10th International World Wide Web Conference.Srivastava, J.; Cooley, R.; Deshpande, M.; and Tan,P. 2000. Web usage mining: Discovery and appli-cations of usage patterns from web data. SIGKDDExplorations 1(2):12–23.


Recommended