+ All Categories
Home > Documents > A prototype WWW literature recommendation system for digital libraries

A prototype WWW literature recommendation system for digital libraries

Date post: 12-Sep-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
14
A prototype WWW literature recommendation system for digital libraries San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang Introduction This is a fascinating period in the history of libraries and publishing. For the first time - with the advances in computing and networking techniques - it is possible to build large-scale services where collections of information are stored in digital formats and accessed over networks (Arms, 2000). Libraries are making good use of digital capacity to provide services such as information seeking and filtering (Furner, 2002; Spink et al., 2002), organising (Arms, 2000), and delivering (Kessler, 1996). Systems intended to provide such digital services are appearing accordingly and are being investigated in digital library environments (Andresen et al., 1995). This paper reports on a networked digital library project at National Sun Yat-sen University in Taiwan. The project, whose principal goal is to develop technologies for supporting digital services, is a series of three investigations, sponsored by the National Science Council and the National Central Library. The first stage involved the design and construction of a literature recommendation system. The second investigation focuses on the integration of various information sources, and the third investigation addresses the representation and retrieval of multimedia content. The progress of the first investigation is reported in this paper. The literature recommendation system aims at recommending relevant articles to researchers and library patrons. The system adopted a WWW framework so that subscribers can access the system without time and location constraints, and so that the task of service spreading can be facilitated by a common browsing interface. The core of the literature recommendation system is a recommender mechanism, which analyses literature usage so that publications can be ranked according to the preferences of an active user. Various characteristics of publications and WWW interactions are taken into account, and the endeavour has The authors San-Yih Hwang, Wen-Chiang Hsiung and Wan-Shiou Yang work at the Department of Information Management, National Sun Yat-sen University, Taiwan. Keywords Digital libraries, Recommendations, Web sites, Cluster analysis, Data mining Abstract This article describes a service for providing literature recommendations, which is part of a networked digital library project whose principal goal is to develop technologies for supporting digital services. The proposed literature recommendation system makes use of the Web usage logs of a literature digital library. The recommendation framework consists of three sequential steps: data preparation of the Web usage log, discovery of article associations, and article recommendations. We discuss several design alternatives for conducting these steps. These alternatives are evaluated using the Web logs of our university’s electronic thesis and dissertation (ETD) system. The proposed literature recommendation system has been incorporated into our university’s ETD system, and is currently operational. Electronic access The Emerald Research Register for this journal is available at http://www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at http://www.emeraldinsight.com/1468-4527.htm Refereed article received 26 February 2003 Accepted for publication 4 April 2003 This work is supported in part by the National Science Council of Taiwan under grant no. NSC91-2413-H-110-004-003 and the National Central Library of Taiwan under grant No. 91-A-015. 169 Online Information Review Volume 27 . Number 3 . 2003 . pp. 169-182 # MCB UP Limited . ISSN 1468-4527 DOI 10.1108/14684520310481436
Transcript
Page 1: A prototype WWW literature recommendation system for digital libraries

A prototype WWWliteraturerecommendationsystem for digitallibraries

San-Yih HwangWen-Chiang Hsiung andWan-Shiou Yang

Introduction

This is a fascinating period in the history oflibraries and publishing For the first time -with the advances in computing andnetworking techniques - it is possible to buildlarge-scale services where collections ofinformation are stored in digital formats andaccessed over networks (Arms 2000)Libraries are making good use of digitalcapacity to provide services such asinformation seeking and filtering (Furner2002 Spink et al 2002) organising (Arms2000) and delivering (Kessler 1996)Systems intended to provide such digitalservices are appearing accordingly and arebeing investigated in digital libraryenvironments (Andresen et al 1995)

This paper reports on a networked digitallibrary project at National Sun Yat-senUniversity in Taiwan The project whoseprincipal goal is to develop technologies forsupporting digital services is a series of threeinvestigations sponsored by the NationalScience Council and the National CentralLibrary The first stage involved the designand construction of a literaturerecommendation system The secondinvestigation focuses on the integration ofvarious information sources and the thirdinvestigation addresses the representation andretrieval of multimedia content The progressof the first investigation is reported in thispaper

The literature recommendation systemaims at recommending relevant articles toresearchers and library patrons The systemadopted a WWW framework so thatsubscribers can access the system withouttime and location constraints and so that thetask of service spreading can be facilitated bya common browsing interface The core of theliterature recommendation system is arecommender mechanism which analysesliterature usage so that publications can beranked according to the preferences of anactive user Various characteristics ofpublications and WWW interactions aretaken into account and the endeavour has

The authors

San-Yih Hwang Wen-Chiang Hsiung and

Wan-Shiou Yang work at the Department of Information

Management National Sun Yat-sen University Taiwan

Keywords

Digital libraries Recommendations Web sites

Cluster analysis Data mining

Abstract

This article describes a service for providing literature

recommendations which is part of a networked digital

library project whose principal goal is to develop

technologies for supporting digital services The proposed

literature recommendation system makes use of the Web

usage logs of a literature digital library The

recommendation framework consists of three sequential

steps data preparation of the Web usage log discovery of

article associations and article recommendations We

discuss several design alternatives for conducting these

steps These alternatives are evaluated using the Web

logs of our universityrsquos electronic thesis and dissertation

(ETD) system The proposed literature recommendation

system has been incorporated into our universityrsquos ETD

system and is currently operational

Electronic access

The Emerald Research Register for this journal is

available at

httpwwwemeraldinsightcomresearchregister

The current issue and full text archive of this journal is

available at

httpwwwemeraldinsightcom1468-4527htmRefereed article received 26 February 2003Accepted for publication 4 April 2003This work is supported in part by the NationalScience Council of Taiwan under grant noNSC91-2413-H-110-004-003 and the NationalCentral Library of Taiwan under grantNo 91-A-015

169

Online Information Review

Volume 27 Number 3 2003 pp 169-182

MCB UP Limited ISSN 1468-4527

DOI 10110814684520310481436

resulted in a recommendation system that isparticularly suitable for recommendingliterature in digital library environments

Related workInterest in digital libraries has increasedtremendously with several research projectsaddressing the wealth of challenges in thisfield For example a University of Illinoisproject has focused on providing integratedaccess to diverse and distributed collections ofscientific literature (Chen et al 1996) Thatproject deals with heterogeneous interfaces tomultiple indices semantic federation acrossrepositories and other related issues A groupat the University of California at Berkeley isworking on providing work-centred digitalinformation services (Wilensky 1996) Theissues involved include document imageanalysis natural language analysis andcomputer vision analysis for effectiveinformation extraction Carnegie MellonUniversity intends to build a large onlinedigital video library featuring full-content andknowledge-based searching and retrieval TheUniversity of California at Santa Barbara hasconcentrated on geographical informationsystems and a Stanford University projectaddresses the problem of interoperabilityusing CORBA to implementinformation-access and payment protocols(Baldonado et al 1997)

The focus of our research reported here isto tackle the problem of information overloadProposed solutions to this emphasise the needfor specialisation in information retrievalservices to help people effectively locateinformation that meets their individual needs(Bowman et al 1994) Interest inrecommending has increased in theinformation technology community andespecially in the design of digital libraries(Furner 2002) The research reported hereconcentrates on literature recommendations

The past few years have seen the emergenceof many recommendation systems intendedto provide personal recommendations forvarious types of products and servicesincluding news and e-mail messages (see

wwwnetperceptionscom for acommercial site and Goldberg et al(1992) Lang (1995) Konstan et al(1997) and Billsus and Pazzani (1999)for research prototypes)

Web pages (see httpmyyahoocom fora commercial site and Balabanovirsquoc andShoham (1997) Terveen et al (1997)Pazzani and Billsus (1997) andArmstrong et al (1997) for researchprototypes)

books (see httpwwwamazoncom for acommercial site and Mooney and Roy(2000) for a research prototype)

music (see httpwwwCDNowcom fora commercial site and Shardanand andMaes (1995) for a research prototype)and

movies (see httpmovieseonlinecomfor a commercial site and Alspector et al(1998) Breese et al (1998) Basu et al(1998) Ansari et al (2000) Pennock et al(2000) and Schafer et al (2001) forresearch prototypes)

The first type of recommendation techniquewas called the content-based approach (Loeband Terry 1992) A content-based approachcharacterises recommendable items by a set ofcontent features and represents a userrsquosinterests by a similar feature set Then therelevance of a given content item to the userrsquosinterest profile is measured as the similarity ofthis recommendable item to the userrsquos interestprofile Content-based approaches selectrecommendable items that have a high degreeof similarity to the userrsquos interest profile

Another type of recommendationtechnique the collaborative approach(sometimes called the social-based approach)takes into account the given userrsquos interestsprofile and the profiles of other users withsimilar interests (Shardanand and Maes1995) The collaborative approach looks forrelevance among users by observing theirratings assigned to products in a training setof limited size The lsquolsquonearest-neighbourrsquorsquo usersare those that exhibit the strongest similarityto the target user These users then act aslsquolsquorecommendation partnersrsquorsquo for the targetuser and collaborative approachesrecommend to the target user items thatappear in the profiles of theserecommendation partners (but not in thetarget userrsquos profile) It has been observed inseveral practical settings that the collaborativeapproach generally achieves more effectiverecommendations than its content-basedcounterpart (Alspector et al 1998 Breeseet al 1998 Mooney and Roy 2000Pazzani 1999)

170

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Pennock et al (2000) proposed using acollaborative approach for recommendingarticles in CiteSeer (wwwciteseercom)Their approach implicitly derives usersrsquoratings of articles by observing their actionswhen viewing an article Each action isassigned a weight For example adding adocument to a profile produces a two-pointincrement downloading a document aone-point increment and ignoring arecommendation a one-point decrement inthe rating score However this approachsuffers from the shortage of negativeexamples and the method is applicable onlyto individual members who are willing toidentify themselves each time they use thedigital library

We consider that traditionalrecommendation techniques are not suitablefor recommending articles in digital librariesFirst both content-based and collaborativeapproaches require that usersrsquo rating scoreson selected items (including both positive andnegative instances) are available for analysisFor a typical literature digital libraryrequiring users to rate some articles beforemaking a recommendation is not realisticSecond identifying an individual user of aliterature digital library is generally notpossible since many literature digital librariesare freely available on the Internet and userscan search or browse articles without havingto identify themselves Even for proprietaryliterature digital libraries many users gainaccess via site subscriptions making itdifficult to track an individualrsquos (long term)browsing behaviour

For the reasons mentioned above wepropose making use of a task-focusedapproach (Herlocker and Konstan 2001) Inthis approach a task profile (a set of recentlyaccessed items) rather than the long-terminterest profile is used to makerecommendations One notableimplementation of this approach is Web usagemining which aims to identify interestingusage patterns of a Web-based system fromthe Web usage logs that record interactionsbetween users and Web pages (Srivastavaet al 2000) Recently several approacheshave been proposed for recommending Webpages based on the Web page associationsdiscovered by Web-usage mining algorithms(Yan et al 1996 Mobasher et al 1999Pitkow and Pirolli 1999 Yang et al 2001)While these approaches vary in their details

they follow the same recommendationframework which starts with theidentification of aggregate usage profiles ofWeb pages by some data mining methodThey then make recommendations by lookinginto the similarity between the set of recentlyaccessed Web pages of an active user and thecollected aggregate usage profiles

Obviously literature digital libraries storearticles rather than Web pages and they differfrom Web pages in several respects Web pages are more diversified some

serve as index pages some are contentpages and others have a mixture ofindexes and content On the other handsince literature articles are morehomogeneous in structure they are morelikely to have the same set of metadatafeatures

A Web site can be viewed as a directedgraph whose vertices are Web pageswhile a literature digital library is bettervisualised as a set of articles

Literature articles are often retrieved bysearch queries provided by the systemwhile Web pages are often browsedthrough a static site topology

Literature articles are incrementallyinserted into the digital library at a fasterrate than are Web pages inserted into aWeb site

ContributionsThe above considerations indicate thatliterature recommendation services require adifferent technical approach Here wedescribe a recommendation framework forrecommending articles in a literature digitallibrary Several alternatives are proposed forimplementing the constituent components ofthe recommendation framework Thesealternatives are compared and analysed byapplying the Web usage logs collected fromthe electronic thesis and dissertation (ETD)system at National Sun Yat-sen University(NSYSU-ETD) We have incorporated theproposed recommendation framework intoNSYSU-ETD (wwwlibnsysuedutweThesysenglishdefault_ehtm) thecorresponding implementation status is alsoreported

This paper is structured as follows Next theoverall architecture is described followed bydetailed design and construction methodsThen implementation experience andevaluation results are presented The final

171

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

part summarises this paper and discusses ourfuture research directions

Architecture

The overall architecture of the literaturerecommendation system shown as Figure 1consists of two basic subsystems offline andonline(1) The offline subsystem comprises two

sequentially executed tasks datapreparation and log mining AlthoughWeb usage logs are potentially able toprovide useful knowledge for makingrecommendations the raw log datacannot be used before appropriatepre-processing Therefore we firstconvert raw Web usage logs into a set ofuser transactions before performing thelog mining task The objectives of logmining tasks include the discovery ofarticle association rules and the derivationof article clusters

(2) The online subsystem interacts with anactive user and provides recommendedarticles in real time It keeps track of aset of articles browsed recently by theactive user by consulting the currentWeb usage log provided by the Webserver Then by comparing the

similarities between this set and the

article clusters (or associations)produced by the offline subsystem theonline subsystem recommends articlesin the clusters (associations) that arehighly similar

We have incorporated the literaturerecommendation system into NSYSU-ETDFigure 2 depicts a page view of an article inNSYSU-ETD The Web page comprises twoframes the left frame displays the metadata of

the browsed article and the right frame showsthe unseen articles (up to 15) recommendedby our literature recommendation systemdisplayed in the order of relevance In thisexample the active user had reviewed twoarticles and the literature recommendationsystem has recommended another sevenarticles Once the active user browses anotherarticle the content of the recommendation

frame will update accordingly

Approaches

In this section we first describe the tasksconducted in the offline subsystem of theliterature recommendation system and thenthose conducted in the onlinerecommendation subsystem

Figure 1 Architecture of a literature recommendation system

172

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Data preparation for the literature usagelogTo prepare data from the Web usage logs of aliterature digital library we basically followthe heuristics adopted by Cooley et al (1999)for processing Web usage logs that involvestatic Web pages In their work the Webusage logs are assumed to be in the extendedNCSA format (including referrer and agentfields) The approach contains threesequential steps data cleansing user sessionidentification and transaction identificationThe objective of data cleansing is to prune outunwanted Web log records and to add backmissing Web log records some Web logrecords are surplus as they are accesses tonon-HTML pages (eg images and other httprequests involving no Web page accesses)while other Web log records are missing dueto the existence of the local cache firewallsand proxy servers Identifying missing Weblog records is especially difficult - severalheuristics have been proposed for achievingthis However we found this difficultynonexistent when processing the Web usagelog of NSYSU-ETD because article Webpages are dynamically generated and are notcacheable Our university ETD system isdatabase driven in that the theses metadata

are stored in a DBMS Most large-scaledigital libraries adopt the same method formaintaining their collections In the contextof a literature digital library we are concernedwith and retain only the Web usage recordsthat involve the following two types ofaccesses(1) Lookup accesses Each lookup access is an

execution of a CGI program withsearching or browsing conditionsspecified in the parameters Anexample lookup access of NSYSU-ETDis httpetdlibnsysuedutwETD-dbETD-searchsearch_by_advisoradvisor_name=San-Yih+Hwang which listsall theses supervised by ProfessorSan-Yih Hwang

(2) Article accesses Each article access is anexecution of a CGI program that displaysthe detailed metadata on an article Anexample lookup access of NSYSU-ETDis httpetdlibnsysuedutwETD-dbETD-searchview_etdURN=etd-0726100-135739 which shows themetadata of the thesis whose URN isetd-0726100-135739

Note that article accesses display the detailedmetadata of articles and therefore are of

Figure 2 Page view of an article in NSYSU-ETD

173

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

primary interest Lookup accesses providelookup information to facilitate browsing orsearching and can thus be consideredauxiliary Most of the time a user will firstexecute a lookup access followed by aselective list of article accesses We found thatsome user sessions contained article accesseswithout prior lookup accesses in the Webusage log of NSYSU-ETD This is becauseseveral information sources had providedhyperlinks directly to articlesrsquo page views Inthis case each session can be viewed as a listof queries Each query (optionally) starts witha lookup access followed by a list of articleaccesses related to the articles that the userchose to look at in more detail

The goal of user session identification is todivide the article accesses of each user intoindividual sessions It is reasonable to assumethat two records with different IP addressesbrowsers or operating systems belong to twodifferent user sessions In addition the timeinterval between two consecutive requests in auser session should not be too large As inmany commercial products we use 30minutes as the default timeout period Whenthe time interval between the current accessand the previous one exceeds this a new usersession is assumed to have started Some ofthe identified user sessions are made byInternet robots and hence should not beconsidered Some robots have known agenttypes andor IP addresses and can be easilyidentified Analysis of the user sessions ofthese known robots revealed that most ofthese sessions either have more than 100article accesses or exhibit a mean adjacentWeb page access interval of less than threeseconds User sessions that satisfy thiscondition are considered as robot sessionsand consequently are removed

Finally a user session is further divided intoa number of transactions each of whichrepresents a semantically meaningful unitHowever the various transactionidentification approaches proposed in Cooleyet al (1999) make use of either the index(auxiliary) pages or the Web site topologySince neither exists in the context of literaturedigital libraries these proposed approachesare not applicable Our approach identifiestransactions by considering the types ofaccesses namely lookup and article accessesIn fact articles listed by the same query musthave some degree of similarity in their content(eg keyword title author discipline) On

the other hand articles selected in the sameuser session or query also display some degreeof similarity due to inherent humanbehaviour Therefore we have four methodsfor defining transactions(1) Query-chosen method the articles

selected in a query(2) Session-chosen method the articles

selected in a user session(3) Query-result method the articles listed in

a query(4) Session-result method the articles listed

in queries of a user session

For the query-chosen and session-chosenmethods article accesses present in the Weblogs are grouped into a set of transactions Forthe query-result and session-result methodswe construct transactions by reissuing queriesto the literature digital library

As mentioned the query-chosen andsession-chosen methods incorporateknowledge on human selection in making therecommendations We expect that they willyield more effective recommendations thantheir counterparts without such knowledgenamely the query-result and session-resultmethods We therefore form the followinghypothesis

H1 Users tend to browse the metadata ofonly the articles they find of interestRecommendation schemes thatconsider the Web accesses of thesearticles will result in more effectiverecommendations

Mining the literature usage logThere have been several approaches proposedin the literature (Yan et al 1996 Mobasheret al 1999 Pitkow and Pirolli 1999 Yanget al 2001) for identifying aggregate usageprofiles from Web usage logs Aggregateusage profiles can be represented in the formof association rules sequential patterns (or ann-gram Markov model) or clusters of Webpages In the context of literature digitallibraries however we decided not to considersequential patterns because the order ofarticles in a transaction may not relate tousersrsquo preferences Instead only associationrules and clusters of articles will be discussed

The problem of finding frequentassociations between items in a transactiondatabase called the association-rule discoveryproblem was first introduced by Agrawal et al(1993) Association-rule discovery methods

174

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

such as the A priori Algorithm (Agrawal et al1993 Agrawal and Srikant 1994) aretypically based on two decompositionsextraction of itemsets and the generation ofstrong association rules In the initialextraction phase the methods find sets ofitems that frequently occur together Thesupport of each itemset I =XY denoted asSup(I) is the fraction of transactionscontaining both X and Y This itemset I isreferred to as a frequent itemset if Sup(I)exceeds a user-specified minimal supportthreshold Minsup

In the second phase each discovereditemset I = XY is used to constructassociation rules in the form X ) Y Theconfidence of each rule denoted asConf(X ) Y ) is the fraction of transactionscontaining X that also contain Y Anassociation rule is said to be strong if itexceeds a user-specified confidence thresholdBy applying association-rule discoveryalgorithms to the transactions of thetransformed logs we can find associationrules in the form a1 a2 am)a whichcan be used to recommend article a to userswho have browsed a1 a2 am but not a

The traditional approaches for identifyingitemsets with a uniform minimum supportthreshold however cannot be directlyapplied because articles that arrive later tendto have smaller support even if they areactually more popular Therefore anonuniform support threshold schemeoriginally proposed in Liu et al (1999) isadopted In this scheme each item is assigneda distinct minimum support value (called theminimum item support MIS) and theminimum support of an itemset is theminimum of the MIS values of its constituentitems In the literature recommendationsystem we view the MIS value of an article asa function of its creation time That is articlesthat are added to the digital library morerecently should be assigned smaller MISvalues Let N(t) be the number oftransactions after time t in the Web usagelogs The MIS of an article is defined asfollows

MIShellipadagger ˆ Mhellipadagger Mhellipadagger gt LSLS Otherwise

raquo

Mhellipadagger ˆ NhellipCreationTimehellipadaggerdaggerNhellip0dagger cent Minsup

where LS is the lower bound for support

values Minsup is the minimum supportthreshold based on the entire article browsinglog and N(0) denotes the total number oftransactions in the Web usage log Both LSand Minsup are user-defined constantsAfter assigning the MIS values of all articles inthe literature digital library the methodproposed in Liu et al (1999) can beapplied to derive the association rules forarticles

For the clustering technique we adopt theAssociation Rule Hypergraph Partitioning(ARHP) approach (Mobasher et al 19992000) rather than traditional clusteringtechniques The main reason for this is thatARHP is more efficient in handling highdimensional data such as those present inliterature digital libraries The dimensions ofa transaction are the set of articles which ishuge for a large-scale digital library Thisapproach starts with the identification offrequent itemsets (as in association-rulediscovery methods) each of which containsarticles often accessed together intransactions Each such frequent itemset isthen viewed as a hyperedge with a specificweight

As mentioned each article has a distinctcreation time We therefore normalise thesupport values of itemsets as follows beforecomputing their weights

Sup0hellipaidagger ˆ Nhellip0daggerNhellipCreationTimehellipaidaggerdagger

cent Suphellipaidagger

Sup0hellipa1 akdagger ˆ

Nhellip0daggerN hellipmax1micro micro hellip daggerdagger

cent Suphellipa1 akdagger

There are several ways to define the weightof an itemset such as using either the supportfor or the interest in the itemset The formerfavours itemsets of smaller size whereas thelatter gives priority to larger itemsets Wedefine a general weighting formula that coversthe broad spectrum between these twoextremes In addition the supports for orinterests in different itemsets can have verydiverse values To prevent itemsets of largeweight from dominating the subsequentclustering procedure we apply thelogarithm on the weight The following is ourdefinition of the weight of an itemsethellipa1 a2 ak)

175

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

weighthellipa1 a2 akdagger ˆ

log0hellip dagger

permil 0hellip dagger cent 0hellip dagger cent cent 0hellip daggerŠnotsup3

pound 1

Minsup

acute

where 0micro not micro1 Note that when not = 0 ornot = 1 this formula is equivalent to usingsupport or interest as the weight respectively1Minsup is a constant that keeps the weightnon-negative This definition supports ourfollowing hypotheses

H2 A better recommendation effectivenesscan be achieved by striking a balancebetween using support and interest asthe weight of an itemset

H3 A better recommendation effectivenesscan be achieved by incorporating thelogarithm function in the weight of anitemset

After deciding the weight of each itemset thehypergraph partitioning algorithm proposedin Karypis (2002) is applied to partition theset of articles into disjoint clusters of articlesArticles in the same cluster are more lsquolsquosimilarrsquorsquoin the sense that they are more likely to beaccessed together in the same transaction Toreflect the fact that an article may indeedinterest more than one group of users weadopt the same heuristic as used in Mobasheret al (1999) by adding back articles toclusters which results in overlapping clustersSpecifically for a given hyperedge if thepercentage of involved vertices in a cluster islarge than a threshold the other involvedvertices are included in the same cluster

Online recommendationsWe propose two recommendation approachesthat use the article association rules andarticle clusters obtained by the methodsdescribed above The goal is to recommendthe top-N articles that potentially interest theactive user The first approach makes use ofarticle association rules The idea is to treateach frequent itemset as the interest profile ofa user group and to recommend articles basedon the similarity between the current sessionof the active user and interest profiles of therelevant user groups Specifically let s be theactive userrsquos current session of length k Wefirst identify the set of frequent itemsets ofsize k + 1 that contain all elements in s and an

extra element m (not in s) For each suchitemset I the confidence of the ruleI iexcl mg ) fm is calculated These extraelements are then recommended to the userin descending order of confidence value Ifthese elements are not sufficient (ie there areless than N of them) we then search forfrequent itemsets of size k that contain k - 1elements in s and an extra element (not in s)Again these extra elements are recommendedto the user in descending order of confidencevalue This procedure continues until Narticles are recommended

Our other proposed method uses ahypergraph-based approach In thisapproach the recommendation score of eacharticle a is computed by considering thesimilarity between the current user sessionand the clusters C to which a belongs and thecoherence weight of a with respect to CSpecifically each cluster of articles can beviewed as a vector with binary elements eachof which indicates whether an article appearsin the cluster Similarly the current usersession can also be represented as a vectorThen the similarity between the currentsession s and a cluster C can be defined as acosine function as follows

matchhellipS Cdagger ˆ

Pk

aCk pound Sk

Pk

hellipSkdagger2 poundPk

hellipaCKdagger2

r

where Sk is the krsquoth element in S and aCk is the

krsquoth element in CThe coherence weight of an article a with

respect to the cluster C that it belongs to isdefined as

weighthellipa Cdagger ˆ

Pa2eesup3C

weighthellipedaggerPesup3C

weighthellipedagger

where weight(e) is the weight of a hyperedge eThe recommendation score Rec(S a) of an

article a with respect to the current usersession S is then defined as

Rechellip dagger ˆ

maxa2C

hellip dagger pound hellip dagger

p

The top-N articles for recommendation arethose with the N highest values in therecommendation score

176

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

The hypergraph-based approach is morecarefully designed than theassociation-rule-based approach and weexpect the former to perform better

H4 The hypergraph-based approach willresult in more effectiverecommendations and has aquicker response time than theassociation-rule-based approach

Empirical evaluations

This section reports our experience inapplying the Web usage logs of NSYSU-ETD to the proposed literaturerecommendation system The main objectivewas to test our four hypotheses NSYSU-ETD runs on PC Solaris 27 and uses Apache139 as the Web server Since beingcommissioned in May 2000 it has beenloaded with more than 3000 electronictheses of National Sun Yat-sen UniversityUp to February 2003 these theses had beenbrowsed more than 400000 times anddownloaded more than 100000 times Weanalysed the Web usage logs of NSYSU-ETD between February 2002 and May 2002for our experiments the data collected fromFebruary 1 to April 30 were designated as thetraining data set and those collected in Mayserved as the test data set

We first applied the data cleansingtechnique on the training data and obtained43349 lookup accesses and 41627 articleaccesses Applying the session identificationtechnique revealed 16922 user sessionsamong which 392 sessions were robotgenerated 6068 sessions contained only onearticle access and 5253 sessions containedno article accesses We eliminated these trivialuser sessions and applied transactionidentification techniques resulting in 5617transactions for the query-chosen method5272 transactions for the session-chosenmethod and 17742 transactions for thequery-result method Queries whose resultsare never chosen by the users are removedfrom the query-chosen method but remain inthe query-result method The session-resultmethod produced transactions of huge sizeeach containing thousands of article accessesWe therefore decided not to consider thismethod in the subsequent experiments

The two proposed methods for miningliterature usage logs both require the

identification of frequent itemsets fromtransactions which needs the minimumsupport to be specified However the threetransaction identification methods undercomparison have different numbers oftransactions To be fair we specify adifferent minimum support threshold foreach method such that the total number ofarticles involved in large two-item sets ofeach method - called recommendablearticles - is approximately the same Ourrecommendation framework recommends anarticle only if it is associated with otherarticles a sufficient number of times in Webusage log Therefore articles that are notinvolved in large two-item sets cannotpossibly be recommended Table I shows thespecified minimum support threshold andthe number of recommendable articles ofeach method

To illustrate how we conductedexperiments we define the followingnotation let Teval be the set of transactions inthe test set teval be a transaction in Teval andat(i) be the irsquoth article in teval Given a windowsize Wsize we divide each transaction teval inthe test data set into two lists teval[W] andteval[R] where teval[W] is the first Wsize articleaccesses of teval and teval[R] is the remainingarticles By treating teval[W] as the currentsession the recommender system will choosethe set tpr of top-N articles forrecommendation

The performance metric we adopted formeasuring the quality of recommendation isthe precision and recall scheme Theprecision is the ratio of the number ofrecommended articles accessed by a user tothe total number of recommended articlesdefined as tpr teval permilRŠ=tpr and recall is theratio of the number of recommended articlesaccessed by a user to the total number ofarticles of interest to the user defined astpr tevalpermilRŠ=teval permilRŠ The precision (recall) of arecommendation approach is the averageprecision (recall) of all transactions in the testset

Test of H1We first evaluate the performance impact ofthe three transaction identification methodsIn this experiment not was set to be 05 and thelogarithm was taken when computing theweight of an itemset Figure 3(a b) shows theprecisions and recalls respectively underassociation-rule-based recommendation

177

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

The precisions and recalls underhypergraph-based recommendation areshown in Figure 4(a b) respectively

Overall the trends are the same for theassociation-rule-based and hypergraph-basedapproaches It can be clearly seen that boththe query-chosen and session-chosenmethods outperform the query-resultmethod in terms of both precision and recallThis implies that the information aboutarticles browsed plays a crucial role in making

recommendations - both the query-chosenand session-chosen methods incorporate thisinformation in forming transactions Wetherefore accept H1 However theperformance difference between thequery-chosen and session-chosen methods isnot significant

Test of H2 and H3We then conducted experiments to shed lighton the impact of not and the logarithmic

Figure 3 (a) Precisions of the association-based approach (b) recalls of the association-based approach

Figure 4 (a) Precisions and (b) recalls of the hypergraph-based approach under different transaction identification

methods

Table I Minimum support and number of recommended candidates for each transaction identification method

Minimum support (per cent) No of recommended articles

Session-chosen method 016 253

Query-chosen method 012 250

Query-result method 33 229

178

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

function in the weight definition for thehypergraph-based recommendationapproach Figure 5(a b) shows the precisionand recall values when not = 0 05 and 1 usingthe session-chosen method for transactionidentification The window size was set at 2As can be seen differences in the precisionand recall under different settings of not arevery small We have performed the sameexperiments for different transactionidentification methods and window sizes andobtained similar results H2 is thereforerejected

We then computed the precisions andrecalls with and without application of thelogarithmic function on the itemset weightFigure 6(a b) shows the resulting precisionsand recalls respectively

Figure 6 shows that applying thelogarithmic function on the itemset weightdefinition achieves significantly betterprecision and recall values This meets ourexpectation and H3 is accepted

Test of H4Finally we evaluated the impact ofassociation-rule-based and hypergraph-basedrecommendation approaches for different

window sizes Figure 7(a b) shows theprecisions and recalls for window sizes of 2 34 5 and 6 for a top-15 recommendation Itcan be seen that the hypergraph-basedapproach performs better than theassociation-rule-based approach especiallyfor larger window sizes

We also compared the running times ofboth approaches whilst setting differentminimum support thresholds Therelative performance of the two approachesunder different window sizes are shown inFigure 8(a b) Overall the running time ofthe hypergraph-based approach remainedrelatively constant not varying with changesin window size and minimum supportthreshold In contrast the running time of theassociation-rule-based approach increasedwith an increase in window size or a decreasein the minimum support This is because theassociation-rule-based approach has to searchfor the frequent itemsets that match thecurrent session As the number of frequentitemsets increase (as a result of a decreasein the minimum support) or the length ofcurrent user session increases (as a

Figure 5 (a) Impact of not on precisions and (b) impact of not on recalls of

the hypergraph-based approach using the session-chosen method for

transaction identification

Figure 6 (a) Impact of logarithmic function on precisions and (b) impact

of logarithmic function on recalls of the hypergraph-based approach using

the session-chosen method for transaction identification

179

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

result of an increased window size)the association-rule-based approach incurs alarger running time

Overall we conclude that thehypergraph-based approach is more attractivesince it yields better-quality articlerecommendation and has a more consistentrunning time Thus H4 is accepted

Conclusions

In this paper we have investigated issuesrelated to the recommendation of articles in aliterature digital library We have developed a

literature recommendation system that makes

use of the Web usage logs of a literature

digital library for making recommendations

The literature recommendation system

consists of three sequential steps(1) data preparation of the Web logs(2) usage log mining and(3) generation of article recommendations

We proposed three alternatives for identifying

transactions from Web usage logs and

discussed two approaches - association-rule

based and hypergraph based - for making

recommendations These alternatives and

approaches were evaluated using the Web

Figure 7 (a) Precisions and (b) recalls of the two recommendation approaches under different window sizes

(Minsup = 016 per cent)

Figure 8 (a) Running times of the two recommendation approaches for a window size of two under different

Minsup values and (b) running times of the two recommendation approaches for a window size of six under

different Minsup values

180

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

usage logs of an operational electronic thesissystem at National Sun Yat-sen University Ithas been found that the query-chosen andsession-chosen methods are better fortransaction identification and that thehypergraph-based approach yieldsbetter-quality article recommendation andexhibits a more consistent running time andthus is more scalable

Our recommendation framework identifiesarticle associations present in the Web usagelogs of a digital library While this approachresults in effective recommendations it failsto recommend those independent articles thatare seldom accessed together with others Asevident from our experiments an analysis ofthe three-month collection of Web usage logsof NSYSU-ETD (from 1 February to 30April 2002) showed that only about one-tenthof the total collection are recommendable (seeTable I) To extend the scope ofrecommendable articles we are currentlyinvestigating approaches that make use ofmultiple sources when making articlerecommendations in digital libraries Onesuch source of course is the metadataalready collected by digital libraries

References

Agrawal R Imielinski T and Swami A (1993) ` Mining

association rules between sets of items in largedatabasesrsquorsquo in Buneman P and Jajodia S (Eds)

Proceedings of the ACM SIGMOD Conference onManagement of Data Washington DC May 26-28ACM Press New York NY pp 207-16

Agrawal R and Srikant R (1994) ` Fast algorithms formining association rulesrsquorsquo in Bocca JB Jarke M

and Zaniolo C (Eds) Proceedings of the 20thInternational Conference on Very Large Data Bases

September 12-15 Santiago Chile MorganKaufmann San Francisco CA pp 487-99

Alspector J Kolcz A and Karunanithi N (1998)` Comparing feature-based and clique-based usermodels for movie selectionrsquorsquo Proceedings of the 3rd

ACM International Conference on Digital LibrariesJune 23-26 1998 Pittsburgh PA ACM Press New

York NY pp 11-18Andresen D Carver L Dolin R Fischer C Frew J

Goodchild M Ibarra O Kothuri R Larsgaard MNebert D Simpson J Smith T Yang T andZheng Q (1995) ` The WWW prototype of the

Alexandria digital libraryrsquorsquo Proceedings of theInternational Symposium on Digital Libraries

Tsukuba Japan 22-5 August pp 17-27Ansari A Essegaier S and Kohli R (2000) ` Internet

recommendation systemsrsquorsquo Journal of MarketingResearch Vol 37 No 3 pp 67-85

Arms W (2000) Digital Libraries MIT Press CambridgeMA

Armstrong R Freitag D Joachims T and Mitchell T(1997) ` WebWatcher a learning apprentice for theWorld Wide Webrsquorsquo AAAI Spring Symposium onInformation Gathering from HeterogeneousDistributed Environments pp 6-12

Balabanovirsquoc M and Shoham Y (1997) ` Fabcontent-based collaborative recommendationrsquorsquoCommunications of the ACM Vol 40 No 3pp 66-72

Baldonado M Chang C Gravano L and Paepcke A(1997) ` Metadata for digital libraries architectureand design rationalersquorsquo Proceedings of the 2nd ACMInternational Conference on Digital Libraries ACMPress New York NY pp 47-56

Basu C Hirsh H and Cohen W (1998)` Recommendation as classification using social andcontent-based information in recommendationrsquorsquoProceedings of the 15th National Conference onArtificial Intelligence 26-30 July Madison WI AAAIPress Menlo Park CA pp 714-20

Billsus D and Pazzani M (1999) ` A hybrid user modelfor news story classificationrsquorsquo in Kay J (Ed)Proceedings of the 7th International Conference onUser Modelling Banff Canada 20-4 JuneSpringer-Verlag New York NY pp 99-108

Bowman C Manber P and Schwartz U (1994)` Scalable Internet resources discovery researchproblems and approachesrsquorsquo Communications of theACM Vol 37 No 8 pp 98-107

Breese J Heckerman D and Kadie C (1998) ` Empiricalanalysis of predictive algorithms for collaborativefilteringrsquorsquo Technical Report MSR-TR-98-12Microsoft Research Seattle CA

Chen H Schatz B Ng T Martinez J Kirchhoff A andLin C (1996) ` A parallel computing approach tocreating engineering concept spaces for semanticretrieval the Illinois digital library initiativeprojectrsquorsquo IEEE Transactions on PAMI Vol 18 No 8pp 17-34

Cooley R Mobasher B and Srivastava J (1999) ` Datapreparation for mining World Wide Web browsingpatternsrsquorsquo Journal of Knowledge and InformationSystems Vol 1 No 1 pp 5-32

Furner J (2002) ` On recommendingrsquorsquo Journal of theAmerican Society for Information Science andTechnology Vol 53 No 9 pp 747-63

Goldberg D Nichols D Oki B and Terry D (1992)` Using collaborative filtering to weave aninformation tapestryrsquorsquo Communications of the ACMVol 35 No 12 pp 61-70

Herlocker J and Konstan J (2001) ` Content-independent task-focused recommendationrsquorsquo IEEEInternet Computing Vol 5 No 6 pp 40-7

Karypis G (2002) ` Multilevel hypergraph partitioningrsquorsquoTech Report TR02-25 Department of ComputerScience University of Minnesota MN

Kessler J (1996) Internet Digital Libraries TheInternational Dimension Artech House PublishersNorwood MA

Konstan J Miller B Maltz D Herlocker J Gordon Land Riedl J (1997) ` GroupLens applyingcollaborative filtering to Usenet newsrsquorsquoCommunications of the ACM Vol 40 No 3pp 77-87

181

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Lang K (1995) ` Newsweeder learning to filternetnewsrsquorsquo in Prieditis A and Russell S (Eds)Proceedings of the 12th International Conference onMachine Learning Lake Tahoe Morgan KaufmannSan Francisco CA pp 331-9

Liu B Hsu W and Ma Y (1999) ` Mining associationrules with multiple minimum supportsrsquorsquo Proceedingsof the 5th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining 15-18

August San Diego ACM Press New York NYpp 430-4

Loeb S and Terry D (1992) ` Information filteringrsquorsquoCommunications of the ACM Special Issue onInformation Filtering Vol 35 No 12 pp 26-8

Mobasher B Cooley R and Srivastava J (1999)` Creating adaptive Web sites through usage-basedclustering of URLsrsquorsquo Proceedings of the IEEEKnowledge and Data Engineering ExchangeWorkshop

Mobasher B Dai H Luo T Nakagawa M andWiltshire J (2000) ` Discovery of aggregate usageprofiles for Web personalisationrsquorsquo Proceedings ofthe webKDD Workshop

Mooney R and Roy L (2000) ` Content-based bookrecommending using learning for text

categorisationrsquorsquo Proceedings of the 5th ACMConference on Digital Libraries San Antonio ACMPress New York NY pp 195-204

Pazzani M and Billsus D (1997) ` Learning and revisinguser profiles the identification of interesting Websitesrsquorsquo Machine Learning Vol 27 No 4 pp 313-31

Pazzani M (1999) ` A framework for collaborativecontent-based and demographic filteringrsquorsquo ArtificialIntelligence Review Vol 13 No 56 pp 393-408

Pennock D Horvitz E Lawrence S and Giles C (2000)` Collaborative filtering by personality diagnosis ahybrid memory- and model-based approachrsquorsquoProceedings of the 16th Conference onUncertainty in Artificial Intelligence San Francisco

30 June-3 July Morgan Kaufmann San FranciscoCA pp 473-80

Pitkow J and Pirolli P (1999) ` Mining longest repeatingsubsequences to predict World Wide Web surfingrsquorsquoProceedings of the 2nd USENIX Symposium onInternet Technologies and Systems Boulder COpp 139-150

Schafer J Konstan J and Riedl J (2001) ` E-commercerecommendation applicationsrsquorsquo Data Mining andKnowledge Discovery Vol 5 No 1 pp 10-22

Shardanand U and Maes P (1995) ` Social informationfiltering algorithms for automating `word ofmouthrsquorsquorsquo Proceedings of the Conference on HumanFactors in Computing Systems Denver CO ACMPress New York NY pp 210-17

Spink A Wilson T Ford N Foster A and Ellis D(2002) ` Information seeking and mediatedsearchingrsquorsquo Journal of The American Society ForInformation Science and Technology Vol 53 No 9pp 695-703

Srivastava J Cooley R Deshpande M and Tang P(2000) ` Web usage mining discovery andapplications of usage patterns from Web datarsquorsquoSIGKDD Explorations Vol 1 No 2 pp 12-23

Terveen L Hill W Amento B McDonald D and CreterJ (1997) ` PHOAKS a system for sharingrecommendationsrsquorsquo Communications of the ACMVol 40 No 3 pp 59-62

Wilensky R (1996) ` Toward work-centred digitalinformation servicesrsquorsquo IEEE Computer Vol 29 No 5pp 7-44

Yan T Jacobsen M Molina H and Dayal U (1996)` From user access patterns to dynamic hypertextlinkingrsquorsquo Proceedings of the 5th International WorldWide Web Conference pp 1007-14

Yang Q Zhang HH and Li T (2001) ` Mining Web logsfor prediction models in WWW caching andprefetchingrsquorsquo Proceedings of the ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining pp 473-8

182

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Page 2: A prototype WWW literature recommendation system for digital libraries

resulted in a recommendation system that isparticularly suitable for recommendingliterature in digital library environments

Related workInterest in digital libraries has increasedtremendously with several research projectsaddressing the wealth of challenges in thisfield For example a University of Illinoisproject has focused on providing integratedaccess to diverse and distributed collections ofscientific literature (Chen et al 1996) Thatproject deals with heterogeneous interfaces tomultiple indices semantic federation acrossrepositories and other related issues A groupat the University of California at Berkeley isworking on providing work-centred digitalinformation services (Wilensky 1996) Theissues involved include document imageanalysis natural language analysis andcomputer vision analysis for effectiveinformation extraction Carnegie MellonUniversity intends to build a large onlinedigital video library featuring full-content andknowledge-based searching and retrieval TheUniversity of California at Santa Barbara hasconcentrated on geographical informationsystems and a Stanford University projectaddresses the problem of interoperabilityusing CORBA to implementinformation-access and payment protocols(Baldonado et al 1997)

The focus of our research reported here isto tackle the problem of information overloadProposed solutions to this emphasise the needfor specialisation in information retrievalservices to help people effectively locateinformation that meets their individual needs(Bowman et al 1994) Interest inrecommending has increased in theinformation technology community andespecially in the design of digital libraries(Furner 2002) The research reported hereconcentrates on literature recommendations

The past few years have seen the emergenceof many recommendation systems intendedto provide personal recommendations forvarious types of products and servicesincluding news and e-mail messages (see

wwwnetperceptionscom for acommercial site and Goldberg et al(1992) Lang (1995) Konstan et al(1997) and Billsus and Pazzani (1999)for research prototypes)

Web pages (see httpmyyahoocom fora commercial site and Balabanovirsquoc andShoham (1997) Terveen et al (1997)Pazzani and Billsus (1997) andArmstrong et al (1997) for researchprototypes)

books (see httpwwwamazoncom for acommercial site and Mooney and Roy(2000) for a research prototype)

music (see httpwwwCDNowcom fora commercial site and Shardanand andMaes (1995) for a research prototype)and

movies (see httpmovieseonlinecomfor a commercial site and Alspector et al(1998) Breese et al (1998) Basu et al(1998) Ansari et al (2000) Pennock et al(2000) and Schafer et al (2001) forresearch prototypes)

The first type of recommendation techniquewas called the content-based approach (Loeband Terry 1992) A content-based approachcharacterises recommendable items by a set ofcontent features and represents a userrsquosinterests by a similar feature set Then therelevance of a given content item to the userrsquosinterest profile is measured as the similarity ofthis recommendable item to the userrsquos interestprofile Content-based approaches selectrecommendable items that have a high degreeof similarity to the userrsquos interest profile

Another type of recommendationtechnique the collaborative approach(sometimes called the social-based approach)takes into account the given userrsquos interestsprofile and the profiles of other users withsimilar interests (Shardanand and Maes1995) The collaborative approach looks forrelevance among users by observing theirratings assigned to products in a training setof limited size The lsquolsquonearest-neighbourrsquorsquo usersare those that exhibit the strongest similarityto the target user These users then act aslsquolsquorecommendation partnersrsquorsquo for the targetuser and collaborative approachesrecommend to the target user items thatappear in the profiles of theserecommendation partners (but not in thetarget userrsquos profile) It has been observed inseveral practical settings that the collaborativeapproach generally achieves more effectiverecommendations than its content-basedcounterpart (Alspector et al 1998 Breeseet al 1998 Mooney and Roy 2000Pazzani 1999)

170

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Pennock et al (2000) proposed using acollaborative approach for recommendingarticles in CiteSeer (wwwciteseercom)Their approach implicitly derives usersrsquoratings of articles by observing their actionswhen viewing an article Each action isassigned a weight For example adding adocument to a profile produces a two-pointincrement downloading a document aone-point increment and ignoring arecommendation a one-point decrement inthe rating score However this approachsuffers from the shortage of negativeexamples and the method is applicable onlyto individual members who are willing toidentify themselves each time they use thedigital library

We consider that traditionalrecommendation techniques are not suitablefor recommending articles in digital librariesFirst both content-based and collaborativeapproaches require that usersrsquo rating scoreson selected items (including both positive andnegative instances) are available for analysisFor a typical literature digital libraryrequiring users to rate some articles beforemaking a recommendation is not realisticSecond identifying an individual user of aliterature digital library is generally notpossible since many literature digital librariesare freely available on the Internet and userscan search or browse articles without havingto identify themselves Even for proprietaryliterature digital libraries many users gainaccess via site subscriptions making itdifficult to track an individualrsquos (long term)browsing behaviour

For the reasons mentioned above wepropose making use of a task-focusedapproach (Herlocker and Konstan 2001) Inthis approach a task profile (a set of recentlyaccessed items) rather than the long-terminterest profile is used to makerecommendations One notableimplementation of this approach is Web usagemining which aims to identify interestingusage patterns of a Web-based system fromthe Web usage logs that record interactionsbetween users and Web pages (Srivastavaet al 2000) Recently several approacheshave been proposed for recommending Webpages based on the Web page associationsdiscovered by Web-usage mining algorithms(Yan et al 1996 Mobasher et al 1999Pitkow and Pirolli 1999 Yang et al 2001)While these approaches vary in their details

they follow the same recommendationframework which starts with theidentification of aggregate usage profiles ofWeb pages by some data mining methodThey then make recommendations by lookinginto the similarity between the set of recentlyaccessed Web pages of an active user and thecollected aggregate usage profiles

Obviously literature digital libraries storearticles rather than Web pages and they differfrom Web pages in several respects Web pages are more diversified some

serve as index pages some are contentpages and others have a mixture ofindexes and content On the other handsince literature articles are morehomogeneous in structure they are morelikely to have the same set of metadatafeatures

A Web site can be viewed as a directedgraph whose vertices are Web pageswhile a literature digital library is bettervisualised as a set of articles

Literature articles are often retrieved bysearch queries provided by the systemwhile Web pages are often browsedthrough a static site topology

Literature articles are incrementallyinserted into the digital library at a fasterrate than are Web pages inserted into aWeb site

ContributionsThe above considerations indicate thatliterature recommendation services require adifferent technical approach Here wedescribe a recommendation framework forrecommending articles in a literature digitallibrary Several alternatives are proposed forimplementing the constituent components ofthe recommendation framework Thesealternatives are compared and analysed byapplying the Web usage logs collected fromthe electronic thesis and dissertation (ETD)system at National Sun Yat-sen University(NSYSU-ETD) We have incorporated theproposed recommendation framework intoNSYSU-ETD (wwwlibnsysuedutweThesysenglishdefault_ehtm) thecorresponding implementation status is alsoreported

This paper is structured as follows Next theoverall architecture is described followed bydetailed design and construction methodsThen implementation experience andevaluation results are presented The final

171

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

part summarises this paper and discusses ourfuture research directions

Architecture

The overall architecture of the literaturerecommendation system shown as Figure 1consists of two basic subsystems offline andonline(1) The offline subsystem comprises two

sequentially executed tasks datapreparation and log mining AlthoughWeb usage logs are potentially able toprovide useful knowledge for makingrecommendations the raw log datacannot be used before appropriatepre-processing Therefore we firstconvert raw Web usage logs into a set ofuser transactions before performing thelog mining task The objectives of logmining tasks include the discovery ofarticle association rules and the derivationof article clusters

(2) The online subsystem interacts with anactive user and provides recommendedarticles in real time It keeps track of aset of articles browsed recently by theactive user by consulting the currentWeb usage log provided by the Webserver Then by comparing the

similarities between this set and the

article clusters (or associations)produced by the offline subsystem theonline subsystem recommends articlesin the clusters (associations) that arehighly similar

We have incorporated the literaturerecommendation system into NSYSU-ETDFigure 2 depicts a page view of an article inNSYSU-ETD The Web page comprises twoframes the left frame displays the metadata of

the browsed article and the right frame showsthe unseen articles (up to 15) recommendedby our literature recommendation systemdisplayed in the order of relevance In thisexample the active user had reviewed twoarticles and the literature recommendationsystem has recommended another sevenarticles Once the active user browses anotherarticle the content of the recommendation

frame will update accordingly

Approaches

In this section we first describe the tasksconducted in the offline subsystem of theliterature recommendation system and thenthose conducted in the onlinerecommendation subsystem

Figure 1 Architecture of a literature recommendation system

172

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Data preparation for the literature usagelogTo prepare data from the Web usage logs of aliterature digital library we basically followthe heuristics adopted by Cooley et al (1999)for processing Web usage logs that involvestatic Web pages In their work the Webusage logs are assumed to be in the extendedNCSA format (including referrer and agentfields) The approach contains threesequential steps data cleansing user sessionidentification and transaction identificationThe objective of data cleansing is to prune outunwanted Web log records and to add backmissing Web log records some Web logrecords are surplus as they are accesses tonon-HTML pages (eg images and other httprequests involving no Web page accesses)while other Web log records are missing dueto the existence of the local cache firewallsand proxy servers Identifying missing Weblog records is especially difficult - severalheuristics have been proposed for achievingthis However we found this difficultynonexistent when processing the Web usagelog of NSYSU-ETD because article Webpages are dynamically generated and are notcacheable Our university ETD system isdatabase driven in that the theses metadata

are stored in a DBMS Most large-scaledigital libraries adopt the same method formaintaining their collections In the contextof a literature digital library we are concernedwith and retain only the Web usage recordsthat involve the following two types ofaccesses(1) Lookup accesses Each lookup access is an

execution of a CGI program withsearching or browsing conditionsspecified in the parameters Anexample lookup access of NSYSU-ETDis httpetdlibnsysuedutwETD-dbETD-searchsearch_by_advisoradvisor_name=San-Yih+Hwang which listsall theses supervised by ProfessorSan-Yih Hwang

(2) Article accesses Each article access is anexecution of a CGI program that displaysthe detailed metadata on an article Anexample lookup access of NSYSU-ETDis httpetdlibnsysuedutwETD-dbETD-searchview_etdURN=etd-0726100-135739 which shows themetadata of the thesis whose URN isetd-0726100-135739

Note that article accesses display the detailedmetadata of articles and therefore are of

Figure 2 Page view of an article in NSYSU-ETD

173

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

primary interest Lookup accesses providelookup information to facilitate browsing orsearching and can thus be consideredauxiliary Most of the time a user will firstexecute a lookup access followed by aselective list of article accesses We found thatsome user sessions contained article accesseswithout prior lookup accesses in the Webusage log of NSYSU-ETD This is becauseseveral information sources had providedhyperlinks directly to articlesrsquo page views Inthis case each session can be viewed as a listof queries Each query (optionally) starts witha lookup access followed by a list of articleaccesses related to the articles that the userchose to look at in more detail

The goal of user session identification is todivide the article accesses of each user intoindividual sessions It is reasonable to assumethat two records with different IP addressesbrowsers or operating systems belong to twodifferent user sessions In addition the timeinterval between two consecutive requests in auser session should not be too large As inmany commercial products we use 30minutes as the default timeout period Whenthe time interval between the current accessand the previous one exceeds this a new usersession is assumed to have started Some ofthe identified user sessions are made byInternet robots and hence should not beconsidered Some robots have known agenttypes andor IP addresses and can be easilyidentified Analysis of the user sessions ofthese known robots revealed that most ofthese sessions either have more than 100article accesses or exhibit a mean adjacentWeb page access interval of less than threeseconds User sessions that satisfy thiscondition are considered as robot sessionsand consequently are removed

Finally a user session is further divided intoa number of transactions each of whichrepresents a semantically meaningful unitHowever the various transactionidentification approaches proposed in Cooleyet al (1999) make use of either the index(auxiliary) pages or the Web site topologySince neither exists in the context of literaturedigital libraries these proposed approachesare not applicable Our approach identifiestransactions by considering the types ofaccesses namely lookup and article accessesIn fact articles listed by the same query musthave some degree of similarity in their content(eg keyword title author discipline) On

the other hand articles selected in the sameuser session or query also display some degreeof similarity due to inherent humanbehaviour Therefore we have four methodsfor defining transactions(1) Query-chosen method the articles

selected in a query(2) Session-chosen method the articles

selected in a user session(3) Query-result method the articles listed in

a query(4) Session-result method the articles listed

in queries of a user session

For the query-chosen and session-chosenmethods article accesses present in the Weblogs are grouped into a set of transactions Forthe query-result and session-result methodswe construct transactions by reissuing queriesto the literature digital library

As mentioned the query-chosen andsession-chosen methods incorporateknowledge on human selection in making therecommendations We expect that they willyield more effective recommendations thantheir counterparts without such knowledgenamely the query-result and session-resultmethods We therefore form the followinghypothesis

H1 Users tend to browse the metadata ofonly the articles they find of interestRecommendation schemes thatconsider the Web accesses of thesearticles will result in more effectiverecommendations

Mining the literature usage logThere have been several approaches proposedin the literature (Yan et al 1996 Mobasheret al 1999 Pitkow and Pirolli 1999 Yanget al 2001) for identifying aggregate usageprofiles from Web usage logs Aggregateusage profiles can be represented in the formof association rules sequential patterns (or ann-gram Markov model) or clusters of Webpages In the context of literature digitallibraries however we decided not to considersequential patterns because the order ofarticles in a transaction may not relate tousersrsquo preferences Instead only associationrules and clusters of articles will be discussed

The problem of finding frequentassociations between items in a transactiondatabase called the association-rule discoveryproblem was first introduced by Agrawal et al(1993) Association-rule discovery methods

174

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

such as the A priori Algorithm (Agrawal et al1993 Agrawal and Srikant 1994) aretypically based on two decompositionsextraction of itemsets and the generation ofstrong association rules In the initialextraction phase the methods find sets ofitems that frequently occur together Thesupport of each itemset I =XY denoted asSup(I) is the fraction of transactionscontaining both X and Y This itemset I isreferred to as a frequent itemset if Sup(I)exceeds a user-specified minimal supportthreshold Minsup

In the second phase each discovereditemset I = XY is used to constructassociation rules in the form X ) Y Theconfidence of each rule denoted asConf(X ) Y ) is the fraction of transactionscontaining X that also contain Y Anassociation rule is said to be strong if itexceeds a user-specified confidence thresholdBy applying association-rule discoveryalgorithms to the transactions of thetransformed logs we can find associationrules in the form a1 a2 am)a whichcan be used to recommend article a to userswho have browsed a1 a2 am but not a

The traditional approaches for identifyingitemsets with a uniform minimum supportthreshold however cannot be directlyapplied because articles that arrive later tendto have smaller support even if they areactually more popular Therefore anonuniform support threshold schemeoriginally proposed in Liu et al (1999) isadopted In this scheme each item is assigneda distinct minimum support value (called theminimum item support MIS) and theminimum support of an itemset is theminimum of the MIS values of its constituentitems In the literature recommendationsystem we view the MIS value of an article asa function of its creation time That is articlesthat are added to the digital library morerecently should be assigned smaller MISvalues Let N(t) be the number oftransactions after time t in the Web usagelogs The MIS of an article is defined asfollows

MIShellipadagger ˆ Mhellipadagger Mhellipadagger gt LSLS Otherwise

raquo

Mhellipadagger ˆ NhellipCreationTimehellipadaggerdaggerNhellip0dagger cent Minsup

where LS is the lower bound for support

values Minsup is the minimum supportthreshold based on the entire article browsinglog and N(0) denotes the total number oftransactions in the Web usage log Both LSand Minsup are user-defined constantsAfter assigning the MIS values of all articles inthe literature digital library the methodproposed in Liu et al (1999) can beapplied to derive the association rules forarticles

For the clustering technique we adopt theAssociation Rule Hypergraph Partitioning(ARHP) approach (Mobasher et al 19992000) rather than traditional clusteringtechniques The main reason for this is thatARHP is more efficient in handling highdimensional data such as those present inliterature digital libraries The dimensions ofa transaction are the set of articles which ishuge for a large-scale digital library Thisapproach starts with the identification offrequent itemsets (as in association-rulediscovery methods) each of which containsarticles often accessed together intransactions Each such frequent itemset isthen viewed as a hyperedge with a specificweight

As mentioned each article has a distinctcreation time We therefore normalise thesupport values of itemsets as follows beforecomputing their weights

Sup0hellipaidagger ˆ Nhellip0daggerNhellipCreationTimehellipaidaggerdagger

cent Suphellipaidagger

Sup0hellipa1 akdagger ˆ

Nhellip0daggerN hellipmax1micro micro hellip daggerdagger

cent Suphellipa1 akdagger

There are several ways to define the weightof an itemset such as using either the supportfor or the interest in the itemset The formerfavours itemsets of smaller size whereas thelatter gives priority to larger itemsets Wedefine a general weighting formula that coversthe broad spectrum between these twoextremes In addition the supports for orinterests in different itemsets can have verydiverse values To prevent itemsets of largeweight from dominating the subsequentclustering procedure we apply thelogarithm on the weight The following is ourdefinition of the weight of an itemsethellipa1 a2 ak)

175

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

weighthellipa1 a2 akdagger ˆ

log0hellip dagger

permil 0hellip dagger cent 0hellip dagger cent cent 0hellip daggerŠnotsup3

pound 1

Minsup

acute

where 0micro not micro1 Note that when not = 0 ornot = 1 this formula is equivalent to usingsupport or interest as the weight respectively1Minsup is a constant that keeps the weightnon-negative This definition supports ourfollowing hypotheses

H2 A better recommendation effectivenesscan be achieved by striking a balancebetween using support and interest asthe weight of an itemset

H3 A better recommendation effectivenesscan be achieved by incorporating thelogarithm function in the weight of anitemset

After deciding the weight of each itemset thehypergraph partitioning algorithm proposedin Karypis (2002) is applied to partition theset of articles into disjoint clusters of articlesArticles in the same cluster are more lsquolsquosimilarrsquorsquoin the sense that they are more likely to beaccessed together in the same transaction Toreflect the fact that an article may indeedinterest more than one group of users weadopt the same heuristic as used in Mobasheret al (1999) by adding back articles toclusters which results in overlapping clustersSpecifically for a given hyperedge if thepercentage of involved vertices in a cluster islarge than a threshold the other involvedvertices are included in the same cluster

Online recommendationsWe propose two recommendation approachesthat use the article association rules andarticle clusters obtained by the methodsdescribed above The goal is to recommendthe top-N articles that potentially interest theactive user The first approach makes use ofarticle association rules The idea is to treateach frequent itemset as the interest profile ofa user group and to recommend articles basedon the similarity between the current sessionof the active user and interest profiles of therelevant user groups Specifically let s be theactive userrsquos current session of length k Wefirst identify the set of frequent itemsets ofsize k + 1 that contain all elements in s and an

extra element m (not in s) For each suchitemset I the confidence of the ruleI iexcl mg ) fm is calculated These extraelements are then recommended to the userin descending order of confidence value Ifthese elements are not sufficient (ie there areless than N of them) we then search forfrequent itemsets of size k that contain k - 1elements in s and an extra element (not in s)Again these extra elements are recommendedto the user in descending order of confidencevalue This procedure continues until Narticles are recommended

Our other proposed method uses ahypergraph-based approach In thisapproach the recommendation score of eacharticle a is computed by considering thesimilarity between the current user sessionand the clusters C to which a belongs and thecoherence weight of a with respect to CSpecifically each cluster of articles can beviewed as a vector with binary elements eachof which indicates whether an article appearsin the cluster Similarly the current usersession can also be represented as a vectorThen the similarity between the currentsession s and a cluster C can be defined as acosine function as follows

matchhellipS Cdagger ˆ

Pk

aCk pound Sk

Pk

hellipSkdagger2 poundPk

hellipaCKdagger2

r

where Sk is the krsquoth element in S and aCk is the

krsquoth element in CThe coherence weight of an article a with

respect to the cluster C that it belongs to isdefined as

weighthellipa Cdagger ˆ

Pa2eesup3C

weighthellipedaggerPesup3C

weighthellipedagger

where weight(e) is the weight of a hyperedge eThe recommendation score Rec(S a) of an

article a with respect to the current usersession S is then defined as

Rechellip dagger ˆ

maxa2C

hellip dagger pound hellip dagger

p

The top-N articles for recommendation arethose with the N highest values in therecommendation score

176

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

The hypergraph-based approach is morecarefully designed than theassociation-rule-based approach and weexpect the former to perform better

H4 The hypergraph-based approach willresult in more effectiverecommendations and has aquicker response time than theassociation-rule-based approach

Empirical evaluations

This section reports our experience inapplying the Web usage logs of NSYSU-ETD to the proposed literaturerecommendation system The main objectivewas to test our four hypotheses NSYSU-ETD runs on PC Solaris 27 and uses Apache139 as the Web server Since beingcommissioned in May 2000 it has beenloaded with more than 3000 electronictheses of National Sun Yat-sen UniversityUp to February 2003 these theses had beenbrowsed more than 400000 times anddownloaded more than 100000 times Weanalysed the Web usage logs of NSYSU-ETD between February 2002 and May 2002for our experiments the data collected fromFebruary 1 to April 30 were designated as thetraining data set and those collected in Mayserved as the test data set

We first applied the data cleansingtechnique on the training data and obtained43349 lookup accesses and 41627 articleaccesses Applying the session identificationtechnique revealed 16922 user sessionsamong which 392 sessions were robotgenerated 6068 sessions contained only onearticle access and 5253 sessions containedno article accesses We eliminated these trivialuser sessions and applied transactionidentification techniques resulting in 5617transactions for the query-chosen method5272 transactions for the session-chosenmethod and 17742 transactions for thequery-result method Queries whose resultsare never chosen by the users are removedfrom the query-chosen method but remain inthe query-result method The session-resultmethod produced transactions of huge sizeeach containing thousands of article accessesWe therefore decided not to consider thismethod in the subsequent experiments

The two proposed methods for miningliterature usage logs both require the

identification of frequent itemsets fromtransactions which needs the minimumsupport to be specified However the threetransaction identification methods undercomparison have different numbers oftransactions To be fair we specify adifferent minimum support threshold foreach method such that the total number ofarticles involved in large two-item sets ofeach method - called recommendablearticles - is approximately the same Ourrecommendation framework recommends anarticle only if it is associated with otherarticles a sufficient number of times in Webusage log Therefore articles that are notinvolved in large two-item sets cannotpossibly be recommended Table I shows thespecified minimum support threshold andthe number of recommendable articles ofeach method

To illustrate how we conductedexperiments we define the followingnotation let Teval be the set of transactions inthe test set teval be a transaction in Teval andat(i) be the irsquoth article in teval Given a windowsize Wsize we divide each transaction teval inthe test data set into two lists teval[W] andteval[R] where teval[W] is the first Wsize articleaccesses of teval and teval[R] is the remainingarticles By treating teval[W] as the currentsession the recommender system will choosethe set tpr of top-N articles forrecommendation

The performance metric we adopted formeasuring the quality of recommendation isthe precision and recall scheme Theprecision is the ratio of the number ofrecommended articles accessed by a user tothe total number of recommended articlesdefined as tpr teval permilRŠ=tpr and recall is theratio of the number of recommended articlesaccessed by a user to the total number ofarticles of interest to the user defined astpr tevalpermilRŠ=teval permilRŠ The precision (recall) of arecommendation approach is the averageprecision (recall) of all transactions in the testset

Test of H1We first evaluate the performance impact ofthe three transaction identification methodsIn this experiment not was set to be 05 and thelogarithm was taken when computing theweight of an itemset Figure 3(a b) shows theprecisions and recalls respectively underassociation-rule-based recommendation

177

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

The precisions and recalls underhypergraph-based recommendation areshown in Figure 4(a b) respectively

Overall the trends are the same for theassociation-rule-based and hypergraph-basedapproaches It can be clearly seen that boththe query-chosen and session-chosenmethods outperform the query-resultmethod in terms of both precision and recallThis implies that the information aboutarticles browsed plays a crucial role in making

recommendations - both the query-chosenand session-chosen methods incorporate thisinformation in forming transactions Wetherefore accept H1 However theperformance difference between thequery-chosen and session-chosen methods isnot significant

Test of H2 and H3We then conducted experiments to shed lighton the impact of not and the logarithmic

Figure 3 (a) Precisions of the association-based approach (b) recalls of the association-based approach

Figure 4 (a) Precisions and (b) recalls of the hypergraph-based approach under different transaction identification

methods

Table I Minimum support and number of recommended candidates for each transaction identification method

Minimum support (per cent) No of recommended articles

Session-chosen method 016 253

Query-chosen method 012 250

Query-result method 33 229

178

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

function in the weight definition for thehypergraph-based recommendationapproach Figure 5(a b) shows the precisionand recall values when not = 0 05 and 1 usingthe session-chosen method for transactionidentification The window size was set at 2As can be seen differences in the precisionand recall under different settings of not arevery small We have performed the sameexperiments for different transactionidentification methods and window sizes andobtained similar results H2 is thereforerejected

We then computed the precisions andrecalls with and without application of thelogarithmic function on the itemset weightFigure 6(a b) shows the resulting precisionsand recalls respectively

Figure 6 shows that applying thelogarithmic function on the itemset weightdefinition achieves significantly betterprecision and recall values This meets ourexpectation and H3 is accepted

Test of H4Finally we evaluated the impact ofassociation-rule-based and hypergraph-basedrecommendation approaches for different

window sizes Figure 7(a b) shows theprecisions and recalls for window sizes of 2 34 5 and 6 for a top-15 recommendation Itcan be seen that the hypergraph-basedapproach performs better than theassociation-rule-based approach especiallyfor larger window sizes

We also compared the running times ofboth approaches whilst setting differentminimum support thresholds Therelative performance of the two approachesunder different window sizes are shown inFigure 8(a b) Overall the running time ofthe hypergraph-based approach remainedrelatively constant not varying with changesin window size and minimum supportthreshold In contrast the running time of theassociation-rule-based approach increasedwith an increase in window size or a decreasein the minimum support This is because theassociation-rule-based approach has to searchfor the frequent itemsets that match thecurrent session As the number of frequentitemsets increase (as a result of a decreasein the minimum support) or the length ofcurrent user session increases (as a

Figure 5 (a) Impact of not on precisions and (b) impact of not on recalls of

the hypergraph-based approach using the session-chosen method for

transaction identification

Figure 6 (a) Impact of logarithmic function on precisions and (b) impact

of logarithmic function on recalls of the hypergraph-based approach using

the session-chosen method for transaction identification

179

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

result of an increased window size)the association-rule-based approach incurs alarger running time

Overall we conclude that thehypergraph-based approach is more attractivesince it yields better-quality articlerecommendation and has a more consistentrunning time Thus H4 is accepted

Conclusions

In this paper we have investigated issuesrelated to the recommendation of articles in aliterature digital library We have developed a

literature recommendation system that makes

use of the Web usage logs of a literature

digital library for making recommendations

The literature recommendation system

consists of three sequential steps(1) data preparation of the Web logs(2) usage log mining and(3) generation of article recommendations

We proposed three alternatives for identifying

transactions from Web usage logs and

discussed two approaches - association-rule

based and hypergraph based - for making

recommendations These alternatives and

approaches were evaluated using the Web

Figure 7 (a) Precisions and (b) recalls of the two recommendation approaches under different window sizes

(Minsup = 016 per cent)

Figure 8 (a) Running times of the two recommendation approaches for a window size of two under different

Minsup values and (b) running times of the two recommendation approaches for a window size of six under

different Minsup values

180

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

usage logs of an operational electronic thesissystem at National Sun Yat-sen University Ithas been found that the query-chosen andsession-chosen methods are better fortransaction identification and that thehypergraph-based approach yieldsbetter-quality article recommendation andexhibits a more consistent running time andthus is more scalable

Our recommendation framework identifiesarticle associations present in the Web usagelogs of a digital library While this approachresults in effective recommendations it failsto recommend those independent articles thatare seldom accessed together with others Asevident from our experiments an analysis ofthe three-month collection of Web usage logsof NSYSU-ETD (from 1 February to 30April 2002) showed that only about one-tenthof the total collection are recommendable (seeTable I) To extend the scope ofrecommendable articles we are currentlyinvestigating approaches that make use ofmultiple sources when making articlerecommendations in digital libraries Onesuch source of course is the metadataalready collected by digital libraries

References

Agrawal R Imielinski T and Swami A (1993) ` Mining

association rules between sets of items in largedatabasesrsquorsquo in Buneman P and Jajodia S (Eds)

Proceedings of the ACM SIGMOD Conference onManagement of Data Washington DC May 26-28ACM Press New York NY pp 207-16

Agrawal R and Srikant R (1994) ` Fast algorithms formining association rulesrsquorsquo in Bocca JB Jarke M

and Zaniolo C (Eds) Proceedings of the 20thInternational Conference on Very Large Data Bases

September 12-15 Santiago Chile MorganKaufmann San Francisco CA pp 487-99

Alspector J Kolcz A and Karunanithi N (1998)` Comparing feature-based and clique-based usermodels for movie selectionrsquorsquo Proceedings of the 3rd

ACM International Conference on Digital LibrariesJune 23-26 1998 Pittsburgh PA ACM Press New

York NY pp 11-18Andresen D Carver L Dolin R Fischer C Frew J

Goodchild M Ibarra O Kothuri R Larsgaard MNebert D Simpson J Smith T Yang T andZheng Q (1995) ` The WWW prototype of the

Alexandria digital libraryrsquorsquo Proceedings of theInternational Symposium on Digital Libraries

Tsukuba Japan 22-5 August pp 17-27Ansari A Essegaier S and Kohli R (2000) ` Internet

recommendation systemsrsquorsquo Journal of MarketingResearch Vol 37 No 3 pp 67-85

Arms W (2000) Digital Libraries MIT Press CambridgeMA

Armstrong R Freitag D Joachims T and Mitchell T(1997) ` WebWatcher a learning apprentice for theWorld Wide Webrsquorsquo AAAI Spring Symposium onInformation Gathering from HeterogeneousDistributed Environments pp 6-12

Balabanovirsquoc M and Shoham Y (1997) ` Fabcontent-based collaborative recommendationrsquorsquoCommunications of the ACM Vol 40 No 3pp 66-72

Baldonado M Chang C Gravano L and Paepcke A(1997) ` Metadata for digital libraries architectureand design rationalersquorsquo Proceedings of the 2nd ACMInternational Conference on Digital Libraries ACMPress New York NY pp 47-56

Basu C Hirsh H and Cohen W (1998)` Recommendation as classification using social andcontent-based information in recommendationrsquorsquoProceedings of the 15th National Conference onArtificial Intelligence 26-30 July Madison WI AAAIPress Menlo Park CA pp 714-20

Billsus D and Pazzani M (1999) ` A hybrid user modelfor news story classificationrsquorsquo in Kay J (Ed)Proceedings of the 7th International Conference onUser Modelling Banff Canada 20-4 JuneSpringer-Verlag New York NY pp 99-108

Bowman C Manber P and Schwartz U (1994)` Scalable Internet resources discovery researchproblems and approachesrsquorsquo Communications of theACM Vol 37 No 8 pp 98-107

Breese J Heckerman D and Kadie C (1998) ` Empiricalanalysis of predictive algorithms for collaborativefilteringrsquorsquo Technical Report MSR-TR-98-12Microsoft Research Seattle CA

Chen H Schatz B Ng T Martinez J Kirchhoff A andLin C (1996) ` A parallel computing approach tocreating engineering concept spaces for semanticretrieval the Illinois digital library initiativeprojectrsquorsquo IEEE Transactions on PAMI Vol 18 No 8pp 17-34

Cooley R Mobasher B and Srivastava J (1999) ` Datapreparation for mining World Wide Web browsingpatternsrsquorsquo Journal of Knowledge and InformationSystems Vol 1 No 1 pp 5-32

Furner J (2002) ` On recommendingrsquorsquo Journal of theAmerican Society for Information Science andTechnology Vol 53 No 9 pp 747-63

Goldberg D Nichols D Oki B and Terry D (1992)` Using collaborative filtering to weave aninformation tapestryrsquorsquo Communications of the ACMVol 35 No 12 pp 61-70

Herlocker J and Konstan J (2001) ` Content-independent task-focused recommendationrsquorsquo IEEEInternet Computing Vol 5 No 6 pp 40-7

Karypis G (2002) ` Multilevel hypergraph partitioningrsquorsquoTech Report TR02-25 Department of ComputerScience University of Minnesota MN

Kessler J (1996) Internet Digital Libraries TheInternational Dimension Artech House PublishersNorwood MA

Konstan J Miller B Maltz D Herlocker J Gordon Land Riedl J (1997) ` GroupLens applyingcollaborative filtering to Usenet newsrsquorsquoCommunications of the ACM Vol 40 No 3pp 77-87

181

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Lang K (1995) ` Newsweeder learning to filternetnewsrsquorsquo in Prieditis A and Russell S (Eds)Proceedings of the 12th International Conference onMachine Learning Lake Tahoe Morgan KaufmannSan Francisco CA pp 331-9

Liu B Hsu W and Ma Y (1999) ` Mining associationrules with multiple minimum supportsrsquorsquo Proceedingsof the 5th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining 15-18

August San Diego ACM Press New York NYpp 430-4

Loeb S and Terry D (1992) ` Information filteringrsquorsquoCommunications of the ACM Special Issue onInformation Filtering Vol 35 No 12 pp 26-8

Mobasher B Cooley R and Srivastava J (1999)` Creating adaptive Web sites through usage-basedclustering of URLsrsquorsquo Proceedings of the IEEEKnowledge and Data Engineering ExchangeWorkshop

Mobasher B Dai H Luo T Nakagawa M andWiltshire J (2000) ` Discovery of aggregate usageprofiles for Web personalisationrsquorsquo Proceedings ofthe webKDD Workshop

Mooney R and Roy L (2000) ` Content-based bookrecommending using learning for text

categorisationrsquorsquo Proceedings of the 5th ACMConference on Digital Libraries San Antonio ACMPress New York NY pp 195-204

Pazzani M and Billsus D (1997) ` Learning and revisinguser profiles the identification of interesting Websitesrsquorsquo Machine Learning Vol 27 No 4 pp 313-31

Pazzani M (1999) ` A framework for collaborativecontent-based and demographic filteringrsquorsquo ArtificialIntelligence Review Vol 13 No 56 pp 393-408

Pennock D Horvitz E Lawrence S and Giles C (2000)` Collaborative filtering by personality diagnosis ahybrid memory- and model-based approachrsquorsquoProceedings of the 16th Conference onUncertainty in Artificial Intelligence San Francisco

30 June-3 July Morgan Kaufmann San FranciscoCA pp 473-80

Pitkow J and Pirolli P (1999) ` Mining longest repeatingsubsequences to predict World Wide Web surfingrsquorsquoProceedings of the 2nd USENIX Symposium onInternet Technologies and Systems Boulder COpp 139-150

Schafer J Konstan J and Riedl J (2001) ` E-commercerecommendation applicationsrsquorsquo Data Mining andKnowledge Discovery Vol 5 No 1 pp 10-22

Shardanand U and Maes P (1995) ` Social informationfiltering algorithms for automating `word ofmouthrsquorsquorsquo Proceedings of the Conference on HumanFactors in Computing Systems Denver CO ACMPress New York NY pp 210-17

Spink A Wilson T Ford N Foster A and Ellis D(2002) ` Information seeking and mediatedsearchingrsquorsquo Journal of The American Society ForInformation Science and Technology Vol 53 No 9pp 695-703

Srivastava J Cooley R Deshpande M and Tang P(2000) ` Web usage mining discovery andapplications of usage patterns from Web datarsquorsquoSIGKDD Explorations Vol 1 No 2 pp 12-23

Terveen L Hill W Amento B McDonald D and CreterJ (1997) ` PHOAKS a system for sharingrecommendationsrsquorsquo Communications of the ACMVol 40 No 3 pp 59-62

Wilensky R (1996) ` Toward work-centred digitalinformation servicesrsquorsquo IEEE Computer Vol 29 No 5pp 7-44

Yan T Jacobsen M Molina H and Dayal U (1996)` From user access patterns to dynamic hypertextlinkingrsquorsquo Proceedings of the 5th International WorldWide Web Conference pp 1007-14

Yang Q Zhang HH and Li T (2001) ` Mining Web logsfor prediction models in WWW caching andprefetchingrsquorsquo Proceedings of the ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining pp 473-8

182

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Page 3: A prototype WWW literature recommendation system for digital libraries

Pennock et al (2000) proposed using acollaborative approach for recommendingarticles in CiteSeer (wwwciteseercom)Their approach implicitly derives usersrsquoratings of articles by observing their actionswhen viewing an article Each action isassigned a weight For example adding adocument to a profile produces a two-pointincrement downloading a document aone-point increment and ignoring arecommendation a one-point decrement inthe rating score However this approachsuffers from the shortage of negativeexamples and the method is applicable onlyto individual members who are willing toidentify themselves each time they use thedigital library

We consider that traditionalrecommendation techniques are not suitablefor recommending articles in digital librariesFirst both content-based and collaborativeapproaches require that usersrsquo rating scoreson selected items (including both positive andnegative instances) are available for analysisFor a typical literature digital libraryrequiring users to rate some articles beforemaking a recommendation is not realisticSecond identifying an individual user of aliterature digital library is generally notpossible since many literature digital librariesare freely available on the Internet and userscan search or browse articles without havingto identify themselves Even for proprietaryliterature digital libraries many users gainaccess via site subscriptions making itdifficult to track an individualrsquos (long term)browsing behaviour

For the reasons mentioned above wepropose making use of a task-focusedapproach (Herlocker and Konstan 2001) Inthis approach a task profile (a set of recentlyaccessed items) rather than the long-terminterest profile is used to makerecommendations One notableimplementation of this approach is Web usagemining which aims to identify interestingusage patterns of a Web-based system fromthe Web usage logs that record interactionsbetween users and Web pages (Srivastavaet al 2000) Recently several approacheshave been proposed for recommending Webpages based on the Web page associationsdiscovered by Web-usage mining algorithms(Yan et al 1996 Mobasher et al 1999Pitkow and Pirolli 1999 Yang et al 2001)While these approaches vary in their details

they follow the same recommendationframework which starts with theidentification of aggregate usage profiles ofWeb pages by some data mining methodThey then make recommendations by lookinginto the similarity between the set of recentlyaccessed Web pages of an active user and thecollected aggregate usage profiles

Obviously literature digital libraries storearticles rather than Web pages and they differfrom Web pages in several respects Web pages are more diversified some

serve as index pages some are contentpages and others have a mixture ofindexes and content On the other handsince literature articles are morehomogeneous in structure they are morelikely to have the same set of metadatafeatures

A Web site can be viewed as a directedgraph whose vertices are Web pageswhile a literature digital library is bettervisualised as a set of articles

Literature articles are often retrieved bysearch queries provided by the systemwhile Web pages are often browsedthrough a static site topology

Literature articles are incrementallyinserted into the digital library at a fasterrate than are Web pages inserted into aWeb site

ContributionsThe above considerations indicate thatliterature recommendation services require adifferent technical approach Here wedescribe a recommendation framework forrecommending articles in a literature digitallibrary Several alternatives are proposed forimplementing the constituent components ofthe recommendation framework Thesealternatives are compared and analysed byapplying the Web usage logs collected fromthe electronic thesis and dissertation (ETD)system at National Sun Yat-sen University(NSYSU-ETD) We have incorporated theproposed recommendation framework intoNSYSU-ETD (wwwlibnsysuedutweThesysenglishdefault_ehtm) thecorresponding implementation status is alsoreported

This paper is structured as follows Next theoverall architecture is described followed bydetailed design and construction methodsThen implementation experience andevaluation results are presented The final

171

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

part summarises this paper and discusses ourfuture research directions

Architecture

The overall architecture of the literaturerecommendation system shown as Figure 1consists of two basic subsystems offline andonline(1) The offline subsystem comprises two

sequentially executed tasks datapreparation and log mining AlthoughWeb usage logs are potentially able toprovide useful knowledge for makingrecommendations the raw log datacannot be used before appropriatepre-processing Therefore we firstconvert raw Web usage logs into a set ofuser transactions before performing thelog mining task The objectives of logmining tasks include the discovery ofarticle association rules and the derivationof article clusters

(2) The online subsystem interacts with anactive user and provides recommendedarticles in real time It keeps track of aset of articles browsed recently by theactive user by consulting the currentWeb usage log provided by the Webserver Then by comparing the

similarities between this set and the

article clusters (or associations)produced by the offline subsystem theonline subsystem recommends articlesin the clusters (associations) that arehighly similar

We have incorporated the literaturerecommendation system into NSYSU-ETDFigure 2 depicts a page view of an article inNSYSU-ETD The Web page comprises twoframes the left frame displays the metadata of

the browsed article and the right frame showsthe unseen articles (up to 15) recommendedby our literature recommendation systemdisplayed in the order of relevance In thisexample the active user had reviewed twoarticles and the literature recommendationsystem has recommended another sevenarticles Once the active user browses anotherarticle the content of the recommendation

frame will update accordingly

Approaches

In this section we first describe the tasksconducted in the offline subsystem of theliterature recommendation system and thenthose conducted in the onlinerecommendation subsystem

Figure 1 Architecture of a literature recommendation system

172

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Data preparation for the literature usagelogTo prepare data from the Web usage logs of aliterature digital library we basically followthe heuristics adopted by Cooley et al (1999)for processing Web usage logs that involvestatic Web pages In their work the Webusage logs are assumed to be in the extendedNCSA format (including referrer and agentfields) The approach contains threesequential steps data cleansing user sessionidentification and transaction identificationThe objective of data cleansing is to prune outunwanted Web log records and to add backmissing Web log records some Web logrecords are surplus as they are accesses tonon-HTML pages (eg images and other httprequests involving no Web page accesses)while other Web log records are missing dueto the existence of the local cache firewallsand proxy servers Identifying missing Weblog records is especially difficult - severalheuristics have been proposed for achievingthis However we found this difficultynonexistent when processing the Web usagelog of NSYSU-ETD because article Webpages are dynamically generated and are notcacheable Our university ETD system isdatabase driven in that the theses metadata

are stored in a DBMS Most large-scaledigital libraries adopt the same method formaintaining their collections In the contextof a literature digital library we are concernedwith and retain only the Web usage recordsthat involve the following two types ofaccesses(1) Lookup accesses Each lookup access is an

execution of a CGI program withsearching or browsing conditionsspecified in the parameters Anexample lookup access of NSYSU-ETDis httpetdlibnsysuedutwETD-dbETD-searchsearch_by_advisoradvisor_name=San-Yih+Hwang which listsall theses supervised by ProfessorSan-Yih Hwang

(2) Article accesses Each article access is anexecution of a CGI program that displaysthe detailed metadata on an article Anexample lookup access of NSYSU-ETDis httpetdlibnsysuedutwETD-dbETD-searchview_etdURN=etd-0726100-135739 which shows themetadata of the thesis whose URN isetd-0726100-135739

Note that article accesses display the detailedmetadata of articles and therefore are of

Figure 2 Page view of an article in NSYSU-ETD

173

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

primary interest Lookup accesses providelookup information to facilitate browsing orsearching and can thus be consideredauxiliary Most of the time a user will firstexecute a lookup access followed by aselective list of article accesses We found thatsome user sessions contained article accesseswithout prior lookup accesses in the Webusage log of NSYSU-ETD This is becauseseveral information sources had providedhyperlinks directly to articlesrsquo page views Inthis case each session can be viewed as a listof queries Each query (optionally) starts witha lookup access followed by a list of articleaccesses related to the articles that the userchose to look at in more detail

The goal of user session identification is todivide the article accesses of each user intoindividual sessions It is reasonable to assumethat two records with different IP addressesbrowsers or operating systems belong to twodifferent user sessions In addition the timeinterval between two consecutive requests in auser session should not be too large As inmany commercial products we use 30minutes as the default timeout period Whenthe time interval between the current accessand the previous one exceeds this a new usersession is assumed to have started Some ofthe identified user sessions are made byInternet robots and hence should not beconsidered Some robots have known agenttypes andor IP addresses and can be easilyidentified Analysis of the user sessions ofthese known robots revealed that most ofthese sessions either have more than 100article accesses or exhibit a mean adjacentWeb page access interval of less than threeseconds User sessions that satisfy thiscondition are considered as robot sessionsand consequently are removed

Finally a user session is further divided intoa number of transactions each of whichrepresents a semantically meaningful unitHowever the various transactionidentification approaches proposed in Cooleyet al (1999) make use of either the index(auxiliary) pages or the Web site topologySince neither exists in the context of literaturedigital libraries these proposed approachesare not applicable Our approach identifiestransactions by considering the types ofaccesses namely lookup and article accessesIn fact articles listed by the same query musthave some degree of similarity in their content(eg keyword title author discipline) On

the other hand articles selected in the sameuser session or query also display some degreeof similarity due to inherent humanbehaviour Therefore we have four methodsfor defining transactions(1) Query-chosen method the articles

selected in a query(2) Session-chosen method the articles

selected in a user session(3) Query-result method the articles listed in

a query(4) Session-result method the articles listed

in queries of a user session

For the query-chosen and session-chosenmethods article accesses present in the Weblogs are grouped into a set of transactions Forthe query-result and session-result methodswe construct transactions by reissuing queriesto the literature digital library

As mentioned the query-chosen andsession-chosen methods incorporateknowledge on human selection in making therecommendations We expect that they willyield more effective recommendations thantheir counterparts without such knowledgenamely the query-result and session-resultmethods We therefore form the followinghypothesis

H1 Users tend to browse the metadata ofonly the articles they find of interestRecommendation schemes thatconsider the Web accesses of thesearticles will result in more effectiverecommendations

Mining the literature usage logThere have been several approaches proposedin the literature (Yan et al 1996 Mobasheret al 1999 Pitkow and Pirolli 1999 Yanget al 2001) for identifying aggregate usageprofiles from Web usage logs Aggregateusage profiles can be represented in the formof association rules sequential patterns (or ann-gram Markov model) or clusters of Webpages In the context of literature digitallibraries however we decided not to considersequential patterns because the order ofarticles in a transaction may not relate tousersrsquo preferences Instead only associationrules and clusters of articles will be discussed

The problem of finding frequentassociations between items in a transactiondatabase called the association-rule discoveryproblem was first introduced by Agrawal et al(1993) Association-rule discovery methods

174

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

such as the A priori Algorithm (Agrawal et al1993 Agrawal and Srikant 1994) aretypically based on two decompositionsextraction of itemsets and the generation ofstrong association rules In the initialextraction phase the methods find sets ofitems that frequently occur together Thesupport of each itemset I =XY denoted asSup(I) is the fraction of transactionscontaining both X and Y This itemset I isreferred to as a frequent itemset if Sup(I)exceeds a user-specified minimal supportthreshold Minsup

In the second phase each discovereditemset I = XY is used to constructassociation rules in the form X ) Y Theconfidence of each rule denoted asConf(X ) Y ) is the fraction of transactionscontaining X that also contain Y Anassociation rule is said to be strong if itexceeds a user-specified confidence thresholdBy applying association-rule discoveryalgorithms to the transactions of thetransformed logs we can find associationrules in the form a1 a2 am)a whichcan be used to recommend article a to userswho have browsed a1 a2 am but not a

The traditional approaches for identifyingitemsets with a uniform minimum supportthreshold however cannot be directlyapplied because articles that arrive later tendto have smaller support even if they areactually more popular Therefore anonuniform support threshold schemeoriginally proposed in Liu et al (1999) isadopted In this scheme each item is assigneda distinct minimum support value (called theminimum item support MIS) and theminimum support of an itemset is theminimum of the MIS values of its constituentitems In the literature recommendationsystem we view the MIS value of an article asa function of its creation time That is articlesthat are added to the digital library morerecently should be assigned smaller MISvalues Let N(t) be the number oftransactions after time t in the Web usagelogs The MIS of an article is defined asfollows

MIShellipadagger ˆ Mhellipadagger Mhellipadagger gt LSLS Otherwise

raquo

Mhellipadagger ˆ NhellipCreationTimehellipadaggerdaggerNhellip0dagger cent Minsup

where LS is the lower bound for support

values Minsup is the minimum supportthreshold based on the entire article browsinglog and N(0) denotes the total number oftransactions in the Web usage log Both LSand Minsup are user-defined constantsAfter assigning the MIS values of all articles inthe literature digital library the methodproposed in Liu et al (1999) can beapplied to derive the association rules forarticles

For the clustering technique we adopt theAssociation Rule Hypergraph Partitioning(ARHP) approach (Mobasher et al 19992000) rather than traditional clusteringtechniques The main reason for this is thatARHP is more efficient in handling highdimensional data such as those present inliterature digital libraries The dimensions ofa transaction are the set of articles which ishuge for a large-scale digital library Thisapproach starts with the identification offrequent itemsets (as in association-rulediscovery methods) each of which containsarticles often accessed together intransactions Each such frequent itemset isthen viewed as a hyperedge with a specificweight

As mentioned each article has a distinctcreation time We therefore normalise thesupport values of itemsets as follows beforecomputing their weights

Sup0hellipaidagger ˆ Nhellip0daggerNhellipCreationTimehellipaidaggerdagger

cent Suphellipaidagger

Sup0hellipa1 akdagger ˆ

Nhellip0daggerN hellipmax1micro micro hellip daggerdagger

cent Suphellipa1 akdagger

There are several ways to define the weightof an itemset such as using either the supportfor or the interest in the itemset The formerfavours itemsets of smaller size whereas thelatter gives priority to larger itemsets Wedefine a general weighting formula that coversthe broad spectrum between these twoextremes In addition the supports for orinterests in different itemsets can have verydiverse values To prevent itemsets of largeweight from dominating the subsequentclustering procedure we apply thelogarithm on the weight The following is ourdefinition of the weight of an itemsethellipa1 a2 ak)

175

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

weighthellipa1 a2 akdagger ˆ

log0hellip dagger

permil 0hellip dagger cent 0hellip dagger cent cent 0hellip daggerŠnotsup3

pound 1

Minsup

acute

where 0micro not micro1 Note that when not = 0 ornot = 1 this formula is equivalent to usingsupport or interest as the weight respectively1Minsup is a constant that keeps the weightnon-negative This definition supports ourfollowing hypotheses

H2 A better recommendation effectivenesscan be achieved by striking a balancebetween using support and interest asthe weight of an itemset

H3 A better recommendation effectivenesscan be achieved by incorporating thelogarithm function in the weight of anitemset

After deciding the weight of each itemset thehypergraph partitioning algorithm proposedin Karypis (2002) is applied to partition theset of articles into disjoint clusters of articlesArticles in the same cluster are more lsquolsquosimilarrsquorsquoin the sense that they are more likely to beaccessed together in the same transaction Toreflect the fact that an article may indeedinterest more than one group of users weadopt the same heuristic as used in Mobasheret al (1999) by adding back articles toclusters which results in overlapping clustersSpecifically for a given hyperedge if thepercentage of involved vertices in a cluster islarge than a threshold the other involvedvertices are included in the same cluster

Online recommendationsWe propose two recommendation approachesthat use the article association rules andarticle clusters obtained by the methodsdescribed above The goal is to recommendthe top-N articles that potentially interest theactive user The first approach makes use ofarticle association rules The idea is to treateach frequent itemset as the interest profile ofa user group and to recommend articles basedon the similarity between the current sessionof the active user and interest profiles of therelevant user groups Specifically let s be theactive userrsquos current session of length k Wefirst identify the set of frequent itemsets ofsize k + 1 that contain all elements in s and an

extra element m (not in s) For each suchitemset I the confidence of the ruleI iexcl mg ) fm is calculated These extraelements are then recommended to the userin descending order of confidence value Ifthese elements are not sufficient (ie there areless than N of them) we then search forfrequent itemsets of size k that contain k - 1elements in s and an extra element (not in s)Again these extra elements are recommendedto the user in descending order of confidencevalue This procedure continues until Narticles are recommended

Our other proposed method uses ahypergraph-based approach In thisapproach the recommendation score of eacharticle a is computed by considering thesimilarity between the current user sessionand the clusters C to which a belongs and thecoherence weight of a with respect to CSpecifically each cluster of articles can beviewed as a vector with binary elements eachof which indicates whether an article appearsin the cluster Similarly the current usersession can also be represented as a vectorThen the similarity between the currentsession s and a cluster C can be defined as acosine function as follows

matchhellipS Cdagger ˆ

Pk

aCk pound Sk

Pk

hellipSkdagger2 poundPk

hellipaCKdagger2

r

where Sk is the krsquoth element in S and aCk is the

krsquoth element in CThe coherence weight of an article a with

respect to the cluster C that it belongs to isdefined as

weighthellipa Cdagger ˆ

Pa2eesup3C

weighthellipedaggerPesup3C

weighthellipedagger

where weight(e) is the weight of a hyperedge eThe recommendation score Rec(S a) of an

article a with respect to the current usersession S is then defined as

Rechellip dagger ˆ

maxa2C

hellip dagger pound hellip dagger

p

The top-N articles for recommendation arethose with the N highest values in therecommendation score

176

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

The hypergraph-based approach is morecarefully designed than theassociation-rule-based approach and weexpect the former to perform better

H4 The hypergraph-based approach willresult in more effectiverecommendations and has aquicker response time than theassociation-rule-based approach

Empirical evaluations

This section reports our experience inapplying the Web usage logs of NSYSU-ETD to the proposed literaturerecommendation system The main objectivewas to test our four hypotheses NSYSU-ETD runs on PC Solaris 27 and uses Apache139 as the Web server Since beingcommissioned in May 2000 it has beenloaded with more than 3000 electronictheses of National Sun Yat-sen UniversityUp to February 2003 these theses had beenbrowsed more than 400000 times anddownloaded more than 100000 times Weanalysed the Web usage logs of NSYSU-ETD between February 2002 and May 2002for our experiments the data collected fromFebruary 1 to April 30 were designated as thetraining data set and those collected in Mayserved as the test data set

We first applied the data cleansingtechnique on the training data and obtained43349 lookup accesses and 41627 articleaccesses Applying the session identificationtechnique revealed 16922 user sessionsamong which 392 sessions were robotgenerated 6068 sessions contained only onearticle access and 5253 sessions containedno article accesses We eliminated these trivialuser sessions and applied transactionidentification techniques resulting in 5617transactions for the query-chosen method5272 transactions for the session-chosenmethod and 17742 transactions for thequery-result method Queries whose resultsare never chosen by the users are removedfrom the query-chosen method but remain inthe query-result method The session-resultmethod produced transactions of huge sizeeach containing thousands of article accessesWe therefore decided not to consider thismethod in the subsequent experiments

The two proposed methods for miningliterature usage logs both require the

identification of frequent itemsets fromtransactions which needs the minimumsupport to be specified However the threetransaction identification methods undercomparison have different numbers oftransactions To be fair we specify adifferent minimum support threshold foreach method such that the total number ofarticles involved in large two-item sets ofeach method - called recommendablearticles - is approximately the same Ourrecommendation framework recommends anarticle only if it is associated with otherarticles a sufficient number of times in Webusage log Therefore articles that are notinvolved in large two-item sets cannotpossibly be recommended Table I shows thespecified minimum support threshold andthe number of recommendable articles ofeach method

To illustrate how we conductedexperiments we define the followingnotation let Teval be the set of transactions inthe test set teval be a transaction in Teval andat(i) be the irsquoth article in teval Given a windowsize Wsize we divide each transaction teval inthe test data set into two lists teval[W] andteval[R] where teval[W] is the first Wsize articleaccesses of teval and teval[R] is the remainingarticles By treating teval[W] as the currentsession the recommender system will choosethe set tpr of top-N articles forrecommendation

The performance metric we adopted formeasuring the quality of recommendation isthe precision and recall scheme Theprecision is the ratio of the number ofrecommended articles accessed by a user tothe total number of recommended articlesdefined as tpr teval permilRŠ=tpr and recall is theratio of the number of recommended articlesaccessed by a user to the total number ofarticles of interest to the user defined astpr tevalpermilRŠ=teval permilRŠ The precision (recall) of arecommendation approach is the averageprecision (recall) of all transactions in the testset

Test of H1We first evaluate the performance impact ofthe three transaction identification methodsIn this experiment not was set to be 05 and thelogarithm was taken when computing theweight of an itemset Figure 3(a b) shows theprecisions and recalls respectively underassociation-rule-based recommendation

177

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

The precisions and recalls underhypergraph-based recommendation areshown in Figure 4(a b) respectively

Overall the trends are the same for theassociation-rule-based and hypergraph-basedapproaches It can be clearly seen that boththe query-chosen and session-chosenmethods outperform the query-resultmethod in terms of both precision and recallThis implies that the information aboutarticles browsed plays a crucial role in making

recommendations - both the query-chosenand session-chosen methods incorporate thisinformation in forming transactions Wetherefore accept H1 However theperformance difference between thequery-chosen and session-chosen methods isnot significant

Test of H2 and H3We then conducted experiments to shed lighton the impact of not and the logarithmic

Figure 3 (a) Precisions of the association-based approach (b) recalls of the association-based approach

Figure 4 (a) Precisions and (b) recalls of the hypergraph-based approach under different transaction identification

methods

Table I Minimum support and number of recommended candidates for each transaction identification method

Minimum support (per cent) No of recommended articles

Session-chosen method 016 253

Query-chosen method 012 250

Query-result method 33 229

178

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

function in the weight definition for thehypergraph-based recommendationapproach Figure 5(a b) shows the precisionand recall values when not = 0 05 and 1 usingthe session-chosen method for transactionidentification The window size was set at 2As can be seen differences in the precisionand recall under different settings of not arevery small We have performed the sameexperiments for different transactionidentification methods and window sizes andobtained similar results H2 is thereforerejected

We then computed the precisions andrecalls with and without application of thelogarithmic function on the itemset weightFigure 6(a b) shows the resulting precisionsand recalls respectively

Figure 6 shows that applying thelogarithmic function on the itemset weightdefinition achieves significantly betterprecision and recall values This meets ourexpectation and H3 is accepted

Test of H4Finally we evaluated the impact ofassociation-rule-based and hypergraph-basedrecommendation approaches for different

window sizes Figure 7(a b) shows theprecisions and recalls for window sizes of 2 34 5 and 6 for a top-15 recommendation Itcan be seen that the hypergraph-basedapproach performs better than theassociation-rule-based approach especiallyfor larger window sizes

We also compared the running times ofboth approaches whilst setting differentminimum support thresholds Therelative performance of the two approachesunder different window sizes are shown inFigure 8(a b) Overall the running time ofthe hypergraph-based approach remainedrelatively constant not varying with changesin window size and minimum supportthreshold In contrast the running time of theassociation-rule-based approach increasedwith an increase in window size or a decreasein the minimum support This is because theassociation-rule-based approach has to searchfor the frequent itemsets that match thecurrent session As the number of frequentitemsets increase (as a result of a decreasein the minimum support) or the length ofcurrent user session increases (as a

Figure 5 (a) Impact of not on precisions and (b) impact of not on recalls of

the hypergraph-based approach using the session-chosen method for

transaction identification

Figure 6 (a) Impact of logarithmic function on precisions and (b) impact

of logarithmic function on recalls of the hypergraph-based approach using

the session-chosen method for transaction identification

179

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

result of an increased window size)the association-rule-based approach incurs alarger running time

Overall we conclude that thehypergraph-based approach is more attractivesince it yields better-quality articlerecommendation and has a more consistentrunning time Thus H4 is accepted

Conclusions

In this paper we have investigated issuesrelated to the recommendation of articles in aliterature digital library We have developed a

literature recommendation system that makes

use of the Web usage logs of a literature

digital library for making recommendations

The literature recommendation system

consists of three sequential steps(1) data preparation of the Web logs(2) usage log mining and(3) generation of article recommendations

We proposed three alternatives for identifying

transactions from Web usage logs and

discussed two approaches - association-rule

based and hypergraph based - for making

recommendations These alternatives and

approaches were evaluated using the Web

Figure 7 (a) Precisions and (b) recalls of the two recommendation approaches under different window sizes

(Minsup = 016 per cent)

Figure 8 (a) Running times of the two recommendation approaches for a window size of two under different

Minsup values and (b) running times of the two recommendation approaches for a window size of six under

different Minsup values

180

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

usage logs of an operational electronic thesissystem at National Sun Yat-sen University Ithas been found that the query-chosen andsession-chosen methods are better fortransaction identification and that thehypergraph-based approach yieldsbetter-quality article recommendation andexhibits a more consistent running time andthus is more scalable

Our recommendation framework identifiesarticle associations present in the Web usagelogs of a digital library While this approachresults in effective recommendations it failsto recommend those independent articles thatare seldom accessed together with others Asevident from our experiments an analysis ofthe three-month collection of Web usage logsof NSYSU-ETD (from 1 February to 30April 2002) showed that only about one-tenthof the total collection are recommendable (seeTable I) To extend the scope ofrecommendable articles we are currentlyinvestigating approaches that make use ofmultiple sources when making articlerecommendations in digital libraries Onesuch source of course is the metadataalready collected by digital libraries

References

Agrawal R Imielinski T and Swami A (1993) ` Mining

association rules between sets of items in largedatabasesrsquorsquo in Buneman P and Jajodia S (Eds)

Proceedings of the ACM SIGMOD Conference onManagement of Data Washington DC May 26-28ACM Press New York NY pp 207-16

Agrawal R and Srikant R (1994) ` Fast algorithms formining association rulesrsquorsquo in Bocca JB Jarke M

and Zaniolo C (Eds) Proceedings of the 20thInternational Conference on Very Large Data Bases

September 12-15 Santiago Chile MorganKaufmann San Francisco CA pp 487-99

Alspector J Kolcz A and Karunanithi N (1998)` Comparing feature-based and clique-based usermodels for movie selectionrsquorsquo Proceedings of the 3rd

ACM International Conference on Digital LibrariesJune 23-26 1998 Pittsburgh PA ACM Press New

York NY pp 11-18Andresen D Carver L Dolin R Fischer C Frew J

Goodchild M Ibarra O Kothuri R Larsgaard MNebert D Simpson J Smith T Yang T andZheng Q (1995) ` The WWW prototype of the

Alexandria digital libraryrsquorsquo Proceedings of theInternational Symposium on Digital Libraries

Tsukuba Japan 22-5 August pp 17-27Ansari A Essegaier S and Kohli R (2000) ` Internet

recommendation systemsrsquorsquo Journal of MarketingResearch Vol 37 No 3 pp 67-85

Arms W (2000) Digital Libraries MIT Press CambridgeMA

Armstrong R Freitag D Joachims T and Mitchell T(1997) ` WebWatcher a learning apprentice for theWorld Wide Webrsquorsquo AAAI Spring Symposium onInformation Gathering from HeterogeneousDistributed Environments pp 6-12

Balabanovirsquoc M and Shoham Y (1997) ` Fabcontent-based collaborative recommendationrsquorsquoCommunications of the ACM Vol 40 No 3pp 66-72

Baldonado M Chang C Gravano L and Paepcke A(1997) ` Metadata for digital libraries architectureand design rationalersquorsquo Proceedings of the 2nd ACMInternational Conference on Digital Libraries ACMPress New York NY pp 47-56

Basu C Hirsh H and Cohen W (1998)` Recommendation as classification using social andcontent-based information in recommendationrsquorsquoProceedings of the 15th National Conference onArtificial Intelligence 26-30 July Madison WI AAAIPress Menlo Park CA pp 714-20

Billsus D and Pazzani M (1999) ` A hybrid user modelfor news story classificationrsquorsquo in Kay J (Ed)Proceedings of the 7th International Conference onUser Modelling Banff Canada 20-4 JuneSpringer-Verlag New York NY pp 99-108

Bowman C Manber P and Schwartz U (1994)` Scalable Internet resources discovery researchproblems and approachesrsquorsquo Communications of theACM Vol 37 No 8 pp 98-107

Breese J Heckerman D and Kadie C (1998) ` Empiricalanalysis of predictive algorithms for collaborativefilteringrsquorsquo Technical Report MSR-TR-98-12Microsoft Research Seattle CA

Chen H Schatz B Ng T Martinez J Kirchhoff A andLin C (1996) ` A parallel computing approach tocreating engineering concept spaces for semanticretrieval the Illinois digital library initiativeprojectrsquorsquo IEEE Transactions on PAMI Vol 18 No 8pp 17-34

Cooley R Mobasher B and Srivastava J (1999) ` Datapreparation for mining World Wide Web browsingpatternsrsquorsquo Journal of Knowledge and InformationSystems Vol 1 No 1 pp 5-32

Furner J (2002) ` On recommendingrsquorsquo Journal of theAmerican Society for Information Science andTechnology Vol 53 No 9 pp 747-63

Goldberg D Nichols D Oki B and Terry D (1992)` Using collaborative filtering to weave aninformation tapestryrsquorsquo Communications of the ACMVol 35 No 12 pp 61-70

Herlocker J and Konstan J (2001) ` Content-independent task-focused recommendationrsquorsquo IEEEInternet Computing Vol 5 No 6 pp 40-7

Karypis G (2002) ` Multilevel hypergraph partitioningrsquorsquoTech Report TR02-25 Department of ComputerScience University of Minnesota MN

Kessler J (1996) Internet Digital Libraries TheInternational Dimension Artech House PublishersNorwood MA

Konstan J Miller B Maltz D Herlocker J Gordon Land Riedl J (1997) ` GroupLens applyingcollaborative filtering to Usenet newsrsquorsquoCommunications of the ACM Vol 40 No 3pp 77-87

181

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Lang K (1995) ` Newsweeder learning to filternetnewsrsquorsquo in Prieditis A and Russell S (Eds)Proceedings of the 12th International Conference onMachine Learning Lake Tahoe Morgan KaufmannSan Francisco CA pp 331-9

Liu B Hsu W and Ma Y (1999) ` Mining associationrules with multiple minimum supportsrsquorsquo Proceedingsof the 5th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining 15-18

August San Diego ACM Press New York NYpp 430-4

Loeb S and Terry D (1992) ` Information filteringrsquorsquoCommunications of the ACM Special Issue onInformation Filtering Vol 35 No 12 pp 26-8

Mobasher B Cooley R and Srivastava J (1999)` Creating adaptive Web sites through usage-basedclustering of URLsrsquorsquo Proceedings of the IEEEKnowledge and Data Engineering ExchangeWorkshop

Mobasher B Dai H Luo T Nakagawa M andWiltshire J (2000) ` Discovery of aggregate usageprofiles for Web personalisationrsquorsquo Proceedings ofthe webKDD Workshop

Mooney R and Roy L (2000) ` Content-based bookrecommending using learning for text

categorisationrsquorsquo Proceedings of the 5th ACMConference on Digital Libraries San Antonio ACMPress New York NY pp 195-204

Pazzani M and Billsus D (1997) ` Learning and revisinguser profiles the identification of interesting Websitesrsquorsquo Machine Learning Vol 27 No 4 pp 313-31

Pazzani M (1999) ` A framework for collaborativecontent-based and demographic filteringrsquorsquo ArtificialIntelligence Review Vol 13 No 56 pp 393-408

Pennock D Horvitz E Lawrence S and Giles C (2000)` Collaborative filtering by personality diagnosis ahybrid memory- and model-based approachrsquorsquoProceedings of the 16th Conference onUncertainty in Artificial Intelligence San Francisco

30 June-3 July Morgan Kaufmann San FranciscoCA pp 473-80

Pitkow J and Pirolli P (1999) ` Mining longest repeatingsubsequences to predict World Wide Web surfingrsquorsquoProceedings of the 2nd USENIX Symposium onInternet Technologies and Systems Boulder COpp 139-150

Schafer J Konstan J and Riedl J (2001) ` E-commercerecommendation applicationsrsquorsquo Data Mining andKnowledge Discovery Vol 5 No 1 pp 10-22

Shardanand U and Maes P (1995) ` Social informationfiltering algorithms for automating `word ofmouthrsquorsquorsquo Proceedings of the Conference on HumanFactors in Computing Systems Denver CO ACMPress New York NY pp 210-17

Spink A Wilson T Ford N Foster A and Ellis D(2002) ` Information seeking and mediatedsearchingrsquorsquo Journal of The American Society ForInformation Science and Technology Vol 53 No 9pp 695-703

Srivastava J Cooley R Deshpande M and Tang P(2000) ` Web usage mining discovery andapplications of usage patterns from Web datarsquorsquoSIGKDD Explorations Vol 1 No 2 pp 12-23

Terveen L Hill W Amento B McDonald D and CreterJ (1997) ` PHOAKS a system for sharingrecommendationsrsquorsquo Communications of the ACMVol 40 No 3 pp 59-62

Wilensky R (1996) ` Toward work-centred digitalinformation servicesrsquorsquo IEEE Computer Vol 29 No 5pp 7-44

Yan T Jacobsen M Molina H and Dayal U (1996)` From user access patterns to dynamic hypertextlinkingrsquorsquo Proceedings of the 5th International WorldWide Web Conference pp 1007-14

Yang Q Zhang HH and Li T (2001) ` Mining Web logsfor prediction models in WWW caching andprefetchingrsquorsquo Proceedings of the ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining pp 473-8

182

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Page 4: A prototype WWW literature recommendation system for digital libraries

part summarises this paper and discusses ourfuture research directions

Architecture

The overall architecture of the literaturerecommendation system shown as Figure 1consists of two basic subsystems offline andonline(1) The offline subsystem comprises two

sequentially executed tasks datapreparation and log mining AlthoughWeb usage logs are potentially able toprovide useful knowledge for makingrecommendations the raw log datacannot be used before appropriatepre-processing Therefore we firstconvert raw Web usage logs into a set ofuser transactions before performing thelog mining task The objectives of logmining tasks include the discovery ofarticle association rules and the derivationof article clusters

(2) The online subsystem interacts with anactive user and provides recommendedarticles in real time It keeps track of aset of articles browsed recently by theactive user by consulting the currentWeb usage log provided by the Webserver Then by comparing the

similarities between this set and the

article clusters (or associations)produced by the offline subsystem theonline subsystem recommends articlesin the clusters (associations) that arehighly similar

We have incorporated the literaturerecommendation system into NSYSU-ETDFigure 2 depicts a page view of an article inNSYSU-ETD The Web page comprises twoframes the left frame displays the metadata of

the browsed article and the right frame showsthe unseen articles (up to 15) recommendedby our literature recommendation systemdisplayed in the order of relevance In thisexample the active user had reviewed twoarticles and the literature recommendationsystem has recommended another sevenarticles Once the active user browses anotherarticle the content of the recommendation

frame will update accordingly

Approaches

In this section we first describe the tasksconducted in the offline subsystem of theliterature recommendation system and thenthose conducted in the onlinerecommendation subsystem

Figure 1 Architecture of a literature recommendation system

172

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Data preparation for the literature usagelogTo prepare data from the Web usage logs of aliterature digital library we basically followthe heuristics adopted by Cooley et al (1999)for processing Web usage logs that involvestatic Web pages In their work the Webusage logs are assumed to be in the extendedNCSA format (including referrer and agentfields) The approach contains threesequential steps data cleansing user sessionidentification and transaction identificationThe objective of data cleansing is to prune outunwanted Web log records and to add backmissing Web log records some Web logrecords are surplus as they are accesses tonon-HTML pages (eg images and other httprequests involving no Web page accesses)while other Web log records are missing dueto the existence of the local cache firewallsand proxy servers Identifying missing Weblog records is especially difficult - severalheuristics have been proposed for achievingthis However we found this difficultynonexistent when processing the Web usagelog of NSYSU-ETD because article Webpages are dynamically generated and are notcacheable Our university ETD system isdatabase driven in that the theses metadata

are stored in a DBMS Most large-scaledigital libraries adopt the same method formaintaining their collections In the contextof a literature digital library we are concernedwith and retain only the Web usage recordsthat involve the following two types ofaccesses(1) Lookup accesses Each lookup access is an

execution of a CGI program withsearching or browsing conditionsspecified in the parameters Anexample lookup access of NSYSU-ETDis httpetdlibnsysuedutwETD-dbETD-searchsearch_by_advisoradvisor_name=San-Yih+Hwang which listsall theses supervised by ProfessorSan-Yih Hwang

(2) Article accesses Each article access is anexecution of a CGI program that displaysthe detailed metadata on an article Anexample lookup access of NSYSU-ETDis httpetdlibnsysuedutwETD-dbETD-searchview_etdURN=etd-0726100-135739 which shows themetadata of the thesis whose URN isetd-0726100-135739

Note that article accesses display the detailedmetadata of articles and therefore are of

Figure 2 Page view of an article in NSYSU-ETD

173

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

primary interest Lookup accesses providelookup information to facilitate browsing orsearching and can thus be consideredauxiliary Most of the time a user will firstexecute a lookup access followed by aselective list of article accesses We found thatsome user sessions contained article accesseswithout prior lookup accesses in the Webusage log of NSYSU-ETD This is becauseseveral information sources had providedhyperlinks directly to articlesrsquo page views Inthis case each session can be viewed as a listof queries Each query (optionally) starts witha lookup access followed by a list of articleaccesses related to the articles that the userchose to look at in more detail

The goal of user session identification is todivide the article accesses of each user intoindividual sessions It is reasonable to assumethat two records with different IP addressesbrowsers or operating systems belong to twodifferent user sessions In addition the timeinterval between two consecutive requests in auser session should not be too large As inmany commercial products we use 30minutes as the default timeout period Whenthe time interval between the current accessand the previous one exceeds this a new usersession is assumed to have started Some ofthe identified user sessions are made byInternet robots and hence should not beconsidered Some robots have known agenttypes andor IP addresses and can be easilyidentified Analysis of the user sessions ofthese known robots revealed that most ofthese sessions either have more than 100article accesses or exhibit a mean adjacentWeb page access interval of less than threeseconds User sessions that satisfy thiscondition are considered as robot sessionsand consequently are removed

Finally a user session is further divided intoa number of transactions each of whichrepresents a semantically meaningful unitHowever the various transactionidentification approaches proposed in Cooleyet al (1999) make use of either the index(auxiliary) pages or the Web site topologySince neither exists in the context of literaturedigital libraries these proposed approachesare not applicable Our approach identifiestransactions by considering the types ofaccesses namely lookup and article accessesIn fact articles listed by the same query musthave some degree of similarity in their content(eg keyword title author discipline) On

the other hand articles selected in the sameuser session or query also display some degreeof similarity due to inherent humanbehaviour Therefore we have four methodsfor defining transactions(1) Query-chosen method the articles

selected in a query(2) Session-chosen method the articles

selected in a user session(3) Query-result method the articles listed in

a query(4) Session-result method the articles listed

in queries of a user session

For the query-chosen and session-chosenmethods article accesses present in the Weblogs are grouped into a set of transactions Forthe query-result and session-result methodswe construct transactions by reissuing queriesto the literature digital library

As mentioned the query-chosen andsession-chosen methods incorporateknowledge on human selection in making therecommendations We expect that they willyield more effective recommendations thantheir counterparts without such knowledgenamely the query-result and session-resultmethods We therefore form the followinghypothesis

H1 Users tend to browse the metadata ofonly the articles they find of interestRecommendation schemes thatconsider the Web accesses of thesearticles will result in more effectiverecommendations

Mining the literature usage logThere have been several approaches proposedin the literature (Yan et al 1996 Mobasheret al 1999 Pitkow and Pirolli 1999 Yanget al 2001) for identifying aggregate usageprofiles from Web usage logs Aggregateusage profiles can be represented in the formof association rules sequential patterns (or ann-gram Markov model) or clusters of Webpages In the context of literature digitallibraries however we decided not to considersequential patterns because the order ofarticles in a transaction may not relate tousersrsquo preferences Instead only associationrules and clusters of articles will be discussed

The problem of finding frequentassociations between items in a transactiondatabase called the association-rule discoveryproblem was first introduced by Agrawal et al(1993) Association-rule discovery methods

174

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

such as the A priori Algorithm (Agrawal et al1993 Agrawal and Srikant 1994) aretypically based on two decompositionsextraction of itemsets and the generation ofstrong association rules In the initialextraction phase the methods find sets ofitems that frequently occur together Thesupport of each itemset I =XY denoted asSup(I) is the fraction of transactionscontaining both X and Y This itemset I isreferred to as a frequent itemset if Sup(I)exceeds a user-specified minimal supportthreshold Minsup

In the second phase each discovereditemset I = XY is used to constructassociation rules in the form X ) Y Theconfidence of each rule denoted asConf(X ) Y ) is the fraction of transactionscontaining X that also contain Y Anassociation rule is said to be strong if itexceeds a user-specified confidence thresholdBy applying association-rule discoveryalgorithms to the transactions of thetransformed logs we can find associationrules in the form a1 a2 am)a whichcan be used to recommend article a to userswho have browsed a1 a2 am but not a

The traditional approaches for identifyingitemsets with a uniform minimum supportthreshold however cannot be directlyapplied because articles that arrive later tendto have smaller support even if they areactually more popular Therefore anonuniform support threshold schemeoriginally proposed in Liu et al (1999) isadopted In this scheme each item is assigneda distinct minimum support value (called theminimum item support MIS) and theminimum support of an itemset is theminimum of the MIS values of its constituentitems In the literature recommendationsystem we view the MIS value of an article asa function of its creation time That is articlesthat are added to the digital library morerecently should be assigned smaller MISvalues Let N(t) be the number oftransactions after time t in the Web usagelogs The MIS of an article is defined asfollows

MIShellipadagger ˆ Mhellipadagger Mhellipadagger gt LSLS Otherwise

raquo

Mhellipadagger ˆ NhellipCreationTimehellipadaggerdaggerNhellip0dagger cent Minsup

where LS is the lower bound for support

values Minsup is the minimum supportthreshold based on the entire article browsinglog and N(0) denotes the total number oftransactions in the Web usage log Both LSand Minsup are user-defined constantsAfter assigning the MIS values of all articles inthe literature digital library the methodproposed in Liu et al (1999) can beapplied to derive the association rules forarticles

For the clustering technique we adopt theAssociation Rule Hypergraph Partitioning(ARHP) approach (Mobasher et al 19992000) rather than traditional clusteringtechniques The main reason for this is thatARHP is more efficient in handling highdimensional data such as those present inliterature digital libraries The dimensions ofa transaction are the set of articles which ishuge for a large-scale digital library Thisapproach starts with the identification offrequent itemsets (as in association-rulediscovery methods) each of which containsarticles often accessed together intransactions Each such frequent itemset isthen viewed as a hyperedge with a specificweight

As mentioned each article has a distinctcreation time We therefore normalise thesupport values of itemsets as follows beforecomputing their weights

Sup0hellipaidagger ˆ Nhellip0daggerNhellipCreationTimehellipaidaggerdagger

cent Suphellipaidagger

Sup0hellipa1 akdagger ˆ

Nhellip0daggerN hellipmax1micro micro hellip daggerdagger

cent Suphellipa1 akdagger

There are several ways to define the weightof an itemset such as using either the supportfor or the interest in the itemset The formerfavours itemsets of smaller size whereas thelatter gives priority to larger itemsets Wedefine a general weighting formula that coversthe broad spectrum between these twoextremes In addition the supports for orinterests in different itemsets can have verydiverse values To prevent itemsets of largeweight from dominating the subsequentclustering procedure we apply thelogarithm on the weight The following is ourdefinition of the weight of an itemsethellipa1 a2 ak)

175

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

weighthellipa1 a2 akdagger ˆ

log0hellip dagger

permil 0hellip dagger cent 0hellip dagger cent cent 0hellip daggerŠnotsup3

pound 1

Minsup

acute

where 0micro not micro1 Note that when not = 0 ornot = 1 this formula is equivalent to usingsupport or interest as the weight respectively1Minsup is a constant that keeps the weightnon-negative This definition supports ourfollowing hypotheses

H2 A better recommendation effectivenesscan be achieved by striking a balancebetween using support and interest asthe weight of an itemset

H3 A better recommendation effectivenesscan be achieved by incorporating thelogarithm function in the weight of anitemset

After deciding the weight of each itemset thehypergraph partitioning algorithm proposedin Karypis (2002) is applied to partition theset of articles into disjoint clusters of articlesArticles in the same cluster are more lsquolsquosimilarrsquorsquoin the sense that they are more likely to beaccessed together in the same transaction Toreflect the fact that an article may indeedinterest more than one group of users weadopt the same heuristic as used in Mobasheret al (1999) by adding back articles toclusters which results in overlapping clustersSpecifically for a given hyperedge if thepercentage of involved vertices in a cluster islarge than a threshold the other involvedvertices are included in the same cluster

Online recommendationsWe propose two recommendation approachesthat use the article association rules andarticle clusters obtained by the methodsdescribed above The goal is to recommendthe top-N articles that potentially interest theactive user The first approach makes use ofarticle association rules The idea is to treateach frequent itemset as the interest profile ofa user group and to recommend articles basedon the similarity between the current sessionof the active user and interest profiles of therelevant user groups Specifically let s be theactive userrsquos current session of length k Wefirst identify the set of frequent itemsets ofsize k + 1 that contain all elements in s and an

extra element m (not in s) For each suchitemset I the confidence of the ruleI iexcl mg ) fm is calculated These extraelements are then recommended to the userin descending order of confidence value Ifthese elements are not sufficient (ie there areless than N of them) we then search forfrequent itemsets of size k that contain k - 1elements in s and an extra element (not in s)Again these extra elements are recommendedto the user in descending order of confidencevalue This procedure continues until Narticles are recommended

Our other proposed method uses ahypergraph-based approach In thisapproach the recommendation score of eacharticle a is computed by considering thesimilarity between the current user sessionand the clusters C to which a belongs and thecoherence weight of a with respect to CSpecifically each cluster of articles can beviewed as a vector with binary elements eachof which indicates whether an article appearsin the cluster Similarly the current usersession can also be represented as a vectorThen the similarity between the currentsession s and a cluster C can be defined as acosine function as follows

matchhellipS Cdagger ˆ

Pk

aCk pound Sk

Pk

hellipSkdagger2 poundPk

hellipaCKdagger2

r

where Sk is the krsquoth element in S and aCk is the

krsquoth element in CThe coherence weight of an article a with

respect to the cluster C that it belongs to isdefined as

weighthellipa Cdagger ˆ

Pa2eesup3C

weighthellipedaggerPesup3C

weighthellipedagger

where weight(e) is the weight of a hyperedge eThe recommendation score Rec(S a) of an

article a with respect to the current usersession S is then defined as

Rechellip dagger ˆ

maxa2C

hellip dagger pound hellip dagger

p

The top-N articles for recommendation arethose with the N highest values in therecommendation score

176

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

The hypergraph-based approach is morecarefully designed than theassociation-rule-based approach and weexpect the former to perform better

H4 The hypergraph-based approach willresult in more effectiverecommendations and has aquicker response time than theassociation-rule-based approach

Empirical evaluations

This section reports our experience inapplying the Web usage logs of NSYSU-ETD to the proposed literaturerecommendation system The main objectivewas to test our four hypotheses NSYSU-ETD runs on PC Solaris 27 and uses Apache139 as the Web server Since beingcommissioned in May 2000 it has beenloaded with more than 3000 electronictheses of National Sun Yat-sen UniversityUp to February 2003 these theses had beenbrowsed more than 400000 times anddownloaded more than 100000 times Weanalysed the Web usage logs of NSYSU-ETD between February 2002 and May 2002for our experiments the data collected fromFebruary 1 to April 30 were designated as thetraining data set and those collected in Mayserved as the test data set

We first applied the data cleansingtechnique on the training data and obtained43349 lookup accesses and 41627 articleaccesses Applying the session identificationtechnique revealed 16922 user sessionsamong which 392 sessions were robotgenerated 6068 sessions contained only onearticle access and 5253 sessions containedno article accesses We eliminated these trivialuser sessions and applied transactionidentification techniques resulting in 5617transactions for the query-chosen method5272 transactions for the session-chosenmethod and 17742 transactions for thequery-result method Queries whose resultsare never chosen by the users are removedfrom the query-chosen method but remain inthe query-result method The session-resultmethod produced transactions of huge sizeeach containing thousands of article accessesWe therefore decided not to consider thismethod in the subsequent experiments

The two proposed methods for miningliterature usage logs both require the

identification of frequent itemsets fromtransactions which needs the minimumsupport to be specified However the threetransaction identification methods undercomparison have different numbers oftransactions To be fair we specify adifferent minimum support threshold foreach method such that the total number ofarticles involved in large two-item sets ofeach method - called recommendablearticles - is approximately the same Ourrecommendation framework recommends anarticle only if it is associated with otherarticles a sufficient number of times in Webusage log Therefore articles that are notinvolved in large two-item sets cannotpossibly be recommended Table I shows thespecified minimum support threshold andthe number of recommendable articles ofeach method

To illustrate how we conductedexperiments we define the followingnotation let Teval be the set of transactions inthe test set teval be a transaction in Teval andat(i) be the irsquoth article in teval Given a windowsize Wsize we divide each transaction teval inthe test data set into two lists teval[W] andteval[R] where teval[W] is the first Wsize articleaccesses of teval and teval[R] is the remainingarticles By treating teval[W] as the currentsession the recommender system will choosethe set tpr of top-N articles forrecommendation

The performance metric we adopted formeasuring the quality of recommendation isthe precision and recall scheme Theprecision is the ratio of the number ofrecommended articles accessed by a user tothe total number of recommended articlesdefined as tpr teval permilRŠ=tpr and recall is theratio of the number of recommended articlesaccessed by a user to the total number ofarticles of interest to the user defined astpr tevalpermilRŠ=teval permilRŠ The precision (recall) of arecommendation approach is the averageprecision (recall) of all transactions in the testset

Test of H1We first evaluate the performance impact ofthe three transaction identification methodsIn this experiment not was set to be 05 and thelogarithm was taken when computing theweight of an itemset Figure 3(a b) shows theprecisions and recalls respectively underassociation-rule-based recommendation

177

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

The precisions and recalls underhypergraph-based recommendation areshown in Figure 4(a b) respectively

Overall the trends are the same for theassociation-rule-based and hypergraph-basedapproaches It can be clearly seen that boththe query-chosen and session-chosenmethods outperform the query-resultmethod in terms of both precision and recallThis implies that the information aboutarticles browsed plays a crucial role in making

recommendations - both the query-chosenand session-chosen methods incorporate thisinformation in forming transactions Wetherefore accept H1 However theperformance difference between thequery-chosen and session-chosen methods isnot significant

Test of H2 and H3We then conducted experiments to shed lighton the impact of not and the logarithmic

Figure 3 (a) Precisions of the association-based approach (b) recalls of the association-based approach

Figure 4 (a) Precisions and (b) recalls of the hypergraph-based approach under different transaction identification

methods

Table I Minimum support and number of recommended candidates for each transaction identification method

Minimum support (per cent) No of recommended articles

Session-chosen method 016 253

Query-chosen method 012 250

Query-result method 33 229

178

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

function in the weight definition for thehypergraph-based recommendationapproach Figure 5(a b) shows the precisionand recall values when not = 0 05 and 1 usingthe session-chosen method for transactionidentification The window size was set at 2As can be seen differences in the precisionand recall under different settings of not arevery small We have performed the sameexperiments for different transactionidentification methods and window sizes andobtained similar results H2 is thereforerejected

We then computed the precisions andrecalls with and without application of thelogarithmic function on the itemset weightFigure 6(a b) shows the resulting precisionsand recalls respectively

Figure 6 shows that applying thelogarithmic function on the itemset weightdefinition achieves significantly betterprecision and recall values This meets ourexpectation and H3 is accepted

Test of H4Finally we evaluated the impact ofassociation-rule-based and hypergraph-basedrecommendation approaches for different

window sizes Figure 7(a b) shows theprecisions and recalls for window sizes of 2 34 5 and 6 for a top-15 recommendation Itcan be seen that the hypergraph-basedapproach performs better than theassociation-rule-based approach especiallyfor larger window sizes

We also compared the running times ofboth approaches whilst setting differentminimum support thresholds Therelative performance of the two approachesunder different window sizes are shown inFigure 8(a b) Overall the running time ofthe hypergraph-based approach remainedrelatively constant not varying with changesin window size and minimum supportthreshold In contrast the running time of theassociation-rule-based approach increasedwith an increase in window size or a decreasein the minimum support This is because theassociation-rule-based approach has to searchfor the frequent itemsets that match thecurrent session As the number of frequentitemsets increase (as a result of a decreasein the minimum support) or the length ofcurrent user session increases (as a

Figure 5 (a) Impact of not on precisions and (b) impact of not on recalls of

the hypergraph-based approach using the session-chosen method for

transaction identification

Figure 6 (a) Impact of logarithmic function on precisions and (b) impact

of logarithmic function on recalls of the hypergraph-based approach using

the session-chosen method for transaction identification

179

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

result of an increased window size)the association-rule-based approach incurs alarger running time

Overall we conclude that thehypergraph-based approach is more attractivesince it yields better-quality articlerecommendation and has a more consistentrunning time Thus H4 is accepted

Conclusions

In this paper we have investigated issuesrelated to the recommendation of articles in aliterature digital library We have developed a

literature recommendation system that makes

use of the Web usage logs of a literature

digital library for making recommendations

The literature recommendation system

consists of three sequential steps(1) data preparation of the Web logs(2) usage log mining and(3) generation of article recommendations

We proposed three alternatives for identifying

transactions from Web usage logs and

discussed two approaches - association-rule

based and hypergraph based - for making

recommendations These alternatives and

approaches were evaluated using the Web

Figure 7 (a) Precisions and (b) recalls of the two recommendation approaches under different window sizes

(Minsup = 016 per cent)

Figure 8 (a) Running times of the two recommendation approaches for a window size of two under different

Minsup values and (b) running times of the two recommendation approaches for a window size of six under

different Minsup values

180

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

usage logs of an operational electronic thesissystem at National Sun Yat-sen University Ithas been found that the query-chosen andsession-chosen methods are better fortransaction identification and that thehypergraph-based approach yieldsbetter-quality article recommendation andexhibits a more consistent running time andthus is more scalable

Our recommendation framework identifiesarticle associations present in the Web usagelogs of a digital library While this approachresults in effective recommendations it failsto recommend those independent articles thatare seldom accessed together with others Asevident from our experiments an analysis ofthe three-month collection of Web usage logsof NSYSU-ETD (from 1 February to 30April 2002) showed that only about one-tenthof the total collection are recommendable (seeTable I) To extend the scope ofrecommendable articles we are currentlyinvestigating approaches that make use ofmultiple sources when making articlerecommendations in digital libraries Onesuch source of course is the metadataalready collected by digital libraries

References

Agrawal R Imielinski T and Swami A (1993) ` Mining

association rules between sets of items in largedatabasesrsquorsquo in Buneman P and Jajodia S (Eds)

Proceedings of the ACM SIGMOD Conference onManagement of Data Washington DC May 26-28ACM Press New York NY pp 207-16

Agrawal R and Srikant R (1994) ` Fast algorithms formining association rulesrsquorsquo in Bocca JB Jarke M

and Zaniolo C (Eds) Proceedings of the 20thInternational Conference on Very Large Data Bases

September 12-15 Santiago Chile MorganKaufmann San Francisco CA pp 487-99

Alspector J Kolcz A and Karunanithi N (1998)` Comparing feature-based and clique-based usermodels for movie selectionrsquorsquo Proceedings of the 3rd

ACM International Conference on Digital LibrariesJune 23-26 1998 Pittsburgh PA ACM Press New

York NY pp 11-18Andresen D Carver L Dolin R Fischer C Frew J

Goodchild M Ibarra O Kothuri R Larsgaard MNebert D Simpson J Smith T Yang T andZheng Q (1995) ` The WWW prototype of the

Alexandria digital libraryrsquorsquo Proceedings of theInternational Symposium on Digital Libraries

Tsukuba Japan 22-5 August pp 17-27Ansari A Essegaier S and Kohli R (2000) ` Internet

recommendation systemsrsquorsquo Journal of MarketingResearch Vol 37 No 3 pp 67-85

Arms W (2000) Digital Libraries MIT Press CambridgeMA

Armstrong R Freitag D Joachims T and Mitchell T(1997) ` WebWatcher a learning apprentice for theWorld Wide Webrsquorsquo AAAI Spring Symposium onInformation Gathering from HeterogeneousDistributed Environments pp 6-12

Balabanovirsquoc M and Shoham Y (1997) ` Fabcontent-based collaborative recommendationrsquorsquoCommunications of the ACM Vol 40 No 3pp 66-72

Baldonado M Chang C Gravano L and Paepcke A(1997) ` Metadata for digital libraries architectureand design rationalersquorsquo Proceedings of the 2nd ACMInternational Conference on Digital Libraries ACMPress New York NY pp 47-56

Basu C Hirsh H and Cohen W (1998)` Recommendation as classification using social andcontent-based information in recommendationrsquorsquoProceedings of the 15th National Conference onArtificial Intelligence 26-30 July Madison WI AAAIPress Menlo Park CA pp 714-20

Billsus D and Pazzani M (1999) ` A hybrid user modelfor news story classificationrsquorsquo in Kay J (Ed)Proceedings of the 7th International Conference onUser Modelling Banff Canada 20-4 JuneSpringer-Verlag New York NY pp 99-108

Bowman C Manber P and Schwartz U (1994)` Scalable Internet resources discovery researchproblems and approachesrsquorsquo Communications of theACM Vol 37 No 8 pp 98-107

Breese J Heckerman D and Kadie C (1998) ` Empiricalanalysis of predictive algorithms for collaborativefilteringrsquorsquo Technical Report MSR-TR-98-12Microsoft Research Seattle CA

Chen H Schatz B Ng T Martinez J Kirchhoff A andLin C (1996) ` A parallel computing approach tocreating engineering concept spaces for semanticretrieval the Illinois digital library initiativeprojectrsquorsquo IEEE Transactions on PAMI Vol 18 No 8pp 17-34

Cooley R Mobasher B and Srivastava J (1999) ` Datapreparation for mining World Wide Web browsingpatternsrsquorsquo Journal of Knowledge and InformationSystems Vol 1 No 1 pp 5-32

Furner J (2002) ` On recommendingrsquorsquo Journal of theAmerican Society for Information Science andTechnology Vol 53 No 9 pp 747-63

Goldberg D Nichols D Oki B and Terry D (1992)` Using collaborative filtering to weave aninformation tapestryrsquorsquo Communications of the ACMVol 35 No 12 pp 61-70

Herlocker J and Konstan J (2001) ` Content-independent task-focused recommendationrsquorsquo IEEEInternet Computing Vol 5 No 6 pp 40-7

Karypis G (2002) ` Multilevel hypergraph partitioningrsquorsquoTech Report TR02-25 Department of ComputerScience University of Minnesota MN

Kessler J (1996) Internet Digital Libraries TheInternational Dimension Artech House PublishersNorwood MA

Konstan J Miller B Maltz D Herlocker J Gordon Land Riedl J (1997) ` GroupLens applyingcollaborative filtering to Usenet newsrsquorsquoCommunications of the ACM Vol 40 No 3pp 77-87

181

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Lang K (1995) ` Newsweeder learning to filternetnewsrsquorsquo in Prieditis A and Russell S (Eds)Proceedings of the 12th International Conference onMachine Learning Lake Tahoe Morgan KaufmannSan Francisco CA pp 331-9

Liu B Hsu W and Ma Y (1999) ` Mining associationrules with multiple minimum supportsrsquorsquo Proceedingsof the 5th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining 15-18

August San Diego ACM Press New York NYpp 430-4

Loeb S and Terry D (1992) ` Information filteringrsquorsquoCommunications of the ACM Special Issue onInformation Filtering Vol 35 No 12 pp 26-8

Mobasher B Cooley R and Srivastava J (1999)` Creating adaptive Web sites through usage-basedclustering of URLsrsquorsquo Proceedings of the IEEEKnowledge and Data Engineering ExchangeWorkshop

Mobasher B Dai H Luo T Nakagawa M andWiltshire J (2000) ` Discovery of aggregate usageprofiles for Web personalisationrsquorsquo Proceedings ofthe webKDD Workshop

Mooney R and Roy L (2000) ` Content-based bookrecommending using learning for text

categorisationrsquorsquo Proceedings of the 5th ACMConference on Digital Libraries San Antonio ACMPress New York NY pp 195-204

Pazzani M and Billsus D (1997) ` Learning and revisinguser profiles the identification of interesting Websitesrsquorsquo Machine Learning Vol 27 No 4 pp 313-31

Pazzani M (1999) ` A framework for collaborativecontent-based and demographic filteringrsquorsquo ArtificialIntelligence Review Vol 13 No 56 pp 393-408

Pennock D Horvitz E Lawrence S and Giles C (2000)` Collaborative filtering by personality diagnosis ahybrid memory- and model-based approachrsquorsquoProceedings of the 16th Conference onUncertainty in Artificial Intelligence San Francisco

30 June-3 July Morgan Kaufmann San FranciscoCA pp 473-80

Pitkow J and Pirolli P (1999) ` Mining longest repeatingsubsequences to predict World Wide Web surfingrsquorsquoProceedings of the 2nd USENIX Symposium onInternet Technologies and Systems Boulder COpp 139-150

Schafer J Konstan J and Riedl J (2001) ` E-commercerecommendation applicationsrsquorsquo Data Mining andKnowledge Discovery Vol 5 No 1 pp 10-22

Shardanand U and Maes P (1995) ` Social informationfiltering algorithms for automating `word ofmouthrsquorsquorsquo Proceedings of the Conference on HumanFactors in Computing Systems Denver CO ACMPress New York NY pp 210-17

Spink A Wilson T Ford N Foster A and Ellis D(2002) ` Information seeking and mediatedsearchingrsquorsquo Journal of The American Society ForInformation Science and Technology Vol 53 No 9pp 695-703

Srivastava J Cooley R Deshpande M and Tang P(2000) ` Web usage mining discovery andapplications of usage patterns from Web datarsquorsquoSIGKDD Explorations Vol 1 No 2 pp 12-23

Terveen L Hill W Amento B McDonald D and CreterJ (1997) ` PHOAKS a system for sharingrecommendationsrsquorsquo Communications of the ACMVol 40 No 3 pp 59-62

Wilensky R (1996) ` Toward work-centred digitalinformation servicesrsquorsquo IEEE Computer Vol 29 No 5pp 7-44

Yan T Jacobsen M Molina H and Dayal U (1996)` From user access patterns to dynamic hypertextlinkingrsquorsquo Proceedings of the 5th International WorldWide Web Conference pp 1007-14

Yang Q Zhang HH and Li T (2001) ` Mining Web logsfor prediction models in WWW caching andprefetchingrsquorsquo Proceedings of the ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining pp 473-8

182

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Page 5: A prototype WWW literature recommendation system for digital libraries

Data preparation for the literature usagelogTo prepare data from the Web usage logs of aliterature digital library we basically followthe heuristics adopted by Cooley et al (1999)for processing Web usage logs that involvestatic Web pages In their work the Webusage logs are assumed to be in the extendedNCSA format (including referrer and agentfields) The approach contains threesequential steps data cleansing user sessionidentification and transaction identificationThe objective of data cleansing is to prune outunwanted Web log records and to add backmissing Web log records some Web logrecords are surplus as they are accesses tonon-HTML pages (eg images and other httprequests involving no Web page accesses)while other Web log records are missing dueto the existence of the local cache firewallsand proxy servers Identifying missing Weblog records is especially difficult - severalheuristics have been proposed for achievingthis However we found this difficultynonexistent when processing the Web usagelog of NSYSU-ETD because article Webpages are dynamically generated and are notcacheable Our university ETD system isdatabase driven in that the theses metadata

are stored in a DBMS Most large-scaledigital libraries adopt the same method formaintaining their collections In the contextof a literature digital library we are concernedwith and retain only the Web usage recordsthat involve the following two types ofaccesses(1) Lookup accesses Each lookup access is an

execution of a CGI program withsearching or browsing conditionsspecified in the parameters Anexample lookup access of NSYSU-ETDis httpetdlibnsysuedutwETD-dbETD-searchsearch_by_advisoradvisor_name=San-Yih+Hwang which listsall theses supervised by ProfessorSan-Yih Hwang

(2) Article accesses Each article access is anexecution of a CGI program that displaysthe detailed metadata on an article Anexample lookup access of NSYSU-ETDis httpetdlibnsysuedutwETD-dbETD-searchview_etdURN=etd-0726100-135739 which shows themetadata of the thesis whose URN isetd-0726100-135739

Note that article accesses display the detailedmetadata of articles and therefore are of

Figure 2 Page view of an article in NSYSU-ETD

173

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

primary interest Lookup accesses providelookup information to facilitate browsing orsearching and can thus be consideredauxiliary Most of the time a user will firstexecute a lookup access followed by aselective list of article accesses We found thatsome user sessions contained article accesseswithout prior lookup accesses in the Webusage log of NSYSU-ETD This is becauseseveral information sources had providedhyperlinks directly to articlesrsquo page views Inthis case each session can be viewed as a listof queries Each query (optionally) starts witha lookup access followed by a list of articleaccesses related to the articles that the userchose to look at in more detail

The goal of user session identification is todivide the article accesses of each user intoindividual sessions It is reasonable to assumethat two records with different IP addressesbrowsers or operating systems belong to twodifferent user sessions In addition the timeinterval between two consecutive requests in auser session should not be too large As inmany commercial products we use 30minutes as the default timeout period Whenthe time interval between the current accessand the previous one exceeds this a new usersession is assumed to have started Some ofthe identified user sessions are made byInternet robots and hence should not beconsidered Some robots have known agenttypes andor IP addresses and can be easilyidentified Analysis of the user sessions ofthese known robots revealed that most ofthese sessions either have more than 100article accesses or exhibit a mean adjacentWeb page access interval of less than threeseconds User sessions that satisfy thiscondition are considered as robot sessionsand consequently are removed

Finally a user session is further divided intoa number of transactions each of whichrepresents a semantically meaningful unitHowever the various transactionidentification approaches proposed in Cooleyet al (1999) make use of either the index(auxiliary) pages or the Web site topologySince neither exists in the context of literaturedigital libraries these proposed approachesare not applicable Our approach identifiestransactions by considering the types ofaccesses namely lookup and article accessesIn fact articles listed by the same query musthave some degree of similarity in their content(eg keyword title author discipline) On

the other hand articles selected in the sameuser session or query also display some degreeof similarity due to inherent humanbehaviour Therefore we have four methodsfor defining transactions(1) Query-chosen method the articles

selected in a query(2) Session-chosen method the articles

selected in a user session(3) Query-result method the articles listed in

a query(4) Session-result method the articles listed

in queries of a user session

For the query-chosen and session-chosenmethods article accesses present in the Weblogs are grouped into a set of transactions Forthe query-result and session-result methodswe construct transactions by reissuing queriesto the literature digital library

As mentioned the query-chosen andsession-chosen methods incorporateknowledge on human selection in making therecommendations We expect that they willyield more effective recommendations thantheir counterparts without such knowledgenamely the query-result and session-resultmethods We therefore form the followinghypothesis

H1 Users tend to browse the metadata ofonly the articles they find of interestRecommendation schemes thatconsider the Web accesses of thesearticles will result in more effectiverecommendations

Mining the literature usage logThere have been several approaches proposedin the literature (Yan et al 1996 Mobasheret al 1999 Pitkow and Pirolli 1999 Yanget al 2001) for identifying aggregate usageprofiles from Web usage logs Aggregateusage profiles can be represented in the formof association rules sequential patterns (or ann-gram Markov model) or clusters of Webpages In the context of literature digitallibraries however we decided not to considersequential patterns because the order ofarticles in a transaction may not relate tousersrsquo preferences Instead only associationrules and clusters of articles will be discussed

The problem of finding frequentassociations between items in a transactiondatabase called the association-rule discoveryproblem was first introduced by Agrawal et al(1993) Association-rule discovery methods

174

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

such as the A priori Algorithm (Agrawal et al1993 Agrawal and Srikant 1994) aretypically based on two decompositionsextraction of itemsets and the generation ofstrong association rules In the initialextraction phase the methods find sets ofitems that frequently occur together Thesupport of each itemset I =XY denoted asSup(I) is the fraction of transactionscontaining both X and Y This itemset I isreferred to as a frequent itemset if Sup(I)exceeds a user-specified minimal supportthreshold Minsup

In the second phase each discovereditemset I = XY is used to constructassociation rules in the form X ) Y Theconfidence of each rule denoted asConf(X ) Y ) is the fraction of transactionscontaining X that also contain Y Anassociation rule is said to be strong if itexceeds a user-specified confidence thresholdBy applying association-rule discoveryalgorithms to the transactions of thetransformed logs we can find associationrules in the form a1 a2 am)a whichcan be used to recommend article a to userswho have browsed a1 a2 am but not a

The traditional approaches for identifyingitemsets with a uniform minimum supportthreshold however cannot be directlyapplied because articles that arrive later tendto have smaller support even if they areactually more popular Therefore anonuniform support threshold schemeoriginally proposed in Liu et al (1999) isadopted In this scheme each item is assigneda distinct minimum support value (called theminimum item support MIS) and theminimum support of an itemset is theminimum of the MIS values of its constituentitems In the literature recommendationsystem we view the MIS value of an article asa function of its creation time That is articlesthat are added to the digital library morerecently should be assigned smaller MISvalues Let N(t) be the number oftransactions after time t in the Web usagelogs The MIS of an article is defined asfollows

MIShellipadagger ˆ Mhellipadagger Mhellipadagger gt LSLS Otherwise

raquo

Mhellipadagger ˆ NhellipCreationTimehellipadaggerdaggerNhellip0dagger cent Minsup

where LS is the lower bound for support

values Minsup is the minimum supportthreshold based on the entire article browsinglog and N(0) denotes the total number oftransactions in the Web usage log Both LSand Minsup are user-defined constantsAfter assigning the MIS values of all articles inthe literature digital library the methodproposed in Liu et al (1999) can beapplied to derive the association rules forarticles

For the clustering technique we adopt theAssociation Rule Hypergraph Partitioning(ARHP) approach (Mobasher et al 19992000) rather than traditional clusteringtechniques The main reason for this is thatARHP is more efficient in handling highdimensional data such as those present inliterature digital libraries The dimensions ofa transaction are the set of articles which ishuge for a large-scale digital library Thisapproach starts with the identification offrequent itemsets (as in association-rulediscovery methods) each of which containsarticles often accessed together intransactions Each such frequent itemset isthen viewed as a hyperedge with a specificweight

As mentioned each article has a distinctcreation time We therefore normalise thesupport values of itemsets as follows beforecomputing their weights

Sup0hellipaidagger ˆ Nhellip0daggerNhellipCreationTimehellipaidaggerdagger

cent Suphellipaidagger

Sup0hellipa1 akdagger ˆ

Nhellip0daggerN hellipmax1micro micro hellip daggerdagger

cent Suphellipa1 akdagger

There are several ways to define the weightof an itemset such as using either the supportfor or the interest in the itemset The formerfavours itemsets of smaller size whereas thelatter gives priority to larger itemsets Wedefine a general weighting formula that coversthe broad spectrum between these twoextremes In addition the supports for orinterests in different itemsets can have verydiverse values To prevent itemsets of largeweight from dominating the subsequentclustering procedure we apply thelogarithm on the weight The following is ourdefinition of the weight of an itemsethellipa1 a2 ak)

175

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

weighthellipa1 a2 akdagger ˆ

log0hellip dagger

permil 0hellip dagger cent 0hellip dagger cent cent 0hellip daggerŠnotsup3

pound 1

Minsup

acute

where 0micro not micro1 Note that when not = 0 ornot = 1 this formula is equivalent to usingsupport or interest as the weight respectively1Minsup is a constant that keeps the weightnon-negative This definition supports ourfollowing hypotheses

H2 A better recommendation effectivenesscan be achieved by striking a balancebetween using support and interest asthe weight of an itemset

H3 A better recommendation effectivenesscan be achieved by incorporating thelogarithm function in the weight of anitemset

After deciding the weight of each itemset thehypergraph partitioning algorithm proposedin Karypis (2002) is applied to partition theset of articles into disjoint clusters of articlesArticles in the same cluster are more lsquolsquosimilarrsquorsquoin the sense that they are more likely to beaccessed together in the same transaction Toreflect the fact that an article may indeedinterest more than one group of users weadopt the same heuristic as used in Mobasheret al (1999) by adding back articles toclusters which results in overlapping clustersSpecifically for a given hyperedge if thepercentage of involved vertices in a cluster islarge than a threshold the other involvedvertices are included in the same cluster

Online recommendationsWe propose two recommendation approachesthat use the article association rules andarticle clusters obtained by the methodsdescribed above The goal is to recommendthe top-N articles that potentially interest theactive user The first approach makes use ofarticle association rules The idea is to treateach frequent itemset as the interest profile ofa user group and to recommend articles basedon the similarity between the current sessionof the active user and interest profiles of therelevant user groups Specifically let s be theactive userrsquos current session of length k Wefirst identify the set of frequent itemsets ofsize k + 1 that contain all elements in s and an

extra element m (not in s) For each suchitemset I the confidence of the ruleI iexcl mg ) fm is calculated These extraelements are then recommended to the userin descending order of confidence value Ifthese elements are not sufficient (ie there areless than N of them) we then search forfrequent itemsets of size k that contain k - 1elements in s and an extra element (not in s)Again these extra elements are recommendedto the user in descending order of confidencevalue This procedure continues until Narticles are recommended

Our other proposed method uses ahypergraph-based approach In thisapproach the recommendation score of eacharticle a is computed by considering thesimilarity between the current user sessionand the clusters C to which a belongs and thecoherence weight of a with respect to CSpecifically each cluster of articles can beviewed as a vector with binary elements eachof which indicates whether an article appearsin the cluster Similarly the current usersession can also be represented as a vectorThen the similarity between the currentsession s and a cluster C can be defined as acosine function as follows

matchhellipS Cdagger ˆ

Pk

aCk pound Sk

Pk

hellipSkdagger2 poundPk

hellipaCKdagger2

r

where Sk is the krsquoth element in S and aCk is the

krsquoth element in CThe coherence weight of an article a with

respect to the cluster C that it belongs to isdefined as

weighthellipa Cdagger ˆ

Pa2eesup3C

weighthellipedaggerPesup3C

weighthellipedagger

where weight(e) is the weight of a hyperedge eThe recommendation score Rec(S a) of an

article a with respect to the current usersession S is then defined as

Rechellip dagger ˆ

maxa2C

hellip dagger pound hellip dagger

p

The top-N articles for recommendation arethose with the N highest values in therecommendation score

176

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

The hypergraph-based approach is morecarefully designed than theassociation-rule-based approach and weexpect the former to perform better

H4 The hypergraph-based approach willresult in more effectiverecommendations and has aquicker response time than theassociation-rule-based approach

Empirical evaluations

This section reports our experience inapplying the Web usage logs of NSYSU-ETD to the proposed literaturerecommendation system The main objectivewas to test our four hypotheses NSYSU-ETD runs on PC Solaris 27 and uses Apache139 as the Web server Since beingcommissioned in May 2000 it has beenloaded with more than 3000 electronictheses of National Sun Yat-sen UniversityUp to February 2003 these theses had beenbrowsed more than 400000 times anddownloaded more than 100000 times Weanalysed the Web usage logs of NSYSU-ETD between February 2002 and May 2002for our experiments the data collected fromFebruary 1 to April 30 were designated as thetraining data set and those collected in Mayserved as the test data set

We first applied the data cleansingtechnique on the training data and obtained43349 lookup accesses and 41627 articleaccesses Applying the session identificationtechnique revealed 16922 user sessionsamong which 392 sessions were robotgenerated 6068 sessions contained only onearticle access and 5253 sessions containedno article accesses We eliminated these trivialuser sessions and applied transactionidentification techniques resulting in 5617transactions for the query-chosen method5272 transactions for the session-chosenmethod and 17742 transactions for thequery-result method Queries whose resultsare never chosen by the users are removedfrom the query-chosen method but remain inthe query-result method The session-resultmethod produced transactions of huge sizeeach containing thousands of article accessesWe therefore decided not to consider thismethod in the subsequent experiments

The two proposed methods for miningliterature usage logs both require the

identification of frequent itemsets fromtransactions which needs the minimumsupport to be specified However the threetransaction identification methods undercomparison have different numbers oftransactions To be fair we specify adifferent minimum support threshold foreach method such that the total number ofarticles involved in large two-item sets ofeach method - called recommendablearticles - is approximately the same Ourrecommendation framework recommends anarticle only if it is associated with otherarticles a sufficient number of times in Webusage log Therefore articles that are notinvolved in large two-item sets cannotpossibly be recommended Table I shows thespecified minimum support threshold andthe number of recommendable articles ofeach method

To illustrate how we conductedexperiments we define the followingnotation let Teval be the set of transactions inthe test set teval be a transaction in Teval andat(i) be the irsquoth article in teval Given a windowsize Wsize we divide each transaction teval inthe test data set into two lists teval[W] andteval[R] where teval[W] is the first Wsize articleaccesses of teval and teval[R] is the remainingarticles By treating teval[W] as the currentsession the recommender system will choosethe set tpr of top-N articles forrecommendation

The performance metric we adopted formeasuring the quality of recommendation isthe precision and recall scheme Theprecision is the ratio of the number ofrecommended articles accessed by a user tothe total number of recommended articlesdefined as tpr teval permilRŠ=tpr and recall is theratio of the number of recommended articlesaccessed by a user to the total number ofarticles of interest to the user defined astpr tevalpermilRŠ=teval permilRŠ The precision (recall) of arecommendation approach is the averageprecision (recall) of all transactions in the testset

Test of H1We first evaluate the performance impact ofthe three transaction identification methodsIn this experiment not was set to be 05 and thelogarithm was taken when computing theweight of an itemset Figure 3(a b) shows theprecisions and recalls respectively underassociation-rule-based recommendation

177

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

The precisions and recalls underhypergraph-based recommendation areshown in Figure 4(a b) respectively

Overall the trends are the same for theassociation-rule-based and hypergraph-basedapproaches It can be clearly seen that boththe query-chosen and session-chosenmethods outperform the query-resultmethod in terms of both precision and recallThis implies that the information aboutarticles browsed plays a crucial role in making

recommendations - both the query-chosenand session-chosen methods incorporate thisinformation in forming transactions Wetherefore accept H1 However theperformance difference between thequery-chosen and session-chosen methods isnot significant

Test of H2 and H3We then conducted experiments to shed lighton the impact of not and the logarithmic

Figure 3 (a) Precisions of the association-based approach (b) recalls of the association-based approach

Figure 4 (a) Precisions and (b) recalls of the hypergraph-based approach under different transaction identification

methods

Table I Minimum support and number of recommended candidates for each transaction identification method

Minimum support (per cent) No of recommended articles

Session-chosen method 016 253

Query-chosen method 012 250

Query-result method 33 229

178

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

function in the weight definition for thehypergraph-based recommendationapproach Figure 5(a b) shows the precisionand recall values when not = 0 05 and 1 usingthe session-chosen method for transactionidentification The window size was set at 2As can be seen differences in the precisionand recall under different settings of not arevery small We have performed the sameexperiments for different transactionidentification methods and window sizes andobtained similar results H2 is thereforerejected

We then computed the precisions andrecalls with and without application of thelogarithmic function on the itemset weightFigure 6(a b) shows the resulting precisionsand recalls respectively

Figure 6 shows that applying thelogarithmic function on the itemset weightdefinition achieves significantly betterprecision and recall values This meets ourexpectation and H3 is accepted

Test of H4Finally we evaluated the impact ofassociation-rule-based and hypergraph-basedrecommendation approaches for different

window sizes Figure 7(a b) shows theprecisions and recalls for window sizes of 2 34 5 and 6 for a top-15 recommendation Itcan be seen that the hypergraph-basedapproach performs better than theassociation-rule-based approach especiallyfor larger window sizes

We also compared the running times ofboth approaches whilst setting differentminimum support thresholds Therelative performance of the two approachesunder different window sizes are shown inFigure 8(a b) Overall the running time ofthe hypergraph-based approach remainedrelatively constant not varying with changesin window size and minimum supportthreshold In contrast the running time of theassociation-rule-based approach increasedwith an increase in window size or a decreasein the minimum support This is because theassociation-rule-based approach has to searchfor the frequent itemsets that match thecurrent session As the number of frequentitemsets increase (as a result of a decreasein the minimum support) or the length ofcurrent user session increases (as a

Figure 5 (a) Impact of not on precisions and (b) impact of not on recalls of

the hypergraph-based approach using the session-chosen method for

transaction identification

Figure 6 (a) Impact of logarithmic function on precisions and (b) impact

of logarithmic function on recalls of the hypergraph-based approach using

the session-chosen method for transaction identification

179

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

result of an increased window size)the association-rule-based approach incurs alarger running time

Overall we conclude that thehypergraph-based approach is more attractivesince it yields better-quality articlerecommendation and has a more consistentrunning time Thus H4 is accepted

Conclusions

In this paper we have investigated issuesrelated to the recommendation of articles in aliterature digital library We have developed a

literature recommendation system that makes

use of the Web usage logs of a literature

digital library for making recommendations

The literature recommendation system

consists of three sequential steps(1) data preparation of the Web logs(2) usage log mining and(3) generation of article recommendations

We proposed three alternatives for identifying

transactions from Web usage logs and

discussed two approaches - association-rule

based and hypergraph based - for making

recommendations These alternatives and

approaches were evaluated using the Web

Figure 7 (a) Precisions and (b) recalls of the two recommendation approaches under different window sizes

(Minsup = 016 per cent)

Figure 8 (a) Running times of the two recommendation approaches for a window size of two under different

Minsup values and (b) running times of the two recommendation approaches for a window size of six under

different Minsup values

180

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

usage logs of an operational electronic thesissystem at National Sun Yat-sen University Ithas been found that the query-chosen andsession-chosen methods are better fortransaction identification and that thehypergraph-based approach yieldsbetter-quality article recommendation andexhibits a more consistent running time andthus is more scalable

Our recommendation framework identifiesarticle associations present in the Web usagelogs of a digital library While this approachresults in effective recommendations it failsto recommend those independent articles thatare seldom accessed together with others Asevident from our experiments an analysis ofthe three-month collection of Web usage logsof NSYSU-ETD (from 1 February to 30April 2002) showed that only about one-tenthof the total collection are recommendable (seeTable I) To extend the scope ofrecommendable articles we are currentlyinvestigating approaches that make use ofmultiple sources when making articlerecommendations in digital libraries Onesuch source of course is the metadataalready collected by digital libraries

References

Agrawal R Imielinski T and Swami A (1993) ` Mining

association rules between sets of items in largedatabasesrsquorsquo in Buneman P and Jajodia S (Eds)

Proceedings of the ACM SIGMOD Conference onManagement of Data Washington DC May 26-28ACM Press New York NY pp 207-16

Agrawal R and Srikant R (1994) ` Fast algorithms formining association rulesrsquorsquo in Bocca JB Jarke M

and Zaniolo C (Eds) Proceedings of the 20thInternational Conference on Very Large Data Bases

September 12-15 Santiago Chile MorganKaufmann San Francisco CA pp 487-99

Alspector J Kolcz A and Karunanithi N (1998)` Comparing feature-based and clique-based usermodels for movie selectionrsquorsquo Proceedings of the 3rd

ACM International Conference on Digital LibrariesJune 23-26 1998 Pittsburgh PA ACM Press New

York NY pp 11-18Andresen D Carver L Dolin R Fischer C Frew J

Goodchild M Ibarra O Kothuri R Larsgaard MNebert D Simpson J Smith T Yang T andZheng Q (1995) ` The WWW prototype of the

Alexandria digital libraryrsquorsquo Proceedings of theInternational Symposium on Digital Libraries

Tsukuba Japan 22-5 August pp 17-27Ansari A Essegaier S and Kohli R (2000) ` Internet

recommendation systemsrsquorsquo Journal of MarketingResearch Vol 37 No 3 pp 67-85

Arms W (2000) Digital Libraries MIT Press CambridgeMA

Armstrong R Freitag D Joachims T and Mitchell T(1997) ` WebWatcher a learning apprentice for theWorld Wide Webrsquorsquo AAAI Spring Symposium onInformation Gathering from HeterogeneousDistributed Environments pp 6-12

Balabanovirsquoc M and Shoham Y (1997) ` Fabcontent-based collaborative recommendationrsquorsquoCommunications of the ACM Vol 40 No 3pp 66-72

Baldonado M Chang C Gravano L and Paepcke A(1997) ` Metadata for digital libraries architectureand design rationalersquorsquo Proceedings of the 2nd ACMInternational Conference on Digital Libraries ACMPress New York NY pp 47-56

Basu C Hirsh H and Cohen W (1998)` Recommendation as classification using social andcontent-based information in recommendationrsquorsquoProceedings of the 15th National Conference onArtificial Intelligence 26-30 July Madison WI AAAIPress Menlo Park CA pp 714-20

Billsus D and Pazzani M (1999) ` A hybrid user modelfor news story classificationrsquorsquo in Kay J (Ed)Proceedings of the 7th International Conference onUser Modelling Banff Canada 20-4 JuneSpringer-Verlag New York NY pp 99-108

Bowman C Manber P and Schwartz U (1994)` Scalable Internet resources discovery researchproblems and approachesrsquorsquo Communications of theACM Vol 37 No 8 pp 98-107

Breese J Heckerman D and Kadie C (1998) ` Empiricalanalysis of predictive algorithms for collaborativefilteringrsquorsquo Technical Report MSR-TR-98-12Microsoft Research Seattle CA

Chen H Schatz B Ng T Martinez J Kirchhoff A andLin C (1996) ` A parallel computing approach tocreating engineering concept spaces for semanticretrieval the Illinois digital library initiativeprojectrsquorsquo IEEE Transactions on PAMI Vol 18 No 8pp 17-34

Cooley R Mobasher B and Srivastava J (1999) ` Datapreparation for mining World Wide Web browsingpatternsrsquorsquo Journal of Knowledge and InformationSystems Vol 1 No 1 pp 5-32

Furner J (2002) ` On recommendingrsquorsquo Journal of theAmerican Society for Information Science andTechnology Vol 53 No 9 pp 747-63

Goldberg D Nichols D Oki B and Terry D (1992)` Using collaborative filtering to weave aninformation tapestryrsquorsquo Communications of the ACMVol 35 No 12 pp 61-70

Herlocker J and Konstan J (2001) ` Content-independent task-focused recommendationrsquorsquo IEEEInternet Computing Vol 5 No 6 pp 40-7

Karypis G (2002) ` Multilevel hypergraph partitioningrsquorsquoTech Report TR02-25 Department of ComputerScience University of Minnesota MN

Kessler J (1996) Internet Digital Libraries TheInternational Dimension Artech House PublishersNorwood MA

Konstan J Miller B Maltz D Herlocker J Gordon Land Riedl J (1997) ` GroupLens applyingcollaborative filtering to Usenet newsrsquorsquoCommunications of the ACM Vol 40 No 3pp 77-87

181

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Lang K (1995) ` Newsweeder learning to filternetnewsrsquorsquo in Prieditis A and Russell S (Eds)Proceedings of the 12th International Conference onMachine Learning Lake Tahoe Morgan KaufmannSan Francisco CA pp 331-9

Liu B Hsu W and Ma Y (1999) ` Mining associationrules with multiple minimum supportsrsquorsquo Proceedingsof the 5th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining 15-18

August San Diego ACM Press New York NYpp 430-4

Loeb S and Terry D (1992) ` Information filteringrsquorsquoCommunications of the ACM Special Issue onInformation Filtering Vol 35 No 12 pp 26-8

Mobasher B Cooley R and Srivastava J (1999)` Creating adaptive Web sites through usage-basedclustering of URLsrsquorsquo Proceedings of the IEEEKnowledge and Data Engineering ExchangeWorkshop

Mobasher B Dai H Luo T Nakagawa M andWiltshire J (2000) ` Discovery of aggregate usageprofiles for Web personalisationrsquorsquo Proceedings ofthe webKDD Workshop

Mooney R and Roy L (2000) ` Content-based bookrecommending using learning for text

categorisationrsquorsquo Proceedings of the 5th ACMConference on Digital Libraries San Antonio ACMPress New York NY pp 195-204

Pazzani M and Billsus D (1997) ` Learning and revisinguser profiles the identification of interesting Websitesrsquorsquo Machine Learning Vol 27 No 4 pp 313-31

Pazzani M (1999) ` A framework for collaborativecontent-based and demographic filteringrsquorsquo ArtificialIntelligence Review Vol 13 No 56 pp 393-408

Pennock D Horvitz E Lawrence S and Giles C (2000)` Collaborative filtering by personality diagnosis ahybrid memory- and model-based approachrsquorsquoProceedings of the 16th Conference onUncertainty in Artificial Intelligence San Francisco

30 June-3 July Morgan Kaufmann San FranciscoCA pp 473-80

Pitkow J and Pirolli P (1999) ` Mining longest repeatingsubsequences to predict World Wide Web surfingrsquorsquoProceedings of the 2nd USENIX Symposium onInternet Technologies and Systems Boulder COpp 139-150

Schafer J Konstan J and Riedl J (2001) ` E-commercerecommendation applicationsrsquorsquo Data Mining andKnowledge Discovery Vol 5 No 1 pp 10-22

Shardanand U and Maes P (1995) ` Social informationfiltering algorithms for automating `word ofmouthrsquorsquorsquo Proceedings of the Conference on HumanFactors in Computing Systems Denver CO ACMPress New York NY pp 210-17

Spink A Wilson T Ford N Foster A and Ellis D(2002) ` Information seeking and mediatedsearchingrsquorsquo Journal of The American Society ForInformation Science and Technology Vol 53 No 9pp 695-703

Srivastava J Cooley R Deshpande M and Tang P(2000) ` Web usage mining discovery andapplications of usage patterns from Web datarsquorsquoSIGKDD Explorations Vol 1 No 2 pp 12-23

Terveen L Hill W Amento B McDonald D and CreterJ (1997) ` PHOAKS a system for sharingrecommendationsrsquorsquo Communications of the ACMVol 40 No 3 pp 59-62

Wilensky R (1996) ` Toward work-centred digitalinformation servicesrsquorsquo IEEE Computer Vol 29 No 5pp 7-44

Yan T Jacobsen M Molina H and Dayal U (1996)` From user access patterns to dynamic hypertextlinkingrsquorsquo Proceedings of the 5th International WorldWide Web Conference pp 1007-14

Yang Q Zhang HH and Li T (2001) ` Mining Web logsfor prediction models in WWW caching andprefetchingrsquorsquo Proceedings of the ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining pp 473-8

182

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Page 6: A prototype WWW literature recommendation system for digital libraries

primary interest Lookup accesses providelookup information to facilitate browsing orsearching and can thus be consideredauxiliary Most of the time a user will firstexecute a lookup access followed by aselective list of article accesses We found thatsome user sessions contained article accesseswithout prior lookup accesses in the Webusage log of NSYSU-ETD This is becauseseveral information sources had providedhyperlinks directly to articlesrsquo page views Inthis case each session can be viewed as a listof queries Each query (optionally) starts witha lookup access followed by a list of articleaccesses related to the articles that the userchose to look at in more detail

The goal of user session identification is todivide the article accesses of each user intoindividual sessions It is reasonable to assumethat two records with different IP addressesbrowsers or operating systems belong to twodifferent user sessions In addition the timeinterval between two consecutive requests in auser session should not be too large As inmany commercial products we use 30minutes as the default timeout period Whenthe time interval between the current accessand the previous one exceeds this a new usersession is assumed to have started Some ofthe identified user sessions are made byInternet robots and hence should not beconsidered Some robots have known agenttypes andor IP addresses and can be easilyidentified Analysis of the user sessions ofthese known robots revealed that most ofthese sessions either have more than 100article accesses or exhibit a mean adjacentWeb page access interval of less than threeseconds User sessions that satisfy thiscondition are considered as robot sessionsand consequently are removed

Finally a user session is further divided intoa number of transactions each of whichrepresents a semantically meaningful unitHowever the various transactionidentification approaches proposed in Cooleyet al (1999) make use of either the index(auxiliary) pages or the Web site topologySince neither exists in the context of literaturedigital libraries these proposed approachesare not applicable Our approach identifiestransactions by considering the types ofaccesses namely lookup and article accessesIn fact articles listed by the same query musthave some degree of similarity in their content(eg keyword title author discipline) On

the other hand articles selected in the sameuser session or query also display some degreeof similarity due to inherent humanbehaviour Therefore we have four methodsfor defining transactions(1) Query-chosen method the articles

selected in a query(2) Session-chosen method the articles

selected in a user session(3) Query-result method the articles listed in

a query(4) Session-result method the articles listed

in queries of a user session

For the query-chosen and session-chosenmethods article accesses present in the Weblogs are grouped into a set of transactions Forthe query-result and session-result methodswe construct transactions by reissuing queriesto the literature digital library

As mentioned the query-chosen andsession-chosen methods incorporateknowledge on human selection in making therecommendations We expect that they willyield more effective recommendations thantheir counterparts without such knowledgenamely the query-result and session-resultmethods We therefore form the followinghypothesis

H1 Users tend to browse the metadata ofonly the articles they find of interestRecommendation schemes thatconsider the Web accesses of thesearticles will result in more effectiverecommendations

Mining the literature usage logThere have been several approaches proposedin the literature (Yan et al 1996 Mobasheret al 1999 Pitkow and Pirolli 1999 Yanget al 2001) for identifying aggregate usageprofiles from Web usage logs Aggregateusage profiles can be represented in the formof association rules sequential patterns (or ann-gram Markov model) or clusters of Webpages In the context of literature digitallibraries however we decided not to considersequential patterns because the order ofarticles in a transaction may not relate tousersrsquo preferences Instead only associationrules and clusters of articles will be discussed

The problem of finding frequentassociations between items in a transactiondatabase called the association-rule discoveryproblem was first introduced by Agrawal et al(1993) Association-rule discovery methods

174

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

such as the A priori Algorithm (Agrawal et al1993 Agrawal and Srikant 1994) aretypically based on two decompositionsextraction of itemsets and the generation ofstrong association rules In the initialextraction phase the methods find sets ofitems that frequently occur together Thesupport of each itemset I =XY denoted asSup(I) is the fraction of transactionscontaining both X and Y This itemset I isreferred to as a frequent itemset if Sup(I)exceeds a user-specified minimal supportthreshold Minsup

In the second phase each discovereditemset I = XY is used to constructassociation rules in the form X ) Y Theconfidence of each rule denoted asConf(X ) Y ) is the fraction of transactionscontaining X that also contain Y Anassociation rule is said to be strong if itexceeds a user-specified confidence thresholdBy applying association-rule discoveryalgorithms to the transactions of thetransformed logs we can find associationrules in the form a1 a2 am)a whichcan be used to recommend article a to userswho have browsed a1 a2 am but not a

The traditional approaches for identifyingitemsets with a uniform minimum supportthreshold however cannot be directlyapplied because articles that arrive later tendto have smaller support even if they areactually more popular Therefore anonuniform support threshold schemeoriginally proposed in Liu et al (1999) isadopted In this scheme each item is assigneda distinct minimum support value (called theminimum item support MIS) and theminimum support of an itemset is theminimum of the MIS values of its constituentitems In the literature recommendationsystem we view the MIS value of an article asa function of its creation time That is articlesthat are added to the digital library morerecently should be assigned smaller MISvalues Let N(t) be the number oftransactions after time t in the Web usagelogs The MIS of an article is defined asfollows

MIShellipadagger ˆ Mhellipadagger Mhellipadagger gt LSLS Otherwise

raquo

Mhellipadagger ˆ NhellipCreationTimehellipadaggerdaggerNhellip0dagger cent Minsup

where LS is the lower bound for support

values Minsup is the minimum supportthreshold based on the entire article browsinglog and N(0) denotes the total number oftransactions in the Web usage log Both LSand Minsup are user-defined constantsAfter assigning the MIS values of all articles inthe literature digital library the methodproposed in Liu et al (1999) can beapplied to derive the association rules forarticles

For the clustering technique we adopt theAssociation Rule Hypergraph Partitioning(ARHP) approach (Mobasher et al 19992000) rather than traditional clusteringtechniques The main reason for this is thatARHP is more efficient in handling highdimensional data such as those present inliterature digital libraries The dimensions ofa transaction are the set of articles which ishuge for a large-scale digital library Thisapproach starts with the identification offrequent itemsets (as in association-rulediscovery methods) each of which containsarticles often accessed together intransactions Each such frequent itemset isthen viewed as a hyperedge with a specificweight

As mentioned each article has a distinctcreation time We therefore normalise thesupport values of itemsets as follows beforecomputing their weights

Sup0hellipaidagger ˆ Nhellip0daggerNhellipCreationTimehellipaidaggerdagger

cent Suphellipaidagger

Sup0hellipa1 akdagger ˆ

Nhellip0daggerN hellipmax1micro micro hellip daggerdagger

cent Suphellipa1 akdagger

There are several ways to define the weightof an itemset such as using either the supportfor or the interest in the itemset The formerfavours itemsets of smaller size whereas thelatter gives priority to larger itemsets Wedefine a general weighting formula that coversthe broad spectrum between these twoextremes In addition the supports for orinterests in different itemsets can have verydiverse values To prevent itemsets of largeweight from dominating the subsequentclustering procedure we apply thelogarithm on the weight The following is ourdefinition of the weight of an itemsethellipa1 a2 ak)

175

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

weighthellipa1 a2 akdagger ˆ

log0hellip dagger

permil 0hellip dagger cent 0hellip dagger cent cent 0hellip daggerŠnotsup3

pound 1

Minsup

acute

where 0micro not micro1 Note that when not = 0 ornot = 1 this formula is equivalent to usingsupport or interest as the weight respectively1Minsup is a constant that keeps the weightnon-negative This definition supports ourfollowing hypotheses

H2 A better recommendation effectivenesscan be achieved by striking a balancebetween using support and interest asthe weight of an itemset

H3 A better recommendation effectivenesscan be achieved by incorporating thelogarithm function in the weight of anitemset

After deciding the weight of each itemset thehypergraph partitioning algorithm proposedin Karypis (2002) is applied to partition theset of articles into disjoint clusters of articlesArticles in the same cluster are more lsquolsquosimilarrsquorsquoin the sense that they are more likely to beaccessed together in the same transaction Toreflect the fact that an article may indeedinterest more than one group of users weadopt the same heuristic as used in Mobasheret al (1999) by adding back articles toclusters which results in overlapping clustersSpecifically for a given hyperedge if thepercentage of involved vertices in a cluster islarge than a threshold the other involvedvertices are included in the same cluster

Online recommendationsWe propose two recommendation approachesthat use the article association rules andarticle clusters obtained by the methodsdescribed above The goal is to recommendthe top-N articles that potentially interest theactive user The first approach makes use ofarticle association rules The idea is to treateach frequent itemset as the interest profile ofa user group and to recommend articles basedon the similarity between the current sessionof the active user and interest profiles of therelevant user groups Specifically let s be theactive userrsquos current session of length k Wefirst identify the set of frequent itemsets ofsize k + 1 that contain all elements in s and an

extra element m (not in s) For each suchitemset I the confidence of the ruleI iexcl mg ) fm is calculated These extraelements are then recommended to the userin descending order of confidence value Ifthese elements are not sufficient (ie there areless than N of them) we then search forfrequent itemsets of size k that contain k - 1elements in s and an extra element (not in s)Again these extra elements are recommendedto the user in descending order of confidencevalue This procedure continues until Narticles are recommended

Our other proposed method uses ahypergraph-based approach In thisapproach the recommendation score of eacharticle a is computed by considering thesimilarity between the current user sessionand the clusters C to which a belongs and thecoherence weight of a with respect to CSpecifically each cluster of articles can beviewed as a vector with binary elements eachof which indicates whether an article appearsin the cluster Similarly the current usersession can also be represented as a vectorThen the similarity between the currentsession s and a cluster C can be defined as acosine function as follows

matchhellipS Cdagger ˆ

Pk

aCk pound Sk

Pk

hellipSkdagger2 poundPk

hellipaCKdagger2

r

where Sk is the krsquoth element in S and aCk is the

krsquoth element in CThe coherence weight of an article a with

respect to the cluster C that it belongs to isdefined as

weighthellipa Cdagger ˆ

Pa2eesup3C

weighthellipedaggerPesup3C

weighthellipedagger

where weight(e) is the weight of a hyperedge eThe recommendation score Rec(S a) of an

article a with respect to the current usersession S is then defined as

Rechellip dagger ˆ

maxa2C

hellip dagger pound hellip dagger

p

The top-N articles for recommendation arethose with the N highest values in therecommendation score

176

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

The hypergraph-based approach is morecarefully designed than theassociation-rule-based approach and weexpect the former to perform better

H4 The hypergraph-based approach willresult in more effectiverecommendations and has aquicker response time than theassociation-rule-based approach

Empirical evaluations

This section reports our experience inapplying the Web usage logs of NSYSU-ETD to the proposed literaturerecommendation system The main objectivewas to test our four hypotheses NSYSU-ETD runs on PC Solaris 27 and uses Apache139 as the Web server Since beingcommissioned in May 2000 it has beenloaded with more than 3000 electronictheses of National Sun Yat-sen UniversityUp to February 2003 these theses had beenbrowsed more than 400000 times anddownloaded more than 100000 times Weanalysed the Web usage logs of NSYSU-ETD between February 2002 and May 2002for our experiments the data collected fromFebruary 1 to April 30 were designated as thetraining data set and those collected in Mayserved as the test data set

We first applied the data cleansingtechnique on the training data and obtained43349 lookup accesses and 41627 articleaccesses Applying the session identificationtechnique revealed 16922 user sessionsamong which 392 sessions were robotgenerated 6068 sessions contained only onearticle access and 5253 sessions containedno article accesses We eliminated these trivialuser sessions and applied transactionidentification techniques resulting in 5617transactions for the query-chosen method5272 transactions for the session-chosenmethod and 17742 transactions for thequery-result method Queries whose resultsare never chosen by the users are removedfrom the query-chosen method but remain inthe query-result method The session-resultmethod produced transactions of huge sizeeach containing thousands of article accessesWe therefore decided not to consider thismethod in the subsequent experiments

The two proposed methods for miningliterature usage logs both require the

identification of frequent itemsets fromtransactions which needs the minimumsupport to be specified However the threetransaction identification methods undercomparison have different numbers oftransactions To be fair we specify adifferent minimum support threshold foreach method such that the total number ofarticles involved in large two-item sets ofeach method - called recommendablearticles - is approximately the same Ourrecommendation framework recommends anarticle only if it is associated with otherarticles a sufficient number of times in Webusage log Therefore articles that are notinvolved in large two-item sets cannotpossibly be recommended Table I shows thespecified minimum support threshold andthe number of recommendable articles ofeach method

To illustrate how we conductedexperiments we define the followingnotation let Teval be the set of transactions inthe test set teval be a transaction in Teval andat(i) be the irsquoth article in teval Given a windowsize Wsize we divide each transaction teval inthe test data set into two lists teval[W] andteval[R] where teval[W] is the first Wsize articleaccesses of teval and teval[R] is the remainingarticles By treating teval[W] as the currentsession the recommender system will choosethe set tpr of top-N articles forrecommendation

The performance metric we adopted formeasuring the quality of recommendation isthe precision and recall scheme Theprecision is the ratio of the number ofrecommended articles accessed by a user tothe total number of recommended articlesdefined as tpr teval permilRŠ=tpr and recall is theratio of the number of recommended articlesaccessed by a user to the total number ofarticles of interest to the user defined astpr tevalpermilRŠ=teval permilRŠ The precision (recall) of arecommendation approach is the averageprecision (recall) of all transactions in the testset

Test of H1We first evaluate the performance impact ofthe three transaction identification methodsIn this experiment not was set to be 05 and thelogarithm was taken when computing theweight of an itemset Figure 3(a b) shows theprecisions and recalls respectively underassociation-rule-based recommendation

177

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

The precisions and recalls underhypergraph-based recommendation areshown in Figure 4(a b) respectively

Overall the trends are the same for theassociation-rule-based and hypergraph-basedapproaches It can be clearly seen that boththe query-chosen and session-chosenmethods outperform the query-resultmethod in terms of both precision and recallThis implies that the information aboutarticles browsed plays a crucial role in making

recommendations - both the query-chosenand session-chosen methods incorporate thisinformation in forming transactions Wetherefore accept H1 However theperformance difference between thequery-chosen and session-chosen methods isnot significant

Test of H2 and H3We then conducted experiments to shed lighton the impact of not and the logarithmic

Figure 3 (a) Precisions of the association-based approach (b) recalls of the association-based approach

Figure 4 (a) Precisions and (b) recalls of the hypergraph-based approach under different transaction identification

methods

Table I Minimum support and number of recommended candidates for each transaction identification method

Minimum support (per cent) No of recommended articles

Session-chosen method 016 253

Query-chosen method 012 250

Query-result method 33 229

178

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

function in the weight definition for thehypergraph-based recommendationapproach Figure 5(a b) shows the precisionand recall values when not = 0 05 and 1 usingthe session-chosen method for transactionidentification The window size was set at 2As can be seen differences in the precisionand recall under different settings of not arevery small We have performed the sameexperiments for different transactionidentification methods and window sizes andobtained similar results H2 is thereforerejected

We then computed the precisions andrecalls with and without application of thelogarithmic function on the itemset weightFigure 6(a b) shows the resulting precisionsand recalls respectively

Figure 6 shows that applying thelogarithmic function on the itemset weightdefinition achieves significantly betterprecision and recall values This meets ourexpectation and H3 is accepted

Test of H4Finally we evaluated the impact ofassociation-rule-based and hypergraph-basedrecommendation approaches for different

window sizes Figure 7(a b) shows theprecisions and recalls for window sizes of 2 34 5 and 6 for a top-15 recommendation Itcan be seen that the hypergraph-basedapproach performs better than theassociation-rule-based approach especiallyfor larger window sizes

We also compared the running times ofboth approaches whilst setting differentminimum support thresholds Therelative performance of the two approachesunder different window sizes are shown inFigure 8(a b) Overall the running time ofthe hypergraph-based approach remainedrelatively constant not varying with changesin window size and minimum supportthreshold In contrast the running time of theassociation-rule-based approach increasedwith an increase in window size or a decreasein the minimum support This is because theassociation-rule-based approach has to searchfor the frequent itemsets that match thecurrent session As the number of frequentitemsets increase (as a result of a decreasein the minimum support) or the length ofcurrent user session increases (as a

Figure 5 (a) Impact of not on precisions and (b) impact of not on recalls of

the hypergraph-based approach using the session-chosen method for

transaction identification

Figure 6 (a) Impact of logarithmic function on precisions and (b) impact

of logarithmic function on recalls of the hypergraph-based approach using

the session-chosen method for transaction identification

179

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

result of an increased window size)the association-rule-based approach incurs alarger running time

Overall we conclude that thehypergraph-based approach is more attractivesince it yields better-quality articlerecommendation and has a more consistentrunning time Thus H4 is accepted

Conclusions

In this paper we have investigated issuesrelated to the recommendation of articles in aliterature digital library We have developed a

literature recommendation system that makes

use of the Web usage logs of a literature

digital library for making recommendations

The literature recommendation system

consists of three sequential steps(1) data preparation of the Web logs(2) usage log mining and(3) generation of article recommendations

We proposed three alternatives for identifying

transactions from Web usage logs and

discussed two approaches - association-rule

based and hypergraph based - for making

recommendations These alternatives and

approaches were evaluated using the Web

Figure 7 (a) Precisions and (b) recalls of the two recommendation approaches under different window sizes

(Minsup = 016 per cent)

Figure 8 (a) Running times of the two recommendation approaches for a window size of two under different

Minsup values and (b) running times of the two recommendation approaches for a window size of six under

different Minsup values

180

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

usage logs of an operational electronic thesissystem at National Sun Yat-sen University Ithas been found that the query-chosen andsession-chosen methods are better fortransaction identification and that thehypergraph-based approach yieldsbetter-quality article recommendation andexhibits a more consistent running time andthus is more scalable

Our recommendation framework identifiesarticle associations present in the Web usagelogs of a digital library While this approachresults in effective recommendations it failsto recommend those independent articles thatare seldom accessed together with others Asevident from our experiments an analysis ofthe three-month collection of Web usage logsof NSYSU-ETD (from 1 February to 30April 2002) showed that only about one-tenthof the total collection are recommendable (seeTable I) To extend the scope ofrecommendable articles we are currentlyinvestigating approaches that make use ofmultiple sources when making articlerecommendations in digital libraries Onesuch source of course is the metadataalready collected by digital libraries

References

Agrawal R Imielinski T and Swami A (1993) ` Mining

association rules between sets of items in largedatabasesrsquorsquo in Buneman P and Jajodia S (Eds)

Proceedings of the ACM SIGMOD Conference onManagement of Data Washington DC May 26-28ACM Press New York NY pp 207-16

Agrawal R and Srikant R (1994) ` Fast algorithms formining association rulesrsquorsquo in Bocca JB Jarke M

and Zaniolo C (Eds) Proceedings of the 20thInternational Conference on Very Large Data Bases

September 12-15 Santiago Chile MorganKaufmann San Francisco CA pp 487-99

Alspector J Kolcz A and Karunanithi N (1998)` Comparing feature-based and clique-based usermodels for movie selectionrsquorsquo Proceedings of the 3rd

ACM International Conference on Digital LibrariesJune 23-26 1998 Pittsburgh PA ACM Press New

York NY pp 11-18Andresen D Carver L Dolin R Fischer C Frew J

Goodchild M Ibarra O Kothuri R Larsgaard MNebert D Simpson J Smith T Yang T andZheng Q (1995) ` The WWW prototype of the

Alexandria digital libraryrsquorsquo Proceedings of theInternational Symposium on Digital Libraries

Tsukuba Japan 22-5 August pp 17-27Ansari A Essegaier S and Kohli R (2000) ` Internet

recommendation systemsrsquorsquo Journal of MarketingResearch Vol 37 No 3 pp 67-85

Arms W (2000) Digital Libraries MIT Press CambridgeMA

Armstrong R Freitag D Joachims T and Mitchell T(1997) ` WebWatcher a learning apprentice for theWorld Wide Webrsquorsquo AAAI Spring Symposium onInformation Gathering from HeterogeneousDistributed Environments pp 6-12

Balabanovirsquoc M and Shoham Y (1997) ` Fabcontent-based collaborative recommendationrsquorsquoCommunications of the ACM Vol 40 No 3pp 66-72

Baldonado M Chang C Gravano L and Paepcke A(1997) ` Metadata for digital libraries architectureand design rationalersquorsquo Proceedings of the 2nd ACMInternational Conference on Digital Libraries ACMPress New York NY pp 47-56

Basu C Hirsh H and Cohen W (1998)` Recommendation as classification using social andcontent-based information in recommendationrsquorsquoProceedings of the 15th National Conference onArtificial Intelligence 26-30 July Madison WI AAAIPress Menlo Park CA pp 714-20

Billsus D and Pazzani M (1999) ` A hybrid user modelfor news story classificationrsquorsquo in Kay J (Ed)Proceedings of the 7th International Conference onUser Modelling Banff Canada 20-4 JuneSpringer-Verlag New York NY pp 99-108

Bowman C Manber P and Schwartz U (1994)` Scalable Internet resources discovery researchproblems and approachesrsquorsquo Communications of theACM Vol 37 No 8 pp 98-107

Breese J Heckerman D and Kadie C (1998) ` Empiricalanalysis of predictive algorithms for collaborativefilteringrsquorsquo Technical Report MSR-TR-98-12Microsoft Research Seattle CA

Chen H Schatz B Ng T Martinez J Kirchhoff A andLin C (1996) ` A parallel computing approach tocreating engineering concept spaces for semanticretrieval the Illinois digital library initiativeprojectrsquorsquo IEEE Transactions on PAMI Vol 18 No 8pp 17-34

Cooley R Mobasher B and Srivastava J (1999) ` Datapreparation for mining World Wide Web browsingpatternsrsquorsquo Journal of Knowledge and InformationSystems Vol 1 No 1 pp 5-32

Furner J (2002) ` On recommendingrsquorsquo Journal of theAmerican Society for Information Science andTechnology Vol 53 No 9 pp 747-63

Goldberg D Nichols D Oki B and Terry D (1992)` Using collaborative filtering to weave aninformation tapestryrsquorsquo Communications of the ACMVol 35 No 12 pp 61-70

Herlocker J and Konstan J (2001) ` Content-independent task-focused recommendationrsquorsquo IEEEInternet Computing Vol 5 No 6 pp 40-7

Karypis G (2002) ` Multilevel hypergraph partitioningrsquorsquoTech Report TR02-25 Department of ComputerScience University of Minnesota MN

Kessler J (1996) Internet Digital Libraries TheInternational Dimension Artech House PublishersNorwood MA

Konstan J Miller B Maltz D Herlocker J Gordon Land Riedl J (1997) ` GroupLens applyingcollaborative filtering to Usenet newsrsquorsquoCommunications of the ACM Vol 40 No 3pp 77-87

181

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Lang K (1995) ` Newsweeder learning to filternetnewsrsquorsquo in Prieditis A and Russell S (Eds)Proceedings of the 12th International Conference onMachine Learning Lake Tahoe Morgan KaufmannSan Francisco CA pp 331-9

Liu B Hsu W and Ma Y (1999) ` Mining associationrules with multiple minimum supportsrsquorsquo Proceedingsof the 5th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining 15-18

August San Diego ACM Press New York NYpp 430-4

Loeb S and Terry D (1992) ` Information filteringrsquorsquoCommunications of the ACM Special Issue onInformation Filtering Vol 35 No 12 pp 26-8

Mobasher B Cooley R and Srivastava J (1999)` Creating adaptive Web sites through usage-basedclustering of URLsrsquorsquo Proceedings of the IEEEKnowledge and Data Engineering ExchangeWorkshop

Mobasher B Dai H Luo T Nakagawa M andWiltshire J (2000) ` Discovery of aggregate usageprofiles for Web personalisationrsquorsquo Proceedings ofthe webKDD Workshop

Mooney R and Roy L (2000) ` Content-based bookrecommending using learning for text

categorisationrsquorsquo Proceedings of the 5th ACMConference on Digital Libraries San Antonio ACMPress New York NY pp 195-204

Pazzani M and Billsus D (1997) ` Learning and revisinguser profiles the identification of interesting Websitesrsquorsquo Machine Learning Vol 27 No 4 pp 313-31

Pazzani M (1999) ` A framework for collaborativecontent-based and demographic filteringrsquorsquo ArtificialIntelligence Review Vol 13 No 56 pp 393-408

Pennock D Horvitz E Lawrence S and Giles C (2000)` Collaborative filtering by personality diagnosis ahybrid memory- and model-based approachrsquorsquoProceedings of the 16th Conference onUncertainty in Artificial Intelligence San Francisco

30 June-3 July Morgan Kaufmann San FranciscoCA pp 473-80

Pitkow J and Pirolli P (1999) ` Mining longest repeatingsubsequences to predict World Wide Web surfingrsquorsquoProceedings of the 2nd USENIX Symposium onInternet Technologies and Systems Boulder COpp 139-150

Schafer J Konstan J and Riedl J (2001) ` E-commercerecommendation applicationsrsquorsquo Data Mining andKnowledge Discovery Vol 5 No 1 pp 10-22

Shardanand U and Maes P (1995) ` Social informationfiltering algorithms for automating `word ofmouthrsquorsquorsquo Proceedings of the Conference on HumanFactors in Computing Systems Denver CO ACMPress New York NY pp 210-17

Spink A Wilson T Ford N Foster A and Ellis D(2002) ` Information seeking and mediatedsearchingrsquorsquo Journal of The American Society ForInformation Science and Technology Vol 53 No 9pp 695-703

Srivastava J Cooley R Deshpande M and Tang P(2000) ` Web usage mining discovery andapplications of usage patterns from Web datarsquorsquoSIGKDD Explorations Vol 1 No 2 pp 12-23

Terveen L Hill W Amento B McDonald D and CreterJ (1997) ` PHOAKS a system for sharingrecommendationsrsquorsquo Communications of the ACMVol 40 No 3 pp 59-62

Wilensky R (1996) ` Toward work-centred digitalinformation servicesrsquorsquo IEEE Computer Vol 29 No 5pp 7-44

Yan T Jacobsen M Molina H and Dayal U (1996)` From user access patterns to dynamic hypertextlinkingrsquorsquo Proceedings of the 5th International WorldWide Web Conference pp 1007-14

Yang Q Zhang HH and Li T (2001) ` Mining Web logsfor prediction models in WWW caching andprefetchingrsquorsquo Proceedings of the ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining pp 473-8

182

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Page 7: A prototype WWW literature recommendation system for digital libraries

such as the A priori Algorithm (Agrawal et al1993 Agrawal and Srikant 1994) aretypically based on two decompositionsextraction of itemsets and the generation ofstrong association rules In the initialextraction phase the methods find sets ofitems that frequently occur together Thesupport of each itemset I =XY denoted asSup(I) is the fraction of transactionscontaining both X and Y This itemset I isreferred to as a frequent itemset if Sup(I)exceeds a user-specified minimal supportthreshold Minsup

In the second phase each discovereditemset I = XY is used to constructassociation rules in the form X ) Y Theconfidence of each rule denoted asConf(X ) Y ) is the fraction of transactionscontaining X that also contain Y Anassociation rule is said to be strong if itexceeds a user-specified confidence thresholdBy applying association-rule discoveryalgorithms to the transactions of thetransformed logs we can find associationrules in the form a1 a2 am)a whichcan be used to recommend article a to userswho have browsed a1 a2 am but not a

The traditional approaches for identifyingitemsets with a uniform minimum supportthreshold however cannot be directlyapplied because articles that arrive later tendto have smaller support even if they areactually more popular Therefore anonuniform support threshold schemeoriginally proposed in Liu et al (1999) isadopted In this scheme each item is assigneda distinct minimum support value (called theminimum item support MIS) and theminimum support of an itemset is theminimum of the MIS values of its constituentitems In the literature recommendationsystem we view the MIS value of an article asa function of its creation time That is articlesthat are added to the digital library morerecently should be assigned smaller MISvalues Let N(t) be the number oftransactions after time t in the Web usagelogs The MIS of an article is defined asfollows

MIShellipadagger ˆ Mhellipadagger Mhellipadagger gt LSLS Otherwise

raquo

Mhellipadagger ˆ NhellipCreationTimehellipadaggerdaggerNhellip0dagger cent Minsup

where LS is the lower bound for support

values Minsup is the minimum supportthreshold based on the entire article browsinglog and N(0) denotes the total number oftransactions in the Web usage log Both LSand Minsup are user-defined constantsAfter assigning the MIS values of all articles inthe literature digital library the methodproposed in Liu et al (1999) can beapplied to derive the association rules forarticles

For the clustering technique we adopt theAssociation Rule Hypergraph Partitioning(ARHP) approach (Mobasher et al 19992000) rather than traditional clusteringtechniques The main reason for this is thatARHP is more efficient in handling highdimensional data such as those present inliterature digital libraries The dimensions ofa transaction are the set of articles which ishuge for a large-scale digital library Thisapproach starts with the identification offrequent itemsets (as in association-rulediscovery methods) each of which containsarticles often accessed together intransactions Each such frequent itemset isthen viewed as a hyperedge with a specificweight

As mentioned each article has a distinctcreation time We therefore normalise thesupport values of itemsets as follows beforecomputing their weights

Sup0hellipaidagger ˆ Nhellip0daggerNhellipCreationTimehellipaidaggerdagger

cent Suphellipaidagger

Sup0hellipa1 akdagger ˆ

Nhellip0daggerN hellipmax1micro micro hellip daggerdagger

cent Suphellipa1 akdagger

There are several ways to define the weightof an itemset such as using either the supportfor or the interest in the itemset The formerfavours itemsets of smaller size whereas thelatter gives priority to larger itemsets Wedefine a general weighting formula that coversthe broad spectrum between these twoextremes In addition the supports for orinterests in different itemsets can have verydiverse values To prevent itemsets of largeweight from dominating the subsequentclustering procedure we apply thelogarithm on the weight The following is ourdefinition of the weight of an itemsethellipa1 a2 ak)

175

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

weighthellipa1 a2 akdagger ˆ

log0hellip dagger

permil 0hellip dagger cent 0hellip dagger cent cent 0hellip daggerŠnotsup3

pound 1

Minsup

acute

where 0micro not micro1 Note that when not = 0 ornot = 1 this formula is equivalent to usingsupport or interest as the weight respectively1Minsup is a constant that keeps the weightnon-negative This definition supports ourfollowing hypotheses

H2 A better recommendation effectivenesscan be achieved by striking a balancebetween using support and interest asthe weight of an itemset

H3 A better recommendation effectivenesscan be achieved by incorporating thelogarithm function in the weight of anitemset

After deciding the weight of each itemset thehypergraph partitioning algorithm proposedin Karypis (2002) is applied to partition theset of articles into disjoint clusters of articlesArticles in the same cluster are more lsquolsquosimilarrsquorsquoin the sense that they are more likely to beaccessed together in the same transaction Toreflect the fact that an article may indeedinterest more than one group of users weadopt the same heuristic as used in Mobasheret al (1999) by adding back articles toclusters which results in overlapping clustersSpecifically for a given hyperedge if thepercentage of involved vertices in a cluster islarge than a threshold the other involvedvertices are included in the same cluster

Online recommendationsWe propose two recommendation approachesthat use the article association rules andarticle clusters obtained by the methodsdescribed above The goal is to recommendthe top-N articles that potentially interest theactive user The first approach makes use ofarticle association rules The idea is to treateach frequent itemset as the interest profile ofa user group and to recommend articles basedon the similarity between the current sessionof the active user and interest profiles of therelevant user groups Specifically let s be theactive userrsquos current session of length k Wefirst identify the set of frequent itemsets ofsize k + 1 that contain all elements in s and an

extra element m (not in s) For each suchitemset I the confidence of the ruleI iexcl mg ) fm is calculated These extraelements are then recommended to the userin descending order of confidence value Ifthese elements are not sufficient (ie there areless than N of them) we then search forfrequent itemsets of size k that contain k - 1elements in s and an extra element (not in s)Again these extra elements are recommendedto the user in descending order of confidencevalue This procedure continues until Narticles are recommended

Our other proposed method uses ahypergraph-based approach In thisapproach the recommendation score of eacharticle a is computed by considering thesimilarity between the current user sessionand the clusters C to which a belongs and thecoherence weight of a with respect to CSpecifically each cluster of articles can beviewed as a vector with binary elements eachof which indicates whether an article appearsin the cluster Similarly the current usersession can also be represented as a vectorThen the similarity between the currentsession s and a cluster C can be defined as acosine function as follows

matchhellipS Cdagger ˆ

Pk

aCk pound Sk

Pk

hellipSkdagger2 poundPk

hellipaCKdagger2

r

where Sk is the krsquoth element in S and aCk is the

krsquoth element in CThe coherence weight of an article a with

respect to the cluster C that it belongs to isdefined as

weighthellipa Cdagger ˆ

Pa2eesup3C

weighthellipedaggerPesup3C

weighthellipedagger

where weight(e) is the weight of a hyperedge eThe recommendation score Rec(S a) of an

article a with respect to the current usersession S is then defined as

Rechellip dagger ˆ

maxa2C

hellip dagger pound hellip dagger

p

The top-N articles for recommendation arethose with the N highest values in therecommendation score

176

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

The hypergraph-based approach is morecarefully designed than theassociation-rule-based approach and weexpect the former to perform better

H4 The hypergraph-based approach willresult in more effectiverecommendations and has aquicker response time than theassociation-rule-based approach

Empirical evaluations

This section reports our experience inapplying the Web usage logs of NSYSU-ETD to the proposed literaturerecommendation system The main objectivewas to test our four hypotheses NSYSU-ETD runs on PC Solaris 27 and uses Apache139 as the Web server Since beingcommissioned in May 2000 it has beenloaded with more than 3000 electronictheses of National Sun Yat-sen UniversityUp to February 2003 these theses had beenbrowsed more than 400000 times anddownloaded more than 100000 times Weanalysed the Web usage logs of NSYSU-ETD between February 2002 and May 2002for our experiments the data collected fromFebruary 1 to April 30 were designated as thetraining data set and those collected in Mayserved as the test data set

We first applied the data cleansingtechnique on the training data and obtained43349 lookup accesses and 41627 articleaccesses Applying the session identificationtechnique revealed 16922 user sessionsamong which 392 sessions were robotgenerated 6068 sessions contained only onearticle access and 5253 sessions containedno article accesses We eliminated these trivialuser sessions and applied transactionidentification techniques resulting in 5617transactions for the query-chosen method5272 transactions for the session-chosenmethod and 17742 transactions for thequery-result method Queries whose resultsare never chosen by the users are removedfrom the query-chosen method but remain inthe query-result method The session-resultmethod produced transactions of huge sizeeach containing thousands of article accessesWe therefore decided not to consider thismethod in the subsequent experiments

The two proposed methods for miningliterature usage logs both require the

identification of frequent itemsets fromtransactions which needs the minimumsupport to be specified However the threetransaction identification methods undercomparison have different numbers oftransactions To be fair we specify adifferent minimum support threshold foreach method such that the total number ofarticles involved in large two-item sets ofeach method - called recommendablearticles - is approximately the same Ourrecommendation framework recommends anarticle only if it is associated with otherarticles a sufficient number of times in Webusage log Therefore articles that are notinvolved in large two-item sets cannotpossibly be recommended Table I shows thespecified minimum support threshold andthe number of recommendable articles ofeach method

To illustrate how we conductedexperiments we define the followingnotation let Teval be the set of transactions inthe test set teval be a transaction in Teval andat(i) be the irsquoth article in teval Given a windowsize Wsize we divide each transaction teval inthe test data set into two lists teval[W] andteval[R] where teval[W] is the first Wsize articleaccesses of teval and teval[R] is the remainingarticles By treating teval[W] as the currentsession the recommender system will choosethe set tpr of top-N articles forrecommendation

The performance metric we adopted formeasuring the quality of recommendation isthe precision and recall scheme Theprecision is the ratio of the number ofrecommended articles accessed by a user tothe total number of recommended articlesdefined as tpr teval permilRŠ=tpr and recall is theratio of the number of recommended articlesaccessed by a user to the total number ofarticles of interest to the user defined astpr tevalpermilRŠ=teval permilRŠ The precision (recall) of arecommendation approach is the averageprecision (recall) of all transactions in the testset

Test of H1We first evaluate the performance impact ofthe three transaction identification methodsIn this experiment not was set to be 05 and thelogarithm was taken when computing theweight of an itemset Figure 3(a b) shows theprecisions and recalls respectively underassociation-rule-based recommendation

177

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

The precisions and recalls underhypergraph-based recommendation areshown in Figure 4(a b) respectively

Overall the trends are the same for theassociation-rule-based and hypergraph-basedapproaches It can be clearly seen that boththe query-chosen and session-chosenmethods outperform the query-resultmethod in terms of both precision and recallThis implies that the information aboutarticles browsed plays a crucial role in making

recommendations - both the query-chosenand session-chosen methods incorporate thisinformation in forming transactions Wetherefore accept H1 However theperformance difference between thequery-chosen and session-chosen methods isnot significant

Test of H2 and H3We then conducted experiments to shed lighton the impact of not and the logarithmic

Figure 3 (a) Precisions of the association-based approach (b) recalls of the association-based approach

Figure 4 (a) Precisions and (b) recalls of the hypergraph-based approach under different transaction identification

methods

Table I Minimum support and number of recommended candidates for each transaction identification method

Minimum support (per cent) No of recommended articles

Session-chosen method 016 253

Query-chosen method 012 250

Query-result method 33 229

178

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

function in the weight definition for thehypergraph-based recommendationapproach Figure 5(a b) shows the precisionand recall values when not = 0 05 and 1 usingthe session-chosen method for transactionidentification The window size was set at 2As can be seen differences in the precisionand recall under different settings of not arevery small We have performed the sameexperiments for different transactionidentification methods and window sizes andobtained similar results H2 is thereforerejected

We then computed the precisions andrecalls with and without application of thelogarithmic function on the itemset weightFigure 6(a b) shows the resulting precisionsand recalls respectively

Figure 6 shows that applying thelogarithmic function on the itemset weightdefinition achieves significantly betterprecision and recall values This meets ourexpectation and H3 is accepted

Test of H4Finally we evaluated the impact ofassociation-rule-based and hypergraph-basedrecommendation approaches for different

window sizes Figure 7(a b) shows theprecisions and recalls for window sizes of 2 34 5 and 6 for a top-15 recommendation Itcan be seen that the hypergraph-basedapproach performs better than theassociation-rule-based approach especiallyfor larger window sizes

We also compared the running times ofboth approaches whilst setting differentminimum support thresholds Therelative performance of the two approachesunder different window sizes are shown inFigure 8(a b) Overall the running time ofthe hypergraph-based approach remainedrelatively constant not varying with changesin window size and minimum supportthreshold In contrast the running time of theassociation-rule-based approach increasedwith an increase in window size or a decreasein the minimum support This is because theassociation-rule-based approach has to searchfor the frequent itemsets that match thecurrent session As the number of frequentitemsets increase (as a result of a decreasein the minimum support) or the length ofcurrent user session increases (as a

Figure 5 (a) Impact of not on precisions and (b) impact of not on recalls of

the hypergraph-based approach using the session-chosen method for

transaction identification

Figure 6 (a) Impact of logarithmic function on precisions and (b) impact

of logarithmic function on recalls of the hypergraph-based approach using

the session-chosen method for transaction identification

179

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

result of an increased window size)the association-rule-based approach incurs alarger running time

Overall we conclude that thehypergraph-based approach is more attractivesince it yields better-quality articlerecommendation and has a more consistentrunning time Thus H4 is accepted

Conclusions

In this paper we have investigated issuesrelated to the recommendation of articles in aliterature digital library We have developed a

literature recommendation system that makes

use of the Web usage logs of a literature

digital library for making recommendations

The literature recommendation system

consists of three sequential steps(1) data preparation of the Web logs(2) usage log mining and(3) generation of article recommendations

We proposed three alternatives for identifying

transactions from Web usage logs and

discussed two approaches - association-rule

based and hypergraph based - for making

recommendations These alternatives and

approaches were evaluated using the Web

Figure 7 (a) Precisions and (b) recalls of the two recommendation approaches under different window sizes

(Minsup = 016 per cent)

Figure 8 (a) Running times of the two recommendation approaches for a window size of two under different

Minsup values and (b) running times of the two recommendation approaches for a window size of six under

different Minsup values

180

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

usage logs of an operational electronic thesissystem at National Sun Yat-sen University Ithas been found that the query-chosen andsession-chosen methods are better fortransaction identification and that thehypergraph-based approach yieldsbetter-quality article recommendation andexhibits a more consistent running time andthus is more scalable

Our recommendation framework identifiesarticle associations present in the Web usagelogs of a digital library While this approachresults in effective recommendations it failsto recommend those independent articles thatare seldom accessed together with others Asevident from our experiments an analysis ofthe three-month collection of Web usage logsof NSYSU-ETD (from 1 February to 30April 2002) showed that only about one-tenthof the total collection are recommendable (seeTable I) To extend the scope ofrecommendable articles we are currentlyinvestigating approaches that make use ofmultiple sources when making articlerecommendations in digital libraries Onesuch source of course is the metadataalready collected by digital libraries

References

Agrawal R Imielinski T and Swami A (1993) ` Mining

association rules between sets of items in largedatabasesrsquorsquo in Buneman P and Jajodia S (Eds)

Proceedings of the ACM SIGMOD Conference onManagement of Data Washington DC May 26-28ACM Press New York NY pp 207-16

Agrawal R and Srikant R (1994) ` Fast algorithms formining association rulesrsquorsquo in Bocca JB Jarke M

and Zaniolo C (Eds) Proceedings of the 20thInternational Conference on Very Large Data Bases

September 12-15 Santiago Chile MorganKaufmann San Francisco CA pp 487-99

Alspector J Kolcz A and Karunanithi N (1998)` Comparing feature-based and clique-based usermodels for movie selectionrsquorsquo Proceedings of the 3rd

ACM International Conference on Digital LibrariesJune 23-26 1998 Pittsburgh PA ACM Press New

York NY pp 11-18Andresen D Carver L Dolin R Fischer C Frew J

Goodchild M Ibarra O Kothuri R Larsgaard MNebert D Simpson J Smith T Yang T andZheng Q (1995) ` The WWW prototype of the

Alexandria digital libraryrsquorsquo Proceedings of theInternational Symposium on Digital Libraries

Tsukuba Japan 22-5 August pp 17-27Ansari A Essegaier S and Kohli R (2000) ` Internet

recommendation systemsrsquorsquo Journal of MarketingResearch Vol 37 No 3 pp 67-85

Arms W (2000) Digital Libraries MIT Press CambridgeMA

Armstrong R Freitag D Joachims T and Mitchell T(1997) ` WebWatcher a learning apprentice for theWorld Wide Webrsquorsquo AAAI Spring Symposium onInformation Gathering from HeterogeneousDistributed Environments pp 6-12

Balabanovirsquoc M and Shoham Y (1997) ` Fabcontent-based collaborative recommendationrsquorsquoCommunications of the ACM Vol 40 No 3pp 66-72

Baldonado M Chang C Gravano L and Paepcke A(1997) ` Metadata for digital libraries architectureand design rationalersquorsquo Proceedings of the 2nd ACMInternational Conference on Digital Libraries ACMPress New York NY pp 47-56

Basu C Hirsh H and Cohen W (1998)` Recommendation as classification using social andcontent-based information in recommendationrsquorsquoProceedings of the 15th National Conference onArtificial Intelligence 26-30 July Madison WI AAAIPress Menlo Park CA pp 714-20

Billsus D and Pazzani M (1999) ` A hybrid user modelfor news story classificationrsquorsquo in Kay J (Ed)Proceedings of the 7th International Conference onUser Modelling Banff Canada 20-4 JuneSpringer-Verlag New York NY pp 99-108

Bowman C Manber P and Schwartz U (1994)` Scalable Internet resources discovery researchproblems and approachesrsquorsquo Communications of theACM Vol 37 No 8 pp 98-107

Breese J Heckerman D and Kadie C (1998) ` Empiricalanalysis of predictive algorithms for collaborativefilteringrsquorsquo Technical Report MSR-TR-98-12Microsoft Research Seattle CA

Chen H Schatz B Ng T Martinez J Kirchhoff A andLin C (1996) ` A parallel computing approach tocreating engineering concept spaces for semanticretrieval the Illinois digital library initiativeprojectrsquorsquo IEEE Transactions on PAMI Vol 18 No 8pp 17-34

Cooley R Mobasher B and Srivastava J (1999) ` Datapreparation for mining World Wide Web browsingpatternsrsquorsquo Journal of Knowledge and InformationSystems Vol 1 No 1 pp 5-32

Furner J (2002) ` On recommendingrsquorsquo Journal of theAmerican Society for Information Science andTechnology Vol 53 No 9 pp 747-63

Goldberg D Nichols D Oki B and Terry D (1992)` Using collaborative filtering to weave aninformation tapestryrsquorsquo Communications of the ACMVol 35 No 12 pp 61-70

Herlocker J and Konstan J (2001) ` Content-independent task-focused recommendationrsquorsquo IEEEInternet Computing Vol 5 No 6 pp 40-7

Karypis G (2002) ` Multilevel hypergraph partitioningrsquorsquoTech Report TR02-25 Department of ComputerScience University of Minnesota MN

Kessler J (1996) Internet Digital Libraries TheInternational Dimension Artech House PublishersNorwood MA

Konstan J Miller B Maltz D Herlocker J Gordon Land Riedl J (1997) ` GroupLens applyingcollaborative filtering to Usenet newsrsquorsquoCommunications of the ACM Vol 40 No 3pp 77-87

181

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Lang K (1995) ` Newsweeder learning to filternetnewsrsquorsquo in Prieditis A and Russell S (Eds)Proceedings of the 12th International Conference onMachine Learning Lake Tahoe Morgan KaufmannSan Francisco CA pp 331-9

Liu B Hsu W and Ma Y (1999) ` Mining associationrules with multiple minimum supportsrsquorsquo Proceedingsof the 5th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining 15-18

August San Diego ACM Press New York NYpp 430-4

Loeb S and Terry D (1992) ` Information filteringrsquorsquoCommunications of the ACM Special Issue onInformation Filtering Vol 35 No 12 pp 26-8

Mobasher B Cooley R and Srivastava J (1999)` Creating adaptive Web sites through usage-basedclustering of URLsrsquorsquo Proceedings of the IEEEKnowledge and Data Engineering ExchangeWorkshop

Mobasher B Dai H Luo T Nakagawa M andWiltshire J (2000) ` Discovery of aggregate usageprofiles for Web personalisationrsquorsquo Proceedings ofthe webKDD Workshop

Mooney R and Roy L (2000) ` Content-based bookrecommending using learning for text

categorisationrsquorsquo Proceedings of the 5th ACMConference on Digital Libraries San Antonio ACMPress New York NY pp 195-204

Pazzani M and Billsus D (1997) ` Learning and revisinguser profiles the identification of interesting Websitesrsquorsquo Machine Learning Vol 27 No 4 pp 313-31

Pazzani M (1999) ` A framework for collaborativecontent-based and demographic filteringrsquorsquo ArtificialIntelligence Review Vol 13 No 56 pp 393-408

Pennock D Horvitz E Lawrence S and Giles C (2000)` Collaborative filtering by personality diagnosis ahybrid memory- and model-based approachrsquorsquoProceedings of the 16th Conference onUncertainty in Artificial Intelligence San Francisco

30 June-3 July Morgan Kaufmann San FranciscoCA pp 473-80

Pitkow J and Pirolli P (1999) ` Mining longest repeatingsubsequences to predict World Wide Web surfingrsquorsquoProceedings of the 2nd USENIX Symposium onInternet Technologies and Systems Boulder COpp 139-150

Schafer J Konstan J and Riedl J (2001) ` E-commercerecommendation applicationsrsquorsquo Data Mining andKnowledge Discovery Vol 5 No 1 pp 10-22

Shardanand U and Maes P (1995) ` Social informationfiltering algorithms for automating `word ofmouthrsquorsquorsquo Proceedings of the Conference on HumanFactors in Computing Systems Denver CO ACMPress New York NY pp 210-17

Spink A Wilson T Ford N Foster A and Ellis D(2002) ` Information seeking and mediatedsearchingrsquorsquo Journal of The American Society ForInformation Science and Technology Vol 53 No 9pp 695-703

Srivastava J Cooley R Deshpande M and Tang P(2000) ` Web usage mining discovery andapplications of usage patterns from Web datarsquorsquoSIGKDD Explorations Vol 1 No 2 pp 12-23

Terveen L Hill W Amento B McDonald D and CreterJ (1997) ` PHOAKS a system for sharingrecommendationsrsquorsquo Communications of the ACMVol 40 No 3 pp 59-62

Wilensky R (1996) ` Toward work-centred digitalinformation servicesrsquorsquo IEEE Computer Vol 29 No 5pp 7-44

Yan T Jacobsen M Molina H and Dayal U (1996)` From user access patterns to dynamic hypertextlinkingrsquorsquo Proceedings of the 5th International WorldWide Web Conference pp 1007-14

Yang Q Zhang HH and Li T (2001) ` Mining Web logsfor prediction models in WWW caching andprefetchingrsquorsquo Proceedings of the ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining pp 473-8

182

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Page 8: A prototype WWW literature recommendation system for digital libraries

weighthellipa1 a2 akdagger ˆ

log0hellip dagger

permil 0hellip dagger cent 0hellip dagger cent cent 0hellip daggerŠnotsup3

pound 1

Minsup

acute

where 0micro not micro1 Note that when not = 0 ornot = 1 this formula is equivalent to usingsupport or interest as the weight respectively1Minsup is a constant that keeps the weightnon-negative This definition supports ourfollowing hypotheses

H2 A better recommendation effectivenesscan be achieved by striking a balancebetween using support and interest asthe weight of an itemset

H3 A better recommendation effectivenesscan be achieved by incorporating thelogarithm function in the weight of anitemset

After deciding the weight of each itemset thehypergraph partitioning algorithm proposedin Karypis (2002) is applied to partition theset of articles into disjoint clusters of articlesArticles in the same cluster are more lsquolsquosimilarrsquorsquoin the sense that they are more likely to beaccessed together in the same transaction Toreflect the fact that an article may indeedinterest more than one group of users weadopt the same heuristic as used in Mobasheret al (1999) by adding back articles toclusters which results in overlapping clustersSpecifically for a given hyperedge if thepercentage of involved vertices in a cluster islarge than a threshold the other involvedvertices are included in the same cluster

Online recommendationsWe propose two recommendation approachesthat use the article association rules andarticle clusters obtained by the methodsdescribed above The goal is to recommendthe top-N articles that potentially interest theactive user The first approach makes use ofarticle association rules The idea is to treateach frequent itemset as the interest profile ofa user group and to recommend articles basedon the similarity between the current sessionof the active user and interest profiles of therelevant user groups Specifically let s be theactive userrsquos current session of length k Wefirst identify the set of frequent itemsets ofsize k + 1 that contain all elements in s and an

extra element m (not in s) For each suchitemset I the confidence of the ruleI iexcl mg ) fm is calculated These extraelements are then recommended to the userin descending order of confidence value Ifthese elements are not sufficient (ie there areless than N of them) we then search forfrequent itemsets of size k that contain k - 1elements in s and an extra element (not in s)Again these extra elements are recommendedto the user in descending order of confidencevalue This procedure continues until Narticles are recommended

Our other proposed method uses ahypergraph-based approach In thisapproach the recommendation score of eacharticle a is computed by considering thesimilarity between the current user sessionand the clusters C to which a belongs and thecoherence weight of a with respect to CSpecifically each cluster of articles can beviewed as a vector with binary elements eachof which indicates whether an article appearsin the cluster Similarly the current usersession can also be represented as a vectorThen the similarity between the currentsession s and a cluster C can be defined as acosine function as follows

matchhellipS Cdagger ˆ

Pk

aCk pound Sk

Pk

hellipSkdagger2 poundPk

hellipaCKdagger2

r

where Sk is the krsquoth element in S and aCk is the

krsquoth element in CThe coherence weight of an article a with

respect to the cluster C that it belongs to isdefined as

weighthellipa Cdagger ˆ

Pa2eesup3C

weighthellipedaggerPesup3C

weighthellipedagger

where weight(e) is the weight of a hyperedge eThe recommendation score Rec(S a) of an

article a with respect to the current usersession S is then defined as

Rechellip dagger ˆ

maxa2C

hellip dagger pound hellip dagger

p

The top-N articles for recommendation arethose with the N highest values in therecommendation score

176

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

The hypergraph-based approach is morecarefully designed than theassociation-rule-based approach and weexpect the former to perform better

H4 The hypergraph-based approach willresult in more effectiverecommendations and has aquicker response time than theassociation-rule-based approach

Empirical evaluations

This section reports our experience inapplying the Web usage logs of NSYSU-ETD to the proposed literaturerecommendation system The main objectivewas to test our four hypotheses NSYSU-ETD runs on PC Solaris 27 and uses Apache139 as the Web server Since beingcommissioned in May 2000 it has beenloaded with more than 3000 electronictheses of National Sun Yat-sen UniversityUp to February 2003 these theses had beenbrowsed more than 400000 times anddownloaded more than 100000 times Weanalysed the Web usage logs of NSYSU-ETD between February 2002 and May 2002for our experiments the data collected fromFebruary 1 to April 30 were designated as thetraining data set and those collected in Mayserved as the test data set

We first applied the data cleansingtechnique on the training data and obtained43349 lookup accesses and 41627 articleaccesses Applying the session identificationtechnique revealed 16922 user sessionsamong which 392 sessions were robotgenerated 6068 sessions contained only onearticle access and 5253 sessions containedno article accesses We eliminated these trivialuser sessions and applied transactionidentification techniques resulting in 5617transactions for the query-chosen method5272 transactions for the session-chosenmethod and 17742 transactions for thequery-result method Queries whose resultsare never chosen by the users are removedfrom the query-chosen method but remain inthe query-result method The session-resultmethod produced transactions of huge sizeeach containing thousands of article accessesWe therefore decided not to consider thismethod in the subsequent experiments

The two proposed methods for miningliterature usage logs both require the

identification of frequent itemsets fromtransactions which needs the minimumsupport to be specified However the threetransaction identification methods undercomparison have different numbers oftransactions To be fair we specify adifferent minimum support threshold foreach method such that the total number ofarticles involved in large two-item sets ofeach method - called recommendablearticles - is approximately the same Ourrecommendation framework recommends anarticle only if it is associated with otherarticles a sufficient number of times in Webusage log Therefore articles that are notinvolved in large two-item sets cannotpossibly be recommended Table I shows thespecified minimum support threshold andthe number of recommendable articles ofeach method

To illustrate how we conductedexperiments we define the followingnotation let Teval be the set of transactions inthe test set teval be a transaction in Teval andat(i) be the irsquoth article in teval Given a windowsize Wsize we divide each transaction teval inthe test data set into two lists teval[W] andteval[R] where teval[W] is the first Wsize articleaccesses of teval and teval[R] is the remainingarticles By treating teval[W] as the currentsession the recommender system will choosethe set tpr of top-N articles forrecommendation

The performance metric we adopted formeasuring the quality of recommendation isthe precision and recall scheme Theprecision is the ratio of the number ofrecommended articles accessed by a user tothe total number of recommended articlesdefined as tpr teval permilRŠ=tpr and recall is theratio of the number of recommended articlesaccessed by a user to the total number ofarticles of interest to the user defined astpr tevalpermilRŠ=teval permilRŠ The precision (recall) of arecommendation approach is the averageprecision (recall) of all transactions in the testset

Test of H1We first evaluate the performance impact ofthe three transaction identification methodsIn this experiment not was set to be 05 and thelogarithm was taken when computing theweight of an itemset Figure 3(a b) shows theprecisions and recalls respectively underassociation-rule-based recommendation

177

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

The precisions and recalls underhypergraph-based recommendation areshown in Figure 4(a b) respectively

Overall the trends are the same for theassociation-rule-based and hypergraph-basedapproaches It can be clearly seen that boththe query-chosen and session-chosenmethods outperform the query-resultmethod in terms of both precision and recallThis implies that the information aboutarticles browsed plays a crucial role in making

recommendations - both the query-chosenand session-chosen methods incorporate thisinformation in forming transactions Wetherefore accept H1 However theperformance difference between thequery-chosen and session-chosen methods isnot significant

Test of H2 and H3We then conducted experiments to shed lighton the impact of not and the logarithmic

Figure 3 (a) Precisions of the association-based approach (b) recalls of the association-based approach

Figure 4 (a) Precisions and (b) recalls of the hypergraph-based approach under different transaction identification

methods

Table I Minimum support and number of recommended candidates for each transaction identification method

Minimum support (per cent) No of recommended articles

Session-chosen method 016 253

Query-chosen method 012 250

Query-result method 33 229

178

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

function in the weight definition for thehypergraph-based recommendationapproach Figure 5(a b) shows the precisionand recall values when not = 0 05 and 1 usingthe session-chosen method for transactionidentification The window size was set at 2As can be seen differences in the precisionand recall under different settings of not arevery small We have performed the sameexperiments for different transactionidentification methods and window sizes andobtained similar results H2 is thereforerejected

We then computed the precisions andrecalls with and without application of thelogarithmic function on the itemset weightFigure 6(a b) shows the resulting precisionsand recalls respectively

Figure 6 shows that applying thelogarithmic function on the itemset weightdefinition achieves significantly betterprecision and recall values This meets ourexpectation and H3 is accepted

Test of H4Finally we evaluated the impact ofassociation-rule-based and hypergraph-basedrecommendation approaches for different

window sizes Figure 7(a b) shows theprecisions and recalls for window sizes of 2 34 5 and 6 for a top-15 recommendation Itcan be seen that the hypergraph-basedapproach performs better than theassociation-rule-based approach especiallyfor larger window sizes

We also compared the running times ofboth approaches whilst setting differentminimum support thresholds Therelative performance of the two approachesunder different window sizes are shown inFigure 8(a b) Overall the running time ofthe hypergraph-based approach remainedrelatively constant not varying with changesin window size and minimum supportthreshold In contrast the running time of theassociation-rule-based approach increasedwith an increase in window size or a decreasein the minimum support This is because theassociation-rule-based approach has to searchfor the frequent itemsets that match thecurrent session As the number of frequentitemsets increase (as a result of a decreasein the minimum support) or the length ofcurrent user session increases (as a

Figure 5 (a) Impact of not on precisions and (b) impact of not on recalls of

the hypergraph-based approach using the session-chosen method for

transaction identification

Figure 6 (a) Impact of logarithmic function on precisions and (b) impact

of logarithmic function on recalls of the hypergraph-based approach using

the session-chosen method for transaction identification

179

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

result of an increased window size)the association-rule-based approach incurs alarger running time

Overall we conclude that thehypergraph-based approach is more attractivesince it yields better-quality articlerecommendation and has a more consistentrunning time Thus H4 is accepted

Conclusions

In this paper we have investigated issuesrelated to the recommendation of articles in aliterature digital library We have developed a

literature recommendation system that makes

use of the Web usage logs of a literature

digital library for making recommendations

The literature recommendation system

consists of three sequential steps(1) data preparation of the Web logs(2) usage log mining and(3) generation of article recommendations

We proposed three alternatives for identifying

transactions from Web usage logs and

discussed two approaches - association-rule

based and hypergraph based - for making

recommendations These alternatives and

approaches were evaluated using the Web

Figure 7 (a) Precisions and (b) recalls of the two recommendation approaches under different window sizes

(Minsup = 016 per cent)

Figure 8 (a) Running times of the two recommendation approaches for a window size of two under different

Minsup values and (b) running times of the two recommendation approaches for a window size of six under

different Minsup values

180

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

usage logs of an operational electronic thesissystem at National Sun Yat-sen University Ithas been found that the query-chosen andsession-chosen methods are better fortransaction identification and that thehypergraph-based approach yieldsbetter-quality article recommendation andexhibits a more consistent running time andthus is more scalable

Our recommendation framework identifiesarticle associations present in the Web usagelogs of a digital library While this approachresults in effective recommendations it failsto recommend those independent articles thatare seldom accessed together with others Asevident from our experiments an analysis ofthe three-month collection of Web usage logsof NSYSU-ETD (from 1 February to 30April 2002) showed that only about one-tenthof the total collection are recommendable (seeTable I) To extend the scope ofrecommendable articles we are currentlyinvestigating approaches that make use ofmultiple sources when making articlerecommendations in digital libraries Onesuch source of course is the metadataalready collected by digital libraries

References

Agrawal R Imielinski T and Swami A (1993) ` Mining

association rules between sets of items in largedatabasesrsquorsquo in Buneman P and Jajodia S (Eds)

Proceedings of the ACM SIGMOD Conference onManagement of Data Washington DC May 26-28ACM Press New York NY pp 207-16

Agrawal R and Srikant R (1994) ` Fast algorithms formining association rulesrsquorsquo in Bocca JB Jarke M

and Zaniolo C (Eds) Proceedings of the 20thInternational Conference on Very Large Data Bases

September 12-15 Santiago Chile MorganKaufmann San Francisco CA pp 487-99

Alspector J Kolcz A and Karunanithi N (1998)` Comparing feature-based and clique-based usermodels for movie selectionrsquorsquo Proceedings of the 3rd

ACM International Conference on Digital LibrariesJune 23-26 1998 Pittsburgh PA ACM Press New

York NY pp 11-18Andresen D Carver L Dolin R Fischer C Frew J

Goodchild M Ibarra O Kothuri R Larsgaard MNebert D Simpson J Smith T Yang T andZheng Q (1995) ` The WWW prototype of the

Alexandria digital libraryrsquorsquo Proceedings of theInternational Symposium on Digital Libraries

Tsukuba Japan 22-5 August pp 17-27Ansari A Essegaier S and Kohli R (2000) ` Internet

recommendation systemsrsquorsquo Journal of MarketingResearch Vol 37 No 3 pp 67-85

Arms W (2000) Digital Libraries MIT Press CambridgeMA

Armstrong R Freitag D Joachims T and Mitchell T(1997) ` WebWatcher a learning apprentice for theWorld Wide Webrsquorsquo AAAI Spring Symposium onInformation Gathering from HeterogeneousDistributed Environments pp 6-12

Balabanovirsquoc M and Shoham Y (1997) ` Fabcontent-based collaborative recommendationrsquorsquoCommunications of the ACM Vol 40 No 3pp 66-72

Baldonado M Chang C Gravano L and Paepcke A(1997) ` Metadata for digital libraries architectureand design rationalersquorsquo Proceedings of the 2nd ACMInternational Conference on Digital Libraries ACMPress New York NY pp 47-56

Basu C Hirsh H and Cohen W (1998)` Recommendation as classification using social andcontent-based information in recommendationrsquorsquoProceedings of the 15th National Conference onArtificial Intelligence 26-30 July Madison WI AAAIPress Menlo Park CA pp 714-20

Billsus D and Pazzani M (1999) ` A hybrid user modelfor news story classificationrsquorsquo in Kay J (Ed)Proceedings of the 7th International Conference onUser Modelling Banff Canada 20-4 JuneSpringer-Verlag New York NY pp 99-108

Bowman C Manber P and Schwartz U (1994)` Scalable Internet resources discovery researchproblems and approachesrsquorsquo Communications of theACM Vol 37 No 8 pp 98-107

Breese J Heckerman D and Kadie C (1998) ` Empiricalanalysis of predictive algorithms for collaborativefilteringrsquorsquo Technical Report MSR-TR-98-12Microsoft Research Seattle CA

Chen H Schatz B Ng T Martinez J Kirchhoff A andLin C (1996) ` A parallel computing approach tocreating engineering concept spaces for semanticretrieval the Illinois digital library initiativeprojectrsquorsquo IEEE Transactions on PAMI Vol 18 No 8pp 17-34

Cooley R Mobasher B and Srivastava J (1999) ` Datapreparation for mining World Wide Web browsingpatternsrsquorsquo Journal of Knowledge and InformationSystems Vol 1 No 1 pp 5-32

Furner J (2002) ` On recommendingrsquorsquo Journal of theAmerican Society for Information Science andTechnology Vol 53 No 9 pp 747-63

Goldberg D Nichols D Oki B and Terry D (1992)` Using collaborative filtering to weave aninformation tapestryrsquorsquo Communications of the ACMVol 35 No 12 pp 61-70

Herlocker J and Konstan J (2001) ` Content-independent task-focused recommendationrsquorsquo IEEEInternet Computing Vol 5 No 6 pp 40-7

Karypis G (2002) ` Multilevel hypergraph partitioningrsquorsquoTech Report TR02-25 Department of ComputerScience University of Minnesota MN

Kessler J (1996) Internet Digital Libraries TheInternational Dimension Artech House PublishersNorwood MA

Konstan J Miller B Maltz D Herlocker J Gordon Land Riedl J (1997) ` GroupLens applyingcollaborative filtering to Usenet newsrsquorsquoCommunications of the ACM Vol 40 No 3pp 77-87

181

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Lang K (1995) ` Newsweeder learning to filternetnewsrsquorsquo in Prieditis A and Russell S (Eds)Proceedings of the 12th International Conference onMachine Learning Lake Tahoe Morgan KaufmannSan Francisco CA pp 331-9

Liu B Hsu W and Ma Y (1999) ` Mining associationrules with multiple minimum supportsrsquorsquo Proceedingsof the 5th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining 15-18

August San Diego ACM Press New York NYpp 430-4

Loeb S and Terry D (1992) ` Information filteringrsquorsquoCommunications of the ACM Special Issue onInformation Filtering Vol 35 No 12 pp 26-8

Mobasher B Cooley R and Srivastava J (1999)` Creating adaptive Web sites through usage-basedclustering of URLsrsquorsquo Proceedings of the IEEEKnowledge and Data Engineering ExchangeWorkshop

Mobasher B Dai H Luo T Nakagawa M andWiltshire J (2000) ` Discovery of aggregate usageprofiles for Web personalisationrsquorsquo Proceedings ofthe webKDD Workshop

Mooney R and Roy L (2000) ` Content-based bookrecommending using learning for text

categorisationrsquorsquo Proceedings of the 5th ACMConference on Digital Libraries San Antonio ACMPress New York NY pp 195-204

Pazzani M and Billsus D (1997) ` Learning and revisinguser profiles the identification of interesting Websitesrsquorsquo Machine Learning Vol 27 No 4 pp 313-31

Pazzani M (1999) ` A framework for collaborativecontent-based and demographic filteringrsquorsquo ArtificialIntelligence Review Vol 13 No 56 pp 393-408

Pennock D Horvitz E Lawrence S and Giles C (2000)` Collaborative filtering by personality diagnosis ahybrid memory- and model-based approachrsquorsquoProceedings of the 16th Conference onUncertainty in Artificial Intelligence San Francisco

30 June-3 July Morgan Kaufmann San FranciscoCA pp 473-80

Pitkow J and Pirolli P (1999) ` Mining longest repeatingsubsequences to predict World Wide Web surfingrsquorsquoProceedings of the 2nd USENIX Symposium onInternet Technologies and Systems Boulder COpp 139-150

Schafer J Konstan J and Riedl J (2001) ` E-commercerecommendation applicationsrsquorsquo Data Mining andKnowledge Discovery Vol 5 No 1 pp 10-22

Shardanand U and Maes P (1995) ` Social informationfiltering algorithms for automating `word ofmouthrsquorsquorsquo Proceedings of the Conference on HumanFactors in Computing Systems Denver CO ACMPress New York NY pp 210-17

Spink A Wilson T Ford N Foster A and Ellis D(2002) ` Information seeking and mediatedsearchingrsquorsquo Journal of The American Society ForInformation Science and Technology Vol 53 No 9pp 695-703

Srivastava J Cooley R Deshpande M and Tang P(2000) ` Web usage mining discovery andapplications of usage patterns from Web datarsquorsquoSIGKDD Explorations Vol 1 No 2 pp 12-23

Terveen L Hill W Amento B McDonald D and CreterJ (1997) ` PHOAKS a system for sharingrecommendationsrsquorsquo Communications of the ACMVol 40 No 3 pp 59-62

Wilensky R (1996) ` Toward work-centred digitalinformation servicesrsquorsquo IEEE Computer Vol 29 No 5pp 7-44

Yan T Jacobsen M Molina H and Dayal U (1996)` From user access patterns to dynamic hypertextlinkingrsquorsquo Proceedings of the 5th International WorldWide Web Conference pp 1007-14

Yang Q Zhang HH and Li T (2001) ` Mining Web logsfor prediction models in WWW caching andprefetchingrsquorsquo Proceedings of the ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining pp 473-8

182

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Page 9: A prototype WWW literature recommendation system for digital libraries

The hypergraph-based approach is morecarefully designed than theassociation-rule-based approach and weexpect the former to perform better

H4 The hypergraph-based approach willresult in more effectiverecommendations and has aquicker response time than theassociation-rule-based approach

Empirical evaluations

This section reports our experience inapplying the Web usage logs of NSYSU-ETD to the proposed literaturerecommendation system The main objectivewas to test our four hypotheses NSYSU-ETD runs on PC Solaris 27 and uses Apache139 as the Web server Since beingcommissioned in May 2000 it has beenloaded with more than 3000 electronictheses of National Sun Yat-sen UniversityUp to February 2003 these theses had beenbrowsed more than 400000 times anddownloaded more than 100000 times Weanalysed the Web usage logs of NSYSU-ETD between February 2002 and May 2002for our experiments the data collected fromFebruary 1 to April 30 were designated as thetraining data set and those collected in Mayserved as the test data set

We first applied the data cleansingtechnique on the training data and obtained43349 lookup accesses and 41627 articleaccesses Applying the session identificationtechnique revealed 16922 user sessionsamong which 392 sessions were robotgenerated 6068 sessions contained only onearticle access and 5253 sessions containedno article accesses We eliminated these trivialuser sessions and applied transactionidentification techniques resulting in 5617transactions for the query-chosen method5272 transactions for the session-chosenmethod and 17742 transactions for thequery-result method Queries whose resultsare never chosen by the users are removedfrom the query-chosen method but remain inthe query-result method The session-resultmethod produced transactions of huge sizeeach containing thousands of article accessesWe therefore decided not to consider thismethod in the subsequent experiments

The two proposed methods for miningliterature usage logs both require the

identification of frequent itemsets fromtransactions which needs the minimumsupport to be specified However the threetransaction identification methods undercomparison have different numbers oftransactions To be fair we specify adifferent minimum support threshold foreach method such that the total number ofarticles involved in large two-item sets ofeach method - called recommendablearticles - is approximately the same Ourrecommendation framework recommends anarticle only if it is associated with otherarticles a sufficient number of times in Webusage log Therefore articles that are notinvolved in large two-item sets cannotpossibly be recommended Table I shows thespecified minimum support threshold andthe number of recommendable articles ofeach method

To illustrate how we conductedexperiments we define the followingnotation let Teval be the set of transactions inthe test set teval be a transaction in Teval andat(i) be the irsquoth article in teval Given a windowsize Wsize we divide each transaction teval inthe test data set into two lists teval[W] andteval[R] where teval[W] is the first Wsize articleaccesses of teval and teval[R] is the remainingarticles By treating teval[W] as the currentsession the recommender system will choosethe set tpr of top-N articles forrecommendation

The performance metric we adopted formeasuring the quality of recommendation isthe precision and recall scheme Theprecision is the ratio of the number ofrecommended articles accessed by a user tothe total number of recommended articlesdefined as tpr teval permilRŠ=tpr and recall is theratio of the number of recommended articlesaccessed by a user to the total number ofarticles of interest to the user defined astpr tevalpermilRŠ=teval permilRŠ The precision (recall) of arecommendation approach is the averageprecision (recall) of all transactions in the testset

Test of H1We first evaluate the performance impact ofthe three transaction identification methodsIn this experiment not was set to be 05 and thelogarithm was taken when computing theweight of an itemset Figure 3(a b) shows theprecisions and recalls respectively underassociation-rule-based recommendation

177

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

The precisions and recalls underhypergraph-based recommendation areshown in Figure 4(a b) respectively

Overall the trends are the same for theassociation-rule-based and hypergraph-basedapproaches It can be clearly seen that boththe query-chosen and session-chosenmethods outperform the query-resultmethod in terms of both precision and recallThis implies that the information aboutarticles browsed plays a crucial role in making

recommendations - both the query-chosenand session-chosen methods incorporate thisinformation in forming transactions Wetherefore accept H1 However theperformance difference between thequery-chosen and session-chosen methods isnot significant

Test of H2 and H3We then conducted experiments to shed lighton the impact of not and the logarithmic

Figure 3 (a) Precisions of the association-based approach (b) recalls of the association-based approach

Figure 4 (a) Precisions and (b) recalls of the hypergraph-based approach under different transaction identification

methods

Table I Minimum support and number of recommended candidates for each transaction identification method

Minimum support (per cent) No of recommended articles

Session-chosen method 016 253

Query-chosen method 012 250

Query-result method 33 229

178

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

function in the weight definition for thehypergraph-based recommendationapproach Figure 5(a b) shows the precisionand recall values when not = 0 05 and 1 usingthe session-chosen method for transactionidentification The window size was set at 2As can be seen differences in the precisionand recall under different settings of not arevery small We have performed the sameexperiments for different transactionidentification methods and window sizes andobtained similar results H2 is thereforerejected

We then computed the precisions andrecalls with and without application of thelogarithmic function on the itemset weightFigure 6(a b) shows the resulting precisionsand recalls respectively

Figure 6 shows that applying thelogarithmic function on the itemset weightdefinition achieves significantly betterprecision and recall values This meets ourexpectation and H3 is accepted

Test of H4Finally we evaluated the impact ofassociation-rule-based and hypergraph-basedrecommendation approaches for different

window sizes Figure 7(a b) shows theprecisions and recalls for window sizes of 2 34 5 and 6 for a top-15 recommendation Itcan be seen that the hypergraph-basedapproach performs better than theassociation-rule-based approach especiallyfor larger window sizes

We also compared the running times ofboth approaches whilst setting differentminimum support thresholds Therelative performance of the two approachesunder different window sizes are shown inFigure 8(a b) Overall the running time ofthe hypergraph-based approach remainedrelatively constant not varying with changesin window size and minimum supportthreshold In contrast the running time of theassociation-rule-based approach increasedwith an increase in window size or a decreasein the minimum support This is because theassociation-rule-based approach has to searchfor the frequent itemsets that match thecurrent session As the number of frequentitemsets increase (as a result of a decreasein the minimum support) or the length ofcurrent user session increases (as a

Figure 5 (a) Impact of not on precisions and (b) impact of not on recalls of

the hypergraph-based approach using the session-chosen method for

transaction identification

Figure 6 (a) Impact of logarithmic function on precisions and (b) impact

of logarithmic function on recalls of the hypergraph-based approach using

the session-chosen method for transaction identification

179

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

result of an increased window size)the association-rule-based approach incurs alarger running time

Overall we conclude that thehypergraph-based approach is more attractivesince it yields better-quality articlerecommendation and has a more consistentrunning time Thus H4 is accepted

Conclusions

In this paper we have investigated issuesrelated to the recommendation of articles in aliterature digital library We have developed a

literature recommendation system that makes

use of the Web usage logs of a literature

digital library for making recommendations

The literature recommendation system

consists of three sequential steps(1) data preparation of the Web logs(2) usage log mining and(3) generation of article recommendations

We proposed three alternatives for identifying

transactions from Web usage logs and

discussed two approaches - association-rule

based and hypergraph based - for making

recommendations These alternatives and

approaches were evaluated using the Web

Figure 7 (a) Precisions and (b) recalls of the two recommendation approaches under different window sizes

(Minsup = 016 per cent)

Figure 8 (a) Running times of the two recommendation approaches for a window size of two under different

Minsup values and (b) running times of the two recommendation approaches for a window size of six under

different Minsup values

180

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

usage logs of an operational electronic thesissystem at National Sun Yat-sen University Ithas been found that the query-chosen andsession-chosen methods are better fortransaction identification and that thehypergraph-based approach yieldsbetter-quality article recommendation andexhibits a more consistent running time andthus is more scalable

Our recommendation framework identifiesarticle associations present in the Web usagelogs of a digital library While this approachresults in effective recommendations it failsto recommend those independent articles thatare seldom accessed together with others Asevident from our experiments an analysis ofthe three-month collection of Web usage logsof NSYSU-ETD (from 1 February to 30April 2002) showed that only about one-tenthof the total collection are recommendable (seeTable I) To extend the scope ofrecommendable articles we are currentlyinvestigating approaches that make use ofmultiple sources when making articlerecommendations in digital libraries Onesuch source of course is the metadataalready collected by digital libraries

References

Agrawal R Imielinski T and Swami A (1993) ` Mining

association rules between sets of items in largedatabasesrsquorsquo in Buneman P and Jajodia S (Eds)

Proceedings of the ACM SIGMOD Conference onManagement of Data Washington DC May 26-28ACM Press New York NY pp 207-16

Agrawal R and Srikant R (1994) ` Fast algorithms formining association rulesrsquorsquo in Bocca JB Jarke M

and Zaniolo C (Eds) Proceedings of the 20thInternational Conference on Very Large Data Bases

September 12-15 Santiago Chile MorganKaufmann San Francisco CA pp 487-99

Alspector J Kolcz A and Karunanithi N (1998)` Comparing feature-based and clique-based usermodels for movie selectionrsquorsquo Proceedings of the 3rd

ACM International Conference on Digital LibrariesJune 23-26 1998 Pittsburgh PA ACM Press New

York NY pp 11-18Andresen D Carver L Dolin R Fischer C Frew J

Goodchild M Ibarra O Kothuri R Larsgaard MNebert D Simpson J Smith T Yang T andZheng Q (1995) ` The WWW prototype of the

Alexandria digital libraryrsquorsquo Proceedings of theInternational Symposium on Digital Libraries

Tsukuba Japan 22-5 August pp 17-27Ansari A Essegaier S and Kohli R (2000) ` Internet

recommendation systemsrsquorsquo Journal of MarketingResearch Vol 37 No 3 pp 67-85

Arms W (2000) Digital Libraries MIT Press CambridgeMA

Armstrong R Freitag D Joachims T and Mitchell T(1997) ` WebWatcher a learning apprentice for theWorld Wide Webrsquorsquo AAAI Spring Symposium onInformation Gathering from HeterogeneousDistributed Environments pp 6-12

Balabanovirsquoc M and Shoham Y (1997) ` Fabcontent-based collaborative recommendationrsquorsquoCommunications of the ACM Vol 40 No 3pp 66-72

Baldonado M Chang C Gravano L and Paepcke A(1997) ` Metadata for digital libraries architectureand design rationalersquorsquo Proceedings of the 2nd ACMInternational Conference on Digital Libraries ACMPress New York NY pp 47-56

Basu C Hirsh H and Cohen W (1998)` Recommendation as classification using social andcontent-based information in recommendationrsquorsquoProceedings of the 15th National Conference onArtificial Intelligence 26-30 July Madison WI AAAIPress Menlo Park CA pp 714-20

Billsus D and Pazzani M (1999) ` A hybrid user modelfor news story classificationrsquorsquo in Kay J (Ed)Proceedings of the 7th International Conference onUser Modelling Banff Canada 20-4 JuneSpringer-Verlag New York NY pp 99-108

Bowman C Manber P and Schwartz U (1994)` Scalable Internet resources discovery researchproblems and approachesrsquorsquo Communications of theACM Vol 37 No 8 pp 98-107

Breese J Heckerman D and Kadie C (1998) ` Empiricalanalysis of predictive algorithms for collaborativefilteringrsquorsquo Technical Report MSR-TR-98-12Microsoft Research Seattle CA

Chen H Schatz B Ng T Martinez J Kirchhoff A andLin C (1996) ` A parallel computing approach tocreating engineering concept spaces for semanticretrieval the Illinois digital library initiativeprojectrsquorsquo IEEE Transactions on PAMI Vol 18 No 8pp 17-34

Cooley R Mobasher B and Srivastava J (1999) ` Datapreparation for mining World Wide Web browsingpatternsrsquorsquo Journal of Knowledge and InformationSystems Vol 1 No 1 pp 5-32

Furner J (2002) ` On recommendingrsquorsquo Journal of theAmerican Society for Information Science andTechnology Vol 53 No 9 pp 747-63

Goldberg D Nichols D Oki B and Terry D (1992)` Using collaborative filtering to weave aninformation tapestryrsquorsquo Communications of the ACMVol 35 No 12 pp 61-70

Herlocker J and Konstan J (2001) ` Content-independent task-focused recommendationrsquorsquo IEEEInternet Computing Vol 5 No 6 pp 40-7

Karypis G (2002) ` Multilevel hypergraph partitioningrsquorsquoTech Report TR02-25 Department of ComputerScience University of Minnesota MN

Kessler J (1996) Internet Digital Libraries TheInternational Dimension Artech House PublishersNorwood MA

Konstan J Miller B Maltz D Herlocker J Gordon Land Riedl J (1997) ` GroupLens applyingcollaborative filtering to Usenet newsrsquorsquoCommunications of the ACM Vol 40 No 3pp 77-87

181

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Lang K (1995) ` Newsweeder learning to filternetnewsrsquorsquo in Prieditis A and Russell S (Eds)Proceedings of the 12th International Conference onMachine Learning Lake Tahoe Morgan KaufmannSan Francisco CA pp 331-9

Liu B Hsu W and Ma Y (1999) ` Mining associationrules with multiple minimum supportsrsquorsquo Proceedingsof the 5th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining 15-18

August San Diego ACM Press New York NYpp 430-4

Loeb S and Terry D (1992) ` Information filteringrsquorsquoCommunications of the ACM Special Issue onInformation Filtering Vol 35 No 12 pp 26-8

Mobasher B Cooley R and Srivastava J (1999)` Creating adaptive Web sites through usage-basedclustering of URLsrsquorsquo Proceedings of the IEEEKnowledge and Data Engineering ExchangeWorkshop

Mobasher B Dai H Luo T Nakagawa M andWiltshire J (2000) ` Discovery of aggregate usageprofiles for Web personalisationrsquorsquo Proceedings ofthe webKDD Workshop

Mooney R and Roy L (2000) ` Content-based bookrecommending using learning for text

categorisationrsquorsquo Proceedings of the 5th ACMConference on Digital Libraries San Antonio ACMPress New York NY pp 195-204

Pazzani M and Billsus D (1997) ` Learning and revisinguser profiles the identification of interesting Websitesrsquorsquo Machine Learning Vol 27 No 4 pp 313-31

Pazzani M (1999) ` A framework for collaborativecontent-based and demographic filteringrsquorsquo ArtificialIntelligence Review Vol 13 No 56 pp 393-408

Pennock D Horvitz E Lawrence S and Giles C (2000)` Collaborative filtering by personality diagnosis ahybrid memory- and model-based approachrsquorsquoProceedings of the 16th Conference onUncertainty in Artificial Intelligence San Francisco

30 June-3 July Morgan Kaufmann San FranciscoCA pp 473-80

Pitkow J and Pirolli P (1999) ` Mining longest repeatingsubsequences to predict World Wide Web surfingrsquorsquoProceedings of the 2nd USENIX Symposium onInternet Technologies and Systems Boulder COpp 139-150

Schafer J Konstan J and Riedl J (2001) ` E-commercerecommendation applicationsrsquorsquo Data Mining andKnowledge Discovery Vol 5 No 1 pp 10-22

Shardanand U and Maes P (1995) ` Social informationfiltering algorithms for automating `word ofmouthrsquorsquorsquo Proceedings of the Conference on HumanFactors in Computing Systems Denver CO ACMPress New York NY pp 210-17

Spink A Wilson T Ford N Foster A and Ellis D(2002) ` Information seeking and mediatedsearchingrsquorsquo Journal of The American Society ForInformation Science and Technology Vol 53 No 9pp 695-703

Srivastava J Cooley R Deshpande M and Tang P(2000) ` Web usage mining discovery andapplications of usage patterns from Web datarsquorsquoSIGKDD Explorations Vol 1 No 2 pp 12-23

Terveen L Hill W Amento B McDonald D and CreterJ (1997) ` PHOAKS a system for sharingrecommendationsrsquorsquo Communications of the ACMVol 40 No 3 pp 59-62

Wilensky R (1996) ` Toward work-centred digitalinformation servicesrsquorsquo IEEE Computer Vol 29 No 5pp 7-44

Yan T Jacobsen M Molina H and Dayal U (1996)` From user access patterns to dynamic hypertextlinkingrsquorsquo Proceedings of the 5th International WorldWide Web Conference pp 1007-14

Yang Q Zhang HH and Li T (2001) ` Mining Web logsfor prediction models in WWW caching andprefetchingrsquorsquo Proceedings of the ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining pp 473-8

182

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Page 10: A prototype WWW literature recommendation system for digital libraries

The precisions and recalls underhypergraph-based recommendation areshown in Figure 4(a b) respectively

Overall the trends are the same for theassociation-rule-based and hypergraph-basedapproaches It can be clearly seen that boththe query-chosen and session-chosenmethods outperform the query-resultmethod in terms of both precision and recallThis implies that the information aboutarticles browsed plays a crucial role in making

recommendations - both the query-chosenand session-chosen methods incorporate thisinformation in forming transactions Wetherefore accept H1 However theperformance difference between thequery-chosen and session-chosen methods isnot significant

Test of H2 and H3We then conducted experiments to shed lighton the impact of not and the logarithmic

Figure 3 (a) Precisions of the association-based approach (b) recalls of the association-based approach

Figure 4 (a) Precisions and (b) recalls of the hypergraph-based approach under different transaction identification

methods

Table I Minimum support and number of recommended candidates for each transaction identification method

Minimum support (per cent) No of recommended articles

Session-chosen method 016 253

Query-chosen method 012 250

Query-result method 33 229

178

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

function in the weight definition for thehypergraph-based recommendationapproach Figure 5(a b) shows the precisionand recall values when not = 0 05 and 1 usingthe session-chosen method for transactionidentification The window size was set at 2As can be seen differences in the precisionand recall under different settings of not arevery small We have performed the sameexperiments for different transactionidentification methods and window sizes andobtained similar results H2 is thereforerejected

We then computed the precisions andrecalls with and without application of thelogarithmic function on the itemset weightFigure 6(a b) shows the resulting precisionsand recalls respectively

Figure 6 shows that applying thelogarithmic function on the itemset weightdefinition achieves significantly betterprecision and recall values This meets ourexpectation and H3 is accepted

Test of H4Finally we evaluated the impact ofassociation-rule-based and hypergraph-basedrecommendation approaches for different

window sizes Figure 7(a b) shows theprecisions and recalls for window sizes of 2 34 5 and 6 for a top-15 recommendation Itcan be seen that the hypergraph-basedapproach performs better than theassociation-rule-based approach especiallyfor larger window sizes

We also compared the running times ofboth approaches whilst setting differentminimum support thresholds Therelative performance of the two approachesunder different window sizes are shown inFigure 8(a b) Overall the running time ofthe hypergraph-based approach remainedrelatively constant not varying with changesin window size and minimum supportthreshold In contrast the running time of theassociation-rule-based approach increasedwith an increase in window size or a decreasein the minimum support This is because theassociation-rule-based approach has to searchfor the frequent itemsets that match thecurrent session As the number of frequentitemsets increase (as a result of a decreasein the minimum support) or the length ofcurrent user session increases (as a

Figure 5 (a) Impact of not on precisions and (b) impact of not on recalls of

the hypergraph-based approach using the session-chosen method for

transaction identification

Figure 6 (a) Impact of logarithmic function on precisions and (b) impact

of logarithmic function on recalls of the hypergraph-based approach using

the session-chosen method for transaction identification

179

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

result of an increased window size)the association-rule-based approach incurs alarger running time

Overall we conclude that thehypergraph-based approach is more attractivesince it yields better-quality articlerecommendation and has a more consistentrunning time Thus H4 is accepted

Conclusions

In this paper we have investigated issuesrelated to the recommendation of articles in aliterature digital library We have developed a

literature recommendation system that makes

use of the Web usage logs of a literature

digital library for making recommendations

The literature recommendation system

consists of three sequential steps(1) data preparation of the Web logs(2) usage log mining and(3) generation of article recommendations

We proposed three alternatives for identifying

transactions from Web usage logs and

discussed two approaches - association-rule

based and hypergraph based - for making

recommendations These alternatives and

approaches were evaluated using the Web

Figure 7 (a) Precisions and (b) recalls of the two recommendation approaches under different window sizes

(Minsup = 016 per cent)

Figure 8 (a) Running times of the two recommendation approaches for a window size of two under different

Minsup values and (b) running times of the two recommendation approaches for a window size of six under

different Minsup values

180

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

usage logs of an operational electronic thesissystem at National Sun Yat-sen University Ithas been found that the query-chosen andsession-chosen methods are better fortransaction identification and that thehypergraph-based approach yieldsbetter-quality article recommendation andexhibits a more consistent running time andthus is more scalable

Our recommendation framework identifiesarticle associations present in the Web usagelogs of a digital library While this approachresults in effective recommendations it failsto recommend those independent articles thatare seldom accessed together with others Asevident from our experiments an analysis ofthe three-month collection of Web usage logsof NSYSU-ETD (from 1 February to 30April 2002) showed that only about one-tenthof the total collection are recommendable (seeTable I) To extend the scope ofrecommendable articles we are currentlyinvestigating approaches that make use ofmultiple sources when making articlerecommendations in digital libraries Onesuch source of course is the metadataalready collected by digital libraries

References

Agrawal R Imielinski T and Swami A (1993) ` Mining

association rules between sets of items in largedatabasesrsquorsquo in Buneman P and Jajodia S (Eds)

Proceedings of the ACM SIGMOD Conference onManagement of Data Washington DC May 26-28ACM Press New York NY pp 207-16

Agrawal R and Srikant R (1994) ` Fast algorithms formining association rulesrsquorsquo in Bocca JB Jarke M

and Zaniolo C (Eds) Proceedings of the 20thInternational Conference on Very Large Data Bases

September 12-15 Santiago Chile MorganKaufmann San Francisco CA pp 487-99

Alspector J Kolcz A and Karunanithi N (1998)` Comparing feature-based and clique-based usermodels for movie selectionrsquorsquo Proceedings of the 3rd

ACM International Conference on Digital LibrariesJune 23-26 1998 Pittsburgh PA ACM Press New

York NY pp 11-18Andresen D Carver L Dolin R Fischer C Frew J

Goodchild M Ibarra O Kothuri R Larsgaard MNebert D Simpson J Smith T Yang T andZheng Q (1995) ` The WWW prototype of the

Alexandria digital libraryrsquorsquo Proceedings of theInternational Symposium on Digital Libraries

Tsukuba Japan 22-5 August pp 17-27Ansari A Essegaier S and Kohli R (2000) ` Internet

recommendation systemsrsquorsquo Journal of MarketingResearch Vol 37 No 3 pp 67-85

Arms W (2000) Digital Libraries MIT Press CambridgeMA

Armstrong R Freitag D Joachims T and Mitchell T(1997) ` WebWatcher a learning apprentice for theWorld Wide Webrsquorsquo AAAI Spring Symposium onInformation Gathering from HeterogeneousDistributed Environments pp 6-12

Balabanovirsquoc M and Shoham Y (1997) ` Fabcontent-based collaborative recommendationrsquorsquoCommunications of the ACM Vol 40 No 3pp 66-72

Baldonado M Chang C Gravano L and Paepcke A(1997) ` Metadata for digital libraries architectureand design rationalersquorsquo Proceedings of the 2nd ACMInternational Conference on Digital Libraries ACMPress New York NY pp 47-56

Basu C Hirsh H and Cohen W (1998)` Recommendation as classification using social andcontent-based information in recommendationrsquorsquoProceedings of the 15th National Conference onArtificial Intelligence 26-30 July Madison WI AAAIPress Menlo Park CA pp 714-20

Billsus D and Pazzani M (1999) ` A hybrid user modelfor news story classificationrsquorsquo in Kay J (Ed)Proceedings of the 7th International Conference onUser Modelling Banff Canada 20-4 JuneSpringer-Verlag New York NY pp 99-108

Bowman C Manber P and Schwartz U (1994)` Scalable Internet resources discovery researchproblems and approachesrsquorsquo Communications of theACM Vol 37 No 8 pp 98-107

Breese J Heckerman D and Kadie C (1998) ` Empiricalanalysis of predictive algorithms for collaborativefilteringrsquorsquo Technical Report MSR-TR-98-12Microsoft Research Seattle CA

Chen H Schatz B Ng T Martinez J Kirchhoff A andLin C (1996) ` A parallel computing approach tocreating engineering concept spaces for semanticretrieval the Illinois digital library initiativeprojectrsquorsquo IEEE Transactions on PAMI Vol 18 No 8pp 17-34

Cooley R Mobasher B and Srivastava J (1999) ` Datapreparation for mining World Wide Web browsingpatternsrsquorsquo Journal of Knowledge and InformationSystems Vol 1 No 1 pp 5-32

Furner J (2002) ` On recommendingrsquorsquo Journal of theAmerican Society for Information Science andTechnology Vol 53 No 9 pp 747-63

Goldberg D Nichols D Oki B and Terry D (1992)` Using collaborative filtering to weave aninformation tapestryrsquorsquo Communications of the ACMVol 35 No 12 pp 61-70

Herlocker J and Konstan J (2001) ` Content-independent task-focused recommendationrsquorsquo IEEEInternet Computing Vol 5 No 6 pp 40-7

Karypis G (2002) ` Multilevel hypergraph partitioningrsquorsquoTech Report TR02-25 Department of ComputerScience University of Minnesota MN

Kessler J (1996) Internet Digital Libraries TheInternational Dimension Artech House PublishersNorwood MA

Konstan J Miller B Maltz D Herlocker J Gordon Land Riedl J (1997) ` GroupLens applyingcollaborative filtering to Usenet newsrsquorsquoCommunications of the ACM Vol 40 No 3pp 77-87

181

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Lang K (1995) ` Newsweeder learning to filternetnewsrsquorsquo in Prieditis A and Russell S (Eds)Proceedings of the 12th International Conference onMachine Learning Lake Tahoe Morgan KaufmannSan Francisco CA pp 331-9

Liu B Hsu W and Ma Y (1999) ` Mining associationrules with multiple minimum supportsrsquorsquo Proceedingsof the 5th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining 15-18

August San Diego ACM Press New York NYpp 430-4

Loeb S and Terry D (1992) ` Information filteringrsquorsquoCommunications of the ACM Special Issue onInformation Filtering Vol 35 No 12 pp 26-8

Mobasher B Cooley R and Srivastava J (1999)` Creating adaptive Web sites through usage-basedclustering of URLsrsquorsquo Proceedings of the IEEEKnowledge and Data Engineering ExchangeWorkshop

Mobasher B Dai H Luo T Nakagawa M andWiltshire J (2000) ` Discovery of aggregate usageprofiles for Web personalisationrsquorsquo Proceedings ofthe webKDD Workshop

Mooney R and Roy L (2000) ` Content-based bookrecommending using learning for text

categorisationrsquorsquo Proceedings of the 5th ACMConference on Digital Libraries San Antonio ACMPress New York NY pp 195-204

Pazzani M and Billsus D (1997) ` Learning and revisinguser profiles the identification of interesting Websitesrsquorsquo Machine Learning Vol 27 No 4 pp 313-31

Pazzani M (1999) ` A framework for collaborativecontent-based and demographic filteringrsquorsquo ArtificialIntelligence Review Vol 13 No 56 pp 393-408

Pennock D Horvitz E Lawrence S and Giles C (2000)` Collaborative filtering by personality diagnosis ahybrid memory- and model-based approachrsquorsquoProceedings of the 16th Conference onUncertainty in Artificial Intelligence San Francisco

30 June-3 July Morgan Kaufmann San FranciscoCA pp 473-80

Pitkow J and Pirolli P (1999) ` Mining longest repeatingsubsequences to predict World Wide Web surfingrsquorsquoProceedings of the 2nd USENIX Symposium onInternet Technologies and Systems Boulder COpp 139-150

Schafer J Konstan J and Riedl J (2001) ` E-commercerecommendation applicationsrsquorsquo Data Mining andKnowledge Discovery Vol 5 No 1 pp 10-22

Shardanand U and Maes P (1995) ` Social informationfiltering algorithms for automating `word ofmouthrsquorsquorsquo Proceedings of the Conference on HumanFactors in Computing Systems Denver CO ACMPress New York NY pp 210-17

Spink A Wilson T Ford N Foster A and Ellis D(2002) ` Information seeking and mediatedsearchingrsquorsquo Journal of The American Society ForInformation Science and Technology Vol 53 No 9pp 695-703

Srivastava J Cooley R Deshpande M and Tang P(2000) ` Web usage mining discovery andapplications of usage patterns from Web datarsquorsquoSIGKDD Explorations Vol 1 No 2 pp 12-23

Terveen L Hill W Amento B McDonald D and CreterJ (1997) ` PHOAKS a system for sharingrecommendationsrsquorsquo Communications of the ACMVol 40 No 3 pp 59-62

Wilensky R (1996) ` Toward work-centred digitalinformation servicesrsquorsquo IEEE Computer Vol 29 No 5pp 7-44

Yan T Jacobsen M Molina H and Dayal U (1996)` From user access patterns to dynamic hypertextlinkingrsquorsquo Proceedings of the 5th International WorldWide Web Conference pp 1007-14

Yang Q Zhang HH and Li T (2001) ` Mining Web logsfor prediction models in WWW caching andprefetchingrsquorsquo Proceedings of the ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining pp 473-8

182

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Page 11: A prototype WWW literature recommendation system for digital libraries

function in the weight definition for thehypergraph-based recommendationapproach Figure 5(a b) shows the precisionand recall values when not = 0 05 and 1 usingthe session-chosen method for transactionidentification The window size was set at 2As can be seen differences in the precisionand recall under different settings of not arevery small We have performed the sameexperiments for different transactionidentification methods and window sizes andobtained similar results H2 is thereforerejected

We then computed the precisions andrecalls with and without application of thelogarithmic function on the itemset weightFigure 6(a b) shows the resulting precisionsand recalls respectively

Figure 6 shows that applying thelogarithmic function on the itemset weightdefinition achieves significantly betterprecision and recall values This meets ourexpectation and H3 is accepted

Test of H4Finally we evaluated the impact ofassociation-rule-based and hypergraph-basedrecommendation approaches for different

window sizes Figure 7(a b) shows theprecisions and recalls for window sizes of 2 34 5 and 6 for a top-15 recommendation Itcan be seen that the hypergraph-basedapproach performs better than theassociation-rule-based approach especiallyfor larger window sizes

We also compared the running times ofboth approaches whilst setting differentminimum support thresholds Therelative performance of the two approachesunder different window sizes are shown inFigure 8(a b) Overall the running time ofthe hypergraph-based approach remainedrelatively constant not varying with changesin window size and minimum supportthreshold In contrast the running time of theassociation-rule-based approach increasedwith an increase in window size or a decreasein the minimum support This is because theassociation-rule-based approach has to searchfor the frequent itemsets that match thecurrent session As the number of frequentitemsets increase (as a result of a decreasein the minimum support) or the length ofcurrent user session increases (as a

Figure 5 (a) Impact of not on precisions and (b) impact of not on recalls of

the hypergraph-based approach using the session-chosen method for

transaction identification

Figure 6 (a) Impact of logarithmic function on precisions and (b) impact

of logarithmic function on recalls of the hypergraph-based approach using

the session-chosen method for transaction identification

179

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

result of an increased window size)the association-rule-based approach incurs alarger running time

Overall we conclude that thehypergraph-based approach is more attractivesince it yields better-quality articlerecommendation and has a more consistentrunning time Thus H4 is accepted

Conclusions

In this paper we have investigated issuesrelated to the recommendation of articles in aliterature digital library We have developed a

literature recommendation system that makes

use of the Web usage logs of a literature

digital library for making recommendations

The literature recommendation system

consists of three sequential steps(1) data preparation of the Web logs(2) usage log mining and(3) generation of article recommendations

We proposed three alternatives for identifying

transactions from Web usage logs and

discussed two approaches - association-rule

based and hypergraph based - for making

recommendations These alternatives and

approaches were evaluated using the Web

Figure 7 (a) Precisions and (b) recalls of the two recommendation approaches under different window sizes

(Minsup = 016 per cent)

Figure 8 (a) Running times of the two recommendation approaches for a window size of two under different

Minsup values and (b) running times of the two recommendation approaches for a window size of six under

different Minsup values

180

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

usage logs of an operational electronic thesissystem at National Sun Yat-sen University Ithas been found that the query-chosen andsession-chosen methods are better fortransaction identification and that thehypergraph-based approach yieldsbetter-quality article recommendation andexhibits a more consistent running time andthus is more scalable

Our recommendation framework identifiesarticle associations present in the Web usagelogs of a digital library While this approachresults in effective recommendations it failsto recommend those independent articles thatare seldom accessed together with others Asevident from our experiments an analysis ofthe three-month collection of Web usage logsof NSYSU-ETD (from 1 February to 30April 2002) showed that only about one-tenthof the total collection are recommendable (seeTable I) To extend the scope ofrecommendable articles we are currentlyinvestigating approaches that make use ofmultiple sources when making articlerecommendations in digital libraries Onesuch source of course is the metadataalready collected by digital libraries

References

Agrawal R Imielinski T and Swami A (1993) ` Mining

association rules between sets of items in largedatabasesrsquorsquo in Buneman P and Jajodia S (Eds)

Proceedings of the ACM SIGMOD Conference onManagement of Data Washington DC May 26-28ACM Press New York NY pp 207-16

Agrawal R and Srikant R (1994) ` Fast algorithms formining association rulesrsquorsquo in Bocca JB Jarke M

and Zaniolo C (Eds) Proceedings of the 20thInternational Conference on Very Large Data Bases

September 12-15 Santiago Chile MorganKaufmann San Francisco CA pp 487-99

Alspector J Kolcz A and Karunanithi N (1998)` Comparing feature-based and clique-based usermodels for movie selectionrsquorsquo Proceedings of the 3rd

ACM International Conference on Digital LibrariesJune 23-26 1998 Pittsburgh PA ACM Press New

York NY pp 11-18Andresen D Carver L Dolin R Fischer C Frew J

Goodchild M Ibarra O Kothuri R Larsgaard MNebert D Simpson J Smith T Yang T andZheng Q (1995) ` The WWW prototype of the

Alexandria digital libraryrsquorsquo Proceedings of theInternational Symposium on Digital Libraries

Tsukuba Japan 22-5 August pp 17-27Ansari A Essegaier S and Kohli R (2000) ` Internet

recommendation systemsrsquorsquo Journal of MarketingResearch Vol 37 No 3 pp 67-85

Arms W (2000) Digital Libraries MIT Press CambridgeMA

Armstrong R Freitag D Joachims T and Mitchell T(1997) ` WebWatcher a learning apprentice for theWorld Wide Webrsquorsquo AAAI Spring Symposium onInformation Gathering from HeterogeneousDistributed Environments pp 6-12

Balabanovirsquoc M and Shoham Y (1997) ` Fabcontent-based collaborative recommendationrsquorsquoCommunications of the ACM Vol 40 No 3pp 66-72

Baldonado M Chang C Gravano L and Paepcke A(1997) ` Metadata for digital libraries architectureand design rationalersquorsquo Proceedings of the 2nd ACMInternational Conference on Digital Libraries ACMPress New York NY pp 47-56

Basu C Hirsh H and Cohen W (1998)` Recommendation as classification using social andcontent-based information in recommendationrsquorsquoProceedings of the 15th National Conference onArtificial Intelligence 26-30 July Madison WI AAAIPress Menlo Park CA pp 714-20

Billsus D and Pazzani M (1999) ` A hybrid user modelfor news story classificationrsquorsquo in Kay J (Ed)Proceedings of the 7th International Conference onUser Modelling Banff Canada 20-4 JuneSpringer-Verlag New York NY pp 99-108

Bowman C Manber P and Schwartz U (1994)` Scalable Internet resources discovery researchproblems and approachesrsquorsquo Communications of theACM Vol 37 No 8 pp 98-107

Breese J Heckerman D and Kadie C (1998) ` Empiricalanalysis of predictive algorithms for collaborativefilteringrsquorsquo Technical Report MSR-TR-98-12Microsoft Research Seattle CA

Chen H Schatz B Ng T Martinez J Kirchhoff A andLin C (1996) ` A parallel computing approach tocreating engineering concept spaces for semanticretrieval the Illinois digital library initiativeprojectrsquorsquo IEEE Transactions on PAMI Vol 18 No 8pp 17-34

Cooley R Mobasher B and Srivastava J (1999) ` Datapreparation for mining World Wide Web browsingpatternsrsquorsquo Journal of Knowledge and InformationSystems Vol 1 No 1 pp 5-32

Furner J (2002) ` On recommendingrsquorsquo Journal of theAmerican Society for Information Science andTechnology Vol 53 No 9 pp 747-63

Goldberg D Nichols D Oki B and Terry D (1992)` Using collaborative filtering to weave aninformation tapestryrsquorsquo Communications of the ACMVol 35 No 12 pp 61-70

Herlocker J and Konstan J (2001) ` Content-independent task-focused recommendationrsquorsquo IEEEInternet Computing Vol 5 No 6 pp 40-7

Karypis G (2002) ` Multilevel hypergraph partitioningrsquorsquoTech Report TR02-25 Department of ComputerScience University of Minnesota MN

Kessler J (1996) Internet Digital Libraries TheInternational Dimension Artech House PublishersNorwood MA

Konstan J Miller B Maltz D Herlocker J Gordon Land Riedl J (1997) ` GroupLens applyingcollaborative filtering to Usenet newsrsquorsquoCommunications of the ACM Vol 40 No 3pp 77-87

181

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Lang K (1995) ` Newsweeder learning to filternetnewsrsquorsquo in Prieditis A and Russell S (Eds)Proceedings of the 12th International Conference onMachine Learning Lake Tahoe Morgan KaufmannSan Francisco CA pp 331-9

Liu B Hsu W and Ma Y (1999) ` Mining associationrules with multiple minimum supportsrsquorsquo Proceedingsof the 5th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining 15-18

August San Diego ACM Press New York NYpp 430-4

Loeb S and Terry D (1992) ` Information filteringrsquorsquoCommunications of the ACM Special Issue onInformation Filtering Vol 35 No 12 pp 26-8

Mobasher B Cooley R and Srivastava J (1999)` Creating adaptive Web sites through usage-basedclustering of URLsrsquorsquo Proceedings of the IEEEKnowledge and Data Engineering ExchangeWorkshop

Mobasher B Dai H Luo T Nakagawa M andWiltshire J (2000) ` Discovery of aggregate usageprofiles for Web personalisationrsquorsquo Proceedings ofthe webKDD Workshop

Mooney R and Roy L (2000) ` Content-based bookrecommending using learning for text

categorisationrsquorsquo Proceedings of the 5th ACMConference on Digital Libraries San Antonio ACMPress New York NY pp 195-204

Pazzani M and Billsus D (1997) ` Learning and revisinguser profiles the identification of interesting Websitesrsquorsquo Machine Learning Vol 27 No 4 pp 313-31

Pazzani M (1999) ` A framework for collaborativecontent-based and demographic filteringrsquorsquo ArtificialIntelligence Review Vol 13 No 56 pp 393-408

Pennock D Horvitz E Lawrence S and Giles C (2000)` Collaborative filtering by personality diagnosis ahybrid memory- and model-based approachrsquorsquoProceedings of the 16th Conference onUncertainty in Artificial Intelligence San Francisco

30 June-3 July Morgan Kaufmann San FranciscoCA pp 473-80

Pitkow J and Pirolli P (1999) ` Mining longest repeatingsubsequences to predict World Wide Web surfingrsquorsquoProceedings of the 2nd USENIX Symposium onInternet Technologies and Systems Boulder COpp 139-150

Schafer J Konstan J and Riedl J (2001) ` E-commercerecommendation applicationsrsquorsquo Data Mining andKnowledge Discovery Vol 5 No 1 pp 10-22

Shardanand U and Maes P (1995) ` Social informationfiltering algorithms for automating `word ofmouthrsquorsquorsquo Proceedings of the Conference on HumanFactors in Computing Systems Denver CO ACMPress New York NY pp 210-17

Spink A Wilson T Ford N Foster A and Ellis D(2002) ` Information seeking and mediatedsearchingrsquorsquo Journal of The American Society ForInformation Science and Technology Vol 53 No 9pp 695-703

Srivastava J Cooley R Deshpande M and Tang P(2000) ` Web usage mining discovery andapplications of usage patterns from Web datarsquorsquoSIGKDD Explorations Vol 1 No 2 pp 12-23

Terveen L Hill W Amento B McDonald D and CreterJ (1997) ` PHOAKS a system for sharingrecommendationsrsquorsquo Communications of the ACMVol 40 No 3 pp 59-62

Wilensky R (1996) ` Toward work-centred digitalinformation servicesrsquorsquo IEEE Computer Vol 29 No 5pp 7-44

Yan T Jacobsen M Molina H and Dayal U (1996)` From user access patterns to dynamic hypertextlinkingrsquorsquo Proceedings of the 5th International WorldWide Web Conference pp 1007-14

Yang Q Zhang HH and Li T (2001) ` Mining Web logsfor prediction models in WWW caching andprefetchingrsquorsquo Proceedings of the ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining pp 473-8

182

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Page 12: A prototype WWW literature recommendation system for digital libraries

result of an increased window size)the association-rule-based approach incurs alarger running time

Overall we conclude that thehypergraph-based approach is more attractivesince it yields better-quality articlerecommendation and has a more consistentrunning time Thus H4 is accepted

Conclusions

In this paper we have investigated issuesrelated to the recommendation of articles in aliterature digital library We have developed a

literature recommendation system that makes

use of the Web usage logs of a literature

digital library for making recommendations

The literature recommendation system

consists of three sequential steps(1) data preparation of the Web logs(2) usage log mining and(3) generation of article recommendations

We proposed three alternatives for identifying

transactions from Web usage logs and

discussed two approaches - association-rule

based and hypergraph based - for making

recommendations These alternatives and

approaches were evaluated using the Web

Figure 7 (a) Precisions and (b) recalls of the two recommendation approaches under different window sizes

(Minsup = 016 per cent)

Figure 8 (a) Running times of the two recommendation approaches for a window size of two under different

Minsup values and (b) running times of the two recommendation approaches for a window size of six under

different Minsup values

180

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

usage logs of an operational electronic thesissystem at National Sun Yat-sen University Ithas been found that the query-chosen andsession-chosen methods are better fortransaction identification and that thehypergraph-based approach yieldsbetter-quality article recommendation andexhibits a more consistent running time andthus is more scalable

Our recommendation framework identifiesarticle associations present in the Web usagelogs of a digital library While this approachresults in effective recommendations it failsto recommend those independent articles thatare seldom accessed together with others Asevident from our experiments an analysis ofthe three-month collection of Web usage logsof NSYSU-ETD (from 1 February to 30April 2002) showed that only about one-tenthof the total collection are recommendable (seeTable I) To extend the scope ofrecommendable articles we are currentlyinvestigating approaches that make use ofmultiple sources when making articlerecommendations in digital libraries Onesuch source of course is the metadataalready collected by digital libraries

References

Agrawal R Imielinski T and Swami A (1993) ` Mining

association rules between sets of items in largedatabasesrsquorsquo in Buneman P and Jajodia S (Eds)

Proceedings of the ACM SIGMOD Conference onManagement of Data Washington DC May 26-28ACM Press New York NY pp 207-16

Agrawal R and Srikant R (1994) ` Fast algorithms formining association rulesrsquorsquo in Bocca JB Jarke M

and Zaniolo C (Eds) Proceedings of the 20thInternational Conference on Very Large Data Bases

September 12-15 Santiago Chile MorganKaufmann San Francisco CA pp 487-99

Alspector J Kolcz A and Karunanithi N (1998)` Comparing feature-based and clique-based usermodels for movie selectionrsquorsquo Proceedings of the 3rd

ACM International Conference on Digital LibrariesJune 23-26 1998 Pittsburgh PA ACM Press New

York NY pp 11-18Andresen D Carver L Dolin R Fischer C Frew J

Goodchild M Ibarra O Kothuri R Larsgaard MNebert D Simpson J Smith T Yang T andZheng Q (1995) ` The WWW prototype of the

Alexandria digital libraryrsquorsquo Proceedings of theInternational Symposium on Digital Libraries

Tsukuba Japan 22-5 August pp 17-27Ansari A Essegaier S and Kohli R (2000) ` Internet

recommendation systemsrsquorsquo Journal of MarketingResearch Vol 37 No 3 pp 67-85

Arms W (2000) Digital Libraries MIT Press CambridgeMA

Armstrong R Freitag D Joachims T and Mitchell T(1997) ` WebWatcher a learning apprentice for theWorld Wide Webrsquorsquo AAAI Spring Symposium onInformation Gathering from HeterogeneousDistributed Environments pp 6-12

Balabanovirsquoc M and Shoham Y (1997) ` Fabcontent-based collaborative recommendationrsquorsquoCommunications of the ACM Vol 40 No 3pp 66-72

Baldonado M Chang C Gravano L and Paepcke A(1997) ` Metadata for digital libraries architectureand design rationalersquorsquo Proceedings of the 2nd ACMInternational Conference on Digital Libraries ACMPress New York NY pp 47-56

Basu C Hirsh H and Cohen W (1998)` Recommendation as classification using social andcontent-based information in recommendationrsquorsquoProceedings of the 15th National Conference onArtificial Intelligence 26-30 July Madison WI AAAIPress Menlo Park CA pp 714-20

Billsus D and Pazzani M (1999) ` A hybrid user modelfor news story classificationrsquorsquo in Kay J (Ed)Proceedings of the 7th International Conference onUser Modelling Banff Canada 20-4 JuneSpringer-Verlag New York NY pp 99-108

Bowman C Manber P and Schwartz U (1994)` Scalable Internet resources discovery researchproblems and approachesrsquorsquo Communications of theACM Vol 37 No 8 pp 98-107

Breese J Heckerman D and Kadie C (1998) ` Empiricalanalysis of predictive algorithms for collaborativefilteringrsquorsquo Technical Report MSR-TR-98-12Microsoft Research Seattle CA

Chen H Schatz B Ng T Martinez J Kirchhoff A andLin C (1996) ` A parallel computing approach tocreating engineering concept spaces for semanticretrieval the Illinois digital library initiativeprojectrsquorsquo IEEE Transactions on PAMI Vol 18 No 8pp 17-34

Cooley R Mobasher B and Srivastava J (1999) ` Datapreparation for mining World Wide Web browsingpatternsrsquorsquo Journal of Knowledge and InformationSystems Vol 1 No 1 pp 5-32

Furner J (2002) ` On recommendingrsquorsquo Journal of theAmerican Society for Information Science andTechnology Vol 53 No 9 pp 747-63

Goldberg D Nichols D Oki B and Terry D (1992)` Using collaborative filtering to weave aninformation tapestryrsquorsquo Communications of the ACMVol 35 No 12 pp 61-70

Herlocker J and Konstan J (2001) ` Content-independent task-focused recommendationrsquorsquo IEEEInternet Computing Vol 5 No 6 pp 40-7

Karypis G (2002) ` Multilevel hypergraph partitioningrsquorsquoTech Report TR02-25 Department of ComputerScience University of Minnesota MN

Kessler J (1996) Internet Digital Libraries TheInternational Dimension Artech House PublishersNorwood MA

Konstan J Miller B Maltz D Herlocker J Gordon Land Riedl J (1997) ` GroupLens applyingcollaborative filtering to Usenet newsrsquorsquoCommunications of the ACM Vol 40 No 3pp 77-87

181

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Lang K (1995) ` Newsweeder learning to filternetnewsrsquorsquo in Prieditis A and Russell S (Eds)Proceedings of the 12th International Conference onMachine Learning Lake Tahoe Morgan KaufmannSan Francisco CA pp 331-9

Liu B Hsu W and Ma Y (1999) ` Mining associationrules with multiple minimum supportsrsquorsquo Proceedingsof the 5th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining 15-18

August San Diego ACM Press New York NYpp 430-4

Loeb S and Terry D (1992) ` Information filteringrsquorsquoCommunications of the ACM Special Issue onInformation Filtering Vol 35 No 12 pp 26-8

Mobasher B Cooley R and Srivastava J (1999)` Creating adaptive Web sites through usage-basedclustering of URLsrsquorsquo Proceedings of the IEEEKnowledge and Data Engineering ExchangeWorkshop

Mobasher B Dai H Luo T Nakagawa M andWiltshire J (2000) ` Discovery of aggregate usageprofiles for Web personalisationrsquorsquo Proceedings ofthe webKDD Workshop

Mooney R and Roy L (2000) ` Content-based bookrecommending using learning for text

categorisationrsquorsquo Proceedings of the 5th ACMConference on Digital Libraries San Antonio ACMPress New York NY pp 195-204

Pazzani M and Billsus D (1997) ` Learning and revisinguser profiles the identification of interesting Websitesrsquorsquo Machine Learning Vol 27 No 4 pp 313-31

Pazzani M (1999) ` A framework for collaborativecontent-based and demographic filteringrsquorsquo ArtificialIntelligence Review Vol 13 No 56 pp 393-408

Pennock D Horvitz E Lawrence S and Giles C (2000)` Collaborative filtering by personality diagnosis ahybrid memory- and model-based approachrsquorsquoProceedings of the 16th Conference onUncertainty in Artificial Intelligence San Francisco

30 June-3 July Morgan Kaufmann San FranciscoCA pp 473-80

Pitkow J and Pirolli P (1999) ` Mining longest repeatingsubsequences to predict World Wide Web surfingrsquorsquoProceedings of the 2nd USENIX Symposium onInternet Technologies and Systems Boulder COpp 139-150

Schafer J Konstan J and Riedl J (2001) ` E-commercerecommendation applicationsrsquorsquo Data Mining andKnowledge Discovery Vol 5 No 1 pp 10-22

Shardanand U and Maes P (1995) ` Social informationfiltering algorithms for automating `word ofmouthrsquorsquorsquo Proceedings of the Conference on HumanFactors in Computing Systems Denver CO ACMPress New York NY pp 210-17

Spink A Wilson T Ford N Foster A and Ellis D(2002) ` Information seeking and mediatedsearchingrsquorsquo Journal of The American Society ForInformation Science and Technology Vol 53 No 9pp 695-703

Srivastava J Cooley R Deshpande M and Tang P(2000) ` Web usage mining discovery andapplications of usage patterns from Web datarsquorsquoSIGKDD Explorations Vol 1 No 2 pp 12-23

Terveen L Hill W Amento B McDonald D and CreterJ (1997) ` PHOAKS a system for sharingrecommendationsrsquorsquo Communications of the ACMVol 40 No 3 pp 59-62

Wilensky R (1996) ` Toward work-centred digitalinformation servicesrsquorsquo IEEE Computer Vol 29 No 5pp 7-44

Yan T Jacobsen M Molina H and Dayal U (1996)` From user access patterns to dynamic hypertextlinkingrsquorsquo Proceedings of the 5th International WorldWide Web Conference pp 1007-14

Yang Q Zhang HH and Li T (2001) ` Mining Web logsfor prediction models in WWW caching andprefetchingrsquorsquo Proceedings of the ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining pp 473-8

182

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Page 13: A prototype WWW literature recommendation system for digital libraries

usage logs of an operational electronic thesissystem at National Sun Yat-sen University Ithas been found that the query-chosen andsession-chosen methods are better fortransaction identification and that thehypergraph-based approach yieldsbetter-quality article recommendation andexhibits a more consistent running time andthus is more scalable

Our recommendation framework identifiesarticle associations present in the Web usagelogs of a digital library While this approachresults in effective recommendations it failsto recommend those independent articles thatare seldom accessed together with others Asevident from our experiments an analysis ofthe three-month collection of Web usage logsof NSYSU-ETD (from 1 February to 30April 2002) showed that only about one-tenthof the total collection are recommendable (seeTable I) To extend the scope ofrecommendable articles we are currentlyinvestigating approaches that make use ofmultiple sources when making articlerecommendations in digital libraries Onesuch source of course is the metadataalready collected by digital libraries

References

Agrawal R Imielinski T and Swami A (1993) ` Mining

association rules between sets of items in largedatabasesrsquorsquo in Buneman P and Jajodia S (Eds)

Proceedings of the ACM SIGMOD Conference onManagement of Data Washington DC May 26-28ACM Press New York NY pp 207-16

Agrawal R and Srikant R (1994) ` Fast algorithms formining association rulesrsquorsquo in Bocca JB Jarke M

and Zaniolo C (Eds) Proceedings of the 20thInternational Conference on Very Large Data Bases

September 12-15 Santiago Chile MorganKaufmann San Francisco CA pp 487-99

Alspector J Kolcz A and Karunanithi N (1998)` Comparing feature-based and clique-based usermodels for movie selectionrsquorsquo Proceedings of the 3rd

ACM International Conference on Digital LibrariesJune 23-26 1998 Pittsburgh PA ACM Press New

York NY pp 11-18Andresen D Carver L Dolin R Fischer C Frew J

Goodchild M Ibarra O Kothuri R Larsgaard MNebert D Simpson J Smith T Yang T andZheng Q (1995) ` The WWW prototype of the

Alexandria digital libraryrsquorsquo Proceedings of theInternational Symposium on Digital Libraries

Tsukuba Japan 22-5 August pp 17-27Ansari A Essegaier S and Kohli R (2000) ` Internet

recommendation systemsrsquorsquo Journal of MarketingResearch Vol 37 No 3 pp 67-85

Arms W (2000) Digital Libraries MIT Press CambridgeMA

Armstrong R Freitag D Joachims T and Mitchell T(1997) ` WebWatcher a learning apprentice for theWorld Wide Webrsquorsquo AAAI Spring Symposium onInformation Gathering from HeterogeneousDistributed Environments pp 6-12

Balabanovirsquoc M and Shoham Y (1997) ` Fabcontent-based collaborative recommendationrsquorsquoCommunications of the ACM Vol 40 No 3pp 66-72

Baldonado M Chang C Gravano L and Paepcke A(1997) ` Metadata for digital libraries architectureand design rationalersquorsquo Proceedings of the 2nd ACMInternational Conference on Digital Libraries ACMPress New York NY pp 47-56

Basu C Hirsh H and Cohen W (1998)` Recommendation as classification using social andcontent-based information in recommendationrsquorsquoProceedings of the 15th National Conference onArtificial Intelligence 26-30 July Madison WI AAAIPress Menlo Park CA pp 714-20

Billsus D and Pazzani M (1999) ` A hybrid user modelfor news story classificationrsquorsquo in Kay J (Ed)Proceedings of the 7th International Conference onUser Modelling Banff Canada 20-4 JuneSpringer-Verlag New York NY pp 99-108

Bowman C Manber P and Schwartz U (1994)` Scalable Internet resources discovery researchproblems and approachesrsquorsquo Communications of theACM Vol 37 No 8 pp 98-107

Breese J Heckerman D and Kadie C (1998) ` Empiricalanalysis of predictive algorithms for collaborativefilteringrsquorsquo Technical Report MSR-TR-98-12Microsoft Research Seattle CA

Chen H Schatz B Ng T Martinez J Kirchhoff A andLin C (1996) ` A parallel computing approach tocreating engineering concept spaces for semanticretrieval the Illinois digital library initiativeprojectrsquorsquo IEEE Transactions on PAMI Vol 18 No 8pp 17-34

Cooley R Mobasher B and Srivastava J (1999) ` Datapreparation for mining World Wide Web browsingpatternsrsquorsquo Journal of Knowledge and InformationSystems Vol 1 No 1 pp 5-32

Furner J (2002) ` On recommendingrsquorsquo Journal of theAmerican Society for Information Science andTechnology Vol 53 No 9 pp 747-63

Goldberg D Nichols D Oki B and Terry D (1992)` Using collaborative filtering to weave aninformation tapestryrsquorsquo Communications of the ACMVol 35 No 12 pp 61-70

Herlocker J and Konstan J (2001) ` Content-independent task-focused recommendationrsquorsquo IEEEInternet Computing Vol 5 No 6 pp 40-7

Karypis G (2002) ` Multilevel hypergraph partitioningrsquorsquoTech Report TR02-25 Department of ComputerScience University of Minnesota MN

Kessler J (1996) Internet Digital Libraries TheInternational Dimension Artech House PublishersNorwood MA

Konstan J Miller B Maltz D Herlocker J Gordon Land Riedl J (1997) ` GroupLens applyingcollaborative filtering to Usenet newsrsquorsquoCommunications of the ACM Vol 40 No 3pp 77-87

181

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Lang K (1995) ` Newsweeder learning to filternetnewsrsquorsquo in Prieditis A and Russell S (Eds)Proceedings of the 12th International Conference onMachine Learning Lake Tahoe Morgan KaufmannSan Francisco CA pp 331-9

Liu B Hsu W and Ma Y (1999) ` Mining associationrules with multiple minimum supportsrsquorsquo Proceedingsof the 5th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining 15-18

August San Diego ACM Press New York NYpp 430-4

Loeb S and Terry D (1992) ` Information filteringrsquorsquoCommunications of the ACM Special Issue onInformation Filtering Vol 35 No 12 pp 26-8

Mobasher B Cooley R and Srivastava J (1999)` Creating adaptive Web sites through usage-basedclustering of URLsrsquorsquo Proceedings of the IEEEKnowledge and Data Engineering ExchangeWorkshop

Mobasher B Dai H Luo T Nakagawa M andWiltshire J (2000) ` Discovery of aggregate usageprofiles for Web personalisationrsquorsquo Proceedings ofthe webKDD Workshop

Mooney R and Roy L (2000) ` Content-based bookrecommending using learning for text

categorisationrsquorsquo Proceedings of the 5th ACMConference on Digital Libraries San Antonio ACMPress New York NY pp 195-204

Pazzani M and Billsus D (1997) ` Learning and revisinguser profiles the identification of interesting Websitesrsquorsquo Machine Learning Vol 27 No 4 pp 313-31

Pazzani M (1999) ` A framework for collaborativecontent-based and demographic filteringrsquorsquo ArtificialIntelligence Review Vol 13 No 56 pp 393-408

Pennock D Horvitz E Lawrence S and Giles C (2000)` Collaborative filtering by personality diagnosis ahybrid memory- and model-based approachrsquorsquoProceedings of the 16th Conference onUncertainty in Artificial Intelligence San Francisco

30 June-3 July Morgan Kaufmann San FranciscoCA pp 473-80

Pitkow J and Pirolli P (1999) ` Mining longest repeatingsubsequences to predict World Wide Web surfingrsquorsquoProceedings of the 2nd USENIX Symposium onInternet Technologies and Systems Boulder COpp 139-150

Schafer J Konstan J and Riedl J (2001) ` E-commercerecommendation applicationsrsquorsquo Data Mining andKnowledge Discovery Vol 5 No 1 pp 10-22

Shardanand U and Maes P (1995) ` Social informationfiltering algorithms for automating `word ofmouthrsquorsquorsquo Proceedings of the Conference on HumanFactors in Computing Systems Denver CO ACMPress New York NY pp 210-17

Spink A Wilson T Ford N Foster A and Ellis D(2002) ` Information seeking and mediatedsearchingrsquorsquo Journal of The American Society ForInformation Science and Technology Vol 53 No 9pp 695-703

Srivastava J Cooley R Deshpande M and Tang P(2000) ` Web usage mining discovery andapplications of usage patterns from Web datarsquorsquoSIGKDD Explorations Vol 1 No 2 pp 12-23

Terveen L Hill W Amento B McDonald D and CreterJ (1997) ` PHOAKS a system for sharingrecommendationsrsquorsquo Communications of the ACMVol 40 No 3 pp 59-62

Wilensky R (1996) ` Toward work-centred digitalinformation servicesrsquorsquo IEEE Computer Vol 29 No 5pp 7-44

Yan T Jacobsen M Molina H and Dayal U (1996)` From user access patterns to dynamic hypertextlinkingrsquorsquo Proceedings of the 5th International WorldWide Web Conference pp 1007-14

Yang Q Zhang HH and Li T (2001) ` Mining Web logsfor prediction models in WWW caching andprefetchingrsquorsquo Proceedings of the ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining pp 473-8

182

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182

Page 14: A prototype WWW literature recommendation system for digital libraries

Lang K (1995) ` Newsweeder learning to filternetnewsrsquorsquo in Prieditis A and Russell S (Eds)Proceedings of the 12th International Conference onMachine Learning Lake Tahoe Morgan KaufmannSan Francisco CA pp 331-9

Liu B Hsu W and Ma Y (1999) ` Mining associationrules with multiple minimum supportsrsquorsquo Proceedingsof the 5th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining 15-18

August San Diego ACM Press New York NYpp 430-4

Loeb S and Terry D (1992) ` Information filteringrsquorsquoCommunications of the ACM Special Issue onInformation Filtering Vol 35 No 12 pp 26-8

Mobasher B Cooley R and Srivastava J (1999)` Creating adaptive Web sites through usage-basedclustering of URLsrsquorsquo Proceedings of the IEEEKnowledge and Data Engineering ExchangeWorkshop

Mobasher B Dai H Luo T Nakagawa M andWiltshire J (2000) ` Discovery of aggregate usageprofiles for Web personalisationrsquorsquo Proceedings ofthe webKDD Workshop

Mooney R and Roy L (2000) ` Content-based bookrecommending using learning for text

categorisationrsquorsquo Proceedings of the 5th ACMConference on Digital Libraries San Antonio ACMPress New York NY pp 195-204

Pazzani M and Billsus D (1997) ` Learning and revisinguser profiles the identification of interesting Websitesrsquorsquo Machine Learning Vol 27 No 4 pp 313-31

Pazzani M (1999) ` A framework for collaborativecontent-based and demographic filteringrsquorsquo ArtificialIntelligence Review Vol 13 No 56 pp 393-408

Pennock D Horvitz E Lawrence S and Giles C (2000)` Collaborative filtering by personality diagnosis ahybrid memory- and model-based approachrsquorsquoProceedings of the 16th Conference onUncertainty in Artificial Intelligence San Francisco

30 June-3 July Morgan Kaufmann San FranciscoCA pp 473-80

Pitkow J and Pirolli P (1999) ` Mining longest repeatingsubsequences to predict World Wide Web surfingrsquorsquoProceedings of the 2nd USENIX Symposium onInternet Technologies and Systems Boulder COpp 139-150

Schafer J Konstan J and Riedl J (2001) ` E-commercerecommendation applicationsrsquorsquo Data Mining andKnowledge Discovery Vol 5 No 1 pp 10-22

Shardanand U and Maes P (1995) ` Social informationfiltering algorithms for automating `word ofmouthrsquorsquorsquo Proceedings of the Conference on HumanFactors in Computing Systems Denver CO ACMPress New York NY pp 210-17

Spink A Wilson T Ford N Foster A and Ellis D(2002) ` Information seeking and mediatedsearchingrsquorsquo Journal of The American Society ForInformation Science and Technology Vol 53 No 9pp 695-703

Srivastava J Cooley R Deshpande M and Tang P(2000) ` Web usage mining discovery andapplications of usage patterns from Web datarsquorsquoSIGKDD Explorations Vol 1 No 2 pp 12-23

Terveen L Hill W Amento B McDonald D and CreterJ (1997) ` PHOAKS a system for sharingrecommendationsrsquorsquo Communications of the ACMVol 40 No 3 pp 59-62

Wilensky R (1996) ` Toward work-centred digitalinformation servicesrsquorsquo IEEE Computer Vol 29 No 5pp 7-44

Yan T Jacobsen M Molina H and Dayal U (1996)` From user access patterns to dynamic hypertextlinkingrsquorsquo Proceedings of the 5th International WorldWide Web Conference pp 1007-14

Yang Q Zhang HH and Li T (2001) ` Mining Web logsfor prediction models in WWW caching andprefetchingrsquorsquo Proceedings of the ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining pp 473-8

182

Prototype WWW literature recommendation system for digital libraries

San-Yih Hwang Wen-Chiang Hsiung and Wan-Shiou Yang

Online Information Review

Volume 27 Number 3 2003 169-182


Recommended