WEB MINING: A ROADMAP - University of Albertagolmoham/SW/web mining 23Jan2008/WEB...1 WEB MINING: A...

1

WEB MINING: A ROADMAP Magdalini Eirinaki Dept. of Informatics Athens University of Economics and Business CHAPTER 1 Introduction – The three axes of Web Mining 1.1 WWW Impact

The World Wide Web, has grown in the past few years from a small research community to the biggest and most popular way of communication and information dissemination. Every day, the WWW grows by roughly a million electronic pages, adding to the hundreds of millions already on-line. WWW serves as a platform for exchanging various kinds of information, ranging from research papers, and educational content, to multimedia content and software.

The continuous growth in the size and the use of the WWW imposes new methods for processing these huge amounts of data. Because of its rapid and chaotic growth, the resulting network of information lacks of organization and structure. Moreover, the content is published in various diverse formats. Due to this fact, users are feeling sometimes disoriented, lost in that information overload that continues to expand.

Issues that have to be dealt with are the detection of relevant information, involving the searching and indexing of the Web content, the creation of some meta-knowledge out of the information which is available on the Web, as well as the addressing of the individual users’ needs and interests, by personalizing the provided information and services.

Web mining is a very broad research area emerging to solve the issues that arise due to the WWW phenomenon. The Web mining research is a converging research area from several research communities, such as Databases, IR and AI. In this work we will try to overview the most important issues of each one of the three axes of Web mining, namely Web structure, Web content and Web usage mining. We also try to make a prediction concerning the future of Web mining, which is the combination of the methods used in all three categories of Web mining, towards the Semantic Web vision.

1.2 Web data

Web data are those that can be collected and used in the context of Web

personalization. These data are classified in four categories according to [SC+00]:

2

• Content data are presented to the end-user appropriately structured. They can be simple text, images, or structured data, such as information retrieved from databases.

• Structure data represent the way content is organized. They can be either data entities used within a Web page, such as HTML or XML tags, or data entities used to put a Web site together, such as hyperlinks connecting one page to another.

• Usage data represent a Web site’s usage, such as a visitor’s IP address, time and date of access, complete path (files or directories) accessed, referrers’ address, and other attributes that can be included in a Web access log.

• User profile data provide information about the users of a Web site. A user profile contains demographic information for each user of a Web site, as well as information about users’ interests and preferences. Such information is acquired through registration forms or questionnaires, or can be inferred by analyzing Web usage logs.

1.3 The tree axes of Web Mining The Web mining research is a converging area from several research communities, such as databases, IR, machine learning and NLP. Even though it is strongly related to data mining, it is not equivalent to it. Web mining processes Web data, which come in various categories and formats and this is what makes the combination of many techniques applied to the research areas mentioned before, essential. Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services [E96]. In the literature, three main axes of Web mining have been identified, according to the Web data used as input in the data mining process, namely Web structure, Web content and Web usage mining. The goal of Web structure mining is to categorize the Web pages and generate information such as the similarity and relationship between them, taking advantage of their hyperlink topology. In the latter years, the area of Web structure mining focuses on the identification of authorities, i.e. pages that are considered as important sources of information from many people in the Web community. Web content mining has to do with the retrieval of information (content) available on the Web into more structured forms as well as its indexing for easy tracking information locations. Web content may be unstructured (plain text), semi-structured (HTML documents), or structured (extracted from databases into dynamic Web pages). Such dynamic data cannot be indexed and consist what is called “the hidden Web”. A research area closely related to content mining is text mining. Web usage mining is the process of identifying browsing patterns by analyzing the user’s navigational behavior. This information takes as input the usage data, i.e. the data residing in the Web server logs, recording the visits of the users to a Web site. Extensive research in the area of Web usage mining led to the appearance of a related research area, that of Web personalization. Web personalization utilizes the results produced after performing Web usage mining, in order to dynamically provide recommendations to each user.

3

Even though there exist three distinct axes in the Web mining research, the distinctions between them are not clear-cut [KB00]. Web content mining might utilize the text contained in the links, or even the text surrounding them. On the other hand, in Web structure mining the same information can be used for different purposes. Moreover, web usage mining can be enhanced if the content semantics are taken into consideration. Most of the research efforts nowadays, propose systems or algorithms that combine methods from these Web mining categories. Therefore, Web mining moves to a more abstract level, where data representation is achieved using semantics. These semantics are defined using tools that emerged along with the Semantic Web vision, such as XML, RDF, and most importantly, ontologies. 1.4 Paper Outline In the rest of this work we will overview in more detail the three axes of Web mining. In Chapter 2, we will examine the role of hyperlinks in Web searching, concentrate on the link analysis research area. A brief description of the two most important algorithms, PageRank and HITS, along with the search engines in which they are employed, is then given. Finally, we also examine how hyperlink information is taken into consideration during the Web structure mining process. Related research efforts are reported throughout the Chapter. The Web content mining research area is analyzed in Chapter 3. A brief discussion concerning data preprocessing methods is followed by an overview of the most popular Web document representation models, namely Vector Space Model and Support Vector Machines. A quite detailed survey on Web document clustering research approaches follows, whereas Web document classification techniques are also described. In Chapter 4 we discuss the most important issues in the area of Web usage mining. The format of input data is described, along with preprocessing techniques applied to them. Next, the most important Web usage mining algorithms employed during the pattern discovery phase are presented. Next, an overview of the pattern analysis phase and the emerging area of Web personalization, is given. Finally, we present the most important research initiatives of the area. In Chapter 5, we overview the application of Markov models in the Web usage mining and personalization areas. We describe some preliminaries on modeling web usage using Markov models, and then refer to the most important research initiatives. Finally, in Chapter 6, we give an overview of the ways the three axes of Web mining can be combined, as one step towards the creation of the Semantic Web. We describe the most important structures used in that context, namely XML and RDF, as well as ontologies. More specifically, we give a categorization of ontologies, as well as different ways of using them. Next, we present in more detail the way content and structure information can be combined, as well as usage and content semantics. We also overview some initial projects dealing with Semantic Web sites based on ontologies. For all these research directions, we also present the most important initiatives. Finally, we provide an overview of the most important commercial products of the area.

4

CHAPTER 2 Web Structure Mining 2.1 The Web is a Graph

What differentiates World Wide Web from other document collections is that

Web, except for the documents it contains, is also strongly characterized by the hyperlinks that inter-connect them. Therefore, the Web is a graph. More precisely, it is a directed labeled graph whose nodes are the documents and the edges are the hyperlinks between them. Several researchers have tried to analyse the properties of this graph. One of the latest studies [BK+00] supports that the structure of the Web graph, looks like a giant bow tie. Using some rough explanation, this bow tie has a Strongly Connected Core component (SCC) of 56 million pages in the middle, and two others with 44 million pages on the sides, one containing pages pointing to SCC (the IN set), and one with pages pointed by SCC (the OUT set). Additionally, there exist some “tubes” connecting directly the IN and OUT sets. Finally, there also exist some smaller components (groups of pages) that cannot be reached from any other point of this structure. It is therefore evident that Web is a huge structure, growing rapidly. This network of information lacks organization and structure, and is only held together by the hyperlinks. 2.2 The role of hyperlinks in web searching In order to make navigation in this chaotic structure easier, people use search engines, trying to focus their search by querying using specific terms/keywords. At the beginning, where the amount of information contained in the Web did not yet have these big proportions, search engines used manually-built lists covering popular topics. They maintained an index, containing a list for every word, of all Web pages containing this word. This index was then used in order to answer to the users’ queries. However, after a few years, when the Web evolved including millions of pages, the manual maintenance of such indices was very expensive. The automated search engines relying in keyword matching, give results including hundreds (or more) Web pages, most of them of low quality. The need for ranking somehow the importance and relevance of the results was more than evident. To handle this problem, search engines such as AltaVista, Lycos, Infoseek, HotBot and Excite use some simple heuristics in order to accomplish a page ranking. Such heuristics take into consideration the number of times the term appears in the document, if it appears at the beginning of the text, or at areas considered more important (such as headings, italics, etc).

However, Web designers soon tried to take advantage of this way of ranking by including words or phrases many times, or in invisible places, so that they are favored in the search engine’s ranking. This practice is called spamming and has

5

become a reason why search engines tried to find some more clever ways to rank Web pages.

The main break-through in the research area of searching the Web came by the realization that the characterization as well as the assessment of a page, i.e. how important it is considered, is enhanced if we take into account how many people consider it important, and how they characterize it. This is performed by using the link structure of the Web, taking into consideration the incoming links to a page. The most important algorithms based on this idea are PageRank [BP98] and HITS [K99]. 2.3 Link Analysis – Co-citations, Hubs & Authorites, PageRank The intuition behind link analysis is that a Web page is popular/important if many other pages point to it. However, it is considered very important if it is pointed by reliable pages. Reversely, a page that points to other pages is reliable if these pages are important. 2.3.1 Citation Analysis Link analysis has close ties to social networks and citation analysis, the study of the co-citations occurring between scientific papers. The best-known measure of a publication’s importance is the “impact factor” [G72], developed by Eugene Garfield. This metric takes into account the number of citations received by a publication. The impact factor is proportional to the number of citations a publication has. This measure, counts all references equally. However, it is evident that some “important” citations should be given additional weight. The problem is to define what is “important”. Pinski et al. [PN76] overcame this by developing a model for computing the equilibrium for what they defined as “influence weights”. The weight of each publication is equal to the sum of its citations, scaled by the importance of these citations. On the Web, the notion of citations corresponds to the links pointing to a Web page. The most simplified ranking of a Web page could be accomplished by summing up the number of links pointing to it. However, this approach favors the most popular Web sites, such as universally known portals, news pages etc. Moreover, the diversity of the content and its quality in the Web should also be taken into consideration. In scientific literature all publications have a certain standard and their value is measured by their impact to the scientific community. Usually the co-citations are between closed networks of knowledge. On the other hand, in the Web there exist amounts of information, serving different purposes. What is more, a phenomenon usually noticed is that there not exist links between competitors, i.e. Web pages referring to the same subject with conflicting goals. What interconnects such sites are other Web sites serving as indices.

The metric proposed by Pinski et al. influenced Brin and Page to develop PageRank [BP98], the algorithm hiding behind the most popular search engine, Google [Google]. On the other hand, the notion of important pages in terms of content (authorities) as well as important pages serving as indices (hubs) was introduced in the HITS algorithm [K99], which was developed by Kleinberg. This algorithm supports another prototype search engine, Clever [CD+99].

6

2.3.2 PageRank PageRank [BP98] was developed by Brin and Page during their PhD in

Stanford University. This algorithm is influenced by citation analysis, considering the incoming links as citations to the Web pages. However, by simply applying citation analysis techniques to the diverse set of Web documents, would not result in as good outcomes. Therefore PageRank provides a more sophisticated way to compute the importance of a Web page than simply counting the number of pages that have a link pointing to it (named as “backlinks”). If a backlink comes from an “important” Web page, then it weighs more than others that come from minor pages. Intuitively, “a page has high rank if the sum of the ranks of its backlinks is high. This covers both the case when a page has many backlinks and when a page has few highly ranked backlinks” [BP98]. In other words, we may consider that links from a page to another as a vote. However, not only the number of votes a page receives is considered important, but the “importance” of the ones that cast these votes as well. First thing that should be computed is the number of links pointing to every Web page. This is something not known apriori, therefore a technique based on random walk on graphs is employed. Intuitively, they model the behaviour of a “random surfer”. The “random surfer” visits a page and then follows its links to other pages equiprobably. In the steady state each page will have a “visit rate” which will be what defines its importance. However, the surfer may fall in loops (i.e. pages pointing each other), or may visit a page with no outgoing links (“dangling links”). To encounter these situations, the surfer follows equiprobably one of the outgoing links with 90% probability, whereas with probability 10% the surfer visits (randomly) another page. Therefore, the surfer never gets stuck locally, and there is a long-term rate when every page is visited. The following equation calculates a page’s PageRank: PR(A) = (1 – d) + d(PR(t1)/C(t1) + … + PR(tn)/C(tn)), where t1 - tn are pages linking to page A, C is the number of outbound links that a page has and d is a damping factor, usually set to 0.85. In other words, a page "votes" a (little less than its own) amount of PageRank onto each page that it links to. This value is shared equally between all the pages that it links to. Therefore, the PageRank of a page linking to another one is important, yet the total number of outlinks contained in that page are also important and taken into consideration. The more the number of links in the page, the less PageRank value the linked page receives. The algorithm should run recursively a few times before reaching equilibrium. 2.3.3 Hubs & Authorities – The HITS algorithm

Kleinberg [K99] also identified a form of equilibrium between Web sources on a common topic. He identifies two different forms of Web pages: hubs and authorities. Authorities are pages bearing important content. Hubs are pages that act as resource lists, directing users to authorities. Thus, a good hub page for a subject points to many authoritative pages on that content, and a good authority page is pointed by many good hub pages on the same subject. We should stress that a page

7

might be a good hub and a good authority in the same time. This circular relationship leads to the definition of an iterative algorithm, HITS. HITS assigns two numbers, a hub weight H(p) and an authority weight A(p) to each page p. These weights are initially set to 1 and are iteratively updated using the following formula:

∑=+),(

1 )()(qp

ii qApH

∑=+),(

1 )()(qp

ii qHpA ,

where (p, q) denotes that there’s a hyperlink from page p to page q. Therefore, the page’s authority weight is proportional to the sum of the hub weights of pages that link to it. Similarly, a page’s hub weight is proportional to the sum of the authority weights of pages that it links to [K99b]. The algorithm converges after a few iterations. However, since HITS makes iterative computations from the query result, it makes it difficult to meet the real-time constraints of an on-line search engine. 2.3.4 Search Engines As already mentioned PageRank is the supporting algorithm of the most popular search engine at this time, Google, and HITS is used from Clever. Google and Clever have two main differences: Google is query-independent, whereas Clever computes the rankings according to the query terms. Secondly, Google looks only to the forward direction, whereas Clever also takes into consideration the backward direction too, leading through its computation to the creation of web communities. Therefore Google works well on answering specific queries, whereas HITS works well on answering broad-topic queries.

To be more specific, in Google, when a query is given, all pages meeting the query (i.e. containing the search term) are firstly retrieved, but are presented to the end user ranked according to their PageRank. Therefore, the order is query-independent. Of course, PageRank is not the only algorithm hiding behind Google. Even if it’s not clearly stated, a number of heuristics are also used to support the ranking of the results presented to the end user. Such heuristics, based on hyperlink information, will be discussed in the following section.

Unlike Google, Clever is query-dependent. Given a query, it creates a root set containing all the pages retrieved (200 – 1000 nodes). All pages that are pointed by, or point to a page in the root set are also added, to create finally the base set (5000 nodes). After a few iterations the algorithm converges, creating thematically coherent clusters of information on the same subject.

These “hyperlinked communities that appear to span a wide range of interests and disciplines” [GK98] are also called “Web communities” and the process of identifying them is called “trawling” [KR+99]. For more details on Web communities and trawling, please refer to the Web document clustering section.

8

2.4 Hyperlink information

Many researchers have suggested solutions to the problems of searching, indexing, or querying the Web, taking into account its structure as well as the meta-information included in the hyperlinks and the text surrounding them [FS92, MB94, WV+96, HSD97, CD+99, PW00, HG+02, VV+03].

The underlying idea is that many Web pages do not include words that are descriptive of their content (for example rarely a portal Web site includes the word “portal” in its home page), and there exist Web pages which contain very little text (such as image, music, video resources), making text-based search techniques difficult. However, how others characterize this page may be useful. This “characterization” is included in the text that surrounds the hyperlink pointing to the page.

Both aforementioned search engines use PageRank and HITS respectively in combination with other heuristics incorporating this idea. Brin and Page [BP98] have also stressed the importance of incorporating anchor information when dealing with pages that cannot be indexed by text-based search engines, associating the text of a link with the page the link appears on, as well as the page the link points at.

Chakrabarti et al. define this as the “anchor-window” [CD+99]. If a text descriptive of a topic occurs around the link pointing to a page from a good hub, then this reinforces the belief that this page probably is a good authority on the topic. In their experiments they use a window of 50 bytes on either side of the link. The amount of topic-related text in that anchor-window is incorporated as a positive numerical weight to the link when the hubs and authorities scores are computed.

Researching on similarity search on the Web, Haveliwala et al. [HG+02] have resulted that the anchor-based strategy requires fewer citations than the link-based strategy, since it contributes more terms. After experimenting with different anchor-window widths, they concluded that large fixed anchor-windows of 23 words (approximately 150 bytes) give the best results. On the other hand, Varlamis et al. [VV+03] when enhancing the information of the Web pages with link semantics, use an anchor-window of 100 bytes, trimmed whenever certain html tags appear, so that the resulting mean number of keywords is approximately five, incorporating the results of [PW00].

We may therefore conclude that even though the anchor-window size may vary according to the specific application it is used in, the information that is contained in it has proven to be very useful and has been increasingly used by researchers in the Web mining and related areas.

9

CHAPTER 3 Web Content Mining 3.1 Searching the Web In the past few years, WWW has become the largest data repository, containing hundred millions of documents in several semi-structured, or structured formats, pictures, or other multimedia files. As its size grows exponentially, so does its usage. More and more people use it as their main source of information. The existence of an abundance of information in combination with the dynamic and heterogeneous nature of the Web makes information retrieval a very difficult task for the average user. Web content mining provides methods enabling the automated discovery, retrieval, organization, and management of the vast amount of information and resources available in the Web. Cooley et al. [CMS97] categorize the main research efforts in the area of Content Mining in two approaches, the Information Retrieval (IR), and the Database (DB) approach. The IR approach involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize Web-based information. On the other hand the Database (DB) approach focuses on techniques for integrating and organizing the heterogeneous and semi-structured data on the Web into more structured and high-level collections of resources, using standard database querying mechanisms and data mining techniques to access and analyze this information. Web content mining is nowadays strongly interrelated with Web structure mining, since usually both are used in combination for extracting and organizing information from the Web. In this section we will focus on the purely content-oriented areas of Web content mining, dealing with representing, processing and organizing the content that is retrieved from the WWW in order to assist the users in their search. 3.2 Data Preprocessing

Web content mining is strongly related to the domain of Text Mining, since in

order to process and organize Web pages their content should be first appropriately processed in order to extract properties of interest. These selected properties are subsequently used to represent the documents and assist the clustering or classification processes. Based on [RV03], we discriminate four stages of data preprocessing, based on techniques used in text mining, namely Data Selection, Filtering, Cleaning, and Representation. Data selection involves identification and retrieval of textual data from relevant data sources. During data selection, exogenous information usually represented by descriptive metadata such as keywords, attached to the document is used. Data selection as defined in text mining cannot be applied in Web documents, since no metadata are attached to them (except for the rare case of Semantic Web documents), therefore won’t be further discussed.

10

On the other hand, data filtering processes the endogenous information, usually called “implicit metadata”, to identify the document relevance. At this stage NLP techniques are used, whereas the main issue to be addressed is that of the language identification. The goal of data cleaning is to remove data noise (errors, inconsistencies and outlier values) from data to improve its quality. When processing heterogeneous data sources, data integration has to be first applied. The most important step during the data preprocessing process is the data representation stage. The content should be transformed into a normalized representation. This representation is usually called the feature vector, including the most important attributes that are selected to represent the content. An issue that has to be dealt with is that currently available algorithms do not cope efficiently with high-dimensional vector space, making data reduction techniques essential.

Another issue during this stage is semantic analysis. Semantic analysis deals mainly with the problems of synonymy (different names for the same concept) and polysemy (different concepts having the same name). Research on the area of “Word Sense Disambiguation” (WSD) has dealt with this problem (for an overview see [VI98]). Word sense disambiguation is achieved by assigning words to appropriate concepts. The mapping from words to concepts should be done in a reliable way, depending on the relations between words under examination. In the Semantic Web context, concepts are organized in ontologies, providing means to express the inter-concept relationships and connections to a vocabulary. Ontologies and their use in Web mining are discussed in Chapter 5. Data representation methods are examined in more detail in the next section. 3.3 Web document representation models

In order to reduce the complexity of the documents and make them easier to handle, during the clustering and/or classification processes, one should first choose the type of characteristics or attributes (e.g. words, phrases, or links) of the documents that are of importance, and how these should be represented. Since documents are represented in a uniform way, the similarity between two documents can then be easily calculated. The simpler case is the bag-of-words representation, however more sophisticated representations incorporating weighting of terms and relations between them are usually used.

The most commonly used model in clustering is the Vector Space Model [SWY75], whereas in classification is Support Vector Machines [V95]. We will examine those two models in more detail in the following sections. 3.3.1 Vector Space Model In the Vector Space Model, each document is represented as a feature vector, whose length is equal to the number of unique document attributes in the collection. Each component on that vector has a weight indicating the importance of each attribute in the characterization of the document. Usually, these attributes are terms that are extracted from the document using IR techniques.

11

The phase of extracting terms that characterize a document is called document indexing. Next, in the term weighting phase, these terms are given weights, indicating their significance in the characterization of the document. These weights may be binary, indicating the existence (1) or not (0) of the term in the document. However, it is more common to use the frequency of the occurrence of the term in the document (raw term frequency), or an algorithm belonging to the Tf*Idf family [SB98]. Raw term frequency is based on term statistic within a document and is the simpler way of assigning weights to terms. Tf*Idf is a measure used in collections of documents, that favors terms that are frequent in relevant documents, but infrequent in the collection as a whole. Tf stands for the frequency of occurrence of the term in the document, and Idf is the inverse frequency of occurrence of the term in the whole collection. Idf = log(nk/N), where nk is the number of documents including the term and N is the total number of documents. After the term weighting has taken place, a similarity measure must be chosen for the calculation of the similarity between two documents (or clusters). Considering that each document is now represented by a weighted vector, the similarity can be found, in the most simple way, by calculating their inner product. However this similarity measure is never used. The most popular similarity measure is the Cosine Coefficient, which measures the cosine of the angle between the two feature vectors. Other measures are the Jaccard Coefficient and the Dice Coefficient, all normalized versions of the simple matching coefficient. More on similarity measures may be found in [R79, SJM00]. 3.3.2 Support Vector Machines (SVM)

SVM method was introduced by Vapnik [V95] and was further examined by

Joachims [J99, J01]. SVMs have proven to be fast end effective classifiers for text documents and solve the problem of dimensionality, since instead of restricting the number of features, they use a refined structure, which does not necessarily depend on the dimensionality of the input space.

The idea is to separate the vector space by a plane in such a way that it separates members of different classes in a best way. The data points that are closest to the hyperplane and therefore define it, are the support vectors. Because of this kernel mapping, calculations only involve inner products, which is efficient. Since SVMs do not lose their efficiency or ability to generalize as the number of input features grows, makes them ideal model for document classification using all words in a text directly as features.

Joachims [J99] showed that SVMs perform substantially better at classifying text document into topic categories than the currently best performing conventional methods, used for document classification (naïve Bayes, Rocchio, decision trees, k-NN). A brief description of SVMs can also be found in [P03].

12

3.4 Web document clustering (from [OV03]) 3.4.1 Steps of the clustering process

Clustering is one of the main data analysis techniques and deals with the organization of a set of objects in a multidimensional space into cohesive groups, called clusters. Many uses of clustering as part of the Web Information Retrieval process have been proposed in the literature. Firstly, based on the cluster hypothesis, clustering can increase the efficiency and the effectiveness of the retrieval. Furthermore, clustering can be used as a very powerful mechanism for browsing a collection of documents (e.g. scatter/gather) or for presenting the results of the retrieval (e.g. suffix tree clustering). Finally, other applications of clustering include query expansion, tracing of similar documents and ranking of the retrieval results. As it was already mentioned, in order to cluster documents, one must first choose the type of the characteristics or attributes (e.g. words, phrases or links) of the documents on which the clustering will be based and their representation. The most commonly used model is the Vector Space Model described in a previous section. Clustering is then performed using as input the vectors that represent the documents and a Web document clustering algorithm. The existing web document clustering algorithms differ in many parts, such as the types of attributes they use to characterize the documents, the similarity measure used, the representation of the clusters etc. Based on the characteristics or attributes of the documents that are used by the clustering algorithm, the different approaches can be categorized into i. text-based, in which the clustering is based on the content of the document, ii. link-based, based on the link structure of the pages in the collection and iii. hybrid ones, which take into account both the content and the links of the document. 3.4.2 Web document clustering algorithms 3.4.2.1 Text based Clustering

The text-based web document clustering approaches characterize each document according to its content, i.e. the words contained in it (or phrases or snippets). The basic idea is that if two documents contain many common words then it is very possible that the two documents are very similar.

The approaches in this category can be further categorised according to the clustering method used into the following categories: partitional, hierarchical, graph-based, neural network-based and probabilistic algorithms. Furthermore, according to the way a clustering algorithm handles uncertainty in terms of cluster overlapping, an algorithm can be either crisp (or hard), which considers non-overlapping partitions, or fuzzy (or soft), with which a document can be classified to more than one cluster. Most of the existing algorithms are crisp, meaning that a document either belongs to a cluster or not. In the paragraphs that follow we present the main text-based document clustering approaches, their characteristics and the representative algorithms of each category. Partitional Clustering. The partitional or non-hierarchical document clustering approaches attempt a flat partitioning of a collection of documents into a predefined

13

number of disjoint clusters. More specifically, these algorithms produce an integer number of partitions that optimize a certain criterion function (e.g. maximize the sum of the average pairwise intra-cluster similarities).

Partitional clustering algorithms are divided into iterative or reallocation methods and single pass methods. Most of them are iterative and the single pass methods are usually used in the beginning of a reallocation method. The most common partitional clustering algorithm is k-means, which relies on the idea that the center of the cluster, called centroid, can be a good representation of the cluster. The algorithm starts by selecting k cluster centroids. Then the cosine dictance between each document in the collection and the centroids is calculated and the document is assigned to the cluster with the nearest centroid. Then the new cluster centroids are recalculated and the procedure runs iteratively until some criterion is reached. Other partitional clustering algorithms are the single pass method and the nearest neighbor algorithm. Many variations of the k-means algorithm are also proposed, e.g. ISODATA and bisecting k-means [JMF99, SKK00].

Hierarchical Clustering. Hierarchical clustering algorithms produce a sequence of nested partitions. Usually the similarity between each pair of documents is stored in a nxn similarity matrix. At each stage, the algorithm either merges two clusters (agglomerative methods) or splits a cluster in two (divisive methods). The result of the clustering can be displayed in a tree-like structure, called a dendrogram, with one cluster on the top containing all the documents of the collection and many cluster on the bottom with one document each. By choosing the appropriate level of the dendrogram we get a partitioning into as many clusters as we wish.

Almost all the hierarchical algorithms used for document clustering are agglomerative (HAC). A typical HAC algorithm starts by assigning each document in the collection to a single cluster. The similarity between all pairs of clusters is computes and stored in a similarity matrix. Then, the two most similar (closest) clusters are merged and the similarity matrix is updated to reflect the change in the similarity between the new cluster and the original clusters. This process is repeated until only one cluster remains or until a threshold is reached. The hierarchical agglomerative clustering methods differ in the way they calculate the similarity between two clusters. The existing methods are the single link method [R79, S73, R92, V86], complete link method [D77, V86], the group average method [V86, SKK00], Ward’s method [EW86], and centroid/median methods. Graph based clustering. The documents to be clustered can be viewed as a set of nodes and the edges between the nodes represent the relationship between them. The edges bare a weight, which denotes the degree of that relationship. Graph based algorithms rely on graph partitioning, that is, they identify the clusters by cutting edges from the graph such that the edge-cut, i.e. the sum of the weights of the edges that are cut, is minimized. Since each edge in the graph represents the similarity between the documents, by cutting the edges with the minimum sum of weights the algorithm minimizes the similarity between documents in different clusters. The basic idea is that the weights of the edges in the same cluster will be greater than the weights of the edges across clusters. Hence, the resulting cluster will contain highly related documents. The most important graph based algorithms are Chameleon [KHK99], Association Rule Hypergraph Partitioning (ARHP) [BG+99] and the one proposed by Dhillon [D01].

14

Neural Network based Clustering. The Kohonen’s Self-Organizing feature Maps (SOM) [K95] is a widely used unsupervised neural network model. It consists of two layers: the input layer with n input nodes, which correspond to the n documents, and an output layer with k output nodes, which correspond to k decision regions. The input units receive the input data and propagate them onto the output units. Each of the k output units is assigned a weight vector. During each learning step, a document from the collection is associated with the output node, which has the most similar weight vector. The weight vector of that ‘winner’ node is then adapted in such a way that it will become even more similar to the vector that represents that document. The output of the algorithm is the arrangement of the input documents in a 2-dimensional space in such a way that the similarity between the input documents is mirrored in terms of topographic distance between the k decision regions. Another approach proposed in the literature is the hierarchical feature map [M98] model, which is based on a hierarchical organization of more than one self-organizing feature maps. Fuzzy Clustering. All the above approaches produce clusters in such a way that each document is assigned to one and only one cluster. Fuzzy clustering approaches, on the other hand, are non-exclusive, in the sense that each document can belong to more than one clusters. Fuzzy algorithms usually try to find the best clustering by optimising a certain criterion function. The fact that a document can belong to more than one clusters is described by a membership function. The membership function calculates for each document a membership vector, in which the i-th element indicates the degree of membership of the document in the i-th cluster. The most widely used fuzzy clustering algorithm is Fuzzy c-means [BEF84], a variation of the partitional k-means algorithm. Another fuzzy approach is the Fuzzy Clustering and Fuzzy Merging algorithm (FCFM) [L99]. Probabilistic Clustering. Another way of dealing with uncertainty is to use probabilistic clustering algorithms. These algorithms use statistical models to calculate the similarity between the data instead of some predefined measures. The basic idea is the assignment of probabilities for the membership of a document in a cluster. Each document can belong to more than one cluster according to the probability of belonging to each cluster. Probabilistic clustering approaches are based on finite mixture modeling [EH81]. Two widely used probabilistic algorithms are Expectation Maximization (EM) and AutoClass [CS96]. The output of the probabilistic algorithms is the set of distribution function parameter values and the probability of membership of each document to each cluster. 3.4.2.2 Link based clustering

Text-based clustering approaches were developed for use in small, static and homogeneous collections of documents. On the contrary, the web is a very large collection of heterogeneous and interconnected web pages. Moreover, the web pages have additional information attached to them (web document metadata, hyperlinks) that can be very useful to clustering. The link-based document clustering approaches characterize the documents by information extracted by the link structure of the collection. The underlying idea is that when two documents are connected via a link, then there exists a semantic relationship between them, which can be the basis for the partitioning of the collection into clusters. The use of the link structure for the clustering of a collection is based on citation analysis from the field of bibliometrics. Link based clustering is an area where Web content and Web structure

15

mining overlap. Therefore, certain issues discussed in the Web structure mining Chapter won’t be further analyzed here.

Botafogo et al. [BS91, B93] proposed an algorithm which is based on a graph theoretic algorithm that finds strongly connected components in a hypertext’s graph structure. The algorithm uses a compactness measure, which indicates the interconnectedness of the hypertext, and is a function of the average link distance between the hypertext nodes. Another link-based algorithm was proposed by Larson [L96], who applied co-citation analysis to a collection of web documents.

Finally, another interesting approach to clustering of web pages is trawling [KR+99], which clusters related pages on the Web in order to discover new emerging cyber-communities that have not yet been identified by large web directories. The underlying idea in trawling is that these relevant pages are very frequently cited together even before their creators realize that they have created a community. Furthermore, based on Kleinberg’s idea, trawling assumes that these communities consist of mutually reinforcing hubs and authorities. So, trawling combines the idea of co-citation and HITS to discover clusters. Based on the above assumptions, Web communities are characterized by dense directed bipartite subgraphs1. These graphs, that are the signatures of web communities, contain at least one core, which are complete directed bipartite graphs with a minimum number of nodes. Trawling aims at discovering these cores and then applies graph-based algorithms to discover the clusters. 3.4.2.3 Hybrid Approaches

The link-based document clustering approaches described above characterize the document solely by the information extracted from the link structure of the collection, just as the text-based approaches characterize the documents only by the words they contain. Although the links can be seen as a recommendation of the creator of one page to another page, they do not always intend to indicate the similarity. Furthermore, these algorithms may suffer from poor or very dense link structures. On the other hand, text-based algorithms have problems when dealing with different languages or with particularities of the language (synonyms, homonyms etc.). Also, web pages contain other forms of information except text, such as images or multimedia. As a consequence, hybrid document clustering approaches have been proposed in order to combine the advantages and limit the disadvantages of the two approaches.

Pirolli et al. [PPR96] described a method that represents the pages as vectors containing information from the content, the linkage, the usage data and the meta-information attached to each document. The ‘content-link clustering’ algorithm, which was proposed by Weiss et al. [WV+96], is a hierarchical agglomerative clustering algorithm that uses the complete link method and a hybrid similarity measure. Finally, another hybrid text- and link-based clustering approach is the toric k-means algorithm, proposed by Modha & Spangler [MS00]. The algorithm starts by gathering the results returned to a user’s query from a search engine and expands the set by including the web pages that are linked to the pages in the original set. Modha & Spangler also provide a scheme for presenting the contents of each cluster to the users by describing various aspects of the cluster. 1 A bipartite graph is a graph whose node set can be partitioned into two sets N1 and N2. Each directed edge in the graph is

directed from a node in N1 to a node in N2

16

3.4.2.4 Conclusions Clustering is a very complex procedure as it depends on the collection on

which it is applied as well as the choice of the various parameter values. Hence, a careful selection of these is very crucial to the success of the clustering. Furthermore, the development of link-based clustering approaches has proven that the links can be a very useful source of information for the clustering process.

Although there is already much research conducted on the field of web document clustering, it is clear that there are still some open issues that call for more research. These include the achievement of better quality-complexity tradeoffs, as well as effort to deal with each method’s disadvantages. In addition, another very important issue is incrementality because the web pages change very frequently and because new pages are always added to the web. Also, the fact that very often a web page relates to more than one subject should also be considered and lead to more algorithms that allow for overlapping clusters. Finally, more attention should also be given to the description of the clusters’ contents to the users, the labeling issue. 3.5 Web document classification techniques (from [H02]) Web document classification involves assigning Web documents to one of some pre-defined categories. To achieve this, the input documents are characterized by a set of attributes, usually called features. Unlike Web document clustering, which involves unsupervised learning, in classification, a training set of data is needed, with pre-assigned class labels (supervised machine learning). The objective of the classification is to analyze the input data and develop an accurate model for each class using these features. The new documents will be classified into one of these classes.

In the case of text classification, the attributes are words contained in text documents. Feature (attribute) selection is widely used prior to machine learning to reduce the feature space, as the number of features in consideration might become prohibitive.

In general, we distinguish between rule-based classifiers (rules are constructed manually, and the resulting set of rules is difficult to modify) and inductive-learning classifiers. Classifiers based on inductive learning are constructed using labeled training data; these are easy to construct and update, not requiring rule-writing skills. In this section, based on the work of Hynek [H02], we focus on inductive-learning approach in classifier construction.

Besides document categorization, we can come across the issue of web page and link classification, as introduced by Haas and Grams [HG98], with useful applications in searching and authoring.

3.5.1 Existing Document Classification Methods

An interesting survey of five (supervised learning) document classification algorithms is presented by Dumais et al. [DPH98], focusing namely on promising Support Vector Machines (SVM) method. KNN, Naïve Bayes, Bayesian Networks, and Decision Trees methods are also discussed. Another detailed test of text categorization methods is presented by Yang and Liu [YL99], discussing various

17

algorithms such as SVM, kNN, LLSF (linear least-squares fit), NNet and Naïve Bayes. Here we overview the most important document classification methods.

3.5.1.1 K-Nearest Neighbor (KNN)

This method’s principle is to classify a new document, finding the most similar document in the training set. Methods utilizing such principle are sometimes called “memory based learning” methods. Tf*idf term weights are used, computing similarity between test examples and category centroids. The weight assigned to a term is a combination of its weight in an original query, and judged relevant and irrelevant documents. The similarity between two documents is usually measured using Cosine similarity. In general this algorithm is easy to interpret, and performs very well. On the other hand, it does not reduce feature space, and cannot do off-line preprocessing. 3.5.1.2 Decision trees

The model is based on decision trees consists of a series of simple decision rules, often presented in form of a graph. They are one of the most popular machine learning techniques currently used. Several methods have been proposed for inducing decision trees, including the CART algorithm, ID3 algorithm, and its more recent version, C4.5. Decision trees are is a probabilistic classifier – confidence(class) represents a probability distribution. They are easy to interpret, however they require a number of model parameters which is usually hard to find, and error estimates are difficult. 3.5.1.3 Naïve Bayes

Naïve Bayes is a probabilistic classifier too. It is constructed from the training data to estimate the probability of each class given the document feature values (words) of a new instance. Bayes theorem is used to estimate these probabilities. It works well even when the feature independence assumed by Naïve Bayes does not hold. However, it is based on simplifying assumptions (conditional independence of words). 3.5.1.4 Unrestricted Bayesian classifier

Unlike Naïve Bayes classifier, in this case the assumption of word independence does not hold. Its alternative, semi-naïve Bayesian classifier, iteratively joins pairs of attributes to relax the strongest independence assumptions. Its implementation is simple and its results are easy to interpret. On the other hand, due to the assuming conditional dependence of words, its computation has exponential complexity. 3.5.1.5 Neural Networks (perceptrons)

Using this method, a separate neural network per category is constructed, learning a non-linear mapping from input words (or more complex features, such as itemsets) to a category. Its design is easy to modify and various models can be constructed quickly and flexibly. The output model, however, does not provide any clear interpretation. Moreover, the training cost is high (more time consuming than the other classifiers).

18

3.5.1.6 Linear SVM As already mentioned in the relevant section, an SVM is a hyperplane that

separates a set of positive examples from a set of negative examples with maximum margin. The margin is defined by the distance of the hyperplane to the nearest of the positive and negative examples. The SVM (optimization) problem is to find the decision surface that maximizes the margin between the data points in a training set. SVM shows good generalization performance on a wide variety of classification problems. The same holds for its classification accuracy, and it is fast to learn, and for classifying new instances. However, not all problems are linearly separable. 3.5.2 Assessment of Classification Algorithms

Classification algorithms are evaluated in terms of speed and accuracy. Speed of a classifier must be assessed separately for two different tasks: learning (training a classifier) and classification of new instances. Many evaluation criteria for classification are proposed for this purpose. Precision and recall criteria are mentioned most often. Break-even point is proposed by Dumais et al. [DPH98] as an average of precision and recall.

Decision thresholds in classification algorithms can be modified in order to produce higher precision (at the cost of lower recall), or vice versa – as appropriate for different applications. In the case of mono-classification, some researchers report an error rate measure, which is the percentage of documents misclassified.

It is important to note that classifier’s performance largely depends on splitting the corpus on training and testing data. Testing the classifier on training data used for learning the classifier often leads to significantly better results. The problem with evaluating classifiers is their domain dependence. Each classifier has a particular sub-domain for which it is most reliable [OKA01]. In order to overcome this issue, multiple learned classifiers are combined to obtain more accurate classification. Separating the training data into subsets where classifiers either succeed or fail to make predictions was used in Schapire’s Boosting algorithm [S90].

19

CHAPTER 4 Web Usage Mining (from [EV03]) 4.1 Identifying navigational patterns The users’ activity when browsing through Web sites is registered in these sites’ Web logs. Considering the average number of visits to a medium-sized Web site per day, we can presume that the amount of information hidden in the site’s Web logs is huge, yet meaningless if they’re not appropriately processed. By processing these data, either using simple statistical methods, or by using more complicated data mining techniques, we can identify interesting trends, and patterns concerning the activity in the Web site. Site administrators can then use this information to redesign or customize the Web site according to the interests and behavior of its visitors, or improve the performance of their systems, whereas the managers of a commercial site can acquire valuable business intelligence, creating consumer profiles and achieving market segmentation. Furthermore, this knowledge can be used to automatically or semi-automatically adjust the content of the site to the needs of specific groups of users, i.e. to personalize the site.

The process of analyzing the user’s browsing behavior is called Web usage mining. It can be regarded as a three-phase process, consisting of the data preparation, pattern discovery and pattern analysis phases [SC+00]. In the first phase, Web data are preprocessed in order to identify users, sessions, pageviews, and so on. The input data are mainly the hits registered in the Web usage logs of the site, sometimes combined with other information such as registered user profiles, referrer’s logs, cookies, etc. In the second phase, statistical methods, as well as data mining methods (such as association rules, sequential pattern discovery, clustering, and classification) are applied in order to detect interesting patterns. These patterns are stored so that they can be further analyzed in the third phase of the Web usage mining process. In this Chapter we will overview the Web usage mining area, by analyzing the various methods that are used to preprocess, mine and analyze the data. We will also provide an overview of one of the most important applications of Web usage mining, which is Web personalization, along with the research initiatives in these areas. This work is part of the survey on Web usage mining and personalization that can be found in [EV03]. 4.2 Input data – Web usage logs

Each access to a Web page is recorded in the access log of the Web server that hosts it. The entries of a Web log file consist of fields that follow a predefined format. The fields of the common log format are:

remotehost rfc931 authuser date "request" status bytes

where remotehost is the remote hostname or IP number if the DNS hostname is not available; rfc931 the remote log name of the user; authuser, the username with which

20

the user has authenticated himself, available when using password-protected WWW pages; date, the date and time of the request; "request", the request line exactly as it came from the client (the file, the name, and the method used to retrieve it); status, the HTTP status code returned to the client, indicating whether the file was successfully retrieved and if not, what error message was returned; and bytes, the content-length of the documents transferred. If any of the fields cannot be determined a minus sign (-) is placed in this field.

W3C [W3Clog] presented an improved format for Web server log files, called the “extended” log file format, partially motivated by the need to support the collection of data for demographic analysis and for log summaries. This format permits customized log files to be recorded in a format readable by generic analysis tools. The main extension to the common log format is that a number of fields are added to it. The most important are: referrer, which is the URL the client was visiting before requesting that URL, user_agent, which is the software the client claims to be using, and cookie, in the case where the site visited uses cookies.

In general, extended log format consists of a list of prefixes such as c (client), s (server), r (remote), cs (client to server), sc (server to client), sr (server to remote server, used by proxies), rs (remote server to server, used by proxies), x (application-specific identifier), and a list of identifiers such as date, time, ip, dns, bytes, cached (records whether a cache hit occurred), status, comment (comment returned with status code), method, uri, uri-stem and uri-query. Using a combination of some of the aforementioned prefixes and identifiers, additional information such as referrers’ IPs, or keywords used in search engines can be stored. Except for Web server logs, which are the main source of information, usage data can also be acquired by proxy server logs, browser logs, user profiles, registration data, cookies, mouse clicks etc. 4.3 Data Preprocessing

There are some important technical issues that must be taken into consideration during this phase in the context of the Web personalization process, because it is necessary for Web log data to be prepared and preprocessed in order to use them in the consequent phases of the process. An extensive description of data preparation and preprocessing methods can be found in [CMS99]. In the sequel, we provide a brief overview of the most important ones.

The first issue in the preprocessing phase is data preparation. Depending on the application, Web log data may need to be cleaned from entries involving pages that returned an error or graphics file accesses. Furthermore, crawler activity can be filtered out, because such entries do not provide useful information about the site’s usability. Another problem to be met has to do with caching. Accesses to cached pages are not recorded in the Web log, therefore such information is missed. Caching is heavily dependent on the client-side technologies used and therefore cannot be dealt with easily. In such cases, cached pages can usually be inferred using the referring information from the logs. Moreover, a useful aspect is to perform pageview identification, determining which page file accesses contribute to a single pageview. Again such a decision is application-oriented.

Most important of all is the user identification issue. There are several ways to identify individual visitors. The most obvious solution is to assume that each IP

21

address (or each IP address/client agent pair) identifies a single visitor. Nonetheless, this is not very accurate because, for example, a visitor may access the Web from different computers, or many users may use the same IP address (if a proxy is used). A further assumption can then be made, that consecutive accesses from the same host during a certain time interval come from the same user. More accurate approaches for a priori identification of unique visitors are the use of cookies or similar mechanisms or the requirement for user registration. However, a potential problem in using such methods might be the reluctance of users to share personal information.

Assuming a user is identified, the next step is to perform session identification, by dividing the clickstream of each user into sessions. The usual solution in this case is to set a minimum timeout and assume that consecutive accesses within it belong to the same session, or set a maximum timeout, where two consecutive accesses that exceed it belong to different sessions. 4.4 Pattern discovery: Web usage-mining algorithms 4.4.1 Log Analysis

Log analysis tools (also called traffic analysis tools) take as input raw Web

data and process them in order to extract statistical information. Such information includes statistics for the site activity (such as total number of visits, average number of hits, successful/failed/redirected/cached hits, average view time, and average length of a path through a site), diagnostic statistics (such as server errors, and page not found errors), server statistics (such as top pages visited, entry/exit pages, and single access pages), referrers statistics (such as top referring sites, search engines, and keywords), user demographics (such as top geographical location, and most active countries/cities/organizations), client statistics (visitor’s Web browser, operating system, and cookies), and so on.

Some tools also perform clickstream analysis, which refers to identifying paths through the site followed by individual visitors by grouping together consecutive hits from the same IP, or include limited low-level error analysis, such as detecting unauthorized entry points or finding the most common invalid URL. These statistics are usually output to reports and can also be displayed as diagrams.

This information is used by administrators for improving the system performance, facilitating the site modification task, and providing support for marketing decisions [SC+00]. However, most advanced Web mining systems further process this information to extract more complex observations that convey knowledge, utilizing data mining techniques such as association rules and sequential pattern discovery, clustering, and classification. These techniques are described in more detail in the next paragraph.

4.4.2 Web Usage Mining Log analysis is regarded as the simplest method used in the Web usage mining

process. The purpose of Web usage mining is to apply statistical and data mining techniques to the preprocessed Web log data, in order to discover useful patterns. As

22

mentioned before, the most common and simple method that can be applied to such data is statistical analysis. More advanced data mining methods and algorithms tailored appropriately for use in the Web domain include association rules, sequential pattern discovery, clustering, and classification.

Association rule mining is a technique for finding frequent patterns, associations, and correlations among sets of items. Association rules are used in order to reveal correlations between pages accessed together during a server session. Such rules indicate the possible relationship between pages that are often viewed together even if they are not directly connected, and can reveal associations between groups of users with specific interests. Aside from being exploited for business applications, such observations also can be used as a guide for Web site restructuring, for example, by adding links that interconnect pages often viewed together, or as a way to improve the system’s performance through prefetching Web data.

Sequential pattern discovery is an extension of association rules mining in that it reveals patterns of cooccurrence incorporating the notion of time sequence. In the Web domain such a pattern might be a Web page or a set of pages accessed immediately after another set of pages. Using this approach, useful users’ trends can be discovered, and predictions concerning visit patterns can be made.

Clustering is used to group together items that have similar characteristics. In the context of Web mining, we can distinguish two cases, user clusters and page clusters. Page clustering identifies groups of pages that seem to be conceptually related according to the users’ perception. User clustering results in groups of users that seem to behave similarly when navigating through a Web site. Such knowledge is used in e-commerce in order to perform market segmentation but is also helpful when the objective is to personalize a Web site.

Classification is a process that maps a data item into one of several predetermined classes. In the Web domain classes usually represent different user profiles and classification is performed using selected features that describe each user’s category. The most common classification algorithms are decision trees, naïve Bayesian classifier, neural networks, and so on.

There also exist other methods for extracting usage patterns from Web logs. The most important one is using Markov models. We will examine this approach in detail in Chapter 5. Other approaches include mathematic frameworks, introduce fuzziness in recommendations’ production, or model navigation sequences in a tree-like structure. 4.4.3 Pattern analysis

After discovering patterns from usage data, a further analysis has to be

conducted. The exact methodology that should be followed depends on the technique previously used. The most common ways of analyzing such patterns are either by using a query mechanism on a database where the results are stored, or by loading the results into a data cube and then performing OLAP operations. Additionally, visualization techniques are used for an easier interpretation of the results. Using these results in association with content and structure information concerning the Web site there can be extracted useful knowledge for modifying the site according to the correlation between user and content groups.

23

4.4.4 Web personalization Web personalization is defined as any action that adapts the information or

services provided by a Web site to the needs of a user or a set of users, taking advantage of the knowledge gained from the users’ navigational behavior and individual interests, in combination with the content and the structure of the Web site [EVV03].

The steps of a Web personalization process include: (a) the collection of Web data, (b) the modelling and categorization of these data (pre-processing phase), (c) the analysis of the collected data and (d) the determination of the actions that should be performed. The ways that are employed in order to analyse the collected data include content-based filtering, collaborative filtering, rule-based filtering and Web usage mining. The site is personalized through the highlighting of existing hyperlinks, the dynamic insertion of new hyperlinks that seem to be of interest for the current user, or even the creation of new index pages. Most of the research efforts in Web personalization correspond to the evolution of extensive research in Web usage mining.

4.6 Research Initiatives

Recently, many research projects are dealing with Web usage mining and Web personalization areas. Most of the efforts focus on extracting useful patterns and rules using data mining techniques in order to understand the users’ navigational behavior, so that decisions concerning site restructuring or modification can then be made by humans. In several cases, a recommendation engine helps the user navigate through a site. Some of the more advanced systems provide much more functionality, introducing the notion of adaptive Web sites and providing means of dynamically changing a site’s structure. In the sequel we provide a brief description of the most important research efforts in the Web mining and personalization domain, as overviewed in [EV03] by Eirinaki and Vazirgiannis, as well as some research initiatives proposed the last two years, which illustrate the current trends in the area.

One of the earliest attempts to take advantage of the information that can be gained through exploring a visitor’s navigation through a Web site resulted in Letizia [L95], a client-site agent that monitors the user’s browsing behavior and searches for potentially interesting pages for recommendations. The agent looks ahead at the neighbouring pages using a best-first search augmented by heuristics inferring user interest, inasmuch as they’re derived from the user’s navigational behavior, and offers suggestions.

An approach for automatically classifying a Web site’s visitors according to their access patterns is presented in the work of Yan et al. [YZ+96]. The model they propose consists of two modules; an offline module that performs cluster analysis on the Web logs and an online module aiming at dynamic link generation. Every user is assigned to a single cluster based on his current traversal patterns. The authors have implemented the offline module (Analog) and have given a brief description of the way the online module should function.

One of the most popular systems from the early days of Web usage mining is WebWatcher [JF+97]. The system starts by profiling the user, acquiring information

24

about her interests. Each time the user requests a page, this information is routed through a proxy server in order to easily track the user session across the Web site and any links believed to be of interest for the user are highlighted. Its strategy for giving advice is learned from feedback from earlier tours.

A similar system is the Personal WebWatcher [M99], which is structured to specialize for a particular user, modeling his interests. It solely records the addresses of pages requested by the user and highlights interesting hyperlinks without involving the user in its learning process.

Chen et al. [CPY96] introduce the “maximal forward reference” concept in order to characterize user episodes for the mining of traversal patterns. Their work is based on statistically dominant paths and association rules discovery, and a maximal forward reference is defined as the sequence of pages requested by a user up to the last page before backtracking.

The SpeedTracer project [WYB98] is built on the work proposed in [CPY96]. SpeedTracer uses the referrer page and the URL of the requested page as a traversal step and reconstructs the user traversal paths for session identification. Each identified user session is mapped into a transaction and then data mining techniques are applied in order to discover the most frequent user traversal paths and the most frequently visited groups of pages.

A different approach is adopted by Zaiane et al. [ZXH98]. The authors combine the OLAP and data mining techniques and a multidimensional data cube, to extract interactively implicit knowledge. Their WebLogMiner system transforms Web log data into a relational database. In the next phase a data cube is built, each dimension representing a field with all possible values described by attributes. OLAP technology is then used in combination with data mining techniques for prediction, classification, and time-series analysis of Web log data.

Huang et al. [HN+01] also propose the use of a cube model that explicitly identifies Web access sessions, maintains the order of the session’s components and uses multiple attributes to describe the Web pages visited. Borges and Levene [BL99] model the set of user navigation sessions as a hypertext probabilistic grammar whose higher probability generated strings correspond to the user’s preferred trails. Shahabi et al. [SZ+97] propose the use of a client-side agent that captures the client’s behavior creating a profile. Their system then creates clusters of users with similar interests.

Joshi et al. [KJ+01] introduce the notion of uncertainty in Web usage mining, discovering clusters of user session profiles using robust fuzzy algorithms. In their approach, a user or a page can be assigned to more than one cluster. To achieve this, they introduce a similarity measure that takes into account both the individual URLs in a Web session, as well as the structure of the site.

The prototype system WebSIFT [SC+00] was introduced by Srivastava et al. After data cleansing and preprocessing for identifying users, server sessions, and inferring cached page references, pattern discovery is accomplished through the use of general data mining techniques such as association rules, sequential pattern analysis, clustering, and classification. The results are then analyzed through a simple knowledge query mechanism, a visualization tool, or the information filter, that makes use of the preprocessed content and structure information to automatically filter the results of the knowledge discovery algorithms.

Masseglia et al. [MPT99] apply data mining techniques such as association rules and sequential pattern discovery on Web log files and then use them to customize the server hypertext organization dynamically. The prototype system, WebTool, also provides a visual query language in order to improve the mining

25

process. A generator of dynamic links uses the rules generated from sequential patterns or association rules, and each time the navigation pattern of a visitor matches a rule, the hypertext organization is dynamically modified.

Buchner et al. [BB+99] introduce the data mining algorithm MiDAS for discovering sequential patterns from Web log files, in order to perceive behavioral marketing intelligence. In this work, domain knowledge is described as flexible navigation templates that specify navigational behavior, as network structures for the capture of Web site topologies, as well as concept hierarchies and syntactic constraints.

Spiliopoulou et al. [SFW99] have designed MINT, another mining language for the implementation of WUM, a sequence mining system for the specification, discovery, and visualization of interesting navigation patterns. In the data preparation phase, except for log data filtering and completion, user sessions are identified using timeout mechanisms. This language supports the specification of criteria of statistical, structural, and textual features.

Berendt [B01] has implemented STRATDYN, an add-on module that extends WUM’s capabilities by identifying the differences between navigation patterns and exploiting the site’s semantics in the visualization of the results. In this approach, concept hierarchies are used as the basic method of grouping Web pages together.

Berendt et. al. [BM+02] investigated different methods for evaluating the reliability of sessionizing mechanisms. More specifically, they studied the impact of the web site structure to the sessionizing reliability, both for frame-based and frame-free web sites. They resulted that different session reconstruction heuristics can be proposed, based on each web site’s characteristics. Such heuristics are time-oriented, or referrer-based oriented.

Perkowitz and Etzioni [PE00a] propose a system that semi-automatically modifies a Web site. The authors propose PageGather, an algorithm that uses a clustering methodology to discover Web pages visited together and to place them in the same group. In a more recent work [PE00b], they move from the statistical cluster-mining algorithm PageGather to IndexFinder, which fuses statistical and logical information to synthesize index pages.

WebPersonalizer, proposed by Mobasher et al. [MD+00a], provides a framework for mining Web log files to discover knowledge for the provision of recommendations to current users based on their browsing similarities to previous users. Data mining techniques are applied, and the results are then used for the creation of aggregated usage profiles, in order to create decision rules. The recommendation engine matches each user’s activity against these profiles and provides him with a list of recommended hypertext links.

Nanopoulos and Manolopoulos [NM02] propose a recommendation system for e-commerce web sites. For this purpose, they use market basket databases and collaborative filtering. They address the problem of finding similarities in market basket data. The propose a new method, S3B, which is based on similarity search based on transaction signatures.

Nasraoui and Petenes [NP03] present a fast and intuitive approach to provide recommendations using a fuzzy inference engine with rules that are automatically derived from prediscovered user profiles. More specifically, they propose a profile-based fuzzy recommendation engine with extensive empirical comparison of different fuzzy input membership derivation and parametrization options, and comparison with approaches based on collaborative filtering and nearest profile recommendations.

26

They also claim that fuzzy recommendations are very intuitive, deal with natural overlap in user interests, and are very low in cost compared to collaborative filtering.

A different approach is that of Hooker and Finkelman [HF04], who present a mathematical framework in which the system tries to learn directly a user’s mode of browsing during a given session. They are motivated by the fact that different users navigate web sites differently, and that even the same user may exhibit different behaviours at different times. Therefore, the system should be able to understand the type of behaviour, to tailor the information it displays appropriately. The proposed framework is inspired by sequential analysis in the setting of educational testing.

Zhao and Bhowmick [ZB04] model web logs as sequences of events with user identifications and timestamps. They call this structure WAP (Web Access Pattern) and is a tree-like structure. WAPs are mined in order to extract association and sequential patterns with certain metrics. In their lattest work, they propose a novel approach to discover hidden knowledge from historical changes to WAPs. In this way, they also deal with the dynamic and constantly-changing nature of Web usage data. Rather than focusing on the occurrence of the WAPs, they focus on the frequently changing web access patterns. For this purpose, they define FM-WAP (Frequent Mutating WAP), based on the historical changes of WAPs.

Finally, we refer to some approaches which utilize machine learning techniques in order to provide recommendations to the users. Such an approach was proposed by Edwards et. al. [EGP02]. They investigate how machine learning algorithms can be used in the Semantic Web context in order to create user profiles and personalize a Web site. They used semantically annotated datasets and concluded the ILP (Inductive Logic Programming) techniques discovered valuable knowledge.

Nasraoui et. al. [NC+03] propose a scalable clustering methodology inspired by the natural immune system in order to mine evolving user profiles in noisy web clickstream data. Their motivation was the difficulty of data mining algorithms to scale and adapt since Web logs are updated in a daily basis, and the need for scalable, noise intensive, initialization independent techniques that can continuously discover possibly evolving Web profiles without any stoppages or reconfigurations of the clustering schemas. The proposed clustering algorithm plays the role of the cognitive agent of an artificial immune system, whose goal is to continuously perform an intelligent organization of the incoming noisy data into clusters. They claim that it exhibits superior learning abilities, while requiring modest memory and computational costs.

In a different work, Nasraoui and Pavuluri [NP04] present a Context Ultra-Sensitive approach based on two-step Recommender systems, that relies on a committee of profile-specific neural networks. The proposed approach provides recommendations that are accurate and fast to train because only the URLs relevant to a specific profile are used to define the architecture of each network. Since each recommendation model is designed for each profile separately, the approach is context ultra-sensitive. Moreover, compared to collaborative filtering, it achieves higher coverage and precision while being faster, and requiring lower main memory.

Finally, Baraglia and Silvestri [BS04] propose SUGGEST 3.0, a Web usage mining recommender system, that dynamically generates links to pages that have not yet been visited by a user. The difference from any other recommender system is that SUGGEST 3.0 does not make use of any off-line component, and is able to manage dynamic Web sites. In order to achieve this, the proposed system incrementally builds and maintains historical information by means of an incremental graph-partitioning algorithm, requiring no off-line component. Experiments conducted to evaluate the

27

system’s performance, demonstrated that it introduces only a limited overhead on the Web server activity. There also exist various research efforts combining usage mining with the other axes of Web mining, namely content and structure in order to provide integrated solutions to the problems of Web searching, querying and personalization. The most important of these efforts will be outlined in Chapter 6. In Chapter 5, we examine in more detail the use of Markov models in the Web usage mining and personalization process.

28

CHAPTER 5 Web Usage Mining and Personalization using Markov Models 5.1 Modeling Web usage with Markov Models If we parallel the navigation of a user in a Web site with a Markov chain, then we can create a Markov model for predicting this user’s next action. This prediction is based on the current position of the user on the Web graph, and/or his past visits. This approach is different than the ones mentioned in the previous section, since they are based on probabilities rather than data mining techniques. They, however, address the problems of predicting user actions and personalizing a site as well. In the next paragraphs we present how the users’ navigation is modeled, as well as the most representative research initiatives in this area. The preprocessing phase is the same as in the data mining approach, mentioned in Section 4. 5.1.1 The Transition Graph In order to model the paths followed by the visitors of the Web site, we should create a weighted transition graph. This graph is created using the data residing on the Web logs. Its nodes represent the Web pages of the site, whereas the links between them the hyperlinks between the pages. This links carry weights, which represent the number of transitions from the “source” Web page to the “target” Web page. An example of such a graph is included in Fig 1 [ZHH02]:

Figure 1

29

5.1.2 Markov Models Every node in the transition graph may be considered as a state in a discrete Markov model, and may be defined as a tuple <S, Q, L>, where S is the state space, which includes all nodes in the transition graph, Q is the probability matrix that includes the one-step transition probabilities between the nodes, and L is the initial probability distribution regarding the states in S. The navigation of a user may be represented as a stochastic process { }nΧ , that has S as state space. If )(

,mjiP is the bounded probability of visiting page j in the next

step, and is based on the last m pages, then { }nΧ is called m-order Markov chain [ZHH02, Kij97]. Given that the user is currently at page i and has already visited pages

01,...,iin− , then )(,

mjiP is based only on pages 11,...,, +−− mnn iii and is given by Equation 1:

( ) ====== −−+ 00111)(

, ,...,,| iXiXiXjXPP nnnnmji

( )11111 ,...,,| +−+−−−+ ==== mnmnnnnn iXiXiXjXP (1) where the bounded probability of { }1+nX , given the states of all previous events, equals the bounded probability of { }1+nX given all previous m events. When m = 1, then { }1+nX depends only on the current state { }nX , and is called 1st order Markov chain. In that case,

( )iXjXPPP nnjiji ==== + |1)1(

,, where jiP , is the one-step probability of the transition from state i to state j [ZHH02]. The transition probability from page i to page j in only one step may be computed using a transition graph, as the one introduced before. Using the weights of the graph links as underlying information concerning the preferences of past users (i.e. what paths where followed by users in the past), we can use them to create a transition probability matrix that includes the one-step transition probabilities in the Markov model.

The Markov model will be further used in order to compute the transition probabilities using the bounded probability of the event of the user visiting more pages in the future, given his current position and/or his past visits. Therefore, the one-step transition probability from page i to page j can be regarded as a function of the transitions from i to j through the total number of transitions form i to the other pages and the “Exit” node. This probability is given by Equation 2:

( ) ====== −−+ 00111, ,...,,| iXiXiXjXPP nnnnji

( )∑

===+k ki

jinn w

wiXjXP

,

,1 | (2)

where jiw , is the weight of the link from i to j, and kiw , is the weight of the link from i to k.

We now may create the transition probability matrix, which, as already mentioned, represents the probability of the one-step transition between two Web pages. In such a matrix the i-th row includes all the transition probabilities from this page to all other pages, and their sum equals to 1. An example, of the transition probability matrix for the graph depicted in Figure 1 is included in Figure 2 [ZHH02].

30

Figure 2

The transition probability matrix may become very large, especially in the

case of big Web sites. In that case, there exist several techniques for transforming the matrix into a more compact form. Such an approach includes the clustering of the Web pages. This clustering is based on similar navigation patterns. Following the previous example, the compacted transition probability matrix for the matrix in Figure 2 is included in Figure 3.

Figure 3

5.2 Research Initiatives Many approaches have been proposed for using Markov models in the Web usage mining and personalization areas. From these, it is proved that 1st order Markov models are much simpler to model, but are not as accurate as the higher order ones. This can be intuitively explained since the users’ next action is not based only on the current one (as 1st order Markov models assume), but on several steps before (long-term memory models).

31

The higher order Markov models are therefore much more accurate, and their precision increases exponentially as the order of the model increases. They are, however, much more complex, since the state space increases, and therefore the requirements in time and space are much bigger. Moreover, there exist hybrid Markov models, which combine parts from models of various orders, so that the resulting model has reduce state complexity and increased precision in the predictions [DK01]. These models are called selective and are the ones most commonly used in the Web personalization area. Before referring to the most recent ones, dealing with Web personalization, it is useful to mention some of the approaches of the past decade, which inspired the current researchers for using Markov models in this area. Chen and Goodman [CG96] used empirical models which combine parts from Markovian models of 1st – kth order. This approach was the first step towards selective Markov models. Cadez et.al. [CH+00] proposed the WebCANVAS system. This system depicts similar user groups using Markov models. Sarrukai Ji et. al. [JKW00] proposed a system for helping students while studying, MANIC (Multimedia Asynchronus Networked Individualized Courseware). For the system implementation they used Hidden Markov Models (HMM). Their objective was to capture the students’ behavior and implement algorithms for pre-fetching lecture notes. The results proved that HMM worked well as predictive tools. Zhu et. al. [ZHH02] proposed an improvement of the Sarrukai method, by computing the probability of the event of a user arriving in a state of the transition probability matrix in the next n steps. This computation is performed taking the normalized sum of the probabilities of the user’s arrival in a certain state , taking into consideration his history, as the total probability of his arrival to this state in the future. In that way, compared to the Sarrukai method, it is possible to predict more steps in the future. Moreover, the size of the transition probability matrix was reduced to an optimal size, thus reducing the complexity of the algorithm. Finally, the same research team proposed the method of “Maximal Forward Path” in order to improve the precision of the prediction results. The maximal forward path is a sequence of strongly connected web pages within a user visit. Only these pages are included in the user’s history and are used for predicting his next step. Deshpande and Karypis [DK01] proposed the “selective” Markov models, which reduced the state space using pruning techniques. These techniques improved the prediction and reduced the model size. The selective Markov models are based on the selection of parts from Markov models of different order, so that the resulting model has reduced state complexity as well as precision in predicting the user’s next step. The key idea of the proposed work is that many of the states of the Markov models of various orders can be pruned without affecting the performance of the integrated model. More specifically, they present 3 pruning schemas for the states of the All-Kth-Order Markov model: (i) support pruning (SPMM), (ii) confidence pruning (CPMM), and (iii) error pruning (EPMM). After experimentation, they proved that the pruning techniques improved complexity and prediction precision. Cadez et.al. [CGS00] as well as Sen and Hansen [SH03] also proposed the use of mixed Markov models. Nanopoulos et. al. [NKM03] present a new framework for using prefetching algorithms as markovian prediction algorithms. Moreover, they propose WMo, an algorithm based on data mining, which is a generalization of other data mining

32

algorithms. It was designed to examine the factors that affect the performance of the prefetching algorithms. Ypma and Heskes [YH03] propose a mixture of Hidden Markov Models (HMMs) for modeling the user clickstreams. They then cluster the web pages along with labeling the users, using their clickstreams and some other static user data. Therefore, the user transitions are abstracted from web pages to web page categories, thus reducing the size of the transition matrix. Anderson et. al. [ADW02] proposed Relational Markov Models (RMMs). They proposed PROTEUS, a system architecture in order to automatically personalize a Web site. PROTEUS is a two-phase process. The first phase includes the Web log mining for extracting usage patterns. Next, PROTEUS selects among a set of different ways in which a web site may be personalized (e.g. add a new hyperlink, create an index list, remove content) the one that produces the maximal usability for every pattern. After a set of experiments they concluded that the addition of hyperlinks proved very useful. For that reason, they proposed the MINPATH algorithm. They used Markov models of 1st and 2nd order. They concluded that a mixture of 1st order Markov models behaved the best, requiring 40% less effort from the site visitors. RMMs, however, have a few shortcomings. The most important one is that they need training data in order to provide recommendations. Therefore, if a page was not included in the training data (or was not visited by any users), then the Markov model cannot predict them. This is a common phenomenon in dynamically created web sites, such as portals. Finally, in their work, Jespersen et. al. [JPT03] try to evaluate the Markov assumption for Web Usage Mining. More specifically, they systematically investigate the quality of browsing patterns mined from structures based on the Markov assumption of depth n, i.e. that the next page requested is only depended on the last n pages visited. They define formal measures of quality, based on the closeness of the mined patterns to the true traversal patterns, and they perform an extensive experimental evaluation. The results indicated that a large number of rules must be considered to achieve high quality, that long rules are generally more distorted than shorter ones and that the model yields knowledge of a higher quality when applied to more random usage patterns. Their research proved that Markov-based structures for Web usage mining are best suited for tasks demanding less accuracy, such as pre-fetching, personalization and targeted ads. 5.3 Conclusions Markov chains have proved a useful tool for modeling the navigation of the visitors in a web site. The main disadvantage of Markov models is that they cannot predict pages not previously visited by the users. Moreover, they have two major restrictions:

1. Need for large size of training data: Markov models are based on statistical methods, therefore the resulting prediction model is totally depended on the size of the available data. This is not a problem when the analysis is performed on a single web site with high visit rates. The approach, however, is problematic when the analysis is performed in multiple web sites combined with low visit rates.

33

2. Dimensionality: The second restriction in the use of Markov chains is dimensionality. The transition matrix is usually very big, but it can be reduced using clustering of similar web pages [Sar00].

34

CHAPTER 6 State-of-the Art – Combining Web Mining techniques In this Chapter we examine the current trends in the Web Mining area. We overview the most important elements of what is called “the Semantic Web”, such as metadata, ontologies, and RDF and XML technologies. We then present how these elements are combined with Web mining techniques in order to create more advanced applications as well as more sophisticated solutions in order to address fundamental Web mining problems. Some of the most representative research initiatives are presented, that depict the current trends in the Semantic Web Mining area. We also present an overview of some commercial products that were implemented to address the same needs. We finally conclude with some thoughts on the future of Web mining. 6.1 Definition of Web Semantics 6.1.1 The Semantic Web

Tim Berners-Lee, the founder of the World Wide Web, now predicts that the Web will evolve toward what he calls the Semantic Web: “To date, the Web has developed most rapidly as a medium of documents for people rather than of information that can be manipulated automatically. By augmenting Web pages with data targeted at computers and by adding documents solely for computers, we will transform into the Semantic Web. Computers will find the meaning of semantic data by following hyperlinks to definitions of key terms and rules for reasoning about them logically” [BHL01]. In other words, he envisions that in the future the vast amount of information in the Web will bear machine-readable metadata, resulting in computers being able to manipulate this content automatically, without human intervention.

Therefore, the Semantic Web is envisaged as an extension of the Web, in which information is given a well-defined meaning. To accomplish this, the provided information should be structured, accompanied by sets of inference rules that can be used by computers to conduct automated reasoning. The challenge of the Semantic Web, according to Berners-Lee, “is to provide a language that expresses both data and rules for reasoning about the data and that allows rules from any existing knowledge-representation system to be exported onto the Web”.

Two important technologies towards that direction are the eXtensible Markup Language (XML) and the Resource Description Framework (RDF). XML is simple and easy to use, however due to the freedom it gives to the users to add structure to the documents they create, it doesn’t impose any underlying rules for expressing the content in a universally understandable way. On the other hand, RDF is based on more strict rules for describing content, therefore is easier to be “understood” by machines. However, this adds to its complexity making it difficult for a user to annotate a document using RDF.

This modeling proves insufficient since, even if there exists a “grammar” for expressing content using RDF, the same concept may be expressed using different

35

identifiers, making interoperability across different systems difficult. The solution to this problem is the use of ontologies. An ontology is defined as “a formal explicit specification of a shared conceptualization” [G93]. In more simple words, an ontology is a document that formally defines the relations among terms. The most typical kind of ontology used in the Web context is a taxonomy (concept hierarchy) and a set of inference rules. Tim Berners-Lee considers ontologies to be a critical part of the Semantic Web, since they will provide a common vocabulary for solving such terminology problems. 6.1.2 RDF and XML

XML is a very simple language, that permits users create their own tags in

order to annotate Web documents. XML is called “extensible” because it is not a fixed format like HTML. Instead, it is actually a “metalanguage”, which lets users design their own customized markup languages for limitless different types of documents. As claimed in the XML Specification [XML], XML is not just for Web pages; it can be used to store any kind of structured information, and to enclose or encapsulate information in order to pass it between different computing systems which would otherwise be unable to communicate.

XML allows users to create structured documents in a very simple way. It provides a robust, non-proprietary, persistent, and verifiable file format for the storage and transmission of text and data both on and off the Web. However, due to the lack of an explicit grammar, these structures are meaningless to other systems. The meaning of data, i.e. the data semantics are expressed by RDF [RDF]. Whereas an XML document is a tree, an RDF document consists of sets of triples. Each triple contains a subject, predicate and object. These triples can be written using XML tags. In RDF, a document makes assertions that things have properties, and this is how most of the data can be described and further processed by the computers. Subject and object are identified by a URI (Universal Resource Identifier), ensuring that everything contained in a document is tied to a unique definition available to anyone on the Web. XML and RDF are two complimentary technologies used to build a more “intelligent” Web. XML is syntax-oriented, whereas RDF brings the notion of data semantics. Using RDF models, two different systems may communicate with different syntax using the concept of equivalencies. This can be performed if the RDF model uses vocabulary defined by the terms of an ontology. The combination of an RDF model and the associated ontology gives the computer enough information to discover the meaning of data. 6.1.3 Ontologies

Ontologies are developed to provide machine-processable semantics of information sources that can be communicated between different agents (software and humans). Many definitions exist for ontologies. Some are more restrictive: “An ontology consists of a representational vocabulary with precise definitions of the meanings of the terms of this vocabulary plus a set of formal axioms that constraint interpretation and well-formed use of these terms” [CS95]. Some others are more

36

relaxed: “an ontology may take a variety of forms, but necessarily it will include a vocabulary of terms, and some specification of their meaning” [JU99]. In simple words, we may say that ontologies provide a shared and common understanding of some domain in a way that can be communicated between people and computers.

As defined by Berners-Lee et.al. [BHL01], the most representative form of ontology is a taxonomy (is-a hierarchy) and a set of inference rules. The taxonomy defines the object classes and the relations among them. The classes, subclasses, and relations constitute the backbone of the ontologies. The inference rules provide extra strength. Boszak et. al. [BE+02] claim that Semantic Web will prove valuable in e-commerce applications, since the data exchange will be performed using ontologies.

6.1.3.1 A categorization of ontologies Depending on their generality level, different types of ontologies may be

identified that fulfil different roles in the process of building a knowledge based system. Among others, we can distinguish the following ontology types:

• Domain ontologies capture the knowledge valid for a particular type of domain. Some examples are “EngMath”, an ontology modelling mathematical formulas, and “PhysSys”, an ontology aiming at modelling, simulating and designing natural (physical) systems. Other ontologies aiming at modelling enterprises are “Enterprise”, a set of ontologies used for describing the processes within an organization, and “TOVE” (Toronto Virtual Enterprise).

• Metadata ontologies provide a vocabulary for describing the content of on-line information sources. The most popular example is “Dublin Core”.

• Linguistic ontologies are extended dictionaries, since they include words, their meaning and the relations between them (such as synonyms, hypernyms, antonyms etc.). The most popular linguistic ontology is “WordNet”. Other linguistic ontologies are “SENSUS” and the “Generalized Upper Model”.

• Generic or common sense ontologies aim at capturing general knowledge about the world, providing basic notions and concepts for things like time, space, state, event etc. As a consequence, they are valid across several domains. An example is “upper Cyc® ontology”, an ontology containing almost 3000 generic terms describing the human activity.

• Representational ontologies do not commit themselves to any particular domain. Such ontologies provide representational entities without stating what should be represented. A well-known representational ontology is the “Frame Ontology”, which defines concepts such as frames, slots, and slot constraints allowing the expression of knowledge in an object-oriented or frame-based way.

• Other types of ontology are so-called method and task ontologies. Task ontologies provide terms specific for particular tasks (e.g. ’hypothesis’ belongs to the diagnosis task ontology), and method ontologies provide terms specific to particular PSMs (e.g. ‘correct state’ belongs to the Propose-and-Revise method ontology). Task and method ontologies provide a reasoning point of view on domain knowledge.

6.1.3.2 Use of Ontologies

37

Often an ontology of the domain is not a goal in itself. Developing an ontology is akin to defining a set of data and their structure for other programs to use. Therefore, most of the ontologies are constructed to enable reusability across applications. However, we may also discriminate between ontologies in terms of their usage. For instance, we may distinguish between ontologies in systems used by a small number of people, or by a big community. From a different point of view, an ontology can be considered as a way of creating a knowledge base, whereas, on the other hand, it can be considered as part of the knowledge base. Consequently, we may distinguish between three main categories of ontology use [UG96], namely communication, interoperability, and systems engineering. In the Web context, there exists various ways by which ontologies can be used. For instance, they may be used to improve the Web searching process, by focusing the search on specific terms/concepts. More advanced applications can use ontologies to relate the information on a page to the associated knowledge structures and inference rules, enabling further processing in terms of characterizing and categorizing this information. In the sections that follow, we will overview some research efforts aiming at the direction of creating the Semantic Web. We will focus on research that uses Web mining methods in combination with structures such as taxonomies or ontologies and will try to investigate different ways of combining these powerful mechanisms. 6.2 Towards Semantic Web - Combining mining techniques and Ontologies for extracting semantics 6.2.1 Semantic Web Mining

It is evident that the distinctions between the three axes of Web mining

(especially of Web content and Web structure mining) are in many cases ambiguous. Most of the current research efforts concentrate on combining methods proposed in either one of them, in order to enhance the results of their studies. This combination implies the advancing of Web mining to a more abstract level. To achieve this abstraction, Web data (usage, content, structure) are represented using another emerging model of representation, ontologies. This representation, closes the gap between Semantic Web and Web Mining areas, to create a fast-emerging research area, that of Semantic Web Mining.

Berendt et al. [BHK02] presented an overview of the ways the Semantic Web and Web Mining areas can benefit from each other. The results of Web mining can be improved by exploiting the new semantic structures in the Web, whereas Web Mining can help to build the Semantic Web. The connective link between the two areas is the use of ontologies, which are the backbone of the Semantic Web. Web mining techniques can be applied to create the Semantic Web in terms of semi-automatic creation of domain ontologies; on the other hand, knowledge in the form of ontologies and other semantic structures can be used to improve the results of Web mining.

Since the objective of this survey is the future directions in the Web mining area, we will focus on the second part of the Semantic Web Mining area, that of exploiting Web semantics in order to improve the results of Web content, structure, and usage mining. As already mentioned, the current trend is to combine methods

38

used in these three areas. In the following sections we will briefly describe how this is achieved and present the most important research initiatives in these areas.

6.2.2 Combining content and structure information

As the WWW started growing rapidly, the need for combining Web content

and Web structure mining emerged. The requirements for effective search and management of Web content became stronger than ever. Therefore, content-based classification/characterization techniques, as well as indexing and searching the Web based on its link structure proved inadequate.

This was the crucial point where researchers suggested solutions to the problems of searching, indexing, characterizing and querying the Web taking into account both the content of a Web page, as well as the link structure surrounding it. Web structure mining benefits by incorporating into the proposed algorithms and methods the meta-information included in the hyperlinks and the text surrounding them. What is more, Web content mining is enhanced if Web pages are characterized using an abstract representation based on ontologies. 6.2.2.1 Using hyperlink information to enforce content semantics

Kleinberg [K99] states “The link structure of a hypermedia environment can be a rich source of information about the content of the environment (…) But for the problem of searching in hyperlinked environments such as the World Wide Web, it is clear from the prevalent techniques that the information inherent in the links has yet to be fully exploited”. As already described in the Hyperlink Information Section, the characterization of a Web page and, as a consequence the indexing and searching of the Web is better performed if the information contained in the incoming, or outgoing hyperlinks is taken into account [CD+99, PW00, HG+02, VV+03, EVV03].

The hypothesis is that the way others characterize a Web page is descriptive of the page’s contents, and, in most occasions, is more objective than how the author characterizes it himself. Moreover, in most of the cases, the text around the hyperlink pointing to a page contains in compact form the description of this page’s contents, therefore should be taken into account. Chakrabarti et al. define this as the “anchor-window” [CD+99]. Eirinaki et al. [EVV03] state that the information included in the pages that are pointed by the page under examination is also a useful source of information, since in most Web pages the authors include links to topics that are of importance in the page’s context. 6.2.2.2 Replacing lexical descriptions with semantics Web documents are mainly characterized by extracted keywords and by a rank that takes into account link structures [BP98]. Finding the similarity between documents is based on exact matching between these terms. This, however, is hardly similarity; rather, it is binary matching. By replacing keywords with concepts and moreover concepts in an ontology (concept hierarchy), a more flexible document matching process than binary matching can be achieved, handling both specializations and generalizations of senses. First, the keywords that characterize each document should be mapped to the concepts of the ontology. This can be performed by using a thesaurus and a similarity measure. Halkidi et al. in [HN+03] perform this mapping using WordNet as a thesaurus. WordNet [WN] is a lexical database containing English nouns, verbs,

39

adjectives and adverbs organized into synonym sets, each representing one underlying lexical concept. It provides mechanisms for finding synonyms, hypernyms, hyponyms, etc. for a given word. Using WordNet in combination with a similarity measure, every keyword can be mapped to the most relevant ontology term (concept). There exist several similarity measures for handling the more simple case of calculating the similarity between two given terms of the ontology. Richardson et al. [RSM94] and Resnik [R95, R99] propose different measure inside a taxonomy such as WordNet, and Lin [L98] proposes a comparison between these measures and others, such as Wu and Palmer [WP94], Miller and Charles, and a novel similarity measure. [DJ02], [HN+03] and [EVV03] also propose the use of Wu and Palmer measure in the ontology context. Since all documents are characterized using ontology terms, this knowledge can be used in order to form semantically coherent clusters. The clustering algorithm is based on a similarity measure between sets of weighted words. Here a task is the definition of a distance between Web documents. Traditionally, in order to achieve this goal, such a user would apply Information Retrieval techniques such as those described in [SM83]. However, these techniques most often rely on exact keyword matching, and do not take into account the fact that the keywords may have some semantic proximity between each other. Halkidi et al. propose a clustering scheme, based on a novel similarity measure (a generalization of Wu and Palmer similarity measure for weighted sets of terms) between sets of terms that are hierarchically related [HN+03]. Once the clusters have been found, a very important issue that should also be considered is their labeling. 6.2.2.3 Research initiatives

As already mentioned, link information is already used by web search engines to better filter and rank query results, for example as mentioned in Chapter 2, many search engines’ underlying algorithms, including Google’s PageRank, prioritize pages with many incoming or outgoing links. An interesting server that uses hyperlink’s structure to group interconnected results is Kartoo [Kar99].

Vivísimo [Viv] proposes a clustering approach for web document organization. It makes use of the contents (titles and brief descriptions) that are returned by the underlying search engines. Northern Light [Nor] classifies each document within an entire source collection into pre-defined subjects and then, at query time, selects those subjects that best match the search results. Vivísimo does not use pre-defined subjects; its annotations are created spontaneously.

Haveliwala et al. [HD+02] also propose a methodology for evaluating strategies for similarity search on the Web. According to this approach a Web document, is represented by a set of terms found in the content and anchor-windows or links to it. Also the corresponding weights of this term are used for the Web document description. Then the similarity between documents is measured as the similarity between their bags. The metric used for measuring the similarity of documents is the Jaccard coefficient.

A method for classifying and describing Web documents is discussed in [GT+02]. They use inbound links and words surrounding them to describe the Web pages. SVM classifier is trained and used to categorize Web pages. Also a method for selecting features and characterizing the classes of Web pages is proposed that uses the expected entropy loss metric. The results of the proposed approach shows that the text in citing documents has more descriptive power than the text in the target document itself.

40

Chekuri et. al. [CG+97] proposes a system for automatic classification of web documents into predefined categories using a training set of pre-classified documents. The documents are represented using word frequency vectors.

Bloehdorn and Hotho [BH04] propose an enhancement of the classical document representation through concepts extracted from background knowledge. They claim that Semantic Web technologies allow the usage of features on a higher semantic level than single words for text classification purposes. They use boosting, a machine learning technique to achieve this classification. The idea of personalizing the results of a search engine lead Aktas et. al. [ANM04] to propose a personalized version of the PageRank algorithm. They introduce a methodology for personalizing PageRank vectors based on URL features such as Internet domains. The users specify interest profiles, which are defined as binary feature vectors. Given such a vector, a weighted PageRank is computed assigning a weight to each URL based on binary matching. After preliminary experimentation, it was shown that personalized PageRank performed favorably compared to pure similarity based ranking and traditional based PageRank. A similar approach was proposed by Deng et. al. [DC+04]. The difference is that they try to personalize the results of a metasearch engine. They proposed the “Spy Naïve Bayes” (SpyNB) technique that identifies user preference pairds generated from clickthrough (i.e. usage) data. Then, a ranking SVM algorithm is employed in order to build a metasearch engine optimizer. They performed experiments on a metasearch engine prototype and proved that SpyNB significantly improves the average ranks of users’ clicks. Halkidi et al. [HN+03, VV+03] propose THESUS, a prototype system, that organizes thematic Web documents into semantic clusters. The system extracts keywords from pages’ incoming links and converts them to semantics by mapping them to a domain’s ontology. Subsequently, a clustering algorithm is applied to discover groups of Web documents. The effectiveness of clustering process is based on the use of a novel similarity measure between documents characterized by sets of terms. At the end of the process, the documents are organized into thematic subsets based on their semantics. This categorization enables a more effective search using a search engine, since all documents are classified into labeled clusters. The idea of clustering documents using taxonomies is also proposed by Adami et. al. [AAS03]. They address the problem of high demand of labeled examples in order to hierarchically categorize Web documents. In their approach, the documents are classified according to class labels, and they propose two clustering approaches, where training is constrained by the a-priori knowledge fo the taxonomy structure. More specifically, they propose the TaxSOM model, that clusters a set of documents in a predefined hierarchy of classes, directly exploiting the knowledge of both their topological organization and their lexical description. Finally, Ngo and Nguyen [NN04] also propose an approach to search results’ clustering, based on Tolerance Rough Set. They define search results’ clustering as a process of automatic grouping of them into thematic groups. Unlike traditional document clustering, they perform this process on-the-fly (per user query request) and locally on a limited set of results. Tolerance classes are used to approximate concepts that exist in the Web documents. In this way, the document and cluster representation is enriched and they claim that this approach increases the clustering performance. 6.2.3 Combining usage and content information

Formatted: English (U.S.)

Deleted: CG+97

41

The exploitation of the pages’ semantics hidden in user paths can considerably improve the results of Web usage mining, since it provides a more abstract yet uniform and both machine and human understandable way of processing and analyzing the data. The underlying idea is to use a concept hierarchy or an ontology, to classify the pages belonging to a Web site according to this hierarchy/ontology. The simplest way to accomplish this is to use hand-crafted ontologies and automated classifiers. This classification can make Web usage mining results more comprehensible both for the Web site analysts and administrators (assist in acquiring business intelligence or redesigning the Web site), as well as for Web personalization.

Most of the research efforts in Web personalization correspond to the evolution of extensive research in Web usage mining. As noted in [MD+00b], usage-based personalization can be problematic either when there is not enough usage data in order to extract patterns related to certain categories, or when the site content changes and new pages are added but are not yet included in the web log. The incorporation of information related to the content and/or the structure of the Web site provides a way of overcoming such problems, thus improving the whole personalization process [EVV03].

Up to now, only a few research efforts have tried to incorporate content into the Web usage mining and personalization process, and even fewer have performed this using ontologies. In the following sections we will present how a personalization system may benefit from incorporating semantics into the input of the Web usage mining process, which are the usage logs. 6.2.3.1 Enhance logs with semantics

Eirinaki et al. [EVV03] introduced C-logs, a conceptual abstraction of the original Web usage logs based to the Web site’s semantics. Web content is semantically characterized using a domain-specific taxonomy and a combination of IR techniques. The keywords that are extracted using these techniques are mapped to the categories of the taxonomy. After this process, every document falls under one or more taxonomy categories. C-logs are created when each record in the processed Web server log is updated to include the related set of concepts (categories). The idea of registering the user behavior in terms of an ontology is also described in [OB+03]. In this work, the Web logs are also semantically enriched with ontology concepts. This framework however, is based on Web sites that are based on an underlying ontology and doesn’t cover the majority of the (non Semantic Web) sites.

Subsequently, data mining algorithms can be applied to cluster users with similar interests, provide personalized views of the ontology [OB+03], semi-automatically generate interesting queries for usage mining, or create meaningful visualizations of usage paths [B02]. Such results can also be used to create a set of recommendations that include thematic categories, in addition to recommendations including URIs [EVV03].

6.2.3.2 Enrich recommendations with semantically similar docs

Since the content of the Web site is characterized using a uniform vocabulary derived from an ontology, the documents can be classified accordingly. Automatic classifiers may be used [BS00], or, as mentioned in the previous section, clustering based on the similarity between the category terms that characterize them [EVV03]. This classification enables a personalization system to recommend documents not only based on exact keyword matching but on semantic similarity as well. The enhanced usage logs are used as input in the Web usage mining process. Since they

42

encapsulate knowledge derived from the site semantics, the results of the usage mining process are further augmented.

Alternatively, as proposed in [EVV03], the usual Web personalization process, which is based on the Web-site's logs, can be enhanced by taking into account the semantic proximity of the content. In this way, the system's suggestions can be enriched with content bearing similar semantics. Following this approach, the recommendations that are provided to the end user are a combination of the ones derived using the traditional usage mining approach, expanded using documents that fall under the same cluster. Moreover, since the logs incorporate semantic information, category-based rules can be extracted, i.e. association rules based on category terms instead of URIs. 6.2.3.3 Research Initiatives

Since we have already mentioned research efforts concentrating on Web usage mining and Web personalization, here we focus on the systems that combine usage and content knowledge in order to perform Web mining or (semi)automatically modify a Web site, utilizing taxonomies/ontologies for this purpose.

Berendt et al. introduced “service based” concept hierarchies in [BS00], for analysing the search behaviour of visitors, i.e. “how they navigate rather than what they retrieve”. This idea is further analysed in [B02], where concept hierarchies are the basic method of grouping Web pages together. The site’s semantics are exploited to semi-automatically generate interesting queries for usage mining, and to create visualizations of the resulting usage paths. Web pages are treated as instances of a higher-level concept, reflecting content (type of object the page describes), structure (function of pages in object search) or service (type of search functionality chosen by the user).

Mobasher et al. present in [DM02] a general framework where domain ontologies are used for automatically characterizing the usage and content profiles defined in their previous work [MD+00b]. In this work they concentrate on the creation of the ontology as a representation of group of objects having a homogenous concept structure, like similar properties and data types, rather than explaining how the content is finally characterized using those concepts. In a latter work, Nakagawa and Mobasher [NM03], present a framework for a hybrid Web personalization system that can intelligently switch among different recommendation models, based on the degree of connectivity and the current location of the user within the site. The hybrid system selects less constrained models such as frequent itemsets when the user is navigating portions of the site with a higher degree of connectivity, while sequential recommendation models are chosen for deeper navigational depths and lower degree of connectivity.

Holland et. al. [HEK03] propose an approach on mining user preferences for personalized applications, modeling them as strict partial orders with “A is better than B” semantics. They present Preference Mining techniques for detecting such strict partial order preferences in user log data. The results of this process have strong semantic expressiveness. Preference Mining implementation uses sophisticated SQL statements to execute all data-intensive operations on the database layer, therefore the proposed algorithms are also scalable. This approach is beneficial for personalized e-applications and can improve customer service quality.

Jin et. al. [JZM04] propose a unified framework based on Probabilistic Latent Semantic Analysis to create models of Web users, taking into account both the navigational usage data and the Web site content information. The proposed model is

43

based on a set of discovered latent factors that “explain” the underlying relationships among pageviews in terms of their common usage and their semantic relationships. Based on the discovered models, they propose algorithms for characterizing Web user segments, in order to provide dynamic and personalized recommendations.

Acharyya and Ghosh [AG03] propose a general framework for modeling users whose surfing behaviour is dynamically governed by their current topic of interest. Such an approach allows a modeled surfer to behave differently on the same page, depending on his situational context. The proposed methodology involves mapping each visited page to a topic or concept, imposing a tree hierarchy (taxonomy) on these topics, and then estimating the parameters of a semi-Markov process defined on this tree based on the observed transitions among the underlying visited pages. The semi-Markovian assumption imparts additional flexibility by allowing for non-exponential state re-visit times, and the concept hierarchy provides a nice way of capturing context and user intent. They also prove that the proposed approach is computationally much less demanding as compared to the alternative approach of using higher order Markov models.

The idea of enhancing usage mining by registering the user behavior in terms of an ontology is described in [OB+03]. This framework is based on a Web site having an underlying ontology. The Web logs are semantically enriched with ontology concepts. Data mining may then be performed on these semantic Web logs to extract knowledge about groups of users, users’ preferences and rules. They perform this process on a knowledge portal, exploiting its inherent RDF annotations. This framework is Web mining and not Web personalization-oriented, therefore, no further processing is performed.

Meo et. al. [ML+04] also proposed the creation of conceptual logs, integrating the usual information about user requests with meta-data concerning the Web site structure. The logs are in XML format, produced by Web applications specified with the WebML conceptual model. After creating the conceptual logs, they apply a data mining language (MINE RULE) in order to identify different types of patterns, such as recurrent navigation paths, page contents most frequently visited, and anomalies, such as intrusion attempts. They also suggest that the use of queries in advanced languages, as opposed to ad-hoc heuristics, eases the specification and the discovery of large spectrum of patterns. Finally, Eirinaki et al. [EVV03] present SEWeP, a Web personalization system that combines usage mining and content semantics. This system utilizes Web content and structure mining methods in combination with a taxonomy to create semantics for not previously annotated Web pages. This knowledge is subsequently fed to Web usage mining algorithms in order to personalize the site. The (semantic) recommendations that are produced by this method are enriched with documents that fall under the same cluster as the ones that would otherwise be presented to the end user. In their latter work [EL+04], they extend the set of recommendations to include category-based ones, i.e. recommendations on general categories (terms of the taxonomy) as well. They also propose an automatic translation method for processing multilingual content. After experimentation, using all three sets of recommendations provided by SEWeP (original, semantic and category-based), they conclude that a hybrid model, including recommendations from all sources, performs better in terms of users’ satisfiability. 6.2.4 Semantic Web Sites and Ontologies

44

As already mentioned, Tim Berners-Lee predicted in 2001 that the Web will evolve toward what he calls the Semantic Web [BHL01]. He envisions that in the future the vast amount of information in the Web will bear machine-readable metadata, resulting in computers being able to manipulate this content automatically, without human intervention. These metadata will be part of an underlying structure, more specifically an ontology, i.e. a hierarchical set of concepts and a set of inference rules. The challenge of the Semantic Web, according to him “is to provide a language that expresses both data and rules for reasoning about the data and that allows rules from any existing knowledge-representation system to be exported onto the Web”.

Therefore, in the past few years many research teams investigated this area and built systems for supporting the creation and maintenance of Semantic Web sites, as well as methods and tools for processing ontologies, in terms of creation, merging, update etc. Moreover, problems such as the diversity of domain-specific ontologies and the need for interoperability are addressed. In what follows, we overview the most important ones. 6.2.4.1 Research Initiatives Berendt et al. [BHS02] presented an overview of the ways the Semantic Web and Web Mining areas can benefit from each other. Web mining techniques can be applied to create the Semantic Web in terms of semi-automatic creation of domain ontologies; on the other hand, knowledge in the form of ontologies and other semantic structures can be used to improve the results of Web mining. In this work, we adopt several of these research directions and further develop them. We use Web content and structure mining to derive semantics out of the Web site’s pages. These semantics are in turn used to enhance the Web usage logs. The enhanced Web logs, called C-Logs are then used as input to the Web mining process, resulting in the creation of a broader set of recommendations. The whole process, which bridges the gap between the two aforementioned areas, is firmly connected with the utilization of a domain-specific taxonomy (concept hierarchy).

Bozsak et. al. [BE+02] presented KAON (Karlsruhe Ontology and Semantic Web Tool Suite). KAON provides the the infrastructure of the metadata needed for creating, using, and accessing Semantic Web services and sites. It includes tools for discovering, manage and present ontologies and metadata. It also implements a platform for applying Semantic Web services in e-commerce and B2B scenaria. It also provides utilities for accessing semantic knowledge bases using OntoMat-REVERSE. Stumme and Maedche [SM01] address the ontology interoperability problem. There may exist several different ontologies, created by different editors using multiple parameters (e.g. an enterprise’s sections). If data need to be exchanged, there should exist methods for making the local ontologies communicate between them. For that reason, they implemented FCA-Merge algorithm, which provides a structural definition of the ontology merging procedure. The algorithm is based on Formal Concept Analysis and is now part of the KAON platform. Another approach on the Semantic Web area is that of Soldar and Smith [SS02]. In their work, they try to describe data files using a machine-readable way. For that purpose they use RDF. They propose an architecture for describing, processing, retrieving and extracting semantic data from certain scientific areas. They distinguish between Instance Metadata, referring to RDF structures, and Schema Metadata referring to ontologies. After extracting semantic metadata, they build an ontology in order to use it for

45

creating file description templates. The infrastructure of this architecture is based on the Semantic Retrieval Model. Patel et. al. [PS+03] prosposed OntoKhoj, a Semantic Web portal that is designed to simplify the Ontology Engineering process. The methodology in developing OntoKohj is based on algorithms used for searching, aggregating, ranking and classifying ontologies in the Semantic Web. The proposed portal allows agents and engineers to retrieve trustworthy, authoritative knowledge, as well as expedite the process of ontology engineering through extensive reuse of ontologies. Finally, Becker et. al. [BB+03] addressed the issue of conceptual modeling of semantic navigation structures. They claim that the emergence of the Semantic Web raises new issues for Web information engineers and that conceptual modeling is an approach for dealing with high complexity and diversified project teams. In their work, they present an approach to modeling semantic navigation structures. They aim at modeling complex, role-based and integrated navigation structures for structured and semi-structured data. They claim that comprehensible and useful navigation structures can be derived from existing structures and hierarchies in organizations, achieving an alignment of information systems with the information needs of the users. Their approach, MoSeNa, also supports emergent technologies for the Semantic Web. 6.3 Commercial Products In this section we present some of the most representative products that are commercially available. We distinguish two main target areas, that of Web searching and content mining, and that of Web usage mining. We should point out, however, that the commercially available products do not include as sophisticated algorithms as the research initiatives do. Their main purpose is to serve as supportive tools for enterprises, therefore their main objective is performance. 6.3.1 Web Searching and Web Content Mining Software Digital Information Gateway (DIG) [DIG] is an advance information sharing and retrieval solution for searching database servers, documents, web pages, and e-mail servers from a common user interface. Information Crawler [IC] provides free information retrieval tools and services. miner 3D [M3D] is a tool for visual Web searching and info-immersioning. mnoGoSearch [MGS] is a full-featured Open Source search engine for Internet and Intranet sites. MyNet-Anywhere [MNA] collects data and information from the Internet and stores it in a structured form on the individual user’s machine. Navagent Surf 3D [NS3D] provides smart Web search, and visualization. Sav Z Server [SZS] is a free Web-based object-relational database server implemented in Java TM. Finally, Web DataBase Connectivity [WDBC] allows standard SQL queries against Web pages. 6.3.2 Web Usage Mining Software

46

Amadea Web Mining [AWM] includes multiple transformations, reports, and parametric and modular marketing indicators for an effective CRM. ANGOSS KnowledgeWebMiner [AKWM] combines ANGOSS KnowledgeSTUDIO with proprietary algorithms for clickstream analysis, Acxiom Data Network, and interfaces to web log reporting tools. Blue Martini Customer Interaction System’s Micro Marketing module [BM] collects clickstreams at the application server level, transforming them to the data warehouse, and provides mining operations. Clementine [CL] offers sequence association and clustering used for Web data analysis. ClickTracks [CT] displays visitor patterns directly on the pages of your website. Conversion Track [CTA] from Antisoft is a web log analysis tool and reports on visitor conversion ratios. Datanautics [DN], formerly known as Accrue, offers G2 and Insight analytic solutions for on-line customer behavior. eNuggets [EN] is a real-time middleware, refines models automatically as new click-throughs occur. LiveStats [LS] from DeepMatrix, provides sophisticated real-time log analysis featuring click paths, campaign tracking, keywords by search engine, geographic pinpointing and more. Lumio Re:cognition [LR], formerly known as MineIt EasyMiner, features cross-session analysis, clickstream analysis and cross-sales. Megaputer WebAnalyst [MWA] integrates the data and text mining capabilities of Megaputer’s analytical software directly into the website. MicroStrategy Web Traffic Analysis module [WTAM] is built on MicroStrategy 7 platform, provides traffic highlights, content analysis, and Web visitor analysis reports. Another part of SPSS is NetGenesis Web Analytics [NGWA]. NetTracker [NT] family, contains powerful and easy-to-use Internet usage tracking programs, and is provided by Sane Solutions. Nihuo Web Log Analyzer [NWLA] provides a comprehensive analysis of the “who, what, when, where and how” of customers visited a web site. Prudsys ECOMMINER [PE] provides a combined clickstream and database analysis for e-commerce. SAS Webhound [SW] analyzes Web site traffic to answer questions like who is visiting, how long they stay, and what are the visitor looking at. WebLog Expert 2.0 [WLE] is an easy-to-use and feature packed web log analyzer. WebTrends [WT] is a suite for data mining of web traffic information. XAffinity [XA] identifies affinities or patterns in transaction and clickstream data. XML Miner [XM] data-mines XML code to find relationships and predict values using fuzzy-logic rules. 123LogAnalyzer [123L] is a simple to use, provides high-speed processing, low disk space requirements, filtering and built-in PI mapping. There also exist some freely available tools for web mining. Analog [AN] is a free and fast program to analyze any Web server logfiles. WebIC [WIC] is an effective “complete-Web” recommender system. Finally, WUM (Web Utilization Miner) [WUM] is an integrated, Java-based Web mining environment for log file preparation, basic reporting, discovery of sequential patterns, and visualization. 6.4 Conclusions

The World Wide Web is huge, universal, heterogeneous and unstructured.

Web mining is a very broad research area trying to solve issues that arise due to the WWW phenomenon. In this work, after analyzing the three separate categories of Web mining, we tried to make a prediction concerning its future. The distinctions between the three axes of Web mining (especially of Web content and Web structure mining) are in many cases ambiguous. Therefore, in the past few years, most of the current research efforts concentrate on combining methods proposed in either one of

47

them. This information is usually expressed using the expressive power of the Semantic Web backbone, which are the ontologies. This representation, closes the gap between Semantic Web and Web Mining areas, to create a fast-emerging research area, that of Semantic Web Mining. The WWW will keep growing, even in a somewhat different form than how we know it today. Therefore the need for discovering new methods and techniques to handle the amounts of data existing in this universal framework will always exist.

48

REFERENCES [AAS03] G. Adami, P. Avesani, D. Sona, Clustering Documents in a Web Directory, in Proceedings of the 5th International Workshop on Web Information and Data Management (ACM WIDM 03), September 2003

[AG03] S. Acharyya, J. Ghosh, Context-Sensitive Modeling of Web Surfing Behaviour Using Concept Trees, in Proceedings of the 5th WEBKDD Workshop, Washington, August 2003

[ANM04] M. Aktas, M. Nacar, F. Menczer, Personalizing PageRank Based on Domain Profiles, in Proceedings of the 6th WEBKDD Workshop, Seattle, August 2004

[ADW02] C. Anderson, P. Domingos, D. S. Weld, Relational Markov Models and their Application to Adaptive Web Navigation, in Proceedings of the 8th ACM SIGKDD Conference, Canada, August 2002

[B93] R.A. Botafogo, Cluster analysis for hypertext systems, in Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (1993), 116-125

[Ber01] Bettina Berendt, Understanding Web usage at different levels of abstraction: coarsening and visualizing sequences, in Proceedings of the Mining Log Data Across All Customer TouchPoints Workshop (WEBKDD’01), August 2001, San Francisco, CA

[B02] B. Berendt, Using site semantics to analyze, visualize and support navigation, in Data Mining and Knowledge Discovery Journal (2002), 6:37-59

[BB+99] A.G. Buchner, M. Baumgarten, S.S. Anand, M.D. Mulvenna, J.G. Hughes, Navigation pattern discovery from Internet data, in Proceedings of the Web Usage Analysis and User Profiling Workshop (WEBKDD’99), August 1999, San Diego, CA

[BB+03] J. Becker, C. Brelage, K. Klose, M. Thygs, Conceptual modeling of semantic navigation structures: the MoSeNa-approach, in Proceedings of the 5th International Workshop on Web Information and Data Management (ACM WIDM 03), September 2003

[BEF84] J.C. Bezdek, R. Ehrlich, W. Full, FCM: Fuzzy C-Means Algorithm, Computers and Geosciences (1984)

[BE+02] E. Boszak, M. Ehrig, S. Handschuh, A. Hotho, A. Maedche, B. Motik, D. Oberle, C. Schmitz, S. Staab, L. Stojanovic, N. Stojanovic, R. Studer, G. Sure, J. Tane, R. Volz and V. Zacharias, KAON – Towards a large scale Semantic Web, in Proceedings of EC-Web, France, 2002

[BG+99] D. Boley, M. Gini, R. Gross, E.H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, J. Moore, Partitioning-based clustering for web document categorization, Decision Support Systems, 27(3) (1999) 329-341

[BH04] S. Bloehdorn, A. Hotho, Boosting for Text Classification with Semantic Features, in Proceedings of the 6th WEBKDD Workshop, Seattle, August 2004

[BHL01] T. Berners-Lee, J. Hendler, O. Lassila, The semantic Web, Scientific American 284, 5 (2001), 34-43

49

[BHS02] B. Berendt, A. Hotho, G. Stumme, Towards Semantic Web Mining, in Proceedings of 1st International Semantic Web Conference (ISWC 2002)

[BK+00] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, J. Wiener, Graph Structure in the Web, Computer Networks (2000), 33(1-6):309-320, Proceedings of the 9th International World Wide Web Conference (WWW9)

[BL99] Jose Borges, Mark Levene, Data Mining of User Navigation Patterns, in Web Usage Analysis and User Profiling, published by Springer-Verlag as Lecture Notes in Computer Science, Vol. 1836, 92-111

[BP98] S. Brin, L. Page, The anatomy of a large-scale hypertextual Web search engine, Computer Networks, 30(1-7): 107-117, 1998, Proceedings of the 7th International World Wide Web Conference (WWW7)

[BS91] R.A. Botafogo, B. Shneiderman, Identifying aggregates in hypertext structures, in Proceedings of the 3rd ACM Conference on Hypertext (1991) 63-74

[BS00] B. Berendt, M. Spiliopoulou, Analysing navigation behaviour in web sites integrating multiple information systems, The VLDB Journal (2000), 9(1):56-75

[BS04] R. Baraglia, F. Silvestri, An Online Recommender System for Large Web Sites, in Proceedings of ACM/IEEE Web Intelligence Conference (WI’04), China, September 2004

[BM+02] B. Berendt, B. Mobasher, M. Nakagawa, M. Spiliopoulou, The impact of site structure and user environment on session reconstruction in web usage analysis, in Proceedings of 4th WEBKDD, July, 2002

[CD+98] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, J. Kleinberg, Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text, in Proceedings of WWW7, 1998

[CD+99] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, Mining the Link Structure of the World Wide Web, IEEE Computer (1999) Vol.32 No.6

[CG96] S.F. Chen, J. Goodman, An Empirical Study of Smoothing Techniques for Language, in Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics

[CG+97] C. Chekuri, M. Goldwasser, P. Raghavan, E. Upfal, Web Search Using Automatic Classification, in Proceedings of the 6th International Conference on the World Wide Web (WWW6), 1997

[CGS00] I. Cadez, S. Gaffney, P. Smyth, A general probabilistic framework for clustering individuals and objects, in Proceedings of the 6th ACM SIGKDD Conference, Boston, 2000

[CH+00] I.Cadez, D.Heckerman, C.Meek, P. Smyth, S. White, Visualization of Navigation Patterns on a Web Site Using Model Based Clustering, in Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston MA, 2000, pp. 280-284

[CMS97] R. Cooley, B. Mobasher, J. Srivastava, Web Mining: Information and Pattern Discovery on the World Wide Web, in Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'97)

50

[CMS99] Robert Cooley, Bamshad Mobasher, Jaideep Srivastava, Data preparation for mining world wide Web browsing patterns, Knowledge and Information Systems, February 1999/Vol.1, No. 1

[CPY96] M.S. Chen, J.S. Park, P.S. Yu, Data Mining for Path Traversal Patterns in a Web Environment, in 16th International Conference on Distributed Computing Systems, pp. 385-392, May 1996 [CS95] A.E. Campbell, S.C. Shapiro, Ontological mediation: An overview, in Proceedings of the IJCAI Workshop on Basic Ontological Issues in Knowledge Sharing, 1995

[CS96] P. Cheeseman, J. Stutz, Bayesian Classification (AutoClass): Theory and Results, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press (1996), 153-180

[D77] D. Defays, An efficient algorithm for the complete link method, The Computer Journal 20 (1977) 364-366

[D01] I.S. Dhillon, Co-clustering documents and words using Bipartite Spectral Graph Partitioning, UT CS Technical Report # TR 2001-05

[DJ02] E. Desmontils, C. Jacquin, Indexing a Web Site with a Terminology Oriented Ontology, in I.F. Cruz, S. Decker, J. Euzenat et D.L. McGuinness, eds., The Emerging Semantic Web. IOS Press, pages 181-198 (2002)

[DC+04] L. Deng, X. Chai, Q. Tan, W. Ng, D. L. Lee, Spying Out Real User Preferences for Metasearch Engine Personalization, in Proceedings of the 6th WEBKDD Workshop, Seattle, August 2004

[DK01] M. Deshpande, G. Karypis, Selective Markov Models for Predicting Web-Page Accesses, in Proceedings of the 1st SIAM International Conference on Data Mining, 2001 [DM02] H. Dai, B. Mobasher, Using Ontologies to Discover Domain-Level Web Usage Profiles, in Proceedings of the Second Workshop on Semantic Web Mining, 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'02), Helsinki, Finland (2002)

[DPH98] S. Dumais, J. Platt, D. Heckerman, Inductive Learning Algorithms and Representations for Text Categorization, CIKM 98, Bethesda MD, USA, pp. 148-155

[E96] O. Etzioni, The world wide web: Quagmire or gold mine?, Communications of the ACM, 39(11):65-68, 1996

[EGP02] P. Edwards, G.A. Grimnes, A. Preece, An Empirical Investigation for Learning from the Semantic Web, in Proceedings of the 2nd Semantic Web Mining Workshop, Finland, 2002

[EH81] B.S. Everitt, D.J. Hand, Finite Mixture Distributions. London: Chapman and Hall (1981)

[EL+04] M. Eirinaki, C. Lampos, S. Paulakis, M. Vazirgiannis, Web Personalization Integrating Content Semantics and Navigational Patterns, to appear in Proceedings of the 6th International Workshop on Web Information and Data Management (ACM WIDM 04), November 2004

[EV03] M. Eirinaki, M. Vazirgiannis, Web Mining for Web Personalization, in ACM Transactions on Internet Technology (TOIT), 3(1), February 2003, 1-29

51

[EVV03] M. Eirinaki, M. Vazirgiannis, SEWeP: Using Site Semantics and a Taxonomy to Enhance the Web Personalization Process, in Proceedings of the 9th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’03), Washington DC, August 2003

[EW86] A. El-Hamdouchi, P. Willett, Comparison of hierarchic agglomerative clustering methods for document retrieval, The Computer Journal 32, 1989

[FS92] H.P. Frei, D. Stieger, Making Use of Hypertext Links when Retrieving Information, in Proceedings of the 4th ACM Conference on Hypertext ECHT’92, pp. 102-111

[G72] Eugene Garfield, Citation analysis as a tool in journal evaluation, Science 178, 1972

[G93] T. Gruber, A translation approach to portable ontologies, Knowledge Acquisition, 5(2) (1993) 199- 220

[GK98] D7. Gibson, J. Kleinberg, P. Raghavan, Inferring Web Communities from Link Topology, in the Proceedings of the 9th ACM Conference on Hypertext and Hypermedia, 1998

[GT+02] E. J. Glover, K. Tsioutsiouliklis, S. Lawrence, D. M. Pennock, G. W. Flake, Using Web Structure for Classifying and Describing Web Pages, WWW 2002 Conference, Hawaii, USA (2002)

[Google] http://www.google.com

[H02] J. Hynek, Document Classification in a Digital Library, Technical Report, University of West Bohemia, Department of Computer Science and Engineering, DCSE/TR-2002-04

[HEK03] S. Holland, M. Ester, W. Kiebling, Preference Mining: A Novel Approach on Mining User Preferences for Personalized Applications, in Proceedings of 7th PKDD Conference, September 2003

[HF04] G. Hooker, M. Finkelman, Sequential Analysis for Learning Modes of Browsing, in Proceedings of the 6th WEBKDD Workshop, Seattle, August 2004

[HG98] S. Haas, E. Grams, Page and Link Classifications: Connecting Diverse Resources, Digital Libraries 98, Pittsburgh PA, USA, ACM 1998, pp. 99-107

[HG+02] T.H. Haveliwala, A. Gionis, D. Klein, P. Indyk, Evaluating Strategies for Similarity Search on the Web, in Proc. of WWW11, Hawaii, USA, May 2002

[HN+01] Zhexue Huang, Joe Ng, David W. Cheung, Michael K. Ng, Wai-Ki Ching, A Cube Model for Web Access Sessions and Cluster Analysis, in Proceedings of the Mining Log Data Across All Customer TouchPoints Workshop (WEBKDD’01), August 2001

[HN+03] M. Halkidi, B. Nguyen, I. Varlamis, M. Vazirgiannis, THESUS: Organizing Web Document Collections Based on Link Semantics, to appear in VLDB Journal, special issue on Semantic Web (2003) [HSD97] V. Harmandas, M. Sanderson, M.D. Dunlop, Image retrieval by hypertext links, in Proceedings of SIGIR-97

[JKW00] P. Ji, J. Kurose, B. Woolf, Student Behavioral Model Based Prefetching in Online Tutoring System, Umass CMPSCI Technical Report 01-27

52

[J99] T. Joachims, Making large-Scale SVM Learning Practical, Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT Press, 1999

[J01] T. Joachims, A Statistical Learning Model of Text Classification with Support Vector Machines, in Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR), 2001

[JF+97] T. Joachims, D. Freitag, T. Mitchell, WebWatcher: A Tour Guide for the World Wide Web, in Proc. of IJCAI97, August 1997

[JMF99] A.K. Jain, M.N. Murty, P.J. Flyn, Data Clustering: A Review, ACM Computing Surveys, Vol. 31, No. 3 (1999)

[JPT03] S. Jespersen, T. B. Pedersen, J. Thorhauge, Evaluating the Markov Assumption for Web Usage Mining, in Proceedings of the 5th International Workshop on Web Information and Data Management (ACM WIDM 03), September 2003

[JU99] R. Jasper, M. Uschold, A framework for understanding and classifying ontology applications, in IJCAI-99 Ontology Workshop, Stockholm, Sweden, July 1999

[JZM04] X. Jin, Y. Zhou, B. Mobasher, A Unified Approach to Personalization based on Probabilistic Latent Semantic Models of Web usage and Content, in Proceedings of AAAI Workshop on Semantic Web Personalization (SWP’04), July, 2004 [K95] T. Kohonen, Self-organizing maps, Springer-Verlag, Berlin (1995)

[K99] J.M. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM, 46(5):604-632, September 1999

[K99b] J.M. Kleinberg, Hubs, Authorities, and Communities, ACM Computing Surveys, 31(4), December 1999

[Kar99] The Kartoo System: http://www.kartoo.fr, Intelligence Research 11, p.95-130 (1999)

[KB00] R. Kosala, H. Blockeel, Web Mining Research: A Survey, SIGKDD Explorations, 2(1):1-15, 2000

[KHK99] G. Karypis, E.H. Han, V. Kumar, CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modelling. IEEE Computer, Vol. 32, No. 8 (1999) 68-75

[Kij97] M. Kijima, Markov Processes for Stochastic Modeling, Chapman & Hall, London, 1997

[KJ+01] Raghu Krishnapuram, Anupam Joshi, Olfa Nasraoui, Liyu Yi, Low-Complexity Fuzzy Relational Clustering Algorithms for Web Mining, in IEEE Transactions of Fuzzy Systems

[KR+99] R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, Trawling the Web for Emerging Cyber-Communities, in Proceedings of the 8th WWW Conference (WWW8), 1999

[L95] H. Lieberman, Letizia: An agent that assists Web browsing, in Proc. of the 14th International Joint Conference on Artificial Intelligence (IJCAI95), Montreal, Canada, 1995

53

[L96] R.R. Larson, Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace, in Proceedings of the 1996 American Society for Information Science Annual Meeting (1996)

[L98] D. Lin, An Information-Theoretic Definition of Similarity, in Proceedings of 15th ICML (1998)

[L99] C. Looney, A Fuzzy Clustering and Fuzzy Merging Algorithm, Technical Report, CS-UNR-101-1999

[M98] D. Merkl, Text Data Mining, In: R. Dale, H. Moisl, H. Somers (eds.), A handbook of natural language processing: techniques and applications for the processing of language as text, Marcel Dekker, New York (1998)

[M99] D. Mladenic, Machine Learning used by Personal Webwatcher, in Proc. Of ACAI-99 Workshop on Machine Learning and Intelligent Agents, Chania, Crete, July 1999

[MB94] O. McBryan, GENVL and WWW: Tools for Taming the Web, in Proceedings of the 1st WWW Conference, Geneva, 1994

[MD+00a] B. Mobasher, H. Dai, T. Luo, Y. Sung, J. Zhu, Discovery of Aggregate Usage Profiles for Web Personalization, in Proceedings of the Web Mining for E-Commerce Workshop (WEBKDD'2000), August 2000, Boston

[MD+00b] B. Mobasher, H. Dai, T. Luo, Y. Sung, J. Zhu, Integrating Web Usage and Content Mining for More Effective Personalization, in Proceedings of the International Conference on E-Commerce and Web Technologies (ECWeb2000), September 2000, Greenwich, UK

[ML04] R. Meo, P.L. Lanzi, M. Matera, R. Esposito, Integrating Web Conceptual Modeling and Web Usage Mining, in Proceedings of the 6th WEBKDD Workshop, Seattle, August 2004

[MPT99] F. Masseglia, P. Poncelet, M. Teisseire, Using Data Mining Techniques on Web Access Logs to Dynamically Improve Hypertext Structure, in ACM SigWeb Letters, Vol. 8, N. 3, pp. 13-19, October 1999

[MS00] D. Modha, W.S. Spangler, Clustering hypertext with applications to web searching, in Proc. ACM Conference on Hypertext and Hypermedia (2000)

[Nor] The Northern line search engine: http://www.northernlight.com

[NC+03] O. Nasraoui, C. Cardona, C. Rojas, F. Gonzales, Mining Evolving User Profiles in Noisy Web Clickstream Data with a Scalable Immune System Clustering Algorithm, in Proceedings of the 5th WEBKDD Workshop, Washington, August 2003

[NKM03] A. Nanopoulos, D. Katsaros, Y. Manolopoulos, A Data Mining Algorithm for Generalized Web Prefetching, in IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 5, September/October 2003

[NM02] A. Nanopoulos, Y. Manolopoulos, Efficient Similarity Search for Market Basket Data, in the VLDB Journal, 2002

[NN04] C.L.Ngo, H.D.Nguyen, A Tolerance Rough Set Approach to Clustering Web Search Results, in Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2004), Italy, September 2004

54

[NP04] O. Nasraoui, M. Pavuluri, Complete this Puzzle: A Connectionist Approach to Accurate Web Recommendations based on a Committee of Predictors,, in Proceedings of the 6th WEBKDD Workshop, Seattle, August 2004

[NP03] O. Nasraoui, C. Petenes, Combining Web Usage Mining and Fuzzy Inference for Website Personalization, in Proceedings of the 5th WEBKDD Workshop, Washington, August 2003

[OB+03] D.Oberle, B.Berendt, A.Hotho, J.Gonzalez, Conceptual User Tracking, to appear in Proceedings of the Atlantic Web Intelligence Conference (AWIC) Madrid, Spain (2003)

[OKA01] J. Ortega, M. Koppel, S. Argamon, Arbitrating Among Competing Classifiers Using Learned Referees, Knowledge and Information Systems (2001) 3: 470-490

[OV03] N. Oikonomakou, M.Vazirgiannis, A Review of Web Document Clustering approaches, in Proceedings of the NEMIS Launch Conference, International Workshop on Text Mining & its Applications, Patras, Greece, April 2003

[P03] G. Paab, Text Classification of News Articles with Support Vector Machines, in Proceedings of the NEMIS Launch Conference, International Workshop on Text Mining & its Applications, Patras, Greece, April 2003

[PE00a] Mike Perkowitz, Oren Etzioni, Towards Adaptive Web Sites: Conceptual Framework and Case Study, in Artificiall Intelligence 118[1-2] (2000), pp. 245-275

[PE00b] Mike Perkowitz, Oren Etzioni, Adaptive Web Sites, Communications of the ACM, August 2000/Vol. 43, No. 8, pp. 152-158

[PN76] G. Pinski, F. Narin, Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics, in Information Processing and Management. 12, 1976

[PPR96] P. Pirolli, J. Pitkow, R. Rao, Silk from a sow's ear: Extracting usable structures from the Web, in Proceedings of the ACM SIGCHI Conference on Human Factors in Computing, 1996

[PS+03] C. Patel, K. Supekar, Y. Lee, E.K. Park, OntoKohj: A Semantic Web Portal for Ontology Searching, Ranking and Classification¸ in Proceedings of the 5th International Workshop on Web Information and Data Management (ACM WIDM 03), September 2003

[PW00] T. Phelps, R. Wilensky, Robust hyperlinks: Cheap, Everywhere, Now, in Proceedings of Digital Documents and Electronic Publishing (DDEP00), Munich, Germany, September 2000

[R79] C. J. van Rijbergen, Information Retrieval, Butterworths (1979)

[R92] E. Rasmussen, Clustering Algorithms. In Information Retrieval, W.B. Frakes & R. Baeza-Yates (eds.), Prentice Hall PTR, New Jersey (1992)

[R95] P. Resnik, Using Information Content to Evaluate Semantic Similarity in a Taxonomy, IJCAI-95, pages 448-453 (1995)

[R99] P. Resnik, Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language, Journal of Artificial Intelligence

55

[RDF] Resource Description Framework (RDF), http://www.w3.org/RDF/

[RSM94] R. Richardson, A. Smeaton, J. Murphy, Using WordNet as a Knowledge Base for Measuring Semantic Similarity between Words, AICS Conference. Dublin (1994)

[RV03] M. Rajman, M. Vesely, From Text to Knowledge: Document Processing and Visualization: a Text Mining Approach, in Proceedings of the NEMIS Launch Conference, International Workshop on Text Mining & its Applications, Patras, Greece, April 2003

[S90] R. Schapire, The Strength of Weak Learnability, Machine Learning 5(2):197-227, 1990

[Sar00] R. Sarukkai, Link Prediction and Path Analysis Using Markov Chains, in Computer Networks, vol. 33, nos 1-6, pp. 377-386, June 2000

[SB98] G. Salton, C. Buckley, Term weighting approaches in automatic text retrieval, Information Processing and Management (1998), Vol. 24, pp.513-523

[SC+00] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, Pang-Ning Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data, SIGKDD Explorations, January 2000/Vol. 1, Issue 2, pp. 12-23

[SH03] R. Sen, M. Hansen, Predicting a Web user’s next access based on log data, in Journal of Computational Graphics and Statistics, 12(1):143-155, 2003

[SM83] G. Salton, M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, New-York (1983).

[SWY75] G. Salton, A. Wang, C. Yang, A vector space model for information retrieval, Journal of the American Society for Information Science, Vol. 18, pp. 613-620, 1975

[SCI] Science Citation Index, http://www.isinet.com/isi/products/citation/sci/

[SZ+97] C. Shahabi, A.M. Zarkesh, J. Adibi, V. Shah, Knowledge Discovery for Users Web-Page Navigation, in Workshop on Research Issues in Data Engineering, Birmingham, England, 1997

[S73] R. Sibson, SLINK: an optimally efficient algorithm for the single link cluster method. The Computer Journal 16 (1973) 30-34

[SS02] G. Soldar, D. Smith, Semantic Web and Retrieval of Scientific Data Semantics, in Proceedings of the 2nd Semantic Web Mining Workshop, Finland, 2002

[SFW99] Myra Spiliopoulou, Lukas C. Faulstich, and K. Wilkler, A data miner analyzing the navigational behaviour of Web users, in Proceedings of the Workshop on Machine Learning in User Modelling of the ACAI99, Greece, July 1999

[SJM00] A. Sthehl, G. Joydeep, R. Mooney, Impact of Similarity Measures on Web-page Clustering, in Proceedings of the 17th National conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search (AAAI 2000)

[SM01] G. Stumme, A. Maedche, FCA-Merge: Bottom-Up Merging of Ontologies,, in Proceedings of the 17th Intl. Conference on Artificial Intelligence (IJCAI’01), Seattle, 2001

[SKK00] M. Steinbach, G. Karypis, V. Kumar, A Comparison of Document Clustering Techniques, in KDD Workshop on Text Mining (2000)

56

[UG96] M. Uschold, M. Gruniger, ONTOLOGIES: Principles, Methods, and Applications, in Knowledge Engineering Review, Vol. 11, No. 2, June 1996

[Viv] Vivisimo search engine: http://www.vivisimo.com/

[V95] V. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1995

[VV+03] I. Varlamis, M. Vazirgiannis, M. Halkidi, B. Nguyen, THESUS, A Closer View on Web Content Management Enhanced with Link Semantics, to appear in IEEE Transactions on Knowledge and Data Engineerign Journal (TKDE)

[VI98] J. Veronis, N. Ide, Word Sense Disambiguation: the State-of-the Art, 1998

[V86] E.M. Voorhees, Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Information Processing & Management, 22 (1986) 465-476

[W3Clog] Extended Log File Format, http://www.w3.org/TR/WD-logfile.html

[WN] WordNet, A lexical database for the English language, http://www.cogsci.princeton.edu/~wn/

[WP94] Z. Wu and M. Palmer, Verb Semantics and Lexical Selection, in Proceedings of the 32nd Annual Meetings of the Associations for Computational Linguistics, pages 133-138.

[WV+96] R. Weiss, B. Velez, M. A. Sheldon, C. Manprempre, P. Szilagyi, A. Duda, D.K. Gifford, HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering, in Proceedings of the 7th ACM Conference on Hypertext, Washington DC, 1996

[WYB98] K. Wu, P.S. Yu, A. Ballman, SpeedTracer: A Web usage mining and analysis tool, IBM Systems Journal, Vol. 37, No. 1, 1998

[XML] Extensible Markup Language, http://www.w3.org/XML/

[YH03] A. Ypma, T. Heskes, Categorization of web pages and user clustering with mixtures of Hidden Markov Models, in Proc. of 4th WEBKDD Workshop, Canada, 2002

[YL99] Y. Yang, X. Liu, A Re-examination of Text Categorization Methods, SIGIR 99, August 1999, Berkley, CA, USA, pp. 42-49

[YZ+96] T.W. Yan, M. Jacobsen, H. Garcia-Mollina, U. Dayal, From User Access Patterns to Dynamic Hypertext Linking, In Fifth International World Wide Web Conference (WWW5), Paris, France, 1996

[ZB04] Q. Zhao, S.S. Bhowmick, Mining History of Changes to Web Access Patterns, in Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2004), Italy, September 2004

[ZHH02] J. Zhu, J. Hong, J.G.Hughes, Using Markov Chains for Link Prediction in Adaptive Web sites, in Proceedings of the First International Conference on Computing in an Imperfect World, 2002

[ZXH98] Osmar R. Zaiane, Man Xin, Jiawei Han, Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs, in Proceedings of Advances in Digital Libraries Conference (ADL'98), Santa Babara, CA, April, 1998

57

WEB SITES

[AN] Analog, http://www.analog.cx

[AWM] Amadea Web Mining, http://alice-soft.com

[AKWM] Angoss KnowledgeWebMiner, http://www.angoss.com

[BM] Blue Martini Customer Interaction System, http://www.bluemartini.com

[CLE] Clementine, http://www.spss.com/clementine

[CT] ClickTracks, http://www.clicktracks.com

[CTA] Conversion Track, http://www.antssoft.com

[DN] Datanautics, http://www.datanautics.com

[DIG] Digital Information Gateway, http://www.visualanalytics.com

[EN] eNuggets, http://www.data-mine.com

[IC] Information Crawler, http://www.informationcrawler.com

[LS] LiveStats, http://www.deepmetrix.com

[LR] Lumio Re:Cognition, http://www.lumio.com

[MWA] Megaputer WebAnalyst, http://www.megaputer.com

[WTAM] MicroStrategy Web Traffic Analysis Module, http://www.microstrategy.com/software/Application/WTAM/

[M3D] miner 3D for Web, http://miner3d.com

[MGS] mnoGoSearch, http://mnoGoSearch.org

[MNA] MyNet-Anywhere, http://bluesoftware.tripod.com

[NS3D] Navagent Surf3D, http://www.navagent.com

[NGWA] NetGenesis Web Analytics, http://www.customersolutions.com

[NT] Net Tracker, http://www.sane.com/products/NetTracker

[NWLA] Nihuo Web Log Analyzer, http://www.loganalyzer.net

[PE] Prudsys ECOMMINER, http://www.prudsys.com/Produkte/Softwarepakete/Ecomminer

[SZS] Sav Z Server, http://savtechno.com

[SW] SAS Webhound, http://www.sas.com/solutions/hrmanagement/index.html

[WDBC] Web DataBase Connectivity, http://www.lotontech.com/wdbc.html

[WIC] WebIC, http://www.web-ic.com

[WLE] WebLog Expert, http://www.weblogexpert.com

[WUM] Web Utilization Miner, http://www.hypknowsys.com

[XA] XAffinity, http://www.xore.com

[XM] XML Miner, http://www.metadatamining.com

[123L] 123LogAnalyzer, http://www.123LogAnalyzer.com

Formatted: German (Germany)


Field Code Changed



Field Code Changed



Field Code Changed




Field Code Changed



Formatted: Spanish(Spain-Traditional Sort)

Field Code Changed



Date post:	17-Mar-2018
Category:	Documents
Upload:	vuongthuan
View:	215 times
Download:	3 times