Bar Sag Ada

8/2/2019 Bar Sag Ada

1/27

Web Usage Mining and PatternDiscovery: A Survey Paper

By

Naresh Barsagade

CSE 8331

December 8, 2003


2/27

1. IntroductionWeb technology is not evolving in comfortable and incremental steps, but it is turbulent,

erratic, and often rather uncomfortable. It is estimated that the Internet, arguably the

most important part of the new technological environment, has expanded by about 2000

% and that is doubling in size every six to ten months. In recent years, the advance in

computer and web technologies and the decrease in their cost have expanded the

means available to collect and store data. As an intermediate consequence, the amount

of information (Meaningful data) stored has been increasing at a very fast pace.

Traditional information analysis techniques are useful to create informative reports from

data and to confirm predefined hypothesis about the data. However, huge volumes of

data being collected create new challenges for such techniques as organizations look for

ways to make use of the stored information to gain an edge over competitors. It is

reasonable to believe that data collected over an extended period contains hidden

knowledge about the business or patterns characterizing customer profile and behavior.

With the rapid growth of the World Wide Web, the study of knowledge discovery in web,

modeling and predicting the users access on a web site has become very important

[GO2003].

From the administration, business and application point of view, knowledge obtained

from the Web usage patterns could be directly applied to efficiently manage activities

related to e-Business, e-CRM, e-Services, e-Education, e-Newspapers, e-Government,

Digital Libraries, and so on [AR2003]. Web is becoming the necessity of the businesses

and organizations because of its demand from the clients. Since the web technology

largely feeds on ideas and knowledge rather than being dependent on fixed assets, it

gave birth to new companies such as Yahoo, Google, Netscape, e-Bay, e-Trade,

Survey Paper: Barsagade Page 2 of 27 4/23/2012


3/27

Expedia, Amazon and so on. With the large number of companies using the Internet to

distribute and collect information, knowledge discovery on the web has become an

important research area [JTP2002]. With the explosive growth of information sources

available on the World Wide Web, it has become necessary for organizations to discover

the usage patterns and analyze the discovered patterns to gain an edge over

competitors.

Jespersen et al [JTB2002] proposed a hybrid approach for analyzing the visitor click

stream sequences. A combination of hypertext probabilistic grammar and click fact table

approach is used to mine Web logs, which could be also used for general sequence

mining tasks. Mobasher et al [MCS1999] proposed the web personalization system,

which consists of offline tasks related to the mining if usage data and online process of

automatic Web page customization based on the knowledge discovered. LOGSOM

(LOGSOM, a system that utilizes Kohonen's self-organizing map (SOM) to organize web

pages into a two-dimensional map) proposed by Smith et al [SN2003], utilizes a self-

organizing map based solely on the users navigation behavior, rather than the content

of the web pages. LumberJack proposed by Chi et al [CRHL2002] builds up user profiles

by combining both clustering of user sessions and traditional statistical traffic analysis

using kmeans algorithm. Joshi et al [JJYK1999] used relational online analytical

processing approach for creating a Web log warehouse using access logs and mined

logs. A comprehensive overview of web usage mining research is found in [SCDT2000,

CMS97, CMS1999, RWC2000].

Web mining can be divided into three areas, namely web content mining, web structure

mining and web usage mining [SCDT2000]. Web Content mining focuses on discovery of

information stored on the Internet. Web Structure mining focuses on improvement in



4/27

structural design of a website. Web Usage mining, the main topic of this paper, focuses

on knowledge discovery from the usage of individuals web sites.

Global Internet Usage Average Usage [NN2003] shows the current usage around the

globe and in United States.

Month of September 2003, Panel Type: Home

September August %ChangeNumber of Sessions per Month 22 22 1.65

Number of Unique Domains Visited 55 54 0.89

Page Views per Month 901 899 0.3

Page Views per Surfing Session 41 41 0

Time Spent per Month 11:59:20 11:50:30 1.24

Time Spent During Surfing Session 0:32:29 0:32:37 -0.4

Duration of a Page Viewed 0:00:48 0:00:47 0.94

Active Internet Universe 252,672,070 253,054,814 -0.15

Current Internet Universe Estimate 419,054,724 416,339,888 0.65

United States: Average Web Usage

Month of October 2003, Panel Type: Home

Sessions/Visits Per Person 71Domains Visited Per Person 103PC Time Per Person 80:46:37Duration of a Web Page Viewed 0:01:00Active Digital Media Universe 47,003,165Current Digital Media Universe Estimate 51,012,930

The remainder of the paper is organized as follows: Section 2 contains applications of

web usage mining, section 3 contains basic components of web mining terminologies,

taxonomy of web mining, architecture of web usage mining, explanation of individual

components in web usage mining architecture, section 4 summarizes the paper,

identifies several future research directions and section 5 contains the bibliography.

2. Appl ications of Web Usage Mining



5/27

Each of the applications can benefit from patterns that are ranked by subjective

interesting.

Web usage mining is used in the following areas:

Web usage mining offers users the ability to analyze massive volumes of

clickstream or click flow data, integrate the data seamlessly with transaction and

demographic data from offline sources and apply sophisticated analytics for web

personalization, e-CRM and other interactive marketing programs.

Personalization for a user can be achieved by keeping track of previously

accessed pages. These pages can be used to identify the typical browsing

behavior of a user and subsequently to predict desired pages.

By determining frequent access behavior for users, needed links can be identified

to improve the overall performance of future accesses.

Information concerning frequently accessed pages can be used for caching.

In addition to modifications to the linkage structure, identifying common access

behaviors can be used to improve the actual design of Web pages and to make

other modifications to the site.

Web usage patterns can be used to gather business intelligence to improve

Customer attraction, Customer retention, sales, marketing and advertisement,

cross sales.

Mining of web usage patterns can help in the study of how browsers are used

and the users interaction with a browser interface.

Usage characterization can also look into navigational strategy when browsing a

particular site.



6/27

Web usage mining focuses on techniques that could predict user behavior while

the user interacts with the Web.

Web usage mining helps in improving the attractiveness of a Web site, in terms

of content and structure.

Performance and other service quality attributes are crucial to user satisfaction

and high quality performance of a web application is expected.

Web usage mining of patterns provides a key to understanding Web traffic

behavior, which can be used to deal with policies on web caching, network

transmission, load balancing, or data distribution.

Web usage and data mining is also useful for detecting intrusion, fraud, and

attempted break-ins to the system.

Web usage mining can be used in

e-Learning, e-Business, e-Commerce, e-CRM, e-Services, e-Education, e-

Newspapers, e-Government, and Digital Libraries.


Customer Relationship Management, Manufacturing and Planning,

Telecommunications and Financial Planning.


Physical Sciences, Social Sciences, Engineering, Medicine, and Biotechnology.


Counter Terrorism and Fraud Detection, and detection of unusual accesses to

secure data.

Web usage mining can be used in determination of common behaviors or traits

of users who perform certain actions, such as purchasing merchandise.



7/27

Web usage mining can be used in usability studies to determine the interface

quality.

Web usage mining can be used in network traffic Analysis for determining

equipment requirements and data distribution in order to efficiently handle site

traffic.

3. Web Usage Mining and Pattern DiscoveryWeb usage mining is the application of data mining techniques to discover usage

pattern from Web data, in order to understand and better serve the needs of Web-

based applications [CMS1997]. Web usage mining consists of three phases, namely

preprocessing, pattern discovery, and pattern analysis. A high level Web usage mining

Process is presented in Figure 1 [SCDT2000]. Mobasher et al. [CMS1997] proposes that

the web mining process can be divided into two main parts. The first part includes the

domain dependent processes of transforming the Web data into suitable transaction

form. This includes preprocessing, transaction identification, and data integration

components. The second part includes some data mining and pattern matching

techniques such as association rule and sequential patterns. In the absence of cookies

or dynamically embedded session Ids in the URIs, the combination of IP address can be

used as a first pass estimate of unique users. This estimate can be refined using the

referrer field as described in [CMS1999]. Some authors have proposed global

architectures to handle the web usage mining process. Cooley et al [CTS1999] proposed

a site information filter, named WebSIFT that establishes a framework for web usage

mining as shown in Figure 2. The WebSIFT performs the mining in distinct tasks.



8/27

WeSift system divides the Web Usage Mining Process into three main parts, as show in

Fig 1. For a particular Web site, the three server logs access, referrer, and agent (often

combined into a single log), the HTML files, template files, script files or databases that

make up the site content, and any optional data such as registration data or remote

agent logs provide the information to construct the different information abstractions.

The preprocessing phase uses the input data to construct a server session file based on

the method and heuristics discussed in [[CMS, 1999]. In order to preprocess a server

log, the log must first be cleaned, which consists of removing unsuccessful requests,

parsing relevant CGI name/value pairs and rolling up file accesses into page views. Once

the log is converted into a list of page views, users must be identified. In the absence of

cookies or dynamically embedded session Ids in the URIs, the combination of IP address

The first is preprocessing state in which user sessions are inferred from log data. The

second searches for patterns in the data by making use of standard data mining

techniques, such as association rules or mining for sequential patterns. In the third

stage an information filter bases on domain knowledge and the web site structures is

applied to the mining patterns in search for the interesting patterns. Links between

pages and the similarity between contents of pages provide evidence that pages are



9/27

related. The preprocessing phase allows the option of converting the server sessions

into episodes prior to performing knowledge discovery.

Figure 2: A General Architecture for Web Usage Mining

In this case, episodes are either all of the page views in a server sessions that the user

spent a significant amount of time viewing, or all of the navigation page views leading

up to each content page view. The details of how a cutoff time is determined for

classifying a page view as content or navigation are also contained in [CMS1999]. The

click-stream or click-flow for each user is divided into sessions based on a simple thirty-

minute timeout. The notion of what makes discovered knowledge interesting has been

addressed in [PT1998]. A survey of methods that have been used to characterize the

interestingness of discovered patterns is given in [HH1999]. Four dimensions used by

[HH1999] to classify interestingness measures are pattern-form, representation, scope,

and class. Pattern-form defines what type of patterns a measure is applicable to, such

as association rules or classification rules. The representation dimension defines the

nature of the framework, such as probabilistic or logical. Scope is a binary dimension

that indicates whether the measure applies to single pattern, or to the entire discovered



10/27

set. The final dimension, class is also a binary dimension that can be labeled as

subjective or objective.

Preprocessing for the content and structure of a site involves assembling each page

view for parsing and /or analysis. Page views are accessed through HTTP requests by a

site crawler to assemble the components of the page view. This handles both static

and dynamic content. In addition to being used to derive a site topology, the site files

are used to classify the pages of a site. Both the site topology and page classification an

then be fed into the information filter. The knowledge discovery phase uses existing

data mining techniques to generate rules and patterns. Included in this phase is the

generation of general usage statistics, such as number of hits per page, page most

frequently accessed, most common starting page, and average time spent on each

page.

The WebSIFT performs the mining in distinct tasks. The first state is preprocessing in

which user sessions are inferred from log data. The second searches for patterns in the

data by making use of standard data mining techniques, such as association rules or

mining for sequential patterns. In the third stage an information filter bases on domain

knowledge and the web site structures is applied to the mining patterns in search for

the interesting patterns. Links between pages and the similarity between contents of

pages provide evidence that the pages are related. This information is used to identify

interesting patterns, for example, itemsets that contain pages not directly connected are

declared interesting. In Mobasher et al [MCS1999] the authors propose to group the

itemsets obtained by the mining stage in cluster of URL references. These clusters are

aimed at real time web page personalization. A hypergraph is inferred from the mined

itemsets where the nodes correspond to pages and the hyperedges connect pages in a



11/27

itemset. The weight of a hyperedge is given by the confidence of the rules involved. The

graph is subsequently partitioned into clusters and an occurring user session is matched

against such clusters. For each URL in the matching clusters a recommendation score is

computed and the recommendation set is composed by all the URL whose

recommendation score is above a specified threshold.

In Buchner et al. [BBAMH1999] a new approach, in the form of process, is proposed to

find marketing intelligence from Internet data. An n-dimensional web log data cube is

created to store the collected data. Domain knowledge is incorporated into the data

cube in order to reduce the pattern search space. They proposed an algorithm to extract

navigation patterns from the data cube. The patterns conform to pre-specified

navigation templates whose use enables the analyst to express his knowledge about the

field and to guide the mining process. This model does not store the log data in compact

form, and that can be major drawback when handling very large daily log files.

Information on how customers are using a Web site is critical for marketers of electronic

commerce businesses. Buchner et al [BM1998] have presented a knowledge discovery

process in order to discover marketing intelligence from Web data. They define a Web

log data hypercube that consolidates Web usage data along with marketing data for

electronic commerce applications. Four distinct steps are identified in customer

relationship life cycle that can be supported by their knowledge discovery techniques:

customer attractions, customer retention, cross sales and customer departure.

In Masseglia et al [MPC1999] proposed an integrated tool for mining access patterns

and association rules from log file. The techniques implemented pay particular attention

to the handling of time constraints, such as the minimum and maximum time gap

between adjacent requests in a pattern. The system provides a real time generator of



12/27

dynamic links, which aimed at automatically modifying the hypertext organization when

user navigation matches a previously mined rule.

Fundamental methods of data cleaning and preparation have been well studied by

Srinivasa et al [SCDT2000]. The main techniques traditionally used for modeling usage

patterns in a Web site are collaborative filtering (CF), clustering pages or user sessions,

association rule generation, sequential pattern generation and Markov Models. The

prediction step is the real-time processing of the model, which considers the active user

session and makes recommendations based on the discovered patterns. The time spent

on a page is a good measure of the users interest in that page, providing an implicit

rating for it [GO2003]. If a user is interested in the content of a page, she will likely

spend more time there compared to the other pages in her session. They presented a

new model that uses both the sequences of visiting pages and the time spent on that

pages which reflects the structural information of user session and handles two-

dimensional information.

Data preprocessing consists of data filtering, user identification, session/transaction

identification, and topology extraction. Data filtering filters out some noise, i.e.,

unsuccessful requests, automatically downloaded graphics, or requests from robots, to

get more compact training data. Now people use some heuristic rules to identify user,

such as IP address, cookies, etc. Preprocessing consists of converting the usage,

content, and structure information contained in the various available data sources into

the data abstractions necessary for pattern discovery.

Us age p r ep ro c e s s in g : Usage preprocessing consists of Web pages, such as IPaddresses, page references, and the date and time of accesses [SCDT2000]. Typically,



13/27

the usage data comes from an Extended Common Log Format (ECLF) Server log

[RWC2000].

Con ten t P r ep r o c e s s in g : Content preprocessing consists of converting the text,images, scripts, and multimedia data into forms that are useful for the web usage

mining process. Often this consists of performing content mining such as classification

or clustering. In the context of web usage mining, the content of Web sites can be used

to filter the input to the pattern discovery algorithms [SCDT2000].

Structure Preprocessing: Web structure mining analyses the link structure of theweb in order to identify relevant documents [SCDT2000]. The structure of a site is

created by the hypertext links between page views. Intra-page structure information

includes the arrangement of various HTML or XML tags within a given page. The

principal kind of inter-page structure information is hyper-links connecting one page to

another. The Google Search engine [GOO] makes use of the web link structure in the

process of determining the relevance of a page. The Google search engine achieves

good results because while the keyword similarity analysis ensures high precision the

use of a probability measure ensures high quality of the pages returned.

The information provided by the data sources listed above can be used to construct a

data model consisting of several data abstractions, notably users, page views, click-

streams, server sessions, and episodes [RWC2000]. A page view is defined as all of the

files that contribute to the client-side presentation seen as the result of a single mouse

click of a user. A click-stream is then the sequence of page views that are accessed by

a user. A server session is the click-stream is then sequence of page views that are

accessed by a user. A server session is the click-stream for a single visit of a user to a

Web site. Finally, an episode is a subset of page views from a server session. Data can



14/27

be collected at the server-level, client-level, proxy-level, or obtained from an

organizations database. Each type of data collection differs not only in terms of the

location of the data source, but also in the kinds of data available, the segment of

population from which the data was collected, and its method of implementation.

The usage data collected at the different sources such as Server level, Client Level and

Proxy Level represent the navigation patterns of different segments of the overall Web

traffic [SCDT2000].

Se r ve r - l e ve l C o l l e c ti o n: A Web server log records the browsing behavior of sitevisitors [SCDT2000]. The data recorded in server logs reflect the concurrent and

interleaved access of a Web site by multiple users. These log files can be stored in

various formats such as Common Log Format (CLF) or Extended Common Log Format

(ECLF). ECLF contains client IP address, User ID, time/date, request, status, bytes,

referrer, and agent. Tracking of individual users is not an easy task due to the stateless

connection model of the HTTP protocol. In order to handle this problem, Web servers

can also store other kind of usage information such as cookies in separate logs, or

appended to the CLF or ECLF logs. Cookies are tokens generated by the Web server for

individual client browsers in order to automatically track the site visitors. Packet sniffing

technology (also referred to as network monitors) is an alternative method for

collecting usage data through server logs. Packet sniffers monitor network traffic coming

to a Web server and extract usage data directly from TCP/IP packets. Besides usage

data, the server side log also provides access to the site files, e.g. content data,

structure information, local databases, and Web page meta-information such as the size

of a file and its last modified time.



15/27

C l i en t l eve l co l l e c t i on : Client-side collection can be implemented by using a remoteagent (such as Java scripts or Java applets) or by modifying the source code of an

existing browser (such as Mosaic or Mozilla) to enhance its data collection capabilities

[SCDT2000]. Proxy Level Collection: The Internet Service Provider (ISP) machine that

users connect to through a model is a common form of proxy server. A web proxy acts

as an intermediary between client browsers and Web servers. Proxy-level caching can

be used to reduce the loading of time of a Web page experienced by users as well as

the network traffic load at the server and client sides.

Pa t t e r n D i s c o ve r y : Pattern discovery uses methods and algorithms developed fromseveral fields such as statistics, data mining, machine learning and pattern recognition

[SCDT2000]. Zaiane et al. [ZXH1998] proposed the use of On-Line Analytical Processing

(OLAP) technology in web usage mining. OLAP and the data cube structure offer a

highly interactive and powerful data retrieval and analysis environment. The knowledge

that can be discovered is represented in the form of rules, tables, charts, graphs, and

other visual presentation forms for characterizing, comparing, predicting, or classifying

data from the web access log. Visualization can also be used in web usage mining, and

it presents the data in the way that can be understood by users more easily.

S ta t i s t i c a l Ana l y s i s : Statistical techniques are the most common method to extractknowledge about visitors to a web site. By analyzing the session file, one can perform

different kinds of descriptive statistical analyses (frequency, mean, median, etc) on

variables such as page views, viewing time and length of a navigational path. For

example e-Trade developed a website in German language for Germany and scrapped it

because German people were visiting the English site rather than the German site. Many

web traffic analysis tools produce a periodic report containing statistical information



16/27

such as the most frequently accessed pages, average view time of a page or average

length of a path through a site. This type of knowledge can be potentially useful for

improving the system performance, enhancing the security of the system, facilitating the

site modification task, and providing support for marketing decisions. There are lots of

commercial tools available for statistical analysis.

Assoc i a t i on Ru l es : Association rule generation can be used to relate pages that aremost often referenced together in a single server sessions [SCDT2000]. In the context

of web usage mining, association rules refer to sets of pages that are accessed together

with a support value exceeding some specified threshold. Association rule mining has

been well studied in Data Mining, especially for basket transaction data analysis. Many

association rule algorithms have been used, such as Apriori, Partition [MHD2003]. Aside

from being applicable for e-Commerce, business intelligence and marketing applications,

it can help web designers to restructure their web site. The results about the usefulness

of such rules in supermarket transaction or in web application have not been reported.

People also put some constraints over the mining process, and prune the extracted

rules. The association rules may also serve as heuristic for pre fetching documents in

order to reduce user-perceived latency when loading a page from a remote site. In

electronic CRM, an existing customer can be retained by dynamically creating web offers

based on associations with threshold support and/or confidence value [BM98].

C lu s te r i ng : Clustering is a technique to group together a set of items having similarcharacteristics [SCDT2000]. Clustering can be performed on either the users or the page

views. Clustering analysis in web usage mining intends to find the cluster of user, page,

or sessions from web log file, where each cluster represents a group of objects with

common interesting or characteristic. User clustering is designed to find user groups



17/27

that have common interests based on their behaviors, and it is critical for user

community construction. Page clustering is the process of clustering pages according to

the users access over them. Such knowledge is especially useful for inferring user

demographics in order to perform market segmentation in e-Commerce applications or

provide personalized web content to the users. On the other hand, clustering of pages

will discover groups of pages having related content. This information is useful for the

Internet search engines and Web assistance providers. In both applications, permanent

or dynamic HTML pages can be created that suggest related hyperlinks to the user

according to the users query or past history of information needs. The intuition is that if

the probability of visiting page, given page has also been visited, is high, then maybe

they can be grouped into one cluster. For session clustering, all the sessions are

processed to find some interesting session clusters. Each session cluster may be one

interesting topic within the web site. Mobasher et al [MCS1999] generated

recommendations from URL clusters to build an adaptive web site by using ARHP

(Association Rule Hypergraph Partitioning).

Abhrahum et al [AR2003] proposed an ant-clustering algorithm to discover web usage

patterns and a linear genetic programming approach to analyze the visitor trends. They

proposed hybrid framework, which uses an ant colony optimization algorithm to cluster

Web usage patterns. The raw data from the log files are cleaned and preprocessed and

the ACLUSTER algorithm is used to identify the usage patterns. The developed clusters

of data are fed to a linear genetic programming model to analyze the usage trends.

The WebCANVAS (Web Clustering Analysis and VisuAlization of Sequences) [CHMSW,

2003] presented a new methodology for exploring and analyzing navigation pattern on a

web site. The patterns that can be analyzed consist of sequences of URL categories



18/27

traversed by users. In their approach, they first partitioned site users into clusters such

that users with similar navigation paths through site are places into the same cluster.

The clustering approach they employed was model-based (as opposed to distance

based) and partitioned users according to the order in which they request web pages.

Another feature of their use of model-based clustering is that learning time scales

linearly with sample size. In contrast, agglomerative distance-based methods scale

quadratically with sample size.

The purpose of knowledge discovery from users profile is to find clusters of similar

interests among the users [SZAS1997]. If the site is well designed, there will be strong

correlation among the similarity of the navigation paths and similarity among the users

interest. Therefore, clustering of the former could be used to cluster the latter. The

definition of the similarity is application dependent. They provide an overview on a

powerful path clustering method called path mining. This approach is suitable for

knowledge discovery in databases with partial ordering in their data. In this method,

first a general path feature space is characterized. Then a similarity, measure among the

paths over the feature space is introduced. Finally this similarity measure is used in the

clustering purpose. They implemented the path-mining algorithm to cluster the

navigation paths detected by the profiler. This algorithm finds a scalar number as the

similarity among the paths. These similarity numbers could be fed to standard data-

mining algorithms to cluster the user interests.

C l as s i f i c a t i on : Classification is the task of mapping a data item into one of severalpredefined classes [SCDT2000]. In the internet marketing, a customer can be classified

as no customer, visitor once and visitor regular based on their browsing patters and

discovered rules for attracting the customers by displaying special offers [BM98].



19/27

In the web domain, one is interested in developing a profile of users belonging to a

particular class or category. This requires extraction and selection of features that best

describe the properties of a given class or category. Classification can be done by using

supervised inductive learning algorithms such as decision tree classifiers, nave Baysian

classifiers, k-nearest neighbor classifiers, Support Vector Machines etc. For example,

classification on server logs may lead to the discovery of interesting rules such as: 30%

of users who placed an online order in /Product/Music are in the 18-25 age group and

live on the west coast. The Classification algorithms such as C4.5, CART, BAYES, and

RIPPER can be used to predict if page is of interest to the user.

Sequen t i a l P a t t e r n s : The technique of sequential pattern discovery attempts to findinter-session patterns such that the presence of a set of items is followed by another

item in a time-ordered set of sessions or episodes [SCDT2000]. A new algorithm MiDAS

(Mining Internet data for Associative Sequences) for discovering sequential patterns

from web log files has been proposed that provides behavioral marketing intelligence for

e-commerce scenarios [BBAMH1999]. MiDAS contains three phases: 1. A priori phase is

the input data preparation, which consists of data reduction and data type substitution.

2. Discovery Phase discovers the sequences of hits and generates the pattern tree. 3. A

posteriori Phase filters out all sequences that do not fulfill the criteria laid in the

specified navigation templates and topology network and also pruning is done in this

phase. By using this approach, Web marketers can predict future visit patterns, which

will be helpful in placing advertisements aimed at certain user groups. Other types of

temporal analysis that can be performed on sequential patterns include trend analysis,

change point detection, or similarity analysis.



20/27

Oyanagi et al [OKN2002] explore the issues in sequence mining for methods for mining

WWW access log. The Apriori algorithm is well known as a typical algorithm for

sequence pattern mining. However, it suffers from inherent difficulties in finding long

sequential patterns and in finding interesting patterns among a huge amount of results.

This paper proposes a new method for finding sequence patterns by matrix clustering.

This method decomposes a sequence into a set of sequence elements, each of which

corresponds to an ordered pair of items. Then matrix clustering is applied to extract a

cluster of similar sequences. The resulting sequence elements are composed into a

graph. A Web Utilization Miner, WUM [SS1998] uses an efficient data structure called

Aggregated Tree to store the user sessions, and it also provides query language to

extract interesting patterns from the aggregated session data. WUM employs an

innovative technique for the discovery of navigation patterns over an aggregated

materialized view of the web log. After performing the classical preparation steps (i.e.,

user and session identification) the user sessions are merged into Aggregated Tree. An

Aggregated Tree is a tree constructed by merging trails with the same prefix. WUM

provides a query language called MINT to let the users specify their query, concerning

the content, structure and statistics of navigation patterns. MINT supports the

specification of criteria of statistical, structural, and textual nature. The WEBMIER tool

[CMS1997] provides a query language on top of external mining software for association

rules and for sequential patterns.

D e p e n d e n cy m o d e l in g : Dependency modeling is another useful pattern discoverytask in web mining [SCDT2000]. The goal here is to develop a model capable of

representing significant dependencies among the various variables in the web domain.

As an example, one may be interested to build a model representing the different stages



21/27

a visitor undergoes while shopping in an online store based on the actions chosen (ie,

from a casual visitor to a serious potential buyer. There are several probabilistic learning

techniques that can be employed to model the browsing behavior of users. Such

techniques include Hidden Markov Models and Bayesian Belief Networks. Modeling of

Web usage patterns will not only provide a theoretical framework for analyzing the

behavior of users but is potentially useful for predicting future Web resource

consumption. Such information may help develop strategies to increase the sales of

products offered by the Web site or improve the navigational convenience of users.

Borgees et al [BL1999] proposed formal data mining model, Hypertext probabilistic

Grammars (HPG) to capture user web navigation patterns. User sessions are presented

as HPG whose higher probability strings correspond to the navigation trails preferred by

the user. Hypertext Probabilistic Grammar (HPG) is a Markov model, which assumes that

the probability of a link being chosen depends more on the contents of the page being

viewed than on all the previous history of sessions [LL1999]. Note that this assumption

can be weighted by making use of the Ngram concept, or dynamic Markov chain

techniques There are situations in which a Markov assumption is realistic, such as, for

example, an online newspaper where a user chooses which article to read in the sports

section independently of the contents of the front page. However, there are also cases

in which such assumption is not very realistic, such as, for example, an online tutorial

providing a sequence of pages explaining how to perform a given task.

D ev i a t i on /Ou t l ie r D e t e c t i on : It contains techniques aimed at detecting unusualchanges in the data relatively to the expected values. Such techniques are useful, for

example, in fraud detection, where the inconsistent use of credit cards can identify



22/27

situations where a card is stolen. The inconsistent use of credit card could be noted if

there were transactions performed in different geographic locations within a given time

window.

Pa t t e r n ana l y s is :Pattern analysis is the last step in the overall Web Usage miningprocess as described in Figure 2. The motivation behind pattern analysis is to filter out

uninteresting rules or patterns from the set found in the pattern discovery phase. The

exact analysis methodology is usually governed by the application for which Web mining

is done. The most common form of pattern analysis consists of a knowledge query

mechanism such as SQL. Another method is to load usage data into a data cube in order

to perform OLAP operations. Visualization techniques, such as graphing patterns or

assigning colors to different values, can often highlight overall patterns or trends in the

data. Content and structure information can be used to filter out patterns containing

pages of a certain usage type, content type, or pages that match a certain hyperlink

structure.

4. Summary and Future Research Direct ionsThis paper has attempted to provide an up-to-date survey of the rapidly growing area of

Web usage mining, which is the demand of current technology. In this paper a general

overview of Web usage mining is presented in introduction section. Web usage mining is

used in many areas such as e-Business, e-CRM, e-Services, e-Education, e-Newspapers,

e-Government, Digital Libraries, advertising, marketing, bioinformatics and so on. The

major classes of recommendation services are based on the discovery of navigational

patterns of users. The main techniques for pattern discovery are sequential patterns,

association rules, Classification, Clustering, and path analysis. Web usage minings basic



23/27

components, taxonomy of web mining, architecture of web usage mining, individual

components in web usage mining and detailed research in this area by researchers like

Jaideep Srivastava, Bamshad Mobasher, Robert Cooley, Cyrus Shahabi, Ming-Syan Chen,

and A.G. Bchner in web mining is described in detail section.

With the growth of Web-based applications, specifically e-commerce, there is significant

interest in analyzing Web usage data. As the web mining area is growing fast, there is a

lot of demand for web usage mining and there is a need to develop a common

framework like J2EE and .NET. Cross Industry Standard Process for Data Mining, the

CRISP-DM project has developed an industry and tool-neutral Data Mining process

model [CRISP-DM] for data mining. Similar Process model or framework needs to be

developed for creating an interest among the new researchers or business strategists

and developers. We need a systematic web-site design methodology to create new web

pages, or modify existing web pages, such that different users navigation patterns could

be better mapped to answers to a set of specific questions. There is a need to develop

tools, which incorporate statistical methods, visualization, and human factors to help

better understand the mined knowledge. Since the output of knowledge mining

algorithms is often not in a form suitable for direct human consumption, there is a need

to develop techniques and tools for helping an analyst better assimilate it. One of the

open issues in data mining, in general and Web Mining, in particular, is the creation of

intelligent tools that can assist in the interpretation of mined knowledge. Clearly, these

tools need to have specific knowledge about the particular problem domain to do any

more than filtering based on statistical attributes of the discovered rules or patterns.



24/27

More research needs to be done in e-Commerce, bioinformatics, computer security, Web

intelligence, intelligent learning, Database systems, Finance, Marketing, Healthcare, and

Telecommunications by using Web usage mining.

5. Bibl iography[AR2003]. Ajit Abhraham, Vitorino Ramos, Web Usage Mining Using Artificial Ant

Colony Clustering and Linear Genetic Programming, to appear in CEC03

- Congress on Evolutionary Computation, IEEE Press, Canberra, Australia,

8-12 Dec. 2003.

[BBAMH1999]. A.G. Bchner, M. Baumgarten, S.S. Anand, M.D. Mulvenna, J. G. Hughes,

Navigation Pattern Discovery from Internet Data, in WEBKDD, San Diego,

CA 1999.

[BL1999]. Jos'e Borges, Mark Levene, Data Mining of User Navigation Patterns,

WEBKDD,1999.

[BM1998]. A.G. Bchner, M.D. Mulvenna, Discovering Internet Marketing Intelligence

through Online Analytical Web Usage Mining, ACM SIGMOD, Vol. 27, No.

4, pp. 54-61, 1998.

[CHMSW2003]. I. V. Cadez, D. Heckerman, C. Meek, P. Smyth, S. White,

Model-based clustering and visualization of navigation patterns on a Web

Site, Journal of Data Mining and Knowledge Discovery, 7(4), 2003.

(extended version of ACM SIGKDD 2000 conference paper).

[CMS1997]. Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava, Web Mining:



25/27

Information and Pattern Discovery on the World Wide Web (A Survey

Paper) (1997), in Proceedings of the 9th IEEE International Conference

on Tools with Artificial Intelligence (ICTAI'97), November 1997

[CMS1999]. Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava, Data

Preparation for mining world wide web browsing patterns, Knowledge

and information Systems 1(1),1999.

[CRHL2002]. Chi E.H., Rosien A. and Heer J., LumberJack:Intelligent Discovey and

Analysis of Web User Traffic Composition. In Proceedings of ACM-SIGKDD

Workshop on Web Mining for Usage Patterns and User Profiles, Canada,

ACM press, 2002.

[CRISP-DM]. http://www.crisp-dm.org .

[CTS1999]. Robert Cooley, Pang-Ning Tan, Jaideep Srivastava, WebSIFT: The Web

Site Information Filter System (1999). Proceedings of the Web Usage

Analysis and User Profiling Workshop, August 1999.

[GO2003]. Sule Gunduz, M. Tamer Ozsu, A Web Page Prediction Model Based on

Click-Stream Tree Representation of User Behavior, The Ninth ACM

SIGKDD International Conference on Knowledge Discovery and Data

Mining. Washington, DC, USA, August 24 - 27, 2003.

[GOO]. Google Search Engine http://www.google.com

[HH1999]. Robert J. Hilderman and Howard J. Hamilton, Knowledge discovery and

interestingness measures, A survey, Technical Report, University of

Regina, 1999.

http://www.crisp-dm.org/http://www.google.com/http://www.crisp-dm.org/http://www.google.com/


26/27

[JJYK1999]. Joshi K. P., Joshi A., Yesha Y., Krishnapuram, R., Warehousing and

Mining We Logs, Proceedings of the 2nd ACM CIKM Workshop on Web

Information and Data Management, pp. 63-68, 1999.

[JTB2002]. Jespersean S.E., Throhauge J., and Bach T., A hybrid approach to Web

Usage Mining, Data Warehousing and Knowledge Discovery, (DaWaK02),

LNCS 2454, Springer Verlag Germany, pp73-82, 2002.

[JTP2002]. Soren E. Jespersen, Jesper Thorhauge, Torben Bach Pederson, A Hybrid

Approach to Web Usage Mining, Technical Report 02-5002, Department

of Computer Science Aalborg University, July 2002.

[MHD2003]. Margaret H. Dunham, Data Mining Introductory and Advanced Topics,

Prentice Hall, 2003.

[LL1999]. Levene, M. and Loizou, G. Computing the entropy of user navigation in

the web, Department of Computer Science, University College London,

1999.

[NN2003]. http://www.nielsen-netratings.com

[MCS1999]. Bamshad Mobasher, Robert Cooley, Jaideep Srivastava, Creating

Adaptive Web Sites Through Usage-Based Clustering of URLs, in

Proceedings of the 1999 IEEE Knowledge and Data Engineering Exchange

Workshop (KDEX'99), November 1999

[MPC1999]. Masseglia, F., Poncelet, P., and Cicchetti, R., 1999a, Webtool: An

Integrated framework for data mining, In proceedings of the Ninth

International Conference on Database and Expert System Application

(DEXA99), Florence, Italy, August, 1999.

http://www.nielsen-netratings.com/http://www.nielsen-netratings.com/


27/27

[OKN2002]. Shigeru Oyanagi, Kazuto Kubota, Akihiko Nakase, Mining WWW Access

Sequence by Matrix Clustering,SIGKDD Explorations. Volume 4, Issue 2,

page 125.

[RWC2000]. Robert W. Cooley, Web Usage Mining: Discovery and Application of

Interesting Patterns from Web Data., A Ph. D. Thesis, May 2000.

[SCDT2000]. Jaideep Srivastava,Robert Cooley, Mukund Deshpande,Pang-Ning Tan,

Web Usage Mining: Discovery and Applications of Usage Patterns from

Web Data(2000). SIGKDD Explorations, Vol. 1, Issue 2, 2000.

[SN2003]. Smith K.A. and Ng A., Web page clustering using a self-organizing map of

user navigation patterns, Decision Support Systems, Volume 35 , Issue 2

(May 2003) Special issue: Web data mining, Pages: 245 256.

[SZAS1997]. Cyrus Shahabi, Amir M. Zarkesh, Jafar Adibi, and Vishal Shah, Knowledge

Discovery from Users Web-page Navigation, IEEE RIDE 1997.

[SS1998]. Myra Spiliopoulou and Lukas C. Faulstich, WUM: A Web Utilization Miner, in

International Workshop on the Web and Databases (WebDB98), Valencia,

Spain, March 1998.

[ZA1997]. A. Zarkesh and J. Adibi, Pathmining: Knowledge discovery in partially

ordered databases. Submitted to KDD-1997.

[ZXH1998]. O. R. Zaiane, M. Xin, and J. Han,Discovering Web access patterns and

trends by applying OLAP and data mining technology on Web logs, in Proc.

Advances in Digital Libraries Conference (ADL'98), Santa Babara, CA, April,

1998.

Date post:	05-Apr-2018
Category:	Documents
Upload:	siddharth7g
View:	217 times
Download:	0 times

Bar Sag Ada

Documents