22
CHAPTER 2
Literature Review
The intent of web usage mining is to analyze the users’ access patterns
from the data generated from browsing web. The output from these
analyses has tremendous practical applications like personalized web
search, target marketing, adaptive web sites and several sales analyses.
This chapter aims to follow up the introductory constricts and approaches on network
utilization excavation. First of all, we introduce web usage data there on moving to
preprocessing and a review on various blue print breaks through approaches for network
utilization excavation. Ultimately, we sum up the conceptions discoursed.
2.1 WEB MINING
There exists abundant content information in web pages and also
their hyperlinks. These pages are accessed by the users and hence a
new set of data by name web logs are generated. These logs contain
the access patterns of the users. The techniques used for mining these
logs incidentally discover and identify exiting information from the
logs. Hence the inputs for web mining come from several areas like databases, Data
Recovery, Machine Acquisition and Instinctive Speech Litigating
Network excavation techniques can broadly classified into three types
(Fig:2.1) namely
1. Web content mining,
2. Web structure mining, and
3. Web usage mining.
23
2.2 WEB CONTENT MINING (WCM)
Web content is a combination several types of data like structured
data, semi structures data, unstructured data further this data again
could be text, images, audio or video content. The category of algorithms
that uncover useful information from these data types or documents is
called web content mining.
The main goals of WCM includes assisting information finding, (ex:
Search Engine), filtering information to users on user profiles, database
view in WCM simulates the information on the network and incorporate
them for large number of convoluted questions. Many intelligent tools
namely web agents were developed by the researchers for information
processing and retrieval and a higher level of abstraction is provided to
the semi-structured data on the web using the data mining techniques.
Text mining [4] and multimedia data mining [5] proficiencies are useful
excavation the subject in network paginates. Some of these efforts are
summarized as follows.
2.2.1 Agent –Based Approach
Generally, agent based Web mining systems can be categorized as:
a. Intelligent Search Agents.
b. Information Filtering/Categorization.
c. Personalized Web Agents.
24
a. Intelligent Search Agents
various well informed network agentive roles are built up for looking up
for pertinent entropy utilizing knowledge base features and client
visibilities to prepare and render the ascertained data. Some of the web
agents are Harvest [6], FAQ-Finder [7], Information Manifold [8] , OCCAM
[9] and ParaSite [10].
b. Information Filtering/Categorization
The network agentive roles use different data recovery methods [11] and
features of candid machine-readable text network papered to
mechanically recover and assess them[12, 13, 14, 15&16].
c. Personalized Web Agents
Many web agents learn user interests according to their web usage and
discover the patterns based on their preferences and interests. Examples
of such personalized web agents are the Web Watcher[17], PAINT[18],
Syskill& Webert[19], GroupLens[20] , Firefly[21], and others[22]. For
instance, Syskill & Webert used Bayesian classifier to rate web page of
users interests based on user’s profile.
2.2.2 Database Approach
The semi-structured data is organized to structured data using various
database approaches. Various database query processing mechanisms
and data mining techniques are used to analyze the structured data
available on web. The database approaches are listed as:
a. Multilevel databases
25
b. Web query systems.
a. Multilevel databases
The main idea behind this approach is that the lowest level of the
database contains semi-structured information stored in various web
repositories, such as hypertext documents.
b. Web query systems
A large number of network-grounded interrogation systems and
languages use standard repository interrogations languages like
structured query language, morphological data about network text files,
and regular instinctive language treating for the interrogations which are
utilized in web look ups [23].
2.3 WEB STRUCTURE MINING (WSM)
Network structure excavation is pertained on evoking the model or
patterns or structures which construct the structural representation of
the web through links. It is used to study the hierarchical structure of
the hyperlinks. The links may be with or without description about them.
This framework is useful to classify network varlets and helpful to
evoke data for example same kind and kinship among various internet
sites. NSE can be utilized to disclose authorized sites. More significant is
the construction of the network varlets and the quality of the pecking
order of web links in the website of a specific field.
Few algorithmic rules have been suggested to simulate the network
topology for example HITS [24], Page Rank [25] and betterments of HITS
26
by tallying subject matter to the links structure [26] and by utilizing
resident straining [27]. These frameworks are primarily implemented as a
method to estimate the character grade or relevance for every network
varlet. Few instances are the Clever System [26] and Google [25]. Few
more applications of the instances consist of network varlets
classification [28] and disclosing mini communities on the network [29].
2.4 WEB USAGE MINING (WUM)
NETWORK UTILIZATION EXCAVATION concentrates on approaches which
anticipate client conduct when the client moves on the network.WUM
intends to uncover exciting WUM intends to uncover exciting recurrent
client access patterns produced while the surfing the web which is
maintained in the web server logs, intermediate server logs or user logs.
WUM is about finding patterns of page views by Web users or to find
the usage of a particular Website. There are many applications of Web
usage mining, such as targeting advertisements. The objective is to find
the set of customers who are most likely to respond to an advertisement.
By sending advertisement materials to these potential customers
significant savings in mailing costs can be achieved. Another application
is in designing of Web pages. By studying the sequence of page visits by
the customers, a Web page may be designed so that the majority of
customers can find the information they desire with a minimum number
of clicks of the mouse; so the Web page design is appealing to the most
users [31].
27
This research investigates web usage mining techniques and suggests
improvements in web services.
2.5 SEMANTIC WEB USAGE MINING
The two rapid-growing research domains are Semantic Network and
Network Excavation both constructed on the achievement of the World
Wide Web (WWW), accompaniment each other well as they each cover
one piece of a fresh gainsay laid by the large success of the present
WWW. Most of the information on the network is so unhelpful which can
only be understood by mankind however the quantity of information is so
vast that it can only be treated expeditiously and in effect by the
machines. The Semantic Network deals the initiative portion of this
gainsay by attempting to build the information (also) machine-graspable,
as Network Excavation deals the next portion by (semi-) mechanically
evoking the utilitarian cognition blotted out in this information, including
it usable as an accumulation of achievable ratios as shown in Fig:2.2.
The Semantic Web is grounded on a imagination of Tim Berners-Lee,
the discoverer of the WWW. The huge achievement of the present WWW
contributes to a fresh gainsay i.e., a large quantity of information is
understood by mankind only. Machine affirms is bounded. Berners-Lee
proposes improving the WWW by machine- process able data that affirms
the clients in his projects. Overlarge or inadequate lists of hints.
28
Machine-process able information can point out the Search Engine to the
relevant pages and can thus improve both precision and recall.
For instance, it is almost impossible to retrieve information with a
keyword search when the information is spread over several pages.
Consider e.g., the query for Web mining experts in a company intranet
where the only explicit information stored are the relationships between
people and the courses they attended on one hand and between courses
and the topics they covered on the other hand. In that case, the use of a
rule stating that people who attended a course which was about a certain
topic having knowledge about that topic might improve the results.
The process of building the Semantic Web is still undergoing changes.
Its structure has to be defined and this structure has then to be filled
with life. In order to make this task feasible one should start with the
simpler task.
The following steps mentioned below shows the direction where in the
Semantic Web is heading:
(1) Providing a common syntax for machine understandable
statements,
(2) Establishing common vocabulary,
(3) Agreeing on a logical language, and
(4) Using the language for exchanging proofs.
Berners -Lee suggested a layer structure for the Semantic Web:
i. Unicode/URI,
29
ii. XML/ namespaces/XMLSchema,
iii. RDF/RDF Schema,
iv. Ontology vocabulary
v. Logic
vi. Proof
vii. Trust
This structure reflects the steps listed above. It makes understand
each step alone will provide added value so that the Semantic Web can
be realized in an incremental fashion.
On the first two layers, a common syntax is provided. Uniform
Resource Identifiers (URIs) provide a standard way to refer to entities;
Unicode is a standard for exchanging symbols. The Extensible Markup
Language (XML) fixes a notation for describing labeled trees, and XML
Scheme allows defining grammar for valid XML documents. XML
documents can refer to different namespaces to make explicit the context
of different tags. The formalizations on these two layers are widely
accepted and the number of XML documents is increasing rapidly.
The next three layers form the current core of the Web enriched by
formal semantics. These are the most important for our ensuing
formalization of Semantic Web Usage Mining.
Proof and trust are the remaining layers. They follow the
understanding that it is important to be able to check the validity of
30
statements made in the (Semantic) Web. These two layers are rarely
tackled today but are interesting topics for future research.
Although semantic web has great potential and scope the following are
several issues have to be addressed to reap its benefits:
Presently there is scarcity for contents on semantic web, hence
currently available web content like Unstructured, Semi structured
and structured content. Dynamic content, multimedia content etc.
need to be converted into semantic web content.
In days to come Ontologies can capture the semantics of the Web
content. These ontologies can be generated using adequate
infrastructure, change management and mapping.
Magnificent efforts are necessary to organize, store Semantic web
content and also techniques to find them. All of these must be
coordinated and scalable since the growth of the data is
exponential.
Multilingualism problem already exists in the current Web here we
need techniques that could convert the content in one language to
that of the native language.
Intuitive visualization is picking more importance. Since the users
expect better presentation and easily recognizable contents for their
needs. Hence there is pressing need not only for building semantic
web content but also exploiting them using semantics
31
In the present work, the primary intention is to use semantics to usage
mining. Navigating the patterns of the pages navigated by the clients can
still enhance the outcomes of web usage mining. The present technique
is very useful for understanding the priorities and clients interests.
Therefore, An amalgamation is required to be used ontology with another
that could preferably classify pages in a website. The present process of
classifying is very helpful to make the outcomes much better and
agreeable and interesting to reform the website, which is also known as
the personification. The results of can make the web usage mining
viewable when the semantics of the web pages are externally used to
discover the topics of the ontology. Using the ontology, to store the
client’s behavior based on the web logs the semantic web mining is
carried out on the web logs. For example, the present web logs is used to
find common activities to bestow personalized services using ontology.
2.6 WEB USAGE DATA
The web usage data primarily maintains logs of access patterns of the
visitors on a website. It can also include client visibilities, bookmarks,
cookies, adjustment data, client queries and any other interactions of the
client while on the website. For easy manageability and convenience the
data is grouped into three divisions that is to say Network Host Logs,
Gateway Host Logs and Client Browser Logs.
32
The web server maintains crucial information for network utilized
excavation these logs are in general access of websites by multiple users.
For each of the records contain the IP address of the user, Pettion time,
Uniform Resource Locator, HTTP status cipher etc. Of course the
information gathered are in several standard formats like log file format,
stretched log file format etc. A portion of a network host log in W3C is as
shown in fig:2.3
A gateway like server known as web server proxy server acts as a gate
for the users and the servers. To decrease the input time of a web page
the proxy receiving is useful and the clients visit these web pages
recurrently and along with this the proxy receiving is also useful to have
the complete view of the load traffic at the server and the client. The
proxy server has the ability to figure out the complete requests made
using the hypertext transfer protocol from different users to different web
servers. Using this proxy server the surfing activities of a cluster of
similar and recognizable clients who share the same proxy server is
analyzed and thus studied. The agent available at the client side is
helpful to gather the usage information of the user at the client side. This
agent can also be seen as the web browser having the abilities to
determine the tasks carried out by the clients. These logs collect
information of a particular client from various web sites. The information
from the client side captures critical information when compared to web
or these gateway logs for example for reloading the page clicks of the
33
mouse is used or back key is also used. The present chapter gives a
summary of web server logs and many of the web mining approaches are
very useful for the web usage mining.
2.7 PREPROCESSING
This method is used to process the actual web logs before the real mining
process and the main intention of this preprocessing method is to
recognize whole web sessions or events. When the web server logs are
used the web server stores the complete information of the entire client’s
access behavior. A snapshot is shown in Fig: 2.3.During this process the
client’s are considered as the whole and as the respective internet
protocol address is not matched to any known profile in the repository.
Most of the web logs are considered as the cluster of successive chains of
the access events from a unique client or phase in the time period in the
increasing order. The present method is applicable to all log files to
ascertain the information on sessions of the web [53]. The methods which
are included in the preprocessing are information cleaning, client
recognition and phase’s recognition.
2.7.1 Data cleaning
The step comprises of taking out all the information chased in network
logs that are unusable for excavation intents e.g: petition for graphic
varlet subject (e.g., jpg and gif images) petitions for any other file which
might be included into a network varlet ; or even navigation session did
34
by robots and web spiders. When petitioning for graphical contents and
lodges are comfortable to eradicate robot and web spider’s navigation
blue prints must be blue print must be explicitly. This is normally
exercised for example by citing to the distant host, by citing to the agent,
or by assuring in the access to the robots.txt file. But, few robots really
send a false client agent in HTTP request. In these case, an heuristic
based on navigational conduct can be used to divides robot sessions from
literal client’s sessions is manifested that look up engine navigation
tracks are qualified by width first navigation in the tree symbolizing the
website structure and by unassigned referrer.(The referrer provides the
site that the user reports having been pertained from). The heuristic
suggested is grounded on the former presumption and categorizations of
seafaring.
The web logs recorded during the users’ interactions cannot be directly
mined. Hence the requested HTML documents are treated as access
events. The file type consists of the records such as Uniform Resource
Locators image files. The image file may be in any of the these formats
like gif, jpg or bmp format. Hypertext Transfer Protocol has a special
code indicating the statuses which are useful for representing the
availability or unavailability of the required item. The events that have
the status codes from 200 to 299 are regarded as fruitful events, and the
remaining are removed when the web logs are used. Any other formats
like URLs of HTML, ASP, JSP etc... are removed from the logs.
35
2.7.2 Client Recognition:
From the describing point of view the users’ behavior firstly the users’
need to be identified hence they are treated as anonymous as mentioned
earlier. One way of identifying user is their client IP address. Thus the
requests from same IP can be treated as same user. Additional
information regarding the client could help us gain insight into the users’
behavioral patterns. Many users’ access website using same proxy then
the IP is same but the agent type could be different. Thus we could
accept that every agentive role type for same Internet Protocol address
symbolizes an client.
2.7.3 Session Recognition
It is understood that the client has traversed the website more than one
time whenever the session length is farseeing. The aim of session
recognition is to separate out network logs of soul user to their access
sessions. The session is regarded as a new session if the difference
among the petition time of two contiguous records from a user is more
than the timeout threshold. In this work, we have set the default timeout
threshold as 30 minutes.
2.7.4 Other Preprocessing Tasks
The preprocessing tasks used depend on the intent of mining. Path
completion is used to find the actual access path among the web pages.
The referrer field in the web logs can be checked to find out from which
36
page the request has come. If the referrer is unavailable, the link
structure of the website can also help to estimate the access path of
users. The goal of transaction identification is to create meaningful
clusters of requested web pages for each user. Hence, the job of
recognizing transactions is to separate out a bigger transaction into
number of more small transactions or combine smaller transactions into
bigger ones. Few transaction recognition methods for example reference
distance, maximum forward reference and time window have been
suggested
2.8 PATTERN DISCOVERY TECHNIQUES
Many a number of approaches have been looked into for evoking data
form network logs. They are statistical analysis, affiliation principle
excavation, constellating, assortment and consecutive blue print
excavation
The number of pattern discovery techniques that can be applied
to the usage data is almost unlimited. The methods range in complexity
from relatively simple techniques such as statistical analysis to more
computationally expansive methods. In practice, several techniques are
usually applied to a set of usage data in order to form a well rounded
picture of how a web site is being used[30].
37
2.8.1 Statistical Approach
Statistical methods are the most usual method to evoke cognition about
visitants to web site. By studying the session file, one can execute
various sorts of the statistical analysis (frequency, mean, median etc.,)on
variables such as varlet eyeshot’s, watching time and a distance of
seafaring track. Numerous network dealings analysis tools develop a
recurrent study consisting of statistical data such as the most repeatedly
accessed in varlets mean look up time of varlet or mean distance of a
track done by a site. This study may comprise bounded low level fault
analysis such as finding unauthorized incoming details or detecting the
most usual disable Uniform Resource Locator. In spite of deficient in the
deepness of its analysis, this eccentric of cognition can be potential
utilitarian for bettering system execution, raising system protection,
alleviating site alteration and rendering affirm for merchandising
conclusions
2.8.2 Repeated Token Group and Affiliation Principle Excavation
Repeated token group disclosure can be utilized to associate varlets who
are almost frequently cited united in a individual host session. Instances
of repeated token groups are as stated below:
• The home page and shopping cart page are accessed combined in 20%
of the sessions.
• The donkey Kong video game and stainless steel flat ware set product
varlets are accessed together in 1.2% of the sessions. Any group of n
38
repeated tokens can foster broken into n different affiliation principles,
where directivity is contributed to the principle. A repeated token group
of varlets X&Y contributes to two affiliation principles of X to Y and Y to
X.The first frequent item sets examples listed above now becomes :
When the shopping cart page is accessed in session, the
home page is also accessed 95% of the time.
When the home page is accessed in session, the shopping
cart page is also accessed 20% of the time.
In the context of web usage mining frequent item sets and
Affiliation principle look up to group of varlets that are accessed
combined with a affirm assess surpassing some mentioned threshold.
These varlets may or may not be straightaway linked to one another via
web links. For instance, principle disclosure utilizing the apriori
algorithm may expose a effect relation among the clients who used a page
which contains electronic products to those who access a page about
sporting instrumentation. This is utilitarian for discovering cross-
promotional chances. The principles may also act as a heuristic for
honing text files so as to decrease client-comprehended latency while
adulterating from a distant site [56].
2.8.3 Clustering
A cluster is a collection of data objects that are similar to each other
and dissimilar to the data objects in other clusters. Clustering is often
the first data mining task applied on a given collection data and is used
39
to explore if any underlying patterns exist in the data.The presence of
dense well –separated clusters indicates that there are structures and
patterns to be explored in the data.On other hand ,clustering of pages
can discover groups of pages having related content.
This information is useful for search engines and web assistance
providers In both applications, permanent or dynamic HTML pages can
be created that suggest related hyperlinks to the user according g the
users query or past history of information needs.
2.8.4 Classification
Classification is the chore of representing a information token into one of
various prefixed categories. In the network knowledge base, one is much
concerned in formulating a visibility of clients belonging to a peculiar
class or family. This needs descent and choice of characteristics that
better depict the attributes of a specific class or family. Classification can
be executed by utilizing monitored inductive acquisition alg orithms like
decision tree classifiers, naive Bayesian classifiers-nearest neighbor
classifiers, support vector machines etc. For instance, classification on
host logs may guide to the disclosure of concerning principles such as:
40% of clients who laid an on-line order in product/Music are in the 19-
26 age class and live on the west coast.
40
2.8.5 Sequential Pattern Mining
The technique of sequential pattern discovery attempts to find inter-
session patterns such that the presence of a set of items is followed by
another item in a time-ordered set of sessions or episodes. For example:
The video game caddy page view is accessed after the Donkey
Kong Video Game page view 50% of the time.
By using this approach, web marketers can predict future visit
patterns which will be helpful in placing advertisements aimed at certain
user groups. Other types of temporal analysis that can be performed on
sequential patterns includes trend analysis or change point detection.
Trend analysis can be used to detect changes in the usage patterns of a
site over time and change point detection identifies when specific changes
take place.
2.8.5.1 Apriori-based Mining Algorithms
This traditional algorithm involves three steps for mining the successive
chain patterns [67]. Initially it tries to discover all the recurrent item sets
that means the token sets with affirm greater than the minimal affirm.
In the next step it replaces the actual transactions with the all recurrent
item sets set consisted by the transaction. At last, the successive chain
patterns are determined. This is very costly algorithm as it has to replace
41
the transactions at each and every step and to tackle the conditions for
the time periods and the taxonomies it is a complex task.
The definition of successive chain patterns that were made general so
that consists of time period conditions, moving windows and client
mentioned taxonomy are suggested [68]. A more general successive chain
pattern mining algorithm known as GSP (Generalized Sequential Pattern
mining algorithm) is also suggested. Same as to that of the conventional
Apriori algorithm, GSP traces out the database many times. Initially it
traces out and determines all the recurrent items and builds recurrent
successive chains set for the length of one is formed. In next traces, it
produces candidate successive chain using a set of recurrent successive
chains attained from the last trace and verifies the respective supports.
The method stops when no recurrent candidate is determined. GSP is
potential only in the circumstances when the successive chains are not
too large in length and when the large size transaction databases are not
considered.
2.8.5.2 WAP-tree Based Mining Algorithms
A mighty good approach with a large data structure with tightly
coupled data is called Web Access Pattern Tree (or WAP-tree), which is
FP-tree oriented is discussed [48]. The WAP-tree structure provides the
generation of novel algorithms for mining access patterns potentially
using a huge set of web log parts. Especially, the WAP-mining algorithm
has been suggested for mining web access patterns from WAP-tree. The
42
present technique eradicates the problem of producing huge number of
candidates as seen in Apriori-kind of algorithms. Apart from this, the
outcomes of the experiment are representing that the WAP-excavation
algorithm is in detail an order of number much quicker than the
conventional successive chain pattern mining approaches. This can be
given the credit to the tightly coupled formation of WAP-tree and the new
tentative finding out methods cooperative in WAP-mining.
WAP-tree is a very good tightly coupled to maintain information from
network logs. WAP-mine is the chief excavation algorithm grounded on
WAP-tree that does not render large number of candidate sets as were
produced in the conventional Apriori-based algorithms. But, the creation
of the in between constraints of the WAP-tree while the mining is in
process is very costly. At present, certain future works are under process
for the WAP-tree and the associated mining algorithms.
A discussion has been carried out for the Pre-Order Linked WAP-Tree
Mining (PLWAP) algorithm which does not produce the WAP- mine in
between level constraint WAP-trees by formatting the binary place
ciphers to all tree nodes [70]. The PLWAP algorithm traces out very
rapidly the postfix trees or forests of any affix token of repeated blue
prints by coping with the formatted binary place ciphers of nodes. The
Binary Cipher Formatting (TreBCF) method is later used to initialization
of alone binary place ciphers to knobs of any universal tree, by initially
exchanging the tree into its binary tree of same kind and utilizing a
43
principle similar that is utilized in Huffman coding to depict a alone
cipher for each knob.
The RSC-tree (Recurrent Successive Chain Tree) enlarges the WAP-tree
skeleton for ever growing and user friendly mining [69]. The mining
algorithm RSC-mining is helpful for analyzing the RSC-tree for extracting
recurrent successive chains. The suggested RSC-Miner system can use
the new input successive chains and give the response for ever growing
without doing complete calculation.
The system also provides a chance to the clients to modify input
sequences (e.g. minimum support and required pattern length) user
friendly without the necessarily complete recalculation of most the cases.
The ever growing modifying capacity of the system gives very potential
performance advantages over complete recalculation even for most huge
modifying lengths.
2.9 SUMMARY
The present chapter has a deep focus on the similar works on web
usage mining which also consists of web usage data, preprocessing
tasks, and the several pattern extraction approaches are discussed. Web
utilization information is the main repository for network utilization
excavation that primarily consists of web host logs, gateway host logs
and client browser logs. As network host logs have all but
interchangeable structures and are promptly supplied to all network
hosts, which are in general cooperative and needy repository in research
44
on network utilization excavation. To preprocess the network utilization
information, the operation comprises of information stripping, client
recognition and session recognition
The basic approaches to extract blue prints in network logs comprise
of statistical analysis, affiliation principle excavation, categorizing,
assortment and consecutive principle excavation.
Statistical approaches are more specifically used to discover statistical
intelligence using the web logs. This type of intelligence is vastly used to
analyze network dealings of a website. Employing the affiliation principle
excavation that is utilized as a whole in an access session is determined.
Grouping approach is very helpful and thus useful to extract page groups
and client groups of network logs. Varlet radicals are utilized to improve
look up locomotive engine and to supply network classification while the
client groups are useful to infer the client demographics so as to give
individualized web subject to browsers. Classification is the task to
anticipate an information token into one of the various all ready defined
groups. These groups generally used for representing several client
profiles, and classification is carried out depending upon the
characteristics which better depict the lineaments of a particular group
or class. Consecutive blue print excavation is a hereafter exercise of
affiliation principle excavation by considering the consecutive of
encountering of tokens in events. Consecutive blue prints are consecutive
network varlets accessed often by users. These kinds of the patterns are
45
helpful to extract the client behavior and expecting the next pages to be
browsed by the client.