CHAPTER 2 Literature Review - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/2194/4/04... ·...

22

CHAPTER 2

Literature Review

The intent of web usage mining is to analyze the users’ access patterns

from the data generated from browsing web. The output from these

analyses has tremendous practical applications like personalized web

search, target marketing, adaptive web sites and several sales analyses.

This chapter aims to follow up the introductory constricts and approaches on network

utilization excavation. First of all, we introduce web usage data there on moving to

preprocessing and a review on various blue print breaks through approaches for network

utilization excavation. Ultimately, we sum up the conceptions discoursed.

2.1 WEB MINING

There exists abundant content information in web pages and also

their hyperlinks. These pages are accessed by the users and hence a

new set of data by name web logs are generated. These logs contain

the access patterns of the users. The techniques used for mining these

logs incidentally discover and identify exiting information from the

logs. Hence the inputs for web mining come from several areas like databases, Data

Recovery, Machine Acquisition and Instinctive Speech Litigating

Network excavation techniques can broadly classified into three types

(Fig:2.1) namely

1. Web content mining,

2. Web structure mining, and

3. Web usage mining.

23

2.2 WEB CONTENT MINING (WCM)

Web content is a combination several types of data like structured

data, semi structures data, unstructured data further this data again

could be text, images, audio or video content. The category of algorithms

that uncover useful information from these data types or documents is

called web content mining.

The main goals of WCM includes assisting information finding, (ex:

Search Engine), filtering information to users on user profiles, database

view in WCM simulates the information on the network and incorporate

them for large number of convoluted questions. Many intelligent tools

namely web agents were developed by the researchers for information

processing and retrieval and a higher level of abstraction is provided to

the semi-structured data on the web using the data mining techniques.

Text mining [4] and multimedia data mining [5] proficiencies are useful

excavation the subject in network paginates. Some of these efforts are

summarized as follows.

2.2.1 Agent –Based Approach

Generally, agent based Web mining systems can be categorized as:

a. Intelligent Search Agents.

b. Information Filtering/Categorization.

c. Personalized Web Agents.

24

a. Intelligent Search Agents

various well informed network agentive roles are built up for looking up

for pertinent entropy utilizing knowledge base features and client

visibilities to prepare and render the ascertained data. Some of the web

agents are Harvest [6], FAQ-Finder [7], Information Manifold [8] , OCCAM

[9] and ParaSite [10].

b. Information Filtering/Categorization

The network agentive roles use different data recovery methods [11] and

features of candid machine-readable text network papered to

mechanically recover and assess them[12, 13, 14, 15&16].

c. Personalized Web Agents

Many web agents learn user interests according to their web usage and

discover the patterns based on their preferences and interests. Examples

of such personalized web agents are the Web Watcher[17], PAINT[18],

Syskill& Webert[19], GroupLens[20] , Firefly[21], and others[22]. For

instance, Syskill & Webert used Bayesian classifier to rate web page of

users interests based on user’s profile.

2.2.2 Database Approach

The semi-structured data is organized to structured data using various

database approaches. Various database query processing mechanisms

and data mining techniques are used to analyze the structured data

available on web. The database approaches are listed as:

a. Multilevel databases

25

b. Web query systems.

a. Multilevel databases

The main idea behind this approach is that the lowest level of the

database contains semi-structured information stored in various web

repositories, such as hypertext documents.

b. Web query systems

A large number of network-grounded interrogation systems and

languages use standard repository interrogations languages like

structured query language, morphological data about network text files,

and regular instinctive language treating for the interrogations which are

utilized in web look ups [23].

2.3 WEB STRUCTURE MINING (WSM)

Network structure excavation is pertained on evoking the model or

patterns or structures which construct the structural representation of

the web through links. It is used to study the hierarchical structure of

the hyperlinks. The links may be with or without description about them.

This framework is useful to classify network varlets and helpful to

evoke data for example same kind and kinship among various internet

sites. NSE can be utilized to disclose authorized sites. More significant is

the construction of the network varlets and the quality of the pecking

order of web links in the website of a specific field.

Few algorithmic rules have been suggested to simulate the network

topology for example HITS [24], Page Rank [25] and betterments of HITS

26

by tallying subject matter to the links structure [26] and by utilizing

resident straining [27]. These frameworks are primarily implemented as a

method to estimate the character grade or relevance for every network

varlet. Few instances are the Clever System [26] and Google [25]. Few

more applications of the instances consist of network varlets

classification [28] and disclosing mini communities on the network [29].

2.4 WEB USAGE MINING (WUM)

NETWORK UTILIZATION EXCAVATION concentrates on approaches which

anticipate client conduct when the client moves on the network.WUM

intends to uncover exciting WUM intends to uncover exciting recurrent

client access patterns produced while the surfing the web which is

maintained in the web server logs, intermediate server logs or user logs.

WUM is about finding patterns of page views by Web users or to find

the usage of a particular Website. There are many applications of Web

usage mining, such as targeting advertisements. The objective is to find

the set of customers who are most likely to respond to an advertisement.

By sending advertisement materials to these potential customers

significant savings in mailing costs can be achieved. Another application

is in designing of Web pages. By studying the sequence of page visits by

the customers, a Web page may be designed so that the majority of

customers can find the information they desire with a minimum number

of clicks of the mouse; so the Web page design is appealing to the most

users [31].

27

This research investigates web usage mining techniques and suggests

improvements in web services.

2.5 SEMANTIC WEB USAGE MINING

The two rapid-growing research domains are Semantic Network and

Network Excavation both constructed on the achievement of the World

Wide Web (WWW), accompaniment each other well as they each cover

one piece of a fresh gainsay laid by the large success of the present

WWW. Most of the information on the network is so unhelpful which can

only be understood by mankind however the quantity of information is so

vast that it can only be treated expeditiously and in effect by the

machines. The Semantic Network deals the initiative portion of this

gainsay by attempting to build the information (also) machine-graspable,

as Network Excavation deals the next portion by (semi-) mechanically

evoking the utilitarian cognition blotted out in this information, including

it usable as an accumulation of achievable ratios as shown in Fig:2.2.

The Semantic Web is grounded on a imagination of Tim Berners-Lee,

the discoverer of the WWW. The huge achievement of the present WWW

contributes to a fresh gainsay i.e., a large quantity of information is

understood by mankind only. Machine affirms is bounded. Berners-Lee

proposes improving the WWW by machine- process able data that affirms

the clients in his projects. Overlarge or inadequate lists of hints.

28

Machine-process able information can point out the Search Engine to the

relevant pages and can thus improve both precision and recall.

For instance, it is almost impossible to retrieve information with a

keyword search when the information is spread over several pages.

Consider e.g., the query for Web mining experts in a company intranet

where the only explicit information stored are the relationships between

people and the courses they attended on one hand and between courses

and the topics they covered on the other hand. In that case, the use of a

rule stating that people who attended a course which was about a certain

topic having knowledge about that topic might improve the results.

The process of building the Semantic Web is still undergoing changes.

Its structure has to be defined and this structure has then to be filled

with life. In order to make this task feasible one should start with the

simpler task.

The following steps mentioned below shows the direction where in the

Semantic Web is heading:

(1) Providing a common syntax for machine understandable

statements,

(2) Establishing common vocabulary,

(3) Agreeing on a logical language, and

(4) Using the language for exchanging proofs.

Berners -Lee suggested a layer structure for the Semantic Web:

i. Unicode/URI,

29

ii. XML/ namespaces/XMLSchema,

iii. RDF/RDF Schema,

iv. Ontology vocabulary

v. Logic

vi. Proof

vii. Trust

This structure reflects the steps listed above. It makes understand

each step alone will provide added value so that the Semantic Web can

be realized in an incremental fashion.

On the first two layers, a common syntax is provided. Uniform

Resource Identifiers (URIs) provide a standard way to refer to entities;

Unicode is a standard for exchanging symbols. The Extensible Markup

Language (XML) fixes a notation for describing labeled trees, and XML

Scheme allows defining grammar for valid XML documents. XML

documents can refer to different namespaces to make explicit the context

of different tags. The formalizations on these two layers are widely

accepted and the number of XML documents is increasing rapidly.

The next three layers form the current core of the Web enriched by

formal semantics. These are the most important for our ensuing

formalization of Semantic Web Usage Mining.

Proof and trust are the remaining layers. They follow the

understanding that it is important to be able to check the validity of

30

statements made in the (Semantic) Web. These two layers are rarely

tackled today but are interesting topics for future research.

Although semantic web has great potential and scope the following are

several issues have to be addressed to reap its benefits:

Presently there is scarcity for contents on semantic web, hence

currently available web content like Unstructured, Semi structured

and structured content. Dynamic content, multimedia content etc.

need to be converted into semantic web content.

In days to come Ontologies can capture the semantics of the Web

content. These ontologies can be generated using adequate

infrastructure, change management and mapping.

Magnificent efforts are necessary to organize, store Semantic web

content and also techniques to find them. All of these must be

coordinated and scalable since the growth of the data is

exponential.

Multilingualism problem already exists in the current Web here we

need techniques that could convert the content in one language to

that of the native language.

Intuitive visualization is picking more importance. Since the users

expect better presentation and easily recognizable contents for their

needs. Hence there is pressing need not only for building semantic

web content but also exploiting them using semantics

31

In the present work, the primary intention is to use semantics to usage

mining. Navigating the patterns of the pages navigated by the clients can

still enhance the outcomes of web usage mining. The present technique

is very useful for understanding the priorities and clients interests.

Therefore, An amalgamation is required to be used ontology with another

that could preferably classify pages in a website. The present process of

classifying is very helpful to make the outcomes much better and

agreeable and interesting to reform the website, which is also known as

the personification. The results of can make the web usage mining

viewable when the semantics of the web pages are externally used to

discover the topics of the ontology. Using the ontology, to store the

client’s behavior based on the web logs the semantic web mining is

carried out on the web logs. For example, the present web logs is used to

find common activities to bestow personalized services using ontology.

2.6 WEB USAGE DATA

The web usage data primarily maintains logs of access patterns of the

visitors on a website. It can also include client visibilities, bookmarks,

cookies, adjustment data, client queries and any other interactions of the

client while on the website. For easy manageability and convenience the

data is grouped into three divisions that is to say Network Host Logs,

Gateway Host Logs and Client Browser Logs.

32

The web server maintains crucial information for network utilized

excavation these logs are in general access of websites by multiple users.

For each of the records contain the IP address of the user, Pettion time,

Uniform Resource Locator, HTTP status cipher etc. Of course the

information gathered are in several standard formats like log file format,

stretched log file format etc. A portion of a network host log in W3C is as

shown in fig:2.3

A gateway like server known as web server proxy server acts as a gate

for the users and the servers. To decrease the input time of a web page

the proxy receiving is useful and the clients visit these web pages

recurrently and along with this the proxy receiving is also useful to have

the complete view of the load traffic at the server and the client. The

proxy server has the ability to figure out the complete requests made

using the hypertext transfer protocol from different users to different web

servers. Using this proxy server the surfing activities of a cluster of

similar and recognizable clients who share the same proxy server is

analyzed and thus studied. The agent available at the client side is

helpful to gather the usage information of the user at the client side. This

agent can also be seen as the web browser having the abilities to

determine the tasks carried out by the clients. These logs collect

information of a particular client from various web sites. The information

from the client side captures critical information when compared to web

or these gateway logs for example for reloading the page clicks of the

33

mouse is used or back key is also used. The present chapter gives a

summary of web server logs and many of the web mining approaches are

very useful for the web usage mining.

2.7 PREPROCESSING

This method is used to process the actual web logs before the real mining

process and the main intention of this preprocessing method is to

recognize whole web sessions or events. When the web server logs are

used the web server stores the complete information of the entire client’s

access behavior. A snapshot is shown in Fig: 2.3.During this process the

client’s are considered as the whole and as the respective internet

protocol address is not matched to any known profile in the repository.

Most of the web logs are considered as the cluster of successive chains of

the access events from a unique client or phase in the time period in the

increasing order. The present method is applicable to all log files to

ascertain the information on sessions of the web [53]. The methods which

are included in the preprocessing are information cleaning, client

recognition and phase’s recognition.

2.7.1 Data cleaning

The step comprises of taking out all the information chased in network

logs that are unusable for excavation intents e.g: petition for graphic

varlet subject (e.g., jpg and gif images) petitions for any other file which

might be included into a network varlet ; or even navigation session did

34

by robots and web spiders. When petitioning for graphical contents and

lodges are comfortable to eradicate robot and web spider’s navigation

blue prints must be blue print must be explicitly. This is normally

exercised for example by citing to the distant host, by citing to the agent,

or by assuring in the access to the robots.txt file. But, few robots really

send a false client agent in HTTP request. In these case, an heuristic

based on navigational conduct can be used to divides robot sessions from

literal client’s sessions is manifested that look up engine navigation

tracks are qualified by width first navigation in the tree symbolizing the

website structure and by unassigned referrer.(The referrer provides the

site that the user reports having been pertained from). The heuristic

suggested is grounded on the former presumption and categorizations of

seafaring.

The web logs recorded during the users’ interactions cannot be directly

mined. Hence the requested HTML documents are treated as access

events. The file type consists of the records such as Uniform Resource

Locators image files. The image file may be in any of the these formats

like gif, jpg or bmp format. Hypertext Transfer Protocol has a special

code indicating the statuses which are useful for representing the

availability or unavailability of the required item. The events that have

the status codes from 200 to 299 are regarded as fruitful events, and the

remaining are removed when the web logs are used. Any other formats

like URLs of HTML, ASP, JSP etc... are removed from the logs.

35

2.7.2 Client Recognition:

From the describing point of view the users’ behavior firstly the users’

need to be identified hence they are treated as anonymous as mentioned

earlier. One way of identifying user is their client IP address. Thus the

requests from same IP can be treated as same user. Additional

information regarding the client could help us gain insight into the users’

behavioral patterns. Many users’ access website using same proxy then

the IP is same but the agent type could be different. Thus we could

accept that every agentive role type for same Internet Protocol address

symbolizes an client.

2.7.3 Session Recognition

It is understood that the client has traversed the website more than one

time whenever the session length is farseeing. The aim of session

recognition is to separate out network logs of soul user to their access

sessions. The session is regarded as a new session if the difference

among the petition time of two contiguous records from a user is more

than the timeout threshold. In this work, we have set the default timeout

threshold as 30 minutes.

2.7.4 Other Preprocessing Tasks

The preprocessing tasks used depend on the intent of mining. Path

completion is used to find the actual access path among the web pages.

The referrer field in the web logs can be checked to find out from which

36

page the request has come. If the referrer is unavailable, the link

structure of the website can also help to estimate the access path of

users. The goal of transaction identification is to create meaningful

clusters of requested web pages for each user. Hence, the job of

recognizing transactions is to separate out a bigger transaction into

number of more small transactions or combine smaller transactions into

bigger ones. Few transaction recognition methods for example reference

distance, maximum forward reference and time window have been

suggested

2.8 PATTERN DISCOVERY TECHNIQUES

Many a number of approaches have been looked into for evoking data

form network logs. They are statistical analysis, affiliation principle

excavation, constellating, assortment and consecutive blue print

excavation

The number of pattern discovery techniques that can be applied

to the usage data is almost unlimited. The methods range in complexity

from relatively simple techniques such as statistical analysis to more

computationally expansive methods. In practice, several techniques are

usually applied to a set of usage data in order to form a well rounded

picture of how a web site is being used[30].

37

2.8.1 Statistical Approach

Statistical methods are the most usual method to evoke cognition about

visitants to web site. By studying the session file, one can execute

various sorts of the statistical analysis (frequency, mean, median etc.,)on

variables such as varlet eyeshot’s, watching time and a distance of

seafaring track. Numerous network dealings analysis tools develop a

recurrent study consisting of statistical data such as the most repeatedly

accessed in varlets mean look up time of varlet or mean distance of a

track done by a site. This study may comprise bounded low level fault

analysis such as finding unauthorized incoming details or detecting the

most usual disable Uniform Resource Locator. In spite of deficient in the

deepness of its analysis, this eccentric of cognition can be potential

utilitarian for bettering system execution, raising system protection,

alleviating site alteration and rendering affirm for merchandising

conclusions

2.8.2 Repeated Token Group and Affiliation Principle Excavation

Repeated token group disclosure can be utilized to associate varlets who

are almost frequently cited united in a individual host session. Instances

of repeated token groups are as stated below:

• The home page and shopping cart page are accessed combined in 20%

of the sessions.

• The donkey Kong video game and stainless steel flat ware set product

varlets are accessed together in 1.2% of the sessions. Any group of n

38

repeated tokens can foster broken into n different affiliation principles,

where directivity is contributed to the principle. A repeated token group

of varlets X&Y contributes to two affiliation principles of X to Y and Y to

X.The first frequent item sets examples listed above now becomes :

When the shopping cart page is accessed in session, the

home page is also accessed 95% of the time.

When the home page is accessed in session, the shopping

cart page is also accessed 20% of the time.

In the context of web usage mining frequent item sets and

Affiliation principle look up to group of varlets that are accessed

combined with a affirm assess surpassing some mentioned threshold.

These varlets may or may not be straightaway linked to one another via

web links. For instance, principle disclosure utilizing the apriori

algorithm may expose a effect relation among the clients who used a page

which contains electronic products to those who access a page about

sporting instrumentation. This is utilitarian for discovering cross-

promotional chances. The principles may also act as a heuristic for

honing text files so as to decrease client-comprehended latency while

adulterating from a distant site [56].

2.8.3 Clustering

A cluster is a collection of data objects that are similar to each other

and dissimilar to the data objects in other clusters. Clustering is often

the first data mining task applied on a given collection data and is used

39

to explore if any underlying patterns exist in the data.The presence of

dense well –separated clusters indicates that there are structures and

patterns to be explored in the data.On other hand ,clustering of pages

can discover groups of pages having related content.

This information is useful for search engines and web assistance

providers In both applications, permanent or dynamic HTML pages can

be created that suggest related hyperlinks to the user according g the

users query or past history of information needs.

2.8.4 Classification

Classification is the chore of representing a information token into one of

various prefixed categories. In the network knowledge base, one is much

concerned in formulating a visibility of clients belonging to a peculiar

class or family. This needs descent and choice of characteristics that

better depict the attributes of a specific class or family. Classification can

be executed by utilizing monitored inductive acquisition alg orithms like

decision tree classifiers, naive Bayesian classifiers-nearest neighbor

classifiers, support vector machines etc. For instance, classification on

host logs may guide to the disclosure of concerning principles such as:

40% of clients who laid an on-line order in product/Music are in the 19-

26 age class and live on the west coast.

40

2.8.5 Sequential Pattern Mining

The technique of sequential pattern discovery attempts to find inter-

session patterns such that the presence of a set of items is followed by

another item in a time-ordered set of sessions or episodes. For example:

The video game caddy page view is accessed after the Donkey

Kong Video Game page view 50% of the time.

By using this approach, web marketers can predict future visit

patterns which will be helpful in placing advertisements aimed at certain

user groups. Other types of temporal analysis that can be performed on

sequential patterns includes trend analysis or change point detection.

Trend analysis can be used to detect changes in the usage patterns of a

site over time and change point detection identifies when specific changes

take place.

2.8.5.1 Apriori-based Mining Algorithms

This traditional algorithm involves three steps for mining the successive

chain patterns [67]. Initially it tries to discover all the recurrent item sets

that means the token sets with affirm greater than the minimal affirm.

In the next step it replaces the actual transactions with the all recurrent

item sets set consisted by the transaction. At last, the successive chain

patterns are determined. This is very costly algorithm as it has to replace

41

the transactions at each and every step and to tackle the conditions for

the time periods and the taxonomies it is a complex task.

The definition of successive chain patterns that were made general so

that consists of time period conditions, moving windows and client

mentioned taxonomy are suggested [68]. A more general successive chain

pattern mining algorithm known as GSP (Generalized Sequential Pattern

mining algorithm) is also suggested. Same as to that of the conventional

Apriori algorithm, GSP traces out the database many times. Initially it

traces out and determines all the recurrent items and builds recurrent

successive chains set for the length of one is formed. In next traces, it

produces candidate successive chain using a set of recurrent successive

chains attained from the last trace and verifies the respective supports.

The method stops when no recurrent candidate is determined. GSP is

potential only in the circumstances when the successive chains are not

too large in length and when the large size transaction databases are not

considered.

2.8.5.2 WAP-tree Based Mining Algorithms

A mighty good approach with a large data structure with tightly

coupled data is called Web Access Pattern Tree (or WAP-tree), which is

FP-tree oriented is discussed [48]. The WAP-tree structure provides the

generation of novel algorithms for mining access patterns potentially

using a huge set of web log parts. Especially, the WAP-mining algorithm

has been suggested for mining web access patterns from WAP-tree. The

42

present technique eradicates the problem of producing huge number of

candidates as seen in Apriori-kind of algorithms. Apart from this, the

outcomes of the experiment are representing that the WAP-excavation

algorithm is in detail an order of number much quicker than the

conventional successive chain pattern mining approaches. This can be

given the credit to the tightly coupled formation of WAP-tree and the new

tentative finding out methods cooperative in WAP-mining.

WAP-tree is a very good tightly coupled to maintain information from

network logs. WAP-mine is the chief excavation algorithm grounded on

WAP-tree that does not render large number of candidate sets as were

produced in the conventional Apriori-based algorithms. But, the creation

of the in between constraints of the WAP-tree while the mining is in

process is very costly. At present, certain future works are under process

for the WAP-tree and the associated mining algorithms.

A discussion has been carried out for the Pre-Order Linked WAP-Tree

Mining (PLWAP) algorithm which does not produce the WAP- mine in

between level constraint WAP-trees by formatting the binary place

ciphers to all tree nodes [70]. The PLWAP algorithm traces out very

rapidly the postfix trees or forests of any affix token of repeated blue

prints by coping with the formatted binary place ciphers of nodes. The

Binary Cipher Formatting (TreBCF) method is later used to initialization

of alone binary place ciphers to knobs of any universal tree, by initially

exchanging the tree into its binary tree of same kind and utilizing a

43

principle similar that is utilized in Huffman coding to depict a alone

cipher for each knob.

The RSC-tree (Recurrent Successive Chain Tree) enlarges the WAP-tree

skeleton for ever growing and user friendly mining [69]. The mining

algorithm RSC-mining is helpful for analyzing the RSC-tree for extracting

recurrent successive chains. The suggested RSC-Miner system can use

the new input successive chains and give the response for ever growing

without doing complete calculation.

The system also provides a chance to the clients to modify input

sequences (e.g. minimum support and required pattern length) user

friendly without the necessarily complete recalculation of most the cases.

The ever growing modifying capacity of the system gives very potential

performance advantages over complete recalculation even for most huge

modifying lengths.

2.9 SUMMARY

The present chapter has a deep focus on the similar works on web

usage mining which also consists of web usage data, preprocessing

tasks, and the several pattern extraction approaches are discussed. Web

utilization information is the main repository for network utilization

excavation that primarily consists of web host logs, gateway host logs

and client browser logs. As network host logs have all but

interchangeable structures and are promptly supplied to all network

hosts, which are in general cooperative and needy repository in research

44

on network utilization excavation. To preprocess the network utilization

information, the operation comprises of information stripping, client

recognition and session recognition

The basic approaches to extract blue prints in network logs comprise

of statistical analysis, affiliation principle excavation, categorizing,

assortment and consecutive principle excavation.

Statistical approaches are more specifically used to discover statistical

intelligence using the web logs. This type of intelligence is vastly used to

analyze network dealings of a website. Employing the affiliation principle

excavation that is utilized as a whole in an access session is determined.

Grouping approach is very helpful and thus useful to extract page groups

and client groups of network logs. Varlet radicals are utilized to improve

look up locomotive engine and to supply network classification while the

client groups are useful to infer the client demographics so as to give

individualized web subject to browsers. Classification is the task to

anticipate an information token into one of the various all ready defined

groups. These groups generally used for representing several client

profiles, and classification is carried out depending upon the

characteristics which better depict the lineaments of a particular group

or class. Consecutive blue print excavation is a hereafter exercise of

affiliation principle excavation by considering the consecutive of

encountering of tokens in events. Consecutive blue prints are consecutive

network varlets accessed often by users. These kinds of the patterns are

45

helpful to extract the client behavior and expecting the next pages to be

browsed by the client.

Date post:	15-Sep-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

CHAPTER 2 Literature Review - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/2194/4/04... ·...

Documents