Phase 1

A Personalized Ontology Model for Web Information Gathering

Abstract:

As a model for knowledge description and formalization, ontologies are

widely used to represent user profiles in personalized web information gathering.

Current web information gathering systems attempt to satisfy user requirements by

capturing their information needs. User profiles represent the concept models

possessed by users when gathering web information. However, when representing

user profiles, many models have utilized only knowledge from either a global

knowledge base or a user local information. A personalized ontology model is

proposed for knowledge representation and reasoning over user profiles. This

model learns ontological user profiles from both a world knowledge base and user

local instance repositories. The ontology model is evaluated by comparing it

against benchmark models in web information gathering. The proposed ontology

model provides a solution to emphasizing global and local knowledge in a single

computational model. The findings in this project can be applied to the design of

web information gathering systems. The model also has extensive contributions to

the fields of Information Retrieval, web Intelligence, Recommendation Systems,

and Information Systems.

Introduction

Overview Of the Project:

The advent of the Internet has become the source of information. It has been

widely by users belonging to a variety of avenues. The need of the hour is to make

the process of searching the internet for information more and more efficiently.

With the size of the internet increasing exponentially, the volume of data to be

crawled also proportionally increases, as a result of which it becomes increasingly

necessary to have appropriate crawling mechanisms in order to make crawls

efficient. Search engines have to answer millions of queries every day. This has

made engineering a search a highly challenging task. Search engines primary

perform the three basic tasks namely: They search the Internet or select pages on

important words. They keep an index of the words they find and where they find

them. c. They allow users to look for the words or combination of words found in

that index. A WebCrawler is a computer program that browses the Internet in a

methodical automated manner. The crawler typically crawls through links grabbing

content from websites and adding it to search engines indexes. The World Wide

Web provides a vast source of information of almost all type. However this

information is often scattered among many web servers and hosts, using many

different formats. We all want that we should have the best possible search in less

time. In this paper, we introduced the working of focused ontology, which is

merged, with the procedure of finding the copyright infringement. For any crawler

there are two issues that it should consider. First, the crawler should have the

capability to plan, i.e., a plan to decide which pages to download next. Second, it

needs to have a highly optimized and robust system architecture so that it can

download a large number of pages per second even against crashes, manageable,

and considerate of resources and web servers. Some recent academic interest is

there in the first issue, including work on deciding which important pages the

crawler should take first. In contrast, less work is done on second issues. Clearly,

all the major search engines have highly optimized crawling system, although

working and details of documentation of this system are usually with their owner.

The system that is known till now in detail and also available in literature is known

to be the Mercator system, which is used by Alta Vista. It is easy to build a crawler

that would work slowly and download few pages per second for a short period of

time. In contrast, it’s a big challenge to build the same system design, I/O, network

efficiency, robustness and manageability. Every search engine is divided into

different modules among those modules user interface module is the module on

which search engine relies the most because it helps to provide the best possible

results to the search engine. Crawlers are small programs that ‘browse’ the web on

the search engine’s behalf, similarly to how a human user would follow links to

reach different pages. The programs are given a starting seed URLs, whose pages

they retrieve from the web. The crawler extracts URLs appearing in the retrieved

pages, and gives this information to the crawler control module. This module

determines what links to visit next, and feeds the links to visit back to the crawlers.

The crawler also passes the retrieved pages into a page repository. Crawlers

continue visiting the web, until local resources, such as storage, are exhausted.

Literature Survey:

ON the last decades, the amount of web-based information available has

increased dramatically. How to gather useful information from the web has

become a challenging issue for users. Current web information gathering systems

attempt to satisfy user requirements by capturing their information needs. For this

purpose, user profiles are created for user background knowledge description. User

profiles represent the concept models possessed by users when gathering web

information. A concept model is implicitly possessed by users and is generated

from their background knowledge. While this concept model cannot be proven in

laboratories, many web ontologists have observed it in user behavior. When users

read through a document, they can easily determine whether or not it is of their

interest or relevance to them, a judgment that arises from their implicit concept

models. If a user’s concept model can be simulated, then a superior representation

of user profiles can be built. To simulate user concept models, ontologies—a

knowledge description and formalization model—are utilized in personalized web

information gathering. Such ontologies are called ontological user profiles or

personalized ontologies. To represent user profiles, many researchers have

attempted to discover user background knowledge through global or local analysis.

Global analysis uses existing global knowledge bases for user background

knowledge representation. Commonly used knowledge bases include generic

ontologies (e.g., WordNet ), thesauruses (e.g., digital libraries), and online

knowledge bases (e.g., online categorizations and Wikipedia). The global analysis

techniques produce effective performance for user background knowledge

extraction. However, global analysis is limited by the quality of the used

knowledge base. For example, WordNet was reported as helpful in capturing user

interest in some areas but useless for others.

Local analysis investigates user local information or observes user behavior

in user profiles. In some works, such as, users were provided with a set of

documents and asked for relevance feedback. User background knowledge was

then discovered from this feedback for user profiles. However, because local

analysis techniques rely on data mining or classification techniques for knowledge

discovery, occasionally the discovered results contain noisy and uncertain

information. As a result, local analysis suffers from ineffectiveness at capturing

formal user knowledge. From this, we can hypothesize that user background

knowledge can be better discovered and represented if we can integrate global and

local analysis within a hybrid model. The knowledge formalized in a global

knowledge base will constrain the background knowledge discovery from the user

local information. Such a personalized ontology model should produce a superior

representation of user profiles for web information gathering. An ontology model

to evaluate this hypothesis is proposed. This model simulates users’ concept

models by using personalized ontologies, and attempts to improve web information

gathering performance by using ontological user profiles. The world knowledge

and a user’s local instance repository (LIR) are used in the proposed model. World

knowledge is commonsense knowledge acquired by people from experience and

education. An LIR is a user’s personal collection of information items. From a

world knowledge base, we construct personalized ontologies by adopting user

feedback on interesting knowledge. A multidimensional ontology mining method,

Specificity and Exhaustivity, is also introduced in the proposed model for

analyzing concepts specified in ontologies. The users’ LIRs are then used to

discover background knowledge and to populate the personalized ontologies. The

proposed ontology model is evaluated by comparison against some benchmark

models through experiments using a large standard data set. The evaluation results

show that the proposed ontology model is successful. The research contributes to

knowledge engineering, and has the potential to improve the design of

personalized web information gathering systems. The contributions are original

and increasingly significant, considering the rapid explosion of web information

and the growing accessibility of online documents.

Global knowledge bases were used by many existing models to learn

ontologies for web information gathering. On the basis of the Dewey Decimal

Classification, King et al. Developed IntelliOnto to improve performance in

distributed web information retrieval. Wikipedia was used by Downey et al. These

works effectively discovered user background knowledge; however, their

performance was limited by the quality of the global knowledge bases. Aiming at

learning personalized ontologies, many works mined user background knowledge

from user local information. Pattern recognition and association rule mining

techniques to discover knowledge from user local documents for ontology

construction. OntoLearn to discover semantic concepts and relations from web

documents. Web content mining techniques were used by Jiang and Tan to

discover semantic knowledge from domain-specific text documents for ontology

learning. Finally, Shehata et al. Captured user information needs at the sentence

level rather than the document level, and represented user profiles by the

Conceptual Ontological Graph. The use of data mining techniques in these models

lead to more user background knowledge being discovered. However, the

knowledge discovered in these works contained noise and uncertainties.

Additionally, ontologies were used in many works to improve the performance of

knowledge discovery. Using a fuzzy domain ontology extraction algorithm, a

mechanism was developed by Lau et al. These works attempted to explore a route

to model world knowledge more efficiently.

“An Integrated Architecture for Personalized Query Expansion in Web

Search” Alexander Salamanca and Elizabeth Le´on

Personalization had been seen as one of the most promising trends in the

near future for improving significantly the enjoyment of the search experience on

the Web. The main idea is deliver quality results prepared uniquely for different

users, which do not share necessarily the same interests on the long term with

another people. The approach described in this paper exploits relations between

keywords of the user’s search history and a more general set of keywords that

expand the user’s search scope. Through a three stages cycle, which consists of

extracting key terms by local analysis, extraction of other key terms through a

system of automatic recommendation and an algorithm to personalize the final list

of terms suggested. Thus, we can produce high quality and relevant query

suggestions reformulations of the user intent –verbalized as a Web query– that

increases the chances of retrieving better results.

Query expansion is the process of adding additional terms to a user’s

original query, with the purpose of improving retrieval performance (Efthimiadis

1995). Although query expansion can be conducted manually by the searcher, or

automatically by the information retrieval system, the focus here is on interactive

query expansion which provides computer support for users to make choices which

result in the expansion of their queries. A common method for query expansion is

the relevance feedback technique (Salton 1979), in which the users judge relevance

from results of a search. This information then is used to modify the vector-model

query with increased weights on the terms found in the relevant documents. In

(Salton and Buckley 1990), these techniques were proved to be effective. In

(Efthimiadis 1995) is presented a comprehensive literature review to extract

keywords based on term frequency, document frequency, etc.

“Link Analysis Ranking: Algorithms, Theory, and Experiments

ALLAN BORODIN

Ranking is an integral component of any information retrieval system. In the

case of Web search, because of the size of the Web and the special nature of the

Web users, the role of ranking becomes critical. It is common for Web search

queries to have thousands or millions of results. On the other hand, Web users do

not have the time and patience to go through them to find the ones they are

interested in. It has actually been documented [Broder 2002; Silverstein et al.

1998; Jansen et al. 1998] that mostWeb users do not look beyond the first page of

results. Therefore, it is important for the ranking function to output the desired

results within the top few pages, otherwise the search engine is rendered useless.

Furthermore, the needs of the users when querying the Web are different from

traditional information retrieval. For example, a user that poses the query

“microsoft” to aWeb search engine is most likely looking for the home page of

Microsoft Corporation, rather than the page of some random user that complains

about the Microsoft products. In a traditional information retrieval sense, the page

of the random user may be highly relevant to the query.

However, Web users are most interested in pages that are not only relevant,

but also authoritative, that is, trusted sources of correct information that have a

strong presence in the Web. In Web search, the focus shifts from relevance to

authoritativeness. The task of the ranking function is to identify and rank highly

the authoritative documents within a collection of Web pages. To this end, the Web

offers a rich context of information which is expressed through the hyperlinks. The

hyperlinks define the “context” in which a Web page appears. Intuitively, a link

from page p to page q denotes an endorsement for the quality of page q. We can

think of theWeb as a network of recommendations which contains information

about the authoritativeness of the pages. The task of the ranking function is to

extract this latent information and produce a ranking that reflects the relative

authority of Web pages. Building upon this idea, the seminal papers of Kleinberg

[1998], and Brin and Page [1998] introduced the area of Link Analysis Ranking,

where hyperlink structures are used to rank Web pages.

Learning to Map between Ontologies on the Semantic Web

AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy

The current World-Wide Web has well over 1.5 billion pages [3], but the

vast majority of them are in human- readable format only (e.g., HTML). As a

consequence software agents (softbots) cannot understand and process this

information, and much of the potential of the Web has so far remained untapped.

In response, researchers have created the vision of the Semantic Web [6], where

data has structure and ontologies describe the semantics of the data. Ontologies

allow users to organize information into taxonomies of concepts, each with their

attributes, and describe relationships between concepts. When data is marked up

using ontologies, softbots can better understand the semantics and therefore more

intelligently locate and integrate data for a wide variety of tasks. The following

example illustrates the vision of the Semantic Web.

On the World-Wide Web of today you will have trouble _nding this person.

The above information is not contained within a single Web page, thus making

keyword search inef- fective. On the Semantic Web, however, you should be able

to quickly _nd the answers. A marked-up directory service makes it easy for your

personal softbot to _nd nearby Com- puter Science departments. These

departments have marked up data using some ontology. Here the data is organized

into a taxonomy that includes courses, people, and professors. Professors have

attributes such as name, degree, and degree-granting institution. Such marked-up

data makes it easy for your softbot to _nd a pro-fessor with the last name Cook.

Then by examining the at- tribute \granting institution", the softbot quickly _nds

the alma mater CS department in Australia. H

“Ontology-Based Personalized Search and Browsing”

Susan Gauch, Jason Chaffee

The Web has experienced continuous growth since its creation. As of March

2002, the largest search engine contained approximately 968 million indexed pages

in its database [SES 02]. As the number of Internet users and the number of

accessible Web pages grows, it is becoming increasingly difficult for users to find

documents that are relevant to their particular needs. Users of the Internet basically

have two ways to find the information for which they are looking: they can browse

or they can search with a search engine. Browsing is usually done by clicking

through a hierarchy of concepts, or ontology, until the area of interest has been

reached. The corresponding node then provides the user with links to related Web

sites. Search engines allow users to enter keywords to retrieve documents that

contain these keywords. The browsing and searching algorithms are essentially the

same for all users. The ontologies that are used for browsing content at a Web site

are generally different for each site that a user visits. Even if there are similarly

named concepts in the ontology, they may contain different types of pages.

Frequently, the same concepts will appear with different names and/or in different

areas of the ontology. Not only are there differences between sites, but between

users as well. One user may consider a certain topic to be an “Arts” topic, while a

different user might consider the same topic to be a “Recreation” topic. Thus,

although browsing provides a very simple mechanism for information navigation,

it can be time consuming for users when they take the wrong paths through the

ontology in search of information. The alternate navigation strategy, search, has its

own problems. Indeed, approximately one half of all retrieved documents have

been reported to be irrelevant [Casasola 98]. One of the main reasons for obtaining

poor search results is that many words have multiple meanings [Krovetz 92]. For

instance, two people searching for “wildcats” may be looking for two completely

different things (wild animals and sports teams), yet they will get exactly the same

results. It is highly unlikely that the millions of users with access to the Internet are

so similar in their interests that one approach to browsing or searching,

respectively, fits all needs. What is needed is a solution that will “personalize” the

information selection and presentation for each user. This paper explores the

OBIWAN project’s use of ontologies as the key to providing personalized

information access. Our goals it to automatically create ontology-based user

profiles based and use these profiles to personalize the results from an Internet

search engine and to also use them to create personalized navigation hierarchies of

remove Web sites.

SYSTEM ANALYSIS

Existing System:

User profiles represent the concept models possessed by users when

gathering web information. A concept model is implicitly possessed by users and is

generated from their background knowledge. While this concept model cannot be

proven in laboratories, many web oncologists have observed it in user behavior.

When users read through a document, they can easily determine whether or not it is

of their interest or relevance to them, a judgment that arises from their implicit

concept models. If a user’s concept model can be simulated, then a superior

representation of user profiles can be built.

Drawbacks:

It is a blind search

If search space is large then the search performance

will be poor

Using web documents for training sets has one severe drawback: web

information has much noise and uncertainties.

Proposed System:

In our project, an ontology model to evaluate this hypothesis is proposed.

This model simulates users’ concept models by using personalized ontologies, and

attempts to improve web information gathering performance by using ontological

user profiles. The world knowledge and a user’s local instance repository (LIR) are

used in the proposed model. World knowledge is commonsense knowledge

acquired by people from experience and education, an LIR is a user’s personal

collection of information items. From a world knowledge base, we construct

personalized ontologies by adopting user feedback on interesting knowledge. A

multidimensional ontology mining method, Specificity and Exhaustivity, is also

introduced in the proposed model for analyzing concepts specified in ontologies.

The users’ LIRs are then used to discover background knowledge and to populate

the personalized ontologies. The proposed ontology model is evaluated by

comparison against some benchmark models through experiments using a large

standard data set. The evaluation results show that the proposed ontology model is

successful

Advantages:

A multidimensional ontology mining method, Specificity and Exhaustivity, is

also introduced in the proposed model for analyzing concepts specified in

ontologies.

The users’ LIRs are then used to discover background knowledge and to

populate the personalized ontologies.

The users’ LIRs are then used to discover background knowledge and to

populate the personalized ontologies.

DEVELOPMENT ENVIRONMENT

SYSTEM CONFIGURATION

HARDWARE REQUIREMENTS

Processor : 733 MHz Pentium III Processor

RAM : 128 MB

Hard Drive : 10GB

Monitor : 14” VGA COLOR MONITOR

Mouse : Logitech Serial Mouse

Disk Space : 1 GB

SOFTWARE REQUIREMENTS:

Platform : JDK 1.6

Operating System : Microsoft Windows NT 4.0 or

Windows 2000 or XP

Program Language : JAVA

Database : MySql 5.1

Tool : NETBEANS 6.8, SQLYog

Module Description:

Modules:

Knowledge Based Grouping

User Query For category Identification

Algorithm Implementation

Gathering Information

1. Knowledge Based Grouping:

User’s personal collection of information items, From a world

knowledge based. We construct personalized ontologies by adopting

user feedback on interesting knowledge. Analyzing concepts specified

in ontologies were used in many works to improve the performance of

knowledge discovery. Such a personalized ontology model should

produce a superior representation of user profiles for web information

gathering. The user requesting url for gathering information through

the ontology model, it will be partitioning the relevant link which has

given by user. The relevant url is gathering information from by

separate category’s, like general, education, health, and entertainment.

2. User Query For category Identification:

User profiles were used in web information gathering to

interpret the semantic meanings of queries and capture user

information needs. User profiles can be categorized into four groups:

general, education, health and entertainment. When user profiles can

be deemed perfect url link for gathering information through the

category which has given url by them. If user given query about any

education oriented link means, the link will be used for identifying the

category and gathered information will be saving that particular

category. The user query is collecting information through searching

the subject catalog of the user given link. The category’s are useful

for separate the links and gathered information will be saved.

3. Algorithm Implementation:

User background knowledge can be discovered from user local

information collections, such as a user’s stored documents, browsed

web pages, and composed/received relevant details. The ontology

constructed has only subject labels and semantic relations specified.

We populate the ontology with the instances generated from user local

information collections. Classification is another strategy to map the

unstructured/ semi structured documents in user profile to the

representation in the global knowledge base. Because ontology

mapping and text classification/ clustering are beyond the scope of the

work presented in our project. The present work assumes that all user

local instance repositories have content-based descriptors referring to

the subjects, however, a large volume of documents existing on the

web may not have such content-based descriptors. For this problem,

strategies like ontology mapping and text classification/clustering

were suggested. These strategies will be investigated in future work to

solve this problem. Here we used the naïve byes algorithm for

gathering information for text classification. And hierarchical

algorithm for clustering the data for which has given link by user.

4. Gathering Information:

Good Url are the concepts relevant to the information need, and

Bad Url subjects are the concepts resolving paradoxical or ambiguous

interpretation of the information need. The user want retrieve the web

information from given url link. In this gathering process the url will

be in under checking and validation process for identifying good url

or bad url. The given link will be gathering the information by good

url only, the bad url links are omitted by the validation process. Bad

url link is a un relevant link, it will be not considered by the web

information gathering process for requesting pages. After url checking

the information is downloaded. The downloaded links are constructed

by algorithms and url validation process. The gathered information are

only relevant details.

Architecture Diagram:

Read URL

Pattern recognition

URL Validation

Checking

Identification Process

Downloading Process

Identify the required URL

Getting Required URL

Clear Overall DFD (Data Flow Diagram)

Web sites Facts

W1 f1

f2W2

W3

W4 f4

o1

o2f3

Data Flow Diagram

Objects

Data Flow Diagram:

Input

Enter the URL

Searching URL

Internal URL – In order Gathering

Internal URL – In order of Size

External URL

Other URL

Bad URL

Exceptions

CSS Classes

Exit

No

Yes

Create Output as HTML files

References:

[1] B. Amento, L.G. Terveen, and W.C. Hill, “Does ‘Authority’ Mean Quality?

Predicting Expert Quality Ratings of Web Documents,” Proc. ACM SIGIR ’00,

July 2000.

[2] M. Blaze, J. Feigenbaum, and J. Lacy, “Decentralized Trust Management,”

Proc. IEEE Symp. Security and Privacy (ISSP ’96), May 1996.

[3] A. Borodin, G.O. Roberts, J.S. Rosenthal, and P. Tsaparas, “Link Analysis

Ranking: Algorithms, Theory, and Experiments,” ACM Trans. Internet

Technology, vol. 5, no. 1, pp. 231-297, 2005.

[4] J.S. Breese, D. Heckerman, and C. Kadie, “Empirical Analysis of Predictive

Algorithms for Collaborative Filtering,” technical report, Microsoft Research,

1998.

[5] R. Guha, R. Kumar, P. Raghavan, and A. Tomkins, “Propagation of Trust and

Distrust,” Proc. 13th Int’l Conf. World Wide Web (WWW), 2004.

[6] G. Jeh and J. Widom, “SimRank: A Measure of Structural-Context Similarity,”

Proc. ACM SIGKDD ’02, July 2002.

[7] J.M. Kleinberg, “Authoritative Sources in a Hyperlinked Environment,” J.

ACM, vol. 46, no. 5, pp. 604-632, 1999.

[8]Logistical Equation from Wolfram MathWorld,

http://mathworld.wolfram.com/LogisticEquation.html, 2008.

[9] T. Mandl, “Implementation and Evaluation of a Quality-Based Search Engine,”

Proc. 17th ACM Conf. Hypertext and Hypermedia, Aug. 2006.

[10] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank Citation

Ranking: Bringing Order to the Web,” technical report, Stanford Digital Library

Technologies Project, 1998.

Sites Referred:

http://java.sun.com

http://www.sourcefordgde.com

http://www.networkcomputing.com/

http://www.networkcomputing.com/

http://www.sourcefordgde.com/

http://java.sun.com/

Conclusion:

In this paper, an ontology model is proposed for representing user

background knowledge for personalized web information gathering. The model

constructs user personalized ontologies by extracting world knowledge from the

LCSH system and discovering user background knowledge from user local

instance repositories. A multidimensional ontology mining method, exhaustively

and specificity, is also introduced for user background knowledge discovery. In

evaluation, the standard topics and a large test bed were used for experiments. The

model was compared against benchmark models by applying it to a common

system for information gathering. The experiment results demonstrate that our

proposed model is promising. A sensitivity analysis was also conducted for the

ontology model. In this investigation, we found that the combination of global and

local knowledge works better than using any one of them. In addition, the ontology

model using knowledge with both is-a and part-of semantic relations works better

than using only one of them. When using only global knowledge, these two kinds

of relations have the same contributions to the performance of the ontology model.

While using both global and local knowledge, the knowledge with part-of relations

is more important than that with is-a.

The proposed ontology model in this paper provides a solution to

emphasizing global and local knowledge in a single computational model. The

findings in this paper can be applied to the design of web information gathering

systems. The model also has extensive contributions to the fields of Information

Retrieval, web Intelligence, Recommendation Systems, and Information Systems.

Future Work:

In our future work, we will investigate the methods that generate user

local instance repositories to match the representation of a global knowledge base.

The present work assumes that all user local instance repositories have content-

based descriptors referring to the subjects; however, a large volume of documents

existing on the web may not have such content-based descriptors. For this problem,

strategies like ontology mapping and text classification/clustering were suggested.

These strategies will be investigated in future work to solve this problem. The

investigation will extend the applicability of the ontology model to the majority of

the existing web documents and increase the contribution and significance of the

present work.

Date post:	20-Jul-2016
Category:	Documents
Upload:	gyan-prakash
View:	1 times
Download:	0 times