Link & Content Coupled Search Engine Filter

i

LC COUPLED FILTER

- A TECHNIQUE FOR INFORMATION RETRIEVAL

A PROJECT REPORT

Submitted by

S. ABDUL LATHEEF

S. MURELI

R. PADMANABHAN

R. RAVISHANKAR

in partial fulfillment for the award of the degree

of

BACHELOR OF TECHNOLOGY

IN

INFORMATION TECHNOLOGY

B.S. ABDUR RAHMAN CRESCENT ENGINEERING COLLEGE,

CHENNAI 600 048

ANNA UNIVERSITY: CHENNAI 600 025

APRIL 2006

ii


BONAFIDE CERTIFICATE

Certified that this project report “LC COUPLED FILTER - A TECHNIQUE FOR

INFORMATION RETRIEVAL” is the bonafide work of “S.ABDUL LATHEEF

(40402205001), S.MURELI (40402205051), R.PADMANABHAN (40402205057),

R.RAVISHANKAR (40402205066)” who carried out the project work under my

supervision.

SIGNATURE SIGNATURE

Prof. S. P. REDDY

HEAD OF THE DEPARTMENT

Department of Information Technology

B.S.A Crescent Engineering College,

Seethakathi Estate, Vandalur,

Chennai – 600 048.

Ms. ANGELINA GEETHA

SUPERVISOR

Asst. Professor

Department of Information Technology

B.S.A Crescent Engineering College,

Seethakathi Estate, Vandalur,

Chennai – 600 048.

iii


VIVA VOCE EXAMINATION

The viva-voce examination of the following students who have submitted

the project work "LC COUPLED FILTER - A TECHNIQUE FOR

INFORMATION RETRIEVAL” is held on ________________.

S. ABDUL LATHEEF (40402205001)

S. MURELI (40402205051)

R. PADMANABHAN (40402205057)

R. RAVISHANKAR (40402205066)

INTERNAL EXAMINER EXTERNAL EXAMINER

iv

ACKNOWLEDGEMENT

We express our deepest gratitude to Prof. S. PEER MOHAMMED,

Correspondent, B.S.A. Crescent Engineering College, for his invaluable guidance

and blessings.

We are grateful to our Principal, Dr. V. M. PERIASAMY, for providing us an

environment to carry out our course successfully.

We are deeply indebted to our beloved Head of the Department, Prof. S. P.

REDDY, who moulded us both technically and morally for achieving greater success in

life.

We record our sincere thanks to Prof. A. PANNEERSELVAM, Professor and

Dean, Department of Information Technology, for his constant encouragement and

support throughout our course, especially for the useful suggestions given during the

project period.

We express our thanks to the project coordinators Ms. G. KAVITHA, Lecturer

and Ms. LATHA TAMILSELVAN, Asst. Professor, Department of Information

Technology, for their valuable suggestions during our project reviews.

We express our sincere thanks to our guide Ms. ANGELINA GEETHA, Asst.

Professor, Department of Information Technology, for being instrumental in the

completion of our project with her exemplary guidance.

We thank all the staff members of our department who helped us at various

levels for the successful completion of our project.

We take this opportunity to extend our deep appreciation to our family and

friends for all that they meant to us during the crucial times of our project completion.

v

TABLE OF CONTENTS

CHAPTER NO TITLE PAGE NO

ABSTRACT vii

LIST OF TABLE vii

LIST OF FIGURES ix

LIST OF ABBREVIATIONS x

1 INTRODUCTION 1

1.1 SEARCH ENGINE 1

1.2 INFORMATION RETRIEVAL 2

1.3 INFORMATION FILTER 3

1.4 RE-RANKING 5

1.5 LITERATURE SURVEY 5

1.6 ORGANIZATION OF REPORT 7

2 PROBLEM DEFINITION 8

3 DEVELOPMENT PROCESS 8

3.1 REQUIREMENT ANALYSIS AND

.SPECIFICATIONS 8

3.1.1 Input Requirements 9

3.1.2 Output Requirements 9

3.1.3 Functional Requirements 9

3.2 RESOURCE REQUIREMENTS 9

3.2.1 Hardware 9

3.2.2 Software 10

3.3 DESIGN 10

3.3.1 Architectural Design 10

3.3.2 Detailed Design 12

vi

3.3.2.1 User Interface 12

3.3.2.2 Module Description 13

3.4 IMPLEMENTATION: ALGORITHM

…..AND CODE 16

3.5 TESTING 22

4 APPLICATIONS AND FUTURE

ENHANCEMENTS 27

5 CONCLUSION 28

REFERENCES 29

APPENDIX A – SCREENSHOTS OF

INSTALLATION PROCESS 30

APPENDIX B – SAMPLE INTERFACE SCREENS 34

.APPENDIX C – RESULTS AND ANALYSIS 40

vii

ABSTRACT

Search engines are now being widely used since its inception in the name

of “Archie” to the present modern search engines such as Goggle, AltaVista, MSN,

etc. These present day search engines are akin. There are very minute differences

that make each of them unique, and mostly this uniqueness lies in the technology

being used for searching and the way they filter the results obtained. The Internet

has grown to such a level that there a plethora of results are being obtained for a

query which makes the people dumbfound. The only loophole to overcome this

overdose is only by filtering and weeding out the tautological links. There are three

types of filtering namely Content filtering, Link filtering and collaborative filtering.

Our work focuses on these issues. The search results obtained from the search

engine are passed through the link filter to eliminate the duplicate links. Further they

pass through the Content filter to measure the relevance of the web document to the

given query. This though brings about a radical diminution in the results there is one

more bone of contention that is to be addressed which is the issue of ranking the

filtered results. What we contemplate is a combination of these at the right

proportion or rather limits and re-rank based on specific parameters, which gives

results better than that given by a standard search engine. An effective user interface

is designed for the system to help the users for easier navigation. These cutting edge

features of this will be highly useful for novices, amateurs and even connoisseurs of

the web. The result of our work is graphically analyzed with the standard search

engine results and is found to be better.

viii

LIST OF TABLE

S. No Table Name Page No.

1

2

3

4

5

6

7

8

9

10

11

12

13

Hardware Resource requirement Table

Software Resource requirement Table

UTC_LCFilter_001

UTC_LCFilter_002

UTC_LCFilter_003

UTC_LCFilter_004

UTC_LCFilter_005

UTC_LCFilter_006

UTC_LCFilter_007

UTC_LCFilter_008

UTC_LCFilter_009

UTC_LCFilter_010

UTC_LCFilter_011

9

10

23

23

23

24

24

24

25

25

25

26

26

ix

LIST OF FIGURES

S. No Figure Name Page No.

1

2

3

4

5

6

7

8

9

10

11

12

13

Architectural Diagram

Flow Diagram

Algorithm for Stopword Removal

Algorithm for Query Match

Algorithm for Query Match percentage

Algorithm for In-link Referral

Algorithm for In-link Referral Percentage

Algorithm for Page Size Calculation

Algorithm for Page Size Percentage

Algorithm for Re-Ranking

Algorithm for Source Generation

Algorithm for Detaging

Algorithm for Frequency and Distance Calculation

11

12

17

18

18

18

19

19

20

20

21

21

21

x

LIST OF ABBREVIATIONS

S.No. Acronym Expansion

1

2

3

4

5

6

7

8

9

10

11

12

SERP

WWW

JVM

IR

URL

CF

TF

TDF

QMP

IRP

SP

MP

Search Engine Result Page

World Wide Web

Java Virtual Machine

Information Retrieval

Uniform Resource Locator

Collaborative Filter

Term Frequency

Inverse Term Document Frequency

Query Match Percentage

In-Link Referral Percentage

Page Size Percentage

Mean Percentage

1

1. INTRODUCTION

The Internet is revolutionizing and enhancing the way we as humans

communicate, both locally and around the globe. Simply put, the Internet is a network

of linked computers allowing participants to share information on those computers.

This is presently experiencing a tremendous rate of growth. Once reserved for the

military, then scientists, the government, and universities, this global computer

network now also acts as a conduit of information for private business and personal

use. A search engine is a program used to retrieve the information stored on various

systems based on user query. But due to the growth of the World Wide Web i.e.

increase in the number of web pages, the search engine users are being provided with

enormous amount of results. Hence how to retrieve the results efficiently on the web

is an issue of concern. A query on a crawler-based search engine often turns up

thousands or even millions of matching web pages. This project is taken up to

overcome this problem of ranking used in search engine. The users must be provided

with most appropriate search results for the given query. The contents of the search

result documents must be analyzed to measure the relevance of the document to the

given query. This is done by content analysis. This work aims at reranking the search

results of a search engine by performing link analysis and content analysis.

1.1 SEARCH ENGINE

A search engine is a program designed to help find information

available on the World Wide Web. A search engine operates, in the following order

• Web crawling

• Indexing

• Searching

2

Web Crawling

A web crawler is a program which browses the World Wide Web in a

methodical, automated manner. Web crawlers are mainly used to create a copy of all

the visited pages for later processing by a search engine that will index the

downloaded pages to provide fast searches. Crawlers can also be used for automating

maintenance tasks on a web site, such as checking links or validating HTML code.

Also, crawlers can be used to gather specific types of information from Web pages,

such as harvesting e-mail addresses.

Indexing

The contents of each page are then analyzed to determine how it should

be indexed (for example, words are extracted from the titles, headings, or special

fields called meta tags). Data about web pages is stored in an index database for use in

later queries. Some search engines, such as Google, store all or part of the source page

(referred to as a cache) as well as information about the web pages, whereas some

store every word of every page it finds.

Searching

When a user requests a search engine to search for a given query,

typically by giving key words, the engine looks up the index and provides a listing of

best-matching web pages according to its criteria, usually with a short summary

containing the document's title and sometimes parts of the text.

1.2 INFORMATION RETRIEVAL

The term "information retrieval" was coined by Calvin Mooers in

1948-50. Information retrieval (IR) is the science of searching for information in

3

documents, searching for documents themselves, searching for metadata which

describe documents, or searching within databases, whether relational stand-alone

databases or hypertext networked databases such as the Internet or intranets, for text,

sound, images or data. Though the areas of data retrieval, document retrieval,

information retrieval, and text retrieval, fall under the category of information

processing each of these have their own bodies of literature, theory, praxis and

technologies.

1.3 INFORMATION FILTER

Information filters are in general used for avoiding certain

unnecessary information that is being published over the net. These filters are also

used in search engines for filtering out the irrelevant search results in the Search

Engine Result Pages(SERP). These filters are the ones that utilize the indexed

information and extract the needed data or pages based on the query . The different

types of filters are

• Collaborative filter

• Link Filter

• Content Filter

Collaborative Filter

Collaborative filtering (CF) is the method of making automatic

predictions (filtering) about the interests of a user by collecting feedback

about a web document information from many users (collaborating). The

underlying assumption of CF approach is that those who agreed in the

past tend to agree again in the future. Collaborative filtering systems

usually take two steps:

4

1.Look for users who share the same rating patterns with the active

user (the user who the prediction is for).

2.Use the ratings from those like-minded users found in step 1 to

calculate a prediction for the active user

Note that these predictions are specific to the user, but use

information gleaned from many users. This differs from the more simple

approach of giving an average (non-specific) score for each item of

interest, for example based on its number of votes.

Content Filter

Content filter is utilized where the indexing of the web pages is

based on the information present in the webpages with the query as the

criteria, relevant pages are searched in the index (index is supposed to

have keywords present in the webpages).The pages that contain the

keywords as in the query are retrieved and the ones that are not are

filtered out.

Link Filter

Unlike content filter, link filter uses the hyperlinks of the web

pages for retrieving the relevant web pages. Here the index contains the

links of the webpages . Hyperlinks provide a valuable source of

information for web information retrieval.So based on the query the

relevant hyperlinks are searched for, but not the content of those

webpages. Here the hyperlink analysis is done so as to find the needed

webpages and the rest of the results are avoided.

5

1.4 RERANKING

The process of ranking WebPages is done based on its relevance to

user query to present the user with the results in the increasing order of degree of

relevance with the user query. This ranking is being done by most of the search

engines. Reranking is the process of rearranging the ranked searched results based

on some predefined criterions.

1.5 LITERATURE SURVEY

Information retrieval (IR) systems facilitate users to retrieve

information that is relevant or close to their information needs. The basic

operations of typical information retrieval systems can be grouped into two main

categories: indexing and matching (or retrieval). The purpose of the indexing

process is to identify the most descriptive words existing in a text. After the

elimination of the stop words and the identification of the unique stems of the

remaining words, the term frequency (tf) and the inverse document frequency (tdf)

of each unique stem are calculated. Each document is then described by a set of

keywords along with their tf and tdf.

The aim of the query matching process is to derive at a list of

documents ranked in decreasing order of relevance to a given query. When a query

based on content, expressed in natural language is submitted to the search system,

it undergoes a process similar to the indexing process. Both documents and queries

can be represented as weighted vectors where each term’s weight is usually a

combination of tf and tdf. In this case the similarity between documents and

queries is based on the vector space model [2]. In this model the documents and

queries are viewed as n-dimentional vectors where n corresponds to the number of

6

unique index terms. The similarity between query q and document d is computed

by measuring the cosine measure. In the work of Fotis [3], a new model for

estimating the relevance of documents to user requests is presented. In this model,

information retrieval and information extraction are integrated for effective

retrieval in the domain of call for papers.

The network structure of a hyperlinked environment can be a rich

source of information about the content of the environment provided there exists a

effective means for understanding it. The work of Kleinberg [1] focusses on the

use of links for discovering the most “authoritative” pages. Jingyu Hou focuses on

finding the relevant web pages from linkage information [ 7]. The analysis of the

hyperlink structure of the web has led tp significant improvements in web

information retrieval. In the article by Monika Henzinger, link analysis algorithms

are studied [8].

The World Wide Web has developed fast and many people use search

engines to capture information from the web. A survey for realnames [4] reports

that 44% of users are frustrated by navigation and search engine use. Current

search engines build indices mostly based on keyword occurrences and frequency

for query negotiation using these indices.

Internet search engines were in use before the emergence and growth

of the web. The first pre-web search engine was Archie, which allowed keyword

searches of a database of names of files available via FTP. The first robot and

search engine of the web was Wandex [5], which was developed by Mathew gray

in 1993. Primary search engines were designed based on traditional information

retrieval methods. AltaVista and excite made huge centralized indices of web

pages. To answer a query, they simply retrieved results from their indexed

databases and showed the cached pages based on keyword occurrences and

7

proximity. While traditional indexing models have been successful in databases, it

was revealed that these methods are nor sufficient for a tremendously unstructured

information resources such as the web. The completeness of the index is not the

only factor in the quality of search results. In order to increase the quality of

search, Google made an innovative ranking system for the entire web. PageRank

used the citation graph of the web and Google introduced link analysis in the

search engine systems [6]. Other efforts have been made to customize and

specialize search tools.

1.6 ORGANIZATION OF THE REPORT

This report is divided into five chapters. The first chapter gives the

introduction about search engines, information retrieval, information filters, re-

ranking, literature survey and the organization of the report. Chapter 2 deals with

problem definition and methodology of the project. Chapter 3 deals with the

requirements analysis and specifications, resource requirements, architectural and

detailed design, implementation and testing. Chapter 4 deals with applications of

the project, and the future enhancements that can be made to the project. Chapter 5

deals with the conclusion of the system.

2 PROBLEM DEFINITION

The search engine uses various techniques of information retrieval

to retrieve the required pages from the WWW. Due to immense growth of the

WWW, the number of web pages has increased to a larger extent. This, though,

advantageous has increased size of the web and the impact of this is found in the

volume of the results given by search engines . Though a lot of filtering techniques

8

are being used, the reduction of SERPs has become impossible. Ranking the

results of the search engine can be a possible solution for this issue. This will

though not reduce the SERPs, but will push the most relevant results to the top, so

that the users have provided with the appropriate results in the first SERP.

The objective of this project is to rerank the search results and give the most

relevant search results to the given query. The software takes up the queries,

retrieves the results from search engine and reranks them based on criterion

specific to the web pages. Ranking is done based on two different methodologies

namely link filter and content filter. Based on these several attributes related to the

web page are calculated and based on these web page is given a rank. And then

based on this rank the results produced by the search engine are reranked and

displayed using the user interface.

3 DEVELOPMENT PROCESS

3.1 Requirement Analysis and Specifications

The requirement engineering process consists of feasibility study,

requirements elicitation and analysis, requirements specification, requirements

validation and requirements management. Requirement elicitation and analysis

is an iterative process that can be represented as a spiral of activities namely

requirements discovery, requirements classification and organization,

requirements negotiation and requirements documentation.

9

3.1.1 Input

The input for the LC coupled filter will be obtained from the user.

The user will be provided with the input area where the he can input the

search query.

3.1.2 Output

The output will be the results of the link and content analysis. The search

results will be displayed with the help of tab control in a table structure. Each table

contains the detailed report of the link and content filtering process. The links

retrieved will be displayed along with its webpage size, query matched of the link,

in-link referral, frequency and distance of the webpage. Finally, the results of the

link and content analysis will be graphically represented in the form of line graph.

3.1.3 Functional Requirements

The information present in the search engine will be retrieved via internet

based on the user’s query. The user should be able to specify his/her own input

query that has to be searched and retrieved. The user must be able to view the

results and compare the search results with the help of the graph.

3.2 Resource Requirements

Hardware (minimum)

The minimum hardware requirements for this project are listed in Table 1

Hardware Required

Processor Speed 700 Mhz

10

RAM 64 MB

Hard Disk space 5 MB

Connectivity A Modem and an Internet account to

access World Wide Web.

Table 1: Hardware Resource requirement Table

Software

The minimum software requirements for this project are listed in 2.

Software Required

Operating System Windows 2000/XP

Runtime Package Java Virtual Machine. (JVM)

Development Front-End Netbeans 4.1

Table 2: Software Resource requirement Table

3.3 Design

3.3.1 Architectural Design

The user needs to input the text in the text box available in

the user interface. The user query will be checked for the stop words and the

processed query will be sent to the www and the results from the search engine

will be obtained for the given user query.

11

Figure 1: Architecture Diagram

The links of the webpages will be obtained from the search results and the web

page will be analyzed based on the link and content of the webpage. After

analyzing the webpage the results will be displayed in the user interface.

The user query will be obtained from the user and the stopwords

will be removed if any. Then the query will be sent to the www and the search

results will be obtained from the search engine. The links available in the search

results will be extracted and link analysis will be done. After the link analysis

the web page content will be retrieved using the link address and detaging

process will be carried out. After removing the HTML tags the content of the

web page will be analyzed and a statistical report will be displayed in the user

interface.

Search

Engine WWW

Link &

Content

Filtering

User

Query

User

Interface

Query

Weightage

12

Figure 2: Flow Diagram (User Interface)

3.3.2 Detailed Design

3.3.2.1 User interface

The user interface has been developed with at most care so

that all types of users like novice to experts can use it. The Interface

consists of a text input area where the user needs to give the input

query. On processing the user query the output will be displayed in a

Tab Pane. Different tabs different levels. This interface is given in

Appendix B, Screen Shot-2.

Query

Weightage

Search

Engine

Link

Filtering

Detag

HTML

Content

analysis

and

filtering

Re-

Ranking

Display

Output

User

Interface

WWW

User Query URL

Text File

13

Search Results

The search results tab displays the results obtain frm the

google search engine. The page size and web page link will also be

displayed in the rank order of the search engine.

Link Filter Analysis

The results of the link filter analysis will be displayed in a

tables structure with query matched, page size and in link referral

details. The links are re-ranked based on the calculated percentages.

Content Filter Analysis

As a result of content filter analysis the web pages details will

be displayed in a table structure containing the frequency of the query

and the distance between them.

Detailed Analysis

The result of the link and content analysis will be analyzed

and graphically represented. Graphs will be generated based on the

search engine results and system generated values.

3.3.2.2 Module description

The LC Coupled ranks the links to provide the user with the

relevant links only. This task is accomplished by performing

various tasks, which takes form as modules. The modules

identified are

14

Link Filter:

This is one of the decisive part in the filter. The Link filter is

used to obtain the links relevant from the result provided by the

search engine and with the hyperlink & hypertext as the input it

ranks the links based on three criterions.

Link and Anchor retrieval

Here the results are provided by the search engine is

retrieved and from that results the hyperlinks and hypertext are

extracted. The results of the search is shown for user reference.

Query Match

This operation finds whether the query is present in the

hypertext, which is retrieved from the search engine results. The

number of words in the query divides the number of words

matched in the hypertext. This gives the query match ratio.

Page Size Calculation

This gets the hyperlink of the search results as the input and

individually connects to each hyperlinks gets the webpage source

and extracts only the contents stripping off the waste contents that

unnecessarily adds to the page size.

Page weighting

The operations said above are done which will filter out the

superfluous links and the links obtained are now ranked based on

the relevance with the page, Size of the page, Query match, Inlink

referrals.

15

Content Filter:

The unique feature of the LC Coupled Filter is that it filters

using both content and link filter. Content Filter gets the filtered

links from link filter as the input and process then based on the

content present in those links.

Detag

The detaging process retrieves the HTML code of the web

page and removes the HTML tags present in it. There by getting

only the content present in the web page.

Query Weightage:

The query given by the user in the User Interface is not sent as

such to the search engine but is pruned and then sent. This process

increases the efficiency or the relevance between the query and the

documents is ameliorated.

Get search query

The query given is retrieved from the user interface for further

processing and sending to the web.

Removal of stop word

The stop word such as most common prepositions and other

kinds most frequently occurring linguistic words are searched for

and removed

User Interface

All the process done above takes its shape in the screen

through the user interface provided. This is used for bridging the

16

user and mask the strenuous task that is being done by the filter

from the user.

Actual Result

This pane of the user interface gives the user the privilege to

see the actual result obtained from the search engine so that the

user can have the actual result of the search engine.

Link Based

This pane gives the results that are link filtered.

Content Based

This pane gives the results that are content filtered.

3.4 Implementation

Our main task is to enhance of retrieved information according to

the user query. Retrieval of information based on the indexing might not be

appropriate to the users query. Enhancement of the search results may also involve

analysis of the link and the web page content.

Each webpage contains its own pre-defined parameters which can be

used to calculate their weightage. Each webpage can be weighed based on Link or

the Content of the webpage. Before analyzing the web page the user query need to

be filtered for any stopwords.

User Query Weighing:

The input given by the user will be passed to a stopword remover to

eliminate the stopwords. A list of stopwords that are more commonly used are are

maintained and used for comparing with the user query. The output of the

stopword remover will be the input query for the search engines.

17

Step 1: Input the query from the user.

Step 2: Pass the query to the stopword_remover().

Step 3: Compare with the list of stopwords.

Step 4: Re phrase the input query.

Figure 3: Algorithm for Stop Word removal

Link Filter Analysis:

With the user input query the search results must be obtained and the links

obtained must be processed. These are the following stream of process which leads to

the ranked results

• Query Match Calculation

• Query Match Percentage (QMP)

• In-link Referral Calculation

• In-Link Referral Percentage (IRP)

• Page size calculation

• Page Size Percentage (SP)

• Re-ranking

Query Match Calculation

The anchor link text will be processed for the availability of the user

query. Anchor text will be compared with the input query and the strength of

the link will be calculated. The user query may contain more than one keyword

and the anchor text may not contain all those keyword in it. The algorithm is

given in figure 4.

18

Step 1: Get the input query from the user.

Step 2: Get the anchor text from the retrieved links.

Step 3: Compare anchor text with the user query.

Step 4: Check whether all words in user query are present in anchor text.

Step 5: Display the query matched value for the respective link.

Figure 4: Algorithm for Query Match

Query Match Percentage (QMP)

The query match ratio of respective links is taken and divided by the

maximum query match value among all the links and converted into percentage,

thereby determining the QMP value of a particular link. The algorithm is given

in figure 5.

Step 1: Get the query match ratio of particular link.

Step 2: Calculate the maximum query match ratio among all the links.

Step 3: Convert it to percentage using the formula,

QMP = (Current QMR / Maximum QMR) * 100

Figure 5: Algorithm for Query Match Percentage

In-link referral

The obtained links from the search engines might be referred by other

webpages. The In link referral will be obtained from the google search engine

by providing the retrieved link concatenated along with “link:” Calculating the

in-link referrals for each link increases the weightage of the webpage. The

algorithm is given in figure 6.

Step 1: Retrieve the link from the search engine.

Step 2: Concatenate “link:” to the obtained link.

Step 3: Pass the concatenated address to the google search engine

Step 4: Calculate the in-link referrals.

Figure 6: Algorithm for In-link Referral

19

In-Link Referral Percentage (IRP)

The in-link referral of respective links is taken and divided by the

maximum in-link referral value among all the links and converted into

percentage, thereby determining the IRP value of a particular link. The

algorithm is given in figure 7.

Step 1: Get the in-link referral of particular link.

Step 2: Calculate the maximum in-link referral among all the links.


IRP = (Current In-link Referral / Maximum In-link Referral) * 100

Figure 7: Algorithm for In-link Referral Percentage

Page size calculation

The content of the webpage will be obtained from the www and then the

content will be detaged. During the detaging process the HTML tags will be

removed and only the contents will be retrieved. After the detaging process the

original size of the webpage will be found. The algorithm is given in figure 8.

Step 1: Get the source of the webpage using the retrieved link.

Step 2: Remove the HTML tags.

Step 3: Find the size of the detaged web page content.

Figure 8: Algorithm for Page Size Calculation

Page Size Percentage (SP)

The page size of respective links is taken and divided by the maximum

page size value among all the links and converted into percentage, thereby

determining the SP value of a particular link. The algorithm is given in figure 8.

20

Step 1: Get the page size referral of particular link.

Step 2: Calculate the maximum page size among all the links.


SP = (Current Page Size / Maximum Page Size) * 100

Figure 9: Algorithm for Page Size Percentage

Re-Ranking

With the link parameter that is being calculated, the web pages will be re-

ranked. The algorithm is given in figure 10.

Step 1: Get the result of Link Analysis.

Step 2: Calculate the mean percentage of the parameters for each webpage using the formula,

Mean Percentage (MP) = (QMP + IRP + SP) / 3

Step 3: Sort the webpages in descending manner and accordingly.

Figure 10: Algorithm for Re-Ranking

Content Filter Analysis

Another method of weighing the webpage is to analyze the content of that

particular webpage. Content Analysis includes frequency of the keyword, distance

between the keywords and page size calculation.

• Source Generation.

• Detaging.

• Frequency and Distance Calculation.

Source Generation

Using the links retrieved from the search engine, the webpage content

will be obtained. The source will be obtained along with the mages,

programming tags and other design related contents. The algorithm is given in

figure 11.

21

Step 1: Get the retrieved webpage link.

Step 2: Contact the world wide web (www) through internet.

Step 3: Retrieve the content of the webpage using the specified link address.

Figure 11: Algorithm for Source Generation

Detaging

The obtained content of a webpage will contain all the HTML tags and

images. In the detaging process, all the HTML tags that occurs between the

angled braces will be identified and will be removed. The algorithm is given in

figure 12.

Step 1: Get the webpage content.

Step 2: Check for HTML tags.

Step 3: Remove the tags.

Step 4: Save the content of the webpage.

Figure 12: Algorithm for Detaging

Frequency and Distance Calculation

The detaged web content will be compared with the user query and the

frequency of the occurrence of the keywords will be calculated. While

determining the frequency the distance between the keywords if any will also

be checked. The algorithm is given in figure 13.

Step 1: Get the user input query.

Step 2: Get the detaged content of the webpage.

Step 3: Compare the webpage content and user query and find the occurrence of the keywords.

Step 4: Find the distance between the keyword if any.

Step 5: Tabulate the results.

Figure 13: Algorithm for Frequency and Distance Calculation

22

Statistical Analysis:

GRAPH 1: Link Filter Analysis

The Query match between the user query and the goggle extracted link

will be calculated and the referral link for each link will be calculated. Based on

the referral links and the query match the links will be re-ranked. As shown in

Appendix C, Screen Shot – 1

GRAPH 2: Content Filter Analysis

The Frequency of the query that has been specified by the user has been

calculated in each web pages that has been retrieved and the distance between

the keywords has also been calculated if any. In case of keywords occurring

more than once the distance will be the mean of the distances. As shown in

Appendix C, Screen Shot – 2.

GRAPH 3: Page Size Analysis

The page size specified by the google and the actual page size of the

webpage after filtering the tags (images, links, advertisements etc.). The actual

size of the web pages will be less than or equal to the google specified size. As

shown in Appendix C, Screen Shot – 3.

3.5 Testing

Testing is the process of evaluating the correctness, the quality and

completeness of the system developed. Majority of the testing activity can be

done by writing test cases for the system. A test case can either be a single step

or a series of steps along with the expected result or outcome, prerequisite states

23

or steps, and descriptions. A collection of test cases is called as a test suite or

test plan. The test suite for our system is as follows

1. UTC_LCFilter_001

Test Case Description Execute the LC Filter User Interface

Module Query Weightage Module.

Action The input query is filtered by the function

Stopword_remover() and then passed as input for

the search engine.

Account/Data Criteria Input query will be obtained from the user.

PASS

Table 3: UTC_LCFilter_001

2. UTC_LCFilter_002


Module Link Filter Module

Action The links will be obtained from the search engine

based on the user query.

Account/Data Criteria The user query will be passed as input to the search

engine.

PASS


3. UTC_LCFilter_003


Module Link Filter module

24

Action The obtained link query will be compared with the

user query.

Account/Data Criteria The link obtained from the search engine.

PASS


4. UTC_LCFilter_004



Action In-link referrals for the each link obtained will be

calculated.

Account/Data Criteria The links obtained from the search engine.

PASS


5. UTC_LCFilter_005



Action The page size of each webpage will be calculated.

Account/Data Criteria The webpage content representing each link. The

webpage size is calculated by filesize().

PASS


6. UTC_LCFilter_006

Test Case Description Execute LC Filter User Interface


25

Action The links will be re-ranked based on the link

parameters.

Account/Data Criteria The links will be rearranged by rerank().

PASS


7. UTC_LCFilter_007


Module Content Filter Module

Action The frequency of the user query will be calculated.

Account/Data Criteria Webpage content of the obtained links. The

frequency will be calculated by content_filtering().

PASS


8. UTC_LCFilter_008


Module Content Filter Module

Action Distance between the keywords will be calculated.

Account/Data Criteria Webpage content and the frequency of the

keyword.

PASS


9. UTC_LCFilter_009


Module User Interface Module

26

Action The input from the user will be obtained. As shown

in Appendix B – Screen Shot 1.

Account/Data Criteria The input text should not be empty. As shown in

Appendix E – Screen Shot 1.

PASS


10. UTC_LCFilter_010



Action The results of the Link Filter analysis will be

displayed in the table structure. As shown in

Appendix B – Screen Shot 2 & 3.

Account/Data Criteria In case of unavailability of the data it will be

specified.

PASS


11. UTC_LCFilter_011



Action The result of the content filter analysis will be

displayed in a table structure. As shown in

Appendix B – Screen Shot 4.

Account/Data Criteria The result values obtained will be displayed.

PASS


27

4 APPLICATIONS AND FUTURE ENHANCEMENTS

This software can be used in comparing documents, ranking the

documents etc. This can also be enhanced in various aspects. Future

enhancements of the project includes the following:

� Optimization of Internet Connectivity:

The execution time and speed are the two criterions to

be enhanced. The connectivity to Internet affects the performance,

speed of our system.

� Effective Search Results:

We have focused re-ranking of search results from one

search engine. This can be upgraded by having a hybrid search

result reranking by combining the search results of more than one

search engine.

� Personalization:

Personalization can be considered as an issue for future

enhancement. User specific search results based on the past search

history of a particular user providing improved search results with

user specific feature.

28

5 CONCLUSION

The growth of Internet is tremendous that for whatever keyword

(either intelligible or unintelligible) the user enters there are a voluminous

amount of results. The user has to pick up the needed ones. Search engines

accepts user query and provide results based on them .These results may not be

acceptable to user’s expectation. In this project, the results provided by the search

engine are retrieved and the various attributes related to each page on the results

are retrieved and ranked based on these criterion. The links are reranked based on

the ranking of content and link filters independently. The usage of this software

will enhance the relevance between the search query and the documents both

based on link and based on content. The results produced are found to be more

effective than the actual results produced by the search engines.

29

REFERENCES

1. Jon M. Kleinberg, “Authoritative Sources in a Hy[perlinked Environment”,

Journal of the ACM, Vol 46, No. 5, September 1999, pp 604 – 632.

2. Salton G., Wong A., Yang C. S., “A Vector Space Model for Automatic

Indexing”, Communications of the ACM, 1975, Vol 8, pp 613-656.

3. Fotis Lazarinis, “Combining Information Retrieval with Information Extraction

for efficient Retrieval of Calls for Papers”, Proceedings of IRSG98, Grenoble,

France, Electronic Workshops in Computing, 1998.

4. Sullivan D., “Survey Reveals Search Habits – The Engine Report”,

http://www.searchenginewatch.com/sereport/00/06-realnames.html

5. Wall A., “History of Search Engines and Web History”, http://www.search-

marketing.info/search-engine-history/

6. Brin S., Page L., “The anatomy of a large scale hypertextual web search

engine”, Proceedings of the 7th International WWWConference, Brisbane,

Australia, pp 107-117.

7. Jingyu Hou, Yanchun Zhang, “Effective finding Relevant Web pages from

Linkage Information”, IEEE Transactions on Knowledge and Data Engineering,

Vol 15, No 4, July-Aug 2003.

8. Monika R. Henzinger, “Hyperlink Analysis for the Web”, IEEE Internet

Computing, Jan-feb 2001, pp 45-50.

9. Deitel & Deitel, “Java How to Program”, Third Edition, Prentice Hall.

10. http://developers.sun.com/prodtech/

11. http://www.google.co.in/about.html

30

Appendices

Appendix A – Screenshots of installation process

31

32

33

34

Appendix B - Screenshots of various functionalities of the software.

Screen Shot 1: Loading Screen

35

Screen Shot 2: Initial User Interface

36

Screen Shot 3: Google Search Results

37

Screen Shots 4: Link Filter Table 1 [Link Parameter Values]

38

Screen Shot 5: Link Filter Table 2 [Link Analysis Ranking]

39

Screen Shot 6: Content Filter Table [Content Parameter Values]

40

Appendix C - Screenshots of Graphs generated.

Screen Shot 1: Link Analysis Graph

41

Screen Shot 2: Content Filter Graph

42

Screen Shot 3: Page Size Graph

Date post:	21-Dec-2015
Category:	Documents
Upload:	paddu-raghavan
View:	17 times
Download:	0 times

Link & Content Coupled Search Engine Filter

Documents