Date post: | 21-Dec-2015 |
Category: |
Documents |
Upload: | paddu-raghavan |
View: | 17 times |
Download: | 0 times |
i
LC COUPLED FILTER
- A TECHNIQUE FOR INFORMATION RETRIEVAL
A PROJECT REPORT
Submitted by
S. ABDUL LATHEEF
S. MURELI
R. PADMANABHAN
R. RAVISHANKAR
in partial fulfillment for the award of the degree
of
BACHELOR OF TECHNOLOGY
IN
INFORMATION TECHNOLOGY
B.S. ABDUR RAHMAN CRESCENT ENGINEERING COLLEGE,
CHENNAI 600 048
ANNA UNIVERSITY: CHENNAI 600 025
APRIL 2006
ii
ANNA UNIVERSITY: CHENNAI 600 025
BONAFIDE CERTIFICATE
Certified that this project report “LC COUPLED FILTER - A TECHNIQUE FOR
INFORMATION RETRIEVAL” is the bonafide work of “S.ABDUL LATHEEF
(40402205001), S.MURELI (40402205051), R.PADMANABHAN (40402205057),
R.RAVISHANKAR (40402205066)” who carried out the project work under my
supervision.
SIGNATURE SIGNATURE
Prof. S. P. REDDY
HEAD OF THE DEPARTMENT
Department of Information Technology
B.S.A Crescent Engineering College,
Seethakathi Estate, Vandalur,
Chennai – 600 048.
Ms. ANGELINA GEETHA
SUPERVISOR
Asst. Professor
Department of Information Technology
B.S.A Crescent Engineering College,
Seethakathi Estate, Vandalur,
Chennai – 600 048.
iii
ANNA UNIVERSITY: CHENNAI 600 025
VIVA VOCE EXAMINATION
The viva-voce examination of the following students who have submitted
the project work "LC COUPLED FILTER - A TECHNIQUE FOR
INFORMATION RETRIEVAL” is held on ________________.
S. ABDUL LATHEEF (40402205001)
S. MURELI (40402205051)
R. PADMANABHAN (40402205057)
R. RAVISHANKAR (40402205066)
INTERNAL EXAMINER EXTERNAL EXAMINER
iv
ACKNOWLEDGEMENT
We express our deepest gratitude to Prof. S. PEER MOHAMMED,
Correspondent, B.S.A. Crescent Engineering College, for his invaluable guidance
and blessings.
We are grateful to our Principal, Dr. V. M. PERIASAMY, for providing us an
environment to carry out our course successfully.
We are deeply indebted to our beloved Head of the Department, Prof. S. P.
REDDY, who moulded us both technically and morally for achieving greater success in
life.
We record our sincere thanks to Prof. A. PANNEERSELVAM, Professor and
Dean, Department of Information Technology, for his constant encouragement and
support throughout our course, especially for the useful suggestions given during the
project period.
We express our thanks to the project coordinators Ms. G. KAVITHA, Lecturer
and Ms. LATHA TAMILSELVAN, Asst. Professor, Department of Information
Technology, for their valuable suggestions during our project reviews.
We express our sincere thanks to our guide Ms. ANGELINA GEETHA, Asst.
Professor, Department of Information Technology, for being instrumental in the
completion of our project with her exemplary guidance.
We thank all the staff members of our department who helped us at various
levels for the successful completion of our project.
We take this opportunity to extend our deep appreciation to our family and
friends for all that they meant to us during the crucial times of our project completion.
v
TABLE OF CONTENTS
CHAPTER NO TITLE PAGE NO
ABSTRACT vii
LIST OF TABLE vii
LIST OF FIGURES ix
LIST OF ABBREVIATIONS x
1 INTRODUCTION 1
1.1 SEARCH ENGINE 1
1.2 INFORMATION RETRIEVAL 2
1.3 INFORMATION FILTER 3
1.4 RE-RANKING 5
1.5 LITERATURE SURVEY 5
1.6 ORGANIZATION OF REPORT 7
2 PROBLEM DEFINITION 8
3 DEVELOPMENT PROCESS 8
3.1 REQUIREMENT ANALYSIS AND
.SPECIFICATIONS 8
3.1.1 Input Requirements 9
3.1.2 Output Requirements 9
3.1.3 Functional Requirements 9
3.2 RESOURCE REQUIREMENTS 9
3.2.1 Hardware 9
3.2.2 Software 10
3.3 DESIGN 10
3.3.1 Architectural Design 10
3.3.2 Detailed Design 12
vi
3.3.2.1 User Interface 12
3.3.2.2 Module Description 13
3.4 IMPLEMENTATION: ALGORITHM
…..AND CODE 16
3.5 TESTING 22
4 APPLICATIONS AND FUTURE
ENHANCEMENTS 27
5 CONCLUSION 28
REFERENCES 29
APPENDIX A – SCREENSHOTS OF
INSTALLATION PROCESS 30
APPENDIX B – SAMPLE INTERFACE SCREENS 34
.APPENDIX C – RESULTS AND ANALYSIS 40
vii
ABSTRACT
Search engines are now being widely used since its inception in the name
of “Archie” to the present modern search engines such as Goggle, AltaVista, MSN,
etc. These present day search engines are akin. There are very minute differences
that make each of them unique, and mostly this uniqueness lies in the technology
being used for searching and the way they filter the results obtained. The Internet
has grown to such a level that there a plethora of results are being obtained for a
query which makes the people dumbfound. The only loophole to overcome this
overdose is only by filtering and weeding out the tautological links. There are three
types of filtering namely Content filtering, Link filtering and collaborative filtering.
Our work focuses on these issues. The search results obtained from the search
engine are passed through the link filter to eliminate the duplicate links. Further they
pass through the Content filter to measure the relevance of the web document to the
given query. This though brings about a radical diminution in the results there is one
more bone of contention that is to be addressed which is the issue of ranking the
filtered results. What we contemplate is a combination of these at the right
proportion or rather limits and re-rank based on specific parameters, which gives
results better than that given by a standard search engine. An effective user interface
is designed for the system to help the users for easier navigation. These cutting edge
features of this will be highly useful for novices, amateurs and even connoisseurs of
the web. The result of our work is graphically analyzed with the standard search
engine results and is found to be better.
viii
LIST OF TABLE
S. No Table Name Page No.
1
2
3
4
5
6
7
8
9
10
11
12
13
Hardware Resource requirement Table
Software Resource requirement Table
UTC_LCFilter_001
UTC_LCFilter_002
UTC_LCFilter_003
UTC_LCFilter_004
UTC_LCFilter_005
UTC_LCFilter_006
UTC_LCFilter_007
UTC_LCFilter_008
UTC_LCFilter_009
UTC_LCFilter_010
UTC_LCFilter_011
9
10
23
23
23
24
24
24
25
25
25
26
26
ix
LIST OF FIGURES
S. No Figure Name Page No.
1
2
3
4
5
6
7
8
9
10
11
12
13
Architectural Diagram
Flow Diagram
Algorithm for Stopword Removal
Algorithm for Query Match
Algorithm for Query Match percentage
Algorithm for In-link Referral
Algorithm for In-link Referral Percentage
Algorithm for Page Size Calculation
Algorithm for Page Size Percentage
Algorithm for Re-Ranking
Algorithm for Source Generation
Algorithm for Detaging
Algorithm for Frequency and Distance Calculation
11
12
17
18
18
18
19
19
20
20
21
21
21
x
LIST OF ABBREVIATIONS
S.No. Acronym Expansion
1
2
3
4
5
6
7
8
9
10
11
12
SERP
WWW
JVM
IR
URL
CF
TF
TDF
QMP
IRP
SP
MP
Search Engine Result Page
World Wide Web
Java Virtual Machine
Information Retrieval
Uniform Resource Locator
Collaborative Filter
Term Frequency
Inverse Term Document Frequency
Query Match Percentage
In-Link Referral Percentage
Page Size Percentage
Mean Percentage
1
1. INTRODUCTION
The Internet is revolutionizing and enhancing the way we as humans
communicate, both locally and around the globe. Simply put, the Internet is a network
of linked computers allowing participants to share information on those computers.
This is presently experiencing a tremendous rate of growth. Once reserved for the
military, then scientists, the government, and universities, this global computer
network now also acts as a conduit of information for private business and personal
use. A search engine is a program used to retrieve the information stored on various
systems based on user query. But due to the growth of the World Wide Web i.e.
increase in the number of web pages, the search engine users are being provided with
enormous amount of results. Hence how to retrieve the results efficiently on the web
is an issue of concern. A query on a crawler-based search engine often turns up
thousands or even millions of matching web pages. This project is taken up to
overcome this problem of ranking used in search engine. The users must be provided
with most appropriate search results for the given query. The contents of the search
result documents must be analyzed to measure the relevance of the document to the
given query. This is done by content analysis. This work aims at reranking the search
results of a search engine by performing link analysis and content analysis.
1.1 SEARCH ENGINE
A search engine is a program designed to help find information
available on the World Wide Web. A search engine operates, in the following order
• Web crawling
• Indexing
• Searching
2
Web Crawling
A web crawler is a program which browses the World Wide Web in a
methodical, automated manner. Web crawlers are mainly used to create a copy of all
the visited pages for later processing by a search engine that will index the
downloaded pages to provide fast searches. Crawlers can also be used for automating
maintenance tasks on a web site, such as checking links or validating HTML code.
Also, crawlers can be used to gather specific types of information from Web pages,
such as harvesting e-mail addresses.
Indexing
The contents of each page are then analyzed to determine how it should
be indexed (for example, words are extracted from the titles, headings, or special
fields called meta tags). Data about web pages is stored in an index database for use in
later queries. Some search engines, such as Google, store all or part of the source page
(referred to as a cache) as well as information about the web pages, whereas some
store every word of every page it finds.
Searching
When a user requests a search engine to search for a given query,
typically by giving key words, the engine looks up the index and provides a listing of
best-matching web pages according to its criteria, usually with a short summary
containing the document's title and sometimes parts of the text.
1.2 INFORMATION RETRIEVAL
The term "information retrieval" was coined by Calvin Mooers in
1948-50. Information retrieval (IR) is the science of searching for information in
3
documents, searching for documents themselves, searching for metadata which
describe documents, or searching within databases, whether relational stand-alone
databases or hypertext networked databases such as the Internet or intranets, for text,
sound, images or data. Though the areas of data retrieval, document retrieval,
information retrieval, and text retrieval, fall under the category of information
processing each of these have their own bodies of literature, theory, praxis and
technologies.
1.3 INFORMATION FILTER
Information filters are in general used for avoiding certain
unnecessary information that is being published over the net. These filters are also
used in search engines for filtering out the irrelevant search results in the Search
Engine Result Pages(SERP). These filters are the ones that utilize the indexed
information and extract the needed data or pages based on the query . The different
types of filters are
• Collaborative filter
• Link Filter
• Content Filter
Collaborative Filter
Collaborative filtering (CF) is the method of making automatic
predictions (filtering) about the interests of a user by collecting feedback
about a web document information from many users (collaborating). The
underlying assumption of CF approach is that those who agreed in the
past tend to agree again in the future. Collaborative filtering systems
usually take two steps:
4
1.Look for users who share the same rating patterns with the active
user (the user who the prediction is for).
2.Use the ratings from those like-minded users found in step 1 to
calculate a prediction for the active user
Note that these predictions are specific to the user, but use
information gleaned from many users. This differs from the more simple
approach of giving an average (non-specific) score for each item of
interest, for example based on its number of votes.
Content Filter
Content filter is utilized where the indexing of the web pages is
based on the information present in the webpages with the query as the
criteria, relevant pages are searched in the index (index is supposed to
have keywords present in the webpages).The pages that contain the
keywords as in the query are retrieved and the ones that are not are
filtered out.
Link Filter
Unlike content filter, link filter uses the hyperlinks of the web
pages for retrieving the relevant web pages. Here the index contains the
links of the webpages . Hyperlinks provide a valuable source of
information for web information retrieval.So based on the query the
relevant hyperlinks are searched for, but not the content of those
webpages. Here the hyperlink analysis is done so as to find the needed
webpages and the rest of the results are avoided.
5
1.4 RERANKING
The process of ranking WebPages is done based on its relevance to
user query to present the user with the results in the increasing order of degree of
relevance with the user query. This ranking is being done by most of the search
engines. Reranking is the process of rearranging the ranked searched results based
on some predefined criterions.
1.5 LITERATURE SURVEY
Information retrieval (IR) systems facilitate users to retrieve
information that is relevant or close to their information needs. The basic
operations of typical information retrieval systems can be grouped into two main
categories: indexing and matching (or retrieval). The purpose of the indexing
process is to identify the most descriptive words existing in a text. After the
elimination of the stop words and the identification of the unique stems of the
remaining words, the term frequency (tf) and the inverse document frequency (tdf)
of each unique stem are calculated. Each document is then described by a set of
keywords along with their tf and tdf.
The aim of the query matching process is to derive at a list of
documents ranked in decreasing order of relevance to a given query. When a query
based on content, expressed in natural language is submitted to the search system,
it undergoes a process similar to the indexing process. Both documents and queries
can be represented as weighted vectors where each term’s weight is usually a
combination of tf and tdf. In this case the similarity between documents and
queries is based on the vector space model [2]. In this model the documents and
queries are viewed as n-dimentional vectors where n corresponds to the number of
6
unique index terms. The similarity between query q and document d is computed
by measuring the cosine measure. In the work of Fotis [3], a new model for
estimating the relevance of documents to user requests is presented. In this model,
information retrieval and information extraction are integrated for effective
retrieval in the domain of call for papers.
The network structure of a hyperlinked environment can be a rich
source of information about the content of the environment provided there exists a
effective means for understanding it. The work of Kleinberg [1] focusses on the
use of links for discovering the most “authoritative” pages. Jingyu Hou focuses on
finding the relevant web pages from linkage information [ 7]. The analysis of the
hyperlink structure of the web has led tp significant improvements in web
information retrieval. In the article by Monika Henzinger, link analysis algorithms
are studied [8].
The World Wide Web has developed fast and many people use search
engines to capture information from the web. A survey for realnames [4] reports
that 44% of users are frustrated by navigation and search engine use. Current
search engines build indices mostly based on keyword occurrences and frequency
for query negotiation using these indices.
Internet search engines were in use before the emergence and growth
of the web. The first pre-web search engine was Archie, which allowed keyword
searches of a database of names of files available via FTP. The first robot and
search engine of the web was Wandex [5], which was developed by Mathew gray
in 1993. Primary search engines were designed based on traditional information
retrieval methods. AltaVista and excite made huge centralized indices of web
pages. To answer a query, they simply retrieved results from their indexed
databases and showed the cached pages based on keyword occurrences and
7
proximity. While traditional indexing models have been successful in databases, it
was revealed that these methods are nor sufficient for a tremendously unstructured
information resources such as the web. The completeness of the index is not the
only factor in the quality of search results. In order to increase the quality of
search, Google made an innovative ranking system for the entire web. PageRank
used the citation graph of the web and Google introduced link analysis in the
search engine systems [6]. Other efforts have been made to customize and
specialize search tools.
1.6 ORGANIZATION OF THE REPORT
This report is divided into five chapters. The first chapter gives the
introduction about search engines, information retrieval, information filters, re-
ranking, literature survey and the organization of the report. Chapter 2 deals with
problem definition and methodology of the project. Chapter 3 deals with the
requirements analysis and specifications, resource requirements, architectural and
detailed design, implementation and testing. Chapter 4 deals with applications of
the project, and the future enhancements that can be made to the project. Chapter 5
deals with the conclusion of the system.
2 PROBLEM DEFINITION
The search engine uses various techniques of information retrieval
to retrieve the required pages from the WWW. Due to immense growth of the
WWW, the number of web pages has increased to a larger extent. This, though,
advantageous has increased size of the web and the impact of this is found in the
volume of the results given by search engines . Though a lot of filtering techniques
8
are being used, the reduction of SERPs has become impossible. Ranking the
results of the search engine can be a possible solution for this issue. This will
though not reduce the SERPs, but will push the most relevant results to the top, so
that the users have provided with the appropriate results in the first SERP.
The objective of this project is to rerank the search results and give the most
relevant search results to the given query. The software takes up the queries,
retrieves the results from search engine and reranks them based on criterion
specific to the web pages. Ranking is done based on two different methodologies
namely link filter and content filter. Based on these several attributes related to the
web page are calculated and based on these web page is given a rank. And then
based on this rank the results produced by the search engine are reranked and
displayed using the user interface.
3 DEVELOPMENT PROCESS
3.1 Requirement Analysis and Specifications
The requirement engineering process consists of feasibility study,
requirements elicitation and analysis, requirements specification, requirements
validation and requirements management. Requirement elicitation and analysis
is an iterative process that can be represented as a spiral of activities namely
requirements discovery, requirements classification and organization,
requirements negotiation and requirements documentation.
9
3.1.1 Input
The input for the LC coupled filter will be obtained from the user.
The user will be provided with the input area where the he can input the
search query.
3.1.2 Output
The output will be the results of the link and content analysis. The search
results will be displayed with the help of tab control in a table structure. Each table
contains the detailed report of the link and content filtering process. The links
retrieved will be displayed along with its webpage size, query matched of the link,
in-link referral, frequency and distance of the webpage. Finally, the results of the
link and content analysis will be graphically represented in the form of line graph.
3.1.3 Functional Requirements
The information present in the search engine will be retrieved via internet
based on the user’s query. The user should be able to specify his/her own input
query that has to be searched and retrieved. The user must be able to view the
results and compare the search results with the help of the graph.
3.2 Resource Requirements
Hardware (minimum)
The minimum hardware requirements for this project are listed in Table 1
Hardware Required
Processor Speed 700 Mhz
10
RAM 64 MB
Hard Disk space 5 MB
Connectivity A Modem and an Internet account to
access World Wide Web.
Table 1: Hardware Resource requirement Table
Software
The minimum software requirements for this project are listed in 2.
Software Required
Operating System Windows 2000/XP
Runtime Package Java Virtual Machine. (JVM)
Development Front-End Netbeans 4.1
Table 2: Software Resource requirement Table
3.3 Design
3.3.1 Architectural Design
The user needs to input the text in the text box available in
the user interface. The user query will be checked for the stop words and the
processed query will be sent to the www and the results from the search engine
will be obtained for the given user query.
11
Figure 1: Architecture Diagram
The links of the webpages will be obtained from the search results and the web
page will be analyzed based on the link and content of the webpage. After
analyzing the webpage the results will be displayed in the user interface.
The user query will be obtained from the user and the stopwords
will be removed if any. Then the query will be sent to the www and the search
results will be obtained from the search engine. The links available in the search
results will be extracted and link analysis will be done. After the link analysis
the web page content will be retrieved using the link address and detaging
process will be carried out. After removing the HTML tags the content of the
web page will be analyzed and a statistical report will be displayed in the user
interface.
Search
Engine WWW
Link &
Content
Filtering
User
Query
User
Interface
Query
Weightage
12
Figure 2: Flow Diagram (User Interface)
3.3.2 Detailed Design
3.3.2.1 User interface
The user interface has been developed with at most care so
that all types of users like novice to experts can use it. The Interface
consists of a text input area where the user needs to give the input
query. On processing the user query the output will be displayed in a
Tab Pane. Different tabs different levels. This interface is given in
Appendix B, Screen Shot-2.
Query
Weightage
Search
Engine
Link
Filtering
Detag
HTML
Content
analysis
and
filtering
Re-
Ranking
Display
Output
User
Interface
WWW
User Query URL
Text File
13
Search Results
The search results tab displays the results obtain frm the
google search engine. The page size and web page link will also be
displayed in the rank order of the search engine.
Link Filter Analysis
The results of the link filter analysis will be displayed in a
tables structure with query matched, page size and in link referral
details. The links are re-ranked based on the calculated percentages.
Content Filter Analysis
As a result of content filter analysis the web pages details will
be displayed in a table structure containing the frequency of the query
and the distance between them.
Detailed Analysis
The result of the link and content analysis will be analyzed
and graphically represented. Graphs will be generated based on the
search engine results and system generated values.
3.3.2.2 Module description
The LC Coupled ranks the links to provide the user with the
relevant links only. This task is accomplished by performing
various tasks, which takes form as modules. The modules
identified are
14
Link Filter:
This is one of the decisive part in the filter. The Link filter is
used to obtain the links relevant from the result provided by the
search engine and with the hyperlink & hypertext as the input it
ranks the links based on three criterions.
Link and Anchor retrieval
Here the results are provided by the search engine is
retrieved and from that results the hyperlinks and hypertext are
extracted. The results of the search is shown for user reference.
Query Match
This operation finds whether the query is present in the
hypertext, which is retrieved from the search engine results. The
number of words in the query divides the number of words
matched in the hypertext. This gives the query match ratio.
Page Size Calculation
This gets the hyperlink of the search results as the input and
individually connects to each hyperlinks gets the webpage source
and extracts only the contents stripping off the waste contents that
unnecessarily adds to the page size.
Page weighting
The operations said above are done which will filter out the
superfluous links and the links obtained are now ranked based on
the relevance with the page, Size of the page, Query match, Inlink
referrals.
15
Content Filter:
The unique feature of the LC Coupled Filter is that it filters
using both content and link filter. Content Filter gets the filtered
links from link filter as the input and process then based on the
content present in those links.
Detag
The detaging process retrieves the HTML code of the web
page and removes the HTML tags present in it. There by getting
only the content present in the web page.
Query Weightage:
The query given by the user in the User Interface is not sent as
such to the search engine but is pruned and then sent. This process
increases the efficiency or the relevance between the query and the
documents is ameliorated.
Get search query
The query given is retrieved from the user interface for further
processing and sending to the web.
Removal of stop word
The stop word such as most common prepositions and other
kinds most frequently occurring linguistic words are searched for
and removed
User Interface
All the process done above takes its shape in the screen
through the user interface provided. This is used for bridging the
16
user and mask the strenuous task that is being done by the filter
from the user.
Actual Result
This pane of the user interface gives the user the privilege to
see the actual result obtained from the search engine so that the
user can have the actual result of the search engine.
Link Based
This pane gives the results that are link filtered.
Content Based
This pane gives the results that are content filtered.
3.4 Implementation
Our main task is to enhance of retrieved information according to
the user query. Retrieval of information based on the indexing might not be
appropriate to the users query. Enhancement of the search results may also involve
analysis of the link and the web page content.
Each webpage contains its own pre-defined parameters which can be
used to calculate their weightage. Each webpage can be weighed based on Link or
the Content of the webpage. Before analyzing the web page the user query need to
be filtered for any stopwords.
User Query Weighing:
The input given by the user will be passed to a stopword remover to
eliminate the stopwords. A list of stopwords that are more commonly used are are
maintained and used for comparing with the user query. The output of the
stopword remover will be the input query for the search engines.
17
Step 1: Input the query from the user.
Step 2: Pass the query to the stopword_remover().
Step 3: Compare with the list of stopwords.
Step 4: Re phrase the input query.
Figure 3: Algorithm for Stop Word removal
Link Filter Analysis:
With the user input query the search results must be obtained and the links
obtained must be processed. These are the following stream of process which leads to
the ranked results
• Query Match Calculation
• Query Match Percentage (QMP)
• In-link Referral Calculation
• In-Link Referral Percentage (IRP)
• Page size calculation
• Page Size Percentage (SP)
• Re-ranking
Query Match Calculation
The anchor link text will be processed for the availability of the user
query. Anchor text will be compared with the input query and the strength of
the link will be calculated. The user query may contain more than one keyword
and the anchor text may not contain all those keyword in it. The algorithm is
given in figure 4.
18
Step 1: Get the input query from the user.
Step 2: Get the anchor text from the retrieved links.
Step 3: Compare anchor text with the user query.
Step 4: Check whether all words in user query are present in anchor text.
Step 5: Display the query matched value for the respective link.
Figure 4: Algorithm for Query Match
Query Match Percentage (QMP)
The query match ratio of respective links is taken and divided by the
maximum query match value among all the links and converted into percentage,
thereby determining the QMP value of a particular link. The algorithm is given
in figure 5.
Step 1: Get the query match ratio of particular link.
Step 2: Calculate the maximum query match ratio among all the links.
Step 3: Convert it to percentage using the formula,
QMP = (Current QMR / Maximum QMR) * 100
Figure 5: Algorithm for Query Match Percentage
In-link referral
The obtained links from the search engines might be referred by other
webpages. The In link referral will be obtained from the google search engine
by providing the retrieved link concatenated along with “link:” Calculating the
in-link referrals for each link increases the weightage of the webpage. The
algorithm is given in figure 6.
Step 1: Retrieve the link from the search engine.
Step 2: Concatenate “link:” to the obtained link.
Step 3: Pass the concatenated address to the google search engine
Step 4: Calculate the in-link referrals.
Figure 6: Algorithm for In-link Referral
19
In-Link Referral Percentage (IRP)
The in-link referral of respective links is taken and divided by the
maximum in-link referral value among all the links and converted into
percentage, thereby determining the IRP value of a particular link. The
algorithm is given in figure 7.
Step 1: Get the in-link referral of particular link.
Step 2: Calculate the maximum in-link referral among all the links.
Step 3: Convert it to percentage using the formula,
IRP = (Current In-link Referral / Maximum In-link Referral) * 100
Figure 7: Algorithm for In-link Referral Percentage
Page size calculation
The content of the webpage will be obtained from the www and then the
content will be detaged. During the detaging process the HTML tags will be
removed and only the contents will be retrieved. After the detaging process the
original size of the webpage will be found. The algorithm is given in figure 8.
Step 1: Get the source of the webpage using the retrieved link.
Step 2: Remove the HTML tags.
Step 3: Find the size of the detaged web page content.
Figure 8: Algorithm for Page Size Calculation
Page Size Percentage (SP)
The page size of respective links is taken and divided by the maximum
page size value among all the links and converted into percentage, thereby
determining the SP value of a particular link. The algorithm is given in figure 8.
20
Step 1: Get the page size referral of particular link.
Step 2: Calculate the maximum page size among all the links.
Step 3: Convert it to percentage using the formula,
SP = (Current Page Size / Maximum Page Size) * 100
Figure 9: Algorithm for Page Size Percentage
Re-Ranking
With the link parameter that is being calculated, the web pages will be re-
ranked. The algorithm is given in figure 10.
Step 1: Get the result of Link Analysis.
Step 2: Calculate the mean percentage of the parameters for each webpage using the formula,
Mean Percentage (MP) = (QMP + IRP + SP) / 3
Step 3: Sort the webpages in descending manner and accordingly.
Figure 10: Algorithm for Re-Ranking
Content Filter Analysis
Another method of weighing the webpage is to analyze the content of that
particular webpage. Content Analysis includes frequency of the keyword, distance
between the keywords and page size calculation.
• Source Generation.
• Detaging.
• Frequency and Distance Calculation.
Source Generation
Using the links retrieved from the search engine, the webpage content
will be obtained. The source will be obtained along with the mages,
programming tags and other design related contents. The algorithm is given in
figure 11.
21
Step 1: Get the retrieved webpage link.
Step 2: Contact the world wide web (www) through internet.
Step 3: Retrieve the content of the webpage using the specified link address.
Figure 11: Algorithm for Source Generation
Detaging
The obtained content of a webpage will contain all the HTML tags and
images. In the detaging process, all the HTML tags that occurs between the
angled braces will be identified and will be removed. The algorithm is given in
figure 12.
Step 1: Get the webpage content.
Step 2: Check for HTML tags.
Step 3: Remove the tags.
Step 4: Save the content of the webpage.
Figure 12: Algorithm for Detaging
Frequency and Distance Calculation
The detaged web content will be compared with the user query and the
frequency of the occurrence of the keywords will be calculated. While
determining the frequency the distance between the keywords if any will also
be checked. The algorithm is given in figure 13.
Step 1: Get the user input query.
Step 2: Get the detaged content of the webpage.
Step 3: Compare the webpage content and user query and find the occurrence of the keywords.
Step 4: Find the distance between the keyword if any.
Step 5: Tabulate the results.
Figure 13: Algorithm for Frequency and Distance Calculation
22
Statistical Analysis:
GRAPH 1: Link Filter Analysis
The Query match between the user query and the goggle extracted link
will be calculated and the referral link for each link will be calculated. Based on
the referral links and the query match the links will be re-ranked. As shown in
Appendix C, Screen Shot – 1
GRAPH 2: Content Filter Analysis
The Frequency of the query that has been specified by the user has been
calculated in each web pages that has been retrieved and the distance between
the keywords has also been calculated if any. In case of keywords occurring
more than once the distance will be the mean of the distances. As shown in
Appendix C, Screen Shot – 2.
GRAPH 3: Page Size Analysis
The page size specified by the google and the actual page size of the
webpage after filtering the tags (images, links, advertisements etc.). The actual
size of the web pages will be less than or equal to the google specified size. As
shown in Appendix C, Screen Shot – 3.
3.5 Testing
Testing is the process of evaluating the correctness, the quality and
completeness of the system developed. Majority of the testing activity can be
done by writing test cases for the system. A test case can either be a single step
or a series of steps along with the expected result or outcome, prerequisite states
23
or steps, and descriptions. A collection of test cases is called as a test suite or
test plan. The test suite for our system is as follows
1. UTC_LCFilter_001
Test Case Description Execute the LC Filter User Interface
Module Query Weightage Module.
Action The input query is filtered by the function
Stopword_remover() and then passed as input for
the search engine.
Account/Data Criteria Input query will be obtained from the user.
PASS
Table 3: UTC_LCFilter_001
2. UTC_LCFilter_002
Test Case Description Execute the LC Filter User Interface
Module Link Filter Module
Action The links will be obtained from the search engine
based on the user query.
Account/Data Criteria The user query will be passed as input to the search
engine.
PASS
Table 4: UTC_LCFilter_002
3. UTC_LCFilter_003
Test Case Description Execute the LC Filter User Interface
Module Link Filter module
24
Action The obtained link query will be compared with the
user query.
Account/Data Criteria The link obtained from the search engine.
PASS
Table 5: UTC_LCFilter_003
4. UTC_LCFilter_004
Test Case Description Execute the LC Filter User Interface
Module Link Filter Module
Action In-link referrals for the each link obtained will be
calculated.
Account/Data Criteria The links obtained from the search engine.
PASS
Table 6: UTC_LCFilter_004
5. UTC_LCFilter_005
Test Case Description Execute the LC Filter User Interface
Module Link Filter Module
Action The page size of each webpage will be calculated.
Account/Data Criteria The webpage content representing each link. The
webpage size is calculated by filesize().
PASS
Table 7: UTC_LCFilter_005
6. UTC_LCFilter_006
Test Case Description Execute LC Filter User Interface
Module Link Filter Module
25
Action The links will be re-ranked based on the link
parameters.
Account/Data Criteria The links will be rearranged by rerank().
PASS
Table 8: UTC_LCFilter_006
7. UTC_LCFilter_007
Test Case Description Execute LC Filter User Interface
Module Content Filter Module
Action The frequency of the user query will be calculated.
Account/Data Criteria Webpage content of the obtained links. The
frequency will be calculated by content_filtering().
PASS
Table 9: UTC_LCFilter_007
8. UTC_LCFilter_008
Test Case Description Execute LC Filter User Interface
Module Content Filter Module
Action Distance between the keywords will be calculated.
Account/Data Criteria Webpage content and the frequency of the
keyword.
PASS
Table 10: UTC_LCFilter_008
9. UTC_LCFilter_009
Test Case Description Execute LC Filter User Interface
Module User Interface Module
26
Action The input from the user will be obtained. As shown
in Appendix B – Screen Shot 1.
Account/Data Criteria The input text should not be empty. As shown in
Appendix E – Screen Shot 1.
PASS
Table 11: UTC_LCFilter_009
10. UTC_LCFilter_010
Test Case Description Execute the LC Filter User Interface
Module User Interface Module
Action The results of the Link Filter analysis will be
displayed in the table structure. As shown in
Appendix B – Screen Shot 2 & 3.
Account/Data Criteria In case of unavailability of the data it will be
specified.
PASS
Table 12: UTC_LCFilter_010
11. UTC_LCFilter_011
Test Case Description Execute LC Filter User Interface
Module User Interface Module
Action The result of the content filter analysis will be
displayed in a table structure. As shown in
Appendix B – Screen Shot 4.
Account/Data Criteria The result values obtained will be displayed.
PASS
Table 2: UTC_LCFilter_011
27
4 APPLICATIONS AND FUTURE ENHANCEMENTS
This software can be used in comparing documents, ranking the
documents etc. This can also be enhanced in various aspects. Future
enhancements of the project includes the following:
� Optimization of Internet Connectivity:
The execution time and speed are the two criterions to
be enhanced. The connectivity to Internet affects the performance,
speed of our system.
� Effective Search Results:
We have focused re-ranking of search results from one
search engine. This can be upgraded by having a hybrid search
result reranking by combining the search results of more than one
search engine.
� Personalization:
Personalization can be considered as an issue for future
enhancement. User specific search results based on the past search
history of a particular user providing improved search results with
user specific feature.
28
5 CONCLUSION
The growth of Internet is tremendous that for whatever keyword
(either intelligible or unintelligible) the user enters there are a voluminous
amount of results. The user has to pick up the needed ones. Search engines
accepts user query and provide results based on them .These results may not be
acceptable to user’s expectation. In this project, the results provided by the search
engine are retrieved and the various attributes related to each page on the results
are retrieved and ranked based on these criterion. The links are reranked based on
the ranking of content and link filters independently. The usage of this software
will enhance the relevance between the search query and the documents both
based on link and based on content. The results produced are found to be more
effective than the actual results produced by the search engines.
29
REFERENCES
1. Jon M. Kleinberg, “Authoritative Sources in a Hy[perlinked Environment”,
Journal of the ACM, Vol 46, No. 5, September 1999, pp 604 – 632.
2. Salton G., Wong A., Yang C. S., “A Vector Space Model for Automatic
Indexing”, Communications of the ACM, 1975, Vol 8, pp 613-656.
3. Fotis Lazarinis, “Combining Information Retrieval with Information Extraction
for efficient Retrieval of Calls for Papers”, Proceedings of IRSG98, Grenoble,
France, Electronic Workshops in Computing, 1998.
4. Sullivan D., “Survey Reveals Search Habits – The Engine Report”,
http://www.searchenginewatch.com/sereport/00/06-realnames.html
5. Wall A., “History of Search Engines and Web History”, http://www.search-
marketing.info/search-engine-history/
6. Brin S., Page L., “The anatomy of a large scale hypertextual web search
engine”, Proceedings of the 7th International WWWConference, Brisbane,
Australia, pp 107-117.
7. Jingyu Hou, Yanchun Zhang, “Effective finding Relevant Web pages from
Linkage Information”, IEEE Transactions on Knowledge and Data Engineering,
Vol 15, No 4, July-Aug 2003.
8. Monika R. Henzinger, “Hyperlink Analysis for the Web”, IEEE Internet
Computing, Jan-feb 2001, pp 45-50.
9. Deitel & Deitel, “Java How to Program”, Third Edition, Prentice Hall.
10. http://developers.sun.com/prodtech/
11. http://www.google.co.in/about.html
30
Appendices
Appendix A – Screenshots of installation process
31
32
33
34
Appendix B - Screenshots of various functionalities of the software.
Screen Shot 1: Loading Screen
35
Screen Shot 2: Initial User Interface
36
Screen Shot 3: Google Search Results
37
Screen Shots 4: Link Filter Table 1 [Link Parameter Values]
38
Screen Shot 5: Link Filter Table 2 [Link Analysis Ranking]
39
Screen Shot 6: Content Filter Table [Content Parameter Values]
40
Appendix C - Screenshots of Graphs generated.
Screen Shot 1: Link Analysis Graph
41
Screen Shot 2: Content Filter Graph
42
Screen Shot 3: Page Size Graph