Content Based Search Engine “Global” for XML Database
A DISSERTATION Submitted in partial fulfillment of the requirements for the award of the degree
of
MASTER OF TECHNOLOGY
In
INFORMATION TECHNOLOGY (Specialization: SOFTWARE ENGINEERING)
Submitted by
Dinesh Garg (MS200506)
Under the Guidance of
Prof. M. Radhakrishna & Mr. Manish Kumar
IIIT-Allahabad
2005-2007 INDIAN INSTITUTE OF INFORMATION TECHNOLOGY,
ALLAHABAD
Date: ______________
WE DO HEREBY RECOMMEND THAT THE THESIS WORK PREPARED
UNDER OUR SUPERVISION BY DINESH GARG ENTITLED CONTENT
BASED SEARCH ENGINE “GLOBAL” FOR XML DATABASE BE ACCEPTED
IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE
OF MASTER OF TECHNOLOGY IN INFORMATION TECHNOLOGY
(SOFTWARE ENGINEERING) FOR EXAMINATION.
COUNTERSIGNED
Mr. Manish Kumar Prof. M. Radhakrishna
(THESIS ADVISERS)
IINNDDIIAANN IINNSSTTIITTUUTTEE OOFF IINNFFOORRMMAATTIIOONN TTEECCHHNNOOLLOOGGYY
AALLLLAAHHAABBAADD (A University Established under sec.3 of UGC Act, 1956 vide Notification No. F.9-4/99-U.3 Dated 04.08.2000
of the Govt. of India )
(A Centre of Excellence in Information Technology Established by Govt. of India)
DR. U. S. TIWARY (DEAN ACADEMICS)
CERTIFICATE OF APPROVAL*
The foregoing thesis is hereby approved as a creditable study in the area of
knowledge management carried out and presented in a manner satisfactory
to warrant its acceptance as a pre-requisite to the degree for which it has
been submitted. It is understood that by this approval the undersigned do
not necessarily endorse or approve any statement made, opinion expressed
or conclusion drawn therein but approve the thesis only for the purpose for
which it is submitted.
COMMITTEE ON
FINAL EXAMINATION
FOR EVALUATION
OF THE THESIS
* Only in case the recommendation is concurred in
IINNDDIIAANN IINNSSTTIITTUUTTEE OOFF IINNFFOORRMMAATTIIOONN TTEECCHHNNOOLLOOGGYY
AALLLLAAHHAABBAADD (A University Established under sec.3 of UGC Act, 1956 vide Notification No. F.9-4/99-U.3 Dated 04.08.2000
of the Govt. of India )
(A Centre of Excellence in Information Technology Established by Govt. of India)
CANDIDATE DECLARATION
This is to certify that Report entitled “Content Based Search
Engine “Global” For Xml Database” which is submitted by me in partial
fulfillment of the requirement for the completion of M.Tech. in
Information Technology (with specialization in Software Engineering) to
Indian Institute of Information Technology, Allahabad comprises only
my original work and due acknowledgement has been made in the text to
all other material used.
Dinesh Garg
M. Tech. IT (Spl. Software Engineering)
MS200506
IINNDDIIAANN IINNSSTTIITTUUTTEE OOFF IINNFFOORRMMAATTIIOONN TTEECCHHNNOOLLOOGGYY
AALLLLAAHHAABBAADD (A University Established under sec.3 of UGC Act, 1956 vide Notification No. F.9-4/99-U.3 Dated 04.08.2000
of the Govt. of India )
(A Centre of Excellence in Information Technology Established by Govt. of India)
Content Based Search Engine “Global” For Xml Database
I
Acknowledgements
I am highly grateful to the honorable Director, IIIT-Allahabad, Dr. M. D. Tiwari, for
his ever helping attitude and encouraging us to excel in studies. I am thankful to
Prof. U. S. Tiwari, Dean Academics, IIIT-Allahabad for providing all the necessary
requirements and his moral support for this dissertation work.
I would like to express my sincere gratitude to Mr. Manish Kumar for his invaluable
guidance and constant encouragement through the last semester of my project work.
He served as a motivating force in whatever I did and was always readily available
whenever needed. From him I have learned to combine theoretical knowledge with
intuitions effectively.
I also thank to Prof. M. Radhakrishna for their expert guidance and encouragement.
In spite of their hectic schedule they were always approachable and took their time off
to attend to my problems and give the appropriate advice.
I am highly obliged to all my friends for their encouragement and for helping me at
the points where I got stuck. I am deeply indebted to all of them for always helping
and inspiriting me
At last I thank all of them who are related with this thesis in one or the other way
Thanks to everyone; it has been a wonderful year!
Dinesh Garg
10-June-2007
Content Based Search Engine “Global” For Xml Database
II
Abstract
With the rapid development of Internet, Web has been becoming a main information
source through which we can obtain the useful information. Nowadays there are
millions of Websites and billions of homepages in Internet. This explosive growth of
information on the internet has greatly increased the need for Information Retrieval
System such as Search engine.
Nowadays most popular search engines such as Google, Alta Vista and yahoo are all
based on HTML documents. Despite the success of HTML-based keyword search
engines shortcoming emerge inside them such as lack of semantics retrieval. These
search engines have HTML file based web server model it possesses certain
limitations. Extensible Markup Language (XML) has recently emerged as the
document standard for representing and exchanging data on the Web. Now XML
turns Web into a database. The database is database of xml websites. To help Web
users to retrieve the useful information in XML documents rapidly has been becoming
a hot topic.
The goal of the thesis is to develop the Search Engine for searching the websites in
XML/XSL. It provides two level searches in comparison to the existing search
engines. The two level searches comprise of basic search and refine search. The basic
search is similar to the conventional HTML search engine. But due to website made
in XML it also provides the semantic information of keyword to user.
One more functionality comes in refine level search where the user can refine his
search according to DTD/Tags information given to him. In addition an efficient
Compressed Tries data structure used to implement the indexer. It also frees the user
from remembering the structure of XML document and writing the sophisticated
queries for searching from XML documents.
Content Based Search Engine “Global” For Xml Database
III
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IITable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IIIList of figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIAbbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII
Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3. Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2. Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1. Overview of XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 XML (eXtensible Markup Language). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.2 DTD (Document Type Definition) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.3 XSL (EXtensible Stylesheet Language) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.4 A Simple Xml example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.5 Document Object Model (DOM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2. Existing Query languages for XML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Problem faced in the existing system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3. Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 Types of Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Shortcomings of most popular Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.3 Problems in Web Server model based on HTML pages. . . . . . . . . . . . . . . . . . 13
2.4 Global Search engine Web Server Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.1 Site Data & Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.2 Advantages of Global Search engine Web Server Model . . . . . . . . . . . . . . . . . 16
Chapter 3. Requirement Specification of Search engine . . . . . . . . . . . . . . . . . . . 17
3.1. Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2. Use Cases diagram of Search engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3. Non-functional Requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4. Design Constraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.1. Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.2. Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Content Based Search Engine “Global” For Xml Database
IV
Chapter 4. Indexing Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1. Standard Trie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2. Basic Compressed Trie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3. Basic M-Way Trie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3.1. Drawbacks of M-Way Trie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Indexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.4.1. Searching Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4.2. The Insertion Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5. Advantage of compressed trie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Chapter 5. High Level Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1. System Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2. Gatherer Module (Module 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Indexer (Module 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.3.1. Parsing of Xml document. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3.2. Stop Word Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3.3. Word Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3.4. Index Routine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.4 Search Processor (Module 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 38 5.4.1. Query processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.4.2. Stop list and stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.4.3. Search routine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.4.4 Results Ranking and Display. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.5 Indexing Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.6 Search Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.7 Transformation process of XML/XSL into Html . . . . . . . . . . . . . . . . . . . . . . . . 42
Chapter 6. Detail Level Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.1. Package Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2. Class Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.2.1 Indexing Package Classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.2.2 Parsing Package Class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.3. Sequence diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.3.1. Sequence Diagram of website Gather Module . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.3.2. Sequence diagram of Insertion Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 52 6.3.3. Sequence diagram of Simple Search Module . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.3.4. Sequence diagram of Refine search Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Content Based Search Engine “Global” For Xml Database
V
Chapter 7. Result and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.1 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2 Simple Search Result page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 7.2.1 Visual understanding of result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 7.2.2 Analysis of Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.3 Refine Search Result page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.4 Simple Search Result page of searching the keyword “xml” . . . . . . . . . . . . . 59 7.4.1 Analysis of Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7.4.2 Refine search result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.5 Simple Search Result page of searching the keyword “Gate” . . . . . . . . . . . . 62
7.6 Procedure of Registering an XML website . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.7 Procedure of Indexing a website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Chapter 8. Conclusion and Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.1. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.2. Future Amendments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Appendix-A: Configuring the project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Content Based Search Engine “Global” For Xml Database
VI
List of Figures
Figure 2.1: Transformation Process 5
Figure 2.2: Hierarchical Structure of a Document Object. 7
Figure 2.3: Xml Parser Creating Dom 7
Figure 2.4: Existing Search Engine’s Web Server Model 12
Figure 2.5: Global Search Engine Web Server Model 14
Figure 3.1: Use Cases Diagram of Search Engine 19
Figure 4.1: Standard Trie 22
Figure 4.2: Basic Compressed Trie 23
Figure 4.3: Basic 10-Way Trie 24
Figure 4.4: Trie of 36 Elements English Alphabets and Numbers 25
Figure 4.5: Compressed Tries 26
Figure 4.6: Anode Structure 27
Figure 4.7: Bnode Structure 27
Figure 4.8: Datablock Structure 28
Figure 4.9: Record Structure 28
Figure 4.10: Search Compressed Tries 29
Figure 4.11: Case 2 Insertion Compressed Tries 31
Figure 4.12: Case 2.1 Insertion Compressed Tries 32
Figure 5.1: System Architecture 33
Figure 5.2: The High Level Design Of Search Engine 35
Figure 5.3: Indexing Process 40
Figure 5.4: Search Process 41
Figure 5.5: Represents a Fragment of the Transformation 42
Figure 6.1: Package Diagram 43
Figure 6.2: Class Diagram 44
Figure 6.3: Class anode 45
Figure 6.4: Class bnode 45
Figure 6.5: Class block 46
Figure 6.6: Class recordlist 46
Figure 6.7: Class record 47
Figure 6.8: Class index 47
Figure 6.9: Class data 48
Content Based Search Engine “Global” For Xml Database
VII
Figure 6.10: Class dom 48
Figure 6.11: Class server 49
Figure 6.12: Class addurl 49
Figure 6.13: Simple servlet 49
Figure 6.14: Refine servlet 50
Figure 6.15: Sequence Diagram Website Gather Module 51
Figure 6.16: Sequence Diagram of Insertion Module 52
Figure 6.17: Sequence Diagram of Simple Search 53
Figure 6.18: Sequence Diagram of Refine Search 54
Figure 7.1: Snap Shot of User Interface 55Figure 7.2: Snap Shot of Simple Search Result Page of the Keyword
“Allahabad“ 56
Figure 7.3: Snap Shot of Refine Search Result Page 58Figure 7.4: Snap Shot of Searching The Keyword “Xml“ 59Figure 7.5: Search Result of Keyword “Xml” After Registering Few
More Website Related to Xml Domain. 60
Figure 7.6: Snap Shot of Refine Search Result Keyword “xml“ 61Figure 7.7: Snap Shot of Simple Search Result of the Keyword
“Gate” 62
Figure 7.8: Snap Shot for Registering Xml Website to Search Engine
63
Figure 7.9: Command to Run Apache Web Server 64Figure 7.10: Command to Run Servlet 64Figure 7.11: Command to Run Server 65
Content Based Search Engine “Global” For Xml Database
VIII
Abbreviations
XML eXtensible Markup Language
XSL Xml Style Sheet
DTD Document Type Definition
XSLT Xml Style Sheet Transformations
CSS Cascading Style Sheets
Html Hypertext Markup Language
DOM Document Object Model
W3C World Wide Web Consortium
XQL XML Query Language
URL Uniform Resource Locator
WWW World Wide Web
Cgi Common Gateway Interface
UML Unified Modeling Language
GUI Graphical User Interface
Content Based Search Engine “Global” For Xml Database
- 1 -
Chapter 1 Introduction
This Chapter deals with the introduction of thesis and the motivation for pursuing the
work in area of xml technology also gives the reader an insight to the theoretical
principles involved in the conception and design of the Search engine developed
during course of this thesis. At last it presents the organization of the thesis.
1.1. Introduction
With the rapid development of Internet large amounts of digitally stored information
is readily available on the Internet. Nowadays there are millions of website and billion
of homepages in the Internet. This information is so much that it becomes
progressively more difficult and time consuming for the users to find the information
relevant to their needs. This explosive growth of information on the internet has
greatly increased the need for information retrieval system.
However most popular search engine such as Google, Alta Vista, yahoo are based on
HTML documents and lack of semantics. HTML provides a simple way to markup
the structure of document using (major headings, minor headings, title, lists etc).
HTML includes information about how to view text. Like browser knows that an H1
means a very large header line. But HTML doesn't give us a way to describe the
content of the text the meaning is lost because there is no way to tag it. As a result
management of internet content is inefficient.
Information retrieval system needs to implement sophisticated pattern matching tools
to determine semantic, context and purpose of the contents. The problem is that
search engines usually can index document titles, frequency of words and some
metadata that describe the content of a page.
We need a way to markup the significant portions of documents to understand the
semantic of documents. So that search engine gets the appropriate information for
index and avoids all of noise information related to presentation. The eXtensible
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 2 -
Markup Language has recently emerged as the document standard for representing
and exchanging data on the web. XML document is built by the nesting of tagged
elements. These nested tagged elements structure of XML makes it suitable for
representing data of the Web. The tags identify the meaning of data rather than its
display format as in html.
However unlike HTML, XML doesn’t specify any information about document
appearance. Browser gets this missing information from style sheets. XSLT style
sheet languages support the complete separation of content and presentation. We need
search engine not only provide the list of Links, related with this keyword, but also
provide descriptive information about the content of internet.
1.2. Motivation
The World Wide Web, the most popular application of the Internet, is playing an
important role in information sharing. This information is so much that it becomes
progressively more difficult and time consuming for the users to find the information
relevant to their needs. This explosive growth of information on the internet has
greatly increased the need for Search engine.
These search engines simply matches the key word and does not retrieve meta data.
However most popular search engine such are based on HTML documents. Despite
the success of HTML-based keyword search engines shortcoming emerge inside them
like lack of semantics retrieval.
We need such type of search engine that not only gives the result as list of urls but
also provides the semantic information, type of document and structure of document
related with the search keyword. In addition to it also provides the facility where the
user can refine the search result according to DTD/Tags information given to him. A
search engine which is based one of the currently emerging technologies such as
XML that is document standard for representing and exchanging data on web.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 3 -
1.3. Organization of Thesis
The thesis report is organized in eight chapters. Chapter 1 deals with the introduction
of thesis and the motivation for pursuing the work in area of xml technology. Rest of
the thesis is organized as follows.
Chapter 2 Literature Review. This chapter presents the literature review. It gives
the overview of xml. Introduces about the existing query language for
xml database. Discusses the shortcomings of current search engine. At
the end presents the global search engine web server model.
Chapter 3 Requirement Specification of Search engine. This chapter describes
the requirement specification of search engine. Tell about the functional
and non-functional requirements of search engine. Showing the use case
diagram of search engine.
Chapter 4 Indexing Data Structure. Describe the Compressed tries indexing data
structure used in implementation of search engine.
Chapter 5 High Level Design. Describe the system architecture and high level
design of search engine.
Chapter 6 Detail Level Design. Deals with the detail design phase presenting the
class diagrams for structural modeling, sequence diagrams for various
classes and functionalities of the search engine.
Chapter 7 Result and Discussion. It shows the result obtained by providing the
snapshots of the search engine result.
Chapter 8 Conclusion and Future Work. It summaries the work done by giving
the conclusions and the possible future work that can be carried out in
the area.
Appendix-A It Provides the information regarding the configuration of project.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 4 -
Chapter 2 Literature Review
This chapter gives the overview of xml, some related publications and systems. Also
introduces the existing query language for xml database. Problem inside these query
language. Discusses shortcomings of current search engines. At the end presents the
global search engine web server model and benefits of this model.
2.1. Overview of XML
XML stands for “Extensible Markup Language” (extensible because it is not a fixed
format like HTML).
2.1.1 XML (eXtensible Markup Language)
XML [4] is a set of rules for defining semantic tags that break a document into parts
and identify the different parts of the document. It is a meta-markup language that
defines a syntax in which other field-specific markup languages can be written. It’s a
language in which you make up the tags you need as you go along. These tags must
be organized according to certain general principles, but they’re quite flexible in their
meaning
The eXtensible Markup Language is a standard recommended by World Wide Web
Consortium for data representation and exchange on the Web. XML documents are
made up of storage units called entities, which contain either parsed or unparsed data.
XML provides a format that can represent both simple and extremely complex
information allow developers to create their own vocabulary for describing the
information.
2.1.2 DTD (Document Type Definition)
Document type definition [11] lists the elements, attributes, entities, and notations that
can be used in a document, as well as their possible relationships to one another also
specifies a set of rules for the structure of a document. A DTD can be declared in xml
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 5 -
document. The document type declaration for a document is very important in
checking whether a document is valid or just well-formed. The tasks carried out by a
document type declaration. Specifying the document’s root element. Defining
elements, attribute and entities to the document (internal DTD). Identifying the
external DTD for the document. [11]
The main use of DTD are With a DTD, independent groups of people can agree to use
a standard DTD for interchanging data. Application can use a standard DTD to verify
that the data you receive from the outside world is valid. We can also use a DTD to
verify your own data.
2.1.3 XSL (EXtensible Stylesheet Language)
Since XML is content-based meta language, it does not mean much to refer to
“viewing an XML document”. How do we view something that does not include any
information about how it is to be displayed? In order to view an xml done must
provide information about how it is to be display. This is accomplished using CSS or
XSL style sheets. [5, 20]
Transformation Sheets
Target Document
XSL
Transformation
HTML, Text, etc
Source Document
XML
Figure 2.1: Transformation process
XML-style sheet model for displaying content is advantageous because:
1) Content is separated from display. Hence if one wants to change the look of web-
page then all that needs to be changed is the XSL and not the data which is in
XML format.
2) Through the use of style sheets, future Web documents will be accessible
everywhere, from PCs to TVs to palm devices to cellular phones. It is now
possible to port the same content easily to different user agents like mobile
devices and web browsers
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 6 -
2.1.4 A Simple XML example
Sample xml file
<? Xml version="1.0" encoding="UTF-8"?> <! DOCTYPE bank SYSTEM "allahbadbank.dtd" > <bank> <allahabadbank>
<account>Get the account detail</account> <InterestRate> <deposit>deposit interest rate</deposit> <credit>credit interest rate </credit>
</InterestRate> <loan> <Education> <Eligibility>Courses Eligible Studies in India </Eligibility> </Education> <personal> Quantum of Loan </personal> <house> Housing Loan Detail</house>
</loan> </allahabadbank> </bank>
This XML document gives information about a bank. It is clear that Allahabadbank
object contains (account, InterestRate and loan objects). Some of these objects contain
other objects.
The corresponding Document Type Definition (DTD) for the above xml document. <? Xml version='1.0' encoding='UTF-8'?> <! ELEMENT bank (allahabadbank)*> <! ELEMENT allahabadbank (loan|InterestRate|account)*> <! ELEMENT account (#PCDATA)> <! ELEMENT InterestRate (credit|deposit)*> <! ELEMENT deposit (#PCDATA)> <! ELEMENT credit (#PCDATA)> <! ELEMENT loan (house|personal|Education)*> <! ELEMENT Education (Eligibility)*> <! ELEMENT Eligibility (#PCDATA)> <! ELEMENT personal (#PCDATA)> <! ELEMENT house (#PCDATA)>
This DTD ensure that the InterestRate object has credit, deposit objects same like loan
object have house, personal, education objects. DTD is to verify our XML data.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 7 -
2.1.5 Document Object Model (DOM)
Document Object Model is a platform- and language-independent standard object
model for representing HTML or XML and related formats. According to the DOM,
everything in an XML document is a node. The entire document is a document node.
Dom supports navigating and modifying XML documents. It shows hierarchical tree
representation of documents. [16].
Bank
InterestRate
Allahabadbank
account
Credit
Deposit
Loan
<Child>
<Parent>
Document RootDocument Node
Figure 2.2: Hierarchical structure of a document object.
The XML Document Object Model is a programming interface for XML documents.
It defines the way an XML document can be accessed and manipulated. The Dom is
usually added as a layer between the XML parser and the application that needs the
information in the document, meaning that the parser reads the data from the XML
document and feed that data into a DOM. The DOM is then used by a higher-level
application.
DOMXML Parser
Xml Document
Application
Figure 2.3: xml parser creating Dom
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 8 -
2.2. Existing Query languages for XML
There are many query languages, which can be used to query an XML database e.g.
XSL, XML-QL, and XQL.
XML-QL
XML-QL integrates XML syntax with query language techniques. The path
expressions and the patterns are used to extract data from the input XML data. It has
variables to which data is bound. It uses templates to show the output XML data. Both
templates and patterns use the XML syntax. XML-QL is based on Construct/Where
syntax. XML-QL has various features such as regexp path expressions, XML patterns,
Joins on multiple input sources, Skolem functions for grouping.
Syntax
WHERE elementPatterns IN xmlSource CONSTRUCT template
Example
WHERE <book> <pub><name>Addison-Willey</name></publisher> <title> $t </title> <author> $a </author> </book> IN “www.x.y.z/bib.xml” CONSTRUCT $a
As seen from the example, XML-QL presents a proper way to query a specified XML
document. It doesn’t inherently support the querying of an XML repository consisting
of XML data in semi structured form. [18]
XQL
For XQL a document is an labeled, ordered tree which contains the node to represents
elements, processing instruction, documents entity, attributes, comments. XQL is
similar to XPath and XSL patterns. XQL engines can represents the input to a query
via XSL nodes, DOM nodes, Index structure or XML text.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 9 -
XQuery
XQuery was devised primarily as a query language for data stored in XML form. So
its main role is to get information out of XML databases. XQuery uses predicates to
limit the extracted data from XML documents. The language is based on tree-
structured model of the information content of an XML document, containing seven
kinds of node: processing instructions, elements, text nodes, comments, attributes,
document nodes, and namespaces [12]
Xquery is case sensitive language. Keywords are in lower-case. XQuery is a
functional language comprised of several kinds expressions that can be nested and
composed with full generality. Every expression has a value and no side effects.
Expressions can raise errors, usually propagate lower level errors.
Sample Xquery Example:
List all suppliers. if a supplier offers medical items, list the descriptions of the items
FOR $s IN document(“suppliers.xml”) //supplier
ORDER BY $s/name
RETURN
<supplier>
{ $s/name,
FOR $ci IN document(“catalog.xml”)//item[supp_no=$s/number],
$mi IN document(“medical_item.xml”)//item[number=$ci/item_no]
RETURN $mi/description
}
</supplier>
The Xquery also allows using aggregation functions (AVG, COUNT, etc.). The
Xquery language also allows several data sources to be interrogated simultaneously,
producing an integrated view of its data.
FOR-WHERE-RETURN Xquery example:
Find all book titles published after 2000:
FOR $x IN document("abc.xml")/bib/book WHERE $x/year/text() > 1999 RETURN $x/title
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 10 -
2.2.1 Problem Faced in the Existing System
From these various query language we can get whatever we want from a given
database by writing sophisticated queries, But for writing these query users need to be
familiar with the document structure. Needs to know what are the various tags in the
document. Every xml document have its own document structure, they are different
for different Xml documents. So before start searching from database user need
remembering the corresponding Xml document structure. In short
• For searching the document from database you have to write sophisticated
queries.
• User have burden of remembering the document structure
• Unreasonable assumption of user’s familiarity with the document structure.
Jennifer Widom’s pioneering “whitepaper” [7] has pointed out the various challenges
and technologies available and required to meet the WWW requirements. This paper
has focused on XML as a standard for data representation and exchange on the
Internet. The paper has discussed various features of XML and commented that XML
will radically change the face of Web. In this paper, various research and
development issues such as query language, information retrieval, database system,
etc. related with XML and database are discussed. According to her, XML “turns
Web into a database.”
In [25] the authors Kerer and C. Kirda have presented an experience report on
building and managing XML/XSL powered websites such as LC separation through
XML/XSL provides high layout flexibility, XML/XSL deployment needs more data
organization planning, XML/XSL enable multi-lingual Websites. Learning XML and
XSL concepts is not easy for the developer, Graphical design companies are slow to
pick up XML/XSL know-how.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 11 -
2.3. Search Engines
A Search Engine is a tool that allows you to look up information. Search engine
accepts the key words entered by user, examines its index and provides a listing of
best-matching web pages according to its criteria.
2.3.1 Types of Search Engines
Search engines can be divided into various categories.
Full-text Search Engines
Some search engines, such as Google, store all or part of the web page as well as
information about it, whereas others, such as AltaVista, store every word of every
page they find. They allow users to find for any string of text also build the ranking
list that tries to present the most useful pages at the top of the list. Ranking can be
based on various factors such as search keyword appear near the top of the document,
in sub-headings, in the meta tags or in the title of the page, the number of times the
search keyword occurs in the test etc.
Directory-based Search Engines
Directory-based search engines use some form of category system. It organize the
documents into various categories such as Movies, Travel, Shopping, Sports.
Examples of this category are Yahoo!, AOL Search, MSN.com, InfoSeek.
MetaSearch Engines
A meta-search engine (well-known as multi-search engines) is a search engine that
sends user requests to several other search engines either simultaneously or
sequentially. The results are then blended together onto one page. They do not have a
own database of Web pages, create a virtual database. It also enables users to enter
search criteria. "Smarter" meta-searcher technology includes clustering and linguistic
analysis. Example of this category Dogpile, Vivisimo, Kartoo, SurfWax, Mamma.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 12 -
Specialist Search Engine
Specialist search engines are specifically designed to provide search relevant to some
specific areas of information. This does not include those search engines run by
individual companies. They confine itself to a wide rang of database search tools that
cover the needs of particular organization. Interactive Movie Database Search (IMDb)
is an example of this category
2.3.2 Shortcomings of Most Popular Search Engines
Most popular search engines are Google, Alta Vista and yahoo despite the success of
these search engines, there are few shortcomings inside it.
• These search engines simply matches the keywords and does not retrieve meta
data (such as xml Tags).
• Current Search engines always return the entire documents as search Results
instead of returning the some part of document which is relevant to search.
• They are based on HTML documents. Problem due to it is explained in next
Section.
• Lack of semantic retrieval i.e. they don’t give information about which type of
document it is and related semantic information about it.
Web Browser
Web Server
Html Pages
Internet Send request to web server Response with html page
Cgi, Servlet Jsp Program
To or from other system database
Figure 2.4: Existing Search engine’s web server model
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 13 -
Figure 2.4 briefly explain the basic functioning of internet and in the dotted box there
is Web server based on HTML pages. Existing Search engines fetch content from
HTML pages.
2.3.3 Problems in Web Server model based on HTML pages
The most common way of storing information on the web by writing webpages in
HTML. This HTML file based storage architecture possesses certain limitations.
Intermixing of Content and Representation: The Web as it stands today is mostly
the collection of large number of HTML files. HTML files are presentation-oriented
files. There is no clear separation from the information content and it’s rendering
details.
Unable to provide semantic information: HTML Page includes information about
how to view text. Such as browser knows that an H2 means a large header line. But
HTML doesn't give us a way to describe the content of the text the meaning is lost
also unable to provide any semantic information about it.
Poor support for device independence: Html has turned out to be its poor support
for device independence. User not only wants to access the web from their personal
computer, but also wants to access it using mobile devices and varying display size
and characteristics.
Difficult to manageable and stiff for changes Html Web sites: Difficult to
manageable, stiff for changes and inability to reuse and extract contents. In flexible to
easily incorporate layout design changes.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 14 -
2.4. Global Search Engine Web Server Model
Global search engine have a database based Web server model. In this model content
has been given the highest importance. The world is content rich and the content
storage and management should be given high importance in the design. Unlike the
existing scenario, in this the content and its display format are totally separated.
The Web is an unlimited collection of data that can not be managed properly in a
bounded document based system. In this model, the content will be stored in such
kind of database system which best suits the World Wide Web such as Semi
structured data.
Web Browser
Web Server
Internet Send request to web server Response with html page
Cgi, Servlet Jsp Program Style
Sheet Repository
DTD Repository
Xml Repository
To or from other system database
Figure 2.5: Global Search engine web server model
Semi Structured Data
Semi structured data is the data that is neither raw-data nor strictly typed as in
conventional relational database system. It is often explained as ‘self-describing’ or
‘schema less’, the terms that indicate that there is no separate description of the
structure or type of data. The structure is irregular, implicit and partial. As semi
structured data are self-describing, the structure of the semi structured data can be
obtained using some computation.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 15 -
As shown in figure 2.5 Global search engine web server is integrated with an XML
database. In this model we have replace the HTML repository with three repository
Xml repository, dtd repository and style sheet repository. XML repository Contains
the data of website there is no representation information is stored inside it. Dtd
repository contains dtd files used for validating the xml files. Style sheet repository
contains the representation information to display the web sites.
2.4.1 Site Data & Storage
Xml Repository
XML repository contains XML documents. XML stores information in hierarchical
formats. XML documents are made up of storage units called entities, which contain
either parsed (#PCDATA) or unparsed data. It provides a format that can represent
both simple and extremely complex information, and allows developers to create their
own vocabularies for describing information.
Style Sheet Repository
Although data and the storage of data are important, the rendering information cannot
be ignored completely. The rendering information for XML kind of data can be
provided using eXtensible Style sheet Language (XSL) or Cascading Style Sheets
(CSS). In this web server model the display formatting information present in Style
Sheet Repository. The XML repository will have proper links with style-sheet
repository.
DTD Repository
DTD Repository contains dtd files for validating the xml Document. A DTD contains
the rules for a particular type of XML-documents. A DTD describes elements. It uses
the syntax like that the text <! ELEMENT, followed by the name of the element,
followed by a description of the element. For instance: <!ELEMENT brand
(#PCDATA)>.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 16 -
2.4.2 Advantages of Global Search Engine Web Server Model
This model has lots of advantages over the existing World Wide Web Webserver
model.
Structured Declaration: In this model, the information is stored in a database kind of
system that can store all types of information in XML format and allow data to be
well structured.
Management of large amount of data: Now large amount of data is managed by the
database not the file system
Separation of Content and Representation: The content and the display format are
stored separately with appropriate linking. This clearly separates the information
content and the representation detail that helps in delivering more structured
information.
Dynamic generation of Web pages and their updation: In this model, information
is always collected from the database that can always remain updated using well-
understood control and maintenance mechanisms.
Searching: Since the data is stored in a database system, searching capability can be
provided and customized. This model provides flexibility to tune to any searching
mechanism.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 17 -
Chapter 3 Requirement Specification of
Search engine
This chapter focused on the Requirement analysis of the Search engine being
developed. It describes the functional and non-functional requirements of search
engine. Then it presented various use case diagrams. The diagrams are drawn using
Unified Modeling Language (UML).
UML is a standard language for writing software blueprints. The UML may be used
to visualize, specify, construct and document the artifacts of a software system. A
right set of diagrams has to be chosen to model the system and thus increasing the
chances of its success. For modeling a simple application only the use case view and
design view is sufficient.
After the system has been specified the next logical step is to present the design of the
system to be implemented. The next chapter discusses the high level design of the
search engine.
Requirements can be divided into two major types, functional and non-functional.
3.1. Functional Requirement
Functional requirements describe “what” the system should do. Following is a list of
functionalities of the system.
1. Gather Module
The web interface module, it provides user to register website to search engine.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 18 -
2. Indexer
It takes the list of urls from Gather Module .Parse each xml document using xerces-
2_7_1 [9] and extracts the information of each word. Record is prepared using that
information. Indexing of record is done using the compressed trie data structure.
3. Simple Search
Similar to any usual search engine for HTML documents. Performs the search based
on user specified keywords and retrieves list of records from the database through the
indexer, display following information as search result.
• Keyword
• Semantic information about keyword as an xml tags.
• URL of XML website/document.
• Name of the subsequent Document Type Definition, list of DTDs and Tags.
• URL of Document Type Definition.
4. Refine Search
User can refine his search by selecting a tag from tag list or by selecting a DTD from
the DTD list.
5. Ranking of Search Results
Frequency of keywords
One of the main rules in a ranking algorithm involves the location and frequency of
keywords on a web page [29]. The document containing most number of occurrences
of the search String is ranked the highest.
Order of tags list Search engines will also check to see if the search keywords appear near the top of a
web page [29]. This ordering is on the basis of tag closeness. The nearer the tag to the
search strings in the document the higher its rank.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 19 -
3.2. Use Cases diagram of Search engine
This involves representing the sequences of action performed by the users of
software. Use case diagram showing the relationships among a set of use cases and
actors as shown in figure 3.1 there are two main symbols: an actor is shown as a stick
person and a use case shown in ellipse. Lines indicate which actor performs which use
cases.
refinesearch
taglist
dtdlist
<<include>>
<<include>>
addministrator
indexing
search
<<extend>>
view index url
server process
parsing
user
crawling
addurl index file genrated
Figure 3.1: Use Cases diagram of Search engine
In this use case specifies the various operations can be performed by user such as
adding the web page for indexing, searching, refines the search results on the tags list
or DTD. It also shows the various operations performed by system administrator.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 20 -
3.3. Non-functional Requirements
Non-functional requirements of the system mean “how well” the functional
requirements of that system are satisfied. User can consider this “how well” in terms
of some characteristics that he is concerned with.
All user interaction with the search engine will be conducted via a GUI. There for it
meet the demands of the user. From the user point of view the requirements are:
1. Conformance to standard: The GUI conforms to web Browser Look and Feel
guidelines.
2. Response time: The response time of any system should not be more than 3
seconds.
3. Robust: System runs smoothly under normal circumstances, without failing
abruptly.
4. Performance: Using Java with Web service technology reduces bandwidth
consumption and makes the environment more reliability, availability, safety.
5. Recovery: if server crash by power down or some else reason we are able to take
backup of the index file.
6. Reusability: Application components must be developed in platform-independent
and portable language for example Java.
7. Interfaces: GUI Interface is similar to well familiar search engine Google.
8. Usability: The tool is easy-to-use. It allows user to operate it with a very little
training.
Ease of Use, Portability, Maintainability, Expandability, and System
Administration. Using advance feature of J2EE to develop the software.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 21 -
3.4. Design Constraints
3.4.1 Hardware Requirements
Above Pentium III, 512 MB RAM or compatible systems for the machines running
client processes; above Pentium IV and 1 GB RAM for the machine running server
process.
3.4.2 Software Requirements
Clients:
• Browser: Internet Explorer, Mozilla Firefox.
Server:
• Linux or Solaris 10 Operating System
• Java Servlet Development Kit 2.0 (JSDK2.0)
• Java SE Development Kit (JDK) & Java JRE 1.5.x
• Apache Tomcat Server
Java Api:
• Xerces-2_7_1 [9] parsing Xml document.
• Javamail for sending mail.
• Porter-Stemmer for stemming algorithm.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 22 -
Chapter 4 Indexing Data Structure
This chapter introduces the indexing data structure used in implementation of search
engine. It describes the standard trie, basic compressed trie, m-way trie data structure.
At the end of chapter presented the compressed trie data structure used in the
implementation of search engine also explaining the searching and insertion
algorithm.
4.1 Standard Trie
The trie (pronounced “try” and derived from the word retrieval) is a data structure
that uses the digits in the keys to organize and search the dictionary. The standard trie
for a set of strings S is an ordered tree such that each node but the root is labeled with
a character. The children of a node are alphabetically ordered. The paths from the
external nodes to the root yield the strings of S. The height of the tree is the length of
the longest string. [19]
Example: standard trie for the set of strings
S = {bear, bell, bid, bull, buy, sell, stock, stop}
a
e
b
r
l
l
s
u
l
l
y
e t
l
l
o
c
k
p
i
d
Figure 4.1: Standard Trie [19]
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 23 -
A standard trie uses O (n) space and supports searches, insertions and deletions in
time O (dm), where:
n total size of the strings in S
m size of the string parameter of the operation
d size of the alphabet
4.2. Basic Compressed Trie
It is obtained from standard trie by compressing chains of “redundant” nodes. It
improves the space inefficiency of tries by removing nodes with only one child. Each
internal node in a compressed trie has at least two children and each external is
associated with a string [14].
e
b
ar ll
s
u
ll y
ell to
ck p
id
Figure 4.2: Basic Compressed Trie [14]
4.3. M-Way Trie
An M-Way Trie (m is the size of alphabets set) in which the root node points to
another node for each of the possible alphabets a word may have. Each of these
nodes, likewise, contains a pointer to a node for each possible second alphabet and so
forth. Each node on level ‘k’ represents the set of all keys that start with the same
sequence of ‘k’ characters; this node specifies an M-way branch, depending on the (k
+ 1) st character of a key [15].
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 24 -
Example of M-Way Trie
A trie representation for five elements
951-94-1654, 562-44-2169, 271-16-3624, 278-49-1515, 951-23-7625
Results in a trie structure that has 10-way branching as shown in figure 4.3 the trie
employs two types of nodes-- element nodes and branch nodes. Each branch node
has 10 children fields. These fields, child[0:9], have been labeled 0, 1, ..., 9 for the
root node of Figure 4.3 root.child[i] points to the root of a subtrie that contains all
elements whose first digit is i. In Figure 4.3, nodes A, B, D, E, F, and I are branch
nodes. The remaining nodes, nodes C, G, H, J, and K are element nodes [10].
Figure 4.3: basic 10-Way Trie [10]
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 25 -
A Basic M-Way Trie where m is the set of all English alphabets and numbers i.e. Trie
of 36 elements.
Figure 4.4: Trie of 36 elements English alphabets and numbers
4.3.1 Drawbacks of M-Way Trie
The basic M-Way Trie structure is space inefficient. For storing a single word
sometime needs to create full 36 array node for it. Yet this data structure is much
better in space utilization with respect to tree or hashing indexing method, but we
need more compression. In search engine we needs to store lots of data therefore we
go for compressed tries, which is discussed below.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 26 -
4.4. Indexer
Indexer is implemented with the help of compressed tries [15, 17] data structure. It divides the keyword database in two levels. The upper level defines the indexing
structure and the lower level consists of database to hold the records which are
arranged in lexicographically manner. The upper level is termed as LEVEL A node
and it points to the next level node termed as level B which is linked list of nodes.
The lower level i.e. leaf level is the doubly linked list of datablocks.
A Level Node
null
0-9 A-E F-K L-Q
R-T U-Z
Linked List of B Level Node
9
Doubly Linked List of Data Block To other B level Node
Figure 4.5: Compressed Tries
A level node
The A level node is an array of elements that comprises of 26 English alphabets and 0
to 9 numeric numbers. Which is divided into 6 parts i.e. ‘0-9’, ‘a-e’, ‘f-k’, ‘l-q’, ‘r-t’,
‘u-z’ as shown in figure 4.6 . These array elements are regarded as pointers that points
to level B node.
null ‘0-9’ ‘a-e’ ‘f-k’ ‘l-q’ ‘r-t’ ‘u-y’ Bnode[0] Bnode[1] Bnode[2] Bnode[3] Bnode[4] Bnode[5] Bnode[6]
Figure 4.6: Anode Structure
S 4
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 27 -
B level node
The B level node distinguish the path for one attribute (i.e. character) from another in
the same range of the A level nodes. B level node comprises ordered linked list of
nodes that are arranged in lexicographic order these nodes are created on fly and are
pointed by the level A nodes. As shown in figure 4.7. Bnode contains following
fields: key, nextbnode, anode and block.
key Nextbnode Anode block
Figure 4.7: Bnode Structure
Data blocks
These are continuous set of blocks which hold the indexing key and the record that
are arrange in lexicography order with respect to the indexing key. It contains a
header that provides the information about the total no of B level nodes pointing to it.
Data Block contains following fields: size, fileoffset, lowKey, highkey, prevBlock,
nextBlock, numofRecs, and numofBnodes. The fileoffset tells about where the record
is present in the file, low and high key contains the smallest and largest indexing keys
of block, prevBlock, nextBlock contains the address of previous block and next block.
size numOf Recs
numOf Bnodes fileOffset lowKey High key nextBlock prevBlock
Figure 4.8: DataBlock Structure
Size Size of the data block.
NumofRecs Number of record present in Data block.
Num of Bnodes Number of bnode pointing to Data block.
Fileoffset FileSize + 1024 - FileSize % 1024
Lowkey Contains smallest indexing keys of block
Nextblock Contains the address of next immediate block.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 28 -
Record
The Records are stored in the indexing file index.dat. Record contains all the
information of related to key. Data Block contains the offset address of these records
in the file. Structure of record is shown in figure 4.9 with all the information it
contain.
Key doc dtd dtdname Frequency elts Size
Figure 4.9: Record Structure
Key Name of the word which we are storing. Doc URL of the document in which word is present. Dtdname Name of the document type definition document.
Dtd URL of the document type definition. elts Linked list of tags. Frequency No. of time that particular word come inside the document.Size Size of the record.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 29 -
4.4.1 Searching Algorithm
Steps:
1. To search for records with indexing key ‘X’, the search process start from the
root i.e. Level A node.
2. The search proceeds by mapping first character of the indexing key, to a field
in level A node i.e. the character range of level A node, and take the Bnode
address of the linked list connected with it.
3. Now start traversing the linked list of B level nodes. Now the level Bnodes are
searched to find the first character that matches with the level B node key. If it
finds it continue the search by following the pointer that points to the next
level otherwise it returns failure.
4. Check for the next level associated with level B node as shown in figure 4.10.
If it is a valid pointer to level A node found, then the search continue by
repeating the above steps. By taking the second character of indexing key.
5. If the data block is reached, then the key is searched inside the Data Block.
Retrieve the records in a record list. Return the record list as a search result.
Figure 4.10: Search Compressed Tries
Linked List of B Level Node
NULL 0-9 a-e f-k l-q r-t u-z
b c e
NULL 0-9 a-e f-k l-q r-t u-z Data Block
r t
NULL 0-9 a-e f-k l-q r-t u-z
To other B level Node
A level node
To other B level Node
?
A level node
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 30 -
4.4.2 The Insertion Algorithm
To insert record with indexing key ‘X’. First of all search is to be carried according to
the above define search algorithm. After the completion of search it returns the last
level B and A node along with the current datablock. After obtaining the last level B
its validity is checked to find out whether it contains the address or not. This causes
the following two cases Case 1 and Case 2 as discussed below.
Case 1
If the last level B contains the address return by the search process ,then it is not
certain whether the current Data Block can hold the new record or not. It arises two
more cases Case 1.1 data block splitting and Case 1.2 deals with creation of a new
two level A-B node structure.
Case 1.1
If the size of the data block is not sufficient to accommodate new record then there is
a need of datablock splitting in which the records are read from the data block in a
recordlist and the new record is inserted in this recordlist. Records from this recordlist
are written to the older data block till it becomes full. Rest of the records are written
to newer data block which is inserted to the right hand side of the older data block.
Case 1.2
If all the records in datablock have same prefix as the search path up to and including
the last A. and if no sufficient space to accommodate the new record in the datablock
then it creates a new two level A-B node structure and the splitting at this level for
level B nodes occur on the attribute just after the prefix.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 31 -
Case 2 If the last level B is null return by the search process then the following steps
are followed.
Steps:
1. When the last level B is null need arises to create the new B level Node it
locate the neighboring B level nodes After locating the neighboring B level
nodes it determines to which data block these nodes points.
2. If they refers to the same datablock then simply puts the record in it and this
data block is pointed, by creating the new B level node. Record is inserted at
the correct position in the data block according to lexicographic order. As
shown in figure 4.11 ‘b’ and ‘e’ are adjacent bnode of ‘c’ and pointing to same
data block.
3. If they do not refer to the same block then the record is inserted into datablock
pointed by the left neighboring B level node.
4. If the datablock pointed by the left neighboring B level node cannot
accommodate the new record without splitting then to insert the record it
moves to the right neighboring B level node. If the right neighboring B level
node datablock is having sufficient space to accommodate the new record then
simply place in it otherwise it has to split it
Figure 4.11: Case 2 Insertion Compressed Tries
Linked list of B Level Nodes
NULL 0-9 a-e f-k l-q r-t u-z
b c e
Data Block
r t
NULL 0-9 a-e f-k l-q r-t u-z
To other B level Nodes
Data Block
A level node
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 32 -
Case 2.1
If the neighboring level B nodes does not refers to datablocks rather that it refers to
the level A nodes. As shown in figure 4.12.
1. For the insertion of the record there is need to create a new data block.
2. Now the process begins from the right neighboring B level node, where it
traverse the level A node to locate the left most datablock. Once it discovers,
it inserts the new datablock to the left of the left most data block in order to
maintain the lexicographic order.
3. If no left most data block is present then the process begins from the left
neighboring B level node. Where it traverses the level A node to locate the
right most data block. Once it discovers, it inserts the new datablock to the
right of it.
Linked list of B Level Nodes
NULL 0-9 a-e f-k l-q r-t u-z
b c e r t
A level node
NULL 0-9 a-e f-k l-q r-t u-z
To other B level Nodes
NULL 0-9 a-e f-k l-q r-t u-z
To other B level Nodes
A level node A level node
Figure 4.12: Case 2.1 Insertion Compressed Tries
4.5 Advantage of Compressed Tries
• To retrieve a data record, the number of comparisons does not depend on number
of key indexed. Instead it depends on the length of key.
• An insertion into trie is localized and does not propagate to higher level in
indexing structure. Insertion only causes the expansion the trie structure [26].
• Quick un successful search determination and looking up keys is faster. Looking
up a key of length m takes worst case O(m) time.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 33 -
Chapter 5 High Level Design
This chapter describes the System architecture and High level design of search
engine. High level Design of the search engine shows how the whole of the process
going on from the user enter the search string to the display of result. Also represents
the entire process from user register his xml website to indexing of website by search
engine.
5.1. System Architecture
The general architecture of the search engine is shown below. It shows the
relationship between the parts of the system.
Results
www
Gatherer
Xml Parser
Indexer
Dom tree
Indices
Query processor
Query
Search
processor
Ranker
Displayer
Figure 5.1: System Architecture
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 34 -
At the top level search engine can be divided into three parts:
1. Gatherer
2. Indexer
3. Search processor
The high level design is shown in figure 5.2.
5.2. Gatherer (Module 1)
Web search engines work by storing information about a large number of web pages.
These pages are retrieved by a Web crawler. Web Crawler (spider, robot) are
computer programs that roam the Web and store links and information about each
page they visit. These software generally start with list of best or most popular
websites and follow the hyperlink on these pages and add to the database. It is mainly
used to create a copy of all the visited pages for later processing (indexing and
retrieving) by a search engine
But there is no web crawler for collecting XML websites. For collecting the webpage
Global Search engine provides GUI interface to user. User registers his website by
entering URL of an XML website. The list of urls stored in the file ‘urls.add’.
5.3. Indexer (Module 2)
The purpose of indexing is to process the documents to be indexed and to extract
appropriate information. This information is stored in a data structure that allows fast
searching of the text. Indexing process of search engine done in two phases. In the
first phase start with gatherer module from where it takes list of xml documents urls.
After that parses these xml websites and collects the information about each word.
Stop word removal and stemming is also done in first phase. This information pass to
the second phase i.e. indexing routine. Purpose of index routine is to write the record
in the indexing file i.e. build the indexing structure.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 35 -
Query processor
Stop word removal
Stemming Find Root Word
Searching Routine
Gatherer
XML parser
Stop Word Removal
Stemming Find Root Word
Indexing Routine
DISK
Indexing structure
Sorting in Rank
Results
To The User
SEARCH PROCESSOR
Indexer
Search Results
Keywords Browser XML and XSL transformed into
HTML prior to rendering
Query By User URLs
Figure 5.2: The High Level Design of Search Engine
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 36 -
5.3.1 Parsing of Xml document
Xerces-2_7_1 parser [9] is used for parsing documents. The parser loads the
document into computer's memory. Once the document is loaded, create the document
object model tree of xml document. The DOM support navigation in any direction
(e.g., parent and previous sibling). From this DOM tree extract the information for
each word.
Key
list of tags (semantic information about the keyword)
DTDname
Url of XML document
url of the DTD
5.3.2 Stop Word Removal
This step helps save system resources by eliminating stop words from further
processing, as well as potential matching, those words that have little value in finding
useful documents. A stop word list typically consists of those word classes known to
convey little substantive meaning such as conjunctions (and, but), articles (a, the),
prepositions (in, over), interjections (oh, but), pronouns (he, it), and forms of the "to
be" verb (is, are).
These word occurs in almost every document of the language, and therefore do not
help in distinguishing between documents that are about different topics. For this
reason, these words are removed and are not indexed.
5.3.3 Word Stemming
Linguistic analysis is used to get the root form of a word. Search engines use
stemming to compare the root form of the search terms to the documents in its
database. Stemming removes word suffixes, recursively in layer after layer of
processing. For example, if the user enters "viewing" as the query, the search engine
reduces the word to its root ("view") and returns all documents containing the root -
like documents containing view, viewing, viewer, preview, review etc.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 37 -
The process has two goals in terms of effectiveness, stemming improves recall by
reducing all forms of the word to a base or stemmed form. In terms of efficiency,
stemming reduce the number of unique words in the index, which in turn reduce the
storage space required for the index and speeds up the search process. Of course
stemming does have a downside.
It may negatively affect precision in that all forms of a stem will match, when in fact a
successful query for the user would have come from matching only the word form
actually used in the query. There are several types of stemming algorithms which
differ in respect to accuracy and performance. e.g. Paice/Husk, Porter, Lovins,
Dawson and Krovetz Stemming Algorithm. We are using Porter Stemming algorithm.
5.3.4 Index Routine
Indexed Routine comes in second phase of indexer to build the indexing structure.
Record is prepared using the information extracted from the DOM tree. Record
contains key, URL of document, Dtdname, URL of the DTD, linked list of tags which
are associated with this key. Now to insert record start searching for the block using
the compressed Tries algorithm.
After reaching to that particular block check whether the current block can contain
this record or not. If block can contain this record simply insert it. Otherwise create
the new block and insert the record in it. For writing the record in the block first we
need to read the Block. Now sort the records and add the new record. At the end write
whole of the Block in the index file. The more detail of index routine given in section
5.5 indexing process.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 38 -
5.4. Search Processor (Module 3)
Search processor is the third part of a search engine. This is the program that sifts
through the millions of pages recorded in the index to find the matches to a search and
rank them in order of what it believes is most relevant. This search processor is
implemented using the servlet program. Here two servlet program one for simple
search another for refine search. The interface of this program is an HTML form.
When the form is submitted, the search processor takes values from the form and
performs the actual search in the compressed trie indexing structure. Searching
process is divided into following modules, which execute in the given sequence.
5.4.1 Query processor
Search processor gets list of words from HTML form that invoked it. Query processor
takes this list and performs syntax check on that list. If there are syntax errors, it
displays the syntax error. Query processor task is carrying out by Tokenizing,
Boolean expression.
Tokenizing
As soon as a user inputs a query, the first task of the search processor is to extract the
keywords from the users input. Search processor use the string tokenizer tokenizes the
query stream, i.e., break it down into understandable segments. Token as an alpha-
numeric string that occurs between white space and/or punctuation.
Boolean expression
Search processor check for the Boolean expression specified with user query i.e.
(AND, OR). After searching each keyword, the results are combined according to the
Boolean expression given in the query and displayed to be user. Check for if any ‘*’
is present it is at the end of the keyword for prefix matching.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 39 -
5.4.2 Stop Word Removal and Stemming
Search processor removes the stop words from the query and search rest of the
keywords, because this speeds the search. , similar to the processes described above in
the Index section.
5.4.3 Search routine
The search processor performs the actual search using the searching algorithm of
compressed trie data structure. Purpose of search routine is to fetch the records from
the indexing file and display to the user. Detail of search routine given below in
section 5.6 as search process.
5.4.4 Results Ranking and Display
After retrieving list of records from the database, it sorts the records on the basis of
frequency of search string. The document containing most number of occurrences of
the search string is ranked the highest .List of tags obtained after simple search is also
ordered. This ordering is done on the basis of tag closeness. The nearer the tag is to
the search string in the document the higher its rank.
Display of simple search Result
After sorting the list of records according to ranking. Then it performs join operation
depending upon whether user wants “or-ing” or “And-ing”. For “or-ing” the lists are
simply concated. “And-ing” is done by taking only those records that are common in
various lists of records. Finally the record list obtained after join operation is given as
output.
Display of refine search Result
Refining of search result user select the tags or dtd from the simple search result . for
refing the search results done, simply by match the tag or dtd name from the list of
records returned by the simple search. Finally join all the records in a recordlist which
match the tag or dtd selected by the user and display to the user.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 40 -
5.5. Indexing process
Purpose of index routine is to write the record in the indexing file i.e. index.dat. Brief
explanation indexing routine and description of function as follows:
Crawling
Checkurl
ParDoc
Makeindex
Index.insert ()
Create new record
BlockSearch ()
InsetData (data, block, rec)
WriteBlock (rec, block)
ReadBlock
readData
Writedata (record, offset)
Urls.add
Index.dat
addRecord ()
abc.xml
Stop word & Stemming
Figure 5.3: Indexing Process
Indexing routine start from InsertIndex function. Makeindex function sends the
information of keywords i.e. (Key, list of tags, DTDname, Url of XML document, and
url of the DTD) to the Insert.index function. Insert.index function starts searching in
the Block by taking the first char of the key and creates the new record for that key.
Insert.data function checks that whether block can contain record or not. If Block is
already full create new Block insert the record inside it.
For writing the block in indexing file, first ReadBlock Function read the Block using
the readData function. After reading the records of the Block in a recordList insert a
new record inside recordList in lexicography order. At the end WriteData function
writes the whole block in the indexing ‘index.dat’ file.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 41 -
5.6. Search Process
Purpose of search routine is to fetch the records from the indexing file and display to
the user. Brief explanation of search routine and description of various functions are
described in figure 5.4.
doget
init
Index.search string () (Root, str.toLowercaser ())
Sorting of records in recordList
checkstr
Index.dump Blocks.dump
St = StringTokenizer(s)
Block search (root, st)
ReadBlock ()
ReadData ()
Index.dat
Display Records on Browser
RecordList r1
Stop word & Stemming
Search_string
Fetching records
Displaying records
Figure 5.4: Search process
Searching process starts from Index.SearchString function, search processor use the
string tokenizer tokenizes the query stream to extract the keywords from the users
input. After that eliminate the stop words, do the steaming of keywords.
Search routine start from BlockSearch function. It finds the address of the block
where that key is stored using the searching algorithm. Next pass this address of to
ReadBlock function. This will call the ReadData function to read the records from
indexing file in a recordList, now sort the recordlist according to rank. Finally, display
the result in sorted order to the browser.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 42 -
5.7. Transformation Process of XML/XSL into Html
The transformation process can take place inside an XML-enabled browser. XSL is a
language specifically designed for transforming the structure of an XML document.
The transformation processor takes as input an XML document and the corresponding
XSL which contains the transformation rules transforms process start sequentially
according to the instructions contained in the rule as shown in figure 5.5.
Figure 5.5: Represents a fragment of the transformation [20]
Figure 5.5 shows a XSL style sheet for transforming the XML data into ordinary
HTML. The style sheet specifies transformation rules. Transformation rule contains a
pattern and an action. The XML document’s natural parse tree structure is traversed to
find nodes that match the pattern part of a rule. At matching nodes, the action part is
used to derive a transformed sub-tree, which is attached at the current node. This
process continues recursively until no patterns match.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 43 -
Chapter 6 Detail level Design
This chapter presents the detail design view of the system through the Class diagrams
for structural modeling, sequence diagrams for behavioral modeling. This design was
created using object-oriented principles and techniques. Wherever diagrams were
needed UML was used.
6.1. Package Diagram
Package diagram gives a way to organize large models and enforce a cleaner
architecture. Packages are groups of related classes. The core software has more than
two thousands lines of code with 5 packages and 12 classes developed in Java
programming language used by web server in order to provide searching and
indexing.
Figure 6.1: Package Diagram
As shown in figure 6.1 Global search engine contains index engine, search processor,
web pages, parser and stemming packages to implement the whole functionality of
search engine. Classes present in each package are discussed below in detail.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 44 -
6.2. Class Diagram
Class diagrams for structural modeling of system. Classes are depicted as boxes with
three sections, the top one indicates the name of the class, the middle one lists the
attributes of the class, and the third one lists the methods.
Figure 6.2: class diagram
This class diagram represents the interaction between cores classes that are used in
implementation of search engine. Detail of each class in the below sub-sections.
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 45 -
6.2.1 Indexing Package Classes
6.2.1.1 Class Anode
Class Anode implements the indexing level A of compressed tries. Level A is an array
of elements, which correspond to character ranges. These array elements contain
(address) pointers to the linked list of B level nodes. It contains the various function
like getBnode () for getting the address of Bnode, insertBnode () for inserting the
Bnode in the trie.
Figure 6.3: class Anode
6.2.1.2 Class Bnode
Class Bnode implements the indexing level B of compressed tries. Class Bnode has
various attribute such as key, nextbnode, anode and block to contain information of
key, address of next bode in linked list, address of the anode, address of the block to
which it is pointing respectively. It contains various functions for getting and setting
the value of above attribute.
Figure 6.4: class Bnode
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 46 -
6.2.1.3 Class Block
Block class has various attributes such as size, fileoffset, lowKey, highkey,
prevBlock, nextBlock, numofRecs, and numofBnodes to contain header information
of data block. These attributes contains size of block, offset of record tell about where
is record present in the file, smallest and largest indexing keys of block, address of
previous block and next block, number of record presents in the block and number of
bnode pointing to it. Contains various functions for getting and setting the value of
above attribute.
Figure 6.5: Class Block
6.2.1.4 Class RecordList
Recordlist class contains list of records. It has various functions such as addRecord ()
function for adding the record in the recordList in such a way that lexicography order
is maintained. Function andList() find the common records between two record list.
Figure 6.6: Class RecordList
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 47 -
6.2.1.5. Class Record
Record class contains the basic attribute key, doc, dtd, dtdname, elts, frequency and
size to store the basic information of a record. It contains the functions such as
setKey(), setDtd(), setDtdName(), setDoc(), setFreq(), setElts() to set the value of the
record in the indexing file. Same function with get name to retrieve the value from the
indexing file.
Figure 6.7: Class Record
6.2.1.6. Class Index
This class is a heart of the software, most of the function to implement compressed
trie algorithm is present in this class. Function BlockSearch () finds the address of the
block to store records at right position. Various other function like rightmostBlock (),
leftmostBlock (), insertdata () also present in this class.
Figure 6.8: Class Index
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 48 -
6.2.1.7. Class Data
Data class provides the functionality for reading and writing from the indexing file.
Functions readBlock () and readdata () for reading form index file. Functions
writeBlock () and writeData () for writing in index file. Function getNewoffset ()
return the offset address for creating new block
Figure 6.9: Class Data
6.2.2 Parsing Package Class
6.2.2.1 Class Dom
Class Dom is used to parse the XML website, during parsing it extract following
information dtdname, dtdurl, docurl and semantic information i.e. the list of tags from
file. Function makeIndex() make the record from that information. At the end it call
the InsertIndex() function for inserting the record.
Figure 6.10: Class Dom
Indian Institute Of Information Technology – Allahabad June-2007
Content Based Search Engine “Global” For Xml Database
- 49 -
Indian Institute Of Information Technology – Allahabad June-2007
6.2.9. Class Server
This class contains the main() method. As Administrator run this class search engine
take website from the ‘urls.add’ file one by one and start the indexing process. In the
end it takes the dump of the whole index.
Figure 6.11: Class Server
6.2.10. Class AddURL
The AddURL servlet e
nables user to register his XML website to search engine.
Figure 6.12: Class AddURL
.2.11. Class Simple
his servlet provide the simple search functionality.
6
T
Figure 6.13: simple servlet
Content Based Search Engine “Global” For Xml Database
- 50 -
Indian Institute Of Information Technology – Allahabad June-2007
6.2.11. Class Refine
his servlet provide the Refine search functionality. T
Figure 6.14: Refine servlet
6.3. Sequence Diagram
The interaction of the classes is shown as the sequence diagram. A sequence diagram
represents the behavioral m s the top of the diagram
represents objects, classes, actors, classifiers or their instances or typically use cases.
e called object lifelines, representing the
life span of the object during the scenario being modeled.
of the method invoked in
response to the message [30].
odeling. The boxes acros
The dashed lines hanging from the boxes ar
The long, thin boxes on the lifelines are activation boxes which indicate processing is
being performed by the target object/class to fulfill a message. Messages are indicated
on UML sequence diagrams as labeled arrows, when the source and target of a
message is an object or class the label is the signature
Content Based Search Engine “Global” For Xml Database
- 51 -
Indian Institute Of Information Technology – Allahabad June-2007
6.3.1 Sequence Diagram of Website Gather Module
he sequence diagram shows the main step how the user registered his page to search
engine. First of all, user enters the url address of his website & email address using
e Adurl.html page interface. Invoke doGet() method of Addurl servlet program.
ites into the ‘urls.add’,
indexpage.add files.
T
th
Servlet program verify the url address and email. Finally it wr
: user Addurl.html
AddUrl Servlet
urls.add indexpage.add1: xmlurl,email ,add
2: doGet(xmlurl,email ,add)
3: check(url)
4: check(email)
urls to be indexed file
list of all indexed urls
5: writeFile(urls,email)
6: writeFile(urls,email)
Figure 6.15: Sequence Diagram website Gather Module
Content Based Search Engine “Global” For Xml Database
- 52 -
Indian Institute Of Information Technology – Allahabad June-2007
6.3.2 Sequence Diagram of Insertion Module
his sequence diagram shows the whole of the process to index an Xml documents.
gh following messages and method calls shown in figure.
T
Process takes place throu
Server Anode :Dom Index Block Record Data RecordList
index.dat file
2: crawl(String, boolean)
1: Anode( )
3: Dom.DomMain(root, str1, true)
4: checkURL();
5: parseDoc()
6: makeIndex(index, root, dat, element, elts, insert);
7: insert(Anode, Data, String, LinkedList, String, String, String)
8: search(Anode, String)
11: insertdata(Data, Block, Record)
9: Record( )
10: setDoc(String)
12: writeBlock(Record, Block, Anode, int, RecordList)
13: readBlock(Block)
15: readData(long, boolean)
16: Record( )
19: setDoc(String)
22: writeData(Record, long, boolean)
14: RecordList( )
20:
21: addRec(Record)
17: new File(index.dat)
18: fp.readLine()
Figure 6.16: Sequence diagram of insertion module
Content Based Search Engine “Global” For Xml Database
- 53 -
Indian Institute Of Information Technology – Allahabad June-2007
6.3.3 Sequence Diagram of Simple Search Module
The sequence diagram shows the main steps of simple search module. It takes the
input from the user, pass to the simple servlet. Simple servlet interact with various
classes to perform the search process. At the end display the result to the user.
html page simple servlet
Anode index Data index.dat
4: search_str,add
2: Anode getroot()
3: root
1: init()
5: doget()
7: recordList searchString (root,str,and)
6: check(str)
8: block search(root,str)
9: RecordList readBlock(block)
10: Record readData()
12: RecordList
13: RecordList
14: display result
11: <<read data>>
Figure 6.17: Sequence diagram of Simple Search
Content Based Search Engine “Global” For Xml Database
- 54 -
Indian Institute Of Information Technology – Allahabad June-2007
6.3.4 Sequence diagram of Refine search Module
This sequence diagram shows the main steps of Refine search process. Refine search
servlet gets the input from simple search servlet. Now it checks for the dtd/tags name
in the record list returned by the simple search. Refine process takes place through
following messages and method calls shown in figure 6.18.
simple search
refineSearch servlet
Anode index Data index.dat
1: search_str, dtds/tags
2: getroot()
3: root
7: RecordList readBlock(block)
8: Record readData()
10: RecordList
5: recordList searchString (root,str,and)
11: RecordList
12: check1(tags,RecordList)
13: check2(dtds,RecordList)
14: RefineRecordList
6: block search(root,str)
15: << diaply refine result >>
9: <<read data>>
Figure 6.18: Sequence diagram of Refine Search
Content Based Search Engine “Global” For Xml Database
- 55 -
Indian Institute Of Information Technology – Allahabad June-2007
Chapter 7 Result and Discussion
his chapter shows the result obtained by providing the snapshots of the search
engine result.
7.1 User Interface
T
Figure 7.1: Snap shot of User Interface
Objective
Provides Graphical user interface to user for searching.
How to use this page
Enter the s
If user wants to find all the keywords Click on the find all words checkbox.
Submit the query by pressing the Global Search button.
earch query keyword in the Textbox provided.
Content Based Search Engine “Global” For Xml Database
- 56 -
Indian Institute Of Information Technology – Allahabad June-2007
7.2 Simple Search Result Page
Figure 7.2: Snap shot of Simple Search Result page
of the keyword “Allahabad“
7.2.1 Visual Understanding of Result
In the result page two c the hyperlinks to xml
Green color links represents the hyper link to the document type
definition file DTD file. On the right hand side of the links in green color numbers
olors of links. Blue color links represents
documents.
represents the rank of the page.
Content Based Search Engine “Global” For Xml Database
- 57 -
Indian Institute Of Information Technology – Allahabad June-2007
DTDName represents type of document it is. Semantic information is represented
using arrows. It represents from where that key word has come. On the right hand side
two lists. Upper One is list of tags, lower one is the list of DTD name. Select one of
the elements from the list for Refining Search result. Click on number 1, 2, 3… for
browsing the more results pages.
7.2.2 Analysis of Result
When user performs search, he/she gets result not only the list of urls which are
related with this keyword but also gets the semantic information of key words. The
user also gets information about the type of documents and able to refine their search
by selecting tags/dtd from the respective list.
When the keyword ‘Allahabad’ is entered in the search text box of the search engine,
it displays the related XML websites that contains the searched word ‘Allahabad’. In
the result it shows four types of documents Bank, university, up tourism, institute as
shown in figure 7.2. It gives flexibility to the user to choose their own search area to
refine search.
In the first link it shows semantic information mnnit->institute->Allahabad in the
second link it shows semantic information uptourisum->PlOfInterest->Kumbhmela-
>Allahabad and o the users in
understanding the result. In engines don’t provide this
information about the keyword.
type definition link to gain familiarity
with structure of document which help them in refinement of search result.
so on. This semantic information also helpful t
current html based search
The user is capable to browse the document
Content Based Search Engine “Global” For Xml Database
- 58 -
Indian Institute Of Information Technology – Allahabad June-2007
7.3 Refine Search Result Page
Figure 7.3: Snap shot of Refine search result page
The figure 7.3 shows the result of the refine search based on the DTD selected from
the list. When the keyword ‘Allahabad’ is entered in the search text box the search
urism, institute. It gives to the user a flexibility of choosing their own search area
.g. If a user selects ‘institute’ from the DTD list then the search engine presents the
ame of the institutes in Allahabad i.e. mnnit and iiita.
engine displays the entire XML website that contains the keyword ‘Allahabad’. In the
above describe snapshot the Allahabad is attached with the university, bank, up
to
e
n
Content Based Search Engine “Global” For Xml Database
- 59 -
Indian Institute Of Information Technology – Allahabad June-2007
7.4 Simple Search Result Page of Searching the Keyword “Xml”
Result of searching the keyword “xml“
currently registered with it.
Currently in the search engine registered websites are http://172.19.6.53:8080/
arning_xml.xml, http://172.19.6.53:8080/processing_xml.xml, http://172.19.6.53
080/ java_xml.xml These are all of xml book type. That’s by in the List of Dtds we
ot only one type of document it is xmlbook.
Figure 7.4: Snap shot of searching the keyword “xml“
7.4.1 Analysis of result
When the keyword ‘XML’ is entered in the search text area, then the search engine
display link of the related XML website which are
le
:8
g
Content Based Search Engine “Global” For Xml Database
- 60 -
Indian Institute Of Information Technology – Allahabad June-2007
If few more sites are registered with the search engine some related to XMLTutorial,
XMLResearch field. Now when user search for keyword “xml”, in the List of Dtds
l book, Tutorial, Research. Result shown in
figure 7.5 User can refine his result according to field of interest.
user got three type of documents i.e. xm
Figure 7.5: Search result of keyword “xml“ after registering few more website related to xml domain.
Content Based Search Engine “Global” For Xml Database
- 61 -
Indian Institute Of Information Technology – Allahabad June-2007
7.4.2 Refine Search Result
If user selects on the ‘XMLResearch’ from List of Dtds then the search engine
presents the only those website which are related to XML Research field.
Figure 7.6: Snap shot of refine search result keyword “xml“
Content Based Search Engine “Global” For Xml Database
- 62 -
Indian Institute Of Information Technology – Allahabad June-2007
7.4 Simple Search Result page of searching the keyword “Gate”
Figure 7.7: Snap shot of simple search result of the keyword “Gate”
Content Based Search Engine “Global” For Xml Database
- 63 -
Indian Institute Of Information Technology – Allahabad June-2007
7.6 Procedure of Registering an XML website
User interface for registering website to search engine.
Figure 7.8: Snap shot for Registering Xml website to Search engine Objective Provides Graphical user interface for registering Xml website to Global Search
engine.
How to use this page
User enters the URL address of website & email address in the textbox provided.
Press the submit button. To register the complete website user need to register each
individual page. Email will send to user as website page is indexed by Search engine.
Content Based Search Engine “Global” For Xml Database
- 64 -
Indian Institute Of Information Technology – Allahabad June-2007
7.7 Procedure of Indexing a Website
Administrator steps for indexing a website
Before start indexing start the following process using Linux command shell. Step 1: Run Apache Web Server using commands shown in figure 7.10.
Figure 7.9: Command to Run Apache Web Server Step 2: Start the serveltrunner using following commands shown in figure 7.11.
tep 3: Run the server class file using commands shown in figure 7.12.
Figure 7.10: Command to Run Servlet S
Figure 7.11: Command to Run Server
Content Based Search Engine “Global” For Xml Database
- 65 -
Indian Institute Of Information Technology – Allahabad June-2007
Chapter 8 s and future work
In the thesis work a new search engine has been proposed for searching the websites
in XML/XSL. It provides two level searches in comparison to the existing search
engines. The two level searches comprise of basic search and refine search. The basic
search is similar to the conventional HTML search engine. But due website made in
XML it also provides the semantic information of keyword to user.
ne more functionality comes in refine level search where the user can refine his
arch according to DTD/Tags information given to him. The beauty of model lies in
t and
riting the sophisticated queries for searching from xml documents.
In addition an efficient Compressed Tries data structure used to implement the
indexer that includes properties fast retrieval time, quick search unsuccessful search
determination and finding the longest match to a given identifier.
8.2 Future Amendments
is obvious that it is not possible to cover the whole functionality of search engine.
unctionality that can be further provided such as: currently the search engine do not
riginal words
s well as their synonyms. Query expansion feature can be provided. One more feature
can be added rather then returning the entire document as search result, returns only the
partial webpage.
Spelling checking functionality can be added. The search engine described here does not
pport other Indian lan the Indian languages.
ml web crawler can be made to gather all the xml websites.
Conclusion
8.1 Conclusion
O
se
the fact that it frees the user from remembering the structure of XML documen
w
It
F
include search for the synonyms that can be included, so that it can search o
a
su guages. It should be extended to include
X
Content Based Search Engine “Global” For Xml Database
- 66 -
Indian Institute Of Information Technology – Allahabad June-2007
R
[1] Mark P. Sinka, David W
Stoplists for Web Document Analysis” IEEE/WIC International Conference on
ce IEEE Computer Society Page: 396, 2003
umber 3 September
1999.
[8] Fang Yuan; Ya-Nan Hao; Ge Yu; “The study of key techniques in intelligent
http://www.cise.ufl.edu/~sahni/dsaaj/enrich/c16/tries.htm
ery
eferences
. Corne “Towards Modernised and Web-Specific
Web Intelligen
[2] W. B. Frakes "Stemming algorithms" Information retrieval: data structures and
algorithms Pages: 131 – 160, 1992.
[3] Marden, P.M, Jr. Munson, E.V. ”Today's style sheet standards: the great vision
blinded” , Computer IEEE JNL Volume 32, Issue 11, Page(s):123 – 125,
Nov. 1999
[4] http://www.w3schools.com/xml/xml_whatis.asp
[5] S. Adler and Co. “Extensible Stylesheet Language (XSL) Version 1.0” W3C
Working Draft, available at http://www.w3.org/TR/xsl/, 18 October 2000.
[6] Arne Andersson and Stefan Nilsson “Faster searching in tries and quadtrees An
analysis of level compression” Springer Berlin / Heidelberg Volume 855, 1994
[7] J. Widom “Data Management for XML - Research Directions”, IEEE Data
Engineering Bulletin, Special Issue on XML, Volume 22, N
XML search engine” Machine Learning and Cybernetics International
Conference Volume 2, Page(s):1194 – 1197, 2004
[9] http://xerces.apache.org/
[10] Sartaj Sahni “Data Structures, Algorithms, & Applications in Java Tries” 1999
[11] http://www.w3schools.com/dtd/dtd_intro.asp
[12] XQuery - Wikipedia http://en.wikipedia.org/wiki/XQu
Content Based Search Engine “Global” For Xml Database
- 67 -
Indian Institute Of Information Technology – Allahabad June-2007
[13] http://www.comp.lancs.ac.uk/computing/research/stem
m
[14] http://ww3.algorithmdesign.net/handouts/Tries.pdf
ue 2 Pages: 243 -
263 , June 1984
: the W3C DOM specification" Volume: 3 ,
Issue: 1, pages 48 – 54, Jan.-Feb. 1999
ly 1976
escu, Alon Levy, Dan Suciu,
Press XML Applications , Pages: 474 - 485 , 2002
Nilsson "Improved behaviour of tries by adaptive
branching” Information Processing Letters, Elsevier North-Holland, Inc.
[23] Aleman-Meza, B. Halaschek-Weiner, C. Arpinar, I.B. Cartic Ramakrishnan
” Internet
44 , May-June 2005
ming/general/index.ht
[15] M. Al-Suwaiyel, E Horowitz "Algorithms for trie compaction" ACM
Transactions on Database Systems (TODS) Volume 9 , Iss
[16] Wood, L. "Programming the Web
[17] Kurt Maly “Compressed tries” Communications of the ACM, Volume 19
Issue 7 Ju
[18] Alin Deutsch, Mary Fernandez, Daniela Flor
“XML-QL: A Query Language for XML”,W3C Notes:http://www.w3.org/TR/
NOTE-xml-ql, August 1998.
[19] http://ww0.java4.datastructures.net/handouts/Tries.pdf
[20] Lionel Villard, Nabil Layaïda “XML Applications: An incremental XSLT
transformation processor for XML document manipulation” Proceedings of
the 11th international conference on World Wide Web WWW '02 Session:
ACM
[21] Tin Kam Ho “Fast identification of stop words for font learning and keyword
spotting” Document Analysis and Recognition, ICDAR '99. 20-22 IEEE CNF
Page(s):333 – 336 , Sept. 1999
[22] "Arne Andersson, Stefan
Volume 46 ,Issue 6 Pages:295-300 Year of Publication: 1993
Sheth, A.P. “Ranking complex relationships on the semantic Web
Computing, IEEE JNL Volume 9, Issue 3, Page(s):37 –
Content Based Search Engine “Global” For Xml Database
- 68 -
Indian Institute Of Information Technology – Allahabad June-2007
[24] Stefan Nilsson and Matti Tikkanen “Implementation of dynamic compresse
trie
d
s” Springer Berlin Volume 844, 1994
uilding and managing
XML/XSL-powered Web sites: an experience report” Computer Software and
Web-based Search Engine for Indian Languages”,
http://www.cse.iitk.ac.in/research/ mtech1997/9711112.ps.gz Dept. of CSE,
[27] Kaplan, A. Lunn, “FlexXML: engineering a more flexible and adaptable web”
2001
[29] Danny Sullivan "How Search Engines Rank Web Pages" http://searchengine
[25] Kerer, C. Kirda, E. Jazayeri, M. Kurmanowytsch “B
Applications Conference, IEEE CNF pp. 547 – 554, Oct. 2001.
[26] Manoj Malviya, “A
Indian Institute of Technology, Kanpur.
Information Technology: Coding and Computing, 2001. IEEE CNF Page(s):
405 – 410 , April
[28] Angela Bonifati, Stefano Ceri “Comparative analysis of five XML query
languages “ ACM SIGMOD Record, Volume 29 Issue 1, March 2000
watch.com/showPage.html?page=2167961 March 15, 2007
[30] http://www.agilemodeling.com/artifacts/sequenceDiagram.htm
Content Based Search Engine “Global” For Xml Database
- 69 -
Indian Institute Of Information Technology – Allahabad June-2007
Appendix A Configuring the projec
A.1 Configuring the project
t
[root]# cd /dinesh/mtech/global/servlet
ode.
[root]# cd /usr/local/apache-tomcat-5.5.17/webapps
ompile Source Code
In second step compile the full source code. Before compiling source code
inistrator needs to include the path of java api (specified in chapter 2) in
LASSPATH environment variable. using command
javac classname.java
tep 3:
un project
To run project Start the following process using Linux shell command. Step 1: Run Apache Web Server using commands [root]# export JAVA_HOME=/usr/java/jdk1.5.0_05/ [root]# /usr/local/apache-tomcat-5.5.17/bin/startup.sh
Step 1:
Create the Following Directory Structure
Put all servlet file like simple, refine, addurl etc. [root]# cd /dinesh/mtech/global/src Put java class source c
[root]# cd /dinesh/mtech/global/index Here Index.dat, urls.add file will be generate.
/GlobalSearch Under apache web server directory put all the web pages
Step 2:
C
adm
C
S
R
Content Based Search Engine “Global” For Xml Database
- 70 -
Indian Institute Of Information Technology – Allahabad June-2007
Step 2: Start the serveltrunner using following command [root]# servletrunner -p 8084 -d /dinesh/mtech/global/servlet -r /dines -s /dines
urfing the sample site
Search/main.html
s
h/mtech/global/servlet h/mtech/global/servlet/servlet.properties
S
http://localhost:8080/Global