Submitted by Dinesh Garg - M.Tech Divison grade/Dinesh Garg MS200506...I am highly grateful to the...

Content Based Search Engine “Global” for XML Database

A DISSERTATION Submitted in partial fulfillment of the requirements for the award of the degree

of

MASTER OF TECHNOLOGY

In

INFORMATION TECHNOLOGY (Specialization: SOFTWARE ENGINEERING)

Submitted by

Dinesh Garg (MS200506)

Under the Guidance of

Prof. M. Radhakrishna & Mr. Manish Kumar

IIIT-Allahabad

2005-2007 INDIAN INSTITUTE OF INFORMATION TECHNOLOGY,

ALLAHABAD

Date: ______________

WE DO HEREBY RECOMMEND THAT THE THESIS WORK PREPARED

UNDER OUR SUPERVISION BY DINESH GARG ENTITLED CONTENT

BASED SEARCH ENGINE “GLOBAL” FOR XML DATABASE BE ACCEPTED

IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE

OF MASTER OF TECHNOLOGY IN INFORMATION TECHNOLOGY

(SOFTWARE ENGINEERING) FOR EXAMINATION.

COUNTERSIGNED

Mr. Manish Kumar Prof. M. Radhakrishna

(THESIS ADVISERS)

IINNDDIIAANN IINNSSTTIITTUUTTEE OOFF IINNFFOORRMMAATTIIOONN TTEECCHHNNOOLLOOGGYY

AALLLLAAHHAABBAADD (A University Established under sec.3 of UGC Act, 1956 vide Notification No. F.9-4/99-U.3 Dated 04.08.2000

of the Govt. of India )

(A Centre of Excellence in Information Technology Established by Govt. of India)

DR. U. S. TIWARY (DEAN ACADEMICS)

CERTIFICATE OF APPROVAL*

The foregoing thesis is hereby approved as a creditable study in the area of

knowledge management carried out and presented in a manner satisfactory

to warrant its acceptance as a pre-requisite to the degree for which it has

been submitted. It is understood that by this approval the undersigned do

not necessarily endorse or approve any statement made, opinion expressed

or conclusion drawn therein but approve the thesis only for the purpose for

which it is submitted.

COMMITTEE ON

FINAL EXAMINATION

FOR EVALUATION

OF THE THESIS

* Only in case the recommendation is concurred in





CANDIDATE DECLARATION

This is to certify that Report entitled “Content Based Search

Engine “Global” For Xml Database” which is submitted by me in partial

fulfillment of the requirement for the completion of M.Tech. in

Information Technology (with specialization in Software Engineering) to

Indian Institute of Information Technology, Allahabad comprises only

my original work and due acknowledgement has been made in the text to

all other material used.

Dinesh Garg

M. Tech. IT (Spl. Software Engineering)

MS200506





Content Based Search Engine “Global” For Xml Database

I

Acknowledgements

I am highly grateful to the honorable Director, IIIT-Allahabad, Dr. M. D. Tiwari, for

his ever helping attitude and encouraging us to excel in studies. I am thankful to

Prof. U. S. Tiwari, Dean Academics, IIIT-Allahabad for providing all the necessary

requirements and his moral support for this dissertation work.

I would like to express my sincere gratitude to Mr. Manish Kumar for his invaluable

guidance and constant encouragement through the last semester of my project work.

He served as a motivating force in whatever I did and was always readily available

whenever needed. From him I have learned to combine theoretical knowledge with

intuitions effectively.

I also thank to Prof. M. Radhakrishna for their expert guidance and encouragement.

In spite of their hectic schedule they were always approachable and took their time off

to attend to my problems and give the appropriate advice.

I am highly obliged to all my friends for their encouragement and for helping me at

the points where I got stuck. I am deeply indebted to all of them for always helping

and inspiriting me

At last I thank all of them who are related with this thesis in one or the other way

Thanks to everyone; it has been a wonderful year!

Dinesh Garg

10-June-2007


II

Abstract

With the rapid development of Internet, Web has been becoming a main information

source through which we can obtain the useful information. Nowadays there are

millions of Websites and billions of homepages in Internet. This explosive growth of

information on the internet has greatly increased the need for Information Retrieval

System such as Search engine.

Nowadays most popular search engines such as Google, Alta Vista and yahoo are all

based on HTML documents. Despite the success of HTML-based keyword search

engines shortcoming emerge inside them such as lack of semantics retrieval. These

search engines have HTML file based web server model it possesses certain

limitations. Extensible Markup Language (XML) has recently emerged as the

document standard for representing and exchanging data on the Web. Now XML

turns Web into a database. The database is database of xml websites. To help Web

users to retrieve the useful information in XML documents rapidly has been becoming

a hot topic.

The goal of the thesis is to develop the Search Engine for searching the websites in

XML/XSL. It provides two level searches in comparison to the existing search

engines. The two level searches comprise of basic search and refine search. The basic

search is similar to the conventional HTML search engine. But due to website made

in XML it also provides the semantic information of keyword to user.

One more functionality comes in refine level search where the user can refine his

search according to DTD/Tags information given to him. In addition an efficient

Compressed Tries data structure used to implement the indexer. It also frees the user

from remembering the structure of XML document and writing the sophisticated

queries for searching from XML documents.


III

Table of Contents

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IITable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IIIList of figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIAbbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII

Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3. Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Chapter 2. Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1. Overview of XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 XML (eXtensible Markup Language). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.2 DTD (Document Type Definition) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.3 XSL (EXtensible Stylesheet Language) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.4 A Simple Xml example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.5 Document Object Model (DOM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2. Existing Query languages for XML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Problem faced in the existing system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3. Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 Types of Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Shortcomings of most popular Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.3 Problems in Web Server model based on HTML pages. . . . . . . . . . . . . . . . . . 13

2.4 Global Search engine Web Server Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.1 Site Data & Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.2 Advantages of Global Search engine Web Server Model . . . . . . . . . . . . . . . . . 16

Chapter 3. Requirement Specification of Search engine . . . . . . . . . . . . . . . . . . . 17

3.1. Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2. Use Cases diagram of Search engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3. Non-functional Requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4. Design Constraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.1. Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.2. Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21


IV

Chapter 4. Indexing Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1. Standard Trie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2. Basic Compressed Trie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3. Basic M-Way Trie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3.1. Drawbacks of M-Way Trie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.4 Indexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.4.1. Searching Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4.2. The Insertion Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.5. Advantage of compressed trie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Chapter 5. High Level Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1. System Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2. Gatherer Module (Module 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.3 Indexer (Module 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.3.1. Parsing of Xml document. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3.2. Stop Word Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3.3. Word Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3.4. Index Routine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.4 Search Processor (Module 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 38 5.4.1. Query processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.4.2. Stop list and stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.4.3. Search routine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.4.4 Results Ranking and Display. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.5 Indexing Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.6 Search Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.7 Transformation process of XML/XSL into Html . . . . . . . . . . . . . . . . . . . . . . . . 42

Chapter 6. Detail Level Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.1. Package Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.2. Class Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.2.1 Indexing Package Classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.2.2 Parsing Package Class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.3. Sequence diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.3.1. Sequence Diagram of website Gather Module . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.3.2. Sequence diagram of Insertion Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 52 6.3.3. Sequence diagram of Simple Search Module . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.3.4. Sequence diagram of Refine search Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 54


V

Chapter 7. Result and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.1 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.2 Simple Search Result page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 7.2.1 Visual understanding of result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 7.2.2 Analysis of Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7.3 Refine Search Result page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

7.4 Simple Search Result page of searching the keyword “xml” . . . . . . . . . . . . . 59 7.4.1 Analysis of Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7.4.2 Refine search result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.5 Simple Search Result page of searching the keyword “Gate” . . . . . . . . . . . . 62

7.6 Procedure of Registering an XML website . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.7 Procedure of Indexing a website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Chapter 8. Conclusion and Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

8.1. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

8.2. Future Amendments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Appendix-A: Configuring the project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69


VI

List of Figures

Figure 2.1: Transformation Process 5

Figure 2.2: Hierarchical Structure of a Document Object. 7

Figure 2.3: Xml Parser Creating Dom 7

Figure 2.4: Existing Search Engine’s Web Server Model 12

Figure 2.5: Global Search Engine Web Server Model 14

Figure 3.1: Use Cases Diagram of Search Engine 19

Figure 4.1: Standard Trie 22

Figure 4.2: Basic Compressed Trie 23

Figure 4.3: Basic 10-Way Trie 24

Figure 4.4: Trie of 36 Elements English Alphabets and Numbers 25

Figure 4.5: Compressed Tries 26

Figure 4.6: Anode Structure 27

Figure 4.7: Bnode Structure 27

Figure 4.8: Datablock Structure 28

Figure 4.9: Record Structure 28

Figure 4.10: Search Compressed Tries 29

Figure 4.11: Case 2 Insertion Compressed Tries 31

Figure 4.12: Case 2.1 Insertion Compressed Tries 32

Figure 5.1: System Architecture 33

Figure 5.2: The High Level Design Of Search Engine 35

Figure 5.3: Indexing Process 40

Figure 5.4: Search Process 41

Figure 5.5: Represents a Fragment of the Transformation 42

Figure 6.1: Package Diagram 43

Figure 6.2: Class Diagram 44

Figure 6.3: Class anode 45

Figure 6.4: Class bnode 45

Figure 6.5: Class block 46

Figure 6.6: Class recordlist 46

Figure 6.7: Class record 47

Figure 6.8: Class index 47

Figure 6.9: Class data 48


VII

Figure 6.10: Class dom 48

Figure 6.11: Class server 49

Figure 6.12: Class addurl 49

Figure 6.13: Simple servlet 49

Figure 6.14: Refine servlet 50

Figure 6.15: Sequence Diagram Website Gather Module 51

Figure 6.16: Sequence Diagram of Insertion Module 52

Figure 6.17: Sequence Diagram of Simple Search 53

Figure 6.18: Sequence Diagram of Refine Search 54

Figure 7.1: Snap Shot of User Interface 55Figure 7.2: Snap Shot of Simple Search Result Page of the Keyword

“Allahabad“ 56

Figure 7.3: Snap Shot of Refine Search Result Page 58Figure 7.4: Snap Shot of Searching The Keyword “Xml“ 59Figure 7.5: Search Result of Keyword “Xml” After Registering Few

More Website Related to Xml Domain. 60

Figure 7.6: Snap Shot of Refine Search Result Keyword “xml“ 61Figure 7.7: Snap Shot of Simple Search Result of the Keyword

“Gate” 62

Figure 7.8: Snap Shot for Registering Xml Website to Search Engine

63

Figure 7.9: Command to Run Apache Web Server 64Figure 7.10: Command to Run Servlet 64Figure 7.11: Command to Run Server 65


VIII

Abbreviations

XML eXtensible Markup Language

XSL Xml Style Sheet

DTD Document Type Definition

XSLT Xml Style Sheet Transformations

CSS Cascading Style Sheets

Html Hypertext Markup Language

DOM Document Object Model

W3C World Wide Web Consortium

XQL XML Query Language

URL Uniform Resource Locator

WWW World Wide Web

Cgi Common Gateway Interface

UML Unified Modeling Language

GUI Graphical User Interface


- 1 -

Chapter 1 Introduction

This Chapter deals with the introduction of thesis and the motivation for pursuing the

work in area of xml technology also gives the reader an insight to the theoretical

principles involved in the conception and design of the Search engine developed

during course of this thesis. At last it presents the organization of the thesis.

1.1. Introduction

With the rapid development of Internet large amounts of digitally stored information

is readily available on the Internet. Nowadays there are millions of website and billion

of homepages in the Internet. This information is so much that it becomes

progressively more difficult and time consuming for the users to find the information

relevant to their needs. This explosive growth of information on the internet has

greatly increased the need for information retrieval system.

However most popular search engine such as Google, Alta Vista, yahoo are based on

HTML documents and lack of semantics. HTML provides a simple way to markup

the structure of document using (major headings, minor headings, title, lists etc).

HTML includes information about how to view text. Like browser knows that an H1

means a very large header line. But HTML doesn't give us a way to describe the

content of the text the meaning is lost because there is no way to tag it. As a result

management of internet content is inefficient.

Information retrieval system needs to implement sophisticated pattern matching tools

to determine semantic, context and purpose of the contents. The problem is that

search engines usually can index document titles, frequency of words and some

metadata that describe the content of a page.

We need a way to markup the significant portions of documents to understand the

semantic of documents. So that search engine gets the appropriate information for

index and avoids all of noise information related to presentation. The eXtensible

Indian Institute Of Information Technology – Allahabad June-2007


- 2 -

Markup Language has recently emerged as the document standard for representing

and exchanging data on the web. XML document is built by the nesting of tagged

elements. These nested tagged elements structure of XML makes it suitable for

representing data of the Web. The tags identify the meaning of data rather than its

display format as in html.

However unlike HTML, XML doesn’t specify any information about document

appearance. Browser gets this missing information from style sheets. XSLT style

sheet languages support the complete separation of content and presentation. We need

search engine not only provide the list of Links, related with this keyword, but also

provide descriptive information about the content of internet.

1.2. Motivation

The World Wide Web, the most popular application of the Internet, is playing an

important role in information sharing. This information is so much that it becomes

progressively more difficult and time consuming for the users to find the information

relevant to their needs. This explosive growth of information on the internet has

greatly increased the need for Search engine.

These search engines simply matches the key word and does not retrieve meta data.

However most popular search engine such are based on HTML documents. Despite

the success of HTML-based keyword search engines shortcoming emerge inside them

like lack of semantics retrieval.

We need such type of search engine that not only gives the result as list of urls but

also provides the semantic information, type of document and structure of document

related with the search keyword. In addition to it also provides the facility where the

user can refine the search result according to DTD/Tags information given to him. A

search engine which is based one of the currently emerging technologies such as

XML that is document standard for representing and exchanging data on web.



- 3 -

1.3. Organization of Thesis

The thesis report is organized in eight chapters. Chapter 1 deals with the introduction

of thesis and the motivation for pursuing the work in area of xml technology. Rest of

the thesis is organized as follows.

Chapter 2 Literature Review. This chapter presents the literature review. It gives

the overview of xml. Introduces about the existing query language for

xml database. Discusses the shortcomings of current search engine. At

the end presents the global search engine web server model.

Chapter 3 Requirement Specification of Search engine. This chapter describes

the requirement specification of search engine. Tell about the functional

and non-functional requirements of search engine. Showing the use case

diagram of search engine.

Chapter 4 Indexing Data Structure. Describe the Compressed tries indexing data

structure used in implementation of search engine.

Chapter 5 High Level Design. Describe the system architecture and high level

design of search engine.

Chapter 6 Detail Level Design. Deals with the detail design phase presenting the

class diagrams for structural modeling, sequence diagrams for various

classes and functionalities of the search engine.

Chapter 7 Result and Discussion. It shows the result obtained by providing the

snapshots of the search engine result.

Chapter 8 Conclusion and Future Work. It summaries the work done by giving

the conclusions and the possible future work that can be carried out in

the area.

Appendix-A It Provides the information regarding the configuration of project.



- 4 -

Chapter 2 Literature Review

This chapter gives the overview of xml, some related publications and systems. Also

introduces the existing query language for xml database. Problem inside these query

language. Discusses shortcomings of current search engines. At the end presents the

global search engine web server model and benefits of this model.

2.1. Overview of XML

XML stands for “Extensible Markup Language” (extensible because it is not a fixed

format like HTML).

2.1.1 XML (eXtensible Markup Language)

XML [4] is a set of rules for defining semantic tags that break a document into parts

and identify the different parts of the document. It is a meta-markup language that

defines a syntax in which other field-specific markup languages can be written. It’s a

language in which you make up the tags you need as you go along. These tags must

be organized according to certain general principles, but they’re quite flexible in their

meaning

The eXtensible Markup Language is a standard recommended by World Wide Web

Consortium for data representation and exchange on the Web. XML documents are

made up of storage units called entities, which contain either parsed or unparsed data.

XML provides a format that can represent both simple and extremely complex

information allow developers to create their own vocabulary for describing the

information.

2.1.2 DTD (Document Type Definition)

Document type definition [11] lists the elements, attributes, entities, and notations that

can be used in a document, as well as their possible relationships to one another also

specifies a set of rules for the structure of a document. A DTD can be declared in xml



- 5 -

document. The document type declaration for a document is very important in

checking whether a document is valid or just well-formed. The tasks carried out by a

document type declaration. Specifying the document’s root element. Defining

elements, attribute and entities to the document (internal DTD). Identifying the

external DTD for the document. [11]

The main use of DTD are With a DTD, independent groups of people can agree to use

a standard DTD for interchanging data. Application can use a standard DTD to verify

that the data you receive from the outside world is valid. We can also use a DTD to

verify your own data.

2.1.3 XSL (EXtensible Stylesheet Language)

Since XML is content-based meta language, it does not mean much to refer to

“viewing an XML document”. How do we view something that does not include any

information about how it is to be displayed? In order to view an xml done must

provide information about how it is to be display. This is accomplished using CSS or

XSL style sheets. [5, 20]

Transformation Sheets

Target Document

XSL

Transformation

HTML, Text, etc

Source Document

XML

Figure 2.1: Transformation process

XML-style sheet model for displaying content is advantageous because:

1) Content is separated from display. Hence if one wants to change the look of web-

page then all that needs to be changed is the XSL and not the data which is in

XML format.

2) Through the use of style sheets, future Web documents will be accessible

everywhere, from PCs to TVs to palm devices to cellular phones. It is now

possible to port the same content easily to different user agents like mobile

devices and web browsers



- 6 -

2.1.4 A Simple XML example

Sample xml file

<? Xml version="1.0" encoding="UTF-8"?> <! DOCTYPE bank SYSTEM "allahbadbank.dtd" > <bank> <allahabadbank>

<account>Get the account detail</account> <InterestRate> <deposit>deposit interest rate</deposit> <credit>credit interest rate </credit>

</InterestRate> <loan> <Education> <Eligibility>Courses Eligible Studies in India </Eligibility> </Education> <personal> Quantum of Loan </personal> <house> Housing Loan Detail</house>

</loan> </allahabadbank> </bank>

This XML document gives information about a bank. It is clear that Allahabadbank

object contains (account, InterestRate and loan objects). Some of these objects contain

other objects.

The corresponding Document Type Definition (DTD) for the above xml document. <? Xml version='1.0' encoding='UTF-8'?> <! ELEMENT bank (allahabadbank)*> <! ELEMENT allahabadbank (loan|InterestRate|account)*> <! ELEMENT account (#PCDATA)> <! ELEMENT InterestRate (credit|deposit)*> <! ELEMENT deposit (#PCDATA)> <! ELEMENT credit (#PCDATA)> <! ELEMENT loan (house|personal|Education)*> <! ELEMENT Education (Eligibility)*> <! ELEMENT Eligibility (#PCDATA)> <! ELEMENT personal (#PCDATA)> <! ELEMENT house (#PCDATA)>

This DTD ensure that the InterestRate object has credit, deposit objects same like loan

object have house, personal, education objects. DTD is to verify our XML data.



- 7 -

2.1.5 Document Object Model (DOM)

Document Object Model is a platform- and language-independent standard object

model for representing HTML or XML and related formats. According to the DOM,

everything in an XML document is a node. The entire document is a document node.

Dom supports navigating and modifying XML documents. It shows hierarchical tree

representation of documents. [16].

Bank

InterestRate

Allahabadbank

account

Credit

Deposit

Loan

<Child>

<Parent>

Document RootDocument Node

Figure 2.2: Hierarchical structure of a document object.

The XML Document Object Model is a programming interface for XML documents.

It defines the way an XML document can be accessed and manipulated. The Dom is

usually added as a layer between the XML parser and the application that needs the

information in the document, meaning that the parser reads the data from the XML

document and feed that data into a DOM. The DOM is then used by a higher-level

application.

DOMXML Parser

Xml Document

Application

Figure 2.3: xml parser creating Dom



- 8 -

2.2. Existing Query languages for XML

There are many query languages, which can be used to query an XML database e.g.

XSL, XML-QL, and XQL.

XML-QL

XML-QL integrates XML syntax with query language techniques. The path

expressions and the patterns are used to extract data from the input XML data. It has

variables to which data is bound. It uses templates to show the output XML data. Both

templates and patterns use the XML syntax. XML-QL is based on Construct/Where

syntax. XML-QL has various features such as regexp path expressions, XML patterns,

Joins on multiple input sources, Skolem functions for grouping.

Syntax

WHERE elementPatterns IN xmlSource CONSTRUCT template

Example

WHERE <book> <pub><name>Addison-Willey</name></publisher> <title> $t </title> <author> $a </author> </book> IN “www.x.y.z/bib.xml” CONSTRUCT $a

As seen from the example, XML-QL presents a proper way to query a specified XML

document. It doesn’t inherently support the querying of an XML repository consisting

of XML data in semi structured form. [18]

XQL

For XQL a document is an labeled, ordered tree which contains the node to represents

elements, processing instruction, documents entity, attributes, comments. XQL is

similar to XPath and XSL patterns. XQL engines can represents the input to a query

via XSL nodes, DOM nodes, Index structure or XML text.



- 9 -

XQuery

XQuery was devised primarily as a query language for data stored in XML form. So

its main role is to get information out of XML databases. XQuery uses predicates to

limit the extracted data from XML documents. The language is based on tree-

structured model of the information content of an XML document, containing seven

kinds of node: processing instructions, elements, text nodes, comments, attributes,

document nodes, and namespaces [12]

Xquery is case sensitive language. Keywords are in lower-case. XQuery is a

functional language comprised of several kinds expressions that can be nested and

composed with full generality. Every expression has a value and no side effects.

Expressions can raise errors, usually propagate lower level errors.

Sample Xquery Example:

List all suppliers. if a supplier offers medical items, list the descriptions of the items

FOR $s IN document(“suppliers.xml”) //supplier

ORDER BY $s/name

RETURN

<supplier>

{ $s/name,

FOR $ci IN document(“catalog.xml”)//item[supp_no=$s/number],

$mi IN document(“medical_item.xml”)//item[number=$ci/item_no]

RETURN $mi/description

}

</supplier>

The Xquery also allows using aggregation functions (AVG, COUNT, etc.). The

Xquery language also allows several data sources to be interrogated simultaneously,

producing an integrated view of its data.

FOR-WHERE-RETURN Xquery example:

Find all book titles published after 2000:

FOR $x IN document("abc.xml")/bib/book WHERE $x/year/text() > 1999 RETURN $x/title



- 10 -

2.2.1 Problem Faced in the Existing System

From these various query language we can get whatever we want from a given

database by writing sophisticated queries, But for writing these query users need to be

familiar with the document structure. Needs to know what are the various tags in the

document. Every xml document have its own document structure, they are different

for different Xml documents. So before start searching from database user need

remembering the corresponding Xml document structure. In short

• For searching the document from database you have to write sophisticated

queries.

• User have burden of remembering the document structure

• Unreasonable assumption of user’s familiarity with the document structure.

Jennifer Widom’s pioneering “whitepaper” [7] has pointed out the various challenges

and technologies available and required to meet the WWW requirements. This paper

has focused on XML as a standard for data representation and exchange on the

Internet. The paper has discussed various features of XML and commented that XML

will radically change the face of Web. In this paper, various research and

development issues such as query language, information retrieval, database system,

etc. related with XML and database are discussed. According to her, XML “turns

Web into a database.”

In [25] the authors Kerer and C. Kirda have presented an experience report on

building and managing XML/XSL powered websites such as LC separation through

XML/XSL provides high layout flexibility, XML/XSL deployment needs more data

organization planning, XML/XSL enable multi-lingual Websites. Learning XML and

XSL concepts is not easy for the developer, Graphical design companies are slow to

pick up XML/XSL know-how.



- 11 -

2.3. Search Engines

A Search Engine is a tool that allows you to look up information. Search engine

accepts the key words entered by user, examines its index and provides a listing of

best-matching web pages according to its criteria.

2.3.1 Types of Search Engines

Search engines can be divided into various categories.

Full-text Search Engines

Some search engines, such as Google, store all or part of the web page as well as

information about it, whereas others, such as AltaVista, store every word of every

page they find. They allow users to find for any string of text also build the ranking

list that tries to present the most useful pages at the top of the list. Ranking can be

based on various factors such as search keyword appear near the top of the document,

in sub-headings, in the meta tags or in the title of the page, the number of times the

search keyword occurs in the test etc.

Directory-based Search Engines

Directory-based search engines use some form of category system. It organize the

documents into various categories such as Movies, Travel, Shopping, Sports.

Examples of this category are Yahoo!, AOL Search, MSN.com, InfoSeek.

MetaSearch Engines

A meta-search engine (well-known as multi-search engines) is a search engine that

sends user requests to several other search engines either simultaneously or

sequentially. The results are then blended together onto one page. They do not have a

own database of Web pages, create a virtual database. It also enables users to enter

search criteria. "Smarter" meta-searcher technology includes clustering and linguistic

analysis. Example of this category Dogpile, Vivisimo, Kartoo, SurfWax, Mamma.



- 12 -

Specialist Search Engine

Specialist search engines are specifically designed to provide search relevant to some

specific areas of information. This does not include those search engines run by

individual companies. They confine itself to a wide rang of database search tools that

cover the needs of particular organization. Interactive Movie Database Search (IMDb)

is an example of this category

2.3.2 Shortcomings of Most Popular Search Engines

Most popular search engines are Google, Alta Vista and yahoo despite the success of

these search engines, there are few shortcomings inside it.

• These search engines simply matches the keywords and does not retrieve meta

data (such as xml Tags).

• Current Search engines always return the entire documents as search Results

instead of returning the some part of document which is relevant to search.

• They are based on HTML documents. Problem due to it is explained in next

Section.

• Lack of semantic retrieval i.e. they don’t give information about which type of

document it is and related semantic information about it.

Web Browser

Web Server

Html Pages

Internet Send request to web server Response with html page

Cgi, Servlet Jsp Program

To or from other system database

Figure 2.4: Existing Search engine’s web server model



- 13 -

Figure 2.4 briefly explain the basic functioning of internet and in the dotted box there

is Web server based on HTML pages. Existing Search engines fetch content from

HTML pages.

2.3.3 Problems in Web Server model based on HTML pages

The most common way of storing information on the web by writing webpages in

HTML. This HTML file based storage architecture possesses certain limitations.

Intermixing of Content and Representation: The Web as it stands today is mostly

the collection of large number of HTML files. HTML files are presentation-oriented

files. There is no clear separation from the information content and it’s rendering

details.

Unable to provide semantic information: HTML Page includes information about

how to view text. Such as browser knows that an H2 means a large header line. But

HTML doesn't give us a way to describe the content of the text the meaning is lost

also unable to provide any semantic information about it.

Poor support for device independence: Html has turned out to be its poor support

for device independence. User not only wants to access the web from their personal

computer, but also wants to access it using mobile devices and varying display size

and characteristics.

Difficult to manageable and stiff for changes Html Web sites: Difficult to

manageable, stiff for changes and inability to reuse and extract contents. In flexible to

easily incorporate layout design changes.



- 14 -

2.4. Global Search Engine Web Server Model

Global search engine have a database based Web server model. In this model content

has been given the highest importance. The world is content rich and the content

storage and management should be given high importance in the design. Unlike the

existing scenario, in this the content and its display format are totally separated.

The Web is an unlimited collection of data that can not be managed properly in a

bounded document based system. In this model, the content will be stored in such

kind of database system which best suits the World Wide Web such as Semi

structured data.

Web Browser

Web Server

Internet Send request to web server Response with html page

Cgi, Servlet Jsp Program Style

Sheet Repository

DTD Repository

Xml Repository

To or from other system database

Figure 2.5: Global Search engine web server model

Semi Structured Data

Semi structured data is the data that is neither raw-data nor strictly typed as in

conventional relational database system. It is often explained as ‘self-describing’ or

‘schema less’, the terms that indicate that there is no separate description of the

structure or type of data. The structure is irregular, implicit and partial. As semi

structured data are self-describing, the structure of the semi structured data can be

obtained using some computation.



- 15 -

As shown in figure 2.5 Global search engine web server is integrated with an XML

database. In this model we have replace the HTML repository with three repository

Xml repository, dtd repository and style sheet repository. XML repository Contains

the data of website there is no representation information is stored inside it. Dtd

repository contains dtd files used for validating the xml files. Style sheet repository

contains the representation information to display the web sites.

2.4.1 Site Data & Storage

Xml Repository

XML repository contains XML documents. XML stores information in hierarchical

formats. XML documents are made up of storage units called entities, which contain

either parsed (#PCDATA) or unparsed data. It provides a format that can represent

both simple and extremely complex information, and allows developers to create their

own vocabularies for describing information.

Style Sheet Repository

Although data and the storage of data are important, the rendering information cannot

be ignored completely. The rendering information for XML kind of data can be

provided using eXtensible Style sheet Language (XSL) or Cascading Style Sheets

(CSS). In this web server model the display formatting information present in Style

Sheet Repository. The XML repository will have proper links with style-sheet

repository.

DTD Repository

DTD Repository contains dtd files for validating the xml Document. A DTD contains

the rules for a particular type of XML-documents. A DTD describes elements. It uses

the syntax like that the text <! ELEMENT, followed by the name of the element,

followed by a description of the element. For instance: <!ELEMENT brand

(#PCDATA)>.



- 16 -

2.4.2 Advantages of Global Search Engine Web Server Model

This model has lots of advantages over the existing World Wide Web Webserver

model.

Structured Declaration: In this model, the information is stored in a database kind of

system that can store all types of information in XML format and allow data to be

well structured.

Management of large amount of data: Now large amount of data is managed by the

database not the file system

Separation of Content and Representation: The content and the display format are

stored separately with appropriate linking. This clearly separates the information

content and the representation detail that helps in delivering more structured

information.

Dynamic generation of Web pages and their updation: In this model, information

is always collected from the database that can always remain updated using well-

understood control and maintenance mechanisms.

Searching: Since the data is stored in a database system, searching capability can be

provided and customized. This model provides flexibility to tune to any searching

mechanism.



- 17 -

Chapter 3 Requirement Specification of

Search engine

This chapter focused on the Requirement analysis of the Search engine being

developed. It describes the functional and non-functional requirements of search

engine. Then it presented various use case diagrams. The diagrams are drawn using

Unified Modeling Language (UML).

UML is a standard language for writing software blueprints. The UML may be used

to visualize, specify, construct and document the artifacts of a software system. A

right set of diagrams has to be chosen to model the system and thus increasing the

chances of its success. For modeling a simple application only the use case view and

design view is sufficient.

After the system has been specified the next logical step is to present the design of the

system to be implemented. The next chapter discusses the high level design of the

search engine.

Requirements can be divided into two major types, functional and non-functional.

3.1. Functional Requirement

Functional requirements describe “what” the system should do. Following is a list of

functionalities of the system.

1. Gather Module

The web interface module, it provides user to register website to search engine.



- 18 -

2. Indexer

It takes the list of urls from Gather Module .Parse each xml document using xerces-

2_7_1 [9] and extracts the information of each word. Record is prepared using that

information. Indexing of record is done using the compressed trie data structure.

3. Simple Search

Similar to any usual search engine for HTML documents. Performs the search based

on user specified keywords and retrieves list of records from the database through the

indexer, display following information as search result.

• Keyword

• Semantic information about keyword as an xml tags.

• URL of XML website/document.

• Name of the subsequent Document Type Definition, list of DTDs and Tags.

• URL of Document Type Definition.

4. Refine Search

User can refine his search by selecting a tag from tag list or by selecting a DTD from

the DTD list.

5. Ranking of Search Results

Frequency of keywords

One of the main rules in a ranking algorithm involves the location and frequency of

keywords on a web page [29]. The document containing most number of occurrences

of the search String is ranked the highest.

Order of tags list Search engines will also check to see if the search keywords appear near the top of a

web page [29]. This ordering is on the basis of tag closeness. The nearer the tag to the

search strings in the document the higher its rank.



- 19 -

3.2. Use Cases diagram of Search engine

This involves representing the sequences of action performed by the users of

software. Use case diagram showing the relationships among a set of use cases and

actors as shown in figure 3.1 there are two main symbols: an actor is shown as a stick

person and a use case shown in ellipse. Lines indicate which actor performs which use

cases.

refinesearch

taglist

dtdlist

<<include>>

<<include>>

addministrator

indexing

search

<<extend>>

view index url

server process

parsing

user

crawling

addurl index file genrated

Figure 3.1: Use Cases diagram of Search engine

In this use case specifies the various operations can be performed by user such as

adding the web page for indexing, searching, refines the search results on the tags list

or DTD. It also shows the various operations performed by system administrator.



- 20 -

3.3. Non-functional Requirements

Non-functional requirements of the system mean “how well” the functional

requirements of that system are satisfied. User can consider this “how well” in terms

of some characteristics that he is concerned with.

All user interaction with the search engine will be conducted via a GUI. There for it

meet the demands of the user. From the user point of view the requirements are:

1. Conformance to standard: The GUI conforms to web Browser Look and Feel

guidelines.

2. Response time: The response time of any system should not be more than 3

seconds.

3. Robust: System runs smoothly under normal circumstances, without failing

abruptly.

4. Performance: Using Java with Web service technology reduces bandwidth

consumption and makes the environment more reliability, availability, safety.

5. Recovery: if server crash by power down or some else reason we are able to take

backup of the index file.

6. Reusability: Application components must be developed in platform-independent

and portable language for example Java.

7. Interfaces: GUI Interface is similar to well familiar search engine Google.

8. Usability: The tool is easy-to-use. It allows user to operate it with a very little

training.

Ease of Use, Portability, Maintainability, Expandability, and System

Administration. Using advance feature of J2EE to develop the software.



- 21 -

3.4. Design Constraints

3.4.1 Hardware Requirements

Above Pentium III, 512 MB RAM or compatible systems for the machines running

client processes; above Pentium IV and 1 GB RAM for the machine running server

process.

3.4.2 Software Requirements

Clients:

• Browser: Internet Explorer, Mozilla Firefox.

Server:

• Linux or Solaris 10 Operating System

• Java Servlet Development Kit 2.0 (JSDK2.0)

• Java SE Development Kit (JDK) & Java JRE 1.5.x

• Apache Tomcat Server

Java Api:

• Xerces-2_7_1 [9] parsing Xml document.

• Javamail for sending mail.

• Porter-Stemmer for stemming algorithm.



- 22 -

Chapter 4 Indexing Data Structure

This chapter introduces the indexing data structure used in implementation of search

engine. It describes the standard trie, basic compressed trie, m-way trie data structure.

At the end of chapter presented the compressed trie data structure used in the

implementation of search engine also explaining the searching and insertion

algorithm.

4.1 Standard Trie

The trie (pronounced “try” and derived from the word retrieval) is a data structure

that uses the digits in the keys to organize and search the dictionary. The standard trie

for a set of strings S is an ordered tree such that each node but the root is labeled with

a character. The children of a node are alphabetically ordered. The paths from the

external nodes to the root yield the strings of S. The height of the tree is the length of

the longest string. [19]

Example: standard trie for the set of strings

S = {bear, bell, bid, bull, buy, sell, stock, stop}

a

e

b

r

l

l

s

u

l

l

y

e t

l

l

o

c

k

p

i

d

Figure 4.1: Standard Trie [19]



- 23 -

A standard trie uses O (n) space and supports searches, insertions and deletions in

time O (dm), where:

n total size of the strings in S

m size of the string parameter of the operation

d size of the alphabet

4.2. Basic Compressed Trie

It is obtained from standard trie by compressing chains of “redundant” nodes. It

improves the space inefficiency of tries by removing nodes with only one child. Each

internal node in a compressed trie has at least two children and each external is

associated with a string [14].

e

b

ar ll

s

u

ll y

ell to

ck p

id

Figure 4.2: Basic Compressed Trie [14]

4.3. M-Way Trie

An M-Way Trie (m is the size of alphabets set) in which the root node points to

another node for each of the possible alphabets a word may have. Each of these

nodes, likewise, contains a pointer to a node for each possible second alphabet and so

forth. Each node on level ‘k’ represents the set of all keys that start with the same

sequence of ‘k’ characters; this node specifies an M-way branch, depending on the (k

+ 1) st character of a key [15].



- 24 -

Example of M-Way Trie

A trie representation for five elements

951-94-1654, 562-44-2169, 271-16-3624, 278-49-1515, 951-23-7625

Results in a trie structure that has 10-way branching as shown in figure 4.3 the trie

employs two types of nodes-- element nodes and branch nodes. Each branch node

has 10 children fields. These fields, child[0:9], have been labeled 0, 1, ..., 9 for the

root node of Figure 4.3 root.child[i] points to the root of a subtrie that contains all

elements whose first digit is i. In Figure 4.3, nodes A, B, D, E, F, and I are branch

nodes. The remaining nodes, nodes C, G, H, J, and K are element nodes [10].

Figure 4.3: basic 10-Way Trie [10]



- 25 -

A Basic M-Way Trie where m is the set of all English alphabets and numbers i.e. Trie

of 36 elements.

Figure 4.4: Trie of 36 elements English alphabets and numbers

4.3.1 Drawbacks of M-Way Trie

The basic M-Way Trie structure is space inefficient. For storing a single word

sometime needs to create full 36 array node for it. Yet this data structure is much

better in space utilization with respect to tree or hashing indexing method, but we

need more compression. In search engine we needs to store lots of data therefore we

go for compressed tries, which is discussed below.



- 26 -

4.4. Indexer

Indexer is implemented with the help of compressed tries [15, 17] data structure. It divides the keyword database in two levels. The upper level defines the indexing

structure and the lower level consists of database to hold the records which are

arranged in lexicographically manner. The upper level is termed as LEVEL A node

and it points to the next level node termed as level B which is linked list of nodes.

The lower level i.e. leaf level is the doubly linked list of datablocks.

A Level Node

null

0-9 A-E F-K L-Q

R-T U-Z

Linked List of B Level Node

9

Doubly Linked List of Data Block To other B level Node

Figure 4.5: Compressed Tries

A level node

The A level node is an array of elements that comprises of 26 English alphabets and 0

to 9 numeric numbers. Which is divided into 6 parts i.e. ‘0-9’, ‘a-e’, ‘f-k’, ‘l-q’, ‘r-t’,

‘u-z’ as shown in figure 4.6 . These array elements are regarded as pointers that points

to level B node.

null ‘0-9’ ‘a-e’ ‘f-k’ ‘l-q’ ‘r-t’ ‘u-y’ Bnode[0] Bnode[1] Bnode[2] Bnode[3] Bnode[4] Bnode[5] Bnode[6]

Figure 4.6: Anode Structure

S 4



- 27 -

B level node

The B level node distinguish the path for one attribute (i.e. character) from another in

the same range of the A level nodes. B level node comprises ordered linked list of

nodes that are arranged in lexicographic order these nodes are created on fly and are

pointed by the level A nodes. As shown in figure 4.7. Bnode contains following

fields: key, nextbnode, anode and block.

key Nextbnode Anode block

Figure 4.7: Bnode Structure

Data blocks

These are continuous set of blocks which hold the indexing key and the record that

are arrange in lexicography order with respect to the indexing key. It contains a

header that provides the information about the total no of B level nodes pointing to it.

Data Block contains following fields: size, fileoffset, lowKey, highkey, prevBlock,

nextBlock, numofRecs, and numofBnodes. The fileoffset tells about where the record

is present in the file, low and high key contains the smallest and largest indexing keys

of block, prevBlock, nextBlock contains the address of previous block and next block.

size numOf Recs

numOf Bnodes fileOffset lowKey High key nextBlock prevBlock

Figure 4.8: DataBlock Structure

Size Size of the data block.

NumofRecs Number of record present in Data block.

Num of Bnodes Number of bnode pointing to Data block.

Fileoffset FileSize + 1024 - FileSize % 1024

Lowkey Contains smallest indexing keys of block

Nextblock Contains the address of next immediate block.



- 28 -

Record

The Records are stored in the indexing file index.dat. Record contains all the

information of related to key. Data Block contains the offset address of these records

in the file. Structure of record is shown in figure 4.9 with all the information it

contain.

Key doc dtd dtdname Frequency elts Size

Figure 4.9: Record Structure

Key Name of the word which we are storing. Doc URL of the document in which word is present. Dtdname Name of the document type definition document.

Dtd URL of the document type definition. elts Linked list of tags. Frequency No. of time that particular word come inside the document.Size Size of the record.



- 29 -

4.4.1 Searching Algorithm

Steps:

1. To search for records with indexing key ‘X’, the search process start from the

root i.e. Level A node.

2. The search proceeds by mapping first character of the indexing key, to a field

in level A node i.e. the character range of level A node, and take the Bnode

address of the linked list connected with it.

3. Now start traversing the linked list of B level nodes. Now the level Bnodes are

searched to find the first character that matches with the level B node key. If it

finds it continue the search by following the pointer that points to the next

level otherwise it returns failure.

4. Check for the next level associated with level B node as shown in figure 4.10.

If it is a valid pointer to level A node found, then the search continue by

repeating the above steps. By taking the second character of indexing key.

5. If the data block is reached, then the key is searched inside the Data Block.

Retrieve the records in a record list. Return the record list as a search result.

Figure 4.10: Search Compressed Tries

Linked List of B Level Node

NULL 0-9 a-e f-k l-q r-t u-z

b c e

NULL 0-9 a-e f-k l-q r-t u-z Data Block

r t


To other B level Node

A level node

To other B level Node

?

A level node



- 30 -

4.4.2 The Insertion Algorithm

To insert record with indexing key ‘X’. First of all search is to be carried according to

the above define search algorithm. After the completion of search it returns the last

level B and A node along with the current datablock. After obtaining the last level B

its validity is checked to find out whether it contains the address or not. This causes

the following two cases Case 1 and Case 2 as discussed below.

Case 1

If the last level B contains the address return by the search process ,then it is not

certain whether the current Data Block can hold the new record or not. It arises two

more cases Case 1.1 data block splitting and Case 1.2 deals with creation of a new

two level A-B node structure.

Case 1.1

If the size of the data block is not sufficient to accommodate new record then there is

a need of datablock splitting in which the records are read from the data block in a

recordlist and the new record is inserted in this recordlist. Records from this recordlist

are written to the older data block till it becomes full. Rest of the records are written

to newer data block which is inserted to the right hand side of the older data block.

Case 1.2

If all the records in datablock have same prefix as the search path up to and including

the last A. and if no sufficient space to accommodate the new record in the datablock

then it creates a new two level A-B node structure and the splitting at this level for

level B nodes occur on the attribute just after the prefix.



- 31 -

Case 2 If the last level B is null return by the search process then the following steps

are followed.

Steps:

1. When the last level B is null need arises to create the new B level Node it

locate the neighboring B level nodes After locating the neighboring B level

nodes it determines to which data block these nodes points.

2. If they refers to the same datablock then simply puts the record in it and this

data block is pointed, by creating the new B level node. Record is inserted at

the correct position in the data block according to lexicographic order. As

shown in figure 4.11 ‘b’ and ‘e’ are adjacent bnode of ‘c’ and pointing to same

data block.

3. If they do not refer to the same block then the record is inserted into datablock

pointed by the left neighboring B level node.

4. If the datablock pointed by the left neighboring B level node cannot

accommodate the new record without splitting then to insert the record it

moves to the right neighboring B level node. If the right neighboring B level

node datablock is having sufficient space to accommodate the new record then

simply place in it otherwise it has to split it

Figure 4.11: Case 2 Insertion Compressed Tries

Linked list of B Level Nodes


b c e

Data Block

r t


To other B level Nodes

Data Block

A level node



- 32 -

Case 2.1

If the neighboring level B nodes does not refers to datablocks rather that it refers to

the level A nodes. As shown in figure 4.12.

1. For the insertion of the record there is need to create a new data block.

2. Now the process begins from the right neighboring B level node, where it

traverse the level A node to locate the left most datablock. Once it discovers,

it inserts the new datablock to the left of the left most data block in order to

maintain the lexicographic order.

3. If no left most data block is present then the process begins from the left

neighboring B level node. Where it traverses the level A node to locate the

right most data block. Once it discovers, it inserts the new datablock to the

right of it.

Linked list of B Level Nodes


b c e r t

A level node





A level node A level node

Figure 4.12: Case 2.1 Insertion Compressed Tries

4.5 Advantage of Compressed Tries

• To retrieve a data record, the number of comparisons does not depend on number

of key indexed. Instead it depends on the length of key.

• An insertion into trie is localized and does not propagate to higher level in

indexing structure. Insertion only causes the expansion the trie structure [26].

• Quick un successful search determination and looking up keys is faster. Looking

up a key of length m takes worst case O(m) time.



- 33 -

Chapter 5 High Level Design

This chapter describes the System architecture and High level design of search

engine. High level Design of the search engine shows how the whole of the process

going on from the user enter the search string to the display of result. Also represents

the entire process from user register his xml website to indexing of website by search

engine.

5.1. System Architecture

The general architecture of the search engine is shown below. It shows the

relationship between the parts of the system.

Results

www

Gatherer

Xml Parser

Indexer

Dom tree

Indices

Query processor

Query

Search

processor

Ranker

Displayer

Figure 5.1: System Architecture



- 34 -

At the top level search engine can be divided into three parts:

1. Gatherer

2. Indexer

3. Search processor

The high level design is shown in figure 5.2.

5.2. Gatherer (Module 1)

Web search engines work by storing information about a large number of web pages.

These pages are retrieved by a Web crawler. Web Crawler (spider, robot) are

computer programs that roam the Web and store links and information about each

page they visit. These software generally start with list of best or most popular

websites and follow the hyperlink on these pages and add to the database. It is mainly

used to create a copy of all the visited pages for later processing (indexing and

retrieving) by a search engine

But there is no web crawler for collecting XML websites. For collecting the webpage

Global Search engine provides GUI interface to user. User registers his website by

entering URL of an XML website. The list of urls stored in the file ‘urls.add’.

5.3. Indexer (Module 2)

The purpose of indexing is to process the documents to be indexed and to extract

appropriate information. This information is stored in a data structure that allows fast

searching of the text. Indexing process of search engine done in two phases. In the

first phase start with gatherer module from where it takes list of xml documents urls.

After that parses these xml websites and collects the information about each word.

Stop word removal and stemming is also done in first phase. This information pass to

the second phase i.e. indexing routine. Purpose of index routine is to write the record

in the indexing file i.e. build the indexing structure.



- 35 -

Query processor

Stop word removal

Stemming Find Root Word

Searching Routine

Gatherer

XML parser

Stop Word Removal

Stemming Find Root Word

Indexing Routine

DISK

Indexing structure

Sorting in Rank

Results

To The User

SEARCH PROCESSOR

Indexer

Search Results

Keywords Browser XML and XSL transformed into

HTML prior to rendering

Query By User URLs

Figure 5.2: The High Level Design of Search Engine



- 36 -

5.3.1 Parsing of Xml document

Xerces-2_7_1 parser [9] is used for parsing documents. The parser loads the

document into computer's memory. Once the document is loaded, create the document

object model tree of xml document. The DOM support navigation in any direction

(e.g., parent and previous sibling). From this DOM tree extract the information for

each word.

Key

list of tags (semantic information about the keyword)

DTDname

Url of XML document

url of the DTD

5.3.2 Stop Word Removal

This step helps save system resources by eliminating stop words from further

processing, as well as potential matching, those words that have little value in finding

useful documents. A stop word list typically consists of those word classes known to

convey little substantive meaning such as conjunctions (and, but), articles (a, the),

prepositions (in, over), interjections (oh, but), pronouns (he, it), and forms of the "to

be" verb (is, are).

These word occurs in almost every document of the language, and therefore do not

help in distinguishing between documents that are about different topics. For this

reason, these words are removed and are not indexed.

5.3.3 Word Stemming

Linguistic analysis is used to get the root form of a word. Search engines use

stemming to compare the root form of the search terms to the documents in its

database. Stemming removes word suffixes, recursively in layer after layer of

processing. For example, if the user enters "viewing" as the query, the search engine

reduces the word to its root ("view") and returns all documents containing the root -

like documents containing view, viewing, viewer, preview, review etc.



- 37 -

The process has two goals in terms of effectiveness, stemming improves recall by

reducing all forms of the word to a base or stemmed form. In terms of efficiency,

stemming reduce the number of unique words in the index, which in turn reduce the

storage space required for the index and speeds up the search process. Of course

stemming does have a downside.

It may negatively affect precision in that all forms of a stem will match, when in fact a

successful query for the user would have come from matching only the word form

actually used in the query. There are several types of stemming algorithms which

differ in respect to accuracy and performance. e.g. Paice/Husk, Porter, Lovins,

Dawson and Krovetz Stemming Algorithm. We are using Porter Stemming algorithm.

5.3.4 Index Routine

Indexed Routine comes in second phase of indexer to build the indexing structure.

Record is prepared using the information extracted from the DOM tree. Record

contains key, URL of document, Dtdname, URL of the DTD, linked list of tags which

are associated with this key. Now to insert record start searching for the block using

the compressed Tries algorithm.

After reaching to that particular block check whether the current block can contain

this record or not. If block can contain this record simply insert it. Otherwise create

the new block and insert the record in it. For writing the record in the block first we

need to read the Block. Now sort the records and add the new record. At the end write

whole of the Block in the index file. The more detail of index routine given in section

5.5 indexing process.



- 38 -

5.4. Search Processor (Module 3)

Search processor is the third part of a search engine. This is the program that sifts

through the millions of pages recorded in the index to find the matches to a search and

rank them in order of what it believes is most relevant. This search processor is

implemented using the servlet program. Here two servlet program one for simple

search another for refine search. The interface of this program is an HTML form.

When the form is submitted, the search processor takes values from the form and

performs the actual search in the compressed trie indexing structure. Searching

process is divided into following modules, which execute in the given sequence.

5.4.1 Query processor

Search processor gets list of words from HTML form that invoked it. Query processor

takes this list and performs syntax check on that list. If there are syntax errors, it

displays the syntax error. Query processor task is carrying out by Tokenizing,

Boolean expression.

Tokenizing

As soon as a user inputs a query, the first task of the search processor is to extract the

keywords from the users input. Search processor use the string tokenizer tokenizes the

query stream, i.e., break it down into understandable segments. Token as an alpha-

numeric string that occurs between white space and/or punctuation.

Boolean expression

Search processor check for the Boolean expression specified with user query i.e.

(AND, OR). After searching each keyword, the results are combined according to the

Boolean expression given in the query and displayed to be user. Check for if any ‘*’

is present it is at the end of the keyword for prefix matching.



- 39 -

5.4.2 Stop Word Removal and Stemming

Search processor removes the stop words from the query and search rest of the

keywords, because this speeds the search. , similar to the processes described above in

the Index section.

5.4.3 Search routine

The search processor performs the actual search using the searching algorithm of

compressed trie data structure. Purpose of search routine is to fetch the records from

the indexing file and display to the user. Detail of search routine given below in

section 5.6 as search process.

5.4.4 Results Ranking and Display

After retrieving list of records from the database, it sorts the records on the basis of

frequency of search string. The document containing most number of occurrences of

the search string is ranked the highest .List of tags obtained after simple search is also

ordered. This ordering is done on the basis of tag closeness. The nearer the tag is to

the search string in the document the higher its rank.

Display of simple search Result

After sorting the list of records according to ranking. Then it performs join operation

depending upon whether user wants “or-ing” or “And-ing”. For “or-ing” the lists are

simply concated. “And-ing” is done by taking only those records that are common in

various lists of records. Finally the record list obtained after join operation is given as

output.

Display of refine search Result

Refining of search result user select the tags or dtd from the simple search result . for

refing the search results done, simply by match the tag or dtd name from the list of

records returned by the simple search. Finally join all the records in a recordlist which

match the tag or dtd selected by the user and display to the user.



- 40 -

5.5. Indexing process

Purpose of index routine is to write the record in the indexing file i.e. index.dat. Brief

explanation indexing routine and description of function as follows:

Crawling

Checkurl

ParDoc

Makeindex

Index.insert ()

Create new record

BlockSearch ()

InsetData (data, block, rec)

WriteBlock (rec, block)

ReadBlock

readData

Writedata (record, offset)

Urls.add

Index.dat

addRecord ()

abc.xml

Stop word & Stemming

Figure 5.3: Indexing Process

Indexing routine start from InsertIndex function. Makeindex function sends the

information of keywords i.e. (Key, list of tags, DTDname, Url of XML document, and

url of the DTD) to the Insert.index function. Insert.index function starts searching in

the Block by taking the first char of the key and creates the new record for that key.

Insert.data function checks that whether block can contain record or not. If Block is

already full create new Block insert the record inside it.

For writing the block in indexing file, first ReadBlock Function read the Block using

the readData function. After reading the records of the Block in a recordList insert a

new record inside recordList in lexicography order. At the end WriteData function

writes the whole block in the indexing ‘index.dat’ file.



- 41 -

5.6. Search Process

Purpose of search routine is to fetch the records from the indexing file and display to

the user. Brief explanation of search routine and description of various functions are

described in figure 5.4.

doget

init

Index.search string () (Root, str.toLowercaser ())

Sorting of records in recordList

checkstr

Index.dump Blocks.dump

St = StringTokenizer(s)

Block search (root, st)

ReadBlock ()

ReadData ()

Index.dat

Display Records on Browser

RecordList r1

Stop word & Stemming

Search_string

Fetching records

Displaying records

Figure 5.4: Search process

Searching process starts from Index.SearchString function, search processor use the

string tokenizer tokenizes the query stream to extract the keywords from the users

input. After that eliminate the stop words, do the steaming of keywords.

Search routine start from BlockSearch function. It finds the address of the block

where that key is stored using the searching algorithm. Next pass this address of to

ReadBlock function. This will call the ReadData function to read the records from

indexing file in a recordList, now sort the recordlist according to rank. Finally, display

the result in sorted order to the browser.



- 42 -

5.7. Transformation Process of XML/XSL into Html

The transformation process can take place inside an XML-enabled browser. XSL is a

language specifically designed for transforming the structure of an XML document.

The transformation processor takes as input an XML document and the corresponding

XSL which contains the transformation rules transforms process start sequentially

according to the instructions contained in the rule as shown in figure 5.5.

Figure 5.5: Represents a fragment of the transformation [20]

Figure 5.5 shows a XSL style sheet for transforming the XML data into ordinary

HTML. The style sheet specifies transformation rules. Transformation rule contains a

pattern and an action. The XML document’s natural parse tree structure is traversed to

find nodes that match the pattern part of a rule. At matching nodes, the action part is

used to derive a transformed sub-tree, which is attached at the current node. This

process continues recursively until no patterns match.



- 43 -

Chapter 6 Detail level Design

This chapter presents the detail design view of the system through the Class diagrams

for structural modeling, sequence diagrams for behavioral modeling. This design was

created using object-oriented principles and techniques. Wherever diagrams were

needed UML was used.

6.1. Package Diagram

Package diagram gives a way to organize large models and enforce a cleaner

architecture. Packages are groups of related classes. The core software has more than

two thousands lines of code with 5 packages and 12 classes developed in Java

programming language used by web server in order to provide searching and

indexing.

Figure 6.1: Package Diagram

As shown in figure 6.1 Global search engine contains index engine, search processor,

web pages, parser and stemming packages to implement the whole functionality of

search engine. Classes present in each package are discussed below in detail.



- 44 -

6.2. Class Diagram

Class diagrams for structural modeling of system. Classes are depicted as boxes with

three sections, the top one indicates the name of the class, the middle one lists the

attributes of the class, and the third one lists the methods.

Figure 6.2: class diagram

This class diagram represents the interaction between cores classes that are used in

implementation of search engine. Detail of each class in the below sub-sections.



- 45 -

6.2.1 Indexing Package Classes

6.2.1.1 Class Anode

Class Anode implements the indexing level A of compressed tries. Level A is an array

of elements, which correspond to character ranges. These array elements contain

(address) pointers to the linked list of B level nodes. It contains the various function

like getBnode () for getting the address of Bnode, insertBnode () for inserting the

Bnode in the trie.

Figure 6.3: class Anode

6.2.1.2 Class Bnode

Class Bnode implements the indexing level B of compressed tries. Class Bnode has

various attribute such as key, nextbnode, anode and block to contain information of

key, address of next bode in linked list, address of the anode, address of the block to

which it is pointing respectively. It contains various functions for getting and setting

the value of above attribute.

Figure 6.4: class Bnode



- 46 -

6.2.1.3 Class Block

Block class has various attributes such as size, fileoffset, lowKey, highkey,

prevBlock, nextBlock, numofRecs, and numofBnodes to contain header information

of data block. These attributes contains size of block, offset of record tell about where

is record present in the file, smallest and largest indexing keys of block, address of

previous block and next block, number of record presents in the block and number of

bnode pointing to it. Contains various functions for getting and setting the value of

above attribute.

Figure 6.5: Class Block

6.2.1.4 Class RecordList

Recordlist class contains list of records. It has various functions such as addRecord ()

function for adding the record in the recordList in such a way that lexicography order

is maintained. Function andList() find the common records between two record list.

Figure 6.6: Class RecordList



- 47 -

6.2.1.5. Class Record

Record class contains the basic attribute key, doc, dtd, dtdname, elts, frequency and

size to store the basic information of a record. It contains the functions such as

setKey(), setDtd(), setDtdName(), setDoc(), setFreq(), setElts() to set the value of the

record in the indexing file. Same function with get name to retrieve the value from the

indexing file.

Figure 6.7: Class Record

6.2.1.6. Class Index

This class is a heart of the software, most of the function to implement compressed

trie algorithm is present in this class. Function BlockSearch () finds the address of the

block to store records at right position. Various other function like rightmostBlock (),

leftmostBlock (), insertdata () also present in this class.

Figure 6.8: Class Index



- 48 -

6.2.1.7. Class Data

Data class provides the functionality for reading and writing from the indexing file.

Functions readBlock () and readdata () for reading form index file. Functions

writeBlock () and writeData () for writing in index file. Function getNewoffset ()

return the offset address for creating new block

Figure 6.9: Class Data

6.2.2 Parsing Package Class

6.2.2.1 Class Dom

Class Dom is used to parse the XML website, during parsing it extract following

information dtdname, dtdurl, docurl and semantic information i.e. the list of tags from

file. Function makeIndex() make the record from that information. At the end it call

the InsertIndex() function for inserting the record.

Figure 6.10: Class Dom



- 49 -


6.2.9. Class Server

This class contains the main() method. As Administrator run this class search engine

take website from the ‘urls.add’ file one by one and start the indexing process. In the

end it takes the dump of the whole index.

Figure 6.11: Class Server

6.2.10. Class AddURL

The AddURL servlet e

nables user to register his XML website to search engine.

Figure 6.12: Class AddURL

.2.11. Class Simple

his servlet provide the simple search functionality.

6

T

Figure 6.13: simple servlet


- 50 -


6.2.11. Class Refine

his servlet provide the Refine search functionality. T

Figure 6.14: Refine servlet

6.3. Sequence Diagram

The interaction of the classes is shown as the sequence diagram. A sequence diagram

represents the behavioral m s the top of the diagram

represents objects, classes, actors, classifiers or their instances or typically use cases.

e called object lifelines, representing the

life span of the object during the scenario being modeled.

of the method invoked in

response to the message [30].

odeling. The boxes acros

The dashed lines hanging from the boxes ar

The long, thin boxes on the lifelines are activation boxes which indicate processing is

being performed by the target object/class to fulfill a message. Messages are indicated

on UML sequence diagrams as labeled arrows, when the source and target of a

message is an object or class the label is the signature


- 51 -


6.3.1 Sequence Diagram of Website Gather Module

he sequence diagram shows the main step how the user registered his page to search

engine. First of all, user enters the url address of his website & email address using

e Adurl.html page interface. Invoke doGet() method of Addurl servlet program.

ites into the ‘urls.add’,

indexpage.add files.

T

th

Servlet program verify the url address and email. Finally it wr

: user Addurl.html

AddUrl Servlet

urls.add indexpage.add1: xmlurl,email ,add

2: doGet(xmlurl,email ,add)

3: check(url)

4: check(email)

urls to be indexed file

list of all indexed urls

5: writeFile(urls,email)

6: writeFile(urls,email)

Figure 6.15: Sequence Diagram website Gather Module


- 52 -


6.3.2 Sequence Diagram of Insertion Module

his sequence diagram shows the whole of the process to index an Xml documents.

gh following messages and method calls shown in figure.

T

Process takes place throu

Server Anode :Dom Index Block Record Data RecordList

index.dat file

2: crawl(String, boolean)

1: Anode( )

3: Dom.DomMain(root, str1, true)

4: checkURL();

5: parseDoc()

6: makeIndex(index, root, dat, element, elts, insert);

7: insert(Anode, Data, String, LinkedList, String, String, String)

8: search(Anode, String)

11: insertdata(Data, Block, Record)

9: Record( )

10: setDoc(String)

12: writeBlock(Record, Block, Anode, int, RecordList)

13: readBlock(Block)

15: readData(long, boolean)

16: Record( )

19: setDoc(String)

22: writeData(Record, long, boolean)

14: RecordList( )

20:

21: addRec(Record)

17: new File(index.dat)

18: fp.readLine()

Figure 6.16: Sequence diagram of insertion module


- 53 -


6.3.3 Sequence Diagram of Simple Search Module

The sequence diagram shows the main steps of simple search module. It takes the

input from the user, pass to the simple servlet. Simple servlet interact with various

classes to perform the search process. At the end display the result to the user.

html page simple servlet

Anode index Data index.dat

4: search_str,add

2: Anode getroot()

3: root

1: init()

5: doget()

7: recordList searchString (root,str,and)

6: check(str)

8: block search(root,str)

9: RecordList readBlock(block)

10: Record readData()

12: RecordList

13: RecordList

14: display result

11: <<read data>>

Figure 6.17: Sequence diagram of Simple Search


- 54 -


6.3.4 Sequence diagram of Refine search Module

This sequence diagram shows the main steps of Refine search process. Refine search

servlet gets the input from simple search servlet. Now it checks for the dtd/tags name

in the record list returned by the simple search. Refine process takes place through

following messages and method calls shown in figure 6.18.

simple search

refineSearch servlet

Anode index Data index.dat

1: search_str, dtds/tags

2: getroot()

3: root

7: RecordList readBlock(block)

8: Record readData()

10: RecordList

5: recordList searchString (root,str,and)

11: RecordList

12: check1(tags,RecordList)

13: check2(dtds,RecordList)

14: RefineRecordList

6: block search(root,str)

15: << diaply refine result >>

9: <<read data>>

Figure 6.18: Sequence diagram of Refine Search


- 55 -


Chapter 7 Result and Discussion

his chapter shows the result obtained by providing the snapshots of the search

engine result.

7.1 User Interface

T

Figure 7.1: Snap shot of User Interface

Objective

Provides Graphical user interface to user for searching.

How to use this page

Enter the s

If user wants to find all the keywords Click on the find all words checkbox.

Submit the query by pressing the Global Search button.

earch query keyword in the Textbox provided.


- 56 -


7.2 Simple Search Result Page

Figure 7.2: Snap shot of Simple Search Result page

of the keyword “Allahabad“

7.2.1 Visual Understanding of Result

In the result page two c the hyperlinks to xml

Green color links represents the hyper link to the document type

definition file DTD file. On the right hand side of the links in green color numbers

olors of links. Blue color links represents

documents.

represents the rank of the page.


- 57 -


DTDName represents type of document it is. Semantic information is represented

using arrows. It represents from where that key word has come. On the right hand side

two lists. Upper One is list of tags, lower one is the list of DTD name. Select one of

the elements from the list for Refining Search result. Click on number 1, 2, 3… for

browsing the more results pages.

7.2.2 Analysis of Result

When user performs search, he/she gets result not only the list of urls which are

related with this keyword but also gets the semantic information of key words. The

user also gets information about the type of documents and able to refine their search

by selecting tags/dtd from the respective list.

When the keyword ‘Allahabad’ is entered in the search text box of the search engine,

it displays the related XML websites that contains the searched word ‘Allahabad’. In

the result it shows four types of documents Bank, university, up tourism, institute as

shown in figure 7.2. It gives flexibility to the user to choose their own search area to

refine search.

In the first link it shows semantic information mnnit->institute->Allahabad in the

second link it shows semantic information uptourisum->PlOfInterest->Kumbhmela-

>Allahabad and o the users in

understanding the result. In engines don’t provide this

information about the keyword.

type definition link to gain familiarity

with structure of document which help them in refinement of search result.

so on. This semantic information also helpful t

current html based search

The user is capable to browse the document


- 58 -


7.3 Refine Search Result Page

Figure 7.3: Snap shot of Refine search result page

The figure 7.3 shows the result of the refine search based on the DTD selected from

the list. When the keyword ‘Allahabad’ is entered in the search text box the search

urism, institute. It gives to the user a flexibility of choosing their own search area

.g. If a user selects ‘institute’ from the DTD list then the search engine presents the

ame of the institutes in Allahabad i.e. mnnit and iiita.

engine displays the entire XML website that contains the keyword ‘Allahabad’. In the

above describe snapshot the Allahabad is attached with the university, bank, up

to

e

n


- 59 -


7.4 Simple Search Result Page of Searching the Keyword “Xml”

Result of searching the keyword “xml“

currently registered with it.

Currently in the search engine registered websites are http://172.19.6.53:8080/

arning_xml.xml, http://172.19.6.53:8080/processing_xml.xml, http://172.19.6.53

080/ java_xml.xml These are all of xml book type. That’s by in the List of Dtds we

ot only one type of document it is xmlbook.

Figure 7.4: Snap shot of searching the keyword “xml“

7.4.1 Analysis of result

When the keyword ‘XML’ is entered in the search text area, then the search engine

display link of the related XML website which are

le

:8

g


- 60 -


If few more sites are registered with the search engine some related to XMLTutorial,

XMLResearch field. Now when user search for keyword “xml”, in the List of Dtds

l book, Tutorial, Research. Result shown in

figure 7.5 User can refine his result according to field of interest.

user got three type of documents i.e. xm

Figure 7.5: Search result of keyword “xml“ after registering few more website related to xml domain.


- 61 -


7.4.2 Refine Search Result

If user selects on the ‘XMLResearch’ from List of Dtds then the search engine

presents the only those website which are related to XML Research field.

Figure 7.6: Snap shot of refine search result keyword “xml“


- 62 -


7.4 Simple Search Result page of searching the keyword “Gate”

Figure 7.7: Snap shot of simple search result of the keyword “Gate”


- 63 -


7.6 Procedure of Registering an XML website

User interface for registering website to search engine.

Figure 7.8: Snap shot for Registering Xml website to Search engine Objective Provides Graphical user interface for registering Xml website to Global Search

engine.

How to use this page

User enters the URL address of website & email address in the textbox provided.

Press the submit button. To register the complete website user need to register each

individual page. Email will send to user as website page is indexed by Search engine.


- 64 -


7.7 Procedure of Indexing a Website

Administrator steps for indexing a website

Before start indexing start the following process using Linux command shell. Step 1: Run Apache Web Server using commands shown in figure 7.10.

Figure 7.9: Command to Run Apache Web Server Step 2: Start the serveltrunner using following commands shown in figure 7.11.

tep 3: Run the server class file using commands shown in figure 7.12.

Figure 7.10: Command to Run Servlet S

Figure 7.11: Command to Run Server


- 65 -


Chapter 8 s and future work

In the thesis work a new search engine has been proposed for searching the websites

in XML/XSL. It provides two level searches in comparison to the existing search

engines. The two level searches comprise of basic search and refine search. The basic

search is similar to the conventional HTML search engine. But due website made in

XML it also provides the semantic information of keyword to user.

ne more functionality comes in refine level search where the user can refine his

arch according to DTD/Tags information given to him. The beauty of model lies in

t and

riting the sophisticated queries for searching from xml documents.

In addition an efficient Compressed Tries data structure used to implement the

indexer that includes properties fast retrieval time, quick search unsuccessful search

determination and finding the longest match to a given identifier.

8.2 Future Amendments

is obvious that it is not possible to cover the whole functionality of search engine.

unctionality that can be further provided such as: currently the search engine do not

riginal words

s well as their synonyms. Query expansion feature can be provided. One more feature

can be added rather then returning the entire document as search result, returns only the

partial webpage.

Spelling checking functionality can be added. The search engine described here does not

pport other Indian lan the Indian languages.

ml web crawler can be made to gather all the xml websites.

Conclusion

8.1 Conclusion

O

se

the fact that it frees the user from remembering the structure of XML documen

w

It

F

include search for the synonyms that can be included, so that it can search o

a

su guages. It should be extended to include

X


- 66 -


R

[1] Mark P. Sinka, David W

Stoplists for Web Document Analysis” IEEE/WIC International Conference on

ce IEEE Computer Society Page: 396, 2003

umber 3 September

1999.

[8] Fang Yuan; Ya-Nan Hao; Ge Yu; “The study of key techniques in intelligent

http://www.cise.ufl.edu/~sahni/dsaaj/enrich/c16/tries.htm

ery

eferences

. Corne “Towards Modernised and Web-Specific

Web Intelligen

[2] W. B. Frakes "Stemming algorithms" Information retrieval: data structures and

algorithms Pages: 131 – 160, 1992.

[3] Marden, P.M, Jr. Munson, E.V. ”Today's style sheet standards: the great vision

blinded” , Computer IEEE JNL Volume 32, Issue 11, Page(s):123 – 125,

Nov. 1999

[4] http://www.w3schools.com/xml/xml_whatis.asp

[5] S. Adler and Co. “Extensible Stylesheet Language (XSL) Version 1.0” W3C

Working Draft, available at http://www.w3.org/TR/xsl/, 18 October 2000.

[6] Arne Andersson and Stefan Nilsson “Faster searching in tries and quadtrees An

analysis of level compression” Springer Berlin / Heidelberg Volume 855, 1994

[7] J. Widom “Data Management for XML - Research Directions”, IEEE Data

Engineering Bulletin, Special Issue on XML, Volume 22, N

XML search engine” Machine Learning and Cybernetics International

Conference Volume 2, Page(s):1194 – 1197, 2004

[9] http://xerces.apache.org/

[10] Sartaj Sahni “Data Structures, Algorithms, & Applications in Java Tries” 1999

[11] http://www.w3schools.com/dtd/dtd_intro.asp

[12] XQuery - Wikipedia http://en.wikipedia.org/wiki/XQu


- 67 -


[13] http://www.comp.lancs.ac.uk/computing/research/stem

m

[14] http://ww3.algorithmdesign.net/handouts/Tries.pdf

ue 2 Pages: 243 -

263 , June 1984

: the W3C DOM specification" Volume: 3 ,

Issue: 1, pages 48 – 54, Jan.-Feb. 1999

ly 1976

escu, Alon Levy, Dan Suciu,

Press XML Applications , Pages: 474 - 485 , 2002

Nilsson "Improved behaviour of tries by adaptive

branching” Information Processing Letters, Elsevier North-Holland, Inc.

[23] Aleman-Meza, B. Halaschek-Weiner, C. Arpinar, I.B. Cartic Ramakrishnan

” Internet

44 , May-June 2005

ming/general/index.ht

[15] M. Al-Suwaiyel, E Horowitz "Algorithms for trie compaction" ACM

Transactions on Database Systems (TODS) Volume 9 , Iss

[16] Wood, L. "Programming the Web

[17] Kurt Maly “Compressed tries” Communications of the ACM, Volume 19

Issue 7 Ju

[18] Alin Deutsch, Mary Fernandez, Daniela Flor

“XML-QL: A Query Language for XML”,W3C Notes:http://www.w3.org/TR/

NOTE-xml-ql, August 1998.

[19] http://ww0.java4.datastructures.net/handouts/Tries.pdf

[20] Lionel Villard, Nabil Layaïda “XML Applications: An incremental XSLT

transformation processor for XML document manipulation” Proceedings of

the 11th international conference on World Wide Web WWW '02 Session:

ACM

[21] Tin Kam Ho “Fast identification of stop words for font learning and keyword

spotting” Document Analysis and Recognition, ICDAR '99. 20-22 IEEE CNF

Page(s):333 – 336 , Sept. 1999

[22] "Arne Andersson, Stefan

Volume 46 ,Issue 6 Pages:295-300 Year of Publication: 1993

Sheth, A.P. “Ranking complex relationships on the semantic Web

Computing, IEEE JNL Volume 9, Issue 3, Page(s):37 –


- 68 -


[24] Stefan Nilsson and Matti Tikkanen “Implementation of dynamic compresse

trie

d

s” Springer Berlin Volume 844, 1994

uilding and managing

XML/XSL-powered Web sites: an experience report” Computer Software and

Web-based Search Engine for Indian Languages”,

http://www.cse.iitk.ac.in/research/ mtech1997/9711112.ps.gz Dept. of CSE,

[27] Kaplan, A. Lunn, “FlexXML: engineering a more flexible and adaptable web”

2001

[29] Danny Sullivan "How Search Engines Rank Web Pages" http://searchengine

[25] Kerer, C. Kirda, E. Jazayeri, M. Kurmanowytsch “B

Applications Conference, IEEE CNF pp. 547 – 554, Oct. 2001.

[26] Manoj Malviya, “A

Indian Institute of Technology, Kanpur.

Information Technology: Coding and Computing, 2001. IEEE CNF Page(s):

405 – 410 , April

[28] Angela Bonifati, Stefano Ceri “Comparative analysis of five XML query

languages “ ACM SIGMOD Record, Volume 29 Issue 1, March 2000

watch.com/showPage.html?page=2167961 March 15, 2007

[30] http://www.agilemodeling.com/artifacts/sequenceDiagram.htm


- 69 -


Appendix A Configuring the projec

A.1 Configuring the project

t

[root]# cd /dinesh/mtech/global/servlet

ode.

[root]# cd /usr/local/apache-tomcat-5.5.17/webapps

ompile Source Code

In second step compile the full source code. Before compiling source code

inistrator needs to include the path of java api (specified in chapter 2) in

LASSPATH environment variable. using command

javac classname.java

tep 3:

un project

To run project Start the following process using Linux shell command. Step 1: Run Apache Web Server using commands [root]# export JAVA_HOME=/usr/java/jdk1.5.0_05/ [root]# /usr/local/apache-tomcat-5.5.17/bin/startup.sh

Step 1:

Create the Following Directory Structure

Put all servlet file like simple, refine, addurl etc. [root]# cd /dinesh/mtech/global/src Put java class source c

[root]# cd /dinesh/mtech/global/index Here Index.dat, urls.add file will be generate.

/GlobalSearch Under apache web server directory put all the web pages

Step 2:

C

adm

C

S

R


- 70 -


Step 2: Start the serveltrunner using following command [root]# servletrunner -p 8084 -d /dinesh/mtech/global/servlet -r /dines -s /dines

urfing the sample site

Search/main.html

s

h/mtech/global/servlet h/mtech/global/servlet/servlet.properties

S

http://localhost:8080/Global

Date post:	26-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Submitted by Dinesh Garg - M.Tech Divison grade/Dinesh Garg MS200506...I am highly grateful to the...

Documents