ICON-BASED INTERFACE TO INTERNET FOR...

ICON-BASED INTERFACE TO INTERNET FOR

LANGUAGE ILLITERATE PEOPLE

Santa Maiti

ICON-BASED INTERFACE TO INTERNET FOR

LANGUAGE ILLITERATE PEOPLE

Thesis submitted to the

Indian Institute of Technology Kharagpur

for award of the degree

of

Master of Science (by Research)

by

Santa Maiti

Under the guidance of

Dr. Debasis Samanta

School of Information Technology


Kharagpur - 721 302, India

May 2013

c⃝2013 Santa Maiti. All rights reserved.

CERTIFICATE OF APPROVAL

Date:

Certi�ed that the thesis entitled Icon-Based Interface to Internet for Language

Illiterate People submitted by Santa Maiti to the Indian Institute of Technology,

Kharagpur, for the award of the degree Master of Science has been accepted by the

external examiners and that the student has successfully defended the thesis in the viva-

voce examination held today.

(Member of DAC) (Member of DAC) (Member of DAC)

(Member of DAC) (Member of DAC) (Member of DAC)

(Supervisor)

(Internal Examiner) (Chairman)

CERTIFICATE

This is to certify that the thesis entitled Icon-Based Interface to Internet for Lan-

guage Illiterate People, submitted by Santa Maiti to Indian Institute of Technology

Kharagpur, is a record of bona �de research work under my supervision and I consider it

worthy of consideration for the award of the degree of Master of Science (by Research)

of the institute.

Date: 30/05/2013

Dr. Debasis Samanta

Associate Professor

School of Information Technology


Kharagpur - 721 302, India

DECLARATION

I certify that

a. The work contained in the thesis is original and has been done by myself under the

general supervision of my supervisor.

b. The work has not been submitted to any other Institute for any degree or diploma.

c. I have followed the guidelines provided by the Institute in writing the thesis.

d. I have conformed to the norms and guidelines given in the Ethical Code of Conduct

of the Institute.

e. Whenever I have used materials (data, theoretical analysis, and text) from other

sources, I have given due credit to them by citing them in the text of the thesis

and giving their details in the references.

f. Whenever I have quoted written materials from other sources, I have put them

under quotation marks and given due credit to the sources by citing them and

giving required details in the references.

Santa Maiti

Dedicated to

My parent and other family members

ACKNOWLEDGMENT

First and foremost I wish to convey my deep sense of gratitude to my mentor Prof.

Debasis Samanta. It has been my blessed opportunity to be his student. I appreciate

all his contributions in the form of time, idea, and greater vision to make my research

experience productive and cherishable. I have learned an approach of humanity, patience

and hard working from him.

I would like to thank Prof. Jayanta Mukhopadhyay, Head of SIT for extending me

all the possible facilities to carry out the research work. I wish to thank all of my

departmental academic committee members Prof. A. Gupta, Prof. C. R. Mandal, Prof.

S. Sural, Prof. S. K. Ghosh, Prof. K. S. Rao, Prof. S. Misra, Prof. R. R. Sahay for

their valuable suggestions during my research. I also like to thank Prof. P. Mitra for his

valuable guidance in my research work. I sincerely remember the support of o�ce sta�s

Mithun Da, Soma Di, Malay Da, Vinod Da and others. I am also grateful to all members

of School of Information Technology.

I owe my deepest gratitude to Somnath Da, Sayan Da, Debasish Da, Sankar Da,

Barik Da, Maunendra Da for strengthening my research by constant moral support and

providing necessary guidance when required. I really learnt a lot from them. I wish

to convey my heartfelt thanks to Manoj Kumar Sharma, Soumalya Ghosh, Pradipta

Kumar Saha, Puspak Das, Jaya Krishna, Vinit Sinha, Satya Ranjan Das, Sudhamay

Maity, Ramu Reddy Vempada, Narendra NP, Krishnendu Ghosh, Partha De, Ruchira

Naskar, Tuhin Chakraborti, Tamoghna Ojha, Jayeeta Mukherjee, Arindam Dasgupta,

Soumya Maity, Nirnay Ghosh and many more.

I wish to thank my friends for helping me get through the di�cult times, and for all the

emotional support, entertainment, and caring they provided. I am greatly indebted to my

friends Rajasri Bandyopadhyay and Soumya Bhattacharya for their constant inspiration.

It's really very di�cult to express my gratitude just through few words to my brother

who never teaches me anything but stays beside me whatever I do, who never leads but

shows all the paths to travel, for whom I never fears of fall. He is not rare but unique.

Thank you for being my brother.

Lastly and most importantly, I like to thank my parent and other family members

for their continuous inspiration and moral support. I deeply indebted to them.

Santa Maiti

ix

Abstract

With the target of free knowledge distribution, a vast information repository is being

built up in Web. People are able to access and share this information through Internet.

It helps people to enrich their knowledge base as well as to get quick suggestions of

any problem. But the opportunity is only limited to the language literate people, who

have good reading, writing and comprehending capability of language speci�cally English

language. This work aims to develop an icon-based interaction system using which less

educated or uneducated people can search and access information in Internet.

To achieve our target, �rst, we develop an icon-based interface using which the target

people can generate and �re query in the search engine. We select icon as interaction

medium as it is language independent and easy to learn. It can also be treated as a

faster mechanism of communication, facilitating recognition rather than recall. Toward

the development of the icon-based interface, the major issues that have been addressed

are deciding domain and domain related queries, preparing icon vocabulary and icon

management.

Usually, the search result returned by search engine is large in size and may be from

di�erent domains. All the results may not belong to user's area of interest. Along

with this, the ranked representation of web search result is incomprehensible for our

target user. As a way out, clustering mechanism has been advocated using which similar

web pages can be grouped together so that representation and extraction of information

related to search query would become easier. To achieve this, in this work we propose pre-

processing of web pages, document feature extraction, inter document similarity measure

and clustering web pages.

Further, for representing the information, it is neither worthy nor possible to rep-

resent all web pages' information in terms of icon. So, some informative information

is mined depending on some prede�ned pattern related to the need of user's query.

The selective information are displayed to the user in terms of icons. Subtasks: building

entity-attribute model, query recognition, potential answer modelling and attribute value

extraction are addressed in this regard.

Few experiments have been performed to check the e�ectiveness of the proposed

methodology. We evaluated our proposed icon-based interface with respect to user friend-

xi

liness and e�ciency in query generation. It reveals that the developed interface is around

79% e�ective to generate search query. The comparison of our proposed clustering algo-

rithm with the benchmarking clustering algorithms proves that the proposed algorithm

provides an optimal solution by balancing cluster quality and time complexity. Finally,

we check the comprehensibility of iconic message. The experimental results substantiate

the e�cacy of the proposal.

To the best of our knowledge, the proposed icon-based interaction to access informa-

tion from Internet is new of its kind. It alleviates the digital divide between privileged

and unprivileged users of Internet. Moreover, the proposed interaction mechanism can

be recon�gured for the use of motor-impaired users.

Keywords: Icon-based user interface, language independent communication, human-

computer interaction, information retrieval, document clustering, clustering algorithm,

icon based information representation.

xii

Contents

Approval i

Certi�cate iii

Declaration v

Dedication vii

Acknowledgment ix

Abstract xi

Contents xv

List of Figures xvii

List of Tables xix

List of Symbols and Abbreviations xxi

1 Introduction 1

1.1 Need and Urgency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Scope of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Objective of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Plan of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Contribution of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.6 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Related Work 13

2.1 Icon-Based Interface: State of Art . . . . . . . . . . . . . . . . . . . . . . . 13

xiii

Contents

2.1.1 Use of Icon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.2 Icon Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.3 Icon-Based Applications . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Document Clustering Techniques . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.1 Basic Clustering Techniques . . . . . . . . . . . . . . . . . . . . . . 19

2.2.2 Other Search Result Clustering Algorithms . . . . . . . . . . . . . 22

2.3 Question Answering Techniques . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Development of Icon-Based Interface 29

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.1 Deciding Icon Vocabulary and Domain Related Queries . . . . . . 31

3.2.2 Maintaining Large Icon Repository . . . . . . . . . . . . . . . . . . 32

3.2.3 Icon Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.4 Developed Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3 Experiments and Experimental Results . . . . . . . . . . . . . . . . . . . . 39

3.3.1 Experimental Setup and User Details . . . . . . . . . . . . . . . . . 39

3.3.2 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4 Clustering Web Search Results 43

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2.1 Preprocessing of Web Documents . . . . . . . . . . . . . . . . . . . 46

4.2.2 Document Feature Extraction . . . . . . . . . . . . . . . . . . . . . 47

4.2.3 Inter Document Similarity Measure . . . . . . . . . . . . . . . . . . 49

4.2.4 Our Proposed Clustering Algorithm . . . . . . . . . . . . . . . . . 50

4.3 Experiments and Experimental Results . . . . . . . . . . . . . . . . . . . . 58

4.3.1 Experimental Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3.3 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

xiv

Contents

5 Icon-Based Information Representation 65

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.2 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.2.1 Building Supportive Model . . . . . . . . . . . . . . . . . . . . . . 67

5.2.2 Information Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2.3 Icon-Based Knowledge Representation . . . . . . . . . . . . . . . . 77

5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6 Conclusion and Future Work 81

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.2 Contribution of Our Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Publications 87

References 89

xv

List of Figures

1.1 Statistical Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

(a) Size of World Wide Web 1 . . . . . . . . . . . . . . . . . . . . . . . . 3

(b) Internet users 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

(c) World literacy statistics 3 . . . . . . . . . . . . . . . . . . . . . . . . 3

(d) Web page language 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

(e) User classi�cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 An overview of our work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 An overview of our approach . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 A snapshot of XML �le . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Icon-based interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Example of query generation. . . . . . . . . . . . . . . . . . . . . . . . . . 36

(a) Selection of `weather' icon. . . . . . . . . . . . . . . . . . . . . . . . 36

(b) Display of `weather' icon. . . . . . . . . . . . . . . . . . . . . . . . . 36

(c) Move to `where' hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . 36


(d) Enable hierarchy and select Ìndia' icon. . . . . . . . . . . . . . . . . 37

(e) Disable hierarchy and select `Kolkata' icon. . . . . . . . . . . . . . . 37

(f) Move to `when' hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . 37


(g) Enable hierarchy and select `month' icon. . . . . . . . . . . . . . . . 38

(h) Disable hierarchy and select `December' icon. . . . . . . . . . . . . . 38

(i) Query completion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1 Overview of our work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2 Overview of our proposed HK Clustering algorithm . . . . . . . . . . . . . 50

4.3 Illustration of the proposed HK Clustering algorithm. . . . . . . . . . . . . 55

4.4 Download time of documents . . . . . . . . . . . . . . . . . . . . . . . . . 62

xvii

List of Figures

4.5 Comparison of Dunn Index . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1 Overview of the proposed approach . . . . . . . . . . . . . . . . . . . . . . 68

5.2 Query template: Culture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.3 Entity-attribute model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.4 Visualisation of information related Assam culture . . . . . . . . . . . . . 78

xviii

List of Tables

3.1 Domain related word extraction . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Tourism related benchmark queries . . . . . . . . . . . . . . . . . . . . . . 33

3.3 User details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4 User training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5 Interface testing result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1 Comparison of cluster quality . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.2 Comparison of time complexity . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3 Web document download time with threading and without threading . . . 62

4.4 Determination of threshold values and number of feature vectors . . . . . 64

5.1 n-gram from pre�x pattern . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2 n-gram from su�x pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3 Ranked candidate phrase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.4 Test result of visual representation . . . . . . . . . . . . . . . . . . . . . . 79

xix

List of Symbols and Abbreviations

List of Symbols

α Dissimilarity threshold

β Merging similarity threshold

γ Belonging similarity threshold

CP Cluster Pool

Ctemp Dequeued cluster

DI Dunn Index

OldS Seed document at previous level

S Seed document at current level

Sim Similarity matrix

List of Abbreviations

AAC Augmentative and Alternative Communication

HCI Human Computer Interaction

LSI Latent Semantic Indexing

NLG Natural Language Generation

PDA Personal Digital Assistant

TDM Term-Document Matrix

xxi

List of Symbols and Abbreviations

xxii

Chapter 1

Introduction

With the advancement of the information technology, Internet becomes an essential part

in our every sphere of life. Everyday a thousand questions arise in human mind. Peo-

ple face di�culties at their work place as well as in their daily life. To get a clue of

the problems earlier, the only way was to take advice from the expertise or experienced

persons. But as in general, a single person cannot be expertise in every �eld and one per-

son cannot be in contact with all expertise persons, so the problems remained unsolved

most of the time. The key reason of this problem is the absence of proper communi-

cation medium. With the development of computers, in 1973 the idea of Internet was

introduced to establish communication between computers [70]. It o�ered a new way

of information exchange. Next, the idea of World Wide Web (WWW) was proposed to

get any information anywhere. Gradually, a vast information repository has been built

up. At present, WWW is a global information medium using which users can access and

share information via computers connected to the Internet. More formally, WWW is a

system of interlinked hypertext documents accessed via Internet through a Web browser.

So, Internet and WWW had facilitated people to obtain their desired information in the

form of text, images, videos etc. and navigate between them via hyperlinks [95]. Still,

the opportunity is only limited to the educated people those who have good reading,

writing and comprehending capability of language speci�cally English language. This

work aims to make Web information accessible for language illiterate and semi-illiterate

people.

The rest of the chapter is organized as follows. In Section 1.1, the need and urgency

of the work and corresponding challenges are discussed. Some related works are also

explored in this regard. The scopes associated with this work is discussed in Section 1.2.

Section 1.3 describes objective of our work. Our work plan is proposed in Section 1.4.

1

1. Introduction

Finally, the contribution and the outline of thesis is presented in Section 1.5 and Sec-

tion 1.6.

1.1 Need and Urgency

At present, Web repository works as mine of information. People share and access these

information through Internet. World Wide Web has expanded about 2000% since its

inception and is doubling in size every six to ten months [35]. According to a recent

survey1, it is reported that the indexed Web contains 7.78 billion pages (Figure 1.1a).

However, the bene�ts of Internet are limited only to the educated people. A recent

statistics2 shows that in the year 2011, per 100 inhabitants, estimated Internet users in

developed countries and developing countries are 74 and 26, respectively (Figure 1.1b).

One of the main reasons of such di�erence is language illiteracy (mainly English illiter-

acy). Traditionally, literacy is described as the ability to read and write. Though the

literacy all over the world (Figure 1.1c) is 83.7%3, it is quite a matter of concern for

many of the developing countries where approximately 54.74% of the total population

is literate4. Further, this literacy includes familiarity with their native languages. This

native language literacy is not su�cient because 55.4% of the Web pages are in English5

(Figure 1.1d). So, to get the maximum advantages of Internet, a user should be an En-

glish literate.

Presently, English is the most widely published language all over the world. Over 1.8

billion people use English as �rst, second and foreign language. It is an o�cial language

in 52 countries as well as many small colonies and territories. Figure 1.1e depicts a Venn

diagram of user classi�cation with respect to language familiarity so far their ability to

read (R), write (W), and speak (S) in their native (N) and English (E) languages are

concerned. The users are classi�ed as U1, U2, U3 and U4. User U1 and U4 can read,

write and speak English and native language respectively. User U3 is a subset of U1 can

speak English. Similarly, user U2 (superset of U4) can speak their own native language.

So, the users in U2-U4 and a major portion of U3, U4 are our target user. Such peo-

ple cannot read, write and understand English properly. We term such people as under

privileged people or novice people with respect to computer and Internet use. Typically,

these people are rickshaw puller, porter, farmer, shop keeper, gate keeper etc.

1The size of the World Wide Web, www.worldwidewebsize.com2ITU: Committed to connecting the world, www.itu.int3UNESCO, www.uis.unesco.org/FactSheets/Documents/FS16-2011-Literacy-EN.pdf4Human Development Reports,www.hdr.undp.org/en/reports5W 3Techs, www.w3techs.com/technologies/overview/content_language/all

2

1.1. Need and Urgency

(a) Size of World Wide Web 1 (b) Internet users 2

(c) World literacy statistics 3

(d) Web page language 5 (e) User classi�cation

Figure 1.1: Statistical Reports

In the current context it is clear that the main issue is not information resource

scarcity but the information access medium. Therefore, the challenge is, to bring Internet

information within a reach of underprivileged users.

• The very �rst problem is to determine the interaction mode that is suitable for our

3

1. Introduction

target user. From the point of feasibility, we notice that di�erent communication

modes like speech, gesture can be used to meet our objective. But, along with

feasibility issue, we have to take care of several other issues like cost e�ectiveness,

user friendliness, adaptability etc. The interacting mode should support easy input

mechanism as well as easy understanding of output.

• Next challenge is to identify the information domain. A query, a word or a group

of words generated by an user are generally abstract in nature which may imply

multiple meaning in di�erent contexts. Web search engine, however, cannot distin-

guish the context and hence retrieves huge information. All these information are

not related to user's query need. Therefore, we have to identify information related

only to the users area of interest.

• Another problem is that, in general a Web page covers multiple topics, related

information, related links etc. It is not worthy to give the entire Web page content

to our target user. The challenge is to distinguish between valuable, less valuable

and non-valuable information.

A work on optimal audio-visual representations for illustrating concepts [83] has been

done to check the comprehensibility of the representation for an illiterate and semi-literate

users. This work has been done in health care domain. It reveals the facts that richer

audio-visual information is not necessarily better understood by target user. Another

work has been done to increase mobile usability for illiterate and literate but novice

user [82]. It considered three text-free interfaces: a spoken dialog system, a graphical

interface, and a live operator for three di�erent requirements. Graphical user interface

is used in the context of mobile banking. But the developed interface is not suitable for

independent use by �rst-time users. In order to make users more autonomous, a user

interface has been designed such that even novice, illiterate users require absolutely no

intervention from anyone at all to use. Two applications are developed in this regard

one for job search for domestic labors, and another for a generic map that could be used

for navigating a city. Our work addresses a di�erent issue of Internet accessibility for

illiterate and semi-illiterate users. Previous works for illiterate users preferred the mixed

mode, that is, visual and speech for interaction. But for all these cases, the amount of

data or information are exchanged is quite less compared to the information we obtain by

search result. Most of the implemented applications deal with very limited vocabulary.

In case of speech mode, we have to translate the information in existing all possible

languages to support users across the world. A work for searching Internet using iconic

interface [113] has been proposed for school going children. Here, the target users have

4

1.2. Scope of Work

limited reading and spelling abilities but they are not illiterate. The work did not mention

how to comprehend the searched information, perhaps they assumed that the children

can understand the information as they are literate. For our target users we have to

�lter only the important information according to their desired query. Otherwise, large

amount of information, generally returned by the search engine will confuse them.

1.2 Scope of Work

In order to achieve our goal, we can use di�erent interacting modes and interacting

devices. Input can be given mainly through the motor controls of the e�ectors (�ngers,

voice, eyes, head and body position etc.) and the response of the system are sensed

through various human sensors (vision, hearing and touch etc.). Based on interacting

devices the interaction can be of three types: visual based interaction, audio based

interaction and sensor based interactions [57].

• Visual based interaction includes facial expression analysis, body movement track-

ing, gesture recognition, gaze detection etc. It helps to recognize and analyze human

emotion as well as the body language. This type of interaction is generally used

as an augmentative and alternative communication (AAC) system for quadriplegic

people. It can be also used as an intelligent system which will consider user ex-

pression as an input. But, it is really di�cult to train our target user about body

language. This interaction type is more suitable for quadriplegic user.

• In case of audio based interaction, system interaction takes place via audio sig-

nals like speech, music, di�erent sound patterns etc. Audio signals can be more

trustable, helpful, and in some cases unique providers of information, compared to

visual signals. Speech recognition, speaker recognition, auditory emotion analysis,

human-made noise/sign detections (gasp, sigh, laugh, cry etc.), musical interaction

are some important areas in audio based interaction system. Audio, speci�cally

speech based interaction is quite helpful for our target user but for that we have to

make it compatible to users' languages.

• The third type of interaction medium is sensor based interaction which uses sensors

ranging from primitive to very sophisticated one. Pen-based interaction, mouse and

keyboard, joysticks, motion tracking sensors and digitizers, haptic sensors, pressure

sensors, taste/smell sensors are di�erent types of sensors.

Among these three types of interaction mediums, mouse and keyboard are most popular

and widely used and known for cost e�ective interacting device. Keyboard is generally

5

1. Introduction

used to generate text and mouse is for positioning, pointing and drawing. We can use

di�erent alternative modes to give input to the search engine. But, for output we have

very narrow option to choice. We can not use text as the medium of front-end user

interaction. On the other hand, the search engines consider text as input and provide

search results in terms of text. So, we need an interpreter which converts the search

result into user understandable form. As an interpreter we can use speech, gesture,

icons, gaze etc. Representing entire Web page resources in all possible languages in text

or speech form requires enormous e�ort. On the other hand, gesture based interaction

su�ers from adaptability issue and requires su�cient training. Therefore, we choice icon

as the medium of interaction both for input and output.

We choice icon for interaction where user can give input to a system by means of

icon selection and understand the response of the system by comprehending the iconic

message. An icon can be de�ned as a small graphics representation of some information

or any object. The icon is selected for the following reasons.

• Language independent: icon o�ers a language independent interaction medium.

Unlike speech or text, it requires a single uni�ed language independent icon database

understandable by users, from any background.

• Easy to learn: Text based interaction requires well understanding of grammar of

corresponding language which is really hectic for our target users. On the other

hand, iconic sense is easy to comprehend. We may note that, icon is used in

public places to convey information or concept to people, irrespectively to one's

background knowledge.

• Recognition rather than recall: In case of icon-based communication memorizing

the complex language dependent characters are not required. Only the user need to

recognize word related icon to interact with the system. Further, this recognition

is easy as in general, a icon resembles with actual object.

• Faster and Expressive: Icon o�ers a faster way of interaction compared to text

for its expressiveness. A single icon is able to represent a word, sentence and also

a concept. For example, the icon of handicapped person in a train compartment

implies that the compartment is reserved for handicapped people.

Using icon as an interaction medium also arises some potential problems. The main

problem of icon is its ambiguousness. Ambiguity or misrepresentation of icon may lead

to major confusion. In very few domains and very speci�c requirements like computer,

transport icon standardization exists at present. The scarcity of standard icon database

6

1.3. Objective of the Thesis

is a major drawback for building any icon based system. In order to achieve our goal we

have used icon as the mode of interaction and tried to address the icon related problems

by building up domain dependent icon vocabulary. In our work, icon is used as front

end interpreter. Our exhaustive literature survey reveals few shortcomings towards the

practicability of icons in general and iconic interface in particular. The various scopes

related to icon-based interaction is listed below.

• Icon management: Design strategy of icon in the interface is explored vastly but

the management of a large icon database still need to be worked out. The work

on proper icon arrangement in the interface to reduce visual search overhead and

error possibility are not reported elsewhere.

• User friendliness: A desirable property of any interface is that it should be user

friendly and user should interact with a minimum e�ort. Work on rate-enhancement

strategy in the context of iconic interface is scarcely reported. Using semantic

dependency between two icons and prediction methodology can be incorporated

with the interface to o�er faster and easier way of query generation. The feature

of query expansion can also be incorporated to o�er user similar and related query

options.

• Handling icon ambiguity: Icons are ambiguous in nature. Single icon can repre-

sent di�erent meaning in di�erent context. Only few works have been done on

metaphoric representation of icon and implementation of rule based context sensi-

tive disambiguation strategy.

• Searching in web: Results retrieved from any search engine against a query are huge

and in mixed form. Existing format of search result presentation against an user

query is not user friendly although it is becoming an important issue particularly

with exponential growth of Web repository.

• Search result representation: Another shortcoming is on the representation of

search results in a user understandable form. As per our knowledge the work

of representing information in terms of icon is not reported elsewhere.

1.3 Objective of the Thesis

We plan to develop an icon based interface which will help the target users to frame a

query. This query will be feed to Google search engine to obtain the Web search results.

The retrieved vast result will be mined next to extract the concrete result. Finally, the

7

1. Introduction

concrete result will be transformed into iconic form. The objective of our work is as

follows.

1. Development of icon-based interface: We plan to develop an iconic interface to

access certain range of information from the Internet.

2. Clustering Web search results: We propose an intermediate processing to return

search results using clustering. Proposed approach will produce coherent clusters

where the number of clusters are decided at run time. Our main target in this

phrase is to obtain better quality of cluster with less response time.

3. Icon-based information representation: In this part, we �rst mine a selected cluster

to obtain information. A mapping from text to icon is planned to represent search

results in a user understandable form.

Figure 1.2 represents an overview of our work plan.

Search

Results

Query

Clustering and

mining

Visual

Represen

tation

Concrete

Result

Retrieve

Icon-based

query

Icon-based

information

Figure 1.2: An overview of our work

1.4 Plan of the Thesis

In order to achieve our target the work plan is as follows.

• We plan to develop an icon-based interface to search information in the Internet

by means of icon selection. Towards the development of the icon based interface

the �rst challenge is to decide the icon vocabulary. The icon vocabulary should

be optimal, that it should not confuse the target user and also support all types

of queries and search results. Next work is to decide domain related query which

8

1.5. Contribution of this Thesis

user will feed into the search engine. Another challenge is to manage a large set of

icons in a limited display area.

• It is neither worthy nor possible to represent all information of retrieved Web

pages in terms of icon. Large number of iconic message may lead major confusion

in message comprehension. So, only the important informative information need

to be mined and displayed. But, the search results returned by search engine

are large in size and in general, from di�erent domains. All the results may not

belong to user's area of interest. Therefore, we propose clustering of the Web

search result in order to group Web pages having similar content. Time and cluster

quality optimization, pre-processing of Web pages, document feature extraction,

inter document similarity measure are also addressed in this regard.

• In this phase, mining of the clustered results and a mapping from text to icon is

planned to represent search results in an user understandable form. It includes sev-

eral issues like building entity-attribute model, query recognition, potential answer

modeling, attribute value extraction, icon sense disambiguation etc.

1.5 Contribution of this Thesis

The objective of the present work is to develop an icon-based interaction system for

underprivileged user. The following gives a summary of the contributions made in this

regard. The very �rst problem is to determine the communication mode. From the

point of feasibility, we notice that di�erent communication modes can be used to meet

the objective. To select the communication mode, we consider several features e.g. cost

e�ectiveness, user friendliness, adaptability etc. The other contributions of our work are

listed next.

• Design of a system that supports iconic mode of communication. Implementing a

robust system having support for all queries, independent to any domain is a com-

plex task. So, some realistic assumptions have been made for implementation. The

physical interface will support tourism domain related query. One important issue

is selection of icons that suits with socio-cultural environment of target users. In-

stead of directly creating icons, we consider the existing icons in Web and modi�ed

them as required because creating icons itself is a complex issue.

• Having chosen the icon as a mode of communication, the next issue is to determine

the nature of physical interface. The interface will work in an interactive way. The

9

1. Introduction

iconic mode is designed to work in the same way as word mode works but having

less cognitive load due to the presence of icons. Another contribution has been done

on the decision of required, relevant icon set covering all domain related query. The

developed interface consists of selected set of organized icons.

• The next critical issue is for selection of proper clustering algorithm. After a sound

literature survey we propose a new clustering algorithm to cluster Web search result

optimizing cluster quality and response time.

• According to our knowledge, no work has been done on text to iconic message

conversion. We address this issue in our work.

1.6 Organization of the Thesis

This thesis contains six chapters including this introductory chapter. This chapter ex-

plains current scenario and need of the work, brief description of di�erent interaction

techniques, the work proposal, the issues and challenges in the proposal, the scope and

objectives related to the our goal and �ow of the work.

Chapter 2 : Related Work

This chapter includes state of the art for di�erent icon-based interaction systems as well

as icon interface. Next, we discuss about the primary methods of clustering and the exist-

ing clustering engines. Finally, some works related to question answer mining is discussed.

Chapter 3 : Icon-Based Interface

This chapter discusses about the steps involved to design the icon-based interface. This

includes major issues like deciding domain related query, preparing icon vocabulary, icon

management etc. We evaluate the proposed interface with respect to the e�ciency in

search query generation. The experiment along with the result are presented in this

chapter.

Chapter 4 : Clustering Web Search Results

This chapter talks about pre-processing of Web search results. We propose a new cluster-

ing methodology to obtain better cluster quality in a�ordable response time. Finally, we

discuss the e�cacy of the proposed clustering mechanism with respect to other clustering

algorithms.

10

1.6. Organization of the Thesis

Chapter 5 : Icon-Based Information Representation

The process of mining the clustered result is discussed in this section. It also focuses on

the representation of mined information in user understandable iconic form. Experimen-

tal results concludes the chapter.

Chapter 6 : Summary and Conclusion

In this chapter, we discuss about the summary of our work and future scope of the work.

11

Chapter 2

Related Work

Our proposed work covers three di�erent research areas: developing icon-based interface,

clustering Web search results and icon-based representation of mined information. We

discuss the literature survey of these areas related to our work. At �rst we discuss the uses

of icons in di�erent forms and in di�erent �elds. We also review the research work in the

�eld of icon-based interfaces. The next discussion is in the area of clustering. Clustering

is one of the most e�cient mechanism to group similar objects without any advanced

knowledge of the group de�nitions. It can be used to group Web pages having similar

content. We illustrate the various clustering procedures to cluster Web search results.

According to our knowledge no work has been done for text to icon representation. But,

some works has been done on question answering where answer of a particular question

is found out from structure or unstructured resource. We review literatures in the �eld

of question-answering.

We plan the organization of this chapter as follows. In Section 2.1, we present the use

of icons in di�erent forms in di�erent applications. Several clustering techniques used

for document as well as for Web search result clustering are described in Section 2.2.

Di�erent strategies used for question-answering are discussed in Section 2.3. Finally, we

summarize the reported works in Section 2.4.

2.1 Icon-Based Interface: State of Art

In this section, we discuss the existing approaches dealing with icons followed by the

state of art on icon-based interface in di�erent contexts.

13

2. Related Work

2.1.1 Use of Icon

Recently, icon has been chosen as a primary mode of interaction between man-machine

for its expressiveness. Most modern written languages are derived from pictorial lan-

guages, but iconic languages seem to be a more recent invention. An experiment in the

Peruvian Amazon area indicates that pencil sketches and photos represent important

tools for communication research and praxis1 mainly to communicate about topics that

are di�cult to speak about. Rogers talked about the usefulness of icon in interface [103].

The previous attempts to create international languages have not been very successful,

partly because of the need for a signi�cant number of people to know them, and partly

because they have to be learned like any other new language [12]. Iconic communication

is the attempt to build cross-language communication systems that completely avoid the

use of words and rely solely on pictorial symbols. The earliest writing systems were

essentially pictographic in nature, such as ancient Egyptian hieroglyphs, in which a vo-

cabulary of about 700 di�erent characters was used. Logograms, where an icon represents

actual object, is used in Chinese language. In 1960's some interest was also shown in

the potential for a general international pictographic language, in which an element is

de�ned as a graphic representation. But its decomposition as well as interpretation was

really tough. Kolers identi�ed technological applications as being better-suited to iconic

representation, because the objects and functions involved were ubiquitous to, and con-

sistent between cultures [62]. In a more limited but successful way, symbols have been

used in national and international signposting for public service functions. This includes

signs for highways and airports, electronics and packaging [6]. In addition, international

standardization of signs has been achieved in such contexts, as has the procedure for their

de�nition, proposal and evaluation (International Standards Organization, 1979). One

actual use of the pictographic representation of data can be found in work on military

battle�eld displays in which army units on the battle�eld are represented on a computer

display [48, 61, 66]. In 2003, a set of standard map icons has been declared by the U.S.

Government for emergency response applications and sharing information among emer-

gency responders. This set of symbols has been used also by the governments of Australia

and New Zealand. Tatomir et al. has developed a set of icons for constructing a map

representing features such as crossing types and road blocks [41,110].

1Pencils and Photos as Tools of Communicative Research and Praxis, http://academics.utep.edu/Portals/1800/Singhal-RattineFlaherty%20Article.pdf

14

2.1. Icon-Based Interface: State of Art

2.1.2 Icon Taxonomy

The success of iconic interface depends on icon design strategy. The e�ects of icon de-

sign on human-computer interaction is discussed in [16, 104] where icon characteristics

(semantic distance, concreteness, familiarity, and visual complexity) are investigated to

determine the speed and accuracy of icon identi�cation. Icon metaphors, design alter-

natives, display structures, implementation and a summary of icon design guidelines are

addressed by Gittin in icon-based human-computer interaction [48]. In iconic interac-

tion, ambiguities on message generation and interpretation are needed to be removed.

Abhishek et al. introduced a disambiguate strategy in ambiguous iconic environment by

constraint satisfaction [1]. Icon can be classi�ed in di�erent ways known as taxonomy

of icons. Di�erent researcher addressed icon taxonomy according to their need although

the basic idea is more or less same. Lodding classi�ed icons into three categories: repre-

sentational, abstract, and arbitrary. Representational icons were described by Lodding

and Blattner et al. as icons that can serve as an example for a general class of object.

Lodding gave the image of a petrol pump, to represent a petrol pump, as an example of a

representational icon [17,89,115]. Same icon category is introduced by other researchers

as àssociative' by Gittins, `nomic' by Gaver, `purely pictographic' by Lindgaard et al.,

`resemblance' by Rogers, `pictoral' by Lodding's, Webb et al., `concrete' by Purchase and

`similar' by Lidwell et al. where the only di�erence is in terminology [43,48,73,75,97,103].

Lodding, Purchase described àbstract' icons as icons that attempt to convey concepts

rather than to display the object itself. Lodding used the image of a broken glass to

represent fragile. Later, researchers address this type of icon as `symbolic'(Lodding's,

Webb et al.), `mixed'(Lindgaard et al.), `metaphorical'(Gaver), `semi-abstract'(Blattner

et al.). Rogers, Lindgaard et al. represented this icon category into two sub-categories:

èxamplar' (e.g. airplane for airport) and `symbolic' (e.g. lightning for electricity). In

àrbitrary' (Lodding, Rogers, Lidwell et al.) type of icon there is no intuitive connection

between the icon and its referent. This type of icon is also addressed as `symbolic' (Gaver,

Purchase), `purely symbolic'(Lindgaard et al), `key'(Gittins), sign(Lodding's, Webb et

al.), àbstract'(Blattner et al) etc. Gittin classi�ed the icons not only on the basis of

type but also on form and color [48]. By form icons can be of two types : static and

dynamic. Non-movable icons are considered as static icon whereas animated icons are

considered as dynamic icon. On the basis of color, Gittin classi�ed icons in two types

: monochrome and color. According to Dinesh Katre (2007) icons can be classi�ed on

30 di�erent attributes and sub-options1. Some of them are : detailing, dimension, light-

1Beware of style in icon design, www.hceye.org/HCInsight-KATRE22.htm

15

2. Related Work

shadow, size, appearance, e�ects, pixilation etc. Now-a-days various types of icons are

found in di�erent computer environments and applications: (1) as a part of an Operating

System's desktop environment like Windows XP, Macintosh, or Linux KDE, (2) as a part

of a speci�c computer application (within software toolbars) like Microsoft Word, and

(3) within Internet websites or other online applications to include the following: website

interfaces, forums, blogs, bulletin boards and Internet chat applications like AOL Instant

Messenger1. Some analysis work also done on icon usability in corresponding graphical

interfaces in order to provide better user interface icons [121].

2.1.3 Icon-Based Applications

An icon can be interpreted by its perceivable form (syntax), by the relation between its

form and what it means (semantics), and by its used (pragmatics) [40]. By this way,

icon can also form a language, where each sentence is formed by a spatial arrangement of

icons [39]. A comparative study of natural language and the design of an iconic language

is done in [11].

Some works have been done towards man-machine interaction through iconic inter-

face. Hotel booking system is designed for booking the hotel room by users from di�erent

linguistic backgrounds, based on form �ll-up mechanism [124].

An iconic environment for programming support namely HI-VISUAL is introduced

by Hirakawa et al., Miller et al. [52, 85].

Pictorial dialogue methods (Barker) are designed for pure person-to-person commu-

nication [9]. CD-Icon, an iconic language-based on conceptual dependency (Beardon)

is developed for message composing by selecting options from a series of interconnected

screens (in the spirit of systemic grammar) [10]. A graphical chatting program - called

visual messenger [26] is implemented in java. Another iconic communication methodol-

ogy using PDA proposed by Fitrianie et al. is able to interpret and convert the iconic

messages to (natural language) text and speech form in di�erent languages [39].

The Elephant's Memory presents a playful learning environment for children2. It

allows the user to build a visual message by combining symbols from its vocabulary.

Similar type of work is done by Uden et al. for the youngest children starting to read

and write [113]. The work emphasizes on understanding of mental models of children.

Clicker3 is a writing support and multimedia tool for children of all abilities. It enables

one to write with whole words, phrases or pictures. Clicker has a powerful graphics fea-

1De�nition of Icon: Types of Icons, www.entity.cc/icon-types.php2The Elephant's Memory, http://www.khm.de/ timot/PageElephant.html3The clicker 5 guide, http://www.cricksoft.com/us/products/clicker/guide/Clicker5 guideus.pdf

16

2.1. Icon-Based Interface: State of Art

ture with pictures, animations and movies to illustrate a concept.

Some icon-based Augmentative and Alternative Communication (AAC) systems are

made for quadriplegic people. An AAC iconic system is developed by Albacete et al. for

the people with signi�cant speech and multiple impairments based on the theory of icon

algebra and the theory of conceptual dependency [4]. Another mobile AAC system is

proposed for handicapped persons to make a general communication with others in a free

and convenient manner. The method of predicting predicate is used to support faster

interaction as well as to satisfy space limitation [67]. Sanyog: an iconic system for mul-

tilingual communication for people with speech and motor impairments is developed by

Bhattacharya et al. The Sanyog project initiates a dialog with the users to take di�erent

portions (e.g. subject, verb, predicate etc.) of a sentence and automatically constructs a

grammatically correct sentence based on NLG techniques. Its intended users are children

su�ering from cerebral palsy. Sanyog o�ers communication by icon to speech conversion

[15].

An icon-based interface to communicate in crisis situations on a PDA is developed

for generating alarm or help message by the people from di�erent background, role and

profession [38]. Optimal audio-visual representations for illiterate users of computers

helps illiterate and semi-literate users to express and to understand information. The

proposed concept is implemented on health domain [83].

An idea of an XML-based iconic communication system (SCILX) enables communi-

cation through the Internet is proposed by Kuicheu et al. The approach has a formal

foundation based on formal grammars of icons. It allows to translate an iconic sentence

into a XML document and vise-versa [64].

An iconic keyboard MinspeakTM system conceived by Baker use the principle of

semantic compaction. It involves mapping concepts to multi-meaning iconic sentences

and using these icon sentences to retrieve messages stored in the memory of a micro-

computer. The stored messages can be words or word sequences. A built-in speech

synthesizer is used to generate the voice output. Over the past ten years, more than

20,000 MinspeakTM units have been distributed all over the world. Swedish, German,

Italian and other MinspeakTM systems are developed [7]. An interactive environment for

iconic language design is proposed by Chang et al. based on the theory of icon algebra

to derive the meaning of iconic sentence1.

Most systems are based on linguistic theories, such as the conceptual dependency the-

ory and basic English [41]. Using these systems a message can be composed by arranging

1A Methodology and Interactive Environment for Iconic Language Design,www.cs.pitt.edu/ chang/365/mins.html

17

2. Related Work

icons or combining di�erent symbols to compose new symbols with new meanings. The

arrangement can be realized in a rigid linear order [39, 124] or a non linear order [40].

Only few works produce text as output of the interpretation of visual representation.

Some systems are hard to learn and some are language speci�c. They are either based

on too complex linguistic theories or on non-intuitive icons. An iconic visual interlin-

gua Visual Inter Language (VIL), based on the notion of simpli�ed speech introduced in

[69]. VIL reduced the complexity of iconic language signi�cantly by avoiding in�ection,

number, gender, tense markers, and articles.

2.2 Document Clustering Techniques

The next part of our literature survey is related to the idea of clustering. In this section,

we present the survey of existing approaches of document clustering specially Web search

result clustering. A Web search engine generally returns thousands of pages in response

to a query, making it di�cult for users to identify relevant information. Clustering

methods can be used to automatically group the retrieved results into a list of meaningful

categories Clustering of Web document is done on the basis of some similarity measure.

Similarity between Web pages usually means content-based similarity which emphasis on

the content of Web page instead of its embedded links. It is also possible to consider

link-based similarity and usage-based similarity. Link-based similarity is related to the

concept of co-citation and is primarily used for discovering a core set of Web pages

on a topic. Usage-based similarity tries to discover user navigation patterns from Web

data and the useful information from the secondary data derived from the interactions

of the users while sur�ng on the Web [35]. Document clustering can be performed,

in advance, on the whole Web collection before conducting any search known as pre-

retrieval clustering or o�ine clustering. O�ine clustering produces directory structure

type search result. Dmoz or ODP (Open Directory Project) is an example of such kind

of human-edited directory of the Web. Web directories are most often used to in�uence

the output of a direct search in response to common user queries. This method combines

the best features of query-based and category-based search. On the other hand, post-

retrieval clustering or online clustering is performed only on retrieved Web pages. As

online clustering only considers relevant Web pages (a subset of vast Web repository) as

an input, it is faster and produces superior results. There are two types of post-retrieval

clustering. The clustering system may rerank the results and o�er a new list to the

user. In this case, the system usually returns the items contained in one or more optimal

clusters. Alternatively, the clustering system groups the ranked results and gives the user

the ability to choose the groups of interest in an interactive manner [24].

18

2.2. Document Clustering Techniques

2.2.1 Basic Clustering Techniques

In this section, we present a brief overview of two fundamental clustering approaches:

partitional and hierarchical clustering techniques.

2.2.1.1 Partitional Clustering

Partitional clustering attempts to directly decompose the document set into a set of dis-

joint clusters. The clustering algorithm emphasizes on the local structure of the data,

as by assigning clusters to peaks in the probability density function, or the global struc-

ture. Typically the global criteria involve minimizing some measure of dissimilarity in

the samples within each cluster, while maximizing the dissimilarity of di�erent clusters.

k -Means clustering is the most common type of partitional clustering and widely used in

the �eld of document clustering.

k-Means clustering: The algorithm was �rst proposed by Stuart Lloyd in 1957 as a

technique for pulse-code modulation [76]. k -Means is based on the idea of centroid as a

good representative of a cluster. In this process k documents are randomly selected and

considered as the initial centroids. Rest of the documents are assigned to their nearest

centroids to generate the cluster. Next, for each cluster the centroid is recomputed.

The process of document assignment and centroids computation is repeated until the

centroids generated in two consecutive iterations are same. The advantage of k -Means

algorithm is that it is simple and fast for low dimensional data. The computational

complexity of k -Means algorithm is O(i.k.m.n) where i implies the number of iterations,

k is the number of clusters, m is the number of features or attributes and n is the number

of data points [77]. In case of space complexity, it requires to store both data points and

centroids. So, the space complexity becomes O((k+n)m), that is, O(n) [84]. In fact, both

complexities are quite less compared to the other clustering techniques. But, the number

of clusters should be prede�ned. We may note that k -Means is sensitive to outliers. The

mediod-based method can eliminate this problem [123]. But, the limitations of these two

partitional schemes are that they are sensitive to initial centroids. Both of them cannot

handle non-globular data of di�erent sizes and densities [71].

Bisecting k-Means: Bisecting k -Means is an enhancement of basic k -Means algorithm

where a cluster is split up into two sub-clusters in iterative steps until we �nd desired

number of clusters. We start with a single cluster containing all documents. Next, we

select a cluster to split depending on some criteria like largest cluster at each step or

the cluster with the least overall similarity, or both criteria. From the selected cluster

19

2. Related Work

two sub-clusters are generated with the help of basic k -Means algorithm. This step is

repeated several times to take the split that produces the cluster with the highest overall

similarity [108]. Total process is repeated until it produces desired number of clusters.

We may note that bisecting k -Means tends to produce clusters of relatively uniform size,

while regular k -Means is known to produce clusters of widely di�erent sizes. The com-

plexity of bisecting k -Means is linear with the number of documents.

Fuzzy c-Means clustering, QT clustering algorithm are variations of k -Means clus-

tering. Some other types of partitional clustering are locality-sensitive hashing, graph-

theoretic method etc. It has been observed that, linear time clustering algorithms are

the best candidates to comply with the speed requirement of on-line clustering. These

include k -Means, Single-Pass, Buckshot and Fractionation. An experimental analysis

of six di�erent clustering techniques of k -Means, Single-Pass, Fractionation, Buckshot,

Su�x Tree and AprioriAll is done by Sambasivam and Theodosopoulos [105]. k -Means

clustering via principal component analysis is proposed by Ding and He [33].

2.2.1.2 Hierarchical Clustering

As an alternative to partitional clustering, hierarchical clustering produces a nested se-

quence of partitions. There are two approaches of hierarchical clustering: agglomerative

(bottom-up) and divisive (top-down). In agglomerative hierarchical clustering [77], ini-

tially each document is considered as an individual cluster. In each iterative step, closest

clusters are merged together. In contrary, divisive hierarchical clustering [49] considers

all documents as a single cluster at the initial stage. It iteratively divides a cluster into

two or more child clusters. The decision of which two clusters should be combined (for

agglomerative), or where a cluster should be split (for divisive) depends on the distance

metric and the linkage criteria. Euclidean distance, Manhattan distance, Cosine similar-

ity etc. are di�erent distance metrics used for clustering purpose. The distance between

cluster any two clusters can be decided using di�erent linkage criteria. Three commonly

used linkage criteria are single-linkage, complete-linkage and average-linkage.

Hierarchical agglomerative clustering: Generally, hierarchical clustering implies

agglomerative approach [77] in the �eld of document clustering. Hierarchical agglom-

erative clustering [108] has four basic steps: cluster initialization and distance matrix

preparation, merging two nearest clusters, updation of distance matrix and repetition.

In this process, initially each individual document is considered as an individual cluster.

The distance between each pair of clusters is represented by distance matrix. Based on

the distance matrix, two closest clusters are merged together. The distance matrix is

20


updated to re�ect the pairwise distance between the new cluster and the previous. Any

of the linkage criteria can be used for matrix updation. This process is repeated until it

produces a single cluster. Hierarchical clustering has some advantages and limitations.

In general, the complexity of agglomerative clustering is O(n3) where n is the number of

documents [77]. Use of priority queue can reduce it to O(n2logn) [77]. In a special case

(e.g. single linkage) the complexity is O(n2) [77]. Hierarchical agglomerative clustering

uses a distance matrix to keep intra-cluster distance along with the generated clusters.

So, space complexity becomes O(n2) [84]. Though the time and space complexities are

quite high, the quality of the clusters produced by the hierarchical clustering is better

compared to the k -Means. Another advantage is that, the number of clusters can be

decided run time depending upon a threshold value. A drawback of the hierarchical

clustering is that if a point is misclassi�ed once then it would not be corrected in further

steps, whereas the k -Means o�ers iterative improvement.

Hierarchical divisive clustering: The hierarchical divisive clustering approach1 con-

sists of four di�erent steps: cluster initialisation, selection of a cluster to split, split

the chosen cluster and replace it with new generated sub-clusters and repeat. At the

initial stage, all the documents are placed in a single cluster. Generally, the cluster

with maximum diameter is chosen for splitting where the diameter of a cluster is com-

puted as the largest dissimilarity between a pair of documents in that particular cluster.

There are di�erent partitioning criteria [49] like cut-based measure, enumerative (e.g.,

the graph coloring algorithm of Hansen and Delattre for minimum diameter partitioning)

or cutting-plane ones (e.g., the branch-and-cut method of Grotschel and Wakabayashi for

clique partitioning) etc. For example, in cut-based measure a cluster is split such a way

that the cost9 of clustering (cutcost / intracosts) is minimum. Next, the old cluster is

replaced by new generated sub-clusters. This process is repeated until all clusters, com-

prise of single document. Hierarchical divisive clustering is conceptually more complex

than agglomerative clustering since we need a second algorithm as a �subroutine�. In this

approach, there are 2n−1 − 1 possibilities to split the documents into two clusters (n is

the number of documents in the cluster) which is considerably larger than that in the

case of an agglomerative method2. Variation of the algorithm can reduce the splitting

possibility. In general, the complexity of divisive clustering with an exhaustive search

1Clustering Algorithms: Divisive hierarchical and �at, http://www.cs.princeton.edu/courses/archive/spr08/cos435/Class_notes/clustering4.pdf

2Divisive Analysis (Diana), http://www.unesco.org/webworld/idams/advguide/Chapt7_1_5.htm

21

2. Related Work

is O(2n)1. The space complexity is same as agglomerative approach that is, O(n2) [84].

Some variations of hierarchical clustering are Birch, Cure, Chameleon etc. [123].

An e�cient and e�ective algorithm for hierarchical classi�cation of search results is

proposed to produce hierarchically organized search results [56, 60]. Rather than using

clustering technique, this approach employs domain ontology in order to obtain better

hierarchical classi�cation [28,60]. In [46] cluster labelling is achieved by combining intra-

cluster and inter-cluster term extraction based on a variant of the IG measure. Semantic,

hierarchical, online clustering of Web search results (SHOC) is proposed by Zhang and

Dong using key phrases as natural language information features and su�x array for key

phrase discovery [127]. In WISE (hierarchical soft clustering of Web page search results

based on Web content mining techniques), documents are represented based on their

most relevant key concepts [23].

2.2.2 Other Search Result Clustering Algorithms

According to [24] the clustering algorithms can be classi�ed into three categories - data

centric, description-aware, and description-centric. Data centric algorithm includes con-

ventional clustering algorithms like hierarchical, optimization, spectral etc. Some exam-

ples of data-centric algorithms can be found in systems such as Lassi, WebCat, AIsearch,

Scatter/Gather, TRSC. The problem of data centric algorithm is cluster label descrip-

tion. Description-aware algorithms emphasizes on cluster label description. Su�x Tree

Clustering (STC) - Grouper, HSTC, SnakeT are the example of description aware al-

gorithms. Description-centric algorithms are designed speci�cally for clustering search

results and take into account both quality of clustering and descriptions. Vivisimo, Ac-

cumo, Clusterizer, Carrot Search, SRC and DisCover, CREDO system [24] are this type

of commercial search results clustering systems.

Scatter-Gather: The earliest work on clustering results was done by Pedersen, Hearst

et al. on Scatter/Gather system [51]. The algorithms used for Scatter/Gather approach

is Buckshot and Fractionation. Fractionation is considered more accurate, while buckshot

is much faster, making it more suitable for searching in real time on the Web (Cutting,

et. al.). A divide-and-merge methodology for clustering [29] is proposed which combines

a top-down "divide" phase with a bottom-up "merge" phase. Two pass approach based

on multilevel graph partitioning introduced in [123] which includes bottom-up cluster-

merging phase and top-down re�nement phase.

1Hierarchical clustering - Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Hierarchical_clustering

22


Su�x Tree Clustering: Generally clustering algorithms consider the Web documents

as bag of words or as abstract concepts. Su�x tree clustering (STC) �rst treats each

document as collection of strings or phases. Based on STC, Grouper is developed to

cluster Web search results labelled by phrases extracted from the snippets [125,126]. STC

is also implemented for Polish language containing complex in�ection and syntaxes [118].

For Chinese Web search result Wang et al. proposed interactive su�x tree algorithm

[117]. An improved su�x tree data structure o�ering a new base cluster combining

algorithm with a new partial phase join operation is proposed in [53] to overcome the

inadequacy of generating interrupted cluster label due to using n-gram technique.

Lingo: In the �eld of Web search result clustering, Lingo a description-oriented algo-

rithm is a breakthrough. The key idea of this method is to �rst discover meaningful

cluster labels and then, based on the labels, determine the actual content of the groups.

Cluster label discovery is accomplished with the use of the Latent Semantic Indexing

(LSI) [32] technique. Lingo basically consists of �ve main phases: preprocessing, feature

extraction, cluster label induction, cluster content discovery and �nal cluster formation.

Preprocessing phase is used to convert the raw data into �ltered form. In the second

phase, su�x array is used to discover phrases and terms which is potentially capable of

explaining the verbal meaning behind the LSI-found abstract concepts. In next phase

cluster labels are �nalised based on the SVD decomposition of the term-document ma-

trix. Then documents are assigned to proper cluster labels in cluster content discovery

phase. Finally, cluster scores are calculated to represent clusters in the interface accord-

ing to their score. Lingo produces better clusters compared to another benchmarking

algorithm Su�x Tree Clustering (STC) used for web search results clustering [92]. Time

complexity of Lingo is in O(n3) and a high number of matrix transformations leading to

more memory requirements [94]. In spite of high computational complexity and space

requirement Lingo is well accepted for clustering search results as it produces better

quality of clusters by considering semantic approach. [81] considers the whole document

for LSI and proposed dynamic SDV clustering to discover optimal number of singular

values. [127] extends Lingo using WordNet and by adding semantic recognition to the

frequent extracted phrase. Syntactic clustering of the Web is also proposed by Broder et

al. [20].

He et al. introduced Web document clustering using hyperlink structures [50]. Con-

tent as well as link information are considered to improve information interpretation for

search result clustering [25,88,116]. A Phrase-based method using hierarchical clustering

of Web snippets is proposed where documents are clustered according to the phase sim-

ilarity [72,78]. [122] proposed a combination of query based and phase based approach.

23

2. Related Work

RFCMdd a robust fuzzy algorithm e�cient to handle noise is proposed in [54] based

on n-gram and vector space methods. Snippets remain unrelated because of their short

representation. In Vector Space Model, it has been noticed that a single document is

usually represented by relatively few terms. A method of Web search result clustering

based on rough sets is proposed by Ngo and Nguyen where tolerance classes are used to

approximate concepts existed in documents and to enrich the vector representation of

snippets to increase clustering performance [91].

Another clustering algorithm based on formal concept analysis is proposed to build

a two-level hierarchy for retrieved search results [129]. Two improved objective metrics

ANMI@K and ANCE@K are introduced to measure cluster quality. Similar work on

retrieval and clustering of Web resources based on pedagogical objectives is introduced

by Mayorga et al. [80].

A mechanism combing of ranking and clustering is being proposed in [35] provides

ordered results in the form of clusters in accordance with user's query. Term ranking

for clustering Web search results shows a di�erent way to ranking terms and variation of

pagerank algorithm based on relational graph representation [44].

Scuba Driver is proposed to help user to better interpret the coverage of millions of

search results and to re�ne their search queries through a keyword guided interface [45].

A recent work is done to overcome the disadvantages of mismatching between informa-

tion most wanted and information regained in [14].

Some other clustering approaches are proposed using transduction-based relevance

model [79, 120], label language model [68], temporal information [5, 22], approximating

matrix factorisation [93], query term expansion [112], topological tree [42], randomized

partitioning of query-induced sub-graphs [19], heuristic search in the Web graph [13],

compression [18], word sense communities in the extracted keyword network [27, 90],

density-based (GDBSCAN), grid-based (OptiGrid), probabilistic approach etc. [111]

based on c-Means fuzzy clustering gives the idea of cluster visualization.

Commercial search engines The systems that perform clustering of Web search re-

sults, also known as clustering engines, have become popular in recent years. The �rst

commercial clustering engine was probably Northern Light, in 1996. It was based on a

prede�ned set of categories, to which the search results were assigned. A major break-

through was then made by Vivisimo [24], whose clusters and cluster labels were dynam-

ically generated from the search results. In recent times, several commercial clustering

engines have been launched in the market [24], [56] namely Grouper (1997) [126] , WISE

24

2.3. Question Answering Techniques

(2002) [23], Carrot (2003) [92], WebCat (2003) [47], AISearch (2004)1, SnakeT (2005)2,

Quintura (2005)3, WebClust (2006)4, YIPPY (2009)5 etc. [37, 96] provides the metrics

for quantitative comparison of cluster quality using external and internal measures.

2.3 Question Answering Techniques

In order to �nd out answer of a particular question from �xed database or free text

system has been proposed. It mainly handles information overload problem. Here, the

target is to obtain speci�c answer rather than an entire document or best matching

paragraph. The question can be broadly classi�ed into two types. Fact based question

(who, when, where, which etc. `wh' questions) is related to named-entity type answers

and the harder question (why, how question) is related to explanatory type answers.

The information maintained in the resource can be structured like database information

or unstructured like free text. To extract answers from structured information source

knowledge annotation technique is used which uses syntactic parsing. On the other

hand, for unstructured information resource knowledge mining is required which uses

statistical tools. The existing Web repository is mixed of structured, unstructured and

semi-structured information.

In general, question-answering systems consist of three main components: question

classi�cation, answer retrieval and answer extraction. In question classi�cation the user

questions are classi�ed to derive expected answer types. From the answer patterns key-

words are extracted to reformulate the question into semantically equivalent multiple

questions which is also known as query expansion. It boosts up the recall of the informa-

tion retrieval system. In answer retrieval phase the probable Web pages or paragraphs

containing answer are retrieved. In �nal phase of answer extraction probable candidate

answers are identi�ed and most suitable answer is selected by ranking.

Annotation of knowledge base in order to obtain answers with the help of natural

language processing is proposed by Katz [58]. The developed system consists of two mod-

ules namely understanding module and generating module. The understanding module

analyzes English text and produces a knowledge base with the help of ternary expres-

sions and S rules, which incorporates the information found in the text. The questions

are also analyzed using ternary expressions and matched with the knowledge base. The

1AI Search Engine From MIT, http://www.netpaths.net/blog/ai-search-engine-from-mit2Search SnakeT Clustering Engine, Meta Search Cluster MetaSearch, http://snaket.di.unipi.it3Quintura - visual search engine, http://www.quintura.com4WebClust - Clustering Search Engine, http://www.webclust.com5Yippy Clustering Search Engine - iTools, http://itools.com/tool/yippy-web-search

25

2. Related Work

generating module produces English sentences from a given segment of knowledge base.

The idea of using database techniques for the World Wide Web is not new. Some sys-

tems like Araneus, Ariadne, Tsimmis etc. have attempted to integrate heterogeneous

Web sources under a common interface. Unfortunately, queries to such systems must

be formulated in SQL, Datalog, or some similar formal language, which render them

inaccessible to the average user. Along with this, the unstructured random collection of

large Web information and its rapid growth caused a challenge to this proposal to ful�ll.

Later, Katz proposed knowledge mining of Web information and integrate it with corpus

based knowledge annotation technique in order to achieve better performance [59, 74].

Knowledge mining takes the advantage of massive amounts of Web data to overcome

many natural language processing challenges. Eight sub modules like formulate requests,

execute requests, generate n-grams, vote, �lter candidates, combine candidates, score

candidates, get support are followed by this technique. An earliest question answer-

ing system Mulder [65] attempted to perform sophisticated linguistic analysis on both

questions and potential answer candidates. As a result, it did not take advantage of

data redundancy. Similar work has been done in Shapqa [21] system answer extraction.

In contrast, the AskMSR [8] embraced data-redundancy and applied extremely simple

word-counting techniques on Web data. Another system Aranea [74] followed a modular

architecture that also serves as a testbed for a variety of knowledge mining techniques.

MultiText [30] employed a di�erent approach: instead of using the Web directly to an-

swer questions, it treated the Web as an auxiliary corpus to validate candidate answers

extracted from a primary, more authoritative, corpus. Question answering by searching

large corpora with linguistic methods is proposed in [55]. The work has developed a

rephrasing algorithm based on linguistic patterns that describe the structure of ques-

tions and candidate sentences and where precisely to �nd the answer in the candidate

sentences. The way of handling unstructured as well as structured Web is suggested

by Cucerzan and Agichtein [31]. For unstructured content, novel features such as Web

snippet pattern matching and generic answer type matching using Web counts is used.

For structured content, an approach that uses information from the millions of tables

and lists that abound on the Web is experimented. In a di�erent way Radev et al. used

probabilistic approach [98] to answer the query. This approach focused on developing

a uni�ed framework that not only uses multiple resources for validating answer candi-

dates, but also considers evidence of similarity among answer candidates in order to

boost the ranking of the correct answer. This framework has been used to select answers

from candidates generated by four di�erent answer extraction methods. The approach

of combining syntactic information with traditional question answering can be found in

26

2.4. Summary

Quaero [109] system. This system performed better in case of for complex questions,

i.e. `how' and `why' questions, which are more representative of real user needs. Some

other established question answering system are Ionaut1 (AT&T Research), InsightSoft-

M (Moscow, Russia) [107], MultiText (University of Waterloo) [30], TextMap (USC/ISI)

[36], LAMP (National University of Singapore) [128], NSIR (University of Michigan) [99],

AnswerBus (University of Michigan) [131] etc.

2.4 Summary

In this chapter, we have discussed various approaches related to icon-based interface,

clustering methodologies and question answering techniques. Using icon as an interaction

medium may lead to major ambiguity in understanding. But, from the literature survey

it is quite clear that icon plays a vital role in communication in absence of common

interaction medium. All applications related to icon based interfaces, considered icon as

a way to give input to a system but, representation of knowledge by means of icon is not

addressed anywhere. CD-Icon, Sanyog, crisis management etc. emphasized on natural

language generation from icon selection. But, a query �red in search engine is not a

well formed sentence in general. Along with that any search engine takes care of query

phrase by its own. Therefore, icon sequence to natural language generation is not our

primary target. The proposed interfaces (Elephant's Memory, Clicker, hotel booking,

AAC systems, crisis management etc.) did not provide any guideline about how they

developed those interfaces. How to decide icons and how to organize them in the interface

is not reported anywhere. So, we build up a new framework to ful�ll our requirement as

well as provide a general guideline for interface developers. Using interface an user can

generate icon-based input query to search in the Internet and the retrieved information

reconverted into iconic form for user understandability.

Our second target is to cluster Web search results. We have discussed several cluster-

ing algorithms which includes fundamental clustering algorithms as well as benchmarked

clustering techniques specially applied for Web search result clustering. We noticed that,

each technique has its own merits and demerits. From the point time and space complex-

ity k -Means algorithm is unbeatable. But, it is sensitive to outliers, requires prede�ned

cluster number and generated clusters are non de�nite. Whereas, clusters produced

by hierarchical approach is de�nite, better in quality and not required prede�ned clus-

ter number. But, the time and space complexity for both agglomerative and divisive

approach is quite high. Among all other clustering algorithms most of the clustering

1Ionaut, www.ionaut.com/

27

2. Related Work

techniques (STC, lingo etc.) consider search result snippet as input to achieve faster

response time but it e�ects on the quality of cluster as snippet is not always a good

representative of a Web page. Our target is to obtain better cluster so the information

extraction part will become easier. Therefore, we need a clustering technique which will

provide good quality of clusters in less response time.

Finally, we need to extract necessary information from the clustered Web documents.

We follow the similar way as the question answering system works. Unlike general search

result retrieval system, question answering system �nds out answer of a particular ques-

tion present in structured or unstructured resource. So, we can say the range of informa-

tion retrieved by the question answering system is very narrow and the scope of general

Web searching system is very wide. We need a system performs in between them. It will

not extract large information that will confuse the target user and will not provide very

narrow range of information. Along with this we need to represent the extracted infor-

mation in terms of icon. We have to conscious while selecting the icons for information

representation as a word may imply di�erent meaning in di�erent context.

Allover it is clear that the existing approaches of icon-based interface, clustering

search results, question-answering techniques can not ful�ll our requirements. In order

to reach our target, we need to develop icon-based interface. With the help of the interface

a user will able to compose iconic query and equivalent English query is automatically

fed into search engine. Next, we need a clustering approach which can produce good

quality of clusters from search results in reasonable response time. From the clustered

search result, we shall �nd out a cluster which may contain query related information.

All information present in that cluster will be overburden for our target user. Therefore,

the selected cluster will be mined to �nd out some query related speci�c information.

Only those speci�c answers will be conveyed to the target user in terms of icon.

28

Chapter 3

Development of Icon-Based Interface

In this chapter, we discuss the development of an icon-based interface using which our

target user can pose a query by means of icon selection and feed it to Google search

engine. We selected icon as the mode of interaction as it is language independent and

understandable for our target user. In order to develop icon-based interface we faced

several challenges. Issues like icon selection for interface, icon ordering, icon optimization,

building icon vocabulary are not addressed earlier in iconic interface related works. The

major issues: deciding domain and domain related queried, maintaining icon vocabulary,

organization of icons in the interface are addressed in our work.

3.1 Introduction

To develop the icon-based interface the main deciding factors are about the icons; that

is, how many and which icons we should keep in the interface and in which order. Since,

there are innumerable number of queries possible in general and one or more icons are

needed to represent each query, it is beyond the scope of this work to plan such a vast icon

vocabulary. Therefore, we need to decide a domain on which we will seek information

from Internet. Along with this we need to decide domain related important queries and

corresponding icons to support those queries. As we targeted the unprivileged users, we

have to conscious about the icon selection and user friendliness of the interface. The

huge number of icons may confuse our target users. On the other hand, less number of

icons may insu�cient to cover all domain related queries. We need to balance these two

factors. The selected icons need to be maintained in an e�cient way so that placing them

in the interface at run time will be easier. Again, presence of all icons in the primary

interface increase cognitive load to user to select their desired icons. We have to be

29

3. Development of Icon-Based Interface

careful about icon's positioning in the interface so that the icon navigation as well as

query generation task will be easier. Therefore, in order to develop icon-based interface

we need to address the issues like deciding icon vocabulary and domain related query,

maintaining large icon database, icon organization etc.

According to our knowledge, most of the previous works that deal with icon-based

interface emphasized on other issues like the proper icon characteristics for interface

[10,83,106,115], iconic sequence to sentence generation [11,38,124], next icon prediction

to meet size limitation [38, 67], icon sense disambiguation [1] etc. But, the issues we

highlighted is not addressed anywhere. In order to develop the interface �rst we have

chosen tourism domain as domain of interest. Next, we �nd out domain related important

words and queries using Google Adwords1. To build the selected queries we need proper

icon database. We generate icon database and maintain indexing to support easy retrieval

of icons while needed. All selected icons should not be placed in the primary interface.

Therefore, we use hierarchy to place the icons. Icons are placed in such a way that the

navigation time will be minimum. The architecture of proposed approached is shown in

Figure 3.1.

Deciding Icon Vocabulary and Domain

Related Queries

Word collection Term importance

Term rankingTerm selection

Query selection

Maintaining Large Icon Repository

Indexing Icon properties

Icon Organization

Hierarchy Icon group

Figure 3.1: An overview of our approach

This chapter consists of six sections. The proposed approach has been discussed in

Section 3.2. Section 3.3 presents the detail about the experiments conducted followed by

the experimental results. Finally, Section 3.4 concludes with the summary.

3.2 Proposed Approach

Our approach consists of deciding icon vocabulary and domain related queries, maintain-

ing large icon repository, icon organization. All these tasks are discussed in the following

sub sections.

1Google AdWords, https://adwords.google.com/o/KeywordTool

30

3.2. Proposed Approach

3.2.1 Deciding Icon Vocabulary and Domain Related Queries

Towards the design of icon-based interface, our �rst task is to decide tourism related

basic words, corresponding icons and then put them in the interface in a proper way.

Redundancy of similar kind of icons is avoided as it opposes icon optimization. To build

the tourism corpus, basic tourism related words are collected from di�erent tourism

related magazines and websites. Stopwords1 (e.g. about, the, in etc.) are frequent in

every domain but they are not important. So, before calculating weight of each word, 571

stopwords are �ltered out from the tourism corpus. Again a word can occur in the corpus

in di�erent morphological form (e.g. transport, transports, transported, transporting,

transportation etc.) whose stem or root is same. A single icon is su�cient to represent

all these morphological forms. So, the stem form of each word is found out and all words

of di�erent morphological forms are mapped to their stem words. Then the unique words

of the corpus are identi�ed.

The term importance increases proportionally with the number of the times a word

appears in the document as well as the occurrence of the word in the corpus. The term

count in the given document is simply the number of times a given term appears in that

document. This count is generally normalized to prevent a biasness for longer documents

(which may have a higher term count regardless of the actual importance of that term in

the document) to give a measure of the importance of the term ti within the particular

document dj . Thus, we have the term frequency as de�ned in Equation. 3.1.

tfi,j =ni,j∑k nk,j

(3.1)

where ni,j is the number of occurrences of the considered term (ti) in document dj and

the denominator is the sum of number of occurrences of all terms in document dj . The

document frequency is a measure of the general importance of the term (obtained by

dividing the number of documents containing the term by the number of all documents)

as shown in Equation. 3.2.

dfi =|{d : ti ∈ d}||D|

(3.2)

with |D|is the total number of documents in the corpus. |{d : ti ∈ d}| is the number ofdocuments where the term ti appears (that is ni,j ̸=0). Then

tf -idfi,j = tfi,j × dfi (3.3)

1http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop

31


The (tf -idf) value for a term will always be greater than or equal to zero. The term ranks

are decided on (tf -idf) weighting. After calculating the (tf -idf) weightage of each term

they are sorted on average (tf -idf) to �nalize top 500 terms. Among these 500 words

we point out domain related important words and the frequent queries �red to Google

using those keywords are collected. Some supporting query words are added to the

main keyword list to complete the query phase. Table 3.1 briefs about icon vocabulary

building. The initial tourism corpus contains 169610 words related to tourism domain.

Table 3.1: Domain related word extraction

Corpus size Reduced corpus size Unique words Final words

169610 139052 30588 453

After �ltering the stopwords the corpus size is reduced to 139052 words. Then after

stemming 30588 words are identi�ed as unique word. The average tf -df value of these

unique words are calculated and ranked in decreasing order. Out of top 500 words, 316

words are selected as tourism related important word as some are implicit to query (e.g.

word "list" is implicit in the query "hotel list of Kolkata") and some are synonymous.

These words are used to extract frequent queries �red to Google in tourism domain using

Google AdWords1. We select 50 queries as tourism related important queries as shown in

Table 3.2. 127 supporting words are added to the corpus to form the queries. At present

the interface contains 235 icons. We collect these icons from di�erent websites. While

selecting, we prefer bi-colour icons which resembles the actual word-object. In some cases

the concept is represented where it is di�cult to represent a real object. Some icons are

modi�ed according to target user's feedback as needed. We follow the consistency rule for

icon characteristics like color, dimension etc. In fact, designing proper icons for tourism

domain is beyond the scope of this work.

3.2.2 Maintaining Large Icon Repository

Presently 235 icons are used in the interface. To manage the icon database, XML doc-

ument is used as an index to support easy icon retrieval. Di�erent icon attributes like

corresponding keyword, synonyms, semantically similar words, word sense, storage loca-

tion (path), position of that icon in the hierarchy are maintained in the XML �le. For

query generation when user selects icon from interface, icon corresponding keyword is

fetched from vocabulary to generate equivalent English query. On the other hand, when

we search a word corresponding icon from icon vocabulary for display we �rst match


32


Table 3.2: Tourism related benchmark queries

Query Query

1. Nearest station <place> 2. Sea beach in south India

3. Distance <place> and <place> 4. Single room hotel rent in <place>5. Festivals in <place> 6. Photo gallery of <place>7. Hotel <place> 8. Climate of <place> in time

9. Transport <place> 10. <place> zoo

11. People culture <place> 12. Buddha stupa in <place>13. Food <place> 14. Shiva temple in <place>15. Map <place> 16. <place> lake

17. Weather <place> 18. Botanical garden in <place>19. Fair in rural India 20. Church in <place>21. Five star hotel in <place> 22. Forest in <place>23. Tour package of <place> 24. <place> border

25. Route from <place> to <place> 26. Train between <place> and <place>27. Adventure sports in <place> 28. Hindu pilgrim in <place>29. Beaches in <place> 30. Train ticket reservation <place> to <place>31. Wildlife in <place> 32. Royal museum in <place>33. Hospital in <place> 34. Reservation status in <place> hotel

35. Five day tour plan <place> 36. Temperature of <place> in <month>37. Best season to visit <place> 38. Stadium in <place>39. <place> college 40. Low budget <place> and <place> travel

41. Market in <place> 42. Village craft fair in <place>43. Migratory bird watching <place> 44. Arrival time of <place> and <place> plane

45. Fort in <place> 46. Sunset in <place> beach

47. Fruit in <place> 48. Tribal festival in <place>49. Orchid of north <place> 50. <place> and <place> �ight ticket price

the keyword and �nd the most suitable sense according to the context. The most ap-

propriate sense corresponding icon is selected for display. If the keyword is not found

in vocabulary, then we look for synonyms followed by semantically similar words. The

details of icons selection for display is explained in Section 5.2.3. Structure of the XML

�le is shown in Figure 3.2. The part of the XML �le shown above contains `town' icon

properties. It contains parent icon named as `pname' to declare the hierarchical position

in the interface. It tells that the icon `town' is under `where' icon. `Collection' implies

the icon has further hierarchy or not. `2' implies further hierarchy and `1' implies no

further hierarchy. `Keyword' declares icon corresponding text. Synonymous words are

declared within `synonyms'. Next, we declare di�erent senses of the keyword. To get the

synonyms and di�erent senses of the keyword we take the help of WordNet database [87]

of Princeton university. Semantically similar words are declared within `seman'. The

sense corresponding di�erent images are given in ìmagepath'.

33


Figure 3.2: A snapshot of XML �le

3.2.3 Icon Organization

All the icons can not be kept in the default interface, so hierarchy is maintained to

organize them. The selected icons are categorized under 5 basic `wh' icons - `what',

`when', `where', `who' and `how'. Icons related to any place are kept under `where'.

Time, person and verb related icons are kept under `when', `who' and `how' respectively.

The rest of the icons are placed under `what'. Qualifying type of icons are put separately

in the interface. With this basic categorization grouping and sub-grouping is also followed

(e.g. all types of vehicles are kept under main transport icon). Similar types of icons

are placed together. Presently, interface maintains 4 level hierarchy (e.g. travel →transport→ train→ ac− 2− tier.) A snapshot of the interface is shown in Figure 3.3.

The interface contains 5 icon groups and satis�es Miller's rule of thumb - magic seven,

plus or minus two [86]. Icon group 1 contains functional icons (backspace, write back,

erase, search) and a display panel to show the icons selected by user. This o�ers easy

reversal of actions to the user. Icon group 2 contains `help' and `feedback' icons. Basic

`wh' icons (`what', `when', `where', `who' and `how') are kept in icon group 3. Using

this `wh' icon user can see basic icons related to that particular `wh'. To go further in

the hierarchy icon group 4 is helpful. Hierarchy icon enables user to �nd icons which

are kept in hierarchical order. As an example we can say transport icon can be found

34


Undo, redo

and clear

Icon group 2: Help

and feedback

Icon group 3: Basic Wh buttons –

what, where, when, who and how

Icon group 4: Hierarchy

Icon group 1

Icon group 5: icons for

selection

Icon display

panel

Figure 3.3: Icon-based interface

by enabling hierarchy icon and selecting travel icon. To reduce short-term memory load,

the hierarchical path that an user follows is displayed in icon group 4. It also allows

user easy state (hierarchy) transition. Icon group 5 is the display area of all `wh' icons.

User can select any icon displayed in this area by single click. Informative feedback is

provided while mouse hovering or clicking, by highlighting and displaying the enlarged

icon for clarity. Using this interface the users are able to generate travel related query

and feed the query in the search engine using `search' icon. Google search engine is used

in this application for searching purpose.

3.2.4 Developed Interface

Using the developed interface target user can frame tourism related query. An example

of basic tourism related query (�weather of Kolkata in December") formation is shown in

Figure 3.4. To construct this query the user has to select three icons: weather, Kolkata

and December. The `weather' icon is in the primary interface. User can select it by

single click. The selected icon then displayed in display panel of icon group 1. Next,

target icon is `Kolkata' which is place related icon. So, we select the `where' icon from

icon group 3. In the icon panel we can found the icon of India. Now, Kolkata is a city

of India so the icon of `Kolkata' must be under Ìndia' icon. We enable the `hierarchy'

35


Selection of weather icon

Enlarged weather icon

Under basic what hierarchy

(a) Selection of `weather' icon.

Display of selected weather icon

Where icon

(b) Display of `weather' icon.

Under where hierarchy

Hierarchy disabled

India

icon

(c) Move to `where' hierarchy.

Figure 3.4: Example of query generation.

36


Select

India

icon

Enable hierarchy

(d) Enable hierarchy and select Ìndia' icon.

Disable hierarchy and select Kolkata icon

Under India hierarchy

(e) Disable hierarchy and select `Kolkata' icon.

Display of selected icons

When icon

(f) Move to `when' hierarchy.


37


Select month icon

Enable hierarchy

Under when hierarchy

(g) Enable hierarchy and select `month' icon.

Disable hierarchy and select December icon

Under month hierarchy

(h) Disable hierarchy and select `December' icon.

Complete query Search option

(i) Query completion.


38

3.3. Experiments and Experimental Results

option and select the India icon. We can �nd the icon of `Kolkata' in the icon panel.

We disable the hierarchy and select the `Kolkata' icon by single click. The change will

be re�ected in display panel. The last icon needs to be selected is `December' which is a

time related icon. So, we move to `when' icon panel. There we can �nd the `month' icon.

In a similar way discussed earlier we go to the next hierarchy and �nd the `December'

icon. By selecting `December' icon we complete the query generation. The `search' icon

will feed the generated query into Google search engine.

3.3 Experiments and Experimental Results

To substantiate the e�ciency of the developed interface we have conducted few experi-

ments. The judgment is done with respect to icon recognizability and e�ciency in query

formation based on user evaluation.

3.3.1 Experimental Setup and User Details

All experiments are carried out inWindows environment (Windows 7) with Intel Core2Duo

(2.0GHz) processor and 2.0 GB memory. The proposed approach is implemented in C#

language in .Net 3.5 platform using Microsoft Visual Web Developer 2008 Express Edi-

tion. Internet explorer is used as default Web browser to access Internet. In our evalu-

ation procedure we have considered users from di�erent background. The user pro�les

are summarized in Table 3.3.

3.3.2 Training and Testing

As the developed system is totally new for our target users, they are undergone a training

followed by a testing procedure. We arrange �ve di�erent sessions with 18 users. At

�rst, we have conducted a test to recognize the icons by target users without any prior

knowledge to check the appropriateness of designed icons. Users are given 25 randomly

selected icons and asked to make some idea about the icons. We separate out the icons

which are not recognized and misinterpreted. In next two sessions, we make familiar the

users with the separated icons. In session 3, we introduce users with the icons by group

(e.g. all vehicles). In later session we jumble up the icons and again acquainted the

users with the icons. In �nal session, 15 randomly selected icons are provided to each

user from icon database and asked to recognize them. Our observation is summarized in

Table 3.4. The table shows that before training around 37% icons of the icon database

are not recognized or misinterpreted. Out of this 37% icons 53% are recognized after

39


Table 3.3: User details

User Type Age Education(Class)

Mother language Computerpro�ciency

O�ce-peon (S1,S2) 24,46 XII,X Bengali Intermediate

Sweepers(S3,S4,S5,S6,S7)

21,28,33,33,37 V,IV,VII,V,- Hindi, Bengali None

Gatekeeper(S8,S9,S10)

16,20,26 III,-,VI Bengali None, Novice

Shopkeeper(S11,S12)

35,42 VII,X Telugu, Bengali Novice

Cook (S13,S14) 43,56 -,V Bengali, Oriya None

Waiter(S15,S16,S17,S18)

19,22,26,28 VIII,X,-,V Bengali, Kannada None, Novice

User's computer pro�ciencyNone: never used computer beforeNovice: used computer few times but not regularly or less than 6 monthsIntermediate: used computer more than 6 months but less than 1 yearExpert: used computer more than 1 year.

Table 3.4: User training

Training Number oficons

Correctlyrecognized

Conceptrecognized

Notrecognized

Mis-interpreted

Before 392 78 169 62 83

After 145 31 47 24 43

training session and the remaining icons are once again familiarized in last session.

Next, the interface is tested with respect to query construction ability. We use �fty

tourism related queries (Ref. Table 3.2) as a benchmark. Participants are asked to

generate �ve queries each, using the interface. The accuracy of query generation is

calculated as the number of icons correctly chosen by the user divided by total number

of icons needed to select to form the query in right way. The average accuracy for each

user is then calculated as shown in Table 3.5. The overall average accuracy of �nding

search keywords with the proposed icon-based keyboard is 0.764 (approx).

3.4 Conclusion

Developed icon-based interface is a prototype version of the proposed approach. In this

work, the prototype implementation works �ne with respect to tourism related queries.

Primary evaluation of interface has been done but more formative evaluation as well

40

3.4. Conclusion

Table 3.5: Interface testing result

User Query Accuracy Averageaccuracy

S1 8, 13, 22, 10, 17 1, 1, 0.75, 0.33, 1 0.816

S2 36, 43, 44, 46, 47 1, 1, 1, 0.5, 1 0.9

S3 1, 23, 35, 28, 20 0.66, 0.5, 1, 0.5, 1 0.732

S4 3, 11, 16, 21, 26 0.5, 0.75, 1, 0.5, 1 0.75

S5 5, 7, 19, 32, 34 1, 1, 0.75, 1, 0.66 0.882

S6 38, 39, 40, 48, 49 0.33, 0.66, 0.6, 0.66, 0.33 0.516

S7 37, 41, 42, 45, 50 1, 0.66, 0.5, 0.8, 1 0.792

S8 6, 14, 18, 27, 24 0.33, 0.33, 0.8, 0.8, 1 0.652

S9 2, 15, 25, 29, 30 1, 1, 0.66, 1, 1 0.932

S10 4, 9, 12, 31, 33 0.5, 1, 0.66, 0.33, 0.5 0.598

S11 8, 13, 22, 10, 17 1, 0.66, 0.5, 1, 1 0.832

S12 36, 43, 44, 46, 47 1, 0.66, 1, 0.75, 0.75 0.832

S13 1, 23, 35, 28, 20 0.33, 1, 1, 1, 1 0.866

S14 3, 11, 16, 21, 26 1, 1, 0.5, 1, 0.75 0.85

S15 5, 7, 19, 32, 34 0.5, 1, 1, 0.66, 0.66 0.764

S16 38, 39, 40, 48, 49 1, 0.33, 0.8, 1, 0.33 0.692

S17 37, 41, 42, 45, 50 0.5, 0.33, 0.75, 0.6, 0.8 0.596

S18 6, 14, 18, 27, 24 0.66, 0.66, 0.6, 0.8, 1 0.744

as summative evaluation is needed to enhance the user friendliness of interface. Our

developed interface helps the target user to compose tourism related queries. The pro-

posed approach can be extended to other domains of interest such as health, education,

shopping, job search etc.

41

Chapter 4

Clustering Web Search Results

The query composed in terms of icons are transformed into text and fed into the search

engine. As a query, a word or a group of words can imply multiple meaning in di�erent

contexts. Web search engine, however, cannot distinguish the context and hence retrieves

huge information. If the query is posed precisely, then the relevancy of retrieved infor-

mation is high. In that case, user obtains desired information precisely and with less

e�ort of navigating. On the other hand, when the query is imprecise or too broad in the

sense, the target information may present in the search result but with a higher rank.

It is not worthy to present all the search results to our target users. Instead, we need

to �nd precise and relevant information to the user query. To group Web pages having

similar content we need clustering. Clustering results a bag of clusters from which we �nd

out a cluster containing most relevant documents. Generally, a Web page contains lots

of extra information along with the valuable information which may result poor cluster

quality. So, before clustering we preprocess Web pages. From the exhaustive literature

survey we realize that their are di�erent methodologies for clustering. Some clustering

methodology compromises cluster quality and some compromises the response time. We

propose a new clustering algorithm combining k -Means and hierarchical clustering in or-

der to obtain better cluster quality with a�ordable time delay. We compare the proposed

method with the existing clustering methods.

4.1 Introduction

After �ring the query to search engine, the search engine returns probable Web pages

related to the query which are huge in size and in general, from di�erent domains. As an

43

4. Clustering Web Search Results

example, Google returns 9,095,238 search results on an average for a query1. An expert

computer user can use several advanced options (e.g. pages containing particular phase,

page language, �le type, time of upload, particular cite or domain etc.) to get accurate

information in least time. Every time users may not get the appropriate answer as they

need, but generally some suggestions or hints are provided. Accuracy of retrieving related

Web pages for a query mainly depends on query formation. Composing accurate query

is not possible for our target users.

Retrieved Web page snippets are ranked depending on the PageRank and relevancy mea-

sure. A snippet contains the title of the Web page, brief description containing the search

word, that is, content summary and the link to that Web page. In general, the user has to

predict their target snippet(s) by going through snippets or navigating the page. Actual

solution or suggestion to problem can be obtained by reading the predicted Web pages

by trial and error method. But, this procedure is infeasible for our target users. As

a way out, clustering mechanism has been advocated where similar Web pages can be

grouped together so that representation and extraction of information related to search

query would be easier.

In general, clustering is the assignment of a set of observations into subsets (called

clusters) so that observations in the same cluster are similar in some sense. It is a method

of unsupervised learning [34]. In other words, clustering is a data mining (machine learn-

ing) technique used to place data elements into related groups without any advanced

knowledge of the group de�nitions [100]. Document clustering was proposed mainly as

a method of improving the e�ectiveness of document ranking following the hypothesis

that closely associated documents will match the same requests [102]. The systems that

perform clustering of Web search results, also known as clustering engines, have become

popular in recent years. The �rst commercial clustering engine was probably North-

ern Light, in 1996. It was based on a prede�ned set of categories, to which the search

results were assigned. A major breakthrough was then made by Vivisimo [24], whose

clusters and cluster labels were dynamically generated from the search results. In recent

times, several commercial clustering engines have been launched in the market [24], [56]

namely Grouper (1997) [126] , WISE (2002) [23], Carrot (2003) [92], WebCat (2003)

[47], AISearch (2004)2, SnakeT (2005)3, Quintura (2005)4, WebClust (2006)5, YIPPY

1Search engine statistics from the search engine yearbook, http://www.searchengineyearbook.com/search-engines-statistics.shtml

2AI Search Engine From MIT, http://www.netpaths.net/blog/ai-search-engine-from-mit3Search SnakeT Clustering Engine, Meta Search Cluster MetaSearch, http://snaket.di.unipi.it4Quintura - visual search engine, http://www.quintura.com5WebClust - Clustering Search Engine, http://www.webclust.com

44

4.2. Proposed Methodology

(2009)1 etc. These engines [24] consider search result snippets as an input to achieve

faster response time. As a consequence cluster quality degrades because snippet is not

always a good representative of a whole document [114].

The existing techniques for Web search results clustering are mainly based on two ba-

sic clustering mechanisms: partitional and hierarchical clustering. k -Means clustering is

the most common type of partitional clustering which produces �at clusters. Hierarchical

clustering, on the other hand, creates a hierarchy of clusters which may be represented in

a tree structure called dendrogram [130]. Hierarchical clustering is usually either agglom-

erative (�bottom-up") or divisive (�top-down") [77]. Both of these clustering techniques

have some limitations to apply in clustering of Web search results directly. Hierarchical

clustering technique results better quality of clusters, though the computational complex-

ity of k -Means is less. But hierarchical clustering is trapped in past mistakes whereas

k -Means o�ers iterative improvement. So, noticing the limitations of these two algo-

rithms, we propose a combination of both the hierarchical and k -Means algorithms to

cluster Web documents. Our main objective is to obtain better clustered search results

to extract concrete information with reasonable time delay. In this work, we analyze two

basic ways of grouping search results and implement a hybrid clustering method. The

total Web page content instead of snippets are considered as an input to improve the

cluster quality. Finally, we compare the performance of the proposed technique with the

existing approaches.

The organization of this chapter is as follows. In Section 4.2 we discuss about the

proposed approach. To substantiate the e�cacy of the proposed algorithm, we have

done some experiments. Our experiments and experimental results are presented in

Section 4.3. Finally, Section 4.4 concludes the paper.

4.2 Proposed Methodology

We have proposed an approach to group the Web search results into a number of clusters

depending on the relevancy of terms in each document. To do this, we introduce a new

clustering algorithm. Our proposed clustering algorithm follows four tasks: Web page

content extraction, preprocessing of Web documents, document feature extraction and

inter document similarity measure. An overview of our entire work is shown in Figure 4.1

and a detail description of each step is given in the following sub sections.

1Yippy Clustering Search Engine - iTools, http://itools.com/tool/yippy-web-search

45


Inter document

similarity

measure

Web page

content

extraction

Document

feature

extraction

Raw

text

Filtered

text

Feature

vectorsDocument

clustering

Similarity

matrixClustered

documents

Query

term

Preprocessing of web documents

Text filtration Lemmatization

Figure 4.1: Overview of our work

4.2.1 Preprocessing of Web Documents

The Web pages contain a lot of extra information, noises (e.g. advertisement) along with

the desirable information. Presence of these elements is often troublesome to identify the

important document features. It also may mislead the clustering process. Hence, before

the feature extraction and proper clustering, document preprocessing is necessary. In

this stage, we �rst identify extra information and noises and �lter out those. Finally, we

do lemmatization [77] to map in�ected words into their root forms.

4.2.1.1 Text Filtration

Source code of a Web page is likewise raw data which can not be used directly for

clustering. In this step, we consider the following elements for removal from any Web

page.

• HTML tags: Di�erent HTML tags are used to represent documents in the Web

browser. The tag itself does not carry any important information. Some important

HTML tags are <head>, <title>, <body>, <table>, <strong> etc.

• Special characters: Some special characters are used to create spaces and represent

symbols (e.g. © & < > " etc.).

• Scripts: Di�erent scripting languages are used to incorporate di�erent features (e.g.

Javascript).

• Non-letter characters: A Web page may contain non-letter characters (e.g. $, %,

# etc.).

• Non-printable ASCII characters: A number of non-printable characters may also

occur in Web pages (e.g. NUL, SOH, STX etc.)

46


After downloading the source code these elements are eliminated to extract only text

(such as sentences, phases and words) from Web documents. Starting and ending tags

are identi�ed to separate unnecessary terms and important text. All the unnecessary

elements as mentioned above are deleted using regular expression [2].

4.2.1.2 Lemmatization

In order to perform clustering, Web documents are needed to represent in terms of

term-vector. Important terms are identi�ed by using term-weighting scheme discussed in

Section 4.2.2. Generally, a word can occur in a document in di�erent morphological forms.

Lemmatizer can map an in�ected word to its base form with the use of a vocabulary

and morphological analysis [77]. We use LemmaGen1 to convert all the words of Web

documents in their base form. This guarantees that all the in�ected forms of a term are

mapped to a single term, which helps to determine each term's importance. The output

of the preprocessing stage, that is, the �ltered text namely sentences, phases and words

are considered as the input to the next stage.

4.2.2 Document Feature Extraction

The aim of this phase is to identify the important terms from extracted text which are

potentially capable to represent the documents. In our work, vector space model [77] is

used to represent a document and term importance are considered as dimensions. At

�rst we �nd out the unique terms among the bag of words of all documents. Next,

we have to identify the document related important terms. In order to do this, we use

the term frequency-inverse document frequency (tf -idf) [119] metric to measure terms'

importance. Note that, stopwords (e.g. about, the, in etc.) are frequent in any document

and are not important. We identify 571 stopwords2 which are �ltered out from every

document. The calculation of (tf -idf) and representation of documents in terms of

documents' feature are discussed in the following.

Calculation of tf-idf: It is calculated by multiplying term frequency (tf) with inverse

document frequency (idf). The term frequency of term ti of a given document dj is

calculated as Equation 3.1. The inverse document frequency is a measure of the general

importance of a term. It is obtained by dividing the total number of documents by

the number of documents containing the term, and then taking the logarithm of that

1LemmaGen, http://lemmatise.ijs.si/2http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop

47


quotient. The mathematical representations of inverse document frequency is given in

Equation 4.1.

idfi = log|D|

|{d ∈ |D| : ti ∈ d}| (4.1)

Inverse document frequency of the term ti is denoted as idfi in Equation 4.1. Here, |D|implies the total number of documents and d is the number of documents containing the

term ti. Combining these two equations we can assign a weight tf -idfi,j to a term ti in

a document dj as Equation 4.2.

tf -idfi,j = tfi,j × idfi (4.2)

After getting the tf -idf value of each unique term for each document, we calculate the

average tf -idf value of each term. Then the terms are sorted in descending order of

these average value. Top m terms are considered as important terms. Any document

can be represented in terms of these terms. The value of m is decided experimentally

(Ref. Section 4.3.3).

Term-document matrix (TDM): Our next task is to represent all documents using

those m terms identi�ed in the previous step. We consider term-document matrix TDM

to represent all documents using tf -idf value of m terms. In TDM , a row represents a

term and a column represents a document. Each element of the matrix is the tf -idf value

of a term in a document. In other words, if the tf -idf value for a particular term ti in

a document dj is tf -idfi,j and total m terms {t0, t1, · · · , tm−1} represents all documents,then a document dj is represented by the vector [tf -idf0,j , tf -idf1,j , · · · tf -idfm−1,j ] [101].

The TDM matrix for a set of n documents is represented in Equation 4.3.

TDM =

tf -idf0,0 tf -idf0,1 · · · tf -idf0,n−1

tf -idf1,0 tf -idf1,1 · · · tf -idf1,n−1

...

tf -idfm−1,0 tf -idfm−1,1 ... tf -idfm−1,n−1

(4.3)

Generally, column length normalisation [92] is done on TDM matrix to avoid the biasness

for very short or very long document. The normalised value of tf -idfi,j denoted by ai,j

is calculated as

ai,j =tf -idfi,j√

tf -idf20,j + tf -idf2

1,j + · · ·+ tf -idf2m−1,j

(4.4)

48


Equation 4.5 shows the term-document matrix with column length normalisation.

TDM ′ =

a0,0 a0,1 · · · a0,n−1

a1,0 a1,1 · · · a1,n−1

...

am−1,0 am−1,1 ... am−1,n−1

(4.5)

4.2.3 Inter Document Similarity Measure

The main objective of the document clustering phase is to group similar documents. For

this, we need to compute similarity values between every pair of documents. As similarity

values are compared often in clustering phase, it is better to compute all possible pairwise

document similarity at �rst. Once all the similarity values are computed we can simply

use the values when needed instead of computing it every time. The input to this phase

is the feature vectors obtained in the previous phase. Our approach to measure the

similarity is discussed in the following.

Cosine similarity: Similarity between two vectors of m dimensions can be measured

using cosine similarity which is actually the cosine of the angle between them. In this

work, cosine similarity is used to measure the similarity between two documents which

is computed using the dot product and magnitude [77]. Let two document di and dj are

represented by the vectors [a0,i, a1,i, · · · , am−1,i] and [a0,j , a1,j , · · · , am−1,j ], respectively.

Then cosine similarity (si,j) between di and dj can be represented by Equation 4.6.

si,j =di.dj||di||||dj ||

=a0,i.a0,j + a1,i.a1,j + · · ·+ am−1,i.am−1,j√

a20,i + a21,i + · · ·+ a2m−1,i.√

a20,j + a21,j + · · ·+ a2m−1,j

(4.6)

Previously, we have done column length normalisation on matrix TDM such a way that,

the sum of the square of column wise elements of TDM is 1, that is,∑m−1

i=0 a2i,j=1 for

any document dj . Therefore, Equation 4.6 becomes

si,j = a0,i.a0,j + a1,i.a1,j + · · ·+ am−1,i.am−1,j (4.7)

Similarity matrix: The pairwise similarity of documents can be represented by sim-

ilarity matrix. A similarity value varies between 0 to 1, where 0 indicates no similarity

and 1 indicates maximum similarity or identical. Similarity matrix denoted as Sim is

symmetrical about the main diagonal, as Sim[di, dj ] = Sim[dj , di]. An element si,j of

similarity matrix Sim is actually the cosine similarity value of documents di and dj . The

49


mathematical representation of similarity matrix is given in Equation 4.8.

Simn,n =

s0,0 s0,1 · · · s0,n−1

s1,0 s1,1 · · · s1,n−1

...

sn−1,0 sn−1,1 ... sn−1,n−1

(4.8)

4.2.4 Our Proposed Clustering Algorithm

We propose an algorithm to cluster documents. It combines divisive hierarchical ap-

proach and k -Means. Hence, we named it HK Clustering. Our objective is to produce

coherent clusters [126] which means a document is not strictly belongs to a single cluster.

Depending on some conditions it may belong to a number of clusters as often a document

covers multiple topics. The �owchart of the HK Clustering algorithm is shown in Fig-

ure 4.2. Let us consider a set of n documents {d0, d1, d2 · · · dn−1} as clustering elements.

The documents based on which the clusters are formed we term them as seed documents.

Suppose, S and Old_S denote the sets of seed documents at current and previous level,

respectively. We store all clusters generated at a level in a pool of clusters called CP .

Step1. Cluster initialization: Initially, all documents are in a single cluster C0, that is,

C0 = {d0, d1, d2, · · · dn−1} and cluster-pool CP = C0. S and Old_S are initialized

to null as in the root level there is no seed documents.

Cluster initialization

Distribute documents

Produce final

coherent clusters

Yes

No

Satisfy

termination

condition?

Seed selection

Figure 4.2: Overview of our proposed HK Clustering algorithm

50


Step2. Seed selection: Suppose, Ci,j is the ith cluster at level j. We select seeds in Ci,j

for (j+1)th level as follows. We �nd out a document pair say di, dj with minimum

similarity value among all the document pairs in the cluster Ci,j . If the similarity

value of the document pair di, dj is less than a limit called dissimilarity threshold

denoted as α, we consider di and dj as seeds for (j + 1)th level. Otherwise, the

current seed documents of Ci,j at level j remain the seeds for (j + 1)th level. To

�nd out all the probable seeds for (j + 1)th level, we consider all the clusters of

jth level one by one.

Next, we �nd if any two or more seeds are mergeble or not. We check the similarity

value of each pair of seeds. If the similarity value of any seed pair exceeds a

threshold value called merging similarity threshold denoted as β, we merge those

two seeds into one. It may happen that two or more seed pair satisfy the condition

and have a common seed document. As an example, the similarity value of di, dj

and dj , dk is greater than β where di, dj , dk ∈ S. In this case, we merge all three

of them into a single representative seed. So, a seed may contain more than one

document. We term such a seed as composite seed. We store all seeds selected

for the next level in S.

Step3. Check terminating condition: The process terminates if CP = null and S =

Old_S. If the terminating condition does not satisfy, go to Step 4 else, go to

Step 5.

Step4. Assign documents to their nearest seeds: For each seed in S, we create a cluster

initially containing the seed document only. Thus, we have |S| number of newlycreated clusters, where |S| denotes the number of seeds in S. Next, we assign

the non-seed documents to those clusters as follows. Suppose, di is a non-seed

document which we want to assign to a seed document. Among all seeds in S

let, dj ∈ S has the maximum similarity with di. Therefore, we assign di to the

cluster corresponding to the seed dj . In case of composite seed where a single

seed is represented by two or more documents, we compare the similarity value

of a non-seed document with each of the document of the composite seed as well

as with other seed documents. All newly obtained clusters are kept in CP . Then

we set Old_S = S and go to Step 2.

Step5. Produce �nal coherent cluster: Finally, we produce the coherent clusters as stated

below. Let, C ′i,j be the centroid of the cluster Ci,j . We calculate C ′

i,j by taking

the arithmetic mean of all the document vectors (discussed in Section 4.2.2) corre-

sponding to the cluster Ci,j . Next, we check the similarity value of each document

51


with other cluster centroids apart from which it belongs to. Suppose, a document

di belongs to cluster Ci,j . So, we check the similarity values of document di with

the cluster centroids other than the centroid C ′i,j . If any of the similarity value

crosses a limit called belonging similarity threshold denoted as γ, then we assign

di to that cluster also. It is possible that a document satis�es the condition for

more than one cluster centroids. In that case, that document will be assigned to

all those clusters.

The proposed HK Clustering algorithm is precisely stated in pseudo code form in Al-

gorithm 1. We consider n documents {d0, d1, · · · dn−1} for clustering and prepare the

similarity matrix Simn,n (as discussed in Section 4.2.3). In any cluster Ci, Ci.seed

denotes the seed document(s) and Ci.doc denotes all documents within the cluster in-

cluding seed(s). We also consider three threshold values: dissimilarity threshold (α),

merging similarity threshold (β) and belonging similarity threshold (γ) for comparison in

di�erent phases of the algorithm. In Algorithm 1, i, j, x, N are considered as positive

integer variables. In Algorithm 1, line number 16-30 determines probable seeds for next

level and line number 8-12 merges close seeds. The procedure Document Distribution

assigns documents to their nearest seeds. Line number 32-37 produces �nal coherent

clusters.

4.2.4.1 Illustration of the Proposed HK Clustering Algorithm

For better understanding, the algorithm is explained in Figure 4.3. In this �gure, Ci,j

denotes a cluster where i represents the level and j represents the cluster number. The

document(s) with star mark(s) within a cluster is(are) denoted as seed document(s).

Suppose, we want to cluster ten documents {d0, d1, d2, · · · d9}. The initial cluster is

C0,1 = {d0, d1, d2, · · · d9} at level 0 and cluster-pool CP = C0,1. S and Old_S are null.

Now, we dequeue the cluster from CP and determine two documents with minimum

similarity. Let, (d0, d8) is the most dissimilar document pair in the cluster C0,1 and the

similarity value is less than the dissimilarity threshold α. So, (d0) and (d8) are considered

as seeds at level 1. Next, we check seeds selected at level 1 are mergeable or not. Now,

in this case it is trivial, because both seeds are generated from the same parent as their

similarity value is less than the dissimilarity threshold. It is obvious that α < β. Hence,

the similarity of (d0, d8) would not cross merging similarity threshold β. So, (d0) and

(d8) are �nalised as seeds at level 1. Then we check the terminating condition. The �rst

condition, that is, CP = null is true. But the second condition S = Old_S is false

because, S = {d0, d8} and Old_S = null. Next, each non-seed document is assigned

52


Input: Simn,n : similarity matrix.Output: C = {C0, C1, · · ·CN−1} : set of coherent clusters.

1 begin2 Initialise S = Old_S = Null, C0 = {d0, d1, · · · dn−1}3 Add C0 into the cluster pool CP4 while true do5 if CP = Null then6 if S = Old_S then7 Exit ; // Terminating condition

8 else9 /* Merging close seeds */10 for all di ∈ S do11 Find the mergeable seed dj , such that Sim[di, dj ] >β12 end13 Merge all mergable seed documents and modify S14 /* Assign non-seed documents to seeds, N denotes the number of

seeds */15 Document Distribution(S,Old_S,CP,N)

16 endif

17 else18 /* Find probable seeds for next level */19 Ctemp ← Dequeue cluster pool CP20 if Count(Ctemp.doc) = Count(Ctemp.seed) then21 S ← Ctemp.seed ; // Cluster contains only seed(s)

22 else23 /* Find two documents in Ctemp having minimum similarity */24 Min_Sim.value←Min(Sim[di, dj ]), ∀di ∈ Ctemp , ∀dj ∈ Ctemp,

di ̸= dj25 Min_Sim.doc← Documents corresponding Min_Sim.value26 if Min_Sim.value < dissimilarity threshold α then27 S ←Min_Sim.doc28 else29 Seed of cluster Ctemp is passed to next level S ← Ctemp.seed30 endif

31 endif

32 endif

33 end34 /* Produce coherent clusters */35 Document Distribution(S,Old_S,CP,N) where N = Count(S)36 Calculate centroids C ′

i of each cluster Ci ∈ CP37 if SimC(di, C

′j) > γ for i=0 to n− 1 and j=0 to N − 1 then

38 Assign di to Cj where C ′j is centroid of cluster Cj , and di /∈ Cj

39 endif

40 end

Algorithm 1: HK Clustering53


Input: S : seed documents of current level, Old_S : seed documents of previouslevel, CP : cluster pool, N : number of seeds.

1 begin2 Create N buckets

C0.seed← S(0), C1.seed← S(1), · · · , CN−1.seed← S(N − 1)3 for j=0 to n− 1 do4 Find closest seed di for document dj where di ∈ S5 Cx.doc← dj : di ∈ Cx.seed, 0 ≤ x ≤ N − 1 ; // Assigning documents to

nearest seed

6 end7 CP = C0 ∪ C1 ∪ · · · ∪ CN−1

8 for i=0 to N − 1 do9 Old_S ← Ci.seed;10 end11 Return;

12 end

Procedure: Document Distribution

to its nearest seed depending on the similarity measure. Suppose, we want to assign a

non-seed document (d1) to seed (d0) or (d8). Let, Sim[d0, d1] > Sim[d8, d1]. So, (d1)

is assigned to the cluster correspond the seed (d0). Similarly, other non-seed documents

are assigned to their respective seeds. Suppose, after completion of level 0, two clusters

are generated: C11 = {d∗0, d1, d4, d5} and C12 = {d∗8, d2, d3, d6, d7, d9} (see Figure 4.3).

C11 and C12 are added into the cluster pool CP . Old_S is set as {d0, d8}.

Suppose, the four documents (d0), (d5) from the cluster C11 and (d3), (d7) from the

cluster C12 are determined as seeds at level 2. Assume that, these seeds are not mergeable.

At this level, the termination condition is not satis�ed. After distributing the documents,

four clusters are generated namely C21 = {d∗0, d4}, C22 = {d∗5, d6}, C23 = {d∗3, d1, d8, d9}and C24 = {d∗7, d2} (also see Figure 4.3). We may note that, document (d1) of cluster

C11 at level 1 is assigned to cluster C23 at level 2 as (d1) is nearer to (d3) than (d0), (d5)

and (d7). Similarly, (d6) is assigned to cluster C22. Old_S becomes {d0, d5, d3, d7}.

At level 3, (d0), (d4) from C21, (d5), (d6) from C22, (d8), (d3) from C23 and (d7) from

C24 are selected as seeds. Here, we assume that no document pair of the cluster C24

exceeds the dissimilarity threshold α. So, the seed of previous level (d7) is passed to level

3. Among all the seeds, let the similarity of (d6) and (d8) exceeds β, themerging similarity

threshold. Therefore, they are merged together. {(d0), (d4), (d5), (d6, d8), (d3), (d7)} are�nalised as seeds at level 3. At level 3, the termination condition is not satis�ed. Hence,

the rest of the documents are assigned to their nearest seed. For this each non-seed

54


d8*

d2,d3,d6,

d7,d9

d0*

d1,d4,d5

d7*

d2

d4*d0*

Level 0

Level 3

Level 1

d5*d3*

d1

d6* d8*

(d6, d8)*

d9

d7*

d2

d4*d0* Level 4d5*d3*

d1,d9

d0*

d4

d7*

d2

d3*

d1,d8,d9

d5*

d6Level 2

(d6,d8)*

d9

C01

C12C11

C21 C22 C23 C24

C41 C42 C43C44

C31 C32 C33C34 C35 C36

C45 C46

d0,... ,d9

Figure 4.3: Illustration of the proposed HK Clustering algorithm.

document is compared to all seeds (including all documents in the composite seed) to

�nd the maximum similarity values among them. This time one seed is composite that

is (d6, d8). So both of them are considered for comparison along with other seeds when a

non-seed document is assigned to a seed. As an example, if we want to assign document d9

to its nearest seed, then d9 should be compared with d0, d4, d5, d3, d7 and both d6 and d8.

Let us assume Sim[d5, d9] < Sim[d0, d9] < Sim[d4, d9] < Sim[d7, d9] < Sim[d6, d9] <

Sim[d3, d9] < Sim[d8, d9], then d9 is assigned to seed (d6, d8). After completion of

document assignment at level 3, six clusters are generated. They are C31 = {d∗0}, C32 =

{d∗4}, C33 = {d∗5}, C34 = {(d6, d8)∗, d9}, C35 = {d∗3, d1} and C36 = {d∗7, d2}. Old_S is set

as {(d0), (d4), (d5), (d6, d8), (d3), (d7)}.Next, at level 4, six seeds (d0) from C31, (d4) from C32, (d5) from C33, (d6, d8) from

C34, (d3) from C35, (d7) from C36 are selected as seed which are not mergeable among

each other. In this stage cluster pool CP is empty and the seeds of level 4 are same with

level 3 seeds. Thus, we meet the terminating condition.

The rest of the documents other than the seeds, are assigned to their nearest seed

for last time. It produces six clusters: (d0), (d4), (d5), (d6, d8, d9), (d3, d1), (d7, d2). At

this stage the generated clusters are hard in nature as each document strictly belongs

to a single cluster. In order to produce coherent [126] clusters the centroids of the

55


generated hard clusters are determined. The similarity of each document and other

cluster centroids apart from which it belongs are checked. Let the similarity value of

d9 and the �fth cluster {d3, d1} centroid exceeds γ, the belonging similarity threshold.

So, d9 is assigned to that cluster. Thus, at the end of the process total six coherent

clusters C41 = {d∗0}, C42 = {d∗4}, C43 = {d∗5}, C44 = {(d6, d8)∗, d9}, C45 = {d∗3, d1, d9} andC46 = {d∗7, d2} are generated.

4.2.4.2 Complexity Analysis of HK Clustering Algorithm

The performance of the algorithm HK Clustering is analysed from the time and storage

requirement point of views as follows.

Time complexity: There are three main tasks in the algorithm: seed selection for

next level, assigning documents to their nearest seeds and producing �nal coherent clus-

ters. Suppose, there are n documents to be clustered. We assume h is a general level

where h = {0, 1, · · · , log2 n}. For simplicity, we assume 2h number of clusters would

be generated from 2h seeds at level h. So, clusters generated in the consecutive levels

are {20, 21, · · · , 2log2 n} = {1, 2, · · · , n}. We also assume that n documents are equally

distributed into all the clusters at the same level. Therefore, each cluster contains approx-

imately n/2h documents. Let, the main three tasks take T1, T2, T3 times, respectively.

For these three tasks we analyse the time complexity as follows.

1. Seed selection: In this phase, we �nd out the seeds which will contribute to create

cluster at the current level h. Further, this task can again be divided into two

sub-tasks: �nding two most dissimilar documents in each cluster and merging of

close seeds. Let, these two sub-tasks take t11 and t12 time, respectively, that is,

T1 = t11 + t12.

For �nding two documents in a cluster with minimum similarity, we have to compare

the similarity values of all document pairs of that cluster. As a single cluster con-

tains n/2h number of documents at level h, the document pair with minimum sim-

ilarity can be found out in O((n/2h)2) time. It would take t11 = O(2h.(n/2h)2) =

O(n2/2h) time when we consider all the clusters of that particular level.

Next task is to �nd the mergeable seeds at the level h. Among n number of doc-

uments, 2h number of documents are seeds in the level h. In general, these seeds

are generated from 2h−1 number of clusters of previous step. To �nd out merge-

able seeds at the level h we need 2h.(2h − 1)/2 number of comparisons. But the

seeds from same parents are not mergeable. So, 2h−1 comparisons are not required.

56


Therefore, 2h.(2h−1)/2−2h−1 = 2h.(2h−1−1) number of comparisons are actually

required to �nd out mergeble seeds at level h. Therefore, t12 = O(2h.(2h−1 − 1)).

2. Assigning documents to their nearest seeds: After merging the mergeable seeds we

need to assign non-seed documents to their nearest seed. If at any level h, 2h seeds

are generated then (n − 2h) documents are left to be assigned. To assign these

(n− 2h) documents among 2h seeds T2 = O((n− 2h)2h) time is required.

3. Producing �nal coherent clusters: To produce �nal coherent clusters at the �nal

stage we need to calculate the centroids of generated clusters. For 2h number of

clusters at �nal level h, we have to �nd out 2h number of centroids. Next, the

similarity of each document and the centroids of clusters other than which the

document belongs to would be compared. It needs n(2h) number of comparisons.

Now, h varies from 0 to log2 n. At level h = log2 n, n number of clusters would

be produced which implies each cluster only contain a single document. In this

case centroid determination is not required. Instead, at level h = (log2 n)− 1, n/2

numbers of clusters would be generated for which n/2 cluster centroids need to be

determined. To check similarity value between cluster centroids and documents,

we need n.n/2 comparisons. So, the maximum possible time required for T3 is

O(n2/2).

Now, among the three tasks stated above, the �rst two tasks, that is, seed selection and

assigning documents to their nearest seeds would occur repeatedly in each level. Required

time for these two tasks at level h can be represented by Equation 4.9.

T h1 + T h

2 =

{n2

2h−1+ 2h(2h−1 − 1)

}+

{2h(n− 2h)

}

=n2

2h−1+ 2h(n− 1)− 22h

2

(4.9)

Total time required for these two tasks including level 0 to log2 n,

T ′ =

log2 n∑h=1

{n2

2h−1+ 2h(n− 1)− 22h

2

}

= n2

log2 n∑h=1

1

2h−1+ (n− 1)

log2 n∑h=1

2h − 1

2

log2 n∑h=1

22h

= 2n(n− 1) + 2(n− 1)2 − 2

3(n2 − 1)

(4.10)

57


Therefore,

T ′ = O(n2) (4.11)

The total time required is T = T ′+T3 = O(n2). So, the time complexity of the proposed

algorithm is O(n2).

Space complexity: In our proposed algorithm, a similarity matrix is created to store

inter-document similarity value. Along with this we have to store the cluster centroids

in the �nal step and generated clusters in each step. So, the overall space complexity is

O(n2+k.n), that is, O(n2) where k is the number of clusters generated in �nal state and

n is number of documents.

4.3 Experiments and Experimental Results

In this section, we present a detail description of experiments and the results observed.

We use the same experimental setup as discussed in Section 3.3.1.

4.3.1 Experimental Data

In order to substantiate the e�cacy of the proposed algorithm and compare di�erent

clustering algorithms we need a tagged or well classi�ed data set. We use the standard

document collections of Reuters 215781. We collect only the tagged documents from

Reuters 21578. Thirteen sample data set are created, from reut2-000 to reut2-012 each

having di�erent number of classes. We also prepare a tagged document collection using

the Web search results for some speci�c query. To create our own document collection,

ten di�erent tourism related benchmark queries are feed to Google search engine. We

consider �rst 100 returned results for each query to build up a collection. 10 expert

users (all are students at UG and PG level of our institute) are asked to tag the Web

pages. Each Web page is tagged to one or more topic(s) by the users. So our own created

database contains 1000 tagged documents. These dataset are used for training as well as

testing purpose.

4.3.2 Performance Metrics

To evaluate the performance of the proposed algorithm, cluster quality is analysed. To

evaluate the cluster quality two di�erent measures: internal measure and external mea-

1Reuters-21578 Text Categorization Test Collection, http://www.daviddlewis.com/resources/testcollections/reuters21578

58


sure are used [108], [37]. Internal measure allows to compare di�erent sets of clusters

without reference of any external knowledge like already classi�ed dataset. Whereas ex-

ternal measure quanti�es how well a clustering result matches with a known classi�ed

dataset. In our work, the known classi�ed dataset is the previously categorised document

collection provided by human editor.

4.3.2.1 Internal Measure

We use Dunn index [37] as a internal measure to quantify cluster quality. It aims to

identify dense and well-separated clusters. It is de�ned as the ratio between the minimal

inter-cluster distance to maximal intra-cluster distance. Equation 4.12 shows the Dunn

index, where δ(Ci, Cj) represents the distance between clusters Ci and Cj and ∆(Cl)

measures the intra-cluster distance of a cluster Cl.

DI (C) =mini̸=j{δ(Ci, Cj)}max1≤l≤k{△(Cl)}

(4.12)

δ (Ci, Cj) =1

|Ci| |Cj |∑

di∈Ci, dj∈Cj

φ(di, dj) (4.13)

△ (Ci) = 2

(∑di∈Ci

φ (di, C′i)

|Ci|

)(4.14)

Here, φ(di, dj) is distance between two documents di and dj , where di and dj are any

two documents belong to Ci and Cj , respectively. |Ci|, |Cj | are the size of clusters Ci,

Cj . C′i is the centroid of cluster Ci.

4.3.2.2 External Measure

F measure is widely used to measure the external quality which combines the precision

(P ) and recall (R) ideas from information retrieval [37], [77]. Let for the document set

D, Cj is one of the output clusters and C∗i is corresponding to human edited class. Then

the precision and recall is,

P =C∗i ∩ Cj

Cj(4.15)

R =C∗i ∩ Cj

C∗i

(4.16)

F measure is the harmonic mean of precision and recall. The F measure (Fi,j) and

overall F measure (F ) is computed as shown in Equation 4.17 and Equation 4.18 where

59


l is the total number of human edited class, k is the number of output cluster and |V | isthe total number of documents present in l number of human edited class.

Fi,j =2P.R

P +R(4.17)

F =

l∑i=1

|C∗i ||V |

maxj=1...k{Fi,j} (4.18)

The value of Dunn index is high for the clusters with high intra-cluster similarity and

low inter-cluster similarity. Again high overall F-measure indicates the higher accuracy

of the clusters mapping to the original classes. So, algorithms that produce clusters with

high Dunn index and high overall F-measure are more desirable.

4.3.3 Evaluations

We compare the HK Clustering algorithm with basic clustering methods as well as the

benchmarking clustering algorithms specially used for clustering Web search results.

Seven data sets of Reuters from reut2-005 to reut2-012 and �ve data sets of our data col-

lection are used for the comparison. Some data sets containing large number documents

of Reuters are used for testing to show the e�ciency of algorithm for higher number of

document collection. We apply HK Clustering algorithm on the document collections

to obtain clusters. Basic clustering methods: k -Means and agglomerative hierarchical

clustering (group average) are also used to cluster the document collections. Similarly,

the latest benchmarking algorithm Lingo using Singular Value Decomposition (SDV)

[92] is used for clustering. Then we check the quality of clusters obtained from di�erent

algorithms with respect to both internal and external measures. Lingo considers search

result snippet as an input. We feed the whole document content in Lingo to maintain

the symmetry. Table 4.1 presents the comparative study of k -Means, agglomerative hi-

erarchical clustering, Lingo and HK Clustering algorithm with respect to cluster quality.

From Table 4.1 it is clear that proposed HK Clustering algorithm produces better

clusters compared to k -Means and Lingo algorithm. The time complexity of the above

mentioned four algorithms are shown in Table 4.2. From the time complexity point of

view, HK Clustering performs better than hierarchical agglomerative clustering algo-

rithm as well as Lingo algorithm. Hence, HK Clustering o�ers an optimal solution by

balancing both cluster quality and time complexity.

60


Table 4.1: Comparison of cluster quality

Number Number Number Hierarchical HK

Document of docu- of of agglomerative k -Means Lingo Clustering

name ments classes clusters DI F DI F DI F DI F

reut2-005 100 6 8 2.71 0.70 2.41 0.49 2.55 0.56 2.64 0.68

reut2-006 100 10 9 2.40 0.78 2.33 0.42 2.3 0.62 2.36 0.71

reut2-007 100 4 5 1.42 0.58 1.27 0.34 2.29 0.41 1.31 0.58

reut2-008 100 7 10 2.67 0.55 2.43 0.32 2.46 0.38 2.49 0.52

reut2-009 100 6 7 1.57 0.67 1.42 0.51 1.42 0.55 1.46 0.64

Adventure India 100 9 8 2.71 0.78 2.49 0.51 2.52 0.58 2.53 0.69

Hotel Hyderabad 100 14 16 4.61 0.77 4.35 0.53 4.39 0.73 4.66 0.81

India pilgrim 100 10 13 3.89 0.38 3.71 0.24 3.73 0.26 3.83 0.35

Kolkata travel 100 14 16 4.79 0.67 4.62 0.52 4.65 0.49 4.77 0.59

Weather Kolkata 100 8 7 1.59 0.68 1.48 0.41 1.52 0.53 1.57 0.67

reut2-010 400 14 17 4.51 0.56 4.14 0.50 4.37 0.51 4.48 0.53

reut2-011 800 12 15 3.45 0.49 3.19 0.24 3.52 0.33 3.36 0.44

reut2-012 1000 15 16 4.26 0.46 4.11 0.31 4.19 0.34 4.24 0.37

Table 4.2: Comparison of time complexity

Clustering algorithm Time complexity

k -Means O(n)

HK Clustering O(n2)

Hierarchical* agglomerative O(n2logn)

Lingo O(n3)

*Considering group-average linkage criterion

4.3.4 Discussion

Apart from the proposed algorithm's e�ciency there are two key issues in our proposed

methodology: document download time and setting the threshold values and number of

feature vectors used in the algorithm. We would address these two issues by two di�erent

experiments. We used the concept of threading to download Web pages returned by

search engine. Our �rst experiment tests the e�ciency of threading. Second experiment is

conducted to set the threshold values and number of feature vectors used in the algorithm.

Document download time: As discussed in Section 4.2.1 we use threading to down-

load Web pages from Web repository. After feeding the query to search engine, the

system takes a huge amount of time to download Web pages one by one. As response

time of the proposed system is a key factor, we tried to reduce the download time us-

ing threading. Five di�erent tourism related queries are used to test the e�ectiveness

61


of threading. As download speed varies, each query is run thrice and average download

time is calculated. The comparison of time requirement for downloading documents with

threading and without threading is given in Table 4.3. It shows threading reduces the

downloading time by a factor of 13 for 100 documents. Figure 4.4 depicts the di�erence

clearly.

Table 4.3: Web document download time with threading and without threading

Number Download Time (mm:ss)

of Q: Culture Assam Q: Delhi guide Q: Market Kolkata Q: Royal Rajasthan Q: Wildlife India

web (3.64 MB) (4.67 MB) (4.83 MB) (3.56 MB) (3.44 MB)

pages WT WOT WT WOT WT WOT WT WOT WT WOT

10 00:08.85 00:21.99 00:09.10 00:39.21 00:08.92 00:18.94 00:08.87 00:29.76 00:08.85 00:31.78

20 00:08.97 01:33.48 00:09.91 01:24.51 00:09.87 00:43.63 00:09.02 01:11.86 00:09.18 01:36.18

30 00:11.58 02:13.70 00:11.91 02:08.51 00:13.26 01:06.27 00:09.25 01:58.62 00:09.52 02:14.18

40 00:13.51 02:44.42 00:18.36 02:49.64 00:17.84 01:27.62 00:10.69 02:34.39 00:10.21 02:44.56

50 00:14.32 03:32.07 00:24.85 03:28.76 00:24.53 03:11.93 00:14.18 03:14.52 00:14.60 03:14.04

60 00:23.67 04:16.47 00:30.78 05:15.68 00:33.10 03:35.56 00:22.16 04:24.06 00:21.66 03:57.92

70 00:25.47 04:59.79 00:31.64 06:04.89 00:33.98 04:24.28 00:22.35 05:15.69 00:25.97 04:29.43

80 00:27.19 05:39.02 00:32.28 06:49.74 00:35.01 04:55.78 00:23.42 05:52.83 00:26.50 05:14.39

90 00:28.42 06:16.56 00:33.56 07:22.20 00:35.69 05:18.35 00:24.07 06:35.51 00:26.83 05:52.23

100 00:28.97 06:58.62 00:34.71 08:49.76 00:39.37 05:36.21 00:26.75 07:33.74 00:29.65 06:17.58

Notation: Q→Query, WT→With Threading, WOT→Without Threading.Average download time for 100 documents without threading = 07:03.18 mm:ss.Average download time for 100 documents with threading = 00:31.89 mm:ss.speed up = 13.27.

Document download

time without threading

Document download

time with threading

Figure 4.4: Download time of documents

62


Determination of optimal threshold values and number of feature vectors:

We have conducted another experiment to set the threshold values and number of feature

vectors used in our proposed algorithm. This experiment basically trains our algorithm

using a classi�ed set of data. From the total collection of �ve data set of Reuters from

reut2-000 to reut2-004 and �ve data set of our own data collection are used to set the

threshold values. First, we calculate the Dunn index for the samples with given class.

The threshold values α, β and γ can vary from 0 to 1. As the number of classes is known

to us, we try to maximize Dunn index for a given number of classes. We consider the

set of threshold values for which the cluster number is closest with given class number

and Dunn index is maximum. After considering all samples, the arithmetic mean and

harmonic mean of individual thresholds and number of features are calculated. Using

these average values we again �nd the Dunn index of the training sample. We have

plotted the Dunn index for every sample considering optimal values as well as average

values. The deviation of Dunn index for average threshold values compared to optimal

threshold values are shown in Figure 4.5. We �nd that the deviation of Dunn index for

optimal values and average values is moderate. The di�erence is very less for arithmetic

mean and harmonic mean values. We consider the harmonic mean values for testing

purpose. The experimental result is summarized in Table 4.4.

2.3

2.6

2.9

3.2

3.5

3.8

4.1

4.4

4.7

5

Given classification

Optimal threshold

Arithmetic mean

Harmonic mean

DunnIndex

Figure 4.5: Comparison of Dunn Index

63


Table 4.4: Determination of threshold values and number of feature vectors

Num- Document Number of Number of DI (given Threshold values Number of DI (proposed

ber collection documents classes classi�cation) α β γ features classi�cation)

1 reut2-000 100 09 2.42 0.025 0.335 0.8 200 2.57

2 reut2-001 100 13 3.37 0.030 0.365 0.6 200 3.49

3 reut2-002 120 12 3.51 0.025 0.440 0.6 300 3.55

4 reut2-003 120 18 4.63 0.020 0.345 0.8 300 4.71

5 reut2-004 100 13 3.48 0.025 0.540 0.8 200 3.43

6 Culture Assam 100 15 2.89 0.045 0.460 0.7 300 3.65

7 Delhi guide 100 24 4.66 0.035 0.665 0.7 400 4.72

8 Market Kolkata 100 14 4.49 0.025 0.550 0.5 300 4.68

9 Royal Rajasthan 100 12 3.41 0.030 0.755 0.6 300 3.59

10 Wildlife India 100 14 4.57 0.035 0.765 0.8 300 4.61

Arithmetic mean α = 0.029, β = 0.522, γ = 0.690, number of features = 280.Harmonic mean α = 0.028, β = 0.479, γ = 0.673, number of features = 267.

4.4 Conclusion

With the advancement of Information Technology, the amount of Web repository is

increasing rapidly and this trend of extension will continue in coming years. As a con-

sequence, for a given user query, search engine jumbles with a huge retrieval and it then

becomes problematic to extract the right information for our target users. To tackle this,

we use clustering of the search results. A number of clustering techniques can cluster

Web documents. But the existing techniques either need user intervention to decide

the number of clusters or computationally expensive. The present work addresses these

limitations and proposes a new clustering approach. The proposed approach takes the

advantage of both k -Means and hierarchical approaches. Our clustering technique able

to produce coherent clusters of good quality without compromising the computational

overhead. The proposed clustering technique is useful to cluster a large number of Web

search results into a group of similarity in real-time. This grouping is helpful to direct

the search interest. The technique presented in this work follows some sub-tasks. We

consider naive approaches for these sub-tasks which results the time complexity of the

technique in O(n2). There is enough scope of improvement to reduce this time complexity

considering more e�cient ways of solving the tasks.

64

Chapter 5

Icon-Based Information

Representation

After clustering we get a number of clusters containing similar type of information. Our

next task is to identify the most relevant cluster according to the user query. It is not

worthy to represent the all Web pages' content of the selected cluster word by word.

Instead we need to �nd out important information related to the query. In order to

do so, we �nd out the query corresponding entities and attributes and extract attribute

values from the clustered Web pages. Those selected information is represented in form

of icons. In order to ful�ll this target we face several challenges like which information

we should present, how to identify and extract the answer, how to present information

etc. We address the major issues like: building supportive model, information mining,

icon based information representation etc.

5.1 Introduction

Clustering methodology helps to group Web pages returned as the search result contain-

ing similar types of information. As in general the search query is not very precise, the

search results returned by the search engine are huge in size and cover various domains.

A user has to predict a Web page which may contain the desired information. Next,

user goes through the whole Web page content to �nd desired piece of information. If

the prediction goes wrong user has to predict again for the target Web page. This trial

and error process continuous until user can �nd the information. For our target user,

we prefer to reduce the prediction and searching overhead. We have noticed that, user's

trend is to obtain some additional information along with the desired and speci�c one.

65

5. Icon-Based Information Representation

Therefore, we plan to represent precise answer and some additional information related

to the query. While representing precise information in iconic form we face the problem

of word sense ambiguity. Same word can convey di�erent sense in di�erent context. Be-

fore representing, we need to disambiguate word sense that we can choice correct icon

from the icon database for representation.

According to our knowledge no work has been reported till now towards text to icon

representation. Some works related to question answering have been addressed to �nd

out answer of a particular question from �xed database or free text. Annotation of

knowledge base in order to obtain answers with the help of natural language processing

is proposed by Katz [58]. Later he proposed knowledge mining of Web information and

integrate it with corpus based knowledge annotation technique in order to achieve better

performance in question answering [59, 74]. Question answering by searching large cor-

pora with linguistic methods is proposed in [55]. The way of handling unstructured as

well as structured Web is suggested by Cucerzan and Agichtein [31]. In a di�erent way

Radev et. al. used probabilistic approach [98] to answer the query. The approach of com-

bining syntactic information with traditional question answering can be found in Quaero

[109] system. Some other established question answering system are Ionaut1 (AT&T Re-

search), MULDER (University of Washington) [65], AskMSR (Microsoft Research) [8],

InsightSoft-M (Moscow, Russia) [107], MultiText (University of Waterloo) [30], Shapaqa

(Tilburg University) [21], Aranea (MIT) [74], TextMap (USC/ISI) [36], LAMP (National

University of Singapore) [128], NSIR (University of Michigan) [99], AnswerBus (Univer-

sity of Michigan) [131] etc. The objective of question answering is to �nd out the exact

answer. Our objective is little bit di�erent that is, to �nd out speci�c as well as related

important information in precise form. Therefore, we generate query related template

that �gures out important entity and corresponding attributes related to the query. With

the help of all entity-attributes of all domain related queries we build entity-attribute

model. This entity-attribute model helps to �nd out most relevant cluster for a speci�c

query from the bag of clusters. Next, we model potential answers that can be present in

the clustered Web pages as the values of attributes. The model assists to �nd the val-

ues of attributes. Extracted words of phrases or sentences are disambiguated with help

of semantic similarity measure of WordNet [87] and XML icon base Ref. 3.2. Finally,

according to the template we display the �nalized icons to the user.

The organization of the paper is as follows. In Section 5.1, we already discussed the

context and motivation of the work. Section 5.2 talks about the proposed methodology of

representing Web page information. The experimental result is discussed in Section 5.3.

1Ionaut, www.ionaut.com/

66


Section 5.4 concludes the paper.

5.2 Proposed Methodology

The objective of this section is to �nd out query related information from the clustered

result. In order to achieve that, �rst we have to identify the most likely cluster containing

query related information. To mine answers from the selected cluster and to represent it

in user understandable iconic form, we follow the modules stated below.

• Building supportive model: In Section 3.2.1, we already identi�ed our domain of

interest and domain corresponding queries. With the help of the domain knowledge,

we build a entity-attribute model and potential answer model which works as a

supportive model for information extraction.

• Information mining: In this phase, we select the most appropriate cluster for in-

formation mining. As we know the query related entity-attribute, we identify and

extract the values related to those target attributes with the help of entity-attribute

model and answer patterns.

• Icon-based information representation: We use the icon database to display iconic

information in this phrase. The �nalized attribute values are disambiguated and

presented in forms of icons using the output template.

Figure 5.1 gives an overview of the proposed methodology. A detailed description of each

module is given in the following sub sections.

5.2.1 Building Supportive Model

The queries �red in the search engine can be general or speci�c in nature. We have

observed that the chance of presence of the desired result in the Web repository is less if

the query is too speci�c (e.g. query - �date of Bihu dance festival in Assam in 2012"). On

the other hand if the query is too general then the chance of retrieving relevant result is

less. For example the chance of obtaining Web pages relevant to �Assam's jeep safari" is

less if the search query is simply �Assam's transport". So, we can say moderate type of

query is capable to retrieve desired information. A statistical analysis1 also shows that

the user trend of �ring general query is more than a very speci�c one. For instance the

queries �culture Assam", �dance Assam", �Bihu dance Assam" and �Bihu dance Assam

in 2012" are �red for 2400, 2900, 720, 0 times respectively in a month globally in Google


67


Domain related user query Search engine

Entity-

attribute

model

Potential

answer

modeling

Retrieved result

Icon

database

Attribute value

extraction

Building supportive model

Information mining

Icon-based information

representation

Word sense

disambiguation

Answer template filling

Query

corresponding

templates

Clustered result

Select cluster

Figure 5.1: Overview of the proposed approach

search engine. It reveals that users are interested to obtain some general and related

idea along with their speci�c query need. Some times number of queries are implied in

a single query like the query �culture of Assam" implies language, festival, song, dance,

craft, religion, dress, food etc. of Assam. We can consider here �culture" as an entity

and language, festival, song, dance, craft, religion, dress, food etc. as its attributes. Note

that, every �culture" has some �history". So, we can say the entity �culture" is related to

entity �history". To extract information fromWeb page we need to know what is the main

entity about which we want to know and what are the entity corresponding attributes.

We also check the related entities to extract related information if it is provided in the

Web page. Therefore, we need a general entity-attribute model which represents related

entities and corresponding attributes.

5.2.1.1 Developing Query Corresponding Templates

We have already �nalized 50 queries related to tourism domain. To represent those

queries we make 23 templates. A query template represents the main object of the

query, main characteristics of that object and related characteristics. Each template

can represent one or more queries. �*" implies a mandatory attribute in any template

whereas �#" implies optional attribute. An example of a culture related template which

68


represents three di�erent types of culture queries: culture or some speci�c festivals of a

place in a particular time, the main festival of some place and the season of a festival of

a place are shown in Figure 5.2.

5.2.1.2 Developing Entity-Attribute Model

Once we prepare the templates of the tourism related queries, it is easy to identify

the main entities and related attributes of tourism domain. Along with these entity-

attributes other related entity-attributes are considered to build up the model. We also

take the help of an established tourism ontology of owl. Our prototype model contains

42 entities and their attributes. Figure 5.3 shows the model where ellipse represents

an entity. Related entities are linked by line. While building the model we generate

a database which includes the attribute synonyms, semantically similar words and the

words which are closely related with the attributes.

5.2.1.3 Potential Answer Modeling

In this phase we model the probable answers related to queries which may be present

in the retrieved documents in di�erent forms. As we have already identi�ed our query

related entities and attributes, we can identify the values of those attributes depending

*Place:

State/province District City/town/village

Specification:

Dance Song Language

Festival Craft Religion

Dress Food

Queries:

Culture/*specification of *place in #time

Main *specification in *place

Festival season in *place

Time:

Year Month Season

CULTURE

Figure 5.2: Query template: Culture

69


state

town

river

desert

hillfor

estwa

terfall

fort

palace

sacred

_place

sanctu

arytow

erpar

k

dam

lake

valley

culture

transpo

rthos

pital

colleg

ewe

ather

school

marke

tzoo

museu

mthe

atre

stadium

clothin

gani

mal

village

bridge

glacie

radv

enture

king

tour

sea

island

beach

packag

e

hotelres

ervatio

nspo

rts

Figure 5.3: Entity-attribute model

70


on some prede�ned patterns. While building the entity-attribute database, we have

included the words which are closely related with the attributes. For each attribute we

de�ne some patters. For example, the related words of attribute �language" are language,

speak, talk, communicate etc. So, the probable answer patterns are

P1: #s11 language s′12

P2: #s21 speak/talk/communicate s′22

P3: s′32 language #s31

P4: s′42 speak/talk/communicate #s41

here sij and s′ij represents a string. sij represents the subject string and s′ij represents

the target string. # implies that the followed string may or may not occur in the phrase

or sentence. If the string (si1) occurs then generally it contains the name of the place

or people, for which we are searching the language. In this way we generate the rules

for each attribute corresponding to a query. Sometimes two di�erent entities may have a

common attribute. For an example two entities �transport" and �hotel" have a common

attribute �reservation". The related words for �reservation" are �from", �to", �departure",

�arrival", �check-in", �check-out" etc. But, �departure", �arrival" appears in the context

of �transport" and �check-in", �check-out" occurs in the context of �hotel". So, at the

time of searching the attribute value, we need to consider not only the attribute but also

the context that is, the corresponding entity.

5.2.2 Information Mining

In this subsection, we discuss the way to �nd query related attribute values from the

most preferable cluster. In order to �nd out the most preferable cluster, we use query

templates and the entity-attribute model generated in previous section. From query

corresponding template, we can easily identify main entity and attributes for the query.

The entity-attribute model gives information about related attributes. Next, we �nd

out similarity of each cluster with query corresponding entity and attributes. Semantic

similarity measure of WordNet [87] is used for similarity measurement. A cluster with

maximum similarity value is considered as most preferable cluster for containing attribute

values.

5.2.2.1 Attribute Value Extraction

Generally in Web page, information are presented in paragraph. We notice paragraph

wise the topic di�ers more or less. So, we consider paragraph as processing unit. With

71


the help of prede�ned pattens (Ref. 5.2.1.3) we �nd out the candidate attribute values

from selected documents.

n-gram selection: We observed that the target string may not occur exactly at the

position of s′i,j due to the occurrence of some adjectives, adverbs, qualifying words etc.

But, most of the cases, it occurs within six words from s′i,j string position. Therefore, we

�nd out the matching patterns in a paragraph and extract all n-gram words of s′i,j string

position where n is 6. n-gram words can be selected from left or right direction. If the

target string follows the matching pattern (in case of pre�x pattern) then we consider

n-gram words from left direction and from right direction for vise-versa. For example, let

a sentence of a paragraph is �The natives of the state of Assam are known as "Asomiya"

(Assamese), which is also the state language of Assam.". To extract the value of the

attribute "language" we use the pattern P3 : s′32 language #s31. This is a su�x pattern

where matching pattern (language #s31) follows target string (#s31). So, we consider

n-gram words from right direction. The n-gram selection are - state, the state, also the

state, is also the state, which is also the state, (Assamese), which is also the state. We

term these n-gram words as candidate phrases. For in�x pattern we consider n-gram

from both directions. While selecting the n-gram, we never cross the sentence boundary.

Next, the stopwords are �ltered out from the candidate phrases.

Ranking candidate phrase: From the clustered documents, we may �nd more than

one candidate values for a single attribute. We need to rank these values in order to �nd

the top rank value. At �rst, we simply count the number of occurrence of each candidate

phrase.

• Merging candidates: As we consider n-gram word chunk as candidate phrase, the

occurrence of shorter candidate is more than the longer one. If any longer candidate

phrase contains another shorter candidate phrase as a substring and their occur-

rence is same, then we eliminate the shorter one. We also eliminate the shorter

candidate if the longer candidate containing the shorter one occurs more than a

threshold value(we consider 5).

• Allow tolerance: In some cases like the values presented by number(e.g. distance)

may di�er depending on the source of information. The di�erence may be very

little, but it causes number of separate candidate phases. Let, two sentences from

two di�erent Web pages are �The total rainfall is 230 mm" and �The total rainfall

is 233 mm". The corresponding pattern is P1: #s11 rainfall s′12. As, it is a pre�x

72


pattern, the candidate phases are - is, is 230, is 230 mm, is 233, is 233 mm. In order

to avoid this condition we allow a tolerance on number type attribute. Tolerance

varies attribute to attribute.

• Unit conversion: It may possible that the value of an attribute is given in two

di�erent units in two di�erent Web pages. Like, �The total rainfall is 230 mm" and

�The total rainfall is 23 cm". These two Web pages support same answer but it

results two di�erent probable answers. We apply unit conversion method to map

di�erent types of units to a same one. Then we compare and allow tolerance to

reduce number of candidate phrases.

Note that, while modifying the candidate phrases we never change the word sequence.

Constraints: As we are considering the n-gram chunk of words as candidate phrases,

it may contain some extra text along with the exact attribute value. We apply some

constrains to �lter out those extra terms.

• Datatype and attribute value: We de�ne datatype of each attribute to remove extra

texts. Like, for the attribute �distance" the datatype is number. In some cases the

values of an attribute can only be one of the �xed list (e.g. states of India). For

this type of attributes we can �lter the candidate phrases using the valid values of

the list.

• Attribute characteristics: There may be multiple values present in the text for

a single attribute. So, the attribute characteristic (single valued or multivalued)

is needed to be declared. For multivalued attribute, the values in general are

separated by `,'. In that case we not consider only a single n-gram chunk but all

text separated by `,', n-gram chunk before the �rst chunk and n-gram chunk after

the last `,' or ànd'. If any attribute supports multivalued characteristic then we

consider more than one value as attribute value if the occurrence of �nalized values

vary within 10%.

• Part-of-speech: Sometimes for emphasizing the descriptive words(e.g. adjective,

adverb) are attached with the exact value. These extra words can be eliminated

by knowing the part-of-speech of target attribute value.

All the candidate phrase are modi�ed according to these constraints. Finally, the can-

didate phrases are ranked on the basis of decreasing order of their occurrence. The top

ranked phrase(s) is(are) considered as the value(s) of the attribute.

73


We illustrate the process of attribute value extraction with an example. Let, using

the answer patters discussed in sub-section 5.2.1.3 we want to �nd the value of the

attribute �language" corresponding to a query �culture Assam". We consider a most

preferable cluster containing eight Web pages. From the Web pages we identify the

following sentences containing answer patters.

• The natives of the state of Assam are known as "Asomiya" (Assamese), which is

also the state language of Assam.

• Diverse tribes like Bodo, Kachari, Karbi, Miri, Mishimi, Rabha, etc co-exist in

Assam, most tribes have their own languages though Assamese is the principal

language of the state.

• Bengali-speaking Hindus and Muslims represent the largest minorities, followed by

Nepalis and populations from neighboring regions of India.

• However, in each of the elements of Assamese culture, i.e. language, traditional

crafts, performing arts, festivity and beliefs either local elements or the local ele-

ments in a Hinduised / Sanskritised forms are always present.

• The records of many aspects of the language, traditional crafts (silk, lac, gold,

bronze, etc), etc are available in di�erent forms.

• The original Tai-Shans assimilated with the local culture, adopted the language

on one hand and on the other also in�uenced the main-stream culture with the

elements from their own.

• The movement contributed greatly towards language, literature and performing

and �ne arts.

• Brajavali a language specially created by introducing words from other Indian lan-

guages had failed as a language but left its traces on the Assamese language.

• The language was standardised by the American Missionaries with the form avail-

able in the Sibsagar (Xiwoxagor) District (the nerve centre of the Ahom politico-

economic system).

• Sanskritisation was increasingly adopted for developing Assamese language and

grammar.

Using the patterns we select n-grams from above mentioned sentences as shown in Ta-

ble 5.1 and Table 5.2. Stopwords removal reduces number of candidate phrases.

74


Table 5.1: n-gram from pre�x pattern

No. Matching phrase n-gram phrase Filtered phrase

1 of Assam of, of Assam Assam

2 though Assameseis the principallanguage

though, though Assamese, though As-samese is, though Assamese is the, thoughAssamese is the principal, though As-samese is the principal language

Assamese, Assamese princi-pal, Assamese principal lan-guage

3 of the state of, of the, of the state state

4 Hindu and Mus-lim represent thelarge

Hindu, Hindu and, Hindu and Muslim,Hindu and Muslim represent, Hindu andMuslim represent the, Hindu and Muslimrepresent the large

Hindu, Hindu Muslim, HinduMuslim represent, HinduMuslim represent large

5 tradition craftperform artfestive and

tradition, tradition craft, tradition craftperform, tradition craft perform art, tra-dition craft perform art festive, traditioncraft perform art festive and

tradition, tradition craft, tra-dition craft perform, traditioncraft perform art, traditioncraft perform art festive

6 tradition craft silklac gold bronze

tradition, tradition craft, tradition craftsilk, tradition craft silk lac, tradition craftsilk lac gold, tradition craft silk lac goldbronze

tradition, tradition craft, tra-dition craft silk, traditioncraft silk lac, tradition craftsilk lac gold, tradition craftsilk lac gold bronze

7 on one hand andon the

on, on one, on one hand, on one hand and,on one hand and on, on one hand and onthe

hand

8 literature and per-form and �ne art

literature, literature and, literature andperform, literature and perform and, lit-erature and perform and �ne, literatureand perform and �ne art

literature, literature perform,literature perform �ne, litera-ture perform �ne art

9 special create byintroduce wordfrom

special, special create, special create by,special create by introduce, special createby introduce word, special create by in-troduce word from

special, special create, specialcreate introduce, special cre-ate introduce word

10 have fail as a lan-guage but

have, have fail, have fail as, have fail asa, have fail as a language, have fail as alanguage but

fail, fail language

11 but leave its traceon the

but, but leave, but leave its, but leave itstrace, but leave its trace on, but leave itstrace on the

leave, leave trace

12 be standardizeby the AmericanMissionary

be, be standardize, be standardize by, bestandardize by the, be standardize by theAmerican, be standardize by the Ameri-can Missionary

standardize, standardizeAmerican, standardize Amer-ican Missionary

13 and grammar and, and grammar grammar

75


Table 5.2: n-gram from su�x pattern

No. Matching phrase n-gram phrase Filtered phrase

1 Assamese which isalso the state

Assamese which is also the state, which isalso the state, is also the state, also thestate, the state, state

Assamese state, state

2 Assam most tribehave their own

though, Assam most tribe have their own,most tribe have their own, tribe have theirown, have their own, their own, own

Assam tribe

3 language thoughAssamese is theprincipal

language though Assamese is the princi-pal, though Assamese is the principal, As-samese is the principal, is the principal,the principal, principal

language Assamese principal,Assamese principal, principal

4 Bengali Bengali Bengali

5 the element of As-samese culture i.e.

the element of Assamese culture i.e., el-ement of Assamese culture i.e., of As-samese culture i.e., Assamese culture i.e.,culture i.e., i.e.

element Assamese culture,Assamese culture, culture

6 record of many as-pect of the

record of many aspect of the, of many as-pect of the, many aspect of the, aspect ofthe, of the, the

record aspect, aspect

7 with the local cul-ture adopt the

with the local culture adopt the, the localculture adopt the, local culture adopt the,culture adopt the, adopt the, the

local culture adopt, cultureadopt, adopt

8 The movementcontribute greattoward

The movement contribute great toward,movement contribute great toward, con-tribute great toward, great toward, to-ward

movement contribute great,contribute great, great

9 Brajavali a Brajavali a, a Brajavali

10 by introduce wordfrom other Indian

by introduce word from other Indian, in-troduce word from other Indian, wordfrom other Indian, from other Indian,other Indian, Indian

introduce word Indian, wordIndian, Indian

11 Indian languagehave fail as a

Indian language have fail as a, languagehave fail as a, have fail as a, fail as a, asa, a

Indian language, language

12 left its trace onthe Assamese

left its trace on the Assamese, its trace onthe Assamese, trace on the Assamese, onthe Assamese, the Assamese, Assamese

left trace Assamese, trace As-samese, Assamese

13 The The

14 be increase adoptfor develop As-samese

be increase adopt for develop Assamese,increase adopt for develop Assamese,adopt for develop Assamese, for developAssamese, develop Assamese, Assamese

increase adopt develop As-samese, adopt develop As-samese, develop Assamese,Assamese

76


Table 5.3: Ranked candidate phrase

Phase Assamesetraditioncraft

principal culture adopt languageAssameseprincipal

Hindutraditioncraft silk

Count 15 9 5 5 5 5 4 4 4

Next, we count the occurrence of all candidate phrase and merge short candidates into

longer one if their occurrences are same. We also eliminate the shorter candidate if the

longer candidate containing the shorter one occurs more than a threshold value of 5. In

this example their is no scope of allowing tolerance and unit conversion. Table 5.3 shows

the top �ve candidates after ranking. Next, we check the constraints. The datatype of

language attribute is simple string but it must be one of the Indian languages. Language

may be more than one and the part-of-speech of the target answer is noun. Among all the

candidates only one satis�es all the constraints that is, the top ranked term �Assamese".

So, we �nalize �Assamese" as the value of the attribute language.

5.2.3 Icon-Based Knowledge Representation

In this phase we represent the �nalized value of attribute in terms of icons. We select

icons from the icon database according to the attribute value. This phase consists of

following two sub phases.

5.2.3.1 Word Sense Disambiguation

Some words in English vocabulary convey di�erent meaning in di�erent context though

their spelling and pronunciation are same. These types of words are known as homonym

[63] (e.g. river bank - river bed, reserve bank - a �nancial institution). On the other hand

polysemy [63] is a word or phrase with di�erent, but related senses (e.g. bank - �nancial

institution, bank on - rely upon). We distinguish these types of words from their context.

Therefore to represent the �nalized values along with the attributes we have to consider

the context and accordingly we select icons from the icon database. To disambiguate a

word we use the icon vocabulary of Section 3.2.2. Our vocabulary contains keywords,

synonym of each keyword, semantically similar words of the keyword, di�erent senses of

keyword and corresponding images. First, each value phrase we want to represent are

tokenized in list of words. Each word is checked with the word vocabulary. If the word

is present in the vocabulary as a keyword then we check the di�erent senses of the word.

We calculate the similarity score [3] of each sense with the word along with its context.

The sense, scores highest is considered as the correct sense of the word and corresponding

image is selected for representation. If the word is not present in the vocabulary as a

77


keyword, next we check the word as a synonym and semantically similar word one after

another. If it is found, corresponding keyword is tracked and the senses are examined in

a similar way. If the word is absent in our vocabulary then word related a dummy icon

would be generated for display.

5.2.3.2 Answer Template Filling

Once the information related icons are decided, the answer template is �lled using those

icons to display the user. Generally, a paragraph leaded by a heading represents infor-

mation about the heading. Therefore, if there are any heading before paragraph, we �rst

present the heading then corresponding attribute and its values. For the query �culture

assam" the following attribute corresponding values: dance - Bihu, Satriya, Barpeta,

Jhumur; language - Assamese, festival - Bihu, craft - weaving, cane-bamboo craft, paint-

ing, jewellery making, wood craft; religion - Hindu, Muslim, Buddhist, Cristian; cloth

- cotton, silk are selected. The values of the attributes music and food are not found

from the clustered Web pages. The icon-based representation of the of all information

are shown in Figure 5.4.

Figure 5.4: Visualisation of information related Assam culture

78

5.3. Experimental Results

5.3 Experimental Results

We evaluated our proposed approach of information representation on the basis of infor-

mation understanding by the target user. In this section, we present the detail description

of our experiments and the results observed. For testing we use the same experimental

setup and users mentioned in Section 3.3.1.

We select ten tourism related queries from �fty benchmarked queries for testing.

Users are asked to generate these queries with the help of developed icon based interface.

Retrieved search result is clustered by the intermediate stage of clustering and a single

cluster is selected as most probable cluster for containing information. After mining that

cluster we obtain query related information and represent it in terms of icon. The users

are asked to recognize those icons as well as to understand the iconic message. Display of

wrong icon by the system indicates failure in word sense disambiguation. We grade the

users for identifying the icons, interpreting the iconic message and the system for word

sense disambiguation. Table 5.4 presents the result.

The icon recognition percentage for above mentioned �ve Web pages (86.14%, 87.50%,

Table 5.4: Test result of visual representation

Web page concept Recognisedconcept

Numberof icons

Wrongicons

Correctlyrecognized icon

Messageinterpreted

Markets of Kolkata market 108 7 87 B

Culture of Assam culture 165 13 133 G

Delhi Guide city 126 17 79 Bd

Hotels of Hyderabad hotel 52 6 41 Av

Wildlife of India wildlife 143 9 112 G

Royal Rajasthan state 62 6 45 Av

Transport of Bangalore transport 45 4 34 G

Tour package of Kashmir package 73 7 61 B

Sea beach of south India beach 39 4 27 G

Fort in Delhi fort 52 6 40 B

Grade for icon identi�cationabove 90%:5, (80-90)%:4, (70-80)%:3, (60-70)%:2, below 60%:1.Grade for message interpretationExcellent(Ex)-5, Better(B)-4, Good(G)-3, Average(Av)-2, Bad(Bd)-1.Grade for word sense disambiguationabove 95%:5, (90-95)%:4, (85-90)%:3, (80-85)%:2, below 80%:1.

79


72.48%, 89.13%, 84.21%, 80.36%, 82.93%, 92.42%, 77.14%, 86.96%) are calculated using

( No. of correctly selected icons) / (Total no. of icons - No. of wrong icons ). We

can also calculate the failure rate of word sense disambiguation by the formula ( No. of

wrong icons ) / ( Total no. of icons ). The failure percentage is 6.48%, 7.88%, 13.49%,

11.54%, 6.29%, 9.68%, 8.89%, 9.59%, 10.26%, 11.54% respectively. Overall e�ciency of

the system can be calculated with respect to three criteria icon recognition, word sense

disambiguation and message interpretation.

The mean score for icon recognition = (4 + 4 + 3 + 4 + 4 + 4 + 4 + 5 + 3 + 4)/10 = 3.9

The mean score for word sense disambiguation = (4+4+3+3+4+4+4+4+3+3)/10 = 3.6

The mean score for message interpretation = (4+3+1+2+3+2+3+4+3+4)/10 = 2.9

So, overall performance = (3.9 + 3.6 + 2.9)/3 = 3.47

Calculation shows that our proposed system (3.47/5) ∗ 100 = 69.33% e�cient.

5.4 Conclusion

In this section, we represent a simple way to represent basic Web information in terms

of icons, understandable for the unprivileged user. The proposed approach identi�es the

appropriate Web page cluster for information mining. From that cluster we �nd out

user query related basic information and produce simple iconic sequence. All the iconic

sequences are represented in a template. This iconic message will help our target user to

obtain and understand some basic information related to the query. This approach can

also be utilized for other purposes like interacting with uneducated people, cross language

communication etc. We have developed a prototype version of our proposed approach

which can be enhanced further to represent any type of knowledge independent of any

domain. Some extensions, like avoidance of redundant information, perfection in iconic

message information expression, automation in pattern generation can give the work a

complete shape. We consider these extensions as our future work.

80

Chapter 6

Conclusion and Future Work

Of late, a huge information repository is built up and maintained in the Web. People use

to share and access this information through Internet. But, the access of this repository

is only limited to a certain group of privileged people those who have good reading,

writing and understanding capability in English language. Rest of the people can not

avail the bene�ts of Internet. As a solution to this problem we have proposed an icon-

based interface to retrieve information from Internet. Using the interface our target users

can generate their desired query by means of icon selection. Generated iconic query

equivalent English query is fed to search engine. As a query, a word or a group of words

can imply multiple meaning in di�erent contexts. Web search engine, however, cannot

distinguish the context and hence retrieves huge information from di�erent domain in

response. These search results are incomprehensible for our target user as the user neither

understand nor �nd out their desired information from the returned result. Therefore, we

have clustered search results based on Web page content similarity and �nd out a cluster,

most relevant to query. Next, we have found out query related important information

from clustered Web pages. Finally, those selected information are displayed to the users

in form of icons.

The work solves a real life problem and overcomes the language barrier for Internet

access. Regarding development of interface we have provided a general approach for

developing any icon-based interface. We have addressed some basic issues like building

icon vocabulary, icon management, icon arrangement in the interface which are not ad-

dressed in any prior work. Though we limited our implementation related to tourism

domain, extension of the work in any other domain is possible. To cluster Web search

results, we have proposed a new clustering algorithm. The algorithm takes care of clus-

ter quality as well as time constraint. To substantiate the e�cacy we compared our

81

6. Conclusion and Future Work

algorithm with established clustering algorithms. Finally, we have mined clustered Web

page to �nd out query related information. Developing query corresponding templates,

developing entity-attribute model, potential answer modeling, attribute value extrac-

tion, icon-based knowledge representation are addressed in this regard. All the question

answering approaches targeted a particular answer whereas we targeted query related

important and precise information. Representation of information in iconic form is not

addressed anywhere.

6.1 Introduction

This vast Web repository is used to share and access information according to user need.

However, the bene�ts of Internet are con�ned within a certain group of people. Statistic

shows that only 34.3% 1 of the world population uses Internet. One of the main reason of

the poor participation is language illiteracy. A major portion of Web repository (55.4%)

is written in English2. Whereas world English literacy rate is quite poor. Therefore, it

is clear that a vast information repository is freely available but it is not consumable for

underprivileged people. To solve this issue a common interaction medium is needed which

is understandable by any user irrespective of their cultural and language background. In

this direction, we have developed an icon-based interface to retrieve and represent Web

information in user understandable form.

As per our knowledge, the problem we addressed is new of its kind. Our goal is to

makeWeb information accessible for illiterate or semi-illiterate people. In order to achieve

our target we have identi�ed three main challenges - giving input to the search engine,

�nding query related important information and represent it in user understandable form.

Previously, icon is used as the medium of interaction for di�erent objectives like - man-

machine interaction, interacting with quadriplegic people, interacting with semi-illiterate

people, in various applications (hotel booking, chatting, alarming in crisis situation etc.),

kids learning etc. Most of the works used an icon-based interface for user interaction

but they did not address the basic issues as what should be the icon vocabulary, how

to manage the icon vocabulary, how to organize icons in the interface etc. In our work

we have addressed these issues. We did not emphasize on the area of natural language

generation from iconic sequence as, �rstly, a query we generally �re in search engine is

not a well de�ned sentence secondly, the search engine handles the query formation by

its own.

1World Internet Users Statistics Usage and World Population Stats, www.internetworldstats.com/stats.htm

2W 3Techs, www.w3techs.com/technologies/overview/content_language/all

82

6.2. Contribution of Our Work

We have faced several issues regarding Web information representation. As search

engine generally returns large number of Web pages in a response to a query, a major

decision need to be taken regarding which information we should present, how to �nd it

out and how to present. To ful�ll our requirement we have used clustering and informa-

tion mining mechanism. Though several clustering mechanisms (Hierarchical, k -Means,

Lingo, STC etc.) are available we propose a new clustering mechanism - HK Clustering

to cluster Web search results. Most of the exiting Web clustering mechanism concen-

trated either in cluster quality or in response time. In HK Clustering mechanism, we

balance the time constraint and cluster quality constraint. For mining information from

clustered Web pages, we have developed an entity attribute model. With the help of this

model we have found out query related precise information. We have followed the ques-

tion answering mechanism for mining information. In our approach, we have handled a

normal search query and mined all important as well as related information. Whereas

question answering mechanism only concentrates some speci�c query and speci�c answer.

Finally, to represent the mined information in terms of icon, we disambiguated each word

of information phrase and �nd out proper icon from icon base.

In this chapter �rst we have summarized our work. Next, in Section 6.1 we discuss

about the importance of our work. Section 6.2 talks about the contribution of our work.

Finally, Section 6.3 concludes about the work and shows the future direction.

6.2 Contribution of Our Work

In the complete work, we have several contributions. Each of the contributions are

discussed next.

1. Deciding the medium of interaction: As our target users are illiterate and semi-

illiterate people, we can not use normal text as the interaction medium between

user and Internet. We analyze several alternatives like speech, gesture, icon etc.

and select icon for various reasons like - language independence, easy to adapt,

faster and expressive medium, o�ers recognition rather than recall etc.

2. Deciding implementation domain: Solving the problem for general domain is quite

voluminous work. Therefore, we implement our proposal in tourism domain. How-

ever, similar application can be done for any other domain following the proposed

methodology.

3. Developing icon vocabulary: We described a way of deciding icon vocabulary for

any domain. The proposal gives an idea to decide icons in any domain that it should

83

6. Conclusion and Future Work

cover the entire domain without redundancy. It gives the idea of icon optimization.

4. Management of icons: We also described the way of e�cient icon management.

How to use icons to represent similar concept, how to maintain indexing that can

support easy icon storage and retrieval are also discussed.

5. Arrangement of icons: All the icons cannot be kept in primary interface. So, how

to maintain icon hierarchy and how to organize icons in the interface are addressed

in this work.

6. Proposing a new clustering methodology to cluster search results considering com-

plexity and cluster quality: To club informative Web pages from the search results

we have done clustering. As our main target is to mine precise information, we pri-

oritize cluster quality than response time. Therefore, we propose a new clustering

mechanism to obtain good cluster in a�ordable response time.

7. Developing entity-attribute model: To identify query related important attributes

we have developed query corresponding templates and from the query templates

we have developed an entity-attribute model. This entity attribute model helps to

identify query related appropriate cluster for mining information.

8. Preparing answer model and mining Web information: We developed an answer

model with the help of entity-attribute model that describes di�erent possibilities

and forms of information present in a Web page. It helps to extract query related

answers which are processed further by ranking and applying constraints to obtain

query related information.

9. Icon-based information representation: As English supports homonymy (di�erent

meaning in di�erent context) and polysemy (di�erent, but related senses), under-

standing the word sense is necessary to select icons from icon base. Therefore, we

disambiguated the �nalized information in word basis and �nd out proper icon that

can convey the information.

6.3 Future Work

We limited our implementation to tourism related information retrieval due to the vast-

ness of the work and absence of proper icon base. Designing appropriate icons for our

target users is out of our scope. We targeted tourism domain as travel related icons are

used in common place and they are bit familiar to general user compared to other domain

84

6.3. Future Work

related icons. We used tourism related di�erent icons available in Internet which may not

the best for our user. User of proper icon may increase the e�cacy of the system. An-

other limitation of our work is towards representing Web information. We narrow down

the amount of information as per requirement of our target users. We assumed that the

users are not interested in all information present in a Web page and large number of

iconic massage will confuse them. We manually decided the important attributes related

to a query which may not cover all important information present in a Web page which

may be needed by user. In future the selection of attributes from a Web page can be

automated which can be done by machine learning. An enormous testing is required to

decide the exact amount of information preferred by the target user. Proper feedbacks

can improve system performance. Apart from this, we can also provide a prediction

mechanism which will help user in query generation. The extension of this work can be

done in various direction. This system can be recon�gured for motor impaired users or

for multilingual communication purpose.

85

Publications out of this work

Published

• S. Maiti, D. Samanta, S. R. Das, and M. Sharma, Language Independent Icon-

Based Interface for Accessing Internet, First International Conference on Advances

in Computing and Communications (ACC), 2011 Springer, pp: 172-182, July 2011,

Kochi, India (Springer)

• S. Maiti, S. Dey, and D. Samanta. Development of Iconic Interface to Retrieve In-

formation from Internet, Students' Technology Symposium (TechSym), 2010 IEEE,

pp: 268-275, April, 2010, Kharagpur, India (IEEE)

Accepted

• S. Maiti, and D. Samanta. Icon-Based Representation of Web Information, 4th

International Conference on Intelligent Human Computer Interaction (IHCI), 2012

IEEE, December, 2012, Kharagpur, India (IEEE)

Communicated

• S. Maiti and D. Samanta, Clustering Web Search Results to Identify Information

Domain, Foundations and Trends in Information Retrieval (FTIR)

87

References

[1] A and A. Basu. A Framework for Disambiguation in Ambiguous Iconic Environ-

ments. In Proceedings of the Advances in Arti�cial Intelligence (AI 2004), 17th

Australian Joint Conference on Arti�cial Intelligence, volume 3339, pages 1135�

1140, December 2004.

[2] A. V. Aho. Algorithms for Finding Patterns in Strings, Handbook of Theoretical

Computer Science: Algorithms and Complexity, volume 1, chapter 5, pages 255�

300. MIT Press Cambridge, MA, USA, 1990.

[3] M. G. Ahsaee, M. Naghibzadeh, and S. E. Yasrebi. Using WordNet to Determine

Semantic Similarity of Words. In Proceedings of the 5th International Symposium

on Telecommunications (IST), pages 1019 �1027, December 2010.

[4] P. L. Albacete, S. K. Chang, and G. Polese. Iconic Language Design for People

with Signi�cant Speech and Multiple Impairments. In Proceedings of the Assistive

Technology and Arti�cial Intelligence, Applications in Robotics, User Interfaces

and Natural Language Processing, pages 12�32, 1998.

[5] O. Alonso, M. Gertz, and R. Baeza-Yates. Clustering and Exploring Search Results

Using Timeline Constructions. In Proceedings of the 18th ACM Conference on

Information and Knowledge Management (CIKM '09), pages 97�106. Hong Kong,

China, November 2009.

[6] J. Arnstein. The International Directory of Graphic Symbols. In Proceedings of

the Kogan Page, 1983.

[7] B.R. Baker. Minspeak, a Semantic Compaction System that Makes Self-Expression

Easier for Communicatively Disabled Individuals. Byte, 7(9):186�202, September

1982.

89

References

[8] M. Banko, E. Brill, S. Dumais, and J. Lin. AskMSR: Question Answering Using

the Worldwide Web. In Proceedings of the AAAI Symposium on Mining Answers

from Text and Knowledge Bases, pages 7�9, 2002.

[9] P. G. Barker and K. A. Manji. Pictorial Dialogue Methods. International Journal

of Man-Machine Studies, 31:323�347, September 1989.

[10] C. Beardon. CD-Icon: an Iconic Language Based on Conceptual Dependency.

Digital Creativity, 3:111�116, November 1992.

[11] C. Beardon. Discourse Structures in Iconic Communication. Arti�cial Intelligence

Review - Special Issue on Integration of Natural Language and Vision Processing:

Intelligent Multimedia, 9(2-3):189�203, June 1995.

[12] C. Beardon, C. Dormann, S. Mealing, and M. Yazdani. Talking with Pictures:

Exploring the Possibilities of Iconic Communication. Alt-J: Research in Learning

Technology, 1:26�39, 1993.

[13] R. Bekkerman, S. Zilberstein, and J. Allan. Web Page Clustering Using Heuristic

Search in the Web Graph. In Proceedings of the 20th International Joint Con-

ference on Arti�cal Intelligence (IJCAI'07), pages 2280�2285. Morgan Kaufmann

Publishers Inc. San Francisco, CA, USA, January 2007.

[14] J. Bhagwani and K. Hande. Context Disambiguation in Web Search Results Using

Clustering Algorithm. International Journal of Computer Science and Communi-

cation, 2(1):119�123, January-June 2011.

[15] S. Bhattacharya. Sanyog: An Iconic System for Multilingual Communication for

People with Speech and Motor Impairment. Master's thesis, Department of Com-

puter Science & Engineering, IIT Kharagpur, 2004.

[16] S. Blankenberger and K. Hahn. E�ects of Icon Design on Human-Computer In-

teraction. International Journal of Man-Machine Studies, 35:363�377, September

1991.

[17] M. M. Blattner, D. A. Sumikawa, and R. M. Greenberg. Earcons and Icons: Their

Structure and Common Design Principles. In Proceedings of the Human-Computer

Interaction, volume 4, pages 11�40, March 1989.

[18] F. Boulanger, G. Polaillon, D. Carstoiu, A. Cernian, and S. Bodea. Web Search

Based on Clustering by Compression. In Proceedings of the International Confer-

ence e-Society (IADIS), 2007.

90

References

[19] A. Bradic. Search Result Clustering via Randomized Partitioning of Query-Induced

Subgraphs. Telfor Journal, 1(1):26�29, 2009.

[20] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Semantic Web Search

Results Clustering Using Lingo and WordNet. International Journal of Research

and Reviews in Computer Science (IJRRCS), 1(2), June 2010.

[21] S. Buchholz and W. Daelemans. SHAPAQA: Shallow Parsing for Question An-

swering on the World Wide Web. In Proceedings of the Recent Advances in Natural

Language Processing (RANLP), pages 47�51, September 2001.

[22] R. Campos, G. Dias, and A. M. Jorge. Disambiguating Web Search Results by

Topic and Temporal Clustering - a Proposal. In Proceedings of the International

Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge

Management (IC3K), pages 292�296, 2009.

[23] R. Campos, G. Dias, and C. Nunes. WISE: Hierarchical Soft Clustering of Web

Page Search Results Based on Web Content Mining Techniques. In Proceedings of

the Web Intelligence'06, pages 301�304. IEEE Computer Society Washington, DC,

USA, December 2006.

[24] C. Carpineto, S. Osinski, G. Romano, and D. Weiss. A Survey of Web Clustering

Engines. ACM Computing Surveys, 41(3):1�38, July 2009.

[25] M. Chau, P. Y. K Chau, and P. J. Hu. Incorporating Hyperlink Analysis in Web

Page Clustering, 2007.

[26] A. Chen. The Visual Messenger Project. PhD thesis, University of Washington,

June 2004.

[27] J. Chen, O. R. Zaiane, and R. Goebel. An Unsupervised Approach to Cluster

Web Search Results based on Word Sense Communities. In Proceedings of the

2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intel-

ligent Agent Technology (WI-IAT '08), pages 725�729. IEEE Computer Society

Washington, DC, USA, 2008.

[28] R. Chen, C. Bau, M. Tsai, and C. Huang. Web Pages Cluster Based on the

Relations of Mapping Keywords to Ontology Concept Hierarchy. International

Journal of Innovative Computing, Information and Control, 6(6):2749�2760, June

2010.

91

References

[29] D. Cheng, R. Kannan, S. Vempala, and G. Wang. A Divide-and-Merge Methodol-

ogy for Clustering. ACM Transactions on Database Systems (TODS), 31(4):1499�

1525, December 2006.

[30] C. L. A. Clarke, G. V. Cormack, D. I. E. Kisman, and T. R. Lynam. Question An-

swering by Passage Selection. In Proceedings of the 9th Text REtrieval Conference

(TREC), 2000.

[31] S. Cucerzan and E. Agichtein. Factoid Question Answering over Unstructured

and Structured Web Content. In Proceedings of the Fourteenth Text REtrieval

Conference (TREC), 2005.

[32] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman.

Indexing by Latent Semantic Analysis. Journal of the American Society for Infor-

mation Science, 41(6):391�407, 1990.

[33] C. Ding and X. He. K-means Clustering via Principal Component Analysis. In

Proceedings of the Twenty-First International Conference on Machine Learning

ICML '04, pages 225�232, Ban�, Canada, 2004. ACM Press.

[34] D. Dipa and M. Kiruthika. Mining Access Patterns Using Clustering. International

Journal of Computer Applications, 4(11):22�26, August 2010.

[35] N. Duhan and A. K. Sharma. A Novel Approach for Organizing Web Search Results

using Ranking and Clustering. International Journal of Computer Applications,

5(10):1�9, August 2010.

[36] A. Echihabi, U. Hermjakob, E. Hovy, D. Marcu, E. Melz, and D. Ravichandran.

Multiple-Engine Question Answering in TextMap. In Proceedings of the 12th An-

nual Text REtrieval Conference (TREC), 2003.

[37] S. M. Z. Eissen and B. Stein. Analysis of Clustering Algorithms for Web-based

Search. In Proceedings of the 4th International Conference on Practical Aspects of

Knowledge Management (PAKM '02). Springer-Verlag London, UK, 2002.

[38] S. Fitrianie, D. Datcu, and L. J. M. Rothkrantz. Human Communication Based on

Icons in Crisis Environments. In Proceedings of the 2nd International Conference

on Usability and Internationalization (UI-HCII'07), pages 57�66, 2007.

[39] S. Fitrianie and L. J. M. Rothkrantz. Language-Independent Communication Using

Icons on a PDA. Lecture Notes in Computer Science, 3658:404�411, 2005.

92

References

[40] S. Fitrianie and L. J. M. Rothkrantz. Two-Dimensional Visual Language Grammar.

In Proceedings of the 9th International Conference on Text, Speech and Dialogue

(TSD'06), pages 573�580. Springer-Verlag Berlin, Heidelberg, 2006.

[41] S. Fitrianie, Z. Yang, and L. J. M. Rothkrantz. Developing Concept-Based User

Interface Using Icons for Reporting Observations. In Proceedings of the Second

International Conference on Information Systems for Crisis Response and Man-

agement (ISCRAM), Washington, DC, USA, May 2008.

[42] R. T. Freeman. Topological Tree Clustering of Web Search Results. In Proceed-

ings of the Seventh International Conference on Intelligent Data Engineering and

Automated Learning (IDEAL'06), pages 789�797. Springer, September 2006.

[43] W. W. Gaver. Auditory Icons: Using Sound in Computer Interface. In Proceedings

of the Human-Computer Interaction, volume 2(2), pages 167�177, 1986.

[44] F. Gelgi, H. Davulcu, and S. Vadrevu. Term Ranking for Clustering Web Search

Results. In Proceedings of the 10th International Workshop on Web and Databases

(WebDB), Beijing, China, June 2007.

[45] F. Gelgi, S. Vadrevu, and H. Davulcu. Scuba Diver: Subspace Clustering of Web

Search Results. In Proceedings of the 3rd International Conference on Web Infor-

mation Systems and Technologies (WebIST), 2007.

[46] F. Geraci, M. Pellegrini, M. Maggini, and F. Sebastiani. Cluster Generation and

Cluster Labelling for Web Snippets: a Fast and Accurate Hierarchical Solution. In

Proceedings of the String Processing and Information Retrieval(SPIRE'06), volume

4209, 2006.

[47] F. Giannotti, M. Nanni, D. Pedreschi, and F. Samaritani. WebCat: Automatic

Categorization of Web Search Results. In Proceedings of the Sistemi Evoluti per

Basi di Dati(SEBD), pages 507�518, 2003.

[48] D. Gittins. Icon-Based Human-Computer Interaction. International Journal of

Man-Machine Studies, 24:519�543, June 1986.

[49] A. Guenoche, P. Hansen, and B. Jaumard. E�cient Algorithms for Divisive Hier-

archical Clustering with the Diameter Criterion. Journal of Classi�cation, 8:5�30,

1991.

93

References

[50] X. He, H. Zha, C. H. Q. Ding, and H. D. Simon. Web Document Clustering

Using Hyperlink Structures. Computational Statistics & Data Analysis, 41(1):19�

45, November 2001.

[51] M. A. Hearst and J. O. Pedersen. Reexamining the Cluster Hypothesis Scatter

Gather on Retrieval Results. In Proceedings of the Nineteenth Annual International

ACM SIGIR Conference on Research and Development in Information Retrieval

Pages(SIGIR '96), pages 76�84, Zurich, 1996.

[52] M. Hirakawa, M. Tanaka, and T. Ichikawa. An Iconic Programming System, HI-

VISUAL. IEEE Transactions on Software Engineering, 16:1178�1184, October

1990.

[53] J. Janruang and W. Kreesuradej. A New Web Search Result Clustering based on

True Common Phrase Label Discovery. In Proceedings of the International Confer-

ence on Computational Intelligence for Modelling Control and Automation,and In-

ternational Conference on Intelligent Agents, Web Technologies and Internet Com-

merce (CIMCA '06), page 242. IEEE, 2006.

[54] A. Joshi and Z. Jiang. Retriever: Improving Web Search Engine Results Using

Clustering, 2002.

[55] M. Kaisser and T. Becker. Question Answering by Searching Large Corpora with

Linguistic Methods. In Proceedings of the Thirteenth Text REtrieval Conference

(TREC), 2004.

[56] Y. Kanda, M. Kudo, and H. Tenmoto. Hierarchical and Overlapping Clustering

of Retrieved Web Pages. In Proceedings of the Advances in Intelligent Information

Systems, pages 345�358, 2009.

[57] F. Karray, M. Alemzadeh, J. A. Saleh, and M. N. Arab. Human-Computer Inter-

action: Overview on State of the Art. International Journal on Smart Sensing and

Intelligent Systems, 1(1):137�159, March 2008.

[58] B. Katz. Annotating the World Wide Web Using Natural Language. In Proceedings

of the 5th RIAO Conference on Computer Assisted Information Searching on the

Internet (RIAO '97), 1997.

[59] B. Katz, J. Lin, D. Loreto, W. Hildebrandt, M. Bilotti, S. Felshin, A. Fernandes,

G. Marton, and F. Mora. Integrating Web-based and Corpus-based Techniques

94

References

for Question Answering. In Proceedings of the Twelfth Text REtrieval Conference

(TREC), 2003.

[60] M. L. Khodra and D. H. Widyantoro. An E�cient and E�ective Algorithm for

Hierarchical Classi�cation of Search Results. In Proceedings of the International

Conference on Electrical Engineering and Informatics, June 2007.

[61] B.G. Knapp, F.L. Moses, and L.H. Gellman. Information Highlighting on Complex

Displays. In Proceedings of the Directions in Human-computer Interaction, pages

195�215, 1982.

[62] P.A. Kolers. Some Formal Characteristic of Pictograms. In Proceedings of the

American Scientist, volume 57, pages 348�363, 1969.

[63] R. Krovetz. Homonymy and Polysemy in Information Retrieval. In Proceedings of

the Eighth Conference on European Chapter of the Association for Computational

Linguistics (EACL '97), pages 72�79, 1997.

[64] N. C. Kuicheu, L. P. Fotso, and F. Siewe. Iconic Communication System by XML

Language: (SCILX). In Proceedings of the International Cross-Disciplinary Con-

ference on Web Accessibility (W4A), pages 112�115, 2007.

[65] C. Kwok, O. Etzioni, and D. S. Weld. Scaling Question Answering to the Web.

ACM Transactions on Information Systems (TOIS), 19:242�262, 2001.

[66] R.E. Maisaino L. Moses and P. Bersh. Natural Associations between Symbols and

Mlitary Information. In Proceedings of the Human Factors and Ergonomics Society

Annual Meeting, volume 23, pages 438�442, 1979.

[67] E. S. Lee, E. J. Hwang, T. S. Hur, Y. S. Woo, and H. K. Min. A Study on the Pred-

icate Prediction Using Symbols in Augmentative and Alternative Communication

System. In Proceedings of the Human-Computer Interaction with Mobile Devices

and Services, Lecture Notes in Computer Science, volume 2795, pages 466�470,

Springer Berlin / Heidelberg, 2003.

[68] Y. Lee, S. Na, and J. Lee. Search Result Clustering Using Label Language Model.

In Proceedings of the Third International Joint Conference on Natural Language

Processing, 2008.

[69] N. E. M. Leemans. A Visual Inter Lingua. Phd thesis, Worcester Polytechnic

Institute, USA, 2001.

95

References

[70] B. M. Leiner, V. G. Cerf, D. D. Clark, R. E. Kahn, L. Kleinrock, D. C. Lynch,

J. Postel, L. G. Roberts, and S. Wol�. A Brief History of the Internet. ACM

SIGCOMM Computer Communication Review, October 2009. Volume 39 Issue 5.

[71] E. Levent, S. Michael, and K. Vipin. A New Shared Nearest Neighbor Clustering

Algorithm and its Applications. InWorkshop on Clustering High Dimensional Data

and its Applications at 2nd SIAM International Conference on Data Mining, pages

105�115, 2002.

[72] Z. Li and X. Wu. A Phrase-Based Method for Hierarchical Clustering of Web

Snippets. In Proceedings of the Twenty-Fourth AAAI Conference on Arti�cial

Intelligence (AAAI-10), 2010.

[73] W. Lidwell, K. Holden, and J. Butler. Universal Principles of Design. In Proceedings

of the Massachusetts, Rockport Publishers, 2003.

[74] J. Lin and B. Katz. Question Answering from the Web Using Knowledge Annota-

tion and Knowledge Mining Techniques. In Proceedings of the Twelfth International

Conference on Information and Knowledge Management, pages 116 � 123, 2003.

[75] G. Lindgaard, J. Chessari, and E Ihsen. Icons in Telecommunication: What Makes

Pictorial Information Comprehensible to the Users? In Proceedings of the Aus-

tralian Telecommunication Research, volume 21(2), pages 17�29, 1987.

[76] S. P. Lloyd. Least Squares Quantization in PCM's. Bell Telephone Laboratories

Paper, 28(2):129�137, March 1957.

[77] C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Re-

trieval. Cambridge University Press, 2008.

[78] I. Maslowska. Phrase-Based Hierarchical Clustering of Web Search Results. In

Proceedings of the 25th European conference on IR research (ECIR'03), pages 555�

562, Berlin Heidelberg, 2003. LNCS, Springer Verlag.

[79] T. Matsumoto and E. Hung. Fuzzy Clustering and Relevance Ranking of Web

Search Results with Di�erentiating Cluster Label Generation. In Fuzzy Systems

(FUZZ), 2010 IEEE International Conference, pages 1�8, Barcelona, 2010.

[80] J. I. Mayorga, J. Cigarran, and M. Rodriguez-Artacho. Retrieval and Clustering

of Web Resources Based on Pedagogical Objectives, September 2010.

96

References

[81] G. Mecca, S. Raunich, and A. Pappalardo. A New Algorithm for Clustering Search

Results. Data & Knowledge Engineering, 62(3):504�522, September 2007.

[82] I. Medhi, S Patnaik, E. Brunskill, S. N. N. Gautama, W. Thies, and K. Toyama.

Designing Mobile Interfaces for Novice and Low-Literacy Users. ACM Transactions

on Computer-Human Interaction (TOCHI), 18:1�28, April 2011.

[83] I. Medhi, A. Prasad, and K. Toyama. Optimal Audio-Visual Representations for

Illiterate Users of Computers. In Proceedings of the 16th International Conference

on World Wide Web, pages 873�882, 2007.

[84] W. Meert, R. Troncon, and G. Janssens. Clustering Maps. Master's thesis,

Katholieke Universiteit Leuven, 2006.

[85] E. Miller, M. Kado, M. Hirakawa, and T. Ichikawa. HI-VISUAL as a User-

Customizable Visual Programming Environment. In Proceedings of the 11th Inter-

national IEEE Symposium on Visual Languages (VL '95), pages 107�113, Darm-

stadt, Germany, September 1995. IEEE Computer Society Press.

[86] G. Miller. The Magical Number Seven, Plus or Minus Two: Some Limits on Our

Capacity for Processing Information. Psychological Review, 63(2):343�352, March

1956.

[87] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. Introduction

to WordNet: an On-line Lexical Database. International Journal of Lexicography,

3(4):235�244, December 1990.

[88] D. S. Modha and W. S. Spangler. Clustering Hypertext with Applications to Web

Searching. In Proceedings of the Eleventh ACM on Hypertext and Hypermedia,

pages 143�152. ACM New York, NY, USA, 2000.

[89] Lodding K. N. Iconic Interfacing. In Proceedings of the IEEE Computer Graphics

and Applications, volume 3(2), pages 11�20, 1983.

[90] R. Navigli and G. Crisafulli. Inducing Word Senses to Improve Web Search Result

Clustering. In Proceedings of the Conference on Empirical Methods in Natural

Language Processing, pages 116�126, MIT, Massachusetts, USA, October 2010.

Association for Computational Linguistics.

[91] C. L. Ngo and H. S. Nguyen. A Method of Web Search Result Clustering Based on

Rough Sets. In Proceedings of the Web Intelligence (WI'05), pages 673�679, 2005.

97

References

[92] S. Osinski. An Algorithm for Clustering of Web Search Result. Master's thesis,

Poznan University of Technology, Poland, June 2003.

[93] S. Osinski. Improving Quality of Search Results Clustering with Approximate Ma-

trix Factorisations. In Proceedings of the 28th European Conference on Advances in

Information Retrieval (ECIR), pages 167�178. Springer-Verlag Berlin Heidelberg,

2006.

[94] S. Osinski and D. Weiss. A Concept-Driven Algorithm for Clustering Search Re-

sults. IEEE Intelligent Systems, 20(3):48 � 54, May 2005.

[95] P. Patil and U. Patil. Preprocessing of Web Server Log File for Web Mining. World

Journal of Science and Technology, 2(3):14�18, 2012.

[96] H. Pu. User Evaluation of Textual Results Clustering for Web Search. Online

Information Review, 34(6):855�874, March 2010.

[97] H. Purchase. De�ning Multimedia. In Proceedings of the IEEE multimedia, volume

5(1), pages 8�15, 1998.

[98] D. Radev, W. Fan, H. Qi, H. Wu, and A. Grewal. Probabilistic Question An-

swering on the Web. Journal of the American Society for Information Science and

Technology, 56(6):571�583, 2005.

[99] D. R. Radev, H. Qi, H. Wu, and W. Fan. Evaluating Web-based Question An-

swering Systems. In Proceedings of the Evaluating Web-Based Question Answering

Systems, 2002.

[100] P. Rai and S. Singh. A Survey of Clustering Techniques. International Journal of

Computer Applications, 7(12):156�162, October 2010.

[101] D. R. Recupero. A New Unsupervised Method for Document Clustering by Using

WordNet Lexical and Conceptual Relations. Information Retrieval, 10(6):563�579,

2007.

[102] C. J. V. Rijsbergen. Information Retrieval. Butterworth-Heinemann, London:

Butterworth, 1979.

[103] Y. Rogers. Icons at the Interface: their Usefulness. Interacting with Computers,

1(1):105�117, April 1989.

98

References

[104] S. J. P. McDougall S. J. Isherwood and M. B. Curry. Icon Identi�cation in Context:

The Changing Role of Icon Characteristics with User Experience. Human Factors:

The Journal of the Human Factors and Ergonomics Society, 49(3):465�476, June

2007.

[105] S. Sambasivam and N. Theodosopoulos. Advanced Data Clustering Methods of

Mining Web Documents. Informing Science and Information Technology, 3, 2006.

[106] M. G. R. Skogen. An Investigation into the Subjective Experience of Icons: A Pilot

Study. In Proceedings of the 10th International Conference Information Visualiza-

tion, pages 368�373, 2006.

[107] M. M. Soubbotin and S. M. Soubbotin. Use of Patterns for Detection of Answer

Strings: A Systematic Approach. In Proceedings of the Text REtrieval Conference

(TREC), 2002.

[108] M. Steinbach, G. Karypis, and V. Kumar. A Comparison of Document Clustering

Techniques. In Proceedings of the KDD Workshop on Text Mining, 2000.

[109] X. Tannier and V. Moriceau. FIDJI Web Question-Answering at Quaero. In Pro-

ceedings of the Seventh International Conference on Language Resources and Eval-

uation (LREC), May 2010.

[110] B. Tatomir and L. Rothkrantz. Crisis Management Using Mobile ad-hoc Wireless

Networks. In Proceedings of the ISCRAM, pages 147�149, Belgium, 2005.

[111] M. Tilsner, O. Hoeber, and A. Fiech. CubanSea: Cluster-Based Visualization

of Search Results. In Proceedings of the IEEE/WIC/ACM International Joint

Conference on Web Intelligence and Intelligent Agent Technology, volume 3, pages

108�112, September 2009.

[112] H. Toda and R. Kataoka. Search Result Clustering Method at NTCIR-5 WEB

Query Term Expansion Subtask. In Proceedings of the NTCIR-5 Workshop Meeting,

Tokyo, Japan, December 2005.

[113] L. Uden and A. Dix. Iconic Interfaces for Kids on the Internet. In Proceedings of

the IFIP World Computer Congress, pages 279�286, August 2000.

[114] I. Varlamis and S. Stamou. Semantically Driven Snippet Selection for Supporting

Focused Web Searches. Data & Knowledge Engineering, 68(2):261�277, February

2009.

99

References

[115] Wang, Hsiu-Feng, Hung, Sheng-Hsiung, Liao, and Ching-Chih. A Survey of Icon

Taxonomy Used in the Interface Design. In Proceedings of the 14th European Con-

ference on Cognitive Ergonomics, ACM International Conference Proceeding Se-

ries,, volume 250, pages 203�206, London, United Kingdom, 2007. ACM.

[116] Y. Wang and M. Kitsuregawa. C4-2: Combining Link and Contents in Clustering

Web Search Results to Improve Information Interpretation. In Proceedings of the

International Conference on Database and Expert Systems Applications, volume 2,

pages 851�857, 2002.

[117] Y. Wang, W. Zuo, T. Peng, F. He, and H. Hu. Clustering Web Search Results Based

on Interactive Su�x Tree Algorithm. In Proceedings of the Third International

Conference on Convergence and Hybrid Information Technology, 2008.

[118] D. Weiss. A Clustering Interface for Web Search Results in Polish and English.

Master's thesis, Poznan University of Technology, Poland, June 2001.

[119] J. S. Whissell and C. L. A. Clarke. Improving Document Clustering Using Okapi

BM25 Feature Weighting. Information Retrieval, 14:466�487, 2011.

[120] L. Xiao and E. Hung. Clustering Web-Search Results Using Transduction-Based

Relevance Model. IEEE 1st Paci�c-Asia Workshop on Web Mining and Web-based

Application, 2008.

[121] J. Xu, H. Zhang, and Y. Chen. The Analysis on the Design of 3D Graphic Software

Icons. In Proceedings of the WRI World Congress on Computer Science and Infor-

mation Engineering (CSIE '09), pages 379�385, Los Angeles, CA, March 31-April

2 2009.

[122] Li Yang and Adnan Rahi. Dynamic Clustering of Web Search Results. In Pro-

ceedings of the International Conference on Computational Science and its Ap-

plications: PartI (ICCSA'03), pages 153�159. Springer-Verlag Berlin Heidelberg,

2003.

[123] Z. Yao and B. Choi. Bidirectional Hierarchical Clustering for Web Mining. In Pro-

ceedings of the IEEE/WIC International Conference on Web Intelligence (WI'03),

page 640, 2003.

[124] M. Yazdani and S. Mealing. Communicating Through Pictures. Arti�cial Intelli-

gence Review - AIR, 9(2-3):205�213, 1995.

100

References

[125] O. Zamir and O. Etzioni. Web Document Clustering: A Feasibility Demonstration.

In Proceedings of the 21st Annual International ACM SIGIR Conference on Re-

search and Development in Information Retrieval (SIGIR '98), pages 46�54, 1998.

[126] O. Zamir and O. Etzioni. Grouper: a Dynamic Clustering Interface to Web Search

Results. Computer Networks: The International Journal of Computer and Telecom-

munications Networking, 31(11-16):1361�1374, May 1999.

[127] D. Zhang and Y. Dong. Semantic, Hierarchical, Online Clustering of Web Search

Results. In Proceedings of the Asia-Paci�c Web Conference (APWeb'04), volume

3007, pages 69�78, 2004.

[128] D. Zhang and W.S. Lee. A Web-Based Question Answering System. In Proceedings

of the SMA Annual Symposium, 2003.

[129] Y. Zhang, B. Feng, and Y. Xue. A New Search Results Clustering Algorithm based

on Formal Concept Analysis. In Proceedings of the Fifth International Conference

on Fuzzy Systems and Knowledge Discovery (FSKD '08), pages 356�360, 2008.

[130] Z. Zhang, X. Cui, D. R. Jeske, X. Li, J. Braun, and J. Borneman. Clustering Scatter

Plots Using Data Depth Measures. In Proceedings of the International Conference

on Data Mining (DMIN'10), pages 327�333, 2010.

[131] Z. Zheng. AnswerBus Question-Answering System. In Proceedings of the Human

Language Technology Conference (HLT), pages 399�404, 2002.

101

Date post:	27-May-2018
Category:	Documents
Upload:	truongtram
View:	214 times
Download:	0 times