Date post: | 27-May-2018 |
Category: |
Documents |
Upload: | truongtram |
View: | 214 times |
Download: | 0 times |
ICON-BASED INTERFACE TO INTERNET FOR
LANGUAGE ILLITERATE PEOPLE
Thesis submitted to the
Indian Institute of Technology Kharagpur
for award of the degree
of
Master of Science (by Research)
by
Santa Maiti
Under the guidance of
Dr. Debasis Samanta
School of Information Technology
Indian Institute of Technology Kharagpur
Kharagpur - 721 302, India
May 2013
c⃝2013 Santa Maiti. All rights reserved.
CERTIFICATE OF APPROVAL
Date:
Certi�ed that the thesis entitled Icon-Based Interface to Internet for Language
Illiterate People submitted by Santa Maiti to the Indian Institute of Technology,
Kharagpur, for the award of the degree Master of Science has been accepted by the
external examiners and that the student has successfully defended the thesis in the viva-
voce examination held today.
(Member of DAC) (Member of DAC) (Member of DAC)
(Member of DAC) (Member of DAC) (Member of DAC)
(Supervisor)
(Internal Examiner) (Chairman)
CERTIFICATE
This is to certify that the thesis entitled Icon-Based Interface to Internet for Lan-
guage Illiterate People, submitted by Santa Maiti to Indian Institute of Technology
Kharagpur, is a record of bona �de research work under my supervision and I consider it
worthy of consideration for the award of the degree of Master of Science (by Research)
of the institute.
Date: 30/05/2013
Dr. Debasis Samanta
Associate Professor
School of Information Technology
Indian Institute of Technology Kharagpur
Kharagpur - 721 302, India
DECLARATION
I certify that
a. The work contained in the thesis is original and has been done by myself under the
general supervision of my supervisor.
b. The work has not been submitted to any other Institute for any degree or diploma.
c. I have followed the guidelines provided by the Institute in writing the thesis.
d. I have conformed to the norms and guidelines given in the Ethical Code of Conduct
of the Institute.
e. Whenever I have used materials (data, theoretical analysis, and text) from other
sources, I have given due credit to them by citing them in the text of the thesis
and giving their details in the references.
f. Whenever I have quoted written materials from other sources, I have put them
under quotation marks and given due credit to the sources by citing them and
giving required details in the references.
Santa Maiti
ACKNOWLEDGMENT
First and foremost I wish to convey my deep sense of gratitude to my mentor Prof.
Debasis Samanta. It has been my blessed opportunity to be his student. I appreciate
all his contributions in the form of time, idea, and greater vision to make my research
experience productive and cherishable. I have learned an approach of humanity, patience
and hard working from him.
I would like to thank Prof. Jayanta Mukhopadhyay, Head of SIT for extending me
all the possible facilities to carry out the research work. I wish to thank all of my
departmental academic committee members Prof. A. Gupta, Prof. C. R. Mandal, Prof.
S. Sural, Prof. S. K. Ghosh, Prof. K. S. Rao, Prof. S. Misra, Prof. R. R. Sahay for
their valuable suggestions during my research. I also like to thank Prof. P. Mitra for his
valuable guidance in my research work. I sincerely remember the support of o�ce sta�s
Mithun Da, Soma Di, Malay Da, Vinod Da and others. I am also grateful to all members
of School of Information Technology.
I owe my deepest gratitude to Somnath Da, Sayan Da, Debasish Da, Sankar Da,
Barik Da, Maunendra Da for strengthening my research by constant moral support and
providing necessary guidance when required. I really learnt a lot from them. I wish
to convey my heartfelt thanks to Manoj Kumar Sharma, Soumalya Ghosh, Pradipta
Kumar Saha, Puspak Das, Jaya Krishna, Vinit Sinha, Satya Ranjan Das, Sudhamay
Maity, Ramu Reddy Vempada, Narendra NP, Krishnendu Ghosh, Partha De, Ruchira
Naskar, Tuhin Chakraborti, Tamoghna Ojha, Jayeeta Mukherjee, Arindam Dasgupta,
Soumya Maity, Nirnay Ghosh and many more.
I wish to thank my friends for helping me get through the di�cult times, and for all the
emotional support, entertainment, and caring they provided. I am greatly indebted to my
friends Rajasri Bandyopadhyay and Soumya Bhattacharya for their constant inspiration.
It's really very di�cult to express my gratitude just through few words to my brother
who never teaches me anything but stays beside me whatever I do, who never leads but
shows all the paths to travel, for whom I never fears of fall. He is not rare but unique.
Thank you for being my brother.
Lastly and most importantly, I like to thank my parent and other family members
for their continuous inspiration and moral support. I deeply indebted to them.
Santa Maiti
ix
Abstract
With the target of free knowledge distribution, a vast information repository is being
built up in Web. People are able to access and share this information through Internet.
It helps people to enrich their knowledge base as well as to get quick suggestions of
any problem. But the opportunity is only limited to the language literate people, who
have good reading, writing and comprehending capability of language speci�cally English
language. This work aims to develop an icon-based interaction system using which less
educated or uneducated people can search and access information in Internet.
To achieve our target, �rst, we develop an icon-based interface using which the target
people can generate and �re query in the search engine. We select icon as interaction
medium as it is language independent and easy to learn. It can also be treated as a
faster mechanism of communication, facilitating recognition rather than recall. Toward
the development of the icon-based interface, the major issues that have been addressed
are deciding domain and domain related queries, preparing icon vocabulary and icon
management.
Usually, the search result returned by search engine is large in size and may be from
di�erent domains. All the results may not belong to user's area of interest. Along
with this, the ranked representation of web search result is incomprehensible for our
target user. As a way out, clustering mechanism has been advocated using which similar
web pages can be grouped together so that representation and extraction of information
related to search query would become easier. To achieve this, in this work we propose pre-
processing of web pages, document feature extraction, inter document similarity measure
and clustering web pages.
Further, for representing the information, it is neither worthy nor possible to rep-
resent all web pages' information in terms of icon. So, some informative information
is mined depending on some prede�ned pattern related to the need of user's query.
The selective information are displayed to the user in terms of icons. Subtasks: building
entity-attribute model, query recognition, potential answer modelling and attribute value
extraction are addressed in this regard.
Few experiments have been performed to check the e�ectiveness of the proposed
methodology. We evaluated our proposed icon-based interface with respect to user friend-
xi
liness and e�ciency in query generation. It reveals that the developed interface is around
79% e�ective to generate search query. The comparison of our proposed clustering algo-
rithm with the benchmarking clustering algorithms proves that the proposed algorithm
provides an optimal solution by balancing cluster quality and time complexity. Finally,
we check the comprehensibility of iconic message. The experimental results substantiate
the e�cacy of the proposal.
To the best of our knowledge, the proposed icon-based interaction to access informa-
tion from Internet is new of its kind. It alleviates the digital divide between privileged
and unprivileged users of Internet. Moreover, the proposed interaction mechanism can
be recon�gured for the use of motor-impaired users.
Keywords: Icon-based user interface, language independent communication, human-
computer interaction, information retrieval, document clustering, clustering algorithm,
icon based information representation.
xii
Contents
Approval i
Certi�cate iii
Declaration v
Dedication vii
Acknowledgment ix
Abstract xi
Contents xv
List of Figures xvii
List of Tables xix
List of Symbols and Abbreviations xxi
1 Introduction 1
1.1 Need and Urgency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Scope of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Objective of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Plan of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Contribution of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Related Work 13
2.1 Icon-Based Interface: State of Art . . . . . . . . . . . . . . . . . . . . . . . 13
xiii
Contents
2.1.1 Use of Icon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Icon Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Icon-Based Applications . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Document Clustering Techniques . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Basic Clustering Techniques . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Other Search Result Clustering Algorithms . . . . . . . . . . . . . 22
2.3 Question Answering Techniques . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Development of Icon-Based Interface 29
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Deciding Icon Vocabulary and Domain Related Queries . . . . . . 31
3.2.2 Maintaining Large Icon Repository . . . . . . . . . . . . . . . . . . 32
3.2.3 Icon Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.4 Developed Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Experiments and Experimental Results . . . . . . . . . . . . . . . . . . . . 39
3.3.1 Experimental Setup and User Details . . . . . . . . . . . . . . . . . 39
3.3.2 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Clustering Web Search Results 43
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 Preprocessing of Web Documents . . . . . . . . . . . . . . . . . . . 46
4.2.2 Document Feature Extraction . . . . . . . . . . . . . . . . . . . . . 47
4.2.3 Inter Document Similarity Measure . . . . . . . . . . . . . . . . . . 49
4.2.4 Our Proposed Clustering Algorithm . . . . . . . . . . . . . . . . . 50
4.3 Experiments and Experimental Results . . . . . . . . . . . . . . . . . . . . 58
4.3.1 Experimental Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.3 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
xiv
Contents
5 Icon-Based Information Representation 65
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2.1 Building Supportive Model . . . . . . . . . . . . . . . . . . . . . . 67
5.2.2 Information Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.3 Icon-Based Knowledge Representation . . . . . . . . . . . . . . . . 77
5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6 Conclusion and Future Work 81
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2 Contribution of Our Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Publications 87
References 89
xv
List of Figures
1.1 Statistical Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
(a) Size of World Wide Web 1 . . . . . . . . . . . . . . . . . . . . . . . . 3
(b) Internet users 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
(c) World literacy statistics 3 . . . . . . . . . . . . . . . . . . . . . . . . 3
(d) Web page language 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
(e) User classi�cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 An overview of our work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 An overview of our approach . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 A snapshot of XML �le . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Icon-based interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Example of query generation. . . . . . . . . . . . . . . . . . . . . . . . . . 36
(a) Selection of `weather' icon. . . . . . . . . . . . . . . . . . . . . . . . 36
(b) Display of `weather' icon. . . . . . . . . . . . . . . . . . . . . . . . . 36
(c) Move to `where' hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Example of query generation. . . . . . . . . . . . . . . . . . . . . . . . . . 37
(d) Enable hierarchy and select `India' icon. . . . . . . . . . . . . . . . . 37
(e) Disable hierarchy and select `Kolkata' icon. . . . . . . . . . . . . . . 37
(f) Move to `when' hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Example of query generation. . . . . . . . . . . . . . . . . . . . . . . . . . 38
(g) Enable hierarchy and select `month' icon. . . . . . . . . . . . . . . . 38
(h) Disable hierarchy and select `December' icon. . . . . . . . . . . . . . 38
(i) Query completion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1 Overview of our work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Overview of our proposed HK Clustering algorithm . . . . . . . . . . . . . 50
4.3 Illustration of the proposed HK Clustering algorithm. . . . . . . . . . . . . 55
4.4 Download time of documents . . . . . . . . . . . . . . . . . . . . . . . . . 62
xvii
List of Figures
4.5 Comparison of Dunn Index . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1 Overview of the proposed approach . . . . . . . . . . . . . . . . . . . . . . 68
5.2 Query template: Culture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3 Entity-attribute model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4 Visualisation of information related Assam culture . . . . . . . . . . . . . 78
xviii
List of Tables
3.1 Domain related word extraction . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Tourism related benchmark queries . . . . . . . . . . . . . . . . . . . . . . 33
3.3 User details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 User training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Interface testing result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Comparison of cluster quality . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Comparison of time complexity . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Web document download time with threading and without threading . . . 62
4.4 Determination of threshold values and number of feature vectors . . . . . 64
5.1 n-gram from pre�x pattern . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 n-gram from su�x pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Ranked candidate phrase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4 Test result of visual representation . . . . . . . . . . . . . . . . . . . . . . 79
xix
List of Symbols and Abbreviations
List of Symbols
α Dissimilarity threshold
β Merging similarity threshold
γ Belonging similarity threshold
CP Cluster Pool
Ctemp Dequeued cluster
DI Dunn Index
OldS Seed document at previous level
S Seed document at current level
Sim Similarity matrix
List of Abbreviations
AAC Augmentative and Alternative Communication
HCI Human Computer Interaction
LSI Latent Semantic Indexing
NLG Natural Language Generation
PDA Personal Digital Assistant
TDM Term-Document Matrix
xxi
Chapter 1
Introduction
With the advancement of the information technology, Internet becomes an essential part
in our every sphere of life. Everyday a thousand questions arise in human mind. Peo-
ple face di�culties at their work place as well as in their daily life. To get a clue of
the problems earlier, the only way was to take advice from the expertise or experienced
persons. But as in general, a single person cannot be expertise in every �eld and one per-
son cannot be in contact with all expertise persons, so the problems remained unsolved
most of the time. The key reason of this problem is the absence of proper communi-
cation medium. With the development of computers, in 1973 the idea of Internet was
introduced to establish communication between computers [70]. It o�ered a new way
of information exchange. Next, the idea of World Wide Web (WWW) was proposed to
get any information anywhere. Gradually, a vast information repository has been built
up. At present, WWW is a global information medium using which users can access and
share information via computers connected to the Internet. More formally, WWW is a
system of interlinked hypertext documents accessed via Internet through a Web browser.
So, Internet and WWW had facilitated people to obtain their desired information in the
form of text, images, videos etc. and navigate between them via hyperlinks [95]. Still,
the opportunity is only limited to the educated people those who have good reading,
writing and comprehending capability of language speci�cally English language. This
work aims to make Web information accessible for language illiterate and semi-illiterate
people.
The rest of the chapter is organized as follows. In Section 1.1, the need and urgency
of the work and corresponding challenges are discussed. Some related works are also
explored in this regard. The scopes associated with this work is discussed in Section 1.2.
Section 1.3 describes objective of our work. Our work plan is proposed in Section 1.4.
1
1. Introduction
Finally, the contribution and the outline of thesis is presented in Section 1.5 and Sec-
tion 1.6.
1.1 Need and Urgency
At present, Web repository works as mine of information. People share and access these
information through Internet. World Wide Web has expanded about 2000% since its
inception and is doubling in size every six to ten months [35]. According to a recent
survey1, it is reported that the indexed Web contains 7.78 billion pages (Figure 1.1a).
However, the bene�ts of Internet are limited only to the educated people. A recent
statistics2 shows that in the year 2011, per 100 inhabitants, estimated Internet users in
developed countries and developing countries are 74 and 26, respectively (Figure 1.1b).
One of the main reasons of such di�erence is language illiteracy (mainly English illiter-
acy). Traditionally, literacy is described as the ability to read and write. Though the
literacy all over the world (Figure 1.1c) is 83.7%3, it is quite a matter of concern for
many of the developing countries where approximately 54.74% of the total population
is literate4. Further, this literacy includes familiarity with their native languages. This
native language literacy is not su�cient because 55.4% of the Web pages are in English5
(Figure 1.1d). So, to get the maximum advantages of Internet, a user should be an En-
glish literate.
Presently, English is the most widely published language all over the world. Over 1.8
billion people use English as �rst, second and foreign language. It is an o�cial language
in 52 countries as well as many small colonies and territories. Figure 1.1e depicts a Venn
diagram of user classi�cation with respect to language familiarity so far their ability to
read (R), write (W), and speak (S) in their native (N) and English (E) languages are
concerned. The users are classi�ed as U1, U2, U3 and U4. User U1 and U4 can read,
write and speak English and native language respectively. User U3 is a subset of U1 can
speak English. Similarly, user U2 (superset of U4) can speak their own native language.
So, the users in U2-U4 and a major portion of U3, U4 are our target user. Such peo-
ple cannot read, write and understand English properly. We term such people as under
privileged people or novice people with respect to computer and Internet use. Typically,
these people are rickshaw puller, porter, farmer, shop keeper, gate keeper etc.
1The size of the World Wide Web, www.worldwidewebsize.com2ITU: Committed to connecting the world, www.itu.int3UNESCO, www.uis.unesco.org/FactSheets/Documents/FS16-2011-Literacy-EN.pdf4Human Development Reports,www.hdr.undp.org/en/reports5W 3Techs, www.w3techs.com/technologies/overview/content_language/all
2
1.1. Need and Urgency
(a) Size of World Wide Web 1 (b) Internet users 2
(c) World literacy statistics 3
(d) Web page language 5 (e) User classi�cation
Figure 1.1: Statistical Reports
In the current context it is clear that the main issue is not information resource
scarcity but the information access medium. Therefore, the challenge is, to bring Internet
information within a reach of underprivileged users.
• The very �rst problem is to determine the interaction mode that is suitable for our
3
1. Introduction
target user. From the point of feasibility, we notice that di�erent communication
modes like speech, gesture can be used to meet our objective. But, along with
feasibility issue, we have to take care of several other issues like cost e�ectiveness,
user friendliness, adaptability etc. The interacting mode should support easy input
mechanism as well as easy understanding of output.
• Next challenge is to identify the information domain. A query, a word or a group
of words generated by an user are generally abstract in nature which may imply
multiple meaning in di�erent contexts. Web search engine, however, cannot distin-
guish the context and hence retrieves huge information. All these information are
not related to user's query need. Therefore, we have to identify information related
only to the users area of interest.
• Another problem is that, in general a Web page covers multiple topics, related
information, related links etc. It is not worthy to give the entire Web page content
to our target user. The challenge is to distinguish between valuable, less valuable
and non-valuable information.
A work on optimal audio-visual representations for illustrating concepts [83] has been
done to check the comprehensibility of the representation for an illiterate and semi-literate
users. This work has been done in health care domain. It reveals the facts that richer
audio-visual information is not necessarily better understood by target user. Another
work has been done to increase mobile usability for illiterate and literate but novice
user [82]. It considered three text-free interfaces: a spoken dialog system, a graphical
interface, and a live operator for three di�erent requirements. Graphical user interface
is used in the context of mobile banking. But the developed interface is not suitable for
independent use by �rst-time users. In order to make users more autonomous, a user
interface has been designed such that even novice, illiterate users require absolutely no
intervention from anyone at all to use. Two applications are developed in this regard
one for job search for domestic labors, and another for a generic map that could be used
for navigating a city. Our work addresses a di�erent issue of Internet accessibility for
illiterate and semi-illiterate users. Previous works for illiterate users preferred the mixed
mode, that is, visual and speech for interaction. But for all these cases, the amount of
data or information are exchanged is quite less compared to the information we obtain by
search result. Most of the implemented applications deal with very limited vocabulary.
In case of speech mode, we have to translate the information in existing all possible
languages to support users across the world. A work for searching Internet using iconic
interface [113] has been proposed for school going children. Here, the target users have
4
1.2. Scope of Work
limited reading and spelling abilities but they are not illiterate. The work did not mention
how to comprehend the searched information, perhaps they assumed that the children
can understand the information as they are literate. For our target users we have to
�lter only the important information according to their desired query. Otherwise, large
amount of information, generally returned by the search engine will confuse them.
1.2 Scope of Work
In order to achieve our goal, we can use di�erent interacting modes and interacting
devices. Input can be given mainly through the motor controls of the e�ectors (�ngers,
voice, eyes, head and body position etc.) and the response of the system are sensed
through various human sensors (vision, hearing and touch etc.). Based on interacting
devices the interaction can be of three types: visual based interaction, audio based
interaction and sensor based interactions [57].
• Visual based interaction includes facial expression analysis, body movement track-
ing, gesture recognition, gaze detection etc. It helps to recognize and analyze human
emotion as well as the body language. This type of interaction is generally used
as an augmentative and alternative communication (AAC) system for quadriplegic
people. It can be also used as an intelligent system which will consider user ex-
pression as an input. But, it is really di�cult to train our target user about body
language. This interaction type is more suitable for quadriplegic user.
• In case of audio based interaction, system interaction takes place via audio sig-
nals like speech, music, di�erent sound patterns etc. Audio signals can be more
trustable, helpful, and in some cases unique providers of information, compared to
visual signals. Speech recognition, speaker recognition, auditory emotion analysis,
human-made noise/sign detections (gasp, sigh, laugh, cry etc.), musical interaction
are some important areas in audio based interaction system. Audio, speci�cally
speech based interaction is quite helpful for our target user but for that we have to
make it compatible to users' languages.
• The third type of interaction medium is sensor based interaction which uses sensors
ranging from primitive to very sophisticated one. Pen-based interaction, mouse and
keyboard, joysticks, motion tracking sensors and digitizers, haptic sensors, pressure
sensors, taste/smell sensors are di�erent types of sensors.
Among these three types of interaction mediums, mouse and keyboard are most popular
and widely used and known for cost e�ective interacting device. Keyboard is generally
5
1. Introduction
used to generate text and mouse is for positioning, pointing and drawing. We can use
di�erent alternative modes to give input to the search engine. But, for output we have
very narrow option to choice. We can not use text as the medium of front-end user
interaction. On the other hand, the search engines consider text as input and provide
search results in terms of text. So, we need an interpreter which converts the search
result into user understandable form. As an interpreter we can use speech, gesture,
icons, gaze etc. Representing entire Web page resources in all possible languages in text
or speech form requires enormous e�ort. On the other hand, gesture based interaction
su�ers from adaptability issue and requires su�cient training. Therefore, we choice icon
as the medium of interaction both for input and output.
We choice icon for interaction where user can give input to a system by means of
icon selection and understand the response of the system by comprehending the iconic
message. An icon can be de�ned as a small graphics representation of some information
or any object. The icon is selected for the following reasons.
• Language independent: icon o�ers a language independent interaction medium.
Unlike speech or text, it requires a single uni�ed language independent icon database
understandable by users, from any background.
• Easy to learn: Text based interaction requires well understanding of grammar of
corresponding language which is really hectic for our target users. On the other
hand, iconic sense is easy to comprehend. We may note that, icon is used in
public places to convey information or concept to people, irrespectively to one's
background knowledge.
• Recognition rather than recall: In case of icon-based communication memorizing
the complex language dependent characters are not required. Only the user need to
recognize word related icon to interact with the system. Further, this recognition
is easy as in general, a icon resembles with actual object.
• Faster and Expressive: Icon o�ers a faster way of interaction compared to text
for its expressiveness. A single icon is able to represent a word, sentence and also
a concept. For example, the icon of handicapped person in a train compartment
implies that the compartment is reserved for handicapped people.
Using icon as an interaction medium also arises some potential problems. The main
problem of icon is its ambiguousness. Ambiguity or misrepresentation of icon may lead
to major confusion. In very few domains and very speci�c requirements like computer,
transport icon standardization exists at present. The scarcity of standard icon database
6
1.3. Objective of the Thesis
is a major drawback for building any icon based system. In order to achieve our goal we
have used icon as the mode of interaction and tried to address the icon related problems
by building up domain dependent icon vocabulary. In our work, icon is used as front
end interpreter. Our exhaustive literature survey reveals few shortcomings towards the
practicability of icons in general and iconic interface in particular. The various scopes
related to icon-based interaction is listed below.
• Icon management: Design strategy of icon in the interface is explored vastly but
the management of a large icon database still need to be worked out. The work
on proper icon arrangement in the interface to reduce visual search overhead and
error possibility are not reported elsewhere.
• User friendliness: A desirable property of any interface is that it should be user
friendly and user should interact with a minimum e�ort. Work on rate-enhancement
strategy in the context of iconic interface is scarcely reported. Using semantic
dependency between two icons and prediction methodology can be incorporated
with the interface to o�er faster and easier way of query generation. The feature
of query expansion can also be incorporated to o�er user similar and related query
options.
• Handling icon ambiguity: Icons are ambiguous in nature. Single icon can repre-
sent di�erent meaning in di�erent context. Only few works have been done on
metaphoric representation of icon and implementation of rule based context sensi-
tive disambiguation strategy.
• Searching in web: Results retrieved from any search engine against a query are huge
and in mixed form. Existing format of search result presentation against an user
query is not user friendly although it is becoming an important issue particularly
with exponential growth of Web repository.
• Search result representation: Another shortcoming is on the representation of
search results in a user understandable form. As per our knowledge the work
of representing information in terms of icon is not reported elsewhere.
1.3 Objective of the Thesis
We plan to develop an icon based interface which will help the target users to frame a
query. This query will be feed to Google search engine to obtain the Web search results.
The retrieved vast result will be mined next to extract the concrete result. Finally, the
7
1. Introduction
concrete result will be transformed into iconic form. The objective of our work is as
follows.
1. Development of icon-based interface: We plan to develop an iconic interface to
access certain range of information from the Internet.
2. Clustering Web search results: We propose an intermediate processing to return
search results using clustering. Proposed approach will produce coherent clusters
where the number of clusters are decided at run time. Our main target in this
phrase is to obtain better quality of cluster with less response time.
3. Icon-based information representation: In this part, we �rst mine a selected cluster
to obtain information. A mapping from text to icon is planned to represent search
results in a user understandable form.
Figure 1.2 represents an overview of our work plan.
Search
Results
Query
Clustering and
mining
Visual
Represen
tation
Concrete
Result
Retrieve
Icon-based
query
Icon-based
information
Figure 1.2: An overview of our work
1.4 Plan of the Thesis
In order to achieve our target the work plan is as follows.
• We plan to develop an icon-based interface to search information in the Internet
by means of icon selection. Towards the development of the icon based interface
the �rst challenge is to decide the icon vocabulary. The icon vocabulary should
be optimal, that it should not confuse the target user and also support all types
of queries and search results. Next work is to decide domain related query which
8
1.5. Contribution of this Thesis
user will feed into the search engine. Another challenge is to manage a large set of
icons in a limited display area.
• It is neither worthy nor possible to represent all information of retrieved Web
pages in terms of icon. Large number of iconic message may lead major confusion
in message comprehension. So, only the important informative information need
to be mined and displayed. But, the search results returned by search engine
are large in size and in general, from di�erent domains. All the results may not
belong to user's area of interest. Therefore, we propose clustering of the Web
search result in order to group Web pages having similar content. Time and cluster
quality optimization, pre-processing of Web pages, document feature extraction,
inter document similarity measure are also addressed in this regard.
• In this phase, mining of the clustered results and a mapping from text to icon is
planned to represent search results in an user understandable form. It includes sev-
eral issues like building entity-attribute model, query recognition, potential answer
modeling, attribute value extraction, icon sense disambiguation etc.
1.5 Contribution of this Thesis
The objective of the present work is to develop an icon-based interaction system for
underprivileged user. The following gives a summary of the contributions made in this
regard. The very �rst problem is to determine the communication mode. From the
point of feasibility, we notice that di�erent communication modes can be used to meet
the objective. To select the communication mode, we consider several features e.g. cost
e�ectiveness, user friendliness, adaptability etc. The other contributions of our work are
listed next.
• Design of a system that supports iconic mode of communication. Implementing a
robust system having support for all queries, independent to any domain is a com-
plex task. So, some realistic assumptions have been made for implementation. The
physical interface will support tourism domain related query. One important issue
is selection of icons that suits with socio-cultural environment of target users. In-
stead of directly creating icons, we consider the existing icons in Web and modi�ed
them as required because creating icons itself is a complex issue.
• Having chosen the icon as a mode of communication, the next issue is to determine
the nature of physical interface. The interface will work in an interactive way. The
9
1. Introduction
iconic mode is designed to work in the same way as word mode works but having
less cognitive load due to the presence of icons. Another contribution has been done
on the decision of required, relevant icon set covering all domain related query. The
developed interface consists of selected set of organized icons.
• The next critical issue is for selection of proper clustering algorithm. After a sound
literature survey we propose a new clustering algorithm to cluster Web search result
optimizing cluster quality and response time.
• According to our knowledge, no work has been done on text to iconic message
conversion. We address this issue in our work.
1.6 Organization of the Thesis
This thesis contains six chapters including this introductory chapter. This chapter ex-
plains current scenario and need of the work, brief description of di�erent interaction
techniques, the work proposal, the issues and challenges in the proposal, the scope and
objectives related to the our goal and �ow of the work.
Chapter 2 : Related Work
This chapter includes state of the art for di�erent icon-based interaction systems as well
as icon interface. Next, we discuss about the primary methods of clustering and the exist-
ing clustering engines. Finally, some works related to question answer mining is discussed.
Chapter 3 : Icon-Based Interface
This chapter discusses about the steps involved to design the icon-based interface. This
includes major issues like deciding domain related query, preparing icon vocabulary, icon
management etc. We evaluate the proposed interface with respect to the e�ciency in
search query generation. The experiment along with the result are presented in this
chapter.
Chapter 4 : Clustering Web Search Results
This chapter talks about pre-processing of Web search results. We propose a new cluster-
ing methodology to obtain better cluster quality in a�ordable response time. Finally, we
discuss the e�cacy of the proposed clustering mechanism with respect to other clustering
algorithms.
10
1.6. Organization of the Thesis
Chapter 5 : Icon-Based Information Representation
The process of mining the clustered result is discussed in this section. It also focuses on
the representation of mined information in user understandable iconic form. Experimen-
tal results concludes the chapter.
Chapter 6 : Summary and Conclusion
In this chapter, we discuss about the summary of our work and future scope of the work.
11
Chapter 2
Related Work
Our proposed work covers three di�erent research areas: developing icon-based interface,
clustering Web search results and icon-based representation of mined information. We
discuss the literature survey of these areas related to our work. At �rst we discuss the uses
of icons in di�erent forms and in di�erent �elds. We also review the research work in the
�eld of icon-based interfaces. The next discussion is in the area of clustering. Clustering
is one of the most e�cient mechanism to group similar objects without any advanced
knowledge of the group de�nitions. It can be used to group Web pages having similar
content. We illustrate the various clustering procedures to cluster Web search results.
According to our knowledge no work has been done for text to icon representation. But,
some works has been done on question answering where answer of a particular question
is found out from structure or unstructured resource. We review literatures in the �eld
of question-answering.
We plan the organization of this chapter as follows. In Section 2.1, we present the use
of icons in di�erent forms in di�erent applications. Several clustering techniques used
for document as well as for Web search result clustering are described in Section 2.2.
Di�erent strategies used for question-answering are discussed in Section 2.3. Finally, we
summarize the reported works in Section 2.4.
2.1 Icon-Based Interface: State of Art
In this section, we discuss the existing approaches dealing with icons followed by the
state of art on icon-based interface in di�erent contexts.
13
2. Related Work
2.1.1 Use of Icon
Recently, icon has been chosen as a primary mode of interaction between man-machine
for its expressiveness. Most modern written languages are derived from pictorial lan-
guages, but iconic languages seem to be a more recent invention. An experiment in the
Peruvian Amazon area indicates that pencil sketches and photos represent important
tools for communication research and praxis1 mainly to communicate about topics that
are di�cult to speak about. Rogers talked about the usefulness of icon in interface [103].
The previous attempts to create international languages have not been very successful,
partly because of the need for a signi�cant number of people to know them, and partly
because they have to be learned like any other new language [12]. Iconic communication
is the attempt to build cross-language communication systems that completely avoid the
use of words and rely solely on pictorial symbols. The earliest writing systems were
essentially pictographic in nature, such as ancient Egyptian hieroglyphs, in which a vo-
cabulary of about 700 di�erent characters was used. Logograms, where an icon represents
actual object, is used in Chinese language. In 1960's some interest was also shown in
the potential for a general international pictographic language, in which an element is
de�ned as a graphic representation. But its decomposition as well as interpretation was
really tough. Kolers identi�ed technological applications as being better-suited to iconic
representation, because the objects and functions involved were ubiquitous to, and con-
sistent between cultures [62]. In a more limited but successful way, symbols have been
used in national and international signposting for public service functions. This includes
signs for highways and airports, electronics and packaging [6]. In addition, international
standardization of signs has been achieved in such contexts, as has the procedure for their
de�nition, proposal and evaluation (International Standards Organization, 1979). One
actual use of the pictographic representation of data can be found in work on military
battle�eld displays in which army units on the battle�eld are represented on a computer
display [48, 61, 66]. In 2003, a set of standard map icons has been declared by the U.S.
Government for emergency response applications and sharing information among emer-
gency responders. This set of symbols has been used also by the governments of Australia
and New Zealand. Tatomir et al. has developed a set of icons for constructing a map
representing features such as crossing types and road blocks [41,110].
1Pencils and Photos as Tools of Communicative Research and Praxis, http://academics.utep.edu/Portals/1800/Singhal-RattineFlaherty%20Article.pdf
14
2.1. Icon-Based Interface: State of Art
2.1.2 Icon Taxonomy
The success of iconic interface depends on icon design strategy. The e�ects of icon de-
sign on human-computer interaction is discussed in [16, 104] where icon characteristics
(semantic distance, concreteness, familiarity, and visual complexity) are investigated to
determine the speed and accuracy of icon identi�cation. Icon metaphors, design alter-
natives, display structures, implementation and a summary of icon design guidelines are
addressed by Gittin in icon-based human-computer interaction [48]. In iconic interac-
tion, ambiguities on message generation and interpretation are needed to be removed.
Abhishek et al. introduced a disambiguate strategy in ambiguous iconic environment by
constraint satisfaction [1]. Icon can be classi�ed in di�erent ways known as taxonomy
of icons. Di�erent researcher addressed icon taxonomy according to their need although
the basic idea is more or less same. Lodding classi�ed icons into three categories: repre-
sentational, abstract, and arbitrary. Representational icons were described by Lodding
and Blattner et al. as icons that can serve as an example for a general class of object.
Lodding gave the image of a petrol pump, to represent a petrol pump, as an example of a
representational icon [17,89,115]. Same icon category is introduced by other researchers
as `associative' by Gittins, `nomic' by Gaver, `purely pictographic' by Lindgaard et al.,
`resemblance' by Rogers, `pictoral' by Lodding's, Webb et al., `concrete' by Purchase and
`similar' by Lidwell et al. where the only di�erence is in terminology [43,48,73,75,97,103].
Lodding, Purchase described `abstract' icons as icons that attempt to convey concepts
rather than to display the object itself. Lodding used the image of a broken glass to
represent fragile. Later, researchers address this type of icon as `symbolic'(Lodding's,
Webb et al.), `mixed'(Lindgaard et al.), `metaphorical'(Gaver), `semi-abstract'(Blattner
et al.). Rogers, Lindgaard et al. represented this icon category into two sub-categories:
`examplar' (e.g. airplane for airport) and `symbolic' (e.g. lightning for electricity). In
`arbitrary' (Lodding, Rogers, Lidwell et al.) type of icon there is no intuitive connection
between the icon and its referent. This type of icon is also addressed as `symbolic' (Gaver,
Purchase), `purely symbolic'(Lindgaard et al), `key'(Gittins), sign(Lodding's, Webb et
al.), `abstract'(Blattner et al) etc. Gittin classi�ed the icons not only on the basis of
type but also on form and color [48]. By form icons can be of two types : static and
dynamic. Non-movable icons are considered as static icon whereas animated icons are
considered as dynamic icon. On the basis of color, Gittin classi�ed icons in two types
: monochrome and color. According to Dinesh Katre (2007) icons can be classi�ed on
30 di�erent attributes and sub-options1. Some of them are : detailing, dimension, light-
1Beware of style in icon design, www.hceye.org/HCInsight-KATRE22.htm
15
2. Related Work
shadow, size, appearance, e�ects, pixilation etc. Now-a-days various types of icons are
found in di�erent computer environments and applications: (1) as a part of an Operating
System's desktop environment like Windows XP, Macintosh, or Linux KDE, (2) as a part
of a speci�c computer application (within software toolbars) like Microsoft Word, and
(3) within Internet websites or other online applications to include the following: website
interfaces, forums, blogs, bulletin boards and Internet chat applications like AOL Instant
Messenger1. Some analysis work also done on icon usability in corresponding graphical
interfaces in order to provide better user interface icons [121].
2.1.3 Icon-Based Applications
An icon can be interpreted by its perceivable form (syntax), by the relation between its
form and what it means (semantics), and by its used (pragmatics) [40]. By this way,
icon can also form a language, where each sentence is formed by a spatial arrangement of
icons [39]. A comparative study of natural language and the design of an iconic language
is done in [11].
Some works have been done towards man-machine interaction through iconic inter-
face. Hotel booking system is designed for booking the hotel room by users from di�erent
linguistic backgrounds, based on form �ll-up mechanism [124].
An iconic environment for programming support namely HI-VISUAL is introduced
by Hirakawa et al., Miller et al. [52, 85].
Pictorial dialogue methods (Barker) are designed for pure person-to-person commu-
nication [9]. CD-Icon, an iconic language-based on conceptual dependency (Beardon)
is developed for message composing by selecting options from a series of interconnected
screens (in the spirit of systemic grammar) [10]. A graphical chatting program - called
visual messenger [26] is implemented in java. Another iconic communication methodol-
ogy using PDA proposed by Fitrianie et al. is able to interpret and convert the iconic
messages to (natural language) text and speech form in di�erent languages [39].
The Elephant's Memory presents a playful learning environment for children2. It
allows the user to build a visual message by combining symbols from its vocabulary.
Similar type of work is done by Uden et al. for the youngest children starting to read
and write [113]. The work emphasizes on understanding of mental models of children.
Clicker3 is a writing support and multimedia tool for children of all abilities. It enables
one to write with whole words, phrases or pictures. Clicker has a powerful graphics fea-
1De�nition of Icon: Types of Icons, www.entity.cc/icon-types.php2The Elephant's Memory, http://www.khm.de/ timot/PageElephant.html3The clicker 5 guide, http://www.cricksoft.com/us/products/clicker/guide/Clicker5 guideus.pdf
16
2.1. Icon-Based Interface: State of Art
ture with pictures, animations and movies to illustrate a concept.
Some icon-based Augmentative and Alternative Communication (AAC) systems are
made for quadriplegic people. An AAC iconic system is developed by Albacete et al. for
the people with signi�cant speech and multiple impairments based on the theory of icon
algebra and the theory of conceptual dependency [4]. Another mobile AAC system is
proposed for handicapped persons to make a general communication with others in a free
and convenient manner. The method of predicting predicate is used to support faster
interaction as well as to satisfy space limitation [67]. Sanyog: an iconic system for mul-
tilingual communication for people with speech and motor impairments is developed by
Bhattacharya et al. The Sanyog project initiates a dialog with the users to take di�erent
portions (e.g. subject, verb, predicate etc.) of a sentence and automatically constructs a
grammatically correct sentence based on NLG techniques. Its intended users are children
su�ering from cerebral palsy. Sanyog o�ers communication by icon to speech conversion
[15].
An icon-based interface to communicate in crisis situations on a PDA is developed
for generating alarm or help message by the people from di�erent background, role and
profession [38]. Optimal audio-visual representations for illiterate users of computers
helps illiterate and semi-literate users to express and to understand information. The
proposed concept is implemented on health domain [83].
An idea of an XML-based iconic communication system (SCILX) enables communi-
cation through the Internet is proposed by Kuicheu et al. The approach has a formal
foundation based on formal grammars of icons. It allows to translate an iconic sentence
into a XML document and vise-versa [64].
An iconic keyboard MinspeakTM system conceived by Baker use the principle of
semantic compaction. It involves mapping concepts to multi-meaning iconic sentences
and using these icon sentences to retrieve messages stored in the memory of a micro-
computer. The stored messages can be words or word sequences. A built-in speech
synthesizer is used to generate the voice output. Over the past ten years, more than
20,000 MinspeakTM units have been distributed all over the world. Swedish, German,
Italian and other MinspeakTM systems are developed [7]. An interactive environment for
iconic language design is proposed by Chang et al. based on the theory of icon algebra
to derive the meaning of iconic sentence1.
Most systems are based on linguistic theories, such as the conceptual dependency the-
ory and basic English [41]. Using these systems a message can be composed by arranging
1A Methodology and Interactive Environment for Iconic Language Design,www.cs.pitt.edu/ chang/365/mins.html
17
2. Related Work
icons or combining di�erent symbols to compose new symbols with new meanings. The
arrangement can be realized in a rigid linear order [39, 124] or a non linear order [40].
Only few works produce text as output of the interpretation of visual representation.
Some systems are hard to learn and some are language speci�c. They are either based
on too complex linguistic theories or on non-intuitive icons. An iconic visual interlin-
gua Visual Inter Language (VIL), based on the notion of simpli�ed speech introduced in
[69]. VIL reduced the complexity of iconic language signi�cantly by avoiding in�ection,
number, gender, tense markers, and articles.
2.2 Document Clustering Techniques
The next part of our literature survey is related to the idea of clustering. In this section,
we present the survey of existing approaches of document clustering specially Web search
result clustering. A Web search engine generally returns thousands of pages in response
to a query, making it di�cult for users to identify relevant information. Clustering
methods can be used to automatically group the retrieved results into a list of meaningful
categories Clustering of Web document is done on the basis of some similarity measure.
Similarity between Web pages usually means content-based similarity which emphasis on
the content of Web page instead of its embedded links. It is also possible to consider
link-based similarity and usage-based similarity. Link-based similarity is related to the
concept of co-citation and is primarily used for discovering a core set of Web pages
on a topic. Usage-based similarity tries to discover user navigation patterns from Web
data and the useful information from the secondary data derived from the interactions
of the users while sur�ng on the Web [35]. Document clustering can be performed,
in advance, on the whole Web collection before conducting any search known as pre-
retrieval clustering or o�ine clustering. O�ine clustering produces directory structure
type search result. Dmoz or ODP (Open Directory Project) is an example of such kind
of human-edited directory of the Web. Web directories are most often used to in�uence
the output of a direct search in response to common user queries. This method combines
the best features of query-based and category-based search. On the other hand, post-
retrieval clustering or online clustering is performed only on retrieved Web pages. As
online clustering only considers relevant Web pages (a subset of vast Web repository) as
an input, it is faster and produces superior results. There are two types of post-retrieval
clustering. The clustering system may rerank the results and o�er a new list to the
user. In this case, the system usually returns the items contained in one or more optimal
clusters. Alternatively, the clustering system groups the ranked results and gives the user
the ability to choose the groups of interest in an interactive manner [24].
18
2.2. Document Clustering Techniques
2.2.1 Basic Clustering Techniques
In this section, we present a brief overview of two fundamental clustering approaches:
partitional and hierarchical clustering techniques.
2.2.1.1 Partitional Clustering
Partitional clustering attempts to directly decompose the document set into a set of dis-
joint clusters. The clustering algorithm emphasizes on the local structure of the data,
as by assigning clusters to peaks in the probability density function, or the global struc-
ture. Typically the global criteria involve minimizing some measure of dissimilarity in
the samples within each cluster, while maximizing the dissimilarity of di�erent clusters.
k -Means clustering is the most common type of partitional clustering and widely used in
the �eld of document clustering.
k-Means clustering: The algorithm was �rst proposed by Stuart Lloyd in 1957 as a
technique for pulse-code modulation [76]. k -Means is based on the idea of centroid as a
good representative of a cluster. In this process k documents are randomly selected and
considered as the initial centroids. Rest of the documents are assigned to their nearest
centroids to generate the cluster. Next, for each cluster the centroid is recomputed.
The process of document assignment and centroids computation is repeated until the
centroids generated in two consecutive iterations are same. The advantage of k -Means
algorithm is that it is simple and fast for low dimensional data. The computational
complexity of k -Means algorithm is O(i.k.m.n) where i implies the number of iterations,
k is the number of clusters, m is the number of features or attributes and n is the number
of data points [77]. In case of space complexity, it requires to store both data points and
centroids. So, the space complexity becomes O((k+n)m), that is, O(n) [84]. In fact, both
complexities are quite less compared to the other clustering techniques. But, the number
of clusters should be prede�ned. We may note that k -Means is sensitive to outliers. The
mediod-based method can eliminate this problem [123]. But, the limitations of these two
partitional schemes are that they are sensitive to initial centroids. Both of them cannot
handle non-globular data of di�erent sizes and densities [71].
Bisecting k-Means: Bisecting k -Means is an enhancement of basic k -Means algorithm
where a cluster is split up into two sub-clusters in iterative steps until we �nd desired
number of clusters. We start with a single cluster containing all documents. Next, we
select a cluster to split depending on some criteria like largest cluster at each step or
the cluster with the least overall similarity, or both criteria. From the selected cluster
19
2. Related Work
two sub-clusters are generated with the help of basic k -Means algorithm. This step is
repeated several times to take the split that produces the cluster with the highest overall
similarity [108]. Total process is repeated until it produces desired number of clusters.
We may note that bisecting k -Means tends to produce clusters of relatively uniform size,
while regular k -Means is known to produce clusters of widely di�erent sizes. The com-
plexity of bisecting k -Means is linear with the number of documents.
Fuzzy c-Means clustering, QT clustering algorithm are variations of k -Means clus-
tering. Some other types of partitional clustering are locality-sensitive hashing, graph-
theoretic method etc. It has been observed that, linear time clustering algorithms are
the best candidates to comply with the speed requirement of on-line clustering. These
include k -Means, Single-Pass, Buckshot and Fractionation. An experimental analysis
of six di�erent clustering techniques of k -Means, Single-Pass, Fractionation, Buckshot,
Su�x Tree and AprioriAll is done by Sambasivam and Theodosopoulos [105]. k -Means
clustering via principal component analysis is proposed by Ding and He [33].
2.2.1.2 Hierarchical Clustering
As an alternative to partitional clustering, hierarchical clustering produces a nested se-
quence of partitions. There are two approaches of hierarchical clustering: agglomerative
(bottom-up) and divisive (top-down). In agglomerative hierarchical clustering [77], ini-
tially each document is considered as an individual cluster. In each iterative step, closest
clusters are merged together. In contrary, divisive hierarchical clustering [49] considers
all documents as a single cluster at the initial stage. It iteratively divides a cluster into
two or more child clusters. The decision of which two clusters should be combined (for
agglomerative), or where a cluster should be split (for divisive) depends on the distance
metric and the linkage criteria. Euclidean distance, Manhattan distance, Cosine similar-
ity etc. are di�erent distance metrics used for clustering purpose. The distance between
cluster any two clusters can be decided using di�erent linkage criteria. Three commonly
used linkage criteria are single-linkage, complete-linkage and average-linkage.
Hierarchical agglomerative clustering: Generally, hierarchical clustering implies
agglomerative approach [77] in the �eld of document clustering. Hierarchical agglom-
erative clustering [108] has four basic steps: cluster initialization and distance matrix
preparation, merging two nearest clusters, updation of distance matrix and repetition.
In this process, initially each individual document is considered as an individual cluster.
The distance between each pair of clusters is represented by distance matrix. Based on
the distance matrix, two closest clusters are merged together. The distance matrix is
20
2.2. Document Clustering Techniques
updated to re�ect the pairwise distance between the new cluster and the previous. Any
of the linkage criteria can be used for matrix updation. This process is repeated until it
produces a single cluster. Hierarchical clustering has some advantages and limitations.
In general, the complexity of agglomerative clustering is O(n3) where n is the number of
documents [77]. Use of priority queue can reduce it to O(n2logn) [77]. In a special case
(e.g. single linkage) the complexity is O(n2) [77]. Hierarchical agglomerative clustering
uses a distance matrix to keep intra-cluster distance along with the generated clusters.
So, space complexity becomes O(n2) [84]. Though the time and space complexities are
quite high, the quality of the clusters produced by the hierarchical clustering is better
compared to the k -Means. Another advantage is that, the number of clusters can be
decided run time depending upon a threshold value. A drawback of the hierarchical
clustering is that if a point is misclassi�ed once then it would not be corrected in further
steps, whereas the k -Means o�ers iterative improvement.
Hierarchical divisive clustering: The hierarchical divisive clustering approach1 con-
sists of four di�erent steps: cluster initialisation, selection of a cluster to split, split
the chosen cluster and replace it with new generated sub-clusters and repeat. At the
initial stage, all the documents are placed in a single cluster. Generally, the cluster
with maximum diameter is chosen for splitting where the diameter of a cluster is com-
puted as the largest dissimilarity between a pair of documents in that particular cluster.
There are di�erent partitioning criteria [49] like cut-based measure, enumerative (e.g.,
the graph coloring algorithm of Hansen and Delattre for minimum diameter partitioning)
or cutting-plane ones (e.g., the branch-and-cut method of Grotschel and Wakabayashi for
clique partitioning) etc. For example, in cut-based measure a cluster is split such a way
that the cost9 of clustering (cutcost / intracosts) is minimum. Next, the old cluster is
replaced by new generated sub-clusters. This process is repeated until all clusters, com-
prise of single document. Hierarchical divisive clustering is conceptually more complex
than agglomerative clustering since we need a second algorithm as a �subroutine�. In this
approach, there are 2n−1 − 1 possibilities to split the documents into two clusters (n is
the number of documents in the cluster) which is considerably larger than that in the
case of an agglomerative method2. Variation of the algorithm can reduce the splitting
possibility. In general, the complexity of divisive clustering with an exhaustive search
1Clustering Algorithms: Divisive hierarchical and �at, http://www.cs.princeton.edu/courses/archive/spr08/cos435/Class_notes/clustering4.pdf
2Divisive Analysis (Diana), http://www.unesco.org/webworld/idams/advguide/Chapt7_1_5.htm
21
2. Related Work
is O(2n)1. The space complexity is same as agglomerative approach that is, O(n2) [84].
Some variations of hierarchical clustering are Birch, Cure, Chameleon etc. [123].
An e�cient and e�ective algorithm for hierarchical classi�cation of search results is
proposed to produce hierarchically organized search results [56, 60]. Rather than using
clustering technique, this approach employs domain ontology in order to obtain better
hierarchical classi�cation [28,60]. In [46] cluster labelling is achieved by combining intra-
cluster and inter-cluster term extraction based on a variant of the IG measure. Semantic,
hierarchical, online clustering of Web search results (SHOC) is proposed by Zhang and
Dong using key phrases as natural language information features and su�x array for key
phrase discovery [127]. In WISE (hierarchical soft clustering of Web page search results
based on Web content mining techniques), documents are represented based on their
most relevant key concepts [23].
2.2.2 Other Search Result Clustering Algorithms
According to [24] the clustering algorithms can be classi�ed into three categories - data
centric, description-aware, and description-centric. Data centric algorithm includes con-
ventional clustering algorithms like hierarchical, optimization, spectral etc. Some exam-
ples of data-centric algorithms can be found in systems such as Lassi, WebCat, AIsearch,
Scatter/Gather, TRSC. The problem of data centric algorithm is cluster label descrip-
tion. Description-aware algorithms emphasizes on cluster label description. Su�x Tree
Clustering (STC) - Grouper, HSTC, SnakeT are the example of description aware al-
gorithms. Description-centric algorithms are designed speci�cally for clustering search
results and take into account both quality of clustering and descriptions. Vivisimo, Ac-
cumo, Clusterizer, Carrot Search, SRC and DisCover, CREDO system [24] are this type
of commercial search results clustering systems.
Scatter-Gather: The earliest work on clustering results was done by Pedersen, Hearst
et al. on Scatter/Gather system [51]. The algorithms used for Scatter/Gather approach
is Buckshot and Fractionation. Fractionation is considered more accurate, while buckshot
is much faster, making it more suitable for searching in real time on the Web (Cutting,
et. al.). A divide-and-merge methodology for clustering [29] is proposed which combines
a top-down "divide" phase with a bottom-up "merge" phase. Two pass approach based
on multilevel graph partitioning introduced in [123] which includes bottom-up cluster-
merging phase and top-down re�nement phase.
1Hierarchical clustering - Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Hierarchical_clustering
22
2.2. Document Clustering Techniques
Su�x Tree Clustering: Generally clustering algorithms consider the Web documents
as bag of words or as abstract concepts. Su�x tree clustering (STC) �rst treats each
document as collection of strings or phases. Based on STC, Grouper is developed to
cluster Web search results labelled by phrases extracted from the snippets [125,126]. STC
is also implemented for Polish language containing complex in�ection and syntaxes [118].
For Chinese Web search result Wang et al. proposed interactive su�x tree algorithm
[117]. An improved su�x tree data structure o�ering a new base cluster combining
algorithm with a new partial phase join operation is proposed in [53] to overcome the
inadequacy of generating interrupted cluster label due to using n-gram technique.
Lingo: In the �eld of Web search result clustering, Lingo a description-oriented algo-
rithm is a breakthrough. The key idea of this method is to �rst discover meaningful
cluster labels and then, based on the labels, determine the actual content of the groups.
Cluster label discovery is accomplished with the use of the Latent Semantic Indexing
(LSI) [32] technique. Lingo basically consists of �ve main phases: preprocessing, feature
extraction, cluster label induction, cluster content discovery and �nal cluster formation.
Preprocessing phase is used to convert the raw data into �ltered form. In the second
phase, su�x array is used to discover phrases and terms which is potentially capable of
explaining the verbal meaning behind the LSI-found abstract concepts. In next phase
cluster labels are �nalised based on the SVD decomposition of the term-document ma-
trix. Then documents are assigned to proper cluster labels in cluster content discovery
phase. Finally, cluster scores are calculated to represent clusters in the interface accord-
ing to their score. Lingo produces better clusters compared to another benchmarking
algorithm Su�x Tree Clustering (STC) used for web search results clustering [92]. Time
complexity of Lingo is in O(n3) and a high number of matrix transformations leading to
more memory requirements [94]. In spite of high computational complexity and space
requirement Lingo is well accepted for clustering search results as it produces better
quality of clusters by considering semantic approach. [81] considers the whole document
for LSI and proposed dynamic SDV clustering to discover optimal number of singular
values. [127] extends Lingo using WordNet and by adding semantic recognition to the
frequent extracted phrase. Syntactic clustering of the Web is also proposed by Broder et
al. [20].
He et al. introduced Web document clustering using hyperlink structures [50]. Con-
tent as well as link information are considered to improve information interpretation for
search result clustering [25,88,116]. A Phrase-based method using hierarchical clustering
of Web snippets is proposed where documents are clustered according to the phase sim-
ilarity [72,78]. [122] proposed a combination of query based and phase based approach.
23
2. Related Work
RFCMdd a robust fuzzy algorithm e�cient to handle noise is proposed in [54] based
on n-gram and vector space methods. Snippets remain unrelated because of their short
representation. In Vector Space Model, it has been noticed that a single document is
usually represented by relatively few terms. A method of Web search result clustering
based on rough sets is proposed by Ngo and Nguyen where tolerance classes are used to
approximate concepts existed in documents and to enrich the vector representation of
snippets to increase clustering performance [91].
Another clustering algorithm based on formal concept analysis is proposed to build
a two-level hierarchy for retrieved search results [129]. Two improved objective metrics
ANMI@K and ANCE@K are introduced to measure cluster quality. Similar work on
retrieval and clustering of Web resources based on pedagogical objectives is introduced
by Mayorga et al. [80].
A mechanism combing of ranking and clustering is being proposed in [35] provides
ordered results in the form of clusters in accordance with user's query. Term ranking
for clustering Web search results shows a di�erent way to ranking terms and variation of
pagerank algorithm based on relational graph representation [44].
Scuba Driver is proposed to help user to better interpret the coverage of millions of
search results and to re�ne their search queries through a keyword guided interface [45].
A recent work is done to overcome the disadvantages of mismatching between informa-
tion most wanted and information regained in [14].
Some other clustering approaches are proposed using transduction-based relevance
model [79, 120], label language model [68], temporal information [5, 22], approximating
matrix factorisation [93], query term expansion [112], topological tree [42], randomized
partitioning of query-induced sub-graphs [19], heuristic search in the Web graph [13],
compression [18], word sense communities in the extracted keyword network [27, 90],
density-based (GDBSCAN), grid-based (OptiGrid), probabilistic approach etc. [111]
based on c-Means fuzzy clustering gives the idea of cluster visualization.
Commercial search engines The systems that perform clustering of Web search re-
sults, also known as clustering engines, have become popular in recent years. The �rst
commercial clustering engine was probably Northern Light, in 1996. It was based on a
prede�ned set of categories, to which the search results were assigned. A major break-
through was then made by Vivisimo [24], whose clusters and cluster labels were dynam-
ically generated from the search results. In recent times, several commercial clustering
engines have been launched in the market [24], [56] namely Grouper (1997) [126] , WISE
24
2.3. Question Answering Techniques
(2002) [23], Carrot (2003) [92], WebCat (2003) [47], AISearch (2004)1, SnakeT (2005)2,
Quintura (2005)3, WebClust (2006)4, YIPPY (2009)5 etc. [37, 96] provides the metrics
for quantitative comparison of cluster quality using external and internal measures.
2.3 Question Answering Techniques
In order to �nd out answer of a particular question from �xed database or free text
system has been proposed. It mainly handles information overload problem. Here, the
target is to obtain speci�c answer rather than an entire document or best matching
paragraph. The question can be broadly classi�ed into two types. Fact based question
(who, when, where, which etc. `wh' questions) is related to named-entity type answers
and the harder question (why, how question) is related to explanatory type answers.
The information maintained in the resource can be structured like database information
or unstructured like free text. To extract answers from structured information source
knowledge annotation technique is used which uses syntactic parsing. On the other
hand, for unstructured information resource knowledge mining is required which uses
statistical tools. The existing Web repository is mixed of structured, unstructured and
semi-structured information.
In general, question-answering systems consist of three main components: question
classi�cation, answer retrieval and answer extraction. In question classi�cation the user
questions are classi�ed to derive expected answer types. From the answer patterns key-
words are extracted to reformulate the question into semantically equivalent multiple
questions which is also known as query expansion. It boosts up the recall of the informa-
tion retrieval system. In answer retrieval phase the probable Web pages or paragraphs
containing answer are retrieved. In �nal phase of answer extraction probable candidate
answers are identi�ed and most suitable answer is selected by ranking.
Annotation of knowledge base in order to obtain answers with the help of natural
language processing is proposed by Katz [58]. The developed system consists of two mod-
ules namely understanding module and generating module. The understanding module
analyzes English text and produces a knowledge base with the help of ternary expres-
sions and S rules, which incorporates the information found in the text. The questions
are also analyzed using ternary expressions and matched with the knowledge base. The
1AI Search Engine From MIT, http://www.netpaths.net/blog/ai-search-engine-from-mit2Search SnakeT Clustering Engine, Meta Search Cluster MetaSearch, http://snaket.di.unipi.it3Quintura - visual search engine, http://www.quintura.com4WebClust - Clustering Search Engine, http://www.webclust.com5Yippy Clustering Search Engine - iTools, http://itools.com/tool/yippy-web-search
25
2. Related Work
generating module produces English sentences from a given segment of knowledge base.
The idea of using database techniques for the World Wide Web is not new. Some sys-
tems like Araneus, Ariadne, Tsimmis etc. have attempted to integrate heterogeneous
Web sources under a common interface. Unfortunately, queries to such systems must
be formulated in SQL, Datalog, or some similar formal language, which render them
inaccessible to the average user. Along with this, the unstructured random collection of
large Web information and its rapid growth caused a challenge to this proposal to ful�ll.
Later, Katz proposed knowledge mining of Web information and integrate it with corpus
based knowledge annotation technique in order to achieve better performance [59, 74].
Knowledge mining takes the advantage of massive amounts of Web data to overcome
many natural language processing challenges. Eight sub modules like formulate requests,
execute requests, generate n-grams, vote, �lter candidates, combine candidates, score
candidates, get support are followed by this technique. An earliest question answer-
ing system Mulder [65] attempted to perform sophisticated linguistic analysis on both
questions and potential answer candidates. As a result, it did not take advantage of
data redundancy. Similar work has been done in Shapqa [21] system answer extraction.
In contrast, the AskMSR [8] embraced data-redundancy and applied extremely simple
word-counting techniques on Web data. Another system Aranea [74] followed a modular
architecture that also serves as a testbed for a variety of knowledge mining techniques.
MultiText [30] employed a di�erent approach: instead of using the Web directly to an-
swer questions, it treated the Web as an auxiliary corpus to validate candidate answers
extracted from a primary, more authoritative, corpus. Question answering by searching
large corpora with linguistic methods is proposed in [55]. The work has developed a
rephrasing algorithm based on linguistic patterns that describe the structure of ques-
tions and candidate sentences and where precisely to �nd the answer in the candidate
sentences. The way of handling unstructured as well as structured Web is suggested
by Cucerzan and Agichtein [31]. For unstructured content, novel features such as Web
snippet pattern matching and generic answer type matching using Web counts is used.
For structured content, an approach that uses information from the millions of tables
and lists that abound on the Web is experimented. In a di�erent way Radev et al. used
probabilistic approach [98] to answer the query. This approach focused on developing
a uni�ed framework that not only uses multiple resources for validating answer candi-
dates, but also considers evidence of similarity among answer candidates in order to
boost the ranking of the correct answer. This framework has been used to select answers
from candidates generated by four di�erent answer extraction methods. The approach
of combining syntactic information with traditional question answering can be found in
26
2.4. Summary
Quaero [109] system. This system performed better in case of for complex questions,
i.e. `how' and `why' questions, which are more representative of real user needs. Some
other established question answering system are Ionaut1 (AT&T Research), InsightSoft-
M (Moscow, Russia) [107], MultiText (University of Waterloo) [30], TextMap (USC/ISI)
[36], LAMP (National University of Singapore) [128], NSIR (University of Michigan) [99],
AnswerBus (University of Michigan) [131] etc.
2.4 Summary
In this chapter, we have discussed various approaches related to icon-based interface,
clustering methodologies and question answering techniques. Using icon as an interaction
medium may lead to major ambiguity in understanding. But, from the literature survey
it is quite clear that icon plays a vital role in communication in absence of common
interaction medium. All applications related to icon based interfaces, considered icon as
a way to give input to a system but, representation of knowledge by means of icon is not
addressed anywhere. CD-Icon, Sanyog, crisis management etc. emphasized on natural
language generation from icon selection. But, a query �red in search engine is not a
well formed sentence in general. Along with that any search engine takes care of query
phrase by its own. Therefore, icon sequence to natural language generation is not our
primary target. The proposed interfaces (Elephant's Memory, Clicker, hotel booking,
AAC systems, crisis management etc.) did not provide any guideline about how they
developed those interfaces. How to decide icons and how to organize them in the interface
is not reported anywhere. So, we build up a new framework to ful�ll our requirement as
well as provide a general guideline for interface developers. Using interface an user can
generate icon-based input query to search in the Internet and the retrieved information
reconverted into iconic form for user understandability.
Our second target is to cluster Web search results. We have discussed several cluster-
ing algorithms which includes fundamental clustering algorithms as well as benchmarked
clustering techniques specially applied for Web search result clustering. We noticed that,
each technique has its own merits and demerits. From the point time and space complex-
ity k -Means algorithm is unbeatable. But, it is sensitive to outliers, requires prede�ned
cluster number and generated clusters are non de�nite. Whereas, clusters produced
by hierarchical approach is de�nite, better in quality and not required prede�ned clus-
ter number. But, the time and space complexity for both agglomerative and divisive
approach is quite high. Among all other clustering algorithms most of the clustering
1Ionaut, www.ionaut.com/
27
2. Related Work
techniques (STC, lingo etc.) consider search result snippet as input to achieve faster
response time but it e�ects on the quality of cluster as snippet is not always a good
representative of a Web page. Our target is to obtain better cluster so the information
extraction part will become easier. Therefore, we need a clustering technique which will
provide good quality of clusters in less response time.
Finally, we need to extract necessary information from the clustered Web documents.
We follow the similar way as the question answering system works. Unlike general search
result retrieval system, question answering system �nds out answer of a particular ques-
tion present in structured or unstructured resource. So, we can say the range of informa-
tion retrieved by the question answering system is very narrow and the scope of general
Web searching system is very wide. We need a system performs in between them. It will
not extract large information that will confuse the target user and will not provide very
narrow range of information. Along with this we need to represent the extracted infor-
mation in terms of icon. We have to conscious while selecting the icons for information
representation as a word may imply di�erent meaning in di�erent context.
Allover it is clear that the existing approaches of icon-based interface, clustering
search results, question-answering techniques can not ful�ll our requirements. In order
to reach our target, we need to develop icon-based interface. With the help of the interface
a user will able to compose iconic query and equivalent English query is automatically
fed into search engine. Next, we need a clustering approach which can produce good
quality of clusters from search results in reasonable response time. From the clustered
search result, we shall �nd out a cluster which may contain query related information.
All information present in that cluster will be overburden for our target user. Therefore,
the selected cluster will be mined to �nd out some query related speci�c information.
Only those speci�c answers will be conveyed to the target user in terms of icon.
28
Chapter 3
Development of Icon-Based Interface
In this chapter, we discuss the development of an icon-based interface using which our
target user can pose a query by means of icon selection and feed it to Google search
engine. We selected icon as the mode of interaction as it is language independent and
understandable for our target user. In order to develop icon-based interface we faced
several challenges. Issues like icon selection for interface, icon ordering, icon optimization,
building icon vocabulary are not addressed earlier in iconic interface related works. The
major issues: deciding domain and domain related queried, maintaining icon vocabulary,
organization of icons in the interface are addressed in our work.
3.1 Introduction
To develop the icon-based interface the main deciding factors are about the icons; that
is, how many and which icons we should keep in the interface and in which order. Since,
there are innumerable number of queries possible in general and one or more icons are
needed to represent each query, it is beyond the scope of this work to plan such a vast icon
vocabulary. Therefore, we need to decide a domain on which we will seek information
from Internet. Along with this we need to decide domain related important queries and
corresponding icons to support those queries. As we targeted the unprivileged users, we
have to conscious about the icon selection and user friendliness of the interface. The
huge number of icons may confuse our target users. On the other hand, less number of
icons may insu�cient to cover all domain related queries. We need to balance these two
factors. The selected icons need to be maintained in an e�cient way so that placing them
in the interface at run time will be easier. Again, presence of all icons in the primary
interface increase cognitive load to user to select their desired icons. We have to be
29
3. Development of Icon-Based Interface
careful about icon's positioning in the interface so that the icon navigation as well as
query generation task will be easier. Therefore, in order to develop icon-based interface
we need to address the issues like deciding icon vocabulary and domain related query,
maintaining large icon database, icon organization etc.
According to our knowledge, most of the previous works that deal with icon-based
interface emphasized on other issues like the proper icon characteristics for interface
[10,83,106,115], iconic sequence to sentence generation [11,38,124], next icon prediction
to meet size limitation [38, 67], icon sense disambiguation [1] etc. But, the issues we
highlighted is not addressed anywhere. In order to develop the interface �rst we have
chosen tourism domain as domain of interest. Next, we �nd out domain related important
words and queries using Google Adwords1. To build the selected queries we need proper
icon database. We generate icon database and maintain indexing to support easy retrieval
of icons while needed. All selected icons should not be placed in the primary interface.
Therefore, we use hierarchy to place the icons. Icons are placed in such a way that the
navigation time will be minimum. The architecture of proposed approached is shown in
Figure 3.1.
Deciding Icon Vocabulary and Domain
Related Queries
Word collection Term importance
Term rankingTerm selection
Query selection
Maintaining Large Icon Repository
Indexing Icon properties
Icon Organization
Hierarchy Icon group
Figure 3.1: An overview of our approach
This chapter consists of six sections. The proposed approach has been discussed in
Section 3.2. Section 3.3 presents the detail about the experiments conducted followed by
the experimental results. Finally, Section 3.4 concludes with the summary.
3.2 Proposed Approach
Our approach consists of deciding icon vocabulary and domain related queries, maintain-
ing large icon repository, icon organization. All these tasks are discussed in the following
sub sections.
1Google AdWords, https://adwords.google.com/o/KeywordTool
30
3.2. Proposed Approach
3.2.1 Deciding Icon Vocabulary and Domain Related Queries
Towards the design of icon-based interface, our �rst task is to decide tourism related
basic words, corresponding icons and then put them in the interface in a proper way.
Redundancy of similar kind of icons is avoided as it opposes icon optimization. To build
the tourism corpus, basic tourism related words are collected from di�erent tourism
related magazines and websites. Stopwords1 (e.g. about, the, in etc.) are frequent in
every domain but they are not important. So, before calculating weight of each word, 571
stopwords are �ltered out from the tourism corpus. Again a word can occur in the corpus
in di�erent morphological form (e.g. transport, transports, transported, transporting,
transportation etc.) whose stem or root is same. A single icon is su�cient to represent
all these morphological forms. So, the stem form of each word is found out and all words
of di�erent morphological forms are mapped to their stem words. Then the unique words
of the corpus are identi�ed.
The term importance increases proportionally with the number of the times a word
appears in the document as well as the occurrence of the word in the corpus. The term
count in the given document is simply the number of times a given term appears in that
document. This count is generally normalized to prevent a biasness for longer documents
(which may have a higher term count regardless of the actual importance of that term in
the document) to give a measure of the importance of the term ti within the particular
document dj . Thus, we have the term frequency as de�ned in Equation. 3.1.
tfi,j =ni,j∑k nk,j
(3.1)
where ni,j is the number of occurrences of the considered term (ti) in document dj and
the denominator is the sum of number of occurrences of all terms in document dj . The
document frequency is a measure of the general importance of the term (obtained by
dividing the number of documents containing the term by the number of all documents)
as shown in Equation. 3.2.
dfi =|{d : ti ∈ d}||D|
(3.2)
with |D|is the total number of documents in the corpus. |{d : ti ∈ d}| is the number ofdocuments where the term ti appears (that is ni,j ̸=0). Then
tf -idfi,j = tfi,j × dfi (3.3)
1http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
31
3. Development of Icon-Based Interface
The (tf -idf) value for a term will always be greater than or equal to zero. The term ranks
are decided on (tf -idf) weighting. After calculating the (tf -idf) weightage of each term
they are sorted on average (tf -idf) to �nalize top 500 terms. Among these 500 words
we point out domain related important words and the frequent queries �red to Google
using those keywords are collected. Some supporting query words are added to the
main keyword list to complete the query phase. Table 3.1 briefs about icon vocabulary
building. The initial tourism corpus contains 169610 words related to tourism domain.
Table 3.1: Domain related word extraction
Corpus size Reduced corpus size Unique words Final words
169610 139052 30588 453
After �ltering the stopwords the corpus size is reduced to 139052 words. Then after
stemming 30588 words are identi�ed as unique word. The average tf -df value of these
unique words are calculated and ranked in decreasing order. Out of top 500 words, 316
words are selected as tourism related important word as some are implicit to query (e.g.
word "list" is implicit in the query "hotel list of Kolkata") and some are synonymous.
These words are used to extract frequent queries �red to Google in tourism domain using
Google AdWords1. We select 50 queries as tourism related important queries as shown in
Table 3.2. 127 supporting words are added to the corpus to form the queries. At present
the interface contains 235 icons. We collect these icons from di�erent websites. While
selecting, we prefer bi-colour icons which resembles the actual word-object. In some cases
the concept is represented where it is di�cult to represent a real object. Some icons are
modi�ed according to target user's feedback as needed. We follow the consistency rule for
icon characteristics like color, dimension etc. In fact, designing proper icons for tourism
domain is beyond the scope of this work.
3.2.2 Maintaining Large Icon Repository
Presently 235 icons are used in the interface. To manage the icon database, XML doc-
ument is used as an index to support easy icon retrieval. Di�erent icon attributes like
corresponding keyword, synonyms, semantically similar words, word sense, storage loca-
tion (path), position of that icon in the hierarchy are maintained in the XML �le. For
query generation when user selects icon from interface, icon corresponding keyword is
fetched from vocabulary to generate equivalent English query. On the other hand, when
we search a word corresponding icon from icon vocabulary for display we �rst match
1Google AdWords, https://adwords.google.com/o/KeywordTool
32
3.2. Proposed Approach
Table 3.2: Tourism related benchmark queries
Query Query
1. Nearest station <place> 2. Sea beach in south India
3. Distance <place> and <place> 4. Single room hotel rent in <place>5. Festivals in <place> 6. Photo gallery of <place>7. Hotel <place> 8. Climate of <place> in time
9. Transport <place> 10. <place> zoo
11. People culture <place> 12. Buddha stupa in <place>13. Food <place> 14. Shiva temple in <place>15. Map <place> 16. <place> lake
17. Weather <place> 18. Botanical garden in <place>19. Fair in rural India 20. Church in <place>21. Five star hotel in <place> 22. Forest in <place>23. Tour package of <place> 24. <place> border
25. Route from <place> to <place> 26. Train between <place> and <place>27. Adventure sports in <place> 28. Hindu pilgrim in <place>29. Beaches in <place> 30. Train ticket reservation <place> to <place>31. Wildlife in <place> 32. Royal museum in <place>33. Hospital in <place> 34. Reservation status in <place> hotel
35. Five day tour plan <place> 36. Temperature of <place> in <month>37. Best season to visit <place> 38. Stadium in <place>39. <place> college 40. Low budget <place> and <place> travel
41. Market in <place> 42. Village craft fair in <place>43. Migratory bird watching <place> 44. Arrival time of <place> and <place> plane
45. Fort in <place> 46. Sunset in <place> beach
47. Fruit in <place> 48. Tribal festival in <place>49. Orchid of north <place> 50. <place> and <place> �ight ticket price
the keyword and �nd the most suitable sense according to the context. The most ap-
propriate sense corresponding icon is selected for display. If the keyword is not found
in vocabulary, then we look for synonyms followed by semantically similar words. The
details of icons selection for display is explained in Section 5.2.3. Structure of the XML
�le is shown in Figure 3.2. The part of the XML �le shown above contains `town' icon
properties. It contains parent icon named as `pname' to declare the hierarchical position
in the interface. It tells that the icon `town' is under `where' icon. `Collection' implies
the icon has further hierarchy or not. `2' implies further hierarchy and `1' implies no
further hierarchy. `Keyword' declares icon corresponding text. Synonymous words are
declared within `synonyms'. Next, we declare di�erent senses of the keyword. To get the
synonyms and di�erent senses of the keyword we take the help of WordNet database [87]
of Princeton university. Semantically similar words are declared within `seman'. The
sense corresponding di�erent images are given in `imagepath'.
33
3. Development of Icon-Based Interface
Figure 3.2: A snapshot of XML �le
3.2.3 Icon Organization
All the icons can not be kept in the default interface, so hierarchy is maintained to
organize them. The selected icons are categorized under 5 basic `wh' icons - `what',
`when', `where', `who' and `how'. Icons related to any place are kept under `where'.
Time, person and verb related icons are kept under `when', `who' and `how' respectively.
The rest of the icons are placed under `what'. Qualifying type of icons are put separately
in the interface. With this basic categorization grouping and sub-grouping is also followed
(e.g. all types of vehicles are kept under main transport icon). Similar types of icons
are placed together. Presently, interface maintains 4 level hierarchy (e.g. travel →transport→ train→ ac− 2− tier.) A snapshot of the interface is shown in Figure 3.3.
The interface contains 5 icon groups and satis�es Miller's rule of thumb - magic seven,
plus or minus two [86]. Icon group 1 contains functional icons (backspace, write back,
erase, search) and a display panel to show the icons selected by user. This o�ers easy
reversal of actions to the user. Icon group 2 contains `help' and `feedback' icons. Basic
`wh' icons (`what', `when', `where', `who' and `how') are kept in icon group 3. Using
this `wh' icon user can see basic icons related to that particular `wh'. To go further in
the hierarchy icon group 4 is helpful. Hierarchy icon enables user to �nd icons which
are kept in hierarchical order. As an example we can say transport icon can be found
34
3.2. Proposed Approach
Undo, redo
and clear
Icon group 2: Help
and feedback
Icon group 3: Basic Wh buttons –
what, where, when, who and how
Icon group 4: Hierarchy
Icon group 1
Icon group 5: icons for
selection
Icon display
panel
Figure 3.3: Icon-based interface
by enabling hierarchy icon and selecting travel icon. To reduce short-term memory load,
the hierarchical path that an user follows is displayed in icon group 4. It also allows
user easy state (hierarchy) transition. Icon group 5 is the display area of all `wh' icons.
User can select any icon displayed in this area by single click. Informative feedback is
provided while mouse hovering or clicking, by highlighting and displaying the enlarged
icon for clarity. Using this interface the users are able to generate travel related query
and feed the query in the search engine using `search' icon. Google search engine is used
in this application for searching purpose.
3.2.4 Developed Interface
Using the developed interface target user can frame tourism related query. An example
of basic tourism related query (�weather of Kolkata in December") formation is shown in
Figure 3.4. To construct this query the user has to select three icons: weather, Kolkata
and December. The `weather' icon is in the primary interface. User can select it by
single click. The selected icon then displayed in display panel of icon group 1. Next,
target icon is `Kolkata' which is place related icon. So, we select the `where' icon from
icon group 3. In the icon panel we can found the icon of India. Now, Kolkata is a city
of India so the icon of `Kolkata' must be under `India' icon. We enable the `hierarchy'
35
3. Development of Icon-Based Interface
Selection of weather icon
Enlarged weather icon
Under basic what hierarchy
(a) Selection of `weather' icon.
Display of selected weather icon
Where icon
(b) Display of `weather' icon.
Under where hierarchy
Hierarchy disabled
India
icon
(c) Move to `where' hierarchy.
Figure 3.4: Example of query generation.
36
3.2. Proposed Approach
Select
India
icon
Enable hierarchy
(d) Enable hierarchy and select `India' icon.
Disable hierarchy and select Kolkata icon
Under India hierarchy
(e) Disable hierarchy and select `Kolkata' icon.
Display of selected icons
When icon
(f) Move to `when' hierarchy.
Figure 3.4: Example of query generation.
37
3. Development of Icon-Based Interface
Select month icon
Enable hierarchy
Under when hierarchy
(g) Enable hierarchy and select `month' icon.
Disable hierarchy and select December icon
Under month hierarchy
(h) Disable hierarchy and select `December' icon.
Complete query Search option
(i) Query completion.
Figure 3.4: Example of query generation.
38
3.3. Experiments and Experimental Results
option and select the India icon. We can �nd the icon of `Kolkata' in the icon panel.
We disable the hierarchy and select the `Kolkata' icon by single click. The change will
be re�ected in display panel. The last icon needs to be selected is `December' which is a
time related icon. So, we move to `when' icon panel. There we can �nd the `month' icon.
In a similar way discussed earlier we go to the next hierarchy and �nd the `December'
icon. By selecting `December' icon we complete the query generation. The `search' icon
will feed the generated query into Google search engine.
3.3 Experiments and Experimental Results
To substantiate the e�ciency of the developed interface we have conducted few experi-
ments. The judgment is done with respect to icon recognizability and e�ciency in query
formation based on user evaluation.
3.3.1 Experimental Setup and User Details
All experiments are carried out inWindows environment (Windows 7) with Intel Core2Duo
(2.0GHz) processor and 2.0 GB memory. The proposed approach is implemented in C#
language in .Net 3.5 platform using Microsoft Visual Web Developer 2008 Express Edi-
tion. Internet explorer is used as default Web browser to access Internet. In our evalu-
ation procedure we have considered users from di�erent background. The user pro�les
are summarized in Table 3.3.
3.3.2 Training and Testing
As the developed system is totally new for our target users, they are undergone a training
followed by a testing procedure. We arrange �ve di�erent sessions with 18 users. At
�rst, we have conducted a test to recognize the icons by target users without any prior
knowledge to check the appropriateness of designed icons. Users are given 25 randomly
selected icons and asked to make some idea about the icons. We separate out the icons
which are not recognized and misinterpreted. In next two sessions, we make familiar the
users with the separated icons. In session 3, we introduce users with the icons by group
(e.g. all vehicles). In later session we jumble up the icons and again acquainted the
users with the icons. In �nal session, 15 randomly selected icons are provided to each
user from icon database and asked to recognize them. Our observation is summarized in
Table 3.4. The table shows that before training around 37% icons of the icon database
are not recognized or misinterpreted. Out of this 37% icons 53% are recognized after
39
3. Development of Icon-Based Interface
Table 3.3: User details
User Type Age Education(Class)
Mother language Computerpro�ciency
O�ce-peon (S1,S2) 24,46 XII,X Bengali Intermediate
Sweepers(S3,S4,S5,S6,S7)
21,28,33,33,37 V,IV,VII,V,- Hindi, Bengali None
Gatekeeper(S8,S9,S10)
16,20,26 III,-,VI Bengali None, Novice
Shopkeeper(S11,S12)
35,42 VII,X Telugu, Bengali Novice
Cook (S13,S14) 43,56 -,V Bengali, Oriya None
Waiter(S15,S16,S17,S18)
19,22,26,28 VIII,X,-,V Bengali, Kannada None, Novice
User's computer pro�ciencyNone: never used computer beforeNovice: used computer few times but not regularly or less than 6 monthsIntermediate: used computer more than 6 months but less than 1 yearExpert: used computer more than 1 year.
Table 3.4: User training
Training Number oficons
Correctlyrecognized
Conceptrecognized
Notrecognized
Mis-interpreted
Before 392 78 169 62 83
After 145 31 47 24 43
training session and the remaining icons are once again familiarized in last session.
Next, the interface is tested with respect to query construction ability. We use �fty
tourism related queries (Ref. Table 3.2) as a benchmark. Participants are asked to
generate �ve queries each, using the interface. The accuracy of query generation is
calculated as the number of icons correctly chosen by the user divided by total number
of icons needed to select to form the query in right way. The average accuracy for each
user is then calculated as shown in Table 3.5. The overall average accuracy of �nding
search keywords with the proposed icon-based keyboard is 0.764 (approx).
3.4 Conclusion
Developed icon-based interface is a prototype version of the proposed approach. In this
work, the prototype implementation works �ne with respect to tourism related queries.
Primary evaluation of interface has been done but more formative evaluation as well
40
3.4. Conclusion
Table 3.5: Interface testing result
User Query Accuracy Averageaccuracy
S1 8, 13, 22, 10, 17 1, 1, 0.75, 0.33, 1 0.816
S2 36, 43, 44, 46, 47 1, 1, 1, 0.5, 1 0.9
S3 1, 23, 35, 28, 20 0.66, 0.5, 1, 0.5, 1 0.732
S4 3, 11, 16, 21, 26 0.5, 0.75, 1, 0.5, 1 0.75
S5 5, 7, 19, 32, 34 1, 1, 0.75, 1, 0.66 0.882
S6 38, 39, 40, 48, 49 0.33, 0.66, 0.6, 0.66, 0.33 0.516
S7 37, 41, 42, 45, 50 1, 0.66, 0.5, 0.8, 1 0.792
S8 6, 14, 18, 27, 24 0.33, 0.33, 0.8, 0.8, 1 0.652
S9 2, 15, 25, 29, 30 1, 1, 0.66, 1, 1 0.932
S10 4, 9, 12, 31, 33 0.5, 1, 0.66, 0.33, 0.5 0.598
S11 8, 13, 22, 10, 17 1, 0.66, 0.5, 1, 1 0.832
S12 36, 43, 44, 46, 47 1, 0.66, 1, 0.75, 0.75 0.832
S13 1, 23, 35, 28, 20 0.33, 1, 1, 1, 1 0.866
S14 3, 11, 16, 21, 26 1, 1, 0.5, 1, 0.75 0.85
S15 5, 7, 19, 32, 34 0.5, 1, 1, 0.66, 0.66 0.764
S16 38, 39, 40, 48, 49 1, 0.33, 0.8, 1, 0.33 0.692
S17 37, 41, 42, 45, 50 0.5, 0.33, 0.75, 0.6, 0.8 0.596
S18 6, 14, 18, 27, 24 0.66, 0.66, 0.6, 0.8, 1 0.744
as summative evaluation is needed to enhance the user friendliness of interface. Our
developed interface helps the target user to compose tourism related queries. The pro-
posed approach can be extended to other domains of interest such as health, education,
shopping, job search etc.
41
Chapter 4
Clustering Web Search Results
The query composed in terms of icons are transformed into text and fed into the search
engine. As a query, a word or a group of words can imply multiple meaning in di�erent
contexts. Web search engine, however, cannot distinguish the context and hence retrieves
huge information. If the query is posed precisely, then the relevancy of retrieved infor-
mation is high. In that case, user obtains desired information precisely and with less
e�ort of navigating. On the other hand, when the query is imprecise or too broad in the
sense, the target information may present in the search result but with a higher rank.
It is not worthy to present all the search results to our target users. Instead, we need
to �nd precise and relevant information to the user query. To group Web pages having
similar content we need clustering. Clustering results a bag of clusters from which we �nd
out a cluster containing most relevant documents. Generally, a Web page contains lots
of extra information along with the valuable information which may result poor cluster
quality. So, before clustering we preprocess Web pages. From the exhaustive literature
survey we realize that their are di�erent methodologies for clustering. Some clustering
methodology compromises cluster quality and some compromises the response time. We
propose a new clustering algorithm combining k -Means and hierarchical clustering in or-
der to obtain better cluster quality with a�ordable time delay. We compare the proposed
method with the existing clustering methods.
4.1 Introduction
After �ring the query to search engine, the search engine returns probable Web pages
related to the query which are huge in size and in general, from di�erent domains. As an
43
4. Clustering Web Search Results
example, Google returns 9,095,238 search results on an average for a query1. An expert
computer user can use several advanced options (e.g. pages containing particular phase,
page language, �le type, time of upload, particular cite or domain etc.) to get accurate
information in least time. Every time users may not get the appropriate answer as they
need, but generally some suggestions or hints are provided. Accuracy of retrieving related
Web pages for a query mainly depends on query formation. Composing accurate query
is not possible for our target users.
Retrieved Web page snippets are ranked depending on the PageRank and relevancy mea-
sure. A snippet contains the title of the Web page, brief description containing the search
word, that is, content summary and the link to that Web page. In general, the user has to
predict their target snippet(s) by going through snippets or navigating the page. Actual
solution or suggestion to problem can be obtained by reading the predicted Web pages
by trial and error method. But, this procedure is infeasible for our target users. As
a way out, clustering mechanism has been advocated where similar Web pages can be
grouped together so that representation and extraction of information related to search
query would be easier.
In general, clustering is the assignment of a set of observations into subsets (called
clusters) so that observations in the same cluster are similar in some sense. It is a method
of unsupervised learning [34]. In other words, clustering is a data mining (machine learn-
ing) technique used to place data elements into related groups without any advanced
knowledge of the group de�nitions [100]. Document clustering was proposed mainly as
a method of improving the e�ectiveness of document ranking following the hypothesis
that closely associated documents will match the same requests [102]. The systems that
perform clustering of Web search results, also known as clustering engines, have become
popular in recent years. The �rst commercial clustering engine was probably North-
ern Light, in 1996. It was based on a prede�ned set of categories, to which the search
results were assigned. A major breakthrough was then made by Vivisimo [24], whose
clusters and cluster labels were dynamically generated from the search results. In recent
times, several commercial clustering engines have been launched in the market [24], [56]
namely Grouper (1997) [126] , WISE (2002) [23], Carrot (2003) [92], WebCat (2003)
[47], AISearch (2004)2, SnakeT (2005)3, Quintura (2005)4, WebClust (2006)5, YIPPY
1Search engine statistics from the search engine yearbook, http://www.searchengineyearbook.com/search-engines-statistics.shtml
2AI Search Engine From MIT, http://www.netpaths.net/blog/ai-search-engine-from-mit3Search SnakeT Clustering Engine, Meta Search Cluster MetaSearch, http://snaket.di.unipi.it4Quintura - visual search engine, http://www.quintura.com5WebClust - Clustering Search Engine, http://www.webclust.com
44
4.2. Proposed Methodology
(2009)1 etc. These engines [24] consider search result snippets as an input to achieve
faster response time. As a consequence cluster quality degrades because snippet is not
always a good representative of a whole document [114].
The existing techniques for Web search results clustering are mainly based on two ba-
sic clustering mechanisms: partitional and hierarchical clustering. k -Means clustering is
the most common type of partitional clustering which produces �at clusters. Hierarchical
clustering, on the other hand, creates a hierarchy of clusters which may be represented in
a tree structure called dendrogram [130]. Hierarchical clustering is usually either agglom-
erative (�bottom-up") or divisive (�top-down") [77]. Both of these clustering techniques
have some limitations to apply in clustering of Web search results directly. Hierarchical
clustering technique results better quality of clusters, though the computational complex-
ity of k -Means is less. But hierarchical clustering is trapped in past mistakes whereas
k -Means o�ers iterative improvement. So, noticing the limitations of these two algo-
rithms, we propose a combination of both the hierarchical and k -Means algorithms to
cluster Web documents. Our main objective is to obtain better clustered search results
to extract concrete information with reasonable time delay. In this work, we analyze two
basic ways of grouping search results and implement a hybrid clustering method. The
total Web page content instead of snippets are considered as an input to improve the
cluster quality. Finally, we compare the performance of the proposed technique with the
existing approaches.
The organization of this chapter is as follows. In Section 4.2 we discuss about the
proposed approach. To substantiate the e�cacy of the proposed algorithm, we have
done some experiments. Our experiments and experimental results are presented in
Section 4.3. Finally, Section 4.4 concludes the paper.
4.2 Proposed Methodology
We have proposed an approach to group the Web search results into a number of clusters
depending on the relevancy of terms in each document. To do this, we introduce a new
clustering algorithm. Our proposed clustering algorithm follows four tasks: Web page
content extraction, preprocessing of Web documents, document feature extraction and
inter document similarity measure. An overview of our entire work is shown in Figure 4.1
and a detail description of each step is given in the following sub sections.
1Yippy Clustering Search Engine - iTools, http://itools.com/tool/yippy-web-search
45
4. Clustering Web Search Results
Inter document
similarity
measure
Web page
content
extraction
Document
feature
extraction
Raw
text
Filtered
text
Feature
vectorsDocument
clustering
Similarity
matrixClustered
documents
Query
term
Preprocessing of web documents
Text filtration Lemmatization
Figure 4.1: Overview of our work
4.2.1 Preprocessing of Web Documents
The Web pages contain a lot of extra information, noises (e.g. advertisement) along with
the desirable information. Presence of these elements is often troublesome to identify the
important document features. It also may mislead the clustering process. Hence, before
the feature extraction and proper clustering, document preprocessing is necessary. In
this stage, we �rst identify extra information and noises and �lter out those. Finally, we
do lemmatization [77] to map in�ected words into their root forms.
4.2.1.1 Text Filtration
Source code of a Web page is likewise raw data which can not be used directly for
clustering. In this step, we consider the following elements for removal from any Web
page.
• HTML tags: Di�erent HTML tags are used to represent documents in the Web
browser. The tag itself does not carry any important information. Some important
HTML tags are <head>, <title>, <body>, <table>, <strong> etc.
• Special characters: Some special characters are used to create spaces and represent
symbols (e.g. © & < > " etc.).
• Scripts: Di�erent scripting languages are used to incorporate di�erent features (e.g.
Javascript).
• Non-letter characters: A Web page may contain non-letter characters (e.g. $, %,
# etc.).
• Non-printable ASCII characters: A number of non-printable characters may also
occur in Web pages (e.g. NUL, SOH, STX etc.)
46
4.2. Proposed Methodology
After downloading the source code these elements are eliminated to extract only text
(such as sentences, phases and words) from Web documents. Starting and ending tags
are identi�ed to separate unnecessary terms and important text. All the unnecessary
elements as mentioned above are deleted using regular expression [2].
4.2.1.2 Lemmatization
In order to perform clustering, Web documents are needed to represent in terms of
term-vector. Important terms are identi�ed by using term-weighting scheme discussed in
Section 4.2.2. Generally, a word can occur in a document in di�erent morphological forms.
Lemmatizer can map an in�ected word to its base form with the use of a vocabulary
and morphological analysis [77]. We use LemmaGen1 to convert all the words of Web
documents in their base form. This guarantees that all the in�ected forms of a term are
mapped to a single term, which helps to determine each term's importance. The output
of the preprocessing stage, that is, the �ltered text namely sentences, phases and words
are considered as the input to the next stage.
4.2.2 Document Feature Extraction
The aim of this phase is to identify the important terms from extracted text which are
potentially capable to represent the documents. In our work, vector space model [77] is
used to represent a document and term importance are considered as dimensions. At
�rst we �nd out the unique terms among the bag of words of all documents. Next,
we have to identify the document related important terms. In order to do this, we use
the term frequency-inverse document frequency (tf -idf) [119] metric to measure terms'
importance. Note that, stopwords (e.g. about, the, in etc.) are frequent in any document
and are not important. We identify 571 stopwords2 which are �ltered out from every
document. The calculation of (tf -idf) and representation of documents in terms of
documents' feature are discussed in the following.
Calculation of tf-idf: It is calculated by multiplying term frequency (tf) with inverse
document frequency (idf). The term frequency of term ti of a given document dj is
calculated as Equation 3.1. The inverse document frequency is a measure of the general
importance of a term. It is obtained by dividing the total number of documents by
the number of documents containing the term, and then taking the logarithm of that
1LemmaGen, http://lemmatise.ijs.si/2http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
47
4. Clustering Web Search Results
quotient. The mathematical representations of inverse document frequency is given in
Equation 4.1.
idfi = log|D|
|{d ∈ |D| : ti ∈ d}| (4.1)
Inverse document frequency of the term ti is denoted as idfi in Equation 4.1. Here, |D|implies the total number of documents and d is the number of documents containing the
term ti. Combining these two equations we can assign a weight tf -idfi,j to a term ti in
a document dj as Equation 4.2.
tf -idfi,j = tfi,j × idfi (4.2)
After getting the tf -idf value of each unique term for each document, we calculate the
average tf -idf value of each term. Then the terms are sorted in descending order of
these average value. Top m terms are considered as important terms. Any document
can be represented in terms of these terms. The value of m is decided experimentally
(Ref. Section 4.3.3).
Term-document matrix (TDM): Our next task is to represent all documents using
those m terms identi�ed in the previous step. We consider term-document matrix TDM
to represent all documents using tf -idf value of m terms. In TDM , a row represents a
term and a column represents a document. Each element of the matrix is the tf -idf value
of a term in a document. In other words, if the tf -idf value for a particular term ti in
a document dj is tf -idfi,j and total m terms {t0, t1, · · · , tm−1} represents all documents,then a document dj is represented by the vector [tf -idf0,j , tf -idf1,j , · · · tf -idfm−1,j ] [101].
The TDM matrix for a set of n documents is represented in Equation 4.3.
TDM =
tf -idf0,0 tf -idf0,1 · · · tf -idf0,n−1
tf -idf1,0 tf -idf1,1 · · · tf -idf1,n−1
...
tf -idfm−1,0 tf -idfm−1,1 ... tf -idfm−1,n−1
(4.3)
Generally, column length normalisation [92] is done on TDM matrix to avoid the biasness
for very short or very long document. The normalised value of tf -idfi,j denoted by ai,j
is calculated as
ai,j =tf -idfi,j√
tf -idf20,j + tf -idf2
1,j + · · ·+ tf -idf2m−1,j
(4.4)
48
4.2. Proposed Methodology
Equation 4.5 shows the term-document matrix with column length normalisation.
TDM ′ =
a0,0 a0,1 · · · a0,n−1
a1,0 a1,1 · · · a1,n−1
...
am−1,0 am−1,1 ... am−1,n−1
(4.5)
4.2.3 Inter Document Similarity Measure
The main objective of the document clustering phase is to group similar documents. For
this, we need to compute similarity values between every pair of documents. As similarity
values are compared often in clustering phase, it is better to compute all possible pairwise
document similarity at �rst. Once all the similarity values are computed we can simply
use the values when needed instead of computing it every time. The input to this phase
is the feature vectors obtained in the previous phase. Our approach to measure the
similarity is discussed in the following.
Cosine similarity: Similarity between two vectors of m dimensions can be measured
using cosine similarity which is actually the cosine of the angle between them. In this
work, cosine similarity is used to measure the similarity between two documents which
is computed using the dot product and magnitude [77]. Let two document di and dj are
represented by the vectors [a0,i, a1,i, · · · , am−1,i] and [a0,j , a1,j , · · · , am−1,j ], respectively.
Then cosine similarity (si,j) between di and dj can be represented by Equation 4.6.
si,j =di.dj||di||||dj ||
=a0,i.a0,j + a1,i.a1,j + · · ·+ am−1,i.am−1,j√
a20,i + a21,i + · · ·+ a2m−1,i.√
a20,j + a21,j + · · ·+ a2m−1,j
(4.6)
Previously, we have done column length normalisation on matrix TDM such a way that,
the sum of the square of column wise elements of TDM is 1, that is,∑m−1
i=0 a2i,j=1 for
any document dj . Therefore, Equation 4.6 becomes
si,j = a0,i.a0,j + a1,i.a1,j + · · ·+ am−1,i.am−1,j (4.7)
Similarity matrix: The pairwise similarity of documents can be represented by sim-
ilarity matrix. A similarity value varies between 0 to 1, where 0 indicates no similarity
and 1 indicates maximum similarity or identical. Similarity matrix denoted as Sim is
symmetrical about the main diagonal, as Sim[di, dj ] = Sim[dj , di]. An element si,j of
similarity matrix Sim is actually the cosine similarity value of documents di and dj . The
49
4. Clustering Web Search Results
mathematical representation of similarity matrix is given in Equation 4.8.
Simn,n =
s0,0 s0,1 · · · s0,n−1
s1,0 s1,1 · · · s1,n−1
...
sn−1,0 sn−1,1 ... sn−1,n−1
(4.8)
4.2.4 Our Proposed Clustering Algorithm
We propose an algorithm to cluster documents. It combines divisive hierarchical ap-
proach and k -Means. Hence, we named it HK Clustering. Our objective is to produce
coherent clusters [126] which means a document is not strictly belongs to a single cluster.
Depending on some conditions it may belong to a number of clusters as often a document
covers multiple topics. The �owchart of the HK Clustering algorithm is shown in Fig-
ure 4.2. Let us consider a set of n documents {d0, d1, d2 · · · dn−1} as clustering elements.
The documents based on which the clusters are formed we term them as seed documents.
Suppose, S and Old_S denote the sets of seed documents at current and previous level,
respectively. We store all clusters generated at a level in a pool of clusters called CP .
Step1. Cluster initialization: Initially, all documents are in a single cluster C0, that is,
C0 = {d0, d1, d2, · · · dn−1} and cluster-pool CP = C0. S and Old_S are initialized
to null as in the root level there is no seed documents.
Cluster initialization
Distribute documents
Produce final
coherent clusters
Yes
No
Satisfy
termination
condition?
Seed selection
Figure 4.2: Overview of our proposed HK Clustering algorithm
50
4.2. Proposed Methodology
Step2. Seed selection: Suppose, Ci,j is the ith cluster at level j. We select seeds in Ci,j
for (j+1)th level as follows. We �nd out a document pair say di, dj with minimum
similarity value among all the document pairs in the cluster Ci,j . If the similarity
value of the document pair di, dj is less than a limit called dissimilarity threshold
denoted as α, we consider di and dj as seeds for (j + 1)th level. Otherwise, the
current seed documents of Ci,j at level j remain the seeds for (j + 1)th level. To
�nd out all the probable seeds for (j + 1)th level, we consider all the clusters of
jth level one by one.
Next, we �nd if any two or more seeds are mergeble or not. We check the similarity
value of each pair of seeds. If the similarity value of any seed pair exceeds a
threshold value called merging similarity threshold denoted as β, we merge those
two seeds into one. It may happen that two or more seed pair satisfy the condition
and have a common seed document. As an example, the similarity value of di, dj
and dj , dk is greater than β where di, dj , dk ∈ S. In this case, we merge all three
of them into a single representative seed. So, a seed may contain more than one
document. We term such a seed as composite seed. We store all seeds selected
for the next level in S.
Step3. Check terminating condition: The process terminates if CP = null and S =
Old_S. If the terminating condition does not satisfy, go to Step 4 else, go to
Step 5.
Step4. Assign documents to their nearest seeds: For each seed in S, we create a cluster
initially containing the seed document only. Thus, we have |S| number of newlycreated clusters, where |S| denotes the number of seeds in S. Next, we assign
the non-seed documents to those clusters as follows. Suppose, di is a non-seed
document which we want to assign to a seed document. Among all seeds in S
let, dj ∈ S has the maximum similarity with di. Therefore, we assign di to the
cluster corresponding to the seed dj . In case of composite seed where a single
seed is represented by two or more documents, we compare the similarity value
of a non-seed document with each of the document of the composite seed as well
as with other seed documents. All newly obtained clusters are kept in CP . Then
we set Old_S = S and go to Step 2.
Step5. Produce �nal coherent cluster: Finally, we produce the coherent clusters as stated
below. Let, C ′i,j be the centroid of the cluster Ci,j . We calculate C ′
i,j by taking
the arithmetic mean of all the document vectors (discussed in Section 4.2.2) corre-
sponding to the cluster Ci,j . Next, we check the similarity value of each document
51
4. Clustering Web Search Results
with other cluster centroids apart from which it belongs to. Suppose, a document
di belongs to cluster Ci,j . So, we check the similarity values of document di with
the cluster centroids other than the centroid C ′i,j . If any of the similarity value
crosses a limit called belonging similarity threshold denoted as γ, then we assign
di to that cluster also. It is possible that a document satis�es the condition for
more than one cluster centroids. In that case, that document will be assigned to
all those clusters.
The proposed HK Clustering algorithm is precisely stated in pseudo code form in Al-
gorithm 1. We consider n documents {d0, d1, · · · dn−1} for clustering and prepare the
similarity matrix Simn,n (as discussed in Section 4.2.3). In any cluster Ci, Ci.seed
denotes the seed document(s) and Ci.doc denotes all documents within the cluster in-
cluding seed(s). We also consider three threshold values: dissimilarity threshold (α),
merging similarity threshold (β) and belonging similarity threshold (γ) for comparison in
di�erent phases of the algorithm. In Algorithm 1, i, j, x, N are considered as positive
integer variables. In Algorithm 1, line number 16-30 determines probable seeds for next
level and line number 8-12 merges close seeds. The procedure Document Distribution
assigns documents to their nearest seeds. Line number 32-37 produces �nal coherent
clusters.
4.2.4.1 Illustration of the Proposed HK Clustering Algorithm
For better understanding, the algorithm is explained in Figure 4.3. In this �gure, Ci,j
denotes a cluster where i represents the level and j represents the cluster number. The
document(s) with star mark(s) within a cluster is(are) denoted as seed document(s).
Suppose, we want to cluster ten documents {d0, d1, d2, · · · d9}. The initial cluster is
C0,1 = {d0, d1, d2, · · · d9} at level 0 and cluster-pool CP = C0,1. S and Old_S are null.
Now, we dequeue the cluster from CP and determine two documents with minimum
similarity. Let, (d0, d8) is the most dissimilar document pair in the cluster C0,1 and the
similarity value is less than the dissimilarity threshold α. So, (d0) and (d8) are considered
as seeds at level 1. Next, we check seeds selected at level 1 are mergeable or not. Now,
in this case it is trivial, because both seeds are generated from the same parent as their
similarity value is less than the dissimilarity threshold. It is obvious that α < β. Hence,
the similarity of (d0, d8) would not cross merging similarity threshold β. So, (d0) and
(d8) are �nalised as seeds at level 1. Then we check the terminating condition. The �rst
condition, that is, CP = null is true. But the second condition S = Old_S is false
because, S = {d0, d8} and Old_S = null. Next, each non-seed document is assigned
52
4.2. Proposed Methodology
Input: Simn,n : similarity matrix.Output: C = {C0, C1, · · ·CN−1} : set of coherent clusters.
1 begin2 Initialise S = Old_S = Null, C0 = {d0, d1, · · · dn−1}3 Add C0 into the cluster pool CP4 while true do5 if CP = Null then6 if S = Old_S then7 Exit ; // Terminating condition
8 else9 /* Merging close seeds */10 for all di ∈ S do11 Find the mergeable seed dj , such that Sim[di, dj ] >β12 end13 Merge all mergable seed documents and modify S14 /* Assign non-seed documents to seeds, N denotes the number of
seeds */15 Document Distribution(S,Old_S,CP,N)
16 endif
17 else18 /* Find probable seeds for next level */19 Ctemp ← Dequeue cluster pool CP20 if Count(Ctemp.doc) = Count(Ctemp.seed) then21 S ← Ctemp.seed ; // Cluster contains only seed(s)
22 else23 /* Find two documents in Ctemp having minimum similarity */24 Min_Sim.value←Min(Sim[di, dj ]), ∀di ∈ Ctemp , ∀dj ∈ Ctemp,
di ̸= dj25 Min_Sim.doc← Documents corresponding Min_Sim.value26 if Min_Sim.value < dissimilarity threshold α then27 S ←Min_Sim.doc28 else29 Seed of cluster Ctemp is passed to next level S ← Ctemp.seed30 endif
31 endif
32 endif
33 end34 /* Produce coherent clusters */35 Document Distribution(S,Old_S,CP,N) where N = Count(S)36 Calculate centroids C ′
i of each cluster Ci ∈ CP37 if SimC(di, C
′j) > γ for i=0 to n− 1 and j=0 to N − 1 then
38 Assign di to Cj where C ′j is centroid of cluster Cj , and di /∈ Cj
39 endif
40 end
Algorithm 1: HK Clustering53
4. Clustering Web Search Results
Input: S : seed documents of current level, Old_S : seed documents of previouslevel, CP : cluster pool, N : number of seeds.
1 begin2 Create N buckets
C0.seed← S(0), C1.seed← S(1), · · · , CN−1.seed← S(N − 1)3 for j=0 to n− 1 do4 Find closest seed di for document dj where di ∈ S5 Cx.doc← dj : di ∈ Cx.seed, 0 ≤ x ≤ N − 1 ; // Assigning documents to
nearest seed
6 end7 CP = C0 ∪ C1 ∪ · · · ∪ CN−1
8 for i=0 to N − 1 do9 Old_S ← Ci.seed;10 end11 Return;
12 end
Procedure: Document Distribution
to its nearest seed depending on the similarity measure. Suppose, we want to assign a
non-seed document (d1) to seed (d0) or (d8). Let, Sim[d0, d1] > Sim[d8, d1]. So, (d1)
is assigned to the cluster correspond the seed (d0). Similarly, other non-seed documents
are assigned to their respective seeds. Suppose, after completion of level 0, two clusters
are generated: C11 = {d∗0, d1, d4, d5} and C12 = {d∗8, d2, d3, d6, d7, d9} (see Figure 4.3).
C11 and C12 are added into the cluster pool CP . Old_S is set as {d0, d8}.
Suppose, the four documents (d0), (d5) from the cluster C11 and (d3), (d7) from the
cluster C12 are determined as seeds at level 2. Assume that, these seeds are not mergeable.
At this level, the termination condition is not satis�ed. After distributing the documents,
four clusters are generated namely C21 = {d∗0, d4}, C22 = {d∗5, d6}, C23 = {d∗3, d1, d8, d9}and C24 = {d∗7, d2} (also see Figure 4.3). We may note that, document (d1) of cluster
C11 at level 1 is assigned to cluster C23 at level 2 as (d1) is nearer to (d3) than (d0), (d5)
and (d7). Similarly, (d6) is assigned to cluster C22. Old_S becomes {d0, d5, d3, d7}.
At level 3, (d0), (d4) from C21, (d5), (d6) from C22, (d8), (d3) from C23 and (d7) from
C24 are selected as seeds. Here, we assume that no document pair of the cluster C24
exceeds the dissimilarity threshold α. So, the seed of previous level (d7) is passed to level
3. Among all the seeds, let the similarity of (d6) and (d8) exceeds β, themerging similarity
threshold. Therefore, they are merged together. {(d0), (d4), (d5), (d6, d8), (d3), (d7)} are�nalised as seeds at level 3. At level 3, the termination condition is not satis�ed. Hence,
the rest of the documents are assigned to their nearest seed. For this each non-seed
54
4.2. Proposed Methodology
d8*
d2,d3,d6,
d7,d9
d0*
d1,d4,d5
d7*
d2
d4*d0*
Level 0
Level 3
Level 1
d5*d3*
d1
d6* d8*
(d6, d8)*
d9
d7*
d2
d4*d0* Level 4d5*d3*
d1,d9
d0*
d4
d7*
d2
d3*
d1,d8,d9
d5*
d6Level 2
(d6,d8)*
d9
C01
C12C11
C21 C22 C23 C24
C41 C42 C43C44
C31 C32 C33C34 C35 C36
C45 C46
d0,... ,d9
Figure 4.3: Illustration of the proposed HK Clustering algorithm.
document is compared to all seeds (including all documents in the composite seed) to
�nd the maximum similarity values among them. This time one seed is composite that
is (d6, d8). So both of them are considered for comparison along with other seeds when a
non-seed document is assigned to a seed. As an example, if we want to assign document d9
to its nearest seed, then d9 should be compared with d0, d4, d5, d3, d7 and both d6 and d8.
Let us assume Sim[d5, d9] < Sim[d0, d9] < Sim[d4, d9] < Sim[d7, d9] < Sim[d6, d9] <
Sim[d3, d9] < Sim[d8, d9], then d9 is assigned to seed (d6, d8). After completion of
document assignment at level 3, six clusters are generated. They are C31 = {d∗0}, C32 =
{d∗4}, C33 = {d∗5}, C34 = {(d6, d8)∗, d9}, C35 = {d∗3, d1} and C36 = {d∗7, d2}. Old_S is set
as {(d0), (d4), (d5), (d6, d8), (d3), (d7)}.Next, at level 4, six seeds (d0) from C31, (d4) from C32, (d5) from C33, (d6, d8) from
C34, (d3) from C35, (d7) from C36 are selected as seed which are not mergeable among
each other. In this stage cluster pool CP is empty and the seeds of level 4 are same with
level 3 seeds. Thus, we meet the terminating condition.
The rest of the documents other than the seeds, are assigned to their nearest seed
for last time. It produces six clusters: (d0), (d4), (d5), (d6, d8, d9), (d3, d1), (d7, d2). At
this stage the generated clusters are hard in nature as each document strictly belongs
to a single cluster. In order to produce coherent [126] clusters the centroids of the
55
4. Clustering Web Search Results
generated hard clusters are determined. The similarity of each document and other
cluster centroids apart from which it belongs are checked. Let the similarity value of
d9 and the �fth cluster {d3, d1} centroid exceeds γ, the belonging similarity threshold.
So, d9 is assigned to that cluster. Thus, at the end of the process total six coherent
clusters C41 = {d∗0}, C42 = {d∗4}, C43 = {d∗5}, C44 = {(d6, d8)∗, d9}, C45 = {d∗3, d1, d9} andC46 = {d∗7, d2} are generated.
4.2.4.2 Complexity Analysis of HK Clustering Algorithm
The performance of the algorithm HK Clustering is analysed from the time and storage
requirement point of views as follows.
Time complexity: There are three main tasks in the algorithm: seed selection for
next level, assigning documents to their nearest seeds and producing �nal coherent clus-
ters. Suppose, there are n documents to be clustered. We assume h is a general level
where h = {0, 1, · · · , log2 n}. For simplicity, we assume 2h number of clusters would
be generated from 2h seeds at level h. So, clusters generated in the consecutive levels
are {20, 21, · · · , 2log2 n} = {1, 2, · · · , n}. We also assume that n documents are equally
distributed into all the clusters at the same level. Therefore, each cluster contains approx-
imately n/2h documents. Let, the main three tasks take T1, T2, T3 times, respectively.
For these three tasks we analyse the time complexity as follows.
1. Seed selection: In this phase, we �nd out the seeds which will contribute to create
cluster at the current level h. Further, this task can again be divided into two
sub-tasks: �nding two most dissimilar documents in each cluster and merging of
close seeds. Let, these two sub-tasks take t11 and t12 time, respectively, that is,
T1 = t11 + t12.
For �nding two documents in a cluster with minimum similarity, we have to compare
the similarity values of all document pairs of that cluster. As a single cluster con-
tains n/2h number of documents at level h, the document pair with minimum sim-
ilarity can be found out in O((n/2h)2) time. It would take t11 = O(2h.(n/2h)2) =
O(n2/2h) time when we consider all the clusters of that particular level.
Next task is to �nd the mergeable seeds at the level h. Among n number of doc-
uments, 2h number of documents are seeds in the level h. In general, these seeds
are generated from 2h−1 number of clusters of previous step. To �nd out merge-
able seeds at the level h we need 2h.(2h − 1)/2 number of comparisons. But the
seeds from same parents are not mergeable. So, 2h−1 comparisons are not required.
56
4.2. Proposed Methodology
Therefore, 2h.(2h−1)/2−2h−1 = 2h.(2h−1−1) number of comparisons are actually
required to �nd out mergeble seeds at level h. Therefore, t12 = O(2h.(2h−1 − 1)).
2. Assigning documents to their nearest seeds: After merging the mergeable seeds we
need to assign non-seed documents to their nearest seed. If at any level h, 2h seeds
are generated then (n − 2h) documents are left to be assigned. To assign these
(n− 2h) documents among 2h seeds T2 = O((n− 2h)2h) time is required.
3. Producing �nal coherent clusters: To produce �nal coherent clusters at the �nal
stage we need to calculate the centroids of generated clusters. For 2h number of
clusters at �nal level h, we have to �nd out 2h number of centroids. Next, the
similarity of each document and the centroids of clusters other than which the
document belongs to would be compared. It needs n(2h) number of comparisons.
Now, h varies from 0 to log2 n. At level h = log2 n, n number of clusters would
be produced which implies each cluster only contain a single document. In this
case centroid determination is not required. Instead, at level h = (log2 n)− 1, n/2
numbers of clusters would be generated for which n/2 cluster centroids need to be
determined. To check similarity value between cluster centroids and documents,
we need n.n/2 comparisons. So, the maximum possible time required for T3 is
O(n2/2).
Now, among the three tasks stated above, the �rst two tasks, that is, seed selection and
assigning documents to their nearest seeds would occur repeatedly in each level. Required
time for these two tasks at level h can be represented by Equation 4.9.
T h1 + T h
2 =
{n2
2h−1+ 2h(2h−1 − 1)
}+
{2h(n− 2h)
}
=n2
2h−1+ 2h(n− 1)− 22h
2
(4.9)
Total time required for these two tasks including level 0 to log2 n,
T ′ =
log2 n∑h=1
{n2
2h−1+ 2h(n− 1)− 22h
2
}
= n2
log2 n∑h=1
1
2h−1+ (n− 1)
log2 n∑h=1
2h − 1
2
log2 n∑h=1
22h
= 2n(n− 1) + 2(n− 1)2 − 2
3(n2 − 1)
(4.10)
57
4. Clustering Web Search Results
Therefore,
T ′ = O(n2) (4.11)
The total time required is T = T ′+T3 = O(n2). So, the time complexity of the proposed
algorithm is O(n2).
Space complexity: In our proposed algorithm, a similarity matrix is created to store
inter-document similarity value. Along with this we have to store the cluster centroids
in the �nal step and generated clusters in each step. So, the overall space complexity is
O(n2+k.n), that is, O(n2) where k is the number of clusters generated in �nal state and
n is number of documents.
4.3 Experiments and Experimental Results
In this section, we present a detail description of experiments and the results observed.
We use the same experimental setup as discussed in Section 3.3.1.
4.3.1 Experimental Data
In order to substantiate the e�cacy of the proposed algorithm and compare di�erent
clustering algorithms we need a tagged or well classi�ed data set. We use the standard
document collections of Reuters 215781. We collect only the tagged documents from
Reuters 21578. Thirteen sample data set are created, from reut2-000 to reut2-012 each
having di�erent number of classes. We also prepare a tagged document collection using
the Web search results for some speci�c query. To create our own document collection,
ten di�erent tourism related benchmark queries are feed to Google search engine. We
consider �rst 100 returned results for each query to build up a collection. 10 expert
users (all are students at UG and PG level of our institute) are asked to tag the Web
pages. Each Web page is tagged to one or more topic(s) by the users. So our own created
database contains 1000 tagged documents. These dataset are used for training as well as
testing purpose.
4.3.2 Performance Metrics
To evaluate the performance of the proposed algorithm, cluster quality is analysed. To
evaluate the cluster quality two di�erent measures: internal measure and external mea-
1Reuters-21578 Text Categorization Test Collection, http://www.daviddlewis.com/resources/testcollections/reuters21578
58
4.3. Experiments and Experimental Results
sure are used [108], [37]. Internal measure allows to compare di�erent sets of clusters
without reference of any external knowledge like already classi�ed dataset. Whereas ex-
ternal measure quanti�es how well a clustering result matches with a known classi�ed
dataset. In our work, the known classi�ed dataset is the previously categorised document
collection provided by human editor.
4.3.2.1 Internal Measure
We use Dunn index [37] as a internal measure to quantify cluster quality. It aims to
identify dense and well-separated clusters. It is de�ned as the ratio between the minimal
inter-cluster distance to maximal intra-cluster distance. Equation 4.12 shows the Dunn
index, where δ(Ci, Cj) represents the distance between clusters Ci and Cj and ∆(Cl)
measures the intra-cluster distance of a cluster Cl.
DI (C) =mini̸=j{δ(Ci, Cj)}max1≤l≤k{△(Cl)}
(4.12)
δ (Ci, Cj) =1
|Ci| |Cj |∑
di∈Ci, dj∈Cj
φ(di, dj) (4.13)
△ (Ci) = 2
(∑di∈Ci
φ (di, C′i)
|Ci|
)(4.14)
Here, φ(di, dj) is distance between two documents di and dj , where di and dj are any
two documents belong to Ci and Cj , respectively. |Ci|, |Cj | are the size of clusters Ci,
Cj . C′i is the centroid of cluster Ci.
4.3.2.2 External Measure
F measure is widely used to measure the external quality which combines the precision
(P ) and recall (R) ideas from information retrieval [37], [77]. Let for the document set
D, Cj is one of the output clusters and C∗i is corresponding to human edited class. Then
the precision and recall is,
P =C∗i ∩ Cj
Cj(4.15)
R =C∗i ∩ Cj
C∗i
(4.16)
F measure is the harmonic mean of precision and recall. The F measure (Fi,j) and
overall F measure (F ) is computed as shown in Equation 4.17 and Equation 4.18 where
59
4. Clustering Web Search Results
l is the total number of human edited class, k is the number of output cluster and |V | isthe total number of documents present in l number of human edited class.
Fi,j =2P.R
P +R(4.17)
F =
l∑i=1
|C∗i ||V |
maxj=1...k{Fi,j} (4.18)
The value of Dunn index is high for the clusters with high intra-cluster similarity and
low inter-cluster similarity. Again high overall F-measure indicates the higher accuracy
of the clusters mapping to the original classes. So, algorithms that produce clusters with
high Dunn index and high overall F-measure are more desirable.
4.3.3 Evaluations
We compare the HK Clustering algorithm with basic clustering methods as well as the
benchmarking clustering algorithms specially used for clustering Web search results.
Seven data sets of Reuters from reut2-005 to reut2-012 and �ve data sets of our data col-
lection are used for the comparison. Some data sets containing large number documents
of Reuters are used for testing to show the e�ciency of algorithm for higher number of
document collection. We apply HK Clustering algorithm on the document collections
to obtain clusters. Basic clustering methods: k -Means and agglomerative hierarchical
clustering (group average) are also used to cluster the document collections. Similarly,
the latest benchmarking algorithm Lingo using Singular Value Decomposition (SDV)
[92] is used for clustering. Then we check the quality of clusters obtained from di�erent
algorithms with respect to both internal and external measures. Lingo considers search
result snippet as an input. We feed the whole document content in Lingo to maintain
the symmetry. Table 4.1 presents the comparative study of k -Means, agglomerative hi-
erarchical clustering, Lingo and HK Clustering algorithm with respect to cluster quality.
From Table 4.1 it is clear that proposed HK Clustering algorithm produces better
clusters compared to k -Means and Lingo algorithm. The time complexity of the above
mentioned four algorithms are shown in Table 4.2. From the time complexity point of
view, HK Clustering performs better than hierarchical agglomerative clustering algo-
rithm as well as Lingo algorithm. Hence, HK Clustering o�ers an optimal solution by
balancing both cluster quality and time complexity.
60
4.3. Experiments and Experimental Results
Table 4.1: Comparison of cluster quality
Number Number Number Hierarchical HK
Document of docu- of of agglomerative k -Means Lingo Clustering
name ments classes clusters DI F DI F DI F DI F
reut2-005 100 6 8 2.71 0.70 2.41 0.49 2.55 0.56 2.64 0.68
reut2-006 100 10 9 2.40 0.78 2.33 0.42 2.3 0.62 2.36 0.71
reut2-007 100 4 5 1.42 0.58 1.27 0.34 2.29 0.41 1.31 0.58
reut2-008 100 7 10 2.67 0.55 2.43 0.32 2.46 0.38 2.49 0.52
reut2-009 100 6 7 1.57 0.67 1.42 0.51 1.42 0.55 1.46 0.64
Adventure India 100 9 8 2.71 0.78 2.49 0.51 2.52 0.58 2.53 0.69
Hotel Hyderabad 100 14 16 4.61 0.77 4.35 0.53 4.39 0.73 4.66 0.81
India pilgrim 100 10 13 3.89 0.38 3.71 0.24 3.73 0.26 3.83 0.35
Kolkata travel 100 14 16 4.79 0.67 4.62 0.52 4.65 0.49 4.77 0.59
Weather Kolkata 100 8 7 1.59 0.68 1.48 0.41 1.52 0.53 1.57 0.67
reut2-010 400 14 17 4.51 0.56 4.14 0.50 4.37 0.51 4.48 0.53
reut2-011 800 12 15 3.45 0.49 3.19 0.24 3.52 0.33 3.36 0.44
reut2-012 1000 15 16 4.26 0.46 4.11 0.31 4.19 0.34 4.24 0.37
Table 4.2: Comparison of time complexity
Clustering algorithm Time complexity
k -Means O(n)
HK Clustering O(n2)
Hierarchical* agglomerative O(n2logn)
Lingo O(n3)
*Considering group-average linkage criterion
4.3.4 Discussion
Apart from the proposed algorithm's e�ciency there are two key issues in our proposed
methodology: document download time and setting the threshold values and number of
feature vectors used in the algorithm. We would address these two issues by two di�erent
experiments. We used the concept of threading to download Web pages returned by
search engine. Our �rst experiment tests the e�ciency of threading. Second experiment is
conducted to set the threshold values and number of feature vectors used in the algorithm.
Document download time: As discussed in Section 4.2.1 we use threading to down-
load Web pages from Web repository. After feeding the query to search engine, the
system takes a huge amount of time to download Web pages one by one. As response
time of the proposed system is a key factor, we tried to reduce the download time us-
ing threading. Five di�erent tourism related queries are used to test the e�ectiveness
61
4. Clustering Web Search Results
of threading. As download speed varies, each query is run thrice and average download
time is calculated. The comparison of time requirement for downloading documents with
threading and without threading is given in Table 4.3. It shows threading reduces the
downloading time by a factor of 13 for 100 documents. Figure 4.4 depicts the di�erence
clearly.
Table 4.3: Web document download time with threading and without threading
Number Download Time (mm:ss)
of Q: Culture Assam Q: Delhi guide Q: Market Kolkata Q: Royal Rajasthan Q: Wildlife India
web (3.64 MB) (4.67 MB) (4.83 MB) (3.56 MB) (3.44 MB)
pages WT WOT WT WOT WT WOT WT WOT WT WOT
10 00:08.85 00:21.99 00:09.10 00:39.21 00:08.92 00:18.94 00:08.87 00:29.76 00:08.85 00:31.78
20 00:08.97 01:33.48 00:09.91 01:24.51 00:09.87 00:43.63 00:09.02 01:11.86 00:09.18 01:36.18
30 00:11.58 02:13.70 00:11.91 02:08.51 00:13.26 01:06.27 00:09.25 01:58.62 00:09.52 02:14.18
40 00:13.51 02:44.42 00:18.36 02:49.64 00:17.84 01:27.62 00:10.69 02:34.39 00:10.21 02:44.56
50 00:14.32 03:32.07 00:24.85 03:28.76 00:24.53 03:11.93 00:14.18 03:14.52 00:14.60 03:14.04
60 00:23.67 04:16.47 00:30.78 05:15.68 00:33.10 03:35.56 00:22.16 04:24.06 00:21.66 03:57.92
70 00:25.47 04:59.79 00:31.64 06:04.89 00:33.98 04:24.28 00:22.35 05:15.69 00:25.97 04:29.43
80 00:27.19 05:39.02 00:32.28 06:49.74 00:35.01 04:55.78 00:23.42 05:52.83 00:26.50 05:14.39
90 00:28.42 06:16.56 00:33.56 07:22.20 00:35.69 05:18.35 00:24.07 06:35.51 00:26.83 05:52.23
100 00:28.97 06:58.62 00:34.71 08:49.76 00:39.37 05:36.21 00:26.75 07:33.74 00:29.65 06:17.58
Notation: Q→Query, WT→With Threading, WOT→Without Threading.Average download time for 100 documents without threading = 07:03.18 mm:ss.Average download time for 100 documents with threading = 00:31.89 mm:ss.speed up = 13.27.
Document download
time without threading
Document download
time with threading
Figure 4.4: Download time of documents
62
4.3. Experiments and Experimental Results
Determination of optimal threshold values and number of feature vectors:
We have conducted another experiment to set the threshold values and number of feature
vectors used in our proposed algorithm. This experiment basically trains our algorithm
using a classi�ed set of data. From the total collection of �ve data set of Reuters from
reut2-000 to reut2-004 and �ve data set of our own data collection are used to set the
threshold values. First, we calculate the Dunn index for the samples with given class.
The threshold values α, β and γ can vary from 0 to 1. As the number of classes is known
to us, we try to maximize Dunn index for a given number of classes. We consider the
set of threshold values for which the cluster number is closest with given class number
and Dunn index is maximum. After considering all samples, the arithmetic mean and
harmonic mean of individual thresholds and number of features are calculated. Using
these average values we again �nd the Dunn index of the training sample. We have
plotted the Dunn index for every sample considering optimal values as well as average
values. The deviation of Dunn index for average threshold values compared to optimal
threshold values are shown in Figure 4.5. We �nd that the deviation of Dunn index for
optimal values and average values is moderate. The di�erence is very less for arithmetic
mean and harmonic mean values. We consider the harmonic mean values for testing
purpose. The experimental result is summarized in Table 4.4.
2.3
2.6
2.9
3.2
3.5
3.8
4.1
4.4
4.7
5
Given classification
Optimal threshold
Arithmetic mean
Harmonic mean
DunnIndex
Figure 4.5: Comparison of Dunn Index
63
4. Clustering Web Search Results
Table 4.4: Determination of threshold values and number of feature vectors
Num- Document Number of Number of DI (given Threshold values Number of DI (proposed
ber collection documents classes classi�cation) α β γ features classi�cation)
1 reut2-000 100 09 2.42 0.025 0.335 0.8 200 2.57
2 reut2-001 100 13 3.37 0.030 0.365 0.6 200 3.49
3 reut2-002 120 12 3.51 0.025 0.440 0.6 300 3.55
4 reut2-003 120 18 4.63 0.020 0.345 0.8 300 4.71
5 reut2-004 100 13 3.48 0.025 0.540 0.8 200 3.43
6 Culture Assam 100 15 2.89 0.045 0.460 0.7 300 3.65
7 Delhi guide 100 24 4.66 0.035 0.665 0.7 400 4.72
8 Market Kolkata 100 14 4.49 0.025 0.550 0.5 300 4.68
9 Royal Rajasthan 100 12 3.41 0.030 0.755 0.6 300 3.59
10 Wildlife India 100 14 4.57 0.035 0.765 0.8 300 4.61
Arithmetic mean α = 0.029, β = 0.522, γ = 0.690, number of features = 280.Harmonic mean α = 0.028, β = 0.479, γ = 0.673, number of features = 267.
4.4 Conclusion
With the advancement of Information Technology, the amount of Web repository is
increasing rapidly and this trend of extension will continue in coming years. As a con-
sequence, for a given user query, search engine jumbles with a huge retrieval and it then
becomes problematic to extract the right information for our target users. To tackle this,
we use clustering of the search results. A number of clustering techniques can cluster
Web documents. But the existing techniques either need user intervention to decide
the number of clusters or computationally expensive. The present work addresses these
limitations and proposes a new clustering approach. The proposed approach takes the
advantage of both k -Means and hierarchical approaches. Our clustering technique able
to produce coherent clusters of good quality without compromising the computational
overhead. The proposed clustering technique is useful to cluster a large number of Web
search results into a group of similarity in real-time. This grouping is helpful to direct
the search interest. The technique presented in this work follows some sub-tasks. We
consider naive approaches for these sub-tasks which results the time complexity of the
technique in O(n2). There is enough scope of improvement to reduce this time complexity
considering more e�cient ways of solving the tasks.
64
Chapter 5
Icon-Based Information
Representation
After clustering we get a number of clusters containing similar type of information. Our
next task is to identify the most relevant cluster according to the user query. It is not
worthy to represent the all Web pages' content of the selected cluster word by word.
Instead we need to �nd out important information related to the query. In order to
do so, we �nd out the query corresponding entities and attributes and extract attribute
values from the clustered Web pages. Those selected information is represented in form
of icons. In order to ful�ll this target we face several challenges like which information
we should present, how to identify and extract the answer, how to present information
etc. We address the major issues like: building supportive model, information mining,
icon based information representation etc.
5.1 Introduction
Clustering methodology helps to group Web pages returned as the search result contain-
ing similar types of information. As in general the search query is not very precise, the
search results returned by the search engine are huge in size and cover various domains.
A user has to predict a Web page which may contain the desired information. Next,
user goes through the whole Web page content to �nd desired piece of information. If
the prediction goes wrong user has to predict again for the target Web page. This trial
and error process continuous until user can �nd the information. For our target user,
we prefer to reduce the prediction and searching overhead. We have noticed that, user's
trend is to obtain some additional information along with the desired and speci�c one.
65
5. Icon-Based Information Representation
Therefore, we plan to represent precise answer and some additional information related
to the query. While representing precise information in iconic form we face the problem
of word sense ambiguity. Same word can convey di�erent sense in di�erent context. Be-
fore representing, we need to disambiguate word sense that we can choice correct icon
from the icon database for representation.
According to our knowledge no work has been reported till now towards text to icon
representation. Some works related to question answering have been addressed to �nd
out answer of a particular question from �xed database or free text. Annotation of
knowledge base in order to obtain answers with the help of natural language processing
is proposed by Katz [58]. Later he proposed knowledge mining of Web information and
integrate it with corpus based knowledge annotation technique in order to achieve better
performance in question answering [59, 74]. Question answering by searching large cor-
pora with linguistic methods is proposed in [55]. The way of handling unstructured as
well as structured Web is suggested by Cucerzan and Agichtein [31]. In a di�erent way
Radev et. al. used probabilistic approach [98] to answer the query. The approach of com-
bining syntactic information with traditional question answering can be found in Quaero
[109] system. Some other established question answering system are Ionaut1 (AT&T Re-
search), MULDER (University of Washington) [65], AskMSR (Microsoft Research) [8],
InsightSoft-M (Moscow, Russia) [107], MultiText (University of Waterloo) [30], Shapaqa
(Tilburg University) [21], Aranea (MIT) [74], TextMap (USC/ISI) [36], LAMP (National
University of Singapore) [128], NSIR (University of Michigan) [99], AnswerBus (Univer-
sity of Michigan) [131] etc. The objective of question answering is to �nd out the exact
answer. Our objective is little bit di�erent that is, to �nd out speci�c as well as related
important information in precise form. Therefore, we generate query related template
that �gures out important entity and corresponding attributes related to the query. With
the help of all entity-attributes of all domain related queries we build entity-attribute
model. This entity-attribute model helps to �nd out most relevant cluster for a speci�c
query from the bag of clusters. Next, we model potential answers that can be present in
the clustered Web pages as the values of attributes. The model assists to �nd the val-
ues of attributes. Extracted words of phrases or sentences are disambiguated with help
of semantic similarity measure of WordNet [87] and XML icon base Ref. 3.2. Finally,
according to the template we display the �nalized icons to the user.
The organization of the paper is as follows. In Section 5.1, we already discussed the
context and motivation of the work. Section 5.2 talks about the proposed methodology of
representing Web page information. The experimental result is discussed in Section 5.3.
1Ionaut, www.ionaut.com/
66
5.2. Proposed Methodology
Section 5.4 concludes the paper.
5.2 Proposed Methodology
The objective of this section is to �nd out query related information from the clustered
result. In order to achieve that, �rst we have to identify the most likely cluster containing
query related information. To mine answers from the selected cluster and to represent it
in user understandable iconic form, we follow the modules stated below.
• Building supportive model: In Section 3.2.1, we already identi�ed our domain of
interest and domain corresponding queries. With the help of the domain knowledge,
we build a entity-attribute model and potential answer model which works as a
supportive model for information extraction.
• Information mining: In this phase, we select the most appropriate cluster for in-
formation mining. As we know the query related entity-attribute, we identify and
extract the values related to those target attributes with the help of entity-attribute
model and answer patterns.
• Icon-based information representation: We use the icon database to display iconic
information in this phrase. The �nalized attribute values are disambiguated and
presented in forms of icons using the output template.
Figure 5.1 gives an overview of the proposed methodology. A detailed description of each
module is given in the following sub sections.
5.2.1 Building Supportive Model
The queries �red in the search engine can be general or speci�c in nature. We have
observed that the chance of presence of the desired result in the Web repository is less if
the query is too speci�c (e.g. query - �date of Bihu dance festival in Assam in 2012"). On
the other hand if the query is too general then the chance of retrieving relevant result is
less. For example the chance of obtaining Web pages relevant to �Assam's jeep safari" is
less if the search query is simply �Assam's transport". So, we can say moderate type of
query is capable to retrieve desired information. A statistical analysis1 also shows that
the user trend of �ring general query is more than a very speci�c one. For instance the
queries �culture Assam", �dance Assam", �Bihu dance Assam" and �Bihu dance Assam
in 2012" are �red for 2400, 2900, 720, 0 times respectively in a month globally in Google
1Google AdWords, https://adwords.google.com/o/KeywordTool
67
5. Icon-Based Information Representation
Domain related user query Search engine
Entity-
attribute
model
Potential
answer
modeling
Retrieved result
Icon
database
Attribute value
extraction
Building supportive model
Information mining
Icon-based information
representation
Word sense
disambiguation
Answer template filling
Query
corresponding
templates
Clustered result
Select cluster
Figure 5.1: Overview of the proposed approach
search engine. It reveals that users are interested to obtain some general and related
idea along with their speci�c query need. Some times number of queries are implied in
a single query like the query �culture of Assam" implies language, festival, song, dance,
craft, religion, dress, food etc. of Assam. We can consider here �culture" as an entity
and language, festival, song, dance, craft, religion, dress, food etc. as its attributes. Note
that, every �culture" has some �history". So, we can say the entity �culture" is related to
entity �history". To extract information fromWeb page we need to know what is the main
entity about which we want to know and what are the entity corresponding attributes.
We also check the related entities to extract related information if it is provided in the
Web page. Therefore, we need a general entity-attribute model which represents related
entities and corresponding attributes.
5.2.1.1 Developing Query Corresponding Templates
We have already �nalized 50 queries related to tourism domain. To represent those
queries we make 23 templates. A query template represents the main object of the
query, main characteristics of that object and related characteristics. Each template
can represent one or more queries. �*" implies a mandatory attribute in any template
whereas �#" implies optional attribute. An example of a culture related template which
68
5.2. Proposed Methodology
represents three di�erent types of culture queries: culture or some speci�c festivals of a
place in a particular time, the main festival of some place and the season of a festival of
a place are shown in Figure 5.2.
5.2.1.2 Developing Entity-Attribute Model
Once we prepare the templates of the tourism related queries, it is easy to identify
the main entities and related attributes of tourism domain. Along with these entity-
attributes other related entity-attributes are considered to build up the model. We also
take the help of an established tourism ontology of owl. Our prototype model contains
42 entities and their attributes. Figure 5.3 shows the model where ellipse represents
an entity. Related entities are linked by line. While building the model we generate
a database which includes the attribute synonyms, semantically similar words and the
words which are closely related with the attributes.
5.2.1.3 Potential Answer Modeling
In this phase we model the probable answers related to queries which may be present
in the retrieved documents in di�erent forms. As we have already identi�ed our query
related entities and attributes, we can identify the values of those attributes depending
*Place:
State/province District City/town/village
Specification:
Dance Song Language
Festival Craft Religion
Dress Food
Queries:
Culture/*specification of *place in #time
Main *specification in *place
Festival season in *place
Time:
Year Month Season
CULTURE
Figure 5.2: Query template: Culture
69
5. Icon-Based Information Representation
state
town
river
desert
hillfor
estwa
terfall
fort
palace
sacred
_place
sanctu
arytow
erpar
k
dam
lake
valley
culture
transpo
rthos
pital
colleg
ewe
ather
school
marke
tzoo
museu
mthe
atre
stadium
clothin
gani
mal
village
bridge
glacie
radv
enture
king
tour
sea
island
beach
packag
e
hotelres
ervatio
nspo
rts
Figure 5.3: Entity-attribute model
70
5.2. Proposed Methodology
on some prede�ned patterns. While building the entity-attribute database, we have
included the words which are closely related with the attributes. For each attribute we
de�ne some patters. For example, the related words of attribute �language" are language,
speak, talk, communicate etc. So, the probable answer patterns are
P1: #s11 language s′12
P2: #s21 speak/talk/communicate s′22
P3: s′32 language #s31
P4: s′42 speak/talk/communicate #s41
here sij and s′ij represents a string. sij represents the subject string and s′ij represents
the target string. # implies that the followed string may or may not occur in the phrase
or sentence. If the string (si1) occurs then generally it contains the name of the place
or people, for which we are searching the language. In this way we generate the rules
for each attribute corresponding to a query. Sometimes two di�erent entities may have a
common attribute. For an example two entities �transport" and �hotel" have a common
attribute �reservation". The related words for �reservation" are �from", �to", �departure",
�arrival", �check-in", �check-out" etc. But, �departure", �arrival" appears in the context
of �transport" and �check-in", �check-out" occurs in the context of �hotel". So, at the
time of searching the attribute value, we need to consider not only the attribute but also
the context that is, the corresponding entity.
5.2.2 Information Mining
In this subsection, we discuss the way to �nd query related attribute values from the
most preferable cluster. In order to �nd out the most preferable cluster, we use query
templates and the entity-attribute model generated in previous section. From query
corresponding template, we can easily identify main entity and attributes for the query.
The entity-attribute model gives information about related attributes. Next, we �nd
out similarity of each cluster with query corresponding entity and attributes. Semantic
similarity measure of WordNet [87] is used for similarity measurement. A cluster with
maximum similarity value is considered as most preferable cluster for containing attribute
values.
5.2.2.1 Attribute Value Extraction
Generally in Web page, information are presented in paragraph. We notice paragraph
wise the topic di�ers more or less. So, we consider paragraph as processing unit. With
71
5. Icon-Based Information Representation
the help of prede�ned pattens (Ref. 5.2.1.3) we �nd out the candidate attribute values
from selected documents.
n-gram selection: We observed that the target string may not occur exactly at the
position of s′i,j due to the occurrence of some adjectives, adverbs, qualifying words etc.
But, most of the cases, it occurs within six words from s′i,j string position. Therefore, we
�nd out the matching patterns in a paragraph and extract all n-gram words of s′i,j string
position where n is 6. n-gram words can be selected from left or right direction. If the
target string follows the matching pattern (in case of pre�x pattern) then we consider
n-gram words from left direction and from right direction for vise-versa. For example, let
a sentence of a paragraph is �The natives of the state of Assam are known as "Asomiya"
(Assamese), which is also the state language of Assam.". To extract the value of the
attribute "language" we use the pattern P3 : s′32 language #s31. This is a su�x pattern
where matching pattern (language #s31) follows target string (#s31). So, we consider
n-gram words from right direction. The n-gram selection are - state, the state, also the
state, is also the state, which is also the state, (Assamese), which is also the state. We
term these n-gram words as candidate phrases. For in�x pattern we consider n-gram
from both directions. While selecting the n-gram, we never cross the sentence boundary.
Next, the stopwords are �ltered out from the candidate phrases.
Ranking candidate phrase: From the clustered documents, we may �nd more than
one candidate values for a single attribute. We need to rank these values in order to �nd
the top rank value. At �rst, we simply count the number of occurrence of each candidate
phrase.
• Merging candidates: As we consider n-gram word chunk as candidate phrase, the
occurrence of shorter candidate is more than the longer one. If any longer candidate
phrase contains another shorter candidate phrase as a substring and their occur-
rence is same, then we eliminate the shorter one. We also eliminate the shorter
candidate if the longer candidate containing the shorter one occurs more than a
threshold value(we consider 5).
• Allow tolerance: In some cases like the values presented by number(e.g. distance)
may di�er depending on the source of information. The di�erence may be very
little, but it causes number of separate candidate phases. Let, two sentences from
two di�erent Web pages are �The total rainfall is 230 mm" and �The total rainfall
is 233 mm". The corresponding pattern is P1: #s11 rainfall s′12. As, it is a pre�x
72
5.2. Proposed Methodology
pattern, the candidate phases are - is, is 230, is 230 mm, is 233, is 233 mm. In order
to avoid this condition we allow a tolerance on number type attribute. Tolerance
varies attribute to attribute.
• Unit conversion: It may possible that the value of an attribute is given in two
di�erent units in two di�erent Web pages. Like, �The total rainfall is 230 mm" and
�The total rainfall is 23 cm". These two Web pages support same answer but it
results two di�erent probable answers. We apply unit conversion method to map
di�erent types of units to a same one. Then we compare and allow tolerance to
reduce number of candidate phrases.
Note that, while modifying the candidate phrases we never change the word sequence.
Constraints: As we are considering the n-gram chunk of words as candidate phrases,
it may contain some extra text along with the exact attribute value. We apply some
constrains to �lter out those extra terms.
• Datatype and attribute value: We de�ne datatype of each attribute to remove extra
texts. Like, for the attribute �distance" the datatype is number. In some cases the
values of an attribute can only be one of the �xed list (e.g. states of India). For
this type of attributes we can �lter the candidate phrases using the valid values of
the list.
• Attribute characteristics: There may be multiple values present in the text for
a single attribute. So, the attribute characteristic (single valued or multivalued)
is needed to be declared. For multivalued attribute, the values in general are
separated by `,'. In that case we not consider only a single n-gram chunk but all
text separated by `,', n-gram chunk before the �rst chunk and n-gram chunk after
the last `,' or `and'. If any attribute supports multivalued characteristic then we
consider more than one value as attribute value if the occurrence of �nalized values
vary within 10%.
• Part-of-speech: Sometimes for emphasizing the descriptive words(e.g. adjective,
adverb) are attached with the exact value. These extra words can be eliminated
by knowing the part-of-speech of target attribute value.
All the candidate phrase are modi�ed according to these constraints. Finally, the can-
didate phrases are ranked on the basis of decreasing order of their occurrence. The top
ranked phrase(s) is(are) considered as the value(s) of the attribute.
73
5. Icon-Based Information Representation
We illustrate the process of attribute value extraction with an example. Let, using
the answer patters discussed in sub-section 5.2.1.3 we want to �nd the value of the
attribute �language" corresponding to a query �culture Assam". We consider a most
preferable cluster containing eight Web pages. From the Web pages we identify the
following sentences containing answer patters.
• The natives of the state of Assam are known as "Asomiya" (Assamese), which is
also the state language of Assam.
• Diverse tribes like Bodo, Kachari, Karbi, Miri, Mishimi, Rabha, etc co-exist in
Assam, most tribes have their own languages though Assamese is the principal
language of the state.
• Bengali-speaking Hindus and Muslims represent the largest minorities, followed by
Nepalis and populations from neighboring regions of India.
• However, in each of the elements of Assamese culture, i.e. language, traditional
crafts, performing arts, festivity and beliefs either local elements or the local ele-
ments in a Hinduised / Sanskritised forms are always present.
• The records of many aspects of the language, traditional crafts (silk, lac, gold,
bronze, etc), etc are available in di�erent forms.
• The original Tai-Shans assimilated with the local culture, adopted the language
on one hand and on the other also in�uenced the main-stream culture with the
elements from their own.
• The movement contributed greatly towards language, literature and performing
and �ne arts.
• Brajavali a language specially created by introducing words from other Indian lan-
guages had failed as a language but left its traces on the Assamese language.
• The language was standardised by the American Missionaries with the form avail-
able in the Sibsagar (Xiwoxagor) District (the nerve centre of the Ahom politico-
economic system).
• Sanskritisation was increasingly adopted for developing Assamese language and
grammar.
Using the patterns we select n-grams from above mentioned sentences as shown in Ta-
ble 5.1 and Table 5.2. Stopwords removal reduces number of candidate phrases.
74
5.2. Proposed Methodology
Table 5.1: n-gram from pre�x pattern
No. Matching phrase n-gram phrase Filtered phrase
1 of Assam of, of Assam Assam
2 though Assameseis the principallanguage
though, though Assamese, though As-samese is, though Assamese is the, thoughAssamese is the principal, though As-samese is the principal language
Assamese, Assamese princi-pal, Assamese principal lan-guage
3 of the state of, of the, of the state state
4 Hindu and Mus-lim represent thelarge
Hindu, Hindu and, Hindu and Muslim,Hindu and Muslim represent, Hindu andMuslim represent the, Hindu and Muslimrepresent the large
Hindu, Hindu Muslim, HinduMuslim represent, HinduMuslim represent large
5 tradition craftperform artfestive and
tradition, tradition craft, tradition craftperform, tradition craft perform art, tra-dition craft perform art festive, traditioncraft perform art festive and
tradition, tradition craft, tra-dition craft perform, traditioncraft perform art, traditioncraft perform art festive
6 tradition craft silklac gold bronze
tradition, tradition craft, tradition craftsilk, tradition craft silk lac, tradition craftsilk lac gold, tradition craft silk lac goldbronze
tradition, tradition craft, tra-dition craft silk, traditioncraft silk lac, tradition craftsilk lac gold, tradition craftsilk lac gold bronze
7 on one hand andon the
on, on one, on one hand, on one hand and,on one hand and on, on one hand and onthe
hand
8 literature and per-form and �ne art
literature, literature and, literature andperform, literature and perform and, lit-erature and perform and �ne, literatureand perform and �ne art
literature, literature perform,literature perform �ne, litera-ture perform �ne art
9 special create byintroduce wordfrom
special, special create, special create by,special create by introduce, special createby introduce word, special create by in-troduce word from
special, special create, specialcreate introduce, special cre-ate introduce word
10 have fail as a lan-guage but
have, have fail, have fail as, have fail asa, have fail as a language, have fail as alanguage but
fail, fail language
11 but leave its traceon the
but, but leave, but leave its, but leave itstrace, but leave its trace on, but leave itstrace on the
leave, leave trace
12 be standardizeby the AmericanMissionary
be, be standardize, be standardize by, bestandardize by the, be standardize by theAmerican, be standardize by the Ameri-can Missionary
standardize, standardizeAmerican, standardize Amer-ican Missionary
13 and grammar and, and grammar grammar
75
5. Icon-Based Information Representation
Table 5.2: n-gram from su�x pattern
No. Matching phrase n-gram phrase Filtered phrase
1 Assamese which isalso the state
Assamese which is also the state, which isalso the state, is also the state, also thestate, the state, state
Assamese state, state
2 Assam most tribehave their own
though, Assam most tribe have their own,most tribe have their own, tribe have theirown, have their own, their own, own
Assam tribe
3 language thoughAssamese is theprincipal
language though Assamese is the princi-pal, though Assamese is the principal, As-samese is the principal, is the principal,the principal, principal
language Assamese principal,Assamese principal, principal
4 Bengali Bengali Bengali
5 the element of As-samese culture i.e.
the element of Assamese culture i.e., el-ement of Assamese culture i.e., of As-samese culture i.e., Assamese culture i.e.,culture i.e., i.e.
element Assamese culture,Assamese culture, culture
6 record of many as-pect of the
record of many aspect of the, of many as-pect of the, many aspect of the, aspect ofthe, of the, the
record aspect, aspect
7 with the local cul-ture adopt the
with the local culture adopt the, the localculture adopt the, local culture adopt the,culture adopt the, adopt the, the
local culture adopt, cultureadopt, adopt
8 The movementcontribute greattoward
The movement contribute great toward,movement contribute great toward, con-tribute great toward, great toward, to-ward
movement contribute great,contribute great, great
9 Brajavali a Brajavali a, a Brajavali
10 by introduce wordfrom other Indian
by introduce word from other Indian, in-troduce word from other Indian, wordfrom other Indian, from other Indian,other Indian, Indian
introduce word Indian, wordIndian, Indian
11 Indian languagehave fail as a
Indian language have fail as a, languagehave fail as a, have fail as a, fail as a, asa, a
Indian language, language
12 left its trace onthe Assamese
left its trace on the Assamese, its trace onthe Assamese, trace on the Assamese, onthe Assamese, the Assamese, Assamese
left trace Assamese, trace As-samese, Assamese
13 The The
14 be increase adoptfor develop As-samese
be increase adopt for develop Assamese,increase adopt for develop Assamese,adopt for develop Assamese, for developAssamese, develop Assamese, Assamese
increase adopt develop As-samese, adopt develop As-samese, develop Assamese,Assamese
76
5.2. Proposed Methodology
Table 5.3: Ranked candidate phrase
Phase Assamesetraditioncraft
principal culture adopt languageAssameseprincipal
Hindutraditioncraft silk
Count 15 9 5 5 5 5 4 4 4
Next, we count the occurrence of all candidate phrase and merge short candidates into
longer one if their occurrences are same. We also eliminate the shorter candidate if the
longer candidate containing the shorter one occurs more than a threshold value of 5. In
this example their is no scope of allowing tolerance and unit conversion. Table 5.3 shows
the top �ve candidates after ranking. Next, we check the constraints. The datatype of
language attribute is simple string but it must be one of the Indian languages. Language
may be more than one and the part-of-speech of the target answer is noun. Among all the
candidates only one satis�es all the constraints that is, the top ranked term �Assamese".
So, we �nalize �Assamese" as the value of the attribute language.
5.2.3 Icon-Based Knowledge Representation
In this phase we represent the �nalized value of attribute in terms of icons. We select
icons from the icon database according to the attribute value. This phase consists of
following two sub phases.
5.2.3.1 Word Sense Disambiguation
Some words in English vocabulary convey di�erent meaning in di�erent context though
their spelling and pronunciation are same. These types of words are known as homonym
[63] (e.g. river bank - river bed, reserve bank - a �nancial institution). On the other hand
polysemy [63] is a word or phrase with di�erent, but related senses (e.g. bank - �nancial
institution, bank on - rely upon). We distinguish these types of words from their context.
Therefore to represent the �nalized values along with the attributes we have to consider
the context and accordingly we select icons from the icon database. To disambiguate a
word we use the icon vocabulary of Section 3.2.2. Our vocabulary contains keywords,
synonym of each keyword, semantically similar words of the keyword, di�erent senses of
keyword and corresponding images. First, each value phrase we want to represent are
tokenized in list of words. Each word is checked with the word vocabulary. If the word
is present in the vocabulary as a keyword then we check the di�erent senses of the word.
We calculate the similarity score [3] of each sense with the word along with its context.
The sense, scores highest is considered as the correct sense of the word and corresponding
image is selected for representation. If the word is not present in the vocabulary as a
77
5. Icon-Based Information Representation
keyword, next we check the word as a synonym and semantically similar word one after
another. If it is found, corresponding keyword is tracked and the senses are examined in
a similar way. If the word is absent in our vocabulary then word related a dummy icon
would be generated for display.
5.2.3.2 Answer Template Filling
Once the information related icons are decided, the answer template is �lled using those
icons to display the user. Generally, a paragraph leaded by a heading represents infor-
mation about the heading. Therefore, if there are any heading before paragraph, we �rst
present the heading then corresponding attribute and its values. For the query �culture
assam" the following attribute corresponding values: dance - Bihu, Satriya, Barpeta,
Jhumur; language - Assamese, festival - Bihu, craft - weaving, cane-bamboo craft, paint-
ing, jewellery making, wood craft; religion - Hindu, Muslim, Buddhist, Cristian; cloth
- cotton, silk are selected. The values of the attributes music and food are not found
from the clustered Web pages. The icon-based representation of the of all information
are shown in Figure 5.4.
Figure 5.4: Visualisation of information related Assam culture
78
5.3. Experimental Results
5.3 Experimental Results
We evaluated our proposed approach of information representation on the basis of infor-
mation understanding by the target user. In this section, we present the detail description
of our experiments and the results observed. For testing we use the same experimental
setup and users mentioned in Section 3.3.1.
We select ten tourism related queries from �fty benchmarked queries for testing.
Users are asked to generate these queries with the help of developed icon based interface.
Retrieved search result is clustered by the intermediate stage of clustering and a single
cluster is selected as most probable cluster for containing information. After mining that
cluster we obtain query related information and represent it in terms of icon. The users
are asked to recognize those icons as well as to understand the iconic message. Display of
wrong icon by the system indicates failure in word sense disambiguation. We grade the
users for identifying the icons, interpreting the iconic message and the system for word
sense disambiguation. Table 5.4 presents the result.
The icon recognition percentage for above mentioned �ve Web pages (86.14%, 87.50%,
Table 5.4: Test result of visual representation
Web page concept Recognisedconcept
Numberof icons
Wrongicons
Correctlyrecognized icon
Messageinterpreted
Markets of Kolkata market 108 7 87 B
Culture of Assam culture 165 13 133 G
Delhi Guide city 126 17 79 Bd
Hotels of Hyderabad hotel 52 6 41 Av
Wildlife of India wildlife 143 9 112 G
Royal Rajasthan state 62 6 45 Av
Transport of Bangalore transport 45 4 34 G
Tour package of Kashmir package 73 7 61 B
Sea beach of south India beach 39 4 27 G
Fort in Delhi fort 52 6 40 B
Grade for icon identi�cationabove 90%:5, (80-90)%:4, (70-80)%:3, (60-70)%:2, below 60%:1.Grade for message interpretationExcellent(Ex)-5, Better(B)-4, Good(G)-3, Average(Av)-2, Bad(Bd)-1.Grade for word sense disambiguationabove 95%:5, (90-95)%:4, (85-90)%:3, (80-85)%:2, below 80%:1.
79
5. Icon-Based Information Representation
72.48%, 89.13%, 84.21%, 80.36%, 82.93%, 92.42%, 77.14%, 86.96%) are calculated using
( No. of correctly selected icons) / (Total no. of icons - No. of wrong icons ). We
can also calculate the failure rate of word sense disambiguation by the formula ( No. of
wrong icons ) / ( Total no. of icons ). The failure percentage is 6.48%, 7.88%, 13.49%,
11.54%, 6.29%, 9.68%, 8.89%, 9.59%, 10.26%, 11.54% respectively. Overall e�ciency of
the system can be calculated with respect to three criteria icon recognition, word sense
disambiguation and message interpretation.
The mean score for icon recognition = (4 + 4 + 3 + 4 + 4 + 4 + 4 + 5 + 3 + 4)/10 = 3.9
The mean score for word sense disambiguation = (4+4+3+3+4+4+4+4+3+3)/10 = 3.6
The mean score for message interpretation = (4+3+1+2+3+2+3+4+3+4)/10 = 2.9
So, overall performance = (3.9 + 3.6 + 2.9)/3 = 3.47
Calculation shows that our proposed system (3.47/5) ∗ 100 = 69.33% e�cient.
5.4 Conclusion
In this section, we represent a simple way to represent basic Web information in terms
of icons, understandable for the unprivileged user. The proposed approach identi�es the
appropriate Web page cluster for information mining. From that cluster we �nd out
user query related basic information and produce simple iconic sequence. All the iconic
sequences are represented in a template. This iconic message will help our target user to
obtain and understand some basic information related to the query. This approach can
also be utilized for other purposes like interacting with uneducated people, cross language
communication etc. We have developed a prototype version of our proposed approach
which can be enhanced further to represent any type of knowledge independent of any
domain. Some extensions, like avoidance of redundant information, perfection in iconic
message information expression, automation in pattern generation can give the work a
complete shape. We consider these extensions as our future work.
80
Chapter 6
Conclusion and Future Work
Of late, a huge information repository is built up and maintained in the Web. People use
to share and access this information through Internet. But, the access of this repository
is only limited to a certain group of privileged people those who have good reading,
writing and understanding capability in English language. Rest of the people can not
avail the bene�ts of Internet. As a solution to this problem we have proposed an icon-
based interface to retrieve information from Internet. Using the interface our target users
can generate their desired query by means of icon selection. Generated iconic query
equivalent English query is fed to search engine. As a query, a word or a group of words
can imply multiple meaning in di�erent contexts. Web search engine, however, cannot
distinguish the context and hence retrieves huge information from di�erent domain in
response. These search results are incomprehensible for our target user as the user neither
understand nor �nd out their desired information from the returned result. Therefore, we
have clustered search results based on Web page content similarity and �nd out a cluster,
most relevant to query. Next, we have found out query related important information
from clustered Web pages. Finally, those selected information are displayed to the users
in form of icons.
The work solves a real life problem and overcomes the language barrier for Internet
access. Regarding development of interface we have provided a general approach for
developing any icon-based interface. We have addressed some basic issues like building
icon vocabulary, icon management, icon arrangement in the interface which are not ad-
dressed in any prior work. Though we limited our implementation related to tourism
domain, extension of the work in any other domain is possible. To cluster Web search
results, we have proposed a new clustering algorithm. The algorithm takes care of clus-
ter quality as well as time constraint. To substantiate the e�cacy we compared our
81
6. Conclusion and Future Work
algorithm with established clustering algorithms. Finally, we have mined clustered Web
page to �nd out query related information. Developing query corresponding templates,
developing entity-attribute model, potential answer modeling, attribute value extrac-
tion, icon-based knowledge representation are addressed in this regard. All the question
answering approaches targeted a particular answer whereas we targeted query related
important and precise information. Representation of information in iconic form is not
addressed anywhere.
6.1 Introduction
This vast Web repository is used to share and access information according to user need.
However, the bene�ts of Internet are con�ned within a certain group of people. Statistic
shows that only 34.3% 1 of the world population uses Internet. One of the main reason of
the poor participation is language illiteracy. A major portion of Web repository (55.4%)
is written in English2. Whereas world English literacy rate is quite poor. Therefore, it
is clear that a vast information repository is freely available but it is not consumable for
underprivileged people. To solve this issue a common interaction medium is needed which
is understandable by any user irrespective of their cultural and language background. In
this direction, we have developed an icon-based interface to retrieve and represent Web
information in user understandable form.
As per our knowledge, the problem we addressed is new of its kind. Our goal is to
makeWeb information accessible for illiterate or semi-illiterate people. In order to achieve
our target we have identi�ed three main challenges - giving input to the search engine,
�nding query related important information and represent it in user understandable form.
Previously, icon is used as the medium of interaction for di�erent objectives like - man-
machine interaction, interacting with quadriplegic people, interacting with semi-illiterate
people, in various applications (hotel booking, chatting, alarming in crisis situation etc.),
kids learning etc. Most of the works used an icon-based interface for user interaction
but they did not address the basic issues as what should be the icon vocabulary, how
to manage the icon vocabulary, how to organize icons in the interface etc. In our work
we have addressed these issues. We did not emphasize on the area of natural language
generation from iconic sequence as, �rstly, a query we generally �re in search engine is
not a well de�ned sentence secondly, the search engine handles the query formation by
its own.
1World Internet Users Statistics Usage and World Population Stats, www.internetworldstats.com/stats.htm
2W 3Techs, www.w3techs.com/technologies/overview/content_language/all
82
6.2. Contribution of Our Work
We have faced several issues regarding Web information representation. As search
engine generally returns large number of Web pages in a response to a query, a major
decision need to be taken regarding which information we should present, how to �nd it
out and how to present. To ful�ll our requirement we have used clustering and informa-
tion mining mechanism. Though several clustering mechanisms (Hierarchical, k -Means,
Lingo, STC etc.) are available we propose a new clustering mechanism - HK Clustering
to cluster Web search results. Most of the exiting Web clustering mechanism concen-
trated either in cluster quality or in response time. In HK Clustering mechanism, we
balance the time constraint and cluster quality constraint. For mining information from
clustered Web pages, we have developed an entity attribute model. With the help of this
model we have found out query related precise information. We have followed the ques-
tion answering mechanism for mining information. In our approach, we have handled a
normal search query and mined all important as well as related information. Whereas
question answering mechanism only concentrates some speci�c query and speci�c answer.
Finally, to represent the mined information in terms of icon, we disambiguated each word
of information phrase and �nd out proper icon from icon base.
In this chapter �rst we have summarized our work. Next, in Section 6.1 we discuss
about the importance of our work. Section 6.2 talks about the contribution of our work.
Finally, Section 6.3 concludes about the work and shows the future direction.
6.2 Contribution of Our Work
In the complete work, we have several contributions. Each of the contributions are
discussed next.
1. Deciding the medium of interaction: As our target users are illiterate and semi-
illiterate people, we can not use normal text as the interaction medium between
user and Internet. We analyze several alternatives like speech, gesture, icon etc.
and select icon for various reasons like - language independence, easy to adapt,
faster and expressive medium, o�ers recognition rather than recall etc.
2. Deciding implementation domain: Solving the problem for general domain is quite
voluminous work. Therefore, we implement our proposal in tourism domain. How-
ever, similar application can be done for any other domain following the proposed
methodology.
3. Developing icon vocabulary: We described a way of deciding icon vocabulary for
any domain. The proposal gives an idea to decide icons in any domain that it should
83
6. Conclusion and Future Work
cover the entire domain without redundancy. It gives the idea of icon optimization.
4. Management of icons: We also described the way of e�cient icon management.
How to use icons to represent similar concept, how to maintain indexing that can
support easy icon storage and retrieval are also discussed.
5. Arrangement of icons: All the icons cannot be kept in primary interface. So, how
to maintain icon hierarchy and how to organize icons in the interface are addressed
in this work.
6. Proposing a new clustering methodology to cluster search results considering com-
plexity and cluster quality: To club informative Web pages from the search results
we have done clustering. As our main target is to mine precise information, we pri-
oritize cluster quality than response time. Therefore, we propose a new clustering
mechanism to obtain good cluster in a�ordable response time.
7. Developing entity-attribute model: To identify query related important attributes
we have developed query corresponding templates and from the query templates
we have developed an entity-attribute model. This entity attribute model helps to
identify query related appropriate cluster for mining information.
8. Preparing answer model and mining Web information: We developed an answer
model with the help of entity-attribute model that describes di�erent possibilities
and forms of information present in a Web page. It helps to extract query related
answers which are processed further by ranking and applying constraints to obtain
query related information.
9. Icon-based information representation: As English supports homonymy (di�erent
meaning in di�erent context) and polysemy (di�erent, but related senses), under-
standing the word sense is necessary to select icons from icon base. Therefore, we
disambiguated the �nalized information in word basis and �nd out proper icon that
can convey the information.
6.3 Future Work
We limited our implementation to tourism related information retrieval due to the vast-
ness of the work and absence of proper icon base. Designing appropriate icons for our
target users is out of our scope. We targeted tourism domain as travel related icons are
used in common place and they are bit familiar to general user compared to other domain
84
6.3. Future Work
related icons. We used tourism related di�erent icons available in Internet which may not
the best for our user. User of proper icon may increase the e�cacy of the system. An-
other limitation of our work is towards representing Web information. We narrow down
the amount of information as per requirement of our target users. We assumed that the
users are not interested in all information present in a Web page and large number of
iconic massage will confuse them. We manually decided the important attributes related
to a query which may not cover all important information present in a Web page which
may be needed by user. In future the selection of attributes from a Web page can be
automated which can be done by machine learning. An enormous testing is required to
decide the exact amount of information preferred by the target user. Proper feedbacks
can improve system performance. Apart from this, we can also provide a prediction
mechanism which will help user in query generation. The extension of this work can be
done in various direction. This system can be recon�gured for motor impaired users or
for multilingual communication purpose.
85
Publications out of this work
Published
• S. Maiti, D. Samanta, S. R. Das, and M. Sharma, Language Independent Icon-
Based Interface for Accessing Internet, First International Conference on Advances
in Computing and Communications (ACC), 2011 Springer, pp: 172-182, July 2011,
Kochi, India (Springer)
• S. Maiti, S. Dey, and D. Samanta. Development of Iconic Interface to Retrieve In-
formation from Internet, Students' Technology Symposium (TechSym), 2010 IEEE,
pp: 268-275, April, 2010, Kharagpur, India (IEEE)
Accepted
• S. Maiti, and D. Samanta. Icon-Based Representation of Web Information, 4th
International Conference on Intelligent Human Computer Interaction (IHCI), 2012
IEEE, December, 2012, Kharagpur, India (IEEE)
Communicated
• S. Maiti and D. Samanta, Clustering Web Search Results to Identify Information
Domain, Foundations and Trends in Information Retrieval (FTIR)
87
References
[1] A and A. Basu. A Framework for Disambiguation in Ambiguous Iconic Environ-
ments. In Proceedings of the Advances in Arti�cial Intelligence (AI 2004), 17th
Australian Joint Conference on Arti�cial Intelligence, volume 3339, pages 1135�
1140, December 2004.
[2] A. V. Aho. Algorithms for Finding Patterns in Strings, Handbook of Theoretical
Computer Science: Algorithms and Complexity, volume 1, chapter 5, pages 255�
300. MIT Press Cambridge, MA, USA, 1990.
[3] M. G. Ahsaee, M. Naghibzadeh, and S. E. Yasrebi. Using WordNet to Determine
Semantic Similarity of Words. In Proceedings of the 5th International Symposium
on Telecommunications (IST), pages 1019 �1027, December 2010.
[4] P. L. Albacete, S. K. Chang, and G. Polese. Iconic Language Design for People
with Signi�cant Speech and Multiple Impairments. In Proceedings of the Assistive
Technology and Arti�cial Intelligence, Applications in Robotics, User Interfaces
and Natural Language Processing, pages 12�32, 1998.
[5] O. Alonso, M. Gertz, and R. Baeza-Yates. Clustering and Exploring Search Results
Using Timeline Constructions. In Proceedings of the 18th ACM Conference on
Information and Knowledge Management (CIKM '09), pages 97�106. Hong Kong,
China, November 2009.
[6] J. Arnstein. The International Directory of Graphic Symbols. In Proceedings of
the Kogan Page, 1983.
[7] B.R. Baker. Minspeak, a Semantic Compaction System that Makes Self-Expression
Easier for Communicatively Disabled Individuals. Byte, 7(9):186�202, September
1982.
89
References
[8] M. Banko, E. Brill, S. Dumais, and J. Lin. AskMSR: Question Answering Using
the Worldwide Web. In Proceedings of the AAAI Symposium on Mining Answers
from Text and Knowledge Bases, pages 7�9, 2002.
[9] P. G. Barker and K. A. Manji. Pictorial Dialogue Methods. International Journal
of Man-Machine Studies, 31:323�347, September 1989.
[10] C. Beardon. CD-Icon: an Iconic Language Based on Conceptual Dependency.
Digital Creativity, 3:111�116, November 1992.
[11] C. Beardon. Discourse Structures in Iconic Communication. Arti�cial Intelligence
Review - Special Issue on Integration of Natural Language and Vision Processing:
Intelligent Multimedia, 9(2-3):189�203, June 1995.
[12] C. Beardon, C. Dormann, S. Mealing, and M. Yazdani. Talking with Pictures:
Exploring the Possibilities of Iconic Communication. Alt-J: Research in Learning
Technology, 1:26�39, 1993.
[13] R. Bekkerman, S. Zilberstein, and J. Allan. Web Page Clustering Using Heuristic
Search in the Web Graph. In Proceedings of the 20th International Joint Con-
ference on Arti�cal Intelligence (IJCAI'07), pages 2280�2285. Morgan Kaufmann
Publishers Inc. San Francisco, CA, USA, January 2007.
[14] J. Bhagwani and K. Hande. Context Disambiguation in Web Search Results Using
Clustering Algorithm. International Journal of Computer Science and Communi-
cation, 2(1):119�123, January-June 2011.
[15] S. Bhattacharya. Sanyog: An Iconic System for Multilingual Communication for
People with Speech and Motor Impairment. Master's thesis, Department of Com-
puter Science & Engineering, IIT Kharagpur, 2004.
[16] S. Blankenberger and K. Hahn. E�ects of Icon Design on Human-Computer In-
teraction. International Journal of Man-Machine Studies, 35:363�377, September
1991.
[17] M. M. Blattner, D. A. Sumikawa, and R. M. Greenberg. Earcons and Icons: Their
Structure and Common Design Principles. In Proceedings of the Human-Computer
Interaction, volume 4, pages 11�40, March 1989.
[18] F. Boulanger, G. Polaillon, D. Carstoiu, A. Cernian, and S. Bodea. Web Search
Based on Clustering by Compression. In Proceedings of the International Confer-
ence e-Society (IADIS), 2007.
90
References
[19] A. Bradic. Search Result Clustering via Randomized Partitioning of Query-Induced
Subgraphs. Telfor Journal, 1(1):26�29, 2009.
[20] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Semantic Web Search
Results Clustering Using Lingo and WordNet. International Journal of Research
and Reviews in Computer Science (IJRRCS), 1(2), June 2010.
[21] S. Buchholz and W. Daelemans. SHAPAQA: Shallow Parsing for Question An-
swering on the World Wide Web. In Proceedings of the Recent Advances in Natural
Language Processing (RANLP), pages 47�51, September 2001.
[22] R. Campos, G. Dias, and A. M. Jorge. Disambiguating Web Search Results by
Topic and Temporal Clustering - a Proposal. In Proceedings of the International
Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge
Management (IC3K), pages 292�296, 2009.
[23] R. Campos, G. Dias, and C. Nunes. WISE: Hierarchical Soft Clustering of Web
Page Search Results Based on Web Content Mining Techniques. In Proceedings of
the Web Intelligence'06, pages 301�304. IEEE Computer Society Washington, DC,
USA, December 2006.
[24] C. Carpineto, S. Osinski, G. Romano, and D. Weiss. A Survey of Web Clustering
Engines. ACM Computing Surveys, 41(3):1�38, July 2009.
[25] M. Chau, P. Y. K Chau, and P. J. Hu. Incorporating Hyperlink Analysis in Web
Page Clustering, 2007.
[26] A. Chen. The Visual Messenger Project. PhD thesis, University of Washington,
June 2004.
[27] J. Chen, O. R. Zaiane, and R. Goebel. An Unsupervised Approach to Cluster
Web Search Results based on Word Sense Communities. In Proceedings of the
2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intel-
ligent Agent Technology (WI-IAT '08), pages 725�729. IEEE Computer Society
Washington, DC, USA, 2008.
[28] R. Chen, C. Bau, M. Tsai, and C. Huang. Web Pages Cluster Based on the
Relations of Mapping Keywords to Ontology Concept Hierarchy. International
Journal of Innovative Computing, Information and Control, 6(6):2749�2760, June
2010.
91
References
[29] D. Cheng, R. Kannan, S. Vempala, and G. Wang. A Divide-and-Merge Methodol-
ogy for Clustering. ACM Transactions on Database Systems (TODS), 31(4):1499�
1525, December 2006.
[30] C. L. A. Clarke, G. V. Cormack, D. I. E. Kisman, and T. R. Lynam. Question An-
swering by Passage Selection. In Proceedings of the 9th Text REtrieval Conference
(TREC), 2000.
[31] S. Cucerzan and E. Agichtein. Factoid Question Answering over Unstructured
and Structured Web Content. In Proceedings of the Fourteenth Text REtrieval
Conference (TREC), 2005.
[32] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman.
Indexing by Latent Semantic Analysis. Journal of the American Society for Infor-
mation Science, 41(6):391�407, 1990.
[33] C. Ding and X. He. K-means Clustering via Principal Component Analysis. In
Proceedings of the Twenty-First International Conference on Machine Learning
ICML '04, pages 225�232, Ban�, Canada, 2004. ACM Press.
[34] D. Dipa and M. Kiruthika. Mining Access Patterns Using Clustering. International
Journal of Computer Applications, 4(11):22�26, August 2010.
[35] N. Duhan and A. K. Sharma. A Novel Approach for Organizing Web Search Results
using Ranking and Clustering. International Journal of Computer Applications,
5(10):1�9, August 2010.
[36] A. Echihabi, U. Hermjakob, E. Hovy, D. Marcu, E. Melz, and D. Ravichandran.
Multiple-Engine Question Answering in TextMap. In Proceedings of the 12th An-
nual Text REtrieval Conference (TREC), 2003.
[37] S. M. Z. Eissen and B. Stein. Analysis of Clustering Algorithms for Web-based
Search. In Proceedings of the 4th International Conference on Practical Aspects of
Knowledge Management (PAKM '02). Springer-Verlag London, UK, 2002.
[38] S. Fitrianie, D. Datcu, and L. J. M. Rothkrantz. Human Communication Based on
Icons in Crisis Environments. In Proceedings of the 2nd International Conference
on Usability and Internationalization (UI-HCII'07), pages 57�66, 2007.
[39] S. Fitrianie and L. J. M. Rothkrantz. Language-Independent Communication Using
Icons on a PDA. Lecture Notes in Computer Science, 3658:404�411, 2005.
92
References
[40] S. Fitrianie and L. J. M. Rothkrantz. Two-Dimensional Visual Language Grammar.
In Proceedings of the 9th International Conference on Text, Speech and Dialogue
(TSD'06), pages 573�580. Springer-Verlag Berlin, Heidelberg, 2006.
[41] S. Fitrianie, Z. Yang, and L. J. M. Rothkrantz. Developing Concept-Based User
Interface Using Icons for Reporting Observations. In Proceedings of the Second
International Conference on Information Systems for Crisis Response and Man-
agement (ISCRAM), Washington, DC, USA, May 2008.
[42] R. T. Freeman. Topological Tree Clustering of Web Search Results. In Proceed-
ings of the Seventh International Conference on Intelligent Data Engineering and
Automated Learning (IDEAL'06), pages 789�797. Springer, September 2006.
[43] W. W. Gaver. Auditory Icons: Using Sound in Computer Interface. In Proceedings
of the Human-Computer Interaction, volume 2(2), pages 167�177, 1986.
[44] F. Gelgi, H. Davulcu, and S. Vadrevu. Term Ranking for Clustering Web Search
Results. In Proceedings of the 10th International Workshop on Web and Databases
(WebDB), Beijing, China, June 2007.
[45] F. Gelgi, S. Vadrevu, and H. Davulcu. Scuba Diver: Subspace Clustering of Web
Search Results. In Proceedings of the 3rd International Conference on Web Infor-
mation Systems and Technologies (WebIST), 2007.
[46] F. Geraci, M. Pellegrini, M. Maggini, and F. Sebastiani. Cluster Generation and
Cluster Labelling for Web Snippets: a Fast and Accurate Hierarchical Solution. In
Proceedings of the String Processing and Information Retrieval(SPIRE'06), volume
4209, 2006.
[47] F. Giannotti, M. Nanni, D. Pedreschi, and F. Samaritani. WebCat: Automatic
Categorization of Web Search Results. In Proceedings of the Sistemi Evoluti per
Basi di Dati(SEBD), pages 507�518, 2003.
[48] D. Gittins. Icon-Based Human-Computer Interaction. International Journal of
Man-Machine Studies, 24:519�543, June 1986.
[49] A. Guenoche, P. Hansen, and B. Jaumard. E�cient Algorithms for Divisive Hier-
archical Clustering with the Diameter Criterion. Journal of Classi�cation, 8:5�30,
1991.
93
References
[50] X. He, H. Zha, C. H. Q. Ding, and H. D. Simon. Web Document Clustering
Using Hyperlink Structures. Computational Statistics & Data Analysis, 41(1):19�
45, November 2001.
[51] M. A. Hearst and J. O. Pedersen. Reexamining the Cluster Hypothesis Scatter
Gather on Retrieval Results. In Proceedings of the Nineteenth Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval
Pages(SIGIR '96), pages 76�84, Zurich, 1996.
[52] M. Hirakawa, M. Tanaka, and T. Ichikawa. An Iconic Programming System, HI-
VISUAL. IEEE Transactions on Software Engineering, 16:1178�1184, October
1990.
[53] J. Janruang and W. Kreesuradej. A New Web Search Result Clustering based on
True Common Phrase Label Discovery. In Proceedings of the International Confer-
ence on Computational Intelligence for Modelling Control and Automation,and In-
ternational Conference on Intelligent Agents, Web Technologies and Internet Com-
merce (CIMCA '06), page 242. IEEE, 2006.
[54] A. Joshi and Z. Jiang. Retriever: Improving Web Search Engine Results Using
Clustering, 2002.
[55] M. Kaisser and T. Becker. Question Answering by Searching Large Corpora with
Linguistic Methods. In Proceedings of the Thirteenth Text REtrieval Conference
(TREC), 2004.
[56] Y. Kanda, M. Kudo, and H. Tenmoto. Hierarchical and Overlapping Clustering
of Retrieved Web Pages. In Proceedings of the Advances in Intelligent Information
Systems, pages 345�358, 2009.
[57] F. Karray, M. Alemzadeh, J. A. Saleh, and M. N. Arab. Human-Computer Inter-
action: Overview on State of the Art. International Journal on Smart Sensing and
Intelligent Systems, 1(1):137�159, March 2008.
[58] B. Katz. Annotating the World Wide Web Using Natural Language. In Proceedings
of the 5th RIAO Conference on Computer Assisted Information Searching on the
Internet (RIAO '97), 1997.
[59] B. Katz, J. Lin, D. Loreto, W. Hildebrandt, M. Bilotti, S. Felshin, A. Fernandes,
G. Marton, and F. Mora. Integrating Web-based and Corpus-based Techniques
94
References
for Question Answering. In Proceedings of the Twelfth Text REtrieval Conference
(TREC), 2003.
[60] M. L. Khodra and D. H. Widyantoro. An E�cient and E�ective Algorithm for
Hierarchical Classi�cation of Search Results. In Proceedings of the International
Conference on Electrical Engineering and Informatics, June 2007.
[61] B.G. Knapp, F.L. Moses, and L.H. Gellman. Information Highlighting on Complex
Displays. In Proceedings of the Directions in Human-computer Interaction, pages
195�215, 1982.
[62] P.A. Kolers. Some Formal Characteristic of Pictograms. In Proceedings of the
American Scientist, volume 57, pages 348�363, 1969.
[63] R. Krovetz. Homonymy and Polysemy in Information Retrieval. In Proceedings of
the Eighth Conference on European Chapter of the Association for Computational
Linguistics (EACL '97), pages 72�79, 1997.
[64] N. C. Kuicheu, L. P. Fotso, and F. Siewe. Iconic Communication System by XML
Language: (SCILX). In Proceedings of the International Cross-Disciplinary Con-
ference on Web Accessibility (W4A), pages 112�115, 2007.
[65] C. Kwok, O. Etzioni, and D. S. Weld. Scaling Question Answering to the Web.
ACM Transactions on Information Systems (TOIS), 19:242�262, 2001.
[66] R.E. Maisaino L. Moses and P. Bersh. Natural Associations between Symbols and
Mlitary Information. In Proceedings of the Human Factors and Ergonomics Society
Annual Meeting, volume 23, pages 438�442, 1979.
[67] E. S. Lee, E. J. Hwang, T. S. Hur, Y. S. Woo, and H. K. Min. A Study on the Pred-
icate Prediction Using Symbols in Augmentative and Alternative Communication
System. In Proceedings of the Human-Computer Interaction with Mobile Devices
and Services, Lecture Notes in Computer Science, volume 2795, pages 466�470,
Springer Berlin / Heidelberg, 2003.
[68] Y. Lee, S. Na, and J. Lee. Search Result Clustering Using Label Language Model.
In Proceedings of the Third International Joint Conference on Natural Language
Processing, 2008.
[69] N. E. M. Leemans. A Visual Inter Lingua. Phd thesis, Worcester Polytechnic
Institute, USA, 2001.
95
References
[70] B. M. Leiner, V. G. Cerf, D. D. Clark, R. E. Kahn, L. Kleinrock, D. C. Lynch,
J. Postel, L. G. Roberts, and S. Wol�. A Brief History of the Internet. ACM
SIGCOMM Computer Communication Review, October 2009. Volume 39 Issue 5.
[71] E. Levent, S. Michael, and K. Vipin. A New Shared Nearest Neighbor Clustering
Algorithm and its Applications. InWorkshop on Clustering High Dimensional Data
and its Applications at 2nd SIAM International Conference on Data Mining, pages
105�115, 2002.
[72] Z. Li and X. Wu. A Phrase-Based Method for Hierarchical Clustering of Web
Snippets. In Proceedings of the Twenty-Fourth AAAI Conference on Arti�cial
Intelligence (AAAI-10), 2010.
[73] W. Lidwell, K. Holden, and J. Butler. Universal Principles of Design. In Proceedings
of the Massachusetts, Rockport Publishers, 2003.
[74] J. Lin and B. Katz. Question Answering from the Web Using Knowledge Annota-
tion and Knowledge Mining Techniques. In Proceedings of the Twelfth International
Conference on Information and Knowledge Management, pages 116 � 123, 2003.
[75] G. Lindgaard, J. Chessari, and E Ihsen. Icons in Telecommunication: What Makes
Pictorial Information Comprehensible to the Users? In Proceedings of the Aus-
tralian Telecommunication Research, volume 21(2), pages 17�29, 1987.
[76] S. P. Lloyd. Least Squares Quantization in PCM's. Bell Telephone Laboratories
Paper, 28(2):129�137, March 1957.
[77] C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Re-
trieval. Cambridge University Press, 2008.
[78] I. Maslowska. Phrase-Based Hierarchical Clustering of Web Search Results. In
Proceedings of the 25th European conference on IR research (ECIR'03), pages 555�
562, Berlin Heidelberg, 2003. LNCS, Springer Verlag.
[79] T. Matsumoto and E. Hung. Fuzzy Clustering and Relevance Ranking of Web
Search Results with Di�erentiating Cluster Label Generation. In Fuzzy Systems
(FUZZ), 2010 IEEE International Conference, pages 1�8, Barcelona, 2010.
[80] J. I. Mayorga, J. Cigarran, and M. Rodriguez-Artacho. Retrieval and Clustering
of Web Resources Based on Pedagogical Objectives, September 2010.
96
References
[81] G. Mecca, S. Raunich, and A. Pappalardo. A New Algorithm for Clustering Search
Results. Data & Knowledge Engineering, 62(3):504�522, September 2007.
[82] I. Medhi, S Patnaik, E. Brunskill, S. N. N. Gautama, W. Thies, and K. Toyama.
Designing Mobile Interfaces for Novice and Low-Literacy Users. ACM Transactions
on Computer-Human Interaction (TOCHI), 18:1�28, April 2011.
[83] I. Medhi, A. Prasad, and K. Toyama. Optimal Audio-Visual Representations for
Illiterate Users of Computers. In Proceedings of the 16th International Conference
on World Wide Web, pages 873�882, 2007.
[84] W. Meert, R. Troncon, and G. Janssens. Clustering Maps. Master's thesis,
Katholieke Universiteit Leuven, 2006.
[85] E. Miller, M. Kado, M. Hirakawa, and T. Ichikawa. HI-VISUAL as a User-
Customizable Visual Programming Environment. In Proceedings of the 11th Inter-
national IEEE Symposium on Visual Languages (VL '95), pages 107�113, Darm-
stadt, Germany, September 1995. IEEE Computer Society Press.
[86] G. Miller. The Magical Number Seven, Plus or Minus Two: Some Limits on Our
Capacity for Processing Information. Psychological Review, 63(2):343�352, March
1956.
[87] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. Introduction
to WordNet: an On-line Lexical Database. International Journal of Lexicography,
3(4):235�244, December 1990.
[88] D. S. Modha and W. S. Spangler. Clustering Hypertext with Applications to Web
Searching. In Proceedings of the Eleventh ACM on Hypertext and Hypermedia,
pages 143�152. ACM New York, NY, USA, 2000.
[89] Lodding K. N. Iconic Interfacing. In Proceedings of the IEEE Computer Graphics
and Applications, volume 3(2), pages 11�20, 1983.
[90] R. Navigli and G. Crisafulli. Inducing Word Senses to Improve Web Search Result
Clustering. In Proceedings of the Conference on Empirical Methods in Natural
Language Processing, pages 116�126, MIT, Massachusetts, USA, October 2010.
Association for Computational Linguistics.
[91] C. L. Ngo and H. S. Nguyen. A Method of Web Search Result Clustering Based on
Rough Sets. In Proceedings of the Web Intelligence (WI'05), pages 673�679, 2005.
97
References
[92] S. Osinski. An Algorithm for Clustering of Web Search Result. Master's thesis,
Poznan University of Technology, Poland, June 2003.
[93] S. Osinski. Improving Quality of Search Results Clustering with Approximate Ma-
trix Factorisations. In Proceedings of the 28th European Conference on Advances in
Information Retrieval (ECIR), pages 167�178. Springer-Verlag Berlin Heidelberg,
2006.
[94] S. Osinski and D. Weiss. A Concept-Driven Algorithm for Clustering Search Re-
sults. IEEE Intelligent Systems, 20(3):48 � 54, May 2005.
[95] P. Patil and U. Patil. Preprocessing of Web Server Log File for Web Mining. World
Journal of Science and Technology, 2(3):14�18, 2012.
[96] H. Pu. User Evaluation of Textual Results Clustering for Web Search. Online
Information Review, 34(6):855�874, March 2010.
[97] H. Purchase. De�ning Multimedia. In Proceedings of the IEEE multimedia, volume
5(1), pages 8�15, 1998.
[98] D. Radev, W. Fan, H. Qi, H. Wu, and A. Grewal. Probabilistic Question An-
swering on the Web. Journal of the American Society for Information Science and
Technology, 56(6):571�583, 2005.
[99] D. R. Radev, H. Qi, H. Wu, and W. Fan. Evaluating Web-based Question An-
swering Systems. In Proceedings of the Evaluating Web-Based Question Answering
Systems, 2002.
[100] P. Rai and S. Singh. A Survey of Clustering Techniques. International Journal of
Computer Applications, 7(12):156�162, October 2010.
[101] D. R. Recupero. A New Unsupervised Method for Document Clustering by Using
WordNet Lexical and Conceptual Relations. Information Retrieval, 10(6):563�579,
2007.
[102] C. J. V. Rijsbergen. Information Retrieval. Butterworth-Heinemann, London:
Butterworth, 1979.
[103] Y. Rogers. Icons at the Interface: their Usefulness. Interacting with Computers,
1(1):105�117, April 1989.
98
References
[104] S. J. P. McDougall S. J. Isherwood and M. B. Curry. Icon Identi�cation in Context:
The Changing Role of Icon Characteristics with User Experience. Human Factors:
The Journal of the Human Factors and Ergonomics Society, 49(3):465�476, June
2007.
[105] S. Sambasivam and N. Theodosopoulos. Advanced Data Clustering Methods of
Mining Web Documents. Informing Science and Information Technology, 3, 2006.
[106] M. G. R. Skogen. An Investigation into the Subjective Experience of Icons: A Pilot
Study. In Proceedings of the 10th International Conference Information Visualiza-
tion, pages 368�373, 2006.
[107] M. M. Soubbotin and S. M. Soubbotin. Use of Patterns for Detection of Answer
Strings: A Systematic Approach. In Proceedings of the Text REtrieval Conference
(TREC), 2002.
[108] M. Steinbach, G. Karypis, and V. Kumar. A Comparison of Document Clustering
Techniques. In Proceedings of the KDD Workshop on Text Mining, 2000.
[109] X. Tannier and V. Moriceau. FIDJI Web Question-Answering at Quaero. In Pro-
ceedings of the Seventh International Conference on Language Resources and Eval-
uation (LREC), May 2010.
[110] B. Tatomir and L. Rothkrantz. Crisis Management Using Mobile ad-hoc Wireless
Networks. In Proceedings of the ISCRAM, pages 147�149, Belgium, 2005.
[111] M. Tilsner, O. Hoeber, and A. Fiech. CubanSea: Cluster-Based Visualization
of Search Results. In Proceedings of the IEEE/WIC/ACM International Joint
Conference on Web Intelligence and Intelligent Agent Technology, volume 3, pages
108�112, September 2009.
[112] H. Toda and R. Kataoka. Search Result Clustering Method at NTCIR-5 WEB
Query Term Expansion Subtask. In Proceedings of the NTCIR-5 Workshop Meeting,
Tokyo, Japan, December 2005.
[113] L. Uden and A. Dix. Iconic Interfaces for Kids on the Internet. In Proceedings of
the IFIP World Computer Congress, pages 279�286, August 2000.
[114] I. Varlamis and S. Stamou. Semantically Driven Snippet Selection for Supporting
Focused Web Searches. Data & Knowledge Engineering, 68(2):261�277, February
2009.
99
References
[115] Wang, Hsiu-Feng, Hung, Sheng-Hsiung, Liao, and Ching-Chih. A Survey of Icon
Taxonomy Used in the Interface Design. In Proceedings of the 14th European Con-
ference on Cognitive Ergonomics, ACM International Conference Proceeding Se-
ries,, volume 250, pages 203�206, London, United Kingdom, 2007. ACM.
[116] Y. Wang and M. Kitsuregawa. C4-2: Combining Link and Contents in Clustering
Web Search Results to Improve Information Interpretation. In Proceedings of the
International Conference on Database and Expert Systems Applications, volume 2,
pages 851�857, 2002.
[117] Y. Wang, W. Zuo, T. Peng, F. He, and H. Hu. Clustering Web Search Results Based
on Interactive Su�x Tree Algorithm. In Proceedings of the Third International
Conference on Convergence and Hybrid Information Technology, 2008.
[118] D. Weiss. A Clustering Interface for Web Search Results in Polish and English.
Master's thesis, Poznan University of Technology, Poland, June 2001.
[119] J. S. Whissell and C. L. A. Clarke. Improving Document Clustering Using Okapi
BM25 Feature Weighting. Information Retrieval, 14:466�487, 2011.
[120] L. Xiao and E. Hung. Clustering Web-Search Results Using Transduction-Based
Relevance Model. IEEE 1st Paci�c-Asia Workshop on Web Mining and Web-based
Application, 2008.
[121] J. Xu, H. Zhang, and Y. Chen. The Analysis on the Design of 3D Graphic Software
Icons. In Proceedings of the WRI World Congress on Computer Science and Infor-
mation Engineering (CSIE '09), pages 379�385, Los Angeles, CA, March 31-April
2 2009.
[122] Li Yang and Adnan Rahi. Dynamic Clustering of Web Search Results. In Pro-
ceedings of the International Conference on Computational Science and its Ap-
plications: PartI (ICCSA'03), pages 153�159. Springer-Verlag Berlin Heidelberg,
2003.
[123] Z. Yao and B. Choi. Bidirectional Hierarchical Clustering for Web Mining. In Pro-
ceedings of the IEEE/WIC International Conference on Web Intelligence (WI'03),
page 640, 2003.
[124] M. Yazdani and S. Mealing. Communicating Through Pictures. Arti�cial Intelli-
gence Review - AIR, 9(2-3):205�213, 1995.
100
References
[125] O. Zamir and O. Etzioni. Web Document Clustering: A Feasibility Demonstration.
In Proceedings of the 21st Annual International ACM SIGIR Conference on Re-
search and Development in Information Retrieval (SIGIR '98), pages 46�54, 1998.
[126] O. Zamir and O. Etzioni. Grouper: a Dynamic Clustering Interface to Web Search
Results. Computer Networks: The International Journal of Computer and Telecom-
munications Networking, 31(11-16):1361�1374, May 1999.
[127] D. Zhang and Y. Dong. Semantic, Hierarchical, Online Clustering of Web Search
Results. In Proceedings of the Asia-Paci�c Web Conference (APWeb'04), volume
3007, pages 69�78, 2004.
[128] D. Zhang and W.S. Lee. A Web-Based Question Answering System. In Proceedings
of the SMA Annual Symposium, 2003.
[129] Y. Zhang, B. Feng, and Y. Xue. A New Search Results Clustering Algorithm based
on Formal Concept Analysis. In Proceedings of the Fifth International Conference
on Fuzzy Systems and Knowledge Discovery (FSKD '08), pages 356�360, 2008.
[130] Z. Zhang, X. Cui, D. R. Jeske, X. Li, J. Braun, and J. Borneman. Clustering Scatter
Plots Using Data Depth Measures. In Proceedings of the International Conference
on Data Mining (DMIN'10), pages 327�333, 2010.
[131] Z. Zheng. AnswerBus Question-Answering System. In Proceedings of the Human
Language Technology Conference (HLT), pages 399�404, 2002.
101