36 IEEE POTENTIALS0278-6648/10/$26.00 © 2010 IEEE
T he majority of the current
search engines generate a
huge list in reply to a user
query. This result is normally
ranked by using ranking criteria such as
page rank or relevancy to the query.
However, this list is extremely inconve-
nient to users, since it expects them to
look into each page sequentially in an
exhaustive manner to find the relevant
information. As a result, most users only
search for an initial few Web pages on the
list. Thus many other relevant information
can be overlooked. The clustering method
is one such solution to overcome this
problem. Instead of a sequential list, it
groups the search results into clusters and
labels these with representative words for
each cluster. These labeled clusters of
search results are exposed to users. The
clustering method provides benefits in
terms of reduced size of information pro-
vided to the end users.
The clustering approach would be even
more important for the largest databases,
the gigantic data silos of those such as
Google, Yahoo, and Microsoft. The need
for better presentation of search results
retrieved from millions, and then billions,
of highly unstructured and untagged Web
pages was obvious. Clustering became a
popular software tool to enhance relevance
ranking by grouping items in the typically
very large result list. The clusters of items
with common semantic and/or other char-
acteristics can guide users in refining their
original queries, to zoom in on smaller
clusters, and drill down through subgroups
within the cluster.
Search result clustering has several
specific requirements that may not be
essential for other cluster algorithms. First,
search result clustering should allow fast
clustering and rapid generation of a label
on the fly, since it is an online process.
This requirement can be met by adopting
“snippets” (the snippets contain phrases
that help in the correct clustering of the
document and do not contain some of the
“noise” present in the original documents
that might cause misclassification of the
Digital Object Identifier 10.1109/MPOT.2010.937055© FOTOSEARCH
SCHISM—A Web search engine using semantic taxonomy
Ramesh Singh, Dhruv Dhingra, and Aman Arora
SEPTEMBER/OCTOBER 2010 37
documents) rather than entire documents
of a search result set. Second, labels
annotated for clusters should be mean-
ingful to users because they are pre-
sented to users as a general view of
results. For this reason, recent search
result clustering research focuses on
selecting meaningful labels. This differs
from general clustering which focuses on
the similarity of documents. In Zamir and
Etzioni, a few other key requirements of
search result clustering are presented.
The suffix tree clustering (STC) algo-
rithm (Fig. 1) was introduced by Zamir
and Etzioni and further developed in the
Overiew of Data Mining—Bayesian Clas-
sification. The algorithm focuses on clus-
tering snippets faster than standard data
mining approaches to clustering by using
a data structure called a suffix tree. Its
time complexity is linear to the number of
snippets, making it attractive when clus-
tering a large number of documents. Since
this is the core clustering approach used
by the software presented in this thesis,
the algorithm will be presented in detail.
Why cluster search resultsSearch results on the Web are tradi-
tionally presented as a flat ranked list of
documents, frequently millions of docu-
ments long. The main use for clustering
is not to improve the actual ranking, but
to give the user a quick overview of the
results. Having divided the result set
into clusters (Fig. 2), the user can quickly
narrow down his search further by se-
lecting a cluster. This resembles query
refinement but avoids the need to query
the search engine for each step. Evalu-
ations done using the grouper system
indicate that users tend to investigate
more documents per query than in
normal search engines. It is assumed
that this is because the user clicks on
the desired cluster rather than reformu-
lating his query. The evaluation also
indicates that once one interesting
document has been found, users often
find other interesting documents in the
same cluster.
Traditional clustering approachesThese techniques transform the docu-
ments to vectors and employ standard
means of calculating differences between
vectors to cluster the documents.
Hierarchical clusteringIn general, there are two types of
hierarchical clustering methods:
Agglomerative or bottom-up hi-•erarchical methods create a cluster for
each document and then merge the two
most similar clusters until just one cluster
remains or some termination condition
is satisfied. Most hierarchical cluster-
ing methods fall into this category, and
they only differ in their definition of
intercluster similarity and termination
conditions.
Divisive or top-down hierarchi-•cal methods do the opposite, by start-
ing with all documents in one cluster.
The initial cluster is divided until some
termination condition is satisfied. Hier-
archical methods are widely adopted,
but often struggle to meet the speed
requirements of the Web. Usually oper-
ating on document vectors with a time
complexity of O (n2) or more, clustering
more than a few hundred snippets is of-
ten unfeasible. Another problem is that
if two clusters are incorrectly merged in
an early state, there is no way of fixing
this later in the process. Finding the best
halting criterion that works well with all
queries can also be very difficult.
K-means clusteringThe k-means algorithm comes in many
flavors and produces a fixed number (k)
of flat clusters. The algorithms generally
follow the following process: Random
samples from the collection are drawn to
serve as centroids for initial clusters.
Based on document vector similarity, all
documents are assigned to the closest
centroid. New centroids are calculated
for each cluster, and the process is
repeated until nothing changes or some
termination condition is satisfied.
Car Review[1]
Car Review[1]
Review[1, 3]
Review[3]
Center[1]
Center[1]
JaguarBuy PanteraOnca Picture
[4]
PanteraOnca [2] Onca [2] Picture [4]
Picture [4]Picture [4]
Fig. 1 Suffix tree.
38 IEEE POTENTIALS
The process can be speeded up by
clustering a subset of the documents,
and later assigning all documents to the
precomputed clusters. Several problems
exist with this approach: It can only pro-
duce a fixed number of clusters (k). It
performs optimally when the clusters are
spherical, but we have no reason to
assume that document clusters are
spherical. Finally, a “bad choice” in the
random selection of initial clusters can
severely degrade performance.
Similar works and the differences with schism
“On-Line Clustering of Web Search
Results” has developed a prototype for
online clustering of Web search re-
sults. The project has also used STC.
Schism uses a different technique for
labeling clusters.
Technical componentsThe project consists of three main
components
1) Yahoo! application programming interface: The suffix tree clustering appli-
cation takes as input as the results pro-
duced by the Yahoo! search engine.
2) Suffix tree clustering: Since this is
the core clustering approach used by the
software presented in this thesis, the
algorithm will be presented in detail.
The algorithm consists of three steps:
Document cleaning, Identifying base
clusters, and combining base clusters
into clusters.
Document cleaning uses a light stem-
ming algorithm on the text. Sentence
boundaries are marked and nonword
tokens are stripped.
The process of identifying base clus-
ters resembles building an inverted index
of phrases for the document collection.
The data structure used is a suffix tree as
suggested by Anastasia Krithara, Cyril
Goutte et al., which can be built in time
linear to the collection size. The follow-
ing is a definition of a suffix tree: For-
mally, a suffix tree is a rooted, directed
tree. Each internal node has at least two
children. Each edge is labeled with a
non-empty sub-string of S (hence it is a
trie). A trie, or prefix tree, is an ordered
tree data structure that is used to store an
associative array where the keys are usu-
ally strings. The label of a node is defined
to be the concatenation of the edge-la-
bels on the path from the root to that
node. No two edges out of the same
node can have edge-labels that begin
with the same word (hence it is com-
pact). For each suffix s of S, there exists
a suffix-node whose label equals s.
The combining base clusters step
merges base clusters with highly over-
lapping document sets. The similarity of
base clusters Bn and Bm is a binary
function defined as
(BM, Bn) 5 (1 iff |Bm \ Bn| / |Bm|
. 0.5 and |Bm \ Bn| / |Bn|
. 0.5) 0 otherwise
where |Bm \ Bn| is the number of
documents shared by Bm and Bn. Calcu-
lating this similarity between all base
clusters, we can create a base cluster
graph, where nodes are a base cluster,
and two nodes are connected if the two
base clusters have a similarity of 1. Using
this graph, a cluster is defined as a con-
nected component in the graph.
3) Graphic User Interface (GUI): The
GUI takes input from the suffix tree clus-
terer and displays the result to the user
along with a separate thread to modify
the various parameters of the defined
algorithm so as to empower an advanced
user to more effectively and flexibly use
the application.
Implementation of the suffix tree clustering algorithm
The results from the search engine
are clustered using the STC algorithm.
The algorithm takes as input the gener-
ated snippets and returns a list of labeled
clusters of documents. The original
paper describing the algorithm lists sev-
eral key requirements for Web document
clustering, including:
Relevance: The clusters should be •relevant to the user query and their la-
bels easily understood by the user.
Overlap: Documents can have •several topics, so it is favorable to al-
low a document to appear in several
clusters.
Snippet-tolerance: In order to •be feasible in large-scale applications,
the algorithm should be able to pro-
duce high-quality clusters based only
on the snippets returned by the search
engine.
Speed: The clusters should be gen-•erated in a matter of milliseconds to avoid
user annoyance and scalability issues.
The STC algorithm has all these quali-
ties, but the main reason for choosing it
was its speed, simplicity, and ability to
produce good cluster labels. Several fac-
tors influence the performance of STC,
and the following sections describes the
choices made at each step.
Stemming snippetsThis choice suggests stemming
all words in the snippets with a
Fig. 2 Cluster selection.
SEPTEMBER/OCTOBER 2010 39
light stemming algorithm. To present
readable labels to the user, they save a
pointer for each phrase into the snip-
pet from which it originated. In this
way, one original form of the phrase
can be retrieved. Several original
phrases from different snippets may
belong to the same base cluster, but
which of these original forms is chosen
is not detailed in the article. The soft-
ware only stems plurals to singular,
using a very light stemming algorithm.
The effects of stemming the snippets
are further assessed in the evaluation.
Original snippets are stored to allow
original phrases to be reconstructed
after clustering.
Removing stop words in snippetsThe authors of STC suggest dealing
with stop words in phrases by allowing
them as long as they are not the first or
the last word in the phrase. Phrases con-
taining stop words are rarely selected as
labels. Therefore, the software simply
skips stop words and inserts phrase
boundaries instead. Testing indicates
that this has very little impact on the
resulting clusters and it’s therefore pre-
ferred for its simplicity.
Labeling clustersThe clustering algorithm outputs a
set of labeled base clusters for each
cluster, and the authors propose using
these base cluster labels as labels for the
final cluster (Fig. 2). In their grouper
system, clusters are labeled with all the
labels from the base clusters. The STC
algorithm assigns a score to each base
cluster but never utilizes it. The approach
taken in the software is to treat the base
cluster labels as candidate labels for the
final cluster, and use the scoring to select
the highest ranked candidate as the final
label. It is assumed that having one or
two phrases as a label instead of label-
ing each cluster with all candidates,
enhances usability.
The original paper describing STC
suggests scoring base clusters using the
formula
s(B) 5 |B| _ f(|P|)
where |B| is the number of documents in
base cluster B, and |P| is the number of
words in P that have a nonzero score.
Zero-scoring words are defined as
stop words, words that appear in less
than three documents or words that
appear in more than 40% of the docu-
ment collection. The function “f” is a func-
tion that penalizes single-word phrases, is
linear for two to six words long phrases
and constant for longer phrases.
The scoring scheme used by the soft-
ware closely resembles the suggested for-
mula because initial testing indicated that
it worked very well. Whether the algo-
rithm succeeds in selecting the “correct”
candidate is addressed in the evaluation
section. The software doesn’t include
stop words in the zero-scoring words
since all stop words are removed during
preprocessing. The function f gives a
score of 0.5 for single-word phrases, 2 to
6 for two- to six-word phrases respec-
tively, and 7 for longer phrases. If the two
top-ranked base clusters have the same
score, the one with the largest document
set is used as the label. This is because
phrases occurring in many documents
are assumed to represent more important
categories.
Ranking clusters As previously mentioned, the original
STC paper does not detail how the final
clusters are ranked. One of the best per-
forming systems available, Clusty.com,
seems to almost always present cluster
lists sorted by the number of documents
they contain. Minor deviations from this
approach can be observed, indicating
that they use some additional heuristic.
This suggests that it might be a small
trade-off between quality and perfor-
mance, given that additional processing
of the clusters can produce a slightly
altered (and hopefully improved) order-
ing of clusters. In the software, this extra
processing is assumed to cost more time
than it gains in quality. It is assumed that
large clusters are more interesting, and
clusters are thus simply ranked by the
number of documents they contain.
PresentationThe software includes a Web front-end
to the search engine merely for testing
purposes. It allows the user to submit a
query and returns a list of matching docu-
ments (Fig. 3) and browsable clusters. The
documents are presented with the docu-
ment title, the generated snippet, and a
link to the corresponding Web page.
Added functionalitiesIgnore words if in fewer •
documents. This functionality controls
the minimum number of times a word
Fig. 3 Search query.
40 IEEE POTENTIALS
must occur in the documents to form a
cluster. The use of this functionality is to
limit the view (Fig. 4) to search cluster
results having higher frequencies.
Ignore words in more documents • . This functionality is used to have only rel-
evant words forming the clusters. It tries
to curb the very common words to from
clusters thereby increasing relevance.
Minimum base cluster size. This •functionality is used to provide the
minimum number of base clusters that
should be present.
Maximum base cluster size • . This is
used to limit the number of base clusters
that would be present
Merge threshold. • This provides
the probability threshold value (ideally
0.5) that is the condition for merging of
clusters to take place.
Future considerationsImplementing semantic taxonomy in
Web search engines has been a first
step toward changing the way people
search. As for future developments,
such an approach for image searches
must be undertaken. Automatic seman-
tic classification of image databases
would be of great use to users’ search-
ing and browsing. Grouping images
into semantically meaningful categories
using basic low-level visual features is a
challenge and an important problem for
content-based image retrieval. The
enormity and diversity of the visual
content of Web images add another
dimension to this challenging task.
Read more about it • A. Krithara, C. Goutte, J.-M. Ren-
ders, and M.-R. Amini, “Semi-supervised
document classification with a mislabel-
ing error model,” in Proc. Annu. Euro-pean Conf. Information Retrieval (ECIR) 2008, pp. 370–381.
• C. Vogel. Tutorial on “Regular-
ization Methods an Applied Mathema-
ticians Perspective.” Dept. Math. Sci.,
Montana State Univ., Bozeman, MT [On-
line]. Available: www.samsi.info/talks/
inverse/Inverse-Vogel.pdf
• L. Kerschberg, A. Scime, and W.
Kim, “A personalizable agent for se-
mantic taxonomy-based web search,”
in Proc. 1st Int. Workshop Radical Agent Concepts (WRAC) 2002, pp. 3–34.
• O. M. O. Zamir, R. Karp, and O.
Etzioni, “Fast and intuitive clustering of
web documents,” in Proc. 3rd Int. Conf. Knowledge Discovery and Data Mining, 1997, pp. 287–290.
• O. Zamir and O. Etzioni, “Web
document clustering: A feasibility
demonstration,” in Proc. 21st Annu. Int. ACM SIGIR Conf. Research and
Development in Information Retrieval, SIGIR’98. New York, NY: ACM Press,
1998, pp. 46–54.
• Y. Lu, C. Hu, X. Zhu, H. Zhang,
and Q. Yang, “A unified framework for
semantics and feature based relevance
feedback in image retrieval systems,” in
Proc. Int. Multimedia Conf. ACM Multi-media, 2000, pp. 31–37.
• Y.-S. Chen and T.-H. Chu, “A
neural network classification tree;
Neural Networks,” in Proc. IEEE Int. Conf. Neural Networks, 1995, vol. 1, pp.
409–413.
• Z. Yang and C. C. J. Kuo, “A
semantic classification and composite
indexing approach to robust image re-
trieval,” in Proc. Int. Conf. Image Process-ing (ICIP), 1999, vol. 1, pp. 134–138.
• Resampling Stats [Online]. Avail-
able: http://www.resample.com/xlmin-
er/help/NNC/NNClass_intro.htm
• Information Technology and Sys-
tems Center Data Mining Solutions Cen-
ter. Overiew of Data Mining—Bayesian
Classification [Online]. Available: http://
datamining.itsc.uah.edu/adam/tutorials/
adam_tut_02_overview_05.html
• Wikipedia [Online]. Available:
ht tp://en.wikipedia.org/wiki/Naive_
Bayesian_classification
• H.O. Borch. (2006). On-Line Clus-
tering of Web Search Results [Online].
Available: http://daim.idi.ntnu.no/mas-
teroppgaver/IME/IDI/2006/1414/mas-
teroppgave.pdf
About the authorsRamesh Singh ([email protected]) is a
senior technical director at NIC, New
Delhi, India. He received his M.Tech.
in computer applications from the
Indian Institute of Technology in New
Delhi.
Dhruv Dhingra (dhruv.dhingra@ece.
dce.edu) studied at the Indian Institute
of Management, Indore, Ecole de Man-
agement de Lyon, and the Delhi College
of Engineering. He is currently working
as manager, mergers and acquisitions, in
the Delhi Region.
Aman Arora ([email protected].
edu) graduated from the Delhi College
of Engineering with a degree in com-
puter engineering. He served as a soft-
ware engineer at CISCO in Banglore. He
is currently studying at the Indian Insti-
tute of Management, Lucknow.
Fig. 4 Search cluster results.