SCHISM—A Web search engine using semantic taxonomy

36 IEEE POTENTIALS0278-6648/10/$26.00 © 2010 IEEE

T he majority of the current

search engines generate a

huge list in reply to a user

query. This result is normally

ranked by using ranking criteria such as

page rank or relevancy to the query.

However, this list is extremely inconve-

nient to users, since it expects them to

look into each page sequentially in an

exhaustive manner to find the relevant

information. As a result, most users only

search for an initial few Web pages on the

list. Thus many other relevant information

can be overlooked. The clustering method

is one such solution to overcome this

problem. Instead of a sequential list, it

groups the search results into clusters and

labels these with representative words for

each cluster. These labeled clusters of

search results are exposed to users. The

clustering method provides benefits in

terms of reduced size of information pro-

vided to the end users.

The clustering approach would be even

more important for the largest databases,

the gigantic data silos of those such as

Google, Yahoo, and Microsoft. The need

for better presentation of search results

retrieved from millions, and then billions,

of highly unstructured and untagged Web

pages was obvious. Clustering became a

popular software tool to enhance relevance

ranking by grouping items in the typically

very large result list. The clusters of items

with common semantic and/or other char-

acteristics can guide users in refining their

original queries, to zoom in on smaller

clusters, and drill down through subgroups

within the cluster.

Search result clustering has several

specific requirements that may not be

essential for other cluster algorithms. First,

search result clustering should allow fast

clustering and rapid generation of a label

on the fly, since it is an online process.

This requirement can be met by adopting

“snippets” (the snippets contain phrases

that help in the correct clustering of the

document and do not contain some of the

“noise” present in the original documents

that might cause misclassification of the

Digital Object Identifier 10.1109/MPOT.2010.937055© FOTOSEARCH

SCHISM—A Web search engine using semantic taxonomy

Ramesh Singh, Dhruv Dhingra, and Aman Arora

SEPTEMBER/OCTOBER 2010 37

documents) rather than entire documents

of a search result set. Second, labels

annotated for clusters should be mean-

ingful to users because they are pre-

sented to users as a general view of

results. For this reason, recent search

result clustering research focuses on

selecting meaningful labels. This differs

from general clustering which focuses on

the similarity of documents. In Zamir and

Etzioni, a few other key requirements of

search result clustering are presented.

The suffix tree clustering (STC) algo-

rithm (Fig. 1) was introduced by Zamir

and Etzioni and further developed in the

Overiew of Data Mining—Bayesian Clas-

sification. The algorithm focuses on clus-

tering snippets faster than standard data

mining approaches to clustering by using

a data structure called a suffix tree. Its

time complexity is linear to the number of

snippets, making it attractive when clus-

tering a large number of documents. Since

this is the core clustering approach used

by the software presented in this thesis,

the algorithm will be presented in detail.

Why cluster search resultsSearch results on the Web are tradi-

tionally presented as a flat ranked list of

documents, frequently millions of docu-

ments long. The main use for clustering

is not to improve the actual ranking, but

to give the user a quick overview of the

results. Having divided the result set

into clusters (Fig. 2), the user can quickly

narrow down his search further by se-

lecting a cluster. This resembles query

refinement but avoids the need to query

the search engine for each step. Evalu-

ations done using the grouper system

indicate that users tend to investigate

more documents per query than in

normal search engines. It is assumed

that this is because the user clicks on

the desired cluster rather than reformu-

lating his query. The evaluation also

indicates that once one interesting

document has been found, users often

find other interesting documents in the

same cluster.

Traditional clustering approachesThese techniques transform the docu-

ments to vectors and employ standard

means of calculating differences between

vectors to cluster the documents.

Hierarchical clusteringIn general, there are two types of

hierarchical clustering methods:

Agglomerative or bottom-up hi-•erarchical methods create a cluster for

each document and then merge the two

most similar clusters until just one cluster

remains or some termination condition

is satisfied. Most hierarchical cluster-

ing methods fall into this category, and

they only differ in their definition of

intercluster similarity and termination

conditions.

Divisive or top-down hierarchi-•cal methods do the opposite, by start-

ing with all documents in one cluster.

The initial cluster is divided until some

termination condition is satisfied. Hier-

archical methods are widely adopted,

but often struggle to meet the speed

requirements of the Web. Usually oper-

ating on document vectors with a time

complexity of O (n2) or more, clustering

more than a few hundred snippets is of-

ten unfeasible. Another problem is that

if two clusters are incorrectly merged in

an early state, there is no way of fixing

this later in the process. Finding the best

halting criterion that works well with all

queries can also be very difficult.

K-means clusteringThe k-means algorithm comes in many

flavors and produces a fixed number (k)

of flat clusters. The algorithms generally

follow the following process: Random

samples from the collection are drawn to

serve as centroids for initial clusters.

Based on document vector similarity, all

documents are assigned to the closest

centroid. New centroids are calculated

for each cluster, and the process is

repeated until nothing changes or some

termination condition is satisfied.

Car Review[1]

Car Review[1]

Review[1, 3]

Review[3]

Center[1]

Center[1]

JaguarBuy PanteraOnca Picture

[4]

PanteraOnca [2] Onca [2] Picture [4]

Picture [4]Picture [4]

Fig. 1 Suffix tree.

38 IEEE POTENTIALS

The process can be speeded up by

clustering a subset of the documents,

and later assigning all documents to the

precomputed clusters. Several problems

exist with this approach: It can only pro-

duce a fixed number of clusters (k). It

performs optimally when the clusters are

spherical, but we have no reason to

assume that document clusters are

spherical. Finally, a “bad choice” in the

random selection of initial clusters can

severely degrade performance.

Similar works and the differences with schism

“On-Line Clustering of Web Search

Results” has developed a prototype for

online clustering of Web search re-

sults. The project has also used STC.

Schism uses a different technique for

labeling clusters.

Technical componentsThe project consists of three main

components

1) Yahoo! application programming interface: The suffix tree clustering appli-

cation takes as input as the results pro-

duced by the Yahoo! search engine.

2) Suffix tree clustering: Since this is

the core clustering approach used by the

software presented in this thesis, the

algorithm will be presented in detail.

The algorithm consists of three steps:

Document cleaning, Identifying base

clusters, and combining base clusters

into clusters.

Document cleaning uses a light stem-

ming algorithm on the text. Sentence

boundaries are marked and nonword

tokens are stripped.

The process of identifying base clus-

ters resembles building an inverted index

of phrases for the document collection.

The data structure used is a suffix tree as

suggested by Anastasia Krithara, Cyril

Goutte et al., which can be built in time

linear to the collection size. The follow-

ing is a definition of a suffix tree: For-

mally, a suffix tree is a rooted, directed

tree. Each internal node has at least two

children. Each edge is labeled with a

non-empty sub-string of S (hence it is a

trie). A trie, or prefix tree, is an ordered

tree data structure that is used to store an

associative array where the keys are usu-

ally strings. The label of a node is defined

to be the concatenation of the edge-la-

bels on the path from the root to that

node. No two edges out of the same

node can have edge-labels that begin

with the same word (hence it is com-

pact). For each suffix s of S, there exists

a suffix-node whose label equals s.

The combining base clusters step

merges base clusters with highly over-

lapping document sets. The similarity of

base clusters Bn and Bm is a binary

function defined as

(BM, Bn) 5 (1 iff |Bm \ Bn| / |Bm|

. 0.5 and |Bm \ Bn| / |Bn|

. 0.5) 0 otherwise

where |Bm \ Bn| is the number of

documents shared by Bm and Bn. Calcu-

lating this similarity between all base

clusters, we can create a base cluster

graph, where nodes are a base cluster,

and two nodes are connected if the two

base clusters have a similarity of 1. Using

this graph, a cluster is defined as a con-

nected component in the graph.

3) Graphic User Interface (GUI): The

GUI takes input from the suffix tree clus-

terer and displays the result to the user

along with a separate thread to modify

the various parameters of the defined

algorithm so as to empower an advanced

user to more effectively and flexibly use

the application.

Implementation of the suffix tree clustering algorithm

The results from the search engine

are clustered using the STC algorithm.

The algorithm takes as input the gener-

ated snippets and returns a list of labeled

clusters of documents. The original

paper describing the algorithm lists sev-

eral key requirements for Web document

clustering, including:

Relevance: The clusters should be •relevant to the user query and their la-

bels easily understood by the user.

Overlap: Documents can have •several topics, so it is favorable to al-

low a document to appear in several

clusters.

Snippet-tolerance: In order to •be feasible in large-scale applications,

the algorithm should be able to pro-

duce high-quality clusters based only

on the snippets returned by the search

engine.

Speed: The clusters should be gen-•erated in a matter of milliseconds to avoid

user annoyance and scalability issues.

The STC algorithm has all these quali-

ties, but the main reason for choosing it

was its speed, simplicity, and ability to

produce good cluster labels. Several fac-

tors influence the performance of STC,

and the following sections describes the

choices made at each step.

Stemming snippetsThis choice suggests stemming

all words in the snippets with a

Fig. 2 Cluster selection.

SEPTEMBER/OCTOBER 2010 39

light stemming algorithm. To present

readable labels to the user, they save a

pointer for each phrase into the snip-

pet from which it originated. In this

way, one original form of the phrase

can be retrieved. Several original

phrases from different snippets may

belong to the same base cluster, but

which of these original forms is chosen

is not detailed in the article. The soft-

ware only stems plurals to singular,

using a very light stemming algorithm.

The effects of stemming the snippets

are further assessed in the evaluation.

Original snippets are stored to allow

original phrases to be reconstructed

after clustering.

Removing stop words in snippetsThe authors of STC suggest dealing

with stop words in phrases by allowing

them as long as they are not the first or

the last word in the phrase. Phrases con-

taining stop words are rarely selected as

labels. Therefore, the software simply

skips stop words and inserts phrase

boundaries instead. Testing indicates

that this has very little impact on the

resulting clusters and it’s therefore pre-

ferred for its simplicity.

Labeling clustersThe clustering algorithm outputs a

set of labeled base clusters for each

cluster, and the authors propose using

these base cluster labels as labels for the

final cluster (Fig. 2). In their grouper

system, clusters are labeled with all the

labels from the base clusters. The STC

algorithm assigns a score to each base

cluster but never utilizes it. The approach

taken in the software is to treat the base

cluster labels as candidate labels for the

final cluster, and use the scoring to select

the highest ranked candidate as the final

label. It is assumed that having one or

two phrases as a label instead of label-

ing each cluster with all candidates,

enhances usability.

The original paper describing STC

suggests scoring base clusters using the

formula

s(B) 5 |B| _ f(|P|)

where |B| is the number of documents in

base cluster B, and |P| is the number of

words in P that have a nonzero score.

Zero-scoring words are defined as

stop words, words that appear in less

than three documents or words that

appear in more than 40% of the docu-

ment collection. The function “f” is a func-

tion that penalizes single-word phrases, is

linear for two to six words long phrases

and constant for longer phrases.

The scoring scheme used by the soft-

ware closely resembles the suggested for-

mula because initial testing indicated that

it worked very well. Whether the algo-

rithm succeeds in selecting the “correct”

candidate is addressed in the evaluation

section. The software doesn’t include

stop words in the zero-scoring words

since all stop words are removed during

preprocessing. The function f gives a

score of 0.5 for single-word phrases, 2 to

6 for two- to six-word phrases respec-

tively, and 7 for longer phrases. If the two

top-ranked base clusters have the same

score, the one with the largest document

set is used as the label. This is because

phrases occurring in many documents

are assumed to represent more important

categories.

Ranking clusters As previously mentioned, the original

STC paper does not detail how the final

clusters are ranked. One of the best per-

forming systems available, Clusty.com,

seems to almost always present cluster

lists sorted by the number of documents

they contain. Minor deviations from this

approach can be observed, indicating

that they use some additional heuristic.

This suggests that it might be a small

trade-off between quality and perfor-

mance, given that additional processing

of the clusters can produce a slightly

altered (and hopefully improved) order-

ing of clusters. In the software, this extra

processing is assumed to cost more time

than it gains in quality. It is assumed that

large clusters are more interesting, and

clusters are thus simply ranked by the

number of documents they contain.

PresentationThe software includes a Web front-end

to the search engine merely for testing

purposes. It allows the user to submit a

query and returns a list of matching docu-

ments (Fig. 3) and browsable clusters. The

documents are presented with the docu-

ment title, the generated snippet, and a

link to the corresponding Web page.

Added functionalitiesIgnore words if in fewer •

documents. This functionality controls

the minimum number of times a word

Fig. 3 Search query.

40 IEEE POTENTIALS

must occur in the documents to form a

cluster. The use of this functionality is to

limit the view (Fig. 4) to search cluster

results having higher frequencies.

Ignore words in more documents • . This functionality is used to have only rel-

evant words forming the clusters. It tries

to curb the very common words to from

clusters thereby increasing relevance.

Minimum base cluster size. This •functionality is used to provide the

minimum number of base clusters that

should be present.

Maximum base cluster size • . This is

used to limit the number of base clusters

that would be present

Merge threshold. • This provides

the probability threshold value (ideally

0.5) that is the condition for merging of

clusters to take place.

Future considerationsImplementing semantic taxonomy in

Web search engines has been a first

step toward changing the way people

search. As for future developments,

such an approach for image searches

must be undertaken. Automatic seman-

tic classification of image databases

would be of great use to users’ search-

ing and browsing. Grouping images

into semantically meaningful categories

using basic low-level visual features is a

challenge and an important problem for

content-based image retrieval. The

enormity and diversity of the visual

content of Web images add another

dimension to this challenging task.

Read more about it • A. Krithara, C. Goutte, J.-M. Ren-

ders, and M.-R. Amini, “Semi-supervised

document classification with a mislabel-

ing error model,” in Proc. Annu. Euro-pean Conf. Information Retrieval (ECIR) 2008, pp. 370–381.

• C. Vogel. Tutorial on “Regular-

ization Methods an Applied Mathema-

ticians Perspective.” Dept. Math. Sci.,

Montana State Univ., Bozeman, MT [On-

line]. Available: www.samsi.info/talks/

inverse/Inverse-Vogel.pdf

• L. Kerschberg, A. Scime, and W.

Kim, “A personalizable agent for se-

mantic taxonomy-based web search,”

in Proc. 1st Int. Workshop Radical Agent Concepts (WRAC) 2002, pp. 3–34.

• O. M. O. Zamir, R. Karp, and O.

Etzioni, “Fast and intuitive clustering of

web documents,” in Proc. 3rd Int. Conf. Knowledge Discovery and Data Mining, 1997, pp. 287–290.

• O. Zamir and O. Etzioni, “Web

document clustering: A feasibility

demonstration,” in Proc. 21st Annu. Int. ACM SIGIR Conf. Research and

Development in Information Retrieval, SIGIR’98. New York, NY: ACM Press,

1998, pp. 46–54.

• Y. Lu, C. Hu, X. Zhu, H. Zhang,

and Q. Yang, “A unified framework for

semantics and feature based relevance

feedback in image retrieval systems,” in

Proc. Int. Multimedia Conf. ACM Multi-media, 2000, pp. 31–37.

• Y.-S. Chen and T.-H. Chu, “A

neural network classification tree;

Neural Networks,” in Proc. IEEE Int. Conf. Neural Networks, 1995, vol. 1, pp.

409–413.

• Z. Yang and C. C. J. Kuo, “A

semantic classification and composite

indexing approach to robust image re-

trieval,” in Proc. Int. Conf. Image Process-ing (ICIP), 1999, vol. 1, pp. 134–138.

• Resampling Stats [Online]. Avail-

able: http://www.resample.com/xlmin-

er/help/NNC/NNClass_intro.htm

• Information Technology and Sys-

tems Center Data Mining Solutions Cen-

ter. Overiew of Data Mining—Bayesian

Classification [Online]. Available: http://

datamining.itsc.uah.edu/adam/tutorials/

adam_tut_02_overview_05.html

• Wikipedia [Online]. Available:

ht tp://en.wikipedia.org/wiki/Naive_

Bayesian_classification

• H.O. Borch. (2006). On-Line Clus-

tering of Web Search Results [Online].

Available: http://daim.idi.ntnu.no/mas-

teroppgaver/IME/IDI/2006/1414/mas-

teroppgave.pdf

About the authorsRamesh Singh ([email protected]) is a

senior technical director at NIC, New

Delhi, India. He received his M.Tech.

in computer applications from the

Indian Institute of Technology in New

Delhi.

Dhruv Dhingra (dhruv.dhingra@ece.

dce.edu) studied at the Indian Institute

of Management, Indore, Ecole de Man-

agement de Lyon, and the Delhi College

of Engineering. He is currently working

as manager, mergers and acquisitions, in

the Delhi Region.

Aman Arora ([email protected].

edu) graduated from the Delhi College

of Engineering with a degree in com-

puter engineering. He served as a soft-

ware engineer at CISCO in Banglore. He

is currently studying at the Indian Insti-

tute of Management, Lucknow.

Fig. 4 Search cluster results.

Date post:	12-Dec-2016
Category:	Documents
Upload:	aman
View:	215 times
Download:	1 times