Wikipedia as Sence Inventory to Improve Diversity in Web Search Results

WIKIPEDIA AS SENCE INVENTORY TO IMPROVE DIVERSITY IN WEB SEARCH RESULTSCelina Santamar´ıa, Julio Gonzalo and Javier ArtilesUNED, c/Juan del Rosal, 16, 28040 Madrid, Spain(National University of Distance Education)ACL 2010

MOTIVATION

Motivation Test Set

Set of Words Set of Documents Manual Annotation

Coverage of Web Search Results: Wikipedia vs Wordnet Diversity in Google Search Results Sense Frequency Estimators for Wikipedia Association of Wikipedia Senses to Web Pages

VSM Approach WSD Approach Classification Results Precision/Coverage Trade-off Using Classification to Promote Diversity

Related Work Conclusions

FOR VERY SHORT QUERIES

for very short queries one word disambiguation may not be possible

focus on two broad-coverage lexical resources WordNet Wikipedia

TEST SET

Motivation Test Set





TEST SET


SET OF WORDS

Corpus annotation two annotator handle 40 nouns

15 nouns from the Senseval-3 lexical sample dataset 25 additional words which satisfy two conditions:

they are all ambiguous, and they are all names for music bands in one of their

senses

CORPUS

The Senseval set is : {argument, arm, atmosphere, bank, degree,

difference, disc, image, paper, party, performance, plan, shelter, sort, source}.

The bands set is : {amazon, apple, camel, cell, columbia, cream,

foreigner, fox, genesis, jaguar, oasis, pioneer, police, puma, rainbow, shell, skin, sun, tesla, thunder, total, traffic, trapeze, triumph, yes}

TABLE 1: COVERAGE OF SEARCH RESULTS: WIKIPEDIA VS. WORDNET

For each noun in set, we looked up all its possible senses in WordNet 3.0 and in Wikipedia disambiguation pages

Wikipedia has an average of 22 senses (per noun) 25.2 in the Bands set 16.1 in the Senseval set

Wordnet a much smaller figure, 4.5 senses (per noun) 3.12 for the Bands set 6.13 for the Senseval set

SET OF DOCUMENTS

Step 1: retrieved top 150 (per noun) in google

Step 2: for each document, we stored both the snippet

and whole HTML document assume a ”one sense per document”

MANUAL ANNOTATION

Annotation Two annotators for every document, whether there was

appropriate senses in each of the dictionaries. They provide annotations for 100 documents

per noun If an URL in the list was corrupt or not available,

it had to be discarded 150 -> 100 documents per noun

COVERAGE OF WEB SEARCH RESULTS: WIKIPEDIA VS WORDNET Motivation Test Set





THE TOP TEN RESULT ARE NOT COVER BY WIKIPEDIA 32% of top ten document are not cover by

wikipedia manually examined

a majority of the missing senses consists of names of (generally not well-known) companies (45%) products or services (26%);

The other frequent type (12%) of non annotated document is disambiguation pages

DEGREE OF OVERLAP BETWEEN WIKIPEDIA AND WORNNET SENSES

just 3% fit wordnet only. Wikipedia seems to extend the coverage of

Wordnet

COVERAGE OF WEB SEARCH RESULTS: WIKIPEDIA VS WORDNET Abstract Motivation Test Set





diversity is not a major priority for ranking results the top ten results only cover, in average, 3

Wikipedia senses average number of senses listed in Wikipedia is 22

First 100 documents, this number grows up to 6.85 senses per noun.

Average 63% of the pages in search results belong to the most frequent sense of the query word






SENSE FREQUENCY ESTIMATORS FOR WIKIPEDIA Wikipedia disambiguation don’t contain the

relative importance of senses for a given word. Internal relevance

incoming links for the URL of a given sense in Wikipedia.

stable External relevance

number of visits for the URL of a given sense (as reported in http://stats.grok.se).

Not stable

http://stats.grok.se/

MEASURED CORRELATION

for each noun w and for each sense wi, we consider three values: proportion of documents retrieved for w which

are manually assigned to each sense inlinks(wi):

Relative amount of incoming links to each sense wi

visits(wi): relative number of visits to the URL for each sense wi.


We have measured the correlation between these three values using a linear regression correlation coefficient, correlation value of .54 for the number of visits correlation value of .71 for the number of

incoming links. Both estimators seem to be positively correlated


freq(wi) = k * inlinks(wi) + (1 – k) * visits(wi), k = 0, 0.1, 0.2, …, 1

When k is 0.9 , the function have maximal correlation value of .73 freq(wi) = 0.9 * inlinks(wi) + 0.1 * visits(wi)

This weighted estimator provides a slight advantage over the use of incoming links only (0.73 vs 0.71)






TWO DIFFERENT TECHNIQUES

Two different techniques Vector Space Model (VSM) WSD system

Two baselines random assignment of senses most frequent sense






VSM APPROACH

For each word sense, its Wikipedia page in a (unigram) vector space model

idf weights are computed in two different ways VSM :

IDF in the collection of retrieved documents VSM-GT:

uses the statistics provided by the Google Terabyte collection

VSM-mixed: VSM + VSM-GT

VSM APPROACH

cosine similarity Assign the sense with the highest similarity

to the document In case of ties, pick the first sense in the

Wikipedia disambiguation page VSM-GT+freq

Consider the case of ties we pick up the one which has the largest frequency

according to our estimator

WSD APPROACH

TiMBL a state-of-the-art supervised WSD system uses Memory-Based Learning. TiMBL-core

Occurrences of the word in the Wikipedia page for the word sense.

TiMBL-inlinks occurrences of the word in Wikipedia pages pointing to

the page for the word sense. TiMBL-all

Core + inlinks

TIMBL

first : disambiguate all occurrences of word w in the page p.

Then : we choose the sense which appears most frequently in

the page according to TiMBL results. In case of ties :

pick up the first sense listed in the Wikipedia disambiguation page.

TiMBL-core+freq Consider the case of ties

we pick up the sense with the highest frequency according to our estimator

when no sense reaches 30% of the cases in the page to be disambiguated we also resort to the most frequent sense heuristic

TABLE 4:

Precision: the number of

pages correctly classified divided by the total number of predictions.






USING CLASSIFICATION TO PROMOTE DIVERSITY we fill each position in the rank (starting at

rank 1), with the document which has the highest similarity to some of the senses which are not yet represented in the rank;

ALTERNATIVE RANKING FOR COMPARISON

clustering (centroids): this method applies Hierarchical Agglomerative

Clustering clustering (top ranked):

this time the top ranked document (in the original Google rank) of each cluster is selected.

random: Randomly selects ten documents from the set of

retrieved results. upper bound:

coverage is not 100% because some words have more than ten meanings in

Wikipedia and we are only considering the top ten documents.

Coverage : number of senses in top 10 / number of senses in all

result Coverage of senses going from 49% to 77% the coverage of Wikipedia senses in the top ten

results is 70% larger than in the original ranking

Using Wikipedia to enhance diversity seems to work much better than clustering

bias only Wikipedia senses are considered to estimate

diversity. our results do not imply that the Wikipedia

modified rank is better than the original Google rank.

Wikipedia can be used as a reference to improve search results diversity for one-word queries.

CONCLUSIONS

We have investigated whether generic lexical resources can be used to promote diversity in Web search results for one-word, ambiguous queries. We have compared WordNet and Wikipedia (i) unsurprisingly, Wikipedia has a much better

coverage of senses in search results, and is therefore more appropriate for the task;

(ii) the distribution of senses in search results can be estimated using the internal graph structure of the Wikipedia and the relative number of visits received by each sense in Wikipedia

(iii) associating Web pages to Wikipedia senses with simple and efficient algorithms, we can produce modified rankings that cover 70% more Wikipedia senses than the original search engine rankings.

Date post:	23-Feb-2016
Category:	Documents
Upload:	caroun
View:	17 times
Download:	0 times