WIKIPEDIA AS SENCE INVENTORY TO IMPROVE DIVERSITY IN WEB SEARCH RESULTSCelina Santamar´ıa, Julio Gonzalo and Javier ArtilesUNED, c/Juan del Rosal, 16, 28040 Madrid, Spain(National University of Distance Education)ACL 2010
MOTIVATION
Motivation Test Set
Set of Words Set of Documents Manual Annotation
Coverage of Web Search Results: Wikipedia vs Wordnet Diversity in Google Search Results Sense Frequency Estimators for Wikipedia Association of Wikipedia Senses to Web Pages
VSM Approach WSD Approach Classification Results Precision/Coverage Trade-off Using Classification to Promote Diversity
Related Work Conclusions
FOR VERY SHORT QUERIES
for very short queries one word disambiguation may not be possible
focus on two broad-coverage lexical resources WordNet Wikipedia
TEST SET
Motivation Test Set
Set of Words Set of Documents Manual Annotation
Coverage of Web Search Results: Wikipedia vs Wordnet Diversity in Google Search Results Sense Frequency Estimators for Wikipedia Association of Wikipedia Senses to Web Pages
VSM Approach WSD Approach Classification Results Precision/Coverage Trade-off Using Classification to Promote Diversity
Related Work Conclusions
TEST SET
Set of Words Set of Documents Manual Annotation
SET OF WORDS
Corpus annotation two annotator handle 40 nouns
15 nouns from the Senseval-3 lexical sample dataset 25 additional words which satisfy two conditions:
they are all ambiguous, and they are all names for music bands in one of their
senses
CORPUS
The Senseval set is : {argument, arm, atmosphere, bank, degree,
difference, disc, image, paper, party, performance, plan, shelter, sort, source}.
The bands set is : {amazon, apple, camel, cell, columbia, cream,
foreigner, fox, genesis, jaguar, oasis, pioneer, police, puma, rainbow, shell, skin, sun, tesla, thunder, total, traffic, trapeze, triumph, yes}
TABLE 1: COVERAGE OF SEARCH RESULTS: WIKIPEDIA VS. WORDNET
For each noun in set, we looked up all its possible senses in WordNet 3.0 and in Wikipedia disambiguation pages
Wikipedia has an average of 22 senses (per noun) 25.2 in the Bands set 16.1 in the Senseval set
Wordnet a much smaller figure, 4.5 senses (per noun) 3.12 for the Bands set 6.13 for the Senseval set
SET OF DOCUMENTS
Step 1: retrieved top 150 (per noun) in google
Step 2: for each document, we stored both the snippet
and whole HTML document assume a ”one sense per document”
MANUAL ANNOTATION
Annotation Two annotators for every document, whether there was
appropriate senses in each of the dictionaries. They provide annotations for 100 documents
per noun If an URL in the list was corrupt or not available,
it had to be discarded 150 -> 100 documents per noun
COVERAGE OF WEB SEARCH RESULTS: WIKIPEDIA VS WORDNET Motivation Test Set
Set of Words Set of Documents Manual Annotation
Coverage of Web Search Results: Wikipedia vs Wordnet Diversity in Google Search Results Sense Frequency Estimators for Wikipedia Association of Wikipedia Senses to Web Pages
VSM Approach WSD Approach Classification Results Precision/Coverage Trade-off Using Classification to Promote Diversity
Related Work Conclusions
THE TOP TEN RESULT ARE NOT COVER BY WIKIPEDIA 32% of top ten document are not cover by
wikipedia manually examined
a majority of the missing senses consists of names of (generally not well-known) companies (45%) products or services (26%);
The other frequent type (12%) of non annotated document is disambiguation pages
DEGREE OF OVERLAP BETWEEN WIKIPEDIA AND WORNNET SENSES
just 3% fit wordnet only. Wikipedia seems to extend the coverage of
Wordnet
COVERAGE OF WEB SEARCH RESULTS: WIKIPEDIA VS WORDNET Abstract Motivation Test Set
Set of Words Set of Documents Manual Annotation
Coverage of Web Search Results: Wikipedia vs Wordnet Diversity in Google Search Results Sense Frequency Estimators for Wikipedia Association of Wikipedia Senses to Web Pages
VSM Approach WSD Approach Classification Results Precision/Coverage Trade-off Using Classification to Promote Diversity
Related Work Conclusions
diversity is not a major priority for ranking results the top ten results only cover, in average, 3
Wikipedia senses average number of senses listed in Wikipedia is 22
First 100 documents, this number grows up to 6.85 senses per noun.
Average 63% of the pages in search results belong to the most frequent sense of the query word
COVERAGE OF WEB SEARCH RESULTS: WIKIPEDIA VS WORDNET Motivation Test Set
Set of Words Set of Documents Manual Annotation
Coverage of Web Search Results: Wikipedia vs Wordnet Diversity in Google Search Results Sense Frequency Estimators for Wikipedia Association of Wikipedia Senses to Web Pages
VSM Approach WSD Approach Classification Results Precision/Coverage Trade-off Using Classification to Promote Diversity
Related Work Conclusions
SENSE FREQUENCY ESTIMATORS FOR WIKIPEDIA Wikipedia disambiguation don’t contain the
relative importance of senses for a given word. Internal relevance
incoming links for the URL of a given sense in Wikipedia.
stable External relevance
number of visits for the URL of a given sense (as reported in http://stats.grok.se).
Not stable
MEASURED CORRELATION
for each noun w and for each sense wi, we consider three values: proportion of documents retrieved for w which
are manually assigned to each sense inlinks(wi):
Relative amount of incoming links to each sense wi
visits(wi): relative number of visits to the URL for each sense wi.
MEASURED CORRELATION
We have measured the correlation between these three values using a linear regression correlation coefficient, correlation value of .54 for the number of visits correlation value of .71 for the number of
incoming links. Both estimators seem to be positively correlated
MEASURED CORRELATION
freq(wi) = k * inlinks(wi) + (1 – k) * visits(wi), k = 0, 0.1, 0.2, …, 1
When k is 0.9 , the function have maximal correlation value of .73 freq(wi) = 0.9 * inlinks(wi) + 0.1 * visits(wi)
This weighted estimator provides a slight advantage over the use of incoming links only (0.73 vs 0.71)
COVERAGE OF WEB SEARCH RESULTS: WIKIPEDIA VS WORDNET Motivation Test Set
Set of Words Set of Documents Manual Annotation
Coverage of Web Search Results: Wikipedia vs Wordnet Diversity in Google Search Results Sense Frequency Estimators for Wikipedia Association of Wikipedia Senses to Web Pages
VSM Approach WSD Approach Classification Results Precision/Coverage Trade-off Using Classification to Promote Diversity
Related Work Conclusions
TWO DIFFERENT TECHNIQUES
Two different techniques Vector Space Model (VSM) WSD system
Two baselines random assignment of senses most frequent sense
COVERAGE OF WEB SEARCH RESULTS: WIKIPEDIA VS WORDNET Motivation Test Set
Set of Words Set of Documents Manual Annotation
Coverage of Web Search Results: Wikipedia vs Wordnet Diversity in Google Search Results Sense Frequency Estimators for Wikipedia Association of Wikipedia Senses to Web Pages
VSM Approach WSD Approach Classification Results Precision/Coverage Trade-off Using Classification to Promote Diversity
Related Work Conclusions
VSM APPROACH
For each word sense, its Wikipedia page in a (unigram) vector space model
idf weights are computed in two different ways VSM :
IDF in the collection of retrieved documents VSM-GT:
uses the statistics provided by the Google Terabyte collection
VSM-mixed: VSM + VSM-GT
VSM APPROACH
cosine similarity Assign the sense with the highest similarity
to the document In case of ties, pick the first sense in the
Wikipedia disambiguation page VSM-GT+freq
Consider the case of ties we pick up the one which has the largest frequency
according to our estimator
WSD APPROACH
TiMBL a state-of-the-art supervised WSD system uses Memory-Based Learning. TiMBL-core
Occurrences of the word in the Wikipedia page for the word sense.
TiMBL-inlinks occurrences of the word in Wikipedia pages pointing to
the page for the word sense. TiMBL-all
Core + inlinks
TIMBL
first : disambiguate all occurrences of word w in the page p.
Then : we choose the sense which appears most frequently in
the page according to TiMBL results. In case of ties :
pick up the first sense listed in the Wikipedia disambiguation page.
TiMBL-core+freq Consider the case of ties
we pick up the sense with the highest frequency according to our estimator
when no sense reaches 30% of the cases in the page to be disambiguated we also resort to the most frequent sense heuristic
TABLE 4:
Precision: the number of
pages correctly classified divided by the total number of predictions.
COVERAGE OF WEB SEARCH RESULTS: WIKIPEDIA VS WORDNET Motivation Test Set
Set of Words Set of Documents Manual Annotation
Coverage of Web Search Results: Wikipedia vs Wordnet Diversity in Google Search Results Sense Frequency Estimators for Wikipedia Association of Wikipedia Senses to Web Pages
VSM Approach WSD Approach Classification Results Precision/Coverage Trade-off Using Classification to Promote Diversity
Related Work Conclusions
USING CLASSIFICATION TO PROMOTE DIVERSITY we fill each position in the rank (starting at
rank 1), with the document which has the highest similarity to some of the senses which are not yet represented in the rank;
ALTERNATIVE RANKING FOR COMPARISON
clustering (centroids): this method applies Hierarchical Agglomerative
Clustering clustering (top ranked):
this time the top ranked document (in the original Google rank) of each cluster is selected.
random: Randomly selects ten documents from the set of
retrieved results. upper bound:
coverage is not 100% because some words have more than ten meanings in
Wikipedia and we are only considering the top ten documents.
Coverage : number of senses in top 10 / number of senses in all
result Coverage of senses going from 49% to 77% the coverage of Wikipedia senses in the top ten
results is 70% larger than in the original ranking
Using Wikipedia to enhance diversity seems to work much better than clustering
bias only Wikipedia senses are considered to estimate
diversity. our results do not imply that the Wikipedia
modified rank is better than the original Google rank.
Wikipedia can be used as a reference to improve search results diversity for one-word queries.
CONCLUSIONS
We have investigated whether generic lexical resources can be used to promote diversity in Web search results for one-word, ambiguous queries. We have compared WordNet and Wikipedia (i) unsurprisingly, Wikipedia has a much better
coverage of senses in search results, and is therefore more appropriate for the task;
(ii) the distribution of senses in search results can be estimated using the internal graph structure of the Wikipedia and the relative number of visits received by each sense in Wikipedia
(iii) associating Web pages to Wikipedia senses with simple and efficient algorithms, we can produce modified rankings that cover 70% more Wikipedia senses than the original search engine rankings.