Date post: | 28-Dec-2015 |
Category: |
Documents |
Upload: | annabella-williams |
View: | 223 times |
Download: | 3 times |
1
Wikification
CSE 6339 (Section 002)
Abhijit Tendulkar
2
Wikify! Linking Documents to Encyclopedic Knowledge. R. Mihalcea and A. Csomai
Learning to Link with Wikipedia. D. Milne and I. H. Witten
3
What is Wikification
• Automatic keyword extraction
• Word sense disambiguation
• Automatically cross-reference documents (unstructured text) with wikipedia.
4
Wikify! - Introduction
• Introduces annotation of documents by linking them with Wikipedia
• Applications could be semantic web, educational applications, useful in no. of text processing problems.
• Previous similar works: Microsoft Smart Tags, Google AutoLink merely based on word or phrase lookup (no keyword extraction or disambiguation)
5
Wikify! - Text Wikification
QuickTime™ and a decompressor
are needed to see this picture.
6
Wikify! - Keyword Extraction
• Recommendations from Wikipedia style manual: link terms providing deeper understanding of topic, avoid linking unrelated terms, select proper amount of keywords.
• Unsupervised algorithms: Involve two steps– Candidate extraction: extract all possible n-grams.
– Keyword ranking: Assign numeric value to each candidate. Used three methods - tf-idf, 2, Keyphraseness.
7
Wikify! - Evaluation of Keyword Extraction
QuickTime™ and a decompressor
are needed to see this picture.
8
Wikify! - Word Sense Disambiguation
• Ambiguity is inherent to human language• Disambiguation algorithms:
– Knowledge-based: rely exclusively on knowledge derived from dictionaries.
– Data-driven: based on probabilities collected from sense-annotated data.
• Here voting scheme is used which seeks agreement between both.
• Wikify! provides highly precise annotation even if recall is lower.
9
Wikify! - Disambiguation Evaluation
Word sense disambiguation results: total number of attempted (A) and correct (C) word senses, together with precision (P), recall (R) and F-measure (F)
evaluations.
QuickTime™ and a decompressor
are needed to see this picture.
10
Wikify! - Overall Evaluation and Conclusion
• Wikify! allows user to upload a text file or accepts URL of webpage, processes the document provided by the user, and finally returns the wikified version of the document.
• The user also has option of providing density of keywords in the range 2%-10% default being 6%.
• When it was evaluated by human evaluators (20 users evaluating 10 documents each) only 57% of the cases were identified accurately (50% would be ideal case).
11
Learning to Link with Wikipedia
• Machine learning approach to identify significant terms within unstructured text.
• It can provide structured knowledge about any unstructured text.
• Uses Wikipedia articles as training data, which improves recall and precision.
12
Snapshot of Wikified document
QuickTime™ and a decompressor
are needed to see this picture.
13
Learning to Disambiguate Links
• Uses disambiguation to inform detection.• Features such as Commonness and Relatedness of
the term are used as measures to resolve ambiguity.
• Commonness of a sense is defined by number of times it is used by wikipedia articles as destination.
• Commonness = (No. of times term is used as link) / (No. of times term appears in Wikipedia articles)
14
Disambiguation (Continued)
• Relatedness is given by following formula:
QuickTime™ and a decompressor
are needed to see this picture.
Where a and b are two articles of interest A and B are sets of all articles that link to a and b respectively, and W is set of all articles in Wikipedia.
15
Disambiguation (Continued)
Commonness and Relatedness
QuickTime™ and a decompressor
are needed to see this picture.
16
Disambiguation (Continued)
• All context terms are not equally useful, so weight is assigned to each context term which is average of its link probability (i.e. commonness) and relatedness.
• All the above features are combined and the feature of context quality is defined as sum of the weights that are previously assigned to each context term.
• These features are used to train the classifier.• To configure the classifier, parameter specifying
minimum probability of sense is used.
17
Disambiguation Evaluation
• Disambiguation classifier was trained over 500 articles (instead of entire Wikipedia) on a modest desktop with 3 GHz dual Core processor and 4GB of RAM.
• Classifier was configured using 100 wikipedia articles.• It was trained in 13 minutes, and tested in 4 minutes and
another 3 minutes were required to load required summaries of Wikipedia’s link structure and anchor statistics into memory.
• To evaluate classifier, 11000 anchors were gathered from 100 random articles.
18
Disambiguation Evaluation
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
19
Learning to Detect Links
• Central difference between Wikify’s link detection approach and this new link detector: Wikify exclusively relies on link probability, whereas in this new approach, the context surrounding the terms is also taken into consideration.
• This link detector discards only terms having very low link probability so that nonsense phrases and stop words are removed.
20
Features used for Link Detection
• Link probability: It considers average link probability.
• Relatedness: semantic relatedness, average relatedness between each topic and all other candidates.
• Disambiguation Confidence• Generality• Location and Spread
21
Link Detector
QuickTime™ and a decompressor
are needed to see this picture.
22
Link Detector Performance
• Same dataset as for disambiguation classifier was used for training, configuration as well as evaluation.
• 6.5% link probability was set as recall and precision balance at that point.
• Link detector was trained on unambiguous terms.
23
Link Detector Performance (Continued)
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
24
Wikification in the Wild
• This system was tested using news articles instead of wikipedia and it gave 76.4% accuracy in link detection.
QuickTime™ and a decompressor
are needed to see this picture.
25
Conclusions
• This system resolves ambiguity as well as polysemy.
• Common hurdle in all such applications: they must somehow move from unstructured text to collection of relevant wikipedia articles.
• This paper has contibuted proven method for extracting key concepts from plain text.
• Finally these are attempts to explain and organize sum total of human knowledge.
26
Application on itself
QuickTime™ and a decompressor
are needed to see this picture.
27
Questions
?