+ All Categories
Home > Documents > 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

Date post: 28-Dec-2015
Category:
Upload: annabella-williams
View: 223 times
Download: 3 times
Share this document with a friend
27
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar
Transcript
Page 1: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

1

Wikification

CSE 6339 (Section 002)

Abhijit Tendulkar

Page 2: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

2

Wikify! Linking Documents to Encyclopedic Knowledge. R. Mihalcea and A. Csomai

Learning to Link with Wikipedia. D. Milne and I. H. Witten

Page 3: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

3

What is Wikification

• Automatic keyword extraction

• Word sense disambiguation

• Automatically cross-reference documents (unstructured text) with wikipedia.

Page 4: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

4

Wikify! - Introduction

• Introduces annotation of documents by linking them with Wikipedia

• Applications could be semantic web, educational applications, useful in no. of text processing problems.

• Previous similar works: Microsoft Smart Tags, Google AutoLink merely based on word or phrase lookup (no keyword extraction or disambiguation)

Page 5: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

5

Wikify! - Text Wikification

QuickTime™ and a decompressor

are needed to see this picture.

Page 6: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

6

Wikify! - Keyword Extraction

• Recommendations from Wikipedia style manual: link terms providing deeper understanding of topic, avoid linking unrelated terms, select proper amount of keywords.

• Unsupervised algorithms: Involve two steps– Candidate extraction: extract all possible n-grams.

– Keyword ranking: Assign numeric value to each candidate. Used three methods - tf-idf, 2, Keyphraseness.

Page 7: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

7

Wikify! - Evaluation of Keyword Extraction

QuickTime™ and a decompressor

are needed to see this picture.

Page 8: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

8

Wikify! - Word Sense Disambiguation

• Ambiguity is inherent to human language• Disambiguation algorithms:

– Knowledge-based: rely exclusively on knowledge derived from dictionaries.

– Data-driven: based on probabilities collected from sense-annotated data.

• Here voting scheme is used which seeks agreement between both.

• Wikify! provides highly precise annotation even if recall is lower.

Page 9: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

9

Wikify! - Disambiguation Evaluation

Word sense disambiguation results: total number of attempted (A) and correct (C) word senses, together with precision (P), recall (R) and F-measure (F)

evaluations.

QuickTime™ and a decompressor

are needed to see this picture.

Page 10: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

10

Wikify! - Overall Evaluation and Conclusion

• Wikify! allows user to upload a text file or accepts URL of webpage, processes the document provided by the user, and finally returns the wikified version of the document.

• The user also has option of providing density of keywords in the range 2%-10% default being 6%.

• When it was evaluated by human evaluators (20 users evaluating 10 documents each) only 57% of the cases were identified accurately (50% would be ideal case).

Page 11: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

11

Learning to Link with Wikipedia

• Machine learning approach to identify significant terms within unstructured text.

• It can provide structured knowledge about any unstructured text.

• Uses Wikipedia articles as training data, which improves recall and precision.

Page 12: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

12

Snapshot of Wikified document

QuickTime™ and a decompressor

are needed to see this picture.

Page 13: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

13

Learning to Disambiguate Links

• Uses disambiguation to inform detection.• Features such as Commonness and Relatedness of

the term are used as measures to resolve ambiguity.

• Commonness of a sense is defined by number of times it is used by wikipedia articles as destination.

• Commonness = (No. of times term is used as link) / (No. of times term appears in Wikipedia articles)

Page 14: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

14

Disambiguation (Continued)

• Relatedness is given by following formula:

QuickTime™ and a decompressor

are needed to see this picture.

Where a and b are two articles of interest A and B are sets of all articles that link to a and b respectively, and W is set of all articles in Wikipedia.

Page 15: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

15

Disambiguation (Continued)

Commonness and Relatedness

QuickTime™ and a decompressor

are needed to see this picture.

Page 16: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

16

Disambiguation (Continued)

• All context terms are not equally useful, so weight is assigned to each context term which is average of its link probability (i.e. commonness) and relatedness.

• All the above features are combined and the feature of context quality is defined as sum of the weights that are previously assigned to each context term.

• These features are used to train the classifier.• To configure the classifier, parameter specifying

minimum probability of sense is used.

Page 17: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

17

Disambiguation Evaluation

• Disambiguation classifier was trained over 500 articles (instead of entire Wikipedia) on a modest desktop with 3 GHz dual Core processor and 4GB of RAM.

• Classifier was configured using 100 wikipedia articles.• It was trained in 13 minutes, and tested in 4 minutes and

another 3 minutes were required to load required summaries of Wikipedia’s link structure and anchor statistics into memory.

• To evaluate classifier, 11000 anchors were gathered from 100 random articles.

Page 18: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

18

Disambiguation Evaluation

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Page 19: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

19

Learning to Detect Links

• Central difference between Wikify’s link detection approach and this new link detector: Wikify exclusively relies on link probability, whereas in this new approach, the context surrounding the terms is also taken into consideration.

• This link detector discards only terms having very low link probability so that nonsense phrases and stop words are removed.

Page 20: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

20

Features used for Link Detection

• Link probability: It considers average link probability.

• Relatedness: semantic relatedness, average relatedness between each topic and all other candidates.

• Disambiguation Confidence• Generality• Location and Spread

Page 21: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

21

Link Detector

QuickTime™ and a decompressor

are needed to see this picture.

Page 22: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

22

Link Detector Performance

• Same dataset as for disambiguation classifier was used for training, configuration as well as evaluation.

• 6.5% link probability was set as recall and precision balance at that point.

• Link detector was trained on unambiguous terms.

Page 23: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

23

Link Detector Performance (Continued)

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Page 24: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

24

Wikification in the Wild

• This system was tested using news articles instead of wikipedia and it gave 76.4% accuracy in link detection.

QuickTime™ and a decompressor

are needed to see this picture.

Page 25: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

25

Conclusions

• This system resolves ambiguity as well as polysemy.

• Common hurdle in all such applications: they must somehow move from unstructured text to collection of relevant wikipedia articles.

• This paper has contibuted proven method for extracting key concepts from plain text.

• Finally these are attempts to explain and organize sum total of human knowledge.

Page 26: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

26

Application on itself

QuickTime™ and a decompressor

are needed to see this picture.

Page 27: 1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

27

Questions

?


Recommended