+ All Categories
Home > Documents > Natural Language Processing for Information Retrieval

Natural Language Processing for Information Retrieval

Date post: 14-Jan-2016
Category:
Upload: misae
View: 60 times
Download: 3 times
Share this document with a friend
Description:
-KVMV Kiran (04005031) ‏ -Neeraj Bisht (04005035) ‏ -L.Srikanth (04005029) ‏. Natural Language Processing for Information Retrieval. OUTLINE. What is Information Retrieval(IR)? Approaches to IR Evaluation of IR methods Statistical IR methods Linguistic IR methods Conclusion - PowerPoint PPT Presentation
Popular Tags:
34
Natural Language Processing for Information Retrieval -KVMV Kiran (04005031) -Neeraj Bisht (04005035) -L.Srikanth (04005029)
Transcript
Page 1: Natural Language Processing  for Information Retrieval

Natural Language Processing

for Information Retrieval

-KVMV Kiran (04005031)

-Neeraj Bisht (04005035)

-L.Srikanth (04005029)

Page 2: Natural Language Processing  for Information Retrieval

OUTLINE

What is Information Retrieval(IR)? Approaches to IR Evaluation of IR methods Statistical IR methods Linguistic IR methods Conclusion Q&A

Page 3: Natural Language Processing  for Information Retrieval

What is Information Retrieval? Retrieving information media with information

content that is relevant to a user's information

need.

Information media can be Text, documents, images, videos

Used for Searching Organization

Page 4: Natural Language Processing  for Information Retrieval

OUTLINE

What is Information Retrieval(IR)? Approaches to IR Evaluation of IR methods Statistical IR methods Linguistic IR methods Conclusion Q&A

Page 5: Natural Language Processing  for Information Retrieval

Approaches to IR

Two types of retrieval By metadata (subject, heading, keywords etc) By content

Metadata Manually assigned Automatically assigned

Content based IR is more successful of the two.

Page 6: Natural Language Processing  for Information Retrieval

OUTLINE

What is Information Retrieval(IR)? Approaches to IR Evaluation of IR methods Statistical IR methods Linguistic IR methods Conclusion Q&A

Page 7: Natural Language Processing  for Information Retrieval

Evaluation of IR methods

Precision: Proportion of retrieved set that is

relevant Precision = |relevant & retrieved|/|retrieved|

= P(relevant|retrieved)

Recall : Probability that a relevant document is

retrieved by the query Recall = |relevant & retrieved|/|relevant|

= P(retrieved|relevant|

Page 8: Natural Language Processing  for Information Retrieval

Example

1000 documents, 400 relevant and 600 non-

relevant to a query. An IR procedure retrieves 75 relevant and 25 non-

relevant documents. Precision – 0.75 Recall - 75/400

Page 9: Natural Language Processing  for Information Retrieval

Evaluating IR methods

Trivial to have recall of one Precision tends to decrease as recall increases A good IR procedure should have both of them

high.

Page 10: Natural Language Processing  for Information Retrieval

Content based IR

Two approaches Statistical Linguistic

Page 11: Natural Language Processing  for Information Retrieval

OUTLINE

What is Information Retrieval(IR)? Approaches to IR Evaluation of IR methods Statistical IR methods Linguistic IR methods Conclusion Q&A

Page 12: Natural Language Processing  for Information Retrieval

Statistical IR

simple focus based on the "bag of words." all words in a document are treated as its index

terms each term assigned a weight in function of its

importance, usually determined by its appearance

frequency pairing the documents' words with that of the

query's

Page 13: Natural Language Processing  for Information Retrieval

Statistical IR(cont..)

Stages in Statistical IR: Document Preprocessing

consisting in preparing the documents for its

parameterisation, eliminating any elements considered as

superfluous. Parametrisation

once the relevant terms have been identified. This consists in

quantifying the document's characteristics (that is, the

terms).

Page 14: Natural Language Processing  for Information Retrieval

Statistical IR(cont..)

An Example- an xml document.

Page 15: Natural Language Processing  for Information Retrieval

Statistical IR(cont..)

Preprocessing phases remove elements that are not meant for indexing,such

as tags and headers

Page 16: Natural Language Processing  for Information Retrieval

Statistical IR(cont..)

Text standardising Uncapitalize Remove numerals and dates Remove words in Stopword lists

a list of empty words in a terms list (prepositions, determiners,

pronouns, etc.) considered to have little semantic value Identify n-grams

identify words that are usually together (compound words, proper

nouns, etc.) to be able to process them as a single conceptual unit done by estimating the probability of two words that are often

together make up a single term (compound) .e,g, Artificial

Intelligence, European Union etc

Page 17: Natural Language Processing  for Information Retrieval

Statistical IR(cont..)

Page 18: Natural Language Processing  for Information Retrieval

Statistical IR(cont..)

Stemming Remove suffixes (prefixes) to find the root of the words.

Page 19: Natural Language Processing  for Information Retrieval

Statistical IR(cont..)

Parameterising the document assign a weight to each one of the relevant terms

associated to a document (usually by appearance

frequency)

Page 20: Natural Language Processing  for Information Retrieval

Statistical IR(cont..)

Estimate the importance of a term TF*IDF (Term frequency * Inverse Document

Frequency) Term Frequency

a term appears often in one document is indicative that that

term is representative of the content Inverse Document frequency

If it appeared frequently in all documents, it would not have

any discriminatory value

Page 21: Natural Language Processing  for Information Retrieval

Drawbacks of Statistical IR

Linguistic Variance : Synonyms - Different words convey the same meaning Might provoke document silence Relevant documents might not be retrieved, recall

decreased Linguistic Ambiguity :

Homograph - Same word different meaning Will provoke document noise Might retrieve too many documents, relating to each

meaning of the word, precision decreased

Page 22: Natural Language Processing  for Information Retrieval

Summary

Statistical IR treats documents as bag of words. Does not take into consideration the linguistics of

the language Need for more linguistics based approach using

complex NLP techniques.

Page 23: Natural Language Processing  for Information Retrieval

OUTLINE

What is Information Retrieval(IR)? Approaches to IR Evaluation of IR methods Statistical IR methods Linguistic IR methods Conclusion Q&A

Page 24: Natural Language Processing  for Information Retrieval

Linguistic IR

The documents are analysed through different

linguistic levels by linguistic tools that

incorporate each level's own annotations to the

text The techniques involved are:- Morphological analysis

taggers assign each word to a grammatical category

Page 25: Natural Language Processing  for Information Retrieval

Linguistic IR (cont..)

Syntax analysis see how words are related and used together in making

larger grammatical units, phrases and sentences restricted to identify the most meaningful structures:

nominal sentences.

Page 26: Natural Language Processing  for Information Retrieval

Linguistic IR (cont..)

Word Sense Disambiguation Index by concept rather than words e.g.Bank as a financial institution, bank as the edge of

a river. Disambiguation helps for queries like “Runs on

a bank” one of the most often used tools for word sense

disambiguation is the lexicographic database WordNet an annotated semantic lexicon in different languages

made up of synonym groups called SYNSETS groups.

Page 27: Natural Language Processing  for Information Retrieval

Linguistic IR (cont..)

Synsets provide short definitions along with the

different semantic relationships between synonym 23 synsets for stock, including

broth, stock livestock, stock, farm animal stock certificate, stock stock, gillyflower stock, carry, stockpile (verb) standard, stock (adjective)

Page 28: Natural Language Processing  for Information Retrieval

Linguistic IR (cont..)

Use of synsets For each query word, find its synsets

Query “punch recipes” punch (3 synsets), recipe (1 synset)

Expand that synset into its “neighborhood” Grow with WordNet hyponym (is part of) relationships until

any additional growth would include a different sense of any

word in the core synset To disambiguate words in a document

Look at all synset neighborhoods for words in document Compare to the way they overlap throughout collection

Page 29: Natural Language Processing  for Information Retrieval

Linguistic IR (cont..)

Choose the neighborhoods where local activity is greater

than expected global activity

Page 30: Natural Language Processing  for Information Retrieval

Problems with Linguistic

techniques in IR

Linguistic techniques must be essentially perfect

to help Queries are difficult Non-linguistic techniques implicitly exploit

linguistic knowledge

Page 31: Natural Language Processing  for Information Retrieval

Conclusion

Statistical IR methods have some drawbacks Linguistic IR methods try to solve those problems

have been fairly unsuccessful Effective IR depends upon properties of queries

that make some NLP techniques redundant Current NLP techniques are not of much help in

strict document retrieval.

Page 32: Natural Language Processing  for Information Retrieval

Q&A

Page 33: Natural Language Processing  for Information Retrieval

References

Natural Language Processing and Information

Retrieval (Ellen M. Voorhes) Natural Language Processing in Textual

Information Retrieval and Related Topics by Mari

Vallez; Rafael Pedraza-Jimenez

(http://www.hipertext.net/english/pag1025.htm) NLP for IR by James Allan

http://citeseer.ist.psu.edu/308641.html

Page 34: Natural Language Processing  for Information Retrieval

References (Contd..)

“A lecture on information retrieval” by Douglas

W. Oard

(http://www.glue.umd.edu/~oard/papers/CMSC72

3.ppt)


Recommended