Date post: | 13-Jan-2016 |
Category: |
Documents |
Upload: | anabel-burns |
View: | 216 times |
Download: | 0 times |
NATURAL LANGUAGE PROCESSING AND
INFORMATION RETRIEVAL
Dr. Eleni Galiotou
Assistant Professor,
Department of Informatics,
TEI of Athens
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 2
(Computer-based) Information Retrieval
• Locate (electronically available) documents satisfying user´s information needs
• Information need: A statement in a query language matched against document surrogates (title, abstract, keywords etc)
• Outcome of IR process: articles, memos, reports, books, annotated image and sound files
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 3
The IR strategy• Purpose:
– Retrieve all relevant documents– Retrieve as few of non-relevant documents as possible
• Techniques in classical IR: – Empirical and ad-hoc– Quantitative methods
• IR : also a Natural Language Processing problem• Heterogeneous Collections of full-text documents
– Need for Content Understanding => NLP techniques
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 4
Main areas of research in IR
• Content analysis
• Relationships between documents to improve efficiency and effectiveness of IR strategies
• Measurement of effectiveness of retrieval
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 5
Example: The vector-space model (1)
• The SMART Text Retrieval System
• Documents and queries represented as vectors in T-dimensional space ( T: number of distinct terms in document collection)
• Automated indexing: Assigning of terms to a piece of text
• Weighted terms to reflect their relative importance in the text
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 6
The vector-space model (2)
• Result of a query: Ranked list of documents ordered by similarity to the query
• Similarity measure: cosine of the angle formed by the query and the document vector (cosine correlation)
qidi
sim = cos(Q,D) =
( qi2 di
2 )1/2
T
i=1
T
i=1
T
i=1
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 7
Extended vector-space model
• Vector : collection of subvectors used to represent different aspects of documents in collection
• Overall similarity between two extended vectors:
sim(Q,D) = αi simi (Qi , Di)
subvector i
ai = importance of subvector i in the overall similarity between texts
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 8
Indexing
• Index language used to describe documents and requests
• Pre-coordinate index terms : logical combination of any index terms used as a label to identify a class of documents
• Post-coordinate terms: combination of classes of documents labeled with the individual index terms
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 9
LMI vs. NLI• Non-Linguistic Indexing: Removing
stopwords, Applying statistical criteria
• Linguistically Motivated Indexing: – Applying syntactic and/or semantic techniques
for term identification and description formation– Identifying multi-word units and characterizing
their internal structure
• NLP needed for automated indexing ?!
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 10
Index Term Weighting
• tf (t): within-document frequency of term t• idf(t): inverse document frequency = log (N/n) N= total number of documents in collection n : number of documents containing term t• General weighting schema: w(t) = tf(t) X idf(t) • Assumptions on term independence often false• Situation worse when single-word terms are
intermixed with phrasal terms
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 11
NLP Based Indexing• Example: TREC Experiments– “joint venture” important in Wall Street
Journal database– “joint”, “venture” dropped from list of
terms by the system because of too low idf • Identify groups creating meaningful phrases• Simple collocations, Statistically-validated
N-grams, Part-Of-Speech tagged sequences, Syntactic structures, Semantic Concepts
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 12
Obstacles in the application of NLP techniques in IR
• Lack of robustness and efficiency
• Representations produced : Complex structures effectively compared to determine relevance
Solution: Use NLP to assist IR system (boolean, statistical, probabilistic) in representing documents for search purposes
Off-line database indexing
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 13
Stream-based IR Model (1)
• Combination of Statistical and NLP Techniques
• Term Extraction Steps
1. Elimination of Stopwords (no-content or low content words: determiners, preposition, pronouns, very frequent words)
2. Morphological Stemming: Affix-stripping process or Morphological Analysis)
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 14
Stream-based IR Model (2)
4. Phrase Extraction : Shallow text processing techniques (POS tagging, Phrase boundary detection, Word co-occurrence metrics) used to identify relatively stable groups of words
5. Phrase Normalization: “Head+Modifier” pairs to normalize across syntactic variants and reduce to a common “concept” , e.g. weapon proliferation, proliferation of weapons weapon+ proliferate
6. Proper Name Extraction: People names and titles, Location names, Organization names used for indexing
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 15
Stream-based IR Model (3)
• Final results: merge ranked lists of documents obtained from searching all streams with appropriately preprocessed queries.
• Contributions from each stream are weighted using an effective combination of alternative retrieval and routing methods
Meta-search strategy which maximizes contributions of each stream (base search engines: SMART v. 11, PRISE v.2 e.t.c)
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 16
Advantages of Stream Architecture
• Easier to compare contributions of different indexing features or representations
• Convenient testbed to experiment with algorithms designed to merge results obtained using different IR engines and/ot techniques
• Easier to fine-tune system in order to obtain optimum performance
• Allows usage of IR engines without having to adopt them
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 17
Part Of Speech Tagging (1)
• Allows resolution of lexical ambiguities in a running text assuming a known general type of text and a context in which a word is used
more accurate lexical normalization, phrase boundary detection
• Assigns POS label(s) to each word in a text depending on labels assigned to preceding words
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 18
POS Tagging (2)
• Best-tag-only option: Only top-ranked for each word is output
gain in speed and robustness of subsequent processes (e.g. parsing)
• Brill´s rule based Tagger trained on Wall Street Journal texts to preprocess linguistic streams used by SMART
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 19
Syntactic Tagging (1)• Capturing semantic dependencies critical for
accurate text indexing• Need to exploit syntactic structures produced by a
fairly comprehensive parser• TREC experiment: TTP (Tagged Text Parser)
based on Linguistic String Grammar• Full grammar parser with a built-in timer
regulating amount of time allowed for parsing a sentence
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 20
Syntactic Tagging (2)
• If no parse is returned before allotated time elapses parser in “skip-and-fit” mode
• Result: approximate parse
• Fragments skipped in first pass:
– analyzed by simple phrasal parser looking for noun phrases and relative clauses
– attached to main parse structure
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 21
Corpus-based disambiguation of long Noun Phrases (1)
• Relationships between in complex phases required to decompose longer phrases into meaningful head+modifier pairs
• Pair extractor looks at distribution statistics of compound terms
– association between any two words in noun phrase syntactically valid and semantically significant
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 22
Corpus-based disambiguation (2)
• Phrasal terms extracted in two phases:1. Only unambiguous head-modifier pairs are
generated2. Distributional statistics gathered in first phase
are used to predict the strength of alternative modifier-modified links within ambiguous phrases
• Example: multiple unambiguous occurrences : “inside trading”, a few of “trading case”, numerous phrases: “insider trading case”, “insider trading legislation”
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 23
Language Resources
• Machine Readable Dictionaries (MRD)
Mixed results in experiments
• Knowledge bases
– CYC : Huge Knowledge base of Common Sense Knowledge, Untested contribution to IR
– WordNet : Models Lexical Knowledge of a native user of English
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 24
Usage of WordNet in Information Retrieval Tasks (1)• WordNet: organized around logical
groupings of related terms (synsets)
• Synset: list of synonymous word forms and semantic pointers describing relationships between current and other synsets
• Knowledge Base: Nouns in WordNet
• Nouns: Most content-bearing of all word classes and occur in every sentence
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 25
Usage of WordNet (2)
• WordNet partitioned into Hierarchical Concept Graphs (HCG) based on the IS-A hierarchical links between synsets
• Information content of each synset approximated by estimating the probability of occurrence of all nouns in all subordinate synsets.
• Semantic similarity between two nouns (synsets from which the nouns are drawn): information content of first synset which subsumes the two synsets
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 26
Usage of WordNet (3)
• Simple word sense disambiguation process for documents which choose the single most likely sense of a noun occurrence
• Experiments: Top 1000 documents pre-fetched from the collection using term weighting (conventional IR technique) and exhaustive word distance based measure on these documents
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 27
Usage of WordNet (4)
• Retrieval effectiveness results using word-word distances, (in terms of precision and recall): poor compared to the tf X idf term weighting strategy
• Possibility of errors in syntactic tagging of documents, in word sense disambiguation, in semantic matching between words.
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 28
Other Roles for NLP (1)
• Routing (Filtering): Amount of training data is the dominant factor in performance
• Text categorization (automatic assignment: to prior headings): Using complex terms had no extra beneficial effect
• ?Real contribution to selective content-based information management
Feb 24, 2004 Eleni Galiotou : Tempus Seminar 29
Other Roles for NLP (2)
• Displaying information about whole documents: giving selected phrases more informative than highlighting matching terms or listing key individual words
Information Extraction and Summarizing
• Real role of NLP: Supporting more exigent information-management functions within a larger, multi-functional whole