Date post: | 02-Jan-2016 |
Category: |
Documents |
Upload: | walter-scott-hart |
View: | 215 times |
Download: | 0 times |
Using Semantic Relations to Improve Information Retrieval
Tom Morton
Introduction
NLP techniques have been largely unsuccessful at information retrieval. Why?– Document retrieval has been the primary
measure of information retrieval success. Document retrieval reduces the need for NLP
techniques.– Discourse factors can be ignored.– Query words perform word-sense disambiguation.
– Lack of robustness: NLP techniques are typically not as robust as word
indexing.
Introduction
Paragraph retrieval for natural-language questions.– Paragraphs can be influenced by discourse factors.– Correctness of answers to natural language questions can
be accurately determined automatically.– Standard precursor to TREC question answering task.
What NLP technologies might help at this information retrieval task and are they robust enough?
Introduction
Question Analysis:– Questions tend to specify the semantic type of
their answer. This component tries to identify this type.
Named-Entity Detection:– Named-entity detection determines the semantic
type of proper nouns and numeric amounts in text.
Introduction
Question Analysis– The category predicted is appended to the question.
Named-Entity Detection:– The NE categories found in text are included as new terms.
This approach requires additional question terms to be in the paragraph.
What party is John Major in? (ORGANIZATION)
It probably won't be clear for some time whether the Conservative Party has chosen in John Major a truly worthy successor to Margaret Thatcher, who has been a giant on the world stage. +ORGANIZATION +PERSON
Introduction
Coreference Relations:– Interpretation of a paragraph may depend on the
context in which it occurs.
Syntactically-based Categorical Relation Extraction:– Appositive and predicate nominative
constructions provide descriptive terms about entities.
Coreference:– Use coreference relationships to introduce new
terms referred to but not present in the paragraph’s text.
Introduction
How long was Margaret Thatcher the prime minister? (DURATION)
The truth, which has been added to over each of her 11 1/2 years in power, is that they don't make many like her anymore. +MARGARET +THATCHER +PRIME +MINISTER +DURATION
Introduction
Categorical Relation Extraction– Identifies DESCRIPTION category.– Allows descriptive terms to be used in term
expansion.
Famed architect Frank Lloyd Wright… +DESCRIPTION
Buildings he designed include the Guggenheim Museum in New York and Robie House in Chicago. +FRANK +LLOYD +WRIGHT +FAMED +ARCHITECT
Who is Frank Lloyd Wright? (DESCRIPTION) What architect designed Robie House? (PERSON)
Introduction
Indexing
Retrieval
NE Detection
Coreference Resolution
Documents
Search Engine
Question Analysis
Question Paragraphs
Paragraphs+
Pre-processing
Categorical Relation
Extraction
Introduction
Will these semantic relations improve paragraph retrieval?– Are the implementations robust enough to see a
benefit across large document collections and question sets?
– Are there enough questions where these relationships are required to find an answer.
Questions need only be answered once.
Short Answer: Yes!
Overview
Introduction Pre-processing Named-Entity Detection Coreference Categorical Relation Extraction Question Analysis Paragraph Retrieval Conclusion Proposed Work
Preprocessing
Paragraph Detection Sentence Detection Tokenization POS Tagging NP-Chunking
Preprocessing
Paragraph finding:– Explicitly marked:
Newline, <p>, blank line, etc.
– Implicitly marked: What is the column width of this document? Would this capitalized, likely sentence initial word fit on the
previous line?
Sentence Detection:– Is this [.?!] the end of a sentence?– Use software developed in Reynar & Ratnaparki 97.
Preprocessing
Tokenization:– Are there additional tokens in this initial space-
delimited set of tokens.– Use techniques described in Reyner 98.
POS Tagging:– Use software developed in Ratnaparki 96.
Preprocessing
NP-Chunking– Developed a maxent tagging model where each
token is assigned a tag of either: Start-NP, Continue-NP, Other
– Software is very similar to the POS tagger.– Performance was evaluated to be at or near
state-of-the-art.
Preprocessing
Producing Robust Components– Sentence, Tokenization and POS-tagging
components we all retrained: Added small samples of texts from the paragraph
retrieval domains to the WSJ-based training data.– Allowed components to deal with editorial conventions
which differed from the Wall Street Journal.
Overview
Introduction Pre-processing Named-Entity Detection Coreference Categorical Relation Extraction Paragraph Retrieval Question Analysis Conclusion Proposed Work
Named-Entity Detection
Task Approach 1 Approach 2
Named-Entity Detection
Task:– Identify the following categories:
Person, Location, Organization, Money, Percentage, Time Point.
Approach 1:– Use an existing NE-detector.
Performance on some genres of text was poor. Couldn’t add new categories. Couldn’t retrain the classifier.
Named-Entity Detection
Approach 2:– Train a maxent classifier on the output of an
existing NE-detector. Used BBN’s MUC NE tagger (Bikel et al. 1997) to create
a corpus.– Combined Time and Date tags to create “Time Point”
category. Added a small sample of tagged text from the paragraph
retrieval domains.– Constructed rule-based models for additional categories.
Distance and Amount
Overview
Introduction Pre-processing Named-Entity Detection Coreference Categorical Relation Extraction Question Analysis Paragraph Retrieval Conclusion Proposed Work
Coreference
Task Approach Results Related Work
Coreference
Task:– Determine space of entity extents:
Basal noun phrases:– Named entities consisting of multiple basal noun phrases
are treated as a single entity. Pre-nominal proper nouns. Possessive pronouns.
– Determine which extents refer to the same entity in the world.
Coreference
Approach (Morton 2000)– Divide referring expressions into three classes
Singular third person pronouns. Proper nouns. Definite noun phrases.
– Create separate resolution approach for each class.
– Apply resolution approaches to text in an interleaved fashion.
Coreference
Singular Third Person Pronouns– Compare the pronoun to each entity in the current
sentence and the previous two sentences.– Compute argmaxi( p(coref|pronoun,entityi)) using
maxent model.– Compute p(nonref|pronoun) using maxent model.– If (p(corefi) > p(nonref)) then resolve pronoun
Coreference
1. John Major, a truly worthy…
2. Margaret Thatcher, her, …
3. The Conservative Party
4. the undoubted exception
5. Winston Churchill
6. …
she?20%
70%
10%
5%10%
Pronoun is resolved to entity rather than most recent extent.
Coreference
Classifier Features:– Distance:
in NPs, Sentences, Left-To-Right, Right-To-Left
– Syntactic Context: NP’s position in sentence. NP’s surrounding context. Pronoun’s syntactic context.
– Salience: Number of times the entity has been mentioned.
– Gender: Pairings of the pronoun’s gender and the lexical items in entity.
Coreference
Proper Nouns:– Remove honorifics, corporate designators,
determiners, and pre-nominal appositives.– Compare the proper noun to each entity
preceding it.– Resolve it to the first preceding proper noun
extent for which this proper noun is a substring (observing word boundaries).
Bob Smith <- Mr. Smith <- Bob <- Smith
Coreference
Definite Noun Phrases– Remove determines.– Resolve to first entity which shares the same
head word and modifiers. the big mean man <- the big man <- the man.
Coreference
Results:– Trained pronominal model on 200 WSJ
documents with only pronouns annotated. Interleaved with other resolution approaches to compute
mention statistics.
– Evaluated using 10-fold cross validation.– P 94.4%, R 76.0%, F 84.2%.
Coreference
Results:– Evaluated the proper noun and definite noun
phrase approaches on 80 hand annotated WSJ files.
Proper Nouns P 92.1%, R 88.0%, F 90.0%. Definite NPs P 82.5%, R 47.4%, F 60.2%.
– Combined Evaluation: MUC6 Coreference Task:
– Annotation guidelines are not identical.– Ignored headline and dateline coreference.– Included appositives and predicate nominatives.
P 79.6%, R 44.5%, F 57.1%.
Coreference
Related Work – Ge et al. 1998:
Presents similar statistical treatment. Assumes non-referential pronouns are pre-marked. Assumes mention statistics are pre-computed.
– Soon et al. 2001: Targets MUC Tasks. P 65.5-67.3%, R 56.1-58.3%, F 60.4-62.6%.
– Ng and Cardie 2002: Targets MUC Tasks. P 70.8-78.0%, R 55.7-64.2%, F 63.1-70.4%.
Our approach favors precision over recall:– Coreference relationships are used in passage retrieval.
Overview
Introduction Pre-processing Named-Entity Detection Coreference Categorical Relation Extraction Question Analysis Paragraph Retrieval Conclusion Proposed Work
Categorical Relation Extraction
Task Approach Results Related Work
Categorical Relation Extraction
Task– Identify whether a categorical relation exists
between NPs in the following contexts: Appositives: NP,NP. Predicative Nominatives: NP copula NP. Pre-nominal appositives:
– (NP (SNP Japanese automaker) Mazda Motor Corp.)
Categorical Relation Extraction
Approach:– Appositives and predicate nominatives:
Create a single binary maxent classifier to determine when NP’s in the appropriate syntactic context express a categorical relationship.
– Pre-nominal appositives: Create a maxent classifier to determine where the split exists
between the appositive and the rest of the noun phrase.– Use the lexical and POS-based features of noun phrases.
Use word/POS pair features. Differentiate between head and modifier words. Pre-nominal appositive classifier also use a word’s presence
on a list of 69 titles as a feature.
Categorical Relation Extraction
Results– Appositives and predicate nominatives:
Training - 1000/1200 examples Test 3 fold cross validation
– Appositive - P 90.9% R 79.1% F 84.6%.– Predicate Nominatives – P 78.8% R 74.4% F 76.5%.
– Pre-nominal appositives: Training - 2000 examples
– Used active learning to select new examples for annotation (884 positive).
Test - 1500 examples (81 positive)– P 98.6% R 85.2% F 91.4%.
Categorical Relation Extraction
Related Work– Soon et al. (2001) defines a specific feature to
identify appositive constructions.– Hovy et al. (2001) uses syntactic patterns to
identify “DEFINITION” and “WHY FAMOUS” types. Our work is unique in that:
– Statistical treatment of extracting categorical relations.
– Uses categorical relations for term expansion in paragraph indexing.
Overview
Introduction Pre-processing Named-Entity Detection Coreference Categorical Relation Extraction Question Analysis Paragraph Retrieval Conclusion Proposed Work
Question Analysis
Task Approach Results Related Work
Question Analysis
Task– Map natural language questions onto 10
categories: Person, Location, Organization, Time Point, Duration,
Money, Percentage, Distance, Amount, Description, Other
– Where is West Point Military Academy? (Location)
– When was ice cream invented? (Time Point)– How high is Mount Shasta? (Distance)
Question Analysis
Approach– Identify Question Word:
Who, What, When, When, Where, Why, Which, Whom, How (JJ|RB)*, Name.
– Identify Focus Noun Noun phrase which specifies the type of the answer. Use a series of syntactic patterns to identify.
– Train maxent classifier to predict which category the answer falls into.
Question Analysis
Focus Noun Syntactic Patterns– Who copula (np)– What copula* (np)– Which copula (np)– Which of (np)– How (JJ|RB) (np)– Name of (np)
Question Analysis
Classifier Features– Lexical Features
Question word, matrix verb, head noun of focus noun phrase, modifiers of the focus noun.
– Word-class features WordNet synsets and entry number of the focus noun.
– Location of the focus noun. Is it the last NP?
– Who is (NP-Focus Colin Powell )?
Question Analysis
Question:– Which poet was born in 1572 and appointed Dean
of St. Paul's Cathedral in 1621? Features:
– def qw=which verb=which_was rw=was rw=born rw=in rw=1572 rw=and rw=appointed rw=Dean rw=of rw=St rw=. rw=Paul rw='s rw=Cathedral rw=in rw=1621 rw=? hw=poet ht=NN s0=poet1 s0=writer1 s0=communicator1 s0=person1 s0=life_form1 s0=causal_agent1 s0=entity1 fnIsLast=false
Question Analysis
Results:– Training:
1888 hand-tagged examples from web-logs and web searches.
– Test: TREC8 Questions – 89.0%. TREC9 Questions – 76.6%.
Question Analysis
Related Work– Ittycheriah et al. 2001:
Similar:– Uses maximum entropy model.– Uses focus nouns and WordNet.
Differs:– Assumes first NP is the focus noun.– 3300 annotated questions.– Uses MUC NE categories plus PHRASE and REASON.– Uses feature selection with held-out data.
Overview
Introduction Pre-processing Named-Entity Detection Coreference Categorical Relation Extraction Question Analysis Paragraph Retrieval Conclusion Proposed Work
Paragraph Retrieval
Task Approach Results Related Work
Paragraph Retrieval
Task– Given a natural language question:
TREC-9 question collection.
– A collection of documents: ~1M documents:
– AP, LA Times, WSJ, Financial Times, FBIS, and SJM.
– Return a paragraph which answers the question. Used TREC-9 answer patterns to evaluate.
Paragraph Retrieval
Approach:– Indexing:
Use named-entity detector to supplement paragraphs with terms for each NE category present in the text.
Use coreference relationships to introduce new terms referred to but not present in the paragraph’s text.
Use syntactically-based categorical relations to create a DESCRIPTION category and for term expansion.
Used open source tf*idf based search engine for retrieval (Lucene).
– No length normalization.
Paragraph Retrieval
Approach:– Retrieval:
Use question analysis component to predict answer category and append it to the question.
– Evaluate using TREC-9 questions and answer patterns
500 questions.
Paragraph Retrieval
Indexing
Retrieval
NE Detection
Coreference Resolution
Documents
Search Engine
Question Analysis
Question Paragraphs
Paragraphs+
Pre-processing
Syntactic Relation
Extraction
Paragraph Retrieval
Results:
205
255
305
355
405
5 10 15 20 25 30 35 40 45
Number of Passages
Nu
mb
er o
f Q
ue
stiu
on
s A
ns
we
re
d
w/ Term Expansion w/ Semantic Categories Baseline
Paragraph Retrieval
Related Work– Prager et al. 2000
Indexes NE categories as terms for question answering passage retrieval.
Our approach is unique in that:– Uses coreference and categorical relation
extraction to perform term expansion.– Demonstrates that this improves performance.
Overview
Introduction Pre-processing Question Analysis Named-Entity Detection Coreference Categorical Relation Extraction Paragraph Retrieval Conclusion Proposed Work
Conclusion
Developed and evaluated new techniques in:– Coreference Resolution.– Categorical Relation Extraction.– Question Analysis.
Integrated these techniques with existing NLP components:
– NE detection, POS tagging, sentence detection, etc.
Demonstrated that these techniques can be used to improve performance in an information retrieval task.
– Paragraph retrieval for natural language questions.
Overview
Introduction Pre-processing Question Analysis Named-Entity Detection Coreference Categorical Relation Extraction Paragraph Retrieval Conclusion Proposed Work
Proposed Work
Named Entity Detection:– Evaluate existing NE performance.
Use MUC NE evaluation data.
– Add additional NE categories: Age Use active learning to annotate data for classifiers.
Proposed Work
Coreference:– Annotate 200 document corpus with all NP
coreference. (done)– Create statistical model for proper nouns and
definite noun phrases. (in progress)– Incorporate named-entity information into
coreference model. (in progress)– Evaluate using new corpus, and MUC 6 and 7
data.
Proposed Work
Categorical Relation Extraction:– Incorporate name-entity information and WordNet
classes for common nouns. Similar to approach used in Question Analysis
component.
Proposed Work
Question Analysis:– Use parser to provide a richer set of features for
classifier. (implemented Ratnaparki 97)– Construct a model to identify the focus noun
phrase. Where did Hillary Clinton go to (NP-Focus college) ?
– Expand the set of answer categories. How old is Dick Clark? (Age)
Proposed Work
Paragraph Retrieval:– Rerun paragraph retrieval evaluation after
completion of proposed work.– Evaluate using TREC X questions.