Using Semantic Relations to Improve Information Retrieval Tom Morton.

Using Semantic Relations to Improve Information Retrieval

Tom Morton

Introduction

NLP techniques have been largely unsuccessful at information retrieval. Why?– Document retrieval has been the primary

measure of information retrieval success. Document retrieval reduces the need for NLP

techniques.– Discourse factors can be ignored.– Query words perform word-sense disambiguation.

– Lack of robustness: NLP techniques are typically not as robust as word

indexing.

Introduction

Paragraph retrieval for natural-language questions.– Paragraphs can be influenced by discourse factors.– Correctness of answers to natural language questions can

be accurately determined automatically.– Standard precursor to TREC question answering task.

What NLP technologies might help at this information retrieval task and are they robust enough?

Introduction

Question Analysis:– Questions tend to specify the semantic type of

their answer. This component tries to identify this type.

Named-Entity Detection:– Named-entity detection determines the semantic

type of proper nouns and numeric amounts in text.

Introduction

Question Analysis– The category predicted is appended to the question.

Named-Entity Detection:– The NE categories found in text are included as new terms.

This approach requires additional question terms to be in the paragraph.

What party is John Major in? (ORGANIZATION)

It probably won't be clear for some time whether the Conservative Party has chosen in John Major a truly worthy successor to Margaret Thatcher, who has been a giant on the world stage. +ORGANIZATION +PERSON

Introduction

Coreference Relations:– Interpretation of a paragraph may depend on the

context in which it occurs.

Syntactically-based Categorical Relation Extraction:– Appositive and predicate nominative

constructions provide descriptive terms about entities.

Coreference:– Use coreference relationships to introduce new

terms referred to but not present in the paragraph’s text.

Introduction

How long was Margaret Thatcher the prime minister? (DURATION)

The truth, which has been added to over each of her 11 1/2 years in power, is that they don't make many like her anymore. +MARGARET +THATCHER +PRIME +MINISTER +DURATION

Introduction

Categorical Relation Extraction– Identifies DESCRIPTION category.– Allows descriptive terms to be used in term

expansion.

Famed architect Frank Lloyd Wright… +DESCRIPTION

Buildings he designed include the Guggenheim Museum in New York and Robie House in Chicago. +FRANK +LLOYD +WRIGHT +FAMED +ARCHITECT

Who is Frank Lloyd Wright? (DESCRIPTION) What architect designed Robie House? (PERSON)

Introduction

Indexing

Retrieval

NE Detection

Coreference Resolution

Documents

Search Engine

Question Analysis

Question Paragraphs

Paragraphs+

Pre-processing

Categorical Relation

Extraction

Introduction

Will these semantic relations improve paragraph retrieval?– Are the implementations robust enough to see a

benefit across large document collections and question sets?

– Are there enough questions where these relationships are required to find an answer.

Questions need only be answered once.

Short Answer: Yes!

Overview

Introduction Pre-processing Named-Entity Detection Coreference Categorical Relation Extraction Question Analysis Paragraph Retrieval Conclusion Proposed Work

Preprocessing

Paragraph Detection Sentence Detection Tokenization POS Tagging NP-Chunking

Preprocessing

Paragraph finding:– Explicitly marked:

Newline, <p>, blank line, etc.

– Implicitly marked: What is the column width of this document? Would this capitalized, likely sentence initial word fit on the

previous line?

Sentence Detection:– Is this [.?!] the end of a sentence?– Use software developed in Reynar & Ratnaparki 97.

Preprocessing

Tokenization:– Are there additional tokens in this initial space-

delimited set of tokens.– Use techniques described in Reyner 98.

POS Tagging:– Use software developed in Ratnaparki 96.

Preprocessing

NP-Chunking– Developed a maxent tagging model where each

token is assigned a tag of either: Start-NP, Continue-NP, Other

– Software is very similar to the POS tagger.– Performance was evaluated to be at or near

state-of-the-art.

Preprocessing

Producing Robust Components– Sentence, Tokenization and POS-tagging

components we all retrained: Added small samples of texts from the paragraph

retrieval domains to the WSJ-based training data.– Allowed components to deal with editorial conventions

which differed from the Wall Street Journal.

Overview

Introduction Pre-processing Named-Entity Detection Coreference Categorical Relation Extraction Paragraph Retrieval Question Analysis Conclusion Proposed Work

Named-Entity Detection

Task Approach 1 Approach 2


Task:– Identify the following categories:

Person, Location, Organization, Money, Percentage, Time Point.

Approach 1:– Use an existing NE-detector.

Performance on some genres of text was poor. Couldn’t add new categories. Couldn’t retrain the classifier.


Approach 2:– Train a maxent classifier on the output of an

existing NE-detector. Used BBN’s MUC NE tagger (Bikel et al. 1997) to create

a corpus.– Combined Time and Date tags to create “Time Point”

category. Added a small sample of tagged text from the paragraph

retrieval domains.– Constructed rule-based models for additional categories.

Distance and Amount

Overview


Coreference

Task Approach Results Related Work

Coreference

Task:– Determine space of entity extents:

Basal noun phrases:– Named entities consisting of multiple basal noun phrases

are treated as a single entity. Pre-nominal proper nouns. Possessive pronouns.

– Determine which extents refer to the same entity in the world.

Coreference

Approach (Morton 2000)– Divide referring expressions into three classes

Singular third person pronouns. Proper nouns. Definite noun phrases.

– Create separate resolution approach for each class.

– Apply resolution approaches to text in an interleaved fashion.

Coreference

Singular Third Person Pronouns– Compare the pronoun to each entity in the current

sentence and the previous two sentences.– Compute argmaxi( p(coref|pronoun,entityi)) using

maxent model.– Compute p(nonref|pronoun) using maxent model.– If (p(corefi) > p(nonref)) then resolve pronoun

Coreference

1. John Major, a truly worthy…

2. Margaret Thatcher, her, …

3. The Conservative Party

4. the undoubted exception

5. Winston Churchill

6. …

she?20%

70%

10%

5%10%

Pronoun is resolved to entity rather than most recent extent.

Coreference

Classifier Features:– Distance:

in NPs, Sentences, Left-To-Right, Right-To-Left

– Syntactic Context: NP’s position in sentence. NP’s surrounding context. Pronoun’s syntactic context.

– Salience: Number of times the entity has been mentioned.

– Gender: Pairings of the pronoun’s gender and the lexical items in entity.

Coreference

Proper Nouns:– Remove honorifics, corporate designators,

determiners, and pre-nominal appositives.– Compare the proper noun to each entity

preceding it.– Resolve it to the first preceding proper noun

extent for which this proper noun is a substring (observing word boundaries).

Bob Smith <- Mr. Smith <- Bob <- Smith

Coreference

Definite Noun Phrases– Remove determines.– Resolve to first entity which shares the same

head word and modifiers. the big mean man <- the big man <- the man.

Coreference

Results:– Trained pronominal model on 200 WSJ

documents with only pronouns annotated. Interleaved with other resolution approaches to compute

mention statistics.

– Evaluated using 10-fold cross validation.– P 94.4%, R 76.0%, F 84.2%.

Coreference

Results:– Evaluated the proper noun and definite noun

phrase approaches on 80 hand annotated WSJ files.

Proper Nouns P 92.1%, R 88.0%, F 90.0%. Definite NPs P 82.5%, R 47.4%, F 60.2%.

– Combined Evaluation: MUC6 Coreference Task:

– Annotation guidelines are not identical.– Ignored headline and dateline coreference.– Included appositives and predicate nominatives.

P 79.6%, R 44.5%, F 57.1%.

Coreference

Related Work – Ge et al. 1998:

Presents similar statistical treatment. Assumes non-referential pronouns are pre-marked. Assumes mention statistics are pre-computed.

– Soon et al. 2001: Targets MUC Tasks. P 65.5-67.3%, R 56.1-58.3%, F 60.4-62.6%.

– Ng and Cardie 2002: Targets MUC Tasks. P 70.8-78.0%, R 55.7-64.2%, F 63.1-70.4%.

Our approach favors precision over recall:– Coreference relationships are used in passage retrieval.

Overview


Categorical Relation Extraction



Task– Identify whether a categorical relation exists

between NPs in the following contexts: Appositives: NP,NP. Predicative Nominatives: NP copula NP. Pre-nominal appositives:

– (NP (SNP Japanese automaker) Mazda Motor Corp.)


Approach:– Appositives and predicate nominatives:

Create a single binary maxent classifier to determine when NP’s in the appropriate syntactic context express a categorical relationship.

– Pre-nominal appositives: Create a maxent classifier to determine where the split exists

between the appositive and the rest of the noun phrase.– Use the lexical and POS-based features of noun phrases.

Use word/POS pair features. Differentiate between head and modifier words. Pre-nominal appositive classifier also use a word’s presence

on a list of 69 titles as a feature.


Results– Appositives and predicate nominatives:

Training - 1000/1200 examples Test 3 fold cross validation

– Appositive - P 90.9% R 79.1% F 84.6%.– Predicate Nominatives – P 78.8% R 74.4% F 76.5%.

– Pre-nominal appositives: Training - 2000 examples

– Used active learning to select new examples for annotation (884 positive).

Test - 1500 examples (81 positive)– P 98.6% R 85.2% F 91.4%.


Related Work– Soon et al. (2001) defines a specific feature to

identify appositive constructions.– Hovy et al. (2001) uses syntactic patterns to

identify “DEFINITION” and “WHY FAMOUS” types. Our work is unique in that:

– Statistical treatment of extracting categorical relations.

– Uses categorical relations for term expansion in paragraph indexing.

Overview


Question Analysis


Question Analysis

Task– Map natural language questions onto 10

categories: Person, Location, Organization, Time Point, Duration,

Money, Percentage, Distance, Amount, Description, Other

– Where is West Point Military Academy? (Location)

– When was ice cream invented? (Time Point)– How high is Mount Shasta? (Distance)

Question Analysis

Approach– Identify Question Word:

Who, What, When, When, Where, Why, Which, Whom, How (JJ|RB)*, Name.

– Identify Focus Noun Noun phrase which specifies the type of the answer. Use a series of syntactic patterns to identify.

– Train maxent classifier to predict which category the answer falls into.

Question Analysis

Focus Noun Syntactic Patterns– Who copula (np)– What copula* (np)– Which copula (np)– Which of (np)– How (JJ|RB) (np)– Name of (np)

Question Analysis

Classifier Features– Lexical Features

Question word, matrix verb, head noun of focus noun phrase, modifiers of the focus noun.

– Word-class features WordNet synsets and entry number of the focus noun.

– Location of the focus noun. Is it the last NP?

– Who is (NP-Focus Colin Powell )?

Question Analysis

Question:– Which poet was born in 1572 and appointed Dean

of St. Paul's Cathedral in 1621? Features:

– def qw=which verb=which_was rw=was rw=born rw=in rw=1572 rw=and rw=appointed rw=Dean rw=of rw=St rw=. rw=Paul rw='s rw=Cathedral rw=in rw=1621 rw=? hw=poet ht=NN s0=poet1 s0=writer1 s0=communicator1 s0=person1 s0=life_form1 s0=causal_agent1 s0=entity1 fnIsLast=false

Question Analysis

Results:– Training:

1888 hand-tagged examples from web-logs and web searches.

– Test: TREC8 Questions – 89.0%. TREC9 Questions – 76.6%.

Question Analysis

Related Work– Ittycheriah et al. 2001:

Similar:– Uses maximum entropy model.– Uses focus nouns and WordNet.

Differs:– Assumes first NP is the focus noun.– 3300 annotated questions.– Uses MUC NE categories plus PHRASE and REASON.– Uses feature selection with held-out data.

Overview


Paragraph Retrieval


Paragraph Retrieval

Task– Given a natural language question:

TREC-9 question collection.

– A collection of documents: ~1M documents:

– AP, LA Times, WSJ, Financial Times, FBIS, and SJM.

– Return a paragraph which answers the question. Used TREC-9 answer patterns to evaluate.

Paragraph Retrieval

Approach:– Indexing:

Use named-entity detector to supplement paragraphs with terms for each NE category present in the text.

Use coreference relationships to introduce new terms referred to but not present in the paragraph’s text.

Use syntactically-based categorical relations to create a DESCRIPTION category and for term expansion.

Used open source tf*idf based search engine for retrieval (Lucene).

– No length normalization.

Paragraph Retrieval

Approach:– Retrieval:

Use question analysis component to predict answer category and append it to the question.

– Evaluate using TREC-9 questions and answer patterns

500 questions.

Paragraph Retrieval

Indexing

Retrieval

NE Detection

Coreference Resolution

Documents

Search Engine

Question Analysis

Question Paragraphs

Paragraphs+

Pre-processing

Syntactic Relation

Extraction

Paragraph Retrieval

Results:

205

255

305

355

405

5 10 15 20 25 30 35 40 45

Number of Passages

Nu

mb

er o

f Q

ue

stiu

on

s A

ns

we

re

d

w/ Term Expansion w/ Semantic Categories Baseline

Paragraph Retrieval

Related Work– Prager et al. 2000

Indexes NE categories as terms for question answering passage retrieval.

Our approach is unique in that:– Uses coreference and categorical relation

extraction to perform term expansion.– Demonstrates that this improves performance.

Overview

Introduction Pre-processing Question Analysis Named-Entity Detection Coreference Categorical Relation Extraction Paragraph Retrieval Conclusion Proposed Work

Conclusion

Developed and evaluated new techniques in:– Coreference Resolution.– Categorical Relation Extraction.– Question Analysis.

Integrated these techniques with existing NLP components:

– NE detection, POS tagging, sentence detection, etc.

Demonstrated that these techniques can be used to improve performance in an information retrieval task.

– Paragraph retrieval for natural language questions.

Overview

Introduction Pre-processing Question Analysis Named-Entity Detection Coreference Categorical Relation Extraction Paragraph Retrieval Conclusion Proposed Work

Proposed Work

Named Entity Detection:– Evaluate existing NE performance.

Use MUC NE evaluation data.

– Add additional NE categories: Age Use active learning to annotate data for classifiers.

Proposed Work

Coreference:– Annotate 200 document corpus with all NP

coreference. (done)– Create statistical model for proper nouns and

definite noun phrases. (in progress)– Incorporate named-entity information into

coreference model. (in progress)– Evaluate using new corpus, and MUC 6 and 7

data.

Proposed Work

Categorical Relation Extraction:– Incorporate name-entity information and WordNet

classes for common nouns. Similar to approach used in Question Analysis

component.

Proposed Work

Question Analysis:– Use parser to provide a richer set of features for

classifier. (implemented Ratnaparki 97)– Construct a model to identify the focus noun

phrase. Where did Hillary Clinton go to (NP-Focus college) ?

– Expand the set of answer categories. How old is Dick Clark? (Age)

Proposed Work

Paragraph Retrieval:– Rerun paragraph retrieval evaluation after

completion of proposed work.– Evaluate using TREC X questions.

Date post:	02-Jan-2016
Category:	Documents
Upload:	walter-scott-hart
View:	215 times
Download:	0 times