Automatic Discovery of Scenario-Level Patterns for Information Extraction

NYU

ANLP-001

Automatic Discovery of Automatic Discovery of Scenario-Level Patterns for Scenario-Level Patterns for

Information ExtractionInformation Extraction

Roman Yangarber

Ralph Grishman

Pasi Tapanainen

Silja Huttunen

NYU2

Outline

Information Extraction: background Problems in IE Prior Work: Machine Learning for IE Discover patterns from raw text Experimental results Current work

NYU3

Quick Overview

What is Information Extraction ? Definition:

– finding facts about a specified class of events from free text

– filling a table in a data base (slots in a template)

Events: instances in relations, with many arguments

NYU4

– George Garrick, 40 years old, president of the

London-based European Information Services

Inc., was appointed chief executive officer of

Nielsen Marketing Research, USA.

Example: Management Succession

NYU5

Position Company Location Person Status

President European InformationServices, Inc.

London George Garrick Out

CEO Nielsen Marketing Research USA George Garrick In

Example: Management Succession

– George Garrick, 40 years old, president of the

London-based European Information Services

Inc., was appointed chief executive officer of

Nielsen Marketing Research, USA.

NYU6

discourse

sentence

Lexical Analysis

System Architecture: Proteus

Name Recognition

Partial Syntax

Scenario Patterns

Reference Resolution

Discourse Analyzer

Output Generation

Input Text

Extracted Information

NYU7

discourse

sentence

Lexical Analysis

System Architecture: Proteus

Name Recognition

Partial Syntax

Scenario Patterns

Reference Resolution

Discourse Analyzer

Output Generation

Input Text

Extracted Information

NYU8

Problems

Customization Performance

NYU9

Problems: Customization

To customize a system for a new extraction task, we have to develop– new patterns for new types of events

– word classes for the domain

– inference rules

This can be a large job requiring skilled labor–expense of customization limits uses of extraction

NYU10

Problems: Performance

Performance on event IE is limited On MUC tasks, typical top performance is

recall < 55%, precision < 75%

Errors propagate through multiple phases:–name recognition errors

–syntax analysis errors

–missing patterns

–reference resolution errors

–complex inference required

NYU11

Missing Patterns

As with many language phenomena–a few common patterns

–a large number of rare patterns

Rare patterns do not surface sufficiently often in limited corpus

Missing patterns make customization expensive and limit performance

Finding good patterns is necessary to improve customization and performance

Freq

Rank

NYU12

Prior Research

build patterns from examples– Yangarber ‘97

generalize from multiple examples: annotated text– Crystal, Whisk (Soderland), Rapier (Califf)

active learning: reduce annotation– Soderland ‘99, Califf ‘99

learning from corpus with relevance judgements– Riloff ‘96, ‘99

co-learning/bootstrapping– Brin ‘98, Agichtein ‘00

NYU13

Our Goals

Minimize manual labor required to construct pattern bases for new domain– un-annotated text

– un-classified text

– un-supervised learning

Use very large corpora -- larger than we could ever tag manually -- to improve coverage of patterns

NYU14

Principle: Pattern Density

If we have relevance judgements for documents in a corpus, for the given task,then the patterns which are much more frequent in relevant documents will generally be good patterns

Riloff (1996) finds patterns related to terrorist attacks

NYU15

Principle: Duality

Duality between patterns and documents:– relevant documents are strong indicators of

good patterns

– good patterns are strong indicators of relevant documents

NYU16

Outline of Procedure

Initial query: a small set of seed patterns which partially characterize the topic of interest

repeat


Retrieve documents containing seed patterns: “relevant documents”



Rank patterns in relevant documents by frequency in relevant docs vs. overall frequency



Rank patterns in relevant documents by frequency in relevant docs vs. overall frequency

Add top-ranked pattern to seed pattern set

17

#1: pick seed pattern

Seed: < person retires >

18

#2: retrieve relevant documents


Fred retired....

Harry was named president.

Maki retired....

Yuki was named president.

Relevant documents Otherdocuments

19

#3: pick new pattern


< person was named president > appears in several relevant documents (top-ranked by Riloff metric)

Fred retired....

Harry was named president.

Maki retired....

Yuki was named president.

20

#4: add new pattern to pattern set

Pattern set: < person retires >

< person was named president >

NYU21

Pre-processing

For each document, find and classify names:– { person | location | organization | …}

Parse document– (regularize passive, relative clauses, etc.)

For each clause, collect a candidate pattern:tuple: heads of– [ subject

verb direct object object/subject complement locative and temporal modifiers… ]

NYU22

Experiment

Task: Management succession (as MUC-6) Source: Wall Street Journal Training corpus: ~ 6,000 articles Test corpus:

– 100 documents: MUC-6 formal training

– + 150 documents judged manually

NYU23

Experiment: two seed patterns

v-appoint = { appoint, elect, promote, name } v-resign = { resign, depart, quit, step-down }

Run discovery procedure for 80 iterations

Subject Verb Objectcompany v-appoint person

person v-resign -

NYU24

Evaluation

Look at discovered patterns– new patterns, missed in manual training

Document filtering Slot filling

NYU25

Discovered patterns

Subject Verb Objectcompany v-appoint person

person v-resign -

person succeed person

person be| become

president| officer| chairman| executive

company name president | …

person join | run| leave

company

person serve board | company

person leave post

NYU26

Evaluation: new patterns

Not found in manual training

Subject Verb Object Complementscompany bring person [as+officer]person come

| return- [to+company]

[as+officer]person rejoin company [as+officer]

person continue| remain| stay

- [as+officer]

person replace person [as+officer]person pursue interest -

NYU27

Evaluation: Text Filtering

How effective are discovered patterns at selecting relevant documents?

–IR-style

–documents matching at least one pattern

Pattern set Recall PrecisionSeed 11% 93%Seed+discovered 88% 81% (85)

NYU28

250 Test Documents (.5)

0

0.2

0.4

0.6

0.8

1

0 10 20 30 40 50 60 70 80

Generation #

Recall

Precision

NYU29

Choice of Test Corpus (.5)

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Pre

cis

ion

250-Corpus

100-MUC Corpus

MUC-6 Players

NYU30

Evaluation: Slot filling

How effective are patterns within a complete IE system?

MUC-style IE on MUC-6 corpora

Caveat

training test

pattern set recall precision F recall precision F

seed 28 78 41

+ discovered 51 76 61

manual–MUC 54 71 62 47 70 56.40

manual–now 69 79 74 54.7 74.4 63.02

NYU31




Caveat

training test


seed 28 78 41


manual–MUC 54 71 62 47 70 56.40

manual–now 69 79 74 54.7 74.4 63.02

NYU32




Caveat

training test


seed 28 78 41


manual–MUC 54 71 62 47 70 56.40

manual–now 69 79 74 54.7 74.4 63.02

NYU33

Conclusion: Automatic discovery

Performance comparable to human(4-week development)

From un-annotated text: allows us to take advantage of very large corpora– redundancy

– duality

Will likely help wider use of IE

NYU34

NYU35

Good Patterns

U - universe of all documentsR - set of relevant documentsH= H(p) - set of documents where pattern p matched

Density criterion:

NYU36

Graded Relevance

Documents matching seed patterns considered 100% relevant Discovered patterns are considered less certain

Documents containing them are considered partially relevant

NYU37

document frequency in relevant documents overall document frequency

document frequency in relevant documents – (metrics similar to those used in Riloff-96)

Scoring Patterns

Date post:	06-Feb-2016
Category:	Documents
Upload:	early
View:	46 times
Download:	0 times