+ All Categories
Home > Documents > Automatic Discovery of Scenario-Level Patterns for Information Extraction

Automatic Discovery of Scenario-Level Patterns for Information Extraction

Date post: 06-Feb-2016
Category:
Upload: early
View: 46 times
Download: 0 times
Share this document with a friend
Description:
Automatic Discovery of Scenario-Level Patterns for Information Extraction. Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen. Outline. Information Extraction: background Problems in IE Prior Work: Machine Learning for IE Discover patterns from raw text Experimental results - PowerPoint PPT Presentation
Popular Tags:
37
NYU ANLP-00 1 Automatic Discovery of Automatic Discovery of Scenario-Level Patterns Scenario-Level Patterns for for Information Extraction Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen
Transcript
Page 1: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU

ANLP-001

Automatic Discovery of Automatic Discovery of Scenario-Level Patterns for Scenario-Level Patterns for

Information ExtractionInformation Extraction

Roman Yangarber

Ralph Grishman

Pasi Tapanainen

Silja Huttunen

Page 2: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU2

Outline

Information Extraction: background Problems in IE Prior Work: Machine Learning for IE Discover patterns from raw text Experimental results Current work

Page 3: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU3

Quick Overview

What is Information Extraction ? Definition:

– finding facts about a specified class of events from free text

– filling a table in a data base (slots in a template)

Events: instances in relations, with many arguments

Page 4: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU4

– George Garrick, 40 years old, president of the

London-based European Information Services

Inc., was appointed chief executive officer of

Nielsen Marketing Research, USA.

Example: Management Succession

Page 5: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU5

Position Company Location Person Status

President European InformationServices, Inc.

London George Garrick Out

CEO Nielsen Marketing Research USA George Garrick In

Example: Management Succession

– George Garrick, 40 years old, president of the

London-based European Information Services

Inc., was appointed chief executive officer of

Nielsen Marketing Research, USA.

Page 6: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU6

discourse

sentence

Lexical Analysis

System Architecture: Proteus

Name Recognition

Partial Syntax

Scenario Patterns

Reference Resolution

Discourse Analyzer

Output Generation

Input Text

Extracted Information

Page 7: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU7

discourse

sentence

Lexical Analysis

System Architecture: Proteus

Name Recognition

Partial Syntax

Scenario Patterns

Reference Resolution

Discourse Analyzer

Output Generation

Input Text

Extracted Information

Page 8: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU8

Problems

Customization Performance

Page 9: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU9

Problems: Customization

To customize a system for a new extraction task, we have to develop– new patterns for new types of events

– word classes for the domain

– inference rules

This can be a large job requiring skilled labor–expense of customization limits uses of extraction

Page 10: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU10

Problems: Performance

Performance on event IE is limited On MUC tasks, typical top performance is

recall < 55%, precision < 75%

Errors propagate through multiple phases:–name recognition errors

–syntax analysis errors

–missing patterns

–reference resolution errors

–complex inference required

Page 11: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU11

Missing Patterns

As with many language phenomena–a few common patterns

–a large number of rare patterns

Rare patterns do not surface sufficiently often in limited corpus

Missing patterns make customization expensive and limit performance

Finding good patterns is necessary to improve customization and performance

Freq

Rank

Page 12: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU12

Prior Research

build patterns from examples– Yangarber ‘97

generalize from multiple examples: annotated text– Crystal, Whisk (Soderland), Rapier (Califf)

active learning: reduce annotation– Soderland ‘99, Califf ‘99

learning from corpus with relevance judgements– Riloff ‘96, ‘99

co-learning/bootstrapping– Brin ‘98, Agichtein ‘00

Page 13: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU13

Our Goals

Minimize manual labor required to construct pattern bases for new domain– un-annotated text

– un-classified text

– un-supervised learning

Use very large corpora -- larger than we could ever tag manually -- to improve coverage of patterns

Page 14: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU14

Principle: Pattern Density

If we have relevance judgements for documents in a corpus, for the given task,then the patterns which are much more frequent in relevant documents will generally be good patterns

Riloff (1996) finds patterns related to terrorist attacks

Page 15: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU15

Principle: Duality

Duality between patterns and documents:– relevant documents are strong indicators of

good patterns

– good patterns are strong indicators of relevant documents

Page 16: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU16

Outline of Procedure

Initial query: a small set of seed patterns which partially characterize the topic of interest

repeat

Initial query: a small set of seed patterns which partially characterize the topic of interest

Retrieve documents containing seed patterns: “relevant documents”

Initial query: a small set of seed patterns which partially characterize the topic of interest

Retrieve documents containing seed patterns: “relevant documents”

Rank patterns in relevant documents by frequency in relevant docs vs. overall frequency

Initial query: a small set of seed patterns which partially characterize the topic of interest

Retrieve documents containing seed patterns: “relevant documents”

Rank patterns in relevant documents by frequency in relevant docs vs. overall frequency

Add top-ranked pattern to seed pattern set

Page 17: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

17

#1: pick seed pattern

Seed: < person retires >

Page 18: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

18

#2: retrieve relevant documents

Seed: < person retires >

Fred retired....

Harry was named president.

Maki retired....

Yuki was named president.

Relevant documents Otherdocuments

Page 19: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

19

#3: pick new pattern

Seed: < person retires >

< person was named president > appears in several relevant documents (top-ranked by Riloff metric)

Fred retired....

Harry was named president.

Maki retired....

Yuki was named president.

Page 20: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

20

#4: add new pattern to pattern set

Pattern set: < person retires >

< person was named president >

Page 21: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU21

Pre-processing

For each document, find and classify names:– { person | location | organization | …}

Parse document– (regularize passive, relative clauses, etc.)

For each clause, collect a candidate pattern:tuple: heads of– [ subject

verb direct object object/subject complement locative and temporal modifiers… ]

Page 22: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU22

Experiment

Task: Management succession (as MUC-6) Source: Wall Street Journal Training corpus: ~ 6,000 articles Test corpus:

– 100 documents: MUC-6 formal training

– + 150 documents judged manually

Page 23: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU23

Experiment: two seed patterns

v-appoint = { appoint, elect, promote, name } v-resign = { resign, depart, quit, step-down }

Run discovery procedure for 80 iterations

Subject Verb Objectcompany v-appoint person

person v-resign -

Page 24: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU24

Evaluation

Look at discovered patterns– new patterns, missed in manual training

Document filtering Slot filling

Page 25: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU25

Discovered patterns

Subject Verb Objectcompany v-appoint person

person v-resign -

person succeed person

person be| become

president| officer| chairman| executive

company name president | …

person join | run| leave

company

person serve board | company

person leave post

Page 26: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU26

Evaluation: new patterns

Not found in manual training

Subject Verb Object Complementscompany bring person [as+officer]person come

| return- [to+company]

[as+officer]person rejoin company [as+officer]

person continue| remain| stay

- [as+officer]

person replace person [as+officer]person pursue interest -

Page 27: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU27

Evaluation: Text Filtering

How effective are discovered patterns at selecting relevant documents?

–IR-style

–documents matching at least one pattern

Pattern set Recall PrecisionSeed 11% 93%Seed+discovered 88% 81% (85)

Page 28: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU28

250 Test Documents (.5)

0

0.2

0.4

0.6

0.8

1

0 10 20 30 40 50 60 70 80

Generation #

Recall

Precision

Page 29: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU29

Choice of Test Corpus (.5)

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Pre

cis

ion

250-Corpus

100-MUC Corpus

MUC-6 Players

Page 30: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU30

Evaluation: Slot filling

How effective are patterns within a complete IE system?

MUC-style IE on MUC-6 corpora

Caveat

training test

pattern set recall precision F recall precision F

seed 28 78 41

+ discovered 51 76 61

manual–MUC 54 71 62 47 70 56.40

manual–now 69 79 74 54.7 74.4 63.02

Page 31: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU31

Evaluation: Slot filling

How effective are patterns within a complete IE system?

MUC-style IE on MUC-6 corpora

Caveat

training test

pattern set recall precision F recall precision F

seed 28 78 41

+ discovered 51 76 61

manual–MUC 54 71 62 47 70 56.40

manual–now 69 79 74 54.7 74.4 63.02

Page 32: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU32

Evaluation: Slot filling

How effective are patterns within a complete IE system?

MUC-style IE on MUC-6 corpora

Caveat

training test

pattern set recall precision F recall precision F

seed 28 78 41

+ discovered 51 76 61

manual–MUC 54 71 62 47 70 56.40

manual–now 69 79 74 54.7 74.4 63.02

Page 33: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU33

Conclusion: Automatic discovery

Performance comparable to human(4-week development)

From un-annotated text: allows us to take advantage of very large corpora– redundancy

– duality

Will likely help wider use of IE

Page 34: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU34

Page 35: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU35

Good Patterns

U - universe of all documentsR - set of relevant documentsH= H(p) - set of documents where pattern p matched

Density criterion:

Page 36: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU36

Graded Relevance

Documents matching seed patterns considered 100% relevant Discovered patterns are considered less certain

Documents containing them are considered partially relevant

Page 37: Automatic Discovery of Scenario-Level Patterns for  Information Extraction

NYU37

document frequency in relevant documents overall document frequency

document frequency in relevant documents – (metrics similar to those used in Riloff-96)

Scoring Patterns


Recommended