+ All Categories
Home > Documents > Querying Text Databases for Efficient Information Extraction

Querying Text Databases for Efficient Information Extraction

Date post: 07-Jan-2016
Category:
Upload: brosh
View: 34 times
Download: 0 times
Share this document with a friend
Description:
Querying Text Databases for Efficient Information Extraction. Eugene Agichtein Luis Gravano Columbia University. Extracting Structured Information “Buried” in Text Documents. Organization. Location. Microsoft 's central headquarters in Redmond - PowerPoint PPT Presentation
37
Querying Text Databases for Efficient Information Extraction Eugene Agichtein Luis Gravano Columbia University
Transcript
Page 1: Querying Text Databases for  Efficient Information Extraction

Querying Text Databases for Efficient Information Extraction

Eugene AgichteinLuis Gravano

Columbia University

Page 2: Querying Text Databases for  Efficient Information Extraction

2

Extracting Structured Information “Buried” in Text Documents

Apple's programmers "think different" on a "campus" in

Cupertino, Cal. Nike employees "just do it" at what the company refers to as its "World Campus," near Portland, Ore.

Microsoft's central headquarters in Redmond is home to almost every product group and division.

OrganizationOrganization LocationLocation

Microsoft

Apple Computer

Nike

Redmond

Cupertino

Portland

Brent Barlow, 27, a software analyst and beta-tester at Apple Computer’s headquarters in Cupertino, was fired Monday for "thinking a little too different."

Page 3: Querying Text Databases for  Efficient Information Extraction

3

Information Extraction Applications

• Over a corporation’s customer report or email complaint database: enabling sophisticated querying and analysis

• Over biomedical literature: identifying drug/condition interactions

• Over newspaper archives: tracking disease outbreaks, terrorist attacks; intelligence

Significant progress over the last decade [MUC]

Page 4: Querying Text Databases for  Efficient Information Extraction

4

Information Extraction Example: Organizations’ Headquarters

doc2

Brent Barlow, a software analyst and beta-tester at AppleComputer's headquarters in Cupertino, was fired Monday for "thinkinga little too different." doc4

<PERSON>Brent Barlow</PERSON>,a software analyst and beta-tester at<ORGANIZATION>Apple Computer</ORGANIZATION>'sheadquarters in <LOCATION>Cupertino</LOCATION>, was firedMonday for "thinking a little too different." doc4

<ORGANIZATION>'sheadquarters in <LOCATION>

<ORGANIZATION>,based in <LOCATION>

<ORGANIZATION> = AppleComputer<LOCATION> = CupertinoPattern = p1

p1

p2

Extraction Patterns

doc4

Organization Location

Eastman Kodak Rochester doc2

doc4

tid

1Apple Computer Cupertino2

W

0.90.8

Useful

Input: Documents

Named-Entity Tagging

Pattern Matching

Output: Tuples

Page 5: Querying Text Databases for  Efficient Information Extraction

5

Goal: Extract All Tuples of a Relation from a Document Database

Text Database

InformationExtraction

System

• One approach: feed every document to information extraction system

• Problem: efficiency!

Extracted Tuples

Page 6: Querying Text Databases for  Efficient Information Extraction

6

Information Extraction is Expensive

• Efficiency is a problem even after training information extraction systemExample: NYU’s Proteus extraction system takes around

9 seconds per document• Over 15 days to process 135,000 news articles

• “Filtering” before further processing a document might help

• Can’t afford to “scan the web” to process each page!• “Hidden-Web” databases don’t allow crawling

Page 7: Querying Text Databases for  Efficient Information Extraction

7

Information Extraction Without Processing All Documents

• Observation: Often only small fraction of database is relevant for an extraction task

• Our approach: Exploit database search engine to retrieve and process only “promising” documents

Page 8: Querying Text Databases for  Efficient Information Extraction

8

Extracted Relation

Architecture of our QXtract System

User-Provided Seed Tuples

Queries

Promising Documents

Query Generation

Text Database

Search Engine

Information Extraction

Microsoft Redmond

Apple Cupertino

Microsoft Redmond

Apple Cupertino

Exxon Irving

IBM Armonk

Intel Santa Clara

Key problem: Learn queries to retrieve “promising” documents

Page 9: Querying Text Databases for  Efficient Information Extraction

Generating Queries to Retrieve Promising Documents

1. Get document sample with “likely negative” and “likely positive” examples.

2. Label sample documents usinginformation extraction systemas “oracle.”

3. Train classifiers to “recognize”useful documents.

4. Generate queries from classifiermodel/rules.

Query Generation

Information Extraction

Text Database

Search Engine

? ???

? ?

??

++

++

- -

--

Seed Sampling

Classifier Training

Queries

tuple1tuple2tuple3tuple4tuple5

++

++

- -

--

User-Provided Seed Tuples

Page 10: Querying Text Databases for  Efficient Information Extraction

10

? ???

? ?

??

Text Database

Search Engine

Getting a Training Document Sample

Microsoft AND Redmond

Apple AND Cupertino“Random” Queries

Get document sample with “likely negative” and “likely positive” examples.

User-Provided Seed Tuples

Text Database

Search Engine

? ???

? ?

??

Seed Sampling

User-Provided Seed Tuples

Page 11: Querying Text Databases for  Efficient Information Extraction

11

tuple1tuple2tuple3tuple4tuple5

++

++

- -

--

Labeling the Training Document Sample

Information Extraction System

Microsoft Redmond

Apple Cupertino

IBM Armonk

? ???

? ?

??

Use information extraction system as “oracle” to label examples as “true positive” and “true negative.”

Page 12: Querying Text Databases for  Efficient Information Extraction

12

++

++

- -

--

Training Classifiers to Recognize “Useful” Documents

Classifier Training

tuple1tuple2tuple3tuple4tuple5

++

++

- -

--

is based in near city

spokesperson reported news earnings release

products made used exported far

past old homerun sponsored event

++--

Ripper SVM

based AND near => Useful

based 3

spokesperson 2

sponsored -1

Okapi (IR)

is

based

near

spokesperson

earnings

sponsored

eventfar

homerun

Document features: words

Page 13: Querying Text Databases for  Efficient Information Extraction

13Queries

Query Generation

Generating Queries from Classifiers

++

++

- -

--

based AND nearspokesperson

based

QCombined

basedspokesperson

spokespersonearningsbased AND near

Ripper SVM

based 3

spokesperson 2

sponsored -1

Okapi (IR)

based AND near => Useful is

based

near

spokesperson

earnings

sponsored

eventfar

homerun

Page 14: Querying Text Databases for  Efficient Information Extraction

14

Extracted Relation

Architecture of our QXtract System

User-Provided Seed Tuples

Queries

Promising Documents

Query Generation

Text Database

Search Engine

Information Extraction

Microsoft Redmond

Apple Cupertino

Microsoft Redmond

Apple Cupertino

Exxon Irving

IBM Armonk

Intel Santa Clara

Page 15: Querying Text Databases for  Efficient Information Extraction

15

Experimental Evaluation: Data

• Training Set: – 1996 New York Times archive of 137,000

newspaper articles– Used to tune QXtract parameters

• Test Set: – 1995 New York Times archive of 135,000

newspaper articles

Page 16: Querying Text Databases for  Efficient Information Extraction

16

Final Configuration of QXtract, from Training

Page 17: Querying Text Databases for  Efficient Information Extraction

17

Experimental Evaluation: Information Extraction Systems and

Associated Relations

• DIPRE [Brin 1998]– Headquarters(Organization, Location)

• Snowball [Agichtein and Gravano 2000]– Headquarters(Organization, Location)

• Proteus [Grishman et al. 2002]– DiseaseOutbreaks(DiseaseName, Location,

Country, Date, …)

Page 18: Querying Text Databases for  Efficient Information Extraction

18

Experimental Evaluation: Seed Tuples

Organization Location

Microsoft Redmond

Exxon Irving

Boeing Seattle

IBM Armonk

Intel Santa Clara

DiseaseName Location

Malaria Ethiopia

Typhus Bergen-Belsen

Flu The Midwest

Mad Cow Disease The U.K.

Pneumonia The U.S.

Headquarters DiseaseOutbreaks

Page 19: Querying Text Databases for  Efficient Information Extraction

19

Experimental Evaluation: Metrics

• Gold standard: relation Rall, obtained by running information extraction system over every document in Dall database

• Recall: % of Rall captured in approximation extracted from retrieved documents

• Precision: % of retrieved documents that are “useful” (i.e., produced tuples)

Page 20: Querying Text Databases for  Efficient Information Extraction

20

Experimental Evaluation: Relation Statistics

Relation and Extraction System | Dall | % Useful | Rall |

Headquarters: Snowball 135,000 23 24,536

Headquarters: DIPRE 135,000 22 20,952

DiseaseOutbreaks: Proteus 135,000 4 8,859

Page 21: Querying Text Databases for  Efficient Information Extraction

21

Alternative Query Generation Strategies

• QXtract, with final configuration from training• Tuples: Keep deriving queries from extracted tuples

– Problem: “disconnected” databases

• Patterns: Derive queries from extraction patterns from information extraction system

– “<ORGANIZATION>, based in <LOCATION>” => “based in”

– Problems: pattern features often not suitable for querying, or not visible from “black-box” extraction system

• Manual: Construct queries manually [MUC]– Obtained for Proteus from developers– Not available for DIPRE and Snowball

Plus simple additional “baseline”: retrieve a random document sample of appropriate size

Page 22: Querying Text Databases for  Efficient Information Extraction

22

Recall and Precision Headquarters Relation; Snowball Extraction System

(a) (b)

0

5

10

15

20

25

30

35

40

45

5% 10% 15% 20% 25%

rec

all

(%)

QXtractPatternsTuplesBaseline

M axFractionRetrieved (% |Dall|)

20

25

30

35

40

45

50

55

5% 10% 15% 20% 25%p

reci

sio

n (

%)

QXtractPatternsTuplesBaseline

M axFractionRetrieved (% |Dall|)

Recall Precision

Page 23: Querying Text Databases for  Efficient Information Extraction

23

Recall and Precision Headquarters Relation; DIPRE Extraction System

(a) (b)

0

5

10

15

20

25

30

35

40

45

5% 10% 15% 20% 25%

rec

all

(%)

QXtractPatternsTuplesBaseline

M axFractionRetrieved (% |Dall|)

20

25

30

35

40

45

50

55

60

65

5% 10% 15% 20% 25%

pre

cisi

on

(%

)

QXtract PatternsTuples Baseline

M axFractionRetrieved (% |Dall|)

Recall Precision

Page 24: Querying Text Databases for  Efficient Information Extraction

24

0

10

20

30

40

50

60

70

80

5% 10% 25%

M axFractionRetrieved

reca

ll (%

)

QXtract Manual Tuples Baseline

Extraction Efficiency and RecallDiseaseOutbreaks Relation; Proteus Extraction System

60% of relation extracted from just 10% of documents of 135,000 newspaper article database

1.4

15.5

0

2

4

6

8

10

12

14

16

0 0 0 0 0

run

nin

g t

ime

(day

s)

ScanQXtract

10% 100%

Page 25: Querying Text Databases for  Efficient Information Extraction

25

Snowball/Headquarters Queries

Page 26: Querying Text Databases for  Efficient Information Extraction

26

DIPRE/Headquarters Queries

Page 27: Querying Text Databases for  Efficient Information Extraction

27

Proteus/DiseaseOutbreaks Queries

Page 28: Querying Text Databases for  Efficient Information Extraction

28

Current Work: Characterizing Databases for an Extraction Task

Sparse?

yesno

Scan QXtract, Tuples

Connected?

yesno

TuplesQXtract

Text Database

SearchInterface

tuple1tuple2tuple3tuple4tuple5

+

+

++tuple1tuple1

tuple1tuple1

+

+

+

+

Page 29: Querying Text Databases for  Efficient Information Extraction

29

Related Work

• Information Extraction: focus on quality of extracted relations [MUC]; most relevant sub-task: text filtering – Filters derived from extraction patterns, or consisting of words

(manually created or from supervised learning)– Grishman et al.’s manual pattern-based filters for disease outbreaks– Related to Manual and Patterns strategies in our experiments– Focus not on querying using simple search interface

• Information Retrieval: focus on relevant documents for queries– In our scenario, relevance determined by “extraction task” and associated

information extraction system• Automatic Query Generation: several efforts for different tasks:

– Minority language corpora construction [Ghani et al. 2001]– Topic-specific document search (e.g., [Cohen & Singer 1996])

Page 30: Querying Text Databases for  Efficient Information Extraction

30

Contributions: An Unsupervised Query-Based Technique for

Efficient Information Extraction• Adapts to “arbitrary” underlying information

extraction system and document database• Can work over non-crawlable “Hidden Web”

databases• Minimal user input required

– Handful of example tuples• Can trade off relation completeness and extraction

efficiency• Particularly interesting in conjunction with

unsupervised/bootstrapping-based information extraction systems (e.g., DIPRE, Snowball)

Page 31: Querying Text Databases for  Efficient Information Extraction

Questions?

Page 32: Querying Text Databases for  Efficient Information Extraction

Overflow Slides

Page 33: Querying Text Databases for  Efficient Information Extraction

33

Related Work (II)

• Focused Crawling (e.g., [Chakrabarti et al. 2002]): uses link and page classification to crawl pages on a topic

• Hidden-Web Crawling [Raghavan & Garcia-Molina 2001]: retrieves pages from non-crawlable Hidden-Web databases– Need rich query interface, with distinguishable attributes– Related to Tuples strategy, but “tuples” derived from pull-down

menus, etc. from search interfaces as found– Our goal: retrieve as few documents as possible from one

database to extract relation

• Question-Answering Systems

Page 34: Querying Text Databases for  Efficient Information Extraction

34

Related Work (III)

• [Mitchell, Riloff, et al. 1998] use “linguistic phrases” derived from information extraction patterns as features for text categorizationRelated to Patterns strategy; requires document parsing,

so can’t directly generate simple queries

• [Gaizauskas & Robertson 1997] use 9 manually generated keywords to search for documents relevant to a MUC extraction task

Page 35: Querying Text Databases for  Efficient Information Extraction

35

Recall and Precision DiseaseOutbreaks Relation; Proteus Extraction System

(a) (b)

0

10

20

30

40

50

60

70

80

90

5% 10% 15% 20% 25%

reca

ll (%

)

QXtract

Manual

Manual+QXtract

Tuples

Baseline

M axFractionRetrieved (% |Dall|)

0

5

10

15

20

25

30

35

5% 10% 15% 20% 25%

pre

cisi

on

(%

)

QXtract

Manual

Manual+QXtract

Tuples

Baseline

M axFractionRetrieved (% |Dall|)

Recall Precision

Page 36: Querying Text Databases for  Efficient Information Extraction

36

Running Times

ProteusSnowball DIPREMaxFractionRetrieved (% |Dall|) MaxFractionRetrieved (% |Dall|) MaxFractionRetrieved (% |Dall|)

0

20

40

60

80

100

120

140

160

180

0 0 0 0 0

run

nin

g t

ime

(min

ute

s)

FullScanQuickScanQXtractExtraction Training

5% 10% 100%0

2

4

6

8

10

12

14

16

0 0 0 0 0

run

nin

g t

ime

(d

ay

s)

FullScan

QXtract

5% 10% 100%0

20

40

60

80

100

120

140

0 0 0 0 0

run

nin

g t

ime

(min

ute

s)

FullScanQuickScanQXtractExtraction Training

5% 10% 100%

Page 37: Querying Text Databases for  Efficient Information Extraction

37

Extracting Relations from Text: Snowball

•Exploit redundancy on web to focus on “easy” instances

•Require only minimal training (handful of seed tuples)

Initial Seed Tuples Occurrences of Seed Tuples

Tag Entities

Generate Extraction Patterns

Generate New Seed Tuples

Augment Table

ORGANIZ ATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA

ACM DL’00


Recommended