Querying Text Databases for Efficient Information Extraction

Querying Text Databases for Efficient Information Extraction

Eugene AgichteinLuis Gravano

Columbia University

2

Extracting Structured Information “Buried” in Text Documents

Apple's programmers "think different" on a "campus" in

Cupertino, Cal. Nike employees "just do it" at what the company refers to as its "World Campus," near Portland, Ore.

Microsoft's central headquarters in Redmond is home to almost every product group and division.

OrganizationOrganization LocationLocation

Microsoft

Apple Computer

Nike

Redmond

Cupertino

Portland

Brent Barlow, 27, a software analyst and beta-tester at Apple Computer’s headquarters in Cupertino, was fired Monday for "thinking a little too different."

3

Information Extraction Applications

• Over a corporation’s customer report or email complaint database: enabling sophisticated querying and analysis

• Over biomedical literature: identifying drug/condition interactions

• Over newspaper archives: tracking disease outbreaks, terrorist attacks; intelligence

Significant progress over the last decade [MUC]

4

Information Extraction Example: Organizations’ Headquarters

doc2

Brent Barlow, a software analyst and beta-tester at AppleComputer's headquarters in Cupertino, was fired Monday for "thinkinga little too different." doc4

<PERSON>Brent Barlow</PERSON>,a software analyst and beta-tester at<ORGANIZATION>Apple Computer</ORGANIZATION>'sheadquarters in <LOCATION>Cupertino</LOCATION>, was firedMonday for "thinking a little too different." doc4

<ORGANIZATION>'sheadquarters in <LOCATION>

<ORGANIZATION>,based in <LOCATION>

<ORGANIZATION> = AppleComputer<LOCATION> = CupertinoPattern = p1

p1

p2

Extraction Patterns

doc4

Organization Location

Eastman Kodak Rochester doc2

doc4

tid

1Apple Computer Cupertino2

W

0.90.8

Useful

Input: Documents

Named-Entity Tagging

Pattern Matching

Output: Tuples

5

Goal: Extract All Tuples of a Relation from a Document Database

Text Database

InformationExtraction

System

• One approach: feed every document to information extraction system

• Problem: efficiency!

Extracted Tuples

6

Information Extraction is Expensive

• Efficiency is a problem even after training information extraction systemExample: NYU’s Proteus extraction system takes around

9 seconds per document• Over 15 days to process 135,000 news articles

• “Filtering” before further processing a document might help

• Can’t afford to “scan the web” to process each page!• “Hidden-Web” databases don’t allow crawling

7

Information Extraction Without Processing All Documents

• Observation: Often only small fraction of database is relevant for an extraction task

• Our approach: Exploit database search engine to retrieve and process only “promising” documents

8

Extracted Relation

Architecture of our QXtract System

User-Provided Seed Tuples

Queries

Promising Documents

Query Generation

Text Database

Search Engine

Information Extraction

Microsoft Redmond

Apple Cupertino

Microsoft Redmond

Apple Cupertino

Exxon Irving

IBM Armonk

Intel Santa Clara

Key problem: Learn queries to retrieve “promising” documents

Generating Queries to Retrieve Promising Documents

1. Get document sample with “likely negative” and “likely positive” examples.

2. Label sample documents usinginformation extraction systemas “oracle.”

3. Train classifiers to “recognize”useful documents.

4. Generate queries from classifiermodel/rules.

Query Generation


Text Database

Search Engine

? ???

? ?

??

++

++

- -

--

Seed Sampling

Classifier Training

Queries

tuple1tuple2tuple3tuple4tuple5

++

++

- -

--


10

? ???

? ?

??

Text Database

Search Engine

Getting a Training Document Sample

Microsoft AND Redmond

Apple AND Cupertino“Random” Queries

Get document sample with “likely negative” and “likely positive” examples.


Text Database

Search Engine

? ???

? ?

??

Seed Sampling


11


++

++

- -

--

Labeling the Training Document Sample

Information Extraction System

Microsoft Redmond

Apple Cupertino

IBM Armonk

? ???

? ?

??

Use information extraction system as “oracle” to label examples as “true positive” and “true negative.”

12

++

++

- -

--

Training Classifiers to Recognize “Useful” Documents

Classifier Training


++

++

- -

--

is based in near city

spokesperson reported news earnings release

products made used exported far

past old homerun sponsored event

++--

Ripper SVM

based AND near => Useful

based 3

spokesperson 2

sponsored -1

Okapi (IR)

is

based

near

spokesperson

earnings

sponsored

eventfar

homerun

Document features: words

13Queries

Query Generation

Generating Queries from Classifiers

++

++

- -

--

based AND nearspokesperson

based

QCombined

basedspokesperson

spokespersonearningsbased AND near

Ripper SVM

based 3

spokesperson 2

sponsored -1

Okapi (IR)

based AND near => Useful is

based

near

spokesperson

earnings

sponsored

eventfar

homerun

14

Extracted Relation

Architecture of our QXtract System


Queries

Promising Documents

Query Generation

Text Database

Search Engine


Microsoft Redmond

Apple Cupertino

Microsoft Redmond

Apple Cupertino

Exxon Irving

IBM Armonk

Intel Santa Clara

15

Experimental Evaluation: Data

• Training Set: – 1996 New York Times archive of 137,000

newspaper articles– Used to tune QXtract parameters

• Test Set: – 1995 New York Times archive of 135,000

newspaper articles

16

Final Configuration of QXtract, from Training

17

Experimental Evaluation: Information Extraction Systems and

Associated Relations

• DIPRE [Brin 1998]– Headquarters(Organization, Location)

• Snowball [Agichtein and Gravano 2000]– Headquarters(Organization, Location)

• Proteus [Grishman et al. 2002]– DiseaseOutbreaks(DiseaseName, Location,

Country, Date, …)

18

Experimental Evaluation: Seed Tuples

Organization Location

Microsoft Redmond

Exxon Irving

Boeing Seattle

IBM Armonk

Intel Santa Clara

DiseaseName Location

Malaria Ethiopia

Typhus Bergen-Belsen

Flu The Midwest

Mad Cow Disease The U.K.

Pneumonia The U.S.

Headquarters DiseaseOutbreaks

19

Experimental Evaluation: Metrics

• Gold standard: relation Rall, obtained by running information extraction system over every document in Dall database

• Recall: % of Rall captured in approximation extracted from retrieved documents

• Precision: % of retrieved documents that are “useful” (i.e., produced tuples)

20

Experimental Evaluation: Relation Statistics

Relation and Extraction System | Dall | % Useful | Rall |

Headquarters: Snowball 135,000 23 24,536

Headquarters: DIPRE 135,000 22 20,952

DiseaseOutbreaks: Proteus 135,000 4 8,859

21

Alternative Query Generation Strategies

• QXtract, with final configuration from training• Tuples: Keep deriving queries from extracted tuples

– Problem: “disconnected” databases

• Patterns: Derive queries from extraction patterns from information extraction system

– “<ORGANIZATION>, based in <LOCATION>” => “based in”

– Problems: pattern features often not suitable for querying, or not visible from “black-box” extraction system

• Manual: Construct queries manually [MUC]– Obtained for Proteus from developers– Not available for DIPRE and Snowball

Plus simple additional “baseline”: retrieve a random document sample of appropriate size

22

Recall and Precision Headquarters Relation; Snowball Extraction System

(a) (b)

0

5

10

15

20

25

30

35

40

45

5% 10% 15% 20% 25%

rec

all

(%)

QXtractPatternsTuplesBaseline

M axFractionRetrieved (% |Dall|)

20

25

30

35

40

45

50

55

5% 10% 15% 20% 25%p

reci

sio

n (

%)



Recall Precision

23

Recall and Precision Headquarters Relation; DIPRE Extraction System

(a) (b)

0

5

10

15

20

25

30

35

40

45

5% 10% 15% 20% 25%

rec

all

(%)



20

25

30

35

40

45

50

55

60

65

5% 10% 15% 20% 25%

pre

cisi

on

(%

)

QXtract PatternsTuples Baseline


Recall Precision

24

0

10

20

30

40

50

60

70

80

5% 10% 25%

M axFractionRetrieved

reca

ll (%

)

QXtract Manual Tuples Baseline

Extraction Efficiency and RecallDiseaseOutbreaks Relation; Proteus Extraction System

60% of relation extracted from just 10% of documents of 135,000 newspaper article database

1.4

15.5

0

2

4

6

8

10

12

14

16

0 0 0 0 0

run

nin

g t

ime

(day

s)

ScanQXtract

10% 100%

25

Snowball/Headquarters Queries

26

DIPRE/Headquarters Queries

27

Proteus/DiseaseOutbreaks Queries

28

Current Work: Characterizing Databases for an Extraction Task

Sparse?

yesno

Scan QXtract, Tuples

Connected?

yesno

TuplesQXtract

Text Database

SearchInterface


+

+

++tuple1tuple1

tuple1tuple1

+

+

+

+

29

Related Work

• Information Extraction: focus on quality of extracted relations [MUC]; most relevant sub-task: text filtering – Filters derived from extraction patterns, or consisting of words

(manually created or from supervised learning)– Grishman et al.’s manual pattern-based filters for disease outbreaks– Related to Manual and Patterns strategies in our experiments– Focus not on querying using simple search interface

• Information Retrieval: focus on relevant documents for queries– In our scenario, relevance determined by “extraction task” and associated

information extraction system• Automatic Query Generation: several efforts for different tasks:

– Minority language corpora construction [Ghani et al. 2001]– Topic-specific document search (e.g., [Cohen & Singer 1996])

30

Contributions: An Unsupervised Query-Based Technique for

Efficient Information Extraction• Adapts to “arbitrary” underlying information

extraction system and document database• Can work over non-crawlable “Hidden Web”

databases• Minimal user input required

– Handful of example tuples• Can trade off relation completeness and extraction

efficiency• Particularly interesting in conjunction with

unsupervised/bootstrapping-based information extraction systems (e.g., DIPRE, Snowball)

Questions?

Overflow Slides

33

Related Work (II)

• Focused Crawling (e.g., [Chakrabarti et al. 2002]): uses link and page classification to crawl pages on a topic

• Hidden-Web Crawling [Raghavan & Garcia-Molina 2001]: retrieves pages from non-crawlable Hidden-Web databases– Need rich query interface, with distinguishable attributes– Related to Tuples strategy, but “tuples” derived from pull-down

menus, etc. from search interfaces as found– Our goal: retrieve as few documents as possible from one

database to extract relation

• Question-Answering Systems

34

Related Work (III)

• [Mitchell, Riloff, et al. 1998] use “linguistic phrases” derived from information extraction patterns as features for text categorizationRelated to Patterns strategy; requires document parsing,

so can’t directly generate simple queries

• [Gaizauskas & Robertson 1997] use 9 manually generated keywords to search for documents relevant to a MUC extraction task

35

Recall and Precision DiseaseOutbreaks Relation; Proteus Extraction System

(a) (b)

0

10

20

30

40

50

60

70

80

90

5% 10% 15% 20% 25%

reca

ll (%

)

QXtract

Manual

Manual+QXtract

Tuples

Baseline


0

5

10

15

20

25

30

35

5% 10% 15% 20% 25%

pre

cisi

on

(%

)

QXtract

Manual

Manual+QXtract

Tuples

Baseline


Recall Precision

36

Running Times

ProteusSnowball DIPREMaxFractionRetrieved (% |Dall|) MaxFractionRetrieved (% |Dall|) MaxFractionRetrieved (% |Dall|)

0

20

40

60

80

100

120

140

160

180

0 0 0 0 0

run

nin

g t

ime

(min

ute

s)

FullScanQuickScanQXtractExtraction Training

5% 10% 100%0

2

4

6

8

10

12

14

16

0 0 0 0 0

run

nin

g t

ime

(d

ay

s)

FullScan

QXtract

5% 10% 100%0

20

40

60

80

100

120

140

0 0 0 0 0

run

nin

g t

ime

(min

ute

s)

FullScanQuickScanQXtractExtraction Training

5% 10% 100%

37

Extracting Relations from Text: Snowball

•Exploit redundancy on web to focus on “easy” instances

•Require only minimal training (handful of seed tuples)

Initial Seed Tuples Occurrences of Seed Tuples

Tag Entities

Generate Extraction Patterns

Generate New Seed Tuples

Augment Table

ORGANIZ ATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA

ACM DL’00

Date post:	07-Jan-2016
Category:	Documents
Upload:	brosh
View:	34 times
Download:	0 times