To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York...

To Search or to CrawlTowards a Query Optimizer for Text-

Centric Tasks

Panos Ipeirotis ndash New York UniversityEugene Agichtein ndash Microsoft ResearchPranay Jain ndash Columbia UniversityLuis Gravano ndash Columbia University

2

Text-Centric Task I Information Extraction

Information extraction applications extract structured relations from unstructured text

May 19 1995 Atlanta -- The Centers for Disease Control and Prevention which is in the front line of the worlds response to the deadly Ebola epidemic in Zaire is finding itself hard pressed to cope with the crisishellip

Date Disease Name Location

Jan 1995 Malaria Ethiopia

July 1995 Mad Cow Disease UK

Feb 1995 Pneumonia US

May 1995 Ebola Zaire

Information Extraction System

(eg NYUrsquos Proteus)

Disease Outbreaks in The New York Times

Information Extraction tutorial yesterday by AnHai Doan Raghu Ramakrishnan Shivakumar Vaithyanathan

3

Text-Centric Task II Metasearching

Metasearchers create content summaries of databases (words + frequencies) to direct queries appropriately

Friday June 16 NEW YORK (Forbes) - Starbucks Corp may be next on the target list of CSPI a consumer-health group that this week sued the operator of the KFC restaurant chain

Word Frequency

Starbucks 102

consumer 215

soccer 1295

hellip hellip

Content Summary

Extractor

Word Frequency

Starbucks 103

consumer 216

soccer 1295

hellip hellip

Content Summary of Forbescom

4

Text-Centric Task III Focused Resource Discovery

Identify web pages about a given topic (multiple techniques proposed simple classifiers focused crawlers focused queryinghellip)

URL

httpbiologyaboutcom

httpwwwamjbotorg

httpwwwsysbotorg

httpwwwbotanyubcca

Web Page

Classifier

Web Pages about Botany

5

An Abstract View of Text-Centric Tasks Output Tokens

hellipExtraction

System

Text Database

3 Extract output tokens2 Process documents1 Retrieve documents from database

Task Token

Information Extraction Relation Tuple

Database Selection Word (+Frequency)

Focused Crawling Web Page about a Topic

For the rest of the talk

6

Executing a Text-Centric TaskOutput Tokens

hellipExtraction

System

Text Database

3 Extract output tokens

2 Process documents

1 Retrieve documents from database

Similar to relational world

Two major execution paradigms Scan-based Retrieve and process documents sequentially Index-based Query database (eg [case fatality rate]) retrieve and process documents in results

Unlike the relational world

Indexes are only ldquoapproximaterdquo index is on keywords not on tokens of interest Choice of execution plan affects output completeness (not only speed)

rarrunderlying data distribution dictates what is best

7

Execution Plan CharacteristicsOutput Tokens

hellipExtraction

System

Text Database


Execution Plans have two main characteristicsExecution TimeRecall (fraction of tokens retrieved)

Question How do we choose the fastest execution plan for reaching

a target recall

ldquoWhat is the fastest plan for discovering 10 of the disease outbreaks mentioned in The New York Times archiverdquo

8

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based(Index-based)

9

ScanOutput Tokens

hellipExtraction

System

Text Database


2 Process documents

1 Retrieve docs from database

ScanScan retrieves and processes documents sequentially (until reaching target recall)

Execution time = |Retrieved Docs| middot (R + P)

Time for retrieving a document

Question How many documents does Scan retrieve

to reach target recall

Time for processing a document

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)

10

Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with

frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process

After retrieving S documents frequency of token t follows hypergeometric distribution

Recall for token t is the probability that frequency of t in S docs gt 0

t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents

Probability of seeing token t after retrieving S

documentsg(t) = frequency of token t

11

Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo

processes one for each token Overall recall is average recall across

tokens

rarr We can compute number of documents required to reach target recall

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt


12

Scan and Filtered ScanOutput Tokens

hellipExtraction

System

Text Database


3 Process documents


ScanScan retrieves and processes all documents (until reaching target recall)

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)

Execution time = |Retrieved Docs| ( R + F + P)


Time for filteringa document

Question How many documents does (Filtered) Scan

retrieve to reach target recall

Classifier

2 Filter documents


Classifier selectivity (σle1)

σ

filtered

13

Estimating Recall of Filtered ScanModeling Filtered Scan

Analysis similar to Scan Main difference the classifier rejects

documents and Decreases effective database size

from |D| to σ|D| (σ classifier selectivity)

Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)

t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM

Documents rejected by classifier decrease effective

database size

Tokens in rejected documents have lower

effective token frequency

14

Outline




Crawl-based

Query-based

15

Iterative Set ExpansionOutput Tokens

hellipExtraction

System

Text Database

3 Extract tokensfrom docs

2 Process retrieved documents

1 Query database with seed tokens

Execution time = |Retrieved Docs| (R + P) + |Queries| Q


Time for answering a query

Question How many queries and how many documents

does Iterative Set Expansion need to reach target recall


Query

Generation

4 Augment seed tokens with new tokens



(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)

16

Querying Graph

The querying graph is a bipartite graph containing tokens and documents

Each token (transformed to a keyword query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt

ltMalaria Ethiopiagt

ltCholera Sudangt

ltH5N1 Vietnamgt

17

Using Querying Graph for Analysis

We need to compute the Number of documents retrieved after

sending Q tokens as queries (estimates time) Number of tokens that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the Degree distribution of the tokens

discovered by retrieving documents Degree distribution of the documents

retrieved by the tokens (Not the same as the degree distribution of a

randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Elegant analysis framework based on generating functions ndash details in the paper

ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

18

Recall Limit Reachability Graph

t1 retrieves document d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Upper recall limit determined by the size of the biggest connected component

Reachability Graph

19

Automatic Query Generation

Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation

Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens

20

Automatic Query GenerationOutput Tokens

hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation

1 Generate queries that tend to retrieve documents with tokens

21

Estimating Recall of Automatic Query Generation

Query q retrieves g(q) docs Query has precision p(q)

p(q)g(q) useful docs [1-p(q)]g(q) useless docs

We compute total number of useful (and useless) documents retrieved

Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful

documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline

Description and analysis of crawl- and query-based plans



23

Summary of Cost Analysis

Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity if plan cannot reach target recall)

Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution

Next we show how to estimate degree distributions on-the-fly

24

Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters

Task Document Distribution Token Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-33863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

25

Parameter Estimation

Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this

No need for separate sampling phase Sampling is equivalent to executing the task

rarrPiggyback parameter estimation into execution

26

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial default estimationUpdated estimationUpdated estimation

27

Outline




28

Correctness of Theoretical Analysis

Solid lines Actual time Dotted lines Predicted time with correct parameters

Task Disease Outbreaks

Snowball IE system

182531 documents from NYT

16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen

Iterative Set Expansion

29

Experimental Results (Information Extraction)

Solid lines Actual time Green line Time with optimizer

(results similar in other experiments ndash see paper)

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

31

Future Work

Incorporate precision and recall of extraction system in framework

Create non-parametric optimization (ie no assumption about distribution families)

Examine other text-centric tasks and analyze new execution plans

Create adaptive ldquonext-Krdquo optimizer

32

Thank you

Task Filtered Scan Iterative Set Expansion


Information Extraction

Grishman et al Jof Biomed Inf 2002

Agichtein and Gravano ICDE 2003


Content Summary Construction

- Callan et al SIGMOD 1999

Ipeirotis and Gravano VLDB 2002

Focused Resource Discovery

Chakrabarti et al WWW 1999

- Cohen and Singer AAAI WIBIS 1996

33

Overflow Slides

34

Experimental Results (IE Headquarters)

Task Company Headquarters

Snowball IE system


16921 tokens

35

Experimental Results (Content Summaries)

Content Summary Extraction

19997 documents from 20newsgroups

120024 tokens

36




120024 tokens

ISE is a cheap plan for low target recall

but becomes the most expensive for high

target recall

37




120024 tokens

Underestimated recall for AQG switched to ISE

38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

OPTIMIZED is faster than ldquobest planrdquo overestimated

FS recall but after FS run to completion OPTIMIZED

just switched to Scan

39



800000 web pages

12000 tokens

2

Text-Centric Task I Information Extraction

Information extraction applications extract structured relations from unstructured text

May 19 1995 Atlanta -- The Centers for Disease Control and Prevention which is in the front line of the worlds response to the deadly Ebola epidemic in Zaire is finding itself hard pressed to cope with the crisishellip

Date Disease Name Location

Jan 1995 Malaria Ethiopia

July 1995 Mad Cow Disease UK

Feb 1995 Pneumonia US

May 1995 Ebola Zaire

Information Extraction System

(eg NYUrsquos Proteus)

Disease Outbreaks in The New York Times

Information Extraction tutorial yesterday by AnHai Doan Raghu Ramakrishnan Shivakumar Vaithyanathan

3




Word Frequency

Starbucks 102

consumer 215

soccer 1295

hellip hellip

Content Summary

Extractor

Word Frequency

Starbucks 103

consumer 216

soccer 1295

hellip hellip


4



URL

httpbiologyaboutcom

httpwwwamjbotorg

httpwwwsysbotorg

httpwwwbotanyubcca

Web Page

Classifier


5


hellipExtraction

System

Text Database


Task Token





6


hellipExtraction

System

Text Database


2 Process documents







7


hellipExtraction

System

Text Database




a target recall


8

Outline




Crawl-based


9

ScanOutput Tokens

hellipExtraction

System

Text Database


2 Process documents









10





t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents



11



tokens


t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt


12


hellipExtraction

System

Text Database


3 Process documents









Classifier

2 Filter documents



σ

filtered

13






t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM


database size



14

Outline




Crawl-based

Query-based

15


hellipExtraction

System

Text Database










Query

Generation





16

Querying Graph




Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

17









Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

18



t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


Reachability Graph

19




20


hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation


21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

3




Word Frequency

Starbucks 102

consumer 215

soccer 1295

hellip hellip

Content Summary

Extractor

Word Frequency

Starbucks 103

consumer 216

soccer 1295

hellip hellip


4



URL

httpbiologyaboutcom

httpwwwamjbotorg

httpwwwsysbotorg

httpwwwbotanyubcca

Web Page

Classifier


5


hellipExtraction

System

Text Database


Task Token





6


hellipExtraction

System

Text Database


2 Process documents







7


hellipExtraction

System

Text Database




a target recall


8

Outline




Crawl-based


9

ScanOutput Tokens

hellipExtraction

System

Text Database


2 Process documents









10





t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents



11



tokens


t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt


12


hellipExtraction

System

Text Database


3 Process documents









Classifier

2 Filter documents



σ

filtered

13






t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM


database size



14

Outline




Crawl-based

Query-based

15


hellipExtraction

System

Text Database










Query

Generation





16

Querying Graph




Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

17









Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

18



t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


Reachability Graph

19




20


hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation


21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

4



URL

httpbiologyaboutcom

httpwwwamjbotorg

httpwwwsysbotorg

httpwwwbotanyubcca

Web Page

Classifier


5


hellipExtraction

System

Text Database


Task Token





6


hellipExtraction

System

Text Database


2 Process documents







7


hellipExtraction

System

Text Database




a target recall


8

Outline




Crawl-based


9

ScanOutput Tokens

hellipExtraction

System

Text Database


2 Process documents









10





t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents



11



tokens


t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt


12


hellipExtraction

System

Text Database


3 Process documents









Classifier

2 Filter documents



σ

filtered

13






t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM


database size



14

Outline




Crawl-based

Query-based

15


hellipExtraction

System

Text Database










Query

Generation





16

Querying Graph




Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

17









Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

18



t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


Reachability Graph

19




20


hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation


21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

5


hellipExtraction

System

Text Database


Task Token





6


hellipExtraction

System

Text Database


2 Process documents







7


hellipExtraction

System

Text Database




a target recall


8

Outline




Crawl-based


9

ScanOutput Tokens

hellipExtraction

System

Text Database


2 Process documents









10





t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents



11



tokens


t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt


12


hellipExtraction

System

Text Database


3 Process documents









Classifier

2 Filter documents



σ

filtered

13






t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM


database size



14

Outline




Crawl-based

Query-based

15


hellipExtraction

System

Text Database










Query

Generation





16

Querying Graph




Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

17









Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

18



t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


Reachability Graph

19




20


hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation


21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

6


hellipExtraction

System

Text Database


2 Process documents







7


hellipExtraction

System

Text Database




a target recall


8

Outline




Crawl-based


9

ScanOutput Tokens

hellipExtraction

System

Text Database


2 Process documents









10





t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents



11



tokens


t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt


12


hellipExtraction

System

Text Database


3 Process documents









Classifier

2 Filter documents



σ

filtered

13






t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM


database size



14

Outline




Crawl-based

Query-based

15


hellipExtraction

System

Text Database










Query

Generation





16

Querying Graph




Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

17









Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

18



t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


Reachability Graph

19




20


hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation


21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

7


hellipExtraction

System

Text Database




a target recall


8

Outline




Crawl-based


9

ScanOutput Tokens

hellipExtraction

System

Text Database


2 Process documents









10





t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents



11



tokens


t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt


12


hellipExtraction

System

Text Database


3 Process documents









Classifier

2 Filter documents



σ

filtered

13






t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM


database size



14

Outline




Crawl-based

Query-based

15


hellipExtraction

System

Text Database










Query

Generation





16

Querying Graph




Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

17









Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

18



t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


Reachability Graph

19




20


hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation


21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

8

Outline




Crawl-based


9

ScanOutput Tokens

hellipExtraction

System

Text Database


2 Process documents









10





t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents



11



tokens


t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt


12


hellipExtraction

System

Text Database


3 Process documents









Classifier

2 Filter documents



σ

filtered

13






t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM


database size



14

Outline




Crawl-based

Query-based

15


hellipExtraction

System

Text Database










Query

Generation





16

Querying Graph




Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

17









Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

18



t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


Reachability Graph

19




20


hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation


21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

9

ScanOutput Tokens

hellipExtraction

System

Text Database


2 Process documents









10





t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents



11



tokens


t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt


12


hellipExtraction

System

Text Database


3 Process documents









Classifier

2 Filter documents



σ

filtered

13






t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM


database size



14

Outline




Crawl-based

Query-based

15


hellipExtraction

System

Text Database










Query

Generation





16

Querying Graph




Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

17









Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

18



t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


Reachability Graph

19




20


hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation


21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

10





t

d1

d2

dS

dN

D

Token

Samplingfor t

ltSARS Chinagt

S documents



11



tokens


t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt


12


hellipExtraction

System

Text Database


3 Process documents









Classifier

2 Filter documents



σ

filtered

13






t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM


database size



14

Outline




Crawl-based

Query-based

15


hellipExtraction

System

Text Database










Query

Generation





16

Querying Graph




Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

17









Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

18



t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


Reachability Graph

19




20


hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation


21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

11



tokens


t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

ltSARS Chinagt

ltEbola Zairegt


12


hellipExtraction

System

Text Database


3 Process documents









Classifier

2 Filter documents



σ

filtered

13






t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM


database size



14

Outline




Crawl-based

Query-based

15


hellipExtraction

System

Text Database










Query

Generation





16

Querying Graph




Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

17









Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

18



t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


Reachability Graph

19




20


hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation


21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

12


hellipExtraction

System

Text Database


3 Process documents









Classifier

2 Filter documents



σ

filtered

13






t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM


database size



14

Outline




Crawl-based

Query-based

15


hellipExtraction

System

Text Database










Query

Generation





16

Querying Graph




Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

17









Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

18



t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


Reachability Graph

19




20


hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation


21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

13






t1 t2 tM

d1

d2

d3

dN

D

Tokens

Samplingfor t1

Samplingfor t2

Sampling

for tM


database size



14

Outline




Crawl-based

Query-based

15


hellipExtraction

System

Text Database










Query

Generation





16

Querying Graph




Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

17









Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

18



t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


Reachability Graph

19




20


hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation


21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

14

Outline




Crawl-based

Query-based

15


hellipExtraction

System

Text Database










Query

Generation





16

Querying Graph




Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

17









Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

18



t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


Reachability Graph

19




20


hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation


21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

15


hellipExtraction

System

Text Database










Query

Generation





16

Querying Graph




Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

17









Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

18



t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


Reachability Graph

19




20


hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation


21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

16

Querying Graph




Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

17









Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

18



t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


Reachability Graph

19




20


hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation


21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

17









Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


ltSARS Chinagt

ltEbola Zairegt


ltCholera Sudangt

ltH5N1 Vietnamgt

18



t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


Reachability Graph

19




20


hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation


21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

18



t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5


Reachability Graph

19




20


hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation


21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

19




20


hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation


21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

20


hellipExtraction

System

Text Database



2 Query database





OfflineQuery

Generation


21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

21






documents retrieved

Text Database

Useful Doc

Useless Doc

q p(q)g(q)

(1-p(q))g(q)

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

22

Outline




23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

23






24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

24






y = 43060x-33863

1

10

100

1000

10000

100000


Nu

mb

er

of

Do

cum

en

ts

y = 54922x-20254

1

10

100

1000

10000


Num

ber

of T

oken

s

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

25






26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

26





indicate so




27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

27

Outline




28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

28




Snowball IE system


16921 tokens

100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt Scan

Automatic Query Gen


29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

29




100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt Scan


Automatic Query Gen

OPTIMIZED

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

30

Conclusions





31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

31

Future Work





32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

32

Thank you













33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

33

Overflow Slides

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

34



Snowball IE system


16921 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

35




120024 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

36




120024 tokens



target recall

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

37




120024 tokens


38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

38


100

1000

10000

100000

000 010 020 030 040 050 060 070 080 090 100

Recall

Exe

cutio

n T

ime

(sec

s) Scan

Filt Scan


Automatic Query Gen

OPTIMIZED




39



800000 web pages

12000 tokens

39



800000 web pages

12000 tokens

Date post:	13-Jan-2016
Category:	Documents
Upload:	arthur-snow
View:	225 times
Download:	3 times