Date post: | 13-Jan-2016 |
Category: |
Documents |
Upload: | arthur-snow |
View: | 225 times |
Download: | 3 times |
To Search or to CrawlTowards a Query Optimizer for Text-
Centric Tasks
Panos Ipeirotis ndash New York UniversityEugene Agichtein ndash Microsoft ResearchPranay Jain ndash Columbia UniversityLuis Gravano ndash Columbia University
2
Text-Centric Task I Information Extraction
Information extraction applications extract structured relations from unstructured text
May 19 1995 Atlanta -- The Centers for Disease Control and Prevention which is in the front line of the worlds response to the deadly Ebola epidemic in Zaire is finding itself hard pressed to cope with the crisishellip
Date Disease Name Location
Jan 1995 Malaria Ethiopia
July 1995 Mad Cow Disease UK
Feb 1995 Pneumonia US
May 1995 Ebola Zaire
Information Extraction System
(eg NYUrsquos Proteus)
Disease Outbreaks in The New York Times
Information Extraction tutorial yesterday by AnHai Doan Raghu Ramakrishnan Shivakumar Vaithyanathan
3
Text-Centric Task II Metasearching
Metasearchers create content summaries of databases (words + frequencies) to direct queries appropriately
Friday June 16 NEW YORK (Forbes) - Starbucks Corp may be next on the target list of CSPI a consumer-health group that this week sued the operator of the KFC restaurant chain
Word Frequency
Starbucks 102
consumer 215
soccer 1295
hellip hellip
Content Summary
Extractor
Word Frequency
Starbucks 103
consumer 216
soccer 1295
hellip hellip
Content Summary of Forbescom
4
Text-Centric Task III Focused Resource Discovery
Identify web pages about a given topic (multiple techniques proposed simple classifiers focused crawlers focused queryinghellip)
URL
httpbiologyaboutcom
httpwwwamjbotorg
httpwwwsysbotorg
httpwwwbotanyubcca
Web Page
Classifier
Web Pages about Botany
5
An Abstract View of Text-Centric Tasks Output Tokens
hellipExtraction
System
Text Database
3 Extract output tokens2 Process documents1 Retrieve documents from database
Task Token
Information Extraction Relation Tuple
Database Selection Word (+Frequency)
Focused Crawling Web Page about a Topic
For the rest of the talk
6
Executing a Text-Centric TaskOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens
2 Process documents
1 Retrieve documents from database
Similar to relational world
Two major execution paradigms Scan-based Retrieve and process documents sequentially Index-based Query database (eg [case fatality rate]) retrieve and process documents in results
Unlike the relational world
Indexes are only ldquoapproximaterdquo index is on keywords not on tokens of interest Choice of execution plan affects output completeness (not only speed)
rarrunderlying data distribution dictates what is best
7
Execution Plan CharacteristicsOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens2 Process documents1 Retrieve documents from database
Execution Plans have two main characteristicsExecution TimeRecall (fraction of tokens retrieved)
Question How do we choose the fastest execution plan for reaching
a target recall
ldquoWhat is the fastest plan for discovering 10 of the disease outbreaks mentioned in The New York Times archiverdquo
8
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based(Index-based)
9
ScanOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens
2 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes documents sequentially (until reaching target recall)
Execution time = |Retrieved Docs| middot (R + P)
Time for retrieving a document
Question How many documents does Scan retrieve
to reach target recall
Time for processing a document
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)
10
Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with
frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process
After retrieving S documents frequency of token t follows hypergeometric distribution
Recall for token t is the probability that frequency of t in S docs gt 0
t
d1
d2
dS
dN
D
Token
Samplingfor t
ltSARS Chinagt
S documents
Probability of seeing token t after retrieving S
documentsg(t) = frequency of token t
11
Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo
processes one for each token Overall recall is average recall across
tokens
rarr We can compute number of documents required to reach target recall
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Samplingfor tM
ltSARS Chinagt
ltEbola Zairegt
Execution time = |Retrieved Docs| middot (R + P)
12
Scan and Filtered ScanOutput Tokens
hellipExtraction
System
Text Database
4 Extract output tokens
3 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes all documents (until reaching target recall)
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)
Execution time = |Retrieved Docs| ( R + F + P)
Time for retrieving a document
Time for filteringa document
Question How many documents does (Filtered) Scan
retrieve to reach target recall
Classifier
2 Filter documents
Time for processing a document
Classifier selectivity (σle1)
σ
filtered
13
Estimating Recall of Filtered ScanModeling Filtered Scan
Analysis similar to Scan Main difference the classifier rejects
documents and Decreases effective database size
from |D| to σ|D| (σ classifier selectivity)
Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Sampling
for tM
Documents rejected by classifier decrease effective
database size
Tokens in rejected documents have lower
effective token frequency
14
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based
15
Iterative Set ExpansionOutput Tokens
hellipExtraction
System
Text Database
3 Extract tokensfrom docs
2 Process retrieved documents
1 Query database with seed tokens
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
Time for processing a document
Query
Generation
4 Augment seed tokens with new tokens
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)
16
Querying Graph
The querying graph is a bipartite graph containing tokens and documents
Each token (transformed to a keyword query) retrieves documents
Documents contain tokens
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
17
Using Querying Graph for Analysis
We need to compute the Number of documents retrieved after
sending Q tokens as queries (estimates time) Number of tokens that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the Degree distribution of the tokens
discovered by retrieving documents Degree distribution of the documents
retrieved by the tokens (Not the same as the degree distribution of a
randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Elegant analysis framework based on generating functions ndash details in the paper
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
18
Recall Limit Reachability Graph
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Upper recall limit determined by the size of the biggest connected component
Reachability Graph
19
Automatic Query Generation
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
2
Text-Centric Task I Information Extraction
Information extraction applications extract structured relations from unstructured text
May 19 1995 Atlanta -- The Centers for Disease Control and Prevention which is in the front line of the worlds response to the deadly Ebola epidemic in Zaire is finding itself hard pressed to cope with the crisishellip
Date Disease Name Location
Jan 1995 Malaria Ethiopia
July 1995 Mad Cow Disease UK
Feb 1995 Pneumonia US
May 1995 Ebola Zaire
Information Extraction System
(eg NYUrsquos Proteus)
Disease Outbreaks in The New York Times
Information Extraction tutorial yesterday by AnHai Doan Raghu Ramakrishnan Shivakumar Vaithyanathan
3
Text-Centric Task II Metasearching
Metasearchers create content summaries of databases (words + frequencies) to direct queries appropriately
Friday June 16 NEW YORK (Forbes) - Starbucks Corp may be next on the target list of CSPI a consumer-health group that this week sued the operator of the KFC restaurant chain
Word Frequency
Starbucks 102
consumer 215
soccer 1295
hellip hellip
Content Summary
Extractor
Word Frequency
Starbucks 103
consumer 216
soccer 1295
hellip hellip
Content Summary of Forbescom
4
Text-Centric Task III Focused Resource Discovery
Identify web pages about a given topic (multiple techniques proposed simple classifiers focused crawlers focused queryinghellip)
URL
httpbiologyaboutcom
httpwwwamjbotorg
httpwwwsysbotorg
httpwwwbotanyubcca
Web Page
Classifier
Web Pages about Botany
5
An Abstract View of Text-Centric Tasks Output Tokens
hellipExtraction
System
Text Database
3 Extract output tokens2 Process documents1 Retrieve documents from database
Task Token
Information Extraction Relation Tuple
Database Selection Word (+Frequency)
Focused Crawling Web Page about a Topic
For the rest of the talk
6
Executing a Text-Centric TaskOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens
2 Process documents
1 Retrieve documents from database
Similar to relational world
Two major execution paradigms Scan-based Retrieve and process documents sequentially Index-based Query database (eg [case fatality rate]) retrieve and process documents in results
Unlike the relational world
Indexes are only ldquoapproximaterdquo index is on keywords not on tokens of interest Choice of execution plan affects output completeness (not only speed)
rarrunderlying data distribution dictates what is best
7
Execution Plan CharacteristicsOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens2 Process documents1 Retrieve documents from database
Execution Plans have two main characteristicsExecution TimeRecall (fraction of tokens retrieved)
Question How do we choose the fastest execution plan for reaching
a target recall
ldquoWhat is the fastest plan for discovering 10 of the disease outbreaks mentioned in The New York Times archiverdquo
8
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based(Index-based)
9
ScanOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens
2 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes documents sequentially (until reaching target recall)
Execution time = |Retrieved Docs| middot (R + P)
Time for retrieving a document
Question How many documents does Scan retrieve
to reach target recall
Time for processing a document
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)
10
Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with
frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process
After retrieving S documents frequency of token t follows hypergeometric distribution
Recall for token t is the probability that frequency of t in S docs gt 0
t
d1
d2
dS
dN
D
Token
Samplingfor t
ltSARS Chinagt
S documents
Probability of seeing token t after retrieving S
documentsg(t) = frequency of token t
11
Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo
processes one for each token Overall recall is average recall across
tokens
rarr We can compute number of documents required to reach target recall
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Samplingfor tM
ltSARS Chinagt
ltEbola Zairegt
Execution time = |Retrieved Docs| middot (R + P)
12
Scan and Filtered ScanOutput Tokens
hellipExtraction
System
Text Database
4 Extract output tokens
3 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes all documents (until reaching target recall)
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)
Execution time = |Retrieved Docs| ( R + F + P)
Time for retrieving a document
Time for filteringa document
Question How many documents does (Filtered) Scan
retrieve to reach target recall
Classifier
2 Filter documents
Time for processing a document
Classifier selectivity (σle1)
σ
filtered
13
Estimating Recall of Filtered ScanModeling Filtered Scan
Analysis similar to Scan Main difference the classifier rejects
documents and Decreases effective database size
from |D| to σ|D| (σ classifier selectivity)
Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Sampling
for tM
Documents rejected by classifier decrease effective
database size
Tokens in rejected documents have lower
effective token frequency
14
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based
15
Iterative Set ExpansionOutput Tokens
hellipExtraction
System
Text Database
3 Extract tokensfrom docs
2 Process retrieved documents
1 Query database with seed tokens
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
Time for processing a document
Query
Generation
4 Augment seed tokens with new tokens
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)
16
Querying Graph
The querying graph is a bipartite graph containing tokens and documents
Each token (transformed to a keyword query) retrieves documents
Documents contain tokens
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
17
Using Querying Graph for Analysis
We need to compute the Number of documents retrieved after
sending Q tokens as queries (estimates time) Number of tokens that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the Degree distribution of the tokens
discovered by retrieving documents Degree distribution of the documents
retrieved by the tokens (Not the same as the degree distribution of a
randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Elegant analysis framework based on generating functions ndash details in the paper
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
18
Recall Limit Reachability Graph
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Upper recall limit determined by the size of the biggest connected component
Reachability Graph
19
Automatic Query Generation
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
3
Text-Centric Task II Metasearching
Metasearchers create content summaries of databases (words + frequencies) to direct queries appropriately
Friday June 16 NEW YORK (Forbes) - Starbucks Corp may be next on the target list of CSPI a consumer-health group that this week sued the operator of the KFC restaurant chain
Word Frequency
Starbucks 102
consumer 215
soccer 1295
hellip hellip
Content Summary
Extractor
Word Frequency
Starbucks 103
consumer 216
soccer 1295
hellip hellip
Content Summary of Forbescom
4
Text-Centric Task III Focused Resource Discovery
Identify web pages about a given topic (multiple techniques proposed simple classifiers focused crawlers focused queryinghellip)
URL
httpbiologyaboutcom
httpwwwamjbotorg
httpwwwsysbotorg
httpwwwbotanyubcca
Web Page
Classifier
Web Pages about Botany
5
An Abstract View of Text-Centric Tasks Output Tokens
hellipExtraction
System
Text Database
3 Extract output tokens2 Process documents1 Retrieve documents from database
Task Token
Information Extraction Relation Tuple
Database Selection Word (+Frequency)
Focused Crawling Web Page about a Topic
For the rest of the talk
6
Executing a Text-Centric TaskOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens
2 Process documents
1 Retrieve documents from database
Similar to relational world
Two major execution paradigms Scan-based Retrieve and process documents sequentially Index-based Query database (eg [case fatality rate]) retrieve and process documents in results
Unlike the relational world
Indexes are only ldquoapproximaterdquo index is on keywords not on tokens of interest Choice of execution plan affects output completeness (not only speed)
rarrunderlying data distribution dictates what is best
7
Execution Plan CharacteristicsOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens2 Process documents1 Retrieve documents from database
Execution Plans have two main characteristicsExecution TimeRecall (fraction of tokens retrieved)
Question How do we choose the fastest execution plan for reaching
a target recall
ldquoWhat is the fastest plan for discovering 10 of the disease outbreaks mentioned in The New York Times archiverdquo
8
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based(Index-based)
9
ScanOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens
2 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes documents sequentially (until reaching target recall)
Execution time = |Retrieved Docs| middot (R + P)
Time for retrieving a document
Question How many documents does Scan retrieve
to reach target recall
Time for processing a document
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)
10
Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with
frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process
After retrieving S documents frequency of token t follows hypergeometric distribution
Recall for token t is the probability that frequency of t in S docs gt 0
t
d1
d2
dS
dN
D
Token
Samplingfor t
ltSARS Chinagt
S documents
Probability of seeing token t after retrieving S
documentsg(t) = frequency of token t
11
Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo
processes one for each token Overall recall is average recall across
tokens
rarr We can compute number of documents required to reach target recall
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Samplingfor tM
ltSARS Chinagt
ltEbola Zairegt
Execution time = |Retrieved Docs| middot (R + P)
12
Scan and Filtered ScanOutput Tokens
hellipExtraction
System
Text Database
4 Extract output tokens
3 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes all documents (until reaching target recall)
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)
Execution time = |Retrieved Docs| ( R + F + P)
Time for retrieving a document
Time for filteringa document
Question How many documents does (Filtered) Scan
retrieve to reach target recall
Classifier
2 Filter documents
Time for processing a document
Classifier selectivity (σle1)
σ
filtered
13
Estimating Recall of Filtered ScanModeling Filtered Scan
Analysis similar to Scan Main difference the classifier rejects
documents and Decreases effective database size
from |D| to σ|D| (σ classifier selectivity)
Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Sampling
for tM
Documents rejected by classifier decrease effective
database size
Tokens in rejected documents have lower
effective token frequency
14
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based
15
Iterative Set ExpansionOutput Tokens
hellipExtraction
System
Text Database
3 Extract tokensfrom docs
2 Process retrieved documents
1 Query database with seed tokens
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
Time for processing a document
Query
Generation
4 Augment seed tokens with new tokens
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)
16
Querying Graph
The querying graph is a bipartite graph containing tokens and documents
Each token (transformed to a keyword query) retrieves documents
Documents contain tokens
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
17
Using Querying Graph for Analysis
We need to compute the Number of documents retrieved after
sending Q tokens as queries (estimates time) Number of tokens that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the Degree distribution of the tokens
discovered by retrieving documents Degree distribution of the documents
retrieved by the tokens (Not the same as the degree distribution of a
randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Elegant analysis framework based on generating functions ndash details in the paper
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
18
Recall Limit Reachability Graph
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Upper recall limit determined by the size of the biggest connected component
Reachability Graph
19
Automatic Query Generation
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
4
Text-Centric Task III Focused Resource Discovery
Identify web pages about a given topic (multiple techniques proposed simple classifiers focused crawlers focused queryinghellip)
URL
httpbiologyaboutcom
httpwwwamjbotorg
httpwwwsysbotorg
httpwwwbotanyubcca
Web Page
Classifier
Web Pages about Botany
5
An Abstract View of Text-Centric Tasks Output Tokens
hellipExtraction
System
Text Database
3 Extract output tokens2 Process documents1 Retrieve documents from database
Task Token
Information Extraction Relation Tuple
Database Selection Word (+Frequency)
Focused Crawling Web Page about a Topic
For the rest of the talk
6
Executing a Text-Centric TaskOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens
2 Process documents
1 Retrieve documents from database
Similar to relational world
Two major execution paradigms Scan-based Retrieve and process documents sequentially Index-based Query database (eg [case fatality rate]) retrieve and process documents in results
Unlike the relational world
Indexes are only ldquoapproximaterdquo index is on keywords not on tokens of interest Choice of execution plan affects output completeness (not only speed)
rarrunderlying data distribution dictates what is best
7
Execution Plan CharacteristicsOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens2 Process documents1 Retrieve documents from database
Execution Plans have two main characteristicsExecution TimeRecall (fraction of tokens retrieved)
Question How do we choose the fastest execution plan for reaching
a target recall
ldquoWhat is the fastest plan for discovering 10 of the disease outbreaks mentioned in The New York Times archiverdquo
8
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based(Index-based)
9
ScanOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens
2 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes documents sequentially (until reaching target recall)
Execution time = |Retrieved Docs| middot (R + P)
Time for retrieving a document
Question How many documents does Scan retrieve
to reach target recall
Time for processing a document
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)
10
Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with
frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process
After retrieving S documents frequency of token t follows hypergeometric distribution
Recall for token t is the probability that frequency of t in S docs gt 0
t
d1
d2
dS
dN
D
Token
Samplingfor t
ltSARS Chinagt
S documents
Probability of seeing token t after retrieving S
documentsg(t) = frequency of token t
11
Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo
processes one for each token Overall recall is average recall across
tokens
rarr We can compute number of documents required to reach target recall
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Samplingfor tM
ltSARS Chinagt
ltEbola Zairegt
Execution time = |Retrieved Docs| middot (R + P)
12
Scan and Filtered ScanOutput Tokens
hellipExtraction
System
Text Database
4 Extract output tokens
3 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes all documents (until reaching target recall)
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)
Execution time = |Retrieved Docs| ( R + F + P)
Time for retrieving a document
Time for filteringa document
Question How many documents does (Filtered) Scan
retrieve to reach target recall
Classifier
2 Filter documents
Time for processing a document
Classifier selectivity (σle1)
σ
filtered
13
Estimating Recall of Filtered ScanModeling Filtered Scan
Analysis similar to Scan Main difference the classifier rejects
documents and Decreases effective database size
from |D| to σ|D| (σ classifier selectivity)
Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Sampling
for tM
Documents rejected by classifier decrease effective
database size
Tokens in rejected documents have lower
effective token frequency
14
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based
15
Iterative Set ExpansionOutput Tokens
hellipExtraction
System
Text Database
3 Extract tokensfrom docs
2 Process retrieved documents
1 Query database with seed tokens
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
Time for processing a document
Query
Generation
4 Augment seed tokens with new tokens
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)
16
Querying Graph
The querying graph is a bipartite graph containing tokens and documents
Each token (transformed to a keyword query) retrieves documents
Documents contain tokens
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
17
Using Querying Graph for Analysis
We need to compute the Number of documents retrieved after
sending Q tokens as queries (estimates time) Number of tokens that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the Degree distribution of the tokens
discovered by retrieving documents Degree distribution of the documents
retrieved by the tokens (Not the same as the degree distribution of a
randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Elegant analysis framework based on generating functions ndash details in the paper
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
18
Recall Limit Reachability Graph
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Upper recall limit determined by the size of the biggest connected component
Reachability Graph
19
Automatic Query Generation
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
5
An Abstract View of Text-Centric Tasks Output Tokens
hellipExtraction
System
Text Database
3 Extract output tokens2 Process documents1 Retrieve documents from database
Task Token
Information Extraction Relation Tuple
Database Selection Word (+Frequency)
Focused Crawling Web Page about a Topic
For the rest of the talk
6
Executing a Text-Centric TaskOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens
2 Process documents
1 Retrieve documents from database
Similar to relational world
Two major execution paradigms Scan-based Retrieve and process documents sequentially Index-based Query database (eg [case fatality rate]) retrieve and process documents in results
Unlike the relational world
Indexes are only ldquoapproximaterdquo index is on keywords not on tokens of interest Choice of execution plan affects output completeness (not only speed)
rarrunderlying data distribution dictates what is best
7
Execution Plan CharacteristicsOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens2 Process documents1 Retrieve documents from database
Execution Plans have two main characteristicsExecution TimeRecall (fraction of tokens retrieved)
Question How do we choose the fastest execution plan for reaching
a target recall
ldquoWhat is the fastest plan for discovering 10 of the disease outbreaks mentioned in The New York Times archiverdquo
8
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based(Index-based)
9
ScanOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens
2 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes documents sequentially (until reaching target recall)
Execution time = |Retrieved Docs| middot (R + P)
Time for retrieving a document
Question How many documents does Scan retrieve
to reach target recall
Time for processing a document
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)
10
Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with
frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process
After retrieving S documents frequency of token t follows hypergeometric distribution
Recall for token t is the probability that frequency of t in S docs gt 0
t
d1
d2
dS
dN
D
Token
Samplingfor t
ltSARS Chinagt
S documents
Probability of seeing token t after retrieving S
documentsg(t) = frequency of token t
11
Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo
processes one for each token Overall recall is average recall across
tokens
rarr We can compute number of documents required to reach target recall
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Samplingfor tM
ltSARS Chinagt
ltEbola Zairegt
Execution time = |Retrieved Docs| middot (R + P)
12
Scan and Filtered ScanOutput Tokens
hellipExtraction
System
Text Database
4 Extract output tokens
3 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes all documents (until reaching target recall)
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)
Execution time = |Retrieved Docs| ( R + F + P)
Time for retrieving a document
Time for filteringa document
Question How many documents does (Filtered) Scan
retrieve to reach target recall
Classifier
2 Filter documents
Time for processing a document
Classifier selectivity (σle1)
σ
filtered
13
Estimating Recall of Filtered ScanModeling Filtered Scan
Analysis similar to Scan Main difference the classifier rejects
documents and Decreases effective database size
from |D| to σ|D| (σ classifier selectivity)
Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Sampling
for tM
Documents rejected by classifier decrease effective
database size
Tokens in rejected documents have lower
effective token frequency
14
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based
15
Iterative Set ExpansionOutput Tokens
hellipExtraction
System
Text Database
3 Extract tokensfrom docs
2 Process retrieved documents
1 Query database with seed tokens
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
Time for processing a document
Query
Generation
4 Augment seed tokens with new tokens
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)
16
Querying Graph
The querying graph is a bipartite graph containing tokens and documents
Each token (transformed to a keyword query) retrieves documents
Documents contain tokens
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
17
Using Querying Graph for Analysis
We need to compute the Number of documents retrieved after
sending Q tokens as queries (estimates time) Number of tokens that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the Degree distribution of the tokens
discovered by retrieving documents Degree distribution of the documents
retrieved by the tokens (Not the same as the degree distribution of a
randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Elegant analysis framework based on generating functions ndash details in the paper
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
18
Recall Limit Reachability Graph
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Upper recall limit determined by the size of the biggest connected component
Reachability Graph
19
Automatic Query Generation
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
6
Executing a Text-Centric TaskOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens
2 Process documents
1 Retrieve documents from database
Similar to relational world
Two major execution paradigms Scan-based Retrieve and process documents sequentially Index-based Query database (eg [case fatality rate]) retrieve and process documents in results
Unlike the relational world
Indexes are only ldquoapproximaterdquo index is on keywords not on tokens of interest Choice of execution plan affects output completeness (not only speed)
rarrunderlying data distribution dictates what is best
7
Execution Plan CharacteristicsOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens2 Process documents1 Retrieve documents from database
Execution Plans have two main characteristicsExecution TimeRecall (fraction of tokens retrieved)
Question How do we choose the fastest execution plan for reaching
a target recall
ldquoWhat is the fastest plan for discovering 10 of the disease outbreaks mentioned in The New York Times archiverdquo
8
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based(Index-based)
9
ScanOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens
2 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes documents sequentially (until reaching target recall)
Execution time = |Retrieved Docs| middot (R + P)
Time for retrieving a document
Question How many documents does Scan retrieve
to reach target recall
Time for processing a document
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)
10
Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with
frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process
After retrieving S documents frequency of token t follows hypergeometric distribution
Recall for token t is the probability that frequency of t in S docs gt 0
t
d1
d2
dS
dN
D
Token
Samplingfor t
ltSARS Chinagt
S documents
Probability of seeing token t after retrieving S
documentsg(t) = frequency of token t
11
Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo
processes one for each token Overall recall is average recall across
tokens
rarr We can compute number of documents required to reach target recall
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Samplingfor tM
ltSARS Chinagt
ltEbola Zairegt
Execution time = |Retrieved Docs| middot (R + P)
12
Scan and Filtered ScanOutput Tokens
hellipExtraction
System
Text Database
4 Extract output tokens
3 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes all documents (until reaching target recall)
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)
Execution time = |Retrieved Docs| ( R + F + P)
Time for retrieving a document
Time for filteringa document
Question How many documents does (Filtered) Scan
retrieve to reach target recall
Classifier
2 Filter documents
Time for processing a document
Classifier selectivity (σle1)
σ
filtered
13
Estimating Recall of Filtered ScanModeling Filtered Scan
Analysis similar to Scan Main difference the classifier rejects
documents and Decreases effective database size
from |D| to σ|D| (σ classifier selectivity)
Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Sampling
for tM
Documents rejected by classifier decrease effective
database size
Tokens in rejected documents have lower
effective token frequency
14
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based
15
Iterative Set ExpansionOutput Tokens
hellipExtraction
System
Text Database
3 Extract tokensfrom docs
2 Process retrieved documents
1 Query database with seed tokens
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
Time for processing a document
Query
Generation
4 Augment seed tokens with new tokens
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)
16
Querying Graph
The querying graph is a bipartite graph containing tokens and documents
Each token (transformed to a keyword query) retrieves documents
Documents contain tokens
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
17
Using Querying Graph for Analysis
We need to compute the Number of documents retrieved after
sending Q tokens as queries (estimates time) Number of tokens that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the Degree distribution of the tokens
discovered by retrieving documents Degree distribution of the documents
retrieved by the tokens (Not the same as the degree distribution of a
randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Elegant analysis framework based on generating functions ndash details in the paper
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
18
Recall Limit Reachability Graph
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Upper recall limit determined by the size of the biggest connected component
Reachability Graph
19
Automatic Query Generation
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
7
Execution Plan CharacteristicsOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens2 Process documents1 Retrieve documents from database
Execution Plans have two main characteristicsExecution TimeRecall (fraction of tokens retrieved)
Question How do we choose the fastest execution plan for reaching
a target recall
ldquoWhat is the fastest plan for discovering 10 of the disease outbreaks mentioned in The New York Times archiverdquo
8
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based(Index-based)
9
ScanOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens
2 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes documents sequentially (until reaching target recall)
Execution time = |Retrieved Docs| middot (R + P)
Time for retrieving a document
Question How many documents does Scan retrieve
to reach target recall
Time for processing a document
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)
10
Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with
frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process
After retrieving S documents frequency of token t follows hypergeometric distribution
Recall for token t is the probability that frequency of t in S docs gt 0
t
d1
d2
dS
dN
D
Token
Samplingfor t
ltSARS Chinagt
S documents
Probability of seeing token t after retrieving S
documentsg(t) = frequency of token t
11
Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo
processes one for each token Overall recall is average recall across
tokens
rarr We can compute number of documents required to reach target recall
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Samplingfor tM
ltSARS Chinagt
ltEbola Zairegt
Execution time = |Retrieved Docs| middot (R + P)
12
Scan and Filtered ScanOutput Tokens
hellipExtraction
System
Text Database
4 Extract output tokens
3 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes all documents (until reaching target recall)
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)
Execution time = |Retrieved Docs| ( R + F + P)
Time for retrieving a document
Time for filteringa document
Question How many documents does (Filtered) Scan
retrieve to reach target recall
Classifier
2 Filter documents
Time for processing a document
Classifier selectivity (σle1)
σ
filtered
13
Estimating Recall of Filtered ScanModeling Filtered Scan
Analysis similar to Scan Main difference the classifier rejects
documents and Decreases effective database size
from |D| to σ|D| (σ classifier selectivity)
Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Sampling
for tM
Documents rejected by classifier decrease effective
database size
Tokens in rejected documents have lower
effective token frequency
14
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based
15
Iterative Set ExpansionOutput Tokens
hellipExtraction
System
Text Database
3 Extract tokensfrom docs
2 Process retrieved documents
1 Query database with seed tokens
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
Time for processing a document
Query
Generation
4 Augment seed tokens with new tokens
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)
16
Querying Graph
The querying graph is a bipartite graph containing tokens and documents
Each token (transformed to a keyword query) retrieves documents
Documents contain tokens
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
17
Using Querying Graph for Analysis
We need to compute the Number of documents retrieved after
sending Q tokens as queries (estimates time) Number of tokens that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the Degree distribution of the tokens
discovered by retrieving documents Degree distribution of the documents
retrieved by the tokens (Not the same as the degree distribution of a
randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Elegant analysis framework based on generating functions ndash details in the paper
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
18
Recall Limit Reachability Graph
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Upper recall limit determined by the size of the biggest connected component
Reachability Graph
19
Automatic Query Generation
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
8
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based(Index-based)
9
ScanOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens
2 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes documents sequentially (until reaching target recall)
Execution time = |Retrieved Docs| middot (R + P)
Time for retrieving a document
Question How many documents does Scan retrieve
to reach target recall
Time for processing a document
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)
10
Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with
frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process
After retrieving S documents frequency of token t follows hypergeometric distribution
Recall for token t is the probability that frequency of t in S docs gt 0
t
d1
d2
dS
dN
D
Token
Samplingfor t
ltSARS Chinagt
S documents
Probability of seeing token t after retrieving S
documentsg(t) = frequency of token t
11
Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo
processes one for each token Overall recall is average recall across
tokens
rarr We can compute number of documents required to reach target recall
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Samplingfor tM
ltSARS Chinagt
ltEbola Zairegt
Execution time = |Retrieved Docs| middot (R + P)
12
Scan and Filtered ScanOutput Tokens
hellipExtraction
System
Text Database
4 Extract output tokens
3 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes all documents (until reaching target recall)
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)
Execution time = |Retrieved Docs| ( R + F + P)
Time for retrieving a document
Time for filteringa document
Question How many documents does (Filtered) Scan
retrieve to reach target recall
Classifier
2 Filter documents
Time for processing a document
Classifier selectivity (σle1)
σ
filtered
13
Estimating Recall of Filtered ScanModeling Filtered Scan
Analysis similar to Scan Main difference the classifier rejects
documents and Decreases effective database size
from |D| to σ|D| (σ classifier selectivity)
Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Sampling
for tM
Documents rejected by classifier decrease effective
database size
Tokens in rejected documents have lower
effective token frequency
14
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based
15
Iterative Set ExpansionOutput Tokens
hellipExtraction
System
Text Database
3 Extract tokensfrom docs
2 Process retrieved documents
1 Query database with seed tokens
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
Time for processing a document
Query
Generation
4 Augment seed tokens with new tokens
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)
16
Querying Graph
The querying graph is a bipartite graph containing tokens and documents
Each token (transformed to a keyword query) retrieves documents
Documents contain tokens
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
17
Using Querying Graph for Analysis
We need to compute the Number of documents retrieved after
sending Q tokens as queries (estimates time) Number of tokens that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the Degree distribution of the tokens
discovered by retrieving documents Degree distribution of the documents
retrieved by the tokens (Not the same as the degree distribution of a
randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Elegant analysis framework based on generating functions ndash details in the paper
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
18
Recall Limit Reachability Graph
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Upper recall limit determined by the size of the biggest connected component
Reachability Graph
19
Automatic Query Generation
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
9
ScanOutput Tokens
hellipExtraction
System
Text Database
3 Extract output tokens
2 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes documents sequentially (until reaching target recall)
Execution time = |Retrieved Docs| middot (R + P)
Time for retrieving a document
Question How many documents does Scan retrieve
to reach target recall
Time for processing a document
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)
10
Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with
frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process
After retrieving S documents frequency of token t follows hypergeometric distribution
Recall for token t is the probability that frequency of t in S docs gt 0
t
d1
d2
dS
dN
D
Token
Samplingfor t
ltSARS Chinagt
S documents
Probability of seeing token t after retrieving S
documentsg(t) = frequency of token t
11
Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo
processes one for each token Overall recall is average recall across
tokens
rarr We can compute number of documents required to reach target recall
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Samplingfor tM
ltSARS Chinagt
ltEbola Zairegt
Execution time = |Retrieved Docs| middot (R + P)
12
Scan and Filtered ScanOutput Tokens
hellipExtraction
System
Text Database
4 Extract output tokens
3 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes all documents (until reaching target recall)
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)
Execution time = |Retrieved Docs| ( R + F + P)
Time for retrieving a document
Time for filteringa document
Question How many documents does (Filtered) Scan
retrieve to reach target recall
Classifier
2 Filter documents
Time for processing a document
Classifier selectivity (σle1)
σ
filtered
13
Estimating Recall of Filtered ScanModeling Filtered Scan
Analysis similar to Scan Main difference the classifier rejects
documents and Decreases effective database size
from |D| to σ|D| (σ classifier selectivity)
Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Sampling
for tM
Documents rejected by classifier decrease effective
database size
Tokens in rejected documents have lower
effective token frequency
14
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based
15
Iterative Set ExpansionOutput Tokens
hellipExtraction
System
Text Database
3 Extract tokensfrom docs
2 Process retrieved documents
1 Query database with seed tokens
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
Time for processing a document
Query
Generation
4 Augment seed tokens with new tokens
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)
16
Querying Graph
The querying graph is a bipartite graph containing tokens and documents
Each token (transformed to a keyword query) retrieves documents
Documents contain tokens
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
17
Using Querying Graph for Analysis
We need to compute the Number of documents retrieved after
sending Q tokens as queries (estimates time) Number of tokens that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the Degree distribution of the tokens
discovered by retrieving documents Degree distribution of the documents
retrieved by the tokens (Not the same as the degree distribution of a
randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Elegant analysis framework based on generating functions ndash details in the paper
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
18
Recall Limit Reachability Graph
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Upper recall limit determined by the size of the biggest connected component
Reachability Graph
19
Automatic Query Generation
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
10
Estimating Recall of ScanModeling Scan for Token t What is the probability of seeing t (with
frequency g(t)) after retrieving S documents A ldquosampling without replacementrdquo process
After retrieving S documents frequency of token t follows hypergeometric distribution
Recall for token t is the probability that frequency of t in S docs gt 0
t
d1
d2
dS
dN
D
Token
Samplingfor t
ltSARS Chinagt
S documents
Probability of seeing token t after retrieving S
documentsg(t) = frequency of token t
11
Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo
processes one for each token Overall recall is average recall across
tokens
rarr We can compute number of documents required to reach target recall
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Samplingfor tM
ltSARS Chinagt
ltEbola Zairegt
Execution time = |Retrieved Docs| middot (R + P)
12
Scan and Filtered ScanOutput Tokens
hellipExtraction
System
Text Database
4 Extract output tokens
3 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes all documents (until reaching target recall)
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)
Execution time = |Retrieved Docs| ( R + F + P)
Time for retrieving a document
Time for filteringa document
Question How many documents does (Filtered) Scan
retrieve to reach target recall
Classifier
2 Filter documents
Time for processing a document
Classifier selectivity (σle1)
σ
filtered
13
Estimating Recall of Filtered ScanModeling Filtered Scan
Analysis similar to Scan Main difference the classifier rejects
documents and Decreases effective database size
from |D| to σ|D| (σ classifier selectivity)
Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Sampling
for tM
Documents rejected by classifier decrease effective
database size
Tokens in rejected documents have lower
effective token frequency
14
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based
15
Iterative Set ExpansionOutput Tokens
hellipExtraction
System
Text Database
3 Extract tokensfrom docs
2 Process retrieved documents
1 Query database with seed tokens
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
Time for processing a document
Query
Generation
4 Augment seed tokens with new tokens
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)
16
Querying Graph
The querying graph is a bipartite graph containing tokens and documents
Each token (transformed to a keyword query) retrieves documents
Documents contain tokens
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
17
Using Querying Graph for Analysis
We need to compute the Number of documents retrieved after
sending Q tokens as queries (estimates time) Number of tokens that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the Degree distribution of the tokens
discovered by retrieving documents Degree distribution of the documents
retrieved by the tokens (Not the same as the degree distribution of a
randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Elegant analysis framework based on generating functions ndash details in the paper
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
18
Recall Limit Reachability Graph
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Upper recall limit determined by the size of the biggest connected component
Reachability Graph
19
Automatic Query Generation
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
11
Estimating Recall of ScanModeling Scan Multiple ldquosampling without replacementrdquo
processes one for each token Overall recall is average recall across
tokens
rarr We can compute number of documents required to reach target recall
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Samplingfor tM
ltSARS Chinagt
ltEbola Zairegt
Execution time = |Retrieved Docs| middot (R + P)
12
Scan and Filtered ScanOutput Tokens
hellipExtraction
System
Text Database
4 Extract output tokens
3 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes all documents (until reaching target recall)
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)
Execution time = |Retrieved Docs| ( R + F + P)
Time for retrieving a document
Time for filteringa document
Question How many documents does (Filtered) Scan
retrieve to reach target recall
Classifier
2 Filter documents
Time for processing a document
Classifier selectivity (σle1)
σ
filtered
13
Estimating Recall of Filtered ScanModeling Filtered Scan
Analysis similar to Scan Main difference the classifier rejects
documents and Decreases effective database size
from |D| to σ|D| (σ classifier selectivity)
Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Sampling
for tM
Documents rejected by classifier decrease effective
database size
Tokens in rejected documents have lower
effective token frequency
14
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based
15
Iterative Set ExpansionOutput Tokens
hellipExtraction
System
Text Database
3 Extract tokensfrom docs
2 Process retrieved documents
1 Query database with seed tokens
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
Time for processing a document
Query
Generation
4 Augment seed tokens with new tokens
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)
16
Querying Graph
The querying graph is a bipartite graph containing tokens and documents
Each token (transformed to a keyword query) retrieves documents
Documents contain tokens
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
17
Using Querying Graph for Analysis
We need to compute the Number of documents retrieved after
sending Q tokens as queries (estimates time) Number of tokens that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the Degree distribution of the tokens
discovered by retrieving documents Degree distribution of the documents
retrieved by the tokens (Not the same as the degree distribution of a
randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Elegant analysis framework based on generating functions ndash details in the paper
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
18
Recall Limit Reachability Graph
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Upper recall limit determined by the size of the biggest connected component
Reachability Graph
19
Automatic Query Generation
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
12
Scan and Filtered ScanOutput Tokens
hellipExtraction
System
Text Database
4 Extract output tokens
3 Process documents
1 Retrieve docs from database
ScanScan retrieves and processes all documents (until reaching target recall)
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents(eg the Sports section of NYT is unlikely to describe disease outbreaks)
Execution time = |Retrieved Docs| ( R + F + P)
Time for retrieving a document
Time for filteringa document
Question How many documents does (Filtered) Scan
retrieve to reach target recall
Classifier
2 Filter documents
Time for processing a document
Classifier selectivity (σle1)
σ
filtered
13
Estimating Recall of Filtered ScanModeling Filtered Scan
Analysis similar to Scan Main difference the classifier rejects
documents and Decreases effective database size
from |D| to σ|D| (σ classifier selectivity)
Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Sampling
for tM
Documents rejected by classifier decrease effective
database size
Tokens in rejected documents have lower
effective token frequency
14
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based
15
Iterative Set ExpansionOutput Tokens
hellipExtraction
System
Text Database
3 Extract tokensfrom docs
2 Process retrieved documents
1 Query database with seed tokens
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
Time for processing a document
Query
Generation
4 Augment seed tokens with new tokens
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)
16
Querying Graph
The querying graph is a bipartite graph containing tokens and documents
Each token (transformed to a keyword query) retrieves documents
Documents contain tokens
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
17
Using Querying Graph for Analysis
We need to compute the Number of documents retrieved after
sending Q tokens as queries (estimates time) Number of tokens that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the Degree distribution of the tokens
discovered by retrieving documents Degree distribution of the documents
retrieved by the tokens (Not the same as the degree distribution of a
randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Elegant analysis framework based on generating functions ndash details in the paper
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
18
Recall Limit Reachability Graph
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Upper recall limit determined by the size of the biggest connected component
Reachability Graph
19
Automatic Query Generation
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
13
Estimating Recall of Filtered ScanModeling Filtered Scan
Analysis similar to Scan Main difference the classifier rejects
documents and Decreases effective database size
from |D| to σ|D| (σ classifier selectivity)
Decreases effective token frequencyfrom g(t) to rg(t)(r classifier recall)
t1 t2 tM
d1
d2
d3
dN
D
Tokens
Samplingfor t1
Samplingfor t2
Sampling
for tM
Documents rejected by classifier decrease effective
database size
Tokens in rejected documents have lower
effective token frequency
14
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based
15
Iterative Set ExpansionOutput Tokens
hellipExtraction
System
Text Database
3 Extract tokensfrom docs
2 Process retrieved documents
1 Query database with seed tokens
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
Time for processing a document
Query
Generation
4 Augment seed tokens with new tokens
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)
16
Querying Graph
The querying graph is a bipartite graph containing tokens and documents
Each token (transformed to a keyword query) retrieves documents
Documents contain tokens
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
17
Using Querying Graph for Analysis
We need to compute the Number of documents retrieved after
sending Q tokens as queries (estimates time) Number of tokens that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the Degree distribution of the tokens
discovered by retrieving documents Degree distribution of the documents
retrieved by the tokens (Not the same as the degree distribution of a
randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Elegant analysis framework based on generating functions ndash details in the paper
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
18
Recall Limit Reachability Graph
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Upper recall limit determined by the size of the biggest connected component
Reachability Graph
19
Automatic Query Generation
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
14
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based
15
Iterative Set ExpansionOutput Tokens
hellipExtraction
System
Text Database
3 Extract tokensfrom docs
2 Process retrieved documents
1 Query database with seed tokens
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
Time for processing a document
Query
Generation
4 Augment seed tokens with new tokens
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)
16
Querying Graph
The querying graph is a bipartite graph containing tokens and documents
Each token (transformed to a keyword query) retrieves documents
Documents contain tokens
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
17
Using Querying Graph for Analysis
We need to compute the Number of documents retrieved after
sending Q tokens as queries (estimates time) Number of tokens that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the Degree distribution of the tokens
discovered by retrieving documents Degree distribution of the documents
retrieved by the tokens (Not the same as the degree distribution of a
randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Elegant analysis framework based on generating functions ndash details in the paper
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
18
Recall Limit Reachability Graph
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Upper recall limit determined by the size of the biggest connected component
Reachability Graph
19
Automatic Query Generation
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
15
Iterative Set ExpansionOutput Tokens
hellipExtraction
System
Text Database
3 Extract tokensfrom docs
2 Process retrieved documents
1 Query database with seed tokens
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
Time for processing a document
Query
Generation
4 Augment seed tokens with new tokens
Question How many queries and how many documents
does Iterative Set Expansion need to reach target recall
(eg [Ebola AND Zaire])(eg ltMalaria Ethiopiagt)
16
Querying Graph
The querying graph is a bipartite graph containing tokens and documents
Each token (transformed to a keyword query) retrieves documents
Documents contain tokens
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
17
Using Querying Graph for Analysis
We need to compute the Number of documents retrieved after
sending Q tokens as queries (estimates time) Number of tokens that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the Degree distribution of the tokens
discovered by retrieving documents Degree distribution of the documents
retrieved by the tokens (Not the same as the degree distribution of a
randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Elegant analysis framework based on generating functions ndash details in the paper
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
18
Recall Limit Reachability Graph
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Upper recall limit determined by the size of the biggest connected component
Reachability Graph
19
Automatic Query Generation
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
16
Querying Graph
The querying graph is a bipartite graph containing tokens and documents
Each token (transformed to a keyword query) retrieves documents
Documents contain tokens
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
17
Using Querying Graph for Analysis
We need to compute the Number of documents retrieved after
sending Q tokens as queries (estimates time) Number of tokens that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the Degree distribution of the tokens
discovered by retrieving documents Degree distribution of the documents
retrieved by the tokens (Not the same as the degree distribution of a
randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Elegant analysis framework based on generating functions ndash details in the paper
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
18
Recall Limit Reachability Graph
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Upper recall limit determined by the size of the biggest connected component
Reachability Graph
19
Automatic Query Generation
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
17
Using Querying Graph for Analysis
We need to compute the Number of documents retrieved after
sending Q tokens as queries (estimates time) Number of tokens that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the Degree distribution of the tokens
discovered by retrieving documents Degree distribution of the documents
retrieved by the tokens (Not the same as the degree distribution of a
randomly chosen token or document ndash it is easier to discover documents and tokens with high degrees)
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Elegant analysis framework based on generating functions ndash details in the paper
ltSARS Chinagt
ltEbola Zairegt
ltMalaria Ethiopiagt
ltCholera Sudangt
ltH5N1 Vietnamgt
18
Recall Limit Reachability Graph
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Upper recall limit determined by the size of the biggest connected component
Reachability Graph
19
Automatic Query Generation
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
18
Recall Limit Reachability Graph
t1 retrieves document d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
Upper recall limit determined by the size of the biggest connected component
Reachability Graph
19
Automatic Query Generation
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
19
Automatic Query Generation
Iterative Set ExpansionIterative Set Expansion has recall limitation due to iterative nature of query generation
Automatic Query GenerationAutomatic Query Generation avoids this problem by creating queries offline (using machine learning) which are designed to return documents with tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
20
Automatic Query GenerationOutput Tokens
hellipExtraction
System
Text Database
4 Extract tokensfrom docs
3 Process retrieved documents
2 Query database
Execution time = |Retrieved Docs| (R + P) + |Queries| Q
Time for retrieving a document
Time for answering a query
Time for processing a document
OfflineQuery
Generation
1 Generate queries that tend to retrieve documents with tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
21
Estimating Recall of Automatic Query Generation
Query q retrieves g(q) docs Query has precision p(q)
p(q)g(q) useful docs [1-p(q)]g(q) useless docs
We compute total number of useful (and useless) documents retrieved
Analysis similar to Filtered Scan Effective database size is |Duseful| Sample size S is number of useful
documents retrieved
Text Database
Useful Doc
Useless Doc
q p(q)g(q)
(1-p(q))g(q)
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
22
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
23
Summary of Cost Analysis
Our analysis so far Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity if plan cannot reach target recall)
Time and recall depend on task-specific properties of database Token degree distribution Document degree distribution
Next we show how to estimate degree distributions on-the-fly
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
24
Estimating Cost Model ParametersToken and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters
Task Document Distribution Token Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-33863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 54922x-20254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
25
Parameter Estimation
Naiumlve solution for parameter estimation Start with separate ldquoparameter-estimationrdquo phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this
No need for separate sampling phase Sampling is equivalent to executing the task
rarrPiggyback parameter estimation into execution
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
26
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming ldquodefaultrdquo parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as ldquorandom samplingrdquoAll other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial default estimationUpdated estimationUpdated estimation
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
27
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
28
Correctness of Theoretical Analysis
Solid lines Actual time Dotted lines Predicted time with correct parameters
Task Disease Outbreaks
Snowball IE system
182531 documents from NYT
16921 tokens
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt Scan
Automatic Query Gen
Iterative Set Expansion
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
29
Experimental Results (Information Extraction)
Solid lines Actual time Green line Time with optimizer
(results similar in other experiments ndash see paper)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
30
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
31
Future Work
Incorporate precision and recall of extraction system in framework
Create non-parametric optimization (ie no assumption about distribution families)
Examine other text-centric tasks and analyze new execution plans
Create adaptive ldquonext-Krdquo optimizer
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
32
Thank you
Task Filtered Scan Iterative Set Expansion
Automatic Query Generation
Information Extraction
Grishman et al Jof Biomed Inf 2002
Agichtein and Gravano ICDE 2003
Agichtein and Gravano ICDE 2003
Content Summary Construction
- Callan et al SIGMOD 1999
Ipeirotis and Gravano VLDB 2002
Focused Resource Discovery
Chakrabarti et al WWW 1999
- Cohen and Singer AAAI WIBIS 1996
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
33
Overflow Slides
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
34
Experimental Results (IE Headquarters)
Task Company Headquarters
Snowball IE system
182531 documents from NYT
16921 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
35
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
36
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
ISE is a cheap plan for low target recall
but becomes the most expensive for high
target recall
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
37
Experimental Results (Content Summaries)
Content Summary Extraction
19997 documents from 20newsgroups
120024 tokens
Underestimated recall for AQG switched to ISE
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
38
Experimental Results (Information Extraction)
100
1000
10000
100000
000 010 020 030 040 050 060 070 080 090 100
Recall
Exe
cutio
n T
ime
(sec
s) Scan
Filt Scan
Iterative Set Expansion
Automatic Query Gen
OPTIMIZED
OPTIMIZED is faster than ldquobest planrdquo overestimated
FS recall but after FS run to completion OPTIMIZED
just switched to Scan
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens
39
Focused Resource Discovery
Focused Resource Discovery
800000 web pages
12000 tokens