1
CIKM 2005
Finding and Approximating Top-Finding and Approximating Top-kkAnswers in Keyword Proximity SearchAnswers in Keyword Proximity Search
Benny Kimelfeld Benny Kimelfeld and Yehoshua Sagiv Yehoshua Sagiv
The Selim and Rachel Benin School of Engineering and Computer Science
האוניברסיטה העברית בירושליםהאוניברסיטה העברית בירושליםThe Hebrew University of JerusalemThe Hebrew University of Jerusalem
Finding and Approximating Top-k Answers in Keyword Proximity Search2PODS'06
A paradigm for data extraction
Data have varying degrees of structure
– Relational databases, XML, Web sites
Queries are sets of keywords
− No structural constraints
Keyword Proximity Search (KPS)Keyword Proximity Search (KPS)
The Goal:The Goal:
Extract meaningful parts of data w.r.t. the keywords
Finding and Approximating Top-k Answers in Keyword Proximity Search3PODS'06
Querying Structure & Content by KeywordsQuerying Structure & Content by Keywords
Keywords appear in different parts of the data
Answers show occurrences of keywords, as well the associations among these occurrences
Proximity of the keywords in the answer indicates a close (strong) semantic association among them
Vardi Databasessearchjournal
name
Databases
article
Vardi
author…
article
title author
Databases Vardi
Finding and Approximating Top-k Answers in Keyword Proximity Search4PODS'06
Past Work on KPS (Past Work on KPS (Keyword Proximity SearchKeyword Proximity Search) )
• DataSpotDataSpot (Sigmod 1998)
• Information Units Information Units (WWW 2001)
• BANKSBANKS (ICDE 2002, VLDB 2005)
• DISCOVERDISCOVER (VLDB 2002)
• DBXplorerDBXplorer (ICDE 2002)
• XKeyword XKeyword (ICDE 2003)
• ……
Finding and Approximating Top-k Answers in Keyword Proximity Search5PODS'06
The Goal of this PaperThe Goal of this Paper
Devise Devise efficientefficient algorithms for finding algorithms for finding high-high-quality answersquality answers in in keyword proximity searchkeyword proximity searchDevise Devise efficientefficient algorithms for finding algorithms for finding high-high-quality answersquality answers in in keyword proximity searchkeyword proximity search
Finding and Approximating Top-k Answers in Keyword Proximity Search6PODS'06
ContentsContents
Introduction
Formal Setting
The Main Results
Enumerating in the Exact Order
Enumerating in an Approximate Order
Conclusion and Future Work
Finding and Approximating Top-k Answers in Keyword Proximity Search7PODS'06
ContentsContents
Introduction
Formal SettingFormal Setting
The Main Results
Enumerating in the Exact Order
Enumerating in an Approximate Order
Conclusion and Future Work
Finding and Approximating Top-k Answers in Keyword Proximity Search8PODS'06
Data GraphsData Graphs
company
supplies
supply
product
supplier
papersA4
company
supplies
supply
product
supplier
coffee
president
Cohen
department
Summers
manager
Parishqhq
Structural and keyword nodes
Edges may have weights
– Weak relationships are penalized by high weights
Finding and Approximating Top-k Answers in Keyword Proximity Search9PODS'06
QueriesQueries
Q={ Summers , Cohen , coffee }company
supplies
supply
product
supplier
papersA4
company
supplies
supply
product
supplier
coffee
president
Cohen
department
Summers
manager
Parishqhq
Queries are sets of keywords from the data graph
Finding and Approximating Top-k Answers in Keyword Proximity Search10PODS'06
Query AnswersQuery Answers
company
supplies
supply
product
customer
papersA4
company
supplies
supply
product
customer
coffee
president
Cohen
department
Summers
manager
Parishqhq
Finding and Approximating Top-k Answers in Keyword Proximity Search11PODS'06
company
supplies
supply
product
customer
papersA4
company
supplies
supply
product
customer
coffee
president
Cohen
department
Summers
manager
Parishqhq
Query AnswersQuery Answers
An answer is a directed subtree of the data graph
Contains all keywords of the query
Has no redundant edges (and nodes)
The keywords of the query are the leaves
The root has two or more children
Finding and Approximating Top-k Answers in Keyword Proximity Search12PODS'06
Ranking: Inversely Proportional to WeightRanking: Inversely Proportional to Weight
Vardi
databases
dblp
article
5
title
1
1
article
title1
1Vardi
databases
article
title
article
title
cite
references1
1.5
1
1
1
1
1Vardi
databases
title
2 13
rank(A)=(weight(A))-1
Smaller subtrees represent closer associations
Finding and Approximating Top-k Answers in Keyword Proximity Search13PODS'06
Enumerating in Exact (Ranked) OrderEnumerating in Exact (Ranked) Order
B CA B CAB CA B CA
B CAB CA B CA
IfIf ThenThen ≤≤
Top-Top-kk Answers AnswersTop-Top-kk Answers Answers
B CAB CA
Finding and Approximating Top-k Answers in Keyword Proximity Search14PODS'06
Enumerating in a Enumerating in a CC-Approximate Order-Approximate Order
B CAB CAB CAB CA
B CAB CA B CA
IfIf ThenThen ≤≤
CC-Approximation of the Top--Approximation of the Top-kk Answers Answers
(Fagin et. al, PODS’01)
CC-Approximation of the Top--Approximation of the Top-kk Answers Answers
(Fagin et. al, PODS’01)
B CA
B CA
CC
C may be a function of G and Q
Finding and Approximating Top-k Answers in Keyword Proximity Search15PODS'06
Polynomial DelayPolynomial Delay
Yardstick of efficiency:
Polynomial delayPolynomial delay
Yardstick of efficiency:
Polynomial delayPolynomial delay
B CA B CAB CA B CA
B CAB CA B CA
Polynomial time between generating successive answers
Exponentially many answers even for 2 keywords (it is inefficient to generate all answers and then sort)
Finding and Approximating Top-k Answers in Keyword Proximity Search16PODS'06
ContentsContents
Introduction
Formal Setting
The Main ResultsThe Main Results
Enumerating in the Exact Order
Enumerating in an Approximate Order
Conclusion and Future Work
Finding and Approximating Top-k Answers in Keyword Proximity Search17PODS'06
Top Answers are Steiner TreesTop Answers are Steiner Trees
• Finding the top answer in KPS (a.k.a. the Steiner-tree problem) is intractableintractable – Therefore, one cannot enumerate all answers
in ranked order with polynomial delay
• However, the top answer can be found efficiently under data complexity– That is, the number of keywords is fixed
• Approximations can be found efficiently under query-and-data complexity – There is a lot of work on Steiner-tree approximations
Finding and Approximating Top-k Answers in Keyword Proximity Search18PODS'06
So What Can Be Done?So What Can Be Done?
Can answers of KPS be enumerated in the exact order with polynomial delay, under data complexity??
Can answers of KPS be enumerated in the exact order with polynomial delay, under data complexity??
Can approximations of Steiner trees be used for efficiently enumerating in an approximate order (while preserving
the approximation ratio)??
Can approximations of Steiner trees be used for efficiently enumerating in an approximate order (while preserving
the approximation ratio)??
Finding and Approximating Top-k Answers in Keyword Proximity Search19PODS'06
Our ResultsOur Results
Theorem 1:Theorem 1: Under data complexity, answers of KPS can be enumerated in the exact order with polynomial delay
B CA B CAB CA B CA
B CAB CA B CA
Finding and Approximating Top-k Answers in Keyword Proximity Search20PODS'06
Our Results (cont’d)Our Results (cont’d)
Theorem 2:Theorem 2: Under query-and-data complexity, given an efficient C-approximation for finding Steiner trees, one can enumerate with polynomial delay in a (C+1)-approximate order
B CA B CAB CA B CA
B CAB CA B CA
Finding and Approximating Top-k Answers in Keyword Proximity Search21PODS'06
The Meaning of the ResultsThe Meaning of the Results
KPS is tractable under data complexity
Under query-and-data complexity, an efficient enumeration in an approximate order can be done
with almost the same ratios as Steiner trees
All results on Steiner trees can be applied to KPS
Existing approaches to KPS are heuristics–Exponential delay in the worst case–No provable nontrivial approximation ratios
From a theoretical point of view, From a theoretical point of view, using heuristics isusing heuristics is notnot the only optionthe only option
Finding and Approximating Top-k Answers in Keyword Proximity Search22PODS'06
ContentsContents
Introduction
Formal Setting
The Main Results
Enumerating in the Exact OrderEnumerating in the Exact Order
Enumerating in an Approximate Order
Conclusion and Future Work
Finding and Approximating Top-k Answers in Keyword Proximity Search23PODS'06
Lawler’s MethodLawler’s Method
• We use the technique of Lawler (1972), which is an iterative method for finding the top-k answers
• Each iteration generates the next answer by finding the top answer under constraints
• Lawler’s method is designed for general (discrete) optimization problems
• When applying it to a specific problem, one needs to deal with the following two issues
Finding and Approximating Top-k Answers in Keyword Proximity Search24PODS'06
Two Problems to SolveTwo Problems to Solve
1.1. What exactly are the constraintsconstraints? (That is, how can we apply Lawler’s
method so that the constraints make it possible to find top answers efficiently?)
2.2. How can we find efficientlyefficiently the top answer under constraints??
Finding and Approximating Top-k Answers in Keyword Proximity Search25PODS'06
Solving the First ProblemSolving the First Problem
Constraints are subtreessubtrees of the graph• Pairwise node disjoint• Their leaves are exactly the keywords of the query
An answer satisfies the constraints if itcontains all the subtrees (i.e., a supertreesupertree)
B CA
E FG
B CA
E FG
Finding and Approximating Top-k Answers in Keyword Proximity Search26PODS'06
1.1. What exactly are the constraintsconstraints? (That is, how can we apply Lawler in a
way that the constraints enable finding the top answer efficiently?)
Two Problems to Solve (One Left)Two Problems to Solve (One Left)
2.2. How can we find efficientlyefficiently the top answer under constraints??
Finding and Approximating Top-k Answers in Keyword Proximity Search27PODS'06
Formulation of the Second ProblemFormulation of the Second Problem
Input:Input: constraints
(node-disjoint subtrees, keywords as leaves)
Objective:Objective:
A minimal answer satisfying the constraints(i.e., containing all the subtress)
Next, an algorithm that solves “almost” this problem, namely:
(Almost the same) Objective:
A minimal supertree satisfying the constraints
Finding and Approximating Top-k Answers in Keyword Proximity Search28PODS'06
Finding a Minimal SupertreeFinding a Minimal Supertree
Input:Input: G, T (constraints, i.e., subtrees)
1. Collapse each of the subtrees of T into a node
2. Find a Steiner tree T of the collapsed subtrees
3. Restore the collapsed subtrees in T
(more details in the proceedings…)
Finding and Approximating Top-k Answers in Keyword Proximity Search29PODS'06
(Almost the same) Objective:
A minimal supertree satisfying the constraints
This is not Enough!This is not Enough!
Input:Input: constraints
(node-disjoint subtrees, keywords as leaves)
Objective:Objective:
A minimal answer satisfying the constraints(i.e., containing all the subtress)
Not the same!
Finding and Approximating Top-k Answers in Keyword Proximity Search30PODS'06
company
supplies
supply
product
customer
papersA4
company
supplies
supply
product
customer
coffee
president
Cohen
department
Summers
manager
Parishqhq
Query Answers RevisitedQuery Answers Revisited
An answer is a directed subtree of the data graph
Contains all keywords of the query
Has no redundant edges (and nodes)
Keywords are the leaves
The root has two or more children
Finding and Approximating Top-k Answers in Keyword Proximity Search31PODS'06
An ExampleAn Example
A B
C D
Finding and Approximating Top-k Answers in Keyword Proximity Search32PODS'06
An ExampleAn Example
A B
C D
A B
C D
The minimal supertreesatisfying the constraints
The minimal answersatisfying the constraints
This edge is redundant!But, it cannot be removed since it is a constraint!
The minimal answer can be completely different from the minimal supertree
Furthermore, there can be no answer even if there is a supertree
Finding and Approximating Top-k Answers in Keyword Proximity Search33PODS'06
What if We Remove Edges of Constraints?What if We Remove Edges of Constraints?
• What if we first generate a minimal supertree and if the root has only one child, then we just remove it (until an answer is obtained)?
• The constraints are violated, leading to a failure of Lawler’s method!
• That is, – Some answers will be duplicated– While other answers will not be generated at all
Finding and Approximating Top-k Answers in Keyword Proximity Search34PODS'06
Our Approach Our Approach
Transform Min.Supertree
Constraints
F
C D
A BH
G
E
A BH
E
F
C D
G
AnswerAnswerA BH
E
F
C D
G
New constraints
The root of this subtree has more than one child and it must be the root of the answer
Finding and Approximating Top-k Answers in Keyword Proximity Search35PODS'06
A BH
E
F
C D
G
Min.Supertree
Min.Supertree
Min.Supertree
Min.Supertree
HF
C D
E
A BG
H
A BH
E
F
C D
G
C D
A BH
F EGC DA BH
F EG
C DA BH
E
F G
A BHF
C D G
ETransform
A BHE
F
C D
G
Transform
Transform
Transform
This Process is RepeatedThis Process is Repeated
Constraints
F
C D
A BH
G
E
Up to 2#keywords times (fixed & usually fewer)
The best is the final answerfinal answer
Finding and Approximating Top-k Answers in Keyword Proximity Search36PODS'06
About the TransformationAbout the Transformation
• The details of the exact transformation and the proof of correctness are intricate
• All can be found in the proceedings…
This concludes the algorithm forThis concludes the algorithm forenumerating in the exact orderenumerating in the exact order
Finding and Approximating Top-k Answers in Keyword Proximity Search37PODS'06
A Different View: Chain of ReductionsA Different View: Chain of Reductions
Enumerating answers in Enumerating answers in ranked orderranked order
Finding the Finding the top top answer under answer under constraintsconstraints
Finding minimal Finding minimal supertreessupertrees
Finding Finding Steiner treesSteiner trees
Adapting Lawler’s methodAdapting Lawler’s method
Transformation of constraintsTransformation of constraints
Collapse and restoreCollapse and restore
Finding and Approximating Top-k Answers in Keyword Proximity Search38PODS'06
ContentsContents
Introduction
Formal Setting
The Main Results
Enumerating in the Exact Order
Enumerating in an Approximate OrderEnumerating in an Approximate Order
Conclusion and Future Work
Finding and Approximating Top-k Answers in Keyword Proximity Search39PODS'06
Modifying the Chain of ReductionsModifying the Chain of Reductions
Enumeration in an Enumeration in an approximateapproximate order order
Finding Finding approximateapproximate answers under constraints answers under constraints
Finding Finding approximations approximations of minimal supertreesof minimal supertrees
Finding Finding approximationsapproximations of Steiner trees of Steiner trees
Similar
Similar
Completely different!
Finding and Approximating Top-k Answers in Keyword Proximity Search40PODS'06
A BH
E
F
C D
G
Min.Supertree
Min.Supertree
Min.Supertree
Min.Supertree
HF
C D
E
A BG
H
A BH
E
F
C D
G
C D
A BH
F EGC DA BH
F EG
C DA BH
E
F G
A BHF
C D G
ETransform
A BHE
F
C D
G
Transform
Transform
TransformConstraints
F
C D
A BH
G
E
Exact Order RevisitedExact Order Revisited
Up to 2#keywords We cannot allow it under query-and-data complexity!
Finding and Approximating Top-k Answers in Keyword Proximity Search41PODS'06
The AlgorithmThe Algorithm
F
C D
A BH
E
E
C D
A BH
E
F
C D
Constraints
≤ CC times the optimum ≤ 11 times the optimum A C-approximation of the minimal supertree (collapse and restore)
A minimal answer for 3 or fewer constraints (the
algorithm for the exact order)
Finding and Approximating Top-k Answers in Keyword Proximity Search42PODS'06
Combine the SubtreesCombine the Subtrees
E
C D
A BH
E
F
C D
A BH
E
F
C D
The combined subgraph contains an answer
≤ ((C+1C+1)) times the optimum
≤ CC times the optimum ≤ 11 times the optimum A C-approximation of the minimal supertree (collapse and restore)
A minimal answer for 3 or fewer constraints (the
algorithm for the exact order)
Finding and Approximating Top-k Answers in Keyword Proximity Search43PODS'06
ContentsContents
Introduction
Formal Setting
The Main Results
Enumerating in the Exact Order
Enumerating in an Approximate Order
Conclusion and Future WorkConclusion and Future Work
Finding and Approximating Top-k Answers in Keyword Proximity Search44PODS'06
Keyword Proximity SearchKeyword Proximity Search• A common paradigm for keyword search over
structured databases
• In the formal model: – Data are directed and weighted graphs– Queries are sets of keywords (i.e., nodes) from
the data graph– Query answers are non-redundant subtrees
containing the keywords of the query
• The goal is to find the top-k answers, where the rank is inversely proportional to the weight
• A stronger goal: enumeration with poly. delay
Finding and Approximating Top-k Answers in Keyword Proximity Search45PODS'06
Our ResultsOur Results
• Under data complexity, answers can be enumerated in the exact ranked order with polynomial delay
• Under query-and-data complexity, every efficient C-approximation to the Steiner-tree problem yields an algorithm for enumerating answers with polynomial delay in a (C+1)-approximate order
Finding and Approximating Top-k Answers in Keyword Proximity Search46PODS'06
Our Chain of ReductionsOur Chain of Reductions
Enumerating answers in sorted orderEnumerating answers in sorted order
Finding the top answer under constraintsFinding the top answer under constraints
Finding minimal supertreesFinding minimal supertrees
Finding Steiner treesFinding Steiner trees
Lawler’s approachLawler’s approach
The intricate part …
Subtree Collapse/RestoreSubtree Collapse/Restore
Finding and Approximating Top-k Answers in Keyword Proximity Search47PODS'06
Other Variant of KPSOther Variant of KPS
Our algorithms can be adapted to other popular variants of KPS
Finding and Approximating Top-k Answers in Keyword Proximity Search48PODS'06
Undirected VariantUndirected Variant
company
supplies
supply
product
supplier
papersA4
company
supplies
supply
product
supplier
coffee
president
Cohen
department
Summers
manager
Parishqhq
Answers are undirected trees
Finding and Approximating Top-k Answers in Keyword Proximity Search49PODS'06
Strong VariantStrong Variant
company
supplies
supply
product
supplier
papersA4
company
supplies
supply
product
supplier
coffee
president
Cohen
department
Summers
manager
Parishqhq
Answers are undirected treesand keywords are leaves
Finding and Approximating Top-k Answers in Keyword Proximity Search50PODS'06
Open ProblemsOpen Problems
• Can we improve the space efficiency of our algorithms??
• Some ranking functions (e.g., height) are easier than weight when looking for the top answer (no constraints), but– The chain of reductions doesn’t work– The complexity of finding the top answer under
constraints is unknown
• Can our results hold for richer queries that also have structural constraints??
Finding and Approximating Top-k Answers in Keyword Proximity Search51PODS'06
Implementation ConsiderationsImplementation Considerations
• Bottlenecks: Steiner-tree algorithms and approximations
• Thin graphs allow in-memory execution of our algorithms, even for large XML documents (e.g., DBLP)
• New and intuitive ranking functions that are easier to implement efficiently
Finding and Approximating Top-k Answers in Keyword Proximity Search52PODS'06
Related Work: Order vs. EfficiencyRelated Work: Order vs. Efficiency
Exact Exact OrderOrder
Approximate Approximate OrderOrder
Heuristic Heuristic OrderOrder(no approx. guaranteed)
No No OrderOrder
More Desirable
More Efficient
(Queries have a
fixed size)This work
Past work
53
CIKM 2005
Thank you.Thank you.
Questions?
54
CIKM 2005
Illustration of Lawler’s MethodIllustration of Lawler’s Method
Finding and Approximating Top-k Answers in Keyword Proximity Search55PODS'06
Lawler’s Method (1972)Lawler’s Method (1972)
Finding and Approximating Top-k Answers in Keyword Proximity Search56PODS'06
1. Find the Top Answer1. Find the Top Answer
In principle, at this point we should find the second-best answer
But Instead…But Instead…
Finding and Approximating Top-k Answers in Keyword Proximity Search57PODS'06
2. Partition the Remaining Answers2. Partition the Remaining Answers
Finding and Approximating Top-k Answers in Keyword Proximity Search58PODS'06
2. Partition the Remaining Answers2. Partition the Remaining Answers
Each partition is defined by a distinct set of constraints
Finding and Approximating Top-k Answers in Keyword Proximity Search59PODS'06
3. Find the Top of each Set3. Find the Top of each Set
Finding and Approximating Top-k Answers in Keyword Proximity Search60PODS'06
4. Find the Second Answer4. Find the Second Answer
The second answer is the best among all the top answers in the partitions
Finding and Approximating Top-k Answers in Keyword Proximity Search61PODS'06
5. Further Divide the Chosen Partition5. Further Divide the Chosen Partition
Finding and Approximating Top-k Answers in Keyword Proximity Search62PODS'06
And so on…And so on…
63
CIKM 2005
Adapting Lawler’s MethodAdapting Lawler’s Method
Finding and Approximating Top-k Answers in Keyword Proximity Search64PODS'06
Our ConstraintsOur Constraints
• Node-disjoint subtrees of the data graph• All the leaves are keywords• An answer must contain all the subtrees
InclusionInclusion constraints
• Edges of the data graph• An answer must not contain any of the
edges
ExclusionExclusion constraints
C DC D
BA
CC
B
Finding and Approximating Top-k Answers in Keyword Proximity Search65PODS'06
Partitioning a Partition (cont)Partitioning a Partition (cont)
A
…
edges(A) \ I = {e1,…,ek}
I A0 A0E ⋃{e1}
I ⋃{e1} A1 A1E ⋃{e2}
I ⋃{e1,e2}A2 A2E ⋃{e3}
I ⋃{e1,e2,e3}A3 A3E ⋃{e4}
I ⋃{e1,…,ek-
1}Ak-1 Ak-1E ⋃{ek}
I AE
Finding and Approximating Top-k Answers in Keyword Proximity Search66PODS'06
Generating Constraints (intuition)Generating Constraints (intuition)
A B C D EA B C D E A B C D EA B C D E
A B C D EA B C D E A B C D EA B C D E A B C D EA B C D E
A B C D EA B C D E A B C D EA B C D E A B C D EA B C D E
A B C D E
Constraints (subtrees/edges) are obtained from existing constraints of the current partition and the top answer
67
CIKM 2005
Collapsing SubtreesCollapsing Subtrees
Finding and Approximating Top-k Answers in Keyword Proximity Search68PODS'06
Collapsing a SubtreeCollapsing a Subtree
Finding and Approximating Top-k Answers in Keyword Proximity Search69PODS'06
1. Remove All Edges and Internal Nodes1. Remove All Edges and Internal Nodes
Only the root is left
Finding and Approximating Top-k Answers in Keyword Proximity Search70PODS'06
2. Remove Incoming Edges of Internal Nodes2. Remove Incoming Edges of Internal Nodes
Finding and Approximating Top-k Answers in Keyword Proximity Search71PODS'06
3. Add Outgoing Edges to the Root3. Add Outgoing Edges to the Root
An edge that emanates from an internal node becomes an outgoing edge of the root
Finding and Approximating Top-k Answers in Keyword Proximity Search72PODS'06
More DetailsMore Details
• When adding an outgoing edge (r,u) to the root, the weight of (r,u) is the minimal weight among all the edges from the collapsed subtree to u
• When restoring a subtree, each outgoing edge (r,u) of the root is replaced with an (arbitrary) original edge from the restored subtree to u, with the same weight
• Incoming edges of internal nodes of the subtree are never restored– Such edges cannot participate in G-supertrees