Finding and Approximating Top- k Answers in Keyword Proximity Search

1

CIKM 2005

Finding and Approximating Top-Finding and Approximating Top-kkAnswers in Keyword Proximity SearchAnswers in Keyword Proximity Search

Benny Kimelfeld Benny Kimelfeld and Yehoshua Sagiv Yehoshua Sagiv

The Selim and Rachel Benin School of Engineering and Computer Science

האוניברסיטה העברית בירושליםהאוניברסיטה העברית בירושליםThe Hebrew University of JerusalemThe Hebrew University of Jerusalem

Finding and Approximating Top-k Answers in Keyword Proximity Search2PODS'06

A paradigm for data extraction

Data have varying degrees of structure

– Relational databases, XML, Web sites

Queries are sets of keywords

− No structural constraints

Keyword Proximity Search (KPS)Keyword Proximity Search (KPS)

The Goal:The Goal:

Extract meaningful parts of data w.r.t. the keywords


Querying Structure & Content by KeywordsQuerying Structure & Content by Keywords

Keywords appear in different parts of the data

Answers show occurrences of keywords, as well the associations among these occurrences

Proximity of the keywords in the answer indicates a close (strong) semantic association among them

Vardi Databasessearchjournal

name

Databases

article

Vardi

author…

article

title author

Databases Vardi


Past Work on KPS (Past Work on KPS (Keyword Proximity SearchKeyword Proximity Search) )

• DataSpotDataSpot (Sigmod 1998)

• Information Units Information Units (WWW 2001)

• BANKSBANKS (ICDE 2002, VLDB 2005)

• DISCOVERDISCOVER (VLDB 2002)

• DBXplorerDBXplorer (ICDE 2002)

• XKeyword XKeyword (ICDE 2003)

• ……


The Goal of this PaperThe Goal of this Paper

Devise Devise efficientefficient algorithms for finding algorithms for finding high-high-quality answersquality answers in in keyword proximity searchkeyword proximity searchDevise Devise efficientefficient algorithms for finding algorithms for finding high-high-quality answersquality answers in in keyword proximity searchkeyword proximity search


ContentsContents

Introduction

Formal Setting

The Main Results

Enumerating in the Exact Order

Enumerating in an Approximate Order

Conclusion and Future Work


ContentsContents

Introduction

Formal SettingFormal Setting

The Main Results





Data GraphsData Graphs

company

supplies

supply

product

supplier

papersA4

company

supplies

supply

product

supplier

coffee

president

Cohen

department

Summers

manager

Parishqhq

Structural and keyword nodes

Edges may have weights

– Weak relationships are penalized by high weights


QueriesQueries

Q={ Summers , Cohen , coffee }company

supplies

supply

product

supplier

papersA4

company

supplies

supply

product

supplier

coffee

president

Cohen

department

Summers

manager

Parishqhq

Queries are sets of keywords from the data graph


Query AnswersQuery Answers

company

supplies

supply

product

customer

papersA4

company

supplies

supply

product

customer

coffee

president

Cohen

department

Summers

manager

Parishqhq


company

supplies

supply

product

customer

papersA4

company

supplies

supply

product

customer

coffee

president

Cohen

department

Summers

manager

Parishqhq

Query AnswersQuery Answers

An answer is a directed subtree of the data graph

Contains all keywords of the query

Has no redundant edges (and nodes)

The keywords of the query are the leaves

The root has two or more children


Ranking: Inversely Proportional to WeightRanking: Inversely Proportional to Weight

Vardi

databases

dblp

article

5

title

1

1

article

title1

1Vardi

databases

article

title

article

title

cite

references1

1.5

1

1

1

1

1Vardi

databases

title

2 13

rank(A)=(weight(A))-1

Smaller subtrees represent closer associations


Enumerating in Exact (Ranked) OrderEnumerating in Exact (Ranked) Order

B CA B CAB CA B CA

B CAB CA B CA

IfIf ThenThen ≤≤

Top-Top-kk Answers AnswersTop-Top-kk Answers Answers

B CAB CA


Enumerating in a Enumerating in a CC-Approximate Order-Approximate Order

B CAB CAB CAB CA

B CAB CA B CA

IfIf ThenThen ≤≤

CC-Approximation of the Top--Approximation of the Top-kk Answers Answers

(Fagin et. al, PODS’01)

CC-Approximation of the Top--Approximation of the Top-kk Answers Answers

(Fagin et. al, PODS’01)

B CA

B CA

CC

C may be a function of G and Q


Polynomial DelayPolynomial Delay

Yardstick of efficiency:

Polynomial delayPolynomial delay

Yardstick of efficiency:

Polynomial delayPolynomial delay

B CA B CAB CA B CA

B CAB CA B CA

Polynomial time between generating successive answers

Exponentially many answers even for 2 keywords (it is inefficient to generate all answers and then sort)


ContentsContents

Introduction

Formal Setting

The Main ResultsThe Main Results





Top Answers are Steiner TreesTop Answers are Steiner Trees

• Finding the top answer in KPS (a.k.a. the Steiner-tree problem) is intractableintractable – Therefore, one cannot enumerate all answers

in ranked order with polynomial delay

• However, the top answer can be found efficiently under data complexity– That is, the number of keywords is fixed

• Approximations can be found efficiently under query-and-data complexity – There is a lot of work on Steiner-tree approximations


So What Can Be Done?So What Can Be Done?

Can answers of KPS be enumerated in the exact order with polynomial delay, under data complexity??

Can answers of KPS be enumerated in the exact order with polynomial delay, under data complexity??

Can approximations of Steiner trees be used for efficiently enumerating in an approximate order (while preserving

the approximation ratio)??

Can approximations of Steiner trees be used for efficiently enumerating in an approximate order (while preserving

the approximation ratio)??


Our ResultsOur Results

Theorem 1:Theorem 1: Under data complexity, answers of KPS can be enumerated in the exact order with polynomial delay

B CA B CAB CA B CA

B CAB CA B CA


Our Results (cont’d)Our Results (cont’d)

Theorem 2:Theorem 2: Under query-and-data complexity, given an efficient C-approximation for finding Steiner trees, one can enumerate with polynomial delay in a (C+1)-approximate order

B CA B CAB CA B CA

B CAB CA B CA


The Meaning of the ResultsThe Meaning of the Results

KPS is tractable under data complexity

Under query-and-data complexity, an efficient enumeration in an approximate order can be done

with almost the same ratios as Steiner trees

All results on Steiner trees can be applied to KPS

Existing approaches to KPS are heuristics–Exponential delay in the worst case–No provable nontrivial approximation ratios

From a theoretical point of view, From a theoretical point of view, using heuristics isusing heuristics is notnot the only optionthe only option


ContentsContents

Introduction

Formal Setting

The Main Results

Enumerating in the Exact OrderEnumerating in the Exact Order




Lawler’s MethodLawler’s Method

• We use the technique of Lawler (1972), which is an iterative method for finding the top-k answers

• Each iteration generates the next answer by finding the top answer under constraints

• Lawler’s method is designed for general (discrete) optimization problems

• When applying it to a specific problem, one needs to deal with the following two issues


Two Problems to SolveTwo Problems to Solve

1.1. What exactly are the constraintsconstraints? (That is, how can we apply Lawler’s

method so that the constraints make it possible to find top answers efficiently?)

2.2. How can we find efficientlyefficiently the top answer under constraints??


Solving the First ProblemSolving the First Problem

Constraints are subtreessubtrees of the graph• Pairwise node disjoint• Their leaves are exactly the keywords of the query

An answer satisfies the constraints if itcontains all the subtrees (i.e., a supertreesupertree)

B CA

E FG

B CA

E FG


1.1. What exactly are the constraintsconstraints? (That is, how can we apply Lawler in a

way that the constraints enable finding the top answer efficiently?)

Two Problems to Solve (One Left)Two Problems to Solve (One Left)

2.2. How can we find efficientlyefficiently the top answer under constraints??


Formulation of the Second ProblemFormulation of the Second Problem

Input:Input: constraints

(node-disjoint subtrees, keywords as leaves)

Objective:Objective:

A minimal answer satisfying the constraints(i.e., containing all the subtress)

Next, an algorithm that solves “almost” this problem, namely:

(Almost the same) Objective:

A minimal supertree satisfying the constraints


Finding a Minimal SupertreeFinding a Minimal Supertree

Input:Input: G, T (constraints, i.e., subtrees)

1. Collapse each of the subtrees of T into a node

2. Find a Steiner tree T of the collapsed subtrees

3. Restore the collapsed subtrees in T

(more details in the proceedings…)


(Almost the same) Objective:

A minimal supertree satisfying the constraints

This is not Enough!This is not Enough!

Input:Input: constraints

(node-disjoint subtrees, keywords as leaves)

Objective:Objective:

A minimal answer satisfying the constraints(i.e., containing all the subtress)

Not the same!


company

supplies

supply

product

customer

papersA4

company

supplies

supply

product

customer

coffee

president

Cohen

department

Summers

manager

Parishqhq

Query Answers RevisitedQuery Answers Revisited

An answer is a directed subtree of the data graph

Contains all keywords of the query

Has no redundant edges (and nodes)

Keywords are the leaves

The root has two or more children


An ExampleAn Example

A B

C D


An ExampleAn Example

A B

C D

A B

C D

The minimal supertreesatisfying the constraints

The minimal answersatisfying the constraints

This edge is redundant!But, it cannot be removed since it is a constraint!

The minimal answer can be completely different from the minimal supertree

Furthermore, there can be no answer even if there is a supertree


What if We Remove Edges of Constraints?What if We Remove Edges of Constraints?

• What if we first generate a minimal supertree and if the root has only one child, then we just remove it (until an answer is obtained)?

• The constraints are violated, leading to a failure of Lawler’s method!

• That is, – Some answers will be duplicated– While other answers will not be generated at all


Our Approach Our Approach

Transform Min.Supertree

Constraints

F

C D

A BH

G

E

A BH

E

F

C D

G

AnswerAnswerA BH

E

F

C D

G

New constraints

The root of this subtree has more than one child and it must be the root of the answer


A BH

E

F

C D

G

Min.Supertree

Min.Supertree

Min.Supertree

Min.Supertree

HF

C D

E

A BG

H

A BH

E

F

C D

G

C D

A BH

F EGC DA BH

F EG

C DA BH

E

F G

A BHF

C D G

ETransform

A BHE

F

C D

G

Transform

Transform

Transform

This Process is RepeatedThis Process is Repeated

Constraints

F

C D

A BH

G

E

Up to 2#keywords times (fixed & usually fewer)

The best is the final answerfinal answer


About the TransformationAbout the Transformation

• The details of the exact transformation and the proof of correctness are intricate

• All can be found in the proceedings…

This concludes the algorithm forThis concludes the algorithm forenumerating in the exact orderenumerating in the exact order


A Different View: Chain of ReductionsA Different View: Chain of Reductions

Enumerating answers in Enumerating answers in ranked orderranked order

Finding the Finding the top top answer under answer under constraintsconstraints

Finding minimal Finding minimal supertreessupertrees

Finding Finding Steiner treesSteiner trees

Adapting Lawler’s methodAdapting Lawler’s method

Transformation of constraintsTransformation of constraints

Collapse and restoreCollapse and restore


ContentsContents

Introduction

Formal Setting

The Main Results


Enumerating in an Approximate OrderEnumerating in an Approximate Order



Modifying the Chain of ReductionsModifying the Chain of Reductions

Enumeration in an Enumeration in an approximateapproximate order order

Finding Finding approximateapproximate answers under constraints answers under constraints

Finding Finding approximations approximations of minimal supertreesof minimal supertrees

Finding Finding approximationsapproximations of Steiner trees of Steiner trees

Similar

Similar

Completely different!


A BH

E

F

C D

G

Min.Supertree

Min.Supertree

Min.Supertree

Min.Supertree

HF

C D

E

A BG

H

A BH

E

F

C D

G

C D

A BH

F EGC DA BH

F EG

C DA BH

E

F G

A BHF

C D G

ETransform

A BHE

F

C D

G

Transform

Transform

TransformConstraints

F

C D

A BH

G

E

Exact Order RevisitedExact Order Revisited

Up to 2#keywords We cannot allow it under query-and-data complexity!


The AlgorithmThe Algorithm

F

C D

A BH

E

E

C D

A BH

E

F

C D

Constraints

≤ CC times the optimum ≤ 11 times the optimum A C-approximation of the minimal supertree (collapse and restore)

A minimal answer for 3 or fewer constraints (the

algorithm for the exact order)


Combine the SubtreesCombine the Subtrees

E

C D

A BH

E

F

C D

A BH

E

F

C D

The combined subgraph contains an answer

≤ ((C+1C+1)) times the optimum

≤ CC times the optimum ≤ 11 times the optimum A C-approximation of the minimal supertree (collapse and restore)

A minimal answer for 3 or fewer constraints (the

algorithm for the exact order)


ContentsContents

Introduction

Formal Setting

The Main Results



Conclusion and Future WorkConclusion and Future Work


Keyword Proximity SearchKeyword Proximity Search• A common paradigm for keyword search over

structured databases

• In the formal model: – Data are directed and weighted graphs– Queries are sets of keywords (i.e., nodes) from

the data graph– Query answers are non-redundant subtrees

containing the keywords of the query

• The goal is to find the top-k answers, where the rank is inversely proportional to the weight

• A stronger goal: enumeration with poly. delay


Our ResultsOur Results

• Under data complexity, answers can be enumerated in the exact ranked order with polynomial delay

• Under query-and-data complexity, every efficient C-approximation to the Steiner-tree problem yields an algorithm for enumerating answers with polynomial delay in a (C+1)-approximate order


Our Chain of ReductionsOur Chain of Reductions

Enumerating answers in sorted orderEnumerating answers in sorted order

Finding the top answer under constraintsFinding the top answer under constraints

Finding minimal supertreesFinding minimal supertrees

Finding Steiner treesFinding Steiner trees

Lawler’s approachLawler’s approach

The intricate part …

Subtree Collapse/RestoreSubtree Collapse/Restore


Other Variant of KPSOther Variant of KPS

Our algorithms can be adapted to other popular variants of KPS


Undirected VariantUndirected Variant

company

supplies

supply

product

supplier

papersA4

company

supplies

supply

product

supplier

coffee

president

Cohen

department

Summers

manager

Parishqhq

Answers are undirected trees


Strong VariantStrong Variant

company

supplies

supply

product

supplier

papersA4

company

supplies

supply

product

supplier

coffee

president

Cohen

department

Summers

manager

Parishqhq

Answers are undirected treesand keywords are leaves


Open ProblemsOpen Problems

• Can we improve the space efficiency of our algorithms??

• Some ranking functions (e.g., height) are easier than weight when looking for the top answer (no constraints), but– The chain of reductions doesn’t work– The complexity of finding the top answer under

constraints is unknown

• Can our results hold for richer queries that also have structural constraints??


Implementation ConsiderationsImplementation Considerations

• Bottlenecks: Steiner-tree algorithms and approximations

• Thin graphs allow in-memory execution of our algorithms, even for large XML documents (e.g., DBLP)

• New and intuitive ranking functions that are easier to implement efficiently


Related Work: Order vs. EfficiencyRelated Work: Order vs. Efficiency

Exact Exact OrderOrder

Approximate Approximate OrderOrder

Heuristic Heuristic OrderOrder(no approx. guaranteed)

No No OrderOrder

More Desirable

More Efficient

(Queries have a

fixed size)This work

Past work

53

CIKM 2005

Thank you.Thank you.

Questions?

54

CIKM 2005

Illustration of Lawler’s MethodIllustration of Lawler’s Method


Lawler’s Method (1972)Lawler’s Method (1972)


1. Find the Top Answer1. Find the Top Answer

In principle, at this point we should find the second-best answer

But Instead…But Instead…


2. Partition the Remaining Answers2. Partition the Remaining Answers


2. Partition the Remaining Answers2. Partition the Remaining Answers

Each partition is defined by a distinct set of constraints


3. Find the Top of each Set3. Find the Top of each Set


4. Find the Second Answer4. Find the Second Answer

The second answer is the best among all the top answers in the partitions


5. Further Divide the Chosen Partition5. Further Divide the Chosen Partition


And so on…And so on…

63

CIKM 2005

Adapting Lawler’s MethodAdapting Lawler’s Method


Our ConstraintsOur Constraints

• Node-disjoint subtrees of the data graph• All the leaves are keywords• An answer must contain all the subtrees

InclusionInclusion constraints

• Edges of the data graph• An answer must not contain any of the

edges

ExclusionExclusion constraints

C DC D

BA

CC

B


Partitioning a Partition (cont)Partitioning a Partition (cont)

A

…

edges(A) \ I = {e1,…,ek}

I A0 A0E ⋃{e1}

I ⋃{e1} A1 A1E ⋃{e2}

I ⋃{e1,e2}A2 A2E ⋃{e3}

I ⋃{e1,e2,e3}A3 A3E ⋃{e4}

I ⋃{e1,…,ek-

1}Ak-1 Ak-1E ⋃{ek}

I AE


Generating Constraints (intuition)Generating Constraints (intuition)

A B C D EA B C D E A B C D EA B C D E

A B C D EA B C D E A B C D EA B C D E A B C D EA B C D E

A B C D EA B C D E A B C D EA B C D E A B C D EA B C D E

A B C D E

Constraints (subtrees/edges) are obtained from existing constraints of the current partition and the top answer

67

CIKM 2005

Collapsing SubtreesCollapsing Subtrees


Collapsing a SubtreeCollapsing a Subtree


1. Remove All Edges and Internal Nodes1. Remove All Edges and Internal Nodes

Only the root is left


2. Remove Incoming Edges of Internal Nodes2. Remove Incoming Edges of Internal Nodes


3. Add Outgoing Edges to the Root3. Add Outgoing Edges to the Root

An edge that emanates from an internal node becomes an outgoing edge of the root


More DetailsMore Details

• When adding an outgoing edge (r,u) to the root, the weight of (r,u) is the minimal weight among all the edges from the collapsed subtree to u

• When restoring a subtree, each outgoing edge (r,u) of the root is replaced with an (arbitrary) original edge from the restored subtree to u, with the same weight

• Incoming edges of internal nodes of the subtree are never restored– Such edges cannot participate in G-supertrees

Date post:	20-Jan-2016
Category:	Documents
Upload:	ogden
View:	34 times
Download:	0 times

Finding and Approximating Top- k Answers in Keyword Proximity Search

Documents