Keyword Proximity Search on Keyword Proximity Search on Graphs Graphs
M.Sc. Systems CourseM.Sc. Systems CourseThe Hebrew University of Jerusalem, Winter 2006
Keyword Proximity Search on Graphs MSSYS 2006
A rapidly evolving paradigm for data extraction
Data have varying degrees of structure
Queries are sets of keywords− No structural constraints
Keyword Proximity Search
Relational Databases
Web SitesXML
Documents
The Goal:The Goal:
Extract meaningful parts of data w.r.t. the keywords
Keyword Proximity Search on Graphs MSSYS 2006
Recent Work on KPS (Keyword Proximity Search)
• DataSpotDataSpot (Sigmod 1998)
• Information Units Information Units (WWW 2001)
• BANKSBANKS (ICDE 2002, VLDB 2005)
• DISCOVERDISCOVER (VLDB 2002)
• DBXplorerDBXplorer (ICDE 2002)
• XKeyword XKeyword (ICDE 2003)
• ……
Keyword Proximity Search on Graphs MSSYS 2006
Systems for KPS on Relational Data
BANKS, DISCOVER and DBXplorer implemented KPS (Keyword Proximity Search) on relational databases Different algorithms are used Slight differences in semantics
G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS. In ICDE, pages 431–440, 2002.
V. Hristidis and Y. Papakonstantinou. DISCOVER: Keyword search in relational databases. In VLDB, pages 670–681, 2002.
S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer: enabling keyword search over relational databases. In SIGMOD Conference, page 627, 2002.
Keyword Proximity Search on Graphs MSSYS 2006
Example: KPS on RDB
IDNamePopulation
22Amsterdam1101407
73Brussels951580
IDNameHead Q.
135EU73
175ESA81
CountryOrg.
B135
NL135
search Belgium , Brussels
CodeNameAreaCapital
NLNetherlands3733022
BBelgium3051073
CitiesCities OrganizationsOrganizations
CountriesCountries MembershipsMemberships
Keyword Proximity Search on Graphs MSSYS 2006
IDNamePopulation
22Amsterdam1101407
73Brussels951580
IDNameHead Q.
135EU73
175ESA81
CountryOrg.
B135
NL135
search Belgium , Brussels
CodeNameAreaCapital
NLNetherlands3733022
BBelgium3051073
CitiesCities OrganizationsOrganizations
CountriesCountries MembershipsMemberships
Brussels is the capital city of Belgium
Keyword Proximity Search on Graphs MSSYS 2006
IDNamePopulation
22Amsterdam1101407
73Brussels951580
IDNameHead Q.
135EU73
175ESA81
CountryOrg.
B135
NL135
search Belgium , Brussels
CodeNameAreaCapital
NLNetherlands3733022
BBelgium3051073
CitiesCities OrganizationsOrganizations
CountriesCountries MembershipsMemberships
BBelgium3051073 73Brussels951580
Brussels is the capital city of Belgium
Keyword Proximity Search on Graphs MSSYS 2006
IDNamePopulation
22Amsterdam1101407
73Brussels951580
IDNameHead Q.
135EU73
175ESA81
CountryOrg.
B135
NL135
CodeNameAreaCapital
NLNetherlands3733022
BBelgium3051073
CitiesCities OrganizationsOrganizations
CountriesCountries MembershipsMemberships
Brussels hosts EU and Belgium is a member
search Belgium , Brussels
Keyword Proximity Search on Graphs MSSYS 2006
IDNamePopulation
22Amsterdam1101407
73Brussels951580
IDNameHead Q.
135EU73
175ESA81
CountryOrg.
B135
NL135
CodeNameAreaCapital
NLNetherlands3733022
BBelgium3051073
CitiesCities OrganizationsOrganizations
CountriesCountries MembershipsMemberships
BBelgium3051073
73Brussels951580
Brussels hosts EU and Belgium is a member
search Belgium , Brussels
B135 135EU73
Keyword Proximity Search on Graphs MSSYS 2006
XKeyword: KPS on XML
XKeyword implemented KPS on XML Architecture is based on that of DISCOVER
A demo over DBLP is available
• http://kebab.ucsd.edu:81/xkeyword
V. Hristidis, Y. Papakonstantinou, and A. Balmin. Keyword proximity search on XML graphs. In ICDE, pages 367–378, 2003.
Keyword Proximity Search on Graphs MSSYS 2006
Example: KPS on XML
dblp
title
author
article
MihalisYannakakis
On theApproximationof MaximumSatisfiability
title
author
article
ImprovedApproximationAlgorithms for
MAX SAT
TakaoAsano
David P.Williamson
authorreferences
cite
search Yannakakis , Approximation
Keyword Proximity Search on Graphs MSSYS 2006
Yannakakis wrote a paper about Approximation
dblp
title
author
article
MihalisYannakakis
On theApproximation
of MaximumSatisfiability
title
author
article
ImprovedApproximationAlgorithms for
MAX SAT
TakaoAsano
David P.Williamson
authorreferences
cite
search Yannakakis , Approximation
Keyword Proximity Search on Graphs MSSYS 2006
dblp
title
author
article
MihalisYannakakis
On theApproximationof MaximumSatisfiability
title
author
article
ImprovedApproximationAlgorithms for
MAX SAT
TakaoAsano
David P.Williamson
authorreferences
cite
Yannakakis is cited by a paper about Approximation
search Yannakakis , Approximation
Keyword Proximity Search on Graphs MSSYS 2006
KPS on Web Sites (Information Units)
• KPS can also be used for retrieving information from Web sites
• For a given query, results are collections of Web pages from the site
– Pages are relevant w.r.t. the keywords
– Pages are connected by hyperlinks
Wen-Syan Li, K. Selçuk Candan, Quoc Vu, and Divyakant Agrawal. Retrieving and organizing web pages by “information unit”. In WWW, pages 230-244, 2001.
Keyword Proximity Search on Graphs MSSYS 2006
Example: KPS in Web Sites
http://www.goisrael.com/http://www.goisrael.com/
search Hilton , Beach
Keyword Proximity Search on Graphs MSSYS 2006
Example: KPS in Web Sites
Eilat Beaches
Hilton Eilat Queen of Sheba
search Hilton , Beach
Eilat
A Formal Framework for KPSA Formal Framework for KPS
Keyword Proximity Search on Graphs MSSYS 2006
Data Graphs
company
supplies
supply
product
supplier
papersA4
company
supplies
supply
product
supplier
coffee
president
Cohen
department
Summers
manager
Parishqhq
Data graphs have two types of nodes: Structural nodes
Keywords
Keyword Proximity Search on Graphs MSSYS 2006
Queries
K={ Summers , Cohen , coffee }company
supplies
supply
product
supplier
papersA4
company
supplies
supply
product
supplier
coffee
president
Cohen
department
Summers
manager
Parishqhq
Queries are sets of keywords from the data graph
Keyword Proximity Search on Graphs MSSYS 2006
Query Results
company
supplies
supply
product
supplier
papersA4
company
supplies
supply
product
supplier
coffee
president
Cohen
department
Summers
manager
Parishqhq
Keyword Proximity Search on Graphs MSSYS 2006
Query Results
company
supplies
supply
product
supplier
papersA4
company
supplies
supply
product
supplier
coffee
president
Cohen
department
Summers
manager
Parishqhq
Query results are subtrees of the data graph Contain all keywords in the query
Have no redundant edges
A subtree that isreduced w.r.t. thekeywords
Keyword Proximity Search on Graphs MSSYS 2006
Three Variants
Three variants of keyword proximity search are considered:
Rooted proximity
Undirected proximity
Strong proximity
Keyword Proximity Search on Graphs MSSYS 2006
Rooted Variant
company
supplies
supply
product
supplier
papersA4
company
supplies
supply
product
supplier
coffee
president
Cohen
department
Summers
manager
Parishqhq
Used in BANKS BANKS
Results are rooted trees
Keyword Proximity Search on Graphs MSSYS 2006
Undirected Variant
company
supplies
supply
product
supplier
papersA4
company
supplies
supply
product
supplier
coffee
president
Cohen
department
Summers
manager
Parishqhq
Used in Interconnection Interconnection Semantics for XMLSemantics for XML
Results are undirected trees
Keyword Proximity Search on Graphs MSSYS 2006
Strong Variant
company
supplies
supply
product
supplier
papersA4
company
supplies
supply
product
supplier
coffee
president
Cohen
department
Summers
manager
Parishqhq
Used in XKeywordXKeyword, Information Information UnitsUnits, DBXplorerDBXplorer and DISCOVERDISCOVER
Results are undirected treesand keywords are leaves
Keyword Proximity Search on Graphs MSSYS 2006
DataData
A data graph G
Problem Definition
QueryQuery
A set K of keywords in G
Query ResultsQuery Results
Subtrees of G that are reduced w.r.t. K
Input:Input:
Output:Output:
Rooted/Undirected/Strong
Keyword Proximity Search on Graphs MSSYS 2006
Creating Data Graphs from Relational Databases
Nodes are tuples
Edges are foreign-key references
Keyword Proximity Search on Graphs MSSYS 2006
Creating Data Graphs from Relational Databases
Edges from each tuple node to all the keywords in that tuple
Belgium 30510B 73
Belgium 30510B 73
Keyword Proximity Search on Graphs MSSYS 2006
Creating Data Graphs from XML
Nodes are XML elements
dblp
article article
On theApproximationof MaximumSatisfiability
titleMihalis
Yannakakis
authorTakao Asano
authorDavid P.
Williamson
authorImproved
ApproximationAlgorithms for
MAX SAT
titlecite
Keyword Proximity Search on Graphs MSSYS 2006
Creating Data Graphs from XML
dblp
article article
On theApproximationof MaximumSatisfiability
titleMihalis
Yannakakis
authorTakao Asano
authorDavid P.
Williamson
authorImproved
ApproximationAlgorithms for
MAX SAT
titlecite
Nodes are XML elements
Edges are nesting of elements …Edges represent
nesting of elements …
Keyword Proximity Search on Graphs MSSYS 2006
Creating Data Graphs from XML
dblp
article article
On theApproximationof MaximumSatisfiability
titleMihalis
Yannakakis
authorTakao Asano
authorDavid P.
Williamson
authorImproved
ApproximationAlgorithms for
MAX SAT
titlecite
Nodes are XML elements
Edges represent nesting of elements …
… and ID references
Keyword Proximity Search on Graphs MSSYS 2006
Creating Data Graphs from XMLKeywords appear in PCDATA
dblp
article article
On theApproximationof MaximumSatisfiability
titleMihalis
Yannakakis
authorTakao Asano
authorDavid P.
Williamson
authorImproved
ApproximationAlgorithms for
MAX SAT
titlecite
Nodes are XML elements
… and ID references
Edges are nesting of elements …Edges represent
nesting of elements …
Keyword Proximity Search on Graphs MSSYS 2006
All Occurrences of a Keyword are Represented by One Node
dblp
article article
On theApproximationof MaximumSatisfiability
titleMihalis
Yannakakis
authorTakao Asano
authorDavid P.
Williamson
authorImproved
ApproximationAlgorithms for
MAX SAT
titlecite
Approximation Approximation
A keywords is represented by a single node
Keyword Proximity Search on Graphs MSSYS 2006
Creating Data Graphs from Web Sites
Nodes are Web pages …
Keywords appear in these pages …
Edges are hyperlinks/XLinks
http://www.goisrael.com/http://www.goisrael.com/
A keywords is represented by a single
node
Ranking and Enumeration OrderRanking and Enumeration Order
Keyword Proximity Search on Graphs MSSYS 2006
Ranking Results
Yannakakis
Approximation
title
Yannakakis
Approximation
dblp
article
title
article
title
Yannakakis
Approximation
article
title
article
title
cite
references
Ranking of results is determined by size
2 13
Keyword Proximity Search on Graphs MSSYS 2006
Edges Have Weights
Yannakakis
Approximation
dblp2
article
2
title
1
1
article
title1
1
Yannakakis
Approximation
article
title
article
title
cite
references1
1.5
1
1
1
1 1
1
Yannakakis
Approximation
title
edges incident to dblp have a large weight
edges from cite to article have a medium weight
2 13
Keyword Proximity Search on Graphs MSSYS 2006
Order of Results
Arbitrary Order
Exact Order ji RRji ,
Keyword Proximity Search on Graphs MSSYS 2006
Order of Results (cont’d)
Heuristic Order
C-Approximate Order
ji RCRji ,
Measuring the Efficiency of Measuring the Efficiency of EnumerationsEnumerations
Keyword Proximity Search on Graphs MSSYS 2006
Polynomial Runtime is not Appropriate for KPS
• In the theory of CS, the usual notion of efficiency is polynomial running time That is, the algorithm terminates in time that is
polynomial in the size of the input
• However, in KPS the number of results can be exponential in the size of the input Algorithms cannot be expected to terminate in
polynomial time
Even for two keywords
• Therefore, other notions are required
Keyword Proximity Search on Graphs MSSYS 2006
Time Efficiency
Polynomial Total TimePolynomial Total Time
Polynomial runtime in the combined size of the input and the output
Polynomial DelayPolynomial Delay
The runtime between two successive results is polynomial in the size of the input
Keyword Proximity Search on Graphs MSSYS 2006
About Polynomial Delay
• With polynomial delay you can: Generate the first few results quickly
Efficiently return results in pages
• In most cases of keyword search, this is the suitable notion of efficiency
• Goal: develop algorithms that enumerate KPS results with polynomial delay
Keyword Proximity Search on Graphs MSSYS 2006
Space Efficiency
Polynomial Space
Linearly-Incremental Space i results require i times polynomial space in
the input
Keyword Proximity Search on Graphs MSSYS 2006
Data and Query-and-Data Complexity
• Under query-and-data complexity, we assume that both the query and the data are of unbounded size Many problems in database theory, e.g.,
computing joins of relational tables, are intractable under this measure
• In practice, however, queries are very small compared to the data
• Under data complexity, the size of the query is assumed to be fixed
Enumerating Results of KS with Enumerating Results of KS with Polynomial DelayPolynomial Delay
Keyword Proximity Search on Graphs MSSYS 2006
Keyword Search with Polynomial Delay
• The following algorithm enumerates reduced subtrees (i.e., results of keyword search) with polynomial delay Results are not ranked
• A different version of the algorithm for each of the three variants: rooted
undirected
strong
Keyword Proximity Search on Graphs MSSYS 2006
Importance of the Algorithm
• An upper bound for ranked keyword search: Results can be enumerated in ranked order in polynomial total time Generate all the results and then sort them
• In some cases, ranking is not required
• A basis for developing efficient heuristics that enumerate in an “almost” ranked order (discussed later)
The Algorithm for Enumerating The Algorithm for Enumerating Rooted Reduced SubtreesRooted Reduced Subtrees
Keyword Proximity Search on Graphs MSSYS 2006
Overview
• The algorithm uses two reductions
• Each reduction alone either does not solve the problem or runs in exponential total time
• However, the two reductions can be combined together to enumerate reduced subtrees with polynomial delay
Keyword Proximity Search on Graphs MSSYS 2006
Data Reduction
1. Choose an arbitrary node v in K
2. For each parent p of v do:
I. In K: replace v with p
II. In G: remove v
III. Generate all results for the new input
IV. Add p→v to each result of the new input
A
KKGG
A B
p
vA B v
pvv
A B
p
v
p
A B
p
v A B A Bv v
B
Keyword Proximity Search on Graphs MSSYS 2006
Example Showing Failure
B
AC
KK
A
BC
Four results!
Two with this
root
Two with this
root
Keyword Proximity Search on Graphs MSSYS 2006
Failure Example
B
AC
KK
A
BC
Keyword Proximity Search on Graphs MSSYS 2006
Failure Example
B
C
KK BC
Keyword Proximity Search on Graphs MSSYS 2006
Failure Example
C
KK
C
Keyword Proximity Search on Graphs MSSYS 2006
Failure Example
KK
Keyword Proximity Search on Graphs MSSYS 2006
Failure Example
C
B
A
Only one result!
Three others are missing!
Keyword Proximity Search on Graphs MSSYS 2006
Why Data Reduction Fails
• We assumed that v is a leaf in every result
• It does not hold for structural nodes in recursive steps!
• Therefore, some results are not found!
• Solution(?): Repeat data reduction for every v in K Exponential total time in the worst case!
Keyword Proximity Search on Graphs MSSYS 2006
Query Reduction
1. Remove one keyword from the query
2. Find all results for the smaller query
3. Extend each result to include the missing keyword, in every possible way
A K= {A,B,C}
A
B
BA
A
BC
C
A
B
C BA A
C
B CBA
Keyword Proximity Search on Graphs MSSYS 2006
Extending Partial Results
• In query reduction, we need to extend a result T of the query K\{k} to all results of the query K
• This is done as follows: For all nodes v of T:
• Remove from G all nodes of T, except for v
• Find all simple directed paths P from v to k and print the concatenation of T and P
• If v is the root of T, we also need to concatenate T with all subtrees that are reduced w.r.t. v and k
• More details are can be found in the paper
Keyword Proximity Search on Graphs MSSYS 2006
Extensions by Directed Paths
Keyword Proximity Search on Graphs MSSYS 2006
Extensions by Directed Subtrees
Keyword Proximity Search on Graphs MSSYS 2006
Query Reduction is not Efficient!
• Query reduction completely solves the problem, but it is inefficient
• Problem: A subset of the query may have much more results than the query itself
Exponential total time!
A B CnA B CnA B CnA B Cn
2n results
for {A,B}
1 result for
{A,B,C}
Keyword Proximity Search on Graphs MSSYS 2006
Combining the Reductions
• In order to enumerate in polynomial total time, combine query and data reductions: If some node v of K is reachable, in the data
graph, from another node u of K, use query reduction
• remove v from K
Otherwise, use data reduction
• By combining the two reductions, results can be enumerated in polynomial total time
v
u
Keyword Proximity Search on Graphs MSSYS 2006
Achieving Polynomial Delay
• To achieve polynomial delay, we cannot wait until a recursive subroutine terminates
• Use coroutines instead of subroutines!
• That is, each recursive execution of the algorithm
stops after generating each result
resumes when the next result is required
Keyword Proximity Search on Graphs MSSYS 2006
routine 3 routine 2 routine 1
Subroutines
Base
Polynomial Polynomial Total TimeTotal Time
Keyword Proximity Search on Graphs MSSYS 2006
routine 3 routine 2 routine 1
Coroutines
Base
Polynomial Polynomial DelayDelay
For papers and projects related to this topic, see the home page of Benny Kimelfeld