Post on 13-Jan-2016
description
transcript
Cost-based Optimization of Graph Queries
Silke Trißl
Humboldt-Universität zu Berlin
Knowledge Management in Bioinformatics
IDAR 2007
2Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Motivation – Biological Networks
from http://www.genome.jp/kegg
Name
Sequence
TYPE
Function
Location
…Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
3Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Querying Networks - PQL
Pathway Query Language (PQL) [Leser, 2005]
Syntax for querying graphs
Find subgraphs matching the query graph
SELECT BFROM networkLET node A, node B, path PWHERE A.name = ‘Glucose’ AND A ISA compound AND B ISA enzyme AND P.path = A[-*]B;
A
B
name = GlucoseISA compound
ISA enzyme
P
Find all enzymes that are directly or
indirectly affected by „Glucose“
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
4Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Node Conditions
Nodes can contain conditions on
A
B
name = GlucoseISA compound
ISA enzyme
P
query TYPE hierarchy - partially
root
molecule interaction
macro-moleculecompoun
d
sugar geneprotein
ion mRNA
catalysis inhibition
enzyme
Attributes A.name = ‘Glucose’
TYPE (of hierarchy) A ISA compound
Function (of ontology)
A HASFUNC (‘catalysis’, GO)
Location A ISIN (‘Human’, taxonomy)
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
5Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Path Conditions
Paths can contain conditions on
A
B
name = GlucoseISA compound
ISA gene
P
query
a
b
graph
Edges P.path = A[-*]B AND P.length = 1
Path existence
P.path = A[-*]B
Path length P.path = A[-*]B AND P.length < 10
Start node P.start = A
Containment P { R
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
6Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Result of Graph Queries
Search for matching subgraphs Find node and path bindings
for the query variables in the network
A
B
name = GlucoseISA compound
ISA enzyme
P
network query
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
7Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Outline
Motivation
Optimize Graph Queries Evaluate node conditions
Evaluate path conditions
Future Work Relational algebra for graph queries
Conclusion
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
8Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Evaluation of Node Conditons
Node attributes Select operator (σ) on Node table
Node types, functions, and locations Hierarchy operator (χ)
– Return the specified concept and all successor concepts
A
B
name = GlucoseISA compound
ISA gene
P
query query plan for node A
Node TYPE
σname=Glucose χcompound
⋈Node.TYPE=TYPE
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
9Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
How to evaluate Path conditions?
Recursively traverse the graph Edge
Arbitrary number of joins
No possibility to optimize the execution
a
b
graph
⋈ Edge ⋈ Edge⋈ Edge⋈ Edge⋈ …
Need for new logical and
physical operators
Need for new logical and
physical operators
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
10Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Path Existence Operator, Φ
Node variables A and B Set of nodes V bound to A
Set of nodes W bound to B
Path variable P Condition on P: path from A to B
A Φ B returns the set of node pairs (v,w) for which paths from v V to w W in G exist.
A
B
name = GlucoseISA compound
ISA gene
P
query
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
11Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Physical Implementation of Φ
Graph traversal at query time Breadth-first or depth-first search
Query precomputed index structure Transitive closure (only for small graphs) GRIPP [Trißl et al., 2007]
– GRIPP index table, IND(G)– one instance for every node v in G
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
12Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
GRIPP Index Creation
Depth-first traversal of G
A
B D
HE F G
R[0
C
[1
[2
[3 [5,4] ,6]
,7] [8,9] [10,19]
[11,14] [15,18]
,20]
,21]
[12
[16
We reach a node v
for the first time
– add tree instance of v to IND(G)
– proceed traversal
again
– add non-tree instance of v to IND(G)
– do not traverse child nodes of v
,13]
,17]
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
13Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Is node C reachable from node D?
GRIPP Index Table, IND(G)
A
B D
HE F G
R[0
C
[1
[2
[3 [5,4] ,6]
,7] [8,9] [10,19]
[11,14] [15,18]
,20]
,21]
[12
[16
,13]
,17]
node pre post inst
R 0 21 tree
A 1 20 tree
B 2 7 tree
E 3 4 tree
F 5 6 tree
C 8 9 tree
D 10 19 tree
G 11 14 tree
B 12 13 non
H 15 18 tree
A 16 17 non
Graph, G GRIPP index, IND(G)
C D
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
14Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Order Tree, O(G)
node pre post inst
R 0 21 tree
A 1 20 tree
B 2 7 tree
E 3 4 tree
F 5 6 tree
C 8 9 tree
D 10 19 tree
G 11 14 tree
B 12 13 non
H 15 18 tree
A 16 17 nonOrder tree, O(G)
w reachable from v iff
vpre < wpre < vpost
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
15Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Order Tree, O(G)
node pre post inst
R 0 21 tree
A 1 20 tree
B 2 7 tree
E 3 4 tree
F 5 6 tree
C 8 9 tree
D 10 19 tree
G 11 14 tree
B 12 13 non
H 15 18 tree
A 16 17 nonOrder tree, O(G)
w reachable from v iff
vpre < wpre < vpost
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
16Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Query strategy – Step 1
Retrieve the reachable instance set of
start node v, called RIS(v) Retrieve RIS(D)
Requires only a single query on IND(G)
If C RIS(D)
return true
stop the search
Else
proceed to Step 2
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
17Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Query strategy – Step 2
Search for non-tree instances in RIS(v) The nodes of these instances are hop nodes
Check every i RIS(D) If i is tree instance
– [G and H]– Done
If i is non-tree instance
– [A and B]– i has no
successors in O(G), but possibly in G
– proceed to Step 3
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
18Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Query strategy – Step 3
Extend the search
using hop nodes v1, …, vn
Obtain the tree instance of node B
Proceed to Step 1
Repeat steps 1…3 until
an instance of node C is found
or no more hop nodes are availableDepth-first traversal of O(G) using
hop nodes
Depth-first traversal of O(G) using
hop nodes
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
19Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
GRIPP – Sets of Nodes
A
B D
HE F G
R
C
Graph, G
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
AP
B
Node
D
Node
C
E
20Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
GRIPP – Sets of Nodes
Two different strategies
Single node pair Evaluate reachability for every node pair
separately
Set-oriented Evaluate reachability for the set in one step
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
21Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
Query GRIPP – Single Node Pair
First evaluate reachability(D,E)
Then reachability(D,C) separately
truetrue truetrue
22Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Query GRIPP – Set-oriented
First query the order tree completely
Then search used nodes and target nodes
If preUsed < preTarget < postUsed true
node pre post
D 10 19
B 2 7
A 1 20
Used nodes
node pre post
C 8 9
E 3 4
Target nodes
truetrue
truetrue
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
23Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Cost model
Single node pair strategy query time linear in size of target set better for few target nodes
Set-oriented strategy almost constant query times better for many target nodes
Average query time for both strategies and increasing size of target node set on a graph with 10,000 nodes and 20,000 edges
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
24Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Outline
Motivation
Optimize Graph Queries Evaluate node conditions
Evaluate path conditions
Future Work Relational algebra for graph queries
Conclusion
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
25Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Future Work
Towards an algebra for graph queries Define new operators
– Logical
– Physical
Determine cost functions
– Estimate the size of result sets
Define rewrite rules
– Which operations can be pushed?
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
26Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Future Work – New Operators
Path length operator Evaluate the length of a path
Possible solution
– Store parts of paths – e.g., up to length x[Giugno & Shasha, 2002]
a
b
graph
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
27Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Future Work
Cost Model Assign cost models to physical operators
Estimate the size of result sets Between how many node pairs does a path
exist? – Possibly of certain length?
Possible solution
– Sampling
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
28Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Rewrite Query PlanA
B
name = GlucoseISA compound
ISA enzyme
P
query
SELECT BFROM networkLET node A, node B, path PWHERE A.name = ‘Glucose’ AND A ISA compound AND B ISA enzyme AND P.path = A[-*]B;
Node TYPE
σname=Glucose χcompound
⋈Node.TYPE=TYPE
Node
TYPE
χenzyme
⋈Node.TYPE=TYPE
Φ
πB
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
29Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
Better Plan?
Node TYPE
σname=Glucose χcompound
⋈Node.TYPE=TYPE
Node
TYPE
χenzyme
⋈Node.TYPE=TYPE
Φ
πB
1 18,000
Node TYPE
σname=Glucose χcompound
⋈Node.TYPE=TYPE Node
TYPE
χenzyme
⋈B.TYPE=TYPE
Φ
πB
2
1 20,000
2,000
30Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
Conclusion
Optimize the execution of graph queries Use cost-based query optimization
Extend relational algebra New operators
– Path existence operator, Φ
– Path length operator
Cost functions
– Estimate the size of result sets
Rewrite rules
Motivation
Optimization
Nodes
Paths
Future Work
Relational
algebra
Conclusion
Thanks for your attention
Special thanks to my PhD supervisor Ulf Leser
Silke Trißl
Humboldt-Universität zu Berlin
Work sponsored by
IDAR 2007
32Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR
References U. Leser. A query language for biological networks. Bioinformatics, 21 Suppl
2:ii33–ii39, Sep 2005. B. Eckman and P. G. Brown Graph data management for molecular and cell
biology. IBM J. Res & Dev., 50(6):545 – 560, Nov 2006. F. Sohler and R. Zimmer. Identifying active transcription factors and
kinases from expression data using pathway queries. Bioinformatics, 21 Suppl 2:ii115-ii122, Sep 2005.
J. McHugh and J. Widom. Query Optimization for XML. In Proc. of the VLDB Conference, pages 315–326, 1999. Morgan Kaufmann.
V. Wu, J. M. Patel, and H. V. Jagadish. Structural Join Order Selection for XML Query Optimization. In Proc. of the ICDE Conference, pages 443–454, 2003. IEEE Computer Society.
S. Trißl and U. Leser. Fast and Practical Indexing and Querying of Very Large Graphs. In Proc. of the ACM SIGMOD Conference, to appear, 2007. ACM Press.