+ All Categories
Home > Documents > Cost-based Optimization of Graph Queries

Cost-based Optimization of Graph Queries

Date post: 13-Jan-2016
Category:
Upload: nevaeh
View: 36 times
Download: 0 times
Share this document with a friend
Description:
IDAR 2007. Cost-based Optimization of Graph Queries. Silke Trißl Humboldt-Universität zu Berlin Knowledge Management in Bioinformatics. Motivation – Biological Networks. …. TYPE. Motivation Optimization Nodes Paths Future Work Relational algebra Conclusion. Sequence. - PowerPoint PPT Presentation
32
Cost-based Optimization of Graph Queries Silke Trißl Humboldt-Universität zu Berlin Knowledge Management in Bioinformatics IDAR 2007
Transcript
Page 1: Cost-based Optimization  of Graph Queries

Cost-based Optimization of Graph Queries

Silke Trißl

Humboldt-Universität zu Berlin

Knowledge Management in Bioinformatics

IDAR 2007

Page 2: Cost-based Optimization  of Graph Queries

2Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Motivation – Biological Networks

from http://www.genome.jp/kegg

Name

Sequence

TYPE

Function

Location

…Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 3: Cost-based Optimization  of Graph Queries

3Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Querying Networks - PQL

Pathway Query Language (PQL) [Leser, 2005]

Syntax for querying graphs

Find subgraphs matching the query graph

SELECT BFROM networkLET node A, node B, path PWHERE A.name = ‘Glucose’ AND A ISA compound AND B ISA enzyme AND P.path = A[-*]B;

A

B

name = GlucoseISA compound

ISA enzyme

P

Find all enzymes that are directly or

indirectly affected by „Glucose“

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 4: Cost-based Optimization  of Graph Queries

4Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Node Conditions

Nodes can contain conditions on

A

B

name = GlucoseISA compound

ISA enzyme

P

query TYPE hierarchy - partially

root

molecule interaction

macro-moleculecompoun

d

sugar geneprotein

ion mRNA

catalysis inhibition

enzyme

Attributes A.name = ‘Glucose’

TYPE (of hierarchy) A ISA compound

Function (of ontology)

A HASFUNC (‘catalysis’, GO)

Location A ISIN (‘Human’, taxonomy)

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 5: Cost-based Optimization  of Graph Queries

5Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Path Conditions

Paths can contain conditions on

A

B

name = GlucoseISA compound

ISA gene

P

query

a

b

graph

Edges P.path = A[-*]B AND P.length = 1

Path existence

P.path = A[-*]B

Path length P.path = A[-*]B AND P.length < 10

Start node P.start = A

Containment P { R

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 6: Cost-based Optimization  of Graph Queries

6Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Result of Graph Queries

Search for matching subgraphs Find node and path bindings

for the query variables in the network

A

B

name = GlucoseISA compound

ISA enzyme

P

network query

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 7: Cost-based Optimization  of Graph Queries

7Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Outline

Motivation

Optimize Graph Queries Evaluate node conditions

Evaluate path conditions

Future Work Relational algebra for graph queries

Conclusion

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 8: Cost-based Optimization  of Graph Queries

8Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Evaluation of Node Conditons

Node attributes Select operator (σ) on Node table

Node types, functions, and locations Hierarchy operator (χ)

– Return the specified concept and all successor concepts

A

B

name = GlucoseISA compound

ISA gene

P

query query plan for node A

Node TYPE

σname=Glucose χcompound

⋈Node.TYPE=TYPE

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 9: Cost-based Optimization  of Graph Queries

9Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

How to evaluate Path conditions?

Recursively traverse the graph Edge

Arbitrary number of joins

No possibility to optimize the execution

a

b

graph

⋈ Edge ⋈ Edge⋈ Edge⋈ Edge⋈ …

Need for new logical and

physical operators

Need for new logical and

physical operators

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 10: Cost-based Optimization  of Graph Queries

10Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Path Existence Operator, Φ

Node variables A and B Set of nodes V bound to A

Set of nodes W bound to B

Path variable P Condition on P: path from A to B

A Φ B returns the set of node pairs (v,w) for which paths from v V to w W in G exist.

A

B

name = GlucoseISA compound

ISA gene

P

query

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 11: Cost-based Optimization  of Graph Queries

11Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Physical Implementation of Φ

Graph traversal at query time Breadth-first or depth-first search

Query precomputed index structure Transitive closure (only for small graphs) GRIPP [Trißl et al., 2007]

– GRIPP index table, IND(G)– one instance for every node v in G

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 12: Cost-based Optimization  of Graph Queries

12Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

GRIPP Index Creation

Depth-first traversal of G

A

B D

HE F G

R[0

C

[1

[2

[3 [5,4] ,6]

,7] [8,9] [10,19]

[11,14] [15,18]

,20]

,21]

[12

[16

We reach a node v

for the first time

– add tree instance of v to IND(G)

– proceed traversal

again

– add non-tree instance of v to IND(G)

– do not traverse child nodes of v

,13]

,17]

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 13: Cost-based Optimization  of Graph Queries

13Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Is node C reachable from node D?

GRIPP Index Table, IND(G)

A

B D

HE F G

R[0

C

[1

[2

[3 [5,4] ,6]

,7] [8,9] [10,19]

[11,14] [15,18]

,20]

,21]

[12

[16

,13]

,17]

node pre post inst

R 0 21 tree

A 1 20 tree

B 2 7 tree

E 3 4 tree

F 5 6 tree

C 8 9 tree

D 10 19 tree

G 11 14 tree

B 12 13 non

H 15 18 tree

A 16 17 non

Graph, G GRIPP index, IND(G)

C D

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 14: Cost-based Optimization  of Graph Queries

14Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Order Tree, O(G)

node pre post inst

R 0 21 tree

A 1 20 tree

B 2 7 tree

E 3 4 tree

F 5 6 tree

C 8 9 tree

D 10 19 tree

G 11 14 tree

B 12 13 non

H 15 18 tree

A 16 17 nonOrder tree, O(G)

w reachable from v iff

vpre < wpre < vpost

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 15: Cost-based Optimization  of Graph Queries

15Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Order Tree, O(G)

node pre post inst

R 0 21 tree

A 1 20 tree

B 2 7 tree

E 3 4 tree

F 5 6 tree

C 8 9 tree

D 10 19 tree

G 11 14 tree

B 12 13 non

H 15 18 tree

A 16 17 nonOrder tree, O(G)

w reachable from v iff

vpre < wpre < vpost

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 16: Cost-based Optimization  of Graph Queries

16Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Query strategy – Step 1

Retrieve the reachable instance set of

start node v, called RIS(v) Retrieve RIS(D)

Requires only a single query on IND(G)

If C RIS(D)

return true

stop the search

Else

proceed to Step 2

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 17: Cost-based Optimization  of Graph Queries

17Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Query strategy – Step 2

Search for non-tree instances in RIS(v) The nodes of these instances are hop nodes

Check every i RIS(D) If i is tree instance

– [G and H]– Done

If i is non-tree instance

– [A and B]– i has no

successors in O(G), but possibly in G

– proceed to Step 3

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 18: Cost-based Optimization  of Graph Queries

18Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Query strategy – Step 3

Extend the search

using hop nodes v1, …, vn

Obtain the tree instance of node B

Proceed to Step 1

Repeat steps 1…3 until

an instance of node C is found

or no more hop nodes are availableDepth-first traversal of O(G) using

hop nodes

Depth-first traversal of O(G) using

hop nodes

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 19: Cost-based Optimization  of Graph Queries

19Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

GRIPP – Sets of Nodes

A

B D

HE F G

R

C

Graph, G

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

AP

B

Node

D

Node

C

E

Page 20: Cost-based Optimization  of Graph Queries

20Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

GRIPP – Sets of Nodes

Two different strategies

Single node pair Evaluate reachability for every node pair

separately

Set-oriented Evaluate reachability for the set in one step

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 21: Cost-based Optimization  of Graph Queries

21Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Query GRIPP – Single Node Pair

First evaluate reachability(D,E)

Then reachability(D,C) separately

truetrue truetrue

Page 22: Cost-based Optimization  of Graph Queries

22Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Query GRIPP – Set-oriented

First query the order tree completely

Then search used nodes and target nodes

If preUsed < preTarget < postUsed true

node pre post

D 10 19

B 2 7

A 1 20

Used nodes

node pre post

C 8 9

E 3 4

Target nodes

truetrue

truetrue

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 23: Cost-based Optimization  of Graph Queries

23Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Cost model

Single node pair strategy query time linear in size of target set better for few target nodes

Set-oriented strategy almost constant query times better for many target nodes

Average query time for both strategies and increasing size of target node set on a graph with 10,000 nodes and 20,000 edges

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 24: Cost-based Optimization  of Graph Queries

24Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Outline

Motivation

Optimize Graph Queries Evaluate node conditions

Evaluate path conditions

Future Work Relational algebra for graph queries

Conclusion

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 25: Cost-based Optimization  of Graph Queries

25Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Future Work

Towards an algebra for graph queries Define new operators

– Logical

– Physical

Determine cost functions

– Estimate the size of result sets

Define rewrite rules

– Which operations can be pushed?

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 26: Cost-based Optimization  of Graph Queries

26Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Future Work – New Operators

Path length operator Evaluate the length of a path

Possible solution

– Store parts of paths – e.g., up to length x[Giugno & Shasha, 2002]

a

b

graph

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 27: Cost-based Optimization  of Graph Queries

27Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Future Work

Cost Model Assign cost models to physical operators

Estimate the size of result sets Between how many node pairs does a path

exist? – Possibly of certain length?

Possible solution

– Sampling

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 28: Cost-based Optimization  of Graph Queries

28Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Rewrite Query PlanA

B

name = GlucoseISA compound

ISA enzyme

P

query

SELECT BFROM networkLET node A, node B, path PWHERE A.name = ‘Glucose’ AND A ISA compound AND B ISA enzyme AND P.path = A[-*]B;

Node TYPE

σname=Glucose χcompound

⋈Node.TYPE=TYPE

Node

TYPE

χenzyme

⋈Node.TYPE=TYPE

Φ

πB

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 29: Cost-based Optimization  of Graph Queries

29Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Better Plan?

Node TYPE

σname=Glucose χcompound

⋈Node.TYPE=TYPE

Node

TYPE

χenzyme

⋈Node.TYPE=TYPE

Φ

πB

1 18,000

Node TYPE

σname=Glucose χcompound

⋈Node.TYPE=TYPE Node

TYPE

χenzyme

⋈B.TYPE=TYPE

Φ

πB

2

1 20,000

2,000

Page 30: Cost-based Optimization  of Graph Queries

30Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

Conclusion

Optimize the execution of graph queries Use cost-based query optimization

Extend relational algebra New operators

– Path existence operator, Φ

– Path length operator

Cost functions

– Estimate the size of result sets

Rewrite rules

Motivation

Optimization

Nodes

Paths

Future Work

Relational

algebra

Conclusion

Page 31: Cost-based Optimization  of Graph Queries

Thanks for your attention

Special thanks to my PhD supervisor Ulf Leser

Silke Trißl

Humboldt-Universität zu Berlin

Work sponsored by

IDAR 2007

Page 32: Cost-based Optimization  of Graph Queries

32Silke Trißl - Cost-based Optimization of Graph Queries - 1st PhD Workshop IDAR

References U. Leser. A query language for biological networks. Bioinformatics, 21 Suppl

2:ii33–ii39, Sep 2005. B. Eckman and P. G. Brown Graph data management for molecular and cell

biology. IBM J. Res & Dev., 50(6):545 – 560, Nov 2006. F. Sohler and R. Zimmer. Identifying active transcription factors and

kinases from expression data using pathway queries. Bioinformatics, 21 Suppl 2:ii115-ii122, Sep 2005.

J. McHugh and J. Widom. Query Optimization for XML. In Proc. of the VLDB Conference, pages 315–326, 1999. Morgan Kaufmann.

V. Wu, J. M. Patel, and H. V. Jagadish. Structural Join Order Selection for XML Query Optimization. In Proc. of the ICDE Conference, pages 443–454, 2003. IEEE Computer Society.

S. Trißl and U. Leser. Fast and Practical Indexing and Querying of Very Large Graphs. In Proc. of the ACM SIGMOD Conference, to appear, 2007. ACM Press.


Recommended