Date post: | 11-May-2015 |
Category: |
Education |
Upload: | university-of-new-south-wales |
View: | 2,154 times |
Download: | 0 times |
GraphREL: A Decomposition-Based andSelectivity-Aware Relational Framework for Processing
Sub-graph Queries
Sherif Sakr
School of Computer Science and EngineeringUniversity of New South Wales
.http://www.cse.unsw.edu.au/∼ssakr/
BIT Seminars ’09 - Free University of Bolzano, Italy
16 November 2009
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 1 / 40
Outline
Previous Work: Pathfinder - Relational XQuery Compiler.
Current Work: GraphREL - General Graph Query Processor.
Future Work: Scalable Graph Query Processing for New Generationof Database Applications.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 2 / 40
Outline
Previous Work: Pathfinder - Relational XQuery Compiler.
Current Work: GraphREL - General Graph Query Processor.
Future Work: Scalable Graph Query Processing for New Generationof Database Applications.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 3 / 40
Pathfinder: A Relational XQuery Processor
Pathfinder
XQuery Expression
Relational Algebra
MIL Code Generator SQL Code Generator
MIL Scripts SQL Scripts
Monet DBMS Conventional RDBMS
http://pathfinder-xquery.org/S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 4 / 40
Pathfinder: A Relational XQuery Processor
Pathfinder
XML Document
XQuery
Expression
Relational Algebra + Special Properties
XPath Accelerator
Estimation Rules
Cardinality Properties
Translation Templates
[VLDB’04]
[VLDB’08]
Conventional RDBMS
XQuery Estimator
Statistical Guide
Statistical Histograms
Relational Results XML
XPath Accelerator Encoding Tuples
+
Statistical Guide
XML Serializer
SQL Generator
System Administrator
Statistical Histograms
Cardinality Properties
Cardinality Properties Aware
SQL Scripts
[SIGMOD’07][IJWIS’09][JDM’09]
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 5 / 40
Outline
Previous Work: Pathfinder - Relational XQuery Compiler.
Current Work: GraphREL - General Graph Query Processor.
Future Work: Scalable Graph Query Processing for New Generationof Database Applications.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 6 / 40
GraphREL: Motivations
Graphs are among the most complicated and general form of datastructures.
Recently, they have been widely used to model many complexstructured and schemaless data such as social networks, chemicalcompounds, biological pathways, spatial databases, semantic web andbusiness process models.
Retrieving related graphs containing a query graph from a large graphdatabase is a key performance issue in all of these graph-basedapplications.
The success of any graph database application is directly dependenton the efficiency of the graph indexing and query processingmechanisms.
RDBMSs have repeatedly shown that they are very efficient, scalableand successful in hosting different kinds of data.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 7 / 40
Preliminaries: Graph Data Model
In labelled graphs, vertices and edges represent the entities and therelationships between them respectively.
The attributes associated with these entities and relationships arecalled labels.
A graph database D is a collection of member graphsD = {g1, g2, ...gn} where each member graph gi is denoted as(V , E , Lv , Le).
V is the set of vertices.E ⊆ V × V is the set of edges joining two distinct vertices.Lv is the set of vertex labels.Le is the set of edge labels.
labelled graphs are classified according to the direction of their edgesinto two main classes:
1 Directed-labelled graphs such as XML, RDF and traffic networks.2 Undirected-labelled graphs such as social networks and chemical
compounds.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 8 / 40
Preliminaries: Graph Queries
In principle, queries in graph databases can be broadly classified into thefollow- ing main categories:
Subgraph queries: this category searches for a specific pattern in thegraph database. The pattern can be either a small graph or a graphwhere some parts of it are uncertain, e.g., vertices with wildcardlabels.
Supergraph queries: this category searches for the graph databasemembers of which their whole structures are contained in the inputquery.
Similarity (Approximate Matching) queries: this category findsgraphs which are similar, but not necessarily isomorphic to a givenquery graph.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 9 / 40
Preliminaries: Subgraph Search Queries
Given a graph database D = {g1, g2, ..., gn} and a graph query q, itreturns the query answer set A = {gi |q ⊆ gi , gi ∈ D}.
A graph q is described as a sub-graph of another graph databasemember gi if the set of vertices and edges of q form subset of thevertices and edges of gi .
Formally, g1(V1, E1, Lv1, Le1) is defined as sub-graph ofg2(V2, E2, Lv2, Le2) if and only if:
1 For every distinct vertex x ∈ V1 with a label vl ∈ Lv1, there is adistinct vertex y ∈ V2 with a label vl ∈ Lv2.
2 For every distinct edge edge ab ∈ E1 with a label el ∈ Le1, there is adistinct edge ab ∈ E2 with a label el ∈ Le2.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 10 / 40
Preliminaries: Subgraph Search Queries
A
B A
C
A
D
A
B C
C D
A
C A
D
B
D
C A
A D
g1 g2 g3 q
mn
n
xx
zy
m z
n
x
x
ef
mx n
m
x
f m
n
x
xx e
(a) Sample graph database
A
B A
C
A
D
A
B C
C D
A
C A
D
B
D
C A
A D
g1 g2 g3 q
mn
n
xx
zy
m z
n
x
x
ef
mx n
m
x
f m
n
x
xx e
(b) Graph query
Figure: An example graph database and graph query
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 11 / 40
Our Approach: GraphREL
Relational encoding of graph data.
SQL translation of sub-graph search queries.
Filtering phase.
Optional verification phase.
Partitioned B-tree Indexes.
Statistical Summaries.
Decomposition-Based and Selectivity-Aware SQL Translation.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 12 / 40
Relational Encoding of Graph Data
The starting point of our relational framework is to find an efficientand suitable encoding for each graph member gi in the graphdatabase D.
We use the Vertex-Edge mapping scheme for storing directedlabelled graphs with the following structure:
Vertices(graphID, vertexID, vertexLabel)
Edges(graphID, sVertex , dVertex , edgeLabel)
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 13 / 40
Relational Encoding of Graph Data
g1
graphID vertexID vLabel
1 1 A
1 2 A
1 3 D
1 4 A
1 5 C
1 6 B
2 1 A
2 2 C
2 3 D
2 4 C
2 5 B
graphID sVertex dVertex eLabel
1 1 2 n
1 1 3 m
1 2 3 n
1 4 3 x
1 5 4 x
1 6 5 y
1 5 2 z
1 1 6 m
2 1 2 e
2 2 3 m
2 4 3 m
2 4 2 n
2 5 4 x
1A
B A
C
A
D
mn
n
xx
zy
m
A
B Cef
mx ng2
2
3
4
5
6
1
25
2 5 B2 5 4 x
2 1 5 fC Dmx n
m
g234
Edges TableVertices Table
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 14 / 40
SQL Translation of Graph Queries
Filtering Phase: a sub-graph query q consists of a set of verticesQV with size equal m and a set of edges QE equal n is evaluatedusing the following SQL translation template:
SELECT DISTINCT V1.graphID, Vi .vertexIDFROM Vertices as V1,..., Vertices as Vm, Edges as E1,..., Edges as En
WHERE∀mi=2(V1.graphID = Vi .graphID)AND ∀nj=1(V1.graphID = Ej .graphID)
AND ∀mi=1(Vi .vertexLabel = QVi .vertexLabel)AND ∀nj=1(Ej .edgeLabel = QEj .edgeLabel)
AND ∀nj=1(Ej .sVertex = Vf .vertexID AND Ej .dVertex = Vf .vertexID);
Verification Phase: an optional phase which is used to verify thateach vertex in the set of filtered vertices for each candidate graph isdistinct. It is applied only if more than one vertex of the set of queryvertices QV have the same label. This can be easily achieved usingtheir vertex ID.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 15 / 40
Partitioned B-tree Indexes
Partitioned B-tree indexing is a slight variant of the B-tree indexingstructure.
The main idea is the use of low-selectivity leading columns tomaintain partitions within the associated B-tree.
In labelled graphs, it is generally the case that the number of distinctvertices and edges labels are far less than the number of vertices andedges respectively.
For example, having an index defined in terms of columns(vertexLabel , graphID) can reduce the access cost of sub-graph querywith only one label to one disk page. On the contrary, an indexdefined in terms of the two columns (graphID, vertexLabel) requiresscanning a large number of disk pages.
Having partitioned B-trees indexes of the high-selectivity attributesachieves fixed execution times which are no longer dependent on thesize of the whole graph database.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 16 / 40
Limitations of SQL-Based Translation Approach
An obvious problem of the SQL translation template is that itinvolves a large number of conjunctive SQL predicates and joinoperations between the encoding tables.
Most of relational query engines will certainly fail to execute the SQLtranslation queries of medium size or large sub-graph queries becausethey are too long and too complex (this does not mean they mustconsequently be too expensive).
Therefore, we need a decomposition mechanism to divide this largeand complex SQL translation query into a sequence of intermediatequeries.
Applying this decomposition mechanism blindly may lead to inefficientexecution plans with very large, non-required and expensiveintermediate results.
We use statistical summary information to achieve an efficientdecomposition process.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 17 / 40
Statistical Summaries
In general, one of the most effective techniques for optimizing theexecution times of SQL queries is to select the relational executionbased on the accurate selectivity information of the query predicates.
We construct three Markov tables to store information about thefrequency of occurrence of the distinct labels of vertices, distinctlabels of edges and connection between pair of vertices (edges).
Vertex Label Frequency
A 100
B 200
C 38
D 4
E 50
L 6
M 10
N 250
O 3
P 40
R 55
Edge Label Frequency
a 40
c 5
e 28
l 54
m 140
n 3
o 20
p 15
x 8
y 60
z 15
Edge Label Connection
Frequency
ab 3
ac 15
ae 45
ec 14
em 103
la 5
pc 18
px 45
xy 25
xz 2R 55
Markov Table summary of vertices labels
z 15
Markov Table summary of edges labels
za 1
Markov Table summary of pair-wise edge connections
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 18 / 40
Decomposition-Based and Selectivity-Aware SQLTranslation
Identifying the pruning points.
Calculating the number of partitions.
Decomposed SQL translation.
Blindly Single-Level Decomposition.
Pruned Single-Level Decomposition.
Pruned Multi-Level Decomposition
Selectivity-aware Annotations.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 19 / 40
Decomposition-Based and Selectivity-Aware SQLTranslation
Identifying the pruning pointsEach vertex label, edge label or edge connection with low frequency isconsidered as a pruning point in our relational evaluation mechanism.
Given a query graph q, we first check the structure of q against oursummary Markov tables to identify the possible pruning points (NPP).
Calculating the number of partitionsHaving a sub-graph query q requires NJP join operations.
Assuming that the relational query engine can evaluate up to numberof join operations equal to MJP in one query.
The number of partitions (NOP) is computed as: (NJP/MJP)
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 20 / 40
Decomposition-Based and Selectivity-Aware SQLTranslation
Blindly Single-Level DecompositionIf NPP = 0 ⇒ we blindly decompose the query q into NOP partitions.Each partition is translated into an intermediate evaluation step Si .The final evaluation step joins all intermediate evaluation steps andadds the conjunctive conditions of the partition’s connectors.
Pruned Single-Level DecompositionIf NPP >= NOP ⇒ we distribute the pruning points across thedifferent intermediate NOP partitions.It ensures a balanced effective pruning of all intermediate results.
Pruned Multi-Level Decompositionif NPP < NOP ⇒ we distribute the pruning points across a first levelintermediate results of NOP partitions. An intermediate collectivepruned step IPS is constructed by joining all the pruned first levelintermediate results.IPS is used as an entry pruning point for the rest (NOP − NPP)non-pruned partitions in a hierarchical multi-level fashion .Each pruning point can be used to prune more than one partition (ifpossible).
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 21 / 40
Decomposition-Based and Selectivity-Aware SQLTranslation
S1 -SQL
S2 -SQL
FES -SQL
S1S2
S FES -
S1
S2
S3
S1 -SQL
S2 -SQL
FES -SQL
S3 -SQL
(a) NPP > NOP
S1 -SQL
S2 -SQL
FES -SQL
S1S2
S2
FES -
S1 S3
S1 -SQL
S2 -SQL
FES -SQL
S3 -SQL
(b) NPP < NOP
Figure: Selectivity-aware decomposition process
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 22 / 40
Decomposition-Based and Selectivity-Aware SQLTranslation
Selectivity-aware Annotations
For any given SQL query, there are a large number of alternativeexecution plans. These alternative execution plans may differsignificantly in their use of system resources or response time.
We use the statistical summary information to give influencing hints forthe query optimizers by injecting additional selectivity information forthe individual query predicates into the SQL translations of the graphqueries.
SELECT fieldlist FROM tablelistWHERE Pi SELECTIVITY Si
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 23 / 40
Experimental Results: Performance and Scalability
Q4 Q8 Q12 Q16 Q201
10
100
1000
10000
100000
Execution T
ime (
ms)
Query Size
D2kV10E20L40M50D10kV10E20L40M50D50kV30E40L90M150D100kV30E40L90M150
(a) Synthetic Dataset
Q4 Q8 Q12 Q16 Q201
10
100
1000
10000
Execution T
ime (
ms)
Query Size
1MB10MB50MB100MB
(b) DBLP Dataset
Figure: The scalability of GraphREL.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 24 / 40
Experimental Results: The effect of using PartitionedB-tree Indexes and Selectivity Injections
Q4 Q8 Q12 Q16 Q200
10
20
30
40
50
60
70
80
90
100
Perc
enta
ge o
f Im
pro
vem
ent (%
)
Query Size
SyntheticDBLP
(a) Partitioned B-tree indexes
Q4 Q8 Q12 Q16 Q200
5
10
15
20
25
30
35
40
Exe
cutio
n T
imes
(ms)
Query Size
SyntheticDBLP
(b) Injection of selectivity annotations
Figure: The speedup improvement for the relational evaluation of sub-graphqueries using partitioned B-tree indexes and selectivity-aware annotations.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 25 / 40
QBP: An Application of GraphREL
Many of today’s Information Systems are driven by explicit processmodels.
A business process is a set of coordinated activities to achieve aspecific business objective.
With the rapid and incremental increase in the number of processmodels, it becomes crucial for business process designers to be able tolook up their repository for models efficiently.
QBP is a query processor for business processes models.
QBP is based on a new visual query language for business processescalled BPMN-Q. The language addresses processes definitions andextends the standard BPMN notations for modeling businessprocesses for its concrete syntax.
A BPMN-Q query is considered to be a graph which is going to bematched with process graph(s).
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 26 / 40
QBP: An Application of GraphREL
Customer applies for
real-estate credit
Credit Rating
[rejected]
Check credit rating
Credit Rating
[accepted]
Check real-estate
construction
document
Check land register
record
Const. Doc.
[invalid]
Const. Doc
[valid]
Record
[absent]
Record
[present]
Prepare contract
Reject application
All OK
Offer loan protection
insurance
Offer residence
insurance
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 27 / 40
QBP: Application Architecture
Relational Business
Process Repository
Relational Business
Process Repository
Model
Editor
Model
Editor
Semantic Query
Expander
Semantic Query
Expander
SQL-Based
Query Processor
SQL-Based
Query Processor
EPCBPELBPMN
Translation MiddlewareTranslation Middleware
RDBMS
……….
SQL ScriptQuery Results
Updates
BPM-Q
Query Editor
BPM-Q
Query Editor
UML
ADs
BPM- Q Query
Semantically
expanded queries
Result Process ModelsBusiness Process
Designers
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 28 / 40
BPMN-Q Query Constructs
Anonymous
Activity
It is used to indicate unknown activities in a query. It resembles an
activity but is distinguished by the @ sign in the beginning of the label.
Generic Node It indicates an unknown node in a process. It could evaluate to any node
type.
Generic Split It refers to any type of split gateways.
Generic Join It refers to any type of join gateways.
Negative
Sequence Flow
It states that two nodes A and B are not directly related by sequence
flow.
Path It states that there must be a path from A to B. A query usually returns
all paths.
Negative Path It states that there is not any path between two nodes A and B.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 29 / 40
QBP: An Application of GraphREL
Customer applies for
real-estate creditReject application//
(a) BPMN-Q Query Example
Customer applies for
real-estate credit
Check credit rating
Check real-estate
construction
document
Check land register
recordReject application
(b) BPMN-Q Query Match
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 30 / 40
QBP: Use Cases
Searching the structure of the process models.
Compliance checking.
Detecting design anomalies.
Discovery of frequent process patterns/anti-patterns.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 31 / 40
QBP: An Application of GraphREL
http://bpmnq.sourceforge.net/
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 32 / 40
Conclusions
GraphREL is a purely relational framework to store and query graphdata.
In principle GraphREL has the following advantages:It can reside on any relational database system and exploits its wellknown matured query optimization techniques as well as its efficientand scalable query processing techniques.
It has no required time cost for offline or pre-processing steps.
It can handle static and dynamic (with frequent updates) graphdatabases very well.
The selectivity annotations for the SQL evaluation scripts provide therelational query optimizers with the ability to select the most efficientexecution plans and apply an efficient pruning for the non-requiredgraph database members.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 33 / 40
Outline
Previous Work: Pathfinder - Relational XQuery Compiler.
Current Work: GraphREL - General Graph Query Processor.
Future Work: Scalable Graph Query Processing for New Generationof Database Applications.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 34 / 40
Future Work: Large Scale Graph Query Processing(e.g: Social Networks)
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 35 / 40
Future Work: Parallel Processing / MapReduce(HadoopDB)
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 36 / 40
Future Work: Storing and Querying Hypergraphs
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 37 / 40
References
[CIDR’03] G. Graefe. Sorting And Indexing With PartitionedB-Trees.
[VLDB’04] T. Grust, S. Sakr, and J. Teubner. XQuery on SQLHosts.
[SIGMOD’07] T. Grust, M. Mayr, J. Rittinger, S. Sakr, and J.Teubner. A SQL:1999 Code Generator for the Pathfinder XQueryCompiler.
[VLDB’08] J. Teubner, T. Grust, S. Maneth, and S. Sakr.Dependable Cardinality Forecats for XQuery.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 38 / 40
References
[IJWIS’08] S. Sakr. ”Algebraic-Based XQuery CardinalityEstimation.
[DASFAA’09] S. Sakr. GraphREL: A Decomposition-Based andSelectivity-Aware Relational Framework for Processing Sub-graphQueries.
[UNISCON’09] S. Sakr. Storing and Querying Graph Data UsingEfficient Relational Processing Techniques.
[JDM’09] S. Sakr. Purely Relational Implementation of an XQueryProcessor.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 39 / 40
The End
Thank You
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 40 / 40