GraphREL: A Relational Graph Query Processor

GraphREL: A Decomposition-Based andSelectivity-Aware Relational Framework for Processing

Sub-graph Queries

Sherif Sakr

School of Computer Science and EngineeringUniversity of New South Wales

.http://www.cse.unsw.edu.au/∼ssakr/

BIT Seminars ’09 - Free University of Bolzano, Italy

16 November 2009

S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 1 / 40

Outline

Previous Work: Pathfinder - Relational XQuery Compiler.

Current Work: GraphREL - General Graph Query Processor.

Future Work: Scalable Graph Query Processing for New Generationof Database Applications.


Outline





Pathfinder: A Relational XQuery Processor

Pathfinder

XQuery Expression

Relational Algebra

MIL Code Generator SQL Code Generator

MIL Scripts SQL Scripts

Monet DBMS Conventional RDBMS

http://pathfinder-xquery.org/S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 4 / 40

Pathfinder: A Relational XQuery Processor

Pathfinder

XML Document

XQuery

Expression

Relational Algebra + Special Properties

XPath Accelerator

Estimation Rules

Cardinality Properties

Translation Templates

[VLDB’04]

[VLDB’08]

Conventional RDBMS

XQuery Estimator

Statistical Guide

Statistical Histograms

Relational Results XML

XPath Accelerator Encoding Tuples

+

Statistical Guide

XML Serializer

SQL Generator

System Administrator

Statistical Histograms

Cardinality Properties

Cardinality Properties Aware

SQL Scripts

[SIGMOD’07][IJWIS’09][JDM’09]


Outline





GraphREL: Motivations

Graphs are among the most complicated and general form of datastructures.

Recently, they have been widely used to model many complexstructured and schemaless data such as social networks, chemicalcompounds, biological pathways, spatial databases, semantic web andbusiness process models.

Retrieving related graphs containing a query graph from a large graphdatabase is a key performance issue in all of these graph-basedapplications.

The success of any graph database application is directly dependenton the efficiency of the graph indexing and query processingmechanisms.

RDBMSs have repeatedly shown that they are very efficient, scalableand successful in hosting different kinds of data.


Preliminaries: Graph Data Model

In labelled graphs, vertices and edges represent the entities and therelationships between them respectively.

The attributes associated with these entities and relationships arecalled labels.

A graph database D is a collection of member graphsD = {g1, g2, ...gn} where each member graph gi is denoted as(V , E , Lv , Le).

V is the set of vertices.E ⊆ V × V is the set of edges joining two distinct vertices.Lv is the set of vertex labels.Le is the set of edge labels.

labelled graphs are classified according to the direction of their edgesinto two main classes:

1 Directed-labelled graphs such as XML, RDF and traffic networks.2 Undirected-labelled graphs such as social networks and chemical

compounds.


Preliminaries: Graph Queries

In principle, queries in graph databases can be broadly classified into thefollow- ing main categories:

Subgraph queries: this category searches for a specific pattern in thegraph database. The pattern can be either a small graph or a graphwhere some parts of it are uncertain, e.g., vertices with wildcardlabels.

Supergraph queries: this category searches for the graph databasemembers of which their whole structures are contained in the inputquery.

Similarity (Approximate Matching) queries: this category findsgraphs which are similar, but not necessarily isomorphic to a givenquery graph.


Preliminaries: Subgraph Search Queries

Given a graph database D = {g1, g2, ..., gn} and a graph query q, itreturns the query answer set A = {gi |q ⊆ gi , gi ∈ D}.

A graph q is described as a sub-graph of another graph databasemember gi if the set of vertices and edges of q form subset of thevertices and edges of gi .

Formally, g1(V1, E1, Lv1, Le1) is defined as sub-graph ofg2(V2, E2, Lv2, Le2) if and only if:

1 For every distinct vertex x ∈ V1 with a label vl ∈ Lv1, there is adistinct vertex y ∈ V2 with a label vl ∈ Lv2.

2 For every distinct edge edge ab ∈ E1 with a label el ∈ Le1, there is adistinct edge ab ∈ E2 with a label el ∈ Le2.


Preliminaries: Subgraph Search Queries

A

B A

C

A

D

A

B C

C D

A

C A

D

B

D

C A

A D

g1 g2 g3 q

mn

n

xx

zy

m z

n

x

x

ef

mx n

m

x

f m

n

x

xx e

(a) Sample graph database

A

B A

C

A

D

A

B C

C D

A

C A

D

B

D

C A

A D

g1 g2 g3 q

mn

n

xx

zy

m z

n

x

x

ef

mx n

m

x

f m

n

x

xx e

(b) Graph query

Figure: An example graph database and graph query


Our Approach: GraphREL

Relational encoding of graph data.

SQL translation of sub-graph search queries.

Filtering phase.

Optional verification phase.

Partitioned B-tree Indexes.

Statistical Summaries.

Decomposition-Based and Selectivity-Aware SQL Translation.


Relational Encoding of Graph Data

The starting point of our relational framework is to find an efficientand suitable encoding for each graph member gi in the graphdatabase D.

We use the Vertex-Edge mapping scheme for storing directedlabelled graphs with the following structure:

Vertices(graphID, vertexID, vertexLabel)

Edges(graphID, sVertex , dVertex , edgeLabel)


Relational Encoding of Graph Data

g1

graphID vertexID vLabel

1 1 A

1 2 A

1 3 D

1 4 A

1 5 C

1 6 B

2 1 A

2 2 C

2 3 D

2 4 C

2 5 B

graphID sVertex dVertex eLabel

1 1 2 n

1 1 3 m

1 2 3 n

1 4 3 x

1 5 4 x

1 6 5 y

1 5 2 z

1 1 6 m

2 1 2 e

2 2 3 m

2 4 3 m

2 4 2 n

2 5 4 x

1A

B A

C

A

D

mn

n

xx

zy

m

A

B Cef

mx ng2

2

3

4

5

6

1

25

2 5 B2 5 4 x

2 1 5 fC Dmx n

m

g234

Edges TableVertices Table


SQL Translation of Graph Queries

Filtering Phase: a sub-graph query q consists of a set of verticesQV with size equal m and a set of edges QE equal n is evaluatedusing the following SQL translation template:

SELECT DISTINCT V1.graphID, Vi .vertexIDFROM Vertices as V1,..., Vertices as Vm, Edges as E1,..., Edges as En

WHERE∀mi=2(V1.graphID = Vi .graphID)AND ∀nj=1(V1.graphID = Ej .graphID)

AND ∀mi=1(Vi .vertexLabel = QVi .vertexLabel)AND ∀nj=1(Ej .edgeLabel = QEj .edgeLabel)

AND ∀nj=1(Ej .sVertex = Vf .vertexID AND Ej .dVertex = Vf .vertexID);

Verification Phase: an optional phase which is used to verify thateach vertex in the set of filtered vertices for each candidate graph isdistinct. It is applied only if more than one vertex of the set of queryvertices QV have the same label. This can be easily achieved usingtheir vertex ID.


Partitioned B-tree Indexes

Partitioned B-tree indexing is a slight variant of the B-tree indexingstructure.

The main idea is the use of low-selectivity leading columns tomaintain partitions within the associated B-tree.

In labelled graphs, it is generally the case that the number of distinctvertices and edges labels are far less than the number of vertices andedges respectively.

For example, having an index defined in terms of columns(vertexLabel , graphID) can reduce the access cost of sub-graph querywith only one label to one disk page. On the contrary, an indexdefined in terms of the two columns (graphID, vertexLabel) requiresscanning a large number of disk pages.

Having partitioned B-trees indexes of the high-selectivity attributesachieves fixed execution times which are no longer dependent on thesize of the whole graph database.


Limitations of SQL-Based Translation Approach

An obvious problem of the SQL translation template is that itinvolves a large number of conjunctive SQL predicates and joinoperations between the encoding tables.

Most of relational query engines will certainly fail to execute the SQLtranslation queries of medium size or large sub-graph queries becausethey are too long and too complex (this does not mean they mustconsequently be too expensive).

Therefore, we need a decomposition mechanism to divide this largeand complex SQL translation query into a sequence of intermediatequeries.

Applying this decomposition mechanism blindly may lead to inefficientexecution plans with very large, non-required and expensiveintermediate results.

We use statistical summary information to achieve an efficientdecomposition process.


Statistical Summaries

In general, one of the most effective techniques for optimizing theexecution times of SQL queries is to select the relational executionbased on the accurate selectivity information of the query predicates.

We construct three Markov tables to store information about thefrequency of occurrence of the distinct labels of vertices, distinctlabels of edges and connection between pair of vertices (edges).

Vertex Label Frequency

A 100

B 200

C 38

D 4

E 50

L 6

M 10

N 250

O 3

P 40

R 55

Edge Label Frequency

a 40

c 5

e 28

l 54

m 140

n 3

o 20

p 15

x 8

y 60

z 15

Edge Label Connection

Frequency

ab 3

ac 15

ae 45

ec 14

em 103

la 5

pc 18

px 45

xy 25

xz 2R 55

Markov Table summary of vertices labels

z 15

Markov Table summary of edges labels

za 1

Markov Table summary of pair-wise edge connections


Decomposition-Based and Selectivity-Aware SQLTranslation

Identifying the pruning points.

Calculating the number of partitions.

Decomposed SQL translation.

Blindly Single-Level Decomposition.

Pruned Single-Level Decomposition.

Pruned Multi-Level Decomposition

Selectivity-aware Annotations.



Identifying the pruning pointsEach vertex label, edge label or edge connection with low frequency isconsidered as a pruning point in our relational evaluation mechanism.

Given a query graph q, we first check the structure of q against oursummary Markov tables to identify the possible pruning points (NPP).

Calculating the number of partitionsHaving a sub-graph query q requires NJP join operations.

Assuming that the relational query engine can evaluate up to numberof join operations equal to MJP in one query.

The number of partitions (NOP) is computed as: (NJP/MJP)



Blindly Single-Level DecompositionIf NPP = 0 ⇒ we blindly decompose the query q into NOP partitions.Each partition is translated into an intermediate evaluation step Si .The final evaluation step joins all intermediate evaluation steps andadds the conjunctive conditions of the partition’s connectors.

Pruned Single-Level DecompositionIf NPP >= NOP ⇒ we distribute the pruning points across thedifferent intermediate NOP partitions.It ensures a balanced effective pruning of all intermediate results.

Pruned Multi-Level Decompositionif NPP < NOP ⇒ we distribute the pruning points across a first levelintermediate results of NOP partitions. An intermediate collectivepruned step IPS is constructed by joining all the pruned first levelintermediate results.IPS is used as an entry pruning point for the rest (NOP − NPP)non-pruned partitions in a hierarchical multi-level fashion .Each pruning point can be used to prune more than one partition (ifpossible).



S1 -SQL

S2 -SQL

FES -SQL

S1S2

S FES -

S1

S2

S3

S1 -SQL

S2 -SQL

FES -SQL

S3 -SQL

(a) NPP > NOP

S1 -SQL

S2 -SQL

FES -SQL

S1S2

S2

FES -

S1 S3

S1 -SQL

S2 -SQL

FES -SQL

S3 -SQL

(b) NPP < NOP

Figure: Selectivity-aware decomposition process



Selectivity-aware Annotations

For any given SQL query, there are a large number of alternativeexecution plans. These alternative execution plans may differsignificantly in their use of system resources or response time.

We use the statistical summary information to give influencing hints forthe query optimizers by injecting additional selectivity information forthe individual query predicates into the SQL translations of the graphqueries.

SELECT fieldlist FROM tablelistWHERE Pi SELECTIVITY Si


Experimental Results: Performance and Scalability

Q4 Q8 Q12 Q16 Q201

10

100

1000

10000

100000

Execution T

ime (

ms)

Query Size

D2kV10E20L40M50D10kV10E20L40M50D50kV30E40L90M150D100kV30E40L90M150

(a) Synthetic Dataset

Q4 Q8 Q12 Q16 Q201

10

100

1000

10000

Execution T

ime (

ms)

Query Size

1MB10MB50MB100MB

(b) DBLP Dataset

Figure: The scalability of GraphREL.


Experimental Results: The effect of using PartitionedB-tree Indexes and Selectivity Injections

Q4 Q8 Q12 Q16 Q200

10

20

30

40

50

60

70

80

90

100

Perc

enta

ge o

f Im

pro

vem

ent (%

)

Query Size

SyntheticDBLP

(a) Partitioned B-tree indexes

Q4 Q8 Q12 Q16 Q200

5

10

15

20

25

30

35

40

Exe

cutio

n T

imes

(ms)

Query Size

SyntheticDBLP

(b) Injection of selectivity annotations

Figure: The speedup improvement for the relational evaluation of sub-graphqueries using partitioned B-tree indexes and selectivity-aware annotations.


QBP: An Application of GraphREL

Many of today’s Information Systems are driven by explicit processmodels.

A business process is a set of coordinated activities to achieve aspecific business objective.

With the rapid and incremental increase in the number of processmodels, it becomes crucial for business process designers to be able tolook up their repository for models efficiently.

QBP is a query processor for business processes models.

QBP is based on a new visual query language for business processescalled BPMN-Q. The language addresses processes definitions andextends the standard BPMN notations for modeling businessprocesses for its concrete syntax.

A BPMN-Q query is considered to be a graph which is going to bematched with process graph(s).



Customer applies for

real-estate credit

Credit Rating

[rejected]

Check credit rating

Credit Rating

[accepted]

Check real-estate

construction

document

Check land register

record

Const. Doc.

[invalid]

Const. Doc

[valid]

Record

[absent]

Record

[present]

Prepare contract

Reject application

All OK

Offer loan protection

insurance

Offer residence

insurance


QBP: Application Architecture

Relational Business

Process Repository

Relational Business

Process Repository

Model

Editor

Model

Editor

Semantic Query

Expander

Semantic Query

Expander

SQL-Based

Query Processor

SQL-Based

Query Processor

EPCBPELBPMN

Translation MiddlewareTranslation Middleware

RDBMS

……….

SQL ScriptQuery Results

Updates

BPM-Q

Query Editor

BPM-Q

Query Editor

UML

ADs

BPM- Q Query

Semantically

expanded queries

Result Process ModelsBusiness Process

Designers


BPMN-Q Query Constructs

Anonymous

Activity

It is used to indicate unknown activities in a query. It resembles an

activity but is distinguished by the @ sign in the beginning of the label.

Generic Node It indicates an unknown node in a process. It could evaluate to any node

type.

Generic Split It refers to any type of split gateways.

Generic Join It refers to any type of join gateways.

Negative

Sequence Flow

It states that two nodes A and B are not directly related by sequence

flow.

Path It states that there must be a path from A to B. A query usually returns

all paths.

Negative Path It states that there is not any path between two nodes A and B.




real-estate creditReject application//

(a) BPMN-Q Query Example


real-estate credit

Check credit rating

Check real-estate

construction

document

Check land register

recordReject application

(b) BPMN-Q Query Match


QBP: Use Cases

Searching the structure of the process models.

Compliance checking.

Detecting design anomalies.

Discovery of frequent process patterns/anti-patterns.



http://bpmnq.sourceforge.net/


Conclusions

GraphREL is a purely relational framework to store and query graphdata.

In principle GraphREL has the following advantages:It can reside on any relational database system and exploits its wellknown matured query optimization techniques as well as its efficientand scalable query processing techniques.

It has no required time cost for offline or pre-processing steps.

It can handle static and dynamic (with frequent updates) graphdatabases very well.

The selectivity annotations for the SQL evaluation scripts provide therelational query optimizers with the ability to select the most efficientexecution plans and apply an efficient pruning for the non-requiredgraph database members.


Outline





Future Work: Large Scale Graph Query Processing(e.g: Social Networks)


Future Work: Parallel Processing / MapReduce(HadoopDB)


Future Work: Storing and Querying Hypergraphs


References

[CIDR’03] G. Graefe. Sorting And Indexing With PartitionedB-Trees.

[VLDB’04] T. Grust, S. Sakr, and J. Teubner. XQuery on SQLHosts.

[SIGMOD’07] T. Grust, M. Mayr, J. Rittinger, S. Sakr, and J.Teubner. A SQL:1999 Code Generator for the Pathfinder XQueryCompiler.

[VLDB’08] J. Teubner, T. Grust, S. Maneth, and S. Sakr.Dependable Cardinality Forecats for XQuery.


References

[IJWIS’08] S. Sakr. ”Algebraic-Based XQuery CardinalityEstimation.

[DASFAA’09] S. Sakr. GraphREL: A Decomposition-Based andSelectivity-Aware Relational Framework for Processing Sub-graphQueries.

[UNISCON’09] S. Sakr. Storing and Querying Graph Data UsingEfficient Relational Processing Techniques.

[JDM’09] S. Sakr. Purely Relational Implementation of an XQueryProcessor.


The End

Thank You


Date post:	11-May-2015
Category:	Education
Upload:	university-of-new-south-wales
View:	2,154 times
Download:	0 times

GraphREL: A Relational Graph Query Processor

Education