E–cient Relational Techniques for Processing Graph Queries · graph databases has generated...

Sakr S, Al-Naymat G. Efficient relational techniques for processing graph queries. JOURNAL OF COMPUTER SCIENCE

AND TECHNOLOGY 25(6): 1237–1255 Nov. 2010. DOI 10.1007/s11390-010-1098-z

Efficient Relational Techniques for Processing Graph Queries

Sherif Sakr1,2, Member, ACM, and Ghazi Al-Naymat1, Member, ACM

1School of Computer Science and Engineering, University of New South Wales, Sydney, Australia2Managing Complexity Group, National ICT Australia (NICTA), ATP, Sydney, Australia

E-mail: {ssakr, ghazi}@cse.unsw.edu.au

Received February 22, 2010; revised August 19, 2010.

Abstract Graphs are widely used for modeling complicated data such as social networks, chemical compounds, proteininteractions and semantic web. To effectively understand and utilize any collection of graphs, a graph database thatefficiently supports elementary querying mechanisms is crucially required. For example, Subgraph and Supergraph queries areimportant types of graph queries which have many applications in practice. A primary challenge in computing the answersof graph queries is that pair-wise comparisons of graphs are usually hard problems. Relational database managementsystems (RDBMSs) have repeatedly been shown to be able to efficiently host different types of data such as complexobjects and XML data. RDBMSs derive much of their performance from sophisticated optimizer components which makeuse of physical properties that are specific to the relational model such as sortedness, proper join ordering and powerfulindexing mechanisms. In this article, we study the problem of indexing and querying graph databases using the relationalinfrastructure. We present a purely relational framework for processing graph queries. This framework relies on buildinga layer of graph features knowledge which capture metadata and summary features of the underlying graph database.We describe different querying mechanisms which make use of the layer of graph features knowledge to achieve scalableperformance for processing graph queries. Finally, we conduct an extensive set of experiments on real and synthetic datasetsto demonstrate the efficiency and the scalability of our techniques.

Keywords graph database, graph query, subgraph query, supergraph query

1 Introduction

The graph is a powerful tool for representingand understanding objects and their relationshipsin various application domains. Recently, graphshave been widely used to model many complexstructured and schemaless data such as semanticweb[1], social networks[2], biological networks[3], pro-tein networks[4], chemical compounds[5-6] and busi-ness process models[7]. The growing popularity ofgraph databases has generated interesting data man-agement problems. Graph querying problems have at-tracted a lot of attention from the database community,such as: subgraph search[8-14], supergraph search[15-16],approximate subgraph matching[17-18] and correlationsubgraph query[19-20]. In principle, retrieving relatedgraphs containing a query graph from a large graphdatabase is a key performance issue in any graph-based application. Therefore, it is important to developefficient algorithms for processing queries on graphdatabases. Among the many graph-based search op-erations, in this article we will focus on the followingtwo main types of graph queries which have generated

a lot of interest in practical applications:1) Subgraph Queries. This category searches for a

specific pattern in the graph database. Formally, givena graph database D = {g1, g2, . . . , gn} and a subgraphquery q, the answer set A = {gi|q ⊆ gi, gi ∈ D}.

2) Supergraph Queries. This category searches forgraphs with structures that are fully contained in theinput query. Formally, given a graph database D ={g1, g2, . . . , gn} and a supergraph query q, the answerset A = {gi|q ⊇ gi, gi ∈ D}.

Clearly, it is an inefficient and a very time-consumingtask to perform a sequential scan over the entire graphdatabase D to check whether each graph databasemember gi belongs to the answer set of a graph queryq. Hence, there is a clear necessity to build graph in-dices in order to improve the performance of processinggraph queries.

Relational database management systems(RDBMSs) have repeatedly shown that they are veryefficient, scalable and successful in hosting types of datawhich have formerly not been anticipated to be storedinside relational databases such as complex objects[21],spatio-temporal data[22] and XML data[23]. In addition,

Regular Paper©2010 Springer Science +Business Media, LLC & Science Press, China

1238 J. Comput. Sci. & Technol., Nov. 2010, Vol.25, No.6

RDBMSs have shown its ability to handle vast amountsof data very efficiently using its powerful indexingmechanisms. In this article we focus on employingthe powerful features of the relational infrastructure toimplement efficient mechanisms for processing graphqueries. In particular, we present a purely relationalframework to speed up the search efficiency in the con-text of graph databases.

In our approach, the graph database is firstlyencoded using an intuitive fixed relational mappingscheme after which the graph query is translated intoa sequence of SQL evaluation steps over the definedstorage scheme. An obvious problem in the relational-based evaluation approach of graph queries is the largecost which may result from the large number of joinoperations which are required to be performed betweenthe encoding relations. In order to overcome this prob-lem, we exploit an observation from our previous workwhich is that the size of the intermediate results dra-matically affect the overall evaluation performance ofSQL scripts[24-26]. Hence, we construct a metadata andsummary layer of graph features knowledge which con-sists of three main components:

1) A descriptor record for the main features of eachgraph database member;

2) An aggregated summary of the unique features ofthe entire graph database members;

3) An index structure of the most prunning (less fre-quently occurring) nodes and edges in the under-lying graph database.

Using this layer of graph features knowledge, we canapply effective and efficient pruning algorithms to filterout as many as possible of the false positives graphsthat are guaranteed to not exist in the final results ofthe graph query before passing the candidate result setto a light-weight verification process. Moreover, we at-tempt to achieve the maximum performance improve-ment for our relational execution plans by two mainways:

a) Utilizing the existing powerful partitioned B-treesindexing mechanism[27] to reduce the access costs of thesecondary storage to a minimum[28];

b) Using the information of the graph featuresknowledge to influence the decision of the relationalquery optimizers to make the right decisions regardingselecting the most efficient join order and the cheapestexecution plan[29] by providing selectivity annotationsof the translated query predicates in order to reducethe size of the intermediate results to a minimum.

We summarize the main contributions of this articleas follows.• We present a purely relational framework for eva-

luating graph queries. In this framework, we encode

graph databases using a fixed relational schema andtranslate the graph queries into standard SQL scripts.• We describe an approach of building a meta-

data and summary layer about the underlying graphdatabase, called graph features knowledge layer. Thislayer extracts the main features of each graph databasemember and keeps information about the most prun-ning features in the graph database.• We describe efficient mechanisms for processing

subgraph and supergraph queries. These mechanismsuse the information of the graph features knowledgelayer to apply effective and efficient pruning strategiesin order to reduce the search space to a minimum.• We exploit the powerful features of the relational

infrastructure such as partitioned B-trees indexes andselectivity annotations to reduce the secondary storageaccess costs and the size of the intermediate results ofour SQL scripts to a minimum.• We empirically evaluate the efficiency and the sca-

lability of our techniques through an extensive set ofexperiments.

The remainder of this article is organized as follows.Section 2 defines the preliminary concepts used in thisarticle. Section 3 discusses different relational encod-ing schemes for storing graph databases. Section 4presents the components of the layer of the graph fea-tures knowledge. Sections 5 and 6 present our mecha-nisms to utilize the graph features knowledge to achievean efficient relational processing for subgraph and su-pergraph queries respectively. Section 7 describes ourdecomposition mechanism to avoid the complexity ofthe generated SQL scripts for large graph queries. Wedemonstrate the efficiency and scalability of our mec-hanisms by conducting an extensive set of experimentswhich are described in Section 8. Related work is sur-veyed in Section 9 before we finally conclude the articlein Section 10.

2 Preliminaries

This section introduces the fundamental definitionsused in this article and gives the formal problem state-ment.

Definition 1 (Labeled Graph). A labeled graph gis denoted as (V, E, Lv, Le, Fv, Fe) where V is the setof vertices; E ⊆ V × V is the set of edges joining twodistinct vertices; Lv is the set of vertex labels; Le is theset of edge labels; Fv is a function V → Lv that assignslabels to vertices and Fe is a function E → Le thatassigns labels to edges.

In principle, labeled graphs are classified accordingto the direction of their edges into two main classes:1) Directed-labeled graphs such as XML and RDF. 2)Undirected-labeled graphs such as social networks and

Sherif Sakr et al.: Efficient Relational Techniques for Processing Graph Queries 1239

chemical compounds. Generally, the following inequa-lities usually hold: |V | ¿ |Lv| and |E| ¿ |Le|.

This article focuses on the directed-labeled and con-nected simple graphs (we refer to them as graphs in therest of the article). However, the algorithms proposedin the article can be easily extended to other kinds ofgraphs.

Definition 2 (Graph Database). A graph databaseD is a collection of member graphs gi where D ={g1, g2, . . . , gn}.

In principle, there are two main types of graphdatabases. The first type consists of a small numberof very large graphs such as the Web graph and socialnetworks. The second type consists of a large set ofsmall graphs such as chemical compounds and biolo-gical pathways. The main focus of this article is thesecond type of graph databases.

Definition 3 (Subgraph Isomorphism). Let g1 =(V1, E1, Lv1 , Le1 , Fv1 , Fe1) and g2 = (V2, E2, Lv2 ,Le2 , Fv2 , Fe2) be two graphs, g1 is defined as a graphisomorphism to g2, if and only if there exists at leastone bijective function f : V1 → V2 such that: 1) forany edge uv ∈ E1, there is an edge f(u)f(v) ∈ E2.2) Fv1(u) = Fv2(f(u)) and Fv1(v) = Fv2(f(v)). 3)Fe1(uv) = Fe2(f(u)f(v)).

The subgraph search and supergraph search prob-lems in graph databases are formulated as follows.

Definition 4 (Subgraph Search). Given a graphdatabase D = {g1, g2, . . . , gn} and a subgraph query q,the query answer set A = {gi|q ⊆ gi, gi ∈ D}.

Definition 5 (Supergraph Search). Given a graphdatabase D = {g1, g2, . . . , gn} and a supergraph queryq, the query answer set A = {gi|q ⊇ gi, gi ∈ D}.

Fig.1 provides illustrative examples for the graphdatabase querying operations. Fig.1(a) describes the

Fig.1. Graph database querying. (a) Subgraph search problem.

(b) Supergraph search problem.

subgraph search problem while Fig.1(b) describes thesupergraph search problem. These two problems arethe main focus of this article.

3 Relational Encoding of Graph Databases

The starting point of any relational framework forprocessing graph queries is to select a suitable encodingfor each graph member gi in the graph database D. Inprinciple, several relational encodings can be used tostore graph structured databases[30-31]. In this section,we discuss two alternative mapping schemes that can beused in our framework. We compare between their ad-vantages and disadvantages. However, the graph queryprocessing algorithms of this article is independent ofthe underlying relational schema and can be adaptedto deal with any encoding.

3.1 Vertex-Edge Mapping Scheme

The first alternative relational mapping scheme forstoring graph databases is the Vertex-Edge mappingscheme. It represents a simple and intuitive encodingscheme where each graph database member gi is as-signed a unique identity graphID. Each vertex is as-signed a sequence number (vertexID) inside its graph.Each vertex is represented by one tuple in a single table(Vertices Table) which stores all vertices of the graphdatabase. Each vertex is identified by the graphID towhich the vertex belongs and the vertex ID. Addition-ally, each vertex has an additional attribute to store thevertex label. Similarly, all edges of the graph databaseare stored in a single table (Edges Table) where eachedge is represented by a single tuple in this table. Eachedge tuple describes the graph database member towhich the edge belongs, the id of the source vertex ofthe edge, the id of the destination vertex of the edgeand the edge label. Therefore, the relational storagescheme of our Vertex-Edge mapping is described as fol-lows:

• Vertices(graphID , vertexID , vertexLabel),• Edges(graphID , sVertex , dVertex , edgeLabel).

Fig.2(a) illustrates an example of the Vertex-Edgerelational mapping scheme of a sample graph database.

3.2 Edge-Edge Mapping Scheme

The second alternative relational mapping schemefor storing graph databases is the Edge-Edge mappingscheme. This scheme starts by assigning a unique iden-tifier for each graph database member, vertex and edgein the graph database. Each edge in the graph databaseis represented by a single tuple in the encoding re-lation. In addition to the ID of the graph member,each tuple stores the IDs and the labels of the source


and destination vertices. Hence, the relational storagescheme of the Edge-Edge mapping is described as fol-lows:• EdgeEdge(gID , eID , eLabel , sVID , sVLabel , dVID ,

dVLabel).Fig.2(b) illustrates an example of the Edge-Edge re-

lational mapping scheme of graph databases.

Fig.2. Relational encoding of graph databases. (a) Vertex-Edge

mapping scheme. (b) Edge-Edge mapping scheme.

3.3 Comparison and Discussion

In principle, the Edge-Edge scheme can be con-sidered as a denormalized version of the Vertex-Edgescheme. It groups the information of all vertices andedges into one relation. Therefore, on the one hand, theEdge-Edge scheme is considered to be more efficient forquerying purposes. For example, retrieving a subgraphwhich consists of two nodes and an edge connectingthem will require a single lookup in the Edge-Edge ta-ble while in the Vertex-Edge scheme, it requires joiningthe Vertices table with the Edges table. In general,let us assume that we would like to retrieve a graphq that consists of m vertices and n edges. Using theVertex-Edge scheme, the number of joins between theencoding relations is equal to m + n. However, usingthe Edge-Edge scheme, it is reduced to only n join ope-rations. On the other hand, the Vertex-Edge schemeis more suitable to deal with dynamic (with frequentupdates) graph databases. It is more efficient to add,delete or update the structure of existing graphs withno special processing while the Edge-Edge scheme suf-fers from the update anomalies problem as the result of

the duplication of the label information of each vertexin the representing tuple of each edge to which it be-longs.

In the remainder of this article, we base the dis-cussion of our algorithms on the Vertex-Edge mappingscheme. However, in Subsection 8.2 we compare be-tween the performance of the query processing algo-rithms using the two alternative storage schemes.

4 Graph Features Knowledge

In order to improve the efficiency of the SQL-basedapproach for evaluating graph queries, we follow theidea of automatic creation and usage of higher-leveldata about data (metadata)[32-34], here referred to asgraph features knowledge (GFK ). We construct a layerof summary data which consists of three main com-ponents: graph descriptors, aggregate graph descrip-tors and Nodes-Edges Markov Summaries. We describethese three components in this section. In the next sec-tions, we describe how to make use of these componentsin order to improve the performance of the processingof graph queries by effectively pruning the search space,reducing the size of intermediate results and influencingthe decision of the relational query optimizers to selectthe most efficient join order and the cheapest executionplan.

4.1 Graph Descriptors

This component maintains a description recordabout the features of each graph database member inthe graph database. Each graph descriptor includes thefollowing components:• No. Nodes (NN ) : represents the total number of

nodes in the graph;• No. Edges (NE ) : represents the total number of

the edges in the graph;• Unique Node Labels (NL) : represents the number

of unique node labels in the graph;• Unique Edge Labels (EL) : represents the number

of unique edge labels in the graph;• Maximum In-node Degree (MID) : represents the

maximum number of incoming edges for a node in thegraph;• Maximum Out-node Degree (MOD) : represents

the maximum number of outgoing edges for a node inthe graph;• Maximum io-node Degree (MD) : represents the

maximum number of total number of incoming and out-going edges for a node in the graph;• Maximum Path Length (MPL) : where a path

length in a graph represents a sequence of consecutiveedges and the length of the path is the number of edgestraversed.


Fig.3 illustrates an example of a graph descriptortable for a sample graph database. In principle, thecomponents of the graph descriptor can vary from onedatabase to another based on the characteristics of theunderlying graph database members. For example, wecan avoid storing the components which store the num-ber of unique node and edge labels if there is a guaran-tee that each node or edge in a graph has a unique label.However, we can add some more descriptive fields suchas the number of occurrences of the maximum pathlength or the number of occurrences of the maximum(in/out/io)-node degrees.

Fig.3. Sample graph descriptors table.

On one hand, the more indexed features the higherthe pruning power of the filtering phases. However,it also increases the space cost. On the other hand,the less indexed features lead to poor pruning powerin the filtering process and consequently resulting in aperformance hit in the query processing time. In gen-eral, computing the components of our graph descriptor(except the maximum path length which is recursive innature) can be achieved in a straightforward way usingSQL-based aggregate queries. For example, computingthe number of nodes graph descriptor component (NN)can be represented using the following SQL query:

SELECT graphID, count(∗) as NN

FROM Vertices GROUP BY graphID.

Similarly, computing the number of edges descriptorcomponent (NE) can be represented using the followingSQL query:

SELECT graphID, count(∗) as NE

FROM Edges GROUP BY graphID.

4.2 Aggregated Graph Descriptors

This component represents a compact aggregatedversion of the graph descriptors in the following form:• (BID, NN, NE, NL, EL, MID, MOD, MD, MPL,

graphCount).It groups the graphs with the same features into one

bucket identified by its bucket ID (BID). Each buckethas a graphCount attribute which represents the num-ber of the graph database members that share the samefeatures identified by the values of the bucket com-ponents. Therefore, the aggregated graph descriptorscomponent can be considered as a guide summary ofthe features of the graph database members[35]. Thesize of this aggregated version is usually, by far, lessthan the size of the descriptor table and can easily fitinto the main memory. Based on a defined threshold t,we create an auxiliary inverted list[36] of all the relatedgraph database members for each summary bucket witha number of graph database members (graphCount)which is less than t. The main idea of these invertedlists is to allow for cheap and quick access to the set ofthe few graphs which share the same features of thatbucket.

4.3 Nodes-Edges Markov Summaries

This component uses three Markov tables to storeinformation about the frequency of occurrence of thedistinct labels of vertices, distinct labels of edges and

Fig.4. Sample nodes-edges Markov summaries.


connection between pair of vertices (edges) in the wholeof the underlying graph database[37]. Fig.4 presents ex-amples of these Markov tables. Due to the requirementsof our query processing algorithms (Sections 5, 6 and7), we are only interested in vertex and edge informa-tion with low frequency. Therefore, we summarize theseMarkov tables by deleting high-frequency tuples up toa defined threshold freqt for each table. This thresh-old value is defined according to the available memorybudget. Moreover, for any vertex label, edge label, pairof vertices which appear in a number of graph databasemember less than another predefined threshold gct, wecreate an inverted list with all its related graphs.

5 Relational Processing of Subgraph SearchQueries

In this section, we present our query processingand optimization techniques for processing subgraphqueries. Subsection 5.1 presents our framework for an-swering graph queries. Subsection 5.2 describes thegeneral case of the SQL-based evaluation of subgraphqueries. Subsection 5.3 shows how the components ofthe layer of the graph features knowledge are used toeffectively improve the query performance. Subsection5.5 describes using the powerful features of the rela-tional indexing infrastructure to further optimize thefeatures-based evaluation and boost the performanceof the query processing.

5.1 Query Processing Framework

Fig.5 depicts an overview for our relational frame-work for processing graph queries. The processing ofgraph queries in our approach goes through the follow-ing main steps:

1) Database Preprocessing. In this step we analyzethe underlying graph database to build the componentsof the layer of the graph features knowledge (GFK).

2) Determining the Pruning Strategy. The queryprocessor recognize the features of the input graphquery and probes the components of the GFK with thequery features to identify the pruning strategy of thefiltering phase.

3) Filtering. It uses the outcome decisions fromprobing the GFK to apply an effective and efficientpruning strategy for filtering the graph database mem-bers, get rid of many false positive graphs (graphs thatare not possible to belong to the query answer) andproduce a list of graph database members which arecandidate to constitute the query result.

4) Verification. It validates each candidate graph tocheck if it really belongs to the answer set of the querygraph.

Fig.5. Framework of processing graph queries.

Clearly, the query response time consists of the fil-tering time and the verification time. The less numberof candidates after the filtering, the faster the query re-sponse time. Hence, pruning power in the filter processis critical to the overall query performance.

5.2 Subgraph Query Evaluation: General Case

Let us assume that we have a subgraph query q con-sists of a set of vertices QV with size equal m and a setof edges QE equal n. Using the Vertex-Edge mappingscheme (Subsection 3.1), we employ the following SQL-based filtering-and-verification mechanism to computethe answer set of such type of queries.

Filtering Phase. In this phase we specify the set ofgraph database members which contain the set of ver-tices and edges that describe the subgraph query usingthe following SQL template:

1 SELECT DISTINCT V1.graphID

2 FROM Vertices as V1, . . ., Vertices as Vm,

3 Edges as E1, . . ., Edges as En

4 WHERE

5 ∀mi=2(V1.graphID = Vi.graphID)

6 AND ∀nj=1(V1.graphID = Ej .graphID)

7 AND ∀mi=1(Vi.vertexLabel = QVi.vertexLabel)

8 AND ∀nj=1(Ej .edgeLabel = QE j .edgeLabel)

9 AND ∀nj=1(Ej .sVertex = Vf .vertexID AND

10 Ej .dVertex = Vf .vertexID);

where each referenced table Vi (Line 2) represents aninstance from the table Vertices and maps the informa-tion of one vertex of the set of subgraph query verticesQV. Similarly, each referenced table Ej (Line 3) repre-sents an instance from the table Edges and maps theinformation of one edge of the set of subgraph queryedges QE. f is the mapping function between each ver-tex of QV and its associated vertices table instance Vi.


Line 5 represents a set of m− 1 conjunctive conditionsto ensure that all queried vertices belong to the samegraph. Similarly, Line 6 represents a set of n conjunc-tive conditions to ensure that all queried edges belongto the same graph of the queried vertices. Lines 7 and 8represent the set of conjunctive predicates of the vertexand edge labels respectively. Lines 9 and 10 representthe topology and connection information between themapped vertices.

Verification Phase. This phase is an optional phase.We apply the verification process only if more than onevertex of the set of query vertices QV have the samelabel. Therefore, in this case we need to verify thateach vertex in the set of filtered vertices for each candi-date graph database member gi is distinct. This can beeasily achieved using their vertex ID. Although the factthat the conditions of the verification process could beinjected into the SQL translation template of the filte-ring phase, we found that it is more efficient to avoidthe cost of performing these conditions over each graphdatabase members gi by delaying their processing , ifrequired, in a separate phase after pruning the candi-date list.

5.3 Features-Based Query Optimization

Subsection 5.2 have presented the general case of theSQL-based evaluation of the subgraph queries. In thissubsection we show how our query processor uses thecomponents of the graph features knowledge (GFK) toreduce the search space and optimize the query perfor-mance in several ways. Given a subgraph query q, thesubgraph query evaluation process starts by applyingthe following two main steps in order to determine itsquery evaluation strategy:

1) Subgraph Query Bucket Matching. In this step wecompute the subgraph query descriptor (qd) then usingthe aggregated graph descriptors component (AGD).We identify the set of buckets where the features of thegraph query are fully contained. The subgraph querydescriptor is considered to be fully contained in a bucketif and only if the value of each component in the querydescriptor is less than or equal to its associated com-ponent in the bucket. This subgraph bucket matchingprocess is defined as follows.

Definition 6 (Subgraph Bucket Match). Givenan aggregate graph descriptor AGD = {b1, b2, . . . , bn}and a subgraph query descriptor qd, the matched bucketset MB = {bi|qd ⊆ bi, bi ∈ AGD} where qd ⊆ bi

iff qd .NN 6 bi.NN ∧ qd .NE 6 bi.NE ∧ qd .NL 6bi.NL∧qd .EL 6 bi.EL∧qd .MID 6 bi.MId∧qd .MOD 6bi.MOD ∧ qd .MD 6 bi.MD ∧ qd .MPL 6 bi.MPL.

Fig.6 illustrates an example of the matching processof a subgraph query descriptor against the buckets of

an aggregate graph descriptors.

Fig.6. Bucket matching of subgraph query descriptor.

2) Pruning Features Identification. We use theMarkov summary tables to determine the most pruning(less frequently occurring) query features as follows:• A vertex v in the subgraph query q is defined as

a pruning vertex if its label frequency is less than apredefined threshold freqv.• An edge e in the subgraph query q is defined as a

pruning edge if its label frequency is less than a prede-fined threshold freqe.• Two connected vertices are defined as a pruning

pair if their labels connection frequency is less than apredefined threshold freqp.

Based on the outcome of the subgraph query bucketmatching and the pruning features identification steps,the query evaluation strategy of the filtering phase isdetermined and executed as depicted in the illustratedAlgorithms in Figs. 7 and 8. The details of these twoalgorithms are described as follows.• Case SubStrategy 1. If the count of the matched

buckets is equal to zero then it means that the featuresof the subgraph query are not contained in any graphdatabase member. Therefore, we end the query eva-luation process by returning an empty result set. Inthis case, using the information of the aggregate graphdescriptors component is quite useful as it avoids anyaccess to the underlying graph database.• Case SubStrategy 2. If any of the query pruning

features appears in a number of graphs that is less thanthe defined threshold gct then we unfold its associatedinverted list to retrieve and verify its related graphsto compute the answer set of the subgraph query q.If we have more than one prunning feature satisfyingthis condition then we select the one with the smallestsearch space.• Case SubStrategy 3. If the count of the matched

buckets is greater than zero and the size of eachmatched bucket (bi.graphCount) is less than the de-fined threshold t then we unfold their associated in-verted lists to retrieve and verify their related graph


database members in order to compute the answer setof the subgraph query q.• If the conditions of the two cases SubStrategy2

and SubStrategy3 are both satisfied then we comparebetween their search spaces and apply the one with thesmaller search space.• Case SubStrategy 4. If we have a number of

matched buckets with a size that is less than the thresh-old t and a number of matched buckets with a sizegreater than the threshold t then for each matchedbucket with a size that is less than t, we unfold its as-sociated inverted list to retrieve and verify the related

Require: q: subgraph query,AD: aggregate descriptors,MVv : Markov table of vertices labels,MVe: Markov table of edge labels,MVp: Markov table of connected pair of vertices,t: threshold of bucket size with an inverted list,gct: threshold of an inverted list size for a Markov table

entryEnsure: SubStrategy : integer

pruningFeature : Markov table entrymb: array of matched aggregated descriptors entries

1 SubStrategy :=0;2 qd:= ComputeGraphQueryDescriptor(q)3 mb:= IdentifySubMatchedBuckets(qd ,AD)4 if mb.length = 0 then5 SubStrategy:= 16 return7 end if8 pf := IdentifyPruningFeatures(q,MVv ,Mve,Mvp)9 if pf.length > 0 then

10 pfSearchSpace := gct + 111 for i = 1 to pf.length do12 if pf [i].graphCount < fbSearchSpace then13 SubStrategy := 214 pruningFeature := pf [i]15 pfSearchSpace := pf [i].graphCount16 end if17 end for18 end if19 fbSearchSpace := 020 for i = 1 to mb.length do21 if mb[i].graphCount > t then22 if SubStrategy = 0 then23 SubStrategy := 424 end if25 exit for26 else27 fbSearchSpace+= mb[i].graphCount28 end if29 if SubStrategy = 2 then30 if pfSearchSpace > fbSearchSpace then31 SubStrategy := 332 end if33 else34 SubStrategy= 335 end if36 end for

Fig.7. IdentifySubStrategy: Determines the subgraph query eva-

luation strategy.

Require: D: graph database, q: subgraph query,GD: graph descriptors, AD: agg. descriptors,MVv : Markov table of vertices labels,MVe: Markov table of edge labels,MVp: Markov table of connected pair of vertices,t: threshold of bucket size with an inverted list,gct: threshold of an inverted list size for a Markov table

entryEnsure: ASGQ: answer set of the subgraph query (q)1: IdentifySubStrategy(q, AD, MVv , MVe, MVp, t, gct,

SubStrategy, pruningFeature, mb)2 if SubStrategy = 1 then3 return empty4 end if5 if SubStrategy = 2 then6 for all g ∈ pruningFeature.InvertedList.graphs do7 if Verify(g, q) then8 ASGQ.ADD(g)9 end if

10 end for11 return ASGQ12 end if13 if SubStrategy = 3 then14 for i = 1 to mb.length do15 for all g ∈ mb[i].InvertedList.graphs do16 if Verify(g, q) then17 ASGQ.ADD(g)18 end if19 end for20 end for21 return ASGQ22 end if23 if SubStrategy = 4 then24 for i = 1 to mb.length do25 if mb[i].graphCount 6 t then26 for all g ∈ mb[i].InvertedList.graphs do27 if Verify(g, q) then28 ASGQ.ADD(g)29 end if30 end for31 else32 cg := SQLFeaturedFilter(D, q,GD ,mb[i])33 for all g ∈ cg do34 if Verify(g, q) then35 ASGQ.ADD(g)36 end if37 end for38 end if39 end for40 return ASGQ41 end if

Fig.8. Features-based evaluation of subgraph queries.

graphs where the target graphs are added to the queryanswer set. For each matched bucket (bs) with a per-centage greater than t, we apply a features-based fil-tering phase where the bucket features and the graphdescriptor component are used to limit the search space.The SQL template of this features-based filtering phaseis depicted in Fig.9.

In this template, m and n are the numbers of verticesand edges of the subgraph query q respectively. Eachreferenced table Vi and Ej maps the information of oneof the query vertices and query edges respectively. Lines


7∼13 use the graph descriptors information to retrieveonly the graph database members which satisfy the fea-tures of the matched buckets. It should be noted thateach candidate graph database member can belong toonly one of the intermediate results based on the match-ing of its graph descriptor with the unique features ofeach matched bucket. Thus, the intermediate resultsare guaranteed to be disjoint and duplicate-free. How-ever, an optional verification phase is required if andonly if more than one vertex of the set of subgraphquery vertices have the same label. In this case, weneed to verify that each vertex in the set of filteredvertices for each candidate graph is distinct. This canbe easily achieved using their vertexID. Although thefact that the conditions of the verification process couldbe injected into the SQL translation template of thefeatures-based filtering phase, we found that it is moreefficient to avoid the cost of performing these condi-tions over each graph database members by delayingtheir processing (if required) to a separate phase afterpruning the candidate list.

1 SELECT DISTINCT V1.graphID

2 FROM Vertices as V1, . . ., Vertices as Vm,

3 Edges as E1, . . ., Edges as En,

4 GraphDescriptors as GD

5 WHERE

6 V1.graphID = GD .graphID

7 AND GD .NN = bs.NN

8 AND GD .NE = bs.NE

9 AND GD .NL = bs.NL

10 AND GD .EL = bs.EL

11 AND GD .MID = bs.MID

12 AND GD .MOD = bs.MOD

13 AND GD .MD = bs.MD

14 AND ∀mi=2(V1.graphID = Vi.graphID)

15 AND ∀nj=1(V1.graphID = Ej .graphID)

16 AND ∀mi=1(Vi.vertexLabel = QV i.vertexLabel)

17 AND ∀nj=1(Ej .edgeLabel = QE j .edgeLabel)

18 AND ∀nj=1(Ej .sVertex = Vf .vertexID AND

19 Ej .dVertex = Vf .vertexID);

Fig.9. Features-based SQL template for the filtering phase.

5.4 Selectivity Annotation of Query Predicates

The main goal of RDBMS query optimizers is tofind the most efficient execution plan for every givenSQL query. For any given SQL query, there are a largenumber of alternative execution plans. These alterna-tive execution plans may differ significantly in their useof system resources or response time. In general, oneof the most effective techniques for optimizing the ex-ecution time of SQL queries is to select the relationalexecution plan based on the accuracy of the selectivity

information of the query predicates. For example, thequery optimizer may need to estimate the selectivitiesof the occurrences of two vertices in one subgraph, oneof these vertices with label A and the other with la-bel B to choose the more selective vertex to be filteredfirst. Sometimes query optimizers are not able to selectthe most optimal execution plan for the input queriesbecause of the unavailability or the inaccuracy of the re-quired statistical information. To tackle this problem,we use our statistical summary information to give in-fluencing hints for the query optimizers by injecting ad-ditional selectivity information for the individual querypredicates into the SQL queries. These hints enable thequery optimizers to decide the optimal join order, uti-lizing the most useful indexes and select the cheapestexecution plan.

More specifically, in SubStrategy 4, we use the se-lectivity information of the query punning features (ac-quired from the Markov summary tables) to hint theselectivity information of the associated query predi-cates for the DB2 (our hosting experimental engine)query optimizer using the following syntax:

SELECT fieldlist FROM tablelist

WHERE Pi SELECTIVITY Si

where Si indicates the selectivity value for the querypredicate Pi. These selectivity values range between 0and 1. Lower selectivity values (close to 0) will informthe query optimizer that the associated predicates willeffectively prune the size of the intermediate result andthus they should be executed first.

5.5 Supporting Relational Indexes

Relational database indexes have proven to be veryefficient tools to speed up the performance of evaluatingthe SQL queries. Moreover, the performance of queriesevaluation in relational database systems is very sensi-tive to the defined indexes structures over the data ofthe source tables. In principal, using relational indexescan accelerate the performance of queries evaluation inseveral ways[38]. For example, applying highly selec-tive predicates first can limit the data that must beaccessed to only a very limited set of rows that satisfythose predicates. Additionally, query evaluations canbe achieved using index-only access and save the neces-sity to access the original data pages by providing allthe data needed for the query evaluation.

Partitioned B-tree indexes are considered to be aslight variant of the B-tree indexing structure. Themain idea of this indexing technique has been repre-sented by Graefe in [27] where he recommended theuse of low-selectivity leading columns to maintain thepartitions within the associated B-tree. In labeled


graphs, it is generally the case that the number of dis-tinct vertices and edges labels are far less than thenumber of vertices and edges respectively. Hence, forexample having an index defined in terms of columns(vertexLabel, graphID) can reduce the access cost of sub-graph query with only one label to one disk page whichstores a list of graphID of all graphs which are includinga vertex with the target query label. On the contrary,an index defined in terms of the two columns (graphID,vertexLabel) requires scanning a large number of diskpages to get the same list of targeted graphs. Concep-tually, this approach could be considered as a horizontalpartitioning of the Vertices and Edges table using thehigh selectivity partitioning attributes[39]. Therefore,instead of requiring an execution time which is linearwith the number of graph database members (graphdatabase size), having partitioned B-trees indexes ofthe high-selectivity attributes can achieve fixed execu-tion time which is no longer dependent on the size ofthe whole graph database[27-28].

Moreover, by leveraging the advantage of relying ona pure relational infrastructure, we are able to use theready-made tools provided by the RDBMSs to proposea candidate set of indexes that are effective for accele-rating our SQL-based evaluation query workloads[40-41].For example, we can use the db2advis tool provided bythe DB2 engine to recommend efficient indexes for ourquery workload. Through the use of this tool we cansignificantly improve the quality of the designed indexesand speed up the query performance by reducing thenumber of calls to the database engine. Similar toolsare also available in most of the commonly used com-mercial RDBMSs.

6 Relational Processing of Supergraph SearchQueries

In this section, we present our query mechanism forprocessing supergraph queries. Similar to our previ-ously introduced mechanism for processing subgraphqueries (Section 5), the information of the graph fea-tures knowledge (GFK) plays an effective rule in re-ducing the search space and optimizing the query per-formance. Given a supergraph query q we start by ap-plying the following two main steps:

1) Supergraph Query Bucket Matching. A bucketof the aggregate graph descriptors is considered to bematched with the features of the supergraph query de-scriptor if and only if the value of each component in thequery descriptor is greater than or equal to its associ-ated component in the bucket. This supergraph bucketmatching process is defined as follows.

Definition 7 (Supergraph Bucket Match). Givenan aggregate graph descriptor AGD = {b1, b2, . . . , bn}

and a supergraph query descriptor qd, the matchedbucket set MB = {bi|qd ⊆ bi, bi ∈ AGD} where qd ⊆ bi

iff qd .NN > bi.NN ∧ qd .NE > bi.NE ∧ qd .NL >bi.NL∧qd .EL > bi.EL∧qd .MID > bi.MId∧qd .MOD >bi.MOD ∧ qd .MD > bi.MD ∧ qd .MPL > bi.MPL.

Fig.12 illustrates an example of the matching processof a supergraph query descriptor against the buckets ofan aggregate graph descriptors.

2) Pruning Features Identification. This step deter-mines the most pruning (less frequent) query featuresin an exactly similar way to the one of the subgraphqueries (Subsection 5.3).

Based on the outcome of these two steps, the super-graph query evaluation strategy is determined as de-picted in the illustrated algorithms in Figs. 10 and 11which are further described as follows.• Case SupStrategy 1. If the count of the matched

buckets is equal to zero, we return an empty result setbecause none of the graph database members has fea-tures which are fully contained in the supergraph querydescriptor.• Case SupStrategy 2. This case is exactly similar to

the subgraph query case SubStrategy 3. It ensures thatthe size of each matched bucket (bi.graphCount) is lessthan the defined threshold t. If so, we unfold the asso-ciated inverted lists of the matched buckets to retrieveand verify their related graphs in order to compute theanswer set of the supergraph query q.• Case SupStrategy 3. This case handles the situa-

tion when we have a number of matched buckets with asize less than the threshold t and a number of matchedbuckets with a size greater than the threshold t. Similarto SupStrategy 2, for each matched bucket with a sizeless than t, we unfold its associated inverted list to re-trieve and verify the related graphs. However, for eachmatched bucket (mbs) with a percentage greater than

Require: q: supegraph query,AD: aggregate descriptors,t: threshold of bucket size with an inverted list,

Ensure: SupStrategy : integermb : array of matched aggregated descriptors entries

1 qd := ComputeGraphQueryDescriptor(q)2 mb := IdentifySupMatchedBuckets(qd, AD)3 if mb.length = 0 then4 SupStrategy := 15 return6 end if7 SupStrategy := 2;8 for i = 1 to mb.length do9 if mb[i].graphCount > t then

10 SupStrategy := 311 return12 end if13 end for

Fig.10. IdentifySupStrategy: Determines the supergraph query

evaluation strategy.


Require: D: graph database, q: supergraph query,GD: graph descriptors, AD: aggregate descriptors,t: threshold of bucket size with an inverted list,

Ensure: ASGQ: answer set of the supergraph query (q)1 IdentifySupStrategy(q, AD, t, SupStrategy, mb)2 if SupStrategy = 1 then3 return empty4 end if5 if SupStrategy = 2 then6 for i = 1 to mb.length do7 for all g ∈ mb[i].InvertedList.graphs do8 if Verify(g, q) then9 ASGQ.ADD(g)

10 end if11 end for12 end for13 return ASGQ14 end if15 if SupStrategy = 3 then16 for i = 1 to mb.length do17 if mb[i].graphCount 6 t then18 for all g ∈ mb[i].InvertedList.graphs do19 if Verify(g, q) then20 ASGQ.ADD(g)21 end if22 end for23 else24 sqg := ExtractFeaturedSubgraphs(q,mb[i])25 for all eqg ∈ sqg do26 cg := SQLFeaturedFilter(D, eqg,GD ,mb[i])27 for all g ∈ cg do28 if Verify(g, q) do29 ASGQ.ADD(g)30 end if31 end for32 end for33 end if34 end for35 return ASGQ36 end if

Fig.11. Features-based evaluation of supergraph queries.

Fig.12. Bucket matching of supergraph query descriptor.

t, we convert the supergraph search problem into asubgraph search problem. We use the features of thematched bucket (mbs) to apply a backtracking stepwhich uses the features of the matched bucket to ex-tract from the supergraph query (q) all subgraphs (sqg)with exactly the same features.

We then treat each extracted subgraph (eqg) as a

subgraph query where we issue a features-based SQLquery using the template presented on SubStrategy 4to retrieve all graph database members that fulfill thefeatures of both the extracted subgraph (eqg) and thematched bucket (mbs). Clearly, this backtracking stepdramatically reduces the search space and avoids theneed for any further verification phase as the extractedsubgraphs are guaranteed to be contained in the super-graph query (q). Similarly to SubStrategy 4, an optionalverification phase is required if and only if more thanone vertex of the set of extracted subgraph query ver-tices have the same label. Hence, the verification phaseensures that the set of filtered vertices for each candi-date graph are distinct.

7 Decomposition Mechanism of LargeSubgraph Queries

In Subsection 5.3, we presented our SQL templatefor the features-based evaluation strategy of subgraphqueries. One concern with this one-step translationtemplate is that the generated SQL scripts for large(in terms of the number of nodes and edges) subgraphqueries can involve a large number of join operations.As we previously discussed in Subsection 3.3, givena subgraph query q that consists of m vertices andn edges, the number of join operations between theencoding relations based on the Vertex-Edge encodingscheme is equal to m+n while the Edge-Edge encodingscheme involves n join operations. Most of the queryengine either have a strict defined limit on the num-ber of join operations they can evaluate in one query ortheir performance degrades dramatically if the numberof join operations exceeds a certain limit.

In order to tackle this problem, we employ a de-composition mechanism that divides the one-step SQLtranslation of large subgraph queries into a sequenceof intermediate queries (using temporary tables). Ap-plying this decomposition mechanism blindly can leadto inefficient execution plans with very large intermedi-ate results. Therefore, we use the Nodes-Edges MarkovSummaries to perform an effective selectivity-aware de-composition process. Specifically, given a subgraphquery q with a required number of join operations equalto NJ, our decomposition mechanism goes through thefollowing steps:• Identifying the Number of the Pruning Features.

As we previously mentioned in Subsection 5.3, we con-sider each query vertex, query edge or a pair of con-nected query vertices with low frequency as a pruningfeature. We check the structure of q against our Markovtables to identify the possible pruning features. We re-fer to the number of the identified pruning features byNPP.


• Computing the Number of Partitions. Assumingthat the relational query engine has an explicit or im-plicit limit on the number of join operations that canbe involved in one query equal to MNJ, the numberof required partitions NOP can be simply computed asfollows: NOP = NJ/MNJ .• Decomposing the Subgraph Query. Based on the

identified number of pruning features (NPP) and thenumber of partitions (NOP), we choose between one ofthe following three alternatives:

1) Blinded decomposition: if NPP = 0 then we blindlydecompose the subgraph query q into the calculatednumber of partition NOP where each partition istranslated using our translation template into anintermediate evaluation step Si. The final evalua-tion step FES represents a join operation betweenthe results of all intermediate evaluation steps Si

in addition to the conjunctive condition of the sub-graphs connectors. The unavailability of any infor-mation about effective pruning features could leadto the result where the size of some intermediate re-sults may contain a large set of non-required graphmembers.

2) Pruned single-level decomposition: if NPP > NOPthen we distribute the pruning features across thedifferent intermediate NOP partitions. Thus, weensure a balanced effective pruning of all intermedi-ate results. All intermediate results Si of all prunedpartitions are constructed before the final evalua-tion step FES joins all these intermediate results inaddition to the connecting conditions to constitutethe final result.

3) Pruned multi-level decomposition: if NPP < NOPthen we distribute the pruning features across a firstlevel intermediate results of NOP partitions. Thisstep ensures an effective pruning of a percentage ofNPP/NOP% partitions. An intermediate collec-tive pruned step IPS is constructed by joining allthese pruned first level intermediate results in ad-dition to the connecting conditions between them.Progressively, IPS is used as an entry pruning fea-ture for the rest of the (NOP − NPP) non-prunedpartitions in a hierarchical multi-level fashion toconstitute the final result set. However, the numberof non-pruned partitions can be reduced if any ofthem can be connected to one of the pruning fea-tures. In other words, each pruning feature can beused to prune, if possible, more than one partitionto avoid the cost of having any large intermediateresults.

Fig.13 illustrates two examples of our decomposi-tion mechanism where the pruning vertices are markedby solid fillings, pruning edges are marked by boldline styles and the connectors between subgraphs are

marked by dashed edges. Fig.13(a) represents an ex-ample where the number of pruning features is greaterthan number of partitions. Fig.13(b) represents anotherexample where the number of pruning features is lessthan the number of partitions and one pruning vertexis used to prune different partitioned subgraphs.

Fig.13. Decomposition of large subgraph queries. (a) NPP >

NOP . (b) NPP < NOP .

8 Performance Evaluation

In this section, we present an extensive experimentalstudy for our proposed graph querying mechanisms. Weconducted our experiments using IBM DB2 RDBMSrunning on a PC with 3.2 GHz Intel Xeon processors,4GB of main memory storage and 250GB of SCSI sec-ondary storage.

8.1 Datasets

In our experiments we used two kinds of datasets:1) A real AIDS antiviral screen dataset (NCI/NIH)

which is publicly available on the website of the Deve-lopmental Therapeutics Program[42]. For this dataset,different query sets are used, each of which has 1000queries. These 1000 queries are constructed by ran-domly selecting 1000 graphs from the antiviral screendataset and then extracting a connected m edge sub-graph from each graph randomly. Each query set isdenoted by its edge size (m) as Qm.

2) A set of synthetic datasets which is generatedby our implemented data generator that follows thesame idea presented by Kuramochi and Karypis in [43].The generator allows the user to specify the number ofgraphs (D), the average number of vertices for eachgraph (V ), the average number of edges for each graph(E), the number of distinct vertices labels (L) and thenumber of distinct edge labels (M). We use the no-tation DdEeVvLlMm to represent the generation para-meters of each dataset.


8.2 Experiments

8.2.1 Offline Database Preprocessing Cost

We first examined the offline preprocessing cost ofour approach. Several synthetic datasets have beenused in this experiments. Each dataset is denoted asdn where n represents the number of graphs in thedataset. Fig.14 shows the offline preprocessing cost ofthese datasets. Fig.14(a) shows the disk storage sizeof the graph features knowledge in comparison to theGIndex [12] and TreePi [13] approaches. The results ofthis experiment show that the storage size of the graphknowledge features is the highest and it increases lin-early w.r.t. the size of the graph databases. Obviously,the main reason behind this is that each graph databasemember requires a new record in the graph descriptortable while both GIndex and TreePi index only the fre-quent features (graphs or tree). In fact, the size of theaggregate graph descriptors component is very smalland could be neglected. The size increase of this aggre-gate component is not affected by the increase in thenumber of graph database members. However, it is af-fected by the number of unique features of the graphdatabase members.

Fig.14(b) represents the running time in construct-ing the graph features knowledge in comparison to the

Fig.14. Offline database preprocessing cost. (a) Index size. (b)

Construction time.

indexing time of GIndex and TreePi. The results of thisexperiment show that the running time of the graphknowledge features is the lowest. The main reason be-hind that is the straightforward computation of the fea-tures of the graph database member while GIndex andTreePi apply sophisticated mining techniques to choosetheir frequent indexing features. Moreover, the indexconstruction time of GIndex and TreePi is approxi-mately proportional to the database size, while TreePiis relatively faster due to the fact that frequent subtreemining is simpler than the frequent subgraph mining.The construction time of the graph features knowledgeis more stable, scalable and does not increase with thesame portion of increase (much less) in the number ofgraph database member. The main reason behind thisis the efficiency of the relational indexing infrastruc-ture in computing the aggregate values that representthe components of the descriptor record of each graphdatabase member.

8.2.2 Subgraph Query Performance

One of the main advantages of using a relationaldatabase to store and query graph databases is to ex-ploit their well-known scalability feature. Fig.15(a)shows the query performance of our features-basedSQL evaluation of subgraph queries. The results ofthis experiment illustrate the average execution timefor the features-based SQL translations of 1000 sub-graph queries with different sizes on graph databaseswith scaling characteristics. In this figure, the runningtime for the subgraph query processing is presented inthe Y -axis, the X-axis represents the different graphdatabases and the different bars represent the differentquery sizes. The running time of these experiments in-clude both the filtering and verification phases. How-ever, on average the running time of the verificationphase represents 5% of the total running time and canbe considered to have a negligible effect on all querieswith small result sets. The results of this experimentshow that the execution times of our approach per-forms and scales in a near linear fashion with respectto the graph database and query sizes. This linear in-crease of the execution time starts to decrease with thelarger datasets because of the efficiency of our queryoptimization techniques. The speedup improvementsof these different optimization techniques are discussedin the next set of experiments. Fig.15(b) shows theresults of comparing the performance of our approachwith the TreePi and GIndex approaches (using the twolargest synthetic datasets). The results of this exper-iment show that our approach is able to outperformthe performance TreePi and GIndex approaches dueto the well-known salability advantage of the relational


infrastructure in handling large datasets which may notfit in the main memory. In principle, the results of thisexperiments is expected because the TreePi and GIndexapproaches are not designed to handle these very largedatasets. Moreover, they are designed to deal with thegeneral subgraph-isomorphism problem.

Fig.15. Features-based evaluation of subgraph queries. (a) Rela-

tional scalability. (b) Comparison with TreePi and GIndex.

8.2.3 Speedup Improvements.

Fig.16 represents the speedup improvements of dif-ferent query optimization techniques which are appliedin our approach. We evaluated the effect of each opti-mization technique independently in a separate exper-iment. The reported percentage of speedup improve-ments in these experiments are computed using theformula: (1 − G

C )% where G represents the executiontime of the SQL execution plans with using one of theoptimization techniques while C represents the execu-tion time of the SQL execution plans without using thisoptimization technique.

In principle, the results of these experiments confirmthe efficiency of our optimization techniques. Fig.16(a)indicates that the speedup improvement introduced byusing the graph descriptor component is the highest.

The main reason behind this is that injecting the filter-ing conditions for components of the matched bucketswith the descriptor of the subgraph query dramaticallyreduces the search space and gets rid of many irrelevantgraph database members which can never appear in theanswer set of the subgraph query.

Fig.16(b) indicates that the speedup improvementof using the partitioned B-tree indexes is quite effec-tive. The main reason behind this is that it effectivelyreduces the access cost of the secondary storage. Aswe previously mentioned, in labeled graphs, it is gene-rally the case that the number of distinct vertices andedges labels are far less than the number of verticesand edges respectively. In principle, the effect of par-titioned B-tree indexes increases with the increase ofthe size of the graph database and the percentage ofthe distinct node/edge labels. Therefore, the effect onthe synthetic database is greater than the effect on theAIDS dataset. Fig.16(c) indicates that the speedup im-provement of injecting the selectivity annotations of thefiltering predicates in our SQL translation scripts. Theresults of this experiment shows that the bigger thequery size, the more join operations are required to beexecuted and consequently the higher the effect of theselectivity annotations on helping the query optimizerto decide the cheapest execution plan.

Fig.16(d) indicates the further speedup improvementwhich can be achieved by encoding the graph databaseusing Edge-Edge instead of the Vertex-Edge schemewhich we used in all of our experiments. In this expe-riment, the reported percentage of speedup improve-ments is also computed using the formula: (1 − G

C )%.However, here G represents the execution time of theSQL execution plans using the Edge-Edge encodingscheme while C represents the execution time of theSQL execution plans using the Vertex-Edge encodingscheme. In principle, the results of this experiment isquite expected and can be explained fairly easily be-cause of the effective reduction on the required num-ber of expensive join operations between the encodingrelations to evaluate the subgraph queries. As wepreviously discussed in Subsection 3.3, the Edge-Edgescheme can be considered as a denormalized versionof the Vertex-Edge scheme which explains why it ismore efficient and more suitable for querying purposesof static graph databases.

In principle, the measurements of the speedup im-provements of different optimization techniques showthat the bigger the query size or graph database size,the higher the cost of the query evaluation and conse-quently the bigger the effect of the different optimiza-tion techniques on improving the performance of thequery processing.


Fig.16. Speedup improvements. (a) Graph descriptors. (b) Partitioned B-trees. (c) Selectivity injections. (d) Edge-Edge encoding

scheme.

8.2.4 Analysis of the Effect of the ThresholdParameter (t)

The value of the threshold parameter t decides tocreate an inverted list of a buck of the aggregate de-scriptors component if the number of its related graphsis less than or equal to the value of t. In this ex-periment, we have tested the effect of the value of ton the construction time of these inverted lists andthe performance of the query processing. The expe-riment of analyzing the effect of t on the performanceof the query processing time has been executed usinga synthetic dataset generated with the following pa-rameters D50kV30E40L90M150. The results show thatthe smaller the value of t, the shorter the constructiontime of the inverted list (Fig.17(a)) but the longer thequery processing time (Fig.17(b)). The main reasonbehind this is the smaller value of t, the less num-ber of inverted list will be constructed and thereforemost of the matched buckets of any query will not findan associated inverted list and will increase the num-ber of the issued SQL queries to retrieve the candidategraphs which satisfy the features of the matched buck-ets. However, it is not entirely true that when the valueof t is larger, the query response time will always beshorter. The advantage of using a larger value of t is to

avoid accessing the secondary storage of the encodingrelations of the graph database. However, the large thevalue of t, the bigger the search space and the cost ofthe probing cost of the inverted list to verify the relatedgraphs and identify the target ones. This probing costcan dramatically increase if the memory requirementsfor retrieving the inverted list is bigger than the avail-able budget. Therefore, there is a point at which thequery processing time will become longer when the in-crease in the inverted list probing cost is greater thanthe cost of the SQL-based evaluation. Thus, a moderatevalue of t which takes into consideration the availablebudget of main memory is actually a better choice.

8.2.5 Supergraph Query Performance

Fig.18 shows the query performance of our features-based SQL evaluation of subgraph queries. The resultsof this experiment illustrate the average execution timesfor the features-based SQL translations of 1000 super-graph queries with different sizes on graph databaseswith scaling characteristics.

The running time of these experiments include boththe filtering and verification phases. Similar to the ex-periment of subgraph queries, the average running timeof the verification phase can be neglected as it repre-sents less than 3% of the total running time. The SQL


Fig.17. Effect of the value of the threshold parameter (t).

(a) Inverted list construction time. (b) Query performance.

Fig.18. Features-based evaluation of supergraph queries.

translation scripts of this experiment use different opti-mization techniques introduced in our approach (GraphDescriptors, Partitioned B-trees and Selectivity Injec-tions). In a previous experiment, we presented theoffline preprocessing cost which is required in our

approach in order to construct the graph featuresknowledge (Fig.14). Although the size of the graphfeatures knowledge (Fig.14(a)) can be considered asan additional overhead, it is really worth to pay asit significantly improves the performance of query eva-luation especially in the particular case of supergraphqueries. With these knowledge, evaluating the super-graph queries is quite challenging and expensive for anySQL-based approach as it will require issuing an inter-mediate evaluation steps for each pair of integer valuesx and y where x 6 the number of nodes, y 6 the num-ber of edges and x > y. The main reason behind thisis that the filtering predicates will not be able considerany information about the features or the topology ofthe supergraph query or the graphs of the underlyingdatabase. Moreover, it has no means to check if thesefiltered graphs are fully contained in the supergraphquery q or not. This will also yield for too many falsepositive and duplicate graphs in the intermediate re-sults. In our approach, the supergraph bucket match-ing process (Definition 7) plays a crucial role to solvethis problem by identifying the set of target featuresof the intermediate results and converting the super-graph search problem into an efficient subgraph searchproblem.

9 Related Work

Graph databases and graph query processing has at-tracted a lot of attention from the database community.Several approaches have been proposed to deal with dif-ferent types of graph queries[44]. This section gives anoverview of the literature classified by the type of theirtarget graph queries.

9.1 Subgraph Queries

Many graph indexing techniques have been recentlyproposed to deal with the problem of subgraph queryprocessing[8,12,17,45-46]. They can be mainly classifiedinto two main approaches: 1) Mining-Based Graph In-dexing Techniques. 2) Non Mining-Based Graph Index-ing Techniques.

9.1.1 Mining-Based Graph Indexing Techniques

The GIndex [12] index structure uses frequent sub-graphs as the basic indexing unit. Given a query graphq, if q is a frequent subgraph, the exact set of query an-swers containing q can be retrieved directly. If the querygraph q is infrequent that means this subgraph is onlycontained in a small number of graphs in the database.Cheng et al.[8] have extended the ideas of GIndex[12]

by using nested inverted-index in a new graph indexstructure named FG-Index. In this index structure, a


memory-resident inverted-index is built using the setof frequent subgraphs. A disk-resident inverted-indexis built on the closure of the frequent graphs. If theclosure is too large, a local set of closure frequent sub-graphs is computed from the set of frequent graphs anda further nested inverted-index is constructed.

The TreePi [13] index structure uses frequent sub-trees as the indexing unit. The main note of this ap-proach is that the frequent subtree mining process isrelatively easier than general frequent subgraph miningprocess. Zhao et al.[46] have extended the idea of TreePito achieve better pruning ability by adding a small num-ber of discriminative graphs to the index structure.

9.1.2 Non Mining-Based Graph Indexing Techniques

GraphGrep[9] uses enumerated paths as its index-ing feature. For each graph database member, it enu-merates all paths up to a certain maximum length andrecords the number of occurrences of each path. In thequery processing, the path index is used to find a setof candidate graphs which contains paths in the querystructure and to check if the counts of such paths arebeyond the threshold specified in the query. The mainlimitation of this approach is that the size of the in-dexed paths could drastically increase with the size ofgraph database.

The GCoding [14] index structure uses a tree datastructure to represent the local structure associatedwith each graph node. A spectral graph code is gene-rated by combining all nodes signatures in each graph.A vertex signature dictionary is built to store all dis-tinct vertex signatures. The filtering phase uses bothof the graph codes and local node signatures to avoidintroducing many false negatives candidates.

The GDIndex [11] technique relies on a structuredgraph decomposition approach where all connected andinduced subgraphs of a given graph are enumerated. Agraph of size n is decomposed into at most 2n subgraphswhen each of the vertices has a unique label. A hash ta-ble is used to index the subgraphs enumerated duringthe decomposition process. A clear advantage of theGDIndex approach is that no candidate verification isrequired. However, the index is designed for databasesthat consist of relatively smaller graphs and do not havea large number of distinct graphs.

ClosureTree[47] is a tree-based index structure whichis very similar to the R-tree indexing mechanism[48] butextended to support graph matching queries. In thisindex structure, each node in the tree contains discri-minative information about its descendants in order tofacilitate effective pruning. Each node represents thegraph closure of its children where the children of aninternal node are nodes and the children of a leaf node

are database graphs.GString [10] focuses on modeling graph objects in

the context of organic chemistry using basic structures(Line, Star and Cycle) that have semantic meaning anduse them as indexing features. GString represents bothgraphs and queries as string sequences and transformsthe problem to the subsequence string matching do-main. A suffix tree-based index structure is then cre-ated to support an efficient string matching process. Akey disadvantage of the GString approach is that con-verting subgraph search queries into a string matchingproblem could be inefficient especially if the size of thegraph database or the subgraph query is large. Ad-ditionally, it is not trivial to extend this approach toother domain of applications.

9.2 Supergraph Queries

Although the importance of the supergraph query-ing problem in many practical applications, it has notbeen extensively considered in the literature. Very fewapproaches have been presented to deal with this prob-lem. The cIndex [15] technique uses the subgraphs whichextracted from graph databases based on their rarelyoccurrence in historical query graphs as its indexingunit. It tries to find a set of contrast subgraphs thatcollectively perform well. An advantage of the cIndexapproach is that the size of the feature index is small.However, the query logs may frequently change so thatthe feature index maybe outdated quite often and needto be recomputed to stay effective.

Zhang et al.[16] have proposed an approach for com-pact organization of graph database members namedGPTree. In this approach, all of the graph databasemembers are stored into one graph where the commonsubgraphs are stored only once. Based on the contain-ment relationship between the support sets of the ex-tracted features, a mathematical approach is used todetermine the ordering of the feature set which can re-duce the number of subgraph isomorphism tests duringquery processing.

10 Conclusions

Efficient processing for graph queries plays a cri-tical role in many applications domains which involvecomplex structures such as: social networks, bioinfor-matics and chemical compounds. In this paper, wepresented a purely relational framework for processinggraph queries. Our approach encodes a graph databaseusing a fixed relational schema and translates graphqueries into SQL scripts. A main advantage of this ap-proach is that it can reside on any RDBMS and exploitsits well known matured query optimization and index-ing techniques. Relying on a layer of graph features


knowledge which capture metadata and summary fea-tures of the underlying graph database, we presenteddifferent mechanisms to effectively prune the searchspace and achieve efficient and scalable performancefor graph queries. Moreover, an effective selectivity-aware decomposition mechanism have been presentedto tackle the join problem of large subgraph queries.Finally, we conducted an extensive set of experimentson real and synthetic datasets to demonstrate the effi-ciency and the scalability of our techniques. The resultsshow that our purely relational approach for processinggraph queries deserves to be pursued further.

In the future, we will conduct further experiments onusing our approach with other types of graphs and willexplore extending our querying mechanisms to handleother types of graph queries such as similarity queriesor reachability queries. For example, adjusting our ap-proach to handle the general subgraph isomorphismproblem can be achieved by dropping the filtering pre-dicates on the labels of the query vertices and edges inour SQL templates. Clearly, this will reduce the prun-ing power of the filtering phase and will increase thenumber of candidate graphs and the queries responsetime. It is expected that addressing this problem effec-tively would require the addition of more sophisticatedcomponents to the graph descriptors in addition to theintroduction of new summary components to the graphfeatures knowledge layer.

In general, different types of graph queries are dif-ferent in their nature. Hence, the components of GFKwill be subject to adjustment based on the target querytypes. However, the aggregate graph descriptors andaggregate descriptors components are two basic com-ponents that are expected to be utilized to support ef-ficient evaluation for different types of queries.

Acknowledgments The authors would like tothank Dr. Xifeng Yan and Prof. Jiawei Han for provid-ing GIndex, and Dr. Shijie Zhang and Mr. Meng Huand Dr. Jiong Yang for providing TreePi.

References

[1] Manola F, Miller E. RDF primer: World wide webconsortium proposed recommendation. February 2004.http://www.w3.org/TR/rdfprimer/.

[2] Cai D, Shao Z, He X, Yan X, Han J. Community mining frommulti-relational networks. In Proc. the 9th European Confer-ence on Principles and Practice of Knowledge Discovery inDatabases, Porto, Portugal, Oct. 3-7, 2005, pp.445-452.

[3] Yang Q, Sze S. Path matching and graph matching in biologi-cal networks. Journal of Computational Biology, 2007, 14(1):56-67.

[4] Huan J, Wang W, Bandyopadhyay D, Snoeyink J, Prins J,Tropsha A. Mining protein family specific residue packing pat-terns from protein structure graphs. In Proc. the Eighth An-nual International Conference on Computational MolecularBiology, San Diego, USA, Mar. 27-31, 2004, pp.308-315.

[5] Klinger S, Austin J. Chemical similarity searching using aneural graph matcher. In Proc. the 13th European Sympo-sium on Artificial Neural Networks (ESANN), Bruges, Bel-gium, Apr. 27-29, 2005, pp.479-484.

[6] Willett P, Barnard J, Downs G. Chemical similarity search-ing. Journal of Chemical Information and Computer Sci-ences, 1998, 38(6): 983-996.

[7] Sakr S, Awad A. A framework for querying graph-based busi-ness process models. In Proc. the 19th International WorldWide Web Conference (WWW), Raleigh, USA, Apr. 26-30,2010, pp.1297-1300.

[8] Cheng J, Ke Y, Ng W, Lu A. FG-Index: Towards verifica-tionfree query processing on graph databases. In Proc. theACM SIGMOD International Conference on Management ofData, San Diego, USA, Aug. 5-9, 2007, pp.857-872.

[9] Giugno R, Shasha D. GraphGrep: A fast and universalmethod for querying graphs. In Proc. the IEEE Interna-tional Conference in Pattern Recognition (ICPR), Quebec,Canada, Aug. 11-15, 2002, pp.112-115.

[10] Jiang H, Wang H, Yu P, Zhou S. A novel approach for efficientsearch in graph databases. In Proc. the 23rd InternationalConference on Data Engineering (ICDE), Istanbul, Turkey,Apr. 15-20, 2007, pp.566-575.

[11] Williams D, Huan J, Wang W. Graph database indexing us-ing structured graph decomposition. In Proc. the 23rd Inter-national Conference on Data Engineering (ICDE), Istanbul,Turkey, April 15-20, 2007, pp.976-985.

[12] Yan X, Yu P, Han J. Graph indexing: A frequent struc-ture based approach. In Proc. the ACM SIGMOD Inter-national Conference on Management of Data, Los Angeles,USA, Aug. 8-12, 2004, pp.335-346.

[13] Zhang S, Hu M, Yang J. TreePi: A novel graph indexingmethod. In Proc. the 23rd International Conference on DataEngineering, Istanbul, Turkey, Apr. 15-20, 2007, pp.966-975.

[14] Zou L, Chen L, Yu J, Lu Y. A novel spectral coding in a largegraph database. In Proc. the 11th International Conferenceon Extending Database Technology (EDBT), Nantes, France,Mar. 25-29, 2008, pp.181-192.

[15] Chen C, Yan X, Yu P, Han J, Zhang D, Gu X. Towards graphcontainment search and indexing. In Proc. the 33rd Inter-national Conference on Very Large Data Bases (VLDB), Vi-enna, Austria, Sept. 23-27, 2007, pp.926-937.

[16] Zhang S, Li J, Gao H, Zou Z. A novel approach for efficient su-pergraph query processing on graph databases. In Proc. the12th International Conference on Extending Database Tech-nology (EDBT), Saint-Petersburg, Russia, Mar. 24-26, 2009,pp.204-215.

[17] Tian Y, Patel J. TALE: A tool for approximate large graphmatching. In Proc. of the 24th International Conference onData Engineering (ICDE), Cancun, Mexico, Apr. 7-12, 2008,pp.963-972.

[18] Yan X, Yu P, Han J. Substructure similarity search in graphdatabases. In Proc. the ACM SIGMOD International Con-ference on Management of Data, Los Angeles, USA, Jul. 31-Aug. 4, 2005, pp.766-777.

[19] Ke Y, Cheng J, Ng W. Efficient correlation search from graphdatabases. IEEE Transactions on Knowledge and Data En-gineering, 2008, 20(12): 1601-1615.

[20] Zou L, Chen L, Lu Y. Top-K correlation sub-graph searchin graph databases. In Proc. the International Conferenceon Database Systems for Advanced Applications (DASFAA),Brisbane, Australia, Apr. 21-23, 2009, pp.168-185.

[21] Cohen S, Hurley P, Schulz K et al. Scientific formats forobject-relational database systems: A study of suitability andperformance. SIGMOD Record, 2006, 35(2): 10-15.

[22] Botea V, Mallett D, Nascimento M, Sander J. PIST: An ef-ficient and practical indexing technique for historical spatio-


temporal point data. GeoInformatica, 2008, 12(2): 143-168.[23] Grust T, Sakr S, Teubner J. XQuery on SQL hosts. In Proc.

the 30th International Conference on Very Large Data Bases,Toronto, Canada, Aug. 31-Sept. 3, 2004, pp.252-263.

[24] Grust T, Mayr M, Rittinger J et al. A SQL: 1999 code gen-erator for the pathfinder XQuery compiler. In Proc. the 26thACM SIGMOD International Conference on Management ofData, San Diego, USA, Aug. 5-9, 2007, pp.1162-1164.

[25] Sakr S. Algebraic-based XQuery cardinality estimation. In-ternational Journal of Web Information Systems (IJWIS),2008, 4(1): 7-46.

[26] Teubner J, Grust T, Maneth S, Sakr S. Dependable cardinal-ity forecasts for XQuery. Proceedings of the VLDB Endow-ment (PVLDB), 2008, 1(1): 463-477.

[27] Graefe G. Sorting and indexing with partitioned B-trees. InProc. the 1st International Conference on Data Systems Re-search (CIDR), Asilomar, USA, Jan. 5-8, 2003.

[28] Grust T, Rittinger J, Teubner J. Why off-the-shelf RDBMSsare better at XPath than you might expect. In Proc. the 26thACM SIGMOD International Conference on Management ofData, San Diego, USA, Aug. 5-9, 2007, pp.949-958.

[29] Bruno N, Chaudhuri S, Ramamurthy R. Power hints for queryoptimization. In Proc. the 25th International Conference onData Engineering (ICDE), Shanghai, China, Mar. 29-Apr. 2,2009, pp.469-480.

[30] Florescu D, Kossmann D. Storing and querying XML datausing an RDMBS. IEEE Data Engineering Bulletin, 1999,22(3): 27-34.

[31] Sakr S. Storing and querying graph data using efficient rela-tional processing techniques. In Proc. the 3rd InternationalUnited Information Systems Conference (UNISCON), Syd-ney, Australia, Apr. 21-24, 2009, pp.379-392.

[32] Beyer K, Haas P, Reinwald B et al. On synopses for distinct-value estimation under multiset operations. In Proc. theACM SIGMOD International Conference on Management ofData, San Diego, USA, August 5-9, 2007, pp.199-210.

[33] Chakkappen S, Cruanes T, Dageville B, Jiang L, Shaft U, SuH, Zait M. Efficient and scalable statistics gathering for largedatabases in Oracle 11g. In Proc. the ACM SIGMOD Inter-national Conference on Management of Data, Los Angeles,USA, Aug. 11-15, 2008, pp.1053-1064.

[34] Graefe G, Fayyad U, Chaudhuri S. On the efficient gather-ing of sufficient statistics for classification from large SQLdatabases. In Proc. the ACM SIGKDD Conference onKnowledge Discovery and Data Mining (KDD), New YorkCity, USA, Aug. 27-31, 1998, pp.204-208.

[35] Goldman R, Widom J. Enabling query formulation and opti-mization in semistructured databases. In Proc. the 23rd In-ternational Conference on Very Large Data Bases (VLDB),Athens, Greece, Aug. 25-29, 1997, pp.436-445.

[36] Baeza-Yates R, Ribeiro-Neto B. Modern Information Re-trieval. ACM Press/Addison-Wesley, 1999.

[37] Aboulnaga A, Alameldeen A, Naughton J. Estimating the se-lectivity of XML path expressions for Internet scale applica-tions. In Proc. the 27th Int. Conf. Very Large Data Bases(VLDB), Rome, Italy, Sept. 11-14, 2001, pp.591-600.

[38] Graefe G. Query evaluation techniques for large databases.ACM Computing Surveys, 1993, 25(2): 73-170.

[39] Agrawal S, Narasayya V, Yang B. Integrating vertical andhorizontal partitioning into automated physical database de-sign. In Proc. the ACM SIGMOD Int. Conf. Managementof Data, Toronto, Canada, Aug. 31-Sept. 3, 2004, pp.359-370.

[40] Agrawal S, S Chaudhuri, Narasayya V. Automated selectionof materialized views and indexes in SQL databases. In Proc.the 26th International Conference on Very Large Data Bases(VLDB), Cairo, Egypt, Sept. 10-14, 2000, pp.496-505.

[41] Agrawal S, Chu E, Narasayya V. Automatic physical designtuning: Workload as a sequence. In Proc. the ACM SIGMODInternational Conference on Management of Data, Chicago,USA, Jun. 26-29, 2006, pp.683-694.

[42] Developmental therapeutics program. NCI/NIH. http://dtp.nci.nih.gov/.

[43] Kuramochi M, Karypis G. Frequent subgraph discovery. InProc. the IEEE International Conference on Data Mining(ICDM), San Jose, USA, Nov. 29-Dec. 2, 2001, pp.313-320.

[44] Sakr S, Al-Naymat G. Graph indexing and querying: A re-view. International Journal of Web Information Systems(IJWIS), 2010, 6(2): 101-120.

[45] Sakr S. GraphREL: A decomposition-based and selectivity-aware relational framework for processing sub-graph queries.In Proc. the 14th International Conference on Database Sys-tems for Advanced Applications (DASFAA), Brisbane, Aus-tralia, Apr. 21-23, 2009, pp.123-137.

[46] Zhao P, Yu J, Yu P. Graph indexing: Tree + delta > graph. InProc. the 33rd Int. Conf. Very Large Data Bases (VLDB),Vienna, Austria, Sept. 23-27, 2007, pp.938-949.

[47] He H, Singh A. Closure-tree: An index structure for graphqueries. In Proc. the 22nd International Conference on DataEngineering (ICDE), Atlanta, USA, Apr. 3-8, 2006, pp.38-52.

[48] Guttman A. R-trees: A dynamic index structure for spatialsearching. In Proc. the ACM SIGMOD Int. Conf. Manage-ment of Data, Minneapolis, USA, Jul. 23-27, 1984, pp.47-57.

Sherif Sakr is a research sci-entist in the Managing ComplexityGroup at National ICT Australia(NICTA). He is also a conjoint lec-turer in the School of ComputerScience and Engineering (CSE) atthe University of New South Wales(UNSW), Australia. He received hisPh.D. degree in computer sciencefrom Konstanz University, Germany

in 2007. His research interests is data and information man-agement in general, particularly in areas of indexing tech-niques, query processing and optimization techniques, graphdata management and the large scale data management incloud platforms. His work has been published in interna-tional journals and conferences such as: PVLDB, SIGMOD,JCSS, JDM, IJWIS, WWW, BPM, TPCTC, DASFAA andDEXA. One of his papers has awarded the Outstanding Pa-per Excellence Award 2009 of Emerald Literati Network.

Ghazi Al-Naymat is a postdoc-toral research fellow at the Schoolof Computer Science and Engineer-ing at the University of New SouthWales, Australia. He received hisPh.D. degree in May 2009 from theSchool of Information Technologiesat the University of Sydney, Aus-tralia. Dr. Al-Naymat’s researchfocuses on developing novel data

mining techniques for different applications and datasetssuch as: graph, spatial, spatio-temporal, and time seriesdatabases. Dr. Al-Naymat has published a number of pa-pers in excellent international journals and conferences.

Date post:	03-Aug-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

E–cient Relational Techniques for Processing Graph Queries · graph databases has generated...

Documents