GRAPHITE: An Extensible Graph Traversal …GRAPHITE: An Extensible Graph Traversal Framework for...

GRAPHITE: An Extensible Graph Traversal Frameworkfor Relational Database Management Systems

Marcus Paradies, Wolfgang Lehner Christof Bornhövd

Database Technology Group SAP Labs, LLCTechnische Universität Dresden Palo Alto, CA 94304, [email protected] [email protected]

[email protected]

ABSTRACTGraph traversals are a basic but fundamental ingredient for a varietyof graph algorithms and graph-oriented queries. To achieve the bestpossible query performance, they need to be implemented at thecore of a database management system that aims at storing, manip-ulating, and querying graph data. Increasingly, modern businessapplications demand native graph query and processing capabilitiesfor enterprise-critical operations on data stored in relational databasemanagement systems. In this paper we propose an extensible graphtraversal framework (GRAPHITE) as a central graph processing com-ponent on a common storage engine inside a relational databasemanagement system.

We study the influence of the graph topology on the executiontime of graph traversals and derive two traversal algorithm imple-mentations specialized for different graph topologies and traversalqueries. We conduct extensive experiments on GRAPHITE for alarge variety of real-world graph data sets and input configurations.Our experiments show that the proposed traversal algorithms differby up to two orders of magnitude for different input configurationsand therefore demonstrate the need for a versatile framework to effi-ciently process graph traversals on a wide range of different graphtopologies and types of queries. Finally, we highlight that the queryperformance of our traversal implementations is competitive withthose of two native graph database management systems.

1. INTRODUCTIONEvermore, enterprises from various domains, such as the financial,

insurance, and pharmaceutical industry, explore and analyze the con-nections between data records in traditional customer-relationshipmanagement and enterprise-resource-planning systems. Typically,these industries rely on mature RDBMS technology to retain a singlesource of truth and access. Although graph structure is already latentin the relational schema and inherently represented in foreign keyrelationships, managing native graph data is moving into focus as itallows rapid application development in the absence of an upfrontdefined database schema. Specifically, novel and traditional businessapplications leverage the advantages of a graph data model, such

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.

Application

RDBMS

Runtime

Storage

Query Data

./ σ ∪

GMS

Runtime

Storage

Traversal σ

Query Data

(a) Separated RDBMS and GMSwith specialized storage enginesand processing via application.

Application

Runtime RuntimeGRAPHITE

Storage

./ σ ∪

RDBMS

Query Data

(b) RDBMS with integratedgraph processing capabilitiesand common storage engine.

Figure 1: Architecture alternatives for graph processing.

as schema flexibility and an explicit representation of relationshipsbetween data records. Although these business applications mainlyoperate on graph-structured data, they still require direct access tothe relational base data.

Existing solutions performing graph operations on business-crit-ical data either use a combination of SQL and application logic oremploy a graph management system (GMS) such as Neo4j [3] orSparksee [24], or distributed graph systems, such as GraphLab [21]or Apache Giraph [1]. For the first approach, relying only on SQLtypically results in poor execution performance caused by the func-tional mismatch between a traversal algebra [29] and the relationalalgebra. Even worse, the relational query optimizer is not graph-aware i.e., it does not keep statistics about the graph topology norabout graph query patterns, and therefore is likely to construct asuboptimal execution plan. The other alternative is to process thedata in a native GMS to hurdle the unsuitability of the relationalalgebra to express complex graph queries in an RDBMS.

Figure 1a depicts a traditional system landscape with an RDBMSand a GMS located next to each other and orchestrated at applicationlevel. A GMS is superior to an RDBMS for complex graph process-ing as it provides a natural understanding of a graph data model, arich set of graph processing functionality, and optimized data struc-tures for fast data access. Especially scenarios that do not requireaccessing the most recent data snapshot nor combine operationsfrom different data models into cross-data-model operations canbe handled by GMS’s efficiently. Cross-data-model operations com-bining data from various data models, i.e., relational, text, spatial,temporal, and graph however will play a key role for graph analyt-ics in the future [5]. For example, a clinical information systemstores data from patient records in an RDBMS. Graph analytics on aknowledge graph of patient records and their relationships to eachother help physicians to improve diagnostics and identify complexco-morbidity conditions. Such a medical knowledge graph containsnot only information about the relationships between diagnoses

arX

iv:1

412.

6477

v1 [

cs.D

B]

19

Dec

201

4

and patients, but also text data from patient records and temporalinformation about prescriptions.

In this paper we propose the seamless integration of graph pro-cessing functionality into an RDBMS sharing a common storageengine as depicted in Figure 1b. Located next to a relational run-time stack in the same system, a graph runtime with a set of graphoperators provides native support for querying graph data on top ofa common relational storage engine. For the context of this paper,we focus on graph traversals as they are a vital component of everyGMS and the foundation for a large variety of graph algorithms,such as finding shortest paths, detecting connected components, andanswering reachability queries.

We introduce GRAPHITE, a traversal framework that providesan extensible set of logical graph traversal operators and their cor-responding implementations. Similar to the distinction betweena logical and a physical layer in a relational runtime, GRAPHITEalso provides a set of logical operators and a set of correspondingphysical implementations. In the context of this paper we proposetwo traversal implementations optimized for in-memory columnarRDBMS but argue that the general concept of a traversal frameworkcan be extended with specialized traversal implementations and costmodels for row-oriented or even disk-based RDBMS. GRAPHITEoperates on a physical column group model (cf. Figure 2). Wesummarize our main contributions as follows:

• We introduce GRAPHITE as a modular and extensible founda-tion of a traversal framework inside an RDBMS, which allowsseamlessly reusing existing physical data structures and de-ploying of novel traversal implementations.

• We present two different implementations of the traversal op-erator, a naive level-synchronous (LS), and a novel fragmented-incremental (FI) traversal algorithm that is superior to thenaive approach for specific graph topologies and traversalqueries.

• We conduct an extensive experimental evaluation for a largevariety of real-world data sets and traversal queries, and showan execution time improvement of our FI-traversal by up totwo orders of magnitude compared to the LS-traversal forcertain graph topologies and traversal queries. Moreover, weshow that the query performance of our implementations iscompetitive with those of two native graph database manage-ment systems.

The remainder of this paper is structured as follows: in Section 2, wedescribe GRAPHITE as the foundation of the traversal operator thatwe present in Section 3. We detail the two physical implementationsof the traversal operator in Sections 4 and 5, respectively. A set oftopology-aware clustering techniques that can be applied to bothphysical implementations is presented in Section 6. In Section 7we provide an extensive experimental evaluation of our traversalimplementations. Finally, we discuss related work in Section 8before we conclude the paper in Section 9.

2. GRAPH TRAVERSAL FRAMEWORKGRAPHITE is a general and extensible framework that allows easydeploying, testing, and benchmarking of traversal implementationson top of a common relational storage engine of an RDBMS. Itprovides a common interface for traversal configuration parametersand is tightly integrated with a unified, graph-aware controllerinfrastructure leveraging a comprehensive set of available graphstatistics. We show that the query optimizer of an RDBMS canbenefit from having extensive information about the graph topologyto choose the best traversal operator for a given traversal query.

id: 1name: Johntype: User

id: 2title: Shiningtype: Product

id: 3title: Ittype: Product

id: 4name: Horrortype: Category

id: 5name: Literaturetype: Category

type: category

type: similar

type: belongs

type: belongs

type: ratedrating: 4.0

type: ratedrating: 5.0

(a) Example graph.

id type name title . . .

1 User John ? . . .2 Product ? Shining3 Product ? It . . .4 Category Horror ?5 Category Literature ? . . .

(b) Vertex column group.

Vs Vt type rating . . .

2 3 similar ? . . .2 4 belongs ?3 4 belongs ? . . .1 3 rated 5.01 2 rated 4.0 . . .4 5 category ?

(c) Edge column group.

Figure 2: Mapping of a property graph to column groups.

Graph Model and Physical Representation. GRAPHITE providesas logical data model a property graph model. The property graphdata model has emerged as the de-facto standard for general purposegraph processing in enterprise environments [28]. It representsmulti-relational directed graphs where vertices and edges can haveassigned an arbitrary number of attributes in a key/value fashion.

We store a property graph using a common storage infrastructurewith the relational runtime stack in two column groups, one forthe vertices and one for the edges, respectively. A column grouprepresents a column-oriented data layout, where a new attribute canbe added by appending a new column to the column group [7]. Nullvalues in sparsely populated columns can be compressed throughrun-length-based compression techniques [6]. Additionally, theevaluation of column predicates allows a seamless combinationof relational predicate filters with the actual traversal operation.Figure 2 depicts an example graph and its mapping to two columngroups. We map each vertex and edge to a single entry in the columngroup and each attribute to a separate column. Each vertex has aunique identifier as the only mandatory attribute.Traversal Configuration. In the following, we introduce a formalnotion of the graph traversal operation, its input parameters, and theexpected output.

DEFINITION 1. (Traversal Configuration) Let G = (V,E) bea directed, multi-relational graph, where V refers to the set ofvertices and E ⊆ (V × V ) refers to the set of edges. We define atraversal configuration ρ as a tuple ρ = (S, ϕ, c, r, d) composedof a set of start vertices S ⊆ V , an edge predicate ϕ, a collectionboundary c, a recursion boundary r, and a traversal direction d. Agraph traversal τG(ρ) is a unary operation on G and returns a setof visited vertices R ⊆ V .

We represent each vertex in S by its unique vertex identifier. Theedge predicate ϕ defines a propositional formula consisting ofatomic attribute predicates that can be combined with the logicaloperators ∧, ∨, and ¬. For each edge e ∈ E, the traversal algorithmevaluates ϕ and appends matching edges to the working set of activeedges. Further, it receives a recursion boundary r ∈ N+ that definesthe maximum number of levels to traverse. To support unlimitedtraversals or transitive closure calculations, the recursion boundarycan be infinite (∞). The collection boundary c ∈ N specifies thelevel of the traversal from where to start collecting discovered ver-tices. For c = 0, we add all start vertices to the result. For anytraversal configuration, the condition c ≤ r must hold. The traversaldirection d ∈ {→,←} specifies the direction to traverse the edges.A forward traversal (→) traverses edges from the source vertex tothe target vertex, a backward traversal (←) traverses edges in theopposite direction. The traversal operation outputs a set of verticesthat have been visited in the boundaries defined by c and r.

A

B C

D

E

F

G

a aa

b

a

a

b

a

b

11 1

1

2

Traversal Configuration Result

({ A } , ’type=a’, 0, 1,→) { A,B,C,D }({ A } , ’type=a’, 0, 1,→) { A,B,C,D }({ A } , ’type=a’, 1, 1,→) { B,C,D }({ A } , ’type=a’, 2, 2,→) { F }

({ A } , ’type=a’, 1,∞,→) { B,C,D, F }({ E } , ’type=b’, 2, 2,←) {D }

({ A } , ’type=a ∨ type=b’, 2, 2,→) { E,F }

Figure 3: Traversal configurations and result sets.

Formal Description. We define a traversal by a totally ordered setP of path steps, where each path step describes the transition be-tween two traversal iterations. Path steps are evaluated sequentiallyaccording to the total ordering in P . We determine the numberof path steps by the recursion boundary r. Formally, we define agraph traversal operation based on the mathematical notion of setsand their basic operations union and complement. Each path steppi ∈ P with 1 ≤ i ≤ r receives a set of vertices Di−1 discoveredat level i − 1 and returns a set of adjacent vertices Di. Initially,we assign the set of start vertices to the set of discovered vertices(D0 = S). In the following, we define the transformation rules forpi with i > 0.

D→i = { v | ∃u ∈ Di−1 : e = (u, v) ∈ E ∧ eval(e, ϕ) } (1)

D←i = { u | ∃v ∈ Di−1 : e = (u, v) ∈ E ∧ eval(e, ϕ) } (2)

Depending on the traversal direction d, we select a different transfor-mation rule. Equation 1 presents the definition for forward traversals(→), and Equation 2 for backward traversals (←), respectively. Inpath step pi we generate the set of vertices Di by traversing fromeach vertex in Di−1 over all outgoing/incoming edges matching theedge predicate ϕ. Once the traversal operation finished processingthe path step, the vertex setDi contains all vertices reachable withinone hop from the vertices in Di−1 via edges for which ϕ holds.Equation 3 shows the definition of the resulting vertex set Rτ for atraversal operation τ .

Rτ =

(r⋃i=c

Di︸︷︷︸target vertices

)\

(c−1⋃i=0

Di︸︷︷︸visited vertices

)(3)

Conceptually, the collection boundary c and the recursion boundaryr divide the discovered vertices into two working sets. The set ofvisited vertices contains all vertices that have been discovered beforethe traversal reached the collection boundary c. Vertices within theset of visited vertices are not considered for the final result, but arenecessary to complete the traversal operation. We produce the setof visited vertices by forming the union of all partial vertex sets{D0, D1, . . . , Dc−1} from path steps p1 to pc−1. The set of targetvertices refers to the set of vertices that are potentially relevant forthe final result set. To produce the set of target vertices, we unionall partial vertex sets {Dc, . . . , Dr} from path steps pc to pr . Toretrieve the final result, we compute the complement between theset of target vertices and the set of visited vertices. We consideronly vertices from the set of target vertices that are not in the set ofvisited vertices for the final result. Figure 3 shows a set of exampletraversal configurations and their query results for the given graph.Example. Traversal configuration ({A } , ’type=a’, 2, 2,→) inFigure 3 traverses starting from vertex A on edges of type a andvisits all vertices within a distance of 2 from vertex A. Here, weonly collect discovered vertices in the last path step p2. Dashedarrows with numbers show the traversed edges and the path stepthey were discovered in. First, path step p1 transforms the vertex set

PreparationPhase

TraversalPhase

Controller

GraphStatistics

Level-Synchronous

Fragmented-Incremental

. . .

DecodingPhase

ρ(S, ϕ, c, r, d)

Figure 4: Processing phases in GRAPHITE.

D0 = {A } into the vertex set D1 = {B,C,D }. Next, path stepp2 transforms vertex set D1 into vertex set D2 = { F }. Finally,the output for this example graph and traversal configuration is avertex set containing vertex F only.

3. TRAVERSAL OPERATORSWe now discuss the components of GRAPHITE, as shown in Figure 4,in detail. It receives a traversal configuration ρ and processes atraversal in three phases: a Preparation Phase, a Traversal Phase,and a Decoding Phase. All three phases share common interfacesallowing to easily exchange implementations.Preparation Phase. We pass a set of start vertices S to the prepa-ration phase and transform it into a more processing-friendly set-oriented data structure. If the storage engine leverages dictionaryencoding [6], we consult the value dictionary of the source/targetvertex column in the edge column group, and encode all verticesfrom S into their internal numerical value code representation. De-pending on the traversal direction, we use a different vertex IDcolumn (either Vs or Vt from Figure 2). In addition to the actualvalue encoding, we select active edges that are to be considered forthe traversal operation. Therefore, we push down the edge predicateϕ to the storage engine to filter out invalid edges. Active edgesare stored in a list that represents the valid and invalid records of acolumn group. Additionally, transactional visibility is guaranteedby intersecting the obtained list with visibility information in thecurrent transaction context from the multi-version concurrency con-trol of the RDBMS. Finally, we pass the list of active edges to thetraversal phase for further processing.Traversal Phase. We distinguish two major components in thetraversal phase: a set of traversal operator implementations and atraversal controller. Within the scope of this paper, we propose twotraversal algorithm strategies—level-synchronous and fragmented-incremental—and describe them in detail in Sections 4 and 5, re-spectively. Initially, we pass a collection boundary c, a recursionboundary r, a traversal direction d, a set of active edges Ea, andthe encoded vertex set S to the traversal phase. We select the besttraversal operator implementation based on collected graph statisticsand the characteristics of the traversal query. After the traversaloperator has finished execution, it returns discovered vertices in aset-oriented data structure.Decoding Phase. To translate the internal code representations ofdiscovered vertices back into actual ID values, we consult the valuedictionary of the source/target vertex column for each value codeand add the actual value to the final output set. If the storage enginedoes not leverage dictionary encoding, the decoding phase can beomitted and the result set directly returned.

3.1 Strategies and VariationsThe core component of an implementation of the graph traversaloperator is the traversal algorithm. In general, traversal algorithmsoccur in variations favoring different graph topologies and traversalqueries. While dense graphs with a large vertex outdegree favor amore robust (with respect to skewed outdegree distribution) level-synchronous traversal algorithm, a very sparse graph with a lowaverage vertex outdegree benefits from a more fine-granular traversal

implementation. We divide the algorithm engineering space fortraversal implementations into two dimensions: traversal strategyand physical reorganization.

Within the scope of this paper, we propose two traversal algo-rithms named LS-traversal and FI-traversal in the traversal strategydimension. The second dimension describes the physical organi-zation of the edges. We distinguish between a clustered physicalorganization, where edges having the same source vertex are clus-tered together in the column, and an unclustered physical organiza-tion, where edges do not have a particular physical ordering. Bothdimensions are freely combinable with each other. The LS-traversaloperates level-synchronously and thereby emits only those verticeswithin a single traversal iteration that are adjacent to the verticesfrom the working set. For each traversal iteration, it reads the com-plete graph to retrieve adjacent vertices. For sparse graphs, eachedge is accessed possibly multiple times although each edge is onlytraversed exactly once. In multi-core environments with a largenumber of available hardware threads, this overhead can be hiddenthrough parallelized read operations on the graph. For an under-utilized database management system with a low query workload,such a read-intensive implementation can hide the additional costfor reading data multiple times. However, a single traversal querycannot leverage the full parallelization capabilities in a fully utilizeddatabase management system with a high query workload and pos-sibly hundreds of traversal queries running in parallel. In such ascenario, the CPU is fully occupied and the overhead for reading thecomplete graph multiple times for a single traversal query cannot beconcealed anymore. To keep the execution time of a single querylow, either more hardware resources have to be added or the resourceconsumption of a single query has to be reduced.

The FI-traversal uses less CPU resources than the LS-traversal al-gorithm and aims at minimizing the total number of read operationson the complete graph. It processes a graph in column fragmentsand thereby materializes adjacent vertices immediately after theprocessing of the respective column fragment. Column fragmentsdivide a column into logical partitions, where partition sizes canvary within a column. This allows limiting the operation area tothose parts of the graph that are relevant for the given query. Wegive a detailed description of the FI-traversal in Section 5.

4. LS-TRAVERSALFigure 5 sketches the execution flow of our LS-traversal implemen-tation. It operates on two columns Vs and Vt that represent sourceand target vertices of edges, respectively. To fully exploit thread-level parallelization, we split columns Vs and Vt into n equally sizedlogical partitions of edges. In the following, s1, . . . , sn describe par-titions of Vs, partitions t1, . . . , tn correspond to column Vt. An LS-traversal visits vertices in a strict breadth-first ordering and therebydiscovers vertices always on the shortest path. Conceptually, wedivide an LS-traversal algorithm into four major algorithmic steps:Distribute, Scan, Materialize, and Merge. The distribute step prop-agates a search request with the working set of vertices to all npartitions in parallel. Next, each scan worker thread searches forvertices from the working set in its local partition si with 1 ≤ i ≤ nand writes search hits into a local position list pi.

Each materialization worker thread receives a local position listpi and fetches adjacent vertices from the target vertex column Vt.Subsequently, the merge step collects and combines all locally dis-covered adjacent vertices into the vertex set R. Finally, the traversalalgorithm either terminates and forwards its output to the decodingphase, or continues with the next traversal iteration.

We sketch our LS-traversal implementation in Algorithm 1. Ini-tially, we pass a traversal configuration κ = (Sm, Ea, c, r, d) to the

S s3

s2

s1

...

sn

Vs

p3 01

p2 01

p1 01

...

pn 01

t3

t2

t1

...

tn

Vt

R

Distribute Scan Materialize Merge

Figure 5: LS-traversal algorithm.

Algorithm 1: LS-traversalInput : Traversal configuration κ = (Sm, Ea, c, r, d).Output : Set of discovered vertices R.

1 begin2 if d is backward then3 swap(Vs,Vt); // Adjust Column Handles

4 if c = 0 then5 R← Sm // Add start vertices to result

6 p← 1;Dw ← Sm while p ≤ r do7 if Dw = ∅ then8 return; // No more vertices to discover

9 P ← ∅;10 Vs.scan(Dw, Ea, P ); // Parallel scan for Dw

11 Dw ← ∅; // Reset working vertex set

12 Vt.materialize(P,Dw); // Materialize vertices from P

13 if p ≥ c then14 R← R ∪Dw; // Add vertices from Dw to result R

15 p← p+ 1;

16 return R;

LS-traversal. The preparation phase emits a vertex set Sm, evaluatesthe edge predicate ϕ and returns a set of active edgesEa. The outputof an LS-traversal execution is a set of discovered verticesR. We col-lect intermediate results, such as vertex sets and position lists, eitherin space-efficient bit sets or in dense set data structures, dependingon the estimated output cardinality of the traversal iteration.

First, the LS-traversal algorithm analyzes whether the traversalconfiguration describes a forward or a backward traversal and up-dates the handles to the columns accordingly (Line 3). If the col-lection boundary c is zero, all vertices in Sm are added to the finalresult R (Line 5). Initially, we assign the vertex set Sm to the work-ing vertex set Dw. The major part of the LS-traversal algorithmdescribes a single traversal iteration and is executed at most r times(Lines 6–15). At the beginning of each traversal iteration, we checkthe working set Dw for emptiness. If it is empty, no more verticescan be discovered and the execution of the LS-traversal is termi-nated. During each traversal iteration, we scan the source vertexcolumn Vs in parallel, and emit matching edges into a position listP . During the scan operation, we use the set of active edges Ea tocheck the validity of the matching edges and filter out invalid edges.In addition, the scan operation modifies the set of active edges byinvalidating all traversed edges. Next, the LS-traversal algorithmmaterializes adjacent vertices into the working set Dw using the po-sition list P . If the currently active traversal iteration already passedthe collection boundary c, it adds the discovered vertices from Dw

Transition Graph

F1 F2

F3 F4

{1, 7, 8}

{13, 14}

{8, 15, 19}

{12, 15,17, 18}

ExecutionChain

F2 F4 F?

FragmentQueue

Fragment F3 F4Priority 1 1

Vs Vt

F1

1 78 17 88 13

F2

13 1214 1914 1514 13

F3

8 1415 1919 1815 18

F4

12 1515 1718 1717 16

Frontier

Figure 6: Example transition graph and auxiliary data structures.

to the result set R. Finally, it passes the working set Dw to the nexttraversal iteration. The traversal algorithm terminates if either nomore vertices have been discovered during the last traversal iterationor it has reached the recursion boundary r.Cost Model. The execution time of the LS-traversal is dominated bythe total number of edges in the graph and the number of processedtraversal iterations. It has a worst case time complexity ofO(r ·|E|),where r denotes the recursion boundary and |E| refers to the totalnumber of edges in the graph. For each traversal iteration, it scansthe graph for adjacent vertices from a given vertex set.

We provide a query and graph topology-dependent cost model todescribe the execution time behavior of the LS-traversal. The costof an LS-traversal can be derived from the number of edges to readand the number of traversal iterations to perform.

CLS = min{r, δ̃} · |E| · Ce (4)

We define the cost CLS as the composite product of the minimum ofthe recursion boundary r and the estimated diameter δ̃ of the graph,the number of edges |E|, and a constant cost Ce to read a singleedge from main memory in Equation 4.

5. FI-TRAVERSALAn FI-traversal attempts to limit read operations of data records tothose that are required for creating the final result. Thereby, it pre-serves the ability to fully exploit available thread-level parallelismand differs from an LS-traversal in two fundamental ways. First,an FI-traversal materializes adjacent vertices of a given set of ver-tices on column fragment granularity instead of column granularity.Therefore, intermediate results can be accessed immediately and areavailable before the next scan operation begins. Second, a scan oper-ation searches for adjacent vertices from several unfinished traversaliterations and outputs results of multiple traversal iterations. Con-sequently, an FI-traversal does not operate level-synchronously, butinstead traverses the graph incrementally by processing fragmentsin sequence.

We select the next fragment to read with the help of a light-weight, synopsis-based transition graph index (TGI). Conceptually,a TGI models a directed graph, where vertices denote fragmentsand edges represent transitions between them. A transition betweentwo fragments F1 and F2 describes a path of length 2 with anedge e1 = (u, v) in F1 and an edge e2 = (v, w) residing in F2.If we read edge e1 in fragment F1 and proceed with the traversalafterwards, we have to read fragment F2 as well, as it contains edgesthat extend the traversal path. A fragment has at most one fragmenttransition to any other fragment, including to itself. Since not everyfragment has a transition to every other fragment in the transitiongraph, we only represent directed edges between fragments if thereis a transition between them. In addition to fragment transitions, we

Algorithm 2: FI-traversalInput : Traversal configuration κ = (Sm, Ea, c, r, d).Output : Set of discovered vertices R.

1 begin2 if d is backward then3 swap(Vs,Vt); // Adjust Column Handles

4 Dw[0]← Sm;5 Frontiers← Sm;6 sFactor ← 1;7 mFactor ← 1;8 while getNextFragment(Frontiers, F ) do9 Vs.nWayScan(F,Dw, Ea, sFactor, P );

10 Vt.nWayMaterialize(P,mFactor,Dw, F rontiers) ifsFactor ≤ r then ++sFactor if mFactor < r then++mFactor

11 R← generateResult(c, r,Dw);

store light-weight synopses representing the distinct values of eachfragment. A fragment synopsis is stored as a compact bloom filterwith bits set for all distinct values present in the fragment. Figure 6depicts a transition graph with fragment size 4 for the edge columngroup on the right side. For example, there is a fragment transitionF2 → F4 with a path 13 ; 12 ; 15, i.e., e1 = (13, 12) in F2 ande2 = (12, 15) in F4. The fragment synopses are directly attachedto the corresponding fragment, e.g., the fragment synopsis {13, 14}represents all distinct values in fragment F2.

In addition to the TGI, we store query-specific runtime informa-tion about already processed fragments and fragment candidates inauxiliary data structures. We keep already processed fragments in afragment execution chain and append to it whenever a new fragmenthas been selected for execution. Since we only choose a single frag-ment at a time, we queue all other generated fragment candidates ina priority-based fragment queue. To choose the next fragment forexecution, we select the tail fragment of the execution chain and usethe set of newly discovered vertices (the so-called frontiers) fromthe previous traversal round. Every vertex in the graph can only be afrontier vertex exactly once, i.e., when the vertex is first discovered.

For the tail fragment, we probe all adjacent fragment synopseswith the frontier vertices. If an adjacent fragment matches, i.e. thereis a transition between the tail and the adjacent fragment caused byone or more frontiers, the adjacent fragment is added to the fragmentqueue. If the fragment is already queued, we increase its priority.After updating the fragment queue, we remove the fragment withthe highest priority and return it to the main traversal algorithm. Ifthere are no frontier vertices, we immediately remove the fragmentwith the highest priority. If the fragment queue becomes empty, thetraversal terminates.

Example. Let us consider an example traversal starting withvertex 13 as depicted in Figure 6. We start the traversal at fragmentF2 and emit the newly discovered vertex with id 12. During thefragment selection, we probe all adjacent fragments of F2 (i.e., F2,F3, and F4) with frontier vertex 12. Since fragment F4 containsvertex 12 in its fragment synopsis, we select it as the next fragmentto read. After processing fragment F4, we emit frontier vertex 15and select the next fragment to read. Since the probing generatestwo candidate fragments F3 and F4, we select one of them andcontinue the traversal.

After describing the principle workings of the FI-traversal algo-rithm, we sketch the algorithmic description in Algorithm 2. Ini-tially, we pass a traversal configuration κ with a vertex set Sm, anedge set Ea, a collection boundary c, a recursion boundary r, anda direction d to the FI-traversal algorithm. It outputs a vertex setR with visited vertices that have been discovered between c and r.Since the execution of an FI-traversal is based on sequential reads

Algorithm 3: Procedure getNextFragment(Frontiers,F)Input : Set of frontier vertices Frontiers.Output : Candidate fragment F to read next.

1 begin2 Flast ← m_chain.getLast();3 foreach outgoing edge e = (Flast, Fcand) from F do4 foreach v ∈ F do5 if Fcand.matches(v) ∧ ¬I.hasKey(Fcand, v) then6 I.insert(Fcand, v);7 if PQ.hasKey(Fcand) then8 PQ.increasePrio(Fcand)

9 else PQ.insert(F )

10 if ¬PQ.empty() then11 F ← PQ.extractMin();12 m_chain.add(F );13 return true;

14 else return false

of fragments, we parallelize the execution of the scan if necessaryand materialize operations within a single column fragment. AnFI-traversal runs in a series of iterations, where we process one frag-ment per iteration. At the beginning of each iteration, the algorithmgetNextFragment receives a set of frontier vertices and returnsthe next fragment F to read. A fragment contains the start and endposition in the column and limits the scan to that range. Initially, wepass the set of start vertices as frontiers to getNextFragment.The body of the main loop performs a scan operation and a material-ize operation (Lines 9–10). The scan takes the first sFactor workingsets from the traversal iterations and returns matching edges in thecorresponding position lists from the vector of position lists P . Forexample, an n-way scan with sFactor=2 probes the column with twovertex sets from two different traversal iterations and returns match-ing edges into two position lists. Subsequently, newly discoveredadjacent vertices are materialized in a similar multi-way manner asin the scan operation. Depending on the mFactor, we read out thecollected position lists and add adjacent vertices to the working setsfrom Dw. In addition, newly discovered vertices are added to theset of frontier vertices Frontiers. Once the recursion boundary isreached, the traversal reads and processes all remaining fragmentsfrom the fragment queue. If getNextFragment does not returnany more fragments, the traversal terminates and generates the finalresult according to the given collection and recursion boundaries(Line 11).

Candidate Fragment Selection. Algorithm 3 describes in detailhow to find the next fragment to read given a set of frontier ver-tices. It starts with the last processed fragment and probes adjacentfragments for matching vertices. For each adjacent fragment, weconsult its fragment synopsis and compare the frontiers against it. Ifa frontier matches, we update the fragment queue accordingly. If thefragment is already in the queue, we increase its priority, otherwisewe insert it. Further, we invalidate vertices in the synopses that trig-gered the candidate fragment selection. We keep invalidated verticesand their corresponding fragments in an invalidation list I (Line 6).Finally, we return the fragment with the highest priority from thefragment queue and append it to the execution chain. Since thefragment synopses are implemented as compact bloom filters, falsepositive fragments can occur. However, a false positive does notharm the traversal functionally. There is a tradeoff between spaceconsumption and execution time for the fragment synopses. Weevaluate the effect of the size of a bloom filter in the experimentalevaluation. Since the value distribution in fragments might vary,each bloom filter can have a different size depending on the numberof distinct values present in the fragment.

Cost Model. The cost model of the FI-traversal is slightly morecomplex than for the LS-traversal since the calculation of the costsdepends on a larger set of input parameters. The costs of an FI-traversal can be directly related to the number and the size of theaccessed fragments. Hence, we can use the chain of read fragmentsFp to derive the cost of the FI-traversal. The overall cost is theaccumulated cost of the reads for all accessed fragments in Fp.Consequently, the traversal cost is not directly dependent on thenumber of traversal iterations anymore. We define the cost CFI of anFI-traversal in Equation 5 as follows.

CFI =

min{r,δ̃}∑i=0

(1 + p)(d̄out)i · ξ · Ce (5)

The cost depends on the average false positive rate p, the averagevertex outdegree d̄out, and the fragment size ξ. The FI-traversalis bounded by the minimum of the recursion boundary r and theestimated effective graph diameter δ̃. The most important factorseffecting the memory consumption of the TGI are the size andnumber of fragments. We can minimize the memory consumptionof the TGI by grouping edges by source vertex (see edge clusteringin Section 6). Then, each vertex with incoming and outgoing edgescontributes exactly once to a single fragment transition. For equally-sized fragments, we have to choose a fragment size that is as largeas the largest vertex out-degree. Therefore, we also propose toprovide a heterogeneous fragment size distribution that is alwayslarger than a predefined minimum fragment size. The upper sizeis determined automatically by the vertex out-degree. We discussthese configuration parameters and their performance implicationsfor the FI-traversal in detail in the evaluation in Section 7.

6. TOPOLOGY-AWARE CLUSTERINGThe basic implementation of the LS-traversal algorithm does not relyon a particular ordering of the edges in the edge column group. How-ever, to fully leverage the benefits of a main-memory storage engine,we can use data access patterns that provide a more efficient accessto data placed in memory. Therefore, a physical reorganization ofrecords is a common optimization strategy to reduce data accesscosts [6]. In the following, we describe two strategies to furtherreduce the overall execution time of the LS-traversal algorithm bymaximizing the spatial locality of memory accesses to reduce thenumber of records to scan.

Type Clustering

Edge Clustering

Vs Vt Type

D F aA D aA B aA C aE B aE G aD B bB E bF G b

Figure 7: ClusteredEdges.

Type Clustering. Typically, real-world graphdata sets are modeled with a widespread anddiverse set of edge types that connect the ver-tices in the graph. Conceptually, an edge typedescribes a subgraph and can be interpreted asa separate layer or view on top of the originaldata graph. Such multi-relational graphs withmultiple edge types are common in a variety ofscenarios, such as product batch traceability,social network applications, or material flowsgraphs. For example, a product rating websitemight store different relationships between en-tity types rating, user, and product, such asrating relationships, product hierarchies, anduser fellowships. To that end, traversal queries are specific withregard to which parts of the graph they refer to. We propose toarrange edges sharing the same type physically together, allowinga traversal query to operate directly on the subgraph instead of theentire original graph. Thus, a graph that comprises n different edgetypes results in n different subgraphs. A subgraph is associated

with an area in the column that contains all edges forming the sub-graph. Figure 7 illustrates an edge column group with two edgetypes. Here, a traversal query that refers to edges of type b wouldonly have to scan the corresponding subgraph. The portion of thecolumn for edge type b is indicated by the dashed lower rectangle.If the edge predicate contains a disjunctive condition, for exampleto traverse only over edges of type a or b, the LS-traversal algorithmautomatically splits the scan operation and unions the partial resultsthereafter.

Edge Clustering. The most fundamental component of a traversaloperation is to retrieve the set of adjacent vertices for a given ver-tex. Therefore, an efficient traversal implementation must provideefficient access to adjacent vertices located in main memory. Toachieve this, we introduce the notion of topological locality in agraph. Topological locality describes a concept for accessing allvertices adjacent to a given vertex v ∈ V . If a neighboring vertex ofa vertex v is accessed, it is likely that all other vertices adjacent of vare accessed also.

We translate topological locality in a graph directly into spatiallocality in memory by grouping edges based on their source vertex.Such an edge clustering increases spatial locality, i.e., all edgessharing the same source vertex are written consecutively in memory.Maximizing spatial locality for memory accesses results in a betterlast-level cache utilization and minimizes the amount of data to beloaded from main memory into the last-level cache of the proces-sor [10]. Figure 7 sketches an example for vertex A. All edgeshaving A as source vertex are written consecutively into the edgecolumn group. To that end, applying first clustering by type and thenby edge extends the physical reorganization on a second level. Espe-cially the materialization step of an LS-traversal algorithm benefitsfrom an increased spatial locality while fetching adjacent verticesfrom the Vt column (see Line 12 in Algorithm 1).

Besides the spatial locality, column decompression plays an im-portant role in materializing adjacent vertices. Major in-memorydatabase vendors rely on a two-level compression strategy. Thefirst level is dictionary encoding, where a value is represented by itsnumerical value code from the dictionary and stored in a bit-packed,space-efficient data container. Here, a lightweight, but still notabledecompression routine is used to reconstruct the actual value code.If adjacent vertices are not stored in a consecutive chunk of memory,the decompression routine might decompress unnecessary valuecodes. A similar behavior can be observed on the second level ofcompression, the value-based block compression. Edge clusteringallows retrieving blocks of value codes that can be reconstructedefficiently by leveraging SIMD instructions.

7. EVALUATIONWe evaluate the LS-traversal and the FI-traversal on a diverse setof real-world graphs and for different types of graph queries. Inthe following, we describe the environmental setup and providestatistical information about the evaluated data sets. We presentan extensive experimental evaluation of the memory consumption,execution time, cost model, and system-level comparison with twonative graph management systems.

Environmental Setup and Data SetsWe have implemented GRAPHITE as a prototype in the context ofthe in-memory column-oriented SAP HANA database system. Graphdata in SAP HANA is stored in two column groups, where eachgroup has its own read-optimized main storage and write-optimizeddelta storage. Data manipulation operations exclusively modify thedelta storage, which is periodically merged into the main storage.

ID |V | |E| d̄out max(dout) δ̃ Size (MB)

CR 1.9 M 2.7 M 2.8 12 495.0 143LJ 4.8 M 68.5 M 28.3 635 K 6.5 1 617OR 3.1 M 117.2 M 76.3 32 K 5.0 3 066PA 3.7 M 16.5 M 8.7 793 9.4 397SK 1.7 M 11.1 M 13.1 35 K 5.9 305TW 40.1 M 1.4 B 36.4 2.9 M 5.4 32 686

Table 1: Evaluated data sets with their topology statistics.Deletions only invalidate records and affected records are beingremoved during the next merge process. Within the scope of thispaper, we focus on read-only graphs, but argue that the proposedalgorithms could be easily extended to support dynamic graphs aswell (for example by treating the delta storage as a single fragmentand by using general visibility data structures, such as a validityvector, to check for deletions). All values are dictionary-encodedallowing the traversal algorithms to operate on the value codesdirectly. Initially, we loaded the data sets into their correspondingvertex and edge column groups, and populate the TGI. We ranthe experiments on a single server machine running SUSE LinuxEnterprise Server 11 (64 bit) with Intel Xeon X5650 running at2.67 GHz, 6 cores, 12 threads, 12 MB L3 cache shared, and 48 GBRAM. For the LS-traversal we leverage full parallelization with12 threads, for the FI-traversal we use 1 thread to scan a singlefragment. To evaluate our approach on a wide range of differentgraph topologies, we selected six real-world graph data sets fromthe domains: social networks (OR,TW,LJ), citation networks (PA),autonomous system networks (SK), and road networks (CR). Foreach data set, we report the number of vertices |V |, the number ofedges |E|, the average vertex outdegree d̄out, the maximum vertexoutdegree max(dout), the estimated graph diameter δ̃, and the rawsize of the graph in Table 1.

All evaluated queries are of the form {{ s } , ’*’, k, k,→}, wheres is a randomly selected start vertex, ∗ refers to a nonselective edgefilter, and k denotes the traversal depth. Without losing generality,we focus in the evaluation on traversal queries where the collectionboundary is equal to the recursion boundary. Such traversal queriesonly return vertices first discovered in traversal iteration k. Forthe runtime analysis, we randomly selected start vertices for thetraversal and report the median execution time over 50 runs. Wedecided to report the median since the execution highly varies fordifferent start vertices.

We compare our LS-traversal and FI-traversal implementationsagainst two join-based approaches (with and without secondary in-dex support) in SAP HANA, the open-source version of VirtuosoUniversal Server 7.1 [13], and the community edition of the na-tive graph database management system Neo4j 2.1.3 [3]. For theexperiments we prepared and configured the evaluated systems asfollows:

SAP HANA. We populated the data sets into two columnar tables,one for vertices and one for the edges. For the indexed join, wecreated a secondary index on the source vertex column.

Virtuoso. Since Virtuoso is an RDF store, we transformed alldata sets into RDF triples of the form <source_id> <edge_type>

<target_id> and use SPARQL property paths to emulate a breadth-first traversal. We increased the number of buffers (NumberOfBuffers)and maximum dirty buffers (MaxDirtyBuffers) as recommended.

Neo4j. We configured the object caches of Neo4j so that the data setfits entirely into memory. We warmed up the object cache by runningrandomly 10000 traversal queries against the database instance. Torun the experiments, we used Neo4j’s declarative query lanuageCypher and created an additional index on the vertex identifierattribute.

CR LJ OR PA SK TW

26 28 210 212 214 216

101

102

103

Fragment size ξ

Mem

ory

(in

MB

)

False positive rate (p = 1%)

26 28 210 212 214 216

101

102

103

Fragment size ξM

emor

y(i

nM

B)


Figure 8: TGI memory consumption for different false positive ratesand fragment sizes.TGI Memory ConsumptionIn this experiment we study the effect of various algorithm param-eter configurations on the memory consumption of the TGI. Wepopulated TGI instances for clustered physical edge ordering withdifferent fragment sizes and false positive rates, and present theresults in Figure 8. To evaluate the impact of the fragment size ξ,we construct the TGI for different fragment sizes 26, . . . , 216 anda fixed average false positive rate of 1%. We analyze the effectof the average false positive rate for a representative fragment sizeξ = 512 and construct fragment synopses based on an average falsepositive rate p selected from {1%, 5%, 10%, 20%}.

The size of the TGI is directly related to the total number ofedges of the original input graph. For fragment size ξ = 1024,the TGI consumes only about 10% of the size of the input graph.For ξ = 1024, the TGI of data set TW has the highest memoryconsumption with about 2.09 GB (about 8.4% of the raw size of thegraph) and data set CR the lowest memory consumption (about 8.8%of the raw size of the graph). For disabled clustering by edge, thememory consumption of the TGI can grow up to a factor of 10 of theoriginal graph. This makes the unclustered variant of the FI-traversalimpractical for a productive system as it occupies up to two ordersof magnitude the memory of the clustered variant.

Impact of Fragment Size. For all evaluated data sets, the memoryfootprint decreases for increasing fragment sizes. A larger fragmentsize leads to a smaller number of vertices in the TGI and conse-quently to fewer possible transitions between them. Although largerfragments cause a denser TGI topology, the total number of frag-ment transitions is much lower than for smaller fragments. For inputgraphs with a larger average vertex outdegree, the memory overheadcan be reduced for ξ = 216 to up to 13% of the memory overheadfor ξ = 26. For very sparse graphs, such as CR, the TGI consumesabout 37% of the memory for ξ = 216 compared to ξ = 26. Conse-quently, the sparser the input graph is, the lower is the impact of thefragment size on the total memory consumption of the TGI.

Impact of False Positive Rate. We store fragment synopses inspace-efficient bloom filter data structures, where each fragmentsynopsis occupies as much memory as needed to fulfill the prede-fined false positive rate. A smaller false positive rate causes theFI-traversal to access more fragments, but reduces the memory foot-print of the TGI. We show the memory overhead for different falsepositive rates in Figure 8. For data set PA, a false positive rate of20% leads to a memory footprint decrease of 13% compared to afalse positive rate of 1%. In contrast, data set CR reached a mem-ory footprint decrease of almost 50% for p = 20% compared top = 1%.

Runtime AnalysisFigure 9 presents the runtime results of the LS-traversal for all datasets and with different traversal queries. We report average execu-tion times of the three traversal phases preparation, traversal, anddecoding as well as average output sizes. In general, the traversalphase dominates the overall execution time of the traversal operatorand consumes up to 95% of the total runtime. The runtime of thepreparation phase is only about 5% of the overall execution timeand is independent from the number of traversal iterations. Thepreparation phase only evaluates the edge predicate and processesthe start vertices. The decoding phase highly depends on the sizeof the vertex output set as it translates for each vertex the valuecode back into the corresponding vertex identifier. For the data setSK, we can see the effect of the output size on the runtime spentfor the decoding. The output size of the traversal steadily growsuntil the traversal reaches the effective diameter. Consequently, onlyvery few traversals reach a larger traversal depth than the effectivediameter. The LS-traversal scales almost linearly with increasingnumber of traversal iterations as the full column scan takes aboutthe same time to complete independent of the traversal iteration.

Figure 10 presents an in-depth comparison of LS-traversal andFI-traversal on all data sets for fragment sizes {27, . . . , 210}. Largerfragment sizes resulted in higher execution times and are thereforeomitted in the results. For all evaluated data sets, LS-traversal showsa linear runtime behavior for an increasing number of traversaliterations. For the data set PA, the runtime steadily increases until theLS-traversal reaches the effective diameter. After the traversal queryreached the effective diameter, the plot flattens for longer traversalqueries. In comparison, the data plots of the FI-traversal grow muchfaster for an increasing number of traversal iterations. For shorttraversals with a low number of traversal iterations, the FI-traversaloutperforms the LS-traversal by up to two orders of magnitude. Thiscan be explained by the more fine-granular graph access pattern ofthe FI-traversal. Especially the first traversal iterations process onlyvery small parts of the whole graph and a fine-granular fragmentaccess clearly outperforms a full column scan. For a large workingset, potentially many fragments have to be accessed which in turnis hard to predict and prefetched by the hardware. If large partsof the graph are accessed, a single full column scan is superiorcompared to many small fragment scans. The break-even pointwhen the FI-traversal outperforms the LS-traversal depends on thegraph topology and the given traversal query. From the results wecan conclude that short traversal queries clearly favor the FI-traversalover the LS-traversal. Even for short traversal queries, both data setsproduce large intermediate results due to the power-law distributionof vertex outdegrees making the FI-traversal less effective. For 4out of 6 data sets, the FI-traversal outperforms the LS-traversal fortraversal queries with r ≤ 5. The fragment size has a severe impacton the overall execution performance of the FI-traversal. For dataset CR, the fragment size does not only effect the total runtime, butalso increases the range of traversal queries, where the FI-traversaloutperforms the LS-traversal. For example, a traversal query withtraversal depth 14 on data set CR consumes only about 26% of theruntime than for ξ = 210 for a fragment size ξ = 27. In general,we can conclude that FI-traversal is superior to LS-traversal for verysparse graphs or for short traversal queries.

Figure 11 depicts the slowdown factors for all data sets withdifferent fragment sizes {26, . . . , 216}. To compute the slowdownfactor, we use the data point for the smallest fragment size/falsepositive rate as baseline and relate all other results to this baseline.Further, we analyze the effect of the false positive rate on the queryruntime. Without losing generality, we conduct all experiments ona representative query of the form {{ s } , ’*’, 3, 3,→}. In general,

Preparation Traversal Decoding

2 4 6 8 10

0

5

10

15

20

25

# of Traversal Iterations

Exe

cutio

nTi

me

(ms)

California-Roads (CR)

0

10

20

30

40

50

2 4 6 8 10

0

50

100

150


LiveJournal (LJ)

0

0.4 M

0.8 M

1.2 M

2 4 6 8 10

0

50

100

150

200


Orkut (OR)

0

0.4 M

0.8 M

1.2 M

Out

putS

ize

2 4 6 8 10

0

10

20

30

40


Exe

cutio

nTi

me

(ms)

Patents (PA)

0

200

400

600

800

2 4 6 8 10

0

20

40

60


Skitter (SK)

0

10 K

20 K

30 K

2 4 6 8 10

0

1,000

2,000

3,000

4,000

5,000


Twitter (TW)

0

0.4 M

0.8 M

1.2 M

1.6 M

Out

putS

ize

Figure 9: Execution time and output size of LS-traversal for different queries and data sets.

the FI-traversal with enabled edge clustering finished the executionon average about 3.5 times faster than for the unclustered variant.If the graph is not clustered by edge, the probability of a transitionto another fragment is significantly higher due to a higher numberof distinct values in the fragment. For enabled edge clustering, themaximum number of possible transitions is bounded by the numberof vertices in the graph. For all data sets, smaller fragment sizesclose to the expected average vertex outdegree are more beneficialwith respect to execution performance than larger ones. Althoughone could specify a fragment size that is very small or even closeto 1, the memory overhead would be prohibitively high. Therefore,we limit the minimum fragment size to be connected to the vertexoutdegree.

A larger false positive rate increases the memory consumption ofthe TGI, but speeds up the runtime of the FI-traversal. If the falsepositive rate is too large, many fragments are read although they donot contribute to the traversal query result.

Impact of Edge Predicates. We study the effect of edge predicateson the query performance of the LS-traversal and the FI-traversalin Figure 12. An edge predicate selects a subgraph of the entiredata graph and limits the traversal to a subset of active edges. Wegenerated edge weights following a zipfian distribution with s = 2and assigned them randomly to the edges. For a selectivity of 25%,i.e., an edge predicate that selects only 25% of all edges leads for theLS-traversal to a speedup of 3. We observed that an edge predicatewith a low selectivity drastically reduces the size of intermediate andfinal output results. Since the LS-traversal is a scan-based traversalalgorithm, it still has to scan the entire column for each traversaliteration. In contrast, the FI-traversal reaches a speedup of up tofactor 6 for a selectivity of 25%. If the selectivity is low, morefragments can be pruned during the traversal and cause the doubledspeedup compared to the LS-traversal .

System-Level Benchmarks. We compared our two traversal imple-mentations with a purely relational self-join-based approach (withand without secondary index support), Neo4j, and Virtuoso 7.0.For the join-based traversal, we use the same data layout as forGRAPHITE and leverage the columnar relational engine of SAPHANA. We present our results in Figure 13. For short traversals of1–3 hops, our FI-traversal is competitive with native graph imple-mentations from Neo4j and Virtuoso. For data sets PA, SK, and CRFI-traversal outperforms all evaluated systems by up to an order ofmagnitude.

Cost Model EvaluationTo verify our cost model function, we applied regression analysis.We use the coefficient of determination denoted R2 to evaluate thequality of our FI-traversal cost model. The coefficient of determina-tion has a value range of −1 ≤ |R2| ≤ 1. A value of |R2| close to1 indicates a good fit of the proposed cost model function with themanually collected data points. We compare the manually collecteddata of the number of accessed edges against the results of the costfunction. For each data set, we performed a set of traversal querieswith a recursion boundary ranging from 1 to 10. We achieved thebest result with an average R2 = 0.92 for data set CR, respectively.For the data set EP, we achievedR2 = 0.78. In general, graphs witha power-law vertex outdegree distribution caused our cost functionto underestimate the costs of the FI-traversal. This underestimationcan be explained by the method used to describe the vertex out-degree distribution. We use the average vertex outdegree d̄out toestimate the expected number of neighbors for a single vertex. If thetraversal discovers a vertex with a considerably larger vertex outde-gree, the cost function underestimates the access costs. Additionally,the traversal depth is estimated as the minimum of the recursionboundary and the diameter of the graph. However, traversal queries

LS FI-27 FI-28 FI-29 FI-210

0 10 20 30 40 50

10−1

100

101

102

# Traversal Iterations

Exe

cutio

nTi

me

(ms)


2 4 6 8

101

102


Livejournal (LJ)

2 4 6 8

101

102


Exe

cutio

nTi

me

(ms)

Orkut (OR)

2 4 6 8

10−1

100

101

102


Patents (PA)

2 4 6 810−1

100

101


Exe

cutio

nTi

me

(ms)

Skitter (SK)

2 4 6 8

102

103


Twitter (TW)

Figure 10: Comparison of LS-traversal and FI-traversal for differentqueries and data sets.that terminate before they reach the recursion boundary, are not ap-propriately reflected in the cost function. We believe that additionalinformation about the outdegree distribution and the distribution ofpath lengths is required to obtain a more accurate estimation fromthe cost function.

8. RELATED WORKGraph Traversal Algorithms. Graph traversals are one of themost important and fundamental building blocks of graph algo-rithms, such as finding shortest paths, computing the maximumflow, and identifying strongly connected components. Increasinggraph data sizes and the proliferation of parallelism on differenthardware levels as well as heterogeneous processor environmentsencouraged researchers to revise the well-known breadth-first graphtraversal and to propose novel techniques to run graph traversalson high-end computers with large numbers of cores and differenttypes of processors on a single machine. A large body of researchhas been conducted on efficient parallel graph traversals, lately evenleveraging co-processors to speed up graph processing on large datagraphs [17]. State-of-the-art parallel graph traversals operate witha level-synchronous strategy and parallelize the work to be done ateach level. However, all parallel graph traversal implementationsrely on sophisticated data structures that are tailored to the graphtraversal algorithm. Such an algorithm-dependent data structure isnot applicable in our case since we are using the traversal operator in

CR LJ OR PA SK TW

26 28 2100

10

20

30

40

Fragment size ξ

Slow

dow

n(i

nm

ultip

les

ofba

selin

e)


1 5 10 20

0

1

2

3

4

False Positive Rate p (in %)

Fragment size (ξ = 29)

Figure 11: Execution time of FI-traversal for {{ s } , ’*’, 3, 3,→}.

CR LJ OR PA SK TW

255075100

0

1

2

3

Selectivity (in %)

Spee

dup

(in

mul

tiple

sof

base

line)

255075100

0

2

4

6

Selectivity (in %)

Figure 12: Speedup in multiples of baseline for different edgepredicate selectivities with query {{ s } , ’*’, 3, 3,→}. The baselineis a traversal query without edge predicate, i.e., a traversal on theentire graph.an RDBMS on top of a common storage engine without copying datainto separate data structures. As one of our strongest advantages,we do not require the graph data to be copied from possibly alreadyexisting legacy relational tables into algorithm-specific data struc-tures. Replicating data into separate data structures wastes memoryand also adds a considerable maintenance overhead. Chhugani et al.study scalable breadth-first traversal algorithms on modern hardwarewith multi-socket, multi-core processor architectures [12, 27]. Theyachieved an impressive performance by tuning the data structuresand the traversal algorithm to the underlying hardware. In contrastto our approach, they only consider a single implementation ofgraph traversals for any graph topology and types of graph traversalqueries. Graph traversals in distributed memory recently gainedmore attention and resulted in the development of sophisticated datapartitioning schemes for distributed graph traversals [11, 26].

Graph Processing on Column Stores. Column stores have showngreat potential for storing and querying wide and sparse data [4].These considerations brought up research projects that aimed to pro-vide efficient access to RDF [7] and XML data [30] kept in a columnstore. However, none of them covered the design and implementa-tion of a native graph traversal operator that leverages advantages ofcolumnar data structures and exploits knowledge about the graphtopology to speed up the graph traversal execution.

Distributed Graph Engines. The demand to efficiently processreal-world billion-scale graphs triggered the development of a vari-ety of distributed graph processing systems [15, 18, 21, 22]. GBaseis a distributed graph engine based on MapReduce and relies ondistributed matrix-vector multiplications [18]. The vertex-centricprogramming model, as proposed by [22], has been an area of

LS FI Neo4j

Virtuoso RDBMS RDBMS-INDEX

0 10 20 30 40 50

10−1

100

101

102

103


Exe

cutio

nTi

me

(ms)


2 4 6 8

101

102

103


Orkut (OR)

2 4 6 8100

101

102

103


Exe

cutio

nTi

me

(ms)

Skitter (SK)

2 4 6 8

10−1

100

101

102


Patents (PA)

2 4 6 8

101

102

103


Exe

cutio

nTi

me

(ms)

Livejournal (LJ)

2 4 6 8 10

102

103

104

105


Twitter (TW)

Figure 13: System-level benchmarks of LS-traversal, FI-traversal,VIRTUOSO, NEO4J, and a self-join-based approach (with and with-out secondary index support). Since the execution times highly varyfor different start vertices, we report median execution times. Wewere not able to run experiments for the Twitter data set on Neo4jdue to data loading issues.active research and has been implemented in GraphLab [21] andPowerGraph [15] among others. Although distributed graph en-gines show good scalability for billion-scale graphs, we see thefollowing disadvantages making them not applicable in our scenar-ios: (1) Business data from enterprise-critical applications is stillmainly stored in RDBMS and cannot be easily replicated to externalgraph processing engines; (2) typically, graph engines cannot copewith cross-data-model operations (e.g., combining text, graph, andspatial); (3) distributed graph engines rely on sophisticated graphpartitioning algorithms that do not scale well to large graphs and arehard to maintain for dynamic graphs; and (4) they do not providetransactional guarantees.

Single-Machine Graph Engines. An interesting alternative to dis-tributed graph engines has been introduced by Kyrola et al. [19] andconceptually extended by Han et al. [16]. GraphChi is a disk-basedgraph engine on a single machine that exploits parallel sliding win-dows and sharding to efficiently process billion-scale graphs fromdisk [19]. To minimize I/O overhead, they apply a technique similar

to edge clustering to improve disk access and maximize data localityon disk. The lack of support for attributes on vertices and edgesand dynamic graphs resulted in GraphChi-DB, a recent extension ofGraphChi [20]. Interestingly, they also use a vertically partitionedlayout to represent attributes on vertices and edges. In contrast toGraphChi we run GRAPHITE not as a standalone graph engine, but aspart of a graph runtime stack on a common relational storage enginein an RDBMS. Since our targeted scenarios run on main-memoryRDBMS, GRAPHITE can operate completely in memory and aims atmaximizing CPU cache locality. Similar to GraphChi is TurboGraph,a single-machine disk-based graph processing engine using solidstate disks (SSD) to store and process large graphs [16]. They use avertically partitioned layout on SSD to store vertex attributes. How-ever, we do see two major drawbacks compared to our approach:(1) GraphChi and TurboGraph are efficient single-machine graphengines, but do not provide transactional access to the data; and (2)graph data has to be available upfront in a specific data format ondisk. To be applicable for business data stored in RDBMS, the datahas to be exported and transformed into a file format that can beconsumed by the system.

Graph Databases. A different direction is followed by graph data-bases, such as Neo4j [3], Sparksee [23], and InfiniteGraph [2].While Neo4j relies on a disk-based storage accelerated by bufferpools to store recently accessed parts of the graph, Sparksee allowsmanipulating and querying the graph in memory. The Sparkseeinternal data structures rely on efficient bitmaps, which representthe set of vertices and edges describing the graph [23]. All graphdatabases are specialized engines that can perform graph-orientedprocessing efficiently, but always require loading possibly relationalbusiness data in advance from other data sources. On the contrary,our approach can directly operate on the relational business datawithout having to copy it to a dedicated database engine. Moreover,the combination of relational operations and graph operations canbe handled efficiently by a single database engine.

Graph Processing in RDBMS. Although graph databases are arather new research field, path traversals in relational databases withthe help of recursive queries have been in the focus of researchfor more than 20 years now [8]. There have been proposals forextending relational query languages with support for recursionin the past and even the SQL:1999 standard offers recursive com-mon table expressions. However, commercial database vendorsoften provide their own proprietary functionality, if they do at all.Gao et al. leverage recent extensions to the SQL standard, such aswindow functions and merge statements, to implement algorithmsfor shortest path discovery on relational tables [14]. Unlike ourapproach, they are reusing existing relational operators and the rela-tional query optimizer to create an optimal execution plan. However,when ignoring graph-specific statistics, the optimizer is likely toselect a suboptimal execution plan. Magic-set transformations are aquery rewrite technique for optimizing recursive and non-recursivequeries, which was originally devised for Datalog [9] and has beenextended for SQL [25]. Since graph traversals can be expressed asrecursive database queries, the magic-sets transformation could alsobe applied to them. However, instead of proposing an optimizationstrategy for relational execution plans, we approach the problemwith a dedicated plan operator.

9. CONCLUSIONWe presented GRAPHITE, a modular and versatile graph traversalframework for main-memory RDBMS. As part of GRAPHITE, we

presented two different specialized traversal implementations namedLS-traversal and FI-traversal to support a wide range of differentgraph topologies and varying graph traversal queries most efficiently.GRAPHITE is extensible and other graph traversal strategies, suchas depth-first based traversals, could be integrated as well. Wederived a basic cost model of two traversal implementations andexperimentally showed that it can assist a query optimizer to selectthe optimal traversal implementation. The FI-traversal outperformsthe LS-traversal for graphs with a low density and short traversalqueries by up to two orders of magnitude. In contrast, the LS-traversal performs significantly better than the FI-traversal, if thegraph is dense or the query traverses a large fraction of the wholegraph. Our experimental results illustrate the need for a graphtraversal framework with an accompanied set of traversal operatorimplementations. Finally, we show that, despite popular belief,graph traversals can be efficiently implemented in RDBMS on acommon relational storage engine and are competitive with those ofnative graph management systems.

10. REFERENCES[1] Apache Giraph project website. http://www.giraph.apache.org/.[2] InfiniteGraph project website. http://objectivity.com/INFINITEGRAPH.[3] Neo4j project website. http://neo4j.org/.[4] D. Abadi. Column Stores for Wide and Sparse Data. In Proc.

CIDR’07, pages 292–297, 2007.[5] D. Abadi, R. Agrawal, A. Ailamaki, M. Balazinska, P. A.

Bernstein, M. J. Carey, S. Chaudhuri, J. Dean, A. Doan, M. J.Franklin, et al. The beckman report on database research.ACM SIGMOD Record, 43(3):61–70, 2014.

[6] D. Abadi, S. Madden, and M. Ferreira. IntegratingCompression and Execution in Column-oriented DatabaseSystems. In Proc. SIGMOD’06, pages 671–682, 2006.

[7] D. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach.Scalable Semantic Web Data Management Using VerticalPartitioning. In Proc. VLDB’07, pages 411–422, 2007.

[8] R. Agrawal. Alpha: An Extension of Relational Algebra toExpress a Class of Recursive Queries. 14(7):879–885, 1988.

[9] F. Bancilhon, D. Maier, Y. Sagiv, and J. D. Ullman. MagicSets and Other Strange Ways to Implement Logic Programs.In Proc. PODS’86, pages 1–15, 1986.

[10] P. Boncz, S. Manegold, and M. Kersten. DatabaseArchitecture Optimized for the New Bottleneck: MemoryAccess. In Proc. VLDB’99, pages 54–65, 1999.

[11] A. Buluç and K. Madduri. Parallel Breadth-First Search onDistributed Memory Systems. In Proc. SC’11, pages 1–12,2011.

[12] J. Chhugani, N. Satish, C. Kim, J. Sewall, and P. Dubey. Fastand Efficient Graph Traversal Algorithm for CPUs:Maximizing Single-Node Efficiency. In Proc. IPDPS’12,pages 378–389, 2012.

[13] O. Erling. Virtuoso, a Hybrid RDBMS/Graph Column Store.IEEE Data Eng. Bull., 35:3–8, 2012.

[14] J. Gao, R. Jin, J. Zhou, J. X. Yu, X. Jiang, and T. Wang.Relational Approach for Shortest Path Discovery over LargeGraphs. Proc. VLDB Endow., 5(4):358–369, Dec. 2011.

[15] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin.PowerGraph: Distributed Graph-parallel Computation onNatural Graphs. In Proc. OSDI’12, pages 17–30, 2012.

[16] W.-S. Han, S. Lee, K. Park, J.-H. Lee, M.-S. Kim, J. Kim, andH. Yu. TurboGraph: A Fast Parallel Graph Engine HandlingBillion-scale Graphs in a Single PC. In Proc. KDD’13, pages77–85, 2013.

[17] S. Hong, T. Oguntebi, and K. Olukotun. Efficient ParallelGraph Exploration on Multi-Core CPU and GPU. In Proc.PACT’11, pages 78–88, 2011.

[18] U. Kang, H. Tong, J. Sun, C.-Y. Lin, and C. Faloutsos.GBASE: A Scalable and General Graph Management System.In Proc. KDD’11, pages 1091–1099, 2011.

[19] A. Kyrola, G. Blelloch, and C. Guestrin. GraphChi:Large-scale Graph Computation on Just a PC. In Proc.OSDI’12, pages 31–46, 2012.

[20] A. Kyrola and C. Guestrin. GraphChi-DB: Simple Design fora Scalable Graph Database System - on Just a PC. CoRR,abs/1403.0701, 2014.

[21] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, andJ. M. Hellerstein. Distributed GraphLab: A Framework forMachine Learning and Data Mining in the Cloud. Proc. VLDBEndow., 5(8):716–727, Apr. 2012.

[22] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn,N. Leiser, and G. Czajkowski. Pregel: A System forLarge-scale Graph Processing. In Proc. SIGMOD’10, pages135–146, 2010.

[23] N. Martínez-Bazan, M. A. Águila Lorente, V. Muntés-Mulero,D. Dominguez-Sal, S. Gómez-Villamor, and J.-L. Larriba-Pey.Efficient graph management based on bitmap indices. In Proc.IDEAS’12, pages 110–119, 2012.

[24] N. Martínez-Bazan, V. Muntés-Mulero, S. Gómez-Villamor,J. Nin, M.-A. Sánchez-Martínez, and J.-L. Larriba-Pey. DEX:High-Performance Exploration on Large Graphs forInformation Retrieval. In Proc. CIKM’07, pages 573–582,2007.

[25] I. S. Mumick and H. Pirahesh. Implementation of Magic-setsin a Relational Database System. In Proc. SIGMOD’94, pages103–114, 1994.

[26] R. Pearce, M. Gokhale, and N. M. Amato. MultithreadedAsynchronous Graph Traversal for In-Memory andSemi-External Memory. In Proc. SC’10, pages 1–11, 2010.

[27] V. Prabhakaran, M. Wu, X. Weng, F. McSherry, L. Zhou, andM. Haridasan. Managing Large Graphs on Multi-cores withGraph Awareness. In Proc. USENIX ATC’12, pages 41–52,2012.

[28] M. A. Rodriguez and P. Neubauer. Constructions from Dotsand Lines. Bulletin of the American Society for InformationScience and Technology, 36(6):35–41, 2010.

[29] M. A. Rodriguez and P. Neubauer. A Path Algebra forMulti-Relational Graphs. In Proc. ICDEW’11, pages 128–131,2011.

[30] J. T. Teubner. Pathfinder: XQuery Compilation Techniques forRelational Database Targets. PhD thesis, TUM, 2006.http://www-db.in.tum.de/~teubnerj/publications/diss.pdf.

http://www.giraph.apache.org/

http://objectivity.com/INFINITEGRAPH

http://neo4j.org/

http://www-db.in.tum.de/~teubnerj/publications/diss.pdf

Date post:	01-Jun-2020
Category:	Documents
Upload:	others
View:	12 times
Download:	0 times

GRAPHITE: An Extensible Graph Traversal …GRAPHITE: An Extensible Graph Traversal Framework for...

Documents