+ All Categories
Home > Documents > gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a...

gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a...

Date post: 28-Sep-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
41
gMark: Schema-Driven Generation of Graphs and Queries Radu Ciucanu Universit´ e Clermont Auvergne Joint work with colleagues from Univ. Lille, Univ. Lyon, TU Eindhoven JIRC 2017, Orl´ eans Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orl´ eans 1 / 41
Transcript
Page 1: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

gMark:Schema-Driven Generation of Graphs and Queries

Radu Ciucanu

Universite Clermont Auvergne

Joint work with colleagues from Univ. Lille, Univ. Lyon, TU Eindhoven

JIRC 2017, Orleans

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 1 / 41

Page 2: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Why graph data?

Big graph data sets are ubiquitous

social networks (e.g., LinkedIn, Facebook)

scientific networks (e.g., Uniprot, PubChem)

knowledge graphs (e.g., DBPedia)

...

Focus is on “things” and their relationships

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 2 / 41

Page 3: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Why graph databases?

Analytics on big graphs increasingly important

role discovery in social networks

identifying interesting patterns in biological networks

finding important publications in a citation network

...

In response to these trends, the past decade has witnessed an explosion ofgraph data management solutions, e.g.,

Graph databases such as Neo4j

Graph analytics platforms such as GraphX

Triple stores such as Virtuoso

Datalog engines such as LogicBlox

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 3 / 41

Page 4: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Why graph database benchmarking?

Benchmark = data sets + query workloads

When a field has good benchmarks, we settle debates and thefield makes rapid progress.

D. Patterson (CACM, 2012)

Motivated by success stories in relational and XML engineering e.g., TPCand XMark, it is clear that good benchmarks are needed for graph DBs

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 4 / 41

Page 5: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Graph database benchmarking

LDBC-SNB1 and WatDiv2 are current leaders in graph DBMSbenchmarking

LDBC is a fixed-schema and fixed-queries benchmark targetingfocused stress-testing of query engineering choke-points

§ social network scenario

WatDiv is a schema-driven workload-based benchmark targetingbroad coverage of query features

§ default schema is products and users scenario

1Erling, Averbuch, Larriba-Pey, Chafi, Gubichev, Prat, Pham, and Boncz: The LDBC socialnetwork benchmark: Interactive workload. SIGMOD’15.

2Aluc, Hartig, Ozsu, and Daudjee: Diversified stress testing of RDF data managementsystems. ISWC’14.

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 5 / 41

Page 6: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Synthetic graph and workload generation with gMark

We present gMark, an open-source1 framework for generation of syntheticgraphs and workloads.

Given a graph schema, gMark

generates synthetic instances of the schema (of desired size)

generates sophisticated query workloads with targeted structure andruntime behavior (which holds for all instances of the schema)

1https://github.com/graphMark/gmarkRadu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 6 / 41

Page 7: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Why gMark?

We adopt successful aspects of the state of the art

Like WatDiv (and unlike LDBC), gMark is schema-driven,

allowing finely tailored graph instances for specific applicationdomains; and,

allowing tightly controlled generation of query workloads.

Like LDBC (and unlike WatDiv), gMark supports focused stress-testing ofquery engineering choke-points, through fine control of query selectivities.

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 7 / 41

Page 8: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Why gMark?

Unlike both WatDiv and LDBC, gMark

supports the generation of workloads containing recursive pathqueries, which are fundamental for graph analytics;

performs selectivity estimation in a purely instance-independentschema-driven fashion.

§ hence, more scalable, more predictable, and easier toexplain/understand

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 8 / 41

Page 9: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Overview of the gMark workflow

Graph configuration‚ Size‚ Node types‚ Edge predicates‚ Schema constraints‚ Degree distributions

Query workload configuration‚ Size‚ Selectivity‚ Recursion‚ Shape‚ Arity

gMarkGraph&query generator

Graph instance file(CSV)

Query workload file(UCRPQs as XML)

gMarkQuery translator

SPARQL

openCypher

PostgreSQL

Datalog

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 9 / 41

Page 10: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

gMark: Schema-Driven Generation of Graphs and Queries

1 Graph Generation

2 Query Generation

3 Scalability Study of Current Graph Databases

4 Evolving Graph Generation

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 10 / 41

Page 11: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

gMark: Schema-Driven Generation of Graphs and Queries

1 Graph Generation

2 Query Generation

3 Scalability Study of Current Graph Databases

4 Evolving Graph Generation

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 11 / 41

Page 12: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

gMark graph generation

Graph configuration‚ Size‚ Node types‚ Edge predicates‚ Schema constraints‚ Degree distributions

Query workload configuration‚ Size‚ Selectivity‚ Recursion‚ Shape‚ Arity

gMarkGraph&query generator

Graph instance file(CSV)

Query workload file(UCRPQs as XML)

gMarkQuery translator

SPARQL

openCypher

PostgreSQL

Datalog

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 12 / 41

Page 13: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Graph configurations

The user can specify in the graph configuration (i.e., graph schema):

‚ Size: # of nodes‚ Node types: finite set of node labels

e.g., author, citation, journal

‚ Edge predicates: finite set of edge labelse.g., authoredBy, referencedBy

‚ Schema constraints: proportion of nodes/edges of given typee.g., 20% of all nodes are authors

‚ Degree distributions: on the in- and out-degree of edge predicates(uniform, normal, zipfian)

e.g., the out-distribution of citation authoredByÝÝÝÝÝÝÝÝÑ

author is Gaussian

with parameters µ “ 3, σ “ 1

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 13 / 41

Page 14: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Graph configurations: Uniprot schema

Node type Constr.

gene 35%

protein 31%

author 20%

citation 10%

organism 1%

. . . . . .

Edge predicate Constr.

authoredBy 64%

encodedOn 6%

referencedBy 3%

occursIn 2%

. . . . . .

Node types Edge predicates

source type predicateÝÝÝÝÝÝÑ

target type In-distr. Out-distr.

citation authoredByÝÝÝÝÝÝÝÝÑ

author Zipfian Gaussian

. . . . . . . . .In- and out-degree distributions

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 14 / 41

Page 15: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Schema-driven graph generation

We have established the intractability of the generation problem

Theorem

Given a graph configuration G , deciding whether or not there exists agraph instance satisfying G is NP-complete.

Hence, gMark follows a ‘best-effort’ strategy in instance generation(Opnq), i.e., it attempts to achieve the exact values of the inputparameters and relaxes them whenever this is not possible.

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 15 / 41

Page 16: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Schema-driven graph generation

We adapted the scenarios of popular use cases into meaningful gMarkconfigurations, while also adding new gMark features:

Bib: our default bibliographical use-case

LSN: LDBC social network benchmark

WD: WatDiv e-commerce benchmark

SP: SP2Bench DBLP benchmark

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 16 / 41

Page 17: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Scalability of gMark graph generation

100K 1M 10M 100M

Bib 0m0.057s 0m0.638s 0m8.344s 1m28.725s

LSN 0m0.225s 0m1.451s 0m23.018s 3m11.318s

WD 0m2.163s 0m25.032s 4m10.988s 113m31.078s

SP 0m0.638s 0m7.048s 1m28.831s 15m23.542s

Graph generation times, with varying graph sizes (# nodes)

Generation time depends heavily on density of instances (e.g., WD has 100xnumber of edges than Bib)

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 17 / 41

Page 18: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

gMark: Schema-Driven Generation of Graphs and Queries

1 Graph Generation

2 Query Generation

3 Scalability Study of Current Graph Databases

4 Evolving Graph Generation

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 18 / 41

Page 19: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

gMark query generation

Graph configuration‚ Size‚ Node types‚ Edge predicates‚ Schema constraints‚ Degree distributions

Query workload configuration‚ Size‚ Selectivity‚ Recursion‚ Shape‚ Arity

gMarkGraph&query generator

Graph instance file(CSV)

Query workload file(UCRPQs as XML)

gMarkQuery translator

SPARQL

openCypher

PostgreSQL

Datalog

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 19 / 41

Page 20: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

A query language for graphs

UCRPQ: Unions of Conjunctions of Regular Path Queries– Core constructs of the W3C’s SPARQL 1.1, Oracle’s PGQL, and andNeo4j’s openCypher– Well understood theoretical properties (e.g., polynomial data complexity)

UCRPQ includes recursive queries (via the Kleene star ˚), withapplications in social networks, bioinformatics, etc.

gMark generates UCRPQ Ñ the first synthetic workload generator tosupport recursive queries (and their translation in concrete syntaxes).

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 20 / 41

Page 21: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

A query language for graphs

Example of UCRPQ

for each researcher, select all of the biological entities (i.e., genesand organisms) relevant to proteins studied in papers authoredby people in the researcher’s coauthorship network

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 21 / 41

Page 22: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

A query language for graphs

Example of UCRPQ

for each researcher, select all of the biological entities (i.e., genesand organisms) relevant to proteins studied in papers authoredby people in the researcher’s coauthorship network

p?x , ?zq Ð p?x , pa´ ¨aq˚, ?yq, p?y , pa´ ¨r´ ¨e` a´ ¨r´ ¨oq, ?zq

(a=authoredBy, r=referencedBy, e=encodedOn, o=occursIn)

#rules 1#conjuncts 2#disjuncts 1, 2path lengh 2, 3, 3

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 22 / 41

Page 23: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Schema-driven workload generation

The user can specify in the query workload configuration:

‚ Size: #queries, #conjuncts/#disjuncts/path length per query

‚ Selectivity: constant, linear, quadratic.

‚ Recursion: probability to generate Kleene star above a conjunct.

‚ Shape: chain, star, cycle, star-chain.

‚ Arity: arbitrary (including 0 i.e., Boolean).

The graph configuration is also input to the query generator.

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 23 / 41

Page 24: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Selectivity estimation quality of gMark

‚ Given a binary query Q and a graph G , we assume that|QpG q| “ Opβˆ|nodespG q|αq.

‚ α is the selectivity value (0–constant, 1–linear, 2–quadratic).

‚ Assigning selectivities required us to develop a selectivity algebra forinstance-independent reasoning over query behavior.

‚ Experiments confirmed the assumption and the estimation quality.

Constant Linear QuadraticLSN 0.200˘0.417 1.189˘0.261 2.032˘0.059Bib 0.003˘0.010 0.921˘0.122 1.405˘0.337WD 0.016˘0.044 1.427˘0.392 2.004˘0.022SP 0.074˘0.130 1.064˘0.034 2.034˘0.295

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 24 / 41

Page 25: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

gMark query translator

Graph configuration‚ Size‚ Node types‚ Edge predicates‚ Schema constraints‚ Degree distributions

Query workload configuration‚ Size‚ Selectivity‚ Recursion‚ Shape‚ Arity

gMarkGraph&query generator

Graph instance file(CSV)

Query workload file(UCRPQs as XML)

gMarkQuery translator

SPARQL

openCypher

PostgreSQL

Datalog

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 25 / 41

Page 26: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Query translation

UCRPQ: p?x , ?zq Ð p?x , pa´ ¨aq˚, ?yq, p?y , pa´ ¨r´ ¨e` a´ ¨r´ ¨oq, ?zq

SPARQL openCypher‹

PREFIX : <http://example.org/gmark/>

SELECT DISTINCT ?x ?z

WHERE { ?x (^:a/:a)* ?y .

?y ((^:a/^:r/:e)|(^:a/^:r/:o)) ?z .}

MATCH (x)<-[:a]-()-[:a]->(y),

(y)<-[:a]-()<-[:r]-()-[:e]->(z)

RETURN DISTINCT x, z

UNION

MATCH (x)<-[:a]-()-[:a]->(y),

(y)<-[:a]-()<-[:r]-()-[:o]->(z)

RETURN DISTINCT x, z;

Datalog SQLg0(x,y)<- edge(x1,a,x0),edge(x1,a,x2),

x=x0,y=x2.

g0(x,y)<- g0(x,z),g0(z,y).

g1(x,y)<- edge(x1,a,x0),edge(x2,r,x1),

edge(x2,e,x3),x=x0,y=x3.

g1(x,y)<- edge(x1,a,x0),edge(x2,r,x1),

edge(x2,o,x3),x=x0,y=x3.

query(x,z)<- g0(x,y),g1(y,z).

WITH RECURSIVE c0(src, trg) AS (

SELECT edge.src, edge.src FROM edge

UNION

SELECT edge.trg, edge.trg FROM edge

UNION

SELECT s0.src, s0.trg

FROM (SELECT trg as src, src as trg,

‹ openCypher disallows Kleene star above concatenation or inverses.

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 26 / 41

Page 27: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Scalability of gMark workload generation

On a laptop, gMark generates workloads of one thousand queries for Bibin „ 0.3s; LSN and SP in „ 1.5s; and for the richer WD scenario in „ 10s.

Query translation of the thousand queries into all four supported syntaxesfor each of the four scenarios requires „ 0.1s.

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 27 / 41

Page 28: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

gMark: Schema-Driven Generation of Graphs and Queries

1 Graph Generation

2 Query Generation

3 Scalability Study of Current Graph Databases

4 Evolving Graph Generation

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 28 / 41

Page 29: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

State-of-the-art graph DBMSs

We studied query evaluation performance of four mainstream graphDBMSs:

P: PostgreSQL (SQL:1999 recursive views)

S: a popular SPARQL query engine (SPARQL 1.1)

G: a native graph database (openCypher)

D: a modern Datalog engine (Datalog)

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 29 / 41

Page 30: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Scalability on non-recursive query workloads

Query execution times for diverse graph sizes and query workloads:– Len (varying path lengths, 1 disjunct, 1 conjunct)– Dis (multiple disjuncts, 1 conjunct)– Con (multiple conjuncts and disjuncts)

100

101

102

103

Tim

e (

seco

nd

s, log

scale

)

Scenario / System

2K 4K 8K

16K

ConD

ConG

ConS

ConP

DisD

DisG

DisS

DisP

LenD

LenG

LenS

LenP

Constant queries

100

101

102

103

Tim

e (

seco

nd

s, log

scale

)

Scenario / System

2K 4K 8K

16K

ConD

ConG

ConS

ConP

DisD

DisG

DisS

DisP

LenD

LenG

LenS

LenP

Linear queries

101

102

103

104

Tim

e (

seco

nd

s, log

scale

)

Scenario / System

2K 4K 8K

16K

ConD

ConG

ConS

ConP

DisD

DisG

DisS

DisP

LenD

LenG

LenS

LenP

Quadratic queries

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 30 / 41

Page 31: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Scalability on recursive query workloads

Query execution times for simple recursive queries on various small graphsizes (from 2K to 32K nodes):

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

101

102

103

104

105

Result count

Execution tim

e(m

s)

DatalogSystemSPARQLSystemPostgreSQLGraphSystem

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 31 / 41

Page 32: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

gMark: Schema-Driven Generation of Graphs and Queries

1 Graph Generation

2 Query Generation

3 Scalability Study of Current Graph Databases

4 Evolving Graph Generation

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 32 / 41

Page 33: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Motivation

Graphs are naturally evolving over time e.g.,

Nodes and edges have properties whose values change amongconsecutive snapshots

Nodes and edges may exist only during specific time intervals

Idea: use gMark to generate schema-driven graphs and enrich them withtime-evolving properties

gMark + time-evolving properties = EGG1

1Open-source: https://github.com/karimalami7/EGGRadu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 33 / 41

Page 34: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

EGG: Evolving Graph Generator

Static graph configuration‚ Size‚ Node and edge types‚ Occurrence constraints‚ Degree distributions

Evolving graph configuration‚ # of snapshots‚ Evolving properties (nodes and edges)‚ Evolution constraints

gMarkStatic graph generator

EGGEvolving graph generator

RDF annotatedwith temporal

information

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 34 / 41

Page 35: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Example

Parameter Description

Size B e.g., 10M

Node types B e.g., city, hotel

Edge predicates B e.g., train, contains

Schema constraints B e.g., 10% of all nodes are cities

Degree distributionsB e.g., the # of hotels in a city followsa Zipfian distribution

Evolving properties:

city: weather, qAir

hotel: star, availableRooms, hotelPrice

train: trainPrice

Each graph snapshot corresponds to a day.

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 35 / 41

Page 36: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Example

Type Property Description

city

unordered qualitative, has three possibleweather values tsunny, cloudy, rainyu

successors of sunny: sunny and cloudy.ordered qualitative, has ten possible values

qAir from 1 to 10; can increment or decrementby 1 between two consecutive snapshots.

hotel

ordered qualitative, has values from 1 to 5,star it changes every 365 snapshots with 1%

probability, by one position at mostdiscrete quantitative, has values in [1,100];

availableRooms the offset is set to [-15,15]

hotelPrice

continuous quantitative, dependent on star fordomain and on availableRooms for evolutionB e.g., for node x of type hotel:if star(x)=3, then hotelPrice(x)P[50,100]if availableRooms(x) Ò, then hotelPrice(x) Óif availableRooms(x) Ó, then hotelPrice(x) Ò.

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 36 / 41

Page 37: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Summary of EGG contributions

Linear-time generation algorithm

102 103 104 105 106 107

# of graph nodes

10 1

100

101

102

103

104

Tim

e in

seco

nds

# of graph snapshots set to 100dblp use casesocial use casetrip use case

101 102 103

# of graph snapshots

101

102

103

Tim

e in

seco

nds

# of graph nodes set to 100000dblp use casesocial use casetrip use case

Visualization module to emphasize the accuracy of EGG

40

60

Valu

es

Property availableRooms of node 45 of type hotel

6070

Valu

es

Property hotelPrice of node 45 of type hotel

12345

Valu

es

Property star of node 45 of type hotel

0 5 10 15 20 25 30Time

Validity of node 45 of type hotelT

0 5 10 15 20 25 30Time

0

20

40

60

80

100

Valu

es

Property availableRooms of hotel

12345678910

Valu

es

Property qAir of node 4 of type city

Property weather of node 4 of type citysunnyrainycloudy

0 5 10 15 20 25 30Time

Validity of node 4 of type cityT

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 37 / 41

Page 38: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Summary of EGG contributions

Storage format based on RDF named graphs to decouple static andevolving parts of the graphs e.g.,ns1:G31 { <hotel:27> ns2:hasProperty <Property:availableRooms> }ns1:snapshot9 { ns1:G31 ns3:value "57" }

Evaluation of historical reachability queries1 on top of EGG:– A baseline implementation in SPARQL on top of Apache Jena– Disjunctive-BFS: dynamic programming approach1

10 snapshotsInterval=[0,9]

100 snapshotsInterval=[45,54]

1000 snapshotsInterval=[495,504]

0

500

1000

1500

2000

Tim

e (in

seco

nds)

60

658

76 111 165

Historical Reachability Queries: Disjunctive-BFS vs SPARQL Graph of size 100K nodes, 500K edges; Fixed query size=10

Disjunctive-BFSSPARQLSPARQL 'out of memory' exception

interval=[50,50] interval=[45,54] interval=[25,74] interval=[0,99]

0

100

200

300

400

500

600

Tim

e (in

seco

nds)

633 634 633 635

47 44 32 33

Historical Reachability Queries: Disjunctive-BFS vs SPARQL Graph of size 100K nodes, 500K edges; Fixed # of snapshots=100

Disjunctive-BFSSPARQL

1K. Semertzidis, K. Lillis, E. Pitoura. TimeReach: Reachability Queries on EvolvingGraphs. EDBT’15.

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 38 / 41

Page 39: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Conclusions

Page 40: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

Conclusions

gMark1

§ schema-driven graph and query-workload generator§ finely controlled query workload-centered approach, featuring

instance-independent selectivity estimation§ translation to SPARQL, openCypher, SQL, Datalog§ discovery of the poor performance of existing graph DBMS on

evaluating a basic class of graph queries i.e., regular path queries

EGG2

§ evolving graph generator extending the gMark graphs with propertiesthat evolve over time

§ storage format using RDF named graphs to reduce redundancy§ easy to use to empirically evaluate evolving graph processing systems

1https://github.com/graphMark/gmark2https://github.com/karimalami7/EGGRadu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 40 / 41

Page 41: gMark: Schema-Driven Generation of Graphs and Queries · social network scenario WatDivis a schema-driven workload-based benchmark targeting broad coverage of query features default

gMark & EGG papers

Bagan, Bonifati, Ciucanu, Fletcher, Lemay, AdvokaatgMark: Schema-Driven Generation of Graphs and QueriesTKDE’17 full paperICDE’17 extended abstract

Bagan, Bonifati, Ciucanu, Fletcher, Lemay, AdvokaatGenerating Flexible Workloads for Graph DatabasesVLDB’16 demo

Alami, Ciucanu, Mephu NguifoEGG: A Framework for Generating Evolving RDF GraphsISWC’17 demo

Alami, Ciucanu, Mephu NguifoSynthetic Graph Generation from Finely-Tuned Temporal ConstraintsTD-LSG @ PKDD/ECML’17

Radu Ciucanu gMark: Schema-Driven Generation of Graphs and Queries JIRC 2017, Orleans 41 / 41


Recommended