HAL Id: hal-01657144https://hal.inria.fr/hal-01657144
Submitted on 6 Dec 2017
HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.
Data Discovery in RDF GraphsIoana Manolescu
To cite this version:Ioana Manolescu. Data Discovery in RDF Graphs. DEXA 2017 - 28th International Conference onDatabase and Expert System Applications, Aug 2017, Lyon, France. pp.1-63. �hal-01657144�
Data Discovery in RDF Graphs
Ioana Manolescu
INRIA and Ecole Polytechnique, [email protected]
http://pages.saclay.inria.fr/Ioana.Manolescu
DEXA Conference, Aug 29, 2017
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 1 / 54
Outline
1 Background: semantic RDF graphs2 Summarizing semantic-rich RDF graphs
[CGM15a, CGM15b, CGM17a]Joint work with Sejla Cebiric (Inria) and Francois Goasdoue(U. Rennes 1 and Inria)
3 Finding insights in RDF graphs [DMS17]Joint work with Yanlei Diao and Shu Shang (EcolePolytechnique and Inria)
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 2 / 54
RDF
Big Data needs semantics
AI Magazine, Spring 2015
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 4 / 54
RDF
Do we really need the semantics?
Yes. All the time.
Application knowledge / constraints:
Every Senator is an ElectedO�cial which is a Person
(On Wikipedia) being BornInAPlace means being a Person
Without the semantics, we may miss query answers
Data Constraints QueryJohn is a Senator Every Senator is a Person Who is a Person?
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 5 / 54
RDF
Do we really need the semantics?
Yes. All the time.
Application knowledge / constraints:
Every Senator is an ElectedO�cial which is a Person
(On Wikipedia) being BornInAPlace means being a Person
Without the semantics, we may miss query answers
Data Constraints QueryJohn is a Senator Every Senator is a Person Who is a Person?
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 5 / 54
RDF
Do we really need the semantics?
Yes. All the time.
Application knowledge / constraints:
Every Senator is an ElectedO�cial which is a Person
(On Wikipedia) being BornInAPlace means being a Person
Semantic contraints are a compact way of encoding informa-tion
“Every ElectedO�cial is a Person” stated only once even if thou-sands of ElectedO�cials.
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 6 / 54
RDF
Semantics for Web data
Data and metadata on the Web is often structured in graphs,e.g., RDF (W3C’s Resource Description Framework)
Famous application: the Linked Open Data cloud (2017)
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 7 / 54
RDF
The Resource Description Framework (RDF)
RDF graph: set of triples
Assertion Triple Relational notation Intuition
Class s rdf:type o o(s) ”s is an o”Property s p o p(s, o) ”The p of s is o”
doi
1
Book
“El Aleph”
:b1
“J. L. Borges”
“1949”
publishedIn
hasTitle
writtenBy
hasName
rdf:typeClass
resource (URI)
blank node
“literal (string)”
property
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 8 / 54
RDF
The Resource Description Framework (RDF)
RDF graph: set of triples
Assertion Triple Relational notation Intuition
Class s rdf:type o o(s) ”s is an o”Property s p o p(s, o) ”The p of s is o”
doi
1
Book
“El Aleph”
:b1
“J. L. Borges”
“1949”
publishedIn
hasTitle
writtenBy
hasName
rdf:typeClass
resource (URI)
blank node
“literal (string)”
property
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 8 / 54
RDF
The Resource Description Framework (RDF)
Assertion Triple Relational notation Intuition
Class s rdf:type o o(s) ”s is an o”Property s p o p(s, o) ”The p of s is o”
doi
1
Book
“El Aleph”
:b1
“J. L. Borges”
“1949”
publishedIn
hasTitle
writtenBy
hasName
rdf:typeClass
resource (URI)
blank node
“literal (string)”
property
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 9 / 54
RDF
The Resource Description Framework (RDF)
Assertion Triple Relational notation Intuition
Class s rdf:type o o(s) ”s is an o”Property s p o p(s, o) ”The p of s is o”
doi
1
Book
“El Aleph”
:b1
“J. L. Borges”
“1949”
publishedIn
hasTitle
writtenBy
hasName
rdf:typeClass
resource (URI)
blank node
“literal (string)”
property
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 9 / 54
RDF RDFS
RDF Schema (RDFS)
Declare deductive constraints between classes and propertiesConstraint Triple OWA interpretation
Subclass c1 rdfs:subClassOf c2 c1 ✓ c2
Subproperty p1 rdfs:subPropertyOf p2 p1 ✓ p2
Domain typing p rdfs:domain c ⇧domain
(p) ✓ c
Range typing p rdfs:range c ⇧range
(p) ✓ c
Book
Publication
Person
writtenBy
hasAuthor
rdfs:subClassOf
rdfs:domain
rdfs:range
rdfs:subPropertyOf
“Any c1 is also a c2”
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 10 / 54
RDF RDFS
RDF Schema (RDFS)
Declare deductive constraints between classes and propertiesConstraint Triple OWA interpretation
Subclass c1 rdfs:subClassOf c2 c1 ✓ c2
Subproperty p1 rdfs:subPropertyOf p2 p1 ✓ p2
Domain typing p rdfs:domain c ⇧domain
(p) ✓ c
Range typing p rdfs:range c ⇧range
(p) ✓ c
Book
Publication
Person
writtenBy
hasAuthor
rdfs:subClassOf
rdfs:domain
rdfs:range
rdfs:subPropertyOf
“If two resources are related by p1, they are also related by p2”
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 11 / 54
RDF RDFS
RDF Schema (RDFS)
Declare deductive constraints between classes and propertiesConstraint Triple OWA interpretation
Subclass c1 rdfs:subClassOf c2 c1 ✓ c2
Subproperty p1 rdfs:subPropertyOf p2 p1 ✓ p2
Domain typing p rdfs:domain c ⇧domain
(p) ✓ c
Range typing p rdfs:range c ⇧range
(p) ✓ c
Book
Publication
Person
writtenBy
hasAuthor
rdfs:subClassOf
rdfs:domain
rdfs:range
rdfs:subPropertyOf
“Anyone having p is a c”
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 12 / 54
RDF RDFS
RDF Schema (RDFS)
Declare deductive constraints between classes and propertiesConstraint Triple OWA interpretation
Subclass c1 rdfs:subClassOf c2 c1 ✓ c2
Subproperty p1 rdfs:subPropertyOf p2 p1 ✓ p2
Domain typing p rdfs:domain c ⇧domain
(p) ✓ c
Range typing p rdfs:range c ⇧range
(p) ✓ c
Book
Publication
Person
writtenBy
hasAuthor
rdfs:subClassOf
rdfs:domain
rdfs:range
rdfs:subPropertyOf
“Anyone who is a value of p is a c”
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 13 / 54
RDF RDF entailment
Open-world assumption and RDF entailment
RDF data model based on the open-world assumption.
Deductive constraints lead to implicit triples:part of the graph even though not explicitly present
explicit triples+ ! implicit triples
entailment rules
Exhaustive application of entailment leads to saturation (closure)
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 14 / 54
RDF RDF entailment
Open-world assumption and RDF entailment
RDF data model based on the open-world assumption.
Deductive constraints lead to implicit triples:part of the graph even though not explicitly present
explicit triples+ ! implicit triples
entailment rules
Exhaustive application of entailment leads to saturation (closure)
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 14 / 54
RDF RDF entailment
Open-world assumption and RDF entailment
RDF data model based on the open-world assumption.
Deductive constraints lead to implicit triples:part of the graph even though not explicitly present
explicit triples+ ! implicit triples
entailment rules
Exhaustive application of entailment leads to saturation (closure)
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 14 / 54
RDF RDF entailment
The semantics of an RDF graph G is its saturation G1
Sample instance entailment rules from schema and instance triplesc1 rdfs:subClassOf c2 ^ s rdf:type c1 `RDF s rdf:type c2p1 rdfs:subPropertyOf p2 ^ s p1 o `RDF s p2 o
p rdfs:domain c ^ s p o `RDF s rdf:type cp rdfs:range c ^ s p o `RDF o rdf:type c
doi
1
Book
Publication
“El Aleph”
:b1
“J. L. Borges”
“1949”
Person
writtenBy
hasAuthor
publishedIn
rdfs:subClassOf
rdfs:domain
rdfs:range
rdfs:subPropertyOf
hasTitle
writtenBy
hasName
rdf:type
rdf:type
hasAuthor rdf:type
rdfs:domain
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 15 / 54
RDF Discovery
RDF graph discovery
An RDF graph can be large and complex, lack a fixed schema,include many heterogeneous values...
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 16 / 54
RDF Discovery
RDF graph discovery
An RDF graph can be large and complex, lack a fixed schema,include many heterogeneous values...
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 17 / 54
RDF Discovery
RDF graph discovery
Two approaches:
1 RDF summarization: compactly representing the explicit andimplicit structure of a graph
2 Insight discovery in RDF graphs: automatically identifyaggregation queries with interesting results
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 18 / 54
Concepts
RDF summaries
Problem
RDF graph G is large, heterogeneous, partially implicit.How to compactly represent all its structure?
Existing solutions
Partial representation (frequent patterns, statistics etc.)e.g., [NM11, LYL13]Potentially not compact e.g., [GW97, CFKP15]Only for explicit data, e.g., [CDT13, ZDYZ14]
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 20 / 54
Concepts
A summary of DBLP data
150M triples
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 21 / 54
Concepts
A summary of geographic data
French territory division in regions, departments, urban areas,cities, districts etc.368K triples
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 22 / 54
Concepts
RDF summaries
We define1 RDF node equivalence relation ⌘: equivalence relation such
that class and property nodes are only equivalent tothemselves
2 RDF summary G/⌘ of an RDF graph G: the quotient of Gthrough ⌘
Recall: quotient of a directed graph G by ⌘G = (V ,E), ⌘ equivalence relation on V
G/⌘ nodes: one for ⌘ equivalence class of V
G/⌘ edges: n1⌘
a�! n2⌘ i↵ 9n1 a�! n2 2 G such that n1 represented by n1
/⌘,
n2 represented by n2/⌘
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 23 / 54
Concepts
Why do we need a special RDF equivalence?
Why not use any node equivalence? E.g., forward and backwardbisimilarity ⇠fb [HHK95]
Sample graph G and its quotient through ⇠fb
p1 p2�sp
p3 p4�sp
A
u1
⌧
u2p1
p2
u3
B
⌧
u4p1
p2
u5 u6p3
p4
�sp
p1
⌧
p3
Loss of class and (some) property names
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 24 / 54
Concepts
Why do we need a special RDF equivalence?
Why not use any node equivalence? E.g., forward and backwardbisimilarity ⇠fb [HHK95]
Sample graph G and its quotient through ⇠fb
p1 p2�sp
p3 p4�sp
A
u1
⌧
u2p1
p2
u3
B
⌧
u4p1
p2
u5 u6p3
p4
�sp
p1
⌧
p3
Loss of class and (some) property names
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 24 / 54
Concepts
Why do we need a special RDF equivalence?
Why not use any node equivalence? E.g., forward and backwardbisimilarity ⇠fb [HHK95]
Sample graph G and its quotient through ⇠fb
p1 p2�sp
p3 p4�sp
A
u1
⌧
u2p1
p2
u3
B
⌧
u4p1
p2
u5 u6p3
p4
�sp
p1
⌧
p3
Loss of class and (some) property names
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 24 / 54
Concepts
Why do we need a special RDF equivalence?
Why not use any graph node equivalence? E.g., forward andbackward bisimilarity ⇠fb
Sample graph G and its quotient through ⇠fb
p1 p2�sp
p3 p4�sp
A
u1
⌧
u2p1
p2
u3
B
⌧
u4p1
p2
u5 u6p3
p4
�sp
p1
⌧
p3
Loss of schema triples
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 25 / 54
Concepts
Why do we need a special RDF equivalence?
Why not use any graph node equivalence? E.g., forward andbackward bisimilarity ⇠fb
Sample graph G and its quotient through ⇠fb
p1 p2�sp
p3 p4�sp
A
u1
⌧
u2p1
p2
u3
B
⌧
u4p1
p2
u5 u6p3
p4
�sp
p1
⌧
p3
Loss of implicit triples
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 26 / 54
Concepts
Why do we need a special RDF equivalence?
Why not use any graph node equivalence? E.g., forward andbackward bisimilarity ⇠fb
Sample graph G and its quotient through ⇠fb
p1 p2�sp
p3 p4�sp
A
u1
⌧
u2p1
p2
u3
B
⌧
u4p1
p2
u5 u6p3
p4
�sp
p1
⌧
p3
Loss of implicit triples
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 26 / 54
Concepts
Why do we need a special RDF equivalence?
Sample graph G and its quotient through ⇠fbp1 p2�sp
p3 p4�sp
A
u1
⌧
u2p1
p2
u3
B
⌧
u4p1
p2
u5 u6p3
p4
�sp
p1
⌧
p3
Quotient of the same graph throughthe RDF node equivalence ⌘fb
p1 p2�sp
p3 p4�sp
A
⌧
B
⌧
p1p2
p3p4
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 27 / 54
Properties
Formal summary properties
For any RDF equivalence relation ⌘:
Size limit The summary is at most as large as the graph.Schema The schema of G/⌘ is the schema of G.preservationRepresentativeness Any conjunctive query q with answers on G also
has answers on its summary:q(G1) 6= ; ) q((G/⌘)
1) 6= ;This enables query pruning (for query an-swering) without saturating G
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 28 / 54
Properties
Which equivalence relations to use?
Equivalence notions previously studied
Forward / backward / forward and backward simulation
Forward / backward / forward and backward bisimulation
Adapted to semantic RDF graphs
Novel equivalence notions we introduce (see next)
Flexible similarity suited to heterogeneous graphs
Based on property cliques and possibly on RDF types
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 29 / 54
Clique-based summaries
RDF node equivalence based on property cliques
Intuition: a1, a2 are similar; r1, r2, r3, r4, r5 are similar
r1 r2 r3
a1 t1
author title
t2 e1
title editor
e2 c1
editor comment
Book
⌧
r6
⌧
Journal
⌧
r4
a2 t3
author title
r5
t4
title
editor
Spec
⌧
publishedreviewed
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 30 / 54
Clique-based summaries
RDF node equivalence based on property cliques
Output property cliques: {a, t, e, c}; {r}; {p}; ;Input property cliques: {a}; {t}; {e}; {c}; {r , p}; ;
r1 r2 r3
a1 t1
author title
t2 e1
title editor
e2 c1
editor comment
Book
⌧
r6
⌧
Journal
⌧
r4
a2 t3
author title
r5
t4
title
editor
Spec
⌧
publishedreviewed
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 31 / 54
Clique-based summaries
Weak clique-based summaries
Two nodes are weakly equivalent (⌘W
) i↵ they have the sameinput clique or the same output clique.
Weak summary G/⌘W
of the sample RDF graph G:
t
c ea
r
p
Book
⌧
Journal
⌧
Spec
⌧⌧
Property: In G/⌘W
, each data property appears exactly once )its nodes are “source of p, target of p” for each p [CGM15b].
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 32 / 54
Clique-based summaries
Weak clique-based summaries
Property: G/⌘W
nodes are “source of p, target of p” for each p.
Detecting errors in the data:: why do the birthplace anddeathplace loop?
Looking in the data, we find:hhttp://dbpedia.org/resource/Kunitomo Ikkansaii hhttp://www.w3.org/1999/02/22-rdf-syntax-ns#typeihhttp://xmlns.com/foaf/0.1/Personi .hhttp://dbpedia.org/resource/Kunitomo Ikkansaiihhttp://dbpedia.org/ontology/birthPlaceihhttp://dbpedia.org/resource/Kunitomo Ikkansaii .
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 33 / 54
Clique-based summaries
Strong clique-based summaries
Two nodes are strongly equivalent (⌘S
) i↵ they have the sameinput clique and the same output clique.
Strong summary G/⌘ S
of the sample RDF graph G:
a
t
c e
Book
⌧
Journal
⌧
Spec
⌧
⌧
t
a e
r
p
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 34 / 54
Clique-based summaries
Using types for summarization
Group nodes first by their types; then group untyped nodes bytheir property cliques.Typed weak summary G/⌘TW
of the sample RDF graph G:
author title title editor editor comment
Book
⌧
Journal
⌧
author
title title editor
Spec⌧
published
reviewed
On this example, this is also the typed strong summary G/⌘TS
.Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 35 / 54
Clique-based summaries
RDF summaries outline
Summary Weak? Strong? Types first?
G/⌘W
XG/⌘ S
XG/⌘TW
X XG/⌘TS
X X
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 36 / 54
Clique-based summaries
RDF summaries outline
Summary Weak? Strong? FW bisim? BW bisim? Types first?
G/⌘W
XG/⌘ S
XG/⌘TW
X XG/⌘TS
X XG/⌘ fw
XG/⌘ bw
XG/⌘ fb
X XG/⌘ fw,T X XG/⌘ bw,T X XG/⌘ fb,T X X X
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 37 / 54
Clique-based summaries
Relations between RDF summaries [CGM17b]
G/fb G/S G/W/W/S
G
/fb /S /W
G/TW
/TW
G/TS
/TS
/TW
/TS
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 38 / 54
Summary sizes
Summary size comparison (more in [CGM17b])
Graph G |G| Summary G/⌘ |G/⌘| cf⌘
DBLP 150,787,464 G/W 71 2,123,767DBLP 150,787,464 G/S 206 731,978DBLP 150,787,464 G/fw 262,695 574
LUBM1M 1,227,868 G/W 161 7,579LUBM1M 1,227,868 G/S 207 5,903LUBM1M 1,227,868 G/fw 1982 617
LUBM10M 11,990,183 G/W 162 74,013LUBM10M 11,990,183 G/S 206 58,204LUBM10M 11,990,183 G/fw 24,958 480LUBM10M 11,990,183 G/bw 6,162 1,944LUBM10M 11,990,183 G/fb 11,990,076 1
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 39 / 54
Summary sizes
Summarizing G1
Recall: With an RDF Schema, the semantics of G is G1 )We really need (G1)/⌘!
1 Saturate G, then summarize
2 Can we avoid saturating G?...
Shortcut theorem [CGM17a]
For the summaries G/W, G/S, G/fw, G/bw, G/fb:
(G1)/⌘ is the same as ((G/⌘)1)/⌘
Also: su�cient condition for any ⌘ to admit the shortcut.
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 40 / 54
Summary sizes
Summarizing G1
Recall: With an RDF Schema, the semantics of G is G1 )We really need (G1)/⌘!
1 Saturate G, then summarize
2 Can we avoid saturating G?...
Shortcut theorem [CGM17a]
For the summaries G/W, G/S, G/fw, G/bw, G/fb:
(G1)/⌘ is the same as ((G/⌘)1)/⌘
Also: su�cient condition for any ⌘ to admit the shortcut.
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 40 / 54
Summary sizes
Shortcut toward the summary of G1
Direct G ! sat. ! G
1 ! summ. ! (G1)⌘Shortcut G ! summ. ! G⌘ ! sat. ! (G⌘)1 ! summ. ! ((G⌘)1)⌘
If G⌘ is much smaller than G, the shortcut may be faster!Up to 20 times in our experiments [CGM17b]
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 41 / 54
Summary sizes
Shortcut example: GW
y2
y1
z
x
r1a
b1
r2b2
c
b1 �sp b, b2 �sp b
G
a
b1
G/W
b2c
b1 �sp b
b2 �sp b
a
b1b
(G/W)1
b2c
b1 �sp b
b2 �sp b
b
r1
x
y1
a
b1b
G
1
r2
y2
z
b2c
b1 �sp b
b2 �sp b
b
b
(G1)/W = ((G/W)1)/W
b1
a
b2
c
b1 �sp b, b2 �sp b
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 42 / 54
Summary sizes
Shortcut counter-example: GTW
y2
y1
x
r1a
b
r2b
a -d c
G
a
b
G/TW
a -d c
a
b
c
⌧
(G/TW)1 = ((G/TW)
1)/TW
a -d c
r1
x
y1
a
b
c
⌧
G
1
r2
y2b
a -d c
a
b
c
⌧
(G1)/TW
b
a -d c
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 43 / 54
Summary sizes
Summary-enabled LOD cloud exploration
ILDA Inria team (E. Pietriga, H. Ozaygen)Use summary to derive visualisation instead of the original graph(smaller, faster)
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 44 / 54
Part III
Finding insights in RDF graphs
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 45 / 54
Insight in an RDF graph
We consider an insight to be the result of an aggregation queryover the RDF graphWe focus one-dimensional aggregates ) 2D layout
An insight is interesting if a certain measure (e.g., variance) on itsset of aggregation values is high
Problem
Problem: given a graph G, find the top-k insights
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 46 / 54
Dagger approach
Dagger: Digging for Interesting Aggregates in RDFGraphs [DMS17] (ongoing)1. Candidate facts Resources from G: of a certain type, or havingcertain property sets2. Candidate dimension Properties of the candidate facts, withstrong support and relatively few distinct values.Also: derived properties, e.g., authors count;3. Candidate measure Another property of the candidate factsAlso: automatic value typing4. Candidate aggregation function Chosen depending on themeasure type
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 47 / 54
Dagger-selected aggregate in DBLP data
Average number of authors of journal articles, per publication year
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 48 / 54
Dagger-selected aggregate in DBLP data
Number of book authors, per book publication year
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 49 / 54
Dagger-selected aggregate in DBLP data
The number of books by each publisher (highest: Springer)
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 50 / 54
The need for RDF graph discovery tools
RDF graphs can be large and complex, they lack aprescriptive schema
Semantic rules lead to implicit data
Toward helping users to discover RDF graphs:
1 Structural quotient summaries representing the completegraph structure; compact clique-based summaries; available at:
https://team.inria.fr/cedar/projects/rdfsummary/
2 Insight discovery: interesting aggregate queries; project Webpage:
https://team.inria.fr/cedar/projects/dagger/
Many follow-up directions: parallelization, moreinterestingness measures, extensions to ML.
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 52 / 54
References
[CDT13] Stephane Campinas, Renaud Delbru, and Giovanni Tummarello. E�ciencyand precision trade-o↵s in graph summary algorithms. In IDEAS, 2013.
[CFKP15] Mariano P. Consens, Valeria Fionda, Shahan Khatchadourian, andGiuseppe Pirro. S+EPPs: Construct and explore bisimulation summaries+ optimize navigational queries (demo). PVLDB, 8(12), 2015.
[CGM15a] Sejla Cebiric, Francois Goasdoue, and Ioana Manolescu. Query-orientedsummarization of RDF graphs. In BICOD, 2015.
[CGM15b] Sejla Cebiric, Francois Goasdoue, and Ioana Manolescu. Query-orientedsummarization of RDF graphs (demonstration). PVLDB, 8(12), 2015.
[CGM17a] Sejla Cebiric, Francois Goasdoue, and Ioana Manolescu. A framework fore�cient representative summarization of RDF graphs. In InternationalSemantic Web Conference (ISWC), 2017.
[CGM17b] Sejla Cebiric, Francois Goasdoue, and Ioana Manolescu. Query-OrientedSummarization of RDF Graphs. Research Report RR-8920, INRIA, 2017.
[DMS17] Yanlei Diao, Ioana Manolescu, and Shu Shang. Dagger: Digging forinteresting aggregates in RDF graphs. In International Semantic WebConference (ISWC), 2017.
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 53 / 54
References (cont.)
[GW97] Roy Goldman and Jennifer Widom. Dataguides: Enabling queryformulation and optimization in semistructured databases. In VLDB, 1997.
[HHK95] Monika Rauch Henzinger, Thomas A. Henzinger, and Peter W. Kopke.Computing simulations on finite and infinite graphs. In FOCS, 1995.
[LYL13] Shou-De Lin, Mi-Yen Yeh, and Cheng-Te Li. Sampling and summarizationfor social networks (tutorial), 2013.
[NM11] Thomas Neumann and Guido Moerkotte. Characteristic sets: Accuratecardinality estimation for RDF queries with multiple joins. In ICDE, 2011.
[ZDYZ14] Haiwei Zhang, Yuanyuan Duan, Xiaojie Yuan, and Ying Zhang. ASSG:adaptive structural summary for RDF graph data. In ISWC (Posters andDemonstrations), 2014.
Ioana Manolescu Discovering RDF Graphs DEXA, Aug 2017 54 / 54