Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for
the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND2011-2324P
Triangle Finding: How Graph Theory can Help the Semantic Web
Edward Jimenez, Eric Goodman
The Semantic Web as a Graph
The Semantic Web as a Graph
Optimizing Queries with Graph Theory Graph theory has a lot to offer the semantic web One example: triangle finding
O(|E|1.5) Much more efficient than what a typical database would do.
Query2 SELECT ?X, ?Y, ?Z WHERE{ ?X rdf:type ub:GraduateStudent . ?Y rdf:type ub:University . ?Z rdf:type ub:Department . ?X ub:memberOf ?Z . ?Z ub:subOrganizationOf ?Y . ?X ub:undergraduateDegreeFrom ?Y}
Query9SELECT ?X, ?Y, ?Z WHERE{ ?X rdf:type ub:Student . ?Y rdf:type ub:Faculty . ?Z rdf:type ub:Course . ?X ub:advisor ?Y . ?Y ub:teacherOf ?Z . ?X ub:takesCourse ?Z}
Experiment
Compare these three approaches, finding all triangles in a graph Sesame Jena MultiThreaded Graph Library (MTGL)
MTGL Open source library of graph algorithms, targeted towards shared
memory supercomputers Used MTGL’s implementation of J. Cohen’s triangle finding algorithm
Had to modify slightly to allow for multiple edges between vertices.
Data
Data: An Recursive Matrix (R-MAT) graph Specify
|V| edge factor (average number of edges per
vertex) Probabilities a, b, c, d, where a+b+c+d=1.
Has properties similar to real-world graphs such as short diameters and small-world properties.
Used as basis of Graph500 benchmark. Nodes are given a unique IRI and edges
are given a random value. |V| = {25-219} Edge factor: {16, 32, 64}
a b
c d
a b
c d
Possible Triangles
Trying to Find Triangles via SPARQL
SELECT ?X ?Y ?Z WHERE {{?X ?a ?Y . ?Y ?b ?Z .?Z ?c ?X }UNION{?Y ?a ?X ?Z ?b ?Y ?X ?c ?Z}UNION{?X ?a ?Y?Y ?b ?Z?X ?c ?Z}
UNION{?X ?a ?Y . ?Z ?b ?Y .?X ?c ?Z }UNION{?Y ?a ?X ?Y ?b ?Z ?X ?c ?Z}UNION{?Y ?a ?X?Z ?b ?Y?Z ?c ?X}
UNION{?X ?a ?Y . ?Z ?b ?Y .?Z ?c ?X }UNION{?Y ?a ?X ?Y ?b ?Z ?Z ?c ?X}}
Redundant Solutions
The Problem: Graph Isomorphism
?X
?Z ?Y
iii?X
?Z ?Y
iv
?X = Alice?Y = Bob?Z = Charlie
Alice
BobCharlie?X = Alice?Y = Charlie?Z = Bob
Alice
CharlieBob
The Other Problem: Automorphism
?X
?Z ?Y
i Alice
BobCharlie
Charlie
AliceBob
?X = Alice?Y = Bob?Z = Charlie
?X = Charlie?Y = Alice?Z = Bob
Possible Triangles
The SPARQL Query
SELECT ?X ?Y ?Z WHERE {{ ?X ?a ?Y . ?Y ?b ?Z . ?Z ?c ?X FILTER (STR(?X) < STR(?Y)) FILTER (STR(?Y) < STR(?Z)) } UNION { ?X ?a ?Y . ?Y ?b ?Z . ?Z ?c ?X FILTER (STR(?Y) > STR(?Z)) FILTER (STR(?Z) > STR(?X)) } UNION { ?X ?a ?Y . ?Y ?b ?Z . ?X ?c ?Z }}
Cohen’s Triangle Algorithm
Assumptions Simplified graph Completely connected
Map 1: O(m) Use v1 < v2 < ··· < vn for tie-breaking
Bin v1 Bin v2 … Bin vn-1
<v1,v2> <v2,v3> … <vn-1,vn>
<v1,v3> <v2,v4>
… …
… <v2,vn>
<v1,vn>
Bin <v2,v3> Bin <v2,v4> … Bin <v2,vn> …
Cohen’s Triangle Algorithm Reduce: O(m3/2)
Bin v1 Bin v2 … Bin vn-1
<v1,v2> <v2,v3> … <vn-1,vn>
<v1,v3> <v2,v4>
<v1,v4> …
… <v2,vn>
<v1,vn>
<v1,v2>, <v1,v3> <v1,v2>, <v1,v4> … <v1,v2>, <v1,vn> …
Cohen’s Triangle Algorithm
Map 2: O(m3/2) Identity mapping of previous reduce step. Map edges
v8
v20
v1
v8
v20
v3
v8
v20
v2
<v8, v20> bin
…
v8 v20
Reduce 2: O(m3/2) Emit triangles for the contents of each <vi, vj> bin when the
edge exists between vi and vj.
Results: Growth of Triangles
Results
Comparison at Larger Scales
With 1 billion edges, assuming the same constant An O(x1.39) implementation versus an O(x1.58) is 50x faster An O(x1.39) implementation versus an O(x1.83) is 9000x faster
Conclusions
The Semantic Web is a graph Graph theory can add a lot in terms of speeding up queries It also has other approaches for analyzing the data SPARQL has unexpected issues when graph isomorphism or
automorphisms arise.