G-SPARQL: A Hybrid Engine for Querying Large Attributed
Graphs
Sherif Sakr Sameh Elnikety
Yuxiong He
NICTA & UNSWSydney, Australia
Microsoft Research
Redmond, WA
CIKM 2012
Microsoft Research
Redmond, WA
Example 1: Social Network
Bob
Hillary Alice
Chris David
FranceEd George
Bob
Hillary Alice
Chris David
FranceEd George
Photo1
Photo2
Photo3
Photo4Photo5 Photo6
Photo8
Photo7
2
3
Example 2: Bibliographical Network
Alice JohnSmith
age: 28office: 518
Age:42location: Sydney
age:45
Paper 1 Paper 2
UNSW Microsoft
VLDB¶12
Keyword: graph Keyword: XMLtype: Demo
location: Istanbul
country: Australiaestablished: 1949
country: USAestablished: 1975
citedBy
title: Professor
title: Senior Researcher
order: 1order: 2 order: 1 order: 2
Month: 1Month: 3
4
Contributions1. G-SPARQL language
– Pattern matching– Reachability
2. Hybrid execution engine– Graph topology in main memory– Graph data in relational database
3. Algebraic transformation– Operators– Optimizations
4. Experimental evaluation
5
1. G-SPARQL Query Language•Extends a subset of SPARQL
– Based on triple pattern: (subject, predicate, object)
•Sub-graph matching patterns on– Graph structure– Node attribute– Edge attribute
•Reachability patterns on– Path– Shortest path
subject object
6
G-SPARQL Syntax
7
G-SPARQL Pattern Matching•Node attribute
– ?Person @officeNumber “518”
•Edge attribute– ?E @Role “Programmer”
•Structural– ?Person worksAt Microsoft– ?Person ?E(worksAt) Microsoft
Alice Microsoft
officeNu mber=518
Role = Programmer
8
G-SPARQL Reachability•Path
– Subject ??PathVar Object
•Shortest path– Subject ?*PathVar Object
•Path filters– Path length– All edges– All nodes
9
Example: G-SPARQL QuerySELECT ?L1 ?L2WHERE {
?X ??P ?Y.
?X @Label ?L1. ?Y @Label ?L2.?X @Age ?Age1. ?Y @Age ?Age2.?X Affiliated UNSW. ?Y ?E(Affiliated) Microsoft.?X LivesIn Sydney. ?E @Title "Researcher".
FILTER(?Age1 >= 40). FILTER(?Age2 >= 40).FILTERPATH( Length( ??P, <= 3) ).
}
10
Outline1. G-SPARQL language
– Pattern matching– Reachability
2. Hybrid execution engine– Graph topology in main memory– Graph data in relational database
3. Algebraic transformation– Operators– Optimizations
4. Experimental evaluation
11
2. Hybrid Execution Engine•Reachability queries
– Main memory algorithms– Example: BFS and Dijkstra’s algorithm
•Pattern matching queries– Relational database– Indexing
» Example: B-tree– Query optimizations,
» Example: selectivity estimation, and join ordering– Recursive queries
» Not efficient: large intermediate results and multiple joins
Bob
Hillary Alice
Chris David
FranceEd George
Photo1
Photo2
Photo3
Photo4Photo5 Photo6
Photo8
Photo7
12
Graph RepresentationID Value1 John2 Paper 23 Alice4 Microsoft5 VLDB’126 Paper 17 UNSW8 Smith
ID Value1 453 428 28
ID Value8 518
ID Value3 Sydney5 Istanbul
ID Value2 XML6 graph
ID Value2 Demo
ID Value4 USA7 Australia
ID Value4 19757 1949
eID sID dID1 1 25 3 26 3 611 8 6
Node Label age office location keyword type established
country
authorOf
eID sID dID
3 1 4
8 3 7
12 8 7
affiliated
eID sID dID
4 2 5
10 6 5
published
eID sID dID
9 6 2
citedBy
eID sID dID
7 3 8
supervise
eID sID dID
2 1 3
know ID Value
3 Senior Researcher
8 Professor
title
ID Value
1 2
5 1
6 2
11 1
order
ID Value
4 3
10 1
month
13
Hybrid Execution Engine: interfaces
Bob
Hillary Alice
Chris David
FranceEd George
Photo1
Photo2
Photo3
Photo4Photo5 Photo6
Photo8
Photo7
G-SPARQL query
SQL commands
Traversal
operations
14
3. Intermediate Language & Compilation
Physical execution
planSQL commands
Traversal
operations
G-SPARQL query
Algebraic query plan
Front-end compilation
Step 2
Back-end compilation
Step 1
Bob
Hillary Alice
Chris David
FranceEd George
Photo1
Photo2
Photo3
Photo4Photo5 Photo6
Photo8
Photo7
15
Intermediate Language•Objective
– Generate query plan and chop it» Reachability part -> main-memory algorithms on topology» Pattern matching part -> relational database
– Optimizations•Features
– Independent of execution engine and graph representation– Algebraic query plan
16
G-SPARQL Algebra•Variant of “Tuple Algebra”•Algebra details
– Data: tuples» Sets of nodes, edges, paths.
– Operators» Relational: select, project, join» Graph specific: node and edge attributes, adjacency» Path operators
17
Relational
18
Relational
NOT Relational
19
Front-end Compilation (Step 1)• Input
– G-SPARQL query•Output
– Algebraic query plan•Technique
– Map» from triple patterns» To G-SPARQL operators
– Use inference rules
20
Front-end Compilation: Inference Rules
21
Front-end Compilation: Optimizations•Objective
– Delay execution of traversal operations•Technique
– Order triple patterns, based on restrictiveness•Heuristics
– Triple pattern P1 is more restrictive than P21. P1 has fewer path variables than P22. P1 has fewer variables than P23. P1’s variables have more filter statements than P2’s variables
22
Back-end Compilation (Step 2)• Input
– G-SPARQL algebraic plan•Output
– SQL commands– Traversal operations
•Technique– Substitute G-SPARLQ relational operators with SPJ– Traverse
» Bottom up» Stop when reaching root or reaching non-relational operator» Transform relational algebra to SQL commands
– Send non-relational commands to main memory algorithms
23
Back-end Compilation: Optimizations•Optimize a fragment of query plan
– Before generating SQL command•All operators are Select/Project/Join•Apply standard techniques
– For example pushing selection
24
Example: G-SPARQL QuerySELECT ?L1 ?L2WHERE {
?X ??P ?Y.
?X @label ?L1. ?Y @label ?L2.?X @age ?Age1. ?Y @age ?Age2.?X affiliated UNSW. ?Y ?E(affiliated) Microsoft.?X livesIn Sydney. ?E @title "Researcher"
FILTER(?Age1 >= 40). FILTER(?Age2 >= 40).}
25
Example: Query Plan
26
4. Experimental Evaluation•Objective
– This is a good idea– Good performance from DBMS and main memory topology
•Data sets– Real ACM bibliographic network
– Synthetic graphs» See technical report
27
Experimental Environment•Workload
– Created Q1 … Q12•Process
– Compare to Neo4J (non-optimized, optimized)•Environment
– Implementation» Main memory algorithms in C++» IBM DB2
– PC Server
28
Results on Real Dataset
29
Response time on ACM Bibliographic Network
30
Conclusions•G-SPARQL Language
– Expresses pattern matching and reachability queries on attributed graphs
•Hybrid engine– Graph topology in main memory– Graph data in database
•Compilation into algebraic plan– Operators and optimizations
•Evaluation– Real and synthetic datasets– Good performance
» Leveraging database engine and main memory topology