GStore: Answering SPARQL Queries via Subgraph Matching Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Ozsu...

Post on 13-Jan-2016

219 views 0 download

transcript

gStore: Answering SPARQL Queries via Subgraph Matching

Lei Zou , Jinghui Mo , Lei Chen , M. Tamer Ozsu¨ , Dongyan Zhao

{ zoulei,mojinghui,zdy}@icst.pku.edu.cn, leichen@cse.ust.hk,

tamer.ozsu@uwaterloo.ca

Agenda

• Introduction• Preliminaries• Overview of gStore• Storage Scheme and Encoding Technique• Indexing Structure and Query Algorithm• Optimized methods• Experiments and their results• Conclusions

Introduction -1/4• What is RDF?

– Building block of semantic web– Represented as a collection of triples : (Subject,Property,Object)

Prefix: y=http://en.wikipedia.org/wiki/Subject Property Object

y:Abraham Lincoln hasName Abraham Lincolny:Abraham Lincoln BornOnDate 1809-02-12y:Abraham Lincoln DiedOnDate 1865-04-15y:Abraham Lincoln DiedIn y:Washington_D.Cy:Washington_D.C hasName “Washington D.C”y:Washington_D.C FoundYear 1790y:Washington_D.C rdf:type y:cityy:United_States hasName “United States”y:United_States hasCapital y:Washington_D.Cy:United_States rdf:type Countryy:Reese_Witherspoon rdf:type y:Actory:Reese_Witherspoon BornOnDate “1976-03-22”y:Reese_Witherspoon BornIn y:New_Orleans_Louisianay:Reese_Witherspoon hasName “Reese Witherspoon”y:New_Orleans_Louisiana FoundYear 1718y:New_Orleans_Louisiana rdf:type y:cityy:New_Orleans_Louisiana locatedIn y:United_States

Introduction 2/4:RDF Graph

Introduction - 3/4

• What is SPARQL?• Sample query:

Select ?name Where { ?m <hasName> ?name. ?m <BornOn Date > “1809-02-12” ?m <DiedOnDate> “1865-04-15” }

• Query with wildcards:Select ?name Where { ?m <hasName> ?name. ?m <BornOnDate> ?bd. ?m <DiedOnDate> ?dd. FILTER regex(str(?bd), “02-12”), regex(str(?dd), “04-15”) }

Introduction - 4/4

• Problems with existing solutions:– they cannot answer SPARQL queries with

wildcards in a scalable manner– they cannot handle frequent updates in RDF

repositories• Answering with subgraph matching– Modeling RDF data and Query as two graphs– Cannot use regular graph pattern matching– Answering SPARQL query ≈ subgraph matching

Preliminaries• RDF graph , G, is denoted as G=(V, LV , E, LE )• Query graph , Q, is denoted as Q=(V, LV , E, LE )

• G(u1, u2,…, un) is a match of Q(v1, v2,…, vn) if:– vi is a literal vertex, vi and ui have the same literal value– vi is a class/entity vertex, vi and ui have the same URI– vi is a parameter vertex, there is no constraint over ui

– vi is a wildcard vertex, vi is a substring of ui and ui is a literal value

– there is an edge from vi to vj in Q with the property p, there is also an edge from ui to uj in G with the same property p

Preliminaries Cont’d

Overview of gstore

• Work directly on RDF graph and SPARQL Query graph

• Use a signature-based encoding of each entity and class vertex to speed up matching

• Filter and evaluate– Use a false-positive algorithm to prune nodes and obtain a set of

candidates; then verify each candidate

• Use an index (VS -tree) over the data signature ∗graph (has light maintenance load) for efficient pruning

Storage Scheme & Encoding Technique

• Storage Scheme

Storage Scheme & Encoding Technique

• Encoding technique

(hasName, “Abraham Lincoln”)

0100 0000 0000

Storage Scheme & Encoding Technique

• Encoding technique

(hasName, “Abraham Lincoln”)

0100 0000 0000

“Abr”

“bra”

“rah”

Storage Scheme & Encoding Technique

• Encoding technique

(hasName, “Abraham Lincoln”)

0100 0000 0000

“Abr”

“bra”

“rah”

0000 0100 0000 0000

1000 0000 0000 0000

0000 0000 0100 0000

Storage Scheme & Encoding Technique

• Encoding technique

(hasName, “Abraham Lincoln”)

0100 0000 0000

“Abr”

“bra”

“rah”

0000 0100 0000 0000

1000 0000 0000 0000

0000 0000 0100 0000

OR

1000 0100 0100 0000

Storage Scheme & Encoding Technique

• Encoding technique

(hasName, “Abraham Lincoln”)

0100 0000 0000

1000 0100 0100 0000

1000 0100 0100 0000

Storage Scheme & Encoding Technique

• Encoding technique

(hasName, “Abraham Lincoln”)

0010 0000 0000 1000 0100 0100 0000

(BornOnDate, "1908-02-12")

0100 0000 0000 0100 0010 0100 1000

(DiedOnDate, "1965-04-15")

0000 1000 0000 0000 0010 0100 0000

(DiedIn, y:Washington DC)0000 0010 0000 1000 0010 0100 0001

0110 1010 0000 1100 0110 0100 1001

OR

Indexing Structure and Query Algorithm

Data Signature Graph G*

Converting Q to Q*

Filter and Evaluate

Find matches of Q* over G*(CL)

Verify each match in RDF against G(RS)

Generating Candidate List(CL)

• Two step process:– for each vertex vi ∈ V (Q ), we find a list ∗ Ri = {ui1 ,

ui2 , ..., uin}, where vi&ui=vi, ui V(G*) and ∈ uij R∈ i – do a multi-way join to get the candidate list

• Use S-trees– Height-balanced tree over signatures– Does not support second step - expensive

• Vs-tree and Vs*-tree– Multi-resolution summary graph based on S-

tree– Supports both steps efficiently

S-tree Solution

001 002 003 004

005 007 008 006

d13

d23 d3

3

d43

d12 d2

2

d13

0010 1000 1000 0100 1000 0001 0001 1000

0000 0001 0100 0100 0001 0100 1000 1000

0010 1001 1100 0100 1001 0101 1001 1000

1001 11011110 1101

1111 1101

0000 1000 1000 000010000

S-tree Solution

001 002 003 004

005 007 008 006

d13

d23 d3

3

d43

d12 d2

2

d13

0010 1000 1000 0100 1000 0001 0001 1000

0000 0001 0100 0100 0001 0100 1000 1000

0010 1001 1100 0100 1001 0101 1001 1000

1001 11011110 1101

1111 1101

0000 1000 1000 000010000 001

004

006

S-tree Solution

001 002 003 004

005 007 008 006

d13

d23 d3

3

d43

d12 d2

2

d13

0010 1000 1000 0100 1000 0001 0001 1000

0000 0001 0100 0100 0001 0100 1000 1000

0010 1001 1100 0100 1001 0101 1001 1000

1001 11011110 1101

1111 1101

0000 1000 1000 000010000 001

004

006

002003

006

S-tree Solution

001 002 003 004

005 007 008 006

d13

d23 d3

3

d43

d12 d2

2

d13

0010 1000 1000 0100 1000 0001 0001 1000

0000 0001 0100 0100 0001 0100 1000 1000

0010 1001 1100 0100 1001 0101 1001 1000

1001 11011110 1101

1111 1101

0000 1000 1000 000010000 001

004

006

002003

006

S-tree Solution

001 002 003 004

005 007 008 006

d13

d23 d3

3

d43

d12 d2

2

d13

0010 1000 1000 0100 1000 0001 0001 1000

0000 0001 0100 0100 0001 0100 1000 1000

0010 1001 1100 0100 1001 0101 1001 1000

1001 11011110 1101

1111 1101

0000 1000 1000 000010000 001

004

006

002003&

006

VS-tree Solution

1110 1101 1001 1101

0010 10011100 0100

1001 0101

1001 1000

0010 1000 1000 0100 1000 00010001 1000

0000 0001 0100 0100 0001 0100

1000 1000

001 002 003004

005

006

007 008

d13 d2

3d3

3

d43

d12

d22

d11

11111

10010 00110

00001

10010 01000

01011

00010

00010 00100 00010

10000

0001000010

01000

00010

00100

00010

VS-tree Solution 0000 1000 1000 000010000

VS-tree Solution 0000 1000 1000 000010000

d11 X d1

1

VS-tree Solution 0000 1000 1000 000010000

d12 X d1

2

VS-tree Solution 0000 1000 1000 000010000

d13 X d2

3

VS-tree Solution 0000 1000 1000 000010000

001 X

002

VS-tree Solution-limitations 0000 1000 1000 0000

10000

If this level is dense,many summary matches => More search space

Process each level step by step

Possible Optimization Methods

• “magically” know which level to begin with to minimize the number of summary matches

• Use DFS(Depth First Search) to find the valid child nodes

• While inserting vertices, consider not only the hamming distance but also the number of super edges introduced

Optimization example

Experimental results-Exact queries

Queries

Yago network (20 million triples & size 3.1GB)

gStore RDF-3x SW-Store x-RDF-3x BigOWLIM GRIN

Experimental results-Wildcard queries

Queries

gStore RDF-3x SW-Storex-RDF-3x BigOWLIM GRIN

Conclusion

• This approach:– Uses two novel indexes VS-tree and VS*-tree to

speed up query processing– Was also to solve the two problems with existing

solutions:• answers SPARQL queries with wildcards in a scalable

manner• handle frequent and online updates in RDF repositories

Questions?