Federated SPARQL Query Processing ISWC2015 Tutorial

Post on 12-Apr-2017

597 views 2 download

transcript

Federated SPARQL Query Processing Over the Web of Data

Muhammad Saleem

Tutorial at ISWC 2015, Bethlehem, USA

Agile Knowledge Engineering and Semantic Web (AKSW), University of Leipzig, Germany, 11/10/2015

Agenda

• SPARQL Query Federation Approaches• SPARQL Query Federation Optimization– Source Selection– Data Integration Options– Join Order Selection– Join Order Optimization– Join Implementations

• Performance Metrics and Discussion

SPARQL Query Federation Approaches

• SPARQL Endpoint Federation (SEF)• Linked Data Federation (LDF)• Linked Data Fragments Federation (LDFF)• Distributed Hash Tables (DHTs)• Hybrid

SPARQL Endpoint Federation Approaches

• Most commonly used approaches• Make use of SPARQL endpoints URLs• Fast query execution • RDF data needs to be exposed via SPARQL

endpoints• E.g., HiBISCus, FedX, SPLENDID, ANAPSID, LHD,

TopFed, QUETSAL etc.

Linked Data Federation Approaches

• Data needs not be exposed via SPARQL endpoints• Uses URI lookups at runtime• Data should follow Linked Data principles• Slower as compared to previous approaches• E.g., LDQPS, SIHJoin, WoDQA etc.

Linked Data Fragments Federation• Federation over Linked Data Fragments• Will be explained in upcoming session in detail

Query federation on top of Distributed Hash Tables

• Uses DHT indexing to federate SPARQL queries• Space efficient• Cannot deal with whole LOD• E.g., ATLAS

Hybrid

• Federation over SPARQL endpoints and Linked Data

• Can potentially deal with whole LOD• E.g., ADERIS-Hybrid (of SEF+LDF)

SPARQL Endpoint Federation

S1

S2

S3

S4

RDF RDF RDF RDF

Parsing/Rewriting

Source Selection

Federator Optimzer

Integrator

Rewrite query and get Individual Triple Patterns

Identify capable source against Individual Triple Patterns

Generate optimized sub-query Exe. Plan

Integrate sub-queries results

Execute sub-queries

Source Selection

Source Selection

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

Source Selection

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1

Source Selection

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1 TP4 = S4

Source Selection

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1 TP4 = S4

TP5 = S1 S2 S4-S9

Source Selection

Total triple pattern-wise sources selected = 1+1+1+1+8 => 12

Types of Source Selection• Index-free

– Using SPARQL ASK queries– No index maintenance required– Potentially ensures result set completeness– SPARQL ASK queries can be expensive– Can make use of the cache to store recent SPARQL ASK queries results– E.g., FedX

• Index-only– Only make use of Index/data summaries– Less efficient but fast source selection– Result set completeness is not ensured– E.g., DARQ, LHD

• Hybrid– Make use of index+SPARQL ASK – Most efficient– Result set completeness is not ensured– Can make use of the cache to store recent SPARQL ASK queries results– E.g., HiBISCuS, ANAPSID, SPLENDID

Index-free Source SelectionInput: SPARQL query Q , set of all data sources DOutput: Triple pattern to relevant data sources map M for each triple pattern ti in SPARQL query Q Ri = {}; // set of relevant data sources for triple pattern ti

for each data source di in D if SPARQL ASK(di , ti) = true Ri = Ri U {di}; end if end for M = M U {Ri}; end forreturn M

What is the total number of SPARQL ASK requests used?

total number of triple patterns * total number of data sources

Index-free Source Selection

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

Index-free Source Selection

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1

Index-free Source Selection

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1 TP4 = S4

Index-free Source Selection

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1 TP4 = S4

TP5 = S1 S2

Index-free Source Selection

Total number of SPARQL ASK requests used = 45Total triple pattern-wise sources selected = 12

S4-S9

Index-only Source Selection (LHD)Input: SPARQL query Q , set of all data sources D, data sources index I storing all distinct predicates for all data sources in DOutput: Triple pattern to relevant data sources map M for each triple pattern ti in SPARQL query Q Ri = {}; // set of relevant data sources for triple pattern ti

p = Pred(ti) // predicate of ti

if (bound (p)) Ri = Lookup (I, p) // index lookup for predicate of ti

else Ri = D ; // all data sources are relevant end if M = M U {Ri} ; end forreturn M

Why it is the less efficient approach (i.e., greatly overestimate relevant data sources)?

• Source selection is only based on predicate of triple patterns• Simply select all data sources for triple patterns having unbound predicates

Index-only Source Selection

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1-S9TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1S1-S9

Index-only Source Selection

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1

Index-only Source Selection

S1-S9

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1 TP4 = S4

Index-only Source Selection

S1-S9

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1 TP4 = S4

TP5 = S1 S2 S4-S9

Index-only Source Selection

Total number of SPARQL ASK requests used = 0Total triple pattern-wise sources selected = 20

S1-S9

Hybrid Source SelectionInput: SPARQL query Q , set of all data sources D, data sources index I storing all distinct predicates for all data sources in DOutput: Triple pattern to relevant data sources map M for each triple pattern ti in SPARQL query Q Ri = {}; // set of relevant data sources for triple pattern ti

s = Subj(ti) , p = Pred(ti) , o = Obj(ti) ; // subject, predicate, and object of ti

if (!bound (p) || bound (s) || bound (o) ) for each data source di in D if SPARQL ASK(di , ti) = true Ri = Ri U {di}; end if end for else Ri = Lookup (I, p) // index lookup for predicate of ti

end if M = M U {Ri} end forreturn M

What is the total number of SPARQL ASK requests used?

total number of triple patterns with bound subject or bound object or unbound predicate * total number of data sources

Hybrid Source Selection

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

Hybrid Source Selection

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1

Hybrid Source Selection

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1 TP4 = S4

Hybrid Source Selection

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1 TP4 = S4

TP5 = S1 S2

Total number of SPARQL ASK requests used = 18Total triple pattern-wise sources selected = 12

S4-S9

Anything still needs to be improved?

Hybrid Source Selection

Source Selection• Triple pattern-wise source selection

– Ensures 100% recall– Can over-estimate capable sources– Can be expensive, e.g., total number of SPARQL ASK requests used– Performed by FedX, SPLENDID, LHD, DARQ, ADERIS etc.

• Join-aware triple-pattern wise source selection– Ensures 100% recall– May selects optimal/close to optimal capable sources– Can be expensive, e.g., total number of SPARQL ASK requests used– Can significantly reduce the query execution time– Performed by ANAPSID, HiBISCuS

HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation

• Hybrid source selection• Join-aware triple-pattern wise source selection• Makes use of the hypergraph representation of

SPARQL queries• Makes use of the URI authorities • Makes use of the cache to store recent SPARQL

ASK queries results

Motivation

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

Motivation

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

Motivation

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1

Motivation

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1 TP4 = S4

Motivation

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1 TP4 = S4

TP5 = S1 S2 S4 S5 S6 S7 S8 S9

Motivation

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1 TP4 = S4

TP5 = S1 S2 S4 S5 S6 S7 S8 S9

Total triple pattern-wise selected sources = 12Total SPARQL ASK queries : 9*5 = 45

Motivation

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP3 = S1 TP4 = S4

TP5 = S1 S2 S4 S5 S6 S7 S8 S9

Total triple pattern-wise selected sources = 12Total SPARQL ASK queries : 9*5 = 45

Motivation

FedBench (LD3): Return for all US presidents their party membership and news pages about them.

SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

dbpedia

RDF

Source Selection Algorithm

Triple pattern-wise source selection

S1TP1 =

TP3 = S1

Optimal triple pattern-wise selected sources 5

KEGG

RDF

ChEBI

RDF

NYT

RDF

SWDF

RDF

LMDB

RDF

Jamendo

RDF

Geo Names

RDF

DrugBank

RDF

S1 S2 S3 S4 S5 S6 S7 S8 S9

//TP1

//TP3//TP4

//TP5

//TP2

TP2 = S1

TP4 = S4

TP5 = S1 S2 S4 S5 S6 S7 S8 S9

Problem Statement

• An overestimation of triple pattern-wise source selection can be expensive– Resources are wasted– Query runtime is increased– Extra traffic is generated

• How do we perform join-aware triple pattern wise source selection in time efficient way?

HiBISCuS: Key Concept

• Makes use of the URI’s authorities

http://dbpedia.org/ontology/partyScheme Authority Path

For URI details: http://tools.ietf.org/html/rfc3986

HiBISCuS: SPARQL Query as HypergraphSELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

?president

rdf:type dbpedia:President

HiBISCuS: SPARQL Query as HypergraphSELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

?president

rdf:type dbpedia:President

dbpedia:United_S

tates

dbpedia:nationality

HiBISCuS: SPARQL Query as HypergraphSELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

?president

rdf:type dbpedia:President

dbpedia:United_S

tates

dbpedia:nationality

dbpedia:party ?party

HiBISCuS: SPARQL Query as HypergraphSELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

?president

rdf:type dbpedia:President

dbpedia:United_S

tates

dbpedia:nationality

dbpedia:party ?party

?x

nyt:topicPage

?page

HiBISCuS: SPARQL Query as HypergraphSELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

?president

rdf:type dbpedia:President

dbpedia:United_S

tates

dbpedia:nationality

dbpedia:party ?party

?x

nyt:topicPage

?page

owl:SameAs

HiBISCuS: SPARQL Query as HypergraphSELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

?president

rdf:type dbpedia:President

dbpedia:United_S

tates

dbpedia:nationality

?x

owl:SameAs

dbpedia:party ?party

nyt:topicPage

?page

Star simple hybrid Tail of hyperedge

HiBISCuS: Data Summaries[] a ds:Service ; ds:endpointUrl <http://dbpedia.org/sparql> ; ds:capability [ ds:predicate dbpedia:party ; ds:sbjAuthority <http://dbpedia.org/> ; ds:objAuthority <http://dbpedia.org/> ; ] ; ds:capability [ ds:predicate rdf:type ; ds:sbjAuthority <http://dbpedia.org/> ; ds:objAuthority owl:Thing, dbpedia:President; #we store all distinct classes ] ; ds:capability [ ds:predicate dbpedia:postalCode ; ds:sbjAuthority <http://dbpedia.org/> ; #No objAuthority as the object value for dbpedia:postalCode is string ] ;

HiBISCuS: Triple Pattern-wise Source SelectionSELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

?president

rdf:type dbpedia:President

dbpedia:United_S

tates

dbpedia:nationality

?x

owl:SameAs

dbpedia:party ?party

nyt:topicPage

?page

dbpedia

dbpedia

dbpedia

NYT

dbpedia KEGG NYT SWDF LMDB Geo DrgBnk Jamendo

HiBISCuS: Triple Pattern-wise Source PruningSELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

?president

rdf:type dbpedia:President

dbpedia:United_S

tates

dbpedia:nationality

?x

owl:SameAs

dbpedia:party ?party

nyt:topicPage

?page

dbpedia

dbpedia

dbpedia

Sbj. auth.

Sbj. auth.

Sbj. auth.

NYTSbj. auth.

dbpedia KEGG NYT SWDFDrgBnk LMDB Geo Jamendo

Obj.auth.

dbpedia

Sbj. auth.

KEGGSbj. auth.

NYTSbj. auth.

SWDF

Sbj. auth.

LMDBSbj. auth.

Geo

Sbj. auth.

DrgBnkSbj. auth.

JamendoSbj. auth.

HiBISCuS: Triple Pattern-wise Source PruningSELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

?president

rdf:type dbpedia:President

dbpedia:United_S

tates

dbpedia:nationality

?x

owl:SameAs

dbpedia:party ?party

nyt:topicPage

?page

dbpedia

dbpedia

dbpedia

Sbj. auth.

Sbj. auth.

Sbj. auth.

NYTSbj. auth.

dbpedia

Sbj. auth.

KEGGSbj. auth.

NYTSbj. auth.

SWDF

Sbj. auth.

LMDBSbj. auth.

Geo

Sbj. auth.

DrgBnkSbj. auth.

JamendoSbj. auth.

dbpedia KEGG NYT SWDFDrgBnk LMDB Geo Jamendo

Obj.auth.

HiBISCuS: Triple Pattern-wise Source PruningSELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

?president

rdf:type dbpedia:President

dbpedia:United_S

tates

dbpedia:nationality

?x

owl:SameAs

dbpedia:party ?party

nyt:topicPage

?page

dbpedia

dbpedia

dbpedia

Sbj. auth.

Sbj. auth.

Sbj. auth.

NYT

dbpedia KEGG NYT SWDFDrgBnk LMDB Geo Jamendo

Obj.auth.

NYT

HiBISCuS: Triple Pattern-wise Source PruningSELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

?president

rdf:type dbpedia:President

dbpedia:United_S

tates

dbpedia:nationality

?x

owl:SameAs

dbpedia:party ?party

nyt:topicPage

?page

dbpedia

dbpedia

dbpedia

Sbj. auth.

Sbj. auth.

Sbj. auth.

NYT

NYTObj. auth.

NYT

HiBISCuS: Triple Pattern-wise Source PruningSELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

?president

rdf:type dbpedia:President

dbpedia:United_S

tates

dbpedia:nationality

?x

owl:SameAs

dbpedia:party ?party

nyt:topicPage

?page

dbpedia

dbpedia

dbpedia

Sbj. auth.

Sbj. auth.

Sbj. auth.

NYT

NYTObj. auth.

NYT

HiBISCuS: Triple Pattern-wise Source PruningSELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}

?president

rdf:type dbpedia:President

dbpedia:United_S

tates

dbpedia:nationality

?x

owl:SameAs

dbpedia:party ?party

nyt:topicPage

?page

dbpedia

dbpedia

dbpedia

NYT

NYT

Total triple pattern-wise selected sources = 5Total SPARQL ASK queries : 0

Data Integration Options

Complete Local Integration• Triple patterns are individually and completely

evaluated against every endpoint• Triple pattern results are locally integrated using

different join techniques, e.g., NLJ, Hash Join etc.• Less efficient if query contains common

predicates such rdf:type and owl:sameAs• Large amount of potentially irrelevant

intermediate results retrieval

Iterative Integration• Evaluate query iteratively pattern by pattern• Start with a single triple pattern • Substitute mappings from previous triple

pattern in the subsequent evaluation• Evaluate query in a NLJ fashion• NLJ can cause many remote requests• Block NLJ fashion minimize the remote requests

Join Order Selection

Join Order Selection• Left-deep trees– Joins take place in a left-to-right sequential order – Result of the join is used as an outer input for the next join– Used in FedX, DARQ

• Right-deep trees– Joins take place in a right-to-left sequential order – Result of the join is used as an inner input for the next join

• Bushy trees– Joins take place in sub-tress both on left and right sides– Used in ANAPSID

• Dynamic programming– Used in SPLENDID

66

Join Order Selection ExampleCompute Micronutrients using Drugbank and KEGG

SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory drugbank-cat:micronutrient. // TP1 ?drug drugbank:casRegistryNumber ?id . // TP2 ?keggDrug rdf:type kegg:Drug . // TP3 ?keggDrug bio2rdf:xRef ?id . // TP4 ?keggDrug dc:title ?title . // TP5}

𝜋 ?𝑑𝑟𝑢𝑔 ,? 𝑡𝑖𝑡𝑙𝑒

TP1 TP2

TP3

TP4

TP5

Left-deep tree

𝜋 ? 𝑑𝑟𝑢𝑔 ,? 𝑡𝑖𝑡𝑙𝑒

TP1 TP2

TP3

TP4

TP5

Right-deep tree𝜋 ?𝑑𝑟𝑢𝑔 ,? 𝑡𝑖𝑡𝑙𝑒

TP1 TP2

Bushy tree

TP3 TP5

TP4

Goal: Execute smallest cardinality joins first

Join Order Optimization

Join Order Optimization• Exclusive Groups

– Group triple patterns with the same relevant data source – Evaluation in a single (remote) sub-query– Push join to the data source, i.e., endpoint

• Variable count-heuristic– Iteratively determine the join order based on free variables count of

triple patterns and groups– Consider “resolved ” variable mappings from earlier iteration

• Using Selectivities– Store distinct predicates, avg. subject selectivities , and avg. object

selectivities for each predicate in index– Use the predicate count, avg. subject selectivities , and avg. object

selectivities to estimate the join cardinality

69

Exclusive Groups

SELECT ?President ?Party ?TopicPage WHERE { ?President rdf:type dbpedia-yago:PresidentsOfTheUnitedStates . ?President dbpedia:party ?Party . ?nytPresident owl:sameAs ?President . ?nytPresident nytimes:topicPage ?TopicPage .}

Source Selection@ DBpedia@ DBpedia@ DBpedia, NYTimes@ NYTimes

Exclusive Group

Advantage:Delegate joins to the endpoint by forming exclusive groups (i.e. executing the respective patterns in a single subquery)

Source: http://www.slideshare.net/aschwarte/fedx-for-federated-query-processing-on-linked-data

70

Exclusive Groups Join Order Optimization 2 Unoptimized Internal Representation

Compute Micronutrients using Drugbank and KEGG

SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory drugbank-cat:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrug rdf:type kegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrug dc:title ?title .}

1 SPARQL Query

3 Optimized Internal Representation

4x Local Join=

4x NLJ

Exlusive Group Remote Join

Source: http://www.slideshare.net/aschwarte/fedx-for-federated-query-processing-on-linked-data

[] a sd:Service ; sd:endpointUrl <http://localhost:8890/sparql> ; sd:capability [ sd:predicate diseasome:name ; sd:totalTriples 147 ; // Total number of triple patterns with predicate value sd:predicate sd:avgSbjSel ``0.0068'' ; // 1/ distinct subjects with predicate value sd:predicate sd:avgObjSel ``0.0069'' ; // 1/ distinct Objects with predicate value sd:predicate ] ; sd:capability [ sd:predicate diseasome:chromosomalLocation ; sd:totalTtriples 160 ; sd:avgSbjSel ``0.0062'' ; sd:avgObjSel ``0.0072'' ; ] ;

S1 P O1 . S1 P O2 . S2 P O1 . S3 P O2 .

totalTriples = 4avgSbjSel(p) = 1/3 avgObjSel(p) =1/2

Selectivity Based Join Order Optimization

Selectivity Based Join Order Optimization

• Triple pattern cardinality

• Join Cardinality

= pred(tp) , = Total triple having predicate

𝐶 ( 𝐽 (𝑡𝑝1 , 𝑡𝑝2 ))={𝐶 (𝑡𝑝1 )×𝐶 (𝑡𝑝 2 )×𝑎𝑣𝑔𝑃𝑟𝑒𝑑𝐽𝑜𝑖𝑛𝑆𝑒𝑙 (𝑡𝑝1 )×𝑎𝑣𝑔𝑃𝑟𝑒𝑑𝐽𝑜𝑖𝑛𝑆𝑒𝑙 (𝑡𝑝2 ) 𝑖𝑓 𝑝−𝑝 𝑗𝑜𝑖𝑛𝐶 (𝑡𝑝1 )×𝐶 (𝑡𝑝2 )×𝑎𝑣𝑔𝑆𝑏𝑗 𝐽𝑜𝑖𝑛𝑆𝑒𝑙 (𝑡𝑝1 )×𝑎𝑣𝑔𝑆𝑏𝑗 𝐽𝑜𝑖𝑛𝑆𝑒𝑙 (𝑡𝑝2 ) 𝑖𝑓 𝑠−𝑠 𝑗𝑜𝑖𝑛𝐶 (𝑡𝑝1 )×𝐶 (𝑡𝑝 2 )×𝑎𝑣𝑔𝑆𝑏𝑗𝐽𝑜𝑖𝑛𝑆𝑒𝑙 (𝑡𝑝1 )×𝑎𝑣𝑔𝑂𝑏𝑗𝐽𝑜𝑖𝑛𝑆𝑒𝑙 (𝑡𝑝2 ) 𝑖𝑓 𝑠−𝑜 𝑗𝑜𝑖𝑛

How to calculate avgPredJoinSel, avgSbjJoinSel, and avgObjJoinSel? DARQ selected 0.5 as the avgJoinSel value for all joins

Join Implementations

Join Implementations• Bound Joins– Start with a single triple pattern (lowest cardinality)– Substitute mappings from previous triple pattern in the subsequent

evaluation– Bound Joins in NLJ fashion

• Execute bound joins in nested loop join fashion• Too many remote requests

– Bound Joins in Block NLJ fashion• Execute bound joins in block nested loop join fashion• Make use of SPARQL UNION construct• Remote requests are reduced by the block size

• Other Join techniques– E.g, Hash Joins

75

Bound Joins in Block NLJ

SELECT ?President ?Party ?TopicPage WHERE { ?President rdf:type dbpedia:PresidentsOfTheUnitedStates . ?President dbpedia:party ?Party . ?nytPresident owl:sameAs ?President . ?nytPresident nytimes:topicPage ?TopicPage .}

Assume that the following intermediate results have been computed as input for the last triple pattern

Block Input“Barack Obama”“George W. Bush”…

Before (NLJ)SELECT ?TopicPage WHERE { “Barack Obama” nytimes:topicPage ?TopicPage }SELECT ?TopicPage WHERE { “George W. Bush” nytimes:topicPage ?TopicPage }…

Now: Evaluation in a single remote request using a SPARQL UNION construct + local post processing (SPARQL 1.0)

SPARQL 1.1: BINDINGS clause

Source: http://www.slideshare.net/aschwarte/fedx-for-federated-query-processing-on-linked-data

Parallelization and Pipelining• Execute sub-queries concurrently on different data

sources• Multithreaded worker pool to execute the joins

and UNION operators in parallel• Pipelining approach for intermediate results• See FedX and LHD implementations

Performance Metrics and Discussion

Performance Metrics• Efficient source selection in terms of– Total triple pattern-wise sources selected– Total number of SPARQL ASK requests used during source

selection– Source selection time

• Query execution time• Results completeness and correctness• Number of remote requests during query execution• Index compression ratio (1- index size/datadump size)• See https://code.google.com/p/bigrdfbench/

Evaluation Setup

• Local dedicated network• Local SPARQL endpoints (One per machine)• Run each query 10 times and present the average results• Statistically analyzed the results, e.g., Wilcoxon signed rank

test, student T-test

Thanks

saleem@informatik.uni-leipzig.deAKSW, University of Leipzig, Germany