+ All Categories
Home > Documents > Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf...

Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf...

Date post: 31-Mar-2015
Category:
Upload: rosemary-slatter
View: 214 times
Download: 2 times
Share this document with a friend
Popular Tags:
140
Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1 , Katja Hose 2 , Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of Aalborg, Denmark 3 University of Passau, Germany
Transcript
Page 1: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Scalable RDF Data Management

& SPARQL Query Processing

Martin Theobald1, Katja Hose2, Ralf Schenkel3

1University of Antwerp, Belgium2University of Aalborg, Denmark3University of Passau, Germany

Page 2: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Outline of this Tutorial

• Part I–RDF in Centralized Relational Databases

• Part II–RDF in Distributed Settings

• Part III–Managing Uncertain RDF Data

Page 3: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Outline for Part I• Part I.1: Foundations

– Introduction to RDF and Linked Open Data– A short overview of SPARQL

• Part I.2: Rowstore Solutions• Part I.3: Columnstore Solutions• Part I.4: Other Solutions and Outlook

Page 4: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

bornOn(Jeff, 09/22/42)gradFrom(Jeff, Columbia)hasAdvisor(Jeff, Arthur)hasAdvisor(Surajit, Jeff)knownFor(Jeff, Theory)

Information Extraction

YAGO/DBpedia et al.

>120 M facts for YAGO2(mostly from Wikipedia infoboxes & categories)

Page 5: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

http://www.mpi-inf.mpg.de/yago-naga/

YAGO2 Knowledge BaseEntity

Max_Planck

Apr 23, 1858

Person

City

Countrysubclass

Locationsubclass

instanceOf

subclass

bornOn

“Max Planck”

means

subclass

Oct 4, 1947 diedOn

Kiel

bornInNobel Prize

Erwin_Planck

FatherOfhasWon

Scientist

means

“Max Karl Ernst Ludwig Planck”

Physicist

instanceOf

subclassBiologist

subclass

Germany

Politician

Angela Merkel

Schleswig-Holstein

State

“Angela Dorothea Merkel”

Oct 23, 1944diedOn

Organization

subclass

Max_Planck Society

instanceOf

means

instanceOfinstanceOf

subclass

subclass

means

“Angela Merkel”

means

citizenOf

instanceOfinstanceOf

locatedIn

locatedIn

subclass

3 M entities, 120 M facts100 relations, 200k classes

accuracy 95%

Page 6: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Why care about scalability?Rapid growth of available semantic data

Sources:linkeddata.orgwikipedia.org

Page 7: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Why care about scalability?Rapid growth of available semantic data

More than 30 billion triples in more than 200 sources across the LOD cloud DBPedia: 3.4 million entities, 1 billion triples

Sources:linkeddata.orgwikipedia.org

As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase

Page 8: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

… and still growing• Billion triple challenge 2008: 1B triples• Billion triple challenge 2010: 3B triples

http://km.aifb.kit.edu/projects/btc-2010/

• Billion triple challenge 2011: 2B tripleshttp://km.aifb.kit.edu/projects/btc-2011/

• |

• War stories from http://www.w3.org/wiki/LargeTripleStores

– BigOWLIM: 12B triples in Jun 2009 – Garlik 4Store: 15B triples in Oct 2009– OpenLink Virtuoso: 15.4B+ triples – AllegroGraph: 1+ Trillion triples

Page 9: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Queries can be complex, too

SELECT DISTINCT ?a ?b ?lat ?long WHERE{ ?a dbpedia:spouse ?b. ?a dbpedia:wikilink dbpediares:actor. ?b dbpedia:wikilink dbpediares:actor. ?a dbpedia:placeOfBirth ?c. ?b dbpedia:placeOfBirth ?c. ?c owl:sameAs ?c2. ?c2 pos:lat ?lat. ?c2 pos:long ?long.}

Q7 on BTC2008 in [Neumann & Weikum, 2009]

Page 10: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

What effects does the financial crisis have on migration rates in the US?

Page 11: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Is there a significant increase of serious weather conditions in Europe over the past 20 years?

Page 12: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Which glutamic-acid proteases are inhibitors of HIV?

Page 13: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Question Answering (QA) Systems

• KB of curated, structured data• 10 trillion (!) facts, 50k algorithms

• KB from Wikipedia and user edits• 600 million facts, 25 million entities

Page 14: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

IBM Watson: Deep Question Answering

99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain

This town is known as "Sin City" & its downtown is "Glitter Gulch"

William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel

As of 2010, this is the only former Yugoslav republic in the EU

YAGO

knowledgeback-ends

questionclassification &decomposition

www.ibm.com/innovation/us/watson/index.htm

D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine, Fall 2010.

Page 15: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

SPARQL 1.0 / 1.1• Query language for RDF suggested by the W3C.

• 3 ways to interpret RDF data:– Instances of logical predicates (“facts”)– Graphs (subjects/objects as nodes,

predicates as directed and labeled edges)

– Relations (either multiple binary relations or a single, large ternary relation)

• SPARQL main building block:– select-project-join combination of relational triple patterns

equivalent to graph isomorphism queries over a potentially very large RDF graph

Page 16: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

SPARQL – Example

Example query:Find all actors from Ontario (that are in the knowledge base)

vegetarian

Albert_Einstein

physicist

Jim_Carrey

actor

Ontario

Canada

Ulm

Germany

scientist

chemist

Otto_Hahn

Frankfurt

Mike_Myers

NewmarketScarborough

Europe

isA isA isA isA isA

bornIn bornIn bornIn bornIn

locatedInlocatedIn

locatedInlocatedInlocatedInlocatedIn

isAisA

isA

Page 17: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

SPARQL – Example

Example query:Find all actors from Ontario (that are in the knowledge base)

vegetarian

Jim_Carrey

actor

Ontario

Canada

Mike_Myers

NewmarketScarborough

isA isA

bornIn bornIn

locatedIn

locatedInlocatedIn

isA

actor

Ontario

?person

?loc

bornIn

locatedIn

isA

Find subgraphs of this form:

variables

constants

SELECT ?person WHERE { ?person isA actor. ?person bornIn ?loc . ?loc locatedIn Ontario . }

Page 18: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

• Eliminate duplicates in results

• Return results in some order

with optional LIMIT n clause• Optional matches and filters on bounded var’s

• More operators: ASK, DESCRIBE, CONSTRUCT

See: http://www.w3.org/TR/rdf-sparql-query/

SPARQL 1.0 – More Features

SELECT DISTINCT ?c WHERE {?person isA actor. ?person bornIn ?loc. ?loc locatedIn ?c}

SELECT ?person WHERE {?person isA actor. ?person bornIn ?loc. ?loc locatedIn Ontario} ORDER BY DESC(?person)

SELECT ?person WHERE {?person isA actor. OPTIONAL{?person bornIn ?loc}. FILTER (!BOUND(?loc))}

Page 19: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

SPARQL 1.1 Extensions of the W3C

W3C SPARQL 1.1:• Aggregations (COUNT, AVG, …) and grouping• Subqueries in the WHERE clause• Safe negation: FILTER NOT EXISTS {?x …}

– Syntactic sugar forOPTIONAL {?x … }FILTER(!BOUND(?x))

• Expressions in the SELECT clause:SELECT (?a+?b) AS ?sum

• Label constraints on paths:?x foaf:knows/foaf:knows/foaf:name ?name

• More functions and operators …

Page 20: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

RDF+SPARQL: Centralized Engines

• BigOWLIM (now ontotext.com)

• OpenLink Virtuoso • OntoBroker (now semafora-systems.com)

• Apache Jena (different main-memory/relational backends)

• Sesame (now openRDF.org)

• SW-Store, Hexastore, 3Store, RDF-3X (no reasoning)

System deployments with >1011 triples

( see http://esw.w3.org/LargeTripleStores)

Page 21: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

SPARQL: Extensions from Research (1)

More complex graph patterns:• Transitive paths [Anyanwu et al., WWW’07]

SELECT ?p, ?c WHERE {?p isA scientist . ?p ??r ?c. ?c isA Country . ?c locatedIn Europe .

PathFilter(cost(??r) < 5) . PathFilter(containsAny(??r,?t ). ?t isA City . }

• Regular expressions [Kasneci et al., ICDE’08]SELECT ?p, ?c WHERE { ?p isA ?s. ?s isA scientist. ?p (bornIn | livesIn | citizenOf) locatedIn* Europe.}

Meanwhile mostly covered by the SPARQL 1.1 query proposal.

Page 22: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

SPARQL: Extensions from Research (2)

Queries over federated RDF sources:• Determine distribution of triple patterns as part of

query (for example in Jena ARQ)

• Automatically route triple predicates to useful sources

Page 23: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

SPARQL: Extensions from Research (2)

Queries over federated RDF sources:• Determine distribution of triple patterns as part of

query (for example in Jena ARQ)

• Automatically route triple predicates to useful sources

Potentially requires mapping of identifiers from different sources

SPARQL 1.1 explicitly supportsfederation of sources

http://www.w3.org/TR/sparql11-federated-query/

Page 24: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Ranking is Essential!

• Queries often have a huge number of results:– “scientists from Canada”– “publications in databases”– “actors from the U.S.”

• Queries may have no matches at all:– “Laboratoire d'informatique de Paris 6”– “most beautiful railway stations”

• Ranking is an integral part of search• Huge number of app-specific ranking methods:

paper/citation count, impact, salary, …• Need for generic ranking of 1) entities and 2) facts

Page 25: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Extending Entities with Keywords

Remember: entities occur in facts & in documentsÞ Associate entities with terms in those documents,

keywords in URIs, literals, … (context of entity)

chancellor Germany scientist election Stuttgart21 Guido Westerwelle France Nicolas Sarkozy

Page 26: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Extensions: Keywords

• Consider witnesses/sources (provenance meta-facts)• Allow text predicates with each triple pattern (à la XQ-FT)

Problem: not everything is triplified!

European composers who have won the Oscar,whose music appeared in dramatic western scenes,and who also wrote classical pieces ?

Select ?p Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name AcademyAward . ?p contributedTo ?movie [western, gunfight, duel, sunset] . ?p composed ?music [classical, orchestra, cantata, opera] . }

Semantics: • triples match struct. pred.• witnesses match text pred.

Page 27: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Select ?r, ?a Where {?r instOf researcher [“computer science“] . ?a workedOn ?x [“Manhattan project“] .?r hasAdvisor ?a . }

Select ?r, ?a Where {?r ?p1 ?o1 [“computer science“] . ?a ?p2 ?o2 [“Manhattan project“] .?r ?p3 ?a . }

Extensions: Keywords

• Consider witnesses/sources (provenance meta-facts)• Allow text predicates with each triple pattern (à la XQ-FT)

Problem: not everything is triplified!

Proximity ofkeywords or phrasesboosts expressiveness

French politicians married to Italian singers? Select ?p1, ?p2 Where { ?p1 instanceOf ?c1 [France, politics] . ?p2 instanceOf ?c2 [Italy, singer] . ?p1 marriedTo ?p2 . }

CS researchers whose advisors worked on the Manhattan project?

Page 28: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Extensions: Keywords

CLEF/INEX 2012-13 Linked Data Track

<topic id="2012374" category="Politics"> <jeopardy_clue>Which German politician is a successor of another politician who stepped down before his or her actual term was over, and what is the name of their political ancestor?</jeopardy_clue> <keyword_title>German politicians successor other stepped down before actual term name ancestor</keyword_title> <sparql_ft> SELECT ?s ?s1 WHERE { ?s rdf:type <http://dbpedia.org/class/yago/GermanPoliticians> . ?s1 <http://dbpedia.org/property/successor> ?s . FILTER FTContains (?s, "stepped down early") . } </sparql_ft></topic>

https://inex.mmci.uni-saarland.de/tracks/lod/

Problem: not everything is triplified!

Page 29: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Extensions: Keywords / Multiple Languages

<question id="4" answertype="resource" aggregation="false" onlydbo="true"> <string lang="en">Which river does the Brooklyn Bridge cross?</string> <string lang="de">Welchen Fluss überspannt die Brooklyn Bridge?</string> <string lang="es">¿Por qué río cruza la Brooklyn Bridge?</string> <string lang="it">Quale fiume attraversa il ponte di Brooklyn?</string> <string lang="fr">Quelle cours d'eau est traversé par le pont de Brooklyn?</string> <string lang="nl">Welke rivier overspant de Brooklyn Bridge?</string> <keywords lang="en">river, cross, Brooklyn Bridge</keywords> <keywords lang="de">Fluss, überspannen, Brooklyn Bridge</keywords> <keywords lang="es">río, cruza, Brooklyn Bridge</keywords> <keywords lang="it">fiume, attraversare, ponte di Brooklyn</keywords> <keywords lang="fr">cours d'eau, pont de Brooklyn</keywords> <keywords lang="nl">rivier, Brooklyn Bridge, overspant</keywords> <query> PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX res: <http://dbpedia.org/resource/> SELECT DISTINCT ?uri WHERE { res:Brooklyn_Bridge dbo:crosses ?uri . } </query></question>

http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/

Multilingual Question Answering over Linked Data (QALD-3), CLEF 2011-13

Problem: not everything is triplified!

Page 30: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

What Makes a Fact “Good”?

Confidence:Prefer results that are likely correct

accuracy of info extraction trust in sources

(authenticity, authority)

bornIn (Jim Gray, San Francisco) from“Jim Gray was born in San Francisco”(en.wikipedia.org)livesIn (Michael Jackson, Tibet) from“Fans believe Jacko hides in Tibet”(www.michaeljacksonsightings.com)

Informativeness:Prefer results with salient factsStatistical estimation from:

frequency in answer frequency on Web frequency in query log

q: Einstein isa ?Einstein isa scientistEinstein isa vegetarian

q: ?x isa vegetarianEinstein isa vegetarianWhocares isa vegetarian

Conciseness:Prefer results that are tightly connected

size of answer graph weight of Steiner tree

Einstein won NobelPrizeBohr won NobelPrizeEinstein isa vegetarianCruise isa vegetarianCruise born 1962 Bohr died 1962

Diversity:Prefer variety of facts

E won … E discovered … E played … E won … E won … E won … E won …

Page 31: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

How Can We Implement This?

Confidence:Prefer results that are likely correct

accuracy of info extraction trust in sources

(authenticity, authority)

Informativeness:Prefer results with salient factsStatistical estimation from:

frequency in answer frequency on Web frequency in query log

Conciseness:Prefer results that are tightly connected

size of answer graph weight of Steiner tree

Diversity:Prefer variety of facts

Empirical accuracy of Information ExtractionPageRank-style estimate of trustcombine into: max { accuracy (f,s) * trust(s) | s witnesses(f) }

Statistical Language Models[Zhai et al., Elbassuoni et al.]

Graph algorithms (BANKS, STAR, …) [S.Chakrabarti et al., G.Kasneci et al., …]

PageRank-style entity/fact ranking[V. Hristidis et al., S.Chakrabarti, …]

IR models: tf*idf … [K.Chang et al., …]Statistical Language Models [de Rijke et al.]

or

Page 32: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Outline for Part I• Part I.1: Foundations

– Introduction to RDF– A short overview of SPARQL

• Part I.2: Rowstore Solutions• Part I.3: Columnstore Solutions• Part I.4: Other Solutions and Outlook

Page 33: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

RDF in Rowstores• Rowstore: general relational database, storing

relations (incl. facts) as complete rows (MySQL, PostgreSQL, Oracle, DB2, SQLServer, …)

• General principles:– store triples in one giant three-attribute table

(subject, predicate, object)– convert SPARQL to equivalent SQL– The database will do the rest

• Strings often mapped to unique integer IDs• Used by many TripleStores, including 3Store,

Jena, HexaStore, RDF-3X, …

Simple extension to quadruples (with graphid):(graph,subject,predicate,object)

We consider only triples for simplicity!

Page 34: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Example: Single Triple Tableex:Katja ex:teaches ex:Databases;

ex:works_for ex:MPI_Informatics;ex:PhD_from ex:TU_Ilmenau.

ex:Martin ex:teaches ex:Databases;ex:works_for ex:MPI_Informatics;ex:PhD_from ex:Saarland_University.

ex:Ralf ex:teaches ex:Information_Retrieval;ex:PhD_from ex:Saarland_University;ex:works_for ex:Saarland_University,

ex:MPI_Informatics.

subject predicate objectex:Katja ex:teaches ex:Databasesex:Katja ex:works_for ex:MPI_Informaticsex:Katja ex:PhD_from ex:TU_Ilmenauex:Martin ex:teaches ex:Databasesex:Martin ex:works_for ex:MPI_Informaticsex:Martin ex:PhD_from ex:Saarland_Universityex:Ralf ex:teaches ex:Information_Retrievalex:Ralf ex:PhD_from ex:Saarland_Universityex:Ralf ex:works_for ex:Saarland_Universityex:Ralf ex:works_for ex:MPI_Informatics

Page 35: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Conversion of SPARQL to SQLGeneral approach to translate SPARQL into SQL:

(1) Each triple pattern is translated into a (self-) JOIN over the triple table (2) Shared variables create JOIN conditions(3) Constants create WHERE conditions(4) FILTER conditions create WHERE conditions(5) OPTIONAL clauses create OUTER JOINS(6) UNION clauses create UNION expressions

Page 36: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

SELECTFROM Triples P1, Triples P2, Triples P3

Example: Conversion to SQL QuerySELECT ?a ?b ?t WHERE{?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. }OPTIONAL {?a teaches ?t}FILTER (regex(?u, “Saar”))

SELECTFROM Triples P1, Triples P2, Triples P3WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from”

SELECTFROM Triples P1, Triples P2, Triples P3WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object

SELECTFROM Triples P1, Triples P2, Triples P3WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object, “Saar”)

SELECT P1.subject as A, P2.subject as BFROM Triples P1, Triples P2, Triples P3WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object, “Saar”)

SELECT R1.A, R1.B, R2.T FROM( SELECT P1.subject as A, P2.subject as B FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object, “Saar”) ) R1 LEFT OUTER JOIN ( SELECT P4.subject as A, P4.object as T FROM Triples P4 WHERE P4.predicate=“teaches”) AS R2) ON (R1.A=R2.A)

P1

P2

P3

P4

Filterregex(?u,“Saar“)

Projection

?u

?a,?u

?a

Page 37: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Is that all?Well, no.• Which indexes should be built?

(to support efficient evaluation of triple patterns)• How can we reduce storage space?• How can we find the best execution plan?

Existing databases need modifications:• flexible, extensible, generic storage not needed here• cannot deal with multiple self-joins of a single table• often generate bad execution plans

Page 38: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Dictionary for StringsMap all strings to unique integers (e.g., via hashing)• Regular size (4-8 bytes), much easier to handle• Dictionary usually small, can be kept in main

memory<http://example.de/Katja> 194760<http://example.de/Martin> 679375<http://example.de/Ralf> 4634

This may break original lexicographic sorting order RANGE conditions (not in SPARQL) are difficult! FILTER conditions may be more expensive!

Page 39: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Indexes for Commonly Used Triple PatternsPatterns with a single variable are frequentExample: Albert_Einstein invented ?xÞ Build clustered index over (s,p,o)

Can also be used for pattern like Albert_Einstein ?p ?x

Build similar clustered indexes for all six permutations (3 x 2 x 1 = 6)• SPO, POS, OSP to cover all possible triplet patterns• SOP, OPS, PSO to have all sort orders for patterns with two var’s

(16,19,5356)(16,24,567)(16,24,876)(27,19,643)(27,48,10486)(50,10,10456) …

All triples in(s,p,o) order

B+ tree foreasy access

1. Lookup ids for constants:Albert_Einstein 16, invented 24

2. Lookup known prefix in index:(16,24,0)

3. Read results while prefix matches:(16,24,567), (16,24,876)come already sorted!

Triple table no longer needed, all triples in each index

Page 40: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Why Sort Order Matters for Joins

(16,19,5356)(16,24,567)(16,24,876)(27,19,643)(27,48,10486)(50,10,10456)

(16,33,46578)(16,56,1345)(24,16,1353)(27,18,133)(47,37,20495)(50,134,1056)

MJ

When inputs sorted by joinattribute, use Merge Join:• sequentially scan both inputs• immediately join matching triples• skip over parts without matches• allows pipelining

When inputs are unsorted/sortedby wrong attribute, use Hash Join:• build hash table from one input• scan other input, probe hash table• needs to touch every input triple• breaks pipelining

(16,19,5356)(16,24,567)(16,24,876)(27,19,643)(27,48,10486)(50,10,10456)

(27,18,133)(50,134,1056)(16,56,1345)(24,16,1353) (47,37,20495)(16,33,46578)

HJ

In general, Merge Joins are more preferable:small memory footprint, pipelining

Page 41: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

RDF-3x: Even More Indexes!SPARQL 1.0 considers duplicates (unless removed with DISTINCT) but

does not (yet) support aggregates/countingÞ often queries with many duplicates like

SELECT ?x WHERE ?x ?y Germany.

to retrieve entities related to Germany (but counts may be important in the application!)

Þ this materializes many identical intermediate resultsSolution: even more redundancy!

• Pre-compute aggregated indexes SP,SO,PO,PS,OP,OS,S,P,O Example: SO contains, for each pair (s,o), the number of triples with subject s and object o

• Do not materialize identical bindings, but keep counts Example: ?x=Albert_Einstein:4; ?x=Angela_Merkel:10 • 15 indexes overall (all SPO permutations + their unique subsets)

Page 42: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

RDF-3x: Compression Scheme for Triplets

• Compress sequences of triples in lexicographic order (v1;v2;v3); for SPO: v1=S, v2=P, v3=O

• Step 1: compute per-attribute deltas

• Step 2: variable-byte encoding for each delta triple

1-13 bytes

(16,19,5356)(16,24,567)(16,24,676)(27,19,643)(27,48,10486)(50,10,10456)

(16,19,5356)(0,5,-4798)(0,0,109)(11,-5,-34)(0,29,9843)(23,-38,-30)

gapbit

header(7 bits)

Delta of value 1(0-4 bytes)

Delta of value 2(0-4 bytes)

Delta of value 3(0-4 bytes)

When gap=1, thedelta of value3 isincluded in header,all others are 0

Otherwise, header contains length of encoding for each of the three deltas (5*5*5=125 combinations stored in 7 bits)

Many variants exist; this one is designed for triplets…

Page 43: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Compression Effectiveness vs. Efficiency

• Byte-level encoding almost as effective as bit-level encoding techniques (Gamma, Golomb, Rice, etc.)

• Much faster (10x) for decompressing• Example for Barton dataset [Neumann & Weikum: VLDB’10]:

– Raw data 51 million triples, 7GB uncompressed (as N-Triples)– All 6 main indexes:

• 1.1GB size, 3.2s decompression with byte-level encoding

• Optionally: additional compression with LZ77 2x more compact, but much slower to decompress– Compression always on page level

Page 44: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

POS(works_for,?u,?a)

POS(pdh_from,?u,?a)

PSO(works_for,?u,?b)

Projection

?u,?a

?u

?a

MJ

MJ

Filterregex(?u,“Saar“)

POS(teaches,?a,?t)

Back to the Example QuerySELECT ?a ?b ?t WHERE{?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. }OPTIONAL {?a teaches ?t}FILTER (regex(?u, “Saar”))

Which of the two plans is better?How many intermediate results?

1000

1000

100

50

5

250

POS(works_for,?u,?a)

POS(works_for,?u,?b)

PSO(phd_from,?a,?u)

POS(teaches,?a,?t)

Projection

?u

?a,?u

?a

MJ

HJ

Filterregex(?u,“Saar“)

Core ingredients of a good query optimizer areselectivity estimators for triple patterns (index scans) and joins

1000

1000

10050

2500

250250

250

Page 45: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

RDF-3x: Selectivity EstimationHow many results will a triple pattern have?Standard databases:• Per-attribute histograms• Assume independence of attributes

Use aggregated indexes for exact count

Additional join statistics for triple blocks (pages):

too simplisticand inexact

… (16,19,5356)(16,24,567)(16,24,876)(27,19,643)(27,48,10486)(50,10,10456) …

Assume independencebetween triple patterns;additionally precompute

exact statistics for frequentpaths in the data

Page 46: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Handling UpdatesWhat should we do when our data changes?(SPARQL 1.1 has updates!)

Assumptions:• Queries far more frequent than updates• Updates mostly insertions, hardly any deletions• Different applications may update concurrently

Solution: Differential Indexing

Page 47: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

RDF-3x: Differential Updates

Workspace A:Triples insertedby application A

Workspace B:Triples insertedby application B

on-demand indexesat query time

kept in main memory

Staging architecture for updates in RDF-3X

completionof A

completionof B

Deletions:• Insert the same tuple again with “deleted” flag• Modify scan/join operators: merge differential indexes with main index

Page 48: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Outline for Part I• Part I.1: Foundations

– Introduction to RDF– A short overview of SPARQL

• Part I.2: Rowstore Solutions• Part I.3: Columnstore Solutions• Part I.4: Other Solutions and Outlook

Page 49: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Principles

Observations and assumptions:• Not too many different predicates• Triple patterns usually have fixed predicate• Need to access all triples with one predicate

Design consequence:• Use one two-attribute table for each predicate

Page 50: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Example: Columnstoresex:Katja ex:teaches ex:Databases;

ex:works_for ex:MPI_Informatics;ex:PhD_from ex:TU_Ilmenau.

ex:Martin ex:teaches ex:Databases;ex:works_for ex:MPI_Informatics;ex:PhD_from ex:Saarland_University.

ex:Ralf ex:teaches ex:Information_Retrieval;ex:PhD_from ex:Saarland_University;ex:works_for ex:Saarland_University,

ex:MPI_Informatics.

subject objectex:Katja ex:TU_Ilmenauex:Martin ex:Saarland_Universityex:Ralf ex:Saarland_University

PhD_from

subject objectex:Katja ex:MPI_Informaticsex:Martin ex:MPI_Informtaticsex:Ralf ex:Saarland_Universityex:Ralf ex:MPI_Informatics

works_for

subject objectex:Katja ex:Databasesex:Martin ex:Databasesex:Ralf ex:Information_Retrieval

teaches

Page 51: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Simplified Example: Query ConversionSELECT ?a ?b ?t WHERE{?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. }

SELECT W1.subject as A, W2.subject as B FROM works_for W1, works_for W2, phd_from P3 WHERE W1.object=W2.object AND W1.subject=P3.subject AND W1.object=P3.object

So far, this is yet another relational representation of RDF.So, what is a columnstore?

Page 52: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Columnstores and RDFColumnstores store all columns of a table separately.

subject objectex:Katja ex:TU_Ilmenauex:Martin ex:Saarland_Universityex:Ralf ex:Saarland_University

PhD_from PhD_from:subjectex:Katjaex:Martinex:Ralf

PhD_from:objectex:TU_Ilmenauex:Saarland_Universityex:Saarland_University

Advantages:• Fast if only subject or object are accessed, not both• Allows for a very compact representation Problems:• Need to recombine columns if subject and object are accessed• Inefficient for triple patterns with predicate variable

Page 53: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Compression in ColumnstoresGeneral ideas: • Store subject only once• Use same order of subjects for all columns, including NULL values

when necessary

• Additional compression to get rid of NULL values

subjectex:Katjaex:Martinex:Ralfex:Ralf

PhD_fromex:TU_Ilmenauex:Saarland_Universityex:Saarland_UniversityNULL

works_forex:MPI_Informaticsex:MPI_Informaticsex:Saarland_Universityex:MPI_Informatics

teachesex:Databasesex:Databases ex:Information_RetrievalNULL

PhD_from: bit[1110]ex:TU_Ilmenauex:Saarland_Universityex:Saarland_University

Teaches: range[1-3]ex:Databasesex:Databases ex:Information_Retrieval

Page 54: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Outline for Part I• Part I.1: Foundations

– Introduction to RDF– A short overview of SPARQL

• Part I.2: Rowstore Solutions• Part I.3: Columnstore Solutions• Part I.4: Other Solutions and Outlook

Page 55: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Property TablesGroup entities with similar predicates into a relational table(for example using RDF types or a clustering algorithm).ex:Katja ex:teaches ex:Databases;

ex:works_for ex:MPI_Informatics;ex:PhD_from ex:TU_Ilmenau.

ex:Martin ex:teaches ex:Databases;ex:works_for ex:MPI_Informatics;ex:PhD_from ex:Saarland_University.

ex:Ralf ex:teaches ex:Information_Retrieval;ex:PhD_from ex:Saarland_University;ex:works_for ex:Saarland_University,

ex:MPI_Informatics.subject teaches PhD_fromex:Katja ex:Databases ex:TU_Ilmenauex:Martin ex:Databases ex:Saarland_Universityex:Ralf ex:IR ex:Saarland_University

subject teaches PhD_fromex:Katja ex:Databases ex:TU_Ilmenauex:Martin ex:Databases ex:Saarland_Universityex:Ralf ex:IR ex:Saarland_Universityex:Axel NULL ex:TU_Vienna

subject predicate objectex:Katja ex:works_for ex:MPI_Informaticsex:Martin ex:works_for ex:MPI_Informaticsex:Ralf ex:works_for ex:Saarland_Universityex:Ralf ex:works_for ex:MPI_Informatics

“Leftover triples”

Page 56: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Property Tables: Pros and Cons

Advantages:• More in the spirit of existing relational systems• Saves many self-joins over triple tables etc.Disadvantages:• Potentially many NULL values• Multi-value attributes problematic• Query mapping depends on schema• Schema changes very expensive

Page 57: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Even More Systems…• Store RDF data as sparse matrix with bit-vector

compression [BitMat, Hendler at al.: ISWC’09]

• Convert RDF into XML and use XML methods (XPath, XQuery, …)

• Store RDF data in graph databases and perform bi-simulation [Fletcher at al.: ESWC’12] or employ specialized graph index structures [gStore, Zou et al.: PVLDB’11]

• And many more …

See our list of readings.

Page 58: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Which Technique is Best?• Performance depends a lot on precomputation,

optimization, implementation, fine-tuning …• Comparative results on BTC 2008:

(from [Neumann & Weikum, 2009])

RDF-3X RDF-3X (2008) COLSTORE ROWSTORE

RDF-3X RDF-3X (2008)

COLSTORE

ROWSTORE

Page 59: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Challenges and Opportunities

• SPARQL with different entailment regimes• New SPARQL 1.1 features (grouping, aggregation, updates)

• User-oriented ranking of query results– Efficient top-k operators– Effective scoring methods for structured queries

• What are the limits of a centralized RDF engine?• Dealing with uncertain RDF data –

what is the most likely query answer?– Triples with probabilities probabilistic databases

Page 60: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Outline of this Tutorial

• Part I–RDF in Centralized Relational Databases

• Part II–RDF in Distributed Settings

• Part III–Managing Uncertain RDF Data

Page 61: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Outline for Part II• Part II.1: Search Engines for the Semantic Web• Part II.2: Mediator-based and Federated

Architectures

Page 62: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Semantic Web Search Engines

• Querying RDF data collections started by adapting existing search engines to RDF data.– Crawling for .rdf files, and HTML documents with embedded

RDF content (see: RDFa microformat).– Indexing & search based on keywords extracted from entity-

and property names.– Usually generate a virtual document for an entity (string literals

and human-readable names).

• Swoogle [Ding et al., CIKM’04] (University of Maryland)

• Falcons [Cheng at al., WWW’08] (Nanjing University)

Page 63: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.
Page 64: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.
Page 65: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Outline for Part II• Part II.1: Search Engines for the Semantic Web• Part II.2: Mediator-based and Federated

Architectures

Page 66: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Classification of Distributed ApproachesApproaches for querying

distributed and potentially heterogeneous (RDF) data sources

Materialization-based approaches(data-warehousing)

Virtually materialized approaches

Peer-2-Peer

Federated systems

MapReduce/Hadoop

Shared-memoryarchitectures

(Message-Passing, RMI, etc.)

Shared-nothingarchitectures

Mediator-based systems

Shard, Jena-HBase[Abadi et al. PVLDB’11]

Trinity (MSR)

DARQ FedExYARS2

GridvineRDFPeers

Partout4StoreEagre

Page 67: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

How to Integrate Data Sources?

• Ship and integrate data from different sources to the client.

• Three common approaches:– Query-driven (single mediator)

– Database federations (exported schemas)

– Warehousing (fully integrated & centrally managed)

RDFSource

RDFSource

?

Page 68: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Query-Driven Approach

SPARQL Client

Wrapper

RDFSource

RDFSource

RDFSource

query result query resultquery result

Mediator

Wrapper Wrapper

List of SPARQL endpoints: http://www.w3.org/wiki/SparqlEndpointsDBpedia: http://dbpedia.org/sparqlYAGO: https://d5gate.ag5.mpi-sb.mpg.de/webyagospo/Browser

SPARQL Client

Page 69: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Advantages of Query-Driven Integration

• No need to copy data– no or little own storage costs– no need to purchase data

• Potentially more up-to-date data• Mediator holds catalog (statistics, etc.) and may

optimize queries• Only generic query interface needed at sources

(SPARQL endpoints)• May be less draining on sources• Sources often even unaware of participation

Page 70: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

resultquery

Federation-based Approach

SPARQL Client

RDFSource

Federated Schema

Exported Schema

SPARQL Client

Local Schema

resultquery

RDFSource

Exported Schema

Local Schema

resultquery

RDFSource

Exported Schema

Local Schema

Source 1 Source 2 … Source n

Page 71: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Advantages of Federation-Based Integration

• Very similar to query-driven integration, except – that the sources know that they are part of a federation;– and they export their local schemas into a federated schema.

• Intermediate step toward full integration of the data in a single “warehouse”.

Page 72: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Warehousing Architecture

SPARQL Client

Warehouse

RDFSource

RDFSource

RDFSource

Query & Analysis

Integration

Metadata

SPARQL Client

Integrated LOD index: http://lod2.openlinksw.com/sparql

Page 73: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Advantages of Warehousing

• Perform Extract-Transform-Load (ETL) processes with periodic updates over the source

• High query performance• Local processing at sources unaffected• Can operate even when sources are offline• Can query data that is no longer stored at sources• More detailed statistics and metadata available at

warehouse– Modify, summarize (store aggregates), analyse– Add historical information, provenance, timestamps, etc.

Page 74: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Classification of Distributed ApproachesApproaches for querying

distributed and potentially heterogeneous (RDF) data sources

Materialization-based approaches(data-warehousing)

Virtually materialized approaches

Peer-2-Peer

Federated systems

MapReduce/Hadoop

Shared-memoryarchitectures

(Message-Passing, RMI, etc.)

Shared-nothingarchitectures

Mediator-based systems

Shard, Jena-HBase[Abadi et al. PVLDB’11]

Trinity (MSR)

DARQ FedExYARS2

GridvineRDFPeers

Partout4StoreEagre

Page 75: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

DARQ [Leser et al., Humbold University Berlin, ISWC’08] • Classical mediator-based architecture

connecting a given SPARQL endpoint to other endpoints via a combination of wrappers and service descriptions.

• Service descriptions– RDF data descriptions– Statistical information– Binding constraints

• Query optimizer based on rewriting rules and cost estimations for physical join operators.

Page 76: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

FedEx [fluid Op’s & MPI-INF: ISWC’11]

• Online query optimization over federations of SPARQL endpoints.

• Cost estimates based on result sizes of SPARQL ASK queries.• “Bound nested-loop joins” by grouping sets of variable bindings

into SPARQL UNION queries (instead of using FILTER conditions):

SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory drugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrug rdf:type kegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrug purl:title ?title .}

Page 77: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Partout [Galaraga, Hose, Schenkel: PVLDB’13]

• Materialization-based, distributed & workload-aware SPARQL engine.

• Distribution helps to scale-out query processing via parallel join executions.

• Triple fragments are distributed over hosts H1…Hn by– (1) maximizing query locality, and– (2) balancing the hosts workload.

• H1…Hn run local RDF-3x instances.– (1) local S,P,O statistics by RDF-3x, – (2) global (cached) statistics.

Global query workload (aka. “query log”) Global query graph

Page 78: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Partout Example Query Plan

• H1, H2, H3 hold triplets for ?s rdf:type db:city• H1 has triplets for ?s db:located db:Germany• H2 has triplets for ?s db:name ?name

Page 79: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

More Distributed RDF Engines• Shard TripleStore (Hadoop + Hash-Partitioning)

• RDFPeers (P2P/Chord architecture) [Cai et al., WWW’04]

• Gridvine (P2P/Chord architecture) [Aberer et al., VLDB’07]

• YARS2 (federated architecture) [Decker at al., ISWC’07]

• Jena-HBase (Hadoop & HBase) [Khadilkar et al., ISWC’12]

• SW-Store (Hadoop/RDF-3x) [Abadi et al., PVLDB’11]

• 4Store (materialized, shared-nothing) [Harris et al., SSWS’09]

• Eagre (materialized, shared-nothing) [HKUST & HP Labs, ICDE’13]

• Trinity (materialized, shared-memory, message passing) [MSR, SIGMOD’13]

more in Zoi’s tutorial in the afternoon…

Page 80: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Outline of this Tutorial

• Part I–RDF in Centralized Relational Databases

• Part II–RDF in Distributed Settings

• Part III–Managing Uncertain RDF Data

Page 81: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Outline for Part III• Part III.1: Motivation

– What is uncertain data, and where does it come from?

• Part III.2: Possible Worlds & Beyond• Part III.3: Probabilistic Database Engines

– Stanford Trio Project– MystiQ @ U Washington

• Part III.4: Managing Uncertain RDF Data– URDF @ Max Planck Institute

Page 82: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

What is “Uncertain” Data?

“Certain” Data “Uncertain” Data

Temperature is 74.634589 F Sensor reported 75 ± 0.5 F

Bob works for Yahoo Bob works for either Yahoo or Microsoft

Mary sighted a Finch Mary sighted either a Finch (60%) or a Sparrow (40%)

It always rains in Galway There is a 89% chance of rain in Galway tomorrow

Yahoo stocks will be at 100 in a month

Yahoo stock will be between 60 and 120 in a month

John’s age is 23 John’s age is in [20,30]

Page 83: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

… And Why Does It Arise?

“Certain” Data “Uncertain” Data

Temperature is 74.634589 F Sensor reported 75 ± 0.5 F

Bob works for Yahoo Bob works for either Yahoo or Microsoft

Mary sighted a Finch Mary sighted either a Finch (60%) or a Sparrow (40%)

It always rains in Galway There is a 89% chance of rain in Galway tomorrow

Yahoo stocks will be at 100 in a month

Yahoo stock will be between 60 and 120 in a month

John’s age is 23 John’s age is in [20,30]

Precision of devices

Lack of exact information(alternatives and missing values)

Uncertainty about futureevents

Anonymization

Page 84: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Applications: Deduplication

NameJohn Doe

J. Doe? 80% match

Page 85: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Applications: Information Integration

name,hPhone,oPhone,hAddr,oAddr

name,phone,address

Combined View

at the schema level: “schema integration”

at the instance level: “record linkage”

Page 86: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Applications: Information Extraction (I)

Restaurant Zip

Hard Rock Cafe

94111 9413394109

Page 87: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Applications: Information Extraction (II)

What is Uncertain Data and Why Does It Arise?

Subj. Pred. Obj.Galway type City

locatedIn Ireland

hasPopulation 75,414

areaCode 091

namedAfter Gaillimh_River

… … …

Page 88: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

bornOn(Jeff, 09/22/42)gradFrom(Jeff, Columbia)hasAdvisor(Jeff, Arthur)hasAdvisor(Surajit, Jeff)knownFor(Jeff, Theory)

type(Jeff, Author)[0.9]

author(Jeff, Drag_Book)[0.8]

author(Jeff,Cind_Book)[0.6]

worksAt(Jeff, Bell_Labs)[0.7]

type(Jeff, CEO)[0.4]

Applications: Information Extraction (III)

YAGO/DBpedia et al.

New fact candidates

>120 M facts for YAGO2(mostly from Wikipedia infoboxes)

100’s M additional facts from Wikipedia text

Page 89: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

How do current database management systems (DBMS) handle uncertainty?

They don’t

Page 90: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

• Clean: turn into data that DBMSs can handle

What Do (Most) Applications Do?

(1) Loss of information (2) Errors compound and propagate insidiously

Observer Bird-1

Mary Finch: 80%Sparrow: 20%

Susan

Dove: 70%Sparrow: 30%

Jane Hummingbird: 65%Sparrow: 35%

Bird-1

Finch

Dove

Hummingbird

Page 91: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Outline for Part III• Part III.1: Motivation

– What is uncertain data, and where does it come from?

• Part III.2: Possible Worlds & Beyond• Part III.3: Probabilistic Database Engines

– Stanford Trio Project– MystiQ @ U Washington

• Part III.4: Managing Uncertain RDF Data– URDF @ Max Planck Institute

Dan Suciu, Dan Olteanu, Christopher Ré, Christoph Koch: Probabilistic Databases

(Synthesis Lectures on Data Management)Morgan & Claypool Publishers, 2012

Page 92: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Databases Today are Deterministic

• An item either is in the database or it is not.

• A tuple either is in the query answer or it is not.

• This applies to all variety of data models:– Relational, E/R, hierarchical, XML, …

Page 93: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

What is a Probabilistic Database ?

• “An tuple belongs to the database” is a probabilistic event.

• “A tuple is an answer to the query” is a probabilistic event.

• Can be extended to all possible kinds of data models; we consider only

probabilistic relational data.

Page 94: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Sample Spaces & Venn Diagrams

Sample Space

• Sample space : all possible events that can be observed. Pr() = 1.• Random variable χt assigns a probability to an event s.t. 0 ≤ Pr( χt ) ≤ 1.• As a convention, we will use tuple identifiers in the place of random

variables to denote probabilistic events.

“Tuple t1 is in the database.”

“Tuple t2 is an answer to a query.”

Page 95: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Possible Worlds Semantics

int, varchar(55), datetime

Employee(ID:int, name:varchar(55), dob:datetime, salary:int)

Attribute domains:

Relational schema:

# values: 232, 2440, 264

# of possible tuples: 232 × 2440 × 264 × 232

# of possible relation instances: 2 232 × 2440 × 264 × 232

Employee(. . .), Projects( . . . ), Groups( . . .), WorksFor( . . .)

Database schema:

# of possible database instances: N (= big but finite)

Page 96: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

The Definition

Given a finite set of all possible database instances:

INST = {I1, I2, I3, . . ., IN}

Definition: A probabilistic database Ip is a probability distribution on INST

s.t. åi=1,…,N Pr(Ii) = 1Pr : INST → [0,1]

Definition: A possible world is I INST s.t. Pr(I) > 0

Page 97: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Example

Customer Address Product

John Seattle Gizmo

John Seattle Camera

Sue Denver Gizmo

Pr(I1) = 1/3

Customer Address Product

John Boston Gadget

Sue Denver Gizmo

Customer Address Product

John Seattle Gizmo

John Seattle Camera

Sue Seattle Camera

Customer Address Product

John Boston Gadget

Sue Seattle Camera

Pr(I2) = 1/12

Pr(I3) = 1/2Pr(I4) = 1/12

Possible worlds = {I1, I2, I3, I4}

Ip =

Page 98: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Tuples as Events

One tuple t event “t I”

Two tuples t1, t2 event “t1 I Λ t2 I”

Pr(t) = åI: t I Pr(I)

Pr(t1 Λ t2) = åI: t1I Λ t2I

Pr(I)

Marginalprobability of t

Marginalprobability of t1 Λ t2

Page 99: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Tuple Correlations

Pr(t1 Λ t2) = 0

Pr(t1 Λ t2) < Pr(t1) Pr(t2)Negatively correlated

Pr(t1 Λ t2) > Pr(t1) Pr(t2)Positively correlated

Pr(t1 Λ t2) = Pr(t1) = Pr(t2)Identical =

N

P

DΛDisjoint-AND

Pr(t1 Λ t2) = Pr(t1) Pr(t2)Independent-AND I Λ

Independent-OR Pr(t1 V t2) = 1-(1-Pr(t1))(1-Pr(t2))

Disjoint-OR Pr(t1 V t2) = Pr(t1)+Pr(t2)

I V

D V

Pr(⌐t1) = 1 - Pr(t1)NOT ⌐

Page 100: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Example with Correlations

Customer Address Product

John Seattle Gizmo

John Seattle Camera

Sue Denver Gizmo

Pr(I1) = 1/3

Customer Address Product

John Boston Gadget

Sue Denver Gizmo

Customer Address Product

John Seattle Gizmo

John Seattle Camera

Sue Seattle Camera

Customer Address Product

John Boston Gadget

Sue Seattle Camera

Pr(I2) = 1/12

Pr(I3) = 1/2Pr(I4) = 1/12

=

N

P

D

D

Ip =

Page 101: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Special Case!

Pr(I) = Õt I pr(t) × Õt Ï I (1-pr(t))

No restrictions w.r.t. other tuplespr : TUP → (0,1]

Tuple-independent probabilistic database

INST = P(TUP)N = 2MTUP = {t1, t2, …, tM} = all tuples

Page 102: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

… back to the Venn Diagram (I)Sample Space

If t1 and t2 are independent (per assumption!):

4 possible worlds = 4 subsets of events

“Tuple t1 is in the database.”

“Tuple t2 is in the database.”

Pr(“Tuple t1 is in the database and tuple t2 is in the database”) := Pr(t1) x Pr(t2) = pr(t1) x pr(t2)

Page 103: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

… back to the Venn Diagram (II)Sample Space

If t1 and t2 are disjoint (per assumption!):

3 possible worlds = 3 subsets of events

“Tuple t1 is in the database.”

“Tuple t2 is in the database.”

Pr(“Tuple t1 is in the database and tuple t2 is in the database”) := 0

Page 104: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Tuple Prob. Possible Worlds

Name City pr

John Seattle p1 = 0.8

Sue Boston p2 = 0.6

Fred Boston p3 = 0.9

Ip = Name City

John Seattl

Sue Bosto

Fred Bosto

Name City

Sue Bosto

Fred Bosto

Name City

John Seattl

Fred Bosto

Name City

John Seattl

Sue Bosto

Name City

Fred Bosto

Name City

Sue Bosto

Name City

John Seattl

I1

(1-p1)(1-p2)(1-p3)

I2

p1(1-p2)(1-p3)

I3

(1-p1)p2(1-p3)

I4

(1-p1)(1-p2)p3

I5

p1p2(1-p3)

I6

p1(1-p2)p3

I7

(1-p1)p2p3

I8

p1p2p3

å = 1

J =

Name City

Assumption:Tuples are

independent!

Page 105: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Tuple Prob. Query Evaluation

Name City pr

John Seattle p1

Sue Boston p2

Fred Boston p3

Customer Product Date pr

John Gizmo . . . q1

John Gadget . . . q2

John Gadget . . . q3

Sue Camera . . . q4

Sue Gadget . . . q5

Sue Gadget . . . q6

Fred Gadget . . . q7

SELECT DISTINCT x.cityFROM Personp x, Purchasep yWHERE x.Name = y.Customer and y.Product = ‘Gadget’

Tuple Probability

Seattle

Boston

1-(1-q2)(1-q3)p1( )

1- (1- ) × (1- )

p2( )p3

1-(1-q5)(1-q6 )q7

Marginals

Page 106: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Summary of Data Model

Possible Worlds Semantics• Very powerful model:

– Can capture any tuple correlations.

• Needs separate representation formalism: (“just tables” are generally not enough)

Boolean event expressions to capture complex tuple- dependencies: “provenance”, “lineage”, “views”, etc.

• But: query evaluation may be very expensive.– Need to find good cases, otherwise must approximate.

Page 107: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Outline for Part III• Part III.1: Motivation

– What is uncertain data, and where does it come from?

• Part III.2: Possible Worlds & Beyond• Part III.3: Probabilistic Database Engines

– Stanford Trio Project– MystiQ @ U Washington

• Part III.4: Managing Uncertain RDF Data– URDF @ Max Planck Institute

Page 108: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Trio’s Data Model

1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidence values4. Lineage

Uncertainty-Lineage Databases (ULDBs)

[Widom et al.: 2008]

Page 109: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Trio’s Data Model

1. Alternatives: uncertainty about value

Saw (witness, color, car)

Amy red, Honda ∥ red, Toyota ∥ orange, Mazda

Three possibleinstances

Page 110: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Six possibleinstances

Trio’s Data Model

1. Alternatives2. ‘?’ (Maybe): uncertainty about presence

?

Saw (witness, color, car)

Amy red, Honda ∥ red, Toyota ∥ orange, Mazda

Betty blue, Acura

Page 111: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Trio’s Data Model

• 1. Alternatives• 2. ‘?’ (Maybe) Annotations• 3. Confidences: weighted uncertainty

Six possible instances, each with a probability

?

Saw (witness, color, car)

Amy red, Honda 0.5 ∥ red, Toyota 0.3 ∥ orange, Mazda 0.2

Betty blue, Acura 0.6

Page 112: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

So Far: Model is Not Closed

Saw (witness, car)

Cathy

Honda ∥ Mazda

Drives (person, car)

Jimmy, Toyota ∥ Jimmy, Mazda

Billy, Honda ∥ Frank, Honda

Hank, Honda

Suspects

Jimmy

Billy ∥ Frank

Hank

Suspects = πperson(Saw ⋈ Drives)

???

Does not correctlycapture possibleinstances in theresult

CANNOT

Page 113: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Example with Lineage

ID

Saw (witness, car)

11

Cathy

Honda ∥ Mazda

ID

Drives (person, car)

21

Jimmy, Toyota ∥ Jimmy, Mazda

22

Billy, Honda ∥ Frank, Honda

23

Hank, Honda

ID

Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

Suspects = πperson(Saw ⋈ Drives)

???

λ(31) = (11,2) Λ (21,2)λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)λ(33) = (11,1) Λ 23

Page 114: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Example with Lineage

ID Saw (witness, car)

11

Cathy

Honda ∥ Mazda

ID Drives (person, car)

21

Jimmy, Toyota ∥ Jimmy, Mazda

22

Billy, Honda ∥ Frank, Honda

23

Hank, Honda

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

Suspects = πperson(Saw ⋈ Drives)

???

λ(31) = (11,2) Λ (21,2)λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)λ(33) = (11,1) Λ 23

Correctly captures possible instances inthe result (7)

Page 115: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Operational Semantics

Closure:up-arrowalways exists

Completeness: any (finite) set of possible instances can be represented

D

I1, I2, …, In J1, J2, …, Jm

D′

possibleinstances

Q on eachinstance

rep. ofinstances

directimplementation

Page 116: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Summary on Trio’s Data Model

1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidence values4. Lineage

Uncertainty-Lineage Databases (ULDBs)

Theorem: ULDBs are closed and complete.

Formally studied properties like minimization, equivalence, approximation and membership based on lineage.

[Benjelloun, Widom, et al.: VLDB J. 08]

Page 117: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

MYSTIQ: Query Complexity• Data complexity of a query Q:• Compute Q(Ip), for probabilistic database J

– Extensional query evaluation: Works for “safe” query plans with PTIME data complexity

– Intensional query evaluation:Works for any plan but has #P-complete data complexity in the general case

• Assume independent tuples in J• Compute marginal probabilities for tuples in Q• Boolean event expressions for intensional query evaluation

Page 118: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Extensional Query Evaluation

Relational op’s compute probabilities

s

v p

v p

×

v1 p1

v1 v2 p1 p2

v2 p2

P

v p1

v p2

v 1-(1-p1)(1-p2)…

[Fuhr&Roellke:1997, Dalvi&Suciu:2004]

-

v p1

v p1(1-p2)

v p2

Data complexity: PTIME

or: p1 + p2 + …

Page 119: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Jon Sea p1

Jon q1

Jon q2

Jon q3

SELECT DISTINCT x.CityFROM Personp x, Purchasep yWHERE x.Name = y.Customer and y.Product = ‘Gadget’

Jon Sea p1q1

Jon Sea p1q2

Jon Sea p1q3

Sea 1-(1-p1q1)(1- p1q2)(1- p1q3)

×

P

Jon q1

Jon q2

Jon q3

×

P

Jon Sea p1(1-(1-q1)(1-q2)(1-q3))

[Dalvi&Suciu:2004]

Wrong !

Correct !

Depends on plan !!!

Jon 1-(1-q1)(1-q2)(1-q3)

Jon Sea p1

“Safe Plans”

Page 120: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Query Complexity

Sometimes there exists a correct extensional plan,but consider the following:

Qbad :- R(x), S(x,y), T(y)Data

complexityis #P-complete

[Dalvi&Suciu:2004]

NP = class of problems of the form “is there a witness ?”#P = class of problems of the form “how many witnesses ?” (will be coming back to this…)

Page 121: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Intensional Database[Fuhr&Roellke:1997]

Atomic event ids

Intensional probabilistic database J each tuple t has an event attribute t.E

e1, e2, e3, …

p1, p2, p3, … [0,1]

e3 Λ (e5 V ⌐e2)

Probabilities:

Event expressions: Λ, V, ⌐

Page 122: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Probability of Boolean Expressions

E = X1X3 v X1X4 v X2X5 v X2X6

Sampling: Randomly make each variable true with the following probabilities

Pr(X1) = p1, Pr(X2) = p2, . . . . . , Pr(X6) = p6

What is Pr(E) ???

Answer: Re-group cleverly E = X1 (X3 v X4 ) v X2 (X5 v X6)

Pr(E) = 1 - (1-p1(1-(1-p3)(1-p4))) (1-p2(1-(1-p5)(1-p6)))

Needed for query evaluation!

“Read once” formula

Page 123: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Complexity Issues

Theorem [Valiant:1979]For a Boolean expression E, computing Pr(E) is #P-complete

NP = class of problems of the form “is there a witness ?” SAT#P = class of problems of the form “how many witnesses ?” #SAT

The decision problem for 2CNF is in PTIMEThe counting problem for 2CNF is #P-complete

Page 124: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

MYSTIQ: [Re, Suciu: VLDB’04]

Probabilistic Query Evaluation on Top of a Deterministic Database Engine

Deterministic Database

SQL Query ProbabilisticQuery Engine

(Top-k) Answers

1. Sampling

2. Extensional joins

3. Indexes

Page 125: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Outline for Part III• Part III.1: Motivation

– What is uncertain data, and where does it come from?

• Part III.2: Possible Worlds & Beyond• Part III.3: Probabilistic Database Engines

– Stanford Trio Project– MystiQ @ U Washington

• Part III.4: Uncertain RDF Data– URDF Project @ Max Planck Institute

Page 126: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Uncertain RDF (URDF) Data Model

• Extensional Layer (information extraction & integration)– High-confidence facts: existing knowledge base (“ground truth”)– New fact candidates: extracted facts with confidence values– Integration of different knowledge sources:

Ontology merging or explicit Linked Data (owl:sameAs, owl:equivProp.)

Large “Probabilistic Database” of RDF facts

• Intensional Layer (query-time inference)– Soft rules: deductive grounding & lineage (Datalog/SLD resolution)– Hard rules: consistency constraints (more general FOL rules)– Propositional & probabilistic consistency reasoning

Page 127: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Soft Rules vs. Hard Rules

(Soft) Deduction Rules vs. (Hard) Consistency Constraints

• People may live in more than one placelivesIn(x,y) marriedTo(x,z) livesIn(z,y)livesIn(x,y) hasChild(x,z) livesIn(z,y)

• People are not born in different places/on different datesbornIn(x,y) bornIn(x,z) y=zbornOn(x,y) bornOn(x,z) y=z

• People are not married to more than one person (at the same time, in most countries?)

marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z

disjoint(t1,t2)

[0.8]

[0.5]

Page 128: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Soft Rules vs. Hard Rules

(Soft) Deduction Rules vs. (Hard) Consistency Constraints

• People may live in more than one placelivesIn(x,y) marriedTo(x,z) livesIn(z,y)livesIn(x,y) hasChild(x,z) livesIn(z,y)

• People are not born in different places/on different datesbornIn(x,y) bornIn(x,z) y=zbornOn(x,y) bornOn(x,z) y=z

• People are not married to more than one person (at the same time, in most countries?)

marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z

disjoint(t1,t2)

[0.8]

[0.5]

Deductive database: Datalog, core of SQL &

relational algebra, RDF/S, OWL2-RL, etc.

More general FOL constraints:

Datalog with constraints, X-Tuples in Prob.-DB’s

owl:FunctionalProperty, etc.

Page 129: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

URDF: Running ExampleRules hasAdvisor(x,y) worksAt(y,z)

graduatedFrom(x,z) [0.4]

graduatedFrom(x,y) graduatedFrom(x,z) y=z

Jeff

Stanford

University

type[1.0]

Surajit

Princeton

David

Computer Scientist

worksAt[0.9]

type[1.0]

type[1.0]

type[1.0]type[1.0]

graduatedFrom[0.6]

graduatedFrom[0.7]

graduatedFrom[0.9]

hasAdvisor[0.8]hasAdvisor[0.7]

KB: Base Facts

Derived FactsgradFr(Surajit,Stanfor

d)gradFr(David,Stanford)

graduatedFrom[?]graduatedFrom[?]

Page 130: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Basic Types of Inference

• MAP Inference– Find the most likely assignment to query variables y

under a given evidence x.

– Compute: arg max y P( y | x) (NP-hard for MaxSAT)

• Marginal/Success Probabilities– Probability that query y is true in a random world under a given evidence x.

– Compute: ∑y P( y | x) (#P-hard already for conjunctive

queries)

Page 131: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

General Route: Grounding & MaxSAT Solving

Query graduatedFrom(x, y)

CNF (graduatedFrom(Surajit, Stanford) graduatedFrom(Surajit, Princeton))

(graduatedFrom(David, Stanford) graduatedFrom(David, Princeton))

(hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))

(hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford))

worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton)

1000

1000

0.4

0.4

0.9 0.8 0.7 0.6 0.7 0.9

1) Grounding– Consider only facts (and rules)

which are relevant for answering the query

2) Propositional formula in CNF, consisting of– Grounded hard & soft rules– Weighted base facts

3) Propositional Reasoning– Find truth assignment to facts such

that the total weight of the satisfied clauses is maximized

MAP inference: compute “most likely” possible world

Page 132: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

[Theobald,Sozio,Suchanek,Nakashole: VLDS‘12]

Find: arg max y P( y | x) Resolves to a variant of MaxSAT

for propositional formulas

URDF: MaxSAT Solving with Soft & Hard Rules

{ graduatedFrom(Surajit, Stanford), graduatedFrom(Surajit, Princeton) }

{ graduatedFrom(David, Stanford), graduatedFrom(David, Princeton) }

(hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))

(hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford))

worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton)

0.4

0.4

0.9 0.8 0.7 0.6 0.7 0.9

S:

Mut

ex-c

onst

.

Special case: Horn-clauses as soft rules & mutex-constraints as hard rules

C:

Wei

ghte

d H

orn

clau

ses

(CN

F)

Compute W0 = ∑clauses C w(C) P(C is satisfied);For each hard constraint S { For each fact f in St { Compute Wf+

t = ∑clauses C w(C) P(C is sat. | f = true); } Compute WS-

t = ∑clauses C w(C) P(C is sat. | St = false); Choose truth assignment to f in St that maximizes Wf+

t , WS-t ;

Remove satisfied clauses C; t++;}

• Runtime: O(|S||C|)• Approximation

guarantee of 1/2

MaxSAT Alg.

Page 133: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Deductive Grounding with Lineage (SLD Resolution/Datalog)

\/

/\

graduatedFrom(Surajit,

Princeton)[0.7]

hasAdvisor(Surajit,Jeff)

[0.8]

worksAt(Jeff,Stanford)[0.9]

graduatedFrom(Surajit,

Stanford)[0.6]

Query graduatedFrom(Surajit, y)

C D

A B

A(B (CD)) A(B (CD))

graduatedFrom(Surajit, Princeton)

graduatedFrom(Surajit, Stanford)Q1 Q2

Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z)

[0.4]

graduatedFrom(x,y) graduatedFrom(x,z) y=zBase FactsgraduatedFrom(Surajit, Princeton) [0.7]graduatedFrom(Surajit, Stanford) [0.6]graduatedFrom(David, Princeton) [0.9]hasAdvisor(Surajit, Jeff) [0.8]hasAdvisor(David, Jeff) [0.7]worksAt(Jeff, Stanford) [0.9]type(Princeton, University) [1.0]type(Stanford, University) [1.0]type(Jeff, Computer_Scientist) [1.0]type(Surajit, Computer_Scientist) [1.0]type(David, Computer_Scientist) [1.0]

Page 134: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Lineage & Possible Worlds

1) Deductive Grounding– Dependency graph of the query– Trace lineage of individual query

answers

2) Lineage DAG (not in CNF),consisting of– Grounded hard & soft rules– Weighted base facts

Plus: entire derivation history!

3) Probabilistic Inference Compute marginals:

P(Q): aggregate probabilities of all possible worlds where the lineage of the query evaluates to “true”

P(Q|H): drop “impossible worlds”

\/

/\

graduatedFrom(Surajit,

Princeton)[0.7]

hasAdvisor(Surajit,Jeff)

[0.8]

worksAt(Jeff,Stanford)[0.9]

graduatedFrom(Surajit,

Stanford)[0.6]

Query graduatedFrom(Surajit, y)

0.7x(1-0.888)=0.078 (1-0.7)x0.888=0.266

1-(1-0.72)x(1-0.6)=0.888

0.8x0.9=0.72

C D

A B

A(B (CD)) A(B (CD))

graduatedFrom(Surajit, Princeton)

graduatedFrom(Surajit, Stanford)Q1 Q2

[Das Sarma,Theobald,Widom: ICDE‘08 Dylla, Miliaraki,Theobald: CIKM‘11]

Page 135: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Possible Worlds Semantics

A:0.7 B:0.6 C:0.8 D:0.9 Q2: A(B(CD))

P(W)

1 1 1 1 0 0.7x0.6x0.8x0.9 = 0.3024

1 1 1 0 0 0.7x0.6x0.8x0.1 = 0.0336

1 1 0 1 0 … = 0.0756

1 1 0 0 0 … = 0.0084

1 0 1 1 0 … = 0.2016

1 0 1 0 0 … = 0.0224

1 0 0 1 0 … = 0.0504

1 0 0 0 0 … = 0.0056

0 1 1 1 1 0.3x0.6x0.8x0.9 = 0.1296

0 1 1 0 1 0.3x0.6x0.8x0.1 = 0.0144

0 1 0 1 1 0.3x0.6x0.2x0.9 = 0.0324

0 1 0 0 1 0.3x0.6x0.2x0.1 = 0.0036

0 0 1 1 1 0.3x0.4x0.8x0.9 = 0.0864

0 0 1 0 0 … = 0.0096

0 0 0 1 0 … = 0.0216

0 0 0 0 0 … = 0.0024

1.0

0.2664

0.412

P(Q2)=0.2664

P(Q2|H)=0.2664 / 0.412 = 0.6466

P(Q1)=0.0784 P(Q1|H)=0.0784 / 0.412 = 0.1903

0.0784

Hard rule H: A (B (CD))

Page 136: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

More Probabilistic Approaches

• Propositional– Stochastic MaxSat solvers: MaxWalkSat (MAP-Inference)– URDF: constrained weighted MaxSat solver for soft & hard rules

• Lineage & Possible Worlds (tuple-independent database)– Exact probabilistic inference: junction trees, variable elimination– Approximate inference: decision diagrams/Shannon expansions, sampling

• Combining First-Order Logic & Probabilistic Graphical Models

– Markov Logic Networks*[Richardson & Domingos: Machine Learning 2006]

– Factor Graphs [FactorIE, McCallum et al.: NIPS 2008]– Variety of MCMC sampling techniques for probabilistic inference (e.g., Gibbs sampling, MC-SAT, etc.)*Alchemy – Open-Source AI: http://alchemy.cs.washington.edu/

Page 137: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

Experiments

• URDF: SLD grounding & MaxSat solving

|C| - # literals in soft rules|S| - # literals in hard rules

• URDF MaxSat vs. Markov Logic (MAP inference & MC-SAT)

• YAGO Knowledge Base: 2 Mio entities, 20 Mio facts• Basic query answering: SLD grounding & MaxSat solving of 10 queries over 16

soft rules (partly recursive) & 5 hard rules (bornIn, diedIn, marriedTo, …)• Asymptotic runtime checks: runtime comparisons for synthetic rule expansions

Page 138: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

• System components:– Flash Player client– Tomcat server (JRE)– Relational backend

(JDBC)– Remote Method

Invocation & Object Serialization (BlazeDS)

UViz: URDF Visualization Frontend[Meiser, Dylla, Theobald: CIKM’11 Demo]

Page 139: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

UViz: URDF Visualization Frontend

Demo!http://urdf.mpi-inf.mpg.de

[Meiser, Dylla, Theobald: CIKM’11 Demo]

Page 140: Scalable RDF Data Management & SPARQL Query Processing Martin Theobald 1, Katja Hose 2, Ralf Schenkel 3 1 University of Antwerp, Belgium 2 University of.

PART I• SPARQL Query Language for RDF, W3C Recommendation, 15 January 2008, http://www.w3.org/TR/2008/REC-rdf-sparql-query/• SPARQL 1.1 Query Language, W3C Working Draft, 21 March 2013, http://www.w3.org/TR/sparql11-query/ • SPARQL 1.1 Federated Query, W3C Working Draft, 21 March 2013, http://www.w3.org/TR/sparql11-federated-query/ • Kemafor Anyanwu, Angela Maduko, Amit P. Sheth: SPARQ2L: towards support for subgraph extraction queries in RDF databases. WWW Conference, 2007• Krisztian Balog, Edgar Meij, Maarten de Rijke: Entity Search: Building Bridges between Two Worlds. WWW, 2010• Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan: Keyword Searching and Browsing in Databases using BANKS. ICDE, 2002• Tao Cheng , Xifeng Yan , Kevin Chen-Chuan Chang: EntityRank: searching entities directly and holistically. VLDB, 2007• Shady Elbassuoni, Maya Ramanath, Ralf Schenkel, Marcin Sydow, Gerhard Weikum: Language-model-based ranking for queries on RDF-graphs. CIKM, 2009• Vagelis Hristidis, Heasoo Hwang, Yannis Papakonstantinou: Authority-based keyword search in databases. ACM Transactions on Database Systems 33(1), 2008• Gjergji Kasneci, Maya Ramanath, Mauro Sozio, Fabian M. Suchanek, Gerhard Weikum: STAR: Steiner-Tree Approximation in Relationship Graphs. ICDE, 2009• Gjergji Kasneci, Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, Gerhard Weikum: NAGA: Searching and Ranking Knowledge. ICDE, 2008• Thomas Neumann, Gerhard Weikum: Scalable join processing on very large RDF graphs. SIGMOD Conference, 2009• Thomas Neumann, Gerhard Weikum: The RDF-3X engine for scalable management of RDF data. VLDB Journal 19(1), 2010• François Picalausa, Yongming Luo, George H. L. Fletcher, Jan Hidders, Stijn Vansummeren: A Structural Approach to Indexing Triples. ESWC 2012• Nicoleta Preda, Gjergji Kasneci, Fabian M. Suchanek, Thomas Neumann, Wenjun Yuan, Gerhard Weikum: Active knowledge: dynamically enriching RDF knowledge bases

by Web Services. SIGMOD Conference, 2010• Cheng Xiang Zhai: Statistical Language Models for Information Retrieval. Morgan & Claypool Publishers, 2008• Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Özsu, Dongyan Zhao: gStore: Answering SPARQL Queries via Subgraph Matching. PVLDB 4(8), 2011PART II• Min Cai, Martin R. Frank: RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network. WWW, 2004• Gong Cheng, Weiyi Ge, Yuzhong Qu: Falcons: searching and browsing entities on the semantic web. WWW, 2008• Li Ding, Tim Finin, Anupam Joshi, Rong Pan, R. Scott Cost, Yun Peng, Pavan Reddivari, Vishal C Doshi, Joel Sachs: Swoogle: A Search and Metadata Engine for the Semantic

Web. CIKM, 2004• Luis Galárraga, Katja Hose, Ralf Schenkel: Partout: A Distributed Engine for Efficient RDF Processing. To appear in PVLDB, 2013• Steve Harris, Nick Lamb, Nigel Shadbolt: 4store: The Design and Implementation of a Clustered RDF Store. SSWS, 2009• Jiewen Huang, Daniel J. Abadi, Kun Ren: Scalable SPARQL Querying of Large RDF Graphs. PVLDB 4(11), 2011• Vaibhav Khadilkar, Murat Kantarcioglu, Bhavani Thuraisingham, Paolo Castagna: Jena-HBase: A Distributed, Scalable and Efficient RDF Triple Store. ISWC, 2012• Bastian Quilitz, Ulf Leser: Querying Distributed RDF Data Sources with SPARQL. ISWC, 2008• Andreas Schwarte, Peter Haase, Katja Hose, Ralf Schenkel, Michael Schmidt: FedX: A Federation Layer for Distributed Query Processing on Linked Open Data. ESWC, 2011• Bin Shao, Haixun Wang, Yatao Li: Trinity: A Distributed Graph Engine on a Memory Cloud. To appear in SIGMOD, 2013• Xiaofei Zhang, Lei Chen, Yongxin Tong, Min Wang: EAGRE: Towards Scalable I/O Efficient SPARQL Query Evaluation on the Cloud. ICDE, 2013PART III• Omar Benjelloun, Anish Das Sarma, Alon Y. Halevy, Martin Theobald, Jennifer Widom: Databases with uncertainty and lineage. VLDB J. 17(2), 2008• Jihad Boulos, Nilesh N. Dalvi, Bhushan Mandhani, Shobhit Mathur, Christopher Ré, Dan Suciu: MYSTIQ: a system for finding more answers by using probabilities. SIGMOD

Conference, 2005• Nilesh N. Dalvi, Dan Suciu: Efficient Query Evaluation on Probabilistic Databases. VLDB, 2004• Maximilian Dylla, Iris Miliaraki, Martin Theobald: Top-k Query Processing in Probabilistic Databases with Non-Materialized Views. ICDE, 2013• Norbert Fuhr, Thomas Rölleke: A Probabilistic Relational Algebra for the Integration of Information Retrieval and Database Systems. ACM Trans. Inf. Syst. 15(1), 1997• Timm Meiser, Maximilian Dylla, Martin Theobald: Interactive reasoning in uncertain RDF knowledge bases. CIKM, 2011• Ndapandula Nakashole, Mauro Sozio, Fabian Suchanek, Martin Theobald: Query-Time Reasoning in Uncertain RDF Knowledge Bases with Soft and Hard Rules. VLDS, 2012• Dan Suciu, Dan Olteanu, Christopher Ré, Christoph Koch: Probabilistic Databases (Synthesis Lectures on Data Management), Morgan & Claypool Publishers, 2012

Recommended Readings


Recommended