Date post: | 31-Mar-2015 |
Category: |
Documents |
Upload: | rosemary-slatter |
View: | 214 times |
Download: | 2 times |
Scalable RDF Data Management
& SPARQL Query Processing
Martin Theobald1, Katja Hose2, Ralf Schenkel3
1University of Antwerp, Belgium2University of Aalborg, Denmark3University of Passau, Germany
Outline of this Tutorial
• Part I–RDF in Centralized Relational Databases
• Part II–RDF in Distributed Settings
• Part III–Managing Uncertain RDF Data
Outline for Part I• Part I.1: Foundations
– Introduction to RDF and Linked Open Data– A short overview of SPARQL
• Part I.2: Rowstore Solutions• Part I.3: Columnstore Solutions• Part I.4: Other Solutions and Outlook
bornOn(Jeff, 09/22/42)gradFrom(Jeff, Columbia)hasAdvisor(Jeff, Arthur)hasAdvisor(Surajit, Jeff)knownFor(Jeff, Theory)
Information Extraction
YAGO/DBpedia et al.
>120 M facts for YAGO2(mostly from Wikipedia infoboxes & categories)
http://www.mpi-inf.mpg.de/yago-naga/
YAGO2 Knowledge BaseEntity
Max_Planck
Apr 23, 1858
Person
City
Countrysubclass
Locationsubclass
instanceOf
subclass
bornOn
“Max Planck”
means
subclass
Oct 4, 1947 diedOn
Kiel
bornInNobel Prize
Erwin_Planck
FatherOfhasWon
Scientist
means
“Max Karl Ernst Ludwig Planck”
Physicist
instanceOf
subclassBiologist
subclass
Germany
Politician
Angela Merkel
Schleswig-Holstein
State
“Angela Dorothea Merkel”
Oct 23, 1944diedOn
Organization
subclass
Max_Planck Society
instanceOf
means
instanceOfinstanceOf
subclass
subclass
means
“Angela Merkel”
means
citizenOf
instanceOfinstanceOf
locatedIn
locatedIn
subclass
3 M entities, 120 M facts100 relations, 200k classes
accuracy 95%
Why care about scalability?Rapid growth of available semantic data
Sources:linkeddata.orgwikipedia.org
Why care about scalability?Rapid growth of available semantic data
More than 30 billion triples in more than 200 sources across the LOD cloud DBPedia: 3.4 million entities, 1 billion triples
Sources:linkeddata.orgwikipedia.org
As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase
… and still growing• Billion triple challenge 2008: 1B triples• Billion triple challenge 2010: 3B triples
http://km.aifb.kit.edu/projects/btc-2010/
• Billion triple challenge 2011: 2B tripleshttp://km.aifb.kit.edu/projects/btc-2011/
• |
• War stories from http://www.w3.org/wiki/LargeTripleStores
– BigOWLIM: 12B triples in Jun 2009 – Garlik 4Store: 15B triples in Oct 2009– OpenLink Virtuoso: 15.4B+ triples – AllegroGraph: 1+ Trillion triples
Queries can be complex, too
SELECT DISTINCT ?a ?b ?lat ?long WHERE{ ?a dbpedia:spouse ?b. ?a dbpedia:wikilink dbpediares:actor. ?b dbpedia:wikilink dbpediares:actor. ?a dbpedia:placeOfBirth ?c. ?b dbpedia:placeOfBirth ?c. ?c owl:sameAs ?c2. ?c2 pos:lat ?lat. ?c2 pos:long ?long.}
Q7 on BTC2008 in [Neumann & Weikum, 2009]
What effects does the financial crisis have on migration rates in the US?
Is there a significant increase of serious weather conditions in Europe over the past 20 years?
Which glutamic-acid proteases are inhibitors of HIV?
Question Answering (QA) Systems
• KB of curated, structured data• 10 trillion (!) facts, 50k algorithms
• KB from Wikipedia and user edits• 600 million facts, 25 million entities
IBM Watson: Deep Question Answering
99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain
This town is known as "Sin City" & its downtown is "Glitter Gulch"
William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel
As of 2010, this is the only former Yugoslav republic in the EU
YAGO
knowledgeback-ends
questionclassification &decomposition
www.ibm.com/innovation/us/watson/index.htm
D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine, Fall 2010.
SPARQL 1.0 / 1.1• Query language for RDF suggested by the W3C.
• 3 ways to interpret RDF data:– Instances of logical predicates (“facts”)– Graphs (subjects/objects as nodes,
predicates as directed and labeled edges)
– Relations (either multiple binary relations or a single, large ternary relation)
• SPARQL main building block:– select-project-join combination of relational triple patterns
equivalent to graph isomorphism queries over a potentially very large RDF graph
SPARQL – Example
Example query:Find all actors from Ontario (that are in the knowledge base)
vegetarian
Albert_Einstein
physicist
Jim_Carrey
actor
Ontario
Canada
Ulm
Germany
scientist
chemist
Otto_Hahn
Frankfurt
Mike_Myers
NewmarketScarborough
Europe
isA isA isA isA isA
bornIn bornIn bornIn bornIn
locatedInlocatedIn
locatedInlocatedInlocatedInlocatedIn
isAisA
isA
SPARQL – Example
Example query:Find all actors from Ontario (that are in the knowledge base)
vegetarian
Jim_Carrey
actor
Ontario
Canada
Mike_Myers
NewmarketScarborough
isA isA
bornIn bornIn
locatedIn
locatedInlocatedIn
isA
actor
Ontario
?person
?loc
bornIn
locatedIn
isA
Find subgraphs of this form:
variables
constants
SELECT ?person WHERE { ?person isA actor. ?person bornIn ?loc . ?loc locatedIn Ontario . }
• Eliminate duplicates in results
• Return results in some order
with optional LIMIT n clause• Optional matches and filters on bounded var’s
• More operators: ASK, DESCRIBE, CONSTRUCT
See: http://www.w3.org/TR/rdf-sparql-query/
SPARQL 1.0 – More Features
SELECT DISTINCT ?c WHERE {?person isA actor. ?person bornIn ?loc. ?loc locatedIn ?c}
SELECT ?person WHERE {?person isA actor. ?person bornIn ?loc. ?loc locatedIn Ontario} ORDER BY DESC(?person)
SELECT ?person WHERE {?person isA actor. OPTIONAL{?person bornIn ?loc}. FILTER (!BOUND(?loc))}
SPARQL 1.1 Extensions of the W3C
W3C SPARQL 1.1:• Aggregations (COUNT, AVG, …) and grouping• Subqueries in the WHERE clause• Safe negation: FILTER NOT EXISTS {?x …}
– Syntactic sugar forOPTIONAL {?x … }FILTER(!BOUND(?x))
• Expressions in the SELECT clause:SELECT (?a+?b) AS ?sum
• Label constraints on paths:?x foaf:knows/foaf:knows/foaf:name ?name
• More functions and operators …
RDF+SPARQL: Centralized Engines
• BigOWLIM (now ontotext.com)
• OpenLink Virtuoso • OntoBroker (now semafora-systems.com)
• Apache Jena (different main-memory/relational backends)
• Sesame (now openRDF.org)
• SW-Store, Hexastore, 3Store, RDF-3X (no reasoning)
System deployments with >1011 triples
( see http://esw.w3.org/LargeTripleStores)
SPARQL: Extensions from Research (1)
More complex graph patterns:• Transitive paths [Anyanwu et al., WWW’07]
SELECT ?p, ?c WHERE {?p isA scientist . ?p ??r ?c. ?c isA Country . ?c locatedIn Europe .
PathFilter(cost(??r) < 5) . PathFilter(containsAny(??r,?t ). ?t isA City . }
• Regular expressions [Kasneci et al., ICDE’08]SELECT ?p, ?c WHERE { ?p isA ?s. ?s isA scientist. ?p (bornIn | livesIn | citizenOf) locatedIn* Europe.}
Meanwhile mostly covered by the SPARQL 1.1 query proposal.
SPARQL: Extensions from Research (2)
Queries over federated RDF sources:• Determine distribution of triple patterns as part of
query (for example in Jena ARQ)
• Automatically route triple predicates to useful sources
SPARQL: Extensions from Research (2)
Queries over federated RDF sources:• Determine distribution of triple patterns as part of
query (for example in Jena ARQ)
• Automatically route triple predicates to useful sources
Potentially requires mapping of identifiers from different sources
SPARQL 1.1 explicitly supportsfederation of sources
http://www.w3.org/TR/sparql11-federated-query/
Ranking is Essential!
• Queries often have a huge number of results:– “scientists from Canada”– “publications in databases”– “actors from the U.S.”
• Queries may have no matches at all:– “Laboratoire d'informatique de Paris 6”– “most beautiful railway stations”
• Ranking is an integral part of search• Huge number of app-specific ranking methods:
paper/citation count, impact, salary, …• Need for generic ranking of 1) entities and 2) facts
Extending Entities with Keywords
Remember: entities occur in facts & in documentsÞ Associate entities with terms in those documents,
keywords in URIs, literals, … (context of entity)
chancellor Germany scientist election Stuttgart21 Guido Westerwelle France Nicolas Sarkozy
Extensions: Keywords
• Consider witnesses/sources (provenance meta-facts)• Allow text predicates with each triple pattern (à la XQ-FT)
Problem: not everything is triplified!
European composers who have won the Oscar,whose music appeared in dramatic western scenes,and who also wrote classical pieces ?
Select ?p Where { ?p instanceOf Composer . ?p bornIn ?t . ?t inCountry ?c . ?c locatedIn Europe . ?p hasWon ?a .?a Name AcademyAward . ?p contributedTo ?movie [western, gunfight, duel, sunset] . ?p composed ?music [classical, orchestra, cantata, opera] . }
Semantics: • triples match struct. pred.• witnesses match text pred.
Select ?r, ?a Where {?r instOf researcher [“computer science“] . ?a workedOn ?x [“Manhattan project“] .?r hasAdvisor ?a . }
Select ?r, ?a Where {?r ?p1 ?o1 [“computer science“] . ?a ?p2 ?o2 [“Manhattan project“] .?r ?p3 ?a . }
Extensions: Keywords
• Consider witnesses/sources (provenance meta-facts)• Allow text predicates with each triple pattern (à la XQ-FT)
Problem: not everything is triplified!
Proximity ofkeywords or phrasesboosts expressiveness
French politicians married to Italian singers? Select ?p1, ?p2 Where { ?p1 instanceOf ?c1 [France, politics] . ?p2 instanceOf ?c2 [Italy, singer] . ?p1 marriedTo ?p2 . }
CS researchers whose advisors worked on the Manhattan project?
Extensions: Keywords
CLEF/INEX 2012-13 Linked Data Track
<topic id="2012374" category="Politics"> <jeopardy_clue>Which German politician is a successor of another politician who stepped down before his or her actual term was over, and what is the name of their political ancestor?</jeopardy_clue> <keyword_title>German politicians successor other stepped down before actual term name ancestor</keyword_title> <sparql_ft> SELECT ?s ?s1 WHERE { ?s rdf:type <http://dbpedia.org/class/yago/GermanPoliticians> . ?s1 <http://dbpedia.org/property/successor> ?s . FILTER FTContains (?s, "stepped down early") . } </sparql_ft></topic>
https://inex.mmci.uni-saarland.de/tracks/lod/
Problem: not everything is triplified!
Extensions: Keywords / Multiple Languages
<question id="4" answertype="resource" aggregation="false" onlydbo="true"> <string lang="en">Which river does the Brooklyn Bridge cross?</string> <string lang="de">Welchen Fluss überspannt die Brooklyn Bridge?</string> <string lang="es">¿Por qué río cruza la Brooklyn Bridge?</string> <string lang="it">Quale fiume attraversa il ponte di Brooklyn?</string> <string lang="fr">Quelle cours d'eau est traversé par le pont de Brooklyn?</string> <string lang="nl">Welke rivier overspant de Brooklyn Bridge?</string> <keywords lang="en">river, cross, Brooklyn Bridge</keywords> <keywords lang="de">Fluss, überspannen, Brooklyn Bridge</keywords> <keywords lang="es">río, cruza, Brooklyn Bridge</keywords> <keywords lang="it">fiume, attraversare, ponte di Brooklyn</keywords> <keywords lang="fr">cours d'eau, pont de Brooklyn</keywords> <keywords lang="nl">rivier, Brooklyn Bridge, overspant</keywords> <query> PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX res: <http://dbpedia.org/resource/> SELECT DISTINCT ?uri WHERE { res:Brooklyn_Bridge dbo:crosses ?uri . } </query></question>
http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/
Multilingual Question Answering over Linked Data (QALD-3), CLEF 2011-13
Problem: not everything is triplified!
What Makes a Fact “Good”?
Confidence:Prefer results that are likely correct
accuracy of info extraction trust in sources
(authenticity, authority)
bornIn (Jim Gray, San Francisco) from“Jim Gray was born in San Francisco”(en.wikipedia.org)livesIn (Michael Jackson, Tibet) from“Fans believe Jacko hides in Tibet”(www.michaeljacksonsightings.com)
Informativeness:Prefer results with salient factsStatistical estimation from:
frequency in answer frequency on Web frequency in query log
q: Einstein isa ?Einstein isa scientistEinstein isa vegetarian
q: ?x isa vegetarianEinstein isa vegetarianWhocares isa vegetarian
Conciseness:Prefer results that are tightly connected
size of answer graph weight of Steiner tree
Einstein won NobelPrizeBohr won NobelPrizeEinstein isa vegetarianCruise isa vegetarianCruise born 1962 Bohr died 1962
Diversity:Prefer variety of facts
E won … E discovered … E played … E won … E won … E won … E won …
How Can We Implement This?
Confidence:Prefer results that are likely correct
accuracy of info extraction trust in sources
(authenticity, authority)
Informativeness:Prefer results with salient factsStatistical estimation from:
frequency in answer frequency on Web frequency in query log
Conciseness:Prefer results that are tightly connected
size of answer graph weight of Steiner tree
Diversity:Prefer variety of facts
Empirical accuracy of Information ExtractionPageRank-style estimate of trustcombine into: max { accuracy (f,s) * trust(s) | s witnesses(f) }
Statistical Language Models[Zhai et al., Elbassuoni et al.]
Graph algorithms (BANKS, STAR, …) [S.Chakrabarti et al., G.Kasneci et al., …]
PageRank-style entity/fact ranking[V. Hristidis et al., S.Chakrabarti, …]
IR models: tf*idf … [K.Chang et al., …]Statistical Language Models [de Rijke et al.]
or
Outline for Part I• Part I.1: Foundations
– Introduction to RDF– A short overview of SPARQL
• Part I.2: Rowstore Solutions• Part I.3: Columnstore Solutions• Part I.4: Other Solutions and Outlook
RDF in Rowstores• Rowstore: general relational database, storing
relations (incl. facts) as complete rows (MySQL, PostgreSQL, Oracle, DB2, SQLServer, …)
• General principles:– store triples in one giant three-attribute table
(subject, predicate, object)– convert SPARQL to equivalent SQL– The database will do the rest
• Strings often mapped to unique integer IDs• Used by many TripleStores, including 3Store,
Jena, HexaStore, RDF-3X, …
Simple extension to quadruples (with graphid):(graph,subject,predicate,object)
We consider only triples for simplicity!
Example: Single Triple Tableex:Katja ex:teaches ex:Databases;
ex:works_for ex:MPI_Informatics;ex:PhD_from ex:TU_Ilmenau.
ex:Martin ex:teaches ex:Databases;ex:works_for ex:MPI_Informatics;ex:PhD_from ex:Saarland_University.
ex:Ralf ex:teaches ex:Information_Retrieval;ex:PhD_from ex:Saarland_University;ex:works_for ex:Saarland_University,
ex:MPI_Informatics.
subject predicate objectex:Katja ex:teaches ex:Databasesex:Katja ex:works_for ex:MPI_Informaticsex:Katja ex:PhD_from ex:TU_Ilmenauex:Martin ex:teaches ex:Databasesex:Martin ex:works_for ex:MPI_Informaticsex:Martin ex:PhD_from ex:Saarland_Universityex:Ralf ex:teaches ex:Information_Retrievalex:Ralf ex:PhD_from ex:Saarland_Universityex:Ralf ex:works_for ex:Saarland_Universityex:Ralf ex:works_for ex:MPI_Informatics
Conversion of SPARQL to SQLGeneral approach to translate SPARQL into SQL:
(1) Each triple pattern is translated into a (self-) JOIN over the triple table (2) Shared variables create JOIN conditions(3) Constants create WHERE conditions(4) FILTER conditions create WHERE conditions(5) OPTIONAL clauses create OUTER JOINS(6) UNION clauses create UNION expressions
SELECTFROM Triples P1, Triples P2, Triples P3
Example: Conversion to SQL QuerySELECT ?a ?b ?t WHERE{?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. }OPTIONAL {?a teaches ?t}FILTER (regex(?u, “Saar”))
SELECTFROM Triples P1, Triples P2, Triples P3WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from”
SELECTFROM Triples P1, Triples P2, Triples P3WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object
SELECTFROM Triples P1, Triples P2, Triples P3WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object, “Saar”)
SELECT P1.subject as A, P2.subject as BFROM Triples P1, Triples P2, Triples P3WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object, “Saar”)
SELECT R1.A, R1.B, R2.T FROM( SELECT P1.subject as A, P2.subject as B FROM Triples P1, Triples P2, Triples P3 WHERE P1.predicate=“works_for” AND P2.predicate=“works_for” AND P3.predicate=“phd_from” AND P1.object=P2.object AND P1.subject=P3.subject AND P1.object=P3.object AND REGEXP_LIKE(P1.object, “Saar”) ) R1 LEFT OUTER JOIN ( SELECT P4.subject as A, P4.object as T FROM Triples P4 WHERE P4.predicate=“teaches”) AS R2) ON (R1.A=R2.A)
P1
P2
P3
P4
Filterregex(?u,“Saar“)
Projection
?u
?a,?u
?a
Is that all?Well, no.• Which indexes should be built?
(to support efficient evaluation of triple patterns)• How can we reduce storage space?• How can we find the best execution plan?
Existing databases need modifications:• flexible, extensible, generic storage not needed here• cannot deal with multiple self-joins of a single table• often generate bad execution plans
Dictionary for StringsMap all strings to unique integers (e.g., via hashing)• Regular size (4-8 bytes), much easier to handle• Dictionary usually small, can be kept in main
memory<http://example.de/Katja> 194760<http://example.de/Martin> 679375<http://example.de/Ralf> 4634
This may break original lexicographic sorting order RANGE conditions (not in SPARQL) are difficult! FILTER conditions may be more expensive!
Indexes for Commonly Used Triple PatternsPatterns with a single variable are frequentExample: Albert_Einstein invented ?xÞ Build clustered index over (s,p,o)
Can also be used for pattern like Albert_Einstein ?p ?x
Build similar clustered indexes for all six permutations (3 x 2 x 1 = 6)• SPO, POS, OSP to cover all possible triplet patterns• SOP, OPS, PSO to have all sort orders for patterns with two var’s
(16,19,5356)(16,24,567)(16,24,876)(27,19,643)(27,48,10486)(50,10,10456) …
All triples in(s,p,o) order
B+ tree foreasy access
1. Lookup ids for constants:Albert_Einstein 16, invented 24
2. Lookup known prefix in index:(16,24,0)
3. Read results while prefix matches:(16,24,567), (16,24,876)come already sorted!
Triple table no longer needed, all triples in each index
Why Sort Order Matters for Joins
(16,19,5356)(16,24,567)(16,24,876)(27,19,643)(27,48,10486)(50,10,10456)
(16,33,46578)(16,56,1345)(24,16,1353)(27,18,133)(47,37,20495)(50,134,1056)
MJ
When inputs sorted by joinattribute, use Merge Join:• sequentially scan both inputs• immediately join matching triples• skip over parts without matches• allows pipelining
When inputs are unsorted/sortedby wrong attribute, use Hash Join:• build hash table from one input• scan other input, probe hash table• needs to touch every input triple• breaks pipelining
(16,19,5356)(16,24,567)(16,24,876)(27,19,643)(27,48,10486)(50,10,10456)
(27,18,133)(50,134,1056)(16,56,1345)(24,16,1353) (47,37,20495)(16,33,46578)
HJ
In general, Merge Joins are more preferable:small memory footprint, pipelining
RDF-3x: Even More Indexes!SPARQL 1.0 considers duplicates (unless removed with DISTINCT) but
does not (yet) support aggregates/countingÞ often queries with many duplicates like
SELECT ?x WHERE ?x ?y Germany.
to retrieve entities related to Germany (but counts may be important in the application!)
Þ this materializes many identical intermediate resultsSolution: even more redundancy!
• Pre-compute aggregated indexes SP,SO,PO,PS,OP,OS,S,P,O Example: SO contains, for each pair (s,o), the number of triples with subject s and object o
• Do not materialize identical bindings, but keep counts Example: ?x=Albert_Einstein:4; ?x=Angela_Merkel:10 • 15 indexes overall (all SPO permutations + their unique subsets)
RDF-3x: Compression Scheme for Triplets
• Compress sequences of triples in lexicographic order (v1;v2;v3); for SPO: v1=S, v2=P, v3=O
• Step 1: compute per-attribute deltas
• Step 2: variable-byte encoding for each delta triple
1-13 bytes
(16,19,5356)(16,24,567)(16,24,676)(27,19,643)(27,48,10486)(50,10,10456)
(16,19,5356)(0,5,-4798)(0,0,109)(11,-5,-34)(0,29,9843)(23,-38,-30)
gapbit
header(7 bits)
Delta of value 1(0-4 bytes)
Delta of value 2(0-4 bytes)
Delta of value 3(0-4 bytes)
When gap=1, thedelta of value3 isincluded in header,all others are 0
Otherwise, header contains length of encoding for each of the three deltas (5*5*5=125 combinations stored in 7 bits)
Many variants exist; this one is designed for triplets…
Compression Effectiveness vs. Efficiency
• Byte-level encoding almost as effective as bit-level encoding techniques (Gamma, Golomb, Rice, etc.)
• Much faster (10x) for decompressing• Example for Barton dataset [Neumann & Weikum: VLDB’10]:
– Raw data 51 million triples, 7GB uncompressed (as N-Triples)– All 6 main indexes:
• 1.1GB size, 3.2s decompression with byte-level encoding
• Optionally: additional compression with LZ77 2x more compact, but much slower to decompress– Compression always on page level
POS(works_for,?u,?a)
POS(pdh_from,?u,?a)
PSO(works_for,?u,?b)
Projection
?u,?a
?u
?a
MJ
MJ
Filterregex(?u,“Saar“)
POS(teaches,?a,?t)
Back to the Example QuerySELECT ?a ?b ?t WHERE{?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. }OPTIONAL {?a teaches ?t}FILTER (regex(?u, “Saar”))
Which of the two plans is better?How many intermediate results?
1000
1000
100
50
5
250
POS(works_for,?u,?a)
POS(works_for,?u,?b)
PSO(phd_from,?a,?u)
POS(teaches,?a,?t)
Projection
?u
?a,?u
?a
MJ
HJ
Filterregex(?u,“Saar“)
Core ingredients of a good query optimizer areselectivity estimators for triple patterns (index scans) and joins
1000
1000
10050
2500
250250
250
RDF-3x: Selectivity EstimationHow many results will a triple pattern have?Standard databases:• Per-attribute histograms• Assume independence of attributes
Use aggregated indexes for exact count
Additional join statistics for triple blocks (pages):
too simplisticand inexact
… (16,19,5356)(16,24,567)(16,24,876)(27,19,643)(27,48,10486)(50,10,10456) …
Assume independencebetween triple patterns;additionally precompute
exact statistics for frequentpaths in the data
Handling UpdatesWhat should we do when our data changes?(SPARQL 1.1 has updates!)
Assumptions:• Queries far more frequent than updates• Updates mostly insertions, hardly any deletions• Different applications may update concurrently
Solution: Differential Indexing
RDF-3x: Differential Updates
Workspace A:Triples insertedby application A
Workspace B:Triples insertedby application B
on-demand indexesat query time
kept in main memory
Staging architecture for updates in RDF-3X
completionof A
completionof B
Deletions:• Insert the same tuple again with “deleted” flag• Modify scan/join operators: merge differential indexes with main index
Outline for Part I• Part I.1: Foundations
– Introduction to RDF– A short overview of SPARQL
• Part I.2: Rowstore Solutions• Part I.3: Columnstore Solutions• Part I.4: Other Solutions and Outlook
Principles
Observations and assumptions:• Not too many different predicates• Triple patterns usually have fixed predicate• Need to access all triples with one predicate
Design consequence:• Use one two-attribute table for each predicate
Example: Columnstoresex:Katja ex:teaches ex:Databases;
ex:works_for ex:MPI_Informatics;ex:PhD_from ex:TU_Ilmenau.
ex:Martin ex:teaches ex:Databases;ex:works_for ex:MPI_Informatics;ex:PhD_from ex:Saarland_University.
ex:Ralf ex:teaches ex:Information_Retrieval;ex:PhD_from ex:Saarland_University;ex:works_for ex:Saarland_University,
ex:MPI_Informatics.
subject objectex:Katja ex:TU_Ilmenauex:Martin ex:Saarland_Universityex:Ralf ex:Saarland_University
PhD_from
subject objectex:Katja ex:MPI_Informaticsex:Martin ex:MPI_Informtaticsex:Ralf ex:Saarland_Universityex:Ralf ex:MPI_Informatics
works_for
subject objectex:Katja ex:Databasesex:Martin ex:Databasesex:Ralf ex:Information_Retrieval
teaches
Simplified Example: Query ConversionSELECT ?a ?b ?t WHERE{?a works_for ?u. ?b works_for ?u. ?a phd_from ?u. }
SELECT W1.subject as A, W2.subject as B FROM works_for W1, works_for W2, phd_from P3 WHERE W1.object=W2.object AND W1.subject=P3.subject AND W1.object=P3.object
So far, this is yet another relational representation of RDF.So, what is a columnstore?
Columnstores and RDFColumnstores store all columns of a table separately.
subject objectex:Katja ex:TU_Ilmenauex:Martin ex:Saarland_Universityex:Ralf ex:Saarland_University
PhD_from PhD_from:subjectex:Katjaex:Martinex:Ralf
PhD_from:objectex:TU_Ilmenauex:Saarland_Universityex:Saarland_University
Advantages:• Fast if only subject or object are accessed, not both• Allows for a very compact representation Problems:• Need to recombine columns if subject and object are accessed• Inefficient for triple patterns with predicate variable
Compression in ColumnstoresGeneral ideas: • Store subject only once• Use same order of subjects for all columns, including NULL values
when necessary
• Additional compression to get rid of NULL values
subjectex:Katjaex:Martinex:Ralfex:Ralf
PhD_fromex:TU_Ilmenauex:Saarland_Universityex:Saarland_UniversityNULL
works_forex:MPI_Informaticsex:MPI_Informaticsex:Saarland_Universityex:MPI_Informatics
teachesex:Databasesex:Databases ex:Information_RetrievalNULL
PhD_from: bit[1110]ex:TU_Ilmenauex:Saarland_Universityex:Saarland_University
Teaches: range[1-3]ex:Databasesex:Databases ex:Information_Retrieval
Outline for Part I• Part I.1: Foundations
– Introduction to RDF– A short overview of SPARQL
• Part I.2: Rowstore Solutions• Part I.3: Columnstore Solutions• Part I.4: Other Solutions and Outlook
Property TablesGroup entities with similar predicates into a relational table(for example using RDF types or a clustering algorithm).ex:Katja ex:teaches ex:Databases;
ex:works_for ex:MPI_Informatics;ex:PhD_from ex:TU_Ilmenau.
ex:Martin ex:teaches ex:Databases;ex:works_for ex:MPI_Informatics;ex:PhD_from ex:Saarland_University.
ex:Ralf ex:teaches ex:Information_Retrieval;ex:PhD_from ex:Saarland_University;ex:works_for ex:Saarland_University,
ex:MPI_Informatics.subject teaches PhD_fromex:Katja ex:Databases ex:TU_Ilmenauex:Martin ex:Databases ex:Saarland_Universityex:Ralf ex:IR ex:Saarland_University
subject teaches PhD_fromex:Katja ex:Databases ex:TU_Ilmenauex:Martin ex:Databases ex:Saarland_Universityex:Ralf ex:IR ex:Saarland_Universityex:Axel NULL ex:TU_Vienna
subject predicate objectex:Katja ex:works_for ex:MPI_Informaticsex:Martin ex:works_for ex:MPI_Informaticsex:Ralf ex:works_for ex:Saarland_Universityex:Ralf ex:works_for ex:MPI_Informatics
“Leftover triples”
Property Tables: Pros and Cons
Advantages:• More in the spirit of existing relational systems• Saves many self-joins over triple tables etc.Disadvantages:• Potentially many NULL values• Multi-value attributes problematic• Query mapping depends on schema• Schema changes very expensive
Even More Systems…• Store RDF data as sparse matrix with bit-vector
compression [BitMat, Hendler at al.: ISWC’09]
• Convert RDF into XML and use XML methods (XPath, XQuery, …)
• Store RDF data in graph databases and perform bi-simulation [Fletcher at al.: ESWC’12] or employ specialized graph index structures [gStore, Zou et al.: PVLDB’11]
• And many more …
See our list of readings.
Which Technique is Best?• Performance depends a lot on precomputation,
optimization, implementation, fine-tuning …• Comparative results on BTC 2008:
(from [Neumann & Weikum, 2009])
RDF-3X RDF-3X (2008) COLSTORE ROWSTORE
RDF-3X RDF-3X (2008)
COLSTORE
ROWSTORE
Challenges and Opportunities
• SPARQL with different entailment regimes• New SPARQL 1.1 features (grouping, aggregation, updates)
• User-oriented ranking of query results– Efficient top-k operators– Effective scoring methods for structured queries
• What are the limits of a centralized RDF engine?• Dealing with uncertain RDF data –
what is the most likely query answer?– Triples with probabilities probabilistic databases
Outline of this Tutorial
• Part I–RDF in Centralized Relational Databases
• Part II–RDF in Distributed Settings
• Part III–Managing Uncertain RDF Data
Outline for Part II• Part II.1: Search Engines for the Semantic Web• Part II.2: Mediator-based and Federated
Architectures
Semantic Web Search Engines
• Querying RDF data collections started by adapting existing search engines to RDF data.– Crawling for .rdf files, and HTML documents with embedded
RDF content (see: RDFa microformat).– Indexing & search based on keywords extracted from entity-
and property names.– Usually generate a virtual document for an entity (string literals
and human-readable names).
• Swoogle [Ding et al., CIKM’04] (University of Maryland)
• Falcons [Cheng at al., WWW’08] (Nanjing University)
Outline for Part II• Part II.1: Search Engines for the Semantic Web• Part II.2: Mediator-based and Federated
Architectures
Classification of Distributed ApproachesApproaches for querying
distributed and potentially heterogeneous (RDF) data sources
Materialization-based approaches(data-warehousing)
Virtually materialized approaches
Peer-2-Peer
Federated systems
MapReduce/Hadoop
Shared-memoryarchitectures
(Message-Passing, RMI, etc.)
Shared-nothingarchitectures
Mediator-based systems
Shard, Jena-HBase[Abadi et al. PVLDB’11]
Trinity (MSR)
DARQ FedExYARS2
GridvineRDFPeers
Partout4StoreEagre
How to Integrate Data Sources?
• Ship and integrate data from different sources to the client.
• Three common approaches:– Query-driven (single mediator)
– Database federations (exported schemas)
– Warehousing (fully integrated & centrally managed)
RDFSource
RDFSource
?
Query-Driven Approach
SPARQL Client
Wrapper
RDFSource
RDFSource
RDFSource
query result query resultquery result
Mediator
Wrapper Wrapper
List of SPARQL endpoints: http://www.w3.org/wiki/SparqlEndpointsDBpedia: http://dbpedia.org/sparqlYAGO: https://d5gate.ag5.mpi-sb.mpg.de/webyagospo/Browser
SPARQL Client
Advantages of Query-Driven Integration
• No need to copy data– no or little own storage costs– no need to purchase data
• Potentially more up-to-date data• Mediator holds catalog (statistics, etc.) and may
optimize queries• Only generic query interface needed at sources
(SPARQL endpoints)• May be less draining on sources• Sources often even unaware of participation
resultquery
Federation-based Approach
SPARQL Client
RDFSource
Federated Schema
Exported Schema
SPARQL Client
Local Schema
resultquery
RDFSource
Exported Schema
Local Schema
resultquery
RDFSource
Exported Schema
Local Schema
Source 1 Source 2 … Source n
Advantages of Federation-Based Integration
• Very similar to query-driven integration, except – that the sources know that they are part of a federation;– and they export their local schemas into a federated schema.
• Intermediate step toward full integration of the data in a single “warehouse”.
Warehousing Architecture
SPARQL Client
Warehouse
RDFSource
RDFSource
RDFSource
Query & Analysis
Integration
Metadata
SPARQL Client
Integrated LOD index: http://lod2.openlinksw.com/sparql
Advantages of Warehousing
• Perform Extract-Transform-Load (ETL) processes with periodic updates over the source
• High query performance• Local processing at sources unaffected• Can operate even when sources are offline• Can query data that is no longer stored at sources• More detailed statistics and metadata available at
warehouse– Modify, summarize (store aggregates), analyse– Add historical information, provenance, timestamps, etc.
Classification of Distributed ApproachesApproaches for querying
distributed and potentially heterogeneous (RDF) data sources
Materialization-based approaches(data-warehousing)
Virtually materialized approaches
Peer-2-Peer
Federated systems
MapReduce/Hadoop
Shared-memoryarchitectures
(Message-Passing, RMI, etc.)
Shared-nothingarchitectures
Mediator-based systems
Shard, Jena-HBase[Abadi et al. PVLDB’11]
Trinity (MSR)
DARQ FedExYARS2
GridvineRDFPeers
Partout4StoreEagre
DARQ [Leser et al., Humbold University Berlin, ISWC’08] • Classical mediator-based architecture
connecting a given SPARQL endpoint to other endpoints via a combination of wrappers and service descriptions.
• Service descriptions– RDF data descriptions– Statistical information– Binding constraints
• Query optimizer based on rewriting rules and cost estimations for physical join operators.
FedEx [fluid Op’s & MPI-INF: ISWC’11]
• Online query optimization over federations of SPARQL endpoints.
• Cost estimates based on result sizes of SPARQL ASK queries.• “Bound nested-loop joins” by grouping sets of variable bindings
into SPARQL UNION queries (instead of using FILTER conditions):
SELECT ?drug ?title WHERE { ?drug drugbank:drugCategory drugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrug rdf:type kegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrug purl:title ?title .}
Partout [Galaraga, Hose, Schenkel: PVLDB’13]
• Materialization-based, distributed & workload-aware SPARQL engine.
• Distribution helps to scale-out query processing via parallel join executions.
• Triple fragments are distributed over hosts H1…Hn by– (1) maximizing query locality, and– (2) balancing the hosts workload.
• H1…Hn run local RDF-3x instances.– (1) local S,P,O statistics by RDF-3x, – (2) global (cached) statistics.
Global query workload (aka. “query log”) Global query graph
Partout Example Query Plan
• H1, H2, H3 hold triplets for ?s rdf:type db:city• H1 has triplets for ?s db:located db:Germany• H2 has triplets for ?s db:name ?name
More Distributed RDF Engines• Shard TripleStore (Hadoop + Hash-Partitioning)
• RDFPeers (P2P/Chord architecture) [Cai et al., WWW’04]
• Gridvine (P2P/Chord architecture) [Aberer et al., VLDB’07]
• YARS2 (federated architecture) [Decker at al., ISWC’07]
• Jena-HBase (Hadoop & HBase) [Khadilkar et al., ISWC’12]
• SW-Store (Hadoop/RDF-3x) [Abadi et al., PVLDB’11]
• 4Store (materialized, shared-nothing) [Harris et al., SSWS’09]
• Eagre (materialized, shared-nothing) [HKUST & HP Labs, ICDE’13]
• Trinity (materialized, shared-memory, message passing) [MSR, SIGMOD’13]
more in Zoi’s tutorial in the afternoon…
Outline of this Tutorial
• Part I–RDF in Centralized Relational Databases
• Part II–RDF in Distributed Settings
• Part III–Managing Uncertain RDF Data
Outline for Part III• Part III.1: Motivation
– What is uncertain data, and where does it come from?
• Part III.2: Possible Worlds & Beyond• Part III.3: Probabilistic Database Engines
– Stanford Trio Project– MystiQ @ U Washington
• Part III.4: Managing Uncertain RDF Data– URDF @ Max Planck Institute
What is “Uncertain” Data?
“Certain” Data “Uncertain” Data
Temperature is 74.634589 F Sensor reported 75 ± 0.5 F
Bob works for Yahoo Bob works for either Yahoo or Microsoft
Mary sighted a Finch Mary sighted either a Finch (60%) or a Sparrow (40%)
It always rains in Galway There is a 89% chance of rain in Galway tomorrow
Yahoo stocks will be at 100 in a month
Yahoo stock will be between 60 and 120 in a month
John’s age is 23 John’s age is in [20,30]
… And Why Does It Arise?
“Certain” Data “Uncertain” Data
Temperature is 74.634589 F Sensor reported 75 ± 0.5 F
Bob works for Yahoo Bob works for either Yahoo or Microsoft
Mary sighted a Finch Mary sighted either a Finch (60%) or a Sparrow (40%)
It always rains in Galway There is a 89% chance of rain in Galway tomorrow
Yahoo stocks will be at 100 in a month
Yahoo stock will be between 60 and 120 in a month
John’s age is 23 John’s age is in [20,30]
Precision of devices
Lack of exact information(alternatives and missing values)
Uncertainty about futureevents
Anonymization
Applications: Deduplication
NameJohn Doe
J. Doe? 80% match
Applications: Information Integration
name,hPhone,oPhone,hAddr,oAddr
name,phone,address
Combined View
at the schema level: “schema integration”
at the instance level: “record linkage”
Applications: Information Extraction (I)
Restaurant Zip
Hard Rock Cafe
94111 9413394109
Applications: Information Extraction (II)
What is Uncertain Data and Why Does It Arise?
Subj. Pred. Obj.Galway type City
locatedIn Ireland
hasPopulation 75,414
areaCode 091
namedAfter Gaillimh_River
… … …
bornOn(Jeff, 09/22/42)gradFrom(Jeff, Columbia)hasAdvisor(Jeff, Arthur)hasAdvisor(Surajit, Jeff)knownFor(Jeff, Theory)
type(Jeff, Author)[0.9]
author(Jeff, Drag_Book)[0.8]
author(Jeff,Cind_Book)[0.6]
worksAt(Jeff, Bell_Labs)[0.7]
type(Jeff, CEO)[0.4]
Applications: Information Extraction (III)
YAGO/DBpedia et al.
New fact candidates
>120 M facts for YAGO2(mostly from Wikipedia infoboxes)
100’s M additional facts from Wikipedia text
How do current database management systems (DBMS) handle uncertainty?
They don’t
• Clean: turn into data that DBMSs can handle
What Do (Most) Applications Do?
(1) Loss of information (2) Errors compound and propagate insidiously
Observer Bird-1
Mary Finch: 80%Sparrow: 20%
Susan
Dove: 70%Sparrow: 30%
Jane Hummingbird: 65%Sparrow: 35%
Bird-1
Finch
Dove
Hummingbird
Outline for Part III• Part III.1: Motivation
– What is uncertain data, and where does it come from?
• Part III.2: Possible Worlds & Beyond• Part III.3: Probabilistic Database Engines
– Stanford Trio Project– MystiQ @ U Washington
• Part III.4: Managing Uncertain RDF Data– URDF @ Max Planck Institute
Dan Suciu, Dan Olteanu, Christopher Ré, Christoph Koch: Probabilistic Databases
(Synthesis Lectures on Data Management)Morgan & Claypool Publishers, 2012
Databases Today are Deterministic
• An item either is in the database or it is not.
• A tuple either is in the query answer or it is not.
• This applies to all variety of data models:– Relational, E/R, hierarchical, XML, …
What is a Probabilistic Database ?
• “An tuple belongs to the database” is a probabilistic event.
• “A tuple is an answer to the query” is a probabilistic event.
• Can be extended to all possible kinds of data models; we consider only
probabilistic relational data.
Sample Spaces & Venn Diagrams
Sample Space
• Sample space : all possible events that can be observed. Pr() = 1.• Random variable χt assigns a probability to an event s.t. 0 ≤ Pr( χt ) ≤ 1.• As a convention, we will use tuple identifiers in the place of random
variables to denote probabilistic events.
“Tuple t1 is in the database.”
“Tuple t2 is an answer to a query.”
Possible Worlds Semantics
int, varchar(55), datetime
Employee(ID:int, name:varchar(55), dob:datetime, salary:int)
Attribute domains:
Relational schema:
# values: 232, 2440, 264
# of possible tuples: 232 × 2440 × 264 × 232
# of possible relation instances: 2 232 × 2440 × 264 × 232
Employee(. . .), Projects( . . . ), Groups( . . .), WorksFor( . . .)
Database schema:
# of possible database instances: N (= big but finite)
The Definition
Given a finite set of all possible database instances:
INST = {I1, I2, I3, . . ., IN}
Definition: A probabilistic database Ip is a probability distribution on INST
s.t. åi=1,…,N Pr(Ii) = 1Pr : INST → [0,1]
Definition: A possible world is I INST s.t. Pr(I) > 0
Example
Customer Address Product
John Seattle Gizmo
John Seattle Camera
Sue Denver Gizmo
Pr(I1) = 1/3
Customer Address Product
John Boston Gadget
Sue Denver Gizmo
Customer Address Product
John Seattle Gizmo
John Seattle Camera
Sue Seattle Camera
Customer Address Product
John Boston Gadget
Sue Seattle Camera
Pr(I2) = 1/12
Pr(I3) = 1/2Pr(I4) = 1/12
Possible worlds = {I1, I2, I3, I4}
Ip =
Tuples as Events
One tuple t event “t I”
Two tuples t1, t2 event “t1 I Λ t2 I”
Pr(t) = åI: t I Pr(I)
Pr(t1 Λ t2) = åI: t1I Λ t2I
Pr(I)
Marginalprobability of t
Marginalprobability of t1 Λ t2
Tuple Correlations
Pr(t1 Λ t2) = 0
Pr(t1 Λ t2) < Pr(t1) Pr(t2)Negatively correlated
Pr(t1 Λ t2) > Pr(t1) Pr(t2)Positively correlated
Pr(t1 Λ t2) = Pr(t1) = Pr(t2)Identical =
N
P
DΛDisjoint-AND
Pr(t1 Λ t2) = Pr(t1) Pr(t2)Independent-AND I Λ
Independent-OR Pr(t1 V t2) = 1-(1-Pr(t1))(1-Pr(t2))
Disjoint-OR Pr(t1 V t2) = Pr(t1)+Pr(t2)
I V
D V
Pr(⌐t1) = 1 - Pr(t1)NOT ⌐
Example with Correlations
Customer Address Product
John Seattle Gizmo
John Seattle Camera
Sue Denver Gizmo
Pr(I1) = 1/3
Customer Address Product
John Boston Gadget
Sue Denver Gizmo
Customer Address Product
John Seattle Gizmo
John Seattle Camera
Sue Seattle Camera
Customer Address Product
John Boston Gadget
Sue Seattle Camera
Pr(I2) = 1/12
Pr(I3) = 1/2Pr(I4) = 1/12
=
N
P
D
D
Ip =
Special Case!
Pr(I) = Õt I pr(t) × Õt Ï I (1-pr(t))
No restrictions w.r.t. other tuplespr : TUP → (0,1]
Tuple-independent probabilistic database
INST = P(TUP)N = 2MTUP = {t1, t2, …, tM} = all tuples
… back to the Venn Diagram (I)Sample Space
If t1 and t2 are independent (per assumption!):
4 possible worlds = 4 subsets of events
“Tuple t1 is in the database.”
“Tuple t2 is in the database.”
Pr(“Tuple t1 is in the database and tuple t2 is in the database”) := Pr(t1) x Pr(t2) = pr(t1) x pr(t2)
… back to the Venn Diagram (II)Sample Space
If t1 and t2 are disjoint (per assumption!):
3 possible worlds = 3 subsets of events
“Tuple t1 is in the database.”
“Tuple t2 is in the database.”
Pr(“Tuple t1 is in the database and tuple t2 is in the database”) := 0
Tuple Prob. Possible Worlds
Name City pr
John Seattle p1 = 0.8
Sue Boston p2 = 0.6
Fred Boston p3 = 0.9
Ip = Name City
John Seattl
Sue Bosto
Fred Bosto
Name City
Sue Bosto
Fred Bosto
Name City
John Seattl
Fred Bosto
Name City
John Seattl
Sue Bosto
Name City
Fred Bosto
Name City
Sue Bosto
Name City
John Seattl
I1
(1-p1)(1-p2)(1-p3)
I2
p1(1-p2)(1-p3)
I3
(1-p1)p2(1-p3)
I4
(1-p1)(1-p2)p3
I5
p1p2(1-p3)
I6
p1(1-p2)p3
I7
(1-p1)p2p3
I8
p1p2p3
å = 1
J =
Name City
Assumption:Tuples are
independent!
Tuple Prob. Query Evaluation
Name City pr
John Seattle p1
Sue Boston p2
Fred Boston p3
Customer Product Date pr
John Gizmo . . . q1
John Gadget . . . q2
John Gadget . . . q3
Sue Camera . . . q4
Sue Gadget . . . q5
Sue Gadget . . . q6
Fred Gadget . . . q7
SELECT DISTINCT x.cityFROM Personp x, Purchasep yWHERE x.Name = y.Customer and y.Product = ‘Gadget’
Tuple Probability
Seattle
Boston
1-(1-q2)(1-q3)p1( )
1- (1- ) × (1- )
p2( )p3
1-(1-q5)(1-q6 )q7
Marginals
Summary of Data Model
Possible Worlds Semantics• Very powerful model:
– Can capture any tuple correlations.
• Needs separate representation formalism: (“just tables” are generally not enough)
Boolean event expressions to capture complex tuple- dependencies: “provenance”, “lineage”, “views”, etc.
• But: query evaluation may be very expensive.– Need to find good cases, otherwise must approximate.
Outline for Part III• Part III.1: Motivation
– What is uncertain data, and where does it come from?
• Part III.2: Possible Worlds & Beyond• Part III.3: Probabilistic Database Engines
– Stanford Trio Project– MystiQ @ U Washington
• Part III.4: Managing Uncertain RDF Data– URDF @ Max Planck Institute
Trio’s Data Model
1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidence values4. Lineage
Uncertainty-Lineage Databases (ULDBs)
[Widom et al.: 2008]
Trio’s Data Model
1. Alternatives: uncertainty about value
Saw (witness, color, car)
Amy red, Honda ∥ red, Toyota ∥ orange, Mazda
Three possibleinstances
Six possibleinstances
Trio’s Data Model
1. Alternatives2. ‘?’ (Maybe): uncertainty about presence
?
Saw (witness, color, car)
Amy red, Honda ∥ red, Toyota ∥ orange, Mazda
Betty blue, Acura
Trio’s Data Model
• 1. Alternatives• 2. ‘?’ (Maybe) Annotations• 3. Confidences: weighted uncertainty
Six possible instances, each with a probability
?
Saw (witness, color, car)
Amy red, Honda 0.5 ∥ red, Toyota 0.3 ∥ orange, Mazda 0.2
Betty blue, Acura 0.6
So Far: Model is Not Closed
Saw (witness, car)
Cathy
Honda ∥ Mazda
Drives (person, car)
Jimmy, Toyota ∥ Jimmy, Mazda
Billy, Honda ∥ Frank, Honda
Hank, Honda
Suspects
Jimmy
Billy ∥ Frank
Hank
Suspects = πperson(Saw ⋈ Drives)
???
Does not correctlycapture possibleinstances in theresult
CANNOT
Example with Lineage
ID
Saw (witness, car)
11
Cathy
Honda ∥ Mazda
ID
Drives (person, car)
21
Jimmy, Toyota ∥ Jimmy, Mazda
22
Billy, Honda ∥ Frank, Honda
23
Hank, Honda
ID
Suspects
31
Jimmy
32
Billy ∥ Frank
33
Hank
Suspects = πperson(Saw ⋈ Drives)
???
λ(31) = (11,2) Λ (21,2)λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)λ(33) = (11,1) Λ 23
Example with Lineage
ID Saw (witness, car)
11
Cathy
Honda ∥ Mazda
ID Drives (person, car)
21
Jimmy, Toyota ∥ Jimmy, Mazda
22
Billy, Honda ∥ Frank, Honda
23
Hank, Honda
ID Suspects
31
Jimmy
32
Billy ∥ Frank
33
Hank
Suspects = πperson(Saw ⋈ Drives)
???
λ(31) = (11,2) Λ (21,2)λ(32,1) = (11,1) Λ (22,1); λ(32,2) = (11,1) Λ (22,2)λ(33) = (11,1) Λ 23
Correctly captures possible instances inthe result (7)
Operational Semantics
Closure:up-arrowalways exists
Completeness: any (finite) set of possible instances can be represented
D
I1, I2, …, In J1, J2, …, Jm
D′
possibleinstances
Q on eachinstance
rep. ofinstances
directimplementation
Summary on Trio’s Data Model
1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidence values4. Lineage
Uncertainty-Lineage Databases (ULDBs)
Theorem: ULDBs are closed and complete.
Formally studied properties like minimization, equivalence, approximation and membership based on lineage.
[Benjelloun, Widom, et al.: VLDB J. 08]
MYSTIQ: Query Complexity• Data complexity of a query Q:• Compute Q(Ip), for probabilistic database J
– Extensional query evaluation: Works for “safe” query plans with PTIME data complexity
– Intensional query evaluation:Works for any plan but has #P-complete data complexity in the general case
• Assume independent tuples in J• Compute marginal probabilities for tuples in Q• Boolean event expressions for intensional query evaluation
Extensional Query Evaluation
Relational op’s compute probabilities
s
v p
v p
×
v1 p1
v1 v2 p1 p2
v2 p2
P
v p1
v p2
v 1-(1-p1)(1-p2)…
[Fuhr&Roellke:1997, Dalvi&Suciu:2004]
-
v p1
v p1(1-p2)
v p2
Data complexity: PTIME
or: p1 + p2 + …
Jon Sea p1
Jon q1
Jon q2
Jon q3
SELECT DISTINCT x.CityFROM Personp x, Purchasep yWHERE x.Name = y.Customer and y.Product = ‘Gadget’
Jon Sea p1q1
Jon Sea p1q2
Jon Sea p1q3
Sea 1-(1-p1q1)(1- p1q2)(1- p1q3)
×
P
Jon q1
Jon q2
Jon q3
×
P
Jon Sea p1(1-(1-q1)(1-q2)(1-q3))
[Dalvi&Suciu:2004]
Wrong !
Correct !
Depends on plan !!!
Jon 1-(1-q1)(1-q2)(1-q3)
Jon Sea p1
“Safe Plans”
Query Complexity
Sometimes there exists a correct extensional plan,but consider the following:
Qbad :- R(x), S(x,y), T(y)Data
complexityis #P-complete
[Dalvi&Suciu:2004]
NP = class of problems of the form “is there a witness ?”#P = class of problems of the form “how many witnesses ?” (will be coming back to this…)
Intensional Database[Fuhr&Roellke:1997]
Atomic event ids
Intensional probabilistic database J each tuple t has an event attribute t.E
e1, e2, e3, …
p1, p2, p3, … [0,1]
e3 Λ (e5 V ⌐e2)
Probabilities:
Event expressions: Λ, V, ⌐
Probability of Boolean Expressions
E = X1X3 v X1X4 v X2X5 v X2X6
Sampling: Randomly make each variable true with the following probabilities
Pr(X1) = p1, Pr(X2) = p2, . . . . . , Pr(X6) = p6
What is Pr(E) ???
Answer: Re-group cleverly E = X1 (X3 v X4 ) v X2 (X5 v X6)
Pr(E) = 1 - (1-p1(1-(1-p3)(1-p4))) (1-p2(1-(1-p5)(1-p6)))
Needed for query evaluation!
“Read once” formula
Complexity Issues
Theorem [Valiant:1979]For a Boolean expression E, computing Pr(E) is #P-complete
NP = class of problems of the form “is there a witness ?” SAT#P = class of problems of the form “how many witnesses ?” #SAT
The decision problem for 2CNF is in PTIMEThe counting problem for 2CNF is #P-complete
MYSTIQ: [Re, Suciu: VLDB’04]
Probabilistic Query Evaluation on Top of a Deterministic Database Engine
Deterministic Database
SQL Query ProbabilisticQuery Engine
(Top-k) Answers
1. Sampling
2. Extensional joins
3. Indexes
Outline for Part III• Part III.1: Motivation
– What is uncertain data, and where does it come from?
• Part III.2: Possible Worlds & Beyond• Part III.3: Probabilistic Database Engines
– Stanford Trio Project– MystiQ @ U Washington
• Part III.4: Uncertain RDF Data– URDF Project @ Max Planck Institute
Uncertain RDF (URDF) Data Model
• Extensional Layer (information extraction & integration)– High-confidence facts: existing knowledge base (“ground truth”)– New fact candidates: extracted facts with confidence values– Integration of different knowledge sources:
Ontology merging or explicit Linked Data (owl:sameAs, owl:equivProp.)
Large “Probabilistic Database” of RDF facts
• Intensional Layer (query-time inference)– Soft rules: deductive grounding & lineage (Datalog/SLD resolution)– Hard rules: consistency constraints (more general FOL rules)– Propositional & probabilistic consistency reasoning
Soft Rules vs. Hard Rules
(Soft) Deduction Rules vs. (Hard) Consistency Constraints
• People may live in more than one placelivesIn(x,y) marriedTo(x,z) livesIn(z,y)livesIn(x,y) hasChild(x,z) livesIn(z,y)
• People are not born in different places/on different datesbornIn(x,y) bornIn(x,z) y=zbornOn(x,y) bornOn(x,z) y=z
• People are not married to more than one person (at the same time, in most countries?)
marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z
disjoint(t1,t2)
[0.8]
[0.5]
Soft Rules vs. Hard Rules
(Soft) Deduction Rules vs. (Hard) Consistency Constraints
• People may live in more than one placelivesIn(x,y) marriedTo(x,z) livesIn(z,y)livesIn(x,y) hasChild(x,z) livesIn(z,y)
• People are not born in different places/on different datesbornIn(x,y) bornIn(x,z) y=zbornOn(x,y) bornOn(x,z) y=z
• People are not married to more than one person (at the same time, in most countries?)
marriedTo(x,y,t1) marriedTo(x,z,t2) y≠z
disjoint(t1,t2)
[0.8]
[0.5]
Deductive database: Datalog, core of SQL &
relational algebra, RDF/S, OWL2-RL, etc.
More general FOL constraints:
Datalog with constraints, X-Tuples in Prob.-DB’s
owl:FunctionalProperty, etc.
URDF: Running ExampleRules hasAdvisor(x,y) worksAt(y,z)
graduatedFrom(x,z) [0.4]
graduatedFrom(x,y) graduatedFrom(x,z) y=z
Jeff
Stanford
University
type[1.0]
Surajit
Princeton
David
Computer Scientist
worksAt[0.9]
type[1.0]
type[1.0]
type[1.0]type[1.0]
graduatedFrom[0.6]
graduatedFrom[0.7]
graduatedFrom[0.9]
hasAdvisor[0.8]hasAdvisor[0.7]
KB: Base Facts
Derived FactsgradFr(Surajit,Stanfor
d)gradFr(David,Stanford)
graduatedFrom[?]graduatedFrom[?]
Basic Types of Inference
• MAP Inference– Find the most likely assignment to query variables y
under a given evidence x.
– Compute: arg max y P( y | x) (NP-hard for MaxSAT)
• Marginal/Success Probabilities– Probability that query y is true in a random world under a given evidence x.
– Compute: ∑y P( y | x) (#P-hard already for conjunctive
queries)
General Route: Grounding & MaxSAT Solving
Query graduatedFrom(x, y)
CNF (graduatedFrom(Surajit, Stanford) graduatedFrom(Surajit, Princeton))
(graduatedFrom(David, Stanford) graduatedFrom(David, Princeton))
(hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))
(hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford))
worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton)
1000
1000
0.4
0.4
0.9 0.8 0.7 0.6 0.7 0.9
1) Grounding– Consider only facts (and rules)
which are relevant for answering the query
2) Propositional formula in CNF, consisting of– Grounded hard & soft rules– Weighted base facts
3) Propositional Reasoning– Find truth assignment to facts such
that the total weight of the satisfied clauses is maximized
MAP inference: compute “most likely” possible world
[Theobald,Sozio,Suchanek,Nakashole: VLDS‘12]
Find: arg max y P( y | x) Resolves to a variant of MaxSAT
for propositional formulas
URDF: MaxSAT Solving with Soft & Hard Rules
{ graduatedFrom(Surajit, Stanford), graduatedFrom(Surajit, Princeton) }
{ graduatedFrom(David, Stanford), graduatedFrom(David, Princeton) }
(hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))
(hasAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford))
worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford) graduatedFrom(David, Princeton)
0.4
0.4
0.9 0.8 0.7 0.6 0.7 0.9
S:
Mut
ex-c
onst
.
Special case: Horn-clauses as soft rules & mutex-constraints as hard rules
C:
Wei
ghte
d H
orn
clau
ses
(CN
F)
Compute W0 = ∑clauses C w(C) P(C is satisfied);For each hard constraint S { For each fact f in St { Compute Wf+
t = ∑clauses C w(C) P(C is sat. | f = true); } Compute WS-
t = ∑clauses C w(C) P(C is sat. | St = false); Choose truth assignment to f in St that maximizes Wf+
t , WS-t ;
Remove satisfied clauses C; t++;}
• Runtime: O(|S||C|)• Approximation
guarantee of 1/2
MaxSAT Alg.
Deductive Grounding with Lineage (SLD Resolution/Datalog)
\/
/\
graduatedFrom(Surajit,
Princeton)[0.7]
hasAdvisor(Surajit,Jeff)
[0.8]
worksAt(Jeff,Stanford)[0.9]
graduatedFrom(Surajit,
Stanford)[0.6]
Query graduatedFrom(Surajit, y)
C D
A B
A(B (CD)) A(B (CD))
graduatedFrom(Surajit, Princeton)
graduatedFrom(Surajit, Stanford)Q1 Q2
Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z)
[0.4]
graduatedFrom(x,y) graduatedFrom(x,z) y=zBase FactsgraduatedFrom(Surajit, Princeton) [0.7]graduatedFrom(Surajit, Stanford) [0.6]graduatedFrom(David, Princeton) [0.9]hasAdvisor(Surajit, Jeff) [0.8]hasAdvisor(David, Jeff) [0.7]worksAt(Jeff, Stanford) [0.9]type(Princeton, University) [1.0]type(Stanford, University) [1.0]type(Jeff, Computer_Scientist) [1.0]type(Surajit, Computer_Scientist) [1.0]type(David, Computer_Scientist) [1.0]
Lineage & Possible Worlds
1) Deductive Grounding– Dependency graph of the query– Trace lineage of individual query
answers
2) Lineage DAG (not in CNF),consisting of– Grounded hard & soft rules– Weighted base facts
Plus: entire derivation history!
3) Probabilistic Inference Compute marginals:
P(Q): aggregate probabilities of all possible worlds where the lineage of the query evaluates to “true”
P(Q|H): drop “impossible worlds”
\/
/\
graduatedFrom(Surajit,
Princeton)[0.7]
hasAdvisor(Surajit,Jeff)
[0.8]
worksAt(Jeff,Stanford)[0.9]
graduatedFrom(Surajit,
Stanford)[0.6]
Query graduatedFrom(Surajit, y)
0.7x(1-0.888)=0.078 (1-0.7)x0.888=0.266
1-(1-0.72)x(1-0.6)=0.888
0.8x0.9=0.72
C D
A B
A(B (CD)) A(B (CD))
graduatedFrom(Surajit, Princeton)
graduatedFrom(Surajit, Stanford)Q1 Q2
[Das Sarma,Theobald,Widom: ICDE‘08 Dylla, Miliaraki,Theobald: CIKM‘11]
Possible Worlds Semantics
A:0.7 B:0.6 C:0.8 D:0.9 Q2: A(B(CD))
P(W)
1 1 1 1 0 0.7x0.6x0.8x0.9 = 0.3024
1 1 1 0 0 0.7x0.6x0.8x0.1 = 0.0336
1 1 0 1 0 … = 0.0756
1 1 0 0 0 … = 0.0084
1 0 1 1 0 … = 0.2016
1 0 1 0 0 … = 0.0224
1 0 0 1 0 … = 0.0504
1 0 0 0 0 … = 0.0056
0 1 1 1 1 0.3x0.6x0.8x0.9 = 0.1296
0 1 1 0 1 0.3x0.6x0.8x0.1 = 0.0144
0 1 0 1 1 0.3x0.6x0.2x0.9 = 0.0324
0 1 0 0 1 0.3x0.6x0.2x0.1 = 0.0036
0 0 1 1 1 0.3x0.4x0.8x0.9 = 0.0864
0 0 1 0 0 … = 0.0096
0 0 0 1 0 … = 0.0216
0 0 0 0 0 … = 0.0024
1.0
0.2664
0.412
P(Q2)=0.2664
P(Q2|H)=0.2664 / 0.412 = 0.6466
P(Q1)=0.0784 P(Q1|H)=0.0784 / 0.412 = 0.1903
0.0784
Hard rule H: A (B (CD))
More Probabilistic Approaches
• Propositional– Stochastic MaxSat solvers: MaxWalkSat (MAP-Inference)– URDF: constrained weighted MaxSat solver for soft & hard rules
• Lineage & Possible Worlds (tuple-independent database)– Exact probabilistic inference: junction trees, variable elimination– Approximate inference: decision diagrams/Shannon expansions, sampling
• Combining First-Order Logic & Probabilistic Graphical Models
– Markov Logic Networks*[Richardson & Domingos: Machine Learning 2006]
– Factor Graphs [FactorIE, McCallum et al.: NIPS 2008]– Variety of MCMC sampling techniques for probabilistic inference (e.g., Gibbs sampling, MC-SAT, etc.)*Alchemy – Open-Source AI: http://alchemy.cs.washington.edu/
Experiments
• URDF: SLD grounding & MaxSat solving
|C| - # literals in soft rules|S| - # literals in hard rules
• URDF MaxSat vs. Markov Logic (MAP inference & MC-SAT)
• YAGO Knowledge Base: 2 Mio entities, 20 Mio facts• Basic query answering: SLD grounding & MaxSat solving of 10 queries over 16
soft rules (partly recursive) & 5 hard rules (bornIn, diedIn, marriedTo, …)• Asymptotic runtime checks: runtime comparisons for synthetic rule expansions
• System components:– Flash Player client– Tomcat server (JRE)– Relational backend
(JDBC)– Remote Method
Invocation & Object Serialization (BlazeDS)
UViz: URDF Visualization Frontend[Meiser, Dylla, Theobald: CIKM’11 Demo]
UViz: URDF Visualization Frontend
Demo!http://urdf.mpi-inf.mpg.de
[Meiser, Dylla, Theobald: CIKM’11 Demo]
PART I• SPARQL Query Language for RDF, W3C Recommendation, 15 January 2008, http://www.w3.org/TR/2008/REC-rdf-sparql-query/• SPARQL 1.1 Query Language, W3C Working Draft, 21 March 2013, http://www.w3.org/TR/sparql11-query/ • SPARQL 1.1 Federated Query, W3C Working Draft, 21 March 2013, http://www.w3.org/TR/sparql11-federated-query/ • Kemafor Anyanwu, Angela Maduko, Amit P. Sheth: SPARQ2L: towards support for subgraph extraction queries in RDF databases. WWW Conference, 2007• Krisztian Balog, Edgar Meij, Maarten de Rijke: Entity Search: Building Bridges between Two Worlds. WWW, 2010• Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan: Keyword Searching and Browsing in Databases using BANKS. ICDE, 2002• Tao Cheng , Xifeng Yan , Kevin Chen-Chuan Chang: EntityRank: searching entities directly and holistically. VLDB, 2007• Shady Elbassuoni, Maya Ramanath, Ralf Schenkel, Marcin Sydow, Gerhard Weikum: Language-model-based ranking for queries on RDF-graphs. CIKM, 2009• Vagelis Hristidis, Heasoo Hwang, Yannis Papakonstantinou: Authority-based keyword search in databases. ACM Transactions on Database Systems 33(1), 2008• Gjergji Kasneci, Maya Ramanath, Mauro Sozio, Fabian M. Suchanek, Gerhard Weikum: STAR: Steiner-Tree Approximation in Relationship Graphs. ICDE, 2009• Gjergji Kasneci, Fabian M. Suchanek, Georgiana Ifrim, Maya Ramanath, Gerhard Weikum: NAGA: Searching and Ranking Knowledge. ICDE, 2008• Thomas Neumann, Gerhard Weikum: Scalable join processing on very large RDF graphs. SIGMOD Conference, 2009• Thomas Neumann, Gerhard Weikum: The RDF-3X engine for scalable management of RDF data. VLDB Journal 19(1), 2010• François Picalausa, Yongming Luo, George H. L. Fletcher, Jan Hidders, Stijn Vansummeren: A Structural Approach to Indexing Triples. ESWC 2012• Nicoleta Preda, Gjergji Kasneci, Fabian M. Suchanek, Thomas Neumann, Wenjun Yuan, Gerhard Weikum: Active knowledge: dynamically enriching RDF knowledge bases
by Web Services. SIGMOD Conference, 2010• Cheng Xiang Zhai: Statistical Language Models for Information Retrieval. Morgan & Claypool Publishers, 2008• Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Özsu, Dongyan Zhao: gStore: Answering SPARQL Queries via Subgraph Matching. PVLDB 4(8), 2011PART II• Min Cai, Martin R. Frank: RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network. WWW, 2004• Gong Cheng, Weiyi Ge, Yuzhong Qu: Falcons: searching and browsing entities on the semantic web. WWW, 2008• Li Ding, Tim Finin, Anupam Joshi, Rong Pan, R. Scott Cost, Yun Peng, Pavan Reddivari, Vishal C Doshi, Joel Sachs: Swoogle: A Search and Metadata Engine for the Semantic
Web. CIKM, 2004• Luis Galárraga, Katja Hose, Ralf Schenkel: Partout: A Distributed Engine for Efficient RDF Processing. To appear in PVLDB, 2013• Steve Harris, Nick Lamb, Nigel Shadbolt: 4store: The Design and Implementation of a Clustered RDF Store. SSWS, 2009• Jiewen Huang, Daniel J. Abadi, Kun Ren: Scalable SPARQL Querying of Large RDF Graphs. PVLDB 4(11), 2011• Vaibhav Khadilkar, Murat Kantarcioglu, Bhavani Thuraisingham, Paolo Castagna: Jena-HBase: A Distributed, Scalable and Efficient RDF Triple Store. ISWC, 2012• Bastian Quilitz, Ulf Leser: Querying Distributed RDF Data Sources with SPARQL. ISWC, 2008• Andreas Schwarte, Peter Haase, Katja Hose, Ralf Schenkel, Michael Schmidt: FedX: A Federation Layer for Distributed Query Processing on Linked Open Data. ESWC, 2011• Bin Shao, Haixun Wang, Yatao Li: Trinity: A Distributed Graph Engine on a Memory Cloud. To appear in SIGMOD, 2013• Xiaofei Zhang, Lei Chen, Yongxin Tong, Min Wang: EAGRE: Towards Scalable I/O Efficient SPARQL Query Evaluation on the Cloud. ICDE, 2013PART III• Omar Benjelloun, Anish Das Sarma, Alon Y. Halevy, Martin Theobald, Jennifer Widom: Databases with uncertainty and lineage. VLDB J. 17(2), 2008• Jihad Boulos, Nilesh N. Dalvi, Bhushan Mandhani, Shobhit Mathur, Christopher Ré, Dan Suciu: MYSTIQ: a system for finding more answers by using probabilities. SIGMOD
Conference, 2005• Nilesh N. Dalvi, Dan Suciu: Efficient Query Evaluation on Probabilistic Databases. VLDB, 2004• Maximilian Dylla, Iris Miliaraki, Martin Theobald: Top-k Query Processing in Probabilistic Databases with Non-Materialized Views. ICDE, 2013• Norbert Fuhr, Thomas Rölleke: A Probabilistic Relational Algebra for the Integration of Information Retrieval and Database Systems. ACM Trans. Inf. Syst. 15(1), 1997• Timm Meiser, Maximilian Dylla, Martin Theobald: Interactive reasoning in uncertain RDF knowledge bases. CIKM, 2011• Ndapandula Nakashole, Mauro Sozio, Fabian Suchanek, Martin Theobald: Query-Time Reasoning in Uncertain RDF Knowledge Bases with Soft and Hard Rules. VLDS, 2012• Dan Suciu, Dan Olteanu, Christopher Ré, Christoph Koch: Probabilistic Databases (Synthesis Lectures on Data Management), Morgan & Claypool Publishers, 2012
Recommended Readings